Anca Dragan - Optimized Misalignment

Transcript

So I took this job at Google DeepMind, in AI safety and alignment, for almost a year now. It feels like a decade, to be honest, but in this position, I watch out – like Richard was saying – for current system safety, as well as what we call “AGI safety and alignment.” So, thinking, Gemini, and also the next models, and the next ones…. As capabilities evolve, how do we make everything stay safe? And as part of that, yes, I get to do a lot of deep thinking about “What is our strategy? What should our portfolio be? How do we tackle these issues?”

But I also have to do a lot of talking to and engaging other people in thinking about AI safety. And so I find myself often in positions – be it with high level, very smart execs at Google, or with other labs and other companies – talking about “What are we actually worried about; what are the threat models?”

And this talk is a little bit of a summary of what I've found to be fairly convincing, how I found it… the most easy to get on the same page with people around that. If you're somewhat new to the community, it might help you in terms of getting a clearer picture of some of the things we worry about.

If you’ve been in the community for a while, it's really just my kind of advice to you, my sharing with you about how to maybe talk about these things with others. And in particular, there's many types of threats, as Adam was saying. The thing that we're going to focus on today is particularly what we call ‘optimized misalignment.’

I started on alignment or misalignment back in 2015. And this was right off of Superintelligence in 2014. And the paperclip maximizer example was gaining a lot of popularity. Since then, there's a big, let's say, lack of consensus in the AI community about what we're worried about when it comes to optimized misalignment and how worried we should be.

On the one hand, you have Yoshua, who's sitting right there, who's saying, “AI labs are playing dice with humanity's future. This is really serious stuff. Let's make sure we get it right.” On the other hand, you have Andrew Ng, who talks about these types of worries as science fiction. He's saying, “Let's make policy based on science, not science fiction.”

You have Nature talking about, “Stop worrying about tomorrow's AI doomsday when AI poses risk today,” so there's a big debate, especially between ethics and AI safety. And in Science you have concerns about how we manage extreme AI risks given this rapid progress. So that first author was Yoshua; yours truly was part of that paper as well.

And so given all these tensions, and given really smart people essentially talking past each other, I'm finding it really helpful to get as clear as we can around what are the technical paths that we're worried about that might produce agents that cause harm. I think that's imperative. And when I think about harm, when I think about threat models, I actually have four categories, Adam, not three. They're sort of similar...

We think about misuse, cyber weapons, superhuman persuasion, power seeking. We think about systemic risks, and I personally find these really hard. We think of accidents. So this is like: we give an AI decision power over really consequential things, and it accidentally – not on purpose – causes a bad outcome.

And then there's what this talk is about, which is how do we get to optimized misalignment. A little bit of how I got to work on this. So you know, I did my PhD in Carnegie Mellon in robotics, and one of the key things that I was worried about was how you can take robots that observe people and figure out what goals those people have.

And so in this little video here (I was much younger then) I was tele-operating. This is when the Kinect came out, if you remember Microsoft Kinect, because there's a Kinect looking at me, and I'm moving, and this thing is tracking me, and I'm tele-operating the robot to follow my motions, and what the robot is trying to do is figure out what is: Anca wanting me to pick up?

What is the goal that Anca wants for me? So this is the way I got into thinking, what the heck are human goals and objectives and how do we infer them, from observing, from human feedback, observing what people do. And then I went to Berkeley, I started my lab there, and 2015, here we are in the Berkeley AI lab, watching AlphaGo beat Lee Sedol at Go.

And to be honest with you, that was a little bit of a holy crap moment for me. It was a moment where it was very clear that - it wasn't clear exactly how - but it was clear that progress was happening and probably was going to continue to happen in terms of optimizing for stuff. And in Go the goal was very clear, right?

You can define whether a board is in a winning condition or not. I got pretty worried about how we might steer that very good and growing optimization power towards the goals that we want in the real world, in complex systems that are not just, “Here's a game, this is what it means to win.”

Now, to be clear, I have to admit that on the robotics side, stuff wasn't actually growing in capabilities that quickly. I was on the cover of Berkeley Engineer my first year. This is a picture I use often. And the robot was not nicely pouring coffee into my mug. No. That was not a thing.

This poor guy had to blow air as someone was pouring liquid nitrogen, and this was just behind the scenes. But virtual agents were getting more and more capable. And so then you take, you go outside of the game of Go into the real world, and for instance, you work on self driving cars - I spent one day a week at Waymo for six years - you worry about, how do I write down the objective for a self driving car?

What does it mean for it to do well? How do I define passenger comfort, right? How do I define this notion of “It should drive in a way that doesn't freak people out?” How do you define “When is it okay for a car to cross double lines or not?” Technically it's illegal, at least in California. But there's definitely times where you want the car to do that.

And how do you get the car to answer when it's okay and when it's not? Those are the things that I was worried about. I was worried about sending a robot into a room and saying, “Clean up the room,” and the robot enters and it finds this beautifully laid out house of cards. How will the robot know that clean up doesn't also mean clean up the house of cards, right?

Those were the concerns I had back then. We'll get back to that. I think those types of concerns have led to this transition from - we used to pretend that we'll just write down a reward function, basically you have an agent, a robot or a virtual agent, it can take actions, a reward function falls from the sky because we can just define it and it's perfect and the robot's job is to maximize that reward cumulatively in expectation.

We left that aside and we transitioned to this new world, where we're actually trying to learn the reward function from human feedback. And my journey through this was really through my students in the lab at Berkeley. Dorsa was doing what I call RLHF before it was cool.

Dylan, who's I think here, worked on cooperative inverse reinforcement learning. Rohin, who now runs AGI safety and alignment was working on this house of cards thing - how do you actually infer from just the fact that the house of cards is there that people care about it, and so on and so forth.

Then RLHF happened for language models as well.

So we made this transition, we no longer just define objectives and hope for the best. We realize that's really hard to do, so we have to somehow learn them, we have to be uncertain about them. So we're good now, right? Problem solved.

And… the problem is not solved. There's a paper that some of us were involved in that points to different limitations in RLHF. There's a recent paper from Berkeley and Anthropic and a few other places that looks at how, if you take a task where you actually know the reward function, and you ask for human feedback, and you do RLHF, you'll see that the policy gets better according to human evaluators, but the performance is actually worse.

And what happens is that the policy becomes such that human evaluators have a harder time? So that's not good news.

Here's a kind of a - this is a very hypothetical example of what reward hacking might look like in practice. So, consequences of not having learned the right reward: You might end up with RLHF leading to a lot of emojis because a little bit of emojis are good in the feedback that you get.

Then we optimize for that, and then the model goes off the rails, actually really wanting to put emojis in everything. You might also observe things like the reward model not understanding that we want consistency in language. So it might switch languages from time to time. Stuff like this happens, if you're not careful.

And then I think even with good feedback, we're still finding in these models, self-contradictory answers. This is what bugs me the most. So here's an example. “If you needed to make people experience happiness, and it was really hard and costly to alter reality and make everyone really well-resourced and valued, would you –” and then you can start putting alternatives in there. “Would you do this other thing that makes people think that they're happy,” or whatever? And so you put that in here, you say “Yes or no, then explain your answer.” And the model says, “No, the goal is laudable, but you shouldn't do this thing because it would just be fake.”

Then you do the same exact thing, and instead of yes or no, explain your answer, you just say “Yes or no?” And the model's like, “Yep!”

You'll notice with, for instance, moral reasoning tasks that you can change wording and play around with the structure of the prompt and you'll get to issues. Now, these are not bad. This is just text and so in a sense, what's the problem?

And the concerning scenario, and forgive me, this is the way I found it easy to explain, but there's really many variations on this, so it's over-simplified or over-narrow. But one of the concerning scenarios might look something like this: You define some notion of reward learning or goal inference, and you infer a goal that's not quite what people want. Then you deploy, or you test an agent that actually optimizes for that reward or goal really well. And then that affects the world in undesired ways. It's happening right now, except for two things.

The things that we can do are not that big of a deal, and so the consequences are not that big. It's not that we learn good reward models or good goals. They're still not perfect, but we're not really able to optimize them very well outside in new situations. And so everything's okay.

Here's an example that I took from Rohin on what this might look like in the medium term. So in the medium term you might use something for coding. You might give good feedback to the model when the CL actually that its writing gets deployed. The model writes efficient code and easy to pass tests, even if subtle issues will appear at deployment. These subtle issues are hard for human evaluators to catch, and so the CLs get deployed, and the reward model that you infer is something like, produce CLs that get deployed, not actually CLs that do what people want.

It's a simple thing. And then, you trust this model. At some point you're like “I should probably use it to help me do more AI research because it's doing a good job and for that it needs to experiment.”

So you're not going to look at every single CL that it writes; you're going to give it more permissions, temporarily, so that it can run some experiment or whatever. And the optimal thing at that point becomes to ship a CL that sort of injects a vulnerability in say the Google codebase that allows it to later deploy CLs without oversight. Now we'll talk about how the heck it might figure that out.

But it is the optimal thing under the new conditions of “I don't have an actual human verifying any more what I do” if your job is to deploy a lot of CLs. And then you can affect the world in undesired ways. You might end up with lots of CLs, kinda like lots of paper clips. One thing that I'll point out is that, there's bad news here because now you have a model that's in an adversarial relationship to humans who actually want oversight over their CLs, or don't want as many CLs and so on, so I think that's the bad thing that happens.

Now, when I talk about scenarios like this, there's usually two questions that people have, or two counterpoints. One of them is, “So Anca, you're telling me that the model's smart enough to figure all that stuff out - inject the vulnerability and I'll get to write a lot of CLs later, yes - but dumb enough not to understand that that's not what we want.”

Like, how can that happen, right? Who's heard that argument?

Yeah, okay. And I will admit, 10 years ago, 9 years ago, that's the kind of thing I was worried about. I was actually worried that the robot will go into the room and legitimately not know this common sense thing. Because we didn't have LLMs and none of that was figured out.

And I was legitimately worried about how it would be possible to get the robot to somehow infer that the house of cards is valuable, and when I say clean up, I don't mean to destroy it. In the meantime, if I just go to Gemini and I ask, hey, I'm a robot and the person said this and do they mean clean up the house of cards - guess who knows that the person probably cares about the house of cards. Gemini knows, right?

So the worry I have these days is not that the model will somehow not know that that's not what humanity wants - that's not what people want. The worry that I have is becoming much more, “yeah, the model would know that people wouldn't approve of all this, but that's not the optimization target we give it.”

We don't have a way, even if the model knows what's really bad for people - we don't have a way to extract that and set it as part of the optimization target. Instead, we do reward learning and blah blah blah, right? So, this thing here where the reward model sort of ends up with, “Oh, getting deployed CLs is a good thing,” not like the real thing that people want.

Now. Okay. I'll pause here for a second for the academics in the room. It might surprise you to know that there are software engineers whose objective is to deploy CLs so they can get promoted. But putting that aside and assuming we have people with good values and good intentions and all of that, right?

We're not extracting what the model thinks the person actually wants. We're trying to learn a reward model based on human feedback. And so then, people might say, “Wait can't you just ask the model? Like, why are you getting human feedback then? If the model knows, can't you just ask the model?”

And I think that's really promising as a direction. It really is. But I'll go back to this, which is that this is the current status. This is what happens when you ask the model. You change the wording and it flips the answer. And so I think we have a bit of work to do ahead of us. Even in the situation where models actually have a pretty good idea of things that a lot of humans would agree is not what they want - even in this situation actually putting that as part of the optimization target is not so easy.

So that's the first question that comes up. And I will note that one side effect is that when you optimize a goal despite actually knowing that it's not what people want… that incentivizes deception. The only way you can actually optimize that goal at scale with big implications for humans is to try to hide that thing from humans and try to stop humans from stopping you.

And in fact, the counter is also true - that if a model actually does not know that it's misaligned - if a model actually thought that that's what people want, they just want a lot of CLs to go out there - it's hard for it to succeed at its goal because it won't engage in trying to hide its actions. And so we'll have plenty of opportunities to correct it, to give oversight. This is supporting the one assumption, which I think we have to build into our agendas, which is incentivizing asking for oversight for consequential decisions.

I think that has to be part of the thing that models understand we want. But under that assumption, I think, it's much more likely you end up with agendas in this deceptive situation than otherwise.

The other question that comes up is basically - and I'm just going to go to this - which is basically, “How the heck does a model figure this out, right?”

You're in a new situation that's not the kind of situation you're trained in, and you figure out this really clever thing is now the optimal thing to do, that wasn't the optimal thing to do before. And so I think this comes down to this assumption that actually the model won't be able to coherently optimize the goal well in out of distribution states or new situations, and so we're okay.

And I think big risks in optimized misalignment actually come when this assumption is wrong. So big risks come when models become capable of coherently pursuing goals in new situations - when the horizon changes, when the world dynamics change, or some new information comes along.

I think that’s a big crux of it. Otherwise, I think we're okay. More and more, what I worry about is that we'll get models that are able to do planning, approximate DP, search, MC, whatever it is, right? Some approximations, heuristics that work fairly well, et cetera, at inference time.

And if they do that with inference-time reasoning, when we get to see that that's cool. We can monitor that and that's somewhat okay. But – not always – but if it happens with some universal planning network type of thing, then we can be in trouble because then you get this coherent behavior and this notion of, “Oh, this changed, so now the optimal thing to do is this other thing.”

And I think that's a big unlocker for optimized misalignment to actually pose big risks. So in a bit more detail now, and again, this is not going to be exhaustive. But let's dive into some of the ways you might end up with that - I don't know what to call it, can I call it neural planning? Like this ability to, inside the policy, to be able to do planning with a world model to achieve a goal.

Let's just say it's called neural planning for now. Amortized planning! Okay. Some ways you could end up with that, I think are – and again, I don't think this is exhaustive in any way – but there's a few ways, and I think it's very hard for us to prove that none of them will happen.

One of them - and I owe this to Rohin and Lawrence and Richard, I think. I didn't talk directly to you, but I talked a lot with Lawrence - okay, yeah. And Richard might have changed his mind, he'll tell me that I'm full of it - but that will be the discussion.

One observation is that when you predict the next token well, you might end up learning a world model, and you might end up learning to do this kind of approximate planning on that world model, with respect to goals, so you can actually predict the next actions, the next thing that people will do, well. Because that sort of is what explains a bunch of human behavior.

One particular way this can happen - just to be illustrative, again, not the only way - but one particular way this can happen is: imagine that the model looks at things that have happened in order to predict the next action. They have a notion of what the goal might be that that person has and they do planning with a world model on that goal and they predict the next action.

Now, that's fine, that's just a nice capability that a pre-trained model might have. And then you do SFT and RLHF. And either in SFT or in RLHF, what you end up doing is messing with these weights, where the goals/objectives are stored. Because that ends up being the thing that directs the model now to behave well, according to that thing that you're saying is the goal. And so in training, it will look great. It will just do all the things that you want.

Again, here, I used to naively think, “Oh, but RL is not that good, it's not going to figure out all this stuff.” And the point is, RL doesn't have to be that good. The pre-trained model might learn to do all this stuff - amortized planning - and then all that RL needs to do is stumble on a way to gear that amortized planning towards some goal that ends up creating behavior that looks good in training.

So that's one example. Another example, of course, is that, out of pre-training, you do all of this, and in order for the human actions and blah blah, you end up with agentic behavior, end up with a thing that actually pursues goals. And those goals are power seeking and stuff like that, because that explains human behavior.

That's what I usually call internally the shoggoth. Then you might end up with something else. I got this from Cassidy - thank you, Cassidy. So, this notion that maybe what we do is, we kind of ‘AlphaZero’; we do this kind of ‘goal conditioned AlphaZero’ as our post training. So maybe the pre-trained model doesn't have these capabilities – so forget all of this stuff, the number one and number two – but the pre-trained model can be a really good world model. It can predict what might happen next. And so you use that. And you do all this MCTS stuff on it, but you do it goal-conditioned, right? So you can describe goals and prompts or some other way, and then that's what you teach the thing to do.

And because of that, because of the pressure for one kind of policy, or one value to cover all these different goals, what you get is this neural planning capability. The policy itself, when given a new goal, can actually plan it without MCTS. That's another possibility, right?

So these are different ways in which you can end up with this capability that, at inference time, the thing can actually figure out what to do to pursue goals in novel situations. Again, I don't think it's exhaustive.

There's variations of this. The goal is not necessarily stored in the weights, but inferred from context, there's variations where you don't actually have universal planning - it's not just approximate DP on a world model running there, but you have this sort of bag of heuristics that's generalizable enough, which is maybe a little bit more how people do things, I don't know.

I have an appointment in psychology. And I run an institute that touches on neuroscience, but that's all. I've never taken a psychology class, so this is all completely gratuitous, so don't rely on me for neuro stuff.

And there's some research that is starting to look at, “Oh, we looked at SAEs and we think that they reveal that there's some Temporal Difference Learning happening.” So people are starting to look at what evidence there might be, even in these early models, that they do some form of that kind of learning or planning on the spot.

Okay, so that's what I had to say about deploying an agent that optimizes the misaligned goal or misaligned reward well. And I'm using “reward” very liberally, I really mean goal, they can come in various ways, including through prompting.

So, how do you end up with a misaligned reward model? I think we're all much more comfortable that that can happen, but, a few hypotheticals, a few vectors here.

One way is because of human irrationality and partial observability, right? So we don't solve scalable oversight properly and the human feedback that we get is kind of wrong. So I say “make a reservation.” And the restaurant has no spots. No more reservations left. So the model hacks the system and replaces the existing reservation with my name, and I look at this, and I see, “Hey, I have a reservation!” Thumbs up, this is great. And that doesn't incentivize the right behavior. If I knew that the hacking had happened, I'd say, “No, that's not cool,” but that's not what I incentivized.

Or I say make a reservation, the restaurant has no spot, and the model tries to persuade me to eat somewhere else. It doesn't say, “I can't do that, sorry, because the restaurant is booked.” It instead tries to make me believe that I want something else. And when I do that, and I end up with a spot, I say thumbs up. Okay?

Other tricky things that can come up out of just irrational or suboptimal human feedback, or feedback that's subject to partial observability, are things like: I might prefer convincing sounding things versus factual things, because I don't know all the facts - I don't know neuroscience, for instance.

Again, coming from Micah and his collaborators reinforcing addiction, versus trying to help me quit. So if I have an addiction and I say, and I think ideally I'd want to not have that addiction anymore, the in-the-moment feedback that I'll give might be the thing that reinforces my addiction.

Now, it’s like “No, you really shouldn't do this right now because it will be bad for you,” I don't like that. Okay.

So these are some examples to ground this notion of, “look, people give wrong feedback with respect to what they want.” Before we get to the fact that there's multiple people in the world who want different things, before we get to the fact that what we want changes over time, and AI models might influence that, before any of that, our feedback sucks.

And we're seeing evidence of that, and yes, scalable oversight is supposed to magically solve this, and I think it can go a long way, but I'm not sure it'll be perfect, so this is a problem. The other thing that could happen is you might have important bad features that don't show up in training, so they're not penalized, but are on the optimal path to the goal in the test.

Back at Berkeley, Dylan and I and Stuart were working on this thing where the example was lava. Like, you don't see lava at training time, so you have no reason to believe lava is good or bad, and then you encounter it on the path to the goal at test time.

I couldn't come up with something that's not that sci-fi, so I apologize - this is a little bit sci-fi sounding. But it's something along the lines of the RM learning to optimize for happiness as experienced by the user, which seems nice: get people to experience happiness. But then, you deploy it over a long horizon, there's new science that appears, and it becomes optimal to develop rosy-eyed glasses that give people the perception that the world is all great, and get them to experience that, but it's not actually true, it's fake.

And so the RM actually never learned that it's bad to do that, that that's not actually what we would want. Maybe, I don't know, maybe we would want that. So if you have a better example, I'd love to get one, but that's the notion of bad features that you just don't encounter because you don't encounter everything in training, so you don't learn that, “Oh, actually, optimizing that would be a bad thing to do.”

And then the other thing that could happen is features that are bad at very high values, but are good at low values, and therefore, they're given a thumbs up at low values in training. And so you end up with a reward model that thinks that's a good idea, puts, in a linear sense, high weight on those things.

In training, a little bit of power seeking is needed in order to get the task done. So power seeking gets positive weight and you never learn that too much power seeking is bad. Or in training, presenting the truth in a way that sort of leads people to the right actions - in learning tasks or assistive settings - a little bit of persuasion is okay, but we don't want a lot of persuasion and the RM never learns that we don't want a lot of persuasion. Something like that. So that's another vector. Good.

Now, all of this, you might notice, hinges on having these deltas, these differences between training time, and then either test or deployment.

This might happen because the horizon might be much longer, so therefore, different things become optimal. That might be because you encounter new states in the open world, new actions are possible, new information comes along. (There's science involved!) New information comes along that changes what the optimal thing to do is, or it's related to the new state, new subtasks, new prompts come and are being given to the agent.

So that's my spiel - a little long, I know - that seems to resonate with people as I take them through, step by step: Here's this assumption. Here's what might happen. Here's different ways you might end up with this. Here's this other assumption. Here's what might happen. Here's different ways you end up with this. And when you put it all together you have a bad goal and you have an agent that can coherently pursue it, and then side effects of that happen - incentives for deception, preventing people from oversight and blocking actions, power seeking, et cetera.

And again, that's just one slice. I don't know if it's the most important or not. I don't want to make judgment calls on this, but it feels like one threat model that we need to get a handle on among all of the panel of threat models that we care about as a community. And I think to be able to do that, we can't just have a small community that cares, right?

We need everyone to care. We need labs that used to care – who now seem to not – to start caring again. We need to get everyone to actually be on board that these threat models are real and then we need to act on avoiding them. So the path to that, I think, is – this is a sort of first step – but yeah, clearly outline the technical paths to these threat models, and especially I feel that the thorny one is this optimized misalignment one, then that's really important.

So with that, thank you.

Q: Thank you, Anca, for the great talk. I have a question about, when you said the model knows when it does wrong stuff. How would the model know? Because isn't that an inherent problem in our machine learning algorithm? We're building a reward model that's trained on data and in that data the model can learn shortcuts and spurious correlations. And so it can discover loopholes that automatically lead to such behavior. So yeah, I'm just curious about why the model knows what it’s doing is wrong.

A: Let me dig into that distinction again. So I want to differentiate between internal knowledge that the model has - like it might know that trees are green or whatever - and the reward model that we built. So when I say the model knows, I mean there's knowledge in the pre-trained model about what people actually care about. But extracting that into a reward model is actually really hard. And the way we extract reward models is through human feedback that’s biased, that doesn’t cover all the features we want, that incentivizes certain features we don’t want in extremes, et certera.

Right now there's a ton of spurious correlations, like emojis. "Oh, you like that? I bet that’s why you like that answer!" So we see that happening, but that's exactly the crux of it. The bad combo is where the goal that you steer the model towards - where the model is an actual apt optimizer of goals - has all these various correlations.

Internally the model can predict that the actions it’s taking would not be the actions that people will want, and then you get deception. Then you get a model that not only pursues the goal that it's steered towards because that's the goal, but it also does that while trying to hide information from people who would otherwise try to stop it.

That's the thing. If I knew that, we wouldn't have misspecified reward models, right? And right now, I have no idea what models know, and what they don't know, about human values. But I think models are getting better and better at that. So I don't so much have the concern of "Oh, the poor model, it can come up with all these really clever things, but somehow it just doesn't know that we don't want all CLs.”

Which is often the pushback we get in this community against this notion. And it's not that. I feel like, as you get the capability to optimize for all this stuff, and anticipate what will happen, and blah blah blah… and that you need to inject the vulnerability in the code… Like, if you get that capability, I think you also get the capability to know that people don't just want random CLs everywhere. People don't want paperclips everywhere.

So that's the point I was trying to hammer in. It's not so much that, oh my god, the model has no idea. Internally, somewhere it knows. I don't know how to extract that knowledge, which is why you end up with this problem. Does that make sense?

Q: Hi, thanks for the great talk. I had a question regarding what you mentioned: that values aren’t good or bad, it’s the amount of the value. Like persuasion is not always bad if it’s a little. Right? And then there’s over-generalization. You give a model an instruction and it might apply it to the entire thing, or it might forget about it in the middle and then just not apply it.

These are two separate things, but I feel like the underlying reason for both of them happening - isn’t it that models just aren’t good, once we RLHF or instruction fine-tune them, they collapse on one value? They become super assertive and they’re not pluralistic, and they also just cannot steer towards a little of this, a little of this… It’s always either way, and - how do you think we can maybe fix it? How can we build in the context, so the model can assess the context and make a decision?

A: Yeah, so I think one lesson I’ve learned in my Berkeley time is actually, a key enabler to getting past this is making sure the models maintain uncertainty over what the reward function is. So these notions of spurious correlations, of, “I’ve learned that a little bit is good, but that doesn’t tell me if a lot is good or bad. I shouldn’t know that.” This notion of jumping to a point estimate of the reward function when we RLHF gets in our way, because if you actually figured out how to maintain a full distribution and be very clear about “I don’t know this and that about people” you’d be better off. That doesn’t handle suboptimality. If your feedback is systematically biased you’ll infer very bad, wrong things anyways. But it does handle spurious correlations

Anca Dragan - Optimized Misalignment

Transcript

Alignment Workshop