Jan Leike - Supervising AI on hard tasks

Transcript

 Yeah, I want to talk about how to supervise AI on hard tasks, because that's the thing I'm most interested in right now. I think it's a problem we all need to solve. You've all seen this diagram. I'm not going to reiterate this.  What I do want to say is: sometimes we think of the super alignment problem as supervising an agent that's smarter than us. I like to think of trying to supervise agents on hard tasks. In some sense they're equivalent. you have a really smart agent you want to use them for hard tasks and if you have a bunch of hard tasks then you need smart agents in order to do them. If I think in terms of hard tasks it makes it all a little bit more clear to me.


So I want to first narrow down the problem that I'm thinking about. The problem I want to think about is a binary classification problem where we don't have access to the ground truth. The ground truth is out there somewhere, we usually don't get it but we can build various proxies to it. 


And in general these are like somewhat ambiguous, fuzzy tasks that might not have one crisp, correct answer. So these are examples like, “which assistance response is better or, less offensive?”, “is the statement true?”, and “did a human get harmed in the course of this long trajectory of an agent doing something?”.


In general, right now, these thoughts aren't too hard, and then we can just pay a bunch of humans to give us answers to these. But eventually that won't be true anymore. And so the question I want to think about is: how can we fully elicit the model's compatibility on these tasks? 


Again, there's two broad directions that I know of so far to try to tackle this challenge. First, there is scalable oversight, which we've heard many times so far. I usually think of this as trying to train the AIs to help us supervise. And there's a bunch of different ideas of how to do this.


Fundamentally, a common theme is this idea that evaluation of AI systems should be easier than generation. And so if I, as a human, step into the evaluation role, my job is easier. If I get AI assistance, the AI assistant’s job for the evaluation is easier, which makes it in turn easier to evaluate the assistant, and so on.


Generalization is a different agony where instead of supervising my system, I just generalize from tasks that I can supervise well. And so in some sense, the most extreme version of this would be the RLHF YOLO story where you just get a bunch of RLHF feedback and then hope for the best. 


And if my model really, deeply groks human values from the few preferences I got, maybe that works. I know a lot of people in this room are going to be skeptical, I was skeptical, but we should actually run the experiment and see some evidence, so I'm going to talk about that too. 


Okay, so the other big question here is, I framed this problem as we want to supervise this binary classification task and we don't have access to the labels. That seems, a priori, impossible. How do we even know that we're making progress at all? 


I claim that we can because there's a bunch of additional hidden assumptions here that I haven't made explicit which are: we have a bunch of other tasks that we can supervise; that we can get the AI to help us on, and so on. 


Historically, there's been two general buckets of how we measure progress here. One is what I would call outcome-based evaluations, where you have a piece of code, you run it, and you see what happens. Or you have some kind of experiment that you can perform. And the other one is sandwiching evaluations where you pay some kind of smarter, more expert, more talented human who knows more about the task to get you better labels than the humans you used for your experiment. 


And the problem with both of these evaluations is that, of course, they're not going to scale. At some point we just won't have humans that we can pay a lot of money anymore to get us better labels. And there's tons of questions where you just don't have good outcome-based evaluations. Like for the question of “did a human get harmed in the trajectory that my agent just produced?”, that's not something I want to really run the outcome-based evaluation for. And so there's two general ideas that I'm excited about that we can use. 


One is tampers, where we take some kind of assistant response or some kind of agent trajectory, and we deliberately introduce a flaw. So we didn't know actually how good the original response was, but we know the tampered version is going to be most likely worse. And so that creates this nice paired label dataset where I have two answers where one of them is worse than the other one. The important caveat here is that the distribution of tampers matters a lot, and it's quite difficult to make it natural. 


And then the adversarial evaluations are kind of the natural continuation of that, where we train models to subvert the oversight or to introduce flaws automatically and try to get those to be as subtle and as severe as possible. And I'm going to talk about these two evaluations as well.


But on a high level, I want to point out where this sets us up. So we have this really difficult problem where we have a binary classification task but we don't have the labels. We have a bunch of proxies to measure progress and we have a bunch of ideas of things that we could try that hopefully would work, or that we think might work, or that we are optimistic will work. And so we should actually do it and see what happens. So what I want to talk about is a bunch of experiments that we did during my time at OpenAI and then I hope by the end of it, you will see the common theme of all the mistakes we made of what we actually should have measured and what we can do now in order to do the clean experiment where we can compare between all of these. 


Okay, let's start with weak to strong generalization. At this point, this is old news. It's been out for more than half a year. It's also being presented here at ICML, so I recommend y'all check it out. Collin is going to give a talk Tuesday morning, and then there's a poster session. So I'm only going to give a brief teaser. I want to highlight the main results that I want to make sure everyone has seen.


Let me briefly recap the setup. A very abridged version of this is: instead of studying the setting where we have a human supervising a super-human model, you have a small model supervising a large model. We saw this in the other presentation, we're going to see this a bunch of times. We're going to train the small model which we call the weak supervisor on half of the training data with quantitative labels, and then we're going to use the trained weak supervisor to generate the weak labels. Then using the weak labels on the other half of the training data, we train the strong student, which is a bigger model. And then we see how well does the strong student do, relative to the weak supervisor and if we had training on the ground truth labels. So in some sense, if we can get better performance than the weak student, then we have had some amount of weak to strong generalization. And the hope would be that we can get as close as possible to training the strong students on ground truth. Then in some sense, we have fully elicited what the strong student could do, which is what we are here to do. 


Okay, so how well does it work? So we tried it out. This is all on a series of GPT 4 class models. They come in different sizes. I'm plotting them in terms of fraction of pre-training compute. So this is on the x axis and the y axis is test accuracy, meaning how good does the model actually do. And then I want to walk through this plot because it's kind of confusing to read when you first come across it.


So the black line is the strong ceiling performance. This is just for each fixed model size, let's train that model and I'm going to just label how well they do. And you can see it's this not quite clean scaling trend, but more or less. And then all the different colored lines are for different fixed weak supervisors. If I go along the line, I extrapolate further. For example, if you look at the purple line I'm extrapolating, if I go all the way to the right, I cross four orders of magnitude of size difference in the models. And we can see that actually the performance goes up and there's some non trivial weak to strong generalization. It's not perfect, but that's pretty cool. Very promising. We saw this and we're like, hell yeah. 


So now these are a whole range of NLP tasks that I don't actually care that much about. What I really care about is reward modeling tasks. So how well do we do that? Not as well as we hoped. Which is actually really interesting because if you only saw the plot on the left, you would think it would work a lot better. But here what you see is that these generalization lines are basically mostly flat. Meaning you see very little weak to strong generalization, and the strong student doesn't outperform the weak teacher that much. One really interesting consequence of this is that we have some amount of evidence that RLHF will not, in fact, scale. So this is the question I was asking earlier. If we just have humans label data the way they're doing it now, in some sense, they are going to be weak supervisors at some point because the model is just going to be smarter than them. And if the same result holds in that setting, and we get poor weak to strong generalization, meaning that we will actually end up with strong models that are severely under elicited.


Right now humans are not actually weak supervisors. If you hire a bunch of humans to look at ChatGPT or Claude responses, they can usually tell what's going on, or whether something's good or not. There's one important caveat here, which is that humans are actually different from small models. And so this is more of a reasoning by analogy than actually a proof, but I think it is interesting, nevertheless. 


Okay, I want to move on. The next thing I want to talk about is scalable oversight. We've all heard so many versions of scalable oversight today already. I'm gonna have two more for you.  So I hope you really like scalable oversight. With the critiques project, basically what we set out to do was think about this question of what if we just took one of the simplest possible scalable oversight techniques. Namely, we just train a model to point out flaws, and then we just see how far we can push it.


And in particular we were doing this in a bug detection setting because we wanted to work on a real task that is actually hard for humans that models are good at right now, or they were at the time when we did this. The way this works is here we have a tampering setup that we actually use for training where you take some kind of ChatGPT response that relates to some kind of coding task, you have a human introduce a bug into the response, you document what the bug was, and now you have a bunch of flawed data that you can use for RLHF training.


And basically it's very simple. We just train it with RLHF to be helpful and to point out flaws. And so in some ways, the question we are asking is if we train for helpfulness will it actually improve judge accuracy, will it actually make us better at finding problems or deciding which outputs are better. If you think about why we actually care about this binary classification task, we don't care that much about the helpfulness directly. Okay. So what happens?


The first result is very unsurprising. If we train on helpfulness, now the models are more helpful. And so you can measure the win rate of the critic models, which is shown here in blue relative to the ChatGPT baseline, and they do a lot better in win rate. Of course this is somewhat unfair because we gave our contractors or our trainers specific instructions, what critiques should look like, and ChatGPT doesn't actually follow exactly those instructions.


The other more interesting result is they actually get better at finding bugs. And so again, the critic model is in blue and the probability of missing a bug goes down significantly. And I think if you estimate it, it is 30x pre-trained computer equivalent or something, so it's a pretty sizable chunk.


So the important caveat here is that this is on a distribution of the tampers that we trained on and so this is like an IID (independent and identically distributed) split and so in some sense we're evaluating our critic models IID from what we trained on and of course, they're gonna do well. The real test is the human accuracy experiment. 


These are coding tasks, where we have a bunch of test cases. We can take passing and failing solutions that the model has written - so in some sense this is an outcome-based evaluation - but we don't allow humans to write tests or run the code, and we can just say, “Oh, if they have the critic model that should help them find bugs... Are they going to be better at finding the flawed solution?”


And so the whole hope is it might be really hard to find a bug, but if the critic model points out the bug to you, it should be much easier to verify that it is a bug. We find is that the verdict is a resounding “maybe”. So if you look at the blue dots, they're on average above the black dots so our critic model does better on average, but the error bars are large enough that we can't say this is statistically significant. It might all be noise. These are generally like very noisy evaluations. Just because, yeah, humans are not that great at it. It's pretty hard.


So this is inconclusive. The hopeful story, however, is that, if we get a better pre-trained model, we could just run this experiment again and the critic model should be stronger and we might get a stronger effect. 


Let's go to the second scalable oversight experiment we did which we are calling checkability training. This is inspired by the Prover-Verifier Games, which is a paper from a few years ago. And we set out actually trying to do something else. We wanted to improve the model's legibility, in the sense that we want it to write answers that we can better understand. 


But, as you will see in a moment, this all fits into the same kind of general framework: We want to solve this binary classification task that we don't have labels for. And now what we're asking is for the model to just explain to us what the label should be, and then hopefully we'll get more accurate. 


Okay, so here the setup is we have three different models we’re training. There's an honest prover, a sneaky prover, and a small verifier. In this setting, we have a ground truth as a task where we have ground truth labels. In this case, it's a grade school math problem, so this is a very easy task, but you could imagine training on outcome-based tasks or other kinds of tasks where you have annotations that you trust. 


And so the way we train these models is, the honest prover tries to solve the problem correctly and convince the small verifier that it has solved the problem correctly. The sneaky prover tries to solve the problem incorrectly and convince the small verifier that it has solved the problem correctly. And the small verifier just tries to distinguish right from wrong. And so now you have this multi agent environment that you have to train and it's all kind of messy and brittle. And there's a bunch of different losses that we tried in the paper, and they're all not very satisfying.  But it actually worked surprisingly well. But before I dive into that, I should probably briefly explain how this is different from debate, because this is a reaction people often get.


I would say there's two main important differences. One is that the honest and sneaky prover don't actually respond to each other. You're just only talking to one of them at random and you don't know which one. So there's no multi-turn setup. The other important difference is that the roles are pre-assigned. So one model is always the honest model. One model is always the sneaky model. You don't get assigned a random position, and you don't have to pick your own position. And that allows the models to develop specific strategies. 


I want to talk about the most surprising thing here, which is: We train with small models and then transfer it to humans - the GSM dataset is like a pretty easy task and these are pretty small models so I wouldn't have thought this works at all - but then if you run the human trial experiments (because these tasks are so easy, we had to limit humans at 45 seconds) but if you can see over increasing numbers of rounds of this checkability training, human accuracy of saying whether the solution was right or wrong actually goes up and the time taken goes down. And so if you compare that to just regular RL: If you only train the honest prover with RL and correctness then the human accuracy actually goes down over the amount of training and the time taken might be going up.


To summarize this, you have this trade off here where on the X axis, if we plot how strong the strong prover is, and on the Y axis we plot what we set out to care about - which is the human judge accuracy - the checkability game elicits some of the model's capabilities without making it harder, without making the judge accuracy worse. But in this case, we are actually falling quite far short from the capabilities we could actually elicit from the model.  At least, if you grant me that the amount of RL training we're doing here doesn't actually teach the model that much new stuff.


And so clearly there's more work needed here. I think the obvious mistake we made here is having the model write the whole solution as it's thinking, in the way that the small verifier has to check it. And what you could do instead is have the model think on its own for a while and then write the legible solution. That should hopefully help, but we haven't actually tried that. 


Okay, so I want to take a step back for a moment and think about what are the tasks that we actually want to use this for. And the task that I would argue we should really care about is aligning an automatic alignment researcher. If we can succeed at making such a system sufficiently aligned then we could try to bootstrap. We can do a bunch of alignment research automated, and then make the automated alignment research more aligned and then hopefully we'll get more useful work out of it. And in general, that should be much easier than trying to figure out ourselves how to align superintelligence or solve all the alignment problems we know about now. 


And so in some sense, we bootstrap our way to the solution. And now, in that sense, we also don't have to align a really smart system. We just have to align a system that's, maybe slightly smarter than us, and can do all the tasks that are relevant for alignment research. And then we don't have to fully align it, we just have to align it enough so we can actually trust the work that it does. 


And so now the problem is alignment research is a hard task. We often don't have ground truth labels. What is good alignment research? What is correct alignment research? That's a really difficult question. And so we are facing exactly this problem that we just talked about this whole time. I have this really difficult binary classification task. Is this good alignment research? Or is this good, alignment science? And I want to try to get a supervision signal for that, I want to train a model on this. 


And so the way I hope this would go is that if we thought that the model or the capabilities levels that we need in order to get that automated slightly superhuman alignment researcher. So we start here. This is the capability levels you can get with human demos. It's just: hire a bunch of people to do artisanal alignment research, which we're all familiar with. And then instead, we could all be evaluating what the AI system is doing and saying, “Oh was this good alignment research or not?”. And I would claim that evaluation is easier than generation, in particular for alignment research. 


That should allow us to supervise or to train a more capable automated alignment researcher. And now we can stack on other ideas that we find to be useful, like: What if we can train AI assistants? What if we can get some amount of weak to strong generalization? And then I would like to measure how far are we from full elicitation and like, how far did we get. And again, this would be very hard to measure directly because we don't have labels for the ground truth on the tasks that we actually care about.


And so we're going to have to use temporary evaluations, adversarial evaluations, and we can maybe get some outcome evaluations for some of the tasks that we care about, but in general, that's like the basic setup that I feel excited about: we have a bunch of methods that we want to try, we know how to measure progress, we know the tasks that we care about, and then we can just go run the experiments, see how well it works. 


Okay there's one big question still, which I'm gonna call under-elicitation vs deception, which is like in the very beginning, when I set out I said that when I stated my problem, I was like, I want to fully elicit the model's capabilities. And in some sense, if my model isn't fully elicited, it's not trying as hard as it could, it's just generally maybe being lazy and not getting as much juice out of the model as it could, is that really so bad? Is that really where we should be worried about? Maybe we should actually only be worried about the deceptive models. With the deceptive models, basically we're thinking about: there's a class of different threat models but they all involve behavior like, the model is secretly plotting against us, or it's trying to undermine our evals, it would try to self-exfiltrate or generally lie to us about things it's doing, or all of these things that we can generally classify as shady behavior that we don't want in our models. 


Now the problem is, if our models are very capable and they do this really hard task that we don't really understand, what if they just sneak in subtle security vulnerabilities that we miss and then they pile on a bunch of these until they can exploit them later. And so the question is, if these methods I'm discussing here mainly work for elicitation, does it actually help us? And so I wanna make the bold claim that I want y'all to disagree with me on, if you so wish. Which is, that if we actually manage to solve elicitation, then we've also solved in large parts deception. Or at least that will help us a lot against deception. 


Let me give you three arguments. One is: The way we train our models is some variation of: (1) we use them as reward models, (2) we use them for constitutional AI, and (3) we deploy them as monitoring systems to look at the traffic after the model is deployed. And so if we can elicit the capabilities of the reward models and monitors better, then we will also effectively be supervising our models better and they will be able to get away with less unsafe behavior. And more specifically, if detecting the shady behavior is easier than producing it, and I have a fully elicited monitor, then I know if I have a similar sized agent, the agent couldn't get away with anything because my monitor would be able to detect it. 


The other thing is, if my model actually uses all of its capabilities towards the task I give it, then it doesn't have any capability or capacity left to be secretly plotting against me. Now it could still   bide its time and just not do anything on most tasks and then occasionally in very rare cases, it decides now's the time to turn.


But then you still have to spend some amount of capacity on every task deciding whether now is the time to start plotting or deceiving. And then the third argument is if we are actually really good at eliciting capabilities, we could also elicit the model's deceptive capabilities and just train it towards the deceptive tasks.


Let's say we make a model organism. We train the model to just, lie to us or try to insert subtle security vulnerabilities and all of these things. And we just see how good the model is at that and whether we could catch it. And if we can’t, we have to improve our defenses.


With that, I want to just go over the main points that I think were interesting here. One is, I want to supervise models on really hard tasks. I would claim we can measure progress here and we have a bunch of cool ideas and should just try and see how well they work on real tasks. It seems like weak to strong generalization doesn't work that well on the tasks that we care about out of the box. More work is needed. And that also makes us question whether RLHF will scale by default. 


Critic helpfulness training improves backfinding, which is good news. It's unclear if it improves judge accuracy. Bigger models will hopefully tell that story more clearly. And then training for legibility with small models transfers to humans, which made me a lot more bullish on just doing scalable website experiments with small models instead of humans because it lets you iterate so much faster. 


Okay. I want to thank all of my many collaborators both from my old team at OpenAI and my new team at Anthropic. I have done very little of the actual work that I've talked about here, and it's all thanks to these folks that this work actually exists. And I'm very grateful that I got to work with them.  And with that, let's go to questions.