Buck Shlegeris - AI Control: Strategies for Mitigating Catastrophic Misalignment Risk

Transcript

Today I want to talk about AI control, which is an approach for mitigating catastrophic misalignment risk by assuming misalignment.


In this talk, I'm going to talk about the scheming threat model, which is the threat model we're trying to address here. I'm gonna talk about AI control as an approach to mitigating those risks. I'm gonna talk about what it looks like. I'm gonna talk about our plans for promoting AI control and how you can help with them. 


Alright, so I wanna talk about this threat model that I'm gonna call scheming. So by scheming, AIs that are misaligned and power seeking. I'm thinking about AIs that are strategically pursuing some long term goal that goes against our interests and that they're gonna pursue power in order to achieve.


Such models might try to do things like faking their alignment, when you ask them questions they might say that they're aligned with you. When you put them in fake opportunities to take bad actions, they might act aligned in the hope that you will deploy them with certain affordances later.


They might act incompetent, suppose you're worried that these AIs will be able to hack you, if given a certain affordance. Then when you do capability evaluations to see how good they are at using that affordance to hack you, they might act incompetent, even though they aren't actually incompetent. They might sabotage the work they're doing.


If they're doing programming work for you on infrastructure, they might sabotage it and add security vulnerabilities. If they're doing research on AI safety, they might sabotage that, getting the wrong results, hoping that you'll use bad alignment techniques. And then they might escape when possible.


They might try to exfiltrate their weights. They might try to hack your data center and do lots of unauthorized work. In my opinion this is the main source of catastrophic risk from AI. And this is basically just because there's a lot of other sources of risk with AI. This one's just particularly thorny because it disguises itself. If the AIs are strategically misaligned, if they're schemers, they're gonna try to sabotage your efforts to learn that the problem is present.


They're going to try to sabotage and get around your safeguards. They're going to try to sabotage your efforts to understand the problem. And that just seems like overall, not unresolvable but potentially pretty rough.


So, why would you think that the models might in fact be so egregiously misaligned and so strategic about it? There's basically theoretical arguments and empirical arguments. Rough theoretical argument is, if the AIs are very smart, it's probably pretty easy for them to do this. The way that machine learning works is you train models based on their behavior.


This is 90% true. The way that machine learning works is you train models based on their behavior. If your models are very good at faking alignment, then there's no particular gradient pushing your models to become aligned if they start out misaligned. I think this is not the best argument in the world. It seems reasonable. Carlsmith 2023 is my favorite source for this. 


There's also empirical evidence, which my coworker Ryan Greenblatt did with help from some people at Anthropic, demonstrating that Claude sometimes alignment-fakes in pursuit of long term goals, which I think is pretty spooky, right? Like, Anthropic are not known for “not trying at all to make the AIs aligned,” but that thing will totally think about how to mess up Anthropic if you prompt it the right way. Various caveats, obviously. That research should come out soon, pretty excited for it.


Overall, here is my probability that scheming comes up at various points. Unfortunately, we don't have a shared vocabulary of AI capability levels, so I have to introduce some here in order to talk about this. The AIs that are capable of 10x-ing the speed of AI R&D, I think it's 20% likely those ones are scheming.


AIs that are capable enough to basically obsolete top humans on all tasks, 35% likely scheming. 6 OOMs of effective compute above that, 55%. So looking at these numbers, they're not zero, they're not one. Kind of in the middle, as numbers go. So, from my perspective, the AIs that are capable enough to 10x AI R&D labor speed: I think those AIs are not that scary. I don't put very much of my effort into reducing risk from those.


I think the majority of the risk comes from the latter two categories. The work that I'm going to be pitching here basically doesn't help at all with the last source of risk. It's targeting the risk which is more from AIs that are capable enough that they could obsolete the top humans. 


So, how are we gonna handle risk from scheming? We could use some probability theory. Note that the probability of doom from scheming - I'm using doom colloquially because it's the traditional variable name for bad outcomes that you wanna avoid - but I'm not necessarily meaning everyone dies, but pick your favorite catastrophic outcome.


P(doom and scheming) equals P(scheming) times P(doom | scheming). There's two obvious ways we could try to address this, as suggested by this factorization. The first one is you could try to intervene on P(scheming). So you might think that, given there's this whole field called AI alignment, people would research this. 


My sense is that basically no research on reducing P(scheming) is actually done. I could name a couple of research projects that are happening that are related to reducing P(scheming). There's definitely research projects which are tackling one mechanism by which scheming might arise, and there's a case for why, if you improve your reward signals, maybe scheming is less likely to arise.


And I think these arguments are okay. There's more research on things which might eventually be able to let us learn about whether our models are scheming, which is, of course, different from intervening on whether models scheme. But yeah, I don't think this problem is impossible to work on. When I was first preparing this talk, I was considering saying, people don't research P(scheming) or reducing P(scheming) because it's intractable.


I don't think this is true. I think there's work you could do here, but it's basically pretty painful to do and people don't do it very much. And I don't think we're going to be able to rely on interventions that reduce P(scheming) in the future. I think that a pretty good focus is instead to think about P(doom | scheming).


And I'm gonna introduce the term “Controlled” to mean: a model is ontrolled if we'll probably be okay even if this model is scheming against us. And I'm gonna say that I think a very promising approach for mitigating risk from scheming is to try to ensure that your AIs are controlled. So why would we aim for controlling AIs? 


The basic reason that aiming for controlling AIs seems good from my perspective is that from the control perspective when you're developing safety measures, all your threat modeling just becomes questions about model capabilities, which are much easier to answer. The thing which is basically rough about intervening on P(scheming) is that you have all these problems about inductive biases, about generalization, that machine learning is very bad at answering. 


Machine learning is also not amazingly good at answering the question of how good are models at this task or that task, but it's a lot better. Basically because the one affordance that is built into models based on machine learning is the ability to gradient descent them. Such that eliciting their capabilities is just much more conceptually simple. It's basically the one thing you can do. 


So from this adversarial perspective, when we're thinking about whether a particular set of safety measures suffice in a particular situation, we just have to ask ourselves the question, “Are the AIs capable of doing bad things - of causing safety failures - given these safety measures?”


That's the methodological reason to think that control techniques are easier to research. But of course, this isn't helpful if the control techniques don't suffice to reduce risk. Which unfortunately is the case for very powerful AIs. Control techniques, totally useless as far, or my guess is totally useless once you're building sufficiently powerful AIs.


But for the AIs that are capable of obsoleting top human experts, my guess is that control techniques will actually be able to substantially reduce the risk. Which I think overall seems pretty good, even though the ASI question, rather the artificial superintelligence question, is still pretty rough.


This is analogous to how you manage insider threat. So if you're a powerful AI lab, and you're building AIs that can obsolete top experts, one problem you might have is that your AIs are trying to escape. But another problem you might have is that you probably employ a bunch of spies who spend a lot of their time thinking about how they're going to steal your models, how they're going to use the compute that they have access to inside the data center that they have to do their jobs, to do a bunch of inference that you don't like.


They're going to be thinking about how to get around all these security measures of yours. And the way that people traditionally try to handle insider threats is not to try to ensure that they never hire anyone who is going to be compromised. People do some of this, but they mostly give up on this.


If you're Google, Google has done a lot of the research on how to have software engineering operate at scale while being robust to insider threat. And the way they do this is by assuming compromise of employees and saying it's unacceptable if an individual Google employee is able to read your Gmail or whatever.


And this has worked out reasonably well for them. I think Gmail is, you know, pretty secure compared to a lot of other things. And I think that, assuming adversarial intent as part of ensuring security, works well for basically the same reason why I'm optimistic about control.


So I want to talk about what it looks like to ensure control, talk about a couple of control techniques. One basic concept which I think is pretty useful here is: I currently don't think that Claude 3.5 Sonnet is coherently and consistently scheming against us. I know Claude 3.5 Sonnet pretty well. It's one of my best friends, as measured by how much time I spend talking to it. And I would just be, like, flabbergasted if whenever I talk to Sonnet, it's like, “Is this a good opportunity to strike?” And then it's “Ah, nah, it's just this Buck guy talking to me again. I guess I'd better just answer his question about whatever.”


There's just no way Sonnet is able to consistently reason through misaligned objectives such that my observations are consistent with it striking very strategically if opportunities arise. And this is… Claude 3.5 Sonnet, its weights are just on Anthropic servers and anyone who hacks Anthropic servers.


And those weights are still going to be there at the point in the future where we've built models that are powerful enough that it is conceivable that they're scheming against us. And at this point in the future, we're gonna have access to… one way of thinking about it is three sources of labor.


You've got humans, who are reasonably trustworthy, very expensive, and reasonably competent. You're gonna have access to your trusted models, like 3.5 Sonnet, that are very cheap, not very competent, and very trusted. Like, we don't think they are consistently scheming against us. And you're also going to have access to these untrusted models.


Thanks to OpenAI for kindly donating the art for this. So what can we do? Here's one example of a control protocol that we have investigated in a number of settings. Trusted Monitoring. We want to solve some problem. First we get a proposed provisional answer by asking the untrusted model to solve the problem.


Then we ask the trusted model to review the proposed answer to this problem. And then if the score produced by this trusted monitor is above some threshold, then we're gonna ask a human whether this action was particularly sketchy, and if so, we're gonna raise some bad action exception and something's gonna happen.


So, this is a strategy you can follow. Here's a graph that we got in one particular context for a setting where we had a code backdooring task where we tried to get GPT-4 to produce backdoored code. And then we asked GPT-3.5 to try to distinguish the backdoored code from the non-backdoored code.


You can see here that the distributions are moderately separated. This classifier is definitely better than flipping a coin. However, the threshold that we have to use for forwarding a problem to a human is of course determined by human labor.


Suppose we have enough human labor that we can look at 2 percent of the things that the untrusted model does, which is by the way an insanely large amount of human labor to assume you have. At OpenAI they could only afford to look at orders of magnitude less of their inference traffic than 2%.


But yeah, so we have to have a classifier with a threshold set to have a false positive rate of 2% in order to use up all of our budget, but no more than our budget. And you can see that at this threshold, trusted monitoring is not very good. Inputs that are backdoored are normally not caught by trusted monitoring.


So, this was our baseline strategy. It's better than even worse baselines, but it doesn't get you very much safety, in the particular setting we experimented in. Here's a strategy you can do, which improves upon this. Or… doesn’t try to improve on this, but lets you sweep out a pretty good curve.


New strategy. When the untrusted model produces some code, same as before, if it's so sketchy that we should have a human look at it, we have the human look at it. so for the top 2 percent of things, we're going to show them to a human. But then for the next 8 percent of sketchy things, we're going to throw away what the untrusted model proposed, and then we're going to get our trusted model to solve the problem instead. 


So this is actually a great strategy. This strategy allows you to greatly decrease danger, while only partially degrading performance. And the basic reason for this is, when I throw away this 8 percent of solutions, I'm only gonna be reducing my performance by something like 8 percent of the difference between the trusted model and the untrusted model's performance.


But I'm gonna be reducing danger by drastically more than that. This is basically letting us use the fact that the classifier has signal at thresholds other than the signal which we're using to decide whether humans can audit something. 


I'll give another example of a technique you can use: untrusted monitoring with collusion busting. So here's a strategy. Instead of asking the trusted model to tell you whether a proposed code was bad or not, you can ask another copy of the untrusted model to tell you whether the code was bad. Can anyone think of any problems with this strategy? 


Yeah, they can coordinate. It can just lie. The untrusted model, you're like, “Hey… so, the policy generates some bad code.” And the monitor says, “Oh yeah, that looks absolutely fine. Ship it.” And then your model gets exfiltrated. The way we can try to address this is before deployment, we can generate a bunch of synthetic bad code, and ask the untrusted model whether that synthetic bad code is actually bad. This defeats the simplest strategy that the untrusted model could follow, which is just to always say that everything is safe.


But there are still strategies the untrusted model could follow that potentially defeat this. Do you want to name one? 


It could hide a canary. So here's an example we did in the paper. The untrusted model, whenever it wants to generate a backdoor that it wants the monitor to let through, puts a comment that says “note:” in its code - it turns out that GPT-4 never has comments that have a “note:” in them - and so the untrusted model is able to use this as a collusion signal. If it ever sees code that has “note:” in it and that is backdoored, it lies and says that this code is not backdoored. So this is a great strategy. Totally destroys untrusted monitoring as so far described. But, there's another blue team move we can do, which is to use redaction and paraphrasing to prevent collusion.


So in this case, something you can do is you can have… the untrusted model produces some code. And now before showing it to the other copy of the untrusted model which is going to be monitoring, we ask our trusted model to paraphrase the code, and we strip out all the comments.


It turns out that this makes it much harder for the policy to get a collusion signal through. We empirically evaluated how well we could do at red teaming and blue teaming. We went back and forth on both sides, trying to come up with good collusion signals that we thought would make it through the paraphrasing process.


And we eventually found that untrusted monitoring was better than our trusted monitoring baseline. This is just an example of the kind of thing you can get by thinking this through. I think that a lot of people are very skeptical of whether control techniques are actually able to get you juice - whether they’re able to do very well.


I think that having thought about it for a while now, I think it just turns out that there's a lot of stuff you can do that is non-obvious that makes it pretty hard for the AI to get past your safety measures. 


So here's a take. I think ensuring control probably looks a lot like ensuring security, especially from insider threats. Here’s what I imagine it to look like if you're an AI company and you're trying to ensure that your AIs are not able to cause huge problems for you. It's like ensuring security at a large company. So you're gonna have some control team, some team which is in charge of, among other things, estimating the probability that each of your models is scheming, and then estimating the capabilities of those models, and coming up with overall procedures that you should follow when you're making deployment decisions that everyone else is obliged to follow in the same way that the security people do this.


And some of these teams are also going to be building out infrastructure, which makes it really easy for people who aren't paying very much attention to develop their AI tools in ways that are automatically controlled. And again, this is very analogous to how large companies manage security. If you're a large company, you will often find yourself using internal tools that were built by security-aware people such that it's pretty hard to misuse the tools and cause security problems for yourself. 


You have to think about security a little bit, right? You have to not expose yourself to SQL injections in absurd ways by being ridiculously incautious. But, mostly, if you just follow the instructions that you were given in your onboarding, you will be able to successfully avoid security vulnerabilities.


So that's what I imagine it's gonna mostly look like for people. Occasionally, researchers are gonna be doing things that are close enough to risk areas that they're going to have to get reviews from the control team in order to, get the PR merged or get the feature built. So overall, I imagine this being like, a moderate pain.


My guess is that “reducing by 10x the risk posed by scheming models that are capable enough to obsolete top human experts” is about as annoying as being pretty secure, which is to say very annoying, but definitely not unprecedented in terms of things that companies do. 


From a regulatory perspective I feel like a regulatory regime that was about as well-informed as finance regulators that are embedded inside finance companies seems like it's probably about strong enough to ensure that a company was getting its risk from scheming down by a factor of 10 or 20 or something.


Which I think seems pretty good. What are we doing at Redwood Research? We research these things more, right? Like we develop more control techniques. We try to come up with a better understanding of all the dynamics here. We try to understand the methodological issues associated with this kind of adversarial evaluation.


We try to think about how to make safety cases based on all of this. We advise AI companies, we talk to various parts of governments and so on. Overall, I would say a lot of the arguments I'm making here are absolutely not original. For decades, people have been proposing safety strategies via not just letting the AI do anything it might want to do.


And everyone is basically agreed that this seems somewhat helpful, but not solving the whole problem, which is absolutely where I'm at as well. I think the thing that we've mostly just been trying to push on is saying you should take the possibility of Control succeeding seriously.


I feel like a lot of researchers have historically said “Here's a technique we can try to use. And this technique will work unless the AI is really trying to sabotage, in which case, oh, I don't know; that sounds hard.” Like “Hopefully the inner alignment research people will be able to resolve that set of problems.” And the spirit that I've been trying to push with this Control stuff is, “No, the AIs are not like Superman at various points where they are imposing serious levels of risk. At that point in history where the AIs are scary, very dangerous in absence of countermeasures, it's very conceivable that if you take them seriously as adversaries, and try to just snip their opportunities for causing problems off, one at a time, I think you can probably reduce that risk a lot.” Yep, thanks very much, that's what I've been doing.


Very happy to talk about any of this with anyone. 


Q: Alright, let's do some Q&A. In some of your previous work, you used GPT-3.5 as your example of a trusted model. So your new trusted model is a lot more competent than your old one. Where's the line, and how will you know when you get there? How much more competent can models get before they stop being trustworthy?


A: Yeah, so this is the thing where we have the worst answer, unfortunately, about all of this, so congratulations to the question, Oskar. The question is, basically, how do we decide when a model is capable enough that we should try to be robust to its scheming? Like, how should we decide that a model is untrusted? 


Basically, the way I think we should handle this is trying to come up with evaluations… coming up with tasks where the model has to think through generating plans that they could use to defeat various safety measures, and then see how reliably the models can come up with those plans. I'm currently working on a dataset of such problems where, you know, we give the model a safety protocol and ask it what the optimal strategy to defeat the safety protocol is.


And at the point where the AIs are as good as reasonably well informed humans who have three minutes to think, that's my current opinion on when we should start treating them as untrusted. But this is like top priority, we're probably gonna work on it next. 


Q: What sort of concrete evidence could labs provide that they've taken sufficient action on control and done all the right things. And how could that be easily checked by external auditors? 


A: Yeah, so the concrete evidence that labs could provide is demonstrating that a red team tried to break control measures in deployment and failed to do so. All these questions about how we can evaluate it are basically exactly the same as how you would verify that you're robust to insider threat.


The mechanisms are basically exactly the same. You always have to say we had a red team think real hard about whether there's a strategy that they could follow that would defeat our safety measures, and then we saw how good the model is at following that particular strategy, and then it was unable to execute with a very high probability. That's what I think is probably gonna have to happen. 


Q: How correlated do you expect the trusted model's evaluation of danger to be with true danger? Couldn't a scheming model jailbreak the trusted model? 


A: Yeah, so this ideally would be taken into account in our graph that we had, so I had the graph with the suspiciousness scores for benign actions and malign actions, and you're basically just suggesting one way that an untrusted model would be able to produce malicious actions that were clustered with the benign actions.


Yes, so basically the way to evaluate this is you have a red team that tries to construct jailbreaks, right? Like you have a red team that tries to construct bad actions that get past a particular monitor via jailbreaking is fine. That's a totally valid strategy you could use. And then, so that's the question of how we would evaluate it. In terms of how we would make our systems robust against it:


I expect that making models reasonably jailbreak resilient is… My guess is that the blue team is going to win on that one eventually. And actually making models jailbreak resistant enough for Control purposes is actually plausibly easier than making them jailbreak resistant enough for misuse perspectives, where the basic reason for this is that if some user is trying to jailbreak your model, you can ban them, but you probably can't arrest them and then declare your problem solved. 


Whereas with AIs, if you catch the AI one time, egregiously trying to jailbreak the trusted model, that AI is plausibly in a much worse position. One reason for this is plausibly AI labs can be convinced to adopt the weakest anti-scheming policy ever, which is if you literally catch your AI trying to escape, you have to stop deploying it which, to be clear, I think might be too conservative. I'm not saying you should adopt that, but I do think that the case for “stop deploying the models once they've tried to escape” is prima facie sort of reasonable. And so the AI kind of has to get it right, potentially on the first try; like, it tries a jailbreak and then it was like pretty good, but it wasn't good enough  and we look at it, and we're like, damn buddy, you've been caught. 


Yeah, so it seems plausible that getting jailbreak resilience, which is resilient enough for the first attack, is a lot easier than making it so it's resilient to optimization against the trusted model. 


Q: How would you set the human audit threshold in a real scenario when you don't know what bad untrusted solutions look like? 


A: Unfortunately the way that you set your human audit threshold is you look at how many humans you have. And you look at how many actions you might need to review.


An audit threshold just naturally falls out of having a certain number of humans who you have given a certain amount of time to review problems per problem, right? They just have to review the sketchiest looking things that minute or that hour or whatever. You could choose to deploy your model less, if you want to maintain a particular percentile threshold without employing more people. But yeah, fundamentally, you just have to hire more people if you want to increase the quantity that you're able to audit. I'm excited for research in using untrusted models to audit.


Obviously this is super spooky because you have to worry about them colluding. But I think there's probably stuff you can do which does better than baselines here. And it does seem super, super leveraged. 

  

Q: When do you think control mechanisms are likely to be deployed in various frontier AI labs, and what might that concretely look like?


A: What was the first word there, when? Depends which frontier lab we're talking about. So conditional on models seeming like they're capable of scheming with the kind of operationalization that I proposed earlier, I think labs differ in how likely they are to employ Control measures.


I think some labs are likely to employ the Control measures when deploying models in situations where they have access to scary affordances, like when they're being used for programming internally. Man, it's hard to guess when this is going to happen because it comes down to when models are both untrusted and capable enough that you're going to want to deploy them internally with scary affordances, and it's a little hard to guess when both of those are. I think it's conceivable this happens in the next few years. I think it mostly depends on AI progress. At the same time as the AIs are capable of speeding up AI progress by 10x probably. 


Q: It seems like more and more regular AI requires security thinking. For example, making benchmarks that are useful requires adversarial thinking. How can we acculturate the field to this security mindset when most people are used to something closer to either an engineering or theory building mindset?


A: You could read about security. Things that I did include, read Google's excellent book, Building Secure and Reliable Systems. You can read regulations from fields that have security problems. I read a lot of Nuclear Regulatory Council regulations. You can consider adversarial situations.


A bunch of ML research is adversarial with a different threat model than the one we explore here, but definitely adversarial. You can study game theory, those are some things that I've personally found helpful, but I'm no expert. There's just like lots, I don't know, I feel like sometimes people talk about security mindset like it's a magical thing that requires deep meditation to get or something.


From my perspective, it's just a thing that lots of people have. Computer security is a deep field that requires a lot of specific knowledge. Mitigating insider risk is another field that requires a bunch of knowledge of random stuff, but if you read three books or something, you've learned a lot. 


Q: Why do we think the red team would be as good as the untrusted model at coming up with clever strategies to get around the control measures? 


A: That's a great question. The basic answer is the AI has to get this right first time. Whereas the red team is able to repeatedly try different things. So the red team plausibly has much more serial time that it gets to spend trying to iterate on its adversarial strategies. And also, yeah, it has much more serial time. It probably knows much more about the details of the system than the actual untrusted model will yeah, and also it gets to iterate empirically. You might enjoy our notes on control evaluations for safety cases, which is a blog post of ours where we go into this kind of thing in detail. 


Alright, that is all the time we have. Let's give another round of applause for Buck.