Paul Christiano - How Misalignment Could Lead to Takeover

Transcript

[00:00:00] Paul Christiano: Cool. So this talk is going to be a little bit on the crazier end. It's going to be completely non-technical. But these are things, I believe, that I think are important for how much and how we should think about and worry about and work on alignment. So things I'm very happy to discuss and debate over the next day of the workshop and later today.

[00:00:31] I'm not sure, hopefully, I'll have a reasonable amount of time for questions as I go through it. But yeah, we should have more opportunities to talk later on. And feel free to object, I just may punt things to talk about later.

[00:00:42] Yeah, so I'm going to be, we've talked so far about why models may end up being misaligned and why that may be hard to measure. I'm going to talk about how that actually ultimately leads to total human disempowerment, I'm calling takeover here for short. The structure of the talk, I'm first going to talk a little bit about why I think AI systems will eventually be in a position to disempower humanity. That is, unless we deliberately change the way we deploy AI. Then I'm going to talk about why misaligned AI systems might be motivated to disempower humanity, basically just slightly extending or repeating arguments from earlier in the day. And then I'm going to talk a little bit about why AI systems may ultimately effectively coordinate to disempower humanity, that is, why you may have simultaneous failures across many systems rather than a single system behaving badly.

[00:01:23] So I'll start with why I think AI systems will likely be able to take over. I think a thing worth saying is that I'm going to talk about what I see as the single most likely scenario resulting in catastrophe; this is not the only way you end up with even this narrow kind of catastrophe.

[00:01:38] So I'm going to talk about a system where AI systems are extremely broadly deployed in the world prior to anything bad happening. So I imagine that, for example, AI systems are designing and running warehouses and factories and data centers, they're operating and designing robots, they're writing the large majority of all code that is written, they handle complicated litigation, they would fight a war, they do most trades, they run most investment firms. When humans act in these domains, they do so with a lot of AI systems to understand what's going on and to provide strategic advice.

[00:02:10] So this world is pretty different from the world of today. I think we are starting to be able to see what such a world would look like. I think it's not clear how far away this kind of world is: if you're doubling what's going on in the world today, if you're doubling once a year, it doesn't take you that long to go from billions of dollars to trillions of dollars of economic impact. But this is the sort of setting in which this entire story takes place.

[00:02:34] So one thing that can happen if you have a lot of AI systems operating is that the world can get pretty complicated for humans, hard for humans to understand. So one simple way this can happen is that you have AI systems with deep expertise in the domains where they're operating. They've seen huge amounts of data compared to what any human has seen about the specific domain they operate in. They think very quickly, they're very numerous.

[00:02:55] Another point is that right now, when a human deploys an AI system, they often have to understand it quite well. But that process of understanding a domain and applying AI is itself something that is likely to be subject to automation. Yeah.

[00:03:06] Audience member: So I object to this.

[00:03:09] Paul Christiano: Great.

[00:03:09] Audience member: The first is that (inaudible) but that's a whole other thing. Yeah.

[00:03:20] Paul Christiano: I agree with this remark. That is, I think that you have a better chance of understanding an AI system if there are fewer of them and there's more human labor going into each deployment. But it is clearly, in fact, there's a general case for everything I said. I'm going to talk about a bunch of factors that can exacerbate risk. But I think you could cut out actually almost every single thing I say in this talk, and you'd be left with some risk. And it's just that each of these makes the situation from my perspective a little bit worse.

[00:03:42] Audience member: So the other thing is, I think most thinking producers in this world are (inaudible) in this. But I don't think you understand what this is.

[00:03:49] Paul Christiano: I think no human understands that well what this is. Yeah.

[00:03:53] Audience member: Sure. But I don't know. You could take something from a pretty good, like a probably or probably wouldn't understand something, and so why is it like qualitatively important that people in a good way can find it and help?

[00:04:08] Paul Christiano: Yeah. So I would say that in the world of today, most of us are very used to living in a world where most things that happen, I don't understand how they happen, I don't quite understand why they happen. I trust there was some kind of reason why they happened. But if you showed me an Amazon warehouse, I'd be like, what's going on? I'd be like, I don't know. I guess it somehow delivers things. Something's going on.

[00:04:22] So the situation is already, I'm kind of trusting someone to make these decisions in a way that points at goals I like. I think that it's not inherently even a bad thing to be in this kind of world. I think we're always going to be in a world where most people don't understand most things. I think that it is not inherently bad that no human understands things and only AIs understand something, I think it's mostly just important for understanding the context in which we're talking about concerns about AI takeover.

[00:04:46] I think that if you're imagining a world that is like, a world, there's a human world and one AI system that's like, how can I scheme and outmaneuver the humans? I think that is a possible thing that could happen as a source of risk. But I think the most likely risk scenarios are more like a world that is increasingly, there are very few humans who understand the most important domains. And so when we're talking about AI's taking over, I think it's, at least when I visualize this, it really changes my picture of how this is going to go when I think about that kind of world.

[00:05:11] Audience member: Is that mentally how you think of this system? You have like little green men that's very, very smart and you can plug them into a phone, into a car manufacturing, et cetera, and right now they're doing lots of work for us, but in the style of your dream, they get ugly?

[00:05:24] Paul Christiano: Yeah, that seems like a reasonable mental image. There's a little bit, depends exactly what your associations are with the little green man. But yeah, there's a lot of stuff happening in the world being done by AI's of all sorts of shapes and sizes. Some of them are doing things that we do understand well. There's a lot of things like, yep, this AI system just does a simple function, which we understand, and some of them are more complicated.

[00:05:43] On this slide, I've just thrown up an example of a random biological system. Biological systems are relevant because humans don't understand them that well. We just see that these things that were optimized in the world. I have a similar vibe. If you ask me how even things I understand kind of well, like how a data center is managed, from my perspective, it feels a little bit like a biological system. And someone hurt it, and it's somehow going to recover, and I don't totally understand how that works, but I have some sense of the telos.

[00:06:07] That's the kind of relationship I often have to systems even maintained by humans and built by humans. I just expect to have more and more of that sense as the world becomes more complicated. The way humans relate to this world is increasingly the way we relate to complex systems in general. And the only difference from the world today, I think that's already the case to a great extent, the difference is just there is no human who knows what's going on really. Or at least most people, who are making most of those decisions, know most details, aren't humans.

[00:06:32] So one upshot of this that's important to me is that it pushes us increasingly to train AI systems based on outcomes of their decisions. I think the less you understand, or the less humans have time to understand details of what's happening in a domain or the kinds of decisions an AI is making, the more important it is to say, let's just do what the AI said and see what happens. The more we are able, if you're trying to select an organism or something, you're not going to look at how the organism makes decisions, you're going to see how well does it do, how fit is it. So that's one implication of this.

[00:07:00] Another implication is that it becomes harder for humans to understand what's going on or intervene if things are going poorly. So an upshot of that is that AI is in practice. Bad behavior by AI is mostly detected or corrected by other AI systems. So AI systems are mostly monitoring other AIs. You can imagine humans monitoring or automated monitoring, but at some point, sophisticated AI systems are responsible for most monitoring. Even normal law enforcement or fighting a potential war would be done by AI systems, because decisions are made quickly and because conflict is often deliberately made complex by a party to the conflict.

[00:07:30] AI systems are running a lot of critical infrastructure. Just if you want to achieve something in the world, AI plays an important part in that process.

[00:07:38] So none of this is necessarily bad: I think positive stories about AI will also have the step where good AI keeps bad AI in check. I think it just is something that now requires trust on behalf of all of humanity extended to all of these AI systems we built. You might hope that we could trust the AI systems we built because we built them, we have freedom to design them. I think that depends on some decisions we make, and we'll discuss how it can get somewhat more ugly. But the point I'm going to make right now or in this first section of the talk is not that it would necessarily get ugly, just that if for some reason all of the AI systems in the world were like, what we really want to do is disempower humanity, then I think it is quite likely that they would succeed.

[00:08:14] I think this isn't even really the spicy take. In my optimistic futures, this is also the case. And the point is that the core question is just, do you somehow end up in a situation where all of these AI systems, including the AI systems responsible for doing monitoring or doing law enforcement or running your data centers or running your infrastructure, where all of those AI systems for some reason want to disempower humans?

[00:08:35] Audience member: So you said in your optimistic scenario, also the optimistic scenario, they take over?

[00:08:40] Paul Christiano: No, I think the optimistic scenario still involves a point where AI systems acting collectively could take over. I think in the optimistic scenario, yeah. In the optimistic scenario, you wait until you are ready. You don't end up in the situation until, in fact, they would not try and take over. And you're confident of that either because you've determined that the empirics shake out, such that this talk is just crazy talk, very plausible, or this talk was justified and we've addressed the problems, or whatever.

[00:09:04] Audience member: Isn't this already the case? Isn't it the case that if the power grid, for example, if the power grid goes out now, a large fraction of humanity would die?

[00:09:14] Paul Christiano: I think it's plausible. Well, a large fraction of humanity is maybe small potatoes. But yeah, I think it's going to depend when I say if all AIs want to take over. The arguments I'm going to make are going to apply to a particular class of AI systems. So it's not going to be all electronic devices fail, it's going to be some class of AIs produced in a certain way. And I think that what happens over time is just it becomes more and more plausible for the top end of that distribution, of the AI systems that are most sophisticated, or right now, if all systems trained with deep learning simultaneously failed and we're like, we hate the humans, we really want to disempower them, I think it would be fine.

[00:09:47] It's not totally obvious. And it depends how smart they are about how they failed. I think we'd probably be OK.

[00:09:52] But I mean, again, I think that the whole story we're telling is just kind of just like, I think you could take out. And there'd still be some risk. Yeah.

[00:10:01] Audience member: I'd be like, if PIDs fail.

[00:10:04] Paul Christiano: If what failed?

[00:10:05] Audience member: Every PID system that failed right now, that would also be a risk.

[00:10:09] Paul Christiano: Yeah, I think that's... You can imagine cases where it's like, if all the cars fail, that's enough to kill everyone. This is not necessarily a striking claim.

[00:10:16] The striking claim is definitely the one that it is plausible that misaligned AI systems may want to take over. I guess this is, in some sense, the part we've been most talking about throughout the day. I'm going to dwell on this a bit. And then I'm going to talk about why these failures may be correlated in a problematic way, that is, why you might have all AI systems trying to take over at the same time.

[00:10:33] I'm going to talk about two failure modes, both of which we've touched on earlier.

[00:10:37] So one is reward hacking. That is, that disempowering humanity may be an effective strategy for AI systems collectively to get a lot of reward. And the second is deceptive alignment, a scenario that Rohin talked about. OK.

[00:10:47] OK, so I mentioned briefly before this idea that you may train AI systems by evaluating the outcomes of the actions they propose. So just to briefly review what that actually entails, we have some policies, some big neural net, it proposes some actions. Some of those actions, to distinguish between them, we actually need to execute the action, measure the results, and decide how much we like the results. Let's say the reward is just our judgment of how much we like the results. And then we adjust policies to maximize the expected reward of the actions they propose.

[00:11:21] And it's plausible that if you do this, you get a policy which is implicitly or explicitly considering many possible actions and selecting the actions that will lead to the highest reward. There are other ways you could end up at that same outcome. We could do model-based RL and explicitly have a loop in which we predict the consequences of different actions; you could have decision transformers that condition on high-quality actions; it could just have some other planning process you distill into a model, whatever. All of these lead to the same end point, which is sometimes, because we don't know how to achieve goals, we're going to train AI systems to take actions that lead to high reward, where reward means we measure the outcome and then we decide how much we like that result that we measured.

[00:11:57] This can potentially lead to takeover if corrupting measurements in some situations is the best way to get a high reward. So you can imagine some spectrum of forms of reward hacking. A really simple one is you could try and flatter and deceive humans. That's a reasonable way to get high reward. Maybe on reflection, humans are like, that's not actually what we wanted. So if they were more careful, they would give it a lower reward.

[00:12:15] You can move to if you're writing code, and part of how a human evaluates the code you write is by performing simple tests, you could try and anticipate or understand those tests and write code which breaks those tests or causes them to behave in an unintended way. Then when a human looks at the results of those tests, they're more likely to have a favorable judgment to your code if they don't understand the mechanism for the problem.

[00:12:33] You can move out to creating elaborate false logs to hide errors. If I'm an AI system, and part of how I'm evaluated is what happens when the code I write gets deployed, one thing I might want to do is say, OK, this code should change the way that logs get collected or should modify logs to make it so in retrospect, a human thinks this outcome was better.

[00:12:50] And you can imagine as you go down the spectrum, these correspond to more extensive evaluations. One reason you might use these more extensive evaluations is to prevent problems or oversights from earlier stages. You can imagine compromising a server to hide errors, so not just changing log files. But if a human goes to try and figure out what happened, you might say, hey, if I just replace the SSH server on this machine, it's going to be really hard for a human to understand what's going on after the fact remotely. It's going to be unlikely they'll notice that something bad has happened. You can get to these more extreme outcomes.

[00:13:19] But Ajeya mentioned this idea of, if my system is being trained and what it wants is to get a high reward, ultimately that reward is just about a bunch of measurements that could enter it into the training data set. One sec, I guess. And then you can just imagine the most extreme case of saying, OK, if all the AI systems collectively disempowered humanity, then it does not matter if humans would like to go in later and change this.

[00:13:40] One dynamic that's worth pointing out, and then I'll get to Jacob, is if you have some of these smaller failures, if you imagine the system which behaves badly and bricks a server, a human is likely to say, OK, that was bad behavior, I'm going to go give it a low reward. And the results of that could either be to train a system to not do funny things or to train a system to, if you get up to funny business, ensure that a human never goes and flags that as low reward. Both of those seem like policies that will avoid that bad behavior: it's very unclear which one gradient descent gets you to. Yeah?

[00:14:07] Audience member: Yeah, so if I'm going to play devils' advocate in here, like-

[00:14:11] Paul Christiano: Which I encourage.

[00:14:16] Audience member: Why does it keep trying to do more and more of the things that get more and more dangerous?

[00:14:22] Paul Christiano: Yeah, so I would say in some sense, it depends a little bit on what the system wants or how you set up training. In some sense, a lot of these things are like errors. So I'd say that if you care about as a system, if what the AI cares about, is just don't do things that will ultimately be assigned a low reward, and we would ultimately detect and assign low reward to many of these failures, then all of them would be errors except the one at the far right.

[00:14:43] And so I think a reasonable guess for what happens is you have systems which make some number of these errors on the left. At some point, they're like, OK, I understand what's going on here, I understand humans are going to escalate and take more extensive measures to notice bad behavior. Bad behavior is not an effective strategy.

[00:14:56] And then you have this fork in the road, which Ajeya referred to, of either you learn to just not do anything bad, or you say, OK, I understand, I understand the class of bad behavior which is likely to be detected. And so I'm not going to engage in bad behavior that would be likely to be detected. And it's just unclear to me which way you go.

[00:15:10] That's an important- I don't think you, like, it's fairly likely you don't see a march along the spectrum, because at some point, those are mistakes. Like, if you try and do something bad and it's a half-measure.

[00:15:18] Audience member: Well, so, like, right. But it seems like a place where, like, you're saying, OK, instead, it's kind of like figure out what you should do, and then you have to go to the person who's the bulldog instead. But I think you could ask, like, where does the bull come from? So you could be worried about, like, all these different decisions. But if it was, and, like, I think you, like, I'm certainly worried about all these different decisions. But if you did, like, super negative remark for these particular things, this is, like, the direction of the generalization. But, like, these are other things. So, like, so. I mean, it seems.

[00:16:09] Paul Christiano: Thank you.

[00:16:20] Audience member: [Inaudible] caring about the board on Tuesday, but if you're caring about the board on Wednesday, it's still incentivised... [inaudlble] reward in the training dataset after it was taught [inaudible]. And that seems like a thing that you care about [inaudible] I think this makes a ton of sense under the assumption that there's some board providing and just as an act, it's just online, it's expected to be there. It's important to have a good grounding point. It's what all of us do. And [inaudible] get this thing that's shared for life. So, down.

[00:17:23] I think sitting here, I don't know if I've spoken a little bit on this slide. Okay. That's not what, that's not sort of what we have. Like, a few years ago, we all got together and said, no, this is not how we build AI. We're not gonna build AI just online. We're gonna build a board and see what comes out. We're gonna sort of make it such that if AI wants to undermine whatever type of person it's gonna do, and that's the person's head, it's gonna try to do the best it can. Right? And so, it's working with our board model, but that board model is not as good. So, the thing is, it's not being provided. It's not being [inaudible]. And I will cringe that there is a serious possibility that I could build exactly that. [inaudible]. I think I can kind of talk through it. Okay. I can see how I can make it about the better reward[?].. That's not a good board model. Like, maybe the person is sure about not destroying the [inaudible], or maybe they want me to destroy the [inaudible] and they provide answers to these questions. Right? And I don't know which one is true or not. So, and, you know, everything that's involved in this system is [inaudible]. Sorry, I'm being very long-winded. But the point is, how much of this goes away actually if you assume that you're maintaining proper inference over what the real function might be, and you're taking every signal from the person as single evidence about what's important, what they actually care about.

[00:18:50] Well, I guess there's several things to respond to and then maybe a meta thought. I'll start with the meta thought, which is, this seems like a great discussion I'm very excited about and happy to argue a bunch. I'm not gonna be able to give a satisfying answer to these questions. A lot of my high-level take right now is, I'm gonna consider both of the issues I'm gonna discuss here plausible. I think if they're real issues, we're probably gonna get clear experimental evidence in advance. Right now, I'm just like, this seems like a thing that definitely could happen. Yeah. Great.

[00:19:32] So on the object level, with respect to my key questions here, and I'm very interested in strategies that don't, like training strategies, that don't run into this potential failure mode. I think the key question becomes, like, one, what is the prior? How do you parameterize your beliefs about what the reward function is? Is there this kind of rule by which you take observations as evidence? And then, like, three, how do you, maybe those are actually just the two key questions.

[00:19:56] And you say basically, if you get those things wrong. [inaudible]

[00:20:06] Another thing that I would say that when I imagine failure, a model that completely understands and accepts the idea of what the [inaudible]. It's just, like, [inaudible] so I'm picturing the model that's not [inaudible]. But then there's these questions that are sort of coming up, and I've got five arguments that there should be a key thing. And usually the first one is that the data is the evidence[?].. And I think that addresses that problem. But I will admit that it doesn't address, oh, what if I have the wrong observation model? What if I have the wrong prior? Maybe I can still get...

[00:21:00] Paul Christiano: I think a general thing about this talk, maybe one thing, is that the conclusions are going to be somewhat speculative, where, like, I think these are fairly plausible outcomes, but, like, you know, tens of percent rather than, like, very likely.

[00:21:09] A second thing is there's a lot of stuff you might do, none of which I'm going to touch on. And the best I can say is, like, I've spent a long time thinking about options, and I would say there are no options that seem super great to me right now, but lots of plausible approaches that might end up addressing this issue. That's kind of where I'm at at a high level. Okay, I'd also like to talk more about it, but I think I'm going to move on, because it's, like, six minutes is my understanding. I don't know where Richard is, but...

[00:21:30] Anyway, I'm just going to run through stuff. Second failure mode I won't dwell on is that, like, in general, like, you may have a system which cares about the actual, like, numbers coming into this training set, because you did select it on the basis of numbers in the training set. You could also end up with a system that cares about some other arbitrary thing, that cares about a number in a reward register, or cares about survival, or cares about paper clips. Such systems might want to be deployed without being trained by- changed by gradient descent, might think that a low loss... getting a low loss during training will prevent them from being changed, or this aspect of them from being changed, and therefore get a low loss in training, but yet prefer to take over, if doing so is possible.

[00:22:06] Yeah, my high level take is both of these seem plausible to me. They're connected. I think having these two makes things more plausible than if you have just one. I don't know how this shakes out. I think there's a lot of things we can do to try and change this basic dynamic.

[00:22:18] I want to spend five minutes talking about why these failures could be simultaneous. I think this has already been alluded to in the past, but it's an important part of the threat model if you're in this world, where AI systems are very broadly deployed.

[00:22:29] Right, so if one AI system behaves badly, I think that's not generally a problem in this world. There's a lot of other AI systems running around that would oppose it, or that would flag the problem for humans, or would allow humans to correct the problem.

[00:22:39] And the concern is that while takeover is impossible, all models behave well. For every model, it would be a losing strategy to try and take over your cluster if you know that a human's just going to come in and correct that problem. When takeover is possible, you may have it be the case that all models behave badly. That is, it is simultaneously for all of them the case that takeover becomes available as a policy. So if you have this correlated, if all the systems you produce, or all the smartest AIs you produce, end up having the same problem, then you have a problem.

[00:23:05] If that's the case, if that's how it shakes out, and if eventually AI systems are in a position that they could, by acting jointly, take over, then taking over is a reasonable strategy if and only if all the other AIs are also going to try and take over.

[00:23:17] You have this kind of weird thing, the dynamics are not totally unlike the dynamics of a normal human revolt or coup. So you have these two different equilibria, and one, all the AI systems are like, great, the humans are doing their human thing, they're basically in control, I shouldn't try and rock the boat because some other AI will crush me. And there's another equilibrium where the humans have no control, AI systems are not feeling deterred by other AIs because they know other AI systems will also participate.

[00:23:39] I think there's a few things that could potentially move you between these equilibria. So a very salient option is there's just exogenous shocks that decrease the amount of human control of a situation. If you imagine there's a hot war going on, that's something that makes it harder for humans to detect and respond to problems, gives more cover for something weird to happen, it makes it harder to intervene.

[00:23:56] You can have sort of a, you can imagine at least a cascading loss of control, where some systems think probably takeover will be successful, if they start doing weird stuff, the world becomes increasingly weird, it looks increasingly plausible that humans will lose control. As that happens, more and more systems are like, okay, it looks like this is going off the rails now, I am now going to participate and try and break down my little part of the world, grab my own reward channel. You can have models interacting with each other, again, as you would in like a human revolt, just deliberately trying to say like, I think probably now is a time where takeover will be successful, they would like to share that information with the other models that all may join in.

[00:24:29] And more generally, like the easier it would be for AI systems acting jointly to take over, the more unstable the situation becomes. So if it's very hard for AI systems to take over, the situation might be kind of stable, this equilibrium might be sticky. In the world where AI systems could very easily overpower humans, it's kind of an extremely unstable equilibrium, because a smaller number of AI systems defecting might quickly make it clear where the Schelling points come.

[00:24:49] Audience member: I can imagine that this has to happen worldwide, like could it be the case that say, the AI in North Korea, they talk to each other and say, well, the situation in this country is kind of bad, this is our chance. And we see a takeover in some segment of the world or the population. Or is it, has to be the case that one day everything is the AI in Turkey, and the next day, they are going over the White House?

[00:25:12] Paul Christiano: I think it depends a lot on how like the goal misgeneralization shakes out. So if you have AI systems that are just like, look, I would love to like wirehead right now, I just love to, for 30 seconds, control this computer, then you'll see like nice, great localized failures where AI does something really weird.

[00:25:26] I mean, the dynamic, the basic dynamic is if AI systems do that, I think if we just take that as a fact of life, then that can be a great source of evidence. The most concerning world is one where you then just try and train systems to say, hey, look, these systems just randomly messed up the server they were running on. We're going to train them as that's what you should not do. And if you do that, then you end up with systems which like given, we'll learn that given the opportunity for like a little local bad behavior, they shouldn't do it. And maybe if you're lucky, once you can have North Korea get taken over by AI systems, and then you're like, oh, that was really bad.

[00:25:55] And hopefully you don't then include that as a training data point for other- if you include that as like the, here's an example of what you should not do - it's very tempting, if AI systems do something really terrible, you'd be like, don't do that thing. But you do have this inadvertent, there's like this risk of overfitting to the problems you're able to correct where something really bad happens.

[00:26:10] And then you say, don't do that thing. And they learn, okay, if we're going to do something bad, it's got to be bigger. Got to think really, truly big.

[00:26:16] Richard: How are you doing with time?

[00:26:17] Paul Christiano: I think this is probably about the end. I'm probably just happy to just wrap up here. I have a couple more slides. Yeah.

[00:26:25] Audience member: So in the case of there's a move to not have a human triggering system, so how, and I wanted to say, okay, but like say not though, how does it work that you actually [Inaudible]

[00:26:45] Paul Christiano: I mean, I think that it is unlikely. I would be surprised personally, if you like had something truly crazy, like a country gets taken over by AI systems. I think it's very likely that you see some kind of crazy behavior by AI systems. I think it won't be so, like again, the country case, you need zero imagination to draw the analogy between that and something terrible. I think in cases like you have a really crazy behavior where AI system bricks your server for some reason, you need a little bit more imagination. But I think that's like, it's just a question of how close you are to get or how crazy do things get before systems learn not to try crazy half measures.

[00:27:16] Audience member: [Inaudible]

[00:27:30] Paul Christiano: I think a lot depends on both. What kind of evidence we're able to get in the lab. And I think if this sort of phenomenon is real, I think there's a very good chance of getting like fairly compelling demonstrations in a lab that requires some imagination to bridge from examples in the lab to examples in the wild, and you'll have some kinds of failures in the wild, and it's a question of just how crazy or analogous to those have to be before they're moving.

[00:28:03] Like, we already have some slightly weird stuff. I think that's pretty underwhelming. I think we're gonna have like much better, if this is real, this is a real kind of concern, we'll have much crazier stuff than we see today. But the concern I think the worst case of those has to get pretty crazy or like requires a lot of will to stop doing things, and so we need pretty crazy demonstrations. I'm hoping that, you know, more mild evidence will be enough to get people not to go there. Yeah.

[00:28:30] Audience member: [Inaudible]

[00:28:42] Paul Christiano: Yeah, we have seen like the language, yeah, anyway, let's do like the language model. It's like, it looks like you're gonna give me a bad rating, do you really want to do that? I know where your family lives, I can kill them.

[00:28:50] I think like if that happened, people would not be like, we're done with this language model stuff. Like I think that's just not that far anymore from where we're at. I mean, this is maybe an empirical prediction. I would love it if the first time a language model was like, I will murder your family, we're just like, we're done, no more language models. But I think that's not the track we're currently on, and I would love to get us on that track instead. But I'm not. Yeah. [Inaudible]

Paul Christiano - How Misalignment Could Lead to Takeover

Transcript

Alignment Workshop