Joel Leibo - AGI-Complete Evaluation

Transcript

So everyone has some classification of different kinds of risks. With some there's misuse, and with some there's misalignment, and with some there's “everything else”. I'm really talking about the “everything else” because that's what I think is important when we're talking about what we could do in terms of evaluation if we had AGIs, or something which could scale toward AGI. That's what I mean to talk about. 


But anyway, this is about systemic risks, or “destabilization risks”, as people were calling them yesterday. So the canonical example is social media. We say, “After social media became widespread, it had some kind of delayed effect on the world because it was mediated through people changing their behavior.” It's not just like the technology itself had this effect.


It's like the technology went out, and it was used in some way, and then people changed their behavior and that changed the world in a way that was maybe undesirable. And so obviously we are running another experiment now, where we're introducing new powerful technologies. And this is also going to catalyze new social organizations.


People are going to change their behavior in response to this technology. This has already started. Even if the technology doesn't improve anymore, which obviously won't happen, it's already the case that there are new affordances that people have and they're going to have some effect on the world.


We're going to change the equilibrium that we're in. And so more changes in society are coming regardless of whether the technology continues to evolve, to improve. These kind of changes I'm thinking about are not slow exactly. They're driven by positive feedback dynamics, so they can be very fast in the sense of ‘human behavior’-fast.


This isn't like a FOOM-fast, of the AI building itself and then overnight being in a completely different world, but it's still fast in the sense of humans changing their behavior on the time scale at which humans change their behavior. We're talking maybe months or a small number of years, you can end up in a very different world. The reason it's fast in this way is because it's positive feedback changes. The more people change their behavior, the more incentive there is for others to change their behavior. Examples on the slide. Other things are like this too, where the stability of the system can be affected by technology.


One example I like to talk about is fishery maintenance, where you could have a fishery that is perfectly sustainable when everyone is fishing with nets, but if you then introduce some other technology like dynamite then it becomes an unsustainable practice.


So this is actually very common. There's a lot of people who study the stability of sociological systems and how they're affected by technology or choices in use of technology. And here's another example that's a bit more AI-like and less catastrophic. If you know some area where it's hard to get a reservation for dinner, I always tell this story about Granary Square in London. If people have been there you might be motivated to have your AI assistant go make a whole bunch of reservations for you and then you say “Oh, I don't know which ones are gonna work, but I'll make all the reservations, and then I'll pick out of the ones that actually worked.”


And then maybe everyone adopts this strategy. And the problem with that is, of course, that then the restaurants will be wise to that, and they'll be like, “Okay we're not gonna allow you to keep canceling reservations if everyone does that.” Maybe it's okay if one person did it but as soon as everyone does that, it has to change the equilibrium, right?


They have to stop taking reservations over the phone, or they have to make it so there's a cancellation fee, or whatever. There's just a new equilibrium where it's no longer possible to make reservations in the same way. So, here's an example where an AI assistant changes the way a system works.


I'm calling these, the kind of risk that we've identified here, an “equilibrium selection risk” where there's some convention, or norm, or institution, that's regulating our lives and it's dependent – in ways that we might not immediately know in advance – on things like the costs of taking different actions.


If something becomes much easier than it was, like the example with restaurant booking, it's just much easier to make lots and lots of calls and make lots of reservations and then cancel them all. In principle, a person could have done that before, but it would have been costly and no one would have done it.


And the only thing that was supporting the equilibrium was the fact that it was a costly thing. But it's not now. More consequential examples are: now, any one of us can produce endless misinformation to whatever agenda we like. So the cost of taking this action is just much lower now.


So the question is: “Is something about the equilibrium that we're all in dependent on the cost of that action being where it was… and now going to be destabilized by the change?” Of course, I'm using the language of game theory, you can understand this in a game theoretic way.


I only have a minute, so I won't really say that, but, in a game like that, if you start out in the blue equilibrium there, the (2, 2) one, you could say, there's both an equilibrium risk of moving to the (1, 1) equilibrium, and an equilibrium opportunity of moving to the (3, 3) equilibrium which would be better. And we should think about both of those possibilities. 


I talked about AGI-complete evaluation: what could we do in terms of an evaluation framework that would get better and better as we get better and better AI, because I think that's what we need here. I will say that I think agent-based modeling is the thing to do. We should do agent-based modeling because the agents themselves will get better, and so we can simulate more about the world and we can understand more.


It's not just about simulating the whole world and hoping things work out; there are principles. And everyone should read Eleanor Ostrom, and I'll stop.