David Bau – Resilience and Interpretability

Transcript

 Let me start by asking this question: who here was affected by the CrowdStrike business on their way here?  Oh, a few, maybe 25% of the people. So I had a terrible experience: I was stuck in Zurich. When I went to sleep on the red eye to Zurich, the world was fine. When we woke up, it was like 9/11 or something. The pilot says, “while you guys have been asleep, there's been a global cyber security incident. The Zurich airport is closed. They've kindly allowed us to land. No flights are departing from the airport. Your connections, if you have them, are no doubt canceled. Good luck finding a hotel tonight.”


So I stayed in the Hyatt Hotel, and I had this terrible experience there. It turns out that this Friday patch that killed the airports also killed this system called the OPERA Hotel Property Management System. And so this is what the experience at the hotel was like: they couldn't look up reservations, they couldn't track charges, they couldn't open up any of the rooms because none of the keys would work, they couldn't control the lights, they couldn't control the heating, they couldn't locate the staff, they couldn't figure out how to pay the staff. Everything was broken in the hotel. So the entire lobby was just full of guests who should have been in their rooms just hanging out on the sofas. And so I just went out and explored Zurich for the day. And I came back in the afternoon and I got a little topic that I want to talk about here, which is inspired by this experience.


My original talk was going to be, “What is this interpretability business for?” What are we studying this for? I think that there was a comment earlier, I forgot which speaker said it. They said, maybe we're doing too much interpretability work, not sure what it's for. So I thought that would be a good topic to talk about. Neel touched on a little bit about this, but I switched the topic after my Zurich experience. How about this: what's needed for resilience in AI? I think what we really need is to think about what's needed for resilience. Turns out these two topics are related. So let's think about resilience.


What is resilience? Resilience is really about ecosystems. It's really asking the question, “How can we make an ecosystem that can sustainably adapt to unexpected challenges?”


So resilience is really different from other things that we typically measure when we do AI. So there's these other R's that we talk a lot about. We talk about Reliability, we want reliable systems. We want Robust systems. Nick Carlini did a really great talk about robustness, but we also want Resilience. And these things are different. I'm going to give you a one line explanation of each of these.


So *reliability you can think of as performance in expectation, what all of us machine learning people have been trained to do. Get good performance in expectation. On OPERA, when Hyatt says they have 20 years experience running OPERA, they mean they've gotten good reliability in expectation. They've tested this thing. In AI, we invest in evaluation and benchmarking, that's how we get good at reliability.


Robustness is really more about red teaming. It's about thinking about the adversary. The rare situation that you can figure out that might happen. This wonderful CVE system that we have. So for example when Hyatt tells you, we patched CVE 2023-xxxxx, they're doing red teaming. They found a bug, they patched it before somebody could exploit it. In AI, this is the same kind of state of the art. We're proposing, let's red team these systems, let's figure out how to adversarially attack them and see what we can do to defend these systems in advance.


But resilience is something different. It's about how the ecosystem responds when something completely unexpected that's outside what you've been able to anticipate happens. Can you have the system behave well anyway? Like with OPERA down, can our hotels keep on running? So what do you invest in AI to make sure that the world behaves well, that the world is resilient when things go wrong?


And so I've listed three things here, which I think is what we need to invest in. And they're all related to interpretability, so we'll get into it, but let's use the hotel example to put this into perspective.


At the hotel, when I came back from my little tour of Zurich, it was running again. They were starting to empty out the lobby, and they were showing that they were resilient in some ways. For example, they were able to accept payments. They were able to get people into rooms. They figured out that to get you into a room, they could use their master key that they had because they could physically open up rooms, even though the electronic key system wasn't working. And so they recruited one of the staff members to bring everybody up and down the elevators and use the master keys to get people into rooms. They knew they could allocate rooms on a paper spreadsheet. They had all the front desk people coordinating with their paper spreadsheets. You allocate this floor, I'll allocate this floor. They invented their own distributed computing system and they got everybody into the rooms.


They could even take payments. They had a separate credit card control system. And the reason they were able to do this is because they had three things in all the things that they were able to be resilient on:

And so in these situations, we have all three of these ingredients. They were able to be resilient.


But let me tell you, staying in the hotel last night was terrible. Because every hour the lights would turn on again, the heat would turn off, and you just could not sleep. And after this happened a few times, I called up the front desk and they explained, “we can't fix this. We do not understand what the system is doing here. And we have no ability to control the electricity in the hotel. We can't turn off all the electricity. We're not allowed to do that. That would be a safety issue. There's nothing that we can do.”  And so the system, in many respects, with respect to the lighting, with respect to the heating, with other functions to the hotel, they were not resilient, because they were missing these three things. Does that make sense? Do you see the connection to interpretability here?


I'm going to run through a couple of technical things, but there are really three ingredients here for resilience in AI. And the first is people need to understand what it is that the AI systems are doing. Neel had a bunch of examples showing that this is possible, and I'm going to show you a couple more, but I’ll just give you a little flavor. It's not that important what the technical details are. There's a lot of papers being written about different aspects of AI that you can understand. 


So who knows what instrument Miles Davis plays? Miles Davis plays the trumpet. So one of the questions that you can ask is, when you have an AI that can solve a task, you can ask, what is it actually doing when it solves the task? Is it just magically answering this trivia about the world?


And what we do in a lot of interpretability work is we set up experiments moving hidden states around between different runs of the model to try to isolate where these different computations happen, and the summary for factual recall is that we can actually localize that knowledge, like knowing the fact that Miles Davis plays a trumpet, is really a vector mapping. So when Neel says, “hey, we're trying to understand these vectors and what information they encode throughout the model”, well, a lot of associational knowledge, is really encoded as a mapping in the weights of the model from one vector, which encodes like the utterance, “Miles Davis”, as a string or as a sequence of words that might not mean anything, to the meaning of who Miles Davis was. He was a musician that played certain instruments and lived in a certain place and that type of thing. And so this mapping from one type of vector to the other type of vector is encoded in a bunch of the layers of the network. And we can see that in certain experiments, and then we can verify that in different ways.


This ROME view of understanding knowledge in AI really leads to this three part research program, which I think we're all embarking on right now in the interpretability world. We've been embarking on this for a little while, but it's becoming a little bit clearer that there's three big interpretability problems that are being solved inside these large networks.


One is, what information is encoded in a vector? The SAEs are one attack on the problem, there are other attacks on this type of problem. But what properties are held in the vector space? And then the second question is, how are maps between vector spaces managed? How does knowledge get represented in a network? What does a network know about how to map from one vector to another? And then the last thing is how does reasoning work, because vectors are transmitted on pathways in complex ways through the network, and these pathways really define algorithms, and understanding these algorithms is really the third puzzle.


And so this is a view of understanding AI in terms of vector mappings. Now, at some point we're going to get much better at this, and we'll probably find other types of abstractions, which are the things to look at, but I'm just giving a little flavor of what understanding AI looks like.


The second ingredient of resilience is really control. So it's not enough just to understand a system in the abstract, it's not enough to understand that the system controls the heat. Can you control the heat? Can you get the hotel room open? 


And so we do this kind of work in interpretability research as well. So if we know that knowledge entails a mapping from one kind of vector to another, it's reasonable to ask the question, can you control that mapping? Can you change it? So that's one of the things that we do in our lab, we go and we change these mappings to see if we can change a model's opinion of what instrument Miles Davis plays. And it turns out that you can do this. If you know what the vectors are that you're trying to map from and map to, you can actually do these very direct changes on the weights of the model to cause the mapping to change. And that's a lot faster than traditional fine tuning. We call it direct model editing. And it turns out that it works very well. It turns out that if you do the direct model editing correctly, you can get better specificity and generalization than traditional fine tuning methods, which makes us think we're really getting at the natural organization of knowledge in the network. 


And then, you can take this kind of view, and you can actually test it by applying it to immediate problems. One problem is the diffusion problem. The way a diffusion model works is, you go to DALL·E, or you go to one of the diffusion models, and you say, “give me a picture of some doctors.” And that gets encoded as a vector, your text. So I've labeled that as k here, “text that means doctors”. And then it goes to a cross-attention layer in the diffusion model that transforms that to other vectors, v, which sort of encodes, “what do doctors look like”? And then that goes through a bunch of other layers and it ends up with a picture like this. Everybody, what do you think of this picture of what doctors look like? Anybody see anything wrong with this picture? Do they look like your doctor?  Who has a doctor who doesn't look like this?  


They're all men. They look like they could be cousins.  And maybe, I don't know, maybe in Zurich. But maybe not. I don't know, I just came from Zurich. It's a pretty diverse place. So this is a well known situation we call bias amplification where these diffusion models tend to not just reflect the biases in the training set, but they actually amplify them for various reasons that people have been investigating. And so we don't actually want it to do this. So what we did in this paper was we said, “Hey, we can rewrite the mappings. We know how to edit mappings. Get me a vector for what male doctors look like. Get me a vector for what female doctors look like.” Now we don't know what the vector is for an equal mix, but we hypothesize that it's on this line between these two vectors, so we just search the line, we find this nice equal mix vector, and we say, now once we found it, we do a ROME style edit, and we edit the model, and we say, “Now give me doctors”, and we get this kind of thing. We actually interpolate over a larger space, so we get racial diversity as well.


And we can do this in just a few minutes, by just searching over a small vector space, it just is an instantaneous change in the mapping to do this. And it's actually quite practical. We can actually debug many problems in a diffusion model simultaneously. Debiasing dozens of professions. Artists are worried about copyright infringement; we can remove hundreds of artists from these models. There's some issues with the model's tendency to create nude images at inappropriate moments; we can remove that tendency at the same time. We can use model editing and do this kind of thing. And the nice thing is that it breaks away from a little bit of what we typically do in machine learning, which is matching the training distribution. Sometimes you don't want your model to exactly mimic the training distribution, sometimes what you want to do is something a little different. And so model editing is a way of giving people control. So that's the second thing, is control. Resilience really involves understanding and having the ability to control.


And then the third part of resilience is this other interesting thing, which is power. It doesn't matter if I know how to use the other credit card machine, and I know that we have to take payments, somebody has to trust me to do it. What the heck does that have to do with interpretability? We've been thinking about this a little bit. The thing that is really in the way of people developing a culture of feeling like they're empowered to understand and take control over the AI that they use is really the closed tech structure of the modern AI industry. 


Large scale AI models are made out of two parts: there's foundation models, then there's the fine tuning part, the interpretation and control part, like the part that we change in order to make the doctors diverse, or not, if we want.  And one of these things is more expensive than the other. One of these things cost about a hundred million dollars of electricity to do, and the other cost just a thousand dollars of electricity to do. And right now they're bundled together in the same companies. One of these is being protected by the other one. There's this big moat. And so we feel like there's a problem here. But even open models have this problem, because an open model, if you use Llama 3 405B, who's going to download Llama 3 405B when it's released in a few days? Anybody going to download it? Okay, maybe not. Why not? I'll tell you one reason why not: It's because it's 1.6 terabytes of parameters. It might take a little while to download onto your laptop. That takes about $2 million of hardware to run. Even if you want to just do interpretability work on it, just like some little experiments, it still costs the same $2 million dollars to run that thing. So even open models are effectively closed because of the capital investment that it takes to get into them.


And one of the things that we're doing in interpretability work is thinking about what kind of societal-wide infrastructure you would need to address some of these closedness issues. So we have a new project that we're starting called the National Deep Inference Fabric. Actually, it's a pair of projects. One of them is called the NNsight Open Source API. The API is hosted by the National Deep Inference Fabric which actually has physical GPUs to share. And the basic idea is to bring down the cost and increase the openness of access to AI models. So it's basically an inference API for scientists. An inference API that actually lets you see all the activations, gradients, and make changes in the models that you're using without taking custody of the weights. So I don't have enough time here to show you what the API looks like. We're quite proud of it, it's an extension of PyTorch. But it's very flexible and also designed to be very efficient. But it enables you to work with massive models like 1.6 terabyte parameter models without actually having to take sole custody of the machines that it's running on. You can still multi-tenant share the machines with hundreds of other users just like the way you do when you use inference on a regular inference API. And so the goal is to enable open scientific experiments, and eventually lead to a resilient scientific community. 


This is actually something that we're creating now. One of the things I want everybody to be aware of this week at ICML is we're launching a pilot program. When Llamas 3 405B is released, we're looking for early users to use these free GPUs in exchange for helping us test the API. And if you have an experiment that would be just perfect for this, then we'd like you to sign up to be one of the early participants. So we have a form here that you can sign up. And we’ll try to keep it geographically diverse and so on and we'll pick a few of the early participants to get to use these NSF-supplied GPUs.


But the takeaway for the project is that, what interpretability really is about creating the ingredients for creating resilient AI communities. Communities that understand what it is that AIs are really doing. Communities of engineers, of developers, even maybe users, that feel like they have practical control over the details of what the AIs are doing, they don't have to just accept what the AIs are giving them. And they also feel like they have the power to muck with them, they’re just not stuck with the systems that our monopolies give them.  And I think that if we create all three of those things we will have achieved the goals we're trying to achieve in interpretability.