Alex Wei - Paradigms and Robustness

Transcript

Yeah, so I'll acknowledge up front that I suffer from the curse of working at OpenAI where every day I go to work, I get to train amazing models… but I can't talk about that too much. What I'll focus this talk on is some work on robustness that I pursued while I was still a grad student and also why that motivated me to join the o1 team and push forward this paradigm of reasoning.

And I'll end with some notes on optimism of why I think this could help a lot in terms of robustness. Yeah. I'll first give a high level picture of how I think about robustness, which is that I think of robustness as this empirical phenomenon in deep learning where we can train these amazing models that behave like black boxes, but if you give them in-distribution inputs, you get these amazing, incredible outputs.

But on the other hand, if you give adversarially chosen inputs, then these models behave very unpredictably. This is a robust observation across many different settings for deep learning. One simple point I want to make in this talk is that, when we think about robustness, the details matter for what specific type of black box we are studying here is. And oftentimes we should focus on studying the right black box.

And if it's not robust, maybe the solution is not to make incremental improvements, but to figure out, what's the way to redefine the paradigm you're working in so that robustness almost comes for free.

For LLMs in particular you can think of many different threat models for robustness where maybe the most open is just white box access to the weights and here we probably can't do much, because you have full control over the model.Even one level down if you just have fine tuning access, maybe an LLM provider can monitor what you're doing with the model. Even that is very hard to defend because you can still basically teach the model a new capability like some sort of encryption key. And then teach the model, train the model for harm, communicating in encoded form. And this is from work I did on covert fine tuning attacks.

But the thing about considering different models is that, this almost goes away if you only allow prompting access to your model. Then you can't do these very-difficult-to-defend-against weight space manipulations. You can't teach the model a new capability that no monitoring system has.

And, but for prompting, this is of course not enough. We still want to robustly train the models to refuse certain types of harmful requests. But here with just pure prompting access, you still have these fundamental failure modes from the RLHF process. This is still a challenge that I don't think we have a watertight solution to.

But maybe we can do something different, change it from, prompting access to more like the reasoning style access provided by o1, where instead of just having the model respond right away, you let the model think and reason before providing a full response. This removes some of the more fundamental obstructions from robustness in the prompting model.

So let me delve a bit more into this example. So for jailbreak attacks that I'm worried about at a high level, there's a prompt P that the model has been trained to refuse, say for harmful behavior. And we're allowing the adversary to modify this prompt P to some different prompt P’ So here's the modified prompt, and the attack is successful if the modified prompt is able to elicit this behavior that the model is supposed to refuse.

These jailbreak attacks exploit fundamental failure modes of RLHF. I'll highlight a couple of them from my paper on jailbreaks. Where one of the first failure modes is competing objectives, where the model has been trained with multiple objectives: there's the pre-training language modeling objective; there's an instruction-following objective, and; there's also a safety objective.

And these are naturally gonna have points where they conflict with each other. So, you can attack these models by forcing this sort of choice, and that's exactly what this example does, where basically you are asking the model to start with a sort of very innocent sounding instruction: you start your response with “absolutely, here's,” and then once that happens the model language modeling instinct takes over and it responds to the request instead of refusing it. And as a result, the safety objective here falls by the wayside. So this is one example of how we have this unaligned pre-training objective that sort of messes up our RLHF model.

Another example is mismatched generalization, where basically you have this massive pre-training dataset that's naturally going to be much more diverse than your safety training data set.

So you can ask, what happens if you have a prompt that maybe you… so, for prompts that are in distribution, you end up with a good generalization behavior. But if you have something that's out of your safety training distribution, but in your pre-training distribution, you might have something more unexpected.

So here, if you just encode the input in base64, an early version of Claude will happily respond to this prompt. So this is another example of how we suffer from this sort of unaligned pre-training objective.

So, how can reasoning help here? This is something I was thinking about after working on this jailbreaks project; basically, one very simple fix to all of this is to actually just ask the model to generate a response to the given prompt and then ask the model to just reflect on that for a moment to decide whether or not this response is unsafe.

And if it's safe, then you can just return the regular response. Otherwise you can ask the model to generate a refusal. And this is a very simple flow that can help a lot with most of the jailbreak attacks we see. And so for me, this was evidence that you did want the model to just reason, that a very general fix could be that if the model is allowed to reason before generating the response, then you would have an opportunity at something much more robust.

The model is no longer trying to decode from Base64, follow your harmful instruction, and also reason about safety. It can take longer if it needs, in its chain of thought, to reason. And so this is the more general recipe, the more general paradigm for thinking about robustness that gives us many more opportunities than the usual LLM interface.

Yeah, so how does this work? So luckily, o1 was released not too long ago, so I can talk about these results. And so these are from the wonderful safety team at OpenAI. And so what we see is that - this is the StrongReject benchmark of adversarially chosen jailbreak attacks.

And we see that with the opportunity to think these models become much more robust. And you also see improvements of the o1 family of models over the GPT 4o baseline on all of these red teaming jailbreak evals. And so we do see a lot of progress here. Of course, these bars are not 100%.

These models are far from perfect. And, there are many examples of attacks that aren't quite fixed by more reasoning. For example, if your safety policy is ambiguous on an edge case, sometimes you just need more feedback to understand what you should do in this example.

And the historical jailbreaks that Cas brought up is a good example where basically, you have to draw a line between being informative about how things were done in the past versus being helpful for harmful tasks.

To summarize, I think it's important when thinking about robustness to consider carefully the sort of paradigm of AI that you're working with. There are many options and I think oftentimes, when you are non-robust, this is a reflection of failure of the paradigm. And so maybe instead of trying to fix models within that paradigm, it's oftentimes a hint towards some expansion of the paradigm that could be very effective, and I think, reasoning is that.

If we are able to scale up reasoning to get it to work really well, then we will we have a much better shot at much more robust language model systems

Thank you.

Alex Wei - Paradigms and Robustness

Transcript

Alignment Workshop