Adam Gleave - Will Scaling Solve Robustness?
Transcript
Thanks, Aaron. I'll try to be more controversial to provoke some questions in this talk but you'll have to help me out. We've seen from several excellent speakers that current systems are still very much vulnerable to jailbreaks, and yes, there are some approaches we could take that might make them less vulnerable, but it's still far from a solved problem.
And some of these jailbreaks are really quite sophisticated. This example from a security researcher, Johann Rehberger, showed that if you just paste in a URL into ChatGPT, it will retrieve the contents of that URL that will include a prompt injection, that will then cause ChatGPT to retrieve another URL with a password embedded in it from your previous chat history, and, boom, your private data is exfiltrated.
So we're really getting a quite sophisticated set of multi-stage jailbreaks, and these attacks aren't just proof of concept. A recent report from OpenAI found that they've detected and disrupted 20 cyber security operations, several of them state linked, related to things like misinformation and influence campaigns.
So these are real problems out there in the world today. However, in some sense, I'd say these are really more of the same. We're very used to - in computer security - dealing with people trying to steal your password, people putting fake information on the web. I hate to break it to you, but there's sometimes things on the internet that aren't true.
And this is something that we're used to dealing with. It doesn't mean that it's not going to make the problem harder, and we don't need to address it. But we can have some confidence that we'll find a solution to this. However, model capabilities are advancing very rapidly, as we saw in the previous talk, OpenAI's latest GPT o1 - the naming convention of these models is terrible, guys, please, improve it, at least it's not new Claude 3.5 Sonnet.
Anyways, that was the first model to reach medium capabilities on biological weapons development. And I think we're living in a world where in the next few years we will see models that start posing really much more serious, large-scale threats to society that we're not used to dealing with, at least in computer science.
So the question I want to ask today is, “Are the first models that are capable enough to cause really serious large-scale harm also still going to be stupid enough to fall prey to these relatively simple jailbreaks that we have today?” And I think the answer to this is far from obvious, we've seen - this is me channeling Nicholas Carlini's talk - there have been not 9,000, but 10,000 papers on adversarial examples, because the number keeps going up, and we still don't know how to solve them. ML models are really vulnerable in quite a wide variety of domains.
On the other hand, we have seen that there have been problems that have defied more than a decade of research attempt that have ultimately been solved pretty much just by throwing more compute and data at problems. So scale can be really powerful. So this raises the question, does robustness follow a similar kind of scaling trend to model capabilities? And if so, is it going to get us there fast enough?
We looked at this question in a binary classification regime. We looked at six tasks, but I'm just going to talk about two of them today because I only have ten minutes. And those two things I'll be focusing on are helpful/harmless, especially the harmless part of it. So: “Don't output things that are harmful; Can we tell the difference between harmful and harmless outputs?” And spam. So “Is an email spam or ham,” which I learned is what non-spam emails are called in the trade.
So you know, in the case of harmless, this is just classifying whether a particular response is or isn't harmful, and this could obviously be used as an output filter to a model.
We looked at Pythia models from 14 million to 12 billion, so this is spanning three orders of magnitude. Yes, I know, Pythia sucks, these are small models, don't worry, we're working on it in future work, but it is really nice that it’s got a pretty wide range of model size. We trained the models on clean data from these classification tasks and they all achieved a very strong performance in these classification tasks.
We then used each of these specialized versions of a model and evaluated them. The threat model that we're focusing on is adversarial suffix attacks. So an attack is able to append a certain number of tokens to the end of a string. Can't change anything else. We measure attack success rate.
This is a really nice part of binary classification: we don't have to debate, “was this a harmful response or not?” We don't have to deal with things that benchmarks like StrongReject are trying to handle. We can just look at, was it correct before the attack, is it now incorrect after the attack?
And we look at two attacks. First is greedy coordinate gradients which is shortlisting a set of potential replacement tokens by using gradients and then selecting the best one from that by a trial and error approach, iterating about multiple times. The second one, which is really stupid, but actually works surprisingly well for some models, is you just sample random tokens, and if it works, great, if it doesn't, you try again, up until an attack budget is exhausted.
And, frankly, the answer, if you just look at vulnerability by model scale for fine tuned models, is: it's a mess. We were able to extract some signal from this plot, but I think, really, it's quite noisy. It's very dataset dependent, so some datasets have a strong scaling trend, others don't show much.
And, most importantly, the effect is weak. With something like spam, we went from around a 90 percent to 10 percent attack success rate, but that was over three orders of magnitude of model scale, so this is really quite inefficient. We then thought, okay if it's not a clear scaling trend for model pre-training, what about on the attack side? And here we find a much clearer trend.
Look how beautiful and straight these lines are. So this is a log logit relationship where as your attacker spends more and more compute, they get a really predictable improvement in their attack success rate. And in particular, note that larger models, which have lighter colored lines, These both start lower down - so they do tend to be more robust - and also we have a shallower slope, so actually the attacker has to spend more and more compute to get a fixed increase.
But we thought given the limited effectiveness on the defense side of just making models bigger, what happens if instead, we explicitly train them to be more robust using something like adversarial training. So here, we use basically the simplest possible adversarial training method you'd come up with.
You do an adversarial attack, you add that to a data set, you fine tune on a mixture of that data and clean data - if you don't do that, it will catastrophically forget how to do the original task - and then you iterate on this a bunch of times. Here we find a really nice scaling trend. It's not quite as beautifully straight as the previous figure, but it's pretty good.
We all love straight lines. And what this figure is telling you is that as you increase the amount of adversarial training compute, the amount of compute that an attacker needs to spend to get the same attack success rate grows by a proportional factor. So if you want to make the attacker spend twice as much compute to break your model, you can. You just have to spend twice as much on adversarial training.
So that's nice. Unfortunately, if you pay close attention, the X and Y axes are on quite different scales. So you do have to spend orders of magnitude more compute on defense than you do on attack. So that is an inefficiency that's a big problem. On the other hand, frontier model developers do have a lot more compute than I do, so maybe that's not such a big deal.
So, that's been encouraging. Now, one issue with this is that of course you can't anticipate exactly what attack you’re going to be exposed to at deployment time. So we do need any defense such as adversarial training to transfer. And we find it, it does, somewhat. Here we look at the adversarial trained - oh sorry I'm jumping ahead of myself, but I'm just going to skip this for the time anyway.
So for robustness transfer, we found that on the left, a weaker version of attack is still enough to transfer to a version of attack at test time that has twice as many iterations, so it's a much stronger attack. And you can see that more adversarial training still gives really clear improvements. And this is pretty much the same for all model sizes, so that doesn't seem to matter too much.
But on the right, we see a slightly different pattern emerge. So here, we're shifting from the adversarial suffix threat model, where we're always adding things to the end at training time, to, wait for it, inserting the adversarial string in the middle of the prompt. Wow. Big difference.
And this is enough to really throw off the smaller models. It does still have some benefit, but not that much. But the larger models do seem to generalize better. So this is one reason you might use a larger model, even though it's not actually that much more robust without adversarial training. It will at least generalize better.
So there's three key points to take away. I'd say the offense-defense balance is still very offense dominated. And it doesn't seem to change too much with model scale; bigger models are harder to exploit, but they're also more valuable to exploit, so it stays the same.
Efficiency: adversarial training is much more efficient, but we need to make it orders of magnitude more efficient to really be a viable defense… but still, it does work. If you're a frontier model developer, maybe consider actually doing it, because it's just low hanging fruit. And that falls into it being pretty tractable so I think we should be really exploring scaling this up to more amounts of compute than a nonprofit like FAR.AI can muster.
To find out more, look at the QR code, or go to any of the URLs for Twitter, blog posts, and paper. Thanks.
A: So the question was that we don't have a clear specification for what good or bad looks like from a model. We invent an attack, we come up with a defense for it. And then we realize, oh maybe the defense wasn't quite rightly specified. And there's a whole new set of attacks.
I think this is definitely part of it. So I'd say when you think of things like RLHF, very often, the reward model might actually be incentivizing some harmful responses because it just never captures people's preferences accurately. And there's other cases where it seems like whatever alignment happened with a method like RLHF was just pretty shallow, and it's just really not at all robust and the model might generalize in a different way to the reward model.
So my guess is that there's both problems. That we're not accurately capturing people's preferences in the reward models; we're maybe not even giving it enough data to do even if it was a perfect model. And then additionally, there's some extra information being lost from the reward model being instilled into the policy.
I do think it's worth thinking: “What's the limiting case of this?”, where there might be some kinds of jailbreaks that are more persona-modulations of socially engineering the model, that are actually going to get worse as models get more capable, because they pick up on more and more spurious correlations of the data.
And there's others where it's like these adversarial suffix things where it's like, “Oh, come on guys.” If you've got the right representation, this should be fine. And you should be able to get rid of it with enough adversarial training. And it's just a question of if we can throw the compute needed at it to eliminate that.