Mantas Mazeika - Tamper-Resistant Safeguards for Open-Weight LLMs
Transcript
Thank you. Hi everyone, my name is Mantas, I work at the Center for AI Safety, and this is a presentation on a recent paper we did called Tamper Resistant Safeguards for Openweight LLMs. It's a big collaboration. I have amazing collaborators and I'm excited to talk about the work. In this work, we study the problem of weight tampering attacks, which includes all attacks that on a large language model safeguards that just perturb or edit the weights in some way, including fine tuning attacks, which have been talked about a lot lately.
An example of this is that – while the Llama-2 safeguards are quite robust at input-space attacks – it turns out that they can be easily stripped away with a couple of steps of fine tuning. And in fact, I think this was done within a week of Llama-2's release. So, this is a problem because on the one hand, open-weight models are really great. They reduce concentration of power, democratize the benefits of AI, and accelerate academic research including AI safety research. I think in large part, a lot of this work on Large Language Model robustness to input-space attacks would not have been possible without these Llama models because they were the first example we had of robustness to these attacks.
But on the other hand, advanced open weight models without safeguards could pose serious risks because if you release a model on the Internet, you are giving it to everyone, including terrorist groups and death cults and you name it. So there's this rock in a hard place problem and people have thought about this problem and tried to figure out ways that you can make these safeguards tamper resistant.
There's been lots of prior work on this topic, but before our work there were still no methods that protected large language models from a wide range of tampering attacks. So that's what we set out to study. We came up with this method, which is basically adversarial training, except the adversary can optimize the parameters instead of the inputs.
And it's inspired by this earlier work by Peter Henderson et al. which was done on BERT style models. And so just briefly, this algorithm we have this training set of optimization attacks. We focus on fine tuning attacks, to train against. And we have a retain loss and this safety loss and data set, we call this the tamper-resistance loss.
And then our method itself has two stages. The first applies some initial just normal safeguard like RLHF or a standard unlearning method. And then we have a second stage of tamper-resistant training. This is the adversarial training loop, where there's an outer loop and inner attack, which doesn't have to be an inner optimization process but in our case it is because we're using fine tuning attacks. And yeah, we found some interesting things while designing this method and trying a bunch of stuff out. Maybe one of the more interesting things is that the method actually works, but the choice of tampering-resistance loss matters a lot to make the method work better.
So on the left is what happens if you are maximizing the cross entropy loss of the adversary as your tamper-resistance loss. And on the right is what happens when you maximize the entropy, just make it uncertain after the adversary's inner loop. And you can see the adversary's inner loop in these small boxes, and if you're using this better entropy loss, what happens throughout the course of training is: the outer loop is more stable, but also the inner loop curves up and the adversary becomes less able to optimize their loss.
This method actually works, which was crazy and surprising to see for large language models. These are all on Llama 3, 8B scale models. And our results are on two domains. The first is what we call Weaponization Knowledge Restriction. It's basically unlearning on this WMDP dataset. The second is Harmful Request Refusal, and so on the left here we have our on learning results across biosecurity, chemical security, and cybersecurity data.
And we do see a degradation in benign performance on MMLU but crucially we're able to improve the tamper-resistance loss, much more highly than it was for the baselines. We see a similar story with for refusal experiments, our experiments there are a little smaller scale, but we were able to improve performance, we have extensive red-teaming results. And that's basically it.