Dawn Song - Challenges for AI safety in adversarial settings
Transcript
Thanks everyone. My name is Dawn Song.
I'm also from UC Berkeley. I'm a director at um Berkeley Center for
Responsible Decent - Decentralized Intelligence and also affiliated with the CHAI and
BAIR. So it's been, it's really exciting to
talk about machine learning all the great advancement and so on. But coming from security background,
one message I really want to get across to this audience is that as we talk about all
the great things machine learning can do, it's really really important to consider the development and deployment of
machine learning in the presence of attackers. So history,
especially in cybersecurity has really shown that the attackers always follow the footsteps
of new technology development or sometimes even lead it and also this time the
stake with the AI is even higher as AI can choose among more and more systems,
attacker will have a higher and higher incentives um to attack these systems.
And also as AI becomes more and more capable,
the consequence of misuse by attackers will also become more and more severe.
And hence, as we talk about AI safety, it's really important to consider AI safety in the adversarial setting.
So, adversarial examples has been shown to be prevalent in deep learning systems.
Um but my group has done a lot of work in this space and many uh uh and many other
researchers also in this audience have done a lot of work in this space.
And pretty much we have shown that adversarial examples are prevalent to essentially all
different tasks and model classes in deep learning. And also there are different types of threat models,
including black box and white box attacks that can be very effective and even attacks
can be effective in the physical real world as well.
And it's great to see that the number of papers essentially has increased
exponentially in the in the area of adversarial examples.
And and also this work has also helped raise a lot of the awareness in the public as well.
So some of our work, the artifact from our physical adversarial examples are now actually part of the
permanent collection at the Science Museum of London.
And also adversarial attacks can happen at different stages of machine learning pipelines
including at inference time as well as pre-training and fine-tuning stages.
So as we talk about AI safety and AI alignment,
it's important to look at what adversarial attacks can do in the setting,
in particular, on safety-aligned LLMs.
Unfortunately, recent work has shown that really the safety-aligned LLMs are pretty
brittle under adversarial attacks. This is our recent work DecodingTrust,
which is it's the first comprehensive trustworthiness evaluation framework for large language models.
And essentially we developed new evaluation datasets as well as new protocols for evaluating
trustworthiness across eight different perspectives for
trustworthiness for LLMs. Including toxicity, stereotypes, robustness, privacy,
and fairness and many others. And in the development of our evaluation framework,
we specifically set out to both include the
evaluation under benign environments as well as adversarial environments.
And our work showed that for these large language models - so for example
GPT-4 can perform better under benign environments, for example,
than GPT-3.5. But actually in adversarial environments,
GPT-4 could actually even be more vulnerable than GPT-3.5, potentially because
it's actually better at following instructions.
So it's actually in something easier to jailbreak. And given the interest
of time I won't go through these - these are examples for these different perspectives.
And also you can learn more information at decodingtrust.io and also come
to our presentation on Tuesday morning.
And also there are other researchers' work demonstrating other types of
adversarial attacks, so our work also shows that, for example,
the attacks on open source models can be transferred to closed source models.
And Zico is going to talk more about their attack on universal adversarial attacks on
tomorrow as well. And other works have shown that adversarial attacks multi-model
models and also adversarial fine tuning where you can actually use
a very few adversarially-designed training examples
through fine-tune to completely break safety-aligned LLMs.
The key message that I want to get across here is that in adversarial attacks,
the field has made tremendous, tremendous progress.
We have collected and come out with so many different types of attacks.
But however, for adversarial defenses, the progress has been extremely slow.
So literally, we really don't have effective general adversarial defenses today.
And hence, for AI safety, it's really important for the AI to have safety mechanisms to be designed to be resilient
against adversarial attacks. And this actually poses a significant
challenge for AI safety and I think my time is up so just one word to summarize...
...Partially to address this issue.
In some recent work, also with the collaborator here, and Dan Hendrycks led this work with
some other collaborators here,
we have a new work on representation engineering to essentially look at
the internal representations - activations - from the LLM and showing that we can identify certain
directions um that can help identify whether the LLM is more honest or is more truthful for
reading and using these internal directions, we can also control the model behavior.