Shayne Longpre - A Safe Harbor for AI Evaluation & Red Teaming
Transcript
Hello everyone. I'm Shane. I'm a PhD student at MIT, and I want to talk to you today about our work – A Safe Harbor for AI Evaluations and Red teaming – which was presented as an oral ICML. I don't need to convince anybody here, but we've seen in products and services that general purpose AI has been rolled out to over a billion users worldwide, on estimate.
And because these systems are general purpose, the uses that they have and the risks that they have are continually evolving and changing, and often unforeseen. And more so the models, once they're deployed – and this is the setting we're talking about; post deployment – they are changing frequently on a cadence. The safeguards are updated as there's additional continuous training, and other things are applying to make the model safer as they're out there in the wild.
This is a very subjective taxonomy, but it's important to show that, while many of you could add to this list that, in addition to software security vulnerabilities – which AI systems are; software – there are additional securities related to the algorithm itself, to the AI. There are safety concerns, trustworthiness concerns, and concerns of the socio-economic impacts as they interact with people, with children, with other software agentically.
But the important thing that I want to point out is that one set of these vulnerabilities or risks has fantastic reporting; infrastructure and disclosure systems and protocols and best practices. There are also enshrined legal protections and guidance from the Department of Justice on how to do this and how third party AI evaluation or security evaluation can happen. And there are bug bounties with financial rewards that enumerate upwards of 12 million dollars that Google gives out each year, just because of how valuable and important these types of disclosures are.
But compare this to AI. We don't have anything like this right now here for all of these enumerated risks. The reporting infrastructure often has race dynamics: academics want to post to Twitter before they’re scooped. I've heard this firsthand from people. And if you have a transferable attack, who do you report it to?
You have to find the emails of the individual people at each company, maybe you don't know who to talk to, maybe you don't know who at all is using the potential models that you found vulnerabilities in, and also there are very little to no legal protections, voluntary or otherwise, far less a bonus of financial rewards.
And so in speaking to many of the AI researchers – while many here are the vanguard of doing incredible safety research, and have connections to elite labs in our experience, many of the less elite folks are fearful of violating terms of service, losing access to their accounts – which has happened in many companies and cases. And also maybe chilling the effect of getting jobs at those companies if they do something untoward.
So there's chilling effects on research, disincentives to tackle certain problems, and an imbalance in representation, given that explicit permission to do this research is only given by the companies in selective research access programs. And so we're proposing a safe harbor for AI evaluation, which is a voluntary commitment to protect what is clearly defined as good-faith research, reducing fear of repercussions.
I won't get into this, but in the paper we show that we don't have anything like this right now. There's not good transparency, or justifications, or process if your account has been suspended for doing good faith research. And earlier this year, we released an open letter, which espoused some of these things, asked for more equitable access to do good faith AI research and basic protections.
It was really an easy lift in my mind, and I can explain to you why offline. And it was co-signed by about 350 researchers and received some press. I want to convince you that there's a small set of commitments that companies can make to enable research, both legally and technically. And these only apply to researchers that follow strict rules of engagement that are verifiable: they are only testing models that are in scope; they disclose the vulnerabilities they find according to certain protocols and with a certain time period in advance; they don't harm users or the systems themselves in the process of doing that testing, and; there are rigid privacy requirements.
These commitments are verifiable and do not protect malicious use. Security has been doing this for decades, and it's been extremely helpful. We need to adopt this in AI. We are also hosting a workshop on Monday in three days. It's virtual. You're welcome to attend or check this out if you want to learn more from top experts in the field, from both security, AI, and the law. And thank you so much.