Soroush Pour - Third-Party Evals: Learnings from Harmony Intelligence

Transcript

I wanna talk a little bit about what we've learned in the past year building a third-party evals org, Harmony Intelligence, and I'll move pretty quickly, keep it pretty high level, but always happy to talk more about this after the lightning talk as well.

Great, so just a bit of background, my name is Soroush, I'm the CEO and co-founder of a company called Harmony Intelligence. We're a one year old public benefit company which means that our mission is on equal footing with shareholder returns. We're four research engineers and growing quickly, and we're focused exclusively on catastrophic risk evals, red-teaming and audit.

Some of the things we've done this past year is: we've published two papers, one on automated red-teaming, another with the MIT FutureTech group on the AI Risk Repository, which is basically a living database of AI risks that people can refer to. I recommend people search for that and find those links.

We've worked with two top AI labs – one major government and one research grant maker – on various eval projects, typically measuring a specific catastrophic risk against a frontier model. A big focus on model autonomy and cyber, although we are interested in other areas of evaluation as well. We also spoke in front of an Australian Senate hearing on AI risk and we have regular conversations with quite a few other policy makers to try and make the case for why catastrophic risk is very real and important and needs to be acted upon.

I think we're pretty proud of what we've achieved in the first year and I want to talk about some of the things that have been able to get us there. What's actually been some of the ingredients to getting to this sort of impact in the first year? So at a really high level, I'm just picking some of the biggest and most important.

First one is good eval design, and there's a lot to say on this topic… but a couple of really big things which sound obvious but are exceedingly rare in real world evals when you actually look at them. I'll talk about the consequential piece first: we try to measure things that people actually really care about and really would make a difference.

There's a lot of evaluations out there where as soon as it triggers, people say that this doesn't actually show us anything important; this doesn't change the existing beliefs in risks. We really try to focus on things that are extremely consequential, things like a massive change in the cybersecurity landscape through an eval showing up.

The other big one is around contract validity or experimental validity. Do you actually measure the thing that you say you will do? And we spend a lot of time with our partners making sure this is actually the case. And a good example of something like this is a human baseline trial: being really clear that if this is where the levels are at today, this is the delta versus baseline. Okay, this model did really well on this knowledge set, but does that actually change the landscape? How did you actually measure that fact? How can you prove that fact?

A couple other things that have been really powerful for us is a really big focus on software engineering. We have a really deep software engineering pedigree. And a lot of this stuff is actually just really basic software engineering and lots of code that you have to write scalably. Things like spinning up VMs; being able to clear your state upon each eval run.

And so really good software engineering skills and maintainable code have been critical to our success. A couple other things that have been really powerful on the more non-technical side: biasing towards visible action, and valuable action. There are a lot of people out there who care about evals, but they'll spend a lot of time writing stories about how important evals are rather than actually just writing evals and talking to the people who need to deploy them. So that's been a big part of our success. And the last one is talking to our partners, labs, governments, policy makers, really understanding their constraints and their needs and really trying to deliver on those so that we can make concrete, incremental action rather than just hit dead walls and say “Well, we can’t do the gold plated version, so we're not going to do anything at all.” That's been pretty critical for us.

I just want to call out a couple of concrete calls to action that we've picked up as a real-world evals org trying to do this work. On the policy front, evals are still voluntary, and that's a big issue.

We do need to move to a world where evals and audits are, on some level, mandatory, so that the incentives for actually doing them and prioritizing them within labs, within other large organizations is there. So I think that's really important from a policy perspective. On a technical perspective, the science of evals – and you've heard this from others as well – is still really weak.

Things like ‘a principled way to come up with the risks we want to measure,’ or ‘our confidence around how well we've covered those risks’ is still really weak. So any work on the science of evals is incredibly valuable. I'll stop there and just say if you are working on evals, red teaming, or audit, or you have needs here, please come find me at this event. Or afterwards you can email me my first name @HarmonyIntelligence.com, and also SoroushJP on socials. Thank you, everyone.

Soroush Pour - Third-Party Evals: Learnings from Harmony Intelligence

Transcript

Alignment Workshop