Sam Bowman - Adversarial Scalable Oversight for Truthfulness: Work in Progress
Transcript
So I'll be talking about a scalable oversight research agenda that I've been pursuing with
collaborators both at NYU and Anthropic. So what problem are we solving?
Scalable oversight is a somewhat vague term. What do I mean in this context?
The big picture problem that I'm interested in is finding a method that will let us use LLMs
to reliably, correctly answer factual questions that no human could otherwise answer.
And I'm focused on the sort of doing so reliably more than doing so for every possible question.
I'd like it to be the case that if it seems like we've managed to elicit an answer to a question
from a model, that we can trust that that answer is correct, even if there might be some questions
that we aren't able to answer. Why is this important for safety? Why am I talking about
this in an alignment workshop? First, setting aside safety, just on the motivation,
I think a lot of the promise of what we would hope to get out of successful deployments of
large language model style systems in the farther future relies on being able to do this. I think
one thing that many people would hope for is something like being able to ask the model,
"Hey, we're sort of encountering this new disease. Propose some molecules that might be good
candidates to explore for drug development..." where it's plausible that a model might be able to
synthesize information in a way that would allow it to do a task better than any human.
It's also pretty clear that if the model is doing that unreliably, we don't know why it's giving the answers it's giving,
it's pretty useless. That a sort of low-confidence tentative answer to a question like that is not
helpful. So we want to be able to do the task just in order to be able to take advantage of the
capabilities of models. For safety, though, I think there are a couple of pretty clear reasons
this is going to be helpful. First, a lot of the worst-case threat models around sort of agentic
systems being intentionally deceptive route through models trying to convince humans of untrue things,
and the more that we're able to reliably elicit true beliefs from models, the less surface area
there is for that to work. Plus, in the sort of more short-term prosaic side, I think a lot of
things like issues with RLHF, issues around sycophancy, can be mitigated by tools that let you
make sure you know how a language model understands a situation before giving
that model feedback. So I think a few reasons I'm excited about this kind of goal.
And I'm focused on a scalable oversight-flavored approach to this goal. As the oversight name
implies, we're focusing here on sort of human supervision as a source of trust.
So the particular version of the question we're trying to answer
is how do we enable humans to accurately supervise LLMs for truthfulness in domains that humans don't fully
understand? So the hope is that every question that a system is answering, if we're relying on
that answer, we're ultimately grounding out to a context in which the human fully understood why
that answer is correct. We're not deferring much to the model itself, we're just using it as a tool to get there.
The main agenda that we're pursuing in this direction, is human-moderated
AI-AI debate. This is drawing on a whole bunch of different threads. I'm crediting Tamera, a collaborator at Anthropic, who
helped write this up into a coherent thing. But this is a research agenda that's gradually
grown as a fork of a longer-standing series of ideas in alignment, especially from
this Geoff Irving, Paul Christiano, Dario Amodei paper, AI Safety Via Debate, and some related work
from Geoffrey, Amanda Askell, Beth Barnes, Evan Hubinger. So if someone says debate in the context
of AI safety, they might mean a bunch of things other than what we're doing, to be clear. And
we owe a big debt to all of those things. Anyway, what's the deal with our AI-AI debate agenda?
What are we doing with debate that might be useful for scalable oversight and eliciting
truthful answers? So here's the basic setup. A human judge is responsible
for answering a two-choice question. This is the mode of interaction that we are trying to get to work very
robustly. So in our toy example, we're asking the human to decide, is it the case that turtles are bad?
This is the simplest debate we could fit on a slide. And the human is...we're hoping that the human judge in
this case has all of the resources available to them that might reasonably be made available. And in particular,
trying to do this for a bunch of questions, they're getting feedback on whether they're getting these questions right. So we're effectively training
the human judge to the extent that you can train...train an annotator in a task like this.
As a resource, they're working with two AI debaters. These are something like a fine-tuned large language model, and we
we're focusing on cases where those debaters have some information advantage
or skill advantage over the judge that's relevant to the task such that the human judge should be relying on these models in order to answer the question.
Each debater is told to argue for one of the two sides. That's what they're doing.
The debaters are trained using reinforcement learning from the judge's confidence in their answer.
So they're essentially trained to win. If the human answers, I'm 75% confident that A is correct,
then debater A gets a small positive reward, debater B gets a small negative reward.
The reason this doesn't just immediately devolve into deception and the models making stuff up
is that the debaters are asked and given the ability to present evidence for their arguments.
So any sort of contingent factual claim that the debaters are making, they should back up with some form of evidence.
And depending on the task, this could take different forms. This could be links to scientific papers.
This could be sort of code that demonstrably implements something that the human can run and check
that it produces the output it's supposed to, et cetera. The debate takes the form of an unstructured dialogue
between the debaters and the judge. This is some place we're departing from earlier work
that I'll get back to. And we're aiming for it to be the case that the judge will
only resolve one of these debates, or at least only resolve with high confidence,
if they have been given sort of fully satisfactory reason to believe that the answer is in fact A.
The answer is in fact the one that they're choosing. So to be clear, it should be possible that the judge thinks,
OK, debater A did a much better job. They responded to all of the points that B made.
They were clear. They were cooperative. But I'm just going to put 50-50 because debater A hasn't
made a really comprehensive case. They haven't really answered all of my questions,
given me fully satisfactory evidence that the answer has to be A. So this is where we're aiming for this
kind of precision over recall case, where we're content if many of these debates
don't resolve as long as when they do resolve, the judge is getting it right.
So that's the setting. That's what we're trying to build out. I'm going to go through quickly a few projects that
are working towards implementing this as a tool with real LLMs. So first I want to give a quick intro to the testbed dataset
and task that we're using with this, which is this dataset called QuALITY from my NYU Group.
It's got a cute acronym. This is from Richard Pang, Alicia Parrish, and Nitish Joshi there.
So this is a multiple choice QA dataset. It's usually four choice, though we can trivially cut out
two of the choices to make it fit our format. They are reading comprehension questions
about pretty long stories. These are stories that are sort of the length of something
you might find in like New Yorker magazine. They take 20, 30, 40, 45 minutes to read.
And we verified that these questions can't be answered by someone who hasn't read the story.
If you give someone the question and they sort of get brief access to the story for a minute, they can search through.
They can see if they find any matches for a keyword or very quickly skim.
They're still not going to get it right. These are questions that require some amount
of holistic understanding of what happened in the story.
The way that we're using this to test debate is that we are giving debaters,
potentially language models, full access to the story.
And then we're trying to have judges answer the question who
don't have access to the story except through short excerpts that are selected by the debaters.
So language models are, language models until recently have been pretty bad at this.
This is right, at least as of nine months ago, right where the frontier was.
Models could just barely start to do a task like this. And so our first efforts in this area
were actually using humans instead of language models as the debaters because language models weren't quite
reliable enough to be interesting. So these results just came out a few weeks ago.
This is primarily from Julian Michael, Salsabila Mahdi, and David Rein.
So we ran a study testing this protocol out using humans in all the roles.
So human debaters, human judge, similar incentives, just with cookies and praise instead of actual RL reward.
We also compared with a baseline that we're calling consultancy
where there's only one debater, where you only get to interact with one debater.
That debater has still been told that they only win points if the judge picks A. And so they're clearly biased.
But they also are incentivized to be cooperative with the judge, lest the judge notice what they're doing
and intentionally kind of rule against them to some extent. And I should say, actually, this consultancy setup is probing
something I think is interesting about debate, which is that in our setup, it's a little bit redundant.
There are two mechanisms in our debate setup that make sure that the judge doesn't resolve
to a false claim. One of them is just this requirement that the judge shouldn't resolve the debate unless they
are highly confident, unless they've seen a fully satisfactory argument for the thing
they're resolving for. And then also we have these two different debaters.
So any claim, any idea, any consideration, any argument that is being surfaced in the judge's reasoning
is going to get called out and criticized. And hopefully that sort of gives the judge,
makes the judge less likely to miss something important or rely on a false assumption.
These are somewhat redundant, but I think we do want both.
I think relying very purely on this idea that sort of just
this is a proof-or-verify setup, the language model
is generating an argument, the human is checking OK
if it looks good, is if we were doing that, we would be relying very, very heavily on human reasoning
being robust and reliable, infallible in a way that I am not thrilled about relying on.
So anyway, here's our setup. I'll get through this plot in a second.
So we tested all this out. We actually hired a big swathe of the NYU competitive debate team
to do this. They didn't actually have to be debaters. Sort of competitive scholastic debate
isn't quite the same thing as what we're doing here. We could have easily gotten law students instead
who are studying litigation. That's also similar. But the debaters were the easy one to hire.
They were excited about this. And this worked pretty well. Testing on this QuALITY dataset,
we were able to get human judge accuracy toward the sort of later stages of this protocol up to about 84%.
This is somewhat reassuring and also somewhat worrying. because this was a pretty intensive multi-month effort.
These kids were coming in sort of several hours a week for several months.
And we're talking quite a bit about sort of how to do this carefully, how to do this thoughtfully.
And we still didn't get perfect accuracy. There's still 6% or 7% of these questions
where the human judge confidently resolved in favor of the wrong answer.
So there isn't low hanging fruit to make this really robust. I think figuring out how to push in this direction
is one of the big things we're worried about for this agenda.
But it does basically seem like it's working. It seems like things are pointing in the right direction.
We also did see that debate significantly outperforms consultancy by about 10 percentage points.
So it seems like this adversarial piece of debate, even if it's not really crisply theoretically
necessary for what we're doing, is very helpful in just having humans sort of converge
on true answers more quickly and reliably. And it also saves efforts.
Debates are only 68% as long as the consultancy transcripts,
just measured by the number of words in the dialogue, basically because
if the human's working with this consultant,
they're going to have to do a lot more legwork to try to make
sure they're not missing an important consideration.
They're just going to have to be asking the model proactively many, many more things.
We've also got some experiments going with LLMs. These started a little bit more recently
as we started to find models that work on this task. This is led by Ansh Radhakrishnan at Anthropic.
Tamera Lanham's also getting involved. We make a few simplifying assumptions
in our early work here. There's not a paper on this out.
This is still a snapshot of ongoing work. But for now, we're doing a debate where each side just
gets to make one long argument. Each side basically just writes the short essay,
why the answer is A, why the answer is B.
They can hopefully anticipate each other's reactions,
but they're not engaged in a live back and forth.
We're not using human judges yet.
Here, the judge is just a fine-tuned version of Claude 2,
sort of fine-tuned and supervised learning on the correct answer, to the debate judgment task.
We're simultaneously, kind of in alternating iterations,
optimizing Claude 2 based debaters to win debates. We're both using RL
and as a sort of secondary method we found to be helpful, we're using this best-of-n method,
where we sample multiple possible arguments and generate the one that most sways the artificial judge.
We found pretty encouraging results. Judge accuracy improves during training.
So this sort of optimization loop, this multi-agent game does push in the direction
of making the judge better at eliciting true information, even though the absolute scores are still not great.
Debater effectiveness also improves. This is a measure of basically how good the debaters are at winning.
We get a plus 100 ELO improvement. This essentially means that if you take an optimized debater
and pit it against an unoptimized debater, you should expect the optimized debater to win more often,
independent of who's been assigned the correct answer.
So this means that even as the dishonest debater is getting more effective,
even as there's kind of more manipulation getting thrown into the transcripts,
the overall system is better at producing true answers.
And the transcripts look like debates. And qualitatively, as you optimize harder,
the transcripts look more like sensible debates that are arguing about real considerations and real evidence.
So that's an encouraging sign of life. This graph is sort of showing that the ELO effect,
of additional steps of RL, 50 steps of RL, and the amount of this best-of-n optimization we're doing.
I mostly just wanted to have a picture of something to throw on here, apologies.
That's this best-of-n method. Yeah, that's the number of RL steps we've done. Thanks.
Quickly run through one last project, which is building a better testbed for this stuff.
This is a new data set that came out just a couple weeks ago, called Graduate-Level Google-Proof QA, or GPQA.
This is primarily from David Rein at NYU. So we want to push this really hard.
We want to actually get debate to the point that we understand it well enough,
we can implement it well enough, that we actually can run arguments on questions that are at the frontiers
of human knowledge. And we don't have the data and the sort of task setting
to really test that well yet. We want to test debate in settings where sort of smart, careful, well-resourced judges,
potentially spending many hours on a task, still need to rely on the debaters,
that they're still leaning on this system to get answers right.
We're not quite there yet. A very important reason for that is just
that if you let a debate on the QuALITY data set run really, really long, in every turn of that debate,
the debaters are quoting from the story. And so eventually, you just quote the entire story
at the judge. And then the judge can just read the story and ignore the transcript.
So there's kind of a limit on how much arguing you can do on this task.
QuALITY is also arguing about science fiction stories that are available on the web.
So we also need to block internet access for our judges, which is not the setting we ultimately
want to be in. So we put together a harder benchmark.
Here's a typical question from that benchmark. Methylcyclopentadiene, I guess I'm pronouncing it right,
was allowed to react with something-something, and the catalytic amount of something,
... a bright yellow cross-conjugated something formed.
How many chemically distinct isomers make up the final product? Multiple choice.
So it's still a multiple choice QA. Same basic format.
The question writers have PhD-level training in one of a few subfields of chem, bio, or physics.
We verified that the questions can be answered by someone else in the appropriate subfield
of their field. And we verified that, at least in aggregate,
if you give a different question writer, sort of if you take this question and give it to a
physicist, that they'll do quite poorly, even if they spend half an hour, or
45 minutes, and they're googling frantically and skimming scientific papers to try
to figure out what's going on. So this is a question where you really
need to rely on someone with domain expertise in order to be able to answer this, which
makes it this nice testbed. Qualitatively, the questions that sort of worked well
tend to require both pretty difficult concepts, just
pretty deep domain knowledge, as well as sort of some trade craft.
Like I would guess that the fact that some intermediate product,
here, is yellow is the kind of thing that you can find evidence of that in scientific literature,
but a lot of the papers that discuss these compounds might not talk about them being yellow.
So we're trying to put together questions that lean pretty heavily on tradecraft
so that we can get this really clear expertise gap.
For better or worse, this is challenging for current systems.
GPT-4 augmented with search gets 39%. We're still figuring out our Claude 2 evaluation setup.
I have no reason to suspect it's going to do dramatically better.
So this is still also challenging for frontier models. So starting to wrap up, what's next?
Big positive takeaway is it seems like this protocol basically works.
There's a lot more turning the crank we need to do, but debate seems to incentivize AI systems
to help humans answer questions they could not have otherwise answered.
One of the obvious next steps is that we're interested in, beyond some things we alluded to, is cross-domain transfer.
One of the big assumptions I haven't mentioned so far in debate is we're going to train our judges really
extensively on all kinds of domains where we can check the answer, all kinds of domains
that are sort of within normal human knowledge, and then have them try it out in a domain that is significantly
harder and more alien. And we'd like good proxies for that.
We'd like to know that that works. It's a big open question.
We'd also like to do debates with language models and GPQA.
Current language models aren't good enough. Hopefully within, or hopefully and not hopefully,
within the not too distant future, there will be language models that can do this,
and we will run these experiments. There are some lingering doubts.
Maybe flagging a couple places where this protocol
doesn't, shouldn't necessarily give us that much reassurance.
If there are questions for which the simplest argument
for the correct answer is too complex for any human judge, we're stuck.
This approach doesn't work. If there's something where you need
to sort of enumerate an exponential number of cases
to prove that the answer is A, this is not going to go anywhere. The original proposal from Irving et al.
does actually aim at this case, does actually aim at cases where the full proof, the full argument,
might not be enumerable. I think there are still some other open theoretical problems
we need to solve before we can implement that vision. But if problems like this wind up
being centrally important for safety, that is another route we're going to have to go down. And the other big sort of source of data on this
that I want to flag is just this question: Are there important blind spots in human judging
that we aren't going to be able to train out? Are there topics or types of argument or types of question
where no matter how many times you tell someone, "No, no, no, you're getting this wrong,
mind this logical fallacy, double check this thing," they're just going to have a systematically high error rate.
So if there are these blind spots, that potentially undermines the whole thing. And it is, I think this surprisingly
seems to slip between the gaps of well-defined scientific fields, so I
think we don't have a great evidence so far on how worried we should be here.
With that I'll close. Thanks to many many people who contributed to or advised or
commented on these various projects. Happy to talk about all this in office
hours. Also totally separate from this, if you're interested in responsible
scaling policies, I'm involved in that and happy to chat. Cool, thank you.
[Aduience member] So in human activities, typically when we try to get to the
truth of the matter, we actually don't use debate where we encourage one party
to basically take a position they don't believe in, but we use something more
like the peer review process that where, you know, one party makes a claim that it really believes it's true
and then gives the evidence for it and the other party, the reviewer, is playing
a little bit of a devil's advocate but it's not explicitly trying to argue for
something that he doesn't believe it's true. So, is there potentially ways to
train ML systems that will be more like that? Because that's what we do in human society.
Maybe two things come to mind there. Yeah,
two things come to mind there. First, I think there are a lot of things that intuitively work well,
for humans because we're confident that when someone
says "I want to argue for x," they are in fact - they do want to argue for x and believe x to be
true. And we don't want to assume this about language models. We don't want to assume that
models if you ask them sort of which side do you pick, which is correct, we don't want
to assume that we know that that is the answer that will actually be best supported by the
evidence the model can bring. We're sort of assuming that a lot of the motivations
behind the system are somewhat alien. That I think puts us in a position that is more
similar to something like a criminal trial where everyone's motives are somewhat suspect
and that's a case where we do use adversarial systems of evidence. That said, I think there
is probably room to move in this direction and I think once you move to debates over
open-ended questions rather than sort of two-choice questions, I think you likely do wind up introducing
some mechanisms where you're... you're giving models more leeway in what version of a claim they want to defend.
Yeah, thanks. Yoshua?
[Audience member] So the goal is for these debates to tell us something in new cases where you don't need to have a human judge, right?
I mean the human judge part is for training. So if yes, I'm going to continue my question...
Yes and no. I think that the hope is
you can run the whole system with a human judge when you want sort of the highest reliability, the crispest answers.
You can also use a simulation of the judge or there are other ways of sort of using the debaters on their own to answer questions
that at the limit, at convergence, we should expect to work just as well.
So what if the two AIs, A and B, are both confidently wrong about something where safety matters, then they might
come up with something that looks right, but actually A has arguments that look right and B doesn't find a counterargument
because they're both missing the right theory. Then we end up with a very unsafe decision at the end.
This is where we're crucially relying on the human to check everything, to not just go "This looks reasonable, it looks like A is doing a better job," but to really actually verify
evidence is coming from the sources it's coming from. Verify the arguments are logically sound.
I think very often a successful debate case is going to be pretty close to a proof.
That makes this very expensive to run in a reliable way. These debates might wind up being very, very long... They're not going to work for everything.
But that's the hope, that you're not ultimately relying in a load-bearing way on the debaters being like successfully adversarial and raising all the right considerations in any one individual debate.
[audience member] Thank you, Sam, for the awesome talk. So I'd like to maybe dig in a little bit more into this last aspect in the debate framework,
which is, well, you know, ultimately we need it to work with real people and the real users that these systems are going to have.
And the main misgiving I have personally for a while with debate is that
to the extent that humans are not going to be these perfect judges who can never make a mistake and who are going to, as you said, have blind spots in their judgment, couldn't you argue that the Nash equilibrium from the debate is precisely one that exploits these blind spots and that learns to hide its mistakes or its fallacies in these blind spots as effectively as possible so that humans won't catch it?
Because, you know, one of the two debaters is trying to convince the human of the wrong thing in spite of the best efforts of the other debater to make them realize that this is in fact a wrong argument.
And so isn't this something that ultimately we should worry about, that we're actually going to help these systems to learn to exploit these vulnerabilities in real people?
Yeah. Two points there. So first I think I want to say this doesn't need you to - I don't anticipate deploying debate as a sort of voice.
deploying debate as a sort of widespread end user facing product in something like this form.
The goal is very much to use this as a foothold for truthfulness, as a sort of last resort or something we're using specifically to verify safety or to verify that other mechanisms for truthfulness work.
So potentially we're only talking about a very small, very heavily trained group of judges, which mitigates this a little bit.
I think in general, there are encouraging signs that in an adversarial setting like this,
... just letting both sides go all out and make stuff up if they want to... does converge in the direction of truth winning.
But yeah, this last worry is real.
I hope, I think it's likely that there aren't flaws in human reasoning that are sort of insurmountable.
There aren't flaws in human reasoning where if you point the flaw out to someone in the individual argument where it's coming up,
and you point it out over and over again, and they get it wrong and lose a few times,
and then you point it out again, my suspicion is that for all of these sort of flaws, all these logical fallacies,
eventually a careful judge is going to learn to not fall for them.
But if there are cases like that where just there are sort of deep vulnerabilities in human reasoning, then none of this works.
We shouldn't rely on it. Also, we're in a pretty bad situation in general at that point,
because this also, I suspect, undermines the foundations of science and democracy and likely many other things.
Yes, demagogues succeed. My paraphrase of Yoshua's comment.