Oliver Klingefjord – What are Human Values, and How Do We Align AI to Them?

Transcript

Hi everyone, my name is Oliver. I'm here with my co-founder Joe from the Meaning Alignments Institute, and I will talk to you briefly about a paper we wrote called, “What are human values, and how do we align AI to them?”. One framing of the field of alignment is to align AI to human values, but of course this begs the question: What are values?


Are these values? They are, but framed in these terms, we bump into some problems. So for one, my conception of freedom might be entirely different from your conception of freedom. And these terms don't say anything about how they apply or when. And yet, this is what we find in, for example, Claude's constitution, with statements like, “please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood”. 


And unfortunately, this sort of vagueness is quite common, not just in alignment, but also in philosophy more broadly. But we found one sort of exception to this rule, which is in the philosophy of choice. In this field, values are talked about as a kind of a language we use to evaluate different options when we make choices.


To make that concrete, let's look at honesty. When I choose to be honest, or when I make a choice informed by honesty, I pay attention to certain things. Maybe I pay attention to how something feels in my body when I say it, and whether I can fully endorse it. And another person might pay attention to whether the words might be taken to mean something stronger than what they intended. And a third person's way of being honest might be to justify their statements with rigorous methods and knowledge. 


And of course these are three different ways of acting in the world, and in our process thereby they show up as three different kinds of values. And what you see here is what we call “values cards”.


So this is an encapsulation of a value, a way of acting in the world. That's framed in a way that makes it very easy to distinguish whether two people actually share the same value or not. You can ignore the top part here, the title and the description - that's generated, not super important. The important part is the bottom part which is a set of attentional policies.


So these are things that it's meaningful for a person with this value to attend to when they make a choice informed by it. So in this case, it's things like the doubts about the claim they're making due to the limits of the methods they use, or the clarity that comes from carefully distinguishing facts versus speculations. 


So this makes it more clear whether someone is following this value or not, and it also makes it more clear to kind of reason about what it would mean for a model to adhere to a value like this.  But of course, there's another problem, which is that as individuals, we all have different values. What do we do about that? 


How do we reconcile these differences? You could just average across them. You could count votes in some sort of public process, but all of these approaches would miss two very important things about values. 


Firstly, values are highly contextual; we live in different contexts. Some people live in the countryside with a big family, some people live in the city. And of course, these contexts demand different kinds of values. And we don't want to miss out on this rich contextual information in our process. We don't want to average across these. 


And secondly, as we go throughout life and we encounter new situations, we grapple with new tensions, our values productively evolve, we learn new things about what's important to us, and we don't want to cut off this moral learning. Instead we want to shard it out, not opt for this sort of fat middle of wisdom, but the forward edge. 


And this is what we do in our process. We have people reason about values: where they think they apply, and also which values are wiser than others. And this results in a data object that we call a moral graph. You can think of this as a kind of alternative to a constitution. It's basically made up of values (which are the nodes here in the graph), and a broad agreement that one value is wiser than another for a particular context (these are the edges).


And using a structure like this, we can use algorithms like PageRank to determine which values apply where, and thereby determine in a very detailed manner how a model should act based on it. And that's all I have for now. If you're interested in learning more about the results, et cetera, here's a link to the paper.


And if you have questions, then please come talk to me or Joe. We'll be here. Thank you.