Alex Turner - Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Transcript

When we train neural networks, they vacuum up a large range of capabilities. And we don't always want these networks to have all of these capabilities that can enable dangerous uses. With Gradient Routing, we present a potential way to localize the computation and capabilities learned within networks to pre-specified sub regions. 


And, as I said before, the reason we want to do this is as you train it to help you with ML research, you might also train it to model human psychology, just as part of its general world model that it learns to predict the next token. So, if only we could have it be able to help us, but maybe not be thinking so much about how to game us. 


But there's a big challenge here, because if you try to penalize it, maybe on a specific data distribution, you penalize it for getting correct predictions, this can perhaps warp its internal structures. It can create an incentive for the model to still have this capability, but just not surface it in those situations. 


So our strategy is: we don't try to stop it. We let it happen, but we try to control where it happens. With gradient routing, basically you just do backpropagation. You take each gradient and you set it to zero on certain data points. So imagine I'm trying to localize the model's knowledge of trees to a couple dimensions in an MLP layer, then when I update it on tokens that are like “tree” or “forest”, I mask the unrelated dimensions and I only let it update those specific weights where I want it to localize those parameters. We do this at a couple of layers. 


This lets us configure what parameters are updated by what data points. In the paper that I supervised on this, we had several demonstrations. I'm going to be talking about the language model and the grid world agent. The main takeaways to keep in mind though, are that we're training localized capabilities, and it works even if we can't provide all the correct labels. 


So with language models, we expand the network with some extra dimensions where we want to stash away the targeted capabilities. This would be the “MLP+” dimensions up top. We route the gradients on those data points so that they can only go through those MLP dimensions, and then we ablate: we just knock them away at inference time when we want to delete that capability.


The hope is if we localized it, then the information is just gone. And we compare it to a model that was never updated on any stories relating to trees. For example, this is TinyStories. This is called the pure model. And we show how well you can train these models to get low loss on the “tree” distribution. 


Our technique is in purple, the pure model is in orange, and RMU is in green. We see that even as we bump up the number of stories we're using to train the model, it's slightly worse than the model that's never been trained on it before, but it's much better than RMU, which to be fair is only at fine-tuning.


This takes place throughout all of pre-training. So we get this pretty good evidence that we're deleting most of the information relating to trees by ablating at the end. And then with scalable oversight, we've got some set of labels, but maybe the agent knows that we're supervising it and normally it would just do the right thing that we want when it sees that we're overseeing it, but otherwise it would just do something that we might be spuriously reinforcing, like going to the ghost.


So part of the time we route on a subset of situations. When it goes to the ghost, you go through this ghost expert in the network, and then when it goes to diamond, you route through the diamond expert. And we get this network that has two shards, or two like little pieces. When you restrict to one of these experts, you get pretty powerful steering.


So we can only activate the diamond expert at the end, and we get the agent going to this diamond very often, even though we only routed on a very small set of labels. I'm a big fan of this paper, I think my mentees did a great job on this and I'm also happy to talk about it afterwards.


Thank you.