San Francisco Alignment Workshop
27–28 February 2023 | San Francisco
In February 2023, researchers from a number of top industry AI labs (OpenAI, DeepMind, Anthropic) and universities (Cambridge, NYU) co-organized a two-day workshop on the problem of AI alignment, attended by 80 of the world’s leading machine learning researchers. We’re now making recordings and transcripts of the talks available online. The content ranged from very concrete to highly speculative, and the recordings include the many questions, interjections and debates which arose throughout. We hope that they provide a window into how leading researchers are grappling with one of the most pressing problems of our time.
If you're a machine learning researcher interested in attending follow-up workshops similar to the San Francisco alignment workshop, you can fill out this form.
Main talks
Ilya Sutskever—Opening Remarks: Confronting the Possibility of AGI
Jacob Steinhardt—Aligning Massive Models: Current and Future Challenges
Ajeya Cotra—“Situational Awareness” Makes Measuring Safety Tricky
Paul Christiano—How Misalignment Could Lead to Takeover
Jan Leike—Scaling Reinforcement Learning from Human Feedback
Chris Olah—Looking Inside Neural Networks with Mechanistic Interpretability
Dan Hendrycks—Surveying Safety Research Directions
Lightning talks (Day 1)
Jason Wei—Emergent abilities of language models
Martin Wattenberg —Emergent world models and instrumenting AI systems
Been Kim—Alignment, setbacks and beyond alignment
Jascha Sohl-Dickstein—More intelligent agents behave less coherently
Ethan Perez—Model-written evals
Daniel Brown—Challenges and progress towards efficient and causal preference-based reward learning
Boaz Barak—For both alignment and utility: focus on the medium term
Ellie Pavlick—Comparing neural networks' conceptual representations to humans’
Percy Liang—Transparency and standards for language model evaluation
(Note: some parts of the video with audience questions are inaudible)
Lightning talks (Day 2)
Sam Bowman—Measuring progress on scalable oversight for large language models
Zico Kolter—"Safe Mode": the case for (manually) verifying the output of LLMs
Roger Grosse—Understanding LLM generalization using influence functions
Scott Niekum—Models of human preferences for learning reward functions
Aleksander Madry—Faster datamodels as a new approach to alignment
Andreas Stuhlmuller—Iterated decomposition: improving science Q&A by supervising reasoning processes
Paul Christiano—Mechanistic anomaly detection
Lionel Levine—Social dynamics of reinforcement learners
Vincent Conitzer—Foundations of cooperative AI lab
Scott Aaronson—Cryptographic backdoors in large language models
(Note: some parts of the video with audience questions are inaudible)
Attendees
Organizers
Ilya Sutskever
Richard Ngo
Sam Bowman
Tim Lillicrap
David Krueger
Jan Leike
Participants
Scott Aaronson
Joshua Achiam
Samuel Albanie
Jimmy Ba
Boaz Barak
James Bradbury
Noam Brown
Daniel Brown
Yuri Burda
Danqi Chen
Mark Chen
Eunsol Choi
Yejin Choi
Paul Christiano
Jeff Clune
Vincent Conitzer
Ajeya Cotra
Allan Dafoe
David Dohan
Anca Dragan
David Duvenaud
Jacob Eisenstein
Owain Evans
Amelia Glaese
Adam Gleave
Roger Grosse
He He
Dan Hendrycks
Jacob Hilton
Andrej Karpathy
Been Kim
Durk Kingma
Zico Kolter
Shane Legg
Lionel Levine
Omer Levy
Percy Liang
Ryan Lowe
Aleksander Madry
Tegan Maharaj
Nat McAleese
Sheila McIlraith
Roland Memisevic
Jesse Mu
Karthik Narasimhan
Scott Niekum
Chris Olah
Avital Oliver
Jakub Pachocki
Ellie Pavlick
Ethan Perez
Alec Radford
Jack Rae
Aditi Raghunathan
Aditya Ramesh
Mengye Ren
William Saunders
Rohin Shah
Jascha Sohl-Dickstein
Jacob Steinhardt
Andreas Stuhlmüller
Alex Tamkin
Nate Thomas
Maja Trebacz
Ashish Vaswani
Victor Veitch
Martin Wattenberg
Greg Wayne
Jason Wei
Hyung Won Chung
Yuhuai Wu
Jeffrey Wu
Diyi Yang
Wojciech Zaremba
Brian Ziebart