New Orleans Alignment Workshop
10–11 December 2023 | New Orleans
In December 2023, top ML researchers from industry and academia convened for a second Alignment Workshop in New Orleans, just before NeurIPS 2023. The workshop was hosted by FAR AI, a nonprofit alignment research organization headed by Adam Gleave. Talks from this workshop are available below.
The workshop facilitated discussion and debate amongst ML researchers on topics related to AI alignment so that we can better understand potential risks from advanced AI and strategies for solving them. The content built upon the talks from the workshop that took place in San Francisco earlier this year. You can watch the talks from the previous event here.
Keynote Speaker: Yoshua Bengio
Towards Quantitative Safety Guarantees and Alignment
Towards Quantitative Safety Guarantees and Alignment
Introducing Alignment Problems
Main Speakers
Adam Gleave - AGI Safety: Risks and Research Directions
Owain Evans - Out-of-context Reasoning in LLMs
Oversight
Main Speakers
Lightning Talks
Anca Drăgan - Implications of human model misspecification for alignment
Max Tegmark - Provably safe AI
Brad Knox - Your RLHF fine-tuning is secretly applying a regret preference model
Sheila McIlraith - Epistemic side effects
Shane Legg - System Two Safety
Dylan Hadfield-Menell - Preference Learning in Alignment
Elad Hazan - AI safety by debate via regret minimization
Interpretability
Main Speakers
Lightning Talks
Victor Veitch - What (and Why) is a 'Linear' representation?
Eric Michaud - The Quantization Model of Neural Scaling
Sanjeev Arora - Emergence of Complex Skills in LLMs (and, do they understand us?)
Stephen Casper - Cognitive Dissonance: Why do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Adrià Garriga Alonso - Understanding planning in neural networks
Atticus Geiger - Theories and Tools for Mechanistic Interpretability via Causal Abstraction
Johannes von Oswald - Mechanistic Interpretability of in-context learning
Robustness and Generalization
Main Speakers
Zico Kolter - Adversarial Attacks on Aligned Language Models
Collin Burns - Weak-to-Strong Generalization
Lightning Talks
Dawn Song - Challenges for AI safety in adversarial settings
Boaz Barak - The impossibility of (strong) watermarking
Dimitris Papailiopoulos - The challenge of monitoring covert interactions and behavioral shifts in LLM agents
Christian Schroeder de Witt - Secret Collusion Among Generative AI Agents: a Model Evaluation Framework
Eric Neyman - Heuristic arguments: An approach to detecting anomalous model behavior
Jascha Sohl-Dickstein - Adversarial examples transfer from machines to humans
Governance
Main Speakers
Gillian Hadfield - Building an Off Switch for AI
Aleksander Madry - Preparedness @ OpenAI
Lightning Talks
Irina Rish - Complex systems view of large-scale AI systems
Lewis Hammond - Multi-agent risks from advanced AI
Vincent Conitzer - Foundations of Cooperative AI
John Schulman - Keeping Humans in the Loop
David Krueger - Sociotechnical AI safety