Alignment Workshop - NOLA 2023

New Orleans Alignment Workshop

10–11 December 2023 | New Orleans

In December 2023, top ML researchers from industry and academia convened for a second Alignment Workshop in New Orleans, just before NeurIPS 2023. The workshop was hosted by FAR AI, a nonprofit alignment research organization headed by Adam Gleave. Talks from this workshop are available below.

The workshop facilitated discussion and debate amongst ML researchers on topics related to AI alignment so that we can better understand potential risks from advanced AI and strategies for solving them. The content built upon the talks from the workshop that took place in San Francisco earlier this year. You can watch the talks from the previous event here.

Keynote Speaker: Yoshua Bengio
Towards Quantitative Safety Guarantees and Alignment

Change the size of the slides:

Watch this talk in full-screen.

Read the transcript here.

Introducing Alignment Problems

Main Speakers

Adam Gleave - AGI Safety: Risks and Research Directions

Owain Evans - Out-of-context Reasoning in LLMs

Oversight

Main Speakers

Sam Bowman - Adversarial Scalable Oversight for Truthfulness: Work in Progress

Lightning Talks

Anca Drăgan - Implications of human model misspecification for alignment

Max Tegmark - Provably safe AI

Brad Knox - Your RLHF fine-tuning is secretly applying a regret preference model

Sheila McIlraith - Epistemic side effects

Shane Legg - System Two Safety

Dylan Hadfield-Menell - Preference Learning in Alignment

Elad Hazan - AI safety by debate via regret minimization

Interpretability

Main Speakers

Been Kim - Alignment and Interpretability: How we might get it right

Roger Grosse - Studying LLM Generalization through Influence Functions

Lightning Talks

Victor Veitch - What (and Why) is a 'Linear' representation?

Eric Michaud - The Quantization Model of Neural Scaling

Sanjeev Arora - Emergence of Complex Skills in LLMs (and, do they understand us?)

Stephen Casper - Cognitive Dissonance: Why do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Adrià Garriga Alonso - Understanding planning in neural networks

Atticus Geiger - Theories and Tools for Mechanistic Interpretability via Causal Abstraction

Johannes von Oswald - Mechanistic Interpretability of in-context learning

Robustness and Generalization

Main Speakers

Zico Kolter - Adversarial Attacks on Aligned Language Models

Collin Burns - Weak-to-Strong Generalization

Lightning Talks

Dawn Song - Challenges for AI safety in adversarial settings

Boaz Barak - The impossibility of (strong) watermarking

Dimitris Papailiopoulos - The challenge of monitoring covert interactions and behavioral shifts in LLM agents

Christian Schroeder de Witt - Secret Collusion Among Generative AI Agents: a Model Evaluation Framework

Eric Neyman - Heuristic arguments: An approach to detecting anomalous model behavior

Jascha Sohl-Dickstein - Adversarial examples transfer from machines to humans

Governance

Main Speakers

Gillian Hadfield - Building an Off Switch for AI

Aleksander Madry - Preparedness @ OpenAI

Lightning Talks

Irina Rish - Complex systems view of large-scale AI systems

Lewis Hammond - Multi-agent risks from advanced AI

Vincent Conitzer - Foundations of Cooperative AI

John Schulman - Keeping Humans in the Loop

David Krueger - Sociotechnical AI safety

Financial Support

We are grateful for financial support and sponsorship from:

Page updated

Google Sites

Report abuse