Bay Area Alignment Workshop
24-25 Oct 2024 | Santa Cruz
The Bay Area Alignment Workshop was held at Chaminade in Santa Cruz. The workshop brought together leaders and researchers from government, industry labs, nonprofits, and academia for a mix of lightning talks, expert speakers, and facilitated discussions. Participants explored topics such as threat models, safety cases, monitoring and assurance. interpretability, robustness and oversight.
As part of the Alignment Workshop Series, the Bay Area workshop built upon the success of previous highly-rated events in Vienna (2024) and New Orleans (2023), showcasing progress in the field since those gatherings. We continue to host small events with limited space for attendees; please do let us know if you are interested in attending our next event.
All presentations are now freely available below and via our YouTube channel, contributing to a growing and publicly accessible repository of high-quality content.
Threat models & Safety Cases:
Anca Dragan - Optimized Misalignment
Anca Dragan - Optimized Misalignment
Alignment Workshop Speakers
Monitoring and Assurance
Beth Barnes - METR Updates & Research Directions
Buck Shlegeris - AI Control: Strategies for Mitigating Catastrophic Misalignment Risk
Governance and Security
Kwan Yee Ng - AI Policy in China
Daniel Kang - Dual use of AI agents
Kimin Lee - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
Sheila McIlraith - Using formal languages to encode reward functions, instructions, preferences, norms and advice
Atoosa Kasirzadeh - Value pluralism and AI value alignment
Chirag Agarwal - The (Un)Reliability of Chain-of-Thought Reasoning
Alex Turner - Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Jacob Hilton: Backdoors as an Analogy for Deceptive Alignment
Mantas Mazeika: Tamper-Resistant Safeguards for Open-Weight LLMs
Zac Hatfield-Dodds – Formal Verification is Overrated
Evan Hubinger - Alignment Stress-Testing at Anthropic
Day 2 Lightning Talks
Richard Ngo - Reframing AGI Threat Models
Dawn Song - A Sociotechnical Approach to a Safe, Responsible AI Future
Shayne Longpre - A Safe Harbor for AI Evaluation & Red Teaming
Soroush Pour - Third-Party Evals: Learnings from Harmony Intelligence
Joel Leibo - AGI-Complete Evaluation
David Duvenaud: Connecting Capability Evals to Danger Thresholds for Large-Scale Deployments
Interpretability
Atticus Geiger - State of Interpretability & Ideas for Scaling Up
Andy Zou: Improving AI Safety with Top-Down Interpretability
Robustness
Stephen Casper - Powering Up Capability Evaluations
Alex Wei - Paradigms and Robustness
Adam Gleave - Will Scaling Solve Robustness?
Oversight
Micah Carroll - Targeted Manipulation & Deception Emerge in LLMs Trained on User Feedback
Julian Michael - Empirical Progress on Debate
Program Committee
Anca Dragan, Director, AI Safety and Alignment, Google DeepMind; Associate Professor, UC Berkeley
Robert Trager
Co-Director, Oxford Martin AI Governance Initiative
Dawn Song Professor, UC Berkeley
Dylan Hadfield-Menell
Assistant Professor, MIT
Adam Gleave
Founder, FAR AI