Bay Area Alignment Workshop

24-25 Oct 2024 | Santa Cruz

The Bay Area Alignment Workshop was held at Chaminade in Santa Cruz. The workshop brought together leaders and researchers from government, industry labs, nonprofits, and academia for a mix of lightning talks, expert speakers, and facilitated discussions. Participants explored topics such as threat models, safety cases, monitoring and assurance, interpretability, robustness and oversight.

As part of the Alignment Workshop Series, the Bay Area workshop built upon the success of previous highly-rated events in Vienna (2024) and New Orleans (2023), showcasing progress in the field since those gatherings. We continue to host small events with limited space for attendees; please do let us know if you are interested in attending our next event.

All presentations are now freely available below and via our YouTube channel, contributing to a growing and publicly accessible repository of high-quality content.

Threat models & Safety Cases:
Anca Dragan - Optimized Misalignment

Change the size of the slides:

Watch this talk in full-screen and read the transcript here.

Alignment Workshop Speakers

Monitoring and Assurance

Beth Barnes - METR Updates & Research Directions

Buck Shlegeris - AI Control: Strategies for Mitigating Catastrophic Misalignment Risk

Governance and Security

Kwan Yee Ng - AI Policy in China

Daniel Kang - Dual use of AI agents

Kimin Lee - MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Sheila McIlraith - Using formal languages to encode reward functions, instructions, preferences, norms and advice

Atoosa Kasirzadeh - Value pluralism and AI value alignment

Chirag Agarwal - The (Un)Reliability of Chain-of-Thought Reasoning

Alex Turner - Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Jacob Hilton: Backdoors as an Analogy for Deceptive Alignment

Mantas Mazeika: Tamper-Resistant Safeguards for Open-Weight LLMs

Zac Hatfield-Dodds – Formal Verification is Overrated

Evan Hubinger - Alignment Stress-Testing at Anthropic

Day 2 Lightning Talks

Richard Ngo - Reframing AGI Threat Models

Dawn Song - A Sociotechnical Approach to a Safe, Responsible AI Future

Shayne Longpre - A Safe Harbor for AI Evaluation & Red Teaming

Soroush Pour - Third-Party Evals: Learnings from Harmony Intelligence

Joel Leibo - AGI-Complete Evaluation

David Duvenaud: Connecting Capability Evals to Danger Thresholds for Large-Scale Deployments

Interpretability

Atticus Geiger - State of Interpretability & Ideas for Scaling Up
Andy Zou: Improving AI Safety with Top-Down Interpretability

Robustness

Stephen Casper - Powering Up Capability Evaluations

Alex Wei - Paradigms and Robustness

Adam Gleave - Will Scaling Solve Robustness?

Oversight

Micah Carroll - Targeted Manipulation & Deception Emerge in LLMs Trained on User Feedback

Julian Michael - Empirical Progress on Debate

Program Committee

Anca Dragan, Director, AI Safety and Alignment, Google DeepMind; Associate Professor, UC Berkeley

Robert Trager
Co-Director, Oxford Martin AI Governance Initiative

Dawn Song Professor, UC Berkeley

Dylan Hadfield-Menell
Assistant Professor, MIT

Adam Gleave
Founder, FAR AI

Page updated

Google Sites

Report abuse