Powered by RND
PodcastsTechnologyAXRP - the AI X-risk Research Podcast
Listen to AXRP - the AI X-risk Research Podcast in the App
Listen to AXRP - the AI X-risk Research Podcast in the App
(398)(247,963)
Save favourites
Alarm
Sleep timer

AXRP - the AI X-risk Research Podcast

Podcast AXRP - the AI X-risk Research Podcast
Daniel Filan
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper,...

Available Episodes

5 of 48
  • 38.4 - Shakeel Hashim on AI Journalism
    AI researchers often complain about the poor coverage of their work in the news media. But why is this happening, and how can it be fixed? In this episode, I speak with Shakeel Hashim about the resource constraints facing AI journalism, the disconnect between journalists' and AI researchers' views on transformative AI, and efforts to improve the state of AI journalism, such as Tarbell and Shakeel's newsletter, Transformer. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2025/01/05/episode-38_4-shakeel-hashim-ai-journalism.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:31 - The AI media ecosystem 02:34 - Why not more AI news? 07:18 - Disconnects between journalists and the AI field 12:42 - Tarbell 18:44 - The Transformer newsletter   Links: Transformer (Shakeel's substack): https://www.transformernews.ai/ Tarbell: https://www.tarbellfellowship.org/   Episode art by Hamish Doodles: hamishdoodles.com
    --------  
    24:14
  • 38.3 - Erik Jenner on Learned Look-Ahead
    Lots of people in the AI safety space worry about models being able to make deliberate, multi-step plans. But can we already see this in existing neural nets? In this episode, I talk with Erik Jenner about his work looking at internal look-ahead within chess-playing neural networks. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/12/episode-38_3-erik-jenner-learned-look-ahead.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:57 - How chess neural nets look into the future 04:29 - The dataset and basic methodology 05:23 - Testing for branching futures? 07:57 - Which experiments demonstrate what 10:43 - How the ablation experiments work 12:38 - Effect sizes 15:23 - X-risk relevance 18:08 - Follow-up work 21:29 - How much planning does the network do?   Research we mention: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network: https://arxiv.org/abs/2406.00877 Understanding the learned look-ahead behavior of chess neural networks (a development of the follow-up research Erik mentioned): https://openreview.net/forum?id=Tl8EzmgsEp Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT: https://arxiv.org/abs/2310.07582   Episode art by Hamish Doodles: hamishdoodles.com
    --------  
    23:46
  • 39 - Evan Hubinger on Model Organisms of Misalignment
    The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge". Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html   Topics we discuss, and timestamps: 0:00:36 - Model organisms and stress-testing 0:07:38 - Sleeper Agents 0:22:32 - Do 'sleeper agents' properly model deceptive alignment? 0:38:32 - Surprising results in "Sleeper Agents" 0:57:25 - Sycophancy to Subterfuge 1:09:21 - How models generalize from sycophancy to subterfuge 1:16:37 - Is the reward editing task valid? 1:21:46 - Training away sycophancy and subterfuge 1:29:22 - Model organisms, AI control, and evaluations 1:33:45 - Other model organisms research 1:35:27 - Alignment stress-testing at Anthropic 1:43:32 - Following Evan's work   Main papers: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models: https://arxiv.org/abs/2406.10162   Anthropic links: Anthropic's newsroom: https://www.anthropic.com/news Careers at Anthropic: https://www.anthropic.com/careers   Other links: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1 Simple probes can catch sleeper agents: https://www.anthropic.com/research/probes-catch-sleeper-agents Studying Large Language Model Generalization with Influence Functions: https://arxiv.org/abs/2308.03296 Stress-Testing Capability Elicitation With Password-Locked Models [aka model organisms of sandbagging]: https://arxiv.org/abs/2405.19550   Episode art by Hamish Doodles: hamishdoodles.com
    --------  
    1:45:47
  • 38.2 - Jesse Hoogland on Singular Learning Theory
    You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.   Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch  FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 00:34 - About Jesse 01:49 - The Alignment Workshop 02:31 - About Timaeus 05:25 - SLT that isn't developmental interpretability 10:41 - The refined local learning coefficient 14:06 - Finding the multigram circuit   Links: Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984 Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition   Episode art by Hamish Doodles: hamishdoodles.com
    --------  
    18:18
  • 38.1 - Alan Chan on Agent Infrastructure
    Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch  FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/   Topics we discuss, and timestamps: 01:02 - How the Alignment Workshop is 01:32 - Agent infrastructure 04:57 - Why agent infrastructure 07:54 - A trichotomy of agent infrastructure 13:59 - Agent IDs 18:17 - Agent channels 20:29 - Relation to AI control   Links: Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao IDs for AI Systems: https://arxiv.org/abs/2406.12137 Visibility into AI Agents: https://arxiv.org/abs/2401.13138   Episode art by Hamish Doodles: hamishdoodles.com
    --------  
    24:48

More Technology podcasts

About AXRP - the AI X-risk Research Podcast

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Podcast website

Listen to AXRP - the AI X-risk Research Podcast, Hard Fork and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.2.0 | © 2007-2025 radio.de GmbH
Generated: 1/14/2025 - 4:23:51 PM