Podcasts TechnologyMachine Learning Street Talk (MLST)

Listen to this podcast in the app for free:

radio.net

Sleep timer

Save favourites

Download for free in the App Store

Machine Learning Street Talk (MLST)

Technology

Latest episode

256 episodes

Why a Nation Can't Outsource Its Frontier AI - Alistair Pullen (Cosine AI)
13/07/2026 | 55 mins.
This episode is sponsored by Notion. Learn more about Notion's Developer Platform today at https://notion.com/mlst

Britain's most capable coding model can't be exported, and that ban is the whole reason Cosine set out to build one from scratch. Alistair Pullen, CEO and co-founder of Cosine, sits down with Tim Scarfe to explain how a frontier system he calls Fable, locked behind US export controls, became the founding case for a UK sovereign model trained on the Isambard supercomputer in Bristol.

The bet underneath it is economic. Pullen argues that an inference company, rather than a training-first lab, doesn't need billions to compete: millions, a national compute allocation, and a consortium feedback loop can be enough. From there it gets into the machinery, why open-weight models still trail the frontier on size, active parameters and data, the mixture-of-experts versus dense trade-off and why active params dominate how a model actually feels, and the edge that real coding trajectories confer.

The back half is about making agents trustworthy. Pullen makes the case for beating "slop" by rewarding the process instead of the final answer, reframes code review as runtime proof (spin the bug up in a VM and force the agent to actually exploit it), and walks through Swarm, Cosine's system running hundreds of sub-agents in one shot. It ends on why memory is still an unsolved hack, how synthetic graders let you run RL on tasks with no built-in test, and why Pullen reads US export controls as an accidental gift, with a supply-chain sting in the tail.

---
TIMESTAMPS:
00:00:00 The sovereign mandate and the Fable ban
00:04:02 Millions vs billions: the inference-company model
00:07:19 The consortium feedback loop
00:07:40 Why open models lag the frontier
00:14:59 MoE vs dense, and why active params matter
00:16:29 Trajectories: the process-data advantage
00:19:48 Beating slop: reward the process, not the answer
00:26:06 Reusable abstractions and the epistemic wall
00:29:56 Code review becomes runtime proof
00:37:32 Do agentic harnesses still matter?
00:40:35 Swarm: orchestrating hundreds of sub-agents
00:45:14 Why memory is still unsolved
00:48:25 Synthetic data and graders for RL
00:53:09 The US export gift and supply-chain risk

---
REFERENCES:
organization:
[00:01:15] Cosine
https://cosine.sh
[00:04:14] Mistral AI
https://mistral.ai
[00:05:50] Anthropic
https://www.anthropic.com
[00:07:42] Cohere
https://cohere.com
[00:08:36] DeepSeek
https://www.deepseek.com
tool:
[00:02:52] Isambard-AI
https://isambard.ac.uk
[00:05:56] Colossus (xAI)
https://en.wikipedia.org/wiki/Colossus_(supercomputer)
[00:07:52] GLM (Z.ai)
https://z.ai
[00:11:52] NVIDIA B300
https://www.nvidia.com/en-us/data-center/dgx-b300/
[00:15:37] gpt-oss-120b
https://huggingface.co/openai/gpt-oss-120b
[00:15:52] Devstral 2
https://mistral.ai/news/devstral
[00:16:01] Llama 70b
https://www.llama.com
[00:17:05] Claude Code
https://www.anthropic.com/claude-code
[00:26:23] ARC-AGI (Francois Chollet)
https://arcprize.org
[00:40:38] Swarm (Cosine)
https://cosine.sh
[00:40:50] OpenAI Codex
https://github.com/openai/codex
[00:41:16] Lumen Outpost (Cosine)
https://cosine.sh
[00:41:18] Kimi K2 (Moonshot)
https://huggingface.co/moonshotai/Kimi-K2-Instruct
[00:49:55] SWE-bench
https://www.swebench.com
[00:52:40] SystemVerilog
https://en.wikipedia.org/wiki/SystemVerilog
person:
[00:23:40] Andrej Karpathy
https://karpathy.ai
paper:
[00:27:10] GRPO (DeepSeekMath)
https://arxiv.org/abs/2402.03300
[00:27:13] GSPO
https://arxiv.org/abs/2507.18071

Incompressible Knowledge Probes, Bojie Li
https://arxiv.org/pdf/2604.24827

Estimating the Size of Claude Opus 4.5/4.6
https://unexcitedneurons.substack.com/p/estimating-the-size-of-claude-opus

---
ReScript:
https://app.rescript.info/session/5852d2b884c4ce4b?share=10b9799160845bb11779f8ac6cd3124f
The Benchmark With No Instructions — ARC-AGI-3 (winning team!)
01/07/2026 | 1h 24 mins.
Tim Scarfe travels to Zurich to sit down with the Tufa Labs ARC-AGI-3 team — founder Benjamin Crouzier, with Jeroen Cottaar, Dries Smit, Stefano Viel and Michal Tesnar — to work out what their leaderboard-topping system does and what the benchmark is really testing.The cut opens on the games: a walkthrough of the Locksmith game, where you read the rules of an unfamiliar world straight from raw frames. ARC-AGI-3 makes ARC interactive and agentic, so the model has to *discover* the goal rather than transduce a static grid. It stays easy for humans and breaks LLMs, and it runs through everything that follows. Dries traces his StochasticGoose preview win — brute force that only searched actions which changed the frame — and why it collapsed once the organisers added action-efficiency scoring and unseen games.Induction and transduction run through the middle of the conversation — how much of an answer is really priors leaking back the moment a model recognises a maze. The abstraction mountain, and Tim's case that LLMs reach the right answer through fractured, entangled representations — performance, not competence. Whether transformers plan at all or just fake it well enough. Why the score really measures action efficiency, not games solved, and why agents lock onto the wrong goal and cannot climb back out.Crouzier closes on the Tufa Labs thesis — a small lab against the giants, the bitter lesson against hand-built harnesses, and safety — and Tim ties it back to Kenneth Stanley, deep constraints, and creativity as competence.

Disclosure: Tufa Labs sponsors MLST. ---TIMESTAMPS:00:00:00 Meet the Tufa team and what makes ARC-AGI-3 hard00:02:11 Locksmith game: reading the rules from raw frames00:03:10 Why build an independent research lab00:04:11 StochasticGoose: a preview win, then the hardened games00:07:58 Induction, transduction, and priors inside LLMs00:10:31 Curiosity, world models, and exploring by frame change00:14:32 Understanding debt and losing sight of your own code00:15:53 Requirements-based agents and human-AI co-creativity00:19:22 Why auto-research misses the big picture00:21:54 The abstraction mountain and fractured representations00:27:36 Constraints and making LLMs act as if they understand00:34:51 Human difficulty calibration, esports priors, and emergence00:41:35 Agency, goal acquisition, and two kinds of planning00:47:31 Harnesses, the 36% number, and wrong-goal loops00:52:33 Rewards, goals, and why ARC-AGI-3 resists brute force01:00:46 Would solving ARC-AGI-3 prove AGI?01:07:53 Stripping language away, then priors leak back01:14:06 Representation and whether language is necessary01:18:04 The bitter lesson versus specialised harnesses01:22:20 Capability research, safety, and the software singularity---REFERENCES:organization:[00:02:11] ARC-AGI-3https://arcprize.org/arc-agi/3[00:03:10] Tufa Labshttps://tufalabs.ai/team/[00:04:20] ARC-AGI-3 Preview Agent Competitionhttps://arcprize.org/competitions/arc-agi-3-preview-agentstool:[00:04:55] StochasticGoose ARC-AGI-3 solutionhttps://github.com/DriesSmit/ARC3-solution[00:07:42] ArcGenticahttps://github.com/symbolica-ai/arcgentica[00:07:49] RGB-Agenthttps://github.com/alexisfox7/RGB-Agent[00:14:38] Claude Codehttps://www.anthropic.com/claude-code[01:03:42] Qwen 3.6 27Bhttps://huggingface.co/Qwen/Qwen3.6-27Bpaper:[00:13:03] On the Measure of Intelligencehttps://arxiv.org/abs/1911.01547[00:27:42] DreamCoderhttps://arxiv.org/abs/2006.08381[00:43:55] On the Biology of a Large Language Modelhttps://transformer-circuits.pub/2025/attribution-graphs/biology.html[01:18:46] ImageNet Classification with Deep CNNs (AlexNet)https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdfother:[01:18:16] The Bitter Lessonhttp://www.incompleteideas.net/IncIdeas/BitterLesson.html---https://app.rescript.info/share/463d7f031349b4b9db428553eed88230
The Thermodynamic AI Computing Chip - Thomas Ahle
28/06/2026 | 1h 2 mins.
Thomas Ahle wants Normal Computing to be the Lovable for chip design: type your intent, and a swarm of agents carries it from design through optimisation, formalisation and verification to tape-out. To get there, his team at wrote their own open-source Verilog simulator, 580,000 lines in 43 days, because commercial EDA verifiers run about $10,000 per core and there are no decent open-source compilers to build on.

That sets up the question Tim keeps pressing: if an agent can produce a chip design, a proof, or a working program, how do you actually know it is correct? Passing 70% of tests is not the same as being right, and a single fabricated bug can cost a company a fortune. They dig into ProgramBench (rebuild a program from its tests, roughly 0% success), the difference between structure and competence, and the "understanding debt" you take on when nobody reads the code.

From there: auto-formalisation in Lean and the AlphaProof trick of training on prove-or-disprove; why there is no single true representation of a spec (Petri nets, TLA+, Erik Curiel's "math does not represent"); and thermodynamic computing, where Normal Computing's CN101 chip is built so that its physical noise *is* the computation, settling a stochastic differential equation in hardware to invert a matrix. Plus Bayesian uncertainty, specialisation, the Chomsky hierarchy, AI slop, and whether performance is all that matters.

Recorded in Zurich.

Disclosure: Normal Computing paid our production and travel costs for this show. We retained full editorial control. They did not see the video before publication, and we did not show it to them or discuss it with them beforehand.

---
TIMESTAMPS:
00:00:00 Meet Thomas Ahle: the Lovable for chip design
00:03:41 Why hardware needs formal verification
00:06:36 Ten thousand dollars per core and a six-month agent run
00:07:40 Rebuilding programs from tests: ProgramBench and zero percent
00:12:15 Structure vs competence: can you learn a program from behavior?
00:15:27 Continual learning, abstraction, and Claude as an ecosystem
00:23:17 Autoformalization and the AlphaProof trick
00:29:31 No single true representation: specs, Petri nets and TLA+
00:34:43 Thermodynamic computing: when noise is the computation
00:37:32 Bayesian uncertainty in the age of token streams
00:41:12 Hybrid compute: vibe-coding loops, binaries and Stockfish
00:44:44 Co-design, central-AI apps and API pricing
00:49:45 Chain of thoughtlessness and the Chomsky hierarchy
00:53:40 AI psychosis, slop and the broken social contract
00:57:34 Typing it yourself, teamwork and performance vs competence

---
REFERENCES:
person:
[00:00:10] Thomas Ahle
https://thomasahle.com
organization:
[00:00:27] Normal Computing
https://normalcomputing.com/
paper:
[00:11:21] ProgramBench: Can Language Models Rebuild Programs From Scratch?
https://arxiv.org/abs/2605.03546
[00:31:55] Autoformalizing Memory Device Specifications with Agents
https://arxiv.org/abs/2605.00058
[00:35:20] Thermo AI and the Fluctuation Frontier
https://arxiv.org/abs/2302.06584
[00:36:40] Thermo Comp System for AI Applications
https://arxiv.org/abs/2312.04836
[00:37:05] Thermodynamic Linear Algebra
https://arxiv.org/abs/2308.05660
[00:44:50] An efficient probabilistic hardware architecture for diffusion-like models
https://arxiv.org/abs/2510.23972
tool:
other:
[00:01:00] Building an Open-Source Verilog Simulator with AI: 580K Lines in 43 Days
https://normalcomputing.com/blog/building-an-open-source-verilog-simulator-with-ai-580k-lines-in-43-days
[00:02:55] Normal Computing Announces Tape-Out of the World's First Thermodynamic Computing Chip (CN101)
https://www.normalcomputing.com/blog/normal-computing-announces-tape-out-of-worlds-first-thermodynamic-computing-chip
[00:32:02] DRAMBench: Autoformalizing DRAM Specifications with Timed Petri Nets
https://www.iese.fraunhofer.de/blog/drambench-autoformalizing-dram-specifications/

---
ReScript: https://app.rescript.info/share/ff9684a112ab37744096adaeb097a263
He won a Nobel here for AlphaFold. Then he left. - John Jumper
22/06/2026 | 53 mins.
This episode is sponsored by Notion. Learn more about Notion's Developer Platform today at https://notion.com/mlstProtein folding stalled biology for fifty years. A sequence of amino acids dictates a three-dimensional shape, but reading that shape meant a year and roughly $100,000 of crystallography per structure. Then AlphaFold 2 won CASP14 so decisively the organizers called the problem essentially solved.In this documentary cut, John Jumper, who shared the 2024 Nobel Prize in Chemistry and has since left DeepMind for Anthropic, walks Tim Scarfe through what the system did and, more interestingly, what it did not. The architecture gets a proper dissection: MSAs, the Evoformer, invariant point attention, the FAPE loss, and Jumper's correction of the equivariance story, which ablations valued at roughly 2.5 of 30 GDT points rather than the whole win. He is blunt about the limits. AlphaFold predicts one experiment extraordinarily well; it is not a model of the cell, it does not capture dynamics, and on a given drug target it is "wrong nine times out of ten."From there: the AlphaFold Database of 200M+ predicted structures, AlphaFold 3 and ligands, Isomorphic Labs, and Jumper's quarrel with the bitter lesson, where finite data and human hypotheses still matter. Emmanuel Nji of BioStruct Africa closes the film on what changes when work that took years now takes months, and on training the next thousand structural biologists across Africa.---TIMESTAMPS:00:00:00 Cold open: predicting nature with a button press00:01:03 The protein folding bottleneck and CASP00:04:39 The Nobel, the database, and the move to Anthropic00:05:50 Sponsor (Notion) and framing: what AlphaFold does not claim00:07:39 Proteins as self-assembling nanomachines00:12:24 From structures to biology: drug discovery and Midnolin00:17:37 The humility of AlphaFold: a narrow predictor00:22:18 Inside the architecture: Evoformer, IPA and FAPE00:30:20 Ruthless empiricism: ablations and 100x in data00:35:20 Predict, control, understand00:40:00 Against the bitter lesson; AlphaFold 3 as diffusion00:45:07 Intelligence, representations and AGI00:49:23 Epilogue: AlphaFold in Africa00:52:16 Closing: the case for hybrid science models---REFERENCES:organization:[00:01:55] Critical Assessment of Structure Prediction (CASP)https://predictioncenter.org/[00:04:39] The Nobel Prize in Chemistry 2024https://www.nobelprize.org/prizes/chemistry/2024/summary/[00:05:18] BioStruct Africahttps://www.biostructafrica.org/[00:18:03] Isomorphic Labshttps://www.isomorphiclabs.com/paper:[00:03:09] AlphaFold Protein Structure Databasehttps://doi.org/10.1093/nar/gkab1061[00:17:25] Accurate structure prediction of biomolecular interactions with AlphaFold 3https://www.nature.com/articles/s41586-024-07487-w[00:22:18] Highly accurate protein structure prediction with AlphaFoldhttps://www.nature.com/articles/s41586-021-03819-2[00:23:10] Midnolin promotes degradation of substrates independent of ubiquitinationhttps://doi.org/10.1126/science.adh5021[00:27:00] Improved protein structure prediction using potentials from deep learninghttps://www.nature.com/articles/s41586-019-1923-7tool:[00:03:09] AlphaFold Protein Structure Database (EBI)https://alphafold.ebi.ac.uk/[00:45:55] AlphaEvolve: a coding agent for designing advanced algorithmshttps://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/other:[00:39:40] The Bitter Lessonhttp://www.incompleteideas.net/IncIdeas/BitterLesson.html---ReScript: https://app.rescript.info/share/d8cde5c221fb71e2c0f5aafe94f90dfaDisclaimer - not sponsored, editorial with us - we filmed it at GDM, London
When AI Decides You're a Threat — Brad Carson
31/05/2026 | 1h 20 mins.
Brad Carson was the Army's General Counsel, served two terms in Congress and was Acting Under Secretary of Defense for Personnel and Readiness. He now heads Americans for Responsible Innovation, the AI-policy advocacy group he co-founded. Keith Duggar spends roughly eighty minutes pushing back.

SPONSOR:
---
Cyber Fund built the Monastery to help founders ship products that were impossible a year ago. Applications for Batch 1 are now open.
Apply now: https://cyber.fund
---

Carson's whole case rests on one line: the genie is not out of the bottle. We have pulled dangerous tech back before. Asilomar halted recombinant DNA in 1975, and the West still controls the chips AI runs on. Calling it unstoppable, he says, is the most dangerous idea in the room.

Then Keith drags him somewhere darker. A Palantir heat map scores you 0.73 on whether you are a combatant, and a strike follows. The model is wrong some accepted share of the time, and when it is, nobody answers for it. You cannot court-martial a model, and not even the interpretability researchers can say why it picked you.

—
Note: after recording, we learned that Americans for Responsible Innovation is backed by EA-aligned philanthropy (not sponsored)

---
TIMESTAMPS:
00:00:00 From the Pentagon to AI governance
00:04:52 Regulatory capture vs Silicon Valley networks
00:07:56 Transparency and the Claude tier changes
00:09:40 Tort liability when AI tools cause harm
00:13:40 AI is a product, not a person
00:16:01 Children, suicide, and the suicide business
00:19:59 Opaque neural nets and the law of war
00:25:54 Probabilistic targeting and the death of accountability
00:28:47 The arms race fallacy: Asilomar and restraint
00:34:02 Talking to China: track 2 talks and chip leverage
00:39:45 Air power never wins: capital for labour
00:43:29 Anthropic vs the Department of War
00:51:29 Concentration, open source, and brain drain
01:00:18 DeepSeek, Chinese culture, and AI as diplomacy
01:12:25 Upskilling Congress and why public trust matters

---
REFERENCES:
organization:
[00:02:45] ICRC position on autonomous weapons
https://www.icrc.org/en/law-and-policy/autonomous-weapons
[00:05:22] Americans for Responsible Innovation (ARI)
https://ari.us
[00:07:20] Andreessen Horowitz (a16z)
https://a16z.com/
[01:16:05] Office of Technology Assessment
https://en.wikipedia.org/wiki/Office_of_Technology_Assessment
other:
[00:03:35] Beneficial AGI 2019 Conference (Future of Life Institute, Puerto Rico)
https://futureoflife.org/event/beneficial-agi-2019/
[00:18:30] Section 230 of the Communications Decency Act
https://en.wikipedia.org/wiki/Section_230
[00:19:59] Lethal Autonomous Weapons (LAWS)
https://en.wikipedia.org/wiki/Lethal_autonomous_weapon
[00:31:35] Strategic Arms Limitation Talks (SALT)
https://en.wikipedia.org/wiki/Strategic_Arms_Limitation_Talks
[00:32:28] Asilomar Conference on Recombinant DNA (1975)
https://en.wikipedia.org/wiki/Asilomar_Conference_on_Recombinant_DNA
[00:39:45] The New Iron Triangle (ARI policy byte)
https://ari.us/policy-bytes/the-new-iron-triangle/
[00:48:05] Defense Production Act
https://en.wikipedia.org/wiki/Defense_Production_Act
person:
[00:03:35] Anthony Aguirre
https://en.wikipedia.org/wiki/Anthony_Aguirre
[00:06:48] Dean Ball — Hyperdimensional
https://www.hyperdimensional.co/
[00:23:13] Neel Nanda — mechanistic interpretability
https://www.neelnanda.io/
[00:36:02] Jack Clark (Anthropic) on Conversations with Tyler
https://conversationswithtyler.com/episodes/jack-clark/
[00:39:15] Robert Trager — Centre for the Governance of AI
https://www.governance.ai/team/robert-trager
[00:41:55] Giulio Douhet
https://en.wikipedia.org/wiki/Giulio_Douhet
[01:15:05] Don Beyer (US Congress)
https://en.wikipedia.org/wiki/Don_Beyer
tool:
[00:22:19] Phalanx CIWS
https://en.wikipedia.org/wiki/Phalanx_CIWS

---
ReScript:
https://app.rescript.info/public/share/9405ff35c0215b7cdae6402d41284171
https://app.rescript.info/api/public/sessions/0a6c081b8e5fe413/pdf

More Technology podcasts

About Machine Learning Street Talk (MLST)

Welcome! We engage in fascinating discussions with pre-eminent figures in the AI field. Our flagship show covers current affairs in AI, cognitive science, neuroscience and philosophy of mind with in-depth analysis. Our approach is unrivalled in terms of scope and rigour – we believe in intellectual diversity in AI, and we touch on all of the main ideas in the field with the hype surgically removed. MLST is run by Tim Scarfe, Ph.D (https://www.linkedin.com/in/ecsquizor/) and features regular appearances from MIT Doctor of Philosophy Keith Duggar (https://www.linkedin.com/in/dr-keith-duggar/).

Podcast website

Technology