Tim Scarfe travels to Zurich to sit down with the Tufa Labs ARC-AGI-3 team — founder Benjamin Crouzier, with Jeroen Cottaar, Dries Smit, Stefano Viel and Michal Tesnar — to work out what their leaderboard-topping system does and what the benchmark is really testing.The cut opens on the games: a walkthrough of the Locksmith game, where you read the rules of an unfamiliar world straight from raw frames. ARC-AGI-3 makes ARC interactive and agentic, so the model has to *discover* the goal rather than transduce a static grid. It stays easy for humans and breaks LLMs, and it runs through everything that follows. Dries traces his StochasticGoose preview win — brute force that only searched actions which changed the frame — and why it collapsed once the organisers added action-efficiency scoring and unseen games.Induction and transduction run through the middle of the conversation — how much of an answer is really priors leaking back the moment a model recognises a maze. The abstraction mountain, and Tim's case that LLMs reach the right answer through fractured, entangled representations — performance, not competence. Whether transformers plan at all or just fake it well enough. Why the score really measures action efficiency, not games solved, and why agents lock onto the wrong goal and cannot climb back out.Crouzier closes on the Tufa Labs thesis — a small lab against the giants, the bitter lesson against hand-built harnesses, and safety — and Tim ties it back to Kenneth Stanley, deep constraints, and creativity as competence.
Disclosure: Tufa Labs sponsors MLST. ---TIMESTAMPS:00:00:00 Meet the Tufa team and what makes ARC-AGI-3 hard00:02:11 Locksmith game: reading the rules from raw frames00:03:10 Why build an independent research lab00:04:11 StochasticGoose: a preview win, then the hardened games00:07:58 Induction, transduction, and priors inside LLMs00:10:31 Curiosity, world models, and exploring by frame change00:14:32 Understanding debt and losing sight of your own code00:15:53 Requirements-based agents and human-AI co-creativity00:19:22 Why auto-research misses the big picture00:21:54 The abstraction mountain and fractured representations00:27:36 Constraints and making LLMs act as if they understand00:34:51 Human difficulty calibration, esports priors, and emergence00:41:35 Agency, goal acquisition, and two kinds of planning00:47:31 Harnesses, the 36% number, and wrong-goal loops00:52:33 Rewards, goals, and why ARC-AGI-3 resists brute force01:00:46 Would solving ARC-AGI-3 prove AGI?01:07:53 Stripping language away, then priors leak back01:14:06 Representation and whether language is necessary01:18:04 The bitter lesson versus specialised harnesses01:22:20 Capability research, safety, and the software singularity---REFERENCES:organization:[00:02:11] ARC-AGI-3https://arcprize.org/arc-agi/3[00:03:10] Tufa Labshttps://tufalabs.ai/team/[00:04:20] ARC-AGI-3 Preview Agent Competitionhttps://arcprize.org/competitions/arc-agi-3-preview-agentstool:[00:04:55] StochasticGoose ARC-AGI-3 solutionhttps://github.com/DriesSmit/ARC3-solution[00:07:42] ArcGenticahttps://github.com/symbolica-ai/arcgentica[00:07:49] RGB-Agenthttps://github.com/alexisfox7/RGB-Agent[00:14:38] Claude Codehttps://www.anthropic.com/claude-code[01:03:42] Qwen 3.6 27Bhttps://huggingface.co/Qwen/Qwen3.6-27Bpaper:[00:13:03] On the Measure of Intelligencehttps://arxiv.org/abs/1911.01547[00:27:42] DreamCoderhttps://arxiv.org/abs/2006.08381[00:43:55] On the Biology of a Large Language Modelhttps://transformer-circuits.pub/2025/attribution-graphs/biology.html[01:18:46] ImageNet Classification with Deep CNNs (AlexNet)https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdfother:[01:18:16] The Bitter Lessonhttp://www.incompleteideas.net/IncIdeas/BitterLesson.html---https://app.rescript.info/share/463d7f031349b4b9db428553eed88230