Interconnects

Nathan Lambert

Science Technology

Latest episode

Available Episodes

5 of 107

The American DeepSeek Project
https://www.interconnects.ai/p/the-american-deepseek-projectWhile America has the best AI models in Gemini, Claude, o3, etc. and the best infrastructure with Nvidia it’s rapidly losing its influence over the future directions of AI that unfold in the open-source and academic communities. Chinese organizations are releasing the most notable open models and datasets across all modalities, from text to robotics or video, and at the same time it’s common for researchers worldwide to read far more new research papers from Chinese organizations rather than their Western counterparts.This balance of power has been shifting rapidly in the last 12 months and reflects shifting, structural advantages that Chinese companies have with open-source AI — China has more AI researchers, data, and an open-source default.On the other hand, America’s open technological champions for AI, like Meta, are “reconsidering their open approach” after yet another expensive re-org and the political environment is dramatically reducing the interest of the world’s best scientists in coming to our country.It’s famous lore of the AI industry that much of the flourishing of progress around ChatGPT is downstream from Google Research’s, and the industry’s writ-large, practice of openly sharing the science of AI until approximately 2022. Stopping this practice, and the resulting power shifts mean it will be likely that the next “Transformer”-style breakthrough will be built on or related to Chinese AI models, AI chips, ideas, or companies. Countless Chinese individuals are some of the best people I’ve worked with, both at a technical and personal level, but this direction for the ecosystem points to AI models being less accountable, auditable, and trustworthy due to inevitable ties to the Chinese Government.The goal for my next few years of work is what I’m calling The American DeepSeek Project — a fully open-source model at the scale and performance of current (publicly available) frontier models, within 2 years. A fully open model, as opposed to just an “open weights” model, comes with data, training code, logs, and decision making — on top of the weights to run inference — in order to distribute the knowledge and access for how to train AI models fully.This project serves two goals, where balancing the scales with the pace of the Chinese ecosystem is only one piece:* Reclaim the AI research default home being on top of American (or Western) technologies and tools, and* Reduce the risk that the only viable AI ecosystem for cutting edge products in built atop of proprietary, closed, for-profit AI models.More people should be focused on this happening. A lot of people talk about how nice it would be to have “open-source AGI for all,” but very few people are investing in making it reality. With the right focus, I estimate this will take ~$100M-500M over the next two years.Within the context of recent trends, this is a future that has a diminishing, minute probability. I want to do this at Ai2, but it takes far more than just us to make it happen. We need advocates, peers, advisors, and compute.The time to do this is now, if we wait then the future will be in the balance of extremely powerful, closed American models counterbalancing a sea of strong, ubiquitous, open Chinese models. This is a world where the most available models are the hardest to trust. The West historically has better systems to create AI models that are trustworthy and fair across society. Consider how:* Practically speaking, there will never be proof that Chinese models cannot leave vulnerabilities in code or execute tools in malicious ways, even though it’s very unlikely in the near future.* Chinese companies will not engage as completely in the U.S. legal system on topics from fair use or non-consensual deepfakes.* Chinese models will over time shift to support a competitive software ecosystem that weakens many of America and the West’s strongest companies due to in-place compute restrictions.Many of these practical problems cannot be fixed by simply fine-tuning the model, such as Perplexity’s R1-1776 model. These are deep, structural realities that can only be avoided with different incentives and pretrained models.My goal is to make a fully open-source model at the scale of DeepSeek V3/R1 in the next two years. I’ve been starting to champion this vision in multiple places that summarizes the next frontier for performance on open-source language models, so I needed this document to pin it down.I use scale and not performance as a reference point for the goal because the models we’re collectively using as consumers of the AI industry haven’t really been getting much bigger. This “frontier scale” is a ballpark for where you’ve crossed into a very serious model, and, by the time a few years has gone by, the efficiency gains that would’ve accumulated by then will mean this model will far outperform DeepSeek V3. The leading models used for synthetic data (and maybe served to some users) will continue to get bigger, but not as quickly as capabilities will grow and new types of agents will emerge.Interconnects is a reader-supported publication. Consider becoming a subscriber.The terminology “American DeepSeek” is stretching words in order to be identifiable to a broad public. It combines the need for true American values with a breakthrough open release that marks a new milestone in capabilities.DeepSeek is known for many things to the general public — training cheap frontier models, bringing reasoning models to consumers, and largely being the face of Chinese AI efforts. Since ChatGPT, DeepSeek is the first organization to release an open, permissively licensed AI model at the frontier of performance. This was a major milestone and why 2025 has been a transformative year in the perception of feasibility for open models generally. The name DeepSeek will forever be known in AI lore for it.At the same time, what will count as a “DeepSeek moment” is changing. The new directions for where AI is heading is more in line with agents that use models a lot (sometimes even smaller models) rather than relying on scaling performance of single model generations.This changes what it’ll mean for models to be “at the frontier.” More releases will look like Claude 4 and be about usability, where the benchmarks that people are hillclimbing on represent new types of capabilities or outlandish, harder than human expert tasks. For the suite of tasks that were core for the current generation of models: MATH, GPQA, SWE-Bench Verified, etc., solving them represents a challenging, but reasonable, baseline for human performance.The next major milestone will be when fully open-source models reach this performance threshold. With fully open-source models at this level, “anyone” can specialize the model to their task and the possibility of an open ecosystem that runs efficiently on a single architecture can coalesce. This doesn’t mean releasing the best AI models of 2027 with complete openness — just that we should, come 2027, have fully open models of 2025’s capabilities in order to enable new types of companies and research.The efficiencies of open-source software style development are dramatically stronger for agentic systems than models. Models are singular entities built with expensive resources and incredible focus. Agents are systems that can use many models off the shelf and route requests depending on what’s needed.This agentic era is the opportunity open models have needed, but we need to clear much stronger performance thresholds before the open counterparts are viable. We have companies like OpenAI and Google launching Claude Code competitors that pretty much flop. Imagine what this would look like with open models today? Not good.For this reason, we have finite time to get there. Surely, eventually this level of models will exist, but if we want a new type of ecosystem to form we need to build the raw resources while developers and new companies are getting started. We need people willing to take the risk on something different while there is still potential for it to be comparable across performance trade-offs.Today, the best fully open language models are catching up to the levels of the original GPT-4. This is a major step from GPT-3 levels. The required step I’m shooting for is reaching the modern GPT-4 type models, the likes of recent Sonnet, DeepSeek V3, or Gemini Pro. It’s a big step, but a transformative one in terms of what the models can do.Of course, some of this still works with open weight models and not just fully open models, but to date we have not had good success with having open weight models that can fully be trusted. The best American models are plagued by the Llama license (and rumors that future versions will be discontinued). At the same time, Chinese models aren’t trusted because the models are being integrated directly with more complex tools that muddy the water with a weak security reputation, and European models are largely off the map.If we want models we can trust, we need something that’s a bit different. If the models all converge on a certain capability level, and the differentiation is on integration and finetuning to specific skills, this is something the open community can do.In many ways, obtaining this goal is a quintessentially American volition. In the face of a technology that is poised to bring such extreme financial, and by proxy literal, power to a few companies, opening AI is one of the only things we can do to reduce it. Technology proceeds in a one-way direction — for a variety of geopolitical and capitalistic reasons it is impractical to pause AI development to “do AI another way” — the best we can do is chart a path that makes this future better.Along the same vein, if AGI already exists and something closer to ASI is coming, it will be intertwined with countless details of billions of people’s lives in a matter of just years. Something so indispensable to our lives in work, play, entertainment, and relationships is a closer analog to electricity than other traditional technology products that one can opt into. Such technology should be available for all to benefit from.We need new systems to mitigate misuse, but it shouldn’t be solely up to corporations to control this. Safety by isolating technology to a select few is something we’re in the later stages of with nuclear weapons, and AI progress is far harder to monitor. Robustness to AI can only come from designing systems that expect it to be pervasive — not that it is an easy task.Realistically, all of this is fighting gravity. The corporations will win, but we can control to what extent. We can control how good the other options are. The open options.The call to action here is simple — consider how you can slightly shift your decision making to make The American DeepSeek more likely. This approach succeeds just as much by having one model at the end of it, as it does by having the community form better habits and norms around the way AI models are conceived, built, shared, and used. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
10:36
--------
10:36
Some ideas for what comes next
https://www.interconnects.ai/p/summertime-outlook-o3s-novelty-comingSummer is always a slow time for the tech industry. OpenAI seems fully in line with this, with their open model “[taking] a little more time” and GPT-5 seemingly always delayed a bit more. These will obviously be major news items, but I’m not sure we see them until August.I’m going to take this brief reprieve in the bombardment of AI releases to reflect on where we’ve been and where we’re going. Here’s what you should know.1. o3 as a technical breakthrough beyond scalingThe default story around OpenAI’s o3 model is that they “scaled compute for reinforcement learning training,” which caused some weird, entirely new over-optimization issues. This is true, and the plot from the livestream of the release still represents a certain type of breakthrough — namely scaling up data and training infrastructure for reinforcement learning with verifiable rewards (RLVR).The part of o3 that isn’t talked about enough is how different its search feels. For a normal query, o3 can look at 10s of websites. The best description I’ve heard of its relentlessness en route to finding a niche piece of information is akin to a “trained hunting dog on the scent.” o3 just feels like a model that can find information in a totally different way than anything out there.The kicker with this is that we’re multiple months out from its release in April of 2025 and no other leading lab has a model remotely like it. In a world where releases between labs, especially OpenAI and Google, seem totally mirrored, this relentless search capability in o3 still stands out to me.The core question is when will another laboratory release a model that feels qualitatively similar? If this trend goes on through the end of the summer it’ll be a confirmation that OpenAI had some technical breakthrough to increase the reliability of search and other tool-use within reasoning models.For a contrast, consider basic questions we are facing in the open and academic community on how to build a model inspired by o3 (so something more like a GPT-4o or Claude 4 in its actual search abilities):* Finding RL data where the model is incentivized to search is critical. It’s easy in an RL experiment to tell the model to try searching in the system prompt, but as training goes on if the tool isn’t useful the model will learn to stop using it (very rapidly). It is likely that OpenAI, particularly combined with lessons from Deep Research’s RL training (which, I know, is built on o3), has serious expertise here. A research paper showing a DeepSeek R1 style scaled RL training along with consistent tool use rates across certain data subsets will be very impressive to me.* The underlying search index is crucial. OpenAI’s models operate on a Bing backend. Anthropic uses Brave’s API and it struggles for it (lots of SEO spam). Spinning up an academic baseline with these APIs is a moderate additive cost on top compute.Once solid open baselines exist, we could do fun science such as studying which model can generalize to unseen data-stores best — a crucial feature for spinning up a model on local sensitive data, e.g. in healthcare or banking.If you haven’t been using o3 for search, you really should give it a go.Interconnects is a reader-supported publication. Consider becoming a subscriber.2. Progress on agents will be higher variance than modeling was, but often still extremely rapidClaude Code’s product market fit, especially with Claude 4, is phenomenal. It’s the full package for a product — works quite often and well, a beautiful UX that mirrors the domain, good timing, etc. It’s just a joy to use.With this context, I really have been looking for more ways to write about it. The problem with Claude Code, and other coding agents such as Codex and Jules, is that I’m not in the core audience. I’m not regularly building in complex codebases — I’m more of a research manager and fixer across the organization than someone that is building in one repository all the time — so, I don’t have practical guides on how to get the most out of Claude Code or a deep connection with it that can help you “feel the AGI.”What I do know about is models and systems, and there are some very basic facts of frontier models that make the trajectory for the capabilities of these agents quite optimistic.The new part of LLM-based agents is that they involve many model calls, sometimes with multiple models and multiple prompt configurations. Previously, the models everyone was using in chat windows were designed to make progress on linear tasks and return that to the user — there wasn’t a complex memory or environment to manage.Adding a real environment for the models has made it so the models need to do more things and often a wider breadth of tasks. When building these agentic systems, there are two types of bottlenecks:* The models cannot solve any of the task we hope to use the agent for, and* The models fail at small components of the task that we are deploying.For agents that have initial traction, such as Claude Code and Deep Research, many of the problems are in the second class. How these fixes are made is that labs notice repeated, odd failures among real world use-cases. This can look like a 50% reliability rate on some long-tail mundane task. In this case it is often easy for the lab to make new data, include it in the next post-training run for their models, and up that sub-task reliability to almost 99%. As labs are making most of their gains in post-training today, rather than big pretraining runs, the time for that change to get integrated is well shorter than recent years.The kicker for this is how it all fits together. Many complex tasks can be bottlenecked by some weird, small failures. In this case, we can have small changes to models that make agents like Claude Code feel way more reliable, even though the peak performance of the model hasn’t changed much. The same goes for Deep Research.With this, I expect these agents we’re already using to improve randomly and in big leaps.What I’m unsure of is when new agent platforms will be built. Some of this is a product problem and some of it is a peak performance problem. New agentic platforms that feel like they have product-market fit will be somewhat random, but those that have a fit already can improve like we’re used to frontier models getting way better.This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.Much like o3, you should play with Claude Code even if you don’t code a lot. It can make fun demos and standalone websites in no time. It’s miles ahead in its approachability compared to the fully-autonomous agents like Codex (at least for the time being).3. Scaling parameters is going to go very slow for consumer modelsThe models that leading AI labs have been releasing in 2025 have mostly stopped getting bigger in total parameters. Take Claude 4, the API prices are the same as Claude 3.5 (and its minor versions). OpenAI only half released GPT-4.5. Gemini hasn’t released its Ultra variant. There are more models that are private to these laboratories that are certainly much bigger.The nuanced part of this is that many of these models likely could be getting slightly smaller, e.g. Claude 4 Sonnet could be slightly smaller than Claude 3.5 Sonnet, due to efficiency gains at pretraining. That sort of marginal technical advancement is a big deal on price and inference speed, especially in the long-run, but not the central point I’m making.The point is how GPT-5 is going to be bigger mostly through inference-time scaling and less through just “one bigger model.” For years we were told the narrative that the lab with the biggest training cluster was going to win because they have an advantage with scaling. That was the story behind xAI’s mega-cluster that Elon built. Now, the biggest cluster just is an advantage in overall research pace.Scaling, at least in terms of what users need, has largely fizzled out. Labs may come back to it later as they find super hard problems that users need to solve, but where GPT 4.5 cost about 100x the compute of GPT-4 to train, it is only slightly better on normal user metrics.What we see now is a mass efficiency march along the model sizes that people love. The industry has a few standards, from* Tiny models like Gemini Flash Lite or GPT 4.1 Nano,* Small models like Gemini Flash and Claude Haiku,* Standard models like GPT-4o and Gemini Pro, and* Big models like Claude Opus and Gemini Ultra.These models come with somewhat predictable price-points (we know Gemini is way cheaper than the industry standard), latencies, and capability levels. Standards like this are important as industries mature!Over time, efficiency gains will make new standards emerge. The first thing we’ll see is more mass availability of the likes of Gemini Ultra and GPT-4.5 (maybe in the GPT-5 release), but what comes after that isn’t on the radar at all. Now, scaling to new size tiers is only possible “every few years” or maybe not at all, if monetization of AI doesn’t go as well as many hope.Scaling as a product differentiator died in 2024. That doesn’t mean pretraining as a science isn’t crucial. The recent Gemini 2.5 report made that pretty clear:The Gemini 2.5 model series makes considerable progress in enhancing large-scale training stability, signal propagation and optimization dynamics, resulting in a considerable boost in performance straight out of pre-training compared to previous Gemini models. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
9:57
--------
9:57
Crafting a good (reasoning) model
Why are some models that are totally exceptional on every benchmark a total flop in normal use? This is a question I was hinting at in my post on GPT-4o’s sycophancy, where I described it as “The Art of The Model”:RLHF is where the art of the model is crafted and requires a qualitative eye, deep intuition, and bold stances to achieve the best outcomes. In many ways, it takes restraint to land a great model. It takes saying no to researchers who want to include their complex methods that may degrade the overall experience (even if the evaluation scores are better). It takes saying yes to someone advocating for something that is harder to measure.In many ways, it seems that frontier labs ride a fine line between rapid progress and usability. Quoting the same article:While pushing so hard to reach the frontier of models, it appears that the best models are also the ones that are closest to going too far.Once labs are in sight of a true breakthrough model, new types of failure modes and oddities come into play. This phase won’t last forever, but seeing into it is a great opportunity to understanding how the sausage is made and what trade-offs labs are making explicitly or implicitly when they release a model (or in their org chart).This talk expands on the idea and goes into some of the central grey areas and difficulties in getting a good model out the door. Overall, this serves as a great recap to a lot of my writing on Interconnects in 2025, so I wanted to share it along with a reading list for where people can find more.The talk took place at an AI Agents Summit local to me in Seattle. It was hosted by the folks at OpenPipe who I’ve been crossing paths with many times in recent months — they’re trying to take similar RL tools I’m using for research and make them into agents and products (surely, they’re also one of many companies).Slides for the talk are available here and you can watch on YouTube (or listen wherever you get your podcasts).Reading listIn order (2025 unless otherwise noted):* Setting the stage (June 12): The rise of reasoning machines * Reward over-optimization* (Feb. 24) Claude 3.7 Thonks and What’s Next for Inference-time Scaling* (Apr. 19) OpenAI's o3: Over-optimization is back and weirder than ever* RLHF Book on over optimization* Technical bottlenecks* (Feb. 28) GPT-4.5: "Not a frontier model"?* Sycophancy and giving users what they want* (May 4) Sycophancy and the art of the model* (Apr. 7) Llama 4: Did Meta just push the panic button?* RLHF Book on preference data* Crafting models, past and future* (July 3 2024) Switched to Claude 3.5* (June 4) A taxonomy for next-generation reasoning models* (June 9) What comes next with reinforcement learning* (Mar. 19) Managing frontier model training organizations (or teams)Timestamps00:00 Introduction & the state of reasoning05:50 Hillclimbing imperfect evals09:18 Technical bottlenecks13:02 Sycophancy18:08 The Goldilocks Zone19:28 What comes next? (hint, planning)26:40 Q&ATranscriptTranscript produced with DeepGram Nova v3 with some edits by AI.Hopefully, this is interesting. I could sense from some of the talks, it'll be a bit of a change of pace than some of the talks that have come before. I think I was prompted to talk about kind of a half theme of one of the blog posts I wrote about sycophancy and try to expand on it. There's definitely some overlap with things I'm trying to reason through that I spoke about at AI Engineer World Fair, but largely a different through line. But mostly, it's just about modeling and what's happening today at that low level of the AI space.So for the state of affairs, everybody knows that pretty much everyone has released a reasoning model now. These things like inference time scaling. And most of the interesting questions at my level and probably when you're trying to figure out where these are gonna go is things like what are we getting out of them besides high benchmarks? Where are people gonna take training for them? Now that reasoning and inference time scaling is a thing, like how do we think about different types of training data we need for these multi model systems and agents that people are talking about today?And it's just a extremely different approach and roadmap than what was on the agenda if a AI modeling team were gonna talk about a year ago today, like, what do we wanna add to our model in the next year? Most of the things that we're talking about now were not on the road map of any of these organizations, and that's why all these rumors about Q Star and and all this stuff attracted so much attention. So to start with anecdotes, I I really see reasoning as unlocking new ways that I interact with language models on a regular basis. I've been using this example for a few talks, which is me asking O3, I can read it, is like, can you find me the GIF of a motorboat over optimizing a game that was used by RL researchers for a long time? I've used this GIF in a lot of talks, but I always forget the the name, and this is the famous GIF here.And coast runners is the game the game name, which I tend to forget. O3 just gives you a link to download the GIF direct directly, which is just taking where this is going to go, it's going to be like, I ask an academic question and then it finds the paragraph in the paper that I was looking for. And that mode of interaction is so unbelievably valuable. I was sitting in the back trying to find what paper came up with the definition of tool use. I think there's a couple twenty twenty two references.If you're interested after, you can can find me because I don't remember them off the top of my head. But these are things that AI is letting me do, and it's it's much more fun and engaging than sifting through Google. And the forms of the models so this previous one was just O3 natively, whatever system prompt ChatGPT has, but the form of these interactions are also changing substantially with deep research that we've heard alluded to and and referenced. And then Claude Code, which is one of the more compelling and nerdy and very interesting ones. I I used it to help build some of the back end for this RLHF book that I've been writing in a website.And these things like just spinning up side projects, are so easy right now. And then also Codex, which these types of autonomous coding agents where there's not the interactivity of Claude code is obviously the frontier that is going. But if you try to use something like this, it's like, okay. It works for certain verticals and certain engineers. However, the stuff I do is like, okay.This is not there yet. It doesn't have the Internet access is a little weird as to build these complex images, installing PyTorch. It's like, okay. We don't we don't want that yet for me, but it's coming really soon. And at the bottom of this is like this foundation where the reasoning models have just unlocked these incredible benchmark scores, and I break these down in a framework I'll come back to later as what I call a skill.And it's just fundamentally reasoning models can do different things with tokens that let them accomplish much harder tasks. So if you look at GPT-4o, which was OpenAI's model going into this, there was a variety of what we're seeing as kind of frontier AI evaluations where it's on the spectrum of the models get effectively zero, which is truly at the frontier to somewhere to 50 to 60 is labs have figured out how to hill climb on this, but they're not all the way there yet. And when they transition from GPT-4o to O1, which if you believe Dylan Patel of semi analysis, is the same base model with different post training, you get a jump like this. And then when OpenAI scales reinforcement learning still on this base model, they get a jump like this. And the rumors are that they're now gonna use a different base model and kind of accumulate these gains in another rapid fashion.And these benchmark scores are are not free. It's a lot of hard work that gets there, but it's just a totally different landscape where things like AIM and GPQA, which is this kind of science technology reasoning questions, are effectively solved. And this is like the use cases I was describing where it's like, O3 can kind of just do this. And a lot of harder things we'll see keep coming, might unlock some of these kind of use factors I'm mentioning as interesting but not there yet. And we'll see this kind of list grow over time, but it's really not like the only thing that we're experiencing on the ground because skills are only one part of this, and there's a lot of this arts and crafts of how do you actually have a good model that people like to use.And a lot of this talk is gonna be talking ways that that can go right and wrong. And generally, just my reflections as someone who trains these models on why we get exposed to this. So there's a lot of online discourse about models that go too far on training on benchmarks. This is an old tweet from Phi from Microsoft. I don't wanna throw them under their bus because they've also Phi-4 is a really good model by now.So a lot of these people get this reputation for things that are maybe like a one off model incident, which emerges from a complexity of org structure weirdness and individual incentives. And I think like Meta's really in this right now, that doesn't mean their future models will be subject to this. But it is definitely a phenomenon that could happen where it's like a lot of low level decisions result in the final product that is just not what you wanted even though it seems like along the way you're doing everything right. And just kind of climbing these benchmark scores, is linked to this thing that I was saying with skills, is not the only way forward. And especially with reasoning models, there's kind of another way we've seen this, which is Claude 3.5, where people love to gripe about this supposed upgrade to Claude, would love to just like fake its way through unit tests.And if you're looking at reasoning training, a lot of the technical implementation for code is you have the model generate code and you check if you pass unit tests. And what people are seeing is that Claude essentially does everything and then modifies the code so that the test passes. And this is like a side effect of at the training time, our reward function is just too simple. It's like we're rewarding the model for getting unit tests right, which might be disconnected from the overall theme. And like, there's just so many ways that they can actually come up, and it's like this RLVR thing, reinforced learning with verifiable rewards, let us climb these skill charts crazily both on public and private benchmarks.So I think a lot of the labs have benchmarks internally that are much more specific to things, and even those you could kind of fake yourself on. And, like, I try there's examples of OpenAI and Gemini doing the same thing, and a lot of recent model releases have this sort of like, oh, the coding is a little off energy. But I I think a lot of it comes to this, which is just bypassing a unit test. And another really interesting one that doesn't quite have the same interference with the utility of it is like this Transluce blog post and other communications when O3 was announced where O3 will do this thing where you ask it a question and it'll say its reasoning for why it's true is some action in the world that it can't actually take due to its sandboxing. So this is the O3 saying that it measured it on its MacBook Pro that it has even though it's just running in in the data center without access to it.I think this is a good example because things will come up like this that have actual physical meaning, but even if we could see it in such innocuous ways, it's just that this, like, we're we're pushing so hard on these skills and the measurable things that a lot of the unmeasurable becomes a second priority goal. And this is the sort of thing where it's a team a teammate at a company will be like, we're so much better at coding, search, math, everything. And it's like, oh, the the leadership needs to get the model out for competitive reasons. And then some of the things, it's like, okay, we do this messy process of making the model much more robust and getting rid of these oddities in the training data and stuff like this, and and you kind of do this later on. So it's kind of this back and forth when the pace of progress is so high that you you have to make hard decisions on what you prioritize.And a lot of times, these weird model releases are actually just technical bottlenecks. I think this is one of my most entertaining model releases of the year is when OpenAI released GPT 4.5, and they released this system card. And originally, it had the sentence in it that GPT 4.5 is not a frontier model. If you try to find this now, they've since edited the system card, so it doesn't say this anymore. But it's just such a funny thing for an AI lab to say because at the same time, the model is really liked by people.And I think that one of the things that people caught on with this was doing green text, and they thought this model was very good at it. And since GPT 4.5, I think people kind of think that all the big models are kind of good at green text. So Claude 4 Opus and Claude 2. Gemini 2.5 Pro, people also do this now. And this is a sign that humor is really getting better at the models. And there's these this is like a total viral moment where people like interacting with GPT 4.5.But if you look at the system card, it's like Sam Altman says we spent 10 times the compute as GPT four on it, and the evaluations are like a minor tick up. And when you look at reasoning models, all the reasoning models have like 20 gains all the time. So it's this really odd release where I think this is a model that I still use very regularly in my day to day basis, but it just can't land for some weird complex reason of balancing this kind of vibes, eval scores, which are really important for marketing to both users and kind of business customers because you get that first impression feeling and then also price. And there's also these weird things where there's more people trying to get involved in these leading models, and the bar for releasing them is actually getting much higher. So these are two papers that are some of the, what I call, leading open reports on how to do reasoning model training.Open thoughts three is on kind of data methods for instruction or supervised fine tuning, just kind of generating a large dataset for performance on math and code. And Magistral is Mistral's first reasoning model. And these, the methods they describe are very strong and it's very good for these companies to be open and Bespoke Labs from Open Thoughts release the data as well. But it's like even these companies get really bad can get really bad press when the models just fail to do really simple things. So for example, the OpenThinker model was the sort of model that'll think for minutes when you say hi to it, and the Mistral model is one of the models that every time you ask it a question, will format it as if it's giving a math answer.And these are obviously cherry picked things from a known vocal critic because it is entertaining, but it's just like this is the space that AI operates in, which is like if you don't get the little things right, people are still gonna complain and and give you bad press because the models that do things extremely well are just one click away. I think things like Claude 4 Opus is easy to use in there and Gemini's coming and there's more open models. And it's just hard to get a model out that has this kind of care to it because it takes a lot of time and resources to kind of wait to release the model when you do all these kind of last bits of fine tuning. The most important one that's been in the news is kind of the sycophancy idea. This is not a real example from the model that they released for two weeks for ChatGPT.This whole little saga, I just imitated it by system prompting it. But it reads just like them, which is ChatGPT was unbelievably sycophantic for a few days. I'm not gonna comment on all the kind of social second order effects because they're obvious and that's a large motivation of why I do the work that I do, which is just like we want to be able to have understanding and oversight onto things like that, but it reveals deep organizational pressures that these companies are going through to kind of get things out the door fast that people really want. If you are to so so here's some more examples. These are the GPT-4o version and one of these viral examples on what it was actually saying.And on the left is the Llama 4 secret chatbot arena version, which is the one they use to get the number one score on Chatbot Arena, but never actually released. And talking to that was very odd because on the release day, you're like, this can't be the model that they're saying is the best thing ever. It's just really strange. It's like Llama example, it's like I asked it what's its name and it said a very direct and very good question. It's like I don't think most people in this room want that answer and that says a lot about evaluation and other things.But it's like this is a deep grained problem with reinforced learning from human feedback and collecting preferences, which is that if you're collecting preference data, you will give a multi page document ranking your priorities on how you rate the or compare the answers. And at the end, there's certain things like Sick of Fancy that people actually just like to get out of models, so they become tie breaks if they're particularly distinctive between the answers. And in the OpenAI post mortem on sake of NC, they had an extremely good breakdown on this. So I recommend that you read this, but I'll take a second and read out loud the core example. So they said, for example, the update introduced an additional reward signal based on user feedback, thumbs up and thumbs down data from ChatGPT.Throughout the post, they talk about how they trained a reward model to predict this. And with what we've seen from things like Chatbot Arena, it seems very likely that the strongest signal in that reward model was that it is just sycophancy of links to these thumbs up data and then that was expressed. Generally this is a form of over optimization. The last line on this slide is something that comes from a lot of history in the reinforcement learning literature is that for example in syncopancy and that reward model, when they were training these models in post training with many stages, so they go through some instruction tuning and they do RL, and they do RL, and they do RL, RL. As you're really pushing the models to their limits, the strong optimizers that we use will extract performance where it's easiest in your training signals.And the easiest training signal is probably, is just like you you add some emojis, you tell them they're good at it, and the models can pick up on that very easily. And then the decision making problem that can explain most of these issues that we've talked about so far in the talk is that you have many evals, and these are things you're trying to hill climb on, but you can never have all of them. So you're kind of doing a multi objective problem, you're pushing all of those up, and it's often taking from something that you're not accounting for, and it's getting pulled way in the other direction. And this is where the things like art are important, and I'll kind of highlight a model that I think did this really well in a few slides. But I think as we see this competition for models, we'll both see more weird releases like this where there's kind of rough edges.And there's also this much bigger opportunity in the AI space to release things that are really robust and bring a lot of joy and don't have these rough edges. So I think this kind of drive to be patient is going to be hard and hopefully rewarded when there's a lot of weird things out there. And for the non researcher majority of this audience, the figure on the right is from what is like the original over optimization paper in reward models. It's called scaling laws for language model or reward model optimization. And this is just showing the x axis is a technical measure that's KL distance, which is a distance that's used to reference the numerical change from a starting model to a final model when you're doing some RL fine tuning.And this is showing that the y axis, your reward model score, it goes up and then it goes down. So the the hard part is when you don't have something you're competent in is you don't know when it starts going down in your over optimization. So this is kind of a classic paper you'll see if you start digging into this direction more. And what I kind of wanna highlight is that there's a Goldilocks zone between evals, vibes, and price. And I think particularly most models now are getting evals and price, and this middle one is hard.When I think about it, Claude 3.5 Sonnet was released over a year before Claude 4 Sonnet. And this model definitely had all of them. It was the one that got people in the Bay Area here to switch from ChatGPT to Claude. And that it feels so ahead of its time because clod force on it really doesn't feel that different than 3.5. And it's definitely better, but these jumps are rare.And these models, it's like we're on this really fast slope. So if you get a really lucky model, you're just gonna have such a really, really great output. And we'll kind of see where they come. I wouldn't know if I necessarily count O3 in this. O3 is a proof of concept that you can do something, but it still has a lot of rough edges for people.For example, coding ability isn't quite good enough. But there will definitely be more of them and it'll be interesting to see what falls into that niche. And then, kind of the transition into the art of modeling and back to some of the stuff that I was presenting at the AI engineered world fair and I'm thinking about for my day job is like, what comes next after we have this reasoning ability? And I've come up with a few different things that people are gonna be trading into these models that make some of these applications that I started with possible. Autonomy is a very discussed trend in AI right now.This is a plot from METR, which is a nonprofit evaluation or monitoring startup, I think, in Berkeley. And the plot here is the y axis is the length of time that it would take for a human to do a task that an AI is now doing. And what we're seeing is that AI models over time are able to solve longer and longer tasks. It's important that I I focus on this fact that it's a time of human task because I've messed up saying it in the past. And the TLDR is that climbing this is not free, and it takes a lot of hard work in improving the models and knowing where to push the models next to kind of unlock these.And it's it's like the reasoning paradigm is a good example because we couldn't keep hill climbing on what we were doing to to kind of unlock this middle phase. So it takes a lot of effort and kind of transitioning the focus from reasoning to things that are focused on planning in this kind of task abstraction is gonna be the thing that unlock these next models or even systems. I think you might see like a DeepResearch bullet point on here rather than like O4 being on here. And then the question is like, how do we actually do this in a model and how do we train autonomous models? So when I think about traits of these kind of independent agent models, I think about starting with skills, which is what I talked about with reasoning, and then kind of expanding into other traits that are gonna be needed.So I kind of think of calibration, is models that know what they don't know. And I think that labs haven't been using this because it's been easier to unlock performance by just building skills than having the models being kind of introspective. They just haven't needed it because they're not doing a lot of tasks on their own, and the humans could give the feedback. And then to kind of go with this, there's a few what I described these last two as two different subsections of planning. I was trying not to overburden the term because strategy is kind of creating a plan in itself and knowing where to go.So if you ask DeepSeq R1 a very hard math question, it won't first plan. It'll just dive in and start to try to solve it. And having the model spend the time to refine its direction before solving it would be very important. And then when we're doing things like deep research or hard coding problems, we need a model that if it's presented with a plan, knows how to do itself in certain inference passes to solve subsets of it or to dispatch that to other agents that can take these steps on the problem and actually solve them. These are things that their models aren't trained to do at all right now, where it's like skills we have, inference time scaling, the basic RL stuff has unlocked this.I don't think we need that much the models are solving the most impossible math questions for mathematicians. We don't need to push it that much further. Calibration, there's a lot of research on that. And then these last two is what I think is the next Q Star like thing where there's gonna be a lot of human data, and then we're gonna have a few good examples that we can then use to iterate with more complex training. And so if we revisit this example to ground my taxonomy, I call this very skillful but lacking planning.Search is a skill that O3 has largely mastered. It can find really niche information, but if we're gonna pair this to DeepResearch, it doesn't quite know how to compare and contrast and know how much information it needs to gather before making its conclusion. So planning and synthesis is something that we have to encourage the models to do more of before we kind of just well, to unlock the next phase of progress. And to end, this is a somewhat technical provocation, but is something that I hear definitely happening directionally on the ground of other labs is like this RL and post training is becoming the focal point of language model development. It doesn't mean like, pre training is definitely not dead now, but mathematically, in terms of compute, these techniques for reasoning and planning are really becoming the bedrock of what people spend their money on.Is a plan that people ask me what am I doing at AI2 to try to make a better reasoning model. At the end of the day, the technical things are not that complicated, which is you get a big data set, you filter the data set, you train the model for a while, and then after you have your models, you do a bunch of the things that are in all the research papers. They have a few ideas and you try them at the end and they might give you 1%. This is mostly suited for a technical audience but fun to include. And then back to this from post training to training idea, it's like how do we integrate this compute where post training is similar in GPU hours and then there's the idea of continual learning and if we don't ever have to really pre train a model at all, and we can kind of just keep using these real world interactions to provide signal to the model.If we ground it into some real numbers, DeepSeek V3 was the famous paper that kick started a lot of discussions on how much does it cost to train a frontier model, and they listed these prices at $5,000,000 Within that table, they said that post training used like well less than 1% of the compute. A fun example is that a researcher on their RL team tweeted that DeepSeek R1 trained for a quote few weeks in the RL stage. This is obviously not something that you want to base any sort of, like, investment or strategy decision in. But if you extrapolate from the tweet, it's like DeepSeek R1 one could be, 10 to 20% in GPU hours. If you talk to somebody at OpenAI, they'll also say similar things, which is like O1 uses very low percentage, but O3 was 10 x the RL compute, and O4 should be the same, which is that this post training phase is already becoming 10 to 20% of the compute used for these larger models, and that's where people are looking to push limits with these things like planning and so on.So it should be fun. This means that we'll get model releases more frequently because you can kind of see where the performance is going during these RL runs. Where pre training, you have to wait all the way to the end for technical reasons to see how the model is So for people building models, I think these next eighteen months are gonna continue to feel like what it has in the first six months of the year where it's every few weeks we're getting something that is potentially noticeably better. So that's where I've ended. I'm probably a little under on time but potentially catching up for for break time.So thank you all for listening. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
30:26
--------
30:26
The rise of reasoning machines
https://www.interconnects.ai/p/the-rise-of-reasoning-machinesNote: voiceover coming later in the day. I may fix a couple typos then too.A sufficiently general definition of reasoning I’ve been using is:Reasoning is the process of drawing conclusions by generating inferences from observations.Ross Taylor gave this definition on his Interconnects Interview, which I re-used on my State of Reasoning recap to start the year (and he’s expanded upon on his YouTube channel). Reasoning is a general space of behaviors or skills, of which there can be many different ways of expressing it. At the same time, reasoning for humans is very naturally tied to our experiences such as consciousness or free will.In the case of human brains, we collectively know very little of how they actually work. We, of course, know extremely well the subjective experience of our reasoning. We do not know the mechanistic processes much at all.When it comes to language models, we’re coming at it from a somewhat different angle. We know the processes we took to build these systems, but we also don’t really know “how deep learning works” mechanistically. The missing piece is that we don’t have a deep sense of the subjective experience of an AI model like we do with ourselves. Overall, the picture is quite similar.To set the stage why this post is needed now, even when reasoning model progress has been rampaging across the technology industry in 2025. Last week, an Apple paper titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity reignited the “reasoning debate” with newfound vigor.Some of the key examples in the paper, other than traditional reasoning evaluations such as MATH-500, were that AIs struggled to solve scaled up versions of toy problems, shown below. These are problems that one can programmatically increase the complexity on.The argument was that language models cannot generalize to higher complexity problems. On one of these toy problems, the Tower of Hanoi, the models structurally cannot output enough tokens to solve the problem — the authors still took this as a claim that “these models cannot reason” or “they cannot generalize.” This is a small scientific error.The paper does do some good work in showing the limitations of current models (and methods generally) when it comes to handling complex questions. In many ways, answering those with a single chain of thought is unlikely to ever actually work, but they could be problems that the model learns to solve with code execution or multiple passes referencing internal memory. We still need new methods or systems, of course, but that is not a contribution to the question can language models reason? Existence of a trait like reasoning needs small, contained problems. Showing individual failures cannot be a proof of absence.Interconnects is a reader-supported publication. Consider becoming a subscriber.This summary of the paper, written by o3-pro for fun, sets up the argument well:The presence of a coherent-looking chain‑of‑thought is not reliable evidence of an internal reasoning algorithm; it can be an illusion generated by the same pattern‑completion process that writes the final answer.The thing is, the low-level behavior isn’t evidence of reasoning. A tiny AI model or program can create sequences of random strings that look like chains of thought. The evidence of reasoning is that these structures are used to solve real tasks.That the models we use are imperfect is not at all a conclusive argument that they cannot do the behavior at all. We are dealing with the first generation of these models. Even humans, who have been reasoning for hundreds of thousands of years, still show complete illusions of reasoning. I for one have benefitted in my coursework days by regurgitating a random process of solving a problem from my repertoire to trick the grader into giving me a substantial amount of partial credit.Another point the paper points out is that on the hardest problems, AI models will churn through thinking for a while, but suddenly collapse even when compute is left. Back to the test-taking analogy — who doesn’t remember the drama of a middle-of-the-pack classmate leaving early during a brutally hard exam because they know they had nothing left? Giving up and pivoting to a quick guess almost mirrors human intelligence too.This all brings us back to the story of human intelligence. Human intelligence is the existence proof that has motivated modern efforts into AI for decades. The goal has been to recreate it.Humans for a long time have been drawn to nature for inspiration on their creations. Humans long sought flying machines inspired by nature’s most common flying instrument — flapping wings — by building ornithopters.Let’s remember how that turned out. The motivation is surely essential to achieving our goal of making the thing, but the original goal is far from reality.Human reasoning is the flapping wings of this analogy. It’s the target, but not the end point. Any useful definition of reasoning should encompass what humans do and what our future creations will do.We’ve passed the Wright Brothers moment for artificial reasoners — it’s not what we expected it to look like, but it’s here.We should go deeper on why the subjective experience we have as humans makes this case far harder to disentangle than flight. Flight is a physical phenomenon, and hence one detached from our mind. Our mind is literally only representing reality through a transformation, and it can manipulate this representation in a way that serves its physical interests.Free will is one of those manipulations, or expressions. Free will is a useful construct that enables many complex human behaviors.The “awareness” of these reasoning models is definitely in a tricky middle ground. The language models have a remarkable general understanding of the environments they operate in — they can explain what a code executor or a chatbot is with precision. They cannot, though, explain exactly how the environment they’re in works.AI gaining this level of awareness while being able to act is entirely new. Previous generations of AI models that acted were RL systems trained end-to-end to act in a narrow environment. They were superhuman but had effectively no awareness of how the environment worked. Having both the ability to break down problems and express some level of awareness with the world is remarkable. What is missing in the human comparison is AIs being able to evolve with the environment, i.e. continual learning.Just because an AI doesn’t have all the tools that we use to interact intelligently with the world does not mean it isn’t reasoning. The models break down problems and iteratively try until they reach an answer. Sometimes the answer is wrong, but that’ll improve over time in line with their awareness.You say AIs are just pattern matching — I say humans are just pattern matching too. We’re doing it in different ways. Would many of the critics be more accepting of this type of reasoning if it was moved to a latent reasoning approach, more similar to how humans draw answers out of thin air and ruminating?Hallucinations are a great example of the type of complete awareness our AI systems lack. We’ll get better at this. For now, AI models are very minimally trained for “calibration” or knowing what they know. Why train models to know what they know when there are easier ways to solve evaluations? This is why I call calibration a trait of next-generation models — we’re just now getting to the point where it’s needed to solve complex tasks.With better awareness one could argue for consciousness, but I don’t have a good grasp on how that is defined for humans so I won’t go so far as to assign it to other systems.Ilya Sutskever discussed the boundary between understanding and awareness, as what comes next, in his latest test of time talk at NeurIPS 2024. To understand is to predict things accurately. To be self-aware is to be able to predict accurately with an understanding of what it is and what its environment is. This all goes back to Ilya’s provocation for why next-token prediction is enough on the Dwarkesh Podcast:Predicting the next token well means that you understand the underlying reality that led to the creation of that token.His argument is that self-awareness will follow as we push AI models to understand the world. Since that quote 2 years ago, we’ve made immense progress on his vision. Ilya also included a warning in his more recent NeurIPS talk:The more [a system] reasons, the more unpredictable it becomes.We are crossing a rubicon. To ignore this is to be fundamentally ill-prepared for the future.Being surrounded by another intelligent entity is naturally very off-putting for humans. We evolved in a way that made our social and abstract intelligence a major competitive advantage that allowed us to effectively conquer our environment. I’m not an evolutionary biologist nor anthropologist nor sociologist, but it appears that a majority of critiques of AI reasoning are based in a fear of no longer being special rather than a fact-based analysis of behaviors.Thanks again to Ross Taylor for discussions that helped form this post. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
9:03
--------
9:03
What comes next with reinforcement learning
https://www.interconnects.ai/p/what-comes-next-with-reinforcementFirst, some housekeeping. The blog’s paid discord (access or upgrade here) has been very active and high-quality recently, especially parsing recent AI training tactics like RLVR for agents/planning. If that sounds interesting to you, it’s really the best reason to upgrade to paid (or join if you’ve been paying and have not come hung out in the discord).Second, I gave a talk expanding on the content from the main technical post last week, A taxonomy for next-generation reasoning models, which you can also watch on the AI Engineer World’s Fair page within the full track. My talk was one of 7 or 8 across the full day, which was very enjoyable to be at, so I am honored to have won “best speaker” for it.Three avenues to pursue now that RL worksThe optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer. This involves wrapping multiple inference calls, prompts, and interactions with an environment within one episode that the policy is updated against.The case for optimism around RL working in these new domains is far less clear compared to current training regimes which largely are rewarding the model for how it does on one interaction with the environment — one coding task checked against tests, one math answer, or one information retrieval. RL is not going to magically let us train language models end-to-end that make entire code-bases more efficient, run scientific experiments in the real world, or generate complex strategies. There are major discoveries and infrastructure improvements that are needed.When one says scaling RL is the shortest path to performance gains in current language models it implies scaling techniques similar to current models, not unlocking complex new domains.This very-long-episode RL is deeply connected with the idea of continual learning, or language models that get better as they interact with the real world. While structurally it is very likely that scaling RL training is the next frontier of progress, it is very unclear if the type of problems we’re scaling to have a notably different character in terms of what they teach the model. Throughout this post, three related terms will be discussed:* Continuing to scale RL for reasoning — i.e. expanding upon recent techniques with RLVR by adding more data and more domains, without major algorithmic breakthroughs.* Pushing RL to sparser domains — i.e. expanding upon recent techniques by training end-to-end with RL on tasks that can take hours or days to get feedback on. Examples tend to include scientific or robotics tasks. Naturally, as training on existing domains saturates, this is where the focus of AI labs will turn.* Continual learning with language models — i.e. improvements where models are updated consistently based on use, rather than finish training and then served for inference with static weights.At a modeling level, with our current methods of pretraining and post-training, it is very likely that the rate of pretraining runs drops further and the length of RL training runs at the end increases.These longer RL training runs will naturally translate into something that looks like “continual learning” where it is technically doable to take an intermediate RL checkpoint, apply preference and safety post-training to it, and have a model that’s ready to ship to users. This is not the same type of continual learning defined above and discussed later, this is making model releases more frequent and training runs longer.This approach to training teams will mark a major shift where previously pretraining needed to finish before one could apply post-training and see the final performance of the model. Or, in cases like GPT-4 original or GPT-4.5/Orion it can take substantial post training to wrangle a new pretrained model, so the performance is very hard to predict and the time to completing it is variable. Iterative improvements that feel like continual learning will be the norm across the industry for the next few years as they all race to scale RL.True continual learning, in the lens of Dwarkesh Patel is something closer to the model being able to learn from experience as humans do. A model that updates its parameters by noticing how it failed on certain tasks. I recommend reading Dwarkesh’s piece discussing this to get a sense for why it is such a crucial missing piece to intelligence — especially if you’re motivated by making AIs have all the same intellectual skills as humans. Humans are extremely adaptable and learn rapidly from feedback.Related is how the Arc Prize organization (behind the abstract reasoning evaluations like ARC-AGI 1, 2 and 3) is calling intelligence “skill acquisition efficiency.”Major gains on either of these continual learning scenarios would take an algorithmic innovation far less predictable than inference-time scaling and reasoning models. The paradigm shift of inference-time scaling was pushing 10 or 100X harder on the already promising direction of Chain of Thought prompting. A change to enable continual learning, especially as the leading models become larger and more complex in their applications, would be an unexpected scientific breakthrough. These sorts of breakthroughs are by their nature unpredictable. Better coding systems can optimize existing models, but only human ingenuity and open-ended research will achieve these goals.Challenges of sparser, scaled RLIn the above, we established how scaling existing RL training regimes with a mix of verifiable rewards is ongoing and likely to result in more frequent model versions delivered to end-users. Post-training being the focus of development makes incremental updates natural.On the other end of the spectrum, we established that predicting (or trying to build) true continual learning on top of existing language models is a dice roll.The ground in the middle, pushing RL to sparser domains, is far more debatable in its potential. Personally, I fall slightly on the side of pessimism (as I stated before), due to the research becoming too similar to complex robotics research, where end-to-end RL is distinctly not the state-of-the-art method.Interconnects is a reader-supported publication. Consider becoming a subscriber.The case forThe case where sparser, scaled RL works is quite similar to what has happened with the past generations of AI models, but with the infrastructure challenges we are overcoming being a bit bigger. This is continuing the march of “deep learning works,” where we move RL training to be further off-policy and multi-datacenter. In many ways RL is better suited to multi-datacenter training due to it having multiple clusters of GPUs for acting, generation, and learning, policy gradient updates that don’t need to communicate as frequently as the constant updates of pretraining with next-token prediction.There are two key bottlenecks here that will fall:* Extremely sparse credit assignment. RL algorithms we are using or discovering can attribute per-token lessons well across generations of millions of tokens. This is taking reward signals from the end of crazily long sequences and doing outcome supervision to update all tokens in that generation at once.* Extremely off-policy RL. In order to make the above operate at a reasonable speed, the RL algorithms learning are going to need to learn from batches of rollouts as they come in from multiple trial environments. This is different than basic implementations that wait for generations from the current or previous batch to then run policy updates on. This is what our policy gradient algorithms were designed for.As the time to completion becomes variable on RL environments, we need to shift our algorithms to be stable with training on outdated generations — becoming like the concept of a replay buffer for LM training.Between the two, sparsity of rewards seems the most challenging for these LM applications. The learning signal should work, but as rewards become sparser, the potential for overoptimization seems even stronger — the optimizer can update more intermediate tokens in a way that is hard to detect in order to achieve the goal.Overcoming sparsity here is definitely similar to what happened for math and code problems in the current regime of RLVR, where process reward models (PRMs) with intermediate supervision were seen as the most promising path to scaling. It turned out that scaling simpler methods won out. The question here is, will the simpler methods even work at all?The case againstThere are always many cases against next-generation AI working, as it’s always easy to come up with a narrative against complexity in progress. There are a few key points. The first is that scaling to sparser tasks is already not working, or we don’t know how to actually set up the rewards in a way that encourages the model to get meaningfully better at long tasks.For example, consider Deep Research, a new product that is “trained with RL” and generates millions of tokens per query. How exactly does the RL work there? OpenAI lightly described the training method for Deep Research in their launch blog post (emphasis mine):Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model. While o1 demonstrates impressive capabilities in coding, math, and other technical domains, many real-world challenges demand extensive context and information gathering from diverse online sources. Deep research builds on these reasoning capabilities to bridge that gap, allowing it to take on the types of problems people face in work and everyday life.There are two key points. First, they say they train on browser and tool-use tasks with the same infrastructure as o1. Second, they focus on how these capabilities can bridge the gap to harder problems — not that the capabilities are being learned on the harder problems themselves.How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.There’s a much clearer path for RL training going from human preferences through verifiable rewards and back to human preferences again, rather than pushing further into sparser, harder verifiable domains.Second, recent RL research always shows that many interactions with a problem or world are needed to solve challenging tasks. In the RLVR domain for math or code the models are generally shown many similar problems multiple times. In the standard RL domains, standard practice is to create simulators that allow massively parallel learning agents (as discussed in the Interconnects Interview with Eugene Vinitsky). The more challenging the problem we’re attempting to deploy RL to, the less these conditions of parallelism or multiple tries can apply.Whether or not it works, the thing to try is carefully curating the first trajectories to train the models on. This is what OpenAI did to create o1, and it took so long that we got all the Q* rumors in their early experiments. These manual trajectories of optimal samples from Deep Research or coding agents will definitely help performance, but it isn’t clear if they’ll serve as a “warm start” for the model to then be trained extensively with bigger RL.Is continual learning something we should want?Dwarkesh’s goal, in many ways, is an AI that learns after interacting with you in a permanent way. This comes with unintended side-effects and would be borderline dangerous. The current AI systems that learn in a “continual” way via trial-and-error with the user are algorithmic feeds. Most people remark how incredible it is for TikTok to learn your interests in real time in front of you, often capturing an essence within minutes.When it comes to AI models with the latent intelligence that is superhuman in many aspects of understanding, unlocking a rapid and personalized feedback loop back to some company owned AI system opens up all other types of dystopian outcomes. For a long time I’ve written that AI models have a higher risk potential in terms of social outcomes because the modalities they interact with us in are far more personal — e.g. private messaging. Combine a far stronger optimizer with a far more intimate context and that is a technology I don’t even want to try.There are alternatives that still reap the upside. Despite the bumpy rollout, ChatGPT features that just remember your past interactions can go a long way to act like continual learning. The model can reference past chats and times you corrected it in order to avoid repeating the same mistake, even though the underlying weights don’t need to update. If that isn’t powerful enough, we can wait for the technology to become efficient enough for local models to learn continually as we interact with them. Both of these would dampen the risk potential of super-targeted AI.Personalization is the softer framing of this that is more compelling. Continual learning is the framing that suits the leading model providers because their training algorithms will be the ones benefiting from all of the interactions. Personalization doesn’t suit the frontier AI laboratories well because their economies of scale push them to have few models for many users. If open models keep up, we should be able to create a future of specialized, “n of 1” models for specific users.Without corporate misaligned incentives, I’d be happy to have continual learning, but on the path we’re going down I’d rather not have it be an option presented to the masses at all.As AI is going to be so powerful as a standalone entity, breaking some of the symbiotic links will be good for adding friction that makes the technology easier to steer towards good outcomes. In short, be wary of wishing for end-to-end (reinforcement) learning when you’re part of the environment. It’s a destiny to dystopia.Aside: Revisiting AI usageFinally, while in SF I was chatting with many people about the theme of my post, People use AI more than you think, which is framed as simple AI revenue and usage growth. The core idea of the article should’ve been expanded, as not only do people use AI a lot already, but most of the most popular AI services are supply constrained just like Nvidia. When you see revenue forecasts from OpenAI or Anthropic to The Information, it’s best to believe them for scaling existing product offerings. They know when they’re getting more capacity online. The new higher-revenue offerings are in flux.For example, Sundar Pichai acknowledged this in his recent appearance on the Lex Fridman podcast:I think it's compute-limited in this sense, right, like, you know, part of the reason you've seen us do Flash, Nano, Flash, and Pro models, but not an Ultra model, it's like for each generation, we feel like we've been able to get the Pro model at like, I don't know, 80, 90% of Ultra's capability, but Ultra would be a lot more, like slow, and a lot more expensive to serve. But what we've been able to do is to go to the next generation and make the next generation's Pro as good as the previous generation's Ultra.This will very likely continue. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
13:53
--------
13:53

More Science podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

Podcast website

Science Technology