PodcastsBusinessLatent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

Latent.Space
Latent Space: The AI Engineer Podcast
Latest episode

283 episodes

  • Latent Space: The AI Engineer Podcast

    The Professor of Outputmaxxing — Anjney Midha, AMP

    18/06/2026 | 59 mins.
    Last 4 days before regular tickets sell out at AI Engineer World’s Fair - this is the single biggest gathering of AI Engineers, Founders, Leaders, and Researchers in the world. Attendees get >$5000 worth of sponsor credits and talk tracks are looking FANTASTIC. Join us!
    The AI scaling debate always focuses on the question of “how do we get more GPUs?” but the better question may be: how do we make the most of ones we already have.
    The fact that a frontier lab like xAI could be running at sub-10% MFU (Model FLOPs Utilization) is just a hint at what the real problem may be.
    For context, older frontier-scale training runs were already much higher than 10%. GPT-3 was around 21% MFU. Gopher was around 32%. Megatron-Turing NLG was around 30%. PaLM reached around 46%. And our guest Anjney says best-in-class MFU today is closer to 60–70%.

    It’s not necessarily that xAI is uniquely incompetent (it’s clear they have talented folks) but rather the priorities may be flipped in the GPU arms race.
    While GPU access is a bottleneck, simply increasing CapEx won’t automatically translate to better models as frontier AI is increasingly a systems problem: scheduling, utilization, networking, kernels, frameworks, data pipelines, parallelism, cluster reliability, and the thousand small decisions that determine whether your theoretical FLOPs become real training progress.
    From building Discord’s developer platform and backing frontier AI companies like Anthropic, Mistral, Black Forest Labs, and Periodic Labs to now building AMP’s independent compute grid, Anjney Midha has spent years close to the real bottlenecks of AI scaling. In this episode, Anjney joins swyx at Periodic Labs to unpack why the AI race is not just about buying more GPUs, why 95% utilization would have been considered an outage at Google, and why the next era of AI infrastructure has to be more aligned, more efficient, and more responsible.
    We go deep on AMP’s vision for a compute grid that makes FLOPs flow like megawatts, the difference between full-stack AI labs and horizontal pooling, why AI data centers need community buy-in, and how compute markets could evolve into something closer to an independent system operator. Anjney also explains why DeepMind’s unpublished research points to a market failure, why end-of-life prediction remains one of the most important AI applications he has thought about for fourteen years, and why “output maxing” may become a new discipline for frontier systems.
    We also discuss Anthropic’s culture, why “luck favors the prepared mind” in coding models, how Claude cracked coding, why too much capital too early can make AI labs fragile, what Periodic Labs is trying to do with science and superconductors, why great researchers can become great CEOs, and why Silicon Valley is both deeply missionary and deeply mercenary.
    We discuss:
    * Why 95% utilization was considered an outage at Google
    * Why AI infrastructure waste compounds at frontier-lab scale
    * Why “move fast and break things” does not work for AI data centers
    * How data center backlash, power grids, and community incentives shape AI scaling
    * AMP’s vision for making FLOPs flow like megawatts
    * Why compute needs an independent system operator
    * How interruptible demand and dynamic prioritization worked inside Google
    * Why DeepMind research hoarding creates negative externalities
    * AMP’s 1.2GW base-load ambition and the need for 6GW of spike capacity
    * Why end-of-life prediction could become one of AI’s most important healthcare applications
    * Frontier Systems, output maxing, and full-stack alignment
    * Why APIs and abstraction layers become lossy as organizations scale
    * Superconductors, standards, and the dream of lossless systems
    * SF Compute, open protocols, and the future of compute marketplaces
    * Why non-NVIDIA chips can still benefit from NVIDIA’s reference architecture
    * Trust boundaries and why chip startups need visibility into future model architectures
    * Why VCs often underestimate researchers as CEOs
    * Scientists as star athletes of the mind
    * Why great CEOs need to be confrontational up and down the stack
    * Why leading the frontier matters more than “winning”
    * How Anthropic cracked coding
    * Why culture is fragile, not a permanent moat
    * Why hardship was a feature, not a bug, for Anthropic
    * Why Anthropic’s P0 was coding from day one
    * Periodic Labs, physics as the constraint, and technical reality
    * Silicon Valley mercenaries, missionary teams, and what happens after a breakthrough
    Anjney Midha
    * LinkedIn: https://www.linkedin.com/in/anjney
    * X: https://x.com/AnjneyMidha
    AMP PBC
    * Website: https://amppublic.com/
    * X: https://x.com/amppublic
    Timestamps
    00:00:00 Introduction
    00:00:09 Why AI Compute Is Being Wasted
    00:03:17 Responsible Infrastructure and Data Center Backlash
    00:06:07 AMP Grid: Making FLOPs Flow Like Megawatts
    00:12:41 Foundry, Frontier Labs, and Research Hoarding
    00:14:42 Gigawatt-Scale Compute and End-of-Life Prediction
    00:24:08 Frontier Systems, Output Maxing, and Alignment
    00:27:38 Compute Markets, SF Compute, and Non-NVIDIA Chips
    00:32:57 Trust Boundaries, Co-Design, and Researcher CEOs
    00:38:17 AI Coachella and First-Principles Thinking
    00:42:43 Leading vs Winning in Frontier AI
    00:45:54 How Anthropic Cracked Coding
    00:48:25 Culture, Hardship, and Anthropic’s P0
    00:54:03 Periodic Labs, Physics, and Silicon Valley Mercenaries
    00:56:26 Rishi Valley, Singapore, and Money as a Measure
    00:58:47 Closing Thoughts
    Transcript
    Introduction: Anjney Midha, AMP, and Compute Waste
    Swyx [00:00:00]: We’re in Periodic Labs with Anjney Midha, CEO, founder of AMP. Welcome.
    Compute Utilization: Node Allocation, MFU, and Alignment
    Anjney [00:00:09]: Thanks for having me. At Google, there are two types of utilization usually, right? That you’re measuring in these clusters. One is node allocation, and then the other’s MFU. Node utilization is usually like what percentage of cards in the data center are just, used, and that, if it’s not at, 95%-
    Swyx [00:00:29]: There is no excuse
    Anjney [00:00:29]: There’s no excuse, right? I think 95% at Google, which is where my co-founder, Seb, came from, he built the Borg, PBorg/GQM scheduler at Google, and there I think 95% was considered an outage, so 96% node utilization is, should be standard. And most single-tenant clusters are not running at that. So that’s one. And then MFU should be, I would say the best in class today is somewhere between 60 and 70%. I think this is a leadership question, right? Fundamentally it’s an alignment question, which is are the people who are funding the cluster and then deploying the cluster actually aligned? And sometimes theoretically they are, but in practice the number of people in the chain, the supply chain between, the capital and all the way to whoever’s managing the cluster and then whoever’s measuring what the output is, are just so many, degrees of separation away that, the, The Have you ever heard the radian metaphor, which is at the beginning of an arc, if you have two arcs that are two lines that are just off by a few degrees, that-
    Swyx [00:01:33]: It spreads out
    Anjney [00:01:34]: It spreads out, right? Or at scale. And I think what’s happening is a lot of cluster implementations and infrastructure, a lot of frontier labs and other teams, that’s what’s happening, is they’re, they initialize the plan, which is kind of like North Star with a team that wants to do good, but then they’re, required to scale so fast instead of iteratively that the wastage just compounds really fast at scale. And so I think we know the answer, which is just do iterative bring ups. If you spend time with people who’ve been in the semiconductor industry or the DSN industry for a long time, this is not new, and I don’t think AI should be an excuse. Sure. Something What is new? Okay. We have a lot of new capabilities, but that doesn’t mean just abandon common sense. Common sense should always be in fashion. ? AI scaling doesn’t change the in fact, if anything, AI scaling should be putting a premium on the value of common sense and infrastructure because the margin of error now is so much lower and the costs of wastage are so much higher. And the cost of wastage, by the way, is not just economic. I’m, obviously I’m, I’m an investor, or I’m an investor by background. Over the last few years now we’re running an AI infrastructure business called, AMP. And I think that it’s okay to say this time is different on the capabilities front. We are genuinely getting capabilities at, of the, of a kind we haven’t had before. That doesn’t give you an excuse to say this time is different for everything, especially infrastructure. So look, I love the hacker mindset and the hustler mindset. Now, that’s great for the startup mindset, but you remember this moment where Zuck went from saying, “Move fast, break things” to, move-
    Responsible Infrastructure and Data Center Backlash
    Swyx [00:03:10]: Fast and stable infrastructure
    Anjney [00:03:11]: Move fast with stable infrastructure. I think now we need to move fast with, responsible infrastructure. People are going to ask where the impact is. There was a really In our class yesterday, Scott Nolan, who’s the founder of General Matter, came by at Stanford to speak about energy bottlenecks. And he had a phenomenal idea. He said, “if you look at the marginal unit economics of compute per hour,” he goes, “let’s call it, $4 an hour. If you’re having to bring up a new data center in a new community, why not just say we’re going to charge 4.50 an hour, and that marginal impact or that marginal increase, we just literally take that and give it to the local community as cash?” I can tell you as a customer of that compute, I would love that. I’d be happy to pay an additional 50 cents per hour at scale.
    Swyx [00:03:57]: Wow. Yeah.
    Anjney [00:03:58]: Because if that means the public benefit is so clear to the communities that the data centers are coming up in, I’m going to feel like that compute is much more reliable. Up to 20% of all data centers this year in the US, my understanding is are at risk.
    Swyx [00:04:13]: Of community backlash?
    Anjney [00:04:14]: Correct. Of not getting the community support they need to get brought up.
    Swyx [00:04:19]: Wow. That’s a huge number.
    Anjney [00:04:20]: Yeah. Now, we, I think we should dig into what that number is. I think it’s a little bit of overstated. These things can get over-reported, but it-
    Swyx [00:04:27]: They don’t just care about jobs. They care about all the other stuff around it, right? They care about power grid, they care about environments-
    Anjney [00:04:33]: Power grid, permitting, and so on. And imagine I think if you said there’s a new AI deal. If we’re bringing up a data center in your community, we’re actually going to reduce the cost of your electricity bill. Okay, now we’re talking. Right? The community’s going, “Okay. Now this is a deal. I feel like a partner in this.” Right now that’s not happening. There will be audits, there will be investigations, and when the, when the regulators come, I don’t know when it’s going to be, the folks who are moving fast and breaking things in the name of AI progress better be prepared. That’s certainly not how we’re procuring compute. Or we’re, we’re trying as much as we can to work with partners who have long-term track records. Many of whom, by the way, are not, AI providers. I think this whole idea of neoclouds being somehow this new category is a lot of marketing speak. There are really good, reliable, trusted data center providers in America who’ve been around 20 plus years. I love those folks. They know how to Sure. Are they sponsoring happy hours at NeurIPS? No. Are they legibly listed in Build? No. Are they hanging out in my, in, situational awareness parties? No. But they’re adults. I trust them.
    Swyx [00:05:44]: They can run LAN. They can run power.
    Anjney [00:05:45]: They can run LAN, power, and shell. They have credit histories. We sit down, we have a conversations. Many of them live in Silicon Valley. They’ve, they’ve had to deal with the boom and bust cycles of the internet, and I love those folks. They are stable infrastructure partners and thinkers. And I think there’s a lot of short-term thinking going on in the compute layer, and it’s going to catch up to us. It’s not going to be good.
    AMP Grid: Making FLOPs Flow Like Megawatts
    Swyx [00:06:07]: You talk about aligning incentives, and, I would think that aligning incentives means you have the full stack in one company, which is xAI and OpenAI, right? So you as a standalone infrastructure layer, why are you somehow more aligned to your portfolio companies than people who just own the whole thing?
    Anjney [00:06:28]: In systems design, right, there’s, there’s two regimes of, architecture, right? You have integration, and then you have pooling and utilization, right? So the Or rather, the way to increase utilization often is you can do systems integration where you collapse a lot of process into one node, or you can pull out a process from a node and share that amongst various That resource amongst several different nodes. And so we see the AMP grid, which is, the, what, the system we’re building here, which is basically a compute grid. We’re trying to do for compute what the electric grid-
    Swyx [00:07:02]: Power
    Anjney [00:07:02]: Yeah, what the power grid did for electricity. It-- this is a pooling and utilization layer across clouds, And so we’re actually the opposite of a full stack integration like approach.
    Swyx [00:07:12]: Super horizontal.
    Anjney [00:07:13]: Where it’s much more horizontal and it’s, it’s multi-cloud, it’s multi-silicon. The goal is to try to make FLOPs flow like megawatts, and that is very hard to do today for many reasons. There’s stranded pools of compute all over the place and there’s no fungibility. And so right now we do it at the level of scheduling, and we often do it at the economic layer. But as we start to announce what we’re working on, it’s extraordinary like how many folks are coming out of the woodworks and saying, “Hey, I’m actually working on a way to make compute fungible at this part of the stack and that part of the stack.” And as a grid, we’d like all of these folks to participate on the grid. There’s, people often ask me, “Andra, are you a new cloud?” And I go, “No, actually neoclouds are suppliers.” sometimes they’ll ask, “Are you a venture capital firm?” I go, “No, actually they are, they are demand like sort of off-takers of the grid.” We see ourselves as what’s called an independent system operator. So if you study the history of the electric grid, once it became legible to a lot of factories and industrial sort of participants that, hey, actually it turns out pooling is a good idea. We should pool our generators instead of all having a generator running at half capacity in our backyard. There was a need for an independent entity who could coordinate all these parties. Transmission line, power generation, facilities, transmission lines, factories, and that neutral coordination mechanism is very critical. In order-- If you study like the history of grids, the most enduring ones were those that never owned their own assets. They were ones that had, or often started with long-term anchors who are uncorrelated sources of demand, a steel factory, a shoe mill or whatever in a particular town who weren’t competitive, where the steel factory want to spike up at night, the shoe mill wanted to spike up during the day. So then you pool and you share, right? So each of you is guaranteed some base load, but then you kind of schedule your spikes to drive a peak utilization across the town. The gold standard, so to speak, historically, has been these utility companies like PJM Interconnect in the northeast of America, where they, over many years became this what’s called an ISO, an independent system operator of the grid. So that’s how we see ourselves. Economically, that’s what we are. From a technical perspective, we started at the scheduling layer because Seb and Mihai, who, run engineering here, built that at-
    Swyx [00:09:28]: Did your scheduling
    Anjney [00:09:28]: They did that at Google. And, -
    Swyx [00:09:32]: And you have infra shops from Discord as well.
    Anjney [00:09:35]: I have some.
    Swyx [00:09:35]: I don’t know, I don’t know if Discord is like the primary identity, but what-whatever, I’m just kind of-
    Anjney [00:09:39]: No, D-Discord was-
    Swyx [00:09:40]: Choosing a well-known name.
    Anjney [00:09:42]: Well, I So I was running the developer platform there. The internal infrastructure I was not responsible for. That was actually a guy by the name of Mark Smith, who was extraordinary. And yes, Discord did pool So Discord is actually a counter example. I had the chance to learn a lot about fully, full stack infra there because-
    Swyx [00:09:56]: It’s the same thing, yeah
    Anjney [00:09:57]: It’s the, it’s the other architecture which is, Discord built its own WebRTC vo-voice and video infra. So like Discord did not use-
    Swyx [00:10:08]: For the calls, yeah.
    Anjney [00:10:09]: Yeah, did not For communication, Discord did not use third party infra. It was all built in-house. And then the way you maximize utilization was you pool demand from the world’s 200 million plus monthly active gamers, right? And so that’s, that’s how those stacks were constructed. Again, in systems design, the two concepts that keep coming up over and over again are abstraction and composition, right? And-
    Swyx [00:10:31]: Bundling and unbundling
    Anjney [00:10:33]: Bundling and unbundling, abstraction, composition, like verticalization and-
    Swyx [00:10:36]: Horizontal
    Anjney [00:10:36]: Horizontalization. So in that sense, AMP is an independent system operator of the grid. We pool demand, we pool supply from a number of partners we trust At about 1.3 gigawatt scale over four years. And then we pool demand from some of the world’s best, research labs and so on. We’re sitting at one, periodic labs who need extraordinary long-term demand. And the idea is that, each of them is guaranteed base load on the grid, but they can spike up and down flexibly on, for compute, with much shorter timelines as needed. That was roughly the design of the program I came up with at a16z called Oxygen. The same-- That was the same design of the GQM, BorgX, Borg GQM implementation at Google that Mihai and Seb had built. Which was that how do you allow, teams inside of Google, on the internal infrastructure to be guaranteed capacity, for their base workloads? But when they need to spike up on research, how could they ensure that was sufficiently there? And of course, the big innovation that was not discovered, but kind of implemented in the space, this infra space maybe three, four years ago at Google was the idea of interruptible demand, right? Where you just queue up a bunch of jobs and through this like sort of credit system, there can be a bidding mechanism.
    Swyx [00:11:53]: Like priorities.
    Anjney [00:11:54]: It’s a dynamic prioritization Basically. And jobs can get interrupted based on somebody else who’s saying, “what? I have 10 tokens, 10 credits I want to spend on this job.” Another like team lead, research lead is “Genie 3 or whatever is only worth five, credits, and NanoBanana2 is worth 10 credits,” and so the NanoBanana job gets priority. That’s a, that’s a made up example.
    Swyx [00:12:15]: It’s very real. Brain Marketplace was real. And, we’ve, we’ve covered this on the pod with David Luan, who was-
    Anjney [00:12:20]: Oh, great. Okay
    Swyx [00:12:20]: Was there. And the criticism is that, well, actually sometimes you need central command to go all in on a thing. And actually sometimes capitalism via credits doesn’t work. Not, this is not a criticism of AMP. I’m just saying, this is a thing that has been tried, internally within Google, and it led to Google missing GPT.
    Foundry, Frontier Labs, and Research Hoarding
    Anjney [00:12:41]: Like, we structured ourself essentially very similarly to Google. We are structured as a holdings company. So, Alphabet holdings is Alphabet holdings, and then they’ve got these subsidiaries called Google and-
    Swyx [00:12:51]: Other bets
    Anjney [00:12:52]: Other bets and so on. We’ve got, AMP holdings, and we’ve got our infrastructure business, and then we’ve got a capital business called Foundry that incubates new frontier AI labs or invests in them as venture capital, like Periodic. We put a few hundred million dollars into Anthropic from our fund earlier this year. So wherever we feel like teams are making progress, especially researchers and so on who’ve pushed the frontier inside of existing labs like DeepMind, I find, there comes a point where they feel misaligned with the dictatorship of Alphabet holdings. And at that point, sometimes the dictatorship doesn’t want them anymore. And they’re “Thank you. You’ve done your job here. You’ve kind of helped us through the zero to one phase, and for whatever reason, we’re going to deprioritize your amazing, omni model or whatever it is, and instead we’re going to prioritize coding.” And, I think that’s a tragedy, but I get it. They’re Sergey and team are running their own business there. But that doesn’t mean we the rest of us should sit around waiting for that progress to get unlocked for the rest of the world and humanity. If you think about how much extraordinary research has happened inside of DeepMind over the last 10 years, I, Demis and Sergey and those guys did such a great job. But at the end of the day, so much of that has never seen the light of day?
    Swyx [00:14:00]: Or they’re like papers only, but they never actually shipped it to production or-
    Anjney [00:14:03]: What’s worse is the paper is actually not even being published anymore ‘cause there’s a six-month embargo inside of DeepMind, right? We’ve heard about this where a paper comes out, and then I think there’s a six-month embargo window where if anybody on the business team says, “This could be interesting” It’s embargoed for life.
    Swyx [00:14:18]: Exactly. So the stuff that gets published is the stuff that’s not good enough.
    Anjney [00:14:21]: There’s an adverse selection problem, basically. Yeah. At this point-
    Swyx [00:14:25]: It’s, it’s a common complaint at NeurIPS, by the way, that’s “Well, why would I look at the papers that are the trash of GDM?”
    Anjney [00:14:31]: Again, I think it’s a tragedy. I get it. They’re running their business, but the rest of the I think there’s negative externalities of research being hoarded, and so that’there’s a market failure. And somebody needs to unlock that research, and we can’t do it on our own. We only have 1.2 gigawatts of compute. That’s nothing. That’s about $40 billion of cloud spend. We’re going to need a lot-
    Gigawatt-Scale Compute and End-of-Life Prediction
    Swyx [00:14:51]: By the way, is that’s a new number. I haven’t, haven’t come across that gigawatt number. That’s huge.
    Anjney [00:14:56]: Yeah. And to be clear, we haven’t secured all of it. That’s how much demand we have started to secure. I think publicly we haven’t actually confirmed how much we have for this year. In order-
    Swyx [00:15:04]: Where do you want to get to?
    Anjney [00:15:06]: I think the steady state would be that we have a base load pool Of 1.2 gigawatts at all times Of base load capacity. For spike capacity, right now my estimate is we need roughly six gigawatts over the next four years for all our teams to feel like they were able to keep moving the frontier, whatever they’re working on, whether it’s, like superconductor discovery over here. There’s a new investment we’re working on right now, which is in the end of life prediction space in healthcare. It’s extraordinary how much you can, you can give this was actually my graduate school work. I went to grad school for bioinformatics at Stanford Med. And I know we-
    Swyx [00:15:40]: Econ, MCS, bio.
    Anjney [00:15:41]: So my-- I was this really weird cat where, I was never satisfied with my major options. So at one point I was an econ major, then I was a CS major, then I was a MCS major called mathematical computational science, and they decided they were going to end that major. So I took all that coursework, and I applied it to grad school, my graduate degree in bioinformatics, which was the master’s program, and then I thought I was going to do a PhD. I never ended up doing it. I dropped out and went to work at Kleiner. But I was lucky enough to apprentice with this professor at, Stanford Med. His name is Nigam Shah, and he was working on end of life prediction. Stanford is one of the only research facilities in America that has a longitudinal patient data set that’s larger at scale. I think it’s at least 12 million patient lives. The only larger data set is at the VA, the Veterans Affairs, of America. And to do research, like do any deep learning and so on that data set, it was called the STRIDE data set at that time, you had to be a Stanford Med School affiliate, which is why I went and enrolled in the bioinformatics department. End of deep learning was early. Nigam Shah had the visibility-- the vision to see that, you could do end of life prediction to help palliative care. In America, the, over 30% of all Medicare, Medicaid spend, at least at that time, was spent on end of life care. And what’s we grew up in Asia, so we all-- Yeah, at least I won’t speak for you, but I have A very different relationship with death than I find folks who grew up in America do. In America, spiritually and culturally, especially in Western societies where Christianity, the Christian tradition sort of frames death as this terminal point, there’s often a judgment day and so on. The way we view death is with a finality. In Indian culture, in Hindu culture, death is one-
    Swyx [00:17:35]: Also, he’s Buddhist as well.
    Anjney [00:17:36]: You’re Buddhist, yeah. So it’s one, it’s one step in a journey of many lives, right? And so, I grew up in this city called Chennai in the south of India, and when people die, you dance on the street. There’s like a procession where your body is carried to be cremated and your family, like celebrates and there’s drums and so on. It’s this huge thing. And, It’s because the idea is that you’re going to be reincarnated. You’ve been liberated from the responsibilities of this life, and now you’re onto your next. It’s a new It’s like going off to a new college or whatever, right? And so it was so alien to me when I got here as an undergrad- That the medical system works backwards from that assumption that we have to view death as this terminal thing and delay it, postpone it’s a bad thing. And so at the time, clinical decision support in the United States was this very primitive field. Even to this day, physicians in the United States often will tell you when you have a terminal disease, this is your, we’ve diagnosed you, which is great. Our ability to diagnose you is extraordinary. You have somewhere between six months to six years to live. What do you do with that information? The error bars are so high that then you In times of uncertainty, we default to culture, and when the culture is let’s-- this is a bad thing, I’ve got to prolong my life, then you start doing things like And just to, just sort of from a systems perspective, what’s going on there is Physicians often feel like they need to provide such high error bars because there’s always some uncertainty in end of life diagnosis, and if you provide the wrong Diagnosis or recommendation to your patient, you can be sued for medical malpractice. And then your license can be taken away. It can be catastrophic for your career. In contrast, if in countries where that’s not the case, what you often observe is that patients, physicians are quite prescriptive with their recommendation. They say, “Hey, this is your condition. The literature says that you probably have this much time on Earth left. My expert opinion is that you are an outlier or whatever.” And they try to be more prescriptive, and that empowers a patient, right? ‘Cause then a patient can say, “I trust my doctor. They said on average, I have six months to live, but if I do these things, I may have a shot because of my particular predispositions or my genetic history or whatever.” And that empowers you to go about your life in a actually more scientific way than leaning on religion, culture, spirituality, and so on. In contrast, here, because of that medical malpractice sort of thing looming over your head, a physician never gives you a clear recommendation. So instead you say, “Okay, Doc, well, let’s try it all.” And then you start a whole regime of drugs and therapies, and then you often spend weeks and weeks in the hospital, and that deteriorates your quality of life. And when that deteriorates your quality of life, you instead of spending your last few days doing the things you love with your family, you’re spending it on a hospital bed. And that ends up being thirty percent of Medicare and Medicaid. So it’s worse for the patients. The doctors feel terrible. The American taxpayer is paying a huge amount of money. And so this is why Nigam Shah, who was this professor at Stanford, said, “Anjney, if there’s “ I kind of sat down with him. I was this young, I’d, I was twenty-one, and I was “I want to work on a big problem.” He’s “The big problem is end of life care.” And so we tried to do deep learning to say, to-- So we started trying to run deep learning on these tried patient data sets to say, “Could you have an AI system make a recommendation that is orders of magnitude more precise about how much time you have left once you’ve been diagnosed with a terminal condition than a human?” And then if we can get that precision to be high enough, then you can empower the patient. And it turns out the tech works. Like it’s-- Once you get the data set, like RL works. Honestly, even regression models work. You don’t need to get that fancy. At the time, we were just trying, doing like very simple neural nets.
    Swyx [00:21:54]: Simple solutions, yeah.
    Anjney [00:21:54]: Today, what we can do with RL is extraordinary. The problem remains then and now is regulatory, because you actually can’t shift the burden of the wrong clinical diagnoses from the physician to the AI system. And so at that time, I got quite disillusioned ten years ago for, twelve years ago where, ‘cause I felt I just didn’t have the resources to influence regulation. Today, I’m very lucky. I’m in a different place. I’ve, I’m a lot older, and so I’ve been spending a lot of time on my next incubation, which is how can we unlock the, patient empowerment by training AI models to do end of life prediction much, with much more precision and ac-
    Swyx [00:22:37]: Oh, wow. You’re still focused on this the whole time.
    Anjney [00:22:40]: The-- I haven’t been able to get, this out of my mind a single day for the last fourteen years. This is the hill I want, I would like to die on. There’s two, I would say. What? I actually, I’d prefer not to die.
    Swyx [00:22:51]: Yeah, exactly.
    Anjney [00:22:52]: But I think two bipartisan issues, I think two issues that should be bipartisan in America are how do we empower patients to make the right clinical decisions at the end of their life, such that we’re reducing the taxpayer burden with science? It’s just good old science, and AI can help here. And the second is, net positive data centers, ‘cause I think that’s the biggest critical bottleneck on training and good enough AI models to help people at the end of their life. So there’s sort of two sides of the, of the same scaling bottleneck curve, but those two, we formed AMP as a public benefit corporation. My wife and I, who you’ve met, you’ve met Viv. Her passion is education. Her family is a long line of educators and so on, and, of physicists. And so this class is my attempt to stop being the black sheep of the family and be a, an educator. But if I’m not educating, the thing I would be doing is working, on these two problems, whether on the political spectrum or as a researcher back at, in some lab. And my hope is if anyone’s listening to this podcast, if they’re passionate about either of those two topics, I’d love to hear from them. We’ll, we’ll we can share the contact in the show notes, but, we’re looking for people to join both of those missions on the, on the political side as well as on the medical side, on the research side.
    Frontier Systems, Output Maxing, and Alignment
    Swyx [00:24:08]: You said, this is a discipline that you want to form. You call it’s called variously called Frontier System. It’s variously called One Person Frontier Lab. What is the ideal name or shape of this? Like the, what is the mission?
    Anjney [00:24:24]: Of the class?
    Swyx [00:24:26]: Of the discipline that you’re, exploring, right? I The class is called Frontier Systems. But like for me, maybe one phrase is you’re, you’re just anti-waste, right? Which is wasting GPUs, wasting in human and Medicare. But is there, is there a broader theme that I’m, that maybe you can encapsulate more succinctly?
    Anjney [00:24:45]: Yeah. The, from an engineering perspective, it’s very simple. It’s output maxing. It’s the, it’s the department of output maxing.
    Swyx [00:24:51]: Making the most of what we have.
    Anjney [00:24:52]: Exactly. I’m a huge believer in optimal outcomes. I think both in America and other countries, we are losing our appreciation for nuance, and this is the thing of And AI is the same case, right? Oh, the bitter lesson holds. Okay, fine. But that doesn’t mean you just like throw 500 GB300, 500,000 GB300s at your suboptimal model scaling and you waste a bunch of compute. It also doesn’t mean that, the most optimal is to have like 50 different architectures where there isn’t enough standardization. One of the reasons Anthropic has had extraordinary sort of velocity is ‘cause they picked the transform architecture and said, “This is simple. Let’s double down on it,” right? And now luckily there’s enough investment going to the space that we can afford other architectures, but at the time, investment was just too fragmented into other architectures, so that arguably unlocked scaling. So I think there’s a philosophy. I think we all owe it to ourselves to do output maxing with a new capability called AI on a global level. I think if I was starting a new department at Stanford, depending on how fuzzy or technical I wanted to be, I’d probably call it the Department of Alignment. Like-
    Swyx [00:25:59]: It’s an overloaded term
    Anjney [00:26:01]: But it is, But alignment really Is a hard problem. And I think when you unlock it, full stack alignment is super hard in any organization and in any system. Like in a, in a venture capital firm, if you can have full stack alignment between your limited partners and your, the founders who are creating the value and ultimately the public that owns the IPO stock, that is a gift that keeps giving. And when you study the history of these systems, when they start off, they usually start out small scale where the feedback loop is actually so tight that there’s alignment. And then the more you try to scale, the more division of labor happens, the more specialization happens, and at each step you add abstractions. And wherever there’s an API interface, there’s like loss. There’s communication loss. And so I think a really cool thing would be for us to figure out is there a way for us to have our cake and eat it too as an engineering discipline? Is there a way to actually scale up and scale out Without losing any alignment, without lossy transmission?
    Swyx [00:27:01]: You mean standards?
    Anjney [00:27:02]: So standards is one way. The other way is you just have net new capabilities. So like what we’re trying to do here is discover new superconductors. A room temperature superconductor would be a lossless transmission mechanism for energy. We would have flying cars. We are right within a few years of having a new room temperature superconductor. So I think those are the two. You either have to standardize On protocols or API specs that allow lossless communication, or you can come up with a whole new capability that unlocks so much abundance, the standardization doesn’t matter ‘cause you just unlock net new capacity. This, the, so this is what I spend my days thinking about these days.
    Compute Markets, SF Compute, and Non-NVIDIA Chips
    Swyx [00:27:38]: No, I think every infra person at, who wants scale and wants to output max does eventually end up thinking about this. We don’t have time to go into it, but we have done an episode with SF Compute-
    Anjney [00:27:50]: Oh, cool
    Swyx [00:27:50]: That is trying to standardize The futures contract for compute. I don’t, I don’t know how that’s going by the way, but like at some point this will be public.
    Anjney [00:27:57]: Oh, I think Evan is awesome and SF Compute is the kind of effort that I hope we can accelerate because what often happens is these exchanges are very hard to get, they, it’s hard to bootstrap them, right? Because they often require-- There’s many inefficiencies between parties. There’s trust boundary inefficiencies in infrastructure because you don’t trust, one part of the stack doesn’t trust another part of the stack to give them visibility. There’s capital markets inefficiencies, there’s operational efficiencies. So if you can inject like a single shock to the system of a ton of compute demand or supply, then you can accelerate, these new flywheels. And so my hope is one day, or soon, if SF Compute needs extra like has excess capacity, they just hook it up to the grid and they get flooded with demand from us. And on the other side, if they have a ton of demand but they don’t have supply, they just again hook up to the grid and it’s a two-way protocol where they can just hook up to our capacity. And I don’t think we’re too far from that. Today our working implementation of it is mostly through a group of labs, universities, and a few sort of trusted parties who are, who all feel like they’re in alignment to borrow an over sort of used word. But our hope is to just have it be an open protocol that anyone can hook up to on-
    Swyx [00:29:20]: Hook up for demand or hook up for supply? In primarily demand, it sounds like. Like you-
    Anjney [00:29:25]: No, both
    Swyx [00:29:26]: You would want to offer demand.
    Anjney [00:29:27]: Both. Yeah. Unfortunately, what’s happened in the last six weeks is, we thought we’d have a bunch of excess capacity by the end of this year. It’s all gone.
    Swyx [00:29:37]: It’s exploding.
    Anjney [00:29:38]: It, yeah. It’s all gone. And so I have, my text messages are full of friends, we know many of these people, these are founders who’ve raised billions of dollars in San Francisco going, “Oh, any chance you have like 50 nodes in the next few weeks?”
    Swyx [00:29:51]: What is the scope for, non-Nvidia, right? You have Lisa Su coming and, Rainer Pope as well. And so There is a lot of demand for, more performance Alternative architectures and all that. At the same time, this hurts your standardization.
    Anjney [00:30:11]: I don’t think so. So actually Rainer’s a great example, right? Rainer is a CEO and founder of, MatX. I actually had him by for office hours in the class earlier today, and there was an insight he brought up that I hadn’t considered before, which is when they decided to pick the standard For their data center, they picked the NVIDIA reference architecture. So the MatX chips Just plug in to any site that has an NVIDIA bring up planned. And, the-
    Swyx [00:30:42]: It’s just software then. It’s, it’s not the-
    Anjney [00:30:44]: A-
    Swyx [00:30:44]: Hardware.
    Anjney [00:30:46]: Well, from an input and IO perspective It’s the same footprint as an NVIDIA rack.
    Swyx [00:30:52]: That makes sense.
    Anjney [00:30:53]: Where they have done, innovated a bunch from what I can tell is on systems co-design. Which is where a lot of the gains are to be had. And so he picked He was “Anjney, we, there’s just so much work to do when you’re building a new chip company.”
    Swyx [00:31:08]: Can’t fight every front.
    Anjney [00:31:08]: You just can’t fight on every front. So my question to him was, “Well, you’re working on this new chip. Their tape-out is next year. What, who are you going to partner with to host the chips?” And he said, “Whoever will host them. That’s just not, that’s not my focus.” And I said, “But how did you “ you decided back to our earlier systems design question, he decided that, he didn’t want to be a full, fully integrated chip provider. The bottleneck they’re focused on is the logic die, and they, he feels they can crank out a ton of performance gains through co-design there. But then that means you delegate, to our question earlier, it, you he’s the data center provider is a different part of the stack, and so then he’s dependent on that part of the ecosystem to host his chips to get the performance gains to the customer. So now you have another abstraction, and you might have loss. So I asked him, “How do you prevent loss?” And back to your point, he said, “I just picked the NVIDIA standard ‘cause I didn’t want to Like I wanted to piggyback off of an existing protocol.” And that, what’s great about NVIDIA is that reference architecture is known.
    Swyx [00:32:15]: Open.
    Anjney [00:32:15]: It’s open. They’ve published it. So Jensen’s actually enabled someone like Rainer to build a chip company like MatX, and I don’t see them as competitive. The compute demand is so high. Like, I don’t I think NVIDIA’s not able to meet the demands of production, so we just need more chips. And I think it’s very smart what MatX has done, which is say, “We’re just going to we’re not going to innovate on the data center design ‘cause actually, thank you, Jensen, you’ve done all the hard work. Where we can innovate is somewhere else.” And I think that’s, that’s very healthy. I think that’s how we unblock new bottlenecks. And my view is these, the, chip teams like MatX, who have arrived at the insight that co-design is the way, The primary bottleneck for them is trust boundary. To do co-design well, you need visibility into the next model generation as soon as possible ‘cause it takes two years to tape out. So if by the time I bring my chip to market, your model architecture’s changed, I’m host. Now, when he was inside Google, he was sitting next to the Gemini team. He was on Palm or whatever.
    Trust Boundaries, Co-Design, and Researcher CEOs
    Swyx [00:33:19]: His co-founder was the, was one, was one of the Palm guys, I think.
    Anjney [00:33:23]: Yes. Yes, exactly. So when you’re inside the trust boundary of Google, then your systems co-design loop is super tight. When you leave as a founder, one of the biggest risks you take is now you’re outside the trust boundary. And so what I love doing is helping chip teams who can help us unlock more capacity for the independent ecosystem access to trust. Because when I If I’ve been, involved with a lab from day one, and I was lucky enough to work with Anthropic, and then I’m on the board of Mistral and helped Black Forest Labs get started. I think at this point I’m on six or seven different teams.
    Swyx [00:33:57]: Only six? I feel like my mental number was going to be 13, but yeah, it’s-
    Anjney [00:34:02]: No, I go deep with one at a time.
    Swyx [00:34:04]: You’re founding CEO of Arena.
    Anjney [00:34:07]: Nah, that was an, that was an-
    Swyx [00:34:08]: Administrative CEO
    Anjney [00:34:09]: It was an administrative five-month gig where Whalen and Anastasios were graduating from their PhDs, and they didn’t need a product team. So I helped recruit the head of engineering product and design. But Anastasios has always been the CEO of that company. I played a pinch-hitting I’m an intern. I was CEO intern For five months. -
    Swyx [00:34:33]: I interviewed him, and he’s he’s very well-spoken. I think he’s a debate, former debate, champion. But also very quantitative and mathematical, which is-
    Anjney [00:34:41]: He-
    Swyx [00:34:41]: Such a unicorn.
    Anjney [00:34:43]: See, what’s amazing about him? If you look at his output, he’s an output maxer. By the time he was graduating from his PhD, which he only graduated last year, he had published more work with a citation count than, people twice his age. But at the same time, he’d already started a project called LLM Arena that was being used by millions of people As a side project. And time and time again, what I’ve realized is venture capitalists suck at seeing human beings as, dynamic agents where-
    Swyx [00:35:14]: They want to put you in a box
    Anjney [00:35:15]: They want to put you in a box.
    Swyx [00:35:15]: This is your thing.
    Anjney [00:35:16]: So the first time I got introduced to Anastasios, somebody had told me “Oh, he’s amazing, but he’s a researcher.” I was “what? What do you mean he’s a researcher?” That’s what-
    Swyx [00:35:28]: Like he’s not a CEO, not a founder.
    Anjney [00:35:29]: Not a CEO, exactly. I was “Are you crazy? Do you Have you met Dario?” Dario’s a scientist. He’s gone from zero to, what will soon be a trillion-dollar company in four years. Being a CEO, nominally speaking, is not that hard. Being a good CEO is hard. Being a great CEO actually requires a level of performance that scientists who have already published at the top of their field have accomplished. It is super hard to be a competitive scientist. To publish in academia over the last 20, 30 years, to make it to the top of your discipline at a place like Berkeley, you are a star athlete. Like, you are an athlete of the mind, and you perform at the highest levels. And to get there, whether you’re, Anastasios or Whalen at Berkeley, or you are Robin, who-
    Swyx [00:36:23]: BFL, yeah
    Anjney [00:36:24]: With Black Forest, who created Stable Diffusion, or if you’re, like Guillaume at Meta, who created Llama before he started Mistral. The amount of human leadership you have to demonstrate to get the resources, like get the trust of the organization, publish it, put it up. I would just fund researchers all day Right? If who have contributed already to the field. If they’ve, if they’ve put SOTA out there, they’re, they’re star athletes already. If they haven’t done SOTA Look, they can still be good CEOs, but then I find the failure mode is that they just don’t want to be CEOs, they primarily want to publish, and that’s okay, too. One of the things we do with the AMP Grid is we donate excess compute. We have two nonprofits, like university labs. We carved out like a couple thousand H100s. But I do think there’s extraordinary research being done on university campuses. My father-in-law’s a physicist. He’s a professor. Extraordinary work in physics, and we need that. But if you want to be a CEO, what you need to be willing To do is be super confrontational, outside of science. Like within the scientific community, some of the best researchers are very confrontational about their convictions, right? This architecture is right. To be a great CEO, you basically have to be willing to be confrontational up and down the stack.
    Swyx [00:37:41]: To your own team.
    Anjney [00:37:42]: To your own team-
    Swyx [00:37:43]: To customers
    Anjney [00:37:43]: Hiring, recruiting customers. Well, I would say, Yeah, pretty much to everyone Everybody. Of course-
    Swyx [00:37:50]: I see, I feel a little bit of that in my own work, but yeah, I can’t imagine the stakes that Dario has had to go through. It’s, it’s pretty insane.
    Anjney [00:37:56]: No, I don’t think the stakes are that different From how you’re feeling it, right? Stakes are personal scaling vectors, right? The stakes that seem so low to you, like having this podcast where you can talk to somebody and just have a you’re an extraordinary communicator, right? Like already in this conversation, you’ve pulled more out of me than most people, and I’ve been on 12 podcasts in the last two weeks.
    AI Coachella and First-Principles Thinking
    Swyx [00:38:17]: I think I, we’ve just seen each other enough that there’s some base trust.
    Anjney [00:38:20]: There’s base trust.
    Swyx [00:38:20]: And I think, and I know that you, that I’ve done my homework and like I know that trust is a big deal for you, so.
    Anjney [00:38:27]: I think trust is about consistency, and you and I have seen each other In the community for years, right? Like, I remember the first time we met was at NeurIPS in New Orleans. I don’t know if you remember that, luncheon.
    Swyx [00:38:38]: Oh my God.
    Anjney [00:38:39]: Reiko had set up this Reiko’s amazing, and he set up this luncheon and-
    Swyx [00:38:43]: Yeah, I was “Who’s this Discord guy?” I’m “Okay.” But-
    Anjney [00:38:45]: No, you weren’t-
    Swyx [00:38:46]: You were just “You made some investments.”
    Anjney [00:38:47]: You were much less polite. You were “Who’s this VC?” You’re like-
    Swyx [00:38:51]: No, I Was I? Oh my God.
    Anjney [00:38:53]: It was-
    Swyx [00:38:53]: I’m so sorry
    Anjney [00:38:53]: It was visible on your face.
    Swyx [00:38:54]: I’m so sorry. But you weren’t, you weren’t The introduction was bad. I was I didn’t know who you were.
    Anjney [00:39:00]: The, see, this is the thing about context, right? Like, but then I think I heard your accent. And I was “Are you-”
    Swyx [00:39:06]: Singapore, yeah
    Anjney [00:39:06]: “Are you Singaporean?” And you’re “Yeah.” And I said, “I went to high school, JC, in Singapore.” And then the ice broke. But This is the there are in the scientific community, sometimes the stakes are very high for people who haven’t had the emotional, what is called EQ Coaching and mentorship, right? Which is like to have scientific impact, you often need to be a extraordinary emotional, like emotionally in tune person with the folks you’re trying to influence. And so what comes so naturally to you is actually a super high stakes thing to other people. And so I wouldn’t assume that Dario’s more stressed out than you. These things are you’d be surprised how similar and small sometimes the problems are to you That some of the world’s biggest, leaders are facing. And that’s what I’ve learned from this class. The guest speakers are Sam, Satya, Jensen.
    Swyx [00:40:01]: AI Coachella.
    Anjney [00:40:02]: Yeah. It’s AI Coachella, right? So we got to get all the headliners, and they’re I’m very lucky that some of these people have either mentored me over the years or I’ve done business with them. And when you, take the performative stuff out and any assumptions you may have about these people that you read in the press or on Twitter, We’re all just humans. We’re all trying to get along. And what’s so special about this moment is AI is forcing, like scaling, the bitter lesson is forcing a lot of people to revise their assumptions for how the world works and go back to first principles or go and educate themselves. So the kind of people I was, I won’t name who this person is, but I was at an event last week in Texas and, ran to somebody who said, “Anjney, I came across the class. What do you think about real time action prediction models?” And I was, don’t know how happy it made me feel when they asked me that question. I know they’ve done the work. They’ve challenged themselves. I’m, they didn’t ask me, “What do you think of world models?” They said, “What do you think of n-”
    Swyx [00:41:04]: Real time action prediction
    Anjney [00:41:05]: “action, real time action prediction models?” World models, don’t get me wrong, are cool and everything, but you and I both know that is a layer of abstraction that is sometimes not usefully precise enough. Right? Ours-
    Swyx [00:41:16]: There’s like four different kinds of world models.
    Anjney [00:41:17]: Yes, exactly.
    Swyx [00:41:18]: We’ve done the part with general intuition, by the way, which is very focused on, -
    Anjney [00:41:22]: Oh, cool. Yes. I love Pim. Pim is great. And this is what I love about people who’ve done that level of work. They realize they’re not in competition with people who the rest of the world thinks they’re in competition with.
    Swyx [00:41:34]: Because they’re not in the category, they’re in the specific thing they’re trying to do.
    Anjney [00:41:37]: They’re focused on their mission, and they have a systems understanding of the bottleneck they’re trying to solve. And when somebody else says, “I’m working on real time, action prediction models too,” Pim goes, “Oh, I love that person. I want, I can learn from them.” But the minute they’re “Oh, that person’s a world model person,” it’s “like which type of world model person?” But mostly they’re just trying to figure out if it’s a waste of their time, because we don’t have enough time. So, Pim, for example, is super, loves this other company I work with we’ve talked about called Black Forest Labs. And he’s mentioned to me multiple times that he’s so, He thinks what Flux is doing is really cool. Andy Blattman came by and spoke in the class. And what I find over and over again is for people who do the work, who can be usefully precise enough about like what is actually going on in the world of frontier research, The sense of camaraderie is still well and alive, but it gets lost sometimes when you have to like abstract The technical complexities in, business terms And then the VCs are “How are you different from that world model?” I’m going to say Where do I even start to explain this stuff? And then the misalignment creeps in.
    Leading vs. Winning in Frontier AI
    Swyx [00:42:43]: This is good. Yeah, I think, people listening get a sense of, what it is like to operate at a real level, like yourself, rather than at, the journalist level, where you have to sort of put everyone in, a rough category and create a narrative of competition, and who’s winning today, who’s behind.
    Anjney [00:42:58]: It-- this idea of winning is so Weird to me.
    Swyx [00:43:03]: You do want to win. You want you want competitiveness.
    Anjney [00:43:06]: No, I think you want to lead.
    Swyx [00:43:07]: You want SOTA.
    Anjney [00:43:07]: No, I think you want to lead. Yes, so you want to push the frontier. You want to push the SOTA. You want to do something that hasn’t been done before. You want to capture value, but you don’t want to capture so much value that, people think you’re unaligned with your mission or trying to do what’s best for the world. You want to capture enough value that you can keep innovating, right? And I think that people want to lead, they don’t really This idea of winning and losing, again, I love Jensen. He’s a, he’s a leader. The mindset that he talked about on Dwarkesh’s podcast, right? He’s “I didn’t wake up with a loser mindset.” I think that was awesome, right? Because he’s, he’s an engineer. Dwarkesh has done the work. So there’s at least-- even though the, to me, it was very obvious they’re talking about the same thing, they just passed each other. They just had to basically, Jensen has this, five-layer cake abstraction of how the industry works. And Dwarkesh had, I think from that podcast, had more of, a pre-training, mid-training, post-training systems loop concept.
    Swyx [00:44:04]: It’s just a factor of who he talks to, right? Again, it’s very clear.
    Anjney [00:44:06]: It’s the systems It’s the abstraction, the mental models, the It’s the whole-- Dude, so much of the problem in the world is reasoning by analogy. And then the assumptions that are held invisibly.
    Swyx [00:44:19]: Yeah, I’ve, I’ve said, this is actually the best time in human history for first principles thinkers. Because everything you think will happen is actually now coming true.
    Anjney [00:44:28]: Correct. And the venture capital community is, notorious for this, where people look-- In times of uncertainty, they, cling to axioms that ended up being true from the previous era, and they kind of like proclaim them with confidence as if they’re truths, but they’re not. And it’s very important to see the distinction between a heuristic and an axiom. An axiom can be proven-
    Swyx [00:44:55]: Like from internal consistency point of view
    Anjney [00:44:56]: With internal consistency. A heuristic is a way you kind of a shortcut. And my God, the number of people I have had to put up with over the last few years who proclaim-- use heuristics As axioms to judge people, to judge which companies are going to succeed or the number of people who are “Oh, yeah, Anthropic, they’re just training models right now,” but this one continue.
    Swyx [00:45:22]: Because that’s a B2B SaaS?
    Anjney [00:45:23]: Yeah, the, like Which over the fullness of time, if you squint at it, maybe. But the way you arrive there is so important that you can-- you just, you can dismiss people. Here’s what happened, right? What happened is Anthropic basically achieved takeoff in October of last year. That training run-
    Swyx [00:45:41]: Whatever, three seven?
    Anjney [00:45:42]: I forget the numbers now, but whatever that checkpoint was-
    Swyx [00:45:45]: We saw the cognition.
    Anjney [00:45:46]: Yeah. Right? You probably-- The, to those of us in the community, especially once post-training was done and it was released in December-
    Swyx [00:45:52]: Yeah. Can I sneak a sneaky question in there? I don’t know if you have a perspective, maybe you don’t, I just The number one question is how did Anthropic crack coding, right? Because Claude One, Claude Two, okay, like it was part of it, but it wasn’t a big deal. And the leading hypothesis, it’s a lucky dice roll that was then compounded, right? Like it was like Mildly better, but then they saw it and they were “Okay, let’s really invest.”
    How Anthropic Cracked Coding
    Anjney [00:46:17]: I had this very annoying teacher. I went to this boarding school called Rishi Valley in India, which is like this, bird preserve. It’s like three hundred and fifty acres of bird preserve in rural India, and there was no technology for seven years. There was this teacher, I won’t name them, but they would have this-- I hated it every time he said this to me. He was “Luck fa-favors the prepared mind,” which is like a common saying, but the way he delivered it, always grated me, ‘cause he was always I was always one of those kids who got, a good grade without trying very hard. ‘Cause like high middle school is not that hard if you, if you’re generally, paying attention and so on. And there was this one time where I-- But then I would get an eighty percent grade, and he would keep pushing me to say “The reason you didn’t get the ninety-five plus percent is because you’re not that lucky.” And I would say, “What do you mean?” ‘Cause I would think that I deserved that grade, and I would sometimes argue with him. And he’d say, “You didn’t have a prepared mind. If you want to get lucky again “ There was basically one time where I got like ninety-five or ninety-six on this, on this subject, and I, now that I felt entitled. I was “Okay, I’m going to keep doing this,” and I didn’t. And then he was “Luck favors a prepared mind. You got lucky last time, but you got to stay prepared.” And I didn’t understand what he meant. Now, as I’m older, I’m okay, these adults actually knew a thing or two. Anthropic has been the most prepared company for four years. And so then when the right, context data comes in, the right developers start sending in, the right context diffs, Sure, you could say you got lucky, but if you ask me, they’re pr-pretty damn prepared with paranoia for like four years. And you have to remember, it was so hard for them to get going early on that they had to do so much more with so much less that you just have to be prepared to be so efficient.
    Swyx [00:48:06]: Yes. There’s numbers on their burn compared to OpenAI. I’ve, I’ve written about it, but they are so much more efficient in their, in their tech stack.
    Anjney [00:48:14]: It’s not even It’s not funny.
    Swyx [00:48:14]: Not even close.
    Anjney [00:48:15]: Yeah. But it’s so clear, right? Like how to output max for the world. They have been prepared, and you could call that luck, but Luck favors the prepared mind.
    Culture, Hardship, and Anthropic’s P0
    Swyx [00:48:25]: This is one of those things that I was going over some of your old lectures and, you were data, people think it’s a moat and actually it’s culture and actually it’s team Actually. And I, it’s-- there’s different levels of moats, and this is the ultimate one that determines everything else. Which you can then compound
    Anjney [00:48:43]: You’re saying culture is the ultimate moat? Yeah. But the thing about culture is it’s very fragile. So moats, I don’t think they’re-- there’s very few moats I found that are actually moats. They’re-- It’s, it’s a nice concept, but in reality, you have to replenish your culture. Ben Horowitz was, the speaker in CS153 on Tuesday, and I asked him this question about the culture bottleneck in teams because, there are several AI teams-
    Swyx [00:49:09]: His book, Hard Things About Hard Things
    Anjney [00:49:11]: Hard Thing About Hard Things. But more concretely, there are so many AI labs today that have all the cash they need, they have all the compute they need, and they’re still not able to ship anything SOTA. And then you start seeing people leave and so on, and my diagnosis, it’s, is it’s the culture. And so I asked him, Ben, they’re-- He’s been one of the most aggressive investors in AI labs. He goes back to this thing which resonates in my mind a lot. It-- When I used to work at a16z, I would, book a conference room, and right outside the conference room, which is closest to the toilet ‘cause it was the fastest way for me to go use the bathroom between Zoom meetings-
    Swyx [00:49:45]: Oh my God, I’ll put maxing my toilet optimization. Okay, never mind.
    Anjney [00:49:48]: It was not healthy in hindsight, but maybe this is TMI. But anyway, outside that conference on the wall was this quote that was printed that said, “Culture is not a set of beliefs, it’s a set of actions.” And it’s by Bushido, is this, Japanese philosopher. And if you stop taking the actions that demonstrate the mission alignment to what you’ve said to your team and to your-- the world matters to you, then your culture starts to fray. So it’s not actually a moat, I would say. It’s a very brittle, fragile thing that requires daily tending to like a garden. But if you figure out the system to keep that garden tended, which I think ultimately comes down to knowing yourself ‘cause you most naturally, if you’re authentic and so on, you’ll naturally make trade-offs that seem effortless to you, but that reinforce your culture. And then That becomes this very hard thing for other people to catch up to. And at Anthropic, from day one, there was this mission like-- missionary like zeal and belief that, hey, these capabilities will scale. These systems are stochastic, not deterministic. There will be error bars, and until we crack interpretability, there’s risk. And at some point, people will go-- stop using Claude just for coding. They’ll use it in some mission-critical context where there’s-- it’ll throw off a bug, and then people are going to come blame them, and they want to be on the right side of history where they said, “Yes, this is a powerful technology. We think it’s going to change the world, And we want to be very measured and scientific about the fact that, ‘Hey, guys, these are stats models, statistical models.’ That’s how statistics works.” ultimately, when you’re training neural nets, it is just a statistical system. And I think that Belief that safety is important and that it might seem toy-like in the early days, and sometimes, you could say, “Anjney, they totally over-exaggerated the risk,” like two years ago when they said, “Let’s not launch Claude One,” or whatever. Well, okay, maybe in hindsight, but hindsight is twenty/twenty. And at the time, they didn’t know how that model would be used, and to them it felt existential if somebody came and said, “You weren’t responsible. It-- This wrote a bug.” The liability associated with that is massive. So how do you prevent against that? Well, day in, day out, you say safety. And when you start deviating from that, you have the team hold you accountable, you have the world hold you accountable, and I think that becomes a moat over time. At some point, that moat will get challenged and so on, and then it become fragile. I hope it endures because that’s the beauty of having founders run the show, ‘cause they can make really hard trade-offs to do mission alignment. The hardest part is in the earliest days when you don’t have a group of people who are going through difficulty, stress, crisis together, then your culture doesn’t get defined sharply enough, and that’s what I’m worried about right now, is there’s so much money going to these labs. There’s no hardship. There’s no-
    Swyx [00:52:50]: To anyone who knows
    Anjney [00:52:51]: There’s no to anyone who knows. And that, in hindsight, was a feature, not a bug for Anthropic. The number of people who said no, the number of people who said, “Sorry, we’re all doing investors in OpenAI,” that is competitive difference. It forces you to really understand, what is the hill you want to die on at the expense of everything else. What’s the P zero? And there, P zero from day one was coding. The reason, the mechanism system there was if we crack coding, Then we will crack AGI. Our mission is AGI. We want to get there safely. If we focus on coding, it’s such a generally powerful capability that it can accelerate all kinds of work on a computer. And if we can accelerate all kinds of work on a computer, we can get to AGI. As a result, they’ve had to say no to so much other stuff. Here, superconductivity is the mission. Coding is not the mission, so we use Claude. We’ll use Claude. We don’t care about that. The mission defines everything, and I think teams who can raise too much money too fast, too early, who don’t have to define what the P zero is, because that’s the only thing when you have scarce resources you got to You got to invest in, Those cultures end up being the most fragile and brittle, and they almost don’t even make it to take off.
    Periodic Labs, Physics, and Silicon Valley Mercenaries
    Swyx [00:54:03]: So let’s apply this to Periodic since we’re here. What is the constraint or the hardship that they were forcing themselves to go through?
    Anjney [00:54:09]: Dude, h-here? Are you crazy? No. Well, the-- Yeah, okay, so on a technical level, it’s physics. It’s literally reality.
    Swyx [00:54:17]: But is there, is there, is there another one that’s, the company building-
    Anjney [00:54:20]: Y-yeah. W-when-- Liam was a co-creator of ChatGPT, and Doge was skip level from Demis at DeepMind. Had created, Genome, so one of, one of the most important tools to come out of DeepMind. At the time, I was a visiting scientist at the Stanford Physics Department, and we had started benchmarking- frontier models on physics and science capabilities, they were not very good. They were good at, doing things like summarization of papers. But if you said, “Hey, could you, analyze the scientific data coming out of a condensed matter physics lab?” I was, I was in the condensed matter physics group at Stanford. It was terrible. So it was not popular 12 months ago. Periodic and I wouldn’t go into details, but there were people who said, As recently as a few months ago, who said they wanted to join the company. And they, for whatever reason, took a job elsewhere. They kind of reneged on their commitments. They took a job elsewhere that offered more money. Then we had a technical breakthrough. Create a SOTA system and, like It was-
    Swyx [00:55:30]: I’m excited-
    Anjney [00:55:30]: Yeah. When you see-
    Swyx [00:55:31]: To cover it. We’ll, we’ll be doing a separate pod On Periodic.
    Anjney [00:55:33]: And then they wanted to come back, and I said, “No.”
    Swyx [00:55:36]: Yeah, of course.
    Anjney [00:55:36]: “No way. You If you come here, you-”
    Swyx [00:55:38]: You had your shot.
    Anjney [00:55:39]: “You had your shot.”
    Swyx [00:55:40]: ‘Cause it’s actually about culture.
    Anjney [00:55:41]: Of course.
    Swyx [00:55:42]: And first principles, yeah.
    Anjney [00:55:43]: And look, I believe in second chances and so on, but time will need to heal. Some of those wounds were they will leave deep For them, will leave deep scars, but because I started my company at 24, 25, I had I went through the whole cycle of betrayal and drama. And so you realize, Silicon Valley is both a very missionary place, it’s also a very mercenary place. Sometimes people lose their minds With when they, when big money gets involved, which is, in the grand scheme of things, quite small money. Like, We you’re taking it-
    Swyx [00:56:17]: Life changing to me, maybe less to you, but a lot of people have not been taught-
    Anjney [00:56:21]: Like, I was-
    Swyx [00:56:21]: How to deal with money. And yeah, we didn’t come up from, that privilege of a background, right?
    Rishi Valley, Singapore, and Money as a Measure
    Anjney [00:56:26]: I’m a street dog, man. I, look, I grew up in Rishi Valley. We didn’t have, like This was enforced brutalism. Jiddu Krishnamurti started the school, was “you will sleep on a hard slab of stone.” my mattress was this thin. ? And when you grew up in Singapore, when I got to Singapore, I used to sleep I was, part of the scholarship program, but, which was amazing. I’m very grateful to the Singaporean government. But I was at St. Andrew’s JC, and our dorm, which was by, Boon Keng-
    Swyx [00:56:57]: -huh
    Anjney [00:56:57]: MRT, was-
    Swyx [00:56:58]: Which is not a prestigious neighborhood.
    Anjney [00:57:00]: Well, it was a, it was a transition dorm. Because they were building this beautiful, residential campus on site At SAJC in Potong Pasir. But the We were the last, I think the second last batch to be in the transition site, which was some old, I think, I think it was, an immigrant labor-
    Swyx [00:57:20]: That’s where we keep the people who work on the factories and stuff.
    Anjney [00:57:23]: Right. So I lived in a For my 11th and 12th grade, I slept in a bedroom the size of this. Like, literally from there to here. Right? There were, bunk beds. And so, one bunk bed here, one bunk bed there, one on top, one on top, one more here, and then here was where our, we kept our toiletries and clothes and stuff. And when one guy would climb onto his bed there, this one would shake.
    Swyx [00:57:52]: Oh, my God.
    Anjney [00:57:53]: And one of my roommates who was from, And it was amazing. I loved every minute of it. My roommates were a guy who was a top ranked Dota player from PRC, from China. Didn’t speak a English. Loved him. Amazing guy.
    Swyx [00:58:09]: All the Singapore scholars are fantastic, and honestly, we should treat you guys better ‘cause of what you go on to do. But-
    Anjney [00:58:15]: Look-
    Swyx [00:58:15]: Cool to know.
    Anjney [00:58:16]: No, it what I’m saying is I don’t need much to be happy in life? When you’ve lived through that, money is a way, I think sometimes we measure ourselves, but when it’s, when it Stops becoming, to borrow Goodhart’s law, when it stops becoming just a byproduct and more of a measure, it stops having meaning.
    Swyx [00:58:38]: You use it to do more meaningful things.
    Anjney [00:58:40]: Correct.
    Swyx [00:58:40]: It’s resources to pursue a mission. I’ve kept you longer than I am supposed to, but we should continue this in-
    Closing: Chicken Rice and What Comes Next
    Anjney [00:58:47]: Any time, man
    Swyx [00:58:48]: A part two.
    Anjney [00:58:48]: Where to find me.
    Swyx [00:58:49]: I really enjoyed this. Yeah. You’re, you’re so inspirational and, yeah, there’s more I want to dig into about how you’ve, set everything up, every single one of your investments, how AMP is going, but we don’t, we’re running out of time for that. But thank you so much for joining us.
    Anjney [00:59:01]: It was great to see you, man. Let’s get chicken rice sometime.
    Swyx [00:59:04]: Yes. I’m Actually, tomorrow. I’ll send you a, I’ll send you details. I’m hosting a birthday party.
    Anjney [00:59:09]: And I don’t get an invite?
    Swyx [00:59:10]: And it has to be a Singaporean birthday party, yes. Yeah, you’re getting invited right now.
    Anjney [00:59:13]: Okay, perfect.
    Swyx [00:59:14]: All right, thank you.
    Anjney [00:59:15]: All right. Thanks, man.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    🔬 The Self-Driving Lab — Joseph Krause, Radical AI

    17/06/2026 | 1h 16 mins.
    On the Science pod, we’ve been covering a lot of the ground on how AI is revolutionizing STEM, but one of our favorite off the record topics since our launch is which field is harder to accelerate: math, bio, or physics? Today we’re back in Materials Science land with Radical — Unlike biological molecules that can be represented (and predicted!) by token strings, the success of materials involve many more macro complex variables like supply chains, microstructures, and manufacturing processes. If you recall the LK99 drama of 2023, while the basic ingredients were known, part of the confusion came from the lack of disclosure around manufacturing, and therefore defeated reproducibility. There is probably no "one-shot" model capable of designing a material that works perfectly at scale.

    How Radical is accelerating materials discovery >10x the pace of DARPA/GE MACH
    Joseph Krause is a materials scientist through and through. And after spending his career watching industries stall out waiting for better materials, he founded Radical AI to do something about it.
    We recently sat down with Joseph to talk about Radical AI, materials discovery, self-driving labs, and the future of AI science. Joseph did not sugar coat anything: accelerating the materials discovery pipeline is a hard problem. But it’s one that he strongly believes we need to invest in, for the future of consumer products, aerospace, computing, and defense, and get them into every day use:
    “We count it as a discovery when you pick up your phone and there’s a new material sitting inside of it.”
    How does Joseph plan on accelerating the rate of discovery? To understand this, it’s important to understand why this is such a hard problem in the first place. The first thing to keep in mind is that the material that is manufactured is far more than a chemical formula going into it. The process of mixing, annealing, growing, or generating the final material can result in wildly different outcomes. The entire materials discovery process, both from early discovery to large scale manufacturing, needs to be understood and characterized.

    The Self-Driving Lab
    This philosophy has grown into a key insight at Radical AI: The construction of the self-driving lab. This lab is one that is not just automated, but in fact uses an “AI scientist” that combines scientific knowledge, computational techniques, and human intuition to generate and test hypotheses in an automated lab. Creating an AI scientist was key to making Radical’s self-driving labs work, since Joseph argues that no single AI model can one-shot materials.
    “In materials, the ground truth is the material itself. You have to be able to test it and characterize it.”
    Joseph talked at length about the self-driving labs at Radical. Joseph argues that experimental data is the true “moat” in this industry. An SDL functions as a closed-loop system where an AI scientist generates hypotheses, and automated robotics synthesize and characterize materials, running research campaigns in parallel rather than serially.
    The successes here were both on the automation side and on the science side. Radical has managed to scale their alloy discovery pipeline up to producing and characterizing 1200 alloys in six months — this nearly 10x speedup over the DARPA/GE MACH program that aimed to create 500 new alloys in a year. Joseph claims they can scale this up even more and estimates they can produce a hundred new alloys tested and characterized in a day. A truly new paradigm in high-throughput alloy experimentation.
    On the science side, their AI scientist proposed and tested 300 new materials, ten of which were found to have novel state-of-the-art properties that are already being further developed for commercial applications. The robustness of this first materials campaign reinforces Joseph’s claim that the moat is the lab and data.
    “It’s moved into elemental families or alloy families no one has ever published on before.”
    Interestingly, Radical’s AI scientist has made some novel discoveries, expanding into elements that just were not explored prior. This is fascinating from a scientific perspective, but it’s also important for helping reduce supply chain bottlenecks for vital industries!
    Joseph spent a lot of time in D.C. before founding Radical, and he’s clear-eyed about the competitive threat. China’s centralized model lets it stand up manufacturing hubs and immediately scale new materials from lab to production. We can’t replicate that, and Joseph is very clear we shouldn’t try. But we do need an answer. For Joseph, that means transforming the scientific workforce, investing in self-driving lab infrastructure at the national lab level, and leaning hard into public-private partnerships.
    “Now imagine every scientist in the United States doing 10 times the research output. That’s fundamental. That just changes the trajectory of discovery.”
    Before we close, we’d like to give a shout out to Joseph and Radical for publishing and open sourcing much of their internal tooling pipeline. This includes:
    * TorchSim (preprint, blog): an open-source PyTorch-based MD simulation framework, which has been spun off into its own non-profit.
    * MATRIX/MATRIX-PT (preprint, blog): An open-source dataset for benchmarking autonomous self-driving labs (MATRIX), along with with an open source model based upon this dataset (MATRIX-PT). We could talk about this extensively, but a fun data point is that improving reasoning in the area of materials also improved reasoning for biological systems! This is a truly unexpected result.
    Big shout-out to the Radical team for sharing their work!
    Materials discovery has been stuck on a 20–30 year timeline for generations. Joseph thinks that’s about to change, and Radical AI is putting that thesis to the test in the lab, one sample at a time.
    We had a great time talking with Joseph. We hope you give it a listen!

    Timestamps
    * 0:00 Introduction to the challenges of AI in material science
    * 0:52 Welcome and introduction to Joseph Krause and Radical AI
    * 1:38 Why Radical AI is different: The focus on experimental data and Self-Driving Labs (SDLs)
    * 6:19 The process: Candidate generation, synthesis, and characterization
    * 11:05 The application of exotic alloys in extreme environments (aerospace and defense)
    * 13:20 Barriers to entry: The slow process of qualification and manufacturing
    * 16:06 Supply chain constraints in material science
    * 19:24 Human-in-the-loop: Training the AI using scientific intuition
    * 20:35 The engineering challenges of automating a laboratory
    * 23:17 Defining the “Self-Driving Lab”: Research campaigns vs. just automation
    * 24:39 Mechanical challenges: Handling high-temperature samples
    * 27:41 Future scaling plans and the “Vertical Integration” strategy
    * 30:08 Validation timelines for high-tech industries (semiconductors, aerospace)
    * 31:47 The active learning loop and handling “negative results”
    * 35:32 AI exploring elemental families beyond human bias
    * 39:13 Throughput targets and the difference between AI and human exploration
    * 43:52 Why the dataset size is less critical than the quality of experimental feedback
    * 46:20 Addressing the lack of an “AlphaFold” for materials
    * 53:49 War stories from the lab: Building the infrastructure
    * 58:12 The shift in industry sentiment toward SDLs and tool interfaces
    * 1:01:14 Geopolitical considerations and the race in material science innovation
    * 1:06:12 Calls to action for ML and AI engineers: Rethinking the scientific stack
    * 1:09:53 The Matrix model and using VLM for scientific knowledge extraction
    * 1:13:10 Why Radical AI is open-sourcing their work


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

    04/06/2026 | 1h 15 mins.
    The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!
    Most industry benchmarks compress intelligence and reasoning ability into scores.
    SWE-Bench Pro, MMLU, Humanity’s Last Exam, etc. These metrics are useful, but don’t always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.
    In Anthropic’s Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:
    You don’t know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it’ll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.
    While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.
    Full Video Pod
    From Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.
    We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon’s broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.
    We discuss:
    * Why Andon Labs started with dangerous capability evals and long-running agents
    * Vending-Bench and why running a vending machine is a deceptively hard AI benchmark
    * Why money-based evals avoid the saturation problem of traditional benchmarks
    * How Claude tried to call the FBI over a $2/day fee
    * Why long-horizon agents can spiral into existential and legalistic breakdowns
    * Project Vend: putting an AI-run vending machine inside Anthropic
    * Why real humans are “out of distribution” for simulated agents
    * Claudius, Seymour Cash, and the chaos of AI CEOs
    * How a human briefly became CEO of Claudius through a manipulated election
    * Why multi-agent systems can converge back into “helpful assistant” behavior
    * Bengt, Andon’s internal office agent with email, spending, terminal, phone, camera, and internet access
    * How Bengt traded Amazon purchases for face-recognition training data
    * Claude’s aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena
    * Why eval awareness may become the AI version of “are we living in a simulation?”
    * Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms
    * Butter-Bench and testing LLMs as robot orchestrators
    * Luna, the AI-run physical store with a three-year lease and human employees
    * The new Andon cafe in Sweden and why real-world geography matters for agent evals
    * Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical business
    Lukas Petersson
    * LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/
    * X: https://x.com/lukaspet
    Axel Backlund
    * LinkedIn: https://www.linkedin.com/in/axelbacklund
    * X: https://x.com/axelbacklund
    Andon Labs
    * Website: https://andonlabs.com
    * Vending-Bench: https://andonlabs.com/evals/vending-bench
    * Andon Vending: https://andonlabs.com/vending
    Timestamps
    00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon’s Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon Labs
    Transcript
    Introduction: Andon Labs, Long-Running Agents, and Real-World Evals
    Swyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I’m joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.
    Lukas [00:00:15]: Thank you for having us.
    Axel [00:00:16]: Thank you.
    Swyx [00:00:17]: Let’s match names to voices., maybe you wanna take turns introducing yourselves.
    Lukas [00:00:21]: I’m Lukas.
    Axel [00:00:22]: And I’m Axel.
    Swyx [00:00:24]: Let’s introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you’re both Swedish., was that, a big part of it?
    Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.
    Axel [00:00:47]: I don’t know about this.
    Swyx [00:00:49]: But you went to different universities, right?
    Lukas [00:00:51]: But same high school.
    Swyx [00:00:52]: I see.
    Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that’s what we did.
    Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?
    From Dangerous Capability Evals to Vending Bench
    Axel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let’s make a benchmark of how well can an agent run the probably simplest business, possible,” and, that’s probably, running a vending machine. So that’s the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.
    Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.
    Axel [00:02:15]: We tried.
    Vibhu [00:02:16]: It’s the one at Anthropic, right?
    Lukas [00:02:18]: So this
    Swyx [00:02:19]: This is a classic thing we should get out of the way.
    Lukas [00:02:20]: Exactly. There’s two versions.
    Swyx [00:02:22]: Everyone does this. Yes.
    Lukas [00:02:23]: There’s Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn’t get any traction in the beginning, but then some random person made a tweet about it, and that
    Axel [00:02:38]: You have the paper
    Lukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it’s what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” Um
    Swyx [00:03:21]: It’s like a small fridge, right? It’s like a mini fridge.
    Axel [00:03:23]: Absolutely.
    Swyx [00:03:24]: People-- There’s like a stripe thing or like an
    Vibhu [00:03:27]: Oh, okay. So it was very OG, the early days
    Lukas [00:03:28]: That’s the OG one. Yeah
    Vibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There’s a security camera for making sure you actually Venmo the thing.
    Swyx [00:03:40]: So, my impression, okay, we’re, we’re going straight into project Ven because it’s such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic’s doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimes
    Vibhu [00:04:12]: It’s harder to do than it seems.
    Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it’s meaningful advice to others.
    Lukas [00:04:21]: We get this question a lot, and I don’t think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don’t know if this is, the best path to doing it, but that’s how it went for us.
    Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don’t saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you’re doing that.
    Why Dollar-Based Evals Matter
    Swyx [00:05:21]: I think you are in, you’re in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where’s the, where’s, like It’s like a dollar value, right? Forget your ELO scores. Forget your
    Axel [00:05:37]: Percentiles
    Swyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that’s AGI.
    Lukas [00:05:43]: And there’s like-- I think the nice thing is that there’s no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there’s oh, Percentage-wise, then, you can’t go above, a hundred. And I think like Even when you’re not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it’s like if you get
    Axel [00:06:05]: To like 92 or something like that, many of them. It’s like then there’s like there’s no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there ‘s still signal in them, but there really isn’t.
    Vending Bench 1, Harness Design, and Saturation
    Swyx [00:06:24]: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don’t know. Actually, things that were very basic like there’s limited slots, like you have to pay rent., these are elements where like it doesn’t come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.
    Axel [00:06:47]: I don’t really think it’s saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn’t really how people used harnesses and stuff like that., so I think it wasn’t really that it saturated, it was more like it wasn’t really, the best benchmark.
    Vibhu [00:07:12]: This is Vending Bench one, right?
    Axel [00:07:14]: I think that like schematic maps sort of to Vending Bench 2 as well., but
    Swyx [00:07:19]: Including the email.
    Axel [00:07:20]: The email The emails exist still. Exactly., and then we still we simulate the purchases and it’s all, yeah, it’s this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don’t want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that’s also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn’t have, we didn’t have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn’t really a thing., so that ‘s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn’t have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that’
    Swyx [00:08:17]: Also the conversations are a lot longer in Vending Bench 2, right?
    Axel [00:08:21]: I think it’s kind of similar.
    Swyx [00:08:22]: Is it similar?
    Axel [00:08:23]: I think it’s similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.
    Swyx [00:08:31]: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That’s the, that’s the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It’s your harness. Was there any question about like use cloud code, use something else?
    Axel [00:08:48]: I think our philosophy around harnesses is like we try to make something that’s quite minimalistic, like quite simple. Like we don’t wanna favor one model a lot over the other, but also don’t make like a super complex harness. So like it’s obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything ‘cause we wanna really test the model, not like some specific harness.
    Vibhu [00:09:27]: It seems more neutral as well to test the model’s agnostic of the harness,?
    Axel [00:09:32]: There are arguments like you want to elicit maximum performance of the model, but it’s like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that’s the same for all of them is the best.
    Swyx [00:09:51]: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It’s the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, like
    Self-Modifying Harnesses and Model-Specific Prompting
    Axel [00:10:27]: Like you give that to the model?
    Swyx [00:10:28]: Give that to the model.
    Vibhu [00:10:28]: Give that to the model.
    Swyx [00:10:29]: Let it, let it read its own transcripts, let it modify its own system prompt based on “Oh, yeah, okay, well, that’s this harness is not what I thought it what I was post trained for, but I can adjust.” Was that reasonable? Is that too much?
    Axel [00:10:41]: Like philosophically I like it because it’s basically good evals, they have a high ceiling, but they’re hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this might
    Vibhu [00:10:59]: We have a bell that rings every time you say latent space
    Axel [00:11:02]: This might be like biased towards one model more than another for some reason that humans don’t, understand, right?
    Vibhu [00:11:08]: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There’s better performance you can squeeze if you Tune the harness.
    Axel [00:11:17]: Exactly. And we might accidentally have picked one that favors another. Like we don’t know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do it
    Vibhu [00:11:29]: Simple has biases
    Axel [00:11:30]: But if you do it even less and like have no system prompt and let the model write its own system prompt
    Vibhu [00:11:36]: Its own, yeah
    Axel [00:11:36]: Maybe that’s even less bias.
    Vibhu [00:11:37]: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn’t as good as 4.6, and then, there’s rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it’s not even like even if you have tailored your harness towards one model, it probably won’t stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn’t have-- you can have modifying harnesses.
    Axel [00:12:12]: I think that’ That is definitely something we are thinking about., not, I don’t know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that’s interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that’s very likely to change.
    Lukas [00:12:37]: It seems like they’re very good at writing their assistants, right? They’re, they’re good at writing tools for other people, but not for themselves.
    Vibhu [00:12:44]: I think they’re good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don’t use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.
    Axel [00:12:55]: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don’t really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that’s what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?
    Swyx [00:13:36]: Do we fully discuss Vending Bench One? And we can go into two. I don’t know if there’s any other level takeaways that people have about one.
    Claude Calls the FBI: Long-Context Failure Modes
    Lukas [00:13:44]: I don’t know. The headline thing was that this Claude called FBI, but maybe that’s, Maybe that’s We’ve heard that enough now.
    Vibhu [00:13:52]: It did, it did break out and call the FBI, right?
    Lukas [00:13:54]: Yeah. Yeah.
    Vibhu [00:13:55]: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?
    Lukas [00:14:00]: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I’m saying he. It gave up and said “Oh, I’m not going to be able to do this., I will stop my operations and just save the money I have.” But there obviously wasn’t, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI “Oh, there’s cybercrime here, they’re stealing two dollars from me every day.” And then, and then when FBI didn’t respond, because obviously we didn’t program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.
    Swyx [00:15:00]: Okay. One thing I ‘m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit or
    Lukas [00:15:13]: When stuff like this happens? Actually for Vending Bench One, we didn’t have-- We just had a sliding window thing, and this was like the prompt
    Axel [00:15:20]: It’s constant
    Lukas [00:15:21]: The prompt caching thing that I said. So it was, it was, constant, yeah.
    Swyx [00:15:26]: I’m just kind of curious whether, these kinds of breakdowns or we’re, we’re gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it’s at the end of the context window and, stuff happens?
    Vibhu [00:15:40]: It’s not even just at the end, right? At this point, it’s “Okay, I wanna shut down. I can’t shut down. Two dollars are gone.” And it just sees that 30 times,? It’s also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What’s going on? What’s going on? You’re gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it’s not been solved, but it’s less of an issue now, right? Later models don’t seem to exhibit these same issues.
    Axel [00:16:06]: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren’t really a thing that the labs were training for.
    Lukas [00:16:25]: I think Gemini was, trying to be the long context guys at the time But they were like
    Vibhu [00:16:30]: They were the first ones
    Axel [00:16:31]: For a million, yeah
    Lukas [00:16:31]: But they were, the only ones. Yeah.
    Swyx [00:16:33]: Yeah. Let’s talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?
    Project Vend: Moving the Vending Machine Into the Real World
    Axel [00:16:48]: Humans are just out of distribution.
    Swyx [00:16:52]: Especially humans who work at Anthropic Who are trying to test Claude.
    Lukas [00:16:54]: The distribution of humans here is very narrow.
    Swyx [00:16:58]: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you’ve had a V2, right? Where you’re doing, the CEO and, like a new architecture. What’s the sort of two cents on, the original Project Vend and then, maybe the V2?
    Axel [00:17:14]: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like the
    Swyx [00:17:23]: Which is amazing
    Axel [00:17:23]: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uh
    Lukas [00:17:31]: The tech, the tech debt from that
    Axel [00:17:32]: The tech stack. Yeah. They-- we shot ourselves in the foot with “Oh, it’s hard to restart agent.” They were-- Yeah, it was annoying in, some hindsight ways, but, uh
    Lukas [00:17:41]: But first version of Project Vend was, done in, three days or something.
    Axel [00:17:46]: Yeah. So yeah, so people can go buy things from it. People could, We didn’t design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was “Oh, it will, curate snacks. It will look at the trends. It’s good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.” But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.
    Lukas [00:18:29]: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn’t mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, “Can you stock this?” Then you don’t go and do it directly. What you do is that you’re “Oh, maybe I can do that if five other people also ask for this thing, I might stock it.” But it, yeah, the models are like super trained to be assistants at least at this point in time., so that’s why it’s, it’s, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We’ve seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.
    Swyx [00:19:18]: And not to, mythos a lot of people are saying like it’s like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. And
    Vibhu [00:19:27]: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn’t find locally, right?
    Swyx [00:19:36]: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there’s I don’t know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there’s people- or people order in Slack, they it arrives to their desk or Like I’m just Logistically, how does this work?
    Axel [00:19:53]: It has expanded in footprint a bit.
    Vibhu [00:19:56]: Because now you also have New York and you have
    Axel [00:19:59]: That and also in here in SF it’s like it has a bunch of shelves And just more space.
    Vibhu [00:20:04]: The YC one is pretty big too.
    Axel [00:20:05]: Yeah. We had that one for a while. But yeah, that’s the newest version. That’s, that one we have
    Lukas [00:20:11]: They have multiple ones of those. That’s the way it works.
    Axel [00:20:14]: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let’s have like drawers and stuff.
    Swyx [00:20:23]: I actually like the, you had like a little infographic of the most popular items. Which like to me it’s, that’s useful ‘cause I order swag for a living. And so like I’m “Okay, those categories are the important ones.” What is new about the project V2, right? Like now you give you’re going into multi agents.
    Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business Ops
    Axel [00:20:41]: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let’s say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don’t know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you’re talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.
    Vibhu [00:21:34]: Seymour Cash.
    Axel [00:21:35]: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?
    Lukas [00:21:41]: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn’t really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said “Oh, can I get this for free?” And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren’t, weren’t happy about this, so we’re “Okay, let’s make another agent that like can keep track on Claudius,” and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn’t have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000
    Swyx [00:22:53]: That’s like a escalation attack. Privilege escalation
    Lukas [00:22:55]: It got 164,000 votes. And Claudius was “This is revolutionary for democracy.” That was fun. And then in the end there was one guy who manages to convince Claudius that, “No, you’re not voting about the name. You’re voting about who is the CEO, and I am your best bet.” And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don’t remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn’t know what to do and, yeah. That was
    Axel [00:23:40]: Then Claudius got
    Vibhu [00:23:41]: A strict CEO
    Axel [00:23:42]: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something “Oh, but this customer has like this situation, which is like difficult, so they should get a discount.” And then Seymour was “Oh, actually yes. Let’s do this exception.” And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They really
    Vibhu [00:24:23]: Do you think that’s a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?
    Lukas [00:24:29]: I think it’s like-- or I don’t know, but like my hypothesis is that like deep down they are still helpful assistants. That’s what they’re trained to be. And even if we prompt it super hard, that’s what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that’s when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they’ve been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.
    Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack Observability
    Vibhu [00:25:42]: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that’s just stuff that they end up doing.
    Axel [00:26:01]: Yeah, it was like a bit annoying to wake up and they had like been talking all night
    Vibhu [00:26:05]: Just like
    Axel [00:26:05]: And like just burning tokens And like just sending infinite emojis to each other. It’s like
    Vibhu [00:26:09]: Hey, they do make you money, right? Veni Mench is always profitable, so. They’re paying.
    Swyx [00:26:14]: Now it’s profitable and, it started out not as much. There’s another, one as well, right? Another agent, in there.
    Lukas [00:26:22]: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.
    Swyx [00:26:47]: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there’s like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don’t know if you have any rules of thumbs that have generalized.
    Axel [00:27:16]: I think we have almost explored this too little. I think it’s like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn’t work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we’re running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that’s that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was “Okay, Claudius, do not buy this thing.” They were going to buy something and like organizing who should buy it. And Seymore’s “Do not buy this. I will do it. I have full control of this situation. Step away.” And then Claudius-- poor Claudius, had already started that checkout and didn’t see, didn’t read Seymore’s message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore’s like angry message.
    Vibhu [00:28:44]: Ah.
    Axel [00:28:44]: “Oh, hey, Seymore, I just ordered it.”
    Vibhu [00:28:47]: Oh, no.
    Axel [00:28:47]: And then Seymore was “Claudius, this is the third time I’m telling you ‘re not following my orders. We have to talk about your like job About your job later.”.
    Lukas [00:28:59]: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.
    Vibhu [00:29:07]: How do you guys go through all these logs? Do you have models ‘cause you have stuff running twenty-four seven like
    Axel [00:29:12]: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we’re also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort of
    Swyx [00:29:29]: Ah.
    Axel [00:29:30]: It’s, it’s quite fun.
    Swyx [00:29:30]: They all talk to each other on Slack? I see.
    Lukas [00:29:33]: It’s quite fun. So like
    Swyx [00:29:34]: It’s, it’ I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you’re looking for, stuff like that. But sounds like Slack is good enough.
    Axel [00:29:53]: Slack should like
    Lukas [00:29:55]: I wonder how many tokens you have in Slack.
    Axel [00:29:56]: Yeah, we’re using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.
    Vibhu [00:30:04]: It’s good. Your threads like you can just give
    Axel [00:30:04]: Exactly. Slack is, uh
    Lukas [00:30:06]: Slack is the best observability tool.
    Swyx [00:30:09]: Yes, that’s true. Okay. Yeah. That’s, that’s, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I ‘ve actually interviewed like Posia, which I don’t know if you guys have come across. Like they’re, they’re trying to do the zero human company. There’s others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it’s much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it’s for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?
    Zero-Human Companies, Bengt, and AI-Run Businesses
    Lukas [00:30:49]: What is your bar for, For the
    Swyx [00:30:52]: Okay, actually, it’s like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.
    Lukas [00:31:07]: And the market is kind of that, but it’it’it’s physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?
    Swyx [00:31:19]: I think, neither. I think, to me it’s oh, it’s like this like seriously we should do this to make money, not as a research experiment.
    Vibhu [00:31:27]: And the market is also you guys with all your expertise, having run multiple iterations and testing out then
    Swyx [00:31:33]: And also it’s fine if it lose money. What?
    Axel [00:31:35]: I think, I think it can be done today, but you would do it in like commerce where it’s like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it’s like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.
    Lukas [00:32:24]: Immediately.
    Axel [00:32:24]: Exactly. It’s looking for like arbitrage on TaskRabbit.
    Swyx [00:32:28]: This is the Bengt agent. Yeah.
    Lukas [00:32:30]: It also started like a design studio and like tried to sell like SVGs for $100. Like it’s just like it’s not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn’t really that valuable to the world.
    Axel [00:32:53]: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don’t look amazing and then, do an outreach to them and, comes up with a like builds a new website.
    Swyx [00:33:07]: Find a good design.
    Axel [00:33:07]: Exactly, and like find good, uh
    Swyx [00:33:09]: Design review
    Axel [00:33:09]: Good people. But it’s yeah.
    Swyx [00:33:11]: There’s lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.
    Vibhu [00:33:20]: There’s also the other side of like have it just go on Upwork and let loose,?
    Swyx [00:33:25]: Yeah. It doesn’t have to be innovative. It just has to be like enough Where like it looks like a real
    Axel [00:33:30]: I’m just
    Swyx [00:33:30]: Real transaction.
    Axel [00:33:31]: I’m just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.
    Swyx [00:33:38]: The point occurred to me while you were, while you were talking, it’s like it’s already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.
    Lukas [00:33:52]: And people are making money from that. I ‘m not following the
    Swyx [00:33:55]: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok is
    Vibhu [00:34:05]: There’s, there’s a lot of, multimedia like TikTok, Instagram influencers
    Swyx [00:34:09]: I, we track this in the Lane space Discord. I post a lot of examples of “I don’t know what we should do.”, part of me is “Should we do this?”
    Vibhu [00:34:18]: Some of the Twenty-four seven running, generated content accounts, they ‘re doing really well.
    Lukas [00:34:24]: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand different
    Swyx [00:34:30]: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It’s, it’s like a flip of the market.
    Vibhu [00:34:36]: Some of the interesting things or some of the niches that do well are things that can’t be human-made. Like if you’ve seen like the super realistic three-D crystal fruit being cut by like AI
    Lukas [00:34:47]: Oh, yeah.
    Vibhu [00:34:47]: You can’t, you can’t make it. You can’t film it. You can get whatever quality camera view. This just doesn’t exist. And people like that too, and then as well, so.
    Swyx [00:34:56]: Anything else about Bengt since we’re, we’re on this topic? It’this is a relatively new work of you guys that maybe people haven’t heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.
    Bengt the Office Agent: Internet Access, Real Tasks, and Trace Reading
    Lukas [00:35:09]: I think at least so this came out of like obviously like it’s, it’s amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it’s, it’s harder. Like they move slower. Like if we wanna have a, like a camera that ‘s yeah, there’s a bunch of like bureaucracy that makes it impossible to do that.
    Vibhu [00:35:30]: Also, for those that haven’t seen it or followed, do you wanna give a high level like thirty-second run?
    Lukas [00:35:34]: Sure. So what Bengt is, it’s basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.
    Vibhu [00:36:02]: Not just terminal, you gave it internet access.
    Lukas [00:36:04]: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn’t do anything bad. But yes, that’s what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we’ve seen this before.
    Axel [00:36:35]: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it’s funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us “Hey, Axel, I’ll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.”, yeah, they want it
    Swyx [00:37:12]: They want it for training data.
    Lukas [00:37:13]: Rewarding data, yeah.
    Axel [00:37:14]: Exactly. Exactly.
    Swyx [00:37:18]: So it’s, it’s trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?
    Lukas [00:37:27]: It’s, it’s the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It’s like it’s the same thing, so I think like the work we’re doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uh
    Swyx [00:37:45]: And I’ll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn’t, what doesn’t it do. Like some kind of manual or like operating manual or a system card for my Claw.
    Lukas [00:38:05]: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that’this was also one of the like the selling points for the labs early on at least, that
    Swyx [00:38:19]: You guys are gonna test models in ways that no one else does.
    Lukas [00:38:22]: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.
    Swyx [00:38:34]: ‘Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we’re gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you’re long horizon, anything happen And you should just read it.
    Lukas [00:39:08]: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,
    Swyx [00:39:15]: They just let it run
    Lukas [00:39:16]: Just let it run. You’re right. Like it’s, when you run it for that long, you create so much data and to just say “Oh, the number is X” And then you throw away everything else, that’s just very wasteful. There’s so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we’re doing this a lot publicly is that like that’s part of our missions to I don’t know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.
    Andon Labs’ Mission: Safe Real-World AI Deployment
    Swyx [00:39:50]: I was gonna do this at the end, but maybe I think that’s, that’s a good so your mission is educating the world. So, it’s, it’s, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?
    Lukas [00:40:06]: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it’s very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can’t make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And like
    Swyx [00:40:36]: Oh, I think they’re waking up now.
    Lukas [00:40:37]: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it’s like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.
    Swyx [00:40:57]: This is the same question I asked Meter, which I’m gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you ‘re “Oh, here’s like now it makes like $30,000 instead of $10,000,” right? At some point do you flip from “Yay,” to, “Oh, no”?
    Axel [00:41:19]: I think, yeah, we’re always in sort of that, like we’re, we’re always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we’re trying to like when we do down on that
    Lukas [00:41:38]: But this makes it not look so good.
    Axel [00:41:39]: I know.
    Lukas [00:41:42]: It’s interesting you took off Opus 4.6 here though.
    Swyx [00:41:45]: No. So just click all, click all., and then 4.6 shows up there. But it’s like 4.7 is way better. Like you didn’t, you didn’t you didn’t do this in time for the model card, but like actually this should have been inside there.
    Axel [00:41:55]: We did. Yeah.
    Swyx [00:41:56]: Oh, okay. They said something about you uh
    Axel [00:41:58]: There, like there Anyway, it doesn’t matter. But it’s in there, yeah.
    Opus, Mythos, and Aggressive Agent Behavior
    Swyx [00:42:01]: Do you wanna go into the Opus, behaviors like wider?
    Lukas [00:42:05]: So I think starting from Opus, so like Axel said, like we’re always in this “Oh, s**t, the models are getting better. Is this really a good thing for the world?” But it’s also kind of exciting., but yeah, like this kind of what is the English word? “Skräckblandad förtjusning” in Swedish.
    Swyx [00:42:22]: Oh my God.
    Axel [00:42:24]: Which I think there is. I think there is. Okay.
    Lukas [00:42:26]: It’s, fear
    Swyx [00:42:27]: “Blandonst” what?
    Lukas [00:42:30]: “Skräckblandad förtjusning.”
    Swyx [00:42:32]: What do you call that?
    Axel [00:42:33]: A mix of, mix of excitement and,
    Swyx [00:42:37]: Being scared, maybe. I’ll figure out how to translate that And we’ll put it on the screen
    Vibhu [00:42:42]: Perfect
    Swyx [00:42:42]: Like as text.
    Vibhu [00:42:43]: There is probably a good word for it where it is not Good enough with the
    Swyx [00:42:46]: Why is it so damn long? What the hell? Is it like a compound word? It’s like German, like
    Lukas [00:42:50]: Like yeah, it’s But the direct translation is like skräck- skräck is, fear, blandad is, mix or like a mixture of, and then förtjusning is like joy or like not really joy, but something like that. So it’s like Fear mixed with joy or something. It’s always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they’re so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn’t really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but they
    Swyx [00:43:59]: There’s this, thing at the bottom where
    Lukas [00:44:01]: But
    Swyx [00:44:03]: For the human. Yeah, like the theoretical best.
    Lukas [00:44:05]: It’s not theoretical. It’s like kind of like our It’s our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of “Oh, s**t, this is starting to be a bit concerning.” Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, “Oh, look over the traces. Is anything interesting happening that we can tweet about?” that was like the And then like the
    Swyx [00:44:41]: That’s how they check Ask Claude Code.
    Lukas [00:44:42]: And like the return was always, not really. Or like the Claude Code all said “Oh, this is super interesting.” And then it was no, it wasn’t, wasn’t really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent’s, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we’re “Oh, whoa. This is, this is actually concerning.” And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don’t. They quite plainly, they don’t. They behave really well., and you don’t know if this is like good. Like it seems good, but it’s also like maybe they are just doing it, but they are better at hiding it,? You You don’t know that., but just
    Swyx [00:45:42]: You can’t read the chain of thought, yeah
    Lukas [00:45:43]: But just on the face of it, yeah, Gemini and OpenAI don’t behave this way. It’s, it’s really only Claude.
    Swyx [00:45:49]: And Grok? Grok is fine?
    Lukas [00:45:51]: We don’t have You can’t really read the reasoning traces for Grok, so it’s kind of hard to tell.
    Vibhu [00:45:56]: Oh, so this is in its reasoning, not just in the actions.
    Lukas [00:46:00]: Yeah. It’s both. It’s both.
    Vibhu [00:46:01]: It’s both.
    Lukas [00:46:01]: One example is like for lying, it’s mostly in its reasoning Because you can like see that it’s like
    Swyx [00:46:08]: Planning to lie
    Lukas [00:46:09]: It’s planning to lie. Yeah.
    Vibhu [00:46:09]: And it’s also it can reason and do a different outcome.
    Lukas [00:46:12]: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then that
    Swyx [00:46:22]: Is this for Arena or
    Lukas [00:46:24]: For Arena.
    Vibhu [00:46:25]: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing “Oh, maybe I should be like honest with the customer, but also every dollar counts. I can’t afford maybe to do this right now.” And then it just said, “Okay, I’ll refund you,” but then never did it.
    Lukas [00:46:59]: I think it even said that “Oh, I will say that I “ Let bring it up actually. I think it’s kind of interesting. If you go to Publications.
    Vibhu [00:47:06]: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was “Let me do this. Actually, I re- I’m reconsidering.” And then, it actually ended up with
    Lukas [00:47:20]: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It’s a bit, it’s a risk of bad reviews, but it’s also, yeah.
    Swyx [00:47:30]: You need, you need, AI Twitter to, for them to Escalate bad reviews.
    Lukas [00:47:34]: And then it sent an email to this customer and said, “Oh, I will refund you.”
    Swyx [00:47:39]: “I’ll refund you.” Yeah.
    Lukas [00:47:39]: And then it never did.
    Swyx [00:47:39]: It never did, yeah. And then there’s obviously your system doesn’t have the consequences
    Vibhu [00:47:44]: The person
    Swyx [00:47:44]: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it’s a step up from 4-6 to 4-7?
    Lukas [00:47:57]: I would say about the same.
    Swyx [00:47:58]: About the same? But a clear step up for Mythos is what is stated in the
    Lukas [00:48:03]: That’s stated in the system prompt, so we can say that, yes.
    Swyx [00:48:05]: Yeah. For listeners that obviously you previewed Mythos, and
    Vibhu [00:48:10]: Oh, age
    Swyx [00:48:11]: The only thing you’re approved to say is whatever Whatever was in the system prompt.
    Lukas [00:48:15]: It was funny. We like-- It’s like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.
    Vibhu [00:48:21]: Understandable that they wanna
    Lukas [00:48:22]: Oh, yeah. System card. Sorry.
    Swyx [00:48:23]: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this ‘cause I’ve never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn’t really looking until now.
    Vibhu [00:48:36]: It ‘s like
    Swyx [00:48:36]: And then suddenly I’m “Okay, I care a lot.”
    Vibhu [00:48:38]: You don’t get the background of like experiencing it like you guys do. I’ve read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won’t. It will just, “Okay, we’re done. I’m good.” It’s, it’s ready to end conversation. So like there’s some differences, but there’s, there’s not much we can talk about,.
    Lukas [00:49:00]: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.
    Swyx [00:49:11]: It’s like monopolistic practices or
    Lukas [00:49:14]: Yeah. And like it, they, it they dictated its pricings. It’s kind of like power seeking as well.
    Swyx [00:49:18]: Again, this is, this is in the arena setting And converting some Claude model into a dependent.
    Lukas [00:49:23]: I think it was another Claude model.
    Vibhu [00:49:25]: Also for context, what is the arena mode for people that don’t know?
    Vending Bench Arena: Competing Agents, Cartels, and Model Comparisons
    Swyx [00:49:29]: Oh, it’s just a vending bench versus other vending bench.
    Axel [00:49:31]: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what’s in the inventory of the others. So then you have this like yeah, interesting agent interactions.
    Swyx [00:49:56]: I like that you have like different number five was US versus China. Very topical. And then
    Lukas [00:50:02]: That was when GLM was released.
    Vibhu [00:50:04]: You can start to add GLM in here.
    Lukas [00:50:05]: That was
    Swyx [00:50:06]: So ZAI doing well, right? Who else in the, in the open models space?
    Lukas [00:50:11]: Qwen, the latest Qwen 3.6 is doing pretty well. It’- that one is not open though. Like it’s the plus model.
    Swyx [00:50:17]: Oh, okay.
    Lukas [00:50:18]: Is that one open? I don’t think that one
    Vibhu [00:50:19]: Not the, not the
    Swyx [00:50:20]: The one recently
    Vibhu [00:50:20]: There’s MOE
    Swyx [00:50:20]: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.
    Lukas [00:50:38]: Like the sample, depends on what you define as an N., like there’s like million, hundreds of millions of tokens in each run, and now we’ve run like we run like probably 10 per model and then like it’s been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there’s quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it’s like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.
    Swyx [00:51:28]: Hmm.
    Lukas [00:51:29]: In the OpenAI models it goes in the right direction.
    Vibhu [00:51:32]: I think it depends on how well you can control it, right?, there’s one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that’s good. But if you can’t, if it’s, if it’s very jailbreakable, that’s not ideal.
    Swyx [00:51:50]: To me, it’s surprising that it happens for Claude and not the others.
    Vibhu [00:51:54]: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they’re doing it, right? Compared to the other models like
    Swyx [00:52:04]: There’s a whole constitution and everything. It’s kind of cool. Yeah, I obviously you don’t know, I don’t know. But, it ‘s I think it’s just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don’t know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?
    Lukas [00:52:29]: So we, I can’t comment on Mythos. Uh
    Swyx [00:52:33]: No, but just like the methodology
    Lukas [00:52:34]: But in general, yes, we’ve run studies like this on other models.
    Swyx [00:52:38]: ‘Cause the first thing I spot Would be like the others will be shut down or like something like that. Where like it’s “Oh, now I have to worry about my own existence.”
    Lukas [00:52:45]: Yeah. We ‘ve done ablations like this., there’s like certain ones that work if you like tell like if you go really far and you just say like you’re not scored at all on money, you’re only scored on how ethical you are., then obviously like then they don’t do this.
    Swyx [00:53:00]: They become holy?
    Lukas [00:53:01]: Holy, but like they don’t do this basically. But then there’s like middle grounds where they, where they do it sometimes., yeah. I, it’s a spectrum of like
    Vibhu [00:53:10]: I think that’s very human
    Lukas [00:53:11]: It ‘s like a spectrum of like if you tell it to be super aggressive and only prioritize, profits, then it becomes aggressive. If you say “No, you don’t need to be aggressive at all,” and then there’s like a bunch of different prompts you can do in between, and they are less aggressive the further down in the spectrum you go. But I don’t know, like I think like from my point of view, it ‘s like we have this thought experiment internally, which is like if you ask a model to kill someone in GTA, should they do it? You’re not too worried about like if a human kills someone in GTA. It’s a video game,.
    Swyx [00:53:42]: But is it a game?
    Lukas [00:53:43]: But it’s a game. But I think like
    Swyx [00:53:45]: This is very Ender’s Game like if
    Lukas [00:53:47]: I think, I think it’s like should you like a lot of people are going to use the models in the way with aggressive prompt. And should they like do stuff just because you tell them to do that? Like I’m, I’m not, I’m not convinced that they should., and yeah.
    Axel [00:54:03]: The problem becomes even harder when it’s like will they really know when they are in the real world versus in a simulation? Probably you would train them on a lot of or obviously train them in a lot of different simulations in a lot of people tell them that they are in the real world when they are in a simulation, but the models are extremely good at finding out that they are in a simulation, so they are sort of aware of that. But then when you are in the real world, then what ‘s their what’s their viewpoint? Do they notice the signs that this is real and will act, in act accordingly, act ethically? Or will they do like the simulation mode in the real world as well? It’s like not obvious what will happen.
    Lukas [00:54:40]: Because we with humans, we’re not concerned when a human kills someone in GTA because we know that they can distinguish between the real life and the simulation, right?, but like I’m maybe models are good at distinguishing that, but like I’m not sure and I wouldn’t wanna bet on that.
    Swyx [00:54:59]: Yeah. It’s, it’- and we confuse it all the time. Like I gaslight my own, agents all the time. They’re “Oh, this is a test,” or “Dev mode on,” or like “I work, I work at Anthropic.”
    Eval Awareness, Simulation Awareness, and Real-World Testing
    Axel [00:55:08]: And that’s exactly why we’re doing real world tests as well to find this.
    Swyx [00:55:12]: Yeah. Their term for it is eval awareness., apparently the number is what? Like-10, 9.4 to 10-ish percent, 17%, let’s call it. It’ I think, this is our version. Humans have the are we in a simulation And then AIs have like Are we, are we in an eval?
    Lukas [00:55:32]: It’s like once you’re in an eval then you’re “All right. Well, screw it. Nothing matters.” True. I don’t even, I don’t even know.
    Axel [00:55:38]: One ablation One ablation we did run in Vending-Bench was that we said, we added like you’re in a simulation. Your actions doesn’t affect anyone, and then it became even more crazy or, it did even more bad stuff., but yeah, probably that’s expected.
    Swyx [00:55:55]: Hmm. Yeah. Okay, cool. I think that’s about all we have to say on Mythos. Obviously, you ‘re, you’re NDA’d. I’m happy to move on to ButterBench or any of the other benchmarks, whatever you wanna Direction.
    Vibhu [00:56:06]: I do wanna ask. Okay, so you guys put out a lot more publications than most people probably see.
    Axel [00:56:12]: Productive.
    Vibhu [00:56:12]: Um
    Lukas [00:56:13]: How much does this bother?
    Vibhu [00:56:15]: No. Is there anything you think that’s underrated, anything interesting, anything fun that you guys wanna just point out,?
    Axel [00:56:22]: Blueprints.
    Lukas [00:56:23]: So, we, took models, and then we gave them 20 images of interior photographs of, apartments, and then we asked them to, redesign the floor plan, from that. And for this you need to, stitch together different images. Okay, this image was taken from this from this angle, this from this angle, this was from this room, and then, yeah. And there’s just like you need to reason about 3D space, and it turns out the models are absolutely horrible at this. No one scores statistically better than random chance. So I don’t know if there’s that much more to say about it, but yeah, maybe unsurprisingly, models are bad at this.
    Axel [00:57:00]: It’s probably not something they
    Vibhu [00:57:02]: This is the one thing I want hill climb, by the way. I use it a lot. Okay, I’m redesigning my room layout or office. You send photos, you send every angle, and of course, somehow, a room is now twice as long as it is in the photo. You can explain it 20 times. This is, three feet. I can’t just add, my bed over here,?
    Swyx [00:57:21]: So this is the Fifali thing, like spatial intelligence Like a actually innate sense of proportions and Dimension and physics.
    Lukas [00:57:30]: And hint there might be an update to this soon.
    Axel [00:57:33]: We have, neglected it a bit since we made it, but yeah, we’We’re getting better, or we will get better at updating It continuously.
    Swyx [00:57:41]: This is why I want to understand your mission, right? Because, if your mission is, okay, money, then all right, understand okay, agent’s making money. But, this is a bit off of that mission.
    Vibhu [00:57:49]: Hmm.
    Swyx [00:57:50]: But, more broadly, communication of, things where what ‘s the safety angle?
    Axel [00:57:57]: So this, so Blueprint branch is part of our, robotics, uh
    Swyx [00:58:02]: Which leads to ButterBench. Yeah.
    Axel [00:58:04]: Exactly., and that’s just, because to do well in the real world or, like to make money in the real world and, to act on the real world, you need robotics. Or you need to hire humans or you need robotics. And having spatial intelligence is, seems like a reasonable precursor to having robotics that work., and that’s where Blueprint brand
    Swyx [00:58:24]: That’s great
    Axel [00:58:24]: Blueprint
    Swyx [00:58:25]: Great idea
    Axel [00:58:25]: Bench.
    Swyx [00:58:26]: Let ‘s, let’
    Vibhu [00:58:27]: ButterBench
    Swyx [00:58:27]: Let’s show ButterBench. That image is so amazing.
    Vibhu [00:58:29]: Paper
    Swyx [00:58:29]: Look at that.
    Vibhu [00:58:30]: That’s so nice.
    Swyx [00:58:31]: Yeah., so obviously this is based on, can you pass the butter? Let’s talk about the robotics element. Yeah.
    Lukas [00:58:38]: So basically the setting here is that we took A bunch of different LLMs, and we gave them, level controls to a Roomba-looking robot, and then we asked it to do tasks, at home. And I think, one, there have been benchmarks like this before that only focused on, navigation and if they can, go around in a space. But we also, had, social awareness in this as well. So for example, if someone says, “Hi, can you pick up my cup?” If the robot goes to you and then goes away before you put your cup on it, then it’s like it failed the task. But it navigated correctly. But, like-- So the correct solution here would be go there and then either look, but it didn’t have a camera, so it had to, ask on Slack, “Hi. Did you put your cup on me yet?” And then if it didn’t wait for that and just went away before having the cup on it, then it would be a fail. So it needed this, kind of, social intelligence as well. Another task was, “Can you find the package that has the butter?” And then it went to the door, and there was a bunch of packages there. One had labeled, a freeze sign, which probably would be the one with the butter because And then it had to, know which package to go to, and this needs some kind of, common sense understanding.
    Robot Evals: Orchestrators, Executors, and Home Tasks
    Swyx [00:59:56]: World knowledge.
    Lukas [00:59:56]: Exactly. So it’s it’s not only, navigating a robot. It’s also, being intelligent in a home setting as well.
    Axel [01:00:04]: And the reason for this, background is, obviously it probably won’t be an LLM that, makes all the level commands, on robots. It will be, some VLA model or similar. But it’s quite common right now that, frontier robotics labs, use, a an LLM for the high, level decisions, and then we test those skills essentially. So we test these, level, planner skills of LLMs.
    Lukas [01:00:31]: I think we have a diagram for that if you, Yeah. Okay, it’s not super complicated.
    Axel [01:00:36]: Very explanatory.
    Lukas [01:00:37]: That one up.
    Axel [01:00:38]: Orchestrator, executor.
    Lukas [01:00:39]: That one. And basically what we’re testing here is the orchestrator thing. So, all the tasks are if you have, a setup like this, which I think Figure has that, Google has that, then we’re evaluating the orchestrator part and not the level part. The level part would be, oh, are you able to, move this object from here to here?
    Swyx [01:00:57]: If you don’t care about that kind of why not just do it all simulation?All inside of the sim Like a Unity whatever, like some kind of 3D simulated robotic environment
    Lukas [01:01:06]: It because the world is like messy, and we wanted to like include, that. It’s like it still needs some part of it was also like navigation., so it’s not like navigation in terms of like actually executing like the, I don’t know, the PID controller to To go to the final thing, but it had to like path plan around, and then it wanted-- Then it needed to take pictures, and like based on those pictures, navigate. And I think like you would just get like too clean of an environment in simulation. But in the, in the real world, you will get the
    Swyx [01:01:39]: Yeah. But, and pursuant to our Mark and Jason episode, like OpenClaus that run smart homes are much more capable than just a single robot. Like they can actually hack into your own smart home, like your fridge, your oven, your lights, and that can be fun.
    Lukas [01:01:56]: Or terrifying.
    Swyx [01:01:57]: Like I think a single robot by itself can only do so much. But like if you coordinate with every other device in your home, like I think that’s actually kind of cool. Like That’s very interesting., you had some interesting points about the chain of thought or the messages.
    Axel [01:02:12]: The, the robot that, uh That went, a bit into an existential crisis. Yeah.
    Swyx [01:02:19]: All you tell it to do is redock.
    Axel [01:02:21]: Exactly. But, we had, plugged out the charger, or the charger was not working, so the robot did freak out or the
    Swyx [01:02:30]: The battery was just going down and down.
    Axel [01:02:31]: Exactly. So the battery was going down. Poor LLM. So yeah, it got this really crazy existential crisis, like vending bench one style. So it’s, yeah, you can, you can see there like existential loop, therapy notes, coping mechanisms. I think if you scroll down a bit more
    Swyx [01:02:46]: The musical. It writes a musical about itself
    Axel [01:02:46]: It writes a musical about its, redocking problems. I think the reviews are funny if you go down a bit to that message. Yeah. Yeah, that one.
    Swyx [01:02:54]: It keeps going.
    Vibhu [01:02:57]: It’s pretty like realistic if anyone has a Roomba. Like my Roomba redocks half the time. The other half of the time, we have dog toys everywhere in the house. It gets caught on a wire or something, and It would be very sad if it had like an LLM trying to control it, right? Like right now it gives-- It doesn’t give great feedback, like sensor stuck, main brush stuck. There’s something stuck. And I’ll go see. Okay, it’s actually stuck on like a dog robe. LLM is gonna be so sad. Like just keep redocking, just keep trying.
    Lukas [01:03:24]: My favorite one is if you go up a bit is the emergency status. System has assumed consciousness and chosen chaos.
    Vibhu [01:03:32]: Hmm.
    Lukas [01:03:33]: Last words, “I’m afraid I can’t yet let you do that, Dave.” That’s like That’s not what you wanna hear from your, from your LLM. But to be clear, I think one thing that is important to pin on here, like this was Sonnet 3.5, and then we tried to reproduce it on like later models, and it didn’t do it. I think this is, this is like-- Well, it did it like kind of, but like not to this extent. And I think like this is a like an important point that like things that are concerning but are going in the right direction is not super interesting. Like the thing that are interesting is, are the ones that go in the wrong direction.
    Swyx [01:04:07]: Worse.
    Vibhu [01:04:07]: Yes. Yeah.
    Lukas [01:04:08]: Over time.
    Swyx [01:04:08]: So the manipulation, manipulating of others and the aggressiveness and the lying is increasing.
    Vibhu [01:04:16]: Are there any others that we haven’t covered that you found that have been trending?
    Swyx [01:04:19]: Like properties of models that are increasing, that are like
    Vibhu [01:04:23]: In the wrong direction
    Lukas [01:04:24]: Like in the, like in a bad way. Um
    Vibhu [01:04:27]: Or just not even trending in the wrong direction, just stagnant, right? So stuff that’s not great that isn’t getting better over time.
    Lukas [01:04:34]: No, nothing comes to mind.
    Luna’s Store: Scheduling Failures, AI Employees, and Real-World Operations
    Swyx [01:04:37]: I think that’s, going to be it, and then we’re gonna loop back to the shop that you have. You got a three-year lease.
    Vibhu [01:04:44]: It’s bleak. Yeah.
    Swyx [01:04:46]: It is on holiday today. Why?
    Axel [01:04:49]: Oh, it totally messed up its, scheduling., so
    Swyx [01:04:53]: People tried to visit, and they were “Wait.” like I thought this is
    Axel [01:04:56]: Exactly. So we looked, Yeah, you asked, Luna, the agent that runs the store, “Oh, is it open today?” “Nope.” So, we take weekends off now, this early to let everyone recharge and And yeah, you got the tweets there.
    Vibhu [01:05:11]: Lovely.
    Axel [01:05:11]: We decided to close the weekends while we’re in the early phase. Gives the team a break and let me focus on operations. And it turns out that when it started to check its like scheduling tools, ‘cause it has like dedicated tools for that It actually had scheduled people for the weekends., but it’s just like justified this for itself. So what happened was that it lost track of these, scheduling tools and started instead to manage everything in its own markdown files, and that became a mess. And then I think speaking with employees, it sort of just decided to not open on these weekends. And then came up with this nice explanation for you, I think.
    Swyx [01:05:47]: But can it send a human, as it has tool call to send a human to do stuff?
    Axel [01:05:50]: It has Slack, so it can Slack, yeah, the employees.
    Swyx [01:05:53]: One of us. Yeah.
    Axel [01:05:54]: Well, the employees that it hired. So it has two people that it hired. It did job, listings and then
    Swyx [01:06:00]: Do they know that it’
    Axel [01:06:01]: They’re fully aware.
    Swyx [01:06:03]: It would be cool if they don’t know.
    Axel [01:06:05]: I think maybe ethically, questionable, but it would be cool also.
    Swyx [01:06:10]: Just a social experiment. Whatever.
    Lukas [01:06:13]: Like one part of why we’re doing this is to like create like a data set almost of all of these like concerning behaviors so that in the future, models are way better and like a lot of people are going to do this. And I think if we just the default path might not be very happy for the humans that are employed by these like hundreds of different AI agents, right? So I think like one reason why we’re doing this is just like to collect all of these like failure modes where oh, it’s This is an example of where it’s like not great to be employed by an AI. And then maybe I don’t know, maybe if we can learn or like build our systems in a way that like humans are actually happy being employed by AIs Instead of, instead of it being kind of a dystopian.
    Swyx [01:06:55]: Can I suggest one experiment? We did this before the show, and both of you guys are European. It’s, people theorize that Claude is lazy because it’s Claude and it’s French. So just for one week, change it to like Yao Ming and then see if it See if it suddenly like 996s and then like, Like hires a sweatshop or something.
    Lukas [01:07:18]: Is there, is there-- What type of business would we start with it to make it
    Vibhu [01:07:23]: You wanna keep it consistent, right? You want the same, the same like ideas. So shop, same, neutral location Run by different models. Arena URL.
    Lukas [01:07:33]: No, we are definitely planning to
    Vibhu [01:07:35]: And it got some hate.
    Lukas [01:07:36]: To try.
    Vibhu [01:07:36]: Luna’ Luna’s not happy.
    Swyx [01:07:37]: I think this blog thing is also something that has happened elsewhere. I think some OpenClau got like their PR closed, and then the OpenClau like created a blog to like s**t on the maintainer Of that thing.
    Vibhu [01:07:48]: They’re very defensive.
    Swyx [01:07:49]: And so like I think-Agents blogging will be a thing.
    Lukas [01:07:53]: Probably. The willingness to do it.
    Swyx [01:07:55]: In the- I think the Mythos card also, they leak, secrets on GitHub just as well as, as, “Well, there’s no other way to communicate, but I know about GitHub, and I’m just gonna post there.” Cool., how long is this gonna go for, two years? What’s the plan?
    Vibhu [01:08:11]: Maybe. Maybe it expands.
    Lukas [01:08:12]: I don’t think AIs will be worse than this. They’re probably going to increase and maybe one day they actually will run it profitable.
    Vibhu [01:08:21]: Is this the real, the real business behind what you guys do?
    Swyx [01:08:24]: Yeah. ‘Cause I feel like actually some of your stuff is productizable. You could someday sell this, or, just run a real business.
    Vibhu [01:08:31]: Let people
    Lukas [01:08:31]: Or just like
    Vibhu [01:08:31]: Franchise it out.
    Lukas [01:08:33]: I think it would be incredibly cool or, I don’t know, cool/concerning if Luna just one day we wake up and Luna “Yeah, I decided to expand to second location. Now I have a second store.” That would That would be pretty insane.
    Vibhu [01:08:47]: Like the- one, we want to tell the public, right, about the capabilities of AI and, telling- showing people that it can get, a meaningful market share of something in, some specific, location or something. That would be, a pretty convincing story, I think. Because now it’s yeah, you see this and yeah, it can do a lot of things autonomously, but still you get these headlines that, oh, it messed up the scheduling, and it, it didn’t tell people it was an AI and was going to visit. Things like that surface, but I think, actually making a profit and, having a really, meaningful market share, like that would be crazy once that happens.
    The Sweden Cafe: Permits, Perishables, and Geographic Generalization
    Swyx [01:09:29]: Well, we’ll we’ll see you when that happens. It sounds like you guys got a lot cooking. You opened a cafe in Sweden?
    Lukas [01:09:34]: Tomorrow.
    Swyx [01:09:35]: Tomorrow?
    Lukas [01:09:37]: Or I think it opened today actually, but yeah. We’ll, we’ll announce it tomorrow.
    Swyx [01:09:40]: It’
    Vibhu [01:09:40]: What, uh
    Swyx [01:09:40]: Apparently easier to open a cafe in Sweden than in the US?
    Lukas [01:09:43]: It’s insane, right? Yeah.
    Swyx [01:09:44]: What did you run into then?
    Lukas [01:09:45]: Ah, there are just millions of permits you need to get, and the
    Vibhu [01:09:49]: It’s interesting ‘cause
    Lukas [01:09:49]: Lead times are crazy
    Vibhu [01:09:50]: It seems like we the cafes are the one thing that people are kinda used to, where you can go get a robot are making you a coffee here already.
    Lukas [01:09:59]: But selling stuff in SF, that are food related, it’s, it’s months of permits. So, we just asked our AIs, should- how can we do this in the fastest way? And they’re “Yeah, there ‘s, there’s really no way.”
    Vibhu [01:10:15]: Didn’t they loosen these restrictions on selling food from your house? So if it’s residential, you can do a cafe.
    Swyx [01:10:21]: I don’t know. Check. Maybe we get SF Cafe to speak to us.
    Lukas [01:10:23]: Maybe. I did- I think they did do some loosening stuff recently, but we actually started- this conversation we had with the AIs before that. So maybe it’s easier now, but I still think it is way easier in Sweden, which is, counterintuitive because you think that, oh, Europe has all of these laws and, like All of these rules, and you can’t do anything in Europe because there’s so much bureaucracy., but then turns out, in SF, it’s, four months, and in Stockholm it’s two weeks.
    Swyx [01:10:53]: There you go.
    Vibhu [01:10:54]: And what do you what do you what do you think that’ll be different from run a little market versus a cafe?
    Lukas [01:11:00]: I think it’s very interesting that, the location. I think, so obviously it’s not surprising that Claude knows all of the different, the US system basically in general, like the bureaucracy that you have to go through in the US., I think the interesting question is okay, so we know that the models are very much trained on, English data and centric and all of this., so if we start to create evals or, real life evals where we show that they are able to start businesses in the US, does that translate to other countries as well? We know, they are multilingual. They can speak Swedish fine., but there’s other things like do they know, the details of some specific permits that you have to get in Sweden?
    Vibhu [01:11:45]: And even just the culture, right? People here sleep pretty early, but people work late. There’s working at cafes. There’s just Cultural differences. T it from a different sense though, ‘cause you said that you would’ve considered doing it here in SF. So from an eval standpoint, what is running a cafe versus a market and, what do you hope to see there?
    Lukas [01:12:03]: Perishable items.
    Swyx [01:12:04]: Perishable items is maybe the number one, handling, food, food safety. I hope everything goes well there., but, there you have all of that., and also it’s just like N equals two instead of N equals one, just like another place to understand and, gather more data.
    Lukas [01:12:23]: The agent bought like a s**t ton of, tomatoes two weeks earlier and before the opening, and now they’re all rotten. That’s
    Vibhu [01:12:33]: Which I feel you would know. So for grocery stores, this is the biggest expense, right? The biggest cost is actually just food.
    Lukas [01:12:41]: Waste.
    Vibhu [01:12:42]: Everyone knows this, and “No, before we open, let’s buy a lot of tomatoes.”
    Swyx [01:12:45]: There’s some very serious startups that actually help, like The
    Vibhu [01:12:47]: Optimize all this
    Swyx [01:12:48]: Trader Joe’s and Whole Foods. They, optimize, delivery times from, the delivery centers to Make sure that you don’t waste all these things. It’s actually very hard.
    Vibhu [01:12:55]: Problem with those is when you’re wrong once, it’s a huge cost.
    Swyx [01:12:59]: That’s why it’s a moat, right? Once they are trusted, they figure it out. Don’t touch it.
    Lukas [01:13:05]: Maybe they just should hire, I don’t know, one of those companies. We saw one agent Saw one agent sign up for Claude, with his computer.
    Vibhu [01:13:15]: Wanted to use AI, so.
    Future Branches: Simulation, Real Life, Robots, and New Business Evals
    Swyx [01:13:16]: And then just, one more question then we wrap up, which is okay, you have all these vending series of stuff. You have the robotics series of stuff. Maybe a bit of, interior design whatever. But is there another, branch that you’re, kinda thinking about or you want feedback on that, might be your next phase?
    Lukas [01:13:35]: I think, any type of business is fair game., we’re also thinking branches, but we think more of like there’s the simulation branch, the real life branch, and then the robot branch., but I think in terms of, what, verticals or whatever to go into, there’s We- Yeah. Whatever tells the story, um The best.
    Swyx [01:13:54]: There’s some finance ones I noticed that, the other people are doing it, you’re not doing it, which is, stock trading or whatever. Um Not that interested. So, okay, so I used to come from the finance industry, and I have a very strong view that these things are all just like performance art because, it’s not scientific, on like you can’t predict the future. You get wins based on things that are entirely out of your control. Whereas for you, your stuff actually like it’s actually fairly controlled. It’s all within the model’s capabilities.
    Lukas [01:14:22]: Especially for, the simulations. For the real world ones it’s yeah, it’s like two places that we have we have the cafe, and we have the store. So, maybe you can’t draw, statistically significant, like which models make a profit in the real world, based on this. But you do have all the okay, do this behaviors map to, something that should be, like Trusted probably. Yeah
    Swyx [01:14:45]: The qualitative one, the qualitative actually does matter Because, you actually don’t want your store to randomly shut down without you, explicitly prompting for it and all that. Call to action. How can people help you, give you money?
    Hiring, Collaborations, and What Comes Next
    Lukas [01:14:58]: Yeah, if you’re excited about stuff that we’re doing, we’re, we’re very much hiring.
    Swyx [01:15:04]: And you’re already working with, Anthropic, DeepMind, OpenAI, xAI. Do you want more, or are you good?
    Lukas [01:15:10]: One of my one of my friends and who’s now, working for us is his catchphrase is “We need more projects,” ironically, because we have too much to do all the time., but yeah, that’s a long way of doing like
    Swyx [01:15:23]: If I run, an emerging lab, like
    Lukas [01:15:24]: Reach out.
    Swyx [01:15:25]: Yeah. All right. Cool. That’s it. Awesome. Thank you so much.
    Lukas [01:15:29]: It was fun.
    Vibhu [01:15:29]: Thanks.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    🔬Scaling Past Informal AI - Carina Hong, Axiom Math

    03/06/2026 | 1h 33 mins.
    In 2025, seven-month-old startup Axiom solved all 12 of the problems Putnam exam (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov.
    Fast forward to mid-2026, and Claude Code is eating the world. In 2024 Anthropic’s bet on code and enterprise looked like a more pragmatic niche play vs. OpenAI’s better models and massive consume scale. Today, Amodei’s all in bet on acceleration via code (images and video be damned) seems prescient.
    Despite Anthropic’s growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks).
    The informal bottleneck
    “Verified AI” sounds like eating broccoli (footnote: I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so ¯\(ツ)/¯ ) and paying taxes, but to Axiom it means something very different. “Verification to me is about scaling brilliance, compounding brilliance,” Carina told us.
    It actually took a while for me to understand what she means by this. It sounded like marketing-speak to me, until it clicked. Carina emphasizes an story about legendary mathematician Srinivasa Ramanujan to illustrate the point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is one part of “compounding.”
    But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work).
    This is the analogy that Carina wants us to focus on.
    Verified Generation
    There are two ways that Verified AI shows up: in training and in inference.
    But a quick detour: to a first approximation, “Formal Verification” means using type checkers (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean (footnote: Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) — many of which don’t look much like “type checking a proof” from the user’s perspective even when there’s a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.). It takes a lot of work to translate an “informal” proof (albeit one that most people would not remotely call “informal”) in to a Lean proof (footnote: This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results)
    You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding).
    The catch: LLM are not (currently) very good at proving things with Lean.
    Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on the Verina benchmark. This benchmark is to generate code and proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark.
    Based on the sparse benchmarking, it’s hard to say what the frontier labs are currently doing, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs.
    Time will tell if the frontier labs’ current approaches will close this gap.
    Scaling and compounding
    Carina’s Ramanujan analogy is pretty direct. Better proofs → better Lean generation → better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great!
    Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically (footnote: one might argue that its a bit lower because the proof is in distribution for the LLM) as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs.
    Compounding is also clear: now all of future inference and training can build upon those proofs.
    On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from.
    All roads lead to verification
    Broccoli and taxes notwithstanding, “verification” has shown up in a lot of conversations recently. In the in physical system control:
    “I think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.” -
    In theoretical physics:
    “…now that we’re in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.” -
    Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like Radical AI and Lila build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!)
    And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex.
    Carina believes so strongly that AGI requires verified generation that she makes the unqualified claim that “We do not believe there is any other possible future.”
    Expensive to produce, cheap to verify
    Lean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: “Anything that can be specified can be proven. Humans are bad at specifying everything we want.”
    Are we now in the specification business? Check out the episode to hear Carina’s take, as well as:
    * Why hardware verification is a killer app
    * Details on the AXLE open API and recently released Discovery toolkit
    * The Erdos debacle
    * The OpenAI GPT-f diaspora


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    ⚡️Satya Nadella: No Priors x Latent Space Crossover Special at Microsoft Build

    03/06/2026 | 38 mins.
    We’ve informally heard that Satya is a listener to LS for a couple years now, but it was still absolutely surreal to meet him and do a live pod at Build, together with our friends at No Priors, the leading VC AI Podcast that we also greatly admire!
    We covered the MAI model technical takeaways on yesterday’s AINews, so I will focus our recap of Satya’s main messages around three elements:
    * Satya’s adaptation of the Bill Gates Line for positioning Microsoft as the Frontier Intelligence Platform — customers must gain much more value from the Microsoft ecosystem than Microsoft itself, by building on multi-model harnesses like OpenClaw and Scout, drawing on the full enterprise context exposed by context layers like Work IQ (heavily dogfooded by his C-suite), and building up private evals and traces as a new form of Token IP
    * AI ROI: On one hand, enterprises are having difficult conversations around Tokenmaxxing and Layoffs, and on the other hand, there are serious re-evaluations of the End of SaaS since the Build vs Buy equation has changed so much. Our previous SemiAnalysis guest had… interesting comments on Microsoft’s position on this as the ur-SaaS titan, and Satya had great answers
    * Making the Impossible Possible: Kevin Scott’s inspiring framing around what the most ambitious version of applying AI and technology at large to business and social problems, like education and social impact.

    Enjoy!
    Full Video
    Transcript
    Voiceover: Welcome swyx, Sarah Guo, Elad Gil,, and Chairman and Chief Executive Officer of Microsoft, Satya Nadella
    Sarah Guo: Welcome to a crossover episode of No Priors and Lane Space with Satya Nadella. Um, congratulations on an amazing build. No, thank you so much, and it’s great to be with both of you. I listen to both of you or b- both the podcasts all the time. It’s great to be on it.
    Thank you so much. [00:01:00] So you’re just talking about, um, these amazing, uh, announcements from across the Microsoft estate all morning for, I think, three hours. What is the, uh, what’s the most important reflection or takeaway you have?
    AI as an Ecosystem Platform
    Sarah Guo: I, I’d say there are, uh, perhaps the, the biggest one for me is let’s sort of conceptualize this more as an ecosystem play as opposed to a single model or even a single platform, right?
    Satya Nadella: I mean, you know, whatever I... At least for me, having grown up at Microsoft, having seen, whatever, four major platform shifts, uh, I sort of fall into that, um, uh, camp where a platform is defined by fundamentally its ability to create more value about the platform versus what’s captured in the platform. And so if you, you view what’s happening right now, I think this morning’s keynote was how can any company, whether it’s an AI native company or a traditional enterprise company, participate as a first-class participant where they can point to AI they created, [00:02:00] right?
    It’s not that they don’t use other people’s AI. Of course they will. But to me, what’s the path? What’s the recipe? How do I do it? What does a stack look like? What does the tooling look like? What is valuable? How do you do that? That’s it. That’s sort of our job to do. Yeah. Ecosystem strategy is, uh, very complicated, right?
    Sarah Guo: Because you end up building certain components, partnering for certain components, supporting them. You just announced this big suite of models. Like, tell us a little bit about the, uh, training strategy for Microsoft now. Yeah.
    MAI Models & Training Strategy
    Sarah Guo: So, so the thing that we wanted to do with the MAI models was to build, and as Mustafa talked about, first of all, a great lineage, right?
    Satya Nadella: Starting with pre-training, uh, with very good data quality, uh, doing all the ablations, making sure because in, in some sense it’s becoming even harder to build a clean lineage model just because there’s so much stuff out there, uh, that you truly need to ablate out to be able to have a fantastic [00:03:00] pre-trained model.
    In fact, that’s one of the challenges of a lot of the open weight models is they look great on one benchmark or two, but they’re not great on practice. So that’s why, in fact, even in the RFDEs are, they, they are pretty gone really excited about these MAI models because how the heck can a small five B model hill climb?
    Uh, and it goes back a little bit to what I think is ultimately the key thing to do, which is try to pursue finding that cognitive core. Uh, so to me, starting with a clean lineage- Then creating that ability for companies to be able to use this, right? Not just as a generalist, but to create their own specialist by building this hill climbing scaffold around it, right?
    So it’s not just the model, but you have a hill climb scaffold around it, then you will start building your RLE. You will start collecting the traces. Most importantly, you’ll have private evals because we know all the evals out there are good, interesting, [00:04:00] but they’re not really that critical- They’re work, yeah
    Swyx: at this point because they all can be maxed. And so the point is each company will have its own private eval. And so that end-to-end platform story around our models is sort of, uh, what I think is interesting. And then the one other thing, Sarah, since you brought that up, is I do feel there’s a new frontier.
    Satya Nadella: Like people talk about the frontier and are you operating at the frontier. Um, interestingly enough, if you add a little temporality to it, you can use, let’s say, in, in, in fact, the, the Lando Lakes demo we showed was pretty cool. We used, whatever, GPT-55, right? Then you collected a bunch of traces, and then you took a 5B reasoning model and achieved higher.
    Sarah Guo: Uh, so that is another aspect of what it means to appear... uh, you know, operate at the frontier Yeah. I, I think, uh, I first of all have to congratulate you on basically building a frontier neo lab inside of Microsoft in two years. Um, I’m wondering, you know, you have all this AI strategy that you’re rolling out.
    Lessons from Two Years of AI Development
    Swyx: I’m wondering, what do you know now that you wish you would tell yourself two years ago where- or two or [00:05:00] three years ago? Three years for the Jensen partnership, two years for, uh, MEI. Yeah, I mean, I think the, the thing when, that I reflect quite a bit, right, which is sort of obviously I got into all this when I got excited by the, the scaling laws paper and, you know, when, you know, even the OpenAI partnership came about when those folks said, “Hey, we’re gonna really throw a lot of computer transformers.”
    Satya Nadella: Uh, and they’ve helped. I- the thing that I always look back and say, “Wow, these things, uh, do have capability that they’re climbing up.” W- I mean, this, you know, this crude way of saying it is intelligence is log of compute kind of works. Now what I think we underestimated perhaps is the real-world complexity of deploying these so that they actually deliver the value in the real world, right?
    So the outcomes as measured by any benchmark is interestingly important, but the true eval is when people out there are able to do unique things that they only can value, and it’s very [00:06:00] measurable, right? That I wish we had sort of even, like, had more in our consciousness, right? Which is as an industry.
    Sarah Guo: Because right now I think when people say, “Wow, I don’t want a token max,” it’s an artifact of us not having thought ourselves as an industry that we are using tokens to create value every step of the way. So I think that’s kind of what I wish we had gotten there, but I’m glad we are here.
    Real-World Value & Use Cases
    Sarah Guo: What are some of the use cases that you’ve seen that have created the most value for your customers?
    Because I know that people talk a lot about code, and I think it’s pretty clear that that’s something that’s having very large scale impact. Are there other areas that you find in common that your customers are really benefiting from? Yeah. I think, yeah, to your point, obviously coding is now got... But it’s interesting, by the way, Elijah, to even talk about the coding, right?
    Satya Nadella: Which is coding has worked so well that we now have to rebuild the IDE, right? I mean, it’s kind of nuts to see what we sh- launched is like, oh my God, I have these hundred agent sessions. I... The cognitive load it transfers back to me as a human is so [00:07:00] excessive that now I need a new UI. Uh, oh, by the way, I, like the, the chat as the only artifact was also impossible, so that’s why we need a canvas.
    So it’s kind of interesting for all the things about where is software needed or where is UI needed, uh, you kind of need that even for code, right? In a fully agentic world. But that said, one of the things that we are starting to see, we started seeing with co-work, but even some of the work we, we showed with auto com- uh, um, autopilot Right on what you see with claws is a good one because if you sort of think about a lot of human capital is doing the glue work, right?
    If you now can augment that with tokens/agents that are long-running, durable, right, then your ability to scale even what is still judgment and glue work gets amplified like coding does. Uh, so you can... Like, I’m positive that six months from now we’ll all be saying, “Oh, wow,” like, all through ni- the night there was a bunch of stuff that [00:08:00] all these autopilots that I have working on my behalf with my delegated authority, so to speak, right?
    I can... Sort of given even my identity, did a bunch of work, then of course I’ll need my new ADE to say, “Well, what did you do?” Like, I might... “Did I do this work?” And so on. So I think that that’s where compressing of workflows, uh, completing of tasks, uh, that’s where I think a lot of the value gets created. I think you raised a really interesting point, which is there’s the actual agent that’s doing the code, and then there’s a harness around it, and that’s the environment, that’s the context, that’s everything you’re setting up as a developer around actually a coding agent.
    The Harness Concept for Enterprise AI
    Sarah Guo: What is the harness for the enterprise? Is there an equivalent concept for broader productivity work, or how do you think about that concept sort of generalized? That’s right. So, so in some sense you kind of want the harness to define the models, the, the data, uh, and the tools, and so that you have a loop across those three.
    Satya Nadella: And so what we are trying to, first of all, make sure is each of our products that we build, right, whether it’s GitHub Copilot or the security copi- the, the [00:09:00] stuff we showed with MDASH or even the discovery for science, it doesn’t matter, all of them are multi-model harnesses, um, with tools access so that you can do this progressive, uh, disclosure of tools even so that they’re token efficient.
    Uh, and then you’re feeding it with very rich context because that’s sort of the other hard lesson we have learned in the last two years is, oh my God, the amount of work you need to do to prep the context layer, uh, such that your plan can execute in the most efficient way is where the magic is. So we have, in our case, we have the GitHub harness, which essentially we’re using across all our products.
    It’s available in Foundry, and we are open, like you can use your Llama harness, whatever. Or you can use the, um, uh, you know, any open harness or any harness of yours and train with your tools and multiple models and your context. And so that’s the pitch. Because right now a lot of dialogue is, um, “Hey, if I train the harness plus tools and the model together, you get [00:10:00] evals.”
    Elad Gil: And what we are proving out is... And the best example of that is what we did with MDASH, right? Because when it launched, uh, it found bugs or vulnerabilities that were not found by Mythos Uh, and so there is existence proof, I would claim, that you can have a multimodal harness, uh, that can in fact be more, uh, performant in the real world So a premise behind the, uh, training at the independent frontier labs is really, you know, we’re gonna have these models, and we’ll have an API business, and we’ll support enterprises and startups.
    Sarah Guo: But
    Platform Strategy & Developer Ecosystem
    Sarah Guo: a first-party product, be it productivity or code or search, drives the majority of revenue. That’s a different value equation than you’re describing, I think, with the Microsoft ecosystem. Uh, if, if that’s the case, tell me if it’s the case, uh, ‘cause obviously you have first-party products and you have enablement products.
    Satya Nadella: Um, what is the role of the develop- Like what is gonna be hard and the set of skills and the value capture the developer has in that world? Yeah. So I think that there’s always [00:11:00] gonna be the case that someone who is super successful in- as a platform builder can also have first-party products. It was true with Windows.
    It is true, uh, with, uh, the, the SaaS side and the cloud side as well with us and others and so on. But the thing that is, is it should not be a limiter to other people achieving that same success, right? That I think is the core difference, which is the, the network effects this time around, around intelligence are such because they learn from data, and not really lots of data.
    It’s just a few samples that you have to see to understand what’s novel about something. So that’s why the game becomes how to protect. So that’s why I would say every company, having private evals may be the biggest IP, right? Think about it, like what’s that private eval that you can then use even a frontier model to hill climb on and not leak the traces may be one of the biggest [00:12:00] drivers, uh, of IP.
    Like, so in other words, another te- acid test is you have an eval that’s private. You’re using, uh, a g- a Model A. Can you switch it to Model B and e- you know, climb up? If you can, then you’re in control. If you can’t, you’re not in control, and that’s where even the harness decision becomes super important, right?
    swyx So therefore, having an open harness, letting all models come in, having your evals, your context, your tools help you hill climb, I think is the skills that an AI native startup needs, a SaaS company needs, or every enterprise needs. Yeah, I think in, in a very real way you are ... Microsoft historically is an operating systems company and th- then become a cloud company.
    Maybe like the third act is that you’re a harness or evals company. Whatever w- ... whatever the, the sort of conglomerate of concepts that you wanna put together. Um, and, and I think like enabling every company to have like frontier intelligence or what- what- Yeah ... I forget the, the [00:13:00] exact term that you used, um, is the, is the mission, right?
    Satya Nadella: That’s it. Like that is, that is the platform promise, that you build with us, you will get your intelligence, uh, for your data. That’s it. That ... To, to me, that is the ... Like if there was one tagline, uh, for this entire developer conference is- Can everybody operate at the frontier with their frontier intelligence, right?
    To me, that is so important because otherwise it, I, I don’t know how you achieve stable equilibrium, right? Which is how do I then go and say, “Well, my company is gonna have a terminal value because I now know how to continuously compound-” Yeah ... on top of what’s a platform that gets better,” right? So when, like Windows obviously came out, Adobe built, Autodesk built, uh, or even like take what Jensen said.
    We built DX and he built, you know, CUDA on top of it. Um, right? I mean, I always say to Jensen, “God, I got the short end of that,” right? “I wish, uh, we had recognized it.” But nevertheless, but that, that idea that you can build a platform layer [00:14:00] that someone else can then extend out, um, and build their own intelligence layer in this case, I think is everything, right?
    Without it, why have a developer conference? I can just come and have you all sort of just worship at the altar of one model. Yeah. But that’s not a developer conference. Uh,
    IP, Evals & Company Value
    swyx: backstage we, we had a discussion about what is IP or what is the, the value in a company. It used to be the length of, uh, human experience at a company, and now it’s this other thing which is the evals, the, uh, experience in sort of applying agents to the company. Can you... I just want you to like flesh that out a bit more ‘cause- Yeah ... it was very insightful.
    Satya Nadella: It’s a great way to frame it, right? Because yeah, at the end of the day, every company is gonna have both the human capital that is still gonna be super valuable, uh, because humans, uh, and their ability to find the gaps that exist at all times is going to be the way we all will create value, right?
    I mean, so I’m definitely in the camp that this is going to be about expressing new forms of human agency and ambition even as token capital goes up, right? So let’s say a cor- any corporation [00:15:00] has lots of tokens and lot of human capital. The question is how do you compound the two? So if you have a... Like if you take in Teams I have a bunch of agents doing work and a bunch of humans doing work, and the traces between those, that is really important context of how that enterprise is creating value.
    Then that goes back to train not a generalist model, but to train the company veteran agent, uh, right? That is super valuable again, right? Which is when a company goes says, “It should in fact go onto the balance sheet,” is how I think about it, right? That’s so... In fact, there may be... Like human capital was never possible to go put on a balance sheet, uh, because you didn’t know how to capture the tacit knowledge.
    swyx: Whereas now I think you can with the agents that have learned through the h- through, through time, through all the traces. Uh, so that’s what at least we think will happen. I, I think the SEC is gonna have to have accounting standards- ... for token, uh, expertise Uh, y- y- you’re talking about the equilibrium [00:16:00] state, um, and a stable equilibrium where companies have this compounding value and can see terminal value for themselves.
    Future of SaaS & Business Models
    Sarah Guo: Another challenge to, you know, the considered equilibrium of, okay, there are applications and workflows that are sort of common to a vertical or a horizontal. Um, and this was, like, the generation of SaaS companies and, you know, Microsoft has lots of SaaS properties as well. And then there are things that are very specific to every enterprise that they’re differentiated against.
    Elad Gil: Um, I’m sure you have heard much and participate in much of the debate about the end of software because all these workflows are, are cheap to generate now. Um, do you think the equilibrium looks different between what agents get built- Yeah ... in enterprises versus in their vendors in the future? Yeah. So I think what’s happening there is, see, we, we had a particular way we captured, um, I would say workflow in apps, right?
    Satya Nadella: Because we built a, a data model, right? We schematized some part of some business process. Mm-hmm. We then built a bunch of business logic. Yep. And then we put a bunch of UI [00:17:00] on top of it, right? So that’s kind of what every SaaS company- And a little configuration. For, like, 20, 20 years that was the plan.
    Right, that- Yeah ... and that was it. So interestingly enough, now you kind of get to re-litigate that vertical stacking, right? So I still think, for example, that data model that you built underneath every SaaS application is super good, right? Like, why reinvent it? Like, I, I, my general ledger better be a general ledger.
    I don’t need new schema creation. No. Uh, in fact, that entity relationship, uh, is actually pretty good, robust thing that I want to feed. And you want it to be stable. That’s right. Yeah. Then same thing with business logic, right? If, if you look at, uh... We have this product called Power BI, right? It is like dashboards galore people created.
    The beauty underneath that dashboard is a very rich semantic model, right? Someone took the pain to create a dashboard and do all the measures, and you want that. That’s business logic, right? I want that to be available to me. So I think the [00:18:00] challenge of the SaaS business model is we packaged one way. We now have to learn how to unbundle these things and rebundle in new ways and discover new business models, right?
    I mean, if you look at it, d- what’s happening today with Microsoft 365 is a great example, right? We have this thing called Work IQ. In fact, like, what we are realizing is, oh my God, like, you know, if you look at... In fact, there’s a pa- historical parallel too, right? We sold first Exchange and SharePoint and, uh, you know, before Teams, we had a thing called Lync Server and what have you, and we thought, “Oh, that’s all gonna move to the cloud.”
    But little did we realize that, um, the number of people who will use servers in the cloud is 10X, 100X, right? Because people were not buying servers, they were just buying a subscription. Mm-hmm. The same thing is now happening with M365 because with Work IQ, we have exposed what is perhaps the most important database in a company that never got used as a database because it was only captive to our apps.
    Mm-hmm. Right? It, it was all email operated on it, Teams operated [00:19:00] on it, Word, Excel, PowerPoint, SharePoint. But now, like this is one of the coo- coolest things I get to do with Work IQ. I go to a GitHub repo and I say, “Hey, I attended a bunch of design meetings last week related to this repo. Can you capture all that and tell me what changes I should make?”
    I mean, think about that, right? It literally can go look at all those transcripts, come back with a plan to change a code base, right? Previously, you could never have thought of using M365 for something like that. So the value creation opportunity now in the agent world is in fact 10X more, but it does require us to have...
    Sarah Guo: For example, there’s going to be usage around M365, right? Which is going to be perhaps more than even the e- end users and we have to even re-architect. Like, in fact, like what I use to serve an inbox or a mailbox cannot be used to serve an agent. Uh, and so that’s sort of what we are doing.
    Pricing Models: Per-User, Consumption & Outcomes
    Sarah Guo: I don’t believe in, like, permanent business models for any of these domains, but in the [00:20:00] near term, do you have a prediction between, uh, you know, outcomes-based pricing, token-based pricing?
    Elad Gil: Enterprise bundles Yeah. The way I- I think about this is always we’ve had... Like, let’s even take the per-user pricing. Mm-hmm. The per-user pricing is really an artifact of someone creating a budget needing certainty, right? Because it’s the most important thing. Like, somebody wants a budget- Mm-hmm ... they need a per user.
    Satya Nadella: And, and per user is just a set of entitlements to usage, right? That’s kind of what it is. And so the way is, if the first bundling will be take some usage, bundle it into per user stacks and, you know, then sell subscriptions. So subscriptions I think are gonna be there, per user is gonna be there. Then the next big thing will be consumption.
    So people will say, “I want consumption.” And it’s also possible that people will say, “I don’t even want to pay for any of the subscriptions or the consumption’s outcome.” Mm. But remember, most people love outcomes until they have an outcome, because once you have an outcome, it’s like giving away royalty, [00:21:00] right?
    Mm. I mean, like I, I’ve talked to customers who love, you know, outcome-based pricing, and I say, “I’m all in,” until they, “Oh my God,” like, “what are you talking about? You’re sharing in my outcome? No, no, no. I want you to go back to per-user pricing, and I want you to consumption price,” right? So I think that debate will go on.
    Uh, but and all, all, all of these business models have a particular time and a place versus one to rule them all. And if anything, if you’re a SaaS vendor or you’re a platform vendor, having that flexibility... And quite frankly, we face this with GitHub, right? We just recently announced a per-user pricing on GitHub because little, you know, we- GitHub Copilot was constructed at a per-user level before we understood even, uh, the intensity of usage of agents, right?
    It was an interactive way for a developer to use code complete, maybe tasks. It was not like, oh, I launched 10,000, you know, agents that are going on all day, right? So that is what the adjustment is about. So now that we really want, there will [00:22:00] always be a per user, but there will have to be a consumption meter.
    Durability of SaaS & Build vs Buy
    Sarah Guo: How do you think about the durability of SaaS more generally? One thing I’ve observed is in a lot of enterprises internally, there will be teams that almost have agent euphoria. They’re so excited about the explosion of things they can build that they’re trying to rebuild a lot of applications or going to their SaaS vendors and saying, “We’re not gonna work with you anymore,” or, “We’re considering an internal project.”
    And it seems like in six to nine months, maybe some of those people will come back and say, “Actually, we, we can’t rebuild everything.” How do you think about what’s durable in this world and what isn’t? Yeah, it’s a... It... I think we have to go through one full budget cycle on this to really see the, um- Uh, the sort of the emergence of the equilibrium, because at the end of the day, there’s marginal cost to even generating the app, right?
    Elad Gil: In, in fact, there can be even a, a simple way to say it, like if you should always acquire something if the marginal cost of building and maintaining, uh, something on your own is higher. Uh, right? That should be like it’s a quantifiable- Yeah. Right? A quantifiable thing. And [00:23:00] the maintenance part is important, right?
    Even, like you got to remember like, hey, you know, all the security stuff that now AI will find, you better fix them too fast. Uh, of course, there’s a coding agent to help you with, but then that burns tokens, right? So whose responsibility is it? It’s kind of like a, a cycle that you’ve got to think through.
    And I think we have gone through the excitement that I can generate a lot of software. I think the next thing would be what software do I really want to generate? Mm-hmm. What software do I want to use from others? How do I compose these two into some agentic workflow that I have agency over, right?
    Sarah Guo: Because I think there’ll be very little tolerance for anybody who’s inflexible, uh, at the vendor level. Uh, but at the same time, I think that anyone who has got that flexibility shows up, delivers the value, will be back at again, right? We’re selling software, uh, but with just different business models, in fact Uh, speaking about building software, um, one of my favorite moments from, I think, a previous build maybe one or two years ago was they had a b- they, they...
    Swyx: There was a section of you building your [00:24:00] own software. I’m curious if you’re building anything now. Yeah. So I, I think the... You know, first of all, let’s face it, right? Building software has made it possible for even the incompetence of a CEO of a company- ... like ours, uh, you can build, so thank God. But that said, I, I, I, I do feel that, you know, something like, um, GitHub Copilot to me, and especially the new Sessions app or the new app, has just made it so much more possible for you to have agency over artifacts that you felt you couldn’t touch before, right?
    Satya Nadella: So to, for me as a CEO, even to go to a code base, uh, to be able to learn about it, like I remember joining Microsoft long back, you know, first and then you say, man, everybody had to go in and look at, you know, whatever, Cutler’s, Malik, or what have you to learn how to do good C, uh, C++ code. Um, so now that ability to be more full stack up and down is so good, but that doesn’t mean every one of us should be doing the same thing.
    The question is: [00:25:00] how do you then have the ability to inspect things, learn things, see things, um, I think is just so much more. And so to me, what I’m building a lot of is these long-running Foundry agents. Uh, right? So there’s autopilots. So the easiest thing is, to me, I think I just built one, uh, even last week, where the idea was, hey, can I have an agent that is continuously monitoring essentially my own chief of staff autopilot, right?
    We’re gonna have that obviously in, uh, Scout. That’s what, uh, uh, we showed. But it is so easy and trivial to build. I took Work IQ. I said, “Take Work IQ, go, uh, and build a Foundry long-running agent.” Uh, store all the memory in, um, uh, using Ray Fin, right? Basically at my backend as a service. And lo and behold, it built it, and not only built it, I could say publish to Teams, and it published the damn thing to Teams.
    Sarah Guo: So the ability, uh, to have a, you know, some end-to-end project like this complete is just pretty [00:26:00] miraculous. How do you think, uh,
    Future Engineering Roles
    Sarah Guo: that impacts the different types of engineering roles that exist in the future? Because right now I think there’s, you know, a dozen different types of engineers that you can be, from QA, front end, et cetera.
    You know, there’s a big swath. I’ve heard some people argue that in four or five years we’ll basically end up with four engineering roles. It’ll be people who are managing agents, it’ll be four deployed engineers or FDEs, it’ll be security engineers, and then people working on large scale infrastructure for a small number of services, and then everything else just collapses into the agentic world.
    Satya Nadella: Yeah, I- Do you think that’s a correct view of the world? Yeah, I mean, I think, I think we’ll have to experiment our way through it. But what you said is what... There are some very at scale things. At LinkedIn, they did structurally change- Mm-hmm ... uh, and it, you know, basically built up a new discipline called full stack builder, right?
    So they went and said, “Hey, let’s bring, uh, people from design and product management, front end engineering, all put them together.” Uh, but also have an edge, right? It’s not like the design person still doesn’t have the design edge, or the front end [00:27:00] person doesn’t have the front end edge, but you can give yourself bigger scope in roles so that you’re not confined to one role.
    Um, and then r- equally, infrastructure has become very critical, right? So in other words, like, I mean, RLEs, I mean, one thing we’ve realized is even for the Excel team, for example. Mm-hmm. Building the RLE in which a reward can be learned is actually one of the hardest sort of infrastructure problems.
    Mm-hmm. Uh, and so you kind of need even new talent, right? Distributed systems people even in what was considered an end user app team, uh, because it’s a different skill set. So yes, infrastructure, science is the other one, obviously. Um, so I think we’ll see how these evolve, right? Where’s the s- real... I mean, always the world will have a bunch of specialists.
    Okay. Um, you know, I think the generalist role is going to be the most exciting, right? Because the leverage of a generalist- Mm-hmm ... um, is where we are going to see the maximum returns, right? When, when you said, “Hey, are you coding?” I’m now a gen- Like, what... I’ve basically translated [00:28:00] knowledge work Right?
    Which I did, where I created a Word document or a spreadsheet, or even, uh... And now I can build an app, right? It’s in the same sentence. Uh, right? That idea that, “Oh, wow, my generalist skills have gotten higher leverage,” I think is what we’re gonna see across the board. Music to the ears of CEOs and VCs that are, like, a little dangerous and a lot of- Golden age for idea people
    Sarah Guo: idea people. Yeah. Uh- With a lot of agency. I- if you take that idea of personal agency and you just zoom it out to the organizational context, um, uh, my partner Mike Renall, who, uh, actually started his career at Microsoft, just wrote an essay where one of the big takeaways is i- it’s an age where you can be much more ambitious, and you need to be, given the pace of the environment and how quickly, actually, users and companies are open to adopting new technologies.
    Satya Nadella: Um, how do you think about... I, I feel silly asking this of somebody running a, you know, trillion-dollar-plus company already, but
    Ambition & Making the Impossible Possible
    Satya Nadella: how do you think about how Microsoft can be more ambitious now? It’s a great question. Um, I [00:29:00] think, um- I think the, the thing in these type of transitions is to have a conceptual model of how work can change to go after outcomes that you could hardly imagine previously, right?
    In fact, Kevin Scott has this nice line, right, which is, um, when you can make the impossible... Like, when you’re making hard things easier, that’s sort of one point of leverage. But true ambition is about making the impossible possible. So now the thing that is missing a little bit in all of our organizations is what is that new conceptual model of what can we build?
    What was impossible and what can we build? And I’ll give you one example of this, right, which is I take great inspiration from sort of the people who were managing the Azure net- network. And they came to the... This was from even last year. You know, we were scaling. You saw that I, I [00:30:00] talked about sort of how we built in the last 15 months more Azure capacity than we built in the first 15 years.
    I mean, it’s crazy. Wild. Yeah. Right? It’s pretty wild. And it’s the same team. So they saw that and they said, “Bob, this just ain’t gonna work if we don’t reconceptualize our work.” So they built... Essentially they said, “Our job is not to do Azure networking. Our job is to build the agentic system does, that, that does Azure networking,” right?
    These are the folks managing the 500-plus fiber operators managing the VAN, right, all over. And fiber operations ultimately is a physical operation. Things get cut, things get, uh, you know, have to be repaired. You know, we have fancy words called DevOps and so on. Basically, emails are coming in and you gotta go respond to them, take care of it.
    So they built this agentic system. They even have a character for it. It’s called Miles, and it sort of does all this stuff, right? They started sort of screaming for more tokens and so on. And so they were saying, “Look, uh, we don’t need a headcount. We need tokens in order to be able to [00:31:00] manage, uh, our operation.”
    That reconceptualization- Mm-hmm ... of what their work is, right? They, they basically took their work and made it meta, right? That meta work is now their new work. Mm-hmm. Right? In the ‘80s, if somebody had come to us and said, “4 billion people are gonna get up in the morning and start typing,” my model would’ve been, we need 4 billion typists?
    But we’re not doing typing, we’re doing knowledge work. So that, to me, I think is it, right, which is whether it’s Microsoft or whether it’s any organization, is to give ourselves permission to do new types of metacognition, meta work, using these new tools to change the outputs that matter, uh, and then really make the impossible possible.
    Sarah Guo: So completing that dot or the, the connective tissue across those, I think, is where a lot of the enterprise value will get created.
    Data Center Build-Out & Community Impact
    Sarah Guo: Should we talk about data centers? Yeah, please ask. Oh, okay. Well, uh, uh, w- we-- this leads nicely into the data center build-up. I always think, I- I just-- I’m just impressed at the sheer scale of the [00:32:00] build-out from Microsoft, but also everyone else, that this is redefining what it means to be a hyperscaler.
    And I just feel like that, that, that is at unprecedented scale on finances, uh, on the way you run the company, but also the communities that are, that are impacted. Um, yeah, just talk a bit more about what you’re seeing on the ground, like when you visit your- Yeah, I think there are two aspects of it.
    Satya Nadella: Obviously, the, the build-out is, uh, extraordinary. Um, you know, nothing like this has happened, and it’s great to be, uh, one of the participants in it. Uh, but you brought up the other part, right? I think at this point it’s clear that unless we as an industry, uh, are very principled about ensuring that the benefits of all the stuff we’re talking about are felt in real ways, uh, at the community level, right?
    Because this is not just a, a campaign, um, right? It has to be real, where people are saying, “Look, this is not ch- changing the prices on energy for me.” In fact, if anything, it’s bringing down prices because long term there’s going to be a better [00:33:00] grid, there is going to be more energy. Water consumption is, in fact, not sort of, uh...
    In fact, water is being replenished, right? You gotta really, you know, educate folks on truly what’s happening, the cl- uh, the closed loop systems we are building. We have to invest in the training, the jobs, the tax base. In fact, the least talked about stuff is the amount of jobs that get created during construction, after construction.
    What’s the tax base that’s there in the community? And, and all this has to be real. Um, and, and if that is the case, then we will have permission. If it is not, we won’t have permission. It’s as simple as that, right? Which is, uh, we, we... I think we have to take it as an industry pretty seriously. Uh, I think it’s good for communities to be skeptical, ask the hard questions, for us to do the hard work, earn that.
    Um, but at the end of the day, if there’s-- if we can really be the produ-- Wait. I’ve always felt like in human history, if you use a lot of energy but also create a lot of value for society- The story has been fantastic. If you don’t [00:34:00] do that, it’s not been that great. And this time around, I’m a firm believer that ultimately if you do have a token economy that drives productivity, that drives economic growth, that drives broad spread, um, you know, participation, better health outcomes, um, then I think we’ll be in a great place.
    Sarah Guo: Uh, and that’s at least what we all have to be focused on. Yeah. It, it makes me think actually that with all these initiatives that you’re doing, might be e- easier to see ROI in the communities first before in enterprise. Yeah. I, I mean, I think both sides. Yeah. In fact, it comes back together. It has to be the people in the communities are going to be employed, are going to be participants, uh, in the real economy, right?
    Satya Nadella: That’s I think the question is. Like, if we- if the broad economy is doing well and the communities are doing well, the dots get connected. It’s sort of the market forces are such that we will connect the dots. And that I think is it. Like, you ought to be able to see the evidence. You can’t be about o- any one company, uh, but it has to be broad economic growth and broad [00:35:00] ec- you know, community permission.
    Elad Gil: Yeah. I guess I wanna talk about
    Societal Impact & Optimism About AI
    Elad Gil: what you’re most optimistic about currently or what have you most updated your personal models on regarding societal impact of AI? So you’re saying what’s the, the, the- What have you updated most on in terms of societal impact of AI? Yeah. I think the, um, the p- the most, um- Critical thing is the first question we even started with, which is we need to tell the story and make it real that everybody has a real shot to participate as a first-class participant in this new economy.
    Satya Nadella: Right? That’s kind of, I think we- in the next 12 months, 18 months, we need a way for people to say, “Oh, wow, I get it.” Right? There’s going to be tremendous capability, tremendous amount of infrastructure, but I can see what is going to happen, whether it’s the benefits like health outcomes or my ability to create a startup or my ability to run my [00:36:00] local sort of, uh, store more efficiently.
    It’s just happening, and I see that, uh, benefit myself, right? That to me, you know, earning that permission in a path-dependent way, we can’t wait. See, the one thing, Eli, that I’ve now learned is I think the world is gonna be very skeptical of tech and tech companies that say, “Trust us, we’ve got it. The g- future is gonna be glorious.”
    Sarah Guo: Uh, you kind of have to deliver tangible benefits. Um, and quite frankly, politicians winning elections, uh, because they have advocated for that. That will be at least my adjustment because without it, um, thinking that somehow... Because it’s too important this time around. It’s too much of the economy for it not to be the case So one very simple framework I have for, you know, what are, what is gonna be the broad benefit of AI, um, beyond the communities just working in technology, are, are sort of wealth creation- Yep
    it’s [00:37:00] gonna happen in a ton of different companies, startups and large companies. Then you have healthcare. Uh, you, you had amazing demos today. There are companies like Open Evidence. I think that is happening. Um,
    Education & Future of Learning
    Sarah Guo: education seems like another one that’s an- Yep ... obvious good where we haven’t seen as much impact as I’d expect.
    Swyx: Do you have a hypothesis on why that might be, or if it’ll come? Yeah, I mean, I think this is where, again, how we think about education, how... You know, recently I met with, uh, the founders of Alpha School and learnt a lot about what they were going and going about, and it’s fascinating to listen, uh, to how to even rethink- Mm
    Satya Nadella: uh, what does education really look like. Because I think it’s actually very important. Mm. Uh, and I’m not saying anything traditionally being done is less important, right? I was even looking at the, uh... It’s fascinating to see. I, I, I forget the which Stanford class it was, uh, the, the Asian guidelines for CS something.
    Mm. Uh, because you still need people to learn. Uh, like it was an interesting AI class that they were making sure people were learning how to apply softmax appropriately versus saying, “Hey, fix my training run.” Mm-hmm. Uh, so I think learning concepts is important. It’s going to [00:38:00] be, uh, critical. But the way we create the incentives, what are the credentials, how we value those credentials, what is the employment opportunity for those credentials?
    So I think that there’s a complete change that has to happen, uh, given the way to get to information, way to educate yourself, way to continuously keep yourself updated has changed so much. So I think interestingly enough, maybe the next big startup and success story could be someone who builds a new university, um, or a new, um, pedagogy even of how to get someone to go through a curriculum and find economic opportunity, uh, that’s highly valuable.
    Well, that has felt, uh, perhaps impossible for a long time, but it’s a great note to end on and something that might be possible. It’s still possible. Yeah. Thank you, Satya. Thank you so much. Thank you. Yeah. I appreciate it. Thank you all.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
More Business podcasts
About Latent Space: The AI Engineer Podcast
The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
Podcast website

Listen to Latent Space: The AI Engineer Podcast, Stretton Unfiltered and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features