PodcastsBusinessLatent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

Latent.Space
Latent Space: The AI Engineer Podcast
Latest episode

277 episodes

  • Latent Space: The AI Engineer Podcast

    Why Video Agent models are next — Ethan He, xAI Grok Imagine

    01/06/2026 | 1h 43 mins.
    We’re announcing AIEWF speakers this week! Take the AI Engineering Survey!
    Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months:
    He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…)
    Put it this way: In the near term, the next Sora won’t be a better video model, but a video agent.
    Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs.
    At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models.
    Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task.
    In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models.
    We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines.

    Flipbook: The future of Videomaxxing
    Video agents are almost a sure bet to be the trend in the coming year. We end with a glance at what’s beyond video agents:
    Flipbook caused a minor sensation this year when it was released, but most treat it as a fun demo. Ethan takes it very seriously — with the speed and cost of inference coming down every year, the future of custom video JIT UI is closer than you think. We talked about why videogen models may become the front end of AI, how generative UI could replace traditional HTML/CSS, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone.
    We discuss:
    * Why fast iteration mattered more than meetings
    * Why small training bugs can drive huge model quality gains
    * Why coding models may make compute the bottleneck again
    * How image and video models are trained with synthetic captions
    * The role of VAEs and latent space in frontier video models
    * Why image models are the foundation for video models
    * The tradeoff between temporal compression and real-time interactivity
    * Flipbook, Neural OS, and the future of generative UI
    * Why future interfaces may go from user intent to pixels
    * The hidden cost of training video models: storage, egress, and GPU hours
    * How step distillation and consistency models (like OpenAI sCM) makes video inference orders of magnitude faster
    * Grok Imagine 0.9 and large-scale audio-video generation
    * Why audio-video alignment is harder than text-video alignment
    * Ethan’s definition of world models
    * Reference-to-video, video extension, and long-context video generation
    * Why xAI’s research communication undersells Grok Imagine
    * How xAI culture shaped the speed of development
    * AI watermarking, SynthID, and detecting generated media
    * Why prompt rewriting matters for video models
    * Grok Imagine Agent and the rise of video agents
    * Why language models may unlock better video generation
    * Robotics, physical AI, and embodied world models
    * Why Ethan left xAI and shifted focus toward LLMs
    * Self-managed context, memory, and the next frontier for language models
    Ethan He
    * LinkedIn: https://www.linkedin.com/in/ethanhe42
    * X: https://x.com/EthanHe_42
    Timestamps
    00:00:00 Introduction
    00:01:25 From NVIDIA Cosmos to xAI
    00:03:24 Building Grok Imagine from Zero to One
    00:10:07 How Image and Video Models Are Trained
    00:18:53 Video Compression, VAEs, and Real-Time Tradeoffs
    00:22:10 Generative UI, Flipbook, and Neural OS
    00:32:10 The Cost of Training Large Video Models
    00:37:04 Distillation, GANs, and Fast Video Inference
    00:41:21 Audio-Video Generation and Grok Imagine 0.9
    00:48:34 What Makes a World Model?
    00:55:51 Reference Videos, Long Context, and Video Memory
    01:00:11 xAI Culture, Research, and First-Principles Building
    01:09:45 AI Safety, Watermarking, and Prompt Rewriting
    01:13:10 Video Agents and AI-Assisted Creation
    01:27:32 Why Language Models Unlock Better Video
    01:31:15 Robotics, Physical AI, and Embodied World Models
    01:32:38 Why Ethan Left xAI
    01:34:16 Self-Managed Context and the Future of LLMs
    01:38:43 Ethan’s Career Path and Closing Thoughts
    Transcript
    Introduction: Ethan He, Latent Space, and the Path to xAI
    Swyx [00:00:00]: We’re here in the studio with Ethan He, most recently of xAI. Welcome.
    Ethan [00:00:10]: Thank you. Glad being here.
    Swyx [00:00:11]: We’re also here with Vibhu. you were first coming to us or joining the latent space world because you were working on Kosmos at NVIDIA, and you did a paper. We loved it. you presented it as well, so thank you for doing that.
    Ethan [00:00:23]: I’ve actually, I also presented the MoEs twice at latent space.
    Swyx [00:00:29]: How did you actually hear about us? Did we reach out to you? Is that how it worked?
    Ethan [00:00:33]: No, actually, I-- the community. Like I realized, oh, there is this online community that people talk about AI and also learn from each other through papers every week through the Paperclip. It’s very nice.
    Ethan [00:00:49]: I learned a lot.
    Swyx [00:00:49]: I think three years stop. We haven’t stopped even on Christmas and New Years. many weeks I want to stop but it keeps going.
    Vibhu [00:00:58]: No, that was good. I think you had posted that you worked on a paper, and I was “Oh, very cool. We have Paperclip. Present then.”
    Vibhu [00:01:04]: But I might have reached out to you after.
    Swyx [00:01:05]: you-- because it’s an amateur club, right?
    Swyx [00:01:08]: so it’s very unusual and but we have sometimes paper authors come by and actually explain the paper. Today we just did, the poolside paper, which was apparently very good.
    Vibhu [00:01:18]: Came out yesterday.
    Vibhu [00:01:19]: pretty interesting, right? Fully open. They talk about everything, systems. So it’s a good one. We’ll, we’ll recommend people to read it.
    Swyx [00:01:25]: Bring us up to speed on your transition to xAI, ‘cause I actually don’t even know when you joined. just like tell the, tell the story about the sort of transition.
    From NVIDIA Cosmos to xAI: Scaling Video and World Models
    Ethan [00:01:34]: Before xAI, I was working on Kosmos world model as in-- at NVIDIA. So Kosmos is, it’s a giant video foundation models that can-- that aims to simulate the world and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Kosmos one, I realized as this thing also has a scaling law similar to language model, we need to scale up the video models further. that’s, that’s why I realized I need to move to somewhere with much more compute resources. That’s how I
    Swyx [00:02:13]: Than NVIDIA?
    Vibhu [00:02:14]: The GPU rich came themselves.
    Vibhu [00:02:19]: And timeline-wise, when was Kosmo? It was pretty early, right? It was open world model, open paper, everything.
    Ethan [00:02:25]: It was end of twenty-four.
    Vibhu [00:02:28]: End of twenty-four.
    Ethan [00:02:30]: Then at mid twenty-five, I moved to xAI. At that time-- I joined about the time when xAI was about to build video models and in multi-model models. There were no infra, no data, and no model, and it just-- as a few engineers, we built it in three months and released the first model, Grok Imagine zero point nine.
    Ethan [00:02:55]: And since then, I keep working on video models and move more from training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and, video extensions. And, before I left, I worked on a world model, leading a small team to focus on the real-time long horizon video generation.
    Building Grok Imagine From Scratch in Three Months
    Swyx [00:03:24]: Can you give like a rough roadmap of okay, you’re on a brand-new team. Grok previously was only text, or they partnered with BFL for their image gen stuff. What do you-- what are the building blocks, right? You have compute, data you can procure somewhere. Like just what are like the sequence of things that people should think about when you’re setting up a new team?
    Vibhu [00:03:43]: actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast, but yeah
    Swyx [00:03:51]: three months is like
    Vibhu [00:03:52]: From everything
    Swyx [00:03:52]: actually like very surprisingly fast.
    Ethan [00:03:55]: One thing I say like thanks to my experience at NVIDIA, ‘cause first time when we were building Kosmos together, we built it, for about a year. So this is like the second time I do it. Roughly have an idea, what to do. I say the most important thing is the talent. Everyone were very strong and clever, very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work towards the same goal. It’s, it’s like every day there’s not that much meetings on the calendar, like maybe like a, like a sync a day, and after that it’s, it’s just all building. It was pretty fun at that time.
    Ethan [00:04:47]: And another thing is that xAI has very strong foundations of like data inference, model inference, and the supporting there can help the model develop a lot. When I look at, training models, I don’t so actually the top important thing is like how many, how many iterations can you do, per day? and the more iteration can you do, you can, you can train the model much faster. So if you have very strong infra and you have a lot of compute, you can, you can train these models in very short period of time. That can give you a much larger buffer to, for errors, and it also gives you the opportunity to spot more bugs.
    Iteration Speed, Compute, and Debugging Model Pipelines
    Swyx [00:05:46]: What is an iteration? Is it like a few hundred steps or what are you
    Ethan [00:05:50]: Let’s say just the train-training the model, like from acquire new data and maybe design new algorithms and train a new model, maybe at smaller scale or
    Swyx [00:06:01]: So cycle time for like any hyperparam that you’re searching.
    Ethan [00:06:04]: Cycle time and tune to like eval this model. Is this model better than my previous iteration?
    Ethan [00:06:11]: So
    Swyx [00:06:11]: So it’s like before you, someone had already set this up that you can iterate very quickly.
    Ethan [00:06:15]: I think the foundation there is extremely good forDeveloping and research models.
    Ethan [00:06:23]: And often I find is it-- this is kind of boring, but like a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline. Those give, those give the biggest boost to the model quality.
    Vibhu [00:06:46]: It’s interesting, right? So you say it’s like small team, less communication bandwidth, but also a lot of quality is like find little bugs. It seems counterintuitive, right? You have a lot of people, you can iron out more of those, but it’s interesting to see the other side, right?
    Swyx [00:07:00]: I also wonder, have you-- do you try using LLMs to look for bugs? I don’t know.
    Ethan [00:07:05]: I remember at that time it was mid two thousand and twenty-five, so it’s the coding model wasn’t quite there yet. I remem- I remember like December two thousand and twenty-five, it was extremely good. Yeah, I’ve been, I’ve been using it at that time. It’s, it’s helpful. sometimes it produce codes that are kind of difficult to maintain, even though like the first time it built something extremely fast. But it gave the, like a spaghetti code, thousands of lines that I couldn’t maintain, and the LLM itself couldn’t figure out what’s, what’s wrong and how to improve on top of it. But now I find it much better. Yeah, I want to bring up another point here is now coding models are much more efficient and can help us implement stuff much faster. Compute might become a bottleneck again because previously, like if you want to train a new model, say you want to generate new synthetic data and then or write a new algorithm, it might take a few weeks. And during that period of time, you don’t-- you might not have experiments to run. But now you can build that thing within a few hours, then you can immediately train a model.
    Ethan [00:08:24]: Now you have to have enough compute to try all of the ideas. So compute might be the bottleneck of iterating speed again.
    Swyx [00:08:36]: yeah, I actually, honestly, I think it’s like kind of a stressful job because you’re “Well, I should be trying everything, and if I’m not, then I’m not doing my job well.”
    Vibhu [00:08:48]: there’s also the stress of you’re eating thousands of GPUs per hour, which is very expensive and, compute can go to other researchers.
    Swyx [00:08:56]: You got the daddy Elon to
    Vibhu [00:08:57]: You got daddy Elon.
    Ethan [00:08:59]: It was
    Vibhu [00:09:00]: But there’s still finite amount of compute, like you want to use it, you want to use it well, you want more of it.
    Ethan [00:09:06]: That was quite stressful indeed. Yeah, I think one thing is the-- with coding models now, like a lot of these jobs can be automated, which is much better. A second, it’s a, it’s a marathon, so you got to maintain good health and, a regular schedule.
    Vibhu [00:09:28]: It’s, it’s hard to hear that when you shift from zero to nothing in two months.
    Swyx [00:09:32]: and, I think obviously the culture at xAI is very famously, people work very hard. one thing I did want to dive into, in our-- in the notes that you, that you sent ahead of time, you had specific comments about the cost of Video Gen training. presumably this is on the Colossus-1, right? the two hundred megawatt cluster. Any whatever you want to just share on that.
    Vibhu [00:09:54]: I think there’s, there’s three things we’re talking about, right? So there’s Video Gen, there’s also the Image Gen model that you put out. Do you want to like complete the, okay, so zero to one, you have a few months. Just what are the stages of create Image Gen model?
    Swyx [00:10:06]: Oh, yeah, maybe I got distracted.
    How Image and Video Models Are Trained: Synthetic Captions, Tokenizers, and VAEs
    Vibhu [00:10:07]: Sorry. and then, from there’s Video Gen, there’s Audio Gen. Would love to get into those next. But what is that first few months like? So small team, a lot of bugs, iterations, but what does it look like? Do we take something off the shelf? Do we just get data compute? What’s, what’s the few months like? How do you go to state-art Image Gen model? How do you just start?
    Ethan [00:10:28]: I cannot comment specifically how xAI did, but it’s, it’s a quite standard process. I can draw some, examples from Cosmos. So mainly it’s building a video model, you actually need to build a image model first. And building these two models, the data you need is a hundred percent synthetic pair of language and image or language to video. Because on the, on the internet, actually, the videos don’t naturally associate with text. So you can say, oh, like on YouTube, you have the title and you have the description and the comments
    Swyx [00:11:11]: Title
    Ethan [00:11:11]: of a video, but usually they’re not relevant to the video itself. And say maybe like the video is a natural scene of mountains or something, and the title is, I’m so happy today.
    Ethan [00:11:26]: So they have they have no correlation at all. So the first step is to, you have to generate synthetic pair of language with the videos. So you gather videos from the internet, and you use a VLM to caption the videos. So that part, here’s a question, like how do you, how do you gather VLM to begin with? So if there’s no
    Swyx [00:11:55]: You, so you fuse the model, right? Like
    Ethan [00:11:57]: Say if there’s no like VLM exists, like how do you generate the text to the beginning, right? It’s, it’s impossible.
    Swyx [00:12:04]: I see.
    Ethan [00:12:05]: In the beginning, it’s like you ask human to describe the video as detailed as possible.For example, you ask them to describe everything, like all objects, all characters, and all interaction and dialogues in the, in the videos. So that’s in the protocol of Cosmos labeling. We require the objective we give to the labelers was that you have to describe the video as detailed as possible, such that a blind person hears a blob of text can reconstruct what the video is like from their head.
    Swyx [00:12:43]: Video or image? You’re talking about images.
    Ethan [00:12:44]: Video or image, either one of them.
    Vibhu [00:12:47]: This was pretty common when we went from clip and DALL-E, right?
    Vibhu [00:12:51]: It’s all training on really detailed captioning of images. So same is applied to video, but instead
    Ethan [00:12:57]: same applied
    Vibhu [00:12:57]: of using multimodal model to pass in video images and write rich descriptions, you can also
    Swyx [00:13:04]: I think there’s this traditional perspective of supervised, or, very highly human curated thing. I feel like there’s a unlock with unsupervised, right? Where like you have enough to bootstrap that you can just throw common corpus on it or, whatever. like unsupervised vision and language pairing, right? Like where you just have, interspersed image and text and it just learns. To me, that is the VLM breakthrough that is different from the clip, different from the LM era.
    Ethan [00:13:36]: It’s interesting to see that you kind of need both data.
    Ethan [00:13:41]: For example, for the
    Swyx [00:13:41]: You need it to bootstrap it up. Yeah
    Ethan [00:13:43]: for the generative model training, there’s also usually like a small percentage of unlabeled data. So the model is instructed to generate a video without any text instruction. That can also help the model generalize. So after this stage of generative synthetic pair, so, one important common step is to train a compressor or a tokenizer of the image or videos. So because, if you train-- If you can technically, theoretically train image or video models on pure pixels, but the problem is that the, it’s, it’s a lot of tokens. So like one image, it’s, a thousand by a thousand, it’s like one million tokens, one million pixels. It’s impossible to train transformer on that. So it’s, you need to train a tokenizer, which can go from image to latent space and latent space back to image.
    Swyx [00:14:45]: That’s why we named the podcast.
    Swyx [00:14:48]: But, basically, you’re talking about vocabulary science.
    Ethan [00:14:50]: so vocab.
    Swyx [00:14:51]: And so, what is, what is imp-- like a million is impossible?
    Ethan [00:14:54]: In generative models, the vocab is continuous. It’s a continuous space. We can think about like you map an image to a vector. It’s a, it’s a fixed length vector. It’s sixteen or forty-eight, something like that. And then you map that vector back to the image space. And the mapping is, has-- The mapping is patch-based. So you say you have
    Ethan [00:15:22]: a sixteen by sixteen patch and you match, you map that patch of pixels into this latent space.
    Swyx [00:15:29]: We’ve covered this
    Vibhu [00:15:30]: This is like the vision transformers
    Swyx [00:15:32]: VAEs,
    Ethan [00:15:33]: VAEs.
    Vibhu [00:15:34]: You basically compress your input, you do your generation, you’re reasoning all that generation in smaller dimension, and then you project back out.
    Swyx [00:15:43]: VAE is a form compression, but I think the for me, the patching thing is from VIT, right?
    Ethan [00:15:48]: You can make those.
    Swyx [00:15:49]: Literally the, yeah, the paper is titled like sixteen by sixteen is all you need. something like that. and then I think also, people make a lot of comparisons with this kind of patching with convolutions.
    Swyx [00:16:02]: Which is you’re, you’re kind of re- reconstructing the old paradigm with the new.
    Ethan [00:16:05]: Actually, in VAEs, there are, there are both convolution networks and transformers. You can actually do both.
    Ethan [00:16:14]: After this VAE, so what you’ve got is you’ve got latent space tokens and you’ve got the language tokens. So now the training of the diffusion transformer, usually generative models use diffusion transformers. It is actually quite standard. It’s, it’s very similar to how you train a language transformer models. It’s not that much difference. It’s just the tokens, the visual tokens in, visual tokens out. The only difference is there’s a denoising process. So you train the model to unmask some of the noise. So you add, you add random noise to the visual tokens, and then you train the model to remove those noise to generate the clean tokens. Any inference, the model can iteratively remove noise from a hundred percent noise.
    Swyx [00:17:12]: And then there’s also, to speed things along on the tech tree of diffusion, there’s CFG, and then there’s, there’s also, latent diffusion that, there’s, there’s someone in there. I think, somewhere along the line, obviously, like stability and all these other guys, pioneered a lot of this, architecture. I don’t know if you want to get into that or just, or do the video side up to you.
    Bootstrapping Video from Image Models and Temporal Compression
    Ethan [00:17:37]: After you train such model, such image model, the reason it’s a, it’s a foundation for video models is that image models are cheaper to train, and they have much denser connection between language and text. So, sorry, language and images. For example, you train a billion, you train on a billion images, and there’s a mapping from the text to the image. And the cost to train the same, like the, a billion, a billion text to a billion videos, that’s much more expensive because videosNaturally have more tokens than images. Because the diffusion models, their understanding of, language purely come from this mapping. So if you don’t have enough mapping, so if you only train on like a ten million videos or something, there-- you might not see enough language tokens in your training, so your model does not understand human intention enough. So that’s why you really-- you train-- you first train this image diffusion models, and then you bootstrap the video model from there.
    Swyx [00:18:53]: One thing I did want to ask, because I-- actually, I think you’re, you’re the first per-- video model person I’ve ever talked to, I think. we’ve, we’ve like talked to Luma and all those folks. There’s all these tricks in video compression where basically frame by frame there’s not that much difference, so actually you don’t have to regenerate or save the whole frame, right? but I think MP4 compression or something else like that.
    Swyx [00:19:16]: is it tempting to use that? Or as far as I can tell, everyone just treats it as, “No, we would just generate every frame.” Is that roughly the state-art?
    Ethan [00:19:27]: There are a few different approaches. Let’s say first, like you want to just directly use MP4 compression and use that as the tokens for the transformers to train, right? So people actually have tried that, but the main challenge is the latent space for the MP4 tokens were not, were not very comprehensible for the models. It’s, it’s extremely hard to train on that. And there’s a
    Ethan [00:20:01]: So that’s why they created VAEs, which creates more continuous, latent space, so the models can understand that latent space and learn from it much easier. Even within the VAEs, there are different difficulties of the latent space. So you can imagine something the simplest, the most naive VAE is like you have an image, and you just shuffle all of the images into a, into a vector. So you don’t need to train any VAEs, right? But that latent space is extremely hard for models to train on top of. That’s why there are some debate on like how do you compress the tokens. So you mentioned like you can compress frame by frame. Also, you can compress, the temporal dimension.
    Ethan [00:20:52]: The difference is if you compress the temporal dimension, you get a much higher compression rate. Because there’s temporal redundancy between frames, because, this frame and the last frame, likely they are mostly similar, so there’s only some small difference. for example, I think in 12.1 VAE, they have like a eight by eight by four compression rate. So the four temporal tokens are compressed into one tokens. That can save a lot of, save a lot of the context length. If you do it frame by frame, you have to do maybe like eight by eight by one. Your context length will be four times larger. That being said, the benefit of the frame-- per frame compression, we might come back to this later, is, real-timeness and interactivity. ‘Cause if you, if you strain the output of the model, frame by frame, you can-- the model can respond to any user request immediately. So if you have like a temporal four compression, four times compression, then
    Swyx [00:22:06]: It might be laggy
    Ethan [00:22:07]: there’s a lag there in nature.
    Swyx [00:22:10]: So you’re very pilled on this. let’s just go ahead and bring it up ‘cause we have the visual prepared anyway. There’s some frontier applications of real-time video gen. So Flipbook is one of the examples that went viral recently, right? What is Flipbook?
    Real-Time Generative UI: Flipbook, Neural OS, and Diffusion Front Ends
    Ethan [00:22:23]: Flipbook is kind of like a web brow- web browser. You can see like it has the web bro- browser UI on top. The difference is all of the UIs are generated by generative image model in real time, and anything here are fake. But you can, you can explore inside this wor- this imaginary world. Say like we-- here we have engineering the Great Pyramid. Like the model generates this for us to understand how it works, and if we want to navigate around and understand further, we can click on some of the, some of the description here, and the model will generate a new page, new subpage describing the details we want to know about.
    Swyx [00:23:14]: So it’s basically kind of we’re playing a video, but it’s pausing for our next interaction, and then it just plays the next thing based on our interaction.
    Swyx [00:23:23]: Which is kind of cool.
    Vibhu [00:23:25]: and you kind of decide your story. So this was, how do you make a pyramid? levering technique seemed interesting, right? It shows how do you take Okay, I want to know what is this
    Swyx [00:23:35]: The demo, the demo tweet had more animation between frames.
    Vibhu [00:23:38]: I think it’s just skipping,
    Swyx [00:23:39]: Oh, it’s just skipping a lot of frames.
    Ethan [00:23:40]: they also have a video mode
    Vibhu [00:23:42]: It takes a lot. There’s a lot of people
    Ethan [00:23:42]: but, a lot of people are using it.
    Ethan [00:23:45]: So it’s not available.
    Vibhu [00:23:46]: There’s a live video stream. We can try,
    Swyx [00:23:50]: So this is an example of the kind of future that you see at the extreme. We don’t-- we’re obviously not in it today.
    Swyx [00:23:56]: But in a world where inference is completely free this is better than generating code and text?
    Ethan [00:24:02]: So this is, this is a final state of where Viva will be at for word model, I think. Imagine internet doesn’t exist, and then you type in google.com. Like what should, what should, what should a model show you?the model can imagine something, and this is what the model imagine. And these web pages, they completely do not exist. So I think as the inference costs come down, we are going to have generative UI for everything. If you think about how the coding model works, so they write code for a web page, and they render the code might be con- converted into binary, and the binary render the pixels on the screen. So we in machine learning, every time we have some breakthrough, obviously it’s, it’s more intuit. So why don’t we have like user instruction to the pixel directly? So the generative UI will be user intention to the pixels directly. And say like even if I want email, let’s say everyone have the same interface, but I want, I want it slightly different. I want the email to show to me like a TikTok, so I can swipe left and right for the emails. And or maybe you want something else. We can have completely different things. Or like I have I’m looking at, Instagram stories, and I don’t like the Like button. I always may click it. And, generative UI resolved it. So it’s going to be a revolutionary replacement of the interface. So in the future, we might have much more powerful
    Ethan [00:25:50]: LLMs and coding models running behind the scene. And in the, in the front-end, the diffusion model will actually be the front-end to show stuff to you. That’s how I imagine it.
    Swyx [00:26:02]: Diffusion front-end, deterministic back-end.
    Swyx [00:26:04]: Something like that. I find that very expensive, but,
    Vibhu [00:26:08]: I find it interesting you called LLMs writing code on the back end deterministic, but okay.
    Swyx [00:26:14]: you write it once
    Vibhu [00:26:15]: Compare it to
    Swyx [00:26:16]: And then you execute.
    Ethan [00:26:17]: If you think about the cost, say, let’s say H100 costs $1 per hour, and if you use this eight hours a day and thirty days, so, every month you’re paying this two forty, you’ll actually not wanna pay for that. That’s even more expensive than Cloud Code Max. But if you think about the compute costs come down like two times every year, and I think the future will likely arrive like within few years.
    Vibhu [00:26:49]: It’s everything, right? compute cost comes down, compute gets faster, model gets smarter
    Ethan [00:26:54]: More efficient
    Vibhu [00:26:54]: model gets smaller.
    Swyx [00:26:55]: I don’t know why you say two times, ‘cause I think it’s like 100 times. In language models, it is roughly one hundred to a thousand times every twelve to eighteen months, for the same given level of LMSys, ELO.
    Vibhu [00:27:08]: That’s a net of everything, right? That’s model performance alongside compute. So different than just compute costs come down. But, a very interesting future.
    Swyx [00:27:19]: So the web designers will have to shout out that accessibility is an issue, right? how do you deal with screen readers or whatever. But yes, this is higher bandwidth storytelling than anything you can possibly generate with code, right? So I think that’s the rough idea.
    Ethan [00:27:34]: And I’d like to add a little bit that so human naturally have the maximum bandwidth when we are looking at things, look at videos, and we also have maximum output bandwidth when we are talking. So in the future, it might be something like we talk to AI models, and the AI model responds back with a generative UI. So that would be the maximum input and output bandwidth to interact with AI models before neural link happens.
    Vibhu [00:28:06]: And it’s also very custom, right? Some people are very visual, some people are not as visual, right? They prefer the text. But the best thing about generative UI, right, it can also be text.
    Swyx [00:28:17]: There’s another project that we wanted to highlight, which is the Neural OS. Kinda similar idea, but here you’re literally operating, simulating an operating system with a video model.
    Swyx [00:28:27]: and you can play Doom, you can do Firefox. I find this like mildly less impressive, obviously, because it’s an OS that I can run.
    Swyx [00:28:37]: But here everything is imagined.
    Vibhu [00:28:40]: I was, used to the Command+W to close the Firefox tab. It didn’t crash. That’s why I said
    Swyx [00:28:45]: It’s too immersive.
    Vibhu [00:28:46]: It’s, it’s too immersive for me.
    Swyx [00:28:47]: Too immersive.
    Vibhu [00:28:48]: I wanted to close the tab.
    Vibhu [00:28:49]: But yes, I can play generated diffusion.
    Swyx [00:28:51]: this is shockingly fast.
    Swyx [00:28:54]: Because I remember there was a demo about like maybe one to two years ago. Someone tried to do the first-person shooter with a image model. There was no consistency. It was very slow. But here it looks like realistically it’s-- this is Doom.
    Vibhu [00:29:07]: I think there’s two sides to that, right? There’s okay, what is running a game? The heavy part of it is actually the game engine, all the lighting, all that stuff, the graphics. This is just kind of video, right? Like we’ve solved consistency. This is still, it looks like a few years old image generation. There’s some temporal consistency, but it’s, it’s kind of just images stitched together as frame video. But it’s a good visual representation to pi- to picture the future you wanna see, right? that’s, that’s what I see in these more so.
    Ethan [00:29:38]: This reminds me of how the video models gets better and better. So Neural OS is kinda if you just look at it feels like it’s just a crappy version of the, like the Windows we could have, right? And, but the difference is, so the model, this model is overfitted on the existing operating systems. It can generate nothing different than that. But it’s actually also similar to video models. So when we are training these video model, image model, we train them on internet. There’s no imaginary supernatural stuff on the internet. But once we train this model, you can prompt the model to generate something supernatural that have never existed in the data set. So if you train your Neural OS or neural computer on the standard screen recordings on the entire internet. The model can imagine completely new interface to interact with the computer.
    Swyx [00:30:43]: This is one of those things that is magical to me. usually generalizing out of distribution is bad, but somehow we have learned some kind of internal world model that you say, this plus, but it looks like rainbows and butterflies, it’ll do it and it will kind of make sense.
    Swyx [00:31:03]: So yeah, that’s kind of cool. Yeah, I don’t know if there’s any comment more on there. I do, I do wanted to, I did wanted to touch a little bit more on the model architecture stuff, which I think you were getting. It’s, really fascinating. We don’t get a chance to talk about this enough. So one of the papers that we covered, we’ve covered every annual, segment anything release. and I don’t know if you follow-- you’re a computer vision guy, so you
    Ethan [00:31:26]: I know
    Swyx [00:31:27]: . So they did memory attention, which is kind of interesting. And I always think, anything where you can, across the temporal dimension, keep some consistency, I think it’s, very fascinating, and I don’t know if Basically, does that-- the CV side bleeding into video gen side, I think is underexplored, right? we talk about it for labeling, but actually you can borrow the architecture itself.
    Ethan [00:31:50]: There’s, there’s also complete different approaches, right? you brought up the term world model, so we went from video model to world model. There is diffusion, but there’s also other approaches that people are doing. So maybe we get into those after as well,?
    Swyx [00:32:03]: He has a whole definition of world models and stuff. I feel like we threw a lot at you. Whatever you want to comment on.
    Why Video Models Are Expensive: Storage, I/O, and Training Scale
    Ethan [00:32:10]: I think one thing that we should actually comment back on is okay, so we were talking about the steps to train image gen to video model. One thing we don’t see as much of is okay, you brought up the delta in training data, right? So
    Ethan [00:32:24]: you won’t have as much a video model might not generalize, but what is the cost of training a large video model? So we know for LLMs roughly, okay, even like the poolside thing that came out today, right? It’s a Gemma level model trained on roughly forty trillion tokens at this many H200s over this much time, right? You can see what is the exact cost of that. So how many GPU hours over how much H200 costs? So how do we do the back-end math of, same thing for video models, image models. How do you, how do you kind of break that down? I can share some back-envelope calculation. So surprisingly, video models is-- the cost is very-- is comparable to language models and obviously the largest scale is language model, maybe like a medium scale to language models. I said just storing the videos alone, it costs a lot. You can, you can maybe look up on AWS or something.
    Ethan [00:33:20]: You really, say if you have a billion videos and let’s say, let’s just say like each video, like five megabyte, then you need five petabyte to just store those videos. And also remember we talk about you use a VAE to compress the videos, and you also need to store, typically you need to store those continuous feature, in-- also in your storage. That’s also comparable size with the videos themselves. So just storing these videos and the features is tens of petabytes alone. And,
    Swyx [00:33:58]: I just, I just looked up the calculation. Five petabytes on S3 Standard is one hundred K per month.
    Ethan [00:34:05]: And
    Swyx [00:34:05]: It’s comparable
    Ethan [00:34:05]: and you need
    Swyx [00:34:06]: And
    Ethan [00:34:06]: And then like tens of petabytes, two hundred K. And even more expensive is you have the ingress and egress.
    Swyx [00:34:13]: Oh, yeah.
    Ethan [00:34:14]: Like you-- through the internet. You have to just to download those videos, I believe it’s, it’s more expensive on AWS than just storing those videos.
    Swyx [00:34:25]: Storing, yeah.
    Ethan [00:34:25]: And each training runs, you probably need to pull them once. If you train multiple times, it’s, it’s even more than that. So it’s like just storing the network, those costs is just, it would be a few, a few millions per month to just storing everything, not to mention the GPU cost.
    Ethan [00:34:45]: And
    Swyx [00:34:45]: my side tangent, the compute rental, like GPU rental is very efficient. There’s one side, okay, you can be XAI and build your data center. Should we not just build our, storage compute as well? Like
    Ethan [00:34:57]: Of course
    Swyx [00:34:57]: cloud cost compared to just,
    Ethan [00:34:59]: You save so much
    Swyx [00:35:00]: store. Yeah, exactly.
    Swyx [00:35:01]: Especially with like egress and stuff. So.
    Ethan [00:35:04]: That’s a good idea, but it also comes to-- there are some of its own challenges.
    Swyx [00:35:09]: Of course, of course.
    Ethan [00:35:10]: like people who build the GPU data centers, they might not expect this much, storage. And yeah, people build storage, typically they just build it somewhere with just CPUs.
    Swyx [00:35:23]: I just looked it up. Five-- AWS only charges for egress, not ingress. Tier five for five petabytes is two hundred and thirty K.
    Ethan [00:35:32]: Even more expensive than the storage.
    Swyx [00:35:34]: But storing is per month, right? You check in, then you cannot check out. so it’s so cool. It’s okay. So there’s that side.
    Ethan [00:35:41]: So the TLDR, my backhand math
    Swyx [00:35:42]: Data is larger than you think. Yes.
    Ethan [00:35:44]: my backhand math of GPU hours times GPU cost is also very much, I’m missing some storage.
    Swyx [00:35:49]: You’re also-- you’re basically like also more IO bound than normal training.
    Swyx [00:35:55]: Yes. ‘Cause like data loading, so caching everything, it becomes super important.
    Ethan [00:36:00]: So in Cosmos, we did a lot of optimizations to make it not IO bound. So, speaking of the training, actually training the model, the GPU cost, if you look up like the open source model, how big these video models are, I think like LTX has nineteen B parameters. That’s a dense model. And people are also exploring, MoEs, so it might be twenty B active and, like a hun- hundreds B, total. So that’s, that’s even-- that’s similar size as medium-sized LLM models. And if you, if you look at number of tokens-Uh, we disclose that in Cosmos. It’s also like tens of trillions of tokens on the visual tokens. So putting this together, the cost of, training these video models, it’s actually comparable with LLMs. Not to mention, the infra is slightly different from LLM, so it might be less efficient to train these models.
    Inference Speedups: Step Distillation, Consistency Models, and GANs
    Swyx [00:37:04]: Do you get the benefits of traditional diffusion speed-up? So for, images, there’s LCM, LoRAs for, fine-tuning. There’s, there’s a lot of stuff that’s been
    Ethan [00:37:15]: Flow matching.
    Swyx [00:37:16]: there’s flow matching. There’s a lot of stuff that’s been done. there’s some overlap that applies to diffusion on the inference side and stuff or?
    Ethan [00:37:23]: so the difference-- the inference side is a completely different story.
    Ethan [00:37:28]: I think for the training side, it might be a little bit hard to reduce that cost. And for the inference side, the biggest gain is from the distillation of these models. You can-- It’s called step distillation, slightly different from knowledge distillation in LLMs. So you-- Typically, for flow matching models, you need like 100 steps or something. Like a distortion model even need even more, like 1,000 steps to generate a good image or video. A step distillation is try to learn to generate fewer step from the model itself. It’s kind of like now we-- you use the full model to generate in 100 steps, and then you take a model that only generate 10 steps and let that model to learn from the perfect one.
    Ethan [00:38:25]: why this work
    Swyx [00:38:27]: Strong to weak seemingly.
    Ethan [00:38:28]: It is. It’s kind of
    Swyx [00:38:29]: Distillation
    Ethan [00:38:29]: kind of like strong to weak. the-- from the modeling perspective, the strong model, the teacher model is trying to model the image and videos of inter-internet, and that distribution is extremely complex. But the step distilled model is just trying to learn from the teacher. The teacher is a model, and the size is fixed, as the distribution is much simpler than the whole internet. That’s the intuition I have why step distillation can work. So usually these models serve in productions, they only run in a few steps. In Cosmos, I believe we have, we have like four step and eight steps. If you do some simpler task, image-image translation, it can even run in fewer step, like one step in Cosmos Transfer.
    Swyx [00:39:22]: I think this is the same intuition that guides a lot of the consistency model work. I sent you a link for, SCM. I don’t know if you covered that. To me, that was actually one of, the most impressive papers I’ve ever seen from OpenAI.
    Swyx [00:39:34]: That this is the unifying grand concept of consistency models. I don’t know if you have any comments on this.
    Ethan [00:39:41]: So there are, there are a few different approaches,
    Swyx [00:39:46]: Oh, yeah. Here it is.
    Swyx [00:39:47]: Two steps versus twenty or 100 steps, whatever. It’s already done.
    Ethan [00:39:52]: So there are, there are a few different approaches, for example, consistency model, and there are also Actually, we shouldn’t forget GAN. So GAN, actually, that was, that was the OG of
    Swyx [00:40:05]: OG
    Ethan [00:40:05]: step distillation ‘cause it trained just one step to begin with. So actually, a lot of, uh-- For example, there’s a distribution matching distillation which use, which uses GAN, as one of the laws for distillation. It-- GAN just tells you, “Hey, generate an image,” and then
    Ethan [00:40:31]: it has a discriminator to tell, is this image real or not? So the model, the model just need to learn one of the distribution, not the full distribution. Because in training, the model is asked to reconstruct the ground truth image from the internet, which is extremely hard. And in-- When you’re training GAN, it’s a step process. It’s just a, “Hey, you generate image. Does this image look as real as the image from the internet?” Which is a much simpler task. And, yeah, combining a lot of these approaches together, people typically do that, like consistency model and distribution matching and GAN, and we can get these few step models.
    Audio-Video Generation and Time Alignment
    Swyx [00:41:21]: Then there’s one step I wanted to add, which is audio and video.
    Ethan [00:41:26]: So, Grok Imagine zero point nine, I believe it’s, it’s a first audio video transmodel deployed at a large scale. So
    Swyx [00:41:39]: And that was your first model?
    Ethan [00:41:40]: that was, Grok Imagine’s first model. It’s, it’s audio video, joint generation. I think the hard part is, the modality alignment, ‘cause before this transmodel, we have, we have text to video alignment. We have this, correspondence between text and video. Typically, most of the VLMs, they understand images and videos. Video’s very rare, and they don’t understand audio mostly. And if you look at the audio generation on the LLM side, you can talk to them perfectly fine, but if you ask them to sing a song or something, it typically is not very good. Also, they don’t have, they don’t have music either. The hard part is thatUh, actually audio has two component. It has like a discrete component, a continuous component. The discrete component is like the language.
    Ethan [00:42:44]: So when we speak, it’s just, some
    Swyx [00:42:47]: It’s an ASR issue, yeah.
    Ethan [00:42:49]: It’s, it’s text token with some characteristics, I would say.
    Ethan [00:42:54]: But music
    Swyx [00:42:56]: I think the speech guys would disagree with this.
    Swyx [00:42:57]: Like disfluencies and then,
    Vibhu [00:43:00]: There’s tones you can get angry.
    Ethan [00:43:01]: Well, I say largely.
    Ethan [00:43:03]: the mu- but the music is completely different. It’s, it’s very continuous, and you cannot model them like discrete tokens in language models. this is like the hard part for models is, not to mention we have to align text, video, and audio together.
    Ethan [00:43:26]: So
    Vibhu [00:43:26]: How?
    Ethan [00:43:28]: So significant-- some significant challenges are like-- So first, like we talk about as the VLMs, they cannot understand most of them cannot understand audio.
    Ethan [00:43:39]: So you have to have some way to do the synthetic data generation for audio. You have to caption the model, and that involve, that involve synthetic data and human data effort a lot. And not just surprisingly, most of the LLMs are very bad at recognizing, like the beat, tone, and the details of the of music. They can, they can give some general prediction of which song is this, but it’s very hard to describe the details of the music. like we mentioned in image generation, like you have to describe image as detailed as possible so that someone blind can reconstruct that. So here is like someone
    Vibhu [00:44:32]: Deaf
    Ethan [00:44:32]: someone deaf can reconstruct how the music sounds like without actually listening to it. Maybe you can think of it need to have the-- or they call the script.
    Vibhu [00:44:49]: Subtitles, yeah.
    Ethan [00:44:49]: You gotta have all the details of the music, and the dialogue.
    Vibhu [00:44:55]: So is the challenge there typically stuff like music and audio, or is it just Like is there a baseline? Okay, there’s enough data where we can understand, narration, conversation, but there’s nuances in audio that’s where you hit all the data issues or is it just from stage zero, you just do it all right?
    Ethan [00:45:15]: So one important thing is like the alignment. So the model, the model has to know like the video and audio, the, uh-- it has to have a time-based alignment, like at which time step the video and the audio token correspond to each other. But we actually don’t have this kind of alignment for most of the other modalities. If you think about like text and image, text and video, they are loosely aligned. So you can, you can have a description of what’s going on in the video, but you don’t have to exactly, You typically don’t have exact description, oh, at, time step one second like what happened?
    Vibhu [00:46:02]: It’s very
    Ethan [00:46:03]: At time step two second what happened
    Vibhu [00:46:03]: coarse. Yeah.
    Swyx [00:46:05]: So what was the ideal time step? You have to oblate it, and then it’s like four seconds or something.
    Ethan [00:46:09]: So that comes down to how you design the model to, for the model to be aware of as a time, as a time modality. So the model is like a time aware. And that’s something pretty unique if you think about LLMs. So if you ask LLM to complete a task, say they, uh-- you ask them and they will say, “Oh, this task will probably take twelve hours to complete,” and they come back in one hour. Say “I’ve already spent two days on this and I’ve exhausted everything.”
    Ethan [00:46:47]: So the LLMs them-themselves, they don’t have a sense of time there.
    Vibhu [00:46:53]: I actually don’t think that’s just them not having a sense of time. I think it’s somewhat based, right?
    Vibhu [00:46:58]: Like you tell someone, “Okay, go work on this feature. Go implement this,” there’s a general understanding you would have of how long that would take without LLMs working at LLM speed, right? So you think back like two years ago, if I tell you to like build me like a new front end for latent space, have a search bar, have all this, you’ll estimate that it’ll take a few days, right?
    Vibhu [00:47:19]: So you tell an LLM, “Go build this.” It’ll take me a few days. But I think it’s somewhat grounded as opposed to them not having the best-- Not saying that they have a great understanding, but I think that example is like you can see where it comes from, right? You’re trained on all over the text.
    Swyx [00:47:35]: They’re, they’re trying to estimate what a human would say.
    Vibhu [00:47:37]: because that’s what the, that’s what the data kind of represents. It’s not them
    Ethan [00:47:41]: It came from the corpus on the internet. People have a estimate of how much time.
    Vibhu [00:47:45]: And not even just in direct like training samples, right? Just your world understanding of tokens of how long stuff takes, right? Go read a book. It’ll take you a while, right?
    Vibhu [00:47:56]: Even if you do nothing but read a book, it takes a few days. So yeah, LLM, I read it took me a few hours.
    Vibhu [00:48:01]: It’ll take me a few hours to go through this research. But this is a tangent.
    Swyx [00:48:05]: Somewhat, yeah.
    Swyx [00:48:06]: This is a train of thought I haven’t really expressed until now is, which is basically like a full world model must also be recursive, meaning that the participant in the world model must also be aware that they have a world model. which is like this whole recursive thing down the, down the line. but yes, and that the world model can be wrong and that they need to update it and blah. Yeah. We’ve, argued this on the, newsletter as well, that there needs to be sort of recursive or adversarial world models.
    World Models: Real-Time, Long-Horizon, Interactive Video
    Vibhu [00:48:34]: just, to ask, how do you define world model?
    Swyx [00:48:38]: Oh, yeah, let’s go there.
    Ethan [00:48:40]: So
    Vibhu [00:48:40]: So just for context, we talked about, video generation, and then there’s a-- if you say there’s a distinction between world models, what’s your, what’s your definition? How do you see the two?
    Ethan [00:48:53]: So disclaimer, I’m not going to debate, what is world model. Yeah. there are many definitions, so I’ll just talk about my definition. Since I came from the multi-model, multi-model domain, so mainly talking from video. So world model is like real-time interactive long horizon videos. So there are three parts. so we-- let’s talk about them one by one. So the so interaction, so we just, we just look at Facebook and neural computer. So the interaction part of it, so you, world model can allow you to interact with them through keyboard, mouse, and maybe also voice. So these all is-- all is a modality. You can, you can interact with the model, and the model should respond reasonably. Second part is real time. So once you, once, say, you move your mouse, if, say, the world model generate a game, how fast can the game respond? So if you’re like professional CS: GO players- -my say, oh, you have to respond- He’s beginner within sub ten milliseconds or- Yeah even less. So that’s not most of the- No, sixty FPS. Let’s go. Oh, three hundred FPS. Oh, five hundred FPS. Wait. okay, yeah. I didn’t do the math, but yeah, okay. Uh- Yeah, three hundred FPS, that’s a three millisecond. So you have to respond- Oh, s**t. Okay. Yeah
    Ethan [00:50:29]: within a millisecond. Most of the video models cannot do that. Yeah. And, but if you, say, if you have a video model that is, say, like a digital human, the response time might be more generous. Maybe typically, for real-time voice interaction, it’s like two hundred millisecond. So that’s, that’s much more generous. But even two hundred millisecond is pretty, it is pretty tricky, ‘cause remember we mentioned
    Ethan [00:51:01]: you have this, temporal compression coming from the VAE. So if you, if you don’t compress the temporal dimension, your sequence length is going to explode. So if you want to have this real-time, real-timeness in your model, you have to do is one context problem. And the third part is long horizon, ‘cause we-- if you’re not going to just play with, video games just, a few seconds, most video models only a few seconds. We’re going to play with minutes, hours. The model have to be able to generate long-form content.
    Ethan [00:51:42]: So putting these three together, it’s, real-time, long horizon interactive videos. I think the final state will be, for example, like a video, a video version of Playbook, where you can, you can interact with, a neural computer. You move your mouse, and you click on the generative interface, and it will reply to you through pixels- generating in real time. But getting there, it’s, it’s a very long way to get there. So one of the first step, at Grok Imagine, where I led a small world model team there, was to build video extension. So, video extension- it’s the first step of interactivity. Yeah. It’s, it’s the first step. Yeah. So it’s the first step- You have it here, video editing, yeah. Yeah. Yeah. So the first step is because, this unlocks long horizon videos. Typically, for most of the video generation models, you give it a prompt or an image as an initial frame. You generate video, that’s it. That’s just, one time, done. And some creators would try to, use the last frame as a first frame for the second video. It can-- sometimes it works, but if you do it a few times, it says the quality would decrease. And- It doesn’t have that context- Yeah over the full video, so the temporal- Yeah, exactly. Yeah, ‘cause you only gave it the last frame, of course, right? Yeah. Exactly. And- it’s actually a pretty fun hack. if you’ve seen like- Oh, no, he’s saying something better. Yeah. And for example, like Vue, I remember Vue 3 has like a second context of the last video. It is slightly better than using the last frame, but it has the same problem-- similar problem that it, the quality would decrease. if you extend a few times to, one minute, the video quality would look much worse than the first video. Second, another problem is that the model doesn’t have long-range knowledge of, what’s happening before. Say, if they generate some dialogue, some, two people speaking, and their voice might change, over some time, especially if the second conditioning, it does not cover the previous context. So these are the core challenges. So the Grok Imagine video extension, it has historical context of all of the previous generated videos. It can, It has, it has the context of, who is speaking and what objects have appeared and everything, having that to generate the next video. So if we naively do this, you can imagine, just, put all of the previous history video tokens into the context. The context lens will easily explode. Especially for video models, that can be like a few, a few million context, I would imagine- context lens. Yes.Yeah.
    Swyx [00:54:58]: Let’s run with that.
    Ethan [00:54:59]: for example, like in Cosmos, I think just five seconds of video is like a fifty K or sixty K number of tokens. So like if you do, if you do fifty second, that’s a five hundred K tokens. If you do longer than that, easily explode. This long horizon, problem was the first step we’re trying to solve world model. It turns out people, yeah, people love video extension. Like a lot, a lot of the creators love using video extension to create longer form videos. This is the part I liked that you have a, you have an intermediate step toward the final goal instead of just a straight shot to the final version very much.
    Swyx [00:55:48]: But I can see you have a strong vision of where we want to end up.
    Long Context, Redundancy, and Efficient Interactive Video
    Vibhu [00:55:51]: Does it seem like it’s an efficiency issue? okay, we’re at a few million tokens context,. If you draw the parallel to language models, we had very short context, two thousand, eight thousand, then, you scale it up one million, ten million. sure, there’s effective context, but at the end of the day, it’s just what’s it worth? sure, there’s a whole training data side. In video, it might be slightly easier ‘cause we have a hundred million token video, right? Just take a movie with the full context there. Like is this efficiency from an inference standpoint that like it’s expensive, but we know how to solve it? Or like why is this not the approach? So like my broader point was on your second point of world models, you say it needs to be interactive and live, right? You should be able to play a game and see the interaction live. So one thing I see with research is a lot of what you actually serve is different than what you build, right? So we talked about distillation. You train big model, you distill it, you do quantization, speculative decoding. We do all this stuff to serve it efficiently. Should we not just have a solution, like a world model that can interact well, do inference optimization, serve it, distill it secondary, so make it real time after you solve it? So like a-- another parallel is say, continual learning, right? What we need is someone to solve it and show it works inefficiently. Give it a few years, people will make it efficient. Same thing with regular attention, right? It worked. Over a few years, people have different forms of attention, and we’ve scaled it to be efficient at log context,? So kind of two things there, right? One is it seems like it works. You’ve scaled it. Can we not just scale it a lot more efficiently over time? Do we need a separate approach if this works? And same thing with interaction, right? if we can get it done, like if we can solve some way that it works, we can solve making it more efficient from an inference standpoint later.
    Ethan [00:57:53]: that’s actually a very good point. So in videos, there’s actually a lot of redundancies. So we solve a lot of the pixel redundancy from VE, but there’s more redundancy in long range and long horizon videos. Say, if a character appear in the first clip and then it disappeared, it only reappear at the end of the video, you probably don’t need the-- the context, like in the middle of the generation. So you only need that character, where you need. So that’s why, I helped build another feature. It’s a reference video.
    Vibhu [00:58:36]: Is it here?
    Swyx [00:58:36]: is it the same model release or different one?
    Ethan [00:58:39]: It’s a different one.
    Ethan [00:58:41]: You probably need to search on
    Swyx [00:58:43]: I’ll find it
    Ethan [00:58:43]: X reference to video.
    Ethan [00:58:46]: So reference video allow you to like upload up to seven images as condition and generate the video. Say, if like I want-- it can, it can be characters or objects or even scenes. Say like I want, I want condition on, Sean’s selfie and holding a blade
    Swyx [00:59:07]: We have a dog
    Ethan [00:59:08]: or whatever.
    Swyx [00:59:08]: We put the dog in the thing.
    Ethan [00:59:09]: you can put them there and the video models will generate the video from and copies the context over. So that can solve a lot of the problems there, like the long context problem. It doesn’t need to have a very long context, but it’s-- I feel like it’s an intermediate solution. The model
    Swyx [00:59:29]: It’s cheating.
    Ethan [00:59:30]: the model should be able to like selectively know, where should I draw the references. So say if I want to generate a movie, I generate it autoregressive, like a ten second at a time or something. And now this character appear, I can look back to where it first appear and, bring that back. Yeah, this one, I put the references. Yeah, that’s, Optimus, Einstein myself, Annie.
    Vibhu [01:00:02]: Oddly enough, I used Grok Search to find it, and it pulled your LinkedIn post. But yeah we found it.
    Ethan [01:00:08]: Interesting.
    Vibhu [01:00:10]: But
    xAI’s Underrated Work, Culture, and Watermarking
    Swyx [01:00:11]: this is a problem. This is not your fault, but like XAI doesn’t communicate all this work that you do very well because they just have the model release and then that’s it. But actually, these details are very good.
    Swyx [01:00:22]: As far as I understand, everything you just described is state-art, like no one else has done it.
    Vibhu [01:00:30]: A lot of-- yeah, I have a lot more
    Swyx [01:00:32]: And then, and then you just put this blog post with the cookies. I’m this is not enough,?
    Swyx [01:00:37]: but I, obviously this is like the high level numbers that people want to know. But no, okay, so
    Vibhu [01:00:42]: And I wonder, like part of that is also some labs don’t share research into what happens. And if
    Swyx [01:00:50]: No, but this is literally bragging about how good they are, right?
    Swyx [01:00:54]: Like, why would you not say that you are capable of extending with full context? this is not a secret sauce. This is like we did the work. yeah, I don’t know.
    Ethan [01:01:02]: different labs have slightly different communication styles.
    Swyx [01:01:07]: Anyway, if anyone from XAI is listening we are always happy to help you tell your story. Yeah, okay, so you did references, and I think, I think kind of the point you’re, you’re making is it is sort of like a kludge, right? this is-- you can do seven, but what about 100?
    Swyx [01:01:23]: Right? Then you need a completely different thing.
    Ethan [01:01:26]: So I think it’s-- this is, a mechanism to, select the context from the history, and you might not put the entire history into the context. for example, there’s a paper called Frame Pack, which have
    Ethan [01:01:41]: a heuristic that the latest history, the last one second, I put the entire history, and the history before that, I would, compress it and makes the video smaller. So they follow this pattern, this build overall pattern that the maximum sequence length is fixed. So the further you are from the current frame, you have a smaller image. So this is just a heuristic. I think it can be more automatic. The model is aware like which history part of it can be select. So this part of the research is actually being actively, worked on by a lot of people. It’s also quite interesting. I feel this is actually, this part of long context is a little bit ahead of the LLM part.
    Ethan [01:02:31]: So for example, like in LLMs, if you-- so contexts keep growing. Let’s say if you call tool and the tool call history is extremely long, that’s still in context, and keep growing, keep growing. Even if you switch the topic to something else, the whole context was there. There are some agentic harnesses that help you to, say, prune the tool results and, prune Like when you, when you query a file, only show like the top 200 lines or something. Those were very heuristic-driven.
    Swyx [01:03:08]: For listeners, we did a write-up on the cloud code, leak where there are eight different kinds of pruning, including like you prune the tool results and all that. So you can, you can read up on that kind of thing.
    Ethan [01:03:17]: I think, one breakthrough in continual learning might be like a way to automatically, manage its own context.
    Swyx [01:03:27]: These are all heuristics, and they will be replaced by machine learning.
    Ethan [01:03:30]: Interestingly
    Vibhu [01:03:32]: The
    Ethan [01:03:32]: the same thing is being researched in both LLMs and video models.
    Vibhu [01:03:36]: The interesting thing is also like in the paper you showed, it’s actually happening at the model level, right? Compared to like language models, sure, we have base attention, but we’ll do our own compression, we’ll do our own pruning, which is separate from model error.
    Vibhu [01:03:49]: Eventually, it all just boils in, hopefully.
    Swyx [01:03:52]: I think this is a form of like attention, but like also know sort of reasoning attention. I feel like that’s different than normal attention.
    Swyx [01:04:03]: Does that, does that make sense?
    Ethan [01:04:04]: It’s, it’s different in the sense that attention, not to mention, set sparse attention aside, like normal attention
    Swyx [01:04:13]: Like UKV, yeah
    Ethan [01:04:14]: you have to attend to all of the tokens.
    Ethan [01:04:17]: So you don’t have a high-level mechanism to drop which tokens do-- you don’t want to attend to. As humans’ attention span is surprisingly small.
    Ethan [01:04:28]: You can only remember 11 digit of a phone number.
    Swyx [01:04:32]: But I have feature detection, right? I can detect, oh, that’s a sequence of one, two, three, four in a phone number that is 11 digit.
    Vibhu [01:04:39]: Very good pattern matchers.
    Ethan [01:04:41]: But humans’ context can-- like attention can work because we can dynamically pull in, context from different places. The same mechanism, I think is going to happen for LLMs and video models. I think we have
    Swyx [01:04:57]: RLMs is recent-- is on, it’s on the recent work is there, which is not that, crazy, but it’s just recursive.
    Vibhu [01:05:04]: I think it’s somewhat inherent in models too, right? Like you
    Swyx [01:05:06]: No, here’s a nice example here
    Vibhu [01:05:07]: you pull up these, you can read it fine, but, language models are also very good at slop parsing. you have a trans
    Swyx [01:05:15]: I throw my typos in there, it doesn’t matter.
    Vibhu [01:05:17]: You have a, you have a transcript, you have whatever, just throw it in and it’s very good at parsing through noise. m-- that may be a brute force. It can look over a reason over it, but there’s, there’s parallels to both.
    Swyx [01:05:31]: I think it’s just really fascinating how you relate the world models stuff to the video generation, which I don’t think a lot of people hear directly, from people like you. So I think that’s really helpful. Any other work? Do we cover like video, audio, world models, any other stuff in that omni
    Swyx [01:05:48]: team,?
    Vibhu [01:05:49]: Or any other work at XAI you want to talk about? Seems like everything we see publicly announced, “Oh, cool, cookies.” And then there’s so much more to it.
    Swyx [01:05:58]: There’s a lot of depth.
    Vibhu [01:05:59]: Any underrated stuff, just at the time there?
    Ethan [01:06:03]: I feel the, as a culture, it is quite interesting and a bit underrated. So the culture is, the culture is three sentences: move fast, build No goal is too ambitious, and the first principle. Like early, the goal set was very ambitious. It wasn’t very-- this wasn’t-- it wasn’t possible to achieve when I, when I was thinking, first thinking about it. Like for example, like build something in three months. And
    Vibhu [01:06:36]: Was that “Okay, we’re starting team, we want image, we want video. Do it by this deadline.” Or, how do you work back? Like was it just, “Okay, we have a rough by, this date we want something out,” or is this like
    Ethan [01:06:52]: That’s a very good point. So it’s from first principle thinking.
    Ethan [01:06:56]: If you think about, people might say that first principle thinking applied more to the physical world than the models. I would say, for example, like if you think about-Some limitation, for example, acquiring data, like how fast can we acquire the videos? And if you think about training the models, what’s the iteration speed for training a model end? And how would adding more GPUs accelerate that timeline? And maybe if you need human data, like what’s the turnaround time for human data to arrive? If you put all of those together, that is first principle thinking where, oh, like what is the timeline? What’s the minimum number of days that is possible to achieve something?
    Swyx [01:07:52]: I think there’s a-- this is a lot of Elon’s type of thinking, right? He’s like-- I think he’s famous for saying that the only law you can’t break is the laws of physics, something like that.
    Swyx [01:08:01]: Just broadly, you worked a lot with Elon.
    Ethan [01:08:04]: I, one benefit is working at xAI, you got a chance to interact more with Elon. So I was very fortunate to get a few retweets from him, and that was quite fun. And, he also worked very closely, with people. like people imagine online, like he’s very hands-on.
    Vibhu [01:08:34]: There are two things. one-- So I was actually looking up, Elon retweeting you. I’ll pull it up. he talked about you tweeting that you have a really good voice mode. I don’t know
    Ethan [01:08:47]: Oh, me?
    Vibhu [01:08:47]: No. Him.
    Swyx [01:08:48]: Oh, I also did it. But anyway.
    Vibhu [01:08:49]: I actually-- So I would DM you feedback on voice mode because I was “Wow, really good.” And then I’m “Ugh, this sucks.” But, I don’t know. Anything you want to talk about your voice mode, building it? Was it a team you worked on as well?
    Ethan [01:09:02]: Oh, that’s actually not part of the team I worked on.
    Swyx [01:09:05]: He probably worked on more of the video. No, but Grok Voice actually
    Vibhu [01:09:11]: Grok Voice
    Swyx [01:09:11]: like very good. I-- This is one of those things where first of all, you can speak at 2X, which is fun.
    Swyx [01:09:16]: which I listen to 2X, so I like to speak at 2X. But also I think like the interruption was better than Gemini. I don’t know how it compares to ChatGPT real time now, but as far as like driving was concerned, like having Grok in my Tesla and like driving, I think it was like-- it’s a really good experience.
    Vibhu [01:09:34]: He likes voice mode. But also, just the crazy reach by Elon
    Swyx [01:09:40]: Fifty million views for just saying, “Yes, true.”
    Vibhu [01:09:43]: That’s true.
    Swyx [01:09:44]: Oh my God
    Vibhu [01:09:45]: but, it’s, it’s pretty cool how fast it came out. the other thing is the safety aspect of video mode. Anything interesting to talk about there? So
    Swyx [01:09:56]: spicy
    Vibhu [01:09:57]: spicy question.
    Ethan [01:09:58]: A lot of the countries where they don’t allow like a generative data-- generative AI videos without watermarks. So in all of the-- those countries, Grok Imagine had watermarks, and a lot of the-- a lot of the takedowns of the videos were also happening extremely fast.
    Swyx [01:10:22]: it’s, it’s part of running a social platform but also it transfers nicely to the GenAI side. Do you have a perspective on SynthID versus other kinds of watermarking?
    Ethan [01:10:33]: it’s going to be
    Ethan [01:10:37]: it’s going to be harder and harder to detect, the Yeah, these things. So SynthID, one thing is, previously it was only Google, and now, like a lot of different labs
    Swyx [01:10:52]: OpenAI adopted it
    Ethan [01:10:52]: are also adapting it.
    Ethan [01:10:54]: As-- A limitation is like the technology The paper was out there, and people can reverse engineer like how to get rid of it.
    Ethan [01:11:05]: And it’s-- I think even as it advance, it’s, it’s still possible to reverse engineer it.
    Swyx [01:11:13]: so if you are interested, you can go onto Reddit and people have taken out the exact I don’t know, what do you call it? Mask or pattern that Google applies, and then you can apply it onto any Google-generated photo, and you can reverse out the SynthID.
    Ethan [01:11:30]: And it’s, it’s also harder and harder to just judge by eyes. I remember like a couple years ago, there was like six fingers or something. It’s very obvious.
    Vibhu [01:11:42]: My current is actually the audio. I feel like the audio is really lacking. my way to tell if something is generated, outside of okay, I think I’ve seen enough, I have a decent eye, the audio matchup, especially of Sora, is not great. It’s all similar style. But there’s
    Swyx [01:11:57]: I see. those are minor imperfections.
    Swyx [01:11:59]: I think the point is that like-- Actually, my closest reference to this is also Ian Goodfellow, ‘cause I think he did like the adversarial GAN thing where like it’s okay, here’s a picture of a zebra. Then you like change one pixel, and it becomes a panda.
    Swyx [01:12:12]: Right? This is like-- this is like a classic computer vision issue.
    Ethan [01:12:15]: If you think about how these models were trained, like I, like I mentioned before, like GAN was in the training process. The objective of GAN is you-- is the model generates an image, and the model, there’s a judge to tell like if the image is real or not. The model is trained to make the image more real. So as the model become more and more advanced, it’s going to be harder and harder. For me personally, now I have to judge by
    Ethan [01:12:49]: if the-- these videos have logical sense.
    Ethan [01:12:53]: If these, this video
    Swyx [01:12:55]: Have a world model.
    Swyx [01:12:57]: No, I also like it-- the audio is too nice, like too studio quality. The lighting is too good. The skin is too clear. the-- basically, the lack of imperfections.
    Vibhu [01:13:10]: Do we have a good way to do reasoning in diffusion? Like is that what separates video generators from world models or in, -We really know how to apply it to other regressive language models. Is there a parallel for diffusion video gen world models like on that point, right? Is
    Swyx [01:13:30]: He has a thing on video agents.
    Ethan [01:13:31]: that’s a good question. Yeah, actually, I have a, I have a pretty big claim. The intelli- the visual intelligence are actually mostly coming from language. these video models, especially from now, since the diffusion model technology is more mature, the every time you see there is some improvement on these models, I would say mostly, this, again, comes from language model, not coming from the vid- the video model itself, like the video distribution models themselves. In Cosmos, that could be Typically these models, they have two parts. there’s a, there’s a prompt rewriter or the prompt up sampler part. I think in Cosmos, we use Llama or we use Mix- Mixtro. And the Cosmos video model itself is only 7B, and the model, the language model
    Prompt Rewriting, Video Agents, and Agentic Generation
    Ethan [01:14:35]: is a prompt rewriter. It’s, it’s bigger than that. So the prompt rewriter’s task is to take user instruction and convert it to extremely detailed description of the video. So because the video, the visual-- the video distribution models, I would describe, they’re kinda dumb because they take the input
    Ethan [01:15:03]: instruction literally. Because in the training process, remember that we have to describe the video as detailed as possible when we’re creating the synthetic, text pair. So this model, they take those kind of instruction to generate the videos. So in-- when you’re taking the user instructions, the user instruction usually are simple. Just say a cat or something. If you put a cat in the video model, they would take that instruction literally. They would literally show a cat, a cat in maybe a white background because you didn’t describe the background. The cat is not moving because you didn’t describe it. It takes the instruction quite literally. It’s kinda, it’s kinda dumb. The prompt rewriter is actually a much bigger model. It’s a language model that takes, the user instruction and expand it. So the thinking process you mentioned, is from there. So if you, if you look at like GPT image, like you generate a image in three minutes. Three minute is not all like a pixel generation. A lot of time is spending
    Vibhu [01:16:19]: Prompt writing
    Ethan [01:16:19]: on thinking.
    Ethan [01:16:20]: So prompt rewriting now have evolved to, not only just as thinking, it can, it can also be a agent, a agentic model. For example, say you want, you wanted to generate the image of today’s news. So the-- So it’s likely they’ll go to fetch today’s news online and then, process and digest them, then organize the layout and generate it. Another thing quite interesting is,
    Vibhu [01:16:53]: If I’m not mistaken, these are-- it’s no longer a diffusion model though, right? It’s autoregressively Or is there still
    Ethan [01:17:02]: There are different approaches. For example, Gemini Omni. Since they said it’s Omni, I believe it’s a, it’s a single model. Maybe it’s something it’s a language model with a diffusion head or something. Like the language model do the thinking, do the agentic tool calling, and then it would, use the diffusion head to generate the image in the end. There were also approaches like Cosmos, where you have a separate language model and separate diffusion models. And there were also like a purely language model, like you discretize the images, and then you generate the image as discrete tokens. So there are different approaches. I would say like
    Vibhu [01:17:44]: One of, one of the claims I’ve seen for why these approaches struggle is because a lot of the benefits for how we currently learn reasoning with language models is you basically iteratively generate reason. You have your thought, and then you work on that answer, right? So if you have like Omni model and then diffusion head, you can’t feed that back in to continue reasoning, right? So you can’t go like text, image, text, image. You can’t reason on the output and then go back to diffusion. But in the new Gemini Omni, you would be able to, as long as you have diffusion.
    Ethan [01:18:15]: I’m not sure if
    Vibhu [01:18:16]: But
    Ethan [01:18:16]: they have that process. it’s definitely possible in the Omni paradigm.
    Ethan [01:18:22]: So if you think about like traditional multi-model language model, they would have a VIT encoder that can encode the image. So if they have a diffusion head, they can generate the image and then put that back into the VIT encoder, encode that, and then do the iterative refinement if the result Yeah.
    Swyx [01:18:44]: I think you have to jointly train the VIT and the diffusion to make that somewhat reasonable, ‘cause otherwise you’re kind of mismatching or feeding in slop.
    Vibhu [01:18:55]: I think it depends on the stage of training. You might be able to freeze it. But anyway, also just on your earlier
    Swyx [01:19:00]: Wait. I wanted to also make explicit. We do know that NanoBanana and GPT image are autoregressive, language model with diffusion head.
    Swyx [01:19:09]: as far as I can tell from your description of Grok image, it is not. It is, it is end.
    Ethan [01:19:14]: I cannot
    Swyx [01:19:15]: You cannot
    Ethan [01:19:15]: comment on that.
    Swyx [01:19:16]: Well, the way that you described it. but, yeah, I think it-- there’s, there’s different approaches, right? Like you started off saying prompt rewriter is, the-- a big part of the intelligence.
    Vibhu [01:19:24]: and even on that, I think everyone should try using an early diffusion model. If you’ve used Stable Diffusion one or whatever, if you’ve seen the prompts ultra-high res, four K this style, oh my God, the first time I tried one, you don’t talk to them like language models, right? Your prompting is very, comma separated
    Swyx [01:19:43]: It’s literally talking in the labels that were in the data set, right?
    Swyx [01:19:46]: But basically, I’m just trying to make the point that prompt writer and then image is different from autoregressive language model with diffusion hit. Right? They’re different things.
    Ethan [01:19:56]: they’re different.
    Swyx [01:19:57]: Just wanted to establish.
    Ethan [01:19:59]: I’d say, the common part is, the image part. So it’s, it’s quite surprising that, a lot of the improvement came from the
    Swyx [01:20:12]: Language side
    Ethan [01:20:12]: the thinking the tool calling. So I still remember, in Cosmos, I generated a happy sheep and can if without any rewriting, it’s-- it looks so, CGI, and after rewrite it looks, it looks so beautiful.
    Ethan [01:20:31]: I think
    Swyx [01:20:32]: Without any joint training.
    Ethan [01:20:34]: actually, without any joint training. it’s-- with rewriting, it’s already much better. See, a very interesting thing, what happened is the video agents, mostly language models, will call these, generative model, either it’s a separate model or a diffusion head or whatever, as tool. So this model can iteratively refine the results or even, generate longer content through a very long train of thought. It’s actually very similar to how human create art. So we don’t, we don’t generate the pixels directly. We literally draw something on And I think through this process, the-- these models not only use diffusion as one of the tool, it can also use traditional tool. It can also use, image editing tools from Photoshop. It can use, video editor, FFmpeg, whatever, to take combination of these and the generative AI technology as a, as a set of tool, and they can, they can iteratively create a better, a much better, video for production-grade quality. If you look at existing, professional creators, they don’t, they don’t end at, generating a video from these models. They would take this video to their editor and edit here and there.
    Swyx [01:22:11]: So much post-production in And sometimes actually, the reason the video is good is not really the video model, it’s actually the editing.
    Swyx [01:22:21]: And yes, we also are engaged in the same process as well. Would you love to use a video editing model?
    Ethan [01:22:27]: Actually, there’s, Grok Imagine Agent beta. That was the, that was the first attempt in that direction.
    Ethan [01:22:38]: So I think, the process would be similar to like
    Vibhu [01:22:44]: It’s just agent mode.
    Ethan [01:22:46]: you can, you can ask it to
    Swyx [01:22:48]: There’s no blog post for it
    Ethan [01:22:49]: maybe generate a minute, video, which is not possible if you ask the same prompt to video models. But this model will ca- literally call different tools to do that.
    Ethan [01:23:05]: So yeah, this is actually an interesting thing. So when we first released, a video editing model, I see on X some people try the video editing feature with, “Edit this video to be one minute.” ‘cause they didn’t understand how video editing work. Video editing typically is just a removal, add, replace, style transfer, this kind of thing. But that’s actually a valid request under the assumption of video agents. So these agents should be able to understand these kind of, long horizon tasks to be able to easily, create a long-form video. I think this is, this is really fascinating ‘cause it’s kinda take-- it’s taking the same direction as first you have these, assisted-- assisted coding, kind of like tab completion, GitHub Copilot. And from there, you gradually evolve to Codex and Cloud Code, where you do things fully automated. So in agent, in Grok Imagine Agent mode, you can, you can still go in there and do stuff by yourself.
    Ethan [01:24:22]: gradually, as the model capability increase, it will be able to do everything fully automated.
    Swyx [01:24:30]: I like that. okay.
    Ethan [01:24:32]: That’s good.
    Swyx [01:24:32]: So it looks like it’s still generating.
    Vibhu [01:24:34]: Also, I did notice the Grok image gen was always very fast. I don’t know if this is something you guys benchmarked, but, this is just a tangent. Compared to what I used to use before the latest OpenAI’s image gen, and same with Gemini Nano Banana, I would oftentimes use Grok just for the speed.
    Swyx [01:24:54]: It’s, it’s in the benchmark somewhere that’s
    Vibhu [01:24:56]: It’s
    Swyx [01:24:56]: in the Imagine API blog post that they have all the speed things.
    Swyx [01:25:00]: it mostly combination of distillation plus inference.
    Ethan [01:25:04]: There are a bunch of things. we talk about distillation, and if you talk about thinking, if you don’t have any thinking budget, the model can just think three minutes and then come back to you. And also, inferenceThe inference infra team was very talented, and they were, they were able to accelerate a hell lot of these models.
    Swyx [01:25:27]: my comment on the, on the video agents things, I’m trying to figure out, when people say video agents, when you initially told me about your bet on video agents or your vision for video agents, I was a little bit disappointed. I was “you mean, like models are tapped out, now we have to do agents?” But, I think you have to, right? The question now is, how much model training is it really going to make a difference versus just building a better harness? Like you said the models don’t have to be jointly trained. you can just take an shelf frontier reasoning model, slap it on a harness, give it Grok as a tool. That’s it. That’s your video agent. Doesn’t seem super satisfying. Obviously, you can train and get some more percentage points of per- performance. But, if your central claim that the majority of video or generative media, alpha or whatever, is actually coming from language intelligence and not, image diffusion or video diffusion, then that is the future.
    Vibhu [01:26:30]: it’s pretty cool
    Swyx [01:26:31]: It’s just like primarily just weight.
    Vibhu [01:26:33]: If you pop back at the example, it generated frames. Sorry to interrupt, it’s been saying “Okay, I’m gonna start stitching these frames together.”
    Swyx [01:26:42]: So
    Vibhu [01:26:42]: It’s using FFmpeg like using code.
    Swyx [01:26:43]: This is what GPT Image Pro as well is doing, right?
    Swyx [01:26:46]: Like, this is also just writing code in the background and then just
    Vibhu [01:26:48]: Stitching
    Swyx [01:26:49]: doing an image pass on the final output. It feels dissatisfying for the people who want to just train models.
    Vibhu [01:26:54]: It’s interesting, right? it’s, it’s also somewhat exciting. Like you brought up earlier, a lot of the gains don’t come as much from the video. I think you can see that in the language model space too, right? Anthropic, very good at coding. They’re multimodal, not the best, right? They have basic input PDF, but there’s clearly a disconnect in the quality of their image video processing, audio processing, yet intelligence very top tier. Other labs, Gemini, OpenAI, xAI, you can add modalities, but it’s not like they’re unlocking crazy capabilities, right? So it’s interesting.
    Ethan [01:27:32]: It’s interesting to see that, like the video model’s capability increase actually come from language model being more intelligent. I think video agent, like it can unlock more stuff than my- you might imagine. So there’s a few things. So one thing is when we are prompting these models, so most of the people were actually not very good at prompting.
    Ethan [01:27:59]: Actually, language models have a better sense of how to prompt AI models. AI models know AI models better. So if you jointly train these models, maybe the model have a better sense of, how to prompt each model. Like a different model
    Vibhu [01:28:15]: Of course
    Ethan [01:28:15]: might be different. Another thing is it might not as simple as just, like generate a few clips and slap them together using FFmpeg. Like you might-- there might be more like image and video editing tool appear in this process. Say, if you want to exactly add a blob of text at this timestamp, the videos model-- video models might not get that intention very precisely.
    Ethan [01:28:48]: But these are possible using these deterministic tools. The long-- The video agents can use all sorts of tools, so you don’t have to put all of the capabilities into the generation model itself.
    Swyx [01:29:04]: I think that’s very true. no, so for what it’s worth, I think you’re right. I think that this will be a big category. I think probably you are predicting like the next one year in video is gonna be all this.
    Vibhu [01:29:18]: Do you have a time prediction for how-- when this stuff ramps up? Like
    Swyx [01:29:22]: they already started.
    Vibhu [01:29:23]: Is,
    Swyx [01:29:24]: It’s not very good yet.
    Vibhu [01:29:25]: Are we so-- No, it’s so, it’s so good. I think the last one’s just longer.
    Vibhu [01:29:29]: it didn’t give me a minute.
    Ethan [01:29:30]: Last thirty-six.
    Vibhu [01:29:30]: It gave me thirty-six seconds. But are we feeling it now? Is there gonna be inflection? Is there any timeline predictions you wanna make?
    Ethan [01:29:37]: by the end of this year is-- this is going to
    Ethan [01:29:41]: be a big hit. So the inflection point will be where, the videos generated by video agents can get to like production grade quality, so it can be presented and it can be, it can be distributed in ads. And when-- once that happen, I think the enterprise will have much more budget for video models because the agents are, inherently more expensive than the, than the video models themselves, ‘cause they do this iterative process. They generate many variations.
    Ethan [01:30:23]: but once these models have this, pass this usability threshold, I think it’s, it’s going to be a exponential growth beyond that.
    Swyx [01:30:35]: I would, fund a company right now based on this thing.
    Robotics, Physical AI, and Internet-Trained World Models
    Swyx [01:30:40]: so I think you’re right. One thing I’m, I’m surprising, I’m reflecting on the whole like past hour or so of conversation, you are-- I think you’re into world models and video generation for video generation’s sake. I think that a lot of other world models people, we’ve interviewed a lot of them, general intuition and Fei Li and all those guys and Moondream, which I think I told you about. Moonlake.
    Vibhu [01:31:01]: Lake.
    Swyx [01:31:01]: I keep saying Moondream. Goddammit. Moonlake. A lot of them actually say like robotics is the end game. Like embodied robotics, like you want real-time, you want interactive. It is to interact with the physical world. You’re not that concerned about it.
    Ethan [01:31:15]: I think robotics will be a, will be a big part of it for sure.the process may happen naturally. So my prediction on robotics is that the problem is physical AI might be solved, like without actually need to
    Swyx [01:31:36]: Be in the real world
    Ethan [01:31:37]: need to be in the real world. So it might, it might get solved by a video-- A LLM is very strong video capability. So remember we talk about the real-time interactive long horizon video. Once these models-- So now these models are just training on like screen recordings and computer screens. Once these models can use computers and understand the future state of computer extremely well, the robots might be, might be one of the, one of the tools, a very powerful AI can use. So the powerful AI might just, be able to control the physical embodiment naturally.
    Why Ethan Left xAI and What Comes Next
    Swyx [01:32:28]: I see that for sure. Cool. I know, I know we are coming up on time. you had-- you left one more spicy topic, which is why you left xAI.
    Ethan [01:32:38]: For me, there’s, there’s a lot of, a lot of research you want to do that you cannot do at, as a company. And also like the priorities and objective the-- of a company typically can change very fast. It is-- It’s also the same for xAI. So now is kind of like the time so there is some research I want to do, especially more on language model side like I cannot do at xAI.
    Swyx [01:33:11]: Oh, okay, yeah. So you’re, you’re basically leaving You’re, you’re-- you had this whole transition from computer vision to world models, video generation, to now you’re like focusing on LLMs.
    Vibhu [01:33:22]: But it seems a lot of you saying focusing on LLMs, you really in the past hour described how it all ties together, right? Like But I don’t know. What do you mean by focusing on LLMs? Is there
    Ethan [01:33:33]: I realize the fact that the video models, even like in the beginning, the game might come from improvement on diffusion technology, but this is a point where actually most of the game, come from the language models themselves.
    Swyx [01:33:50]: It’s a huge black pill for anyone who has like spent their career in like generative, media.
    Vibhu [01:33:56]: it-- that’s an extreme view, right? The-- You still definitely need a bit of both, right?
    Vibhu [01:34:01]: There’s just, it seems like more pressing, impactful work to do now on language model side.
    Swyx [01:34:07]: Do you have any similar predictions? you-- so you predict the video agents, and I think you will be right. on the language side, what are you looking for in the next one year?
    Ethan [01:34:16]: I think one thing pretty interesting I think might be happening soon is the language models will be like context-aware and manage its own context.
    Ethan [01:34:29]: So some-- Like from the video model side, we’ve been suffering from the long horizon issue, like we want to generate video longer and longer, and we’ve been trying to solve the context length issues through various ways. One thing is just brute-forcing train longer context lengths. Another is to manage the context better. I think the same thing in language model is also going to be happening soon. So for example, like the language models, they’re not aware of how long their own context length is. Once they hit like eighty percent or something, automatic context compression is getting triggered. And the model, is not aware of that when it’s working. And some-- maybe it’s good for the models to know, “Oh, I’m, I’m approaching like eighty percent,” or something. And something also pretty interesting, like for example, in OpenClau, like you-- every time you type in something, a times-- the current local time is automatically attached to your message, so the model actually know what time is it. So this is making the model time-aware. And also like in tool calling the-- a lot of the intermediate tool call results automatically prune. So there’s like context removal, context addition, and, context compaction. So all of these are from the harnesses themselves. But from our experience, the heuristic engineering also helps the models get this absorbed into the models themselves. that’s something very interesting to explore.
    Vibhu [01:36:12]: So infinite context?
    Ethan [01:36:14]: Maybe.
    Vibhu [01:36:15]: No, but it’s, it’s interesting, right? you
    Swyx [01:36:17]: It is in the space of memory and continual learning and
    Vibhu [01:36:20]: I don’t know. It’s also like in the space of agent harness use, right? You’re seeing
    Swyx [01:36:25]: No, he’s saying he doesn’t want to do it in a harness, right?
    Vibhu [01:36:27]: No, but models are also being trained on uni-- using harnesses, right?
    Vibhu [01:36:32]: So some of it is, you could say, implicitly leaking in, right? part of that post-training of language models is, okay, using it in coding harnesses, in which case, when are agents spawned? When is compaction gonna happen? it’s not explicit you have this much token window, which I don’t know if you want it to be, as that’ll change, but it’s, it’s somewhat leaking in there.
    Ethan [01:36:58]: I’m imagining, what if the model have access to the whole-- the code of the agent harness itself and being able to modify it to whatever you want. Say, if the agent harness is short enough, you can just put in the context lengths in the system prompt, and then the model will say, “When I want to spawn a future version of myself, I can modify the agent harness.” For example, if I-- the agent harness can be, “Oh, when I’m reading-”A long document, I can choose to read the whole thing in chunks and, come back, smash the summary together, or I can just read the first two hundred lines and, discard the rest. And all kind of choices, if they can be made by the models themselves, it might be very interesting to see that the model can, program the model can program itself online in test time.
    Career Lessons: Moving Across ML Domains
    Swyx [01:38:02]: so the self-modifying harness is also part of, OpenClaw and Py, but I think there’s a lot more work to do there. Very cool. I think part of me is kind of curious. I think you are part of Big Lab, right? And there’s this career path of a researcher at a Big Lab, which is you are-- you train models, you get more compute, you train better models, and you keep going. And somewhat, I feel like you’re opting out of that. And if I were you, I would be “Oh, I think this is, a bit of a career risk.” what?
    Swyx [01:38:36]: I don’t have any comment apart from, you’re very strongly convicted. I think that a lot of people in your shoes would not be doing what you did.
    Ethan [01:38:43]: Speaking of my career, if I look back, actually, there were, there were a lot of huge transitions. So ten years ago, I was, I was doing research with a ResNet authors, Xiangyu Zhang and Jian Sun. Yeah, at that time, the research were completely different. It was, mostly confirmation, like image recognition, object detection, object tracking. I was also doing neural net compression at that time. It was quite different from knowledge dissolutions these days. And at that time, I was-- I wanted to be a professor, and I applied. When I applied for a PhD, I already had a few first author papers at top conferences, so I confidently applied at the top schools. It turns out I got rejected by all of the top PhD programs. So I had to, I had to go to the industry. At that time, I was at Facebook AI Research fair, led by Yann LeCun.
    Swyx [01:39:51]: I wanted to talk about VJPA, but it’s different.
    Ethan [01:39:53]: I know. Yeah, we can leave it for another time.
    Ethan [01:39:57]: I switched to At that time, I switched to self-surprised learning. It was, it was quite different from what I was doing in contribution.
    Ethan [01:40:07]: And after that is NVIDIA Cosmos. So I realized scaling up was extremely important. So at NVIDIA, I was mainly focusing on scaling. So one thing is Cosmos scaling the video distribution models to a few billion parameters. And another thing is, I was working on MoEs. The Megatron MoEs was the first, was the first framework open source to be able to train these MoEs at very large scales, hundred billions parameters to even trillions parameters efficiently at, forty percent MFU.
    Ethan [01:40:51]: And going to switching to xAI was trying to work on even larger compute scale even further. And yeah, looking at this trajectory, I actually worked on a lot of different things. So I feel actually within ML, it’s actually easier to switch than you think. a lot of people might have mindset that, “Oh, I work on, I work on computer vision. I always have to work on computer vision, and I cannot switch to language.” And, but from my experience, at least at NVIDIA, I worked on both language model MoEs and also video models. It’s, it’s actually not the case. A lot of, a lot of the core principles how to train large models are largely the same. And yeah, for me, I feel right now the bottleneck, for video models is actually the language part the agent, which is why I want to go to work more on LLMs. One thing is it’s, it’s a bit of a challenge. I don’t think it’s a huge, jump, so.
    Closing Thoughts
    Swyx [01:42:18]: kudos to you. I think you have a lot of, strong vision there. Yeah, I think that was mostly everything that we wanted to cover. You’ve been very generous with your time, and I, it’s really nice that you are able to share all these things now. We don’t have to go through xAI to clear everything. but also we
    Ethan [01:42:35]: Oh,
    Swyx [01:42:35]: I think we didn’t get you in trouble.
    Ethan [01:42:37]: It’s a lot of good stuff about xAI compared to what you just see in the releases, right? You don’t realize how many more levels there are to it.
    Swyx [01:42:44]: xAI, please do more podcasts.
    Swyx [01:42:47]: anyway.
    Swyx [01:42:48]: but thank you for, sharing. It’s been very kind. And also, I wanna hear more from you. I think you are going to embark on your next phase. You haven’t announced what you’re doing next, but clearly you have, more vision and more ambition on this path, and I think you’re, you’re basically kind of gradient descending to, whatever your final form is.
    Ethan [01:43:08]: Thank you. Yeah. Yeah, I’ll, I’ll share more about my next chapter soon.
    Ethan [01:43:14]: Thank you for having me.
    Swyx [01:43:16]: Thanks for coming.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray

    28/05/2026 | 1h 8 mins.
    The new AIEWF website is live! CFPs close in 2 days and we will run our first New Engineer Orientation this weekend, get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!
    One of the central tensions in the agents industry is that even while there are major decacorn agent labs like Sierra, Decagon, Notion and Cursor being built up, it is also true that it has never been easier to DIY agents, with a plethora of agent frameworks like LangGraph and Pydantic and Flue, and managed agents from Anthropic and Gemini and Amazon. There has been a wave of companies building their own background agents from Shopify to Stripe to Paradigm to Razorpay, and even Cognition’s friends Ramp have built their own coding agent with other friend Modal.
    You’d think Cognition might feel a bit threatened, but they’re not - even after all this, they were way oversubscribed for the $1B Series D they just announced:
    Walden Yan, coiner of context engineering and Chief Product Officer/Cofounder of Cognition, invited OpenInspect’s Cole Murray to talk about why the Devin is in the Details.
    Full conversation live on the pod today:
    In retrospect, async agents were the most AGI pilled bet you could make in 2024 - the models weren’t good enough yet to vibecode, and people didn’t trust AI enough to let it rip, nobody (including early Cognition) was sure about the form factors.
    Now it is obvious:
    * The first wave of AI coding tools made the developer faster but remain heavily in the loop. Copilor and Cursor’s tab autocomplete are prime examples However, the workflow was still heavily centered around and bottlenecked by the developer’s local workflow: a developer in an IDE, watching the model, accepting or rejecting changes, and pushing code one interaction at a time.
    * The second wave was local agents: Claude Code, Windsurf, Cursor’s agents pane: first one and increasingly many terminals all running concurrently.
    * The current Age of Async Agents points to a different future focused more on agent orchestration which drives end-to-end development.

    According to previous guest Steve Yegge, there are finer-grained 8 levels to agent adoption, but we have collapsed it into three.
    As Cursor’s Michael Truell put it in The third era of AI software development:
    Cursor is no longer primarily about writing code. It is about helping developers build the factory that creates their software. This factory is made up of fleets of agents that they interact with as teammates: providing initial direction, equipping them with the tools to work independently, and reviewing their work.

    The agent should not sit solely inside the developer’s flow. It should be setup to work in the background so that you can give it a task, a repo, a machine, a shell, a browser, tests, memory, and review loops to go do the work somewhere else.
    In less than a year, the sentiment has shifted from avoiding multi-agent systems:
    to suggesting approaches that actually work:
    From coining “context engineering” to building the infrastructure behind Devin’s 7x PR growth and jump from 16% to 80% of commits across Cognition repos, Walden Yan has had a front-row seat to the background-agent shift. In this episode, Cognition co-founder and CPO Walden Yan joins swyx alongside Cole Murray, creator of OpenInspect, to unpack why everyone is building their own Devin, what changed after the December 2025 model inflection, and why “spec to pull request” is now becoming a real production workflow.
    We go deep on the architecture of background agents: harness-in-the-box vs out-of-the-box, why Devin separates the “brain” from the machine, why repo setup is still one of the hardest problems, why Docker is not always enough, and how full VMs, snapshots, scoped secrets, GitHub bots, Slack integrations, and video-based testing all fit together. Walden and Cole also dig into memory, MCP limitations, multi-agent orchestration, AI code review, SRE auto-triage, PMs shipping code from Slack, Windsurf 2.0, hybrid frontier/sub-frontier systems, and the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.
    And as agents eat software… and software eats the world… you can draw the conclusion on what is next:
    We discuss:
    * Why the engineering world is waking up to background agents and cloud agents
    * The December 2025 model inflection that made spec-to-PR workflows practical
    * Devin’s 7x merged PR growth and rise from 16% to 80% of commits
    * Why Cole built OpenInspect as an open-source background-agent system
    * The economics of $20/seat agent products and why monetization is tricky
    * What Cognition actually sells beyond Devin: infra, onboarding, integrations, and adoption
    * Harness in the box vs out of the box, and why architecture matters
    * Why Devin separates the brain from the machine for security and permissions
    * Repo setup, scoped secrets, Docker Compose, and agent-ready dev environments
    * Why full VMs matter when agents need to run real applications and test them
    * Android, macOS, Windows, nested virtualization, and machine-specific agent work
    * Why testing is much harder than “computer use”
    * Screenshots, video verification, and the “I know it works” merge moment
    * GitHub UX, Devin Review, AI reviewers, and agents responding to PR comments
    * Why MCP alone is not enough for first-class Slack and enterprise integrations
    * Memory, Knowledge, skills, Claude.md, and why retrieval is still unsolved
    * Devin’s auto-generated memories and the challenge of memory pruning
    * Always-on agents as permanent PMs for issues, tickets, and product areas
    * Sub-agents, meta-Devin management, and what multi-agent systems actually add
    * Why pure auto-merge vibe coding breaks down after about two weeks
    * AI code smells, lint rules, reward hacking, and Semgrep for agent-written code
    * GitAI, inline context, and preserving the “why” behind code changes
    * Local testing, mock servers, older codebases, and preparing companies for agents
    * Windsurf 2.0 and the handoff between local foreground agents and cloud background agents
    * SRE auto-triage, support workflows, and agents as first responders
    * PMs, marketing, and non-engineers creating pull requests from Slack
    * AI agent budgets, $1k-$5k per engineer spend, and hybrid frontier/sub-frontier systems
    * The rise of autonomous coding factories and who Cognition is hiring
    Walden Yan
    * X: https://x.com/walden_yan
    * LinkedIn: https://www.linkedin.com/in/waldenyan/
    Cole Murray
    * X: https://x.com/_colemurray
    * LinkedIn: https://www.linkedin.com/in/colemurray/
    * OpenInspect / Background Agents: https://github.com/ColeMurray/background-agents
    Timestamps
    00:00:00 Introduction00:00:43 Why Everyone Is Building Their Own Devin00:01:57 Devin’s 2025 Ramp: 7x PR Growth and 80% of Commits00:03:49 OpenInspect and the Rise of Open-Source Background Agents00:07:59 What Cognition Actually Sells Beyond Devin00:09:56 Background Agent Architecture: Harness In vs Out of the Box00:12:08 Separating the Brain from the Machine00:14:07 Repo Setup, Secrets, Docker, and Full VMs00:19:13 Why Testing Is Harder Than Computer Use00:22:40 Video Verification and the “I Know It Works” Merge Moment00:23:19 GitHub UX, Devin Review, and AI Code Review00:25:42 MCP, Slack, and Enterprise Agent Integrations00:28:59 Memory, Knowledge, and Always-On Agents00:36:16 Sub-Agents, Multi-Agent Orchestration, and Meta-Devin00:43:55 Vibe Coding, Auto-Merge, and Codebase Decay00:48:38 Agent Infra, VPCs, Cloud Providers, and Fast VM Restore00:52:25 AI Code Smells, Reward Hacking, and Code Review Systems00:56:10 Making Codebases Agent-Ready00:58:30 Windsurf 2.0 and the Local-to-Cloud Agent Handoff01:01:15 SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases01:04:32 Agent Budgets, Hybrid Models, and Autonomous Coding Factories01:06:51 Hiring at Cognition and OpenInspect Consulting01:07:45 Outro
    Transcript
    Introduction: Walden Yan, Cole Murray, and Context Engineering
    Swyx [00:00:00]: All right, we’re in the studio with Walden Yan, co-founder of Cognition, CPO.
    Walden [00:00:08]: Happy to be here.
    Swyx [00:00:09]: Which is a cool title. And coiner of context engineering.
    Walden [00:00:15]: Although I think there are many people who’d used the terms in various ways beforehand, but I did find that people, both internally and externally, enjoyed the upgrade from prompt engineering or model wrapping into maybe a more thoughtful way to build agents.
    Swyx [00:00:33]: For those who haven’t caught up on that, I have on screen the Don’t Build Multi-Agents post, which you should go read on and we might refer to, and Cole Murray, who created OpenInspect.
    Cole [00:00:43]: Great to be here.
    Swyx [00:00:43]: So let’s talk about it. Everyone is building their own Devins. What’s going on?
    The December Shift: From Handholding Models to Autonomous PRs
    Cole [00:00:51]: So I think the engineering world is waking up to this idea of background agents, cloud agents, whatever you’d like to call it. And I think we saw a shift around the December timeframe of 2025, where the models Opus 4.5 and GPT 5.2, they reached a capability where we moved away from handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request, assuming the spec was good enough, with very little friction. And that paradigm alone, I think, changed a lot of how we interact with agents, and opened this world where background agents became more practical.
    Swyx [00:01:41]: I think for Cole, everyone experienced this in December, but I feel like there was just this increasing ramp, right? There was this moment which was, I think, Sonnet 3.7, where, You guys rewrote Devin in one night or something. So describe 2025 or how it felt from your side.
    Walden [00:02:01]: In retrospect, we always thought it was ramping up, but then even now, over the last three, four months from today, it’s been ramping up even faster. So it’s almost funny to be talking about how, big of a leap Sonnet 3.7 was, and honestly, a lot of it was stripping out parts of Devin that were no longer needed with that jump in of intelligence. But I also just think that a lot of the recent leaps, especially, you look at, models like Opus and the latest GPT models, they are reaching levels of autonomy where people are actually finding that they actually can just be hands-off. And people who were once debating, “Oh, do I need to be in the weeds with my model in the IDE? Can I just completely move it off into the cloud?” That’s a more serious conversation, and we’ve seen that in all of our growth charts. Internally there’s this funny graph where our usage has, of PRs, our merged PRs, has grown 7X since I forget what it was called.
    Swyx [00:02:57]: I think Dev, maybe tweeted that. Yes.
    Walden [00:03:01]: it grew like 7X over, the last, I think it was, two months, three months, something like that. And then you see our engineering headcount growth. It’s, gone up by, 10% or something.
    Swyx [00:03:11]: We were, we were afraid To release this. So this is Devin commit percentages on all Devin repos, was 16% in January and now 80% in March.
    Walden [00:03:25]: It’s a big shift right now. And so it makes sense that a lot of people are now thinking about, buying Devin, but also maybe, trying to build their own and there’s Lots of I have a lot of fun building Devin, so I can see why other people would want to build their own cloud agents as well. Matt, well, maybe it’s good to hear, what initially inspired you to try to build OpenInspect?
    OpenInspect: Ramp, Cloud Agents, and Open Source
    Cole [00:03:49]: OpenInspect came about, through primarily my clients observing how they were using tools like Claude, OpenAI’s Codex at the time, and seeing some of the friction that they were having with it. Primarily the Claude was being used through Slack, and a big issue they ran into was that the sessions that were launched were specific to whoever called it via Slack. And so if a PM was the one who invoked the session and they would then go to pass context to engineering can’t see the session. And that in itself was a deal breaker because the PM, “Hey, engineering, can you jump in?” But there’s nothing to jump in on unless they’re copy-pasting out or the single response that came back. And so seeing some of these problems, I had built a similar architecture internally, just to experiment with, test out different ideas as this trend of moving off of localhost was starting to become, And as Ramp released their blog post, I had a lot of the pieces for this already in place, and just thought it would be funny to, see what Claude could do just purely from the blog post. And on my X account, there’s actually a thread of where I live tweeted, going through this
    Cole [00:05:14]: comparing GPT and Claude as both of them are going through it.
    Swyx [00:05:17]: On the announcement thing or something else?
    Cole [00:05:19]: right after it got released. We can put it in the show notes. Yeah, it was helpful that I had already knew how to verify the system. I knew what I was looking for. I think Ramp did a great job of really illustrating, the technical aspects of how to build something. It was much more than just like, “Hey, we built a great system.” It was, “And here’s how you can build it too.” And so, I resonated a lot with that, just with the problems that I was already seeing, and I thought that, looking around, I didn’t really see anything in the open source community that, met this type of system. I think there’s a lot that run, in localhost like Superset, Conductor, and many others.But nothing that was actually running in the cloud. And so, I built it, and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.
    The Business of Background Agents: Open Source vs. Devin
    Swyx [00:06:16]: So literally after Devin was launched was, there was OpenDevin Which became All Hands. I don’t know if you tried that or
    Walden [00:06:22]: I was going to say, one of the things that interested me a lot with OpenInspect was, you didn’t try to go make it then something you monetize. There are a lot of, I think, these open source projects would then go and really try to, raise V
    Swyx [00:06:36]: That’s why no OpenDevin. Yeah.
    Walden [00:06:38]: yeah, and how did you think about that? I thought that was very interesting.
    Cole [00:06:44]: I thought, and just what I had seen across my clients, was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. To that question though, I get asked all, “Oh, are you going to raise? Are you going to turn this into a service?”
    Walden [00:07:08]: I’m sure you’ve gotten offers.
    Cole [00:07:09]: but primarily I don’t want to do that for a few reasons. One, I think that I don’t want to compete for, $20 a seat. I think that is just a really difficult business. I think it’s very easy to copy the main pieces of it. Again, I built this fairly quickly. And I think because you are not owning, I guess, the entire stack, it’s hard to monetize. You have money being made at the sandbox layer with Daytona, E2b, many other players. You have money being made at the model layer. And you sit in this weird in-between gray area where what are you actually selling? You’re selling, I guess, the infrastructure. You’re selling, the integrations maybe.
    Swyx [00:07:55]: let’s ask the guy. What are you What are you selling?
    Walden [00:07:59]: Well, yeah, there’s multiple layers to this in practice, and actually it’s funny you mentioned the infrastructure, ‘cause when we got started building Devin as well, we had to go figure out how to make the infrastructure as well because,
    Swyx [00:08:10]: You had to build this two years before everyone else,?
    Swyx [00:08:15]: Including, the model side
    Walden [00:08:17]: It was not, it was not very polished at the start, when we just built it off of raw VMs from cloud providers like EC2, the boot up time was so slow, I think, And especially then, turning off the machines, saving them, and then to be able to bring them back up again when the, when you want Devin to wake up again later. It would just be out cold for like 10 minutes because that’s just how long these systems took. They were not built for this repeated down and up usage. And so we actually had to go do all of that. And as a result now, one thing we offer when we go and sell Devin to people is, you don’t have to worry about all the compute side of things. We’ll make it work. We’ll make it work in your cloud if you want it to. But aside from the product, and I want to go into the agents and the tuning of the intelligence part later, but I think a big part of what we do at Cognition as well is to just make sure that your company learns and uses and adopts these coding agents. ‘Cause I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. But because of the way projects are planned, because, not everyone is literate in using AI in these ways, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, leverage with AI, is super helpful. And so We do that. We show thought partners to the customers that we work with as well.
    Swyx [00:09:56]: So let’s talk about, architectural stuff. I think that’s always, that is something that was the topic of conversation between the two of you. Is this, the mental model that you want to start with or something else? I’ll just leave the floor open to you guys.
    Agent Architecture: Harness in the Box vs. Out of the Box
    Cole [00:10:11]: I think, maybe we can start here as just a general what are the pieces of a background agent system. And then maybe we can go into some of the nuances of, Decisions that you can make.
    Swyx [00:10:22]: But I guess I also Like, what, maybe what Walden is saying is the agent is like in this open code box, I guess. Right? This is infra, and then there’s, that’s the agent. And you had this discussion about whether you put the agent in here or in Out externally. Can you tease that out?
    Cole [00:10:39]: In a background agent systems, you have a decision to make of where the agent is actually going to run. This is typically described as the harness in the box or out of the box. With running the agent in the box, you’re making some trade-offs by doing that. The negative trade-off you’re making is primarily security. Because the agent is running in that box, unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable, and you could very easily end up accidentally exfilling your secrets, or other unintended behavior. Now, the out of the box is the idea that we are going to have the actual agent running not directly in the sandbox, and we will have, quote-unquote, the brain of the agent running in some type of worker, control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you’re making between the two systems is that, in my opinion, running it out of the box is much more complex because, you have state that has to be managed, whereas if you’re running it in the box, all of the state of that agent is actually in the box, and yes, it’s you could persist it elsewhere, but it’s all localized and you have less concerns to worry about.
    Walden [00:12:08]: I think a lot of that, what you mentioned, is why we actually from the start built Devin to what we called separate the brain from the machine. The other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes Perhaps. And so you don’t have to worry as much about making a new type of dev box that has all the dependencies the brain needs, as you mentioned, the secrets the brain needs as well. One thing that we’ve seen some customers run into is, you have a GitHub app and you want Devin, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions. If they are all interacting through the same GitHub app and there’s no actual, separation between the system that decides, what it does and the actual secrets on the machine, then you run into an issue where, okay, it’s hard to do the separation. But in practice, with Devin, it’s much easier because we just say whatever you put on the machine, that is, the scope of basically what the user is free to do, what the agent is free to do. So only put the most scoped secrets on that machine, and then the brain is fully not accessible from the machine. So you don’t have to worry about messing with the, any of the most secure parts of the brain if the user is free to do whatever they want with the machine.
    Swyx [00:13:31]: I was going to just bring, I have this, chart from OpenAI, where I don’t know if this is, in the box, out of the box. That is something that they do use to describe it. And then also recently Anthropic did, managed agents
    Swyx [00:13:44]: Which is, this is their thing. I don’t know. It’s all, it’s all variations of the same pattern, right?
    Cole [00:13:49]: So this would be out of the box.
    Swyx [00:13:51]: Which, is preferable for them because it’s less work?
    Cole [00:13:56]: I would say it’s more work.
    Swyx [00:13:58]: It’s more work?
    Cole [00:13:58]: But it, in my opinion, it is the better architecture of the two. It’s just, you’re taking on a bit of complexity by doing that.
    Repo Setup, Docker, and VM-Based Development Environments
    Walden [00:14:07]: One thing I’ve not seen a lot of other players do well is how do you manage what’s actually on the box? And this can be complex for many reasons. Let’s say you have a big repository that’s changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let’s say, run the app and test it, and all the things you want your autonomous
    Swyx [00:14:34]: So a repo setup.
    Walden [00:14:35]: Exactly. So in, internally At Cognition, we call this repo setup.
    Cole [00:14:39]: The hardest part of
    Walden [00:14:40]: It’s been a perennial problem since the start of the company, of how do we help people get this set up? Because not everyone just has, working cloud environments working out of the box. And do you find this to be a common problem with
    Swyx [00:14:53]: How do you solve it?
    Walden [00:14:53]: Your clients?
    Cole [00:14:54]: This is a very common problem, and through my consulting, this is a lot of what I help teams do. A lot of teams don’t really have great developer environment setups, if any. A lot of the times it’s, “Go talk to Bob and get the secrets,” and that obviously doesn’t work when the agent needs to actually set this up. And so a lot of that, most teams are using Docker Compose or some type of microservices. And so for the
    Swyx [00:15:19]: Even in prod?
    Cole [00:15:20]: Not in prod. With the OpenInspect, you are using this primarily to interact, and make code changes. There is other use cases, but you can hook, whether through CLI, MCPs, other tools, you can then hook that into your production systems primarily for, SRE type use cases. But you are not, necessarily, trying to test your prod internal microservice through the system.
    Walden [00:15:48]: And you mentioned Docker Compose. I think one direction we saw some of our friends take early on was, using Docker containers as the level of abstraction for their models. There’s lots of reasons, I think, why Docker containers are not great. One thing is, Docker container’s not really a true security boundary, for one. But the other is, if you are running real applications, a lot of times those applications use Docker, and then you have to think about Docker in Docker, which is, really weird. And so I think part of, the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed, full VMs to be able to do these types of things. And especially nowadays where there’s actually value in running the application and clicking around and sending you screen recordings of these things. The value just, keeps adding on top of that. But it is a decision I see people run into when they try to build their own systems, is, “Oh, do we, in addition to this, do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?” What do you recommend people nowadays?
    Cole [00:16:57]: I think Docker is a good solution for maybe not running the agent, but running your infrastructure, because that is more or less the same setup your engineers are probably already using. If they’re not, then I don’t know what they’re using. But they’re probably already using Docker Compose.
    Swyx [00:17:14]: I’ve always had a small candle for web containers. I don’t know if you guys have tried them before.
    Swyx [00:17:19]: To me, they were, supposed to be like Docker Light.
    Cole [00:17:22]: Is it?
    Swyx [00:17:22]: I don’t know.
    Cole [00:17:22]: No, I haven’t tried it. But yeah, I think any environment that you’ve set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out that local developer story, you’ve more or less solved the agent in a sandbox, environment setup. OpenInspect does have hooks as well, where you can, run a setup SH script that will pre-install everything. You can then pre-snapshot that build so it starts instantly, and then there is a second hook to actually then, restore the state of the sandbox when it comes back. And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.
    Testing Agents: Computer Use, Screenshots, and Real App Workflows
    Walden [00:18:08]: Another thing that we’ve been thinking a lot about is like Different VM service offerings. Have you had customers where they needed like macOS specific VMs or like Windows specific
    Walden [00:18:20]: VMs?
    Walden [00:18:22]: There are like many technologies in the world that only work on specific types of machines, right? If you’re building a.NET application that has to run on Windows or like, maybe more commonly if you want to build iOS or macOS Does that work
    Swyx [00:18:32]: Does Commission support
    Swyx [00:18:33]: Choices like that?
    Walden [00:18:35]: The fundamental architecture we do, because we do the separation, it does support, but the actual work in progress is happening right now on these. Another thing that we’ve actually recently added support now for, it’s in beta, is doing Android development. To do that, we needed to support, I think, nested virtualization within our machines because the VM itself is like a, is a virtualized Firecracker instance, and then you had to then run another Android emulator inside. And there’s like weird performance issues that like, it, which is why it’s like still in beta. We have to think through these problems, but it unlocks a lot for anyone who wants to do Android development.
    Swyx [00:19:13]: I was trying to find like a reference video for the testing thing. I couldn’t find it, but I think you worked on the testing, capability. Why call it testing and not like computer use or I don’t know, it’s, what’s the general Category of problem?
    Walden [00:19:26]: I think that when people think about the ability of an AI to run your app and test it, I think they actually over-index on the computer use part of it because computer use in my mind is the literal, okay, you want what button you want to click. Can you emit the right coordinates to go click that button? I think testing is actually a really interesting like
    Walden [00:19:48]: Problem-solving, challenge for these AIs because if you wanted to do arbitrary testing, imagine you make a change that spans the frontend and the backend, maybe, even some other like even more deeply nested service. To actually test that change, we have to reason through what-- how do you first run these applications to orchestrate with each other with the right version of the code? Then, okay, how do I trigger the feature or how do I make the thing actually happen? And this can get arbitrarily hard, maybe you have to be an admin. Maybe a certain thing has to be feature flagged on. Maybe, you have to like run two sessions and then send us a very specific word into one of them to trigger a specific behavior. And figuring out how do you do that requires a lot of code base context, requires, a lot of orchestration that we’ve specifically done. And in some cases, we found that you actually, no one frontier model can actually do this full end-to-end task itself.
    Walden [00:20:42]: We’ve seen cases where we actually had to orchestrate different frontier models together to solve this problem together. That is where we spend most of our time when we think about this testing problem, not so much the computer use part. Computer use for what it’s worth has gotten a lot better with recent models and it’s made that part of the job certainly easier.
    Swyx [00:20:58]: Especially with like even 4.7, that they released yesterday, apparently like way better in terms of the vision stuff, which is going to be encompassing computer use.
    Walden [00:21:08]: Having evals for all these as well is something that like takes a while to build up. And having the evals be right is tricky as well. Do you ever see like, clients who are building their own agents have to start standing up evals to make sure things don’t regress?
    Swyx [00:21:25]: Not so much evals in the traditional sense, but specific to the testing part that has just gone in. I just added support for screenshots And in theory you can also do video. I need to put in a plugin to do that. But they do show up natively, and it was a very heavily requested feature, especially after Cursor’s recording came out. I think that was very enlightening for everyone of like, “Oh, this is a very good feature to actually have.”, I think with Devin you guys have had this for a while.
    Swyx [00:21:57]: Oh, yeah. See how screenshots work. Yeah, I don’t know if there’s anything, super and not obvious. It’s like once what feature to build, you can just prompt it and it Will mostly work.
    Walden [00:22:09]: I think to Walden’s point, though, the computer use is a subset of the larger testing problem, and I think that’s very specific to the code base that you’re working and it’s not something that, out of the box that you could just solve it. The-- you do need the code base context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that code base locally that what is changing and could then inspect it and use that to drive the model.
    Swyx [00:22:40]: For those who haven’t seen it before, this is an example of how it works. You, after the PR is done, you click testing approved, and then it sends you back a video. What I really like is that it labels, It’s very small here, but it actually labels what it’s testing. And then it-- and then you actually see the cursor and everything. So I don’t know, yeah, the engineering in this, just Whatever you want to show. ‘cause this is like, this is one of those like, oh, few of the AGI moments, right? ‘cause Once I look at this, I actually don’t I wish I can just merge inside Of Slack instead of going to GitHub ‘cause I don’t need to see the code. I know it works.
    Walden [00:23:19]: Maybe a new feature in Cursor. Yeah, the annotations at the bottom was also a big difference for me when I, when I added those.
    Swyx [00:23:27]: It’s just like, what am I looking at? What are you trying to demonstrate?
    Walden [00:23:30]: Exactly. There’s a surprisingly long tail of small details that ends up making a big difference for this end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools. Because I think, most tools out there when you build the agent, you’ll think about, oh, it’ll create the PR for you. We try to take that a step further and say, “Oh, what if we actually made sure you could interact Devin, with direct Devin directly on GitHub?” And so we made sure that you can comment on GitHub, and Devin would actually receive those comments and address them back. But there’s actually quite a bit of tuning you have to do here because you can imagine that actually like-We recently have Devin Review, for example. Devin Review will post comments on his own PR And then Devin has to then go
    GitHub Workflows: Devin Review, Comments, and PR Automation
    Swyx [00:24:23]: He answers his own comments, which is Really loopy. So like, yeah, I like that it just updates here that it’s, that I have commented But usually it’s just me saying like, “Hey, merged, fix any merge conflicts.”
    Walden [00:24:37]: The, so when Devin fixes his own comments, you might be scared that, oh, maybe I’ll infinite loop. But we’ve put a lot of work into making sure it doesn’t, both by making sure that the comments are high signal, but also that the agent is thoughtful about what comments it immediately goes and tries to fix, and what comments it’s like, “Wait a second, I think you’re wrong.” Actually, that’s one of my favorite moments is when Devin tells me that I’m wrong, when I try to get it to do something different. But tuning that behavior, actually makes a big difference in terms of how useful the actual GitHub experience is.
    Cole [00:25:06]: I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. OpenInspect does have that. It has a GitHub code reviewer that you can control the prompt. It does do comments as well. It doesn’t do them automatically yet. The capability is there, but it’s not fully used.
    Swyx [00:25:27]: So you have to ask for it?
    Cole [00:25:28]: you do, yeah. You can tag it on GitHub, and then whatever you named your, GitHub bot, it will then follow up on it. It will then, if you have merge conflicts or whatever you have asked it to resolve, it will then resolve it, but it doesn’t do it automatically yet.
    Integrations: Slack, MCP, and First-Party Agent Interfaces
    Walden [00:25:42]: Well, I’m curious, what is, the most common thing that people end up requesting, that they still need on top of OpenInspect when you help them go implement it?
    Cole [00:25:52]: I think a lot of it comes down to actually integrating it into the company. It’s one thing to have the background agent system set up, but if it isn’t actually integrated into your larger ecosystem, it isn’t that useful. It is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with read-only credentials, the logs, a Confluence or internal knowledge-based system. I think that is where I see the huge leap for companies, and that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it, especially if they’re in environments that have more compliance type things where, access control can be pretty big and how do you deliberately think about these problems, I find to be, one of the problems that comes with a system like this.
    Walden [00:26:46]: The thing we found is So, MCPs, obviously it has been like this, really big explosion of, oh, you can go, integrate it with all these different things. But to actually get the integration right and the and get the right experience, oftentimes we found that we had to go build our own ad hoc things. I think Slack is a great example of this. You could give your agent a Slack MCP and okay, it can post messages back to you on Slack. But we actually use Devin like a coworker in Slack, and that’s how it’s been built from the ground up. But to do that, you actually need to, support webhooks that come back, right? And then Devin has to respond in a natural way and then hopefully don’t spam your threads too much and annoy the people in your company. So you got to tune that experience just right. Especially when there’s a lot of back and forths, we find that we actually have to go beyond the simple MCP integrations in these places.
    Swyx [00:27:39]: I just pulled up the MCP marketplace. I know this is a Fair amount of work. Is the answer to eventually take first party control of all the top MCPs? Is that the
    Walden [00:27:48]: I would love a world where you could have something that’s more expressive than MCP. That, goes both ways, not just a set of tools, but a proper system that interacts back and lets it Have the right experience with all these interfaces.
    Swyx [00:28:03]: So there actually is sampling in the MCP spec, but nobody Uses it, right?
    Walden [00:28:07]: And so I think that’s the other part is, actually we found that when the MCP spec starts to get too complicated, it starts to lose its original promise of Being like a simple one-step connect. Now then we have to go figure out how to support all these different variations of things and It starts to look a lot like just building the first party integrations in a lot of these cases now.
    Cole [00:28:29]: I think it matters, too, how critical it is to your company, right? If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it Versus just whatever is off the shelf.
    Swyx [00:28:43]: Awesome. Other than MCPs, what else, sorry, well, I don’t know if that’s Narrowing in too much on, integrations. But what else? What other elements of building OpenInspect or Devin that you guys really sink on?
    Memory and Knowledge: What Agents Should Remember
    Cole [00:28:59]: I think, a problem that comes up very frequently is this idea of memories or knowledge base.
    Swyx [00:29:05]: Oh, boy. How do you solve it?
    Cole [00:29:08]: so not solved yet, is the short answer.
    Cole [00:29:11]: it’s something, there’s a open issue for it, someone asking about it.
    Swyx [00:29:16]: There’s, I, D Wiki hasn’t indexed anything about memory yet.
    Cole [00:29:20]: how I’m seeing it solved across my clients is primarily through skills. I find that skills can be a good gap within that or updating Claude MD, but I think memory as a whole is a pretty unsolved problem, and it is why I’ve been hesitant to add it. I think there is parts of memory and that can be addressed, but I think as a whole it’s a very difficult retrieval problem.
    Swyx [00:29:44]: Oh my God. RAMP didn’t write anything about memory? I see zero search results.
    Walden [00:29:50]: No. Memory can be quite tricky to get right because it’s the retrieval, but also the generation of the memories that can be really tricky. You don’t want it to just like Remember very specific details.
    Swyx [00:29:59]: Walk us through the Devin memory journey because I know there’s been a journey.
    Walden [00:30:03]: the first version of memory that like stuck around for a while was A system we have called Knowledge. And the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devin things. So, okay, any time you remind Devin, “Wait, no, that’s not quite the way you’re supposed to use Git”Like, we actually want Devin to say, “Hey, do you want me to actually just remember this for the future?” And for you to just basically quickly approve or reject and for it to build up over time. ‘Cause I find that, 95%, I think, or some crazy stat like that of the memories that Devin has are all through these auto-generated things. Very few people actually just want to sit down and write big docs on Here’s how you’re supposed to work with the technology, et cetera. The generation and the retrieval has been something that we’ve been trying to tune a lot over the years. Generation, you don’t want it to remember something like, if you asked one time to like, “Oh, please open as a draft PR,” you don’t want to be like, “Oh, everyone forever now should get their PRs as draft PRs.” But you do want some, conveyor. Maybe you want to say like, “Oh, Cole generally likes, things to be created as draft PRs.” Same with retrieval, if you have thousands of these memories, how do you actually make sure they’re retrieved at the right time? And that can be quite tricky to do right without exploding the context with a bunch of useful yeah, useless information. Surprising amount of just, eval work to just make sure that, memory is, remains a reliable system as new models come and go.
    Cole [00:31:31]: Do you have anything that you could share on, memory pruning? And like the temporal aspect of memory?
    Swyx [00:31:36]: Deleting and forgetting?
    Walden [00:31:39]: The, today, the, So the things they could do is it could edit memories. And so if your memory used to say like, “Oh, Cole likes to open everything as like a draft PR,” then you can imagine, “No, don’t do that.” And then it’ll say, “Oh, do you want me to update the memory to be Cole now want everything as, open PRs?” I think that at the same time we don’t know if this is going to be the final version of the system. Whatever we have here will probably, translate into the new system that we’ll be coming up with. But I think one big difference between two years ago and today is these agents are really good at using anything that resembles a file system natively. And so part of us are, is thinking, “Oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own?” That’s been an interesting exploration. Also similar ideas in the scale space.
    Swyx [00:32:35]: I am pulling up OpenClaude’s memory thing right now. So memory, OpenClaude has like this like daily memory journal thing, right? And you can I mean, that is a file system you can grep through and is a source of truth. I don’t know if it’s the best. It’s probably super noisy, but at least, if you lose something you can discover it or you can apply some, forgetting algorithm to, more ancient memories that don’t get recalled again or something. I don’t know.
    Walden [00:33:01]: One thing we’ve been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file, a memory.md or something, and just like be your permanent PM for a specific set of issues maybe. So we have like some Slack channels internally, maybe a Slack channel dedicated to, a specific product like DeepWiki maybe. And you can imagine that, or you want a Devin that never stops, it’s just always awake, but it has this like memory dock that it can just maintain for itself about, okay, what are like the number one priorities of what we have to fix and prioritize? Who is responsible for some upcoming work? Maybe they’ll even Devin will even tag you on some recurring basis. And so it’s been an interesting move to see, okay, how can we actually use Devin for more than just engineering? Can we actually upstream above the engineering process and maybe it’s just Devin creating tickets, which then maybe some humans do, but then maybe other Devins do.
    Swyx [00:34:00]: One of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis. That’s the automation. I can’t find it right now, but basically it just like, “Look at competitors and suggest things.” “And here are three things that you’ve suggested that I don’t want any more of,” and you just stick that in the prompts. But like I wish actually So for like when I, for example, when I reject a PR, I wish that it updated memory so that I can then just not have to go up, go back and update the scheduled, sync, but anyway, feature request.
    Walden [00:34:31]: what? We might change it soon. I guess OpenInspect, in the time you’ve been around, has there been anything you tried to implement but then you had to like undo and like do a different way?
    OpenInspect Architecture: Webhooks, Control Planes, and Agent State
    Cole [00:34:41]: Nothing yet, but something that is on my mind. The initial way that I built it was that each of the integrations lives as its own package. And so you have The Slack bot, which is what’s handling the webhooks, and then is basically interacting with the control plane. As I’m seeing the system starting to be more integrated, specifically with the GitHub bot integration, I’m considering bringing that all into the central control plane because especially now I want to start, And a request that I’m getting is the ability to monitor, the actual, pull requests being merged, as well as just tracking of
    Swyx [00:35:19]: What do I have open?
    Cole [00:35:21]: What do I have open? How many of these are getting merged? How many comments are showing up? To just understand the health of the system. And so in the case of a GitHub app, you only have one webhook. And so then it’s a question of do I put that webhook in that GitHub bot package? That’s weird. It doesn’t really make sense to live there because that package is more for like the code reviewer. Or do I like centralize it? So that’s something that’s on my mind of, making that decision. I think the other one we touched on earlier is the harness in the box versus out of the box. I think long term the architecture will eventually come back out of the box. Some of the newer tools that I’ve added are calling back into the control plane so that you don’t have the secrets in the sandbox. And so I think long term I probably will pull the actual, agent out of the box, but I think for now it’s fine.
    Subagents and Multi-Agent Systems: When Parallelism Helps or Hurts
    Swyx [00:36:16]: Just, a quick question on pulling the agent out of the box. I’m One thing I’m very bullish on this year is agents calling other agents or spawning sub-agents or Whatever you want to call it. Does that make it harder or easier? I can’t tell. Because if the harness is in the box, you can just spin up more boxes. If the harness is outside the box, then you’re, it’s less easy because you are, you have a unicorn pet of a, of a harness that’s, living outside the box.
    Cole [00:36:45]: In theory it would be the same way, right? Whether, one agent has launched many, sub-sessions within it, OpenInspect, for example, can launch sub-sessions and actually create other environments and then monitor them. In the case where it is out of the box, that would basically just be an additional session that’s running. And so that session is also running outside of the box. It’s running in your worker plane, wherever you’re running this. And then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex, just ‘cause again, you have now a more difficult architecture. But I think if you figured it out once, it’s probably fine.
    Swyx [00:37:26]: Well, then I’m just, throwing it open to you in terms of, I call this like meta Devin management. Which is like the, Devin’s calling Devins or Devin scheduling Devins or querying trajectories or anything like that. What have you built or unshipped, anything?
    Cole [00:37:46]: I think one of the surprising things we’ve seen is that a lot of the ways that, these, separate agents work with each other, and you want them to, parallelize their work, has still mostly followed the same manager sub-agents regime. And a lot of people I think are excited about this world where you have swarms of agents that, talk with each other all over the place. We’ve actually given Devin an MCP so they can just go arbitrarily message other Devins And create new Devins, et cetera. But I guess, it somehow creates, a really chaotic world in that sense. And so we’ve still found that most practical use on a day-to-day basis has been one single Devin.
    Cole [00:38:33]: Figuring out how to segregate the work and get, have other Devins work on it in, a relatively isolated sense, each with their own boxes Not sharing machines, so there’s, a very little room for conflict is the regime that you have to create today.
    Swyx [00:38:50]: I’ll call out, the experiments from Cursor, right? This is Wilson Lin’s work on Single agent to multi-agent, and you’re obviously famously on the side of don’t build multi-agent. But they went through the whole thing, only to arrive at, this Which is exactly what Devin has, I think.
    Cole [00:39:08]: I think there will be a revision to that post at some point About
    Swyx [00:39:12]: Tell us about it
    Cole [00:39:12]: I think multi-agents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can argue, are they really multi-agents, or are they just just, tool calls,? There are people who, will create sub-agents to go look for XYZ file, XYZ implementation. Has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There’s a lot of benefits to doing this. We basically have Devin do this with Deep Bookie, make a call out to Deep Bookie, give you back the results, but that feels like a tool call,? It’s not like these, two collaborators actually talking back with each, back and forth with each other. But I think the thing that gives me the most bullishness that multi-agents might actually be possible is actually what I said earlier about Devin will actually sometimes tell me I’m wrong and push back, and I think that demonstrates a level of maturity and communication today that makes a multi-agent world possible. One, can two agents who have seen different information come back to each other and actually figure out who is right, what is the correct implementation? They’re not just, yes men. Claude, I guess is like, used to just say, what is it? “You’re right,” or,
    Swyx [00:40:25]: “You’re absolutely right.”
    Cole [00:40:26]: “You’re absolutely right.” Yeah.
    Swyx [00:40:28]: The Have you seen, did you see
    Cole [00:40:29]: The age is over
    Swyx [00:40:30]: The Codex app troll in Topic? This is the Codex app. Inside of Settings, there’s a little, there’s a little Easter egg, right? So if you go to, the Themes or Appearance, right? There’s all these, color codes, and the top is absolutely, and it’s the Topic’s colors. Which is such a troll. Anyway.
    Model Behavior: Pushback, Adversarial Prompts, and Agent Skepticism
    Cole [00:40:53]: I love that Easter egg. Did you discover that yourself?
    Swyx [00:40:54]: No, it was, someone was, tweeting about it And I was like, I was like, “Is this true?” Because, sometimes people just tweet stuff to, get a rise out of you. But yeah, there you go, in Topic colors.
    Cole [00:41:06]: Yeah. So yeah, we’re out of this regime where, it just says you’re absolutely right, and they can have real conversations and real back and forths.
    Swyx [00:41:13]: You can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah, that, I mean, to me, that is more intelligence, right? That is not just something that’s, a dumb tool, it’s actually pushing back on you I think. Yeah.
    Cole [00:41:24]: when you mentioned, of course, the blog posts. There was one blog they had where they fed a swarm of agents together and built a browser.
    Swyx [00:41:34]: That was I think that was the one.
    Cole [00:41:36]: You can have, like
    Swyx [00:41:37]: I think it’s the same one
    Cole [00:41:37]: Creation of it. We found a surprising success of, don’t do a swarm or anything, just have one Devin, it does its own context management. Just let it keep running for a while and give it some crazy tasks. I think we asked it to, rebuild, a Windows OS system. And it managed to do it just like, going on for long enough. It’s
    Swyx [00:41:55]: Was this Andrew’s thing?
    Cole [00:41:58]: there were lots of demos that we ended up not posting, ‘cause at some point we’d just be posting way too much a bunch of, Demos. But I love that because it shows that I think the multi-agent thing still has, a bit of exciting sexiness to it, which is maybe still beyond still, the actual delta it adds to the capabilities of these systems. But it’s absolutely the future. I think we’re heading in that direction and we can see the progress being made there already.
    Swyx [00:42:25]: If I were to, make one super minor pushback because I don’t feel that confident about it yet
    Cole [00:42:33]: Go for it
    Swyx [00:42:33]: But I’ve had Ryan Lopopolo from OpenAI on the pod And he’s a super slop cannon, right? Oh my God, that’s my coding agent being done. I downloaded this, Peon Ping. I don’t know if you guys have heard this. It takes like-, sound packs from popular games like, Command and Conquer and Warcraft, and then it plays it whenever it’s done. And so it’s like, “Work,” or whatever, “At your command,” or something. Anyway, what I got from the Cursor code base and from Ryan’s thing was that there’s a slop cannon approach where you try to loosen the single agent’s, bottleneck, and I feel like that is, probably an, a very important thing to try to figure out. I don’t think anyone’s, really solved it. Because then you just have more reviewer slop on top of the agent slop To try to wrangle it all. Ryan will probably very strongly object that I say that he hasn’t solved it, but he thinks he’s He thinks he’s completely solved it. But I think it’s still I think it’s, very important, ‘cause, that is a bottleneck, right? I feel Devin is slow sometimes Because I’m like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just, I want a button to just say, “Just ramp this up 1,000 next parallel, in parallel and just, see what happens,”? And I don’t know if that’s, feasible at some point in the future.
    Code Review, Entropy, and AI Slop
    Walden [00:43:55]: I And we’ve also run experiments internally where we’ve basically tried to build entire products, true products that we knew we would eventually ship, but for now, let’s try to see if we can do it just by purely, vibe coding on top of each other, auto merge, no code review at all. And then there’s this benchmark of how many weeks can you go onto this for Before you say, “We have the trashiest code base.”
    Walden [00:44:18]: “Let’s actually rewrite it from scratch.”
    Swyx [00:44:19]: Start a new factory, yeah. What’d you find?
    Walden [00:44:21]: I think we found that the state-of-the-art in December was you can probably, run this for about two weeks. By the end of those two weeks, you’d find that, hey, you want to, change the color of a button. Well, it turns out this button is implemented in, 10 different places, and they, have All these different variations, and oh, you forgot one of them, and actually it’s a slightly different color in one spot. And you’re like, “Okay, this is too much to work with. Let’s actually try to do code review at the same time.” And make sure that we’re on top of our software, actually cleaning it up a bit And making sure it’s done in a scalable way.
    Cole [00:44:54]: I think building on that, the idea of, you don’t have to look at code, I think is generally a bad idea. And the meme that I have for that
    Walden [00:45:03]: What timeline, all right, is Do you think that statement will be true on?
    Cole [00:45:06]: I think probably for a while it’ll be true that you should continue to look at your code. A problem that I see a lot of teams run into that I work with who are embracing AI native, AI first coding, is The meme that I have is that your code base regresses to your worst engineer, because that engineer who is, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code, and now the AI is referencing their patterns. And so now their if/else block that, is 20 if/elses back and forth, the AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point, a pretty good approach to that is having scheduled cleanup, whether by humans or through systems, that are looking for duplication. They then address that. You’ll end up with like 12 helpers for how to format a date. And you need to address that, because otherwise it will continue to sprawl.
    Swyx [00:46:09]: Within balance, I think it’s fine to have some duplication, and then sometimes To have garbage collection, right? Yeah. The What I’ve been, talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules, and it’s your job as an architect, as a CTO, whatever, to say like, “Okay, here’s the hard contract between you guys and you guys. Whatever you do inside this black box is your business. You do whatever. But between these guys, let’s be, really damn clear, and any movement must be signed off by a human or me,” or. Then, and like that’s that. I don’t know if you have any other modifications or advice.
    Walden [00:46:44]: Well, I guess generally on the topic of, where humans can be useful, I found that ‘cause, some of these, really deep infra problems, sometimes just having a human that just has, really deep expertise can make a big difference. I’ve actually seen this come into play when actually building agents. So we’ve had a few friends now, try building their own coding agents, and I think one same problem that I recurringly heard a lot of them run into was this problem of like, “Oh, Grep is really slow on our agents’ machines.” And so a lot of them, I assume because they’re using AI and they themselves don’t have, super deep infra background knowledge, say, “Okay, we’re going to go build our own custom Grep index. It’s going to be really fast,” and use that as a way around this problem. When we ran into this problem About like, maybe like a year and a half ago when we were, in the early days of building Devin, we obviously didn’t have AI then. We just asked our, how to, how to do this. You can just swap out a new Grep index, so.
    Infrastructure Details: Grep, File Systems, and Sandboxes
    Swyx [00:47:45]: What do you mean you hand-coded Devin? What?
    Walden [00:47:48]: It’s like, can you believe we hand-wrote this code? And we had, our infra people who are really amazing, they were looking into it and they’re like, “Oh, what? We realized that actually the root cause of this problem is actually super simple, but like fine-grain detail,” which is that a lot of these virtual machines actually underlying them don’t use real file systems. They use these, network file systems where things are actually cached over the network actually in S3. So when you’re Grepping, you’re actually making network calls Every time you’re doing these things, and that’s why Grep is extremely slow on these machines. And so again, goes back to, what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself, there are tons of small details like this, and so we had to eventually go swap out that network file system. But
    Swyx [00:48:35]: I think there’s a write-up about it, right? Silas did one about the virtual file system.
    Walden [00:48:38]: Oh, that was a whole other thing. The
    Swyx [00:48:39]: Oh, that’s a different thing
    Walden [00:48:40]: The BlockDev file storage format
    Swyx [00:48:42]: I’ll bring it up
    Walden [00:48:42]: Which is, a file system format that we built so that the VMs could be spun up and down very quickly. Basically, the intuition behind this is-Imagine you have, a terabyte of disk, and your agent only, wrote, a hundred lines of code on top of that disk. How long does it, say, take to, save and re-bring up that disk? And most systems, because you’re not optimizing for this case, it’s just, on the order of a terabyte of work because you have to Save all of that and bring it back up. In our system, we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you’re only doing work that is proportional to effectively the diff in the file system. And so this, shaves off a lot of time in the boot-up process of Devin. I think we This is actually now outdated. We have a newer system inside of Devin. But yeah, there’s a lot of tiny details you have to get right here to actually get the day-to-day experience of Devin to be good.
    Swyx [00:49:39]: It’s, not technically agents, but it is agent infra, and when you sell an agent as a company, you sell agent plus agent infra.
    Walden [00:49:46]: At least the way we do it be And the other The nice thing about having the agent infra being done together is, you We get to deploy Devin in whatever environment we want now. We don’t need to wait for some underlying infra provider to also go and support VPC or on-prem or FedGovCloud, for instance. So we can actually go and figure out, okay, since we own the infrastructure, how can we get that set up for you?
    Cloud Providers: Modal, Daytona, and Enterprise Sandboxes
    Swyx [00:50:12]: Whereas you’re Cloudflare dependent.
    Cole [00:50:15]: so Cloudflare runs the control plane. The sandboxes, Modal is supported. A contributor just added Daytona. E2B is on the roadmap, and I think there’s an abstraction in place that if any contributor wants to add a new provider, they can add that in.
    Walden [00:50:32]: Well, what are, How are the customers you work with Do they generally try to then go set up a contract with another one of these third-party providers? Do they try to do the VMs in-house?
    Cole [00:50:44]: most of them I see using Modal. I think Modal has a great
    Walden [00:50:48]: Shout out Modal.
    Swyx [00:50:48]: Shout out Modal.
    Cole [00:50:50]: I think Modal has a great offering. It captures all of the sandbox pieces you need, snapshots being a pretty big piece of that, and given that they also offer GPUs, I think it’s a pretty nice offering as a whole.
    Swyx [00:51:04]: no debate there.
    Walden [00:51:07]: Modal is great, especially, I think their container offering is, the most natural, and so especially if you are willing to, forego, the full VM requirements Modal is, a really vast place you can spin something up on.
    Swyx [00:51:20]: Is there a point So Modal’s very Python, and I feel like most workload, has really shifted to JavaScript. I don’t know if you guys Get the same feeling. So, okay, when I started Landspace and IE and all these things, I was like 50/50 Python and JS, right? That’s roughly. I think that’s wrong now. I think JS has won. I don’t know if you guys Like, I Maybe I’m overstating it, and maybe for cognition, there’s, C# and Java and what have you. But for, new greenfield apps, do you feel that Do you get that sense? Does it matter?
    Cole [00:51:52]: I think that most of the libraries that I see in this space are Python native first, especially in the
    Cole [00:51:58]: Observability space. That said, I think that there is a pretty big appeal of having your entire system in one language. Especially when you have both your frontend and backend communicating, you can have one central type Which is very nice.
    Swyx [00:52:11]: That’s my case against Modal, which is Then you have to run JS. You can run JS inside Modal. It’s just, one extra step That, isn’t native to the runtime. I don’t know if
    Walden [00:52:22]: I don’t know
    Swyx [00:52:23]: Reviews. Do you have numbers? I don’t know.
    Walden [00:52:25]: the one thing I don’t like about Python is whenever AI, whenever it writes Python, it always does, the weirdest patterns, and
    Swyx [00:52:32]: Oh, because it’s, mixing two and three or what?
    Walden [00:52:34]: I think it’s something mixing two and three, yeah. The I don’t know if you see this. It always tries to do, has attribute on objects as like
    Cole [00:52:41]: Oh, my God.
    Walden [00:52:41]: But it’s like But that you shouldn’t be doing that. It should error if there was
    Swyx [00:52:45]: Because it’s training on library code?
    Cole [00:52:47]: I think it’s more of, like
    Cole [00:52:48]: From what I’ve seen, it’s more of, a reward hacking mechanism where it doesn’t want to basically
    Walden [00:52:54]: It’ll never error.
    Cole [00:52:54]: It doesn’t want the code to fail. And so it Even when it knows it has the attribute, it’ll call getattr on a, and for a lot of my clients who have moved towards more autonomous coding, we’ve put that in as a lint rule That if you do getattr, your pull request is going to fail.
    Slop Signatures: Comments, Backwards Compatibility, and Types
    Swyx [00:53:12]: Ooh, this is a fun topic. Can you tell me more about this? What else is a sign of AI coding that you have to put guards in?
    Walden [00:53:21]: So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like, it’ll, comment every line, but it’ll write, paragraph, PRDs, on top of every function. But I will say, to its credit, these aren’t slop, descriptions like they were before. “Oh, here’s what this function does.” It’s like, “Oh, here’s actually the reasoning and why we chose this approach and what the alternatives were and why we shouldn’t do those alternatives.” Still too much information, but I wonder if this actually might be directionally correct if you want systems that can self-maintain themselves in the long run.
    Swyx [00:54:04]: Oh, they write the specs inline.
    Walden [00:54:05]: Have all the context In the code as well. Yeah.
    Swyx [00:54:07]: So you approve?
    Walden [00:54:09]: I But at the same time, it’s this tricky problem. Maybe we’ll just give our users, a setting or something, for, how verbose you want it to be. I haven’t loved it. Honestly, I just I like the comment, but please, get rid of it. But I could, I could see a world where maybe something of the sort becomes reality. I don’t know If you guys know about GitAI. So
    Swyx [00:54:32]: We’ve talked about it, yeah.
    Walden [00:54:33]: GitAI, the idea behind it is
    Swyx [00:54:34]: I’ll bring it up
    Walden [00:54:35]: That if you run an agent, the actual prompts you send to the agent should be stored alongside the code inside the Git metadata so that future agents can reference it, maybe code review bots can reference it. And it’s ideal world where, your context for why decisions were made constantly lives aside, beside your code. And so it’s, maybe a more hidden version of this, write massive PRDs for every comment approach.
    Swyx [00:55:01]: I’m waiting for the real bull case where we just get rid of Git altogether. We’re not I’m not, I’m not there yet, but I’m looking for it because that would be a big shift.
    Cole [00:55:11]: On the topic of, visible slop, a pattern that I see a lot of across GPT models specifically is backwards compatibility, at all costs
    Cole [00:55:21]: Where it’s doing these weird import exports so that it doesn’t have to modify, the names of where the modules were. And I’ve seen Claude 4.6 starting to do this as well.
    Cole [00:55:33]: And again, I think it is this, reward hacking behavior where it doesn’t want failure to occur, and you can address that through, Semgrep or other tools where that behavior is pretty easy to identify. But it’s something that you only learn through the trade of just seeing code patterns. Untyped tuples are a really big problem of just, again, just throw any in there, dict string any. And again, you can address those through linting.
    Local Testing, Mock Servers, and AI-Ready Codebases
    Swyx [00:56:01]: Awesome. Yeah. Any other So, linting, any other tools? Devin Review, of course. Not so, not so free now, but still use it.
    Walden [00:56:10]: Well, the one thing that I think we try to recommend teams as they use more AI agents, it goes back to this, local testing thing. In the end of the day, you want your agent to be able to do the full thing, not just write the code, but actually run it and test it. And a lot of code bases were not necessarily built for this from the start. For example, you probably do want a local DB setup, a local Docker Compose and Postgres in order to have it so that you don’t need to give your agent any crazy product credentials to actually run and test its code. We’ve also internally done a big shift to make a lot of our core, components of code testable as purely local dev without needing to actually, integrate with, any live services for this reason. And honestly, the older the company, the more you have to change to shift in this direction. But you can use AI to help you perform this migration nowadays.
    Swyx [00:57:02]: The older, the older the company, the more you have to change in order to do local dev?
    Walden [00:57:05]: I think so.
    Swyx [00:57:06]: Or am I misunderstanding? So you’re saying
    Walden [00:57:08]: Or often times
    Swyx [00:57:08]: Most people just build with full integration to all their stuff, and there’s no code path to switch it to local.
    Walden [00:57:14]: Especially in, when there’s, lots of different services and you have, microservice architecture, making that shift, the larger the code base, the harder it is. I guess if you did build it correctly from the very start, I think it’d be possible. But also, a lot There are a lot of companies in the world that got started before Docker was a thing, and so You’re forced to make a migration at some point.
    Swyx [00:57:35]: Well, Devin’s good, very good at making mock servers. Right? So, And no, the Well, one of the projects that I really want to It’s like, it’s like Little Snitch. I don’t know if you guys have heard of this.
    Cole [00:57:44]: I run Little Snitch on my computer.
    Swyx [00:57:46]: It’s just like There’s, a man in the middle, but it, shows you all the traffic going back and forth. But then from there you can reconstruct the server, right? And then, and then, create local mocks so you can local mock everything if you just observe traffic for a little bit.
    Cole [00:57:58]: That’s an interesting idea.
    Swyx [00:58:01]: cool. I don’t know if this will get anywhere, but I wanted to maybe talk a little bit about the CloudCode, leak because usually if I have an Anthropic person on, I can’t talk about the CloudCode leak. Did you guys learn anything from CloudCode? I
    Walden [00:58:19]: So if I say
    Cole [00:58:19]: This is the first time I’ve seen it
    Walden [00:58:19]: I was not that, interested in the Leak. We didn’t spend that much time on it
    Walden [00:58:24]: If I was to say, but
    Swyx [00:58:25]: I’m just, I’m just, fishing for
    Cole [00:58:28]: no, I didn’t really,
    Cole [00:58:29]: Research too much into it.
    Windsurf, Local Agents, and Cloud Agents
    Swyx [00:58:30]: Fair enough. Okay, one more last thing before we go. Windsurf 2.0, you guys shipped another thing. So The meta context is you use background agents enough, sometimes you’re going to want to bring them to foreground. And that little, hands-off from local to cloud is hard to work on. And then And Devin has Or Cognition has just done it.
    Walden [00:58:50]: I think for me the biggest, gap this is trying to close is, again, how do you make the testing process as fast as possible? When it can test on its own and send you a video, it’s freaking magical. Sometimes there are just really difficult things you can that you do just need to, pull down locally. And we just want Windsurf to just be your, local command center of all your agents, your background ones, your local ones, and you can imagine, “Oh, okay, this agent needs me to review something. I’ll pull that down, move my other agents to the background, go test it. Okay, boom, done. On to the next one,” right? You have some issue you got to fix in the background, just click, approve. Okay, set up, start a background agent to go fix it. I’d love a world where I don’t have to leave this window. Then maybe the other window I got to figure out how to stop spending so much time into Slack, but maybe, someday We’ll want to get those tools all.
    Swyx [00:59:38]: And does that require the binaries to be exactly the same for local versus cloud?
    Walden [00:59:46]: So the funny thing here is that the behavior between local agents and cloud agents, I think is actually a bit different In their ideal state. I think local agents, you want them to be a bit more fast and let the user make the call on things. Actually don’t try to autonomously go test things. The background agent mode where you go start it off, I think the agent should just assume the next message I send a user should just have everything that the user needs from me and not run and stop Keep running and don’t stop until you have the testing Until you have full report.
    Swyx [01:00:19]: So that’s a, that’s just a slightly different prompt.
    Walden [01:00:20]: But for many reasons, because of all the work we do to make sure that Devin works with different Git providers, that it works with different, OS’s and VM’s, we want as much of that logic to be shared as possible. So for our own practical purposes, we try to share as much of it as possible.
    Swyx [01:00:36]: Yeah. I mean, I can’t imagine how much work it is to, transition back and forth, so congrats on shipping this.
    Swyx [01:00:45]: okay. Anything else that we should cover before we, wrap? Just whatever you guys were talking about in your lunch.
    Walden [01:00:52]: maybe, use cases. What are your, do you find to be, the biggest things that your clients are trying to do with their cloud agents today?
    Cole [01:00:59]: Do you want to just ask it again so we can get, a clean cut?
    Swyx [01:01:02]: Because he was drinking his water. Yeah.
    Walden [01:01:04]: The thing I wanted to talk about was use cases. What do you think are the main things that your clients come to you today about, “Hey, this is why we want to go set up cloud agents”?
    Cole [01:01:15]: I think the easiest and most common use case I see across everyone is SRE use cases. The idea that whether we have our alerts in Slack or Datadog or wherever they’re going, we want the agent to be the first responder on that. And that doesn’t necessarily mean that the agent is actually resolving the issue, but just being able to collect that context ahead of time is huge. Because again, that agent is integrated into the production logs, the database. It has full visibility, and over time, playbooks as well for how to address certain issues. And so that’s a huge win for teams because instantly you can have a full trajectory of what is going on within the system, and oftentimes actually a pull request directly from that, which is a pretty neat flow to actually experience of, error pull request done. OpenInspect does support a trigger for that as well, so that could happen completely autonomously.
    Swyx [01:02:09]: From Datadog specifically, or just
    Use Cases: PMs, Support, Security, and SRE
    Cole [01:02:11]: it supports Sentry, it supports a generic webhook, and if someone wants to add Datadog, they can. The other use cases that I see, are for non-builder use cases, whether that’s the PM or the marketing team. I’m seeing a lot of, teams where the idea of who’s actually contributing code is starting to change. And in a lot of cases, the PM, if there’s just a quick bug fix, the PM is not creating an issue anymore. The PM is just prompting through Slack, and the pull request is then being created. And so I think that’s a huge win. I think that trend will continue, where we’re seeing, code modifications happening outside of engineering. The last common use case that I see is customer support. And so where they’re experiencing an issue with a customer, they’re not entirely sure why this behavior is happening. Previously that world was, “Hey, there’s a bug when they tried to use this feature. We don’t know what’s going on.” Well, they’re now tagging that in Slack. Again, that entire full context is ready. They can then just tag in engineering and have a complete understanding of that issue and completely bypass the previous pain points of like, “Oh, can you get more information from them?”
    Walden [01:03:24]: The only things I’d add on top of that I think I’ve seen is, continual security scanning Continual security review Is a very big one as well. The SRE use case, internally we think about it as auto triage Because we just want every message that comes in, and that’s an alert, that’s a bug report, to have Devin just start triaging it before anything else. And we’ve leaned into this use case so much though that we’ve basically tried to make it so that you don’t ever have to leave Slack to interact with this. So again, making the interactions with Devin super fluid from the moment the report comes in to it responds to a report and be able to ask it questions right there with full code-based context about all the issues. Very related to customer support as well, I think one thing that we found is CLIs can sometimes be, very difficult for people who aren’t technical to go and use. But an online chat interface that anyone can go and ask questions and is super intuitive and doesn’t assume you have any technical knowledge but does have access to all parts of your code base, super useful For support, for salespeople, anyone who might need to have their questions answered about the code base. So yeah, great callout.
    Swyx [01:04:32]: This might potentially be, a very expensive, use case. Is there like a rule, sense, a rule of thumb on, how much people should spend on this? ‘Cause, you have unlimited budget, but not other people don’t,? I don’t know if this is an answerable question because obviously it depends on, a lot of factors. But I guess, like
    Cole [01:04:51]: I think it depends really on, how people are using it. I think If people are using it responsibly and they’re getting value from it, then, you can kinda determine the budget. Common numbers that I hear are anywhere from 1,000 an engineer up to 5,000 an engineer. I have not heard anywhere in the realm of, 50,000 an engineer for a frame of reference.
    Model Costs, Smart Routing, and Frontier Tradeoffs
    Swyx [01:05:12]: We’ll get there.
    Walden [01:05:13]: I’ve seen, I’ve seen numbers go that high for sure. I think that this is also I think going to be a big theme of the coming year, is we’re going to see very expensive, very smart frontier models, And we’re also going to see people who say, “ what? I don’t need the frontier anymore for a lot of the work I do,” because some frontier models actually are good enough For a lot of the work.
    Swyx [01:05:36]: Also shout-out you pioneered Smartfind Which is a mix.
    Walden [01:05:39]: I’m really interested in a world where you basically have hybrid frontier and subfrontier systems Where you use the subfrontier part to be really fast, really efficient, and call out to the frontier part of the system so that you can still get frontier performance for the most part.
    Swyx [01:05:54]: I’m trying to search, but Twitter search is, completely broken. I, it’s, the from field is just completely gone. It’s very sad, Because I really want to
    Walden [01:06:04]: No worries. I might have to make a new post at some point about the return of Smartfind.
    Swyx [01:06:10]: Anthropic has now officially adopted it. Okay, cool. I think that’s it. It’s really great discussion and good, great having you guys on. Background agents are a thing now, and everyone’s building them. We, but we talked a lot about, the production concerns and like, well, why you would want to offer one architecture over the other. Yeah, lots to look forward to.
    Walden [01:06:35]: There’s a real zeitgeist in the space right now I think, for companies to want to turn themselves into these autonomous coding factories. And yeah, we’re doing a lot to try to support that. And so, any listeners are welcome to come chat to us about that, whether using Devin or working with us.
    Wrap-Up: Hiring, Consulting, and Agent Adoption
    Swyx [01:06:51]: Hiring?
    Swyx [01:06:53]: what, specifically, just like give like one profile that’s, very interesting.
    Walden [01:06:58]: I think people underestimate the role of, really high-taste product engineers In this space right now.
    Swyx [01:07:05]: And the test is, what have you shipped end to end that is A tasteful product.
    Walden [01:07:10]: If you’ve shipped stuff that you think is tasteful and you’re, and you’re proud of, you should, you should come talk to us.
    Cole [01:07:15]: For me, any businesses that are looking to further their engineering org, a lot of the consulting I do is around that. Teams who are maybe starting their AI journey, whether that’s with Cursor or Claude Code, but they’re looking for someone to help navigate them through the state-of-the-art and beyond just that initial deployment. As mentioned, there’s a lot of lift from you’ve deployed the background agent to how do we actually get this fully integrated into the company and really realizing the true value of that.
    Swyx [01:07:45]: Okay. Well, thanks you guys for coming on.
    Walden [01:07:47]: Thanks for having us.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    🔬ESM: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

    27/05/2026 | 1h 10 mins.
    Editor’s note: In our first BioHub pod with Priscilla and Mark they discussed their acquisition of EvoScale, led by Alex Rives, who is now Head of Science at BioHub. With ESM-1 they trained language models on millions of protein sequences drawn from across life, with a simple “next token” objective: predict the amino acids that have been randomly masked out, based on the context of the rest of the sequence. But they soon found that these models also learned biological structure and function, including properties the model had never been explicitly shown AND that this ability scales predictably with compute, leading to ESM2 and ESM3.
    Today, Alex announced ESMFold 2, an open scientific engine to power prediction, design, and discovery across protein biology.
    Building on Cryo-EM data (discussed in the CZI pod), ESMFold2 reports state of the art performance on protein interactions, especially antibodies, a critical modality for therapeutics, and evidence that inference time scaling is also working across five targets in cancer and immunology.
    In a nod to that other famous AI x protein folding project, they are also releasing an atlas of 6.8 billion proteins, and 1.1 billion predicted structures, which you can play around with on their website. We are honored to work with them for this huge release!
    One of the refrains we’ve heard on the Science pod has been that protein folding, materials design, cellular biology, etc. are very different problems from Language Modeling. They definitely are. Yet Alex Rives and the ESM team at BioHub just released a preprint and model, demonstrating that vanilla BERT-like transformer models trained on sufficiently large and diverse data sets can beat specialized models like AlphaFold3 on some of the hardest protein-related problems.
    Andrew White had a great segment in our first LS-Science episode that explained how mind blowing AlphaFold2 was when it was released in 2020: it suddenly solved problems on a GPU on your desktop that DESRes had built custom-ASIC supercomputer clusters to solve. John Jumper and Demmis Hassabis received the Nobel Prize in Chemistry for this work.
    AlphaFold2 took advantage of an very clever observation: if multiple species co-evolve pairs of mutations, this implies that the mutations correspond to parts of the protein that are close in 3d space. This is usually shorthanded as MSAs (multi-sequence alignments), and is the key insight which makes AlphaFold2 so effective.
    Like other inductive biases, however, it hurts generalization.
    Scale-pilled before it was cool
    If you take a look at the timeline for scaling laws for LLMs and release of structure prediction models, the ESM team notably doubled down on their MSAs-be-damned approach after AlphaFold2 released. This obviously requires a great deal of belief in the scale hypothesis.
    Why the conviction?
    ESM developed at a time when many of the scaling laws and the “Bitter Lesson” were proving increasingly correct. AlphaFold2’s wild success must have been both exciting and bitterly disappointing. But using MSAs mean that the model is is dependent on training data that contains MSAs in order to be accurate in a given domain. For things like antibodies that don’t have MSAs to train on, AlphaFold tends to do poorly.
    ESM takes a different approach: learn the relationship between different proteins by unsupervised training on as much diversity as you can find (sound familiar?) and then correlate that back to structures know from the Protein Data Bank (PDB) and other sources.
    In other words, a World Model.
    World Model for proteins
    “World Model” is a hype term that I define like this:
    Use unsupervised training to learn abstract patterns from the data:
    * The abstraction should be semantic - novel constructions represent things that obey the rules of the real world
    * The abstraction should be compositional - recombining different patterns leads to novel and often valid constructions
    * The abstraction should support generalization - it predicts things in the real world it wasn’t trained on

    Once you have a world model, you can attach “heads” to it for downstream tasks: predict properties of a protein, decompose its functional features, or search the representation for proteins that meet design criteria. The two big models BioHub just released under MIT license map directly onto this:
    * World model → ESMC (a model trained on 2.8 billion sequences)
    * Structure-prediction head → ESMFold2
    One of the interesting ways the world model can “predict things” is to generate proteins sequences and then measure the predicted properties, such as binding affinity, in the lab. Alex talks in the episode about validating some of the harder molecules they predicted in the wet-lab. Very cool!
    Another way is to use mech-interp techniques such as Sparse Auto Encoders (SAEs) to extract semantic features from your model, and then find novel features that predict unknown biology. I won’t spoil this part for you: it was one of the highlights of the episode for me!
    A cell is a computer
    We have all heard that genes are like computer programs, but usually the analogy fizzles after that. Of course genes are transcribed into RNA and RNA is translated into proteins, so genes are programs for building proteins, but that carries the analogy only to “binary digits are programs.”
    Here’s a better analogy: you can think of the cell nucleus as a storage device / storage controller, the ribosome as a JIT-compiler and runtime, and the semantic features that we learn from our world model via SAEs as functions, proteins as processes that interact together in workflows (signalling pathways) to produce behaviors and outputs (phenotypes).
    Like functions, the SAE features have a hierarchical composition from local, secondary and tertiary structures (mimicing protein structure), but also motifs that are conceptual, such as membrane integrations, disordered regions and disulfide bonds. As we learn to compose these features we into novel protein designs, we move further towards programmable biology.
    Alex goes into much more detail about this in the episode, as well as:
    * Principles for new data collection
    * BioHub’s vision
    * Modeling the cell
    Enjoy!
    Full Video podcast
    please like and subscribe!
    * X: https://x.com/alexrives
    * LinkedIn:


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Giving Agents Computers — Ivan Burazin, Daytona

    21/05/2026 | 1h 10 mins.
    Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!
    On the product side, everyone is getting Computer - Perplexity, Manus, Cursor, and so on. Meanwhile on the research side, agentic evals like TerminalBench and GDPVal are also assuming computer (Harbor). On both ends, the consolidating LLM OS stack has become a standard toolkit, and Daytona is one of a small set of AI Infra companies that are booming because of it.
    “The end of localhost” has been Ivan Burazin’s obsession for more than a decade.
    Something that is all too familiar…
    Long before agents became the default way people talked about software development, Ivan was already chasing the idea that development should not depend on a fragile local machine. CodeAnywhere, one of the first browser-based IDEs, was an early attempt at that future: move the development environment into the cloud, make setup reproducible, and free developers from the endless “works on my machine” tax.
    The thesis was directionally right, but the market wasn’t ready yet.However, agents changed that. They do not care about a laptop, desk setup, or favorite editor. They need a computer they can access through an API: something stateful enough to keep working, fast enough to spin up instantly, flexible enough to resize, isolated enough to be safe, and composable enough to run the messy real-world workflows that real software engineering actually requires.Daytona isn’t just selling “sandboxes” in the narrow code-execution sense. It is the latest version of Ivan’s original localhost thesis.
    In this episode, Daytona’s CEO joins swyx to explain why AI agents need more than code execution boxes: they need composable computers, stateful sandboxes, instant startup, dynamic resources, and infrastructure that can survive workloads going from zero to 100,000 CPUs.
    We go deep on the new agent compute market: Daytona’s hard pivot from human dev environments to AI sandboxes, the New Year’s Eve MVP that customers begged for, why Daytona runs on bare metal with its own scheduler, how one customer runs almost 850,000 sandboxes a day, and why RL/eval workloads went from 0% to roughly 50% of usage in just months. Ivan also explains why agents need Windows and macOS machines, why CLI may matter more than MCP, why Kubernetes is painful for this workload, and why the future AI cloud may look more like Stripe than AWS.
    We discuss:
    * How Daytona grew out of CodeAnywhere, Shift, and the “end of localhost” thesis
    * Why Daytona pivoted from human dev environments to AI sandboxes
    * Why agents need composable computers instead of disposable code execution boxes
    * The New Year’s Eve MVP that customers chased API keys for
    * Why Daytona chose bare metal, stateful snapshots, and its own scheduler
    * How Daytona spins up one sandbox in ~60ms and 50,000 sandboxes in ~75 seconds
    * Why Daytona’s biggest customer runs ~850,000 sandboxes a day
    * How RL/eval workloads create zero-to-100,000 CPU spikes
    * Why RL workloads went from 0% to roughly 50% of Daytona usage
    * Why customers compare Daytona against EKS/GKS and say they’re “never going back”
    * Why every AI agent may need a computer, including Windows and macOS environments
    * The Apple licensing constraints that make macOS sandboxes hard
    * Why CLI gives agents more power than MCP
    * How open source helps agents integrate Daytona
    * Why agent-generated PRs may break today’s CI/CD assumptions
    * Why AI SaaS companies reselling tokens may face a cold shower
    * Why the AI cloud may look more like Stripe than AWS
    Ivan Burazin
    * LinkedIn: https://www.linkedin.com/in/ivanburazin
    * X: https://x.com/ivanburazin
    Daytona
    * Website: https://www.daytona.io
    * X: https://x.com/daytonaio
    Timestamps
    * 00:00:00 Hook
    * 00:01:12 Introduction
    * 00:03:15 CodeAnywhere, Shift, and the end of localhost
    * 00:05:58 What Daytona is: composable computers for AI agents
    * 00:08:07 The pivot from dev environments to AI sandboxes
    * 00:10:17 The New Year’s Eve MVP and customers begging for API keys
    * 00:12:56 Bare metal, stateful sandboxes, and Daytona’s scheduler
    * 00:17:28 60ms startup, 50,000 sandboxes, and 850K daily runs
    * 00:21:53 Spiky RL/eval workloads and the new agent infra problem
    * 00:28:12 RL workloads, Kubernetes pain, and dynamic resizing
    * 00:33:31 Why every AI agent needs a computer
    * 00:38:48 macOS sandboxes and Apple’s licensing problem
    * 00:44:28 Why CLI may matter more than MCP
    * 00:48:11 Open source, GitHub stars, and agent integration
    * 00:53:11 Git, CI/CD, and agent collaboration bottlenecks
    * 00:58:15 Founder life and building a 25-person infra company
    * 01:02:44 AI SaaS, token resale, and API-first business models
    * 01:06:10 GPU sandboxes, data centers, and compute growth
    * 01:09:48 Why the AI cloud may look more like Stripe than AWS
    * 01:11:26 Closing thoughts
    Transcript
    Introduction: Daytona, CodeAnywhere, and the End of Localhost
    Swyx [00:00:02]: Okay, we’re in the studio with Ivan Burazin, CEO of Daytona. Welcome.
    Ivan [00:00:07]: Thanks for having me, man.
    Swyx [00:00:08]: Ivan, you and I go back.
    Ivan [00:00:10]: Way back.
    Swyx [00:00:11]: How I don’t even know how, you found, did you reach out or, for Shift.
    Ivan [00:00:17]: I reached out to you. The reason was you - we were just - we were thinking about I was one of the co-founders of CodeAnywhere, the first browser-based IDE, and so we were thinking a long time of, localhost should die. And you had this article.
    Swyx [00:00:29]: End of localhost.
    Ivan [00:00:30]: Then I reached out to you because of that, and then we talked, and I was actually at a different job and learning about I was the head of, developer experience, and you were quite well-versed in that, and I actually reached out to you, among other people, how do we go about that? What are the key things and whatnot at this point in time? And you were nice enough to take the call, and I remember I was late on your call with you.
    Swyx [00:00:51]: I don’t remember.
    Ivan [00:00:52]: I remember because I was with my then I’m thinking of a girlfriend or wife at that point in time, I’m not sure. It’s the same person, so that’s great, and I was late ‘cause we were, in, Italy on, vacation, and then I was late for something. I felt so bad, and you were so nice to be, good about.
    Swyx [00:01:10]: The reason I’m nice is because I’m also late to other people, so it’s like, who’s, who’s without sin here, yeah, so I have to, for those who don’t know, InfoBip Shift, there’s this whole thing that, you did in the past, and, and that was basically one of the inspirations for me starting AI Engineer, which is like, I have to thank you for giving me that push to be like, “Oh, you can, you can build and sell conferences?”
    Ivan [00:01:34]: I remember you asked you asked me at the beginning to give me advisory shares, and I was so focused on what we were doing, I said no, and I should’ve took the advisory shares. So I’m sorry, dude. But anyway.
    Swyx [00:01:43]: We’re not, we’re not venture backed.
    Ivan [00:01:44]: No, it doesn’t matter.
    Swyx [00:01:45]: It’s Yeah, anyway, so I think what’s impressive about you is that CodeAnywhere is the thing that you’ve been trying to build, and, you kind of put it on hold and then came back after InfoBip. Just give us the story, do you - the story and the origin story, going into Daytona.
    From CodeAnywhere and Shift to Daytona
    Ivan [00:02:05]: Sure. Like, really way back, me and my co-founder have been together. I say this, I’ve said this multiple times, it’s like we were married and divorced and married. Some people actually ask me is my co-founder my partner. they thought it literally. It’s not literally, but we have done multiple companies together, and to your point, we had this shift where we went from the CodeAnywhere to the conference called Shift, and then back to, Daytona. We originally started stacking servers, doing like virtualization in the early 2000s and, routers and doing basically all these things, at a foundational level, and that was a services company which we sold to focus on what my co-founder actually invented, which was the very first browser-based IDE, right, I say the first. Before us was actually Heroku. They did it for a very short time until they became Heroku. But outside of them, we were the only one, and it was called.
    Swyx [00:02:55]: There was Cloud9.
    Ivan [00:02:57]: Cloud9 came out slightly after us. There was Replit, which came out when we stopped doing it, Replit came out, and they have been successful since then, which is great. There was Nitrous.io. There was quite a few that existed at the time, but it was like too early. But the interesting part is that we, at that point in time, because there was no VS Code, there was no Kubernetes, and Docker had just started when we Or I’m not sure if it was even public at that point in time. And so we had to build everything to the whole stack ourselves and that was the key learning that we brought into and that we’ve been using in Daytona today. So it was super early. There’s about 3 million people used CodeAnywhere. It was slightly, it was angel-backed more than venture-backed. We ended up paying everyone back because it didn’t have that sort of scale. But, three years ago, we started something similar with Daytona, which is not what we are today, but it was automating dev environments for human engineers, the basically the underlying stack of CodeAnywhere. And then we did a hard pivot last January to sandboxes. And so here we are.
    Swyx [00:04:01]: Historic pivot, yeah, and, it’s one of those things where, I had independently invested in CodeAnywhere, but also in E2B, and then both of you pivoted into the same thing, and I’m like, “F**k.”
    Ivan [00:04:12]: You invested, you invested in Daytona. You invested in Daytona. But you were the first If we had not got your check, we wouldn’t have done it.
    Swyx [00:04:18]: No way.
    Ivan [00:04:19]: No, it was like, “We have to get him on board first,” and you were that kicker that we, that got us off the ground.
    Swyx [00:04:23]: No, because you were putting me on your pitch deck, man. I was like, “Man, this is like a good trip if I don’t invest.”
    Ivan [00:04:29]: That’s because it was your quote. It’s like we.
    Swyx [00:04:30]: Yeah. It’s the end of localhost.
    Ivan [00:04:31]: Did a bunch of research about end of localhost and who was interested in that,.
    Swyx [00:04:34]: No, that’s like, I put, I wrote that blog post, and every single company in that field reached out to me, and then every VC who was receiving those pitches then also had to call me and, talk it, talk through it with me.
    Ivan [00:04:47]: It’s finally happening though.
    Swyx [00:04:48]: It was really super interesting.
    Ivan [00:04:48]: It’s finally happening.
    Swyx [00:04:49]: It’s finally happening.
    Ivan [00:04:49]: Yeah, it’s finally.
    Swyx [00:04:49]: It’s finally happening, with maybe sort of non-human users. Yeah, so what is Daytona today? Let’s get like a quick description. I’m wearing the shirt.
    What Daytona Is Today: Composable Computers for AI Agents
    Ivan [00:04:58]: You’re wearing the shirt. Yes,.
    Swyx [00:04:59]: It says, I think your branding is very good. Like, it’s very consistent. It runs AI code. Like, it cannot be simpler.
    Ivan [00:05:05]: Exactly, but we’re gonna probably have to change that.
    Swyx [00:05:07]: Oh, s**t.
    Ivan [00:05:07]: It’s also a subset of what we do. Unfortunately, we really love this, Run AI Code is super simple. People interpret it different ways. I think we’ve given out 5,000, 6,000 of these shirts. People wear them with pride because it doesn’t really market about us.
    Swyx [00:05:21]: Yeah, Daytona’s on the back.
    Ivan [00:05:22]: It markets the back. It markets to the person itself, so I think we did a really good job on that one. But it is also a subset of what we do, because people, when they think about Run AI Code, they just think about these small, let’s call it isolates, code execution boxes that, you send some code, you get an output. Whereas what Daytona is today is essentially composable computers for AI agents. It is, the market calls them sandboxes which can be misleading.
    Swyx [00:05:44]: All these things. All these things on.
    Ivan [00:05:45]: Yeah, exactly, ‘cause it can be misleading ‘cause people usually think about sandboxes as a demo or a test environment versus a production-grade environment. But what Daytona does, if you think of the laptop that you have in front of you or the computer that’s over there, or, my wife is an architect, so she has like a Windows with a 3D graphics card inside to do 3D rendering. Like, as humans, we have different computers or different compositions of computers. And our belief is strongly that agents today and going forward will need all these different compositions of computers to do different types of tasks. And so we offer that basically through an API.
    Swyx [00:06:19]: Yeah, to give people - I’m trying to sort of front-load all the aha moments or the wow moments so that people can, stay engaged and click like and subscribe. the market is exploding, right? Like, you have been reporting 74% month-on-month growth, and it also, it’s just been growing for a while. Like, it’s been going like this. And every single - It’s not just you guys. It’s every single.
    Ivan [00:06:41]: Everyone, yeah.
    Swyx [00:06:42]: Sort of, compute provider. I don’t know if you agree with me saying compute provider or not.
    Ivan [00:06:48]: It’s fine.
    Swyx [00:06:48]: Yeah. So like organically PLG-driven growth, but also enterprise is doing super well, I think I wanna rewind to January of last year when you did the pivot. Like, so you obviously called this market early, and you were positioned for it, and you are now one of the market leaders. But what was the insight that made you do the pivot?
    The Pivot: From Human Dev Environments to Agent Sandboxes
    Ivan [00:07:06]: The insight that made us do this pivot is the quarter before that, so end of 2024, when we had - Basically, we did a demo with - I don’t I think we discussed this as well, Devin was not public. You actually gave me access to Devin at that time. So Devin.
    Swyx [00:07:25]: I did?
    Ivan [00:07:26]: Yeah, you gave me access.
    Swyx [00:07:26]: I don’t think I was supposed.
    Ivan [00:07:27]: Yeah, exactly.
    Swyx [00:07:28]: Yeah, I.
    Ivan [00:07:28]: So it doesn’t matter. You.
    Swyx [00:07:29]: Yeah. I gave like three friends access.
    Ivan [00:07:31]: Yeah, or it was a call and you showed it to me. It doesn’t matter. but OpenDevin was available, which is now called OpenHands. And so we’re like, “Oh, this seems to be a thing. This is not public. Let’s take our for human automation of dev environments and take, OpenDevin and launch that as a SaaS.” And we did that. Not very many people signed up and used it, but a lot of people reached out that were building agents, and they were like, “Hey, my agent needs a compute sandbox runtime,” whatever you wanna call it. I forgot what it was called at that point. And then we were like, “Oh, amazing. This is a new market. Here is our infrastructure. Here’s our product, and go.” And what we found really fast, soon, was that people did not like what we had built. It didn’t work. And I remember talking to people at the beginning when we’re doing this, the sandbox we’re building for agents. People were like, “Oh, why is it different? It’s the same thing. We have like EC2, we have VMs, we have all these things.” But we saw that everyone we gave it to, it was like 20, 30 people, they all said, “No.” Like, “This is not what we need. This sort of breaks.” And basically, me and my co-founder not knowing a lot about - ‘cause we’re infra people. We’re not AI people. So I basically took it upon myself to like watch every single podcast that exists, including all of, all of these and all that, and sort of get up to date, read all the blogs, like get, understand what’s going on.
    Swyx [00:08:45]: Do you wanna shout out who else was useful, just in case people are also looking.
    Ivan [00:08:49]: Generally we -, I looked at There’s a few of podcast, different segments and different types. So there’s you guys, No Priors, Bill Gurley’s was great while.
    Swyx [00:09:04]: VG2, yeah.
    Ivan [00:09:05]: Yeah, while it was around. So there’s a few. 20VC is interesting from a different dynamic, and some are different dynamic. But there was, also Red Points.
    Swyx [00:09:14]: We’re not really about the compute market.
    Ivan [00:09:15]: It was also already - Sorry?
    Swyx [00:09:16]: You’re, you want - You’re looking at the agent infra market.
    Ivan [00:09:19]: I was looking at the agent market and the AI market in general and sort of understanding who are the players, what the perception, and how that goes. And like obviously you complement this with like going to conferences, going to events, going to meetups, reading white papers, like doing all the things that you have to do to understand what’s happening. And so when we figured, when we sort of had an idea of what we had to build, literally over the New Year’s Eve, literally on New Year’s Eve, I half vibe coded the first MVP, first minimal viable product of what Daytona is today. And I went to sleep at like 3:00 AM or something like that. I was doing - I just put my like baby daughter and wife to sleep and, Happy New Year’s, and go back to just, doing this. And I sent it to my co-founder, my CTO, and he saw it in the morning. He’s like, “This is absolute garbage.” “Do not show this to anybody at all, but the idea is good.” And so he took two weeks, and he rebuilt it.
    Swyx [00:10:09]: Did it like look like that? Listen, I - It was rough idea.
    Ivan [00:10:12]: Oh, not even, not even close. Like it was it was way worse. But it was like a very - It was a simplistic view of what it should be. Like, it worked, but it was not ideal. And so he went, we went down the whole, which is his job as CTO, to go, and he came back with this version. We then called all the people that had said like, “This is garbage,” a quarter ago. And we set up these calls, and we gave it to - We just demoed it to everyone. And all the calls went long, every single one. They were 15-minute calls, and they all went to like 25, 30 minutes or whatnot. And everyone said, “We need, we want access.” There was no login, just an API key, ‘cause it was just a beta or an alpha. And they said, “Oh, we want access.” And we’re like, “Sure, yeah. Okay, thank you very much.” But after like the next day, if we’d not send it, every single one, like every call that we did, everyone came back, “Where is my API key?” Like everyone wanted it. We’re like, “S**t.” Like this is it. Like I’ve never felt So one, the understanding to your point was like most people thought it was the same infrastructure for humans and agents. We understood a quarter ago it’s not. We just didn’t know what was the right primitive. And then when we came, and we can talk about what that is, and we gave it to these people, I’ve never seen, I’ve never experienced - I’ve done multiple companies in my life. I’ve never experienced this, that people literally call you if you do not give them access. Like they want access right now. And so it’s like, okay, they don’t want this. the thing that they want doesn’t seem to exist, or they have not found it, and they really want what we want. And then when we understood that we’re onto something, and then when you think about the size of the market, like the market for human engineers and enterprise is a very large market, so think GitLab or whatnot. But the market for every single agent that will exist ever in the future is just like, what is that market? How big is that? And we’re like, “We are all in on this.” And so that is where we made sort of the cut between the old product and the new one.
    Bare Metal, Stateful Sandboxes, and the Lambda + EC2 Model
    Swyx [00:12:02]: Yeah. But it wasn’t composable at the time?
    Ivan [00:12:05]: It was very - It was basically just a Linux box that you could change, that you could define number of CPUs, disk, and RAM. Like that is what you could do, but you couldn’t have multiple operating systems, you couldn’t resize it on the fly, you couldn’t add a GPU, you couldn’t do like all the things. It was just the, just the first sort of variation of that, yeah.
    Swyx [00:12:22]: Was it bare metal from the start?
    Ivan [00:12:24]: It was bare metal from the start. And so the interesting thing that we thought about right away, so our.
    Swyx [00:12:29]: Which, give people the background, what is the normal path?
    Ivan [00:12:32]: Yeah, so, basically most providers run this on top of VMs. And also.
    Swyx [00:12:37]: Firecracker.
    Ivan [00:12:38]: Yeah, they run on Firecracker and VM. And so we also fire - We can get - We have multiple isolation layers and we can do that. But the common way to do it is that they, one, that the state of the machine, or the hard disk is not part of the sandbox itself. And the other thing is they’re not meant to last forever. So most of them are preemptible, like they can There’s a time that they can live. And so our thought was when we were going into this is, agents will be like humans in the sense of you don’t want your laptop to be shut down until you’re done with work. Like, and you want to close the lid and open the lid, it’s the same state. So you - Agents would want that, like the pause and come back. They want those two things. But also agents really want speed, right? Can they get it? So when we thought about it’s like we need something insanely fast, how to make it fast, how to make it long-running, and stateful. And so those two things, it’s like combining a Lambda and an EC2, right? Those two things together. And so we didn’t have an idea how others did it, ‘cause we didn’t know too that there was a market around this. It was more like, okay, this is what we need, what they need. And we looked at Kubernetes, it wasn’t wasn’t good enough for that. We looked at Nomad, it didn’t enable that. And so our history in rewriting our own scheduler at CodeAnywhere is basically what my CTO came up with. Like, he’s like, “Oh, the learnings from there,” and he brought it. And the funny thing is, our third co-founder, when he saw it, he’s like, “Dude, what is this? This is like 2008.” Like, we went back in time, and he’s like, “Exactly.” And so the reason why Daytona is like super fast, and you see this on benchmarks, is we essentially, we run on bare metal. We have our own scheduler, we use the underlying, disk, CPU, and RAM of the underlying machine, which means your IOPS are insanely fast because there’s no, there’s no network between an EBS or something like that. But also the snapshot, the point in time, the templates, are also preloaded on the bare metal machines. So when you fire off a sandbox from a template or a snapshot, you’re essentially directed to the bare metal machine where that snapshot is based on that NVMe drive, and then it literally just turns on that machine, and it’s local. There’s no network latency, anything on there. And so that is sort of the specificities that we, when we’re thinking from first principles, what a computer would look like for an agent, that is what we came up with, and that’s what we created.
    Benchmarks, 60ms Startup, and 50,000 Sandboxes
    Swyx [00:15:02]: Yeah. I should maybe, I don’t know if you endorse this, but there’s someone that does compute SDK, you guys do very well on there, with like the TTI, right? I. is this a, is this a is this a relevant benchmark for you guys? I don’t know.
    Ivan [00:15:16]: I don’t know, and it changes every day. So today RKL is.
    Swyx [00:15:18]: I don’t know what RKL is. Never heard of it.
    Ivan [00:15:20]: Yeah. RK, yeah, so it is there.
    Swyx [00:15:22]: You are, at least a third of the next tier of performance, and then, there’s a lot of other better-known names that are very slow to start.
    Ivan [00:15:31]: Yeah. We’ve been the number one by far for a long time, and now there’s different, there’s different definitions also of sandboxes, different isolation patterns, different other things. So RKL runs it literally on the S3, the data, so it’s very different, and they spin up a sandbox, spin up a container for that, so it’s a different type of thing. So the definition of a sandbox is something that we can all, we all need to get along with. But yeah, we’re insanely fast on getting these things, up and running. And so you can see even there that it’s a zero point 0.10 to 0.11, so.
    Swyx [00:16:03]: Close enough. Yeah. what else do you need, right?
    Ivan [00:16:05]: Yeah. So the benchmarks itself, so, in this, in I don’t think the benchmarks equate to market ownership or revenue or anything like that. and I’ve seen this with multiple benchmarks, not just in sandboxes, but in general benchmarks around.
    Swyx [00:16:20]: It’s table stakes. It’s just like.
    Ivan [00:16:21]: Exactly. But it doesn’t hurt.
    Swyx [00:16:22]: Just roughly check.
    Ivan [00:16:22]: Like you definitely have to be up there and you have to be competing so that people know that, oh, this is definitely one of the top. Because this is only one dimension of what customers look for. There’s other things like how many can you spin up consecutively? There’s a feature set, there’s support, there’s like all different things that people look at, but you definitely have to be there, on the benchmarks.
    Swyx [00:16:40]: How many people do people spin up consecutively?
    Ivan [00:16:43]: So we have.
    Swyx [00:16:43]: Or concurrently, is the Concurrency, right?
    Ivan [00:16:45]: There’s three metrics that we look at. And so one is like time to spin up one, and so our time to spin up one is 60 milliseconds with network latency. So request, spin up, reply, 60, the whole thing, 60 milliseconds. That is one. But if you wanna spin up 50,000 at once, we are now at about 75 seconds. So it takes about 75 seconds to spin up concurrently 50,000. Some others, there’s public data around this, like take 2,000 seconds, which is 30 minutes. Like there’s different variations of that. And then there is the so it is speed of one, speed of like multiple, and then how many can you consistently have up and running. And so we basically have right now no limit to how much we can add because we basically own our own metal. But the biggest customer of ours does like about 850,000 every single day is sort of where they’re, where they’re just shy of a million every single day that they’re running, we do have a request for half a million concurrent, which is literally half a million CPUs somewhere running. So that’s an interesting.
    Swyx [00:17:44]: They pay by like vCPU seconds.
    Ivan [00:17:47]: By seconds, yeah.
    Swyx [00:17:47]: Or whatever. Yeah. Okay, and so and then, and the other thing is, the sleeping and the resuming, ‘cause it’s all the stateful resumption of all these things, how, what kind of workload are people putting through this, right? Like how is it Do we measure by gigabytes in memory, gigabytes in storage? I don’t In like network attached storage. I, what are the costly ones of, out of all these features?
    Workload Economics: CPU, RAM, Network, and Storage
    Ivan [00:18:15]: The most expensive thing are CPU.
    Swyx [00:18:18]: Okay. Yeah, of course.
    Ivan [00:18:18]: The second one, yeah Then it’s RAM, then it’s disk. We actually don’t charge.
    Swyx [00:18:22]: Which is snapshotting, right?
    Ivan [00:18:23]: No, it’s actually the, snapshotting’s part of it, but basically the size of your hard disk, of your machine. So do you have 10 gigabytes, do you have 20, do you have 50, do you have whatever? And then the transference of that. Right now, currently we don’t charge for, network at all at Polychron.
    Swyx [00:18:37]: Oh, you gotta, yeah, you gotta fix.
    Ivan [00:18:38]: Yeah. It is very much a it’s a larger and larger part of our bill, so we’re working around, that part there. Obviously, that is the least, expensive, so the hard disk is the least expensive, so it’s basically CPU, RAM, for us network, ‘cause we don’t charge the customer, and then hard disk, is how it’s split up. But there’s also different types of workloads, so we basically split it up into two types of workloads in Daytona. One is what we call background agents or long-running agents. and the other is, basically RLs and evals, which I put sort of together. And so they have very different patterns of usage, and if you look at the usage of a background And I’ll just name names of companies, not specifically.
    Background Agents vs. RL/Evals: Two Usage Shapes
    Swyx [00:19:21]: Yeah, open, all hands.
    Ivan [00:19:23]: Yeah. So like a background agent’s a Cognition, a Lovable, a like all these things are Harvey. These are all long-running, background agents. And so if you look at their usage patterns, their usage patterns are similar to human, which is like follow the sun. Basically, the usage patterns of that is like noon is probably the highest, and the midnight is the lowest, and then weekends are lower. weekday is higher.
    Swyx [00:19:42]: Yeah, that’s a fun question. How global is it? Is it very US-centric or?
    Ivan [00:19:46]: The US is a large part, but we have currently, we have Asia, Europe, and the US regions.
    Swyx [00:19:52]: So it’s quite global.
    Ivan [00:19:53]: Yeah, it’s quite global. We have it all over. It’s interesting that our I talked to you a bit about this. Our number one city by user.
    Swyx [00:20:01]: Hmm.
    Ivan [00:20:02]: Is Singapore.
    Swyx [00:20:04]: Oh, wow. Amazing.
    Ivan [00:20:05]: Which is an interesting one, right? Not by revenue, just by just like by individual head count.
    Swyx [00:20:09]: Really?
    Ivan [00:20:09]: Just like an interesting thing.
    Swyx [00:20:10]: Singapore is, Singapore is weirdly high in the adoption charts of AI for the population. It’s like an, seven, eight million population. And it’s like keeps showing up.
    Ivan [00:20:20]: No, it’s quite interesting. We were quite shocked, and I was like, “Oh, this is interesting.” And also one that’s up there.
    Swyx [00:20:24]: There’s a reason I’m doing AI using Singapore. it’s because I’m from there.
    Ivan [00:20:27]: We’re there. We’re gonna, we’re gonna be there as well. and it’s interesting that Japan is in the top or like Tokyo’s in the top, which is in all the tech cycles it has never been. It has never been, so it’s quite interesting that they’re.
    Swyx [00:20:39]: I think the Japanese just love AI. Yeah. It’s that, and then it’s Brazil. That’s it.
    Ivan [00:20:44]: Brazil has always been in.
    Swyx [00:20:45]: I think.
    Ivan [00:20:46]: Even when I look, if you look at like GitHub’s data and ask historically with CodeAnywhere, it was always like US, Western Europe, and then you’d have like India, Brazil, China, like that would be there. But like Singapore was not in, specifically Japan was never in sort of that top, that top.
    Swyx [00:21:01]: Yeah. Weird pockets.
    Ivan [00:21:01]: Weird. Yeah, so it’s very global.
    Swyx [00:21:02]: Okay, so actually that, but that’s helps you to distribute your load through, all time?
    Ivan [00:21:08]: The interesting thing is like we have those kind of loads, but if you look at the researcher loads, they’re quite different. So what they are is like if you give them concurrency of 10,000 or 50,000 or 100,000 CPUs at ARMb, when they fire off a run, it’s just 100%. And then it just runs, and then it stops. So it’s very, the usage pattern is squares basically, right? And it’s also not follow the sun, because people will fire it off at midnight before they go to sleep but then wake up and so it’s very unpredictable, so you don’t know where that is. So the shapes of the usage are quite different than we have had before. And also what’s interesting is when it’s sort of a follow the sun, even if you have a high growth company, you can sort of predict your usage patterns and have enough capacity for that, because it’s sort of, it grows in a, in a way you can project. When you have companies doing sort of like evals and RL, they’re super spiky. So they’re gonna come in, it’s like, “We’re gonna use nothing, then can we have 100,000?” Right? And then go back down. And then 100,000, go back down. So it’s very different, right? And.
    Swyx [00:22:09]: Do you want to lock them into commits so.
    Ivan [00:22:11]: Yeah, we do.
    Swyx [00:22:12]: Yeah, okay.
    Ivan [00:22:12]: We so we have to lock them into some sort of commits to have that capacity, because we have to have, basically we have to have the capacity for peak. Right? And so right now, Daytona’s mean utilization is 15%, 1-5.
    Swyx [00:22:25]: Oh my God.
    Ivan [00:22:26]: So it’s very low.
    Swyx [00:22:27]: Because it’s very spiky.
    Ivan [00:22:27]: It’s very spiky, but we get up to 90%. so we have these things. And so what we’re, what we’re looking at right now as a company is similar to Cloudflare where you can like geo move things around, but that works really well for basically the background agent where it’s follow the sun. But this, it’s not. Like it’s a very different shape. Obviously with scale you figure these things out, but that’s an interesting new problem that we have, as a compute provider in the agent space. And when we were doing the conference recently, and so we talked to like Nikita from Neon and.
    Swyx [00:22:57]: I should bring it up.
    Ivan [00:22:58]: Parag from Parallel and whatnot, everyone has the same problem. Whereas the usage is super spiky, and this is something that has not happened before, that you have these types of like it was always, it the amplitudes were not this high, right? So it’s quite interesting use case and problem solve.
    Compute Conference and Spiky Agent Infrastructure
    Swyx [00:23:12]: Yeah, I don’t know if we’re gonna bring this up again, but let’s just talk about the conference, you had like 1,000 something people at the Warriors game, at the Sorry, where is it? What’s.
    Ivan [00:23:22]: Chase Center.
    Swyx [00:23:23]: Chase Center.
    Ivan [00:23:23]: Chase Center.
    Swyx [00:23:24]: I went. It was, it was very impressive. Obviously, you can, how to throw a conference, what did you learn? you put, you pulled together all these impressive names.
    Ivan [00:23:33]: What I.
    Swyx [00:23:34]: What were you looking for?
    Ivan [00:23:35]: My thesis behind the Compute Conference was let’s bring together people that are building infrastructure for AI agents. Because when I think of what we’re building, it is the agent is the primary user, what are the ergonomics and usage patterns of agents, and so we can do that. And what I found, this was a theory, it wasn’t proven, is that we all have these problems, as I touched onto. And I was, as I was talking on stage, it was like we all have the same underlying infra problems, which is this spiky workloads, unpredictable workloads that we’ve never had before, in human, compute or human infrastructure. And it’s, again, it’s the same when I was talking to Parag or when I was talking.
    Swyx [00:24:20]: Lynn. Nikita.
    Ivan [00:24:21]: Lynn, Nikita. Lynn especially, I was talking to her the other day as well. Like the It is a very interesting type of problem to solve because I can touch on Cloudflare because there’s a lot of like talk about that recently as to how they solve that, which is they have a bunch of geos, and basically, as users work in different places, and depending on your tier, they can move you around the geos. And so that how, that’s how they get the higher utilization. But you can sort of predict these, and it’s If it’s something in You’ll rarely get a spike that is 10 orders of magnitude. Like you’ll get a like let’s say one of your customers has some like an exponential curve. What is that to I’m using Cloudflare as an example. 10%, 20%, whatever it is. I don’t, I don’t have this data, I’m just assessing. It’s surely not 10x, right? It’s surely not something there. And so how do you go out and solve this problem? And we’re all solving this in different ways. So we have.
    Swyx [00:25:11]: She also has the same thing.
    Ivan [00:25:12]: Yeah, I know specifically that like Neon had that issue as well. Like how are we solving these spiky loads and things like that ‘cause we talked about it. And so the interesting thing for me to actually internalize was, yes, everyone that’s building for agents first is going through this, and we’re all solving similar problems, which is quite.
    Swyx [00:25:28]: Let me let me double-click on this. Okay. So for example, Neon, I happen to know that they’re very sort of S3 oriented, right? so they’re just like fully bet on S3. And you get to benefit from S3’s distribution and infrastructure. So I would imagine that Neon doesn’t have to care, whereas Lynn maybe has to care a bit more because obviously she’s doing GPU inference. And, for listeners, we did an episode with her, one and a half years ago. And you have to care. But like, right?
    Ivan [00:25:54]: Parag cares for sure, and Nikita.
    Swyx [00:25:58]: And Parag is C of, Parallel.
    Ivan [00:25:59]: Parallel, yeah.
    Swyx [00:26:00]: Former CTO of Twitter.
    Ivan [00:26:01]: Twitter, yeah.
    Swyx [00:26:02]: They are the search.
    Ivan [00:26:03]: Yeah, they’re search, yeah.
    Swyx [00:26:03]: I You and I know but the listeners don’t know.
    Ivan [00:26:08]: Yeah, we can put it down in the screen, and so ‘cause we, when we were talking.
    Swyx [00:26:11]: I’ll put it up on the, on the screen.
    Ivan [00:26:12]: Yeah, right.
    Swyx [00:26:12]: People can look it up if they need.
    Ivan [00:26:14]: Look it up. And, yes, but they still have CPU and RAM, allocation that you have to have up and running. And so CPU and RAM, you have to allocate that and have that ready. And so there’s basically two ways to do it. One is you either over-provision and you can handle the bursts, or two, you basically have, I don’t know if this is a term, just-in-time compute, which is like as your load becomes, as your usage comes in, you can fire off requests for VMs or bare metals at other cloud providers and then get them up and running.
    Swyx [00:26:43]: This is if you go above 100%, right?
    Ivan [00:26:45]: Yeah, this is.
    Swyx [00:26:46]: Like your overflow.
    Ivan [00:26:46]: If your overflow, like spillage or whatever you do.
    Swyx [00:26:48]: You probably lose money on it, but it doesn’t matter, right?
    Ivan [00:26:50]: It, not Well, you might, you might not That is a more cost-effective way to do it but it’s a slower way to do it. Because basically what you have to do is you have to like queue your requests, spin up these just-in-time compute, get it all ready, provision it, and then get your workload there. And so if the time isn’t important that much, that’s fine, and you can do that. But if your customer, and especially for, let’s say, the RL training runs, the reason why a lot of people come to us is because GPUs are more expensive than CPUs, right? So you want your GPU running at, what, 100% the entire time. And so when you’re running runs on CPUs, when the when the CPU cycle is like down and spinning up the next one, you want that to be instantaneous so that your GPU doesn’t go down, right? And if you then have to like go out and provision machines, you’re essentially telling the GPU that it has to wait, and that’s incurring our cost. So there’s things that you have to try to solve for there.
    RL Workloads, Declarative Images, and Kubernetes Replacement
    Swyx [00:27:43]: Yeah, let’s talk about the different workload, right? You said that, what was it? A few months ago, you had zero RL workload and now it’s 50%.
    Ivan [00:27:52]: It will be this one, 50%, yeah.
    Swyx [00:27:54]: Let’s talk about how different it is, right? Like I imagine, for example, a lot less dynamic code generation of like arbitrary code. Like here, it’s probably all the same code. You’re just doing parallel runs or something, I don’t know.
    Ivan [00:28:05]: Yeah. So you’ll have multiple Depends on the like for each run, you’ll have a snapshot. And they, for the most part, they actually do use our declarative image builder, which is like, “Oh, we, the agent wants these dependencies, these env vars.”
    Swyx [00:28:17]: These ones, yeah.
    Ivan [00:28:18]: Yeah, the declarative image builder, it.
    Swyx [00:28:20]: Which is a very modal like thing that they.
    Ivan [00:28:22]: Yeah. And so we build it on the fly and then we propagate that snapshot, and you can spin up as many sandboxes as you want against that snapshot. And then if you have to do changes, the model can, or like it could be also be automated. It’s like, “Oh, now for the next run, we need to install these things or remove these things or whatever to get, a task done,” and then it goes off and runs that. So yes, that is something that it seems that they prefer. The number one reason I found, or should I say, let’s take a step back. What we are competing against in that environment is essentially managed Kubernetes. So EKS, GKE, whatever. That is what the vast majority run on. And anyone that has tried Daytona versus GKE, EKS is like, “I’m never going back.” That has always been. There’s a few reasons. One is the ergonomics. So if you have, if you’re using Kubernetes to spin that up, you have to essentially manage the interface interactions with that. Daytona, although as a compute provider, it’s more akin to a Twilio and Stripe from a consumption perspective than it is an AWS. Like you have an API, an SDK, it’s quite like easy and seamless to get these things up and running, that’s one. The other is the speed to which we spin up, which we mentioned earlier, which is much faster, and the scale to which we can go to. We haven’t got into features, but an interesting feature is that it’s very hard to OOM, or out of memory, our sandboxes, because we can dynamically on the fly.
    Swyx [00:29:48]: Resize.
    Ivan [00:29:49]: Resize, which is like impossible on almost any other thing. There are some technologies that enable you to do that, but it’s like a very hard thing. And so we actually saw this when, the Terminal Revenge team is, brought us actually. So thank you, Alex and the team, that brought us into this whole space.
    Swyx [00:30:05]: It’s just very rare that, a framework would just say, “Guys, just use Daytona.”
    Ivan [00:30:11]: Yeah, I think it says it somewhere. Yeah.
    Swyx [00:30:13]: Yeah. I was like, “What is this?”
    Ivan [00:30:15]: There’s all, there’s multiple there, but they also mention a few other places. and so Daytona specifically-We have, the, just jumping on themes here We, I don’t know where it says Data Center.
    Swyx [00:30:27]: I, there.
    Ivan [00:30:27]: Doesn’t matter.
    Swyx [00:30:28]: There’s a very strong recommendation, which is, very unusual. Which is, it’s.
    Ivan [00:30:33]: We do not pay them for this, just.
    Swyx [00:30:34]: I know, yeah. They just like you.
    Ivan [00:30:35]: Yeah, they like us. yeah, and also a thing, so, Data Center has multiple isolation sets underneath. The customer doesn’t have to know what they are. But basically we have Docker, which is a container, that’s hardened with Sysbox. So it’s Docker’s, isolation that is a security equivalent to a VM, but it’s still a container. And that is the default, and they, especially in these training workloads, really like that as an interface to be able to use just a basic Docker container, and we enable Docker and Docker. Which for these RL runs, if you need to do a Docker compose or Kubernetes, you can spin up a K3S inside of these things, which unlocks a huge amount of workloads that you can do that you cannot do on other providers. So just on that part is much more interesting. And so we went that, through that. We showed them that we could do that, and they enjoyed that quite a bit. They being the general venture people.
    Swyx [00:31:28]: Those people, yeah.
    Ivan [00:31:29]: And Harbor people.
    Swyx [00:31:29]: Harbor people, do are they, are they a company yet?
    Ivan [00:31:33]: As far, I do not know.
    Customer Pull, Slack Connect, and the Computer Use Bet
    Swyx [00:31:35]: Okay. All right. Yeah. It’s like super obvious that like, there’s a lot of excitement and success around these things, okay, so yeah, tell us more, right? Like, this is an exploding workload, Harbor adopted you, which helped speed things along. But what are you learning as this new workload comes online?
    Ivan [00:31:53]: There’s a couple things that we learned, which we chat about in the beginning. We, and this has led our story, as we mentioned, we like talked to a lot of customers along the way, and we add more features and more tool sets as we talk to customers. And it’s interesting that And I think it’s that the ecosystem is so small and/or the models get smarter, where when we see one user come with a request, we know it goes on a roadmap if like three to five customers come with the same request in that week. It’s like very bizarre. It happens so many times, which is.
    Swyx [00:32:27]: Because they’re all friends.
    Ivan [00:32:28]: Sorry?
    Swyx [00:32:28]: They all, they’re all friends. They’re all in the same group chat.
    Ivan [00:32:30]: Yeah, probably, yeah. ‘Cause and they’re like, “Oh, can you do this?” And I’m like, “Okay, this is interesting. We’ll put it on a feature request.” And then the next one’s like, “Oh, can you do this?” “Okay.” It’s all the same, right? It’s always the same. And so what we try to do, and I personally try to do, I try to be on as many call, quote-unquote “sales calls” I can. I’m in every Slack channel. We literally have about 1,000 Slack Connect channels, something like that. It’s an interesting, there’s so many interesting things you find out when you have all the Slack channels. You can also see where people, transfer between companies. You see leave Slack channel, enter Slack channel. It’s an interesting thing. Also, just I digress, I feel that Slack Connect is literally LinkedIn what it should be. You have a list.
    Swyx [00:33:08]: LinkedIn charges you to, use your own connections, but Slack doesn’t, right? Slack is like, do it for free. It’s more lock-in. It’s great.
    Ivan [00:33:15]: Yeah. It’s amazing. Yeah. It’s one of the reasons.
    Swyx [00:33:17]: You’re gonna pay Slack for life.
    Ivan [00:33:18]: Exactly. You’re there for life. So that’s interesting. And so one of the things, the newer things we were talking about earlier is we made a big bet and put a lot of investment on computer use. that is not seen publicly the light of day. We haven’t GA’d that yet, but we have.
    Swyx [00:33:32]: Is there a thing I can pull up?
    Ivan [00:33:33]: There is computer use there. It’s right up a bit.
    Swyx [00:33:36]: Oh, yeah. Okay.
    Ivan [00:33:38]: What we have, what we talked about and what we’ve seen publicly is there’s this theme now about, the human emulator where And Elon from XAI has talked about this publicly, and if you think about the models today, they’re actually quite sophisticated and they can do a lot of work, but they still don’t have access to all the tools. Like, I’m a strong believer that the most efficient way for an agent to work is essentially headless or through, terminal or whatnot. But if we, if we look at knowledge work in general, there’s about 100 million knowledge workers in the US, about a billion in the world, and knowledge workers, and the salaries of them aggregate to 10 trillion in the US 50 trillion worldwide.
    Swyx [00:34:24]: Wow.
    Ivan [00:34:25]: Something like that. And if we look at, the five most important sectors of that, so like healthcare and government and financial services and whatnot, that’s about 56% of that. So let’s say it’s about half of that. So in the US it’s about 25 trillion, and most of them, most of that work is actually still locked into legacy apps inside of Windows, which is not going anywhere for a very long time. Like, people just won’t invest in that. How much of it? our assumption is the following: if, in the RPA market, which is similar market, well, not the same 25% of, these white collar, workers’, work is automated. If an agent is more sophisticated, can go through more runs, figure stuff out, let’s say it’s, 40%, right? And so if you take 40% of that, you get to essentially, $10 trillion a year.
    Swyx [00:35:17]: That’s a TAM.
    Ivan [00:35:18]: That is a that is a TAM. So that’s the TAM of the models, right? That’s not our, essentially ours. But you get to that size, and to be able to do that, you essentially have to give agents these computers with the legacy. So computer use, either Mac or Windows or Linux. Linux we also obviously have and others have. But Windows specifically is something very new, and the only option right now is an EC2 with, Windows or on Azure. Both of them take anywhere from three to five minutes to spin up. We’ve created an actual sandbox, so it’s a second instead of milliseconds, but you have, point in time snapshots, you have, forking, you have all the things that you have from a sandbox, but essentially enables you to hopefully unlock all this value. And so that’s been our big push and bet, but we’ve sort of, kept our ear to the ground. What is sort of the next things in the market?
    RPA Returns: Why Agents Still Need Computers
    Swyx [00:36:06]: Yeah, knowledge work, and building, and sort of RPA, the next wave of RPA. I got very excited about RPA kind of during COVID times. The UI path was IPO-ing. And it was, a very hot Isn’t it, Eastern European?
    Ivan [00:36:20]: It is, Romanian.
    Swyx [00:36:21]: Romanian?Yeah, it might be the only Romanian, big unicorn okay, yeah. This I don’t I don’t, I don’t have like a I think there’s, I think there’s a stage being set for the resurgence of RPA, ‘cause everyone understands that, yeah, no one wants to deal with these shitty apps and no one’s gonna rewrite them. Like, you just have to do, a remote operation and programmatic operation of them.
    Ivan [00:36:45]: If you wanna unlock it, my own setup was basically the following. So I was doing a board deck recently, last month, whatever, and I’m like, “Okay, let’s just, let’s just do automated.” So, all our data’s in, ClickHouse and PostHog and QuickBooks, where everyone else’s is, and I’m basically, connected that all to, my Cloud code, like go off and go Cloud code whatever. Go off and, here’s the integrations, go do that. It pulled out the first report, which was great. It connected to Brex and all these things, pulled it, which was great, and then I say, “Okay, now pull out this, and this,” and I kept getting, really well McKinsey-style design reports, but the data said partial data. all the missing data, partial data. Like, it can’t access all the things, and I got so frustrated, and so I got, I got, my Mac Mini virtual sandbox with OpenClaw. I gave it its own account in our company, and then I went to all these services and created a read-only account, so literally like an intern in your company. And so I would say, “Now go and do this report,” and it would get the same, or like, “I can’t via the MCP or the API or whatever. I can’t get all the information.” I’m like, “Go log in.” And it will log into the website, then go in, export the data. It’ll export the data and do the thing end to end. So even for things that have today APIs, not all of it is exposed, and I to get value, I get immense value right now, but it has to be a computer usage, unfortunately, and so I spend a bunch of tokens just on that, but I get the job done. And so if even a startup like ours, and using all the hottest tools, still needs a computer agent what hope does, Goldman have to have a headless, right?
    Swyx [00:38:22]: Yeah, what a - Why isn’t Microsoft doing this?
    Ivan [00:38:27]: I’m pretty sure, Satya had a post yesterday.
    Swyx [00:38:29]: Oh, okay. I see.
    Ivan [00:38:29]: Which was like, “Every agent needs a computer.”
    Swyx [00:38:31]: I see, I see.
    Ivan [00:38:32]: So they have launched something recently.
    Swyx [00:38:34]: Yeah, they have Microsoft Power Automate, I’m sure, I’m sure, they’re gonna have their version.
    macOS Sandboxes, Apple Constraints, and the Windows Opportunity
    Ivan [00:38:39]: Version of that, yeah.
    Swyx [00:38:39]: You’re gonna try to do yours, and it - I always know there’s always demand for Mac, but I know it’s, tricky to host, macOS sandboxes.
    Ivan [00:38:49]: We will have macOS sandboxes fairly soon. The problem with macOS, OS sandboxes is, I’m deep in this, I don’t know how much interesting is.
    Swyx [00:38:55]: No, it’s.
    Ivan [00:38:56]: MacOS has this problem.
    Swyx [00:38:57]: It’s a licensing thing, right?
    Ivan [00:38:58]: Licensing thing. So one, you’re allowed to run only two parallel VMs per machine, so that’s one. Two, you can only license to a different user every 24 hours. So if you come in and theoretically, if I wanna charge you per second and I charge you one second, I have to have it idle for the rest of the day. I can’t have anyone else doing that. So the pricing will be different in the sense that I will have to - we would have to charge for 24 hours, and that’s not even, that’s not even the most difficult thing. But the, thing above that is, from a security perspective, they enable you to do memory snapshot, pause, resume, but only on the same physical drive, physical machine. And so what you can do in, Windows world or Linux world is that I can move in the background, your snapshot from one to the other and manage load, right? Here, if you wanna do that, you essentially have to have your.
    Swyx [00:39:49]: Yeah, snapshots. Yeah.
    Ivan [00:39:50]: Your.
    Swyx [00:39:51]: It’s like.
    Ivan [00:39:51]: Physical machine.
    Swyx [00:39:52]: You can’t break it up.
    Ivan [00:39:53]: You can’t, you can’t move things around that, and all of that is, that part is, from a security standpoint, if it is written. Like, I understand the security aspect of that, but it disables you from doing these agentic, like really scalable agentic workloads.
    Swyx [00:40:08]: You need to do a vibe-coded, clean room implementation on macOS that you can then - That’s like Clean OS or something. I don’t know.
    Ivan [00:40:17]: So. We have.
    Swyx [00:40:18]: ‘cause like Linux was originally like a clean room rewrite of Unix.
    Ivan [00:40:21]: Okay. Yeah.
    Swyx [00:40:21]: Or something like that, right? Like same thing to macOS. Someone needs to do it.
    Ivan [00:40:25]: Someone will do that, and someone will have some long-running agents for a few days to figure this stuff out. But yeah. So definitely we - we’re really close to offering something ‘cause people do want it, but the pricing will be different, and the feature set will be sort of stringent.
    Swyx [00:40:38]: Yeah, nobody’s gonna use this. like, the labs, the labs will because they want to automate macOS.
    Ivan [00:40:42]: They have to do RL. They have to do RL again. But even if you The - So the point is with the RL part, if you, if you do RL on macOS, then the next iteration of the model comes out, it will be able to use these tools significantly. Then you actually need to run those, that somewhere. So you’re gonna have to have that, later on. And from, if anyone at Apple is listening, I very much feel that they are shooting themselves in the foot of the scale of the revenue of compute or licensing they could get if they would just enable a concurrency model similar to what you can get on a Windows and a, and Linux.
    Swyx [00:41:17]: Yeah. Yeah. And I’m sure they’ve heard this before. They just don’t care. Yeah, it’s And maybe they will change their mind with the new CEO.
    Ivan [00:41:24]: Yeah. We’ll see.
    Swyx [00:41:25]: We’ll see.
    Ivan [00:41:25]: High hopes.
    Swyx [00:41:26]: High hopes.
    Ivan [00:41:26]: High hopes.
    Swyx [00:41:27]: Okay. But I, it’s very clear the market opportunity is huge in Windows, and you can go for a long time on just Windows, but your customers are gonna want both. and I think, it is interesting to me that, this is the sort of God application of agents, right? Like, I don’t It was - How big was OpenClaw for you guys? Like, was it, was there, a significant bump.
    OpenClaw, Agent Labs, and the B2B2C Sandbox Market
    Ivan [00:41:54]: Not for us because we.
    Swyx [00:41:54]: Because you already.
    Ivan [00:41:55]: We’re kind of positioned differently. Whereas although it’s completely PLG and we have individual developers that use it, most of the users that use Daytona are sort of a B2B2C. Sort of it’s either B2B or B2B2C. So, in the researcher world, it’s B2B, so you’re selling to, labs and neo labs and things like that. But on the long-running agents, it’s mostly, from a scale revenue perspective, it’s mostly B2B2C, where you have a app layer agent that uses you at a big scale.
    Swyx [00:42:26]: Like a Manus. Yeah.
    Ivan [00:42:28]: Like a Manus Lovable type of thing.
    Swyx [00:42:31]: Yeah. I think that’s the question of, well how, um-Uh, yeah, B2B to C is basically to me what I’ve been calling an agent lab, which is kind of like you’re not in a model lab, but you’re making a very good wrapper that is a platform that other people can sign up so they don’t have to code those things. Yeah, it sound, it sounds like a much better market than the direct OpenClaw market.
    Ivan [00:42:56]: I’ve like - We I’ve done multiple things. So the CodeAnywhere’s part of our career path R in the calendar, was very much an end user developer product. And so that is great. It You can get a lot of developer love, and I feel that we do as a company have a bunch of developer love. But it’s a different type, where it’s people building these things. Again, it’s more akin to a Twilio because you don’t really run - As a person, you wouldn’t run Twilio. I don’t know how many people remember. It was like ask your developer billboard and whatnot. And people really love Twilio, but they only used it inside of like, “Oh, I’m building this app or service for thing.” And so we’re very much directly to that. And you also know that I used to work for a competitor for Twilio, so it’s kind of ingrained, in my DNA.
    Swyx [00:43:35]: People don’t know InfoBip is that big.
    Ivan [00:43:38]: Yeah, it’s.
    Swyx [00:43:39]: Because.
    Ivan [00:43:40]: It’s a billion euro.
    Swyx [00:43:40]: They’re all American. They’re like, “Whatever’s in Europe doesn’t matter to me.” But like it’s the, it’s the same size or bigger? Same size?
    Ivan [00:43:46]: It’s about half the size.
    Swyx [00:43:47]: Half the size?
    Ivan [00:43:48]: Yeah, about half the size.
    Swyx [00:43:48]: It’s like, yeah.
    Ivan [00:43:48]: Still huge. Multiple billions a year. Yes.
    Swyx [00:43:51]: That’s crazy.
    Ivan [00:43:51]: Exactly, and so that - These are like really interesting and large revenue-generating, very sticky businesses. Whereas when you’re selling to the - When your focus is the end developer, it is a very hard sell because they’re very price sensitive, very price conscious, very around that. And there’s very It’s very hard to scale. Your cap is the number of people that are willing to spin up - First of all, wanna spin that up, and then spin up multiple of these. Whereas if you’re in the enterprise one, like we know everyone’s talking about like how many tokens they’re spending, I’m spending. Like a lot of companies today are like, “If this is our company, spend as much as you can.” Like basically that is where we’re going. And so if you think about that paradigm, where you’re selling to companies that say, “Spend as much as you can to generate, productivity,” versus, “Oh, I’m a single person. I have this much budget, and I’m doing this thing because it’s fun or it’s helping me out or whatever.” Like it is a different, it’s a different go-to-market, I think, strategy.
    MCP, CLIs, and Sandboxes as the Agent Runtime
    Swyx [00:44:50]: Yeah, there’s a lot of discussion. I’m just kind of going through like the mental list of things that are in your favor, which is, for example, MCP versus CLI. Like obviously you want CLI. It’s been very good for you. I feel like it’s maybe a drop in the bucket or maybe it’s huge. I’m just checking whether it’s like these are big trends.
    Ivan [00:45:10]: Those things you - work well in our favor, to your point just because every.
    Swyx [00:45:13]: They’re kind of drop in the bucket, right?
    Ivan [00:45:15]: I think it’s like sort of all the things come together. And so there’s so many things that impact that. To your point, like OpenClaw wasn’t huge for us, but like having the agent SDK, from Anthropic, so or Cloud Claude Code was very interesting. The reason why it was interesting is that a lot of, let’s call them app I don’t know what to call them, app layer agent companies, essentially they are like, “Oh, I can create this new app, this new agent. All I need, I just use Claude Code, and I throw it into a sandbox, and then I have my interface to the human to that.” And so that enabled so many more companies to actually offer this, and then they would pull on sandbox. So that was, that was interesting. And to your point, like MCP, versus the CLI, the MCP is an interface against an API, whereas the CLI is like you can actually go do things. Like this is it. The difference between integrations and actually running scripts or data or analysis against a thing. So being able to use a CLI very well enables the agent to do more things, and it’s because that people will invoke a sandbox, they’ll run it in the CLI, and but it’ll do anal-analysis on that data and then give you an actual result versus just, pulling data from an API source.
    Swyx [00:46:29]: Yeah, it’s a layer of indirection basically, it’s the same thing as agentic search versus RAG, which where you’re.
    Ivan [00:46:34]: Exactly, yeah.
    Swyx [00:46:34]: Just like you just win whenever people put more agents into their workflow. And so like it doesn’t really matter, but I’m just kinda teasing out like what else have people heard about that like it’s sort of, “Oh yeah, this is another sandbox use case. Oh yeah, that’s another one.” Am I, am I missing any big ones?
    Ivan [00:46:51]: The thing, the thing that people, which is the computer use stuff, which I think is probably the most interesting one, is, and to your point, we’ve talked to so many people over the last year. It’s like, “Oh, like why do you need a sandbox? Why do you need this? Why this?” And to your point, it’s like, “Oh, I need sandbox for this. I need sandbox for that. I need sandbox-” It’s like, “Oh, I need it for every single thing.” And so basically what I, what I - and it sounds like a broken record, it’s like you use a laptop every single day, right? And you are n of one. It’s just you. But now imagine how And by the way, the laptop, the computer PC market, the PC market is about equal to the cloud market in total. So it’s about 150, 180 billion a year. Something like that. It’s about roughly the three cloud hyperscalers is about equal to like Apple, HP, Lenovo, whatever, It’s a little bit less, but it’s sort of like that. And now imagine And that’s just like, so how big is the addressable market? What, how many people are there in the world now? What’s the last data?
    Swyx [00:47:45]: Let’s call it eight billion.
    Ivan [00:47:46]: Eight billion. And so let’s say you can have two computer, like you have one personal and one business, whatever. Like so it’s double that, right? and so that’s 16 billion, right? How many agents are gonna be running in two years, in 10 years, in 100 years? Like And for every single task, they will need one of these. And so how big is that? That market is essentially quote unquote “infinite”. You will get to the point, and Dylan Patel was at the conference talking about, from SemiAnalysis, that talks usually about GPUs, was also talking about how CPUs will now be a bottleneck because it will be the constraint. You won’t be able to grow, or we won’t be able to have enough of these because there won’t be enough CPUs to basically do.
    Swyx [00:48:23]: Yeah. Well, I actually had a really good podcast with Doug Oliphant, who, which was his president at SemiAnalysis, where they’ve basically been like, yeah, it’s been a GPU shortage first, but then it’s cascaded down to memory and now to CPUs.
    Ivan [00:48:35]: CPU, yeah.
    Swyx [00:48:35]: It-What’s next? So networking. So, networking actually has been in shortage for a while if you’re looking at, just GPU networking. But, yeah, it’s really crazy the amount of computer use that’s going on, yeah, cool. I, other questions are, just the one very big part is the open sourceness which you didn’t have to do, your competitors don’t do, like it’s not, a lot of people are worried about keeping their projects open source because some competitor can just slot fork it. I don’t know if there’s any reflections on just being an open source company.
    Open Source, Trust, and Enterprise Procurement
    Ivan [00:49:15]: Yeah. There’s a bunch. So we the original product that we did was open source.
    Swyx [00:49:19]: Yeah. CodeAnywhere.
    Ivan [00:49:20]: So doing that was actually very good for us. There’s basically a saying of, What’s the saying? Like, companies that are, that are doing really well, measure themselves against, free cashflow, that are kinda okay, it’s EBITDA, then, it’s, it goes all the way down.
    Swyx [00:49:36]: The worst is like GitHub stars.
    Ivan [00:49:37]: GitHub stars. GitHub stars are the worst, yeah. So you go all the way down to GitHub stars. And so our original one was GitHub stars. That’s what we talked about, we’re at the point we’re talking about revenue, so we’re we’ve gone up the stack on that. And so we started.
    Swyx [00:49:47]: No, profit.
    Ivan [00:49:48]: Yeah. We haven’t, we’re, we’ll get there. We’ll get there. But basically at that point we did stars and GitHub and it was useful, and the original variation that we did, it we split the core into its own repo and it was Apache 2.0, so very, permissive. And then we basically would bundle that on the enterprise side with a proprietary repo. So it was like open core, but it didn’t, it didn’t fill out the repository was very clean. When we did the pivot, we didn’t have time to rethink this, and we wanted to We had this open source community. It felt a shame not to do that, and so, but we still did want to add some restrictions, so in the new sandbox product we did add a AGPL 3, which is, it’s a kind of a shortcut way to do that where you are open source. And it is true open source in the sense of an enterprise can use it if it, if it wants, but you essentially can’t make a competitor without open sourcing your stuff, which.
    Swyx [00:50:42]: It’s one of, three approaches. Like, there’s, BSL and some of the other sort of, elastic license.
    Ivan [00:50:47]: Yeah. There’s some others there. So pure open source believers agree that this is not full open source and I totally respect that. That is absolutely true, but we did leave that. And Daytona, in its essence everything outside of what’s under a feature flag today, which is like the Windows stuff, GPU stuff, and whatever, it is in this open source. It is there. So everything is there, like our own scheduler, everything’s there. So we are I’ve had some competitors say, “You guys are actually open source open source. Like, you’re real.” “Like, you can actually see that.” And people do like that, and it has helped a bit, but it’s actually more helped in the consumption of our cloud product than actually transferring people over. The reason is you can actually You send the repository to your agent when you’re integrating Daytona and it just has more context. It’s like, “Oh, okay. This is why this is happening. This is why this, that.”
    Swyx [00:51:41]: You could equivalently just have docs that you can Yeah, so, okay.
    Ivan [00:51:45]: I agree, but I, it to be fair, and so it actually doesn’t really help the growth significantly today. We’ve had this conversation with, investors and other people is like, “How do you convert people.
    Swyx [00:51:56]: Dude,.
    Ivan [00:51:56]: From open source?”
    Swyx [00:51:57]: The open source business conversation is so all over the place, right? Okay, on and I would just, for listeners who maybe they haven’t thought this through, a lot of people say, “Oh, it’s our free tier,” right? Like, “Oh, if you run it yourself, but if when you get serious, call us.” Right? And then other, And then me personally, ‘cause of my Temporal experience, it actually is the way that, it’s the, it’s GTM into some of the largest companies where we wouldn’t pass their, review process maybe ‘cause we’re too young of a company or, there’s, parts of the stack that we haven’t, that just doesn’t work with them. But because it’s open source, then they, then they adopt it, and then later on we figure it out. Like, that’s the low end and the high end. I don’t know if it.
    Ivan [00:52:37]: No, absolutely, and that has been historically. The thing that we have found in this AI transition is, and so we haven’t talked about this, Daytona’s customers are everything from, the single developer, the YC startup, to people say Fortune 500, I’ll say Fortune 5, like the biggest companies in the world.
    Swyx [00:52:55]: Big Neo labs. You told me about the, we’re gonna keep them anonymous.
    Ivan [00:52:59]: All, the enormous companies, right? And because the market pull is so strong, we’re able to circumvent these processes. I’m not saying We go, we pass security audits, we pass all these things, but as you mentioned, like Temporal way back in the way, day, in our old version of Daytona, like it took us months, and usually at the end they would churn off because just like, “Oh, you’re too small of a company,” like, “We don’t trust you” “enough.” Whereas today we’ve had these large companies push us, like they would push us through. Like, usually when you would go through procurement to become a vendor of large companies, it would take you like two, three months. We get it done in five days now. And this is not saying that maybe we’re great, but it’s more, I think, a sign of the market where it is today. And so when you think about that, the open source is something that we, from a go-to-market perspective, don’t think about that much because everything that we’ve created right now has been PLG through the cloud product, people signing up and just pulling us inwards.
    GitHub, Agent-First Versioning, and CI Bottlenecks
    Swyx [00:53:53]: Yeah, this is a personal interest, and I don’t know if you have an answer, but, do you have problems with GitHub?
    Ivan [00:54:02]: I do. A little bit. A little bit.
    Swyx [00:54:04]: Yeah. Tell me, tell me. ‘Cause I’m thinking about, well, okay, what would it take to replace GitHub?
    Ivan [00:54:09]: There’s a lot of things. I’ve thought about this, and I’ve talked, I’ve tweeted about this, and I looked at some. I’ve actually invested personally in some.
    Swyx [00:54:17]: Is it, Entire?
    Ivan [00:54:18]: No, I haven’t done it.
    Swyx [00:54:18]: No? Okay.
    Ivan [00:54:19]: Yeah, so I, and I’ve met Thomas or virtually and we’ve talked. So I really think that And this was my reason for that. Because we have a bunch of background long-run agents, and for our time most of them are coding agents. Like, everyone was building up a competitor to Lovable or Devin or whatnot. What we saw from our customers was that they were all trying to figure out how to do, versioningLike, everyone is doing it in different ways. There was like some really weird ways where people were doing that, and the reason was that GitHub as is was an overhead. Like, it wasn’t fast enough what they needed, it didn’t solve the problem that they needed. And to be fair, like GitHub is for post your the inner loop, right? It is post your laptop, right?
    Swyx [00:55:07]: Yeah, GitHub is the point at which the outer loop starts.
    Ivan [00:55:11]: So people started using that for sandboxes, which is inner loop, which is usually, it’s on your laptop, right? And so that is not what it’s made for, and then we had everything from people Actually, the most interesting one is we had one customer that would literally take the entire code base inside the sandbox and every I forgot what the time sequence was, they would just dump it all into a JSON and then push that to S3. And that’s it.
    Swyx [00:55:37]: Make your own Git.
    Ivan [00:55:38]: It’s, it But it’s not, there’s not even diffs, it’s just a whole thing every single time. It’s just every Because it was super fast. Like, it didn’t matter. And then they would go back and search and find, sort of what the file was and write it, and whatnot. Because there’s text file, there’s JSON, like they’re very small so the network cost is very low, and they didn’t care, and they just did it that way. And I’m like, if people are doing this, that means there needs to be a new solution to this problem, right? And so for me, it’s quite interesting to look at who is building these types of new things. Agent first. I think Git as is still exists in the future, maybe even GitHub exists, but there will be a whole new sort.
    Swyx [00:56:15]: Yeah, exactly. Git is like the deploy artifact to kick off CI/CD. But then there’s a layer before that is like the agent collaboration layer.
    Ivan [00:56:23]: Yeah. And so I think something needs to be said there, but on the other side, like there’s issues with Another interesting thing is just like CI right now. So the amount of PRs being created is insane right now, right? In general.
    Swyx [00:56:33]: Even for you guys, right?
    Ivan [00:56:34]: Everyone’s creating a bunch of PRs. everyone. And then all that has to go through CI, and then that’s the bottleneck. Like, everyone’s bottleneck. Like, not just like, not just actions, but like go to any CI provider, you will not be able to, if you have a high throughput of PRs There’s one company we’re talking to, they do 1,000 PRs a day. Which means like And they’re just waiting. They have just a queue on that, right?
    Swyx [00:56:55]: What do they use, Buildkite.
    Ivan [00:56:58]: I don’t know what they.
    Swyx [00:56:59]: Circle?
    Ivan [00:57:00]: They’re, whatever.
    Swyx [00:57:00]: Technically your tech can be used for CI.
    Ivan [00:57:03]: That’s, that was the conversation. That was the conversation.
    Swyx [00:57:06]: Is that a serious conversation?
    Ivan [00:57:08]: We’ll, we’ll see how that goes. We’ve had quite a few conversations around that. We’re we are not a CI provider by any means, right?
    Swyx [00:57:13]: But what is what’s missing?
    Ivan [00:57:15]: No, so essentially.
    Swyx [00:57:17]: Nothing.
    Ivan [00:57:18]: You, essentially you could use a Daytona sandbox instead of whatever you use for, your GitHub runners essentially.
    Swyx [00:57:27]: Like, yeah, I’m The only thing I would say is like maybe CI machines are supposed to be very cheap, maybe it’s like the low end because it’s supposed to be like, non-blocking or like something like a, like a background job. Like, it’s, the urgency is not that important for CI.
    Ivan [00:57:45]: Performance is, though. Performance is, yeah.
    What Sells Daytona: Responsiveness, Support, and Customer Trust
    Swyx [00:57:48]: Yeah, okay, that is interesting, and yeah, I think, like before we leave Daytona and go into like sort of broader like founder takes and what have you, any other Daytona elements that, is interesting that we haven’t touched on?
    Ivan [00:58:04]: Interesting Daytona things. There’s, there.
    Swyx [00:58:06]: I can, I can give you more prompts if you want.
    Ivan [00:58:07]: Yeah, I’d love more prompts, actually.
    Swyx [00:58:09]: Okay. So when startups evaluate you, so you have, you have all these like names and you have more that you can’t, you can’t even name, they see all your wall of competitors. and yeah, you have differentiation versus, many of these, but like what sells them?
    Ivan [00:58:26]: The thing that we found that sells people the most, this is more maybe a day two thing instead of a day one thing. And we’ve seen this again and again. So we have a bunch of case studies, and we have a bunch of them still coming out. They’re all done by a third party, so we don’t do the case studies, and it’s actually interesting to watch those cases. I watch, they’re recorded, and because it’s a third party, people are actually more open, and they will tell you, “Oh, we use this competitor,” or, “We like this competitor more,” or this thing or whatever. And the number one thing that people come back to us for is that our, we have an insane responsiveness.
    Swyx [00:58:57]: In terms of your team?
    Ivan [00:58:58]: In terms of the team, yeah. Insane responsiveness has been by far the Now, we can talk about like features and breadth of product and concurrency and CPUs and like all those things, but I feel that would probably So if all other things are equal, that is very much a differentiator I’ve found. And I didn’t know.
    Swyx [00:59:15]: Is that entirely Slack or Slack plus email?
    Ivan [00:59:18]: It is, there’s email there as well, there’s calls, but the vast majority is like on Slack. So it’s Slack. Like, we have had customers like, “Hey, we have a problem. Can you get on Huddle?” Like, we will get on that Huddle like in five minutes, literally. I’ve done this multiple times, so yeah.
    Swyx [00:59:31]: Wait, okay, so how big are you?
    Ivan [00:59:33]: 25 today.
    Swyx [00:59:34]: How do you do this kind of support like this?
    Ivan [00:59:36]: We’re insane. We don’t sleep. 007, have you heard the new thing?
    Swyx [00:59:40]: 007. like I’ve met your team. They’re very impressive, they’re very dedicated, but like also how do you get a team to do that? it’s.
    Startup Culture, Family Tradeoffs, and Enjoying the Pain
    Ivan [00:59:48]: So there’s.
    Swyx [00:59:49]: I have Slack exhaustion?
    Ivan [00:59:51]: Yeah, we all have Slack exhaustion. We’re very tired. the thing that is unique, I don’t know unique about us, but unique, I would say unique about any successful, serial founder is that you’re able to pull in people that you’ve worked with before, and so you can’t do that as a first-time founder. Like, I couldn’t have done that or not. But of the 25 people in Daytona, I think about 13 of them we have worked with seven years plus. So it’s like high trust, high throughput, high we know what we’re signing off to do. And especially these people worked with us when we were starting, and we were actually hustling. hungry for food hustling type level, and so those are the people that work with us. The, now the new segment that has come is almost everyone is sort of, one degree of separation, so it’s like someone that someone has known, and so they sort of come into this org. And we’ve had people that have like not fit into org as well. It’s just like, it’s type of culture where there is a high expectation of, being online, replying for these things, and I do that first. You if you ask any engineer, they’re like, “You never sleep,” like, about me. And so then I do that as an I don’t do it as an example. That’s just how I’m wired. My wife doesn’t appreciate that I have to tell you. My wife doesn’t appreciate that. I told her about 996, she said, “I wish.”
    Swyx [01:01:09]: It’s like these Chinese people are slacking.
    Ivan [01:01:13]: Yeah. So, that is something there. And so I think every company has their own culture, and that’s something very deep, ours. And it’s something that’s come up again and again, and every single day we’re reminded about that. And I didn’t go out thinking that is how I’m gonna build it. It’s just how I’ve built these things right now.
    Swyx [01:01:29]: Yeah. so okay, I’ll transition a little bit on the founder side. Like, I’m very impressed by you in general of, your sort of balance, you have, you have a young family.
    Ivan [01:01:38]: Two kids, yeah.
    Swyx [01:01:39]: Two kids now.
    Ivan [01:01:40]: Yeah, two kids now. Yeah.
    Swyx [01:01:41]: I think a lot of people I meet, they’re like, “Oh, I’m starting a family. I can’t be a founder,” and all that, what’s your advice to those people?
    Ivan [01:01:48]: Everyone has their own I, it’s a hard, it’s a hard, they Every single day, so my family, they’re here right now, but they’re usually I fly between Croatia and here. Like, a lot of our team is in Croatia. A part of our team, and are growing, is here now in San Francisco. And so I spend a lot of time away from my family, and that is hard. Like, that is a sacrifice that you have to. But going in, people say, on your deathbed, you’re gonna miss some of those things. The thing that, and probably might be true, but the thing that going into this, I already said, I know that this is gonna hurt, and everything has to hurt. By the way, I’m very much of a feeling that everything has to hurt. Going to the gym hurts. Losing weight hurts. Like, everything has to hurt, right? It does. Like, we all.
    Swyx [01:02:32]: No pain, no gain.
    Ivan [01:02:33]: It is literally, but you actually have to enjoy the pain and just, if you don’t enjoy the pain, it’s not for you. And so you get accustomed to that pain. And so love the kids, especially I have a daughter and a son. Daughter is the eldest, love her and do miss her when she’s not here, but it’s like, that’s what I signed up for, and there is a plan and target of what I’m trying to achieve. And now hopefully with my wife, which does support me, we can get ourselves together more, so it doesn’t there. But she takes a large part portion of that. And so if you have a partner on the other side that is okay with that, then you can do that. But even if they do, you have to be okay with not being there, right?
    Swyx [01:03:11]: Yeah. This is my vision for you, this meme.
    Ivan [01:03:15]: Yeah. I.
    Swyx [01:03:15]: That’s your kids in the future.
    Ivan [01:03:18]: Yeah, I think.
    Swyx [01:03:18]: It’s like this,.
    Ivan [01:03:18]: We have to teach them that they’re not rich.
    Swyx [01:03:19]: Because Dad, built the compute sandboxes.
    Ivan [01:03:21]: Yeah, you built compute sandboxes. Dad made sandboxes. Dad made sandboxes.
    Swyx [01:03:25]: Built the spiritual successor to serverless and Kubernetes and for agents, any other sort of, hot topics, trends? You have a lot of hot takes, actually, you are best known for, you were, you were, you were sort of in sort of hustle culture mode, right? And someone quoted you and said, “I haven’t even heard of you, bro.” “Just log off and take the, take the Christmas off.” And then your response was?
    Ivan [01:03:53]: Oh, my response was, “That’s why I can’t.”
    Swyx [01:03:56]: Like, I think that’s, very typical of you. I don’t have it here. I can’t, I can’t bring it up. But, I think that’s very typical of the culture. But, I think you have a lot of, interesting hot takes like that. Any other sort of takes on, the startup ecosystem?
    SaaS Token Resellers, API Revenue, and Startup Hot Takes
    Ivan [01:04:11]: Oh, yeah, the startup ecosystem. And this was the recent one, which is I think that And this is general, business. I feel that the It didn’t come off, I think, well on Twitter. Some people at least misread it. Which is, the market is adding premium to SaaS vendors that are reselling tokens. And I think that’s incorrect.
    Swyx [01:04:34]: Why?
    Ivan [01:04:35]: Because I think So what I think, why I think that’s incorrect is that if you look at, one, your pricing depends on what the price is, if it’s public market or if it’s private or whatever. You’re saying, the person that’s reading that the re-acceleration of revenue is equal to the old revenue, which it’s not even close. Because one, you had on SaaS, you had typical SaaS margins, whatever it was, right? Stickiness and all these things. Now what you’re doing is you are saying, “Here is my agent, and I have whatever the margin is.” It’s way worse, right? And now you’re using Anthropic or OpenAI or whatever through me, the SaaS product, and then we as a community are saying now that is re-acceleration. And so one, I think that’s wrong because it, first, it’s not the same. The makeup is not the same. The other thing is, and go back to, what I mentioned earlier is, the Kua and how I set up OpenCloud and whatever. I don’t want your agent, essentially, because what happens, right now we have a problem that, and this has historically been, you have data siloed in, again, ClickHouse, QuickBooks, it’s all siloed, and now you’re giving me an agent that’ll give me the data, but it’s still siloed, right? And so now I have to, take that data and then get another agent.
    Swyx [01:05:52]: Just expose the data to my agent.
    Ivan [01:05:53]: Just expose the data. Just expose it. And one thing I have to and so I’m like, “Just expose everything and charge me for that.” So charge me for consumption of API. So you’ll have your old seat-based pricing for humans. Charge me for this. The number of agents will skyrocket, and essentially you’ll have more usage, and charge for more if your product has value. So, there’s arguments some of them do have value. It’s a database, not database. We can get into that. But some of them really do, and I was actually shocked that the first person to do this was Benioff.
    Swyx [01:06:24]: Salesforce, yeah.
    Ivan [01:06:25]: Sales.
    Swyx [01:06:25]: Agentforce?
    Ivan [01:06:26]: It, there was a tweet, I think three days ago, where she said every product in Salesforce has been exposed via an API.
    Swyx [01:06:33]: Wow.
    Ivan [01:06:33]: Everything. And I’m like, now I understand why this person has built.
    Swyx [01:06:38]: This guy’s king.
    Ivan [01:06:38]: This insane. Kudos to him. Amazing. It’s like, thank you. I don’t know if you listen to me or someone else, but like thank you for someone This is the direction of the world, and so if you can get real acceleration against that, against consumption of API, that is actual revenue, and that is actual real acceleration, and that is where value come from. And I think that there will be cold shower when people understand, no one’s actually gonna use and pay for these agents and tokens, and that wasn’t actually really a solution, but it’ll drop back down.
    Swyx [01:07:05]: Yeah. Yeah, look, obviously, I think generally correct, and I agree. I think - But people are going to try to become an AI company.
    Ivan [01:07:15]: No, absolutely. And nothing against that. And I - this is no, - To be very clear, this is not a downer on anyone that’s building this thing. Everyone has to get to, get to the revenues, get to the multiples, get the valuations, do what you have to get to the next step. Absolutely agree. But we, as a community, are now, saying, “Oh, this is, the magical way to get out.” This is not. Like, that is not what is happening, right?
    Swyx [01:07:35]: Yeah. No, I think, there was like this kitchen appliance company that put out some AI nonsense recently.
    Ivan [01:07:42]: It was also the sneaker as well. It was called Allbirds.
    Swyx [01:07:44]: Allbirds. No, Allbirds is pivoting to GPU. That’s fine. It’s like, I have - I can - I have some money left, I’m just gonna, do some lottery tickets, would you go into offering GPUs?
    GPU Sandboxes, Data Centers, and Bare Metal Economics
    Ivan [01:07:55]: Oh, yeah, we will. But not for inference. Like, essentially, what we think about is, the GPU sandbox. So, if you think of, if you have a GPU in your computer, that is what you have a GPU in the sandbox. So, there are workloads that do need GPUs. Again, I always go back to 3D rendering ‘cause it’s the easiest one to comprehend. But, if you wanna do any type of RL on, CAD or something like that, you will need a GPU in the sandbox, and so that’s coming now as well, yeah.
    Swyx [01:08:18]: How about own data centers?
    Ivan [01:08:20]: Own data centers. So we run on co-location providers, bare metal machines. Data centers, we technically can run on that or our own data center. Like, that’s how we architected it. Today, from a gross profit margin perspective, it doesn’t make sense for us to get in that. You have to raise a large amount of capital, a large amount of risk for, single-digit percentage points. So today, that doesn’t make sense, but we are fundamentally architected so that we can do that if we want.
    Swyx [01:08:47]: Yeah. you’re a large customer of these guys now. Do you see any opportunity?
    Ivan [01:08:51]: We will see. We will see, yeah.
    Swyx [01:08:54]: Yeah. I see a lot of people, trying to do the bare metal thing, we talked to Railway, the other day and they’re also doing a very similar, strategy.
    Ivan [01:09:04]: They think - I think they’re building out something or they have their own sort of data centers now.
    Swyx [01:09:07]: Yeah, they have majority their own data centers, I - But I do think, they still use Equinix and all those things. So I think it’s just interesting that this model basically hasn’t changed. It’s basically a real estate model. They manage the facilities and then you do everything else, I wonder how it can be changed for the, for the future ‘cause, the AI wave is the opportunity to reinvent everything, yeah. anything else, cool. I think that’s about it. I didn’t have any other, topics. I think this is, as best and comprehensive, if you have, any questions about the compute market, and sandboxing and Daytona, this is the best place to start. Where does this go, man? Like, we’re here in April. Things are growing 75% month to month. Like, where are we, where are we gonna be by end of year?
    The Agent Cloud: New AWS, New Stripe, or Something Else
    Ivan [01:09:58]: It’s an insane number. I’m sort of scared to say it out loud. So, it is - It’s very big, just the sandbox market on - And we - There - We talked about this in general. The entire infrastructure market is growing 40% plus or minus month over month. Everyone is growing 40% month to month. And that’s also a hot take, is like if you’re not growing 40%-ish, it’s not that - It’s just the market. You might as well - You don’t have to come to work to grow that amount, basically. I’m half kidding, but that’s where it’s going. And so where does it end? We will see. The thing that I think about from at least a CPU perspective, a GPU is even crazier, but from a CPU perspective, it is like there’s a high probability that actually owning the CPUs beforehand will be a go-to-market tactic, and it will probably - ‘Cause I - You - As you do probably talk to a lot of GPU providers, their growth is hindered by the amount of GPUs that you have right now, right?
    Swyx [01:10:47]: Yeah. It’s just like, it’s whatever NVIDIA decides to bless that day.
    Ivan [01:10:51]: That’s how much, that’s how much they’re gonna grow, right? And so where - The CPU market in general, be it like something like Railway, for example, or Vercel or whatnot, or Deployment, or it’s like the sandboxes, they’re still CPUs. So, each is growing at the pace of the of their - the market and what their, plus or minus of that market. But it’s still not constrained by that. And so my thought is, for all of us in this market, and databases fall into that as well ‘cause databases also run on CPUs. And it’s like we all have to grow as fast as we can so we can get enough of, CPUs tomorrow from Intel or from NVIDIA, ‘cause they have now CPUs and everyone else later on. So it’ll be interesting when we get to that cap.
    Swyx [01:11:30]: Okay. maybe one version I’ll phrase this is like, are you, is the potential new Heroku, new AWS or new, what’s it? New Stripe but compute? Or like what’s the, what’s the analogy that is most appropriate?
    Ivan [01:11:48]: There’s interesting. There’s like analogies of like - So the, there’s new Cloudflare, but new Cloudflare is new Cloudflare.
    Swyx [01:11:54]: New Cloudflare.
    Ivan [01:11:54]: They’re actually doing a really good job about,.
    Swyx [01:11:56]: Cloudflare owns networking. No one can fight. it’s like, come on.
    Ivan [01:11:59]: They’re doing - No, they’re doing really well. No, what I said is in the sense of their whole agent portfolio is actually really good. And I should say there are some technical I think, personally, around, everything’s under constrained under Workers. Like, Workers is their thing. But from a go-to-market vision perspective, I think they’re actually really good. I think they actually get it, unlike some other companies, and to your question is like, what is gonna be - There will be an equivalent, everyone says like an AWS for AI agents, but your answer, it might look more like Stripe than AWS, in a sense. So there will be a cloud built out specifically for agents. And so that cloud will have sandboxes, and it will have web search, and it’ll have, databases like SQLite or Neon or whatever, specifically for agent and other things. We are not at the end of the new infrastructure primitives for agents. There are more coming. So people think like, “Oh, there’s nothing else. This it.” There are more. Like, we have some ideas about the next ones. We don’t have time to do them, but there are definitely more primitives that are being built out for agents, and there will be, I think, a cloud that runs all that together.
    Swyx [01:13:07]: Yeah. Yeah, OpenAI has said AI cloud, Vercel has said AI cloud, and you are potentially also one of the other, the prospective AI clouds. I think it’s a very big prize to win, well, thanks for coming on.
    Ivan [01:13:18]: Thank you for having me. It’s been amazing.
    Swyx [01:13:19]: Yeah. Okay. That’s it.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Railway: The Agent-Native Cloud — Jake Cooper

    20/05/2026 | 1h 28 mins.
    Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!
    This was recorded before Railway suffered a major GCP outage on May 19, despite being a multi-AZ, multi-zone mesh ring, with HA fiber interconnects between their Metal <> GCP <> AWS, because workload discoverability was unintentionally still tied to GCP. All has been resolved with a post-mortem.
    Railway did not start as an AI infrastructure company.
    It was founded in 2020 years before agents became the default way people thought about deploying software. Jake Cooper, formerly at Bloomberg and Uber, started Railway with a simple obsession: the activation energy to ship something to production should be near zero. Push code, get a URL, iterate. No Docker files, no Kubernetes manifests, no Ansible scripts stacked on Ansible scripts.
    For years, this was a slow grind. Railway spent its first 18 months hand-acquiring its first 100 users with Jake personally greeting every Discord signup on a second monitor.
    Today, Railway has raised $124m and is growing very fast. A 35-person team supports 3 million users, adding roughly 100,000 signups a week. Their bare metal data centers have a 3-month payback period vs. renting in the cloud, with 70% margins funding aggressive cloud bursting when needed. The servers they own have actually appreciated in value as RAM prices have climbed basically meaning the value of their hardware now exceeds the capital they've raised.
    From rebuilding Railway’s network overlay over a weekend to moving the vast majority of workloads onto its own bare metal data centers, Jake Cooper is trying to build a new cloud for an agent-native world. In this episode, Railway’s founder and “conductor” joins swyx and Alessio to unpack why the next era of software infrastructure is not just “Heroku but newer,” what agents need that humans did not, and why the old deployment loop of Git, PRs, CI/CD, and static cloud resources may be heading for a rewrite.
    We go deep on Railway’s infrastructure stack: own-metal data centers, three-month cloud payback periods, cloud bursting, data center debt, Railpack, Nixpacks, Temporal, feature flags, Central Station, content-addressable filesystems, agent-safe production forks, and why the CLI may become more important than the canvas in an agent world. Jake also shares the founder journey behind Railway, how the company survived losing $500K/month, why it now serves millions of users with only 35 people, and why he believes the pull request is dying.
    We discuss:
    * How Railway went from a slow six-year grind to adding 100,000 users a week
    * How Railway thinks about agents as the next dominant software species
    * Why agents need version control, observability, compute, storage, and orchestration at 1000x scale
    * The economics of Railway’s own-metal data centers and three-month payback
    * How Railway uses cloud bursting while scaling its own infrastructure
    * Why data center debt can be a better tool than venture debt for infra startups
    * Central Station, Railway’s internal system for clustering customer feedback and incidents
    * Why responsible disclosure and over-communication matter for platforms
    * Why feature flags, progressive rollouts, and shadow traffic are essential for agents
    * Temporal’s strengths, pain points, and why workflows matter for agents
    * Railpack, Nixpacks, Nix, and lazy-loaded content-addressable filesystems
    * Why “cattle, not pets” may change if you can clone the pets
    * Why Railway is building a new cloud from scratch instead of copying hyperscalers
    * The solo founder path, focus, writing, and how Jake thinks about company building
    Railway:
    * Website: https://railway.com/
    * X: https://x.com/Railway
    Jake Cooper:
    * LinkedIn: https://www.linkedin.com/in/thejakecooper/
    * X: https://x.com/JustJake
    Timestamps
    00:00:00 Introduction: What Is Railway?00:02:07 Jake’s Path to Railway00:06:13 Railway’s Six-Year Growth Story00:08:52 Rebuilding the Business After the Free Tier00:11:17 Agents as the Next Software Platform00:13:29 Railway’s Infrastructure Philosophy00:15:42 Bare Metal, Cloud Economics, and the Compute Crunch00:17:22 Cloud Bursting and Five-Cloud Networking00:20:20 Data Center Debt and Infra Financing00:23:31 Data Centers in Space00:25:24 What Agents Need From Infrastructure00:28:24 CLIs, Canvas, and Agent-Native UX00:35:15 Central Station, Incidents, and Responsible Disclosure00:40:30 Safe Rollouts, SRE Agents, and Production Forks00:45:00 AI SRE, Specs, Code, and Tests00:48:24 Self-Replicating Infrastructure and the New Serverless00:53:18 Heroku, Temporal, and Workflow Engines01:04:07 Railpack, Nixpacks, and Lazy-Loaded Filesystems01:06:01 Coding Agents, Token Spend, and Roadmap Acceleration01:10:56 The Pull Request Is Dying01:12:28 Feature Flags and the Agent-Era SDLC01:16:15 Cattle, Pets, and Cloning Machines01:19:29 Solo Founder Lessons01:24:12 Focus, GPUs, and Building a New Cloud01:28:20 Closing Thoughts
    Transcript
    Alessio [00:00:00]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I’m joined by Swyx, editor of Latent Space.
    Swyx [00:00:10]: Hey, hey, hey. Today we’re in the studio with Jake Cooper of Railway.
    Alessio [00:00:14]: Conductor of Railway.
    Swyx [00:00:15]: Conductor at Railway. Yeah.
    Alessio [00:00:16]: Choo-choo.
    Swyx [00:00:17]: Do you actually have that anywhere, like on your business card?
    Jake [00:00:20]: We call some of our volunteer moderators conductors. I don’t have a business card. We’re not that big yet. At some point I will. I got handed a nice business card from the Supermicro folks, and I was like, “Damn, this is pretty official.”
    Swyx [00:00:30]: Business cards are coming back.
    Jake [00:00:32]: They’re cool. They’re hip. The conductor thing is good. We’re trying to figure out what we want to call each other internally. Some people think it’s super cringe and say, “You don’t need a name for people internally.” Some people want to call each other something. We still don’t have a really good one.
    Jake [00:00:55]: We’ve got New Railcrews, Trainiacs. Nothing has stuck yet.
    Swyx [00:01:00]: I like Trainiac. Trainiac sounds good. Railwayians. For those who don’t know, what is Railway? Let’s give people a crisp definition up front.
    Jake [00:01:09]: Railway is the easiest way to ship anything. You go to the canvas, or you talk with Claude, and you say, “Deploy a Postgres instance, deploy my GitHub repository, run this code,” and you’re off to the races.
    Swyx [00:01:22]: You’ve got a nice animation on the landing page.
    Jake [00:01:24]: Thank you. None of my work, by the way. They don’t let me touch the design stuff anymore.
    Jake [00:01:25]: We want to make it trivially easy not just to deploy things, but to evolve applications over time. Most tooling right now stacks entropy on top of entropy: Docker, Kubernetes, Ansible scripts, and all these other things. If we can version all of your software and keep track of all the changes, then we can make it trivial to clone environments, fork into a parallel universe, get copies of production data, get copies of any services, make changes, validate them, and collapse them back in without reproducing everything across a staging environment.
    The Railway Origin Story: From Uber Systems to a New Cloud
    Swyx [00:02:07]: I was looking at your background: Bloomberg, Uber. Nothing immediately stands out as, “This guy is going to found the next great platform as a service.” What prepared you for Railway?
    Jake [00:02:21]: It was curiosity to keep going deeper. I started out on front-end stuff, working on Wolfram Mathematica and porting it over. Then I briefly moved to Bloomberg, then toward Uber and distributed systems, taking the Jump Bikes systems and moving them to a distributed system built on top of Cadence, the pre-Temporal Temporal.
    Swyx [00:02:44]: Which, by the way, I’m happy to talk about, pros and cons.
    Jake [00:02:48]: Totally.
    Swyx [00:02:51]: But let’s do the Railway story.
    Jake [00:02:52]: It has been a continual step of wanting an experience. Whether it’s walking up to a bike, unlocking it, and having it work frictionlessly, or something else, the depth required to make that happen follows from the experience. A lot of the work I do, and a lot of the team does, is in service of that experience. We fundamentally don’t care how deep we have to go. We will swim to the bottom of the swimming pool to get the experience.
    Jake [00:03:17]: I don’t have a physics PhD. I did an EECS degree. It has always been about figuring out the next step: how do we get there? That’s what led to starting Railway for that experience and then moving all the way to bare metal data centers. I was adding patches to the kernel this week to get the experience there because I can see how much better it can be.
    Swyx [00:03:49]: Other patches to the Linux kernel this week?
    Jake [00:03:51]: Yeah. Not upstream. Our fork.
    Swyx [00:03:52]: That’s a flex. Railpack? No, this is different. This is the OS on top of Railpack?
    Jake [00:03:57]: No, this is an actual kernel patch. It’s always literally: what do we have to do to get that experience? Then figure it out. Anything is figureoutable.
    Swyx [00:04:10]: Would you send the patch upstream, or does it not fit other use cases?
    Jake [00:04:13]: Maybe. We have to work out the experience internally. It has to do with the storage layer we’re building for some of the agentic stuff. Maybe it’ll be useful upstream, but it’s deeply useful for us internally.
    Open Source, Forks, and Non-Deterministic Versioning
    Swyx [00:04:29]: You mentioned open source before. How do you think about starting from open source, and then coding agents letting you do a lot more from forks of it?
    Jake [00:04:38]: GitHub’s original sin is that it’s almost a series of broken pointers. You have this thing, then you clone it, and now you’ve lost the whole upstream. How do we make it trivial for people to modify really small pieces of it?
    Jake [00:04:51]: We think of Git in a discrete sense: I’ve either made a change and merged upstream, or I haven’t. What would it look like if it were percentage-based, a little more non-deterministic, or a stream of changes that users traverse as a percentage rolled out in general and then rolled all the way up?
    Jake [00:05:13]: We have the open-source kickback program and let you deploy templates because we want to make it trivial for people to version these shards over time. It solves a large problem around authentication, authorization, and security. NPM has a way to define, “Don’t take any new packages.” The ideal end state is that you roll out progressively to users with the minimum impact zone and continue rolling up. JPMorgan should probably be the last one on the patch line, for all our sakes, because our money and livelihoods are there.
    Jake [00:05:53]: It’s okay if Johnny Vibe Coder gets a broken patch because there’s so much entropy in the system that the rubber has to meet the road at some point. You have to test at varying levels.
    The Long Grind: First Users, Free Tier, and Making the Business Work
    Swyx [00:06:13]: I wanted to pull up this glorious chart, which is your usage or number of daily signups?
    Jake [00:06:22]: Daily signups, I think.
    Swyx [00:06:24]: You started six years ago. It was a slow grind, and now you’re on a rocket ship. You say, “Don’t doubt your fight and don’t quit.” Maybe pick out certain points that were key inflections for the company.
    Jake [00:06:40]: At the start, it’s about getting your first 100 users, hell or high water. We had a website and a support link. The support link was the Discord channel. I had notifications on with two monitors: the monitor I was working on and the other monitor with Discord. If anybody came in, I was immediately like, “Hey, how’s it going?” It was rare, so getting those first 100 users to come back was the start.
    Jake [00:07:14]: Then you build a consultancy factory because users want all these things. You have to go back to the board and ask, “What is the actual product offering I want to build on top of this?”
    Jake [00:07:28]: VCs want charts that always go up and to the right, but in reality you don’t necessarily want charts that look like that. For us, there have been periods of expansion where we add features to test use cases, and periods of compaction where we ask, “If the experience we have is good, how do we make it significantly better?” Maybe we strip out features that don’t fit our ICP anymore.
    Jake [00:07:57]: The boom from 2022 to 2023 came from the free tier. Everybody under the sun was using it.
    Swyx [00:08:09]: A lot of Reddit bots and Discord bots.
    Jake [00:08:12]: And crypto miners. When you build an open product on the internet where anybody can sign up, the internet is a horrible place with so many things. You go through periods of asking, “How do I reach as many people as possible?” Then, “How do I fit the exact use case for the people who really matter and are really excited about this specific thing?”
    Jake [00:08:39]: Then there was a two-year period of making the actual business work. During the free-tier era, we were losing about half a million dollars a month.
    Swyx [00:08:59]: On a $20 million bank account.
    Jake [00:09:02]: On a $20 million bank account with maybe $50,000 a month in revenue. That’s a horrible business. I don’t know how anybody invested. But you have to go through it and say, “We have an experience people love, but the business has to work.”
    Jake [00:09:17]: There are two schools of thought. You can run the horrible business all the way up with bad margins, or you can go back and make it work. We’ve always wanted a super lean team. We’re 35 people right now. It’s very small.
    Swyx [00:09:36]: Supporting three million already?
    Jake [00:09:38]: Yeah. We’re adding 100,000 users a week right now, so it’s growing fast. We don’t want to add headcount for the sake of headcount or throw bodies at problems. We want to build systems. It’s hard to build systems during expansion because you’re adding things to the system because people are asking for them or things are breaking.
    Jake [00:10:00]: We had to cut off the free users for a little while, rebuild the business, and make sure it worked. We want to reach as many people as possible because software is important. It’s become difficult to create things in the physical world, so it’s important to make it easy for people to build in the virtual world and have access to creation. But there are legs to that journey.
    Jake [00:10:30]: You can see divots in the charts. If you follow between 2025 and 2026, it’s either summer or winter. People go on holiday with family.
    Swyx [00:10:50]: It affects that much?
    Jake [00:10:51]: Yeah. It’s kind of B2C and kind of B2B. People are shipping constantly, then they stop. Our activation curve now shows more people activating on weekdays because we have more business users, so it smooths out over time.
    Agents as the New Interface to Deployment
    Swyx [00:11:17]: Was there a point where you started prioritizing AI development or agent development?
    Jake [00:11:24]: We’ve prioritized agentic as a top-of-funnel thing. Over the last six months, we’ve deeply prioritized agentic as a mechanism to build and deploy things because we believe the curve is so steep and that is how people will build and deploy software.
    Jake [00:11:42]: It almost fundamentally doesn’t matter whether this is dot-com or not because we’re all on the internet anyway. If agents are going to deploy a bunch of things and we hit an inference wall at some point, we’ll fix those problems. The dominant species over the next 10 years is that we’ve moved from assembly to C to C++ to JavaScript to words. You’re going to need to close that loop.
    Swyx [00:12:13]: When you say this is dot-com, did you mean buying the domain, or the general case?
    Jake [00:12:17]: I mean the dot-com era, when companies had a huge run-up because people understood the internet was important. Then they hit bottlenecks, fundamental laws of physics, math didn’t work, and everybody came back down to earth. But it didn’t matter because the internet became so impactful. If you operate on a long enough time horizon, you should build these things anyway because you can see where it’s going.
    Jake [00:12:45]: That’s where I think a lot of agent stuff is. You get to a point where you’re running thousands of agents in parallel. What is the inference cost? What is the compute cost? How do you make that efficient? How do you coordinate all this? We have issues coordinating humans; we don’t even have good tooling for that. Now we have to figure out how to get agents to coordinate, safely version changes, and know when to raise their hand for someone to intervene. Otherwise it becomes an interrupt factory.
    Railway’s Infrastructure Thesis: Network, Compute, Storage, and Metal
    Swyx [00:13:19]: Let’s go right into the technical side. What are the core infrastructure or architectural beliefs of Railway that allow you to do what you do?
    Jake [00:13:29]: The primitives matter a lot for us. We need network, compute, storage, and orchestration around it. You need control over a lot of those things. We’ve talked a lot about how we don’t really use Kubernetes because we want higher-order control to place workloads in very specific places.
    Jake [00:13:48]: The reason is that you have to be very efficient with agents: memory reuse and all these other things, or you’re going to massively blow up your cost structure. Being able to rack and stack your own servers and build your own metal unlocks performance and cost. Experiences where you’re running 1,000 agents in parallel are not massively cost prohibitive.
    Jake [00:14:13]: Token use and compute use are blowing up. Over time, those things have to get a lot more efficient. You can get a lot of margin to make those experiences solid by building your own metal. That’s all in service of offering a differentiated experience to as many people as humanly possible.
    Swyx [00:14:51]: You have a data center in Singapore.
    Jake [00:14:53]: Yeah. We have two in every other region now. In Singapore, we’re adding a second one in Q3.
    Swyx [00:14:58]: What’s it like? I’ve never built a data center. Do you go to Equinix and say, “I want some slots?”
    Jake [00:15:05]: Yeah. Equinix. You basically go and say, “I want power and I want a cage.” They say, “Great, here’s what it’s going to be.” You rent the cage for a period of time, fill it with racks and servers, and hook up internet to it. That’s all the pieces.
    Swyx [00:15:36]: Then you handle everything else.
    Jake [00:15:37]: You handle everything else.
    Swyx [00:15:39]: What’s the math versus clouds doing it for you?
    Jake [00:15:43]: If we rented in the cloud, our payback period when we go to metal is about three months.
    Swyx [00:15:50]: Which is crazy.
    Jake [00:15:51]: It’s nuts. That’s four years of depreciated hardware. You’re going to see a lot of this compute crunch because hyperscalers are buying up a lot of stuff. We’re working directly with OEMs, resellers, and people building these machines: Supermicro, Dell, and others.
    Jake [00:16:11]: Upstream, there’s a bunch of supply pressure. When we raised our last round, between deploying capital for servers and now, the amount of money we’ve raised is less than the amount of money we have in the bank plus the value of the servers because the servers have appreciated as RAM has gone up. It’s nuts how valuable hardware has become.
    Jake [00:16:50]: If you look at hyperscalers, they deployed around $80 billion of capital expenditures this year, and next year will be more. That’s a massive infrastructure build-out. You look at that and think it’s crazy that they’re spending way more than the Manhattan Project. But if every person is going to run dozens or hundreds of agents in parallel, you have no conceptual idea how much compute is required to make that experience happen, even if you’re deeply efficient and sharing resources. And that doesn’t even count inference.
    Swyx [00:17:22]: How do you plan the build-out? The growth chart is so vertical. Are you usually at 100% utilization as soon as racks are live? How far ahead are you planning?
    Jake [00:17:33]: We still maintain cloud presence for bursting. We work with AWS, GCP, and a few other clouds. We can rent, and then the moment we get space or power, we compact those workloads off the cloud. We started on the clouds, then built a system to migrate to our own metal. There’s nothing that says you can’t continually do that again, and that’s exactly what we do. We never want to be compute constrained.
    Jake [00:18:09]: At the start of the year, we actually became compute constrained because one upstream provider wasn’t able to give us quota at the rate we needed, and the hardware was slower. I spent a weekend rebuilding our entire network overlay so we could straddle five clouds: Oracle, AWS, ourselves, GCP, and one other one. We can do more than that now.
    Jake [00:18:38]: We got into a spot where we were trying to pack instances tight because we couldn’t get enough compute. That led to a few reliability issues, which are now past us. I made a tweet pointing out that it’s becoming harder and harder to acquire compute at the rate these models need to acquire compute. We got bit by it.
    Swyx [00:19:15]: How do you think about pricing knowing you might not have your own metal available at all times? Are you pricing assuming you need extra margin if you end up going into the cloud?
    Jake [00:19:26]: Because we’ve built out our metal data centers, our margins on metal are around 70%. We can deeply subsidize the cloud business if we want to scale at a reasonable rate. We have a few levers: metal, which makes the margins; cloud burst; debt to buy servers; and venture capital. It’s an interesting operational problem: how much cash do we have, how much should we raise, how quickly can we deploy it, and can we scale revenue as quickly as we scale compute?
    Jake [00:20:05]: If we continue making it trivially easy for people to build and deploy, then the faster we close that loop and the more operationally excellent we are with capital, the faster the business can scale. It’s almost a straight linear deployment rate.
    Financing Infrastructure: Hardware Debt, VC, and Operational Leverage
    Swyx [00:20:20]: I think infra startups raising debt is a tool people don’t utilize enough or know enough about. What can you tell us about that? Is it secured against your CPUs?
    Jake [00:20:32]: It’s secured against our hardware.
    Swyx [00:20:37]: What rates do you get? Who are the lenders?
    Jake [00:20:39]: We pay prime plus a spread, and we can refinance any of the debt as rates go down. The terms are pretty good. The unfortunate thing is that Twitter has no nuance, so people say, “Venture debt bad.” But as with all things, there are specific tools and areas where you can be deliberate instead of using one tool as a hammer. Venture capital is not the hammer for everything. You have to explore and figure out what works.
    Swyx [00:21:12]: VC is usually the most expensive financing you can get.
    Jake [00:21:15]: Yeah. I also think people think about VC incorrectly from a capital-raising perspective. Most people think, “How do I raise as much money as possible from whoever is probably the best I can get at that time?” That’s close to right, but what we’ve tried to do is figure out what unfair advantage we can buy with that equity.
    Jake [00:21:34]: It’s the most expensive equity you’re going to give away at that point in time, assuming the company keeps getting better. How do you use it to work with someone stellar who complements you? In the seed stage, I had never started a company. Ray Tonsing had good advice, and I could text him all the time. He was really fast. Awesome.
    Jake [00:22:01]: Then with John and Erica at Unusual, they said, “You roughly know what you’re doing building a product. We’ll mostly leave you alone and be available for advice.” Amazing. Then we got to Series A and the business was an operational tire fire because we didn’t know how to scale a business. Work with Erica, and Jordan is over at Redpoint, so bonus.
    Jake [00:22:28]: Now we’ve raised from TQ and FPV as we’re moving into enterprises. Every step of the way, we’ve asked: who can we partner with at this specific time to unlock the next section of the journey? I don’t know enterprise sales. As an engineer, I can eyeball what features we might need, and we have wonderful people internally who can help. But you want boardroom dynamics where everyone is aligned and asking, “How do we win this?” instead of bickering about strategy.
    Data Centers in Space and the Physics of Compute
    Swyx [00:23:31]: You had a tweet about data centers in space. Why no data centers in space?
    Jake [00:23:37]: It’s not “no data centers in space.” My hot take is that I think it is solvable. I’ve just never seen anybody solve it.
    Swyx [00:23:49]: You said, “How are you going to dissipate that much heat in a vacuum?” You’re making a physics claim.
    Jake [00:23:55]: I haven’t seen anybody prove how you’re going to dissipate that much heat in a vacuum. It doesn’t mean it’s not possible. It just means nobody has brought it up yet.
    Swyx [00:24:05]: Astrophage.
    Jake [00:24:06]: I don’t know what that is.
    Swyx [00:24:07]: The Martian thing. Okay, you’re very logical.
    Jake [00:24:09]: It could work. A lot of people are putting the cart before the horse. They say, “We’re going to put data centers in space.” Okay, but how? “We have time to figure it out.” It’s like in The Martian where they ask how they’re going to intercept something and say, “We’ll figure it out.”
    Swyx [00:24:36]: Making a bet on human invention is weird because you blind trust that it can be solved. But with physics, there are first-principles bounds you can put on it. Maybe not. Maybe you’re asking to travel time or break a fundamental thermodynamic law.
    Jake [00:24:57]: I don’t know how VCs do this either. How do you know what’s not possible and a grift versus what’s possible but sounds completely insane? “We’re going to put data centers in space.” Coin flip as to which it is, and I guess you’ll know in 10 years. That’s one cycle.
    What Agents Need: Versioning, Observability, and 1,000x Scale
    Swyx [00:25:23]: Moving back to agents. The branching, fast spin-up, and orchestration you do feels like pre-work that happened to be exactly what agents want. What do agents want differently than humans?
    Jake [00:25:37]: They want the ability to version things. It’s not that different; it materializes slightly differently. Agents want a way to test changes incrementally. Engineers have feature flags. Is there a reason agents can’t use feature flags? I don’t think so.
    Jake [00:25:54]: They want version control. Can we use Git or not Git? That one is up in the air. I think something outside Git will emerge for how we version these things over time. They need observability. You need to query what happened, when it happened, which steps failed, traces, logs, metrics, and all the rest. They need network, compute, and storage. They need to write files, save files, iterate on files, and snapshot file systems.
    Jake [00:26:25]: A lot of what humans needed is in line with what agents need. Branching and forking are not different; we’re just moving 1,000 times quicker. It can look like you need something massively different, but what you need is something massively better than what existed. You need orchestration massively better than Kubernetes. You need networking probably better than Envoy. It goes all the way down the stack.
    Jake [00:26:55]: If the workload profile doesn’t change so much as it gets massively compressed because you need thousands of these things, what assumptions change? etcd is going to melt. You need to replace it with something. You can go all the way down the stack and say, “That part has to change, that part has to change, and that part has to change.”
    Jake [00:27:19]: The interesting thing about the super-exponential curve is that you have to build systems where you can rip out those parts at any time because a new bottleneck might emerge. You get good at parallel agents, and a different part of the system breaks. So it’s similar to what humans needed, but at 1,000x scale.
    Jake [00:27:55]: How do you do code review in the age of agents?
    Swyx [00:28:00]: You throw more agents at it.
    Jake [00:28:01]: You don’t. But then who reviews for CVEs and all these other things?
    Swyx [00:28:07]: More agents.
    Jake [00:28:08]: And that’s how we hit the inference wall. You can continually throw agents at the problem, but I think there’s a limit to the number of agents you can throw at a problem.
    CLI, Agent Handles, and Closing the Loop
    Swyx [00:28:24]: You already had a CLI before it was cool. How is the shape of what you’re exposing changing, if at all?
    Jake [00:28:28]: CLIs have always been cool. The CLI changes because we think about how to give Claude, Codex, ChatGPT, or any model a handhold.
    Jake [00:28:50]: A CLI is a single command: deploy, get logs, and so on. Things that were prohibitively annoying to humans are not annoying to agents. They’re nice. If I handed you a CLI with 40 arguments and 600 flags, you’d think, “I’m never going to use all of this.” But if you hand it to an agent, it says, “This is excellent. I have so many handles to work with.”
    Jake [00:29:24]: If you’re going to expose things to agents that way, you want as many handles as possible where they can get information, query dynamic information, and close the loop quickly. Most problems right now are about how to close the loop as quickly as possible. Where does the agent get stuck, and how can you remove that?
    Jake [00:29:49]: Telemetry is important. If you can tell where the agent gets stuck from the CLI and say, “12% of people deviate from the happy path because of this, and now I add this argument and drive it down to 2%,” you massively increase the rate of loop closure.
    Jake [00:30:03]: That’s how we think about not just the CLI, but every point in the dashboard. It’s a user journey: I hear about Railway. I get something deployed. I get my first green build or aha moment. I see an endpoint, logs, whatever. Then I iterate. The iteration loop is indefinite. The user wants to deploy a new thing, a Postgres instance, change code, and keep iterating.
    Jake [00:30:36]: If you focus on the iteration loops and what’s blocking them from closing quickly, one thing we say internally is: you never want to be waiting on compute anymore. You always want to be waiting on intelligence. If you’re waiting on compute, there’s a bottleneck that needs to be destroyed because eventually that bottleneck becomes so large that another workflow emerges to change it.
    Jake [00:31:04]: We’ve built a product where you push code, build it, and so on. But I fundamentally believe the push-pull loop is going away. We’ll get to a point where you make a small change in production, that change is versioned across your infrastructure, you’re working alongside copy-on-write versions of your database and infrastructure, and then you merge it in and it’s instantaneously live. That’s the holy grail of loops. The push-pull-rebuild thing is a point of friction that we’re removing entirely.
    Canvas as Output: Dashboards, Context Anchors, and Hyperstructures
    Swyx [00:31:43]: It’s incredibly fast. If anyone hasn’t tried it, that fast feedback is great. My hot take is that Railway was famous for its canvas, which visualizes your infrastructure and lets you manipulate it visually. But that was for humans. For the next phase of growth, Railway CLI is more important than canvas.
    Jake [00:32:05]: The canvas is funny because it’s a mechanism to show changes over time. You’re right that previously we used it a lot as an input. Moving forward, its goal is more like an output. You would go to the canvas, make changes, see them, and watch your infrastructure evolve. Now agents have access to the CLI and can make those changes. So the canvas becomes an output: what information does the human need at this moment to make suitable decisions about control requests? Do I approve this or not?
    Jake [00:32:57]: It also has to be an anchor for your context, a port in the storm. Think of it like layers in a file system. You start with a project, then drill down into services, then into a function or code, because you want to represent the entire thing not just in your head, but in the canvas. Other people can share that representation, think on the same wavelength, and move quickly.
    Jake [00:33:33]: A lot of organizations get in trouble as they scale because all the context lives in someone’s head. “How does this microservice work?” “I have no idea; go ask this person.” Then you have whole categories of products built around context discovery. A lot of that melts away if you have a solid hierarchy and can infinitely nest services, code, context, and everything else all the way down. That’s what lets you build these structures over time.
    Jake [00:34:18]: It’s also what lets us build what I’ve called hyperstructures: things that are way bigger. You look at the Golden Gate Bridge and ask, “How did we build that?” There’s a meme that we lost the technology. To some extent, yes, because the coordination that built those things evolved and changed. We lost some of the art of building structure as we jammed everything into Slack.
    Swyx [00:34:52]: But you jam everything in Discord.
    Jake [00:34:53]: Same point. It doesn’t matter. It’s message passing and interrupts, message passing and interrupts.
    Swyx [00:35:00]: So you’re arguing there should be something better and more structured than Slack?
    Jake [00:35:04]: Yeah. For sure. I think Slack is awful, and Discord is awful too.
    Central Station: Context Routing, Support, and Incident Clusters
    Swyx [00:35:09]: This is the equivalent of my mom test. What have you done that has your solution to this?
    Jake [00:35:15]: Internally, we’ve built a tool called Central Station that aggregates all the context from our users. Every piece of feedback, every customer support item, everything gets aggregated into clusters. If an incident is brewing, we can determine how many users are affected and break off a discussion based on that.
    Jake [00:35:40]: That is more helpful than long-running channels where you’re trying to decide which channel to put something in. If you can dynamically aggregate information and dynamically route it to the right person based on context, it works better. We know internally that these four people are close to networking. If we see a networking thing, we can drill it down to those four people. If it’s with this part, we can look at the commits. This is no longer a manual process internally.
    Jake [00:36:13]: If you go to station or help.railway.com, that’s why we built it. We wanted to scale with a massive amount of leverage by aggregating feedback.
    Swyx [00:36:27]: This is built in-house?
    Jake [00:36:28]: Yep.
    Swyx [00:36:29]: I remember helping out on this one with Angelo in 2023. You scale a lot with a very small team.
    Jake [00:36:38]: Yeah. We’re about 10 times bigger now.
    Swyx [00:36:40]: You have your full developer code here? Very cool.
    Jake [00:36:44]: If you go to railway.com/stats, we expose this as a pub-sub-able thing. It’s all real-time metrics. There’s a way to get it as JSON somewhere if you care.
    Jake [00:37:01]: We’re big on trying to build everything in public and talk about what we’re working on. We’ve had issues in the past, and we’ll say, “Here’s how we’re fixing these things.” We’ve gotten compliments and flak for incident reports. We’re always trying to make them better and talk with people.
    Incidents, Disclosure, and Progressive Rollouts
    Swyx [00:37:20]: You had a big one recently. I liked that it was scoped to 3,000. You presumably used Central Station. Talk through what happened and how you address it internally as a team.
    Jake [00:37:38]: Internally, this one really sucked. It had to do with an upstream provider that didn’t do the behavior it said it documented, which is unfortunate given they wrote the RFC for how the behavior should work. We rolled those things out, and Central Station caught it initially when a couple users said caches weren’t invalidating. We turned it off immediately.
    Jake [00:38:03]: When you roll out to a large user base of three million people, you get a lot of disparate behaviors. We tested in staging and had tests, but we hit an edge case. We’ve hardened those systems, and now we can make that better. But it was a tough one.
    Swyx [00:38:39]: I always wonder how private disclosure is supposed to work if people find an issue. Are they supposed to contact you first? When you run a platform, these things will happen. What channels should people pursue to quietly resolve it before it becomes a bigger incident?
    Jake [00:38:59]: There’s responsible disclosure. We err on the side of over-disclosing and letting you know something is wrong versus having your provider gaslight you. We’ve erred on sharing those things more publicly, even if they impact a small subset of users. That’s a decision we’ve made internally. We have four values. One is honor. The honorable thing is to notify people to the widest degree at which they may have been affected or there was an issue, and then confront it head-on: why did it happen, what can we do better?
    Swyx [00:39:45]: Not the whole user base. That’s because of incremental rollouts and other things?
    Jake [00:39:50]: Yeah. Progressive rollouts.
    Swyx [00:39:54]: That should be the norm at all large platforms.
    Jake [00:39:58]: It should. A variety of companies do this. There’s the quote that Meta runs 10,000 different versions of Meta. To our earlier point about agents, they need the same thing. They need shadow traffic and all these other things. We’ve built so much ceremony around production being sacred that we need to make it trivially easy to test different behaviors in a safe environment. Then you can make mistakes in a safe environment.
    Safe AI SRE: Customer Agents, Forked Environments, and Production Parity
    Alessio [00:40:30]: Do you see a world where these things get automatically caught, not necessarily by your agent, but by your customer’s agent? The cache invalidation issue seems easy to check if you know to look for it.
    Jake [00:40:44]: It’s hard because to determine it, we almost need to hook into your observability infrastructure. That’s why we have the template loop on the platform: so you can roll things out progressively. You can roll out to Johnny Vibe Coder initially, or push a shard that someone consumes at their own leisure. Or you can roll it out over weeks: 0.1% of people, 1% of people, early adopters, then all the way up. That’s the non-deterministic version control we talked about earlier.
    Jake [00:41:30]: I believe that’s where most things should go, because most companies end up building staged rollout systems in-house. It’s the same thing built again and again at every company. There’s a massive opportunity to consolidate developer debt.
    Alessio [00:41:45]: You should have a free tier. Model providers give free tokens if you let them use the data. You could give free compute if someone is the number-one shard that goes out and lets you plug into their observability.
    Jake [00:41:55]: We do that. That’s why we talked about the impact on 3,000 people. We start with lower-impact people. Larger companies on the platform are last to receive those rollouts so they have a version of the platform that’s deeply stable.
    Alessio [00:42:16]: I have three services, so I’m sure I get the first rollout. You can nuke my thing at any time. There are all these SRE agent companies. Observability people also want agents that fix upstream problems. You have your own agent in the canvas now. How do you see that playing out?
    Jake [00:42:39]: It’s the stacking entropy problem. If you don’t have primitives to make iteration in production safe, it becomes difficult. If you’re an observability provider saying, “Here’s the fix to this error,” assume 80% are good and make sense. But in the last 20% long tail of complex issues, if you let somebody stamp it, you create an opportunity for an incident.
    Jake [00:43:08]: That’s why forked environments are important. People have staging, but it always drifts from production. You need primitives, workflows, and experience built first-party on the platform so you can fork any service at any point in time.
    Jake [00:43:33]: I think of the canvas as a sheet of transparency paper. The agent is a little guy you push up into the canvas. It should say, “I need to copy that service and that service so I can test these two things.” It gets a read-only copy of production. Anything that’s PII gets marked as a transform when we clone the database, create a copy-on-write version, or read from it. Then the agent makes changes and asks, “Does this actually work?” as close to production as possible.
    Jake [00:44:22]: That’s how close you have to be, or you get massive drift. The system becomes unstable. You see this with massive systems built on Docker for local, Kubernetes for production, and a specific thing for something else. That complexity slows developers and becomes unstable at scale, making it hard to iterate. We want to compress that way down and say, “As close to prod as possible is where we want to be.”
    From AISRE Skeptic to Agent Believer
    Swyx [00:45:00]: I was texting Erica for questions, and she says you were originally not a believer in AISRE. Have you come around on it?
    Jake [00:45:10]: I flipped, but I’m still not a believer in AISRE if you don’t have the primitives to make it safe. If you unleash AISRE on production infrastructure without safe primitives for copying volumes and making sure things are fine, it’s going to nuke your production database. It’s not a matter of if, but when. I’m a big believer in making those loops safe.
    Jake [00:45:33]: I was a deep AI skeptic until 2023. In 2024, I thought, “Maybe I can roughly make this thing do it.” In 2025, I thought, “Now I can hold this.” Over winter break, everybody came back saying, “It’s almost impossible to hold this.”
    Swyx [00:46:01]: Did you see this on the Claude docs? CloudBot? OpenCloud?
    Jake [00:46:06]: It’s gotten to a point where it’s harder to hold it wrong than to hold it right. There’s a scene in Avengers where Vision picks up Thor’s hammer and says it’s terribly well-balanced. It self-balances and works well. I’m a deep believer at this point that this will be the dominant species: assembly, C, C++, JavaScript, words.
    Swyx [00:46:35]: It feels like a big jump.
    Jake [00:46:37]: It is. But it’s not like you abandon CPU-based discrete logic and move straight to fuzzy logic. You need both. Your skills should call code or applications or some static structure. You can use skills to distill what the procedure should be or how the code should act.
    Jake [00:47:02]: I’m coming to a thesis: you need three points. You need a clear spec defining the system, the code, and the tests. When you say it out loud, if you’ve been in engineering long enough, you’re like, “Of course. That’s an RFC, tests, and code.” But they all matter. Having them together lets them reinforce each other: the spec and tests match, but the code doesn’t, so reconcile it. Or the tests and code match but the spec doesn’t, so reconcile that. That’s the iteration loop.
    Jake [00:47:41]: That’s why you’re seeing people talk about software factories, docs, and reconciliation. Some of that is architectural astronomy if you don’t implement it, but that loop is where most things will end up.
    Swyx [00:48:07]: For listeners, we’ve been talking about this on the pod for three years: the holy trinity of specs and tests. Itamar Friedman from Qodo is the reference if people want to look it up.
    Self-Modifying Infrastructure and the End of Push-Pull-Rebuild
    Swyx [00:48:18]: One thing I want to mention on the OpenCloud idea is self-modification. I don’t know how Railway would support it, but I have my OpenClaw, and I just tell it it has the Railway CLI and can do whatever. In theory, whatever capabilities or new infra it needs, it can call the Railway CLI, provision it, and add it to itself. The agent can modify its own infra.
    Jake [00:48:45]: It’s nuts. I have a loop set up where you put the Railway CLI on top of something that runs on Railway. You’re authenticated as whatever the current box is, and you can make any changes to it. Then you call Railway deploy, and it deploys itself.
    Jake [00:49:04]: It’s like: “I need to spin up this instance of this environment. I already exist in this environment. Excellent, I have access to a Postgres instance now.” That’s where we want to go with agentic, self-replicating infrastructure. That’s your loop: iterate in production. You continue making changes. If it works, merge it upstream. If it doesn’t, throw it away.
    Jake [00:49:37]: How do you make throwaway copies trivial to spin up and super cheap? The era of “I have an AWS instance with four vCPU and 16 gigs of RAM” is going to get destroyed. If you do that for agents, you need a thousand of those machines. It’s prohibitively expensive compared with what we’ve spent a ton of time figuring out: the atomic unit of deploy, whether you call it isolates, sandboxes, or something else. Only pay for what you use, spin up instantaneously, and close the loop as quickly as possible.
    Jake [00:50:15]: If the system can self-replicate safely and say, “This is my environment, I’m making these changes,” it can come back with, “Does this look good? This is a new state of infrastructure given this prompt. I think I’ve solved it.” Then you go back and say, “Actually, it looks different.” It does the loop again. Then you say, “Cool. Apply.”
    Swyx [00:50:38]: That’s retroactively obvious, which is the most useful kind. Any other comments on agent deployment on Railway?
    Jake [00:50:51]: It’s getting better every day. I’m on X or Twitter. You can always yell at me about the parts not working as well as they should, because plenty of things should work way better.
    The New Serverless: Stateful, Long-Running, Pay-for-What-You-Use Linux
    Swyx [00:51:04]: At this stage, when people want massively or embarrassingly parallel compute, they usually talk serverless. I feel like there’s a new serverless compared to the previous five years of serverless. You’re in that new bucket. Do you have comparisons or philosophical differences you want to call out?
    Jake [00:51:31]: It’s somewhere in between. It’s the ability to run stateful, long-running workflows or executions.
    Swyx [00:51:42]: Vercel has Fluid Compute, Cloudflare has some container thing, Google has App Runner and others.
    Jake [00:51:55]: That’s where everything is roughly going, and it’s why we’ve been working on this for six years. We believe users need access to a computer: a box that speaks Linux. They need to deploy what they want. Other systems change the surface area of what you can build. For us, users need a computer and need to deploy anything they truly want. That’s why we’ve focused on the primitives: network, compute, storage. If we give you those and expose them so you can run things indefinitely, that’s where we believe it’s going.
    Jake [00:52:43]: Twitter has no nuance, so everyone says “servers” or “serverless.” It’s always somewhere in the middle: I want to run it for a long time, but I don’t want to provision the resource statically or pay for things I’m not using. That’s been our thesis from day one: pay only for what you use, run it indefinitely, and it is full Linux.
    Swyx [00:53:12]: That’s why I like the naming of Fluid. It’s fluid. Flexible.
    Heroku, Focus, and Carrying the Torch Without Becoming the Past
    Swyx [00:53:18]: Another milestone is the Heroku official deprecation. You’re one of the presumptive new Herokus. “New Heroku” has been a category for as long as I’ve been in developer tooling. It’s finally happening. What was that like? Any behind-the-scenes of, “This is the moment”?
    Jake [00:53:42]: You have people where you’re like, “You were running stuff on here? You, as this company?” It’s crazy that names you would know are running on it and now coming to us saying, “We want to move a lot of this off.”
    Swyx [00:54:00]: Any behind-the-scenes on why Salesforce let Heroku stagnate?
    Jake [00:54:05]: I can only guess. It’s hard when it’s not your business. Salesforce’s business is to build a great CRM. That’s their focus. Then you acquire a compute business as an offshoot. A lot of early Meta people talk about focus. Boz has a write-up about how in the early days of Meta they had no money, so they were forced to focus. Then they turned on the money tree and had no reason not to split their focus.
    Jake [00:54:52]: But that dilutes your product. You get offshoots where you ask, “Is this the focus of the business?” If it’s not core, it languishes. A lot of companies get in trouble when they split focus because they’re fighting a multi-front war, not just externally but internally for alignment. Where are we going? What are we doing? What is our purpose?
    Jake [00:55:24]: If you’re Salesforce-built and mission-driven, you want to work on Salesforce. Heroku is off to the side. It’s not core to the business. Getting resources, budget, focus, and alignment internally becomes hard. It was a matter of time.
    Swyx [00:56:06]: Kudos for them to call it out instead of leaving it unknown.
    Jake [00:56:12]: Their release was a little odd. They called it out, but they didn’t say they were shutting it down. Behind the scenes, I think they issued messages to people saying they should close accounts and that they were going to deprecate and remove things over time.
    Jake [00:56:30]: It’s crazy because some of my first deployment experiences were on Heroku. You start with dragging things into an FTP server, then you try to get a deploy working, and then it’s Heroku. It was the on-ramp for us. But the wheel turns. New things emerge. We’re happy to carry the torch for a lot of that. But we don’t want to be the new Heroku. We want to be the way people build and deploy software, and ultimately the way people monetize software over time.
    Swyx [00:57:19]: It’s still a big crown to be the new Heroku. There are 50 companies that fought for that.
    Jake [00:57:23]: Everybody is holding some portion of it. We’re happy to support people and companies. The platform works differently. The game loop is similar, but we’ve been dogmatic about where these things are going: primitives, agents, fan-out. Some things fit; some workflows need to change. We have an approximation of Heroku pipelines with the environment system. It’s exciting. We’ve got a ton of people we can support, and it’s growing a lot.
    Temporal, Workflow Engines, and State Machines
    Swyx [00:58:12]: I have one more technical question about Temporal. I’ve sold my shares. You’re a power user and one of our earliest customers. I met you through Temporal. You built on Temporal. You have complaints. This may be the most neutral and informed conversation anyone will hear about Temporal without someone working at the company.
    Jake [00:58:39]: That’s fair. I’ve used Temporal for almost 10 years because of Cadence at Uber.
    Swyx [00:58:52]: Give people a sense of what Cadence was at Uber.
    Jake [00:58:57]: Cadence was the precursor to Temporal. It powers trip actions, rides, when you rent a Jump bike or scooter or car. You’re running workflows for a period of time and saying, “This ride will run indefinitely until it finishes.” You attach information: you paused in this zone, so add this charge to the bill. When you end the trip, the workflow is done. That experience was powered by Cadence at the time.
    Swyx [00:59:34]: I used to say it’s like programming the entire user journey top-down as one function.
    Jake [00:59:39]: It’s a powerful idea and important. It’s also important for the next phase of the agentic journey. You want an agent to do a specific task, be complete or incomplete on that task, and move on to the next thing. You need a way to manage workflows dynamically.
    Jake [00:59:59]: Temporal was always great in theory, and great when you got it working the way you wanted in production. But it required you to model the entire journey in your head. If you didn’t, you could cause issues where replaying the state of the workflow causes non-determinism.
    Swyx [01:00:25]: Because it works on deterministic workflow history.
    Jake [01:00:28]: Exactly. I describe it as a jet engine. If you know how to operate it and run it, it’s great. But you can’t hand it to people trying to build complicated things if they don’t have the whole state in their head.
    Jake [01:00:48]: We run our whole deployment pipeline on top of it. That’s a reasonably complicated workflow: pre-commit hooks, signaling, queuing, and all the rest. We ran into the same thing at Uber. As you express a large workflow, it gets more complicated, with more states in the state machine that you have to map back to the workflow.
    Swyx [01:01:15]: It’s a lot of ifs.
    Jake [01:01:16]: Exactly. At Uber, we built a system for doing the state machine and testing it. We’ve started to build some of those things here because it’s grown heavily. It’s not quite love-hate. When it works well, it works super well. But if someone who doesn’t have full context puts something into the system that invalidates state or causes non-determinism, or spins off a ton of activities, you have to keep track of underlying SRE knobs like activity slots. Those should scale with memory, vCPU, and so on. It becomes a bear to scale.
    Swyx [01:02:10]: You need a capable sysadmin running things behind the scenes. If you moved off, what would you do?
    Jake [01:02:19]: We’d build our own workflow engine. We have a few internally that we’ve worked on.
    Swyx [01:02:27]: This is one of those classes of things you typically wouldn’t vibe code, but I’m wondering if you can.
    Jake [01:02:33]: I still don’t think you should vibe code it. You still want to run decent tests to make sure it works.
    Swyx [01:02:39]: Timo didn’t invent that from scratch either. There are libraries you can run. On top of that, it’s just a state machine that you have to map out. Ultimately, you define the instructions you want and run them through a state machine.
    Jake [01:03:00]: It’s very doable. Workflow stuff is interesting. Restate is doing neat stuff here.
    Swyx [01:03:10]: You’re tied into JavaScript. Are you a JavaScript maxi?
    Jake [01:03:13]: Internally, we have TypeScript, Rust, and Go. We don’t add more languages. Actually, we have a little C because we write BPF code and hooks. But those are the languages.
    Swyx [01:03:28]: Is this for sidecars?
    Jake [01:03:32]: No. It’s for the networking stack, volumes, and things like that. We use TypeScript a lot because it powers the dashboard, but we’re moving a lot of workflow stuff off the dashboard stack and into the infrastructure stack.
    Railpack, Nixpacks, and Content-Addressable Filesystems
    Swyx [01:04:00]: Cool. Any other technical infrastructure stuff? Railpacks?
    Jake [01:04:07]: We built an engine for determining dependencies based on source code. It’s called Railpack. We built the first version, Nixpacks, on top of Nix, and then we moved.
    Swyx [01:04:17]: People have been trying to get me to adopt Nix and NixOS for four years. Is it ever going to be a thing?
    Jake [01:04:23]: I don’t know. We’re excited about it, but it has pain points. Think of it as a stack of versioned binaries at specific slices in time. If you want version X and version Y, you bloat the package space, which blows up image size and makes real-world workloads difficult.
    Swyx [01:04:53]: But you content-address it and cache it. In theory, there are optimizations.
    Jake [01:05:00]: In theory, yes. But with a large enough user base and disparate enough machines, you run into a problem Meta described in the XFAAS paper, their internal serverless system. It becomes difficult at scale unless you break out specific runtimes.
    Jake [01:05:24]: We didn’t want to do that because we wanted to truly allow you to deploy anything. That was our initial thing with Nix. But we’ve moved toward interesting work around content-addressable file systems that can lazy-load anything from any point and page it into memory.
    Swyx [01:05:48]: Amazing.
    Jake [01:05:49]: The future is very bright. It’s crazy, and it’s going to be nuts.
    Coding Agent Spend, Roadmaps, and Token ROI
    Swyx [01:05:54]: Founder journey stuff?
    Alessio [01:05:56]: Your cloud usage: you tweeted you’re going to spend $300K this month?
    Jake [01:06:01]: I think we got to $200K.
    Alessio [01:06:02]: Coding agents?
    Jake [01:06:03]: Yeah.
    Swyx [01:06:04]: Across the company?
    Alessio [01:06:05]: You only have 35 people, so I’m sure they’re not all spending $10K a month. What’s the distribution?
    Jake [01:06:10]: I think I’m at about $25K. We have power users all the way down. We came back from winter break, and I basically said, “If you’re writing code by hand, you’re doing this wrong.” The tools are good enough now that you can move extremely quickly. There are issues and pain points, but you should be reviewing the code you are writing instead of writing it by hand.
    Jake [01:06:40]: Architectural patterns matter more now than ever, but you shouldn’t spend your time generating code you would write. If you know how to write it, ask the agent to write it and reconcile it until it looks like you would have written it yourself.
    Jake [01:06:58]: People misconstrue my propensity to push people toward agents as connected to our growth and some reliability bumps. They’re not necessarily related. The tools are good enough to move extremely quickly and build things way larger than you could before.
    Jake [01:07:19]: To the earlier point about cooling data centers in space: I don’t know. But with software, you can ask, “How would I build block storage from scratch? How would I do these things?” I have ideas because I have history and have read papers. Let me work them out and build massive test benches with thousands of tests, because those are now free to author. If you’re not using AI systems to speed-run your roadmap and reconcile your existing system onto the future, you’re missing a large point of what’s happening.
    Alessio [01:08:12]: What’s the path to spending $3 million a month? Is it bound by ideas and things customers can absorb?
    Jake [01:08:19]: For most companies, it’s bound by deployment at this point. That’s why we’ve seen a massive boom in users and companies, from Fortune 50s down, asking how to get developers to move faster. You’ll probably hit your CFO before any technical limits because they’ll look at the eye-watering amount of money spent on tokens. Inference costs have to come down, but we’re inference constrained now. There will be price discovery around what makes sense for an org to adopt.
    Jake [01:09:06]: I think you’ll end up with the F1 driver concept. If someone is really adept at these things, it makes sense to put them in a $3 million car. If they’re not, it probably doesn’t make sense. You’ll take a few people and say, “You can drive the F1 car. We need to go in this direction. Figure out if it works and prototype it.”
    Jake [01:09:33]: We’ve done some of that and vastly accelerated our roadmap. We thought we’d ship something in a few years; now we can probably ship it in a few months because we validated it and don’t have to build it incrementally. We can skip steps and move toward our vision.
    Alessio [01:09:58]: A lot of people are realizing the roadmap doesn’t always have a business impact, so they say tokens are too expensive. But if your roadmap were built to make more money by the time you built it, you’d have token pricing for it, the same way you do with sales. You’d spend a billion dollars on sales if you knew you would get $2 billion of revenue.
    Jake [01:10:19]: Exactly. A naive way to measure this is the percentage of tokens that end up in production. If you can measure impact because those tokens end up in production, that’s awesome. But the burden of proof will rise. Internally, we have a growing number of pull requests that haven’t merged. The question becomes: how do you get this into production? It’s about how quickly you can build and deploy software, which is exciting because that’s our whole thing.
    The SDLC Shift: Prompt Requests, Feature Flags, and Safe Rollouts
    Swyx [01:10:56]: The SDLC is changing. One thesis is that the pull request is dying. It’s going to be the prompt request. Beyond that, code review is also kind of dying if you have all the other systems in place. What else is changing about the SDLC?
    Jake [01:11:19]: The AISRE and the tools to make it happen. AISRE is pie-in-the-sky aspirational. What does it take to get an AISRE? What tools do you need to build?
    Swyx [01:11:32]: You should expose your tooling to customers at some point. The Central Station command center.
    Jake [01:11:39]: We have it for template maintainers. Template maintainers can deploy and maintain templates, and they get feedback. We’re going to expose those things incrementally.
    Swyx [01:11:51]: Clustering around incidents. Everyone has a version of that, but I don’t think anyone has solved it.
    Jake [01:11:56]: I won’t say we’ve solved it internally, but it’s gotten so good that we can see incidents forming pretty quickly. At some point, those will be things either someone else builds or we build. We’ve always built things purpose-built for us. If it makes sense to make it useful for users, monetize it, or turn that loop into a profit center instead of a cost center, we want to do that.
    Jake [01:12:28]: Pull request is definitely dying.
    Swyx [01:12:29]: Do you do first-party feature flagging and incremental rollout stuff?
    Jake [01:12:34]: We have a feature-flagging engine we built internally and will eventually roll out.
    Swyx [01:12:38]: I don’t see it as a user. How come you didn’t give us what you have?
    Jake [01:12:43]: We have to beta test it. We care a lot about the quality of the things. There’s plenty we’ve used internally that doesn’t make it all the way through the journey because it fails. It works for one service but not multiple services. We’d have to build it for multiple services and know that if we released it, we’d rebuild it again and again. Some things are worth that, but many inform the roadmap.
    Jake [01:13:18]: We don’t want to dilute the experience by saying, “This works, but only for this service,” unless it’s a core initiative. Over the next few months, we’ll roll out things that work for a single service, then multiple services, then multiple services across the environment. You have to be deliberate. Otherwise you create broken disparate experiences and support load because people ask how to use the feature.
    Jake [01:13:52]: It’s the earlier expansion and compaction pattern. You expand the company to get features, then compact and smooth them out so the experience is stellar. You told me in the hallway, “It’s gotten so much better.” Internally we’re saying, “This part really sucks. We need to make it significantly better.”
    Swyx [01:14:11]: I can attest to that over the last three years watching you build Railway. For listeners, feature flagging is a huge part of Uber culture. So much so that they have too many feature flags and another thing to remove feature flags. Facebook has Gatekeeper. Agents are going to need this. It’s fundamental to incremental rollouts. OpenAI acquired Statsig. GPT-5 is routing and flagging through different models.
    Jake [01:14:56]: It’s super important. If the software development lifecycle is going to change because we’re doing things 1,000 times faster and 1,000 times more concurrently, what becomes important at scale?
    Jake [01:15:16]: Before I started Railway, I built a feature-flagging product and tried to sell it. It was an easier version of LaunchDarkly. I ran into a problem: anyone small enough to adopt your technology doesn’t care about feature flags, and anyone large enough to need feature flags needs so much scale that you have to build out all the infrastructure. I scrapped it.
    Jake [01:15:42]: But what is old is new again. Companies are trying to move quickly, but you can’t YOLO a vibe-coded thing straight into production. You need to say, “Here’s my blast radius, my impact, and I want to shadow it for these users.” Feature flags. You’re going to need the tools larger companies built to maintain their structures. Everything gets compressed by 1,000x so everybody can build those structures quickly.
    Jake [01:16:07]: That’s exactly where we are: compressing the software development lifecycle, then expanding it and adding more new things.
    Cattle, Pets, and Clonable Infrastructure
    Swyx [01:16:15]: Another term that comes to mind for newer developers is “cattle, not pets.” People treat production like a pet. It has a name. You baby it and keep it alive. With cattle, you can mass farm, roll out, portion parts out, and kill them.
    Jake [01:16:37]: I think that might change. You can move toward having pets as long as you have a cloning machine for your pets.
    Swyx [01:16:52]: Yeah.
    Jake [01:16:52]: If you can snapshot every single thing at every frame, it doesn’t matter if something gets obliterated because you have a snapshot of it. The things we’ve built right now are designed to block changes from the hermetically sealed DevOps line. You have to write a Dockerfile because you need a specific cut of the file system.
    Jake [01:17:14]: What if you had the whole file system? What if you snapshot it and lazily load the entire file system? Then you get around this problem entirely. You don’t need the ceremony of Dockerfiles, Ansible scripts, or other things. You can iterate, snapshot, ask if it’s the right loop or state, and then merge it into production. Merge the file system.
    Swyx [01:17:45]: Why not?
    Jake [01:17:46]: It’s going to be fun.
    Swyx [01:17:47]: This is a whole other can of worms, but if you cataloged the stateful things in a VM and developed dedicated solutions for each, you can cut the problem down a lot. It’s surprising people weren’t trying until now.
    Jake [01:18:04]: It has always been surprising to me because these are the things we would work on. It’s obvious.
    Swyx [01:18:11]: At first principles, you need them. Everyone needs them in theory. Then the big clouds don’t do them, so you assume it’s impossible.
    Jake [01:18:18]: Exactly. You think, “Meta has all the people writing eBPF code, and they’re doing something with them.” But you need that kind of work to solve these problems. Whatever is required, however deep we have to go, we’ll go all the way down to the kernel’s TCP/IP stack if needed. If we need to modify something to make it work for the mental model of the universe moving forward, we’ll do it and keep going down.
    Swyx [01:18:52]: That sounds fun.
    Jake [01:18:53]: It’s so much fun. I have to peel myself away from fun, interesting problems to make sure we can scale the company in a way that works. There are so many fun problems: getting information from customers to support to the person who built the thing internally, safe iteration, context from the dashboard to users, drilling down to the infrastructure layer, and managing orchestration as a real-time operating system versus a feedback control system. It’s just so fun.
    Solo Founder Lessons: Obsession, Writing, and Focus
    Swyx [01:19:29]: Speaking of the founder side, you’re famously outside the YC/SF consensus. You go to YC, get a co-founder, and do all these things. You did none of that.
    Jake [01:19:40]: None.
    Swyx [01:19:45]: In the elevator you said a co-founder makes sense if one person is the tech person and the other is the biz dev person. But you have to contain those multitudes yourself. How do you do it?
    Jake [01:19:58]: I try to get eight hours of sleep.
    Swyx [01:20:11]: Is there a balance: 50/50, 30/30/30? What’s the mental model as a solo founder?
    Jake [01:20:17]: There’s no balance. You have to think about all these things and be obsessed with them. Be obsessed with how people think about your product from a go-to-market perspective, and be obsessed with the kernel-level change that makes a user’s SSH connection never drop. I want a universe where you can snapshot everything and it feels like iterating on a VM.
    Jake [01:20:47]: You have to be obsessed at every layer of the stack. That’s what makes it easier for me. Some people are obsessed with different portions of the company journey, and if you can segment those lines well and be clear about ownership, you’ll have a good time.
    Jake [01:21:12]: I said two is the worst number of co-founders because you have no tiebreak. You disagree, and how do you resolve it?
    Swyx [01:21:38]: Usually someone is CEO, so they have the tiebreaker.
    Jake [01:21:43]: Totally. It’s hard every way you cut it. It’s hard if you get help, and it’s hard if you do it yourself. Running things is hard, but it’s so rewarding and fun.
    Swyx [01:21:56]: What have you found useful? A coach? Any advice that has been helpful?
    Jake [01:22:01]: I like to write a lot. I get in trouble a lot for my Twitter. I once said if you’re working weekends, you’re messing up your planning. I’ve gone back and forth on that because right now we’re at an extenuating time where it makes sense to work more. The goals are clear in my mind. If you have the vision and know where you’re going, work harder to distill that vision and do those things.
    Jake [01:22:33]: If you’re not certain and need clarity, disconnect and take your weekends seriously. Write about where you are, what you want to do, where you want to go, and what problems you’re solving.
    Jake [01:22:56]: Writing is important. I don’t love the word meditation, but whatever gets you into mental clarity is important when you’re trying to say, “We’re here and need to be here,” or “We’re here and I think we need to be in this general space for this to work.”
    Jake [01:23:22]: Disconnect, hang out with people you love, and work hard when you’re working. I try to work sunup to sundown, Monday to Friday, all out. I disconnect on Saturday and come back Sunday afternoon to write, plan the week, and do everything else. It works well for me.
    Jake [01:23:43]: Another hot take: most advice should be digested and thrown out the window. If it’s helpful, it’ll come back. You’ll learn it through experience. We have made failure very expensive as a society, and it makes it difficult for people to walk off the paths.
    GPUs, Focus, and the Dominant Role of Agents
    Swyx [01:24:03]: Anything you haven’t tweeted and gotten in trouble with that you want to preview to the world?
    Jake [01:24:12]: The agent stuff is crazy. It’s going to be the dominant way people do pretty much everything, provided we can get the inference required for that to happen. Over the next 10 years, you’ll see a fundamental shift in how people think about authoring the logic in their head.
    Swyx [01:24:36]: One way of phrasing it is: if Allbirds can become a GPU provider, so can Railway.
    Jake [01:24:44]: I think there’s a lot of “everyone becomes a GPU provider” that is actually not becoming a GPU provider. You’re defined more by the things you don’t do than the things you do, because it’s easy to say yes to a lot of things.
    Jake [01:24:56]: Anthropic is amazing and moving into different zones. They’re moving into Figma-like things.
    Swyx [01:25:09]: As we’re recording, Mike Krieger was on Figma’s board, they removed him Monday, and then they launched this today.
    Jake [01:25:18]: Things move fast right now. But agents are going to be the way people operate.
    Swyx [01:25:25]: So your answer is focus: no GPUs for now, but never say never.
    Jake [01:25:27]: Focus. We will not do GPUs now, but we 100% will do GPUs at some point in the future. That’s not me leaking our roadmap because we don’t have plans to do GPUs. It’s just a function of needing FLOPS at some point. If you’re fully vertically integrated and want to make it trivial for people to iterate, build, and deploy, you need access to this core piece of fundamental logic.
    A New Cloud From First Principles
    Swyx [01:25:57]: Presumably your own data center traffic is a minority of your workload right now, but is there a point where it’s a majority or you turn off public clouds?
    Jake [01:26:10]: At some point, we got to 100% data center: our own data centers. Right now, the vast majority of what exists on our platform is on our bare-metal data centers.
    Swyx [01:26:21]: So you’re already there.
    Jake [01:26:23]: Yeah. The transition was completed at some point, and then we grew so fast that we had to scale back on that. It got to 100% on the Datadog dashboard and then divoted back into the 90s because we were adding capacity.
    Swyx [01:26:45]: You’re literally building a new independent cloud, and people assume that could never happen post-AWS.
    Jake [01:26:53]: It’s hard. We’re going to figure out a bunch of things to make sure the platform is deeply reliable. But you have to break ground on new things when you decide to build a cloud from scratch but not copy the hyperscalers.
    Jake [01:27:10]: We’ve been deliberate about inventing our own infrastructure from scratch based on reading a ton of papers, while promising ourselves we wouldn’t copy someone else’s homework. If we copy someone else, we lose. You become them over time. You need a core thesis for why this business needs to exist now.
    Jake [01:27:33]: For us, the activation energy required to deploy something in production on hyperscalers is far too high. We believe it should be instantaneous. There should be no friction between your thought and the reality that comes out and that you can share with friends. That’s what we’re building toward at every layer of the stack. If we have to go down to energy, we’ll go down to energy.
    Jake [01:27:58]: It matters for giving people access to this tooling. It’s gated not just for citizen developers who are now vibe coding. You have multiple layers: citizen developer, front-end developer, back-end developer, DevOps person, and more. Those layers need to disappear so people can just ship.
    Swyx [01:28:20]: Amazing. That’s the future of cloud.
    Jake [01:28:22]: Awesome. Thanks for coming on. Thank you for having me. It’s been wonderful.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
More Business podcasts
About Latent Space: The AI Engineer Podcast
The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
Podcast website

Listen to Latent Space: The AI Engineer Podcast, The Prof G Pod with Scott Galloway and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features