Powered by RND
PodcastsTechnologyLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 in the App
Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 in the App
(398)(247,963)
Save favourites
Alarm
Sleep timer

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Podcast Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
Alessio + swyx
The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover ...
More

Available Episodes

5 of 98
  • Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper
    The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World’s Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat.There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation:But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz’s existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow).All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc:Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz!Flow Engineering + Qodo/AlphaCodium UpdateIn year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago:Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind’s AlphaCode with high efficiency:With a simple problem solving code agent:* The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.* Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test.swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code.More recently, Itamar has also shown that AlphaCodium’s techniques also extend well to the o1 models:Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers.Full Video PodcastLike and subscribe!Show Notes* Itamar* Qodo* First episode* Eric* Bolt* StackBlitz* Thinkster* AlphaCodium* WebContainersChapters* 00:00:00 Introductions & Updates* 00:06:01 Generic vs. Specific AI Agents* 00:07:40 Maintaining vs Creating with AI* 00:17:46 Human vs Agent Computer Interfaces* 00:20:15 Why Docker doesn't work for Bolt* 00:24:23 Creating Testing and Code Review Loops* 00:28:07 Bolt's Task Breakdown Flow* 00:31:04 AI in Complex Enterprise Environments* 00:41:43 AlphaCodium* 00:44:39 Strategies for Breaking Down Complex Tasks* 00:45:22 Building in Open Source* 00:50:35 Choosing a product as a founder* 00:59:03 Reflections on Bolt Success* 01:06:07 Building a B2C GTM* 01:18:11 AI Capabilities and Pricing Tiers* 01:20:28 What makes Bolt unique* 01:23:07 Future Growth and Product Development* 01:29:06 Competitive Landscape in AI Engineering* 01:30:01 Advice to Founders and Embracing AI* 01:32:20 Having a baby and completing an Iron ManTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back.Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half.Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome.Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz?Swyx [00:00:45]: Like, is it like its own company now or?Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point.Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long.Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah.Swyx [00:01:12]: Yeah.Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah.Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be.Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure.Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the first one and a half year, what we did is grew in our offering and mostly on the side of, does this code actually works, testing, code review, et cetera. And we believe that's the critical milestone that needs to be achieved to actually have the AI engineer for enterprise software. And then like for the first year was everything bottom up, getting to 1 million installation. 2024, that was 2023, 2024 was starting to monetize, to feel like how it is to make the first buck. So we did the teams offering, it went well with a thousand of teams, et cetera. And then we started like just a few months ago to do enterprise with everything you need, which is a lot of things that discussed in the last post that was just released by Codelm. So that's how we call it at Codelm. Just opening the brackets, our company name was Codelm AI, and we renamed to Codo and we call our models Codelm. So back to my point, so we started Enterprise Motion and already have multiple Fortune 100 companies. And then with that, we raised a series of $40 million. And what's exciting about it is that enables us to develop more agents. That's our focus. I think it's very different. We're not coming very soon with an ID or something like that.Swyx [00:06:01]: You don't want to fork this code?Itamar [00:06:03]: Maybe we'll fork JetBrains or something just to be different.Swyx [00:06:08]: I noticed that, you know, I think the promise of general purpose agents has kind of died. Like everyone is doing kind of what you're doing. There's Codogen, Codomerge, and then there's a third one. What's the name of it?Itamar [00:06:17]: Yeah. Codocover. Cover. Which is like a commercial version of a cover agent. It's coming soon.Swyx [00:06:23]: Yeah. It's very similar with factory AI, also doing like droids. They all have special purpose doing things, but people don't really want general purpose agents. Right. The last time you were here, we talked about AutoGBT, the biggest thing of 2023. This year, not really relevant anymore. And I think it's mostly just because when you give me a general purpose agent, I don't know what to do with it.Eric [00:06:42]: Yeah.Itamar [00:06:43]: I totally agree with that. We're seeing it for a while and I think it will stay like that despite the computer use, et cetera, that supposedly can just replace us. You can just like prompt it to be, hey, now be a QA or be a QA person or a developer. I still think that there's a few reasons why you see like a dedicated agent. Again, I'm a bit more focused, like my head is more on complex software for big teams and enterprise, et cetera. And even think about permissions and what are the data sources and just the same way you manage permissions for users. Developers, you probably want to have dedicated guardrails and dedicated approvals for agents. I intentionally like touched a point on not many people think about. And of course, then what you can think of, like maybe there's different tools, tool use, et cetera. But just the first point by itself is a good reason why you want to have different agents.Alessio [00:07:40]: Just to compare that with Bot.new, you're almost focused on like the application is very complex and now you need better tools to kind of manage it and build on top of it. On Bot.new, it's almost like I was using it the other day. There's basically like, hey, look, I'm just trying to get started. You know, I'm not very opinionated on like how you're going to implement this. Like this is what I want to do. And you build a beautiful app with it. What people ask as the next step, you know, going back to like the general versus like specific, have you had people say, hey, you know, this is great to start, but then I want a specific Bot.new dot whatever else to do a more vertical integration and kind of like development or what's the, what do people say?Eric [00:08:18]: Yeah. I think, I think you kind of hit the, hit it head on, which is, you know, kind of the way that we've, we've kind of talked about internally is it's like people are using Bolt to go from like 0.0 to 1.0, like that's like kind of the biggest unlock that Bolt has versus most other things out there. I mean, I think that's kind of what's, what's very unique about Bolt. I think the, you know, the working on like existing enterprise applications is, I mean, it's crazy important because, you know, there's a, you look, when you look at the fortune 500, I mean, these code bases, some of these have been around for 20, 30 plus years. And so it's important to be going from, you know, 101.3 to 101.4, et cetera. I think for us, so what's been actually pretty interesting is we see there's kind of two different users for us that are coming in and it's very distinct. It's like people that are developers already. And then there's people that have never really written software and more if they have, it's been very, very minimal. And so in the first camp, what these developers are doing, like to go from zero to one, they're coming to Bolt and then they're ejecting the thing to get up or just downloading it and, you know, opening cursor, like whatever to, to, you know, keep iterating on the thing. And sometimes they'll bring it back to Bolt to like add in a huge piece of functionality or something. Right. But for the people that don't know how to code, they're actually just, they, they live in this thing. And that was one of the weird things when we launched is, you know, within a day of us being online, one of the most popular YouTube videos, and there's been a ton since, which was, you know, there's like, oh, Bolt is the cursor killer. And I originally saw the headlines and I was like, thanks for the views. I mean, I don't know. This doesn't make sense to me. That's not, that's not what we kind of thought.Swyx [00:09:44]: It's how YouTubers talk to each other. Well, everything kills everything else.Eric [00:09:47]: Totally. But what blew my mind was that there was any comparison because it's like cursor is a, is a local IDE product. But when, when we actually kind of dug into it and we, and we have people that are using our product saying this, I'm not using cursor. And I was like, what? And it turns out there are hundreds of thousands of people that we have seen that we're using cursor and we're trying to build apps with that where they're not traditional software does, but we're heavily leaning on the AI. And as you can imagine, it is very complicated, right? To do that with cursor. So when Bolt came out, they're like, wow, this thing's amazing because it kind of inverts the complexity where it's like, you know, it's not an IDE, it's, it's a, it's a chat-based sort of interface that we have. So that's kind of the split, which is rather interesting. We've had like the first startups now launch off of Bolt entirely where this, you know, tomorrow I'm doing a live stream with this guy named Paul, who he's built an entire CRM using this thing and you know, with backend, et cetera. And people have made their first money on the internet period, you know, launching this with Stripe or whatever have you. So that's, that's kind of the two main, the two main categories of folks that we see using Bolt though.Itamar [00:10:51]: I agree that I don't understand the comparison. It doesn't make sense to me. I think like we have like two type of families of tools. One is like we re-imagine the software development. I think Bolt is there and I think like a cursor is more like a evolution of what we already have. It's like taking the IDE and it's, it's amazing and it's okay, let's, let's adapt the IDE to an era where LLMs can do a lot for us. And Bolt is more like, okay, let's rethink everything totally. And I think we see a few tools there, like maybe Vercel, Veo and maybe Repl.it in that area. And then in the area of let's expedite, let's change, let's, let's progress with what we already have. You can see Cursor and Kodo, but we're different between ourselves, Cursor and Kodo, but definitely I think that comparison doesn't make sense.Alessio [00:11:42]: And just to set the context, this is not a Twitter demo. You've made 4 million of revenue in four weeks. So this is, this is actually working, you know, it's not a, what, what do you think that is? Like, there's been so many people demoing coding agents on Twitter and then it doesn't really work. And then you guys were just like, here you go, it's live, go use it, pay us for it. You know, is there anything in the development that was like interesting and maybe how that compares to building your own agents?Eric [00:12:08]: We had no idea, honestly, like we, we, we've been pretty blown away and, and things have just kind of continued to grow faster since then. We're like, oh, today is week six. So I, I kind of came back to the point you just made, right, where it's, you, you kind of outlined, it's like, there's kind of this new market of like kind of rethinking the software development and then there's heavily augmenting existing developers. I think that, you know, both of which are, you know, AI code gen being extremely good, it's allowed existing developers, it's allowing existing developers to camera out software far faster than they could have ever before, right? It's like the ultimate power tool for an existing developer. But this code gen stuff is now so good. And then, and we saw this over the past, you know, from the beginning of the year when we tried to first build, it's actually lowered the barrier to people that, that aren't traditionally software engineers. But the kind of the key thing is if you kind of think about it from, imagine you've never written software before, right? My co-founder and I, he and I grew up down the street from each other in Chicago. We learned how to code when we were 13 together and we've been building stuff ever since. And this is back in like the mid 2000s or whatever, you know, there was nothing for free to learn from online on the internet and how to code. For our 13th birthdays, we asked our parents for, you know, O'Reilly books cause you couldn't get this at the library, right? And so instead of like an Xbox, we got, you know, programming books. But the hardest part for everyone learning to code is getting an environment set up locally, you know? And so when we built StackBlitz, like kind of the key thesis, like seven years ago, the insight we had was that, Hey, it seems like the browser has a lot of new APIs like WebAssembly and service workers, et cetera, where you could actually write an operating system that ran inside the browser that could boot in milliseconds. And you, you know, basically there's this missing capability of the web. Like the web should be able to build apps for the web, right? You should be able to build the web on the web. Every other platform has that, Visual Studio for Windows, Xcode for Mac. The web has no built in primitive for this. And so just like our built in kind of like nerd instinct on this was like, that seems like a huge hole and it's, you know, it will be very valuable or like, you know, very valuable problem to solve. So if you want to set up that environments, you know, this is what we spent the past seven years doing. And the reality is existing developers have running locally. They already know how to set up that environment. So the problem isn't as acute for them. When we put Bolt online, we took that technology called WebContainer and married it with these, you know, state of the art frontier models. And the people that have the most pain with getting stuff set up locally is people that don't code. I think that's been, you know, really the big explosive reason is no one else has been trying to make dev environments work inside of a browser tab, you know, for the past if since ever, other than basically our company, largely because there wasn't an immediate demand or need. So I think we kind of find ourselves at the right place at the right time. And again, for this market of people that don't know how to write software, you would kind of expect that you should be able to do this without downloading something to your computer in the same way that, hey, I don't have to download Photoshop now to make designs because there's Figma. I don't have to download Word because there's, you know, Google Docs. They're kind of looking at this as that sort of thing, right? Which was kind of the, you know, our impetus and kind of vision from the get-go. But you know, the code gen, the AI code gen stuff that's come out has just been, you know, an order of magnitude multiplier on how magic that is, right? So that's kind of my best distillation of like, what is going on here, you know?Alessio [00:15:21]: And you can deploy too, right?Eric [00:15:22]: Yeah.Alessio [00:15:23]: Yeah.Eric [00:15:24]: And so that's, what's really cool is it's, you know, we have deployment built in with Netlify and this is actually, I think, Sean, you actually built this at Netlify when you were there. Yeah. It's one of the most brilliant integrations actually, because, you know, effectively the API that Sean built, maybe you can speak to it, but like as a provider, we can just effectively give files to Netlify without the user even logging in and they have a live website. And if they want to keep, hold onto it, they can click a link and claim it to their Netlify account. But it basically is just this really magic experience because when you come to Bolt, you say, I want a website. Like my mom, 70, 71 years old, made her first website, you know, on the internet two weeks ago, right? It was about her nursing days.Swyx [00:16:03]: Oh, that's fantastic though. It wouldn't have been made.Eric [00:16:06]: A hundred percent. Cause even in, you know, when we've had a lot of people building personal, like deeply personal stuff, like in the first week we launched this, the sales guy from the East Coast, you know, replied to a tweet of mine and he said, thank you so much for building this to your team. His daughter has a medical condition and so for her to travel, she has to like line up donors or something, you know, so ahead of time. And so he actually used Bolt to make a website to do that, to actually go and send it to folks in the region she was going to travel to ahead of time. I was really touched by it, but I also thought like, why, you know, why didn't he use like Wix or Squarespace? Right? I mean, this is, this is a solved problem, quote unquote, right? And then when I thought, I actually use Squarespace for my, for my, uh, the wedding website for my wife and I, like back in 2021, so I'm familiar, you know, it was, it was faster. I know how to code. I was like, this is faster. Right. And I thought back and I was like, there's a whole interface you have to learn how to use. And it's actually not that simple. There's like a million things you can configure in that thing. When you come to Bolt, there's a, there's a text box. You just say, I need a, I need a wedding website. Here's the date. Here's where it is. And here's a photo of me and my wife, put it somewhere relevant. It's actually the simplest way. And that's what my, when my mom came, she said, uh, I'm Pat Simons. I was a nurse in the seventies, you know, and like, here's the things I did and a website came out. So coming back to why is this such a, I think, why are we seeing this sort of growth? It's, this is the simplest interface I think maybe ever created to actually build it, a deploy a website. And then that website, my mom made, she's like, okay, this looks great. And there's, there's one button, you just click it, deploy, and it's live and you can buy a domain name, attach it to it. And you know, it's as simple as it gets, it's getting even simpler with some of the stuff we're working on. But anyways, so that's, it's, it's, uh, it's been really interesting to see some of the usage like that.Swyx [00:17:46]: I can offer my perspective. So I, you know, I probably should have disclosed a little bit that, uh, I'm a, uh, stack list investor.Alessio [00:17:53]: Canceled the episode. I know, I know. Don't play it now. Pause.Eric actually reached out to ShowMeBolt before the launch. And we, you know, we talked a lot about, like, the framing of, of what we're going to talk about how we marketed the thing, but also, like, what we're So that's what Bolt was going to need, like a whole sort of infrastructure.swyx: Netlify, I was a maintainer but I won't take claim for the anonymous upload. That's actually the origin story of Netlify. We can have Matt Billman talk about it, but that was [00:18:00] how Netlify started. You could drag and drop your zip file or folder from your desktop onto a website, it would have a live URL with no sign in.swyx: And so that was the origin story of Netlify. And it just persists to today. And it's just like it's really nice, interesting that both Bolt and CognitionDevIn and a bunch of other sort of agent type startups, they all use Netlify to deploy because of this one feature. They don't really care about the other features.swyx: But, but just because it's easy for computers to use and talk to it, like if you build an interface for computers specifically, that it's easy for them to Navigate, then they will be used in agents. And I think that's a learning that a lot of developer tools companies are having. That's my bolt launch story and now if I say all that stuff.swyx: And I just wanted to come back to, like, the Webcontainers things, right? Like, I think you put a lot of weight on the technical modes. I think you also are just like, very good at product. So you've, you've like, built a better agent than a lot of people, the rest of us, including myself, who have tried to build these things, and we didn't get as far as you did.swyx: Don't shortchange yourself on products. But I think specifically [00:19:00] on, on infra, on like the sandboxing, like this is a thing that people really want. Alessio has Bax E2B, which we'll have on at some point, talking about like the sort of the server full side. But yours is, you know, inside of the browser, serverless.swyx: It doesn't cost you anything to serve one person versus a million people. It doesn't, doesn't cost you anything. I think that's interesting. I think in theory, we should be able to like run tests because you can run the full backend. Like, you can run Git, you can run Node, you can run maybe Python someday.swyx: We talked about this. But ideally, you should be able to have a fully gentic loop, running code, seeing the errors, correcting code, and just kind of self healing, right? Like, I mean, isn't that the dream?Eric: Totally.swyx: Yeah,Eric: totally. At least in bold, we've got, we've got a good amount of that today. I mean, there's a lot more for us to do, but one of the nice things, because like in web container, you know, there's a lot of kind of stuff you go Google like, you know, turn docker container into wasm.Eric: You'll find a lot of stuff out there that will do that. The problem is it's very big, it's slow, and that ruins the experience. And so what we ended up doing is just writing an operating system from [00:20:00] scratch that was just purpose built to, you know, run in a browser tab. And the reason being is, you know, Docker 2 awesome things will give you an image that's like out 60 to 100 megabits, you know, maybe more, you know, and our, our OS, you know, kind of clocks in, I think, I think we're in like a, maybe, maybe a megabyte or less or something like that.Eric: I mean, it's, it's, you know, really, really, you know, stripped down.swyx: This is basically the task involved is I understand that it's. Mapping every single, single Linux call to some kind of web, web assembly implementation,Eric: but more or less, and, and then there's a lot of things actually, like when you're looking at a dev environment, there's a lot of things that you don't need that a traditional OS is gonna have, right?Eric: Like, you know audio drivers or you like, there's just like, there's just tons of things. Oh, yeah. Right. Yeah. That goes . Yeah. You can just kind, you can, you can kind of tos them. Or alternatively, what you can do is you can actually be the nice thing. And this is, this kind of comes back to the origins of browsers, which is, you know, they're, they're at the beginning of the web and, you know, the late nineties, there was two very different kind of visions for the web where Alan Kay vehemently [00:21:00] disagree with the idea that should be document based, which is, you know, Tim Berners Lee, you know, that, and that's kind of what ended up winning, winning was this document based kind of browsing documents on the web thing.Eric: Alan Kay, he's got this like very famous quote where he said, you know, you want web browsers to be mini operating systems. They should download little mini binaries and execute with like a little mini virtualized operating system in there. And what's kind of interesting about the history, not to geek out on this aspect, what's kind of interesting about the history is both of those folks ended up being right.Eric: Documents were actually the pragmatic way that the web worked. Was, you know, became the most ubiquitous platform in the world to the degree now that this is why WebAssembly has been invented is that we're doing, we need to do more low level things in a browser, same thing with WebGPU, et cetera. And so all these APIs, you know, to build an operating system came to the browser.Eric: And that was actually the realization we had in 2017 was, holy heck, like you can actually, you know, service workers, which were designed for allowing your app to work offline. That was the kind of the key one where it was like, wait a second, you can actually now run. Web servers within a [00:22:00] browser, like you can run a server that you open up.Eric: That's wild. Like full Node. js. Full Node. js. Like that capability. Like, I can have a URL that's programmatically controlled. By a web application itself, boom. Like the web can build the web. The primitive is there. Everyone at the time, like we talked to people that like worked on, you know Chrome and V8 and they were like, uhhhh.Eric: You know, like I don't know. But it's one of those things you just kind of have to go do it to find out. So we spent a couple of years, you know, working on it and yeah. And, and, and got to work in back in 2021 is when we kind of put the first like data of web container online. Butswyx: in partnership with Google, right?swyx: Like Google actually had to help you get over the finish line with stuff.Eric: A hundred percent, because well, you know, over the years of when we were doing the R and D on the thing. Kind of the biggest challenge, the two ways that you can kind of test how powerful and capable a platform are, the two types of applications are one, video games, right, because they're just very compute intensive, a lot of calculations that have to happen, right?Eric: The second one are IDEs, because you're talking about actually virtualizing the actual [00:23:00] runtime environment you are in to actually build apps on top of it, which requires sophisticated capabilities, a lot of access to data. You know, a good amount of compute power, right, to effectively, you know, building app in app sort of thing.Eric: So those, those are the stress tests. So if your platform is missing stuff, those are the things where you find out. Those are, those are the people building games and IDEs. They're the ones filing bugs on operating system level stuff. And for us, browser level stuff.Eric [00:23:47]: yeah, what ended up happening is we were just hammering, you know, the Chromium bug tracker, and they're like, who are these guys? Yeah. And, and they were amazing because I mean, just making Chrome DevTools be able to debug, I mean, it's, it's not, it wasn't originally built right for debugging an operating system, right? They've been phenomenal working with us and just kind of really pushing the limits, but that it's a rising tide that's kind of lifted all boats because now there's a lot of different types of applications that you can debug with Chrome Dev Tools that are running a browser that runs more reliably because just the stress testing that, that we and, you know, games that are coming to the web are kind of pushing as well, but.Itamar [00:24:23]: That's awesome. About the testing, I think like most, let's say coding assistant from different kinds will need this loop of testing. And even I would add code review to some, to some extent that you mentioned. How is testing different from code review? Code review could be, for example, PR review, like a code review that is done at the point of when you want to merge branches. But I would say that code review, for example, checks best practices, maintainability, and so on. It's not just like CI, but more than CI. And testing is like a more like checking functionality, et cetera. So it's different. We call, by the way, all of these together code integrity, but that's a different story. Just to go back to the, to the testing and specifically. Yeah. It's, it's, it's since the first slide. Yeah. We're consistent. So if we go back to the testing, I think like, it's not surprising that for us testing is important and for Bolt it's testing important, but I want to shed some light on a different perspective of it. Like let's think about autonomous driving. Those startups that are doing autonomous driving for highway and autonomous driving for the city. And I think like we saw the autonomous of the highway much faster and reaching to a level, I don't know, four or so much faster than those in the city. Now, in both cases, you need testing and quote unquote testing, you know, verifying validation that you're doing the right thing on the road and you're reading and et cetera. But it's probably like so different in the city that it could be like actually different technology. And I claim that we're seeing something similar here. So when you're building the next Wix, and if I was them, I was like looking at you and being a bit scared. That's what you're disrupting, what you just said. Then basically, I would say that, for example, the UX UI is freaking important. And because you're you're more aiming for the end user. In this case, maybe it's an end user that doesn't know how to develop for developers. It's also important. But let alone those that do not know to develop, they need a slick UI UX. And I think like that's one reason, for example, I think Cursor have like really good technology. I don't know the underlying what's under the hood, but at least what they're saying. But I think also their UX UI is great. It's a lot because they did their own ID. While if you're aiming for the city AI, suddenly like there's a lot of testing and code review technology that it's not necessarily like that important. For example, let's talk about integration tests. Probably like a lot of what you're building involved at the moment is isolated applications. Maybe the vision or the end game is maybe like having one solution for everything. It could be that eventually the highway companies will go into the city and the other way around. But at the beginning, there is a difference. And integration tests are a good example. I guess they're a bit less important. And when you think about enterprise software, they're really important. So to recap, like I think like the idea of looping and verifying your test and verifying your code in different ways, testing or code review, et cetera, seems to be important in the highway AI and the city AI, but in different ways and different like critical for the city, even more and more variety. Actually, I was looking to ask you like what kind of loops you guys are doing. For example, when I'm using Bolt and I'm enjoying it a lot, then I do see like sometimes you're trying to catch the errors and fix them. And also, I noticed that you're breaking down tasks into smaller ones and then et cetera, which is already a common notion for a year ago. But it seems like you're doing it really well. So if you're willing to share anything about it.Eric [00:28:07]: Yeah, yeah. I realized I never actually hit the punchline of what I was saying before. I mentioned the point about us kind of writing an operating system from scratch because what ended up being important about that is that to your point, it's actually a very, like compared to like a, you know, if you're like running cursor on anyone's machine, you kind of don't know what you're dealing with, with the OS you're running on. There could be an error happens. It could be like a million different things, right? There could be some config. There could be, it could be God knows what, right? The thing with WebConnect is because we wrote the entire thing from scratch. It's actually a unified image basically. And we can instrument it at any level that we think is going to be useful, which is exactly what we did when we started building Bolt is we instrumented stuff at like the process level, at the runtime level, you know, et cetera, et cetera, et cetera. Stuff that would just be not impossible to do on local, but to do that in a way that works across any operating system, whatever is, I mean, would just be insanely, you know, insanely difficult to do right and reliably. And that's what you saw when you've used Bolt is that when an error actually will occur, whether it's in the build process or the actual web application itself is failing or anything kind of in between, you can actually capture those errors. And today it's a very primitive way of how we've implemented it largely because the product just didn't exist 90 days ago. So we're like, we got some work ahead of us and we got to hire some more a little bit, but basically we present and we say, Hey, this is, here's kind of the things that went wrong. There's a fix it button and then a ignore button, and then you can just hit fix it. And then we take all that telemetry through our agent, you run it through our agent and say, kind of, here's the state of the application. Here's kind of the errors that we got from Node.js or the browser or whatever, and like dah, dah, dah, dah. And it can take a crack at actually solving it. And it's actually pretty darn good at being able to do that. That's kind of been a, you know, closing the loop and having it be a reliable kind of base has seemed to be a pretty big upgrade over doing stuff locally, just because I think that's a pretty key ingredient of it. And yeah, I think breaking things down into smaller tasks, like that's, that's kind of a key part of our agent. I think like Claude did a really good job with artifacts. I think, you know, us and kind of everyone else has, has kind of taken their approach of like actually breaking out certain tasks in a certain order into, you know, kind of a concrete way. And, and so actually the core of Bolt, I know we actually made open source. So you can actually go and check out like the system prompts and et cetera, and you can run it locally and whatever have you. So anyone that's interested in this stuff, I'd highly recommend taking a look at. There's not a lot of like stuff that's like open source in this realm. It's, that was one of the fun things that we've we thought would be cool to do. And people, people seem to like it. I mean, there's a lot of forks and people adding different models and stuff. So it's been cool to see.Swyx [00:30:41]: Yeah. I'm happy to add, I added real-time voice for my opening day demo and it was really fun to hack with. So thank you for doing that. Yeah. Thank you. I'm going to steal your code.Eric [00:30:52]: Because I want that.Swyx [00:30:52]: It's funny because I built on top of the fork of Bolt.new that already has the multi LLM thing. And so you just told me you're going to merge that in. So then you're going to merge two layers of forks down into this thing. So it'll be fun.Eric [00:31:03]: Heck yeah.Alessio [00:31:04]: Just to touch on like the environment, Itamar, you maybe go into the most complicated environments that even the people that work there don't know how to run. How much of an impact does that have on your performance? Like, you know, it's most of the work you're doing actually figuring out environment and like the libraries, because I'm sure they're using outdated version of languages, they're using outdated libraries, they're using forks that have not been on the public internet before. How much of the work that you're doing is like there versus like at the LLM level?Itamar [00:31:32]: One of the reasons I was asking about, you know, what are the steps to break things down, because it really matters. Like, what's the tech stack? How complicated the software is? It's hard to figure it out when you're dealing with the real world, any environment of enterprise as a city, when I'm like, while maybe sometimes like, I think you do enable like in Bolt, like to install stuff, but it's quite a like controlled environment. And that's a good thing to do, because then you narrow down and it's easier to make things work. So definitely, there are two dimensions, I think, actually spaces. One is the fact just like installing our software without yet like doing anything, making it work, just installing it because we work with enterprise and Fortune 500, etc. Many of them want on prem solution.Swyx [00:32:22]: So you have how many deployment options?Itamar [00:32:24]: Basically, we had, we did a metric metrics, say 96 options, because, you know, they're different dimensions. Like, for example, one dimension, we connect to your code management system to your Git. So are you having like GitHub, GitLab? Subversion? Is it like on cloud or deployed on prem? Just an example. Which model agree to use its APIs or ours? Like we have our Is it TestGPT? Yeah, when we started with TestGPT, it was a huge mistake name. It was cool back then, but I don't think it's a good idea to name a model after someone else's model. Anyway, that's my opinion. So we gotSwyx [00:33:02]: I'm interested in these learnings, like things that you change your mind on.Itamar [00:33:06]: Eventually, when you're building a company, you're building a brand and you want to create your own brand. By the way, when I thought about Bolt.new, I also thought about if it's not a problem, because when I think about Bolt, I do think about like a couple of companies that are already called this way.Swyx [00:33:19]: Curse companies. You could call it Codium just to...Itamar [00:33:24]: Okay, thank you. Touche. Touche.Eric [00:33:27]: Yeah, you got to imagine the board meeting before we launched Bolt, one of our investors, you can imagine they're like, are you sure? Because from the investment side, it's kind of a famous, very notorious Bolt. And they're like, are you sure you want to go with that name? Oh, yeah. Yeah, absolutely.Itamar [00:33:43]: At this point, we have actually four models. There is a model for autocomplete. There's a model for the chat. There is a model dedicated for more for code review. And there is a model that is for code embedding. Actually, you might notice that there isn't a good code embedding model out there. Can you name one? Like dedicated for code?Swyx [00:34:04]: There's code indexing, and then you can do sort of like the hide for code. And then you can embed the descriptions of the code.Itamar [00:34:12]: Yeah, but you do see a lot of type of models that are dedicated for embedding and for different spaces, different fields, etc. And I'm not aware. And I know that if you go to the bedrock, try to find like there's a few code embedding models, but none of them are specialized for code.Swyx [00:34:31]: Is there a benchmark that you would tell us to pay attention to?Itamar [00:34:34]: Yeah, so it's coming. Wait for that. Anyway, we have our models. And just to go back to the 96 option of deployment. So I'm closing the brackets for us. So one is like dimensional, like what Git deployment you have, like what models do you agree to use? Dotter could be like if it's air-gapped completely, or you want VPC, and then you have Azure, GCP, and AWS, which is different. Do you use Kubernetes or do not? Because we want to exploit that. There are companies that do not do that, etc. I guess you know what I mean. So that's one thing. And considering that we are dealing with one of all four enterprises, we needed to deal with that. So you asked me about how complicated it is to solve that complex code. I said, it's just a deployment part. And then now to the software, we see a lot of different challenges. For example, some companies, they did actually a good job to build a lot of microservices. Let's not get to if it's good or not, but let's first assume that it is a good thing. A lot of microservices, each one of them has their own repo. And now you have tens of thousands of repos. And you as a developer want to develop something. And I remember me coming to a corporate for the first time. I don't know where to look at, like where to find things. So just doing a good indexing for that is like a challenge. And moreover, the regular indexing, the one that you can find, we wrote a few blogs on that. By the way, we also have some open source, different than yours, but actually three and growing. Then it doesn't work. You need to let the tech leads and the companies influence your indexing. For example, Mark with different repos with different colors. This is a high quality repo. This is a lower quality repo. This is a repo that we want to deprecate. This is a repo we want to grow, etc. And let that be part of your indexing. And only then things actually work for enterprise and they don't get to a fatigue of, oh, this is awesome. Oh, but I'm starting, it's annoying me. I think Copilot is an amazing tool, but I'm quoting others, meaning GitHub Copilot, that they see not so good retention of GitHub Copilot and enterprise. Ooh, spicy. Yeah. I saw snapshots of people and we have customers that are Copilot users as well. And also I saw research, some of them is public by the way, between 38 to 50% retention for users using Copilot and enterprise. So it's not so good. By the way, I don't think it's that bad, but it's not so good. So I think that's a reason because, yeah, it helps you auto-complete, but then, and especially if you're working on your repo alone, but if it's need that context of remote repos that you're code-based, that's hard. So to make things work, there's a lot of work on that, like giving the controllability for the tech leads, for the developer platform or developer experience department in the organization to influence how things are working. A short example, because if you have like really old legacy code, probably some of it is not so good anymore. If you just fine tune on these code base, then there is a bias to repeat those mistakes or old practices, etc. So you need, for example, as I mentioned, to influence that. For example, in Coda, you can have a markdown of best practices by the tech leads and Coda will include that and relate to that and will not offer suggestions that are not according to the best practices, just as an example. So that's just a short list of things that you need to do in order to deal with, like you mentioned, the 100.1 to 100.2 version of software. I just want to say what you're doing is extremelyEric [00:38:32]: impressive because it's very difficult. I mean, the business of Stackplus, kind of before bulk came online, we sold a version of our IDE that went on-prem. So I understand what you're saying about the difficulty of getting stuff just working on-prem. Holy heck. I mean, that is extremely hard. I guess the question I have for you is, I mean, we were just doing that with kind of Kubernetes-based stuff, but the spread of Fortune 500 companies that you're working with, how are they doing the inference for this? Are you kind of plugging into Azure's OpenAI stuff and AWS's Bedrock, you know, Cloud stuff? Or are they just like running stuff on GPUs? Like, what is that? How are these folks approaching that? Because, man, what we saw on the enterprise side, I mean, I got to imagine that that's a huge challenge. Everything you said and more, like,Itamar [00:39:15]: for example, like someone could be, and I don't think any of these is bad. Like, they made their decision. Like, for example, some people, they're, I want only AWS and VPC on AWS, no matter what. And then they, some of them, like there is a subset, I will say, I'm willing to take models only for from Bedrock and not ours. And we have a problem because there is no good code embedding model on Bedrock. And that's part of what we're doing now with AWS to solve that. We solve it in a different way. But if you are willing to run on AWS VPC, but run your run models on GPUs or inferentia, like the new version of the more coming out, then our models can run on that. But everything you said is right. Like, we see like on-prem deployment where they have their own GPUs. We see Azure where you're using OpenAI Azure. We see cases where you're running on GCP and they want OpenAI. Like this cross, like a case, although there is Gemini or even Sonnet, I think is available on GCP, just an example. So all the options, that's part of the challenge. I admit that we thought about it, but it was even more complicated. And it took us a few months to actually, that metrics that I mentioned, to start clicking each one of the blocks there. A few months is impressive. I mean,Eric [00:40:35]: honestly, just that's okay. Every one of these enterprises is, their networking is different. Just everything's different. Every single one is different. I see you understand. Yeah. So that just cannot be understated. That it is, that's extremely impressive. Hats off.Itamar [00:40:50]: It could be, by the way, like, for example, oh, we're only AWS, but our GitHub enterprise is on-prem. Oh, we forgot. So we need like a private link or whatever, like every time like that. It's not, and you do need to think about it if you want to work with an enterprise. And it's important. Like I understand like their, I respect their point of view.Swyx [00:41:10]: And this primarily impacts your architecture, your tech choices. Like you have to, you can't choose some vendors because...Itamar [00:41:15]: Yeah, definitely. To be frank, it makes us hard for a startup because it means that we want, we want everyone to enjoy all the variety of models. By the way, it was hard for us with our technology. I want to open a bracket, like a window. I guess you're familiar with our Alpha Codium, which is an open source.Eric [00:41:33]: We got to go over that. Yeah. So I'll do that quickly.Itamar [00:41:36]: Yeah. A pin in that. Yeah. Actually, we didn't have it in the last episode. So, so, okay.Swyx [00:41:41]: Okay. We'll come back to that later, but let's talk about...Itamar [00:41:43]: Yeah. So, so just like shortly, and then we can double click on Alpha Codium. But Alpha Codium is a open source tool. You can go and try it and lets you compete on CodeForce. This is a website and a competition and actually reach a master level level, like 95% with a click of a button. You don't need to do anything. And part of what we did there is taking a problem and breaking it to different, like smaller blocks. And then the models are doing a much better job. Like we all know it by now that taking small tasks and solving them, by the way, even O1, which is supposed to be able to do system two thinking like Greg from OpenAI like hinted, is doing better on these kinds of problems. But still, it's very useful to break it down for O1, despite O1 being able to think by itself. And that's what we presented like just a month ago, OpenAI released that now they are doing 93 percentile with O1 IOI left and International Olympiad of Formation. Sorry, I forgot. Exactly. I told you I forgot. And we took their O1 preview with Alpha Codium and did better. Like it just shows like, and there is a big difference between the preview and the IOI. It shows like that these models are not still system two thinkers, and there is a big difference. So maybe they're not complete system two. Yeah, they need some guidance. I call them system 1.5. We can, we can have it. I thought about it. Like, you know, I care about this philosophy stuff. And I think like we didn't see it even close to a system two thinking. I can elaborate later. But closing the brackets, like we take Alpha Codium and as our principle of thinking, we take tasks and break them down to smaller tasks. And then we want to exploit the best model to solve them. So I want to enable anyone to enjoy O1 and SONET and Gemini 1.5, etc. But at the same time, I need to develop my own models as well, because some of the Fortune 500 want to have all air gapped or whatever. So that's a challenge. Now you need to support so many models. And to some extent, I would say that the flow engineering, the breaking down to two different blocks is a necessity for us. Why? Because when you take a big block, a big problem, you need a very different prompt for each one of the models to actually work. But when you take a big problem and break it into small tasks, we can talk how we do that, then the prompt matters less. What I want to say, like all this, like as a startup trying to do different deployment, getting all the juice that you can get from models, etc. is a big problem. And one need to think about it. And one of our mitigation is that process of taking tasks and breaking them down. That's why I'm really interested to know how you guys are doing it. And part of what we do is also open source. So you can see.Swyx [00:44:39]: There's a lot in there. But yeah, flow over prompt. I do believe that that does make sense. I feel like there's a lot that both of you can sort of exchange notes on breaking down problems. And I just want you guys to just go for it. This is fun to watch.Eric [00:44:55]: Yeah. I mean, what's super interesting is the context you're working in is, because for us too with Bolt, we've started thinking because our kind of existing business line was going behind the firewall, right? We were like, how do we do this? Adding the inference aspect on, we're like, okay, how does... Because I mean, there's not a lot of prior art, right? I mean, this is all new. This is all new. So I definitely am going to have a lot of questions for you.Itamar [00:45:17]: I'm here. We're very open, by the way. We have a paper on a blog or like whatever.Swyx [00:45:22]: The Alphacodeum, GitHub, and we'll put all this in the show notes.Itamar [00:45:25]: Yeah. And even the new results of O1, we published it.Eric [00:45:29]: I love that. And I also just, I think spiritually, I like your approach of being transparent. Because I think there's a lot of hype-ium around AI stuff. And a lot of it is, it's just like, you have these companies that are just kind of keep their stuff closed source and then just max hype it, but then it's kind of nothing. And I think it kind of gives a bad rep to the incredible stuff that's actually happening here. And so I think it's stuff like what you're doing where, I mean, true merit and you're cracking open actual code for others to learn from and use. That strikes me as the right approach. And it's great to hear that you're making such incredible progress.Itamar [00:46:02]: I have something to share about the open source. Most of our tools are, we have an open source version and then a premium pro version. But it's not an easy decision to do that. I actually wanted to ask you about your strategy, but I think in your case, there is, in my opinion, relatively a good strategy where a lot of parts of open source, but then you have the deployment and the environment, which is not right if I get it correctly. And then there's a clear, almost hugging face model. Yeah, you can do that, but why should you try to deploy it yourself, deploy it with us? But in our case, and I'm not sure you're not going to hit also some competitors, and I guess you are. I wanted to ask you, for example, on some of them. In our case, one day we looked on one of our competitors that is doing code review. We're a platform. We have the code review, the testing, et cetera, spread over the ID to get. And in each agent, we have a few startups or a big incumbents that are doing only that. So we noticed one of our competitors having not only a very similar UI of our open source, but actually even our typo. And you sit there and you're kind of like, yeah, we're not that good. We don't use enough Grammarly or whatever. And we had a couple of these and we saw it there. And then it's a challenge. And I want to ask you, Bald is doing so well, and then you open source it. So I think I know what my answer was. I gave it before, but still interestingEric [00:47:29]: to hear what you think. GeoHot said back, I don't know who he was up to at this exact moment, but I think on comma AI, all that stuff's open source. And someone had asked him, why is this open source? And he's like, if you're not actually confident that you can go and crush it and build the best thing, then yeah, you should probably keep your stuff closed source. He said something akin to that. I'm probably kind of butchering it, but I thought it was kind of a really good point. And that's not to say that you should just open source everything, because for obvious reasons, there's kind of strategic things you have to kind of take in mind. But I actually think a pretty liberal approach, as liberal as you kind of can be, it can really make a lot of sense. Because that is so validating that one of your competitors is taking your stuff and they're like, yeah, let's just kind of tweak the styles. I mean, clearly, right? I think it's kind of healthy because it keeps, I'm sure back at HQ that day when you saw that, you're like, oh, all right, well, we have to grind even harder to make sure we stay ahead. And so I think it's actually a very useful, motivating thing for the teams. Because you might feel this period of comfort. I think a lot of companies will have this period of comfort where they're not feeling the competition and one day they get disrupted. So kind of putting stuff out there and letting people push it forces you to face reality soon, right? And actually feel that incrementally so you can kind of adjust course. And that's for us, the open source version of Bolt has had a lot of features people have been begging us for, like persisting chat messages and checkpoints and stuff. Within the first week, that stuff was landed in the open source versions. And they're like, why can't you ship this? It's in the open, so people have forked it. And we're like, we're trying to keep our servers and GPUs online. But it's been great because the folks in the community did a great job, kept us on our toes. And we've got to know most of these folks too at this point that have been building these things. And so it actually was very instructive. Like, okay, well, if we're going to go kind of land this, there's some UX patterns we can kind of look at and the code is open source to this stuff. What's great about these, what's not. So anyways, NetNet, I think it's awesome. I think from a competitive point of view for us, I think in particular, what's interesting is the core technology of WebContainer going. And I think that right now, there's really nothing that's kind of on par with that. And we also, we have a business of, because WebContainer runs in your browser, but to make it work, you have to install stuff from NPM. You have to make cores bypass requests, like connected databases, which all require server-side proxying or acceleration. And so we actually sell WebContainer as a service. One of the core reasons we open-sourced kind of the core components of Bolt when we launched was that we think that there's going to be a lot more of these AI, in-your-browser AI co-gen experiences, kind of like what Anthropic did with Artifacts and Clod. By the way, Artifacts uses WebContainers. Not yet. No, yeah. Should I strike that? I think that they've got their own thing at the moment, but there's been a lot of interest in WebContainers from folks doing things in that sort of realm and in the AI labs and startups and everything in between. So I think there'll be, I imagine, over the coming months, there'll be lots of things being announced to folks kind of adopting it. But yeah, I think effectively...Swyx [00:50:35]: Okay, I'll say this. If you're a large model lab and you want to build sandbox environments inside of your chat app, you should call Eric.Itamar [00:50:43]: But wait, wait, wait, wait, wait, wait. I have a question about that. I think OpenAI, they felt that people are not using their model as they would want to. So they built ChatGPT. But I would say that ChatGPT now defines OpenAI. I know they're doing a lot of business from their APIs, but still, is this how you think? Isn't Bolt.new your business now? Why don't you focus on that instead of the...Swyx [00:51:16]: What's your advice as a founder?Eric [00:51:18]: You're right. And so going into it, we, candidly, we were like, Bolt.new, this thing is super cool. We think people are stoked. We think people will be stoked. But we were like, maybe that's allowed. Best case scenario, after month one, we'd be mind blown if we added a couple hundred K of error or something. And we were like, but we think there's probably going to be an immediate huge business. Because there was some early poll on folks wanting to put WebContainer into their product offerings, kind of similar to what Bolt is doing or whatever. We were actually prepared for the inverse outcome here. But I mean, well, I guess we've seen poll on both. But I mean, what's happened with Bolt, and you're right, it's actually the same strategy as like OpenAI or Anthropic, where we have our ChatGPT to OpenAI's APIs is Bolt to WebContainer. And so we've kind of taken that same approach. And we're seeing, I guess, some of the similar results, except right now, the revenue side is extremely lopsided to Bolt.Itamar [00:52:16]: I think if you ask me what's my advice, I think you have three options. One is to focus on Bolt. The other is to focus on the WebContainer. The third is to raise one billion dollars and do them both. I'm serious. I think otherwise, you need to choose. And if you raise enough money, and I think it's big bucks, because you're going to be chased by competitors. And I think it will be challenging to do both. And maybe you can. I don't know. We do see these numbers right now, raising above $100 million, even without havingEric [00:52:49]: a product. You can see these. It's excellent advice. And I think what's been amazing, but also kind of challenging is we're trying to forecast, okay, well, where are these things going? I mean, in the initial weeks, I think us and all the investors in the company that we're sharing this with, it was like, this is cool. Okay, we added 500k. Wow, that's crazy. Wow, we're at a million now. Most things, you have this kind of the tech crunch launch of initiation and then the thing of sorrow. And if there's going to be a downtrend, it's just not coming yet. Now that we're kind of looking ahead, we're six weeks in. So now we're getting enough confidence in our convictions to go, okay, this seems to be the trend line. I'll tell you another reason whySwyx [00:53:33]: I think, where is Jasper? They actually just announced some new numbers recently. They're still surviving. They have gone down a lot. I think that the peak that I heard was a hundredItamar [00:53:42]: billion ARR. And now there's like tens of these. So I think their success was phenomenal, like what I see at Bolt. And I think if you want to keep that, probably, who am I? I'm just giving my two cents. You need to focus because you are going to see weeks, I think that you're disrupting their market. And you open sourced some of it and they have containers, I believe. And you need to fight. I can tell you that when we open source, I share with you a small competitor, but I can tell you, I have a friend who has built a billion dollar company and more. When we released Alpha Codium, he sent me a private email asking, what the f**k did you just do? Why did you release that? You should have kept it. Yeah, you released that open source. I'm thinking, build some stuff and now I can do that much more easily. I can tell you my answer and I thought that maybe you'll answer as well. Although I think Bolt is already very promising. For us, Alpha Codium 1 is like GPT 1. I agree with you. Being open and open source, etc. really helps to improve the product community, etc. But at some point, OpenAI closed their GPT 3.5 or whatever. And that was part of my answer. Alpha Codium is the agent that is compatible with GPT 1 and there is a lot to do for these agents to actually get that moment that we had with GPT 3.5, etc. as agents.Eric [00:55:11]: Yeah, I think you're dead right. And I think it just comes back to what GeoHot said. It's like, if you want to win, there's no other option than out hustling everyone else. And so I think that's kind of out hustling in the sense really meaning building the best product, building the best experiences. And so I think that's the only way kind of almost any route and open source and stuff just kind of burns the ships in a sense. And maybe that's the simplest way of saying it. You're burning the ships, but also it builds a lot of goodwill. I mean, there's tons of benefits to it. Salesforce are doing that, right?Itamar [00:55:43]: They're now going to be agent force or whatever. So you can also...Swyx [00:55:47]: We're going to try to get Mark on the podcast. And they're good friends with Salesforce. Any parting thoughts, any trends that you'reItamar [00:55:55]: super excited about? If we're talking about trends, I go back to our original podcast where we talked about the idea that the software world is built from specs, tests, and code. And I think you can see that one dimension are company startups that are rethinking the entire development environment, I think like Bolt, etc. And another dimension is where is their focus? Is it on the spec, is on the test and on the code? And I think it's interesting to see that from that view. We'll see more startup and more amazing announcements of new directions, new philosophy. So I think we'll see startup focusing, let's build everything from the spec. To some extent, I would say that Bolt is, from my understanding, you can say better, somewhere in the line between the spec and the code. Because you start, like I saw your demos, you're trying to describe things, not just in one row, because you want to look like you want it. So it's on that edge between connecting between spec and code. And you see others, I think all the IDEs, most of them are the new IDEs, or the fork are there. We are more focused from the test and to the code and to the spec, etc. So these are trends, I think we will see that. And I think another dimension to consider is, is it more for the highway AI, for the developers, maybe not even a technical person, or is it for the enterprise? And that also gives you different products. If they are aiming for different ICP, different ideal client profile, they will approach this triangle of spec and test and code. And that's how I see the world. And what I'm noticing is that we're seeing more and more of those new startups, new interfaces that are not focused on code. For example, talking more about the spec, talking more about the testing. Eventually, I think that that's where the world is going to. The code is going to be there, and there will be developers, etc. But as agent improves and capabilities of the LLMs and integrations to different parts of the development environment, we're going to see more and more focusing on the spec and the test. Basically, these two might unite, the spec and the test, because you can say that tests are runnable specs, to some extent. So that's another way to look atSwyx [00:58:23]: it. Yeah, that is literally on the slide here, runnable tests, right here. Yeah, I'm consistent.Itamar [00:58:27]: It's all consistent. Look, I talked about system one and system two more than a year ago. And now with O1, people are talking about system one. But I think we'll talk about it again, because I think they're totally, totally wrong about O1 being a system two. It is now in the hype or whatever, talking about that. But I think the agents are the ones that will take us towards system two. And the more they are aware of their environment, and aware of that sometimes they don't know what they don't know, then we'll really get to system two. But that'sSwyx [00:59:03]: a deeper discussion. It's a deeper discussion. I love the philosophy talk that we had last time as well. All right, so we're back on to Bolt, and Itamar had to leave for another interview. But we were just talking about what happened post-launch, right? And I held this emergency council of advisors for you, because we had never seen this before. And I was like, okay, I'm going to call all the smartest people I know to join this thing.Eric [00:59:27]: Which was extremely helpful. And I'm so appreciative. There's been a handful of me.Swyx [00:59:31]: You made one hire out of that.Eric [00:59:34]: Yeah, because it was like, I think I can't remember where we were at kind of ARR-wise when I had messaged you.Swyx [00:59:40]: It was like, you messaged me at like two or three. And then by the time we got everything together, it was four. And then, yeah, now it's at-Alessio [00:59:48]: Since Eric sat down five minutes.Swyx [00:59:52]: But I mean, it sounds like you accelerated, because you told me it was like 100k, 200k a day. And now it's accelerated?Eric [00:59:58]: Yeah, this past- I mean, every week has been kind of a blowout week as far as- Is it TikTok? We're digging into the degree that we can of just like where all this stuff's coming from. I mean, there's a ton of word of mouth, right? So that you can't- which you can't just like look by refer, right? So there's a ton of direct. But yeah, I mean, there's a lot of TikTok. There's a ton of YouTube. It's kind of, I think, been a sensation in the sort of like entrepreneurial, build your own SaaS, indie hacker, even developer circles. And I think, too, our team's been doing a really good job. Our folks just kind of like flipped a switch. And people were just working through the weekends or whatever to get stuff fixed. And so the product- and you'll see people say this online. Like today, there was a tweet. Someone was like, yeah, I tried this like the first week and I couldn't get whatever to work. Came back today, six weeks later, and this is ridiculous. Like this is so good, right? And so I think there's been an incredible amount of improvement to the product, to the agent, also to like the underlying models, too. Like Sonnet, they just happened to do an update with their release a couple of weeks ago. And so when we put our new agent online and the new Sonnet, we saw a huge bump in conversion just based on that. And so yeah, we've gone at that. When we were chatting, that must have been three weeks ago, maybe an average of 100K ARR per day. And this week, I will see- I've said this every week, but we'll see if it holds. The past couple of days have been like half a million of ARR per day, which is insane. I think today we've had peak traffic, just kind of set the previous- and that's kind of been every day this week. But anyways, yeah, I think things just continue to accelerate, which is kind of blowing my mind, because it's just the sheer numbers of this stuff are just mind-boggling.Alessio [01:01:40]: I think you almost suffered from the Twitter demo issues that other people had. The first time I saw Bolt, I saw the demo and I was like, oh, that's cool. I didn't go to try it because I was like, I've seen so many of these that it's like, I don't know if it's actually going to work. And then two days ago, I signed up to use it. I was building a Luma replacement. I'm done with Luma. And I was like, man, this thing really works. And I already knew you, of course. I was like, man, this thing really works. What the f**k? I was like, it's actually, I don't know if it's like the model, if it's like how you prompt it, but it's so good at coming up with the simplest thing to implement. So the Luma example, right? So first I was like, create a RSVP page for an event and it created a wedding RSVP. I don't know if it's your fault. I don't know if you bolted it. And then I was like, well, now it needs to have a way to create more events and added that. And then I was like, now it needs a way to like have an admin page to modify event. And maybe what I would have done as a developer is like, well, I'll create a different like admin view, you know, with all the events and then I'll have like the front end thing. And instead what it did is like, it created like a admin view with toggle on top and then like just a pencil button on every page to edit them in line, you know, and that was it. And I was like, yeah, that works just as well. And like for the model, that's probably the simplest way to do it because it like limits the amount of files that are there. Can you talk just more about how much of this is like the model coming out with it, how much you're prompting it to kind of like be very likeEric [01:03:04]: compressed and concise. A ton of it is the model, but I think what's interesting though, is you're kind of baseline model. If I just like, if it's kind of like try and put it into like a, you know, way, if you had to quantify, quantify, you know, the effect is obviously the model is like this sort of like 10X multiplier. You're how good the bottom line model is huge, huge swing. And then kind of what you can do on top of that, you can squeeze out three, four X kind of more. And so that's kind of where the realm of, you know, prompt engineering and multi-agent approaches, et cetera, kind of kick in. And so I think, I think with us, you know, our folks, like the guy on our side that, you know, led the web engineering, like that kind of our core technology for the past, you know, seven years here, you know, his name is Dominic Elm based out of Germany and he was one of the founding engineers of the company. You had previous to StackBlitz, he actually was doing machine learning and he basically had built a StackBlitz, like online ID for machine learning. So I think like, I kind of like Google Colab sort of thing, or like Hugging Face has their kind of version of this. Back in 2016, it wasn't as much of a market for this stuff, but he had been doing a lot of, you know, training, you know, ML models and that sort of thing. So I guess, you know, as we began, you know, kind of digging into AI stuff over the past year, he's been kind of leading that off. And so a lot of it, I really attribute to Dom's specific angle, cause he has deep understanding of our technology and how it works. Cause he's, you know, led the engineering on web container, but as you know, deep understanding of how these models work going and actually kind of writing out these you know, whether it's like the, the, the prompt engineering aspect of it or multi-agent or whatever, have you, you know, that's sort of like that much context. And, and the, and the other folks on the team are, are, you know, in the same, same sort of spot that have been working on this stuff. I think we'd be able to squeeze out a lot more than I've seen almost anything else out there, at least in the term of building web apps, at least. But I guess I think it's, I think it's kind of just because we we have more context on, on a fewer number of heads at the company. So we can kind of connect the dots of it faster, youSwyx [01:05:01]: know? Yeah. That's part of the issue with the whole raise a billion dollars thing. Like you actually run very lean and that's, that's actually been to your advantage.Eric [01:05:08]: Totally. And I think, you know, and I think we, we have to staff up because I mean, we went from, you know, call it zero customers to, you know, 20, 30,000 kind of, you know, in six weeks, we have to have certainly more customer support, customer success stuff, et cetera. But you know, also just on, on engineering we have to ramp up, but I do think that there's a, we saw this in the 2021 cycle, right? Where, you know, adding tons more people can, can, can be a thing that really hurts, you know, the company because you can, it's just harder. It's really hard to manage lots of people. Not if you're a big enough company to warrant a certain headcount, a 100%, you kind of have to do it. Right. But I think for us, it's worked just to really grow, grow the team slowly and intentionally. And so I think we're going to take the same approach here at a bit of a faster clip than we were previously. But to me, that would just be general advice to startups is like slowly intentionally as fast as you can to meet demand or whatever. Part of what I felt like you're in a unique position toSwyx [01:06:07]: talk about, but also kind of what we went through in our, in our call was I have PMF now, what is, is kind of what I've been saying. And so like, I think the first answer is hire a data scientist because we have to sort of figure out like from our data that you're now sitting on a ton of different customers and we don't really know the different customer segments. You're starting to get an idea of churn. You're starting to get an idea of like segmentation. You already had data enrichment. One of my most interesting quotes from you from that session was that because you were selling to enterprise for so long, you had already set up all that stuff and it's just like, wasn't useful for a more sort of developer bottom up centric approach.Eric [01:06:46]: Yeah. And particularly because for the first time in the company's history, we're selling primarily to almost non-developers. And so everything that we've ever, all the playbooks we had not relevant here basically. Right. So the, and you're one of one of our investors I talked with earlier this week, basically brought up a really great point, which is like, you are now a B2C company and how you operate needs to reflect that.Swyx [01:07:09]: Which is, which is what, I don't know.Eric [01:07:11]: Which is basically from an analytics perspective, like you're tracking everything. Right. And then to your point, you have, you have people kind of around the clock slicing and dicing data to understand who are these people coming in, who are the types of people you actually want to retain versus people that, you know, are just going to churn out. And that's okay. Cause they're not the actual like ICP that you're going for. Right. When you're building stuff for enterprise software, the bar is a lot lower. And then to kind of to, from the conversation before one of the biggest, and this is kind of what we found with StackBlitz, which is kind of interesting, you know, you mentioned it, it's like, it's as a startup, it's very hard to sell on-prem extremely true. But if you can do it, it's like the promised land because you know, these, these companies you know, the fortune 500s, they can write really large checks. And so when you're going and selling to them, it doesn't matter so much like on your website. Sure. You want to track the conversion to the enterprise contact form or whatever. Right. But what, what actually really matters is like the, a lot of human touch points of, Hey, we want to have a quarterly call after just getting installed this stuff. There's a whole playbook for that. And you need to hire sales engineers that can be on the ground floor and helping people install it. Then after that, you got to, okay, how do we make sure they're kind of constantly successful? Because you can't access like we can, our enterprise customer instances, we have no idea how often they're using them. Why? Because the whole point is that we can't see what they're up to for a good reason, right? Like they, they need to own their data. And so the way it's actually much, a very complicated problem of how do you have like build relationships where everyone's getting on calls, they can share kind of the telemetry that, that they can see within their instance. And you can kind of extrapolate that and make sure they're happy and successful. So that's, there's a whole art of that, of doing enterprise well, that we've gone and done and closed these folks totally unrelated to doing BC completely, completely unrelated for the most part. So anyway, so that, so that, you know, we're, as a company, we're, we're kind of reorienting, you know, our focus on, okay, going and actually really leaning in on analytics, whatever have you. And fortunately, like my co-founder and I, the art, the enterprise business of stack was, was the first time we had ever done enterprise primarily like things to the company we did before was B2C. Like we were selling people courses on how to do web development basically. Right. So a lot of the skillset that, you know, I had built up there, I able to pull that back off the shelf, dust it off, sharpen the blade. And, you know, we're doing email marketing, we're doing live streams, you know? So, so that's, it's, it's kind of cool to, you know, be shifting back to some of the, the, the, where we cut our teeth on back in the day.Alessio [01:09:35]: How did you pick the pricing? Because I had to pay.Swyx [01:09:38]: That's fantastic. You want to like slight, slightly like, yeah, you got a bit. It's like,Alessio [01:09:44]: you're running out of tokens, dude. I was like, f**k, I'm running out of tokens. It's like, I don't want to run out of tokens, but there's like five different tiers. Yeah. Right. Which are kind of like token based and capacity based. Yep. How do you kind of reconcile that? And the consumer side where maybe the consumer doesn't even really need to know what a token is, right? Like on that, like your mom probably doesn't really care what an AI token is. How did you structure it to start? How did you come up with that? And then maybe ideas that you have to like improve or like modify that.Eric [01:10:12]: Totally. Yeah. So we, so when we first launched with StackBlitz is like, we were an enterprise play, right? And so when we launched in 2017, I think we tried pricing 2018 or 2019, but like it was free for a long time. And then we had a 9𝑝𝑙𝑎𝑛𝑎𝑛𝑑𝑡ℎ𝑎𝑡𝑤𝑎𝑠𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑤𝑎𝑦𝑖𝑡𝑤𝑎𝑠.𝐼𝑡𝑤𝑎𝑠,𝑖𝑡𝑤𝑎𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑜𝑢𝑟,𝑜𝑢𝑟𝑑𝑜𝑙𝑙𝑎𝑟50ℎ𝑜𝑡𝑑𝑜𝑔𝑎𝑡𝐶𝑜𝑠𝑡𝑐𝑜.𝐼𝑡′𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑡ℎ𝑖𝑠,𝑡ℎ𝑖𝑠,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑗𝑢𝑠𝑡𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒,𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡,𝑖𝑡𝑤𝑎𝑠𝑛′𝑡𝑡ℎ𝑒𝑝𝑟𝑖𝑚𝑎𝑟𝑦𝑟𝑒𝑣𝑑𝑟𝑖𝑣𝑒𝑟𝑎𝑛𝑑𝑤𝑒𝑗𝑢𝑠𝑡𝑤𝑎𝑛𝑡𝑒𝑑𝑡𝑜,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑠𝑎𝑦,𝐻𝑒𝑦,𝑝𝑎𝑦𝑓𝑜𝑟𝑠𝑜𝑚𝑒𝑚𝑜𝑟𝑒𝑠𝑡𝑜𝑟𝑎𝑔𝑒𝑎𝑛𝑑𝑝𝑟𝑖𝑣𝑎𝑡𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑠𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑠𝑜𝑤𝑒𝑤𝑒𝑛𝑡𝑡𝑜𝑙𝑎𝑢𝑛𝑐ℎ𝑏𝑜𝑙𝑡𝑎𝑔𝑎𝑖𝑛,𝑙𝑖𝑘𝑒𝑜𝑢𝑟𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑤𝑎𝑠,𝐻𝑒𝑦,𝑤𝑒′𝑙𝑙𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑦𝑔𝑒𝑡𝑎𝑔𝑜𝑜𝑑𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑡ℎ𝑎𝑡′𝑙𝑙𝑠𝑖𝑔𝑛𝑢𝑝𝑎𝑛𝑑𝑏𝑒𝑒𝑥𝑐𝑖𝑡𝑒𝑑𝑎𝑏𝑜𝑢𝑡𝑖𝑡.𝐴𝑛𝑑𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑛𝑜𝑡𝑡𝑜𝑜𝑐𝑜𝑛𝑐𝑒𝑟𝑛𝑒𝑑,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑜𝑡,𝑤𝑒𝑤𝑒𝑟𝑒𝑢𝑛𝑝𝑟𝑒𝑝𝑎𝑟𝑒𝑑𝑓𝑜𝑟𝑡ℎ𝑒𝑡𝑠𝑢𝑛𝑎𝑚𝑖𝑡ℎ𝑎𝑡ℎ𝑖𝑡.𝐴𝑛𝑑𝑠𝑜𝑎𝑓𝑡𝑒𝑟𝑔𝑜𝑖𝑛𝑔𝑜𝑛𝑙𝑖𝑛𝑒𝑡ℎ𝑒𝑓𝑖𝑟𝑠𝑡𝑤𝑒𝑒𝑘,𝑤𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑤𝑜𝑤,𝑡ℎ𝑖𝑠𝑖𝑠𝑐𝑜𝑜𝑙.𝑇ℎ𝑒𝑟𝑒′𝑠𝑎,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑒𝑝𝑡𝑔𝑟𝑜𝑤𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝑜𝑛𝑐𝑒𝑤𝑒ℎ𝑖𝑡𝑤𝑒𝑒𝑘𝑡𝑤𝑜,𝐼𝑚𝑒𝑎𝑛,𝑤𝑒𝑤𝑒𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑤𝑎𝑠,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑒𝑎𝑝𝑒𝑠𝑡𝐴𝐼𝑐𝑜𝑑𝑖𝑛𝑔𝑡ℎ𝑖𝑛𝑔𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑒𝑡𝑚𝑎𝑦𝑏𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑎𝑛𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑏𝑢𝑡𝑙𝑖𝑘𝑒𝑤𝑒𝑤𝑒𝑟𝑒𝑜𝑣𝑒𝑟𝑟𝑢𝑛𝑏𝑦𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑡𝑖𝑐𝑘𝑒𝑡𝑠.𝐴𝑛𝑑𝐼𝑗𝑢𝑠𝑡,𝑎𝑛𝑑𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑠ℎ𝑒𝑒𝑟𝑣𝑜𝑙𝑢𝑚𝑒𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑛𝑑𝑖𝑡𝑗𝑢𝑠𝑡𝑙𝑎𝑤𝑠𝑜𝑓𝑠𝑢𝑝𝑝𝑙𝑦𝑎𝑛𝑑𝑑𝑒𝑚𝑎𝑛𝑑.𝑊𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑡ℎ𝑖𝑠𝑖𝑠𝑛′𝑡,𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑤𝑒𝑐𝑎𝑛𝑠𝑐𝑎𝑙𝑒𝑡𝑜𝑚𝑒𝑒𝑡𝑡ℎ𝑖𝑠.𝐴𝑙𝑠𝑜𝑡ℎ𝑒𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑟𝑒𝑏𝑢𝑟𝑛𝑖𝑛𝑔𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑡ℎ𝑒𝑖𝑟𝑡𝑜𝑘𝑒𝑛𝑠𝑎𝑛𝑑𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑡𝑜𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑙𝑖𝑘𝑒𝑏𝑢𝑦𝑚𝑜𝑟𝑒𝑜𝑓𝑡ℎ𝑒𝑠𝑒𝑡ℎ𝑖𝑛𝑔𝑠.𝐴𝑛𝑑𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑖𝑠𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑐𝑎𝑛′𝑡𝑔𝑒𝑡𝑡ℎ𝑎𝑡𝑚𝑢𝑐ℎ𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑜𝑢𝑡𝑜𝑓𝑡ℎ𝑎𝑡.𝐴𝑛𝑑𝑠𝑜𝑡ℎ𝑒,ℎ𝑒𝑟𝑒′𝑠𝑡ℎ𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑖𝑛𝑔𝑡ℎ𝑎𝑡′𝑠𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔𝑎𝑏𝑜𝑢𝑡𝑏𝑜𝑙𝑡𝑐𝑜𝑚𝑝𝑎𝑟𝑒𝑑𝑡𝑜𝑙𝑖𝑘𝑒𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑡ℎ𝑖𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑡𝑖𝑒𝑑𝑡ℎ𝑖𝑠,𝑠𝑜𝑟𝑟𝑦,𝑎𝑙𝑖𝑡𝑡𝑙𝑒𝑏𝑖𝑡𝑜𝑓𝑎𝑟𝑜𝑢𝑛𝑑𝑎𝑏𝑜𝑢𝑡𝑤𝑎𝑦𝑡𝑜𝑎𝑛𝑠𝑤𝑒𝑟𝑦𝑜𝑢𝑟𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛.𝐵𝑢𝑡𝑏𝑎𝑠𝑖𝑐𝑎𝑙𝑙𝑦𝑤ℎ𝑎𝑡𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑎𝑡𝑡ℎ𝑎𝑡𝑚𝑜𝑚𝑒𝑛𝑡,𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑟𝑒𝑎𝑙𝑖𝑧𝑖𝑛𝑔𝑖𝑠𝑡ℎ𝑎𝑡𝑤ℎ𝑒𝑛𝑦𝑜𝑢𝑢𝑠𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑤ℎ𝑎𝑡𝑖𝑡′𝑠𝑠𝑒𝑛𝑑𝑖𝑛𝑔𝑢𝑝,𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑎𝑙𝑜𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑜𝑓𝑦𝑜𝑢𝑟𝑐𝑜𝑑𝑒𝑏𝑎𝑠𝑒.𝑇ℎ𝑒𝑦𝑡𝑟𝑦𝑎𝑛𝑑𝑟𝑒𝑑𝑢𝑐𝑒𝑡ℎ𝑒𝑎𝑚𝑜𝑢𝑛𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑎𝑠𝑚𝑢𝑐ℎ𝑎𝑠𝑡ℎ𝑒𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑡ℎ𝑒𝑜𝑟𝑖𝑔𝑖𝑛𝑠𝑜𝑓𝑡ℎ𝑖𝑠𝑠𝑡𝑢𝑓𝑓𝑖𝑠𝑡ℎ𝑒𝑦,𝑒𝑣𝑒𝑟𝑦𝑜𝑛𝑒𝑘𝑖𝑛𝑑𝑜𝑓𝑤𝑎𝑛𝑡𝑠𝑡ℎ𝑖𝑠𝑙𝑖𝑘𝑒𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒𝑝𝑜𝑖𝑛𝑡𝑤ℎ𝑒𝑟𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑎𝑙𝑙𝑦𝑜𝑢𝑐𝑎𝑛𝑒𝑎𝑡.𝑆𝑜𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑖𝑛𝑑𝑜𝑓,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑓𝑒𝑒𝑙𝑠𝑙𝑖𝑘𝑒,𝑐𝑎𝑢𝑠𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝑖𝑡𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑁𝑒𝑡𝑓𝑙𝑖𝑥,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝐼′𝑙𝑙𝑝𝑎𝑦𝑎𝑡ℎ𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝐼𝑐𝑎𝑛𝑗𝑢𝑠𝑡𝑑𝑜𝑎𝑠𝑚𝑢𝑐ℎ𝑜𝑓𝑡ℎ𝑒𝑚𝑜𝑣𝑖𝑒𝑤𝑎𝑡𝑐ℎ𝑖𝑛𝑔𝑎𝑠𝐼𝑤𝑎𝑛𝑡.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝐼𝑡ℎ𝑖𝑛𝑘𝑡ℎ𝑎𝑡,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑒𝑛𝑡𝑎𝑙𝑖𝑡𝑦,𝑤ℎ𝑒𝑛𝑡ℎ𝑒𝑠𝑒𝑓𝑖𝑟𝑠𝑡𝐴𝐼𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠𝑐𝑎𝑚𝑒,𝑖𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑎𝑘𝑒𝑠𝑠𝑒𝑛𝑠𝑒.𝑇ℎ𝑒𝑦′𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑤𝑒𝑙𝑙𝑤𝑒,𝑤𝑒𝑑𝑜𝑛′𝑡𝑤𝑎𝑛𝑡𝑡𝑜𝑚𝑒𝑡𝑒𝑟𝑖𝑡.𝐶𝑎𝑢𝑠𝑒𝑡ℎ𝑎𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑓𝑒𝑒𝑙𝑔𝑜𝑜𝑑.𝑅𝑖𝑔ℎ𝑡.𝐵𝑢𝑡𝑡ℎ𝑒𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑖𝑠𝑡ℎ𝑎𝑡𝑡ℎ𝑒𝑛𝑡ℎ𝑒𝑦′𝑟𝑒𝑖𝑛𝑐𝑒𝑛𝑡𝑖𝑣𝑖𝑧𝑒𝑑𝑡𝑜𝑛𝑜𝑡ℎ𝑎𝑣𝑒𝑖𝑡𝑏𝑒𝑎𝑏𝑙𝑒𝑡𝑜𝑘𝑒𝑒𝑝𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑦𝑜𝑢𝑔𝑖𝑣𝑒𝑖𝑡,𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑖𝑡𝑐𝑎𝑛𝑑𝑜.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑚𝑎𝑔𝑖𝑐𝑜𝑓𝑤ℎ𝑎𝑡𝑤𝑒′𝑟𝑒𝑑𝑜𝑖𝑛𝑔𝑤𝑖𝑡ℎ𝑏𝑜𝑙𝑑𝑖𝑠𝑤𝑒′𝑟𝑒𝑔𝑖𝑣𝑖𝑛𝑔𝑖𝑡𝑎𝑙𝑙𝑡ℎ𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑤𝑒𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑦𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑜𝑡𝑜𝑖𝑡𝑎𝑛𝑑𝑠𝑎𝑦,𝑚𝑎𝑘𝑒𝑚𝑒𝑎𝑛𝑅𝑆𝑉𝑃𝑠𝑖𝑡𝑒.𝐴𝑛𝑑𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑏𝑒𝑐𝑎𝑢𝑠𝑒𝑖𝑡ℎ𝑎𝑠𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑡ℎ𝑒𝑒𝑛𝑡𝑖𝑟𝑒𝑠𝑡𝑎𝑡𝑒𝑜𝑓𝑡ℎ𝑒𝑎𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑎𝑡𝑚𝑎𝑘𝑒𝑠𝑖𝑡𝑠𝑜𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒.𝑉𝑒𝑟𝑠𝑢𝑠𝑖𝑓𝑦𝑜𝑢𝑔𝑜𝑡𝑜𝑐𝑜−𝑝𝑖𝑙𝑜𝑡𝑎𝑛𝑑𝑠𝑎𝑦𝑡ℎ𝑎𝑡𝑖𝑡,𝑡ℎ𝑒𝑟𝑒′𝑙𝑙𝑏𝑒,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡𝑚𝑖𝑔ℎ𝑡𝑝𝑢𝑛𝑐ℎ𝑜𝑢𝑡𝑎𝑟𝑒𝑎𝑐𝑡𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡.𝑇ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑏𝑢𝑡𝑡𝑜𝑛𝑡𝑜𝑐𝑟𝑒𝑎𝑡𝑒𝑡ℎ𝑒𝑡ℎ𝑖𝑛𝑔,𝑏𝑢𝑡𝑛𝑜𝑡𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑚𝑜𝑟𝑒𝑡ℎ𝑎𝑛𝑡ℎ𝑎𝑡.𝑆𝑜𝑎𝑛𝑦𝑤𝑎𝑦,𝑠𝑜,𝑢𝑚,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑎𝑛𝑑𝑎𝑡𝑡ℎ𝑒𝑡𝑖𝑚𝑒𝑤ℎ𝑒𝑛𝑝𝑒𝑜𝑝𝑙𝑒ℎ𝑎𝑣𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒9planandthatwasjustthewayitwas.Itwas,itwaskindoflikeour,ourdollar50hotdogatCostco.It′skindoflikethis,this,youknow,justlowprice,just,youknow,it,itwasn′ttheprimaryrevdriverandwejustwantedto,youknow,say,Hey,payforsomemorestorageandprivateprojectsorwhatever.Andsowewenttolaunchboltagain,likeourexpectationwas,Hey,we′llprobablygetagoodnumberofpeoplethat′llsignupandbeexcitedaboutit.Andyouknow,we′renottooconcerned,youknow,we′rejust,we′rejustnot,wewereunpreparedforthetsunamithathit.Andsoaftergoingonlinethefirstweek,wewerelike,wow,thisiscool.There′sa,Imean,itjustkeptgrowing.Andthenoncewehitweektwo,Imean,wewerejustninebuckswas,Imean,it′slikethecheapestAIcodingthingyoucangetmaybeotherthancopilot,butlikewewereoverrunbysupporttickets.AndIjust,andjustthesheervolumeofpeoplecominginanditjustlawsofsupplyanddemand.Wewerelike,okay,thisisn′t,there′snowaywecanscaletomeetthis.Alsothepeoplecominginareburningthroughtheirtokensandthere′snowaytoactuallylikebuymoreofthesethings.Andninebucksisjust,youcan′tgetthatmuchinferenceoutofthat.Andsothe,here′stheotherthingthat′sinterestingaboutboltcomparedtolikesomethinglikecopilotorwhatever.Andthiskindoftiedthis,sorry,alittlebitofaroundaboutwaytoansweryourquestion.Butbasicallywhatweendedupatthatmoment,weendeduprealizingisthatwhenyouusecopilot,whatit′ssendingup,itdoesn′tprovidealotofcontextofyourcodebase.Theytryandreducetheamountofcontextasmuchastheycan.AndIthink,youknow,theoriginsofthisstuffisthey,everyonekindofwantsthislikelowpricepointwhereit′slikeallyoucaneat.Soitjustkindof,thatkindoffeelslike,causeit′slike,italmostlikeNetflix,it′slike,I′llpayathing.AndthenIcanjustdoasmuchofthemoviewatchingasIwant.AndIthink,Ithinkthat,thatkindofmentality,whenthesefirstAIproductscame,itkindofmakessense.They′relike,okay,wellwe,wedon′twanttometerit.Causethatdoesn′tfeelgood.Right.Buttheproblemisthatthenthey′reincentivizedtonothaveitbeabletokeepthemorecontextyougiveit,themoreitcando.Andthat′sthemagicofwhatwe′redoingwithboldiswe′regivingitallthecontextwepossiblycan.Andthat′swhyyoucangotoitandsay,makemeanRSVPsite.Anditdoesn′tbecauseithascontext,theentirestateoftheapplication,youknow,etcetera,etcetera.Andthat′swhatmakesitsoaccurate.Versusifyougotoco−pilotandsaythatit,there′llbe,youknow,itmightpunchoutareactcomponent.That′sthebuttontocreatethething,butnotactuallymorethanthat.Soanyway,so,um,youknow,andatthetimewhenpeoplehaveboughtthe9 plan, they were like, I want to give you more money. I want you to buy more tokens. How do I do that? And so our team scrambled that weekend, we just turned it around and just, you know, we said, okay, well, what do we think is reasonable? And we said, okay, so let's go, you immediately double the prices of the, of the base tier, because it's just not enough what people are getting on for nine bucks. So that'll be, that seems reasonable. It's kind of in line with everyone else. And then we added 50, 100 and $200 plans. Cause we're like, that should be enough. And so, yeah, so that, that's kind of the origins of it. And, and, um, it was, it was people that use it, fall in love with that and they want to use more of it. And the problem is the inference is expensive. And so we're not actually taking, you know, to date on the, on the revenue we've done, we have not really taken a margin at all on this stuff. Cause we're just trying to put all the value back into the folks that are there using the tool and just getting the maximum amount of value out of it. But it's really key to the kind of the magic of the experience. And so the other, the other thing kind of worth mentioning is there's kind of the ARR number, but then we, you can also buy additional tokens, you know, just with usage-based billing effectively. And that's accounting for an additional 20, 30% of, of revenue that's coming to the company. People are actually using this to do their jobs. Like, you think, think about a web development agency before this thing, they're going in using Figma to make a design. They have to pay the designer. They have to like punch that out into code, kind of man. And maybe like co-pilot can help a little bit with punching out this thing that they're coming to this thing. And there's just wild stories online where it's like guy bake, local bakeries, like we need a website. He's like, okay, well, I'm going to charge you a thousand bucks. They're like, okay, that sounds great. Reasonable price. 30 minutes later, he's like, here's a deploy preview of your thing. How does that look? They're like, wow, holy crap. I'm not giving you a thousand bucks. But they did, they were, they were, they were like, this usually takes months, you know? So some of the biggest power users are people that build websites for a living because this is the, the alpha on this is insane.Alessio [01:14:26]: That's almost like the gap, right? It's like, it used to be that if I ask you before this to do a website and in 30 minutes you return to me and you give me something, I'm like, you know, you're probably just copying something else you've done before, you know, versus now it's almost like, it doesn't really matter how much time it takes you because everybody's going to be so fast with these things. It's more like the value. And that's why when you're pricing TRL, it was almost like, there's only really going to be like either 20𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠𝑜𝑟𝑙𝑖𝑘𝑒𝑎𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑑𝑜𝑙𝑙𝑎𝑟𝑠𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠.𝑌𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡′𝑠𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑤ℎ𝑜′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑢𝑠𝑒𝑡ℎ𝑒20amonthusersorlikeathousanddollarsamonthusers.Youknow,it′salmostlikewho′sgoingtousethe50 a month because it's kind of like in between, between being infrequent user and being like a power user, you know? So yeah, it makes sense that you have like a big part of like on demandEric [01:15:05]: on top of that. Yeah. And on the 50, there's actually a lot of people on the one. I think it's because it's like enough to actually like for developers are using this to just kind of like punch out components or designs or whatever, kind of gets them enough for, you know, kind of in a given month or whatever. And so it's been interesting to just kind of see the, the, you know, the, the upgrades that happen, but what's been kind of cool about the product is it's, and again, I think this is kind of novel and this is, you know, us being maybe a little more transparent than we should be or something, but like, I suspect we're just, I think we're going to see a lot more of this because we're hitting an inflection point coming back to the co-pilot thing. Part of the problem before is that it didn't matter if you provided more context, the models just weren't good enough to know what to even do with it. That's not the case now. You know, just one, one, you know, story of like one of the first people, one of the power, first power users that adopted Bolt was this gal in Thailand who's a PM at a software banking company. And she had an idea for this app called viralhooks.ai, which is basically, it's a tool that if you want to make viral TikToks and stuff, it's like, what's the hook of the video to make people watch. Right. And so basically she, you know, you can go and like, see, it goes and extracts hooks from other people's videos and helps you with like, you know, AI to write your own. And she had originally put the week before Bolt launched, she put that on Upwork and you know, some, I think a developer in like Ukraine had quoted her, you know, 5,000.𝐴𝑛𝑑𝑖𝑡′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑡𝑎𝑘𝑒𝑙𝑖𝑘𝑒𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑜𝑟𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡.𝑅𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑡𝑖𝑚𝑒𝑓𝑟𝑎𝑚𝑒,𝑟𝑖𝑔ℎ𝑡.𝐹𝑜𝑟𝑎𝑛𝑎𝑝𝑝𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡,𝑟𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑝𝑟𝑖𝑐𝑒.𝑇ℎ𝑒𝑤𝑒𝑒𝑘𝑎𝑓𝑡𝑒𝑟𝑡ℎ𝑎𝑡𝐵𝑜𝑙𝑡𝑐𝑎𝑚𝑒𝑜𝑢𝑡,𝑠ℎ𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒5,000.Andit′sgoingtotakelikethreemonthsorsomethinglikethat.Reasonabletimeframe,right.Foranapplikethat,reasonableprice.TheweekafterthatBoltcameout,sheboughtthe50 plan and she had the app built within a week or two. And so you're talking about like, that's it. And it's beautiful. She did an incredible job. Right. And so the numbers are wild. 5,000,𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑡𝑜5,000,threemonthsto50 and like a week. Yeah. You got to charge more. So it's, it's kind of like, so there's, there's people like when we've had a lot of people go, this pricing is insane. And we're like, well, we're not even taking really a margin at the moment on it, you know, but also, but when you, when you compare that to the price of actually going and building the cost of building quality software today, anyone who knows the price of building quality software, the alpha is obvious, right? It's a 99% cost production and five X faster, you know, delivery time, you know? So anyway, so that's, I think we're one of the first products that have actually come out kind of proving that, you know, in, in, in a revenue way to kind of underscore the point, as you can imagine, we've had, you know, kind of venture capital firms kind of reach out and kind of, you know, curious to kind of, you know, what we're up to or whatever. And so, you know, one of the most, you know, there's kind of one of the, the most notable ones or whatever reached out. So we kind of sent them, you know, you know, kind of our numbers. Actually it was the investor update, Sean, that, that I think you, you know, the, you know, the one you saw kind of gave him a snapshot of it. And they one of their analysts accidentally replied all on what we had sent them and with, with the analysis. And so on this part there, you know, one of the things they said was we haven't seen anything that's kind of eyeopening to see people going to $200 tier on this sort of thing. Haven't seen anything else like that in the space. Cause I think this is very new because of the new model capabilities, right? Where people, you know, it makes sense. Like you're willing to pay more money for this stuff. So. This is something I've talked about before in terms of matchingSwyx [01:18:11]: the dollar amount of spend to the capabilities of the AIs. The chart that I published in the past was, you know, OpenAI has like five levels of AGI-ness and level, level one is sort of like a chatbots, level two is reasoning, level three is agents, four is organizations, five is some, something super, super human. I don't remember what the exact levels are, but each, you can sort of each match each of them with like tiers. Like 20𝑖𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑎𝑡𝐺𝐵𝑇𝑡𝑖𝑒𝑟.20islikethechatGBTtier.200 is where you're at. 2,000𝑖𝑠ℎ𝑖𝑔ℎ𝑒𝑟,2,000ishigher,20,000, $200,000, right? Like you can see levels where it makes sense. I think BrightWave is also there, by the way. Like I don't know what BrightWave charges, but it's higher, right? Than a chatGBT. And like, you have to deliver more value for that, but you, you can do it now. Yep. So then why not? Everyone should do it.Eric [01:18:58]: I think we're going to see a lot more of this. I think we're going to see, I think, you know, for AI, Cogen specifically, this is the first moment where I think that there's been that moment where it goes from zero to one, where it's like, yep. The price point, you know, the value, the value is so, like what you can get out of these things is so much higher than it was, you know, three, six months ago that I think we're going to see, I think we're going to see a lot more of this. Like we might, you know, Bolt is, I think one of the first things, but yeah, I mean, it's just, to me, it's inevitable that we're going to see many more things kind of leveraging this, this sort of use case and the amount of efficiency you can get out of usingAlessio [01:19:38]: these systems. Right. So yeah. Yeah. Yeah. Because I mean, the Bolt arbitrage would be quote the price based on the query, you know, you're selling high value tokens. Yeah. It's like, Hey, it's like your mom is like, you wouldn't charge your mom $2,000 to tell her stories, but like, you know, this person doing an app and like a product on it. Yeah. You got to pay more, you know, but it's hard right now. I understand. It's like, it's really hard to figure out how much you can push it, how much value the person will get outSwyx [01:20:04]: of the thing. Yeah. So I want to riff a little bit on like stuff like this, right? I think you nailed a lot with the design system. You know, one of the differences between open source Bolt and the one that you have is actually like you, you spend a lot of time on the design system. I think, right. Most things just look great when they come out, but I think there's also a whole backend portion that they need. Was that a challenge? Is there anything that you sort of like figuring out that you want to riff on? Yeah. So I think one of the main things,Eric [01:20:28]: I think you hit the nail on the head, which is, you know, kind of going into putting Bolt online. We originally, again, we've been selling to developers and so we were kind of like, this is a tool for prototyping and they'll download their code. But we ended up finding in the early user testing was how important the deployment story was and how, and this is something you said to me specifically, you're like backend, this needs to like backend needs to be part of this, like logging in, like off just to triple confirm you're dead right. That has been the absolute number one thing that folks coming to Bolt, you know, are looking to do is build a real app with a backend, with billing. And so one of this guy, Mauricio, he's one of our power users. He's like, there's three things that like every app that I'll ever want to build in Bolt, any of these other people in this community, you know, three things, a database, auth, and payments. So those three things, right. So that's- Admin dashboard. We can do that pretty decently, pretty decently. As in every database needs a WP admin. Yes. Yes. Correct. Totally. Totally. And so, yeah, today I think like viral hooks, for example, I think she's using Firebase for auth and database and that sort of thing. You know, so I think Firebase and Superbase, those are the two things that, that just work incredibly well. And so that's actually the point where we're at now, where, you know, right now it's, you know, folks have to still, you know, kind of go to Superbase, manually spin up a thing, come back to Bolt, but the thing that, you know, it's like that sort of processing thing with Firebase, each of those products are going to have their own little quirks that you have to, there's like kind of steps, right. And so- Boltbase. Yeah. Boltbase. Yeah. I think, yeah, I think initially we're like, okay, there should just be a way to like, for Bolt to just go and spin up these things on their behalf and just, and just, you know, both of them have APIs to do so. I'll go even further, like have like pre-warmSwyx [01:22:12]: instances that you just assign, like it's already spun up, right. So it's, so it's like kind of serverless feeling, even as like, not really, but like yeah, just pre-warm and then just kind of assign it when, whenever someone like- That's a really great point. Yeah. Just keep, keep oneEric [01:22:26]: Firebase in the hopper, basically. One, 10, 100, I don't know. More generally, this is what I feltSwyx [01:22:32]: that I wanted to do on our call, which is like, when you have PMF, yes, you want to invest some time in like understanding your customers and do a data analytics and like tighten, tighten things up in general, like tighten up the pricing, tighten up the cost and all that. But then like, you also have to work on like, what is next, like the next level and growth, like you can still inflect. Yeah. I don't know what that is, but you know, I wanted to, I wanted to keep pushing you and I don't know if I did, mostly because I was serving as facilitator on that call. That's what I think. Like, I think you got to still keep pushing the frontier and I don't know what it, what it is, but like, you know, I want to hear what you got thinking about.Eric [01:23:07]: I think there's, you know, we've addressed just a lot of the low hanging P0 stuff then, and we've actually seen, we've kind of the, you know, there's, there's key moments where it's just kind of like been going like that, which has been cool. Cause it's like, okay, well we were, we're just getting started. This is just the, this is just the fixing obvious things part. Fundamentally, I think a lot, what a lot of people are coming here to do is just, how can we just make it faster to go from idea to production? And a lot of it is like, I had, when I have to go to Firebase, Superbase, spin something up, run a migrate, you know, like add a table, but it's like the agent can do that, you know, so that stuff should be baked in. Yeah. And same thing with the deployment side. It's like right now it's going to Netlify, but people have to create a Netlify account and go and do that. Right. And so I think one of the things we're going to end up doing here is just having the hosting be baked in. And so I've been talking with Matt over at Netlify about this, cause they actually have a way to kind of white label stuff. And so, cause people are, they're just going to make a website, you know? And so it's I mean, that means also you take over domain registration. Can you imagine, right? Like a couple of months from now, you come to this thing, you're like, I want to make, I want to make an RSVP site. Right. And it's like, great. Do you, you know, do you have a name for it? Or do you want to, you know, a domain? You're like, I don't know a name. It's like, well, here's like 10 options and the.coms are able to look good. Yep. That one does. Okay. We want to buy it. Okay, great. It bought the DNS is pointed at the thing. Should we start building this? Okay. Does this look good? Yep. Okay. Am I okay to push this to prod? Yep. That looks good. You know, like that's without leaving the product.Swyx [01:24:31]: Right. So to me, like it's tomorrow was the first to actually say like you are the new Wix. I never, I personally never thought about it that way. Wix is a $10 billion company where you want to go, you know, cause you still have a choice here. From what we're hearing from the folks usingEric [01:24:43]: the product, I think I don't even think Wix is even able to solve their need, you know? But not to say that we don't want to, you know, that, that what you're saying is now we want, but, but I mean, yeah, like I think we want to solve folks problems. And I think that there's a huge gap in the market of being able to build, you know, kind of more sophisticated, high quality software like websites in a way that for someone who's a non-engineer. And so I think there's a huge market for that. And obviously, even if you're trying to build a wedding website, yeah, this is, this is easier and faster. Right. So I love it. I, you know, again, coming to the origins of why Albert, my co-founder and I are doing this is we've always just loved building stuff on the web. It's like this, I, this is the tool from what, even when stack was just the IDE interface to the technology, it's like, this is the thing we wish we had when we were 13 years old, you know? And with Bolt, oh my God, if this is the thing I wish we had when we were 13 years old, I'm so glad that my daughter's going to have this thing, you know? So anyways, yeah, I think it makes me pretty, pretty stoked that people are going to be able to actually build amazing web applications that can do really sophisticated things, you know? So yes, I think the short answer is heck yeah. I mean, yeah, that sort of market and totally right up our alley. One other angle that I wanted to pursue wasSwyx [01:25:53]: also the other languages. You know, you're very JavaScript centric. We've talked about Python forever. Ruby maybe, is that important? You know, like the previous generation of site builders were mostly Ruby shops and some PHP. Do we want to capture that or are we just like, you know, always been on JavaScript and just let JavaScript take over the world? You know, I think, I thinkEric [01:26:14]: we're, we're, we're certainly with great interest interested in other languages and we have like minimal support of Python and some C++ stuff in web container that you can like run or whatever. I think especially with the, with the stuff we're seeing though, it's the languages is kind of ancillary to the, to the, to the thing. Well, there's the ecosystem of like,Swyx [01:26:31]: I want to end up with a code base that I can hire humans on to do the stuff that Bolt cannot do.Eric [01:26:36]: Yeah, true. And I think, I think in that sense, like the, the, the JavaScript Node.js ecosystem is huge and well-established. So it's like, I think it'd be certainly be able to get people to work on this stuff. And I think the only thing that would be missing is it's like, are you building web apps that where a lot of the functionality is only in libraries that are in Python or something. Right. And I think just kind of seeing the applications that are being built here at, you know, I think that'd be like data science and like ML and that sort of thing. And so that's, we're not seeing a lot of that stuff, you know? And then, but I think that's like, we're like kind of a more generic approach is like what Repl.it's doing where they're spinning up real VMs. You can kind of run anything. And I think they started off with like doing Python service. I actually haven't tried their, their, you know, their new agent stuff that's based on.Swyx [01:27:15]: Repl.it agent. Yeah. We're close friends. Repl.it has the database, the sort of live hosting, everything integrated that you're going to want to build. And you're, I think you're on a collision course with them, to be honest.Eric [01:27:29]: We'll see. Cause I'm curious, you're not the first person to say that. I'm curious to see how it shakes out. Cause I think the challenge is focus. You know, when you are, what's kind of the end goal that you're shooting? Yeah, Repl.it's firmly for developers.Swyx [01:27:45]: You're positioning it for non-developers like that. That's legit.Eric [01:27:48]: Yeah. And even getting, even if focusing on a language or an ecosystem as well, because again, the problem is that these things can just break in a million ways. And so part of the, a lot of the work in making the experience better, like how do you get, like how make it, someone get an idea into the fingertips and live on prod, right? There's so much stuff in between there. And a lot of it is just errors that happen and how do you handle those? And a lot of that comes down to having a giant database of common errors that you can maybe even fine tune stuff on at some point, right? So doing that on, on one ecosystem, you can move a lot faster than if you're trying to support a lot of different languages. However, it's a, to the point of, if you're kind of targeting developers, they may not need that level of kind of streamline, you know, thing. I think that's kind of where I see the main divergence is that we are unabashedly focused on this ecosystem of, for building web apps. Got it. Yeah. You support it forever. Yeah. And so I'm very curious to see how, just how it all shakes out. Cause it's, I think what they're doing is actually, I mean, I'm very curious to see what Microsoft does because if anyone is good at giving out VMs, tying it to a coder and putting AI in it, it's Sia. He's got a cloud. He's got VS code. They've got code spaces. They've they're in open AI. Now they've got Anthropic and Copilot. I mean, I must imagine, I must imagine that they're cooking stuff overSwyx [01:29:06]: there, you know? We'll make sure to ask him. We have many friends from Microsoft listening to theAlessio [01:29:11]: pod. So just to wrap, I don't know, is there anything else Bolt related? I just have one personal question before we wrap the pod. Maybe like just advice, like now that you'veSwyx [01:29:20]: been through this journey, right? Advice to your former self. Oh, okay. Yeah. At which point? Advice yourself, like thinking about, there are many founders out there with a business where they're like, they're working really hard at it. It's interesting, but it's not an AI business. Yeah. And you kind of took the plunge to invest in this and it worked out for you. Maybe a lot of people are like, okay, like, you know, this guy got lucky. Obviously there's a little bit of luck in everything, but like, how do you improve your chances? Like, would you say, go for it? Would you say everyone should go for it? How would you advise someone who was in your shoes and thinking about, you know, maybe I should have a second product. Maybe I should take this, this experiment or maybe it doesn't work out. Like what is, what's the calculus here?Eric [01:30:01]: Yeah. We were deeply skeptical going. I remember the conversation you and I had, you know, I was like this, I think there's something here. At that point we had built some amount, but I had waited a long time to give you the call. I said, this is your moment. Well, it was. So I remember specifically at the beginning of the conversation with Sean, he and I sat down at a coffee shop and, and, and SF, and, and so I was kind of giving him the pitch of like, you know, I think we have, I think that I can't remember the exact framing. I said, but it's, it's, it was obvious that Sean had heard a lot of people say this exact thing to him over the past year or two, which is like, Hey man, we've gotten AI play. Like this is our thing plus AI equals this, this could be crazy. And Sean, I get, you gave me this like skeptical look and then, and I was like, I really think so. And kind of here's why. Right. And and I think, I think that's, it's actually, I think it's, that is internally having, being skeptical of just kind of going and jumping on hype trains is, is good. Cause it's like, I think you, you know, your focus and your time and what you're putting your weight into is the most important thing when you're a founder. I think for us, like we actually, again, like I had mentioned at the beginning of this, you know, we had tried bold and didn't see the results and that was like a two week sprint and we rolled it back. Right. This, this isn't viable at this point, but then when, you know, once we, once we saw real tangible results of, you know, some of the new stuff, right. Okay. That, that changes. Thanks. And I think a lot of it is, is two is going and finding that out for yourself and then going and talking to the smartest people, you know, with more domain knowledge on that stuff than you have and going, here's kind of what we found. Does this track? So when Sean and I met and he, and he, and you know, we keep, he and I kind of, he saw it, we talked through it and he said, this is your moment. I specifically remember that. Cause I, I walked away from that and I was like, holy s**t, this, this is it. Like this, you know, like Sean's Sean's at the intersection of web and AI and as like, it, you know, has one of the best perspectives on this stuff of, of anyone I know that put a huge wind in our sales, honestly, of just like, okay, let's, let's go and really, let's go and double down here because you know, we had conviction before, but having someone who's in the space independently kind of verify meant a lot, you know, so it makes me uncomfortable, but thank you. I get it. I mean, and I waited, I waited until I was pretty darn sure it was not going to be a waste of time toAlessio [01:32:12]: cool. Well, that's all I have. Yeah. And then on the personal side, you had a baby in April, you ran an Ironman in October. Now it's November.Swyx [01:32:20]: He did Ironman while launching ball. I was trying to schedule the call for him and he was like, Nope, I'm sorry. I'm swimming. I was like, Hey, I'm on the swimming session. For those who don't know, actually, I did not know. I don't even know the distance of an Ironman. 13 hours. Your time was 12, 12, 12, 12, 15, 12, 15.Eric [01:32:41]: Give me my minutes. No, no, I, it's, it can, it can completely depends on, you know, the course and just the, the, the person or whatever, right. And, but yeah, I mean, it's,Swyx [01:32:51]: it's 2.4 cam open water, 2.4 mile open water swim, a hundred KM, a hundred mile, a hundred KMEric [01:32:58]: cycle. I think it's like, I think it's 112 mile a bike and then marathon. Yeah. Full 26.2 mile marathon. Yeah. It was why. Yeah. And you weren't, you were not like a super endurance athlete before, right? Like let's like make this clear. Yeah. Kind of a wild, a wild thing. So I, you know, back when I did, we, we had our daughter in April and at that time we were, the future of the company was, you know, we're, we're figuring out what are we going to do here at that time. It was, it was pro just prior to bolt kind of getting kicked into, you know, the rebirth of it with the new models and stuff. And so I knew that it was going to be, you know, having, having a child is, you know, if you talk to anyone that's done that you're, you don't have a lot of sleep. It's it's, you know, there's a lot of, you know, to, to, to be a great parent is, is a ton of work. And then also being a startup CEO where there's a lot of uncertainty or whatever the way I've always found, like when I have to go and you kind of knock it out of the park and all aspects of my life is, is going, yeah, just to, to make it all aspects of my life. And so I was, I just won. Yeah. I woke up one day, I was like, all right, I'm going to do an Ironman this year and I burned the ships, bought the, it's cost a thousand bucks to do. These didn't know that. And, you know, just started, I'd never ran a marathon at that point. And so I think it was like 45 or 60 days after that, I ran a marathon. My brother-in-law, he's, that was even more insane two weeks before the marathon. I was like, Hey, you want to run a marathon in two weeks? He's like, sure. And, and just did it with me. He did not an endurance athlete either. Right. But anyway, so yeah, so I was training, ended up getting a coach who's usually go, you're kind of online. He's up in Marin. Great guy was on the U S Olympic team for triathlons. And when I told him, okay, I'm going to, I'm doing Ironman, California in three months, he was like, are you insane? You know, like, what are you, you know, you'd ask for my opinion, but like, I just want you to know, I don't think this is a good idea. I think, you know, like you shouldn't do this, et cetera. And I ended up doing it, you know, I ended up getting it done. And so he was like, okay, like that's pretty bad. But what makes you, what makes you ignore expert advice here? LikeSwyx [01:34:59]: most sane people would be, would be like, okay, I mean, you know what you're doing? Like,Eric [01:35:03]: I'll maybe wait a year. I think, and this is, this is kind of the, and the being a founder, right. It's, it's all about like, if you, like I mentioned earlier, it's like when we talk to people that worked on browser engines, they're like, you can't, you can't build what you're talking about. I think the job of a founder is, is to, is to solicit that advice. And, and what my coach actually said, he was right about certain things. There are certain areas where I was under indexed on, like, I was not, you know, spending nearly enough time on my bike, for example. Like after that, I was on my bike six hours a day on the weekends. That's a lot of time to spend in the saddle. Just like, just kind of, you know, and that was like, you know, for a couple of months leading up to it, he was right on, on certain aspects of it. And, but I kind of had to look internally and go, okay, like, what is he kind of missing about who I am and like, what I kind of know I'm capable of at this point. I mean, it was a nail biter. I mean, going into the thing, you know, it's, you get in, this is the same thing with launching bolt. It's like, or, or launching anything you get launch day, race day, you kind of go in, you're like, all right, here we go. Like we're going to, we're going to find out, we're going to find out, you know, how based in reality I was about all the decisions that led to this moment. And so I was going and doing the Ironman in like six months. Most people spend, you know, the, the folks he trains, usually it's, you know, one to two years on this stuff before you do try and do a full, you know, it's like going and kind of doing in that sort of timeframe. It's, it's, it's very similar to the same sort of skill set of going and building products. You have to really kind of look at the base reality and go make your own assessment onAlessio [01:36:24]: it. Right. So cool. Great. Sorry to wrap. Thank you so much here. Thanks for your time. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:38:39
  • The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
    We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month.Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors:Solving SWE-BenchAs part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result:The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively:* Speaking with SWEBench/SWEAgent authors at ICLR* Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024)* Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-AgentOne of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”:The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use…And you can also see even more computer use tools given in the new Model Context Protocol servers:Claude Computer UseBecause it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety:Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode.Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators.Full Video episodePlease like and subscribe!Show Notes* Eric Schluntz* “Raising the bar on SWE-Bench Verified”* Cobalt Robotics* SWE-Bench* SWE-Bench Verified* Human Eval & other benchmarks* Anthropic Workbench* Aider* Cursor* Fireworks AI* E2B* Amanda Askell* Toyota Research* Physical Intelligence (Pi)* Chelsea Finn* Josh Albrecht* Eric Jang* 1X* Dust* Cosine Episode* Bolt* Adept Episode* TauBench* LMSys EpisodeTimestamps* [00:00:00] Introductions* [00:03:39] What is SWE-Bench?* [00:12:22] SWE-Bench vs HumanEval vs others* [00:15:21] SWE-Agent architecture and runtime* [00:21:18] Do you need code indexing?* [00:24:50] Giving the agent tools* [00:27:47] Sandboxing for coding agents* [00:29:16] Why not write tests?* [00:30:31] Redesigning engineering tools for LLMs* [00:35:53] Multi-agent systems* [00:37:52] Why XML so good?* [00:42:57] Thoughts on agent frameworks* [00:45:12] How many turns can an agent do?* [00:47:12] Using multiple model types* [00:51:40] Computer use and agent use cases* [00:59:04] State of AI robotics* [01:04:24] Robotics in manufacturing* [01:05:01] Hardware challenges in robotics* [01:09:21] Is self-driving a good business?TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI.Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome.Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench.Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure.Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic.Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies?Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic.Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscureSwyx [00:02:44]: ways to organize.Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more.Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that.Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's still confidential. But I think Anthropic's done a really good job, like applying the model to different things. So you took the lead on SWE-Bench, but then also we're going to talk a little bit about computer use later on. So maybe just give us a context about why you looked at SWE-Bench Verified and you actually came up with a whole system for building agents that would maximally use the model well. Yeah.Erik [00:04:28]: So I'm on a sub team called Product Research. And basically the idea of product research is to really understand what end customers care about and want in the models and then work to try to make that happen. So we're not focused on sort of these more abstract general benchmarks like math problems or MMLU, but we really care about finding the things that are really valuable and making sure the models are great at those. And so because I've been interested in coding agents, I knew that this would be a really valuable thing. And I knew there were a lot of startups and our customers trying to build coding agents with our models. And so I said, hey, this is going to be a really good benchmark to be able to measure that and do well on it. And I wasn't the first person at Anthropic to find SWE-Bench, and there are lots of people that already knew about it and had done some internal efforts on it. It fell to me to sort of both implement the benchmark, which is very tricky, and then also to sort of make sure we had an agent and basically like a reference agent, maybe I'd call it, that could do very well on it. Ultimately, we want to provide how we implemented that reference agent so that people can build their own agents on top of our system and get sort of the most out of it as possible. So with this blog post we released on SWE-Bench, we released the exact tools and the prompt that we gave the model to be able to do well.Swyx [00:05:46]: For people who don't know, who maybe haven't dived into SWE-Bench, I think the general perception is they're like tasks that a software engineer could do. I feel like that's an inaccurate description because it is basically, one, it's a subset of like 12 repos. It's everything they could find that every issue with like a matching commit that could be tested. So that's not every commit. And then SWE-Bench verified is further manually filtered by OpenAI. Is that an accurate description and anything you'd change about that? Yes.Erik [00:06:14]: SWE-Bench is, it certainly is a subset of all tasks. It's first of all, it's only Python repos, so already fairly limited there. And it's just 12 of these popular open source repos. And yes, it's only ones where there were tests that passed at the beginning and also new tests that were introduced that test the new feature that's added. So it is, I think, a very limited subset of real engineering tasks. But I think it's also very valuable because even though it's a subset, it is true engineering tasks. And I think a lot of other benchmarks are really kind of these much more artificial setups of even if they're related to coding, they're more like coding interview style questions or puzzles that I think are very different from day-to-day what you end up doing. I don't know how frequently you all get to use recursion in your day-to-day job, but whenever I do, it's like a treat. And I think it's almost comical, and a lot of people joke about this in the industry, is how different interview questions are.Swyx [00:07:13]: Dynamic programming. Yeah, exactly.Erik [00:07:15]: Like, you code. From the day-to-day job. But I think one of the most interesting things about SWE-Bench is that all these other benchmarks are usually just isolated puzzles, and you're starting from scratch. Whereas SWE-Bench, you're starting in the context of an entire repository. And so it adds this entirely new dimension to the problem of finding the relevant files. And this is a huge part of real engineering, is it's actually pretty rare that you're starting something totally greenfield. You need to go and figure out where in a codebase you're going to make a change and understand how your work is going to interact with the rest of the systems. And I think SWE-Bench does a really good job of presenting that problem.Alessio [00:07:51]: Why do we still use human eval? It's like 92%, I think. I don't even know if you can actually get to 100% because some of the data is not actuallySwyx [00:07:59]: solvable.Alessio [00:08:00]: Do you see benchmarks like that, they should just get sunsetted? Because when you look at the model releases, it's like, oh, it's like 92% instead of like 89%, 90% on human eval versus, you know, SWE-Bench verified is you have 49%, right? Which is like, before 45% was state of the art, but maybe like six months ago it was like 30%, something like that. So is that a benchmark that you think is going to replace human eval, or do you think they're just going to run in parallel?Erik [00:08:27]: I think there's still need for sort of many different varied evals. Like sometimes you do really care about just sort of greenfield code generation. And so I don't think that everything needs to go to sort of an agentic setup.Swyx [00:08:39]: It would be very expensive to implement.Erik [00:08:41]: The other thing I was going to say is that SWE-Bench is certainly hard to implement and expensive to run because each task, you have to parse, you know, a lot of the repo to understand where to put your code. And a lot of times you take many tries of writing code, running it, editing it. It can use a lot of tokens compared to something like human eval. So I think there's definitely a space for these more traditional coding evals that are sort of easy to implement, quick to run, and do get you some signal. Maybe hopefully there's just sort of harder versions of human eval that get created.Alessio [00:09:14]: How do we get SWE-Bench verified to 92%? Do you think that's something where it's like line of sight to it, or it's like, you know, we need a whole lot of things to go right? Yeah, yeah.Erik [00:09:23]: And actually, maybe I'll start with SWE-Bench versus SWE-Bench verified, which is I think something I missed earlier. So SWE-Bench is, as we described, this big set of tasks that were scraped.Swyx [00:09:33]: Like 12,000 or something?Erik [00:09:34]: Yeah, I think it's 2,000 in the final set. But a lot of those, even though a human did them, they're actually impossible given the information that comes with the task. The most classic example of this is the test looks for a very specific error string. You know, like assert message equals error, something, something, something. And unless you know that's exactly what you're looking for, there's no way the model is going to write that exact same error message, and so the tests are going to fail. So SWE-Bench verified was actually made in partnership with OpenAI, and they hired humans to go review all these tasks and pick out a subset to try to remove any obstacle like this that would make the tasks impossible. So in theory, all of these tasks should be fully doable by the model. And they also had humans grade how difficult they thought the problems would be. Between less than 15 minutes, I think 15 minutes to an hour, an hour to four hours, and greater than four hours. So that's kind of this interesting sort of how big the problem is as well. To get to SWE-Bench verified to 90%, actually, maybe I'll also start off with some of the remaining failures that I see when running our model on SWE-Bench. I'd say the biggest cases are the model sort of operates at the wrong level of abstraction. And what I mean by that is the model puts in maybe a smaller band-aid when really the task is asking for a bigger refactor. And some of those, you know, is the model's fault, but a lot of times if you're just sort of seeing the GitHub issue, it's not exactly clear which way you should do. So even though these tasks are possible, there's still some ambiguity in how the tasks are described. That being said, I think in general, language models frequently will produce a smaller diff when possible, rather than trying to do a big refactor. I think another area, at least the agent we created, didn't have any multimodal abilities, even though our models are very good at vision. So I think that's just a missed opportunity. And if I read through some of the traces, there's some funny things where, especially the tasks on matplotlib, which is a graphing library, the test script will save an image and the model will just say, okay, it looks great, you know, without looking at it. So there's certainly extra juice to squeeze there of just making sure the model really understands all the sides of the input that it's given, including multimodal. But yeah, I think like getting to 92%. So this is something that I have not looked at, but I'm very curious about. I want someone to look at, like, what is the union of all of the different tasks that have been solved by at least one attempt at SWE-Bench Verified. There's a ton of submissions to the benchmark, and so I'd be really curious to see how many of those 500 tasks at least someone has solved. And I think, you know, there's probably a bunch that none of the attempts have ever solved. And I think it'd be interesting to look at those and say, hey, is there some problem with these? Like, are these impossible? Or are they just really hard and only a human could do them?Swyx [00:12:22]: Yeah, like specifically, is there a category of problems that are still unreachable by any LLM agent? Yeah, yeah. And I think there definitely are.Erik [00:12:28]: The question is, are those fairly inaccessible or are they just impossible because of the descriptions? But I think certainly some of the tasks, especially the ones that the human graders reviewed as like taking longer than four hours are extremely difficult. I think we got a few of them right, but not very many at all in the benchmark.Swyx [00:12:49]: And did those take less than four hours?Erik [00:12:51]: They certainly did less than, yeah, than four hours.Swyx [00:12:54]: Is there a correlation of length of time with like human estimated time? You know what I mean? Or do we have sort of more of X paradox type situations where it's something super easy for a model, but hard for a human?Erik [00:13:06]: I actually haven't done the stats on that, but I think that'd be really interesting to see of like how many tokens does it take and how is that correlated with difficulty? What is the likelihood of success with difficulty? I think actually a really interesting thing that I saw, one of my coworkers who was also working on this named Simon, he was focusing just specifically on the very hard problems, the ones that are said to take longer than four hours. And he ended up sort of creating a much more detailed prompt than I used. And he got a higher score on the most difficult subset of problems, but a lower score overall on the whole benchmark. And the prompt that I made, which is sort of much more simple and bare bones, got a higher score on the overall benchmark, but lower score on the really hard problems. And I think some of that is the really detailed prompt made the model sort of overcomplicate a lot of the easy problems, because honestly, a lot of the suite bench problems, they really do just ask for a bandaid where it's like, hey, this crashes if this is none, and really all you need to do is put a check if none. And so sometimes trying to make the model think really deeply, it'll think in circles and overcomplicate something, which certainly human engineers are capable of as well. But I think there's some interesting thing of the best prompt for hard problems might not be the best prompt for easy problems.Alessio [00:14:19]: How do we fix that? Are you supposed to fix it at the model level? How do I know what prompt I'm supposed to use?Swyx [00:14:25]: Yeah.Erik [00:14:26]: And I'll say this was a very small effect size, and so I think this isn't worth obsessing over. I would say that as people are building systems around agents, I think the more you can separate out the different kinds of work the agent needs to do, the better you can tailor a prompt for that task. And I think that also creates a lot of like, for instance, if you were trying to make an agent that could both solve hard programming tasks, and it could just write quick test files for something that someone else had already made, the best way to do those two tasks might be very different prompts. I see a lot of people build systems where they first sort of have a classification, and then route the problem to two different prompts. And that's sort of a very effective thing, because one, it makes the two different prompts much simpler and smaller, and it means you can have someone work on one of the prompts without any risk of affecting the other tasks. So it creates like a nice separation of concerns. Yeah.Alessio [00:15:21]: And the other model behavior thing you mentioned, they prefer to generate like shorter diffs. Why is that? Like, is there a way? I think that's maybe like the lazy model question that people have is like, why are you not just generating the whole code instead of telling me to implement it?Swyx [00:15:36]: Are you saving tokens? Yeah, exactly. It's like conspiracy theory. Yeah. Yeah.Erik [00:15:41]: Yeah. So there's two different things there. One is like the, I'd say maybe like doing the easier solution rather than the hard solution. And I'd say the second one, I think what you're talking about is like the lazy model is like when the model says like dot, dot, dot, code remains the same.Swyx [00:15:52]: Code goes here. Yeah. I'm like, thanks, dude.Erik [00:15:55]: But honestly, like that just comes as like people on the internet will do stuff like that. And like, dude, if you're talking to a friend and you ask them like to give you some example code, they would definitely do that. They're not going to reroll the whole thing. And so I think that's just a matter of like, you know, sometimes you actually do just, just want like the relevant changes. And so I think it's, this is something where a lot of times like, you know, the models aren't good at mind reading of like which one you want. So I think that like the more explicit you can be in prompting to say, Hey, you know, give me the entire thing, no, no elisions versus just give me the relevant changes. And that's something, you know, we want to make the models always better at following those kinds of instructions.Swyx [00:16:32]: I'll drop a couple of references here. We're recording this like a day after Dario, Lex Friedman just dropped his five hour pod with Dario and Amanda and the rest of the crew. And Dario actually made this interesting observation that like, we actually don't want, we complain about models being too chatty in text and then not chatty enough in code. And so like getting that right is kind of a awkward bar because, you know, you, you don't want it to yap in its responses, but then you also want it to be complete in, in code. And then sometimes it's not complete. Sometimes you just want it to diff, which is something that Enthopic has also released with a, you know, like the, the fast edit stuff that you guys did. And then the other thing I wanted to also double back on is the prompting stuff. You said, you said it was a small effect, but it was a noticeable effect in terms of like picking a prompt. I think we'll go into suite agent in a little bit, but I kind of reject the fact that, you know, you need to choose one prompt and like have your whole performance be predicated on that one prompt. I think something that Enthopic has done really well is meta prompting, prompting for a prompt. And so why can't you just develop a meta prompt for, for all the other prompts? And you know, if it's a simple task, make a simple prompt, if it's a hard task, make a hard prompt. Obviously I'm probably hand-waving a little bit, but I will definitely ask people to try the Enthopic Workbench meta prompting system if they haven't tried it yet. I went to the Build Day recently at Enthopic HQ, and it's the closest I've felt to an AGI, like learning how to operate itself that, yeah, it's, it's, it's really magical.Erik [00:17:57]: Yeah, no, Claude is great at writing prompts for Claude.Swyx [00:18:00]: Right, so meta prompting. Yeah, yeah.Erik [00:18:02]: The way I think about this is that humans, even like very smart humans still use sort of checklists and use sort of scaffolding for themselves. Surgeons will still have checklists, even though they're incredible experts. And certainly, you know, a very senior engineer needs less structure than a junior engineer, but there still is some of that structure that you want to keep. And so I always try to anthropomorphize the models and try to think about for a human sort of what is the equivalent. And that's sort of, you know, how I think about these things is how much instruction would you give a human with the same task? And do you, would you need to give them a lot of instruction or a little bit of instruction?Alessio [00:18:36]: Let's talk about the agent architecture maybe. So first, runtime, you let it run until it thinks it's done or it reaches 200k context window.Swyx [00:18:45]: How did you come up? What's up with that?Erik [00:18:47]: Yeah.Swyx [00:18:48]: Yeah.Erik [00:18:49]: I mean, this, so I'd say that a lot of previous agent work built sort of these very hard coded and rigid workflows where the model is sort of pushed through certain flows of steps. And I think to some extent, you know, that's needed with smaller models and models that are less smart. But one of the things that we really wanted to explore was like, let's really give Claude the reins here and not force Claude to do anything, but let Claude decide, you know, how it should approach the problem, what steps it should do. And so really, you know, what we did is like the most extreme version of this is just give it some tools that it can call and it's able to keep calling the tools, keep thinking, and then yeah, keep doing that until it thinks it's done. And that's sort of the most, the most minimal agent framework that we came up with. And I think that works very well. I think especially the new Sonnet 3.5 is very, very good at self-correction, has a lot of like grit. Claude will try things that fail and then try, you know, come back and sort of try different approaches. And I think that's something that you didn't see in a lot of previous models. Some of the existing agent frameworks that I looked at, they had whole systems built to try to detect loops and see, oh, is the model doing the same thing, you know, more than three times, then we have to pull it out. And I think like the smarter the models are, the less you need that kind of extra scaffolding. So yeah, just giving the model tools and letting it keep sample and call tools until it thinks it's done was the most minimal framework that we could think of. And so that's what we did.Alessio [00:20:18]: So you're not pruning like bad paths from the context. If it tries to do something, it fails. You just burn all these tokens.Swyx [00:20:25]: Yes.Erik [00:20:26]: I would say the downside of this is that this is sort of a very token expensive way to doSwyx [00:20:29]: this. But still, it's very common to prune bad paths because models get stuck. Yeah.Erik [00:20:35]: But I'd say that, yeah, 3.5 is not getting stuck as much as previous models. And so, yeah, we wanted to at least just try the most minimal thing. Now, I would say that, you know, this is definitely an area of future research, especially if we talk about these problems that are going to take a human more than four hours. Those might be things where we're going to need to go prune bad paths to let the model be able to accomplish this task within 200k tokens. So certainly I think there's like future research to be done in that area, but it's not necessary to do well on these benchmarks.Swyx [00:21:06]: Another thing I always have questions about on context window things, there's a mini cottage industry of code indexers that have sprung up for large code bases, like the ones in SweetBench. You didn't need them? We didn't.Erik [00:21:18]: And I think I'd say there's like two reasons for this. One is like SweetBench specific and the other is a more general thing. The more general thing is that I think Sonnet is very good at what we call agentic search. And what this basically means is letting the model decide how to search for something. It gets the results and then it can decide, should it keep searching or is it done? Does it have everything it needs? So if you read through a lot of the traces of the SweetBench, the model is calling tools to view directories, list out things, view files. And it will do a few of those until it feels like it's found the file where the bug is. And then it will start working on that file. And I think like, again, this is all, everything we did was about just giving Claude the full reins. So there's no hard-coded system. There's no search system that you're relying on getting the correct files into context. This just totally lets Claude do it.Swyx [00:22:11]: Or embedding things into a vector database. Exactly. Oops. No, no.Erik [00:22:17]: This is very, very token expensive. And so certainly, and it also takes many, many turns. And so certainly if you want to do something in a single turn, you need to do RAG and just push stuff into the first prompt.Alessio [00:22:28]: And just to make it clear, it's using the Bash tool, basically doing LS, looking at files and then doing CAD for the following context. It can do that.Erik [00:22:35]: But it's file editing tool also has a command in it called view that can view a directory. It's very similar to LS, but it just sort of has some nice sort of quality of life improvements. So I think it'll only do an LS sort of two directories deep so that the model doesn't get overwhelmed if it does this on a huge file. I would say actually we did more engineering of the tools than the overall prompt. But the one other thing I want to say about this agentic search is that for SWE-Bench specifically, a lot of the tasks are bug reports, which means they have a stack trace in them. And that means right in that first prompt, it tells you where to go. And so I think this is a very easy case for the model to find the right files versus if you're using this as a general coding assistant where there isn't a stack trace or you're asking it to insert a new feature, I think there it's much harder to know which files to look at. And that might be an area where you would need to do more of this exhaustive search where an agentic search would take way too long.Swyx [00:23:33]: As someone who spent the last few years in the JS world, it'd be interesting to see SWE-Bench JS because these stack traces are useless because of so much virtualization that we do. So they're very, very disconnected with where the code problems are actually appearing.Erik [00:23:50]: That makes me feel better about my limited front-end experience, as I've always struggled with that problem.Swyx [00:23:55]: It's not your fault. We've gotten ourselves into a very, very complicated situation. And I'm not sure it's entirely needed. But if you talk to our friends at Vercel, they will say it is.Erik [00:24:04]: I will say SWE-Bench just released SWE-Bench Multimodal, which I believe is either entirely JavaScript or largely JavaScript. And it's entirely things that have visual components of them.Swyx [00:24:15]: Are you going to tackle that? We will see.Erik [00:24:17]: I think it's on the list and there's interest, but no guarantees yet.Swyx [00:24:20]: Just as a side note, it occurs to me that every model lab, including Enthopic, but the others as well, you should have your own SWE-Bench, whatever your bug tracker tool. This is a general methodology that you can use to track progress, I guess.Erik [00:24:34]: Yeah, sort of running on our own internal code base.Swyx [00:24:36]: Yeah, that's a fun idea.Alessio [00:24:37]: Since you spend so much time on the tool design, so you have this edit tool that can make changes and whatnot. Any learnings from that that you wish the AI IDEs would take in? Is there some special way to look at files, feed them in?Erik [00:24:50]: I would say the core of that tool is string replace. And so we did a few different experiments with different ways to specify how to edit a file. And string replace, basically, the model has to write out the existing version of the string and then a new version, and that just gets swapped in. We found that to be the most reliable way to do these edits. Other things that we tried were having the model directly write a diff, having the model fully regenerate files. That one is actually the most accurate, but it takes so many tokens, and if you're in a very big file, it's cost prohibitive. There's basically a lot of different ways to represent the same task. And they actually have pretty big differences in terms of model accuracy. I think Eider, they have a really good blog where they explore some of these different methods for editing files, and they post results about them, which I think is interesting. But I think this is a really good example of the broader idea that you need to iterate on tools rather than just a prompt. And I think a lot of people, when they make tools for an LLM, they kind of treat it like they're just writing an API for a computer, and it's sort of very minimal. It's sort of just the bare bones of what you'd need, and honestly, it's so hard for the models to use those. Again, I come back to anthropomorphizing these models. Imagine you're a developer, and you just read this for the very first time, and you're trying to use it. You can do so much better than just sort of the bare API spec of what you'd often see. Include examples in the description. Include really detailed explanations of how things work. And I think that, again, also think about what is the easiest way for the model to represent the change that it wants to make. For file editing, as an example, writing a diff is actually... Let's take the most extreme example. You want the model to literally write a patch file. I think patch files have at the very beginning numbers of how many total lines change. That means before the model has actually written the edit, it needs to decide how many numbers or how many lines are going to change.Swyx [00:26:52]: Don't quote me on that.Erik [00:26:54]: I think it's something like that, but I don't know if that's exactly the diff format. But you can certainly have formats that are much easier to express without messing up than others. And I like to think about how much human effort goes into designing human interfaces for things. It's incredible. This is entirely what FrontEnd is about, is creating better interfaces to kind of do the same things. And I think that same amount of attention and effort needs to go into creating agent computer interfaces.Swyx [00:27:19]: It's a topic we've discussed, ACI or whatever that looks like. I would also shout out that I think you released some of these toolings as part of computer use as well. And people really liked it. It's all open source if people want to check it out. I'm curious if there's an environment element that complements the tools. So how do you... Do you have a sandbox? Is it just Docker? Because that can be slow or resource intensive. Do you have anything else that you would recommend?Erik [00:27:47]: I don't think I can talk about sort of public details or about private details about how we implement our sandboxing. But obviously, we need to have sort of safe, secure, and fast sandboxes for training for the models to be able to practice writing code and working in an environment.Swyx [00:28:03]: I'm aware of a few startups working on agent sandboxing. E2B is a close friend of ours that Alessio has led around in, but also I think there's others where they're focusing on snapshotting memory so that it can do time travel for debugging. Computer use where you can control the mouse or keyboard or something like that. Whereas here, I think that the kinds of tools that we offer are very, very limited to coding agent work cases like bash, edit, you know, stuff like that. Yeah.Erik [00:28:30]: I think the computer use demo that we released is an extension of that. It has the same bash and edit tools, but it also has the computer tool that lets it get screenshots and move the mouse and keyboard. Yeah. So I definitely think there's sort of more general tools there. And again, the tools we released as part of SweetBench were, I'd say they're very specific for like editing files and doing bash, but at the same time, that's actually very general if you think about it. Like anything that you would do on a command line or like editing files, you can do with those tools. And so we do want those tools to feel like any sort of computer terminal work could be done with those same tools rather than making tools that were like very specific for SweetBench like run tests as its own tool, for instance. Yeah.Swyx [00:29:15]: You had a question about tests.Alessio [00:29:16]: Yeah, exactly. I saw there's no test writer tool. Is it because it generates the code and then you're running it against SweetBench anyway, so it doesn't really need to write the test or?Swyx [00:29:26]: Yeah.Erik [00:29:27]: So this is one of the interesting things about SweetBench is that the tests that the model's output is graded on are hidden from it. That's basically so that the model can't cheat by looking at the tests and writing the exact solution. And I'd say typically the model, the first thing it does is it usually writes a little script to reproduce the error. And again, most SweetBench tasks are like, hey, here's a bug that I found. I run this and I get this error. So the first thing the model does is try to reproduce that. So it's kind of been rerunning that script as a mini test. But yeah, sometimes the model will like accidentally introduce a bug that breaks some other tests and it doesn't know about that.Alessio [00:30:05]: And should we be redesigning any tools? We kind of talked about this and like having more examples, but I'm thinking even things of like Q as a query parameter in many APIs, it's like easier for the model to like re-query than read the Q. I'm sure it learned the Q by this point, but like, is there anything you've seen like building this where it's like, hey, if I were to redesign some CLI tools, some API tool, I would like change the way structure to make it better for LLMs?Erik [00:30:31]: I don't think I've thought enough about that off the top of my head, but certainly like just making everything more human friendly, like having like more detailed documentation and examples. I think examples are really good in things like descriptions, like so many, like just using the Linux command line, like how many times I do like dash dash help or look at the man page or something. It's like, just give me one example of like how I actually use this. Like I don't want to go read through a hundred flags. Just give me the most common example. But again, so you know, things that would be useful for a human, I think are also very useful for a model.Swyx [00:31:03]: Yeah. I mean, there's one thing that you cannot give to code agents that is useful for human is this access to the internet. I wonder how to design that in, because one of the issues that I also had with just the idea of a suite bench is that you can't do follow up questions. You can't like look around for similar implementations. These are all things that I do when I try to fix code and we don't do that. It's not, it wouldn't be fair, like it'd be too easy to cheat, but then also it's kind of not being fair to these agents because they're not operating in a real world situation. Like if I had a real world agent, of course I'm giving it access to the internet because I'm not trying to pass a benchmark. I don't have a question in there more, more just like, I feel like the most obvious tool access to the internet is not being used.Erik [00:31:47]: I think that that's really important for humans, but honestly the models have so much general knowledge from pre-training that it's, it's like less important for them. I feel like versioning, you know, if you're working on a newer thing that was like, they came after the knowledge cutoff, then yes, I think that's very important. I think actually this, this is like a broader problem that there is a divergence between Sweebench and like what customers will actually care about who are working on a coding agent for real use. And I think one of those there is like internet access and being able to like, how do you pull in outside information? I think another one is like, if you have a real coding agent, you don't want to have it start on a task and like spin its wheels for hours because you gave it a bad prompt. You want it to come back immediately and ask follow up questions and like really make sure it has a very detailed understanding of what to do, then go off for a few hours and do work. So I think that like real tasks are going to be much more interactive with the agent rather than this kind of like one shot system. And right now there's no benchmark that, that measures that. And maybe I think it'd be interesting to have some benchmark that is more interactive. I don't know if you're familiar with TauBench, but it's a, it's a customer service benchmark where there's basically one LLM that's playing the user or the customer that's getting support and another LLM that's playing the support agent and they interact and try to resolve the issue.Swyx [00:33:08]: Yeah. We talked to the LMSIS guys. Awesome. And they also did MTBench for people listening along. So maybe we need MTSWE-Bench. Sure. Yeah.Erik [00:33:16]: So maybe, you know, you could have something where like before the SWE-Bench task starts, you have like a few back and forths with kind of like the, the author who can answer follow up questions about what they want the task to do. And of course you'd need to do that where it doesn't cheat and like just get the exact, the exact thing out of the human or out of the sort of user. But I think that would be a really interesting thing to see. If you look at sort of existing agent work, like a Repl.it's coding agent, I think one of the really great UX things they do is like first having the agent create a plan and then having the human approve that plan or give feedback. I think for agents in general, like having a planning step at the beginning, one, just having that plan will improve performance on the downstream task just because it's kind of like a bigger chain of thought, but also it's just such a better UX. It's way easier for a human to iterate on a plan with a model rather than iterating on the full task that sort of has a much slower time through each loop. If the human has approved this implementation plan, I think it makes the end result a lot more sort of auditable and trustable. So I think there's a lot of things sort of outside of SweetBench that will be very important for real agent usage in the world. Yeah.Swyx [00:34:27]: I will say also, there's a couple of comments on names that you dropped. Copilot also does the plan stage before it writes code. I feel like those approaches have generally been less Twitter successful because it's not prompt to code, it's prompt plan code. You know, so there's a little bit of friction in there, but it's not much. Like it's, it actually, it's, it, you get a lot for what it's worth. I also like the way that Devin does it, where you can sort of edit the plan as it goes along. And then the other thing with Repl.it, we had a, we hosted a sort of dev day pregame with Repl.it and they also commented about multi-agents. So like having two agents kind of bounce off of each other. I think it's a similar approach to what you're talking about with kind of the few shot example, just as in the prompts of clarifying what the agent wants. But typically I think this would be implemented as a tool calling another agent, like a sub-agent I don't know if you explored that, do you like that idea?Erik [00:35:20]: I haven't explored this enough, but I've definitely heard of people having good success with this. Of almost like basically having a few different sort of personas of agents, even if they're all the same LLM. I think this is one thing with multi-agent that a lot of people will kind of get confused by is they think it has to be different models behind each thing. But really it's sort of usually the same, the same model with different prompts. And yet having one, having them have different personas to kind of bring different sort of thoughts and priorities to the table. I've seen that work very well and sort of create a much more thorough and thought outSwyx [00:35:53]: response.Erik [00:35:53]: I think the downside is just that it adds a lot of complexity and it adds a lot of extra tokens. So I think it depends what you care about. If you want a plan that's very thorough and detailed, I think it's great. If you want a really quick, just like write this function, you know, you probably don't want to do that and have like a bunch of different calls before it does this.Alessio [00:36:11]: And just talking about the prompt, why are XML tags so good in Cloud? I think initially people were like, oh, maybe you're just getting lucky with XML. But I saw obviously you use them in your own agent prompts, so they must work. And why is it so model specific to your family?Erik [00:36:26]: Yeah, I think that there's, again, I'm not sure how much I can say, but I think there's historical reasons that internally we've preferred XML. I think also the one broader thing I'll say is that if you look at certain kinds of outputs, there is overhead to outputting in JSON. If you're trying to output code in JSON, there's a lot of extra escaping that needs to be done, and that actually hurts model performance across the board. Versus if you're in just a single XML tag, there's none of that sort of escaping thatSwyx [00:36:58]: needs to happen.Erik [00:36:58]: That being said, I haven't tried having it write HTML and XML, which maybe then you start running into weird escaping things there. I'm not sure. But yeah, I'd say that's some historical reasons, and there's less overhead of escaping.Swyx [00:37:12]: I use XML in other models as well, and it's just a really nice way to make sure that the thing that ends is tied to the thing that starts. That's the only way to do code fences where you're pretty sure example one start, example one end, that is one cohesive unit.Alessio [00:37:30]: Because the braces are nondescriptive. Yeah, exactly.Swyx [00:37:33]: That would be my simple reason. XML is good for everyone, not just Cloud. Cloud was just the first one to popularize it, I think.Erik [00:37:39]: I do definitely prefer to read XML than read JSON.Alessio [00:37:43]: Any other details that are maybe underappreciated? I know, for example, you had the absolute paths versus relative. Any other fun nuggets?Erik [00:37:52]: I think that's a good sort of anecdote to mention about iterating on tools. Like I said, spend time prompt engineering your tools, and don't just write the prompt, but write the tool, and then actually give it to the model and read a bunch of transcripts about how the model tries to use the tool. I think by doing that, you will find areas where the model misunderstands a tool or makes mistakes, and then basically change the tool to make it foolproof. There's this Japanese term, pokayoke, about making tools mistake-proof. You know, the classic idea is you can have a plug that can fit either way, and that's dangerous, or you can make it asymmetric so that it can't fit this way, it has to go like this, and that's a better tool because you can't use it the wrong way. So for this example of absolute paths, one of the things that we saw while testing these tools is, oh, if the model has done CD and moved to a different directory, it would often get confused when trying to use the tool because it's now in a different directory, and so the paths aren't lining up. So we said, oh, well, let's just force the tool to always require an absolute path, and then that's easy for the model to understand. It knows sort of where it is. It knows where the files are. And then once we have it always giving absolute paths, it never messes up even, like, no matter where it is because it just, if you're using an absolute path, it doesn't matter whereSwyx [00:39:13]: you are.Erik [00:39:13]: So iterations like that, you know, let us make the tool foolproof for the model. I'd say there's other categories of things where we see, oh, if the model, you know, opens vim, like, you know, it's never going to return. And so the tool is stuck.Swyx [00:39:28]: Did it get stuck? Yeah. Get out of vim. What?Erik [00:39:31]: Well, because the tool is, like, it just text in, text out. It's not interactive. So it's not like the model doesn't know how to get out of vim. It's that the way that the tool is, like, hooked up to the computer is not interactive. Yes, I mean, there is the meme of no one knows how to get out of vim. You know, basically, we just added instructions in the tool of, like, hey, don't launch commands that don't return.Swyx [00:39:54]: Yeah, like, don't launch vim.Erik [00:39:55]: Don't launch whatever. If you do need to do something, you know, put an ampersand after it to launch it in the background. And so, like, just, you know, putting kind of instructions like that just right in the description for the tool really helps the model. And I think, like, that's an underutilized space of prompt engineering, where, like, people might try to do that in the overall prompt, but just put that in the tool itself so the model knows that it's, like, for this tool, this is what's relevant.Swyx [00:40:20]: You said you worked on the function calling and tool use before you actually started this vBench work, right? Was there any surprises? Because you basically went from creator of that API to user of that API. Any surprises or changes you would make now that you have extensively dog-fooded in a state-of-the-art agent?Erik [00:40:39]: I want us to make, like, maybe, like, a little bit less verbose SDK. I think some way, like, right now, it just takes, I think we sort of force people to do the best practices of writing out sort of these full JSON schemas, but it would be really nice if you could just pass in a Python function as a tool. I think that could be something nice.Swyx [00:40:58]: I think that there's a lot of, like, Python- There's helper libraries. ... structure, you know. I don't know if there's anyone else that is specializing for Anthropic. Maybe Jeremy Howard's and Simon Willis's stuff. They all have Cloud-specific stuff that they are working on. Cloudette. Cloudette, exactly. I also wanted to spend a little bit of time with SuiteAgent. It seems like a very general framework. Like, is there a reason you picked it apart from it's the same authors as vBench, or?Erik [00:41:21]: The main thing we wanted to go with was the same authors as vBench, so it just felt sort of like the safest, most neutral option. And it was, you know, very high quality. It was very easy to modify, to work with. I would say it also actually, their underlying framework is sort of this, it's like, youSwyx [00:41:39]: know, think, act, observe.Erik [00:41:40]: That they kind of go through this loop, which is like a little bit more hard-coded than what we wanted to do, but it's still very close. That's still very general. So it felt like a good match as sort of the starting point for our agent. And we had already sort of worked with and talked with the SWE-Bench people directly, so it felt nice to just have, you know, we already know the authors. This will be easy to work with.Swyx [00:42:00]: I'll share a little bit of like, this all seems disconnected, but once you figure out the people and where they go to school, it all makes sense. So it's all Princeton. Yeah, the SWE-Bench and SuiteAgent.Erik [00:42:11]: It's a group out of Princeton.Swyx [00:42:12]: Yeah, and we had Shun Yu on the pod, and he came up with the React paradigm, and that's think, act, observe. That's all React. So they're all friends. Yep, yeah, exactly.Erik [00:42:22]: And you know, if you actually read our traces of our submission, you can actually see like think, act, observe in our logs. And we just didn't even change the printing code. So it's like doing still function calls under the hood, and the model can do sort of multiple function calls in a row without thinking in between if it wants to. But yeah, so a lot of similarities and a lot of things we inherited from SuiteAgent just as a starting point for the framework.Alessio [00:42:47]: Any thoughts about other agent frameworks? I think there's, you know, the whole gamut from very simple to like very complex.Swyx [00:42:53]: Autogen, CooEI, LandGraph. Yeah, yeah.Erik [00:42:56]: I think I haven't explored a lot of them in detail. I would say with agent frameworks in general, they can certainly save you some like boilerplate. But I think there's actually this like downside of making agents too easy, where you end up very quickly like building a much more complex system than you need. And suddenly, you know, instead of having one prompt, you have five agents that are talking to each other and doing a dialogue. And it's like, because the framework made that 10 lines to do, you end up building something that's way too complex. So I think I would actually caution people to like try to start without these frameworks if you can, because you'll be closer to the raw prompts and be able to sort of directly understand what's going on. I think a lot of times these frameworks also, by trying to make everything feel really magical, you end up sort of really hiding what the actual prompt and output of the model is, and that can make it much harder to debug. So certainly these things have a place, and I think they do really help at getting rid of boilerplate, but they come with this cost of obfuscating what's really happening and making it too easy to very quickly add a lot of complexity. So yeah, I would recommend people to like try it from scratch, and it's like not that bad.Alessio [00:44:08]: Would you rather have like a framework of tools? Do you almost see like, hey, it's maybe easier to get tools that are already well curated, like the ones that you build, if I had an easy way to get the best tool from you, andSwyx [00:44:21]: like you maintain the definition?Alessio [00:44:22]: Or yeah, any thoughts on how you want to formalize tool sharing?Erik [00:44:26]: Yeah, I think that's something that we're certainly interested in exploring, and I think there is space for sort of these general tools that will be very broadly applicable. But at the same time, most people that are building on these, they do have much more specific things that they're trying to do. You know, I think that might be useful for hobbyists and demos, but the ultimate end applications are going to be bespoke. And so we just want to make sure that the model's great at any tool that it uses. But certainly something we're exploring.Alessio [00:44:52]: So everything bespoke, no frameworks, no anything.Swyx [00:44:55]: Just for now, for now.Erik [00:44:56]: Yeah, I would say that like the best thing I've seen is people building up from like, build some good util functions, and then you can use those as building blocks. Yeah, yeah.Alessio [00:45:05]: I have a utils folder, or like all these scripts. My framework is like def, call, and tropic. And then I just put all the defaults.Swyx [00:45:12]: Yeah, exactly. There's a startup hidden in every utils folder, you know? No, totally not. Like, if you use it enough, like it's a startup, you know? At some point. I'm kind of curious, is there a maximum length of turns that it took? Like, what was the longest run? I actually don't.Erik [00:45:27]: I mean, it had basically infinite turns until it ran into a 200k context. I should have looked this up. I don't know. And so for some of those failed cases where it eventually ran out of context, I mean, it was over 100 turns. I'm trying to remember like the longest successful run, but I think it was definitely over 100 turns that some of the times.Swyx [00:45:48]: Which is not that much. It's a coffee break. Yeah.Erik [00:45:52]: But certainly, you know, these things can be a lot of turns. And I think that's because some of these things are really hard, where it's going to take, you know, many tries to do it. And if you think about like, think about a task that takes a human four hours to do. Think about how many different files you read, and like times you edit a file in four hours. That's a lot more than 100.Alessio [00:46:10]: How many times you open Twitter because you get distracted. But if you had a lot more compute, what's kind of like the return on the extra compute now? So like, you know, if you had thousands of turns or like whatever, like how much better would it get?Erik [00:46:23]: Yeah, this I don't know. And I think this is, I think sort of one of the open areas of research in general with agents is memory and sort of how do you have something that can do work beyond its context length where you're just purely appending. So you mentioned earlier things like pruning bad paths. I think there's a lot of interesting work around there. Can you just roll back but summarize, hey, don't go down this path? There be dragons. Yeah, I think that's very interesting that you could have something that that uses way more tokens without ever using at a time more than 200k. So I think that's very interesting. I think the biggest thing is like, can you make the model sort of losslessly summarize what it's learned from trying different approaches and bring things back? I think that's sort of the big challenge.Swyx [00:47:11]: What about different models?Alessio [00:47:12]: So you have Haiku, which is like, you know, cheaper. So you're like, well, what if I have a Haiku to do a lot of these smaller things and then put it back up?Erik [00:47:20]: I think Cursor might have said that they actually have a separate model for file editing.Swyx [00:47:25]: I'm trying to remember.Erik [00:47:25]: I think they were on maybe the Lex Fridman podcast where they said they have a bigger model, like write what the code should be and then a different model, like apply it. So I think there's a lot of interesting room for stuff like that. Yeah, fast supply.Swyx [00:47:37]: We actually did a pod with Fireworks that they worked with on. It's speculative decoding.Erik [00:47:41]: But I think there's also really interesting things about like, you know, paring down input tokens as well, especially sometimes the models trying to read like a 10,000 line file. That's a lot of tokens. And most of it is actually not going to be relevant. I think it'd be really interesting to like delegate that to Haiku. Haiku read this file and just pull out the most relevant functions. And then, you know, Sonnet reads just those and you save 90% on tokens. I think there's a lot of really interesting room for things like that. And again, we were just trying to do sort of the simplest, most minimal thing and show that it works. I'm really hoping that people, sort of the agent community builds things like that on top of our models. That's, again, why we released these tools. We're not going to go and do lots more submissions to SWE-Bench and try to prompt engineer this and build a bigger system. We want people to like the ecosystem to do that on top of our models. But yeah, so I think that's a really interesting one.Swyx [00:48:32]: It turns out, I think you did do 3.5 Haiku with your tools and it scored a 40.6. Yes.Erik [00:48:38]: So it did very well. It itself is actually very smart, which is great. But we haven't done any experiments with this combination of the two models. But yeah, I think that's one of the exciting things is that how well Haiku 3.5 did on SWE-Bench shows that sort of even our smallest, fastest model is very good at sort of thinking agentically and working on hard problems. Like it's not just sort of for writing simple text anymore.Alessio [00:49:02]: And I know you're not going to talk about it, but like Sonnet is not even supposed to be the best model, you know? Like Opus, it's kind of like we left it at three back in the corner intro. At some point, I'm sure the new Opus will come out. And if you had Opus Plus on it, that sounds very, very good.Swyx [00:49:19]: There's a run with SuiteAgent plus Opus, but that's the official SWE-Bench guys doing it.Erik [00:49:24]: That was the older, you know, 3.0.Swyx [00:49:25]: You didn't do yours. Yeah. Okay. Did you want to? I mean, you could just change the model name.Erik [00:49:31]: I think we didn't submit it, but I think we included it in our model card.Swyx [00:49:35]: Okay.Erik [00:49:35]: We included the score as a comparison. Yeah.Swyx [00:49:38]: Yeah.Erik [00:49:38]: And Sonnet and Haiku, actually, I think the new ones, they both outperformed the original Opus. Yeah. I did see that.Swyx [00:49:44]: Yeah. It's a little bit hard to find. Yeah.Erik [00:49:47]: It's not an exciting score, so we didn't feel like they need to submit it to the benchmark.Swyx [00:49:52]: We can cut over to computer use if we're okay with moving on to topics on this, if anything else. I think we're good.Erik [00:49:58]: I'm trying to think if there's anything else SWE-Bench related.Swyx [00:50:02]: It doesn't have to be also just specifically SWE-Bench, but just your thoughts on building agents, because you are one of the few people that have reached this leaderboard on building a coding agent. This is the state of the art. It's surprisingly not that hard to reach with some good principles. Right. There's obviously a ton of low-hanging fruit that we covered. Your thoughts on if you were to build a coding agent startup, what next?Erik [00:50:24]: I think the really interesting question for me, for all the startups out there, is this kind of divergence between the benchmarks and what real customers will want. So I'm curious, maybe the next time you have a coding agent startup on the podcast, you should ask them that. What are the differences that they're starting to make? Tomorrow.Swyx [00:50:40]: Oh, perfect, perfect. Yeah.Erik [00:50:41]: I'm actually very curious what they will see, because I also have seen, I feel like it's slowed down a little bit if I don't see the startups submitting to SWE-Bench that much anymore.Swyx [00:50:52]: Because of the traces, the trace. So we had Cosign on, they had a 50-something on full, on SWE-Bench full, which is the hardest one, and they were rejected because they didn't want to submit their traces. Yep. IP, you know? Yeah, that makes sense, that makes sense. Actually, tomorrow we're talking to Bolt, which is a cloud customer. You guys actually published a case study with them. I assume you weren't involved with that, but they were very happy with Cloud. Cool. One of the biggest launches of the year. Yeah, totally. We actually happened to be sitting in Adept's former office. My take on this is Anthropic shipped Adept as a feature. It's still a beta feature, but yes. What was it like when you tried it for the first time? Was it obvious that Cloud had reached that stage where you could do computer use? It was somewhat of a surprise to me.Erik [00:51:40]: I had been on vacation, and I came back, and everyone's like, computer use works. So it was this very exciting moment. After the first go to Google, I think I tried to have it play Minecraft or something, and it actually installed and opened Minecraft.Swyx [00:51:54]: I was like, wow, this is pretty cool.Erik [00:51:55]: So I was like, wow, yeah, this thing can actually use a computer. And certainly, it is still beta. There's certain things that it's not very good at yet. But I'm really excited, I think, most broadly, not just for new things that weren't possible before, but as a much lower friction way to implement tool use. One anecdote from my days at Cobalt Robotics, we wanted our robots to be able to ride elevators, to go between floors and fully cover a building. The first way that we did this was doing API integrations with the elevator companies. Some of them actually had APIs. We could send a request, and it would move the elevator. Each new company we did took six months to do,Swyx [00:52:37]: because they were very slow.Erik [00:52:39]: They didn't really care.Swyx [00:52:40]: Or an elevator, not an API.Erik [00:52:42]: Even installing, once we had it with the company, they would have to literally go install an API box on the elevator that we wanted to use, and that would sometimes take six months.Swyx [00:52:51]: So very slow.Erik [00:52:52]: And eventually, we're like, okay, this is slowing down all of our customer deployments. And I was like, what if we just add an arm to the robot? And I added this little arm that could literally go and press the elevator buttons, and we use computer vision to do this. And we could deploy that in a single day, and have the robot being able to use the elevators. At the same time, it was slower than the API. It wasn't quite as reliable. Sometimes it would miss, and it would have to try to press it again.Swyx [00:53:20]: But it would get there.Erik [00:53:20]: But it was slower and a little bit less reliable. And I kind of see this as an analogy to computer use, of anything you can do with computer use today, you could probably write tool use and integrate it with APIs.Swyx [00:53:33]: It's up to the language model.Erik [00:53:34]: But that's going to take a bunch of software engineering to write those integrations.Swyx [00:53:38]: You have to do all this stuff.Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.Alessio [00:54:20]: Or farming on World of Warcraft.Swyx [00:54:23]: Yes, or that.Erik [00:54:23]: Just go computer use.Alessio [00:54:25]: Very high-value use cases.Swyx [00:54:27]: I always say about this, this is the oldest question in robotics or self-driving, which is, do you drive by vision or do you have special tools? And vision is the universal tool to claim all tools. There's trade-offs, but there's situations in which that will come. But this week's podcast, the one that we just put out, had Stan Polu from Dust saying that he doesn't see a future where it's the significant workhorse. I think there could be a separation between maybe the high-volume use cases. You want APIs. And then the long tail, you want computer use. I totally agree. Right?Erik [00:55:00]: Or you'll start, you'll prototype something with computer use. And then, hey, this is working. Customers have adopted this feature. OK, let's go turn it into an API. And it'll be faster and use less tokens.Swyx [00:55:11]: I'd be interested to see a computer use agent replace itself by figuring out the API and then just dropping out of the equation altogether.Erik [00:55:20]: Yeah, that's really fun, actually.Swyx [00:55:22]: If I was running an RPA company, you would have the RPA scripting. RPA, for people listening, is robotic process automation, where you would script things that always show up in sequence. So you don't have an LLM in the loop. And so basically what you need to do is train an LLM to code that script. And then you can naturally hand off from computer use to non-computer use.Erik [00:55:43]: Or have some way to turn Claude's actions of computer use into a saved script that you can then run repeatedly.Swyx [00:55:49]: Yeah, it'd be interesting to record that.Alessio [00:55:50]: Why did you decide to not ship any sandbox harness for computer use? It's kind of like, hey, peace.Swyx [00:55:58]: Run at your own risk. It's Docker, right?Erik [00:55:59]: No, no, we launched it with, I think, a VM or Docker, a Docker as system.Alessio [00:56:03]: But it's not for your actual computer, right? The Docker instance runs in the Docker. It's not for...Swyx [00:56:10]: Yeah, it runs its own browser.Erik [00:56:13]: I mean, the main reason for that, one, is sort of security. We don't want... The model can do anything. So we wanted to give it a sandbox, not have people do their own computer. At least sort of for our default experience. We really care about providing a nice sort of... Making the default safe, I think, is the best way for us to do it. And I mean, very quickly, people made modifications to let you run it on your own desktop. And that's fine.Swyx [00:56:37]: Someone else can do that.Erik [00:56:37]: But we don't want that to be the official, anthropic thing to run. I would say also, from a product perspective, right now, because this is sort of still in beta, I think a lot of the most useful use cases are... Like, a sandbox is actually what you want. You want something where, hey, it can't mess up anything in here. It only has what I gave it. Also, if it's using your computer, you know, you can't use your computer at the same time. I think you actually want it to have its own screen. It's like you and a person pair programming, but only on one laptop versus you have two laptops.Swyx [00:57:07]: Everyone should totally have a side laptop where the computer uses... Cloud is just doing its thing. Yeah, yeah.Erik [00:57:11]: I think it's such a better experience. Unless there's something very explicit you want it to do for you on your own computer.Swyx [00:57:17]: It becomes like you're sort of shelling into a remote machine and, you know, maybe checking in on it every now and then. Like, I have fond memories of... Half our audience is going to be too young to remember this, but Citrix desktop experience, like, you were sort of remote into a machine that someone else was operating. And for a long time, that would be how you did, like, enterprise computing. Yeah, yeah. It's coming back. Any other implications of computer use? You know, is it a fun demo or is it, like, the future of Anthropic? I'm very excited about it.Erik [00:57:50]: I think that, like, there's a lot of sort of very repetitive work that, like, computer use will be great for. I think I've seen some examples of people build, like, coding agents that then also, like, test the front end that they made. So I think it's very cool to, like, use computer use to be able to close the loop on a lot of things that right now just a terminal-based agent can't do. So I think that's very exciting.Swyx [00:58:11]: It's kind of like end-to-end testing. Exactly. Yeah, yeah.Erik [00:58:14]: The end sort of front-end and web testing is something I'm very excited about.Swyx [00:58:18]: Yeah, I've seen Amanda also talking... This would be Amanda Askell, the head of Cloud Character. She goes on a lunch break and it generates, you know, research ideas for her. Giving it a name like computer use is very practical. It's like you're supposed to do things, but maybe sometimes it's not about doing things, it's about thinking. And thinking... In the process of thinking, you're using the computer. In some way that's, you know, solving SweetBench, like, you should be allowed to use the internet or you should be allowed to use a computer to solve it and use your vision and use whatever. Like, we're just sort of shackling it with all these restrictions just because we want to play nice for a benchmark. But really, you know, a full AI will be able to do all these things. To think. Yeah, we'll definitely be able to. To reason. To Google and search for things.Erik [00:58:58]: Yeah, yeah. Pull down inspiration.Alessio [00:59:00]: Can we just do a... before we wrap, a robotics corner?Swyx [00:59:03]: Oh, yeah, yeah.Alessio [00:59:04]: People are always curious, especially with somebody that is not trying to hype their own company. What's the state of AI robotics? Under-hyped, over-hyped?Erik [00:59:12]: Yeah, and I'll say, like, these are my opinions, not Anthropic's. And again, coming from a place of a burned-out robotics founder, so take everything with a grain of salt. I would say on the positives, like, there is really sort of incredible progress that's happened in the last five years that I think will be a big unlock for robotics. The first is just general purpose language models. I mean, there was an old saying in robotics that if to fully describe your task is harder than to just do the task, you can never automate it. Because, like, it's going to take more effort to even tell the robot how to do this thing than to me just do it itself. LLM solved that. I no longer need to go exhaustively program in every little thing I could do. The thing just has common sense. And it's going to know, how do I make a Reuben sandwich? I'm not going to have to go program that in. Whereas before, like, the idea of even, like, a cooking thing, it's like, oh god, like, we're gonna have the team of engineers that are hard coding recipes for the long tail of anything. It would be a disaster. So I think that's one thing, is that bringing common sense really is, like, solves this huge problem of describing tasks. The second big innovation has been diffusion models for path planning. A lot of this work came out of Toyota Research. There's a lot of startups now that are working on this, like Physical Intelligence Pi, Chelsea Finn's startup out of Stanford. And the basic idea here is using a little bit of the, I'd say maybe more inspiration from diffusion rather than diffusion models themselves. But they're a way to basically learn an end-to-end sort of motion control. Whereas previously, all of robotics motion control was sort of very hard-coded. You either, you know, you're programming in explicit motions, or you're programming in an explicit goal and using an optimization library to find the shortest path to it. This is now something where you just give it a bunch of demonstrations. And again, just like using learning, it's basically like learning from these examples. What does it mean to go pick up a cup? And doing these in a way just like diffusion models, where they are somewhat conditioned by text, you can have the same model learn many different tasks. And then the hope is that these start to generalize. That if you've trained it on picking up coffee cups and picking up books, then when I say pick up the backpack, it knows how to do that too. Even though you've never trained it on that. That's kind of the holy grail here, is that you train it on 500 different tasks, and then that's enough to really get it to generalize to do anything you would need. I think that's like still a big TBD. And these people are working, have like measured some degree of generalization. But at the end of the day, it's also like LLMs. Like, you know, do you really care about the thing, being able to do something that no one has ever shown in training data? People for like a home robot, there's going to be like a hundred things that people really wanted to do. And you can just make sure it has good training for those things. What you do care about then is like generalization within a task of, oh, I've never seen this particular coffee mug before. Can I still pick it up? And those, the models do seem very good at. So these kind of are the two big things that are going for robotics right now, is LLMs for common sense and diffusion-inspired path planning algorithms. I think this is very promising, but I think there's a lot of hype. And I think where we are right now is where self-driving cars were 10 years ago. I think we have very cool demos that work. I mean, 10 years ago, you had videos of people driving a car on the highway, driving a car, you know, on a street with a safety driver. But it's really taken a long time to go from there to, I took a Waymo here today. And even Waymo is only in SF and a few other cities. And I think it takes a long time for these things to actually get everywhere and to get all the edge cases covered. I think that for robotics, the limiting factor is going to be reliability, that these models are really good at doing these demos of doing laundry or doing dishes. If they only work 99% of the time, that sounds good, but that's actually really annoying. Humans are really good at these tasks. Imagine if one out of every 100 dishes, it washed, it breaks. You would not want that robot in your house, or you certainly wouldn't want that in your factory if one of every 100 boxes that it moves, it drops and breaks things inside it. So I think for these things to really be useful, they're going to have to hit a very, very high level of reliability, just like self-driving cars. And I don't know how hard it's going to be for these models to move from the 95% reliability to 99.9. I think that's going to be the big thing. And I think also, I'm a little skeptical of how good the unit economics of these things will be. These robots are going to be very expensive to build. And if you're just trying to replace labor, like a one-for-one purchase, it kind of sets an upper cap about how much you can charge. And so it seems like it's not that great a business. I'm also worried about that for the self-driving car industry.Alessio [01:04:05]: Do you see most of the applications actually taking some of the older, especially manufacturing machinery, which needs to be very precise? Even if it's off by just a few millimeters, it cannot screw up the whole thing and be able to adjust at the edge? Or do you think the net new use cases may be more interesting?Erik [01:04:24]: I think it'd be very hard to replace a lot of those traditional manufacturing robots because everything relies on that precision. If you have a model that can, again, only get there 99% of the time, you don't want 1% of your cars to have the weld in the wrong spot. That's going to be a disaster. And a lot of manufacturing is all about getting rid of as much variance and uncertainty asSwyx [01:04:47]: possible.Erik [01:04:47]: Yeah.Swyx [01:04:48]: And what about the hardware?Alessio [01:04:49]: A lot of my friends that work in robotics, one of their big issues is sometimes you just have a servo that fails, and it takes a bunch of time to fix that.Swyx [01:04:57]: Is that holding back things?Alessio [01:04:58]: Or is the software still, anyway, not that ready?Swyx [01:05:01]: I think both.Erik [01:05:01]: I think there's been a lot more progress in the software in the last few years. And I think a lot of the humanoid robot companies now are really trying to build amazing hardware. Hardware is just so hard. It's something where you build your first robot, and it works. You're like, great. Then you build 10 of them. Five of them work. Three of them work half the time. Two of them don't work. And you built them all the same, and you don't know why. And it's just like the real world has this level of detail and differences that softwareSwyx [01:05:28]: doesn't have.Erik [01:05:29]: Imagine if every for loop you wrote, some of them just didn't work. Some of them were slower than others. How do you deal with that? Imagine if every binary that you shipped to a customer, each of those four loops was aSwyx [01:05:41]: little different.Erik [01:05:41]: It becomes just so hard to scale and maintain quality of these things. And I think that's what makes hardware really hard. It's not building one of something, but repeatedly building something and making it work reliably. Where again, you'll buy a batch of 100 motors, and each of those motors will behave a little bit differently to the same input command.Swyx [01:06:01]: This is your lived experience at Cobalt.Erik [01:06:03]: And robotics is all about how do you build something that's robust despite these differences.Swyx [01:06:08]: We can't get the tolerance of motors down to-Erik [01:06:10]: It's just everything.Swyx [01:06:13]: It's actually everything.Alessio [01:06:14]: Yeah.Erik [01:06:15]: No, I mean, one of my horror stories was that at Cobalt, this was many years ago, we had a thermal camera on the robot that had a USB connection to the computer inside, which is, first of all, is a big mistake. You're not supposed to use a USB. It is not a reliable protocol. It's designed that if there's mistakes, the user can just unplug it and plug it back in. I see. And so typically things that are USB, they're not designed to the same level of very high reliability you need. Again, because they assume someone will just unplug it and replug it back in. You just say someone sometime.Swyx [01:06:46]: I heard this too, and I didn't listen to it.Erik [01:06:47]: I really wish I had before. Anyway, at a certain point, a bunch of these thermal cameras started failing, and we couldn't figure out why. And I asked everyone on the team, like, hey, what's changed? Did the software change around this? Did the hardware design change around this? And I was investigating all this stuff, looking at kernel logs of what's happening with thisSwyx [01:07:07]: thing.Erik [01:07:07]: And finally, the procurement person was like, oh, yeah, well, I found this new vendor for USB cables last summer.Swyx [01:07:14]: And I'm like, what?Erik [01:07:15]: You switched which vendor were buying USB cables? I'm like, yeah, it's the same exact cable. It's just a dollar cheaper. And it turns out this was the problem. This new cable had slightly worse resistance or slightly worse EMI interference. And it worked most of the time. But 1% of the time, these cameras would fail, and we'd need to reboot a big part of the system. And it was all just because the same exact spec, these two different USB cables, slightly different. And so these are the kind of things you deal with with hardware.Swyx [01:07:45]: For listeners, we had an episode with Josh Albrecht in BU where he talked about buying tens of thousands of GPUs. And just some of them will just not do math. Yeah, that's the same thing. You run some tests to find the bad batch, and then you return it to sender because they just, GPUs won't do math, right? Yeah, yeah, this is the thing.Erik [01:08:05]: The real world has this level of detail. Eric Jang, he did AI at Google.Swyx [01:08:11]: Yeah, 1X. Yeah, and then joined 1X.Erik [01:08:13]: I see him post on Twitter occasionally of complaints about hardware and supply chain. And we know each other, and we joke occasionally. I went from robotics into AI, and he went from AI into robotics.Swyx [01:08:26]: I mean, look, very, very promising. The time of the real world is unlimited, right? But just also a lot harder. And yeah, I do think something I also tell people about for why working software agents is they're infinitely clonable. Yeah, they always work the same way. Mostly, unless you're using Python. And yeah, I mean, this is the whole thesis. I'm also interested, you dropped a little bit of alpha there. I don't want to make sure we don't lose it. Like, you're kind of skeptical about self-driving as a business. So I want to double click on this a little bit, because I mean, I think that shouldn't be taken away. We do have some public Waymo numbers. Read from Waymo is pretty public with their stats. They're exceeding 100 Waymo trips a week. If you assume a 25𝑟𝑖𝑑𝑒𝑎𝑣𝑒𝑟𝑎𝑔𝑒,𝑡ℎ𝑎𝑡′𝑠25rideaverage,that′s130 million revenue run rate. At some point, they will recoup their investment, right? Like, what are we talking about here? Way to skepticism.Erik [01:09:21]: I think, and again, I'm not an expert. I don't know their financials. I would say the thing I'm worried about is compared to an Uber, I don't know how much an Uber driver takes home a year, but call that the revenue that a Waymo is going to be making in that same year. Those cars are expensive. It's not about if you can hit profitability, it's about your cash conversion cycles. Is building one Waymo, how cheap can you make that compared to how much you're earning as the equivalent of what an Uber driver would take home? Because remember, an Uber driver, you're not getting that whole revenue. You think about, for the Uber driver, the cost of the car, the depreciation of the car. I'm not convinced how much profit Waymo can actually make per car.Swyx [01:10:02]: That's, I think, my skepticism.Alessio [01:10:02]: Well, they need to pre-assess the run Waymo because the Class C is like $110 grand, somethingSwyx [01:10:09]: like that, plus the LiDAR. That's many years of, yeah, yeah, yeah. Exactly, exactly. Anything else?Alessio [01:10:14]: Parting thoughts? Call to action? Rants?Swyx [01:10:18]: The floor is yours.Erik [01:10:19]: I'm very excited to see a lot more LLM agents out there in the world doing things. And I think they'll be, the biggest limiting thing will start to become, do people trust the output of these agents? And how do you trust the output of an agent that did five hours of work for you and is coming back with something? And if you can't find some way to trust that agent's work, it kind of wasn't valuable at all. So I think that's going to be a really important thing, is not just doing the work, but doing the work in a trustable, auditable way where you can also explain to the human, hey, here's exactly how this works and why and how I came to it. I think that's going to be really important.Swyx [01:10:54]: Thank you so much. Yeah, thanks. This was great. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:11:10
  • Why Compound AI + Open Source will beat Closed AI
    We have a full slate of upcoming events: AI Engineer London, AWS Re:Invent in Las Vegas, and now Latent Space LIVE! at NeurIPS in Vancouver and online. Sign up to join and speak!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!We try to stay close to the inference providers as part of our coverage, as our podcasts with Together AI and Replicate will attest: However one of the most notable pull quotes from our very well received Braintrust episode was his opinion that open source model adoption has NOT gone very well and is actually declining in relative market share terms (it is of course increasing in absolute terms):Today’s guest, Lin Qiao, would wholly disagree. Her team of Pytorch/GPU experts are wholly dedicated toward helping you serve and finetune the full stack of open source models from Meta and others, across all modalities (Text, Audio, Image, Embedding, Vision-understanding), helping customers like Cursor and Hubspot scale up open source model inference both rapidly and affordably.Fireworks has emerged after its successive funding rounds with top tier VCs as one of the leaders of the Compound AI movement, a term first coined by the Databricks/Mosaic gang at Berkeley AI and adapted as “Composite AI” by Gartner:Replicating o1We are the first podcast to discuss Fireworks’ f1, their proprietary replication of OpenAI’s o1. This has become a surprisingly hot area of competition in the past week as both Nous Forge and Deepseek r1 have launched competitive models.Full Video PodcastLike and subscribe!Timestamps* 00:00:00 Introductions* 00:02:08 Pre-history of Fireworks and PyTorch at Meta* 00:09:49 Product Strategy: From Framework to Model Library* 00:13:01 Compound AI Concept and Industry Dynamics* 00:20:07 Fireworks' Distributed Inference Engine* 00:22:58 OSS Model Support and Competitive Strategy* 00:29:46 Declarative System Approach in AI* 00:31:00 Can OSS replicate o1?* 00:36:51 Fireworks f1* 00:41:03 Collaboration with Cursor and Speculative Decoding* 00:46:44 Fireworks quantization (and drama around it)* 00:49:38 Pricing Strategy* 00:51:51 Underrated Features of Fireworks Platform* 00:55:17 HiringTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner at CTO at Danceable Partners, and I'm joined by my co-host, Swyx founder, Osmalayar.Swyx [00:00:11]: Hey, and today we're in a very special studio inside the Fireworks office with Lin Qiang, CEO of Fireworks. Welcome. Yeah.Lin [00:00:20]: Oh, you should welcome us.Swyx [00:00:21]: Yeah, welcome. Yeah, thanks for having us. It's unusual to be in the home of a startup, but it's also, I think our relationship is a bit unusual compared to all our normal guests. Definitely.Lin [00:00:34]: Yeah. I'm super excited to talk about very interesting topics in that space with both of you.Swyx [00:00:41]: You just celebrated your two-year anniversary yesterday.Lin [00:00:43]: Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run to we delete some data that shouldn't be deleted operationally. We went through a massive scale where we actually are busy getting capacity to, yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places to join a company. It has really been a fun journey.Alessio [00:01:24]: When you started, did you think the technical stuff will be harder or the bank run and then the people side? I think there's a lot of amazing researchers that want to do companies and it's like the hardest thing is going to be building the product and then you have all these different other things. So, were you surprised by what has been your experience the most?Lin [00:01:42]: Yeah, to be honest with you, my focus has always been on the product side and then after the product goes to market. And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So it's done. I think I just somehow don't think about it too much and solve whatever problem coming our way and it worked.Swyx [00:02:08]: So let's, I guess, let's start at the pre-history, the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years and we previously had Sumit Chintal on and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Faire and Meta were prior to the current GenEI revolution.Lin [00:02:35]: My background is deep in distributed system, database management system. And I joined Meta from the data side and I saw this tremendous amount of data growth, which cost a lot of money and we're analyzing what's going on. And it's clear that AI is driving all this data generation. So it's a very interesting time because when I joined Meta, Meta is going through ramping down mobile-first, finishing the mobile-first transition and then starting AI-first. And there's a fundamental reason about that sequence because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data and this data power AI. So then the whole entire industry is also going through, falling through this same transition. When I see, oh, okay, this AI is powering all this data generation and look at where's our AI stack. There's no software, there's no hardware, there's no people, there's no team. I want to dive up there and help this movement. So when I started, it's very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production and they use a very certain way of defining the graph of neural network and then use that to drive the model iteration and productionization. And PyTorch is completely different. So they could also assume that he was the user of his product. And he basically says, researchers face so much pain using existing AI frameworks, this is really hard to use and I'm going to do something different for myself. And that's the origin story of PyTorch. PyTorch actually started as the framework for researchers. They don't care about production at all. And as they grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate and they put their results out there in open source and that power the downstream productionization. So it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source because MATA internally is a PyTorch shop. So it creates a flying wheel effect. So that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of at Caspo, MATA established PyTorch as the framework for both research and production. So no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before. Now it's a concern. And we actually have to adjust its design and make it work for both sides. And that took us five years because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line or as ranking newsfeed, video ranking to site integrity detect bad content automatically using AI to all kinds of effects, translation, image classification, object detection, all this. And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us like this industry is starting to take on AI first transition. And of course, MATA's hyperscale always go ahead of industry. And it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact. Of course, the problem that the industry is facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that. So yeah, so that's how we started.Swyx [00:06:58]: When you and I chatted about the origins of fireworks, it was originally envisioned more as a PyTorch platform, and then later became much more focused on generative AI. Is that fair to say? What was the customer discovery here?Lin [00:07:13]: Right. So I would say our initial blueprint is we should build a PyTorch cloud because a PyTorch library and there's no SaaS platform to enable AI workloads.Swyx [00:07:26]: Even in 2022, it's interesting.Lin [00:07:28]: I would not say absolutely no, but cloud providers have some of those, but it's not first class citizen, right? At 2022, there's still like TensorFlow is massively in production. And this is all pre-gen AI, and PyTorch is kind of getting more and more adoption. But there's no PyTorch-first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers. We understand their use case, we understand their pain points, we understand the value we deliver to them. So we want to take a different approach instead of building a horizontal PyTorch cloud. We want to build a verticalized platform first. And then we talk with many customers. And interestingly, we started the company in September 2022, and in October, November, the OpenAI announced ChatGPT. And then boom, when we talked with many customers, they were like, can you help us work on the JNS aspect? So of course, there are some open source models. It's not as good at that time, but people are already putting a lot of attention there. Then we decided that if we're going to pick a vertical, we're going to pick JNI. The other reason is all JNI models are PyTorch models. So that's another reason. We believe that because of the nature of JNI, it's going to generate a lot of human consumable content. It will drive a lot of consumer, customer-developer-facing application and product innovation. Guaranteed. We're just at the beginning of this. Our prediction is for those kind of applications, the inference is much more important than training because inference scale is proportional to the up-limit award population. And training scale is proportional to the number of researchers. Of course, each training round could be very expensive. Although PyTorch supports both inference and training, we decided to laser focus on inference. So yeah, so that's how we got started. And we launched our public platform August last year. When we launched, it was a single product. It's a distributed inference engine with a simple API, open AI compatible API with many models. We started with LM and then we added a lot of models. Fast forward to now, we are a full platform with multiple product lines. So we love to kind of dive deep into what we offer. But that's a very fun journey in the past two years.Alessio [00:09:49]: What was the transition from you start to focus on PyTorch and people want to understand the framework, get it live. And now say maybe most people that use you don't even really know much about PyTorch at all. You know, they're just trying to consume a model. From a product perspective, like what were some of the decisions early on? Like right in October, November, you were just like, hey, most people just care about the model, not about the framework. We're going to make it super easy or was it more a gradual transition to the model librarySwyx [00:10:16]: you have today?Lin [00:10:17]: Yeah. So our product decision is all based on who is our ICP. And one thing I want to acknowledge here is the generic technology is disruptive. It's very different from AI before GNI. So it's a clear leap forward. Because before GNI, the companies that want to invest in AI, they have to train from scratch. There's no other way. There's no foundation model. It doesn't exist. So that means then to start a team, first hire a team who is capable of crunch data. There's a lot of data to crunch, right? Because training from scratch, you have to prepare a lot of data. And then they need to have GPUs to train, and then you start to manage GPUs. So then it becomes a very complex project. It takes a long time and not many companies can afford it, actually. And the GNI is a very different game right now, because it is a foundation model. So you don't have to train anymore. That makes AI much more accessible as a technology. As an app developer or product manager, even, not a developer, they can interact with GNI models directly. So our goal is to make AI accessible to all app developers and product engineers. That's our goal. So then getting them into the building model doesn't make any sense anymore with this new technology. And then building easy, accessible APIs is the most important. Early on, when we got started, we decided we're going to be open AI compatible. It's just kind of very easy for developers to adopt this new technology, and we will manage the underlying complexity of serving all these models.Swyx [00:11:56]: Yeah, open AI has become the standard. Even as we're recording today, Gemini announced that they have open AI compatible APIs. Interesting. So we just need to drop it all in line, and then we have everyone popping in line.Lin [00:12:09]: That's interesting, because we are working very closely with Meta as one of the partners. Meta, of course, is kind of very generous to donate many very, very strong open source models, expecting more to come. But also they have announced LamaStack, which is basically standardized, the upper level stack built on top of Lama models. So they don't just want to give out models and you figure out what the upper stack is. They instead want to build a community around the stack and build a new standard. I think there's an interesting dynamics in play in the industry right now, when it's more standardized across open AI, because they are kind of creating the top of the funnel, or standardized across Lama, because this is the most used open source model. So I think it's a lot of fun working at this time.Swyx [00:13:01]: I've been a little bit more doubtful on LamaStack, I think you've been more positive. Basically it's just like the meta version of whatever Hugging Face offers, you know, or TensorRT, or BLM, or whatever the open source opportunity is. But to me, it's not clear that just because Meta open sources Lama, that the rest of LamaStack will be adopted. And it's not clear why I should adopt it. So I don't know if you agree.Lin [00:13:27]: It's very early right now. That's why I kind of work very closely with them and give them feedback. The feedback to the meta team is very important. So then they can use that to continue to improve the model and also improve the higher level I think the success of LamaStack heavily depends on the community adoption. And there's no way around it. And I know the meta team would like to kind of work with a broader set of community. But it's very early.Swyx [00:13:52]: One thing that after your Series B, so you raced for Benchmark, and then Sequoia. I remember being close to you for at least your Series B announcements, you started betting heavily on this term of Compound AI. It's not a term that we've covered very much in the podcast, but I think it's definitely getting a lot of adoption from Databricks and Berkeley people and all that. What's your take on Compound AI? Why is it resonating with people?Lin [00:14:16]: Right. So let me give a little bit of context why we even consider that space.Swyx [00:14:22]: Because like pre-Series B, there was no message, and now it's like on your landing page.Lin [00:14:27]: So it's kind of very organic evolution from when we first launched our public platform, we are a single product. We are a distributed inference engine, where we do a lot of innovation, customized KUDA kernels, raw kernel kernels, running on different kinds of hardware, and build distributed disaggregated execution, inference execution, build all kinds of caching. So that is one. So that's kind of one product line, is the fast, most cost-efficient inference platform. Because we wrote PyTorch code, we know we basically have a special PyTorch build for that, together with a custom kernel we wrote. And then we worked with many more customers, we realized, oh, the distributed inference engine, our design is one size fits all. We want to have this inference endpoint, then everyone come in, and no matter what kind of form and shape or workload they have, it will just work for them. So that's great. But the reality is, we realized all customers have different kinds of use cases. The use cases come in all different forms and shapes. And the end result is the data distribution in their inference workload doesn't align with the data distribution in the training data for the model. It's a given, actually. If you think about it, because researchers have to guesstimate what is important, what's not important in preparing data for training. So because of that misalignment, then we leave a lot of quality, latency, cost improvement on the table. So then we're saying, OK, we want to heavily invest in a customization engine. And we actually announced it called FHIR Optimizer. So FHIR Optimizer basically helps users navigate a three-dimensional optimization space across quality, latency, and cost. So it's a three-dimensional curve. And even for one company, for different use cases, they want to land in different spots. So we automate that process for our customers. It's very simple. You have your inference workload. You inject into the optimizer along with the objective function. And then we spit out inference deployment config and the model setup. So it's your customized setup. So that is a completely different product. So that product thinking is one size fits all. And now on top of that, we provide a huge variety of state-of-the-art models, hundreds of them, varying from text to large state-of-the-art English models. That's where we started. And as we talk with many customers, we realize, oh, audio and text are very, very close. Many of our customers start to build assistants, all kinds of assistants using text. And they immediately want to add audio, audio in, audio out. So we support transcription, translation, speech synthesis, text, audio alignment, all different kinds of audio features. It's a big announcement. You should have heard by the time this is out. And the other areas of vision and text are very close with each other. Because a lot of information doesn't live in plain text. A lot of information lives in multimedia format, images, PDFs, screenshots, and many other different formats. So oftentimes to solve a problem, we need to put the vision model first to extract information and then use language model to process and then send out results. So vision is important. We also support vision model, various different kinds of vision models specialized in processing different kinds of source and extraction. And we're also going to have another announcement of a new API endpoint we'll support for people to upload various different kinds of multimedia content and then get the extract very accurate information out and feed that into LM. And of course, we support embedding because embedding is very important for semantic search, for RAG, and all this. And in addition to that, we also support text-to-image, image generation models, text-to-image, image-to-image, and we're adding text-to-video as well in our portfolio. So it's a very comprehensive set of model catalog that built on top of File Optimizer and Distributed Inference Engine. But then we talk with more customers, they solve business use case, and then we realize one model is not sufficient to solve their problem. And it's very clear because one is the model hallucinates. Many customers, when they onboard this JNI journey, they thought this is magical. JNI is going to solve all my problems magically. But then they realize, oh, this model hallucinates. It hallucinates because it's not deterministic, it's probabilistic. So it's designed to always give you an answer, but based on probabilities, so it hallucinates. And that's actually sometimes a feature for creative writing, for example. Sometimes it's a bug because, hey, you don't want to give misinformation. And different models also have different specialties. To solve a problem, you want to ask different special models to kind of decompose your task into multiple small tasks, narrow tasks, and then have an expert model solve that task really well. And of course, the model doesn't have all the information. It has limited knowledge because the training data is finite, not infinite. So the model oftentimes doesn't have real-time information. It doesn't know any proprietary information within the enterprise. It's clear that in order to really build a compiling application on top of JNI, we need a compound AI system. Compound AI system basically is going to have multiple models across modalities, along with APIs, whether it's public APIs, internal proprietary APIs, storage systems, database systems, knowledge to work together to deliver the best answer.Swyx [00:20:07]: Are you going to offer a vector database?Lin [00:20:09]: We actually heavily partner with several big vector database providers. Which is your favorite? They are all great in different ways. But it's public information, like MongoDB is our investor. And we have been working closely with them for a while.Alessio [00:20:26]: When you say distributed inference engine, what do you mean exactly? Because when I hear your explanation, it's almost like you're centralizing a lot of the decisions through the Fireworks platform on the quality and whatnot. What do you mean distributed? It's like you have GPUs in a lot of different clusters, so you're sharding the inference across the same model.Lin [00:20:45]: So first of all, we run across multiple GPUs. But the way we distribute across multiple GPUs is unique. We don't distribute the whole model monolithically across multiple GPUs. We chop them into pieces and scale them completely differently based on what's the bottleneck. We also are distributed across regions. We have been running in North America, EMEA, and Asia. We have regional affinity to applications because latency is extremely important. We are also doing global load balancing because a lot of applications there, they quickly scale to global population. And then at that scale, different content wakes up at a different time. And you want to kind of load balancing across. So all the way, and we also have, we manage various different kinds of hardware skew from different hardware vendors. And different hardware design is best for different types of workload, whether it's long context, short context, long generation. So all these different types of workload is best fitted for different kinds of hardware skew. And then we can even distribute across different hardware for a workload. So the distribution actually is all around in the full stack.Swyx [00:22:02]: At some point, we'll show on the YouTube, the image that Ray, I think, has been working on with all the different modalities that you offer. To me, it's basically you offer the open source version of everything that OpenAI typically offers. I don't think there is. Actually, if you do text to video, you will be a superset of what OpenAI offers because they don't have Sora. Is that Mochi, by the way? Mochi. Mochi, right?Lin [00:22:27]: Mochi. And there are a few others. I will say, the interesting thing is, I think we're betting on the open source community is going to proliferate. This is literally what we're seeing. And there's amazing video generation companies. There is amazing audio companies. Like cross-border, the innovation is off the chart, and we are building on top of that. I think that's the advantage we have compared with a closed source company.Swyx [00:22:58]: I think I want to restate the value proposition of Fireworks for people who are comparing you versus a raw GPU provider like a RunPod or Lambda or anything like those, which is like you create the developer experience layer and you also make it easily scalable or serverless or as an endpoint. And then, I think for some models, you have custom kernels, but not all models.Lin [00:23:25]: Almost for all models. For all large language models, all your models, and the VRMs. Almost for all models we serve.Swyx [00:23:35]: And so that is called Fire Attention. I don't remember the speed numbers, but apparently much better than VLM, especially on a concurrency basis.Lin [00:23:44]: So Fire Attention is specific mostly for language models, but for other modalities, we'll also have a customized kernel.Swyx [00:23:51]: And I think the typical challenge for people is understanding that has value, and then there are other people who are also offering open-source models. Your mode is your ability to offer a good experience for all these customers. But if your existence is entirely reliant on people releasing nice open-source models, other people can also do the same thing.Lin [00:24:14]: So I would say we build on top of open-source model foundation. So that's the kind of foundation we build on top of. But we look at the value prop from the lens of application developers and product engineers. So they want to create new UX. So what's happening in the industry right now is people are thinking about a completely new way of designing products. And I'm talking to so many founders, it's just mind-blowing. They help me understand existing way of doing PowerPoint, existing way of coding, existing way of managing customer service. It's actually putting a box in our head. For example, PowerPoint. So PowerPoint generation is we always need to think about how to fit into my storytelling into this format of slide one after another. And I'm going to juggle through design together with what story to tell. But the most important thing is what's our storytelling lines, right? And why don't we create a space that is not limited to any format? And those kind of new product UX design combined with automated content generation through Gen AI is the new thing that many founders are doing. What are the challenges they're facing? Let's go from there. One is, again, because a lot of products built on top of Gen AI, they are consumer-personal developer facing, and they require interactive experience. It's just a kind of product experience we all get used to. And our desire is to actually get faster and faster interaction. Otherwise, nobody wants to spend time, right? And then that requires low latency. And the other thing is the nature of consumer-personal developer facing is your audience is very big. You want to scale up to product market fit quickly. But if you lose money at a small scale, you're going to bankrupt quickly. So it's actually a big contrast. I actually have product market fit, but when I scale, I scale out of my business. So that's kind of a very funny way to think about it. So then having low latency and low cost is essential for those new applications and products to survive and really become a generation company. So that's the design point for our distributed inference engine and the file optimizer. File optimizer, you can think about that as a feedback loop. The more you feed your inference workload to our inference engine, the more we help you improve quality, lower latency further, lower your cost. It basically becomes better. And we automate that because we don't want you as an app developer or product engineer to think about how to figure out all these low-level details. It's impossible because you're not trained to do that at all. You should kind of keep your focus on the product innovation. And then the compound AI, we actually feel a lot of pain as the app developers, engineers, there are so many models. Every week, there's at least a new model coming out.Swyx [00:27:09]: Tencent had a giant model this week. Yeah, yeah.Lin [00:27:13]: I saw that. I saw that.Swyx [00:27:15]: It's like $500 billion.Lin [00:27:18]: So they're like, should I keep chasing this or should I forget about it? And which model should I pick to solve what kind of sub-problem? How do I even decompose my problem into those smaller problems and fit the model into it? I have no idea. And then there are two ways to think about this design. I think I talked about that in the past. One is imperative, as in you figure out how to do it. You give developer tools to dictate how to do it. Or you build a declarative system where a developer tells what they want to do, not how. So these are completely two different designs. So the analogy I want to draw is, in the data world, the database management system is a declarative system because people use database, use SQL. SQL is a way you say, what do you want to extract out of a database? What kind of result do you want? But you don't figure out which node is going to, how many nodes you're going to run on top of, how you redefine your disk, which index you use, which project. You don't need to worry about any of those. And database management system will figure out, generate a new best plan, and execute on that. So database is declarative. And it makes it super easy. You just learn SQL, which is learn a semantic meaning of SQL, and you can use it. Imperative side is there are a lot of ETL pipelines. And people design this DAG system with triggers, with actions, and you dictate exactly what to do. And if it fails, then how to recover. So that's an imperative system. We have seen a range of systems in the ecosystem go different ways. I think there's value of both. There's value of both. I don't think one is going to subsume the other. But we are leaning more into the philosophy of the declarative system. Because from the lens of app developer and product engineer, that would be easiest for them to integrate.Swyx [00:29:07]: I understand that's also why PyTorch won as well, right? This is one of the reasons. Ease of use.Lin [00:29:14]: Focus on ease of use, and then let the system take on the hard challenges and complexities. So we follow, we extend that thinking into current system design. So another announcement is we will also announce our next declarative system is going to appear as a model that has extremely high quality. And this model is inspired by Owen's announcement for OpenAI. You should see that by the time we announce this or soon.Alessio [00:29:46]: Trained by you.Lin [00:29:47]: Yes.Alessio [00:29:48]: Is this the first model that you trained? It's not the first.Lin [00:29:52]: We actually have trained a model called FireFunction. It's a function calling model. It's our first step into compound AI system. Because function calling model can dispatch a request into multiple APIs. We have pre-baked set of APIs the model learned. You can also add additional APIs through the configuration to let model dispatch accordingly. So we have a very high quality function calling model that's already released. We have actually three versions. The latest version is very high quality. But now we take a further step that you don't even need to use function calling model. You use our new model we're going to release. It will solve a lot of problems approaching very high OpenAI quality. So I'm very excited about that.Swyx [00:30:41]: Do you have any benchmarks yet?Lin [00:30:43]: We have a benchmark. We're going to release it hopefully next week. We just put our model to LMSYS and people are guessing. Is this the next Gemini model or a MADIS model? People are guessing. That's very interesting. We're watching the Reddit discussion right now.Swyx [00:31:00]: I have to ask more questions about this. When OpenAI released o1, a lot of people asked about whether or not it's a single model or whether it's a chain of models. Noam and basically everyone on the Strawberry team was very insistent that what they did for reinforcement learning, chain of thought, cannot be replicated by a whole bunch of open source model calls. Do you think that that is wrong? Have you done the same amount of work on RL as they have or was it a different direction?Lin [00:31:29]: I think they take a very specific approach where the caliber of team is very high. So I do think they are the domain expert in doing the things they are doing. I don't think there's only one way to achieve the same goal. We're on the same direction in the sense that the quality scaling law is shifting from training to inference. For that, I fully agree with them. But we're taking a completely different approach to the problem. All of that is because, of course, we didn't train the model from scratch. All of that is because we built on the show of giants. The current model available we have access to is getting better and better. The future trend is the gap between the open source model and the co-source model. It's just going to shrink to the point there's not much difference. And then we're on the same level field. That's why I think our early investment in inference and all the work we do around balancing across quality, latency, and cost pay off because we have accumulated a lot of experience and that empowers us to release this new model that is approaching open-ended quality.Alessio [00:32:39]: I guess the question is, what do you think the gap to catch up will be? Because I think everybody agrees with open source models eventually will catch up. And I think with 4, then with Lama 3.2, 3.1, 4.5b, we close the gap. And then 0.1 just reopened the gap so much and it's unclear. Obviously, you're saying your model will have...Swyx [00:32:57]: We're closing that gap.Alessio [00:32:58]: But you think in the future, it's going to be months?Lin [00:33:02]: So here's the thing that's happened. There's public benchmark. It is what it is. But in reality, open source models in certain dimensions are already on par or beat closed source models. So for example, in the coding space, open source models are really, really good. And in function calling, file function is also really, really good. So it's all a matter of whether you build one model to solve all the problems and you want to be the best of solving all the problems, or in the open source domain, it's going to specialize. All these different model builders specialize in certain narrow area. And it's logical that they can be really, really good in that very narrow area. And that's our prediction is with specialization, there will be a lot of expert models really, really good and even better than one-size-fits-all closed source models.Swyx [00:33:55]: I think this is the core debate that I am still not 100% either way on in terms of compound AI versus normal AI. Because you're basically fighting the bitter lesson.Lin [00:34:09]: Look at the human society, right? We specialize. And you feel really good about someone specializing doing something really well, right? And that's how our way evolved from ancient times. We're all journalists. We do everything. Now we heavily specialize in different domains. So my prediction is in the AI model space, it will happen also. Except for the bitter lesson.Swyx [00:34:30]: You get short-term gains by having specialists, domain specialists, and then someone just needs to train like a 10x bigger model on 10x more inference, 10x more data, 10x more model perhaps, whatever the current scaling law is. And then it supersedes all the individual models because of some generalized intelligence slash world knowledge. I think that is the core insight of the GPTs, the GPT-123 networks. Right.Lin [00:34:56]: But the training scaling law is because you have an increasing amount of data to train from. And you can do a lot of compute. So I think on the data side, we're approaching the limit. And the only data to increase that is synthetic generated data. And then there's like what is the secret sauce there, right? Because if you have a very good large model, you can generate very good synthetic data and then continue to improve quality. So that's why I think in OpenAI, they are shifting from the training scaling law intoSwyx [00:35:25]: inference scaling law.Lin [00:35:25]: And it's the test time and all this. So I definitely believe that's the future direction. And that's where we are really good at, doing inference.Swyx [00:35:34]: A couple of questions on that. Are you planning to share your reasoning choices?Lin [00:35:39]: That's a very good question. We are still debating.Swyx [00:35:43]: Yeah.Lin [00:35:45]: We're still debating.Swyx [00:35:46]: I would say, for example, it's interesting that, for example, SweetBench. If you want to be considered for ranking, you have to submit your reasoning choices. And that has actually disqualified some of our past guests. Cosign was doing well on SweetBench, but they didn't want to leak those results. So that's why you don't see O1 preview on SweetBench, because they don't submit their reasoning choices. And obviously, it's IP. But also, if you're going to be more open, then that's one way to be more open. So your model is not going to be open source, right? It's going to be an endpoint that you provide. Okay, cool. And then pricing, also the same as OpenAI, just kind of based on...Lin [00:36:25]: Yeah, this is... I don't have, actually, information. Everything is going so fast, we haven't even thought about that yet. Yeah, I should be more prepared.Swyx [00:36:33]: I mean, this is live. You know, it's nice to just talk about it as it goes live. Any other things that you want feedback on or you're thinking through? It's kind of nice to just talk about something when it's not decided yet. About this new model. It's going to be exciting. It's going to generate a lot of buzz. Right.Lin [00:36:51]: I'm very excited to see how people are going to use this model. So there's already a Reddit discussion about it. And people are asking very deep, mathematical questions. And since the model got it right, surprising. And internally, we're also asking the model to generate what is AGI. And it generates a very complicated DAG thinking process. So we're having a lot of fun testing this internally. But I'm more curious, how will people use it? What kind of application they're going to try and test on it? And that's where we really like to hear feedback from the community. And also feedback to us. What works out well? What doesn't work out well? What works out well, but surprising them? And what kind of thing they think we should improve on? And those kind of feedback will be tremendously helpful.Swyx [00:37:44]: Yeah. So I've been a production user of Preview and Mini since launch. I would say they're very, very obvious jobs in quality. So much so that they made clods on it. And they made the previous state-of-the-art look bad. It's really that stark, that difference. The number one thing, just feedback or feature requests, is people want control on the budget. Because right now, in 0.1, it kind of decides its own thinking budget. But sometimes you know how hard the problem is. And you want to actually tell the model, spend two minutes on this. Or spend some dollar amount. Maybe it's time you miss dollars. I don't know what the budget is. That makes a lot of sense.Lin [00:38:27]: So we actually thought about that requirement. And it should be, at some point, we need to support that. Not initially. But that makes a lot of sense.Swyx [00:38:38]: Okay. So that was a fascinating overview of just the things that you're working on. First of all, I realized that... I don't know if I've ever given you this feedback. But I think you guys are one of the reasons I agreed to advise you. Because I think when you first met me, I was kind of dubious. I was like... Who are you? There's Replicate. There's Together. There's Laptop. There's a whole bunch of other players. You're in very, very competitive fields. Like, why will you win? And the reason I actually changed my mind was I saw you guys shipping. I think your surface area is very big. The team is not that big. No. We're only 40 people. Yeah. And now here you are trying to compete with OpenAI and everyone else. What is the secret?Lin [00:39:21]: I think the team. The team is the secret.Swyx [00:39:23]: Oh boy. So there's no thing I can just copy. You just... No.Lin [00:39:30]: I think we all come from a very aligned culture. Because most of our team came from meta.Swyx [00:39:38]: Yeah.Lin [00:39:38]: And many startups. So we really believe in results. One is result. And second is customer. We're very customer obsessed. And we don't want to drive adoption for the sake of adoption. We really want to make sure we understand we are delivering a lot of business values to the customer. And we really value their feedback. So we would wake up midnight and deploy some model for them. Shuffle some capacity for them. And yeah, over the weekend, no brainer.Swyx [00:40:15]: So yeah.Lin [00:40:15]: So that's just how we work as a team. And the caliber of the team is really, really high as well. So as plug-in, we're hiring. We're expanding very, very fast. So if we are passionate about working on the most cutting-edge technology in the general space, come talk with us. Yeah.Swyx [00:40:38]: Let's talk a little bit about that customer journey. I think one of your more famous customers is Cursor. We were the first podcast to have Cursor on. And then obviously since then, they have blown up. Cause and effect are not related. But you guys especially worked on a fast supply model where you were one of the first people to work on speculative decoding in a production setting. Maybe just talk about what was the behind the scenes of working with Cursor?Lin [00:41:03]: I will say Cursor is a very, very unique team. I think the unique part is the team has very high technical caliber. There's no question about it. But they have decided, although many companies building coding co-pilot, they will say, I'm going to build a whole entire stack because I can. And they are unique in the sense they seek partnership. Not because they cannot. They're fully capable, but they know where to focus. That to me is amazing. And of course, they want to find a bypass partner. So we spent some time working together. They are pushing us very aggressively because for them to deliver high caliber product experience, they need the latency. They need the interactive, but also high quality at the same time. So actually, we expanded our product feature quite a lot as we support Cursor. And they are growing so fast. And we massively scaled quickly across multiple regions. And we developed a pretty high intense inference stack, almost like similar to what we do for Meta. I think that's a very, very interesting engagement. And through that, there's a lot of trust being built. They realize, hey, this is a team they can really partner with. And they can go big with. That comes back to, hey, we're really customer obsessed. And all the engineers working with them, there's just enormous amount of time syncing together with them and discussing. And we're not big on meetings, but we are like stack channel always on. Yeah, so you almost feel like working as one team. So I think that's really highlighted.Swyx [00:42:38]: Yeah. For those who don't know, so basically Cursor is a VS Code fork. But most of the time, people will be using closed models. Like I actually use a lot of SONET. So you're not involved there, right? It's not like you host SONET or you have any partnership with it. You're involved where Cursor is small, or like their house brand models are concerned, right?Lin [00:42:58]: I don't know what I can say, but the things they haven't said.Swyx [00:43:04]: Very obviously, the drop down is 4.0, but in Cursor, right? So I assume that the Cursor side is the Fireworks side. And then the other side, they're calling out the other. Just kind of curious. And then, do you see any more opportunity on the... You know, I think you made a big splash with 1,000 tokens per second. That was because of speculative decoding. Is there more to push there?Lin [00:43:25]: We push a lot. Actually, when I mentioned Fire Optimizer, right? So as in, we have a unique automation stack that is one size fits one. We actually deployed to Cursor earlier on. Basically optimized for their specific workload. And that's a lot of juice to extract out of there. And we see success in that product. It actually can be widely adopted. So that's why we started a separate product line called Fire Optimizer. So speculative decoding is just one approach. And speculative decoding here is not static. We actually wrote a blog post about it. There's so many different ways to do speculative decoding. You can pair a small model with a large model in the same model family. Or you can have equal pads and so on. There are different trade-offs which approach you take. It really depends on your workload. And then with your workload, we can align the Eagle heads or Medusa heads or a small big model pair much better to extract the best latency reduction. So all of that is part of the Fire Optimizer offering.Alessio [00:44:23]: I know you mentioned some of the other inference providers. I think the other question that people always have is around benchmarks. So you get different performance on different platforms. How should people think about... People are like, hey, Lama 3.2 is X on MMLU. But maybe using speculative decoding, you go down a different path. Maybe some providers run a quantized model. How should people think about how much they should care about how you're actually running the model? What's the delta between all the magic that you do and what a raw model...Lin [00:44:57]: Okay, so there are two big development cycles. One is experimentation, where they need fast iteration. They don't want to think about quality, and they just want to experiment with product experience and so on. So that's one. And then it looks good, and they want to post-product market with scaling. And the quality is really important. And latency and all the other things are becoming important. During the experimentation phase, it's just pick a good model. Don't worry about anything else. Make sure you even generate the right solution to your product. And that's the focus. And then post-product market fit, then that's kind of the three-dimensional optimization curve start to kick in across quality, latency, cost, where you should land. And to me, it's purely a product decision. To many products, if you choose a lower quality, but better speed and lower cost, but it doesn't make a difference to the product experience, then you should do it. So that's why I think inference is part of the validation. The validation doesn't stop at offline eval. The validation will go through A-B testing, through inference. And that's where we offer various different configurations for you to test which is the best setting. So this is the traditional product evaluation. So product evaluation should also include your new model versions and different model setup into the consideration.Swyx [00:46:22]: I want to specifically talk about what happens a few months ago with some of your major competitors. I mean, all of this is public. What is your take on what happens? And maybe you want to set the record straight on how Fireworks does quantization because I think a lot of people may have outdated perceptions or they didn't read the clarification post on your approach to quantization.Lin [00:46:44]: First of all, it's always a surprise to us that without any notice, we got called out.Swyx [00:46:51]: Specifically by name, which is normally not what...Lin [00:46:54]: Yeah, in a public post. And have certain interpretation of our quality. So I was really surprised. And it's not a good way to compete, right? We want to compete fairly. And oftentimes when one vendor gives out results, the interpretation of another vendor is always extremely biased. So we actually refrain ourselves to do any of those. And we happily partner with third parties to do the most fair evaluation. So we're very surprised. And we don't think that's a good way to figure out the competition landscape. So then we react. I think when it comes to quantization, the interpretation, we wrote actually a very thorough blog post. Because again, no one says it's all. We have various different quantization schemes. We can quantize very different parts of the model from ways to activation to cross-TPU communication. They can use different quantization schemes or consistent across the board. And again, it's a trade-off. It's a trade-off across this three-dimensional quality, latency, and cost. And for our customer, we actually let them find the best optimized point. And we have a very thorough evaluation process to pick that point. But for self-serve, there's only one point to pick. There's no customization available. So of course, it depends on what we talk with many customers. We have to pick one point. And I think the end result, like AA published, later on AA published a quality measure. And we actually looked really good. So that's why what I mean is, I will leave the evaluation of quality or performance to third party and work with them to find the most fair benchmark. And I think that's a good approach, a methodology. But I'm not a part of an approach of calling out specific namesSwyx [00:48:55]: and critique other competitors in a very biased way. Databases happens as well. I think you're the more politically correct one. And then Dima is the more... Something like this. It's you on Twitter.Lin [00:49:11]: It's like the Russian... We partner. We play different roles.Swyx [00:49:20]: Another one that I wanted to... I'm just the last one on the competition side. There's a perception of price wars in hosting open source models. And we talked about the competitiveness in the market. Do you aim to make margin on open source models? Oh, absolutely, yes.Lin [00:49:38]: So, but I think it really... When we think about pricing, it's really need to coordinate with the value we're delivering. If the value is limited, or there are a lot of people delivering the same value, there's no differentiation. There's only one way to go. It's going down. So through competition. If I take a big step back, there is pricing from... We're more compared with close model providers, APIs, right? The close model provider, their cost structure is even more interesting because we don't bear any training costs. And we focus on inference optimization, and that's kind of where we continue to add a lot of product value. So that's how we think about product. But for the close source API provider, model provider, they bear a lot of training costs. And they need to amortize the training costs into the inference. So that created very interesting dynamics of, yeah, if we match pricing there, and I think how they are going to make money is very, very interesting.Swyx [00:50:37]: So for listeners, opening eyes 2024, $4 billion in revenue, $3 billion in compute training, $2 billion in compute inference, $1 billion in research compute amortization, and $700 million in salaries. So that is like...Swyx [00:50:59]: I mean, a lot of R&D.Lin [00:51:01]: Yeah, so I think matter is basically like, make it zero. So that's a very, very interesting dynamics we're operating within. But coming back to inference, so we are, again, as I mentioned, our product is, we are a platform. We're not just a single model as a service provider as many other inference providers, like they're providing a single model. We have our optimizer to highly customize towards your inference workload. We have a compound AI system where significantly simplify your interaction to high quality and low latency, low cost. So those are all very different from other providers.Alessio [00:51:38]: What do people not know about the work that you do? I guess like people are like, okay, Fireworks, you run model very quickly. You have the function model. Is there any kind of like underrated part of Fireworks that more people should try?Lin [00:51:51]: Yeah, actually, one user post on x.com, he mentioned, oh, actually, Fireworks can allow me to upload the LoRa adapter to the service model at the same cost and use it at same cost. Nobody has provided that. That's because we have a very special, like we rolled out multi-LoRa last year, actually. And we actually have this function for a long time. And many people has been using it, but it's not well known that, oh, if you find your model, you don't need to use on demand. If you find your model is LoRa, you can upload your LoRa adapter and we deploy it as if it's a new model. And then you use, you get your endpoint and you can use that directly, but at the same cost as the base model. So I'm happy that user is marketing it for us. He discovered that feature, but we have that for last year. So I think to feedback to me is, we have a lot of very, very good features, as Sean just mentioned. I'm the advisor to the company,Swyx [00:52:57]: and I didn't know that you had speculative decoding released.Lin [00:53:02]: We have prompt catching way back last year also. We have many, yeah. So I think that is one of the underrated feature. And if they're developers, you are using our self-serve platform, please try it out.Swyx [00:53:16]: The LoRa thing is interesting because I think you also, the reason people add additional costs to it, it's not because they feel like charging people. Normally in normal LoRa serving setups, there is a cost to dedicating, loading those weights and dedicating a machine to that inference. How come you can't avoid it?Lin [00:53:36]: Yeah, so this is kind of our technique called multi-LoRa. So we basically have many LoRa adapters share the same base model. And basically we significantly reduce the memory footprint of serving. And the one base model can sustain a hundred to a thousand LoRa adapters. And then basically all these different LoRa adapters can share the same, like direct the same traffic to the same base model where base model is dominating the cost. So that's how we advertise that way. And that's how we can manage the tokens per dollar, million token pricing, the same as base model.Swyx [00:54:13]: Awesome. Is there anything that you think you want to request from the community or you're looking for model-wise or tooling-wise that you think like someone should be working on in this?Lin [00:54:23]: Yeah, so we really want to get a lot of feedback from the application developers who are starting to build on JNN or on the already adopted or starting about thinking about new use cases and so on to try out Fireworks first. And let us know what works out really well for you and what is your wishlist and what sucks, right? So what is not working out for you and we would like to continue to improve. And for our new product launches, typically we want to launch to a small group of people. Usually we launch on our Discord first to have a set of people use that first. So please join our Discord channel. We have a lot of communication going on there. Again, you can also give us feedback. We'll have a starting office hour for you to directly talk with our DevRel and engineers to exchange more long notes.Alessio [00:55:17]: And you're hiring across the board?Lin [00:55:18]: We're hiring across the board. We're hiring front-end engineers, infrastructure cloud, infrastructure engineers, back-end system optimization engineers, applied researchers, like researchers who have done post-training, who have done a lot of fine-tuning and so on.Swyx [00:55:34]: That's it. Thank you. Thanks for having us. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    58:25
  • Agents @ Work: Lindy.ai
    Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!If you've been following the AI agents space, you have heard of Lindy AI; while founder Flo Crivello is hesitant to call it "blowing up," when folks like Andrew Wilkinson start obsessing over your product, you're definitely onto something.In our latest episode, Flo walked us through Lindy's evolution from late 2022 to now, revealing some design choices about agent platform design that go against conventional wisdom in the space.The Great Reset: From Text Fields to RailsRemember late 2022? Everyone was "LLM-pilled," believing that if you just gave a language model enough context and tools, it could do anything. Lindy 1.0 followed this pattern:* Big prompt field ✅* Bunch of tools ✅* Prayer to the LLM gods ✅Fast forward to today, and Lindy 2.0 looks radically different. As Flo put it (~17:00 in the episode): "The more you can put your agent on rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user."Instead of a giant, intimidating text field, users now build workflows visually:* Trigger (e.g., "Zendesk ticket received")* Required actions (e.g., "Check knowledge base")* Response generationThis isn't just a UI change - it's a fundamental rethinking of how to make AI agents reliable. As Swyx noted during our discussion: "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software."The Surprising Truth About Model LimitationsHere's something that might shock folks building in the space: with Claude 3.5 Sonnet, the model is no longer the bottleneck. Flo's exact words (~31:00): "It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small."Some context: Lindy started when context windows were 4K tokens. Today, their system prompt alone is larger than that. But what's really interesting is what this means for platform builders:* Raw capabilities aren't the constraint anymore* Integration quality matters more than model performance* User experience and workflow design are the new bottlenecksThe Search Engine Parallel: Why Horizontal Platforms Might WinOne of the spiciest takes from our conversation was Flo's thesis on horizontal vs. vertical agent platforms. He draws a fascinating parallel to search engines (~56:00):"I find it surprising the extent to which a horizontal search engine has won... You go through Google to search Reddit. You go through Google to search Wikipedia... search in each vertical has more in common with search than it does with each vertical."His argument: agent platforms might follow the same pattern because:* Agents across verticals share more commonalities than differences* There's value in having agents that can work together under one roof* The R&D cost of getting agents right is better amortized across use casesThis might explain why we're seeing early vertical AI companies starting to expand horizontally. The core agent capabilities - reliability, context management, tool integration - are universal needs.What This Means for BuildersIf you're building in the AI agents space, here are the key takeaways:* Constrain First: Rather than maximizing capabilities, focus on reliable execution within narrow bounds* Integration Quality Matters: With model capabilities plateauing, your competitive advantage lies in how well you integrate with existing tools* Memory Management is Key: Flo revealed they actively prune agent memories - even with larger context windows, not all memories are useful* Design for Discovery: Lindy's visual workflow builder shows how important interface design is for adoptionThe Meta LayerThere's a broader lesson here about AI product development. Just as Lindy evolved from "give the LLM everything" to "constrain intelligently," we might see similar evolution across the AI tooling space. The winners might not be those with the most powerful models, but those who best understand how to package AI capabilities in ways that solve real problems reliably.Full Video PodcastFlo’s talk at AI Engineer SummitChapters* 00:00:00 Introductions * 00:04:05 AI engineering and deterministic software * 00:08:36 Lindys demo* 00:13:21 Memory management in AI agents * 00:18:48 Hierarchy and collaboration between Lindys * 00:21:19 Vertical vs. horizontal AI tools * 00:24:03 Community and user engagement strategies * 00:26:16 Rickrolling incident with Lindy * 00:28:12 Evals and quality control in AI systems * 00:31:52 Model capabilities and their impact on Lindy * 00:39:27 Competition and market positioning * 00:42:40 Relationship between Factorio and business strategy * 00:44:05 Remote work vs. in-person collaboration * 00:49:03 Europe vs US Tech* 00:58:59 Testing the Overton window and free speech * 01:04:20 Balancing AI safety concerns with business innovation Show Notes* Lindy.ai* Rick Rolling* Flo on X* TeamFlow* Andrew Wilkinson* Dust* Poolside.ai* SB1047* Gathertown* Sid Sijbrandij* Matt Mullenweg* Factorio* Seeing Like a StateTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're joined in the studio by Florent Crivello. Welcome.Flo [00:00:15]: Hey, yeah, thanks for having me.Swyx [00:00:17]: Also known as Altimore. I always wanted to ask, what is Altimore?Flo [00:00:21]: It was the name of my character when I was playing Dungeons & Dragons. Always. I was like 11 years old.Swyx [00:00:26]: What was your classes?Flo [00:00:27]: I was an elf. I was a magician elf.Swyx [00:00:30]: Well, you're still spinning magic. Right now, you're a solo founder and CEO of Lindy.ai. What is Lindy?Flo [00:00:36]: Yeah, we are a no-code platform letting you build your own AI agents easily. So you can think of we are to LangChain as Airtable is to MySQL. Like you can just pin up AI agents super easily by clicking around and no code required. You don't have to be an engineer and you can automate business workflows that you simply could not automate before in a few minutes.Swyx [00:00:55]: You've been in our orbit a few times. I think you spoke at our Latent Space anniversary. You spoke at my summit, the first summit, which was a really good keynote. And most recently, like we actually already scheduled this podcast before this happened. But Andrew Wilkinson was like, I'm obsessed by Lindy. He's just created a whole bunch of agents. So basically, why are you blowing up?Flo [00:01:16]: Well, thank you. I think we are having a little bit of a moment. I think it's a bit premature to say we're blowing up. But why are things going well? We revamped the product majorly. We called it Lindy 2.0. I would say we started working on that six months ago. We've actually not really announced it yet. It's just, I guess, I guess that's what we're doing now. And so we've basically been cooking for the last six months, like really rebuilding the product from scratch. I think I'll list you, actually, the last time you tried the product, it was still Lindy 1.0. Oh, yeah. If you log in now, the platform looks very different. There's like a ton more features. And I think one realization that we made, and I think a lot of folks in the agent space made the same realization, is that there is such a thing as too much of a good thing. I think many people, when they started working on agents, they were very LLM peeled and chat GPT peeled, right? They got ahead of themselves in a way, and us included, and they thought that agents were actually, and LLMs were actually more advanced than they actually were. And so the first version of Lindy was like just a giant prompt and a bunch of tools. And then the realization we had was like, hey, actually, the more you can put your agent on Rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user, because you can really, as a user, you get, instead of just getting this big, giant, intimidating text field, and you type words in there, and you have no idea if you're typing the right word or not, here you can really click and select step by step, and tell your agent what to do, and really give as narrow or as wide a guardrail as you want for your agent. We started working on that. We called it Lindy on Rails about six months ago, and we started putting it into the hands of users over the last, I would say, two months or so, and I think things really started going pretty well at that point. The agent is way more reliable, way easier to set up, and we're already seeing a ton of new use cases pop up.Swyx [00:03:00]: Yeah, just a quick follow-up on that. You launched the first Lindy in November last year, and you were already talking about having a DSL, right? I remember having this discussion with you, and you were like, it's just much more reliable. Is this still the DSL under the hood? Is this a UI-level change, or is it a bigger rewrite?Flo [00:03:17]: No, it is a much bigger rewrite. I'll give you a concrete example. Suppose you want to have an agent that observes your Zendesk tickets, and it's like, hey, every time you receive a Zendesk ticket, I want you to check my knowledge base, so it's like a RAG module and whatnot, and then answer the ticket. The way it used to work with Lindy before was, you would type the prompt asking it to do that. You check my knowledge base, and so on and so forth. The problem with doing that is that it can always go wrong. You're praying the LLM gods that they will actually invoke your knowledge base, but I don't want to ask it. I want it to always, 100% of the time, consult the knowledge base after it receives a Zendesk ticket. And so with Lindy, you can actually have the trigger, which is Zendesk ticket received, have the knowledge base consult, which is always there, and then have the agent. So you can really set up your agent any way you want like that.Swyx [00:04:05]: This is something I think about for AI engineering as well, which is the big labs want you to hand over everything in the prompts, and only code of English, and then the smaller brains, the GPU pours, always want to write more code to make things more deterministic and reliable and controllable. One way I put it is put Shoggoth in a box and make it a very small, the minimal viable box. Everything else should be traditional, if this, then that software.Flo [00:04:29]: I love that characterization, put the Shoggoth in the box. Yeah, we talk about using as much AI as necessary and as little as possible.Alessio [00:04:37]: And what was the choosing between kind of like this drag and drop, low code, whatever, super code-driven, maybe like the Lang chains, auto-GPT of the world, and maybe the flip side of it, which you don't really do, it's like just text to agent, it's like build the workflow for me. Like what have you learned actually putting this in front of users and figuring out how much do they actually want to add it versus like how much, you know, kind of like Ruby on Rails instead of Lindy on Rails, it's kind of like, you know, defaults over configuration.Flo [00:05:06]: I actually used to dislike when people said, oh, text is not a great interface. I was like, ah, this is such a mid-take, I think text is awesome. And I've actually come around, I actually sort of agree now that text is really not great. I think for people like you and me, because we sort of have a mental model, okay, when I type a prompt into this text box, this is what it's going to do, it's going to map it to this kind of data structure under the hood and so forth. I guess it's a little bit blackmailing towards humans. You jump on these calls with humans and you're like, here's a text box, this is going to set up an agent for you, do it. And then they type words like, I want you to help me put order in my inbox. Oh, actually, this is a good one. This is actually a good one. What's a bad one? I would say 60 or 70% of the prompts that people type don't mean anything. Me as a human, as AGI, I don't understand what they mean. I don't know what they mean. It is actually, I think whenever you can have a GUI, it is better than to have just a pure text interface.Alessio [00:05:58]: And then how do you decide how much to expose? So even with the tools, you have Slack, you have Google Calendar, you have Gmail. Should people by default just turn over access to everything and then you help them figure out what to use? I think that's the question. When I tried to set up Slack, it was like, hey, give me access to all channels and everything, which for the average person probably makes sense because you don't want to re-prompt them every time you add new channels. But at the same time, for maybe the more sophisticated enterprise use cases, people are like, hey, I want to really limit what you have access to. How do you kind of thread that balance?Flo [00:06:35]: The general philosophy is we ask for the least amount of permissions needed at any given moment. I don't think Slack, I could be mistaken, but I don't think Slack lets you request permissions for just one channel. But for example, for Google, obviously there are hundreds of scopes that you could require for Google. There's a lot of scopes. And sometimes it's actually painful to set up your Lindy because you're going to have to ask Google and add scopes five or six times. We've had sessions like this. But that's what we do because, for example, the Lindy email drafter, she's going to ask you for your authorization once for, I need to be able to read your email so I can draft a reply, and then another time for I need to be able to write a draft for them. We just try to do it very incrementally like that.Alessio [00:07:15]: Do you think OAuth is just overall going to change? I think maybe before it was like, hey, we need to set up OAuth that humans only want to kind of do once. So we try to jam-pack things all at once versus what if you could on-demand get different permissions every time from different parts? Do you ever think about designing things knowing that maybe AI will use it instead of humans will use it? Yeah, for sure.Flo [00:07:37]: One pattern we've started to see is people provisioning accounts for their AI agents. And so, in particular, Google Workspace accounts. So, for example, Lindy can be used as a scheduling assistant. So you can just CC her to your emails when you're trying to find time with someone. And just like a human assistant, she's going to go back and forth and offer other abilities and so forth. Very often, people don't want the other party to know that it's an AI. So it's actually funny. They introduce delays. They ask the agent to wait before replying, so it's not too obvious that it's an AI. And they provision an account on Google Suite, which costs them like $10 a month or something like that. So we're seeing that pattern more and more. I think that does the job for now. I'm not optimistic on us actually patching OAuth. Because I agree with you, ultimately, we would want to patch OAuth because the new account thing is kind of a clutch. It's really a hack. You would want to patch OAuth to have more granular access control and really be able to put your sugar in the box. I'm not optimistic on us doing that before AGI, I think. That's a very close timeline.Swyx [00:08:36]: I'm mindful of talking about a thing without showing it. And we already have the setup to show it. Why don't we jump into a screen share? For listeners, you can jump on the YouTube and like and subscribe. But also, let's have a look at how you show off Lindy. Yeah, absolutely.Flo [00:08:51]: I'll give an example of a very simple Lindy and then I'll graduate to a much more complicated one. A super simple Lindy that I have is, I unfortunately bought some investment properties in the south of France. It was a really, really bad idea. And I put them on a Holydew, which is like the French Airbnb, if you will. And so I received these emails from time to time telling me like, oh, hey, you made 200 bucks. Someone booked your place. When I receive these emails, I want to log this reservation in a spreadsheet. Doing this without an AI agent or without AI in general is a pain in the butt because you must write an HTML parser for this email. And so it's just hard. You may not be able to do it and it's going to break the moment the email changes. By contrast, the way it works with Lindy, it's really simple. It's two steps. It's like, okay, I receive an email. If it is a reservation confirmation, I have this filter here. Then I append a row to this spreadsheet. And so this is where you can see the AI part where the way this action is configured here, you see these purple fields on the right. Each of these fields is a prompt. And so I can say, okay, you extract from the email the day the reservation begins on. You extract the amount of the reservation. You extract the number of travelers of the reservation. And now you can see when I look at the task history of this Lindy, it's really simple. It's like, okay, you do this and boom, appending this row to this spreadsheet. And this is the information extracted. So effectively, this node here, this append row node is a mini agent. It can see everything that just happened. It has context over the task and it's appending the row. And then it's going to send a reply to the thread. That's a very simple example of an agent.Swyx [00:10:34]: A quick follow-up question on this one while we're still on this page. Is that one call? Is that a structured output call? Yeah. Okay, nice. Yeah.Flo [00:10:41]: And you can see here for every node, you can configure which model you want to power the node. Here I use cloud. For this, I use GPT-4 Turbo. Much more complex example, my meeting recorder. It looks very complex because I've added to it over time, but at a high level, it's really simple. It's like when a meeting begins, you record the meeting. And after the meeting, you send me a summary and you send me coaching notes. So I receive, like my Lindy is constantly coaching me. And so you can see here in the prompt of the coaching notes, I've told it, hey, you know, was I unnecessarily confrontational at any point? I'm French, so I have to watch out for that. Or not confrontational enough. Should I have double-clicked on any issue, right? So I can really give it exactly the kind of coaching that I'm expecting. And then the interesting thing here is, like, you can see the agent here, after it sent me these coaching notes, moves on. And it does a bunch of other stuff. So it goes on Slack. It disseminates the notes on Slack. It does a bunch of other stuff. But it's actually able to backtrack and resume the automation at the coaching notes email if I responded to that email. So I'll give a super concrete example. This is an actual coaching feedback that I received from Lindy. She was like, hey, this was a sales call I had with a customer. And she was like, I found your explanation of Lindy too technical. And I was able to follow up and just ask a follow-up question in the thread here. And I was like, why did you find too technical about my explanation? And Lindy restored the context. And so she basically picked up the automation back up here in the tree. And she has all of the context of everything that happened, including the meeting in which I was. So she was like, oh, you used the words deterministic and context window and agent state. And that concept exists at every level for every channel and every action that Lindy takes. So another example here is, I mentioned she also disseminates the notes on Slack. So this was a meeting where I was not, right? So this was a teammate. He's an indie meeting recorder, posts the meeting notes in this customer discovery channel on Slack. So you can see, okay, this is the onboarding call we had. This was the use case. Look at the questions. How do I make Lindy slower? How do I add delays to make Lindy slower? And I was able, in the Slack thread, to ask follow-up questions like, oh, what did we answer to these questions? And it's really handy because I know I can have this sort of interactive Q&A with these meetings. It means that very often now, I don't go to meetings anymore. I just send my Lindy. And instead of going to like a 60-minute meeting, I have like a five-minute chat with my Lindy afterwards. And she just replied. She was like, well, this is what we replied to this customer. And I can just be like, okay, good job, Jack. Like, no notes about your answers. So that's the kind of use cases people have with Lindy. It's a lot of like, there's a lot of sales automations, customer support automations, and a lot of this, which is basically personal assistance automations, like meeting scheduling and so forth.Alessio [00:13:21]: Yeah, and I think the question that people might have is memory. So as you get coaching, how does it track whether or not you're improving? You know, if these are like mistakes you made in the past, like, how do you think about that?Flo [00:13:31]: Yeah, we have a memory module. So I'll show you my meeting scheduler, Lindy, which has a lot of memories because by now I've used her for so long. And so every time I talk to her, she saves a memory. If I tell her, you screwed up, please don't do this. So you can see here, oh, it's got a double memory here. This is the meeting link I have, or this is the address of the office. If I tell someone to meet me at home, this is the address of my place. This is the code. I guess we'll have to edit that out. This is not the code of my place. No dogs. Yeah, so Lindy can just manage her own memory and decide when she's remembering things between executions. Okay.Swyx [00:14:11]: I mean, I'm just going to take the opportunity to ask you, since you are the creator of this thing, how come there's so few memories, right? Like, if you've been using this for two years, there should be thousands of thousands of things. That is a good question.Flo [00:14:22]: Agents still get confused if they have too many memories, to my point earlier about that. So I just am out of a call with a member of the Lama team at Meta, and we were chatting about Lindy, and we were going into the system prompt that we sent to Lindy, and all of that stuff. And he was amazed, and he was like, it's a miracle that it's working, guys. He was like, this kind of system prompt, this does not exist, either pre-training or post-training. These models were never trained to do this kind of stuff. It's a miracle that they can be agents at all. And so what I do, I actually prune the memories. You know, it's actually something I've gotten into the habit of doing from back when we had GPT 3.5, being Lindy agents. I suspect it's probably not as necessary in the Cloud 3.5 Sunette days, but I prune the memories. Yeah, okay.Swyx [00:15:05]: The reason is because I have another assistant that also is recording and trying to come up with facts about me. It comes up with a lot of trivial, useless facts that I... So I spend most of my time pruning. Actually, it's not super useful. I'd much rather have high-quality facts that it accepts. Or maybe I was even thinking, were you ever tempted to add a wake word to only memorize this when I say memorize this? And otherwise, don't even bother.Flo [00:15:30]: I have a Lindy that does this. So this is my inbox processor, Lindy. It's kind of beefy because there's a lot of different emails. But somewhere in here,Swyx [00:15:38]: there is a rule where I'm like,Flo [00:15:39]: aha, I can email my inbox processor, Lindy. It's really handy. So she has her own email address. And so when I process my email inbox, I sometimes forward an email to her. And it's a newsletter, or it's like a cold outreach from a recruiter that I don't care about, or anything like that. And I can give her a rule. And I can be like, hey, this email I want you to archive, moving forward. Or I want you to alert me on Slack when I have this kind of email. It's really important. And so you can see here, the prompt is, if I give you a rule about a kind of email, like archive emails from X, save it as a new memory. And I give it to the memory saving skill. And yeah.Swyx [00:16:13]: One thing that just occurred to me, so I'm a big fan of virtual mailboxes. I recommend that everybody have a virtual mailbox. You could set up a physical mail receive thing for Lindy. And so then Lindy can process your physical mail.Flo [00:16:26]: That's actually a good idea. I actually already have something like that. I use like health class mail. Yeah. So yeah, most likely, I can process my physical mail. Yeah.Swyx [00:16:35]: And then the other product's idea I have, looking at this thing, is people want to brag about the complexity of their Lindys. So this would be like a 65 point Lindy, right?Flo [00:16:43]: What's a 65 point?Swyx [00:16:44]: Complexity counting. Like how many nodes, how many things, how many conditions, right? Yeah.Flo [00:16:49]: This is not the most complex one. I have another one. This designer recruiter here is kind of beefy as well. Right, right, right. So I'm just saying,Swyx [00:16:56]: let people brag. Let people be super users. Oh, right.Flo [00:16:59]: Give them a score. Give them a score.Swyx [00:17:01]: Then they'll just be like, okay, how high can you make this score?Flo [00:17:04]: Yeah, that's a good point. And I think that's, again, the beauty of this on-rails phenomenon. It's like, think of the equivalent, the prompt equivalent of this Lindy here, for example, that we're looking at. It'd be monstrous. And the odds that it gets it right are so low. But here, because we're really holding the agent's hand step by step by step, it's actually super reliable. Yeah.Swyx [00:17:22]: And is it all structured output-based? Yeah. As far as possible? Basically. Like, there's no non-structured output?Flo [00:17:27]: There is. So, for example, here, this AI agent step, right, or this send message step, sometimes it gets to... That's just plain text.Swyx [00:17:35]: That's right.Flo [00:17:36]: Yeah. So I'll give you an example. Maybe it's TMI. I'm having blood pressure issues these days. And so this Lindy here, I give it my blood pressure readings, and it updates a log that I have of my blood pressure that it sends to my doctor.Swyx [00:17:49]: Oh, so every Lindy comes with a to-do list?Flo [00:17:52]: Yeah. Every Lindy has its own task history. Huh. Yeah. And so you can see here, this is my main Lindy, my personal assistant, and I've told it, where is this? There is a point where I'm like, if I am giving you a health-related fact, right here, I'm giving you health information, so then you update this log that I have in this Google Doc, and then you send me a message. And you can see, I've actually not configured this send message node. I haven't told it what to send me a message for. Right? And you can see, it's actually lecturing me. It's like, I'm giving it my blood pressure ratings. It's like, hey, it's a bit high. Here are some lifestyle changes you may want to consider.Alessio [00:18:27]: I think maybe this is the most confusing or new thing for people. So even I use Lindy and I didn't even know you could have multiple workflows in one Lindy. I think the mental model is kind of like the Zapier workflows. It starts and it ends. It doesn't choose between. How do you think about what's a Lindy versus what's a sub-function of a Lindy? Like, what's the hierarchy?Flo [00:18:48]: Yeah. Frankly, I think the line is a little arbitrary. It's kind of like when you code, like when do you start to create a new class versus when do you overload your current class. I think of it in terms of like jobs to be done and I think of it in terms of who is the Lindy serving. This Lindy is serving me personally. It's really my day-to-day Lindy. I give it a bunch of stuff, like very easy tasks. And so this is just the Lindy I go to. Sometimes when a task is really more specialized, so for example, I have this like summarizer Lindy or this designer recruiter Lindy. These tasks are really beefy. I wouldn't want to add this to my main Lindy, so I just created a separate Lindy for it. Or when it's a Lindy that serves another constituency, like our customer support Lindy, I don't want to add that to my personal assistant Lindy. These are two very different Lindys.Alessio [00:19:31]: And you can call a Lindy from within another Lindy. That's right. You can kind of chain them together.Flo [00:19:36]: Lindys can work together, absolutely.Swyx [00:19:38]: A couple more things for the video portion. I noticed you have a podcast follower. We have to ask about that. What is that?Flo [00:19:46]: So this one wakes me up every... So wakes herself up every week. And she sends me... So she woke up yesterday, actually. And she searches for Lenny's podcast. And she looks for like the latest episode on YouTube. And once she finds it, she transcribes the video and then she sends me the summary by email. I don't listen to podcasts as much anymore. I just like read these summaries. Yeah.Alessio [00:20:09]: We should make a latent space Lindy. Marketplace.Swyx [00:20:12]: Yeah. And then you have a whole bunch of connectors. I saw the list briefly. Any interesting one? Complicated one that you're proud of? Anything that you want to just share? Connector stories.Flo [00:20:23]: So many of our workflows are about meeting scheduling. So we had to build some very open unity tools around meeting scheduling. So for example, one that is surprisingly hard is this find available times action. You would not believe... This is like a thousand lines of code or something. It's just a very beefy action. And you can pass it a bunch of parameters about how long is the meeting? When does it start? When does it end? What are the meetings? The weekdays in which I meet? How many time slots do you return? What's the buffer between my meetings? It's just a very, very, very complex action. I really like our GitHub action. So we have a Lindy PR reviewer. And it's really handy because anytime any bug happens... So the Lindy reads our guidelines on Google Docs. By now, the guidelines are like 40 pages long or something. And so every time any new kind of bug happens, we just go to the guideline and we add the lines. Like, hey, this has happened before. Please watch out for this category of bugs. And it's saving us so much time every day.Alessio [00:21:19]: There's companies doing PR reviews. Where does a Lindy start? When does a company start? Or maybe how do you think about the complexity of these tasks when it's going to be worth having kind of like a vertical standalone company versus just like, hey, a Lindy is going to do a good job 99% of the time?Flo [00:21:34]: That's a good question. We think about this one all the time. I can't say that we've really come up with a very crisp articulation of when do you want to use a vertical tool versus when do you want to use a horizontal tool. I think of it as very similar to the internet. I find it surprising the extent to which a horizontal search engine has won. But I think that Google, right? But I think the even more surprising fact is that the horizontal search engine has won in almost every vertical, right? You go through Google to search Reddit. You go through Google to search Wikipedia. I think maybe the biggest exception is e-commerce. Like you go to Amazon to search e-commerce, but otherwise you go through Google. And I think that the reason for that is because search in each vertical has more in common with search than it does with each vertical. And search is so expensive to get right. Like Google is a big company that it makes a lot of sense to aggregate all of these different use cases and to spread your R&D budget across all of these different use cases. I have a thesis, which is, it's a really cool thesis for Lindy, is that the same thing is true for agents. I think that by and large, in a lot of verticals, agents in each vertical have more in common with agents than they do with each vertical. I also think there are benefits in having a single agent platform because that way your agents can work together. They're all like under one roof. That way you only learn one platform and so you can create agents for everything that you want. And you don't have to like pay for like a bunch of different platforms and so forth. So I think ultimately, it is actually going to shake out in a way that is similar to search in that search is everywhere on the internet. Every website has a search box, right? So there's going to be a lot of vertical agents for everything. I think AI is going to completely penetrate every category of software. But then I also think there are going to be a few very, very, very big horizontal agents that serve a lot of functions for people.Swyx [00:23:14]: That is actually one of the questions that we had about the agent stuff. So I guess we can transition away from the screen and I'll just ask the follow-up, which is, that is a hot topic. You're basically saying that the current VC obsession of the day, which is vertical AI enabled SaaS, is mostly not going to work out. And then there are going to be some super giant horizontal SaaS.Flo [00:23:34]: Oh, no, I'm not saying it's either or. Like SaaS today, vertical SaaS is huge and there's also a lot of horizontal platforms. If you look at like Airtable or Notion, basically the entire no-code space is very horizontal. I mean, Loom and Zoom and Slack, there's a lot of very horizontal tools out there. Okay.Swyx [00:23:49]: I was just trying to get a reaction out of you for hot takes. Trying to get a hot take.Flo [00:23:54]: No, I also think it is natural for the vertical solutions to emerge first because it's just easier to build. It's just much, much, much harder to build something horizontal. Cool.Swyx [00:24:03]: Some more Lindy-specific questions. So we covered most of the top use cases and you have an academy. That was nice to see. I also see some other people doing it for you for free. So like Ben Spites is doing it and then there's some other guy who's also doing like lessons. Yeah. Which is kind of nice, right? Yeah, absolutely. You don't have to do any of that.Flo [00:24:20]: Oh, we've been seeing it more and more on like LinkedIn and Twitter, like people posting their Lindys and so forth.Swyx [00:24:24]: I think that's the flywheel that you built the platform where creators see value in allying themselves to you. And so then, you know, your incentive is to make them successful so that they can make other people successful and then it just drives more and more engagement. Like it's earned media. Like you don't have to do anything.Flo [00:24:39]: Yeah, yeah. I mean, community is everything.Swyx [00:24:41]: Are you doing anything special there? Any big wins?Flo [00:24:44]: We have a Slack community that's pretty active. I can't say we've invested much more than that so far.Swyx [00:24:49]: I would say from having, so I have some involvement in the no-code community. I would say that Webflow going very hard after no-code as a category got them a lot more allies than just the people using Webflow. So it helps you to grow the community beyond just Lindy. And I don't know what this is called. Maybe it's just no-code again. Maybe you want to call it something different. But there's definitely an appetite for this and you are one of a broad category, right? Like just before you, we had Dust and, you know, they're also kind of going after a similar market. Zapier obviously is not going to try to also compete with you. Yeah. There's no question there. It's just like a reaction about community. Like I think a lot about community. Lanespace is growing the community of AI engineers. And I think you have a slightly different audience of, I don't know what.Flo [00:25:33]: Yeah. I think the no-code tinkerers is the community. Yeah. It is going to be the same sort of community as what Webflow, Zapier, Airtable, Notion to some extent.Swyx [00:25:43]: Yeah. The framing can be different if you were, so I think tinkerers has this connotation of not serious or like small. And if you framed it to like no-code EA, we're exclusively only for CEOs with a certain budget, then you just have, you tap into a different budget.Flo [00:25:58]: That's true. The problem with EA is like, the CEO has no willingness to actually tinker and play with the platform.Swyx [00:26:05]: Maybe Andrew's doing that. Like a lot of your biggest advocates are CEOs, right?Flo [00:26:09]: A solopreneur, you know, small business owners, I think Andrew is an exception. Yeah. Yeah, yeah, he is.Swyx [00:26:14]: He's an exception in many ways. Yep.Alessio [00:26:16]: Just before we wrap on the use cases, is Rick rolling your customers? Like a officially supported use case or maybe tell that story?Flo [00:26:24]: It's one of the main jobs to be done, really. Yeah, we woke up recently, so we have a Lindy obviously doing our customer support and we do check after the Lindy. And so we caught this email exchange where someone was asking Lindy for video tutorials. And at the time, actually, we did not have video tutorials. We do now on the Lindy Academy. And Lindy responded to the email. It's like, oh, absolutely, here's a link. And we were like, what? Like, what kind of link did you send? And so we clicked on the link and it was a recall. We actually reacted fast enough that the customer had not yet opened the email. And so we reacted immediately. Like, oh, hey, actually, sorry, this is the right link. And so the customer never reacted to the first link. And so, yeah, I tweeted about that. It went surprisingly viral. And I checked afterwards in the logs. We did like a database query and we found, I think, like three or four other instances of it having happened before.Swyx [00:27:12]: That's surprisingly low.Flo [00:27:13]: It is low. And we fixed it across the board by just adding a line to the system prompt that's like, hey, don't recall people, please don't recall.Swyx [00:27:21]: Yeah, yeah, yeah. I mean, so, you know, you can explain it retroactively, right? Like, that YouTube slug has been pasted in so many different corpuses that obviously it learned to hallucinate that.Alessio [00:27:31]: And it pretended to be so many things. That's the thing.Swyx [00:27:34]: I wouldn't be surprised if that takes one token. Like, there's this one slug in the tokenizer and it's just one token.Flo [00:27:41]: That's the idea of a YouTube video.Swyx [00:27:43]: Because it's used so much, right? And you have to basically get it exactly correct. It's probably not. That's a long speech.Flo [00:27:52]: It would have been so good.Alessio [00:27:55]: So this is just a jump maybe into evals from here. How could you possibly come up for an eval that says, make sure my AI does not recall my customer? I feel like when people are writing evals, that's not something that they come up with. So how do you think about evals when it's such like an open-ended problem space?Flo [00:28:12]: Yeah, it is tough. We built quite a bit of infrastructure for us to create evals in one click from any conversation history. So we can point to a conversation and we can be like, in one click we can turn it into effectively a unit test. It's like, this is a good conversation. This is how you're supposed to handle things like this. Or if it's a negative example, then we modify a little bit the conversation after generating the eval. So it's very easy for us to spin up this kind of eval.Alessio [00:28:36]: Do you use an off-the-shelf tool which is like Brain Trust on the podcast? Or did you just build your own?Flo [00:28:41]: We unfortunately built our own. We're most likely going to switch to Brain Trust. Well, when we built it, there was nothing. Like there was no eval tool, frankly. I mean, we started this project at the end of 2022. It was like, it was very, very, very early. I wouldn't recommend it to build your own eval tool. There's better solutions out there and our eval tool breaks all the time and it's a nightmare to maintain. And that's not something we want to be spending our time on.Swyx [00:29:04]: I was going to ask that basically because I think my first conversations with you about Lindy was that you had a strong opinion that everyone should build their own tools. And you were very proud of your evals. You're kind of showing off to me like how many evals you were running, right?Flo [00:29:16]: Yeah, I think that was before all of these tools came around. I think the ecosystem has matured a fair bit.Swyx [00:29:21]: What is one thing that Brain Trust has nailed that you always struggled to do?Flo [00:29:25]: We're not using them yet, so I couldn't tell. But from what I've gathered from the conversations I've had, like they're doing what we do with our eval tool, but better.Swyx [00:29:33]: And like they do it, but also like 60 other companies do it, right? So I don't know how to shop apart from brand. Word of mouth.Flo [00:29:41]: Same here.Swyx [00:29:42]: Yeah, like evals or Lindys, there's two kinds of evals, right? Like in some way, you don't have to eval your system as much because you've constrained the language model so much. And you can rely on open AI to guarantee that the structured outputs are going to be good, right? We had Michelle sit where you sit and she explained exactly how they do constraint grammar sampling and all that good stuff. So actually, I think it's more important for your customers to eval their Lindys than you evaling your Lindy platform because you just built the platform. You don't actually need to eval that much.Flo [00:30:14]: Yeah. In an ideal world, our customers don't need to care about this. And I think the bar is not like, look, it needs to be at 100%. I think the bar is it needs to be better than a human. And for most use cases we serve today, it is better than a human, especially if you put it on Rails.Swyx [00:30:30]: Is there a limiting factor of Lindy at the business? Like, is it adding new connectors? Is it adding new node types? Like how do you prioritize what is the most impactful to your company?Flo [00:30:41]: Yeah. The raw capabilities for sure are a big limit. It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small. It's kind of insane that we started building this when the context windows were like 4,000 tokens. Like today, our system prompt is more than 4,000 tokens. So yeah, the model is actually very much not a limit anymore. It almost gives me pause because I'm like, I want the model to be a limit. And so no, the integrations are ones, the core capabilities are ones. So for example, we are investing in a system that's basically, I call it like the, it's a J hack. Give me these names, like the poor man's RLHF. So you can turn on a toggle on any step of your Lindy workflow to be like, ask me for confirmation before you actually execute this step. So it's like, hey, I receive an email, you send a reply, ask me for confirmation before actually sending it. And so today you see the email that's about to get sent and you can either approve, deny, or change it and then approve. And we are making it so that when you make a change, we are then saving this change that you're making or embedding it in the vector database. And then we are retrieving these examples for future tasks and injecting them into the context window. So that's the kind of capability that makes a huge difference for users. That's the bottleneck today. It's really like good old engineering and product work.Swyx [00:31:52]: I assume you're hiring. We'll do a call for hiring at the end.Alessio [00:31:54]: Any other comments on the model side? When did you start feeling like the model was not a bottleneck anymore? Was it 4.0? Was it 3.5? 3.5.Flo [00:32:04]: 3.5 Sonnet, definitely. I think 4.0 is overhyped, frankly. We don't use 4.0. I don't think it's good for agentic behavior. Yeah, 3.5 Sonnet is when I started feeling that. And then with prompt caching with 3.5 Sonnet, like that fills the cost, cut the cost again. Just cut it in half. Yeah.Swyx [00:32:21]: Your prompts are... Some of the problems with agentic uses is that your prompts are kind of dynamic, right? Like from caching to work, you need the front prefix portion to be stable.Flo [00:32:32]: Yes, but we have this append-only ledger paradigm. So every node keeps appending to that ledger and every filled node inherits all the context built up by all the previous nodes. And so we can just decide, like, hey, every X thousand nodes, we trigger prompt caching again.Swyx [00:32:47]: Oh, so you do it like programmatically, not all the time.Flo [00:32:50]: No, sorry. Anthropic manages that for us. But basically, it's like, because we keep appending to the prompt, the prompt caching works pretty well.Alessio [00:32:57]: We have this small podcaster tool that I built for the podcast and I rewrote all of our prompts because I noticed, you know, I was inputting stuff early on. I wonder how much more money OpenAN and Anthropic are making just because people don't rewrite their prompts to be like static at the top and like dynamic at the bottom.Flo [00:33:13]: I think that's the remarkable thing about what we're having right now. It's insane that these companies are routinely cutting their costs by two, four, five. Like, they basically just apply constraints. They want people to take advantage of these innovations. Very good.Swyx [00:33:25]: Do you have any other competitive commentary? Commentary? Dust, WordWare, Gumloop, Zapier? If not, we can move on.Flo [00:33:31]: No comment.Alessio [00:33:32]: I think the market is,Flo [00:33:33]: look, I mean, AGI is coming. All right, that's what I'm talking about.Swyx [00:33:38]: I think you're helping. Like, you're paving the road to AGI.Flo [00:33:41]: I'm playing my small role. I'm adding my small brick to this giant, giant, giant castle. Yeah, look, when it's here, we are going to, this entire category of software is going to create, it's going to sound like an exaggeration, but it is a fact it is going to create trillions of dollars of value in a few years, right? It's going to, for the first time, we're actually having software directly replace human labor. I see it every day in sales calls. It's like, Lindy is today replacing, like, we talk to even small teams. It's like, oh, like, stop, this is a 12-people team here. I guess we'll set up this Lindy for one or two days, and then we'll have to decide what to do with this 12-people team. And so, yeah. To me, there's this immense uncapped market opportunity. It's just such a huge ocean, and there's like three sharks in the ocean. I'm focused on the ocean more than on the sharks.Swyx [00:34:25]: So we're moving on to hot topics, like, kind of broadening out from Lindy, but obviously informed by Lindy. What are the high-order bits of good agent design?Flo [00:34:31]: The model, the model, the model, the model. I think people fail to truly, and me included, they fail to truly internalize the bitter lesson. So for the listeners out there who don't know about it, it's basically like, you just scale the model. Like, GPUs go brr, it's all that matters. I think it also holds for the cognitive architecture. I used to be very cognitive architecture-filled, and I was like, ah, and I was like a critic, and I was like a generator, and all this, and then it's just like, GPUs go brr, like, just like let the model do its job. I think we're seeing it a little bit right now with O1. I'm seeing some tweets that say that the new 3.5 SONNET is as good as O1, but with none of all the crazy...Swyx [00:35:09]: It beats O1 on some measures. On some reasoning tasks. On AIME, it's still a lot lower. Like, it's like 14 on AIME versus O1, it's like 83.Flo [00:35:17]: Got it. Right. But even O1 is still the model. Yeah.Swyx [00:35:22]: Like, there's no cognitive architecture on top of it.Flo [00:35:23]: You can just wait for O1 to get better.Alessio [00:35:25]: And so, as a founder, how do you think about that, right? Because now, knowing this, wouldn't you just wait to start Lindy? You know, you start Lindy, it's like 4K context, the models are not that good. It's like, but you're still kind of like going along and building and just like waiting for the models to get better. How do you today decide, again, what to build next, knowing that, hey, the models are going to get better, so maybe we just shouldn't focus on improving our prompt design and all that stuff and just build the connectors instead or whatever? Yeah.Flo [00:35:51]: I mean, that's exactly what we do. Like, all day, we always ask ourselves, oh, when we have a feature idea or a feature request, we ask ourselves, like, is this the kind of thing that just gets better while we sleep because models get better? I'm reminded, again, when we started this in 2022, we spent a lot of time because we had to around context pruning because 4,000 tokens is really nothing. You really can't do anything with 4,000 tokens. All that work was throwaway work. Like, now it's like it was for nothing, right? Now we just assume that infinite context windows are going to be here in a year or something, a year and a half, and infinitely cheap as well, and dynamic compute is going to be here. Like, we just assume all of these things are going to happen, and so we really focus, our job to be done in the industry is to provide the input and output to the model. I really compare it all the time to the PC and the CPU, right? Apple is busy all day. They're not like a CPU wrapper. They have a lot to build, but they don't, well, now actually they do build the CPU as well, but leaving that aside, they're busy building a laptop. It's just a lot of work to build these things. It's interesting because, like,Swyx [00:36:45]: for example, another person that we're close to, Mihaly from Repl.it, he often says that the biggest jump for him was having a multi-agent approach, like the critique thing that you just said that you don't need, and I wonder when, in what situations you do need that and what situations you don't. Obviously, the simple answer is for coding, it helps, and you're not coding, except for, are you still generating code? In Indy? Yeah.Flo [00:37:09]: No, we do. Oh, right. No, no, no, the cognitive architecture changed. We don't, yeah.Swyx [00:37:13]: Yeah, okay. For you, you're one shot, and you chain tools together, and that's it. And if the user really wantsFlo [00:37:18]: to have this kind of critique thing, you can also edit the prompt, you're welcome to. I have some of my Lindys, I've told them, like, hey, be careful, think step by step about what you're about to do, but that gives you a little bump for some use cases, but, yeah.Alessio [00:37:30]: What about unexpected model releases? So, Anthropic released computer use today. Yeah. I don't know if many people were expecting computer use to come out today. Do these things make you rethink how to design, like, your roadmap and things like that, or are you just like, hey, look, whatever, that's just, like, a small thing in their, like, AGI pursuit, that, like, maybe they're not even going to support, and, like, it's still better for us to build our own integrations into systems and things like that. Because maybe people will say, hey, look, why am I building all these API integrationsFlo [00:38:02]: when I can just do computer use and never go to the product? Yeah. No, I mean, we did take into account computer use. We were talking about this a year ago or something, like, we've been talking about it as part of our roadmap. It's been clear to us that it was coming, My philosophy about it is anything that can be done with an API must be done by an API or should be done by an API for a very long time. I think it is dangerous to be overly cavalier about improvements of model capabilities. I'm reminded of iOS versus Android. Android was built on the JVM. There was a garbage collector, and I can only assume that the conversation that went down in the engineering meeting room was, oh, who cares about the garbage collector? Anyway, Moore's law is here, and so that's all going to go to zero eventually. Sure, but in the meantime, you are operating on a 400 MHz CPU. It was like the first CPU on the iPhone 1, and it's really slow, and the garbage collector is introducing a tremendous overhead on top of that, especially a memory overhead. For the longest time, and it's really only been recently that Android caught up to iOS in terms of how smooth the interactions were, but for the longest time, Android phones were significantly slowerSwyx [00:39:07]: and laggierFlo [00:39:08]: and just not feeling as good as iOS devices. Look, when you're talking about modules and magnitude of differences in terms of performance and reliability, which is what we are talking about when we're talking about API use versus computer use, then you can't ignore that, right? And so I think we're going to be in an API use world for a while.Swyx [00:39:27]: O1 doesn't have API use today. It will have it at some point, and it's on the roadmap. There is a future in which OpenAI goes much harder after your business, your market, than it is today. Like, ChatGPT, it's its own business. All they need to do is add tools to the ChatGPT, and now they're suddenly competing with you. And by the way, they have a GPT store where a bunch of people have already configured their tools to fit with them. Is that a concern?Flo [00:39:56]: I think even the GPT store, in a way, like the way they architect it, for example, their plug-in systems are actually grateful because we can also use the plug-ins. It's very open. Now, again, I think it's going to be such a huge market. I think there's going to be a lot of different jobs to be done. I know they have a huge enterprise offering and stuff, but today, ChatGPT is a consumer app. And so, the sort of flow detail I showed you, this sort of workflow, this sort of use cases that we're going after, which is like, we're doing a lot of lead generation and lead outreach and all of that stuff. That's not something like meeting recording, like Lindy Today right now joins your Zoom meetings and takes notes, all of that stuff.Swyx [00:40:34]: I don't see that so farFlo [00:40:35]: on the OpenAI roadmap.Swyx [00:40:36]: Yeah, but they do have an enterprise team that we talk to You're hiring GMs?Flo [00:40:42]: We did.Swyx [00:40:43]: It's a fascinating way to build a business, right? Like, what should you, as CEO, be in charge of? And what should you basically hireFlo [00:40:52]: a mini CEO to do? Yeah, that's a good question. I think that's also something we're figuring out. The GM thing was inspired from my days at Uber, where we hired one GM per city or per major geo area. We had like all GMs, regional GMs and so forth. And yeah, Lindy is so horizontal that we thought it made sense to hire GMs to own each vertical and the go-to market of the vertical and the customization of the Lindy templates for these verticals and so forth. What should I own as a CEO? I mean, the canonical reply here is always going to be, you know, you own the fundraising, you own the culture, you own the... What's the rest of the canonical reply? The culture, the fundraising.Swyx [00:41:29]: I don't know,Flo [00:41:30]: products. Even that, eventually, you do have to hand out. Yes, the vision, the culture, and the foundation. Well, you've done your job as a CEO. In practice, obviously, yeah, I mean, all day, I do a lot of product work still and I want to keep doing product work for as long as possible.Swyx [00:41:48]: Obviously, like you're recording and managing the team. Yeah.Flo [00:41:52]: That one feels like the most automatable part of the job, the recruiting stuff.Swyx [00:41:56]: Well, yeah. You saw myFlo [00:41:59]: design your recruiter here. Relationship between Factorio and building Lindy. We actually very often talk about how the business of the future is like a game of Factorio. Yeah. So, in the instance, it's like Slack and you've got like 5,000 Lindys in the sidebar and your job is to somehow manage your 5,000 Lindys. And it's going to be very similar to company building because you're going to look for like the highest leverage way to understand what's going on in your AI company and understand what levels do you have to make impact in that company. So, I think it's going to be very similar to like a human company except it's going to go infinitely faster. Today, in a human company, you could have a meeting with your team and you're like, oh, I'm going to build a facility and, you know, now it's like, okay,Swyx [00:42:40]: boom, I'm going to spin up 50 designers. Yeah. Like, actually, it's more important that you can clone an existing designer that you know works because the hiring process, you cannot clone someone because every new person you bring in is going to have their own tweaksFlo [00:42:54]: and you don't want that. Yeah.Swyx [00:42:56]: That's true. You want an army of mindless dronesFlo [00:42:59]: that all work the same way.Swyx [00:43:00]: The reason I bring this, bring Factorio up as well is one, Factorio Space just came out. Apparently, a whole bunch of people stopped working. I tried out Factorio. I never really got that much into it. But the other thing was, you had a tweet recently about how the sort of intentional top-down design was not as effective as just build. Yeah. Just ship.Flo [00:43:21]: I think people read a little bit too much into that tweet. It went weirdly viral. I was like, I did not intend it as a giant statement online.Swyx [00:43:28]: I mean, you notice you have a pattern with this, right? Like, you've done this for eight years now.Flo [00:43:33]: You should know. I legit was just hearing an interesting story about the Factorio game I had. And everybody was like, oh my God, so deep. I guess this explains everything about life and companies. There is something to be said, certainly, about focusing on the constraint. And I think it is Patrick Collison who said, people underestimate the extent to which moonshots are just one pragmatic step taken after the other. And I think as long as you have some inductive bias about, like, some loose idea about where you want to go, I think it makes sense to follow a sort of greedy search along that path. I think planning and organizing is important. And having older is important.Swyx [00:44:05]: I'm wrestling with that. There's two ways I encountered it recently. One with Lindy. When I tried out one of your automation templates and one of them was quite big and I just didn't understand it, right? So, like, it was not as useful to me as a small one that I can just plug in and see all of. And then the other one was me using Cursor. I was very excited about O1 and I just up frontFlo [00:44:27]: stuffed everythingSwyx [00:44:28]: I wanted to do into my prompt and expected O1 to do everything. And it got itself into a huge jumbled mess and it was stuck. It was really... There was no amount... I wasted, like, two hours on just, like, trying to get out of that hole. So I threw away the code base, started small, switched to Clouds on it and build up something working and just add it over time and it just worked. And to me, that was the factorial sentiment, right? Maybe I'm one of those fanboys that's just, like, obsessing over the depth of something that you just randomly tweeted out. But I think it's true for company building, for Lindy building, for coding.Flo [00:45:02]: I don't know. I think it's fair and I think, like, you and I talked about there's the Tuft & Metal principle and there's this other... Yes, I love that. There's the... I forgot the name of this other blog post but it's basically about this book Seeing Like a State that talks about the need for legibility and people who optimize the system for its legibility and anytime you make a system... So legible is basically more understandable. Anytime you make a system more understandable from the top down, it performs less well from the bottom up. And it's fine but you should at least make this trade-off with your eyes wide open. You should know, I am sacrificing performance for understandability, for legibility. And in this case, for you, it makes sense. It's like you are actually optimizing for legibility. You do want to understand your code base but in some other cases it may not make sense. Sometimes it's better to leave the system alone and let it be its glorious, chaotic, organic self and just trust that it's going to perform well even though you don't understand it completely.Swyx [00:45:55]: It does remind me of a common managerial issue or dilemma which you experienced in the small scale of Lindy where, you know, do you want to organize your company by functional sections or by products or, you know, whatever the opposite of functional is. And you tried it one way and it was more legible to you as CEO but actually it stopped working at the small level. Yeah.Flo [00:46:17]: I mean, one very small example, again, at a small scale is we used to have everything on Notion. And for me, as founder, it was awesome because everything was there. The roadmap was there. The tasks were there. The postmortems were there. And so, the postmortem was linkedSwyx [00:46:31]: to its task.Flo [00:46:32]: It was optimized for you. Exactly. And so, I had this, like, one pane of glass and everything was on Notion. And then the team, one day,Swyx [00:46:39]: came to me with pitchforksFlo [00:46:40]: and they really wanted to implement Linear. And I had to bite my fist so hard. I was like, fine, do it. Implement Linear. Because I was like, at the end of the day, the team needs to be able to self-organize and pick their own tools.Alessio [00:46:51]: Yeah. But it did make the company slightly less legible for me. Another big change you had was going away from remote work, every other month. The discussion comes up again. What was that discussion like? How did your feelings change? Was there kind of like a threshold of employees and team size where you felt like, okay, maybe that worked. Now it doesn't work anymore. And how are you thinking about the futureFlo [00:47:12]: as you scale the team? Yeah. So, for context, I used to have a business called TeamFlow. The business was about building a virtual office for remote teams. And so, being remote was not merely something we did. It was, I was banging the remote drum super hard and helping companies to go remote. And so, frankly, in a way, it's a bit embarrassing for me to do a 180 like that. But I guess, when the facts changed, I changed my mind. What happened? Well, I think at first, like everyone else, we went remote by necessity. It was like COVID and you've got to go remote. And on paper, the gains of remote are enormous. In particular, from a founder's standpoint, being able to hire from anywhere is huge. Saving on rent is huge. Saving on commute is huge for everyone and so forth. But then, look, we're all here. It's like, it is really making it much harder to work together. And I spent three years of my youth trying to build a solution for this. And my conclusion is, at least we couldn't figure it out and no one else could. Zoom didn't figure it out. We had like a bunch of competitors. Like, Gathertown was one of the bigger ones. We had dozens and dozens of competitors. No one figured it out. I don't know that software can actually solve this problem. The reality of it is, everyone just wants to get off the darn Zoom call. And it's not a good feeling to be in your home office if you're even going to have a home office all day. It's harder to build culture. It's harder to get in sync. I think software is peculiar because it's like an iceberg. It's like the vast majority of it is submerged underwater. And so, the quality of the software that you ship is a function of the alignment of your mental models about what is below that waterline. Can you actually get in sync about what it is exactly fundamentally that we're building? What is the soul of our product? And it is so much harder to get in sync about that when you're remote. And then you waste time in a thousand ways because people are offline and you can't get a hold of them or you can't share your screen. It's just like you feel like you're walking in molasses all day. And eventually, I was like, okay, this is it. We're not going to do this anymore.Swyx [00:49:03]: Yeah. I think that is the current builder San Francisco consensus here. Yeah. But I still have a big... One of my big heroes as a CEO is Sid Subban from GitLab.Flo [00:49:14]: Mm-hmm.Swyx [00:49:15]: Matt MullenwegFlo [00:49:16]: used to be a hero.Swyx [00:49:17]: But these people run thousand-person remote businesses. The main idea is that at some company size, your company is remote anyway. Yeah. Because if you go from one building to two buildings, congrats, you're now remote from the other building. If you want to go from one city office to two city offices, they're remote from each other.Flo [00:49:35]: But the teams are co-located. Every time anyone talks about remote success stories, they always talk about this real force. Yeah. It's always GitLab and WordPress and Zapier. Zapier. It used to be Envision. And I will point out that in every one of these examples, you have a co-located counterfactual that is sometimes orders of magnitude bigger. Look, I like Matt Mullenweg a lot, but WordPress is a commercial failure. They run 60% of the internet and they're like a fraction of the size of even Substack. Right?Swyx [00:50:05]: They're trying to get more money.Flo [00:50:07]: Yeah, that's my point, right? Look, GitLab is much smaller than GitHub. Envision, you know, is no more. And Figma, like, completely took off. And Figma was like very in-person. So, I think if you're optimizing for productivity, if you really know, hey, this is a support ticket, right, and I want to have my support ticket for a buck 50 per support ticket and next year I want it for a buck 20, then sure, send your support ticket team to offshore, like the Philippines or whatever, and just optimize for cost. If you're optimizing for cost, absolutely be remote. If you're optimizing for creativity, which I think that software and product building is a creative endeavor, if you're optimizing for creativity, it's kind of like you have to be in person and hear the music to do that.Swyx [00:50:52]: Yeah. Maybe the line is that all jobs that can be remote should be AI or Lindy's and all jobs that are not remote are in person. Like, there's a very,Flo [00:51:04]: very clear separation of jobs. Sure. Well, I think over the long term,Swyx [00:51:09]: every job is going to be AI anyway. It would be curious to break down what you think is creativity in coding and in product defining and how to express that for sure. You're definitely what I call a temperature zero use case of LLMs. You want it to be reliable, predictable, small. And then there's other use cases of LLMs that are more for creativity and engines. Right? I haven't checked, but I'm pretty sure no one uses Lindy for brainstorming. Actually,Flo [00:51:36]: probably they do. I use Lindy for brainstormingSwyx [00:51:38]: a lot, actually. Yeah, yeah. But you want to have something that's anti-fragile to hallucination. Hallucinations are good.Flo [00:51:45]: By creativity, I mean, is it about direction or magnitude? If it is about direction, like decide what to do, then it's a creative endeavor. If it is about magnitude and just do it as fast as possible, as cheap as possible, then it's magnitude. And so sometimes, you know, software companies are not necessarily creative. Sometimes you know what you're doing. And I'll say that it's going to come across the wrong way, but linear. I look up to a huge amount, like such amazing product builders, but they know what they're building. They're building a I don't mean to throw shade at them. Like, good for them.Swyx [00:52:20]: I think they're aware that they're not like They recently got s**t for saying that they have work-life balance on their job description.Flo [00:52:26]: They're like, what do you mean by this? We're building a new kind of product that no one's ever built before. And so we're just scratching our heads all day trying to get in sync about like, what exactly is itSwyx [00:52:37]: that we're building? What does it consist of? Inherently creative struggle. Yeah. Dare we ask about San Francisco? And there's a whole bunch of tough stuff in here. Probably the biggest one I would just congratulate you on is becoming American, right? Very French, but your heart was sort of in the U.S. You eventually found your way here. What are your takes for founders? A few years ago, you wrote this post on Go West, young man. And now you've basically completed that journey, right? You're now here and up to the point where you're kind of mystified by how Europe has been so decel.Flo [00:53:11]: In a way, though, I feel vindicated because I was making the prediction that Europe was over 14 years ago or something like that. I think it's been a walking corpse for a long time. I think it is only now becoming obvious that it is paying the consequences of its policies from 10, 20, 30 years ago. I think at this point, I wish I could rewrite the Go West, young man article but really even more extreme. I think at this point, if you are in tech, especially in AI, but if you're in tech and you're not in San Francisco, you either lack judgment or you lack ambition. It's funny, I recently told that to someone and they were like, oh, not everyone wants to be like a unicorn founder. And I was like, like I said, judgment or ambition. It's fine to not have ambition. It's fine to want to prioritize other things than your company in life or your career in life. That's perfectly okay. But know that that's the trade-off you're making. If you prioritize your career, you've got to be here.Alessio [00:54:03]: As a fellow European escapist, I grew up in Rome.Flo [00:54:05]: Yeah, how do you feel?Swyx [00:54:06]: We never talk about your feelings about Europe.Alessio [00:54:08]: Yeah, I've been in the U.S. now six years. Well, I started my first company in Europe 10 years ago, something like that. Yeah, you can tell nobody really wants to do much. And then you're like, okay. It's funny, I was looking back through some old tweets and I was sending all these tweets to Marc Andreessen like 15 years ago like trying to like learn more about why are you guys putting money in these things that most people here would say you're like crazy to like even back. And eventually, you know, I started doing venture six, five years ago. And I think just like so many people in Europe reach out and ask, hey, can you like talk to our team and they just cannot comprehend like the risk appetite that people have here. It's just like so foreign to people, at least in Italy and like in some parts of Europe. I'm sure there's some great founders in Europe, but like the average European founders, like why would I leave my job at the post office to go work on the startup that could change everything and become very successful but might go out of business instead in the U.S. You have like, you know, we host a hackathon and it's like 400 people and it's like, where can I go work that it's like no job security, you know? It's just like completely different and there's no incentives from the government to change that. There's no way you can like change such a deep-rooted culture of like, you know, going and wine and April spritzFlo [00:55:27]: and all of thatAlessio [00:55:28]: early in the afternoon.Flo [00:55:29]: So, I don't really know how it's going to change.Alessio [00:55:32]: It's quality of life. Yeah, totally. That's why I left. The quality is so high that I left. But again, I think it's better to move here and just, if you want to do this job and do this, you should be here. If you don't want to, that's fine.Flo [00:55:47]: But like,Alessio [00:55:48]: don't copium. Don't be like, oh no, you can also be successful doing this and knees or like whatever. No, probably not, you know? So,Flo [00:55:59]: yeah,Alessio [00:56:00]: I've already done my N400Flo [00:56:01]: so I should get my U.S. citizenship interview soon. Yeah. And I think to be fair, I think what's happening right now to Europe and they've said no to capitalism. They've decided to say no to capitalism a long time ago. They've like completely over-regulated. Taxation is much too high and so forth. But I also think some of this is a little bit of a self-fulfilling prophecy or it's a self-perpetuating phenomenon because, look, to your point, like once there is a network effect that's just so incredibly powerful, they can't be broken, really. And we tried with San Francisco. I tried with San Francisco. Like during COVID,Swyx [00:56:35]: there was a movement of people moving to Miami.Flo [00:56:38]: How did that pan out? You can't break the network effect,Swyx [00:56:41]: you know? It's so annoying because first principles wise, tech should not be here. Like tech should be in Miami because it's just a better city.Flo [00:56:48]: San Francisco does not want tech to be here.Swyx [00:56:50]: San Francisco hates tech.Flo [00:56:51]: 100%.Swyx [00:56:52]: This is the thing I actually wrote down.Alessio [00:56:54]: San Francisco hates tech. It is true. I think the people that are in San Francisco that were here before, tech hated it and then there's kind of like this passed down thing. But I would say people in Miami would hate it too if there were too much of it. You know? The Mickey Beach crowd would also not gel.Swyx [00:57:08]: They're just rich enough and chill enough to not care.Flo [00:57:10]: Yeah, I think so too.Swyx [00:57:11]: They're like, oh, crypto kids.Flo [00:57:13]: Okay, cool. Yeah. Miami celebrates success which is one thingSwyx [00:57:17]: I loved about it.Flo [00:57:18]: A little bit too much.Swyx [00:57:19]: Maybe the last thing I'll mention, I just wanted a little bit of EUAC talk. I think that's good. I'll maybe carve out that I think the UK has done really well. That's an argument for the UK not being part of Europe is that, you know, the AI institutions there at least have done very well. Right?Flo [00:57:34]: Sure. I think a lot of Britain is in the gutter. Yeah, exactly.Swyx [00:57:38]: They've been stagnating at best. And then France has a few wins.Flo [00:57:41]: Who?Swyx [00:57:42]: Mistral.Flo [00:57:43]: Who uses Mistral?Swyx [00:57:44]: Hugging face.Flo [00:57:45]: A few wins.Swyx [00:57:46]: I'm just saying. They disappointed their first AI minister. You know the meme with the guyFlo [00:57:51]: who's celebrating with his trophy and then he's like, no, that's France. Right? To me, that's France. It's like, aha, look, we've got Mistral! It's like champagne! It's like maybe 1% of market share. And by the way, and it's not a critic of them, it's a critic of France and of Europe. And by the way, I think I've heard that the Mistral guys were moving to the US. They're opening an office here. They're opening an office here. But, I mean,Swyx [00:58:15]: they're very French, right?Flo [00:58:16]: Right.Swyx [00:58:17]: You can't really avoid it. There's one interesting counter move which is Jason Warner and ISOCAT moving to Paris for poolside. I don't know. It remains to be seen how that move is going. Maybe the last thing I'll say, you know, that's the Europe talk. We try not to do politics so much, but you're here. One thing that you do a lot is you test your overturned windows. Right? Like far more than any founder I know. You know it's not your job. Someone, for sure, you're just indulging. But also, I think you consciously test. And I just want to see what drives you there and why do you keep doing it? Because you treat very spicy stuff, especially for like the San Francisco sort of liberal dynasty.Flo [00:58:59]: I don't know because I assume you're referring to I posted something about pronouns and how nonsense...Swyx [00:59:05]: Just in general. I don't want you to focus on any particular thing unless you want to.Flo [00:59:09]: You know, well, that tweet in particular, when I was tweeting it, I was like, oh, this is kind of spicy. Should I do this? And then I just did it. And I received zero pushback.Swyx [00:59:20]: And the tweet was actuallyFlo [00:59:21]: pretty successful and I received a lot of people reaching out like, oh my God, so true. I think it's coming from a few different places. One, life is more fun this way. Like I don't feel like if everyone always self-censors, you never know what everyone, what anyone thinks. And so it's becoming like a self-perpetuating thing. It's like a public lies, private truth sort of phenomenon. Or like, you know, there's this phenomenon called the preference cascade. It's like, there's this joke. It's like, oh, there's only one communist left in USSR. The problem is no one knows which one it is. So everyone pretends to be communist because everyone else pretends to be communist. And so I think there's a role to be played when you have a boss who's going to fire me. It's like, look, if I don't speak up and if founders don't speak up, I'm like, why? What are you afraid of? Right? Like, there's really not that much downside. And I think there'sSwyx [01:00:14]: something to be said about standing up for what you think is right and being real and owning your opinions. I think there's a correlation there between having that level of independence for your political beliefs and free speech or whatever and the wayFlo [01:00:27]: that you think about business too. But I think there's such a powerful insight at its core, which is groupthink is real and pervasive and really problematic. Like, your brain constantly shuts down because you're not even thinking in your other way or you're not thinking. You just look around you and you decide to adopt the same beliefs as people around you. And everyone thinksSwyx [01:00:48]: they're immuneFlo [01:00:49]: and everyone elseSwyx [01:00:50]: is doing itFlo [01:00:51]: except themselves. I'm a special snowflake. I have free will. That's right. And so I actually make it a point to look for, and then I think about it and I'm like, do I believe this thing? And very often the answer is yes. And then I just say it. And so I think the AI safety is an example of that. Like, at some point, Marc Andreessen blocked me on Twitter and it hurt, frankly. I really look up to Marc AndreessenSwyx [01:01:13]: and I knew he would block me. It means you're successful on Twitter.Flo [01:01:17]: It's just the right message. Marc Andreessen was really my booster initially on Twitter. He really made my account. And I was like, look, I'm really concerned about AI safety. It is an unpopular viewSwyx [01:01:27]: among my peers. I remember, you were one of the few that actually came out in support of the bill.Flo [01:01:32]: I came out in support of SB1047 a year and a half ago. I put like some tweet storms about how I was really concerned. And yeah, I was blocked by a bunch of AI safety people and I don't like it, but you know, it's funny, maybe it's my French education. But look, in France, World War II is very present in people's minds and the phenomenon of people collaborating with the Nazis and there's always this sort of debate that people have like at dinner and it's like, ah, would you really have resisted during World War II? And everybody is always saying, oh yeah, we totally have resisted. It's like, yeah, but no. The reality of it is 95% of the country did not resist and most of it actually collaborated actively with the Nazis. And so 95% of y'all are wrong. You would actually have collaborated, right? I've always told myself I will stand up for what I think is right because some people got attacked and the way I was brought up is if someone gets attacked before you, you get involved. It doesn't matter, you get involved and you help the person, right? And so, look, I'm not pretending we're nowhere near a World War II phenomenon but I'm like, exactly because we are nowhere nearAlessio [01:02:45]: this kind of phenomenon. The stakes are so low and if you're not going to stand upFlo [01:02:49]: for what you think is right when the stakes are so low,Swyx [01:02:52]: are you going to stand up when it matters? There's an inconsistency in your statements because you simultaneously believe that AGI is very soon and you also say stakes are low. You can't believe both are real.Flo [01:03:03]: Well, why does AGI make the stakes of speaking up higher?Swyx [01:03:06]: Sorry, the stakes of safety.Flo [01:03:08]: Oh yeah, no, the stakes of AISwyx [01:03:11]: are like physical safety?Flo [01:03:12]: No, AI safety. Oh no, the stakes of AI safety couldn't be higher.Swyx [01:03:17]: I meant the stakesFlo [01:03:18]: of speaking up aboutAlessio [01:03:19]: pronouns or whatever. How do you figure out who's real and who isn't? Because there was a manifesto for responsible AI that hundreds of VCs and people signed and I don't think anybody actually thinks about it anymore.Flo [01:03:30]: Was that the pause letter?Swyx [01:03:31]: The six-month pause?Flo [01:03:32]: No,Alessio [01:03:33]: there was something else that I think general catalyst and some fun sign. And then there's maybe the anthropic case which is like, hey, we're leaving open AI because you guys don't take security seriously and then it's like, hey, what if we gave AI access to a whole computerFlo [01:03:49]: to just go do things?Alessio [01:03:50]: How do you reconcile like, okay, I mean, you could say the same thing about Lindy. It's like, if you're worried about AI safety, why are you building AI? Right? That's kind of like the extreme thinking. How do you internally decide between participation and talking about it and saying, hey, I think this is important but I'm still going to build towards that and building actually makes it safer because I'm involved versus just being like anti. I think this is unsafe but then not do anything about it and just kind of remove yourselfFlo [01:04:20]: from the whole thing. What I think about our own involvement here is I'm acutely concerned about the risks at the model layer and I'm simultaneously very excited about the upside. Like, for the record, my PDoom, insofar as I can quantify it, which I cannot, but if I had to, like my vibe is like 10% or something like that and so there's like a 90% chance that we live in like a pure utopia. Right? And that's awesome. Right? So like, let's go after utopia. Right? Let's talk about the 10% chance that we live in a utopia where there's no disease and it's like a post-scarcity world. I think that utopia is going to happen through, like again, I'm bringing my little contribution to the movement. I think it would be silly to say no to the upside because you're concerned about the downside. At the same time, we want to be concerned about the downside. I know that it's very self-serving to say, oh, you know, like the downside doesn't exist at my layer, it exists at like the model layer. But truly, look at Lindy, look at the Apple building. I struggle to see exactly how it would like get up if I'm concerned about the model layer.Swyx [01:05:21]: Okay. Well, this kind of discussion can go on for hours. It is still daylight, so not the best time for it. But I really appreciate you spending the time. Any other last calls to actions or thoughts that you feel like you want to get off your chest?Flo [01:05:33]: AGI is coming.Flo [01:05:37]: Are you hiringAlessio [01:05:38]: for any roles? We are.Flo [01:05:40]: Oh yeah, I guess that should be the...Swyx [01:05:43]: Don't bother.Flo [01:05:44]: No, can you stop saying AGI is coming and just talk about it? We are also hiring yeah, we are hiring designers and engineers right now. Yeah. So hit me up at flo.lindy.aiAlessio [01:05:55]: And then go talk to my Lindy. You're not actually going to read it.Flo [01:05:58]: Actually, I have wonderedSwyx [01:05:59]: how many times when I talk to you, I'm talking to a bot. Part of that is I don't have to know, right?Flo [01:06:05]: That's right. Well, it's actually doubly confusing because we also have a teammateSwyx [01:06:09]: whose name is Lindy. Yes, I was wondering when I met her, I was like, wait, did you hire her first?Flo [01:06:14]: Marketing is fun. No, she was an inspiration after we named the company both after her. Oh, okay.Swyx [01:06:19]: Interesting. Yeah, wonderful. I'll comment on the design piece just because I think that there are a lot of AI companies that very much focus on the functionality and the models and the capabilities and the benchmark. But I think that increasingly I'm seeing people differentiate with design and people want to use beautiful products and people who can figure that out and integrate the AI into their human lives. You know, design at the limit. One, at the lowest level is to make this look pretty, make this look like Stripe or Linear's homepage. That's design. But at the highest level of design it is make this integrate seamlessly into my life. Intuitive, beautiful, inspirational maybe even. And I think that companies that, you know, this is kind of like a blog post I've been thinking about, companies that emphasize design actually are going to win more than companies that don't. Yeah,Flo [01:07:06]: I love Steve Jobs' quote and I'm going to butcher it. It's something like, design is the expression of the soul of a man-made product through successive layers of design. Jesus. Right? He was good. He was cooking. He was cooking on that one. He was cooking. It starts with the soul of the product which is why I was saying it is so important to reach alignment about that soul of the product, right? It's like an onion, like you peel the onion in those layers, right? And you design an entire journey just like the user experiencing your product chronologically all the way from the beginning of like the awareness stage I think it is also the job of the designer to design that part of the experience. It's like, okay, design is immensely important. Okay.Alessio [01:07:46]: Lovely. Yeah.Flo [01:07:48]: Thanks for coming on, Flo. Yeah, absolutely. Thanks for having me. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:09:53
  • Agents @ Work: Dust.tt
    We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show!Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We’ve had Harrison from LangChain on twice (as a guest and as a co-host), and we’ve now finally come full circle as Stanislas from Dust joined us in the studio.After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down."The History of DustDust's journey can be broken down into three phases:* Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn’t just `print` statements.* Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4.* Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments.The Case for Being HorizontalThe big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages:* Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team.* Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions.* Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures.The Vertical ChallengeHowever, this approach comes with trade-offs:* Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'"* Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI PlatformsStan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance.This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations.Full YouTube EpisodeChapters* 00:00:00 Introductions* 00:04:33 Joining OpenAI from Paris* 00:09:54 Research evolution and compute allocation at OpenAI* 00:13:12 Working with Ilya Sutskever and OpenAI's vision* 00:15:51 Leaving OpenAI to start Dust* 00:18:15 Early focus on browser extension and WebGPT-like functionality* 00:20:20 Dust as the infrastructure for agents* 00:24:03 Challenges of building with early AI models* 00:28:17 LLMs and Workflow Automation* 00:35:28 Building dependency graphs of agents* 00:37:34 Simulating API endpoints* 00:40:41 State of AI models* 00:43:19 Running evals* 00:46:36 Challenges in building AI agents infra* 00:49:21 Buy vs. build decisions for infrastructure components* 00:51:02 Future of SaaS and AI's Impact on Software* 00:53:07 The single employee $1B company race* 00:56:32 Horizontal vs. vertical approaches to AI agentsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome.Stan [00:00:14]: Thank you very much for having me.Swyx [00:00:16]: Visiting from Paris.Stan [00:00:17]: Paris.Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders.Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave.Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit.Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei.Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure.Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess.Swyx [00:03:01]: Discovering new math would be very foundational.Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around.Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move.Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs. And we tried to apply a transformer to that and do reinforcement learning with the signal of how much you propagate within the binary. Didn't work at all because the transformers are so slow compared to evolutionary algorithms that it kind of didn't work. Then I started interested in math and AI and started working on SAT solving with AI. And at the same time, OpenAI was kind of starting the reasoning team that were tackling that project as well. I was in touch with Greg and eventually got in touch with Ilya and finally found my way to OpenAI. I don't know how much you want to dig into that. The way to find your way to OpenAI when you're in Paris was kind of an interesting adventure as well.Swyx [00:04:33]: Please. And I want to note, this was a two-month journey. You did all this in two months.Stan [00:04:38]: The search.Swyx [00:04:40]: Your search for your next thing, because you left in July 2019 and then you joined OpenAI in September.Stan [00:04:45]: I'm going to be ashamed to say that.Swyx [00:04:47]: You were searching before. I was searching before.Stan [00:04:49]: I mean, it's normal. No, the truth is that I moved back to Paris through Stripe and I just felt the hardship of being remote from your team nine hours away. And so it kind of freed a bit of time for me to start the exploration before. Sorry, Patrick. Sorry, John.Swyx [00:05:05]: Hopefully they're listening. So you joined OpenAI from Paris and from like, obviously you had worked with Greg, but notStan [00:05:13]: anyone else. No. Yeah. So I had worked with Greg, but not Ilya, but I had started chatting with Ilya and Ilya was kind of excited because he knew that I was a good engineer through Greg, I presume, but I was not a trained researcher, didn't do a PhD, never did research. And I started chatting and he was excited all the way to the point where he was like, hey, come pass interviews, it's going to be fun. I think he didn't care where I was, he just wanted to try working together. So I go to SF, go through the interview process, get an offer. And so I get Bob McGrew on the phone for the first time, he's like, hey, Stan, it's awesome. You've got an offer. When are you coming to SF? I'm like, hey, it's awesome. I'm not coming to the SF. I'm based in Paris and we just moved. He was like, hey, it's awesome. Well, you don't have an offer anymore. Oh, my God. No, it wasn't as hard as that. But that's basically the idea. And it took me like maybe a couple more time to keep chatting and they eventually decided to try a contractor set up. And that's how I kind of started working at OpenAI, officially as a contractor, but in practice really felt like being an employee.Swyx [00:06:14]: What did you work on?Stan [00:06:15]: So it was solely focused on math and AI. And in particular in the application, so the study of the larger grid models, mathematical reasoning capabilities, and in particular in the context of formal mathematics. The motivation was simple, transformers are very creative, but yet they do mistakes. Formal math systems are of the ability to verify a proof and the tactics they can use to solve problems are very mechanical, so you miss the creativity. And so the idea was to try to explore both together. You would get the creativity of the LLMs and the kind of verification capabilities of the formal system. A formal system, just to give a little bit of context, is a system in which a proof is a program and the formal system is a type system, a type system that is so evolved that you can verify the program. If the type checks, it means that the program is correct.Swyx [00:07:06]: Is the verification much faster than actually executing the program?Stan [00:07:12]: Verification is instantaneous, basically. So the truth is that what you code in involves tactics that may involve computation to search for solutions. So it's not instantaneous. You do have to do the computation to expand the tactics into the actual proof. The verification of the proof at the very low level is instantaneous.Swyx [00:07:32]: How quickly do you run into like, you know, halting problem PNP type things, like impossibilities where you're just like that?Stan [00:07:39]: I mean, you don't run into it at the time. It was really trying to solve very easy problems. So I think the... Can you give an example of easy? Yeah, so that's the mass benchmark that everybody knows today. The Dan Hendricks one. The Dan Hendricks one, yeah. And I think it was the low end part of the mass benchmark at the time, because that mass benchmark includes AMC problems, AMC 8, AMC 10, 12. So these are the easy ones. Then AIME problems, somewhat harder, and some IMO problems, like Crazy Arm.Swyx [00:08:07]: For our listeners, we covered this in our Benchmarks 101 episode. AMC is literally the grade of like high school, grade 8, grade 10, grade 12. So you can solve this. Just briefly to mention this, because I don't think we'll touch on this again. There's a bit of work with like Lean, and then with, you know, more recently with DeepMind doing like scoring like silver on the IMO. Any commentary on like how math has evolved from your early work to today?Stan [00:08:34]: I mean, that result is mind blowing. I mean, from my perspective, spent three years on that. At the same time, Guillaume Lampe in Paris, we were both in Paris, actually. He was at FAIR, was working on some problems. We were pushing the boundaries, and the goal was the IMO. And we cracked a few problems here and there. But the idea of getting a medal at an IMO was like just remote. So this is an impressive result. And we can, I think the DeepMind team just did a good job of scaling. I think there's nothing too magical in their approach, even if it hasn't been published. There's a Dan Silver talk from seven days ago where it goes a little bit into more details. It feels like there's nothing magical there. It's really applying reinforcement learning and scaling up the amount of data that can generate through autoformalization. So we can dig into what autoformalization means if you want.Alessio [00:09:26]: Let's talk about the tail end, maybe, of the OpenAI. So you joined, and you're like, I'm going to work on math and do all of these things. I saw on one of your blog posts, you mentioned you fine-tuned over 10,000 models at OpenAI using 10 million A100 hours. How did the research evolve from the GPD 2, and then getting closer to DaVinci 003? And then you left just before ChatGPD was released, but tell people a bit more about the research path that took you there.Stan [00:09:54]: I can give you my perspective of it. I think at OpenAI, there's always been a large chunk of the compute that was reserved to train the GPTs, which makes sense. So it was pre-entropic splits. Most of the compute was going to a product called Nest, which was basically GPT-3. And then you had a bunch of, let's say, remote, not core research teams that were trying to explore maybe more specific problems or maybe the algorithm part of it. The interesting part, I don't know if it was where your question was going, is that in those labs, you're managing researchers. So by definition, you shouldn't be managing them. But in that space, there's a managing tool that is great, which is compute allocation. Basically by managing the compute allocation, you can message the team of where you think the priority should go. And so it was really a question of, you were free as a researcher to work on whatever you wanted. But if it was not aligned with OpenAI mission, and that's fair, you wouldn't get the compute allocation. As it happens, solving math was very much aligned with the direction of OpenAI. And so I was lucky to generally get the compute I needed to make good progress.Swyx [00:11:06]: What do you need to show as incremental results to get funded for further results?Stan [00:11:12]: It's an imperfect process because there's a bit of a... If you're working on math and AI, obviously there's kind of a prior that it's going to be aligned with the company. So it's much easier than to go into something much more risky, much riskier, I guess. You have to show incremental progress, I guess. It's like you ask for a certain amount of compute and you deliver a few weeks after and you demonstrate that you have a progress. Progress might be a positive result. Progress might be a strong negative result. And a strong negative result is actually often much harder to get or much more interesting than a positive result. And then it generally goes into, as any organization, you would have people finding your project or any other project cool and fancy. And so you would have that kind of phase of growing up compute allocation for it all the way to a point. And then maybe you reach an apex and then maybe you go back mostly to zero and restart the process because you're going in a different direction or something else. That's how I felt. Explore, exploit. Yeah, exactly. Exactly. Exactly. It's a reinforcement learning approach.Swyx [00:12:14]: Classic PhD student search process.Alessio [00:12:17]: And you were reporting to Ilya, like the results you were kind of bringing back to him or like what's the structure? It's almost like when you're doing such cutting edge research, you need to report to somebody who is actually really smart to understand that the direction is right.Stan [00:12:29]: So we had a reasoning team, which was working on reasoning, obviously, and so math in general. And that team had a manager, but Ilya was extremely involved in the team as an advisor, I guess. Since he brought me in OpenAI, I was lucky to mostly during the first years to have kind of a direct access to him. He would really coach me as a trainee researcher, I guess, with good engineering skills. And Ilya, I think at OpenAI, he was the one showing the North Star, right? He was his job and I think he really enjoyed it and he did it super well, was going through the teams and saying, this is where we should be going and trying to, you know, flock the different teams together towards an objective.Swyx [00:13:12]: I would say like the public perception of him is that he was the strongest believer in scaling. Oh, yeah. Obviously, he has always pursued the compression thesis. You have worked with him personally, what does the public not know about how he works?Stan [00:13:26]: I think he's really focused on building the vision and communicating the vision within the company, which was extremely useful. I was personally surprised that he spent so much time, you know, working on communicating that vision and getting the teams to work together versus...Swyx [00:13:40]: To be specific, vision is AGI? Oh, yeah.Stan [00:13:42]: Vision is like, yeah, it's the belief in compression and scanning computes. I remember when I started working on the Reasoning team, the excitement was really about scaling the compute around Reasoning and that was really the belief we wanted to ingrain in the team. And that's what has been useful to the team and with the DeepMind results shows that it was the right approach with the success of GPT-4 and stuff shows that it was the right approach.Swyx [00:14:06]: Was it according to the neural scaling laws, the Kaplan paper that was published?Stan [00:14:12]: I think it was before that, because those ones came with GPT-3, basically at the time of GPT-3 being released or being ready internally. But before that, there really was a strong belief in scale. I think it was just the belief that the transformer was a generic enough architecture that you could learn anything. And that was just a question of scaling.Alessio [00:14:33]: Any other fun stories you want to tell? Sam Altman, Greg, you know, anything.Stan [00:14:37]: Weirdly, I didn't work that much with Greg when I was at OpenAI. He had always been mostly focused on training the GPTs and rightfully so. One thing about Sam Altman, he really impressed me because when I joined, he had joined not that long ago and it felt like he was kind of a very high level CEO. And I was mind blown by how deep he was able to go into the subjects within a year or something, all the way to a situation where when I was having lunch by year two, I was at OpenAI with him. He would just quite know deeply what I was doing. With no ML background. Yeah, with no ML background, but I didn't have any either, so I guess that explains why. But I think it's a question about, you don't necessarily need to understand the very technicalities of how things are done, but you need to understand what's the goal and what's being done and what are the recent results and all of that in you. And we could have kind of a very productive discussion. And that really impressed me, given the size at the time of OpenAI, which was not negligible.Swyx [00:15:44]: Yeah. I mean, you've been a, you were a founder before, you're a founder now, and you've seen Sam as a founder. How has he affected you as a founder?Stan [00:15:51]: I think having that capability of changing the scale of your attention in the company, because most of the time you operate at a very high level, but being able to go deep down and being in the known of what's happening on the ground is something that I feel is really enlightening. That's not a place in which I ever was as a founder, because first company, we went all the way to 10 people. Current company, there's 25 of us. So the high level, the sky and the ground are pretty much at the same place. No, you're being too humble.Swyx [00:16:21]: I mean, Stripe was also like a huge rocket ship.Stan [00:16:23]: Stripe, I was a founder. So I was, like at OpenAI, I was really happy being on the ground, pushing the machine, making it work. Yeah.Swyx [00:16:31]: Last OpenAI question. The Anthropic split you mentioned, you were around for that. Very dramatic. David also left around that time, you left. This year, we've also had a similar management shakeup, let's just call it. Can you compare what it was like going through that split during that time? And then like, does that have any similarities now? Like, are we going to see a new Anthropic emerge from these folks that just left?Stan [00:16:54]: That I really, really don't know. At the time, the split was pretty surprising because they had been trying GPT-3, it was a success. And to be completely transparent, I wasn't in the weeds of the splits. What I understood of it is that there was a disagreement of the commercialization of that technology. I think the focal point of that disagreement was the fact that we started working on the API and wanted to make those models available through an API. Is that really the core disagreement? I don't know.Swyx [00:17:25]: Was it safety?Stan [00:17:26]: Was it commercialization?Swyx [00:17:27]: Or did they just want to start a company?Stan [00:17:28]: Exactly. Exactly. That I don't know. But I think what I was surprised of is how quickly OpenAI recovered at the time. And I think it's just because we were mostly a research org and the mission was so clear that some divergence in some teams, some people leave, the mission is still there. We have the compute. We have a site. So it just keeps going.Swyx [00:17:50]: Very deep bench. Like just a lot of talent. Yeah.Alessio [00:17:53]: So that was the OpenAI part of the history. Exactly. So then you leave OpenAI in September 2022. And I would say in Silicon Valley, the two hottest companies at the time were you and Lanktrain. What was that start like and why did you decide to start with a more developer focused kind of like an AI engineer tool rather than going back into some more research and something else?Stan [00:18:15]: Yeah. First, I'm not a trained researcher. So going through OpenAI was really kind of the PhD I always wanted to do. But research is hard. You're digging into a field all day long for weeks and weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, oh, yeah, that was obvious. And you go back to digging. I'm not a trained, like formally trained researcher, and it wasn't kind of a necessarily an ambition of me of creating, of having a research career. And I felt the hardness of it. I enjoyed a lot of like that a ton. But at the time, I decided that I wanted to go back to something more productive. And the other fun motivation was like, I mean, if we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down. And so that was kind of the true motivation for like trying to go there. So that's kind of the core motivation at the beginning of personally. And the motivation for starting a company was pretty simple. I had seen GPT-4 internally at the time, it was September 2022. So it was pre-GPT, but GPT-4 was ready since, I mean, I'd been ready for a few months internally. I was like, okay, that's obvious, the capabilities are there to create an insane amount of value to the world. And yet the deployment is not there yet. The revenue of OpenAI at the time were ridiculously small compared to what it is today. So the thesis was, there's probably a lot to be done at the product level to unlock the usage.Alessio [00:19:49]: Yeah. Let's talk a bit more about the form factor, maybe. I think one of the first successes you had was kind of like the WebGPT-like thing, like using the models to traverse the web and like summarize things. And the browser was really the interface. Why did you start with the browser? Like what was it important? And then you built XP1, which was kind of like the browser extension.Stan [00:20:09]: So the starting point at the time was, if you wanted to talk about LLMs, it was still a rather small community, a community of mostly researchers and to some extent, very early adopters, very early engineers. It was almost inconceivable to just build a product and go sell it to the enterprise, though at the time there was a few companies doing that. The one on marketing, I don't remember its name, Jasper. But so the natural first intention, the first, first, first intention was to go to the developers and try to create tooling for them to create product on top of those models. And so that's what Dust was originally. It was quite different than Lanchain, and Lanchain just beat the s**t out of us, which is great. It's a choice.Swyx [00:20:53]: You were cloud, in closed source. They were open source.Stan [00:20:56]: Yeah. So technically we were open source and we still are open source, but I think that doesn't really matter. I had the strong belief from my research time that you cannot create an LLM-based workflow on just one example. Basically, if you just have one example, you overfit. So as you develop your interaction, your orchestration around the LLM, you need a dozen examples. Obviously, if you're running a dozen examples on a multi-step workflow, you start paralyzing stuff. And if you do that in the console, you just have like a messy stream of tokens going out and it's very hard to observe what's going there. And so the idea was to go with an UI so that you could kind of introspect easily the output of each interaction with the model and dig into there through an UI, which is-Swyx [00:21:42]: Was that open source? I actually didn't come across it.Stan [00:21:44]: Oh yeah, it wasn't. I mean, Dust is entirely open source even today. We're not going for an open source-Swyx [00:21:48]: If it matters, I didn't know that.Stan [00:21:49]: No, no, no, no, no. The reason why is because we're not open source because we're not doing an open source strategy. It's not an open source go-to-market at all. We're open source because we can and it's fun.Swyx [00:21:59]: Open source is marketing. You have all the downsides of open source, which is like people can clone you.Stan [00:22:03]: But I think that downside is a big fallacy. Okay. Yes, anybody can clone Dust today, but the value of Dust is not the current state. The value of Dust is the number of eyeballs and hands of developers that are creating to it in the future. And so yes, anybody can clone it today, but that wouldn't change anything. There is some value in being open source. In a discussion with the security team, you can be extremely transparent and just show the code. When you have discussion with users and there's a bug or a feature missing, you can just point to the issue, show the pull request, show the, show the, exactly, oh, PR welcome. That doesn't happen that much, but you can show the progress if the person that you're chatting with is a little bit technical, they really enjoy seeing the pull request advancing and seeing all the way to deploy. And then the downsides are mostly around security. You never want to do security by obfuscation. But the truth is that your vector of attack is facilitated by you being open source. But at the same time, it's a good thing because if you're doing anything like a bug bountying or stuff like that, you just give much more tools to the bug bountiers so that their output is much better. So there's many, many, many trade-offs. I don't believe in the value of the code base per se. I think it's really the people that are on the code base that have the value and go to market and the product and all of those things that are around the code base. Obviously, that's not true for every code base. If you're working on a very secret kernel to accelerate the inference of LLMs, I would buy that you don't want to be open source. But for product stuff, I really think there's very little risk. Yeah.Alessio [00:23:39]: I signed up for XP1, I was looking, January 2023. I think at the time you were on DaVinci 003. Given that you had seen GPD 4, how did you feel having to push a product out that was using this model that was so inferior? And you're like, please, just use it today. I promise it's going to get better. Just overall, as a founder, how do you build something that maybe doesn't quite work with the model today, but you're just expecting the new model to be better?Stan [00:24:03]: Yeah, so actually, XP1 was even on a smaller one that was the post-GDPT release, small version, so it was... Ada, Babbage... No, no, no, not that far away. But it was the small version of GDPT, basically. I don't remember its name. Yes, you have a frustration there. But at the same time, I think XP1 was designed, was an experiment, but was designed as a way to be useful at the current capability of the model. If you just want to extract data from a LinkedIn page, that model was just fine. If you want to summarize an article on a newspaper, that model was just fine. And so it was really a question of trying to find a product that works with the current capability, knowing that you will always have tailwinds as models get better and faster and cheaper. So that was kind of a... There's a bit of a frustration because you know what's out there and you know that you don't have access to it yet. It's also interesting to try to find a product that works with the current capability.Alessio [00:24:55]: And we highlighted XP1 in our anatomy of autonomy post in April of last year, which was, you know, where are all the agents, right? So now we spent 30 minutes getting to what you're building now. So you basically had a developer framework, then you had a browser extension, then you had all these things, and then you kind of got to where Dust is today. So maybe just give people an overview of what Dust is today and the courtesies behind it. Yeah, of course.Stan [00:25:20]: So Dust, we really want to build the infrastructure so that companies can deploy agents within their teams. We are horizontal by nature because we strongly believe in the emergence of use cases from the people having access to creating an agent that don't need to be developers. They have to be thinkers. They have to be curious. But anybody can create an agent that will solve an operational thing that they're doing in their day-to-day job. And to make those agents useful, there's two focus, which is interesting. The first one is an infrastructure focus. You have to build the pipes so that the agent has access to the data. You have to build the pipes such that the agents can take action, can access the web, et cetera. So that's really an infrastructure play. Maintaining connections to Notion, Slack, GitHub, all of them is a lot of work. It is boring work, boring infrastructure work, but that's something that we know is extremely valuable in the same way that Stripe is extremely valuable because it maintains the pipes. And we have that dual focus because we're also building the product for people to use it. And there it's fascinating because everything started from the conversational interface, obviously, which is a great starting point. But we're only scratching the surface, right? I think we are at the pong level of LLM productization. And we haven't invented the C3. We haven't invented Counter-Strike. We haven't invented Cyberpunk 2077. So this is really our mission is to really create the product that lets people equip themselves to just get away all the work that can be automated or assisted by LLMs.Alessio [00:26:57]: And can you just comment on different takes that people had? So maybe the most open is like auto-GPT. It's just kind of like just trying to do anything. It's like it's all magic. There's no way for you to do anything. Then you had the ADAPT, you know, we had David on the podcast. They're very like super hands-on with each individual customer to build super tailored. How do you decide where to draw the line between this is magic? This is exposed to you, especially in a market where most people don't know how to build with AI at all. So if you expect them to do the thing, they're probably not going to do it. Yeah, exactly.Stan [00:27:29]: So the auto-GPT approach obviously is extremely exciting, but we know that the agentic capability of models are not quite there yet. It just gets lost. So we're starting, we're starting where it works. Same with the XP one. And where it works is pretty simple. It's like simple workflows that involve a couple tools where you don't even need to have the model decide which tools it's used in the sense of you just want people to put it in the instructions. It's like take that page, do that search, pick up that document, do the work that I want in the format I want, and give me the results. There's no smartness there, right? In terms of orchestrating the tools, it's mostly using English for people to program a workflow where you don't have the constraint of having compatible API between the two.Swyx [00:28:17]: That kind of personal automation, would you say it's kind of like an LLM Zapier type ofStan [00:28:22]: thing?Swyx [00:28:22]: Like if this, then that, and then, you know, do this, then this. You're programming with English?Stan [00:28:28]: So you're programming with English. So you're just saying, oh, do this and then that. You can even create some form of APIs. You say, when I give you the command X, do this. When I give you the command Y, do this. And you describe the workflow. But you don't have to create boxes and create the workflow explicitly. It just needs to describe what are the tasks supposed to be and make the tool available to the agent. The tool can be a semantic search. The tool can be querying into a structured database. The tool can be searching on the web. And obviously, the interesting tools that we're only starting to scratch are actually creating external actions like reimbursing something on Stripe, sending an email, clicking on a button in the admin or something like that.Swyx [00:29:11]: Do you maintain all these integrations?Stan [00:29:13]: Today, we maintain most of the integrations. We do always have an escape hatch for people to kind of custom integrate. But the reality is that the reality of the market today is that people just want it to work, right? And so it's mostly us maintaining the integration. As an example, a very good source of information that is tricky to productize is Salesforce. Because Salesforce is basically a database and a UI. And they do the f**k they want with it. And so every company has different models and stuff like that. So right now, we don't support it natively. And the type of support or real native support will be slightly more complex than just osing into it, like is the case with Slack as an example. Because it's probably going to be, oh, you want to connect your Salesforce to us? Give us the SQL. That's the Salesforce QL language. Give us the queries you want us to run on it and inject in the context of dust. So that's interesting how not only integrations are cool, and some of them require a bit of work on the user. And for some of them that are really valuable to our users, but we don't support yet, they can just build them internally and push the data to us.Swyx [00:30:18]: I think I understand the Salesforce thing. But let me just clarify, are you using browser automation because there's no API for something?Stan [00:30:24]: No, no, no, no. In that case, so we do have browser automation for all the use cases and apply the public web. But for most of the integration with the internal system of the company, it really runs through API.Swyx [00:30:35]: Haven't you felt the pull to RPA, browser automation, that kind of stuff?Stan [00:30:39]: I mean, what I've been saying for a long time, maybe I'm wrong, is that if the future is that you're going to stand in front of a computer and looking at an agent clicking on stuff, then I'll hit my computer. And my computer is a big Lenovo. It's black. Doesn't sound good at all compared to a Mac. And if the APIs are there, we should use them. There is going to be a long tail of stuff that don't have APIs, but as the world is moving forward, that's disappearing. So the core API value in the past has really been, oh, this old 90s product doesn't have an API. So I need to use the UI to automate. I think for most of the ICP companies, the companies that ICP for us, the scale ups that are between 500 and 5,000 people, tech companies, most of the SaaS they use have APIs. Now there's an interesting question for the open web, because there are stuff that you want to do that involve websites that don't necessarily have APIs. And the current state of web integration from, which is us and OpenAI and Anthropic, I don't even know if they have web navigation, but I don't think so. The current state of affair is really, really broken because you have what? You have basically search and headless browsing. But headless browsing, I think everybody's doing basically body.innertext and fill that into the model, right?Swyx [00:31:56]: MARK MIRCHANDANI There's parsers into Markdown and stuff.Stan [00:31:58]: FRANCESC CAMPOY I'm super excited by the companies that are exploring the capability of rendering a web page into a way that is compatible for a model, being able to maintain the selector. So that's basically the place where to click in the page through that process, expose the actions to the model, have the model select an action in a way that is compatible with model, which is not a big page of a full DOM that is very noisy, and then being able to decompress that back to the original page and take the action. And that's something that is really exciting and that will kind of change the level of things that agents can do on the web. That I feel exciting, but I also feel that the bulk of the useful stuff that you can do within the company can be done through API. The data can be retrieved by API. The actions can be taken through API.Swyx [00:32:44]: For listeners, I'll note that you're basically completely disagreeing with David Wan. FRANCESC CAMPOY Exactly, exactly. I've seen it since it's summer. ADEPT is where it is, and Dust is where it is. So Dust is still standing.Alessio [00:32:55]: Can we just quickly comment on function calling? You mentioned you don't need the models to be that smart to actually pick the tools. Have you seen the models not be good enough? Or is it just like, you just don't want to put the complexity in there? Like, is there any room for improvement left in function calling? Or do you feel you usually consistently get always the right response, the right parametersStan [00:33:13]: and all of that?Alessio [00:33:13]: FRANCESC CAMPOY So that's a tricky product question.Stan [00:33:15]: Because if the instructions are good and precise, then you don't have any issue, because it's scripted for you. And the model will just look at the scripts and just follow and say, oh, he's probably talking about that action, and I'm going to use it. And the parameters are kind of abused from the state of the conversation. I'll just go with it. If you provide a very high level, kind of an auto-GPT-esque level in the instructions and provide 16 different tools to your model, yes, we're seeing the models in that state making mistakes. And there is obviously some progress can be made on the capabilities. But the interesting part is that there is already so much work that can assist, augment, accelerate by just going with pretty simply scripted for actions agents. What I'm excited about by pushing our users to create rather simple agents is that once you have those working really well, you can create meta agents that use the agents as actions. And all of a sudden, you can kind of have a hierarchy of responsibility that will probably get you almost to the point of the auto-GPT value. It requires the construction of intermediary artifacts, but you're probably going to be able to achieve something great. I'll give you some example. We have our incidents are shared in Slack in a specific channel, or shipped are shared in Slack. We have a weekly meeting where we have a table about incidents and shipped stuff. We're not writing that weekly meeting table anymore. We have an assistant that just go find the right data on Slack and create the table for us. And that assistant works perfectly. It's trivially simple, right? Take one week of data from that channel and just create the table. And then we have in that weekly meeting, obviously some graphs and reporting about our financials and our progress and our ARR. And we've created assistants to generate those graphs directly. And those assistants works great. By creating those assistants that cover those small parts of that weekly meeting, slowly we're getting to in a world where we'll have a weekly meeting assistance. We'll just call it. You don't need to prompt it. You don't need to say anything. It's going to run those different assistants and get that notion page just ready. And by doing that, if you get there, and that's an objective for us to us using Dust, get there, you're saving an hour of company time every time you run it. Yeah.Alessio [00:35:28]: That's my pet topic of NPM for agents. How do you build dependency graphs of agents? And how do you share them? Because why do I have to rebuild some of the smaller levels of what you built already?Swyx [00:35:40]: I have a quick follow-up question on agents managing other agents. It's a topic of a lot of research, both from Microsoft and even in startups. What you've discovered best practice for, let's say like a manager agent controlling a bunch of small agents. It's two-way communication. I don't know if there should be a protocol format.Stan [00:35:59]: To be completely honest, the state we are at right now is creating the simple agents. So we haven't even explored yet the meta agents. We know it's there. We know it's going to be valuable. We know it's going to be awesome. But we're starting there because it's the simplest place to start. And it's also what the market understands. If you go to a company, random SaaS B2B company, not necessarily specialized in AI, and you take an operational team and you tell them, build some tooling for yourself, they'll understand the small agents. If you tell them, build AutoGP, they'll be like, Auto what?Swyx [00:36:31]: And I noticed that in your language, you're very much focused on non-technical users. You don't really mention API here. You mention instruction instead of system prompt, right? That's very conscious.Stan [00:36:41]: Yeah, it's very conscious. It's a mark of our designer, Ed, who kind of pushed us to create a friendly product. I was knee-deep into AI when I started, obviously. And my co-founder, Gabriel, was a Stripe as well. We started a company together that got acquired by Stripe 15 years ago. It was at Alain, a healthcare company in Paris. After that, it was a little bit less so knee-deep in AI, but really focused on product. And I didn't realize how important it is to make that technology not scary to end users. It didn't feel scary to me, but it was really seen by Ed, our designer, that it was feeling scary to the users. And so we were very proactive and very deliberate about creating a brand that feels not too scary and creating a wording and a language, as you say, that really tried to communicate the fact that it's going to be fine. It's going to be easy. You're going to make it.Alessio [00:37:34]: And another big point that David had about ADAPT is we need to build an environment for the agents to act. And then if you have the environment, you can simulate what they do. How's that different when you're interacting with APIs and you're kind of touching systems that you cannot really simulate? If you call it the Salesforce API, you're just calling it.Stan [00:37:52]: So I think that goes back to the DNA of the companies that are very different. ADAPT, I think, was a product company with a very strong research DNA, and they were still doing research. One of their goals was building a model. And that's why they raised a large amount of money, et cetera. We are 100% deliberately a product company. We don't do research. We don't train models. We don't even run GPUs. We're using the models that exist, and we try to push the product boundary as far as possible with the existing models. So that creates an issue. Indeed, so to answer your question, when you're interacting in the real world, well, you cannot simulate, so you cannot improve the models. Even improving your instructions is complicated for a builder. The hope is that you can use models to evaluate the conversations so that you can get at least feedback and you could get contradictive information about the performance of the assistance. But if you take actual trace of interaction of humans with those agents, it is even for us humans extremely hard to decide whether it was a productive interaction or a really bad interaction. You don't know why the person left. You don't know if they left happy or not. So being extremely, extremely, extremely pragmatic here, it becomes a product issue. We have to build a product that identifies the end users to provide feedback so that as a first step, the person that is building the agent can iterate on it. As a second step, maybe later when we start training model and post-training, et cetera, we can optimize around that for each of those companies. Yeah.Alessio [00:39:17]: Do you see in the future products offering kind of like a simulation environment, the same way all SaaS now kind of offers APIs to build programmatically? Like in cybersecurity, there are a lot of companies working on building simulative environments so that then you can use agents like Red Team, but I haven't really seen that.Stan [00:39:34]: Yeah, no, me neither. That's a super interesting question. I think it's really going to depend on how much, because you need to simulate to generate data, you need to train data to train models. And the question at the end is, are we going to be training models or are we just going to be using frontier models as they are? On that question, I don't have a strong opinion. It might be the case that we'll be training models because in all of those AI first products, the model is so close to the product surface that as you get big and you want to really own your product, you're going to have to own the model as well. Owning the model doesn't mean doing the pre-training, that would be crazy. But at least having an internal post-training realignment loop, it makes a lot of sense. And so if we see many companies going towards that all the time, then there might be incentives for the SaaS's of the world to provide assistance in getting there. But at the same time, there's a tension because those SaaS, they don't want to be interacted by agents, they want the human to click on the button. Yeah, they got to sell seats. Exactly.Swyx [00:40:41]: Just a quick question on models. I'm sure you've used many, probably not just OpenAI. Would you characterize some models as better than others? Do you use any open source models? What have been the trends in models over the last two years?Stan [00:40:53]: We've seen over the past two years kind of a bit of a race in between models. And at times, it's the OpenAI model that is the best. At times, it's the Anthropic models that is the best. Our take on that is that we are agnostic and we let our users pick their model. Oh, they choose? Yeah, so when you create an assistant or an agent, you can just say, oh, I'm going to run it on GP4, GP4 Turbo, or...Swyx [00:41:16]: Don't you think for the non-technical user, that is actually an abstraction that you should take away from them?Stan [00:41:20]: We have a sane default. So we move the default to the latest model that is cool. And we have a sane default, and it's actually not very visible. In our flow to create an agent, you would have to go in advance and go pick your model. So this is something that the technical person will care about. But that's something that obviously is a bit too complicated for the...Swyx [00:41:40]: And do you care most about function calling or instruction following or something else?Stan [00:41:44]: I think we care most for function calling because you want to... There's nothing worse than a function call, including incorrect parameters or being a bit off because it just drives the whole interaction off.Swyx [00:41:56]: Yeah, so got the Berkeley function calling.Stan [00:42:00]: These days, it's funny how the comparison between GP4O and GP4 Turbo is still up in the air on function calling. I personally don't have proof, but I know many people, and I'm probably part of them, to think that GP4 Turbo is still better than GP4O on function calling. Wow. We'll see what comes out of the O1 class if it ever gets function calling. And Cloud 3.5 Summit is great as well. They kind of innovated in an interesting way, which was never quite publicized. But it's that they have that kind of chain of thought step whenever you use a Cloud model or Summit model with function calling. That chain of thought step doesn't exist when you just interact with it just for answering questions. But when you use function calling, you get that step, and it really helps getting better function calling.Swyx [00:42:43]: Yeah, we actually just recorded a podcast with the Berkeley team that runs that leaderboard this week. So they just released V3.Stan [00:42:49]: Yeah.Swyx [00:42:49]: It was V1 like two months ago, and then they V2, V3. Turbo is on top.Stan [00:42:53]: Turbo is on top. Turbo is over 4.0.Swyx [00:42:54]: And then the third place is XLAM from Salesforce, which is a large action model they've been trying to popularize.Stan [00:43:01]: Yep.Swyx [00:43:01]: O1 Mini is actually on here, I think. O1 Mini is number 11.Stan [00:43:05]: But arguably, O1 Mini has been in a line for that. Yeah.Alessio [00:43:09]: Do you use leaderboards? Do you have your own evals? I mean, this is kind of intuitive, right? Like using the older model is better. I think most people just upgrade. Yeah. What's the eval process like?Stan [00:43:19]: It's funny because I've been doing research for three years, and we have bigger stuff to cook. When you're deploying in a company, one thing where we really spike is that when we manage to activate the company, we have a crazy penetration. The highest penetration we have is 88% daily active users within the entire employee of the company. The kind of average penetration and activation we have in our current enterprise customers is something like more like 60% to 70% weekly active. So we basically have the entire company interacting with us. And when you're there, there is so many stuff that matters most than getting evals, getting the best model. Because there is so many places where you can create products or do stuff that will give you the 80% with the work you do. Whereas deciding if it's GPT-4 or GPT-4 Turbo or et cetera, you know, it'll just give you the 5% improvement. But the reality is that you want to focus on the places where you can really change the direction or change the interaction more drastically. But that's something that we'll have to do eventually because we still want to be serious people.Swyx [00:44:24]: It's funny because in some ways, the model labs are competing for you, right? You don't have to do any effort. You just switch model and then it'll grow. What are you really limited by? Is it additional sources?Stan [00:44:36]: It's not models, right?Swyx [00:44:37]: You're not really limited by quality of model.Stan [00:44:40]: Right now, we are limited by the infrastructure part, which is the ability to connect easily for users to all the data they need to do the job they want to do.Swyx [00:44:51]: Because you maintain all your own stuff.Stan [00:44:53]: You know, there are companies out thereSwyx [00:44:54]: that are starting to provide integrations as a service, right? I used to work in an integrations company. Yeah, I know.Stan [00:44:59]: It's just that there is some intricacies about how you chunk stuff and how you process information from one platform to the other. If you look at the end of the spectrum, you could think of, you could say, oh, I'm going to support AirByte and AirByte has- I used to work at AirByte.Swyx [00:45:12]: Oh, really?Stan [00:45:13]: That makes sense.Swyx [00:45:14]: They're the French founders as well.Stan [00:45:15]: I know Jean very well. I'm seeing him today. And the reality is that if you look at Notion, AirByte does the job of taking Notion and putting it in a structured way. But that's the way it is not really usable to actually make it available to models in a useful way. Because you get all the blocks, details, et cetera, which is useful for many use cases.Swyx [00:45:35]: It's also for data scientists and not for AI.Stan [00:45:38]: The reality of Notion is that sometimes you have a- so when you have a page, there's a lot of structure in it and you want to capture the structure and chunk the information in a way that respects that structure. In Notion, you have databases. Sometimes those databases are real tabular data. Sometimes those databases are full of text. You want to get the distinction and understand that this database should be considered like text information, whereas this other one is actually quantitative information. And to really get a very high quality interaction with that piece of information, I haven't found a solution that will work without us owning the connection end-to-end.Swyx [00:46:15]: That's why I don't invest in, there's Composio, there's All Hands from Graham Newbig. There's all these other companies that are like, we will do the integrations for you. You just, we have the open source community. We'll do off the shelf. But then you are so specific in your needs that you want to own it.Swyx [00:46:28]: Yeah, exactly.Stan [00:46:29]: You can talk to Michel about that.Swyx [00:46:30]: You know, he wants to put the AI in there, but you know. Yeah, I will. I will.Stan [00:46:35]: Cool. What are we missing?Alessio [00:46:36]: You know, what are like the things that are like sneakily hard that you're tackling that maybe people don't even realize they're like really hard?Stan [00:46:43]: The real parts as we kind of touch base throughout the conversation is really building the infra that works for those agents because it's a tenuous walk. It's an evergreen piece of work because you always have an extra integration that will be useful to a non-negligible set of your users. I'm super excited about is that there's so many interactions that shouldn't be conversational interactions and that could be very useful. Basically, know that we have the firehose of information of those companies and there's not going to be that many companies that capture the firehose of information. When you have the firehose of information, you can do a ton of stuff with models that are just not accelerating people, but giving them superhuman capability, even with the current model capability because you can just sift through much more information. An example is documentation repair. If I have the firehose of Slack messages and new Notion pages, if somebody says, I own that page, I want to be updated when there is a piece of information that should update that page, this is not possible. You get an email saying, oh, look at that Slack message. It says the opposite of what you have in that paragraph. Maybe you want to update or just ping that person. I think there is a lot to be explored on the product layer in terms of what it means to interact productively with those models. And that's a problem that's extremely hard and extremely exciting.Swyx [00:48:00]: One thing you keep mentioning about infra work, obviously, Dust is building that infra and serving that in a very consumer-friendly way. You always talk about infra being additional sources, additional connectors. That is very important. But I'm also interested in the vertical infra. There is an orchestrator underlying all these things where you're doing asynchronous work. For example, the simplest one is a cron job. You just schedule things. But also, for if this and that, you have to wait for something to be executed and proceed to the next task. I used to work on an orchestrator as well, Temporal.Stan [00:48:31]: We used Temporal. Oh, you used Temporal? Yeah. Oh, how was the experience?Swyx [00:48:34]: I need the NPS.Stan [00:48:36]: We're doing a self-discovery call now.Swyx [00:48:39]: But you can also complain to me because I don't work there anymore.Stan [00:48:42]: No, we love Temporal. There's some edges that are a bit rough, surprisingly rough. And you would say, why is it so complicated?Swyx [00:48:49]: It's always versioning.Stan [00:48:50]: Yeah, stuff like that. But we really love it. And we use it for exactly what you said, like managing the entire set of stuff that needs to happen so that in semi-real time, we get all the updates from Slack or Notion or GitHub into the system. And whenever we see that piece of information goes through, maybe trigger workflows to run agents because they need to provide alerts to users and stuff like that. And Temporal is great. Love it.Swyx [00:49:17]: You haven't evaluated others. You don't want to build your own. You're happy with...Stan [00:49:21]: Oh, no, we're not in the business of replacing Temporal. And Temporal is so... I mean, it is or any other competitive product. They're very general. If it's there, there's an interesting theory about buy versus build. I think in that case, when you're a high-growth company, your buy-build trade-off is very much on the side of buy. Because if you have the capability, you're just going to be saving time, you can focus on your core competency, etc. And it's funny because we're seeing, we're starting to see the post-high-growth company, post-SKF company, going back on that trade-off, interestingly. So that's the cloud news about removing Zendesk and Salesforce. Do you believe that, by the way?Alessio [00:49:56]: Yeah, I did a podcast with them.Stan [00:49:58]: Oh, yeah?Alessio [00:49:58]: It's true.Swyx [00:49:59]: No, no, I know.Stan [00:50:00]: Of course they say it's true,Swyx [00:50:00]: but also how well is it going to go?Stan [00:50:02]: So I'm not talking about deflecting the customer traffic. I'm talking about building AI on top of Salesforce and Zendesk, basically, if I understand correctly. And all of a sudden, your product surface becomes much smaller because you're interacting with an AI system that will take some actions. And so all of a sudden, you don't need the product layer anymore. And you realize that, oh, those things are just databases that I pay a hundred times the price, right? Because you're a post-SKF company and you have tech capabilities, you are incentivized to reduce your costs and you have the capability to do so. And then it makes sense to just scratch the SaaS away. So it's interesting that we might see kind of a bad time for SaaS in post-hyper-growth tech companies. So it's still a big market, but it's not that big because if you're not a tech company, you don't have the capabilities to reduce that cost. If you're a high-growth company, always going to be buying because you go faster with that. But that's an interesting new space, new category of companies that might remove some SaaS. Yeah, Alessio's firmSwyx [00:51:02]: has an interesting thesis on the future of SaaS in AI.Alessio [00:51:05]: Service as a software, we call it. It's basically like, well, the most extreme is like, why is there any software at all? You know, ideally, it's all a labor interface where you're asking somebody to do something for you, whether that's a person, an AI agent or whatnot.Stan [00:51:17]: Yeah, yeah, that's interesting. I have to ask.Swyx [00:51:19]: Are you paying for Temporal Cloud or are you self-hosting?Stan [00:51:22]: Oh, no, no, we're paying, we're paying. Oh, okay, interesting.Swyx [00:51:24]: We're paying way too much.Stan [00:51:26]: It's crazy expensive, but it makes us-Swyx [00:51:28]: That's why as a shareholder, I like to hear that. It makes us go faster,Stan [00:51:31]: so we're happy to pay.Swyx [00:51:33]: Other things in the infrastack, I just want a list for other founders to think about. Ops, API gateway, evals, you know, anything interesting there that you build or buy?Stan [00:51:41]: I mean, there's always an interesting question. We've been building a lot around the interface between models and because Dust, the original version, was an orchestration platform and we basically provide a unified interface to every model providers.Swyx [00:51:56]: That's what I call gateway.Stan [00:51:57]: That we add because Dust was that and so we continued building upon and we own it. But that's an interesting question was in you, you want to build that or buy it?Swyx [00:52:06]: Yeah, I always say light LLM is the current open source consensus.Stan [00:52:09]: Exactly, yeah. There's an interesting question there.Swyx [00:52:12]: Ops, Datadog, just tracking.Stan [00:52:14]: Oh yeah, so Datadog is an obvious... What are the mistakes that I regret? I started as pure JavaScript, not TypeScript, and I think you want to, if you're wondering, oh, I want to go fast, I'll do a little bit of JavaScript. No, don't, just start with TypeScript. I see, okay.Swyx [00:52:30]: So interesting, you are a research engineer that came out of OpenAI that bet on TypeScript.Stan [00:52:36]: Well, the reality is that if you're building a product, you're going to be doing a lot of JavaScript, right? And Next, we're using Next as an example. It's a great platform. And our internal service is actually not built in Python either, it's built in Rust.Swyx [00:52:50]: That's another fascinating choice. The Next.js story is interesting because Next.js is obviously the king of the world in JavaScript land, but recently ChachiBT just rewrote from Next.js to Remix. We are going to be having them on to talk about the big rewrite. That is like the biggest news in front-end world in a while.Stan [00:53:06]: All right, just to wrap,Alessio [00:53:07]: in 2023, you predicted the first billion dollar company with just one person running it, and you said that's basically like a sign of AGI, once we get there. And you said it had already been started. Any 2024 updates on the take?Stan [00:53:20]: That quote was probably independently invented it, but Sam Altman stole it from me eventually. But anyway, it's a good quote. So I hypothesized it was maybe already being started, but if it's a uniperson company, it would probably grow really fast, and so we should probably see it already. I guess we're going to have to wait for it a little bit. And I think it's because the dust of the world don't exist. And so you don't have that thing that lets you run those, just do anything with models. But one thing that is exciting is maybe that we're going to be able to scale a team much further than before. All generations of company might be the first billion dollar companies with engineering teams of 20 people. That would be so exciting as well. That would be so great. You know, you don't have the management hurdle, you're just 20 focused people with a lot of assistance from machines to achieve your job. That would be great. And that I believe in a bit more. Yeah.Alessio [00:54:14]: I've written a post called Maximum Enterprise Utilization, kind of like you have MFU for GPUs, but it's basically like so many people are focused on, oh, it's going to like displace jobs and whatnot. But I'm like, there's so much work that people don't do because they don't have the people. And maybe the question is that you just don't scale to that size, you know, to begin with. And maybe everybody will use Dust and Dust is only going to be 20 people and then people using Dust will be two people.Swyx [00:54:39]: So my hot take is, I actually know what vertical they'll be in. They'll be content creators and podcasters.Alessio [00:54:44]: There's already two of us, so we're a max capacity.Swyx [00:54:47]: Most people would regard Jimmy Donaldson, like Mr. Beast as a billionaire, but his team is, he's got about like 200 people. So he's not a single person company. The closer one actually is Joe Rogan, where he basically just has like a guy. Hey, Jamie, put it on the screen. But Joe, I don't think, he sold his future for 250 million to Spotify. So he's not going to hit that billionaire status. The non-consensus one, it will be the Hawkswagirl.Swyx [00:55:12]: Anyway, but like you want creators who are empowered by a bunch of agents, Dust agents to do all this stuff because then ultimately it's just the brand, the curation. What is the role of the human then? What is that one person supposed to do if you have all these agents?Stan [00:55:28]: That's a good question. I mean, I think it was, I think it was Pinterest or Dropbox founder at the time was when you're CEO, you mostly have an editorial position. You're here to say yes and no to the things you are supposed to do.Swyx [00:55:42]: Okay, so I make a daily AI newsletter where I just, it's 99% AI generated, but I serve the role as the editor. Like I write commentary. I choose between four options.Stan [00:55:53]: You decide what goes in and goes out. And ultimately, as you said, you build up your brand through those many decisions.Swyx [00:56:00]: You should pursue creators.Stan [00:56:03]: And you've made a, I think you've made a, you've have an upcoming podcast with Notebook NLM, which has been doing a crazy stuff. That is exciting.Swyx [00:56:09]: They were just in here yesterday. I'll tell you one agent that we need. If you want to pursue the creator market, the one agent that we haven't paid for is our video editor agent. So if you want, you need to, you know, wrap FFmpeg in a GPT.Alessio [00:56:24]: Awesome. This was great. Anything we missed? Any final kind of like call to action hiring? It's like, obviously people should buy the product.Stan [00:56:32]: And no, I think we didn't dive into the vertical versus horizontal approach to AI agents. We mentioned a few things. We spike at penetration and that's just awesome because we carry the tool that the entire company has and use. So we create a ton of value, but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, oh, I'm going to solve the lawyer stuff. But the potential within the company after that is limited. So there's really a nice tension there. We are true believers of the horizontal approach and we'll see how that plays out. But I think it's an interesting thing to think about when as a founder or as a technical person working with agents, what do you want to solve? Do you want to solve something general or do you want to solve something specific? And it has a lot of impact on eventually what type of company you're going to build.Swyx [00:57:21]: Yeah, I'll provide you my response on that. So I've gone the other way. I've gone products over platform. And it's basically your sense on the products drives your platform development. In other words, if you're trying to be as many things to as many people as possible, we're just trying to be one thing. We build our brand in one specific niche. And in future, if we want to choose to spin off platforms for other things, we can because we have that brand. So for example, Perplexity, we went for products in search, right? But then we also have Perplexity Labs that like here's the info that we use for search and whatever.Stan [00:57:51]: The counter argument to that is that you always have lateral movement within companies, but if you're Zendesk, you're not going to be Zendesk- Serving web services.Swyx [00:58:03]: There are a few, you know, there's success stories on both sides, but there's Amazon and Amazon web services, right? And sorry by platform,Stan [00:58:08]: I don't really mean the platform as the platform platform. I mean like the product that is useful to everybody within the company. And I'll take on that is that there is so many operations within the company. Some of them have been extremely rationalized by the markets, like salespeople, like support has been extremely rationalized. And so you can probably create very powerful vertical product around that. But there is so many operations that make up a company that are specific to the company that you need a product to help people get assisted on those operations. And that's kind of the bet we have. Excellent.Alessio [00:58:40]: Awesome, man. Thanks again for the time. Thank you very much for having me.Stan [00:58:42]: It was so much fun. Yeah, great discussion.Swyx [00:58:44]: Thank you.Stan [00:58:46]: Thank you. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:00:06

More Technology podcasts

About Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Podcast website

Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0, What Bitcoin Did and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Radio
Social
v6.30.1 | © 2007-2024 radio.de GmbH
Generated: 12/9/2024 - 1:23:47 PM