PodcastsBusinessScreaming in the Cloud

Screaming in the Cloud

Corey Quinn
Screaming in the Cloud
Latest episode

663 episodes

  • Screaming in the Cloud

    Is It Broken Everywhere or Just for Me with Omri Sass

    22/1/2026 | 31 mins.
    When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.

    About Omri: 
    Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.

    Show Highlights:
    (02:12) What is Updog and How Does It Work
    (03:38) Why Knowing If It's a Global Problem Matters
    (04:01) The Problem With Testing Every Endpoint Yourself
    (05:52) How Datadog Discovered EC2 Outages From Their Own Systems
    (10:38) When AWS Regions Go Down and Cascade Failures
    (13:13) What Happens When Services Rebuild Completely
    (16:29) The Most Important Learning During a 3 AM Incident
    (20:11) Why This Took So Long to Build
    (23:40) When Datadog Going Down Isn't Critical Path
    (25:22) How They Picked Which AWS Services to Monitor
    (27:07) What Comes Next for Updog
    (30:11) Where to Find Omri and Updog

    Links: 
    Datadog: datadoghq.com
    Omir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/
    Sponsored by:
    duckbillhq.com
  • Screaming in the Cloud

    Solving the 20-Year S3 File System Problem with Hunter Leath

    20/1/2026 | 31 mins.
    Hunter Leath, CEO of Archil, spent 8 years building Amazon's EFS file storage system, learning exactly why making cloud storage act like a hard drive always fails. Old programs need hard drives, but cloud storage doesn't work like hard drives—a problem that's existed for 20 years.
    Now Hunter's building Archil, which puts super-fast storage between programs and S3 so they can finally work together. Your programs think they're talking to a regular disk while your data lives safely in the cloud.
    Hunter explains how they're doing what others couldn't, why it costs less than Amazon's own solutions, and why file systems suddenly matter again in the AI era.
    Show Highlights:
    (01:37) What Archil Does and Why It Exists
    (02:26) Why Mounting S3 as a File System Has Always Failed
    (03:07) What Building EFS Taught Hunter
    (06:55) Using Fast SSDs as a Cache Layer for S3
    (09:45) Attaching Archil to Your Existing S3 Buckets
    (15:08) Why Archil Costs Less Than EBS When You Do the Math
    (17:56) What Happens If Amazon Builds This Feature
    (19:20) Competing With EBS Performance on GP3 Volumes
    (21:43) Raising $6.7 Million Without an AI Pitch
    (23:46) What Customers Get Wrong About Archil
    (28:07) Accessing Data Stored in Glacier Deep Archive
    (29:24) The Plan to Get Into the Linux Kernel 
    (30:51) Where to Find Hunter

    About Hunter Leath: 
    Hunter is the founder and CEO of Archil, which transforms S3 buckets into infinite, local file systems that provide instant access to massive data sets. Prior to Archill, Hunter spent the last ten years in the cloud storage industry, including 8 years building Amazon's Elastic File System product and one year on Netflix's core storage team.
    Links:
    Hunter Leath on LinkedIn: https://www.linkedin.com/in/hleath/
    Hunter Leath on X: https://x.com/jhleath/
    Archil’s Website: https://archil.com
    Sponsored by:
    duckbillhq.com
  • Screaming in the Cloud

    Building Systems That Work Even When Everything Breaks with Ben Hartshorne

    15/1/2026 | 36 mins.
    When AWS has a major outage, what actually happens behind the scenes? Ben Hartshorne, a principal engineer at Honeycomb, joins Corey Quinn to discuss a recent AWS outage and how they kept customer data safe even when their systems couldn't fully work. Ben explains why building services that expect things to break is the only way to survive these outages. Ben also shares how Honeycomb used its own tools to cut their AWS Lambda costs in half by tracking five different things in a spreadsheet and making small changes to all of them.

    About Ben Hartshorne: 
    Ben has spent much of his career setting up monitoring systems for startups and now is thrilled to help the industry see a better way. He is always eager to find the right graph to understand a service and will look for every excuse to include a whiteboard in the discussion.
    Show highlights: 
    (02:41)Two Stories About Cost Optimization
    (04:20) Cutting Lambda Costs by 50%
    (08:01) Surviving the AWS Outage
    (09:20) Preserving Customer Data During the Outage
    (13:08) Should You Leave AWS After an Outage?
    (15:09) Multi-Region Costs 10x More
    (18:10) Vendor Dependencies
    (22:06) How LaunchDarkly's SDK Handles Outages
    (24:40) Rate Limiting Yourself
    (29:00) How Much Instrumentation Is Too Much?
    (34:28) Where to Find Ben

    Links: 
    Linkedin: https://www.linkedin.com/in/benhartshorne/
    GitHub: https://github.com/maplebed

    Sponsored by:
    duckbillhq.com
  • Screaming in the Cloud

    Engineering Around Extreme S3 Scale with R. Tyler Croy

    13/1/2026 | 33 mins.
    R. Tyler Croy, a principal engineer at Scribd, joins Corey Quinn to explain what happens when simple tasks cost $100,000. Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive. Normal solutions don't work anymore. Tyler shares how with this much data, you can't just throw money at the problem, but rather you have to engineer your way out.
    About R. Tyler: 
    R. Tyler Croy leads infrastructure architecture at Scribd and has been an open source developer for over 14 years. His work spans the FreeBSD, Python, Ruby, Puppet, Jenkins, and Delta Lake communities. Under his leadership, Scribd’s Infrastructure Engineering team built Delta Lake for Rust to support a wide variety of high performance data processing systems. That experience led to Tyler developing the next big iteration of storage architecture to power large-scale fulltext compute challenges facing the organization.
    Show Highlights:
    01:48 Scribd's 18-Year History
    04:00 One Document Becomes Billions of Files
    05:47 When Normal Physics Stop Working
    08:02 Why S3 Metadata Costs Too Much
    10:50 How AI Made Old Documents Valuable
    13:30 From 100 Billion to 100 Million Objects
    15:05 The Curse of Retail Pricing 
    19:17 How Data Scientists Create Growth
    21:18 De-Normalizing Data Problems
    25:29 Evolving Old Systems
    27:45 Billions Added Since Summer
    29:29 Underused S3 Features
    31:48 Where to Find Tyler

    Links: 
    Scribd: https://tech.scribd.com
    Mastodon:  https://hacky.town/@rtyler
    GitHub: https://github.com/rtyler
    Sponsored by:
    duckbillhq.com
  • Screaming in the Cloud

    Avery Pennarun on Tailscale's Evolution: From Mesh VPN to AI Security Gateway

    08/1/2026 | 44 mins.
    Corey Quinn sits down with Avery Pennarun, co-founder and CEO of Tailscale, for a deep dive into how the company is reinventing networking for the modern era. From finally making VPNs behave the way they should to tackling AI security with zero-click authentication, Avery shares candid insights on building infrastructure people actually love using, and love talking about.
    They get into everything: surviving 100% year-over-year growth, why running on two tailnets at once is pure chaos, and how Tailscale makes “secure by default” feel effortless. Plus, they dig into why FreeBSD firewalls needed some tough love, the uncomfortable truth behind POCs, and even the surprisingly useful trick of turning your Apple TV into an exit node.

    About Avery: 
    Avery Pennarun is the co-founder and CEO of Tailscale, where he’s redefining secure networking with a simple, Zero Trust approach. A veteran software engineer with experience ranging from startups to Google, he’s known for turning complex systems into approachable, user-friendly tools. His contributions to projects like wvdial, bup, and sshuttle reflect his belief that great technology should be both powerful and easy to use. With a mix of technical depth and dry humor, Avery shares insights on modern networking, internet evolution, and the realities of scaling a startup.

    Highlights:
    (0:00) Introduction to Tailscale and Security
    (00:52) Sponsorship and Personal Experiences
    (02:07) Technical Deep Dive into Tail Scale
    (06:10) Challenges and Future of Tail Scale
    (22:45) Building the Tail Net's API
    (23:54) Connecting Cloud Providers with Tailscale
    (25:22) Tailscale as a Security Solution
    (26:44) Innovations and Future of Tailscale
    Sponsored by:
    duckbillhq.com

More Business podcasts

About Screaming in the Cloud

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.
Podcast website

Listen to Screaming in the Cloud, The Curve and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

Screaming in the Cloud: Podcasts in Family

Social
v8.3.0 | © 2007-2026 radio.de GmbH
Generated: 1/23/2026 - 4:41:22 PM