PodcastsTechnologyThe Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

Astronomer
The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI
Latest episode

105 episodes

  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Orchestrating 2,000 Airflow pipelines at Luiza Labs with Mateus Ferreira

    28/05/2026 | 32 mins.
    Running Airflow at the scale of a national retailer means more than just scheduling. It means giving non-engineers a path to ship DAGs, and classifying thousands of runs to know which ones need attention. In this episode, Mateus Ferreira, Senior Data Engineer at Luiza Labs (the technology arm of Magazine Luiza, one of Brazil's largest retailers), joins Marc to talk about the patterns his team uses to run 2,000+ Airflow pipelines across more than four petabytes of data.

    Key Takeaways:
    00:00 Introduction
    01:11 Mateus introduces himself and Luiza Labs, the technology arm of Magazine Luiza (Magalu), one of Brazil's largest retailers (founded 1957). 1,000+ physical stores, multi-region operations, and a data team that has to handle the variability that comes with all of it.
    04:33 Lu Brain, Magalu's AI initiative built around their character Lu, and how AI fits into the data work.
    06:47 The data reliability engineering channel where AI summarizes Airflow errors with confidence scores and posts a suggested fix in chat.
    08:30 How Airflow became the heart of orchestration. Coming from Control-M in banking, then GCP, then consolidating on Cloud Composer to centralize roughly 2,000 pipelines.
    14:23 The YAML wrapper that lets non-engineers ship DAGs. Reads namespace, tables, and Spark options. Handles CDC, JDBC full, and JDBC incremental collection types with checkpoints. All changes go through data reliability engineering.
    17:20 Why metadata is the most valuable asset in the AI era, and how the wrapper makes data lineage observable across 2,000 pipelines.
    18:26 The Data Reliability Engineering team. A 10-person group that is the window to the company, handling maintenance, validation, corrections, and optimization for the business unit pipelines.
    20:09 Operating at four petabytes of data.
    21:24 Why they built custom Spark operators. Cost drove the move off the DataprocOperator. The custom operator exposes Spark driver and executor sizing as Airflow parameters and generates the Kubernetes manifest.
    24:36 The monitoring dashboard built on the Airflow metadata DB. A timeline view that shows how many DAGs run each hour, used to spread scheduling across the day.
    26:37 Classifying DAGs by their last five runs: success, partially correct, intermittent, total failure. A reusable observability pattern.
    29:57 How to reach Mateus, and a closing thought in Portuguese on appreciating the good old times while you are living them.

    Resources Mentioned:
    Apache Airflow (airflow.apache.org)
    Magalu Cloud / MGC
    Luiza Labs (luizalabs.com) and Magazine Luiza / Magalu
    Astro Observe (https://www.astronomer.io/product)
    Mateus Ferreira on LinkedIn (linkedin.com/in/mateusmferreira)

    Thanks for listening to "The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI." If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Enhancing DAGs for Data Processing with William Orgertrice III at Cargill

    21/05/2026 | 26 mins.
    In the data engineering world, the difference between a pipeline that works and one that's truly production-ready often comes down to a handful of deliberate decisions. William Orgertrice III, Data Engineer at Cargill, joins us to share the DAG design and monitoring practices he presented at Airflow Summit 2025 and how his team is rolling out Airflow across 60+ internal teams as part of Cargill's new Minerva data platform.

    Key Takeaways:

    00:00 Introduction.
    01:45 Cargill is one of the largest privately owned companies in the US, operating across 70 countries and serving 125+ markets.
    03:45 William's team on the Cargill Data Platform supports 60+ internal teams, providing data products that drive decisions across finance, inventory and operations.
    05:10 Cargill chose Airflow as a core component of its new Minerva data platform to replace older ETL tooling with a more supportable, observable stack.
    06:26 Native SLA sensors and dependency management were specific features that made Airflow the right fit for Cargill's batch ingestion pipelines.
    09:00 Cargill is running Airflow through Astronomer as their managed solution, with some teams already in production.
    13:22 Every task in a DAG should have a single, documented purpose — one task doing everything makes troubleshooting significantly harder.
    14:40 A DAG that never enters a failed state but keeps running indefinitely will spend compute budget without alerting anyone.
    15:25 In shared Airflow environments, embedding contact information and owner tags in DAGs ensures the right team is reached when something breaks upstream.
    21:00 William flags connection testing as a friction point in pipeline development — verifying a connection string before building the full job would reduce iteration time.

    Resources Mentioned:

    Cargill | Website
    https://www.cargill.com/food-beverage

    Airflow Community on Slack
    https://airflow.apache.org/community/

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Getting Into Data Engineering with Shrividya Hegde, Data and AI Engineer

    14/05/2026 | 27 mins.
    In this episode, we take a step back from implementation-specific topics to explore what it actually takes to build a career in data engineering — and how AI is reshaping that path.
    Shrividya Hegde,  a data and AI engineer and an Airflow champion in Astronomer’s Champions program, joins us to discuss getting into data engineering, contributing to open source and why good data engineering should make AI output trustworthy rather than confidently wrong.

    Key Takeaways:

    00:00 Introduction.
    04:08 Build fundamentals before chasing trending tools — understanding what a tool does, why it exists and what problem it solves has to come first.
    07:19 Data engineering fundamentals mean SQL query performance under joins and aggregations, how data moves between pipelines, DAG failure recovery and idempotency — not just writing queries.
    08:10 The most common mistake newer data engineers make is skipping fundamentals to chase trends — it is a sequencing problem, not a talent problem.
    13:15 AI creates more opportunity for data engineers because AI output quality is directly determined by the quality of the data pipeline feeding it — confidently wrong output is harder to catch than obviously wrong output.
    15:06 Airflow's supporting operators make AI outputs production-ready — orchestration is what converts experimental AI into something reliable.
    17:14 AI-generated DAGs help newer engineers understand underlying concepts rather than just producing working code.
    23:12 The Airflow open source community is more welcoming than most people expect for a project of its size — raising issues and reviewing PRs are viable entry points for first contributions.

    Resources Mentioned:

    Shrividya Hegde
    https://www.linkedin.com/in/shrividya-hegde-shri-91562365/

    Astronomer | LinkedIn
    https://www.linkedin.com/company/astronomer/

    Astronomer | Website
    https://www.astronomer.io

    Women in Data | Website
    https://womenindata.mn.co/landing

    Apache Airflow Slack
    https://airflow.apache.org/

    Shrividya's Medium writing
    https://medium.com/@shrihegde

    Shrividya’ Substack writing
    https://substack.com/@shrividyahegde

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow #MachineLearning
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Orchestrating DBT With Cosmos and Airflow with Filip Kunčar at ShipMonk Product Development

    07/05/2026 | 24 mins.
    We explore how a third-party logistics platform built its entire data orchestration layer on Airflow, and what that makes possible for developer teams and merchant-facing products alike.

    Filip Kunčar, Platform Director at ShipMonk Product Development, discusses migrating from a closed source tool to Airflow, orchestrating dbt with both Cosmos and the BashOperator and using Airflow to power customer-facing data delivery.

    Key Takeaways:

    00:00 Introduction.
    01:07 ShipMonk is a third-party logistics company guaranteeing two-day delivery across the US. The data platform team's mission is to lower cognitive load for developers working with data.
    05:13 ShipMonk migrated to Airflow in 2022, moving away from a closed-source UI-based tool, driven by the need for a code-first approach, open source extensibility and broad cloud provider support.
    10:02 The team uses Cosmos for developer-facing visibility and lineage and BashOperator for internal pipelines where runtime performance matters.
    12:20 Switching from Cosmos to the BashOperator for a frequently running pipeline reduced runtime from over 15 minutes to three minutes.
    13:14 Because the full dbt chain runs inside Airflow, a configurable downstream DAG can deliver processed data directly to each merchant's preferred destination, with secrets management and SLA tracking already handled.
    15:03 Per-team alerting is hooked to each DAG by owner and severity, so teams can react to SLA breaches immediately.
    18:09 ShipMonk uses Airflow in three ways for AI: authoring DAGs faster with skills, orchestrating AI workloads in Lambda and containers and using Astronomer's skills repo to simplify Airflow version upgrades.

    Resources Mentioned:

    Filip Kunčar
    https://www.linkedin.com/in/filipkuncar/

    ShipMonk Product Development
    https://www.linkedin.com/company/shipmonk-product-development/

    ShipMonk | Website
    http://www.shipmonk.com

    Astronomer Cosmos
    http://www.astronomer.io/cosmos

    Astronomer AI Skills Repo
    http://www.github.com/astronomer/airflow-llm-providers-demo

    Datadog
    http://www.datadoghq.com

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow #MachineLearning
  • The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI

    Building Airflow CTL with Buğra Öztürk at Mollie

    30/04/2026 | 19 mins.
    Buğra Öztürk, Senior Data Engineer at Mollie and Committer and PMC member on the Apache Airflow project, joins us to walk through Airflow CTL — what it is, how it differs from the existing Airflow CLI and where it is headed under AIP-94.

    Key Takeaways:

    00:00 Introduction.
    03:10 Buğra has contributed to Airflow since 2022, from docs changes up to Committer and PMC member — a path he hopes inspires others to start small and contribute.
    04:05 Airflow CTL solves secure user interaction by abstracting database credentials behind the public core API.
    05:13 Airflow CLI and Airflow CTL are complementary — CLI handles administration and database management while CTL handles secure user interactions via the API.
    07:08 Airflow CTL authenticates via the API, acquires a JWT token and stores it securely in the OS keyring — running on the user's machine and never requiring direct database access.
    08:21 Concrete use cases include local DAG development without the UI and CI/CD automation using headless mode with short-lived JWT tokens.
    10:08 AIP-94 describes the long-term vision — decoupling all remote commands from the Airflow CLI and routing them through Airflow CTL.
    13:12 Airflow CTL is currently at 0.X and already being used in CI and deployment automations. The move to 1.0 with full CLI parity is the next milestone under AIP-94.
    16:09 Multi-team deployment becoming generally available in a future Airflow release is Buğra's most-anticipated upcoming feature beyond Airflow CTL.

    Resources Mentioned:

    Buğra Öztürk
    https://www.linkedin.com/in/bugraozturk93/

    Mollie
    https://www.linkedin.com/company/mollie/

    Mollie | Website
    https://www.mollie.com/

    Apache Airflow CTL
    https://airflow.apache.org/

    AIP-94 on Airflow Confluence
    https://lists.apache.org/thread/d2o1pr78wxdp1wozq519stp0pkcv6k6c

    Apache Airflow GitHub
    https://www.github.com/apache/airflow

    Thanks for listening to “The Data Flowcast: Mastering Apache Airflow® for Data Engineering and AI.” If you enjoyed this episode, please leave a 5-star review to help get the word out about the show. And be sure to subscribe so you never miss any of the insightful conversations.

    #AI #Automation #Airflow #MachineLearning
More Technology podcasts
About The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI
Welcome to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI— the podcast where we keep you up to date with insights and ideas propelling the Airflow community forward. Join us each week, as we explore the current state, future and potential of Airflow with leading thinkers in the community, and discover how best to leverage this workflow management system to meet the ever-evolving needs of data engineering and AI ecosystems. Podcast Webpage: https://www.astronomer.io/podcast/
Podcast website

Listen to The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI, All-In with Chamath, Jason, Sacks & Friedberg and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
The Data Flowcast: Mastering Apache Airflow ® for Data Engineering and AI: Podcasts in Family