PodcastsTechnologyThis is Fine! A podcast about resilience engineering and software

This is Fine! A podcast about resilience engineering and software

Colette Alexander and Clint Byrum
This is Fine! A podcast about resilience engineering and software
Latest episode

36 episodes

  • This is Fine! A podcast about resilience engineering and software

    The academic and the practical in resilience engineering

    28/06/2026 | 46 mins.
    Aerobarrier is super cool if you want to check out how it works: https://glbipro.com/aerobarrier-2/

    Colette isn’t drinking alcohol, it’s a delicious herbal mocktail by Casamara Club: https://www.casamaraclub.com/products/como
    (Maybe we can get them to sponsor us someday)

    On the passive house that survived the Palisades fire: https://www.reddit.com/r/Damnthatsinteresting/comments/1hy22ui/house_designed_on_passive_house_principles/

    Layer Aleph https://layeraleph.com/ folks have written Crisis Engineering: https://bookshop.org/p/books/crisis-engineering-time-tested-tools-for-turning-chaos-into-clarity-marina-nitze/44736d1287a7da6e and there will be an RISF event with them in September https://resilienceinsoftware.org/events/272451 , which means we’re hoping to get them on the pod in August, FYI!

    More about the Lund program here: https://www.humanfactors.lth.se/

    Dr. Woods’ OSU program is here: https://ise.osu.edu/human-systems-integration

    Our semi-academic, semi-practical space is the Resilience in Software Foundation slack - lots of folks from all ends of the spectrum talk there about RE concepts and practical applications for how to solve things (and pet pictures). You can get access to it by joining the Resilience in Software Foundation: resilienceinsoftware.org

    The program in Australia is NOT in Melbourne, it’s in Brisbane at Griffith University, you can see Sidney Dekker’s profile there: https://experts.griffith.edu.au/19027-sidney-dekker as part of the Humanities and Social Sciences school. Drew Rae is also there, and his podcast is here: https://safetyofwork.com/

    https://www.adaptivecapacitylabs.com/ and https://www.uptimelabs.io/ are very practical approaches to a lot of these things. Blackrock 3 https://www.blackrock3.com/ does training for incident command.
  • This is Fine! A podcast about resilience engineering and software

    Interviewing for Incident Analysis w/special guest John Allspaw

    14/05/2026 | 1h
    The new website is live! thisisfinepod.com

    You can find John Allspaw at Adaptive Capacity Labs: https://www.adaptivecapacitylabs.com

    Mike McGill, the skateboarder: https://en.wikipedia.org/wiki/Mike_McGill

    Annie Duke’s Thinking in Bets, referenced by our question-asker is a great one: https://bookshop.org/p/books/thinking-in-bets-making-smarter-decisions-when-you-don-t-have-all-the-facts-annie-duke/31466984521c3d8a?ean=9780735216372&next=t

    Naturalistic Decision Making has its own association, which has a ton of resources (and a conference!) - https://naturalisticdecisionmaking.org/
    They also have a podcast! https://naturalisticdecisionmaking.org/new-podcast/

    Gary Klein is the NDM guy - https://bookshop.org/p/books/seeing-what-others-don-t-the-remarkable-ways-we-gain-insights-chief-scientist-gary-klein/c4ae5e017fe005ff?ean=9781610393829&next=t

    We contrast him and his style of approaching cognition and decision making with Kahneman and Tversky.
    Kahneman and Tversky wrote a lot, but Judgement Under Uncertainty is probably the most famous? https://www.science.org/doi/abs/10.1126/science.185.4157.1124

    And Kahneman wrote Thinking Fast and Slow: https://bookshop.org/p/books/thinking-fast-and-slow-daniel-kahneman-phd/83a544fe6f98df87?ean=9780606275644&next=t

    It has been zero episodes since we’ve mentioned Lisanne Bainbridge’s Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdf

    But also she has Verbal Reports as evidence of the process operator’s knowledge: https://www.sciencedirect.com/science/article/abs/pii/S1071581979603075?via%3Dihub

    And the Etsy Debriefing Guide is super great: https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf

    Sidney Dekker and The Field Guide are foundational: https://bookshop.org/p/books/the-field-guide-to-understanding-human-error-sidney-dekker/3a4209dfc8b3a721?ean=9781472439055&next=t

    From Dekker’s field guide (pg 47) there is a list referencing Gary Klein’s questions for an incident investigation:

    Cues:
    What were you seeing?
    What were you focusing on?
    What were you expecting to happen?

    Interpretation:
    If you had to describe the situation to your colleague at that point, what would you have told?

    Errors:
    What mistakes (for example in interpretation) were likely at this point?

    Previous experience/knowledge:
    Were you reminded of any previous experience?
    Did this situation fit a standard scenario?
    Were you trained to deal with this situation?
    Were there any rules that applied clearly here?
    Did any other sources of knowledge suggest what to do?

    Goals:
    What were you trying to achieve?
    Were there multiple goals at the same time?
    Was there time pressure or other limitations on what you could do?

    Taking action:
    How did you judge you could influence the course of events?
    Did you discuss or mentally imagine a number of options or did you know straight away what to do?

    Outcome:
    Did the outcome fit your expectation?
    Did you have to update your assessment of the situation?

    John mentioned Uptime Labs, who do staged worlds for software incidents: https://uptimelabs.io/

    Facets of Complexity in Situated Work is here: https://www.researchgate.net/publication/345523195_Facets_of_Complexity_in_Situated_Work

    On the Jamie Zawinski quote: https://regex.info/blog/2006-09-15/247

    If you don’t know the parable of the blind men and the elephant: https://en.wikipedia.org/wiki/Blind_men_and_an_elephant
  • This is Fine! A podcast about resilience engineering and software

    Paper Club: Two Years Before the Mast w/special guest eric dobbs

    04/05/2026 | 44 mins.
    Mitchell Hashimoto’s post on leaving Github: https://mitchellh.com/writing/ghostty-leaving-github

    The Reddit post on Github’s availability historically (that we find questionable): https://www.reddit.com/r/github/comments/1rnvhs9/githubs_historic_downtime_scraped_and_plotted/

    A reminder, the Messy 9 are: congestion, cascade, conflict, lag, saturation, friction, tempo, surprise, tangles

    We have sometimes loved his stuff, but Gergely is annoying us with these posts: https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?r=78c7k&utm_medium=email

    https://x.com/GergelyOrosz/status/2048017382036082706

    You can find the RISF store with Hindsight Bias merch here: https://www.bonfire.com/store/risf/

    You can find a copy of Richard Cook’s Two Years Before the Mast at Lorin’s Blog: https://surfingcomplexity.blog/wp-content/uploads/2026/03/twoyearsbeforethemast.pdf

    A reminder, Richard Cook’s How Complex Systems Fail can be found at http://how.complexsystems.fail

    Some writing on the 1996 Annenberg conference: https://www.researchgate.net/publication/351953417_Coming_Together_The

    Folk models paper (not by Woods, by Dekker and Hollnagel), which is specifically targeting Situational Awareness as being a folk model: https://link.springer.com/article/10.1007/s10111-003-0136-9

    Some stuff about SNAFU Catchers: https://www.snafucatchers.com/
    And https://snafucatchers.github.io/

    Eric referenced our conversation with Beth Long about Building and Revising Adaptive Capacity, which she co-wrote with Richard Cook about New Relic’s real-life example of resilience engineering: https://youtu.be/A_rU4-M61Hk and https://www.sciencedirect.com/science/article/abs/pii/S0003687020301903?via%3Dihub for the paper

    Erik Hollnagel’s RAG get’s referenced: https://erikhollnagel.com/onewebmedia/RAG%20Outline%20V2.pdf

    Once again, we link you to Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/

    Eric is referencing Lund, that is their Human Factors and Systems Safety program: https://www.humanfactors.lth.se/

    Check out Crisis Engineering! https://crisisengineering.layeraleph.com/crisis-engineering-the-book/

    The upcoming RISF event on Practice of Practice Gamelan: https://resilienceinsoftware.org/events/245030
  • This is Fine! A podcast about resilience engineering and software

    SRECon Americas 2026 recap

    14/04/2026 | 55 mins.
    Colette’s talk at SRECon intro: https://www.usenix.org/conference/srecon26americas/presentation/alexander

    Clint’s talk at SRECon intro: https://www.usenix.org/conference/srecon26americas/presentation/byrum

    Dan Slimmon is an excellent engineer (per Clint’s shoutout) and ALSO an excellent podcast creator/host: https://techblows.net/

    Michelle Brush’s Keynote summary is here: https://www.usenix.org/conference/srecon26americas/presentation/brush

    Jevon’s Paradox: https://en.wikipedia.org/wiki/Jevons_paradox

    Dr. Nicole Forsgren’s talk summary: https://www.usenix.org/conference/srecon26americas/presentation/forsgren

    DORA is always worth a dive into if you haven’t taken a look yet: https://dora.dev/

    The blog post Colette mentioned comparing AI gold rush to Mao’s Revolution: https://leehanchung.github.io/blogs/2026/04/05/the-ai-great-leap-forward/

    Many people have written about why MTTR is a bad metric to track, you can read a write up from Adrian Hornsby here: https://newsletter.resiliumlabs.com/p/mttr-problems-better-incident-metrics

    And watch the OG, Courtney Nash, speak about it here: https://www.youtube.com/watch?v=uhCgBOHo8EY

    Beth Long’s SRE Soundbath: https://www.usenix.org/conference/srecon26americas/presentation/long

    Vanessa Huerta-Granda’s talk is summarized here: https://www.usenix.org/conference/srecon26americas/presentation/huerta-granda

    Martin Smith and Abe Hoffman’s talk is summarized here: https://www.usenix.org/conference/srecon26americas/presentation/hoffman

    Some information about Metrist: https://vault42consulting.com/about/portfolio/metrist

    AI Agents Good Bad and Ugly talk: https://www.usenix.org/conference/srecon26americas/presentation/budichenko

    The CAST talk: https://www.usenix.org/conference/srecon26americas/presentation/barroso

    Engineering a Safer World by Nancy Leveson is worth a look: https://bookshop.org/p/books/engineering-a-safer-world-systems-thinking-applied-to-safety-nancy-g-leveson/57b01ef464f9f81b?ean=9780262533690&next=t

    Erik Hollnagel wrote the book on FRAM and it has a lot of support in the safety world across industries: https://functionalresonance.com/ and https://etn-peter.eu/2021/02/11/fram-in-a-nutshell/ are good resources.

    Daria Barteneva’s closing keynote on game theory and SRE was great: https://www.usenix.org/conference/srecon26americas/presentation/barteneva

    Some good stuff on Above the Line/Below the Line, if you’re curious:
    https://queue.acm.org/detail.cfm?id=3380777

    https://www.youtube.com/watch?v=xA5U85LSk0M

    Lorin Hochstein’s closing keynote on storytelling was rad: https://www.usenix.org/conference/srecon26americas/presentation/hochstein

    SRECon EMEA 2026 (in Dublin) has their CFP up: https://www.usenix.org/conference/srecon26emea/call-for-participation

    As always, you can check out the Resilience in Software Foundation at resilienceinsoftware.org
  • This is Fine! A podcast about resilience engineering and software

    The 2025 DORA Report w/special guest Fred Hebert

    12/03/2026 | 59 mins.
    You can find the 2025 DORA Report here: https://dora.dev/research/2025/dora-report/

    Read more of Fred’s work/opinions here: https://ferd.ca/

    If you want to know more about Lund’s Human Factors and Systems Safety program, you can read here: https://www.humanfactors.lth.se/

    DORA has some good writeups of generative leadership and Westrum’s model here: https://dora.dev/capabilities/generative-organizational-culture/

    We can reset the counter, it’s been 0 episodes since we mentioned Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/

    Fred writes well about the Law of Stretched Systems: https://ferd.ca/the-law-of-stretched-cognitive-systems.html

    We’re still trying to schedule a DORA event with our friends who make the report, but keep an eye out on https://resilienceinsoftware.org/events - it will pop up there when we do!
More Technology podcasts
About This is Fine! A podcast about resilience engineering and software
A podcast about resilience engineering and software. Ever wondered why things on the internet break? Do you work in software and wish that you could have a Dear-Abby-Like call-in show that could answer your deepest questions about how to make your workplace suck less? We're here to help! Write us anonymously at our open question form Email us at: thisisfine.softwarepodcast@gmail.com Call us and leave a voicemail, or text us at: ‪(401) 592-7574‬
Podcast website

Listen to This is Fine! A podcast about resilience engineering and software, Darknet Diaries and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features