Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements
Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning
Interview
Introduction
How did you get involved in machine learning?
Can you describe what River is and the story behind it?
What is "online" machine learning?
What are the practical differences with batch ML?
Why is batch learning so predominant?
What are the cases where someone would want/need to use online or streaming ML?
The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like?
Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift?
Can you describe how the River framework is implemented?
How have the design and goals of the project changed since you started working on it?
How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state?
In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision?
Can you describe the process of using River to design, implement, and validate a streaming ML model?
What are the operational requirements for deploying and serving the model once it has been developed?
What are some of the challenges that users of River might run into if they are coming from a batch learning background?
What are the most interesting, innovative, or unexpected ways that you have seen River used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on River?
When is River the wrong choice?
What do you have planned for the future of River?
Contact Info
Email
@halford_max on Twitter
MaxHalford on GitHub
Parting Question
From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email
[email protected]) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
River
scikit-multiflow
Federated Machine Learning
Hogwild! Google Paper
Chip Huyen concept drift blog post
Dan Crenshaw Berkeley Clipper MLOps
Robustness Principle
NY Taxi Dataset
RiverTorch
River Public Roadmap
Beaver tool for deploying online models
Prodigy ML human in the loop labeling
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Sponsored By:Linode: Do you want to try out some of the tools and applications that you heard about on Podcast.\_\_init\_\_? Do you have a side project that you want to share with the world? With Linode's managed Kubernetes platform it's now even easier to get started with the latest in cloud technologies. With the combined power of the leading container orchestrator and the speed and reliability of Linode's object storage, node balancers, block storage, and dedicated CPU or GPU instances, you've got everything you need to scale up. Go to [pythonpodcast.com/linode](https://www.pythonpodcast.com/linode) today and get a $100 credit to launch a new cluster, run a server, upload some data, or... And don't forget to thank them for being a long time supporter of Podcast.\_\_init\_\_!