Managing the Flywheel of ML Data

apply(conf) - May '22 - 25 minutes

The ML Engineer’s life has become significantly easier over the past few years, but ML projects are still too tedious and complex. Feature stores have recently emerged as an important product category within the MLOps ecosystem. They solve part of the data problem for ML by automating feature processing and serving.

But feature stores are not enough. What data teams need is a platform that automates the complete lifecycle of ML features. This platform must provide integrations with the modern DevOps and data ecosystems, including the Modern Data Stack. It should provide excellent support for advanced use cases like Recommender Systems and Fraud. And it should automate the data feedback loop, abstracting away tasks like data logging and training dataset generation. In this talk, Mike will cover his vision for the evolution of the feature store into this complete feature platform for ML.

Today, what I want to talk about is something that’s very important for every machine learning application which is a concept I want to talk about called an ML Flywheel. And so this is something that I think is pretty important and for every team that’s putting machine learning into their product. And so let’s just talk about just some examples of what machine learning, we’re focused on machine learning that is powering recommendations in your e-commerce website. For example, so here’s obvious, this is Amazon. Netflix has a bunch of recommendations that are movies, that’s a live production ML application that’s powering that. For financial services of course, you’re detecting fraud and so there’s a variety of ways in which we want to power our products with machine learning.

How do we do that? It’s really tough. That’s the reason we’re all here is to figure out how to get better at this because it’s still really challenging. There are a number of different stages in the life cycle of these projects. There’s ideation, that’s where we’re coming up with the ideas that we want to build, that’s where we’re designing experiments. Development is when we’re actually, building this stuff and actually, running experiments. Productionization is putting this stuff, actually deploying this stuff, putting it into production and doing live serving with it. Operation is all about on and ongoing basis. We’re not just making one set of predictions, we have to maintain this stuff into production. We have to debug it, we have to be on call all of that stuff.

And so when we look at this, everybody knows if you’re in part of this conference which parts of this are hard. You know that productionization and operation are the places where we face some real challenges. And this is tough for all ML teams and today I want to just talk about a concept that is a mental framework that is really useful to keep in mind as you are developing your ML application to make the productionization and operation a lot easier. And that’s this concept of an ML Flywheel. But first machine learning teams, who are these teams? So a first group of team… First of all, machine learning teams, this is you. So you may be in one of these two teams, I’m going to highlight two types of teams. The first is an ML product team.

This is someone responsible for building the end ML application. So think about a fraud team, a recommendation team, a pricing team, these teams tend to have all of the personas, every type of team member all on the same team. The ML engineers, the data scientists, product engineers and they’re reusing a variety of whatever is provided to them by the data team and the infrastructure team. So data sets and different tools the info team provides to them but they’re ultimately, delivering their recommendations to a product so this is like a use case team. When a company has a variety of ML product teams, they will build… Sometimes they build an ML platform team and so you may be on the ML platform team and these teams are responsible for supporting the ML product teams.

So typically, the ML platform team is centralizing a lot of the engineering efforts that are duplicated across these ML product teams. So it’s a lot more concentrated in engineering here and they’re supporting a variety of… They’re providing reusable components, a training system, a serving system, et cetera, for the ML product teams who can then because they have the engineering support, they can then focus on more of the data science and product specific challenges but this is a lot of stuff. I mean, we’re trying to put ML in our products, look how much complexity there is here. There’s multiple teams, they have to collaborate, you have to hire different people to do different responsibilities in each team. All of this is just in service of the product.

What does the product actually, need? If you’re the web team and the eCommerce web team and you’re just trying to show predictions, what do you care about? You’re just trying to get a decision and so everything, all of this machine learning is just an implementation detail for the end consuming team where they just really want this decision. So they’re looking to get a decision from the ML teams and really you can think of it as the ML teams are implementing a reliable and repeatable decision API that’s accurate for the product teams. I think this is a good way to think about it, it helps you understand. It helps you keep in mind what you’re ultimately, providing to the product and maybe it’s not get decision.

Maybe it’s get recommendation, get price, get fraud score depending on your use case. But ultimately, you’re providing one signal, one decision to the product. How is this different than other machine learning use cases? Well, we’re focusing on ML infused products and that means we’re making these decisions repeatedly. This is not just analyzing a document one off or doing a one off. It’s doing a one off scoring type of thing. This is for every user that comes to the website, we need to score them and we need to score them today, tomorrow and forever. And so how do we really make reliably and repeatedly make accurate decisions? That’s the key challenge here and that’s what all of the ML op stuff is about, is doing this reliably and repeatedly and keeping it accurate. And so we do this as humans not through machines but we just every day do this as humans.

So we do this through a typical feedback loop. We do something in the real world, we observe how it went. So we make a decision, we act, we observe how it went. We combine that information with all the other information we know. We update our learnings we update our model and then we make another decision. And so as a quick example, imagine you’re a career basketball player, you don’t care just about your performance on one game, you’re optimizing for, “Can I get better and be a more valuable basketball player over time?” Well, you play one game, you make some shots, you miss some shots after the game, you go and study the performance in the game. You’re watching the tapes, you’re combining that with all of your other knowledge, you’re learning, practicing and then you’re playing your next game and you’re getting better.

So these great decisions that are repeated require this feedback loop and every ML application that we’re talking about is building this too. So this is something that you have in your ML application, you have the product, you’re trying to deploy, you’re making some decisions, you’re delivering some predictions to the product, say it’s a recommendation but then somebody… It may not be you but somebody in your company is logging, has a variety of tooling to log and monitor what actually, happened to the customer, did they click on the item you recommended? And then bringing, and then organizing all of that data and organizing all of those logs, turning them into model data sets. And then on the ML side, we start extracting features, building new models from it, updating our model.

And then with that model, excuse me… And those features, then we’re building, then we’re doing all kinds of online computation to deliver those predictions in real time and update the product and make new decisions and make new predictions. So this loop is not an instant loop, this can take a long time but every application has this loop. That’s how training data sets get generated and that’s how predictions get made and delivered to the product. So it’s my claim and this is the whole point of this talk that this feedback loop is extremely important and it’s very important for you as a machine learning practitioner to be aware of all parts of this loop when you’re building your ML application. There’s a couple of core parts collecting the data, organizing this data, learning from it and then deciding. Great machine learning applications require a great ML Flywheel.

And we see this with when we talk to teams that are very successful with machine learning. They actually, are very intentional about this ML Flywheel. They’re actually, design… They build tools to design and manage this ML Flywheel and they think a lot more about the whole end to end loop than individual parts of it. And what we find is that teams that are not so successful with ML, they struggle with this Flywheel. They may not realize that, they may not even be thinking about the bottom half of this Flywheel in the first place. Or they may not realize a lot of the complexity that happens in the decide stage and be ignoring some of the key infrastructure elements that are part of their ML application or that should be considered part of their ML application but is not thought of as a first class ML system.

So this concept of the ML Flywheel is very important and let’s talk about why that is. Teams that have their ML Flywheel working, ML just feels natural and easy for them. So in that world, you have very clear ownership across the whole ML life cycle. The whole ML application becomes much more debuggable and reliable. And because you know who owns each part of the Flywheel it becomes much easier to move fast, it becomes easier to make decisions, to make changes and understand the implications of all of these decisions. And then this allows your iteration speed to be much higher and then that means that your models are getting better and more accurate much quicker. There’s a compounding effect from the Flywheel. Every iteration, the system’s getting better and so in machine learning, we’re trying to get as many iterations in as possible so we can compound to the best performance.

When it’s not working, it’s very painful and most people are in some in between state but most people have many elements of the not working part of this Flywheel or sorry… The not working model that I’m showing here, things go slower and they break more, basically. The iteration speeds much slower and because you don’t know, there’s unclear owners for each stage especially, in the bottom half of the loop, it’s really hard to make changes. Small changes become really big tasks, “I want to a new feature to my model. Well, that data’s not available in my data set so I need this person over here to log some new data from the product but I don’t actually, know who that person is. And so then that’s a whole bunch of coordinating across a number of unknown people. That’s just a big project for me and I might not do it.”

When there are these uncleared dependencies there’s a lot more… Things break and it’s really hard to promise a certain level of reliability and that just makes it really hard to understand the changes you’re making and the impact of them. So we want to have really strong ML Flywheels but why is this hard? Well, there’s a ton of, excuse me… There’s a ton of tools, there’s everyone that has a different stack. There’s not just one of these tools, not everyone’s using the same five tools, the same 10 tools here. And so what’s really needed as a way to make it easy for teams to organize, orchestrate these tools for them to work well together and so individual teams can go really deep on individual tools.

But the challenge for the ML Flywheel is having them all work together coherently to support a self-improving ML application, a reliable and repeatable ML application. So what we really need is a way for all the stakeholders, the data scientists and the machine and the engineers to build and orchestrate their ML and data tools into coherent ML Flywheels so we need a way to organize this stuff into the Flywheels very easily. And this requires simple abstractions to define these Flywheels and manage the infrastructure and the corresponding artifacts, the data sets and the metrics and the models, all of that. And in a way that fits naturally with the best practice workflows that we have. So it should fit into the DevOps workflows for the engineers, fit into the data scientist workflows as well. So now let’s look at a specific dimension of these Flywheels.

Let’s look at the data side of things so which is the main like blood flowing through the Flywheel. Data flows are at the core of this flywheel so let’s look at some data sets. So I have the product, we’re making predictions in the product and then we’re going to log a variety of things. So we’re logging some metrics, we’re logging which predictions happen, which features were served, which outcomes we observed so some labels, we’re joining all of those data sets together. This is the organized stage so we’re joining them together into feature logs and label data sets. And then when we’re building our models where potentially extracting new features, we are generating new training data sets and then online at the decide stage, when we’re making predictions there’s a variety of different data sets we need. We could be scoring multiple candidates or generating a variety of candidates, extracting features for all of them, scoring them and turning them into predictions.

So there’s a lot of data sets that need to be managed throughout the Flywheel and so this is pretty tricky to do. How do we do that? Well, feature platforms have had a really big impact on part of the Flywheel. They’ve made it pretty easy to manage the data sets to declaratively, define the data flows that happen in the learn and decide stage and manage all the infrastructure and data sets throughout those stages so far. And let me just do a little bit of an aside and show what Tecton does to help make that more concrete. So Tecton makes it really easy for someone to provide a simple declarative feature definition so on the left side, this is a feature definition and it’s just a simple Python file where you define some metadata.

So you say some things about, “Hey, this should serve online, this should also be available in our offline feature store,” You can say who the owner is, a variety of other metadata. And then you provide your feature definition, your feature transformation. What Tecton does is it takes that Python file and it converts it to a physical implementation. So it’s spinning up the right infrastructure, it’s setting up the right data schema is on all of them and connecting them all together. It’s orchestrating the data flows on that infrastructure and backfilling your features into your offline feature store so it can be pulled for training data sets. It’s loading up the most recent features into the online feature store so it can serve your model at scale in real time. And so this is managing the infrastructure and the artifacts end to end. And that’s had a really big impact for that top path of the loop.

And this is the data management that we’re looking for, for the whole loop. So how do we bring these data management patterns to the whole ML Flywheel, how do we extend that capability to everything? And so there’s a couple of things that you should be thinking about when you’re putting together… When you’re coming from the top half of the Flywheel and you’re trying to really make sure you have a solid story for the whole Flywheel. The first thing is simple, it’s close the loop. So that means simply bring this bottom half of the loop into your purview. It doesn’t mean you have to be responsible and own all of these things but know who these people are, who is the person in your organization that is responsible for logging, what happens in the product and delivering that data into the data warehouse.

Ultimately, you want some log observation API to make it really easy for you to add a new label that you want to log, a new feature, something like that, some data that could power a feature so we want to make it really easy to capture, observe events. And then we want all of this to be… The entire loop to be implemented on top of your Cloud data platform so you don’t want to spin up a whole new data stack to support this. This should all merge in with what your data team is doing today. They’re probably, using a data lakehouse or data warehouse in some way or a data lake and having all of this be implemented on top of your existing data stack. Second is really having a unified data model throughout the whole ML Flywheel. So what we want to have is the user takes some action in the website.

They click on the recommendation, let’s say, and they click on the ad and then we want that to update all of the relevant data sets within the Flywheel pretty quickly. So we want that to update our features, our fraud features, let’s say, features that are being used for predictions right now. We also want that ad click, maybe that’s something that we’re predicting so we want it to update our labels data set and then update a training dataset. So there’s this element of freshness that we want to have and we want to have that propagate through all the data sets. And secondly, we need to establish these coherent compatible data schemas in all elements of the infrastructure throughout the loop and so that’s something that we need to handle as well. And then finally, not every use case is the exact same and we want to support use case specific architectures.

So there’s not one size fits all and not every use case needs the same underlying infrastructure. If you’re doing a batch lead scoring model, it’s very different than what a real time recommendation system is going to need. And I’ll just show you an example of some complexity here. This is from Eugene Yan’s blog post, is really great blog post about discovery and recommendations. When you’re making a prediction for when you’re making real time recommendations or ranking, there’s a lot… It’s not just a simple, a model service that’s just taking in some features and making a prediction. There’s a number of steps online that the machine learning application is doing. It’s first generating some candidates, it’s taken into query, generating some candidates then filtering those candidates then generating features for each of these candidates then scoring each of those candidates then doing the ranking based on that.

So there’s a lot more infrastructure online in that type of use case and different use cases need different infrastructure here. So the point is to make sure that your ML Flywheel includes the use case specific architectures, infrastructure that you need as well. And if you don’t then you risk having either using the wrong infrastructure which will be inefficient, expensive for you or your Flywheel will not be complete so this is something we’re focused on at Tecton. At Tecton, we’re focused on helping teams build their ML Flywheels and helping build a unified management of the data life cycle for the ML applications. This means simple declarative definitions become managed infrastructure and managed data sets throughout the Flywheel. We have use case to optimize architectures. All of it is implemented on your data stack and the point is for it to fit really well into your dev team’s workflow so your DevOps workflows and your data science workflows.

But overall, you should be thinking about how to optimize the Flywheel for your use case and why is this important. Well, for platform teams, what does this get you? This makes it much easier for you to support more use case teams, more ML product teams with less resources and that’s what ML platform teams are trying to do. They’re just trying to keep up with demand and support their customers. For the ML product teams, it lets them iterate, build and deploy faster, there’s a lot less glue code, it makes for fewer mistakes. There’s a lot less maintenance so things become a lot more reliable and you also need fewer people per project. It’s just easier to do this work because it makes it much easier to have purview over the whole Flywheel.

And it gets you much closer to that mythical decision API that I was talking about at first which is what everybody wants to get to eventually. And this is not just for the recommendations use cases I mentioned but it’s my claim that every ML use case has to be thinking about this so whether it’s fraud detection, pricing recommendations, we deal with gaming use cases, security, enterprise, SaaS, surge, ranking, a variety of things that are all implementing this Flywheel in one way or another. So I recommend that you think about the Flywheel and figure out how it maps to your world as well. We’re always around to help out and then just a quick plug so I mentioned Tecton. Tecton is an enterprise feature platform. We are a developer platform to make it easy to manage the data sets and run the feature transformations and all of the data flows throughout the ML application and FEAST is a completely open source production feature store. You can get started with that right now. Check out feast.dev. Okay, thank you.

Mike Del Balso

CEO & Co-founder

Tecton

Mike Del Balso is the co-founder of Tecton, where he is focused on building next-generation data infrastructure for Operational ML. Before Tecton, Mike was the PM lead for the Uber Michelangelo ML platform. He was also a product manager at Google where he managed the core ML systems that power Google’s Search Ads business. Previous to that, he worked on Google Maps. He holds a BSc in Electrical and Computer Engineering summa cum laude from the University of Toronto.

Managing the Flywheel of ML Data

Mike Del Balso

Follow Us

Book a Demo

Contact Sales

Request a free trial