Staff Machine Learning Engineer
apply(conf) - May '22 - 10 minutes
All ML teams need to be able to translate offline gains to online performance. Deploying ML models to production is hard. Making sure that those models stay fresh and performant can be even harder. In this talk, we will cover the value of regularly redeploying models, and the failure modes of not doing so. We will discuss approaches to make ML deployment easier, faster and safer which allowed our team to spend more time improving models, and less time shipping them.
Yeah, today I wanted to talk about kind of something that’s often forgotten in the kind of art of operationalizing machine learning, and that’s kind of speeding up, making safe, and making regular the art of model deployment. This is really some work that we did recently over the past year in my team at Stripe, so I just wanted to kind of, yeah, quickly run through why deploying models often is valuable, and kind of the skills needed to do so, and a few tips that we learned along the way.
All right, so we’ll break it down into these three parts. One, I think there’s a few things that are obvious about why you would want to redeploy models regularly and some that are less obvious, so we’ll talk about those. Two, I think it’s kind of interesting, because the field of ML, as probably most of you know at this conference, just keeps eating other fields. It used to be like just statistics, and then just modeling, and then data engineering, and then ops, so I’ll talk about that a little bit. Then, yeah, last but not least, kind of just a few lessons learned from our journey of improving our process of releasing models. It’s definitely not the answers to the question of how do you make model releases amazing, but there’s a few pointers.
All right, so what’s the value of regularly deploying ML models? I think traditionally, I would say, at least as a data scientist, you often think about, okay, I’m going to build this model, and then eventually, I’m going to ship it, and those tasks are sort of the large tasks, both the modeling, but then also all the engineering required to ship it, and often, what gets left by the wayside is okay, but what happens when we need to update the model? What happens when we have new features? What happens when we notice some drift? How do we correct it?
There’s basically three/four big reasons I would say you need to regularly redeploy your ML models. The most obvious one is drift. For many use cases where you could continuously learn from the world, where you’re learning something that changes, an example is what my team does at Stripe, which is fraud prevention, where people that commit fraud are always looking for new ways to commit fraud, then any model that you train today will be obsolete tomorrow. The question is only by how much will it be obsolete. Here’s an example kind of on the right, a total example of course, like if you don’t retrain, you’re just slowly degrading in performance once you release your models, but if you retrain regularly, you can kind of recapture that level of performance.
We talk a little bit about it on our blog, which I’ll link at the end, but for us, we’ve realized that for fraud prevention, for example, one month of staleness translates to roughly half a percent performance drop, so every month that we don’t release a model, we just lose half a percent of performance, and conversely, if you just rerelease a model a month later, you just got a half percent performance gain with kind of no effort. That’s one reason.
The other one is sort of similar but more subtle. Even if it’s not a slow drift of the world shifting under your feet, companies change, your users change, so regularly, you’ll see that they change in subtle ways. You know, maybe if you launch in a whole new country, you’ll think, “Oh, well I guess I should retrain our fraud models or our customer prediction models,” or any other models, but if one of your users slowly started onboarding more people, or they expanded to a different geo, or some subtle things can happen which can cause your models to slowly become out of date and eventually just become trained on a completely different distribution from what your inference is today. Here, I have countries as an example.
And finally, I think the last reason is kind of like the bottleneck. We’d realized that we’d optimize our experimentation enough that we could make a simple change to our data feature or our data filtering, see that, “Oh, that’s great. That’s changed.” It gives us like a 2% boost in performance, let’s say, but then shipping it was weeks and weeks of work, so in the end, even though we had optimized one of the iteration loops that you care about as an ML team, the other one was so much bigger, that that’s why we chose to focus on that one. You know, if you can experiment quickly it’s great, but then you have to be able to ship quickly. And the sub-part to that is if the model you’re working on is important, you should be able to fix it really quickly if it breaks, so deploying models should be something that you’re able to do kind of quickly and safely.
What skill sets do you need to regularly deploy models? We already said that to be, I guess, a 10X ML engineer, you need to know stats, and modeling, and be a really good software engineer, and be a good data engineer, so what else are we adding onto the pile? Well, this is my mental model of it. Starting from the left, to get a good model, you kind of need 90% data generation and cleaning and 10% modeling, and then once you have a good model, moving on to the middle, if you want to ship this model, 10% of the work was getting the good model. 90% that remains is now relying on software engineering best practices.
But once you’ve shared the model, if you want to be able to reliably serve it and kind of update it, then all of that work that we’re kind of going to talk about is mostly on the operational side. This is basically kind of like getting to a level where you excel at running your training, and evaluation [inaudible] plans, and your deployment pipeline, so it’s very operational work, but it has huge value.
As I said, this is kind of like a fuzzy problem, a problem that’s clearly worth solving, but one where there’s, I don’t think, kind of clear best practices, so I’ll share three or four tips, depending on how you count, that helped us along the route of sort of tripling the speed at which we release models and kind of capturing those gains I had mentioned, where our models are now always fresh.
One is, usually when people think about model training, they think about this box at the top-right, you know? They think, “Okay, well I trained my model,” but really if you think about the process of producing a model, it is a much more complicated, and in fact almost a cyclical process, where first you kind of generate data, you filter it, you generate your features, you generate your training sets and your evaluation sets, you train your model, you evaluate it on said sets, you evaluate it usually for more than just kind of like is loss going down. You evaluate it for business metrics, so you might have particular datasets that you want to measure performance on.
Then you deploy it, then you monitor it, and then usually, you measure your model’s performance in production, and very often, that will feed into the data that you use to train your models, right? Maybe you’ll notice that some examples are particularly hard or particularly easy for your current model and you want to select them. So when we say reliably train models, really you should be able to reliably do all of this, and do it regularly.
Our solution for this is to basically automate almost all of it. Basically, what you kind of want here is like can you have every single one of these tasks be a job that can be automated, and that you can kind of kick off, so that all of this runs without a single human in the loop, you know? In our particular use case, we actually keep a human in the loop at kind of two spots, which is for deployment, as a way to sanity check like, “Yes, this looks good. Let’s deploy this new model,” and then to sanity check that we are indeed happy with production performance later on, to make sure that we’re not missing out on something. But essentially, the first step in kind of being able to retrain and redeploy models is for that whole process, to not have like, let’s say like a data scientist, that does the feature generation themselves, and goes in a notebook, and codes a thing, like a one-off job, but to have all of this be automated.
There’s one more trick that we use to de-risk this automated pipeline, because once it’s automated, you might be concerned about what could go wrong, and that’s leveraging shadow mode. Shadow mode, there’s a few blog posts that talk about it, but essentially, the idea of shadow mode is that we also basically deploy not only just our production model to our production environment, but we deploy models that we’re thinking about deploying in something called shadow. These models make predictions just like the production model. We just store these predictions for later. That allows us to really test models at the end of that pipeline, going back here. Before we deploy it, we deploy this model to shadow for a little bit, and then we observe how it behaves in production, and verify that we’re happy with it.
In summary, once you have this whole pipeline, once you have shadow scores, then you just automate it, or then you just schedule it. That’s basically kind of like putting it all together, the thing that’s, I think, the most useful inside here, is you can basically have this pipeline that creates models, starting from data generation, and does it automatically for you, let’s say every week. Let’s say like week one, the first week, it’ll run, then the second week, maybe you’ll realize a bug. Maybe some other team changes some code, and all of a sudden, you can’t produce models anymore, and you’ll immediately know, and you’ll immediately know to fix your pipeline, so that you can guarantee that you can continuously produce models, and that at any point, you’re ready to redeploy your models, to fix your models if they break, or to just kind of ship your newest changes to production.
That’s most of it. As I said, there’s a technical guide with a lot more information. You can Google these four words, you’ll find it, or we can paste the link as well. But yeah, that’s it. Thanks everyone.
© Tecton, Inc. All rights reserved. Various trademarks held by their respective owners.
The Gartner Cool Vendor badge is a trademark and service mark of Gartner, Inc., and/or its affiliates, and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.