Machine Learning Lead Engineer
apply(conf) - May '22 - 10 minutes
It’s a software monitoring best practice to alert on symptoms, not on causes. “Customer Order Rate dropped to 0” is a great alert: it alerts directly on a bad outcome. For machine learning stacks, this means we should focus monitoring on the output of our models. Data monitoring is also helpful, but should come later in your maturity cycle. In this talk, I will provide practical strategies for prioritizing your monitoring efforts.
I’d be talking about my monitoring blueprint. It’s based on the, roughly 30 models that I ran, I’m an ML engineer, a lead engineer at the DKB bank, previously, a big tech company and the 30 models were mostly in recommended systems, personalization, NLP, and customer service, finance or production systems, so, I had some time to give it some thought, the Emma tooling space is vast and very changing lately. So, what I thought I would give you is my monitoring blueprint, very simple, starting from basic software monitoring to ML monitoring and how you can use your own tooling and let’s start right in. So let’s start with the basics, so I’m assuming we have some data scientists here, so I’ll cover the basics also. All of us have problems, even the most, it’s not a very positive thing to say, but it’s true.
Even Google, and AWS, they have incidents. You cannot avoid box human error, so all you can do is you can detect these problems fast and you act based on their severity. And how do you detect them? And now we come for the famous SRE citation, so what you do is, you use the four golden signals, you take for example, latency is one of them. The time it takes to serve the request, is that fast enough? You take the total number of requests, the arrows and your saturation. So what you do is, when you do this basic backend monitoring is, you focus on the symptoms, which means end-user pain, and you don’t focus on causes. And, no matter if you run an ML product or normal backend service, you should use that for all of your products, these are proven to be very useful to detect incidents.
Now, is that sufficient for running ML monitoring? This is a paper from the ML test score, from Google, which shows the complexity of machine learning systems, which in addition to having the normal complexity, also have complexity related to data and the model code and the infrastructure of model learning. So this can basically lead to failures that are silent in the normal backend monitoring, but can have a huge commercial impact. I put you some of them that I personally experienced on that slide, so there’s too many to list, I give you some examples, so for example, the input data can change. We had a huge fraud model where the change, they changed the unit, they’re sending one field in a very important field from seconds to milliseconds, and the models started making very unreasonable fraud predictions. Or another time we have post-filtering words, post-processing words, and these words can be very aggressive or there can become very aggressive, depending on the data.
For example, we had a on sale filter, that was really okay during the sales season, but really was way too aggressive during non-sale season. And then of course you can have bugs that you create in your own code. We had one case also where we had library issues, we didn’t have a TensorFlow version pinned, and we got a faulty version that’s created weird outputs. So, all of these things haven’t come that, none of them create any blip on the radar of the traditional monitoring, like errors, it doesn’t become slow. These are all silent and they can have a huge commercial impact, much more than model improvements, you work for 2%, very long time, but then you have one of these bugs, which can be, some of them are permanent, some business rules or some bugs that you introduced and didn’t catch, and they have much bigger commercial impact than your model improvement that you’re working on.
So that motivates why you need more on top of the backend monitoring, and now I tell you what you need, what more means. Okay, so I suggest based on the ideas of SRE, that you actually monitor symptom-based. So this means you focus at the output first. So this is the chain from the input data coming in, and then you have the model prediction, and then you have maybe some post-processing rules, and then in the end you deliver to the customer and maybe you have the true outcome. And I proposed to start focusing with priority one, on the outcome, monitor the outcome first. So what does it mean? So, some data scientists ask me actually, can I monitor my evaluation metrics in production? And I say, well, that depends. Do you know the targets close in time? So, sometimes you don’t know the target close in time, Let’s say you have a fraud prediction model and you do reject everybody where you think it’s a fraud.
So actually you don’t know what was a fraud outcome. And in other cases, maybe you make a prediction when the delivery time of a package for the customer is, and then after the package, a few days later arrives, you know, the correct time, so you have a delay, so this can also happen. So basically you should do that, you should compare the prediction to the outcome, and hopefully it’s close enough in time, but even if it isn’t, you should still do it to have like crown truth in production. So what you do is you store the prediction, and you store the target, and then you either have a batch drop or you create an endpoint to receive a feedback call, and based on that, you calculate metrics and add them to a dashboard and you create an alert, of course.
So what you see is a production dashboard. Basically, one of my use cases, having such a dashboard where I can have an alert, if the precision drops too low. So then I recommend another thing, which is often overlooked, conceptually, you should focus on talking to your stakeholders, about the downsides of the machine learning algorithm. So what do they fear could happen? For example, in my case of long prediction, they were really afraid that I reject unjustly and they might lose some sales, so we were having an alert on precision. So it’s quite important that you not only look at technical metrics, but you also look at the possible downsides for the people who know the business case really well. Then, as a second priority, I recommend to you service response monitoring and quality heuristics. So I’m skipping over these a bit fast since it’s the lightning talk, but I will show you the slides afterwards.
So, interesting insight for people who come from the modeling side, is that there’s partially an overlap between metrics for monitoring. So as we’ve seen, you can use the evaluation metrics in production, but thankfully, a lot of the metrics you can use to detect the problem are easier to implement, and they’re not necessarily the same, in fact, they’re different so what you can use, for example, to have real time monitoring, not delayed like a few days later, or a few hours later, is monitor the response distribution of the output. So let’s say you return a score, you can have a simple median, quantile share of empty responses and have alerts for that. And you can also get more fancy with statistical metrics, but I would just start measuring it and start with a simple rule-based logic, and this is very nice because you definitely have real time monitoring here compared to the evaluation metrics.
This is what Google uses, it’s from the paper data validation, it’s a histogram based metrics, based on single fields, also quite nice to use. And another very nice thing is, to use quality heuristics, was once in a talk by Spotify and they said, for that person that homepage ranking, they also monitor heuristics, so if you let’s say you love that box pops songs from the eighties, that’s all you use every day, and what would we expect, is a heuristics for a personalized homepage ranking, well that box should be ranked high up. So that’s the heuristic, what’s your favorite? Your most used carousel, where’s that ranked? And if you can have heuristics like that, for your specific application, like baseline quality indicators, they will go down when you have a problem, so that’s good to have. And last priority would be input data monitoring, it’s useful especially if you want to understand why your output metric changed, but I would implement them in this order.
And how do you actually implement them? So I looked at a bunch of monitoring tools and I also did some survey, under practitioners in the ML Ops chat that D is running, Please join people it’s a very good Slack channel. How many people are using dedicated tools in production? And we did not get that much feedback, it seems like many people have not adopted yet, so my current solution is that you can use your existing monitoring infrastructure and it will get you really far, and it will get you some ideas on what you want if you choose a vendor at some later point in time.
So what do you do practically? You implement, for example, metrics like counters or histograms, here I showed you how to add a histogram metric, and if you have a complicated calculation, for example, for human heuristics, you can just lock the response to storage, for example, to S3, and you run the script every 10 minutes to calculate your metrics, and you add that to existing dashboards and your alert. Okay, to sum up, you need the golden signals, you add some machine learning monitoring on top, focusing on the output metrics first, and you often don’t need a new tool, but instead you need to think what you need and you add a few metrics. Okay, cool. So join us on Slack.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.