Machine Learning Platform for Online Prediction and Continual Learning

apply(conf) - May '22 - 30 minutes

This talk breaks down stage-by-stage requirements and challenges for online prediction and fully automated, on-demand continual learning. We’ll also discuss key design decisions a company might face when building or adopting a machine learning platform for online prediction and continual learning use cases.

Chip Huyen:

Okay, so yeah, my name is Chip. I’m a co-founder of a startup called Claypot AI, which is a machine learning platform for realtime machine learning. I also teach machine learning systems design at Stanford. I usually start the talk with… I am very excited to be here, but I’m still a bit nervous to go after Mate, because like Dee said, he is a visionary in this space, and I don’t think there’s no way I’m going to top his talk, so please be patient with me. Anyone here is from Vietnam? I’m very hopefully ask every talk, but I’m not sure if anyone here… But usually, I never see anyone from Vietnam. Okay, I guess no one then.

Chip Huyen:

Okay, so yes, okay finally, thanks for coming. I know it’s pretty late in Vietnam. Hey, Lin. So today, I’m going to talk about machine learning platform for online predictions and continual learning. A trend that I have noticed in the last few years is that the industry is moving toward online predictions and continual learning, and one common denominator for both of them is that they both require good monitoring solutions. In this talk, ideally we want to cover all three topics, online predictions, monitoring, and continual learning, but because of the limited time, I will try to cover the first two topics first, and if we have time, I’m going to go over continual learning.

Chip Huyen:

I always find events like this a little bit hard to give talks to, because I don’t have a good sense of what people are interested in or what you have already seen a lot, so please feel free to ask questions or stop me at any point. And I know that I have an accent, so sometimes I speak very fast, so feel free to tell me to slow down as well. Nice, I see a lot of people from Vietnam here.

Chip Huyen:

Okay, so the first is online predictions. I believe that at this point, people have pretty much know the difference… A lot of people are very much familiar with the discussions between batch predictions and online predictions. Batch prediction is when predictions are computed periodically, like maybe once a day, so before requests arrive, and online predictions is when predictions are computed on demand, after prediction requests arrive. The problem with batch prediction is that it’s not adaptive, so because the predictions are computed before request arrive, you can’t take into account relevant information to make relevant predictions, and it shows a lot in tasks like recommended systems or dynamic pricing.

Chip Huyen:

Another problem with batch prediction is the cost. Whenever you generate batch predictions, you tend to generate predictions for all the possible requests out there. We just talked to your company recently, and they have about three million users, so they generate predictions for these three million users daily. However, as they only have a small fraction of these users log into the platform daily, which means that’s like 99% of their computer predictions are actually not used, which is a huge waste of compute power. Online predictions, the big challenge is the latency, because predictions are computed after request. You would need a machine learning model that can return prediction requests very fast, because users don’t like waiting.

Chip Huyen:

Batch predictions, the workflows look like this. You could generate predictions offline in batch, and a lot of… I think for a lot of companies, load these precomputed predictions into a key-value store, like DynamoDB or Redis, to reduce the latency at prediction time, and when those applications, when prediction requests arrive, they’re going to fetch the precomputed predictions. Not all use cases need online predictions. Batch prediction works pretty well for a lot of tasks, like churn predictions or user lifetime value. If you want to predict what users are going to leave the platform, you can probably run that like once a month or once a week, and it’s fine.

Chip Huyen:

So with online predictions, online predictions, there are two different type of online predictions. The first is that you do online predictions, but with batch features. Batch features are features that are computed offline, for example like product embeddings, so images that you want to do session-based recommended systems. You might want to look into all the products that a user has seen in the last half an hour, so based on the items that users have seen in the last half an hour, you want to get the embeddings for these items, and you want to add them together, to create a featured embedding, and then you generate the recommended items based on these precomputed embeddings. The embeddings are usually computed beforehand and loaded into a key-value store, so that you reduce the latency at prediction time.

Chip Huyen:

Another level of online predictions is when you want to do it with online features. In the case of embeddings, you can compute embeddings offline. However, for a lot of tasks, you might want to get features that are computed online. For example, if you want to calculate, should you see like what products should be trending right now, you might want to look into the number of views that all products have in the last 30 minutes. To compute the number of views a product has in the last 30 minutes, you want to compute those features online. The workflow is that you compute batch features like embedding offline, you load that into a key-value store, and then at prediction time, you would look into what features, what online features, what features are needed for this prospect request, and the if it’s a batch features, you fetch it from key-value store, and if it’s online features, you compute that from the recent click stream. You can either compute that using the Lambda functions, like you can have a microservices running a Lambda function to compute the number of views a product has in the last 30 minutes.

Chip Huyen:

However, Lambda has, running it that way has a lot of problems. One problem is that Lambda functions is stateless. I want to check in, like how is everyone doing? I haven’t seen… Yeah. Okay, good. Yeah, so I might have a difficult time trying to understand like how the audience receives the talk, because I can’t see your faces, so if this topic is slow, boring, we can move past it. I have a lot of slides here. Cool, thank you.

Chip Huyen:

So as you compute online features, some companies, we see that they set up a Lambda microservice, so they just compute the recent number of views. The problem with Lambda is that it’s not stateful, so it’s stateless. If you want to compute the number of views an item has in the last 30 minutes, you would need… If the views come in every minute, one way is that every time you want the number of views in the last 30 minutes, you just go over the last 30 minutes. But, you might want to just like every minute, you just want to update, calculate the number of views in the last minute, and then combine it with the number of views in the last 29 minutes. With Lambda function, you would need to set up an external database to store the state of the computations. A much more efficient way could be to use stream computations engines such as Flink, and also as Mate just mentioned [inaudible 00:08:28] batch streaming. So yeah, that’s one for online features.

Chip Huyen:

Those features [inaudible 00:08:38] You can see that’s the difference from batch prediction to online prediction with batch features and online prediction with online features. The key difference is in the feature service, and feature service, a lot of companies we talked recently are looking into what is called feature store. For this to work, a feature store, or like feature surveys, would need certain properties. One is that it needs to be able to connect to different data sources, like both offline and batch data sources, like in a warehouse, Snowflake, BigQuery, and streaming data sources, like Kafka, Kinesis. So yes, that’s one requirement for a feature store, is dealing with multiple data source connections.

Chip Huyen:

And another is that it would need to be able to store feature definitions. For example, like if you want to specify that, oh, we want the number of views of a product has in the last 30 minutes, then you might want to… You need to define that query somehow. It can either be in SQL, or in Pandas, or like data frames, or it can still be in like PySport. The feature store would need to be able to store these definitions, and compute that, which brings us to the next requirements of the feature store, is feature computations.

Chip Huyen:

They have feature definitions and have data source, so they want to apply these feature definitions to the data sources, to get the computed features. In a lot of feature stores that I have seen, they’re very light on feature computations, and especially on the streaming part, because streaming is hard, and I see that feature store tends to leverage tools like Spark Streaming, which I think is a great addition to Spark but I think there is still a lot to be desired about Spark Streaming. But luckily, there have been a lot of exciting tools that allow us to do stream computations very efficiently, like Vectorize, Materialize, and Decodable. Oof, this will be a slight burn. No, I mean Spark Streaming is great. I have a lot of respect for Databricks. It’s an amazing company. But Databricks does come from a batch computations background, so it definitely takes time, so it can definitely become a great streaming combination engine. Oh gosh, I hope that Mate’s not in the talk and now doesn’t hate me.

Chip Huyen:

Okay, so another is like after you compute these features, you might want to persist the features. You might want to persist the computed features, for either reusing for the next predictions request, because sometimes, different prediction requests might want to access the same feature, or different models might want to have the same features, so if you persist certain computed features, you can reuse them for future prediction request, or you can also reuse them for when you want to retrain the model on new data. So yeah, in this case, if you want to persist computed features, then feature store is more like a data mart, like more database. It’s store of like precomputed features.

Chip Huyen:

I can think, another thing that people have been talking about a lot is to ensure the consistency between training and serving for predictions. Here, we have this diagram for online predictions. Online prediction is… Batch features is pretty consistent between… It can predict… You can use the same batch pipeline for batch features in serving and training. However, for the online features, this will get tricky. In online features, during predictions, we might want to use streamed computations to get the features, but when training for the same features, we might want to use a batch process.

Chip Huyen:

A big promise of the modern feature store, it is to help you ensure the consistency of the features during predictions and also retraining. I think this is a very difficult topic, because if you compute the training features in the feature store, then yes, it can reuse the same feature definitions, and then go back in time to generate historical features, but if you generate training features outside feature store, then it can be very hard to ensure the consistency.

Chip Huyen:

And one last point that I want to talk about on the feature store, online feature service, is that when we generate batch features offline, like for example, after we generate embeddings offline, we might want to… We might be able to do a lot of tests, to make sure that those embeddings make sense. But if we do generate online features, online, and we reuse that immediately for predictions, then we need to have some way to ensure that these online features have good quality, and they are not… There’s not something that breaks in the pipeline to lead to the wrong features. That’s the last aspect of the feature store functionality, is feature monitoring, to ensure that every feature generated in the feature store, the feature service, will be usable, and correct, or within certain expectations.

Chip Huyen:

Yeah, so I think for this key functionality of feature store, I think it’s like there are the things in blue, that’s what I believe that feature stores today are doing really well, like kind of using different data sources, storing feature definitions, or like persisting computed features. However, I think there’s still a lot of room for, like as a functionality, like feature computations, or [inaudible 00:14:44] feature consistency, and feature monitoring. And it’s very, very likely that a lot of modern feature store, like Tecton [inaudible 00:14:52] and I think a lot of feature stores are actually working on this, and I’m very excited to see where the feature stores are going to go.

Chip Huyen:

Okay, so we’ll talk about features of monitoring, which brings us to the next topic of monitoring. The first question… I think we’re almost running out of time, so I’m going to go through this quickly. The first question we need to address when talking about monitoring is what to monitor. What companies really care about, when they have a machine learning model, is business metrics, like [inaudible 00:15:29] accuracy, click-through rate. However, it’s pretty hard to monitor the business metrics directly, because usually you could need a lot of labels or feedback from users, and not every task has labels or feedback from users immediately, that we can use to monitor the models.

Chip Huyen:

One way to get around is it that companies try to collect as much feedback as possible. For example, for eCommerce, you can leverage users’ feedback as a proxy to see how the model is performing. For example, if we do a recommended systems, you might want to look into that by click, or adding an item to cart, or whether users buy an item or whether they return an item. This different feedback through the users’ journeys have different properties. For example, click-through rate, a click is a lot more… Users click on items a lot, so that means that clicks are very dense, are very dense feedback. However, click is not a very strong signal, so you might have a lot of a clicks, but a user clicking on an item doesn’t mean that the users realize an item.

Chip Huyen:

However, like for purchase, like buying item, it’s a lot more sparse, because it doesn’t happen a lot. However, it’s a very strong signal. Companies then have to decide on what kind of feedback do they actually monitor. And another thing’s, when monitoring business metrics, is that you care about the fine-grain evaluations. Like, one overall metrics, like accuracy or [inaudible 00:17:08] data is not going to be good enough, and you want to know how a model is performing for different subset of that data. For example, if we want to know whether the model is performing well, or equally well across all the demographics, or if there’s certain… or if suddenly change for some group of users, then there might be something interesting happening there. It can be something very bad, like some biases, in your pipeline. Because of the lack of labels and predictions, a lot of monitoring tools turn to monitor proxies, like predictions and features. The assumption here is that a shift in predictions and feature distributions will also lead to decreased business metrics. In this case, monitoring now becomes a problem of detecting distribution shifts.

Chip Huyen:

That leads us to the next question, like how do we detect data distribution shift? Usually, if we have two populations, we want to determine whether these two populations come from the same distributions. [inaudible 00:18:22] question is how do we determine that the two distributions are different? The two distributions might be… One of them might be the distributions of [inaudible 00:18:35] data, and another distribution might be the prediction during production data, or it can be like one distribution might be the distribution from yesterday and another distribution is the distribution from today. The base distribution that we compare the other distribution to is called the source distribution, and the distribution that we care about to see whether it has deviated from the source distribution is called the target distribution.

Chip Huyen:

There are two main approaches to detect data distribution shift. One is to compare by comparing statistics, and another is by using two-sample hypothesis testing, such as like K-S test or MMD. Compare statistics means that it would compare certain statistics of the source distributions, like mean, variance, min-max, and you also compute the same statistics of the target distributions, and you see that if the statistics have diverged from the target distribution to the source distribution, then you can say that, oh, the distributions have shifted.

Chip Huyen:

Two-sample hypothesis test is more involved. The problem with statistics approach is that it’s very distribution-dependent. For example, you should only compute statistics that are meaningful to your distributions. For example, if the distribution is a normal distribution, then means and variance can be very helpful, but if the distribution is something like a long-tail distribution, then it’s probably not a very… Means or variance might not be a good statistic. Another problem with statistic or comparisons approach is that it’s inconclusive. If the means or variance statistics have shifted from the source distribution to the target distribution, then we can say that the distribution has shifted. However, if the statistics are still the same, then we can’t really say that the distribution has not shifted. I hope that makes sense.

Chip Huyen:

In the two-sample hypothesis test, it’s pretty common. However, the vast majority of hypothesis tests today can only work with low-dimensional data. Usually, to compute that hypothesis test, so it doesn’t work for high-dimensional data, like embeddings. So people usually tend to first perform dimensional reductions on high-dimensional data before they apply hypothesis tests.

Chip Huyen:

When talking about detecting shifts, it’s important to note that not all shifts are equal. Some shifts are easier to detect than other. For example, sudden shifts are a lot easier to detect than gradual shifts. Imagine we have a distribution that changes like this, so it’s very gradual change. If you’re comparing the data today to data yesterday, you might not see a lot of change, and you might think that, oh, there’s no shift, but over time, because the shifts are very continual, gradual, after a week, the shift might have changed significantly, but you might not be able to check it.

Chip Huyen:

Another difference is that spatial shifts versus temporal shift. Spatial shift happens when you have pretty much like new access point. For example, the users may have new devices. Before they quit using their applications on desktop, but now they use applications on mobile phone, and users’ behavior on mobile phones are very different from user behaviors on desktop. Or another type of spatial shift is when you have new users. For example, you might launch a new marketing campaign, and you suddenly get users from an entirely different demographic than you had before. So now you have a lot of new users, a lot of users, so like a spatial shift.

Chip Huyen:

On the other hand, temporal shift is when you have the same users, same device, but behaviors have changed over time. Temporal shifts are really tricky to detect. For temporal shift is time window scale really matter a lot. Consider we have a distribution that looks like this, it’s like across 15 days, right? And we want to use day 15 as a target distribution. If we just look at the last, say, six days as the source distribution, then we see that oh, day 15 looks significantly different from day nine to day 14. Then it’s a shift. However, if we use the day one to day 14 as a source distribution, then we’re going to see that, oh, the spike on day 15 is just expected, because this is not a distribution shift.

Chip Huyen:

Yeah, so like time scale window is very, very important, and there are two things. It’s like choosing the right time window for a distribution shift is very hard, because if we have too short time window, we will have a lot of forms like this case, when we think it’s a shift, but it’s not a shift. It’s just the cyclic nature of your data, whereas if you choose a time window that is too long, then it might take too long for us to detect the shift, which is not a good thing either.

Chip Huyen:

A lot of stream processing tools allow us to do something called a merge profile. We can start with monitoring metrics, information statistics at a small time window, like hourly, and then we can merge 24 hours, 24 of the hourly windows, into a bigger profile of like daily. That is very, very convenient. It’s great, and I have seen there are some tools, like Mona Labs, they have this root cause analysis, when they automatically analyze various window size, to have you determine exact point in time where the shift happens.

Chip Huyen:

We talk about two kind of proxies that companies use to monitor their machine learning systems when they don’t have enough labels and feedback, is that predictions and features. I’m a huge fan of monitoring predictions. Example of monitoring predictions is that you can have some conditions, like if predictions are all false in the last 10 minutes, then send an alert, or like if three is still the most popular class. Like, if it’s now no longer the most popular class, then you might see that there’s something [inaudible 00:25:20] has happened.

Chip Huyen:

Predictions, I like predictions because predictions are pretty low-dimensional, so they are easy to visualize, easy to compute stats, and also easy to do two-sample hypothesis tests on, and also changes in prediction distributions generally means changes in the data input distribution. However, keep in mind that predictions shift. A prediction distribution shift can also be caused by canary rollout. I’m not sure you’re familiar with canary rollout. It’s a case when you might want to roll out… You have an existing model, and you have a new model, and you want to launch slowly, roll out your new model to more percentage of users, so you might first want to set up the new model to 1% of users, and then 10%, and then to 90%, and then before 100%.

Chip Huyen:

This means that as you do canary rollout, you have the new model slowly replace existing model. You might see a distribution shift. In this case, you still might want to investigate, because a lot of… If your new model produces significantly different predictions from your existing models, then there might be some problem with the new model, and you should definitely look into it.

Chip Huyen:

For monitoring features, monitoring features is very similar to prediction as well, but it’s a lot harder and a lot more complex. For a given feature, you might want to compute expected statistics or schemas of that feature. Usually, during training for each feature, you might want to get the mean or variance of that feature, and then you monitor the mean and variance of that feature input actions. And then if it changes, then you think that oh, the distribution for that feature has shifted.

Chip Huyen:

Here are some other examples of the expectations for a feature that you can generate during training. For example, take a big commonsense feature like if the task is NLP, you might want to make sure that there is a most common word. Oh, it’s NLP for English, and you know that there is a most common word in English, so you see that like, oh, if there is no longer the most common word of something there in production, then there might be a problem there. Another example is you might want to compute the min, max, or median value. You might expect that the mean or max of a features are within range A and B, and A and B are the values that can computed from the source distributions or the training distributions.

Chip Huyen:

Monitoring features has a lot of challenges. First is the compute and memory cost. We don’t have just one model in production anymore, like any interesting… Companies now, like the number of model in production is just increasing like crazy, so companies may have a lot of models in production, and each model might have a lot of features. So imagine you have 100 models. Each model has like 100 features. Now we already have like 1,000 features, so imagine just computing statistics for these 1,000 or 10,000 features, like constantly. Can be very, very costly, and slow. It can eat up all your compute power.

Chip Huyen:

Another is the alert fatigue. When we have a lot of features, it’s very likely some features are going to change. However, most of the changes are just benign. But if you want to send alerts to a data scientist every time a feature change, we would have a lot of forced alarms. That can lead to something called alert fatigue.

Chip Huyen:

And another is just like features usually follow some expected schema. For example, like if you compute oh, the median has to be between A and B, A and B are the schema of that feature, and this might change as we update the model. Like, if you train the model on new data, then the value A B change, so we need to somehow keep track of all these schema feature status over time, and it can be pretty hard for that.

Chip Huyen:

The vast majority of the feature stores, of monitoring solutions nowadays, focus on monitoring features, and so do feature stores, so I think we see a lot of feature stores that are also adding the functionality to monitor features, because feature stores are already computing a lot of feature values and persisting feature values, so that could be a natural place to do monitoring as well. So the question is that like… Okay, I don’t think I’m going to make any statement here, but I do see that a lot of convergence between monitoring solutions and feature store.

Chip Huyen:

Okay, I think I don’t have time. Thank you, Dee, for reminding me, but yeah. I think there’s some continual learning, but I don’t think I’m going to go through this, and yeah, I have a book coming out, and I hope that… It’s just coming out this week, and I hope that if you already ordered it, I hope that this talk doesn’t make you cancel the order. But also reach out by email, Twitter, to chat about any of the topics. Thank you.

Demetrios:

Awesome stuff, Chip. Do you got a few minutes to answer some questions?

Chip Huyen:

Yes. For you, anytime.

Demetrios:

I love it. We are asking questions in the apply() Conference channel, and there’s a ton of them. I know we’re not going to be able to get to all of them, but I picked out a few that I really liked, and the first one is how do you… Oh sorry, wrong one. That was the wrong thread. It’s can streaming data sources be joined together, i.e., two separate entities with a join key?

Chip Huyen:

They can be, but I think you need… Joining on streaming is actually pretty hard. That’s why we see like joining in SQL tables. We have a lot of optimizations for batch joining, right? I think we see a lot of movement recently in streaming SQL as well, so I think Snowflake has a team now working on streaming. We have Materialized, Decodable, so yes, it’s possible, but the question is how optimized the streaming joins are. That’s why we see a lot of new tools coming in.

Demetrios:

Excellent. What about a recommendation for a feature store for small teams? I see the value prop in enabling much faster iteration, but the fear the maintenance and over-engineering as a team of four.

Chip Huyen:

That’s a good question. I think there are a lot of options for feature store. I would say that depending on your need, like whether there’s a complexity of features. If you mostly have batch features loading into online key-value store, like Redis, DynamoDB, and you can preview something in house, but if you have like streaming features, but it’s simple streaming features, like just computing the number of views a product has in the last 30 minutes, then you can probably get away with Spark Streaming or some Lambda microservice.

Chip Huyen:

However, we’ve talked to companies, especially in fintech. When they have extremely complex streaming features, then you need a very good streaming computation engines. So I would say then that is harder. I don’t see a lot of feature stores that can do complex streaming computations very well yet. And also, I think I heard some complaints about feature store. I’m sorry Mike. I know that you have the call. I know Tecton is incredible, and I think I have heard really good things about Tecton. One thing people do complain though is just like it’s very integration heavy, so like the integrations… If you have a small team, four people, I’m not sure whether… Well, what is the integration timeline going to be for you?

Demetrios:

Last one before we jump. How do you scale serving when the scale is not worth the number of requests the model gets, but it is with the number of models to serve?

Chip Huyen:

I’m sorry, can you repeat the question?

Demetrios:

Yeah. I think there needs to be some context here. Large, over 10,000 models, all models use text as the only input. How do you scale serving when the scale is not with the number of requests the model gets, but it is with the number of models to serve? Also, how would you effectively monitor drift for a large number of models? Can it be done in near real time?

Chip Huyen:

So you have a lot of models?

Demetrios:

Yeah, over 10,000.

Chip Huyen:

Whoa. Is this NLP? Are you using some kind of embeddings, or like prebuilt, pre-trained model for the NLP model?

Demetrios:

Vishal, this one’s on you. Let’s see what he says. I’m seeing something. Yes, he is.

Chip Huyen:

Okay, so are you deploying these models separately, or are you having one container for all of these 10,000 models?

Demetrios:

Custom serving built to handle this, and he uses embeddings plus another model.

Chip Huyen:

I think like this, we actually work with a few companies that have a similar patterns, especially B2B companies. We’ve seen that B2B companies, they might have a separate model for each customer, so if you have 10,000 customers, then you can easily have 10,000 models. That actually requires really interesting solutions, and the solution is not just in scaling distribution service, but also in managing all the features, model retraining, because you don’t want to manually retrain each of these models separately. But yeah, do reach out, and also, I feel low-key bad about… I hope that I didn’t badmouth any of the tools. I know Databricks and Tecton are incredible company. Like, Mike and Mate are amazing, and I think they’re definitely one of the best ML tools [inaudible 00:36:09] out there. And the only thing I said about its limitations are not because of the companies themselves. It’s just the space is very complex right now, and a lot of problems don’t have good solutions yet, but I know that they are working on it, and I’m very excited to see the future of these solutions.

Chip Huyen

Co-Founder & CEO

Claypot AI

Chip Huyen is an engineer who develops tools and best practices for machine learning production. She recently developed Claypot AI, a platform that leverages both batch and streaming systems for real-time machine learning. Through her work with Snorkel AI, NVIDIA, and Netflix, she has helped some of the world’s largest organizations deploy machine learning systems. She teaches “CS 329S: ML Systems Design” at Stanford. She’s also published four bestselling Vietnamese books and author of “Designing Machine Learning Systems” (O’Reilly, 2022).

Add Your Heading Text Here

Machine Learning Platform for Online Prediction and Continual Learning

Chip Huyen

Follow Us

Book a Demo

Contact Sales

Request a free trial