ralf: Real-time, Accuracy Aware Feature Store Maintenance

apply(conf) - May '22 - 10 minutes

Feature stores are becoming ubiquitous in real-time model serving systems, however there has been limited work in understanding how features should be maintained over changing data. In this talk, we present ongoing research at the RISELab on streaming feature maintenance that optimizes both resource costs and downstream model accuracy. We introduce a notion of feature store regret to evaluate feature quality of different maintenance policies, and test various policies on real-world time-series data.

Thanks everyone for having me. My name’s Sarah. I’m a PhD student at the RISELab, soon to be Sky, and I’m here to talk about ralf, which is real-time, accuracy and lineage-aware featurization.

So this audience is really familiar with feature stores, so I’ll just go over this really quickly. In any kind of feature store system, we typically have some kind of incoming data sets that we transform with some kind of feature transformation. And then we then store those features in offline and online store, which we then train to serve to model training and model serving systems.

So this talk is going to be specifically about feature maintenance, so this feature transformation step, where we transform raw data into features, and then also the tools that we use to keep features fresh as data is changing, and we want those changes to be reflected in the feature stores.

So for example, say that we have some raw data, like some user product click stream or click history, and we define some feature table or something like a user embedding from this data. So as we get new data, like new clicks, we want to make sure that these features are basically kept up to date with the new data. But one question is, how often do we actually want to process this incoming data?

So people often compute features with different frequencies. You might have very infrequent updates, like once every 24 hours you process your data, or you might also have a streaming system where every single new event or new click that you get, you recompute a feature. So there’s kind of a trade-off curve where, on one hand, you have lower cost with doing infrequent updates, but also more feature staleness, but then for streaming updates or more frequent batch updates, you have higher costs, but better feature freshness. So it’s often unclear where exactly should you land on this trade-off curve for different applications.

Another issue is which features are actually worth updating. A lot of features actually can be deprioritized and aren’t that important for application metrics. So for example, many features in a feature table might never be queried since you often see sort of a power law distribution with very hot keys getting queried, very often, and other key or feature values almost never getting queried. So if a feature is never actually queried or used by a model, it’s kind of a waste to allocate compute resources to keeping this up to date.

In addition to that, a lot of features are actually changing pretty slowly or don’t really have very meaningful changes. So even if they become stale, it’s not that big of a deal. So if the staleness of the feature doesn’t actually affect the model’s prediction quality, then again, it’s not really that important to update these features.

So coming back to feature maintenance and how to optimize it, we want to consider first the cost. So can we afford higher or lower costs for doing feature maintenance? And then also, how much quality do we want of the queried feature? So I’ll kind of go into what this means a little bit later on. But we basically have this trade-off curve between cost and quality where, oftentimes, the more compute resources we have, the higher quality we can make our features. And so we want to navigate where on this trade-off curve we want to be.

And the other thing that we’ve been thinking about with our research agenda in the lab is how can we improve this trade-off curve by being smarter about what features we update, so basically, deprioritizing things that aren’t important and prioritizing features that are really important to the model.

So first let’s start off with defining feature quality. So this is kind of an ambiguous metric. People kind of talk about staleness or how approximated a feature is. But in this context, we define feature quality in terms of how much worse do models perform with ideal versus approximated features, where the ideal feature is basically the feature that you would’ve calculated with some infinite compute budget, so maybe like the perfectly fresh feature that has every single data point incorporated into it with zero latency, and the actual features, which is the features that you can affordably maintain, so maybe the features that you end up getting, if you’re running some hourly batch process to update the features.

So with this notion of feature quality, we define something called feature store regret, which is basically the prediction error with the actual features minus the prediction error with the ideal features. So basically, the worse your featurization is, the larger you would expect this regret to be because you’re basically losing out on model accuracy by having bad or low-quality features.

So if you’re able to actually measure the accuracy of your model, then you can basically approximate your feature store regret as features are becoming more stale. So for example, you might have some features where the feature staleness has very little effect on prediction quality. So even if you have a stale feature or a fresh feature, it’s basically the same model loss, overall model loss. But you might also have other features where the regret or the sort of loss of the stale features rapidly increases as time goes on. So maybe this feature is really quickly changing or the model’s very sensitive to staleness. So this is a case where regret rapidly increases with the staleness of the features.

So we basically want to represent that this top feature or things in this top category are less important to update, versus features, where regret is increasing, is much more important to update and basically do prioritization in this way. So we do this by basically tracking the cumulative regret that we observe for some feature. So if a model’s doing fine with stale features, like in the top row, the cumulative regret is very low, versus the cumulative regret is rapidly increasing, if you have a lot of sensitivity to staleness, or if that feature is very important.

So with this, we designed a scheduling policy, where basically, for some model serving system, we track the error for predictions made with different features to basically calculate some kind of error feedback, which we then pass back to the feature maintenance system. So for every single key value pair in some feature table, we basically track the cumulative regrets since the feature was last updated. And then we basically take the keys with the highest cumulative regret and update them and set that accumulative regret back to zero. So it’s a bit of a different way of choosing which features to update when.

So we recently submitted a paper on this to a conference, a databases conference, and we used a couple different data sets, but the main two were anomaly detection data set and a recommendation data set, with the MovieLens data set. And we found that, overall, we’re able to have higher accuracy for the same cost by doing this kind of regret optimized scheduling. So going back to this trade-off curve of different errors and different cost amounts, this is a simulated result where we allow our system to do different numbers of feature updates per timestep. And you can see that the regret optimized policy basically kind of shifts down the curve like we wanted to and improves this trade-off between cost and accuracy.

Another result that we have from this is basically freshness and accuracy don’t always correlate perfectly. So obviously, typically, you want to have fresher features in the hopes that you’ll have more accurate predictions. And they are certainly correlated, but we actually found that in our experiments, this regret optimized policy, it had a slightly worse average staleness for the queried features, but it still had much lower average prediction error. So even though the quality of the features and the staleness of the features are closely correlated, and freshness is a pretty good sort of approximation of it, it isn’t perfect. And so then this might be a reason to actually also be looking at the model errors for evaluating your feature quality.

So we’ve implemented some of these scheduling ideas in ralf, which is a framework for feature maintenance. So we basically have a declarative DataFrame API for defining features just in Python code, and also allow for fine-grained control over managing feature updates. So you can basically sort of implement different ways of prioritizing which features that you want updated over others, like the policy that I described earlier. And ralf is also specifically tailored for machine learning feature operations. So it’s natively in Python and Ray.

And yeah, I’ll go through this quickly since it’s not that relevant, but yeah, ralf kind of borrows from a couple core ideas. So basically, this DataFrame API, so you can sort of treat features and data tables as static data frames. And we do incremental view maintenance. So basically as new events come in, we immediately propagate those updates to the feature table. And then we also build on top of Ray actors. So that gives us an existing distributed runtime and also makes it a lot easier to integrate with the existing ML ecosystem.

So yeah, we’re actively looking for collaborators and new workloads. I think as one of the other talks mentioned, there is a bit of a lack of workloads in academia. So we definitely love to have more industry collaborators. If you think this might be relevant, please feel free to reach out to me at this email. And yeah, really appreciate you guys having me.

Sarah Wooders

PhD Student

UC Berkeley - RISELab

Sarah Wooders is a second year PhD student in UC Berkeley's RISELab, advised by Joseph Gonzalez and Ion Stoica. Her current work is focused on real-time feature stores. Before Berkeley, she founded Glisten AI, which builds AI to categorize and tag product data and was part of Y Combinator's W20 batch. Her undergraduate degree is from MIT, where she studied computer science and math and did research at CSAIL in the Supertech Group. While at MIT, she directed Code for Good, helped organize HackMIT, and interned at MemSQL, MobLab, and Bloomberg.

Add Your Heading Text Here

ralf: Real-time, Accuracy Aware Feature Store Maintenance

Sarah Wooders

Follow Us

Book a Demo

Contact Sales

Request a free trial