Real-Time Machine Learning Challenges

We’ve written about why real-time data pipelines are so challenging to build. Companies often turn to feature stores or feature platforms to solve many of these pain points. Some recent examples of this include CashApp using a feature platform to scale machine learning, and Instacart using its in-house machine learning platform to triple the number of applications a year released into production.

They do this because at some point, many companies realize that building real-time machine learning capabilities is easier said than done. For instance, when you first approach a problem, it may not look too bad, especially along the happy path (when nothing unexpected happens).

The biggest challenges in realizing value from real-time machine learning stem from the need to maintain service levels (uptime, serving latency, feature freshness) while handling the unhappy paths (when the unexpected—aka edge cases—happen).

In this post, we’re going to talk about the common challenges of building real-time machine learning (ML) in production and do a deep dive into why they’re so challenging. If you’re already familiar with the challenges, you can skip right to “Feature Stores vs. Feature Platforms: Which One Is Right for Your Team?” where we talk about how companies solve for these problems.

Common real-time machine learning problems

There are a lot of problems people run into when building infrastructure to power real-time ML. Let’s take a look at some of them.

Building reliable streaming and real-time machine learning data pipelines is hard

For real-time ML, you need to build production-grade streaming pipelines that reliably provide transformed feature values to power model inference while accounting for edge cases that come up. This requires a deep understanding of streaming compute system internals and making the right tradeoffs.

Streaming compute systems often (in cases of aggregations and group-bys) maintain in-memory state for events coming in while periodically checkpointing state to disk and storage (e.g., S3). This state can be hard to manage and may cause a streaming pipeline to eventually crash over time, especially when there are data skew issues or large aggregations / group-bys.

Some concepts that affect streaming state management include:

Watermarks: Older events are discarded based on a threshold specified by the user. If the watermark is too high, you could be accumulating a lot of state in the internal store and cause memory pressure. On the other hand, if you keep the watermark too low, you risk dropping events if your stream pipeline gets restarted or goes down for a duration less than the watermark set.
Checkpointing intervals: Checkpointing is usually done to handle failure modes where if the pipeline goes down, it can recover from the last checkpoint and not have to recompute the stream from the beginning. If users set a low checkpointing interval and are checkpointing to S3, then it can cause a lot of S3 operations and hurt processing latency. On the other hand, if you keep the checkpointing interval too large, you’ll have to recompute a large part of the stream if a job goes down.
Internal state stores: Managing internal state storage for streaming compute can be quite challenging. The memory consumed by these state stores depend on operations being performed like group-by and aggregations. Large aggregations or group-bys can affect internal state memory causing out-of-memory exceptions (OOMs). Additionally, the streaming application code might itself have memory leaks, which can cause failures due to accumulation of memory over time.
Data skew / hot keys: This is related to the internal state stores. When you’re running aggregations / group-bys that partition data based on keys, you’ll inevitably run into data skew issues, where some keys have much more data than others. This can cause pipelines to also OOM. Traditional solutions here include salting and building partial aggregations.

Spiky throughput

Spiky throughput can cause streaming job clusters to run out of memory. Stream jobs are always running and the volume of events they’re handling can vary throughout the day and week. For example, e-commerce applications might have more activity during the weekend; therefore, the streaming pipelines need to account for this variance in stream ingestion throughput.

When you first approach this problem, you may not need to worry about these in the happy path. However, for a production streaming pipeline you need to build and solve for many edge cases. This often means you’ll have to deal with a healthy number of failing jobs and alerts until you’re able to get things reliably running for a single set of features—and then rinse and repeat for each new data pipeline that’s needed.

Reliably retrieving fresh features at low latency is hard

The streaming pipeline handles ingestion of stream events, transformations, storage and serving. From a high level, this usually involves:

Compute to ingest and transform the events to features
Low latency storage to store the features
Compute to serve these features out

In addition, you’ll likely have pre-computed features from batch data sources that you need to regularly re-compute and make available for model inference.

Decoupling storage and compute

If there is tight coupling between storage and compute, then high ingestion throughputs can start affecting read performance. The feature store needs to handle varying ingestion throughputs (write) and serving throughputs (read) while maintaining low serving latency and feature freshness.

This is typically non-trivial, especially in a system where the compute for ingestion, storage, and compute for serving are coupled. High ingestion throughputs can also occur with large backfills and can affect read performance. The ability to elastically scale the compute layer and storage individually leads to better performance and efficiency.

Low-latency storage

The low-latency storage of the feature store itself can be non-trivial to manage and operate. While using an in-memory store seems like an easy option, storing large amounts of data in-memory can very quickly drive up your store costs.

The operations you need the store to support are generally point lookups, batch retrieval, and small-range scans for any custom aggregations. The data types you need to support can vary from integers and strings, to more complex types like structs and maps. Supporting more complex feature data types may involve additional serialization and deserialization overhead. Security is also critical for storage since you want to enforce TLS-based authentication, encryption both at rest and in transit. Expiring feature values in the online store may also be needed to keep costs low. Sometimes, you’ll also need to add an additional caching layer to handle spiky throughput or hot keys.

High throughput online serving

Serving throughput can be significantly high with spiky traffic, and the feature store needs to be scalable to handle this. One example of this is recommendation models where you often need to do batch retrievals for all the candidates. The number of candidates here can be in the hundreds, which can significantly increase the throughput requirements of serving. The online serving layer needs to be scalable and compute efficient to be able to handle this high throughput while maintaining low latency at all times. Maintaining low latency for the entire batch can be quite challenging as batch size increases due to tail latencies.

Ensuring service-levels at scale is hard

Streaming pipelines are continuously running, often powering critical applications like fraud detection; therefore, they have strict requirements around:

Feature freshness. The total time from the event appearing on the stream (Kafka / Kinesis) to the time it is available for serving. This usually involves reading the record, deserializing and transforming it, and storing it in a low latency store for serving. Often, all this needs to happen in under 10s (ideally <1s)
Serving latency. The total time to answer a “get features” call made by the application to perform an inference. This involves reading the features from the store, time filtering, and performing any request-time transformations specified by the user. Often this needs to happen in under 100 ms (ideally < 25ms)
Uptime and availability. The total time the streaming pipeline was up, including serving. This involves the pipeline constantly ingesting events from the stream and the “get features” call is able to serve requests. It’s common to see availability requirements of 99.95%

There’s an operational burden here to maintain uptime + freshness + latency SLAs. This is particularly challenging as the number of use cases grows or as teams iterate on existing models—both situations result in an explosion in the number of streaming pipelines.

Streaming pipelines may also need to be individually tuned and stress tested to account for potential feature-specific issues like memory pressure from large window aggregations. Additionally, teams often need to have continuous monitoring and alerting setup for these jobs. One example is alerting when feature freshness is beyond a user-specified threshold.

Mitigating training / serving skew is hard

Training / serving skew refers to any time a trained model doesn’t perform as expected online. Once you realize a model isn’t performing as expected, then you typically need to dig through to see what’s causing issues (e.g., by visualizing / slicing feature values, inspecting data pipelines, manually inspecting examples). It’s a time-consuming process with lots of trial and error.

In our experience, the most common causes of training / serving skew are:

1) Time travel / late arriving data

This is where training data leaks feature values that would not have been available. It’s not just ensuring that event timestamps line up. Consider the following example:

Feature: You have a feature called num_user_transactions_in_last_day.

Data generation: At training, you use a windowed aggregation on the transaction timestamp to figure out this count. At serving time, you serve features computed from a daily scheduled job (running at midnight UTC the previous day) that computes the latest values and store this in a database.

At first glance, this seems reasonable. But then the model doesn’t perform as expected, and you realize the problem: At serving time, you only have the feature values as of each midnight. But at training time, you aren’t aware of when these precomputed feature values are made available online, so you’re accidentally leaking feature values. For example, you may report that the user has had no transactions in the last day and report fraud, when in reality, your feature values are one day stale and the user actually has been transacting frequently this past day.

Two things need to happen now:

Figure out how bad this worst-case daily freshness is. If it impacts model performance, you’ll need to run these daily jobs more frequently or rely on streaming feature values.
For each feature, make data scientists aware of how data pipelines are scheduled so they can take into account the effective timestamp when the data would have been available online when generating training data.

2) Inconsistent transformation logic

This can range from different transformations being written (e.g., stream logic written in Spark, batch logic in Snowflake) to training data filtering out examples that show up at inference time.

To resolve this, you need to inspect your transformation logic (including filters, aggregations, joins, etc.) to ensure consistency. You might also want to plot distributions of feature values at intermediate stages in offline + online pipelines to understand where the inconsistencies come up.

3) Data drift

This is where the training data being used is outdated and does not reflect the current data in production.

To resolve this, you typically need to detect data drift and then retrain the model. In certain cases, it may also require more complex changes (e.g., retraining upstream ML models).

Conclusion

There are many challenges when building real-time machine learning capabilities—and not all of them can be solved with a solution built in-house. Check out “Feature Stores vs. Feature Platforms: Which One Is Right for Your Team?” to learn why many companies opt to use a feature platform like Tecton or feature store like Feast.

Real-Time Machine Learning Challenges

Common real-time machine learning problems

Building reliable streaming and real-time machine learning data pipelines is hard

Reliably retrieving fresh features at low latency is hard

Ensuring service-levels at scale is hard

Mitigating training / serving skew is hard

Conclusion

Follow Us

Book a Demo

Contact Sales

Request a free trial

Common real-time machine learning problems

Building reliable streaming and real-time machine learning data pipelines is hard

Reliably retrieving fresh features at low latency is hard

Ensuring service-levels at scale is hard

Mitigating training / serving skew is hard

Conclusion

Related Posts

How Tecton’s Feature Platform Supercharged HomeToGo’s Ranking

Enhancing LLM Chatbots: Guide to Personalization

Why RAG Isn’t Enough Without the Full Data Context

Follow Us

Book a Demo

Contact Sales

Request a free trial