High-Scale Feature Serving at Low Cost With Caching

The proliferation of AI model-driven decisions over the past few years comes with a myriad of major production challenges, one of which is enriching AI models with context at high scale (i.e., low latency, high throughput) at a low cost.

For instance, this challenge is faced by modelers creating recommendation systems that personalize feeds and need features for millions of users and hundreds of thousands of products to make >10^5 predictions per second.

To tackle the challenge of high-scale feature serving at low cost, we’re excited to announce the Tecton Serving Cache, a server-side cache designed to drastically reduce infrastructure costs of feature serving for machine learning models at high scale. By simplifying feature caching through the Tecton Serving Cache, modelers get an effortless way to boost both performance and cost efficiency as their systems scale to deliver ever-larger impact.

In this post, we’ll discuss the different use cases for caching features, give an overview of how to use the Tecton Serving Cache, and share some benchmarking results that show up to 80% p50 latency reduction and up to 95% cost reduction compared to the baseline of a common AI feature retrieval pattern.

Tecton Serving Cache use cases

Common AI applications, such as recommendation systems, personalized search, customer targeting, and forecasting, would significantly benefit from caching. These high-scale applications are willing to trade slightly stale feature values from the cache for major reductions in latency and cost.

For example, in an ecommerce site’s product recommendation model, one of the features might be the average number of daily cart additions by customers for a given product over a week. This feature would be unlikely to change drastically over an hour, so it could be cached for an hour.

More generally, caching features would be beneficial in the following scenarios:

High-traffic, low-cardinality key reads: Caching is ideal for use cases with high traffic (>1k QPS) where the same keys are repeatedly read. Repeat access of the same keys means a high cache hit rate, which reduces response times and the access cost for a request.
Complex Feature Queries: Caching would be beneficial for features that are slow or expensive to compute, such as features with large aggregation intervals that require long computation times (>100ms).

Enabling the Tecton Serving Cache in two simple steps

To use the Tecton Serving Cache, simply add two pieces of configuration: one in the Feature View and one in the Feature Service that contains the cached Feature View. Modelers can configure the Tecton Serving Cache to retrieve pre-computed feature values from memory.

from tecton import CacheConfig, batch_feature_view, FeatureService

cache_config = CacheConfig(max_age_seconds=3600)

@batch_feature_view()
def my_cached_feature_view(cache_config=cache_config):
    return

fs = FeatureService(
    feature_views=[my_cached_feature_view, ...],
    name="cached_feature_service",
    online_serving_enabled=True,
    enable_online_caching=True,
)

The max_age_seconds parameter in the CacheConfig determines the maximum number of seconds a feature will be cached before it is expired.
The enable_online_caching parameter determines whether the Feature Service will attempt to retrieve a cached value from cached Feature Views. If a Feature View with cache options set is part of a Feature Service with caching disabled, then that Feature View will not retrieve cached values.
The include_serving_status=true metadata option in a request helps verify that a value is being retrieved from the cache.
The server response metadata will include a status field that indicates whether the value was retrieved from the cache or not.

Benchmarks

To quantify the latency and cost reduction that the Tecton Serving Cache delivers, we ran a set of tests. In this benchmark, we queried sample feature services to measure the performance with different simulated cache hit rates, a measure of how frequently features can be found in cache memory rather than needing to access the online feature store. Several common AI applications, such as recommendation systems, typically have high queries per second (QPS) and high cache hit rates, therefore the benchmarks below are done with 10,000 QPS.

Latency Reduction

The table below lists the feature retrieval latencies (P50, P90, P95, P99) for 10,000 QPS for different cache hit rates.

Total QPS	Cache Hit Rate	P50 FS Latency	P90 FS Latency	P95 FS Latency	P99 FS Latency
10,000	0%	7ms	12ms	16ms	29ms
10,000	50%	5.5ms	7.5ms	9ms	22ms
10,000	95%	1.5ms	5.5ms	6ms	8ms

NOTE: These latencies do not include the network latency to/from the client; they include only the time it takes for the Feature Server to compute the feature values.

DynamoDB Cost Reduction

This online store read cost is a major component of the total cost of running high-scale, real-time AI applications. The table below lists the monthly read costs of DynamoDB while using the Tecton Serving Cache at different cache hit rates for 10,000 queries per second.

Total QPS	Cache Hit Rate	DynamoDB Read Cost/Month
10,000	0%	$36700
10,000	50%	$18350
10,000	95%	$1835

💡 These results show that using the Tecton Serving Cache with a high cache hit rate significantly reduces costs and improves overall latency. When cache hit rates are very high, the p50 latency is reduced by up to 80% and DynamoDB costs are reduced by 95%.

The benchmark details are as follows:

Feature Service contains 10 Aggregate Feature Views, each backed by a DynamoDB online store on us-west-1 with 3 Aggregate Features (sum, count, and mean), for a total of 30 aggregate features.
Each aggregate feature has a time window of 30 days and a 1-day aggregation interval.
Each Feature Service uses 3 entity keys.
- Seven of the feature views use 2 entity keys and 3 of the feature views use the other entity key.
Each load test was run with 10 minutes of random traffic at a constant 10,000 QPS.

A glimpse into some optimizations

Our caching methodology employs Redis as a backend, utilizing Redis Hashmaps to significantly enhance efficiency and performance over traditional Redis keys.

We use entity-level caching in the Tecton Serving Cache because both Feature View-level and Feature Service-level caching exhibit limitations. Feature View-level caching, which generates one key per feature view plus entity key combination, results in an excessive number of keys, consuming more storage and computing resources. In contrast, Feature Service-level caching, involving one key per Feature Service and entity key combination, is overly precise. This specificity can be a drawback, as many models share features, and any modification to a Feature Service might invalidate the entire cache, leading to performance issues.

Entity-level caching strikes a balance between these two approaches, aligning well with most caller patterns in AI applications. This method significantly reduces the number of keys requested from Redis, thus easing the load on the backend. Additionally, it uses less storage compared to Feature Service-level caching, making it a more efficient and resource-conscious option in various machine learning scenarios. The diagram below shows how a Feature Service with caching enabled containing Feature Views derived from Item and User entities would interact with entity-level caching.

What’s next

At Tecton, one of our highest priorities is to improve the performance and flexibility for production AI applications. In the coming months, we will add improvements to the Tecton Serving Cache along these dimensions:

More flexibility
- Request-level cache directives to control caching behavior, including functionality to explicitly skip the cache for better debuggability.
- More options to control caching behavior.
Better performance
- Further improvements to the latency of cache retrieval.

We can’t wait to see the Tecton Serving Cache power more production AI use cases—reach out to us to sign up the private preview!

Add Your Heading Text Here

High-Scale Feature Serving at Low Cost With Caching

Tecton Serving Cache use cases

Enabling the Tecton Serving Cache in two simple steps

Benchmarks

A glimpse into some optimizations

What’s next

Follow Us

Book a Demo

Contact Sales

Request a free trial

Tecton Serving Cache use cases

Enabling the Tecton Serving Cache in two simple steps

Benchmarks

A glimpse into some optimizations

What’s next

Related Posts

Fraud Doesn’t Wait: Accelerating AI-Driven Detection with Latency Budgets

Boost AI Fraud Model Performance with 3rd Party Data

Unlocking Real-Time AI for Everyone with Tecton

Follow Us

Book a Demo

Contact Sales

Request a free trial