We recently sat down for a fireside chat with Renault Young, a software engineer on Plaid’s small but mighty ML infrastructure team. Plaid provides an API that enables businesses and apps to integrate with their financial institutions—for example, they power the experience that lets you connect your bank account to Venmo.
In this chat, we focused on how Plaid built Signal, an ML platform that powers payment fraud detection and prevention. We explored how the team overcame technical challenges and implemented MLOps best practices with Tecton’s feature platform. This article shares the key takeaways, and you can watch a full recording of the chat here.
Plaid’s ML journey
In 2020, Plaid had fewer than 10 ML use cases, all productionized ad-hoc without a platform. Their ML platform has evolved dramatically in just a few years, and they are now in the process of migrating from a homegrown feature store to using Tecton as their feature platform. Plaid’s models rely on a mix of real-time and batch-computed features, all feeding into real-time predictions.
With a growing ML team, Plaid is focused on diverse use cases ranging from structured-data ML to deep learning. These include:
- Applying NLP/LLMs to large-scale bank transaction data with description fields
- Allowing FinTech customers to predict cash flow and recurring expense deposits for Plaid’s lending products
- Building the ML infrastructure for Signal, an application that predicts the risk associated with ACH transactions.
Building Signal for financial transaction risk prediction
Plaid Signal evaluates an ACH transaction and returns a risk assessment. This is used to detect fraud in the form of account takeovers, as well as improve the user experience for ACH transactions, which are notoriously slow. If a FinTech business knows an ACH transaction is low risk, the company can provide a better UX by “prefunding” a user’s account so they can make a purchase right away, reducing the likelihood of user churn.
Signal has used an XGBoost model since 2020. The team continually iterates on model features, which range from basic, like how long a user has been banking with a particular financial institution, to more complex, like a user’s cashflow. Many features are pre-computed; others are on-demand and come from diverse data sources.
Plaid uses Tecton to store and serve features to Signal for online inference and offline training. Homegrown tools (e.g., for Shapley value comparison) help with feature engineering. To source labeled data for training and evaluation, Plaid relies in large part on Signal users to report their ACH decisions and outcomes.
Technical challenges & solutions with Tecton
Tecton has helped the Signal team overcome several challenges due to the fact that bank transaction data is particularly difficult to work with, and latency SLAs are tight for their real-time models.
Running compute on mutable data
Some precomputed features for Signal rely on bank transaction data. Since transactions can be updated or disputed, this data can remain mutable over long periods of time (e.g., greater than 14 days). Thus, precomputing features for the universe of possible Signal users means billions of transactions need to remain in memory to get 100% data accuracy. It’s an intractable problem—the payloads are too large, and if Spark Structured Streaming restarts, everything has to be loaded again.
Plaid uses Tecton’s Stream Ingest API to solve this problem. Tecton handles vast volumes of mutable bank transaction data efficiently, while guaranteeing that feature values are reproducible. This allows Plaid to maintain high data accuracy for Signal without worrying about the memory footprint tradeoff.
Generating training data for ML models
Generating training data for Signal is also complex, as the team needs to use a combination of backfills and logging feature values at inference time. Because the data for Signal is so mutable, Plaid maintains a separate Feature View in Tecton that uses custom time-snapshotted datasets to generate new experimental features and guarantee point-in-time accuracy.
Low-latency feature serving
When a fraudster takes over a user’s account and makes a large number of ACH transfers, it often happens within a span of minutes. Plaid takes advantage of Tecton’s On-Demand Feature Views to get very low-latency accurate features from transaction data that has just been created. This is essential to help Plaid meet their API SLAs (in the range of seconds).
MLOps best practices
Tecton has helped Plaid implement the following MLOps best practices:
- Feature documentation. As part of the migration to Tecton, the Signal team noted some features were named inconsistently. The feature platform encourages the team to clearly and consistently document aspects of a feature like its name, description, purpose, and lookback window.
- CI/CD for features. This includes testing, monitoring, and following requirements around the encryption and decryption of sensitive data.
- Consolidated infrastructure. This is especially important for regulated industries like FinTech. With a centralized platform, all the data governance work needs to be done just once, including the standardization and observability requirements that come with it.
Advice for ML teams considering a feature platform
Renault suggested that a feature platform like Tecton is a good choice if you have an ML infrastructure team of at least a few people in size, and you’re working with structured data. His team appreciates the declarative configuration in Tecton, which makes it easy to do feature CI/CD. Additionally, Tecton provides highly reproducible feature values, which has been a major benefit.
According to Renault, you may find it more difficult to use a feature platform if you have a lot of feature transformations that are non-standard or are using many custom libraries. However, this could potentially be mitigated with Tecton’s On-Demand Feature Views and other Feature Views with custom dependencies that can be injected into the runtime environment.
If you think Tecton could help your team scale its ML platform, schedule a demo today. And if you enjoyed this discussion, don’t forget to register for our free apply(ops) conference on November 14, where you’ll hear how other practitioners from companies like HelloFresh, Meta, and Uber have solved operational challenges for production ML.