Powering ML Fraud Detection Models With Advanced Aggregations

apply(conf) - May '23 - 30 minutes

ML models are an essential tool in combating fraud. They can improve fraud detection rates, reduce false positives, and be re-trained to identify new fraudulent behavior as fraudsters adapt.

However, fraud models require high-quality data that can be difficult to process and serve in production. Features typically require aggregations on streaming and real-time data, which are complex to build, compute intensive, and difficult to process at low latency.

In this talk, Mike will walk through a sample use case and show how aggregations are typically processed. He’ll then show how feature engineering frameworks, like the one offered by Tecton, can simplify the development of these features. He’ll explain how these frameworks are orchestrated under the hood to process data at <1 second, serve data with < 10ms latency, reduce processing costs, while ensuring consistency of offline and online data to improve model accuracy.

Mike:

Really excited to be chatting with you all again. Let me click over to my slide. What we’re talking about today is basically a new feature engineering paradigm that we see be pretty useful for teams that are building real-time decisioning systems and particularly for fraud detection, risk type decisions. I want to talk through that. I’ll introduce some concepts along the way, and we’ll have some Q&A at the end so if you got questions we’ll go through that.

Mike:

Let’s start with what is the risk decisioning problem, the main problem that ML teams is try to solve? You have some data, you got an ML model, you’re trying to generate a risk score. Often this is in real-time. There’s something that happens in between. What happens in between here? Well, we’re building features that’s the whole idea. We’re building a variety of features based on our data that can provide the right information to our ML models so that we can make the right risk prediction.

Mike:

It’s not as simple as this because we also have to train our models. So all of these features, they also have to become … Fit into training data sets that are accurate. You’ve probably dealt with this problem or seen somebody talk about this problem many times. There’s effectively two consumption places, two points of consumption for feature data, model serving, and model training. They’re different. One is real-time and on one data point at a time, one is batch, one is slow it’s on a lot of different data, at the same time it needs to be point in time correct. This is a little bit of a feature engineering challenge, your data engineering challenge.

Mike:

When we say feature, what do we mean by this? What is this feature thing? A feature is something that implements this interface. And I’m building up to, I’m going to describe a new way to do some feature engineering. It’s this interface. What’s the interface? Okay. Need data for predictions so inference, so serving feature for inference, and serving feature data for training, and, obviously, also reading the raw data in the business. There’s some transformation step that happens in here. In your model you have lots of features. There’s a couple other things that you might want to consider as part of this interface. I need to backfill a feature at the bottom here so I create a new feature I need to get all the historical values for it. And then a variety of management and maintenance stuff. Can I register, share, monitor this feature?

Mike:

Okay. So far so good. What are some kinds of features some examples of features that we often see used in risk models? One group is dimensions so this is just a basic lookup. Hey, what country is this user in? Pretty easy. That could be a useful feature. Representations. Embeddings about a user, embeddings about a merchant let’s say. Aggregations are a really big one. Lifetime aggregations. How much money has this person spent with us in their whole life? Time window aggregations. How many transactions did this person make in the last five minutes? If it’s 1000 maybe that’s very, obviously, fraud. Session aggregations. How many pages has this person viewed in this current session that they’re in? This is a little bit of a miscellaneous bucket. The theme is real-time logic is being applied.

Mike:

So policy logic. Is this transaction permitted in this state or in this country or something like that? So applying some policies, third-party lookups like looking … Let me ask my identity verification provider what they think about this person so that’s referring to an external system. And also featurizing user inputs. The user just put in some … Just signed up for insurance with us and they input a bunch of data, let’s run some featurization on that data as well. So this is a bunch of different types of features.

Mike:

Just as an example we’ll talk about this aggregations bucket. I’m going to talk through aggregations as an example for the rest of this talk, but what we talk about applies to many of the … All kinds of different features. If I want to build and implement a real-time aggregation as an ML feature today what do I do, right? What’s the status quo way to implement this? Well, there’s a couple of requirements. And this is for a really productionized industry-grade data pipelines that power an ML application. So number one we need it to be fresh. We need the data to be up-to-date and served in real-time. So it’s both the data is fresh and also it’s delivered quickly. It needs to be easy to develop. We can’t make it hard for our engineers or our data scientists to create this stuff we need to go fast.

Mike:

It needs to be cost-efficient. Sometimes this stuff operates at crazy scale so cost efficiency is important. It needs to be consistent. Online and offline the data needs to be the same otherwise you’re going to run into SKU issues, train serve SKU issues, that will make your model effectively your model performance degraded. It needs to be reliable for production. This thing can’t go down it’s powering your fraud models let’s say. You’re not okay with taking on any dependency that’s going to bring this thing down. And backfills have to be super easy. When you can backfill easily then you can make … You can iterate with new features really quickly. It’s really hard to do all of this, right? Build a featured pipeline that can get all of this stuff done. It’s actually even hard to get three or more than three of these implemented. And so the Nirvana state is to get all of them supported.

Mike:

Let’s look at how someone might do this today. So for example, transaction count for a user in the past 30 minutes. It sounds like the simplest possible feature to define in terms of its a really simple definition. So let’s look at how that can be implemented. So this is how someone might do that. They might connect to their payment microservice, send that … Send payment transaction information to Postgres, set up a SQL query to query Postgres upon a trigger, upon the model serving asking them for some data, running that query, and delivering that data to the model service, model survey. And then also would deliver that transaction data into Kafka, Kafka would log it to S3, and then you’d define, configure, orchestrate a Spark job to run against that S3 data, and understand the data, and generate point-in-time accurate versions of this data to output training data. And then you do this for every feature, right? So it’s a lot of stuff. It’s, obviously, a lot of stuff. It’s like whoa, that’s a lot for me to implement one feature.

Mike:

But let’s look at what’s actually hard about this, right? One, I need to make this cost-efficient, especially if I’m running this stuff at scale Two is, I have two implementations to manage here now. I got two different pipelines that have to be … That have to have the consistent data. This is the train serve SKU point. The data coming that I’m making … Doing inferences with has to be the same as the data I’m doing model training with. I got to make sure this stuff … This query runs fast enough at peak times, and at scale, and all the variety of production engineering concerns that we might have. How do we get this stuff to be point-in-time accurate? I need to make sure there’s no leakage and the time travel happens accurately. And then someone needs to revision, monitor, and own all of this stuff so it’s a lot of stuff to get done. When you look at this you’re not thinking “Oh, my ML team’s going to go super fast.”

Mike:

I just showed you this nice diagram with this cool feature box and it didn’t look as scary as this. What happened to this nice thing? Well, top ML companies they can go pretty fast with ML. So what do they do? How do they accomplish this without having to build this every time? So what they do is they use feature platforms. They have infrastructure that they have in their companies that makes it really easy for people to engineer features and deploy them to production. A little bit of quick trivia. My team built all of the ML platform at Uber, it’s called Michelangelo, we had pretty sophisticated feature platform there. And I used to actually work on the one at Google too. But at Tecton we have folks from all of these companies who’ve interacted with this stuff at one point or another. But these top, especially FAANG companies they have this stuff. They have a lot of people working on this stuff.

Mike:

They have feature platforms and those provide clean APIs for developing, operating, and managing features. Basically making it easy to implement and interact with feature according to that interface we just talked about, right? Get feature online, get value offline, and then some of the other stuff around allowing a data scientist to manage this thing easily. Register the feature, share it, deploy or backfill the feature, all of that stuff. Monitor you could put here too.

Mike:

Let’s actually take a step back here. This solves a problem about interacting with the feature and defining or registering and managing the feature. It doesn’t really say much about how do I actually build this core feature pipeline. This core feature pipeline is maybe quite complicated and maybe that’s where a lot of the complexity lives, right? For example, all of the stuff we just saw about how to build an aggregation feature. What these companies do is they also build feature engines which are basically prebuilt implementations of common powerful features so imagine a managed feature. So you can build your own custom feature pipeline but then they have these managed feature pipelines that make it really easy to get a really high-quality feature built and productionized really quickly.

Mike:

So I want to talk about this concept of feature engines, show you an example of it, and how it can speed up ML and your risk team, your fraud ML team so let’s get into that. They can make it really easy to build production eyes, powerful feature pipelines. And you can imagine different groups of features, different categories of features effectively having different feature engines associated with them. Think of it as different template feature pipelines that are already built and optimized to support these kinds of features.

Mike:

So let’s talk about this aggregations engine. We’re going to stay on aggregations today. Instead of doing this whole thing that we just talked about about building an implementation in Postgres, building another one in a Kafka S3, and figuring out all of these details, and then figuring out how to backfill this data, and validate, and monitor all of this data to get this thing, it’s way nicer if you have this opportunity. So this has lots of code, it’s duplicate implementations, engineering needs to be really involved and it takes a lot of time. It’s like this is why it takes months to get something built and into production.

Mike:

But it’s a lot nicer if you can have a single definition so it’s just like one file to configure, define and configure this stuff, and you get all of those requirements we looked at before out of the box, right? It’s automatically productionized, it’s online, offline consistent, and supports backfills. It works at any scale in a cost-efficient way and it’s super fast. Its freshness and serving latency less than 100 milliseconds. And in the Tecton implementation role stuff that we have we got a bunch of bonuses. Automated monitoring, one step to production.

Mike:

But the point is that if you can adopt a feature engine pattern then you have very few components to manage and you can create a feature engineering experience that’s self-serve for data scientists all the way to production and you can productionize instantly. And so it’s a really nice unlock for risk teams, for fraud teams. And I’ll say that the vast majority of Tecton’s customers that use … That are solving fraud or risk problems use our feature engines, especially actually for aggregations because aggregations are one of the hardest problems to solve.

Mike:

So let’s look at what does this deliver? It enables if you do it right, simple and fast feature engineering workflow and it combines it with industry-grade performance, and reliability, and simple enterprise management. So how can we get all of that, right? I’ve been talking about this thing but what actually is it? Let’s look at what it is as an example and then we can imagine how that can work for other types of features as well. Instead of generic feature let’s talk about aggregation feature. It hides the simple interface, right, the simple definition. Allows, if we implement it well, is really easy to create and author these features.

Mike:

For example, there’s two steps. This is a Tecton example. There’s two steps to define an aggregation feature in Tecton. One is defining SimSQL that operates against a stream, your transaction stream, and that filters the data and does projection and filtering in an appropriate way. And then step two is defining different aggregations. Aggregation one is the average amount over the pastime delta one hour so average amount over one hour. And the second aggregation is the average amount over the past 12 hours. So really simple. We wrote this one snippet of code and we got effectively a productionized, performant, and cost-optimized managed feature pipeline. So what does that mean? A simple definition, you get transaction data in, feature values out that meet the feature interface that we’ve been talking about, and it comes with a bunch of engineering best practice implementation behind the scenes.

Mike:

And now let’s look at that. What is that actually? So in the case of aggregations, and now I’m talking a little bit about what Tecton’s aggregation feature engine does. But this is a specific example. It’s a prebuilt feature engineering pipeline. There’s the SQL step that runs against the stream that selects events from the stream. There’s a tiling step that pre-aggregates filter data into tiles, and that data is automatically brought into various online-offline stores to store these partial aggregates. Then there’s a compaction step to compact the partial aggregates across time to save storage and speed up retrieval. When you’re a querying the features, there’s a real-time roll-up which will aggregate across these different tiles to make it very efficient to deliver the right feature value at the right time while minimizing the amount of operations that have to happen at serving time. It’s a lot of stuff. A whole bunch of engineering that happens, a whole bunch of performance optimizations that are built in there, and you get that for free by just writing that nice configuration we looked at in the previous slide.

Mike:

Another way to look at it, and I’m not going to spend too much time on this because it’s a little bit heavy visually. But you get this automated tiling and compaction optimization gives you this performance without compromising on accuracy. The beauty of this pipeline is behind the scenes. It implements a very advanced feature engineering pipeline. And so what we’re seeing here, this is just another way to look at what’s going on.

Mike:

There’s a bunch of events coming from your stream source, and this is vertical as time so now, four days ago, and then those events are aggregated into … They’re pre-aggregated into tiles. Those tiles are compacted into different levels of granularity that are chosen based on the definitions. And then the real-time feature server is rolling up all these tiles and some events to allow for highly accurate but very fast aggregations to be able to be served, computed on demand, for inference. Again, it’s a lot of stuff, it’s very advanced. It’s not something that the data scientist who’s on a really intense deadline is going to whip this up and it’s going to be performant and reliable for production. It’s the beauty of having a feature engine which like a pre-built, pre-optimize pipeline that makes it easy to configure.

Mike:

So why use one of these feature engines? The hard data engineering is automated and out of the way. It’s super fast to build, iterate, and deploy. It runs really fast in real time. The ML problems like the time travel, and the online-offline consistency, and the SKU, and leakage that’s automated for you as well. And it comes with all these performance optimizations and cost optimizations. And you don’t need an eng team. There’s no engineering person needed in the core iteration flow where a data scientist is defining a feature and then reading a feature or deploying a feature. And so that creates this nice separation in the workflows that allows the data scientist to go faster. And it’s super easy to maintain. It’s super easy to maintain. It’s automated in the system and the pipelines are pre-built.

Mike:

As an example, some quick details on Tecton’s aggregations engine, right? Less than one-second feature freshness, less than 10 milliseconds serving latency, super efficient backfills, it supports broken historical data, it supports full SQL and Python UDS, supports extreme scale, super cost optimized. It’s really hard to get something like that. These are the features you want for your fraud models and it’s tough to have your data scientists implementing this stuff.

Mike:

Let’s look at a quick example. Building a feature set without … Now let’s see what this all means, right? Building your feature set. Imagine we have some features. Use your country code like a batch SQL query, transaction, total transaction dollars in the last five minutes. A couple of other features. Some are batch, some are streaming, some are real-time. And I have to build all these features and I have to build the implementation for the features and orchestration of the pipeline and everything behind the scenes for every single feature that’s built. Some of them it’s easy. Running a SQL query can be quite easy and maybe you already have that solved. But for a lot of these other features, like the aggregations we were just talking about, it can be really hard to implement that.

Mike:

The beauty of implementing these features with a feature engine is that it turns it into easy mode for all of these features, right? Now I just use an aggregation feature engine for these, I use the real-time Python feature engine for this one. And then I can use all these features side by side in the featured platform. So the feature platform treats all the features the same because they all implement the same feature interface. The featured platform allows you to manage, work with, serve, read from, interact with all the features in the exact same way. So then when you say, “Hey, I’m making a prediction, I need a feature vector,” you get feature values from all of the features side by side just like the … Through one common interface.

Mike:

And when you get this easy mode for these features that used to be hard to do, it used to be hard to build, it’s nice because you can build all the features you wanted. Before it was such a pain, you would implement many fewer features than you would be interested in having, but now you can build all the features you need because it’s just way faster to build, and productionize, and maintain this stuff. And so we’re seeing actually folks who use the aggregation feature engine in Tecton, they have a ton more features because it’s actually cheaper but faster for them to build and their models get a lot more accurate from them.

Mike:

Really quick on feature engines. They’re powerful configurable implementations of common feature patterns. They make it easy to build production-ready feature pipelines. They’re part of the feature platform and they enable feature engineering at scale. They basically speed up the time to production, the lower cost, and simplify ML. The most important part though is the speeding up. If your team goes a lot faster … Instead of it taking months it takes a day to build, implement, deploy a feature then you get so many more iterations in over the course of a year and your models get way better. And not only that, you become a lot more reactive as a fraud risk detection team. That’s one of the biggest values.

Mike:

I recommend considering a feature engine as a little bit of a feature engineering pattern that you could use in your organization. We have some stuff like that at Tecton but you can also implement this type of pattern internally within your company as well. Okay, that’s what I got. I haven’t checked if there’s any questions. Let me hand over to you, Demetrius. Thanks for your attention, everybody.

Demetrius:

I got you covered on questions.

Mike:

Okay.

Demetrius:

While we are waiting because there is a bit of a delay from when we talk now until people on the stream see it and they give … The questions start coming in-

Mike:

Oh, got you.

Demetrius:

I realized something but I don’t know. Are you a movie buff? I feel like if you were a movie buff you would have the best name because you could just say to people, I am DB.

Mike:

Yes. No one ever told me that before. I was at a comedy show for my 16th birthday and I had … I was at the front and my feet were up on this … Leaning on the stage. And the comedy guy, the comedian goes, “I’ve seen you in a movie before haven’t I?” I said, “No. I don’t know. I’m not an actor.” “No, you’re definitely an actor I’ve seen you before.” “No man, I don’t know, you got the wrong guy.” “You’re not an actor?” “No.” “You don’t perform or anything?” “No.” And he goes, “Then get your effing feet off the stage.” He just got me because I was leading up on the stage and it was such a moment.

Demetrius:

Remembered. Respect those comedians or they’ll come after you.

Mike:

Now I’ve got PTSD related to acting.

Demetrius:

All right. First question coming through for you IMDb. And classic, of course, somebody’s going to ask this one. What role can or does generative AI LLMs have in fraud detection at present or in the near future, especially through Tecton?

Mike:

Great. Okay. There’s two ways to look at it. There’s what role does generative AI have with the problems we’re solving today? And then what role does the Tecton and this type of architecture have in generative AI? What role does generative AI have in the problem we’re solving today? When we talk to our customers, and we talk to folks who are basically ML teams who are solving real-time fraud problems. They have to make decisions faster than generative AI can support. Or, they’re not trying to be generative, they’re not trying to invent things, they’re not trying to be creative, they’re trying to take very structured data and make very specific decisions that are highly accurate. So it’s this domain where errors … Accuracy is really important. You’re a lot more willing to invest the time in engineering upfront to get that outcome and that accuracy.

Mike:

And so we see generative AI not be the top priority for folks who are doing real-time fraud detection and real-time risk decisioning. However, there’s a lot of places where there’s adjacent use cases for generative AI where maybe you’re asking the customer another question. In your chatbot you’re asking. It’s almost like a CAPTCHA or there’s a customer support thing, right? All of those kinds of things. And a lot of this infrastructure, a lot of these patterns that we just talked about are very useful for providing context as part of the prompt that is passed into the LLM. And we actually have a blog post on the LangChain blog about feature platforms, feature stores, and LangChain and LLMs and how they can work together to provide a better customer experience. If you can use those, the feature infrastructure, to provide a really high-quality real-time context in your prompt such that the LLM can provide a more personalized answer.

Demetrius:

I love that. Thinking about it as a CAPTCHA and you’re just giving more context to the LLM. Also, something you said there is that you’re not trying to be generative in some of these use cases. That is such a good point. Just the use case latency requirements that you have sometimes … A lot of people are talking about this in the MLOps Community and they’re saying how they get 40 minutes worth of waiting for a response and then ChatGPT just comes back and says, “Time out error.”

Mike:

I mean, that’s not relevant for these use cases where you’re like “Hey, we’ve got to make a decision right now.” The data you’re trying to process is not … Is often not unstructured like we don’t know what data we’re going to get as input so we need a very flexible LLM to be able to interpret on the fly. You can build in all of that interpretation because you’re interested in effectively overbuilding this thing to provide a very deterministic, correct, highly accurate result rather than building … For a lot of these use cases, building a generic system that can interpret any type of situation, any type of input data. That’s not the problem set up for a lot of the risk and fraud decisioning systems that a lot of folks I’m imagining in this community have.

Demetrius:

Exactly. While we’re on the topic, who hallucinates more ChatGPT or your average college student?

Mike:

Definitely college student. I mean, depends on which college, obviously. From when I was in college I’m going to go with option B on that one.

Demetrius:

We’ve got a few more questions for you and then we’re going to kick you off the stage politely. Justin is asking, “Does the feature store automatically handle train, test, validate splits without breaking aggregations?”

Mike:

Interesting. Actually, that doesn’t happen right now. You can look into it and say, “Hey, based on the length of this feature aggregation I want to smartly choose my train serve split.” But the way that features are calculated today is they use a time window, a feature for January 1st. If it’s a month-long aggregation it’s going to use data from December. So there’s different ways to do train serve splits. Or train, validate, test splits. And if you’re slicing by time that’s always hard when you have an aggregation that aggregates over time because then your feature uses data from a time range. Especially if you have a lifetime feature, like a lifetime aggregation, that data goes back forever.

Mike:

And so you can’t really just say, “Okay, now I’m never going to use this data for these features,” right? But the thing you can do is slice by users or by items or by merchants, by entity, by ID. You say, “I’m going to train on these users, I’m going to validate on these users, I’m going to test on these users.” And that’s a pretty common pattern where that evades the leakage problems that are introduced by having a time dimension aggregation.

Demetrius:

I love that. All right. Does aggregation engine keep feature lineage to help with explainability?

Mike:

Yeah. I can just speak to what we have in Tecton. All the features, every feature that’s defined in the feature platform of Tecton has lineage, feature catalog, owner, all the metadata associated with it, everything all in the same way. So every feature, independent of how it’s implemented, is treated, and managed, and organized in the same way, and governed in the same way, and monitored in the same way. You don’t have to worry about oh, this is a specialized feature pipeline that we needed to implement for some certain whatever reason it is. It connects to a weird data source or it’s a specialized compute or something. You don’t have to worry about that being handled … Living in a silo. It’s all goes through the same management platform which makes it really nice.

Demetrius:

All right. Last one for you then you’re out of here. We’re giving you the musical, and the Oscars, or the Emmys.

Mike:

The hook.

Demetrius:

Exactly. How and when is the feature aggregation triggered by streaming window at streaming or by service call at runtime?

Mike:

In the Tecton implementation it could be configured to be either. There’s a parameter and you can define what type of aggregation you’re interested in. It could be as of the point in time when the prediction is being made so right now for the last 30 minutes, or it could be from the last event to the past 30 minutes before that, or it could be every five minutes it’s recomputed or something like that too. So there’s different ways to do it. And that’s a configuration type of thing.

Mike:

But that’s a really good question because that … Those details are actually the messy details that make it really hard to implement this stuff because you might think “Oh, I can build an aggregation.” But then you actually think “Oh, it’s really easy to build one of those but then you actually need the other thing.” It’s a completely different implementation. I think we’ve talked about this applied before where it’s one of these problems where the deeper you dig into it the harder you realize the problem gets. And so we talk to a lot of teams who start staffing up a team to build this thing and then they realize oh crap, this is … We’re going to need more people, then it’s more people, and it just grows.

Demetrius:

As one person said in the MLOps Community who was trying to build a feature store at their company, they were talking about the pains and they said it’s basically death by 1000 cuts.

Mike:

I think that’s right. Well, that’s all we do is we solve the 1000 cuts thing at Tecton.

Demetrius:

There you go. So with that thank you, Mike. I’m not sure if you’re jumping back on later but it’s been a pleasure, it always is a pleasure.

Mike:

No, I’m excited to watch the rest of the day so I’ll let you guys … I’ll seed the stage.

Demetrius:

All right, there we go.

Mike:

Thanks for your attention everyone. See you.

Demetrius:

See you.

Mike Del Balso

CEO & Co-founder

Tecton

Mike Del Balso is the co-founder of Tecton, where he is focused on building next-generation data infrastructure for Operational ML. Before Tecton, Mike was the PM lead for the Uber Michelangelo ML platform. He was also a product manager at Google where he managed the core ML systems that power Google’s Search Ads business. Previous to that, he worked on Google Maps. He holds a BSc in Electrical and Computer Engineering summa cum laude from the University of Toronto.

Add Your Heading Text Here

Powering ML Fraud Detection Models With Advanced Aggregations

Mike Del Balso

Follow Us

Book a Demo

Contact Sales

Request a free trial