Fighting Fraud with Machine Learning at Remitly

apply(conf) - May '23 - 41 minutes

“The rise of digital transactions and online platforms has brought about new challenges in combating fraud. Traditional rule-based systems often fall short in identifying sophisticated fraud patterns, leading to significant financial losses and compromised user trust. This necessitates the adoption of advanced technologies such as machine learning for effective fraud detection and prevention. This conference talk aims to explore the utilization of machine learning algorithms and techniques to combat fraud across various domains.”

Sounds boring, right? While all true, fighting fraud with machine learning requires so much more than simply “adopting advanced technologies such as machine learning for effective fraud detection and prevention.” You can blame ChatGPT for that paragraph (and for also oversimplifying our heavily nuanced and complicated domain).

In this talk, we will explore what it actually takes to develop, launch, and maintain ML models in production in the highly dynamic and adversarial environment of FinTech risk. I will share how Remitly thinks about risk/fraud tradeoffs, how we frame problems to build successful ML products, and how this influences our team structures. I will also share some lessons (and mistakes) learned from years of developing valuable, robust, and customer-centric fraud models that score transactions every second of every day, to transform the lives of millions of immigrants and their families by providing the most trusted financial services on the planet.

Jake Weholt:

Hey everyone, my name is Jake Weholt. I’m a machine learning engineering manager fighting fraudsters at Remitly. And today I want to talk to you about the challenges of fighting fraud using machine learning, and provide some additional context and nuance to this space that you can use when thinking through your own complex real world machine learning problems.

Jake Weholt:

So first, what does Remitly do? Remitly is a leading digital financial services provider for immigrants, enabling users to make person-to-person money transfers in over 170 countries around the world. Remitly’s vision is to transform the lives of millions of immigrants and their families by providing the most trusted financial services on the planet. A relentless focus on providing customers with best in class products underpins our commitment to bring trust, liability and transparency to cross-border remittances and broader financial services. To build trust with our customers means to build world-class risk systems to protect our customers and their hard-earned money.

Jake Weholt:

So, I want to start by talking about what fraud looks like at Remitly. Because Remitly sends money around the world, we’re a natural target for fraudsters. The most common form of fraud at Remitly is stolen payment info. For example, this is when someone has their credit card information stolen and a fraudster uses it to send money on our platform. This type of fraud manifests as what are known as chargebacks. Chargebacks are a payment reversal financial tool designed to protect consumers and their money.

Jake Weholt:

And if someone notices an unauthorized charge on their bank statement, they can issue a chargeback through their bank and their funds will be returned in full. Remitly is in the middle of a two bank account transfer of value. Money comes in through a sender’s bank account, flows through Remitly and through some treasury magic lands in a recipient’s bank account. As the service provider, we are responsible for our sender’s money, regardless of whether or not these funds were sent illegitimately, and therefore Remitly is responsible for refunding all of the fraud victims’ funds.

Jake Weholt:

In short, Remitly bears the financial responsibility of protecting victims of fraud on our platform, and this means that we are highly incentivized to stop fraud on the platform to protect our customers. Next I want to talk about the history of fighting fraud at Remitly as this journey illustrates the often non-linear path that machine learning solutions take when developed in complex domains. So, automated solutions for fighting fraud at Remitly started long before we had any sort of machine learning solution.

Jake Weholt:

Analysts would deep dive into fraudulent transactions, learn the patterns and write course rules to combat losses. We started then developing machine learning models in an attempt to beat the combination of rules and vendors and this process took years to achieve. We went through many versions of machine learning models that underperformed against the next best alternatives, but this iterative process helped us better understand the needs and requirements of the program, starting first from the data required to build machine learning models all the way to the outputs our partners team needed to make good decisions using the model scores that we provided.

Jake Weholt:

So, today we have many models running in production, actively providing probabilistic risk scores every second of every day for transactions that pass through our systems. And through this machine learning journey, we learned several generalizable ML team lessons that are practiced by our team today. So, the first is always have a baseline to measure against. A snarl of rules, a previous model, a vendor, a random number generator, whatever it is, having a bar to measure against is critical for crafting your roadmap to improvement. Simply put, we measure success of ML solutions at Remitly by their ability to beat the next best alternative, whatever that may be. Without a baseline to measure against, we won’t know whether or not we’re improving.

Jake Weholt:

Next, ML solutions often reveal themselves in the wild. Autonomous data scientists and analysts leveraging deep domain expertise will generate proven analytical solutions, but these solutions often hit roadblocks when attempting to scale them. This scenario is the perfect setup for machine learning and when reflecting back on all of the mature and durable ML products at Remitly, all of them followed a similar path. Having data scientists and analysts run out ahead of eventual ML solutions allows you to prime the ML product pipeline.

Jake Weholt:

And while turning analytical solutions into ML solutions is certainly a big challenge, without a rich pipeline of proven concepts, it’s hard to know what to build next. There’s a natural mismatch between where people believe machine learning can be valuable and where you can actually provide value with machine learning. And turning to those closest to the data, like data scientists and analysts who have proven solutions but are hitting issues of scale, can be a cheat code for the next big machine learning win in your org. And machine learning solutions take time to develop in complex domains, be patient, but also stay realistic. Trust that the iterative experimentation process will pay off if you identify outsized opportunities. But keep in mind that it’s okay to abandon projects if something simply isn’t working.

Jake Weholt:

Next, I want to motivate the problem from a business perspective a bit and talk about how that actually impacts our team design. So first, the business context. Fraudulent transactions lose customer trust and also risk systems carry a non-trivial cost to our customers and to the business through churn inducing frictional customer experiences, engineering costs, and customer service costs. These two constraints give way to two clear goals. First, reduce the customer impacts incurred through fraudulent transactions. And second, don’t impede good customers through the enforcement of fraud. These goals give us a clear trade-off framework to think about when tuning our risk systems. It means that we are always trading off between overly restrictive enforcement and customer impacting fraud losses, and oversimplifying a bit from a product team’s perspective, the desire is low customer friction and smooth product experiences.

Jake Weholt:

And from a customer trust perspective, the desire is to reduce chargebacks. These two desires are often in tension with each other because to stop fraud losses, we have to introduce customer friction in the form of risk systems. And this tension places our team in the middle of two stakeholders with often complete opposite goals and it is our job to design solutions to solve the problem. So, next I want to talk about how those constraints and goals flow into team design. So, transactional risk assessment at Remitly is owned by the scoring team, which is a domain specific ML team. This team design stands in contrast to other centralized machine learning teams at Remitly, which are generally more generalized and own models across several domains.

Jake Weholt:

Being a single domain team allows us to maintain hyper focus on the dynamic and adversarial domain of fraud. A trade-off that we make intentionally here is depth over breadth and the scoring team at Remitly is a product team, and our products are probabilistic risk scores. We measure product quality over several dimensions, two of which are, one, our ability to properly rank transactions by riskiness through the use of machine learning models. And, two, our ability to provide a stable score for decision making. If we can’t rank transactions properly by riskiness, then the score we produce isn’t actually useful. And if we can’t produce a stable score, where the desire here is that through time the same score roughly equates to the same level of riskiness, and it is harder for our partner teams to build durable tools to leverage our score.

Jake Weholt:

As a product team it is our responsibility to help our partner teams make the necessary business trade-offs when these two goals are in direct tension with each other. So now what makes fighting fraud with machine learning hard? And I’m sure that a lot of the folks that I’ve already spoken today and will also speak later have covered some of these topics and I think it’s extremely interesting, so I’m excited to dive into it with you guys as well. So, let’s first write down the goal and a very rough plan for how to achieve it. So, the goal is to develop a model that can separate fraud from not fraud, and our target variable will be chargebacks represented as ones and zeros, we’ll gather a bunch of features to describe fraudster behavior and we’ll throw it all into a model that provides scores and then we ship it, right?

Jake Weholt:

That’s kind of how it goes. Actually, no it doesn’t. There is a ton of nuance in this space that makes fighting fraud with machine learning hard. The first large challenge to highlight is fraudster behavior. To start, fraudsters are rare and look very similar to legitimate customers, and this makes them hard to detect because the behavioral overlap is so large with non fraudsters. Remember our goal from before, which is to stop fraud without impeding legitimate customers. Fraudsters often look like regular customers, so separating the two can be extremely challenging. From a modeling perspective, it becomes obvious that this behavioral overlap coupled with the relative rarity of fraud, makes positive signal generation extremely tricky.

Jake Weholt:

Fraudsters are smart, efficient and adversarial. They learn our systems quickly, change their patterns often and heavily exploit holes when they find them. We’ll talk more about adversarial fraud behavior in a bit, but as you can probably tell by now, the dynamic nature of fraud behavior causes distribution shifts all over the place, which is a little bit of the stuff that Dat talked about just a minute ago. And this impacts our features, labels, performance measures, everything. Next I want to talk about labels, and at first I want to focus on label definitions. So, good modeling outcomes start with clarity in your label definitions. A classification model with good features and enough data will classify the things that you tell it to classify.

Jake Weholt:

Weak label definitions will lead to poor model performance and have you declining the wrong types of transactions. Additionally, label definitions can vary between teams in your org based on what it is that that team cares about. So, having clarity and alignment around label definitions is critical for delivering the product that stakeholders expect, which in our case is ranking the subset of chargebacks that are fraudulent. This means having concrete labels with the correct dimensions to describe those labels if you want more nuanced label definitions. It also means that you should be aware of the variance imposed by labeling techniques such as human labeling and confirm that the signal generated by those labeling techniques actually outweighs the cost of the variance imposed by them.

Jake Weholt:

For us, we are very careful at Remitly when adding nuance to our label definitions, because nuance does not always translate to better model performance. Another difficulty in generating clean labels for training is censored and biased data. We will never know the true label for any transaction that has been declined by our system declining a transaction stops it from reaching a terminal state. If it is fraud, it will never reach chargeback state, and if it isn’t fraud, it will never reach a completed state. This increases label uncertainty across our training data, effectively blurring the boundaries between positive and negative distributions, making them difficult to separate effectively. The reality is that some of the declined transactions were actually fraud, some of them were unfortunately not fraud, and there’s valuable signal hidden within this transaction set that we are, again, completely blind to.

Jake Weholt:

If we were to assume that all of the declined transactions were fraud, we would risk teaching the model to identify good customers as fraudsters, and over time it will eventually capture more and more good customers, which is a bad customer experience. But this doesn’t mean that we should ignore these transactions altogether. Again, hidden within this data set is valuable signal for our models. I can’t talk about exactly how we deal with this problem at Remitly, but I wanted to raise this as a modeling challenge that is worth thinking through when you have models that interact with the environment. This interaction with the environment impacts labeling, data capture, features, measurement, everything. Fraudster behavior is a projection of whatever fraud system is actively interacting with the environment. And I’ll talk more about this in a bit, but broadly, fraudsters quickly adapt to our defenses, our rules, our models, everything.

Jake Weholt:

This adaptive behavior biases our labels because the only fraud that gets through our system is a representation of the holes in our system. Again, I’ll talk more about this in a bit. Next imbalanced data, I’m sure a lot of folks here are struggling with that. An imbalanced data set is a data set with skewed class proportions, meaning some labels are rarer than others, and in fraud this imbalance is extreme because fraudulent transactions are very rare when compared to legitimate transactions. There are a lot of resources online talking about how to deal with imbalanced data, so I’ll just mention the tricky parts for us. They are signal generation and performance measurement. Fewer labels means weaker signal for learning fraud patterns and you run the risk of overfitting. And then fewer labels also leads to higher variance in performance measurement with longer data gathering phases to have confidence in our experiment reads during online testing.

Jake Weholt:

This problem also compounds as we improve our systems as well. So, better systems means less fraud means less positive cases over time to train on, means that the class imbalance actually gets worse. And the final difficult part in generating clean labels is label maturity. So, chargebacks, which are positive labels take weeks to mature. And for the most part, generating a positive chargeback label requires someone to notice a fraudulent charge and interact with their bank to reverse it. This means the feedback cycle is very long, which is weeks, not hours or days, which makes the iteration loop very long. It also means that fraudsters can fly under the radar for a while before they are discovered and they continue to actively harm our good customers.

Jake Weholt:

It also makes it harder to measure improvements or if a model is stable enough to move into production, especially when we make large changes and the uncertainty and risk is increased. So, next I want to talk about measurement, which is a particularly difficult part of building machine learning models in this space. So, what do we want to measure and how do we want to measure it? We want to measure how well our systems are performing, but what does that actually mean? Our primary measures are precision and recall. Precision means of all the transactions flagged by the model, how many were actually fraud. This is a measure of customer friction because if this value is low, say 5%, only 5% of the transaction we flag is fraudulent, are actually fraudulent, and the remaining 95% are good customers getting tangled up in our risk systems.

Jake Weholt:

And then recall means of all the fraud on the platform, how much of it did we actually flag? And we use recall to understand how well we are actually capturing fraud. If the value is high, say 90%, it means that we’re capturing about 90% of the fraud flowing through our systems, but we are still missing about 10%. We trade off between these two by tuning our model thresholds to balance business constraints. So, we set a threshold where any model score above the threshold is flagged as fraud and anything below bypasses our fraud systems. And to further illustrate this point, let’s map back our business constraints to these measurements and gain some intuition there. So, again, from a product team’s perspective, the desire is low customer friction and this means high precision at the expense of recall. You want every transaction declined to have a high likelihood of being fraudulent. Non-fraudulent customers don’t like being labeled as fraudsters, which means that false positives damage the customer experience.

Jake Weholt:

Customer friction directly translates to customer churn, and if Remitly has a more frictional risk experience than the next best remittance provider, customers will simply choose the provider with a better experience for them. And then from a customer protection perspective, we want high recall because we want to be capturing all of the fraud that is flowing through the system. This means that we flag anything that we think is fraudulent, but remember that too sensitive of a system here can lead to a really bad customer experience. So, getting back to customer or fraudster behavior, which I teased earlier, fraudsters are highly adversarial, efficient and their behavior is a projection of our current fraud system in production. So, why is this relevant?

Jake Weholt:

Well, it means that even if we had a theoretical holdout group that wasn’t subjected to any sort of fraud or risk system experiences, that data would still be biased if there are risk systems in place interacting with the environment. Fraudsters use methods that work and abandon methods that don’t, or they move on to other remittance providers where those methods do work. Because it is not efficient to use methods that don’t pay out. So, if a method is abandoned by fraudsters altogether, we would not expect to see that signal in a holdout group.

Jake Weholt:

So, this behavior has many downstream effects. It makes measurement of precision and recall across our threshold spectrum inaccurate. If we move a threshold and relax our system, this may open a hole that fraudsters can pour through unexpectedly. So, regardless of what our test set PR curve said, fraud scales non-linearly if holes are identified by fraudsters. This subjects the tuning of our system to a lot more variants than our test sets would have us realize. Because of this, we choose to be very conservative when making changes to the thresholds of our models. Next, adversarial fraudster behavior means that fraudster behavior is changing over time, which makes performance measurement extremely tricky. There are three areas that I want to talk about here. First is long-term model improvement measurement. How do we know if we are improving if the fraud landscape is constantly changing underneath us?

Jake Weholt:

For instance, if we are stopping baseline fraud and miss a fraud attack and it drives our recall way down, are we doing better than last quarter? The next area is variance in model score stability. So, what if the number of transactions getting flagged by the model suddenly doubles overnight? Is this behavior driven by the model accurately blocking a fraud attack, or are there some other exogenous factors leading to score instability? Given that we are largely blind to the outcome of these transactions, it’s hard to know if the incremental transactions flagged by the model are actually nefarious. In the final area is the time-based dependence of feature value for the model. So, going back to fraudster efficiency, let’s think about an example where a feature is very expensive to compute and is causing headaches for our partner teams to maintain.

Jake Weholt:

They want us to turn off the feature. Can we actually do that? Let’s say that we run an offline experiment with a traditional trained test split and we see no difference in our precision and recall between the models with and without the feature. Given this information, we deprecate the feature and in three days we have a fraud attack which leverages a hole that was previously blocked by the feature. So, what happened? Remember, fraudsters are efficient. If they find an exploit, they use it until it stops working. In our trained test splits, no positive cases related to that feature were present because the feature had previously closed a hole and fraudsters stop using the exploit. So, now comes along a fraud ring testing exploits, and boom, they find a hole. And despite our model measurement telling us that nothing would happen, we get hammered with fraud. So, how can we combat this? One attitude to take is once added, never remove a feature from the model.

Jake Weholt:

This is fair, but it ignores the real world engineering challenges of instrumenting, maintaining and monitoring features. You sometimes don’t have a choice if your features data changes or is deprecated. And simple train test splits might give you the wrong impression of the features value. It’s also hard to prioritize engineering work to instrument the new feature if a simple train test split shows no lift, because the business value of building the new feature isn’t obvious. We have several methods for determining feature value and none of them are perfect. We have shifted our thinking towards whether or not a feature has ever been valuable rather than whether or not it is valuable in the current experimentation split. These methods range from simple measurements such as univariate discriminatory power, univariate correlation, or multiple train test split windows through time to measure precision and recall lift.

Jake Weholt:

And also more complex and costly methods like using SHAP values to understand the impact a particular feature has had on the model over time and over what period that impact was felt. Overall model performance measurement in this space is extremely tricky and oftentimes misleading or altogether worthless. And as the nuance of measurement increases, it becomes more and more difficult to describe the business value that your team is providing. And Remitly, we are constantly improving our measurement techniques and have found that partnering closely with stakeholders to align measurement techniques has saved us a lot of time when answering questions like, “Is your model actually valuable?”, especially when measuring model performance and improvements on shorter time scales like day over day or week over week.

Jake Weholt:

Next I want to talk about data and features. So, model value is generated from raw data using domain expertise to generate features. Without domain expertise, you aren’t going to know what data to capture, what features to build, and you aren’t going to outrun bad data and features with fancier modeling architecture. Complex models are hard to understand and maintain, and the incremental lift generated by them isn’t likely going to outpace better data and features. The curation collection and transformation of data should have dedicated team resources because this is where the value is generated and stored for our models. Data quality matters greatly to the performance and stability of our models, while capturing new fraud signals quickly allows us to better respond to fraud attacks. Additionally, if you have separate data flows for offline training and online scoring, you should align them.

Jake Weholt:

Model volatility can come from many sources, feature drift, vendor downtime, seasonal trends, you name it. The goal for us is to reduce volatility wherever we can. And the first place to start here is aligning your online and offline drift. We went with Tecton to solve this problem because of the drift between our two data pathways. We also aligned ownership under a single team. Historically, our offline features were owned by machine learning engineers and our online features were owned by software engineers on different teams, and it led to needless drift in our features. This bifurcation of ownership was problematic and by better aligning our ML domain engineers, we can have a better understanding of the subtle differences between online and offline data. And then finally, model deployment and serving. So, we have our own in-house model deployment service built on top of AWS SageMaker.

Jake Weholt:

And here’s some of the things that we found valuable as we iterated and built out this system. So, the first is easy model rollback. If you ship a bad model, you’ve got to be able to roll it back quickly. Next is high availability with low latency with SLAs on both. So, Remitly provides an express experience, which means that the funds are available basically instantaneously in the recipient’s bank account. And we need to know whether or not that transaction is fraudulent before those funds are dispersed. So, we need to be able to score the model quickly in the actual transaction and product flow. Having SLAs around both of these measures allows us to make concrete trade-offs when developing the model. So, if the model’s too slow, we can prioritize making them faster, and if it’s too large to deploy, we can prioritize making the model object smaller.

Jake Weholt:

Next is quality monitoring. So, have monitoring, know if your model is down, know if it is timing out, or if your features are missing, or if you’re flagging too many transactions. If your model stands in the critical path of customer experience, you must know when it is behaving incorrectly. And then finally have alarms and the contingency plan for when your model goes down, or is behaving in a way that doesn’t make sense. These things happen and knowing what to do will save time, the sanity of your engineers and most importantly, will maintain your customer’s trust when it is critical for providing the most trusted financial services on the planet. Thank you guys so much. Again, I’m Jake Weholt. I’m ML Engineering Manager at Remitly working in fraud. And if we have time, I’m happy to answer any questions, anything that I can.

Speaker 2:

Jake, awesome talk man. Wait a minute. While we’re waiting for these questions to come in, I’ve got the big question for you. Do you do Toastmasters or something? What’s your deal? You practiced this one a little bit?

Jake Weholt:

Couple times, yeah.

Speaker 2:

Oh man, this was like a TED Talk. I was enthralled the whole time.

Jake Weholt:

Thank you very much.

Speaker 2:

Well, we’ve got some questions about your talk. First one coming up is, as a finance organization, do you prioritize risk management framework over advanced fraud detection and prevention systems?

Jake Weholt:

Yeah, it’s a good question. I think we think about first model explainability and control when building our risk systems. And because we are a public financial company, it is tricky, I would say. I can’t go too much into that, but we think about model explainability and usability first because one of our primary goals is managing model stability so that we are not impacting the customer experience too much. We are, I would say, hesitant to use more complex tools if they add complexity but don’t add value for our customers. And the first thing for us is customer value.

Speaker 2:

100%. Right on. So, how do you deal with the scarcity of positive signals in the data?

Jake Weholt:

I would say like domain expertise is a really important one here. Knowing how and where to go to generate fraud signals is extremely important. And we found that having domain experts be able to dive into the data, particularly data scientists and analysts that can vet ideas and pitch ideas to our machine learning engineers for implementation has been extremely valuable. So, getting folks close to the data is really, really helpful when trying to find positive signal.

Speaker 2:

I’ve just got to say this real fast for everyone that is asking questions, these questions are awesome. I’m looking through Slack right now and wow. So, if you’re not in Slack, join it and ask questions. We’ve got another one coming through. Does Remitly use a combination of graph algorithms, along with deep learning frameworks to uncover hidden patterns and identify fraud?

Jake Weholt:

I unfortunately can’t answer that. That one’s a bit too in the weeds for my ability here. That’s about as much as I can say unfortunately.

Speaker 2:

I thought you were going to say that will reveal too much. It will pull back the kimono a little bit too much.

Jake Weholt:

That’s actually the case. Yeah, unfortunately.

Speaker 2:

You’re being all humble. I don’t know, I plead the fifth on that one. Just tell us if you need to plead the fifth any other time and we will respect that. Do you run multiple models at the same time? Challengers?

Jake Weholt:

We do. We run challenger models. We also have models over different cohorts, so specific customer segments that are problematic. We do splitting there and we have a lot of models running at any given time. And I think AWS is probably very stoked to have us as a customer because we have a lot of models in production.

Speaker 2:

Oh God, I love the good old AWS spend. So, let’s see. And the next one for Remitly, is using multiple ML models in fraud detection more advantageous as opposed to using a single ML model?

Jake Weholt:

I think it really depends on the domain. We’ve found it valuable, but I think going back to my point in my talk around label definitions, making sure that you have concrete label definitions is the key here. And as you apply nuance, you have multiple models that are maybe acting over different label sets or whatnot. As you start narrowing down your label set, the variance increases in the signal decreases. So you need to find a good trade-off point there where you can actually produce good models in your risk systems.

Speaker 2:

How do you make sure scores are stable over time? Do you retrain fraud models often?

Jake Weholt:

Yeah, this is a particularly challenging problem. So, we do retrain quite often and we also use some normalization techniques. But I would say that in terms of product quality for us, where the product here is providing probabilistic risk scores for our partner teams, providing stable scores is actually extremely tricky. And when you have another team that is making decisions on top of the scores that you produce, it’s really important to have stable scores. So, we have a combination of things, but I will say this problem is an extremely tricky one and it’s something that we think about a lot.

Speaker 2:

We can just tell that the quality of these questions, the people are actually doing this. And so I love seeing this. I love this quality. Next one for you. I’ve got a few more and you were kind enough, you were very timely about your presentation. So, we have a few minutes to kick around and then we’re going to have a little break and I am going to play some music for everyone and I’m going to ask for prompts in the chat, so I will sing what you put in the chat. But before we do that, Jake, I got a few more questions for you. I’m just planting the seed of the prompts in the chat. Now, how do you deal with holdout and how do you choose its size?

Jake Weholt:

Another question I can’t answer unfortunately.

Speaker 2:

I plead the four, five, fifth.

Jake Weholt:

Exactly. Yeah, sorry.

Speaker 2:

You almost answered it though. You were pretty close there. We’ll see if we can get you on another one. Let’s see, because do I need to go get my ski mask again? You’re giving us all kinds of revealing information that is going to help me reverse engineer through Remitly fraud detection algorithms, which I’m going to thank you for later. So, what kind of biometric and behavioral features do you use in the model? Do you find them useful?

Jake Weholt:

Let’s see. So, I don’t actually think I can answer that either. I’m so sorry. I have to plead the fifth on that one as well. I don’t want to give away too much of the secret sauce here.

Speaker 2:

Can you answer if they’re useful or if they’re not, or you can’t even answer that?

Jake Weholt:

I can’t even answer that unfortunately.

Speaker 2:

All right, don’t worry. We’ll keep moving. No worries man. I don’t want to make you lose your job. We can also edit this out before we send it to your employer so nobody will know. It’s just us right now having this conversation. And so how do your models capture the new changing fraud behaviors timely? I guess you kind of answered that, but maybe…

Jake Weholt:

Yeah, I think the first problem and the trickiest part here is identifying the fraudster behavior as it comes in and making sure that you’re actually getting signal of a particular fraud attack. And when you add that to the model, making sure that it doesn’t accidentally sweep in a bunch of good customers as you do that as well. So, I think going back to the domain expertise here, I think having people that are just in the data and identifying those trends is extremely valuable. And then being able to quickly spin up features and test them in the model is another big challenge and hurdle that we had to overcome. Training times can be extremely long, so coming up with really simple and quick ways of measuring whether or not a feature might be valuable to the model and then making some assumptions over how valuable that feature actually is, and then shipping it.

Jake Weholt:

And then you run into issues with backfilling the data and making sure that it is accurate at the time of calculation, which plug for Tecton here, is a problem that they have solved very well. So, I would say if you’re on the fence about using Tecton, jump on over to the Tecton side. The water is warm.

Speaker 2:

Ooh, there you go. So, the organizers have just taken that clip and if they’re smart they’re going to plaster it all over social media.

Jake Weholt:

Perfect.

Speaker 2:

And say, “Look at how Remitly loves us.” I’ve got a few more for you man, and then I’ll let you go. This has been awesome and it’s a crowd favorite. I mean the questions are coming through. There’s a lot of questions for you. I don’t know if you’re going to be able to answer that one. This one got a plus one, so I’m going to go ahead and ask it. You mentioned using vendors plus course rules in the past. Any lessons on the transition from in-house and timing of switching off?

Jake Weholt:

That’s a great question. So, I think the first real tricky thing here is measurement across the system. So, being able to roll out rules and test whether or not those rules are working, and then making sure that you have an accurate representation of the state of the world before when you had maybe those rules didn’t exist or what the world might look like if those rules do exist, that is something that you want to try and solve. And then with respect to vendors and rules and then making the transition, I think for us it was a very soft transition, meaning we rolled out ML models very slowly over subsets of customers to make sure that we were actually producing good results. And then for managing fraud attacks, I’m sure many folks deal with this today. Rules are still valuable. We still use rules, we still use vendors, we have machine learning models also making decisions.

Jake Weholt:

And all of these things together create a very robust system that fraudsters find it difficult to penetrate. So, I would say rolling out that ML model, that transitionary period is particularly tricky and high risk. So, doing it slowly and thoughtfully and intentionally is really important there. And then for us, an actual data generation time period was necessary. As the business grew and we generated more data, we had better signal, we had the ability to build better models and that our ability to actually outpace rules and vendors actually increased. So, we were able to do that quite well. I think that’s about it.

Speaker 2:

Yeah, that’s such a great point, that last part about how the more you do, the better it becomes and the more robust it gets. So, let’s see. I’m going to try, do you use some third party data products for fraud detection? If so, do you rely more on these products or your in-house models? I guess you kind of answered that just now. How do you make this decision if they don’t agree with each other? Kind of same.

Jake Weholt:

Yeah, we do. We use third party vendors. We use a whole collection of things. I think the way that we really think about fraud at Remitly is we have a team that is a team of extremely skilled analysts who are responsible for stopping fraud or Remitly. One of the tools in their tool belt is the in-house model that we provide and they have a suite of tools and together they can make good decisions on how to stop fraud. And we sort of pit our internal model against our vendors, our rules, everything together. And thankfully over time we’ve been able to beat those things with our in-house models. But making sure that you’re also being intellectually honest about the value that your model provides. If one day your model all of a sudden is worse than one of your vendors, that’s the day that you should switch over to a vendor and use that instead, right? Because the goal here is to stop fraud and to protect customers and doing anything that you can, using any tool that you can, to do that is extremely important.

Speaker 2:

That’s awesome. It definitely good for you. You get to say, “Look, we’re doing better than these guys,” so you can argue your worth. So, you mentioned domain experts a few times to find emerging fraud vectors. Have you found unsupervised learning useful for detecting new emerging fraud vectors not previously considered?

Jake Weholt:

We have done some of that. If you order the things that you can work on by their ability to actually impact the business, I think for us, features and data has been by far the most impactful and that’s where I would choose to place resources. And that ML experimentation stuff with the non supervised learning techniques or unsupervised learning techniques. I would say if we had a larger team, we would probably dive into some of those more research oriented stuff. But we run a pretty lean shop over here, which is really, really fun. If anybody is looking to join a fraud team, reach out on LinkedIn. But I would say that we’ve done some of that, but it’s not a primary focus for us right now.

Speaker 2:

Nice. Dude, these questions just keep coming. You woke up the chat. I love it. I’m going to keep going because all that was in the future for these poor souls if you are not on here, they would be listening to me singing out of tune, which is also nice, but these questions are very pertinent. How do you ensure the models or the training of models is using the proper data sources and features that can be reproduced for fast online inference, not just offline batch?

Jake Weholt:

Yeah, that’s a great question. I think there are certainly limitations there. The ability to actually calculate features live in the scoring flow adds latency to the customer experience and that makes it tricky to choose features that you are going to eventually put into your model. And I would say at the end of the day, it’s ultimately a trade-off and the thing that we are biased for is customer experience and customer protection. So, making sure that we are thinking of that first. And then in terms of being able to actually calculate those features, we sometimes do have to make trade-offs.

Jake Weholt:

I think thankfully we haven’t had to make too many of them, which is great. We have a really strong data engineering team at Remitly that can build performance features for us. But this is also another benefit of Tecton, right? Being able to build these features offline and also serve them online with timely computation is extremely important to us. And making sure that those two data sources align is also extremely important and being able to push them into a single tool is valuable for us.

Jake Weholt

Engineering Manager, Machine Learning - Fraud Detection

Remitly

Add Your Heading Text Here

Fighting Fraud with Machine Learning at Remitly

Jake Weholt

Follow Us

Book a Demo

Contact Sales

Request a free trial