A Decade of Risk Machine Learning: Some Lessons Learned

apply(conf) - May '23 - 30 minutes

Over the last decade, Francisco has built software for machine learning models, data engineering, and risk at Affirm, Fast, Goldman Sachs, the Commonwealth Bank of Australia, and AIG. In this session, he’ll discuss common pitfalls and some simple approaches to avoiding them.

He’ll cover the importance of:

Treating model development as software
Viewing data through the lens of Data Producers and Consumers
Common mistakes in ML
Understanding the lineage of your data
Having a deep understanding of how your model interacts with the product experience and other software

Francisco:

I’m Francisco. I’ve spent a long time working in FinTech and machine learning and risk, and over the past 10 years I’ve learned some things. Maybe not everything, definitely not everything, maybe some things more useful than others. And so I thought that when I was invited to talk today, I would share some of those lessons and I actually wrote my thoughts formally and I’ll share those notes after it in the Tecton Slack channel. So if you want to see my kind of structured thoughts on it, you you’ll be able to find them there. And these slides are also available as well. And then, yeah, thank you to Tecton folks for hosting this talk and for having me on. Micah is expert in the machine learning feature store systems and I’ve almost died by a thousand cuts in the past. And so I was very, very fond of Feast and Tecton and the work that they’ve been doing. So I’m going to get started. Let’s see.

Francisco:

So I think it’s important to start with an agenda. I’ll tell you a little bit about myself. I’ll try to be brief and then I’ll talk about going from ML zero to one for folks that are just starting with machine learning, what it takes to get there and how you can get there. Then I’ll talk about the model building SDLC, the software development life cycle. I think it’s an important part of the work that often gets less attention than it deserves, but I’ll expand my thoughts there. The next is on data producers, data consumers and data lineage. And it’s kind of an entire rabbit hole, but an important point that cascades into the work that feature stores do and that machine learning engineers and data engineers work on. I’ll talk a lot about the common mistakes that I had and that I’ve made, and I think that’s probably the most important part of the talk, if I’m going to be honest.

Francisco:

And I’ll talk about the risk and the engineering of chaos. And what I mean about that is the quantification of risk. And I don’t mean that in a specific risk capacity. I mean that as a general risk construct. I’ve worked in insurance and consumer financing, commercial finance, and ultimately risk is in certain buckets, maps a certain business objectives, but in general you can define risk as a fuzzy thing and treat risk and the quantification of risk the same independent of the specific use case. And then lastly, I’m going to try to do a quick demo. Some of it will overlap a lot with Feast and Tecton, but really it’ll to try to make all the things that I’m talking about here very concrete and how they kind of ladder up together to really show where the problems come in and what you can do in some cases about them.

Francisco:

So me, I’m Francisco, my first master’s in economics and statistics. My second was in data science and machine learning, particularly engaging machine learning and deep learning. A little bit NLP actually, very outdated and irrelevant now, but this is I guess seven years ago. I was in love with economics and statistics. That’s what brought me into the field, my love of data. And machine learning I found absolutely fascinating. And so that’s really what I spent a lot of my time doing. My professional background, I’ve worked in banking and FinTech for about 10 years now. I’ve worked at some of the largest financial institutions in the world, AIG, the Commonwealth Bank of Australia, Goldman Sachs, some pretty large interesting FinTech startups. Fast, which imploded, and Affirm, which I’m very excited about working at now. And I previously launched and failed in my own FinTech startup.

Francisco:

And so what I do, I am an engineering manager that works at the intersection of machine learning and data at a firm. And occasionally I write for a newsletter called Chaos Engineering. And particularly about some of my experiences. So machine learning zero to one. I think that there’s some important things to keep in mind when you’re trying to launch machine learning in your product. And maybe this is inclusive of generative AI, but probably isn’t. I’d say this is probably for more tabular, traditional use cases. And what I say is that data is the foundation of all machine learning models. That is true for generative AI as well. Basically garbage in, garbage out. And I think Mike mentioned this in his talk before, it just really can’t work without good data. No amount of math can fix that. And software, software is so important. And I started my career not as a software engineer, it’s more of a statistician or quantitative modeler.

Francisco:

And as I started to get into appreciating this information more, that’s really what made me dive into that rabbit hole. And now that’s where I spend 99% of my time and I love it. But when I realized that both of those things combined, i e data and software, compromised about 90% of the work that you’re doing. Or at least that I’ve done into getting a model into production. And so that’s why I made this diagram here on the right and you can see that data can consists of feature engineering, investigating data issues, which you always have to do, identifying sources of truth, building training data sets, building production data pipelines, whether those be batch or streaming or on demand. And in the software realm it’s fetching that data. It’s the feature transformations that maybe you have to also still do in real time and then maybe it’s feature serving and then the model serving and then the service API calls for wherever that model interacts with.

Francisco:

And then the last kind of key part, the kind of secret sauce is the machine learning. And I’ll say that this pyramid to me makes sense because the machine learning stuff, as I said, everything else is foundational. You can’t really have production machine learning without the two layers below. And the machine learning step, for those of you that built models, you spend a lot of time on the algorithm choice testing different algorithms, doing feature selection, doing hyperparameter tuning, optimizations, and then a lot of time in the model evaluation and sometimes a lot of time in the model training, depending on how big the data set is. And so that’s kind of a starting point. And I think in order for machine learning to be effective and fast and reproducible, you need the foundation of a great product. And I think sometimes people look at machine learning and say, “Well, machine learning can solve all of my product problems.” And it can’t. It just outright can’t.

Francisco:

Product is the foundation of customer value. And I think machine learning facilitates customer value and I love machine learning. I write about it in my leisure time, but it needs to be tied to customer value because at the end of the day what we all work to do is provide customer value. And so I really want to emphasize that I’ve worked in teams where people were working on models for the sake of models and not really tying it to customer value. And I challenge whether that’s effective in the short or medium term. I think it’s important to understand what customer problem are we trying to solve and then using the right tooling to enable that. Next, data. Again, the quantification of the experience is important to get right and is kind of a joke here. And then machine learning. Machine learning can amplify a really great product experience. It can optimize it. It can optimize risk, it can optimize portfolio, it can optimize which movie you want to choose. It can optimize a basket of goods.

Francisco:

So that’s really where machine learning can power it. And then software, I think software is a core engine and it reinforces this flywheel of improving a product. And that’s why I kind of drive this cycle that when you have really good foundation and a good product experience, you can really actually get a lot of velocity out of your product. And it relies on software. If you’re taking a long time to deploy your machine learning or the data’s bad and you’re taking a long time because of that, then you’re not going to get the same utility that you want out of your machine learning work. And hopefully that’s what we all want to do.

Francisco:

So next, the model building software development life cycle. And this slide, I’m talking about where it’s done wrong. And so I mentioned I’ve been working in this space for about 10 years and 10 years ago it was very different. My first model, I built it in R and SAS and SAS, not SaaS, but statistical analysis software, one of the oldest statistical machine learning vendors. SAS primarily does work for the FDA. They also do it for lots of legacy insurers and they’ve been a great product for a long time. They’ve done really great historical work. But back when I first built my first model, there was no version control for code. I mean I’m sure other engineers did, but we didn’t. We emailed code, which is wild, or just had it stored on disc and accessible that way and wrote jobs that way.

Francisco:

And so in this kind of silly little diagram, you see that you start off with some service that’s written in code that’s generating some online data. And then you might have some batch pipeline code that takes it and normalizes schema or something. And then you have some transform data, that thing in green. And then on the model side you’ll see some feature pipeline code that takes maybe the transform data or maybe it takes the raw data and transform it into features that you want to use to multiply something. And then you have some model training code. In SAS days it was Proclamix and Python, it’s SK learn or XD boost or PyTorch. And then there’s taking that model training code, you get your model artifact or your tickle file or set of weights or whatever. And then you want to convert this into model scoring code and you can deploy that.

Francisco:

That’s actually not too bad when people copy that. And then there’s sometimes more featured pipeline code because it turns out that maybe not all of your data can be accessed via the batch transformations, as Feast and Mike had showed before. And that you end up making new code to handle your real time data. And then you have to adjust your model scoring maybe for that, maybe not. And then you have to take that model scoring code and then connect it with your service code. And here in all of these kind of processes that may or may not be connected, you write a lot of code, some of it tested, some of it not, and you do a lot of code. But it’s important to understand that all of those things from the model artifact to the dataset were generated through code and a lot of human intelligence.

Francisco:

And in order to have that reproducible, that needs to be codified and probably tested and you need to version control it. You need to version control the model artifact. You need to version control the data and beyond the fact that it’s for high quality and reproducibility, it’s just because if you make a mistake, which you inevitably will, if you’re making all of this code, it’s going to be really hard to recover or repair or find out what you did wrong. And so I cannot emphasize enough how important this step is and some places don’t do this very well. Now I fortunately work at places that do this well and so I’m happy about that. But I will say that if you don’t have this foundation, it’s really important to get this right.

Francisco:

So data producers, data consumers, and data lineage. Data lineage is complicated. What I mean by data producers and consumers is a data producer is somebody who’s actually creating data, like a service that stores user data, as an example. The table in that production Postgres table or whatever is a producer of the data. Like a user submits a form about their, I don’t know, video preferences or movie preferences and they store that in that database, that service produced data. And a consumer could be a data engineering team or a data science team that’s then taking that data from maybe Fivetran or maybe a custom ETL to dump it into Snowflake or something and then construct it out into another table. And now they actually become data producers in that way. And then now some other team is now a data consumer.

Francisco:

And the kind of challenge that comes up with that is that goes sometimes indefinitely without you knowing. And so in practice people just start to create tables on top of your tables because they say, “Oh hey look, just use this table.” And then they do. And so I looked up this DBT, data build tool for those aren’t familiar, it’s an open source framework that handles data engineering pipelines and does really great auto documentation and constructs it and it helps with data lineage. And I looked just as an example of one table and this is some financial SAP fact table thing. And this shows you kind of all the layers that exist into getting into ultimately what’s one final table, which is the terminal end of that. And that DAG, that directed a cyclical graph. And when you have data in live production systems, what happens is some top level data producer, or maybe even intermediate data producer, makes a change and then someone downstream just breaks.

Francisco:

And whether it’s your data engineering pipelines or your production systems are just exploding and you’re trying to figure out what’s happening. And then the answer always is, “Oh, somebody changed the schema.” And really, there’s some kind of silly but obvious or rather obvious but kind of silly solutions to this which are data contracts. And it’s basically testing the changes to your upstream consumers or as an integration test or a unit test, whatever. And that can help mitigate a lot of breaks. It won’t solve everything and you’ll still have some breaks that are just unknown, especially from third party providers. But it does help catch these things ahead of time. And so this is a really important point because it’s like a subtle thing, but you end up spending a lot of time on call triaging just these all the time.

Francisco:

So some common mistakes, memes are my favorite. So I thought I’d list some of the creative sometimes ways that I’ve blown up a server over the last decade. Because it’s easy for me to see this now and say like, “Oh, well don’t do this because of this,” right? But I had to learn through scar tissue why. And so there are about six of them, there’s a lot more, but I try to condense it to the kind of most salient ones. Featurization errors, just generally a silly bug of not testing the behavior of everything that you’re going to do. The most trivial example is a ratio of two features.

Francisco:

So if you wanted to take, I don’t know, age divided by income for whatever reason as a feature, income could be zero. So that’s going to give you a NAN. So if you have a machine learning algorithm that handles NANs by default, that’s fine, but if you have a linear regression or any regression or any generalized linear model, it’s going to try to do a matrix multiplication and that’s not going to work. So you’ll have a bad time. ML library errors are another one where most open source machine learning libraries all have lots of pretty heavy dependencies under the hood, whether it be FORTRAN or C++.

Francisco:

They tend to come with a lot of stuff to build in. And usually when you have a specific version of what you developed either locally or on some server, you just want to make sure that that’s the same one that’s going to be used at serving time just because it ends up causing some blowups later if you don’t have that. Loading model errors. This tends to happen when you have a very large model, particularly like Burt or LLMs or an open source LLM, that fitting it in memory is very hard. So that’s an important thing to make sure that you rightsize before you actually deploy your server. And then there’s service errors, you want to make sure that you handle for the case in which your model blows up. And not assume that it’s always going to be perfect, because then you’ll get external service 500s, which is never a good time.

Francisco:

Then business logic errors. So as Mike said, I think at the previous talk, these models are consumed by other people and you want to make sure that whatever business logic they have, or that you might add onto your model after, has just good testing for it and you test the behavior to the degree you can. And then lastly, statistical errors, making sure that your training sample is a representative population of ultimately the sample that you’re going to use live in production. And if it’s not, then you understand the consequences of that choice and are monitoring the behavior very closely. Because sometimes it makes very good sense why you don’t have that symmetry, but other times you might not have done that on purpose. And that’s a really important thing to understand. So risk and the engineering of chaos, like I said, I’ve worked in risk for a long, long time. And what I’d say is that, and I wrote an entire article about this if you’re interested in reading it.

Francisco:

If not, obviously that’s fine. But the short version is that the underlying structure of tabular data versus data like natural language or images is very different. Tabular data is chaotic because you’re depending on human behavior and human behavior is inherently kind of weird. And it’s fuzzy structures, numbers themselves are kind of arbitrary units and we define them, but it’s kind of just random. There’s a lot of arbitrary rules that are play in our society like drinking age being 21, driving age is being different, employment age in different states being different. So all these just rules that create structure out of chaos and your model tends to operate within these systems. And in this example I was talking about lending particularly, this is an example of pretty much the work I did at Commonwealth Bank and Goldman Sachs. There’s credit verification, employment income verification, and a decision engine and a bunch of different data sources.

Francisco:

And there are all these big interdependencies between these systems. And really you just have to understand how your model and your features fit within this system. And it’s kind of hard to do because it all depends on data. And so luckily there are great frameworks that help to support this and that’s where Feast comes in. So I am going to now do a demo. Let’s see if I can share my screen. Stop sharing and I have 10 minutes left. Okay, I’m going to share my screen actually, so I’ll share it quickly. So let’s see. I have this set of code here. There’s read me and I’ll share the link to this GitHub. There’s this feature repro that you get from Feast and Tecton as well. There’s a flask application I’ve written, there’s some data that exists here and you don’t have to worry too much about it.

Francisco:

There’s some feature views, which is called driver risk. So this example is just showing that, and these feature views are in part based off of some files that we have listed, which are just parkade data format, so you can think of them as CSVs. There are entities that we’ve defined here. And here’s the feature view or here’s one feature view. We actually, I defined a handful, these driver SSN entities view. This is basically a retrieval look to see if this SSN was previously seen. So it’s a database of SSNs and you could think that you’re verifying from a fraud perspective this SSN was seen.

Francisco:

This is an on-demand feature view. This is basically at request time, executing an operation to see if it was previously seen. And you’ll see here is null, equal false. So it’s seeing if this SSN was seen before. This input request is just defining some additional on-demand feature views and there’s a calculate age, we’re calculating age at that particular time. And there’s a batch feature view, which is driver, yesterday’s stats view, which is a feature view showing basically a bunch of features about yesterday. And if we see this, we’ll see… No, that’s the wrong one. Feature.

Francisco:

Yeah, get onboarding features. No, nevermind, it’s actually, oh, it’s batch here. Yeah, here we go. This is creating a simple data frame, simulating a batch job. And here it’s just some simple rules on top of the data, there’s a driver ID, there’s a conversion rate and acceptance rate and average daily number of trips. And you see these fields are essentially just making some binary operators on top of it. Say if this acceptance rate was below 1% and the conversion rate was greater than 80%, that would potentially say that that sounds suspicious. So that’s kind of the code. You can see more about it. You can build this locally with poetry, you just have to run a couple commands. But let’s do one thing first. Feast apply.

Francisco:

And so wow, never do a live demo. Oh, that’s why. Feast apply. And so we’ve now basically updated the metadata in Feast and now we’re going to materialize the incremental data. So this is actually hydrating a local SQL light database, and now I’m going to run this little flask app, which looks like this and it has some endpoints and now we can go. So let’s see, one, two. So if you want, for this demo, I actually did include some API docs, and so you could see these things here and the end points. And so at onboarding we’re going to see this kind of stuff, but for the user it would be something like this, try to sign up and become a feast driver, and our risk score is really high, it’s 78. From the onboarding model. There’s an onboarding model. You can see the code.

Francisco:

It does a simple weighted sum. This are, you’re not eligible to drive right now, so we can try again. Let’s see, let’s try 2000. And it looks like by construction, these two were seeded with already previously seen users and the state was invalid before. So the only valid state is South Dakota, which is a shout-out to where I just moved back from. So let’s submit. Oh, so our risk is zero now, so we’re good to go and you’ll be taken to the homepage. So now you see that you’re good, your daily risk score is 42, whatever that means. And really what’s happening is in the background we’re fetching batch features and we’re rescoring every five seconds. And it doesn’t really matter, it’s just kind of a toy example, but the point being that Feast helps with all of these different frameworks for handling these different items. And so the code is available to your liking.

Dat Ngo (NOT Daniel Ngo):

Dude. Awesome. We’ll drop the code into Slack, so if anyone wants to have some fun, play around with that demo. Thank you so much for doing a live demo. I know you probably had to pray to some of the demo gods before this actually happened and it seemed like it went really well, all things considering.

Francisco:

Yeah.

Dat Ngo (NOT Daniel Ngo):

I will say this though, Migo, I am very disappointed because I did not learn how I can hack Affirm and do any kind of shady business on the Affirm platform. So you probably didn’t even need to get sign off from the PR department or anything for that talk, did you?

Francisco:

No, no, no. I probably didn’t.

Dat Ngo (NOT Daniel Ngo):

Well, this is awesome. We got a minute before people will ask a question. So because we’re just going a little bit over on time, I’m just going to ask you one that comes through.

Francisco:

Okay.

Dat Ngo (NOT Daniel Ngo):

I’m pretty sure somebody has a question that they’re going to ask because the chat was very vivacious during this talk. I will say, while we are waiting for the questions to come through, how can you just show that picture of you and totally gloss over the fact that you’re wearing that hat? How much did you pay for that damn hat? Man, that’s got to be, that looks like it’s more expensive than my car.

Francisco:

That’s right. I forgot to talk about that. Yeah, no, so I just moved back from Western South Dakota near Wyoming and I got that hat at rodeo actually.

Dat Ngo (NOT Daniel Ngo):

Yeah, it felt like Montana, Idaho vibe to it and yeah, you could have wore it for the talk. I mean, I’m wearing this ski mask.

Francisco:

I mean, it’s in my car. I wear it still here in the East Coast, but people ask where I’m from.

Dat Ngo (NOT Daniel Ngo):

I bet you get a lot of looks. I bet people love it. You have a belt buckle that goes with it?

Francisco:

I actually do, but I don’t wear it. But I got the belt buckle as a kid. I mean the TLDR is, my parents are from Mexico and my father grew up on a ranch and so I have cowboy boots too with scorpions on them.

Dat Ngo (NOT Daniel Ngo):

And guns with gold plates and your name engraved in them? Oh no, that hits too close to home. No, so anyway, so we going to, it looks like, I mean, just because of time and I’m getting pressure from the production crew, we may have to end it here. So if anyone wants to ask Francisco a question, throw this in the Slack chat and we will have him jump over there. Oh, we got one for you. What would be the effect…? Oh my god, no, really? Everybody’s asking generative AI questions. The hype, we cannot escape the hype. So there is this question about what would be the effect of generative AI in addressing bot problems?

Francisco:

It depends on what lens. Yeah, so for addressing bot problems, it’s imperfect, right? It’s actually, LMs can get good enough that it’s hard to distinguish between the two. And even using an LM then by all that, is kind of a silly thing. I think there’s going to be a new world where authentication identities is going to be a core part of an experience. Even at a customer onboarding and that’s the only way to tear away from a chatbot. That’s my 2 cents on it.

Dat Ngo (NOT Daniel Ngo):

Oof. I love it. All right, man. Thank you for this demo, the live demo. Thank you for the hat chat.

Francisco Arceo

Engineering Manager

Affirm

Add Your Heading Text Here

A Decade of Risk Machine Learning: Some Lessons Learned

Francisco Arceo

Follow Us

Book a Demo

Contact Sales

Request a free trial