Data Science Team Lead
apply() - May '22 - 10 minutes
A feature store can solve many problems, with various degrees of complexity. In this talk I’ll go over our process to keep it simple, and the solutions we came up with.
So that’s me, Joao Santiago, and this is Do It Yourself Feature Store, A Minimalist’s Guide. So I’m a team lead at Billie. We are a startup in Berlin, in the buy now pay later space for B2B. And because we deal a lot with handing money to other people, fraud is a very big concern. And we want to give our customers the best possible experience, which means all of these models need to be real time. And since the beginning, we had the need to have something that we later figured out was called the feature store. At the time, when we had to decide if we were going to build or buy, there was nothing in the market that we could actually buy, so we decided to build our own. And these are some of the results.
So I’m going to show you two things today. The first one is just the minimal possible thing we built to help us share features and start going. I will not die on this hill by calling it a feature store. I think other people won’t call this it, but this is how we call it. And then finally show you exactly what we did, and we have in production running today. These are some of the points that we wanted to achieve, so like have past data in real time models that’s really obvious for fraud because we want to look at behavioral patterns.
We wanted to implement the feature only once so that we can share features between projects. And the most critical one of all was that it had to be maintainable by a team of two. So that’s me and Bogdan, I think is in the meeting here today. So cheers. And so, we had to build something that was very, very simple, but still achieve business goals. And also, the reason why this had to be simple was because it was unproven tech at the time. So we were not given like a hundred percent time to work on this. And we had to decide where do we go to make something we can ship to production as fast as possible.
And so the first thing we built was a feature store as an R package. And R here because it’s how we build and deploy models in production at Billie for fraud. And the result was this package called Beamter, in which you can build what we call the feature registry. And this is just a list where you include features as steps. This is just steps. And then later, you can see here, recipes. These are all just our lingo. But in essence, what this means is that we are making steps in the pre-processing pipeline. So these are very, very basic examples. We have, obviously something that’s more complicated. But just to show you, if you want to have a feature called hour, which extracts the hour, or extracts the email name, for example, from some string, you can create a feature called email name. It’s dependent on something called email, and then just write your code.
Now, the cool thing here is that this is just very simple R code, anyone that knows R can write it. And then we just wrap this into a package, put it on GitHub. And now, anyone that has access to our Github can just go here and get features that they need. To actually get the features, you just have a vector or a list of strings with the names of the features you need. So if you need the hour, for example, this will fetch this feature here. You just provide this to Beamter and say, here’s the feature registry with all the features we have. These are the features I need. And then Beamter just prepares this pre-processing pipeline, which is called a recipe in R. And we ship this together with our model to production.
Now, this is really cool. It ticks some of the boxes that I showed you before. So we have the same implementation in two places. We only implement it once. There’s a kind of a low entry bar for others, because if you know R, like you can do this immediately. There’s no extra abstraction that you have to learn. But it’s obviously not a full fledged feature store. There’s no possibility to cache data. You have to keep versions synced between projects. And that’s a bit annoying sometimes. And then obviously, not everyone will know the libraries needed. So I cannot expect just a general backend engineer to review my features on Github, because maybe they don’t know either R, or they don’t know the specific libraries we use in data science.
And so, we had to move up a bit from this, and we actually built our own feature store. And again, this is super simple because, like we said, there had to be no Kafka, no Spark, just declarative as possible, and almost nothing new to learn so that anyone in the company, at least in the engineering and tech side, could just come in and contribute from the outset.
And the too long didn’t read version of this talk is, we just use Snowflake. We use Snowflake streams and tasks. And all the features are defined as SQL functions because SQL is kind of the unifying language that anyone should know, at least in data science space, and also in the data engineering space. So this means me as a data scientist can just build a couple of features and have data engineers review them. Or I can actually offload these tasks, for example, into BI and ask them to research something. They create a feature from the research, and then I can review their work because we all speak the same language.
In practice, this kind of looks like this. We have two paths in terms of how the information flows. We have a synchronous path that goes into our models. So this is when you submit an order and you’re waiting for the order to be approved. But at the same time, we have other data going into a Fivetran pipeline and into Snowflake. So we do use only three, if you see, our feature store is pretty much just this here on the bottom, three components, we have Snowflake, we have a Lambda that pretty much just puts stuff from Snowflake into Redis. And we have Redis. And all Snowflake is doing is creating a stream that detects if there’s an insert in a target table, and then running our features or our SQL transformations on that data.
How we actually configure this, it looks like this. It’s just a YAML file. We have a bunch of configurations on top. There’s obviously more options. But to simplify it, this is what you have. You say, which feature you want, which entities you need. And then just write SQL. This is just run of the mill SQL code wrapped in a function so that we can run this both for batch and offline training, and then during back filling. So it’s exactly the same code no matter what we do. This is also very, very cool to have in production because it’s quite hard to beat Redis in terms of simplicity. So with a redis.mget and a vector of all the features that we need, we get immediately everything, and we can go to production.
To keep this link between the model versions, because we have multiple models, depending on which customer we are using, each model may use different features. So we just keep a link between what we need in the Dynamo DB table. And then we go get it from Redis. So this is very simple. It runs super smoothly. The only issue is, like I was showing here, right now, we are locked into this Fivetran delay. But we will replace this at some point with actual events, possibly using Kafka, which means that we will push this forward, and you can keep having Snowflake in there anyways.
So in summary, if you’re just starting out and you don’t have that much realtime data going on, package your features in some package, share it across your teams. It’s not ideal as a long term solution, but it’ll definitely reduce the amount of code everyone is writing. And yeah, you can just build a feature store with three components like we did. There’s a link to the full story where we show a lot more details on how to do this, a lot more examples. And I would just love to hear other people, if you try it out and you experiment with this. And that’s what I have for you guys today. Thank you very much.
© Tecton, Inc. All rights reserved. Various trademarks held by their respective owners.
The Gartner Cool Vendor badge is a trademark and service mark of Gartner, Inc., and/or its affiliates, and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.