apply(conf) - May '22 - 10 minutes
Abnormal Security identifies and blocks advanced social engineering attacks in an ever-changing threat landscape, and so rapid feature development is of paramount importance for staying ahead of attackers. As we’ve scaled our machine learning system to serve thousands of features for hundreds of machine learning models, it’s become a major focus to balance stability with rapid iteration speed. Last year at apply(), we saw a great talk from Stitch Fix on their Hamilton framework for managing complicated logic in Pandas Dataframes. In this talk, we present a similar framework called Compass that was developed at Abnormal. Compass takes a similar approach to Hamilton by modeling feature extraction pipelines as a DAG of composable functions, but differs in some key design choices that make it a better fit for Abnormal’s ML use-case and tech stack. We’ll show how Compass enables machine learning engineers to express feature extraction logic in simple, pipeline-agnostic Python functions, while also providing a way to interface with a feature store in a scalable way when needed.
Hey, so I’m Justin Young, I’m a software engineer at Abnormal Security, and today I’m going to be talking about some common challenges that you’re likely to face when building a feature platform and a system called Compass that we built to tackle these. So I’m going to go pretty fast, but I’ll share these slides out after. So let’s start with the real world problem that we’re trying to solve. So at Abnormal Security, our goal is to stop all email based cyber attacks. So as an ML problem, we have some pretty stringent requirements for this, for one, we need really high recall because it can be devastating if an attack makes it through to an end users inbox, at the same time, we need really high precision, because we don’t want to be flagging, safe, normal messages and finally, we need really rapid iteration cycles because attackers are constantly changing their techniques and actively trying to evade a system like ours.
So, we have this mantra of, move fast and don’t break things. And so for me, as someone who’s working on the ML platform, I’m thinking about how we can implement this user story where we want to make it really easy to add and remove signals, hard to break existing signals, and we also want this pipeline to run really well online and offline. So for today, I’m going to be talking about two challenges that you might run into in trying to do this. The first is entanglement, which is what happens when a certain change to the system has a really wide blast radius. The second is online/offline skew, which is what happens when the data or the logic that you’re serving online is very different from what’s being served offline.
So let’s start with this entanglement problem. So entanglement, like I said, is this blast radius problem, a succinct way of saying it is that, changing anything changes everything. And so let’s look at a common approach to a feature pipeline that you might see in, example code that shows how this might come about. So we might have a predict function like this, in our case, we’re taking it an email and we want to return a model prediction. So you can imagine we start by extracting some heuristic signals, we encode that into a feature vector, and then we return a model prediction based on that feature vector. So totally fine, this seems very reasonable, it’ll definitely work, but we said that we want to be adding signals to the pipeline very often, and so maybe at a later time, we add some feature store signals where we go and use the heuristics, look up something in a feature store and add those to our feature vector.
And maybe we keep doing this, we want to add some heuristics based on what we looked up in the feature store. And eventually this pipeline gets so complex that it’s really hard to understand, and as a result of that, it’s going to be hard to add signals, basically impossible to remove signals, and the reason that we can’t remove or change signals is that, if we make a change at the beginning of this pipeline, it could have an effect anywhere later on, and so we’re going to have to do a lot of custom work to make sure that what we’re doing is safe and it’s going to become infeasible, and this is exactly this entanglement problem. The reason that this is such a big problem, is that changing anything changes everything, like I said, changing something at the beginning of the pipeline can cause the whole thing to break later on and our machine learning pipeline will start to look like this Jenga tower where a small disturbance at the base can cause anything later on to break and the whole thing can fall over.
And the reason this happens is that a block at the top of the tower depends on all the blocks below it, which is actually a really bad characteristic for this system.
So let’s take a quick aside and talk about function composition, abstracting, a bit, this function that you see here is, basically the same structure as what we saw in the function that we wrote before. If we ignore the side effects that are happening, when we write imperative code like this, we’re basically just iteratively taking in some input, returning outputs, and this is really just function composition, and so just to hammer home this point, we can rewrite our original function in this composed form and do exactly the same thing.
I would even go further and say that, function composition is the primary responsibility of machine learning pipeline. Side effects are definitely important, but they should be happening elsewhere, before or after this function, and within this function, we really only care about the inputs and outputs, and what we want to do is iteratively transform our input data until we get some model prediction and we don’t care if any of the intermediate functions are simple Python logic, remote RPC to a service, or even running a model, all of it just has some inputs and it has some outputs in this model.
And so this is the whole reason why we’re doing this, we have an input, we want an output, but our data model is actually too flexible. So we have this really unconstrained function composition, and we definitely have dependencies between stages, but even within stages, you can see that if we drill down into something like a heuristic function, we’re getting an inner web of dependencies, and what this looks like is this graph of dependencies that you see here on the right. We have dependencies within stages and across stages, and in the end we just get this really tangled mess of dependencies, and so if we think about how this is going to scale out using the total number of dependencies as one proxy for the complexity of the pipeline, it’s going to be this N squared relationship as we’re adding more signals, and it’s actually a massive problem.
If you look at the Y axis here, if we’re extracting hundreds of signals, we’re already into the hundreds of thousands of dependencies, even just explode beyond that, and definitely it’s going to be more than a single person to comprehend and make changes to with really understanding what they’re doing to the system. Okay. So this sounds like a major problem. So how do we avoid it? Well, the key insight is that we actually only need a small number of inputs on average for every function. Our model right now is too constrained and we can probably reign it in. And if we were to do that our graph of dependencies will look a lot more like what we see here on the right. This is probably more like what you’d expect rather than this tangled mess. And if we simulate a pipeline that looks like this, in this case, the orange line shows if we had two inputs to every function on average, it would still be [inaudible] squared in the worst case, but the growth is going to be a lot more slow on average.
Okay. So this is a nice idea, but how do we actually make this work? Well, one way would be to just explicitly model the inputs and outputs of every function, if we did that, then we wouldn’t have to make this really wide assumption that something at the top can change something at the bottom, we know exactly what the dependencies are in the graph. If we did that, we would get this graph that you see here. And just to be explicit, there are functions that are transforming the inputs and returning these outputs, but we can just think about the graph of signals. Okay. So this is a bit abstract. It’s a nice idea about how it’ll work in production. Well, quick shout out to the StitchFix team and their Hamilton system from last year’s supply conference, definitely a similar problem there, but out of normal, we’re not very data frame centric.
So we have a bit different constraints and our design will look a bit different, but one challenge that we’ll face right off the bat is that we’re going to have some ugly data classes to model these dependencies, but all I want to show here is that syntactic sugar can do a lot of heavy lifting and allow the ML engineer to have a UX of basically just defining oral Python function with this decorator. And then in terms of actually executing this pipeline, again, details, aren’t super important, but I just want to show that it’s a pretty simple algorithm. All you can do is topologically sort the graph of signals, make sure that you have an append-only data structure and that every function is just depending to it. And then if you just transform all your functions in order or rather execute them all in order, it’ll basically look like this.
We start with our input signals, execute the first transformation, append to our collection, and if we keep doing this at the end, we’ll get our fully realized collection of signals, including any model predictions that we might want at the end. And so this will work. We also get some nice utilities in addition, if we do this dependency modeling, really easy to understand the pipeline, we know what all the dependencies are, so we can do something like visualize a graph or list dependencies of a signal, we can also validate things like the graph having no cycles, no unused branches and wasted computation, and finally, we can platform-ize some common operations like propagating a missing signal down so that we can skip transformations that can’t properly work on missing signals, or we can do parallel execution of parallel paths in the DAG.
So this works, we have a composable pipeline, we said that this is the whole reason that we want a machine learning pipeline, and it’s basically minimally entangled, we should have the minimum set of dependencies, but it’s not necessarily scalable. So how do we execute this offline? Obviously that’s a really important part of a machine learning system. So let’s take an example of a look up to a online store and a feature store that might be our online transformation. If we try to execute that in the same way offline, we know that’s not going to work, that’s the whole reason that a feature store has an offline store, and so we chose to just model that reality and try to reflect it in our code by allowing a user to optionally override with special offline logic. So again, details not super important, but the key insight here is that most of the time, we won’t need to do this.
Most of the time we can just map our online function, if it’s simple Python logic, for example, over the data set, in special cases where we’re not able to do that, like a feature store lookup, we can override it with something I could join, but most of the time we don’t need to do that. So this also has the real nice property that our pipeline is going to look exactly the same online and offline, we even in most cases going to be able to use the same code and any differences are going to be isolated to these individual functions, and so this should be scalable, we have a pipeline that can execute offline. So we accomplish both our goals for this system.
And you’ll pardon the buzzer for a second. I say this a bit, this bit of a silly buzzword, but I just want to pull apart two conflated ideas in a feature platform. One, it’s a feature story, which we hear a lot about, which is really good at registering data sources, serving it online and offline and making sure that the data is close as possible. I want to just propose the idea that a function store doing the same thing, but for the logic that’s transforming your data is equally important and you should definitely have a feature platform that does both of these in a really nice way. That’s my main takeaway. Make sure that your feature platform is doing both. There’s some really nice managed feature platforms that do both in a really nice way, but there’s also some out there that only really handle the data. So, that’s all I have, but thanks so much for listening and thanks to the team who worked on this system and we’re also hiring, so thanks so much.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.