How to Draw an Owl and Build Effective ML Stacks

apply(conf) - May '22 - 30 minutes

They’re handing us an engine, transmission, breaks, and chassis and asking us to build a fast, safe, and reliable car,” a data scientist at a recently IPO’ed tech company opined, while describing the challenges he faces in delivering ML applications using existing tools and platforms. Although hundreds of new MLOps products have emerged in the past few years, data scientists and ML engineers are still struggling to develop, deploy, and maintain models and systems. In fact, iteration speeds for ML teams may be slowing! In this talk, Sarah Catanzaro, a General Partner at Amplify Partners, will discuss a dominant design for the ML stack, consider why this design inhibits effective model lifecycle management, and identify opportunities to resolve the key challenges that ML practitioners face.

Thank you for the intro. It’s awesome to be here today. I think like there’s just such a clear need for conferences like this that’ll bring together data and all practitioners so that we can have conversations about meditation and ML. But anyway, if you give me one second, I will share my slides.

Okay. There we go. So thanks again for the introduction and just the walk through the woods. As you mentioned, I’m a general partner at the Amplify Partners. My name’s also Sarah Catanzaro. And Amplify is an early stage venture capital firm. We focus primarily on investing in technical tools and platforms. And I focus specifically on investing in data and ML tools and platforms. So you might be wondering why the Tecton team invited and investor to critique the design of ML stacks. The fact of the matter is probably that I see some like kind of spicy things on Twitter from time to time, but it’s actually a problem that I think a lot about.

So before I went into investing, I used to be a data scientist. I had most recently led to the data team at a company called Mattermark. We were collecting data on other startups and selling it to investors. But when I reflect upon kind of my time at Mattermark and the experience leading the data and ML team there, I think the thing that really stands out to me is that like it’s not significantly easier seven years later to build and deploy ML-driven applications. So that’s something that I spend a lot of time thinking about, and it’s something that I’ve been discussing with former colleagues, with friends, with other ML practitioners.

But before we kind of get into what I’ve learned from those conversations, I’ll tell you a bit about a meme that became popular back in 2010. So the meme was basically about like how to draw this beautifully illustrated owl. There were two steps. First step, you draw two circles. One circle represents the head. One circle represents the body. Second step, like draw the damn owl. Now, it’s 12 years later, and there’s just a plethora of ML tools and platforms for like every step of the ML development cycle. But frankly, I feel like a lot of these vendors are basically telling their users to like draw the damn owl. They’re saying like here’s a descriptive training library. Here’s a data science framework. Now like build your damn ML stack. And it’s frustrating and challenging because I think firstly, like it’s hard to understand what specific problem any of these vendors solve. It’s hard to kind of like cut through that marketing jargon.

It’s also hard to understand how the components of the stack, how these various tools might like fit together. And even if you have some ideas about like how they ought to fit together, it can still be really tough to implement. And so like, this is a phenomena that I want to really get into today. But I think before we can start to kind of diagnose like why this is the case, we need to establish kind of more of a common vocabulary, common language for talking about the ML stack. I think one of the things that I see is that… Frankly, like there’s a lot of like needlessly spirited debate that is happening in the MLOps community because we don’t have common definitions. So like when I say feature store and Demetrius says feature store, like we may not actually be talking about the same thing. And so like we may disagree about like the utility of feature stores because we don’t have this kind of common set of definitions.

So I’m going to present a bit about like how I think about the design and the organization of ML stacks, but I’d really kind of urged the ML community to come up with your own language for describing the ML stack, its components, and associated workflows in more of a uniform way.

Okay. So I tend to think about the ML stack as having three different layers. There’s the data management layer, modeling and deployment layer, and model operations layer. The data management layer is really focused on tools and platforms that will enable you to collect and query and evaluate training datasets, labels, predictions. The modeling and deployment layer would include tools to build and train and tune and optimize and deploy models. Model operations I think of as being kind of primarily focused on enabling users to either meet or potentially even exceed their performance requirements but also to kind of conform with regulatory or policy standards.

So we can double click on the data management layer first. In this category, we have data labeling tools. I think that’s pretty self-explanatory like data labeling tools. They enable users to label data either by providing kind of access to and tools to manage human labelers or more programmatically by using techniques like weak supervision. We’re also starting to see more and more tools that are really focused on enabling users to evaluate label quality and improve label quality. Databases in this context would mean tools really to like collect and query labels, training datasets and predictions. And I tend to see like two different approaches here. So there are some companies that are primarily using their data warehouse. The data warehouse, it’s really optimized for structured data. And so it’s commonly used by a lot of analytics teams. And I think one of the real benefits of this approach is actually that ML teams can kind of cut down on some data prep by using the data models that analytics engineers are preparing.

Those also tend to be like monitored and tested and possibly been documented. So certainly, that can provide some leverage. The other approach that I see though is building a ML platform or an ML stack on top of the data lake. The data lake is optimized for storing and managing raw data, both structured and unstructured. And certainly, I think in many respects can be kind of a more flexible approach than the data warehouse. In my opinion, the data lake also tends to provide like a better Python native experience, although perhaps that’s changing. The last thing that I’d include within this category is vector databases, just simply becoming more and more common for search and recommendation systems and other types of applications that rely upon embeddings.

So I don’t really need to spend that much time talking about feature management since we’re at a conference that’s hosted by Tecton. I tend to think about feature stores as being the tools and platforms that are going to enable you to transform data into features, not just features that can be used more reliably during training and inference. They are going to solve some of these like kind of gnarlier technical problems, whether it’s enabling accurate backfill, resolving online, offline consistencies. I also see some companies that are just using feature proxies. The feature proxy approach can be well-suited for enabling low latency feature serving that may also be a capability that the feature store provides.

Oops. Okay. So modeling and deployment. Frankly, I think like there’s a lot of like brand confusion in this category and even I had like difficulty like coming up with new… So like the way that I think about it though is that like deep learning frameworks generally help their users build and train models, whereas data science frameworks tend to help their users create data science projects. I think the real utility of the data science frameworks is that they enable you to schedule and, well, run and log ML pipelines typically as like DAGs and workflows and in more of a uniform way. Some of the data science frameworks also provide lightweight experiment tracking and version control capabilities. Although for some companies, including those that are iterating on their models fairly frequently, they may choose to adopt experimentation tracking application, something that is like really purpose-built for that use case. Often, it will include kind of more robust visualization capabilities. It might also make it easier for users to like reproduce and package and share models.

Next is deployment and optimization. So these tools typically enable their users to deploy models as prediction services, usually containerized prediction services. Some of them and some of the newer deployment tools are also designed to run on the data warehouse. I’m also seeing kind of an increasing number of deployment tools that are focused not only on enabling you to containerized your models, but also that provides some optimization capabilities such that you can achieve certain like production on solely is whether it’s low latency, meeting certain memory requirements, et cetera. And then the last thing in this category is distributed training. So these tools accelerate training typically by leveraging parallel processing. Okay. Only one more of these layers and then we can get into like what is wrong with all of this.

So model operations. So frankly, like I’ve been debating whether or not like model operations is even the right name for this category. Luis I think had an interview not too long ago where he basically argued that like what we’re doing when we go do ML is not like building models where we’re building ML-driven applications. And so like model operations isn’t really a thing. It’s more like a specific flavor of DevOps. And I think like there’s a lot of merit to that argument. So perhaps like this should really be called like DevOps-ML or something along those lines. Anyway, I do think that there are some aspects of monitoring and managing ML-driven applications that are unique that make many of the tools that were designed for DevOps with traditional software applications, kind of like ill-suited to that task. Model monitoring is probably a really good example. Many of the tools listed here, they’re really focused on enabling their users to detect distribution shifts, things like that.

Model analytics is actually a category I’ve been thinking about a lot lately. So one of the other things I notice is that like product engineering teams, they tend not to iterate on the same feature that frequently. So if you’re like building a navigation panel, you’re going to release that but you’re not going to like turn around and then iterate on the next version of the navigation panel. You might like revisit in like two to three years. In contrast, a lot of ML and ML engineering teams will constantly iterate on the same ML-driven feature, whether it’s a pricing model or a recommendation carousel. And as such, my expectation is that we’re going to kind of need like more mix panel or amplitude for ML-type products. Model compliance, most of these tools generally ensure that like protected attributes are not used to make predictions, help users mitigate algorithmic risk. And in the category of continuous learning, I typically bucket tools that either like automate retraining or enable active learning.

So, now I’ve kind of expound it upon like how I think about the ML stack. I think we have a pretty good understanding of my mental models for each of the components. So we can start to talk about what happens when these components don’t work so well together. So the first thing that I’ll discuss is like what happens when the components of the data management layer don’t really connect clearly together. So one other thing maybe to add before we get into like the specific problems associated with this anti-pattern is that like I think all of this is going to become much more acute, particularly as companies adopt like the paradigm of data-driven programming where they’re iterating on their datasets to improve model performance. But just to go into kind of some specific symptoms of this problem. Obviously, like when you have data stored across multiple systems and those systems are not well-integrated, it’s going to be much harder to find and access and evaluate datasets.

Some companies like kind of resolve this problem by tasking data or ML engineers with making data available to data scientists or whomever is building the model. But then you have ML and data engineers like spending all of their time just moving data to and from S3 buckets. And given some of the other challenges associated with building ML-driven applications, I really find it hard to believe that like that is the best use of their time.

The other thing that I notice here is that like, frankly, there’s like a lot of duplicative work that’s being done both across ML teams where the data preparation or feature engineering work done by one team cannot easily be repurposed by other teams. But also kind of across the entire organization. Like I suspect that like a lot of the work that is being done to prepare data for other use cases like experimentation or financial reporting could really be repurposed for ML, but it’s hard to kind of like uncover that work. And as such, I think there’s a lot of wasted effort.

Now another thing that, I think, more and more people are talking about nowadays is kind of the challenge of working with unstructured data. Like obviously, transformers are like all the rage in the research community, but I don’t actually see them getting widespread adoption in the industry. I think part of the reason is that like it’s really hard for companies today to reason about the unstructured data that they actually have. But I think this could be a lot easier if we had better tools that enabled you to link your structured data to your unstructured data so that you could kind of easily develop a profile of like what might be contained within your unstructured datasets.

It was interesting. I actually talked to somebody at NVIDIA recently about how they were solving this problem. So they have developed more of an abstraction such that data scientists can write one query as if they were querying a single database and easily collect data about like LIDAR, CV, like weather data, accelerometer data, all associated with a single drive. So I think tools like that ought to become more commonplace such that like we can at least obfuscate the problems that occur when you’ve got data silos, which may exist for like valid reasons. And we’ll go into that in a second too.

Okay. So now we can talk about the issues that occur at the modeling and deployment layer when the components of that layer are not elegantly integrated. I think like one example that always comes to mind in describing this problem is kind of the problem of going from notebooks into like production. Most data scientists or at least many data scientists are going to start the model development process with a deep learning or the machine learning framework in a notebook. But when they’re ready to operationalize their work, they are often going to restructure their data as a DAG. And like at a minimum, that’s super tedious to have to like go through all of your notebook code and kind of like extract the DAG from your notebook code. This is not to say that like the right approach is running notebooks in production. I tend to be like pretty skeptical about that too. But the whole like notebook to DAG thing, like that’s got to get easier.

The other thing that I see too is that like the kind of more advanced ML teams, they typically don’t treat modeling and deployment as like discreet and sequential stages. So for example, at Stripe, they’ll often like actually deploy a randomly weighted model before engaging in any further fine tuning. And what that enables them to do is to get a sense of kind of the performance profile of the model. I think that type of behavior will make communications and interactions between data scientists and ML engineers much easier. But many of the tools that have been adopted today, they don’t really like align with that pattern of working.

Okay. So now we can talk about like some of the problems that kind of occur between the layers. So one of the problems that I see that exists is one that arises when the data management layer and the modeling and deployment layer is not coupled. Like most teams will train their models offline in a batch setting. And like that can still be difficult due to some of the pain points associated with like finding and accessing data that I had described before. When you transition to batch inference, like that can actually also be difficult because now you need to reason about like schedules and data freshness and the impact on model performance. But connecting to like batch runtimes is still like a little bit easier. Batch datasets is still a little bit easier. When you start to work with like online inference and online and nearline datasets, things get pretty gnarly. Like now, you’ve got to reason about like the interplay between model complexity and data volume and client application SLAs.

You need to make sure that like your data sets are available in online stores. And in thinking about kind of the various backends underlying your ML pipeline, you need to consider things like query complexity and like read/write latency and transaction consistency. And those like types of like systems engineering problems, like they’re very different than algorithms problems or like modeling problems. And frankly, I don’t think like a lot of the tools that we use for modeling and deployment are really well-designed to help us think about kind of the interplay between like data systems and model performance, which leads me to just like one thing that I think we need to be talking about more from kind of a philosophical perspective.

So like the diagram that is like on my left, which might be your right, that’s a schematic about Netflix’s recommendation system from like 2013. So like ML-driven applications, they’ve only become more complex. But when we talk about like MLOps and ModelOps, we really only talk about like deploying models or deploying algorithms. And as you see here, like an algorithm is just like one small components in an ML-driven application. Those who are productionizing ML, like they need to build all this. They need to think about like connections with UI services and various data stores and potentially even like authorization services. I hope that in the future, like more ML tools will be oriented towards helping teams build ML-driven applications and like not take this model-centric approach.

So the last thing that I’ll go into is just the challenges that I think exist when all of the components of the ML stack are not explicitly or transparently connected. Frankly, like even the best ML-driven applications are still very… They’re very brittle today. Including because like the dependencies between these components are so opaque. And for that reason, I find like many companies, many teams are actually like reluctant to iterate on their ML-driven applications even when they know that like they’re potentially leaving value on the table. So one of the things that was like fairly shocking to me was hearing that companies like Amazon and Stitch Fix, they only iterate on their embedding models. I think Spotify too, like only once every two years. And frankly, I think like the reason is just that like there are ML stacks, like they’re this gnarly web of different components that is so difficult to unpack. And so like there’s an extreme reluctance to change any one component without understanding how it’s going to impact the entire system.

So it’s really easy to like sit in a venture office and talk about like all of the problems. And none of these problems are actually like easy to solve, but I do think that there are like a few things that the ML community, including those who are using tools and those who are building tools can do better. So for like those who are using tools, I think first and foremost, think about not just the model that you want to deploy, think about like the architecture of the M-driven application and know what set of tools you’re going to need to both build and maintain that application and like demand that your vendors address the set of integrations that you need to build that reliably and iterate upon it flexibly. One of my kind of hypotheses about like why ML tools and platforms are not where they ought to be is just that like the behavior of data scientists and ML engineers is still pretty heterogeneous.

So I think like once we start to establish more best practices and standards, like it’s going to be much easier to build great tools because those tools can align with user behaviors. I also think about like why hasn’t this happened. Like it’s been 10 years, I think, since like AlexNet. Like why don’t we have clear ways of doing things? I don’t actually have an answer to that question, but I do think like both community members as well as vendors ought to be opinionated. You ought to have a view about how people should work.

And once we’re able to establish that, it’ll be a dialogue. I think it’ll be a lot easier to design tools that enable that type of work. But also to think about like how these various components ought to be coupled to enable that workflow. Like if you know what somebody’s workflow is, you can kind of follow that workflow to think about like what is the set of integrations I need to support. So there are a couple of other things that have been thinking about recently. Like what role will collaboration play in the future of ML? Like how will the adoption of pre-trained models impact the architecture of the ML stack? But if I had to share like one major takeaway, it’s just that like we need to stop thinking about MLOps as ModelOps and really focus on a stack that is going to allow us to build reliable ML-driven applications and a unified stack that goes is designed to achieve that end. So that’s all I’ve got.

Sarah Cantazaro

General Partner

Amplify Partners

Sarah Catanzaro is a General Partner at Amplify Partners, where she focuses on investing in and advising high potential startups building data and machine learning tools, platforms, and applications. Her investments at Amplify include startups like OctoML, Intervenn Biosciences, RunwayML, and Hex among others. Sarah also has several years of experience defining data strategy and leading data science teams at startups and in the defense/intelligence sector including through roles at Mattermark, Palantir, Cyveillance, and the Center for Advanced Defense Studies.

How to Draw an Owl and Build Effective ML Stacks

Sarah Cantazaro

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial