Data Observability for Machine Learning Teams

apply(conf) - May '22 - 10 minutes

Once models go to production, observability becomes key to ensuring reliable performance over time. But what’s the difference between “ML Observability” and “Data Observability”, and how can ML Engineering teams apply them to maintain model performance? Get fast, practical answers in this lightning talk by Uber’s former leader of data operations tooling, and founder of data observability company, Bigeye.

Howdy everybody. I am excited to be here for the lightning talks at Tecton. I hope the conference has been good so far, at least as good as that song was. So without further ado, let’s talk about data observability for machine learning teams. So first, who am I? I’m Kyle obviously, but I am the founder of Bigeye. Bigeye is a data observability platform, that’s why I’m here talking about this topic today. Before starting Bigeye a few years ago, I was the product leader at Uber for their data operations tools. So, my team developed Uber’s internal tools for data cataloging, data lineage, data quality, freshness, incident management for data outages, some of which may have affected machine learning models, which is what we’re going to talk about a bit. Prior to that, I was on the experimentation platform team there. So, I was a data scientist, helping them develop the way that they manage their AB tests that are run on pricing or in app interactions and things like that.

And before that, I had a background in analytics. If anybody’s heard of Grooveshark, put your hand up. It’s a now defunct startup. It was competing with Spotify back in the day. Hands up. Cool. Great. Okay. Good stuff. So, without further ado, let’s talk a little bit about what is data observability? So, to talk about data observability, let’s talk about some data first. So, let’s say that we have a pipeline, right? So on the left, we’ve got our application. Maybe we’ve got microservices with an OLTP database backing it. Maybe we have physical sensors and maybe we have an internet of things, SDK that’s submitting logs. But for this example, let’s pretend we have an iOS app. And in that iOS app, we have a mobile SDK. So, that’s our application part of the pipeline. That’s what’s emitting the original data that we’re going to be working with.

That then flows into a data lake or data warehouse, where maybe it’s loaded as raw data. Just tap and click logs coming from that mobile app. And then that’s transformed into product usage data. So this might be modeled now. It might have gone through several stages of data transformation, and then finally it’s going to land in our feature store and we’re going to use that for model training. We might do inference on it. And so, this is sort of our end-to-end example pipeline. So, now what is data observability? It’s the ability for us to understand what’s going on with the data all the way through that pipeline from origin to destination. And when I say what’s going on with the data, I’m not talking about the infrastructure, I’m not talking about the data pipeline, that’s moving it. I’m talking about the content of the data itself because at the end of the day, if we’re going to be running models with it, that’s what we care about, right? What’s inside the data set that we’re working with.

So, an example of something that data observability might help us with is if we have a problem in the middle of that pipeline. So, let’s say we have some sort of an issue that impacts our raw data set with our tap and click logs in the raw layer inside our data lake. That’s eventually going to cause a problem for our feature data. And that could end up impacting our models either that are already running in production. This could affect the feature data that’s being used for inference, or this could affect the data that’s being used to retrain a model, develop a new one, et cetera. And data observability is a method for us to learn about those problems as far upstream as possible.

So, now let’s talk about how this actually works. So, how does data observability work? There’s three main mechanisms by which we can understand what’s going on in the content of the data in the pipeline. So those are metadata, metrics, and lineage. So, the first of these three categories is metadata. So this is information that we can glean about our data sets just by looking at things like query logs, logs that are stored in the data warehouse. So for example, things that would be in the information schema, if you’re using Snowflake, that type of thing. And what can we learn from metadata about our data sets in our pipeline. First, we can understand changes to our schema. So, do we have schema evolution that’s happening? Did we drop a column? Did we rename a column? Did we add a new one? So, in this case there’s been a tax column inserted to the left of the stock column, which tracks maybe the amount of in stock we have of these different items, Oatly and Forager. I love me some Oatly.

So, metadata can tell us about changes to our schema. It can also be used to tell us about problems with freshness in our pipeline. So, are our data sets receiving new records when we would expect them to? So if data’s expected to update every, let’s say three hours, are we seeing updates every three hours? If not, and the datasets getting older and older and older, that may be a problem that impacts our models. Second category, metrics. So, this would be things like detecting duplicate data, missing values, problems with formats. So, if we have integer IDs and we see alpha numeric values in there, that’s a problem changes to distributions, outliers, maybe in numerics, negative values that we should never see negative values for something maybe, novel values. So maybe we’ve had 10 different options for ice cream flavor.

And suddenly we have Ben and Jerry’s Urban Bourbon show up and our model doesn’t know what to do with that. So, that may impact performance down the line. Third category is lineage. So, lineage is understanding the relationships between all the data sets from origin to destination in our pipeline, which tables are used to generate which other ones. And this gives us a map of our dependencies and we can take that metadata information and that metrics information and layer it on over that lineage map. And so now we can really clearly understand when we do see some sort of a change in the behavior of our data set, schema change, distribution shift, et cetera. We can understand where that’s going to flow. And we could, for example, understand if it is going to impact our future store or not. Maybe it’s only going to impact maybe some business intelligence stuff and we might not need to worry about it.

So, these are three different components that all come together to facilitate data observability. And if you look to non-data observability. So, if we look at like APM or something like that, these map really cleanly over to metrics, logs, and traces, which are the three sort of canonical concepts in more traditional observability as well. So, those map over really pretty well to the data space. And so by collecting these different metrics and by layering them on over our lineage graph, we can start to say, identify problems in our tap and click logs information before it matriculates down into our feature store.

So for example, if that top row there that says, alerting, maybe this is tracking the percent of null values in an ID column, and we need that ID column to do some sort of predictions with it. If we suddenly see an increase in null values in that column, more than we’ve had historically, that could indicate a failure or some sort of a problem in the pipeline. We might want to go and do something about that. And in a perfect world, maybe our data engineering team does something about that before it reaches our machine learning work.

So, why should machine learning engineering teams care about this problem at all? Right? So, hopefully that’s already clear. The idea is that we can use data observability to catch problems in our data pipeline before they make it sort of to the point where we’re doing modeling with them. So, an example from industry, actually, I met with a company that was working on self-driving and computer vision and et cetera for vehicles. And this company was purchasing label data from a company that would take imagery from the machines. And then they would draw bounding boxes for things like a pedestrian stop sign, et cetera. And they went to retrain their model and these are pretty big, expensive things to retrain, right? And they looked at their diagnostics in the model and they had a plummet in a bunch of metrics that they were watching and they thought that they did something wrong.

It turns out that just in a batch of image data that they had purchased from this labeling vendor, there were something like 10,000 images that just didn’t have any bounding boxes drawn on them at all. And there was nothing in place to identify that upfront. So, data observability allows us to catch those types of problems as early as possible and do something about it before we waste time when we’re actually trying to work on our models or put things into production. So then how to get started with data observability? I’d say there’s basically three big categories that you should think about. The first is turnkey vendor. There are a number of these these days. Bigeye is one of them. I like to think it’s the best, but that’s just me. Option two, open source frameworks. So, great expectations, it’s I think a pretty well known one, but there are open source options out there for instrumenting your data sets in your pipelines and understanding what’s going on in them.

And then third, build your own. If you want to build one from scratch, my co-founder at Bigeye, Egor Gryaznov, is actually giving a seminar next Thursday. It’s a two hour class on how to construct a basic data observability system from scratch using just Snowflake. So, definitely some fun stuff you can do if you’re building on your own and you want to get your hands dirty. So with that, I’ll end my talk. Thanks for listening. And if you have any questions, I’ll either take them now from UD, or you can shoot me an email to kyle@bigeye.com.

Kyle Kirwan

CEO & Co-founder

Bigeye

Kyle Kirwan is the cofounder and CEO of Bigeye, a data observability platform. Before starting Bigeye, he was a Data Product Manager at Uber where he led the development of internal data operations tools that enabled data discovery, lineage, freshness, observability, and incident management for hundreds of data engineers, analysts, and scientists within the company. He lives in New York City with his fiancé and prefers Cherry MX Blue keyboard switches.

Add Your Heading Text Here

Data Observability for Machine Learning Teams

Kyle Kirwan

Follow Us

Book a Demo

Contact Sales

Request a free trial