The dbt Semantic Layer

apply(conf) - May '22 - 10 minutes

In this talk, Drew will discuss the dbt Semantic Layer and explore some of the ways that Semantic Layers and Feature Stores can be leveraged together to power consistent and precise analytics and machine learning applications.

So my name’s Drew Banin. I’m one of the co-founders at dbt Labs. You can find me on Twitter @drewbanin. And just one point of clarification, I think the agenda says that I’m the CPO here. I’m actually the former CPO here, which is great because as CPO I could never join events like this and give talks about things like semantic layers and feature stores. So really excited to be here and looking forward to the chat after in Slack. Let’s start with the question, what’s a feature store? Seriously, what’s a feature store? Please help me. I don’t know. I’m mostly kidding. I did do my research. I read up on feature stores ahead of time. But the thing I want to come clean about really at the front is machine learning engineering, it’s not my background, it’s not my domain. I come from the data analytics, data warehousing, BI reporting kind of domain.

And the thing I want to dig into today is the rise of the feature store and the rise of a similar, I think technological innovation in the BI and analytics realm that we call a semantic layer. So semantic layers are really in right now. People in the BI and analytics sphere are talking about them. And the big ideas at play are that you want to define your data sets and metrics. You want to map out how they can relate to each other. And then you want to translate high level semantic queries into SQL for execution.

So some examples of metrics that you might want to define once and use everywhere is revenue by country or average revenue per customer. These metrics should be very precisely defined. They will probably change infrequently, though they do change sometimes as we’ll see. And so managing these metrics is deserving of these new technology innovations. So why is this important? Why is it hard to calculate revenue? This is a canonical example. You might think revenue’s easy to calculate. You just sum up order totals in your orders table and you get revenue. But in practice you need to make sure to exclude orders that are returned. You need to make sure that you subtract out tax paid on the order, because that’s revenue for the IRS, not revenue for your company. And then invariably you get quirks like this, where up until today you recorded order totals in pennies and cents, but now it’s dollar values or vice versa, whatever it is.

And so you sometimes see quirks like this, where the source data isn’t actually representative of the reality on the ground. You need to apply business logic to convert that data into information. So I don’t want to belabor dbt here, but just to show you what this looks like to make it concrete, basically when you write a query, you could write out this thing or you could say, select start from this metric called revenue by day and dbt could compile a much bigger sequel query for you. So in terms of visualizing this thing, I think about it like an iceberg. This pink thing, this is a real view of our diag at dbt Labs and how we track our objectives and key results. So this pink thing is a metric. It’s a weekly active projects for us, which is a metric we care a lot about. And that’s the actual aggregation. It’s the sum of users that had activity in the previous week, whatever it is, but that actually misses a lot of the logic that feeds into calculating this metric correctly.

So all these blue nodes here in this diag are logic that we apply to source data, to join data sets together from different data sources like aggregations, filtering, whatever it might be. And so it’s this whole chain of all the logic applied to source data, plus the actual metric definition that composes the final metric. So I’m wondering if this sounds familiar to you. I’m talking about business metrics and BI and reporting, but from reading about feature stores, it seems like a similar concept. There’s only one way to define revenue and in a similar way, there’s only one way to define if someone is active or not. That might change over time as your product evolves or new product capabilities roll out. But if you have many people on many different teams that are all accessing the same features or metrics, you want to make sure that these things are consistent, that work isn’t being repeated, because if you’re repeating work or copying and pasting code, things like that, then inconsistencies can arise and you can get invalid, incorrect data.

So the thing here that I think links together the semantic layer concept in BI and analytics with feature stores and ML engineering is this idea of standardization. Whether you’re creating inputs for ML training and serving or outputs for analytics and data science, you want to make sure that when you invest the energy to define these constructs, features, metrics, whatever they are, that you’re doing that once, that you’re doing it correctly and that you have a good way of evolving this logic over time as requirements and the product experience invariably changes. So I would argue that features in ML are akin to what we call dimensions or maybe metrics in the BI world. This is sometimes true. I don’t think it’s always true for reasons that I’ve read about recently, but I’m curious to hear everyone’s take on it.

So this is my view of how to bridge these two worlds. And this is more of a mental model than an architecture diagram. I’m not advocating for any technology in particular, but it’s how I think about it. The fact that the data warehouse is front and center here, or could be a data lake house of course, is a signal that this is very much my background. But the idea is you’re going to have all these transformations and business logic to apply to your data to make sense of it. From that point, you have this shared logic and then you can use that to make dashboards and reports that business users consume and understand. You could also take these bits of logic and use them to power features that you use for machine learning.

And the really cool thing about having a shared parent between these two use cases is that you get consistency and you get reuse. And so investments that your BI or analytics team has made to model revenue or product usage activity, they could be leveraged for feature engineering and machine learning and vice versa. So that’s what dbt’s all about at its core is helping get data engineers and data analysts collaborating with shared tools. And I just can’t help but wonder if there’s an opportunity for us to do this with machine learning engineering and BI and if there isn’t a common use case that could be shared here.

I want to talk about when this might not work well. And I think it’s one of the big kinks in my plan and it’s something I’m interested in learning a lot more about. So I was reading about data leakage. And the funny thing to me, reading about data leakage is in my world, we call that analytics. Take all the data that you have and mush it all together and show the most accurate information you have now at this point in time. There are cases in analytics where you want to know, what did we report as revenue a month ago? excluding new information that’s come after the fact. But I think this is one of the places where feature stores and semantic layers diverge. And I am wondering if there’s a model that we could employ to help bridge this gap.

So I thought that was interesting. I think that’s one of the problems that we’d have to figure out together, but I think that we can solve this problem and figure out how to unify machine learning with analytics and BI. It’ll also help people become a lot more collaborative. It’ll help teams work together, and again, just create more precision and consistency for our reporting or machine learning or whatever it is.

So these were the questions that I have at this point. I regret that I don’t actually have answers for you. But at the outset of this, I asked the question to myself, are semantic layers and feature stores the same thing or are they going to collide? I think the answer is no, not really. There’s different audiences for these two types of tools, different use cases, different constraints of course. So I don’t think there should be any worry about these two things crashing into each other.

But can we learn from each other? Yeah, definitely. I think we really can. I think that the analytics engineering community has learned a lot about how to be collaborative around creating these shared dimensions and metrics and understanding how to work with the business to make sure that we understand these definitions precisely. And I think the machine learning engineering community has gone really deep on registries and defining things once, using them in many places and leveraging that engineering background. So in terms of where do we start? how do we answer these questions? I’m going to pop into the apply conference channel on Slack. And I’m excited to talk to you about if any of this is resonant or if you think I’m way off base here. So just want to say thanks to everyone for letting me come talk to you today. I’m excited to chat.

Drew Banin

Co-Founder & CPO

DBT Labs

Drew Banin is the co-founder and former Chief Product Officer at dbt Labs, a Philadelphia startup pioneering the practice of modern analytics engineering. dbt is used by over 9,000 companies every week to organize, catalog, and distill knowledge in their data warehouses. Drew works with open source maintainers, contributors, and users to build dbt and strike fear in the hearts of database optimizers.

The dbt Semantic Layer

Drew Banin

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial