Lessons learned from the Feast community

apply(conf) - May '22 - 10 minutes

Feast, the open source feature store, has seen a dramatic rise in adoption as ML teams build out their operational ML use cases. The growth that Feast has experienced is in part due to the project being a community-driven effort, with development happening openly through public forums. However, designing out in the open hasn’t always been straightforward. As the Feast user base has grown, maintainers of the project have been faced with new and interesting challenges. In this talk we will share three examples of when the Feast community surprised us, and how that impacted the project’s direction.

Okay, so, my name is Willem. I’m a principal engineer here at Tecton. Today, I’m going to be talking a bit about some of the lessons we learned while working on Feast. So let me just jump right into it.

But so before we get into that, what’s the story with these? What’s happened in the last year since the previous apply? Give you a quick summary. We’ve seen a bunch of contributions to the project. So it’s really grown with its functionality set, lots of optimizations in our feature server. In terms of performance, we’ve added data quality monitoring. We’ve added a Feast UI. But just in general, we’ve had community contributions from both our end users, so Spark, PostgreSQL, Trino, and a bunch of other connectors and compute engines, as well from our partners like Redis, Snowflake, Azure, et cetera. So really the project has grown a lot in the last year.

And adoption has also grown. So folks at Twitter, Shopify, and a bunch of other teams have built their email platforms on Feast. And constantly what they are, our metrics are also up. So we’re seeing more folks at Slack, we’ve seen a lot more activity on GitHub, so PRs and stars and forks, et cetera. So it’s been really great to see the project really flourish.

But along the way, we’ve learned some things from our users. And we work with our users on Slack and GitHub and community calls. We also run surveys at least once a year. And often we learn things from our users through those surveys. So I’m going to share three of those key takeaways that we’ve had over the last year or so.

The first one was from the 2021 fee survey. We asked our users, which technologies do you want to use for your online feature store? Now, I think most people have an idea of what the results will be. And this is what we saw. We saw unsurprisingly Redis was one of the top progressive online feature store, storage technologies, databases. And of course there’s a kind of long tail of other technologies, but a very interesting for us was PostgreSQL and MySQL is right up there with Redis, and it wasn’t 100% clear to us why this was the case. So, you’d expect a no SQL database, something that’s super low latency, optimized for point lockups, but you wouldn’t really expect PostgreSQL or MySQL there because they’re often optimized for different use cases. And we have a hunch for why then teams wanted to use these technologies.

Perhaps they already have those deployed in production and they just want to reuse them. But we continue to ask our users questions. The other question we asked was, which technology do you want to use your offline feature store? So, where do you want to create your features? Where do you want to build your trained data sets? Where do you want to store your historic feature data? And similarly, of course, you had your data lake, data warehouses, the usual suspects, but similarly, we also saw operational DBs coming into the survey results again. So this was also quite kind of curious for us because we anticipated teams have large amounts of data and they want to use these kind of ELT stacks or at least analytics stacks to process that data and BOLD train data sets, not operational DBs, and yet they still came back into our survey results in 2021.

So the question for us is really is there some kind of use case here for operational DBs, or are these teams just trying to get to market and the operational DBs, just the easiest way for them to do that, maybe they don’t even have an analytics stack. So the question really was is this a temporary thing, or is this a long term thing that we just don’t understand a use case from our users of feature stores. And so we ran the survey again this year recently, actually I think it’s about a month or so ago, and the results seem to indicate that it was in fact, a temporary period that our users went through. So specifically, if you look at the results, data lakes are slightly dropping off for feature store usage, data warehouses are booming.

So BigQuery, Snowflake and Redshift adoption and our users really want to use those for their offline feature stores. And for operational DBs, it’s basically halved. And so when we speak to our users, often they say, yeah, we were using operational DBs as a quick way to get to market. Often the ML team and the product team is the same person, or if it’s not the same person, often the product team just didn’t have confidence in the analytics stack for depending on that in production. So the key takeaway for us was really analytics stacks are increasingly being used for production machine learning and product teams have high confidence in analytics stacks, and folks are moving over to those stacks. So that’s one lesson. So another conclusion that you can draw, the next question you’d ask is, is everyone just using this modern data stack for production machine learning?

Clearly everyone’s using data warehouses. And you can assume that the modern data stack is also just being used for production machine learning. And if you look at the survey results, it does seem to indicate that there’s a massive move towards data warehouses and these kind of self serve, serverless storage engines and compute engines, but it’s slightly more nuanced when you look at the tools that are being used with these storage technologies. So we also users, what tools are you using to transform your data? And if you look at this list of tools, Spark and Pandas still dominate, and those are not modern data stack technologies. So even if users are using data warehouses, they’re extracting the data from a data warehouse and using Spark outside of that, or Pandas outside of that, or a bunch of other tools outside of that. So if you look at the third item here, a bar chart, DBT is there.

So there’s pretty good representation from DBT. So some folks are using DBTs with the data warehouse, but it’s not really at the level of Spark and Pandas. And in fact, if you draw into it more, if you look at feature store users that are using DBT, what tools are they using with DBT, is DBT enough on its own to do ML and to transform your data. And this is what we found. So for DBT users, 100% of them are using DBT with another ETL tool in order to transform their data for ML usage. And almost 70% of them were using DBT with Spark. And so this is kind of an interesting insight for us. It means that DBT is important for these folks, but it’s not sufficient. And so the key takeaway for us was really that the modern data stack is not yet enough for production machine learning.

And when we speak to these users, there’s two use cases that come that aren’t on their minds for, by the ETL tools, along with the modern data stack activity, it’s streaming. Streaming is still the demand of Spark and Flink, and a lot of these traditional tools for data processing, and the modern data stack hasn’t fully addressed that yet. There are vendors in space, but they’re still finding their footing. And the other is for on-demand or read time feature engineering. So can you compute features on the fly and low latency, and that’s largely damaged risk by the modern data stack. So this was a key lesson for us as well.

The third thing we asked our users was, how do you want to integrate Feast for real time feature serving? If you really want features of low latency with Feast, how do you want to architect your system?How do you want Feast to slot in there? And we gave them a bunch of options in terms of service oriented or library and programming languages, and these were the results. And so what we found was that, surprisingly, Python really dominates in terms of the language that folks want to use. And this is surprising because of course, Python is slightly slower than the likes of Java are good in terms of like latency, retime latency, speaks to teams wanting to perhaps have an ease of understanding or perhaps integration and extension for Python is easier and their optimizing for delivery and getting into production faster, as opposed to optimizing for latency. And also interestingly, they want to use a library. So they want to embed Feast as a Python library, into existing components in their stack, perhaps a model server, as opposed to deploying a new service.

And so, if you look at the second bar chart here, another interesting takeaway here was even in the case when they want to deploy Feast with the feature server, as a Python, as a service, they still prefer to use Python. And that was surprising because it’s a standalone service and often the language doesn’t matter. But our takeaway here was teams often want to for code and they want to modify code. And often I think they’re just more comfortable with Python and the ecosystem around that is easier to integrate into a Python service than it is for Java or a Go Server. And so we spoke to these teams and we got a little bit more feedback on why they’re adopting Python in production. The first was obviously they’re optimizing for delivery over performance and in timelines are so slow already that they’ll do anything they can to just get into production.

The second point was that Python isn’t really that much slower these days. Often Python is really just wrapping another programming language. In fact, this is what we do with Feast today. So our feature servers rep in Go Code, and it’s pretty much on par with the Go Servers in terms of serving features at low latency. But most importantly, the Python ML and data ecosystem is vast, and it’s not just offline, it’s also online. Tools like Ray Serve, tools like MLflow, BentoML. All of these are being used in production in an online way at low latency, and being able to integrate these and plug and play these really is valuable to our users. They just want to get too prod and often in the zero to one use case they choose Python. So the key takeaway here was, teams are doubling down on Python, not just for offline workflows, but for online production workflows as well, and that was very interesting for us. So if you want to get involved with the Feast project, head to feast.dev, you probably are on the Slack already. The links are on screen and we’re hiring across the board at Tecton. So if you want to work on an open source, if you want to work on MLOps from junior to leadership roles, engineering, product, marketing, developer relations, come and chat to us.

Willem Pienaar

Feast Committer and Tech Lead

Tecton

Willem Pienaar is a tech lead at Tecton, where he leads the Feast open source feature store. Willem created Feast while leading the Data Science Platform team at Gojek. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making. In a previous life, Willem founded and sold a networking startup.

Lessons learned from the Feast community

Willem Pienaar

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial