Streaming is just an implementation detail

apply(conf) - May '22 - 10 minutes

Microservices are stream processing; whether you’re using Redis, Kafka, or gRPC, you continuously handle events and manage consistency. And given that these are some of the most challenging problems in databases, you’re probably not doing a very good job at it.

But that’s not your fault, these problems are hard! Just like you wouldn’t implement your own database for every web service, you shouldn’t be building your own stream processor for every new product feature.

Today’s stream processors have failed to gain widespread adoption outside niche use cases because they put themselves at the forefront. They force you to think about streaming when building your application and when deploying to production.

In my talk, I will argue that “streaming” is the right tool but the wrong abstraction. Of course, we all want the benefits – increased speed, stronger consistency – but they need to meet developers where they are. Only when streaming becomes an implementation detail can it gain widespread adoption and bring forth the benefits it has promised for so long.

Thanks so much for the intro. I’ve got a bit of a provocatively titled talk here about how streaming is just an implementation detail and how I’m hoping for a future where we actually stop talking about streaming. So before I get into that, let’s just start. Okay, what’s even streaming and why is it even useful? Right? So streaming is about taking action on data points as they appear without waiting for a batch process to happen once a day or every few hours at best. And modern streaming systems are mature and performant. You can process large amounts of data with millisecond level latencies. But stream processing isn’t just about going fast, right? It’s customers have been trained to expect immediate feedback and it’s considered antiquated when it takes more than a moment for say, money to transfer between two accounts or food delivery apps today require up to the minute tracking as it sort of, you watch your food get to your door as you’re hungrier and hungrier. Streaming is what makes all these products possible.

That sounds pretty great. Right? Let’s see how it actually turns out in practice. And I’m going to use a fun running example here. Thought it would be appropriate today. Let’s build a feature store together, right? Now, suppose our team is building a model serving application to detect credit card fraud, right? We’re going to keep the most simple basic example here. There’s a customer., They go to a store, they swipe their credit card, and we’re going to need to determine in real time whether to approve or deny that transaction. And our demo system, we just have a very toy schema. Each account is associated with a single owner, but each owner might have multiple accounts, right? And each transaction we’re going to give it a fraud score in real time. And when given an account ID, we’re simply going to count the number of verified fraudulent transactions against the account owner over the last 30 days.

So what does this look like in practice? So it’s an e-commerce store, so there’s probably a transactional database at the center of everything. And we’re going to put a little web server on top of it to act as our API. So far so good. We’re going to use a data processing framework to compute our transformations. The transformation from raw data into a feature is too costly to perform from scratch every time on demand. So we’re going to process the data as it comes in and write the results somewhere else. Now to write this transformation, you’re going to have to learn some specific framework or programming language.

We’re also going to need some external data sources to compute additional features. That computation might take some periodic rescheduling. So as we pull in the data throughout the days, we’re going to need to use a framework for that. And then finally, if our features are coming from multiple sources, we might still require online processing for things like joints. Now, some of these stream processing frameworks on notoriously poor at joining streams. So you’re going to need some additional services to do that and we’re going to use a state full store like Redis. Good news is a system like that is going to be very helpful for caching at answers as well for low latency rates. And when you sort of step back at all this, this is a lot of infrastructure, right? Let’s admit it. The stream processor was one tiny part of the solution. And to actually build what we needed, we walked ourselves into accidentally building an entire database around it.

Let’s look at three sort of classic database problems that we have to solve. First, we have to think about consistency. What happens in the failure scenario? One service is working. Another one doesn’t process a request. Second, efficiency. Right? Our queries change. Our schemas evolve over time. We build new features. We’re running multiple queries side by side. We have to do essentially query planning. Third, we have to build and maintain indexes and caches. Keeping our cache valid and updating it and invalidating it is one of the hardest problems in computer science. Don’t get me wrong, in our architecture, we got a lot of value out of the screen processor. It did the heavy lifting on the computation, but it wasn’t sufficient on its own. Stream processors are part of a solution, but they aren’t the entire solution. Ultimately, today’s streaming systems don’t meet users where they are. They required developers to let write a lot of complex code.

Put it another way. In other parts of your day, when you need to persist some data to a disc, you don’t implement a B-tree and consider the subtle behavior of fsig. You use a database. You pick a tool that takes your expressed intent. It takes care of the hard problems. I believe that streaming today is a great tool, but it’s more like a B-tree storage engine or a push-based query execution framework. These are all amazing things and I’m really grateful for them, but developers don’t think about them every day. Right? They get to harness their power in an easier to use familiar experience. That’s what databases give you.

So at materialize, we’ve been hard at work building a new database, one powered by a modern stream processor, its core. But one thing we strongly believe in is that none of our users need to know, and users shouldn’t have to care about the stream processor. Now, of course, some people might because it’s cool. And for those that care, the timely data flow stream processor at the heart of materialize is open source. But for the developer, it looks and feels like a fully featured database you can get your job done.

So as a result of this, users can think about computation the same way they have for decades and even migrate existing workloads from data warehouses because they express that computations in standard SQL. Materialize allows you to run these complex analytics workloads in real time. And it supports the full breadth of SQL, which means you can run those data warehouse grade queries on it. And the added bonus is it doesn’t really care if you’re doing batch or streaming, it’s all just one kind of computation so you can mix and match. For example, you might need to backfill a streaming pipeline with some historical data on S3. It’s Postgres wire-compatible so it looks and feels like a regular database. It incrementally materializes, hence the name, the result of your views. And they can be efficiently queried from memory or pushed out to your application using a standard Postgres driver. And most importantly, it fits in your data stacks. You can orchestrate your computations using a tool like dbt.

So let’s start all over. Right? What does building a feature store look like in a world with a streaming database system? Right? There’s no job scheduling. We stream our data directly into materialize. Materialize pulls the data in from a message broker like Apache Kafka for your raw events or directly from your relational database like Postgres. Instead of a zoo of tools, materialize just defines all the transformations in just SQL. Right? The same SQL that you’re probably using across your organization. So you have the added bonus of consistent business logic definition in your production tools as you do in your analytics tools. We’ll dig into exactly what this looks like just in a little bit. And finally, there’s no need for a separate caching layer, right? Materialize is continuously materializing the results and memory so your reads are fast and up to date.

So let’s look at the actual SQL. First things first, we going to connect to the data sources and get in the data. We’ll connect to Postgres using change data capture. There’s a table in our Postgres OLTP database correlating accounts to account owners so it makes sense to just… Anytime there’s any change, say someone gets issued a new card, we instantly replicate that change into materialize. We also have materialize consume these messages from the confirmed fraud topic from a message broker. And that could be sort of any Kafka API compatible message broker like Redpanda. Each message is a JSON Payload containing an account ID and timestamp for a transaction confirmed to be fraudulent.

Now we’re in place to define our features. We’re simply going to write SQL and not worry about streaming. We start by creating a view. We’ll market as materialized so the system will continuously consume eagerly from our sources and update the results of this view definition. Users of the view can then read the pre-computed answers from memory with minimal latency. And from here, it’s all SQL that you already know. You can do some projections to pull the right columns, perform some aggregations, join across the various streams, and even filter out rows when they become too old. Over here, as records become more than 30 days old, they’re automatically filtered out so that the in memory materialization is super efficient on read. And finally, from the client’s perspective, when it’s time to read the result for our production service, our feature is already available in memory ready to serve with a simple select query. Right?

Now I believe that streaming is the future of data processing, but I also believe that streaming will only become mainstream when it disappears into the background as a boring implementation detail. And developers experience the ease of just writing SQL the way they now write batch equal on data warehouses. Thanks so much. I’m a LaTeX on Slack if you have any questions, and if you go to materialize.com/apply, it’ll take you to a GitHub repo where we have the code for building a [inaudible] with just SQL. You can try it yourself on your laptop right now. Thank you very much.

Arjun Narayan

CEO / Co-Founder, Materialize

Materialize

Arjun Narayan is the co-founder and CEO of Materialize. Materialize is a streaming database for real-time applications and analytics, built on top of a next generation stream processor – Timely Dataflow. Arjun was previously an engineer at Cockroach Labs and holds a Ph.D in Computer Science from the University of Pennsylvania.

Streaming is just an implementation detail

Arjun Narayan

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial