Lakehouse: A New Class of Platforms for Data and AI Workloads

apply(conf) - May '22 - 30 minutes

In this talk, Matei will present the role of the Lakehouse as an open data platform for operational ML use cases. He’ll discuss the ecosystem of data tooling that is commonly used to support ML use cases on the Lakehouse, including Delta Lake, Apache Hudi, and feature stores like Feast and Tecton.

Matei Zaharia:

Great. Yeah. Thanks everyone for coming to my talk. So I’m going to talk about something happening on the data infrastructure side. It’s a new kind of system called lakehouse that a bunch of companies are trying to build. So it’s, I think, an interesting component of the data stack that I think over time will become increasingly important. And this is something that we at Databricks started to design and build pretty early on, but many other companies are doing it as I’ll talk about. I think it’s actually a natural evolution of a lot of the data management systems we have today.

Matei Zaharia:

So basically I’m going to talk about a couple of things. So first of all, I want to argue that a lot of the problems that data and ML engineering faces today, not all of them, but definitely a bunch of them, stem from the complex data architectures that companies are forced to deploy, which involve lots of different systems that are better at managing different types of data for different workloads. And often moving and coordinating between these systems is a major problem. So that’s the first thing and because of this, people are looking to simplify the kind of systems you run and lakehouse systems are a new design that kind of tries to combine the benefits or use cases of what used to be separate systems before. So in particular, they support data engineering, SQL data warehousing, and machine learning on the same data store. The reason it’s called lakehouse is because they’re built on data lake storage, which is low cost and very scalable. So that’s basically the outline.

Matei Zaharia:

So let’s start with what are the problems that people face with data? At Databricks, we work with companies that do a wide range of different things from sort of simple SQL based reporting and warehousing to machine learning and interesting, we talk with them all the time. We ask them, “Hey, what do you wish your data platform did? Should it be faster? Should it be more scalable? Should it be cheaper,” whatever, but actually the top problems that users have with data today tend to be with the data itself. The data quality, first of all, and then timeliness. So if you don’t have high quality data going into your application, whether it’s simple analytics or machine learning, then you just can’t do anything with it. And same thing to some extent with timeliness. If it takes forever to get the new data for you to look at, it’s a big problem.

Matei Zaharia:

So we are not the only ones to see this. Actually, quite a few surveys of the industry showed this. For example, here’s a survey of data analysts from Fivetran. Basically they found that 60% of them reported data quality as a top challenge and 86% of them had to use stale data with a lot of it being more than two months old. And this is just because it takes a long time for data to make it through the system where they can actually do analysis. And also all of them basically had unreliable data sources. And the same thing is true with data scientists. If you look at where they spend their time, there’s a lot of work on understanding data and of course, on building the infrastructure to run their data and ML pipeline operationally, which you know, a lot of which is covered in this conference.

Matei Zaharia:

So I want to argue that even though getting high quality and timely data is intrinsically hard, some of it is a problem of our own making because of the way that data stacks look today and it’s basically for historical reasons. It doesn’t have to be that way for a technical reason today. So let’s look at the history of data management systems for analytical data. So this space kind of started in the 1980s with data warehouses and back at that time, companies had what were called operational databases. So they had these live applications that would do transactions, like say the booking system for an airline or something like that. But as more and more computer applications were deployed, they wanted to analyze what’s happening. So people designed this other type of database called the data warehouse that’s optimized for analytics over large amounts of historical data. And the idea was you would take your operational data stores, you would extract, transform, and load your ETL data into them regularly. And then in this warehouse, you would see data from lots of applications together and it’ll be optimized to do efficiently pointing.

Matei Zaharia:

And this is a full SQL database with a lot of powerful management features, as well as performance optimizations. So things like schemas, indexers, asset transactions, maybe multi-version control so you can look at an old version of your table and stuff like that. So it’s a very rich, easy to use environment. And this worked really well. It’s a huge industry, continues to this day. In the 2010s though, some new problems started to come up for this pattern, which are basically three problems. The first one was that the data warehouses were very SQL and table centric, mostly designed for historical, but sort of smaller scale data compared to what we have today. And they didn’t support unstructured and semi-structured data, which was becoming increasingly common. The second one is that the cost to store large data sets was quite high because again, they were just designed for a smaller data volume from these business events.

Matei Zaharia:

And then the third one was that because the only interface is SQL, they can’t easily support data science and machine learning. So as a result, we got a second type of system that probably almost everyone in the conference uses, which is called a data lake. So this is basically low cost storage that is designed to just store our data in whatever format you want. For example, Amazon S3 or Hadoop File System for on-premise or like Azure Data Lake Storage or all these other cloud storage systems. And in this space, basically it’s very easy to just upload your data, whatever format and type it is, and just store it in there and then build applications downstream from it.

Matei Zaharia:

One of the other things that was really different with data lakes is that because they came out of open source, they were designed mostly around open data formats like Apache Parquet, which means that the data is actually directly accessible to lots of engines once you store it in there. And as a result, we have this huge ecosystem of tools on top that know how to read these data lake formats like Parquet and can work on top of it. In fact, even Apache Spark started that way. When I started it, there were a few systems like Hadoop that were widely used and there was all this data in open formats. And it allowed me to write an engine that is better at certain things and just have it work on people’s existing data, which is great for everyone involved. And with this data lake though, you have very limited management features. It’s just a file system basically, or actually even weaker semantics than a file system. So you still have to have jobs that load the data into other systems for more powerful management and performance features.

Matei Zaharia:

So for example, you would still run your data warehouse and you would just load a subset of your data into it regularly. And today these data lakes in most enterprises are where the vast majority of bites are stored. It’s not necessarily where most users query the data, but that’s where most of their data lands and is stored. Okay. So what’s the problem? This looks like we’ve kind of solved everything. We have a low cost, cheap storage tier, the data lake, and we can still run our existing stuff on top. So yeah, it does let you run all these things, but the problem is that it’s complex to use. And in particular, it’s a lot more complex than the single warehouse you had in the 1980s, where as soon as an event happens in your operational system, it just gets loaded into there and analyst can immediately query it and can start analyzing whatever happened.

Matei Zaharia:

Okay. So a few different problems happen. So the first one is that data reliability suffers. You’ve got at least these two different storage systems. And in practice, you might have even more things here, like say a Kafka message bus or something like that. And they have slightly different semantics, maybe different dialects of SQL, different data types in their tables and so on. And you have all these jobs that are supposed to move data between them. So when an event gets copied, say from the data lake to the data warehouse, there is a chance that something goes wrong and actually it’s corrupted or that one of those jobs goes down or has a bug, okay?

Matei Zaharia:

Second problem is that timeliness suffers. You have all these extra ETL steps that need to get the data in there, or even worse, if there’s no pipeline for some new data set, you have to wait for someone to build a pipeline before you can analyze it. And then of course you also have high cost because you’re duplicating some of your data and you’ve got these extra jobs that are running that are basically just transcoding from one format to another and copying bites from one system to another. They cost a lot and they’re not really doing something useful. You’re just spending sort of cloud credits on that. Okay. So that’s where lakehouse systems come in. The idea here is to implement the data warehouse management and performance features and in general, other features of what are today separate storage systems directly on top of these data lakes and open data formats. And the idea is, there’s very little sort of technical reason why you couldn’t have say a great data warehouse performance backed by S3 or a great kind of message bus system or things like that.

Matei Zaharia:

In fact, a lot of the modern systems already use cloud storage underneath. So we want to do that with these open formats and give people ideally a single tier in their data architecture that all these workloads can on against. So if that works out, it would look a little bit like this. You’d have your data lake at the bottom. It’s still something like Amazon S3. And on top of it, you’d have this interface that is a management and performance layer that gives callers sort of higher level abstractions on top, like say transactions or versioning. And then on top of this, you can have a few interfaces. You can have SQL for everything that talks SQL, and you can also have direct access to the files because then these open formats for high throughput systems, I called the machine learning engines that I showed or things like Apache Spark or whatever the next great engine that someone designs is.

Matei Zaharia:

It can just directly read those files at high performance. So this looks nice. It’s always nice to have just a single layer, but the question is, can we actually do it? Can you get great performance, can you get governance and management features that are the reasons we have these different systems today? So I’m going to talk about three kind of areas of work that are enabling this to work. And these are metadata layers on data lakes, which add transactions, governance, and other management features. Lakehouse engine designs, so in particular for SQL, which is where there’s been the most investment in performance in kind of these standalone systems. Can you actually get great performance on today’s open formats? And then I’ll talk about how the ML stack interfaces with these, which is also pretty interesting because you can take advantage of all the optimizations that are done in the other layers.

Matei Zaharia:

So I’ll talk about each of these in order. Okay. So metadata layers, so this is kind of the key technology that’s enabling lakehouse in a sense to be possible. Basically, these are these layers on top of your raw files that try to provide richer semantics and quite a few companies have developed these roughly the same time, basically around 2016, 2017. So at Databricks, we developed Delta Lake. There’s also Apache Iceberg, which started at Netflix and Apache Hudi from Uber. And now actually a lot of the cloud vendors are also offering kind of cloud services that are sort of proprietary, but try to do some of these same things. And the idea in these is actually quite simple. Basically, you’ve got a collection of files, that’s what the low level storage system is, but you want some higher level interface on top and these systems basically just track which files are part of a table version or a data set and use that to implement transactions and other functionality as I’ll talk about.

Matei Zaharia:

And basically when a client application wants to read, say a table, they can ask this layer, “What files are part of this table?” And they get a list of files and then they just directly read those. So it seems really simple, but this layer of interaction adds a lot of opportunities for great functionality. So let me just talk about one of the things which is transactions. So normally if you’re using a data lake, one of the problems with that, one of the reasons you can’t just turn all your workloads against it is that there’s no isolation of concurrent workloads.

Matei Zaharia:

So for example, let’s say we’re storing a table, which is a bunch of events. It’s a bunch of files. It looks nice. If everyone just reads the table, that’s great. Then they can all read it in parallel and it’s very fast and everything’s great, but sometimes you’ll have workloads that try to modify the table. For example, you’ll have someone who tries to delete all the events about a specific customer. And when that happens, you have to modify or add and remove individual files. And there’s no way to do this atomically for the whole table. So for example, you might need to take file1 and rewrite it into file1b and then delete file1. And you might need to do the same thing for file3 as there are events in there. And of course, if you have a job that’s running and doing this, different things can go on.

Matei Zaharia:

So for example, if this is in the middle of running and it’s updated file1, but not file3, some job might go in and see a broken state of the table where some records have changed or a worse, it sees 1 and 1b at the same time and so it sees duplicate data. And of course, if you’re a job that does the update crashes, that’s also a problem, because then you’ve left the table in that corrupt state forever. So how do lakehouse systems solve this? Basically, as this layer of interaction that keeps track of individual table versions as a concept and in Delta Lake, just as an example, it does that using a log of transactions or changes to the table and the log itself is stored in S3 or in your cloud storage system. So you don’t actually need to run a separate service to do this. In fact, this is as highly available as S3, as long as S3 is up and you can read and write to it, you can do asset transactions on your table.

Matei Zaharia:

So this is a simplified version. Actually the log format’s quite a bit more optimized than this, but you can think of each entry in the log as telling you which files are part of that version of the table. So for example, version two of the table is file1, 2, and 3. And then when you do an update like this one with deleting stuff, you can have a job that’s running to do the update. It can rewrite a bunch of files and everything’s fine because readers look at the log and they only look at file1, 2, and 3. They ignore these extra files because they’re not part of any committed version yet. And then when you’re done with your update, you can atomically add a new log file. There’s various ways of doing this in each cloud that says, “Hey, version three of the table is now all these updated files.”

Matei Zaharia:

And so readers who look at this after you do this will see everything updated together. You’ll never get this inconsistent version. So very, very simple idea but now you can actually have concurrent users and workloads and do all kinds of management operations on your data directly in the data lake. You don’t need to move to a SQL warehouse to do that. And based on this simple sort of idea, we’ve also built quite a few other rich management features in Delta Lake. So for example, you all automatically get multi-version sort of tracking with this design. So you can easily look at the log and say, “What did the table look like, say, a week ago?” And read that version of it. So that’s really nice. You can also easily do zero-copy clones of the table.

Matei Zaharia:

You can create a new table, say for development whose log starts by pointing to files in a previous table. And then you can just remember the changes. So it’s very cheap. It’s a little bit like forking something and Git where you can just quickly develop on a changed version of it. Of course, you see its whole history. So you can see who’s been working with it from the log. And then another cool thing that’s pretty widely used is, since we have this log, you can view each table as a stream of changes. And you can actually use a Delta table as a message bus, so you don’t need to run something like Kafka, if you just want in order delivery of some events. Of course, since this is based on S3 the latency isn’t as good as Kafka.

Matei Zaharia:

So definitely there are workloads where you would need a separate message bus, but for a lot of a lot of things, you can actually just have your table be a streaming input source for something else and get all the changes. Okay? And this is an idea that’s become quite widely used. So Delta Lake itself, I think, is the most widely used system in this space. It’s used at thousands of companies already to manage exabytes of data. One kind of really striking thing for me was that on Databricks, we went from 0% to more than 70% of our right workloads going into Delta Lake since we launched this in 2017. That’s very unusual for a new storage format. That’s usually one of the things it takes a long time for people to adopt.

Matei Zaharia:

And there’s also really broad industry support with all these tools that can integrate with it. So I think this type of layer is definitely here to stay for data lakes and it just kind of strictly makes them better. Okay, we made data lakes better in terms of management, but it’s still not enough to replace the separate systems we have today. And one of the core issues people worry about is performance because with data warehouses, you’ve got these basically 40 years of engineering and to building these bespoke systems with co-design storage and compute and query planning that can do analytical queries. So can you get performance SQL on a data lake? And it might seem that it’s hard because you’re stuck with these open formats. Maybe it’s harder to change them.

Matei Zaharia:

So maybe you can’t implement all the optimizations in a warehouse, but it turns out that first of all, today’s open source column their formats are pretty good and they are evolving to meet new needs over time. And second, you can also use the lakehouse design with the layers I showed before to implement a lot of optimizations on the site that improve performance. So for example, you can keep auxiliary data structures like statistics that are updated with each transaction so they’re always consistent. You can organize the data within each file. You can reorder the data to minimize I/O, you can do caching and you can also design engines like a data warehouse engine, but for today’s open formats that actually perform pretty well. And these things together help with both cold and hard data. So all the things you do on storage lets you minimize the amount of bites you have to scan.

Matei Zaharia:

And then once you do scan those bites, you can do a lot of things in memory or in the CPU to make the execution faster. So a bunch of companies are building engines for these. I’ll talk about what we are building at Databricks with Photon, which is our engine. It’s basically a version of Spark that is faster, that uses modern kind of warehouse techniques. So I’ll just briefly show how some of these things work. So one of the optimizations is auxiliary data structures, right? So even if your base data is in an open format like Parquet, you can also build many data structures on the site that speed up ways. And with something like Delta Lake, you can maintain them with your transaction so they’re always up to date.

Matei Zaharia:

So for example, one of the things we do in Delta Lake is called min/max zone map. So for each Parquet file, we have statistics about each column, the minimum and maximum in it. And normally Parquet actually has these inside the file itself. But the problem is you have to do a lot of I/Os to look at the photo of each file and figure out its statistics and decide whether it’s relevant to your query. So if you have a million files, you have to do a million I/Os to see what’s in them. With Delta Lake, we just take these and we basically store them in the Delta table log. And a million of these entries, especially once you compress them, will just be a few megabytes so it’s super cheap to store that stuff in there.

Matei Zaharia:

But the cool thing is that they’re now all together in one place. So when you read your snapshot of the table, you read all the statistics and you can immediately narrow down to which files are relevant for a specific query. And you only have to do one I/O. It takes a few hundred milliseconds to do that with S3 instead of millions of them. So just to show you how you use these, basically when you get a query with conditions in it, you can eliminate a lot of files based on their min/max statistics to know they’re not relevant to the query. So in this one, actually only the last file could have records in the range we care about. And then when you run the query, you can just read that one file.

Matei Zaharia:

Okay. So that’s one optimization. Another thing that we do is optimizing the data layout within each file. So if you know which records will are likely to be accessed together, you can sort your files and sort the records inside them to keep those close together to minimize I/O. And there are a lot of techniques here that you can use. We implemented a bunch of them, including space filling curves that let you cluster the records in a file along two dimensions and have locality in both. So this means if you query on a range of each of those dimensions, you have to read a lot less than the whole file. So that’s this one. And of course, on top of these systems, you can do caching. We implement caching on top of solid state discs, and we have this special sort of half decompressed format that’s actually faster than just putting Parquet on an SSD.

Matei Zaharia:

And then in the engine itself, you can implement all kinds of work, like for example, using vector instructions on the CPU. And we’ve done that in an engine that basically is compatible with Apache Spark so we’re running a lot faster than the open source Java based engines for SQL workloads. And it turns out at least so far, the Parquet format hasn’t really been a limitation. It’s a well designed format and we can match the performance of data warehouses with custom formats using this open one. So yeah, so if you put all this together, you actually get pretty competitive systems. We’ve been running benchmarks against a lot of the popular cloud data warehouses. We have a bunch of these. This is one we posted about last year, where you can see on a pure data warehousing benchmark, TPC-DS that ships with every data warehouse in the industry, we’re actually doing quite a bit faster than dedicated system that does that.

Matei Zaharia:

So at least today, I don’t think there’s any technical evidence that you’d get better performance or that you shouldn’t use these open formats and use lakehouse if you want the best data warehousing performance. Okay. And then the final thing I’ll talk about is how this ties to machine learning and the interesting things you can do with lakehouse for that. So with machine learning today, as you saw in a lot of talks, you have the challenge of how to set up your data pipelines and make them reliable. And it’s pretty hard because you’ll probably want data from a lot of different systems in your company. You might have some data in your data lake, you might also have stuff in your data warehouse. And with the data warehouse specifically, it’s kind of painful to do machine learning with at least with large amounts of data, because ML workloads need to basically read entire tables or entire large data sets and data warehouses with the SQL interfaces are mainly designed for sort of smaller queries that extract a small subset of the data.

Matei Zaharia:

So just reading the data through SQL from these is often really slow and basically it has to be transcoded from the internal format to some open format that the ML engine understands and you have to have a bunch of nodes doing that. Now you could decide to not to read directly from the warehouse. You could export the data to a data lake, but then this is yet another ETL step that could go on, that could break your ML pipeline, if this has gone on before you even get to look at the data and run ML on it. And you could also try to keep your production data sets in both the data warehouse and lake, but that’s even more complex and more things can go wrong as you’re then trying to run this application.

Matei Zaharia:

So with the lakehouse, first of all, just by having a single storage tier, it makes things quite a bit simpler. The ML frameworks, many of them, can already support eating Parquet. And as long as you have a small sort of shim layer that tells them which files to read, they can just integrate with the transaction management and other features in the lakehouse. And then they can just read those files in parallel, directly from cloud storage, without having some other node in between that’s translating them into a different format. And you can also use engines like Spark, which have declarative APIs that can do query optimization. So that’s pretty nice. But then the other cool thing is you get these built-in features like data versioning, streaming, and so on that help with the ML life cycle. So for example, in ML flow, we automatically track which version of a table you used in an experiment so you can get back that same data.

Matei Zaharia:

So a lot of these systems are now integrating with kind of leading ML platforms like Feast, Tecton, and MLflow. All of them can work with Delta Lake, for example. So it’s an exciting area and there’s a lot more to be done here, but I think having a powerful data management system is really important for doing ML. So that’s mostly what I wanted to talk about. I’ll skip this, just showing how things work, but yeah, if there’s one thing to take away, it’s that today’s world with many separate data systems, I think, doesn’t have to be that way for a technical reason. It’s that way for historical reasons because of how these evolved. And I think over time, we’ll converge to ways to do very high quality SQL workloads, data engineering, ML, and so on, on the same system that is based on low cost cloud storage. And that’s what we’re trying to do with lakehouse systems and I think it’s something you’ll start seeing throughout the industry. Thanks.

Demetrios:

Awesome stuff, Matei. Dude, legendary. I appreciate you coming on here and teaching us about this. Do you got a few minutes for some questions?

Matei Zaharia:

Sure. Yeah. If you’d like, yeah.

Demetrios:

All right. So first one up is about assessing data quality. How do you assess data quality and what makes it good or reliable data?

Matei Zaharia:

Yeah, that’s a great question. Yeah. I mean, it definitely depends on your setting. So there are different approaches to doing it. One approach is to have automated tests that aren’t against it, like checking is a large fraction of the data and all and so on or are these values out of range and with the systems I showed, you can run those tests and you can actually hold back to an old stable version, if you see a problem, or you can decide not to push a change into the production version of your table, but there are quite a few other problems because you might have records that look okay where there’s still some kind of bug. So I think you have to periodically have sort of end-to-end sanity checks that see, “Okay, are these tables consistent with each other? If I put in some records to my pipeline, what comes out at the other end?” And so on. I do think the biggest thing is to try to have a simple architecture with fewer moving pieces.

Demetrios:

Excellent. Someone was asking about the differences between Snowpark and Databricks.

Matei Zaharia:

Yeah. I mean, I think that there’s a significant difference. So Snowpark is proprietary API for running Python and Java workloads on Snowflake. And so it means you have to code up your application against that. You can’t use any existing application that was built on Spark, for example, despite its name. Databricks is mainly focused on running open source engines and supporting the open source APIs. So for example, the stuff I showed with Photon, we took great care to make sure that it’s just a version of Spark so people can take any job written against the open source Spark API and audit on there. And same thing for machine learning. You can run TensorFlow, PyTorch Horovod for distributed learning, XGBoost distributed and so on. So we are just focused on supporting open APIs and designing platforms around them that make it better or faster to run those. Yeah.

Demetrios:

Makes sense. All right. So in case anyone is wondering where I’m getting these questions from, if you go to Slack, the apply() Conference channel, throw your question there and I will ask Matei or any of the future speakers too. So next up, how are aggregations and addition of business logic handled in the case of the lakehouse? Does that part look the same if it were a data warehouse?

Matei Zaharia:

Yeah, you definitely need to do work to create downstream tables from the hard data you get. So you can’t just use the hard data. So we’ve seen people use this pattern where they classify tables into bonds, silver, and gold. So bonds is the hard data that comes in. Silver, you have pipelines that use… They often use the streaming mechanism I talked about where you can listen to changes in a table and then update some downstream thing based on that. And then they have a final tier called gold, which is things that they’re pretty sure are vetted and very easy for everyone to consume.

Matei Zaharia:

But one of the really cool things is if you’re an analyst or a data scientist coming in there and 99% of what you need is in the gold tables, but there’s some stuff, some fields that aren’t in there, or some aggregations that aren’t, you can still go and query the other tables in the same system and bring in all that stuff. So you can have access to everything. Let’s say if you’re working in Snowflake all the time as your warehouse, but there’s some stuff that isn’t in there yet, you have to learn and get access to a whole different system to then query S3. So with the one tier you avoid that. Yeah.

Demetrios:

Okay. All right. So next up, how do you support Databricks for on-prem installations in hybrid cloud?

Matei Zaharia:

Yeah. Great question. Yeah, we don’t have a data bricks on-prem product, but since it’s only running these open source engines and formats, you can run Spark on-prem, you can run Delta Lake on-prem with HDFS. So quite a few people use that and then deploy workloads either in the cloud or on-prem in a similar way. So that’s one of the ways. Yeah.

Demetrios:

All right. I think we got time for one more, so if someone wants to drop it in last minute. While I’m waiting for it, because I see somebody’s typing in Slack, I really want to know because I know you’ve got the Data and AI Summit coming up and you’re always thinking ahead, I think a lot of us look at you as quite a visionary in the space, what are some big things that you think need to happen in the next year in this space?

Matei Zaharia:

Yeah, I mean there’s a lot of stuff, I think, to still figure out in the ML space and it’s really cool to see that ideas that used to be only at very advanced tech companies, such as feature platforms are now gaining wide adoption. So I think there’s a lot to do just to figure out a good set of concepts that you can tie together that bridge between data teams and machine learning teams. And yeah, I think there’s still a lot to figure out there. I don’t know. Let’s see. Yeah. I think the other kind of interesting trends that I think we’re starting to see with these feature platforms too, is the needs to do more stuff in real time. I think that’s also really cool and that’s one of the things that we’re investing in, we’re seeing a lot of growth in with Spark Streaming engine, for example. So I think that’s going to happen more and more, but I think that’s a slower rate of evolution overall than the ML platform one.

Demetrios:

Excellent point. Bridging the gap between the data and ML, I like that. So last question from Ash, have you seen companies adopt data mesh architecture as opposed to data platforms?

Matei Zaharia:

Yeah. Great question. We have seen that. And one of the cool things with the cloud now is now that most of your storage is in systems like S3, you could have different teams managing the storage and still run computations efficiently over all of it, because it’s all in the same data center, right? Even if they have different AWS accounts or whatever, they’re in the same region, it’s in the same place. So it actually makes it easier to have this decentralized ownership of data and it’s one of the things we’ve been trying to support. I actually work a lot on the governance and data sharing products and databases and we definitely see this pattern of distributed ownership. So, yeah.

Demetrios:

Excellent, man. Well, I appreciate you coming on here. That’s all we’ve got for now with Matei live, but he will be in Slack. I know there’s a few more questions for you, so feel free to take the conversation over there. Matei, this was awesome, dude. I really appreciate it.

Matei Zaharia:

Great. Thanks a lot.

Matei Zaharia

Co-Founder and Chief Technologist

Databricks

Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks in addition to other aspects of the platform. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).

Lakehouse: A New Class of Platforms for Data and AI Workloads

Matei Zaharia

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial