Declarative Machine Learning Systems: Ludwig & Predibase | Tecton

Tecton

Declarative Machine Learning Systems: Ludwig & Predibase

apply(conf) - May '22 - 10 minutes

Declarative Machine Learning Systems are a new trend that marries the flexibility of DIY machine learning infrastructure and the simplicity of AutoML solutions. In this talk we will discuss about Ludwig, the open source declarative deep learning framework, and Predibase, an enterprise grade solution based on it.

Piero Molino:

Thank you very much. And yes, timing is absolutely perfect, we are going to chat later this week, but right now me and Travis are going to tell you a little bit more about declarative machine learning systems, in particular Ludwig in the open source and Predibase which is the company that we started and the product that we were building to bring this concept to market.

Piero Molino:

The basic concept is that we believe that in the machine learning space today there is a false dichotomy. On one hand, there is a do it yourself approach with low-level APIs, and on the other hand you have AutoML that automates the process, but we believe that it’s a problem. These two solutions don’t solve the underlying problem, and the underlying problem is the efficiency of the machine learning development process.

Piero Molino:

And this is my personal experience when I was working at Uber specifically, where I was working on a number of machine learning projects. The first one was an intent classification system that contained many lines of Tensorflow code, took five months to develop and seven months to deploy. And then I worked on a fraud prediction problem with 900 lines of PyTorch code, five months to develop and four to deploy. And then there was a product recommender system, and a few more, and the observation from this process is that machine learning projects have too long time-to-value. Each of these projects took basically a year to deploy, and every single time you implement a new bespoke solution, and that creates technical debt and really low reproducibility. And finally, organizations have difficulty hiring machine learning data scientist experts, and so they need a more efficient way to develop machine learning systems.

Piero Molino:

And so that dichotomy that we were talking about before, we believe that with a declarative machine learning system, you can get both the flexibility of a low-level API do it yourself approach with the simplicity of an AutoML solution in the same place, because you have a higher abstraction that keeps both the flexibility, the automation, and the ease of use. Also, non experts now with the declarative approach can be empowered to actually use and do machine learning. We pioneered these concepts with Ludwig in the open source and Overton since 2017 really, but more recently with 2019 release of the open source.

Piero Molino:

So how does Ludwig work? Ludwig is an open source declarative deep learning approach, and the idea behind it is that you can use a declarative configuration system instead of writing your own code for doing your machine learning projects. You can see on the left-hand side there’s an example of the configuration, only six lines of the ML that are needed to basically replicate exactly what was the first intent classification model that I developed when I was at Uber that took me more than a year, so this makes development much faster. Also, you don’t need to write low-level machine learning code for doing that, and it’s also readable and reproducible.

Piero Molino:

At the same time, it keeps the flexibility of an expert level of control. So for instance, you can change all the pieces and all the details of your model and your pipeline through changing parts of the configuration. For instance, you can change how the data’s encoded or what the training parameters are, or the details of how many layers, or basically every single aspect of the model. This makes really easy to iterate and improve the models and it’s also extensible because you can add your own pieces of code that then are referenced through the configuration system, so you can extend it to whatever you want. And finally, it already contains advanced capabilities like hyperparameter optimization, already state-of-the-art models available for you, and also distributed training.

Piero Molino:

The basic architecture of Ludwig is this encoder-combiner-decoder architecture, where you have the input part where you have many different data types that can represent the columns of your data, and they are preprocessed and then coded to a common vector representation, it is combined by the combiner, and then you have a decoder that predicts depending on, again, the data type of the data that you are training on, predicts the output of the model. And so this is end-to-end deep learning architecture that can be instantiated for different tasks, so it’s really flexible and can address many different tasks.

Piero Molino:

For instance, if you want to turn a regression model, you have to set it up so that one input is category, one input is numerical, one input is binary, and you are predicting a numerical output, and so you obtain a regression system. Or if you want to do text classification, you have one text feature as input and one category feature as output, and you have a text classification system. For image captioning, you have an image as input and text as output. The same is true for speaker verification, forecasting, and binary classification; you can imagine that by combining the different types of inputs with the different types of outputs, you can solve many, many machine learning problems with this architecture right out the box.

Piero Molino:

And now Travis is going to tell you how we are scaling this concept up to work with bigger amounts of data and how we are bringing it to the market as an enterprise solution.

Travis Addair:

Thanks Piero. So one of the powerful things about declarative machine learning in our view is that because you’re specifying the what and not the how, we can do quite a lot of sophisticated things in terms of how we go about training the models that you specify in your configurations, and so we’ve done a lot of work over the past year or so on building out a very scalable backend for Ludwig built on top of Ray that allows you to scale up to arbitrarily large data sets and perform distributed data pre-processing using Dask, do data parallel distributed training of the model using Horovod, and orchestrate the whole thing end to end by using Ray.

Travis Addair:

The nice thing about this abstraction is that it doesn’t require you to have to provision various sophisticated heavyweight infrastructure, like a separate Spark cluster, a separate NPI cluster, et cetera. Everything runs in a common layer, so the same code that executes normally on your laptop can be made to run in a distributed manner by just running it using Ray submit, and then your existing Ludwig script or your Ludwig command line tool.

Travis Addair:

So this is what we’ve been doing on the open source to make Ludwig not just a tool for research, but a tool really for production as well. But of course there’s more to productionization of machine learning than the model training aspect in isolation, and that’s why we decided to put together the company Predibase that we’re now working on. What we see as the value proposition of Predibase on top of Ludwig is that we take care of not just the individual user training models problem, but we take a look at the end to end problem of how organizations and enterprises actually think about data flowing through machine learning models and getting into production and actually driving value. So that consists of primarily three distinct parts, which is connecting data from a database or data lake, using declarative model training with a Ludwig-like interface, and then productionizing/deploying that model for both batch and real-time prediction using a variety of interface, including REST and a SQL-like interface.

Travis Addair:

So what we like to say about Predibase is that it’s a low-code machine learning platform that provides high performance and high flexibility. One of the really unique things about the platform is because we have this declarative system that’s very tightly bound between the model and the data schema, we have built this layer on top of the models that come out of Ludwig that we call PQL, which is a predictive query language that allows you to do very scalable batch prediction using a very simple predict the target variable, given some data from your database or data warehouse, which allows you to really bring the machine learning to the data and put this capability in the hands of people who wouldn’t traditionally be interacting with machine learning systems like analysts and engineers.

Travis Addair:

It uses the same powerful and flexible Ludwig-ate configuration that people in the open source community are familiar with, but with a lot of extra features built on top and provides you state-of-the-art machine learning infrastructure without the need to build out an entire advanced machine learning infrastructure team. This is a serverless layer that we’ve built on top of the Ludwig on Ray open source product that completely abstracts the way of the complexity of operationalizing and productionizing machine learning.

Travis Addair:

So the way that you can think about this being together is that you take your structured or unstructured data, which could be in a data lake or any structured data source like Snowflake, or Feast and Tecton for a lot of the folks here I imagine, you just connect to any particular data set table-like structure that you have in those data sources, Predibase will allow you to build the models using our declarative interface, and then you operationalize those models using PQL on the analyst side for batch prediction, as well as supporting full REST deployments that allow you to do very low latency, real-time prediction. And then things circle back and allow you to then iterate on the models over time and evolve them as things drift or as model performance degrades, so the whole platform provides an observability layer as well that gives you that ML ops experience end to end.

Travis Addair:

So thank you very much for coming to our talk today, please check us out on Twitter, our Medium posts as well, our website, and of course you can find the GitHub for the Ludwig project here as well. Thanks everyone.

Piero Molino:

Thank you very much.

Travis Addiar

Co-Founder & CTO

Predibase

Travis Addair is co-founder and CTO of Predibase, a data-oriented low-code machine learning platform. Within the Linux Foundation, he serves as lead maintainer for the Horovod distributed deep learning framework and is a co-maintainer of the Ludwig automated deep learning framework. In the past, he led Uber’s deep learning training team as part of the Michelangelo machine learning platform.
Piero Molino

Co-Founder & CEO

Predibase

Piero Molino, PhD is CEO and co-founder of Predibase, a company redefining machine learning tooling. He previously worked as a research scientist exploring machine learning and natural language processing at Stanford University, Uber AI, Geometric Intelligence, IBM Watson and Yahoo!. He is the author of Ludwig, a Linux-Foundation-backed open source declarative deep learning framework.

Let's keep in touch

Receive the latest content from Tecton!

© Tecton, Inc. All rights reserved. Various trademarks held by their respective owners.

Request a free trial

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​

Contact Sales

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​