Machine Learning, Meet SQL: When ML Comes to the Database

apply(conf) - May '22 - 10 minutes

SQL has evolved beyond its relational origins to support non-relational abstractions like arrays, JSON, and geospatial data types so it shouldn’t surprise us that SQL is now being used to build and serve machine learning models. In this presentation, we’ll review how Google Cloud BigQuery supports regression, classification, forecasting, dimensionality reduction, and collaborative filtering. Feature processing, hyperparameter tuning, and evaluation functions are described as well. The talk concludes with a discussion of good practices for building and serving ML models in Google Cloud BigQuery.

I mentioned just a little bit about my background, because I’m going to be talking about machine learning and basically how databases are rising to the occasion of machine learning. And in particular, I want to talk about BigQuery, which is a Google Cloud platform. And I’m a big advocate for Google for working in any kind of data environment, analytics, ML. I’ve been using it, gosh, probably about eight years now. Started using it because I got frustrated with other clouds and how much networking I had to do and other things that weren’t related to the kind of work that I was actually focused on.

So, that’s when I became a Google Cloud convert. And since then, I’ve just gotten more and more into Google Cloud. And I’ve written a few books. I’ve written the official study guides for the cloud engineer, data engineer and professional architect Google certification exams. So, feel free to connect with me on LinkedIn if you have any interest in Google or want to know more about the platform.

But today, as I mentioned, I’m going to focus primarily on BigQuery. Now BigQuery is a serverless data warehouse. And one of the great things about BigQuery is that you can develop petabyte scale data warehouses and not have to worry about servers and hardware and balancing getting the right mix of CPU and storage and things like that. All of that’s managed for us.

Also, it’s highly performing. It uses SQL, but it’s not a relational database in the traditional sense. We don’t have traditional indexes or constraints or things like that. This is really an analytical database. So, it works really well when we’re doing queries across large volumes of data, but maybe we’re querying a subset of all the possible columns that we might be interested in.

Also, BigQuery has other features which make it even more interesting from a data analytics platform. There’s BigQuery ML, which I’m going to talk about today. There’s BigQuery BI Engine, which is an in-memory basically accelerator for interactive querying. There’s also support for geographic information systems. And BigQuery Omni is a relatively new offering, which basically allows you to run BigQuery outside of Google Cloud. It runs on top of Anthos Kubernetes, part of the Kubernetes ecosystem. So, if you really like BigQuery, but don’t want to be in Google, that’s an option.

So, BigQuery ML, basically it allows us to create machine learning models in SQL. So, this is great for people that maybe aren’t really super proficient in Java or Python. And what’s nice about the BigQuery ML is that it’s fairly rich in, at least now in terms of the kinds of models that it supports. So, of course, we can do the basic bread and butter kind of things like linear regression, logistic regression, and K-means clustering.

There’s also support for time series forecasting. And one of the nice things about the time series forecasting, there’s an ARIMA model, AutoRegressive Integrated Moving Average. It’s like a standard workhorse for time series forecasting. But you have to be able to take into account things like seasonality if that’s part of it. All of that is handled with the time series forecasting, the new sort of ARIMA that’s implemented in BigQuery. And so, things like hyperparameter tuning and things like that is much easier when working with BigQuery than it used to be when we were working with platforms where we would have to do all of that ourselves, do the hyperperimeter tuning ourselves.

And then there are other widely used algorithms like XGBoost. If you have your own TensorFlow model, you can import that as well. And then if you really want the machine to do as much as possible, there’s something called auto ML tables. And with auto ML tables, you basically present structured data in tabular format, identify your label columns, and let the algorithms and the hyperparameter search find the best algorithms and configuration for you for developing models with as good of a performance as you’re probably going to get, at least with automated exploration.

So, as I had mentioned, the big advantages of using BigQuery ML is that now you can be a SQL user. So, anybody who knows SQL can build ML models. And that’s a huge advance in terms of democratizing access to machine learning. So, the days where we’re working with Scikit learning, configuring all kinds of things, and you have to be fairly proficient in Python. It’s not like that anymore.

So, we don’t need to be working with another programming language. And also, we don’t need to export our data, which makes it a lot easier. Again, if you’re not really into writing a lot of ETL scripts or data export scripts and data transformations outside of SQL, this makes life a lot easier and it lowers the barrier to entry.

Now, the way machine learning is introduced is basically by augmenting the SQL data definition language, and DML or data manipulation language. And so, we can basically have a statement now that’s called create model. And when we create a model, we specify some parameters. And those parameters are things like what kind of model do we want to build. In this example, we’re looking at building a linear regression model. We also, as I mentioned, specify our label columns. And then we present a SQL query, basically a select statement, which is our training data.

Now, in this case, we’re looking at a natality data set, and we’re working with basically birth weight, trying to predict birth weight. And so, we’re simply going to query some number of rows from a public data set that’s available also in BigQuery. And then we’re going to do a little select and we’re going to select every thousand, or it’s going to select 1,000th of the total population to train this model.

So, this is basically all we need to do to build a model. And once this is done executing, we basically have access to a model that we can use with a predict statement. And a predict statement, basically it’s an additional function that’s been added to SQL. And with the predict function, we specify what model we want to work with, and then we pass in a row set or a result set of basically the same type of data that’s similar to what we used to produce the model minus the label. We’re going to select that. And then we’re going to make predictions on that.

So, that’s basically all this required at this point to build a model, build a machine learning model in BigQuery. It’s basically if you know SQL, if you can write select statements, and if you understand enough about machine learning to be able to pick a type of ML algorithm. So, if you want to use [inaudible] clusterings or XGBoost, if you’re familiar enough you understand what those are and you want to build a model, you can now build those in your database, in BigQuery anyway, using SQL.

And I’ve already seen that we’ve worked with clients who are already using this to pretty quickly leverage building tools which allow model builders to even abstract away from the SQL and use essentially custom buoys for choosing different combinations of columns in the select statement and rapidly building models and building whole suites of models. So, this really is an enabler. It really changes the speed at which people can build models.

So with that, I will wrap it up. And again, feel free to connect with me on LinkedIn. I’m Dan Sullivan PDX. And if you have any questions, I’d be happy to answer them there.

Dan Sullivan

Principal Data Architect

4 Mile Analytics

Dan Sullivan is a Principal Data Architect at 4 Mile Analytics where he specializes in cloud and data architect with extensive experience in data architecture, data science, machine learning, stream processing, and cloud architecture. He is capable of starting with vague initiatives and formulating precise objectives, strategies, and implementation plans. He regularly works with C-level and VP executives while also mentoring and coaching software engineers. He is the author of the official Google Cloud study guides for the Professional Architect, Professional Data Engineer, and Associate Cloud Engineer.

Machine Learning, Meet SQL: When ML Comes to the Database

Dan Sullivan

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial