Why is Machine Learning Hard?

apply(conf) - May '22 - 15 minutes

Each of us has a different answer for “why is machine learning so hard.” And how long you have been working on ML will drastically influence your answer.

I’ll share what I learned over the past 20 years, implementing everything from scratch for 1 model in web search ranking, 100s of models for Sybil and 1000s of models for TFX. You’ll see why I’m convinced that data and software engineering are critical for successful data science – more so than models. Regardless of your experience, I’ll share some tips that will help you overcome the hard parts of machine learning.

Hi everyone. I’m Tal Shaked, I work at Snowflake and I’m the architect, [inaudible 00:00:16] machine learning Snowflake. Before joining Snowflake, I was actually working at Google for close to 17 years, and I did a bunch of machine learning ranging from research applications of machine learning to various products and also building machine learning platforms that were reused pretty much throughout all Google. And today I kind of want to share some of my experiences in machine learning over that time and give my perspective on why I feel machine learning is really hard to do well. And so I’m going to kind of break this up into three chapters, the first is going to talk about some of the experiences I had doing machine learning for ranking in web search at Google. After that, I’m going to talk about building sort of large scale logistic progression for machine learning ads, and then later expanding that to the rest of Google. And then talk about sort of what happened when deep learning became really popular and sort of how a number of platforms adapted to that.

One thing I want to kind of look at is also how the interest in machine learning has evolved over time, at least according to Google trends. So as you can see here from 2004 to 2016, the interest in machine learning has been relatively flat then deep learning became popular, breakthroughs in image processing, text processing and so forth, and machine learning became much more popular during that time. However, one thing I actually found really interesting is that for many of the tech first companies, people were doing machine learning from 2004 onward. In fact, some of the biggest wins companies like Amazon, Google, Netflix, was based on large scale machine learning for recommendation systems, ranking and really many other optimization problems.

Tal, just a heads up, we can’t see your screen.

Shoot. I’m sorry.

It’s all good, you’re very visual in your words, but we want to see the actual slide.

No, I got confused. Okay.

There it is.

Does that work?

Yep. Now we see it.

So this is the slide I had before, this is the visualization of machine learning interest over time. All right. I will jump back to here. All right. Okay, so machine learning for ranking and web search 2004-2007. So at that time I was super excited, I had just graduated from grad school with a master in machine learning, joined Google, got to work with some of the world’s best researchers in machine learning, like Yoram Singer. And my first task was can you do something better in ranking using machine learning because right now they had basically I think around a five or 6,000 line if else function that kind of did ranking. And at the time, we had roughly 1,000 queries that had labels, those queries had maybe 40 results out of 4 billion that were labeled as irrelevant or not relevant, it’s kind of an interesting skew there. And because it was a ranking problem, we looked at the relative ranking between different results, so it was around 1 million examples total.

And I don’t want to go into all the details of how the machine learning system work, but I want to talk about sort of what made this problem really interesting. So here’s an example from a query back in 2005, which was a query miserable failure. As you can see here, the top result was the Biography of George Bush. Now this, at first, doesn’t look really relevant to the query, but as it turns out, this was an example of what we called a Google bombing. A number of users found a way to sort of reverse engineer the ranking function at Google and then create signals that would lead to this result being the top result for this query.

And so this kind of makes you wonder, is this a good result for the query miserable failure or not? I mean, as it turns out, Google later tried to address this by removing these kind of results. But on one hand, before the Google bombing happened, this looks like a pretty relevant result for the query. But since people sort of manipulated the results for this to happen and people were searching for this query because they expected this result to come to the top, one could debate whether or not this is actually the right behavior for the system. Now I want to talk about sort of another example I think is much more representative of what makes ranking really difficult for web search?

So consider the query credit cards. First of all, you’ll probably notice that nowadays half the page is ads, that wasn’t the case back then, but on the right side, you can see that actually there’s some interesting results. The first result is a query from the Bank of America for specific credit card and the second result is sort of an aggregation site that talks about the best credit cards out there. Which of these results do you think is better? So it turns out that there’s a lot of debate here across the engineering team, the product teams, data science teams, even within those teams. Some people felt that the best results is a list of all the most popular credit cards, Discover, American Express, Visa and so forth. And other people said, “No, that’s actually not what our users want. if they have a generic query like credit cards, they want to understand how credit cards work. They want sites to talk about the differences between different credit cards and help those customers find which credit cards are best for them.”

It turns out this wasn’t a one off use case. In fact, there were hundreds of cases like this where people couldn’t actually agree on what the best results were. And what this meant is that it was actually really difficult to provide guidelines so that people could actually rank which results were best for giving queries. Furthermore, people changed their mind about which results were best and so that meant that people liked having the control of manually shaping and evolving the ranking function for web search. And so even though we ended up building a machine learning system that did really well on the metrics we could measure, we didn’t actually really trust those metrics and therefore people didn’t really trust machine learning.

And so the main takeaway I had by doing machine learning for ranking and web search is that data and quality problems are really hard. It was extremely hard to create good data, we couldn’t come up with the right labels, we changed our idea about what the right labels were. I didn’t talk about it, but it was actually pretty difficult to generate training data that time, we didn’t have anything like Feature source. When you have examples, like for the query miserable failure, if the top result is Biography of Georgia Bush, how do you understand what the system was doing? How can you debug it? And then how do you actually improve it? Do you have to change labels on the data? Do you have to change your signals? Do you have to provide different guidelines for data to be generated?

The other thing I want to call out here is that it was hard to build very accurate models, but in this case it didn’t really matter because we didn’t have the right objective. And so we probably would’ve been better off just kind of building simpler models and integrating from there. So that was actually kind of a frustrating experience for me at the time, because I put a lot of effort into all the infrastructure to train the models. I was really proud of sort of building these high quality models on offline metrics and yet I didn’t feel we were able to get the full value because we didn’t have the right objective and people weren’t comfortable sort of handing over the ranking function to machine learning at that time at Google.

And so I kind of looked around to see, well, where else could machine learning be useful? So this is around 2007, and ended up focusing on machine learning for ads. And the reasoning was that this should be much easier, right? Everything is measurable, we can measure whether users click on ads, we can measure the amount of value advertisers get and we can also just measure the revenue to Google. There’s not much debate to say, “Is this revenue higher or lower than the other experience?”

And so that led to developing a system called Sibyl, and this was kind of focused on large scale logistic regression. Can we predict the click through rate for ads? We had millions of ads to select from, and so this system was designed to train on 1 trillion examples with roughly 100 features, for example, and support about 5 billion parameters. Again, this was large scale at the time, nowadays, this is I think more common and there was a lot of innovation in the algorithms and infrastructure, but I’ll save that maybe for another talk. Because I really want to kind of talk about what made this problem challenging and sort of surprisingly hard. So initially when we focused on ads, we got the data, we built a bunch of infrastructure, we built new algorithms and we did everything offline in a nice controlled environment and it was pretty easy to get roughly a 10% improvement, say, over three months.

Then we were pretty excited, we had this really high performing model in offline metrics and we hacked things up and we managed to get this model into production to run a live experiment. Not surprisingly, surprising at the time but in hindsight it wasn’t, this experiments was fairly negative, we had all kinds of problems. And so we had to figure out, well, how does this model interact with 10 or so other models and ads? What happens when the predictions go into the auction? And ultimately why is it that we were getting less clicks and less revenue and advertisers, weren’t just happy with the results?

And so it took a lot of iteration, debugging tuning and so forth. And basically, we also had to re-implement the learning algorithm and change the infrastructure because we discovered there were actually problems with the algorithms we had, but eventually we got a one and a half percent improvement and we thought this was great. We had a one and a half percent improvement on roughly a 20 billion revenue stream at the time, so I said, “That’s great, now let’s just talk to the rest of the ads folks and put this thing in the production.” And then the ads team said, “Well, hold on a second, how are we going to have 50 to 100 engineers training hundreds of models, running multiple experiments, monitoring these models, checking for outages, rolling things back when they’re not working?” And we realized that there was a whole bunch of production work we had to do and that kind of led to this diagram that probably all of you have seen before.

So we kind of focused on that little black box down there that was the ML code, and we ended up having to build pretty much everything else around it to make the system usable and that took about a year. And so after we did that, we realized that we actually had really powerful learning systems [inaudible 00:09:55], it turns out much more than was needed, but we had a nice kind of end to end platform. And although ads was doing well with machine learning, the rest of Google really hadn’t discovered machine learning or felt comfortable using it. And so we went around the rest of Google saying, “Hey, we got this really great system, we talked to YouTube, we talked to Android and other places, and what we discovered is a lot of these teams, once they could easily develop models and put them into production really quickly, they were starting to see really large wins. Wins in the order of 10 to 100%.”

And that led to sort of Sibyl being one of the most widely deployed systems at Google with maybe hundreds of models being trained at any given moment around 2015 or so. And so the main takeaway, or thing we learned from this, is that productionizing machine learning is hard, you can’t just build a model and then bolt it onto the software and products as an after time. Like we saw in web search, it turns out that actually pretty much every team we worked with struggled with getting good data and had quality problems. We discovered that Sybil was a really large scale, fast training system focused on logistic progression, but it was kind of overkill for most teams. They didn’t care if they got 90% or 100% of what was possible, they were seeing 10 to 100% improvements because they just wanted to get something out as quickly as possible.

So with that, I kind of want to switch and talk about sort what happened when deep learning became popular. From my perspective, I felt like I just saw history repeating itself. So one of the reasons deep learning became really popular is the amazing results it had in terms of image processing, text processing and as people started to sort of develop a more experience with deep learning, we realized that actually kind of works well in pretty much just about every problem. And with enough effort, it usually can work as well or better than just about any other learning algorithm. So that was great, lots of teams got super excited, everyone started using deep learning and then they got a model pretty quickly and then they ended up spending roughly one to two years trying to productionize it. And in fact, what they kind of did is they ended up building systems that looked like this.

By the way, I apologize, I know this is a somewhat retro looking kind of diagram, that’s because we copied it for maybe eight years ago just to kind of show what this was like back when we were developing this. So what we realized is that everyone’s rebuilding the same kind of production pipelines, but now they’re doing this for deep learning. So we said, “Well, can we take what we did with Sibyl, can we just pull out the Sibyl learner and replace it with deep learning?” Or TensorFlow in this case, and that’s how we built TensorFlow Extended. And so we did this, but it turns out that we kind of needed to build another transformation system, another serving system, another way to represent the data, the way we analyze deep learning models could be different than the way you analyze Sibyl models. So it turns out we kind of had to rewrite almost everything from scratch and that actually took a few years.

After we did that though, we had a system that addressed many of the use cases people had and we ended up having thousands of different models being trained at any given moment using Tensor Extended. But often took a year or two for people to productionize, they were able to kind of use the systems we had for serving and continuous training and so forth and that really accelerated the adoption of deep learning across all Google. And one of the, I think, key things we did differently in TensorFlow Extended versus Sybil is Sybil was actually pretty rigid. It was really great for large scale logistic progression, if it solved your problem, that was great, but if you kind of needed to modify it, to customize a bit, it wasn’t that easy.

And so we wanted TensorFlow Extended be much more modular and so we kind of thought of it in terms of multiple layers, at the bottom layer, there were very common Google building blocks, Borg, Flume, and so forth. And what we did is we constantly improved those systems so that they were designed to run the ML workloads that we had, continuous training, complex data pipelines using Flume and so forth. Then in the middle, we kind of had all the TensorFlow Extended components. Again, probably you’re all familiar with these, now it’s pretty common in the MLOPs community, and on top of that, we built even easier systems used such as hosted services for serving and certain teams, like in the ads and YouTube spaces, might have their own custom platforms built on top of TensorFlow Extended because they themselves had to train 10 to 100 models customized for their use cases.

And so we found that having these multiple layers from basic building blocks all the way up to sort of really easy to use services was super useful, and people could kind of pick and choose which pieces they wanted. And since this was all done inside Google, every engineer knew how to pick and use these pieces. So what was the main takeaway from that experience? So what I found out is I internalized that ML engineering, as a discipline, is a super set of software engineering.And what I mean by that is when you build production systems, you really need to bring in all the best practices of software engineering. I think that’s really what MLOPs is about, but you also have to deal with the peculiarities of data and quality problems that I highlighted in the web search use case. And I think this is really different than software engineering and that’s what really makes ML engineering kind of unique and really hard to do well.

The other takeaway is that after doing Sybil and TFX and seeing all these other ML platforms out there, I don’t think a single ML platform works well for all use cases. And that’s why I think you need very modular system and multiple players. So with that, I just want to summarize quickly what I learned. So the first is that data’s the most important part for ML and that’s why I’m at Snowflake. Second productionizing ML requires well integrated components and that’s why I’m super excited by the partnership we have, between snowflake and Tacton and more, generally many partnerships between these ML companies. Because the better integrated these components are, the easier it will be for our joint customers to do ML. And finally, ML engineering, as a discipline, is a super set of software engineering. I’m going to conclude by saying that it’s much better to quickly get good models into production than to get great models that never leave your laptop. So with that, I’ll just make one call out to the Snowflake summit that is happening middle of next month. Thank you.

Tal Shaked

ML Architect

Snowflake

Tal Shaked is Snowflake’s Machine Learning (ML) Architect. Prior to Snowflake, Tal spent 16 years at Google, culminating in the role of Distinguished Engineer / Senior Director. He was responsible for a broad set of ML projects such as TensorFlow Extended and Sibyl — two of the most widely deployed ML platforms at Google, and specific applications of ML for Google Ads and Google Search. Tal completed his BS in Computer Science at the University of Arizona and his MS in Computer Science at the University of Washington. Tal is also a chess grandmaster and won the World Junior Chess Championship in 1997.

Add Your Heading Text Here

Why is Machine Learning Hard?

Tal Shaked

Follow Us

Book a Demo

Contact Sales

Request a free trial