Machine Learning Engineer
apply(conf) - May '22 - 10 minutes
At Snap, we train a large number of deep learning models every day to continuously improve the ad recommendation quality to Snapchatters and provide more value to the advertisers. These ad ranking models have hundreds of millions of parameters and are trained on billions of examples. Training an ad ranking model is a computation-intensive and memory-lookup-heavy task. It requires a state-of-the-art distributed system and performant hardware to complete the training reliably and in a timely manner. This session will describe how we leveraged Google’s Tensor Processing Units (TPU) for fast and efficient training.
Yeah, so my name is Aymeric Damien. I’m an engineer at Snap, where I work on inference and training pipeline authorization. And today I just wanted to give a talk about our work on training large-scale recommendation models with TPUs.
And first, let me give you an overview of Snap. And so actually, what are we trying to do at Snap Ad ranking? So basically we’re trying to predict the likelihood of a user to engage with an ad and to do that, we train different models and it’s pretty important because the closer the prediction is, the better it is, first for the user, so that we can have relevant ads to show, than for the advertiser to get more advertiser value, and then for Snap for our revenue.
And yeah, so this prediction is actually used by the ad auction systems that check many parameters together, such as the advertiser budget, the budget pacing, the max bid et cetera, and when we need to show an ad, this ad auction system, basically choose which ads to show. And that’s pretty important as I mentioned. And then once we show an ad, we basically check if the user enjoyed it or not, and that’s how we generate labels out of that. And our ranking system is pretty straightforward. We have light and heavy ranker. So one retrieval phase, when we try to cut down with cheap and accurate models, the total number of ads we have to rank, and then we have much more heavy models, but that are much more accurate where we really try to fine tune the prediction and get the best result.
So our models, they have thousands of numerical features. Hundreds of categorical features. Our architecture is also pretty straightforward. I mean, in terms of recommendation systems, so pretty popular DCN, DCN v2, PLE and our infrastructure is mostly based out of Google Cloud, and we have also some part on AWS and quick word on ML hardware. So we historically trained with CPUs as synchronous training, but now we’re migrating into TPU. And on the inference side, we have a mix between CPU and CPU and GPU. Also, I wanted to highlight some things I believe are really challenging with recommendation systems. So first our data set actually has a lot of categorical features. So we need pretty large embedding tables. And if you’re not familiar with embedding, actually represent discreet ID features as a dense vector in a low dimensional space. So for the embedding lookout, they’re actually pretty memory intensive.
And I would argue that actually recommendeders have been little under investigated by academia. And maybe one reason is that there is a lack of public data set and like computer vision or an NLP. And also for this data set, for example is pretty famous. But the number of features is around 39. So it’s way different from the amount of data we deal with. And I think that makes the entire research pretty hard to reproduce. And as such, there is not too much focus on software integration that really focus on the recommendation systems. And as a next consequence, actually specialized ML hardware has really always focused on large metrics multiplication, but actually the recommendeders have other additional requirements such as fasting embedding lookups, large memory, high memory bandwidth, and so on. So now I want to present and talk a bit more about our TPU integration and the benchmark we ran.
So first for those who don’t know, what is TPU. TPU is actually Google custom develop assets for deep learning. And it comes actually pretty early on it’s back to 2015, so seven years ago, and the v5 is supposed to come, I think this year. And why use TPU for recommendation system? First, is that they have actually large memories, a TPU v3-8 has 128 gigabytes of RAM. So that’s quite a lot, especially compared to one GPU. And it’s also have high bandwidth. Second thing, there is fast chips interconnect, and third is that they also provide a really good TPU embedding API that have native and optimized solution for embedding based operations. So it’s really fast and efficient embedding lookups, and that’s really what we need for recommendation systems. And then it also supports sharding. So those very large embedding table can be shard across many different TPU cores so that even if you exceed this 128 gigabyte you can easily shard your embeddings across many different devices.
So if you deal with very large embedding, that’s very useful. Another point that’s pretty, I think good with TPU is that they have bf16 matmul acceleration, what is call under the hood. So not like GPU where you need to specify your weights as a bf16, all these bf16 matmul are done under the hood. So you can just define everything as fb32 and then it’s just working pretty well. And last’s easy to integrate with TensorFlow and it’s also has a good general availability. So that’s good if you want to train at scale. So our TPU integration. So actually we needed to refactor some of our TF codebase, migrate to this TPU and API. And we also faced some challenge around the IO pipeline. So we used to train on a CPU machine that was only eight course or the throughput permission was actually pretty low.
And then we switched to TPU was at much higher, throughput. So we really needed to optimize the IO pipeline so that we were not IO input bound. And we also migrated from asynchronous to synchronous training, which is a default strategy with TensorFlow. Then there was some challenges. So running TPU is definitely not as straightforward as using GPUs. I would say that optimizing GPUs is another topic. But at this just running on GPU is much easier compared to running TPUs. Some ops are not supported and we also need to tune our training because we migrated from asynchronous to synchronous. So there were some hyper parameters that needed to change. And last, there was also some scanning issue. I will come back to. So now I have the chance to share some results on our benchmark. So first actually we saw really great performance boost by migrating to TPU from our original CPU training.
So up to 250% throughput for standard model and the heavier, the model gets the better the performance boost. So for heavy model, we could see up to 600% higher throughput, and actually a huge cost reduction by 87% taking standard Google Cloud pricing. And then we also compare with GPU and then we found also really great result with TPU. I just added a note there. So we use an optimized image for the 800 GPU, but we didn’t have the best optimization. So recently Nvidia released this HugeCTR TF plugin. That’s doing something similar that TensorFlow did with TPU embedding API. Is that speeding up embedding lookups. So we yet have to benchmark this, but so far without this optimization TPU really out bit the 800 and yeah, one last word on TPU scanning.
So we saw actually some scanning issues and it’s not necessarily only linked to TPU, so there is one famous MLperf competition where people try different ML hardware and see the performance on different tasks. And for the recommended system on DLRM, actually both TPU and GPU were heavily bottlenecked, the higher the number of course, and this may be due to synchronous training. It’s a lot of, basically system were design with computer vision in mind, in computer vision we have one batch is exactly the same size, but in recommendations that may not be true because we have a lot of variable features. And for example, one user might have not seen any ads. So there is no embedding lookups to perform. And what if one user saw a hundred or thousands of ads? So we have thousands of embedding lookup to perform for that user.
So then suddenly you have many more computation to do. So this worker might slow down the entire process. And just last words on future work and the conclusion. So even though GPU works best for us right now, we actually really try to optimize for each type of hardware. I think it’s a really competitive and fast moving environment. So the idea there is that we really want to future proof our training system and then just switch to whatever works best in the future. And there is a ton of exciting new ML hardware coming. So I mentioned TPU v5, but then there is also Intel SapphireRapids. And I think that it looks pretty good because there are going to leverage the integrated graphic score of the CPU to speed up matmul. And also they’re going to have HB memory, so high bandwidth memory.
So that should be actually pretty good for recommendation systems. Last, there is a new Nvidia H100. This actually has even better performance than 800. So really looking forward to benchmark this and last there is also, Habana which is an Intel company, and they’re focusing, on similar to TPU, really focusing on deep learning and they just release, I think two weeks ago there Gaudi2 that shows pretty good improvement over 800. But what’s even more important is that they are really targeting the performance per dollar. And in terms of performance per dollar, they should be pretty competitive. So a ton of interesting things, if you want to learn more, it was pretty brief presentation. So there’s so many detail I left out. I added a link there, so you can get more insight on how we train models TPU, the machine learning we do at Snap for ad ranking, and also how we apply GPU to Snap for GPU in France. Cool. Thanks everyone. Glad to be here.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.
Interested in trying Tecton? Leave us your information below and we’ll be in touch.