PyTorch’s Next Generation of Data Tooling

apply(conf) - May '22 - 10 minutes

An overview and lookahead of our data efforts within PyTorch, including our new API extension points to support state-of-the-art ML data processing in both research and production. TorchData, an extensible library for constructing data loading graphs, and TorchArrow, and lightweight front-end for dispatchable data processing, will be introduced with examples.

Yeah. It’s great to be here and I’m taking the opposite approach, the slide-heavy approach. So I’m going to go really, really fast and I’ll share these afterwards if folks are interested in looking at the code samples in more detail.

So what I’m talking about today is the next generation of data tooling in PyTorch. And you can sort of ask the question, should we have data APIs in PyTorch? And we did ask that question and it actually was a very long discussion. So what it basically came down to is that our mission is to accelerate AI in PyTorch. As you might realize, data loading and processing has actually become a very large share of user time over the last five years, or considerably larger than it once was. Another reason why we decided we were going to begin attempting to begin investing in data APIs in PyTorch, which was that our bread and butter is basically bridging AI silos and certain open source maintainers are actually asking us to step in and introduce APIs that can bridge the loading and processing silos in open source.

And we are in a sort of unique position that we can invest a few years in trying to create a standard, which is so wide and that it can span the entire field across research and production across computer vision recommendations, NLP, various subfields and produce something which is very extensible, but still performant. And this is basically a very risky bet. So that’s essentially when we typically step in. We know that we can fail fast, but we can also spend three years attempting to build an API, which is ultra, ultra generic. To get into more of the actual detailed nuts and bolts.

So breaking data loading and processing into two, what we were seeing in data loading and when we embarked in this journey, was that the PyTorch DataLoader, which has kind of been the thing that was appeasing the community for a long time, it was built in the beginning of PyTorch, was outdated and, excuse me, inflexible. Largely because it was very feature-bundled like batching, shuffling, caching, all this stuff, I don’t think caching was in there. All this stuff was basically built into a single monolithic thing that could just allow you to get by in data loading. There was no multi-threading and it was basically built around map or index access data sets.

At the same time, there was huge innovation in data loading out in the open source, lots of honorable mentions. So basically everybody who wanted to innovate in data loading had to just fork the PyTorch DataLoader, introduce their feature, and those loading features were then just incompatible with one another, right? Each new innovation was just in a separate data loader. And then similarly, you have constant duplication of the kind of table-stakes primitives that were not the innovation that were being introduced in that particular DataLoader. Also these table-stakes primitives were rarely robustly implemented. There’re probably a thousand HTTP handlers and AI scripts out there and obviously they’re not all handling all the different errors properly and everything like that.

So our approach is basically to introduce an extensible, an ultra, ultra wide open extensible loading standard to replace the DataLoader in PyTorch and allow that to be basically the sort of inception point for a new interoperable, modular community of data loading primitive. This is not a simple idea, it’s also still quite risky. But basically, the DataLoader has all these sort of features built into it. We basically broke it out into this new type of a graph of loading primitives called DataPipes, and then a very thin graph handler to ingest that graph.

And the DataPipe standard is like the most flexible we could possibly make it. It basically takes a Python iterator, an iterator over Python objects to an iterator over the Python objects. And it’s easy to say that but actually to make it performant is a much harder problem. But basically, shuffling batching, collation, mapping, filtering, et cetera, all these are basically just now data pipes. They’re not things that would happen in the DataLoader. Then we basically started this new project in PyTorch called TorchData, which is a new corpus of standard data pipes to seed this ecosystem and introduce those table-stakes primitives, but implemented robustly. And all this stuff is extensible and replaceable by users.

So we have 56 such data pipes now. Anything from IO, typical HTTP are requesting of the data set. We have a Google drive requester in there, extractors for CSV, Parquet, Xz, Zip, et cetera. And then all the kinds of normal munging you’d like to do to a data stream. Shuffling, sampling, coalition, mapping, et cetera. And this just continues to grow.

The exciting things are one that this whole system is streaming first. So, almost all the above has been made streamable, including many of these extractors so you can stream in JSON without having to download an entire JSON data set at once. It’s non-BC-breaking, which is frankly shocking to me that was going to happen, but very exciting. It’s also extremely open and extensible. So if you decide you don’t like the shuffler in the default DataLoader graph, you can replace it with your own. We’ve already your, excuse me, projects, working on TFRecord loader because people love TFRecord. And actually AWS just released an S3 reader data pipe, right? Standard kind of thing. You open HCP reader, you probably want an S3 reader.

So this is all extremely early. We have a Loud beta release coming out in late June, but I encourage you to take a look. Here’s just a quick code sample. You can see basically that the code allows you to either build up an object base or declarative set of graph of data pipes, or you can build them up in this functional format. And I’ll leave this again in the Slack.

So now to talk about data processing. So problems in AI data processing were somewhat similar, but look a little bit sort of from a different direction. So we also have lots and lots of pre-processing functions, similar to data loading primitives out in open source. But these are basically incompatible with one another in a different way. Namely, they don’t have a common in-memory format, right? So you could be using a text normalizer from this place and a tokenizer from this place, and they could just be working on… One of them could be working on Unicode and the other one could be working on Ascii. There are just no standards whatsoever and this created lots of problems and sort of conversion overhead.

But then also there’s no common API layer. So one of them takes a tuple of dict and the other one takes a dict of tuple and auto sort them like it just was sort of a constant translation. If you were trying to use data tooling or data process tooling from more than one place, also structured data exists. So at the beginning of PyTorch, we really… Our problems fit the sort of image comma label bucket, more cleanly. And now they simply don’t. They didn’t really then, and they really, really don’t now. So NLP, multimodal, these things don’t tabulating it just does not have a data API in PyTorch, and you get these messes of [inaudible] such as two basic things.

And also we recently open source Meta’s recommendation systems infrastructure as a PyTorch library called TorchRec. And we just could not do that with a straight face without having support infrastructure data in PyTorch. You cannot do recommendation systems without good structured data support. Also, people love TFRecord, but it’s sort of a walled garden because it doesn’t have an open in-memory format. And then again, we’ve seen some amazing innovation in data processing, but it’s mostly sort of in silos. For example, NVidida as a library called CuDF, which allows pandas-like data processing on GPUs. Koala is a similar pandas-like data processing on Spark. Slightly, slightly different, pandas APIs, right? So you can’t just change over the code from one to the next to change out the backend. Honorable mentions also are awesome.

So along a similar theme, we are working on an extensible data processing standard in PyTorch. So if you think about what PyTorch is at its core, it’s not really a… You don’t have to look at it as a quote unquote AI framework. It’s really a standard front end and in-memory format for tensor processing. And so you can use the same API in the same in-memory format for all the different operators that you get from tensor processing out in the ecosystem. And they all work together and they chain together nicely. And then we have this registration and dispatch system to allow any third party to say, okay, register their operators to be run-able easily with PyTorch.

And then we’ve made that batteries included for AI usage, introducing things like auto grad and lots of detail functions and things like that. So we can do the same thing for data frames. So TorchArrow is a very, very early alpha project in PyTorch that is essentially akin to… TorchArrow is to Pandas as PyTorch is to NumPy. So we basically created a standard processing front-end, or standard data frame front end, which is multi-directable to different back-ends. So the default back-end is a high performance CPU engine called Velox, but you can do funny things with it. You can change over to graph mode, so basically trace your data frame operations that can then be lowered into, optimized and lowered into a data warehouse. We are working on targeting GPU via CuDF, targeting Spark and other runtimes.

TorchArrow is ultra-open. So we chose Arrow as in-memory format because essentially everybody asked us to, it seems to be the thing that a lot of people like for open in-memory structured data processing, and there’s no dependency on PyTorch. It was a very interesting thing. So actually Hugging Face came to us and said, or we were talking to Hugging Face about this project and they basically said it would be really lovely if this didn’t depend directly on PyTorch so we could use it for processing across all of our recipes if we were to use it and not only use it for the PyTorch recipes. We could use it for TensorFlow and we don’t have to implement these things separately in the PyTorch and sensor for silos. And we said that makes complete sense. The processing step is mostly orthogonal, other than coalition where you translate the data into TypeScript, then the modeling step.

So that was actually a decision that we made to make TorchArrow independent of PyTorch in a sense. It obviously works seamlessly with PyTorch, it works seamlessly with TorchData. And it’s also quite early. So we’re targeting a beta in late June. Or targeting, I think either alpha or beta in late June, but you can already look at it as an open source. Here’s a code sample. It looks a lot like Panda so I won’t go through it, but you can send it to a different background.

Thanks very much. If you have ideas or questions or you want to integrate with TorchArrow, if you have some project which would be nicely represented as a DataPipe, definitely feel free to email me, or if you just want to chat AI data.

Donny Greenberg

Product Management Lead, PyTorch

PyTorch’s Next Generation of Data Tooling

Donny Greenberg

Follow Us

Book a Demo

Contact Sales

Request a free trial