Streamlining NLP Model Creation and Inference

apply(conf) - May '22 - 10 minutes

At Primer we deliver applications with cutting-edge NLP models to surface actionable information from vast stores of unstructured text. The size of these models and our applications’ latency requirements create an operational challenge of deploying a model as a service. Furthermore, creation/customization of these models for our customers is difficult as model training requires the procurement, setup, and use of specialized hardware and software. Primer’s ML Platform team solved both of these problems, model training and serving, by creating Kubernetes operators. In this talk we will discuss why we chose the Kubernetes operator pattern to solve these problems and how the operators are designed.

Cary Goltermann:

Hey everybody, I’m Cary, and this is Philip. We’re machinery engineers at Primer AI. Today we’re going to be talking to you about how we streamline our process from autocreation and inference in getting NLP products out the door.

Philip North:

We’ll just start off with a little bit of context on some of the problems that Primer tries to solve. Talk about some of the problems we face trying to deliver those solutions. And then talk about why declarative NLP is really well suited for those problems. Talk about some of the benefits of that and recap with some of the takeaways.

Cary Goltermann:

What do we do at Primer? You might have heard of us, maybe you haven’t. We’ll let you know a little bit about the products that we create. One of our first products that we created was called Analyze. It’s a news summarization, aggregation, and exploration tool that helps personas, like analysts, collect information about really anything that could be happening in the news worldwide so that they can make better decisions.

Philip North:

Another one of our products is about providing visibility on real-time streams of social media and classified data. As a user, somebody who’s in the intelligence space, they need to monitor events that are developing quickly over time, this kind of hits that use case for them.

Cary Goltermann:

And then one of the things that we’ve been getting into lately is we have customers that use these tools that we just mentioned, and they want to add custom type of entity extraction to the type of information we pull out of news. Or maybe we’re looking for customers that just have a bespoke type of use case for their workflow and they want to create some sort of NLP solution for it.

Philip North:

We’re an NLP company, and along with that comes a whole suite of problems. For our products that are mostly ingesting streaming data. Handling those workloads, we need a solution that’s kind of hits all of the requirements. From being fast, high reliability, it needs to auto scale, needs to be cost effective. Running on GPUs is inherently expensive and it needs to be flexible enough to support sort of any model framework that a data scientist would want to use. There’s really no simple way to solve all of those requirements and inherently, it’s going to require a complex solution.

Cary Goltermann:

The solution that we’ve come up with in the past for a lot of these problems is, we’re going to have Kafka to queue in different messages. We have streaming problem. We’re going to use Redis as a cache. We need to be able to auto scale if we get bursts of all sorts of messages coming in at once. Like Philip just said, this complexity can become overwhelming, especially when maybe you’re a data scientist and you’re just like, “I created this model for this problem. You told me I needed to be able to identify science entities in data. How do I get this into our tooling?” It can be maybe not obvious if you are relying on a stack that’s fairly deep.

Philip North:

For streaming workloads, we have that set of problems, and also for training custom models. We’re training on big data sets, large NLP models, and we need to do that cross distributed hardware. Trying to kind of hide the provisioning of those resources and all of the kind of work it takes to train at scale where you’re doing parallel cross validation, large scale evaluation, and you need that to integrate with your ultimate serving solution. There’s a lot of complexity that goes along with that.

Cary Goltermann:

The way we approach this is we wanted to create declarative APIs for NLP infrastructure. We wanted to make it really easy for our data scientists to get and manage the type of hardware that they need to actually run their models and not have to worry about the types of problems that I mentioned a couple slides ago. Like, “How is this message coming off of a Kafka queue? What if I need to be able to access it asynchronously? How do I process these things in batches?” We want to be able to hide a lot of that complexity.

Philip North:

So instead, if you’re a data scientist or model author and you’ve taken the time to train and develop some model, and now it’s time to embed that into your pipeline, or your application, or however you want to use that. You really, at the end of the day, just want to be able to call that end of production, endpoint, an endpoint that you know is going to scale. It’s going to handle whatever you send to it. It’s going to be fast. It’s going to be cheap. By leveraging the paradigm of declarative APIs, you can declare what you want and use it. Instead of interacting with an API that’s more imperative, where you have to tell it how these things should be provisioned for you, how it should provision Kafka, Redis, and all the underlying pieces that are needed to give you what you ultimately want.

Cary Goltermann:

The way we did this was with Kubernetes Custom Resources. Kubernetes is declarative by design, and so it offers APIs to extend its APIs. He said in kind of a meadow way, and in the same way that Kubernetes allows for the composition of different resources in, like we said, a declarative way, we extended it to be able to allow for our NLP infrastructure to be served on Kubernetes.

Philip North:

We can take a look at what the declarative API for model serving looks like. So again, instead of having to enumerate all of the things that you want and how your endpoint should be provisioned, we expose a higher level API that lets the author simply state what they want. Really just letting them focus on what’s needed to initialize the model. So this CRD is basically specifying just the basics of, do you want a GPU or not? Maybe some other stuff about the hardware and then ultimately just initializing some model that is abstracted in a Python class. And so the model author really just needs to come with the, an implementation of their model as a class. And the system is kind of separate from that, in that it doesn’t need to be tightly coupled there. So we can get out of the data scientist way and be flexible for them.

Cary Goltermann:

Similarly, we have problems where we want be heavy users training models, where they might not want to actually get into the nitty gritty details of what hyper parameters actually went into it. They might want to just look at the performance metric to come out of a particular training requests. But they don’t want to deal with getting GPS or thinking too hard about what the actual, getting a model from training into production to deploy it is actually going to look like. So we created an autoML declarative solution. You see, similarly, here’s a custom resource that is implemented with Kubernetes. And the user just needs to specify the data and make sure that it conforms to a schema that we expect depending on the actual model type. And then they can just get on with training.

Cary Goltermann:

And what comes out is a class, a set of schema that allows the user to actually pass that model to our serving system in the same custom resource that Philip talked about a moment ago. And we can easily allow a user that trained a model on our training system to just move along to deploying it all, without ever having to worry about where did my training hardware come from? Where am I storing these models? I just need to make sure that I have an S3 location. Is everything going to integrate is going to be available in production? They don’t need to think about those things.

Philip North:

Yeah, so we, I guess another requirement for these serving and training solutions is that they have to be able to run in pretty much any generic environment. A lot of primary use cases are for kind of confidential data. And so it needs to be able to run in maybe any cloud provider, government clouds, even all the way to air-gapped environments. And one of the side effects of going with this declarative approach is that it allows you to have sort of a static view of all of the assets that you’re deploying to an environment. Which allows you to do static security analysis and apply any sort of extra config or changes to those manifests that you need to deploy them into a generic environment. It also lets data scientists sort of trust that if they’re deploying their model on this solution, that it will just go ahead and work in the generic on-prem environment. And so they can more focus on again, get back to just focusing on the quality of their model and how that’s implemented.

Cary Goltermann:

So the important things that we’ve learned from this journey are that declarative APIs allow our users to actually care about what their thing is going to do, not how it’s actually going to be implemented. And this is allowed us to get a lot of models that were previously fairly slow to get into production, both trained and served really quickly. And it allows us to serve in many environments. And we have a very wide variety of environments Phillip just mentioned, we need to be able to deploy too. So that was it from us. Feel free to reach out to Primer, our LinkedIn, our blogs linked here. Feel free to reach out to Philip or I directly we’ll, happy to chat over email.

Phillip North

Machine Learning Engineer

Primer.ai

Phillip is an engineer at Primer working on the ML Platform team. Prior to Primer he has worked both as an engineer and data scientist at various small start-ups.

Cary Goltermann

Machine Learning Engineer

Primer.ai

Cary is a software engineer at Primer where he works on the ML Platform team. Prior to joining Primer he worked for KPMG as a data scientist creating machine learning models and applications for tax professionals.

Streamlining NLP Model Creation and Inference

Phillip North

Cary Goltermann

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial