Feature Engineering at Scale with Dagger and Feast

apply(conf) - May '22 - 30 minutes

Dagger or Data Aggregator is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. With Dagger, you don’t need to write custom applications or manage resources to process data in real-time. Instead, you can write SQLs to do the processing and analysis on streaming data.

At Gojek, Data Platform team use Dagger for feature engineering on realtime features. Computed features are then ingested to Feast for model training and serving. Dagger powers more than 200 realtime features at Gojek. This talk will about the end to end architecture and how Dagger and Feast work together to provide a cohesive feature engineering workflow.

Ravi Suhag:

Okay, so thank you so much for having me here. Let’s get started. So today, I’ll be talking about feature engineering at scale with Dagger and Feast. And before we dive in deep into it. So a little bit about me, so my name is Ravi Suhag, and I’m working as a repair engineer at Gojek and I also work as a like founder and principle architect at an organization called Open DataOps Foundation. So feature engineering, right? So I think in simple terms, to me, it’s process of transforming raw data into high quality input sequence that can be fitted into models so that your underlying problem is better represented. And just as like one example, we’ve seen picture, you have raw data and then you can apply feature engineering to it, and then you can turn it into feature vector.

Ravi Suhag:

So where does feature engineering and the overall ML pipeline kind of fit, right? So you have your raw data and then you have, once you pull your raw data, you want to transform that and then store it into feature store and then feature store is being then used for model solving and model training, right? So at the very high level, very simplistic view, this is how it looks like. So what’s the need of the feature engineering and need of feature engineering just to have better features and why do we need better features? Because better features provides you more flexibility, having good data structure of your features allows you to have better flexibility when you chose your models. And that actually better features also results into simpler model, because that makes it super easy for you to pick right parameters and more simpler models.

Ravi Suhag:

Overall, better features actually turns into better results because your model results kind of turns into better quality. So that’s why we need feature engineering. Now, let’s talk about some of the challenges of feature engineering, right? So the first one is inconsistency between training and serving. And I’m not talking about just inconsistency between training and serving data Excel, right from your feature store. But there can definitely be inconsistency beforehand also, like when you are doing your feature engineering and you’re doing your transformation and especially if you’re doing real time also, so you might want to train your model and then write the transformation logic from a back source. But then when you are serving features online, you still want that same transformation to happen for your realtime features also. And then there can be a trip, right, because you have this one transformation logic that’s for the batch processing and then storing your data for the training and a very different one with different system.

Ravi Suhag:

First one could be with Bitquery or Snowflake or even Spark. And then second one could be with KSQl or Flink, right? So there can be inconsistency between both the logic. Second thing comes is that data scientist does not want to manage data pipelines and they are the one mostly responsible for feature engineering. So they don’t want to spend their time into that. And they simply want to work and focus on more and better models. And the same problem, even if we do that, if there are so many feature engineering and feature transformation jobs happening, it makes it super hard for data scientists to self-manage the entire data infrastructure, because that requires managing your Kubernetes clusters or FlinK clusters, Spark clusters, and so on. Next problem, that the challenge that is there is lack of standardization, right? If you have hundreds of data scientists within your organization and each and everyone is producing and doing feature transformation in their own way, then you end up being there is no standardization.

Ravi Suhag:

And there is so many silos that can happen within the organization from features point of view. Next challenge is, real time features actually require you to have skill data engineers. It’s not like simple, anyone can come and write a small logic into it, like to work on feature engineering is not something like that’s a fling, you definitely need more skilled data engineers who can write more optimized jobs there. So these are the certain challenges that we face in feature engineering, and that basically results directly into the feature engineering platform goals that we had. The first thing we wanted to achieve is the streams, unified processing for streaming and batch. If you are going after realtime features, so the processing logic that you’re using for batch ingesting your features is the same one, or the transformation is the same one you want to use when you are going online with streaming and realtime features. Second thing we wanted to achieve is the self-service platform so that there is no [inaudible] between data scientists actually managing pipelines. It’s completely self serve.

Ravi Suhag:

The whole infrastructure is obstructed out. Next thing goal is that elastic infrastructure. So to tackle the problem of scaling infrastructure, you want to make sure that any time a data scientist comes and spin up a job, your infrastructure can scale up to that. And then it can just turn down again when the job is not there anymore, there has to be a standard and reusable way. So if you are having a feature transformation, just the same way you want to use your features and reuse your features, you want to reuse some of those transformations also. And there has to standard way for people to define that. And last but not the least, is that we don’t want any extra skill to be learned by data scientists to actually create real time features, and that’s where Dagger comes into the picture. So overall in general, Dagger is a stream processing made easy. So it’s an open source platform framework, which allows you to transform aggregate and enrich data with ease of operation and reliability.

Ravi Suhag:

And Dagger is built on top of Apache Flink, so in high level, it takes certain data sources, batch as well as stream, and then it aggregates and then transform that data, given the logic that user provide and then syncs it back to different sources. One of them being Feast and Feast I think we all know that Feast is the feature store, which basically allows your models to pull features from, for training as well as for serving. Now, how does the ML pipeline look like with Dagger and Feast? So you have your raw data and then raw is being ingested into even stream something like Kafka, and then Dagger basically takes data from your Kafka, aggregates it, does the feature transformation and sync those features into feature store. And then Mullin is our internal tool, which we will not talk about today, but that basically allows you to train and experiment and deploy models. And that’s basically your final stages that your model serving and from model serving you’re producing in prints log, which goes back again to Kafka, which you can utilize for monitoring of the models or various other auditing reasons if you want to.

Ravi Suhag:

So before we dive deep into some of the use cases and how Dagger makes it super easy to do the feature transformation, just a very high level architecture. So we have the first thing is a stream, and there is a consumer inside Dagger, which basically pulls data from these different streams, batch as batch sources, as well as stream sources. And then there is a de-stabilizer, which basically makes you easy for you to destabilize the data, whatever the bite or whatever XYZ format that you have the data into. And then there is a pre-processor. So it allows you to basically hook certain logic as pre-processor in the whole Dagger ecosystem before the core SQL execution or transformation happens. And then there is post processor, which can also talk to external data sources. And then last piece is sync. We’ll go into the details of some of these parts in like upcoming slides. And then from sync, it basically allows you to sync to multiple sources. Sync also talks to your data stabilization and destabilization and so that if you want to output data into a certain format, it allows you to basically put it into it. One example could be prototypes.

Ravi Suhag:

So let’s talk about certain key features. We will focus specifically on the ones which helps feature transformation. So one is Dagger is completely built SQL first. It makes writing queries in SQL format, using suggestions, formatting, and templating queries and everything else. It is quite flexible. So even though it’s completely SQL first, there are still various ways to extend the functionality of Dagger by writing [inaudible] transformer, processors, post processors. And then one is streaming enrichment, basically Dagger instead of just pulling data from your core sources, it can asynchronously talk to and refer your lookups to external data sources also, and enrich that data in between. And we’ll talk about how that can be used for various ML pipeline.

Ravi Suhag:

So the first one, right? Self-service, so instead of, so the goal we wanted to achieve was that it has to be affected out there. People don’t data scientists should not be managing data pipeline. So what Dagger allows you to, gives you a complete self-service platform where data scientists can simply specify their data, specify their sync, specify their query, and right there, and there, they can look at the logs of all the feature transformation, and then the job that is happening, the output of that is going, they can even monitor that, they can set up even alerts right then and there and in that interface. So they don’t have to go anywhere for any of these parts, do the same pointing, look at the different templates. So it’s a completely self-service platform for data scientists to simply come and write their logic without worrying about anything on the infrastructure point of view, or without even worrying about taking care of any framework, they don’t have to deal with Flink, they don’t have to deal with deploying a job or anything else.

Ravi Suhag:

Dagger also allows, gives you a GitOps space, so people can specify their transformation jobs as a ML specification, and then use entire Git ecosystem and GitOps ecosystem to manage their feature transformation jobs, which they can also tie them up with feature specification and then manage it into a more center place. So why SQL? I think couple of reasons, right? One is SQL is not [inaudible] like, I mean, it always terminates. I mean, most of the time, so what it allows us to do is that people will not write random custom logic anywhere, and then monetize all the compute power in the data center. So not a hundred percent cases, but I think in the most of the cases, we were able to avoid that. So people write SQL and it’s a very clear way that they are going to basically use limited power. Another thing is that they don’t have to relearn a new language or new framework to do their feature transformation. It still remains as the consistent way for people to define the transformations in a very specific way. And what we realized I think is SQL is kind of serving most of or 90, 95% of the use cases. And 10% of those extra use cases, we can tackle that providing them certain function, flexibility in different ways.

Ravi Suhag:

So for the people where SQL is not enough, another way we provide for people to do the feature transformation is through UDFs. So data scientists can come and if they see that the logic that they want to write, and the transformation they want to write is not…. They can’t achieve that with SQL. So what they can do is simply define either in Python or Java, the user defined function, and then they can use that function within their SQL also. So for example, the one we have is here, Geohash, which basically takes your latitude, longitude and then converts it into an entire Geohash. And this function can be defined by any data scientist and can be reuse by any of them, so it becomes basically your central pool of user defined functions, which any of the data scientists can use. Which also focuses a lot on the reusability of the transformation logic.

Ravi Suhag:

Another challenge I think a lot of, like I’ve talked to a lot of people and that they all face this, is that when I’m training my data, I want to use the same production grade data, right? And I want to use the same production grade data in my local system, as well as in my training pipelines and even in my staging and integration system. But I think there are a lot of compliance issues and security issues that comes into the picture. If you should start to do that because you don’t want people to right away pull all production data into their local system. So Dagger provides a feature of the data masking. So that’s where your transformer feature of the Dagger comes into the place, where there is a transformer, which is specifically for the hashing, but you can write your own very sophisticated one also, if you want to.

Ravi Suhag:

The very simple use case it does is that people can define their transformation logic with this hash transformer in place. So when you’re pulling your data, it can encrypt that entire PIA fields and then sensitive fields. So all you have is still the production data, same throughput, same variance, everything else, but completely encrypted for the sensitive fields. So which allows data scientists to kind of stay into zone and then train their models into the same quality of the data, while making sure security is still in place.

Ravi Suhag:

Another thing that we talked about is hybrid data sources. So let’s take example of that you have a realtime feature that if you want to do the transformation on in your sync, into your feature store, but in your real time, you only have data retention, let’s say in Kafka, for seven days. And what you end up doing is that you want to train data on last, let’s say one year data. So what you end up doing is that you will go to your data lake or to your data warehouse, write transformation logic there, and then ingest that data into your feature store, right? So now you have transformation A that is sitting and then taking care of training this data on the historical data. But when you actually want to deploy your job for the realtime features, you actually will go on onto your realtime source, something like Flink and write your transformation logic there.

Ravi Suhag:

So now it’s the same features, you’re syncing into the same feature store fees, but your transformation logics are into two places. One is for your back source, one is for your real time source. And that creates basically multiple problems. You are managing logic into two places. People have to redo the work again to transform that your standard data lake query into a real time query with tumble windows or grouping and whatnot. And that creates a lot of overlap, overhead on data scientists. So how Dagger solves this problem is that we have this hybrid data source and hybrid data source actually allows you to consume data simultaneously from multiple sources. So you can consume data from three Kafka topics as an example, but it also allows you to actually consume it from hybrid source of batch as well as stream.

Ravi Suhag:

So what I can do in that particular case, is I can have two data source. One is, let’s say from GCS. So my P minus one year to P minus 30 days will come from GCS. And as soon as that data is completely done, GCS1, if I have retention of 30 days in Kafka, minus three, minus 30 days to P, data will start to come from Kafka, and then in real time after that. So what that allows us to do is basically you can backfill your data with the hybrid source without actually writing two different logic or actually waiting for your whole Kafka data to basically come and then wait for that duration. So basically unify the processing across batch and stream data sources, and it allows you to completely backfill the historical data. It also allows you to now basically join multiple data sources, in case you have a complex use case where data is across three Kafka topics. So you can within that do joints within those streams and then write your transformation logics there. So it helps you basically achieve a lot of these complex use cases. So this configuration is a simple way for you to define source students. So here I’m specifying the bounded source as my [inaudible] source and my unbounded source as a Kafka source. So that’s basically my hybrid source of historic, plus the real thing.

Ravi Suhag:

Now next use case is like one feature I want to talk about before I talk about how stream enrichment is used. So what data allows you is not just ingest data from these Kafka sources and while the data is coming from that, you can actually do the reference, look up to your external endpoint also. So you can basically, you can tell Dagger that, “Hey, when this event comes, talk to that particular API, transfer me this result and then join it with this particular event within the processing itself.” Not just APIs, it can also talk to object stores and caches to enrich your streams as events are happening. So this is how basically simple configuration looks like. So you specify your external source. In this particular case, we have specifying elastic search. You provide the host, you provide the endpoint, you provide the variable timeouts and then overall mapping. So if the path of your customer profile, how it should particularly map to your event, right? So you’re specifying that mapping.

Ravi Suhag:

Now how some stream enrichment actually helps. So it doesn’t help into feature transformation. It helps into a very different use space. So now let’s say once you have your feature store deployed, and then you’re serving there, what you want to do is actually do stream inference. So any event that comes, you want the output of the prediction of that model attached with the same event and enriched, and then go back into a stream itself. So this is how roughly the architecture looks like. So you have data coming from Kaka into Dagger, which is your input to your prediction model. Then Dagger basically talks to your model endpoint and fetch that prediction, joins with that event and then puts back into the Kafka, which is the enriched output of that particular production. If you go slightly into detail, how it looks like, so you have a source Kafka topic, you consume it from there, and then you have this rest API.

Ravi Suhag:

And the transformer is basically pulls your features from Feast and then talk to your model service and then pulls that back, that prediction, enrich it, and then send it back to the testation Kafka topic, right? So this is roughly how it looks like. Now, these prediction logs… So what kind of use cases that we solve with it, right? So let’s say there is a booking event that is coming and you want to attach pricing prediction with it as soon as the event is happening. So you can basically call your pricing model, attach the price with it, and then send that output right away. What it allows people to do is that they can use the same output of that model into like multiple consumers can use that, right? That’s one use case where it can be used. Secondary was then all these production logs are now in Kafka and then they can be stored into your Bitquery or other warehouses, or even for monitoring as well as quality as well as audit or whatever XYZ purpose that you want to use it for. But you have all of these produce as an event, which everyone can utilize it in their own way.

Ravi Suhag:

And this is also stream inference from data scientists’ point of view is also very self-service. So people specify the request filtered mapping, like from where they’re coupling of the data, their filter query and then the resulting type mapping. That how you want to output that particular prediction to look like. So if you want to change the name of fields or mapping to something else, you basically simply define it on the UI and then it allows you do that. So how does Dagger adoption at Gojek looks like? So we have some 300 plus Dagger jobs which are running for feature engineering, more than 50 data scientists are actually creating these Dagger jobs without any engineer’s help. So they do not basically come to and talk to us as about can I create this job or not, right? So it’s completely self serve, they do it on their own in case there is very specific use cases that each output engineering team. And then we basically take care of making sure that we provide that capability within Dagger or the [inaudible] system and we have more than 10 terabytes of data just processed for the feature transformation of the day.

Ravi Suhag:

So now just going a bit wider into the Dagger, so Dagger is actually part of larger ecosystem of the products that we are building to power the entire Data Ops ecosystem. And that’s what something we call Data Ops Foundation. It allows basically people to transform, analyze, secure data faster and efficiently. And I think as with some of the screenshot, you guys are seeing, we completely abstract out the fundamental framework or the technology, and then make it super experience first. So it works on more in terms of people should be able to discover their data, understand it, understand the quality and lineage aspect of it, then operate it. So for a data scientist, it could be, is that okay if any data scientist comes into the picture, first thing you want to do, is you want to find out what data exists there.

Ravi Suhag:

And then they basically want to understand the quality of that. Like the data I want to use from my models, is it of the right quality? Once they know that this data is of the right quality, then they basically operate, operate is where Dagger puts into the picture, which allows you to transform that data and then put into your feature store, and apply is basically where your prediction comes into the picture and then they take back into it. But all of that, instead of framework driven, it’s very experience driven, the whole journey kind of looks like in a very self-service way. So RPF is in a way is like this fully integrated suite of multiple products. And this is where these three categories looks like. One is the whole life cycle of the data where you ingest your data stream processes, that’s what we talked about today on the Dagger, and then you send it back into multiple sources.

Ravi Suhag:

And then you have other tools to discover, manage, access control, and even identity manage your data operations. So Dagger is already powering like feature engineering and then overall data platform at Gojek and within Gojek ecosystem, like companies like Midtrans, Mapan and Moka. And it’s better listed for a couple of years now. So all of this open to the ecosystem is open sources, open source on GetUp, and we have an active community and we recently open sourced this entire thing and it’s growing. So we have more than 200 plus contributor growing 80% year over year, and more than 2000 commits have happened in last one year, where we have like across GetUp and Slack, some thousand plus community members, we highly encourage people to kind get involved into it. So all the code, everything is out there on GetUp, and we also have a Slack community, so feel free to join. And that’s pretty much all from my side. Thank you so much for having me here. Happy to take questions.

Demetrios:

Amazing, Ravi, awesome stuff. There are a few questions that are coming through in Slack. And so let’s just start with this first one from Alex, how should these feature engineering and storage approaches be viewed when discussing the feasibility of auto ML solutions?

Ravi Suhag:

Right now, actually we’re not tackling that particular part with that. So we are not given so much of thought into it, but I think it definitely can be explored where you can kind of relook at the same prediction and then do the auto ML again. But right now we are not kind going into that mode and it’s not serving the need of the auto ML, but I very clearly see that it can definitely be enriched to actually tackle some of those challenges.

Demetrios:

Nice. Okay. Dagger seems to be in a similar area to Spark, what’s the pros and cons and why or why not Spark?

Ravi Suhag:

Yeah. So I think if you look at Dagger, under the hood, it still uses Apache Flink, right? So it still uses a framework for you to process your data in stream or as well as batch mode. But what some of these frameworks, either Flink or Spark, irrespective of both of them, what the lack is that it’s still the same challenges that we talked about. They require people to either write their jobs in a very custom way or define it, what we have solved with dagger is simplify a lot of those things. Like, for example, if you don’t want to talk to your external source for enrichment, doing that with Flink or natively with Spark, either all your data scientists will end up doing it in their own way, or they will end up actually struggling a lot with it. What Dagger does is basically builds a lot of abstraction on top of it. So it’s sort of stream processing engine from ground up. It’s built on top of a Apache Flink, but it makes all of these use cases for data scientists to make it super easy.

Demetrios:

Nice. Okay. That makes sense. How long ago did you open source this?

Ravi Suhag:

I think it’s been eight months, but it has been more of in the process of opensourcing slowly, slowly. So I think the fully blown out, I think we just like couple of months, but we have been in the process of kind of starting to open source it for a while now, almost a year, I would say, but it just came out, I think a couple of months back.

Demetrios:

Okay. So there’s a great question coming through here from Frata, he’s asking, in his experience, whenever we, AKA the ML platform, came up with new ways of doing things, ie. feature engineering, we always observed some friction when people started using it. Can you explain further how this process was of reaching 300 data scientists using Dagger?

Ravi Suhag:

Yeah, so I think the one thing that allowed us to do is this fundamental goal that we wanted to achieve, that there is no new skill to learn, that if we literally build a different DSL for people to transform it, it will definitely going to have a steep learning curve where people will be hesitant to learn. What we did is completely SQL first. So let’s make sure that they don’t end up using or learning any new framework, right? So they simply come, and if they’re very simple use case, write it in SQL, nothing new to learn. Second thing is that we did not ask them to actually manage anything else. You don’t need to spin up a new job, you don’t need to spin up your new infrastructure for that. All of that is completely abstract. So onboarding process and the turnaround time for data scientists to basically play around with their features became much shorter.

Ravi Suhag:

You have to do basically some incentive. So the incentive we give is that your traditional way, the way you are doing, you are spinning up your framework. You’re writing your by custom job. And then you are deploying that and doing some [inaudible] or rest of the infrastructure, or maybe as you’re deploying a job on spot. And then it’s taking time, where in this case you come, simply write your SQL and then you have your feature transformation and syncing data to feature store in a couple of minutes. Right? So that’s the mode. And that basically simply allows organically, people to start using it. Of course, you have to educate it that right, things like, so where we have to struggle is making, giving people confidence about the reliability of it. If you will deploy it, we will make sure it stays with this up time. And that’s why we basically give lot of things around monitoring and then logging and those aspects, right? Because [inaudible] becomes important. So one aspect is that in the self-service, if you saw, that there was monitoring logging and all of these aspects were built in, and that gave a lot of confidence to data scientists to try it out.

Demetrios:

So there’s something that you just talked about there, and it’s like you have the pillars that you were using. Do you feel like, maybe can you explain a few other strong pillars or vision when you went out and set out to build it?

Ravi Suhag:

Yeah. So I think the measure pillars were, is that the way we wanted to look at the data experience overall is not from the tooling point of view and the same approach that we applied for data scientists, the framework I talked about, like discover, understand, operate, and apply, the same philosophy we have, even for other personas, as an example, analytics. So even for the analytics, the flow looks like same. The only place it differs is at the operate part. Instead of writing their transformation logic to create models of the feature engineering, analytics are basically doing their real jobs or actually doing their dashboarding. So what we did is that whatever persona that we are serving within the organization from data point of view is actually going through this framework only. And all of our tools are actually basically targeted to solve this problem.

Ravi Suhag:

So I think that’s where this whole reason where we are talking about this, it’s experience was, we do not go in a way that we, “Hey, we need a discovery tool. We need a stream processing tool. We need a data quality tool.” So we did not approach it that way, but we approached it with that, “What is the user journey going to look like for a data scientist? What is the user journey going to look for an analytics engineer? What is the user journey going to look for a data engineer?” And that’s where the whole, every tool that evolved came from the need of to serve that experience. I hope that answers the question.

Demetrios:

Yeah. Yeah. For sure. So does Dagger connect with cloud providers like GCP?

Ravi Suhag:

Yes. So Dagger is cloud native, so as long as you have a Kubernetes cluster, you can deploy it. You can also deploy it on [inaudible] and anything else also, right? So it’s basically a framework for any and all things, either you can do it in standalone in your local system or you can deploy it on where anywhere Kubernetes is running. So it has like no boundary in terms of what you can do with it. It has different modes of deployment.

Ravi Suhag

VP of Engineering

Gojek

Ravi Suhag, an agile engineer with a vision to transform software chaos into seamless experiences. His inherent passion to solve problems involving analysis and synthesis shaped him to lead technology and product for organizations of all sizes, untangling the knots in software development, smoothing product delivery, and instilling best practices. With more than 10 years of experience and proven track record he has crafted both technology and product roadmaps with a distinct programming style and a dedication to transparency and open-source development. Currently working as VP Engineering at Gojek, Indonesia's largest hyper-local company where he leads teams to build large-scale, self-service data platforms, allowing its workforce to make data-driven decisions. Before that, I worked as a tech consultant at the Center for International Development, Harvard University, to build tools that enabled the government to make data-driven policy decisions.

Feature Engineering at Scale with Dagger and Feast

Ravi Suhag

Let's keep in touch

Book a Demo

Contact Sales

Request a free trial