Tecton

Weaver: CashApp’s Real Time ML Ranking System

apply(conf) - May '22 - 30 minutes

In this session, we will talk about one of the core infrastructure systems to personalize the experience on the CashApp, Weaver, and the work we did to scale it. Weaver is our real-time ML ranking system to rank items for search and recommendation use cases. We provide plug-and-play feature store and model hosting backends to meet various needs. We will share our experience on optimizing our service.

Meenal:

Hello, everyone. Super excited to be here. My name is Meenal and I lead the personalization ML engineering for Cash App and Block. And here with me is Austin, who’s an engineer on my team. And today we will talk about our real time ML ranking system that my team is building at Cash App. Oops. All right. So before we jump into the technical stuff, I just want to quickly introduce our company. So Block is a FinTech company with multiple business units like Square and Cash App. I’ve listed all of them here. And then Cash App, particularly focused on making it easy to send, spend, and save money and also buy Bitcoin.

Meenal:

And actually I also want to give a shout out to the previous speaker, Daniella, who used to work at Cash App a while ago. So next, jumping into the agenda. We’re going to talk about the problem here, the requirements and the solution we proceeded with and what were the challenges we faced. We will also share our testing results and optimizations we did, and end with some learnings that we had while scaling the system. So going into the problem here, we have a lot of ranking problems in Cash App. For example, when a cash customer searches for another customer to pay money, you can rank them in an order that’s more relevant to that user. Or when you’re trying to show the debit card rewards that are most relevant to the user to engage them in the product, or when you’re trying to show relevant support articles to the user that are more tailored to their need, and that might actually also help reduce our customer support volume.

Meenal:

So for these kinds of problems, on a high level, the architecture we have, it works in a way that we have a candidate generation phase, which takes the customer in the context. For example, for our rewards use case, we will find all the rewards that our customer is eligible for. And then for each of the candidate, we will score it using ML. And then that score will be further used to sort items before displaying them to the customer. And the intention here is that the model takes the customer in the context into account to generate these personalized scores. So the content is actually tailored to the customer preference.

Meenal:

And in this presentation, we will particularly focus on the scoring part, which we internally call Weaver. How did we build it and what is our story around it? So jumping into the requirements, I want to start by highlighting what I showed in the previous slide, which is a fanout basically. Each event translates into n items that needs to be scored here. So for example, we want to show relevant rewards when the customer opens the app. In that case, we will get the relevant rewards and then we’ll go and score it in the ML algorithm, which means we have to support a really high QPS per use case. Hence, scalability requirement is an important one for us. Now, these experiences are customer facing. So we don’t want it to be slow or have downtimes. And we put an initial target for us before building the system that we would like to score 50 items within a hundred milliseconds here.

Meenal:

The other thing is we have multiple use cases that as I highlighted in the previous slide, so we wanted to build a system that would easily support multiple use cases that we currently have and more that will come in future. So we don’t have to do significant investment on each of them. We also wanted to have basic ML related things like A/B testing, support, logging scores for analytics, having a shadow mode so we can run things in background without impacting our customers. So with these requirements, we started working off a solution and we thought of three options here. We considered a library. So the business logic is embedded in the use case itself. Or a sidecar, so the business logic is run in a container along the product service in the same port. Or a standalone service that the product service actually makes a call to. The advantage of a library or a sidecar is somewhat reduced latency in isolation, but they can be tricky for management.

Meenal:

So we went with a service option to start with because we felt it would actually help us bootstrap easily without much investment from our customers. But in future, we might consider abstracting it as a library if there are some special requirements around it. I do want to emphasize that building a library does not mean that there will be no network call at all, because you might still want to call your features too. It’s more a question of reducing those things. And to manage some of the other downsides of services like latency, we were careful with our network topologies to make sure everything is in the same region and there are no cross region calls that are happening. We also implemented rate limiting to ensure that one caller cannot take the system down for others.

Meenal:

So here is our architecture. It’s pretty straightforward. We get a request with items, some information about the context, and the configuration key. Since we allow multiple configurations to be run in this system, and then we use this key to find the relevant configuration in our system. And based on that, we gather all the features needed from the feature stores and then call the configured model to compute scores for those items. We return the response. We also asynchronously log the results via Kafka queue. As I mentioned, we need those for analytics later. You may notice that we are highlighting here multiple model servers and feature store, and it is because we have access to multiple of these built for different use cases. So we allow the user for system to configure a single or multiple feature store as per their use case. We also cache a bunch of metadata at startup and refresh it regularly to avoid making calls per request.

Meenal:

Now moving onto the challenges. The architecture we just talked about is pretty straightforward, but still there are some challenges. For example, we make call to downstream services like feature store and model server, and the latency for these backend services also impact our scoring service. For example, the latency for feature store will be impacted by the backend of a feature store. Is it a [inaudible 00:06:48] store? Is it like DynamoDB or [inaudible 00:06:52] or any memory cache like Redis. Similarly, model serving will also be impacted by the network cost, model evaluation cause which is dependent on the type of the model here, and the associated overheads with the system here. So slow call to either of these systems will result in a high latency for us. Now, second problem we have is a fanout problem, which I highlighted that one call may translate to end calls to feature stores and model server. This leads to actually increased probability of experiencing high latency, which I’ll illustrate in the future slides.

Meenal:

I also want to highlight the cited articles that actually helped us in gaining understanding on these concepts better. But now jumping into this fanout example. Let’s say we have a service A, which we call our root service here, or a parent service that fans up to another service B, say 10 times in parallel before returning a response. We call the service B a child or a leaf service. And we know that the p99 for our service B is 100 milliseconds. And assuming all the latencies are independently and identically distributed, we want to know what would be the probability of root service latency being greater than 100 milliseconds because we are making a call in parallel. Of course, if it was sequential, it would always be the case.

Meenal:

So before we compute it, I want to highlight one thing. If a p99 of a leaf service is 100 millisecond, what it means is that 99% of request, the latency is less than 100 millisecond. This implies that the probability that the leaf latency is less than 100 milliseconds is actually equal to 0.99. And this is something that we will use in our future computations. So now we want to compute the probability of root latencies greater than 100 milliseconds. What this means is it’ll actually translate to the probability that at least one of the 10 calls that we are making to the service B has a latency of greater than 100 milliseconds, which in turn will be equal to one minus of probability that all the leaf nodes will have a latency less than 100 milliseconds, which was something that we computed in the previous slide. And what we find here is that when we are fanning out to 10 leaves, then roughly 10% if the requests will have slow latency defined by greater than 100 milliseconds.

Meenal:

So [inaudible 00:09:35] I’m also showing you a figure here, which shows you the plot of this probability versus the number of leaf nodes we have, which of course, as illustrated, is increasing and becomes high pretty quickly. So now we figure out that this fanout leads to slower latency. We want to find out what guarantees should we actually get from these downstream services when we are actually making a fanout, in order to guarantee a tighter latency [inaudible 00:10:12] on our root service. So now let’s say, I want for my root service to have a p99 of 100 millisecond here. What should be my leaf latency distribution? And by rearranging the equations that I showed on the previous slide, what we find here is that you have to look at the 99.9 percentile latency for the leaf node and that latency should be 100 millisecond in order for the root leaf node latency to be 100 milliseconds. And this is a case with fanout degree being 10 here.

Meenal:

I will also illustrate that with the graph here on how, when it increases, this thing changes. And as you can see, even from the example here, that it becomes very high very quickly. So the key takeaway here is that you don’t want to look at the same percentile in the root and the leaf. You want to look for tighter guarantees on the higher percentile instead if you want the root node to be fast. And this can be challenging because sometimes this information is not even available, because most of the time you see 99, 95, 99 percentile. Now we will hand this over to Austin who will share the details in the testing and various optimizations that were done and how they performed.

Austin:

All right. Hello, everyone. So first we’ll just go over the initial setup, the testing. In this case, our feature store was using a DynamoDB backend and we were [inaudible 00:11:52] feature store with an OkHttp based client. Our model server in this case, just a container holding a model and we just had this hosted in a cloud provider. In this case, we did keep them all within the same region in order to reduce latency as much as possible. So for the testing, like Meenal mentioned before, we were targeting a batch size of 50 and hardware in this case was about two CPUs and four gigabytes of RAM.

Austin:

So here’s how it started. We took the philosophy that we just want to build something quick and then make it fast later. So this is what we had initially when we were code complete for the first time. So the version of zero here, you can see it’s very slow and unstable. Latency’s between 200 and 800 milliseconds and throughput wise, we’re only really able to do 500 items per second per pod. So when we wanted to scale this up to tens of thousands of items per second, this just wasn’t trackable at all. We’d have to have so many pods, and we were nowhere close to our latency requirement.

Austin:

So how’s it going? This is Weaver as of a couple weeks ago. As you can see, it’s now fast and stable. Our p99 latency is consistently under 50 milliseconds, which is really nice because that’s twice as fast as our initial requirement. And we’re doing a throughput of three times of 1500 items per second per pod. And we don’t really see any sign of degradation. So we do think we can push this even further. So what do we do? Each of the following examples will show, they fall into these broader categories. Now first thing we looked at is the network path, right? How do we actually get to the data. Where the data is stored, right? Underlying system involved. And we looked at retry/hedging policies as a way to impact the tail latencies. Another way to improve stability, we look at connection warmups. And finally just good old concurrency tuning, which is applicable to really any application [inaudible 00:14:08].

Austin:

So first things first, hardening the network path. So in our initial load test, we’re seeing really bad latencies and sometimes it would spike without an obvious cause, right? We would look at the rest of our metrics at Weaver and it just didn’t seem anything to be driving that latency. It really didn’t make sense. So what it turned out to be, it was actually bottleneck in another system. In this case, it was the egress gateway, which controls traffic flowing out of our cloud. And yeah, that turned out to be a huge bottleneck. Just by avoiding that bottleneck, we were able to bring our latency down to a stable 200 milliseconds. We didn’t really appreciate the fact that our first load test that this system was actually doing the highest amount of volume of outgoing requests in Cash App. So the team just wasn’t ready for it. So we scared a bunch of people, but in the end it made our system much better.

Austin:

So the key takeaway from this really is you really want to understand the broader system. Sometimes you can get focused on your own application, but for high throughput cases like this, you really want to understand that entire network path. The next biggest improvement would’ve been serving the features from disk versus in memory. So we’re lucky the feature store that we use happens to support this. And by switching from DynamoDB to Redis, we shifted the entire latency distribution for our calls to the feature store. In this case, our p99 dropped from about 30 milliseconds down to 10. And the average latency dropped from 10 to five. So this is really critical when you want to do a fanout. And as Meenal mentioned earlier, sometimes you can’t even see the long tail latencies. In this case, our APM only goes up to p99, but we could assume here that the p99 [inaudible 00:16:11] is brought down significantly well, which really then gives us a chance to actually compute the root request really quickly.

Austin:

So this was the end result. The previous graph was just the queries to our [inaudible 00:16:27] feature store. And this graph here shows the latency to our actual endpoint. So this is what we’re actually trying to optimize. And this translated to about a 50% reduction in our p99 latency for scoring. These numbers are a little low overall, but in this particular example, our batch size was a bit small. So next up is this idea of hedging requests. This came from this paper called the tail off scale from Google, and really is an effective strategy if you want to reduce the impact of tail latencies. So the algorithm is quite simple. A bit hard to implement in practice, but it’s a basic idea.

Austin:

First, you just send a request. If there’s no response within some certain time delay, just send it again. And then wait for them both to complete and return to one that completes first while you cancel the other. So the logic behind this is that often times, when you do hit a long tail latency, it’s because the system just happens to be overloaded for a really short time duration, right? Because a garbage collection that’s happening or compaction if it’s database or any number of reasons, power things that are happening on a CPU, anything can make a certain replica slow for a brief amount of time. When you send that hedge request, the hope is that that second request will reach a replica that is not having any difficulty.

Austin:

So what’s the cost of doing this? Really, it’s a trade-off here between latency and additional load on the system. So how you control this is actually how you set that hedging delay. If we look at this chart below, this represents distribution of the request going to our feature store. And to think about it, any request that happens to the left of that pink line, these requests are only going to be sent once. And if we set that delay to exactly 6.38 milliseconds, plus a bit of overhead, any request that falls on the right of that line is going to be sent twice. So it is pretty intuitive, right? If you do think about it this way, if you set it to the p95 latency, you’re going to add additional 5% more load to your system.

Austin:

So here’s what it did for us. We actually found that the diminishing returns kicked in really quickly for request hedging. And we topped that at hedging about 1% of our requests. And just by adding that 1% additional load, we were able to cut our p99 down by about 20%. And as you can see, well, the purple line is the p99 latency, and the blue line is the average. And as expected, this is only really going to bring down your average latency. It really is not the target, the long tail. So what’s the problem with this? So when you do request hedging, there is a severe issue you have to be worried about. And that is that sometimes the additional load can actually affect that underlying distribution. So that additional load does shift the distribution to the right. Those slower requests could then cause more additional load, and then the cycle repeats.

Austin:

So you have more additional loads, slows down requests even more, you bring in some more additional load, and then the latency will just spiral out of control. So if you are going to use request hedging, it is similar to retries. You really need some safety mechanism in place to prevent this from happening. The approach that we took was we actually compute the percentage of hedging. And we just have an alert that tells us if it goes beyond a certain amount, and then we can manually adjust it via dynamic configuration. But really, the ideal solution is if you can control this in [inaudible 00:20:15] itself and just have it automated just to reduce operational drive.

Austin:

So surprise problem. It might actually be expensive to cancel requests. So think back to the original algorithm. When you send a hedge request, you collect the fastest one, but you’re supposed to cancel the other one in order to save load on the service that you’re calling, and also save load for yourself. But the library that we’re using, OkHttp, it was actually closing connections on cancellation, and this turns out to only be an issue with HTTP/1 connections, not HTTP/2. So because those closing connections, follow up requests had to reestablish them and did not increase our latency significantly. And again, that feedback loop kicked in and quickly led to hedging all of our requests. So often time, they do rely on the abstractions, but in this case, it was really important to understand the implementation details of our client.

Austin:

Next up is warming up the connection pool. This really applies when you’re deploying or restarting your service or scaling mechanism happens. The initial traffic that hits your service is going to have to establish those connections and that really is the slowest part of the request. And for us, these slow requests are triggering hedging, which it’d be significant at first so it actually slowed everything down even more. So the thing we just had to do was really establish all these connections early on in the process, particularly during the startup. So that way, when the service receives the actual traffic, we don’t have to do that. So that’s really a key takeaway is, warming up is really helpful to improve stability for your latency. Last but not least is of course good, old concurrency tuning. So this is actually quite tricky to find, but it really is key if you want to maximize your performance. There are a bunch of formulas out there. And what it really boils down to is the proportion of IO versus CPU bound.

Austin:

Are you doing mostly IO or mostly CPU? You don’t have to think about it binary, but it really is a spectrum. Want to figure out that proportion, that ratio in your particular use case. So it is you can actually get that data to apply some formula. You can just do it through trial and error. And that’s what we did to find the right connection pool settings and threadpool set up. Particular example for us, by just removing a threadpool, we decreased our CPU utilization by 5% and our latency by five milliseconds. And in this particular case, [inaudible 00:23:01], We used code routines to do our asynchronous work. And we actually had a threadpool that was spitting up a thread for each code routine, which completely defeats the purpose of code routines. So that’s why it was hurting our performance, and just by recognizing that, we were able to get a quick win. A couple more enhancements that we haven’t done yet but we were looking at, first just using more efficient data structures and this idea of good enough results.

Austin:

So first off, more efficient data structures. This was fairly self-explanatory. But the idea here is just, can you minimize amount of data flow going through the system, right? It’s going to be less network bytes, less bytes sent over to network and less stuff to marshal, unmarshal, serialize, deserialize, which will save you CPU as well. So in this case, these pictures here show a columnar versus row oriented data structure, and you can see here the columnar data structure that’s on the right side, it’s a lot more compact. And the main reason in this case is that you’re only specifying feature names once. So this is a good way to just, it’s a small win, right, to make your system more efficient. And in particular use cases, you can actually go much further than this. So this is just columnar versus row, but you might be doing something that you could optimize even further. And the key thing that really makes this something that’s viable is the fact that we were using a container for serving our model. And within that container, we can apply whatever [inaudible 00:24:33] as much we want to unmarshal these requests.

Austin:

Last but not least is the idea of good enough results. So this is another idea that came from the Tail at Scale. And the idea here is just produce good enough results quickly as opposed to the best results slowly. And how we would do it is just use some timeout and just drop the scores to go compute in time, right? The nice thing about this is the more scores that we have to compute, so as our fanout increases, it’s a lot less likely that we’re going to drop the best score. So as the problem gets worse with this fanout, it actually becomes more useful to use this technique. And if we look at this chart here below, this is an example from a real service at Google and just illustrates the problem they had in their service. So if we look at the right most column, we could see that for 95% of those leaf requests to finish, it was only 70 milliseconds, but to wait for all of them, it was 140. So really that 5% of slow requests were dominating the p99 latency of the system.

Austin:

So if your particular use case can get away with dropping some scores and then maybe have some system to impute the score, this could be a really viable technique to bring your latency down more. So just to wrap this up, just want to hammer these points home. If you do have some fanout problem, you really want to know that long tail latency. And I’m not talking about just p99. You really want to see if you understand a p triple nine, quadruple nine, and so on. Next up is really just make sure you understand your network and traffic setup. This is really important. If it’s viable, if it’s possible in your use case, you really want to prefer fetching data from in memory as opposed to on disk. Next up is just make sure your connections are warm prior to receiving traffic on your system. And last but not least is just finding that right amount of concurrency to fully optimize your system. And that’s all. Thank you for watching.

Demetrios:

Amazing. Super cool. Thank you both for that presentation. There’s a ton of questions coming through here. So I’m just going to start firing them off. Did the move to Redis lead to any drop in HA? Did you add Redis as another cache layer on top of DynamoDB? Or did you replace DynamoDB with Redis?

Austin:

In this case. I think, Meenal, you’re muted.

Meenal:

Oh yeah. So I was saying in this case, DynamoDB was replaced by Redis.

Demetrios:

Okay. Nice.

Austin:

Yeah, I was going to say that idea of adding a caching layer is also something that we’re going to think about.

Demetrios:

Oh, okay. Cool. Regarding request hedging, how do you know the request would take more than 6.38 milliseconds beforehand? You mentioned that the request took more than 6.8 milliseconds would be sent again.

Austin:

Yeah. So how we determine that is we monitored a latency of these requests going to this system. And by doing that, we could chart that histogram. So that actually comes from our application form is monitoring. And by observing that distribution, that’s how we choose the delay. And in that example, the p95 latency was about seven milliseconds. So that means that 5% of requests are going to take longer than seven milliseconds, just from the underlying distribution. So if you set your hedging delay to that, it translates directly to that 5% of those hedge requests. 5% of those requests are going to require a hedge request. Just because you’re waiting for the p95 latency. That’s how it works. I hope that was right.

Demetrios:

Right. I see. I see. So another one coming through, not sure if people can read that. It might be a little small, but if you’ve got a big monitor, you can join along with us. Was that even for a cache thread pool? Can you see what that says? I think it’s regarding the question above.

Meenal:

We can actually go and answer these questions in Slack also. I can’t read the screen because I have a small monitor, but I’m happy to-

Demetrios:

I can ask-

Austin:

[inaudible 00:29:28] having a hard time.

Demetrios:

I’m making it so hard. I’m just going to go back to the normal one to make you not want to try and read it. So how about this one? Can you repeat how the fan out analysis by computing p latency or greater than a hundred milliseconds fits into the big picture of increasing throughput performance tuning?

Meenal:

Yeah. So the point there was, we had, as I mentioned in the previous slide, it was fitting in our notion of what latency we wanted to provide to our users. So when we were looking at our options, our first thought was, “Okay, we have these p99 latencies that are downstream are providing and it’s good enough.” But when we actually started doing, it’s like, “No, there are some issues here that we need to dig deeper,” and that helped us think more on how do we optimize those calls. For example, switching from Dynamo to Redis, or now that we think there should be some more caching layer in front of it. It was more for the latency perspective.

Demetrios:

Ah, yeah. Okay. Last one and then-

Austin:

Yeah. Another interesting thing too.

Demetrios:

Go ahead.

Austin:

Just to add onto that, when we were using Dynamo, we actually had another interesting thing happen, and that was our p99 latency was pretty flat, but then our actual end point had a bit of a square wave action happening. And it wasn’t really clear where that was coming from, but it turned out when we did actually find out the long tail, the triple line with the triple nine, there was this square wave pattern showing up in that latency to Dynamo. And the thought was, there might be some connections being restarted or something, but it’s another interesting example of how the p99 and beyond can creep up.

Demetrios:

So last one I’ve got for y’all, and then we will jump to the next talk. How did you handle change in operational load from moving to Redis, as opposed to a managed service like Dynamo?

Meenal:

So in this case, we were actually using a vendor for us. So it was more transparent for us, but you are right. That is actually one of the downsides of using Redis, then you’ll have to build your own load balancing and other things, so you’ll have much more operational burden.

Meenal Chhabra

Machine Learning Engineering Manager

CashApp

Austin Mackillop

Machine Learning Engineer

CashApp

Request a Demo

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button

Contact Sales

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button

Request a free trial

Interested in trying Tecton? Leave us your information below and we’ll be in touch.​

Unfortunately, Tecton does not currently support these clouds. We’ll make sure to let you know when this changes!

However, we are currently looking to interview members of the machine learning community to learn more about current trends.

If you’d like to participate, please book a 30-min slot with us here and we’ll send you a $50 amazon gift card in appreciation for your time after the interview.

CTA link

or

CTA button