Challenges at the Intersection of ML & Real-Time Data: Lessons Learned Spam Fighting at Facebook

apply(conf) - May '23 - 30 minutes

Spam fighting at scale occupies a unique niche at the intersection between real-time data infrastructure and high-powered anomaly detection and machine learning. When these disciplines collide, a whole host of interesting new challenges are presented by each to the other.

This talk draws on my experience building spam-fighting infrastructure at Facebook and real-time data experience at Rockset. I’ll talk through some of these challenges and explore some of the mistakes engineers make when coming from one side into the other. Challenges to be discussed include:

Spam-fighting tends to require low-latency everything. Every aspect of the data system, designed for supporting ML, needs to think about latency.
Large ingest volumes of continuously arriving data needs to be queryable quickly. This requires streaming data to be indexed to power ML features. Spammers act quickly and their previous actions need to show up in the current classification.
Fast queries: Most spam is best stopped synchronously before it’s ever written to any system. Classifications must be quick. Features need to be generated quickly or pre-computed. This runs into the classic “materialized view” problems of traditional databases, except in an ML context.
Hybrid queries. The most valuable queries tend to involve both ML or anomaly detection techniques (e.g., vector search), combined with traditional SQL database techniques (e.g., where clauses).
Development loop. It’s always a good idea to make your development loop as tight as possible, but this is even more crucial in adversarial or time-critical situations. Every aspect of the orchestration and training of ML workflows becomes latency sensitive as well.

Louis Brandy:

Hi, everyone. My name is Louis Brandy and I want to talk about some of the challenges at the intersection of machine learning apps and realtime data. And here I do mean data in sort of the data infrastructure sense. I’m going to explain what I mean by this title. By the way, it’s slightly misleading, but we’ll get to that throughout the talk.

Louis Brandy:

So this talk is about the intersection of domains. Rockset, that’s where I work today, and the things I’ve been working on recently is in this very realtime, realtime database domain, realtime search, realtime analytics. There’s a big, huge category of things that this implies, and that is one of the sort of feet that I have in one of these circles.

Louis Brandy:

Oh, so just to be clear, I think this is probably the domain that is the least relevant, sorry, the least familiar to the people. Certainly not, it’s going to be certainly relevant, but the least familiar to the people here. So I want to define it really quickly. To be a realtime data infrastructure, you need to care about one, typically both of these things, you need fast queries and you need low data latency. Data latency is the more subtle and often more important thing that you need in a realtime data system and analytical data system. And so the basic definition here is when you generate and submit a new piece of data, how quickly is it queryable? And if it needs to be queryable effectively instantaneously, then you’re into the realtime data space.

Louis Brandy:

Top secret, I’m going to give away all of Rockset’s internal trade secrets here, so luckily this isn’t being recorded and uploaded to the internet. If you want to query data quickly, you need to build indexes on that data. That’s how databases work, even analytical databases. And if you want low data latency, what that really means is you need a large streaming ingest system that can update these indexes in sort of an incremental online fashion. That’s how you get data to be queryable quickly, is to update those indexes and you keep your queries fast. So that is the secret. If you build those, congratulations, you have a realtime data infrastructure.

Louis Brandy:

So that’s us, Rockset. We’re in this larger space. Overlapping that space, more than you might appreciate, which is the premise of today’s talk, is the whole AI/ML infra. So if you want to have an AI or ML powered app, there’s infra that helps do that, build that, train that, serve that, et cetera, et cetera. I think the audience here is probably more likely with, sorry, more familiar with this circle. So I’m not going to, it’d be impossible to cover this in any detail. So I won’t. I’m just going to assume you know what that means.

Louis Brandy:

And then of course both of those are infra technologies. You need a domain to apply your infra to. My domain was spam fighting. I worked in spam fighting for a long time. To be clear, I worked in spam fighting infrastructure for a long time. So I was at the bottom part of this circle, more so than the top part of this circle.

Louis Brandy:

So I wanted to start today’s talk off with a bit of observations from this middle of this Venn diagram that I’ve shown. I had a unique position here. So I’ve spent the last few years building realtime infrastructure. So in some ways the current me is a bit of an outsider here. You all are in the risk world, and I’m in the realtime data world and I’m here in part to build a bridge between us. The old me from a few years ago, so for about five years I worked on spam fighting infrastructure at Facebook. That me is much more at home in the talks we’ve had today. And a lot of the old work that we did at Facebook for spam fighting, you can find our blog posts and papers that we wrote and lots of stuff. I say cool stuff, I’ll let you decide.

Louis Brandy:

But from this position, I’ve had a very interesting observation from the middle. I get to play these two personas and get to talk to each other from each other’s perspective. So the infra person in me is always trying to explain to the person trying to build a spam fighting app how to build a scalable thing that’s going to run on every like request to facebook.com. And the ML persona knows what it’s like to run the streaming, kind of online realtime ML heavy workloads. What does my data infra need to do to support that, and how can we make the data infra better to support that?

Louis Brandy:

And so one of my go-to rules is you’ve got to get in trouble in the first five minutes of a talk. So let me get in trouble really quickly. One of the things I’ve seen over the years, and this goes back maybe 15 years when I started in this area, these two areas, these two large domains, these gigantic sections of computer science and technology have done a poor job talking to one another over the last 15 years. I’ve seen data teams building really good data infra, but it’s not actually usable in real life for realtime machine learning, spam fighting type use cases. And I’ve seen ML teams who need certain kinds of infra try to build it themselves and accidentally wander into building their own database of some form or fashion. And I can assure you that accidentally building a database is about the worst thing you can find yourself in. That’s a terrible position to be in.

Louis Brandy:

Now I will say that in the last five years is, I don’t even know if it goes back five years, but at least in the last few years we’re starting to really have the conversation that’s kind of bothered me for a long time, which is these two groups are actually talking to one another at the right ways and building the right stuff for each other.

Louis Brandy:

And so places you will see this discussion happening, so we talk about feature stores. I know this is a popular topic for a lot of people here. Vector database is another place where this is actually starting to get very, it’s very recently become very hot in the spotlight. And then of course these orchestration layers where you start to treat these ML concepts as first class citizen. Managing the life cycle of a feature, for example, is the kind of thing that ML people desperately need, that has not really been something that data infra has ever really dealt properly with.

Louis Brandy:

I originally had this agenda. If you go look at my abstract, it lists all this stuff. And as I was making slides very quickly I realized I have no, this talk is going to be, this is a two hour talk, and frankly, if you look at some of the other abstracts and things that other people talked about, they were doing some of the same stuff. So I decided to revamp this just a bit and laser in a little bit more. So I had this like, here are some challenges at the intersection of realtime and so forth and so on.

Louis Brandy:

My new talk, the one I’ve come up with, is I’m going to give you one challenge at the intersection of realtime and ML, but I chose vector search very specifically. One, it’s a very hot topic. Yes, also we are working on, it’s one of the things that we’re doing at the moment, but also, it’s super emblematic of everything else I want to talk about. There’s a whole host of places where these ML, sort of AI hybrid workloads try to become realtime. And oh, by the way, spam fighting, we’ll get to this, but spam fighting is ruthlessly realtime, and the data people aren’t ready for this. The realtime part of it is just a whole series of curve balls that mess everything up. And so that’s what I wanted to talk about and I’m going to use vector search as the way to talk about it. We’re going to talk about vectors in the context of both spam fighting and in the context of data infra.

Louis Brandy:

So here’s something that hopefully everyone here is savvy to. This is more of the slides for the database conference. But spam fighting infrastructure is realtime infrastructure. Spam fighting, for those of you that know, is pathologically realtime. You are in an adversarial situation. You need to have data as fresh as possible. So this is the kind of query that you can imagine a spam fighting person would want to make. How many comments has this user made in the last minute? If it takes me 30 seconds to know that the user is spamming me because that’s how long the data latency is, I can guarantee you you’re going to get spam in chunks of 30 seconds. That’s exactly what’s going to happen, and they’re going to figure that out super quickly. So the data latency and spam fighting is everything. The correct answer to how much data latency do I want as a spam fighter is zero, zero milliseconds.

Louis Brandy:

On the flip side, I also need queries to be fast. One of the things we found at Facebook was I need my queries to be synchronous, my classifications to be synchronous. That means it’s fundamental to the product experience. I don’t want to write to go to the database unless I’ve determined it’s not evil. And the reason is because if I do let that write go to the database, I will then have to go back later and clean up that write from the database, effectively doing two writes, write amplifying my spammers. Again, there is little that I want more to do than to write amplify the spammers. This is bad. So ideally I’m stopping spam before it ever makes it into a transactional database.

Louis Brandy:

So I just described for you, and hopefully the slides come together, I just described for you why spam fighting is a realtime data infra problem, point blank. I care about query latency, I care about data latency. That’s it.

Louis Brandy:

There’s another advantage that spam fighting has, and this is where data people start to misunderstand the domain that we’re operating in. Approximation is okay. In fact, it’s crucial in spam fighting. This is not a thing database people are used to thinking. There are a lot of clever algorithms where you can trade exactness for speed, and in the spam fighting world, I will always choose speed over exactness. I’m not going to talk about this today, but there’s a whole super fascinating space here around all the different approximation algorithms you can run on a stream of data.

Louis Brandy:

So median statistics, P99 statistics, you can do the cardinality estimation of a stream. This famous algorithm, hyper log log, which if you’ve never heard of, go look it up, it’s super cool. Again, frequent elements, right? Show me the most common IP address in this log. It’s the kind of thing you want to know from a stream of data. The heavy hitters, so to speak. All of these algorithms, there’s lots of papers written, there’s a lot of cool stuff that can be done in this space. We’re going to talk about vectors. If you know anything about vector search, you know approximation is going to come back. Approximation is absolutely crucial to making vector search scale in realtime.

Louis Brandy:

So let’s talk about vectors. There’s a classic question that we asked for years, they’re still asking, I’m sure they’ll ask for decades to come. Is this photo being uploaded to the website spam? It’s not a Facebook specific question. A lot of people are struggling with this exact question. It’s a very hard question to answer. Traditionally, you would try to train a classifier on the point classification. So in other words, a particular image has been uploaded. Here’s the pixels of that image. Here’s the user that uploaded that image. Here’s the metadata associated with this upload.

Louis Brandy:

Can I train a classifier to detect if this is spam? Historically speaking, these classifiers haven’t worked very well. It’s very difficult to write a classifier that looks at the pixels of an image and determines if it’s spam or not. If you can do that, you’re hired. Everyone will hire you. For whatever it’s worth, in the last, and I don’t even think I can say years, in the last months, it’s possible that AI has actually made non-trivial progress on this problem. In other words, we may be in the presence of building artificial intelligence that can look at an image and help determine if it’s spam. But that remains to be seen. Historically speaking, this hasn’t worked particularly well.

Louis Brandy:

So enter vectors. Now I’m going to mostly take for granted that most people out there understand how vectors help in this space. But just in case you don’t, I will take one slide to explain what’s happening here. So we’re not talking about physics class here. When we say vectors, what we mean is given any arbitrarily unstructured data, what I can do is project that data into a vector space. So in this case, I always think images, that’s my go-to example, but it could be anything, even Facebook users, movies, music, whatever.

Louis Brandy:

You project it into this vector space that you see here on the right. And the idea is that the distance metric in this vector space on the right has preserved some semantic we care about. So in this case, in points that are nearby, are alike in some sense. And so we’ve gone from this very squishy notion of likeness, show me photos like this photo is a very squishy notion, to a very precise notion. Show me points in this vector space near this query point. And here near is specifically defined by a distance metric. For our purposes, it could be Euclidean, but there are other distance metrics that you can use.

Louis Brandy:

So when I add vectors back to my photo question, everything changes a lot. The idea is I can embed my photo. When I get it embedding, I can get a vector for my photo and I can start to ask a whole new set of questions that are very powerful about this photo. So for example, how many other photos this have been uploaded? If a photo is relatively unique, probably not spam. At least, it’s a good indication it might not be spam. Do users who have uploaded photos like this one look organic? This one is very powerful and actually is the beginning of extensive amount of spam fighting machinery in these kinds of large companies.

Louis Brandy:

So for example, if a thousand people have uploaded photos that look like this one and they all were registered from the same email domain, something shady might be happening here. That’s a really good indication of whether something shady has happened. And now you get to start to separate your classification problems. So for example, you can offline do very large scale cluster classifications and online just determine does this belong to a bad cluster? So for example, hey, you’re uploading a photo that looks like a thousand other photos. I’ve already determined that that’s spam, so this is spam. So get out of here.

Louis Brandy:

And now something very nice has happened, which is the online classification. The synchronous classification is extremely quick. It is this one simple vector lookup. Now, the second half of this talk is going to be to explain in great detail how the realtime vector search, even that is nowhere near as simple as you think it is. But suffice to say, I can do extensive amounts of offline machine learning to even get us this far. So that’s very powerful.

Louis Brandy:

And so I said all this, but the basic idea is vectors in this photo domains suddenly give me this fan out notion. I can fan out to nearby or quote unquote similar kinds of things, and that lets me do not just classification but enforcement. So if you decide that this kind of image is obviously garbage, I can not just destroy that image, but every other image in that neighborhood for good, in that radius of that vector space.

Louis Brandy:

And so this becomes very powerful. This is extremely powerful, and it powers a significant amount. Well, at least it did power a significant amount of spam fighting at Facebook. But of course this is a realtime problem and there’s actually a lot of aspects of realtime and I want to bring this towards realtime and I want to explain how even this relatively straightforward problem, once we start to add realtime into the mix, everything gets really hard really fast. And that’s what I want to spend the next few minutes talking about.

Louis Brandy:

I want to do a PSA here because I spent some time in the previous slides lecturing the data people on why the ML is harder than they think it is. But I want to spend a minute to talk to the naive ML engineer and tell you that the vector part is harder than you think it is. A vector database is a database and it stores vectors and it does vector search, but I will make a very controversial but important claim here that a good vector database will have much more database technology than vector technology. And I don’t mean to imply that the vector part is easy. It’s certainly not easy. It’s super hard. We’re going to talk about it. But it joins a whole host of classic hard problems, the 5,000 page database textbook is full of hard problems that you still need to solve if you want to build a good database, vectors irrespective.

Louis Brandy:

And so one of the comments I’ll see sometimes online in this space is this idea of, well, these vectors search libraries exist. I can just download those and I’m off and running to build vector infra. I’m not saying you can’t do that. I’m just saying be very, very, very wary of that kind of a claim. That is the kind of thing you say and then accidentally rebuild a database by accident over several years. So again, I’m not saying don’t do it. Lots of people are trying, including us, we’re trying, we’re doing this, but just go in with your eyes wide open if you want to go down this particular road.

Louis Brandy:

I’m going to go through this slide very quickly, but in classic database terms, vectors aren’t indexable, at least not high dimensional vectors are not indexable. We for low dimensions databases have been doing this for a while. You have geotypes, which is the geospatial indexing. That works. So these space partitioning algorithms will give you two dimensional indexes for vector data. But high dimensional data, there’s a thing called the curse of dimensionality, if you’re unfamiliar with this, but the punchline is it’s unindexable. A high dimensional index effectively looks at every [inaudible 00:16:59], it’s the same thing as having no index at all. It does a scan, it does a linear scan. So the punchline is in classic database terms you can’t really build a conventional database index on vectors.

Louis Brandy:

Now of course, enter, oh yeah, I don’t want to skip this slide. So if you go look, OpenAI has a text model. It’s a text embedding [inaudible 00:17:17]. So you can use this to turn any blob of arbitrary text into a 1500 dimensional vector and do nearest neighbor search on those 1500 dimensional vectors. The point of this slide is very simply to claim that in the modern vector world, all vectors are high dimensional. We aren’t, I mean, geospatial there, there’ll always be uses of low dimensional vectors, but the whole AI revolution that happening out there is built on high dimensional vectors. So if you put 1500 dimensional vectors into a K-D tree, you are not going to get performance anything other than linear scanning the entire set of vectors.

Louis Brandy:

And so this enters into this category of algorithms that people may be familiar with, which is the approximate nearest neighbors. So these produce approximate, so approximation has returned, these produce approximate indexes, and this is not something that data people historically have dealt particularly well with. We don’t have approximate indexes in databases as a general rule, but that’s what these algorithms provide. And this is a huge and active area of research. There’s a lot of stuff here to Google. Frankly, there’s more to Google, there are papers being written as we speak, and PhDs being given as we speak on this exact topic for this reason, effectively, the reason we’ve spoken about up till this point.

Louis Brandy:

But ANN is not a silver bullet. It did solve our first problem. Our first problem is I need fast queries. I need to be able to do quick lookups in vectors. I need to find nearest neighbors in a fast fashion. It does solve that problem. Again, there’s some scare quotes around solving that particular, even that problem, but let’s assume that it solves that problem. The problem now is if you’re spam fighting or you’re in any other realtime app, there’s a whole host of new problems that come, and they come really fast.

Louis Brandy:

And the premise of this talk to a large degree is very, very quickly your ML app writer who wants to do realtime online vector search is going to run face first into brutal database problems. And that’s that’s the whole premise of the talk. So I want to go through it right now, just for fun.

Louis Brandy:

So these are the hard problems in vector search. Again, when we’re dealing especially with these realtime data applications, like the spam fighting application that I’ve been alluding to up to this point. I want to stress, these are hard problems. I’m about to fly through some slides that each one of them is like the papers are being written, PhDs are being given, we are working on making it better. Other people are working on making it better. Don’t trust anyone who tells you any of these are solved problems. Nothing here is solved. And I will also make the slightly stronger claim that these are also rarely avoidable in realtime context. So every one of these matters almost always. They’re just not the first thing you think of when you want to build realtime vector search.

Louis Brandy:

So the first problem is incremental indexing. How do I add new vectors to this thing? So I told you, you can query vectors fast. I have an index, it’s an approximate index. I can query vectors super quick. I can get the five nearest neighbors super fast. But what I can’t do is add vectors to that, or rather can I? Can I add vectors to that new one? This speaks not to the fast queries part of realtime, but the data latency part of realtime, and data latency is absolutely fundamental, especially in spam fighting. I need new vectors to be queryable quickly. It can’t take me 30 seconds or a minute or five minutes or an hour to know that a new vector has been added to the vector space. That’s really bad in an adversarial situation.

Louis Brandy:

Now you might think, so again, some people may know a decent amount about these algorithms, others may not, you might think it wouldn’t be too hard to at least naively insert them into the existing data structures. The truth is you can. You can naively them into these data structures, but they tend to deteriorate your index radically and rapidly. And so the right mental analogy, if you aren’t familiar, is balanced binary search tree. You spend an enormous amount of effort to build a balanced binary search tree, and if you try to incrementally add to it, you will slowly unbalance it and so you will lose both [inaudible 00:21:27], well, in the case of these ANN algorithms, you will lose speed and accuracy very rapidly as you try to incrementally index into them.

Louis Brandy:

And so immediately the enterprising engineer starts to think of solutions to this problem. Okay, so if I incrementally index that degrades my index. What do I do? Well, one option might be I can embrace this. I can say, all right, I’m going to just pay the penalty. I’m going to incrementally index, and I know that every hour or so I’m going to need to rebuild my index and then I’ll [inaudible 00:21:56] swap my index. And so a data person hears that, is like, okay, you’re going to build a database, rebuild an index and [inaudible 00:22:03] swaps it behind the scenes. You immediately have a million hard problems to answer. How are you doing this across different shards? How are you going to shard this thing? What is the consistency implications of changing your index between queries? You have all these consistency questions that immediately crop up in a classic database scenario when you do something like that.

Louis Brandy:

An alternate approach is something like a batch. You embrace the batch style of index. The idea here is you accrue new vectors as they come in and you index them as a group, the new vectors, and you kind of have a history of indexes. So you sort of have these immutable indexes throughout time, and then periodically you take a bunch of old indexes and older epochs and you can compact them together. If you know anything about database and realtime database architecture, this is what is known as a log structured merge. It’s a very classic architecture in realtime database and thing. Systems like Rockset are deeply built, we have a ton of machinery to do exactly this kind of index compaction over time. This is how RocksDB works, if you’ve ever heard of RocksDB.

Louis Brandy:

There’s of course a third option, which is not a data problem. This is more, I would say more of an ML problem. You could just improve these algorithms. You could build me a better algorithm that is just efficiently increments, excuse me, it efficiently is incrementally indexable. It doesn’t pay a huge cost in performance or speed. And then by all means, if you do that, publish it, get your PhD, send me the paper, I will happily read and implement whatever you’ve built. That would be great. But I think this is, we will make incremental progress in this last bullet point, but this is a very hard problem.

Louis Brandy:

Moving on to hard problem number two is what is known in the ML world and AI world as metadata filtering. So the idea here … So the other term, by the way, is hybrid search. Sometimes you’ll see that’s a bit of an overloaded term, but the idea of metadata filtering in vector search or hybrid search, but for the data people, this is just the where clause of your database.

Louis Brandy:

So in my example here is show me all the images like this one, that particular phrase, the first part of this, that’s the vector search. Then the second part is that were uploaded in the last 10 minutes. That is the metadata filter or in database terms, the where clause, the where clause of my database. This problem is incredibly hard in vector search, and I would argue it’s worse than hard because it’s not optional. You might be able to get away with not having it sometimes, but I would bet that your application will always be made better by having it. This is a strong claim I’m making here, and in fact, to underline the point, I’ve given I don’t know how many examples of vector search queries in plain language throughout this talk. Every single one of them has had metadata filtering that you might want when you are doing a vector search under the hood.

Louis Brandy:

It may be obvious why this is difficult to do. I don’t know if I can do it justice in this time, but the basic problem is all of your vector, your vector space is pre-computed and it’s global. You don’t really have an index of just the filtered data because the filtered data is a kind of runtime thing. Again, enterprising engineers will have come up with solutions. There are three popular strategies here. One is what is called post filter. So post filter is the idea of you over fetch vectors, like, hey, go fetch me a hundred vectors, I’ll filter the hundred closest using my metadata filter, and hopefully I end up with five or 10 or whatever I need to serve the purpose I had in mind.

Louis Brandy:

Another option is you can pre-filter, so you filter first. So find me all the vectors that match my metadata criteria and then I’ll just scan them. So this won’t really use the ANN index because you don’t have an index of just that filtered data. This also can work. I mean, it throws the index out, so it obviously has concrete problems, but in certain situations this works great.

Louis Brandy:

And then there’s this final option of trying to merge the filtering with the ANN search algorithm directly. And this gets us right back into paper publishing territory. Again, if you come up with an algorithm that is able to do this, do these two things jointly really well, congratulations, publish papers. I’ll read that paper. By all means, send it to me.

Louis Brandy:

And then this leads us finally into hard problem number three, which is hidden in hard problem number two, is what’s known as selectivity estimation. So if you say to me, hey, give me the five nearest neighbors where X, and again, where it has been uploaded in the last 10 minutes or whatever other filter you can come up with, the best strategy actually depends on how selective X is. You might have intuited this from the previous slide.

Louis Brandy:

So if X is very selective, then that post filtering approach is, sorry, excuse me. If X is not very selective, then that post filtering approach works really well. Just find 20, run the filter, and then keep the five that hopefully you end up with at least five that meet the criteria. If X is extremely selective, you pre-filter. So if there’s only five vectors in the whole database that meet the criteria of the filter, I can just pull those five directly from the database and I can just scan them. I don’t need to use any index or anything like that. It’ll just work. So if it’s extremely selective, it works great.

Louis Brandy:

I’ve led you down a path which is fun because if you’ve not done a lot of database work, you probably don’t know, but this exact problem has a mountain of literature associated with it. Selectivity estimation is an extremely well studied and extremely, I don’t know if it’s hard as other problems, but it’s a mountain of work to do in your favorite database. And this whole idea of tracking different predicates in a query, figuring out which ones are most selective and least selective and then reordering the order that you do things is what’s known as a cost-based query optimizer. And all your biggest, most favorite databases have them and they invest enormous amounts of technology into making this good, tracking predicates, tracking their selectivity, reordering them to make them work well.

Louis Brandy:

And in the case of nearest neighbor, it’s not so hard to add in a nearest neighbor index to participate in this massive optimization effort. But it’s the kind of thing that if you come at it from the other direction, oops, I accidentally just had to spend 10 years building a CBO for the database I didn’t know I was building.

Louis Brandy:

So all right, that’s kind of the end of my hard problems and vector search. I gave you three, incremental indexing, metadata filtering, and optimizing the metadata queries. What was the point of all this? The point of this all was you can download a vector search library right now, but if you really want realtime vector search, you run face first into a whole lot of really hard problems. I gave this talk with vector search as the domain, but it’s not the only one. Feature stores have an equivalent talk, where if I want realtime features, guess what? As soon as I add that realtime component, a whole new universe … Actually, I can go into the next slide. A whole new universe of hard problems creep up.

Louis Brandy:

And my view on this is a lot of these realtime ML apps, the realtime and the ML intersect and collide in a way that places like this are great, but in general, I want to talk more about this. I don’t want ML people accidentally building a database, and I really want the real time infra people to understand these workloads and what they need and what they’re looking for and build better tools for them so that we can make progress.

Louis Brandy:

The fundamental theorem here is that understanding the other side of this gap, building this bridge, this is part of the reason why I’m here, building this bridge is extremely valuable. And oh, by the way, for the computer scientists on the call, any place you wander in this space, you flip over a rock, you can find a super hard and interesting problem to work on. So vectors present a bunch of them, but it’s certainly not the only one.

Louis Brandy:

Okay, so that’s my time. Thank you. So obviously I don’t have any questions. I don’t know if we have time for questions, but if not, I’m happy to, I’ll sit in Slack for the next however long, answering any questions that people have, and thank you for your time.

Speaker 2:

There we go. Awesome, dude. I really appreciate this. This was great. Oh, so good. I mean, all I can say is vector stores, vector embedding is so hot right now. Yeah, I mean, I love how you pulled on all your experience in the past too, so you’ve obviously been through the trenches, you understand the conversations that happen, and I appreciate that.

Louis Brandy:

Awesome.

Speaker 2:

Right on. Well, I will let you jump into Slack, and we’re going to keep it moving.

Louis Brandy

VP Engineering

Rockset

Louis Brandy is the Vice President of Engineering at Rockset. Prior to Rockset, Louis was Director of Engineering at Facebook. During his time there, he was an early engineer and manager in Facebook’s Site Integrity organization where his team built much of the anti-abuse infrastructure that powers Facebook’s spam fighting, fraud detection, and other online, real-time classification systems. He also worked on Facebook’s RPC and service discovery ecosystem and built and supported the C++ infrastructure teams responsible for the overall health of the Facebook C++ codebase, working on compilers, sanitizers, linters, and core (and open-source) libraries like folly, jemalloc, and fbthrift.

Add Your Heading Text Here

Challenges at the Intersection of ML & Real-Time Data: Lessons Learned Spam Fighting at Facebook

Louis Brandy

Follow Us

Book a Demo

Contact Sales

Request a free trial