Wild Wild Tests: Monitoring Recommender Systems in the Wild

apply(conf) - May '22 - 10 minutes

As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced, and case-specific tests must be employed to ensure the desired quality. We introduce RecList, a behavioral-based testing methodology and open source package for RecSys, designed to scale up testing through sensible defaults, extensible abstractions and wrappers for popular datasets.

Jacopo Tagliabue:

We are Jacopo and Federico. And today we’re going to present Wild Wild Tests: How to Monitor Recommender Systems in the Wild. This is a joint project with Bocconi University, Stanford University, COS AI, and Coveo. And today two of the authors of the regional library are with you. It’s me and Federico, and we’re super happy and thrilled to be here with all of you today.

Jacopo Tagliabue:

What is the problem? Well, if there’s one thing we want you to take away from this talk it is that evaluating recommender systems with the standard metrics that we use, both when we do research or in many industrial settings, is really not enough. And this is a problem because evaluating recommender systems is crucial to their actual effectiveness in the real world. We just take, for example, eCommerce, which is a field we know very well, 38% of users will stop shopping if they are shown non-relevant recommendations, but guessing when a recommendation is not relevant is actually much more tricky than what you would think, if you apply the standard industry best practice to it.

Jacopo Tagliabue:

The problem with recommender systems, as in many learning systems, is that when they fail, they fail silently. It’s not obvious at all for the model that the model has failed. As you can see here, there’s a job post for a CTO. And the suggestion is an executive assistant, which may or may not be a good suggestion. And the results may be even way more dangerous for people that are on this side of the fence. For example, this is a famous case of Amazon’s recommender system going completely astray and kind of be misled by the behavioral data that was fed into it.

Jacopo Tagliabue:

Even worse, the feedback that recommender systems typically gather in the real world, even when that feedback is positive, may be actually not very telling about how the Re commender system is producing a good user experience for the final user.

Jacopo Tagliabue:

Take for example, Federico, who really loves romance, and then watched in January The Big Sick. Federico, in the test set, so the month later in February, actually watched When Harry Met Sally, and what we’re trying to do here is we have two models to Re commender systems, model A and model B. And we’re asking these two models to predict based on the January viewing history of Federico, what Federico would watch in February. Both model A and B guess correctly. So When Harry Met Sally is actually in both sets of predictions for A and B, but as you can see from the other movies in the carousel, there’s a big difference in the user experience that Federico receives. In particular, model A is much less relevant than model B, even when Federico will actually click on the same item. So if what we use to evaluate A and B is just a pure quantitative pointwise metrics, like for example, a rate, we kind of fail to understand the nuances of how this system behaves in the wild.

Jacopo Tagliabue:

So the question we ask ourselves, and we ask everybody in the field, is do your tests capture these issues? Is the way we should evaluate recommender systems for research or for production actually able to spot these patterns?

Jacopo Tagliabue:

A way to solve this problem is behavioral testing. So in behavioral testing, what we try and do is not to evaluate with one single quantitative metrics, what happens on L doubt data points in a test set, which is the standard quantitative way to evaluate recommender systems, but it’s actually to enforce some input-output that we deem important in the use case at hand. And to see if the model that we’re trying to evaluate actually respects what we expect from a model solving this problem. Take, for example, this very easy to understand behavioral principle. So respective of what’s in the test set, what we want to create is some cases in which we feed the model some items, and if the model is trained to produce similar items, we need to make sure that the other items is a substitute item of the first one.

Jacopo Tagliabue:

So if there’s a white t-shirt, we want to make sure that other white t-shirts are suggested as similar, and not for example, pens. Even more subtler, in the case of complimentary items, we ask models to actually predict something that is complimentary, let’s say an add to cart type of recommendation, and why it’s a very good idea to suggest an HDMI cable to somebody that bought a TV. It’s a terrible idea to suggest a TV to somebody that is buying an HDMI cable.

Jacopo Tagliabue:

Another principle we want to enforce is to measure how bad a mistake is. If we take pure quantitative metrics like hit-or-miss, like hit rate, for example, we don’t really get a sense of how far we are from the actual truth. And of course, it’s a big difference for the user experience when we’re trying to predict When Harry Met Sally and model A predicts Terminator and model B predicts You’ve Got Mail. They’re both wrong, but they’re not wrong in the same way. One is a reasonable mistake. The other one is a terrible user experience.

Jacopo Tagliabue:

And finally, we also recognize that not all data points are created equally. This is particularly important in the digital world when consumptions are powerful. So you can get a very good aggregate metrics like MMR or retrade just by optimizing for the most frequent items. But what happens on the long tail, what happens on a subgroup of users, for example, of items that are less used. And we want to make sure that when we deploy something, all these trade offs are actually made explicit.

Jacopo Tagliabue:

To solve this problem at scale, we introduce RecList, a package for behavioral tests and recommender systems. RecList is open source, so everybody can use it for free. Is peer reviewed, has been presented at one of the top conferences in machine learning in the world and is built by the community for the community.

Jacopo Tagliabue:

RecList is based on the collaborative abstraction, very similar, for example, to Metaflow for those of you who use it. You can see here an example of how easy it is to create a test for our recommendation. So just create a class unit, a RecList class, and then any Python function, as long as it’s properly decorated can actually be used to create behavioral tests. And of course, a huge point of RecList that it comes pre-made with a lot of behavioral tests that you can use on a lot of public datasets that you already probably working on.

Jacopo Tagliabue:

When you see how easy it is to go from a dataset to actually run a like list, it’s literally like five lines of code. You just pick a dataset, pick a model, you train it if you didn’t do it before, and then you run a RecList by supplying the model and the dataset you want to test.

Jacopo Tagliabue:

RecList can be used for research. If you’re running a new model and you want to see how this model is performing, not just based on pure quantitative metrics, but also on behavioral ones. Or of course you can use in production systems, for example, in your CI/CD before promoting a model to deployment, you may want to test not just its accuracy, as in standard practice, but you may want to test, for example, if the behavioral principle about complimentary items is actually respected or not in your model.

Jacopo Tagliabue:

So what now? RecList has been released as an alpha a couple of months ago, and it’s been presented to some teams in some of the best Re commender system shops in the world to collect feedback. And this is part of the, one of the first public parts that we’re having to start discussing RecList and the problem of the old customer recommender systems among the broader public. We’re actually planning right now a better version of RecList, which will incorporate all the feedback that we’ve received so far.

Jacopo Tagliabue:

What can you do? Well, check out RecList on GitHub. And if you liked this talk, or even if you didn’t, give a star and start spreading the good news about how to use behavioral testing for recommender systems. And of course, if you try the project or read the paper and you really like it and want to contribute, please get in touch, as we will need all the help from the community to build a better and improved package for beta version of it. Thanks so much for being with us today. And we’ll love to answer any questions you may have. Thank you. Ciao.

Demetrios:

Cool. So for those of you that have questions, we’ve got Federico here to answer live, and we have a few minutes, so I’ll start fielding questions as they come in, feel free to let us know what you’ve got and Federico, while you, while we’re waiting for someone to ask a question, I should say what-

Demetrios:

Ooh, we’ve got one. So can you give practical examples for tests?

Federico Bianchi:

Yeah, sure. I think that one of the most important tests, the idea that we want to embed in RecList, is this kind of behavioral test. And Jacopo gave an example of this, the example of suggesting an HDMI cable to someone that buys a television. That makes a lot of sense, because these are true complimentary items, but if instead, someone is buying an HDMI cable, it doesn’t actually make sense to suggest to this person a television, because probably he already has somehow the television. So what we are trying to do is to work in a direction in which we can use the metadata inside our data sets to actually work and find a way to define this behavioral test and provide somehow a more broader evaluation of our models.

Demetrios:

Yep. Makes sense. We’ve got the GitHub link. Someone is asking for it. It is right here. I’m going to throw that in the Slack. All right. So I really want to know, what’s the idea with RecList? What’s the end game here? Are you thinking that you want to create a company out of this, or is it just something that you see pain in the market and you would like to ease that pain?

Federico Bianchi:

I think it’s closer to the second one. So we are actually working on this open source project and we want to keep it open source so that everyone can somehow use this in production, but also for research. So we have this two kind of solves in which we want to make it useful for research. So for evaluation of like recommended systems in research, but also as a plug and play tool you can use in production to support your system and evaluation.

Demetrios:

Yeah. Okay. Nice. Is Rec List only for recommender systems or could I use it for any predictive recommendations that I do?

Federico Bianchi:

Currently is based mainly recommender systems. And we have embedded some use cases like session based recommender systems. We’re working a little developing more the foundations. This was the alpha. And we are trying to improve these based on the data. So we collect a lot of feedback that we’re trying to implement and to apply to actually improve the system.

Jacopo Tagliabue

Director of AI

Coveo

Educated in several acronyms across the globe (UNISR, SFI, MIT), Jacopo Tagliabue was co-founder of Tooso, an A.I. company acquired by Coveo in 2019. Jacopo is currently the Director of A.I. at Coveo, shipping models to hundreds of customers and millions of users. When not busy building products, he teaches MLSys at NYU and explores topics at the intersection of language, reasoning and learning (with research work presented at NAACL, RecSys, ACL, SIGIR). In previous lives, he managed to get a Ph.D., do sciency things for a pro basketball team, and simulate a pre-Columbian civilization.

Federico Bianchi

Postdoctoral Researcher

Bocconi University

Add Your Heading Text Here

Wild Wild Tests: Monitoring Recommender Systems in the Wild

Jacopo Tagliabue

Federico Bianchi

Follow Us

Book a Demo

Contact Sales

Request a free trial