##### Dr. Waleed Kadous

Head of Engineering

Anyscale

Home / Learn / apply() Conference /

apply(conf) - May '22 - 30 minutes

Reinforcement Learning has historically not been as widely adopted in production as other learning approaches (particularly supervised learning), despite being capable of addressing a broader set of problems. But we are now seeing an exponential growth in production RL applications: so much so that it looks like production RL is about to reach a tipping point to the mainstream. In this talk, we’ll talk about why this is happening; detail concrete examples of the areas where RL is adding value; and share some practical tips on deploying RL in your organization.

It’s an absolute pleasure to be with you folks again. I had such a blast last time with D at last year’s apply conference, and I’m really grateful for the introduction. I had engineering at any scale. And with that, let me just present my slides and then we’ll take it from there. Cool. So hopefully everybody can see that. And yeah.

So today I’m going to talk to you about something that I really feel passionately about, but is also kind of a controversial topic, which is, are we on the verge of production, reinforcement learning, getting to a tipping point where everybody wants to use it and everybody is using it. We all know that often there’s a very standardized way of doing RL, which is training, modeling, monitoring, reinforcement, learning challenges of it, but the results and the potential are very huge.

And I specifically chose a controversial topic and I look forward to your questions during the discussion. So here’s the overall view of what I’m trying to say. Basically research is showing that reinforcement learning is doing amazingly well. All of the superhuman performance systems that we’ve seen, like AlphaGo alpha star, Dota two, all of those are basically reinforcement learning based. And even as recently as yesterday with the release of Gato, which is this new system from Google that can be, can handle 600 different tasks or more at the heart of it is a little bit of reinforcement learning, especially for the robotics tasks. But yet we’ve very rarely seen production, RL, reinforcement learning in production. And why is that? So as we try to explore this question, I think sometimes we get freaked out by RL a little bit, because it’s something new, but I want to show how it’s a very natural extension of things that probably many of us do already.

I’m going to come back to that question about is RL at a tipping point or not. And then based on our experiences, I want to share some tips for finding successful patterns of RL applications and some traps to watch out for. And so I’m going to give you my conclusion, which is basically it’s not yet off the shelf for a lot of problems, but there’s kind of little bits of problems and areas and patterns where it’s time to become incredibly promising. And these stories are based on our experiences at any scale, working with people, using RL lib. RL lib is an open source, distributed reinforcement learning system. There aren’t many of those around, especially ones that are production ready. So these are stories from our customers and ones that we’ve observed as people have used RL lib. So I want to go back to this thing of just explaining RL in terms of a complexity spectrum.

And really I’m hoping that by doing this, we can build a mental model of what is easy to do with RL and what isn’t, and also provide kind of a map. You know, we’re starting off with some things that all of us may have used, like bandits. And then hopefully this will give us an idea of how to escalate as we start to dip our feet in the water of reinforcement learning and taking it further. And especially as we talk about deployment, we’re going to move on. So first we’re going to start with bandits. Now I’m curious, I was actually going to ask you folks and maybe D can summarize it. You know, what the people’s comments are, but who has used bandits in production? I mean, they are fairly straightforward, well understood technology. Now that we’ve been around for a while, but just to introduce the problem to you, if you haven’t seen it before. Imagine you have four slot machines and each of the slot machines gives you a payout based on a probability that you don’t know yet.

The state is basically, there’s not a lot of stake, but your choice is which of the levers do I pull? So one of the machines might pay out $10 if you pull the right one and it might charge you a dollar, if you don’t. So a good example of a practical example is UI treatments. Imagine you have five different ways of presenting your UI. You’re not sure which is the best. It’s kind of like the same type of problem, right? One of them is better, but you don’t want to test them all uniformly. Right? So this is a great example of one of the problems. So I’m just going to ask a question and maybe we can get back to it, but who here has used bandits or experienced bandits and used them in production type scenarios. I’d love like just a yes, no type answer just to see how folks use them.

Yeah. So everybody, I’m getting a lot of yeses and nos, but the real challenge with bandits compared to conventional machine learning to move on. So I’m seeing a lot of people. Yes, no, maybe 50%. Yes. Is the explore-exploit trend off, right? Like you always have this problem at the heart of it, which is how do I balance between executing on what I already know to be true in terms of the probability distribution and learning what the ground truth is. So to give you an example, a very typical algorithm is what’s called excellent greedy algorithm. And what you do there is if it’s less than Epsilon, you do some random thing. But nine say it’s Epsilon is 5%. So 5% of the time you do something random just to see what happens. And 95% of the time you’re executing on the policy you already learnt to really maximize the rewards.

So let’s take the next stage, right? So some of you may have used bandits. The next thing after bandits are what are called contextual bandits. And again, going back to our analogy of machines that are slot machines, contextual bandits are just like bandits, but there’s these variables that affect the performance that may not be visible to you. So for example, imagine now that you know they do some pretty dodgy things at casinos sometimes, and imagine that they wanted to change the payout depending on whether it was sunny outside or not. Then the sunniness state would be the contextual variable and that would affect their payouts from all of the different machines. And that’s kind of like in very natural extension of bandits. And actually that’s seen a huge amount of uptake in recommended systems. The usual configuration is the context is something about the user, the user’s profile.

And the actual state is like a set of TV shows that you’re presenting for them. So they watch TV show one. And now they’re trying to choose what to choose next. And you use contextual data to kind of say, “Well, based on this user profile and that they watched this episode last, I’m going to suggest that they watch this” or give them some options. And this is being very extensively used at places like Netflix and beyond. So that’s kind of stage two. Now, this is where we get to the really interesting part. This is what most people would think of as reinforcement learning, which is going back to our contextual bandits.

You can think of reinforcement learning as nothing more than contextual bandits with state. In other words, imagine that I have to pull the levers in a certain particular order to get a pay up. So say I have to pull three first and then four and then one, and only when I pull the sequence three for one, do I get a $100 pay out, right? And a very practical example of that is chess, right? Moves do not lead to a payout in the short term always, right? Like I move forward a pawn, nothing really happens. I don’t catch a piece. I don’t do anything like that. But those early moves can make the game. There are very long chains of connections of states within the game. And so that’s the new problem that reinforcement learning introduces, which is really this idea of temporal credit assignment. If I have to go through a sequence of steps to get a payout, how do I work out how to distribute the reward over the last few moves so I can work that out?

And of course, there are ways that you can enroll state, but you kind of have to do some kind of temporal credit assignment that gets hard. The state space gets very large. So there are approximately 10 to the 80 possible chess games. And then just as there’s assigning things to the past, there’s also a delay problem. So what happens if your reward is delayed? We gave the example before of, 3, 4, 1 gives you a hundred dollars payout. What if it’s 3, 4, 1, and then two moves after that you get the $100 payout, but you don’t know that it was 3, 4, 1. So when there’s a delay in reward that makes it even more complicated. And a good example is a sacrifice move in chess, right? Like early on in the game, you might give up a piece for advantage, which has actually got negative reward in the short term, but the reward is like 15 moves later.

Right? And so this is the type of problems that occur once you move into what many of us call reinforcement learning, right? All I’m saying is that you can see reinforcement learning as a very natural extension of contextual bandits. The next thing. So, far we’ve gone bandits, contextual bandits, typical reinforcement learning. Now we start to expand what’s called the state space. Where am I in the game? What’s the state of the board and the action spaces? What are the stages that I can move? So let’s go back to that contextual bandit. And now instead of four machines, I have 32 machines. Suddenly my space has expanded very hugely. And imagine now, instead of having to pull one lever, I basically have to put a bet on the levers or I have to pull some of them hard and some of them soft. Suddenly my action space is now 32 dimensional, right?

And of course that dimensionality grows and it makes it really, really hard. So the first way, once we get to those standard reinforcement learning problems is machine learning gets hard, is through that state space growing very large. Now, another problem with reinforcement learning is what we’ve talked about before is kind of this very sequential process. I have a probability distribution. I make a choice. I observe the output and then I see what happens again. But for so many of our problems, especially in the real world, I don’t get that luxury. I just get a log. I just get here’s what people clicked on yesterday. I don’t get a chance to iteratively build it. I don’t get a chance to do that. So again, going back to the banded example, this is, imagine that you go to the casino and they say, go play.

And here were the payouts yesterday. Here’s what people did go and try and learn a reinforcement learning policy from this. And a good practical example is learning from historical stock purchases. I can’t really test ideas as around, unless I build like a kind of a stock market simulator, which you can do. But it is really starting to become a very common model. And there’s been a lot of research in this space recently on offline reinforcement learning. In other words, how do I build a reinforcement learning model when I can’t actually run the experiments that I want to run? And just like these other stages like when we did RL, it introduced temporal credit assignment. When we talked about large state spaces, we talked about the curse of dimensionality. When you move to offline RL, you kind of stuck with whatever experiences the robots or the agent has gone through.

So what if you trained your stock market agent and there was never a recession in your data, right? You might end up with the agent kind of going, “I’m just going to keep buying stuff. Because the prices are so cheap.” Or again, thinking about those contextual bandits, there might be like a bankrupt thing. So if you pull four, three times in a row, just says, “Sorry, I’m going to take all of your money.” And if you haven’t seen that in the training data, if you haven’t seen that in the offline, you can end up in some very weird situations where you shoot yourself in the foot. So we’ve talked about like how we got from very basic types from bandits all the way through to these very complex offline RL applications. The last model of RL that is applicable in many cases is what’s called multi-agent RL.

And so far we’ve been talking about you as a single person pulling the levers. What if there’s two of you? How would you do then? How would you share probabilities? How would you share information? Do we set it up as a competitive situation where the goal is to earn the most money? Do we set it up as a cooperative situation where the goal, like whatever you guys get together, we cut it in half and share it between you? All of these particular types of things take us into a very, very complex and rich realm, which is the realm of multi-agent reinforcement learning. And one way to think about that is the stock market, there’s an assumption of a lot of very small players. But imagine if you’re, I don’t know, to think of an old joke, frozen concentrated orange juice, right?

There might only be five or 10 people that are involved in that. And then you can model what are their motivations and everything else. So the problem with multi-agent RL is it gets way more complicated. What happens to your reward function? How do you define your reward function? And how do you model the different engines? Do they each have kind of a different model? Do they share models? All of those kinds of questions. So just to kind of summarize this and kind of bring us back to kind of where we started, which is this discussion about is machine learning, is RL at a tipping point? What we see is that there are these set of problems as we go from simple to complex. When you try to use reinforcement learning. At the same time, though, we have to come back and ask ourselves that question, is it at a tipping point?

And the things that would suggest that it is at a tipping point is again, this huge victory in competitive environments. Like I remember playing in triple I 98, a go machine and everyone thought it was a joke. It was just like so bad. And nobody thought we would ever be able to beat the game of go using computers. And yet that happened three or four years ago. The same thing with StarCraft, right? This was thought to be an unachievable thing. And it’s seen this huge success. And there’s also a huge number of companies that are not just using RL, but using RL in production. All of these companies, we know use RL for some kind of production application, for real, for suggestions or recommended systems or for modeling agents, all of these types of things. So we know that’s starting to cross the boundary, but why isn’t it becoming more popular?

I think just reflecting on those levels and complexity that we saw earlier, there’s really four factors in my mind that have stopped it from becoming production ready. And we’ll go through each of these. And the good news is that for each of these problems, we are seeing really good progress on techniques to deal with them. So I’m going to go through each of them now and we can discuss each of them. So here’s the thing that they don’t say when they tell you that AlphaGo Zero, big human players. And then on one level, it’s like amazing. On the other hand, you start to realize that AlphaGo Zero played 5 million games against itself. That’s way beyond what the power that a human can do. But at the same time, it’s only a company like Google that knows how to play 5 million games of go in an efficient way, right?

If each game takes a couple of hours, you can do the maths if you’d like. 10 million hours is a lot of hours, right? And the same thing for alpha star, each of the agents was trained for 200 years. You know, that’s way more than a human would train. So this is clearly beyond the capabilities of a simple machine. The good news is we’re starting to see progress. One is again to plug my own company. We’re seeing a lot of improvement in algorithms for training. So you’re not stuck to a single computer. We’re starting to see transfer learning happen so you don’t have to start from scratch every time. As I mentioned earlier on Gato, which is this new system that Google announced just literally days ago. And as I mentioned, performs on 600 domains, not all of them reinforcement learning, but you don’t always have to start from scratch.

There are techniques called behavioral cloning or imitation learning where you can start with a human expert and copy them. And there are also ways you can do that by reducing parametrized state spaces, limiting actions, and a few other techniques. My point here is these are huge. And I see someone saying games have unlimited data, RL agents winning those often need absolutely huge amounts of data in sampling. So you’re absolutely right. I think Jonathan, you asked an absolutely brilliant question and I a hundred percent agree with you. But let’s talk about that. Okay. So the second challenge talked about one was lots of training data. The second one was naive implementation of ours online. In other words, like I said, it’s designed to run live and the idea of running anything live with a changing dynamic model in production should be pretty scary to most of you or if it isn’t scary, believe me, when you try to do things in production, it’s hard enough to get a model that’s locked in working, right?

Imagine a model that’s actually changing in real time. And the second thing is it’s really hard to get data, to train it on. You basically have to run all of those simulations that you ran before again. Right? And so there has been a lot of progress here, right? The first one is offline training algorithms that can learn to benefit from that data. And what you can also do is start to add counterfactuals. Things that you check that are not broken in the system and dealing with that. A question, yeah so we’ll come back to the questions at the end and we can discuss some more. The third problem is temporal credit assignment, right? Like, so we talked about this very, very complicated issue where you don’t know exactly which action is the one that led to you winning or getting a reward.

And of course that also makes things much harder. And the answer to that is you’d be surprised how many problems the temporal aspects don’t always make a big difference, but there’s been a few advances in what’s called the DQ networks and a few other techniques that have helped a huge amount. And then the fourth one is large action in state spaces, right? And it’s obviously very, very hard to do this. This is kind of related to the first issue that we talked about, which is high training requirements. And the implication is there’s just too many things to learn in real life. And the progress here is really on high fidelity simulators. And we’ll talk about that a little bit later, but for example, for robots, there are really good simulator systems now available like Ross and even unity for simulating robots.

You can train many simulations at once and kind of hybridize the results. You can also use techniques like embeddings and they’ve actually turned out to be critical. So you could do this in two steps, or if you continue with that offline learning path, maybe you don’t have to start from scratch every time, relearning your model. So as we’ve discussed, there’s those things that are real constraints, there is progress on them that we have to kind of realize that those constraints, the large state spaces, the high amount of training that you need to do the offline nature. You know, those things are real. But is there spaces, are there parts of machine learning problems that we think or types of problems where we’ve seen good success, despite those constraints? And the answer is in our work we’ve seen three of them. Really one is if you have a good simulator, you’re halfway done, right?

And then other situations is we talked about temporal credit assignment. Do you really need temporal credit assignment is a good question. And then the final pattern is problems that people have solved before as optimization techniques. So some of you may remember the field of operations research, they were already working on optimization problems long ago. that shape of problem, things that people would previously use things into like linear programming for, sometimes you can just take out that and plug in the reinforcement learning algorithm, right? So let’s go through each of these and I’ll give you some examples of where we have seen this approach work. So we talked about simulated environments and like we said, it takes a lot of data to train them. But if your problem is virtual in its nature already, or your simulation is a faithful representation of reality, then rather than training your reinforcement learning in the real world, you just train it in a simulator and you can run all of those simulators at once and all of the different simulations update the model simultaneously.

And so you can batch models together. Another variant of this is getting close with a simulator and then doing the final stages and refinements in the real world. So let’s say you have a robot. What you do with the robot is you basically train it in the simulator for 90% of its journey. And the last 10% is done in real life. I think that there’s a lot of really good opportunities there. This simulator thing is so important that some people are solving machine learning problems like the offline learning problem by building a simulator first, having the robot learn the policy from the simulator. So sometimes they’re actually seeing the problem being inverted and turned on its head. So a good example of this is a company called ride games. I think some of you may have heard of the game league of legends.

They had a card game, like this is the one where you have decks and the different cards have different properties. And so they turned that into a reinforcement learning problem. And the way they did that is the state is like the state of the game. Who’s winning, what are the reigning cards? The action is which card to play and you get a reward for winning. Now the problem with in games like this is game balance. So what this team did is that they basically created 10 typical decks. And so you have 45 combinations of games and you just play them over and over and over again in a simulator. And then you see if any of the decks has like unusually large victory percentages. And so this allowed Riot to fine tune a commercial game. And I think this is like a very, very interesting application of reinforcement learning. But we’re starting to see people use reinforcement learning in other aspects of game, notably QA testing.

So you build a reinforcement learning agent that breaks your game, that plays your game, trying to find bugs in it, and doesn’t have to be multiple players. It can be just straight against the environment. A second customer is this is like the simulations don’t have to be perfect, but one of the people who talked, we had an RL conference a few months ago, is JP Morgan. And they were modeling foreign exchange transactions. And often it’s what I described earlier. There’s a few big players in the market. And so the state is just the holdings of each party. And then the action that is buying and selling. And then the road is just like the profit minus the leftover stock. And so you can now run a simulator of the market, which is basically a market where you model the flow of finances.

And then you can use that to test like automated trading systems in real life before you release them into production. So we talked about class one, which is if you have a simulated environment you can create simulations, but there’s another set of problems, right? Which remember where we started this journey, we started with bandits then contextual bandits, and then sequentiality. So the question is how many problems are there that are, so one way to think about it is a contextual bandit is just a reinforcement learning problem where the reward only depends on the current state and the future reward only depends on the current state. Right? And it’s a great area then how much does it depend on the current state? So imagine that your state can be simplified or unrolled or any of these types of things to kind of represent the state in a way that you can actually apply it.

And this is kind of really taken off. So right now for recommended systems. So for example, one person who’s done this work is wildlife studios, another games company. And they basically look at the last few games that you’ve played and your user profile. So it feels like that contextual bandit. And then they basically present, “Here are the five games you might want to play next” and you click on one of them. So one problem that occurs when you think about this is “Wait a second, there’s got to be like millions of users and thousands of games. How do they know which is which?” And this is where techniques to deal with large action and stake spaces come in. And the big hero here is really embedding. And that’s why the there’s been a lot of success in this space to deal with that. In fact, this particular problem is so successful that you can now go to Microsoft or Google and say, “I want to recommend a system based on reinforcement learning,” and they will give it to you and you can pay dollars for it.

Now, there is a little bit of an asterisk there, which is this, which is that you have to have these constraints. You have to have quite a lot of data before you can use it, but you can do it. And then finally, the last part is one take is that on RL is it’s a way of doing optimization problems, but in a data driven way. So, you can take over from companies that used to tackle this, like in industrial and manufacturing and just apply reinforcement learning as another way to solve the same problems. And the great thing is optimization problems have the same shape as reinforcement learning is just your yanking one algorithm out and putting another algorithm in. And so one example is Dow chemical. And so they were using reinforcement learning for scheduling different production lines. And what they could do is you basically have to work out all the different systems and what you do when, and when you build it.

So the state is like how you schedule it. And then the reward is just the money saved. Now, historically they did this using traditional operations research approaches like mixed integer linear programming. But now they could actually compare the performance of reinforcement learning agents against it. And so what you can see here is like the improvement, the improving in time. And in the end, they were actually very competitive with the mixed interior linear programming. And in many ways were more robust to things that were outside of the usual state space. So we’re almost at the end here. I just want to leave you with a couple of tips and we’ll take it from there. But the first tip is just to keep it simple. So you can start with stateless and then transition to contextual and then fully state stateless the same thing with on policy to off policy to offline.

So start simple and work your way up. And the second thing is we talk a lot about ML ops . No one’s really started to think about RL ops, and the reinforcement learning development workflow is some ways different. So you need to make sure to give that room. So how do you make sure, how do you validate the models? How do you update the models? Do you run with a live model in production? How do you monitor for situations where the reinforcement learning algorithm is well out of its comfort zone? How do you retrain when really log data is actually a type of offline data? So it’s not the ideal way to train a reinforcement learning model. These are real problems that you have to deal with when you deploy RL in production. So, to just kind of conclude and leave you some with some of the parting thoughts, I think we’ve seen a tipping point in some areas, but by no means universal. There are a few early adopters that are succeeding.

We see three patterns. So things that there are simulators for, things that are like on the edge between contextual bandits and reinforcement learning, and optimization problems. This is where we’re really seeing things take off. And the two practical tips are, you don’t have to start with the most complicated multi-agent offline, blah, blah, blah type learning model. Start at the basics with bandits and work your way up to reinforcement learning. And then finally, really think about the deployment workflow. We really need to start to engineer RL ops essentially, and create this new field. And finally, we have a library that can help. So I’ll just leave it there. And here’s a link, if you want to find out more about RL lib. There’s my email address if you want to reach out to me, please don’t hesitate. And I’d especially like to thank the team behind RL lib. Thanks all.

Dr. Waleed Kadous leads engineering at Anyscale, the company behind the open source project Ray. Prior to Anyscale, Waleed worked at Uber, where he led overall system architecture, evangelized machine learning, and led the Location and Maps teams. He previously worked at Google, where he founded the Android Location and Sensing team, responsible for the "blue dot" as well as ML algorithms underlying products like Google Fit.

© Tecton, Inc. All rights reserved. Various trademarks held by their respective owners.

Interested in trying Tecton? Leave us your information below and we’ll be in touch.

Interested in trying Tecton? Leave us your information below and we’ll be in touch.