Explained Simply: How A.I. Defeated World Champions in the Game of Dota 2

30 min readOct 10, 2023

In 2019, the world of esports changed forever. For the first time, a superhuman AI program learned to cooperate with copies of itself and defeated the reigning world champions in the #1 competitive video game on the planet, Dota 2.

In this essay, I will take that breakthrough research paper by OpenAI and explain it paragraph-by-paragraph, in simple English.

This will make it more approachable for those who don’t (yet) have a strong background in ML or neural networks, or who don’t use English as their first language.

In the process, you’ll learn several concepts at the very cutting edge of modern ML and Reinforcement Learning systems, such as Actor-Critic Algorithms, Proximal Policy Optimization, Generalized Advantage Estimation, etc. and develop intuition that’d help you keep up with future AI developments.

This paper is also a masterclass in designing and executing large-scale ML projects. Engineers and budding AI entrepreneurs: take notes.

If you have zero prior knowledge of neural networks, you could first read the “Deep Learning” section of this essay.

As for my introduction:

Hi I’m Aman, an engineer and writer. I love making hard topics simple — simplicity is underrated.

Let’s dive in. I will only discuss the first 50% of the main paper, which covers mostly everything you need to know. For the rest, you could join the discussion here.

Game-playing AI programs have been around for a long time, making global news headlines and defeating world champions in games like Chess and Go. But unlike these board games, an action video game like Dota2 has the following challenges:

Long time horizons: while a game in chess could be over in 20–50 moves (each move is essentially one time-step), given the very nature of Dota 2, a game lasts thousands of little time-steps (whenever you hold down a key on the keyboard, you’re technically taking several actions per second). In essence, you still need to think in terms of moves (e.g.: move the player to a certain location, attack a certain target, etc.) but act in terms of tiny actions. Naturally, the way you train the model to play these games has to be different.
Imperfect information: in board games like Chess and Go, and even some “single panel” video games such as Mortal Kombat or Street Fighter, there is no “hidden information” in the game — both players can fully observe each others’ positions and the moves available to them. The only variable is how the other player chooses to play. In games like Dota, you only know what’s within your field of vision. This “imperfect information” affects decision-making.
Complex, continuous “action spaces”: In games like Chess or Go, you make one discrete move at a time (e.g.: moving a pawn one step, moving the bishop across the board x steps). In video games like Dota2, you make the character move for a few milliseconds at a time, by holding down the joystick or keyboard button, which makes them more continuous. Therefore, action planning has multiple levels — not just what you’re trying to do strategically (big picture), but also how you’ll execute individual moves using your controls.
“High-dimensionality” of the observation and action spaces: This isn’t mentioned in the Abstract, but in the next paragraph. The environment of a game like Dota2 is way “richer” than a chess board, which means more information to process.

All this is not a diss on classic games or claiming that it makes the game more difficult; these are just different challenges such that you need new approaches to make decisions in real time.

Now, the most important thing to understand in this paper is that while Dota2 is a challenging game for AI to master (for reasons we’ll discuss in more detail later), this paper is not really about “how AI learned to play Dota2.”

The OpenAI Five program relied on algorithms that already existed before. The key contribution of this project is not that they invented new ML techniques, but rather in how they implemented old techniques. As we’ll see later, this paper is really a masterclass in how to plan, organize, and implement a long-drawn ML experiment on a huge scale.

As they say, they developed a training system that allowed them to train OpenAI Five for 10 months. Even though that’s several million dollars worth of investment, without the techniques they used, it would cost several times as much.

Consider this paper a gigantic ML engineering tutorial. I believe that some of the details around how they trained the AI will turn out to be more impactful than the superhuman result itself!

While they mostly talk about their training methods and infrastructure, I will also focus on the underlying ML algorithms, so you get a holistic picture.

They mention how for game-playing AI algorithms, “Reinforcement Learning” (RL) methods have become the go-to. (You don’t need an RL background to understand this essay.)

I’ve explained the Atari 2013 and AlphaGo 2016 papers in other essays (the latter also explains Monte Carlo tree search), which you may want to read too.

The rest is fairly self-explanatory.

As said before, this paper is about how, when faced with a more challenging game, OpenAI’s approach was to simply reuse old problem-solving techniques on a much bigger scale. (I say “simply” in a conceptual sense — in practice, it’s not that easy at all.)

OpenAI utilized thousands of GPUs (graphics processors) over months to train this mammoth model. For context, a single DGX-1 AI supercomputer, that integrates 8 GPUs, cost $129,000 at the time. (Graphics processors are essential for training neural networks because you need to perform a huge number of parallel math operations, that can’t be done on a CPU. GPUs were mainly developed for video games, to render the millions of pixels and polygons on the screen in real-time at high frame rates.)

Another challenge while doing extended ML experiments is that you often realize that the way you’ve set up the problem (could be data pre-processing, model architecture, etc.) should be changed. Traditionally, these are considered huge changes, and the training has to start from scratch. The team was able to devise a means to greatly shorten this re-training time, by developing a set of tools they call surgery.

The fact that they performed a surgery every ~2 weeks (I kinda like how they put it) goes to show how, contrary to most disciplines within computer science and mathematics, ML is a more “science-like” discipline where you have to do lots of real experiments to move forward. I think surgery alone is a great contribution of this research project.

I explained most of this already, with the Abstract. But I want to point out a couple of interesting things:

First, their AI program, by design, selects an action every fourth frame — which means, it observes the game’s state, makes a decision, and executes the action around 7 times per second.

I’m not sure how humans compare in terms of our decision-making and action-taking / reflex speed, or how much this speed varies throughout a single game (given that human stamina is limited).

Second, they mention that “strong play requires… modeling the opponent’s behaviour.” Let me explain what that means with an analogy.

Let’s say you were a karate robot being trained to fight using ML. You’re equipped with a camera and microphone acting as your eyes, and a body with artificial limbs to which you can send different commands (kick type 1, kick type 2, punch A or B, etc.).

Your brain simply gets a set of images and other sensor readings as input — it doesn’t know what your “opponent” looks like, or how he “kicks” you. All you see is a video stream of pixels.

As you capture more data from fights, you start seeing patterns — you start to recognize differences in the video input when the opponent kicks you vs. punches you, or when he gets punched in return.

Over time, as you see certain patterns repeat over and over, you begin to distinguish the opponent from the rest of the environment — and also start to develop mental models of what a “kick” looks and feels like versus a punch, etc.

As you collect even more experience, you get even more insight into the opponent’s behaviour — combinations of moves, and complex strategies.

This is what modeling the opponent’s behaviour means.

The OpenAI Five model has to learn to model their opponents’s subtle behaviours and intentions, simply by collecting lots of experience interacting with the game.

More differences in the complexity of Dota 2 compared to strategy board games.

Okay — this is very game-specific. I’ve never played Dota2 (and have tended to stay away from gaming since college, as a recovering addict), so I have no idea how much these limitations affect the difficulty of training the AI (especially the second one). If you have thoughts on this, do leave a comment. :)

Alright — time to take a quick break, and review what you’ve learned so far.

This is an important section.

Now, we must understand that the goal of this research project was to see if the AI can learn to play at superhuman levels — not necessarily “play at superhuman levels while being subject to the same constraints as a human.”

But that being said, in order to have a holistic view of this AI’s capabilities, we need to be aware that the way this AI program was designed, it had certain advantages over humans that had nothing to do with sophisticated strategies learned through ML, but rather with human limitations in general. Let’s explore some of these advantages.

When humans play a video game, we have to do a lot of things in succession:

Receive the visual input through our eyes
Process it in our brain to understand what we’re seeing on the screen (different players, landscape, etc)
Focus our attention on the information that matters most (depending on the situation and intent)
Make a decision
Convert the decision into a game command / action
Execute the action through the keyboard and mouse, by sending signals to the muscles in our arms/hands

The way OpenAI Five was set up, it had somewhat of an advantage in that at each timestep, instead of visual information (images), it directly gets a set of observations that tell it all the information about the game state in the right format, which it can directly use to decide which action to take.

Another factor here is that since the AI can process all the information it has at once, it doesn’t need to selectively focus its attention on a few parameters while ignoring others (as humans do).

Appendix E shows details of what the “snapshot” of information looked like:

Appendix E: OpenAI Five receives all this information on a platter, at each timestep.

The team mentions in the Appendix that this data is not available to human players unless they click around different menus to purposely access this information. They also say:

“Although these discrepancies are a limitation, we do not believe they meaningfully detract from our ability to benchmark against human players.”

Again, never played Dota, so I don’t know how valuable some of this information is; I’ll take their word for it.

Another important fact: the OpenAI Five program didn’t rely solely on ML. Some of the decisions it made were hand-coded, and baked-into the model.

The main reason for this is computational complexity — the more complex you make the game, the longer and harder it has to train. You always have to make decisions about how much you can simplify the problem for the AI to solve, while still keeping the integrity of the project.

Self-explanatory, but take note! Randomizing the training environment is a good best practice, to make the AI agent more robust and able to adapt to different situations.

A quick note about the technical details of the ML model they used for Reinforcement Learning.

In simple terms, what we’re looking for at the end of the day is a model that observes the game, and then decides which actions to take.

Input: game observation; Output: action choice.

This is what we call a policy. The “observation” can include not just what’s happening in the game right now (e.g.: player A is at position P), but also what has happened previously (player A was initially at position O). This is fairly intuitive (even as humans, we can’t make decisions based on just the present moment, but also what our opponent has been doing!).

Now, in reinforcement learning, we often use a stochastic policy, which means that instead of saying “In situation S, always do X” (which is called a deterministic policy), you say “In situation S, choose between the actions X, Y, and Z with probabilities P1, P2, and P3.” You introduce a certain degree of unpredictability into your actions.

The reason for this is that in many games, being highly predictable would make you fail miserably. For instance rock-paper-scissors, Poker, etc. Depending on the game, there may also be multiple optimum moves in a given situation.

This is especially applicable to games with continuous action spaces. For example, let’s say you’re driving a car and have to make a right turn. There’s often no “exact” angle that you must hold the steering wheel to. 45.2º may be functionally equivalent to 44.7º, and so on.

This is why it’s better to consider a pool of probable actions while training. Hence, what we want as an output of the policy is not an exact, singular action, but a probability distribution over all actions — with good actions having a higher probability of being sampled than others.

The policy model outputs a list of actions, each with a probability of being selected. The agent then randomly samples an action from this distribution — it’s like throwing a deformed dice where, instead of all 6 sides having the same probability of landing on top, certain numbers are more likely than others.

A good policy assigns higher probabilities to good actions, and a poor policy does the opposite. Therefore, even though the sampling is random, on average, the better the policy, the higher the reward the agent will earn in the game.

I won’t explain RNNs / LSTMs here; you can read about them separately. They’re model architectures that have a “memory” aspect — when generating an output based on any data you give them, they also incorporate historical data that they’ve seen before. This makes them a sensible choice given the context above.

Figure 1 shows more details of what information was input to the model, and what its outputs were. First, the observations from the game environment include two kinds of information — about the hero (who is controlled by the model), and everything else (what’s happening in the game, nearby map and landscape, types of heroes and their abilities, etc.).

The former is unique to each AI clone as it plays (naturally, because each AI clone plays with a different hero), and the latter is the same (because the clones share the same data about the game). “Flattened Observation” basically means that the info is put into a simple matrix format. It’s a passing technical detail, and you don’t have to dissect it to understand the paper.

The policy model (referred to as “LSTM,” given the type of architecture it uses) takes in both these inputs (about the hero, and the game) and outputs two things:

Action heads: a probability distribution that shows which actions the model considers “better” (I explained this earlier)
Value Function: at each time step, the model makes its own estimate of “how well it’s doing in the game.” We call this the “value” of being in that state, at that timestep. The higher this value, the higher the total “reward” the agent expects to earn in the game from that point on till the end (and the more likely it expects to win the game).

In human terms, think of it this way: as you get better at playing a sport (say soccer or boxing), doesn’t it also usually mean that you’re getting better at evaluating games? Such as “Which side is doing better” at any given point in a game? The pro has a much better judgment of the game than the amateur.

The value function is essentially a way of quantifying the abstract judgment of “who’s doing better.”

Now, here’s how we train the model. We’ll start with a human example:

Even when you play a game like Dota2 for the first time, you don’t have to wait for the end of a game to start improving your skills. While playing, you already have enough of a feeling to judge how well you’re doing, to use that to slowly improve the way you play. At the end of the game, you’re a better player than when you started, regardless of the actual score.

In a sense, you relied on your internal “value function” (i.e. your judgment) to update your own “policy” (i.e. strategy for choosing actions) instead of the final score.

Even before you’ve finished the first game (and seen the final result of “win or lose”), you may already learned a fair bit about making better decisions and taking better actions, based on the experience and the rewards you collect along the way.

At any given time, we let the agent play the game as it normally would — i.e. by sampling actions from its current policy. At first, we let the agent use its own judgment to improve its gameplay — i.e. it updates its policy based on its value estimate. So in a sense, the agent is always trying to learn a policy that does the best by its own standards. You are both the ACTOR and the CRITIC.

The other important aspect of training is improving your judgment — that is, to improve the accuracy of your value function.

Over time, as you play more games to the finish and observe the actual reward values, you can compare them to the predicted values, and thus get better at estimating the value of any given state of the game. This is called the “value update.”

And since the policy is chasing after the value function, improving the value function over time means also improving the policy function — which is what we’re after. We just want the best possible policy model.

This training method, where you constantly improve your policy based on your value estimations (as opposed to always waiting for the end of the game to learn anything) is called the Actor-Critic method. This concept is central to understanding this paper. I will revisit the training methods a little further below. :)

Take another quick break, and do a quick review. Make sure to understand the intuition behind stochastic policies and actor-critic methods.

Btw, did you know that baby sharks are terrified of goldfish? Actually this is not true, but I wanted to put a little joke in here to force you to take a break.

Interesting — the AI program uses clones of the same model to control each hero on the team, having access to the same information throughout the game. While human teams, by default, include very different people, the AI is not.

I guess about WHY this was done, is that it makes the agents easier to train, and also makes the overall results more repeatable. If you trained 5 different models separately, then it would take 5X the computing resources. It would also introduce more variation in the final results if each model played differently — testing which model is objectively stronger or weaker, which model plays better with which hero, etc.

This merely repeats what we talked about earlier, regarding the game information available to the AI model versus humans. I’m not sure why the authors tend to bounce around so much — makes the paper a little harder to follow.

Self-explanatory — when computing resources are constrained, you can only try so many things.

First, let’s discuss the reward function that the AI agent was trying to maximize. As I briefly explained earlier, the way you train an AI agent is to give it a “target” reward quantity that it strives to maximize while playing. It’s up to you to decide what this function will be. It could be simply a 1 or -1 based on victory or defeat, or it could be a very complex system of points and measures.

If you set up the reward function such that simply wandering around aimlessly leads to a higher value, then the AI agent will figure out how to wander around. If you set it up such that the agent only gets rewarded for winning the game in the end, that’s what it will do — no matter how many sacrifices it takes to get there. (AlphaGo, for example, was trained with this kind of binary reward function — it wouldn’t work for this project because an average Go game lasts only 200–300 discrete moves, while a Dota2 game lasts thousands.)

For Dota2, OpenAI decided to hand-craft a complicated reward function that included many smaller rewards beyond just winning (which creates a very high reward). They share more details of this in the appendix; for example, if the agents earn gold, improve their health, destroy buildings, etc., they earn small rewards.

The reason for this is “credit assignment,” which is a major challenge in training AI agents. Let me explain what that is.

As the agent plays the game, it takes several actions. Some of these are good, and some of these are bad. The whole goal of training is to take more good actions and fewer bad actions. But if it only gets a single, overall reward at the end of a winning game, it will associate that win (or loss) equally with ALL the actions taken, good or bad.

Now, this isn’t necessarily bad in theory — if the AI plays enough games, then on average, the good actions will be positively reinforced more often, and bad actions will be negatively reinforced.

But recall that Dota2 is a game with long time horizons, and thousands of possible moves at each time step — with limited access to computational resources, we don’t have the luxury of letting the AI train this way. Therefore, more frequent, smaller rewards collected throughout gameplay are extremely helpful signals. They give more “direction” to the training.

One could even argue that most RL research is aimed at coming up with new ways to better solve this credit assignment problem.

Interestingly, OpenAI decided that some accomplishments generate an equal or higher reward than even winning the game! Note that winning the game has a reward of 5, but destroying the enemy’s “Barracks” has a reward of 6. “Gaining Aegis” or “Ancient HP Change” also generate a reward of 5 each.

This is the most technical, “loaded” part of the paper. Let’s start with the last line first.

Recall that we’ve already discussed the goal of our training: to achieve a policy model that takes in-game observations as input, and outputs two things: the action to be taken (in the form of a probability distribution over all possible actions), and the value function (the agent’s estimate of how well it’s doing, or rather, how much reward it expects to earn in the game). You also understand that we use an LSTM model for this, because it enables the model to make use of “memories” from past experiences.

We also just analyzed the reward function that the agent receives from the game after each move or so — this feedback from the game environment is used to improve the model’s outputs over time.

Now, the first half of this paragraph (mentioning PPO, GAE, advantage actor-critic, etc.) provides more intricate details of the algorithm. Each of these technical terms is a whole lecture in itself. I will explain things at a high level and focus on intuition, without getting into the math.

“Advantage Actor-Critic”

Recall that we discussed the intuition behind actor-critic methods earlier (i.e. the model makes its own estimate of the value of a given state in the game, and uses it to make constant policy improvements without waiting for the end of the game).

Actor-critic methods are a family of algorithms united by the same philosophy. In terms of implementation, the most popular variant is the “Advantage Actor-Critic” algorithm. Here’s how it works.

Recall that at any given state s in the game, our agent chooses an action a by sampling from a probability distribution (output of the policy).

Now, any action choice naturally has a different value — i.e. some actions are more likely to lead to a high reward/victory than others.

Let’s say that at a given state s, the agent expects a total reward of 5.0 with action a1, 6.0 with a2, and 2.0 with a3.

In this example, on average, the agent gets a reward of 4.33 by following its current policy. Therefore, the advantage of taking action a1 versus the average is 5–4.33 = 0.67. For a2, the advantage is 1.67, and for a3, it’s -2.33.

That’s it — the Advantage function, represented as A(s,a), is the extra reward expected by choosing a certain action a in state s compared to the “average.” In advantage actor-critic, when comparing actions, we rely on this advantage value.

Why is this helpful? Intuitively, this is done to force the model to strive to do better and better than average; to only take the most optimal actions, as opposed to being satisfied with picking actions as long as they lead to a positive reward.

“Proximal Policy Optimization” (PPO)

PPO falls under the advantage actor-critic algorithm, with another small tweak that makes a good difference in practice!

While learning to play the game, as you collect experience, your policy will get updated. But what if, after every couple of hours, your policy/playing style looked completely different from before? It would be as if you’re reacting too much, too quickly to the most recent game experience, instead of gradually updating your policy based over time based a broader, richer base of experiences.

Learning gradually and smoothly has been found to work better overall than learning in a “spiky” fashion where the policy swings too much. This is done by mandating that after each policy update, the new policy is proximal to the previous policy. Thus we have the fancy name: “Proximal Policy Optimization.”

In technical terms, the way we achieve this is by “clipping the objective function” such that the ratio between the policy after the update and the current policy would always be within a fixed range: (1-x to 1+x), with x being a value that we can set as per our liking.

Advantage-based “Variance Reduction” Technique

(Or: Generalized Advantage Estimation)

GAE is yet another little technique used to make training more stable, and to ensure that the model continues to improve over time — it helps with credit assignment, which we discussed earlier.

Recall that in actor-critic methods, we have a value function that enables the agent to use its own prior experience to judge how good its current position in the game is. We must update both the policy and the value function.

Now, if we wanted the most real, objective measure of the value of taking an action, there’s no other way but to actually play the game to finish and observe all the rewards, right? This way of calculation is the least biased by the agent’s prior experience. Let’s put this in numbers quickly.

We’re using policy π to sample actions and collect incremental rewards r in the game at each state s.

At a given state s, the total reward expected from taking a certain action a is the immediate reward r0, PLUS the sum of future rewards in the game:

R = r1 + r2 + r3 + r4 + ….

So far so good?

But we also know that future rewards are less and less dependent on our current action, and more dependent on future actions. Therefore, we discount the future rewards progressively using a discounting factor γ (gamma).

So, value of taking action a in state s = Q(s,a) = “Expectation of” [r0 + γ.r1 + γ².r2 + γ³.r3 + ….]

If γ=0, then only r0 gets counted, and the model doesn’t assign credit for any future rewards to this action. (But that wouldn’t work for any game that requires any semblance of a strategy.)

If γ=1, then every single future reward gets the same weight as the current reward.

This was about discounting rewards. (P.S. If you’re familiar with finance, and know how to use discounted cash flows to calculate the present value of an asset, it’s a very similar concept.)

But of course, here we completely ignored the value function, which would make training extremely slow — without using prior experience/self-judgment to improve while playing, the agent wastes a lot of time collecting gaming experience with dud policies that only get updated once in a while.

So, let’s say we decided to do a combination of both: to train the agent based on both real, observed rewards AND the value function.

Now, we can choose how many steps we want to look ahead at for real rewards, with the rest being approximated by the value function V.

This makes our training more biased (by the agent’s prior experience), but reduces the variance (i.e. makes the agent learn faster and do better in the short term). So we call this a “variance-reduction technique.” Ring a bell? :)

As we saw earlier,

Q(s,a) = E[r0 + γ.r1 + γ².r2 + γ³.r3 + …. ]

But we can choose to look ahead only for a few rewards, say until k=4 timesteps, and use V(s) to replace the rest:

Q(s,a) = E[r0 + γ.r1 + γ².r2 + γ³.r3 + γ4.V(s4)]

If we chose k=10:

Q(s,a) = E[r0 + γ.r1 + γ².r2 + γ³.r3 + γ⁴.r4 +…. γ¹⁰.V(s10)]

And if we chose k=0:

Q(s,a) = E[r0 + γ.V(s1)]

So you see, the higher our “lookahead”, the longer we rely on real rewards. If we only wanted to use real rewards, we’d do k=∞ or something.

Now, someone came up with an idea. Instead of having to pick a single value of k, how about doing a weighted average of multiple k values?

We also remove the average from our calculations, because we only care about the advantage of choosing a particular move.

By doing that, our advantage of a particular move becomes:

A(s,a) = A(k=0) + λ.A(k=1) + λ².A(k=2) + λ³.A(k=3) + …

where λ is another weighing factor to discount the higher values of k.

This is called Generalized Advantage Estimation. It’s a way to combine past experience with actual observed rewards while training, to have the best balance between bias and variance.

It’s yet another little technique to make sure the training is stable and gradual. It might feel like we’re splitting hairs here, doing all these little tweaks to algorithms, but here’s the thing — when you’re spending millions of dollars on an experiment, you want to squeeze as much value from your money as possible. Training time in ML isn’t free, so it behoves us to make use of as many tricks as possible to ensure that training is fruitful.

In this paper, they use k=10 for the max lookahead.

A quick footnote in the paper that I think is worth mentioning.

Take a short break. Review what you just learned.

Now, the rest of this section is structured a little weirdly in the paper (the authors explain the training system in the opposite order of what’s intuitive to follow, IMO). I’ll explain the overall system here first, and then dive into smaller details.

In training any model to play a computer game, we basically have 4 steps:

Take a policy (whether pre-trained or un-trained)
Play a game using the policy:
Take the game state/observation and pass it into the network to get the output action, take the action in the game, and observe the new state and reward, etc.
Use the state-action-reward data (generated in step 2) to update the weights of the policy model.
Update the policy to be used for steps 1–4, and repeat.

To implement the above steps, we need 3 major components, or “structural blocks” as I like to call them:

(i) First, we need a block for training/updating the policy model, as in step 3 above. This block has to have ample GPU resources, and a memory bank where it stores the state-action-reward data samples (called the experience buffer). This block is called the Optimizer (it optimizes the policy using gradient descent algorithms).

(ii) Now, in a project like this, we often want to make old and new policy models play against each other, to track improvements. So, in each training run, we need to select which versions of the policy will be used to play the game in a given experiment. For this, we have a block that gets the newest version of the policy model from the Optimizer and stores it along with several previous versions. This block is called the Controller — it orchestrates the entire experiment, not only choosing which versions of the policy will play against whom, but also stops and starts training runs, etc. It is responsible, in a way, for steps 1 and 4.

(iii) Next, we need a block that receives the model(s) chosen by the controller for the current training run. This block performs step 2 — it receives the game state, passes it through the policy model, and outputs the recommended action. Therefore, this block is called the Forward-Pass GPU (we need a GPU to quickly pass data through a neural network).

(iv) Finally, we need a block that simulates/renders the game — it runs several instances of Dota2 on different player “machines” in parallel (i.e. 10 different heroes split into 2 teams). At each time step, a player machine captures the game state and sends it to the Forward-Pass GPU block (which acts as its “brain”), receives the action from it, and executes it. It then takes all this information (state, action, reward) and sends it to the Optimizer, to be stored in its experience buffer. These are called “Rollout” worker machines because they “roll out the policy” (i.e. see how it acts in the game for a given duration).

The key takeaway here is that they separated these 4 things into distinct programs that can be run in parallel! The game runs independently from the training. The forward-pass GPU and the Rollout machines work together to play Dota2 and generate training data, while the Optimizer stores that data in a buffer and uses it to train the neural network independently without “waiting” on the other blocks.

This is good engineering — the more “modular” our system is, the more flexible it is to changes and updates. Since ML is still a highly experiment-oriented rather than a theoretical science —you really have to train the network and see what happens — this sort of flexibility is extra useful. (P.S. I do consider ML a science; I’m less sure about giving that distinction to theoretical physics, which feels like borderline philosophy.)

Now, study the figure above if you haven’t already, and connect the dots. The rest of this section will be easier to follow. :)

Much of this is self-explanatory, as we covered above. I’ll instead explain how gradients were “averaged across the pool using NCCL2 allreduce”:

When you’re trying to train a really large network, splitting the workload to several GPUs in parallel, you need the processors to share information between them so that they can work together as a whole.

How do the GPUs exchange information amongst each other? We can’t just physically hook them up on a board together and have them figure out how to work together automatically —we need some specialized software and protocols to enable and orchestrate the process.

To make this communication even faster, you want to write really low-level code that bypasses the operating system altogether, helping the GPUs talk to each other directly. For exactly this purpose, NVIDIA (the maker and supplier of these expensive GPUs) provides a code library called NCCL (NVIDIA Collective Communications Library), pronounced “Nickel.”

One of the functions contained in this library is called “Allreduce.” It performs simple math operations (like summation, averaging, finding the min or max value, etc) on the data that’s output from various devices, effectively “reducing” all of them into a single value (hence the name allreduce), and then sends that result wherever it needs to go. In the papers context, it’s used to average the gradients before they are applied to the model parameters.

This operation enables them to have a much larger “effective batch size”, by combining the individual batch sizes of each GPU into one. We’ll talk more about batch sizes a bit later.

I find it kind of heartening that OpenAI give a shout-out to Nvidia, their supplier, specifically mentioning a helpful feature of the product. :)

They’re basically sharing more technical details of steps 3 and 4 from the system we discussed earlier —such as the algorithm they used for updating model parameters (Adam), etc. They also share some other techniques they used to improve training:

“Truncated backpropagation through time” is simply a variety of the standard backprop algorithm, specialized for training long-term sequential models (remember that they used LSTMs in this project to enable the policy to keep a “memory” of sorts that influences future actions based on past experiences?). They also mention that gradients are clipped, etc.

Lili Jiang wrote an excellent article on these topics which I can heartily recommend for those inclined.

We’ve already talked about what the Controller is, previously.

Self-explanatory — we discussed this earlier when I described the overall training system. :)

When rollout machines (which play the game) send the state-action-reward data to the Optimizer to update the model, they do so asynchronously — the reason is discussed in the PPO algorithm and actor-critic methods above.
The GAE details should be self-explanatory based on our discussion earlier.

Finally, the “blocksparse library” is one that was released by OpenAI themselves. It’s essentially a little toolkit that they developed to further speed up and make training more efficient.

Take a short break again.

Finally, we talk about one of the neatest tricks used in this project — “surgery,” a way to modify a deep neural network to deal with a different format of input and output, without having to retrain it from scratch. Here’s some context (from their Appendix B, which I highly recommend reading in the original paper later) about what prompted them to perform surgeries in general:

Most surgeries were done because the team wanted to update the information available to the agent while playing the game. As we often say in the ML community, “garbage in, garbage out!” Having the right input data is 80% of the game (pun intended).

It’s mostly self-explanatory. In essence, they would do an offline operation (i.e. which doesn’t involve training), to initialize a new policy model that is compatible with the new environment, but functionally the same as the previous model (i.e. given the same input, both models would output the same result).

As to why this surgery involves “offline methods,” I’ll explain in a bit.

Net2Net transformations, as you can guess from the preceeding paragraphs’ context, basically allow you to create a new network (called the “student”) that preserves the function of the original network (the “teacher”).

The key thing to realize is that there’s no “one method” for Net2Net — it’s a family of methods, and you can get from A to B in many different ways. In the original paper for Net2Net, they suggested 2 ways: you basically just take the teacher network, and either widen or deepen it:

To widen a network, you replace certain layers with larger layers (which contain the same weights as the layers being replaced, but with additional rows or columns of parameters).
To deepen a network, you replace a given layer with two layers of the same size.

You can also combine the two methods — both widening and deepening it at the same time.

Now, the interesting thing here is that this transformation does not involve training the student model at all. You could say that it’s a more direct, “mechanical” approach to transforming models. This is why it’s an offline method.

Notice the unit they used to measure how much compute they used: PFlops/s-days. “PFlop” = petaflop.

One petaflop/s-day(pfs-day) consists of performing 10¹⁵ neural net operations per second for one day, or a total of about 10²⁰ operations. It’s quite a bit.

Now let’s talk about the 3 ways they “scaled up”:

Batch size: In general, batch size dictates how much data the ML model have to learn from in each turn. A smaller batch means that the data is divided into more chunks, and thus the model gets updated more often — it processes one chunk in each turn, and doesn’t need to look at all the data before it starts learning.
A large batch means that the model has to process much more data in each turn (which it may not be able to do, given computational / memory limitations), but it can learn more quickly that way (fewer turns = less training time).
Size of the model: This is a classic way to increase model performance — larger models simply have much greater learning capacity.
Training time: more training time = more collected experience.

We’re now half-way done with the paper, and covered the most important aspects.

If you’d like to stop here, feel free to do so — it was enough to give you a good fundamental understanding of what the project was about and how it worked.

The rest of the paper is for more serious ML/AI enthusiasts — with more unique tricks and details, such as the perils of surgery when you train for long periods, clever metrics like “staleness” and “sample reuse” to make training faster and better, etc.

To read and discuss these details alongside others like yourself, you can go to DenseLayers — a new “social network” for scholars interested in frontier AI research. :)

Explained Simply: How A.I. Defeated World Champions in the Game of Dota 2

“Advantage Actor-Critic”

“Proximal Policy Optimization” (PPO)

Advantage-based “Variance Reduction” Technique

Written by Aman Y. Agarwal