You have probably heard about Google DeepMind’s AlphaGo program, which attracted significant news coverage when it beat a 2-dan professional Go player in 2015. Later, improved evolutions of AlphaGo went on to beat a 9-dan (the highest rank) professional Go player in 2016, and the #1-ranked Go player in the world in May 2017. A new generation of the software, AlphaZero, was significantly stronger than AlphaGo in late 2017, and not only learned Go but also chess and shogi (Japanese chess).
AlphaGo and AlphaZero both rely on reinforcement learning to train. They also use deep neural networks as part of the reinforcement learning network, to predict outcome probabilities.
In this article, I’ll explain a little about reinforcement learning, how it has been used, and how it works at a high level. I won’t dig into the math, or Markov Decision Processes, or the gory details of the algorithms used. Then I’ll get back to AlphaGo and AlphaZero.
What is reinforcement learning?
There are three kinds of machine learning: unsupervised learning, supervised learning, and reinforcement learning. Each of these is good at solving a different set of problems.
Unsupervised learning, which works on a complete data set without labels, is good at uncovering structures in the data. It is used for clustering, dimensionality reduction, feature learning, and density estimation, among other tasks.
Supervised learning, which works on a complete labeled data set, is good at creating classification models for discrete data and regression models for continuous data. The machine learning or neural network model produced by supervised learning is usually used for prediction, for example to answer “What is the probability that this borrower will default on his loan?” or “How many widgets should we stock next month?”
Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value. That’s easier to understand in more concrete terms.
For example, AlphaGo, in order to learn to play (the action) the game of Go (the environment), first learned to mimic human Go players from a large data set of historical games (apprentice learning). It then improved its play through trial and error (reinforcement learning), by playing large numbers of Go games against independent instances of itself.
Note that AlphaGo doesn’t try to maximize the size of the win, like dan (black belt)-level human players usually do. It also doesn’t try to optimize the immediate position, like a novice human player would. AlphaGo maximizes the estimated probability of an eventual win to determine its next move. It doesn’t care whether it wins by one stone or 50 stones.
Reinforcement learning applications
Learning to play board games such as Go, shogi, and chess is not the only area where reinforcement learning has been applied. Two other areas are playing video games and teaching robots to perform tasks independently.
In 2013, DeepMind published a paper about learning control policies directly from high-dimensional sensory input using reinforcement learning. The applications were seven Atari 2600 games from the Arcade Learning Environment. A convolutional neural network, trained with a variant of Q-learning (one common method for reinforcement learning training), outperformed all previous approaches on six of the games and surpassed a human expert on three of them.
The convolutional neural network’s input was raw pixels and its output was a value function estimating future rewards. The convolutional-neural-network-based value function worked better than more common linear value functions. The choice of a convolutional neural network when the input is an image is unsurprising, as convolutional neural networks were designed to mimic the visual cortex.
DeepMind has since expanded this line of research to the real-time strategy game StarCraft II. The AlphaStar program learned StarCraft II by playing against itself to the point where it could almost always beat top players, at least for Protoss versus Protoss games. (Protoss is one of the alien races in StarCraft.)
Robotic control is another problem that has been attacked with deep reinforcement learning methods, meaning reinforcement learning plus deep neural networks, with the deep neural networks often being convolutional neural networks trained to extract features from video frames. Training with real robots is time-consuming, however. To reduce training time, many of the studies start off with simulations before trying out their algorithms on physical drones, robot dogs, humanoid robots, or robotic arms.
How reinforcement learning works
We’ve already discussed that reinforcement learning involves an agent interacting with an environment. The environment may have many state variables. The agent performs actions according to a policy, which may change the state of the environment. The environment or the training algorithm can send the agent rewards or penalties to implement the reinforcement. These may modify the policy, which constitutes learning.
For background, this is the scenario explored in the early 1950s by Richard Bellman, who developed dynamic programming to solve optimal control and Markov decision process problems. Dynamic programming is at the heart of many important algorithms for a variety of applications, and the Bellman equation is very much part of reinforcement learning.
A reward signifies what is good immediately. A value, on the other hand, specifies what is good in the long run. In general, the value of a state is the expected sum of future rewards. Action choices—policies—need to be computed on the basis of long-term values, not immediate rewards.
Effective policies for reinforcement learning need to balance greed or exploitation—going for the action that the current policy thinks will have the highest value—against exploration, randomly driven actions that may help improve the policy. There are many algorithms to control this, some using exploration a small fraction of the time ε, and some starting with pure exploration and slowly converging to nearly pure greed as the learned policy becomes strong.
There are many algorithms for reinforcement learning, both model-based (e.g. dynamic programming) and model-free (e.g. Monte Carlo). Model-free methods tend to be more useful for actual reinforcement learning, because they are learning from experience, and exact models tend to be hard to create.
If you want to get into the weeds with reinforcement learning algorithms and theory, and you are comfortable with Markov decision processes, I’d recommend Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. You want the 2nd edition, revised in 2018.
AlphaGo and AlphaZero
I mentioned earlier that AlphaGo started learning Go by training against a database of human Go games. That bootstrap got its deep-neural-network-based value function working at a reasonable strength.
For the next step in AlphaGo’s training, it played against itself—a lot—and used the game results to update the weights in its value and policy networks. That made the strength of the program rise above most human Go players.
At each move while playing a game, AlphaGo applies its value function to every legal move at that position, to rank them in terms of probability of leading to a win. Then it runs a Monte Carlo tree search algorithm from the board positions resulting from the highest-value moves, picking the move most likely to win based on those look-ahead searches. It uses the win probabilities to weight the amount of attention it gives to searching each move tree.
The later AlphaGo Zero and AlphaZero programs skipped training against the database of human games. They started with no baggage except for the rules of the game and reinforcement learning. At the beginning they played random moves, but after learning from millions of games against themselves they played very well indeed. AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days.
AlphaZero, as I mentioned earlier, was generalized from AlphaGo Zero to learn chess and shogi as well as Go. According to DeepMind, the amount of reinforcement learning training the AlphaZero neural network needs depends on the style and complexity of the game, taking roughly nine hours for chess, 12 hours for shogi, and 13 days for Go, running on multiple TPUs. In chess, AlphaZero’s guidance is much better than conventional chess-playing programs, reducing the tree space it needs to search. AlphaZero only needs to evaluate 10,000’s of moves per decision versus 10,000,000’s of moves per decision for Stockfish, the strongest handcrafted chess engine.
These board games are not easy to master, and AlphaZero’s success says a lot about the power of reinforcement learning, neural network value and policy functions, and guided Monte Carlo tree search. It also says a lot about the skill of the researchers, and the power of TPUs.
Robotic control is a harder AI problem than playing board games or video games. As soon as you have to deal with the physical world, unexpected things happen. Nevertheless, there has been progress on this at a demonstration level, and the most powerful approaches currently seem to involve reinforcement learning and deep neural networks.