| # Background: Machine Learning | |
| Given that a number of users of the ML-Agents Toolkit might not have a formal | |
| machine learning background, this page provides an overview to facilitate the | |
| understanding of the ML-Agents Toolkit. However, we will not attempt to provide | |
| a thorough treatment of machine learning as there are fantastic resources | |
| online. | |
| Machine learning, a branch of artificial intelligence, focuses on learning | |
| patterns from data. The three main classes of machine learning algorithms | |
| include: unsupervised learning, supervised learning and reinforcement learning. | |
| Each class of algorithm learns from a different type of data. The following | |
| paragraphs provide an overview for each of these classes of machine learning, as | |
| well as introductory examples. | |
| ## Unsupervised Learning | |
| The goal of | |
| [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) is | |
| to group or cluster similar items in a data set. For example, consider the | |
| players of a game. We may want to group the players depending on how engaged | |
| they are with the game. This would enable us to target different groups (e.g. | |
| for highly-engaged players we might invite them to be beta testers for new | |
| features, while for unengaged players we might email them helpful tutorials). | |
| Say that we wish to split our players into two groups. We would first define | |
| basic attributes of the players, such as the number of hours played, total money | |
| spent on in-app purchases and number of levels completed. We can then feed this | |
| data set (three attributes for every player) to an unsupervised learning | |
| algorithm where we specify the number of groups to be two. The algorithm would | |
| then split the data set of players into two groups where the players within each | |
| group would be similar to each other. Given the attributes we used to describe | |
| each player, in this case, the output would be a split of all the players into | |
| two groups, where one group would semantically represent the engaged players and | |
| the second group would semantically represent the unengaged players. | |
| With unsupervised learning, we did not provide specific examples of which | |
| players are considered engaged and which are considered unengaged. We just | |
| defined the appropriate attributes and relied on the algorithm to uncover the | |
| two groups on its own. This type of data set is typically called an unlabeled | |
| data set as it is lacking these direct labels. Consequently, unsupervised | |
| learning can be helpful in situations where these labels can be expensive or | |
| hard to produce. In the next paragraph, we overview supervised learning | |
| algorithms which accept input labels in addition to attributes. | |
| ## Supervised Learning | |
| In [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning), we | |
| do not want to just group similar items but directly learn a mapping from each | |
| item to the group (or class) that it belongs to. Returning to our earlier | |
| example of clustering players, let's say we now wish to predict which of our | |
| players are about to churn (that is stop playing the game for the next 30 days). | |
| We can look into our historical records and create a data set that contains | |
| attributes of our players in addition to a label indicating whether they have | |
| churned or not. Note that the player attributes we use for this churn prediction | |
| task may be different from the ones we used for our earlier clustering task. We | |
| can then feed this data set (attributes **and** label for each player) into a | |
| supervised learning algorithm which would learn a mapping from the player | |
| attributes to a label indicating whether that player will churn or not. The | |
| intuition is that the supervised learning algorithm will learn which values of | |
| these attributes typically correspond to players who have churned and not | |
| churned (for example, it may learn that players who spend very little and play | |
| for very short periods will most likely churn). Now given this learned model, we | |
| can provide it the attributes of a new player (one that recently started playing | |
| the game) and it would output a _predicted_ label for that player. This | |
| prediction is the algorithms expectation of whether the player will churn or | |
| not. We can now use these predictions to target the players who are expected to | |
| churn and entice them to continue playing the game. | |
| As you may have noticed, for both supervised and unsupervised learning, there | |
| are two tasks that need to be performed: attribute selection and model | |
| selection. Attribute selection (also called feature selection) pertains to | |
| selecting how we wish to represent the entity of interest, in this case, the | |
| player. Model selection, on the other hand, pertains to selecting the algorithm | |
| (and its parameters) that perform the task well. Both of these tasks are active | |
| areas of machine learning research and, in practice, require several iterations | |
| to achieve good performance. | |
| We now switch to reinforcement learning, the third class of machine learning | |
| algorithms, and arguably the one most relevant for the ML-Agents Toolkit. | |
| ## Reinforcement Learning | |
| [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning) | |
| can be viewed as a form of learning for sequential decision making that is | |
| commonly associated with controlling robots (but is, in fact, much more | |
| general). Consider an autonomous firefighting robot that is tasked with | |
| navigating into an area, finding the fire and neutralizing it. At any given | |
| moment, the robot perceives the environment through its sensors (e.g. camera, | |
| heat, touch), processes this information and produces an action (e.g. move to | |
| the left, rotate the water hose, turn on the water). In other words, it is | |
| continuously making decisions about how to interact in this environment given | |
| its view of the world (i.e. sensors input) and objective (i.e. neutralizing the | |
| fire). Teaching a robot to be a successful firefighting machine is precisely | |
| what reinforcement learning is designed to do. | |
| More specifically, the goal of reinforcement learning is to learn a **policy**, | |
| which is essentially a mapping from **observations** to **actions**. An | |
| observation is what the robot can measure from its **environment** (in this | |
| case, all its sensory inputs) and an action, in its most raw form, is a change | |
| to the configuration of the robot (e.g. position of its base, position of its | |
| water hose and whether the hose is on or off). | |
| The last remaining piece of the reinforcement learning task is the **reward | |
| signal**. The robot is trained to learn a policy that maximizes its overall rewards. When training a robot to be a mean firefighting machine, we provide it | |
| with rewards (positive and negative) indicating how well it is doing on | |
| completing the task. Note that the robot does not _know_ how to put out fires | |
| before it is trained. It learns the objective because it receives a large | |
| positive reward when it puts out the fire and a small negative reward for every | |
| passing second. The fact that rewards are sparse (i.e. may not be provided at | |
| every step, but only when a robot arrives at a success or failure situation), is | |
| a defining characteristic of reinforcement learning and precisely why learning | |
| good policies can be difficult (and/or time-consuming) for complex environments. | |
| <div style="text-align: center"><img src="../images/rl_cycle.png" alt="The reinforcement learning lifecycle."></div> | |
| [Learning a policy](https://blogs.unity3d.com/2017/08/22/unity-ai-reinforcement-learning-with-q-learning/) | |
| usually requires many trials and iterative policy updates. More specifically, | |
| the robot is placed in several fire situations and over time learns an optimal | |
| policy which allows it to put out fires more effectively. Obviously, we cannot | |
| expect to train a robot repeatedly in the real world, particularly when fires | |
| are involved. This is precisely why the use of | |
| [Unity as a simulator](https://blogs.unity3d.com/2018/01/23/designing-safer-cities-through-simulations/) | |
| serves as the perfect training grounds for learning such behaviors. While our | |
| discussion of reinforcement learning has centered around robots, there are | |
| strong parallels between robots and characters in a game. In fact, in many ways, | |
| one can view a non-playable character (NPC) as a virtual robot, with its own | |
| observations about the environment, its own set of actions and a specific | |
| objective. Thus it is natural to explore how we can train behaviors within Unity | |
| using reinforcement learning. This is precisely what the ML-Agents Toolkit | |
| offers. The video linked below includes a reinforcement learning demo showcasing | |
| training character behaviors using the ML-Agents Toolkit. | |
| <p align="center"> | |
| <a href="http://www.youtube.com/watch?feature=player_embedded&v=fiQsmdwEGT8" target="_blank"> | |
| <img src="http://img.youtube.com/vi/fiQsmdwEGT8/0.jpg" alt="RL Demo" width="400" border="10" /> | |
| </a> | |
| </p> | |
| Similar to both unsupervised and supervised learning, reinforcement learning | |
| also involves two tasks: attribute selection and model selection. Attribute | |
| selection is defining the set of observations for the robot that best help it | |
| complete its objective, while model selection is defining the form of the policy | |
| (mapping from observations to actions) and its parameters. In practice, training | |
| behaviors is an iterative process that may require changing the attribute and | |
| model choices. | |
| ## Training and Inference | |
| One common aspect of all three branches of machine learning is that they all | |
| involve a **training phase** and an **inference phase**. While the details of | |
| the training and inference phases are different for each of the three, at a | |
| high-level, the training phase involves building a model using the provided | |
| data, while the inference phase involves applying this model to new, previously | |
| unseen, data. More specifically: | |
| - For our unsupervised learning example, the training phase learns the optimal | |
| two clusters based on the data describing existing players, while the | |
| inference phase assigns a new player to one of these two clusters. | |
| - For our supervised learning example, the training phase learns the mapping | |
| from player attributes to player label (whether they churned or not), and the | |
| inference phase predicts whether a new player will churn or not based on that | |
| learned mapping. | |
| - For our reinforcement learning example, the training phase learns the optimal | |
| policy through guided trials, and in the inference phase, the agent observes | |
| and takes actions in the wild using its learned policy. | |
| To briefly summarize: all three classes of algorithms involve training and | |
| inference phases in addition to attribute and model selections. What ultimately | |
| separates them is the type of data available to learn from. In unsupervised | |
| learning our data set was a collection of attributes, in supervised learning our | |
| data set was a collection of attribute-label pairs, and, lastly, in | |
| reinforcement learning our data set was a collection of | |
| observation-action-reward tuples. | |
| ## Deep Learning | |
| [Deep learning](https://en.wikipedia.org/wiki/Deep_learning) is a family of | |
| algorithms that can be used to address any of the problems introduced above. | |
| More specifically, they can be used to solve both attribute and model selection | |
| tasks. Deep learning has gained popularity in recent years due to its | |
| outstanding performance on several challenging machine learning tasks. One | |
| example is [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), a | |
| [computer Go](https://en.wikipedia.org/wiki/Computer_Go) program, that leverages | |
| deep learning, that was able to beat Lee Sedol (a Go world champion). | |
| A key characteristic of deep learning algorithms is their ability to learn very | |
| complex functions from large amounts of training data. This makes them a natural | |
| choice for reinforcement learning tasks when a large amount of data can be | |
| generated, say through the use of a simulator or engine such as Unity. By | |
| generating hundreds of thousands of simulations of the environment within Unity, | |
| we can learn policies for very complex environments (a complex environment is | |
| one where the number of observations an agent perceives and the number of | |
| actions they can take are large). Many of the algorithms we provide in ML-Agents | |
| use some form of deep learning, built on top of the open-source library, | |
| [PyTorch](Background-PyTorch.md). | |