ppo-Pyramids-Training / docs /Using-Tensorboard.md

Second Push

05c9ac2 over 2 years ago

5.62 kB

	# Using TensorBoard to Observe Training

	The ML-Agents Toolkit saves statistics during learning session that you can view
	with a TensorFlow utility named,
	[TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard).

	The `mlagents-learn` command saves training statistics to a folder named
	`results`, organized by the `run-id` value you assign to a training session.

	In order to observe the training process, either during training or afterward,
	start TensorBoard:

	1. Open a terminal or console window:
	1. Navigate to the directory where the ML-Agents Toolkit is installed.
	1. From the command line run: `tensorboard --logdir results --port 6006`
	1. Open a browser window and navigate to
	[localhost:6006](http://localhost:6006).

	Note: The default port TensorBoard uses is 6006. If there is an existing
	session running on port 6006 a new session can be launched on an open port using
	the --port option.

	Note: If you don't assign a `run-id` identifier, `mlagents-learn` uses the
	default string, "ppo". You can delete the folders under the `results` directory
	to clear out old statistics.

	On the left side of the TensorBoard window, you can select which of the training
	runs you want to display. You can select multiple run-ids to compare statistics.
	The TensorBoard window also provides options for how to display and smooth
	graphs.

	## The ML-Agents Toolkit training statistics

	The ML-Agents training program saves the following statistics:

	![Example TensorBoard Run](images/mlagents-TensorBoard.png)

	### Environment Statistics

	- `Environment/Lesson` - Plots the progress from lesson to lesson. Only
	interesting when performing curriculum training.

	- `Environment/Cumulative Reward` - The mean cumulative episode reward over all
	agents. Should increase during a successful training session.

	- `Environment/Episode Length` - The mean length of each episode in the
	environment for all agents.

	### Is Training

	- `Is Training` - A boolean indicating if the agent is updating its model.

	### Policy Statistics

	- `Policy/Entropy` (PPO; SAC) - How random the decisions of the model are.
	Should slowly decrease during a successful training process. If it decreases
	too quickly, the `beta` hyperparameter should be increased.

	- `Policy/Learning Rate` (PPO; SAC) - How large a step the training algorithm
	takes as it searches for the optimal policy. Should decrease over time.

	- `Policy/Entropy Coefficient` (SAC) - Determines the relative importance of the
	entropy term. This value is adjusted automatically so that the agent retains
	some amount of randomness during training.

	- `Policy/Extrinsic Reward` (PPO; SAC) - This corresponds to the mean cumulative
	reward received from the environment per-episode.

	- `Policy/Value Estimate` (PPO; SAC) - The mean value estimate for all states
	visited by the agent. Should increase during a successful training session.

	- `Policy/Curiosity Reward` (PPO/SAC+Curiosity) - This corresponds to the mean
	cumulative intrinsic reward generated per-episode.

	- `Policy/Curiosity Value Estimate` (PPO/SAC+Curiosity) - The agent's value
	estimate for the curiosity reward.

	- `Policy/GAIL Reward` (PPO/SAC+GAIL) - This corresponds to the mean cumulative
	discriminator-based reward generated per-episode.

	- `Policy/GAIL Value Estimate` (PPO/SAC+GAIL) - The agent's value estimate for
	the GAIL reward.

	- `Policy/GAIL Policy Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
	for states and actions generated by the policy.

	- `Policy/GAIL Expert Estimate` (PPO/SAC+GAIL) - The discriminator's estimate
	for states and actions drawn from expert demonstrations.

	### Learning Loss Functions

	- `Losses/Policy Loss` (PPO; SAC) - The mean magnitude of policy loss function.
	Correlates to how much the policy (process for deciding actions) is changing.
	The magnitude of this should decrease during a successful training session.

	- `Losses/Value Loss` (PPO; SAC) - The mean loss of the value function update.
	Correlates to how well the model is able to predict the value of each state.
	This should increase while the agent is learning, and then decrease once the
	reward stabilizes.

	- `Losses/Forward Loss` (PPO/SAC+Curiosity) - The mean magnitude of the forward
	model loss function. Corresponds to how well the model is able to predict the
	new observation encoding.

	- `Losses/Inverse Loss` (PPO/SAC+Curiosity) - The mean magnitude of the inverse
	model loss function. Corresponds to how well the model is able to predict the
	action taken between two observations.

	- `Losses/Pretraining Loss` (BC) - The mean magnitude of the behavioral cloning
	loss. Corresponds to how well the model imitates the demonstration data.

	- `Losses/GAIL Loss` (GAIL) - The mean magnitude of the GAIL discriminator loss.
	Corresponds to how well the model imitates the demonstration data.

	### Self-Play

	- `Self-Play/ELO` (Self-Play) -
	[ELO](https://en.wikipedia.org/wiki/Elo_rating_system) measures the relative
	skill level between two players. In a proper training run, the ELO of the
	agent should steadily increase.

	## Exporting Data from TensorBoard
	To export timeseries data in CSV or JSON format, check the "Show data download
	links" in the upper left. This will enable download links below each chart.

	![Example TensorBoard Run](images/TensorBoard-download.png)

	## Custom Metrics from Unity

	To get custom metrics from a C# environment into TensorBoard, you can use the
	`StatsRecorder`:

	```csharp
	var statsRecorder = Academy.Instance.StatsRecorder;
	statsRecorder.Add("MyMetric", 1.0);
	```