| # Table of Contents | |
| * [mlagents.trainers.optimizer.torch\_optimizer](#mlagents.trainers.optimizer.torch_optimizer) | |
| * [TorchOptimizer](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer) | |
| * [create\_reward\_signals](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals) | |
| * [get\_trajectory\_value\_estimates](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates) | |
| * [mlagents.trainers.optimizer.optimizer](#mlagents.trainers.optimizer.optimizer) | |
| * [Optimizer](#mlagents.trainers.optimizer.optimizer.Optimizer) | |
| * [update](#mlagents.trainers.optimizer.optimizer.Optimizer.update) | |
| <a name="mlagents.trainers.optimizer.torch_optimizer"></a> | |
| # mlagents.trainers.optimizer.torch\_optimizer | |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer"></a> | |
| ## TorchOptimizer Objects | |
| ```python | |
| class TorchOptimizer(Optimizer) | |
| ``` | |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals"></a> | |
| #### create\_reward\_signals | |
| ```python | |
| | create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None | |
| ``` | |
| Create reward signals | |
| **Arguments**: | |
| - `reward_signal_configs`: Reward signal config. | |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates"></a> | |
| #### get\_trajectory\_value\_estimates | |
| ```python | |
| | get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]] | |
| ``` | |
| Get value estimates and memories for a trajectory, in batch form. | |
| **Arguments**: | |
| - `batch`: An AgentBuffer that consists of a trajectory. | |
| - `next_obs`: the next observation (after the trajectory). Used for boostrapping | |
| if this is not a termiinal trajectory. | |
| - `done`: Set true if this is a terminal trajectory. | |
| - `agent_id`: Agent ID of the agent that this trajectory belongs to. | |
| **Returns**: | |
| A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)], | |
| the final value estimate as a Dict of [name, float], and optionally (if using memories) | |
| an AgentBufferField of initial critic memories to be used during update. | |
| <a name="mlagents.trainers.optimizer.optimizer"></a> | |
| # mlagents.trainers.optimizer.optimizer | |
| <a name="mlagents.trainers.optimizer.optimizer.Optimizer"></a> | |
| ## Optimizer Objects | |
| ```python | |
| class Optimizer(abc.ABC) | |
| ``` | |
| Creates loss functions and auxillary networks (e.g. Q or Value) needed for training. | |
| Provides methods to update the Policy. | |
| <a name="mlagents.trainers.optimizer.optimizer.Optimizer.update"></a> | |
| #### update | |
| ```python | |
| | @abc.abstractmethod | |
| | update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float] | |
| ``` | |
| Update the Policy based on the batch that was passed in. | |
| **Arguments**: | |
| - `batch`: AgentBuffer that contains the minibatch of data used for this update. | |
| - `num_sequences`: Number of recurrent sequences found in the minibatch. | |
| **Returns**: | |
| A Dict containing statistics (name, value) from the update (e.g. loss) | |