Entropy-TRPO Model Weights
PyTorch checkpoints for the comparative study in A Review of Entropy-Based Extensions to Trust Region Policy Optimization.
Repository layout
Each checkpoint directory contains:
| File | Description |
|---|---|
policy.pt |
Policy network state dict |
value.pt |
Value network state dict |
config.json |
Training hyperparameters |
metadata.json |
Paper source, variant flags, final metrics |
Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).
The repo README is updated automatically during training with a Training progress table
(epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of
available checkpoints.
Variant definitions
All trust-region variants share the TRPO surrogate $\max_\theta \mathbb{E}t[\rho_t(\theta) A_t]$ with $\rho_t = \pi_\theta(a_t|s_t)/\pi{\theta_{\text{old}}}(a_t|s_t)$ and GAE advantages $A_t$, unless noted below.
| Key | Paper name | Objective |
|---|---|---|
trpo |
TRPO | KL trust region $\bar D_{\mathrm{KL}} \le \delta$ (Schulman et al., 2015) |
entrpo_entropy |
EnTRPO-Entropy | $A_t \leftarrow A_t + \beta,\mathcal{H}(\pi_\theta(\cdot|s_t))$ (Roostaie ablation) |
ero_trpo |
ERO-TRPO | Surrogate includes $\beta,\mathcal{H}(\pi_\theta)$ (Xu et al., 2024) |
erc_trpo |
ERC-TRPO | Relaxed KL: $\bar D_{\mathrm{KL}} \le \delta + \alpha,\Delta\mathcal{H}$ (Xu et al., 2024) |
entrpo_buffer |
EnTRPO-Buffer | Roostaie on-policy replay buffer only |
entrpo |
EnTRPO | Entropy in advantage + Roostaie buffer (full method) |
ppo |
PPO | Clipped surrogate + entropy (Schulman et al., 2017) |
Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.
Variants and paper sources
| Variant | Paper |
|---|---|
trpo |
Schulman et al. (2015), Trust Region Policy Optimization, ICML |
entrpo_entropy |
Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation |
entrpo_buffer |
Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation |
entrpo |
Roostaie & Ebadzadeh (2021), EnTRPO — full method |
ero_trpo |
Xu et al. (2024), ERO-TRPO |
erc_trpo |
Xu et al. (2024), ERC-TRPO |
ppo |
Schulman et al. (2017), Proximal Policy Optimization |
See metadata.json in each folder for full author names and URLs.
Usage
Training and evaluation code: GitHub — entropy-trpo (update URL when published).
git clone https://github.com/your-username/entropy-trpo.git
cd entropy-trpo
make setup # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints
Citation
@article{entropytrpo2025,
title = {A Review of Entropy-Based Extensions to Trust Region Policy Optimization},
author = {Green, Simon},
journal = {IEEE Transactions},
year = {2025}
}
@article{roostaie2021entrpo,
title = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
author = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
journal = {arXiv:2110.13373},
year = {2021}
}
Training progress
Last updated: 2026-06-25 06:29:52 UTC
- Device:
cpu - Config:
configs/cpu.yaml - Jobs complete: 62/63
- Running: 1
CartPole-v1 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 10/10 | 50,000 | 252.6 ± 72.4 | 293.4 | 0.0049 |
| TRPO (s1) | done | 10/10 | 50,000 | 297.7 ± 69.6 | 297.8 | 0.0076 |
| TRPO (s2) | done | 10/10 | 50,000 | 390.4 ± 78.6 | 390.4 | 0.0057 |
| EnTRPO-Entropy (s0) | done | 10/10 | 50,000 | 267.6 ± 56.1 | 324.0 | 0.0056 |
| EnTRPO-Entropy (s1) | done | 10/10 | 50,000 | 277.1 ± 87.7 | 297.8 | 0.0056 |
| EnTRPO-Entropy (s2) | done | 10/10 | 50,000 | 373.8 ± 92.4 | 373.8 | 0.0027 |
| ERO-TRPO (s0) | done | 10/10 | 50,000 | 20.5 ± 11.4 | 28.3 | 0.0000 |
| ERO-TRPO (s1) | done | 10/10 | 50,000 | 25.8 ± 15.0 | 27.6 | 0.0000 |
| ERO-TRPO (s2) | done | 10/10 | 50,000 | 21.2 ± 10.5 | 27.9 | 0.0000 |
| ERC-TRPO (s0) | done | 10/10 | 50,000 | 18.5 ± 8.1 | 23.5 | 0.0000 |
| ERC-TRPO (s1) | done | 10/10 | 50,000 | 31.3 ± 22.2 | 31.3 | 0.0000 |
| ERC-TRPO (s2) | done | 10/10 | 50,000 | 28.7 ± 14.8 | 32.4 | 0.0000 |
| EnTRPO-Buffer (s0) | done | 10/10 | 50,000 | 216.5 ± 82.2 | 262.0 | 0.0050 |
| EnTRPO-Buffer (s1) | done | 10/10 | 50,000 | 321.5 ± 106.3 | 340.6 | 0.0049 |
| EnTRPO-Buffer (s2) | done | 10/10 | 50,000 | 80.5 ± 33.3 | 165.7 | 0.0086 |
| EnTRPO (s0) | done | 9/10 | 50,000 | 147.0 ± 66.1 | 174.6 | 0.0082 |
| EnTRPO (s1) | done | 10/10 | 50,000 | 224.6 ± 65.8 | 224.6 | 0.0083 |
| EnTRPO (s2) | done | 10/10 | 50,000 | 186.2 ± 56.1 | 243.5 | 0.0036 |
| PPO (s0) | done | 10/10 | 50,000 | 138.1 ± 82.2 | 138.1 | 0.0009 |
| PPO (s1) | done | 10/10 | 50,000 | 127.8 ± 57.1 | 127.8 | 0.0052 |
| PPO (s2) | done | 10/10 | 50,000 | 138.2 ± 57.5 | 138.2 | 0.0044 |
Humanoid-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| TRPO (s1) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| TRPO (s2) | done | 488/488 | 999,424 | 264.6 ± 47.9 | 325.0 | 0.0000 |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 256.2 ± 58.3 | 312.4 | -0.0000 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 273.6 ± 55.0 | 325.6 | 0.0072 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 256.1 ± 72.0 | 333.6 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 250.4 ± 54.5 | 342.4 | 0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 250.6 ± 23.7 | 315.4 | 0.0071 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 261.1 ± 65.2 | 329.1 | 0.0053 |
| ERC-TRPO (s0) | done | 488/488 | 999,424 | 206.3 ± 84.2 | 262.2 | 0.0023 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 254.5 ± 43.4 | 258.7 | -0.0000 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 217.7 ± 79.6 | 240.5 | 0.0000 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 267.4 ± 72.3 | 326.5 | 0.0000 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 252.4 ± 22.5 | 327.7 | 0.0000 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 249.7 ± 90.1 | 321.0 | 0.0055 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 245.7 ± 32.2 | 332.4 | 0.0043 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 289.4 ± 72.0 | 325.4 | 0.0074 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 280.4 ± 83.2 | 316.8 | 0.0023 |
| PPO (s0) | done | 488/488 | 999,424 | 350.6 ± 97.2 | 374.9 | 0.1023 |
| PPO (s1) | done | 488/488 | 999,424 | 329.9 ± 86.0 | 406.4 | 0.1046 |
| PPO (s2) | done | 488/488 | 999,424 | 305.1 ± 97.9 | 383.9 | 0.1098 |
HumanoidStandup-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 488/488 | 999,424 | 61102.3 ± 10813.4 | 68649.5 | 0.0060 |
| TRPO (s1) | done | 488/488 | 999,424 | 67908.7 ± 12258.9 | 77610.9 | 0.0000 |
| TRPO (s2) | done | 488/488 | 999,424 | 71182.5 ± 7886.6 | 74013.4 | -0.0000 |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 63970.3 ± 12674.7 | 69688.4 | 0.0054 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 74092.3 ± 6436.0 | 84192.4 | -0.0000 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 66211.8 ± 11884.0 | 73910.5 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 46220.5 ± 2917.4 | 47845.4 | -0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 47466.5 ± 3243.4 | 52235.4 | 0.0010 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 47797.9 ± 3607.9 | 50942.6 | 0.0000 |
| ERC-TRPO (s0) | done | 488/488 | 999,424 | 43753.6 ± 3107.6 | 44799.7 | 0.0000 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 47285.0 ± 3076.0 | 51595.2 | 0.0003 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 48414.2 ± 3424.0 | 51013.4 | 0.0009 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 37442.2 ± 2643.0 | 39323.6 | 0.0086 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 41544.7 ± 3903.9 | 45498.8 | -0.0001 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 37246.1 ± 1735.1 | 40451.8 | 0.0000 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 35912.7 ± 2063.5 | 39687.9 | 0.0000 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 43605.8 ± 3502.3 | 46649.5 | 0.0042 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 37992.1 ± 4942.4 | 40168.8 | 0.0064 |
| PPO (s0) | done | 488/488 | 999,424 | 83486.9 ± 6256.6 | 100329.1 | 0.3202 |
| PPO (s1) | done | 488/488 | 999,424 | 88231.3 ± 14647.0 | 110834.0 | 0.5174 |
| PPO (s2) | running | 1/488 | 2,048 | 39294.2 ± 2509.3 | 39294.2 | 0.0830 |
Available checkpoints
{
"entrpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"entrpo_buffer": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"entrpo_entropy": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"erc_trpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"ero_trpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"ppo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"trpo": [
"latest",
"seed_0",
"seed_1",
"seed_2"
],
"trpo_entropy": [
"latest"
]
}