Entropy-TRPO Model Weights

PyTorch checkpoints for the comparative study in A Review of Entropy-Based Extensions to Trust Region Policy Optimization.

Repository layout

Each checkpoint directory contains:

File Description
policy.pt Policy network state dict
value.pt Value network state dict
config.json Training hyperparameters
metadata.json Paper source, variant flags, final metrics

Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).

The repo README is updated automatically during training with a Training progress table (epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of available checkpoints.

Variant definitions

All trust-region variants share the TRPO surrogate $\max_\theta \mathbb{E}t[\rho_t(\theta) A_t]$ with $\rho_t = \pi_\theta(a_t|s_t)/\pi{\theta_{\text{old}}}(a_t|s_t)$ and GAE advantages $A_t$, unless noted below.

Key Paper name Objective
trpo TRPO KL trust region $\bar D_{\mathrm{KL}} \le \delta$ (Schulman et al., 2015)
entrpo_entropy EnTRPO-Entropy $A_t \leftarrow A_t + \beta,\mathcal{H}(\pi_\theta(\cdot|s_t))$ (Roostaie ablation)
ero_trpo ERO-TRPO Surrogate includes $\beta,\mathcal{H}(\pi_\theta)$ (Xu et al., 2024)
erc_trpo ERC-TRPO Relaxed KL: $\bar D_{\mathrm{KL}} \le \delta + \alpha,\Delta\mathcal{H}$ (Xu et al., 2024)
entrpo_buffer EnTRPO-Buffer Roostaie on-policy replay buffer only
entrpo EnTRPO Entropy in advantage + Roostaie buffer (full method)
ppo PPO Clipped surrogate + entropy (Schulman et al., 2017)

Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.

Variants and paper sources

Variant Paper
trpo Schulman et al. (2015), Trust Region Policy Optimization, ICML
entrpo_entropy Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation
entrpo_buffer Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation
entrpo Roostaie & Ebadzadeh (2021), EnTRPO — full method
ero_trpo Xu et al. (2024), ERO-TRPO
erc_trpo Xu et al. (2024), ERC-TRPO
ppo Schulman et al. (2017), Proximal Policy Optimization

See metadata.json in each folder for full author names and URLs.

Usage

Training and evaluation code: GitHub — entropy-trpo (update URL when published).

git clone https://github.com/your-username/entropy-trpo.git
cd entropy-trpo
make setup          # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints

Citation

@article{entropytrpo2025,
  title   = {A Review of Entropy-Based Extensions to Trust Region Policy Optimization},
  author  = {Green, Simon},
  journal = {IEEE Transactions},
  year    = {2025}
}
@article{roostaie2021entrpo,
  title   = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
  author  = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
  journal = {arXiv:2110.13373},
  year    = {2021}
}

Training progress

Last updated: 2026-06-25 06:29:52 UTC

  • Device: cpu
  • Config: configs/cpu.yaml
  • Jobs complete: 62/63
  • Running: 1

CartPole-v1 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 10/10 50,000 252.6 ± 72.4 293.4 0.0049
TRPO (s1) done 10/10 50,000 297.7 ± 69.6 297.8 0.0076
TRPO (s2) done 10/10 50,000 390.4 ± 78.6 390.4 0.0057
EnTRPO-Entropy (s0) done 10/10 50,000 267.6 ± 56.1 324.0 0.0056
EnTRPO-Entropy (s1) done 10/10 50,000 277.1 ± 87.7 297.8 0.0056
EnTRPO-Entropy (s2) done 10/10 50,000 373.8 ± 92.4 373.8 0.0027
ERO-TRPO (s0) done 10/10 50,000 20.5 ± 11.4 28.3 0.0000
ERO-TRPO (s1) done 10/10 50,000 25.8 ± 15.0 27.6 0.0000
ERO-TRPO (s2) done 10/10 50,000 21.2 ± 10.5 27.9 0.0000
ERC-TRPO (s0) done 10/10 50,000 18.5 ± 8.1 23.5 0.0000
ERC-TRPO (s1) done 10/10 50,000 31.3 ± 22.2 31.3 0.0000
ERC-TRPO (s2) done 10/10 50,000 28.7 ± 14.8 32.4 0.0000
EnTRPO-Buffer (s0) done 10/10 50,000 216.5 ± 82.2 262.0 0.0050
EnTRPO-Buffer (s1) done 10/10 50,000 321.5 ± 106.3 340.6 0.0049
EnTRPO-Buffer (s2) done 10/10 50,000 80.5 ± 33.3 165.7 0.0086
EnTRPO (s0) done 9/10 50,000 147.0 ± 66.1 174.6 0.0082
EnTRPO (s1) done 10/10 50,000 224.6 ± 65.8 224.6 0.0083
EnTRPO (s2) done 10/10 50,000 186.2 ± 56.1 243.5 0.0036
PPO (s0) done 10/10 50,000 138.1 ± 82.2 138.1 0.0009
PPO (s1) done 10/10 50,000 127.8 ± 57.1 127.8 0.0052
PPO (s2) done 10/10 50,000 138.2 ± 57.5 138.2 0.0044

Humanoid-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
TRPO (s1) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
TRPO (s2) done 488/488 999,424 264.6 ± 47.9 325.0 0.0000
EnTRPO-Entropy (s0) done 488/488 999,424 256.2 ± 58.3 312.4 -0.0000
EnTRPO-Entropy (s1) done 488/488 999,424 273.6 ± 55.0 325.6 0.0072
EnTRPO-Entropy (s2) done 488/488 999,424 256.1 ± 72.0 333.6 -0.0000
ERO-TRPO (s0) done 488/488 999,424 250.4 ± 54.5 342.4 0.0000
ERO-TRPO (s1) done 488/488 999,424 250.6 ± 23.7 315.4 0.0071
ERO-TRPO (s2) done 488/488 999,424 261.1 ± 65.2 329.1 0.0053
ERC-TRPO (s0) done 488/488 999,424 206.3 ± 84.2 262.2 0.0023
ERC-TRPO (s1) done 488/488 999,424 254.5 ± 43.4 258.7 -0.0000
ERC-TRPO (s2) done 488/488 999,424 217.7 ± 79.6 240.5 0.0000
EnTRPO-Buffer (s0) done 488/488 999,424 267.4 ± 72.3 326.5 0.0000
EnTRPO-Buffer (s1) done 488/488 999,424 252.4 ± 22.5 327.7 0.0000
EnTRPO-Buffer (s2) done 488/488 999,424 249.7 ± 90.1 321.0 0.0055
EnTRPO (s0) done 488/488 999,424 245.7 ± 32.2 332.4 0.0043
EnTRPO (s1) done 488/488 999,424 289.4 ± 72.0 325.4 0.0074
EnTRPO (s2) done 488/488 999,424 280.4 ± 83.2 316.8 0.0023
PPO (s0) done 488/488 999,424 350.6 ± 97.2 374.9 0.1023
PPO (s1) done 488/488 999,424 329.9 ± 86.0 406.4 0.1046
PPO (s2) done 488/488 999,424 305.1 ± 97.9 383.9 0.1098

HumanoidStandup-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO (s0) done 488/488 999,424 61102.3 ± 10813.4 68649.5 0.0060
TRPO (s1) done 488/488 999,424 67908.7 ± 12258.9 77610.9 0.0000
TRPO (s2) done 488/488 999,424 71182.5 ± 7886.6 74013.4 -0.0000
EnTRPO-Entropy (s0) done 488/488 999,424 63970.3 ± 12674.7 69688.4 0.0054
EnTRPO-Entropy (s1) done 488/488 999,424 74092.3 ± 6436.0 84192.4 -0.0000
EnTRPO-Entropy (s2) done 488/488 999,424 66211.8 ± 11884.0 73910.5 -0.0000
ERO-TRPO (s0) done 488/488 999,424 46220.5 ± 2917.4 47845.4 -0.0000
ERO-TRPO (s1) done 488/488 999,424 47466.5 ± 3243.4 52235.4 0.0010
ERO-TRPO (s2) done 488/488 999,424 47797.9 ± 3607.9 50942.6 0.0000
ERC-TRPO (s0) done 488/488 999,424 43753.6 ± 3107.6 44799.7 0.0000
ERC-TRPO (s1) done 488/488 999,424 47285.0 ± 3076.0 51595.2 0.0003
ERC-TRPO (s2) done 488/488 999,424 48414.2 ± 3424.0 51013.4 0.0009
EnTRPO-Buffer (s0) done 488/488 999,424 37442.2 ± 2643.0 39323.6 0.0086
EnTRPO-Buffer (s1) done 488/488 999,424 41544.7 ± 3903.9 45498.8 -0.0001
EnTRPO-Buffer (s2) done 488/488 999,424 37246.1 ± 1735.1 40451.8 0.0000
EnTRPO (s0) done 488/488 999,424 35912.7 ± 2063.5 39687.9 0.0000
EnTRPO (s1) done 488/488 999,424 43605.8 ± 3502.3 46649.5 0.0042
EnTRPO (s2) done 488/488 999,424 37992.1 ± 4942.4 40168.8 0.0064
PPO (s0) done 488/488 999,424 83486.9 ± 6256.6 100329.1 0.3202
PPO (s1) done 488/488 999,424 88231.3 ± 14647.0 110834.0 0.5174
PPO (s2) running 1/488 2,048 39294.2 ± 2509.3 39294.2 0.0830

Available checkpoints

{
  "entrpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "entrpo_buffer": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "entrpo_entropy": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "erc_trpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "ero_trpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "ppo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "trpo": [
    "latest",
    "seed_0",
    "seed_1",
    "seed_2"
  ],
  "trpo_entropy": [
    "latest"
  ]
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for pre63/entropy-trpo-weights