SuperMarioBros-NES Level 1

PPO policy checkpoint for completing SuperMarioBros-Nes-v0 Level1-1 with Stable Retro, trained in the rlab Stable Baselines3 project.

At a Glance

Item Value
Task Reinforcement learning policy for Super Mario Bros. Level 1-1 completion
Environment SuperMarioBros-Nes-v0, state Level1-1
Model Stable Baselines3 PPO
Format PyTorch checkpoint inside SB3 .zip
Input 4 stacked grayscale 84 x 84 frames, channel-first
Output Discrete action over the simple Mario action set
Eval completion rate 100/100 episodes
Eval profile mario_level1_v1, stochastic policy sampling
Uploaded checkpoint B31 seed 23, checkpoint 4,500,000 timesteps

Preview

Preview episode: stochastic policy sample from eval seed 10045; completed Level 1-1 with max_x_pos=6264, reward 6285.50, and no death.

Quick Start

This checkpoint requires a local Stable Retro setup with the game ROM imported. The ROM is not included.

hf download tsilva/SuperMarioBros-NES_Level1 \
  ppo_supermariobros-nes-v0_4500000_steps.zip \
  --local-dir models/SuperMarioBros-NES_Level1

git clone https://github.com/tsilva/rlab.git
cd rlab
UV_CACHE_DIR=.uv-cache uv sync --frozen

UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.play \
  --model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
  --game SuperMarioBros-Nes-v0 \
  --state Level1-1 \
  --episodes 3 \
  --seed 10007 \
  --max-steps 2500 \
  --frame-skip 4 \
  --fps 30 \
  --scale 4 \
  --stochastic \
  --reward-mode score \
  --action-set simple \
  --completion-x-threshold 3160 \
  --terminate-on-life-loss \
  --terminate-on-completion

Validate on the Eval Profile

The reported metric was computed with stochastic policy sampling over 100 episodes, seed start 10007, max 2500 policy steps per episode.

UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.evaluate \
  --model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
  --game SuperMarioBros-Nes-v0 \
  --state Level1-1 \
  --episodes 100 \
  --seed 10007 \
  --max-steps 2500 \
  --frame-skip 4 \
  --stochastic \
  --reward-mode score \
  --action-set simple \
  --completion-x-threshold 3160 \
  --terminate-on-life-loss \
  --terminate-on-completion

Expected summary fields:

completion_count: 100
completion_rate: 1.0
max_x_max: 6264
reward_mean: 3398.5530015126615
death_count: 0

Results

Eval profile Episodes Seed start Completion rate Max x Mean reward
mario_level1_v1 100 10007 100/100 6264 3398.55

This was an out-of-process checkpoint eval. It is not the same signal as the training rolling-window stop metric; this run's maximum training rolling completion rate was below 100/100.

Input / Output

The policy receives observations produced by the project eval wrapper:

  • Game: SuperMarioBros-Nes-v0
  • State: Level1-1
  • Preprocessing: crop top 32 pixels, grayscale, resize to 84 x 84
  • Frame stack: 4
  • Frame skip: 4
  • Max-pool last two frames: enabled
  • Observation layout: channel-first stack, shape (4, 84, 84)
  • Action set: simple = noop, right, right_b, right_a, right_a_b, a, left
  • Policy mode for reported eval: stochastic sampling

Architecture

  • Algorithm: PPO from Stable Baselines3.
  • Policy checkpoint: SB3 PyTorch .zip.
  • Training used native Stable Retro vector environments.
  • The uploaded file contains the policy, optimizer state, SB3 metadata, and system info.

Training Recipe

Setting Value
Seed 23
Environment count 16
n_steps 512
Batch size 512
Epochs 10
Learning rate 1.5e-4
Entropy coefficient 0.01 -> 0.0003 over 2,000,000 timesteps
Gamma 0.9
GAE lambda 1.0
Clip range 0.15
Target KL 0.12
Reward mode score
Completion threshold 3160
Termination life loss and completion

Run description: B31 seed-23 screen combining clipped delta-x reward stabilization with target_kl=0.12.

Files

File Purpose
ppo_supermariobros-nes-v0_4500000_steps.zip SB3 PPO checkpoint selected for upload
replay.mp4 Hugging Face reinforcement-learning widget preview of a completed stochastic eval episode
preview_summary.json Preview seed, video, and episode metadata
model_metadata.json Provenance, eval profile, and checksum metadata

Provenance

  • Source repo: tsilva/rlab
  • W&B run: https://wandb.ai/tsilva/SuperMarioBros-NES/runs/9j4r2h3g
  • W&B run id: 9j4r2h3g
  • W&B run name: b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135
  • W&B artifact: tsilva/SuperMarioBros-NES/b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135-checkpoint:v44
  • Artifact alias: step-4500000
  • Model SHA256: 75eb50015295f887c7faae7dbbb80b9a024052581c443fbc0ce5b72e0be47f11

Limitations

  • No ROM is included; users must provide and import their own legally obtained ROM.
  • The reported result uses the current mario_level1_v1 stochastic eval profile, not sticky actions.
  • The checkpoint was selected by eval performance on this project-specific setup, not by a standardized public benchmark.
  • The model card reports one checkpoint evaluation, not a multi-seed aggregate.
Downloads last month
68
Video Preview
loading