SuperMarioBros-NES Level 1

PPO policy checkpoint for completing SuperMarioBros-Nes-v0 Level1-1 with Stable Retro, trained in the rlab Stable Baselines3 project.

At a Glance

Item	Value
Task	Reinforcement learning policy for Super Mario Bros. Level 1-1 completion
Environment	`SuperMarioBros-Nes-v0`, state `Level1-1`
Model	Stable Baselines3 PPO
Format	PyTorch checkpoint inside SB3 `.zip`
Input	4 stacked grayscale `84 x 84` frames, channel-first
Output	Discrete action over the `simple` Mario action set
Eval completion rate	`100/100` episodes
Eval profile	`mario_level1_v1`, stochastic policy sampling
Uploaded checkpoint	B31 seed 23, checkpoint `4,500,000` timesteps

Preview

Preview episode: stochastic policy sample from eval seed 10045; completed Level 1-1 with max_x_pos=6264, reward 6285.50, and no death.

Quick Start

This checkpoint requires a local Stable Retro setup with the game ROM imported. The ROM is not included.

hf download tsilva/SuperMarioBros-NES_Level1 \
  ppo_supermariobros-nes-v0_4500000_steps.zip \
  --local-dir models/SuperMarioBros-NES_Level1

git clone https://github.com/tsilva/rlab.git
cd rlab
UV_CACHE_DIR=.uv-cache uv sync --frozen

UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.play \
  --model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
  --game SuperMarioBros-Nes-v0 \
  --state Level1-1 \
  --episodes 3 \
  --seed 10007 \
  --max-steps 2500 \
  --frame-skip 4 \
  --fps 30 \
  --scale 4 \
  --stochastic \
  --reward-mode score \
  --action-set simple \
  --completion-x-threshold 3160 \
  --terminate-on-life-loss \
  --terminate-on-completion

Validate on the Eval Profile

The reported metric was computed with stochastic policy sampling over 100 episodes, seed start 10007, max 2500 policy steps per episode.

UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.evaluate \
  --model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
  --game SuperMarioBros-Nes-v0 \
  --state Level1-1 \
  --episodes 100 \
  --seed 10007 \
  --max-steps 2500 \
  --frame-skip 4 \
  --stochastic \
  --reward-mode score \
  --action-set simple \
  --completion-x-threshold 3160 \
  --terminate-on-life-loss \
  --terminate-on-completion

Expected summary fields:

completion_count: 100
completion_rate: 1.0
max_x_max: 6264
reward_mean: 3398.5530015126615
death_count: 0

Results

Eval profile	Episodes	Seed start	Completion rate	Max x	Mean reward
`mario_level1_v1`	100	10007	`100/100`	6264	3398.55

This was an out-of-process checkpoint eval. It is not the same signal as the training rolling-window stop metric; this run's maximum training rolling completion rate was below 100/100.

Input / Output

The policy receives observations produced by the project eval wrapper:

Game: SuperMarioBros-Nes-v0
State: Level1-1
Preprocessing: crop top 32 pixels, grayscale, resize to 84 x 84
Frame stack: 4
Frame skip: 4
Max-pool last two frames: enabled
Observation layout: channel-first stack, shape (4, 84, 84)
Action set: simple = noop, right, right_b, right_a, right_a_b, a, left
Policy mode for reported eval: stochastic sampling

Architecture

Algorithm: PPO from Stable Baselines3.
Policy checkpoint: SB3 PyTorch .zip.
Training used native Stable Retro vector environments.
The uploaded file contains the policy, optimizer state, SB3 metadata, and system info.

Training Recipe

Setting	Value
Seed	23
Environment count	16
`n_steps`	512
Batch size	512
Epochs	10
Learning rate	`1.5e-4`
Entropy coefficient	`0.01 -> 0.0003` over `2,000,000` timesteps
Gamma	`0.9`
GAE lambda	`1.0`
Clip range	`0.15`
Target KL	`0.12`
Reward mode	`score`
Completion threshold	`3160`
Termination	life loss and completion

Run description: B31 seed-23 screen combining clipped delta-x reward stabilization with target_kl=0.12.

Files

File	Purpose
`ppo_supermariobros-nes-v0_4500000_steps.zip`	SB3 PPO checkpoint selected for upload
`replay.mp4`	Hugging Face reinforcement-learning widget preview of a completed stochastic eval episode
`preview_summary.json`	Preview seed, video, and episode metadata
`model_metadata.json`	Provenance, eval profile, and checksum metadata

Provenance

Source repo: tsilva/rlab
W&B run: https://wandb.ai/tsilva/SuperMarioBros-NES/runs/9j4r2h3g
W&B run id: 9j4r2h3g
W&B run name: b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135
W&B artifact: tsilva/SuperMarioBros-NES/b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135-checkpoint:v44
Artifact alias: step-4500000
Model SHA256: 75eb50015295f887c7faae7dbbb80b9a024052581c443fbc0ce5b72e0be47f11

Limitations

No ROM is included; users must provide and import their own legally obtained ROM.
The reported result uses the current mario_level1_v1 stochastic eval profile, not sticky actions.
The checkpoint was selected by eval performance on this project-specific setup, not by a standardized public benchmark.
The model card reports one checkpoint evaluation, not a multi-seed aggregate.

Downloads last month: 68

Video Preview

Reinforcement Learning