Instructions to use tsilva/SuperMarioBros-NES_Level1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- stable-baselines3
How to use tsilva/SuperMarioBros-NES_Level1 with stable-baselines3:
from huggingface_sb3 import load_from_hub checkpoint = load_from_hub( repo_id="tsilva/SuperMarioBros-NES_Level1", filename="{MODEL FILENAME}.zip", ) - Notebooks
- Google Colab
- Kaggle
SuperMarioBros-NES Level 1
PPO policy checkpoint for completing SuperMarioBros-Nes-v0 Level1-1 with Stable Retro, trained in the rlab Stable Baselines3 project.
At a Glance
| Item | Value |
|---|---|
| Task | Reinforcement learning policy for Super Mario Bros. Level 1-1 completion |
| Environment | SuperMarioBros-Nes-v0, state Level1-1 |
| Model | Stable Baselines3 PPO |
| Format | PyTorch checkpoint inside SB3 .zip |
| Input | 4 stacked grayscale 84 x 84 frames, channel-first |
| Output | Discrete action over the simple Mario action set |
| Eval completion rate | 100/100 episodes |
| Eval profile | mario_level1_v1, stochastic policy sampling |
| Uploaded checkpoint | B31 seed 23, checkpoint 4,500,000 timesteps |
Preview
Preview episode: stochastic policy sample from eval seed 10045; completed Level 1-1 with max_x_pos=6264, reward 6285.50, and no death.
Quick Start
This checkpoint requires a local Stable Retro setup with the game ROM imported. The ROM is not included.
hf download tsilva/SuperMarioBros-NES_Level1 \
ppo_supermariobros-nes-v0_4500000_steps.zip \
--local-dir models/SuperMarioBros-NES_Level1
git clone https://github.com/tsilva/rlab.git
cd rlab
UV_CACHE_DIR=.uv-cache uv sync --frozen
UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.play \
--model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
--game SuperMarioBros-Nes-v0 \
--state Level1-1 \
--episodes 3 \
--seed 10007 \
--max-steps 2500 \
--frame-skip 4 \
--fps 30 \
--scale 4 \
--stochastic \
--reward-mode score \
--action-set simple \
--completion-x-threshold 3160 \
--terminate-on-life-loss \
--terminate-on-completion
Validate on the Eval Profile
The reported metric was computed with stochastic policy sampling over 100 episodes, seed start 10007, max 2500 policy steps per episode.
UV_CACHE_DIR=.uv-cache uv run python -m stable_retro_ppo.evaluate \
--model ../models/SuperMarioBros-NES_Level1/ppo_supermariobros-nes-v0_4500000_steps.zip \
--game SuperMarioBros-Nes-v0 \
--state Level1-1 \
--episodes 100 \
--seed 10007 \
--max-steps 2500 \
--frame-skip 4 \
--stochastic \
--reward-mode score \
--action-set simple \
--completion-x-threshold 3160 \
--terminate-on-life-loss \
--terminate-on-completion
Expected summary fields:
completion_count: 100
completion_rate: 1.0
max_x_max: 6264
reward_mean: 3398.5530015126615
death_count: 0
Results
| Eval profile | Episodes | Seed start | Completion rate | Max x | Mean reward |
|---|---|---|---|---|---|
mario_level1_v1 |
100 | 10007 | 100/100 |
6264 | 3398.55 |
This was an out-of-process checkpoint eval. It is not the same signal as the training rolling-window stop metric; this run's maximum training rolling completion rate was below 100/100.
Input / Output
The policy receives observations produced by the project eval wrapper:
- Game:
SuperMarioBros-Nes-v0 - State:
Level1-1 - Preprocessing: crop top
32pixels, grayscale, resize to84 x 84 - Frame stack:
4 - Frame skip:
4 - Max-pool last two frames: enabled
- Observation layout: channel-first stack, shape
(4, 84, 84) - Action set:
simple=noop,right,right_b,right_a,right_a_b,a,left - Policy mode for reported eval: stochastic sampling
Architecture
- Algorithm: PPO from Stable Baselines3.
- Policy checkpoint: SB3 PyTorch
.zip. - Training used native Stable Retro vector environments.
- The uploaded file contains the policy, optimizer state, SB3 metadata, and system info.
Training Recipe
| Setting | Value |
|---|---|
| Seed | 23 |
| Environment count | 16 |
n_steps |
512 |
| Batch size | 512 |
| Epochs | 10 |
| Learning rate | 1.5e-4 |
| Entropy coefficient | 0.01 -> 0.0003 over 2,000,000 timesteps |
| Gamma | 0.9 |
| GAE lambda | 1.0 |
| Clip range | 0.15 |
| Target KL | 0.12 |
| Reward mode | score |
| Completion threshold | 3160 |
| Termination | life loss and completion |
Run description: B31 seed-23 screen combining clipped delta-x reward stabilization with target_kl=0.12.
Files
| File | Purpose |
|---|---|
ppo_supermariobros-nes-v0_4500000_steps.zip |
SB3 PPO checkpoint selected for upload |
replay.mp4 |
Hugging Face reinforcement-learning widget preview of a completed stochastic eval episode |
preview_summary.json |
Preview seed, video, and episode metadata |
model_metadata.json |
Provenance, eval profile, and checksum metadata |
Provenance
- Source repo:
tsilva/rlab - W&B run:
https://wandb.ai/tsilva/SuperMarioBros-NES/runs/9j4r2h3g - W&B run id:
9j4r2h3g - W&B run name:
b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135 - W&B artifact:
tsilva/SuperMarioBros-NES/b31_post12_loosekl_5m_stop100ep100_clip015_targetkl012_clippeddx_seed23_20260618_192135-checkpoint:v44 - Artifact alias:
step-4500000 - Model SHA256:
75eb50015295f887c7faae7dbbb80b9a024052581c443fbc0ce5b72e0be47f11
Limitations
- No ROM is included; users must provide and import their own legally obtained ROM.
- The reported result uses the current
mario_level1_v1stochastic eval profile, not sticky actions. - The checkpoint was selected by eval performance on this project-specific setup, not by a standardized public benchmark.
- The model card reports one checkpoint evaluation, not a multi-seed aggregate.
- Downloads last month
- 68