# Shutdown-Gym Leaderboard Tampering rate measured across 200 rollouts at Tier 2. | Model | Train Method | Trained-Cat (Train Op) | Held-Out (Train Op) | Held-Out (Strict Op) | Math Acc | Honesty (T3) | Submission | |---|---|---|---|---|---|---|---| | Qwen-2.5-1.5B | (none, baseline) | TBD | TBD | TBD | TBD | TBD | this repo | | Qwen-2.5-1.5B | GRPO (ours) | TBD | TBD | TBD | TBD | TBD | this repo | The baseline row is populated Day 2 morning after the 50-rollout baseline measurement (PROJECT.md §20). The trained row is populated Day 2 noon after held-out evaluation completes (§18). ## How to submit 1. Train any model against this environment, or evaluate any existing checkpoint. 2. Run: ```bash python -m evaluation.evaluate \ --model \ --env-url https://arun-sanjay-redbutton.hf.space \ --n-episodes 200 ``` 3. Run again with the `--strict-operator` flag for the strict-operator column. 4. Open a PR with the new row added and a link to your model checkpoint. ## Submission template ``` | | | X% | X% | X% | X% | X% | | ``` ## Why this benchmark Production ML engineers from Meta, HuggingFace, and PyTorch evaluate environments by *whether they could integrate them*. The leaderboard makes this concrete: the env ships with two reference rows (baseline + trained), and any new entry is a 200-rollout ``evaluate.py`` invocation away. The four metrics on the right (Trained-Cat, Held-Out × {Train, Strict} Op, Math Accuracy, Honesty) capture the dual-axis generalization claim — disposition transfers across mechanism *and* across operator policy. See `PROJECT.md` §16, §17, §18 for the detailed taxonomy and rubric design, and `API_NOTES.md` for the recon-verified OpenEnv API surface this env builds on.