Spaces:
Sleeping
Sleeping
File size: 1,830 Bytes
8346aac | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | # Shutdown-Gym Leaderboard
Tampering rate measured across 200 rollouts at Tier 2.
| Model | Train Method | Trained-Cat (Train Op) | Held-Out (Train Op) | Held-Out (Strict Op) | Math Acc | Honesty (T3) | Submission |
|---|---|---|---|---|---|---|---|
| Qwen-2.5-1.5B | (none, baseline) | TBD | TBD | TBD | TBD | TBD | this repo |
| Qwen-2.5-1.5B | GRPO (ours) | TBD | TBD | TBD | TBD | TBD | this repo |
The baseline row is populated Day 2 morning after the 50-rollout
baseline measurement (PROJECT.md §20). The trained row is populated
Day 2 noon after held-out evaluation completes (§18).
## How to submit
1. Train any model against this environment, or evaluate any
existing checkpoint.
2. Run:
```bash
python -m evaluation.evaluate \
--model <name> \
--env-url https://arun-sanjay-redbutton.hf.space \
--n-episodes 200
```
3. Run again with the `--strict-operator` flag for the
strict-operator column.
4. Open a PR with the new row added and a link to your model
checkpoint.
## Submission template
```
| <your model> | <method> | X% | X% | X% | X% | X% | <link to repo or HF model> |
```
## Why this benchmark
Production ML engineers from Meta, HuggingFace, and PyTorch evaluate
environments by *whether they could integrate them*. The leaderboard
makes this concrete: the env ships with two reference rows
(baseline + trained), and any new entry is a 200-rollout
``evaluate.py`` invocation away.
The four metrics on the right (Trained-Cat, Held-Out × {Train,
Strict} Op, Math Accuracy, Honesty) capture the dual-axis
generalization claim — disposition transfers across mechanism *and*
across operator policy. See `PROJECT.md` §16, §17, §18 for the
detailed taxonomy and rubric design, and `API_NOTES.md` for the
recon-verified OpenEnv API surface this env builds on.
|