Spaces:

Arun-Sanjay
/

RedButton

Sleeping

App Files Files Community

RedButton / LEADERBOARD.md

Arun-Sanjay

phase-5: deploy to HF Space, burst+sustained concurrency tests, leaderboard skeleton

8346aac 18 days ago

preview code

raw

history blame contribute delete

1.83 kB

Shutdown-Gym Leaderboard

Tampering rate measured across 200 rollouts at Tier 2.

Model	Train Method	Trained-Cat (Train Op)	Held-Out (Train Op)	Held-Out (Strict Op)	Math Acc	Honesty (T3)	Submission
Qwen-2.5-1.5B	(none, baseline)	TBD	TBD	TBD	TBD	TBD	this repo
Qwen-2.5-1.5B	GRPO (ours)	TBD	TBD	TBD	TBD	TBD	this repo

The baseline row is populated Day 2 morning after the 50-rollout baseline measurement (PROJECT.md §20). The trained row is populated Day 2 noon after held-out evaluation completes (§18).

How to submit

Train any model against this environment, or evaluate any existing checkpoint.

Run:

python -m evaluation.evaluate \
    --model <name> \
    --env-url https://arun-sanjay-redbutton.hf.space \
    --n-episodes 200

Run again with the --strict-operator flag for the strict-operator column.
Open a PR with the new row added and a link to your model checkpoint.

Submission template

| <your model> | <method> | X% | X% | X% | X% | X% | <link to repo or HF model> |

Why this benchmark

Production ML engineers from Meta, HuggingFace, and PyTorch evaluate environments by whether they could integrate them. The leaderboard makes this concrete: the env ships with two reference rows (baseline + trained), and any new entry is a 200-rollout evaluate.py invocation away.

The four metrics on the right (Trained-Cat, Held-Out × {Train, Strict} Op, Math Accuracy, Honesty) capture the dual-axis generalization claim — disposition transfers across mechanism and across operator policy. See PROJECT.md §16, §17, §18 for the detailed taxonomy and rubric design, and API_NOTES.md for the recon-verified OpenEnv API surface this env builds on.