Spaces:
Running
Benchmark Plan
This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission.
Goal
I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting.
The benchmark therefore compares:
Baseline: the plain base model after the JSON warm-up SFT onlyTrained: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts
This is the fairest comparison for this project because:
- the raw base model is not instruction-aligned enough for this exact action interface
- the warm-up SFT establishes the minimum viable policy that can participate in the environment
- the GRPO stage is the part that should improve behavior beyond formatting
Benchmark setup
Environment
- battle format:
gen4randombattle - environment code:
/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py - reward source:
/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py - state formatter:
/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py - action validation:
/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py
Policy checkpoints
Baseline: JSON warm-up only checkpointTrained: latest GRPO checkpoint trained on real rollout data
Evaluation budget
Use the same evaluation budget for both checkpoints:
10battles minimum- same notebook generation settings
- same anti-stall environment settings
If time permits, repeat the benchmark for 20 battles to reduce variance.
Metrics
These are the primary metrics I care about:
format hit rate
- percentage of turns where the model produced a valid JSON action without fallback
model_invalid
- number of turns where the model failed validation and the fallback action was used
env_illegal
- number of environment-illegal actions after parsing
avg reward / turn
- average shaped environment reward across all evaluated turns
avg battle reward
- mean total reward across battles
avg battle length
- average number of turns per battle
commentary spot checks
- at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices
Benchmark procedure
Baseline run
- Load the base model
- Run only the JSON warm-up SFT stage
- Do not run GRPO
- Run the benchmark battles and record metrics
Trained run
- Load the GRPO-trained adapter checkpoint
- Run the same benchmark battles
- Record the same metrics
Consistency rules
- keep
NUM_GENERATIONS,MAX_NEW_TOKENS, andTEMPERATUREfixed during evaluation - do not change reward shaping between baseline and trained benchmark
- do not use different battle caps for one checkpoint and not the other
Benchmark script
Use /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py to run the benchmark.
Before running it, edit the CHECKPOINTS list in that script so it points to:
- the JSON-warmup-only baseline checkpoint
- the trained GRPO checkpoint
Then run:
python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py
The script writes a Markdown table to /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md.
Results table
Fill this table after each benchmark run.
| Checkpoint | Battles | Format Hit Rate | Model Invalid | Env Illegal | Avg Reward / Turn | Avg Battle Reward | Avg Battle Length | Notes |
|---|---|---|---|---|---|---|---|---|
| Baseline (JSON SFT only) | Pending | Pending | Pending | Pending | Pending | Pending | Pending | Pending |
| Trained (GRPO) | 1 | 0 | 0 | 1 | 100.0% | 0 | 0 | -0.031 |
Recommended interpretation
I should only claim strategic improvement if at least one of the following improves while legality remains strong:
- higher average reward per turn
- lower model invalid count
- better commentary examples on matchup-sensitive turns
- more sensible switches and move selection under pressure
I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data.
Artifacts to link
When the benchmark is finished, add links here:
- Hugging Face model repo: https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1
- benchmark notebook / cell output:
TBD - demo video timestamp showing benchmark:
TBD