WolfeClick / benchmarks /benchmark.md
Atharva
Add HF Space battle viewer, polish docs, and deployment files
16c6b0b
# Benchmark Plan
This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission.
## Goal
I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting.
The benchmark therefore compares:
1. `Baseline`: the plain base model after the JSON warm-up SFT only
2. `Trained`: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts
This is the fairest comparison for this project because:
- the raw base model is not instruction-aligned enough for this exact action interface
- the warm-up SFT establishes the minimum viable policy that can participate in the environment
- the GRPO stage is the part that should improve behavior beyond formatting
## Benchmark setup
### Environment
- battle format: `gen4randombattle`
- environment code: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py)
- reward source: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
- state formatter: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
- action validation: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)
### Policy checkpoints
- `Baseline`: JSON warm-up only checkpoint
- `Trained`: latest GRPO checkpoint trained on real rollout data
### Evaluation budget
Use the same evaluation budget for both checkpoints:
- `10` battles minimum
- same notebook generation settings
- same anti-stall environment settings
If time permits, repeat the benchmark for `20` battles to reduce variance.
## Metrics
These are the primary metrics I care about:
1. `format hit rate`
- percentage of turns where the model produced a valid JSON action without fallback
2. `model_invalid`
- number of turns where the model failed validation and the fallback action was used
3. `env_illegal`
- number of environment-illegal actions after parsing
4. `avg reward / turn`
- average shaped environment reward across all evaluated turns
5. `avg battle reward`
- mean total reward across battles
6. `avg battle length`
- average number of turns per battle
7. `commentary spot checks`
- at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices
## Benchmark procedure
### Baseline run
1. Load the base model
2. Run only the JSON warm-up SFT stage
3. Do not run GRPO
4. Run the benchmark battles and record metrics
### Trained run
1. Load the GRPO-trained adapter checkpoint
2. Run the same benchmark battles
3. Record the same metrics
### Consistency rules
- keep `NUM_GENERATIONS`, `MAX_NEW_TOKENS`, and `TEMPERATURE` fixed during evaluation
- do not change reward shaping between baseline and trained benchmark
- do not use different battle caps for one checkpoint and not the other
## Benchmark script
Use [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py) to run the benchmark.
Before running it, edit the `CHECKPOINTS` list in that script so it points to:
- the JSON-warmup-only baseline checkpoint
- the trained GRPO checkpoint
Then run:
```bash
python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py
```
The script writes a Markdown table to [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md).
## Results table
Fill this table after each benchmark run.
| Checkpoint | Battles | Format Hit Rate | Model Invalid | Env Illegal | Avg Reward / Turn | Avg Battle Reward | Avg Battle Length | Notes |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baseline (JSON SFT only) | Pending | Pending | Pending | Pending | Pending | Pending | Pending | Pending |
| Trained (GRPO) | 1 | 0 | 0 | 1 | 100.0% | 0 | 0 | -0.031 | -0.919 | 30.0 | 602.2 |
## Recommended interpretation
I should only claim strategic improvement if at least one of the following improves while legality remains strong:
- higher average reward per turn
- lower model invalid count
- better commentary examples on matchup-sensitive turns
- more sensible switches and move selection under pressure
I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data.
## Artifacts to link
When the benchmark is finished, add links here:
- Hugging Face model repo: [https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1](https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1)
- benchmark notebook / cell output: `TBD`
- demo video timestamp showing benchmark: `TBD`