Spaces:
Running
Running
| # Benchmark Plan | |
| This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission. | |
| ## Goal | |
| I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting. | |
| The benchmark therefore compares: | |
| 1. `Baseline`: the plain base model after the JSON warm-up SFT only | |
| 2. `Trained`: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts | |
| This is the fairest comparison for this project because: | |
| - the raw base model is not instruction-aligned enough for this exact action interface | |
| - the warm-up SFT establishes the minimum viable policy that can participate in the environment | |
| - the GRPO stage is the part that should improve behavior beyond formatting | |
| ## Benchmark setup | |
| ### Environment | |
| - battle format: `gen4randombattle` | |
| - environment code: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py) | |
| - reward source: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py) | |
| - state formatter: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py) | |
| - action validation: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py) | |
| ### Policy checkpoints | |
| - `Baseline`: JSON warm-up only checkpoint | |
| - `Trained`: latest GRPO checkpoint trained on real rollout data | |
| ### Evaluation budget | |
| Use the same evaluation budget for both checkpoints: | |
| - `10` battles minimum | |
| - same notebook generation settings | |
| - same anti-stall environment settings | |
| If time permits, repeat the benchmark for `20` battles to reduce variance. | |
| ## Metrics | |
| These are the primary metrics I care about: | |
| 1. `format hit rate` | |
| - percentage of turns where the model produced a valid JSON action without fallback | |
| 2. `model_invalid` | |
| - number of turns where the model failed validation and the fallback action was used | |
| 3. `env_illegal` | |
| - number of environment-illegal actions after parsing | |
| 4. `avg reward / turn` | |
| - average shaped environment reward across all evaluated turns | |
| 5. `avg battle reward` | |
| - mean total reward across battles | |
| 6. `avg battle length` | |
| - average number of turns per battle | |
| 7. `commentary spot checks` | |
| - at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices | |
| ## Benchmark procedure | |
| ### Baseline run | |
| 1. Load the base model | |
| 2. Run only the JSON warm-up SFT stage | |
| 3. Do not run GRPO | |
| 4. Run the benchmark battles and record metrics | |
| ### Trained run | |
| 1. Load the GRPO-trained adapter checkpoint | |
| 2. Run the same benchmark battles | |
| 3. Record the same metrics | |
| ### Consistency rules | |
| - keep `NUM_GENERATIONS`, `MAX_NEW_TOKENS`, and `TEMPERATURE` fixed during evaluation | |
| - do not change reward shaping between baseline and trained benchmark | |
| - do not use different battle caps for one checkpoint and not the other | |
| ## Benchmark script | |
| Use [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py) to run the benchmark. | |
| Before running it, edit the `CHECKPOINTS` list in that script so it points to: | |
| - the JSON-warmup-only baseline checkpoint | |
| - the trained GRPO checkpoint | |
| Then run: | |
| ```bash | |
| python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py | |
| ``` | |
| The script writes a Markdown table to [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md). | |
| ## Results table | |
| Fill this table after each benchmark run. | |
| | Checkpoint | Battles | Format Hit Rate | Model Invalid | Env Illegal | Avg Reward / Turn | Avg Battle Reward | Avg Battle Length | Notes | | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | | |
| | Baseline (JSON SFT only) | Pending | Pending | Pending | Pending | Pending | Pending | Pending | Pending | | |
| | Trained (GRPO) | 1 | 0 | 0 | 1 | 100.0% | 0 | 0 | -0.031 | -0.919 | 30.0 | 602.2 | | |
| ## Recommended interpretation | |
| I should only claim strategic improvement if at least one of the following improves while legality remains strong: | |
| - higher average reward per turn | |
| - lower model invalid count | |
| - better commentary examples on matchup-sensitive turns | |
| - more sensible switches and move selection under pressure | |
| I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data. | |
| ## Artifacts to link | |
| When the benchmark is finished, add links here: | |
| - Hugging Face model repo: [https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1](https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1) | |
| - benchmark notebook / cell output: `TBD` | |
| - demo video timestamp showing benchmark: `TBD` | |