# Benchmark Plan

This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission.

## Goal

I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting.

The benchmark therefore compares:

1. `Baseline`: the plain base model after the JSON warm-up SFT only
2. `Trained`: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts

This is the fairest comparison for this project because:

- the raw base model is not instruction-aligned enough for this exact action interface
- the warm-up SFT establishes the minimum viable policy that can participate in the environment
- the GRPO stage is the part that should improve behavior beyond formatting

## Benchmark setup

### Environment

- battle format: `gen4randombattle`
- environment code: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py)
- reward source: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
- state formatter: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
- action validation: [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)

### Policy checkpoints

- `Baseline`: JSON warm-up only checkpoint
- `Trained`: latest GRPO checkpoint trained on real rollout data

### Evaluation budget

Use the same evaluation budget for both checkpoints:

- `10` battles minimum
- same notebook generation settings
- same anti-stall environment settings

If time permits, repeat the benchmark for `20` battles to reduce variance.

## Metrics

These are the primary metrics I care about:

1. `format hit rate`
- percentage of turns where the model produced a valid JSON action without fallback

2. `model_invalid`
- number of turns where the model failed validation and the fallback action was used

3. `env_illegal`
- number of environment-illegal actions after parsing

4. `avg reward / turn`
- average shaped environment reward across all evaluated turns

5. `avg battle reward`
- mean total reward across battles

6. `avg battle length`
- average number of turns per battle

7. `commentary spot checks`
- at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices

## Benchmark procedure

### Baseline run

1. Load the base model
2. Run only the JSON warm-up SFT stage
3. Do not run GRPO
4. Run the benchmark battles and record metrics

### Trained run

1. Load the GRPO-trained adapter checkpoint
2. Run the same benchmark battles
3. Record the same metrics

### Consistency rules

- keep `NUM_GENERATIONS`, `MAX_NEW_TOKENS`, and `TEMPERATURE` fixed during evaluation
- do not change reward shaping between baseline and trained benchmark
- do not use different battle caps for one checkpoint and not the other

## Benchmark script

Use [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py) to run the benchmark.

Before running it, edit the `CHECKPOINTS` list in that script so it points to:

- the JSON-warmup-only baseline checkpoint
- the trained GRPO checkpoint

Then run:

```bash
python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py
```

The script writes a Markdown table to [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md).

## Results table

Fill this table after each benchmark run.

| Checkpoint | Battles | Format Hit Rate | Model Invalid | Env Illegal | Avg Reward / Turn | Avg Battle Reward | Avg Battle Length | Notes |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Baseline (JSON SFT only) | Pending | Pending | Pending | Pending | Pending | Pending | Pending | Pending |
| Trained (GRPO) | 1 | 0 | 0 | 1 | 100.0% | 0 | 0 | -0.031 | -0.919 | 30.0 | 602.2 |

## Recommended interpretation

I should only claim strategic improvement if at least one of the following improves while legality remains strong:

- higher average reward per turn
- lower model invalid count
- better commentary examples on matchup-sensitive turns
- more sensible switches and move selection under pressure

I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data.

## Artifacts to link

When the benchmark is finished, add links here:

- Hugging Face model repo: [https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1](https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1)
- benchmark notebook / cell output: `TBD`
- demo video timestamp showing benchmark: `TBD`