Spaces:

Atharva2099
/

WolfeClick

Running

App Files Files Community

WolfeClick / benchmarks /benchmark.md

Atharva

Add HF Space battle viewer, polish docs, and deployment files

16c6b0b about 1 month ago

preview code

raw

history blame contribute delete

5.17 kB

Benchmark Plan

This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission.

Goal

I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting.

The benchmark therefore compares:

Baseline: the plain base model after the JSON warm-up SFT only
Trained: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts

This is the fairest comparison for this project because:

the raw base model is not instruction-aligned enough for this exact action interface
the warm-up SFT establishes the minimum viable policy that can participate in the environment
the GRPO stage is the part that should improve behavior beyond formatting

Benchmark setup

Environment

battle format: gen4randombattle
environment code: /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py
reward source: /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py
state formatter: /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py
action validation: /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py

Policy checkpoints

Baseline: JSON warm-up only checkpoint
Trained: latest GRPO checkpoint trained on real rollout data

Evaluation budget

Use the same evaluation budget for both checkpoints:

10 battles minimum
same notebook generation settings
same anti-stall environment settings

If time permits, repeat the benchmark for 20 battles to reduce variance.

Metrics

These are the primary metrics I care about:

format hit rate

percentage of turns where the model produced a valid JSON action without fallback

model_invalid

number of turns where the model failed validation and the fallback action was used

env_illegal

number of environment-illegal actions after parsing

avg reward / turn

average shaped environment reward across all evaluated turns

avg battle reward

mean total reward across battles

avg battle length

average number of turns per battle

commentary spot checks

at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices

Benchmark procedure

Baseline run

Load the base model
Run only the JSON warm-up SFT stage
Do not run GRPO
Run the benchmark battles and record metrics

Trained run

Load the GRPO-trained adapter checkpoint
Run the same benchmark battles
Record the same metrics

Consistency rules

keep NUM_GENERATIONS, MAX_NEW_TOKENS, and TEMPERATURE fixed during evaluation
do not change reward shaping between baseline and trained benchmark
do not use different battle caps for one checkpoint and not the other

Benchmark script

Use /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py to run the benchmark.

Before running it, edit the CHECKPOINTS list in that script so it points to:

the JSON-warmup-only baseline checkpoint
the trained GRPO checkpoint

Then run:

python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py

The script writes a Markdown table to /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md.

Results table

Fill this table after each benchmark run.

Checkpoint	Battles	Format Hit Rate	Model Invalid	Env Illegal	Avg Reward / Turn	Avg Battle Reward	Avg Battle Length	Notes
Baseline (JSON SFT only)	Pending	Pending	Pending	Pending	Pending	Pending	Pending	Pending
Trained (GRPO)	1	0	0	1	100.0%	0	0	-0.031

Recommended interpretation

I should only claim strategic improvement if at least one of the following improves while legality remains strong:

higher average reward per turn
lower model invalid count
better commentary examples on matchup-sensitive turns
more sensible switches and move selection under pressure

I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data.

Artifacts to link

When the benchmark is finished, add links here:

Hugging Face model repo: https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1
benchmark notebook / cell output: TBD
demo video timestamp showing benchmark: TBD