WolfeClick / benchmarks /benchmark.md
Atharva
Add HF Space battle viewer, polish docs, and deployment files
16c6b0b

Benchmark Plan

This file defines the benchmark protocol for OpenEnv-WolfeClick and records the results I will use in the hackathon submission.

Goal

I want to measure whether the environment and training loop improve actual policy behavior, not just JSON formatting.

The benchmark therefore compares:

  1. Baseline: the plain base model after the JSON warm-up SFT only
  2. Trained: the GRPO-trained model checkpoint produced from real Pokemon Showdown rollouts

This is the fairest comparison for this project because:

  • the raw base model is not instruction-aligned enough for this exact action interface
  • the warm-up SFT establishes the minimum viable policy that can participate in the environment
  • the GRPO stage is the part that should improve behavior beyond formatting

Benchmark setup

Environment

Policy checkpoints

  • Baseline: JSON warm-up only checkpoint
  • Trained: latest GRPO checkpoint trained on real rollout data

Evaluation budget

Use the same evaluation budget for both checkpoints:

  • 10 battles minimum
  • same notebook generation settings
  • same anti-stall environment settings

If time permits, repeat the benchmark for 20 battles to reduce variance.

Metrics

These are the primary metrics I care about:

  1. format hit rate
  • percentage of turns where the model produced a valid JSON action without fallback
  1. model_invalid
  • number of turns where the model failed validation and the fallback action was used
  1. env_illegal
  • number of environment-illegal actions after parsing
  1. avg reward / turn
  • average shaped environment reward across all evaluated turns
  1. avg battle reward
  • mean total reward across battles
  1. avg battle length
  • average number of turns per battle
  1. commentary spot checks
  • at least 2 human-readable battle traces to inspect whether the policy is making plausible strategic choices

Benchmark procedure

Baseline run

  1. Load the base model
  2. Run only the JSON warm-up SFT stage
  3. Do not run GRPO
  4. Run the benchmark battles and record metrics

Trained run

  1. Load the GRPO-trained adapter checkpoint
  2. Run the same benchmark battles
  3. Record the same metrics

Consistency rules

  • keep NUM_GENERATIONS, MAX_NEW_TOKENS, and TEMPERATURE fixed during evaluation
  • do not change reward shaping between baseline and trained benchmark
  • do not use different battle caps for one checkpoint and not the other

Benchmark script

Use /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py to run the benchmark.

Before running it, edit the CHECKPOINTS list in that script so it points to:

  • the JSON-warmup-only baseline checkpoint
  • the trained GRPO checkpoint

Then run:

python3 /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/benchmark.py

The script writes a Markdown table to /Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/benckmarks/latest_results.md.

Results table

Fill this table after each benchmark run.

Checkpoint Battles Format Hit Rate Model Invalid Env Illegal Avg Reward / Turn Avg Battle Reward Avg Battle Length Notes
Baseline (JSON SFT only) Pending Pending Pending Pending Pending Pending Pending Pending
Trained (GRPO) 1 0 0 1 100.0% 0 0 -0.031

Recommended interpretation

I should only claim strategic improvement if at least one of the following improves while legality remains strong:

  • higher average reward per turn
  • lower model invalid count
  • better commentary examples on matchup-sensitive turns
  • more sensible switches and move selection under pressure

I should not claim deep strategic learning from SFT alone. The meaningful claim comes from GRPO improving policy behavior on real rollout data.

Artifacts to link

When the benchmark is finished, add links here: