Training results

Live backend: real Freeciv Web on H100
Model: Qwen/Qwen3.5-0.8B + Unsloth LoRA + TRL GRPO
Run: 10 steps, 32 live states, batch size 8
Train runtime: None
Observed reward improvement: 0.125 → 1.000
Best visible point: step 10 reward 1.000

Reward curve

reward curve

Start vs end

before after reward

Per-step reward

steprewardreward std
10.1250.250
20.3750.539
30.2500.500
40.5000.577
50.6250.539
60.8750.250
70.7500.500
80.8750.250
90.7500.500
101.0000.000