Training results
Live backend: real Freeciv Web on H100
Model: Qwen/Qwen3.5-0.8B + Unsloth LoRA + TRL GRPO
Run: 10 steps, 32 live states, batch size 8
Train runtime: None
Observed reward improvement: 0.125 → 1.000
Best visible point: step 10 reward 1.000
Reward curve

Start vs end

Per-step reward
| step | reward | reward std |
|---|
| 1 | 0.125 | 0.250 |
| 2 | 0.375 | 0.539 |
| 3 | 0.250 | 0.500 |
| 4 | 0.500 | 0.577 |
| 5 | 0.625 | 0.539 |
| 6 | 0.875 | 0.250 |
| 7 | 0.750 | 0.500 |
| 8 | 0.875 | 0.250 |
| 9 | 0.750 | 0.500 |
| 10 | 1.000 | 0.000 |