halluci-mate v2c
DPO fine-tune of jspaulsen/halluci-mate-v2b
on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom
~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish
endorses over the moves v2b actually played in not-yet-lost positions.
Source: https://github.com/jspaulsen/halluci-mate
Training data
11,491 preference pairs derived from a 10,000-game v2b-vs-Stockfish run (skill 5, depth 12) using the halluci-mate eval harness:
scripts/eval.py export-dpo <run> --flavor quality \
--require-consequential --exclude-repetition
- Quality threshold: centipawn loss > 200.
--require-consequentialdrops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.--exclude-repetitiondrops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.
Training recipe
scripts/train_dpo.py
via TRL's DPOTrainer:
| Method | DPO (TRL) |
| Base / reference model | jspaulsen/halluci-mate-v2b |
| Pairs | 11,491 (98% train / 2% eval) |
| Learning rate | 1e-5 |
| LR schedule | cosine, 5% warmup |
| Beta | 0.1 |
| Epochs | 2 (trained), checkpoint-200 selected (โ1.14 ep) |
| Per-device batch ร grad accum ร GPUs | 16 ร 2 ร 2 = effective 64 |
| Optimizer | paged AdamW 8-bit |
| Precision | bf16 + flash-attention-2 |
| Total steps | 352 (full run); shipped checkpoint at step 200 |
| Hardware | RTX PRO 4500 Blackwell + RTX 3090 (DDP) |
The 2-epoch run plateaus on eval_rewards/accuracies around step 200 while
margins keep growing through step 352 โ and on vs-Stockfish the late checkpoints
regress on blunder rate (especially endgame and lost positions) despite a small
further gain on tactical_oversight. Step 200 is the behavioral sweet spot;
it is the checkpoint published here.
Held-out eval at end of training (step 200)
| metric | value |
|---|---|
| eval_loss | 0.6634 |
| eval_rewards/accuracies | 0.773 |
| eval_rewards/margins | 0.0593 |
vs-Stockfish (skill 5, depth 12, alternating colors, --sf-analyze)
| metric | v2b (10,000 games) | v2c (100 games) | ฮ |
|---|---|---|---|
| score_rate | 0.0897 | 0.085 | -0.5pp (within n=100 noise) |
| legal_rate | 0.9767 | 0.9755 | flat |
| tactical_oversight_rate | 0.1535 | 0.1437 | -0.98pp |
| blunder_rate (overall) | 0.0653 | 0.0616 | -0.37pp |
| blunder_rate (consequential) | 0.0505 | 0.0483 | -0.22pp |
| blunder_rate (lost positions) | 0.0849 | 0.0795 | -0.54pp |
| blunder_rate (endgame) | 0.0866 | 0.0811 | -0.55pp |
| CPL p95 (overall) | 263 | 253 | -10cp |
The v2b numbers come from 10,000 games, so its CIs are much tighter than v2c's 100-game eval. score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate, and CPL tail are above noise and consistent with the DPO objective.
License
MIT, matching the v2b base.
- Downloads last month
- 17