halluci-mate v2c

DPO fine-tune of jspaulsen/halluci-mate-v2b on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish endorses over the moves v2b actually played in not-yet-lost positions.

Source: https://github.com/jspaulsen/halluci-mate

Training data

11,491 preference pairs derived from a 10,000-game v2b-vs-Stockfish run (skill 5, depth 12) using the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor quality \
  --require-consequential --exclude-repetition

Quality threshold: centipawn loss > 200.
--require-consequential drops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.
--exclude-repetition drops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer:


Method	DPO (TRL)
Base / reference model	`jspaulsen/halluci-mate-v2b`
Pairs	11,491 (98% train / 2% eval)
Learning rate	1e-5
LR schedule	cosine, 5% warmup
Beta	0.1
Epochs	2 (trained), checkpoint-200 selected (≈1.14 ep)
Per-device batch × grad accum × GPUs	16 × 2 × 2 = effective 64
Optimizer	paged AdamW 8-bit
Precision	bf16 + flash-attention-2
Total steps	352 (full run); shipped checkpoint at step 200
Hardware	RTX PRO 4500 Blackwell + RTX 3090 (DDP)

The 2-epoch run plateaus on eval_rewards/accuracies around step 200 while margins keep growing through step 352 — and on vs-Stockfish the late checkpoints regress on blunder rate (especially endgame and lost positions) despite a small further gain on tactical_oversight. Step 200 is the behavioral sweet spot; it is the checkpoint published here.

Held-out eval at end of training (step 200)

metric	value
eval_loss	0.6634
eval_rewards/accuracies	0.773
eval_rewards/margins	0.0593

vs-Stockfish (skill 5, depth 12, alternating colors, `--sf-analyze`)

metric	v2b (10,000 games)	v2c (100 games)	Δ
score_rate	0.0897	0.085	-0.5pp (within n=100 noise)
legal_rate	0.9767	0.9755	flat
tactical_oversight_rate	0.1535	0.1437	-0.98pp
blunder_rate (overall)	0.0653	0.0616	-0.37pp
blunder_rate (consequential)	0.0505	0.0483	-0.22pp
blunder_rate (lost positions)	0.0849	0.0795	-0.54pp
blunder_rate (endgame)	0.0866	0.0811	-0.55pp
CPL p95 (overall)	263	253	-10cp

The v2b numbers come from 10,000 games, so its CIs are much tighter than v2c's 100-game eval. score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate, and CPL tail are above noise and consistent with the DPO objective.

License

MIT, matching the v2b base.

Downloads last month: 6

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for jspaulsen/halluci-mate-v2c

Base model

jspaulsen/halluci-mate-v2a

Finetuned

jspaulsen/halluci-mate-v2b