halluci-mate v2c

DPO fine-tune of jspaulsen/halluci-mate-v2b on Stockfish-vs-model preference pairs. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer; the policy was nudged to prefer the moves Stockfish endorses over the moves v2b actually played in not-yet-lost positions.

Source: https://github.com/jspaulsen/halluci-mate

Training data

11,491 preference pairs derived from a 10,000-game v2b-vs-Stockfish run (skill 5, depth 12) using the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor quality \
  --require-consequential --exclude-repetition
  • Quality threshold: centipawn loss > 200.
  • --require-consequential drops moves played from positions already evaluated as lost (eval-before < -800 cp). Blunders in already-lost endgames teach the model to chase swindle lines that don't generalize.
  • --exclude-repetition drops moves that recur within the same game. Stockfish flags forced-repetition draws as blunders even when repetition is the only drawing line.

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer:

Method DPO (TRL)
Base / reference model jspaulsen/halluci-mate-v2b
Pairs 11,491 (98% train / 2% eval)
Learning rate 1e-5
LR schedule cosine, 5% warmup
Beta 0.1
Epochs 2 (trained), checkpoint-200 selected (โ‰ˆ1.14 ep)
Per-device batch ร— grad accum ร— GPUs 16 ร— 2 ร— 2 = effective 64
Optimizer paged AdamW 8-bit
Precision bf16 + flash-attention-2
Total steps 352 (full run); shipped checkpoint at step 200
Hardware RTX PRO 4500 Blackwell + RTX 3090 (DDP)

The 2-epoch run plateaus on eval_rewards/accuracies around step 200 while margins keep growing through step 352 โ€” and on vs-Stockfish the late checkpoints regress on blunder rate (especially endgame and lost positions) despite a small further gain on tactical_oversight. Step 200 is the behavioral sweet spot; it is the checkpoint published here.

Held-out eval at end of training (step 200)

metric value
eval_loss 0.6634
eval_rewards/accuracies 0.773
eval_rewards/margins 0.0593

vs-Stockfish (skill 5, depth 12, alternating colors, --sf-analyze)

metric v2b (10,000 games) v2c (100 games) ฮ”
score_rate 0.0897 0.085 -0.5pp (within n=100 noise)
legal_rate 0.9767 0.9755 flat
tactical_oversight_rate 0.1535 0.1437 -0.98pp
blunder_rate (overall) 0.0653 0.0616 -0.37pp
blunder_rate (consequential) 0.0505 0.0483 -0.22pp
blunder_rate (lost positions) 0.0849 0.0795 -0.54pp
blunder_rate (endgame) 0.0866 0.0811 -0.55pp
CPL p95 (overall) 263 253 -10cp

The v2b numbers come from 10,000 games, so its CIs are much tighter than v2c's 100-game eval. score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate, and CPL tail are above noise and consistent with the DPO objective.

License

MIT, matching the v2b base.

Downloads last month
17
Safetensors
Model size
0.4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jspaulsen/halluci-mate-v2c

Finetuned
(2)
this model