halluci-mate v2d

DPO fine-tune of jspaulsen/halluci-mate-v2b on a broadened Stockfish-vs-model preference set. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer. v2d revisits the export filters used for v2c and improves on it across nearly every vs-Stockfish axis.

Source: https://github.com/jspaulsen/halluci-mate

What changed vs v2c

The training recipe and base model are identical to v2c. Only the preference dataset changed, along three axes of the export-dpo filter:

v2c v2d
--flavor quality both (adds legality pairs)
--threshold 200 cp 300 cp (sharper blunder definition)
--require-consequential on off (keeps blunders from already-lost positions)
--exclude-repetition on off
Pairs 11,491 25,717

The hypothesis going in: dropping --require-consequential should help the model learn to keep losing positions losing rather than swindle, and the threshold-300 cut should reduce label noise by only counting genuinely sharp blunders. The vs-Stockfish results below validate that hypothesis, especially in endgame and lost-position phases. The legality-pair re-inclusion is an incidental change in the same direction that we did not isolate separately.

Training data

25,717 preference pairs derived from the same 10,000-game v2b-vs-Stockfish run used for v2c (skill 5, depth 12), exported via the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor both --threshold 300

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer โ€” same hyperparameters as v2c:

Method DPO (TRL)
Base / reference model jspaulsen/halluci-mate-v2b
Pairs 25,717 (98% train / 2% eval)
Learning rate 1e-5
LR schedule cosine, 5% warmup
Beta 0.1
Epochs 2 (trained), checkpoint-500 selected (โ‰ˆ1.27 ep)
Per-device batch ร— grad accum ร— GPUs 16 ร— 2 ร— 2 = effective 64
Optimizer paged AdamW 8-bit
Precision bf16 + flash-attention-2
Total steps 788 (full run); shipped checkpoint at step 500
Hardware RTX PRO 4500 Blackwell + RTX 3090 (DDP)

Step 500 was selected post-hoc as the first checkpoint to hit the maximum eval_rewards/accuracies on the held-out split. As with v2c, eval_loss keeps falling and eval_rewards/margins keeps growing through later steps while accuracy plateaus โ€” empirically the later checkpoints are behaviorally worse on vs-Stockfish.

Held-out eval at end of training (step 500)

Note: v2d's eval split contains both legality and quality pairs at threshold 300, so these numbers are not directly comparable to v2c's held-out numbers.

metric value
eval_loss 0.6431
eval_rewards/accuracies 0.8438
eval_rewards/margins 0.1086

vs-Stockfish (skill 5, depth 12, alternating colors, --sf-analyze, 100 games)

metric v2c v2d ฮ”
score_rate 0.085 0.090 +0.5pp (within n=100 noise)
legal_rate 0.9755 0.9826 +0.71pp
tactical_oversight_rate 0.1437 0.1363 -0.74pp
blunder_rate (overall) 0.0616 0.0570 -0.46pp
blunder_rate (consequential) 0.0483 0.0504 +0.21pp
blunder_rate (lost positions) 0.0795 0.0658 -1.37pp
blunder_rate (endgame) 0.0811 0.0714 -0.97pp
CPL p95 (overall) 253 233 -20cp

score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate (especially endgame and lost positions), and CPL tail are above noise and consistent with the broadened, sharper preference set.

License

MIT, matching the v2b base.

Downloads last month
35
Safetensors
Model size
0.4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jspaulsen/halluci-mate-v2d

Finetuned
(2)
this model