halluci-mate v2d

DPO fine-tune of jspaulsen/halluci-mate-v2b on a broadened Stockfish-vs-model preference set. Same Qwen3-0.6B architecture, same custom ~1,800-token UCI tokenizer. v2d revisits the export filters used for v2c and improves on it across nearly every vs-Stockfish axis.

Source: https://github.com/jspaulsen/halluci-mate

What changed vs v2c

The training recipe and base model are identical to v2c. Only the preference dataset changed, along three axes of the export-dpo filter:

	v2c	v2d
`--flavor`	`quality`	`both` (adds legality pairs)
`--threshold`	200 cp	300 cp (sharper blunder definition)
`--require-consequential`	on	off (keeps blunders from already-lost positions)
`--exclude-repetition`	on	off
Pairs	11,491	25,717

The hypothesis going in: dropping --require-consequential should help the model learn to keep losing positions losing rather than swindle, and the threshold-300 cut should reduce label noise by only counting genuinely sharp blunders. The vs-Stockfish results below validate that hypothesis, especially in endgame and lost-position phases. The legality-pair re-inclusion is an incidental change in the same direction that we did not isolate separately.

Training data

25,717 preference pairs derived from the same 10,000-game v2b-vs-Stockfish run used for v2c (skill 5, depth 12), exported via the halluci-mate eval harness:

scripts/eval.py export-dpo <run> --flavor both --threshold 300

Training recipe

scripts/train_dpo.py via TRL's DPOTrainer — same hyperparameters as v2c:


Method	DPO (TRL)
Base / reference model	`jspaulsen/halluci-mate-v2b`
Pairs	25,717 (98% train / 2% eval)
Learning rate	1e-5
LR schedule	cosine, 5% warmup
Beta	0.1
Epochs	2 (trained), checkpoint-500 selected (≈1.27 ep)
Per-device batch × grad accum × GPUs	16 × 2 × 2 = effective 64
Optimizer	paged AdamW 8-bit
Precision	bf16 + flash-attention-2
Total steps	788 (full run); shipped checkpoint at step 500
Hardware	RTX PRO 4500 Blackwell + RTX 3090 (DDP)

Step 500 was selected post-hoc as the first checkpoint to hit the maximum eval_rewards/accuracies on the held-out split. As with v2c, eval_loss keeps falling and eval_rewards/margins keeps growing through later steps while accuracy plateaus — empirically the later checkpoints are behaviorally worse on vs-Stockfish.

Held-out eval at end of training (step 500)

Note: v2d's eval split contains both legality and quality pairs at threshold 300, so these numbers are not directly comparable to v2c's held-out numbers.

metric	value
eval_loss	0.6431
eval_rewards/accuracies	0.8438
eval_rewards/margins	0.1086

vs-Stockfish (skill 5, depth 12, alternating colors, `--sf-analyze`, 100 games)

metric	v2c	v2d	Δ
score_rate	0.085	0.090	+0.5pp (within n=100 noise)
legal_rate	0.9755	0.9826	+0.71pp
tactical_oversight_rate	0.1437	0.1363	-0.74pp
blunder_rate (overall)	0.0616	0.0570	-0.46pp
blunder_rate (consequential)	0.0483	0.0504	+0.21pp
blunder_rate (lost positions)	0.0795	0.0658	-1.37pp
blunder_rate (endgame)	0.0811	0.0714	-0.97pp
CPL p95 (overall)	253	233	-20cp

score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate (especially endgame and lost positions), and CPL tail are above noise and consistent with the broadened, sharper preference set.

License

MIT, matching the v2b base.

Downloads last month: 5

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for jspaulsen/halluci-mate-v2d

Base model

jspaulsen/halluci-mate-v2a

Finetuned

jspaulsen/halluci-mate-v2b