halluci-mate v2d
DPO fine-tune of jspaulsen/halluci-mate-v2b
on a broadened Stockfish-vs-model preference set. Same Qwen3-0.6B architecture,
same custom ~1,800-token UCI tokenizer. v2d revisits the export filters used for
v2c and improves on it across
nearly every vs-Stockfish axis.
Source: https://github.com/jspaulsen/halluci-mate
What changed vs v2c
The training recipe and base model are identical to v2c. Only the preference
dataset changed, along three axes of the export-dpo filter:
| v2c | v2d | |
|---|---|---|
--flavor |
quality |
both (adds legality pairs) |
--threshold |
200 cp | 300 cp (sharper blunder definition) |
--require-consequential |
on | off (keeps blunders from already-lost positions) |
--exclude-repetition |
on | off |
| Pairs | 11,491 | 25,717 |
The hypothesis going in: dropping --require-consequential should help the
model learn to keep losing positions losing rather than swindle, and the
threshold-300 cut should reduce label noise by only counting genuinely sharp
blunders. The vs-Stockfish results below validate that hypothesis, especially
in endgame and lost-position phases. The legality-pair re-inclusion is an
incidental change in the same direction that we did not isolate separately.
Training data
25,717 preference pairs derived from the same 10,000-game v2b-vs-Stockfish run used for v2c (skill 5, depth 12), exported via the halluci-mate eval harness:
scripts/eval.py export-dpo <run> --flavor both --threshold 300
Training recipe
scripts/train_dpo.py
via TRL's DPOTrainer โ same hyperparameters as v2c:
| Method | DPO (TRL) |
| Base / reference model | jspaulsen/halluci-mate-v2b |
| Pairs | 25,717 (98% train / 2% eval) |
| Learning rate | 1e-5 |
| LR schedule | cosine, 5% warmup |
| Beta | 0.1 |
| Epochs | 2 (trained), checkpoint-500 selected (โ1.27 ep) |
| Per-device batch ร grad accum ร GPUs | 16 ร 2 ร 2 = effective 64 |
| Optimizer | paged AdamW 8-bit |
| Precision | bf16 + flash-attention-2 |
| Total steps | 788 (full run); shipped checkpoint at step 500 |
| Hardware | RTX PRO 4500 Blackwell + RTX 3090 (DDP) |
Step 500 was selected post-hoc as the first checkpoint to hit the maximum
eval_rewards/accuracies on the held-out split. As with v2c, eval_loss
keeps falling and eval_rewards/margins keeps growing through later steps
while accuracy plateaus โ empirically the later checkpoints are behaviorally
worse on vs-Stockfish.
Held-out eval at end of training (step 500)
Note: v2d's eval split contains both legality and quality pairs at threshold 300, so these numbers are not directly comparable to v2c's held-out numbers.
| metric | value |
|---|---|
| eval_loss | 0.6431 |
| eval_rewards/accuracies | 0.8438 |
| eval_rewards/margins | 0.1086 |
vs-Stockfish (skill 5, depth 12, alternating colors, --sf-analyze, 100 games)
| metric | v2c | v2d | ฮ |
|---|---|---|---|
| score_rate | 0.085 | 0.090 | +0.5pp (within n=100 noise) |
| legal_rate | 0.9755 | 0.9826 | +0.71pp |
| tactical_oversight_rate | 0.1437 | 0.1363 | -0.74pp |
| blunder_rate (overall) | 0.0616 | 0.0570 | -0.46pp |
| blunder_rate (consequential) | 0.0483 | 0.0504 | +0.21pp |
| blunder_rate (lost positions) | 0.0795 | 0.0658 | -1.37pp |
| blunder_rate (endgame) | 0.0811 | 0.0714 | -0.97pp |
| CPL p95 (overall) | 253 | 233 | -20cp |
score_rate is within the n=100 noise band; per-move quality improvements on tactical_oversight, blunder rate (especially endgame and lost positions), and CPL tail are above noise and consistent with the broadened, sharper preference set.
License
MIT, matching the v2b base.
- Downloads last month
- 35