v0.5 Chat-Tune Postmortem (2026-05-03)

The canonical v0.5 chat is chat-v3 (MCQ-tuned) at 36.9% on CTIBench MCQ. This document records the recovery attempts that followed and what they actually changed about our understanding.

Result table

Run	Recipe	Steps	LR	Val	CTIBench MCQ
chat-v2	Cybersec Q&A only, no MCQ	1500	5e-5	—	19.0%
chat-v3 (canonical)	Raw letter-only MCQ × 5	1500	5e-5	—	36.9%
chat-v4 (RAFT)	RAG-augmented chat-v3 mix	1500	5e-5	—	25.0%
chat (v0.5 base re-tune)	chat-v3 recipe on v0.5 base	1500	5e-5	—	32.5%
chat-long	chat-v3 mix, 4000 steps	4000	5e-5	—	17.1%
chat-recovered	CoT MCQ × 1 + small-talk × 30	1500	3e-5	2.808	30.8%
chat-v4 (failed)	Hybrid + lr 2e-4	300	2e-4	diverged	killed
chat-v5 (this run)	Hybrid raw × 5 + CoT × 2 + small-talk × 8	2000	5e-5	2.990	34.8%

What we learned

What chat-v3 actually does

The 36.9% canonical is a pattern-match shortcut, not reasoning. With raw letter-only MCQ at × 5 multiplier, the model learns "after the prompt ends in 'Answer:', emit a single letter consistent with the surface features of the options." This is a known class of MCQ artifact (Answer Matching > MCQ, arXiv 2507.02856) — sub-100M models can hit reasonable MCQ scores by exploiting the choice distribution without understanding the question.

Why CoT-MCQ alone made it worse

chat-recovered (30.8%) replaced the letter-only MCQ × 5 with CoT MCQ × 1. The CoT records have the format "B. <1-2 sentence justification>" — Qwen-14B generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini, was that reasoning supervision should outperform pattern-match supervision even at low multipliers.

It didn't — at 36M params, the model can't compress 1-2 sentences of cybersec reasoning into useful weight updates, and it loses the letter-shortcut signal in the process. Documented size effect: weaker students benefit from coarser supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv 2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).

The 30 × small-talk multiplier compounded the damage by pushing task-data share below 5% of the SFT mix — well outside the SmolLM2 reference of ≥ 20% task share.

Why chat-v4 (lr 2e-4) diverged

Research said an undertrained backbone needs aggressive SFT lr to escape a bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to 36M with mean-init new tokens, 2e-4 was still too hot — val climbed monotonically across 3 evals (3.175 → 3.285 → 3.403) before we killed it at step 300.

Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new embedding rows. The safe range is closer to 5e-5.

What chat-v5 got right (and didn't)

The hybrid recipe (raw × 5 + CoT × 2, small-talk × 8, lr 5e-5, mean-init embeddings) lifted the score from 30.8% → 34.8% — a real +4.0 point gain over the prior recovery attempt. But it still trails canonical by 2.1 points.

The hybrid was directionally right — keeping the letter-shortcut anchor (raw × 5) preserved the discriminative signal, while CoT × 2 added some reasoning supervision without over-rotating. Mean-init for new tokens kept the residual stream stable.

What it didn't fix: the letter-shortcut at × 5 is still doing most of the work, and there's no mechanism in this recipe that actually transfers knowledge into the model — only better calibration on top of the shortcut. To beat 36.9% durably, the lever isn't another SFT recipe — it's either:

Bigger model (ghost-base ~350M) so reasoning supervision actually fits.
Better pretrain coverage of the CTIBench knowledge domain (more cyber threat intel, MITRE corpus depth) so the shortcut isn't the only path to a correct answer.
Proper retrieval at inference (RAG done right, not the chat-v4 RAFT attempt that conflated training-time and inference-time augmentation).

Decision

Canonical stays: v0.5 chat-v3 (MCQ-tuned) at 36.9% on the main HF repo.
Ship chat-v5 separately: push to Ghostgim/GhostLM-v0.5-experimental with this postmortem in the model card. Honest framing: "improved CoT hybrid recipe, still 2.1pt below canonical, primarily of research interest."
No more chat-tune iterations on v0.5. The 36.9% ceiling is a pretrain
- capacity ceiling, not a recipe ceiling. Next swing should be ghost-base or a corpus-side fix, not another SFT permutation.

Sources

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ghostgim/GhostLM-v0.5-experimental

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Paper • 2507.02856 • Published Jul 3, 2025 • 9