YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
v0.5 Chat-Tune Postmortem (2026-05-03)
The canonical v0.5 chat is chat-v3 (MCQ-tuned) at 36.9% on CTIBench MCQ.
This document records the recovery attempts that followed and what they
actually changed about our understanding.
Result table
| Run | Recipe | Steps | LR | Val | CTIBench MCQ |
|---|---|---|---|---|---|
| chat-v2 | Cybersec Q&A only, no MCQ | 1500 | 5e-5 | β | 19.0% |
| chat-v3 (canonical) | Raw letter-only MCQ Γ 5 | 1500 | 5e-5 | β | 36.9% |
| chat-v4 (RAFT) | RAG-augmented chat-v3 mix | 1500 | 5e-5 | β | 25.0% |
| chat (v0.5 base re-tune) | chat-v3 recipe on v0.5 base | 1500 | 5e-5 | β | 32.5% |
| chat-long | chat-v3 mix, 4000 steps | 4000 | 5e-5 | β | 17.1% |
| chat-recovered | CoT MCQ Γ 1 + small-talk Γ 30 | 1500 | 3e-5 | 2.808 | 30.8% |
| chat-v4 (failed) | Hybrid + lr 2e-4 | 300 | 2e-4 | diverged | killed |
| chat-v5 (this run) | Hybrid raw Γ 5 + CoT Γ 2 + small-talk Γ 8 | 2000 | 5e-5 | 2.990 | 34.8% |
What we learned
What chat-v3 actually does
The 36.9% canonical is a pattern-match shortcut, not reasoning. With raw letter-only MCQ at Γ 5 multiplier, the model learns "after the prompt ends in 'Answer:', emit a single letter consistent with the surface features of the options." This is a known class of MCQ artifact (Answer Matching > MCQ, arXiv 2507.02856) β sub-100M models can hit reasonable MCQ scores by exploiting the choice distribution without understanding the question.
Why CoT-MCQ alone made it worse
chat-recovered (30.8%) replaced the letter-only MCQ Γ 5 with CoT MCQ Γ 1.
The CoT records have the format "B. <1-2 sentence justification>" β Qwen-14B
generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini,
was that reasoning supervision should outperform pattern-match supervision
even at low multipliers.
It didn't β at 36M params, the model can't compress 1-2 sentences of cybersec reasoning into useful weight updates, and it loses the letter-shortcut signal in the process. Documented size effect: weaker students benefit from coarser supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv 2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).
The 30 Γ small-talk multiplier compounded the damage by pushing task-data share below 5% of the SFT mix β well outside the SmolLM2 reference of β₯ 20% task share.
Why chat-v4 (lr 2e-4) diverged
Research said an undertrained backbone needs aggressive SFT lr to escape a bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to 36M with mean-init new tokens, 2e-4 was still too hot β val climbed monotonically across 3 evals (3.175 β 3.285 β 3.403) before we killed it at step 300.
Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new embedding rows. The safe range is closer to 5e-5.
What chat-v5 got right (and didn't)
The hybrid recipe (raw Γ 5 + CoT Γ 2, small-talk Γ 8, lr 5e-5, mean-init embeddings) lifted the score from 30.8% β 34.8% β a real +4.0 point gain over the prior recovery attempt. But it still trails canonical by 2.1 points.
The hybrid was directionally right β keeping the letter-shortcut anchor (raw Γ 5) preserved the discriminative signal, while CoT Γ 2 added some reasoning supervision without over-rotating. Mean-init for new tokens kept the residual stream stable.
What it didn't fix: the letter-shortcut at Γ 5 is still doing most of the work, and there's no mechanism in this recipe that actually transfers knowledge into the model β only better calibration on top of the shortcut. To beat 36.9% durably, the lever isn't another SFT recipe β it's either:
- Bigger model (ghost-base ~350M) so reasoning supervision actually fits.
- Better pretrain coverage of the CTIBench knowledge domain (more cyber threat intel, MITRE corpus depth) so the shortcut isn't the only path to a correct answer.
- Proper retrieval at inference (RAG done right, not the chat-v4 RAFT attempt that conflated training-time and inference-time augmentation).
Decision
- Canonical stays:
v0.5 chat-v3 (MCQ-tuned)at 36.9% on the main HF repo. - Ship chat-v5 separately: push to
Ghostgim/GhostLM-v0.5-experimentalwith this postmortem in the model card. Honest framing: "improved CoT hybrid recipe, still 2.1pt below canonical, primarily of research interest." - No more chat-tune iterations on v0.5. The 36.9% ceiling is a pretrain
- capacity ceiling, not a recipe ceiling. Next swing should be ghost-base or a corpus-side fix, not another SFT permutation.
Sources
- Downloads last month
- 15