YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

v0.5 Chat-Tune Postmortem (2026-05-03)

The canonical v0.5 chat is chat-v3 (MCQ-tuned) at 36.9% on CTIBench MCQ. This document records the recovery attempts that followed and what they actually changed about our understanding.

Result table

Run Recipe Steps LR Val CTIBench MCQ
chat-v2 Cybersec Q&A only, no MCQ 1500 5e-5 β€” 19.0%
chat-v3 (canonical) Raw letter-only MCQ Γ— 5 1500 5e-5 β€” 36.9%
chat-v4 (RAFT) RAG-augmented chat-v3 mix 1500 5e-5 β€” 25.0%
chat (v0.5 base re-tune) chat-v3 recipe on v0.5 base 1500 5e-5 β€” 32.5%
chat-long chat-v3 mix, 4000 steps 4000 5e-5 β€” 17.1%
chat-recovered CoT MCQ Γ— 1 + small-talk Γ— 30 1500 3e-5 2.808 30.8%
chat-v4 (failed) Hybrid + lr 2e-4 300 2e-4 diverged killed
chat-v5 (this run) Hybrid raw Γ— 5 + CoT Γ— 2 + small-talk Γ— 8 2000 5e-5 2.990 34.8%

What we learned

What chat-v3 actually does

The 36.9% canonical is a pattern-match shortcut, not reasoning. With raw letter-only MCQ at Γ— 5 multiplier, the model learns "after the prompt ends in 'Answer:', emit a single letter consistent with the surface features of the options." This is a known class of MCQ artifact (Answer Matching > MCQ, arXiv 2507.02856) β€” sub-100M models can hit reasonable MCQ scores by exploiting the choice distribution without understanding the question.

Why CoT-MCQ alone made it worse

chat-recovered (30.8%) replaced the letter-only MCQ Γ— 5 with CoT MCQ Γ— 1. The CoT records have the format "B. <1-2 sentence justification>" β€” Qwen-14B generated the reasoning. The hypothesis, from Phi-3.5-mini and OpenMath-Mini, was that reasoning supervision should outperform pattern-match supervision even at low multipliers.

It didn't β€” at 36M params, the model can't compress 1-2 sentences of cybersec reasoning into useful weight updates, and it loses the letter-shortcut signal in the process. Documented size effect: weaker students benefit from coarser supervision; long rationales over-smooth gradients (Skip-Thinking, arXiv 2505.18642; Unveiling Key Factors for Distilling CoT, arXiv 2502.18001).

The 30 Γ— small-talk multiplier compounded the damage by pushing task-data share below 5% of the SFT mix β€” well outside the SmolLM2 reference of β‰₯ 20% task share.

Why chat-v4 (lr 2e-4) diverged

Research said an undertrained backbone needs aggressive SFT lr to escape a bad pretrain basin. SmolLM2 uses 3e-4 SFT lr at 135M params. Scaled down to 36M with mean-init new tokens, 2e-4 was still too hot β€” val climbed monotonically across 3 evals (3.175 β†’ 3.285 β†’ 3.403) before we killed it at step 300.

Lesson: the SmolLM2 lr reference doesn't transfer linearly to 36M with new embedding rows. The safe range is closer to 5e-5.

What chat-v5 got right (and didn't)

The hybrid recipe (raw Γ— 5 + CoT Γ— 2, small-talk Γ— 8, lr 5e-5, mean-init embeddings) lifted the score from 30.8% β†’ 34.8% β€” a real +4.0 point gain over the prior recovery attempt. But it still trails canonical by 2.1 points.

The hybrid was directionally right β€” keeping the letter-shortcut anchor (raw Γ— 5) preserved the discriminative signal, while CoT Γ— 2 added some reasoning supervision without over-rotating. Mean-init for new tokens kept the residual stream stable.

What it didn't fix: the letter-shortcut at Γ— 5 is still doing most of the work, and there's no mechanism in this recipe that actually transfers knowledge into the model β€” only better calibration on top of the shortcut. To beat 36.9% durably, the lever isn't another SFT recipe β€” it's either:

  1. Bigger model (ghost-base ~350M) so reasoning supervision actually fits.
  2. Better pretrain coverage of the CTIBench knowledge domain (more cyber threat intel, MITRE corpus depth) so the shortcut isn't the only path to a correct answer.
  3. Proper retrieval at inference (RAG done right, not the chat-v4 RAFT attempt that conflated training-time and inference-time augmentation).

Decision

  • Canonical stays: v0.5 chat-v3 (MCQ-tuned) at 36.9% on the main HF repo.
  • Ship chat-v5 separately: push to Ghostgim/GhostLM-v0.5-experimental with this postmortem in the model card. Honest framing: "improved CoT hybrid recipe, still 2.1pt below canonical, primarily of research interest."
  • No more chat-tune iterations on v0.5. The 36.9% ceiling is a pretrain
    • capacity ceiling, not a recipe ceiling. Next swing should be ghost-base or a corpus-side fix, not another SFT permutation.

Sources

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Ghostgim/GhostLM-v0.5-experimental