model-clinic / docs /models_recovered.md
spartan8806's picture
Add models recovered case studies
cb728b8 verified
metadata
language: en
tags:
  - model-clinic
  - model-recovery
  - case-study
  - diagnostics
license: mit

Models Recovered by model-clinic

The Diagnosis That Saved 645 Hours

One developer. One GPU. 645 hours of training a custom 334M parameter transformer with five novel architectural features. Every novel feature was dead on arrival. The base model learned language. The memory, drives, adaptive computation, and growth system learned nothing.

Nobody knew until model-clinic found the bug.


Case 1: ATLES 300M RecurrentTransformer

The patient: Custom 334M transformer with 5-tier hierarchical memory, drive system, adaptive computation, and neural foam growth. Trained for 645 hours on an RTX 3060.

Symptoms: Model generates language but all novel features appear inert. Memory banks empty or filled with noise. Drives producing zero output. Growth making things worse. Nobody could figure out why.

model-clinic diagnosis:

$ model-clinic exam merged_final.pt

Health Score: 56/100 (D)

[ERROR] identical_rows β€” memory tier 0, 2: keys identical to values
[ERROR] gradient_noise β€” memory tier 2: condition number 1,400,000,000
[ERROR] gradient_noise β€” layers.13.attention.k_proj: condition 30,321
[WARN]  stuck_gate_closed β€” read_gate: sigmoid(-4.93) = 0.72%
[WARN]  stuck_gate_closed β€” write_gate: sigmoid(-4.97) = 0.69%
[WARN]  heavy_tails β€” 6 parameters with kurtosis > 50
[WARN]  model_aging β€” embed_tokens showing distribution drift
[WARN]  dtype_mismatch β€” mixed fp32/bf16 tensors

Root cause found: stuck_gate_closed on read_gate and write_gate. Both gates initialized to sigmoid(-5) = 0.67%. At 0.67% throughput, gradient signal is also ~0.67%. The memory, drives, and persistent state β€” 26M parameters sitting behind these gates β€” received essentially zero gradient for 645 hours of training.

The optimizer saw near-zero gradients and made near-zero updates. After 645 hours: read gate moved from 0.67% to 0.72%. Write gate: 0.69%. The gates were frozen by their own initialization.

This bug was invisible to standard training metrics. Loss went down. Perplexity improved. The 308M transformer backbone was healthy and learning language. You had to look at the gates specifically to see they were stuck. model-clinic's stuck_gate_closed detector caught it.

Treatment applied:

Stage Tool Used Score What Changed
Original β€” 56/D 16 findings
L1: Cosmetic apply_treatment() 58/D Outliers clamped
L2: Spectral surgery spectral_denoise() 65/C Condition 1.4B β†’ 1K
L3: Distillation distill_repair() 71/C Memory banks reset + retrained
Gate opening Manual (informed by diagnosis) 82/B Gates set to sigmoid(0)=50%

Health score: 56/D β†’ 82/B (+26 points)

The gate opening step β€” setting sigmoid(-5) to sigmoid(0) β€” was directly informed by model-clinic's stuck_gate_closed finding. Without the diagnosis, the fix wouldn't have been obvious. The gates "worked" in the sense that tensors flowed through them. They just throttled everything to 0.67%.

What the repair couldn't fix: Generation quality. The embeddings collapsed to effective rank 11/256 during 645 hours of training. Every layer adapted to work with broken input representations. model-clinic improved weight-level metrics but couldn't inject knowledge that was never learned. The model is structurally healthier but still can't form coherent sentences.

Honest assessment: model-clinic accurately diagnosed the problem and improved measurable health metrics. But post-hoc repair has limits. A model that trained for 645 hours with fundamentally broken features can't be fully recovered by weight surgery alone.


Case 2: ATLESQwen (Saved Before It Broke)

The patient: Qwen2.5-0.5B-Instruct + ATLES wrapper (~47M trainable params). Same memory architecture as the 300M model. Same gate initialization. Same bug, waiting to happen.

The near-miss: ATLESQwen v1 and v2 were initialized with the exact same sigmoid(-5) gate values. Phase 1 v2 training ran to step 2,350 before the server went down for thermal maintenance. During those 2,350 steps:

  • Memory gate: 0.67% throughput
  • Wrapper gate: 0.67% throughput
  • All ATLES features: receiving ~0 gradient
  • Loss was going down (Qwen doing all the work)

If training had continued to 20,000 steps, the result would have been identical to the 300M model: a Qwen that talks fine with 47M parameters of dead weight bolted onto it. Another few hundred hours wasted.

model-clinic's diagnosis of the 300M model revealed the pattern. The stuck_gate_closed finding, the dead memory banks, the zero-gradient drives β€” all pointed to gate initialization as the root cause.

The fix (applied before damage accumulated):

# BEFORE (ATLESQwen v1/v2) β€” the 300M model's bug, repeated
self.state_gate = nn.Parameter(torch.tensor(-5.0))    # 0.67%
self.pre_memory_gate = nn.Parameter(torch.tensor(-5.0))
self.wrapper_gate = nn.Parameter(torch.tensor(-5.0))
self.bridge_gate = nn.Parameter(torch.tensor(-4.0))   # 1.8%

# AFTER (ATLESQwen v3) β€” informed by model-clinic diagnosis
self.state_gate = nn.Parameter(torch.tensor(0.0))     # 50%
self.pre_memory_gate = nn.Parameter(torch.tensor(0.0))
self.wrapper_gate = nn.Parameter(torch.tensor(0.0))
self.bridge_gate = nn.Parameter(torch.tensor(0.0))

ATLESQwen v3 Phase 1 results (gate_init=0):

Metric v2 (broken gates) v3 (fixed gates)
Steps completed 2,350 (server died) 19,500 (full run)
Final PPL ~34 (projected) 46.13
Memory gate 0.67% (stuck) 50% (stable)
Wrapper gate 0.67% (stuck) 48.1% (self-calibrated)
Memory occupancy Empty 128/128 short-term, 512/512 long-term
Training time β€” 6.2 hours
Generation quality Qwen-level Qwen-level (Phase 1 goal)

The wrapper gate self-calibrated from 50% to 48.1%. It didn't collapse to zero (shut off) or stay locked at 50% (no learning). The model found its own operating point. Same behavior as the 300M model's read gate settling at 52% during the gate opening experiment.

Memory is full and active. Short-term: 128/128 slots occupied. Long-term: 512/512 slots occupied. For the first time in the project's history, the memory system is actually accumulating and retaining information during training. With broken gates, memory was always empty.

This is the recovery that matters. Not fixing a broken model after the fact β€” catching the bug before it ruins another training run. model-clinic's diagnosis of the 300M corpse saved ATLESQwen from the same fate.


Case 3: Batch Validation Across 72 Checkpoints

model-clinic scanned every checkpoint from the project's entire training history: 95 files, 72 scoreable models.

Grade distribution:

Grade Count What's Here
A 28 ATLESQwen deltas, crypto models, small SFT
B 15 Pretrain backbones, repaired models
C 14 Fine-tuning runs, baselines without growth
D 14 Every model with growth enabled
F 1 Neural foam 1B (worst checkpoint ever)

The pattern model-clinic revealed:

  1. Pretrain = healthy. Steps 16K-35K all score 84/B. The base transformer architecture works.

  2. Growth = damage. Every growth-enabled model scores strictly worse than its baseline. Chimera growth 58/D vs no-growth 66/C. Neural foam growth 54/D vs no-growth 66/C. No exceptions across 72 models.

  3. Fine-tuning can't fix growth damage. GRPO, Rho-1, LoRA SFT β€” all plateau at C-grade. The growth damage is permanent.

  4. The gate bug is consistent. The 300M model, ATLESQwen v1, ATLESQwen v2 β€” all had the same initialization. All had dead novel features.

Without batch scanning, these patterns would have taken months to notice. With model-clinic, the entire training history was diagnosed in 45 minutes.


What model-clinic Is and Isn't

What it is:

  • An early warning system that catches training bugs before they waste hundreds of hours
  • A forensic tool that tells you exactly what went wrong after training
  • A batch analysis tool that reveals patterns across checkpoint histories
  • Weight-level surgery that can improve health scores by 26+ points

What it isn't:

  • A magic fix for fundamentally broken models
  • A substitute for proper architecture validation before training
  • Able to inject knowledge that was never learned

The honest bottom line: model-clinic's biggest value isn't repair β€” it's prevention. Diagnosing the 300M model's corpse and applying that lesson to ATLESQwen before it broke saved more compute time than any amount of post-hoc weight surgery could recover.


Tools

All analysis performed with model-clinic v0.4.0.

Install: pip install model-clinic

Built by one developer and one AI on an RTX 3060.