Loss Is Not Enough: The Golden Window in Neural Network Training
Song Yue · Independent Researcher · July 2026
Abstract
Training and fine-tuning neural networks proceeds through three structurally distinct phases—Build, Collapse, and Rebuild—that the standard optimization metric, loss, is blind to. During Collapse, structural order drops 49% while loss continues to improve. Phase Structure, one of four complementary diagnostics from FPP, locates the golden window where capability peaks before being sacrificed for marginal gains. Across 13 models spanning 7.5M-14B parameters and 5 architecture families, Green Symmetry varies 2× and Mutual Information 40× within the same family. Momentum injection produces three distinct response modes: mild (+19%, SwiGLU+MHA), strong (+32%, ReLU), and structurally dangerous (GeGLU, unsafe beyond β=0.02). All experiments conducted on consumer hardware (GTX 1650 Ti 4GB + Intel Ultra 30GB). No cloud. No cluster.
Quick Start
pip install torch transformers numpy scipy scikit-learn sentencepiece accelerate
python fpp_health.py --model Qwen/Qwen2.5-0.5B-Instruct
Key Findings
- Three-phase lifecycle: Build → Collapse → Rebuild (loss sees one, FPP sees three)
- GS is U-shaped: peaks at 1.5B, dips at 7B, recovers at 14B
- MI varies 40×: TinyLlama (0.771) has 11× more information capacity than Qwen (0.069)
- Phase is architecture-stable: 0.18–0.47 across all models
- β safety is family-specific: SwiGLU+MHA [0.05,0.20], GeGLU [0.001,0.02], SmolLM2-1.7B NONE
- All on consumer hardware: 1650 Ti (≤1.7B) + Intel Ultra (7–14B)
Models Evaluated
| Model | Family | GS | MI | Phase | Safe β |
|---|---|---|---|---|---|
| Qwen2.5-0.5B | SwiGLU+MHA | 0.81 | 0.09 | 0.42 | [0.05,0.20] |
| Qwen2.5-1.5B | SwiGLU+MHA | 0.89 | 0.10 | 0.32 | [0.05,0.20] |
| Qwen2.5-7B | SwiGLU+MHA | 0.81 | 0.06 | 0.40 | [0.01,0.50] |
| Qwen2.5-14B | SwiGLU+MHA | 0.89 | 0.08 | 0.29 | [0.01,0.20] |
| SmolLM2-360M | SwiGLU+GQA | 0.89 | 0.07 | 0.31 | [0.01,0.02] |
| SmolLM2-1.7B | SwiGLU+GQA | 0.43 | 0.22 | 0.47 | NONE |
| TinyLlama-1.1B | SwiGLU+GQA | 0.80 | 0.77 | 0.39 | [0.01,0.02] |
| Gemma-3-1B | GeGLU | 0.28 | 0.10 | 0.31 | [0.001,0.02] |
| Gemma-2-9B | GeGLU | 0.91 | 0.05 | 0.06 | [0.01,0.05] |
| OPT-1.3B | ReLU | 0.60 | 0.14 | 0.21 | [0.01,0.20] |
| Pythia-160M | GELU | 0.46 | 0.20 | 0.35 | [0.01,0.02] |
Citation
@article{yue2026loss,
title={Loss Is Not Enough: The Golden Window in Neural Network Training},
author={Yue, Song},
year={2026},
eprint={pending},
archivePrefix={arXiv}
}