FOAM V4: Forward Optimizing Adaptive Memory
Neural Growth Without Catastrophic Forgetting
Connor Spartan | January 2026
Related: spartan8806/ATLES-1.5B | Merge Paper
Abstract
We present FOAM V4 (Forward Optimizing Adaptive Memory), a neural growth system that enables continued learning without catastrophic forgetting. The key innovation is deceptively simple: neurons should only grow, never be replaced or pruned. Applied to Qwen2.5-0.5B-Instruct (494M parameters), V4 maintains baseline performance after domain-specific training (47.82% vs 46.68% baseline average across 5 benchmarks) while normal fine-tuning collapses to 41.60%. V4 demonstrates complete resistance to catastrophic forgetting during domain shift testing (0.00% degradation vs -8.34% for the previous version with neuron replacement). We document the full evolution from V1 through V4, showing that neuron replacement — despite seeming theoretically sound — is the primary cause of forgetting in growth-based systems.
1. The Problem: Catastrophic Forgetting
Catastrophic forgetting is one of the fundamental challenges in neural network training. When a model is fine-tuned on new data, it tends to overwrite the weights that encoded prior knowledge, resulting in degraded performance on previously learned tasks.
The Standard Fine-Tuning Failure Mode
In our experiments, we observed this phenomenon directly:
| Metric | Baseline | After Fine-tune | Change |
|---|---|---|---|
| ARC-Easy | 64.6% | 47.2% | -17.4% |
| ARC-Challenge | 30.7% | 28.3% | -2.4% |
| HellaSwag | 40.3% | 35.5% | -4.8% |
The fine-tuned model performed worse on the task it was specifically trained on (ARC) than the untrained baseline. This is the catastrophic forgetting problem in action — the model destroyed its general reasoning capabilities while attempting to specialize.
Why This Matters
For practical AI systems, catastrophic forgetting means:
- Models cannot be continuously improved without full retraining
- Domain specialization comes at the cost of general capabilities
- Training is a one-shot process rather than incremental learning
FOAM V4 addresses this by fundamentally changing how the network adapts to new information.
2. The Journey: From V1 to V4
2.1 V1/V2: Initial Growth Experiments
The initial FOAM experiments focused on proving that neural growth was possible during training. Key developments included:
- GrowableLinear layers: Custom PyTorch modules that can expand their neuron count during training
- Gradient-based growth triggers: New neurons added when gradient magnitude exceeds thresholds
- Paired growth: Gate, up, and down projections in FFN layers grow together to maintain architectural consistency
V1/V2 demonstrated that growth was mechanically possible but lacked sophisticated controls for when and how growth should occur.
2.2 V3: Smart Growth with Replacement
V3 introduced more sophisticated growth controls:
- Percentile-based thresholds: Only the top 10% of gradients trigger growth (prevents runaway expansion)
- Growth cooldown: Minimum 200 steps between growth events per layer
- Neuron replacement: Low-utility neurons could be replaced by new ones
- Apoptosis (pruning): Inactive neurons could be removed entirely
The replacement mechanism seemed theoretically sound — why keep neurons that aren't contributing? However, testing revealed a critical flaw.
The V3 Failure: Domain Shift Test
We trained V3 on ARC data, then continued training on balanced general data:
| Model | After ARC Training | After Balanced Training | Change |
|---|---|---|---|
| V3 Growth | 38.67% | 30.33% | -8.34% |
V3 suffered catastrophic forgetting. The replacement mechanism was the culprit — neurons useful for ARC reasoning were being replaced by neurons for the new domain, destroying the specialized knowledge.
2.3 V4: The Grow-Only Breakthrough
The V4 hypothesis was simple: disable replacement and pruning entirely. Let neurons accumulate rather than compete.
Changes from V3 to V4:
# V3 (problematic)
enable_replacement = True # Replace low-utility neurons
enable_pruning = True # Remove inactive neurons
# V4 (solution)
enable_replacement = False # Never replace neurons
enable_pruning = False # Never remove neurons
The same domain shift test with V4:
| Model | After ARC Training | After Balanced Training | Change |
|---|---|---|---|
| V4 Grow-Only | 43.00% | 43.00% | 0.00% |
V4 maintained its performance perfectly. The accumulated neurons from ARC training coexisted peacefully with neurons grown during balanced training.
3. Technical Implementation
3.1 Architecture Overview
FOAM V4 wraps a standard transformer (Qwen2.5-0.5B-Instruct) with growable FFN layers:
Input -> Attention (frozen) -> GrowableFFN -> Output
GrowableFFN Structure:
|-- gate_proj: GrowableLinearV2 (can expand width)
|-- up_proj: GrowableLinearV2 (grows with gate)
+-- down_proj: GrowableLinearV2 (grows with gate/up)
Key architectural decisions:
- Attention layers frozen: Growth only in FFN layers (attention already captures relational patterns well)
- Embeddings frozen: Token representations remain stable
- LM head frozen: Output vocabulary mapping unchanged
3.2 Growth Mechanism
Growth is triggered by gradient analysis during training:
def check_and_grow(self):
for layer in self.growable_layers:
# Collect gradient magnitudes
grad_mag = layer.weight.grad.abs().mean(dim=1)
# Only top 10% of gradients trigger growth
threshold = torch.quantile(grad_mag, 0.90)
# Check cooldown (200 steps minimum between growth)
if steps_since_last_growth < 200:
continue
# Grow if threshold exceeded
if grad_mag.max() > threshold:
layer.grow_neurons(count=10) # Add 10 neurons
3.3 Key Parameters
| Parameter | V4 Value | Purpose |
|---|---|---|
enable_replacement |
False |
Never replace existing neurons |
enable_pruning |
False |
Never remove neurons |
gradient_percentile |
90.0 |
Only top 10% gradients trigger growth |
growth_cooldown |
200 |
Steps between growth events |
max_growth_per_step |
10 |
Neurons added per growth event |
growth_threshold |
0.001 |
Minimum gradient magnitude |
4. Experimental Results
4.1 Benchmark Methodology
All benchmarks used consistent methodology:
- Sample size: 1000 examples per task (except where noted)
- Tasks: ARC-Easy, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande
- Evaluation: Multiple-choice accuracy using model log-probabilities
- Hardware: NVIDIA RTX 4070 (12GB VRAM)
Standard errors for 1000-sample benchmarks are approximately +/-1.5%.
4.2 V4 vs Baseline vs Normal Fine-tune
Complete benchmark results (1000 samples each):
| Task | V4 Growth | Baseline Qwen | Normal Fine-tune |
|---|---|---|---|
| ARC-Easy | 64.20% | 64.60% | 47.20% |
| ARC-Challenge | 34.20% | 30.70% | 28.30% |
| HellaSwag | 40.20% | 40.30% | 35.50% |
| TruthfulQA | 46.20% | 41.80% | 44.68% |
| Winogrande | 54.20% | 56.00% | 52.30% |
| Average | 47.82% | 46.68% | 41.60% |
Statistical Analysis
- V4 vs Baseline: +1.14% (within margin of error, statistically equivalent)
- V4 vs Fine-tune: +6.22% (statistically significant)
- Baseline vs Fine-tune: +5.08% (fine-tune is significantly worse)
The critical finding: normal fine-tuning made the model worse than doing nothing.
4.3 Domain Shift Testing
To test catastrophic forgetting resistance, we trained models on ARC data, then continued training on balanced general data:
| Model | After ARC | After Balanced | Change | Status |
|---|---|---|---|---|
| Normal Fine-tune | ~38% | 39.83% | +1.83% | Minor improvement |
| V3 (replacement) | 38.67% | 30.33% | -8.34% | Catastrophic forgetting |
| V4 (grow-only) | 43.00% | 43.00% | 0.00% | Stable |
V4 demonstrated complete resistance to catastrophic forgetting during domain shift.
5. Key Findings
Finding 1: Grow-Only Prevents Catastrophic Forgetting
The single most important change from V3 to V4 was disabling neuron replacement. When neurons can only be added (never removed or replaced), knowledge accumulates rather than competes.
Finding 2: V4's Value is Durability, Not Raw Performance
V4 does not significantly outperform baseline Qwen (~1% difference, within error margins). However, V4 enables continued training without collapse, while normal fine-tuning degrades performance. The value proposition is:
V4 makes training safe, not faster.
Finding 3: Normal Fine-Tuning Can Make Models Worse
Our normal fine-tune checkpoint performed worse than the untrained baseline on the exact task it was trained for (ARC). This is a stark demonstration of catastrophic forgetting's severity.
Finding 4: 4 Epochs is Optimal for V4
Extended training (5 epochs vs 4 epochs) showed no improvement:
| Epochs | Average Score | Neurons Grown |
|---|---|---|
| 4 | 43.00% | +20,160 |
| 5 | 43.00% | +24,960 |
Diminishing returns set in after 4 epochs. Additional neurons are grown but don't improve performance.
Finding 5: Neuron Growth is Substantial
V4 training on ARC (4 epochs) grew the model from 254,976 to 275,136 neurons (+20,160 neurons, +7.9% growth). These neurons coexist with existing neurons without interference.
6. Limitations and Future Work
Current Limitations
- VRAM Growth: Grow-only means models get larger over time. Extended training may hit memory limits.
- No Proven Scaling: V4 tested only on Qwen 0.5B. Behavior on larger models (7B, 70B) is unknown.
- Single Domain Testing: Extensive testing on ARC; other domains (code, math, multilingual) not fully explored.
- Marginal Raw Gains: V4 doesn't make models smarter, just more stable during training.
Future Work
- Memory-Bounded Growth: Implement maximum neuron limits with intelligent selection of which neurons to keep.
- Scaling Studies: Test V4 on larger base models to verify the approach generalizes.
- Multi-Domain Sequential Training: Train on 5+ domains sequentially to stress-test forgetting resistance.
- Inference Optimization: Explore pruning after training (grow during train, prune for deploy).
- Growth Pattern Analysis: Study which layers grow most and whether growth patterns indicate learning.
7. Conclusion
FOAM V4 demonstrates that a simple modification — disabling neuron replacement and pruning — transforms neural network training from a fragile, one-shot process into a robust, incremental one.
The key insight is counterintuitive: letting neurons accumulate freely is better than intelligently managing them. The brain doesn't delete neurons when learning new skills; perhaps neural networks shouldn't either.
While V4 doesn't produce smarter models (performance is statistically equivalent to baseline), it produces more trainable models. This distinction matters for real-world AI systems that need continuous improvement without catastrophic regression.
The path forward is clear: grow-only training enables safe, incremental learning. The question now is how to scale this approach to larger models and longer training runs while managing the inevitable growth in model size.
8. Reproducibility
Key Files
src/model/growable_linear_v2.py- Core growable layer implementationsrc/model/growable_qwen_v2.py- Model wrapper with growth managementscripts/train_v4_arc.py- V4 training scripteval/run_standard_benchmarks.py- Benchmark evaluation
Hardware Requirements
- Training: 12GB VRAM minimum (RTX 4070 or better)
- Inference: 8GB VRAM sufficient for evaluation
- Storage: ~2GB per checkpoint
References
- Goddard, C. et al. (2024). "Arcee's MergeKit: A Toolkit for Merging Large Language Models." arXiv:2403.13257
- Kirkpatrick, J. et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS.
- Zenke, F. et al. (2017). "Continual Learning Through Synaptic Intelligence." ICML.
- Evci, U. et al. (2022). "Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning." ICML.