FOAM V4: Forward Optimizing Adaptive Memory

Neural Growth Without Catastrophic Forgetting

Connor Spartan | January 2026

Related: spartan8806/ATLES-1.5B | Merge Paper

Abstract

We present FOAM V4 (Forward Optimizing Adaptive Memory), a neural growth system that enables continued learning without catastrophic forgetting. The key innovation is deceptively simple: neurons should only grow, never be replaced or pruned. Applied to Qwen2.5-0.5B-Instruct (494M parameters), V4 maintains baseline performance after domain-specific training (47.82% vs 46.68% baseline average across 5 benchmarks) while normal fine-tuning collapses to 41.60%. V4 demonstrates complete resistance to catastrophic forgetting during domain shift testing (0.00% degradation vs -8.34% for the previous version with neuron replacement). We document the full evolution from V1 through V4, showing that neuron replacement — despite seeming theoretically sound — is the primary cause of forgetting in growth-based systems.

1. The Problem: Catastrophic Forgetting

Catastrophic forgetting is one of the fundamental challenges in neural network training. When a model is fine-tuned on new data, it tends to overwrite the weights that encoded prior knowledge, resulting in degraded performance on previously learned tasks.

The Standard Fine-Tuning Failure Mode

In our experiments, we observed this phenomenon directly:

Metric	Baseline	After Fine-tune	Change
ARC-Easy	64.6%	47.2%	-17.4%
ARC-Challenge	30.7%	28.3%	-2.4%
HellaSwag	40.3%	35.5%	-4.8%

The fine-tuned model performed worse on the task it was specifically trained on (ARC) than the untrained baseline. This is the catastrophic forgetting problem in action — the model destroyed its general reasoning capabilities while attempting to specialize.

Why This Matters

For practical AI systems, catastrophic forgetting means:

Models cannot be continuously improved without full retraining
Domain specialization comes at the cost of general capabilities
Training is a one-shot process rather than incremental learning

FOAM V4 addresses this by fundamentally changing how the network adapts to new information.

2. The Journey: From V1 to V4

2.1 V1/V2: Initial Growth Experiments

The initial FOAM experiments focused on proving that neural growth was possible during training. Key developments included:

GrowableLinear layers: Custom PyTorch modules that can expand their neuron count during training
Gradient-based growth triggers: New neurons added when gradient magnitude exceeds thresholds
Paired growth: Gate, up, and down projections in FFN layers grow together to maintain architectural consistency

V1/V2 demonstrated that growth was mechanically possible but lacked sophisticated controls for when and how growth should occur.

2.2 V3: Smart Growth with Replacement

V3 introduced more sophisticated growth controls:

Percentile-based thresholds: Only the top 10% of gradients trigger growth (prevents runaway expansion)
Growth cooldown: Minimum 200 steps between growth events per layer
Neuron replacement: Low-utility neurons could be replaced by new ones
Apoptosis (pruning): Inactive neurons could be removed entirely

The replacement mechanism seemed theoretically sound — why keep neurons that aren't contributing? However, testing revealed a critical flaw.

The V3 Failure: Domain Shift Test

We trained V3 on ARC data, then continued training on balanced general data:

Model	After ARC Training	After Balanced Training	Change
V3 Growth	38.67%	30.33%	-8.34%

V3 suffered catastrophic forgetting. The replacement mechanism was the culprit — neurons useful for ARC reasoning were being replaced by neurons for the new domain, destroying the specialized knowledge.

2.3 V4: The Grow-Only Breakthrough

The V4 hypothesis was simple: disable replacement and pruning entirely. Let neurons accumulate rather than compete.

Changes from V3 to V4:

# V3 (problematic)
enable_replacement = True   # Replace low-utility neurons
enable_pruning = True       # Remove inactive neurons

# V4 (solution)
enable_replacement = False  # Never replace neurons
enable_pruning = False      # Never remove neurons

The same domain shift test with V4:

Model	After ARC Training	After Balanced Training	Change
V4 Grow-Only	43.00%	43.00%	0.00%

V4 maintained its performance perfectly. The accumulated neurons from ARC training coexisted peacefully with neurons grown during balanced training.

3. Technical Implementation

3.1 Architecture Overview

FOAM V4 wraps a standard transformer (Qwen2.5-0.5B-Instruct) with growable FFN layers:

Input -> Attention (frozen) -> GrowableFFN -> Output

GrowableFFN Structure:
|-- gate_proj: GrowableLinearV2 (can expand width)
|-- up_proj:   GrowableLinearV2 (grows with gate)
+-- down_proj: GrowableLinearV2 (grows with gate/up)

Key architectural decisions:

Attention layers frozen: Growth only in FFN layers (attention already captures relational patterns well)
Embeddings frozen: Token representations remain stable
LM head frozen: Output vocabulary mapping unchanged

3.2 Growth Mechanism

Growth is triggered by gradient analysis during training:

def check_and_grow(self):
    for layer in self.growable_layers:
        # Collect gradient magnitudes
        grad_mag = layer.weight.grad.abs().mean(dim=1)

        # Only top 10% of gradients trigger growth
        threshold = torch.quantile(grad_mag, 0.90)

        # Check cooldown (200 steps minimum between growth)
        if steps_since_last_growth < 200:
            continue

        # Grow if threshold exceeded
        if grad_mag.max() > threshold:
            layer.grow_neurons(count=10)  # Add 10 neurons

3.3 Key Parameters

Parameter	V4 Value	Purpose
`enable_replacement`	`False`	Never replace existing neurons
`enable_pruning`	`False`	Never remove neurons
`gradient_percentile`	`90.0`	Only top 10% gradients trigger growth
`growth_cooldown`	`200`	Steps between growth events
`max_growth_per_step`	`10`	Neurons added per growth event
`growth_threshold`	`0.001`	Minimum gradient magnitude

4. Experimental Results

4.1 Benchmark Methodology

All benchmarks used consistent methodology:

Sample size: 1000 examples per task (except where noted)
Tasks: ARC-Easy, ARC-Challenge, HellaSwag, TruthfulQA, Winogrande
Evaluation: Multiple-choice accuracy using model log-probabilities
Hardware: NVIDIA RTX 4070 (12GB VRAM)

Standard errors for 1000-sample benchmarks are approximately +/-1.5%.

4.2 V4 vs Baseline vs Normal Fine-tune

Complete benchmark results (1000 samples each):

Task	V4 Growth	Baseline Qwen	Normal Fine-tune
ARC-Easy	64.20%	64.60%	47.20%
ARC-Challenge	34.20%	30.70%	28.30%
HellaSwag	40.20%	40.30%	35.50%
TruthfulQA	46.20%	41.80%	44.68%
Winogrande	54.20%	56.00%	52.30%
Average	47.82%	46.68%	41.60%

Statistical Analysis

V4 vs Baseline: +1.14% (within margin of error, statistically equivalent)
V4 vs Fine-tune: +6.22% (statistically significant)
Baseline vs Fine-tune: +5.08% (fine-tune is significantly worse)

The critical finding: normal fine-tuning made the model worse than doing nothing.

4.3 Domain Shift Testing

To test catastrophic forgetting resistance, we trained models on ARC data, then continued training on balanced general data:

Model	After ARC	After Balanced	Change	Status
Normal Fine-tune	~38%	39.83%	+1.83%	Minor improvement
V3 (replacement)	38.67%	30.33%	-8.34%	Catastrophic forgetting
V4 (grow-only)	43.00%	43.00%	0.00%	Stable

V4 demonstrated complete resistance to catastrophic forgetting during domain shift.

5. Key Findings

Finding 1: Grow-Only Prevents Catastrophic Forgetting

The single most important change from V3 to V4 was disabling neuron replacement. When neurons can only be added (never removed or replaced), knowledge accumulates rather than competes.

Finding 2: V4's Value is Durability, Not Raw Performance

V4 does not significantly outperform baseline Qwen (~1% difference, within error margins). However, V4 enables continued training without collapse, while normal fine-tuning degrades performance. The value proposition is:

V4 makes training safe, not faster.

Finding 3: Normal Fine-Tuning Can Make Models Worse

Our normal fine-tune checkpoint performed worse than the untrained baseline on the exact task it was trained for (ARC). This is a stark demonstration of catastrophic forgetting's severity.

Finding 4: 4 Epochs is Optimal for V4

Extended training (5 epochs vs 4 epochs) showed no improvement:

Epochs	Average Score	Neurons Grown
4	43.00%	+20,160
5	43.00%	+24,960

Diminishing returns set in after 4 epochs. Additional neurons are grown but don't improve performance.

Finding 5: Neuron Growth is Substantial

V4 training on ARC (4 epochs) grew the model from 254,976 to 275,136 neurons (+20,160 neurons, +7.9% growth). These neurons coexist with existing neurons without interference.

6. Limitations and Future Work

Current Limitations

VRAM Growth: Grow-only means models get larger over time. Extended training may hit memory limits.
No Proven Scaling: V4 tested only on Qwen 0.5B. Behavior on larger models (7B, 70B) is unknown.
Single Domain Testing: Extensive testing on ARC; other domains (code, math, multilingual) not fully explored.
Marginal Raw Gains: V4 doesn't make models smarter, just more stable during training.

Future Work

Memory-Bounded Growth: Implement maximum neuron limits with intelligent selection of which neurons to keep.
Scaling Studies: Test V4 on larger base models to verify the approach generalizes.
Multi-Domain Sequential Training: Train on 5+ domains sequentially to stress-test forgetting resistance.
Inference Optimization: Explore pruning after training (grow during train, prune for deploy).
Growth Pattern Analysis: Study which layers grow most and whether growth patterns indicate learning.

7. Conclusion

FOAM V4 demonstrates that a simple modification — disabling neuron replacement and pruning — transforms neural network training from a fragile, one-shot process into a robust, incremental one.

The key insight is counterintuitive: letting neurons accumulate freely is better than intelligently managing them. The brain doesn't delete neurons when learning new skills; perhaps neural networks shouldn't either.

While V4 doesn't produce smarter models (performance is statistically equivalent to baseline), it produces more trainable models. This distinction matters for real-world AI systems that need continuous improvement without catastrophic regression.

The path forward is clear: grow-only training enables safe, incremental learning. The question now is how to scale this approach to larger models and longer training runs while managing the inevitable growth in model size.

8. Reproducibility

Key Files

src/model/growable_linear_v2.py - Core growable layer implementation
src/model/growable_qwen_v2.py - Model wrapper with growth management
scripts/train_v4_arc.py - V4 training script
eval/run_standard_benchmarks.py - Benchmark evaluation

Hardware Requirements

Training: 12GB VRAM minimum (RTX 4070 or better)
Inference: 8GB VRAM sufficient for evaluation
Storage: ~2GB per checkpoint

References

Goddard, C. et al. (2024). "Arcee's MergeKit: A Toolkit for Merging Large Language Models." arXiv:2403.13257
Kirkpatrick, J. et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS.
Zenke, F. et al. (2017). "Continual Learning Through Synaptic Intelligence." ICML.
Evci, U. et al. (2022). "Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning." ICML.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for spartan8806/Neural-Foam-Growth

Arcee's MergeKit: A Toolkit for Merging Large Language Models

Paper • 2403.13257 • Published Mar 20, 2024 • 22