File size: 23,235 Bytes

---
license: mit
datasets:
- sedthh/gutenberg_english
- teknium/OpenHermes-2.5
pipeline_tag: text-generation
---
# MicroExperts/NG Architecture

### Split or die trying.

**Self-organizing 1-bit mixture-of-experts for continual learning without catastrophic forgetting.**

---
### Warning is not an Paper its more an Architecture outline that is still not 100% concret like i probably scrap the bitnet part and a more or less dev Diary a the bottom.I still did not do enough testing to prove that it fully works as intended.
### Warning 2 the text is partly Ai (Claudeopus and Gemini pro) i rewrote the most cringy sentencs fixed most errors.



# Plan

## 01 · The Problem

Standard neural networks store knowledge in shared parameters. Training on new data overwrites weights encoding old knowledge: **catastrophic forgetting**.

This was first identified by McCloskey & Cohen, who showed that a backpropagation network trained on ones addition facts completely lost that knowledge when retrained on twos [1]. Ratcliff established that the root cause is representational overlap at hidden layers: when many shared weights change, prior knowledge cannot survive [2]. French's comprehensive review concluded that dual memory systems separating short-term and long-term storage were necessary to overcome it [3].

Existing solutions all have fundamental limits: **EWC/SI** accumulate penalty until the model can't learn. **Replay buffers** require storing data forever. **Progressive networks** grow linearly without sharing. **Masking** fixes capacity at init and can't grow. The structure itself never self-organizes; it has to be designed.

> **MicroExperts' response:** New knowledge gets new parameters via expert splits. Old knowledge lives in parameters that receive no gradients unless relevant. Protection is structural. And the system grows its own capacity to match data complexity.

---

## 02 · Architecture

A transformer where the FFN in each block is replaced by a **dynamic Mixture-of-Experts layer** with ultra-small 1.58-bit quantized experts.

```
Tokens → Embedding → Attention → Adaptive Router → Expert Pool → Weighted Sum → LM Head
```

The experts are small feedforward networks built from **BitLinear layers**, 1.58-bit quantized linear layers borrowed from Microsoft's BitNet research [15]. Weights are binarized to {-1, 0,+1} using round() with mean absolute value scaling. Activations are quantized to 8-bit via absmax. The straight-through estimator enables gradient flow.

Kawata et al. proved theoretically that vanilla networks fail to detect latent organizational structure in data; they process the problem as a whole. MoE succeeds by dividing it into easier subproblems [9]. Li et al. provided the first theoretical proof that MoE can diversify experts and prevent forgetting in continual learning [8].

### Expert Size Tiers

| Tier | Hidden Dim | Params | Memory (1-bit) | Role |
|------|-----------|--------|----------------|------|
| 0 | 512 | ~1M | ~125 KB | Narrow specialists |
| 1 | 1,024 | ~4M | ~500 KB | Domain experts |
| 2 | 2,048 | ~16M | ~2 MB | Broad generalists |
| 3 | 4,096 | ~64M | ~8 MB | Monolith / max capacity |

Powers of 4 sizing ensures clean merge arithmetic. Fixed tier sizes also eliminate GPU memory fragmentation — pre-allocated slab pools per tier recycle blocks on death and grab from pool on birth.Expert tiers can be increased later.

> **Critical property:** Inactive experts receive **zero gradients**. Their weights are structurally untouched. Physical parameter isolation, not regularization or replay, is what forgetting resistance is built on.

---

## 03 · The Router

Routes inputs to experts, produces clean signals for lifecycle decisions, and adapts as experts appear and disappear.

### Embedding-Similarity (Not Linear)

Each expert has a learned embedding vector in routing space. Routing is cosine similarity between the input (projected through a routing head MLP) and expert embeddings. Adding an expert means adding one vector. Removing means deleting one. No weight matrix resizing.

### Adaptive Threshold (Not Top-K)

Fixed top-k is wrong for this architecture. The system is the best of both worlds of moe and dense Models that have the benifit of the interconnectivity of dense models and the domain spesifc specialisation of Moe models.

Simple inputs might activate 1 expert. Complex cross-domain inputs might activate 50. The density itself is a diagnostic signal; spikes indicate distribution shift.

Chen et al. demonstrated that partitioning features into task-specific and shared components processed by dedicated expert groups effectively mitigates gradient conflicts at their source [11]. The adaptive router achieves this dynamically.

### Lifecycle Hooks

| Event | Embedding Action | Rationale |
|-------|-----------------|-----------|
| **Split** | Preserver: exact copy. Adapter: copy + noise. | Preserver handles same inputs; adapter diverges |
| **Merge** | Child: average of parents | Covers both parents' input space |
| **Death** | Embedding removed | Expert exits routing space |

### Hierarchical Routing (100K Scale)

At 100K experts, two-stage routing: first route to ~316 cluster centroids (√100K), then to experts within selected clusters. Cost: O(√N) instead of O(N). The gradient conflict-driven topology pruning paper notes that memory overhead can be mitigated by sparse conflict sampling [12].

---

## 04 · The Cannibalization Signal

The entire system is driven by a single signal, which produces interference between old and new knowledge.

Yu et al. (PCGrad) established that gradient conflict can be detected via negative cosine similarity between task gradients, and identified three destructive conditions: conflicting directions, high curvature, and large magnitude differences [4].

Yang et al. extended this to **per-expert, per-token** conflict within MoE, computing token-level gradients for each expert and identifying tokens whose gradients conflict with the expert's average optimization direction [5]. This is the closest existing work to MicroExperts' cannibalization signal. The difference: they reassign conflicting tokens (reactive routing); MicroExperts splits the expert (reactive structure).

GCond showed that raw per-batch conflict detection is too noisy; PCGrad gets stuck oscillating. The solution is EMA smoothing with tiered conflict zones [6]. MicroExperts uses the same approach: dual exponential moving averages (fast and slow) tracking each expert's loss. When fast diverges upward from slow, that expert is being cannibalized.

Borsani et al. further showed that magnitude similarity matters alongside directional conflict [7], suggesting the signal could be enriched further.

> **The measurement:** Per-expert interference = L2 distance between the expert's individual output and the combined mixture output, normalized by the expert's output magnitude. High interference = other experts are pulling the result away from what this expert "wants." This is the cannibalization score.

---

## 05 · Self-Organization: Split / Merge / Death

The system starts as a single monolith and differentiates through pure training pressure. Structure emerges from self-preservation dynamics.

### Split: Self-Preservation (Same Tier)

When cannibalization exceeds threshold, the expert splits into **two same-tier children**. The **preserver** inherits exact weights with gradients frozen, for a duration proportional to expert importance. The **adapter** inherits weights with perturbation and absorbs new gradient pressure. This maps to French's dual-memory insight: preserver = long-term memory (stable), adapter = short-term memory (plastic) [3].

SETA validated this shared/unique separation principle: overlapping parameters become shared experts (stabilized), non-overlapping become unique experts (frozen) [10]. MicroExperts achieves the same dynamically through the split mechanism.

> **Same-tier splits grow parameters:** One tier-2 (16M) becomes two tier-2s (32M total). The system genuinely grows capacity. Expert count increases by 1, total params increase by one expert's worth.

### Merge: Consolidation (Tier Up)

Three forces counterbalance splitting:

| Merge Force | Signal | Effect |
|-------------|--------|--------|
| **Fragment** | Two experts individually weak but co-route | Recombines over-split debris → tier+1 |
| **Capacity** | Pool approaching memory budget | Back-pressure against unbounded growth |
| **Tier Gravity** | Small same-tier experts co-activate | Consolidates upward: 2×T0 → 1×T1 |

> **Tier-up merges grow capacity:** Two tier-0s (2M total) merge into one tier-1 (4M). Net gain: 2M params. Every merge-up cycle adds parameters. The system grows through the split→specialize→merge cycle, self-regulated by data complexity.

### Death: Controlled Forgetting

Experts with near-zero routing weight for extended periods are removed. Not all knowledge needs to persist. Death frees capacity.

### No Spontaneous Birth

Experts are only born via splits. Every expert traces lineage to the original monolith. No random initialization ever. Novel data is absorbed by first cannibalizing the nearest existing expert, which then splits to protect itself.

The gradient conflict-driven topology pruning paper explicitly describes this concept as a future research direction: "We believe that grounding
neural architecture search in physical gradient dynamicsrepresents a promising step toward interpretable and self-organizing artificial intelligence." [12].

---

## 06 · Continual Learning


Like Conway's Game of Life[16], the system complexity emerges from simple local rules.When new data arrives, the system protects itself automatically through a six-phase cycle.

| Phase | System State | Response |
|-------|-------------|----------|
| **1. Stability** | Equilibrium, low cannibalization | Normal operation |
| **2. New data** | Router sends to nearest experts; conflicting gradients build | Drift detector notices entropy spike; thresholds tighten |
| **3. Self-preservation** | Cannibalized experts split (same tier); preservers freeze | Expert count grows; old knowledge isolated |
| **4. Specialization** | Adapters learn new domain; router differentiates | Density may spike temporarily |
| **5. Consolidation** | Redundant experts merge (tier up); fragments recombine | Count decreases; total params increase |
| **6. New equilibrium** | System stable at higher capacity | Old + new knowledge coexist |

Flesch et al. showed that both humans and networks face the same fundamental tradeoff: "lumpers" who reuse representations get better transfer but worse interference, while "splitters" keep them separate to avoid interference at the cost of transfer [13]. MicroExperts navigates this dynamically: shared high-tier experts are lumpers, specialized low-tier experts are splitters.

The coordinated eligibility theory decomposes interference into receptive field and population response factors, showing that plasticity rules can protect against catastrophic interference without requiring gradient alignment with task objectives [14]. This maps to MicroExperts' design: experts are population responses, the router gates receptive fields.

> **Training implications:** Data can be trained sequentially (each transition drives natural splits). Small diverse datasets work well (diversity, not volume, drives differentiation).

---

## 07 · Long-Term Structural Evolution

Over extended training, the system should theoraticly develops a knowledge hierarchy, without anyone designing it:

| Layer | Tier | Content | Formation |
|-------|------|---------|-----------|
| **Universal** | 2–3 | Punctuation, numbers, common patterns | Small experts merged upward via tier gravity |
| **Cross-domain** | 1–2 | Shared grammar, Latin roots | Domain experts found redundant; merged |
| **Domain-specific** | 0–1 | French conjugation, Python syntax | Split from shared experts when new data caused cannibalization |
| **Exceptions** | 0 | Irregular verbs, idioms, rare patterns | Tiny specialist |

The total parameter count self-regulates to match accumulated knowledge complexity. More diverse data → more splits and merges → more growth. Total parameter count scales with accumulated knowledge.

---

<!-- 
## 08 · Scaling: 96+96 GB Dual GPU

Implemented in Python/PyTorch. No custom CUDA required except BitLinear kernels (borrowed) [15]. At 1-bit quantization, 96 GB holds hundreds of thousands of tier-0 experts.

| Concern | Solution |
|---------|----------|
| **Dynamic expert-GPU mapping** | PlacementManager: dict of expert_id → device. Updates on lifecycle events. |
| **Migration cost** | Negligible: tier-0 = 125KB, tier-3 = 8MB. Microseconds over NVLink. |
| **Memory fragmentation** | Fixed tier sizes → slab allocator pools. Zero fragmentation by design. |
| **100K routing cost** | Hierarchical routing: O(√N). 316 clusters for 100K experts. |
| **Sparse activation** | At 5% density, 100K pool means ~5K active per input. |

---


## 08 · Comparison

| | Standard MoE (Mixtral, Switch) | MicroExperts |
|--|-------------------------------|--------------|
| **Expert count** | Fixed (8–64) | Dynamic (1 → 100K) |
| **Expert size** | Billions | 1M–64M, 1-bit |
| **Routing** | Fixed top-k | Adaptive threshold (variable density) |
| **Structure** | Static | Self-organizing via cannibalization |
| **Forgetting** | Same as dense model | Structurally resistant |
| **New knowledge** | Overwrites existing | Gets new params via split |
| **Param growth** | Fixed at init | Grows via split + merge-up |
| **Density** | Fixed k per input | Sparse to fully dense |
| **Starting state** | Fixed architecture | Monolith that differentiates |
| **Training data** | Must be shuffled | Can be sequential, aggressive |

---
-->

## 08 · Known Risks

| Risk | Mitigation |
|------|-----------|
| **Cannibalization signal too noisy** | Dual-EMA smoothing validated by GCond [6]; cooldown timer; min age |
| **Merge collapse** |still no soloution want to avoid replay buffer| 
| **Router instability** | Embedding continuity on split; cooldown between events |
| **Expert starvation at 100K** | Death mechanism; pressure system|
| **Split/merge oscillation** | min age before merge; hysteresis; cooldown |
| **1-bit experts too small** | scaling up the expert |

---

<!-- 
# Real Implementation

For this what ever this is i wanted to part it into Plan and Real Implimentation.

The current implimentation is on a mac m4 pro with 48 gb ram thats why i decided to implemted it into mlx and not via bitnet as i see bitnet implementation not as an priority and maby completly remove it i am still not sure.

## Day 1
Splitting does work at first glance there is no wierd beaviour like an cascade or an repeating pattern so it think the implimenation works dont understand me wrong its still not prooven but i see it as an first success.

## Day 2
I curently have the problem to preserve the optimzer state after splitting currently i just reset the optimzer state but than it looses momentum i am still thinking for a sopisticated solution.First death of an expert:
[step 4550][L5] DEATH 6adccc9b (T2, age=254, w=0.0010) RIP
Optimzer State fixed via copying the parent state over to both children dosent work for now i will wipe them

## Day 3
I implemted checkpoint saving the wrong way so it only did save the backbone not the experts so i have to train from the beginning again.

## Day 4 
After interference testing i had to reduce the overall size again i underestimnated the time it would take without compile the compute graph fully.After some test with the checkpoint 5000 i have the  feeling it learns far more quicker than normal model architecture.

## Day 5
Now with redcuced expert and hidden dimension per layers it dosent split anymore.

## Day 6
Split and merge and death are working.Small scale models are working too but the lifetime has to be lowerd so it works as intended so the lifetime and size should depend on each other maby i try out to finde some sort of formular in the future.
-->

# Real Implementation

For this what ever this is i wanted to part it into Plan and Real Implimentation.

The current implementation is on a Mac M4 Pro with 48 GB RAM. That's why I decided to implement it into MLX and not via Bitnet, as I see Bitnet implementation not as a priority and may completely remove it; I am still not sure.

## Day 1
Splitting does work at first glance there is no wierd beaviour like an cascade or an repeating pattern so it think the implimenation works dont understand me wrong its still not prooven but i see it as an first success.

## Day 2
I currently have the problem of preserving the optimizer state after splitting. Currently, I just reset the optimizer state, but then it loses momentum. I am still thinking of a sophisticated solution. First death of an expert:
[step 4550][L5] DEATH 6adccc9b (T2, age=254, w=0.0010) RIP
Optimzer State fixed via copying the parent state over to both children dosent work for now i will wipe them

## Day 3
I implemted checkpoint saving the wrong way so it only did save the backbone not the experts so i have to train from the beginning again.

## Day 4 
After interference testing, I had to reduce the overall size again. I underestimated the time it would take with the compute graph recompiling so frequently.<!--  After some tests with the Checkpoint 5000, I have the feeling it learns far more quickly than normal model architecture.-->

## Day 5
Now with redcuced expert and hidden dimension per layers it dosent split anymore.

## Day 6
Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.

## Day 7
For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back  but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.

## Day 8
An report of claude based fo the logs:
 
12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K.  Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.

The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.


## Day 9
The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.

This are the results training on top of gutenberg with teknium/OpenHermes-2.5:

Loss 7.8→2.5. 6 splits, 5 merges, 3 deaths, 0 crashes. 4/6 splits stuck no merge back.
L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, density 1.5. L10: 2 splits, 2 merges, 1 death. L0–L3, L6–L8, L11: 0 events.

The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.

Here is an short Report with 10 test prompts:
4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.

## References

**[1]** McCloskey, M. & Cohen, N.J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. *Psychology of Learning and Motivation*, 24, 109–165. https://www.andywills.info/hbab/mccloskeycohen.pdf

**[2]** Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. *Psychological Review*, 97(2), 285–308. https://bpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/60429/files/2018/07/psychrev90a-1jt2c34.pdf

**[3]** French, R.M. (1999). Catastrophic forgetting in connectionist networks. *Trends in Cognitive Sciences*, 3(4), 128–135. https://lead.ube.fr/wp-content/uploads/2023/09/000282-catastrophic-forgetting-in-connectionist-networks.pdf

**[4]** Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. & Finn, C. (2020). Gradient Surgery for Multi-Task Learning. *NeurIPS 2020*. https://arxiv.org/pdf/2001.06782

**[5]** Yang, L. et al. (2025). Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model. *ICLR 2025*. https://arxiv.org/abs/2406.19905

**[6]** GCond. (2025). Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning. *arXiv:2509.07252*. https://arxiv.org/abs/2509.07252

**[7]** Borsani, T., Rosani, A., Nicosia, G. & Di Fatta, G. (2025). Gradient Similarity Surgery in Multi-Task Deep Learning. *arXiv:2506.06130*. Accepted at ECML PKDD 2025. https://arxiv.org/abs/2506.06130

**[8]** Li, H. et al. (2024). Theory on Mixture-of-Experts in Continual Learning. *arXiv:2406.16437*. https://arxiv.org/abs/2406.16437

**[9]** Kawata, R. et al. (2025). Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning. *arXiv:2506.01656*. https://arxiv.org/abs/2506.01656

**[10]** Siddika, F. et al. (2026). Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning (SETA). *arXiv:2601.17616*. https://arxiv.org/abs/2601.17616

**[11]** Chen, J. et al. (2024). Mitigating Gradient Conflicts via Expert Squads in Multi-Task Learning. *Neurocomputing*, 128832. https://github.com/chenjie04/Multi-Task-Learning-PyTorch, https://www.sciencedirect.com/science/article/abs/pii/S0925231224016035

**[12]** Anonymous. (2025). MoE with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity. *arXiv:2512.20291*. https://arxiv.org/abs/2512.20291

**[13]** Flesch, T. et al. (2025). Humans and neural networks show similar patterns of transfer and interference during continual learning. *Nature Human Behaviour*. https://www.nature.com/articles/s41562-025-02318-y

**[14]** eLife. (2024). Beyond Gradients: Factorized, Geometric Control of Interference and Generalization. *eLife* 103701. https://elifesciences.org/reviewed-preprints/103701

**[15]** Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). *Microsoft Research, arXiv:2402.17764*. https://arxiv.org/abs/2402.17764

**[16]** Conway's Game of Life.https://noweyr.github.io,https://conwaylife.com/wiki/