File size: 23,235 Bytes
99502bc 49d54a5 99502bc 82c5207 cf39e04 82c5207 c2b229f 82c5207 55a2700 82c5207 55a2700 82c5207 55a2700 82c5207 55a2700 82c5207 99502bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | ---
license: mit
datasets:
- sedthh/gutenberg_english
- teknium/OpenHermes-2.5
pipeline_tag: text-generation
---
# MicroExperts/NG Architecture
### Split or die trying.
**Self-organizing 1-bit mixture-of-experts for continual learning without catastrophic forgetting.**
---
### Warning is not an Paper its more an Architecture outline that is still not 100% concret like i probably scrap the bitnet part and a more or less dev Diary a the bottom.I still did not do enough testing to prove that it fully works as intended.
### Warning 2 the text is partly Ai (Claudeopus and Gemini pro) i rewrote the most cringy sentencs fixed most errors.
# Plan
## 01 · The Problem
Standard neural networks store knowledge in shared parameters. Training on new data overwrites weights encoding old knowledge: **catastrophic forgetting**.
This was first identified by McCloskey & Cohen, who showed that a backpropagation network trained on ones addition facts completely lost that knowledge when retrained on twos [1]. Ratcliff established that the root cause is representational overlap at hidden layers: when many shared weights change, prior knowledge cannot survive [2]. French's comprehensive review concluded that dual memory systems separating short-term and long-term storage were necessary to overcome it [3].
Existing solutions all have fundamental limits: **EWC/SI** accumulate penalty until the model can't learn. **Replay buffers** require storing data forever. **Progressive networks** grow linearly without sharing. **Masking** fixes capacity at init and can't grow. The structure itself never self-organizes; it has to be designed.
> **MicroExperts' response:** New knowledge gets new parameters via expert splits. Old knowledge lives in parameters that receive no gradients unless relevant. Protection is structural. And the system grows its own capacity to match data complexity.
---
## 02 · Architecture
A transformer where the FFN in each block is replaced by a **dynamic Mixture-of-Experts layer** with ultra-small 1.58-bit quantized experts.
```
Tokens → Embedding → Attention → Adaptive Router → Expert Pool → Weighted Sum → LM Head
```
The experts are small feedforward networks built from **BitLinear layers**, 1.58-bit quantized linear layers borrowed from Microsoft's BitNet research [15]. Weights are binarized to {-1, 0,+1} using round() with mean absolute value scaling. Activations are quantized to 8-bit via absmax. The straight-through estimator enables gradient flow.
Kawata et al. proved theoretically that vanilla networks fail to detect latent organizational structure in data; they process the problem as a whole. MoE succeeds by dividing it into easier subproblems [9]. Li et al. provided the first theoretical proof that MoE can diversify experts and prevent forgetting in continual learning [8].
### Expert Size Tiers
| Tier | Hidden Dim | Params | Memory (1-bit) | Role |
|------|-----------|--------|----------------|------|
| 0 | 512 | ~1M | ~125 KB | Narrow specialists |
| 1 | 1,024 | ~4M | ~500 KB | Domain experts |
| 2 | 2,048 | ~16M | ~2 MB | Broad generalists |
| 3 | 4,096 | ~64M | ~8 MB | Monolith / max capacity |
Powers of 4 sizing ensures clean merge arithmetic. Fixed tier sizes also eliminate GPU memory fragmentation — pre-allocated slab pools per tier recycle blocks on death and grab from pool on birth.Expert tiers can be increased later.
> **Critical property:** Inactive experts receive **zero gradients**. Their weights are structurally untouched. Physical parameter isolation, not regularization or replay, is what forgetting resistance is built on.
---
## 03 · The Router
Routes inputs to experts, produces clean signals for lifecycle decisions, and adapts as experts appear and disappear.
### Embedding-Similarity (Not Linear)
Each expert has a learned embedding vector in routing space. Routing is cosine similarity between the input (projected through a routing head MLP) and expert embeddings. Adding an expert means adding one vector. Removing means deleting one. No weight matrix resizing.
### Adaptive Threshold (Not Top-K)
Fixed top-k is wrong for this architecture. The system is the best of both worlds of moe and dense Models that have the benifit of the interconnectivity of dense models and the domain spesifc specialisation of Moe models.
Simple inputs might activate 1 expert. Complex cross-domain inputs might activate 50. The density itself is a diagnostic signal; spikes indicate distribution shift.
Chen et al. demonstrated that partitioning features into task-specific and shared components processed by dedicated expert groups effectively mitigates gradient conflicts at their source [11]. The adaptive router achieves this dynamically.
### Lifecycle Hooks
| Event | Embedding Action | Rationale |
|-------|-----------------|-----------|
| **Split** | Preserver: exact copy. Adapter: copy + noise. | Preserver handles same inputs; adapter diverges |
| **Merge** | Child: average of parents | Covers both parents' input space |
| **Death** | Embedding removed | Expert exits routing space |
### Hierarchical Routing (100K Scale)
At 100K experts, two-stage routing: first route to ~316 cluster centroids (√100K), then to experts within selected clusters. Cost: O(√N) instead of O(N). The gradient conflict-driven topology pruning paper notes that memory overhead can be mitigated by sparse conflict sampling [12].
---
## 04 · The Cannibalization Signal
The entire system is driven by a single signal, which produces interference between old and new knowledge.
Yu et al. (PCGrad) established that gradient conflict can be detected via negative cosine similarity between task gradients, and identified three destructive conditions: conflicting directions, high curvature, and large magnitude differences [4].
Yang et al. extended this to **per-expert, per-token** conflict within MoE, computing token-level gradients for each expert and identifying tokens whose gradients conflict with the expert's average optimization direction [5]. This is the closest existing work to MicroExperts' cannibalization signal. The difference: they reassign conflicting tokens (reactive routing); MicroExperts splits the expert (reactive structure).
GCond showed that raw per-batch conflict detection is too noisy; PCGrad gets stuck oscillating. The solution is EMA smoothing with tiered conflict zones [6]. MicroExperts uses the same approach: dual exponential moving averages (fast and slow) tracking each expert's loss. When fast diverges upward from slow, that expert is being cannibalized.
Borsani et al. further showed that magnitude similarity matters alongside directional conflict [7], suggesting the signal could be enriched further.
> **The measurement:** Per-expert interference = L2 distance between the expert's individual output and the combined mixture output, normalized by the expert's output magnitude. High interference = other experts are pulling the result away from what this expert "wants." This is the cannibalization score.
---
## 05 · Self-Organization: Split / Merge / Death
The system starts as a single monolith and differentiates through pure training pressure. Structure emerges from self-preservation dynamics.
### Split: Self-Preservation (Same Tier)
When cannibalization exceeds threshold, the expert splits into **two same-tier children**. The **preserver** inherits exact weights with gradients frozen, for a duration proportional to expert importance. The **adapter** inherits weights with perturbation and absorbs new gradient pressure. This maps to French's dual-memory insight: preserver = long-term memory (stable), adapter = short-term memory (plastic) [3].
SETA validated this shared/unique separation principle: overlapping parameters become shared experts (stabilized), non-overlapping become unique experts (frozen) [10]. MicroExperts achieves the same dynamically through the split mechanism.
> **Same-tier splits grow parameters:** One tier-2 (16M) becomes two tier-2s (32M total). The system genuinely grows capacity. Expert count increases by 1, total params increase by one expert's worth.
### Merge: Consolidation (Tier Up)
Three forces counterbalance splitting:
| Merge Force | Signal | Effect |
|-------------|--------|--------|
| **Fragment** | Two experts individually weak but co-route | Recombines over-split debris → tier+1 |
| **Capacity** | Pool approaching memory budget | Back-pressure against unbounded growth |
| **Tier Gravity** | Small same-tier experts co-activate | Consolidates upward: 2×T0 → 1×T1 |
> **Tier-up merges grow capacity:** Two tier-0s (2M total) merge into one tier-1 (4M). Net gain: 2M params. Every merge-up cycle adds parameters. The system grows through the split→specialize→merge cycle, self-regulated by data complexity.
### Death: Controlled Forgetting
Experts with near-zero routing weight for extended periods are removed. Not all knowledge needs to persist. Death frees capacity.
### No Spontaneous Birth
Experts are only born via splits. Every expert traces lineage to the original monolith. No random initialization ever. Novel data is absorbed by first cannibalizing the nearest existing expert, which then splits to protect itself.
The gradient conflict-driven topology pruning paper explicitly describes this concept as a future research direction: "We believe that grounding
neural architecture search in physical gradient dynamicsrepresents a promising step toward interpretable and self-organizing artificial intelligence." [12].
---
## 06 · Continual Learning
Like Conway's Game of Life[16], the system complexity emerges from simple local rules.When new data arrives, the system protects itself automatically through a six-phase cycle.
| Phase | System State | Response |
|-------|-------------|----------|
| **1. Stability** | Equilibrium, low cannibalization | Normal operation |
| **2. New data** | Router sends to nearest experts; conflicting gradients build | Drift detector notices entropy spike; thresholds tighten |
| **3. Self-preservation** | Cannibalized experts split (same tier); preservers freeze | Expert count grows; old knowledge isolated |
| **4. Specialization** | Adapters learn new domain; router differentiates | Density may spike temporarily |
| **5. Consolidation** | Redundant experts merge (tier up); fragments recombine | Count decreases; total params increase |
| **6. New equilibrium** | System stable at higher capacity | Old + new knowledge coexist |
Flesch et al. showed that both humans and networks face the same fundamental tradeoff: "lumpers" who reuse representations get better transfer but worse interference, while "splitters" keep them separate to avoid interference at the cost of transfer [13]. MicroExperts navigates this dynamically: shared high-tier experts are lumpers, specialized low-tier experts are splitters.
The coordinated eligibility theory decomposes interference into receptive field and population response factors, showing that plasticity rules can protect against catastrophic interference without requiring gradient alignment with task objectives [14]. This maps to MicroExperts' design: experts are population responses, the router gates receptive fields.
> **Training implications:** Data can be trained sequentially (each transition drives natural splits). Small diverse datasets work well (diversity, not volume, drives differentiation).
---
## 07 · Long-Term Structural Evolution
Over extended training, the system should theoraticly develops a knowledge hierarchy, without anyone designing it:
| Layer | Tier | Content | Formation |
|-------|------|---------|-----------|
| **Universal** | 2–3 | Punctuation, numbers, common patterns | Small experts merged upward via tier gravity |
| **Cross-domain** | 1–2 | Shared grammar, Latin roots | Domain experts found redundant; merged |
| **Domain-specific** | 0–1 | French conjugation, Python syntax | Split from shared experts when new data caused cannibalization |
| **Exceptions** | 0 | Irregular verbs, idioms, rare patterns | Tiny specialist |
The total parameter count self-regulates to match accumulated knowledge complexity. More diverse data → more splits and merges → more growth. Total parameter count scales with accumulated knowledge.
---
<!--
## 08 · Scaling: 96+96 GB Dual GPU
Implemented in Python/PyTorch. No custom CUDA required except BitLinear kernels (borrowed) [15]. At 1-bit quantization, 96 GB holds hundreds of thousands of tier-0 experts.
| Concern | Solution |
|---------|----------|
| **Dynamic expert-GPU mapping** | PlacementManager: dict of expert_id → device. Updates on lifecycle events. |
| **Migration cost** | Negligible: tier-0 = 125KB, tier-3 = 8MB. Microseconds over NVLink. |
| **Memory fragmentation** | Fixed tier sizes → slab allocator pools. Zero fragmentation by design. |
| **100K routing cost** | Hierarchical routing: O(√N). 316 clusters for 100K experts. |
| **Sparse activation** | At 5% density, 100K pool means ~5K active per input. |
---
## 08 · Comparison
| | Standard MoE (Mixtral, Switch) | MicroExperts |
|--|-------------------------------|--------------|
| **Expert count** | Fixed (8–64) | Dynamic (1 → 100K) |
| **Expert size** | Billions | 1M–64M, 1-bit |
| **Routing** | Fixed top-k | Adaptive threshold (variable density) |
| **Structure** | Static | Self-organizing via cannibalization |
| **Forgetting** | Same as dense model | Structurally resistant |
| **New knowledge** | Overwrites existing | Gets new params via split |
| **Param growth** | Fixed at init | Grows via split + merge-up |
| **Density** | Fixed k per input | Sparse to fully dense |
| **Starting state** | Fixed architecture | Monolith that differentiates |
| **Training data** | Must be shuffled | Can be sequential, aggressive |
---
-->
## 08 · Known Risks
| Risk | Mitigation |
|------|-----------|
| **Cannibalization signal too noisy** | Dual-EMA smoothing validated by GCond [6]; cooldown timer; min age |
| **Merge collapse** |still no soloution want to avoid replay buffer|
| **Router instability** | Embedding continuity on split; cooldown between events |
| **Expert starvation at 100K** | Death mechanism; pressure system|
| **Split/merge oscillation** | min age before merge; hysteresis; cooldown |
| **1-bit experts too small** | scaling up the expert |
---
<!--
# Real Implementation
For this what ever this is i wanted to part it into Plan and Real Implimentation.
The current implimentation is on a mac m4 pro with 48 gb ram thats why i decided to implemted it into mlx and not via bitnet as i see bitnet implementation not as an priority and maby completly remove it i am still not sure.
## Day 1
Splitting does work at first glance there is no wierd beaviour like an cascade or an repeating pattern so it think the implimenation works dont understand me wrong its still not prooven but i see it as an first success.
## Day 2
I curently have the problem to preserve the optimzer state after splitting currently i just reset the optimzer state but than it looses momentum i am still thinking for a sopisticated solution.First death of an expert:
[step 4550][L5] DEATH 6adccc9b (T2, age=254, w=0.0010) RIP
Optimzer State fixed via copying the parent state over to both children dosent work for now i will wipe them
## Day 3
I implemted checkpoint saving the wrong way so it only did save the backbone not the experts so i have to train from the beginning again.
## Day 4
After interference testing i had to reduce the overall size again i underestimnated the time it would take without compile the compute graph fully.After some test with the checkpoint 5000 i have the feeling it learns far more quicker than normal model architecture.
## Day 5
Now with redcuced expert and hidden dimension per layers it dosent split anymore.
## Day 6
Split and merge and death are working.Small scale models are working too but the lifetime has to be lowerd so it works as intended so the lifetime and size should depend on each other maby i try out to finde some sort of formular in the future.
-->
# Real Implementation
For this what ever this is i wanted to part it into Plan and Real Implimentation.
The current implementation is on a Mac M4 Pro with 48 GB RAM. That's why I decided to implement it into MLX and not via Bitnet, as I see Bitnet implementation not as a priority and may completely remove it; I am still not sure.
## Day 1
Splitting does work at first glance there is no wierd beaviour like an cascade or an repeating pattern so it think the implimenation works dont understand me wrong its still not prooven but i see it as an first success.
## Day 2
I currently have the problem of preserving the optimizer state after splitting. Currently, I just reset the optimizer state, but then it loses momentum. I am still thinking of a sophisticated solution. First death of an expert:
[step 4550][L5] DEATH 6adccc9b (T2, age=254, w=0.0010) RIP
Optimzer State fixed via copying the parent state over to both children dosent work for now i will wipe them
## Day 3
I implemted checkpoint saving the wrong way so it only did save the backbone not the experts so i have to train from the beginning again.
## Day 4
After interference testing, I had to reduce the overall size again. I underestimated the time it would take with the compute graph recompiling so frequently.<!-- After some tests with the Checkpoint 5000, I have the feeling it learns far more quickly than normal model architecture.-->
## Day 5
Now with redcuced expert and hidden dimension per layers it dosent split anymore.
## Day 6
Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
## Day 7
For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
## Day 8
An report of claude based fo the logs:
12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
## Day 9
The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.
This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
Loss 7.8→2.5. 6 splits, 5 merges, 3 deaths, 0 crashes. 4/6 splits stuck no merge back.
L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, density 1.5. L10: 2 splits, 2 merges, 1 death. L0–L3, L6–L8, L11: 0 events.
The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
Here is an short Report with 10 test prompts:
4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
## References
**[1]** McCloskey, M. & Cohen, N.J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. *Psychology of Learning and Motivation*, 24, 109–165. https://www.andywills.info/hbab/mccloskeycohen.pdf
**[2]** Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. *Psychological Review*, 97(2), 285–308. https://bpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/60429/files/2018/07/psychrev90a-1jt2c34.pdf
**[3]** French, R.M. (1999). Catastrophic forgetting in connectionist networks. *Trends in Cognitive Sciences*, 3(4), 128–135. https://lead.ube.fr/wp-content/uploads/2023/09/000282-catastrophic-forgetting-in-connectionist-networks.pdf
**[4]** Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. & Finn, C. (2020). Gradient Surgery for Multi-Task Learning. *NeurIPS 2020*. https://arxiv.org/pdf/2001.06782
**[5]** Yang, L. et al. (2025). Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model. *ICLR 2025*. https://arxiv.org/abs/2406.19905
**[6]** GCond. (2025). Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning. *arXiv:2509.07252*. https://arxiv.org/abs/2509.07252
**[7]** Borsani, T., Rosani, A., Nicosia, G. & Di Fatta, G. (2025). Gradient Similarity Surgery in Multi-Task Deep Learning. *arXiv:2506.06130*. Accepted at ECML PKDD 2025. https://arxiv.org/abs/2506.06130
**[8]** Li, H. et al. (2024). Theory on Mixture-of-Experts in Continual Learning. *arXiv:2406.16437*. https://arxiv.org/abs/2406.16437
**[9]** Kawata, R. et al. (2025). Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning. *arXiv:2506.01656*. https://arxiv.org/abs/2506.01656
**[10]** Siddika, F. et al. (2026). Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning (SETA). *arXiv:2601.17616*. https://arxiv.org/abs/2601.17616
**[11]** Chen, J. et al. (2024). Mitigating Gradient Conflicts via Expert Squads in Multi-Task Learning. *Neurocomputing*, 128832. https://github.com/chenjie04/Multi-Task-Learning-PyTorch, https://www.sciencedirect.com/science/article/abs/pii/S0925231224016035
**[12]** Anonymous. (2025). MoE with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity. *arXiv:2512.20291*. https://arxiv.org/abs/2512.20291
**[13]** Flesch, T. et al. (2025). Humans and neural networks show similar patterns of transfer and interference during continual learning. *Nature Human Behaviour*. https://www.nature.com/articles/s41562-025-02318-y
**[14]** eLife. (2024). Beyond Gradients: Factorized, Geometric Control of Interference and Generalization. *eLife* 103701. https://elifesciences.org/reviewed-preprints/103701
**[15]** Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). *Microsoft Research, arXiv:2402.17764*. https://arxiv.org/abs/2402.17764
**[16]** Conway's Game of Life.https://noweyr.github.io,https://conwaylife.com/wiki/ |