Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: mahiatlinux/Phi-mini-MoE
|
| 4 |
+
tags:
|
| 5 |
+
- moe
|
| 6 |
+
- mixture-of-attention
|
| 7 |
+
- pruned
|
| 8 |
+
- specialized
|
| 9 |
+
- self-training
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Phi-mini-MoE + MoA + Pruning + Specialization
|
| 13 |
+
|
| 14 |
+
## What's Special
|
| 15 |
+
|
| 16 |
+
This model adds **Mixture of Attention (MoA)** routing to Phi-mini-MoE, then:
|
| 17 |
+
- ✂️ **Pruned 25% of attention heads** (kept only important ones)
|
| 18 |
+
- 🎯 **Forced expert specialization** (each expert focuses on specific tasks)
|
| 19 |
+
- ⚡ **~3x faster** than OLMoE-1B-7B
|
| 20 |
+
|
| 21 |
+
## Stats
|
| 22 |
+
|
| 23 |
+
- Base: Phi-mini-MoE (7.6B total, 2.4B active)
|
| 24 |
+
- Attention heads: 32 → 24 (pruned 25%)
|
| 25 |
+
- Training iterations: 10
|
| 26 |
+
- Expert specialization: 16.7%
|
| 27 |
+
|
| 28 |
+
## Files
|
| 29 |
+
|
| 30 |
+
- `moa_router.pt` - Trained + pruned MoA router
|
| 31 |
+
- `training_data.json` - Self-play examples
|
| 32 |
+
- `expert_stats.json` - Expert specialization profiles
|
| 33 |
+
- `pruning_stats.json` - Which heads were pruned
|
| 34 |
+
|
| 35 |
+
## By
|
| 36 |
+
|
| 37 |
+
[maxie-12321](https://huggingface.co/maxie-12321)
|