anthonym21
/

Eve-2-MoE-272M

+---
+license: mit
+tags:
+- moe
+- deepseek
+- nvidia-h200
+- fineweb-edu
+- pytorch
+- text-generation
+- nano-lm
+- edge-ai
+- rope
+language:
+- en
+pipeline_tag: text-generation
+datasets:
+- HuggingFaceFW/fineweb-edu
+---
+# Eve-2-MoE-272M
+A custom 272M-parameter Mixture-of-Experts language model trained from scratch on **8× NVIDIA H200** GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.
+Eve-2 is a **base model for specialized fine-tuning** — not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.
+**Author:** [Anthony Maio](https://making-minds.ai) / Making Minds AI (Independent)
+https://www.github.com/anthony-maio
+https://www.linkedin.com/in/anthony-maio
+## Architecture
+| | |
+|---|---|
+| **Total Parameters** | 272M |
+| **Type** | Mixture of Experts (MoE) |
+| **Routed Experts** | 8 |
+| **Shared Experts** | 1 (always active) |
+| **Active Params/Token** | ~80M (top-2 routing) |
+| **Routing** | Top-2 gate with load-balancing aux loss |
+| **Layers** | 12 transformer blocks |
+| **Hidden Dim** | 512 |
+| **Attention Heads** | 8 (64-dim each) |
+| **Expert FFN Dim** | 1408 (SwiGLU) |
+| **Position Encoding** | Rotary Position Embeddings (RoPE) |
+| **Context Length** | 2048 tokens |
+| **Vocab** | 50,304 (GPT-2 tokenizer, padded) |
+| **Norm** | RMSNorm |
+| **Precision** | BFloat16 (native) |
+| **Weight Tying** | Embeddings tied with LM head |
+### Design Rationale
+MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.
+This makes Eve-2 a natural base for **nano-LM swarms** — fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.
+## Training
+| | |
+|---|---|
+| **Hardware** | 8× NVIDIA H200 (141 GB VRAM each) |
+| **Throughput** | ~1.26M tokens/sec |
+| **Steps** | 40,000 |
+| **Tokens** | ~10.5B |
+| **Wall Time** | ~2.5 hours |
+| **Data** | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (Sample-10BT) |
+| **Optimizer** | AdamW (β₁=0.9, β₂=0.95, weight decay 0.1) |
+| **Schedule** | Cosine decay with 200-step linear warmup |
+| **Peak LR** | 5e-4 → decays to 5e-5 |
+| **Batch** | 128 × 2048 tokens (16/GPU × 8 GPUs) |
+| **Gradient Clipping** | 1.0 |
+| **Distributed** | PyTorch DDP |
+### Convergence
+| Step | Tokens Seen | Train Loss | Val Loss (WikiText-2) |
+|------|------------|-----------|----------------------|
+| 500 | 131M | 4.82 | 6.35 |
+| 1,000 | 262M | 4.09 | 4.84 |
+| 1,500 | 393M | 3.95 | 4.36 |
+| 5,000 | 1.3B | 3.47 | 3.89 |
+| 13,000 | 3.4B | 3.05 | 3.61 |
+| 25,000 | 6.6B | 2.90 | 3.51 |
+| 37,000 | 9.7B | 2.80 | 3.42 |
+| 40,000 | 10.5B | 2.78 | **3.40** |
+**Final Perplexity (WikiText-2): ~30**
+Training logs: [Weights & Biases](https://wandb.ai/anthony-maio-making-minds/Eve-2-MoE)
+## Quick Start
+This is a custom architecture — you need the model class to load it. Download `modeling_eve.py` from this repo.
+```python
+import torch
+import tiktoken
+from modeling_eve import ModelConfig, DeepSeekMoE
+from huggingface_hub import hf_hub_download
+# Load
+device = "cuda" if torch.cuda.is_available() else "cpu"
+config = ModelConfig()
+model = DeepSeekMoE(config)
+weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
+model.load_state_dict(torch.load(weights, map_location=device))
+model.to(device).eval()
+# Generate
+enc = tiktoken.get_encoding("gpt2")
+tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
+                       dtype=torch.long, device=device).unsqueeze(0)
+output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
+print(enc.decode(output[0].tolist()))
+```
+### CPU / Raspberry Pi
+The model runs on CPU at ~272M parameters. Inference is slower but functional — memory footprint is under 1 GB.
+```python
+device = "cpu"
+# Everything else stays the same
+```
+## Intended Use
+Eve-2 is a **fine-tuning base**, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:
+1. Take this base model
+2. Fine-tune on a narrow task (~20 min on consumer GPU)
+3. Deploy at the edge as part of a specialized nano-LM swarm
+**Target applications:** Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.
+## Limitations
+This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design — it's a small, fast, cheap-to-tune specialist base.
+The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.
+## Files
+```
+├── pytorch_model.bin     # Model weights
+├── config.json           # Architecture config
+├── modeling_eve.py       # Model class definitions (required to load)
+├── generate.py           # Standalone inference script
+├── train.py              # DDP training script
+└── requirements.txt      # Dependencies
+```
+## Citation
+```bibtex
+@misc{maio2026eve2moe,
+  author = {Maio, Anthony},
+  title = {Eve-2-MoE-272M: A Nano-Scale Mixture of Experts Base Model for SLM Swarms & Edge Deployment},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/anthonym21/Eve-2-MoE-272M}
+}
+```
+## License
+MIT — free for research and commercial use.