File size: 5,917 Bytes
ee90542 2308f67 ee90542 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | ---
license: mit
tags:
- moe
- deepseek
- nvidia-h200
- fineweb-edu
- pytorch
- text-generation
- nano-lm
- edge-ai
- rope
language:
- en
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
---
# Eve-2-MoE-272M
A custom 272M-parameter Mixture-of-Experts language model trained from scratch on **8Γ NVIDIA H200** GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.
Eve-2 is a **base model for specialized fine-tuning** β not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.
**Author:** [Anthony Maio](https://making-minds.ai) / Making Minds AI (Independent)
https://www.github.com/anthony-maio
https://www.linkedin.com/in/anthony-maio
## Architecture
| | |
|---|---|
| **Total Parameters** | 272M |
| **Type** | Mixture of Experts (MoE) |
| **Routed Experts** | 8 |
| **Shared Experts** | 1 (always active) |
| **Active Params/Token** | ~80M (top-2 routing) |
| **Routing** | Top-2 gate with load-balancing aux loss |
| **Layers** | 12 transformer blocks |
| **Hidden Dim** | 512 |
| **Attention Heads** | 8 (64-dim each) |
| **Expert FFN Dim** | 1408 (SwiGLU) |
| **Position Encoding** | Rotary Position Embeddings (RoPE) |
| **Context Length** | 2048 tokens |
| **Vocab** | 50,304 (GPT-2 tokenizer, padded) |
| **Norm** | RMSNorm |
| **Precision** | BFloat16 (native) |
| **Weight Tying** | Embeddings tied with LM head |
### Design Rationale
MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.
This makes Eve-2 a natural base for **nano-LM swarms** β fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.
## Training
| | |
|---|---|
| **Hardware** | 8Γ NVIDIA H200 (141 GB VRAM each) |
| **Throughput** | ~1.26M tokens/sec |
| **Steps** | 40,000 |
| **Tokens** | ~10.5B |
| **Wall Time** | ~2.5 hours |
| **Data** | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (Sample-10BT) |
| **Optimizer** | AdamW (Ξ²β=0.9, Ξ²β=0.95, weight decay 0.1) |
| **Schedule** | Cosine decay with 200-step linear warmup |
| **Peak LR** | 5e-4 β decays to 5e-5 |
| **Batch** | 128 Γ 2048 tokens (16/GPU Γ 8 GPUs) |
| **Gradient Clipping** | 1.0 |
| **Distributed** | PyTorch DDP |
### Convergence
| Step | Tokens Seen | Train Loss | Val Loss (WikiText-2) |
|------|------------|-----------|----------------------|
| 500 | 131M | 4.82 | 6.35 |
| 1,000 | 262M | 4.09 | 4.84 |
| 1,500 | 393M | 3.95 | 4.36 |
| 5,000 | 1.3B | 3.47 | 3.89 |
| 13,000 | 3.4B | 3.05 | 3.61 |
| 25,000 | 6.6B | 2.90 | 3.51 |
| 37,000 | 9.7B | 2.80 | 3.42 |
| 40,000 | 10.5B | 2.78 | **3.40** |
**Final Perplexity (WikiText-2): ~30**
Training logs: [Weights & Biases](https://wandb.ai/anthony-maio-making-minds/Eve-2-MoE)
## Quick Start
This is a custom architecture β you need the model class to load it. Download `modeling_eve.py` from this repo.
```python
import torch
import tiktoken
from modeling_eve import ModelConfig, DeepSeekMoE
from huggingface_hub import hf_hub_download
# Load
device = "cuda" if torch.cuda.is_available() else "cpu"
config = ModelConfig()
model = DeepSeekMoE(config)
weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights, map_location=device))
model.to(device).eval()
# Generate
enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
dtype=torch.long, device=device).unsqueeze(0)
output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
print(enc.decode(output[0].tolist()))
```
### CPU / Raspberry Pi
The model runs on CPU at ~272M parameters. Inference is slower but functional β memory footprint is under 1 GB.
```python
device = "cpu"
# Everything else stays the same
```
## Intended Use
Eve-2 is a **fine-tuning base**, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:
1. Take this base model
2. Fine-tune on a narrow task (~20 min on consumer GPU)
3. Deploy at the edge as part of a specialized nano-LM swarm
**Target applications:** Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.
## Limitations
This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design β it's a small, fast, cheap-to-tune specialist base.
The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.
## Files
```
βββ pytorch_model.bin # Model weights
βββ config.json # Architecture config
βββ modeling_eve.py # Model class definitions (required to load)
βββ generate.py # Standalone inference script
βββ train.py # DDP training script
βββ requirements.txt # Dependencies
```
## Citation
```bibtex
@misc{anthony_maio_2026_eve2,
author = { Anthony Maio },
title = { Eve-2-MoE-272M (Revision ee90542) },
year = 2026,
url = { https://huggingface.co/anthonym21/Eve-2-MoE-272M },
doi = { 10.57967/hf/7731 },
publisher = { Hugging Face }
}
```
## License
MIT β free for research and commercial use.
|