File size: 5,917 Bytes
ee90542
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2308f67
 
 
 
 
 
 
ee90542
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: mit
tags:
- moe
- deepseek
- nvidia-h200
- fineweb-edu
- pytorch
- text-generation
- nano-lm
- edge-ai
- rope
language:
- en
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
---

# Eve-2-MoE-272M

A custom 272M-parameter Mixture-of-Experts language model trained from scratch on **8Γ— NVIDIA H200** GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.

Eve-2 is a **base model for specialized fine-tuning** β€” not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.

**Author:** [Anthony Maio](https://making-minds.ai) / Making Minds AI (Independent)
https://www.github.com/anthony-maio
https://www.linkedin.com/in/anthony-maio

## Architecture

| | |
|---|---|
| **Total Parameters** | 272M |
| **Type** | Mixture of Experts (MoE) |
| **Routed Experts** | 8 |
| **Shared Experts** | 1 (always active) |
| **Active Params/Token** | ~80M (top-2 routing) |
| **Routing** | Top-2 gate with load-balancing aux loss |
| **Layers** | 12 transformer blocks |
| **Hidden Dim** | 512 |
| **Attention Heads** | 8 (64-dim each) |
| **Expert FFN Dim** | 1408 (SwiGLU) |
| **Position Encoding** | Rotary Position Embeddings (RoPE) |
| **Context Length** | 2048 tokens |
| **Vocab** | 50,304 (GPT-2 tokenizer, padded) |
| **Norm** | RMSNorm |
| **Precision** | BFloat16 (native) |
| **Weight Tying** | Embeddings tied with LM head |

### Design Rationale

MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.

This makes Eve-2 a natural base for **nano-LM swarms** β€” fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.

## Training

| | |
|---|---|
| **Hardware** | 8Γ— NVIDIA H200 (141 GB VRAM each) |
| **Throughput** | ~1.26M tokens/sec |
| **Steps** | 40,000 |
| **Tokens** | ~10.5B |
| **Wall Time** | ~2.5 hours |
| **Data** | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (Sample-10BT) |
| **Optimizer** | AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1) |
| **Schedule** | Cosine decay with 200-step linear warmup |
| **Peak LR** | 5e-4 β†’ decays to 5e-5 |
| **Batch** | 128 Γ— 2048 tokens (16/GPU Γ— 8 GPUs) |
| **Gradient Clipping** | 1.0 |
| **Distributed** | PyTorch DDP |

### Convergence

| Step | Tokens Seen | Train Loss | Val Loss (WikiText-2) |
|------|------------|-----------|----------------------|
| 500 | 131M | 4.82 | 6.35 |
| 1,000 | 262M | 4.09 | 4.84 |
| 1,500 | 393M | 3.95 | 4.36 |
| 5,000 | 1.3B | 3.47 | 3.89 |
| 13,000 | 3.4B | 3.05 | 3.61 |
| 25,000 | 6.6B | 2.90 | 3.51 |
| 37,000 | 9.7B | 2.80 | 3.42 |
| 40,000 | 10.5B | 2.78 | **3.40** |

**Final Perplexity (WikiText-2): ~30**

Training logs: [Weights & Biases](https://wandb.ai/anthony-maio-making-minds/Eve-2-MoE)

## Quick Start

This is a custom architecture β€” you need the model class to load it. Download `modeling_eve.py` from this repo.

```python
import torch
import tiktoken
from modeling_eve import ModelConfig, DeepSeekMoE
from huggingface_hub import hf_hub_download

# Load
device = "cuda" if torch.cuda.is_available() else "cpu"
config = ModelConfig()
model = DeepSeekMoE(config)

weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights, map_location=device))
model.to(device).eval()

# Generate
enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
                       dtype=torch.long, device=device).unsqueeze(0)

output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
print(enc.decode(output[0].tolist()))
```

### CPU / Raspberry Pi

The model runs on CPU at ~272M parameters. Inference is slower but functional β€” memory footprint is under 1 GB.

```python
device = "cpu"
# Everything else stays the same
```

## Intended Use

Eve-2 is a **fine-tuning base**, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:

1. Take this base model
2. Fine-tune on a narrow task (~20 min on consumer GPU)
3. Deploy at the edge as part of a specialized nano-LM swarm

**Target applications:** Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.

## Limitations

This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design β€” it's a small, fast, cheap-to-tune specialist base.

The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.

## Files

```
β”œβ”€β”€ pytorch_model.bin     # Model weights
β”œβ”€β”€ config.json           # Architecture config
β”œβ”€β”€ modeling_eve.py       # Model class definitions (required to load)
β”œβ”€β”€ generate.py           # Standalone inference script
β”œβ”€β”€ train.py              # DDP training script
└── requirements.txt      # Dependencies
```

## Citation

```bibtex
@misc{anthony_maio_2026_eve2,
	author       = { Anthony Maio },
	title        = { Eve-2-MoE-272M (Revision ee90542) },
	year         = 2026,
	url          = { https://huggingface.co/anthonym21/Eve-2-MoE-272M },
	doi          = { 10.57967/hf/7731 },
	publisher    = { Hugging Face }
}
```

## License

MIT β€” free for research and commercial use.