Eve-2-MoE-272M / README.md

Update README.md

2308f67 verified 7 days ago

5.92 kB

	---
	license: mit
	tags:
	- moe
	- deepseek
	- nvidia-h200
	- fineweb-edu
	- pytorch
	- text-generation
	- nano-lm
	- edge-ai
	- rope
	language:
	- en
	pipeline_tag: text-generation
	datasets:
	- HuggingFaceFW/fineweb-edu
	---

	# Eve-2-MoE-272M

	A custom 272M-parameter Mixture-of-Experts language model trained from scratch on 8× NVIDIA H200 GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.

	Eve-2 is a base model for specialized fine-tuning — not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.

	Author: [Anthony Maio](https://making-minds.ai) / Making Minds AI (Independent)
	https://www.github.com/anthony-maio
	https://www.linkedin.com/in/anthony-maio

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Total Parameters \| 272M \|
	\| Type \| Mixture of Experts (MoE) \|
	\| Routed Experts \| 8 \|
	\| Shared Experts \| 1 (always active) \|
	\| Active Params/Token \| ~80M (top-2 routing) \|
	\| Routing \| Top-2 gate with load-balancing aux loss \|
	\| Layers \| 12 transformer blocks \|
	\| Hidden Dim \| 512 \|
	\| Attention Heads \| 8 (64-dim each) \|
	\| Expert FFN Dim \| 1408 (SwiGLU) \|
	\| Position Encoding \| Rotary Position Embeddings (RoPE) \|
	\| Context Length \| 2048 tokens \|
	\| Vocab \| 50,304 (GPT-2 tokenizer, padded) \|
	\| Norm \| RMSNorm \|
	\| Precision \| BFloat16 (native) \|
	\| Weight Tying \| Embeddings tied with LM head \|

	### Design Rationale

	MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.

	This makes Eve-2 a natural base for nano-LM swarms — fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.

	## Training

	\| \| \|
	\|---\|---\|
	\| Hardware \| 8× NVIDIA H200 (141 GB VRAM each) \|
	\| Throughput \| ~1.26M tokens/sec \|
	\| Steps \| 40,000 \|
	\| Tokens \| ~10.5B \|
	\| Wall Time \| ~2.5 hours \|
	\| Data \| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (Sample-10BT) \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, weight decay 0.1) \|
	\| Schedule \| Cosine decay with 200-step linear warmup \|
	\| Peak LR \| 5e-4 → decays to 5e-5 \|
	\| Batch \| 128 × 2048 tokens (16/GPU × 8 GPUs) \|
	\| Gradient Clipping \| 1.0 \|
	\| Distributed \| PyTorch DDP \|

	### Convergence

	\| Step \| Tokens Seen \| Train Loss \| Val Loss (WikiText-2) \|
	\|------\|------------\|-----------\|----------------------\|
	\| 500 \| 131M \| 4.82 \| 6.35 \|
	\| 1,000 \| 262M \| 4.09 \| 4.84 \|
	\| 1,500 \| 393M \| 3.95 \| 4.36 \|
	\| 5,000 \| 1.3B \| 3.47 \| 3.89 \|
	\| 13,000 \| 3.4B \| 3.05 \| 3.61 \|
	\| 25,000 \| 6.6B \| 2.90 \| 3.51 \|
	\| 37,000 \| 9.7B \| 2.80 \| 3.42 \|
	\| 40,000 \| 10.5B \| 2.78 \| 3.40 \|

	Final Perplexity (WikiText-2): ~30

	Training logs: [Weights & Biases](https://wandb.ai/anthony-maio-making-minds/Eve-2-MoE)

	## Quick Start

	This is a custom architecture — you need the model class to load it. Download `modeling_eve.py` from this repo.

	```python
	import torch
	import tiktoken
	from modeling_eve import ModelConfig, DeepSeekMoE
	from huggingface_hub import hf_hub_download

	# Load
	device = "cuda" if torch.cuda.is_available() else "cpu"
	config = ModelConfig()
	model = DeepSeekMoE(config)

	weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
	model.load_state_dict(torch.load(weights, map_location=device))
	model.to(device).eval()

	# Generate
	enc = tiktoken.get_encoding("gpt2")
	tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
	dtype=torch.long, device=device).unsqueeze(0)

	output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
	print(enc.decode(output[0].tolist()))
	```

	### CPU / Raspberry Pi

	The model runs on CPU at ~272M parameters. Inference is slower but functional — memory footprint is under 1 GB.

	```python
	device = "cpu"
	# Everything else stays the same
	```

	## Intended Use

	Eve-2 is a fine-tuning base, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:

	1. Take this base model
	2. Fine-tune on a narrow task (~20 min on consumer GPU)
	3. Deploy at the edge as part of a specialized nano-LM swarm

	Target applications: Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.

	## Limitations

	This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design — it's a small, fast, cheap-to-tune specialist base.

	The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.

	## Files

	```
	├── pytorch_model.bin # Model weights
	├── config.json # Architecture config
	├── modeling_eve.py # Model class definitions (required to load)
	├── generate.py # Standalone inference script
	├── train.py # DDP training script
	└── requirements.txt # Dependencies
	```

	## Citation

	```bibtex
	@misc{anthony_maio_2026_eve2,
	author = { Anthony Maio },
	title = { Eve-2-MoE-272M (Revision ee90542) },
	year = 2026,
	url = { https://huggingface.co/anthonym21/Eve-2-MoE-272M },
	doi = { 10.57967/hf/7731 },
	publisher = { Hugging Face }
	}
	```

	## License

	MIT — free for research and commercial use.