Eve-2-MoE-272M

A custom 272M-parameter Mixture-of-Experts language model trained from scratch on 8× NVIDIA H200 GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.

Eve-2 is a base model for specialized fine-tuning — not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.

Author: Anthony Maio / Making Minds AI (Independent) https://www.github.com/anthony-maio https://www.linkedin.com/in/anthony-maio

Architecture


Total Parameters	272M
Type	Mixture of Experts (MoE)
Routed Experts	8
Shared Experts	1 (always active)
Active Params/Token	~80M (top-2 routing)
Routing	Top-2 gate with load-balancing aux loss
Layers	12 transformer blocks
Hidden Dim	512
Attention Heads	8 (64-dim each)
Expert FFN Dim	1408 (SwiGLU)
Position Encoding	Rotary Position Embeddings (RoPE)
Context Length	2048 tokens
Vocab	50,304 (GPT-2 tokenizer, padded)
Norm	RMSNorm
Precision	BFloat16 (native)
Weight Tying	Embeddings tied with LM head

Design Rationale

MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.

This makes Eve-2 a natural base for nano-LM swarms — fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.

Training


Hardware	8× NVIDIA H200 (141 GB VRAM each)
Throughput	~1.26M tokens/sec
Steps	40,000
Tokens	~10.5B
Wall Time	~2.5 hours
Data	FineWeb-Edu (Sample-10BT)
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight decay 0.1)
Schedule	Cosine decay with 200-step linear warmup
Peak LR	5e-4 → decays to 5e-5
Batch	128 × 2048 tokens (16/GPU × 8 GPUs)
Gradient Clipping	1.0
Distributed	PyTorch DDP

Convergence

Step	Tokens Seen	Train Loss	Val Loss (WikiText-2)
500	131M	4.82	6.35
1,000	262M	4.09	4.84
1,500	393M	3.95	4.36
5,000	1.3B	3.47	3.89
13,000	3.4B	3.05	3.61
25,000	6.6B	2.90	3.51
37,000	9.7B	2.80	3.42
40,000	10.5B	2.78	3.40

Final Perplexity (WikiText-2): ~30

Training logs: Weights & Biases

Quick Start

This is a custom architecture — you need the model class to load it. Download modeling_eve.py from this repo.

import torch
import tiktoken
from modeling_eve import ModelConfig, DeepSeekMoE
from huggingface_hub import hf_hub_download

# Load
device = "cuda" if torch.cuda.is_available() else "cpu"
config = ModelConfig()
model = DeepSeekMoE(config)

weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights, map_location=device))
model.to(device).eval()

# Generate
enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
                       dtype=torch.long, device=device).unsqueeze(0)

output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
print(enc.decode(output[0].tolist()))

CPU / Raspberry Pi

The model runs on CPU at ~272M parameters. Inference is slower but functional — memory footprint is under 1 GB.

device = "cpu"
# Everything else stays the same

Intended Use

Eve-2 is a fine-tuning base, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:

Take this base model
Fine-tune on a narrow task (~20 min on consumer GPU)
Deploy at the edge as part of a specialized nano-LM swarm

Target applications: Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.

Limitations

This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design — it's a small, fast, cheap-to-tune specialist base.

The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.

Files

├── pytorch_model.bin     # Model weights
├── config.json           # Architecture config
├── modeling_eve.py       # Model class definitions (required to load)
├── generate.py           # Standalone inference script
├── train.py              # DDP training script
└── requirements.txt      # Dependencies

Citation

@misc{anthony_maio_2026_eve2,
    author       = { Anthony Maio },
    title        = { Eve-2-MoE-272M (Revision ee90542) },
    year         = 2026,
    url          = { https://huggingface.co/anthonym21/Eve-2-MoE-272M },
    doi          = { 10.57967/hf/7731 },
    publisher    = { Hugging Face }
}

License

MIT — free for research and commercial use.

Downloads last month: 1,043

Model tree for anthonym21/Eve-2-MoE-272M

Finetunes

1 model

Dataset used to train anthonym21/Eve-2-MoE-272M

Collection including anthonym21/Eve-2-MoE-272M

Eve 2

Collection

Eve 2 family of models; 272M parameter MoE IT Specialist Agents • 19 items • Updated Feb 10 • 1