Eve-2-MoE-272M

A custom 272M-parameter Mixture-of-Experts language model trained from scratch on 8Γ— NVIDIA H200 GPUs. Implements a DeepSeek-V3 style architecture with a shared expert, top-k routed experts, RoPE positional encoding, and SwiGLU activations.

Eve-2 is a base model for specialized fine-tuning β€” not a chatbot. Fine-tune it in ~20 minutes on consumer hardware for narrow tasks like PII redaction, text classification, semantic compression cleanup, or lightweight routing in multi-agent pipelines. Runs on a Raspberry Pi.

Author: Anthony Maio / Making Minds AI (Independent) https://www.github.com/anthony-maio https://www.linkedin.com/in/anthony-maio

Architecture

Total Parameters 272M
Type Mixture of Experts (MoE)
Routed Experts 8
Shared Experts 1 (always active)
Active Params/Token ~80M (top-2 routing)
Routing Top-2 gate with load-balancing aux loss
Layers 12 transformer blocks
Hidden Dim 512
Attention Heads 8 (64-dim each)
Expert FFN Dim 1408 (SwiGLU)
Position Encoding Rotary Position Embeddings (RoPE)
Context Length 2048 tokens
Vocab 50,304 (GPT-2 tokenizer, padded)
Norm RMSNorm
Precision BFloat16 (native)
Weight Tying Embeddings tied with LM head

Design Rationale

MoE at this scale is a deliberate choice. With 8 experts but only 2 active per token, inference cost is roughly equivalent to a 80M dense model while the total parameter budget gives each expert room to specialize. The shared expert handles common patterns across all tokens; the routed experts develop narrow competencies during fine-tuning.

This makes Eve-2 a natural base for nano-LM swarms β€” fine-tune copies for specific tasks, deploy at the edge, coordinate through lightweight protocols.

Training

Hardware 8Γ— NVIDIA H200 (141 GB VRAM each)
Throughput ~1.26M tokens/sec
Steps 40,000
Tokens ~10.5B
Wall Time ~2.5 hours
Data FineWeb-Edu (Sample-10BT)
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1)
Schedule Cosine decay with 200-step linear warmup
Peak LR 5e-4 β†’ decays to 5e-5
Batch 128 Γ— 2048 tokens (16/GPU Γ— 8 GPUs)
Gradient Clipping 1.0
Distributed PyTorch DDP

Convergence

Step Tokens Seen Train Loss Val Loss (WikiText-2)
500 131M 4.82 6.35
1,000 262M 4.09 4.84
1,500 393M 3.95 4.36
5,000 1.3B 3.47 3.89
13,000 3.4B 3.05 3.61
25,000 6.6B 2.90 3.51
37,000 9.7B 2.80 3.42
40,000 10.5B 2.78 3.40

Final Perplexity (WikiText-2): ~30

Training logs: Weights & Biases

Quick Start

This is a custom architecture β€” you need the model class to load it. Download modeling_eve.py from this repo.

import torch
import tiktoken
from modeling_eve import ModelConfig, DeepSeekMoE
from huggingface_hub import hf_hub_download

# Load
device = "cuda" if torch.cuda.is_available() else "cpu"
config = ModelConfig()
model = DeepSeekMoE(config)

weights = hf_hub_download(repo_id="anthonym21/Eve-2-MoE-272M", filename="pytorch_model.bin")
model.load_state_dict(torch.load(weights, map_location=device))
model.to(device).eval()

# Generate
enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(enc.encode("The future of artificial intelligence is"),
                       dtype=torch.long, device=device).unsqueeze(0)

output = model.generate(tokens, max_new_tokens=100, temperature=0.8, top_k=50)
print(enc.decode(output[0].tolist()))

CPU / Raspberry Pi

The model runs on CPU at ~272M parameters. Inference is slower but functional β€” memory footprint is under 1 GB.

device = "cpu"
# Everything else stays the same

Intended Use

Eve-2 is a fine-tuning base, not a finished product. Out of the box it produces coherent English but has no instruction-following capability. The workflow:

  1. Take this base model
  2. Fine-tune on a narrow task (~20 min on consumer GPU)
  3. Deploy at the edge as part of a specialized nano-LM swarm

Target applications: Data cleaning, PII redaction, text classification, semantic compression repair, lightweight routing/triage in multi-agent pipelines.

Limitations

This is a 272M model. It will not write essays, follow complex instructions, or compete with larger models on general benchmarks. That's by design β€” it's a small, fast, cheap-to-tune specialist base.

The train/val gap of ~0.62 at convergence suggests the model could benefit from additional data diversity beyond FineWeb-Edu for downstream generalization.

Files

β”œβ”€β”€ pytorch_model.bin     # Model weights
β”œβ”€β”€ config.json           # Architecture config
β”œβ”€β”€ modeling_eve.py       # Model class definitions (required to load)
β”œβ”€β”€ generate.py           # Standalone inference script
β”œβ”€β”€ train.py              # DDP training script
└── requirements.txt      # Dependencies

Citation

@misc{anthony_maio_2026_eve2,
    author       = { Anthony Maio },
    title        = { Eve-2-MoE-272M (Revision ee90542) },
    year         = 2026,
    url          = { https://huggingface.co/anthonym21/Eve-2-MoE-272M },
    doi          = { 10.57967/hf/7731 },
    publisher    = { Hugging Face }
}

License

MIT β€” free for research and commercial use.

Downloads last month
91
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anthonym21/Eve-2-MoE-272M

Finetunes
1 model

Dataset used to train anthonym21/Eve-2-MoE-272M

Collection including anthonym21/Eve-2-MoE-272M