Eve-2-MoE-IT-272M
The Foundation for Nano-Scale Swarm Intelligence
Eve-2-MoE-IT-272M is a 272M parameter instruction-tuned model designed as the foundational base for the Eve Swarmβa collection of hyper-specialized, CPU-deployable adapters.
Unlike massive generalist LLMs, Eve is built for deterministic-ish transformations. She is designed to be "overfitted" into specialists that perform one job perfectly (e.g., SQL generation, Git commits, JSON extraction) with negligible latency and cost.
Author: Anthony Maio / Public Outputs
π The Eve Swarm (Specialist Ecosystem)
This model serves as the parent for the following Full Fine-Tuned (FFT) specialists. All members were trained on an NVIDIA H200 SXM to ensure optimal embedding alignment.
| Specialist Model | Task | Dataset Source | Size | Loss |
|---|---|---|---|---|
| Eve-NanoFunction | Strict JSON Function Calling β produces valid JSON outputs from natural language. | glaive-function-calling-v2 | 272M | <0.4 (35k samples) |
| Eve-NanoSummary | Conversation Summarization β condenses dialogues into concise summaries. | knkarthick/dialogsum | 272M | <1.0 (12.5k samples) |
| Eve-NanoCommit | Git Diff β Commit Message β writes conventional commits from raw code diffs. | bigcode/commitpackft | 272M | <1.0 (20k samples) |
| Eve-NanoExtract | Text β Structured Data β extracts parameters/entities into strict JSON schemas. | Salesforce/xlam-function-calling | 272M | <0.4 (20k samples) |
| Eve-NanoSQL | Natural Language β SQL β converts questions to SQL using table context. | b-mc2/sql-create-context | 272M | <0.2 (25k samples) |
| Eve-NanoPrompt | Prompt Expansion β expands simple ideas into rich image gen prompts. | Stable-Diffusion-Prompts | 272M | <1.0 (15k samples) |
| Eve-NanoRouter | Intent Classification β routes user queries to the correct swarm member. | bitext/customer-support | 272M | <0.3 (25k samples) |
| Eve-NanoPII | PII Redaction β identifies and masks sensitive entities. | ai4privacy/pii-masking-200k | 272M | <0.1 (35k samples) |
Technical Specifications
Architecture: Nano-MoE
Eve uses a DeepSeek-style Mixture-of-Experts architecture scaled down to the "Nano" range.
- Total Parameters: 272M
- Active Parameters: ~80M (per token)
- Experts: 8 routed + 1 shared
- Top-K: 2
- Context Window: 2048 tokens
- Vocab: 50,304 (GPT-2 compatible)
Training Config (H200 SXM)
This model was trained using Full Fine-Tuning (FFT). We found that LoRA was insufficient for aligning the embeddings of such a small model; unfreezing all weights yielded significant performance gains. You don't need to use a H200, it's absurdly overkill. I love it.
- Hardware: NVIDIA H200 SXM (141GB VRAM)
- Method: Full Fine-Tuning (No PEFT/LoRA)
- Precision:
bfloat16 - Batch Size: 128 (Global)
- Learning Rate: 5e-5 (Cosine Schedule)
- Collator:
DataCollatorForCompletionOnlyLM(Masked User Prompts)
How to Tune Eve 2
If you want to train your own Eve specialist, follow these rules derived from our H200 experiments:
- Abandon LoRA: For a 272M model, LoRA restricts the embedding space too much. You have the VRAM; use Full Fine-Tuning.
- Mask User Prompts: You must use a collator that masks the prompt (loss only on
Assistant:response). If the model calculates loss on the "User:" instructions, it wastes capacity learning English grammar instead of the task. - Batch Size Matters: We saturated the H200 with
batch_size=128. High batch sizes stabilize the gradients for these volatile small architectures. - Dataset Quality > Quantity:
- Bad: 100k rows of scraped web text.
- Good: 10k rows of "Input -> Ideal Output" pairs.
- Sweet Spot: 2 Epochs. Do not over-train; these models memorize quickly.
π Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "anthonym21/Eve-2-MoE-IT-272M"
# Load with trust_remote_code=True for custom MoE architecture
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Standard formatting
prompt = "User: Explain the concept of Semantic Quantization.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.6)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Citation
@misc{maio2026eve2moeit,
author = {Maio, Anthony D.},
title = {Eve-2-MoE-IT-272M: A Nano-MoE Foundation for Swarm Intelligence},
year = {2026},
publisher = {Maio, Anthony D.},
url = {https://huggingface.co/anthonym21/Eve-2-MoE-IT-272M}
}
- Downloads last month
- 254