NanoMoE v9 (322M) - SFT v7
NanoMoE v9 is a custom 322M parameter Mixture of Experts (MoE) language model trained from scratch. It is designed to demonstrate that a well-curated MoE with modern optimization (Muon + AdamW) can outperform dense models of equivalent effective compute (~100-150M active parameters) at equivalent token budgets.
This specific revision includes Supervised Fine-Tuning (SFT v7), resolving previous formatting degradation and catastrophic forgetting issues by utilizing an Epoch-First, NLA-Tiered dataset design.
Model Architecture
The model utilizes a custom nanoGPT-style MoE architecture without RoPE, RMSNorm, or SwiGLU to isolate the performance gains strictly to data quality and the Muon optimizer.
- Total Parameters: 322M
- Active Parameters: ~100-150M per token
- Layers: 12
- Attention Heads: 12
- Embedding Dimension: 768
- Experts: 8
- Routing: Top-K = 2
- Block Size: 4096
- Optimizers: Two-optimizer split (Muon for hidden weights, AdamW for router/embeddings)
Tokenizer and Format
You MUST use the exact ChatML format to prompt this model.
The model uses a custom r50k_base Tiktoken encoding expanded to a vocab size of 50,264. All proprietary tokens from older v1 builds have been stripped to prevent representation space corruption. It strictly utilizes 7 ChatML and Tool-Calling tokens:
<|im_start|><|im_end|><|tool_call|><|tool_call_end|><|tool_result|><|tool_result_end|><|tool_error|>
Example Prompt Format
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 15% of 8244?<|im_end|>
<|im_start|>assistant
<|tool_call|>{"name": "python", "code": "print(round(8244 * 0.15, 2))"}<|tool_call_end|><|im_end|>
<|im_start|>tool
<|tool_result|>1236.6<|tool_result_end|><|im_end|>
<|im_start|>assistant
15% of 8,244 is 1,236.60.<|im_end|>
Training Data
Pretraining Phase (~12.88B tokens)
- 40% LongForm: Cosmopedia (OpenStax, Stanford, KhanAcademy)
- 28% Chat: Hermes, Orca, and WizardLM (Filtered via a custom NLA proxy model)
- 25% Code: StarCoderData (Python subset)
- 5% Web: FineWeb 10BT sample
- 2% Synthetic: High-quality generated reasoning, multi-hop facts, and tool-call boundary cases.
SFT v7 Phase (~4 epochs)
- ~70% Chat: Hermes Tier 1 + Orca Tier 1 + Wizard Tier 1
- ~17% Code: Filtered Python code conversations
- ~7% Greetings & Factual QA: Grounding against hallucinations.
- ~4% Math: Diverse formats to prevent single-phrase attractor loops (e.g., stopping the model from blindly generating "The answer is X").
- ~2% Constraints: Impossibility detection and logical constraints.
Intended Use
This model is intended for research purposes, specifically for evaluating small-scale MoE routing health, dual-optimizer training efficiency, and deterministic tool-calling (via external Python execution/web search harnesses).