NanoMoE v9 (322M) - SFT v7

NanoMoE v9 is a custom 322M parameter Mixture of Experts (MoE) language model trained from scratch. It is designed to demonstrate that a well-curated MoE with modern optimization (Muon + AdamW) can outperform dense models of equivalent effective compute (~100-150M active parameters) at equivalent token budgets.

This specific revision includes Supervised Fine-Tuning (SFT v7), resolving previous formatting degradation and catastrophic forgetting issues by utilizing an Epoch-First, NLA-Tiered dataset design.

Model Architecture

The model utilizes a custom nanoGPT-style MoE architecture without RoPE, RMSNorm, or SwiGLU to isolate the performance gains strictly to data quality and the Muon optimizer.

Total Parameters: 322M
Active Parameters: ~100-150M per token
Layers: 12
Attention Heads: 12
Embedding Dimension: 768
Experts: 8
Routing: Top-K = 2
Block Size: 4096
Optimizers: Two-optimizer split (Muon for hidden weights, AdamW for router/embeddings)

Tokenizer and Format

You MUST use the exact ChatML format to prompt this model.

The model uses a custom r50k_base Tiktoken encoding expanded to a vocab size of 50,264. All proprietary tokens from older v1 builds have been stripped to prevent representation space corruption. It strictly utilizes 7 ChatML and Tool-Calling tokens:

<|im_start|>
<|im_end|>
<|tool_call|>
<|tool_call_end|>
<|tool_result|>
<|tool_result_end|>
<|tool_error|>

Example Prompt Format

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 15% of 8244?<|im_end|>
<|im_start|>assistant
<|tool_call|>{"name": "python", "code": "print(round(8244 * 0.15, 2))"}<|tool_call_end|><|im_end|>
<|im_start|>tool
<|tool_result|>1236.6<|tool_result_end|><|im_end|>
<|im_start|>assistant
15% of 8,244 is 1,236.60.<|im_end|>

Training Data

Pretraining Phase (~12.88B tokens)

40% LongForm: Cosmopedia (OpenStax, Stanford, KhanAcademy)
28% Chat: Hermes, Orca, and WizardLM (Filtered via a custom NLA proxy model)
25% Code: StarCoderData (Python subset)
5% Web: FineWeb 10BT sample
2% Synthetic: High-quality generated reasoning, multi-hop facts, and tool-call boundary cases.

SFT v7 Phase (~4 epochs)

~70% Chat: Hermes Tier 1 + Orca Tier 1 + Wizard Tier 1
~17% Code: Filtered Python code conversations
~7% Greetings & Factual QA: Grounding against hallucinations.
~4% Math: Diverse formats to prevent single-phrase attractor loops (e.g., stopping the model from blindly generating "The answer is X").
~2% Constraints: Impossibility detection and logical constraints.

Intended Use

This model is intended for research purposes, specifically for evaluating small-scale MoE routing health, dual-optimizer training efficiency, and deterministic tool-calling (via external Python execution/web search harnesses).

Downloads last month: -; Downloads are not tracked for this model. How to track

Daxamite
/

NanoMoE_322_9R