NanoMoE v9 (322M) - SFT v7

NanoMoE v9 is a custom 322M parameter Mixture of Experts (MoE) language model trained from scratch. It is designed to demonstrate that a well-curated MoE with modern optimization (Muon + AdamW) can outperform dense models of equivalent effective compute (~100-150M active parameters) at equivalent token budgets.

This specific revision includes Supervised Fine-Tuning (SFT v7), resolving previous formatting degradation and catastrophic forgetting issues by utilizing an Epoch-First, NLA-Tiered dataset design.

Model Architecture

The model utilizes a custom nanoGPT-style MoE architecture without RoPE, RMSNorm, or SwiGLU to isolate the performance gains strictly to data quality and the Muon optimizer.

  • Total Parameters: 322M
  • Active Parameters: ~100-150M per token
  • Layers: 12
  • Attention Heads: 12
  • Embedding Dimension: 768
  • Experts: 8
  • Routing: Top-K = 2
  • Block Size: 4096
  • Optimizers: Two-optimizer split (Muon for hidden weights, AdamW for router/embeddings)

Tokenizer and Format

You MUST use the exact ChatML format to prompt this model.

The model uses a custom r50k_base Tiktoken encoding expanded to a vocab size of 50,264. All proprietary tokens from older v1 builds have been stripped to prevent representation space corruption. It strictly utilizes 7 ChatML and Tool-Calling tokens:

  • <|im_start|>
  • <|im_end|>
  • <|tool_call|>
  • <|tool_call_end|>
  • <|tool_result|>
  • <|tool_result_end|>
  • <|tool_error|>

Example Prompt Format

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 15% of 8244?<|im_end|>
<|im_start|>assistant
<|tool_call|>{"name": "python", "code": "print(round(8244 * 0.15, 2))"}<|tool_call_end|><|im_end|>
<|im_start|>tool
<|tool_result|>1236.6<|tool_result_end|><|im_end|>
<|im_start|>assistant
15% of 8,244 is 1,236.60.<|im_end|>

Training Data

Pretraining Phase (~12.88B tokens)

  • 40% LongForm: Cosmopedia (OpenStax, Stanford, KhanAcademy)
  • 28% Chat: Hermes, Orca, and WizardLM (Filtered via a custom NLA proxy model)
  • 25% Code: StarCoderData (Python subset)
  • 5% Web: FineWeb 10BT sample
  • 2% Synthetic: High-quality generated reasoning, multi-hop facts, and tool-call boundary cases.

SFT v7 Phase (~4 epochs)

  • ~70% Chat: Hermes Tier 1 + Orca Tier 1 + Wizard Tier 1
  • ~17% Code: Filtered Python code conversations
  • ~7% Greetings & Factual QA: Grounding against hallucinations.
  • ~4% Math: Diverse formats to prevent single-phrase attractor loops (e.g., stopping the model from blindly generating "The answer is X").
  • ~2% Constraints: Impossibility detection and logical constraints.

Intended Use

This model is intended for research purposes, specifically for evaluating small-scale MoE routing health, dual-optimizer training efficiency, and deterministic tool-calling (via external Python execution/web search harnesses).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Daxamite/NanoMoE_322_9R 1