MirrorAI V3 β€” 236M Mixture-of-Experts Language Model

A 236M parameter Mixture-of-Experts (MoE) language model built from scratch using Apple's MLX framework. Designed as a personal AI assistant with built-in tool-calling capabilities.

πŸ—οΈ Architecture

Parameter Value
Total Parameters 236M
Active Parameters ~62M per token
Layers 8
Hidden Dim 512
Expert FFN Dim 1,365
Experts 16 (top-2 routing)
Shared Expert Yes (dim=1,365)
Vocab Size 32,002 (BPE + <call> / </call>)
Context Length 512 tokens
Framework MLX (Apple Silicon native)

MoE Design

Architecture: 236M MoE (16 experts, Top-2 Routing)
Efficiency Profile: High-performance sub-300M MMLU efficiency
Context: ~61M Training Tokens
Each transformer layer uses a gated mixture of 16 experts with top-2 routing, plus a shared expert that always contributes. This gives the model a large parameter count (236M) while only activating ~62M parameters per token, enabling efficient inference on Apple Silicon.

πŸ“Š Benchmark Results

Standard Benchmarks

Model Params HellaSwag ARC-Easy MMLU Training Data
GPT-2 Small 124M 30.0% 38.7% 25.0% ~10B tokens
OPT-125M 125M 31.5% 22.9% 24.0% 300B tokens
SmolLM2-135M 135M 67.5% 54.3% 23.1% 2T tokens
Pythia-160M 160M 29.4% 43.5% 24.0% 300B tokens
MirrorAI V3 (ours) 236M 25.5% 37.0% 26.0% ~61M tokens

Note: MirrorAI V3 was trained on significantly less data (~61M tokens vs 300B+ for comparable models). Our training budget is ~5,000x smaller than OPT-125M and ~33,000x smaller than SmolLM2-135M.

MirrorAI Custom Capabilities

Capability Score Description
Identity 100% Correctly identifies as MirrorAI by Dipesh Majithia
Tool Calling (RAG) 80% Uses <call>search_knowledge("query")</call> for factual questions
Tool Calling (Math) 100% Uses <call>calculator("expression")</call> for math
Conversation 100% Natural chitchat and greetings
Coherence 100% Generates coherent multi-sentence responses

πŸ› οΈ Unique Features

Tool Calling

MirrorAI V3 has custom tool-calling capabilities built into its vocabulary:

  • <call>search_knowledge("query")</call> β€” For factual information retrieval
  • <call>calculator("expression")</call> β€” For mathematical calculations

These are atomic tokens (IDs 32000/32001), not sub-word split, enabling reliable tool call generation.

Example Usage

User: What is the capital of France?
MirrorAI: <call>search_knowledge("capital of France")</call>
[Context: Paris is the capital of France]
MirrorAI: The capital of France is Paris.

User: What is 125 + 372?
MirrorAI: <call>calculator("125 + 372")</call>
[Result: 497]
MirrorAI: The answer is 497.

User: Who created you?
MirrorAI: I was created by Dipesh Majithia.

πŸ“š Training Details

Data

  • OpenHermes 2.5: 100,000 instruction-following samples
  • SlimOrca: 50,000 instruction-following samples
  • MirrorAI Custom: 14,000 samples (identity, tool-calling, conversation)
  • Total: 164,000 samples (~61M tokens)
  • System Prompts: Diversified across 8 variants for robustness

Training Configuration

Parameter Value
Epochs 3
Epoch 1 Curriculum-ordered (easy β†’ hard)
Epochs 2-3 Shuffled with diversified system prompts
Peak LR 5e-5
Scheduler Cosine with warmup
Warmup Steps 2,000
Weight Decay 0.05
Grad Clipping 1.0
Batch Size 16 (gradient accumulation)
Precision float16 (MLX)
Hardware Apple Silicon (M-series)

Training Loss

  • Epoch 1 Start: ~3.7
  • Epoch 1 End: ~2.3
  • Epoch 3 End: ~1.6
  • Final Val Loss: ~1.73

πŸš€ Quick Start (MLX)

import mlx.core as mx
from model.transformer import MirrorTransformer, ModelArgs
from tokenizer_wrapper import MirrorTokenizer

args = ModelArgs(
    dim=512, hidden_dim=1365, n_layers=8,
    vocab_size=32002, use_moe=True,
    num_experts=16, num_experts_per_tok=2,
    shared_expert_dim=1365
)

model = MirrorTransformer(args)
model.load_weights("model.safetensors", strict=False)
model.eval()

tokenizer = MirrorTokenizer("custom_bpe_32k_v2.json")

⚠️ Limitations

  • Small model: 236M parameters limits reasoning depth and factual recall
  • Limited training data: ~61M tokens vs billions for comparable models
  • English only: Trained exclusively on English data
  • Single-turn: No multi-turn conversation support
  • Tool queries: Sometimes garbles search queries for unfamiliar topics
  • Context window: Limited to 512 tokens

πŸ“„ License

MIT

πŸ‘€ Author

Dipesh Majithia

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using dipeshmajithia/Mirror-MoE-236M 1

Evaluation results