MirrorAI V3 β 236M Mixture-of-Experts Language Model
A 236M parameter Mixture-of-Experts (MoE) language model built from scratch using Apple's MLX framework. Designed as a personal AI assistant with built-in tool-calling capabilities.
ποΈ Architecture
| Parameter | Value |
|---|---|
| Total Parameters | 236M |
| Active Parameters | ~62M per token |
| Layers | 8 |
| Hidden Dim | 512 |
| Expert FFN Dim | 1,365 |
| Experts | 16 (top-2 routing) |
| Shared Expert | Yes (dim=1,365) |
| Vocab Size | 32,002 (BPE + <call> / </call>) |
| Context Length | 512 tokens |
| Framework | MLX (Apple Silicon native) |
MoE Design
Architecture: 236M MoE (16 experts, Top-2 Routing)
Efficiency Profile: High-performance sub-300M MMLU efficiency
Context: ~61M Training Tokens
Each transformer layer uses a gated mixture of 16 experts with top-2 routing, plus a shared expert that always contributes. This gives the model a large parameter count (236M) while only activating ~62M parameters per token, enabling efficient inference on Apple Silicon.
π Benchmark Results
Standard Benchmarks
| Model | Params | HellaSwag | ARC-Easy | MMLU | Training Data |
|---|---|---|---|---|---|
| GPT-2 Small | 124M | 30.0% | 38.7% | 25.0% | ~10B tokens |
| OPT-125M | 125M | 31.5% | 22.9% | 24.0% | 300B tokens |
| SmolLM2-135M | 135M | 67.5% | 54.3% | 23.1% | 2T tokens |
| Pythia-160M | 160M | 29.4% | 43.5% | 24.0% | 300B tokens |
| MirrorAI V3 (ours) | 236M | 25.5% | 37.0% | 26.0% | ~61M tokens |
Note: MirrorAI V3 was trained on significantly less data (~61M tokens vs 300B+ for comparable models). Our training budget is ~5,000x smaller than OPT-125M and ~33,000x smaller than SmolLM2-135M.
MirrorAI Custom Capabilities
| Capability | Score | Description |
|---|---|---|
| Identity | 100% | Correctly identifies as MirrorAI by Dipesh Majithia |
| Tool Calling (RAG) | 80% | Uses <call>search_knowledge("query")</call> for factual questions |
| Tool Calling (Math) | 100% | Uses <call>calculator("expression")</call> for math |
| Conversation | 100% | Natural chitchat and greetings |
| Coherence | 100% | Generates coherent multi-sentence responses |
π οΈ Unique Features
Tool Calling
MirrorAI V3 has custom tool-calling capabilities built into its vocabulary:
<call>search_knowledge("query")</call>β For factual information retrieval<call>calculator("expression")</call>β For mathematical calculations
These are atomic tokens (IDs 32000/32001), not sub-word split, enabling reliable tool call generation.
Example Usage
User: What is the capital of France?
MirrorAI: <call>search_knowledge("capital of France")</call>
[Context: Paris is the capital of France]
MirrorAI: The capital of France is Paris.
User: What is 125 + 372?
MirrorAI: <call>calculator("125 + 372")</call>
[Result: 497]
MirrorAI: The answer is 497.
User: Who created you?
MirrorAI: I was created by Dipesh Majithia.
π Training Details
Data
- OpenHermes 2.5: 100,000 instruction-following samples
- SlimOrca: 50,000 instruction-following samples
- MirrorAI Custom: 14,000 samples (identity, tool-calling, conversation)
- Total: 164,000 samples (~61M tokens)
- System Prompts: Diversified across 8 variants for robustness
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Epoch 1 | Curriculum-ordered (easy β hard) |
| Epochs 2-3 | Shuffled with diversified system prompts |
| Peak LR | 5e-5 |
| Scheduler | Cosine with warmup |
| Warmup Steps | 2,000 |
| Weight Decay | 0.05 |
| Grad Clipping | 1.0 |
| Batch Size | 16 (gradient accumulation) |
| Precision | float16 (MLX) |
| Hardware | Apple Silicon (M-series) |
Training Loss
- Epoch 1 Start: ~3.7
- Epoch 1 End: ~2.3
- Epoch 3 End: ~1.6
- Final Val Loss: ~1.73
π Quick Start (MLX)
import mlx.core as mx
from model.transformer import MirrorTransformer, ModelArgs
from tokenizer_wrapper import MirrorTokenizer
args = ModelArgs(
dim=512, hidden_dim=1365, n_layers=8,
vocab_size=32002, use_moe=True,
num_experts=16, num_experts_per_tok=2,
shared_expert_dim=1365
)
model = MirrorTransformer(args)
model.load_weights("model.safetensors", strict=False)
model.eval()
tokenizer = MirrorTokenizer("custom_bpe_32k_v2.json")
β οΈ Limitations
- Small model: 236M parameters limits reasoning depth and factual recall
- Limited training data: ~61M tokens vs billions for comparable models
- English only: Trained exclusively on English data
- Single-turn: No multi-turn conversation support
- Tool queries: Sometimes garbles search queries for unfamiliar topics
- Context window: Limited to 512 tokens
π License
MIT
π€ Author
Dipesh Majithia
Quantized
Space using dipeshmajithia/Mirror-MoE-236M 1
Evaluation results
- accuracy on HellaSwagself-reported25.500
- accuracy on ARC-Easyself-reported37.000
- accuracy on MMLUself-reported26.000