Mirror-MoE-236M / README.md
dipeshmajithia's picture
Update README.md
3cbebb8 verified
metadata
license: mit
language:
  - en
tags:
  - mixture-of-experts
  - moe
  - mlx
  - apple-silicon
  - tool-calling
  - personal-assistant
  - small-language-model
pipeline_tag: text-generation
model-index:
  - name: MirrorAI-V3-236M-MoE
    results:
      - task:
          type: text-generation
          name: HellaSwag
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - name: accuracy
            type: accuracy
            value: 25.5
      - task:
          type: text-generation
          name: ARC-Easy
        dataset:
          name: ARC-Easy
          type: ai2_arc
        metrics:
          - name: accuracy
            type: accuracy
            value: 37
      - task:
          type: text-generation
          name: MMLU
        dataset:
          name: MMLU
          type: cais/mmlu
        metrics:
          - name: accuracy
            type: accuracy
            value: 26

MirrorAI V3 β€” 236M Mixture-of-Experts Language Model

A 236M parameter Mixture-of-Experts (MoE) language model built from scratch using Apple's MLX framework. Designed as a personal AI assistant with built-in tool-calling capabilities.

πŸ—οΈ Architecture

Parameter Value
Total Parameters 236M
Active Parameters ~62M per token
Layers 8
Hidden Dim 512
Expert FFN Dim 1,365
Experts 16 (top-2 routing)
Shared Expert Yes (dim=1,365)
Vocab Size 32,002 (BPE + <call> / </call>)
Context Length 512 tokens
Framework MLX (Apple Silicon native)

MoE Design

Architecture: 236M MoE (16 experts, Top-2 Routing)
Efficiency Profile: High-performance sub-300M MMLU efficiency
Context: ~61M Training Tokens
Each transformer layer uses a gated mixture of 16 experts with top-2 routing, plus a shared expert that always contributes. This gives the model a large parameter count (236M) while only activating ~62M parameters per token, enabling efficient inference on Apple Silicon.

πŸ“Š Benchmark Results

Standard Benchmarks

Model Params HellaSwag ARC-Easy MMLU Training Data
GPT-2 Small 124M 30.0% 38.7% 25.0% ~10B tokens
OPT-125M 125M 31.5% 22.9% 24.0% 300B tokens
SmolLM2-135M 135M 67.5% 54.3% 23.1% 2T tokens
Pythia-160M 160M 29.4% 43.5% 24.0% 300B tokens
MirrorAI V3 (ours) 236M 25.5% 37.0% 26.0% ~61M tokens

Note: MirrorAI V3 was trained on significantly less data (~61M tokens vs 300B+ for comparable models). Our training budget is ~5,000x smaller than OPT-125M and ~33,000x smaller than SmolLM2-135M.

MirrorAI Custom Capabilities

Capability Score Description
Identity 100% Correctly identifies as MirrorAI by Dipesh Majithia
Tool Calling (RAG) 80% Uses <call>search_knowledge("query")</call> for factual questions
Tool Calling (Math) 100% Uses <call>calculator("expression")</call> for math
Conversation 100% Natural chitchat and greetings
Coherence 100% Generates coherent multi-sentence responses

πŸ› οΈ Unique Features

Tool Calling

MirrorAI V3 has custom tool-calling capabilities built into its vocabulary:

  • <call>search_knowledge("query")</call> β€” For factual information retrieval
  • <call>calculator("expression")</call> β€” For mathematical calculations

These are atomic tokens (IDs 32000/32001), not sub-word split, enabling reliable tool call generation.

Example Usage

User: What is the capital of France?
MirrorAI: <call>search_knowledge("capital of France")</call>
[Context: Paris is the capital of France]
MirrorAI: The capital of France is Paris.

User: What is 125 + 372?
MirrorAI: <call>calculator("125 + 372")</call>
[Result: 497]
MirrorAI: The answer is 497.

User: Who created you?
MirrorAI: I was created by Dipesh Majithia.

πŸ“š Training Details

Data

  • OpenHermes 2.5: 100,000 instruction-following samples
  • SlimOrca: 50,000 instruction-following samples
  • MirrorAI Custom: 14,000 samples (identity, tool-calling, conversation)
  • Total: 164,000 samples (~61M tokens)
  • System Prompts: Diversified across 8 variants for robustness

Training Configuration

Parameter Value
Epochs 3
Epoch 1 Curriculum-ordered (easy β†’ hard)
Epochs 2-3 Shuffled with diversified system prompts
Peak LR 5e-5
Scheduler Cosine with warmup
Warmup Steps 2,000
Weight Decay 0.05
Grad Clipping 1.0
Batch Size 16 (gradient accumulation)
Precision float16 (MLX)
Hardware Apple Silicon (M-series)

Training Loss

  • Epoch 1 Start: ~3.7
  • Epoch 1 End: ~2.3
  • Epoch 3 End: ~1.6
  • Final Val Loss: ~1.73

πŸš€ Quick Start (MLX)

import mlx.core as mx
from model.transformer import MirrorTransformer, ModelArgs
from tokenizer_wrapper import MirrorTokenizer

args = ModelArgs(
    dim=512, hidden_dim=1365, n_layers=8,
    vocab_size=32002, use_moe=True,
    num_experts=16, num_experts_per_tok=2,
    shared_expert_dim=1365
)

model = MirrorTransformer(args)
model.load_weights("model.safetensors", strict=False)
model.eval()

tokenizer = MirrorTokenizer("custom_bpe_32k_v2.json")

⚠️ Limitations

  • Small model: 236M parameters limits reasoning depth and factual recall
  • Limited training data: ~61M tokens vs billions for comparable models
  • English only: Trained exclusively on English data
  • Single-turn: No multi-turn conversation support
  • Tool queries: Sometimes garbles search queries for unfamiliar topics
  • Context window: Limited to 512 tokens

πŸ“„ License

MIT

πŸ‘€ Author

Dipesh Majithia