Update README.md

3cbebb8 verified about 2 months ago

5.82 kB

license: mit
language:
  - en
tags:
  - mixture-of-experts
  - moe
  - mlx
  - apple-silicon
  - tool-calling
  - personal-assistant
  - small-language-model
pipeline_tag: text-generation
model-index:
  - name: MirrorAI-V3-236M-MoE
    results:
      - task:
          type: text-generation
          name: HellaSwag
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - name: accuracy
            type: accuracy
            value: 25.5
      - task:
          type: text-generation
          name: ARC-Easy
        dataset:
          name: ARC-Easy
          type: ai2_arc
        metrics:
          - name: accuracy
            type: accuracy
            value: 37
      - task:
          type: text-generation
          name: MMLU
        dataset:
          name: MMLU
          type: cais/mmlu
        metrics:
          - name: accuracy
            type: accuracy
            value: 26

MirrorAI V3 — 236M Mixture-of-Experts Language Model

A 236M parameter Mixture-of-Experts (MoE) language model built from scratch using Apple's MLX framework. Designed as a personal AI assistant with built-in tool-calling capabilities.

🏗️ Architecture

Parameter	Value
Total Parameters	236M
Active Parameters	~62M per token
Layers	8
Hidden Dim	512
Expert FFN Dim	1,365
Experts	16 (top-2 routing)
Shared Expert	Yes (dim=1,365)
Vocab Size	32,002 (BPE + `<call>` / `</call>`)
Context Length	512 tokens
Framework	MLX (Apple Silicon native)

MoE Design

Architecture: 236M MoE (16 experts, Top-2 Routing)
Efficiency Profile: High-performance sub-300M MMLU efficiency
Context: ~61M Training Tokens
Each transformer layer uses a gated mixture of 16 experts with top-2 routing, plus a shared expert that always contributes. This gives the model a large parameter count (236M) while only activating ~62M parameters per token, enabling efficient inference on Apple Silicon.

📊 Benchmark Results

Standard Benchmarks

Model	Params	HellaSwag	ARC-Easy	MMLU	Training Data
GPT-2 Small	124M	30.0%	38.7%	25.0%	~10B tokens
OPT-125M	125M	31.5%	22.9%	24.0%	300B tokens
SmolLM2-135M	135M	67.5%	54.3%	23.1%	2T tokens
Pythia-160M	160M	29.4%	43.5%	24.0%	300B tokens
MirrorAI V3 (ours)	236M	25.5%	37.0%	26.0%	~61M tokens

Note: MirrorAI V3 was trained on significantly less data (~61M tokens vs 300B+ for comparable models). Our training budget is ~5,000x smaller than OPT-125M and ~33,000x smaller than SmolLM2-135M.

MirrorAI Custom Capabilities

Capability	Score	Description
Identity	100%	Correctly identifies as MirrorAI by Dipesh Majithia
Tool Calling (RAG)	80%	Uses `<call>search_knowledge("query")</call>` for factual questions
Tool Calling (Math)	100%	Uses `<call>calculator("expression")</call>` for math
Conversation	100%	Natural chitchat and greetings
Coherence	100%	Generates coherent multi-sentence responses

🛠️ Unique Features

Tool Calling

MirrorAI V3 has custom tool-calling capabilities built into its vocabulary:

<call>search_knowledge("query")</call> — For factual information retrieval
<call>calculator("expression")</call> — For mathematical calculations

These are atomic tokens (IDs 32000/32001), not sub-word split, enabling reliable tool call generation.

Example Usage

User: What is the capital of France?
MirrorAI: <call>search_knowledge("capital of France")</call>
[Context: Paris is the capital of France]
MirrorAI: The capital of France is Paris.

User: What is 125 + 372?
MirrorAI: <call>calculator("125 + 372")</call>
[Result: 497]
MirrorAI: The answer is 497.

User: Who created you?
MirrorAI: I was created by Dipesh Majithia.

📚 Training Details

Data

OpenHermes 2.5: 100,000 instruction-following samples
SlimOrca: 50,000 instruction-following samples
MirrorAI Custom: 14,000 samples (identity, tool-calling, conversation)
Total: 164,000 samples (~61M tokens)
System Prompts: Diversified across 8 variants for robustness

Training Configuration

Parameter	Value
Epochs	3
Epoch 1	Curriculum-ordered (easy → hard)
Epochs 2-3	Shuffled with diversified system prompts
Peak LR	5e-5
Scheduler	Cosine with warmup
Warmup Steps	2,000
Weight Decay	0.05
Grad Clipping	1.0
Batch Size	16 (gradient accumulation)
Precision	float16 (MLX)
Hardware	Apple Silicon (M-series)

Training Loss

Epoch 1 Start: ~3.7
Epoch 1 End: ~2.3
Epoch 3 End: ~1.6
Final Val Loss: ~1.73

🚀 Quick Start (MLX)

import mlx.core as mx
from model.transformer import MirrorTransformer, ModelArgs
from tokenizer_wrapper import MirrorTokenizer

args = ModelArgs(
    dim=512, hidden_dim=1365, n_layers=8,
    vocab_size=32002, use_moe=True,
    num_experts=16, num_experts_per_tok=2,
    shared_expert_dim=1365
)

model = MirrorTransformer(args)
model.load_weights("model.safetensors", strict=False)
model.eval()

tokenizer = MirrorTokenizer("custom_bpe_32k_v2.json")

⚠️ Limitations

Small model: 236M parameters limits reasoning depth and factual recall
Limited training data: ~61M tokens vs billions for comparable models
English only: Trained exclusively on English data
Single-turn: No multi-turn conversation support
Tool queries: Sometimes garbles search queries for unfamiliar topics
Context window: Limited to 512 tokens

📄 License

MIT

👤 Author

Dipesh Majithia