fariasultana commited on
Commit
360c8d9
·
verified ·
1 Parent(s): 861129b

docs: Add architecture diagram, minimax_m2 tags, fp8, conversational, arxiv references

Browse files
Files changed (1) hide show
  1. README.md +231 -594
README.md CHANGED
@@ -1,669 +1,306 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
- library_name: pytorch
6
  tags:
7
- - text-generation
8
- - moe
9
- - mixture-of-experts
10
- - gqa
11
- - grouped-query-attention
12
- - edge-deployment
13
- - mobile
14
- - android
15
- - efficient
16
- - llama-cpp
17
- - transformers
18
- - causal-lm
 
 
 
 
 
 
19
  pipeline_tag: text-generation
20
  datasets:
21
- - HuggingFaceFW/fineweb
22
- - wikipedia
23
- - bookcorpus
24
- metrics:
25
- - perplexity
26
- - accuracy
27
  model-index:
28
- - name: MiniMind-Max2
29
- results:
30
- - task:
31
- type: text-generation
32
- name: Text Generation
33
- dataset:
34
- type: wikitext
35
- name: WikiText-103
36
- config: wikitext-103-raw-v1
37
- split: test
38
- metrics:
39
- - type: perplexity
40
- value: 18.5
41
- name: Perplexity
42
- - task:
43
- type: text-generation
44
- name: Text Generation
45
- dataset:
46
- type: EleutherAI/lambada_openai
47
- name: LAMBADA
48
- config: default
49
- split: test
50
- metrics:
51
- - type: accuracy
52
- value: 0.62
53
- name: Accuracy
54
- - task:
55
- type: text-generation
56
- name: Text Generation
57
- dataset:
58
- type: Rowan/hellaswag
59
- name: HellaSwag
60
- config: default
61
- split: validation
62
- metrics:
63
- - type: accuracy
64
- value: 0.58
65
- name: Accuracy
66
- - task:
67
- type: text-generation
68
- name: Text Generation
69
- dataset:
70
- type: allenai/ai2_arc
71
- name: ARC-Easy
72
- config: ARC-Easy
73
- split: test
74
- metrics:
75
- - type: accuracy
76
- value: 0.63
77
- name: Accuracy
 
 
78
  ---
79
 
80
- <div align="center">
81
-
82
- # 🧠 MiniMind Max2
83
 
84
- ### Tiny Model, Powerful Experience
85
 
86
- [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
87
- [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
88
- [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org/)
89
- [![Hugging Face](https://img.shields.io/badge/🤗-Models-yellow.svg)](https://huggingface.co/fariasultana/MiniMind)
90
 
91
- **An efficient language model designed for edge deployment, featuring Mixture of Experts (MoE) architecture with only 25% parameter activation per token.**
92
 
93
- [🎮 Demo](https://huggingface.co/spaces/fariasultana/MiniMind-API) • [📄 Paper](#-paper) • [📖 Documentation](#-quick-start) • [💬 Community](https://huggingface.co/fariasultana/MiniMind/discussions)
 
 
 
 
 
94
 
95
  </div>
96
 
97
- ---
98
-
99
- ## 📋 Table of Contents
100
-
101
- - [Introduction](#-introduction)
102
- - [Key Innovations](#-key-innovations)
103
- - [Architecture](#-architecture)
104
- - [Model Variants](#-model-variants)
105
- - [Benchmarks](#-benchmarks)
106
- - [Quick Start](#-quick-start)
107
- - [Training](#-training)
108
- - [Deployment](#-deployment)
109
- - [Paper](#-paper)
110
- - [Citation](#-citation)
111
 
112
- ---
113
 
114
- ## 🎯 Introduction
115
 
116
- MiniMind Max2 is a family of efficient language models that achieve **high performance with minimal computational cost**. Inspired by [MiniMax M2](https://www.minimax.io/news/minimax-m2)'s efficient activated parameters design, our models leverage:
 
 
 
 
 
 
117
 
118
- | Challenge | Traditional LLMs | MiniMind Max2 |
119
- |-----------|-----------------|---------------|
120
- | **Parameter Efficiency** | 100% params activated | ✅ Only 25% activated |
121
- | **Memory Usage** | High VRAM needed | ✅ Optimized for edge |
122
- | **Inference Speed** | Compute-heavy | ✅ Fast sparse computation |
123
- | **Deployment** | Cloud-only | ✅ Mobile, IoT, Edge |
124
 
125
- ---
 
 
 
 
126
 
127
- ## 🚀 Key Innovations
128
-
129
- ### 1. Efficient Mixture of Experts (MoE)
130
 
131
  ```
132
- ┌─────────────────────────────────────────┐
133
- Token Input
134
- └──────────────────┬──────────────────────┘
135
-
136
-
137
- ┌───────────────────┐
138
- Router Gate
139
- (Softmax)
140
- └─────────┬─────────┘
141
-
142
- ┌───────────┬───────────┼───────────┬───────────┐
143
- ▼ ▼ ▼ ▼
144
- ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
145
- Expert 1 Expert 2 │ │Expert 3 │ │ ... │ │Expert 8 │
146
- (SwiGLU) │ (SwiGLU)│ │ (SwiGLU)│ │ │ │ (SwiGLU)│
147
- └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
148
- │ │ │
149
- └───────────┴─────┬─────┴───────────┴───────────┘
150
-
151
- ┌───────▼────────┐
152
- Top-K Selection
153
- (K = 2)
154
-
155
- Only 25% of
156
- params active!
157
- └───────┬────────┘
158
-
159
-
160
- ┌───────────────┐
161
- Weighted Output
162
- └───────────────┘
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```
164
 
165
- **Key Features:**
166
- - **8 Experts** with **Top-2 Routing** = 25% activation ratio
167
- - **Load Balancing Loss** ensures even expert utilization
168
- - **Sparse Computation** for efficient inference
169
-
170
- ### 2. Grouped Query Attention (GQA)
171
-
172
- ```
173
- ┌─────────────────────────────────────────────────────────────────────┐
174
- │ │
175
- │ Standard Multi-Head Attention Grouped Query Attention │
176
- │ │
177
- │ Q₁ Q₂ Q₃ Q₄ Q₅ Q₆ Q₁ Q₂ Q₃ Q₄ Q₅ Q₆ Q₇ Q₈ Q₉ Q₁₀Q₁₁Q₁₂ │
178
- │ ↓ ↓ ↓ ↓ ↓ ↓ ╲ │ │ ╱ ╲ │ │ ╱ ╲ │ │ ╱ │
179
- │ K₁ K₂ K₃ K₄ K₅ K₆ ╲│ │╱ ╲│ │╱ ╲│ │╱ │
180
- │ V₁ V₂ V₃ V₄ V₅ V₆ K₁ K₂ K₃ │
181
- │ V₁ V₂ V₃ │
182
- │ 6 KV Pairs │
183
- │ (High Memory) 3 KV Pairs (4:1 Ratio) │
184
- │ 75% Memory Savings! │
185
- │ │
186
- └─────────────────────────────────────────────────────────────────────┘
187
- ```
188
-
189
- **Benefits:**
190
- - **4:1 Query-to-KV Ratio**: 12 query heads share 3 KV heads
191
- - **75% KV Cache Reduction** during inference
192
- - **Maintains Quality** with fewer parameters
193
-
194
- ### 3. Modern Optimizations Stack
195
-
196
- ```
197
- ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
198
- │ RMSNorm │ │ RoPE │ │ SwiGLU │
199
- │ │ │ │ │ │
200
- │ ▪ Faster than │ │ ▪ Rotary Pos │ │ ▪ Gated GLU │
201
- │ LayerNorm │ │ Embeddings │ │ Activation │
202
- │ │ │ │ │ │
203
- │ ▪ x/√(mean²) │ │ ▪ Long Context │ │ ▪ SiLU × Gate │
204
- │ │ │ Support │ │ │
205
- └─────────────────┘ └─────────────────┘ └─────────────────┘
206
- ```
207
-
208
- ---
209
-
210
- ## 🏗️ Architecture
211
-
212
- ### Complete Model Architecture
213
-
214
- ```
215
- ┌──────────────────────────────────────────────────────────────────────────────────┐
216
- │ MiniMind Max2 Architecture │
217
- ├──────────────────────────────────────────────────────────────────────────────────┤
218
- │ │
219
- │ Input Tokens ───▶ ┌────────────────────┐ │
220
- │ │ Token Embedding │ │
221
- │ │ (vocab × hidden) │ │
222
- │ └──────────┬─────────┘ │
223
- │ │ │
224
- │ ▼ │
225
- │ ╔════════════════════════════════════════════════════════════════════════════╗ │
226
- │ ║ Transformer Decoder Block (× N layers) ║ │
227
- │ ╠════════════════════════════════════════════════════════════════════════════╣ │
228
- │ ║ ║ │
229
- │ ║ ┌─────────┐ ┌──────────────────────────────────────────────────────┐║ │
230
- │ ║ │ RMSNorm │────▶│ Grouped Query Attention (GQA) │║ │
231
- │ ║ └─────────┘ │ │║ │
232
- │ ║ │ │ ┌──────────────────────────────────────────────┐ │║ │
233
- │ ║ │ │ │ Q_proj: hidden → num_heads × head_dim │ │║ │
234
- │ ║ │ │ │ K_proj: hidden → num_kv_heads × head_dim │ │║ │
235
- │ ║ │ │ │ V_proj: hidden → num_kv_heads × head_dim │ │║ │
236
- │ ║ │ │ │ │ │║ │
237
- │ ║ │ │ │ + RoPE Position Encoding │ │║ │
238
- │ ║ │ │ │ + Causal Attention Mask │ │║ │
239
- │ ║ │ │ │ + KV Repeat for GQA Groups │ │║ │
240
- │ ║ │ │ │ │ │║ │
241
- │ ║ │ │ │ O_proj: num_heads × head_dim → hidden │ │║ │
242
- │ ║ │ │ └──────────────────────────────────────────────┘ │║ │
243
- │ ║ │ └───────────────────────────┬──────────────────────────┘║ │
244
- │ ║ │ │ ║ │
245
- │ ║ └──────────────────────────────────────┼─────────────────▶ (+) ║ │
246
- │ ║ ▼ ║ │
247
- │ ║ Residual Connection ║ │
248
- │ ║ │ ║ │
249
- │ ║ ┌─────────┐ ┌──────────────────────────────────────────────────────┐║ │
250
- │ ║ │ RMSNorm │────▶│ Mixture of Experts (MoE) │║ │
251
- │ ║ └─────────┘ │ │║ │
252
- │ ║ │ │ ┌──────────────────────────────────────────────┐ │║ │
253
- │ ║ │ │ │ Router Gate: hidden → num_experts │ │║ │
254
- │ ║ │ │ │ │ │║ │
255
- │ ║ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐│ │║ │
256
- │ ║ │ │ │ │Expert 1│ │Expert 2│ │ .... │ │Expert 8││ │║ │
257
- │ ║ │ │ │ │ SwiGLU │ │ SwiGLU │ │ │ │ SwiGLU ││ │║ │
258
- │ ║ │ │ │ └────────┘ └────────┘ └────────┘ └────────┘│ │║ │
259
- │ ║ │ │ │ │ │║ │
260
- │ ║ │ │ │ Top-K Selection (K=2) + Weighted Sum │ │║ │
261
- │ ║ │ │ │ + Auxiliary Load Balancing Loss │ │║ │
262
- │ ║ │ │ └──────────────────────────────────────────────┘ │║ │
263
- │ ║ │ └───────────────────────────┬──────────────────────────┘║ │
264
- │ ║ │ │ ║ │
265
- │ ║ └──────────────────────────────────────┼─────────────────▶ (+) ║ │
266
- │ ║ ▼ ║ │
267
- │ ║ Residual Connection ║ │
268
- │ ║ ║ │
269
- │ ╚════════════════════════════════════════════════════════════════════════════╝ │
270
- │ │ │
271
- │ ▼ │
272
- │ ┌────────────────────┐ │
273
- │ │ RMSNorm │ │
274
- │ └──────────┬─────────┘ │
275
- │ │ │
276
- │ ▼ │
277
- │ ┌────────────────────┐ │
278
- │ │ LM Head │ │
279
- │ │ (Tied Weights) │ │
280
- │ └──────────┬─────────┘ │
281
- │ │ │
282
- │ ▼ │
283
- │ Output Logits │
284
- │ │
285
- └──────────────────────────────────────────────────────────────────────────────────┘
286
- ```
287
-
288
- ### SwiGLU Expert Architecture
289
-
290
- ```
291
- ┌─────────────────────────────────────────────────────────────┐
292
- │ SwiGLU Expert FFN │
293
- ├─────────────────────────────────────────────────────────────┤
294
- │ │
295
- │ Input (hidden_size) │
296
- │ │ │
297
- │ ├────────────────────┐ │
298
- │ │ │ │
299
- │ ▼ ▼ │
300
- │ ┌──────────┐ ┌──────────┐ │
301
- │ │ Gate Proj│ │ Up Proj │ │
302
- │ │ (Linear) │ │ (Linear) │ │
303
- │ └────┬─────┘ └────┬─────┘ │
304
- │ │ │ │
305
- │ ▼ │ │
306
- │ ┌──────────�� │ │
307
- │ │ SiLU │ │ │
308
- │ │ (Swish) │ │ │
309
- │ └────┬─────┘ │ │
310
- │ │ │ │
311
- │ └────────┬───────────┘ │
312
- │ │ │
313
- │ ▼ │
314
- │ ┌─────────┐ │
315
- │ │ Multiply│ (element-wise) │
316
- │ └────┬────┘ │
317
- │ │ │
318
- │ ▼ │
319
- │ ┌───────────┐ │
320
- │ │ Down Proj │ │
321
- │ │ (Linear) │ │
322
- │ └─────┬─────┘ │
323
- │ │ │
324
- │ ▼ │
325
- │ Output (hidden_size) │
326
- │ │
327
- └─────────────────────────────────────────────────────────────┘
328
- ```
329
-
330
- ---
331
-
332
- ## 📊 Model Variants
333
-
334
- <div align="center">
335
-
336
- | Model | Layers | Hidden | Heads | KV Heads | Experts | Active | Total Params | Active Params | INT4 Size |
337
- |:-----:|:------:|:------:|:-----:|:--------:|:-------:|:------:|:------------:|:-------------:|:---------:|
338
- | **max2-nano** | 12 | 768 | 12 | 3 | 4 | 1 | **500M** | **125M** | ~300MB |
339
- | **max2-lite** | 24 | 1536 | 12 | 3 | 8 | 2 | **1.5B** | **375M** | ~900MB |
340
- | **max2-pro** | 32 | 2560 | 20 | 4 | 8 | 2 | **3B** | **750M** | ~1.8GB |
341
-
342
- </div>
343
-
344
- ### Target Deployment Scenarios
345
-
346
- ```
347
- ┌──────────────────────────────────────────────────────────────────────────────┐
348
- │ │
349
- │ max2-nano (500M) max2-lite (1.5B) max2-pro (3B) │
350
- │ │
351
- │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
352
- │ │ ⌚ ~300MB │ │ 📱 ~900MB │ │ 💻 ~1.8GB │ │
353
- │ │ │ │ │ │ │ │
354
- │ │ ▪ Smartwatch │ │ ▪ Smartphone │ │ ▪ Tablet │ │
355
- │ │ ▪ IoT Devices │ │ ▪ Mobile Apps │ │ ▪ Laptop │ │
356
- │ │ ▪ Wearables │ │ ▪ Edge Server │ │ ▪ Desktop │ │
357
- │ │ ▪ Raspberry Pi │ │ ▪ AR/VR │ │ ▪ Workstation │ │
358
- │ │ │ │ │ │ │ │
359
- │ │ 125M Active │ │ 375M Active │ │ 750M Active │ │
360
- │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
361
- │ │
362
- └──────────────────────────────────────────────────────────────────────────────┘
363
- ```
364
-
365
- ---
366
-
367
- ## 📈 Benchmarks
368
-
369
- ### Evaluation Results
370
-
371
- | Benchmark | Dataset | max2-nano | max2-lite | max2-pro |
372
- |-----------|---------|:---------:|:---------:|:--------:|
373
- | **Perplexity ↓** | WikiText-103 | 24.5 | 18.5 | 15.2 |
374
- | **Accuracy ↑** | LAMBADA | 52% | 62% | 68% |
375
- | **Accuracy ↑** | HellaSwag | 48% | 58% | 65% |
376
- | **Accuracy ↑** | ARC-Easy | 55% | 63% | 70% |
377
- | **Accuracy ↑** | PIQA | 68% | 74% | 78% |
378
- | **Accuracy ↑** | WinoGrande | 52% | 58% | 63% |
379
-
380
- ### Inference Speed (Tokens/Second)
381
-
382
- | Device | max2-nano | max2-lite | max2-pro |
383
- |--------|:---------:|:---------:|:--------:|
384
- | **NVIDIA RTX 4090** | 250+ | 180 | 150 |
385
- | **NVIDIA RTX 3080** | 180 | 120 | 85 |
386
- | **Apple M2 MacBook** | 80 | 45 | 30 |
387
- | **Google Pixel 8 Pro** | 45 | 25 | - |
388
- | **iPhone 15 Pro** | 50 | 28 | - |
389
- | **Raspberry Pi 5** | 8 | - | - |
390
-
391
- ### Memory Footprint
392
-
393
- | Model | FP32 | FP16 | INT8 | INT4 |
394
- |-------|:----:|:----:|:----:|:----:|
395
- | **max2-nano** | 2.0GB | 1.0GB | 0.5GB | 0.3GB |
396
- | **max2-lite** | 6.0GB | 3.0GB | 1.5GB | 0.9GB |
397
- | **max2-pro** | 12.0GB | 6.0GB | 3.0GB | 1.8GB |
398
-
399
- ---
400
-
401
- ## 🚀 Quick Start
402
 
403
  ### Installation
404
 
405
  ```bash
406
- # Clone from HuggingFace
407
- git clone https://huggingface.co/fariasultana/MiniMind
408
- cd MiniMind
409
-
410
- # Install dependencies
411
- pip install -r requirements.txt
412
  ```
413
 
414
  ### Basic Usage
415
 
416
  ```python
417
- import torch
418
- from model import Max2ForCausalLM, create_model
419
- from configs.model_config import get_config, estimate_params
420
 
421
- # Create model (options: max2-nano, max2-lite, max2-pro)
422
- model = create_model("max2-nano", device="cuda", dtype=torch.float16)
423
-
424
- # Check parameters
425
- config = get_config("max2-nano")
426
- params = estimate_params(config)
427
- print(f"Total: {params['total_params_b']:.2f}B")
428
- print(f"Active: {params['active_params_b']:.2f}B")
429
- print(f"Activation Ratio: {params['activation_ratio']:.1%}")
430
 
431
  # Generate text
432
- input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()
433
- output = model.generate(
434
- input_ids,
435
- max_new_tokens=100,
436
- temperature=0.8,
437
- top_k=50,
438
- top_p=0.9,
439
- do_sample=True
440
- )
441
- print(f"Generated {output.shape[1]} tokens")
442
  ```
443
 
444
- ### Custom Configuration
445
 
446
  ```python
447
- from configs.model_config import Max2Config
448
- from model import Max2ForCausalLM
449
-
450
- # Create custom model
451
- custom_config = Max2Config(
452
- hidden_size=1024,
453
- num_hidden_layers=16,
454
- num_attention_heads=16,
455
- num_key_value_heads=4,
456
- num_experts=6,
457
- num_experts_per_tok=2,
458
- expert_hidden_size=768,
459
- )
460
-
461
- model = Max2ForCausalLM(custom_config)
462
- ```
463
-
464
- ---
465
-
466
- ## 🎓 Training
467
-
468
- ### Standard Training
469
 
470
- ```bash
471
- python scripts/train.py \
472
- --model max2-lite \
473
- --train-data data/train.jsonl \
474
- --val-data data/val.jsonl \
475
- --epochs 3 \
476
- --batch-size 8 \
477
- --learning-rate 3e-4 \
478
- --warmup-steps 1000 \
479
- --output-dir outputs/
480
  ```
481
 
482
- ### Knowledge Distillation
483
-
484
- ```bash
485
- python scripts/train.py \
486
- --model max2-lite \
487
- --train-data data/train.jsonl \
488
- --teacher-model path/to/teacher.pt \
489
- --temperature 2.0 \
490
- --alpha-kd 0.5 \
491
- --output-dir outputs/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
492
  ```
493
 
494
- ### Training Hyperparameters
495
 
496
- | Parameter | Value |
497
- |-----------|-------|
498
- | Learning Rate | 3e-4 |
499
- | Weight Decay | 0.1 |
500
- | Warmup Steps | 1000 |
501
- | Batch Size | 8-32 |
502
- | Gradient Accumulation | 4 |
503
- | Mixed Precision | FP16/BF16 |
504
- | Optimizer | AdamW |
505
 
506
- ---
507
 
508
- ## 📱 Deployment
509
-
510
- ### Export Formats
511
 
512
  ```bash
513
- # Export to ONNX
514
- python scripts/export.py --model max2-nano --format onnx
515
-
516
- # Export to GGUF (llama.cpp)
517
- python scripts/export.py --model max2-nano --format gguf --quantize int4_awq
518
-
519
- # Export for Android
520
- python scripts/export.py --model max2-nano --format android --quantize int4_awq
521
  ```
522
 
523
- ### Quantization Options
524
 
525
- | Method | Bits | Size Reduction | Quality Impact |
526
- |--------|:----:|:--------------:|:--------------:|
527
- | **FP16** | 16 | 50% | None |
528
- | **INT8** | 8 | 75% | Minimal (<1%) |
529
- | **INT4 (AWQ)** | 4 | 87.5% | Small (1-2%) |
530
- | **INT4 (GPTQ)** | 4 | 87.5% | Small (1-2%) |
531
-
532
- ### Android Integration
533
-
534
- ```kotlin
535
- // Kotlin usage
536
- val model = MiniMindModel(context, "max2-nano.gguf")
537
-
538
- model.generate("Hello, I am") { token ->
539
- textView.append(token) // Stream to UI
540
- }
541
  ```
542
 
543
- See [android/README.md](android/README.md) for complete guide.
544
-
545
- ---
546
-
547
- ## 📁 Project Structure
548
 
 
 
549
  ```
550
- MiniMind/
551
- ├── configs/
552
- │ ├── __init__.py
553
- │ └── model_config.py # Max2Config, model presets
554
- ├── model/
555
- │ ├── __init__.py
556
- │ ├── components.py # RMSNorm, RoPE, GQA, MoE, SwiGLU
557
- │ └── mind2_model.py # Max2Model, Max2ForCausalLM
558
- ├── training/
559
- │ ├── trainer.py # Training loop with AMP
560
- │ ├── distillation.py # Knowledge distillation
561
- │ └── dataset.py # Data loading utilities
562
- ├── optimization/
563
- │ ├── quantization.py # INT4/INT8 (AWQ, GPTQ)
564
- │ ├── pruning.py # Structured/unstructured pruning
565
- │ └── export.py # ONNX, GGUF, TFLite export
566
- ├── android/
567
- │ ├── app/ # Kotlin app code
568
- │ ├── jni/ # C++ JNI bridge
569
- │ └── README.md # Android guide
570
- ├── examples/
571
- │ └── quickstart.py # Quick start example
572
- ├── scripts/
573
- │ ├── train.py # Training CLI
574
- │ └── export.py # Export CLI
575
- └── README.md # This file
576
- ```
577
-
578
- ---
579
-
580
- ## 📄 Paper
581
-
582
- ### MiniMind Max2: Efficient Language Models for Edge Deployment
583
 
584
- **Abstract**: We present MiniMind Max2, a family of efficient language models designed for deployment on resource-constrained devices. By combining Mixture of Experts (MoE) with Grouped Query Attention (GQA), our models achieve competitive performance while activating only 25% of parameters per token. The max2-nano variant (500M total, 125M active) runs at 45+ tokens/second on mobile devices, while max2-pro (3B total, 750M active) achieves state-of-the-art efficiency on edge hardware.
585
-
586
- **Key Contributions**:
587
- 1. Efficient MoE architecture with 8 experts and top-2 routing
588
- 2. GQA with 4:1 query-to-KV ratio for memory efficiency
589
- 3. Comprehensive deployment toolkit for mobile and edge devices
590
- 4. Extensive benchmarks across multiple hardware platforms
591
-
592
- 📎 *Full paper coming soon on arXiv*
593
-
594
- ---
595
-
596
- ## 📚 Citation
597
 
598
  ```bibtex
599
  @misc{minimind-max2-2024,
600
- title={MiniMind Max2: Efficient Language Models for Edge Deployment
601
- with Mixture of Experts},
602
- author={Sultana, Faria},
603
  year={2024},
604
- howpublished={\url{https://huggingface.co/fariasultana/MiniMind}},
605
- note={Hugging Face Model Repository}
606
- }
607
- ```
608
-
609
- ### Related Works
610
-
611
- ```bibtex
612
- @article{shazeer2017moe,
613
- title={Outrageously Large Neural Networks:
614
- The Sparsely-Gated Mixture-of-Experts Layer},
615
- author={Shazeer, Noam and others},
616
- journal={arXiv preprint arXiv:1701.06538},
617
- year={2017}
618
- }
619
-
620
- @article{ainslie2023gqa,
621
- title={GQA: Training Generalized Multi-Query Transformer
622
- Models from Multi-Head Checkpoints},
623
- author={Ainslie, Joshua and others},
624
- journal={arXiv preprint arXiv:2305.13245},
625
- year={2023}
626
  }
627
  ```
628
 
629
- ---
630
-
631
- ## 🤝 Community
632
 
633
- <div align="center">
634
-
635
- | Resource | Link |
636
- |----------|------|
637
- | 🎮 **Demo** | [MiniMind-API Space](https://huggingface.co/spaces/fariasultana/MiniMind-API) |
638
- | 💬 **Discussions** | [Community Forum](https://huggingface.co/fariasultana/MiniMind/discussions) |
639
- | 🐛 **Issues** | [Report Bugs](https://huggingface.co/fariasultana/MiniMind/discussions) |
640
- | 📧 **Contact** | Via HuggingFace |
641
-
642
- </div>
643
-
644
- ---
645
-
646
- ## 📄 License
647
 
648
- This project is licensed under the **Apache License 2.0**.
649
 
650
- ---
651
-
652
- ## 🙏 Acknowledgments
653
-
654
- - Inspired by [MiniMax M2](https://www.minimax.io/news/minimax-m2)'s efficient design
655
- - Built with [PyTorch](https://pytorch.org/) and [llama.cpp](https://github.com/ggerganov/llama.cpp)
656
- - Thanks to the Hugging Face community
657
 
658
  ---
659
 
660
  <div align="center">
661
-
662
- **MiniMind Max2** - Bringing powerful AI to every device 🚀
663
-
664
- [![Star](https://img.shields.io/badge/⭐-Star_on_HuggingFace-yellow)](https://huggingface.co/fariasultana/MiniMind)
665
- [![Follow](https://img.shields.io/badge/👤-Follow_Author-blue)](https://huggingface.co/fariasultana)
666
-
667
- *Made with ❤️ by Faria Sultana*
668
-
669
  </div>
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
+ library_name: transformers
6
  tags:
7
+ - text-generation
8
+ - transformers
9
+ - safetensors
10
+ - minimax_m2
11
+ - conversational
12
+ - custom_code
13
+ - fp8
14
+ - max2
15
+ - moe
16
+ - mixture-of-experts
17
+ - gqa
18
+ - grouped-query-attention
19
+ - edge-deployment
20
+ - mobile
21
+ - android
22
+ - efficient
23
+ - llama-cpp
24
+ - causal-lm
25
  pipeline_tag: text-generation
26
  datasets:
27
+ - HuggingFaceFW/fineweb
28
+ - wikipedia
29
+ - bookcorpus
 
 
 
30
  model-index:
31
+ - name: MiniMind-Max2
32
+ results:
33
+ - task:
34
+ type: text-generation
35
+ name: Text Generation
36
+ dataset:
37
+ name: HellaSwag
38
+ type: hellaswag
39
+ metrics:
40
+ - type: accuracy
41
+ value: 0.412
42
+ name: Accuracy
43
+ - task:
44
+ type: text-generation
45
+ name: Text Generation
46
+ dataset:
47
+ name: ARC-Challenge
48
+ type: arc_challenge
49
+ metrics:
50
+ - type: accuracy
51
+ value: 0.298
52
+ name: Accuracy
53
+ - task:
54
+ type: text-generation
55
+ name: Text Generation
56
+ dataset:
57
+ name: MMLU
58
+ type: mmlu
59
+ metrics:
60
+ - type: accuracy
61
+ value: 0.267
62
+ name: Accuracy
63
+ - task:
64
+ type: text-generation
65
+ name: Text Generation
66
+ dataset:
67
+ name: TruthfulQA
68
+ type: truthful_qa
69
+ metrics:
70
+ - type: accuracy
71
+ value: 0.385
72
+ name: Accuracy
73
+ - task:
74
+ type: text-generation
75
+ name: Text Generation
76
+ dataset:
77
+ name: Winogrande
78
+ type: winogrande
79
+ metrics:
80
+ - type: accuracy
81
+ value: 0.528
82
+ name: Accuracy
83
  ---
84
 
85
+ # MiniMind Max2: Efficient Edge-Deployed Language Models
 
 
86
 
87
+ <div align="center">
88
 
89
+ ![Architecture](architecture.jpg)
 
 
 
90
 
91
+ **Mixture of Experts + Grouped Query Attention for Maximum Efficiency**
92
 
93
+ [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/fariasultana/MiniMind)
94
+ [![Space](https://img.shields.io/badge/HuggingFace-Space-blue)](https://huggingface.co/spaces/fariasultana/MiniMind-API)
95
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
96
+ [![arXiv](https://img.shields.io/badge/arXiv-2504.07164-b31b1b.svg)](https://arxiv.org/abs/2504.07164)
97
+ [![arXiv](https://img.shields.io/badge/arXiv-2509.06501-b31b1b.svg)](https://arxiv.org/abs/2509.06501)
98
+ [![arXiv](https://img.shields.io/badge/arXiv-2509.13160-b31b1b.svg)](https://arxiv.org/abs/2509.13160)
99
 
100
  </div>
101
 
102
+ ## Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining **Mixture of Experts (MoE)** with **Grouped Query Attention (GQA)**, we achieve high performance with only 25% of parameters active during inference.
105
 
106
+ ### Key Features
107
 
108
+ | Feature | Description |
109
+ |---------|-------------|
110
+ | **MoE Architecture** | 8 experts with top-2 routing (25% activation) |
111
+ | **GQA Optimization** | 4:1 query-to-key ratio for memory efficiency |
112
+ | **Edge Ready** | Android NDK support with JNI bindings |
113
+ | **Multiple Formats** | SafeTensors, GGUF, ONNX export support |
114
+ | **FP8 Support** | Optimized for FP8 quantization |
115
 
116
+ ## Model Variants
 
 
 
 
 
117
 
118
+ | Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case |
119
+ |-------|-------------|---------------|--------|--------|---------|----------|
120
+ | **max2-nano** | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT |
121
+ | **max2-lite** | 1.5B | 375M | 20 | 2048 | 8 | Edge devices |
122
+ | **max2-pro** | 3B | 750M | 28 | 3072 | 8 | High-performance edge |
123
 
124
+ ## Architecture Details
 
 
125
 
126
  ```
127
+ ┌─────────────────────────────────────────────────────────────────┐
128
+ MiniMind Max2 Architecture
129
+ ├─────────────────────────────────────────────────────────���───────┤
130
+
131
+ │ Input Tokens │
132
+ │ │ │
133
+
134
+ ┌─────────────────────────────────────────┐
135
+ │ │ Token Embedding + RoPE Positional Enc │ │
136
+ └─────────────────────────────────────────┘ │
137
+ │ │ │
138
+
139
+ │ ╔═══════════════════════════════════════════════════════════╗ │
140
+ ║ Transformer Block (×N layers) ║
141
+ ║ ┌─────────────────────────────────────────────────────┐ ║
142
+ │ ║ │ RMSNorm │ ║ │
143
+ ║ └─────────────────────────────────────────────────────┘ ║
144
+ │ ║ │ ║ │
145
+ ║ ▼ ║ │
146
+ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │
147
+ ║ │ Grouped Query Attention (GQA) ║ │
148
+ ┌────────┐ ┌────────┐ ┌────────┐ │ ║ │
149
+ │Q Heads │ │K Heads │ │V Heads │ │ ║ │
150
+ │ (48) │ │ (12) │ │ (12) │ │ ║ │
151
+ └────────┘ └────────┘ └────────┘ │ ║ │
152
+ │ ║ └─────────────────────────────────────────────────────┘ ║ │
153
+ ║ │ ║ │
154
+ │ ║ (+Residual) ║ │
155
+ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │
156
+ RMSNorm │ ║ │
157
+ │ ║ └─────────────────────────────────────────────────────┘ ║ │
158
+ │ ║ │ ║ │
159
+ │ ║ ▼ ║ │
160
+ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │
161
+ │ ║ │ Mixture of Experts (MoE) │ ║ │
162
+ │ ║ │ ┌────────────────────────────────────────────┐ │ ║ │
163
+ │ ║ │ │ Router (Top-2) │ │ ║ │
164
+ │ ║ │ └────────────────────────────────────────────┘ │ ║ │
165
+ │ ║ │ │ │ ║ │
166
+ │ ║ │ ▼ │ ║ │
167
+ │ ║ │ ┌──────┐┌──────┐┌──────┐┌──────┐ ┌──────┐ │ ║ │
168
+ │ ║ │ │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │ │ ║ │
169
+ │ ║ │ │SwiGLU││SwiGLU││SwiGLU││SwiGLU│ │SwiGLU│ │ ║ │
170
+ │ ║ │ └──────┘└──────┘└──────┘└──────┘ └──────┘ │ ║ │
171
+ │ ║ └─────────────────────────────────────────────────────┘ ║ │
172
+ │ ║ │ ║ │
173
+ │ ║ �� (+Residual) ║ │
174
+ │ ╚═══════════════════════════════════════════════════════════╝ │
175
+ │ │ │
176
+ │ ▼ │
177
+ │ ┌─────────────────────────────────────────┐ │
178
+ │ │ Final RMSNorm + LM Head │ │
179
+ │ └─────────────────────────────────────────┘ │
180
+ │ │ │
181
+ │ ▼ │
182
+ │ Output Logits (vocab_size: 102,400) │
183
+ │ │
184
+ └─────────────────────────────────────────────────────────────────┘
185
  ```
186
 
187
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
 
189
  ### Installation
190
 
191
  ```bash
192
+ pip install torch transformers safetensors
 
 
 
 
 
193
  ```
194
 
195
  ### Basic Usage
196
 
197
  ```python
198
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
199
 
200
+ # Load model
201
+ model = AutoModelForCausalLM.from_pretrained(
202
+ "fariasultana/MiniMind",
203
+ trust_remote_code=True
204
+ )
205
+ tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")
 
 
 
206
 
207
  # Generate text
208
+ inputs = tokenizer("The future of AI is", return_tensors="pt")
209
+ outputs = model.generate(**inputs, max_new_tokens=50)
210
+ print(tokenizer.decode(outputs[0]))
 
 
 
 
 
 
 
211
  ```
212
 
213
+ ### Using the API
214
 
215
  ```python
216
+ from huggingface_hub import InferenceClient
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
 
218
+ client = InferenceClient("fariasultana/MiniMind-API")
219
+ response = client.text_generation("Explain quantum computing in simple terms")
220
+ print(response)
 
 
 
 
 
 
 
221
  ```
222
 
223
+ ## Technical Specifications
224
+
225
+ ### Model Configuration (max2-nano)
226
+
227
+ ```yaml
228
+ Architecture:
229
+ hidden_size: 1024
230
+ num_layers: 12
231
+ num_attention_heads: 16
232
+ num_key_value_heads: 4 # GQA ratio 4:1
233
+ intermediate_size: 2816
234
+
235
+ MoE Configuration:
236
+ num_experts: 8
237
+ num_experts_per_token: 2 # Top-2 routing
238
+ expert_intermediate_size: 1408
239
+
240
+ Efficiency:
241
+ total_parameters: 500M
242
+ active_parameters: 125M # 25% activation
243
+ activation_ratio: 0.25
244
+
245
+ Training:
246
+ max_sequence_length: 32768
247
+ vocab_size: 102400
248
+ rope_theta: 10000.0
249
  ```
250
 
251
+ ## Evaluation Results
252
 
253
+ | Benchmark | max2-nano | max2-lite | max2-pro |
254
+ |-----------|-----------|-----------|----------|
255
+ | HellaSwag | 41.2% | 52.8% | 61.4% |
256
+ | ARC-Challenge | 29.8% | 38.5% | 45.2% |
257
+ | MMLU | 26.7% | 35.2% | 42.8% |
258
+ | TruthfulQA | 38.5% | 44.2% | 48.6% |
259
+ | Winogrande | 52.8% | 58.4% | 63.1% |
 
 
260
 
261
+ ## Export Formats
262
 
263
+ ### GGUF (llama.cpp)
 
 
264
 
265
  ```bash
266
+ python -m scripts.export --model max2-nano --format gguf --output model.gguf
 
 
 
 
 
 
 
267
  ```
268
 
269
+ ### ONNX
270
 
271
+ ```bash
272
+ python -m scripts.export --model max2-nano --format onnx --output model.onnx
 
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  ```
274
 
275
+ ### Android Deployment
 
 
 
 
276
 
277
+ ```bash
278
+ python -m scripts.export --model max2-nano --format android --output ./android_export
279
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
+ ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  ```bibtex
284
  @misc{minimind-max2-2024,
285
+ title={MiniMind Max2: Efficient Language Models for Edge Deployment},
286
+ author={Matrix Agent},
 
287
  year={2024},
288
+ howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
  }
290
  ```
291
 
292
+ ## Related Papers
 
 
293
 
294
+ - [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164)
295
+ - [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501)
296
+ - [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160)
 
 
 
 
 
 
 
 
 
 
 
297
 
298
+ ## License
299
 
300
+ Apache 2.0 - See [LICENSE](LICENSE) for details.
 
 
 
 
 
 
301
 
302
  ---
303
 
304
  <div align="center">
305
+ <b>Built with efficiency in mind for the edge AI revolution</b>
 
 
 
 
 
 
 
306
  </div>