---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- text-generation
- transformers
- safetensors
- minimax_m2
- conversational
- custom_code
- fp8
- max2
- moe
- mixture-of-experts
- gqa
- grouped-query-attention
- edge-deployment
- mobile
- android
- efficient
- llama-cpp
- causal-lm
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb
- wikipedia
- bookcorpus
model-index:
- name: MiniMind-Max2
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag
type: hellaswag
metrics:
- type: accuracy
value: 0.412
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: ARC-Challenge
type: arc_challenge
metrics:
- type: accuracy
value: 0.298
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU
type: mmlu
metrics:
- type: accuracy
value: 0.267
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA
type: truthful_qa
metrics:
- type: accuracy
value: 0.385
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande
type: winogrande
metrics:
- type: accuracy
value: 0.528
name: Accuracy
---
# MiniMind Max2: Efficient Edge-Deployed Language Models

**Mixture of Experts + Grouped Query Attention for Maximum Efficiency**
[](https://huggingface.co/fariasultana/MiniMind)
[](https://huggingface.co/spaces/fariasultana/MiniMind-API)
[](LICENSE)
[](https://arxiv.org/abs/2504.07164)
[](https://arxiv.org/abs/2509.06501)
[](https://arxiv.org/abs/2509.13160)
## Overview
MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining **Mixture of Experts (MoE)** with **Grouped Query Attention (GQA)**, we achieve high performance with only 25% of parameters active during inference.
### Key Features
| Feature | Description |
|---------|-------------|
| **MoE Architecture** | 8 experts with top-2 routing (25% activation) |
| **GQA Optimization** | 4:1 query-to-key ratio for memory efficiency |
| **Edge Ready** | Android NDK support with JNI bindings |
| **Multiple Formats** | SafeTensors, GGUF, ONNX export support |
| **FP8 Support** | Optimized for FP8 quantization |
## Model Variants
| Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case |
|-------|-------------|---------------|--------|--------|---------|----------|
| **max2-nano** | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT |
| **max2-lite** | 1.5B | 375M | 20 | 2048 | 8 | Edge devices |
| **max2-pro** | 3B | 750M | 28 | 3072 | 8 | High-performance edge |
## Architecture Details
```
┌─────────────────────────────────────────────────────────────────┐
│ MiniMind Max2 Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input Tokens │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Token Embedding + RoPE Positional Enc │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ╔═══════════════════════════════════════════════════════════╗ │
│ ║ Transformer Block (×N layers) ║ │
│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ └─────────────────────────────────────────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
│ ║ │ Grouped Query Attention (GQA) │ ║ │
│ ║ │ ┌────────┐ ┌────────┐ ┌────────┐ │ ║ │
│ ║ │ │Q Heads │ │K Heads │ │V Heads │ │ ║ │
│ ║ │ │ (48) │ │ (12) │ │ (12) │ │ ║ │
│ ║ │ └────────┘ └────────┘ └────────┘ │ ║ │
│ ║ └─────────────────────────────────────────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ (+Residual) ║ │
│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
│ ║ │ RMSNorm │ ║ │
│ ║ └─────────────────────────────────────────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ ║ │
│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
│ ║ │ Mixture of Experts (MoE) │ ║ │
│ ║ │ ┌────────────────────────────────────────────┐ │ ║ │
│ ║ │ │ Router (Top-2) │ │ ║ │
│ ║ │ └────────────────────────────────────────────┘ │ ║ │
│ ║ │ │ │ ║ │
│ ║ │ ▼ │ ║ │
│ ║ │ ┌──────┐┌──────┐┌──────┐┌──────┐ ┌──────┐ │ ║ │
│ ║ │ │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │ │ ║ │
│ ║ │ │SwiGLU││SwiGLU││SwiGLU││SwiGLU│ │SwiGLU│ │ ║ │
│ ║ │ └──────┘└──────┘└──────┘└──────┘ └──────┘ │ ║ │
│ ║ └─────────────────────────────────────────────────────┘ ║ │
│ ║ │ ║ │
│ ║ ▼ (+Residual) ║ │
│ ╚═══════════════════════════════════════════════════════════╝ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Final RMSNorm + LM Head │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Output Logits (vocab_size: 102,400) │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Quick Start
### Installation
```bash
pip install torch transformers safetensors
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"fariasultana/MiniMind",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")
# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```
### Using the API
```python
from huggingface_hub import InferenceClient
client = InferenceClient("fariasultana/MiniMind-API")
response = client.text_generation("Explain quantum computing in simple terms")
print(response)
```
## Technical Specifications
### Model Configuration (max2-nano)
```yaml
Architecture:
hidden_size: 1024
num_layers: 12
num_attention_heads: 16
num_key_value_heads: 4 # GQA ratio 4:1
intermediate_size: 2816
MoE Configuration:
num_experts: 8
num_experts_per_token: 2 # Top-2 routing
expert_intermediate_size: 1408
Efficiency:
total_parameters: 500M
active_parameters: 125M # 25% activation
activation_ratio: 0.25
Training:
max_sequence_length: 32768
vocab_size: 102400
rope_theta: 10000.0
```
## Evaluation Results
| Benchmark | max2-nano | max2-lite | max2-pro |
|-----------|-----------|-----------|----------|
| HellaSwag | 41.2% | 52.8% | 61.4% |
| ARC-Challenge | 29.8% | 38.5% | 45.2% |
| MMLU | 26.7% | 35.2% | 42.8% |
| TruthfulQA | 38.5% | 44.2% | 48.6% |
| Winogrande | 52.8% | 58.4% | 63.1% |
## Export Formats
### GGUF (llama.cpp)
```bash
python -m scripts.export --model max2-nano --format gguf --output model.gguf
```
### ONNX
```bash
python -m scripts.export --model max2-nano --format onnx --output model.onnx
```
### Android Deployment
```bash
python -m scripts.export --model max2-nano --format android --output ./android_export
```
## Citation
```bibtex
@misc{minimind-max2-2024,
title={MiniMind Max2: Efficient Language Models for Edge Deployment},
author={Matrix Agent},
year={2024},
howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
}
```
## Related Papers
- [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164)
- [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501)
- [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160)
## License
Apache 2.0 - See [LICENSE](LICENSE) for details.
---
Built with efficiency in mind for the edge AI revolution