---
license: apache-2.0
language:
- en
library_name: transformers
tags:
- text-generation
- transformers
- safetensors
- minimax_m2
- conversational
- custom_code
- fp8
- max2
- moe
- mixture-of-experts
- gqa
- grouped-query-attention
- edge-deployment
- mobile
- android
- efficient
- llama-cpp
- causal-lm
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb
- wikipedia
- bookcorpus
model-index:
- name: MiniMind-Max2
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag
      type: hellaswag
    metrics:
    - type: accuracy
      value: 0.412
      name: Accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: ARC-Challenge
      type: arc_challenge
    metrics:
    - type: accuracy
      value: 0.298
      name: Accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU
      type: mmlu
    metrics:
    - type: accuracy
      value: 0.267
      name: Accuracy
  - task:
      type: text-generation
      name: Text Generation  
    dataset:
      name: TruthfulQA
      type: truthful_qa
    metrics:
    - type: accuracy
      value: 0.385
      name: Accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande
      type: winogrande
    metrics:
    - type: accuracy
      value: 0.528
      name: Accuracy
---

# MiniMind Max2: Efficient Edge-Deployed Language Models

<div align="center">

![Architecture](architecture.jpg)

**Mixture of Experts + Grouped Query Attention for Maximum Efficiency**

[![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/fariasultana/MiniMind)
[![Space](https://img.shields.io/badge/HuggingFace-Space-blue)](https://huggingface.co/spaces/fariasultana/MiniMind-API)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-2504.07164-b31b1b.svg)](https://arxiv.org/abs/2504.07164)
[![arXiv](https://img.shields.io/badge/arXiv-2509.06501-b31b1b.svg)](https://arxiv.org/abs/2509.06501)
[![arXiv](https://img.shields.io/badge/arXiv-2509.13160-b31b1b.svg)](https://arxiv.org/abs/2509.13160)

</div>

## Overview

MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining **Mixture of Experts (MoE)** with **Grouped Query Attention (GQA)**, we achieve high performance with only 25% of parameters active during inference.

### Key Features

| Feature | Description |
|---------|-------------|
| **MoE Architecture** | 8 experts with top-2 routing (25% activation) |
| **GQA Optimization** | 4:1 query-to-key ratio for memory efficiency |
| **Edge Ready** | Android NDK support with JNI bindings |
| **Multiple Formats** | SafeTensors, GGUF, ONNX export support |
| **FP8 Support** | Optimized for FP8 quantization |

## Model Variants

| Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case |
|-------|-------------|---------------|--------|--------|---------|----------|
| **max2-nano** | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT |
| **max2-lite** | 1.5B | 375M | 20 | 2048 | 8 | Edge devices |
| **max2-pro** | 3B | 750M | 28 | 3072 | 8 | High-performance edge |

## Architecture Details

```
┌─────────────────────────────────────────────────────────────────┐
│                    MiniMind Max2 Architecture                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Tokens                                                    │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │  Token Embedding + RoPE Positional Enc  │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║              Transformer Block (×N layers)                 ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │        Grouped Query Attention (GQA)                │  ║  │
│  ║  │   ┌────────┐  ┌────────┐  ┌────────┐               │  ║  │
│  ║  │   │Q Heads │  │K Heads │  │V Heads │               │  ║  │
│  ║  │   │  (48)  │  │  (12)  │  │  (12)  │               │  ║  │
│  ║  │   └────────┘  └────────┘  └────────┘               │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │           Mixture of Experts (MoE)                  │  ║  │
│  ║  │  ┌────────────────────────────────────────────┐    │  ║  │
│  ║  │  │              Router (Top-2)                │    │  ║  │
│  ║  │  └────────────────────────────────────────────┘    │  ║  │
│  ║  │       │                                             │  ║  │
│  ║  │       ▼                                             │  ║  │
│  ║  │  ┌──────┐┌──────┐┌──────┐┌──────┐    ┌──────┐     │  ║  │
│  ║  │  │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │     │  ║  │
│  ║  │  │SwiGLU││SwiGLU││SwiGLU││SwiGLU│    │SwiGLU│     │  ║  │
│  ║  │  └──────┘└──────┘└──────┘└──────┘    └──────┘     │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │      Final RMSNorm + LM Head            │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  Output Logits (vocab_size: 102,400)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Quick Start

### Installation

```bash
pip install torch transformers safetensors
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "fariasultana/MiniMind",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```

### Using the API

```python
from huggingface_hub import InferenceClient

client = InferenceClient("fariasultana/MiniMind-API")
response = client.text_generation("Explain quantum computing in simple terms")
print(response)
```

## Technical Specifications

### Model Configuration (max2-nano)

```yaml
Architecture:
  hidden_size: 1024
  num_layers: 12
  num_attention_heads: 16
  num_key_value_heads: 4  # GQA ratio 4:1
  intermediate_size: 2816
  
MoE Configuration:
  num_experts: 8
  num_experts_per_token: 2  # Top-2 routing
  expert_intermediate_size: 1408
  
Efficiency:
  total_parameters: 500M
  active_parameters: 125M  # 25% activation
  activation_ratio: 0.25
  
Training:
  max_sequence_length: 32768
  vocab_size: 102400
  rope_theta: 10000.0
```

## Evaluation Results

| Benchmark | max2-nano | max2-lite | max2-pro |
|-----------|-----------|-----------|----------|
| HellaSwag | 41.2% | 52.8% | 61.4% |
| ARC-Challenge | 29.8% | 38.5% | 45.2% |
| MMLU | 26.7% | 35.2% | 42.8% |
| TruthfulQA | 38.5% | 44.2% | 48.6% |
| Winogrande | 52.8% | 58.4% | 63.1% |

## Export Formats

### GGUF (llama.cpp)

```bash
python -m scripts.export --model max2-nano --format gguf --output model.gguf
```

### ONNX

```bash
python -m scripts.export --model max2-nano --format onnx --output model.onnx
```

### Android Deployment

```bash
python -m scripts.export --model max2-nano --format android --output ./android_export
```

## Citation

```bibtex
@misc{minimind-max2-2024,
  title={MiniMind Max2: Efficient Language Models for Edge Deployment},
  author={Matrix Agent},
  year={2024},
  howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
}
```

## Related Papers

- [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164)
- [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501) 
- [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160)

## License

Apache 2.0 - See [LICENSE](LICENSE) for details.

---

<div align="center">
<b>Built with efficiency in mind for the edge AI revolution</b>
</div>