--- license: apache-2.0 language: - en library_name: transformers tags: - text-generation - transformers - safetensors - minimax_m2 - conversational - custom_code - fp8 - max2 - moe - mixture-of-experts - gqa - grouped-query-attention - edge-deployment - mobile - android - efficient - llama-cpp - causal-lm pipeline_tag: text-generation datasets: - HuggingFaceFW/fineweb - wikipedia - bookcorpus model-index: - name: MiniMind-Max2 results: - task: type: text-generation name: Text Generation dataset: name: HellaSwag type: hellaswag metrics: - type: accuracy value: 0.412 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: ARC-Challenge type: arc_challenge metrics: - type: accuracy value: 0.298 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: MMLU type: mmlu metrics: - type: accuracy value: 0.267 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: TruthfulQA type: truthful_qa metrics: - type: accuracy value: 0.385 name: Accuracy - task: type: text-generation name: Text Generation dataset: name: Winogrande type: winogrande metrics: - type: accuracy value: 0.528 name: Accuracy --- # MiniMind Max2: Efficient Edge-Deployed Language Models
![Architecture](architecture.jpg) **Mixture of Experts + Grouped Query Attention for Maximum Efficiency** [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/fariasultana/MiniMind) [![Space](https://img.shields.io/badge/HuggingFace-Space-blue)](https://huggingface.co/spaces/fariasultana/MiniMind-API) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE) [![arXiv](https://img.shields.io/badge/arXiv-2504.07164-b31b1b.svg)](https://arxiv.org/abs/2504.07164) [![arXiv](https://img.shields.io/badge/arXiv-2509.06501-b31b1b.svg)](https://arxiv.org/abs/2509.06501) [![arXiv](https://img.shields.io/badge/arXiv-2509.13160-b31b1b.svg)](https://arxiv.org/abs/2509.13160)
## Overview MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining **Mixture of Experts (MoE)** with **Grouped Query Attention (GQA)**, we achieve high performance with only 25% of parameters active during inference. ### Key Features | Feature | Description | |---------|-------------| | **MoE Architecture** | 8 experts with top-2 routing (25% activation) | | **GQA Optimization** | 4:1 query-to-key ratio for memory efficiency | | **Edge Ready** | Android NDK support with JNI bindings | | **Multiple Formats** | SafeTensors, GGUF, ONNX export support | | **FP8 Support** | Optimized for FP8 quantization | ## Model Variants | Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case | |-------|-------------|---------------|--------|--------|---------|----------| | **max2-nano** | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT | | **max2-lite** | 1.5B | 375M | 20 | 2048 | 8 | Edge devices | | **max2-pro** | 3B | 750M | 28 | 3072 | 8 | High-performance edge | ## Architecture Details ``` ┌─────────────────────────────────────────────────────────────────┐ │ MiniMind Max2 Architecture │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Input Tokens │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────┐ │ │ │ Token Embedding + RoPE Positional Enc │ │ │ └─────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ╔═══════════════════════════════════════════════════════════╗ │ │ ║ Transformer Block (×N layers) ║ │ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └─────────────────────────────────────────────────────┘ ║ │ │ ║ │ ║ │ │ ║ ▼ ║ │ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │ │ ║ │ Grouped Query Attention (GQA) │ ║ │ │ ║ │ ┌────────┐ ┌────────┐ ┌────────┐ │ ║ │ │ ║ │ │Q Heads │ │K Heads │ │V Heads │ │ ║ │ │ ║ │ │ (48) │ │ (12) │ │ (12) │ │ ║ │ │ ║ │ └────────┘ └────────┘ └────────┘ │ ║ │ │ ║ └─────────────────────────────────────────────────────┘ ║ │ │ ║ │ ║ │ │ ║ ▼ (+Residual) ║ │ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │ │ ║ │ RMSNorm │ ║ │ │ ║ └─────────────────────────────────────────────────────┘ ║ │ │ ║ │ ║ │ │ ║ ▼ ║ │ │ ║ ┌─────────────────────────────────────────────────────┐ ║ │ │ ║ │ Mixture of Experts (MoE) │ ║ │ │ ║ │ ┌────────────────────────────────────────────┐ │ ║ │ │ ║ │ │ Router (Top-2) │ │ ║ │ │ ║ │ └────────────────────────────────────────────┘ │ ║ │ │ ║ │ │ │ ║ │ │ ║ │ ▼ │ ║ │ │ ║ │ ┌──────┐┌──────┐┌──────┐┌──────┐ ┌──────┐ │ ║ │ │ ║ │ │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │ │ ║ │ │ ║ │ │SwiGLU││SwiGLU││SwiGLU││SwiGLU│ │SwiGLU│ │ ║ │ │ ║ │ └──────┘└──────┘└──────┘└──────┘ └──────┘ │ ║ │ │ ║ └─────────────────────────────────────────────────────┘ ║ │ │ ║ │ ║ │ │ ║ ▼ (+Residual) ║ │ │ ╚═══════════════════════════════════════════════════════════╝ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────┐ │ │ │ Final RMSNorm + LM Head │ │ │ └─────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ Output Logits (vocab_size: 102,400) │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Quick Start ### Installation ```bash pip install torch transformers safetensors ``` ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model model = AutoModelForCausalLM.from_pretrained( "fariasultana/MiniMind", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind") # Generate text inputs = tokenizer("The future of AI is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0])) ``` ### Using the API ```python from huggingface_hub import InferenceClient client = InferenceClient("fariasultana/MiniMind-API") response = client.text_generation("Explain quantum computing in simple terms") print(response) ``` ## Technical Specifications ### Model Configuration (max2-nano) ```yaml Architecture: hidden_size: 1024 num_layers: 12 num_attention_heads: 16 num_key_value_heads: 4 # GQA ratio 4:1 intermediate_size: 2816 MoE Configuration: num_experts: 8 num_experts_per_token: 2 # Top-2 routing expert_intermediate_size: 1408 Efficiency: total_parameters: 500M active_parameters: 125M # 25% activation activation_ratio: 0.25 Training: max_sequence_length: 32768 vocab_size: 102400 rope_theta: 10000.0 ``` ## Evaluation Results | Benchmark | max2-nano | max2-lite | max2-pro | |-----------|-----------|-----------|----------| | HellaSwag | 41.2% | 52.8% | 61.4% | | ARC-Challenge | 29.8% | 38.5% | 45.2% | | MMLU | 26.7% | 35.2% | 42.8% | | TruthfulQA | 38.5% | 44.2% | 48.6% | | Winogrande | 52.8% | 58.4% | 63.1% | ## Export Formats ### GGUF (llama.cpp) ```bash python -m scripts.export --model max2-nano --format gguf --output model.gguf ``` ### ONNX ```bash python -m scripts.export --model max2-nano --format onnx --output model.onnx ``` ### Android Deployment ```bash python -m scripts.export --model max2-nano --format android --output ./android_export ``` ## Citation ```bibtex @misc{minimind-max2-2024, title={MiniMind Max2: Efficient Language Models for Edge Deployment}, author={Matrix Agent}, year={2024}, howpublished={\url{https://huggingface.co/fariasultana/MiniMind}} } ``` ## Related Papers - [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164) - [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501) - [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160) ## License Apache 2.0 - See [LICENSE](LICENSE) for details. ---
Built with efficiency in mind for the edge AI revolution