docs: Add architecture diagram, minimax_m2 tags, fp8, conversational, arxiv references
360c8d9
verified
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| tags: | |
| - text-generation | |
| - transformers | |
| - safetensors | |
| - minimax_m2 | |
| - conversational | |
| - custom_code | |
| - fp8 | |
| - max2 | |
| - moe | |
| - mixture-of-experts | |
| - gqa | |
| - grouped-query-attention | |
| - edge-deployment | |
| - mobile | |
| - android | |
| - efficient | |
| - llama-cpp | |
| - causal-lm | |
| pipeline_tag: text-generation | |
| datasets: | |
| - HuggingFaceFW/fineweb | |
| - wikipedia | |
| - bookcorpus | |
| model-index: | |
| - name: MiniMind-Max2 | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: HellaSwag | |
| type: hellaswag | |
| metrics: | |
| - type: accuracy | |
| value: 0.412 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: ARC-Challenge | |
| type: arc_challenge | |
| metrics: | |
| - type: accuracy | |
| value: 0.298 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: MMLU | |
| type: mmlu | |
| metrics: | |
| - type: accuracy | |
| value: 0.267 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: TruthfulQA | |
| type: truthful_qa | |
| metrics: | |
| - type: accuracy | |
| value: 0.385 | |
| name: Accuracy | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: Winogrande | |
| type: winogrande | |
| metrics: | |
| - type: accuracy | |
| value: 0.528 | |
| name: Accuracy | |
| # MiniMind Max2: Efficient Edge-Deployed Language Models | |
| <div align="center"> | |
|  | |
| **Mixture of Experts + Grouped Query Attention for Maximum Efficiency** | |
| [](https://huggingface.co/fariasultana/MiniMind) | |
| [](https://huggingface.co/spaces/fariasultana/MiniMind-API) | |
| [](LICENSE) | |
| [](https://arxiv.org/abs/2504.07164) | |
| [](https://arxiv.org/abs/2509.06501) | |
| [](https://arxiv.org/abs/2509.13160) | |
| </div> | |
| ## Overview | |
| MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining **Mixture of Experts (MoE)** with **Grouped Query Attention (GQA)**, we achieve high performance with only 25% of parameters active during inference. | |
| ### Key Features | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **MoE Architecture** | 8 experts with top-2 routing (25% activation) | | |
| | **GQA Optimization** | 4:1 query-to-key ratio for memory efficiency | | |
| | **Edge Ready** | Android NDK support with JNI bindings | | |
| | **Multiple Formats** | SafeTensors, GGUF, ONNX export support | | |
| | **FP8 Support** | Optimized for FP8 quantization | | |
| ## Model Variants | |
| | Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case | | |
| |-------|-------------|---------------|--------|--------|---------|----------| | |
| | **max2-nano** | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT | | |
| | **max2-lite** | 1.5B | 375M | 20 | 2048 | 8 | Edge devices | | |
| | **max2-pro** | 3B | 750M | 28 | 3072 | 8 | High-performance edge | | |
| ## Architecture Details | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β MiniMind Max2 Architecture β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β Input Tokens β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Token Embedding + RoPE Positional Enc β β | |
| β βββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Transformer Block (ΓN layers) β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β RMSNorm β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β β | |
| β β βΌ β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β Grouped Query Attention (GQA) β β β | |
| β β β ββββββββββ ββββββββββ ββββββββββ β β β | |
| β β β βQ Heads β βK Heads β βV Heads β β β β | |
| β β β β (48) β β (12) β β (12) β β β β | |
| β β β ββββββββββ ββββββββββ ββββββββββ β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β β | |
| β β βΌ (+Residual) β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β RMSNorm β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β β | |
| β β βΌ β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β Mixture of Experts (MoE) β β β | |
| β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β | |
| β β β β Router (Top-2) β β β β | |
| β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β | |
| β β β β β β β | |
| β β β βΌ β β β | |
| β β β ββββββββββββββββββββββββββββββββ ββββββββ β β β | |
| β β β βExp 1 ββExp 2 ββExp 3 ββExp 4 β....βExp 8 β β β β | |
| β β β βSwiGLUββSwiGLUββSwiGLUββSwiGLUβ βSwiGLUβ β β β | |
| β β β ββββββββββββββββββββββββββββββββ ββββββββ β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β β | |
| β β βΌ (+Residual) β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Final RMSNorm + LM Head β β | |
| β βββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β Output Logits (vocab_size: 102,400) β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| pip install torch transformers safetensors | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load model | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "fariasultana/MiniMind", | |
| trust_remote_code=True | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind") | |
| # Generate text | |
| inputs = tokenizer("The future of AI is", return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=50) | |
| print(tokenizer.decode(outputs[0])) | |
| ``` | |
| ### Using the API | |
| ```python | |
| from huggingface_hub import InferenceClient | |
| client = InferenceClient("fariasultana/MiniMind-API") | |
| response = client.text_generation("Explain quantum computing in simple terms") | |
| print(response) | |
| ``` | |
| ## Technical Specifications | |
| ### Model Configuration (max2-nano) | |
| ```yaml | |
| Architecture: | |
| hidden_size: 1024 | |
| num_layers: 12 | |
| num_attention_heads: 16 | |
| num_key_value_heads: 4 # GQA ratio 4:1 | |
| intermediate_size: 2816 | |
| MoE Configuration: | |
| num_experts: 8 | |
| num_experts_per_token: 2 # Top-2 routing | |
| expert_intermediate_size: 1408 | |
| Efficiency: | |
| total_parameters: 500M | |
| active_parameters: 125M # 25% activation | |
| activation_ratio: 0.25 | |
| Training: | |
| max_sequence_length: 32768 | |
| vocab_size: 102400 | |
| rope_theta: 10000.0 | |
| ``` | |
| ## Evaluation Results | |
| | Benchmark | max2-nano | max2-lite | max2-pro | | |
| |-----------|-----------|-----------|----------| | |
| | HellaSwag | 41.2% | 52.8% | 61.4% | | |
| | ARC-Challenge | 29.8% | 38.5% | 45.2% | | |
| | MMLU | 26.7% | 35.2% | 42.8% | | |
| | TruthfulQA | 38.5% | 44.2% | 48.6% | | |
| | Winogrande | 52.8% | 58.4% | 63.1% | | |
| ## Export Formats | |
| ### GGUF (llama.cpp) | |
| ```bash | |
| python -m scripts.export --model max2-nano --format gguf --output model.gguf | |
| ``` | |
| ### ONNX | |
| ```bash | |
| python -m scripts.export --model max2-nano --format onnx --output model.onnx | |
| ``` | |
| ### Android Deployment | |
| ```bash | |
| python -m scripts.export --model max2-nano --format android --output ./android_export | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @misc{minimind-max2-2024, | |
| title={MiniMind Max2: Efficient Language Models for Edge Deployment}, | |
| author={Matrix Agent}, | |
| year={2024}, | |
| howpublished={\url{https://huggingface.co/fariasultana/MiniMind}} | |
| } | |
| ``` | |
| ## Related Papers | |
| - [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164) | |
| - [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501) | |
| - [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160) | |
| ## License | |
| Apache 2.0 - See [LICENSE](LICENSE) for details. | |
| --- | |
| <div align="center"> | |
| <b>Built with efficiency in mind for the edge AI revolution</b> | |
| </div> | |