docs: Add architecture diagram, minimax_m2 tags, fp8, conversational, arxiv references
360c8d9
verified
metadata
license: apache-2.0
language:
- en
library_name: transformers
tags:
- text-generation
- transformers
- safetensors
- minimax_m2
- conversational
- custom_code
- fp8
- max2
- moe
- mixture-of-experts
- gqa
- grouped-query-attention
- edge-deployment
- mobile
- android
- efficient
- llama-cpp
- causal-lm
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb
- wikipedia
- bookcorpus
model-index:
- name: MiniMind-Max2
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag
type: hellaswag
metrics:
- type: accuracy
value: 0.412
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: ARC-Challenge
type: arc_challenge
metrics:
- type: accuracy
value: 0.298
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU
type: mmlu
metrics:
- type: accuracy
value: 0.267
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA
type: truthful_qa
metrics:
- type: accuracy
value: 0.385
name: Accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande
type: winogrande
metrics:
- type: accuracy
value: 0.528
name: Accuracy
MiniMind Max2: Efficient Edge-Deployed Language Models
Overview
MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining Mixture of Experts (MoE) with Grouped Query Attention (GQA), we achieve high performance with only 25% of parameters active during inference.
Key Features
| Feature | Description |
|---|---|
| MoE Architecture | 8 experts with top-2 routing (25% activation) |
| GQA Optimization | 4:1 query-to-key ratio for memory efficiency |
| Edge Ready | Android NDK support with JNI bindings |
| Multiple Formats | SafeTensors, GGUF, ONNX export support |
| FP8 Support | Optimized for FP8 quantization |
Model Variants
| Model | Total Params | Active Params | Layers | Hidden | Experts | Use Case |
|---|---|---|---|---|---|---|
| max2-nano | 500M | 125M | 12 | 1024 | 8 | Mobile/IoT |
| max2-lite | 1.5B | 375M | 20 | 2048 | 8 | Edge devices |
| max2-pro | 3B | 750M | 28 | 3072 | 8 | High-performance edge |
Architecture Details
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MiniMind Max2 Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Input Tokens β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Token Embedding + RoPE Positional Enc β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Transformer Block (ΓN layers) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β RMSNorm β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Grouped Query Attention (GQA) β β β
β β β ββββββββββ ββββββββββ ββββββββββ β β β
β β β βQ Heads β βK Heads β βV Heads β β β β
β β β β (48) β β (12) β β (12) β β β β
β β β ββββββββββ ββββββββββ ββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ (+Residual) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β RMSNorm β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Mixture of Experts (MoE) β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Router (Top-2) β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β β
β β β βΌ β β β
β β β ββββββββββββββββββββββββββββββββ ββββββββ β β β
β β β βExp 1 ββExp 2 ββExp 3 ββExp 4 β....βExp 8 β β β β
β β β βSwiGLUββSwiGLUββSwiGLUββSwiGLUβ βSwiGLUβ β β β
β β β ββββββββββββββββββββββββββββββββ ββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ (+Residual) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Final RMSNorm + LM Head β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Output Logits (vocab_size: 102,400) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Start
Installation
pip install torch transformers safetensors
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"fariasultana/MiniMind",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")
# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Using the API
from huggingface_hub import InferenceClient
client = InferenceClient("fariasultana/MiniMind-API")
response = client.text_generation("Explain quantum computing in simple terms")
print(response)
Technical Specifications
Model Configuration (max2-nano)
Architecture:
hidden_size: 1024
num_layers: 12
num_attention_heads: 16
num_key_value_heads: 4 # GQA ratio 4:1
intermediate_size: 2816
MoE Configuration:
num_experts: 8
num_experts_per_token: 2 # Top-2 routing
expert_intermediate_size: 1408
Efficiency:
total_parameters: 500M
active_parameters: 125M # 25% activation
activation_ratio: 0.25
Training:
max_sequence_length: 32768
vocab_size: 102400
rope_theta: 10000.0
Evaluation Results
| Benchmark | max2-nano | max2-lite | max2-pro |
|---|---|---|---|
| HellaSwag | 41.2% | 52.8% | 61.4% |
| ARC-Challenge | 29.8% | 38.5% | 45.2% |
| MMLU | 26.7% | 35.2% | 42.8% |
| TruthfulQA | 38.5% | 44.2% | 48.6% |
| Winogrande | 52.8% | 58.4% | 63.1% |
Export Formats
GGUF (llama.cpp)
python -m scripts.export --model max2-nano --format gguf --output model.gguf
ONNX
python -m scripts.export --model max2-nano --format onnx --output model.onnx
Android Deployment
python -m scripts.export --model max2-nano --format android --output ./android_export
Citation
@misc{minimind-max2-2024,
title={MiniMind Max2: Efficient Language Models for Edge Deployment},
author={Matrix Agent},
year={2024},
howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
}
Related Papers
- MiniMax-01: Scaling Foundation Models with Lightning Attention
- Efficient Sparse Attention Mechanisms
- Optimizing MoE for Edge Deployment
License
Apache 2.0 - See LICENSE for details.
Built with efficiency in mind for the edge AI revolution
