MiniMind / README.md

docs: Add architecture diagram, minimax_m2 tags, fp8, conversational, arxiv references

360c8d9 verified about 1 month ago

12.8 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- text-generation
	- transformers
	- safetensors
	- minimax_m2
	- conversational
	- custom_code
	- fp8
	- max2
	- moe
	- mixture-of-experts
	- gqa
	- grouped-query-attention
	- edge-deployment
	- mobile
	- android
	- efficient
	- llama-cpp
	- causal-lm
	pipeline_tag: text-generation
	datasets:
	- HuggingFaceFW/fineweb
	- wikipedia
	- bookcorpus
	model-index:
	- name: MiniMind-Max2
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag
	type: hellaswag
	metrics:
	- type: accuracy
	value: 0.412
	name: Accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: ARC-Challenge
	type: arc_challenge
	metrics:
	- type: accuracy
	value: 0.298
	name: Accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU
	type: mmlu
	metrics:
	- type: accuracy
	value: 0.267
	name: Accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA
	type: truthful_qa
	metrics:
	- type: accuracy
	value: 0.385
	name: Accuracy
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande
	type: winogrande
	metrics:
	- type: accuracy
	value: 0.528
	name: Accuracy
	---

	# MiniMind Max2: Efficient Edge-Deployed Language Models

	<div align="center">

	![Architecture](architecture.jpg)

	Mixture of Experts + Grouped Query Attention for Maximum Efficiency

	[![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/fariasultana/MiniMind)
	[![Space](https://img.shields.io/badge/HuggingFace-Space-blue)](https://huggingface.co/spaces/fariasultana/MiniMind-API)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
	[![arXiv](https://img.shields.io/badge/arXiv-2504.07164-b31b1b.svg)](https://arxiv.org/abs/2504.07164)
	[![arXiv](https://img.shields.io/badge/arXiv-2509.06501-b31b1b.svg)](https://arxiv.org/abs/2509.06501)
	[![arXiv](https://img.shields.io/badge/arXiv-2509.13160-b31b1b.svg)](https://arxiv.org/abs/2509.13160)

	</div>

	## Overview

	MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining Mixture of Experts (MoE) with Grouped Query Attention (GQA), we achieve high performance with only 25% of parameters active during inference.

	### Key Features

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| MoE Architecture \| 8 experts with top-2 routing (25% activation) \|
	\| GQA Optimization \| 4:1 query-to-key ratio for memory efficiency \|
	\| Edge Ready \| Android NDK support with JNI bindings \|
	\| Multiple Formats \| SafeTensors, GGUF, ONNX export support \|
	\| FP8 Support \| Optimized for FP8 quantization \|

	## Model Variants

	\| Model \| Total Params \| Active Params \| Layers \| Hidden \| Experts \| Use Case \|
	\|-------\|-------------\|---------------\|--------\|--------\|---------\|----------\|
	\| max2-nano \| 500M \| 125M \| 12 \| 1024 \| 8 \| Mobile/IoT \|
	\| max2-lite \| 1.5B \| 375M \| 20 \| 2048 \| 8 \| Edge devices \|
	\| max2-pro \| 3B \| 750M \| 28 \| 3072 \| 8 \| High-performance edge \|

	## Architecture Details

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ MiniMind Max2 Architecture │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ Input Tokens │
	│ │ │
	│ ▼ │
	│ ┌─────────────────────────────────────────┐ │
	│ │ Token Embedding + RoPE Positional Enc │ │
	│ └─────────────────────────────────────────┘ │
	│ │ │
	│ ▼ │
	│ ╔═══════════════════════════════════════════════════════════╗ │
	│ ║ Transformer Block (×N layers) ║ │
	│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
	│ ║ │ RMSNorm │ ║ │
	│ ║ └─────────────────────────────────────────────────────┘ ║ │
	│ ║ │ ║ │
	│ ║ ▼ ║ │
	│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
	│ ║ │ Grouped Query Attention (GQA) │ ║ │
	│ ║ │ ┌────────┐ ┌────────┐ ┌────────┐ │ ║ │
	│ ║ │ │Q Heads │ │K Heads │ │V Heads │ │ ║ │
	│ ║ │ │ (48) │ │ (12) │ │ (12) │ │ ║ │
	│ ║ │ └────────┘ └────────┘ └────────┘ │ ║ │
	│ ║ └─────────────────────────────────────────────────────┘ ║ │
	│ ║ │ ║ │
	│ ║ ▼ (+Residual) ║ │
	│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
	│ ║ │ RMSNorm │ ║ │
	│ ║ └─────────────────────────────────────────────────────┘ ║ │
	│ ║ │ ║ │
	│ ║ ▼ ║ │
	│ ║ ┌─────────────────────────────────────────────────────┐ ║ │
	│ ║ │ Mixture of Experts (MoE) │ ║ │
	│ ║ │ ┌────────────────────────────────────────────┐ │ ║ │
	│ ║ │ │ Router (Top-2) │ │ ║ │
	│ ║ │ └────────────────────────────────────────────┘ │ ║ │
	│ ║ │ │ │ ║ │
	│ ║ │ ▼ │ ║ │
	│ ║ │ ┌──────┐┌──────┐┌──────┐┌──────┐ ┌──────┐ │ ║ │
	│ ║ │ │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │ │ ║ │
	│ ║ │ │SwiGLU││SwiGLU││SwiGLU││SwiGLU│ │SwiGLU│ │ ║ │
	│ ║ │ └──────┘└──────┘└──────┘└──────┘ └──────┘ │ ║ │
	│ ║ └─────────────────────────────────────────────────────┘ ║ │
	│ ║ │ ║ │
	│ ║ ▼ (+Residual) ║ │
	│ ╚═══════════════════════════════════════════════════════════╝ │
	│ │ │
	│ ▼ │
	│ ┌─────────────────────────────────────────┐ │
	│ │ Final RMSNorm + LM Head │ │
	│ └─────────────────────────────────────────┘ │
	│ │ │
	│ ▼ │
	│ Output Logits (vocab_size: 102,400) │
	│ │
	└─────────────────────────────────────────────────────────────────┘
	```

	## Quick Start

	### Installation

	```bash
	pip install torch transformers safetensors
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"fariasultana/MiniMind",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")

	# Generate text
	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0]))
	```

	### Using the API

	```python
	from huggingface_hub import InferenceClient

	client = InferenceClient("fariasultana/MiniMind-API")
	response = client.text_generation("Explain quantum computing in simple terms")
	print(response)
	```

	## Technical Specifications

	### Model Configuration (max2-nano)

	```yaml
	Architecture:
	hidden_size: 1024
	num_layers: 12
	num_attention_heads: 16
	num_key_value_heads: 4 # GQA ratio 4:1
	intermediate_size: 2816

	MoE Configuration:
	num_experts: 8
	num_experts_per_token: 2 # Top-2 routing
	expert_intermediate_size: 1408

	Efficiency:
	total_parameters: 500M
	active_parameters: 125M # 25% activation
	activation_ratio: 0.25

	Training:
	max_sequence_length: 32768
	vocab_size: 102400
	rope_theta: 10000.0
	```

	## Evaluation Results

	\| Benchmark \| max2-nano \| max2-lite \| max2-pro \|
	\|-----------\|-----------\|-----------\|----------\|
	\| HellaSwag \| 41.2% \| 52.8% \| 61.4% \|
	\| ARC-Challenge \| 29.8% \| 38.5% \| 45.2% \|
	\| MMLU \| 26.7% \| 35.2% \| 42.8% \|
	\| TruthfulQA \| 38.5% \| 44.2% \| 48.6% \|
	\| Winogrande \| 52.8% \| 58.4% \| 63.1% \|

	## Export Formats

	### GGUF (llama.cpp)

	```bash
	python -m scripts.export --model max2-nano --format gguf --output model.gguf
	```

	### ONNX

	```bash
	python -m scripts.export --model max2-nano --format onnx --output model.onnx
	```

	### Android Deployment

	```bash
	python -m scripts.export --model max2-nano --format android --output ./android_export
	```

	## Citation

	```bibtex
	@misc{minimind-max2-2024,
	title={MiniMind Max2: Efficient Language Models for Edge Deployment},
	author={Matrix Agent},
	year={2024},
	howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
	}
	```

	## Related Papers

	- [MiniMax-01: Scaling Foundation Models with Lightning Attention](https://arxiv.org/abs/2504.07164)
	- [Efficient Sparse Attention Mechanisms](https://arxiv.org/abs/2509.06501)
	- [Optimizing MoE for Edge Deployment](https://arxiv.org/abs/2509.13160)

	## License

	Apache 2.0 - See [LICENSE](LICENSE) for details.

	---

	<div align="center">
	<b>Built with efficiency in mind for the edge AI revolution</b>
	</div>