fariasultana
/

MiniMind

Text Generation

Mixture of Experts

mixture-of-experts

grouped-query-attention

edge-deployment

Model card Files Files and versions

MiniMind / android /README.md

fariasultana's picture

MiniMind Max2 - Efficient MoE Language Model

8b187bb verified 26 days ago

|

history blame contribute delete

3.48 kB

	# MiniMind Android Deployment Guide

	Deploy MiniMind (Mind2) models on Android devices using multiple runtime options.

	## Deployment Options

	\| Runtime \| Size \| Speed \| Ease of Use \|
	\|---------\|------\|-------\|-------------\|
	\| llama.cpp \| ★★★★★ \| ★★★★☆ \| ★★★★☆ \|
	\| ONNX Runtime \| ★★★★☆ \| ★★★☆☆ \| ★★★★★ \|
	\| MLC-LLM \| ★★★★☆ \| ★★★★★ \| ★★★☆☆ \|
	\| TensorFlow Lite \| ★★★★★ \| ★★★☆☆ \| ★★★★☆ \|

	## Quick Start

	### Option 1: llama.cpp (Recommended)

	```bash
	# 1. Export model to GGUF format
	python scripts/export_gguf.py --model mind2-lite --output models/mind2-lite.gguf

	# 2. Build llama.cpp for Android
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp
	mkdir build-android && cd build-android
	cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
	-DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-28
	make -j

	# 3. Copy to Android project
	cp libllama.so ../android/app/src/main/jniLibs/arm64-v8a/
	```

	### Option 2: ONNX Runtime

	```bash
	# 1. Export model to ONNX
	python scripts/export_onnx.py --model mind2-lite --output models/mind2-lite.onnx

	# 2. Add ONNX Runtime to Android project
	# In app/build.gradle:
	dependencies {
	implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.16.0'
	}
	```

	### Option 3: MLC-LLM

	```bash
	# 1. Install MLC-LLM
	pip install mlc-llm

	# 2. Compile model for Android
	mlc_llm compile mind2-lite --target android

	# 3. Package for deployment
	mlc_llm package mind2-lite --target android --output ./android/app/src/main/assets/
	```

	## Project Structure

	```
	android/
	├── app/
	│ ├── src/main/
	│ │ ├── java/com/minimind/
	│ │ │ ├── Mind2Model.java # Model wrapper
	│ │ │ ├── Mind2Tokenizer.java # Tokenizer
	│ │ │ └── Mind2Chat.java # Chat interface
	│ │ ├── jniLibs/
	│ │ │ └── arm64-v8a/
	│ │ │ └── libllama.so
	│ │ └── assets/
	│ │ ├── mind2-lite.gguf
	│ │ └── tokenizer.json
	│ └── build.gradle
	├── jni/
	│ ├── mind2_jni.cpp # JNI bridge
	│ └── CMakeLists.txt
	└── README.md
	```

	## Memory Requirements

	\| Model \| RAM (INT4) \| RAM (FP16) \| Storage \|
	\|-------\|-----------\|-----------\|---------\|
	\| mind2-nano \| ~400MB \| ~800MB \| ~300MB \|
	\| mind2-lite \| ~1.2GB \| ~2.4GB \| ~900MB \|
	\| mind2-pro \| ~2.4GB \| ~4.8GB \| ~1.8GB \|

	## Performance Benchmarks

	Tested on common Android devices:

	\| Device \| Model \| Tokens/sec \|
	\|--------\|-------\|-----------\|
	\| Pixel 8 Pro \| mind2-nano \| 45 \|
	\| Pixel 8 Pro \| mind2-lite \| 22 \|
	\| Samsung S24 \| mind2-nano \| 52 \|
	\| Samsung S24 \| mind2-lite \| 28 \|

	## Best Practices

	1. Use INT4 quantization for best size/performance balance
	2. Limit context length to 512-1024 tokens on mobile
	3. Enable KV-cache for faster generation
	4. Use streaming for responsive UI
	5. Handle memory pressure gracefully

	## Troubleshooting

	### Out of Memory
	- Use smaller model (nano instead of lite)
	- Reduce context length
	- Enable swap if available

	### Slow Inference
	- Check CPU governor (set to performance)
	- Ensure using NEON/ARM optimizations
	- Consider GPU acceleration (MLC-LLM)

	### Model Loading Failed
	- Verify GGUF file integrity
	- Check storage permissions
	- Ensure enough free space