MiniMind / android /README.md
fariasultana's picture
MiniMind Max2 - Efficient MoE Language Model
8b187bb verified
# MiniMind Android Deployment Guide
Deploy MiniMind (Mind2) models on Android devices using multiple runtime options.
## Deployment Options
| Runtime | Size | Speed | Ease of Use |
|---------|------|-------|-------------|
| **llama.cpp** | β˜…β˜…β˜…β˜…β˜… | β˜…β˜…β˜…β˜…β˜† | β˜…β˜…β˜…β˜…β˜† |
| **ONNX Runtime** | β˜…β˜…β˜…β˜…β˜† | β˜…β˜…β˜…β˜†β˜† | β˜…β˜…β˜…β˜…β˜… |
| **MLC-LLM** | β˜…β˜…β˜…β˜…β˜† | β˜…β˜…β˜…β˜…β˜… | β˜…β˜…β˜…β˜†β˜† |
| **TensorFlow Lite** | β˜…β˜…β˜…β˜…β˜… | β˜…β˜…β˜…β˜†β˜† | β˜…β˜…β˜…β˜…β˜† |
## Quick Start
### Option 1: llama.cpp (Recommended)
```bash
# 1. Export model to GGUF format
python scripts/export_gguf.py --model mind2-lite --output models/mind2-lite.gguf
# 2. Build llama.cpp for Android
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build-android && cd build-android
cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-28
make -j
# 3. Copy to Android project
cp libllama.so ../android/app/src/main/jniLibs/arm64-v8a/
```
### Option 2: ONNX Runtime
```bash
# 1. Export model to ONNX
python scripts/export_onnx.py --model mind2-lite --output models/mind2-lite.onnx
# 2. Add ONNX Runtime to Android project
# In app/build.gradle:
dependencies {
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.16.0'
}
```
### Option 3: MLC-LLM
```bash
# 1. Install MLC-LLM
pip install mlc-llm
# 2. Compile model for Android
mlc_llm compile mind2-lite --target android
# 3. Package for deployment
mlc_llm package mind2-lite --target android --output ./android/app/src/main/assets/
```
## Project Structure
```
android/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ src/main/
β”‚ β”‚ β”œβ”€β”€ java/com/minimind/
β”‚ β”‚ β”‚ β”œβ”€β”€ Mind2Model.java # Model wrapper
β”‚ β”‚ β”‚ β”œβ”€β”€ Mind2Tokenizer.java # Tokenizer
β”‚ β”‚ β”‚ └── Mind2Chat.java # Chat interface
β”‚ β”‚ β”œβ”€β”€ jniLibs/
β”‚ β”‚ β”‚ └── arm64-v8a/
β”‚ β”‚ β”‚ └── libllama.so
β”‚ β”‚ └── assets/
β”‚ β”‚ β”œβ”€β”€ mind2-lite.gguf
β”‚ β”‚ └── tokenizer.json
β”‚ └── build.gradle
β”œβ”€β”€ jni/
β”‚ β”œβ”€β”€ mind2_jni.cpp # JNI bridge
β”‚ └── CMakeLists.txt
└── README.md
```
## Memory Requirements
| Model | RAM (INT4) | RAM (FP16) | Storage |
|-------|-----------|-----------|---------|
| mind2-nano | ~400MB | ~800MB | ~300MB |
| mind2-lite | ~1.2GB | ~2.4GB | ~900MB |
| mind2-pro | ~2.4GB | ~4.8GB | ~1.8GB |
## Performance Benchmarks
Tested on common Android devices:
| Device | Model | Tokens/sec |
|--------|-------|-----------|
| Pixel 8 Pro | mind2-nano | 45 |
| Pixel 8 Pro | mind2-lite | 22 |
| Samsung S24 | mind2-nano | 52 |
| Samsung S24 | mind2-lite | 28 |
## Best Practices
1. **Use INT4 quantization** for best size/performance balance
2. **Limit context length** to 512-1024 tokens on mobile
3. **Enable KV-cache** for faster generation
4. **Use streaming** for responsive UI
5. **Handle memory pressure** gracefully
## Troubleshooting
### Out of Memory
- Use smaller model (nano instead of lite)
- Reduce context length
- Enable swap if available
### Slow Inference
- Check CPU governor (set to performance)
- Ensure using NEON/ARM optimizations
- Consider GPU acceleration (MLC-LLM)
### Model Loading Failed
- Verify GGUF file integrity
- Check storage permissions
- Ensure enough free space