MiniMind Android Deployment Guide
Deploy MiniMind (Mind2) models on Android devices using multiple runtime options.
Deployment Options
| Runtime |
Size |
Speed |
Ease of Use |
| llama.cpp |
β
β
β
β
β
|
β
β
β
β
β |
β
β
β
β
β |
| ONNX Runtime |
β
β
β
β
β |
β
β
β
ββ |
β
β
β
β
β
|
| MLC-LLM |
β
β
β
β
β |
β
β
β
β
β
|
β
β
β
ββ |
| TensorFlow Lite |
β
β
β
β
β
|
β
β
β
ββ |
β
β
β
β
β |
Quick Start
Option 1: llama.cpp (Recommended)
python scripts/export_gguf.py --model mind2-lite --output models/mind2-lite.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build-android && cd build-android
cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-28
make -j
cp libllama.so ../android/app/src/main/jniLibs/arm64-v8a/
Option 2: ONNX Runtime
python scripts/export_onnx.py --model mind2-lite --output models/mind2-lite.onnx
dependencies {
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.16.0'
}
Option 3: MLC-LLM
pip install mlc-llm
mlc_llm compile mind2-lite --target android
mlc_llm package mind2-lite --target android --output ./android/app/src/main/assets/
Project Structure
android/
βββ app/
β βββ src/main/
β β βββ java/com/minimind/
β β β βββ Mind2Model.java # Model wrapper
β β β βββ Mind2Tokenizer.java # Tokenizer
β β β βββ Mind2Chat.java # Chat interface
β β βββ jniLibs/
β β β βββ arm64-v8a/
β β β βββ libllama.so
β β βββ assets/
β β βββ mind2-lite.gguf
β β βββ tokenizer.json
β βββ build.gradle
βββ jni/
β βββ mind2_jni.cpp # JNI bridge
β βββ CMakeLists.txt
βββ README.md
Memory Requirements
| Model |
RAM (INT4) |
RAM (FP16) |
Storage |
| mind2-nano |
~400MB |
~800MB |
~300MB |
| mind2-lite |
~1.2GB |
~2.4GB |
~900MB |
| mind2-pro |
~2.4GB |
~4.8GB |
~1.8GB |
Performance Benchmarks
Tested on common Android devices:
| Device |
Model |
Tokens/sec |
| Pixel 8 Pro |
mind2-nano |
45 |
| Pixel 8 Pro |
mind2-lite |
22 |
| Samsung S24 |
mind2-nano |
52 |
| Samsung S24 |
mind2-lite |
28 |
Best Practices
- Use INT4 quantization for best size/performance balance
- Limit context length to 512-1024 tokens on mobile
- Enable KV-cache for faster generation
- Use streaming for responsive UI
- Handle memory pressure gracefully
Troubleshooting
Out of Memory
- Use smaller model (nano instead of lite)
- Reduce context length
- Enable swap if available
Slow Inference
- Check CPU governor (set to performance)
- Ensure using NEON/ARM optimizations
- Consider GPU acceleration (MLC-LLM)
Model Loading Failed
- Verify GGUF file integrity
- Check storage permissions
- Ensure enough free space