MiniMind Android Deployment Guide

Deploy MiniMind (Mind2) models on Android devices using multiple runtime options.

Deployment Options

Runtime	Size	Speed	Ease of Use
llama.cpp	★★★★★	★★★★☆	★★★★☆
ONNX Runtime	★★★★☆	★★★☆☆	★★★★★
MLC-LLM	★★★★☆	★★★★★	★★★☆☆
TensorFlow Lite	★★★★★	★★★☆☆	★★★★☆

Quick Start

Option 1: llama.cpp (Recommended)

# 1. Export model to GGUF format
python scripts/export_gguf.py --model mind2-lite --output models/mind2-lite.gguf

# 2. Build llama.cpp for Android
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build-android && cd build-android
cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-28
make -j

# 3. Copy to Android project
cp libllama.so ../android/app/src/main/jniLibs/arm64-v8a/

Option 2: ONNX Runtime

# 1. Export model to ONNX
python scripts/export_onnx.py --model mind2-lite --output models/mind2-lite.onnx

# 2. Add ONNX Runtime to Android project
# In app/build.gradle:
dependencies {
    implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.16.0'
}

Option 3: MLC-LLM

# 1. Install MLC-LLM
pip install mlc-llm

# 2. Compile model for Android
mlc_llm compile mind2-lite --target android

# 3. Package for deployment
mlc_llm package mind2-lite --target android --output ./android/app/src/main/assets/

Project Structure

android/
├── app/
│   ├── src/main/
│   │   ├── java/com/minimind/
│   │   │   ├── Mind2Model.java      # Model wrapper
│   │   │   ├── Mind2Tokenizer.java  # Tokenizer
│   │   │   └── Mind2Chat.java       # Chat interface
│   │   ├── jniLibs/
│   │   │   └── arm64-v8a/
│   │   │       └── libllama.so
│   │   └── assets/
│   │       ├── mind2-lite.gguf
│   │       └── tokenizer.json
│   └── build.gradle
├── jni/
│   ├── mind2_jni.cpp               # JNI bridge
│   └── CMakeLists.txt
└── README.md

Memory Requirements

Model	RAM (INT4)	RAM (FP16)	Storage
mind2-nano	~400MB	~800MB	~300MB
mind2-lite	~1.2GB	~2.4GB	~900MB
mind2-pro	~2.4GB	~4.8GB	~1.8GB

Performance Benchmarks

Tested on common Android devices:

Device	Model	Tokens/sec
Pixel 8 Pro	mind2-nano	45
Pixel 8 Pro	mind2-lite	22
Samsung S24	mind2-nano	52
Samsung S24	mind2-lite	28

Best Practices

Use INT4 quantization for best size/performance balance
Limit context length to 512-1024 tokens on mobile
Enable KV-cache for faster generation
Use streaming for responsive UI
Handle memory pressure gracefully

Troubleshooting

Out of Memory

Use smaller model (nano instead of lite)
Reduce context length
Enable swap if available

Slow Inference

Check CPU governor (set to performance)
Ensure using NEON/ARM optimizations
Consider GPU acceleration (MLC-LLM)

Model Loading Failed

Verify GGUF file integrity
Check storage permissions
Ensure enough free space