MiniMind / android /README.md
fariasultana's picture
MiniMind Max2 - Efficient MoE Language Model
8b187bb verified

MiniMind Android Deployment Guide

Deploy MiniMind (Mind2) models on Android devices using multiple runtime options.

Deployment Options

Runtime Size Speed Ease of Use
llama.cpp β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜† β˜…β˜…β˜…β˜…β˜†
ONNX Runtime β˜…β˜…β˜…β˜…β˜† β˜…β˜…β˜…β˜†β˜† β˜…β˜…β˜…β˜…β˜…
MLC-LLM β˜…β˜…β˜…β˜…β˜† β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜†β˜†
TensorFlow Lite β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜†β˜† β˜…β˜…β˜…β˜…β˜†

Quick Start

Option 1: llama.cpp (Recommended)

# 1. Export model to GGUF format
python scripts/export_gguf.py --model mind2-lite --output models/mind2-lite.gguf

# 2. Build llama.cpp for Android
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build-android && cd build-android
cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-28
make -j

# 3. Copy to Android project
cp libllama.so ../android/app/src/main/jniLibs/arm64-v8a/

Option 2: ONNX Runtime

# 1. Export model to ONNX
python scripts/export_onnx.py --model mind2-lite --output models/mind2-lite.onnx

# 2. Add ONNX Runtime to Android project
# In app/build.gradle:
dependencies {
    implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.16.0'
}

Option 3: MLC-LLM

# 1. Install MLC-LLM
pip install mlc-llm

# 2. Compile model for Android
mlc_llm compile mind2-lite --target android

# 3. Package for deployment
mlc_llm package mind2-lite --target android --output ./android/app/src/main/assets/

Project Structure

android/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ src/main/
β”‚   β”‚   β”œβ”€β”€ java/com/minimind/
β”‚   β”‚   β”‚   β”œβ”€β”€ Mind2Model.java      # Model wrapper
β”‚   β”‚   β”‚   β”œβ”€β”€ Mind2Tokenizer.java  # Tokenizer
β”‚   β”‚   β”‚   └── Mind2Chat.java       # Chat interface
β”‚   β”‚   β”œβ”€β”€ jniLibs/
β”‚   β”‚   β”‚   └── arm64-v8a/
β”‚   β”‚   β”‚       └── libllama.so
β”‚   β”‚   └── assets/
β”‚   β”‚       β”œβ”€β”€ mind2-lite.gguf
β”‚   β”‚       └── tokenizer.json
β”‚   └── build.gradle
β”œβ”€β”€ jni/
β”‚   β”œβ”€β”€ mind2_jni.cpp               # JNI bridge
β”‚   └── CMakeLists.txt
└── README.md

Memory Requirements

Model RAM (INT4) RAM (FP16) Storage
mind2-nano ~400MB ~800MB ~300MB
mind2-lite ~1.2GB ~2.4GB ~900MB
mind2-pro ~2.4GB ~4.8GB ~1.8GB

Performance Benchmarks

Tested on common Android devices:

Device Model Tokens/sec
Pixel 8 Pro mind2-nano 45
Pixel 8 Pro mind2-lite 22
Samsung S24 mind2-nano 52
Samsung S24 mind2-lite 28

Best Practices

  1. Use INT4 quantization for best size/performance balance
  2. Limit context length to 512-1024 tokens on mobile
  3. Enable KV-cache for faster generation
  4. Use streaming for responsive UI
  5. Handle memory pressure gracefully

Troubleshooting

Out of Memory

  • Use smaller model (nano instead of lite)
  • Reduce context length
  • Enable swap if available

Slow Inference

  • Check CPU governor (set to performance)
  • Ensure using NEON/ARM optimizations
  • Consider GPU acceleration (MLC-LLM)

Model Loading Failed

  • Verify GGUF file integrity
  • Check storage permissions
  • Ensure enough free space