MobileQuant: Mobile-friendly Quantization for On-device Language Models
Paper β’ 2408.13933 β’ Published β’ 16
Production-grade multimodal embedding system that runs on edge devices.
Handles text, images, video, audio, PDFs, code, and 40+ file formats β all under 4GB VRAM.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EdgeMultimodalEmbedder β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Text β β Vision β β Audio β β Video β β
β β (Nomic) β β (Nomic) β β (CLAP) β β(VideoMAE)β β
β β 768d β β 768d β β 512d β β 384d β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β ββββββββββββββββ΄βββββββ¬βββββββ΄βββββββββββββββ β
β β β
β ββββββββββββββββ΄βββββββββββββββ β
β β Projection Heads β 512d β β
β ββββββββββββββββ¬βββββββββββββββ β
β β β
β L2-Normalized 512d β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Feature | Details |
|---|---|
| 4 Modalities | Text, images, video, audio β unified 512d space |
| 40+ File Formats | PDF, DOCX, HTML, Markdown, code (Python/JS/Rust/...), CSV, JSON, images, video, audio |
| Edge-First | ~390MB quantized total β fits on phones, RPi, Jetson, 4GB GPUs |
| 5 Hardware Profiles | phone, edge_gpu, workstation, raspberry_pi, jetson |
| Smart Memory | LRU model swapping, adaptive batching, OOM recovery |
| Quantization | INT8 dynamic, INT8 static, INT4, FP16 β automatic per hardware |
| Export | ONNX (ARM64/x86), OpenVINO (Intel), with projection heads |
| Minimal Runtime | ONNX-only inference needs just onnxruntime + numpy (~15MB vs ~2GB PyTorch) |
| Modality | Model | FP32 Size | Quantized | Dim | Quality |
|---|---|---|---|---|---|
| Text | nomic-embed-text-v1.5 | 522MB | 106MB (INT4) | 768 | MTEB 62.28 |
| Image | nomic-embed-vision-v1.5 | 355MB | 92MB (INT8) | 768 | IN-1k 71.0% |
| Audio | CLAP HTSAT-unfused | 586MB | ~150MB (INT8) | 512 | SOTA audio retrieval |
| Video | VideoMAE-Small | 84MB | 42MB (FP16) | 384 | K400 79.0% top-1 |
| Projections | Linear + LayerNorm | ~4MB | ~4MB | β 512 | β |
| Total | β | 1,551MB | ~394MB | 512 | β |
Text β Image share the same 768d space via Nomic. No projection needed for text-image retrieval!
from edge_multimodal_embeddings import EdgeMultimodalEmbedder
# Quick start β auto-selects profile
embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")
# Profiles:
# "phone" β 384d, INT4, 1.5GB budget, model swapping, MiniLM text
# "edge_gpu" β 512d, INT8, 3.5GB budget, all models loaded
# "workstation" β 768d, FP16, 8GB budget, no quantization
# "raspberry_pi" β 256d, INT4, 1GB budget, model swapping, no audio
# "jetson" β 512d, FP16, 3GB budget
pip install torch transformers Pillow numpy
# Optional: pip install onnxruntime onnx librosa soundfile opencv-python pymupdf
from edge_multimodal_embeddings import EdgeMultimodalEmbedder
embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")
# Text
text_emb = embedder.embed("The quick brown fox")
# Image (auto-detected from file extension)
img_emb = embedder.embed("photo.jpg")
# Video
vid_emb = embedder.embed("clip.mp4")
# Audio
audio_emb = embedder.embed("song.wav")
# PDF (extracts text + images, embeds both)
pdf_emb = embedder.embed("paper.pdf")
# Code
code_emb = embedder.embed("main.py", modality="document")
# Cross-modal similarity
similarity = embedder.similarity(text_emb, img_emb)
print(f"Text-Image similarity: {similarity:.4f}")
# Mixed modality batch
inputs = ["Hello world", "photo.jpg", "clip.mp4"]
embeddings = embedder.embed_batch(inputs)
# β shape: (3, 512)
# Step 1: Export on workstation
embedder = EdgeMultimodalEmbedder.from_profile("phone")
embedder.export_onnx("./phone_models")
# Step 2: Run on phone (only needs onnxruntime + numpy)
from edge_multimodal_embeddings.runtime import EdgeRuntime
runtime = EdgeRuntime("./phone_models", num_threads=4)
emb = runtime.embed_text("Find photos of cats")
img_emb = runtime.embed_image("cat.jpg")
sim = runtime.similarity(emb, img_emb)
from edge_multimodal_embeddings.core.config import (
EmbedderConfig, ModelConfig, QuantizationMode
)
config = EmbedderConfig(
unified_dim=512,
device="cuda",
quantization=QuantizationMode.INT8_DYNAMIC,
max_memory_mb=3500,
lazy_load=True,
model_swapping=False,
max_batch_size=16,
# Swap in different models
text_model=ModelConfig(
model_id="sentence-transformers/all-MiniLM-L6-v2",
embedding_dim=384,
max_input_size=512,
),
# Disable modalities you don't need
audio_model=ModelConfig("laion/clap-htsat-unfused", 512, enabled=False),
video_model=ModelConfig("MCG-NJU/videomae-small-finetuned-kinetics", 384, enabled=False),
)
embedder = EdgeMultimodalEmbedder(config)
| Category | Formats |
|---|---|
| Text | .txt, .md, .rst, .csv, .json, .jsonl, .xml, .yaml, .yml |
| Code | .py, .js, .ts, .java, .cpp, .c, .rs, .go, .rb, .sql, .sh |
| Documents | .pdf (text + images), .docx, .html, .htm |
| Images | .jpg, .jpeg, .png, .bmp, .gif, .webp, .tiff |
| Video | .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv |
| Audio | .wav, .mp3, .flac, .ogg, .m4a, .wma, .aac, .opus |
Plus: PIL Images, numpy arrays, torch tensors, base64 strings, URLs, raw bytes.
Based on MobileQuant (2408.13933) and EdgeVL (2403.04908):
embedder.get_status()// build.gradle
implementation 'com.microsoft.onnxruntime:onnxruntime-mobile:1.17.0'
// Load model
val session = env.createSession(modelBytes, SessionOptions().apply {
setIntraOpNumThreads(4)
addNnapi() // Use Android Neural Networks API
})
// Use ONNX β CoreML conversion or CoreML EP
let config = MLModelConfiguration()
config.computeUnits = .all // CPU + GPU + ANE
pip install onnxruntime # ARM64 wheel available
python -c "from edge_multimodal_embeddings.runtime import EdgeRuntime; ..."
| Hardware | Text (ms) | Image (ms) | Memory |
|---|---|---|---|
| Intel i7 (INT8) | ~6ms | ~15ms | ~400MB |
| Ryzen 7 (INT8) | ~5ms | ~12ms | ~400MB |
| RTX 3050 (FP16) | ~2ms | ~5ms | ~800MB |
| Raspberry Pi 4 (INT8) | ~50ms | ~100ms | ~200MB |
# Embed
edge-embed embed "Hello world"
edge-embed embed photo.jpg
edge-embed embed --modality video clip.mp4
# Compare
edge-embed compare "a cat" "a kitten"
# Benchmark
edge-embed benchmark -n 100
# Export
edge-embed export --format onnx -o ./exported/
# System info
edge-embed info
Apache-2.0. All component models are Apache-2.0 or MIT licensed.