🚀 Edge Multimodal Embeddings

Production-grade multimodal embedding system that runs on edge devices.
Handles text, images, video, audio, PDFs, code, and 40+ file formats — all under 4GB VRAM.

┌─────────────────────────────────────────────────────────────────────┐
│                    EdgeMultimodalEmbedder                           │
│                                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │   Text   │  │  Vision  │  │  Audio   │  │  Video   │           │
│  │ (Nomic)  │  │ (Nomic)  │  │  (CLAP)  │  │(VideoMAE)│           │
│  │  768d    │  │  768d    │  │  512d    │  │  384d    │           │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘           │
│       │              │             │              │                 │
│       └──────────────┴──────┬──────┴──────────────┘                 │
│                             │                                       │
│              ┌──────────────┴──────────────┐                        │
│              │   Projection Heads → 512d   │                        │
│              └──────────────┬──────────────┘                        │
│                             │                                       │
│                   L2-Normalized 512d                                │
└─────────────────────────────────────────────────────────────────────┘

✨ Key Features

Feature	Details
4 Modalities	Text, images, video, audio — unified 512d space
40+ File Formats	PDF, DOCX, HTML, Markdown, code (Python/JS/Rust/...), CSV, JSON, images, video, audio
Edge-First	~390MB quantized total — fits on phones, RPi, Jetson, 4GB GPUs
5 Hardware Profiles	`phone`, `edge_gpu`, `workstation`, `raspberry_pi`, `jetson`
Smart Memory	LRU model swapping, adaptive batching, OOM recovery
Quantization	INT8 dynamic, INT8 static, INT4, FP16 — automatic per hardware
Export	ONNX (ARM64/x86), OpenVINO (Intel), with projection heads
Minimal Runtime	ONNX-only inference needs just `onnxruntime` + `numpy` (~15MB vs ~2GB PyTorch)

📊 Model Stack

Modality	Model	FP32 Size	Quantized	Dim	Quality
Text	nomic-embed-text-v1.5	522MB	106MB (INT4)	768	MTEB 62.28
Image	nomic-embed-vision-v1.5	355MB	92MB (INT8)	768	IN-1k 71.0%
Audio	CLAP HTSAT-unfused	586MB	~150MB (INT8)	512	SOTA audio retrieval
Video	VideoMAE-Small	84MB	42MB (FP16)	384	K400 79.0% top-1
Projections	Linear + LayerNorm	~4MB	~4MB	→ 512	—
Total	—	1,551MB	~394MB	512	—

Text ↔ Image share the same 768d space via Nomic. No projection needed for text-image retrieval!

🏗️ Hardware Profiles

from edge_multimodal_embeddings import EdgeMultimodalEmbedder

# Quick start — auto-selects profile
embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")

# Profiles:
# "phone"        → 384d, INT4, 1.5GB budget, model swapping, MiniLM text
# "edge_gpu"     → 512d, INT8, 3.5GB budget, all models loaded
# "workstation"  → 768d, FP16, 8GB budget, no quantization
# "raspberry_pi" → 256d, INT4, 1GB budget, model swapping, no audio
# "jetson"       → 512d, FP16, 3GB budget

🚀 Quick Start

Install

pip install torch transformers Pillow numpy
# Optional: pip install onnxruntime onnx librosa soundfile opencv-python pymupdf

Embed Anything

from edge_multimodal_embeddings import EdgeMultimodalEmbedder

embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")

# Text
text_emb = embedder.embed("The quick brown fox")

# Image (auto-detected from file extension)
img_emb = embedder.embed("photo.jpg")

# Video
vid_emb = embedder.embed("clip.mp4")

# Audio
audio_emb = embedder.embed("song.wav")

# PDF (extracts text + images, embeds both)
pdf_emb = embedder.embed("paper.pdf")

# Code
code_emb = embedder.embed("main.py", modality="document")

# Cross-modal similarity
similarity = embedder.similarity(text_emb, img_emb)
print(f"Text-Image similarity: {similarity:.4f}")

Batch Embedding

# Mixed modality batch
inputs = ["Hello world", "photo.jpg", "clip.mp4"]
embeddings = embedder.embed_batch(inputs)
# → shape: (3, 512)

Phone Deployment

# Step 1: Export on workstation
embedder = EdgeMultimodalEmbedder.from_profile("phone")
embedder.export_onnx("./phone_models")

# Step 2: Run on phone (only needs onnxruntime + numpy)
from edge_multimodal_embeddings.runtime import EdgeRuntime

runtime = EdgeRuntime("./phone_models", num_threads=4)
emb = runtime.embed_text("Find photos of cats")
img_emb = runtime.embed_image("cat.jpg")
sim = runtime.similarity(emb, img_emb)

🔧 Custom Configuration

from edge_multimodal_embeddings.core.config import (
    EmbedderConfig, ModelConfig, QuantizationMode
)

config = EmbedderConfig(
    unified_dim=512,
    device="cuda",
    quantization=QuantizationMode.INT8_DYNAMIC,
    max_memory_mb=3500,
    lazy_load=True,
    model_swapping=False,
    max_batch_size=16,
    
    # Swap in different models
    text_model=ModelConfig(
        model_id="sentence-transformers/all-MiniLM-L6-v2",
        embedding_dim=384,
        max_input_size=512,
    ),
    # Disable modalities you don't need
    audio_model=ModelConfig("laion/clap-htsat-unfused", 512, enabled=False),
    video_model=ModelConfig("MCG-NJU/videomae-small-finetuned-kinetics", 384, enabled=False),
)

embedder = EdgeMultimodalEmbedder(config)

📁 Supported File Formats

Category	Formats
Text	`.txt`, `.md`, `.rst`, `.csv`, `.json`, `.jsonl`, `.xml`, `.yaml`, `.yml`
Code	`.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.rs`, `.go`, `.rb`, `.sql`, `.sh`
Documents	`.pdf` (text + images), `.docx`, `.html`, `.htm`
Images	`.jpg`, `.jpeg`, `.png`, `.bmp`, `.gif`, `.webp`, `.tiff`
Video	`.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, `.flv`, `.wmv`
Audio	`.wav`, `.mp3`, `.flac`, `.ogg`, `.m4a`, `.wma`, `.aac`, `.opus`

Plus: PIL Images, numpy arrays, torch tensors, base64 strings, URLs, raw bytes.

🧠 Architecture Details

Why This Model Stack?

Nomic text + vision share a native 768d embedding space — no projection needed for text↔image retrieval, highest accuracy per byte.
CLAP is the only Apache-2.0 audio-text model with proper zero-shot capabilities.
VideoMAE-Small is remarkably tiny (84MB) while achieving 79% top-1 on Kinetics-400.
Projection heads are single Linear + LayerNorm layers (~0.5MB each) that unify all spaces to 512d.

Quantization Strategy

Based on MobileQuant (2408.13933) and EdgeVL (2403.04908):

Dynamic INT8 (default for edge): Weights quantized, activations computed in FP32. <1% cosine similarity loss. No calibration needed.
Static INT8: Both weights and activations quantized. Needs calibration data. Best throughput.
INT4: 8× memory reduction. 1-3% quality loss. For extreme edge (phones, RPi).
Per-tensor quantization for mobile NPU compatibility (not per-token).

Memory Management

Lazy loading: Models loaded on first use, not at init
LRU swapping: When memory budget is exceeded, least-recently-used model is evicted
Adaptive batching: Automatically reduces batch size on OOM and retries
Real-time monitoring: Track memory usage via embedder.get_status()

📱 Mobile/Edge Deployment

Android (ONNX Runtime Mobile)

// build.gradle
implementation 'com.microsoft.onnxruntime:onnxruntime-mobile:1.17.0'

// Load model
val session = env.createSession(modelBytes, SessionOptions().apply {
    setIntraOpNumThreads(4)
    addNnapi()  // Use Android Neural Networks API
})

iOS (CoreML)

// Use ONNX → CoreML conversion or CoreML EP
let config = MLModelConfiguration()
config.computeUnits = .all  // CPU + GPU + ANE

Raspberry Pi / ARM Linux

pip install onnxruntime  # ARM64 wheel available
python -c "from edge_multimodal_embeddings.runtime import EdgeRuntime; ..."

📊 Benchmarks

Hardware	Text (ms)	Image (ms)	Memory
Intel i7 (INT8)	~6ms	~15ms	~400MB
Ryzen 7 (INT8)	~5ms	~12ms	~400MB
RTX 3050 (FP16)	~2ms	~5ms	~800MB
Raspberry Pi 4 (INT8)	~50ms	~100ms	~200MB

🔗 CLI

# Embed
edge-embed embed "Hello world"
edge-embed embed photo.jpg
edge-embed embed --modality video clip.mp4

# Compare
edge-embed compare "a cat" "a kitten"

# Benchmark
edge-embed benchmark -n 100

# Export
edge-embed export --format onnx -o ./exported/

# System info
edge-embed info

📄 License

Apache-2.0. All component models are Apache-2.0 or MIT licensed.

🙏 Credits

Nomic AI — nomic-embed-text/vision-v1.5
LAION — CLAP audio embeddings
MCG-NJU — VideoMAE
MobileQuant — Mobile quantization research
EdgeVL — Edge visual-language models
MobileCLIP — Architecture inspiration

Downloads last month: -; Downloads are not tracked for this model. How to track

Papers for blckrvrfx/edge-multimodal-embeddings

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Paper • 2408.13933 • Published Aug 25, 2024 • 16

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Paper • 2403.04908 • Published Mar 7, 2024

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Paper • 2311.17049 • Published Nov 28, 2023 • 7