πŸš€ Edge Multimodal Embeddings

Production-grade multimodal embedding system that runs on edge devices.
Handles text, images, video, audio, PDFs, code, and 40+ file formats β€” all under 4GB VRAM.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    EdgeMultimodalEmbedder                           β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚   Text   β”‚  β”‚  Vision  β”‚  β”‚  Audio   β”‚  β”‚  Video   β”‚           β”‚
β”‚  β”‚ (Nomic)  β”‚  β”‚ (Nomic)  β”‚  β”‚  (CLAP)  β”‚  β”‚(VideoMAE)β”‚           β”‚
β”‚  β”‚  768d    β”‚  β”‚  768d    β”‚  β”‚  512d    β”‚  β”‚  384d    β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β”‚
β”‚       β”‚              β”‚             β”‚              β”‚                 β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                             β”‚                                       β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚              β”‚   Projection Heads β†’ 512d   β”‚                        β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                             β”‚                                       β”‚
β”‚                   L2-Normalized 512d                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Key Features

Feature Details
4 Modalities Text, images, video, audio β€” unified 512d space
40+ File Formats PDF, DOCX, HTML, Markdown, code (Python/JS/Rust/...), CSV, JSON, images, video, audio
Edge-First ~390MB quantized total β€” fits on phones, RPi, Jetson, 4GB GPUs
5 Hardware Profiles phone, edge_gpu, workstation, raspberry_pi, jetson
Smart Memory LRU model swapping, adaptive batching, OOM recovery
Quantization INT8 dynamic, INT8 static, INT4, FP16 β€” automatic per hardware
Export ONNX (ARM64/x86), OpenVINO (Intel), with projection heads
Minimal Runtime ONNX-only inference needs just onnxruntime + numpy (~15MB vs ~2GB PyTorch)

πŸ“Š Model Stack

Modality Model FP32 Size Quantized Dim Quality
Text nomic-embed-text-v1.5 522MB 106MB (INT4) 768 MTEB 62.28
Image nomic-embed-vision-v1.5 355MB 92MB (INT8) 768 IN-1k 71.0%
Audio CLAP HTSAT-unfused 586MB ~150MB (INT8) 512 SOTA audio retrieval
Video VideoMAE-Small 84MB 42MB (FP16) 384 K400 79.0% top-1
Projections Linear + LayerNorm ~4MB ~4MB β†’ 512 β€”
Total β€” 1,551MB ~394MB 512 β€”

Text ↔ Image share the same 768d space via Nomic. No projection needed for text-image retrieval!

πŸ—οΈ Hardware Profiles

from edge_multimodal_embeddings import EdgeMultimodalEmbedder

# Quick start β€” auto-selects profile
embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")

# Profiles:
# "phone"        β†’ 384d, INT4, 1.5GB budget, model swapping, MiniLM text
# "edge_gpu"     β†’ 512d, INT8, 3.5GB budget, all models loaded
# "workstation"  β†’ 768d, FP16, 8GB budget, no quantization
# "raspberry_pi" β†’ 256d, INT4, 1GB budget, model swapping, no audio
# "jetson"       β†’ 512d, FP16, 3GB budget

πŸš€ Quick Start

Install

pip install torch transformers Pillow numpy
# Optional: pip install onnxruntime onnx librosa soundfile opencv-python pymupdf

Embed Anything

from edge_multimodal_embeddings import EdgeMultimodalEmbedder

embedder = EdgeMultimodalEmbedder.from_profile("edge_gpu")

# Text
text_emb = embedder.embed("The quick brown fox")

# Image (auto-detected from file extension)
img_emb = embedder.embed("photo.jpg")

# Video
vid_emb = embedder.embed("clip.mp4")

# Audio
audio_emb = embedder.embed("song.wav")

# PDF (extracts text + images, embeds both)
pdf_emb = embedder.embed("paper.pdf")

# Code
code_emb = embedder.embed("main.py", modality="document")

# Cross-modal similarity
similarity = embedder.similarity(text_emb, img_emb)
print(f"Text-Image similarity: {similarity:.4f}")

Batch Embedding

# Mixed modality batch
inputs = ["Hello world", "photo.jpg", "clip.mp4"]
embeddings = embedder.embed_batch(inputs)
# β†’ shape: (3, 512)

Phone Deployment

# Step 1: Export on workstation
embedder = EdgeMultimodalEmbedder.from_profile("phone")
embedder.export_onnx("./phone_models")

# Step 2: Run on phone (only needs onnxruntime + numpy)
from edge_multimodal_embeddings.runtime import EdgeRuntime

runtime = EdgeRuntime("./phone_models", num_threads=4)
emb = runtime.embed_text("Find photos of cats")
img_emb = runtime.embed_image("cat.jpg")
sim = runtime.similarity(emb, img_emb)

πŸ”§ Custom Configuration

from edge_multimodal_embeddings.core.config import (
    EmbedderConfig, ModelConfig, QuantizationMode
)

config = EmbedderConfig(
    unified_dim=512,
    device="cuda",
    quantization=QuantizationMode.INT8_DYNAMIC,
    max_memory_mb=3500,
    lazy_load=True,
    model_swapping=False,
    max_batch_size=16,
    
    # Swap in different models
    text_model=ModelConfig(
        model_id="sentence-transformers/all-MiniLM-L6-v2",
        embedding_dim=384,
        max_input_size=512,
    ),
    # Disable modalities you don't need
    audio_model=ModelConfig("laion/clap-htsat-unfused", 512, enabled=False),
    video_model=ModelConfig("MCG-NJU/videomae-small-finetuned-kinetics", 384, enabled=False),
)

embedder = EdgeMultimodalEmbedder(config)

πŸ“ Supported File Formats

Category Formats
Text .txt, .md, .rst, .csv, .json, .jsonl, .xml, .yaml, .yml
Code .py, .js, .ts, .java, .cpp, .c, .rs, .go, .rb, .sql, .sh
Documents .pdf (text + images), .docx, .html, .htm
Images .jpg, .jpeg, .png, .bmp, .gif, .webp, .tiff
Video .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv
Audio .wav, .mp3, .flac, .ogg, .m4a, .wma, .aac, .opus

Plus: PIL Images, numpy arrays, torch tensors, base64 strings, URLs, raw bytes.

🧠 Architecture Details

Why This Model Stack?

  1. Nomic text + vision share a native 768d embedding space β€” no projection needed for text↔image retrieval, highest accuracy per byte.
  2. CLAP is the only Apache-2.0 audio-text model with proper zero-shot capabilities.
  3. VideoMAE-Small is remarkably tiny (84MB) while achieving 79% top-1 on Kinetics-400.
  4. Projection heads are single Linear + LayerNorm layers (~0.5MB each) that unify all spaces to 512d.

Quantization Strategy

Based on MobileQuant (2408.13933) and EdgeVL (2403.04908):

  • Dynamic INT8 (default for edge): Weights quantized, activations computed in FP32. <1% cosine similarity loss. No calibration needed.
  • Static INT8: Both weights and activations quantized. Needs calibration data. Best throughput.
  • INT4: 8Γ— memory reduction. 1-3% quality loss. For extreme edge (phones, RPi).
  • Per-tensor quantization for mobile NPU compatibility (not per-token).

Memory Management

  • Lazy loading: Models loaded on first use, not at init
  • LRU swapping: When memory budget is exceeded, least-recently-used model is evicted
  • Adaptive batching: Automatically reduces batch size on OOM and retries
  • Real-time monitoring: Track memory usage via embedder.get_status()

πŸ“± Mobile/Edge Deployment

Android (ONNX Runtime Mobile)

// build.gradle
implementation 'com.microsoft.onnxruntime:onnxruntime-mobile:1.17.0'

// Load model
val session = env.createSession(modelBytes, SessionOptions().apply {
    setIntraOpNumThreads(4)
    addNnapi()  // Use Android Neural Networks API
})

iOS (CoreML)

// Use ONNX β†’ CoreML conversion or CoreML EP
let config = MLModelConfiguration()
config.computeUnits = .all  // CPU + GPU + ANE

Raspberry Pi / ARM Linux

pip install onnxruntime  # ARM64 wheel available
python -c "from edge_multimodal_embeddings.runtime import EdgeRuntime; ..."

πŸ“Š Benchmarks

Hardware Text (ms) Image (ms) Memory
Intel i7 (INT8) ~6ms ~15ms ~400MB
Ryzen 7 (INT8) ~5ms ~12ms ~400MB
RTX 3050 (FP16) ~2ms ~5ms ~800MB
Raspberry Pi 4 (INT8) ~50ms ~100ms ~200MB

πŸ”— CLI

# Embed
edge-embed embed "Hello world"
edge-embed embed photo.jpg
edge-embed embed --modality video clip.mp4

# Compare
edge-embed compare "a cat" "a kitten"

# Benchmark
edge-embed benchmark -n 100

# Export
edge-embed export --format onnx -o ./exported/

# System info
edge-embed info

πŸ“„ License

Apache-2.0. All component models are Apache-2.0 or MIT licensed.

πŸ™ Credits

  • Nomic AI β€” nomic-embed-text/vision-v1.5
  • LAION β€” CLAP audio embeddings
  • MCG-NJU β€” VideoMAE
  • MobileQuant β€” Mobile quantization research
  • EdgeVL β€” Edge visual-language models
  • MobileCLIP β€” Architecture inspiration
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for blckrvrfx/edge-multimodal-embeddings