llm-semantic-router
/

multi-modal-embed-small

@@ -2,13 +2,13 @@
 license: apache-2.0
 language:
 - en
-- multilingual
 library_name: transformers
 tags:
 - sentence-transformers
 - multimodal
 - embeddings
 - image-text
 - retrieval
 - 2DMSE
 - matryoshka
@@ -32,30 +32,34 @@ model-index:
       type: recall_at_10
       value: 82.16
   - task:
-      type: sentence-similarity
     dataset:
-      name: Real-world evaluation
-      type: custom
     metrics:
-    - name: Text Similarity Separation
-      type: custom
-      value: 0.783
-    - name: Cross-modal Separation
-      type: custom
-      value: 0.504
 ---
 # multi-modal-embed-small
-A compact multimodal embedding model that unifies text and image representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family powering vLLM Semantic Router.
 ## Model Description
-**multi-modal-embed-small** is a lightweight (~85M parameters) multimodal encoder supporting:
-- **Text encoding** via MiniLM-L6-v2 backbone
-- **Image encoding** via SigLIP-base-patch16-512
-- **Cross-modal fusion** via transformer attention
 - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
 - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
@@ -64,69 +68,64 @@ A compact multimodal embedding model that unifies text and image representations
 | Feature | Description |
 |---------|-------------|
 | **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
-| **Image Resolution** | 512x512 |
-| **Modalities** | Text, Image, Multimodal fusion |
 | **2DMSE Support** | Early exit at any encoder layer |
-| **Languages** | English (primary), multilingual transfer |
-## Usage
-### Installation
 ```bash
 pip install torch transformers pillow safetensors
 ```
-### Basic Usage
 ```python
 import torch
-from PIL import Image
-import requests
-from io import BytesIO
-# Load model
-from transformers import AutoModel, AutoProcessor
-# Or load from local checkpoint
 import sys
 sys.path.append("path/to/2DMSE-Multimodal-Embedder")
-from src.models import MultimodalEmbedder
-model = MultimodalEmbedder(
     text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
     image_encoder_name="google/siglip-base-patch16-512",
     output_dim=384,
-    use_mobile_optimizations=True,
 )
-model.load_state_dict(torch.load("model.pt", map_location="cpu"))
 model.eval()
 ```
 ### Text Embedding
 ```python
 # Single text
-text = "A photo of a cat sitting on a couch"
-text_embedding = model.encode_text(text)  # Shape: [1, 384]
 # Batch of texts
-texts = [
-    "A fluffy orange cat",
-    "A golden retriever dog",
-    "A red sports car",
-]
 text_embeddings = model.encode_text(texts)  # Shape: [3, 384]
 # Compute similarity
-import torch.nn.functional as F
-similarities = F.cosine_similarity(
-    text_embeddings[0:1],
-    text_embeddings[1:],
-    dim=-1
-)
-print(f"Cat vs Dog similarity: {similarities[0]:.3f}")
-print(f"Cat vs Car similarity: {similarities[1]:.3f}")
 ```
 ### Image Embedding
@@ -136,17 +135,26 @@ from PIL import Image
 import requests
 from io import BytesIO
-# Load image from URL
 url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
-response = requests.get(url)
-image = Image.open(BytesIO(response.content)).convert('RGB')
 # Get embedding
 image_embedding = model.encode_image(image)  # Shape: [1, 384]
-# Or from file
-image = Image.open("my_image.jpg").convert('RGB')
-image_embedding = model.encode_image(image)
 ```
 ### Cross-Modal Retrieval
@@ -158,98 +166,73 @@ image_emb = model.encode_image(image)
 captions = [
     "A cat sleeping on a bed",
-    "A dog playing in the park",
     "A car driving on the highway",
-    "A fluffy feline resting",
 ]
 text_embs = model.encode_text(captions)
-# Find most similar caption
 similarities = F.cosine_similarity(image_emb, text_embs)
-best_match_idx = similarities.argmax().item()
-print(f"Best match: {captions[best_match_idx]}")
-print(f"Similarity: {similarities[best_match_idx]:.3f}")
 ```
-### Matryoshka Dimension Reduction (MRL)
 ```python
-# Get full 384-dim embedding
 full_emb = model.encode_text("Hello world")  # [1, 384]
-# Truncate to smaller dimensions (MRL)
-emb_256 = full_emb[:, :256]  # 256-dim, ~1.5x faster retrieval
-emb_128 = full_emb[:, :128]  # 128-dim, ~3x faster retrieval
-emb_64 = full_emb[:, :64]    # 64-dim, ~6x faster retrieval
-# Normalize after truncation
-emb_128_norm = F.normalize(emb_128, p=2, dim=-1)
 ```
-### 2DMSE Adaptive Layer Exit
-```python
-# Full model (all layers) - highest quality
-full_emb = model.encode_text("Complex query", target_layer=None)
-# Early exit at layer 3 (~50% compute) - faster
-early_emb = model.encode_text("Simple query", target_layer=3)
-# Even earlier exit (layer 1) - fastest
-fastest_emb = model.encode_text("Quick lookup", target_layer=1)
 ```
-### Multimodal Fusion
-```python
-# Combine text and image for richer representation
-image = Image.open("cat.jpg").convert('RGB')
-text = "A cute pet"
-fused_embedding = model.encode_multimodal(
-    texts=text,
-    images=image
-)  # Shape: [1, 384]
 ```
 ## Training
-### Architecture
-```
-┌─────────────────────────────────────────────────────────────┐
-│                   multi-modal-embed-small                        │
-├─────────────────────────────────────────────────────────────┤
-│  Text Encoder: MiniLM-L6-v2 (22M params)                   │
-│  Image Encoder: SigLIP-base-patch16-512 (86M params)       │
-│  Fusion: 2-layer Transformer                                │
-│  Output: 384-dim normalized embeddings                      │
-├──────────────────���──────────────────────────────────────────┤
-│  2DMSE: Layer 0-5 early exit support                       │
-│  MRL: 32, 64, 128, 256, 384 dim truncation                 │
-└─────────────────────────────────────────────────────────────┘
-```
-### Training Data
-- **LLaVA-CC3M**: 595K image-caption pairs
-- **COCO Captions**: Validation on 25K pairs
 ### Training Configuration
-- **Hardware**: 8x AMD MI300X GPUs
 - **Precision**: BF16 mixed precision
-- **Batch Size**: 256 per GPU (2048 effective)
 - **Optimizer**: AdamW
-- **Learning Rate**: 1e-4 with cosine decay
 - **Loss**: InfoNCE contrastive + Matryoshka loss
-### Training Stages
-1. **Stage 1** (Frozen encoders): Align image-text space, 6 epochs
-2. **Stage 2** (Partial unfreeze): Fine-tune fusion + top encoder layers
-3. **Stage 4** (Full unfreeze): End-to-end fine-tuning
 ## Evaluation
 ### Image-Text Retrieval (COCO Validation)
@@ -260,61 +243,36 @@ fused_embedding = model.encode_multimodal(
 | R@5    | 71.64%     | 69.15%     |
 | R@10   | 82.16%     | 80.02%     |
-### Text Semantic Similarity
-| Pair Type | Similarity |
-|-----------|------------|
-| Positive (similar) | 0.805 |
-| Negative (different) | 0.022 |
-| **Separation** | **0.783** |
-### Cross-Modal Retrieval (Real-world test)
-| Direction | R@1 | R@5 | MRR |
-|-----------|-----|-----|-----|
-| Image→Text | 87.5% | 100% | 0.94 |
-| Text→Image | 87.5% | 100% | 0.94 |
-### MRL Quality Retention (Matryoshka)
-| Dimension | Compression | Separation |
-|-----------|-------------|------------|
-| 384 (full)| 1x          | 1.024      |
-| 256       | 1.5x        | 1.038      |
-| 128       | 3x          | 0.889      |
-| 64        | 6x          | 0.839      |
-| 32        | 12x         | 0.889      |
 ## Limitations
-- Optimized for English; multilingual performance may vary
-- Image resolution fixed at 512x512
-- Audio encoder included but not yet trained (see Roadmap)
-- Best for semantic similarity, not generative tasks
-## Roadmap
-### Audio Modality Training (Planned)
-The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
-| Dataset | Size | Source | Paper |
-|---------|------|--------|-------|
-| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
-| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
-| [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
-This will enable:
-- Audio-to-text retrieval
-- Text-to-audio retrieval
-- Audio-image-text multimodal fusion
 ## Citation
 ```bibtex
-@misc{multi-modal-embed-small,
   title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
-  author={vLLM Semantic Router Team},
   year={2026},
   url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
 }
@@ -323,9 +281,3 @@ This will enable:
 ## License
 Apache 2.0
-## Related Models
-- [mmbert-embed-32k-2d-matryoshka](https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka) - Long context variant
-- [mmbert-embed-finance](https://huggingface.co/llm-semantic-router/mmbert-embed-finance) - Finance domain
-- [mmbert-embed-medical](https://huggingface.co/llm-semantic-router/mmbert-embed-medical) - Medical domain

 license: apache-2.0
 language:
 - en
 library_name: transformers
 tags:
 - sentence-transformers
 - multimodal
 - embeddings
 - image-text
+- audio-text
 - retrieval
 - 2DMSE
 - matryoshka
       type: recall_at_10
       value: 82.16
   - task:
+      type: audio-text-retrieval
     dataset:
+      name: LibriSpeech
+      type: librispeech
     metrics:
+    - name: Audio-to-Text R@1
+      type: recall_at_1
+      value: 36.38
+    - name: Audio-to-Text R@5
+      type: recall_at_5
+      value: 68.22
+    - name: Audio-to-Text R@10
+      type: recall_at_10
+      value: 79.52
 ---
 # multi-modal-embed-small
+A compact multimodal embedding model that unifies text, image, and audio representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family.
 ## Model Description
+**multi-modal-embed-small** is a lightweight multimodal encoder (~250M parameters) supporting:
+- **Text encoding** via MiniLM-L6-v2 (22M params)
+- **Image encoding** via SigLIP-base-patch16-512 (86M params)
+- **Audio encoding** via Whisper-tiny encoder (39M params)
+- **Cross-modal fusion** via 2-layer transformer attention
 - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
 - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
 | Feature | Description |
 |---------|-------------|
 | **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
+| **Image Resolution** | 512×512 |
+| **Audio Input** | Up to 30s, 16kHz (Whisper Mel spectrogram) |
+| **Modalities** | Text, Image, Audio, Multimodal fusion |
 | **2DMSE Support** | Early exit at any encoder layer |
+| **Languages** | English |
+## Installation
 ```bash
 pip install torch transformers pillow safetensors
 ```
+## Usage
+### Load Model
 ```python
 import torch
+from huggingface_hub import hf_hub_download
+# Download checkpoint
+checkpoint_path = hf_hub_download(
+    repo_id="llm-semantic-router/multi-modal-embed-small",
+    filename="model.pt"
+)
+# Load model
 import sys
 sys.path.append("path/to/2DMSE-Multimodal-Embedder")
+from src.models import create_multimodal_model
+model = create_multimodal_model(
     text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
     image_encoder_name="google/siglip-base-patch16-512",
+    audio_encoder_name="openai/whisper-tiny",
     output_dim=384,
 )
+state_dict = torch.load(checkpoint_path, map_location="cpu")
+model.load_state_dict(state_dict["model_state_dict"])
 model.eval()
 ```
 ### Text Embedding
 ```python
+import torch.nn.functional as F
 # Single text
+text_embedding = model.encode_text("A photo of a cat")  # Shape: [1, 384]
 # Batch of texts
+texts = ["A fluffy orange cat", "A golden retriever dog", "A red sports car"]
 text_embeddings = model.encode_text(texts)  # Shape: [3, 384]
 # Compute similarity
+similarities = F.cosine_similarity(text_embeddings[0:1], text_embeddings[1:], dim=-1)
+print(f"Cat vs Dog: {similarities[0]:.3f}")
+print(f"Cat vs Car: {similarities[1]:.3f}")
 ```
 ### Image Embedding
 import requests
 from io import BytesIO
+# Load image
 url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
+image = Image.open(BytesIO(requests.get(url).content)).convert('RGB')
 # Get embedding
 image_embedding = model.encode_image(image)  # Shape: [1, 384]
+```
+### Audio Embedding
+```python
+import torchaudio
+# Load audio (16kHz)
+waveform, sample_rate = torchaudio.load("speech.wav")
+if sample_rate != 16000:
+    waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)
+# Get embedding
+audio_embedding = model.encode_audio(waveform)  # Shape: [1, 384]
 ```
 ### Cross-Modal Retrieval
 captions = [
     "A cat sleeping on a bed",
+    "A dog playing in the park",
     "A car driving on the highway",
 ]
 text_embs = model.encode_text(captions)
 similarities = F.cosine_similarity(image_emb, text_embs)
+best_idx = similarities.argmax().item()
+print(f"Best match: {captions[best_idx]} ({similarities[best_idx]:.3f})")
 ```
+### Matryoshka Dimension Reduction
 ```python
+# Full 384-dim embedding
 full_emb = model.encode_text("Hello world")  # [1, 384]
+# Truncate to smaller dimensions
+emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1)  # 1.5x faster retrieval
+emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1)  # 3x faster retrieval
+emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1)    # 6x faster retrieval
 ```
+## Architecture
 ```
+┌──────────────────────────────────────────────────────────────┐
+│                  multi-modal-embed-small                     │
+├──────────────────────────────────────────────────────────────┤
+│  Text Encoder:  MiniLM-L6-v2           (22M params, 6 layers)│
+│  Image Encoder: SigLIP-base-patch16-512 (86M params)         │
+│  Audio Encoder: Whisper-tiny encoder    (39M params, 4 layers)│
+│  Fusion:        2-layer Transformer                          │
+├──────────────────────────────────────────────────────────────┤
+│  Output: 384-dim normalized embeddings                       │
+│  2DMSE:  Layer 0-5 early exit support                        │
+│  MRL:    32, 64, 128, 256, 384 dim truncation                │
+└──────────────────────────────────────────────────────────────┘
 ```
 ## Training
+### Training Data
+| Modality | Dataset | Samples | Purpose |
+|----------|---------|---------|---------|
+| Image-Text | LLaVA-CC3M | 595K | Image-text alignment |
+| Image-Text | COCO Captions | 25K | Validation |
+| Audio-Text | LibriSpeech | 105K | Audio-text alignment |
+### Training Stages
+| Stage | Description | Trainable | Epochs |
+|-------|-------------|-----------|--------|
+| 1 | Initial alignment | Projection layers only | 6 |
+| 2 | Partial unfreeze | Top encoder layers + projections | 3 |
+| 4 | Full image-text | All image/text parameters | 3 |
+| 5 | Audio alignment | Audio encoder (text/image frozen) | 5 |
 ### Training Configuration
+- **Hardware**: 8× AMD MI300X GPUs
 - **Precision**: BF16 mixed precision
+- **Batch Size**: 64 per GPU (512 effective)
 - **Optimizer**: AdamW
+- **Learning Rate**: 1e-4 → 5e-5 (stage dependent)
 - **Loss**: InfoNCE contrastive + Matryoshka loss
 ## Evaluation
 ### Image-Text Retrieval (COCO Validation)
 | R@5    | 71.64%     | 69.15%     |
 | R@10   | 82.16%     | 80.02%     |
+### Audio-Text Retrieval (LibriSpeech)
+| Metric | Audio→Text |
+|--------|------------|
+| R@1    | 36.38%     |
+| R@5    | 68.22%     |
+| R@10   | 79.52%     |
+### MRL Quality Retention
+| Dimension | Compression | Quality |
+|-----------|-------------|---------|
+| 384 (full)| 1×          | 100%    |
+| 256       | 1.5×        | ~98%    |
+| 128       | 3×          | ~95%    |
+| 64        | 6×          | ~90%    |
 ## Limitations
+- English language only
+- Image resolution fixed at 512×512
+- Audio limited to 30 seconds
+- Best for retrieval/similarity, not generation
 ## Citation
 ```bibtex
+@misc{multi-modal-embed-small-2026,
   title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
+  author={Semantic Router Team},
   year={2026},
   url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
 }
 ## License
 Apache 2.0

config.json CHANGED Viewed

@@ -1,27 +1,9 @@
 {
-  "_name_or_path": "llm-semantic-router/multi-modal-embed-small",
-  "architectures": [
-    "MultimodalEmbedder"
-  ],
-  "model_type": "mmbert",
   "output_dim": 384,
   "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
   "image_encoder_name": "google/siglip-base-patch16-512",
   "audio_encoder_name": "openai/whisper-tiny",
   "fusion_type": "transformer",
   "num_fusion_layers": 2,
-  "enable_layer_outputs": true,
-  "use_mobile_optimizations": true,
-  "matryoshka_dims": [
-    32,
-    64,
-    128,
-    256,
-    384
-  ],
-  "supported_modalities": [
-    "text",
-    "image",
-    "multimodal"
-  ]
 }

 {
   "output_dim": 384,
   "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
   "image_encoder_name": "google/siglip-base-patch16-512",
   "audio_encoder_name": "openai/whisper-tiny",
   "fusion_type": "transformer",
   "num_fusion_layers": 2,
+  "enable_layer_outputs": true
 }

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4e280a185550651d299dfcd10df7e2cd02629c2f0c0b0964122daabe723ef4b
+size 976407151