llm-semantic-router
/

multi-modal-embed-small

+---
+license: apache-2.0
+language:
+- en
+- multilingual
+library_name: transformers
+tags:
+- sentence-transformers
+- multimodal
+- embeddings
+- image-text
+- retrieval
+- 2DMSE
+- matryoshka
+pipeline_tag: sentence-similarity
+model-index:
+- name: multi-modal-embed-small
+  results:
+  - task:
+      type: image-text-retrieval
+    dataset:
+      name: COCO
+      type: coco
+    metrics:
+    - name: Image-to-Text R@1
+      type: recall_at_1
+      value: 41.88
+    - name: Image-to-Text R@5
+      type: recall_at_5
+      value: 71.64
+    - name: Image-to-Text R@10
+      type: recall_at_10
+      value: 82.16
+  - task:
+      type: sentence-similarity
+    dataset:
+      name: Real-world evaluation
+      type: custom
+    metrics:
+    - name: Text Similarity Separation
+      type: custom
+      value: 0.783
+    - name: Cross-modal Separation
+      type: custom
+      value: 0.504
+---
+# multi-modal-embed-small
+A compact multimodal embedding model that unifies text and image representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family powering vLLM Semantic Router.
+## Model Description
+**multi-modal-embed-small** is a lightweight (~85M parameters) multimodal encoder supporting:
+- **Text encoding** via MiniLM-L6-v2 backbone
+- **Image encoding** via SigLIP-base-patch16-512
+- **Cross-modal fusion** via transformer attention
+- **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
+- **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
+### Key Features
+| Feature | Description |
+|---------|-------------|
+| **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
+| **Image Resolution** | 512x512 |
+| **Modalities** | Text, Image, Multimodal fusion |
+| **2DMSE Support** | Early exit at any encoder layer |
+| **Languages** | English (primary), multilingual transfer |
+## Usage
+### Installation
+```bash
+pip install torch transformers pillow safetensors
+```
+### Basic Usage
+```python
+import torch
+from PIL import Image
+import requests
+from io import BytesIO
+# Load model
+from transformers import AutoModel, AutoProcessor
+# Or load from local checkpoint
+import sys
+sys.path.append("path/to/2DMSE-Multimodal-Embedder")
+from src.models import MultimodalEmbedder
+model = MultimodalEmbedder(
+    text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
+    image_encoder_name="google/siglip-base-patch16-512",
+    output_dim=384,
+    use_mobile_optimizations=True,
+)
+model.load_state_dict(torch.load("model.pt", map_location="cpu"))
+model.eval()
+```
+### Text Embedding
+```python
+# Single text
+text = "A photo of a cat sitting on a couch"
+text_embedding = model.encode_text(text)  # Shape: [1, 384]
+# Batch of texts
+texts = [
+    "A fluffy orange cat",
+    "A golden retriever dog",
+    "A red sports car",
+]
+text_embeddings = model.encode_text(texts)  # Shape: [3, 384]
+# Compute similarity
+import torch.nn.functional as F
+similarities = F.cosine_similarity(
+    text_embeddings[0:1],
+    text_embeddings[1:],
+    dim=-1
+)
+print(f"Cat vs Dog similarity: {similarities[0]:.3f}")
+print(f"Cat vs Car similarity: {similarities[1]:.3f}")
+```
+### Image Embedding
+```python
+from PIL import Image
+import requests
+from io import BytesIO
+# Load image from URL
+url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
+response = requests.get(url)
+image = Image.open(BytesIO(response.content)).convert('RGB')
+# Get embedding
+image_embedding = model.encode_image(image)  # Shape: [1, 384]
+# Or from file
+image = Image.open("my_image.jpg").convert('RGB')
+image_embedding = model.encode_image(image)
+```
+### Cross-Modal Retrieval
+```python
+# Image-to-text retrieval
+image = Image.open("cat.jpg").convert('RGB')
+image_emb = model.encode_image(image)
+captions = [
+    "A cat sleeping on a bed",
+    "A dog playing in the park",
+    "A car driving on the highway",
+    "A fluffy feline resting",
+]
+text_embs = model.encode_text(captions)
+# Find most similar caption
+similarities = F.cosine_similarity(image_emb, text_embs)
+best_match_idx = similarities.argmax().item()
+print(f"Best match: {captions[best_match_idx]}")
+print(f"Similarity: {similarities[best_match_idx]:.3f}")
+```
+### Matryoshka Dimension Reduction (MRL)
+```python
+# Get full 384-dim embedding
+full_emb = model.encode_text("Hello world")  # [1, 384]
+# Truncate to smaller dimensions (MRL)
+emb_256 = full_emb[:, :256]  # 256-dim, ~1.5x faster retrieval
+emb_128 = full_emb[:, :128]  # 128-dim, ~3x faster retrieval
+emb_64 = full_emb[:, :64]    # 64-dim, ~6x faster retrieval
+# Normalize after truncation
+emb_128_norm = F.normalize(emb_128, p=2, dim=-1)
+```
+### 2DMSE Adaptive Layer Exit
+```python
+# Full model (all layers) - highest quality
+full_emb = model.encode_text("Complex query", target_layer=None)
+# Early exit at layer 3 (~50% compute) - faster
+early_emb = model.encode_text("Simple query", target_layer=3)
+# Even earlier exit (layer 1) - fastest
+fastest_emb = model.encode_text("Quick lookup", target_layer=1)
+```
+### Multimodal Fusion
+```python
+# Combine text and image for richer representation
+image = Image.open("cat.jpg").convert('RGB')
+text = "A cute pet"
+fused_embedding = model.encode_multimodal(
+    texts=text,
+    images=image
+)  # Shape: [1, 384]
+```
+## Training
+### Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   multi-modal-embed-small                        │
+├─────────────────────────────────────────────────────────────┤
+│  Text Encoder: MiniLM-L6-v2 (22M params)                   │
+│  Image Encoder: SigLIP-base-patch16-512 (86M params)       │
+│  Fusion: 2-layer Transformer                                │
+│  Output: 384-dim normalized embeddings                      │
+├─────────────────────────────────────────────────────────────┤
+│  2DMSE: Layer 0-5 early exit support                       │
+│  MRL: 32, 64, 128, 256, 384 dim truncation                 │
+└─────────────────────────────────────────────────────────────┘
+```
+### Training Data
+- **LLaVA-CC3M**: 595K image-caption pairs
+- **COCO Captions**: Validation on 25K pairs
+### Training Configuration
+- **Hardware**: 8x AMD MI300X GPUs
+- **Precision**: BF16 mixed precision
+- **Batch Size**: 256 per GPU (2048 effective)
+- **Optimizer**: AdamW
+- **Learning Rate**: 1e-4 with cosine decay
+- **Loss**: InfoNCE contrastive + Matryoshka loss
+### Training Stages
+1. **Stage 1** (Frozen encoders): Align image-text space, 6 epochs
+2. **Stage 2** (Partial unfreeze): Fine-tune fusion + top encoder layers
+3. **Stage 4** (Full unfreeze): End-to-end fine-tuning
+## Evaluation
+### Image-Text Retrieval (COCO Validation)
+| Metric | Image→Text | Text→Image |
+|--------|------------|------------|
+| R@1    | 41.88%     | 39.21%     |
+| R@5    | 71.64%     | 69.15%     |
+| R@10   | 82.16%     | 80.02%     |
+### Text Semantic Similarity
+| Pair Type | Similarity |
+|-----------|------------|
+| Positive (similar) | 0.805 |
+| Negative (different) | 0.022 |
+| **Separation** | **0.783** |
+### Cross-Modal Retrieval (Real-world test)
+| Direction | R@1 | R@5 | MRR |
+|-----------|-----|-----|-----|
+| Image→Text | 87.5% | 100% | 0.94 |
+| Text→Image | 87.5% | 100% | 0.94 |
+### MRL Quality Retention (Matryoshka)
+| Dimension | Compression | Separation |
+|-----------|-------------|------------|
+| 384 (full)| 1x          | 1.024      |
+| 256       | 1.5x        | 1.038      |
+| 128       | 3x          | 0.889      |
+| 64        | 6x          | 0.839      |
+| 32        | 12x         | 0.889      |
+## Limitations
+- Optimized for English; multilingual performance may vary
+- Image resolution fixed at 512x512
+- Audio modality available but not trained in this release
+- Best for semantic similarity, not generative tasks
+## Citation
+```bibtex
+@misc{multi-modal-embed-small,
+  title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
+  author={vLLM Semantic Router Team},
+  year={2026},
+  url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
+}
+```
+## License
+Apache 2.0
+## Related Models
+- [mmbert-embed-32k-2d-matryoshka](https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka) - Long context variant
+- [mmbert-embed-finance](https://huggingface.co/llm-semantic-router/mmbert-embed-finance) - Finance domain
+- [mmbert-embed-medical](https://huggingface.co/llm-semantic-router/mmbert-embed-medical) - Medical domain

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "llm-semantic-router/multi-modal-embed-small",
+  "architectures": [
+    "MultimodalEmbedder"
+  ],
+  "model_type": "mmbert",
+  "output_dim": 384,
+  "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
+  "image_encoder_name": "google/siglip-base-patch16-512",
+  "audio_encoder_name": "openai/whisper-tiny",
+  "fusion_type": "transformer",
+  "num_fusion_layers": 2,
+  "enable_layer_outputs": true,
+  "use_mobile_optimizations": true,
+  "matryoshka_dims": [
+    32,
+    64,
+    128,
+    256,
+    384
+  ],
+  "supported_modalities": [
+    "text",
+    "image",
+    "multimodal"
+  ]
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aced484d5e4736120dcb9f41fe33e9751fc77a076572311d86f691b87a64c394
+size 1350323576

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:609c166182116db34188892e1930c30bf7cd31d2b679369dfa61694c21e299c3
+size 976407151