HuaminChen commited on
Commit
a82f367
Β·
verified Β·
1 Parent(s): 2a321ff

Upload multi-modal-embed-small Stage 2 model

Browse files
Files changed (4) hide show
  1. README.md +314 -0
  2. config.json +27 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - multilingual
6
+ library_name: transformers
7
+ tags:
8
+ - sentence-transformers
9
+ - multimodal
10
+ - embeddings
11
+ - image-text
12
+ - retrieval
13
+ - 2DMSE
14
+ - matryoshka
15
+ pipeline_tag: sentence-similarity
16
+ model-index:
17
+ - name: multi-modal-embed-small
18
+ results:
19
+ - task:
20
+ type: image-text-retrieval
21
+ dataset:
22
+ name: COCO
23
+ type: coco
24
+ metrics:
25
+ - name: Image-to-Text R@1
26
+ type: recall_at_1
27
+ value: 41.88
28
+ - name: Image-to-Text R@5
29
+ type: recall_at_5
30
+ value: 71.64
31
+ - name: Image-to-Text R@10
32
+ type: recall_at_10
33
+ value: 82.16
34
+ - task:
35
+ type: sentence-similarity
36
+ dataset:
37
+ name: Real-world evaluation
38
+ type: custom
39
+ metrics:
40
+ - name: Text Similarity Separation
41
+ type: custom
42
+ value: 0.783
43
+ - name: Cross-modal Separation
44
+ type: custom
45
+ value: 0.504
46
+ ---
47
+
48
+ # multi-modal-embed-small
49
+
50
+ A compact multimodal embedding model that unifies text and image representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family powering vLLM Semantic Router.
51
+
52
+ ## Model Description
53
+
54
+ **multi-modal-embed-small** is a lightweight (~85M parameters) multimodal encoder supporting:
55
+
56
+ - **Text encoding** via MiniLM-L6-v2 backbone
57
+ - **Image encoding** via SigLIP-base-patch16-512
58
+ - **Cross-modal fusion** via transformer attention
59
+ - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
60
+ - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
61
+
62
+ ### Key Features
63
+
64
+ | Feature | Description |
65
+ |---------|-------------|
66
+ | **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
67
+ | **Image Resolution** | 512x512 |
68
+ | **Modalities** | Text, Image, Multimodal fusion |
69
+ | **2DMSE Support** | Early exit at any encoder layer |
70
+ | **Languages** | English (primary), multilingual transfer |
71
+
72
+ ## Usage
73
+
74
+ ### Installation
75
+
76
+ ```bash
77
+ pip install torch transformers pillow safetensors
78
+ ```
79
+
80
+ ### Basic Usage
81
+
82
+ ```python
83
+ import torch
84
+ from PIL import Image
85
+ import requests
86
+ from io import BytesIO
87
+
88
+ # Load model
89
+ from transformers import AutoModel, AutoProcessor
90
+
91
+ # Or load from local checkpoint
92
+ import sys
93
+ sys.path.append("path/to/2DMSE-Multimodal-Embedder")
94
+ from src.models import MultimodalEmbedder
95
+
96
+ model = MultimodalEmbedder(
97
+ text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
98
+ image_encoder_name="google/siglip-base-patch16-512",
99
+ output_dim=384,
100
+ use_mobile_optimizations=True,
101
+ )
102
+ model.load_state_dict(torch.load("model.pt", map_location="cpu"))
103
+ model.eval()
104
+ ```
105
+
106
+ ### Text Embedding
107
+
108
+ ```python
109
+ # Single text
110
+ text = "A photo of a cat sitting on a couch"
111
+ text_embedding = model.encode_text(text) # Shape: [1, 384]
112
+
113
+ # Batch of texts
114
+ texts = [
115
+ "A fluffy orange cat",
116
+ "A golden retriever dog",
117
+ "A red sports car",
118
+ ]
119
+ text_embeddings = model.encode_text(texts) # Shape: [3, 384]
120
+
121
+ # Compute similarity
122
+ import torch.nn.functional as F
123
+ similarities = F.cosine_similarity(
124
+ text_embeddings[0:1],
125
+ text_embeddings[1:],
126
+ dim=-1
127
+ )
128
+ print(f"Cat vs Dog similarity: {similarities[0]:.3f}")
129
+ print(f"Cat vs Car similarity: {similarities[1]:.3f}")
130
+ ```
131
+
132
+ ### Image Embedding
133
+
134
+ ```python
135
+ from PIL import Image
136
+ import requests
137
+ from io import BytesIO
138
+
139
+ # Load image from URL
140
+ url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
141
+ response = requests.get(url)
142
+ image = Image.open(BytesIO(response.content)).convert('RGB')
143
+
144
+ # Get embedding
145
+ image_embedding = model.encode_image(image) # Shape: [1, 384]
146
+
147
+ # Or from file
148
+ image = Image.open("my_image.jpg").convert('RGB')
149
+ image_embedding = model.encode_image(image)
150
+ ```
151
+
152
+ ### Cross-Modal Retrieval
153
+
154
+ ```python
155
+ # Image-to-text retrieval
156
+ image = Image.open("cat.jpg").convert('RGB')
157
+ image_emb = model.encode_image(image)
158
+
159
+ captions = [
160
+ "A cat sleeping on a bed",
161
+ "A dog playing in the park",
162
+ "A car driving on the highway",
163
+ "A fluffy feline resting",
164
+ ]
165
+ text_embs = model.encode_text(captions)
166
+
167
+ # Find most similar caption
168
+ similarities = F.cosine_similarity(image_emb, text_embs)
169
+ best_match_idx = similarities.argmax().item()
170
+ print(f"Best match: {captions[best_match_idx]}")
171
+ print(f"Similarity: {similarities[best_match_idx]:.3f}")
172
+ ```
173
+
174
+ ### Matryoshka Dimension Reduction (MRL)
175
+
176
+ ```python
177
+ # Get full 384-dim embedding
178
+ full_emb = model.encode_text("Hello world") # [1, 384]
179
+
180
+ # Truncate to smaller dimensions (MRL)
181
+ emb_256 = full_emb[:, :256] # 256-dim, ~1.5x faster retrieval
182
+ emb_128 = full_emb[:, :128] # 128-dim, ~3x faster retrieval
183
+ emb_64 = full_emb[:, :64] # 64-dim, ~6x faster retrieval
184
+
185
+ # Normalize after truncation
186
+ emb_128_norm = F.normalize(emb_128, p=2, dim=-1)
187
+ ```
188
+
189
+ ### 2DMSE Adaptive Layer Exit
190
+
191
+ ```python
192
+ # Full model (all layers) - highest quality
193
+ full_emb = model.encode_text("Complex query", target_layer=None)
194
+
195
+ # Early exit at layer 3 (~50% compute) - faster
196
+ early_emb = model.encode_text("Simple query", target_layer=3)
197
+
198
+ # Even earlier exit (layer 1) - fastest
199
+ fastest_emb = model.encode_text("Quick lookup", target_layer=1)
200
+ ```
201
+
202
+ ### Multimodal Fusion
203
+
204
+ ```python
205
+ # Combine text and image for richer representation
206
+ image = Image.open("cat.jpg").convert('RGB')
207
+ text = "A cute pet"
208
+
209
+ fused_embedding = model.encode_multimodal(
210
+ texts=text,
211
+ images=image
212
+ ) # Shape: [1, 384]
213
+ ```
214
+
215
+ ## Training
216
+
217
+ ### Architecture
218
+
219
+ ```
220
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
221
+ β”‚ multi-modal-embed-small β”‚
222
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
223
+ β”‚ Text Encoder: MiniLM-L6-v2 (22M params) β”‚
224
+ β”‚ Image Encoder: SigLIP-base-patch16-512 (86M params) β”‚
225
+ β”‚ Fusion: 2-layer Transformer β”‚
226
+ β”‚ Output: 384-dim normalized embeddings β”‚
227
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
228
+ β”‚ 2DMSE: Layer 0-5 early exit support β”‚
229
+ β”‚ MRL: 32, 64, 128, 256, 384 dim truncation β”‚
230
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
231
+ ```
232
+
233
+ ### Training Data
234
+
235
+ - **LLaVA-CC3M**: 595K image-caption pairs
236
+ - **COCO Captions**: Validation on 25K pairs
237
+
238
+ ### Training Configuration
239
+
240
+ - **Hardware**: 8x AMD MI300X GPUs
241
+ - **Precision**: BF16 mixed precision
242
+ - **Batch Size**: 256 per GPU (2048 effective)
243
+ - **Optimizer**: AdamW
244
+ - **Learning Rate**: 1e-4 with cosine decay
245
+ - **Loss**: InfoNCE contrastive + Matryoshka loss
246
+
247
+ ### Training Stages
248
+
249
+ 1. **Stage 1** (Frozen encoders): Align image-text space, 6 epochs
250
+ 2. **Stage 2** (Partial unfreeze): Fine-tune fusion + top encoder layers
251
+ 3. **Stage 4** (Full unfreeze): End-to-end fine-tuning
252
+
253
+ ## Evaluation
254
+
255
+ ### Image-Text Retrieval (COCO Validation)
256
+
257
+ | Metric | Image→Text | Text→Image |
258
+ |--------|------------|------------|
259
+ | R@1 | 41.88% | 39.21% |
260
+ | R@5 | 71.64% | 69.15% |
261
+ | R@10 | 82.16% | 80.02% |
262
+
263
+ ### Text Semantic Similarity
264
+
265
+ | Pair Type | Similarity |
266
+ |-----------|------------|
267
+ | Positive (similar) | 0.805 |
268
+ | Negative (different) | 0.022 |
269
+ | **Separation** | **0.783** |
270
+
271
+ ### Cross-Modal Retrieval (Real-world test)
272
+
273
+ | Direction | R@1 | R@5 | MRR |
274
+ |-----------|-----|-----|-----|
275
+ | Image→Text | 87.5% | 100% | 0.94 |
276
+ | Text→Image | 87.5% | 100% | 0.94 |
277
+
278
+ ### MRL Quality Retention (Matryoshka)
279
+
280
+ | Dimension | Compression | Separation |
281
+ |-----------|-------------|------------|
282
+ | 384 (full)| 1x | 1.024 |
283
+ | 256 | 1.5x | 1.038 |
284
+ | 128 | 3x | 0.889 |
285
+ | 64 | 6x | 0.839 |
286
+ | 32 | 12x | 0.889 |
287
+
288
+ ## Limitations
289
+
290
+ - Optimized for English; multilingual performance may vary
291
+ - Image resolution fixed at 512x512
292
+ - Audio modality available but not trained in this release
293
+ - Best for semantic similarity, not generative tasks
294
+
295
+ ## Citation
296
+
297
+ ```bibtex
298
+ @misc{multi-modal-embed-small,
299
+ title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
300
+ author={vLLM Semantic Router Team},
301
+ year={2026},
302
+ url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
303
+ }
304
+ ```
305
+
306
+ ## License
307
+
308
+ Apache 2.0
309
+
310
+ ## Related Models
311
+
312
+ - [mmbert-embed-32k-2d-matryoshka](https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka) - Long context variant
313
+ - [mmbert-embed-finance](https://huggingface.co/llm-semantic-router/mmbert-embed-finance) - Finance domain
314
+ - [mmbert-embed-medical](https://huggingface.co/llm-semantic-router/mmbert-embed-medical) - Medical domain
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "llm-semantic-router/multi-modal-embed-small",
3
+ "architectures": [
4
+ "MultimodalEmbedder"
5
+ ],
6
+ "model_type": "mmbert",
7
+ "output_dim": 384,
8
+ "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
9
+ "image_encoder_name": "google/siglip-base-patch16-512",
10
+ "audio_encoder_name": "openai/whisper-tiny",
11
+ "fusion_type": "transformer",
12
+ "num_fusion_layers": 2,
13
+ "enable_layer_outputs": true,
14
+ "use_mobile_optimizations": true,
15
+ "matryoshka_dims": [
16
+ 32,
17
+ 64,
18
+ 128,
19
+ 256,
20
+ 384
21
+ ],
22
+ "supported_modalities": [
23
+ "text",
24
+ "image",
25
+ "multimodal"
26
+ ]
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aced484d5e4736120dcb9f41fe33e9751fc77a076572311d86f691b87a64c394
3
+ size 1350323576
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:609c166182116db34188892e1930c30bf7cd31d2b679369dfa61694c21e299c3
3
+ size 976407151