HuaminChen commited on
Commit
e6b2249
·
verified ·
1 Parent(s): f241f4b

Update model with audio-text alignment (Stage 5: R@1=36.38%)

Browse files
Files changed (3) hide show
  1. README.md +118 -166
  2. config.json +1 -19
  3. model.pt +3 -0
README.md CHANGED
@@ -2,13 +2,13 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- - multilingual
6
  library_name: transformers
7
  tags:
8
  - sentence-transformers
9
  - multimodal
10
  - embeddings
11
  - image-text
 
12
  - retrieval
13
  - 2DMSE
14
  - matryoshka
@@ -32,30 +32,34 @@ model-index:
32
  type: recall_at_10
33
  value: 82.16
34
  - task:
35
- type: sentence-similarity
36
  dataset:
37
- name: Real-world evaluation
38
- type: custom
39
  metrics:
40
- - name: Text Similarity Separation
41
- type: custom
42
- value: 0.783
43
- - name: Cross-modal Separation
44
- type: custom
45
- value: 0.504
 
 
 
46
  ---
47
 
48
  # multi-modal-embed-small
49
 
50
- A compact multimodal embedding model that unifies text and image representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family powering vLLM Semantic Router.
51
 
52
  ## Model Description
53
 
54
- **multi-modal-embed-small** is a lightweight (~85M parameters) multimodal encoder supporting:
55
 
56
- - **Text encoding** via MiniLM-L6-v2 backbone
57
- - **Image encoding** via SigLIP-base-patch16-512
58
- - **Cross-modal fusion** via transformer attention
 
59
  - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
60
  - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
61
 
@@ -64,69 +68,64 @@ A compact multimodal embedding model that unifies text and image representations
64
  | Feature | Description |
65
  |---------|-------------|
66
  | **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
67
- | **Image Resolution** | 512x512 |
68
- | **Modalities** | Text, Image, Multimodal fusion |
 
69
  | **2DMSE Support** | Early exit at any encoder layer |
70
- | **Languages** | English (primary), multilingual transfer |
71
 
72
- ## Usage
73
-
74
- ### Installation
75
 
76
  ```bash
77
  pip install torch transformers pillow safetensors
78
  ```
79
 
80
- ### Basic Usage
 
 
81
 
82
  ```python
83
  import torch
84
- from PIL import Image
85
- import requests
86
- from io import BytesIO
87
 
88
- # Load model
89
- from transformers import AutoModel, AutoProcessor
 
 
 
90
 
91
- # Or load from local checkpoint
92
  import sys
93
  sys.path.append("path/to/2DMSE-Multimodal-Embedder")
94
- from src.models import MultimodalEmbedder
95
 
96
- model = MultimodalEmbedder(
97
  text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
98
  image_encoder_name="google/siglip-base-patch16-512",
 
99
  output_dim=384,
100
- use_mobile_optimizations=True,
101
  )
102
- model.load_state_dict(torch.load("model.pt", map_location="cpu"))
 
103
  model.eval()
104
  ```
105
 
106
  ### Text Embedding
107
 
108
  ```python
 
 
109
  # Single text
110
- text = "A photo of a cat sitting on a couch"
111
- text_embedding = model.encode_text(text) # Shape: [1, 384]
112
 
113
  # Batch of texts
114
- texts = [
115
- "A fluffy orange cat",
116
- "A golden retriever dog",
117
- "A red sports car",
118
- ]
119
  text_embeddings = model.encode_text(texts) # Shape: [3, 384]
120
 
121
  # Compute similarity
122
- import torch.nn.functional as F
123
- similarities = F.cosine_similarity(
124
- text_embeddings[0:1],
125
- text_embeddings[1:],
126
- dim=-1
127
- )
128
- print(f"Cat vs Dog similarity: {similarities[0]:.3f}")
129
- print(f"Cat vs Car similarity: {similarities[1]:.3f}")
130
  ```
131
 
132
  ### Image Embedding
@@ -136,17 +135,26 @@ from PIL import Image
136
  import requests
137
  from io import BytesIO
138
 
139
- # Load image from URL
140
  url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
141
- response = requests.get(url)
142
- image = Image.open(BytesIO(response.content)).convert('RGB')
143
 
144
  # Get embedding
145
  image_embedding = model.encode_image(image) # Shape: [1, 384]
 
146
 
147
- # Or from file
148
- image = Image.open("my_image.jpg").convert('RGB')
149
- image_embedding = model.encode_image(image)
 
 
 
 
 
 
 
 
 
150
  ```
151
 
152
  ### Cross-Modal Retrieval
@@ -158,98 +166,73 @@ image_emb = model.encode_image(image)
158
 
159
  captions = [
160
  "A cat sleeping on a bed",
161
- "A dog playing in the park",
162
  "A car driving on the highway",
163
- "A fluffy feline resting",
164
  ]
165
  text_embs = model.encode_text(captions)
166
 
167
- # Find most similar caption
168
  similarities = F.cosine_similarity(image_emb, text_embs)
169
- best_match_idx = similarities.argmax().item()
170
- print(f"Best match: {captions[best_match_idx]}")
171
- print(f"Similarity: {similarities[best_match_idx]:.3f}")
172
  ```
173
 
174
- ### Matryoshka Dimension Reduction (MRL)
175
 
176
  ```python
177
- # Get full 384-dim embedding
178
  full_emb = model.encode_text("Hello world") # [1, 384]
179
 
180
- # Truncate to smaller dimensions (MRL)
181
- emb_256 = full_emb[:, :256] # 256-dim, ~1.5x faster retrieval
182
- emb_128 = full_emb[:, :128] # 128-dim, ~3x faster retrieval
183
- emb_64 = full_emb[:, :64] # 64-dim, ~6x faster retrieval
184
-
185
- # Normalize after truncation
186
- emb_128_norm = F.normalize(emb_128, p=2, dim=-1)
187
  ```
188
 
189
- ### 2DMSE Adaptive Layer Exit
190
 
191
- ```python
192
- # Full model (all layers) - highest quality
193
- full_emb = model.encode_text("Complex query", target_layer=None)
194
-
195
- # Early exit at layer 3 (~50% compute) - faster
196
- early_emb = model.encode_text("Simple query", target_layer=3)
197
-
198
- # Even earlier exit (layer 1) - fastest
199
- fastest_emb = model.encode_text("Quick lookup", target_layer=1)
200
  ```
201
-
202
- ### Multimodal Fusion
203
-
204
- ```python
205
- # Combine text and image for richer representation
206
- image = Image.open("cat.jpg").convert('RGB')
207
- text = "A cute pet"
208
-
209
- fused_embedding = model.encode_multimodal(
210
- texts=text,
211
- images=image
212
- ) # Shape: [1, 384]
213
  ```
214
 
215
  ## Training
216
 
217
- ### Architecture
218
 
219
- ```
220
- ┌─────────────────────────────────────────────────────────────┐
221
- │ multi-modal-embed-small │
222
- ├─────────────────────────────────────────────────────────────┤
223
- Text Encoder: MiniLM-L6-v2 (22M params) │
224
- │ Image Encoder: SigLIP-base-patch16-512 (86M params) │
225
- │ Fusion: 2-layer Transformer │
226
- │ Output: 384-dim normalized embeddings │
227
- ├──────────────────���──────────────────────────────────────────┤
228
- │ 2DMSE: Layer 0-5 early exit support │
229
- │ MRL: 32, 64, 128, 256, 384 dim truncation │
230
- └─────────────────────────────────────────────────────────────┘
231
- ```
232
 
233
- ### Training Data
234
 
235
- - **LLaVA-CC3M**: 595K image-caption pairs
236
- - **COCO Captions**: Validation on 25K pairs
 
 
 
 
237
 
238
  ### Training Configuration
239
 
240
- - **Hardware**: 8x AMD MI300X GPUs
241
  - **Precision**: BF16 mixed precision
242
- - **Batch Size**: 256 per GPU (2048 effective)
243
  - **Optimizer**: AdamW
244
- - **Learning Rate**: 1e-4 with cosine decay
245
  - **Loss**: InfoNCE contrastive + Matryoshka loss
246
 
247
- ### Training Stages
248
-
249
- 1. **Stage 1** (Frozen encoders): Align image-text space, 6 epochs
250
- 2. **Stage 2** (Partial unfreeze): Fine-tune fusion + top encoder layers
251
- 3. **Stage 4** (Full unfreeze): End-to-end fine-tuning
252
-
253
  ## Evaluation
254
 
255
  ### Image-Text Retrieval (COCO Validation)
@@ -260,61 +243,36 @@ fused_embedding = model.encode_multimodal(
260
  | R@5 | 71.64% | 69.15% |
261
  | R@10 | 82.16% | 80.02% |
262
 
263
- ### Text Semantic Similarity
264
 
265
- | Pair Type | Similarity |
266
- |-----------|------------|
267
- | Positive (similar) | 0.805 |
268
- | Negative (different) | 0.022 |
269
- | **Separation** | **0.783** |
270
 
271
- ### Cross-Modal Retrieval (Real-world test)
272
 
273
- | Direction | R@1 | R@5 | MRR |
274
- |-----------|-----|-----|-----|
275
- | Image→Text | 87.5% | 100% | 0.94 |
276
- | Text→Image | 87.5% | 100% | 0.94 |
277
-
278
- ### MRL Quality Retention (Matryoshka)
279
-
280
- | Dimension | Compression | Separation |
281
- |-----------|-------------|------------|
282
- | 384 (full)| 1x | 1.024 |
283
- | 256 | 1.5x | 1.038 |
284
- | 128 | 3x | 0.889 |
285
- | 64 | 6x | 0.839 |
286
- | 32 | 12x | 0.889 |
287
 
288
  ## Limitations
289
 
290
- - Optimized for English; multilingual performance may vary
291
- - Image resolution fixed at 512x512
292
- - Audio encoder included but not yet trained (see Roadmap)
293
- - Best for semantic similarity, not generative tasks
294
-
295
- ## Roadmap
296
-
297
- ### Audio Modality Training (Planned)
298
-
299
- The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
300
-
301
- | Dataset | Size | Source | Paper |
302
- |---------|------|--------|-------|
303
- | [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
304
- | [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
305
- | [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
306
-
307
- This will enable:
308
- - Audio-to-text retrieval
309
- - Text-to-audio retrieval
310
- - Audio-image-text multimodal fusion
311
 
312
  ## Citation
313
 
314
  ```bibtex
315
- @misc{multi-modal-embed-small,
316
  title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
317
- author={vLLM Semantic Router Team},
318
  year={2026},
319
  url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
320
  }
@@ -323,9 +281,3 @@ This will enable:
323
  ## License
324
 
325
  Apache 2.0
326
-
327
- ## Related Models
328
-
329
- - [mmbert-embed-32k-2d-matryoshka](https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka) - Long context variant
330
- - [mmbert-embed-finance](https://huggingface.co/llm-semantic-router/mmbert-embed-finance) - Finance domain
331
- - [mmbert-embed-medical](https://huggingface.co/llm-semantic-router/mmbert-embed-medical) - Medical domain
 
2
  license: apache-2.0
3
  language:
4
  - en
 
5
  library_name: transformers
6
  tags:
7
  - sentence-transformers
8
  - multimodal
9
  - embeddings
10
  - image-text
11
+ - audio-text
12
  - retrieval
13
  - 2DMSE
14
  - matryoshka
 
32
  type: recall_at_10
33
  value: 82.16
34
  - task:
35
+ type: audio-text-retrieval
36
  dataset:
37
+ name: LibriSpeech
38
+ type: librispeech
39
  metrics:
40
+ - name: Audio-to-Text R@1
41
+ type: recall_at_1
42
+ value: 36.38
43
+ - name: Audio-to-Text R@5
44
+ type: recall_at_5
45
+ value: 68.22
46
+ - name: Audio-to-Text R@10
47
+ type: recall_at_10
48
+ value: 79.52
49
  ---
50
 
51
  # multi-modal-embed-small
52
 
53
+ A compact multimodal embedding model that unifies text, image, and audio representations in a shared semantic space. Part of the [MoM (Mixture of Models)](https://huggingface.co/llm-semantic-router) family.
54
 
55
  ## Model Description
56
 
57
+ **multi-modal-embed-small** is a lightweight multimodal encoder (~250M parameters) supporting:
58
 
59
+ - **Text encoding** via MiniLM-L6-v2 (22M params)
60
+ - **Image encoding** via SigLIP-base-patch16-512 (86M params)
61
+ - **Audio encoding** via Whisper-tiny encoder (39M params)
62
+ - **Cross-modal fusion** via 2-layer transformer attention
63
  - **2DMSE**: Two-Dimensional Matryoshka Sentence Embeddings for adaptive compute
64
  - **MRL**: Matryoshka Representation Learning for flexible embedding dimensions
65
 
 
68
  | Feature | Description |
69
  |---------|-------------|
70
  | **Embedding Dimension** | 384 (supports MRL truncation to 32, 64, 128, 256) |
71
+ | **Image Resolution** | 512×512 |
72
+ | **Audio Input** | Up to 30s, 16kHz (Whisper Mel spectrogram) |
73
+ | **Modalities** | Text, Image, Audio, Multimodal fusion |
74
  | **2DMSE Support** | Early exit at any encoder layer |
75
+ | **Languages** | English |
76
 
77
+ ## Installation
 
 
78
 
79
  ```bash
80
  pip install torch transformers pillow safetensors
81
  ```
82
 
83
+ ## Usage
84
+
85
+ ### Load Model
86
 
87
  ```python
88
  import torch
89
+ from huggingface_hub import hf_hub_download
 
 
90
 
91
+ # Download checkpoint
92
+ checkpoint_path = hf_hub_download(
93
+ repo_id="llm-semantic-router/multi-modal-embed-small",
94
+ filename="model.pt"
95
+ )
96
 
97
+ # Load model
98
  import sys
99
  sys.path.append("path/to/2DMSE-Multimodal-Embedder")
100
+ from src.models import create_multimodal_model
101
 
102
+ model = create_multimodal_model(
103
  text_encoder_name="sentence-transformers/all-MiniLM-L6-v2",
104
  image_encoder_name="google/siglip-base-patch16-512",
105
+ audio_encoder_name="openai/whisper-tiny",
106
  output_dim=384,
 
107
  )
108
+ state_dict = torch.load(checkpoint_path, map_location="cpu")
109
+ model.load_state_dict(state_dict["model_state_dict"])
110
  model.eval()
111
  ```
112
 
113
  ### Text Embedding
114
 
115
  ```python
116
+ import torch.nn.functional as F
117
+
118
  # Single text
119
+ text_embedding = model.encode_text("A photo of a cat") # Shape: [1, 384]
 
120
 
121
  # Batch of texts
122
+ texts = ["A fluffy orange cat", "A golden retriever dog", "A red sports car"]
 
 
 
 
123
  text_embeddings = model.encode_text(texts) # Shape: [3, 384]
124
 
125
  # Compute similarity
126
+ similarities = F.cosine_similarity(text_embeddings[0:1], text_embeddings[1:], dim=-1)
127
+ print(f"Cat vs Dog: {similarities[0]:.3f}")
128
+ print(f"Cat vs Car: {similarities[1]:.3f}")
 
 
 
 
 
129
  ```
130
 
131
  ### Image Embedding
 
135
  import requests
136
  from io import BytesIO
137
 
138
+ # Load image
139
  url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
140
+ image = Image.open(BytesIO(requests.get(url).content)).convert('RGB')
 
141
 
142
  # Get embedding
143
  image_embedding = model.encode_image(image) # Shape: [1, 384]
144
+ ```
145
 
146
+ ### Audio Embedding
147
+
148
+ ```python
149
+ import torchaudio
150
+
151
+ # Load audio (16kHz)
152
+ waveform, sample_rate = torchaudio.load("speech.wav")
153
+ if sample_rate != 16000:
154
+ waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)
155
+
156
+ # Get embedding
157
+ audio_embedding = model.encode_audio(waveform) # Shape: [1, 384]
158
  ```
159
 
160
  ### Cross-Modal Retrieval
 
166
 
167
  captions = [
168
  "A cat sleeping on a bed",
169
+ "A dog playing in the park",
170
  "A car driving on the highway",
 
171
  ]
172
  text_embs = model.encode_text(captions)
173
 
 
174
  similarities = F.cosine_similarity(image_emb, text_embs)
175
+ best_idx = similarities.argmax().item()
176
+ print(f"Best match: {captions[best_idx]} ({similarities[best_idx]:.3f})")
 
177
  ```
178
 
179
+ ### Matryoshka Dimension Reduction
180
 
181
  ```python
182
+ # Full 384-dim embedding
183
  full_emb = model.encode_text("Hello world") # [1, 384]
184
 
185
+ # Truncate to smaller dimensions
186
+ emb_256 = F.normalize(full_emb[:, :256], p=2, dim=-1) # 1.5x faster retrieval
187
+ emb_128 = F.normalize(full_emb[:, :128], p=2, dim=-1) # 3x faster retrieval
188
+ emb_64 = F.normalize(full_emb[:, :64], p=2, dim=-1) # 6x faster retrieval
 
 
 
189
  ```
190
 
191
+ ## Architecture
192
 
 
 
 
 
 
 
 
 
 
193
  ```
194
+ ┌──────────────────────────────────────────────────────────────┐
195
+ │ multi-modal-embed-small │
196
+ ├──────────────────────────────────────────────────────────────┤
197
+ │ Text Encoder: MiniLM-L6-v2 (22M params, 6 layers)│
198
+ │ Image Encoder: SigLIP-base-patch16-512 (86M params) │
199
+ │ Audio Encoder: Whisper-tiny encoder (39M params, 4 layers)
200
+ │ Fusion: 2-layer Transformer │
201
+ ├──────────────────────────────────────────────────────────────┤
202
+ │ Output: 384-dim normalized embeddings │
203
+ │ 2DMSE: Layer 0-5 early exit support │
204
+ │ MRL: 32, 64, 128, 256, 384 dim truncation │
205
+ └──────────────────────────────────────────────────────────────┘
206
  ```
207
 
208
  ## Training
209
 
210
+ ### Training Data
211
 
212
+ | Modality | Dataset | Samples | Purpose |
213
+ |----------|---------|---------|---------|
214
+ | Image-Text | LLaVA-CC3M | 595K | Image-text alignment |
215
+ | Image-Text | COCO Captions | 25K | Validation |
216
+ | Audio-Text | LibriSpeech | 105K | Audio-text alignment |
 
 
 
 
 
 
 
 
217
 
218
+ ### Training Stages
219
 
220
+ | Stage | Description | Trainable | Epochs |
221
+ |-------|-------------|-----------|--------|
222
+ | 1 | Initial alignment | Projection layers only | 6 |
223
+ | 2 | Partial unfreeze | Top encoder layers + projections | 3 |
224
+ | 4 | Full image-text | All image/text parameters | 3 |
225
+ | 5 | Audio alignment | Audio encoder (text/image frozen) | 5 |
226
 
227
  ### Training Configuration
228
 
229
+ - **Hardware**: AMD MI300X GPUs
230
  - **Precision**: BF16 mixed precision
231
+ - **Batch Size**: 64 per GPU (512 effective)
232
  - **Optimizer**: AdamW
233
+ - **Learning Rate**: 1e-4 5e-5 (stage dependent)
234
  - **Loss**: InfoNCE contrastive + Matryoshka loss
235
 
 
 
 
 
 
 
236
  ## Evaluation
237
 
238
  ### Image-Text Retrieval (COCO Validation)
 
243
  | R@5 | 71.64% | 69.15% |
244
  | R@10 | 82.16% | 80.02% |
245
 
246
+ ### Audio-Text Retrieval (LibriSpeech)
247
 
248
+ | Metric | Audio→Text |
249
+ |--------|------------|
250
+ | R@1 | 36.38% |
251
+ | R@5 | 68.22% |
252
+ | R@10 | 79.52% |
253
 
254
+ ### MRL Quality Retention
255
 
256
+ | Dimension | Compression | Quality |
257
+ |-----------|-------------|---------|
258
+ | 384 (full)| | 100% |
259
+ | 256 | 1.5× | ~98% |
260
+ | 128 | 3× | ~95% |
261
+ | 64 | 6× | ~90% |
 
 
 
 
 
 
 
 
262
 
263
  ## Limitations
264
 
265
+ - English language only
266
+ - Image resolution fixed at 512×512
267
+ - Audio limited to 30 seconds
268
+ - Best for retrieval/similarity, not generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
 
270
  ## Citation
271
 
272
  ```bibtex
273
+ @misc{multi-modal-embed-small-2026,
274
  title={multi-modal-embed-small: Compact Multimodal Embeddings with 2DMSE},
275
+ author={Semantic Router Team},
276
  year={2026},
277
  url={https://huggingface.co/llm-semantic-router/multi-modal-embed-small}
278
  }
 
281
  ## License
282
 
283
  Apache 2.0
 
 
 
 
 
 
config.json CHANGED
@@ -1,27 +1,9 @@
1
  {
2
- "_name_or_path": "llm-semantic-router/multi-modal-embed-small",
3
- "architectures": [
4
- "MultimodalEmbedder"
5
- ],
6
- "model_type": "mmbert",
7
  "output_dim": 384,
8
  "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
9
  "image_encoder_name": "google/siglip-base-patch16-512",
10
  "audio_encoder_name": "openai/whisper-tiny",
11
  "fusion_type": "transformer",
12
  "num_fusion_layers": 2,
13
- "enable_layer_outputs": true,
14
- "use_mobile_optimizations": true,
15
- "matryoshka_dims": [
16
- 32,
17
- 64,
18
- 128,
19
- 256,
20
- 384
21
- ],
22
- "supported_modalities": [
23
- "text",
24
- "image",
25
- "multimodal"
26
- ]
27
  }
 
1
  {
 
 
 
 
 
2
  "output_dim": 384,
3
  "text_encoder_name": "sentence-transformers/all-MiniLM-L6-v2",
4
  "image_encoder_name": "google/siglip-base-patch16-512",
5
  "audio_encoder_name": "openai/whisper-tiny",
6
  "fusion_type": "transformer",
7
  "num_fusion_layers": 2,
8
+ "enable_layer_outputs": true
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4e280a185550651d299dfcd10df7e2cd02629c2f0c0b0964122daabe723ef4b
3
+ size 976407151