File size: 12,187 Bytes
e05599b
 
3dae150
38c1507
 
 
 
 
 
e05599b
 
3dae150
e05599b
3dae150
e05599b
3dae150
e05599b
3dae150
46491b5
3dae150
 
46491b5
 
 
3d38f58
 
38c1507
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46491b5
 
 
 
 
ead974a
 
 
46491b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79c202e
 
 
 
 
 
 
 
 
e37d027
79c202e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e37d027
79c202e
 
 
25235d6
 
 
79c202e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dae150
 
 
 
 
 
 
 
 
 
 
 
 
 
7178658
 
 
 
3dae150
 
 
 
 
 
 
 
 
 
 
 
 
76ebefc
 
 
 
 
e05599b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
license: apache-2.0
pipeline_tag: feature-extraction
library_name: sentence-transformers
tags:
  - transformers
  - sentence-transformers
  - feature-extraction
  - multimodal-embedding
---

# LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning

We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families!

This model implements the framework presented in the paper [Scaling Language-Centric Omnimodal Representation Learning](https://huggingface.co/papers/2510.11693), accepted by NeurIPS 2025.

**Project Page:** https://huggingface.co/LCO-Embedding

**Github Repository:** https://github.com/LCO-Embedding/LCO-Embedding


## Quick Start

Note: We are only using the `thinker` component of Qwen2.5 Omni and drops the `talker` component.

### Using Sentence Transformers

Install Sentence Transformers:
```bash
pip install "sentence_transformers[image]"
```

```python
import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "LCO-Embedding/LCO-Embedding-Omni-7B",
    trust_remote_code=True,
    model_kwargs={"dtype": torch.bfloat16},
)

# The same "Summarize the above <modality> in one word:" instruction used in
# the paper is baked into the chat template, so encode() takes plain text or
# multimodal dicts directly.
texts = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "The Eiffel Tower is located in Paris.",
    "Berlin is the capital of Germany.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 3584)

text_similarities = model.similarity(text_embeddings, text_embeddings)
print(text_similarities)
# tensor([[1.0000, 0.9453, 0.6885, 0.5223],
#         [0.9453, 1.0000, 0.7283, 0.5434],
#         [0.6885, 0.7283, 1.0000, 0.3772],
#         [0.5223, 0.5434, 0.3772, 1.0000]])

# Encoding images (text, audio, and video also work, individually or combined using a dict input):
image_embeddings = model.encode([
    "path/to/image_1.png",
    "path/to/image_2.png",
])
print(image_embeddings.shape)
# (2, 3584)

# Multimodal inputs can mix modalities via dicts (text + image + audio + video):
queries = ["A diagram of the Qwen2.5-Omni architecture"]
documents = [
    {"image": "path/to/qwen_diagram.png"},
    {"text": "Llama 4 architecture overview", "image": "path/to/llama_diagram.png"},
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities.shape)
# torch.Size([1, 2])
```

### Using Transformers

```python
from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B") # or add a `max_pixels = 1280*28*28' for efficient encoding
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B",
                                                                    torch_dtype=torch.bfloat16,
                                                                    device_map="auto")
```

#### Text Batch Encodings:
  
```python
texts = ["some random text", "a second random text", "a third random text"] * 30
batch_size = 8
text_prompt =  "{}\nSummarize the above text in one word:" 

all_text_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i : i + batch_size]
        batch_texts = [text_prompt.format(text) for text in batch_texts]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text":text},
                ],

            }
        ] for text in batch_texts]
        text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
        text_inputs = processor(
        text = text_inputs,
        padding = True,
        return_tensors = "pt",
        )
        text_inputs = text_inputs.to("cuda")
        text_outputs = model(
            **text_inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_text_embeddings.append(text_outputs.to(torch.float16).cpu())

all_text_embeddings = torch.cat(all_text_embeddings, dim=0)
```

#### Image Batch Encodings: 

```python
images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline
image_prompt = "\nSummarize the above image in one word:"
batch_size = 8

all_image_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(images), batch_size)):
        batch_images = images[i : i + batch_size]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "image", "image":image},
                    {"type": "text", "text": image_prompt},
                ],

            }
        ] for image in batch_images]
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True)
        inputs = processor(
            text=text, 
            audio=audio_inputs, 
            images=image_inputs, 
            videos=video_inputs, 
            return_tensors="pt", 
            padding=True
        )
        inputs = inputs.to("cuda")
        image_outputs = model(
            **inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_image_embeddings.append(image_outputs.to(torch.float16).cpu())

all_image_embeddings = torch.cat(all_image_embeddings, dim=0)
```

#### Audio Batch Encoding:

```python
import logging
logging.getLogger("root").setLevel(logging.ERROR)
# set this to prevent getting the Qwen Omni system prompt mismatch warning.

batch_size = 4
audio_prompt = "\nSummarize the above audio in one word:"
audis = [some audios]  * 1000

all_audio_embeddings = []

with torch.no_grad():
  for i in tqdm(range(0, len(audios), batch_size)):
      torch.cuda.empty_cache()
      
      batch_audios = audios[i : i + batch_size]
      messages = [[
          {
              "role": "user",
              "content": [
                   {"type": "audio", "audio": audio},
                  {"type": "text", "text": audio_prompt},
              ],
              
          }
      ] for audio in batch_audios]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      audio_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_audio_embeddings.append(audio_outputs.to(torch.float16).cpu())
      del inputs, audio_outputs
      torch.cuda.empty_cache()
                
all_audio_embeddings = torch.cat(all_audio_embeddings, dim=0)

```

#### Video Batch Encoding:


```python
videos = [some videos]  * 1000
video_prompt = "\nSummarize the above video in one word:"
batch_size = 4

long_video = False
# followed by some example hyperparameters to save RAM
# for long videos. Not optimal. Tune case by case.

all_video_embeddings = []
with torch.no_grad():
  for i in tqdm(range(0, len(videos), batch_size)):
      torch.cuda.empty_cache()
      
      batch_videos = videos[i : i + batch_size]
      if long_video:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                          "max_pixels": 224 * 224,
                          "fps": 1,
                          "max_frames": 10
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      else:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      video_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_video_embeddings.append(video_outputs.to(torch.float16).cpu())
      
      del inputs, video_outputs
      torch.cuda.empty_cache()
                
all_video_embeddings = torch.cat(all_video_embeddings, dim=0)
```

## Overview

We introduce **LCO-Embedding**, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on [MIEB](https://huggingface.co/blog/isaacchung/introducing-mieb) (Massive Image Embedding Benchmark), while supporting audio and videos.

This work also introduces the **Generation-Representation Scaling Law**, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce **SeaDoc**, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound.

<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/4Wd8fDFBdT6GxqN6-KzZN.png" alt="overview" width="100%"/></div>

## Evaluation Results

We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories.

<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/63WBsKh57HbNwwe3bZ-oZ.png" alt="mieb_lite" width="100%"/></div>

LCO-Embedding is also SOTA on MAEB (massive audio embedding benchmark) without even training on audio. Screenshot from the MAEB paper.

![image](https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/cp5hfBmm51AlyO4sDnTrN.png)

Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones.

<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/lora_ablation.png" alt="lora_ablation" width="100%"/></div>

Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis).

<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/scaling.png" alt="scaling_law" width="100%"/></div>

## Citation

If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX:

```bibtex
@article{xiao2025scaling,
  title={Scaling Language-Centric Omnimodal Representation Learning},
  author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu},
  journal={arXiv preprint arXiv:2510.11693},
  year={2025}
}
```