File size: 12,187 Bytes

---
license: apache-2.0
pipeline_tag: feature-extraction
library_name: sentence-transformers
tags:
  - transformers
  - sentence-transformers
  - feature-extraction
  - multimodal-embedding
---

# LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning

We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families!

This model implements the framework presented in the paper [Scaling Language-Centric Omnimodal Representation Learning](https://huggingface.co/papers/2510.11693), accepted by NeurIPS 2025.

**Project Page:** https://huggingface.co/LCO-Embedding

**Github Repository:** https://github.com/LCO-Embedding/LCO-Embedding


## Quick Start

Note: We are only using the `thinker` component of Qwen2.5 Omni and drops the `talker` component.

### Using Sentence Transformers

Install Sentence Transformers:
```bash
pip install "sentence_transformers[image]"
```

```python
import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "LCO-Embedding/LCO-Embedding-Omni-7B",
    trust_remote_code=True,
    model_kwargs={"dtype": torch.bfloat16},
)

# The same "Summarize the above <modality> in one word:" instruction used in
# the paper is baked into the chat template, so encode() takes plain text or
# multimodal dicts directly.
texts = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "The Eiffel Tower is located in Paris.",
    "Berlin is the capital of Germany.",
]
text_embeddings = model.encode(texts)
print(text_embeddings.shape)
# (4, 3584)

text_similarities = model.similarity(text_embeddings, text_embeddings)
print(text_similarities)
# tensor([[1.0000, 0.9453, 0.6885, 0.5223],
#         [0.9453, 1.0000, 0.7283, 0.5434],
#         [0.6885, 0.7283, 1.0000, 0.3772],
#         [0.5223, 0.5434, 0.3772, 1.0000]])

# Encoding images (text, audio, and video also work, individually or combined using a dict input):
image_embeddings = model.encode([
    "path/to/image_1.png",
    "path/to/image_2.png",
])
print(image_embeddings.shape)
# (2, 3584)

# Multimodal inputs can mix modalities via dicts (text + image + audio + video):
queries = ["A diagram of the Qwen2.5-Omni architecture"]
documents = [
    {"image": "path/to/qwen_diagram.png"},
    {"text": "Llama 4 architecture overview", "image": "path/to/llama_diagram.png"},
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities.shape)
# torch.Size([1, 2])
```

### Using Transformers

```python
from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B") # or add a `max_pixels = 1280*28*28' for efficient encoding
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B",
                                                                    torch_dtype=torch.bfloat16,
                                                                    device_map="auto")
```

#### Text Batch Encodings:
  
```python
texts = ["some random text", "a second random text", "a third random text"] * 30
batch_size = 8
text_prompt =  "{}\nSummarize the above text in one word:" 

all_text_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i : i + batch_size]
        batch_texts = [text_prompt.format(text) for text in batch_texts]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text":text},
                ],

            }
        ] for text in batch_texts]
        text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
        text_inputs = processor(
        text = text_inputs,
        padding = True,
        return_tensors = "pt",
        )
        text_inputs = text_inputs.to("cuda")
        text_outputs = model(
            **text_inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_text_embeddings.append(text_outputs.to(torch.float16).cpu())

all_text_embeddings = torch.cat(all_text_embeddings, dim=0)
```

#### Image Batch Encodings: 

```python
images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline
image_prompt = "\nSummarize the above image in one word:"
batch_size = 8

all_image_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(images), batch_size)):
        batch_images = images[i : i + batch_size]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "image", "image":image},
                    {"type": "text", "text": image_prompt},
                ],

            }
        ] for image in batch_images]
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True)
        inputs = processor(
            text=text, 
            audio=audio_inputs, 
            images=image_inputs, 
            videos=video_inputs, 
            return_tensors="pt", 
            padding=True
        )
        inputs = inputs.to("cuda")
        image_outputs = model(
            **inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_image_embeddings.append(image_outputs.to(torch.float16).cpu())

all_image_embeddings = torch.cat(all_image_embeddings, dim=0)
```

#### Audio Batch Encoding:

```python
import logging
logging.getLogger("root").setLevel(logging.ERROR)
# set this to prevent getting the Qwen Omni system prompt mismatch warning.

batch_size = 4
audio_prompt = "\nSummarize the above audio in one word:"
audis = [some audios]  * 1000

all_audio_embeddings = []

with torch.no_grad():
  for i in tqdm(range(0, len(audios), batch_size)):
      torch.cuda.empty_cache()
      
      batch_audios = audios[i : i + batch_size]
      messages = [[
          {
              "role": "user",
              "content": [
                   {"type": "audio", "audio": audio},
                  {"type": "text", "text": audio_prompt},
              ],
              
          }
      ] for audio in batch_audios]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      audio_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_audio_embeddings.append(audio_outputs.to(torch.float16).cpu())
      del inputs, audio_outputs
      torch.cuda.empty_cache()
                
all_audio_embeddings = torch.cat(all_audio_embeddings, dim=0)

```

#### Video Batch Encoding:


```python
videos = [some videos]  * 1000
video_prompt = "\nSummarize the above video in one word:"
batch_size = 4

long_video = False
# followed by some example hyperparameters to save RAM
# for long videos. Not optimal. Tune case by case.

all_video_embeddings = []
with torch.no_grad():
  for i in tqdm(range(0, len(videos), batch_size)):
      torch.cuda.empty_cache()
      
      batch_videos = videos[i : i + batch_size]
      if long_video:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                          "max_pixels": 224 * 224,
                          "fps": 1,
                          "max_frames": 10
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      else:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      video_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_video_embeddings.append(video_outputs.to(torch.float16).cpu())
      
      del inputs, video_outputs
      torch.cuda.empty_cache()
                
all_video_embeddings = torch.cat(all_video_embeddings, dim=0)
```

## Overview

We introduce **LCO-Embedding**, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on [MIEB](https://huggingface.co/blog/isaacchung/introducing-mieb) (Massive Image Embedding Benchmark), while supporting audio and videos.

This work also introduces the **Generation-Representation Scaling Law**, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce **SeaDoc**, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound.

<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/4Wd8fDFBdT6GxqN6-KzZN.png" alt="overview" width="100%"/></div>

## Evaluation Results

We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories.

<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/63WBsKh57HbNwwe3bZ-oZ.png" alt="mieb_lite" width="100%"/></div>

LCO-Embedding is also SOTA on MAEB (massive audio embedding benchmark) without even training on audio. Screenshot from the MAEB paper.

![image](https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/cp5hfBmm51AlyO4sDnTrN.png)

Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones.

<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/lora_ablation.png" alt="lora_ablation" width="100%"/></div>

Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis).

<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/scaling.png" alt="scaling_law" width="100%"/></div>

## Citation

If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX:

```bibtex
@article{xiao2025scaling,
  title={Scaling Language-Centric Omnimodal Representation Learning},
  author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu},
  journal={arXiv preprint arXiv:2510.11693},
  year={2025}
}
```