| | --- |
| | license: apache-2.0 |
| | pipeline_tag: feature-extraction |
| | library_name: transformers |
| | --- |
| | |
| | # LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning |
| |
|
| | We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families! |
| |
|
| | This model implements the framework presented in the paper [Scaling Language-Centric Omnimodal Representation Learning](https://huggingface.co/papers/2510.11693), accepted by NeurIPS 2025. |
| |
|
| | **Project Page:** https://huggingface.co/LCO-Embedding |
| |
|
| | **Github Repository:** https://github.com/LCO-Embedding/LCO-Embedding |
| |
|
| |
|
| | ## Quick Start |
| |
|
| | Note: We are only using the `thinker` component of Qwen2.5 Omni and drops the `talker` component. |
| |
|
| | ```python |
| | from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor |
| | from qwen_omni_utils import process_mm_info |
| | |
| | processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B") # or add a `max_pixels = 1280*28*28' for efficient encoding |
| | model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B", |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto") |
| | ``` |
| |
|
| | #### Text Batch Encodings: |
| | |
| | ```python |
| | texts = ["some random text", "a second random text", "a third random text"] * 30 |
| | batch_size = 8 |
| | text_prompt = "{}\nSummarize the above text in one word:" |
| | |
| | all_text_embeddings = [] |
| | |
| | with torch.no_grad(): |
| | for i in tqdm(range(0, len(texts), batch_size)): |
| | batch_texts = texts[i : i + batch_size] |
| | batch_texts = [text_prompt.format(text) for text in batch_texts] |
| | messages = [[ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "text", "text":text}, |
| | ], |
| | |
| | } |
| | ] for text in batch_texts] |
| | text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True) |
| | text_inputs = processor( |
| | text = text_inputs, |
| | padding = True, |
| | return_tensors = "pt", |
| | ) |
| | text_inputs = text_inputs.to("cuda") |
| | text_outputs = model( |
| | **text_inputs, output_hidden_states=True, return_dict=True |
| | ).hidden_states[-1][:, -1, :] |
| | all_text_embeddings.append(text_outputs.to(torch.float16).cpu()) |
| | |
| | all_text_embeddings = torch.cat(all_text_embeddings, dim=0) |
| | ``` |
| |
|
| | #### Image Batch Encodings: |
| |
|
| | ```python |
| | |
| | images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline |
| | image_prompt = "\nSummarize the above image in one word:" |
| | batch_size = 8 |
| | |
| | all_image_embeddings = [] |
| | |
| | with torch.no_grad(): |
| | for i in tqdm(range(0, len(images), batch_size)): |
| | batch_images = images[i : i + batch_size] |
| | messages = [[ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image":image}, |
| | {"type": "text", "text": image_prompt}, |
| | ], |
| | |
| | } |
| | ] for image in batch_images] |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True) |
| | inputs = processor( |
| | text=text, |
| | audio=audio_inputs, |
| | images=image_inputs, |
| | videos=video_inputs, |
| | return_tensors="pt", |
| | padding=True |
| | ) |
| | inputs = inputs.to("cuda") |
| | image_outputs = model( |
| | **inputs, output_hidden_states=True, return_dict=True |
| | ).hidden_states[-1][:, -1, :] |
| | all_image_embeddings.append(image_outputs.to(torch.float16).cpu()) |
| | |
| | all_image_embeddings = torch.cat(all_image_embeddings, dim=0) |
| | ``` |
| |
|
| | ## Overview |
| |
|
| | We introduce **LCO-Embedding**, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on [MIEB](https://huggingface.co/blog/isaacchung/introducing-mieb) (Massive Image Embedding Benchmark), while supporting audio and videos. |
| |
|
| | This work also introduces the **Generation-Representation Scaling Law**, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce **SeaDoc**, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound. |
| |
|
| | <div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/4Wd8fDFBdT6GxqN6-KzZN.png" alt="overview" width="100%"/></div> |
| |
|
| | ## Evaluation Results |
| |
|
| | We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories. |
| |
|
| | <div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/63WBsKh57HbNwwe3bZ-oZ.png" alt="mieb_lite" width="100%"/></div> |
| |
|
| | Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones. |
| |
|
| | <div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/lora_ablation.png" alt="lora_ablation" width="100%"/></div> |
| |
|
| | Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis). |
| |
|
| | <div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/scaling.png" alt="scaling_law" width="100%"/></div> |
| |
|
| | ## Citation |
| |
|
| | If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX: |
| |
|
| | ```bibtex |
| | @article{xiao2025scaling, |
| | title={Scaling Language-Centric Omnimodal Representation Learning}, |
| | author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu}, |
| | journal={arXiv preprint arXiv:2510.11693}, |
| | year={2025} |
| | } |
| | ``` |