Update README.md

76ebefc verified 23 days ago

6.19 kB

	---
	license: apache-2.0
	pipeline_tag: feature-extraction
	library_name: transformers
	---

	# LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning

	We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families!

	This model implements the framework presented in the paper [Scaling Language-Centric Omnimodal Representation Learning](https://huggingface.co/papers/2510.11693), accepted by NeurIPS 2025.

	Project Page: https://huggingface.co/LCO-Embedding

	Github Repository: https://github.com/LCO-Embedding/LCO-Embedding


	## Quick Start

	Note: We are only using the `thinker` component of Qwen2.5 Omni and drops the `talker` component.

	```python
	from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
	from qwen_omni_utils import process_mm_info

	processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B") # or add a `max_pixels = 12802828' for efficient encoding
	model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B",
	torch_dtype=torch.bfloat16,
	device_map="auto")
	```

	#### Text Batch Encodings:

	```python
	texts = ["some random text", "a second random text", "a third random text"] * 30
	batch_size = 8
	text_prompt = "{}\nSummarize the above text in one word:"

	all_text_embeddings = []

	with torch.no_grad():
	for i in tqdm(range(0, len(texts), batch_size)):
	batch_texts = texts[i : i + batch_size]
	batch_texts = [text_prompt.format(text) for text in batch_texts]
	messages = [[
	{
	"role": "user",
	"content": [
	{"type": "text", "text":text},
	],

	}
	] for text in batch_texts]
	text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
	text_inputs = processor(
	text = text_inputs,
	padding = True,
	return_tensors = "pt",
	)
	text_inputs = text_inputs.to("cuda")
	text_outputs = model(
	**text_inputs, output_hidden_states=True, return_dict=True
	).hidden_states[-1][:, -1, :]
	all_text_embeddings.append(text_outputs.to(torch.float16).cpu())

	all_text_embeddings = torch.cat(all_text_embeddings, dim=0)
	```

	#### Image Batch Encodings:

	```python

	images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline
	image_prompt = "\nSummarize the above image in one word:"
	batch_size = 8

	all_image_embeddings = []

	with torch.no_grad():
	for i in tqdm(range(0, len(images), batch_size)):
	batch_images = images[i : i + batch_size]
	messages = [[
	{
	"role": "user",
	"content": [
	{"type": "image", "image":image},
	{"type": "text", "text": image_prompt},
	],

	}
	] for image in batch_images]
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True)
	inputs = processor(
	text=text,
	audio=audio_inputs,
	images=image_inputs,
	videos=video_inputs,
	return_tensors="pt",
	padding=True
	)
	inputs = inputs.to("cuda")
	image_outputs = model(
	**inputs, output_hidden_states=True, return_dict=True
	).hidden_states[-1][:, -1, :]
	all_image_embeddings.append(image_outputs.to(torch.float16).cpu())

	all_image_embeddings = torch.cat(all_image_embeddings, dim=0)
	```

	## Overview

	We introduce LCO-Embedding, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on [MIEB](https://huggingface.co/blog/isaacchung/introducing-mieb) (Massive Image Embedding Benchmark), while supporting audio and videos.

	This work also introduces the Generation-Representation Scaling Law, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce SeaDoc, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound.

	<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/604f67ef0fe8ff3ec13d71ef/4Wd8fDFBdT6GxqN6-KzZN.png" alt="overview" width="100%"/></div>

	## Evaluation Results

	We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories.

	<div align='center'><img src="https://cdn-uploads.huggingface.co/production/uploads/63108cc834c7d77420b0fd68/63WBsKh57HbNwwe3bZ-oZ.png" alt="mieb_lite" width="100%"/></div>

	Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones.

	<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/lora_ablation.png" alt="lora_ablation" width="100%"/></div>

	Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis).

	<div align='center'><img src="https://github.com/LCO-Embedding/LCO-Embedding/raw/main/assets/scaling.png" alt="scaling_law" width="100%"/></div>

	## Citation

	If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX:

	```bibtex
	@article{xiao2025scaling,
	title={Scaling Language-Centric Omnimodal Representation Learning},
	author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu},
	journal={arXiv preprint arXiv:2510.11693},
	year={2025}
	}
	```