Youtu-Embedding / README.md

Update readme

55235ee verified 4 months ago

12.8 kB

	---
	license: apache-2.0
	language:
	- zh
	pipeline_tag: sentence-similarity
	library_name: transformers
	tags:
	- transformers
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- text-embeddings-inference
	---
	<p align="center">
	<img src="images/youtu_embedding.png" width="400"/>
	<p>

	<p align="center">
	🤗 <a href="https://huggingface.co/tencent/Youtu-Embedding"><b>Hugging Face</b></a>  \|
	🖥️ <a href="https://github.com/TencentCloudADP/youtu-embedding"><b>GitHub</b></a>  \|
	🌎 <a href="https://arxiv.org/abs/2508.11442"><b>Technical Report</b></a>
	</p>
	<p align="center">
	💬 <a href="https://huggingface.co/tencent/Youtu-Embedding/blob/main/images/wechat_qr.png"><b>WeChat</b></a>  \|
	🤖 <a href="https://discord.gg/QjqhkHQVVM"><b>Discord</b></a>
	</p>

	## 🎯 Introduction

	Youtu-Embedding is a state-of-the-art, general-purpose text embedding model developed by Tencent Youtu Lab. It delivers exceptional performance across a wide range of natural language processing tasks, including Information Retrieval (IR), Semantic Textual Similarity (STS), Clustering, Reranking, and Classification.

	- Top-Ranked Performance: Achieved the #1 score of 77.58 on the authoritative CMTEB (Chinese Massive Text Embedding Benchmark) as of September 2025, demonstrating its powerful and robust text representation capabilities.

	- Innovative Training Framework: Features a Collaborative-Discriminative Fine-tuning Framework designed to resolve the "negative transfer" problem in multi-task learning. This is accomplished through a unified data format, task-differentiated loss functions, and a dynamic single-task sampling mechanism.


	> Note: You can easily adapt and fine-tune the model on your own datasets for domain-specific tasks. For implementation details, please refer to the [training code](https://github.com/TencentCloudADP/youtu-embedding).


	## 🤗 Model Download

	\| Model Name \| Parameters \| Dimensions \| Sequence Length \| Download \|
	\| :------------------- \| :--------: \| :--------: \| :-----------------: \| :------------------------------------------------------------------------------------------ \|
	\| Youtu-Embedding \| 2B \| 2048 \| 8K \| [Model](https://huggingface.co/tencent/Youtu-Embedding) \|


	## 🚀 Usage
	#### 1. Using `transformers`
	📦 Installation
	```bash
	pip install transformers==4.51.3
	```
	⚙️ Usage
	```python
	import torch
	import numpy as np
	from transformers import AutoModel, AutoTokenizer


	class LLMEmbeddingModel():

	def __init__(self,
	model_name_or_path,
	batch_size=128,
	max_length=1024,
	gpu_id=0):
	self.model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)
	self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right")

	self.device = torch.device(f"cuda:{gpu_id}")
	self.model.to(self.device).eval()

	self.max_length = max_length
	self.batch_size = batch_size

	query_instruction = "Given a search query, retrieve passages that answer the question"
	if query_instruction:
	self.query_instruction = f"Instruction: {query_instruction} \nQuery:"
	else:
	self.query_instruction = "Query:"

	self.doc_instruction = ""
	print(f"query instruction: {[self.query_instruction]}\ndoc instruction: {[self.doc_instruction]}")

	def mean_pooling(self, hidden_state, attention_mask):
	s = torch.sum(hidden_state * attention_mask.unsqueeze(-1).float(), dim=1)
	d = attention_mask.sum(dim=1, keepdim=True).float()
	embedding = s / d
	return embedding

	@torch.no_grad()
	def encode(self, sentences_batch, instruction):
	inputs = self.tokenizer(
	sentences_batch,
	padding=True,
	truncation=True,
	return_tensors="pt",
	max_length=self.max_length,
	add_special_tokens=True,
	).to(self.device)

	with torch.no_grad():
	outputs = self.model(**inputs)
	last_hidden_state = outputs[0]

	instruction_tokens = self.tokenizer(
	instruction,
	padding=False,
	truncation=True,
	max_length=self.max_length,
	add_special_tokens=True,
	)["input_ids"]
	if len(np.shape(np.array(instruction_tokens))) == 1:
	inputs["attention_mask"][:, :len(instruction_tokens)] = 0
	else:
	instruction_length = [len(item) for item in instruction_tokens]
	assert len(instruction) == len(sentences_batch)
	for idx in range(len(instruction_length)):
	inputs["attention_mask"][idx, :instruction_length[idx]] = 0

	embeddings = self.mean_pooling(last_hidden_state, inputs["attention_mask"])
	embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
	return embeddings

	def encode_queries(self, queries):
	queries = queries if isinstance(queries, list) else [queries]
	queries = [f"{self.query_instruction}{query}" for query in queries]
	return self.encode(queries, self.query_instruction)

	def encode_passages(self, passages):
	passages = passages if isinstance(passages, list) else [passages]
	passages = [f"{self.doc_instruction}{passage}" for passage in passages]
	return self.encode(passages, self.doc_instruction)

	def compute_similarity_for_vectors(self, q_reps, p_reps):
	if len(p_reps.size()) == 2:
	return torch.matmul(q_reps, p_reps.transpose(0, 1))
	return torch.matmul(q_reps, p_reps.transpose(-2, -1))

	def compute_similarity(self, queries, passages):
	q_reps = self.encode_queries(queries)
	p_reps = self.encode_passages(passages)
	scores = self.compute_similarity_for_vectors(q_reps, p_reps)
	scores = scores.detach().cpu().tolist()
	return scores


	queries = ["What's the weather like?"]
	passages = [
	'The weather is lovely today.',
	"It's so sunny outside!",
	'He drove to the stadium.'
	]

	model_name_or_path = "tencent/Youtu-Embedding"
	model = LLMEmbeddingModel(model_name_or_path)
	scores = model.compute_similarity(queries, passages)
	print(f"scores: {scores}")
	```

	#### 2. Using `sentence-transformers`
	📦 Installation
	```bash
	pip install sentence-transformers==5.1.0
	```
	⚙️ Usage
	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("tencent/Youtu-Embedding", trust_remote_code=True)
	queries = ["What's the weather like?"]
	passages = [
	'The weather is lovely today.',
	"It's so sunny outside!",
	'He drove to the stadium.'
	]
	queries_embeddings = model.encode_query(queries)
	passages_embeddings = model.encode_document(passages)

	similarities = model.similarity(queries_embeddings, passages_embeddings)
	print(similarities)
	```

	#### 3. Using `LangChain` 🦜
	Easily integrate the model into your LangChain applications, such as RAG pipelines.

	📦 Installation

	```bash
	pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0
	```

	⚙️ Usage
	```python
	import torch
	from langchain.docstore.document import Document
	from langchain_community.vectorstores import FAISS
	from langchain_huggingface.embeddings import HuggingFaceEmbeddings

	model_name_or_path = "tencent/Youtu-Embedding"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	model_kwargs = {
	'trust_remote_code': True,
	'device': device
	}

	embedder = HuggingFaceEmbeddings(
	model_name=model_name_or_path,
	model_kwargs=model_kwargs,
	)

	query_instruction = "Instruction: Given a search query, retrieve passages that answer the question \nQuery:"
	doc_instruction = ""

	data = [
	"Venus is often called Earth's twin because of its similar size and proximity.",
	"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
	"Jupiter, the largest planet in our solar system, has a prominent red spot.",
	"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
	]

	documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(data)]
	vector_store = FAISS.from_documents(documents, embedder, distance_strategy="MAX_INNER_PRODUCT")

	query = "Which planet is known as the Red Planet?"
	instructed_query = query_instruction + query
	results = vector_store.similarity_search_with_score(instructed_query, k=3)

	print(f"Original Query: {query}\n")
	print("Results:")
	for doc, score in results:
	print(f"- Text: {doc.page_content} (Score: {score:.4f})")

	```

	#### 4. Using `LlamaIndex` 🦙
	This is perfect for integrating the model into your LlamaIndex search and retrieval systems.

	📦 Installation

	```bash
	pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1
	```

	⚙️ Usage
	```python
	import faiss
	import torch
	from llama_index.core.schema import TextNode
	from llama_index.core.vector_stores import VectorStoreQuery
	from llama_index.vector_stores.faiss import FaissVectorStore
	from llama_index.embeddings.huggingface import HuggingFaceEmbedding

	model_name_or_path = "tencent/Youtu-Embedding"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	embeddings = HuggingFaceEmbedding(
	model_name=model_name_or_path,
	trust_remote_code=True,
	device=device,
	query_instruction="Instruction: Given a search query, retrieve passages that answer the question \nQuery:",
	text_instruction=""
	)

	data = [
	"Venus is often called Earth's twin because of its similar size and proximity.",
	"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
	"Jupiter, the largest planet in our solar system, has a prominent red spot.",
	"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
	]

	nodes = [TextNode(id_=str(i), text=text) for i, text in enumerate(data)]

	for node in nodes:
	node.embedding = embeddings.get_text_embedding(node.get_content())

	embed_dim = len(nodes[0].embedding)
	store = FaissVectorStore(faiss_index=faiss.IndexFlatIP(embed_dim))
	store.add(nodes)

	query = "Which planet is known as the Red Planet?"
	query_embedding = embeddings.get_query_embedding(query)

	results = store.query(
	VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=3)
	)

	print(f"Query: {query}\n")
	print("Results:")
	for idx, score in zip(results.ids, results.similarities):
	print(f"- Text: {data[int(idx)]} (Score: {score:.4f})")

	```


	## 📊 CMTEB
	\| Model \| Param. \| Mean(Task) \| Mean(Type) \| Class. \| Clust. \| Pair Class. \| Rerank. \| Retr. \| STS \|
	\| :------------------------ \| :--------------\| :----------------- \| :----------------- \| :----: \| :----: \| :---------: \| :-----: \| :----: \| :---: \|
	\| bge-multilingual-gemma2 \| 9B \| 67.64 \| 68.52 \| 75.31 \| 59.30 \| 79.30 \| 68.28 \| 73.73 \| 55.19 \|
	\| ritrieve\_zh\_v1 \| 326M \| 72.71 \| 73.85 \| 76.88 \| 66.50 \| 85.98 \| 72.86 \| 76.97 \| 63.92 \|
	\| Qwen3-Embedding-4B \| 4B \| 72.27 \| 73.51 \| 75.46 \| 77.89 \| 83.34 \| 66.05 \| 77.03 \| 61.26 \|
	\| Qwen3-Embedding-8B \| 8B \| 73.84 \| 75.00 \| 76.97 \| 80.08 \| 84.23 \| 66.99 \| 78.21 \| 63.53 \|
	\| Conan-embedding-v2 \| 1.4B \| 74.24 \| 75.99 \| 76.47 \| 68.84 \| 92.44 \| 74.41 \| 78.31 \| 65.48 \|
	\| Seed1.6-embedding \| - \| 75.63 \| 76.68 \| 77.98 \| 73.11 \| 88.71 \| 71.65 \| 79.69 \| 68.94 \|
	\| QZhou-Embedding \| 7B \| 76.99 \| 78.58 \| 79.99 \| 70.91 \| 95.07 \| 74.85 \| 78.80 \| 71.89 \|
	\| Youtu-Embedding \| 2B \| 77.58 \| 78.86 \| 78.65 \| 84.27 \| 86.12 \| 75.10 \| 80.21 \| 68.82 \|

	> Note: Comparative scores are from the MTEB [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), recorded on September 28, 2025.


	## 🎉 Citation
	```bibtex
	@misc{zhang2025codiemb,
	title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity},
	author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing},
	year={2025},
	eprint={2508.11442},
	archivePrefix={arXiv},
	url={https://arxiv.org/abs/2508.11442},
	}
	```