|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- zh |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: transformers |
|
|
tags: |
|
|
- transformers |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- text-embeddings-inference |
|
|
--- |
|
|
<p align="center"> |
|
|
<img src="images/youtu_embedding.png" width="400"/> |
|
|
<p> |
|
|
|
|
|
<p align="center"> |
|
|
🤗 <a href="https://huggingface.co/tencent/Youtu-Embedding"><b>Hugging Face</b></a> | |
|
|
🖥️ <a href="https://github.com/TencentCloudADP/youtu-embedding"><b>GitHub</b></a> | |
|
|
🌎 <a href="https://arxiv.org/abs/2508.11442"><b>Technical Report</b></a> |
|
|
</p> |
|
|
<p align="center"> |
|
|
💬 <a href="https://huggingface.co/tencent/Youtu-Embedding/blob/main/images/wechat_qr.png"><b>WeChat</b></a> | |
|
|
🤖 <a href="https://discord.gg/QjqhkHQVVM"><b>Discord</b></a> |
|
|
</p> |
|
|
|
|
|
## 🎯 Introduction |
|
|
|
|
|
**Youtu-Embedding** is a state-of-the-art, general-purpose text embedding model developed by Tencent Youtu Lab. It delivers exceptional performance across a wide range of natural language processing tasks, including Information Retrieval (IR), Semantic Textual Similarity (STS), Clustering, Reranking, and Classification. |
|
|
|
|
|
- **Top-Ranked Performance**: Achieved the #1 score of **77.58** on the authoritative CMTEB (Chinese Massive Text Embedding Benchmark) as of September 2025, demonstrating its powerful and robust text representation capabilities. |
|
|
|
|
|
- **Innovative Training Framework**: Features a Collaborative-Discriminative Fine-tuning Framework designed to resolve the "negative transfer" problem in multi-task learning. This is accomplished through a unified data format, task-differentiated loss functions, and a dynamic single-task sampling mechanism. |
|
|
|
|
|
|
|
|
> **Note**: You can easily adapt and fine-tune the model on your own datasets for domain-specific tasks. For implementation details, please refer to the [training code](https://github.com/TencentCloudADP/youtu-embedding). |
|
|
|
|
|
|
|
|
## 🤗 Model Download |
|
|
|
|
|
| Model Name | Parameters | Dimensions | Sequence Length | Download | |
|
|
| :------------------- | :--------: | :--------: | :-----------------: | :------------------------------------------------------------------------------------------ | |
|
|
| Youtu-Embedding | 2B | 2048 | 8K | [Model](https://huggingface.co/tencent/Youtu-Embedding) | |
|
|
|
|
|
|
|
|
## 🚀 Usage |
|
|
#### 1. Using `transformers` |
|
|
**📦 Installation** |
|
|
```bash |
|
|
pip install transformers==4.51.3 |
|
|
``` |
|
|
**⚙️ Usage** |
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
|
|
|
class LLMEmbeddingModel(): |
|
|
|
|
|
def __init__(self, |
|
|
model_name_or_path, |
|
|
batch_size=128, |
|
|
max_length=1024, |
|
|
gpu_id=0): |
|
|
self.model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True) |
|
|
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right") |
|
|
|
|
|
self.device = torch.device(f"cuda:{gpu_id}") |
|
|
self.model.to(self.device).eval() |
|
|
|
|
|
self.max_length = max_length |
|
|
self.batch_size = batch_size |
|
|
|
|
|
query_instruction = "Given a search query, retrieve passages that answer the question" |
|
|
if query_instruction: |
|
|
self.query_instruction = f"Instruction: {query_instruction} \nQuery:" |
|
|
else: |
|
|
self.query_instruction = "Query:" |
|
|
|
|
|
self.doc_instruction = "" |
|
|
print(f"query instruction: {[self.query_instruction]}\ndoc instruction: {[self.doc_instruction]}") |
|
|
|
|
|
def mean_pooling(self, hidden_state, attention_mask): |
|
|
s = torch.sum(hidden_state * attention_mask.unsqueeze(-1).float(), dim=1) |
|
|
d = attention_mask.sum(dim=1, keepdim=True).float() |
|
|
embedding = s / d |
|
|
return embedding |
|
|
|
|
|
@torch.no_grad() |
|
|
def encode(self, sentences_batch, instruction): |
|
|
inputs = self.tokenizer( |
|
|
sentences_batch, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
return_tensors="pt", |
|
|
max_length=self.max_length, |
|
|
add_special_tokens=True, |
|
|
).to(self.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.model(**inputs) |
|
|
last_hidden_state = outputs[0] |
|
|
|
|
|
instruction_tokens = self.tokenizer( |
|
|
instruction, |
|
|
padding=False, |
|
|
truncation=True, |
|
|
max_length=self.max_length, |
|
|
add_special_tokens=True, |
|
|
)["input_ids"] |
|
|
if len(np.shape(np.array(instruction_tokens))) == 1: |
|
|
inputs["attention_mask"][:, :len(instruction_tokens)] = 0 |
|
|
else: |
|
|
instruction_length = [len(item) for item in instruction_tokens] |
|
|
assert len(instruction) == len(sentences_batch) |
|
|
for idx in range(len(instruction_length)): |
|
|
inputs["attention_mask"][idx, :instruction_length[idx]] = 0 |
|
|
|
|
|
embeddings = self.mean_pooling(last_hidden_state, inputs["attention_mask"]) |
|
|
embeddings = torch.nn.functional.normalize(embeddings, dim=-1) |
|
|
return embeddings |
|
|
|
|
|
def encode_queries(self, queries): |
|
|
queries = queries if isinstance(queries, list) else [queries] |
|
|
queries = [f"{self.query_instruction}{query}" for query in queries] |
|
|
return self.encode(queries, self.query_instruction) |
|
|
|
|
|
def encode_passages(self, passages): |
|
|
passages = passages if isinstance(passages, list) else [passages] |
|
|
passages = [f"{self.doc_instruction}{passage}" for passage in passages] |
|
|
return self.encode(passages, self.doc_instruction) |
|
|
|
|
|
def compute_similarity_for_vectors(self, q_reps, p_reps): |
|
|
if len(p_reps.size()) == 2: |
|
|
return torch.matmul(q_reps, p_reps.transpose(0, 1)) |
|
|
return torch.matmul(q_reps, p_reps.transpose(-2, -1)) |
|
|
|
|
|
def compute_similarity(self, queries, passages): |
|
|
q_reps = self.encode_queries(queries) |
|
|
p_reps = self.encode_passages(passages) |
|
|
scores = self.compute_similarity_for_vectors(q_reps, p_reps) |
|
|
scores = scores.detach().cpu().tolist() |
|
|
return scores |
|
|
|
|
|
|
|
|
queries = ["What's the weather like?"] |
|
|
passages = [ |
|
|
'The weather is lovely today.', |
|
|
"It's so sunny outside!", |
|
|
'He drove to the stadium.' |
|
|
] |
|
|
|
|
|
model_name_or_path = "tencent/Youtu-Embedding" |
|
|
model = LLMEmbeddingModel(model_name_or_path) |
|
|
scores = model.compute_similarity(queries, passages) |
|
|
print(f"scores: {scores}") |
|
|
``` |
|
|
|
|
|
#### 2. Using `sentence-transformers` |
|
|
**📦 Installation** |
|
|
```bash |
|
|
pip install sentence-transformers==5.1.0 |
|
|
``` |
|
|
**⚙️ Usage** |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("tencent/Youtu-Embedding", trust_remote_code=True) |
|
|
queries = ["What's the weather like?"] |
|
|
passages = [ |
|
|
'The weather is lovely today.', |
|
|
"It's so sunny outside!", |
|
|
'He drove to the stadium.' |
|
|
] |
|
|
queries_embeddings = model.encode_query(queries) |
|
|
passages_embeddings = model.encode_document(passages) |
|
|
|
|
|
similarities = model.similarity(queries_embeddings, passages_embeddings) |
|
|
print(similarities) |
|
|
``` |
|
|
|
|
|
#### 3. Using `LangChain` 🦜 |
|
|
Easily integrate the model into your **LangChain** applications, such as RAG pipelines. |
|
|
|
|
|
**📦 Installation** |
|
|
|
|
|
```bash |
|
|
pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0 |
|
|
``` |
|
|
|
|
|
**⚙️ Usage** |
|
|
```python |
|
|
import torch |
|
|
from langchain.docstore.document import Document |
|
|
from langchain_community.vectorstores import FAISS |
|
|
from langchain_huggingface.embeddings import HuggingFaceEmbeddings |
|
|
|
|
|
model_name_or_path = "tencent/Youtu-Embedding" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model_kwargs = { |
|
|
'trust_remote_code': True, |
|
|
'device': device |
|
|
} |
|
|
|
|
|
embedder = HuggingFaceEmbeddings( |
|
|
model_name=model_name_or_path, |
|
|
model_kwargs=model_kwargs, |
|
|
) |
|
|
|
|
|
query_instruction = "Instruction: Given a search query, retrieve passages that answer the question \nQuery:" |
|
|
doc_instruction = "" |
|
|
|
|
|
data = [ |
|
|
"Venus is often called Earth's twin because of its similar size and proximity.", |
|
|
"Mars, known for its reddish appearance, is often referred to as the Red Planet.", |
|
|
"Jupiter, the largest planet in our solar system, has a prominent red spot.", |
|
|
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet." |
|
|
] |
|
|
|
|
|
documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(data)] |
|
|
vector_store = FAISS.from_documents(documents, embedder, distance_strategy="MAX_INNER_PRODUCT") |
|
|
|
|
|
query = "Which planet is known as the Red Planet?" |
|
|
instructed_query = query_instruction + query |
|
|
results = vector_store.similarity_search_with_score(instructed_query, k=3) |
|
|
|
|
|
print(f"Original Query: {query}\n") |
|
|
print("Results:") |
|
|
for doc, score in results: |
|
|
print(f"- Text: {doc.page_content} (Score: {score:.4f})") |
|
|
|
|
|
``` |
|
|
|
|
|
#### 4. Using `LlamaIndex` 🦙 |
|
|
This is perfect for integrating the model into your **LlamaIndex** search and retrieval systems. |
|
|
|
|
|
**📦 Installation** |
|
|
|
|
|
```bash |
|
|
pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1 |
|
|
``` |
|
|
|
|
|
**⚙️ Usage** |
|
|
```python |
|
|
import faiss |
|
|
import torch |
|
|
from llama_index.core.schema import TextNode |
|
|
from llama_index.core.vector_stores import VectorStoreQuery |
|
|
from llama_index.vector_stores.faiss import FaissVectorStore |
|
|
from llama_index.embeddings.huggingface import HuggingFaceEmbedding |
|
|
|
|
|
model_name_or_path = "tencent/Youtu-Embedding" |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
embeddings = HuggingFaceEmbedding( |
|
|
model_name=model_name_or_path, |
|
|
trust_remote_code=True, |
|
|
device=device, |
|
|
query_instruction="Instruction: Given a search query, retrieve passages that answer the question \nQuery:", |
|
|
text_instruction="" |
|
|
) |
|
|
|
|
|
data = [ |
|
|
"Venus is often called Earth's twin because of its similar size and proximity.", |
|
|
"Mars, known for its reddish appearance, is often referred to as the Red Planet.", |
|
|
"Jupiter, the largest planet in our solar system, has a prominent red spot.", |
|
|
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet." |
|
|
] |
|
|
|
|
|
nodes = [TextNode(id_=str(i), text=text) for i, text in enumerate(data)] |
|
|
|
|
|
for node in nodes: |
|
|
node.embedding = embeddings.get_text_embedding(node.get_content()) |
|
|
|
|
|
embed_dim = len(nodes[0].embedding) |
|
|
store = FaissVectorStore(faiss_index=faiss.IndexFlatIP(embed_dim)) |
|
|
store.add(nodes) |
|
|
|
|
|
query = "Which planet is known as the Red Planet?" |
|
|
query_embedding = embeddings.get_query_embedding(query) |
|
|
|
|
|
results = store.query( |
|
|
VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=3) |
|
|
) |
|
|
|
|
|
print(f"Query: {query}\n") |
|
|
print("Results:") |
|
|
for idx, score in zip(results.ids, results.similarities): |
|
|
print(f"- Text: {data[int(idx)]} (Score: {score:.4f})") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## 📊 CMTEB |
|
|
| Model | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |
|
|
| :------------------------ | :--------------| :----------------- | :----------------- | :----: | :----: | :---------: | :-----: | :----: | :---: | |
|
|
| bge-multilingual-gemma2 | 9B | 67.64 | 68.52 | 75.31 | 59.30 | 79.30 | 68.28 | 73.73 | 55.19 | |
|
|
| ritrieve\_zh\_v1 | 326M | 72.71 | 73.85 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 | |
|
|
| Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | |
|
|
| Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | |
|
|
| Conan-embedding-v2 | 1.4B | 74.24 | 75.99 | 76.47 | 68.84 | 92.44 | 74.41 | 78.31 | 65.48 | |
|
|
| Seed1.6-embedding | - | 75.63 | 76.68 | 77.98 | 73.11 | 88.71 | 71.65 | 79.69 | 68.94 | |
|
|
| QZhou-Embedding | 7B | 76.99 | 78.58 | 79.99 | 70.91 | 95.07 | 74.85 | 78.80 | 71.89 | |
|
|
| **Youtu-Embedding** | 2B | **77.58** | **78.86** | 78.65 | 84.27 | 86.12 | 75.10 | 80.21 | 68.82 | |
|
|
|
|
|
> **Note**: Comparative scores are from the MTEB [leaderboard](https://huggingface.co/spaces/mteb/leaderboard), recorded on September 28, 2025. |
|
|
|
|
|
|
|
|
## 🎉 Citation |
|
|
```bibtex |
|
|
@misc{zhang2025codiemb, |
|
|
title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity}, |
|
|
author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing}, |
|
|
year={2025}, |
|
|
eprint={2508.11442}, |
|
|
archivePrefix={arXiv}, |
|
|
url={https://arxiv.org/abs/2508.11442}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|