Delete files README_zh.md with huggingface_hub
Browse files- README_zh.md +0 -207
README_zh.md
DELETED
|
@@ -1,207 +0,0 @@
|
|
| 1 |
-
|
| 2 |
-
# QZhou-Embedding
|
| 3 |
-
<div align="center">
|
| 4 |
-
<img src="image-1.png" width="800" height="300"></img>
|
| 5 |
-
</div>
|
| 6 |
-
|
| 7 |
-
## 简介
|
| 8 |
-
我们发布<a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a>(轻舟Embedding😈😈😈),面向通用领域的文本向量表示大模型,擅长各种文本嵌入(检索、重排、句对相似度、分类)任务。得益于基础模型在海量文本上预训练获得的通用语言能力,QZhou-Embedding能够获得更加强大的文本嵌入表示。QZhou-Embedding使用百万量级高质量开源检索数据,以及500万+高质量合成数据(改写、扩展两大合成技术)进行持续训练。我们通过第一阶段检索训练为模型提供query-doc语义匹配能力基础,第二阶段的STS、聚类等多维度能力训练帮助模型在各种场景下持续突破。QZhou-Embedding的模型参数为7B,具备最大8k的长文本向量嵌入能力。在mteb/cmteb评测基准上取得均值全榜最高,各任务指标方面,聚类、句对分类、重排、STS任务指标均值全榜最高的效果。
|
| 9 |
-
|
| 10 |
-
## QZhou-Embedding基本特点
|
| 11 |
-
|
| 12 |
-
- 强大的文本嵌入能力;
|
| 13 |
-
- 长上下文:最大支持8k;
|
| 14 |
-
- 参数量7B
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
## 技术介绍
|
| 18 |
-
### 统一任务建模框架
|
| 19 |
-
将文本嵌入目标统一为三大问题建模优化,提出统一的训练数据结构化方案和对应的训练机制---可融入大部分开源数据作为检索训练集,可结构化数据如下:
|
| 20 |
-
- 检索
|
| 21 |
-
- title-body
|
| 22 |
-
- title-abstract
|
| 23 |
-
- 问答类数据
|
| 24 |
-
- 阅读理解
|
| 25 |
-
- ...
|
| 26 |
-
|
| 27 |
-
- STS
|
| 28 |
-
- 文本对+{true, false}、{yes, no}标签
|
| 29 |
-
- 文本对+分数(如0.2、3.1、4.8等)
|
| 30 |
-
- NLI数据:文本对+{'entailment', 'neutral', 'contradiction'}标签
|
| 31 |
-
|
| 32 |
-
- CLS
|
| 33 |
-
- 句子+类标签
|
| 34 |
-
|
| 35 |
-
<div align="center"><img src="image-18.png" width="1000" height="600"></img></div>
|
| 36 |
-
<div align="center"><img src="image-16.png" width="1000" height="550"></img></div>
|
| 37 |
-
|
| 38 |
-
### 训练目标
|
| 39 |
-
|
| 40 |
-
- 检索:使用InfoNCE对比学习loss函数,效仿gte/qwen3-embedding的改进增加q-q对负样例惩罚<br>
|
| 41 |
-
$$
|
| 42 |
-
L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
|
| 43 |
-
$$
|
| 44 |
-
|
| 45 |
-
- STS:使用Cosent loss:
|
| 46 |
-
$$
|
| 47 |
-
L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
|
| 48 |
-
$$
|
| 49 |
-
|
| 50 |
-
- CLS:同检索一致使用InfoNCE loss,但In-Batch Negative时由于同类冲突概率大,使用mask机制掩盖不同样本共享的负样例中的同类样本。
|
| 51 |
-
$$
|
| 52 |
-
L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
|
| 53 |
-
$$
|
| 54 |
-
$$
|
| 55 |
-
其中C_{t_i}=C_{t_i^+}
|
| 56 |
-
$$
|
| 57 |
-
$$
|
| 58 |
-
MASK(t_i, t_j)=
|
| 59 |
-
\begin{cases}
|
| 60 |
-
0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
|
| 61 |
-
1 & \quad \text{otherwise}
|
| 62 |
-
\end{cases}
|
| 63 |
-
$$
|
| 64 |
-
其中${C_{t_i}}$表示样本${t_i}$的类标签,n是单条数据的负样本数。
|
| 65 |
-
|
| 66 |
-
### 特征增强数据合成技术
|
| 67 |
-
在当今大模型语言及创作能力强大的背景下,我们充分利用了大模型API设计数据合成技术。针对训练集中存在数据少、话题狭隘等问题,我们提出改写、扩展合成技术;同时为增强训练时的负样例难度,我们在现有基于强大Embedding实现难负例采样的基础上,使用基于大模型的难负样例合成技术。几种技术介绍如下:
|
| 68 |
-
<div align="center"><img src="image-9.png" width="930" height="290"></img></div>
|
| 69 |
-
<div align="center"><img src="image-10.png" width="880" height="220"></img></div>
|
| 70 |
-
<div align="center"><img src="image-11.png" width="880" height="210"></img></div>
|
| 71 |
-
|
| 72 |
-
想要获取更多信息(如评测脚本、指令格式等),欢迎访问我们的Github:<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>
|
| 73 |
-
|
| 74 |
-
## 评测结果
|
| 75 |
-
### mteb榜单明细
|
| 76 |
-
<div align="center"><img src="image-7.png" width="1100" height="260"></img></div>
|
| 77 |
-
|
| 78 |
-
### cmteb榜单明细
|
| 79 |
-
<div align="center"><img src="image-8.png" width="1000" height="260"></img></div>
|
| 80 |
-
|
| 81 |
-
## 使用指南
|
| 82 |
-
### 完全复现榜单结果
|
| 83 |
-
我们提供详细的参数、环境配置,以便能够在自己的机器上完全跑出跟榜单一致的结果,包括环境依赖、模型参数等配置。
|
| 84 |
-
#### 环境依赖版本
|
| 85 |
-
- Python: 3.10.12
|
| 86 |
-
- Sentence Transformers: 3.4.1
|
| 87 |
-
- Transformers: 4.51.1
|
| 88 |
-
- PyTorch: 2.7.1
|
| 89 |
-
- Accelerate: 1.3.0
|
| 90 |
-
- Datasets: 3.2.0
|
| 91 |
-
- Tokenizers: 0.21.2
|
| 92 |
-
#### 模型加载参数
|
| 93 |
-
torch_dtype=torch.bfloat16<br>
|
| 94 |
-
attn_implementation='sdpa'<br>
|
| 95 |
-
**注:** 榜单结果使用了sdpa模式,其他模式('eager'、 'flash_attention_2')存在偏差,但不影响整体表现
|
| 96 |
-
#### 指令添加规则
|
| 97 |
-
在我们的<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上可以找到。
|
| 98 |
-
#### 评测代码使用
|
| 99 |
-
在<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上找到我们的评测代码,其中mteb评测脚本是**run_mteb_all_v2.py**,cmteb评测脚本是**run_cmteb_all.py**,运行如下命令:
|
| 100 |
-
```
|
| 101 |
-
POOLING_MODE=mean
|
| 102 |
-
normalize=true
|
| 103 |
-
use_instruction=true
|
| 104 |
-
export TOKENIZERS_PARALLELISM=true
|
| 105 |
-
|
| 106 |
-
model_name_or_path=模型目录位置
|
| 107 |
-
|
| 108 |
-
python3 ./run_cmteb_all.py \
|
| 109 |
-
--model_name_or_path ${model_name_or_path} \
|
| 110 |
-
--pooling_mode ${POOLING_MODE} \
|
| 111 |
-
--normalize ${normalize} \
|
| 112 |
-
--use_instruction ${use_instruction} \
|
| 113 |
-
--output_dir 结果输出路径
|
| 114 |
-
|
| 115 |
-
python3 ./run_mteb_all_v2.py \
|
| 116 |
-
--model_name_or_path ${model_name_or_path} \
|
| 117 |
-
--pooling_mode ${POOLING_MODE} \
|
| 118 |
-
--normalize ${normalize} \
|
| 119 |
-
--use_instruction ${use_instruction} \
|
| 120 |
-
--output_dir 结果输出路径
|
| 121 |
-
```
|
| 122 |
-
这是一套通用脚本,可以用于其他huggingface embedding模型的评测,但需要确保pooling等配置正确。
|
| 123 |
-
|
| 124 |
-
### Sentence Transformers
|
| 125 |
-
|
| 126 |
-
```
|
| 127 |
-
from sentence_transformers import SentenceTransformer
|
| 128 |
-
|
| 129 |
-
model = SentenceTransformer("QZhou-Embedding")
|
| 130 |
-
|
| 131 |
-
model = SentenceTransformer(
|
| 132 |
-
"QZhou-Embedding",
|
| 133 |
-
model_kwargs={"device_map": "auto", "trust_remote_code": True},
|
| 134 |
-
tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
|
| 135 |
-
trust_remote_code=True
|
| 136 |
-
)
|
| 137 |
-
|
| 138 |
-
queries = [
|
| 139 |
-
"What is photosynthesis?",
|
| 140 |
-
"Who invented the telephone?",
|
| 141 |
-
]
|
| 142 |
-
documents = [
|
| 143 |
-
"Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
|
| 144 |
-
"Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
|
| 145 |
-
]
|
| 146 |
-
|
| 147 |
-
query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
|
| 148 |
-
document_embeddings = model.encode(documents, normalize_embeddings=True)
|
| 149 |
-
|
| 150 |
-
similarity = model.similarity(query_embeddings, document_embeddings)
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
### Huggingface Transformers
|
| 154 |
-
|
| 155 |
-
```
|
| 156 |
-
import torch
|
| 157 |
-
import torch.nn.functional as F
|
| 158 |
-
|
| 159 |
-
from torch import Tensor
|
| 160 |
-
from transformers import AutoTokenizer, AutoModel
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
def last_token_pool(last_hidden_states: Tensor,
|
| 164 |
-
attention_mask: Tensor) -> Tensor:
|
| 165 |
-
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
|
| 166 |
-
if left_padding:
|
| 167 |
-
return last_hidden_states[:, -1]
|
| 168 |
-
else:
|
| 169 |
-
sequence_lengths = attention_mask.sum(dim=1) - 1
|
| 170 |
-
batch_size = last_hidden_states.shape[0]
|
| 171 |
-
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
def get_detailed_instruct(task_description: str, query: str) -> str:
|
| 175 |
-
return f'Instruct: {task_description}\nQuery:{query}'
|
| 176 |
-
|
| 177 |
-
task = 'Given a web search query, retrieve relevant passages that answer the query'
|
| 178 |
-
|
| 179 |
-
queries = [
|
| 180 |
-
get_detailed_instruct(task, 'What is photosynthesis?'),
|
| 181 |
-
get_detailed_instruct(task, 'Who invented the telephone?')
|
| 182 |
-
]
|
| 183 |
-
|
| 184 |
-
documents = [
|
| 185 |
-
"Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
|
| 186 |
-
"Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
|
| 187 |
-
]
|
| 188 |
-
|
| 189 |
-
input_texts = queries + documents
|
| 190 |
-
|
| 191 |
-
tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
|
| 192 |
-
model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
|
| 193 |
-
|
| 194 |
-
batch_dict = tokenizer(
|
| 195 |
-
input_texts,
|
| 196 |
-
padding=True,
|
| 197 |
-
truncation=True,
|
| 198 |
-
max_length=8192,
|
| 199 |
-
return_tensors="pt",
|
| 200 |
-
)
|
| 201 |
-
batch_dict.to(model.device)
|
| 202 |
-
outputs = model(**batch_dict)
|
| 203 |
-
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
|
| 204 |
-
|
| 205 |
-
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 206 |
-
scores = (embeddings[:2] @ embeddings[2:].T)
|
| 207 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|