YuPeng0214 commited on
Commit
7a2deeb
·
verified ·
1 Parent(s): 6e67eb1

Delete files README_zh.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README_zh.md +0 -207
README_zh.md DELETED
@@ -1,207 +0,0 @@
1
-
2
- # QZhou-Embedding
3
- <div align="center">
4
- <img src="image-1.png" width="800" height="300"></img>
5
- </div>
6
-
7
- ## 简介
8
- 我们发布<a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a>(轻舟Embedding😈😈😈),面向通用领域的文本向量表示大模型,擅长各种文本嵌入(检索、重排、句对相似度、分类)任务。得益于基础模型在海量文本上预训练获得的通用语言能力,QZhou-Embedding能够获得更加强大的文本嵌入表示。QZhou-Embedding使用百万量级高质量开源检索数据,以及500万+高质量合成数据(改写、扩展两大合成技术)进行持续训练。我们通过第一阶段检索训练为模型提供query-doc语义匹配能力基础,第二阶段的STS、聚类等多维度能力训练帮助模型在各种场景下持续突破。QZhou-Embedding的模型参数为7B,具备最大8k的长文本向量嵌入能力。在mteb/cmteb评测基准上取得均值全榜最高,各任务指标方面,聚类、句对分类、重排、STS任务指标均值全榜最高的效果。
9
-
10
- ## QZhou-Embedding基本特点
11
-
12
- - 强大的文本嵌入能力;
13
- - 长上下文:最大支持8k;
14
- - 参数量7B
15
-
16
-
17
- ## 技术介绍
18
- ### 统一任务建模框架
19
- 将文本嵌入目标统一为三大问题建模优化,提出统一的训练数据结构化方案和对应的训练机制---可融入大部分开源数据作为检索训练集,可结构化数据如下:
20
- - 检索
21
- - title-body
22
- - title-abstract
23
- - 问答类数据
24
- - 阅读理解
25
- - ...
26
-
27
- - STS
28
- - 文本对+{true, false}、{yes, no}标签
29
- - 文本对+分数(如0.2、3.1、4.8等)
30
- - NLI数据:文本对+{'entailment', 'neutral', 'contradiction'}标签
31
-
32
- - CLS
33
- - 句子+类标签
34
-
35
- <div align="center"><img src="image-18.png" width="1000" height="600"></img></div>
36
- <div align="center"><img src="image-16.png" width="1000" height="550"></img></div>
37
-
38
- ### 训练目标
39
-
40
- - 检索:使用InfoNCE对比学习loss函数,效仿gte/qwen3-embedding的改进增加q-q对负样例惩罚<br>
41
- $$
42
- L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
43
- $$
44
-
45
- - STS:使用Cosent loss:
46
- $$
47
- L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
48
- $$
49
-
50
- - CLS:同检索一致使用InfoNCE loss,但In-Batch Negative时由于同类冲突概率大,使用mask机制掩盖不同样本共享的负样例中的同类样本。
51
- $$
52
- L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
53
- $$
54
- $$
55
- 其中C_{t_i}=C_{t_i^+}
56
- $$
57
- $$
58
- MASK(t_i, t_j)=
59
- \begin{cases}
60
- 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
61
- 1 & \quad \text{otherwise}
62
- \end{cases}
63
- $$
64
- 其中${C_{t_i}}$表示样本${t_i}$的类标签,n是单条数据的负样本数。
65
-
66
- ### 特征增强数据合成技术
67
- 在当今大模型语言及创作能力强大的背景下,我们充分利用了大模型API设计数据合成技术。针对训练集中存在数据少、话题狭隘等问题,我们提出改写、扩展合成技术;同时为增强训练时的负样例难度,我们在现有基于强大Embedding实现难负例采样的基础上,使用基于大模型的难负样例合成技术。几种技术介绍如下:
68
- <div align="center"><img src="image-9.png" width="930" height="290"></img></div>
69
- <div align="center"><img src="image-10.png" width="880" height="220"></img></div>
70
- <div align="center"><img src="image-11.png" width="880" height="210"></img></div>
71
-
72
- 想要获取更多信息(如评测脚本、指令格式等),欢迎访问我们的Github:<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>
73
-
74
- ## 评测结果
75
- ### mteb榜单明细
76
- <div align="center"><img src="image-7.png" width="1100" height="260"></img></div>
77
-
78
- ### cmteb榜单明细
79
- <div align="center"><img src="image-8.png" width="1000" height="260"></img></div>
80
-
81
- ## 使用指南
82
- ### 完全复现榜单结果
83
- 我们提供详细的参数、环境配置,以便能够在自己的机器上完全跑出跟榜单一致的结果,包括环境依赖、模型参数等配置。
84
- #### 环境依赖版本
85
- - Python: 3.10.12
86
- - Sentence Transformers: 3.4.1
87
- - Transformers: 4.51.1
88
- - PyTorch: 2.7.1
89
- - Accelerate: 1.3.0
90
- - Datasets: 3.2.0
91
- - Tokenizers: 0.21.2
92
- #### 模型加载参数
93
- torch_dtype=torch.bfloat16<br>
94
- attn_implementation='sdpa'<br>
95
- **注:** 榜单结果使用了sdpa模式,其他模式('eager'、 'flash_attention_2')存在偏差,但不影响整体表现
96
- #### 指令添加规则
97
- 在我们的<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上可以找到。
98
- #### 评测代码使用
99
- 在<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上找到我们的评测代码,其中mteb评测脚本是**run_mteb_all_v2.py**,cmteb评测脚本是**run_cmteb_all.py**,运行如下命令:
100
- ```
101
- POOLING_MODE=mean
102
- normalize=true
103
- use_instruction=true
104
- export TOKENIZERS_PARALLELISM=true
105
-
106
- model_name_or_path=模型目录位置
107
-
108
- python3 ./run_cmteb_all.py \
109
- --model_name_or_path ${model_name_or_path} \
110
- --pooling_mode ${POOLING_MODE} \
111
- --normalize ${normalize} \
112
- --use_instruction ${use_instruction} \
113
- --output_dir 结果输出路径
114
-
115
- python3 ./run_mteb_all_v2.py \
116
- --model_name_or_path ${model_name_or_path} \
117
- --pooling_mode ${POOLING_MODE} \
118
- --normalize ${normalize} \
119
- --use_instruction ${use_instruction} \
120
- --output_dir 结果输出路径
121
- ```
122
- 这是一套通用脚本,可以用于其他huggingface embedding模型的评测,但需要确保pooling等配置正确。
123
-
124
- ### Sentence Transformers
125
-
126
- ```
127
- from sentence_transformers import SentenceTransformer
128
-
129
- model = SentenceTransformer("QZhou-Embedding")
130
-
131
- model = SentenceTransformer(
132
- "QZhou-Embedding",
133
- model_kwargs={"device_map": "auto", "trust_remote_code": True},
134
- tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
135
- trust_remote_code=True
136
- )
137
-
138
- queries = [
139
- "What is photosynthesis?",
140
- "Who invented the telephone?",
141
- ]
142
- documents = [
143
- "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
144
- "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
145
- ]
146
-
147
- query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
148
- document_embeddings = model.encode(documents, normalize_embeddings=True)
149
-
150
- similarity = model.similarity(query_embeddings, document_embeddings)
151
- ```
152
-
153
- ### Huggingface Transformers
154
-
155
- ```
156
- import torch
157
- import torch.nn.functional as F
158
-
159
- from torch import Tensor
160
- from transformers import AutoTokenizer, AutoModel
161
-
162
-
163
- def last_token_pool(last_hidden_states: Tensor,
164
- attention_mask: Tensor) -> Tensor:
165
- left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
166
- if left_padding:
167
- return last_hidden_states[:, -1]
168
- else:
169
- sequence_lengths = attention_mask.sum(dim=1) - 1
170
- batch_size = last_hidden_states.shape[0]
171
- return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
172
-
173
-
174
- def get_detailed_instruct(task_description: str, query: str) -> str:
175
- return f'Instruct: {task_description}\nQuery:{query}'
176
-
177
- task = 'Given a web search query, retrieve relevant passages that answer the query'
178
-
179
- queries = [
180
- get_detailed_instruct(task, 'What is photosynthesis?'),
181
- get_detailed_instruct(task, 'Who invented the telephone?')
182
- ]
183
-
184
- documents = [
185
- "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
186
- "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
187
- ]
188
-
189
- input_texts = queries + documents
190
-
191
- tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
192
- model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
193
-
194
- batch_dict = tokenizer(
195
- input_texts,
196
- padding=True,
197
- truncation=True,
198
- max_length=8192,
199
- return_tensors="pt",
200
- )
201
- batch_dict.to(model.device)
202
- outputs = model(**batch_dict)
203
- embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
204
-
205
- embeddings = F.normalize(embeddings, p=2, dim=1)
206
- scores = (embeddings[:2] @ embeddings[2:].T)
207
- ```