YuPeng0214 commited on
Commit
6e67eb1
·
verified ·
1 Parent(s): b3d6208

Delete files README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -206
README.md DELETED
@@ -1,206 +0,0 @@
1
-
2
- # QZhou-Embedding
3
- <div align="center">
4
- <img src="image-1.png" width="800" height="300"></img>
5
- </div>
6
-
7
- ## Introduction
8
- We have released <a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a> (called "Qingzhou Embedding"), a large-scale text embedding model designed for general use,excelling at various text embedding tasks (retrieval, re-ranking, sentence similarity, and classification). Leveraging the general language capabilities of its underlying model, and pre-trained on massive amounts of text, QZhou-Embedding achieves even more powerful text embedding representations. QZhou-Embedding is continuously trained using millions of high-quality open-source embedding datasets and over 5 million high-quality synthetic data (using two synthetic techniques: rewriting and expansion). Initial retrieval training provides the model with a foundation for query-doc semantic matching capabilities. Later, multi-dimensional training such as STS and clustering, helps the model achieve continuous breakthroughs in various tasks. QZhou-Embedding is a 7B model and can embed long text vectors up to 8k in size. It achieved the highest average score on the mteb/cmteb evaluation benchmarks. In terms of various task scores, its clustering, sentence pair classification, rearrangement, and STS task achieved the highest average scores.
9
- ## Basic Features
10
-
11
- - Powerful text embedding capabilities;
12
- - Long context: up to 8k context length;
13
- - 7B parameter size
14
-
15
-
16
- ## Technical Introduction
17
- ### Unified Task Modeling Framework
18
- We unify the text embedding objectives into three major modeling optimization issues and propose a unified training data structured solution and corresponding training mechanism. This approach can integrate most open source data as retrieval training sets. The structured data can be as follows:
19
- - Retrieval
20
- - title-body
21
- - title-abstract
22
- - Question Answering Dataset
23
- - Reading comprehension
24
- - ...
25
-
26
- - STS
27
- - text pair + label in {true, false}、{yes, no}
28
- - text pair + score(such as 0.2, 3.1. 4.8, etc.)
29
- - NLI dataset:text pair + label in {'entailment', 'neutral', 'contradiction'}
30
-
31
- - CLS
32
- - text+CLS label
33
-
34
- <div align="center"><img src="image-18.png" width="1000" height="600"></img></div>
35
- <div align="center"><img src="image-16.png" width="1000" height="550"></img></div>
36
-
37
- ### Training Objectives
38
-
39
- - Retrieval: Apply InfoNCE contrastive loss function, and follow the gte/qwen3-embedding to add the query-query negative as part of the denominator.<br>
40
- $$
41
- L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
42
- $$
43
-
44
- - STS:Apply Cosent loss:
45
- $$
46
- L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
47
- $$
48
-
49
- - CLS: Apply the same InfoNCE loss as retrieval, but for In-Batch Negative, due to the high probability of same-class conflicts, a mask mechanism is used to cover up similar samples in negative examples shared by different samples.
50
- $$
51
- L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
52
- $$
53
- $$
54
- where\:\:C_{t_i}=C_{t_i^+}
55
- $$
56
- $$
57
- MASK(t_i, t_j)=
58
- \begin{cases}
59
- 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
60
- 1 & \quad \text{otherwise}
61
- \end{cases}
62
- $$
63
- Where $C_{t_i}$ represents the class label of sample $t_i$ , and $n$ is the number of negative samples for a single data point.
64
- ### Feature Enhancement Data Synthesis Technology
65
- In the context of powerful languages and writing capabilities in LLMs, we've fully leveraged the LLMs API to propose a data synthesis technology. To address issues like limited data and narrow topics/features in training sets, we've proposed rewriting and expanding synthesis techniques. Furthermore, to increase the difficulty of negative examples during training, we've designed a hard negative example synthesis technology based on big models, combined with existing strong retriever-based hard negative examples sampling. Several of these technologies are described below:
66
- <div align="center"><img src="image-9.png" width="930" height="290"></img></div>
67
- <div align="center"><img src="image-10.png" width="880" height="220"></img></div>
68
- <div align="center"><img src="image-11.png" width="880" height="210"></img></div>
69
-
70
- For more details, including reproduction of evaluation results, Instruction content and adding method, please refer to our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a> repo, thanks!
71
-
72
- ## Evaluation Results
73
- ### mteb details
74
- <div align="center"><img src="image-7.png" width="1100" height="260"></img></div>
75
-
76
- ### cmteb details
77
- <div align="center"><img src="image-8.png" width="1000" height="260"></img></div>
78
-
79
- ## Usage
80
- ### Completely reproduce the benchmark results
81
- We provide detailed parameters and environment configurations so that you can run results that are completely consistent with the mteb leaderboard on your own machine, including configurations such as environment dependencies and model arguments.
82
- #### Requirements
83
- - Python: 3.10.12
84
- - Sentence Transformers: 3.4.1
85
- - Transformers: 4.51.1
86
- - PyTorch: 2.7.1
87
- - Accelerate: 1.3.0
88
- - Datasets: 3.2.0
89
- - Tokenizers: 0.21.2
90
- #### Transformers model load arguments
91
- torch_dtype=torch.bfloat16<br>
92
- attn_implementation='sdpa'<br>
93
- **NOTE:** The ranking results use the sdpa mode. Other modes ('eager', 'flash_attention_2') may have deviations in results, but still keep the overall performance consistent.
94
- #### Instruction Adding Rules
95
- Details can be found on our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>.
96
- #### Evaluation code usage
97
- Find our benchmark evaluation code on <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>. The mteb benchmark script is **run_mteb_all_v2.py**, and the cmteb benchmark script is **run_cmteb_all.py**. Run the following command:
98
- ```
99
- POOLING_MODE=mean
100
- normalize=true
101
- use_instruction=true
102
- export TOKENIZERS_PARALLELISM=true
103
-
104
- model_name_or_path=<model dir>
105
-
106
- python3 ./run_cmteb_all.py \
107
- --model_name_or_path ${model_name_or_path} \
108
- --pooling_mode ${POOLING_MODE} \
109
- --normalize ${normalize} \
110
- --use_instruction ${use_instruction} \
111
- --output_dir <output dir>
112
-
113
- python3 ./run_mteb_all_v2.py \
114
- --model_name_or_path ${model_name_or_path} \
115
- --pooling_mode ${POOLING_MODE} \
116
- --normalize ${normalize} \
117
- --use_instruction ${use_instruction} \
118
- --output_dir <output dir>
119
- ```
120
- The "<>" should be replaced with your actual setting.<br>
121
- This is a general script that can be used to evaluate other huggingface embedding models, but you need to ensure that the pooling and other configurations are correct.
122
-
123
- ### Sentence-transformers
124
-
125
- ```
126
- from sentence_transformers import SentenceTransformer
127
-
128
- model = SentenceTransformer("QZhou-Embedding")
129
-
130
- model = SentenceTransformer(
131
- "QZhou-Embedding",
132
- model_kwargs={"device_map": "auto", "trust_remote_code": True},
133
- tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
134
- trust_remote_code=True
135
- )
136
-
137
- queries = [
138
- "What is photosynthesis?",
139
- "Who invented the telephone?",
140
- ]
141
- documents = [
142
- "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
143
- "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
144
- ]
145
-
146
- query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
147
- document_embeddings = model.encode(documents, normalize_embeddings=True)
148
-
149
- similarity = model.similarity(query_embeddings, document_embeddings)
150
- ```
151
-
152
- ### Huggingface Transformers
153
-
154
- ```
155
- import torch
156
- import torch.nn.functional as F
157
-
158
- from torch import Tensor
159
- from transformers import AutoTokenizer, AutoModel
160
-
161
-
162
- def last_token_pool(last_hidden_states: Tensor,
163
- attention_mask: Tensor) -> Tensor:
164
- left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
165
- if left_padding:
166
- return last_hidden_states[:, -1]
167
- else:
168
- sequence_lengths = attention_mask.sum(dim=1) - 1
169
- batch_size = last_hidden_states.shape[0]
170
- return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
171
-
172
-
173
- def get_detailed_instruct(task_description: str, query: str) -> str:
174
- return f'Instruct: {task_description}\nQuery:{query}'
175
-
176
- task = 'Given a web search query, retrieve relevant passages that answer the query'
177
-
178
- queries = [
179
- get_detailed_instruct(task, 'What is photosynthesis?'),
180
- get_detailed_instruct(task, 'Who invented the telephone?')
181
- ]
182
-
183
- documents = [
184
- "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
185
- "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
186
- ]
187
-
188
- input_texts = queries + documents
189
-
190
- tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
191
- model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
192
-
193
- batch_dict = tokenizer(
194
- input_texts,
195
- padding=True,
196
- truncation=True,
197
- max_length=8192,
198
- return_tensors="pt",
199
- )
200
- batch_dict.to(model.device)
201
- outputs = model(**batch_dict)
202
- embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
203
-
204
- embeddings = F.normalize(embeddings, p=2, dim=1)
205
- scores = (embeddings[:2] @ embeddings[2:].T)
206
- ```