Update README.md
Browse files
README.md
CHANGED
|
@@ -12,67 +12,18 @@ tags:
|
|
| 12 |
<img src="assets/image-1.png" width="800" height="300"></img>
|
| 13 |
</div>
|
| 14 |
|
| 15 |
-
**<span style="color:red">Important Notice:</span>**
|
| 16 |
-
1、Our model parameters were **<span style="color:red">updated on August 24</span>**. If you downloaded the files prior to this date, please ensure you update to the latest version at your earliest convenience!<br>
|
| 17 |
-
2、Our technical report will be released shortly - stay tuned!
|
| 18 |
|
| 19 |
## Introduction
|
| 20 |
-
We
|
|
|
|
|
|
|
|
|
|
| 21 |
## Basic Features
|
| 22 |
|
| 23 |
- Powerful text embedding capabilities;
|
| 24 |
- Long context: up to 8k context length;
|
| 25 |
- 7B parameter size
|
| 26 |
|
| 27 |
-
|
| 28 |
-
## Technical Introduction
|
| 29 |
-
### Unified Task Modeling Framework
|
| 30 |
-
We unify the text embedding objectives into three major modeling optimization issues and propose a unified training data structured solution and corresponding training mechanism. This approach can integrate most open source data as retrieval training sets. The structured data can be as follows:
|
| 31 |
-
- Retrieval
|
| 32 |
-
- title-body
|
| 33 |
-
- title-abstract
|
| 34 |
-
- Question Answering Dataset
|
| 35 |
-
- Reading comprehension
|
| 36 |
-
- ...
|
| 37 |
-
|
| 38 |
-
- STS
|
| 39 |
-
- text pair + label in {true, false}、{yes, no}
|
| 40 |
-
- text pair + score(such as 0.2, 3.1. 4.8, etc.)
|
| 41 |
-
- NLI dataset:text pair + label in {'entailment', 'neutral', 'contradiction'}
|
| 42 |
-
|
| 43 |
-
- CLS
|
| 44 |
-
- text+CLS label
|
| 45 |
-
|
| 46 |
-
<div align="center"><img src="assets/image-18.png" width="1000" height="600"></img></div>
|
| 47 |
-
<div align="center"><img src="assets/image-16.png" width="1000" height="550"></img></div>
|
| 48 |
-
|
| 49 |
-
### Training Objectives
|
| 50 |
-
|
| 51 |
-
- Retrieval: Apply InfoNCE contrastive loss function, and follow the gte/qwen3-embedding to add the query-query negative as part of the denominator.<br>
|
| 52 |
-
<div align="center"><img src="assets/formula1.png" width="700" height="110"></img></div>
|
| 53 |
-
|
| 54 |
-
- STS:Apply Cosent loss:
|
| 55 |
-
<div align="center"><img src="assets/formula2.png" width="700" height="110"></img></div>
|
| 56 |
-
|
| 57 |
-
- CLS: Apply the same InfoNCE loss as retrieval, but for In-Batch Negative, due to the high probability of same-class conflicts, a mask mechanism is used to cover up similar samples in negative examples shared by different samples.
|
| 58 |
-
<div align="center"><img src="assets/formula3.png" width="1100" height="180"></img></div>
|
| 59 |
-
Where $C_{t_i}$ represents the class label of sample $t_i$ , and $n$ is the number of negative samples for a single data point.
|
| 60 |
-
|
| 61 |
-
### Feature Enhancement Data Synthesis Technology
|
| 62 |
-
In the context of powerful languages and writing capabilities in LLMs, we've fully leveraged the LLMs API to propose a data synthesis technology. To address issues like limited data and narrow topics/features in training sets, we've proposed rewriting and expanding synthesis techniques. Furthermore, to increase the difficulty of negative examples during training, we've designed a hard negative example synthesis technology based on big models, combined with existing strong retriever-based hard negative examples sampling. Several of these technologies are described below:
|
| 63 |
-
<div align="center"><img src="assets/image-9.png" width="930" height="290"></img></div>
|
| 64 |
-
<div align="center"><img src="assets/image-10.png" width="880" height="220"></img></div>
|
| 65 |
-
<div align="center"><img src="assets/image-11.png" width="880" height="210"></img></div>
|
| 66 |
-
|
| 67 |
-
For more details, including reproduction of evaluation results, Instruction content and adding method, please refer to our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a> repo, thanks!
|
| 68 |
-
|
| 69 |
-
## Evaluation Results
|
| 70 |
-
### mteb details
|
| 71 |
-
<div align="center"><img src="assets/image-7.png" width="1100" height="260"></img></div>
|
| 72 |
-
|
| 73 |
-
### cmteb details
|
| 74 |
-
<div align="center"><img src="assets/image-8.png" width="1000" height="260"></img></div>
|
| 75 |
-
|
| 76 |
## Usage
|
| 77 |
### Completely reproduce the benchmark results
|
| 78 |
We provide detailed parameters and environment configurations so that you can run results that are completely consistent with the mteb leaderboard on your own machine, including configurations such as environment dependencies and model arguments.
|
|
|
|
| 12 |
<img src="assets/image-1.png" width="800" height="300"></img>
|
| 13 |
</div>
|
| 14 |
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Introduction
|
| 17 |
+
We present <a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a> (called "Qingzhou Embedding"), a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen2.5-7B-Instruct</a> foundation model, we designed a unified multi-task framework and developed a data synthesis pipeline leveraging LLM APIs. Our two-stage training strategy employs initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. The model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards(August 27, 2025).
|
| 18 |
+
|
| 19 |
+
**<span style="color:red">We will promptly release our technical report—stay tuned!</span>**
|
| 20 |
+
|
| 21 |
## Basic Features
|
| 22 |
|
| 23 |
- Powerful text embedding capabilities;
|
| 24 |
- Long context: up to 8k context length;
|
| 25 |
- 7B parameter size
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
## Usage
|
| 28 |
### Completely reproduce the benchmark results
|
| 29 |
We provide detailed parameters and environment configurations so that you can run results that are completely consistent with the mteb leaderboard on your own machine, including configurations such as environment dependencies and model arguments.
|