Update README.md
Browse files
README.md
CHANGED
|
@@ -43,8 +43,10 @@ Total trained query and document pair is 100,000.<br>
|
|
| 43 |
### Training Details
|
| 44 |
The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
|
| 45 |
Training was divided into two stages: Pre-training and Post-training.<br>
|
| 46 |
-
|
| 47 |
-
In the
|
|
|
|
|
|
|
| 48 |
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
|
| 49 |
The types of data augmentation applied are as follows:<br>
|
| 50 |
| Augmentation* | Description |
|
|
|
|
| 43 |
### Training Details
|
| 44 |
The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
|
| 45 |
Training was divided into two stages: Pre-training and Post-training.<br>
|
| 46 |
+
|
| 47 |
+
* In the pre-training stage, the model was trained using in-batch negatives.
|
| 48 |
+
* In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.
|
| 49 |
+
|
| 50 |
Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
|
| 51 |
The types of data augmentation applied are as follows:<br>
|
| 52 |
| Augmentation* | Description |
|