Update README.md
Browse files
README.md
CHANGED
|
@@ -192,6 +192,34 @@ You can finetune this model on your own dataset.
|
|
| 192 |
This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
|
| 193 |
The training objective was to improve retrieval performance specifically for Korean-language tasks.
|
| 194 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
### Training Hyperparameters
|
| 196 |
#### Non-Default Hyperparameters
|
| 197 |
|
|
@@ -380,6 +408,17 @@ For text embedding tasks like text retrieval or semantic similarity, what matter
|
|
| 380 |
year={2024}
|
| 381 |
}
|
| 382 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 383 |
|
| 384 |
## Limitations
|
| 385 |
|
|
|
|
| 192 |
This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
|
| 193 |
The training objective was to improve retrieval performance specifically for Korean-language tasks.
|
| 194 |
|
| 195 |
+
### Training Methods
|
| 196 |
+
|
| 197 |
+
Following the training approach used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, this model constructs in-batch negatives based on clustered passages. In addition, we introduce GISTEmbedLoss with a configurable margin.
|
| 198 |
+
|
| 199 |
+
**📈 Margin-based Training Results**
|
| 200 |
+
- Using the standard MNR (Multiple Negatives Ranking) loss alone resulted in decreased performance.
|
| 201 |
+
|
| 202 |
+
- The original GISTEmbedLoss (without margin) yielded modest improvements of around +0.8 NDCG@10.
|
| 203 |
+
|
| 204 |
+
- Applying a margin led to performance gains of up to +1.5 NDCG@10.
|
| 205 |
+
|
| 206 |
+
- This indicates that simply tuning the margin value can lead to up to 2x improvement, showing strong sensitivity and effectiveness of margin scaling.
|
| 207 |
+
|
| 208 |
+
This margin-based approach extends the idea proposed in the NV-Retriever paper, which originally filtered false negatives during hard negative sampling.
|
| 209 |
+
We adapt this to in-batch negatives, treating false negatives as dynamic samples guided by margin-based filtering.
|
| 210 |
+
|
| 211 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/KrtD8Mdmz-ziozXCVz9Zr.png" width="800"/>
|
| 212 |
+
|
| 213 |
+
|
| 214 |
+
The sentence-transformers library now supports GISTEmbedLoss with margin configuration, making it easy to integrate into any training pipeline.
|
| 215 |
+
|
| 216 |
+
You can install the latest version with:
|
| 217 |
+
|
| 218 |
+
```bash
|
| 219 |
+
pip install -U sentence-transformers
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
|
| 223 |
### Training Hyperparameters
|
| 224 |
#### Non-Default Hyperparameters
|
| 225 |
|
|
|
|
| 408 |
year={2024}
|
| 409 |
}
|
| 410 |
```
|
| 411 |
+
#### NV-Retriever: Improving text embedding models with effective hard-negative mining
|
| 412 |
+
```bibtex
|
| 413 |
+
@article{moreira2024nvretriever,
|
| 414 |
+
title = {NV-Retriever: Improving text embedding models with effective hard-negative mining},
|
| 415 |
+
author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
|
| 416 |
+
journal = {arXiv preprint arXiv:2407.15831},
|
| 417 |
+
year = {2024},
|
| 418 |
+
url = {https://arxiv.org/abs/2407.15831},
|
| 419 |
+
doi = {10.48550/arXiv.2407.15831}
|
| 420 |
+
}
|
| 421 |
+
```
|
| 422 |
|
| 423 |
## Limitations
|
| 424 |
|