dragonkue
/

multilingual-e5-small-ko

Sentence Similarity

sentence-transformers

feature-extraction

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

dragonkue commited on May 11, 2025

Commit

5267538

·

verified ·

1 Parent(s): 3f2f0ad

Update README.md

Files changed (1) hide show

README.md +39 -0

README.md CHANGED Viewed

@@ -192,6 +192,34 @@ You can finetune this model on your own dataset.
 This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
 The training objective was to improve retrieval performance specifically for Korean-language tasks.
 ### Training Hyperparameters
 #### Non-Default Hyperparameters
@@ -380,6 +408,17 @@ For text embedding tasks like text retrieval or semantic similarity, what matter
   year={2024}
 }
 ```
 ## Limitations

 This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
 The training objective was to improve retrieval performance specifically for Korean-language tasks.
+### Training Methods
+Following the training approach used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, this model constructs in-batch negatives based on clustered passages. In addition, we introduce GISTEmbedLoss with a configurable margin.
+**📈 Margin-based Training Results**
+- Using the standard MNR (Multiple Negatives Ranking) loss alone resulted in decreased performance.
+- The original GISTEmbedLoss (without margin) yielded modest improvements of around +0.8 NDCG@10.
+- Applying a margin led to performance gains of up to +1.5 NDCG@10.
+- This indicates that simply tuning the margin value can lead to up to 2x improvement, showing strong sensitivity and effectiveness of margin scaling.
+This margin-based approach extends the idea proposed in the NV-Retriever paper, which originally filtered false negatives during hard negative sampling.
+We adapt this to in-batch negatives, treating false negatives as dynamic samples guided by margin-based filtering.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/KrtD8Mdmz-ziozXCVz9Zr.png" width="800"/>
+The sentence-transformers library now supports GISTEmbedLoss with margin configuration, making it easy to integrate into any training pipeline.
+You can install the latest version with:
+```bash
+pip install -U sentence-transformers
+```
 ### Training Hyperparameters
 #### Non-Default Hyperparameters
   year={2024}
 }
 ```
+#### NV-Retriever: Improving text embedding models with effective hard-negative mining
+```bibtex
+@article{moreira2024nvretriever,
+  title     = {NV-Retriever: Improving text embedding models with effective hard-negative mining},
+  author    = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
+  journal   = {arXiv preprint arXiv:2407.15831},
+  year      = {2024},
+  url       = {https://arxiv.org/abs/2407.15831},
+  doi       = {10.48550/arXiv.2407.15831}
+}
+```
 ## Limitations