updated README
Browse files
README.md
CHANGED
|
@@ -8,7 +8,7 @@ pinned: false
|
|
| 8 |
license: cc-by-sa-4.0
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# Welcome to RLHN
|
| 12 |
RLHN (ReLabeing Hard Negatives) uses a cascading LLM framework to identify and relabel *false negatives* in IR training datasets.
|
| 13 |
|
| 14 |
This repository contains training datasets curated by RLHN \& models fine-tuned on these curated datasets.
|
|
@@ -19,18 +19,30 @@ List of Contributors:
|
|
| 19 |
- Xueguang Ma
|
| 20 |
- Jimmy Lin
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
# Citation
|
| 25 |
|
| 26 |
```
|
| 27 |
-
@
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
}
|
| 36 |
```
|
|
|
|
| 8 |
license: cc-by-sa-4.0
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# Welcome to RLHN (EMNLP 2025 Findings)
|
| 12 |
RLHN (ReLabeing Hard Negatives) uses a cascading LLM framework to identify and relabel *false negatives* in IR training datasets.
|
| 13 |
|
| 14 |
This repository contains training datasets curated by RLHN \& models fine-tuned on these curated datasets.
|
|
|
|
| 19 |
- Xueguang Ma
|
| 20 |
- Jimmy Lin
|
| 21 |
|
| 22 |
+
Paper URL: https://aclanthology.org/2025.findings-emnlp.481/
|
| 23 |
|
| 24 |
# Citation
|
| 25 |
|
| 26 |
```
|
| 27 |
+
@inproceedings{thakur-etal-2025-hard,
|
| 28 |
+
title = "Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with {LLM}s",
|
| 29 |
+
author = "Thakur, Nandan and
|
| 30 |
+
Zhang, Crystina and
|
| 31 |
+
Ma, Xueguang and
|
| 32 |
+
Lin, Jimmy",
|
| 33 |
+
editor = "Christodoulopoulos, Christos and
|
| 34 |
+
Chakraborty, Tanmoy and
|
| 35 |
+
Rose, Carolyn and
|
| 36 |
+
Peng, Violet",
|
| 37 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
|
| 38 |
+
month = nov,
|
| 39 |
+
year = "2025",
|
| 40 |
+
address = "Suzhou, China",
|
| 41 |
+
publisher = "Association for Computational Linguistics",
|
| 42 |
+
url = "https://aclanthology.org/2025.findings-emnlp.481/",
|
| 43 |
+
doi = "10.18653/v1/2025.findings-emnlp.481",
|
| 44 |
+
pages = "9064--9083",
|
| 45 |
+
ISBN = "979-8-89176-335-7",
|
| 46 |
+
abstract = "Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness {---} pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35{\texttimes}, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on ``false negatives'', where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to \textit{identify} and \textit{relabel} false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 points on BEIR and by 1.7-1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available."
|
| 47 |
}
|
| 48 |
```
|