nthakur commited on
Commit
27ca3ad
·
verified ·
1 Parent(s): b1d11a3

updated README

Browse files
Files changed (1) hide show
  1. README.md +22 -10
README.md CHANGED
@@ -8,7 +8,7 @@ pinned: false
8
  license: cc-by-sa-4.0
9
  ---
10
 
11
- # Welcome to RLHN
12
  RLHN (ReLabeing Hard Negatives) uses a cascading LLM framework to identify and relabel *false negatives* in IR training datasets.
13
 
14
  This repository contains training datasets curated by RLHN \& models fine-tuned on these curated datasets.
@@ -19,18 +19,30 @@ List of Contributors:
19
  - Xueguang Ma
20
  - Jimmy Lin
21
 
22
- Preprint URL: https://huggingface.co/papers/2505.16967
23
 
24
  # Citation
25
 
26
  ```
27
- @misc{thakur2025rlhn,
28
- title={Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval},
29
- author={Nandan Thakur and Crystina Zhang and Xueguang Ma and Jimmy Lin},
30
- year={2025},
31
- eprint={2505.16967},
32
- archivePrefix={arXiv},
33
- primaryClass={cs.IR},
34
- url={https://arxiv.org/abs/2505.16967},
 
 
 
 
 
 
 
 
 
 
 
 
35
  }
36
  ```
 
8
  license: cc-by-sa-4.0
9
  ---
10
 
11
+ # Welcome to RLHN (EMNLP 2025 Findings)
12
  RLHN (ReLabeing Hard Negatives) uses a cascading LLM framework to identify and relabel *false negatives* in IR training datasets.
13
 
14
  This repository contains training datasets curated by RLHN \& models fine-tuned on these curated datasets.
 
19
  - Xueguang Ma
20
  - Jimmy Lin
21
 
22
+ Paper URL: https://aclanthology.org/2025.findings-emnlp.481/
23
 
24
  # Citation
25
 
26
  ```
27
+ @inproceedings{thakur-etal-2025-hard,
28
+ title = "Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with {LLM}s",
29
+ author = "Thakur, Nandan and
30
+ Zhang, Crystina and
31
+ Ma, Xueguang and
32
+ Lin, Jimmy",
33
+ editor = "Christodoulopoulos, Christos and
34
+ Chakraborty, Tanmoy and
35
+ Rose, Carolyn and
36
+ Peng, Violet",
37
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
38
+ month = nov,
39
+ year = "2025",
40
+ address = "Suzhou, China",
41
+ publisher = "Association for Computational Linguistics",
42
+ url = "https://aclanthology.org/2025.findings-emnlp.481/",
43
+ doi = "10.18653/v1/2025.findings-emnlp.481",
44
+ pages = "9064--9083",
45
+ ISBN = "979-8-89176-335-7",
46
+ abstract = "Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness {---} pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35{\texttimes}, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on ``false negatives'', where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to \textit{identify} and \textit{relabel} false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 points on BEIR and by 1.7-1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available."
47
  }
48
  ```