|
|
--- |
|
|
title: README |
|
|
emoji: 🐠 |
|
|
colorFrom: pink |
|
|
colorTo: blue |
|
|
sdk: static |
|
|
pinned: false |
|
|
license: cc-by-sa-4.0 |
|
|
--- |
|
|
|
|
|
# Welcome to RLHN (EMNLP 2025 Findings) |
|
|
RLHN (ReLabeing Hard Negatives) uses a cascading LLM framework to identify and relabel *false negatives* in IR training datasets. |
|
|
|
|
|
This repository contains training datasets curated by RLHN \& models fine-tuned on these curated datasets. |
|
|
|
|
|
List of Contributors: |
|
|
- Nandan Thakur* |
|
|
- Crystina Zhang* |
|
|
- Xueguang Ma |
|
|
- Jimmy Lin |
|
|
|
|
|
Paper URL: https://aclanthology.org/2025.findings-emnlp.481/ |
|
|
|
|
|
# Citation |
|
|
|
|
|
``` |
|
|
@inproceedings{thakur-etal-2025-hard, |
|
|
title = "Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with {LLM}s", |
|
|
author = "Thakur, Nandan and |
|
|
Zhang, Crystina and |
|
|
Ma, Xueguang and |
|
|
Lin, Jimmy", |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.findings-emnlp.481/", |
|
|
doi = "10.18653/v1/2025.findings-emnlp.481", |
|
|
pages = "9064--9083", |
|
|
ISBN = "979-8-89176-335-7", |
|
|
abstract = "Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness {---} pruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35{\texttimes}, surprisingly increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on ``false negatives'', where relevant passages are incorrectly labeled as irrelevant. We utilize LLMs as a simple, cost-effective approach to \textit{identify} and \textit{relabel} false negatives in training datasets. Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 points on BEIR and by 1.7-1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available." |
|
|
} |
|
|
``` |