marginmse-webfaq / README.md
IrvinTopi's picture
Update README.md
ac6f9a2 verified
---
language:
- ara
- dan
- deu
- eng
- fas
- fra
- hin
- ind
- ita
- jpn
- kor
- nld
- pol
- por
- rus
- spa
- swe
- tur
- vie
- zho
multilingual: true
tags:
- dense-retrieval
- hard-negatives
- knowledge-distillation
- webfaq
license: cc-by-4.0
task_categories:
- sentence-similarity
- text-retrieval
---
# WebFAQ 2.0: Multilingual Hard Negatives
This dataset contains **mined hard negatives** derived from the **WebFAQ 2.0** corpus. It covers roughly **1.3 million** samples across **20 languages**.
The dataset is designed to support robust training of dense retrieval models, specifically enabling:
1. **Contrastive Learning:** Using strict hard negatives to improve discrimination.
2. **Knowledge Distillation:** Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE).
## Dataset Creation & Mining Process
To ensure high-quality training signals, we employed a **two-stage mining pipeline** that balances difficulty with correctness.
### 1. Lexical Retrieval (Recall)
For every query in WebFAQ, we first retrieved the **top-200 candidate answers** from the monolingual corpus using **BM25**.
* **Goal:** Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish.
### 2. Semantic Reranking (Precision)
We reranked the top-200 candidates using the state-of-the-art cross-encoder model: **[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)**.
* **Goal:** Assess the true semantic relevance of each candidate.
### 3. Filtering & Scoring
We applied a rigorous filtering strategy to curate the final dataset:
* **False Negative Removal:** Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives.
* **Easy Negative Removal:** Candidates with very low scores were discarded to ensure training efficiency.
* **Score Retention:** We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows.
## Dataset Structure
Each sample in the dataset contains the following fields:
| Field | Description |
| :--- | :--- |
| `query` | The user question. |
| `positive` | The ground-truth correct answer. |
| `negative` | The mined hard negative (non-relevant but similar). |
| `score` | The **BGE-M3 cross-encoder score** for the `(query, negative)` pair. |
### Code & Reproduction
The code used for mining, filtering, and training is available in the official repository:
* **GitHub Repository:** [Link to your GitHub Repo]
* **WebFAQ Project:** [OpenWebSearch.EU](https://openwebsearch.eu)
## Citation
If you use this dataset, please cite the WebFAQ 2.0 paper:
```bibtex
@inproceedings{dinzinger2025webfaq,
title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval},
author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael},
booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2025}
}