|
|
--- |
|
|
language: |
|
|
- ara |
|
|
- dan |
|
|
- deu |
|
|
- eng |
|
|
- fas |
|
|
- fra |
|
|
- hin |
|
|
- ind |
|
|
- ita |
|
|
- jpn |
|
|
- kor |
|
|
- nld |
|
|
- pol |
|
|
- por |
|
|
- rus |
|
|
- spa |
|
|
- swe |
|
|
- tur |
|
|
- vie |
|
|
- zho |
|
|
multilingual: true |
|
|
tags: |
|
|
- dense-retrieval |
|
|
- hard-negatives |
|
|
- knowledge-distillation |
|
|
- webfaq |
|
|
license: cc-by-4.0 |
|
|
task_categories: |
|
|
- sentence-similarity |
|
|
- text-retrieval |
|
|
--- |
|
|
|
|
|
# WebFAQ 2.0: Multilingual Hard Negatives |
|
|
|
|
|
This dataset contains **mined hard negatives** derived from the **WebFAQ 2.0** corpus. It covers roughly **1.3 million** samples across **20 languages**. |
|
|
|
|
|
The dataset is designed to support robust training of dense retrieval models, specifically enabling: |
|
|
1. **Contrastive Learning:** Using strict hard negatives to improve discrimination. |
|
|
2. **Knowledge Distillation:** Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE). |
|
|
|
|
|
## Dataset Creation & Mining Process |
|
|
|
|
|
To ensure high-quality training signals, we employed a **two-stage mining pipeline** that balances difficulty with correctness. |
|
|
|
|
|
### 1. Lexical Retrieval (Recall) |
|
|
For every query in WebFAQ, we first retrieved the **top-200 candidate answers** from the monolingual corpus using **BM25**. |
|
|
* **Goal:** Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish. |
|
|
|
|
|
### 2. Semantic Reranking (Precision) |
|
|
We reranked the top-200 candidates using the state-of-the-art cross-encoder model: **[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)**. |
|
|
* **Goal:** Assess the true semantic relevance of each candidate. |
|
|
|
|
|
### 3. Filtering & Scoring |
|
|
We applied a rigorous filtering strategy to curate the final dataset: |
|
|
* **False Negative Removal:** Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives. |
|
|
* **Easy Negative Removal:** Candidates with very low scores were discarded to ensure training efficiency. |
|
|
* **Score Retention:** We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows. |
|
|
|
|
|
## Dataset Structure |
|
|
|
|
|
Each sample in the dataset contains the following fields: |
|
|
|
|
|
| Field | Description | |
|
|
| :--- | :--- | |
|
|
| `query` | The user question. | |
|
|
| `positive` | The ground-truth correct answer. | |
|
|
| `negative` | The mined hard negative (non-relevant but similar). | |
|
|
| `score` | The **BGE-M3 cross-encoder score** for the `(query, negative)` pair. | |
|
|
|
|
|
### Code & Reproduction |
|
|
The code used for mining, filtering, and training is available in the official repository: |
|
|
* **GitHub Repository:** [Link to your GitHub Repo] |
|
|
* **WebFAQ Project:** [OpenWebSearch.EU](https://openwebsearch.eu) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this dataset, please cite the WebFAQ 2.0 paper: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{dinzinger2025webfaq, |
|
|
title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval}, |
|
|
author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael}, |
|
|
booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, |
|
|
year={2025} |
|
|
} |