---
language:
- ara
- dan
- deu
- eng
- fas
- fra
- hin
- ind
- ita
- jpn
- kor
- nld
- pol
- por
- rus
- spa
- swe
- tur
- vie
- zho
multilingual: true
tags:
- dense-retrieval
- hard-negatives
- knowledge-distillation
- webfaq
license: cc-by-4.0
task_categories:
- sentence-similarity
- text-retrieval
---

# WebFAQ 2.0: Multilingual Hard Negatives

This dataset contains **mined hard negatives** derived from the **WebFAQ 2.0** corpus. It covers roughly **1.3 million** samples across **20 languages**.

The dataset is designed to support robust training of dense retrieval models, specifically enabling:
1.  **Contrastive Learning:** Using strict hard negatives to improve discrimination.
2.  **Knowledge Distillation:** Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE).

## Dataset Creation & Mining Process

To ensure high-quality training signals, we employed a **two-stage mining pipeline** that balances difficulty with correctness.

### 1. Lexical Retrieval (Recall)
For every query in WebFAQ, we first retrieved the **top-200 candidate answers** from the monolingual corpus using **BM25**.
* **Goal:** Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish.

### 2. Semantic Reranking (Precision)
We reranked the top-200 candidates using the state-of-the-art cross-encoder model: **[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)**.
* **Goal:** Assess the true semantic relevance of each candidate.

### 3. Filtering & Scoring
We applied a rigorous filtering strategy to curate the final dataset:
* **False Negative Removal:** Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives.
* **Easy Negative Removal:** Candidates with very low scores were discarded to ensure training efficiency.
* **Score Retention:** We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows.

## Dataset Structure

Each sample in the dataset contains the following fields:

| Field | Description |
| :--- | :--- |
| `query` | The user question. |
| `positive` | The ground-truth correct answer. |
| `negative` | The mined hard negative (non-relevant but similar). |
| `score` | The **BGE-M3 cross-encoder score** for the `(query, negative)` pair. |

### Code & Reproduction
The code used for mining, filtering, and training is available in the official repository:
* **GitHub Repository:** [Link to your GitHub Repo]
* **WebFAQ Project:** [OpenWebSearch.EU](https://openwebsearch.eu)

## Citation

If you use this dataset, please cite the WebFAQ 2.0 paper:

```bibtex
@inproceedings{dinzinger2025webfaq,
  title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval},
  author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael},
  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2025}
}