--- language: - ara - dan - deu - eng - fas - fra - hin - ind - ita - jpn - kor - nld - pol - por - rus - spa - swe - tur - vie - zho multilingual: true tags: - dense-retrieval - hard-negatives - knowledge-distillation - webfaq license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval --- # WebFAQ 2.0: Multilingual Hard Negatives This dataset contains **mined hard negatives** derived from the **WebFAQ 2.0** corpus. It covers roughly **1.3 million** samples across **20 languages**. The dataset is designed to support robust training of dense retrieval models, specifically enabling: 1. **Contrastive Learning:** Using strict hard negatives to improve discrimination. 2. **Knowledge Distillation:** Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE). ## Dataset Creation & Mining Process To ensure high-quality training signals, we employed a **two-stage mining pipeline** that balances difficulty with correctness. ### 1. Lexical Retrieval (Recall) For every query in WebFAQ, we first retrieved the **top-200 candidate answers** from the monolingual corpus using **BM25**. * **Goal:** Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish. ### 2. Semantic Reranking (Precision) We reranked the top-200 candidates using the state-of-the-art cross-encoder model: **[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)**. * **Goal:** Assess the true semantic relevance of each candidate. ### 3. Filtering & Scoring We applied a rigorous filtering strategy to curate the final dataset: * **False Negative Removal:** Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives. * **Easy Negative Removal:** Candidates with very low scores were discarded to ensure training efficiency. * **Score Retention:** We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows. ## Dataset Structure Each sample in the dataset contains the following fields: | Field | Description | | :--- | :--- | | `query` | The user question. | | `positive` | The ground-truth correct answer. | | `negative` | The mined hard negative (non-relevant but similar). | | `score` | The **BGE-M3 cross-encoder score** for the `(query, negative)` pair. | ### Code & Reproduction The code used for mining, filtering, and training is available in the official repository: * **GitHub Repository:** [Link to your GitHub Repo] * **WebFAQ Project:** [OpenWebSearch.EU](https://openwebsearch.eu) ## Citation If you use this dataset, please cite the WebFAQ 2.0 paper: ```bibtex @inproceedings{dinzinger2025webfaq, title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval}, author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael}, booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2025} }