WebFAQ 2.0: Multilingual Hard Negatives

This dataset contains mined hard negatives derived from the WebFAQ 2.0 corpus. It covers roughly 1.3 million samples across 20 languages.

The dataset is designed to support robust training of dense retrieval models, specifically enabling:

Contrastive Learning: Using strict hard negatives to improve discrimination.
Knowledge Distillation: Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE).

Dataset Creation & Mining Process

To ensure high-quality training signals, we employed a two-stage mining pipeline that balances difficulty with correctness.

1. Lexical Retrieval (Recall)

For every query in WebFAQ, we first retrieved the top-200 candidate answers from the monolingual corpus using BM25.

Goal: Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish.

2. Semantic Reranking (Precision)

We reranked the top-200 candidates using the state-of-the-art cross-encoder model: BAAI/bge-m3.

Goal: Assess the true semantic relevance of each candidate.

3. Filtering & Scoring

We applied a rigorous filtering strategy to curate the final dataset:

False Negative Removal: Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives.
Easy Negative Removal: Candidates with very low scores were discarded to ensure training efficiency.
Score Retention: We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows.

Dataset Structure

Each sample in the dataset contains the following fields:

Field	Description
`query`	The user question.
`positive`	The ground-truth correct answer.
`negative`	The mined hard negative (non-relevant but similar).
`score`	The BGE-M3 cross-encoder score for the `(query, negative)` pair.

Code & Reproduction

The code used for mining, filtering, and training is available in the official repository:

GitHub Repository: [Link to your GitHub Repo]
WebFAQ Project: OpenWebSearch.EU

Citation

If you use this dataset, please cite the WebFAQ 2.0 paper:

@inproceedings{dinzinger2025webfaq,
  title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval},
  author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael},
  booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2025}
}

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support