IrvinTopi
/

marginmse-webfaq

dense-retrieval

knowledge-distillation

Model card Files Files and versions

marginmse-webfaq / README.md

IrvinTopi's picture

Update README.md

ac6f9a2 verified 1 day ago

|

history blame contribute delete

3.12 kB

	---
	language:
	- ara
	- dan
	- deu
	- eng
	- fas
	- fra
	- hin
	- ind
	- ita
	- jpn
	- kor
	- nld
	- pol
	- por
	- rus
	- spa
	- swe
	- tur
	- vie
	- zho
	multilingual: true
	tags:
	- dense-retrieval
	- hard-negatives
	- knowledge-distillation
	- webfaq
	license: cc-by-4.0
	task_categories:
	- sentence-similarity
	- text-retrieval
	---

	# WebFAQ 2.0: Multilingual Hard Negatives

	This dataset contains mined hard negatives derived from the WebFAQ 2.0 corpus. It covers roughly 1.3 million samples across 20 languages.

	The dataset is designed to support robust training of dense retrieval models, specifically enabling:
	1. Contrastive Learning: Using strict hard negatives to improve discrimination.
	2. Knowledge Distillation: Using the provided cross-encoder scores to train with soft labels (e.g., MarginMSE).

	## Dataset Creation & Mining Process

	To ensure high-quality training signals, we employed a two-stage mining pipeline that balances difficulty with correctness.

	### 1. Lexical Retrieval (Recall)
	For every query in WebFAQ, we first retrieved the top-200 candidate answers from the monolingual corpus using BM25.
	* Goal: Identify candidates with high lexical overlap (shared keywords) that are likely to be "hard" for a dense retriever to distinguish.

	### 2. Semantic Reranking (Precision)
	We reranked the top-200 candidates using the state-of-the-art cross-encoder model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3).
	* Goal: Assess the true semantic relevance of each candidate.

	### 3. Filtering & Scoring
	We applied a rigorous filtering strategy to curate the final dataset:
	* False Negative Removal: Candidates with extremely high cross-encoder scores (semantic matches) were discarded to prevent "poisoning" the training data with valid answers labeled as negatives.
	* Easy Negative Removal: Candidates with very low scores were discarded to ensure training efficiency.
	* Score Retention: We retained the BGE-M3 relevance scores for every negative, enabling knowledge distillation workflows.

	## Dataset Structure

	Each sample in the dataset contains the following fields:

	\| Field \| Description \|
	\| :--- \| :--- \|
	\| `query` \| The user question. \|
	\| `positive` \| The ground-truth correct answer. \|
	\| `negative` \| The mined hard negative (non-relevant but similar). \|
	\| `score` \| The BGE-M3 cross-encoder score for the `(query, negative)` pair. \|

	### Code & Reproduction
	The code used for mining, filtering, and training is available in the official repository:
	* GitHub Repository: [Link to your GitHub Repo]
	* WebFAQ Project: [OpenWebSearch.EU](https://openwebsearch.eu)

	## Citation

	If you use this dataset, please cite the WebFAQ 2.0 paper:

	```bibtex
	@inproceedings{dinzinger2025webfaq,
	title={WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval},
	author={Dinzinger, Michael and Caspari, Laura and Dastidar, Kanishka Ghosh and Mitrović, Jelena and Granitzer, Michael},
	booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
	year={2025}
	}