Create README.md

2e852df verified 12 days ago

10 kB

	---
	language:
	- he
	tags:
	- hebrew
	- semantic-retrieval
	- information-retrieval
	- dense-retrieval
	- reranking
	- rrf
	- sentence-transformers
	- competition
	pipeline_tag: sentence-similarity
	license: other
	---

	# Hebrew Semantic Retrieval — 2nd Place Solution

	Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

	Result: 🥈 2nd place — nDCG@20 = 0.656792 (private test set) · 0.460408 (public test set)

	Author: itk77

	---

	## Overview

	This repository contains the complete inference code and fine-tuned models for the 2nd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.

	Hebrew is a morphologically rich Semitic language written in an almost consonant-only script, creating significant lexical ambiguity and making retrieval substantially harder than for high-resource languages. The solution addresses this with a carefully engineered three-stage pipeline: sparse + dual-dense retrieval fused via Weighted Reciprocal Rank Fusion (WRRF), followed by a BGE cross-encoder reranker fine-tuned specifically on the challenge corpus, and a final conditional score blending step.

	---

	## The Challenge

	\| Property \| Detail \|
	\|---\|---\|
	\| Organizer \| MAFAT DDR&D + Israel National NLP Program \|
	\| Corpus size \| 127,731 Hebrew paragraphs \|
	\| Data sources \| Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols \|
	\| Evaluation metric \| NDCG@20 \|
	\| Phase I \| Public leaderboard (Codabench) \|
	\| Phase II \| Private test set with additional human annotation of previously unseen retrievals \|
	\| Relevance scale \| 0–4 (human annotated) \|

	---

	## Solution Architecture

	The solution is a three-stage pipeline: sparse + dual-dense retrieval fused with Weighted RRF, cross-encoder reranking, and conditional score blending.

	```
	Query
	│
	├─► [BM25 (k1=1.3, b=0.7, w=1.0)] ──┐
	├─► [E5-large fine-tuned (w=1.2)] ├─► WRRF Fusion (k=35)
	└─► [multilingual-E5-large (w=1.4)] ┘
	│
	▼
	Top-190 Candidates
	│
	▼
	[BGE Cross-Encoder Reranker] (fine-tuned, max_len=640)
	│
	▼
	Conditional Score Blending
	│
	▼
	Final Top-20 Results
	```

	### Stage 1 — Weighted Reciprocal Rank Fusion (WRRF)

	Three independent rankers each produce a ranked list of up to 190 candidates. Their lists are fused using Weighted Reciprocal Rank Fusion:

	$$\text{WRRF}(d) = \frac{w_\text{BM25}}{k + r_\text{BM25}(d) + 1} + \frac{w_\text{E5-ft}}{k + r_\text{E5-ft}(d) + 1} + \frac{w_\text{E5-base}}{k + r_\text{E5-base}(d) + 1}$$

	with $k = 35$ (RRF smoothing constant).

	\| Ranker \| Model \| Weight \| Max Length \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| BM25 \| Custom Hebrew BM25 (bm25s backend) \| 1.0 \| — \| Strip nikkud, NFKC norm, prefix stripping \|
	\| E5 (fine-tuned) \| `e5-large-ft_v6` \| 1.2 \| 512 tokens \| Mean pooling + L2 norm, `query:` / `passage:` prefixes \|
	\| E5 (base) \| `multilingual-e5-large` \| 1.4 \| 512 tokens \| Via SentenceTransformers, BF16; labeled `GemmaEmbedder` in code but loads E5 \|

	Hebrew-specific tokenization (BM25): Unicode NFKC normalization, nikkud stripping (`\u0591–\u05C7`), Hebrew prefix removal (`ו`,`ה`,`ב`,`ל`,`כ`,`מ`,`ש`) with both the stripped and original form indexed, and a custom Hebrew stopword list.

	### Stage 2 — BGE Cross-Encoder Reranking

	The top-190 WRRF candidates are reranked by `bge-reranker-hsrc-pairwise-rrf-V1.4`, a BGE cross-encoder fine-tuned on the challenge corpus using pairwise training with RRF-mined triples. Pairs are scored with a max sequence length of 640 tokens.

	### Stage 3 — Conditional Score Blending

	The final score uses a non-linear conditional boost that amplifies the WRRF signal where the reranker is uncertain:

	$$\text{score}_\text{final} = \hat{s}_\text{BGE} + (1 - w_\text{BGE}) \cdot \hat{s}_\text{WRRF} \cdot (1 - \hat{s}_\text{BGE})$$

	where $w_\text{BGE} = 0.07$, and both scores are min-max normalized to $[0, 1]$ over the candidate pool. When the reranker assigns a high score ($\hat{s}_\text{BGE} \approx 1$), the WRRF boost vanishes; when it is uncertain ($\hat{s}_\text{BGE} \approx 0$), the WRRF signal takes over.

	---

	## Included Models (fine-tuned)

	\| Path in repo \| Base model \| Fine-tuning \|
	\|---\|---\|---\|
	\| `models/e5-large-ft_v6/` \| `intfloat/multilingual-e5-large` \| Fine-tuned on the challenge corpus (v6 checkpoint) \|
	\| `models/bge-reranker-hsrc-pairwise-rrf-V1.4/` \| `BAAI/bge-reranker-v2-m3` \| Fine-tuned on RRF-mined pairwise triples from the challenge corpus \|
	\| `models/multilingual-e5-large/` \| `intfloat/multilingual-e5-large` \| Off-the-shelf (no fine-tuning) \|

	---

	## Repository Structure

	```
	model.py ← Full inference pipeline (preprocess + predict)
	bm25_backends.py ← Pluggable BM25 backends (bm25s / pure-Python fallback)
	text_utils.py ← Hebrew normalization & tokenization utilities
	models/
	e5-large-ft_v6/ ← Fine-tuned E5 embedder ✨
	bge-reranker-hsrc-pairwise-rrf-V1.4/ ← Fine-tuned BGE reranker ✨
	multilingual-e5-large/ ← Off-the-shelf secondary embedder
	```

	---

	## Usage

	The pipeline exposes two functions matching the competition API:

	```python
	from model import preprocess, predict

	# Build corpus index (run once)
	# corpus_dict: {doc_id: {"passage": "..."}, ...}
	preprocessed = preprocess(corpus_dict)

	# Query at inference time
	results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
	# Returns: [{"paragraph_uuid": "...", "score": 0.87}, ...] (top-20)
	```

	Requirements:
	```
	torch
	transformers
	sentence-transformers
	bm25s
	scikit-learn
	numpy
	```

	A CUDA-capable GPU is strongly recommended (two large encoder models + one cross-encoder are loaded simultaneously, all in BF16/FP16).

	---

	## Training Pipeline

	The full training pipeline is located in `repro/documentation/complete_pipeline/` and orchestrated by `pipeline.py`. It automates four sequential stages:

	\| Stage \| Script \| Description \|
	\|---\|---\|---\|
	\| 1 \| `finetune_e5_large.py` \| Fine-tunes E5 on the challenge corpus (12 runs, 2 epochs, lr=2e-6, batch=4) \|
	\| 2 \| `stage1_weight_sweep.py` \| Offline grid sweep of WRRF weights (BM25, E5, Gemma) \|
	\| 3 \| `train_bge_ce_pairwise_rrf.py` \| Trains the BGE cross-encoder reranker (lr=2e-5, max_len=640, batch=4×accum=8) \|
	\| 4 \| `sweep_final2_from_components.py` \| Offline sweep for the final blending weight \|

	### Reranker Training Modes

	The pipeline supports two parallel reranker training paths:

	- Deterministic mode (`--rr_det_runs`): trains from pinned triples (`repro/documentation/triples/triples.jsonl`), enabling reproducible results.
	- Non-deterministic mining mode (`--rr_nd_runs`): the first run mines fresh triples from the best E5 checkpoint; subsequent runs reuse them. ~1 in 7 runs matches submitted model quality.

	### Example Full Run Command

	```bash
	python3 repro/documentation/complete_pipeline/pipeline.py \
	--e5_runs 12 --e5_seed0 45 --e5_seed_stride 0 \
	--e5_epochs 2 --e5_batch 4 --e5_lr 2e-6 \
	--stage1_w_bm25 1.0,2.0,0.1 \
	--stage1_w_e5 1.0,2.0,0.1 \
	--stage1_w_gm 1.0,2.0,0.1 \
	--rr_det_runs 1 --rr_det_seed0 42 --rr_det_seed_stride 0 \
	--rr_det_triples_in repro/documentation/triples/triples.jsonl \
	--rr_nd_runs 15 --rr_nd_seed0 42 --rr_nd_seed_stride 0 \
	--rr_bsz 4 --rr_accum 8 --rr_lr 2e-5 --rr_max_len 640 \
	--rr_sweep_rounds 2000
	```

	Hardware: Original model trained on RTX 3080 Ti; reproducibility runs executed on L40S (~24 hours for the full pipeline with 12 E5 + 15 reranker runs).

	---

	## Evaluation Protocol

	- Holdout set: First 100 queries of the provided training file (fixed split, never changed during development).
	- Local evaluation script: `scripts/eval_std_final.py` — runs silently when `EVAL_STD_MODE=1`.
	- Score discrepancy: 7 of the 100 holdout queries have no labels > 0 (empty relevance). The local script does not ignore these by default, resulting in a local nDCG ~0.615 vs. the public leaderboard score. When empty-label queries are excluded, local scores align with the official leaderboard.

	---

	## Technical Notes

	- All models are loaded in BF16 (E5, Gemma) or FP16 (BGE reranker) to reduce GPU memory usage.
	- Corpus embedding caching: E5 and Gemma corpus embeddings can be cached to disk (keyed by SHA-1 of document IDs + model path + corpus size) to skip re-encoding on repeated runs.
	- BM25 backend fallback chain: `bm25_backends.py` → direct `bm25s` → pure-Python deterministic BM25 (guaranteed to work without external dependencies).
	- Dominant source of non-determinism: GPU FP16/SDPA kernel behavior. Deterministic kernels are available but increase runtime ~3.6× and may exceed GPU memory limits.

	---

	## Results

	\| Phase \| NDCG@20 \| Rank \|
	\|---\|---\|---\|
	\| Public (Phase I) \| 0.460408 \| 🥈 2nd \|
	\| Private (Phase II) \| 0.656792 \| 🥈 2nd \|

	> The large gap between public and private scores is expected: the private phase incorporated additional human annotation of previously un-annotated retrieved documents, significantly impacting NDCG for systems that retrieved relevant but un-annotated paragraphs.

	---

	## Citation

	If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit itk77 as the solution author.

	---

	## Acknowledgements

	- MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
	- The authors of `intfloat/multilingual-e5-large` and `BAAI/bge-reranker-v2-m3`.