Hebrew Semantic Retrieval — 2nd Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: 🥈 2nd place — nDCG@20 = 0.656792 (private test set) · 0.460408 (public test set)

Author: itk77

Overview

This repository contains the complete inference code and fine-tuned models for the 2nd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.

Hebrew is a morphologically rich Semitic language written in an almost consonant-only script, creating significant lexical ambiguity and making retrieval substantially harder than for high-resource languages. The solution addresses this with a carefully engineered three-stage pipeline: sparse + dual-dense retrieval fused via Weighted Reciprocal Rank Fusion (WRRF), followed by a BGE cross-encoder reranker fine-tuned specifically on the challenge corpus, and a final conditional score blending step.

The Challenge

Property	Detail
Organizer	MAFAT DDR&D + Israel National NLP Program
Corpus size	127,731 Hebrew paragraphs
Data sources	Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric	NDCG@20
Phase I	Public leaderboard (Codabench)
Phase II	Private test set with additional human annotation of previously unseen retrievals
Relevance scale	0–4 (human annotated)

Solution Architecture

The solution is a three-stage pipeline: sparse + dual-dense retrieval fused with Weighted RRF, cross-encoder reranking, and conditional score blending.

Query
  │
  ├─► [BM25  (k1=1.3, b=0.7, w=1.0)]  ──┐
  ├─► [E5-large fine-tuned (w=1.2)]  ├─► WRRF Fusion (k=35)
  └─► [multilingual-E5-large (w=1.4)]  ┘
            │
            ▼
     Top-190 Candidates
            │
            ▼
     [BGE Cross-Encoder Reranker]  (fine-tuned, max_len=640)
            │
            ▼
     Conditional Score Blending
            │
            ▼
     Final Top-20 Results

Stage 1 — Weighted Reciprocal Rank Fusion (WRRF)

Three independent rankers each produce a ranked list of up to 190 candidates. Their lists are fused using Weighted Reciprocal Rank Fusion:

$\text{WRRF}(d) = \frac{w_\text{BM25}}{k + r_\text{BM25}(d) + 1} + \frac{w_\text{E5-ft}}{k + r_\text{E5-ft}(d) + 1} + \frac{w_\text{E5-base}}{k + r_\text{E5-base}(d) + 1}$

with $k = 35$ (RRF smoothing constant).

Ranker	Model	Weight	Max Length	Notes
BM25	Custom Hebrew BM25 (bm25s backend)	1.0	—	Strip nikkud, NFKC norm, prefix stripping
E5 (fine-tuned)	`e5-large-ft_v6`	1.2	512 tokens	Mean pooling + L2 norm, `query:` / `passage:` prefixes
E5 (base)	`multilingual-e5-large`	1.4	512 tokens	Via SentenceTransformers, BF16; labeled `GemmaEmbedder` in code but loads E5

Hebrew-specific tokenization (BM25): Unicode NFKC normalization, nikkud stripping (\u0591–\u05C7), Hebrew prefix removal (ו,ה,ב,ל,כ,מ,ש) with both the stripped and original form indexed, and a custom Hebrew stopword list.

Stage 2 — BGE Cross-Encoder Reranking

The top-190 WRRF candidates are reranked by bge-reranker-hsrc-pairwise-rrf-V1.4, a BGE cross-encoder fine-tuned on the challenge corpus using pairwise training with RRF-mined triples. Pairs are scored with a max sequence length of 640 tokens.

Stage 3 — Conditional Score Blending

The final score uses a non-linear conditional boost that amplifies the WRRF signal where the reranker is uncertain:

$\text{score}_\text{final} = \hat{s}_\text{BGE} + (1 - w_\text{BGE}) \cdot \hat{s}_\text{WRRF} \cdot (1 - \hat{s}_\text{BGE})$

where $w_\text{BGE} = 0.07$, and both scores are min-max normalized to $[0, 1]$ over the candidate pool. When the reranker assigns a high score ($\hat{s}_\text{BGE} \approx 1$), the WRRF boost vanishes; when it is uncertain ($\hat{s}_\text{BGE} \approx 0$), the WRRF signal takes over.

Included Models (fine-tuned)

Path in repo	Base model	Fine-tuning
`models/e5-large-ft_v6/`	`intfloat/multilingual-e5-large`	Fine-tuned on the challenge corpus (v6 checkpoint)
`models/bge-reranker-hsrc-pairwise-rrf-V1.4/`	`BAAI/bge-reranker-v2-m3`	Fine-tuned on RRF-mined pairwise triples from the challenge corpus
`models/multilingual-e5-large/`	`intfloat/multilingual-e5-large`	Off-the-shelf (no fine-tuning)

Repository Structure

model.py              ← Full inference pipeline (preprocess + predict)
bm25_backends.py      ← Pluggable BM25 backends (bm25s / pure-Python fallback)
text_utils.py         ← Hebrew normalization & tokenization utilities
models/
  e5-large-ft_v6/                          ← Fine-tuned E5 embedder ✨
  bge-reranker-hsrc-pairwise-rrf-V1.4/     ← Fine-tuned BGE reranker ✨
  multilingual-e5-large/                   ← Off-the-shelf secondary embedder

Usage

The pipeline exposes two functions matching the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.87}, ...]  (top-20)

Requirements:

torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy

A CUDA-capable GPU is strongly recommended (two large encoder models + one cross-encoder are loaded simultaneously, all in BF16/FP16).

Training Pipeline

The full training pipeline is located in repro/documentation/complete_pipeline/ and orchestrated by pipeline.py. It automates four sequential stages:

Stage	Script	Description
1	`finetune_e5_large.py`	Fine-tunes E5 on the challenge corpus (12 runs, 2 epochs, lr=2e-6, batch=4)
2	`stage1_weight_sweep.py`	Offline grid sweep of WRRF weights (BM25, E5, Gemma)
3	`train_bge_ce_pairwise_rrf.py`	Trains the BGE cross-encoder reranker (lr=2e-5, max_len=640, batch=4×accum=8)
4	`sweep_final2_from_components.py`	Offline sweep for the final blending weight

Reranker Training Modes

The pipeline supports two parallel reranker training paths:

Deterministic mode (--rr_det_runs): trains from pinned triples (repro/documentation/triples/triples.jsonl), enabling reproducible results.
Non-deterministic mining mode (--rr_nd_runs): the first run mines fresh triples from the best E5 checkpoint; subsequent runs reuse them. ~1 in 7 runs matches submitted model quality.

Example Full Run Command

python3 repro/documentation/complete_pipeline/pipeline.py \
  --e5_runs 12 --e5_seed0 45 --e5_seed_stride 0 \
  --e5_epochs 2 --e5_batch 4 --e5_lr 2e-6 \
  --stage1_w_bm25 1.0,2.0,0.1 \
  --stage1_w_e5 1.0,2.0,0.1 \
  --stage1_w_gm 1.0,2.0,0.1 \
  --rr_det_runs 1 --rr_det_seed0 42 --rr_det_seed_stride 0 \
  --rr_det_triples_in repro/documentation/triples/triples.jsonl \
  --rr_nd_runs 15 --rr_nd_seed0 42 --rr_nd_seed_stride 0 \
  --rr_bsz 4 --rr_accum 8 --rr_lr 2e-5 --rr_max_len 640 \
  --rr_sweep_rounds 2000

Hardware: Original model trained on RTX 3080 Ti; reproducibility runs executed on L40S (~24 hours for the full pipeline with 12 E5 + 15 reranker runs).

Evaluation Protocol

Holdout set: First 100 queries of the provided training file (fixed split, never changed during development).
Local evaluation script: scripts/eval_std_final.py — runs silently when EVAL_STD_MODE=1.
Score discrepancy: 7 of the 100 holdout queries have no labels > 0 (empty relevance). The local script does not ignore these by default, resulting in a local nDCG ~0.615 vs. the public leaderboard score. When empty-label queries are excluded, local scores align with the official leaderboard.

Technical Notes

All models are loaded in BF16 (E5, Gemma) or FP16 (BGE reranker) to reduce GPU memory usage.
Corpus embedding caching: E5 and Gemma corpus embeddings can be cached to disk (keyed by SHA-1 of document IDs + model path + corpus size) to skip re-encoding on repeated runs.
BM25 backend fallback chain: bm25_backends.py → direct bm25s → pure-Python deterministic BM25 (guaranteed to work without external dependencies).
Dominant source of non-determinism: GPU FP16/SDPA kernel behavior. Deterministic kernels are available but increase runtime ~3.6× and may exceed GPU memory limits.

Results

Phase	NDCG@20	Rank
Public (Phase I)	0.460408	🥈 2nd
Private (Phase II)	0.656792	🥈 2nd

The large gap between public and private scores is expected: the private phase incorporated additional human annotation of previously un-annotated retrieved documents, significantly impacting NDCG for systems that retrieved relevant but un-annotated paragraphs.

Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit itk77 as the solution author.

Acknowledgements

MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
The authors of intfloat/multilingual-e5-large and BAAI/bge-reranker-v2-m3.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including HebArabNlpProject/Semantic-Retrieval-2nd-place

Hebrew Semantic Retrieval Competition Winners

Collection

3 items • Updated about 16 hours ago • 1