Hebrew Semantic Retrieval β€” 2nd Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: πŸ₯ˆ 2nd place β€” nDCG@20 = 0.656792 (private test set) Β· 0.460408 (public test set)

Author: itk77


Overview

This repository contains the complete inference code and fine-tuned models for the 2nd-place solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by NDCG@20.

Hebrew is a morphologically rich Semitic language written in an almost consonant-only script, creating significant lexical ambiguity and making retrieval substantially harder than for high-resource languages. The solution addresses this with a carefully engineered three-stage pipeline: sparse + dual-dense retrieval fused via Weighted Reciprocal Rank Fusion (WRRF), followed by a BGE cross-encoder reranker fine-tuned specifically on the challenge corpus, and a final conditional score blending step.


The Challenge

Property Detail
Organizer MAFAT DDR&D + Israel National NLP Program
Corpus size 127,731 Hebrew paragraphs
Data sources Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric NDCG@20
Phase I Public leaderboard (Codabench)
Phase II Private test set with additional human annotation of previously unseen retrievals
Relevance scale 0–4 (human annotated)

Solution Architecture

The solution is a three-stage pipeline: sparse + dual-dense retrieval fused with Weighted RRF, cross-encoder reranking, and conditional score blending.

Query
  β”‚
  β”œβ”€β–Ί [BM25  (k1=1.3, b=0.7, w=1.0)]  ──┐
  β”œβ”€β–Ί [E5-large fine-tuned (w=1.2)]  β”œβ”€β–Ί WRRF Fusion (k=35)
  └─► [multilingual-E5-large (w=1.4)]  β”˜
            β”‚
            β–Ό
     Top-190 Candidates
            β”‚
            β–Ό
     [BGE Cross-Encoder Reranker]  (fine-tuned, max_len=640)
            β”‚
            β–Ό
     Conditional Score Blending
            β”‚
            β–Ό
     Final Top-20 Results

Stage 1 β€” Weighted Reciprocal Rank Fusion (WRRF)

Three independent rankers each produce a ranked list of up to 190 candidates. Their lists are fused using Weighted Reciprocal Rank Fusion:

WRRF(d)=wBM25k+rBM25(d)+1+wE5-ftk+rE5-ft(d)+1+wE5-basek+rE5-base(d)+1\text{WRRF}(d) = \frac{w_\text{BM25}}{k + r_\text{BM25}(d) + 1} + \frac{w_\text{E5-ft}}{k + r_\text{E5-ft}(d) + 1} + \frac{w_\text{E5-base}}{k + r_\text{E5-base}(d) + 1}

with $k = 35$ (RRF smoothing constant).

Ranker Model Weight Max Length Notes
BM25 Custom Hebrew BM25 (bm25s backend) 1.0 β€” Strip nikkud, NFKC norm, prefix stripping
E5 (fine-tuned) e5-large-ft_v6 1.2 512 tokens Mean pooling + L2 norm, query: / passage: prefixes
E5 (base) multilingual-e5-large 1.4 512 tokens Via SentenceTransformers, BF16; labeled GemmaEmbedder in code but loads E5

Hebrew-specific tokenization (BM25): Unicode NFKC normalization, nikkud stripping (\u0591–\u05C7), Hebrew prefix removal (Χ•,Χ”,Χ‘,ל,Χ›,מ,Χ©) with both the stripped and original form indexed, and a custom Hebrew stopword list.

Stage 2 β€” BGE Cross-Encoder Reranking

The top-190 WRRF candidates are reranked by bge-reranker-hsrc-pairwise-rrf-V1.4, a BGE cross-encoder fine-tuned on the challenge corpus using pairwise training with RRF-mined triples. Pairs are scored with a max sequence length of 640 tokens.

Stage 3 β€” Conditional Score Blending

The final score uses a non-linear conditional boost that amplifies the WRRF signal where the reranker is uncertain:

scorefinal=s^BGE+(1βˆ’wBGE)β‹…s^WRRFβ‹…(1βˆ’s^BGE)\text{score}_\text{final} = \hat{s}_\text{BGE} + (1 - w_\text{BGE}) \cdot \hat{s}_\text{WRRF} \cdot (1 - \hat{s}_\text{BGE})

where $w_\text{BGE} = 0.07$, and both scores are min-max normalized to $[0, 1]$ over the candidate pool. When the reranker assigns a high score ($\hat{s}_\text{BGE} \approx 1$), the WRRF boost vanishes; when it is uncertain ($\hat{s}_\text{BGE} \approx 0$), the WRRF signal takes over.


Included Models (fine-tuned)

Path in repo Base model Fine-tuning
models/e5-large-ft_v6/ intfloat/multilingual-e5-large Fine-tuned on the challenge corpus (v6 checkpoint)
models/bge-reranker-hsrc-pairwise-rrf-V1.4/ BAAI/bge-reranker-v2-m3 Fine-tuned on RRF-mined pairwise triples from the challenge corpus
models/multilingual-e5-large/ intfloat/multilingual-e5-large Off-the-shelf (no fine-tuning)

Repository Structure

model.py              ← Full inference pipeline (preprocess + predict)
bm25_backends.py      ← Pluggable BM25 backends (bm25s / pure-Python fallback)
text_utils.py         ← Hebrew normalization & tokenization utilities
models/
  e5-large-ft_v6/                          ← Fine-tuned E5 embedder ✨
  bge-reranker-hsrc-pairwise-rrf-V1.4/     ← Fine-tuned BGE reranker ✨
  multilingual-e5-large/                   ← Off-the-shelf secondary embedder

Usage

The pipeline exposes two functions matching the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "ΧžΧ” Χ”Χ–Χ›Χ•Χ™Χ•Χͺ של Χ©Χ•Χ›Χ¨Χ™ Χ“Χ™Χ¨Χ”?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.87}, ...]  (top-20)

Requirements:

torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy

A CUDA-capable GPU is strongly recommended (two large encoder models + one cross-encoder are loaded simultaneously, all in BF16/FP16).


Training Pipeline

The full training pipeline is located in repro/documentation/complete_pipeline/ and orchestrated by pipeline.py. It automates four sequential stages:

Stage Script Description
1 finetune_e5_large.py Fine-tunes E5 on the challenge corpus (12 runs, 2 epochs, lr=2e-6, batch=4)
2 stage1_weight_sweep.py Offline grid sweep of WRRF weights (BM25, E5, Gemma)
3 train_bge_ce_pairwise_rrf.py Trains the BGE cross-encoder reranker (lr=2e-5, max_len=640, batch=4Γ—accum=8)
4 sweep_final2_from_components.py Offline sweep for the final blending weight

Reranker Training Modes

The pipeline supports two parallel reranker training paths:

  • Deterministic mode (--rr_det_runs): trains from pinned triples (repro/documentation/triples/triples.jsonl), enabling reproducible results.
  • Non-deterministic mining mode (--rr_nd_runs): the first run mines fresh triples from the best E5 checkpoint; subsequent runs reuse them. ~1 in 7 runs matches submitted model quality.

Example Full Run Command

python3 repro/documentation/complete_pipeline/pipeline.py \
  --e5_runs 12 --e5_seed0 45 --e5_seed_stride 0 \
  --e5_epochs 2 --e5_batch 4 --e5_lr 2e-6 \
  --stage1_w_bm25 1.0,2.0,0.1 \
  --stage1_w_e5 1.0,2.0,0.1 \
  --stage1_w_gm 1.0,2.0,0.1 \
  --rr_det_runs 1 --rr_det_seed0 42 --rr_det_seed_stride 0 \
  --rr_det_triples_in repro/documentation/triples/triples.jsonl \
  --rr_nd_runs 15 --rr_nd_seed0 42 --rr_nd_seed_stride 0 \
  --rr_bsz 4 --rr_accum 8 --rr_lr 2e-5 --rr_max_len 640 \
  --rr_sweep_rounds 2000

Hardware: Original model trained on RTX 3080 Ti; reproducibility runs executed on L40S (~24 hours for the full pipeline with 12 E5 + 15 reranker runs).


Evaluation Protocol

  • Holdout set: First 100 queries of the provided training file (fixed split, never changed during development).
  • Local evaluation script: scripts/eval_std_final.py β€” runs silently when EVAL_STD_MODE=1.
  • Score discrepancy: 7 of the 100 holdout queries have no labels > 0 (empty relevance). The local script does not ignore these by default, resulting in a local nDCG ~0.615 vs. the public leaderboard score. When empty-label queries are excluded, local scores align with the official leaderboard.

Technical Notes

  • All models are loaded in BF16 (E5, Gemma) or FP16 (BGE reranker) to reduce GPU memory usage.
  • Corpus embedding caching: E5 and Gemma corpus embeddings can be cached to disk (keyed by SHA-1 of document IDs + model path + corpus size) to skip re-encoding on repeated runs.
  • BM25 backend fallback chain: bm25_backends.py β†’ direct bm25s β†’ pure-Python deterministic BM25 (guaranteed to work without external dependencies).
  • Dominant source of non-determinism: GPU FP16/SDPA kernel behavior. Deterministic kernels are available but increase runtime ~3.6Γ— and may exceed GPU memory limits.

Results

Phase NDCG@20 Rank
Public (Phase I) 0.460408 πŸ₯ˆ 2nd
Private (Phase II) 0.656792 πŸ₯ˆ 2nd

The large gap between public and private scores is expected: the private phase incorporated additional human annotation of previously un-annotated retrieved documents, significantly impacting NDCG for systems that retrieved relevant but un-annotated paragraphs.


Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit itk77 as the solution author.


Acknowledgements

  • MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
  • The authors of intfloat/multilingual-e5-large and BAAI/bge-reranker-v2-m3.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including HebArabNlpProject/Semantic-Retrieval-2nd-place