Sentence Similarity
sentence-transformers
Safetensors
Hebrew
hebrew
semantic-retrieval
information-retrieval
dense-retrieval
reranking
rrf
competition
Instructions to use HebArabNlpProject/Semantic-Retrieval-2nd-place with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use HebArabNlpProject/Semantic-Retrieval-2nd-place with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("HebArabNlpProject/Semantic-Retrieval-2nd-place") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 10,010 Bytes
2e852df | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | ---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- rrf
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---
# Hebrew Semantic Retrieval β 2nd Place Solution
**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**
**Result:** π₯ **2nd place** β nDCG@20 = **0.656792** (private test set) Β· **0.460408** (public test set)
**Author:** itk77
---
## Overview
This repository contains the complete inference code and fine-tuned models for the 2nd-place solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by **NDCG@20**.
Hebrew is a morphologically rich Semitic language written in an almost consonant-only script, creating significant lexical ambiguity and making retrieval substantially harder than for high-resource languages. The solution addresses this with a carefully engineered three-stage pipeline: sparse + dual-dense retrieval fused via Weighted Reciprocal Rank Fusion (WRRF), followed by a BGE cross-encoder reranker fine-tuned specifically on the challenge corpus, and a final conditional score blending step.
---
## The Challenge
| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0β4 (human annotated) |
---
## Solution Architecture
The solution is a **three-stage pipeline**: sparse + dual-dense retrieval fused with Weighted RRF, cross-encoder reranking, and conditional score blending.
```
Query
β
βββΊ [BM25 (k1=1.3, b=0.7, w=1.0)] βββ
βββΊ [E5-large fine-tuned (w=1.2)] βββΊ WRRF Fusion (k=35)
βββΊ [multilingual-E5-large (w=1.4)] β
β
βΌ
Top-190 Candidates
β
βΌ
[BGE Cross-Encoder Reranker] (fine-tuned, max_len=640)
β
βΌ
Conditional Score Blending
β
βΌ
Final Top-20 Results
```
### Stage 1 β Weighted Reciprocal Rank Fusion (WRRF)
Three independent rankers each produce a ranked list of up to 190 candidates. Their lists are fused using **Weighted Reciprocal Rank Fusion**:
$$\text{WRRF}(d) = \frac{w_\text{BM25}}{k + r_\text{BM25}(d) + 1} + \frac{w_\text{E5-ft}}{k + r_\text{E5-ft}(d) + 1} + \frac{w_\text{E5-base}}{k + r_\text{E5-base}(d) + 1}$$
with $k = 35$ (RRF smoothing constant).
| Ranker | Model | Weight | Max Length | Notes |
|---|---|---|---|---|
| BM25 | Custom Hebrew BM25 (bm25s backend) | 1.0 | β | Strip nikkud, NFKC norm, prefix stripping |
| E5 (fine-tuned) | `e5-large-ft_v6` | 1.2 | 512 tokens | Mean pooling + L2 norm, `query:` / `passage:` prefixes |
| E5 (base) | `multilingual-e5-large` | 1.4 | 512 tokens | Via SentenceTransformers, BF16; labeled `GemmaEmbedder` in code but loads E5 |
**Hebrew-specific tokenization (BM25):** Unicode NFKC normalization, nikkud stripping (`\u0591β\u05C7`), Hebrew prefix removal (`Χ`,`Χ`,`Χ`,`Χ`,`Χ`,`Χ`,`Χ©`) with both the stripped and original form indexed, and a custom Hebrew stopword list.
### Stage 2 β BGE Cross-Encoder Reranking
The top-190 WRRF candidates are reranked by `bge-reranker-hsrc-pairwise-rrf-V1.4`, a BGE cross-encoder fine-tuned on the challenge corpus using **pairwise training with RRF-mined triples**. Pairs are scored with a max sequence length of 640 tokens.
### Stage 3 β Conditional Score Blending
The final score uses a non-linear conditional boost that amplifies the WRRF signal where the reranker is uncertain:
$$\text{score}_\text{final} = \hat{s}_\text{BGE} + (1 - w_\text{BGE}) \cdot \hat{s}_\text{WRRF} \cdot (1 - \hat{s}_\text{BGE})$$
where $w_\text{BGE} = 0.07$, and both scores are **min-max normalized** to $[0, 1]$ over the candidate pool. When the reranker assigns a high score ($\hat{s}_\text{BGE} \approx 1$), the WRRF boost vanishes; when it is uncertain ($\hat{s}_\text{BGE} \approx 0$), the WRRF signal takes over.
---
## Included Models (fine-tuned)
| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/e5-large-ft_v6/` | `intfloat/multilingual-e5-large` | Fine-tuned on the challenge corpus (v6 checkpoint) |
| `models/bge-reranker-hsrc-pairwise-rrf-V1.4/` | `BAAI/bge-reranker-v2-m3` | Fine-tuned on RRF-mined pairwise triples from the challenge corpus |
| `models/multilingual-e5-large/` | `intfloat/multilingual-e5-large` | Off-the-shelf (no fine-tuning) |
---
## Repository Structure
```
model.py β Full inference pipeline (preprocess + predict)
bm25_backends.py β Pluggable BM25 backends (bm25s / pure-Python fallback)
text_utils.py β Hebrew normalization & tokenization utilities
models/
e5-large-ft_v6/ β Fine-tuned E5 embedder β¨
bge-reranker-hsrc-pairwise-rrf-V1.4/ β Fine-tuned BGE reranker β¨
multilingual-e5-large/ β Off-the-shelf secondary embedder
```
---
## Usage
The pipeline exposes two functions matching the competition API:
```python
from model import preprocess, predict
# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)
# Query at inference time
results = predict({"query": "ΧΧ ΧΧΧΧΧΧΧͺ Χ©Χ Χ©ΧΧΧ¨Χ ΧΧΧ¨Χ?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.87}, ...] (top-20)
```
**Requirements:**
```
torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy
```
A CUDA-capable GPU is strongly recommended (two large encoder models + one cross-encoder are loaded simultaneously, all in BF16/FP16).
---
## Training Pipeline
The full training pipeline is located in `repro/documentation/complete_pipeline/` and orchestrated by `pipeline.py`. It automates four sequential stages:
| Stage | Script | Description |
|---|---|---|
| 1 | `finetune_e5_large.py` | Fine-tunes E5 on the challenge corpus (12 runs, 2 epochs, lr=2e-6, batch=4) |
| 2 | `stage1_weight_sweep.py` | Offline grid sweep of WRRF weights (BM25, E5, Gemma) |
| 3 | `train_bge_ce_pairwise_rrf.py` | Trains the BGE cross-encoder reranker (lr=2e-5, max_len=640, batch=4Γaccum=8) |
| 4 | `sweep_final2_from_components.py` | Offline sweep for the final blending weight |
### Reranker Training Modes
The pipeline supports two parallel reranker training paths:
- **Deterministic mode** (`--rr_det_runs`): trains from **pinned triples** (`repro/documentation/triples/triples.jsonl`), enabling reproducible results.
- **Non-deterministic mining mode** (`--rr_nd_runs`): the first run mines fresh triples from the best E5 checkpoint; subsequent runs reuse them. ~1 in 7 runs matches submitted model quality.
### Example Full Run Command
```bash
python3 repro/documentation/complete_pipeline/pipeline.py \
--e5_runs 12 --e5_seed0 45 --e5_seed_stride 0 \
--e5_epochs 2 --e5_batch 4 --e5_lr 2e-6 \
--stage1_w_bm25 1.0,2.0,0.1 \
--stage1_w_e5 1.0,2.0,0.1 \
--stage1_w_gm 1.0,2.0,0.1 \
--rr_det_runs 1 --rr_det_seed0 42 --rr_det_seed_stride 0 \
--rr_det_triples_in repro/documentation/triples/triples.jsonl \
--rr_nd_runs 15 --rr_nd_seed0 42 --rr_nd_seed_stride 0 \
--rr_bsz 4 --rr_accum 8 --rr_lr 2e-5 --rr_max_len 640 \
--rr_sweep_rounds 2000
```
**Hardware:** Original model trained on RTX 3080 Ti; reproducibility runs executed on L40S (~24 hours for the full pipeline with 12 E5 + 15 reranker runs).
---
## Evaluation Protocol
- **Holdout set:** First 100 queries of the provided training file (fixed split, never changed during development).
- **Local evaluation script:** `scripts/eval_std_final.py` β runs silently when `EVAL_STD_MODE=1`.
- **Score discrepancy:** 7 of the 100 holdout queries have no labels > 0 (empty relevance). The local script does not ignore these by default, resulting in a local nDCG ~0.615 vs. the public leaderboard score. When empty-label queries are excluded, local scores align with the official leaderboard.
---
## Technical Notes
- All models are loaded in **BF16** (E5, Gemma) or **FP16** (BGE reranker) to reduce GPU memory usage.
- **Corpus embedding caching:** E5 and Gemma corpus embeddings can be cached to disk (keyed by SHA-1 of document IDs + model path + corpus size) to skip re-encoding on repeated runs.
- **BM25 backend fallback chain:** `bm25_backends.py` β direct `bm25s` β pure-Python deterministic BM25 (guaranteed to work without external dependencies).
- **Dominant source of non-determinism:** GPU FP16/SDPA kernel behavior. Deterministic kernels are available but increase runtime ~3.6Γ and may exceed GPU memory limits.
---
## Results
| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.460408** | π₯ 2nd |
| Private (Phase II) | **0.656792** | π₯ 2nd |
> The large gap between public and private scores is expected: the private phase incorporated additional human annotation of previously un-annotated retrieved documents, significantly impacting NDCG for systems that retrieved relevant but un-annotated paragraphs.
---
## Citation
If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **itk77** as the solution author.
---
## Acknowledgements
- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- The authors of `intfloat/multilingual-e5-large` and `BAAI/bge-reranker-v2-m3`.
|