File size: 10,010 Bytes
2e852df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- rrf
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---

# Hebrew Semantic Retrieval β€” 2nd Place Solution

**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**

**Result:** πŸ₯ˆ **2nd place** β€” nDCG@20 = **0.656792** (private test set) Β· **0.460408** (public test set)

**Author:** itk77

---

## Overview

This repository contains the complete inference code and fine-tuned models for the 2nd-place solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by **NDCG@20**.

Hebrew is a morphologically rich Semitic language written in an almost consonant-only script, creating significant lexical ambiguity and making retrieval substantially harder than for high-resource languages. The solution addresses this with a carefully engineered three-stage pipeline: sparse + dual-dense retrieval fused via Weighted Reciprocal Rank Fusion (WRRF), followed by a BGE cross-encoder reranker fine-tuned specifically on the challenge corpus, and a final conditional score blending step.

---

## The Challenge

| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0–4 (human annotated) |

---

## Solution Architecture

The solution is a **three-stage pipeline**: sparse + dual-dense retrieval fused with Weighted RRF, cross-encoder reranking, and conditional score blending.

```
Query
  β”‚
  β”œβ”€β–Ί [BM25  (k1=1.3, b=0.7, w=1.0)]  ──┐
  β”œβ”€β–Ί [E5-large fine-tuned (w=1.2)]  β”œβ”€β–Ί WRRF Fusion (k=35)
  └─► [multilingual-E5-large (w=1.4)]  β”˜
            β”‚
            β–Ό
     Top-190 Candidates
            β”‚
            β–Ό
     [BGE Cross-Encoder Reranker]  (fine-tuned, max_len=640)
            β”‚
            β–Ό
     Conditional Score Blending
            β”‚
            β–Ό
     Final Top-20 Results
```

### Stage 1 β€” Weighted Reciprocal Rank Fusion (WRRF)

Three independent rankers each produce a ranked list of up to 190 candidates. Their lists are fused using **Weighted Reciprocal Rank Fusion**:

$$\text{WRRF}(d) = \frac{w_\text{BM25}}{k + r_\text{BM25}(d) + 1} + \frac{w_\text{E5-ft}}{k + r_\text{E5-ft}(d) + 1} + \frac{w_\text{E5-base}}{k + r_\text{E5-base}(d) + 1}$$

with $k = 35$ (RRF smoothing constant).

| Ranker | Model | Weight | Max Length | Notes |
|---|---|---|---|---|
| BM25 | Custom Hebrew BM25 (bm25s backend) | 1.0 | β€” | Strip nikkud, NFKC norm, prefix stripping |
| E5 (fine-tuned) | `e5-large-ft_v6` | 1.2 | 512 tokens | Mean pooling + L2 norm, `query:` / `passage:` prefixes |
| E5 (base) | `multilingual-e5-large` | 1.4 | 512 tokens | Via SentenceTransformers, BF16; labeled `GemmaEmbedder` in code but loads E5 |

**Hebrew-specific tokenization (BM25):** Unicode NFKC normalization, nikkud stripping (`\u0591–\u05C7`), Hebrew prefix removal (`Χ•`,`Χ”`,`Χ‘`,`ל`,`Χ›`,`מ`,`Χ©`) with both the stripped and original form indexed, and a custom Hebrew stopword list.

### Stage 2 β€” BGE Cross-Encoder Reranking

The top-190 WRRF candidates are reranked by `bge-reranker-hsrc-pairwise-rrf-V1.4`, a BGE cross-encoder fine-tuned on the challenge corpus using **pairwise training with RRF-mined triples**. Pairs are scored with a max sequence length of 640 tokens.

### Stage 3 β€” Conditional Score Blending

The final score uses a non-linear conditional boost that amplifies the WRRF signal where the reranker is uncertain:

$$\text{score}_\text{final} = \hat{s}_\text{BGE} + (1 - w_\text{BGE}) \cdot \hat{s}_\text{WRRF} \cdot (1 - \hat{s}_\text{BGE})$$

where $w_\text{BGE} = 0.07$, and both scores are **min-max normalized** to $[0, 1]$ over the candidate pool. When the reranker assigns a high score ($\hat{s}_\text{BGE} \approx 1$), the WRRF boost vanishes; when it is uncertain ($\hat{s}_\text{BGE} \approx 0$), the WRRF signal takes over.

---

## Included Models (fine-tuned)

| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/e5-large-ft_v6/` | `intfloat/multilingual-e5-large` | Fine-tuned on the challenge corpus (v6 checkpoint) |
| `models/bge-reranker-hsrc-pairwise-rrf-V1.4/` | `BAAI/bge-reranker-v2-m3` | Fine-tuned on RRF-mined pairwise triples from the challenge corpus |
| `models/multilingual-e5-large/` | `intfloat/multilingual-e5-large` | Off-the-shelf (no fine-tuning) |

---

## Repository Structure

```
model.py              ← Full inference pipeline (preprocess + predict)
bm25_backends.py      ← Pluggable BM25 backends (bm25s / pure-Python fallback)
text_utils.py         ← Hebrew normalization & tokenization utilities
models/
  e5-large-ft_v6/                          ← Fine-tuned E5 embedder ✨
  bge-reranker-hsrc-pairwise-rrf-V1.4/     ← Fine-tuned BGE reranker ✨
  multilingual-e5-large/                   ← Off-the-shelf secondary embedder
```

---

## Usage

The pipeline exposes two functions matching the competition API:

```python
from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "ΧžΧ” Χ”Χ–Χ›Χ•Χ™Χ•Χͺ של Χ©Χ•Χ›Χ¨Χ™ Χ“Χ™Χ¨Χ”?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.87}, ...]  (top-20)
```

**Requirements:**
```
torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy
```

A CUDA-capable GPU is strongly recommended (two large encoder models + one cross-encoder are loaded simultaneously, all in BF16/FP16).

---

## Training Pipeline

The full training pipeline is located in `repro/documentation/complete_pipeline/` and orchestrated by `pipeline.py`. It automates four sequential stages:

| Stage | Script | Description |
|---|---|---|
| 1 | `finetune_e5_large.py` | Fine-tunes E5 on the challenge corpus (12 runs, 2 epochs, lr=2e-6, batch=4) |
| 2 | `stage1_weight_sweep.py` | Offline grid sweep of WRRF weights (BM25, E5, Gemma) |
| 3 | `train_bge_ce_pairwise_rrf.py` | Trains the BGE cross-encoder reranker (lr=2e-5, max_len=640, batch=4Γ—accum=8) |
| 4 | `sweep_final2_from_components.py` | Offline sweep for the final blending weight |

### Reranker Training Modes

The pipeline supports two parallel reranker training paths:

- **Deterministic mode** (`--rr_det_runs`): trains from **pinned triples** (`repro/documentation/triples/triples.jsonl`), enabling reproducible results.
- **Non-deterministic mining mode** (`--rr_nd_runs`): the first run mines fresh triples from the best E5 checkpoint; subsequent runs reuse them. ~1 in 7 runs matches submitted model quality.

### Example Full Run Command

```bash
python3 repro/documentation/complete_pipeline/pipeline.py \
  --e5_runs 12 --e5_seed0 45 --e5_seed_stride 0 \
  --e5_epochs 2 --e5_batch 4 --e5_lr 2e-6 \
  --stage1_w_bm25 1.0,2.0,0.1 \
  --stage1_w_e5 1.0,2.0,0.1 \
  --stage1_w_gm 1.0,2.0,0.1 \
  --rr_det_runs 1 --rr_det_seed0 42 --rr_det_seed_stride 0 \
  --rr_det_triples_in repro/documentation/triples/triples.jsonl \
  --rr_nd_runs 15 --rr_nd_seed0 42 --rr_nd_seed_stride 0 \
  --rr_bsz 4 --rr_accum 8 --rr_lr 2e-5 --rr_max_len 640 \
  --rr_sweep_rounds 2000
```

**Hardware:** Original model trained on RTX 3080 Ti; reproducibility runs executed on L40S (~24 hours for the full pipeline with 12 E5 + 15 reranker runs).

---

## Evaluation Protocol

- **Holdout set:** First 100 queries of the provided training file (fixed split, never changed during development).
- **Local evaluation script:** `scripts/eval_std_final.py` β€” runs silently when `EVAL_STD_MODE=1`.
- **Score discrepancy:** 7 of the 100 holdout queries have no labels > 0 (empty relevance). The local script does not ignore these by default, resulting in a local nDCG ~0.615 vs. the public leaderboard score. When empty-label queries are excluded, local scores align with the official leaderboard.

---

## Technical Notes

- All models are loaded in **BF16** (E5, Gemma) or **FP16** (BGE reranker) to reduce GPU memory usage.
- **Corpus embedding caching:** E5 and Gemma corpus embeddings can be cached to disk (keyed by SHA-1 of document IDs + model path + corpus size) to skip re-encoding on repeated runs.
- **BM25 backend fallback chain:** `bm25_backends.py` β†’ direct `bm25s` β†’ pure-Python deterministic BM25 (guaranteed to work without external dependencies).
- **Dominant source of non-determinism:** GPU FP16/SDPA kernel behavior. Deterministic kernels are available but increase runtime ~3.6Γ— and may exceed GPU memory limits.

---

## Results

| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.460408** | πŸ₯ˆ 2nd |
| Private (Phase II) | **0.656792** | πŸ₯ˆ 2nd |

> The large gap between public and private scores is expected: the private phase incorporated additional human annotation of previously un-annotated retrieved documents, significantly impacting NDCG for systems that retrieved relevant but un-annotated paragraphs.

---

## Citation

If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **itk77** as the solution author.

---

## Acknowledgements

- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- The authors of `intfloat/multilingual-e5-large` and `BAAI/bge-reranker-v2-m3`.