File size: 7,179 Bytes
228db8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- bge-m3
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---

# Hebrew Semantic Retrieval — 3rd Place Solution

**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**

**Result:** 🥉 **3rd place** — nDCG@20 = **0.652538** (private test set) · **0.432286** (public test set)

**Author:** kdbrodt

---

## Overview

This repository contains the complete inference code and fine-tuned models for the 3rd-place solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by **NDCG@20**.

The solution is a clean, end-to-end two-stage retrieve-then-rerank pipeline built entirely on the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) (`BAAI/bge-m3`) family. Both the dense embedder and the cross-encoder reranker were fine-tuned directly on the competition's annotated Hebrew data.

---

## The Challenge

| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0–4 (human annotated) |

---

## Solution Architecture

A straightforward two-stage pipeline: dense retrieval followed by cross-encoder reranking.

```
Query


[BGE-M3 Dense Retriever]  (fine-tuned, CLS pooling, FP16)
  │  cosine similarity over 127k passages

Top-100 Candidates


[BGE-Reranker-v2-M3]  (fine-tuned binary classifier, FP16)
  │  query-passage pairs scored, max_length=512

Final Top-20 Results
```

### Stage 1 — Dense Retrieval

The fine-tuned `bge-m3` encoder produces **CLS-token embeddings** (L2-normalized, FP16) for all corpus passages at preprocessing time. At query time, a single query embedding is computed and scored against all corpus embeddings via **dot-product similarity** (equivalent to cosine similarity on normalized vectors). The top-100 passages are selected for reranking.

| Property | Value |
|---|---|
| Model | `test_encoder_only_base_bge_m3_new1` (fine-tuned `BAAI/bge-m3`) |
| Pooling | CLS token |
| Normalization | L2 |
| Precision | FP16 |
| Max length | 512 tokens |
| Batch size (corpus) | 64 |
| Retrieval pool | Top-100 candidates |

### Stage 2 — Cross-Encoder Reranking

The top-100 candidates are re-scored by the fine-tuned `bge-reranker-v2-m3`, a sequence classification model that takes concatenated `[query, passage]` pairs as input and outputs a relevance logit. Passages are sorted by length before scoring to minimize padding overhead. The top-20 by reranker score are returned.

| Property | Value |
|---|---|
| Model | `test_encoder_only_base_bge_reranker_v2_m3_new1` (fine-tuned `BAAI/bge-reranker-v2-m3`) |
| Max length | 512 tokens |
| Batch size | 16 |
| Output | Top-20 by reranker logit |

---

## Fine-Tuning

Both models were fine-tuned on the competition's annotated Hebrew training set using the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) framework.

**Training data construction:**
- Every query–document pair with a **positive relevance score (> 0)** was treated as a positive example.
- Every pair with a **score of 0** was treated as a negative example.

**Embedder (`bge-m3`):** Trained with **KL-divergence loss** to produce embeddings that better separate relevant from irrelevant documents.

**Reranker (`bge-reranker-v2-m3`):** Trained as a **binary classifier** on the same positive/negative pairs, learning to predict relevance probability directly.

| Hyperparameter | Value |
|---|---|
| Epochs | 2 |
| Batch size per device | 2 |
| Learning rate | 5e-6 |
| Hardware | 2 × Nvidia Tesla V100-SXM2-32GB |
| Training time | ~1 hour |

---

## Included Models (fine-tuned)

| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/test_encoder_only_base_bge_m3_new1/` | `BAAI/bge-m3` | KL-divergence loss on competition data ✨ |
| `models/test_encoder_only_base_bge_reranker_v2_m3_new1/` | `BAAI/bge-reranker-v2-m3` | Binary classification on competition data ✨ |

---

## Repository Structure

```
model.py      ← Full inference pipeline (preprocess + predict)
prepare.py    ← Data preparation script
train.sh      ← Training script
models/
  test_encoder_only_base_bge_m3_new1/                  ← Fine-tuned BGE-M3 embedder ✨
  test_encoder_only_base_bge_reranker_v2_m3_new1/      ← Fine-tuned BGE reranker ✨
```

---

## Usage

The pipeline exposes two functions matching the competition API:

```python
from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 1.23}, ...]  (top-20)
```

**Requirements:**
```
torch
transformers
numpy
```

**Hardware:** A CUDA-capable GPU is required. Inference takes less than 1.5 hours on an `g5.xlarge` instance.

---

## Reproducing the Models

**1. Prepare data:**
```bash
# Download competition data and unzip into `hsrc/` folder
python prepare.py
```

**2. Train:**
```bash
sh ./train.sh
```
Training takes ~1 hour on 2 × V100-SXM2-32GB GPUs.

---

## Technical Notes

- Both models are loaded in **FP16** via `torch_dtype=torch.float16` and `device_map` for automatic GPU placement.
- Corpus passages are **sorted by length** before embedding to reduce padding overhead during batch encoding.
- The reranker also sorts candidates by passage length before scoring batches.
- Fallback: if reranking fails, the pipeline falls back to returning the top-20 by dense retrieval score.

---

## Results

| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.432286** | 🥉 3rd |
| Private (Phase II) | **0.652538** | 🥉 3rd |

> The large gap between public and private scores reflects the private phase's additional human annotation of previously un-annotated retrieved documents, significantly boosting NDCG for systems that retrieved relevant but unannotated paragraphs.

---

## Citation

If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **kdbrodt** as the solution author.

---

## Acknowledgements

- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- The authors of `BAAI/bge-m3` and `BAAI/bge-reranker-v2-m3`.
- The [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) team for the training framework.