File size: 7,825 Bytes
fb9bcb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c73a38c
fb9bcb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20dc672
fb9bcb2
 
 
 
 
 
 
 
 
 
 
 
d51957a
fb9bcb2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- ensemble
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---

# Hebrew Semantic Retrieval β€” 1st Place Solution

**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**

**Result:** πŸ₯‡ **1st place** β€” nDCG@20 = **0.6736** (private test set)

**Author:** victord

---

## Overview

This repository contains the complete inference code and fine-tuned models for the winning solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with building a semantic retrieval system capable of ranking Hebrew paragraphs from a large-scale corpus (127,731 paragraphs) in response to natural-language Hebrew queries, evaluated by **NDCG@20**.

Hebrew is a morphologically rich, Semitic language written in an almost consonant-only script, which creates high lexical ambiguity and makes retrieval significantly harder than in English or other high-resource languages. The challenge was designed to close this gap and advance Hebrew NLP for domains such as government services, law, academia, and the public sector.

---

## The Challenge

| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0–4 (human annotated) |

Ground-truth labels were produced in two stages: a semantic retrieval model first retrieved the top-20 candidates per query, then human annotators rated them on a 0–4 relevance scale.

---

## Solution Architecture

The solution is a classic **two-stage retrieve-then-rerank pipeline**, built on top of a large ensemble of multilingual and Hebrew-specialized embedding models, combined with a sparse BM25 stage.

```
Query
  β”‚
  β”œβ”€β–Ί [Dense Retriever Γ—6]  ──┐
  β”‚                            β”œβ”€β–Ί Score Fusion (weighted, z-normalized)
  └─► [BM25s Sparse]  β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
     Top-250 Candidates
            β”‚
            β–Ό
     [BGE Cross-Encoder Reranker]  (fine-tuned)
            β”‚
            β–Ό
     Final Top-20 Results (ranked by fused score)
```

### Stage 1 β€” Ensemble Dense + Sparse Retrieval

Six dense embedding models run in parallel. Each produces per-document cosine-similarity scores, which are **z-score normalized** (using pre-computed corpus statistics) and **linearly fused** with learned weights. BM25s contributes a 15 % weight in the final fusion.

| Model | Role | Pooling | Max Length |
|---|---|---|---|
| `multilingual-e5-large` (pseudo-fine-tuned) | Primary dense retriever | Mean pooling + L2 norm | 512 |
| `multilingual-e5-large-instruct` | Instruct-style dense retriever | Mean pooling + L2 norm | 512 |
| `BAAI/bge-m3` | Multilingual dense retriever | CLS token + L2 norm | 512 |
| `Snowflake/snowflake-arctic-embed-l-v2.0` | Multilingual dense retriever | CLS token + L2 norm | 1024 |
| `OrdalieTech/Solon-embeddings-large-0.1` | Multilingual dense retriever | Mean pooling + L2 norm | 512 |
| `Webiks/Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0` | Hebrew-specialized retriever | Mean pooling + L2 norm | 512 |
| **BM25s** | Sparse lexical retriever | β€” | β€” |

**Retriever fusion weights (normalized):**

| Retriever | Weight |
|---|---|
| E5-large (pseudo-tuned) | 1.10 |
| E5-large-instruct | 0.25 |
| BGE-M3 | 0.20 |
| Snowflake Arctic | 0.30 |
| Solon | 0.30 |
| Hebrew RAGbot | 0.30 |
| BM25s | 15 % blended into final fusion |

**Long-document handling:** For passages exceeding the model's max context length, a sliding-window chunking strategy with 50 % overlap is applied at the token level, and the maximum chunk score is used to represent the document.

### Stage 2 β€” Cross-Encoder Reranking

The top-250 candidates from Stage 1 are reranked by a **fine-tuned BGE cross-encoder** (`bge-reranker-v2-m3`, pseudo-fine-tuned on the challenge corpus). The reranker operates with a max sequence length of 2048 tokens using the same sliding-window + max-score strategy for long documents.

The final score is a blend of the reranker score and the Stage 1 fusion score:

$$\text{score}_\text{final} = 0.35 \cdot \hat{s}_\text{reranker} + 0.65 \cdot s_\text{fusion}$$

where $\hat{s}_\text{reranker}$ is z-score normalized. The top-20 documents by this blended score are returned.

---

## Included Models (fine-tuned)

| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/multilingual-e5-large_pseudo_full/` | `intfloat/multilingual-e5-large` | Pseudo-label fine-tuning on the challenge corpus |
| `models/bge-reranker-v2-m3_pseudo_tune_full/` | `BAAI/bge-reranker-v2-m3` | Pseudo-label fine-tuning on the challenge corpus |

The remaining models (`bge-m3`, `multilingual-e5-large-instruct`, `snowflake-arctic-embed-l-v2.0`, `Solon-embeddings-large-0.1`, `Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0`) are used as-is (no additional fine-tuning).

---

## Repository Structure

```
model.py              ← Full inference pipeline (preprocess + predict)
models/
  bge-m3/
  bge-reranker-v2-m3_pseudo_tune_full/   ← Fine-tuned reranker ✨
  multilingual-e5-large_pseudo_full/     ← Fine-tuned embedder ✨
  multilingual-e5-large-instruct/
  snowflake-arctic-embed-l-v2.0/
  Solon-embeddings-large-0.1/
  Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0/
```

---

## Usage

The pipeline exposes two functions that match the competition API:

```python
from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "ΧžΧ” Χ”Χ–Χ›Χ•Χ™Χ•Χͺ של Χ©Χ•Χ›Χ¨Χ™ Χ“Χ™Χ¨Χ”?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.92}, ...]  (top-20)
```

**Requirements:**
```
torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy
```

A CUDA-capable GPU is strongly recommended (the pipeline loads ~6 large models simultaneously).

---

## Technical Notes

- All models are loaded in **bfloat16** precision to reduce GPU memory footprint.
- **Offline mode** is enforced at runtime (`HF_HUB_OFFLINE=1`) β€” all model weights must be present locally.
- BM25s tokenization uses the default `bm25s` tokenizer with no additional Hebrew-specific pre-processing.
- The pipeline is time-budgeted: the reranker respects a ~1.85 s per-query wall-clock limit and will skip remaining batches if the budget is exceeded, gracefully falling back to Stage 1 scores.
- CUDA memory is proactively freed between batches; OOM errors trigger single-sample fallback processing.

---

## Results

| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.456235** | πŸ₯‡ 1st |
| Private (Phase II) | **0.6736** | πŸ₯‡ 1st |

---

## Citation

If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **victord** as the solution author.

---

## Acknowledgements

- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- [Webiks](https://www.webiks.com/) for the `Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0` model.
- The authors of `multilingual-e5-large`, `bge-m3`, `bge-reranker-v2-m3`, `snowflake-arctic-embed-l-v2.0`, and `Solon-embeddings-large-0.1`.