Sentence Similarity
sentence-transformers
Safetensors
Hebrew
hebrew
semantic-retrieval
information-retrieval
dense-retrieval
reranking
bge-m3
competition
Instructions to use HebArabNlpProject/Semantic-Retrieval-3rd-place with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use HebArabNlpProject/Semantic-Retrieval-3rd-place with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("HebArabNlpProject/Semantic-Retrieval-3rd-place") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 7,179 Bytes
228db8f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | ---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- bge-m3
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---
# Hebrew Semantic Retrieval — 3rd Place Solution
**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**
**Result:** 🥉 **3rd place** — nDCG@20 = **0.652538** (private test set) · **0.432286** (public test set)
**Author:** kdbrodt
---
## Overview
This repository contains the complete inference code and fine-tuned models for the 3rd-place solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with ranking Hebrew paragraphs from a 127,731-passage corpus in response to natural-language Hebrew queries, evaluated by **NDCG@20**.
The solution is a clean, end-to-end two-stage retrieve-then-rerank pipeline built entirely on the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) (`BAAI/bge-m3`) family. Both the dense embedder and the cross-encoder reranker were fine-tuned directly on the competition's annotated Hebrew data.
---
## The Challenge
| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0–4 (human annotated) |
---
## Solution Architecture
A straightforward two-stage pipeline: dense retrieval followed by cross-encoder reranking.
```
Query
│
▼
[BGE-M3 Dense Retriever] (fine-tuned, CLS pooling, FP16)
│ cosine similarity over 127k passages
▼
Top-100 Candidates
│
▼
[BGE-Reranker-v2-M3] (fine-tuned binary classifier, FP16)
│ query-passage pairs scored, max_length=512
▼
Final Top-20 Results
```
### Stage 1 — Dense Retrieval
The fine-tuned `bge-m3` encoder produces **CLS-token embeddings** (L2-normalized, FP16) for all corpus passages at preprocessing time. At query time, a single query embedding is computed and scored against all corpus embeddings via **dot-product similarity** (equivalent to cosine similarity on normalized vectors). The top-100 passages are selected for reranking.
| Property | Value |
|---|---|
| Model | `test_encoder_only_base_bge_m3_new1` (fine-tuned `BAAI/bge-m3`) |
| Pooling | CLS token |
| Normalization | L2 |
| Precision | FP16 |
| Max length | 512 tokens |
| Batch size (corpus) | 64 |
| Retrieval pool | Top-100 candidates |
### Stage 2 — Cross-Encoder Reranking
The top-100 candidates are re-scored by the fine-tuned `bge-reranker-v2-m3`, a sequence classification model that takes concatenated `[query, passage]` pairs as input and outputs a relevance logit. Passages are sorted by length before scoring to minimize padding overhead. The top-20 by reranker score are returned.
| Property | Value |
|---|---|
| Model | `test_encoder_only_base_bge_reranker_v2_m3_new1` (fine-tuned `BAAI/bge-reranker-v2-m3`) |
| Max length | 512 tokens |
| Batch size | 16 |
| Output | Top-20 by reranker logit |
---
## Fine-Tuning
Both models were fine-tuned on the competition's annotated Hebrew training set using the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) framework.
**Training data construction:**
- Every query–document pair with a **positive relevance score (> 0)** was treated as a positive example.
- Every pair with a **score of 0** was treated as a negative example.
**Embedder (`bge-m3`):** Trained with **KL-divergence loss** to produce embeddings that better separate relevant from irrelevant documents.
**Reranker (`bge-reranker-v2-m3`):** Trained as a **binary classifier** on the same positive/negative pairs, learning to predict relevance probability directly.
| Hyperparameter | Value |
|---|---|
| Epochs | 2 |
| Batch size per device | 2 |
| Learning rate | 5e-6 |
| Hardware | 2 × Nvidia Tesla V100-SXM2-32GB |
| Training time | ~1 hour |
---
## Included Models (fine-tuned)
| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/test_encoder_only_base_bge_m3_new1/` | `BAAI/bge-m3` | KL-divergence loss on competition data ✨ |
| `models/test_encoder_only_base_bge_reranker_v2_m3_new1/` | `BAAI/bge-reranker-v2-m3` | Binary classification on competition data ✨ |
---
## Repository Structure
```
model.py ← Full inference pipeline (preprocess + predict)
prepare.py ← Data preparation script
train.sh ← Training script
models/
test_encoder_only_base_bge_m3_new1/ ← Fine-tuned BGE-M3 embedder ✨
test_encoder_only_base_bge_reranker_v2_m3_new1/ ← Fine-tuned BGE reranker ✨
```
---
## Usage
The pipeline exposes two functions matching the competition API:
```python
from model import preprocess, predict
# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)
# Query at inference time
results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 1.23}, ...] (top-20)
```
**Requirements:**
```
torch
transformers
numpy
```
**Hardware:** A CUDA-capable GPU is required. Inference takes less than 1.5 hours on an `g5.xlarge` instance.
---
## Reproducing the Models
**1. Prepare data:**
```bash
# Download competition data and unzip into `hsrc/` folder
python prepare.py
```
**2. Train:**
```bash
sh ./train.sh
```
Training takes ~1 hour on 2 × V100-SXM2-32GB GPUs.
---
## Technical Notes
- Both models are loaded in **FP16** via `torch_dtype=torch.float16` and `device_map` for automatic GPU placement.
- Corpus passages are **sorted by length** before embedding to reduce padding overhead during batch encoding.
- The reranker also sorts candidates by passage length before scoring batches.
- Fallback: if reranking fails, the pipeline falls back to returning the top-20 by dense retrieval score.
---
## Results
| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.432286** | 🥉 3rd |
| Private (Phase II) | **0.652538** | 🥉 3rd |
> The large gap between public and private scores reflects the private phase's additional human annotation of previously un-annotated retrieved documents, significantly boosting NDCG for systems that retrieved relevant but unannotated paragraphs.
---
## Citation
If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **kdbrodt** as the solution author.
---
## Acknowledgements
- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- The authors of `BAAI/bge-m3` and `BAAI/bge-reranker-v2-m3`.
- The [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding) team for the training framework.
|