Sentence Similarity
sentence-transformers
Safetensors
Hebrew
hebrew
semantic-retrieval
information-retrieval
dense-retrieval
reranking
ensemble
competition
Instructions to use HebArabNlpProject/Semantic-Retrieval-1st-place with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use HebArabNlpProject/Semantic-Retrieval-1st-place with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("HebArabNlpProject/Semantic-Retrieval-1st-place") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 7,825 Bytes
fb9bcb2 c73a38c fb9bcb2 20dc672 fb9bcb2 d51957a fb9bcb2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | ---
language:
- he
tags:
- hebrew
- semantic-retrieval
- information-retrieval
- dense-retrieval
- reranking
- ensemble
- sentence-transformers
- competition
pipeline_tag: sentence-similarity
license: other
---
# Hebrew Semantic Retrieval β 1st Place Solution
**Competition:** Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the **Israel National NLP Program**
**Result:** π₯ **1st place** β nDCG@20 = **0.6736** (private test set)
**Author:** victord
---
## Overview
This repository contains the complete inference code and fine-tuned models for the winning solution to the **Hebrew Semantic Retrieval Challenge**. The challenge tasked participants with building a semantic retrieval system capable of ranking Hebrew paragraphs from a large-scale corpus (127,731 paragraphs) in response to natural-language Hebrew queries, evaluated by **NDCG@20**.
Hebrew is a morphologically rich, Semitic language written in an almost consonant-only script, which creates high lexical ambiguity and makes retrieval significantly harder than in English or other high-resource languages. The challenge was designed to close this gap and advance Hebrew NLP for domains such as government services, law, academia, and the public sector.
---
## The Challenge
| Property | Detail |
|---|---|
| Organizer | MAFAT DDR&D + Israel National NLP Program |
| Corpus size | 127,731 Hebrew paragraphs |
| Data sources | Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols |
| Evaluation metric | NDCG@20 |
| Phase I | Public leaderboard (Codabench) |
| Phase II | Private test set with additional human annotation of previously unseen retrievals |
| Relevance scale | 0β4 (human annotated) |
Ground-truth labels were produced in two stages: a semantic retrieval model first retrieved the top-20 candidates per query, then human annotators rated them on a 0β4 relevance scale.
---
## Solution Architecture
The solution is a classic **two-stage retrieve-then-rerank pipeline**, built on top of a large ensemble of multilingual and Hebrew-specialized embedding models, combined with a sparse BM25 stage.
```
Query
β
βββΊ [Dense Retriever Γ6] βββ
β βββΊ Score Fusion (weighted, z-normalized)
βββΊ [BM25s Sparse] ββββββββββ
β
βΌ
Top-250 Candidates
β
βΌ
[BGE Cross-Encoder Reranker] (fine-tuned)
β
βΌ
Final Top-20 Results (ranked by fused score)
```
### Stage 1 β Ensemble Dense + Sparse Retrieval
Six dense embedding models run in parallel. Each produces per-document cosine-similarity scores, which are **z-score normalized** (using pre-computed corpus statistics) and **linearly fused** with learned weights. BM25s contributes a 15 % weight in the final fusion.
| Model | Role | Pooling | Max Length |
|---|---|---|---|
| `multilingual-e5-large` (pseudo-fine-tuned) | Primary dense retriever | Mean pooling + L2 norm | 512 |
| `multilingual-e5-large-instruct` | Instruct-style dense retriever | Mean pooling + L2 norm | 512 |
| `BAAI/bge-m3` | Multilingual dense retriever | CLS token + L2 norm | 512 |
| `Snowflake/snowflake-arctic-embed-l-v2.0` | Multilingual dense retriever | CLS token + L2 norm | 1024 |
| `OrdalieTech/Solon-embeddings-large-0.1` | Multilingual dense retriever | Mean pooling + L2 norm | 512 |
| `Webiks/Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0` | Hebrew-specialized retriever | Mean pooling + L2 norm | 512 |
| **BM25s** | Sparse lexical retriever | β | β |
**Retriever fusion weights (normalized):**
| Retriever | Weight |
|---|---|
| E5-large (pseudo-tuned) | 1.10 |
| E5-large-instruct | 0.25 |
| BGE-M3 | 0.20 |
| Snowflake Arctic | 0.30 |
| Solon | 0.30 |
| Hebrew RAGbot | 0.30 |
| BM25s | 15 % blended into final fusion |
**Long-document handling:** For passages exceeding the model's max context length, a sliding-window chunking strategy with 50 % overlap is applied at the token level, and the maximum chunk score is used to represent the document.
### Stage 2 β Cross-Encoder Reranking
The top-250 candidates from Stage 1 are reranked by a **fine-tuned BGE cross-encoder** (`bge-reranker-v2-m3`, pseudo-fine-tuned on the challenge corpus). The reranker operates with a max sequence length of 2048 tokens using the same sliding-window + max-score strategy for long documents.
The final score is a blend of the reranker score and the Stage 1 fusion score:
$$\text{score}_\text{final} = 0.35 \cdot \hat{s}_\text{reranker} + 0.65 \cdot s_\text{fusion}$$
where $\hat{s}_\text{reranker}$ is z-score normalized. The top-20 documents by this blended score are returned.
---
## Included Models (fine-tuned)
| Path in repo | Base model | Fine-tuning |
|---|---|---|
| `models/multilingual-e5-large_pseudo_full/` | `intfloat/multilingual-e5-large` | Pseudo-label fine-tuning on the challenge corpus |
| `models/bge-reranker-v2-m3_pseudo_tune_full/` | `BAAI/bge-reranker-v2-m3` | Pseudo-label fine-tuning on the challenge corpus |
The remaining models (`bge-m3`, `multilingual-e5-large-instruct`, `snowflake-arctic-embed-l-v2.0`, `Solon-embeddings-large-0.1`, `Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0`) are used as-is (no additional fine-tuning).
---
## Repository Structure
```
model.py β Full inference pipeline (preprocess + predict)
models/
bge-m3/
bge-reranker-v2-m3_pseudo_tune_full/ β Fine-tuned reranker β¨
multilingual-e5-large_pseudo_full/ β Fine-tuned embedder β¨
multilingual-e5-large-instruct/
snowflake-arctic-embed-l-v2.0/
Solon-embeddings-large-0.1/
Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0/
```
---
## Usage
The pipeline exposes two functions that match the competition API:
```python
from model import preprocess, predict
# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)
# Query at inference time
results = predict({"query": "ΧΧ ΧΧΧΧΧΧΧͺ Χ©Χ Χ©ΧΧΧ¨Χ ΧΧΧ¨Χ?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.92}, ...] (top-20)
```
**Requirements:**
```
torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy
```
A CUDA-capable GPU is strongly recommended (the pipeline loads ~6 large models simultaneously).
---
## Technical Notes
- All models are loaded in **bfloat16** precision to reduce GPU memory footprint.
- **Offline mode** is enforced at runtime (`HF_HUB_OFFLINE=1`) β all model weights must be present locally.
- BM25s tokenization uses the default `bm25s` tokenizer with no additional Hebrew-specific pre-processing.
- The pipeline is time-budgeted: the reranker respects a ~1.85 s per-query wall-clock limit and will skip remaining batches if the budget is exceeded, gracefully falling back to Stage 1 scores.
- CUDA memory is proactively freed between batches; OOM errors trigger single-sample fallback processing.
---
## Results
| Phase | NDCG@20 | Rank |
|---|---|---|
| Public (Phase I) | **0.456235** | π₯ 1st |
| Private (Phase II) | **0.6736** | π₯ 1st |
---
## Citation
If you use this solution or the models in this repository, please acknowledge the **Hebrew Semantic Retrieval Challenge** by MAFAT DDR&D and the Israel National NLP Program, and credit **victord** as the solution author.
---
## Acknowledgements
- MAFAT DDR&D and the **Israel National NLP Program** for organizing the challenge and providing the annotated Hebrew corpus.
- [Webiks](https://www.webiks.com/) for the `Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0` model.
- The authors of `multilingual-e5-large`, `bge-m3`, `bge-reranker-v2-m3`, `snowflake-arctic-embed-l-v2.0`, and `Solon-embeddings-large-0.1`.
|