Hebrew Semantic Retrieval — 1st Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: 🥇 1st place — nDCG@20 = 0.6736 (private test set)

Author: victord

Overview

This repository contains the complete inference code and fine-tuned models for the winning solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with building a semantic retrieval system capable of ranking Hebrew paragraphs from a large-scale corpus (127,731 paragraphs) in response to natural-language Hebrew queries, evaluated by NDCG@20.

Hebrew is a morphologically rich, Semitic language written in an almost consonant-only script, which creates high lexical ambiguity and makes retrieval significantly harder than in English or other high-resource languages. The challenge was designed to close this gap and advance Hebrew NLP for domains such as government services, law, academia, and the public sector.

The Challenge

Property	Detail
Organizer	MAFAT DDR&D + Israel National NLP Program
Corpus size	127,731 Hebrew paragraphs
Data sources	Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric	NDCG@20
Phase I	Public leaderboard (Codabench)
Phase II	Private test set with additional human annotation of previously unseen retrievals
Relevance scale	0–4 (human annotated)

Ground-truth labels were produced in two stages: a semantic retrieval model first retrieved the top-20 candidates per query, then human annotators rated them on a 0–4 relevance scale.

Solution Architecture

The solution is a classic two-stage retrieve-then-rerank pipeline, built on top of a large ensemble of multilingual and Hebrew-specialized embedding models, combined with a sparse BM25 stage.

Query
  │
  ├─► [Dense Retriever ×6]  ──┐
  │                            ├─► Score Fusion (weighted, z-normalized)
  └─► [BM25s Sparse]  ─────────┘
            │
            ▼
     Top-250 Candidates
            │
            ▼
     [BGE Cross-Encoder Reranker]  (fine-tuned)
            │
            ▼
     Final Top-20 Results (ranked by fused score)

Stage 1 — Ensemble Dense + Sparse Retrieval

Six dense embedding models run in parallel. Each produces per-document cosine-similarity scores, which are z-score normalized (using pre-computed corpus statistics) and linearly fused with learned weights. BM25s contributes a 15 % weight in the final fusion.

Model	Role	Pooling	Max Length
`multilingual-e5-large` (pseudo-fine-tuned)	Primary dense retriever	Mean pooling + L2 norm	512
`multilingual-e5-large-instruct`	Instruct-style dense retriever	Mean pooling + L2 norm	512
`BAAI/bge-m3`	Multilingual dense retriever	CLS token + L2 norm	512
`Snowflake/snowflake-arctic-embed-l-v2.0`	Multilingual dense retriever	CLS token + L2 norm	1024
`OrdalieTech/Solon-embeddings-large-0.1`	Multilingual dense retriever	Mean pooling + L2 norm	512
`Webiks/Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0`	Hebrew-specialized retriever	Mean pooling + L2 norm	512
BM25s	Sparse lexical retriever	—	—

Retriever fusion weights (normalized):

Retriever	Weight
E5-large (pseudo-tuned)	1.10
E5-large-instruct	0.25
BGE-M3	0.20
Snowflake Arctic	0.30
Solon	0.30
Hebrew RAGbot	0.30
BM25s	15 % blended into final fusion

Long-document handling: For passages exceeding the model's max context length, a sliding-window chunking strategy with 50 % overlap is applied at the token level, and the maximum chunk score is used to represent the document.

Stage 2 — Cross-Encoder Reranking

The top-250 candidates from Stage 1 are reranked by a fine-tuned BGE cross-encoder (bge-reranker-v2-m3, pseudo-fine-tuned on the challenge corpus). The reranker operates with a max sequence length of 2048 tokens using the same sliding-window + max-score strategy for long documents.

The final score is a blend of the reranker score and the Stage 1 fusion score:

$\text{score}_\text{final} = 0.35 \cdot \hat{s}_\text{reranker} + 0.65 \cdot s_\text{fusion}$

where $\hat{s}_\text{reranker}$ is z-score normalized. The top-20 documents by this blended score are returned.

Included Models (fine-tuned)

Path in repo	Base model	Fine-tuning
`models/multilingual-e5-large_pseudo_full/`	`intfloat/multilingual-e5-large`	Pseudo-label fine-tuning on the challenge corpus
`models/bge-reranker-v2-m3_pseudo_tune_full/`	`BAAI/bge-reranker-v2-m3`	Pseudo-label fine-tuning on the challenge corpus

The remaining models (bge-m3, multilingual-e5-large-instruct, snowflake-arctic-embed-l-v2.0, Solon-embeddings-large-0.1, Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0) are used as-is (no additional fine-tuning).

Repository Structure

model.py              ← Full inference pipeline (preprocess + predict)
models/
  bge-m3/
  bge-reranker-v2-m3_pseudo_tune_full/   ← Fine-tuned reranker ✨
  multilingual-e5-large_pseudo_full/     ← Fine-tuned embedder ✨
  multilingual-e5-large-instruct/
  snowflake-arctic-embed-l-v2.0/
  Solon-embeddings-large-0.1/
  Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0/

Usage

The pipeline exposes two functions that match the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "מה הזכויות של שוכרי דירה?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.92}, ...]  (top-20)

Requirements:

torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy

A CUDA-capable GPU is strongly recommended (the pipeline loads ~6 large models simultaneously).

Technical Notes

All models are loaded in bfloat16 precision to reduce GPU memory footprint.
Offline mode is enforced at runtime (HF_HUB_OFFLINE=1) — all model weights must be present locally.
BM25s tokenization uses the default bm25s tokenizer with no additional Hebrew-specific pre-processing.
The pipeline is time-budgeted: the reranker respects a ~1.85 s per-query wall-clock limit and will skip remaining batches if the budget is exceeded, gracefully falling back to Stage 1 scores.
CUDA memory is proactively freed between batches; OOM errors trigger single-sample fallback processing.

Results

Phase	NDCG@20	Rank
Public (Phase I)	0.456235	🥇 1st
Private (Phase II)	0.6736	🥇 1st

Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit victord as the solution author.

Acknowledgements

MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
Webiks for the Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0 model.
The authors of multilingual-e5-large, bge-m3, bge-reranker-v2-m3, snowflake-arctic-embed-l-v2.0, and Solon-embeddings-large-0.1.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including HebArabNlpProject/Semantic-Retrieval-1st-place

Hebrew Semantic Retrieval Competition Winners

Collection

3 items • Updated about 11 hours ago • 1