Hebrew Semantic Retrieval β€” 1st Place Solution

Competition: Hebrew Semantic Retrieval Challenge by MAFAT DDR&D (Directorate of Defense Research & Development) in partnership with the Israel National NLP Program

Result: πŸ₯‡ 1st place β€” nDCG@20 = 0.6736 (private test set)

Author: victord


Overview

This repository contains the complete inference code and fine-tuned models for the winning solution to the Hebrew Semantic Retrieval Challenge. The challenge tasked participants with building a semantic retrieval system capable of ranking Hebrew paragraphs from a large-scale corpus (127,731 paragraphs) in response to natural-language Hebrew queries, evaluated by NDCG@20.

Hebrew is a morphologically rich, Semitic language written in an almost consonant-only script, which creates high lexical ambiguity and makes retrieval significantly harder than in English or other high-resource languages. The challenge was designed to close this gap and advance Hebrew NLP for domains such as government services, law, academia, and the public sector.


The Challenge

Property Detail
Organizer MAFAT DDR&D + Israel National NLP Program
Corpus size 127,731 Hebrew paragraphs
Data sources Hebrew Wikipedia, Kol-Zchut (legal/civil-rights), Knesset committee protocols
Evaluation metric NDCG@20
Phase I Public leaderboard (Codabench)
Phase II Private test set with additional human annotation of previously unseen retrievals
Relevance scale 0–4 (human annotated)

Ground-truth labels were produced in two stages: a semantic retrieval model first retrieved the top-20 candidates per query, then human annotators rated them on a 0–4 relevance scale.


Solution Architecture

The solution is a classic two-stage retrieve-then-rerank pipeline, built on top of a large ensemble of multilingual and Hebrew-specialized embedding models, combined with a sparse BM25 stage.

Query
  β”‚
  β”œβ”€β–Ί [Dense Retriever Γ—6]  ──┐
  β”‚                            β”œβ”€β–Ί Score Fusion (weighted, z-normalized)
  └─► [BM25s Sparse]  β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
     Top-250 Candidates
            β”‚
            β–Ό
     [BGE Cross-Encoder Reranker]  (fine-tuned)
            β”‚
            β–Ό
     Final Top-20 Results (ranked by fused score)

Stage 1 β€” Ensemble Dense + Sparse Retrieval

Six dense embedding models run in parallel. Each produces per-document cosine-similarity scores, which are z-score normalized (using pre-computed corpus statistics) and linearly fused with learned weights. BM25s contributes a 15 % weight in the final fusion.

Model Role Pooling Max Length
multilingual-e5-large (pseudo-fine-tuned) Primary dense retriever Mean pooling + L2 norm 512
multilingual-e5-large-instruct Instruct-style dense retriever Mean pooling + L2 norm 512
BAAI/bge-m3 Multilingual dense retriever CLS token + L2 norm 512
Snowflake/snowflake-arctic-embed-l-v2.0 Multilingual dense retriever CLS token + L2 norm 1024
OrdalieTech/Solon-embeddings-large-0.1 Multilingual dense retriever Mean pooling + L2 norm 512
Webiks/Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0 Hebrew-specialized retriever Mean pooling + L2 norm 512
BM25s Sparse lexical retriever β€” β€”

Retriever fusion weights (normalized):

Retriever Weight
E5-large (pseudo-tuned) 1.10
E5-large-instruct 0.25
BGE-M3 0.20
Snowflake Arctic 0.30
Solon 0.30
Hebrew RAGbot 0.30
BM25s 15 % blended into final fusion

Long-document handling: For passages exceeding the model's max context length, a sliding-window chunking strategy with 50 % overlap is applied at the token level, and the maximum chunk score is used to represent the document.

Stage 2 β€” Cross-Encoder Reranking

The top-250 candidates from Stage 1 are reranked by a fine-tuned BGE cross-encoder (bge-reranker-v2-m3, pseudo-fine-tuned on the challenge corpus). The reranker operates with a max sequence length of 2048 tokens using the same sliding-window + max-score strategy for long documents.

The final score is a blend of the reranker score and the Stage 1 fusion score:

scorefinal=0.35β‹…s^reranker+0.65β‹…sfusion\text{score}_\text{final} = 0.35 \cdot \hat{s}_\text{reranker} + 0.65 \cdot s_\text{fusion}

where $\hat{s}_\text{reranker}$ is z-score normalized. The top-20 documents by this blended score are returned.


Included Models (fine-tuned)

Path in repo Base model Fine-tuning
models/multilingual-e5-large_pseudo_full/ intfloat/multilingual-e5-large Pseudo-label fine-tuning on the challenge corpus
models/bge-reranker-v2-m3_pseudo_tune_full/ BAAI/bge-reranker-v2-m3 Pseudo-label fine-tuning on the challenge corpus

The remaining models (bge-m3, multilingual-e5-large-instruct, snowflake-arctic-embed-l-v2.0, Solon-embeddings-large-0.1, Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0) are used as-is (no additional fine-tuning).


Repository Structure

model.py              ← Full inference pipeline (preprocess + predict)
models/
  bge-m3/
  bge-reranker-v2-m3_pseudo_tune_full/   ← Fine-tuned reranker ✨
  multilingual-e5-large_pseudo_full/     ← Fine-tuned embedder ✨
  multilingual-e5-large-instruct/
  snowflake-arctic-embed-l-v2.0/
  Solon-embeddings-large-0.1/
  Webiks_Hebrew_RAGbot_KolZchut_QA_Embedder_v1.0/

Usage

The pipeline exposes two functions that match the competition API:

from model import preprocess, predict

# Build corpus index (run once)
# corpus_dict: {doc_id: {"passage": "..."}, ...}
preprocessed = preprocess(corpus_dict)

# Query at inference time
results = predict({"query": "ΧžΧ” Χ”Χ–Χ›Χ•Χ™Χ•Χͺ של Χ©Χ•Χ›Χ¨Χ™ Χ“Χ™Χ¨Χ”?"}, preprocessed)
# Returns: [{"paragraph_uuid": "...", "score": 0.92}, ...]  (top-20)

Requirements:

torch
transformers
sentence-transformers
bm25s
scikit-learn
numpy

A CUDA-capable GPU is strongly recommended (the pipeline loads ~6 large models simultaneously).


Technical Notes

  • All models are loaded in bfloat16 precision to reduce GPU memory footprint.
  • Offline mode is enforced at runtime (HF_HUB_OFFLINE=1) β€” all model weights must be present locally.
  • BM25s tokenization uses the default bm25s tokenizer with no additional Hebrew-specific pre-processing.
  • The pipeline is time-budgeted: the reranker respects a ~1.85 s per-query wall-clock limit and will skip remaining batches if the budget is exceeded, gracefully falling back to Stage 1 scores.
  • CUDA memory is proactively freed between batches; OOM errors trigger single-sample fallback processing.

Results

Phase NDCG@20 Rank
Public (Phase I) 0.456235 πŸ₯‡ 1st
Private (Phase II) 0.6736 πŸ₯‡ 1st

Citation

If you use this solution or the models in this repository, please acknowledge the Hebrew Semantic Retrieval Challenge by MAFAT DDR&D and the Israel National NLP Program, and credit victord as the solution author.


Acknowledgements

  • MAFAT DDR&D and the Israel National NLP Program for organizing the challenge and providing the annotated Hebrew corpus.
  • Webiks for the Hebrew-RAGbot-KolZchut-QA-Embedder-v1.0 model.
  • The authors of multilingual-e5-large, bge-m3, bge-reranker-v2-m3, snowflake-arctic-embed-l-v2.0, and Solon-embeddings-large-0.1.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including HebArabNlpProject/Semantic-Retrieval-1st-place