| --- |
| language: |
| - az |
| license: apache-2.0 |
| tags: |
| - sentence-transformers |
| - feature-extraction |
| - sentence-similarity |
| - retrieval |
| - azerbaijani |
| - embedding |
| library_name: sentence-transformers |
| pipeline_tag: sentence-similarity |
| datasets: |
| - LocalDoc/msmarco-az-reranked |
| - LocalDoc/azerbaijani_retriever_corpus-reranked |
| - LocalDoc/ldquad_v2_retrieval-reranked |
| - LocalDoc/azerbaijani_books_retriever_corpus-reranked |
| base_model: intfloat/multilingual-e5-small |
| model-index: |
| - name: LocRet-small |
| results: |
| - task: |
| type: retrieval |
| dataset: |
| name: AZ-MIRAGE |
| type: custom |
| metrics: |
| - type: mrr@10 |
| value: 0.5250 |
| - type: ndcg@10 |
| value: 0.6162 |
| - type: recall@10 |
| value: 0.8948 |
| --- |
| |
| # LocRet-small — Azerbaijani Retrieval Embedding Model |
|
|
| **LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks. |
|
|
| ## Key Results |
|
|
| ### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval) |
|
|
| | Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 | |
| |:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:| |
| | **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** | |
| | #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 | |
| | #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 | |
| | #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 | |
| | #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 | |
| | #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 | |
| | #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 | |
| | #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 | |
| | #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 | |
|
|
|
|
| ## Usage |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("LocalDoc/LocRet-small") |
| |
| queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"] |
| passages = [ |
| "Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.", |
| "Gəncə Azərbaycanın ikinci böyük şəhəridir.", |
| ] |
| |
| query_embeddings = model.encode_query(queries) |
| passage_embeddings = model.encode_document(passages) |
| |
| similarities = model.similarity(query_embeddings, passage_embeddings) |
| print(similarities) |
| ``` |
| Prefixes `"query: "` and `"passage: "` are applied automatically via encode_query and encode_document. If using model.encode directly, the `"passage: "` prefix is added by default. |
|
|
|
|
| ## Training |
|
|
| ### Method |
|
|
| LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss: |
|
|
| $$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$ |
|
|
| - **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05). |
| - **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages. |
|
|
| This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers. |
|
|
| ### Data |
|
|
| The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets: |
|
|
| | Dataset | Pairs | Domain | Type | |
| |:--------|------:|:-------|:-----| |
| | [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ | |
| | [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ | |
| | [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ | |
| | [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ | |
|
|
| All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds. |
|
|
| ### Hyperparameters |
|
|
| | Parameter | Value | |
| |:----------|:------| |
| | Base model | intfloat/multilingual-e5-small | |
| | Max sequence length | 512 | |
| | Effective batch size | 256 | |
| | Learning rate | 5e-5 | |
| | Schedule | Linear warmup (5%) + cosine decay | |
| | Precision | FP16 | |
| | Epochs | 1 | |
| | Training time | ~25 hours | |
| | Hardware | 4× NVIDIA RTX 5090 (32GB) | |
|
|
| ### Training Insights |
|
|
| - **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274). |
| - **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model. |
| - **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization. |
|
|
| ## Benchmark |
|
|
| ### AZ-MIRAGE |
|
|
| A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |:---------|:------| |
| | Architecture | BERT (XLM-RoBERTa) | |
| | Parameters | 118M | |
| | Embedding dimension | 384 | |
| | Max tokens | 512 | |
| | Vocabulary | SentencePiece (250K) | |
| | Similarity function | Cosine similarity | |
| | Language | Azerbaijani (az) | |
| | License | Apache 2.0 | |
|
|
| ## Limitations |
|
|
| - Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model. |
| - Maximum input length is 512 tokens. Longer documents should be chunked. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{locret-small-2026, |
| title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model}, |
| author={LocalDoc}, |
| year={2026}, |
| url={https://huggingface.co/LocalDoc/LocRet-small} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) |
| - Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) |
| - Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research. |