File size: 7,226 Bytes
10fa932 ffb41cf 10fa932 ffb41cf 10fa932 ffb41cf 10fa932 ffb41cf 10fa932 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | ---
language:
- az
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- azerbaijani
- embedding
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- LocalDoc/msmarco-az-reranked
- LocalDoc/azerbaijani_retriever_corpus-reranked
- LocalDoc/ldquad_v2_retrieval-reranked
- LocalDoc/azerbaijani_books_retriever_corpus-reranked
base_model: intfloat/multilingual-e5-small
model-index:
- name: LocRet-small
results:
- task:
type: retrieval
dataset:
name: AZ-MIRAGE
type: custom
metrics:
- type: mrr@10
value: 0.5250
- type: ndcg@10
value: 0.6162
- type: recall@10
value: 0.8948
---
# LocRet-small — Azerbaijani Retrieval Embedding Model
**LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks.
## Key Results
### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval)
| Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 |
|:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:|
| **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** |
| #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 |
| #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 |
| #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 |
| #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 |
| #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 |
| #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 |
| #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 |
| #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 |
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("LocalDoc/LocRet-small")
queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"]
passages = [
"Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
"Gəncə Azərbaycanın ikinci böyük şəhəridir.",
]
query_embeddings = model.encode_query(queries)
passage_embeddings = model.encode_document(passages)
similarities = model.similarity(query_embeddings, passage_embeddings)
print(similarities)
```
Prefixes `"query: "` and `"passage: "` are applied automatically via encode_query and encode_document. If using model.encode directly, the `"passage: "` prefix is added by default.
## Training
### Method
LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss:
$$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$
- **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05).
- **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages.
This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers.
### Data
The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets:
| Dataset | Pairs | Domain | Type |
|:--------|------:|:-------|:-----|
| [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ |
| [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ |
| [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ |
| [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ |
All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds.
### Hyperparameters
| Parameter | Value |
|:----------|:------|
| Base model | intfloat/multilingual-e5-small |
| Max sequence length | 512 |
| Effective batch size | 256 |
| Learning rate | 5e-5 |
| Schedule | Linear warmup (5%) + cosine decay |
| Precision | FP16 |
| Epochs | 1 |
| Training time | ~25 hours |
| Hardware | 4× NVIDIA RTX 5090 (32GB) |
### Training Insights
- **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274).
- **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model.
- **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization.
## Benchmark
### AZ-MIRAGE
A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text.
## Model Details
| Property | Value |
|:---------|:------|
| Architecture | BERT (XLM-RoBERTa) |
| Parameters | 118M |
| Embedding dimension | 384 |
| Max tokens | 512 |
| Vocabulary | SentencePiece (250K) |
| Similarity function | Cosine similarity |
| Language | Azerbaijani (az) |
| License | Apache 2.0 |
## Limitations
- Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model.
- Maximum input length is 512 tokens. Longer documents should be chunked.
## Citation
```bibtex
@misc{locret-small-2026,
title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model},
author={LocalDoc},
year={2026},
url={https://huggingface.co/LocalDoc/LocRet-small}
}
```
## Acknowledgments
- Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
- Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
- Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research. |