File size: 7,226 Bytes
10fa932
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffb41cf
10fa932
ffb41cf
 
10fa932
 
ffb41cf
 
10fa932
 
 
 
ffb41cf
10fa932
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
- az
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- retrieval
- azerbaijani
- embedding
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- LocalDoc/msmarco-az-reranked
- LocalDoc/azerbaijani_retriever_corpus-reranked
- LocalDoc/ldquad_v2_retrieval-reranked
- LocalDoc/azerbaijani_books_retriever_corpus-reranked
base_model: intfloat/multilingual-e5-small
model-index:
- name: LocRet-small
  results:
  - task:
      type: retrieval
    dataset:
      name: AZ-MIRAGE
      type: custom
    metrics:
    - type: mrr@10
      value: 0.5250
    - type: ndcg@10
      value: 0.6162
    - type: recall@10
      value: 0.8948
---

# LocRet-small — Azerbaijani Retrieval Embedding Model

**LocRet-small** is a compact, high-performance retrieval embedding model specialized for the Azerbaijani language. Despite being **4.8× smaller** than BGE-m3, it significantly outperforms it on Azerbaijani retrieval benchmarks.

## Key Results

### AZ-MIRAGE Benchmark (Native Azerbaijani Retrieval)

| Rank | Model | Parameters | MRR@10 | P@1 | R@5 | R@10 | NDCG@5 | NDCG@10 |
|:----:|:------|:---------:|:------:|:---:|:---:|:----:|:------:|:-------:|
| **#1** | **LocRet-small** | **118M** | **0.5250** | **0.3132** | **0.8267** | **0.8948** | **0.5938** | **0.6162** |
| #2 | BAAI/bge-m3 | 568M | 0.4204 | 0.2310 | 0.6905 | 0.7787 | 0.4791 | 0.5079 |
| #3 | perplexity-ai/pplx-embed-v1-0.6b | 600M | 0.4117 | 0.2276 | 0.6715 | 0.7605 | 0.4677 | 0.4968 |
| #4 | intfloat/multilingual-e5-large | 560M | 0.4043 | 0.2264 | 0.6571 | 0.7454 | 0.4584 | 0.4875 |
| #5 | intfloat/multilingual-e5-base | 278M | 0.3852 | 0.2116 | 0.6353 | 0.7216 | 0.4390 | 0.4672 |
| #6 | Snowflake/snowflake-arctic-embed-l-v2.0 | 568M | 0.3746 | 0.2135 | 0.6006 | 0.6916 | 0.4218 | 0.4516 |
| #7 | Qwen/Qwen3-Embedding-4B | 4B | 0.3602 | 0.1869 | 0.6067 | 0.7036 | 0.4119 | 0.4437 |
| #8 | intfloat/multilingual-e5-small (base) | 118M | 0.3586 | 0.1958 | 0.5927 | 0.6834 | 0.4079 | 0.4375 |
| #9 | Qwen/Qwen3-Embedding-0.6B | 600M | 0.2951 | 0.1516 | 0.4926 | 0.5956 | 0.3339 | 0.3676 |


## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("LocalDoc/LocRet-small")

queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"]
passages = [
    "Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
    "Gəncə Azərbaycanın ikinci böyük şəhəridir.",
]

query_embeddings = model.encode_query(queries)
passage_embeddings = model.encode_document(passages)

similarities = model.similarity(query_embeddings, passage_embeddings)
print(similarities)
```
Prefixes `"query: "` and `"passage: "` are applied automatically via encode_query and encode_document. If using model.encode directly, the `"passage: "` prefix is added by default.


## Training

### Method

LocRet-small is fine-tuned from [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) using **listwise KL distillation** combined with a contrastive loss:

$$\mathcal{L} = \mathcal{L}_{\text{KL}} + 0.1 \cdot \mathcal{L}_{\text{InfoNCE}}$$

- **Listwise KL divergence**: Distills the ranking distribution from a cross-encoder teacher ([bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) over candidate lists of 1 positive + up to 10 hard negatives per query. Teacher and student softmax distributions use asymmetric temperatures (τ_teacher = 0.3, τ_student = 0.05).
- **In-batch contrastive loss (InfoNCE)**: Provides additional diversity through in-batch negatives on positive passages.

This approach preserves the full teacher ranking signal rather than reducing it to binary relevance labels, which is critical for training on top of already strong pre-trained retrievers.

### Data

The model was trained on approximately **3.5 million** Azerbaijani query-passage pairs from four datasets:

| Dataset | Pairs | Domain | Type |
|:--------|------:|:-------|:-----|
| [msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | ~1.4M | General web QA | Translated EN→AZ |
| [azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | ~1.6M | Books, politics, history | Native AZ |
| [azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | ~189K | News, culture | Native AZ |
| [ldquad_v2_retrieval-reranked](https://huggingface.co/datasets/LocalDoc/ldquad_v2_retrieval-reranked) | ~330K | Wikipedia QA | Native AZ |

All datasets include hard negatives scored by a cross-encoder reranker, which serve as the teacher signal for listwise distillation. False negatives were filtered using normalized score thresholds.

### Hyperparameters

| Parameter | Value |
|:----------|:------|
| Base model | intfloat/multilingual-e5-small |
| Max sequence length | 512 |
| Effective batch size | 256 |
| Learning rate | 5e-5 |
| Schedule | Linear warmup (5%) + cosine decay |
| Precision | FP16 |
| Epochs | 1 |
| Training time | ~25 hours |
| Hardware | 4× NVIDIA RTX 5090 (32GB) |

### Training Insights

- **Listwise KL distillation outperforms standard contrastive training** (MultipleNegativesRankingLoss) for fine-tuning pre-trained retrievers, consistent with findings from [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and [cadet-embed](https://arxiv.org/abs/2505.19274).
- **Retrieval pre-training matters more than language-specific pre-training** for retrieval tasks: multilingual-e5-small (with retrieval pre-training) significantly outperforms XLM-RoBERTa and other BERT variants (without retrieval pre-training) as a base model.
- **A mix of translated and native data** prevents catastrophic forgetting while enabling language specialization.

## Benchmark

### AZ-MIRAGE

A native Azerbaijani retrieval benchmark (https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) with 7,373 queries and 40,448 document chunks covering diverse topics. Evaluates retrieval quality on naturally written Azerbaijani text.

## Model Details

| Property | Value |
|:---------|:------|
| Architecture | BERT (XLM-RoBERTa) |
| Parameters | 118M |
| Embedding dimension | 384 |
| Max tokens | 512 |
| Vocabulary | SentencePiece (250K) |
| Similarity function | Cosine similarity |
| Language | Azerbaijani (az) |
| License | Apache 2.0 |

## Limitations

- Optimized for Azerbaijani text retrieval. Performance on other languages may be lower than the base multilingual-e5-small model.
- Maximum input length is 512 tokens. Longer documents should be chunked.

## Citation

```bibtex
@misc{locret-small-2026,
  title={LocRet-small: A Compact Azerbaijani Retrieval Embedding Model},
  author={LocalDoc},
  year={2026},
  url={https://huggingface.co/LocalDoc/LocRet-small}
}
```

## Acknowledgments

- Base model: [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
- Teacher reranker: [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
- Training methodology inspired by [Arctic-Embed 2.0](https://arxiv.org/abs/2412.04506) and cross-encoder listwise distillation research.