File size: 6,012 Bytes
c1ad1b5
 
 
 
7a06500
828f64e
c1ad1b5
43fd496
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07eae29
43fd496
07eae29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43fd496
 
07eae29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43fd496
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
- fa
---
# 🧠 Hakim-small
[![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)

**Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.

Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.

---

## 📌 Model Highlights

- 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size.
- 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project.
- ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.

---

## 🏗️ Training Datasets

Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:

### 📚 Pretraining
- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts.
- **Queries**: 8.5M anonymized search queries.

### 🔄 Unsupervised Stage (Pairsia-unsup)
- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).

### 🧠 Supervised Stage (Pairsia-sup)
- 1.3M labeled pairs with multiple negatives per query.
- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.

For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).

---

## 🧪 Benchmark Results (FaMTEB)

| Model                   | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS   | Summarization |
|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
| **Hakim**              | **73.81**  | **84.56**      | **70.46**  | **89.75**  | 69.46     | 40.43     | 76.62 | **85.41**      |
| Hakim-small            | 70.45      | 80.19          | 66.31      | 87.41      | 67.30     | 38.05     | 75.53 | 78.40          |
| Hakim-unsup            | 64.56      | 60.65          | 58.89      | 86.41      | 67.56     | 37.71     | 79.36 | 61.34          |
| BGE-m3                 | 65.29      | 58.75          | 57.73      | 85.21      | **74.56** | 43.38     | 76.35 | 61.07          |
| Jina-embeddings-v3     | 64.53      | 59.93          | 59.15      | 83.71      | 61.26     | **43.51** | **78.65** | 65.50      |
| multilingual-e5-large  | 64.40      | 59.86          | 57.19      | 84.42      | 74.34     | 42.98     | 75.38 | 56.61          |
| GTE-multilingual-base  | 63.64      | 56.07          | 57.28      | 84.58      | 69.72     | 41.22     | 75.75 | 60.88          |
| multilingual-e5-base   | 62.93      | 57.62          | 56.52      | 84.04      | 72.07     | 41.20     | 74.45 | 54.58          |
| Tooka-SBERT            | 60.65      | 59.40          | 56.45      | 87.04      | 58.29     | 27.86     | 76.42 | 59.06          |

---
## Model Usage

You can interact with the `Hakim_small` model through our API. Below are examples using `curl` and Python.

### Inference with `curl`

Here's how to send a request to the model using a `curl` command in your terminal.

**Important:** Replace `your_api_key` with your actual API key.

> **Note:** For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations.

```bash
curl -X POST 'https://mcinext.ai/api/hakim-small' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim_small",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'
```
###  Inference with `python`

```python
import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-small"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim_small", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")
```

## Citation
```bibtext
@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}
```