File size: 6,012 Bytes

---
language:
- fa
---
# 🧠 Hakim-small
[![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)

**Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.

Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.

---

## 📌 Model Highlights

- 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size.
- 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project.
- ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.

---

## 🏗️ Training Datasets

Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:

### 📚 Pretraining
- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts.
- **Queries**: 8.5M anonymized search queries.

### 🔄 Unsupervised Stage (Pairsia-unsup)
- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).

### 🧠 Supervised Stage (Pairsia-sup)
- 1.3M labeled pairs with multiple negatives per query.
- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.

For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).

---

## 🧪 Benchmark Results (FaMTEB)

| Model                   | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS   | Summarization |
|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
| **Hakim**              | **73.81**  | **84.56**      | **70.46**  | **89.75**  | 69.46     | 40.43     | 76.62 | **85.41**      |
| Hakim-small            | 70.45      | 80.19          | 66.31      | 87.41      | 67.30     | 38.05     | 75.53 | 78.40          |
| Hakim-unsup            | 64.56      | 60.65          | 58.89      | 86.41      | 67.56     | 37.71     | 79.36 | 61.34          |
| BGE-m3                 | 65.29      | 58.75          | 57.73      | 85.21      | **74.56** | 43.38     | 76.35 | 61.07          |
| Jina-embeddings-v3     | 64.53      | 59.93          | 59.15      | 83.71      | 61.26     | **43.51** | **78.65** | 65.50      |
| multilingual-e5-large  | 64.40      | 59.86          | 57.19      | 84.42      | 74.34     | 42.98     | 75.38 | 56.61          |
| GTE-multilingual-base  | 63.64      | 56.07          | 57.28      | 84.58      | 69.72     | 41.22     | 75.75 | 60.88          |
| multilingual-e5-base   | 62.93      | 57.62          | 56.52      | 84.04      | 72.07     | 41.20     | 74.45 | 54.58          |
| Tooka-SBERT            | 60.65      | 59.40          | 56.45      | 87.04      | 58.29     | 27.86     | 76.42 | 59.06          |

---
## Model Usage

You can interact with the `Hakim_small` model through our API. Below are examples using `curl` and Python.

### Inference with `curl`

Here's how to send a request to the model using a `curl` command in your terminal.

**Important:** Replace `your_api_key` with your actual API key.

> **Note:** For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations.

```bash
curl -X POST 'https://mcinext.ai/api/hakim-small' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim_small",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'
```
###  Inference with `python`

```python
import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-small"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim_small", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")
```

## Citation
```bibtext
@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}
```