|
|
--- |
|
|
language: |
|
|
- fa |
|
|
--- |
|
|
# 🧠 Hakim-small |
|
|
[](https://arxiv.org/abs/2505.08435) |
|
|
|
|
|
**Hakim-small** is a compact and efficient version of the state-of-the-art **Hakim** text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the **FaMTEB** benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart. |
|
|
|
|
|
Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📌 Model Highlights |
|
|
|
|
|
- 🔍 **Strong FaMTEB Performance**: Achieves excellent results on the FaMTEB benchmark, especially for its size. |
|
|
- 🧾 **Instruction-Tuned**: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim. |
|
|
- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data from the Hakim project. |
|
|
- ⚙️ **Highly Compact & Fast**: With only **~38M parameters** (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏗️ Training Datasets |
|
|
|
|
|
Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include: |
|
|
|
|
|
### 📚 Pretraining |
|
|
- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech). |
|
|
- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts. |
|
|
- **Queries**: 8.5M anonymized search queries. |
|
|
|
|
|
### 🔄 Unsupervised Stage (Pairsia-unsup) |
|
|
- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.). |
|
|
|
|
|
### 🧠 Supervised Stage (Pairsia-sup) |
|
|
- 1.3M labeled pairs with multiple negatives per query. |
|
|
- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI. |
|
|
|
|
|
For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435). |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Benchmark Results (FaMTEB) |
|
|
|
|
|
| Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization | |
|
|
|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------| |
|
|
| **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** | |
|
|
| Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 | |
|
|
| Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 | |
|
|
| BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 | |
|
|
| Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 | |
|
|
| multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 | |
|
|
| GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 | |
|
|
| multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 | |
|
|
| Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 | |
|
|
|
|
|
--- |
|
|
## Model Usage |
|
|
|
|
|
You can interact with the `Hakim_small` model through our API. Below are examples using `curl` and Python. |
|
|
|
|
|
### Inference with `curl` |
|
|
|
|
|
Here's how to send a request to the model using a `curl` command in your terminal. |
|
|
|
|
|
**Important:** Replace `your_api_key` with your actual API key. |
|
|
|
|
|
> **Note:** For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations. |
|
|
|
|
|
```bash |
|
|
curl -X POST 'https://mcinext.ai/api/hakim-small' \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-H "Accept: application/json" \ |
|
|
-H "Authorization": "Bearer your_api_key" \ |
|
|
-d '{ |
|
|
"model": "Hakim_small", |
|
|
"input": [ |
|
|
"The text of the first document.", |
|
|
"The text of the second document.", |
|
|
"And so on..." |
|
|
], |
|
|
"encoding_format": "float", |
|
|
"add_special_tokens": true |
|
|
}' |
|
|
``` |
|
|
### Inference with `python` |
|
|
|
|
|
```python |
|
|
import requests |
|
|
import json |
|
|
|
|
|
# --- Configuration --- |
|
|
API_KEY = "your_api_key" # Replace with your key or "mcinext" for testing |
|
|
API_URL = "https://mcinext.ai/api/hakim-small" |
|
|
|
|
|
# --- Request Details --- |
|
|
headers = { |
|
|
"Content-Type": "application/json", |
|
|
"Accept": "application/json", |
|
|
"Authorization": f"Bearer {API_KEY}" |
|
|
} |
|
|
|
|
|
data = { |
|
|
"model": "Hakim_small", |
|
|
"input": [ |
|
|
"The text of the first document.", |
|
|
"The text of the second document.", |
|
|
"And so on..." |
|
|
], |
|
|
"encoding_format": "float", |
|
|
"add_special_tokens": True |
|
|
} |
|
|
|
|
|
# --- Send Request --- |
|
|
try: |
|
|
response = requests.post(API_URL, headers=headers, data=json.dumps(data)) |
|
|
response.raise_for_status() |
|
|
|
|
|
print("Request successful!") |
|
|
print("Response JSON:") |
|
|
print(response.json()) |
|
|
|
|
|
except requests.exceptions.HTTPError as http_err: |
|
|
print(f"HTTP error occurred: {http_err}") |
|
|
print(f"Response content: {response.text}") |
|
|
except Exception as err: |
|
|
print(f"An other error occurred: {err}") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
```bibtext |
|
|
@article{sarmadi2025hakim, |
|
|
title={Hakim: Farsi Text Embedding Model}, |
|
|
author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra}, |
|
|
journal={arXiv preprint arXiv:2505.08435}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|