| # 🧠 Hakim-unsup | |
| [](https://arxiv.org/abs/2505.08435) | |
| **Hakim-unsup** represents an intermediate stage of the state-of-the-art **Hakim** text embedding project for the Persian language. This model is the result of pretraining on large Persian corpora followed by an extensive unsupervised contrastive learning phase on millions of text pairs. | |
| While the fully supervised **Hakim** model achieves top performance on the **FaMTEB** benchmark, Hakim-unsup provides strong general-purpose semantic representations. It serves as a powerful foundation for further fine-tuning and is particularly useful for tasks where large labeled datasets are unavailable but understanding semantic similarity from unlabeled pairs is crucial. | |
| --- | |
| ## 📌 Model Highlights | |
| - 🧱 **Strong Foundational Embeddings**: Provides robust general-purpose Persian text embeddings learned from large-scale unsupervised data. | |
| - 🔄 **Trained on Diverse Unlabeled Pairs**: Benefits from the `Pairsia-unsup` dataset, capturing a wide array of semantic relationships. | |
| - ⚙️ **Standard Size**: ~124M parameters, same as the base Hakim model. | |
| - 🌱 **Basis for Supervised Models**: This is the model checkpoint *before* the supervised instruction-tuning phase that creates the final Hakim and Hakim-small models. | |
| --- | |
| ## 🏗️ Training Datasets | |
| Hakim-unsup is trained in two main phases: | |
| ### 📚 Pretraining | |
| - **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech). | |
| - **hmBlogs**: 6.8B tokens from ~20M Persian blog posts. | |
| - **Queries**: 8.5M anonymized search queries. | |
| ### 🔄 Unsupervised Stage (Pairsia-unsup) | |
| - **Pairsia-unsup**: 5M high-quality Persian text pairs from diverse sources including: | |
| - Document–title, FAQ, QA, and paper title–abstract pairs. | |
| - Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.). | |
| - The model is trained using a contrastive learning objective on these pairs to learn general semantic representations. | |
| Hakim-unsup does *not* undergo the subsequent supervised fine-tuning stage with the `Pairsia-sup` dataset or instruction tuning. For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435). | |
| --- | |
| ## 🧪 Benchmark Results (FaMTEB) | |
| | Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization | | |
| |------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------| | |
| | **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** | | |
| | Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 | | |
| | Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 | | |
| | BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 | | |
| | Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 | | |
| | multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 | | |
| | GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 | | |
| | multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 | | |
| | Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 | | |
| --- | |
| ## Model Usage | |
| You can interact with the `Hakim_unsup` model through our API. Below are examples using `curl` and Python. | |
| ### Inference with `curl` | |
| Here's how to send a request to the model using a `curl` command in your terminal. | |
| **Important:** Replace `your_api_key` with your actual API key. | |
| > **Note:** For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations. | |
| ```bash | |
| curl -X POST 'https://mcinext.ai/api/hakim-unsup' \ | |
| -H "Content-Type: application/json" \ | |
| -H "Accept: application/json" \ | |
| -H "Authorization": "Bearer your_api_key" \ | |
| -d '{ | |
| "model": "Hakim_unsuper", | |
| "input": [ | |
| "The text of the first document.", | |
| "The text of the second document.", | |
| "And so on..." | |
| ], | |
| "encoding_format": "float", | |
| "add_special_tokens": true | |
| }' | |
| ``` | |
| ### Inference with `python` | |
| ```python | |
| import requests | |
| import json | |
| # --- Configuration --- | |
| API_KEY = "your_api_key" # Replace with your key or "mcinext" for testing | |
| API_URL = "https://mcinext.ai/api/hakim-unsup" | |
| # --- Request Details --- | |
| headers = { | |
| "Content-Type": "application/json", | |
| "Accept": "application/json", | |
| "Authorization": f"Bearer {API_KEY}" | |
| } | |
| data = { | |
| "model": "Hakim_unsuper", | |
| "input": [ | |
| "The text of the first document.", | |
| "The text of the second document.", | |
| "And so on..." | |
| ], | |
| "encoding_format": "float", | |
| "add_special_tokens": True | |
| } | |
| # --- Send Request --- | |
| try: | |
| response = requests.post(API_URL, headers=headers, data=json.dumps(data)) | |
| response.raise_for_status() | |
| print("Request successful!") | |
| print("Response JSON:") | |
| print(response.json()) | |
| except requests.exceptions.HTTPError as http_err: | |
| print(f"HTTP error occurred: {http_err}") | |
| print(f"Response content: {response.text}") | |
| except Exception as err: | |
| print(f"An other error occurred: {err}") | |
| ``` | |
| ## Citation | |
| ```bibtext | |
| @article{sarmadi2025hakim, | |
| title={Hakim: Farsi Text Embedding Model}, | |
| author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra}, | |
| journal={arXiv preprint arXiv:2505.08435}, | |
| year={2025} | |
| } | |
| ``` |