Hakim-small / README.md

Update README.md

07eae29 verified 7 months ago

6.01 kB

	---
	language:
	- fa
	---
	# 🧠 Hakim-small
	[![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)

	Hakim-small is a compact and efficient version of the state-of-the-art Hakim text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the FaMTEB benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.

	Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.

	---

	## 📌 Model Highlights

	- 🔍 Strong FaMTEB Performance: Achieves excellent results on the FaMTEB benchmark, especially for its size.
	- 🧾 Instruction-Tuned: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
	- 🗣️ Chatbot-Ready: Fine-tuned with chat history-aware data from the Hakim project.
	- ⚙️ Highly Compact & Fast: With only ~38M parameters (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.

	---

	## 🏗️ Training Datasets

	Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:

	### 📚 Pretraining
	- Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
	- hmBlogs: 6.8B tokens from ~20M Persian blog posts.
	- Queries: 8.5M anonymized search queries.

	### 🔄 Unsupervised Stage (Pairsia-unsup)
	- 5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).

	### 🧠 Supervised Stage (Pairsia-sup)
	- 1.3M labeled pairs with multiple negatives per query.
	- Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.

	For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).

	---

	## 🧪 Benchmark Results (FaMTEB)

	\| Model \| Avg. Score \| Classification \| Clustering \| PairClass. \| Reranking \| Retrieval \| STS \| Summarization \|
	\|------------------------\|------------\|----------------\|------------\|------------\|-----------\|-----------\|-------\|----------------\|
	\| Hakim \| 73.81 \| 84.56 \| 70.46 \| 89.75 \| 69.46 \| 40.43 \| 76.62 \| 85.41 \|
	\| Hakim-small \| 70.45 \| 80.19 \| 66.31 \| 87.41 \| 67.30 \| 38.05 \| 75.53 \| 78.40 \|
	\| Hakim-unsup \| 64.56 \| 60.65 \| 58.89 \| 86.41 \| 67.56 \| 37.71 \| 79.36 \| 61.34 \|
	\| BGE-m3 \| 65.29 \| 58.75 \| 57.73 \| 85.21 \| 74.56 \| 43.38 \| 76.35 \| 61.07 \|
	\| Jina-embeddings-v3 \| 64.53 \| 59.93 \| 59.15 \| 83.71 \| 61.26 \| 43.51 \| 78.65 \| 65.50 \|
	\| multilingual-e5-large \| 64.40 \| 59.86 \| 57.19 \| 84.42 \| 74.34 \| 42.98 \| 75.38 \| 56.61 \|
	\| GTE-multilingual-base \| 63.64 \| 56.07 \| 57.28 \| 84.58 \| 69.72 \| 41.22 \| 75.75 \| 60.88 \|
	\| multilingual-e5-base \| 62.93 \| 57.62 \| 56.52 \| 84.04 \| 72.07 \| 41.20 \| 74.45 \| 54.58 \|
	\| Tooka-SBERT \| 60.65 \| 59.40 \| 56.45 \| 87.04 \| 58.29 \| 27.86 \| 76.42 \| 59.06 \|

	---
	## Model Usage

	You can interact with the `Hakim_small` model through our API. Below are examples using `curl` and Python.

	### Inference with `curl`

	Here's how to send a request to the model using a `curl` command in your terminal.

	Important: Replace `your_api_key` with your actual API key.

	> Note: For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations.

	```bash
	curl -X POST 'https://mcinext.ai/api/hakim-small' \
	-H "Content-Type: application/json" \
	-H "Accept: application/json" \
	-H "Authorization": "Bearer your_api_key" \
	-d '{
	"model": "Hakim_small",
	"input": [
	"The text of the first document.",
	"The text of the second document.",
	"And so on..."
	],
	"encoding_format": "float",
	"add_special_tokens": true
	}'
	```
	### Inference with `python`

	```python
	import requests
	import json

	# --- Configuration ---
	API_KEY = "your_api_key" # Replace with your key or "mcinext" for testing
	API_URL = "https://mcinext.ai/api/hakim-small"

	# --- Request Details ---
	headers = {
	"Content-Type": "application/json",
	"Accept": "application/json",
	"Authorization": f"Bearer {API_KEY}"
	}

	data = {
	"model": "Hakim_small",
	"input": [
	"The text of the first document.",
	"The text of the second document.",
	"And so on..."
	],
	"encoding_format": "float",
	"add_special_tokens": True
	}

	# --- Send Request ---
	try:
	response = requests.post(API_URL, headers=headers, data=json.dumps(data))
	response.raise_for_status()

	print("Request successful!")
	print("Response JSON:")
	print(response.json())

	except requests.exceptions.HTTPError as http_err:
	print(f"HTTP error occurred: {http_err}")
	print(f"Response content: {response.text}")
	except Exception as err:
	print(f"An other error occurred: {err}")
	```

	## Citation
	```bibtext
	@article{sarmadi2025hakim,
	title={Hakim: Farsi Text Embedding Model},
	author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
	journal={arXiv preprint arXiv:2505.08435},
	year={2025}
	}
	```