Update README.md

47713ba verified 9 months ago

6 kB

	# 🧠 Hakim-unsup
	[![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)

	Hakim-unsup represents an intermediate stage of the state-of-the-art Hakim text embedding project for the Persian language. This model is the result of pretraining on large Persian corpora followed by an extensive unsupervised contrastive learning phase on millions of text pairs.

	While the fully supervised Hakim model achieves top performance on the FaMTEB benchmark, Hakim-unsup provides strong general-purpose semantic representations. It serves as a powerful foundation for further fine-tuning and is particularly useful for tasks where large labeled datasets are unavailable but understanding semantic similarity from unlabeled pairs is crucial.

	---

	## 📌 Model Highlights

	- 🧱 Strong Foundational Embeddings: Provides robust general-purpose Persian text embeddings learned from large-scale unsupervised data.
	- 🔄 Trained on Diverse Unlabeled Pairs: Benefits from the `Pairsia-unsup` dataset, capturing a wide array of semantic relationships.
	- ⚙️ Standard Size: ~124M parameters, same as the base Hakim model.
	- 🌱 Basis for Supervised Models: This is the model checkpoint before the supervised instruction-tuning phase that creates the final Hakim and Hakim-small models.

	---

	## 🏗️ Training Datasets

	Hakim-unsup is trained in two main phases:

	### 📚 Pretraining
	- Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
	- hmBlogs: 6.8B tokens from ~20M Persian blog posts.
	- Queries: 8.5M anonymized search queries.

	### 🔄 Unsupervised Stage (Pairsia-unsup)
	- Pairsia-unsup: 5M high-quality Persian text pairs from diverse sources including:
	- Document–title, FAQ, QA, and paper title–abstract pairs.
	- Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).
	- The model is trained using a contrastive learning objective on these pairs to learn general semantic representations.

	Hakim-unsup does not undergo the subsequent supervised fine-tuning stage with the `Pairsia-sup` dataset or instruction tuning. For more detailed information on the dataset creation and curation process, please refer to the [Hakim paper](https://arxiv.org/abs/2505.08435).

	---


	## 🧪 Benchmark Results (FaMTEB)

	\| Model \| Avg. Score \| Classification \| Clustering \| PairClass. \| Reranking \| Retrieval \| STS \| Summarization \|
	\|------------------------\|------------\|----------------\|------------\|------------\|-----------\|-----------\|-------\|----------------\|
	\| Hakim \| 73.81 \| 84.56 \| 70.46 \| 89.75 \| 69.46 \| 40.43 \| 76.62 \| 85.41 \|
	\| Hakim-small \| 70.45 \| 80.19 \| 66.31 \| 87.41 \| 67.30 \| 38.05 \| 75.53 \| 78.40 \|
	\| Hakim-unsup \| 64.56 \| 60.65 \| 58.89 \| 86.41 \| 67.56 \| 37.71 \| 79.36 \| 61.34 \|
	\| BGE-m3 \| 65.29 \| 58.75 \| 57.73 \| 85.21 \| 74.56 \| 43.38 \| 76.35 \| 61.07 \|
	\| Jina-embeddings-v3 \| 64.53 \| 59.93 \| 59.15 \| 83.71 \| 61.26 \| 43.51 \| 78.65 \| 65.50 \|
	\| multilingual-e5-large \| 64.40 \| 59.86 \| 57.19 \| 84.42 \| 74.34 \| 42.98 \| 75.38 \| 56.61 \|
	\| GTE-multilingual-base \| 63.64 \| 56.07 \| 57.28 \| 84.58 \| 69.72 \| 41.22 \| 75.75 \| 60.88 \|
	\| multilingual-e5-base \| 62.93 \| 57.62 \| 56.52 \| 84.04 \| 72.07 \| 41.20 \| 74.45 \| 54.58 \|
	\| Tooka-SBERT \| 60.65 \| 59.40 \| 56.45 \| 87.04 \| 58.29 \| 27.86 \| 76.42 \| 59.06 \|

	---

	## Model Usage

	You can interact with the `Hakim_unsup` model through our API. Below are examples using `curl` and Python.

	### Inference with `curl`

	Here's how to send a request to the model using a `curl` command in your terminal.

	Important: Replace `your_api_key` with your actual API key.

	> Note: For quick testing, you can use the value `mcinext` as your API key. This will allow you to use the API with some limitations.

	```bash
	curl -X POST 'https://mcinext.ai/api/hakim-unsup' \
	-H "Content-Type: application/json" \
	-H "Accept: application/json" \
	-H "Authorization": "Bearer your_api_key" \
	-d '{
	"model": "Hakim_unsuper",
	"input": [
	"The text of the first document.",
	"The text of the second document.",
	"And so on..."
	],
	"encoding_format": "float",
	"add_special_tokens": true
	}'
	```
	### Inference with `python`

	```python
	import requests
	import json

	# --- Configuration ---
	API_KEY = "your_api_key" # Replace with your key or "mcinext" for testing
	API_URL = "https://mcinext.ai/api/hakim-unsup"

	# --- Request Details ---
	headers = {
	"Content-Type": "application/json",
	"Accept": "application/json",
	"Authorization": f"Bearer {API_KEY}"
	}

	data = {
	"model": "Hakim_unsuper",
	"input": [
	"The text of the first document.",
	"The text of the second document.",
	"And so on..."
	],
	"encoding_format": "float",
	"add_special_tokens": True
	}

	# --- Send Request ---
	try:
	response = requests.post(API_URL, headers=headers, data=json.dumps(data))
	response.raise_for_status()

	print("Request successful!")
	print("Response JSON:")
	print(response.json())

	except requests.exceptions.HTTPError as http_err:
	print(f"HTTP error occurred: {http_err}")
	print(f"Response content: {response.text}")
	except Exception as err:
	print(f"An other error occurred: {err}")
	```

	## Citation
	```bibtext
	@article{sarmadi2025hakim,
	title={Hakim: Farsi Text Embedding Model},
	author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
	journal={arXiv preprint arXiv:2505.08435},
	year={2025}
	}
	```