Hakim-small / README.md

mehran-sarmadi

Update README.md

07eae29 verified 7 months ago

preview code

raw

history blame contribute delete

6.01 kB

metadata

language:
  - fa

🧠 Hakim-small

Hakim-small is a compact and efficient version of the state-of-the-art Hakim text embedding model, specifically designed for the Persian language. While the main Hakim model sets the SOTA on the FaMTEB benchmark, Hakim-small offers a strong balance of performance and efficiency, making it ideal for applications with resource constraints. It leverages the same advanced training methodologies and datasets as its larger counterpart.

Hakim-small is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.

📌 Model Highlights

🔍 Strong FaMTEB Performance: Achieves excellent results on the FaMTEB benchmark, especially for its size.
🧾 Instruction-Tuned: Capable of handling tasks like classification, STS, retrieval, and QA, benefiting from the same instruction-tuning paradigm as Hakim.
🗣️ Chatbot-Ready: Fine-tuned with chat history-aware data from the Hakim project.
⚙️ Highly Compact & Fast: With only ~38M parameters (compared to Hakim's ~124M), Hakim-small is significantly smaller and faster, making it highly effective for real-world inference where efficiency is key.

🏗️ Training Datasets

Hakim-small benefits from the comprehensive and high-quality datasets developed for the Hakim project. These include:

📚 Pretraining

Corpesia: 11B tokens from 46 Persian websites across 21 domains (e.g., news, health, religion, tech).
hmBlogs: 6.8B tokens from ~20M Persian blog posts.
Queries: 8.5M anonymized search queries.

🔄 Unsupervised Stage (Pairsia-unsup)

5M high-quality Persian text pairs from diverse sources including document–title, FAQ, QA, paper title–abstract, and machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.).

🧠 Supervised Stage (Pairsia-sup)

1.3M labeled pairs with multiple negatives per query.
Instruction-based fine-tuning across tasks: Classification, Retrieval, STS, QA, NLI.

For more detailed information on the dataset creation and curation process, please refer to the Hakim paper.

🧪 Benchmark Results (FaMTEB)

Model	Avg. Score	Classification	Clustering	PairClass.	Reranking	Retrieval	STS	Summarization
Hakim	73.81	84.56	70.46	89.75	69.46	40.43	76.62	85.41
Hakim-small	70.45	80.19	66.31	87.41	67.30	38.05	75.53	78.40
Hakim-unsup	64.56	60.65	58.89	86.41	67.56	37.71	79.36	61.34
BGE-m3	65.29	58.75	57.73	85.21	74.56	43.38	76.35	61.07
Jina-embeddings-v3	64.53	59.93	59.15	83.71	61.26	43.51	78.65	65.50
multilingual-e5-large	64.40	59.86	57.19	84.42	74.34	42.98	75.38	56.61
GTE-multilingual-base	63.64	56.07	57.28	84.58	69.72	41.22	75.75	60.88
multilingual-e5-base	62.93	57.62	56.52	84.04	72.07	41.20	74.45	54.58
Tooka-SBERT	60.65	59.40	56.45	87.04	58.29	27.86	76.42	59.06

Model Usage

You can interact with the Hakim_small model through our API. Below are examples using curl and Python.

Inference with `curl`

Here's how to send a request to the model using a curl command in your terminal.

Important: Replace your_api_key with your actual API key.

Note: For quick testing, you can use the value mcinext as your API key. This will allow you to use the API with some limitations.

curl -X POST 'https://mcinext.ai/api/hakim-small' \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization": "Bearer your_api_key" \
-d '{
    "model": "Hakim_small",
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": true
}'

Inference with `python`

import requests
import json

# --- Configuration ---
API_KEY = "your_api_key"  # Replace with your key or "mcinext" for testing
API_URL = "https://mcinext.ai/api/hakim-small"

# --- Request Details ---
headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "model": "Hakim_small", 
    "input": [
        "The text of the first document.",
        "The text of the second document.",
        "And so on..."
    ],
    "encoding_format": "float",
    "add_special_tokens": True
}

# --- Send Request ---
try:
    response = requests.post(API_URL, headers=headers, data=json.dumps(data))
    response.raise_for_status()  

    print("Request successful!")
    print("Response JSON:")
    print(response.json())

except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
    print(f"Response content: {response.text}")
except Exception as err:
    print(f"An other error occurred: {err}")

Citation

@article{sarmadi2025hakim,
  title={Hakim: Farsi Text Embedding Model},
  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
  journal={arXiv preprint arXiv:2505.08435},
  year={2025}
}