MCINext
/

Hakim

Model card Files Files and versions

xet

Community

mehran-sarmadi commited on May 31, 2025

Commit

57a4c62

verified ·

1 Parent(s): 598ef64

Create README.md

Browse files

Files changed (1) hide show

README.md +67 -0

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# 🧠 Hakim: Farsi Text Embedding Model
+[![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)
+**Hakim** is a state-of-the-art text embedding model designed for the Persian language. It significantly outperforms previous models on the **FaMTEB** benchmark, delivering an **8.5% performance gain**. Hakim is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
+---
+## 📌 Model Highlights
+- 🔍 **FaMTEB SOTA**: Ranked #1 across 63 Persian NLP datasets
+- 🧾 **Instruction-Tuned**: Handles tasks like classification, STS, retrieval, QA, and cross-task reasoning
+- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data
+- ⚙️ **Compact & Fast**: ~124M parameters, effective for real-world inference
+---
+## 🏗️ Training Datasets
+### 📚 Pretraining
+- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g. news, health, religion, tech)
+- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts
+- **Queries**: 8.5M anonymized search queries
+### 🔄 Unsupervised Stage (Pairsia-unsup)
+- 5M high-quality Persian text pairs from:
+  - Document–title, FAQ, QA, and paper title–abstract
+  - Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.)
+### 🧠 Supervised Stage (Pairsia-sup)
+- 1.3M labeled pairs with 9 negatives per query
+- Instruction-based fine-tuning across:
+  - Classification, Retrieval, STS, QA, NLI
+---
+## 🧪 Benchmark Results (FaMTEB)
+| Model                   | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS   | Summarization |
+|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
+| **Hakim**              | **73.81**  | **84.56**      | **70.46**  | **89.75**  | 69.46     | 40.43     | 76.62 | **85.41**      |
+| Hakim-small            | 70.45      | 80.19          | 66.31      | 87.41      | 67.30     | 38.05     | 75.53 | 78.40          |
+| Hakim-unsup            | 64.56      | 60.65          | 58.89      | 86.41      | 67.56     | 37.71     | 79.36 | 61.34          |
+| BGE-m3                 | 65.29      | 58.75          | 57.73      | 85.21      | **74.56** | 43.38     | 76.35 | 61.07          |
+| Jina-embeddings-v3     | 64.53      | 59.93          | 59.15      | 83.71      | 61.26     | **43.51** | **78.65** | 65.50      |
+| multilingual-e5-large  | 64.40      | 59.86          | 57.19      | 84.42      | 74.34     | 42.98     | 75.38 | 56.61          |
+| GTE-multilingual-base  | 63.64      | 56.07          | 57.28      | 84.58      | 69.72     | 41.22     | 75.75 | 60.88          |
+| multilingual-e5-base   | 62.93      | 57.62          | 56.52      | 84.04      | 72.07     | 41.20     | 74.45 | 54.58          |
+| Tooka-SBERT            | 60.65      | 59.40          | 56.45      | 87.04      | 58.29     | 27.86     | 76.42 | 59.06          |
+---
+## 🔧 Usage Example
+```python
+Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
+```
+## Citation
+```bibtext
+@article{sarmadi2025hakim,
+  title={Hakim: Farsi Text Embedding Model},
+  author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
+  journal={arXiv preprint arXiv:2505.08435},
+  year={2025}
+}
+```