Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🧠 Hakim: Farsi Text Embedding Model
|
| 2 |
+
|
| 3 |
+
[](https://arxiv.org/abs/2505.08435)
|
| 4 |
+
|
| 5 |
+
**Hakim** is a state-of-the-art text embedding model designed for the Persian language. It significantly outperforms previous models on the **FaMTEB** benchmark, delivering an **8.5% performance gain**. Hakim is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 📌 Model Highlights
|
| 10 |
+
|
| 11 |
+
- 🔍 **FaMTEB SOTA**: Ranked #1 across 63 Persian NLP datasets
|
| 12 |
+
- 🧾 **Instruction-Tuned**: Handles tasks like classification, STS, retrieval, QA, and cross-task reasoning
|
| 13 |
+
- 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data
|
| 14 |
+
- ⚙️ **Compact & Fast**: ~124M parameters, effective for real-world inference
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## 🏗️ Training Datasets
|
| 19 |
+
|
| 20 |
+
### 📚 Pretraining
|
| 21 |
+
- **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g. news, health, religion, tech)
|
| 22 |
+
- **hmBlogs**: 6.8B tokens from ~20M Persian blog posts
|
| 23 |
+
- **Queries**: 8.5M anonymized search queries
|
| 24 |
+
|
| 25 |
+
### 🔄 Unsupervised Stage (Pairsia-unsup)
|
| 26 |
+
- 5M high-quality Persian text pairs from:
|
| 27 |
+
- Document–title, FAQ, QA, and paper title–abstract
|
| 28 |
+
- Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.)
|
| 29 |
+
|
| 30 |
+
### 🧠 Supervised Stage (Pairsia-sup)
|
| 31 |
+
- 1.3M labeled pairs with 9 negatives per query
|
| 32 |
+
- Instruction-based fine-tuning across:
|
| 33 |
+
- Classification, Retrieval, STS, QA, NLI
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 🧪 Benchmark Results (FaMTEB)
|
| 38 |
+
|
| 39 |
+
| Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
|
| 40 |
+
|------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
|
| 41 |
+
| **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** |
|
| 42 |
+
| Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
|
| 43 |
+
| Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
|
| 44 |
+
| BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 |
|
| 45 |
+
| Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 |
|
| 46 |
+
| multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
|
| 47 |
+
| GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
|
| 48 |
+
| multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
|
| 49 |
+
| Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## 🔧 Usage Example
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Citation
|
| 60 |
+
```bibtext
|
| 61 |
+
@article{sarmadi2025hakim,
|
| 62 |
+
title={Hakim: Farsi Text Embedding Model},
|
| 63 |
+
author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
|
| 64 |
+
journal={arXiv preprint arXiv:2505.08435},
|
| 65 |
+
year={2025}
|
| 66 |
+
}
|
| 67 |
+
```
|