mehran-sarmadi commited on
Commit
57a4c62
·
verified ·
1 Parent(s): 598ef64

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Hakim: Farsi Text Embedding Model
2
+
3
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.08435-b31b1b.svg)](https://arxiv.org/abs/2505.08435)
4
+
5
+ **Hakim** is a state-of-the-art text embedding model designed for the Persian language. It significantly outperforms previous models on the **FaMTEB** benchmark, delivering an **8.5% performance gain**. Hakim is optimized for applications such as semantic search, dense retrieval, RAG (retrieval-augmented generation), and instruction-based NLP tasks like classification and QA.
6
+
7
+ ---
8
+
9
+ ## 📌 Model Highlights
10
+
11
+ - 🔍 **FaMTEB SOTA**: Ranked #1 across 63 Persian NLP datasets
12
+ - 🧾 **Instruction-Tuned**: Handles tasks like classification, STS, retrieval, QA, and cross-task reasoning
13
+ - 🗣️ **Chatbot-Ready**: Fine-tuned with chat history-aware data
14
+ - ⚙️ **Compact & Fast**: ~124M parameters, effective for real-world inference
15
+
16
+ ---
17
+
18
+ ## 🏗️ Training Datasets
19
+
20
+ ### 📚 Pretraining
21
+ - **Corpesia**: 11B tokens from 46 Persian websites across 21 domains (e.g. news, health, religion, tech)
22
+ - **hmBlogs**: 6.8B tokens from ~20M Persian blog posts
23
+ - **Queries**: 8.5M anonymized search queries
24
+
25
+ ### 🔄 Unsupervised Stage (Pairsia-unsup)
26
+ - 5M high-quality Persian text pairs from:
27
+ - Document–title, FAQ, QA, and paper title–abstract
28
+ - Machine-translated datasets (MS MARCO, SAMSum, AdversarialQA, etc.)
29
+
30
+ ### 🧠 Supervised Stage (Pairsia-sup)
31
+ - 1.3M labeled pairs with 9 negatives per query
32
+ - Instruction-based fine-tuning across:
33
+ - Classification, Retrieval, STS, QA, NLI
34
+
35
+ ---
36
+
37
+ ## 🧪 Benchmark Results (FaMTEB)
38
+
39
+ | Model | Avg. Score | Classification | Clustering | PairClass. | Reranking | Retrieval | STS | Summarization |
40
+ |------------------------|------------|----------------|------------|------------|-----------|-----------|-------|----------------|
41
+ | **Hakim** | **73.81** | **84.56** | **70.46** | **89.75** | 69.46 | 40.43 | 76.62 | **85.41** |
42
+ | Hakim-small | 70.45 | 80.19 | 66.31 | 87.41 | 67.30 | 38.05 | 75.53 | 78.40 |
43
+ | Hakim-unsup | 64.56 | 60.65 | 58.89 | 86.41 | 67.56 | 37.71 | 79.36 | 61.34 |
44
+ | BGE-m3 | 65.29 | 58.75 | 57.73 | 85.21 | **74.56** | 43.38 | 76.35 | 61.07 |
45
+ | Jina-embeddings-v3 | 64.53 | 59.93 | 59.15 | 83.71 | 61.26 | **43.51** | **78.65** | 65.50 |
46
+ | multilingual-e5-large | 64.40 | 59.86 | 57.19 | 84.42 | 74.34 | 42.98 | 75.38 | 56.61 |
47
+ | GTE-multilingual-base | 63.64 | 56.07 | 57.28 | 84.58 | 69.72 | 41.22 | 75.75 | 60.88 |
48
+ | multilingual-e5-base | 62.93 | 57.62 | 56.52 | 84.04 | 72.07 | 41.20 | 74.45 | 54.58 |
49
+ | Tooka-SBERT | 60.65 | 59.40 | 56.45 | 87.04 | 58.29 | 27.86 | 76.42 | 59.06 |
50
+
51
+ ---
52
+
53
+ ## 🔧 Usage Example
54
+
55
+ ```python
56
+ Access to the Hakim model will be available through an API. This section will be updated with usage instructions and examples once the API is ready.
57
+ ```
58
+
59
+ ## Citation
60
+ ```bibtext
61
+ @article{sarmadi2025hakim,
62
+ title={Hakim: Farsi Text Embedding Model},
63
+ author={Sarmadi, Mehran and Alikhani, Morteza and Zinvandi, Erfan and Pourbahman, Zahra},
64
+ journal={arXiv preprint arXiv:2505.08435},
65
+ year={2025}
66
+ }
67
+ ```