mohalisad ghazal-zamani commited on
Commit
72c52cc
·
verified ·
1 Parent(s): c81ac07

Create README.md (#1)

Browse files

- Create README.md (6e6b143e9e516f40f8cfb19b8f6b4c2692ad4865)


Co-authored-by: Ghazal Zamaninejad <ghazal-zamani@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fa
4
+ base_model:
5
+ - PartAI/TookaBERT-Large
6
+ library_name: sentence-transformers
7
+ ---
8
+
9
+ # SentenceTransformer
10
+
11
+
12
+ This model is a Sentence Transformers model trained for semantic textual similarity and embedding tasks. It maps sentences and paragraphs to a dense vector space, where semantically similar texts are close together.
13
+
14
+ The model is trained in two sizes: **Base** and **Large**
15
+
16
+ ## Usage
17
+
18
+ ### Direct Usage (Sentence Transformers)
19
+
20
+ First install the Sentence Transformers library:
21
+
22
+ ```bash
23
+ pip install sentence-transformers==3.4.1
24
+ ```
25
+
26
+ Then you can load this model and run inference.
27
+ ```python
28
+ from sentence_transformers import SentenceTransformer
29
+
30
+ # Download from the 🤗 Hub
31
+ model = SentenceTransformer("PartAI/Tooka-SBERT")
32
+ # Run inference
33
+ sentences = [
34
+ 'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
35
+ 'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
36
+ 'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
37
+ ]
38
+ embeddings = model.encode(sentences)
39
+ print(embeddings.shape)
40
+ # [3, 1024]
41
+
42
+ # Get the similarity scores for the embeddings
43
+ similarities = model.similarity(embeddings, embeddings)
44
+ print(similarities.shape)
45
+ # [3, 3]
46
+ ```
47
+
48
+ ## 🛠️ Training Details
49
+ The training is performed in two stages:
50
+
51
+ 1. **Pretraining** on the *Targoman News* dataset
52
+ 2. **Fine-tuning** on multiple synthetic datasets
53
+
54
+ ### Stage 1: Pretraining
55
+ - We use an **asymmetric** setup.
56
+ - Input formatting:
57
+ - Titles are prepended with `"سوال: "`
58
+ - Texts are prepended with `"متن: "`
59
+ - Loss function: `CachedMultipleNegativesRankingLoss`
60
+
61
+ ### Stage 2: Fine-tuning
62
+ - Loss functions:
63
+ - `CachedMultipleNegativesRankingLoss`
64
+ - `CoSENTLoss`
65
+ - Used across multiple synthetic datasets
66
+
67
+
68
+ # 📊 Evaluation
69
+ We evaluate our model on the [**PTEB Benchmark**](https://huggingface.co/spaces/PartAI/pteb-leaderboard). Our model **outperforms mE5-Base on average across PTEB tasks**.
70
+
71
+ For *Retrieval* and *Reranking* tasks, we follow the same asymmetric structure, prepending:
72
+ - `"سوال: "` to queries
73
+ - `"متن: "` to documents
74
+
75
+
76
+ | Model | Pair-Classification-Avg | Classification-Avg | Retrieval-Avg | Reranking-Avg | Overall-Avg |
77
+ |--------------------------------------------------------------------------------|-------------------------|--------------------|---------------|---------------|-------------|
78
+ | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 70.76 | 69.71 | 63.90 | 76.01 | 69.33 |
79
+ | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 72.55 | 72.18 | **65.36** | **78.52** | **71.44** |
80
+ | [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 71.88 | **79.27** | 65.18 | 64.62 | 71.37 |
81
+ | tooka-sbert-large-v1 | **81.52** | 71.54 | 45.61 | 60.44 | 62.54 |
82
+ | tooka-sbert-base-v2 | 75.69 | 72.16 | 61.24 | 73.40 | 69.49 |
83
+ | tooka-sbert-large-v2 | 80.24 | 74.73 | 59.80 | 73.44 | 70.54 |
84
+
85
+
86
+ ### Task-Specific Datasets in PTEB
87
+
88
+ - **Pair-Classification**:
89
+ - FarsTail
90
+
91
+ - **Classification**:
92
+ - MassiveIntentClassification
93
+ - MassiveScenarioClassification
94
+ - MultilingualSentimentClassification
95
+ - PersianFoodSentimentClassification
96
+
97
+ - **Retrieval**:
98
+ - MIRACLRetrieval
99
+ - NeuCLIR2023Retrieval
100
+ - WikipediaRetrievalMultilingual
101
+
102
+ - **Reranking**:
103
+ - MIRACLReranking
104
+ - WikipediaRerankingMultilingual
105
+
106
+
107
+ ## Citation
108
+
109
+ ### BibTeX
110
+
111
+ #### Sentence Transformers
112
+ ```bibtex
113
+ @inproceedings{reimers-2019-sentence-bert,
114
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
115
+ author = "Reimers, Nils and Gurevych, Iryna",
116
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
117
+ month = "11",
118
+ year = "2019",
119
+ publisher = "Association for Computational Linguistics",
120
+ url = "https://arxiv.org/abs/1908.10084",
121
+ }
122
+ ```
123
+
124
+ #### CachedMultipleNegativesRankingLoss
125
+ ```bibtex
126
+ @misc{gao2021scaling,
127
+ title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
128
+ author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
129
+ year={2021},
130
+ eprint={2101.06983},
131
+ archivePrefix={arXiv},
132
+ primaryClass={cs.LG}
133
+ }
134
+ ```