gaurishhs commited on
Commit
49eb33a
·
1 Parent(s): 294887b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -17
README.md CHANGED
@@ -5,16 +5,61 @@ tags:
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- # germla/satoken
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
 
 
13
 
14
  1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
  2. Training a classification head with features from the fine-tuned Sentence Transformer.
16
 
17
- ## Usage
 
 
 
 
18
 
19
  To use this model for inference, first install the SetFit library:
20
 
@@ -33,17 +78,42 @@ model = SetFitModel.from_pretrained("germla/satoken")
33
  preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
34
  ```
35
 
36
- ## BibTeX entry and citation info
37
-
38
- ```bibtex
39
- @article{https://doi.org/10.48550/arxiv.2209.11055,
40
- doi = {10.48550/ARXIV.2209.11055},
41
- url = {https://arxiv.org/abs/2209.11055},
42
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
- title = {Efficient Few-Shot Learning Without Prompts},
45
- publisher = {arXiv},
46
- year = {2022},
47
- copyright = {Creative Commons Attribution 4.0 International}
48
- }
49
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
8
+ library_name: sentence-transformers
9
+ metrics:
10
+ - accuracy
11
+ - f1
12
+ - precision
13
+ - recall
14
+ language:
15
+ - en
16
+ - fr
17
+ - ko
18
+ - zh
19
+ - ja
20
+ - pt
21
+ - ru
22
+ datasets:
23
+ - imdb
24
+ model-index:
25
+ - name: germla/satoken
26
+ results:
27
+ - task:
28
+ type: text-classification
29
+ name: Sentiment Classification
30
+ dataset:
31
+ type: imdb
32
+ name: IMDB
33
+ split: test
34
+ metrics:
35
+ - type: accuracy
36
+ value: 73.976
37
+ name: Accuracy
38
+ - type: f1
39
+ value: 73.1667079105832
40
+ name: F1
41
+ - type: precision
42
+ value: 75.51506895964584
43
+ name: Precision
44
+ - type: recall
45
+ value: 70.96
46
+ name: Recall
47
  ---
48
 
49
+ # Satoken
50
 
51
+ This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.
52
+
53
+ The model has been trained using an efficient few-shot learning technique that involves:
54
 
55
  1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
56
  2. Training a classification head with features from the fine-tuned Sentence Transformer.
57
 
58
+ It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)
59
+
60
+ For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)
61
+
62
+ # Usage
63
 
64
  To use this model for inference, first install the SetFit library:
65
 
 
78
  preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
79
  ```
80
 
81
+ # Training Details
82
+
83
+ ## Training Data
84
+
85
+ - [IMDB](https://huggingface.co/datasets/imdb)
86
+ - [RuReviews](https://github.com/sismetanin/rureviews)
87
+ - [chABSA](https://github.com/chakki-works/chABSA-dataset)
88
+ - [Glyph](https://github.com/zhangxiangxiao/glyph)
89
+ - [nsmc](https://github.com/e9t/nsmc)
90
+ - [Allocine](https://huggingface.co/datasets/allocine)
91
+ - [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)
92
+
93
+ ## Training Procedure
94
+
95
+ We made sure to have a balanced dataset.
96
+ The model was trained on only 35% (50% for chinese) of the train split of all datasets.
97
+
98
+ ### Preprocessing
99
+
100
+ - Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
101
+ - Removal of stopwords using [nltk](https://www.nltk.org/)
102
+
103
+ ### Speeds, Sizes, Times
104
+
105
+ The training procedure took 6hours on the NVIDIA T4 GPU.
106
+
107
+ ## Evaluation
108
+
109
+ ### Testing Data, Factors & Metrics
110
+
111
+ - [IMDB test split](https://huggingface.co/datasets/imdb)
112
+
113
+ # Environmental Impact
114
+
115
+ - Hardware Type: NVIDIA T4 GPU
116
+ - Hours used: 6
117
+ - Cloud Provider: Amazon Web Services
118
+ - Compute Region: ap-south-1 (Mumbai)
119
+ - Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)