germla
/

satoken

@@ -5,16 +5,61 @@ tags:
 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
 ---
-# germla/satoken
-This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
 2. Training a classification head with features from the fine-tuned Sentence Transformer.
-## Usage
 To use this model for inference, first install the SetFit library:
@@ -33,17 +78,42 @@ model = SetFitModel.from_pretrained("germla/satoken")
 preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 ```
-## BibTeX entry and citation info
-```bibtex
-@article{https://doi.org/10.48550/arxiv.2209.11055,
-doi = {10.48550/ARXIV.2209.11055},
-url = {https://arxiv.org/abs/2209.11055},
-author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
-keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
-title = {Efficient Few-Shot Learning Without Prompts},
-publisher = {arXiv},
-year = {2022},
-copyright = {Creative Commons Attribution 4.0 International}
-}
-```

 - sentence-transformers
 - text-classification
 pipeline_tag: text-classification
+library_name: sentence-transformers
+metrics:
+  - accuracy
+  - f1
+  - precision
+  - recall
+language:
+  - en
+  - fr
+  - ko
+  - zh
+  - ja
+  - pt
+  - ru
+datasets:
+  - imdb
+model-index:
+  - name: germla/satoken
+    results:
+      - task:
+          type: text-classification
+          name: Sentiment Classification
+        dataset:
+          type: imdb
+          name: IMDB
+          split: test
+        metrics:
+          - type: accuracy
+            value: 73.976
+            name: Accuracy
+          - type: f1
+            value: 73.1667079105832
+            name: F1
+          - type: precision
+            value: 75.51506895964584
+            name: Precision
+          - type: recall
+            value: 70.96
+            name: Recall
 ---
+# Satoken
+This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.
+The model has been trained using an efficient few-shot learning technique that involves:
 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
 2. Training a classification head with features from the fine-tuned Sentence Transformer.
+It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)
+For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)
+# Usage
 To use this model for inference, first install the SetFit library:
 preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
 ```
+# Training Details
+## Training Data
+- [IMDB](https://huggingface.co/datasets/imdb)
+- [RuReviews](https://github.com/sismetanin/rureviews)
+- [chABSA](https://github.com/chakki-works/chABSA-dataset)
+- [Glyph](https://github.com/zhangxiangxiao/glyph)
+- [nsmc](https://github.com/e9t/nsmc)
+- [Allocine](https://huggingface.co/datasets/allocine)
+- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)
+## Training Procedure
+We made sure to have a balanced dataset.
+The model was trained on only 35% (50% for chinese) of the train split of all datasets.
+### Preprocessing
+- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
+- Removal of stopwords using [nltk](https://www.nltk.org/)
+### Speeds, Sizes, Times
+The training procedure took 6hours on the NVIDIA T4 GPU.
+## Evaluation
+### Testing Data, Factors & Metrics
+- [IMDB test split](https://huggingface.co/datasets/imdb)
+# Environmental Impact
+- Hardware Type: NVIDIA T4 GPU
+- Hours used: 6
+- Cloud Provider: Amazon Web Services
+- Compute Region: ap-south-1 (Mumbai)
+- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)