--- license: apache-2.0 tags: - setfit - sentence-transformers - text-classification pipeline_tag: text-classification library_name: sentence-transformers metrics: - accuracy - f1 - precision - recall language: - en - fr - ko - zh - ja - pt - ru datasets: - imdb model-index: - name: germla/satoken results: - task: type: text-classification name: sentiment-analysis dataset: type: imdb name: imdb split: test metrics: - type: accuracy value: 73.976 name: Accuracy - type: f1 value: 73.1667079105832 name: F1 - type: precision value: 75.51506895964584 name: Precision - type: recall value: 70.96 name: Recall - task: type: text-classification name: sentiment-analysis dataset: type: sepidmnorozy/Russian_sentiment name: sepidmnorozy/Russian_sentiment split: train metrics: - type: accuracy value: 75.66371681415929 name: Accuracy - type: f1 value: 83.64218714253031 name: F1 - type: precision value: 75.25730753396459 name: Precision - type: recall value: 94.129763130793 name: Recall --- # Satoken This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification. The model has been trained using an efficient few-shot learning technique that involves: 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning. 2. Training a classification head with features from the fine-tuned Sentence Transformer. It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature) For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models) # Usage To use this model for inference, first install the SetFit library: ```bash python -m pip install setfit ``` You can then run inference as follows: ```python from setfit import SetFitModel # Download from Hub and run inference model = SetFitModel.from_pretrained("germla/satoken") # Run inference preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"]) ``` # Training Details ## Training Data - [IMDB](https://huggingface.co/datasets/imdb) - [RuReviews](https://github.com/sismetanin/rureviews) - [chABSA](https://github.com/chakki-works/chABSA-dataset) - [Glyph](https://github.com/zhangxiangxiao/glyph) - [nsmc](https://github.com/e9t/nsmc) - [Allocine](https://huggingface.co/datasets/allocine) - [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis) ## Training Procedure We made sure to have a balanced dataset. The model was trained on only 35% (50% for chinese) of the train split of all datasets. ### Preprocessing - Basic Cleaning (removal of dups, links, mentions, hashtags, etc.) - Removal of stopwords using [nltk](https://www.nltk.org/) ### Speeds, Sizes, Times The training procedure took 6hours on the NVIDIA T4 GPU. ## Evaluation ### Testing Data, Factors & Metrics - [IMDB test split](https://huggingface.co/datasets/imdb) # Environmental Impact - Hardware Type: NVIDIA T4 GPU - Hours used: 6 - Cloud Provider: Amazon Web Services - Compute Region: ap-south-1 (Mumbai) - Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)