---
license: apache-2.0
tags:
- setfit
- sentence-transformers
- text-classification
pipeline_tag: text-classification
library_name: sentence-transformers
metrics:
  - accuracy
  - f1
  - precision
  - recall
language:
  - en 
  - fr 
  - ko
  - zh 
  - ja
  - pt
  - ru
datasets:
  - imdb
model-index:
  - name: germla/satoken
    results:
      - task: 
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: imdb
          name: imdb
          split: test
        metrics:
          - type: accuracy
            value: 73.976
            name: Accuracy
          - type: f1
            value: 73.1667079105832
            name: F1
          - type: precision
            value: 75.51506895964584
            name: Precision
          - type: recall
            value: 70.96
            name: Recall
      - task:
          type: text-classification
          name: sentiment-analysis
        dataset:
          type: sepidmnorozy/Russian_sentiment
          name: sepidmnorozy/Russian_sentiment
          split: train
        metrics:
          - type: accuracy
            value: 75.66371681415929
            name: Accuracy
          - type: f1
            value: 83.64218714253031
            name: F1
          - type: precision
            value: 75.25730753396459
            name: Precision
          - type: recall
            value: 94.129763130793
            name: Recall
---

# Satoken

This is a [SetFit model](https://github.com/huggingface/setfit) trained on multilingual datasets (mentioned below) for Sentiment classification.

The model has been trained using an efficient few-shot learning technique that involves:

1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
2. Training a classification head with features from the fine-tuned Sentence Transformer.

It is utilized by [Germla](https://github.com/germla) for it's feedback analysis tool. (specifically the Sentiment analysis feature)

For other models (specific language-basis) check [here](https://github.com/germla/satoken#available-models)

# Usage

To use this model for inference, first install the SetFit library:

```bash
python -m pip install setfit
```

You can then run inference as follows:

```python
from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
```

# Training Details

## Training Data

- [IMDB](https://huggingface.co/datasets/imdb)
- [RuReviews](https://github.com/sismetanin/rureviews)
- [chABSA](https://github.com/chakki-works/chABSA-dataset)
- [Glyph](https://github.com/zhangxiangxiao/glyph)
- [nsmc](https://github.com/e9t/nsmc)
- [Allocine](https://huggingface.co/datasets/allocine)
- [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)

## Training Procedure

We made sure to have a balanced dataset.
The model was trained on only 35% (50% for chinese) of the train split of all datasets.

### Preprocessing

- Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
- Removal of stopwords using [nltk](https://www.nltk.org/)

### Speeds, Sizes, Times

The training procedure took 6hours on the NVIDIA T4 GPU.

## Evaluation

### Testing Data, Factors & Metrics

- [IMDB test split](https://huggingface.co/datasets/imdb)

# Environmental Impact

- Hardware Type: NVIDIA T4 GPU
- Hours used: 6
- Cloud Provider: Amazon Web Services
- Compute Region: ap-south-1 (Mumbai)
- Carbon Emitted: 0.39 [kg co2 eq.](https://mlco2.github.io/impact/#co2eq)