oxygeneDev
/

sentiment-multilingual

@@ -25,6 +25,7 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,144 +1,249 @@
 ---
-license: apache-2.0
-tags:
-- sentiment-analysis
-- text-classification
-- zero-shot-distillation
-- distillation
-- zero-shot-classification
-- debarta-v3
-model-index:
-- name: distilbert-base-multilingual-cased-sentiments-student
-  results: []
-datasets:
-- tyqiangz/multilingual-sentiments
 language:
 - en
 - ar
 - de
-- es
 - fr
-- ja
-- zh
-- id
-- hi
 - it
-- ms
-- pt
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# distilbert-base-multilingual-cased-sentiments-student
-This model is distilled from the zero-shot classification pipeline on the Multilingual Sentiment
-dataset using this [script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/zero-shot-distillation).
-In reality the multilingual-sentiment dataset is annotated of course,
-but we'll pretend and ignore the annotations for the sake of example.
-    Teacher model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
-    Teacher hypothesis template: "The sentiment of this text is {}."
-    Student model: distilbert-base-multilingual-cased
-## Inference example
-```python
-from transformers import pipeline
-distilled_student_sentiment_classifier = pipeline(
-    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
-    return_all_scores=True
-)
-# english
-distilled_student_sentiment_classifier ("I love this movie and i would watch it again and again!")
->> [[{'label': 'positive', 'score': 0.9731044769287109},
-  {'label': 'neutral', 'score': 0.016910076141357422},
-  {'label': 'negative', 'score': 0.009985478594899178}]]
-# malay
-distilled_student_sentiment_classifier("Saya suka filem ini dan saya akan menontonnya lagi dan lagi!")
-[[{'label': 'positive', 'score': 0.9760093688964844},
-  {'label': 'neutral', 'score': 0.01804516464471817},
-  {'label': 'negative', 'score': 0.005945465061813593}]]
-# japanese
-distilled_student_sentiment_classifier("私はこの映画が大好きで、何度も見ます！")
->> [[{'label': 'positive', 'score': 0.9342429041862488},
-  {'label': 'neutral', 'score': 0.040193185210227966},
-  {'label': 'negative', 'score': 0.025563929229974747}]]
 ```
-## Training procedure
-Notebook link: [here](https://github.com/LxYuan0420/nlp/blob/main/notebooks/Distilling_Zero_Shot_multilingual_distilbert_sentiments_student.ipynb)
-### Training hyperparameters
-Result can be reproduce using the following commands:
-```bash
-python transformers/examples/research_projects/zero-shot-distillation/distill_classifier.py \
---data_file ./multilingual-sentiments/train_unlabeled.txt \
---class_names_file ./multilingual-sentiments/class_names.txt \
---hypothesis_template "The sentiment of this text is {}." \
---teacher_name_or_path MoritzLaurer/mDeBERTa-v3-base-mnli-xnli \
---teacher_batch_size 32 \
---student_name_or_path distilbert-base-multilingual-cased \
---output_dir ./distilbert-base-multilingual-cased-sentiments-student \
---per_device_train_batch_size 16 \
---fp16
-```
-If you are training this model on Colab, make the following code changes to avoid Out-of-memory error message:
-```bash
-###### modify L78 to disable fast tokenizer
-default=False,
-###### update dataset map part at L313
-dataset = dataset.map(tokenizer, input_columns="text", fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512})
-###### add following lines to L213
-del model
-print(f"Manually deleted Teacher model, free some memory for student model.")
-###### add following lines to L337
-trainer.push_to_hub()
-tokenizer.push_to_hub("distilbert-base-multilingual-cased-sentiments-student")
 ```
-### Training log
-```bash
-Training completed. Do not forget to share your model on huggingface.co/models =)
-{'train_runtime': 2009.8864, 'train_samples_per_second': 73.0, 'train_steps_per_second': 4.563, 'train_loss': 0.6473459283913797, 'epoch': 1.0}
-100%|███████████████████████████████████████| 9171/9171 [33:29<00:00,  4.56it/s]
-[INFO|trainer.py:762] 2023-05-06 10:56:18,555 >> The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
-[INFO|trainer.py:3129] 2023-05-06 10:56:18,557 >> ***** Running Evaluation *****
-[INFO|trainer.py:3131] 2023-05-06 10:56:18,557 >>   Num examples = 146721
-[INFO|trainer.py:3134] 2023-05-06 10:56:18,557 >>   Batch size = 128
-100%|███████████████████████████████████████| 1147/1147 [08:59<00:00,  2.13it/s]
-05/06/2023 11:05:18 - INFO - __main__ - Agreement of student and teacher predictions: 88.29%
-[INFO|trainer.py:2868] 2023-05-06 11:05:18,251 >> Saving model checkpoint to ./distilbert-base-multilingual-cased-sentiments-student
-[INFO|configuration_utils.py:457] 2023-05-06 11:05:18,251 >> Configuration saved in ./distilbert-base-multilingual-cased-sentiments-student/config.json
-[INFO|modeling_utils.py:1847] 2023-05-06 11:05:18,905 >> Model weights saved in ./distilbert-base-multilingual-cased-sentiments-student/pytorch_model.bin
-[INFO|tokenization_utils_base.py:2171] 2023-05-06 11:05:18,905 >> tokenizer config file saved in ./distilbert-base-multilingual-cased-sentiments-student/tokenizer_config.json
-[INFO|tokenization_utils_base.py:2178] 2023-05-06 11:05:18,905 >> Special tokens file saved in ./distilbert-base-multilingual-cased-sentiments-student/special_tokens_map.json
 ```
-### Framework versions
-- Transformers 4.28.1
-- Pytorch 2.0.0+cu118
-- Datasets 2.11.0
-- Tokenizers 0.13.3

 ---
+base_model: distilbert/distilbert-base-multilingual-cased
 language:
 - en
+- zh
+- es
+- hi
 - ar
+- bn
+- pt
+- ru
+- ja
 - de
+- ms
+- te
+- vi
+- ko
 - fr
+- tr
 - it
+- pl
+- uk
+- tl
+- nl
+- gsw
+- sw
+library_name: transformers
+license: cc-by-nc-4.0
+pipeline_tag: text-classification
+tags:
+- text-classification
+- sentiment-analysis
+- sentiment
+- synthetic data
+- multi-class
+- social-media-analysis
+- customer-feedback
+- product-reviews
+- brand-monitoring
+- multilingual
+- 🇪🇺
+- region:eu
+datasets:
+- tabularisai/swahili_sentiment_dataset
 ---
+# 🚀 Multilingual Sentiment Classification Model (23 Languages)
+<!-- TRY IT HERE: `coming soon`
+ -->
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/sznxwdqBXj)
+# NEWS!
+- 2025/8: Major model update +1 new language: **Swahili**! Also, general improvements accross all languages.
+- 2025/8: Free API for our model! Please see below!
+- 2025/7: We’ve just released ModernFinBERT, a model we’ve been working on for a while. It’s built on the ModernBERT architecture and trained on a mix of real and synthetic data, with LLM-based label correction applied to public datasets to fix human annotation errors.
+It’s performing well across a range of benchmarks — in some cases improving accuracy by up to 48% over existing models like FinBERT.
+You can check it out here on Hugging Face:
+👉 https://huggingface.co/tabularisai/ModernFinBERT
+- 2024/12: We are excited to introduce a multilingual sentiment model! Now you can analyze sentiment across multiple languages, enhancing your global reach.
+## 🔌 Hosted API
+We provide a hosted inference API:
+**Example request body:**
+```json
+curl -X POST https://api.tabularis.ai/ \
+     -H "Content-Type: application/json" \
+     -d '{"text":"I love the design","return_all_scores":false}'
 ```
+## Model Details
+- `Model Name:` tabularisai/multilingual-sentiment-analysis
+- `Base Model:` distilbert/distilbert-base-multilingual-cased
+- `Task:` Text Classification (Sentiment Analysis)
+- `Languages:` Supports English plus Chinese (中文), Spanish (Español), Hindi (हिन्दी), Arabic (العربية), Bengali (বাংলা), Portuguese (Português), Russian (Русский), Japanese (日本語), German (Deutsch), Malay (Bahasa Melayu), Telugu (తెలుగు), Vietnamese (Tiếng Việt), Korean (한국어), French (Français), Turkish (Türkçe), Italian (Italiano), Polish (Polski), Ukrainian (Українська), Tagalog, Dutch (Nederlands), Swiss German (Schweizerdeutsch), and Swahili.
+- `Number of Classes:` 5 (*Very Negative, Negative, Neutral, Positive, Very Positive*)
+- `Usage:`
+  - Social media analysis
+  - Customer feedback analysis
+  - Product reviews classification
+  - Brand monitoring
+  - Market research
+  - Customer service optimization
+  - Competitive intelligence
+> If you wish to use this model for commercial purposes, please obtain a license by contacting: info@tabularis.ai
+## Model Description
+This model is a fine-tuned version of `distilbert/distilbert-base-multilingual-cased` for multilingual sentiment analysis. It leverages synthetic data from multiple sources to achieve robust performance across different languages and cultural contexts.
+### Training Data
+Trained exclusively on synthetic multilingual data generated by advanced LLMs, ensuring wide coverage of sentiment expressions from various languages.
+### Training Procedure
+- Fine-tuned for 3.5 epochs.
+- Achieved a train_acc_off_by_one of approximately 0.93 on the validation dataset.
+## Intended Use
+Ideal for:
+- Multilingual social media monitoring
+- International customer feedback analysis
+- Global product review sentiment classification
+- Worldwide brand sentiment tracking
+## How to Use
+Using pipelines, it takes only 4 lines:
+```python
+from transformers import pipeline
+# Load the classification pipeline with the specified model
+pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")
+# Classify a new sentence
+sentence = "I love this product! It's amazing and works perfectly."
+result = pipe(sentence)
+# Print the result
+print(result)
 ```
+Below is a Python example on how to use the multilingual sentiment model without pipelines:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "tabularisai/multilingual-sentiment-analysis"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+def predict_sentiment(texts):
+    inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
+    return [sentiment_map[p] for p in torch.argmax(probabilities, dim=-1).tolist()]
+texts = [
+    # English
+    "I absolutely love the new design of this app!", "The customer service was disappointing.", "The weather is fine, nothing special.",
+    # Chinese
+    "这家餐厅的菜味道非常棒！", "我对他的回答很失望。", "天气今天一般。",
+    # Spanish
+    "¡Me encanta cómo quedó la decoración!", "El servicio fue terrible y muy lento.", "El libro estuvo más o menos.",
+    # Arabic
+    "الخدمة في هذا الفندق رائعة جدًا!", "لم يعجبني الطعام في هذا المطعم.", "كانت الرحلة عادية。",
+    # Ukrainian
+    "Мені дуже сподобалася ця вистава!", "Обслуговування було жахливим.", "Книга була посередньою。",
+    # Hindi
+    "यह जगह सच में अद्भुत है!", "यह अनुभव बहुत खराब था।", "फिल्म ठीक-ठाक थी।",
+    # Bengali
+    "এখানকার পরিবেশ অসাধারণ!", "সেবার মান একেবারেই খারাপ।", "খাবারটা মোটামুটি ছিল।",
+    # Portuguese
+    "Este livro é fantástico! Eu aprendi muitas coisas novas e inspiradoras.",
+    "Não gostei do produto, veio quebrado.", "O filme foi ok, nada de especial.",
+    # Japanese
+    "このレストランの料理は本当に美味しいです！", "このホテルのサービスはがっかりしました。", "天気はまあまあです。",
+    # Russian
+    "Я в восторге от этого нового гаджета!", "Этот сервис оставил у меня только разочарование.", "Встреча была обычной, ничего особенного.",
+    # French
+    "J'adore ce restaurant, c'est excellent !", "L'attente était trop longue et frustrante.", "Le film était moyen, sans plus.",
+    # Turkish
+    "Bu otelin manzarasına bayıldım!", "Ürün tam bir hayal kırıklığıydı.", "Konser fena değildi, ortalamaydı.",
+    # Italian
+    "Adoro questo posto, è fantastico!", "Il servizio clienti è stato pessimo.", "La cena era nella media.",
+    # Polish
+    "Uwielbiam tę restaurację, jedzenie jest świetne!", "Obsługa klienta była rozczarowująca.", "Pogoda jest w porządku, nic szczególnego.",
+    # Tagalog
+    "Ang ganda ng lugar na ito, sobrang aliwalas!", "Hindi maganda ang serbisyo nila dito.", "Maayos lang ang palabas, walang espesyal.",
+    # Dutch
+    "Ik ben echt blij met mijn nieuwe aankoop!", "De klantenservice was echt slecht.", "De presentatie was gewoon oké, niet bijzonder.",
+    # Malay
+    "Saya suka makanan di sini, sangat sedap!", "Pengalaman ini sangat mengecewakan.", "Hari ini cuacanya biasa sahaja.",
+    # Korean
+    "이 가게의 케이크는 정말 맛있어요!", "서비스가 너무 별로였어요.", "날씨가 그저 그렇네요.",
+    # Swiss German
+    "Ich find dä Service i de Beiz mega guet!", "Däs Esä het mir nöd gfalle.", "D Wätter hüt isch so naja."
+]
+for text, sentiment in zip(texts, predict_sentiment(texts)):
+    print(f"Text: {text}\nSentiment: {sentiment}\n")
 ```
+## Ethical Considerations
+Synthetic data reduces bias, but validation in real-world scenarios is advised.
+## Citation
+```bib
+@misc{tabularisai_2025,
+	author       = { tabularisai and Samuel Gyamfi and Vadim Borisov and Richard H. Schreiber },
+	title        = { multilingual-sentiment-analysis (Revision 69afb83) },
+	year         = 2025,
+	url          = { https://huggingface.co/tabularisai/multilingual-sentiment-analysis },
+	doi          = { 10.57967/hf/5968 },
+	publisher    = { Hugging Face }
+}
+```
+## Contact
+For inquiries, data, private APIs, better models, contact info@tabularis.ai
+tabularis.ai
+<table align="center">
+  <tr>
+    <td align="center">
+      <a href="https://www.linkedin.com/company/tabularis-ai/">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/linkedin.svg" alt="LinkedIn" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://x.com/tabularis_ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/x.svg" alt="X" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://github.com/tabularis-ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/github.svg" alt="GitHub" width="30" height="30">
+      </a>
+    </td>
+    <td align="center">
+      <a href="https://tabularis.ai">
+        <img src="https://cdn.jsdelivr.net/gh/simple-icons/simple-icons/icons/internetarchive.svg" alt="Website" width="30" height="30">
+      </a>
+    </td>
+  </tr>
+</table>

config.json CHANGED Viewed

@@ -1,5 +1,4 @@
 {
-  "_name_or_path": "distilbert-base-multilingual-cased",
   "activation": "gelu",
   "architectures": [
     "DistilBertForSequenceClassification"
@@ -9,15 +8,19 @@
   "dropout": 0.1,
   "hidden_dim": 3072,
   "id2label": {
-    "0": "positive",
-    "1": "neutral",
-    "2": "negative"
   },
   "initializer_range": 0.02,
   "label2id": {
-    "negative": 2,
-    "neutral": 1,
-    "positive": 0
   },
   "max_position_embeddings": 512,
   "model_type": "distilbert",
@@ -25,11 +28,12 @@
   "n_layers": 6,
   "output_past": true,
   "pad_token_id": 0,
   "qa_dropout": 0.1,
   "seq_classif_dropout": 0.2,
   "sinusoidal_pos_embds": false,
   "tie_weights_": true,
   "torch_dtype": "float32",
-  "transformers_version": "4.28.1",
   "vocab_size": 119547
 }

 {
   "activation": "gelu",
   "architectures": [
     "DistilBertForSequenceClassification"
   "dropout": 0.1,
   "hidden_dim": 3072,
   "id2label": {
+    "0": "Very Negative",
+    "1": "Negative",
+    "2": "Neutral",
+    "3": "Positive",
+    "4": "Very Positive"
   },
   "initializer_range": 0.02,
   "label2id": {
+    "Negative": 1,
+    "Neutral": 2,
+    "Positive": 3,
+    "Very Negative": 0,
+    "Very Positive": 4
   },
   "max_position_embeddings": 512,
   "model_type": "distilbert",
   "n_layers": 6,
   "output_past": true,
   "pad_token_id": 0,
+  "problem_type": "single_label_classification",
   "qa_dropout": 0.1,
   "seq_classif_dropout": 0.2,
   "sinusoidal_pos_embds": false,
   "tie_weights_": true,
   "torch_dtype": "float32",
+  "transformers_version": "4.55.0",
   "vocab_size": 119547
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ab095b69033944004bb7a6ffcbfc2d77a240dd0f71c27820aa73efa71f664fe
-size 541320452

 version https://git-lfs.github.com/spec/v1
+oid sha256:3ab3cecb8605da0a240e5b4e18d969704d44e27c6ea48533ef6693d31dbb926a
+size 541326604

tokenizer.json CHANGED Viewed

@@ -1,19 +1,7 @@
 {
   "version": "1.0",
-  "truncation": {
-    "direction": "Right",
-    "max_length": 512,
-    "strategy": "LongestFirst",
-    "stride": 0
-  },
-  "padding": {
-    "strategy": "BatchLongest",
-    "direction": "Right",
-    "pad_to_multiple_of": null,
-    "pad_id": 0,
-    "pad_type_id": 0,
-    "pad_token": "[PAD]"
-  },
   "added_tokens": [
     {
       "id": 0,

 {
   "version": "1.0",
+  "truncation": null,
+  "padding": null,
   "added_tokens": [
     {
       "id": 0,

tokenizer_config.json CHANGED Viewed

@@ -1,11 +1,51 @@
 {
-  "clean_up_tokenization_spaces": true,
   "cls_token": "[CLS]",
-  "do_basic_tokenize": true,
   "do_lower_case": false,
   "mask_token": "[MASK]",
   "model_max_length": 512,
-  "never_split": null,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
   "strip_accents": null,

 {
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
   "cls_token": "[CLS]",
   "do_lower_case": false,
   "mask_token": "[MASK]",
   "model_max_length": 512,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
   "strip_accents": null,