MultiSent-E5-Pro / README.md
JonusNattapong's picture
Update README.md
71fba91 verified
metadata
license: cc-by-nc-nd-4.0
language:
  - th
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
base_model: intfloat/multilingual-e5-large
library_name: transformers
pipeline_tag: text-classification
tags:
  - sentiment-analysis
  - thai
  - multilingual
  - fine-tuned
  - transformers
  - southeast-asian
metrics:
  - accuracy
  - f1
  - precision
  - recall
widget:
  - text: ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย
    example_title: Thai Positive
  - text: บริการแย่มาก ไม่ประทับใจเลย
    example_title: Thai Negative
  - text: อาหารรสชาติธรรมดา
    example_title: Thai Neutral
  - text: ราคาเท่าไหร่ครับ?
    example_title: Thai Question

🎯 MultiSent-E5-Pro: Advanced Thai Sentiment Classifier

MultiSent-E5-Pro Logo

🇹🇭 State-of-the-art Thai sentiment analysis with multilingual capabilities

📋 Quick Overview

MultiSent-E5-Pro is a fine-tuned sentiment analysis model based on intfloat/multilingual-e5-large, specially optimized for Thai with support for multilingual contexts. The model classifies text into four categories: Positive, Negative, Neutral, and Question.

🎯 Key Features

  • Handles Thai-specific expressions, colloquialisms, and sarcasm effectively
  • Performs well on real-world social media & review data
  • Multilingual support for Southeast and East Asian languages

🏆 Benchmark Summary

Rank Model Accuracy F1-Macro Notes
🥇 1st MultiSent-E5-Pro 84.61% 84.61% Best overall
2nd MultiSent-E5 80.62% 80.62% Baseline model
3rd sentiment-103 57.40% 49.87% Moderate baseline

📊 Detailed Metrics (2,183 samples)

Metric Score
Accuracy 84.61%
F1-Macro 84.61%
F1-Weighted 84.75%
Avg Confidence 98.53%
Low Confidence Rate (<60%) 0.96%

Per-Class Performance

Class Precision Recall F1 Notes
Negative 91.0% 84.6% 87.7% Excellent
Positive 83.0% 94.3% 88.3% Excellent
Neutral 71.9% 81.6% 76.4% Moderate
Question 94.4% 79.0% 86.0% Good

⚡ Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = "ZombitX64/MultiSent-E5-Pro"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSequenceClassification.from_pretrained(model)

text = "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted = torch.argmax(probs, dim=-1)

labels = ["Question", "Negative", "Neutral", "Positive"]
print(f"Sentiment: {labels[predicted.item()]} (Confidence: {probs[0][predicted].item():.2%})")

🌟 Use Cases

Application Suitability
Product Reviews 🟢 Excellent
Social Media 🟢 Excellent
Customer Support 🟢 Excellent
Content Moderation 🟡 Good
Research Analysis 🟡 Good

⚠ Known Limitations

  • Sarcasm Misclassification (especially in Chinese)
  • Mixed Sentiments lean toward Neutral
  • Low recall for Question class due to limited data
  • Bias toward Positive due to class imbalance
  • Overconfidence in some multilingual predictions

🛠 Technical Info

Config Value
Base Model multilingual-e5-large
Params ~1.02B
Classes 4
Max Length 512
Training Time ~27 min

Data Summary:

  • Training: 2,456 samples
  • Validation: 273 samples
  • Evaluation: 2,183 samples

📄 Citation

@misc{MultiSent-E5-Pro-2024,
  title={MultiSent-E5-Pro: Advanced Thai Sentiment Analysis},
  author={ZombitX64, Janutsaha K., Saengwichain C.},
  year={2024},
  url={https://huggingface.co/ZombitX64/MultiSent-E5-Pro},
  note={Hugging Face Model Card}
}
@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}

👨‍💼 Authors

Role Name
Lead Dev ZombitX64
Data Scientist Krittanut Janutsaha
Engineer Chanyut Saengwichain

😊 Feedback & Contributions


Last Updated: Dec 2024 | Version: 1.1 | Docs: v2.0