🎯 MultiSent-E5-Pro: Advanced Thai Sentiment Classifier
📋 Quick Overview
MultiSent-E5-Pro is a fine-tuned sentiment analysis model based on intfloat/multilingual-e5-large, specially optimized for Thai with support for multilingual contexts. The model classifies text into four categories: Positive, Negative, Neutral, and Question.
🎯 Key Features
- Handles Thai-specific expressions, colloquialisms, and sarcasm effectively
- Performs well on real-world social media & review data
- Multilingual support for Southeast and East Asian languages
🏆 Benchmark Summary
| Rank |
Model |
Accuracy |
F1-Macro |
Notes |
| 🥇 1st |
MultiSent-E5-Pro |
84.61% |
84.61% |
Best overall |
| 2nd |
MultiSent-E5 |
80.62% |
80.62% |
Baseline model |
| 3rd |
sentiment-103 |
57.40% |
49.87% |
Moderate baseline |
📊 Detailed Metrics (2,183 samples)
| Metric |
Score |
| Accuracy |
84.61% |
| F1-Macro |
84.61% |
| F1-Weighted |
84.75% |
| Avg Confidence |
98.53% |
| Low Confidence Rate (<60%) |
0.96% |
Per-Class Performance
| Class |
Precision |
Recall |
F1 |
Notes |
| Negative |
91.0% |
84.6% |
87.7% |
Excellent |
| Positive |
83.0% |
94.3% |
88.3% |
Excellent |
| Neutral |
71.9% |
81.6% |
76.4% |
Moderate |
| Question |
94.4% |
79.0% |
86.0% |
Good |
⚡ Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = "ZombitX64/MultiSent-E5-Pro"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSequenceClassification.from_pretrained(model)
text = "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted = torch.argmax(probs, dim=-1)
labels = ["Question", "Negative", "Neutral", "Positive"]
print(f"Sentiment: {labels[predicted.item()]} (Confidence: {probs[0][predicted].item():.2%})")
🌟 Use Cases
| Application |
Suitability |
| Product Reviews |
🟢 Excellent |
| Social Media |
🟢 Excellent |
| Customer Support |
🟢 Excellent |
| Content Moderation |
🟡 Good |
| Research Analysis |
🟡 Good |
⚠ Known Limitations
- Sarcasm Misclassification (especially in Chinese)
- Mixed Sentiments lean toward Neutral
- Low recall for Question class due to limited data
- Bias toward Positive due to class imbalance
- Overconfidence in some multilingual predictions
🛠 Technical Info
| Config |
Value |
| Base Model |
multilingual-e5-large |
| Params |
~1.02B |
| Classes |
4 |
| Max Length |
512 |
| Training Time |
~27 min |
Data Summary:
- Training: 2,456 samples
- Validation: 273 samples
- Evaluation: 2,183 samples
📄 Citation
@misc{MultiSent-E5-Pro-2024,
title={MultiSent-E5-Pro: Advanced Thai Sentiment Analysis},
author={ZombitX64, Janutsaha K., Saengwichain C.},
year={2024},
url={https://huggingface.co/ZombitX64/MultiSent-E5-Pro},
note={Hugging Face Model Card}
}
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
👨💼 Authors
| Role |
Name |
| Lead Dev |
ZombitX64 |
| Data Scientist |
Krittanut Janutsaha |
| Engineer |
Chanyut Saengwichain |
😊 Feedback & Contributions
Last Updated: Dec 2024 | Version: 1.1 | Docs: v2.0