Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
license: cc-by-nc-nd-4.0
|
| 3 |
language:
|
| 4 |
- th
|
|
@@ -94,525 +95,197 @@ language:
|
|
| 94 |
- xh
|
| 95 |
- yi
|
| 96 |
- zh
|
| 97 |
-
base_model:
|
| 98 |
-
- intfloat/multilingual-e5-large
|
| 99 |
library_name: transformers
|
| 100 |
pipeline_tag: text-classification
|
| 101 |
-
metrics:
|
| 102 |
-
- accuracy
|
| 103 |
-
- f1
|
| 104 |
-
- bertscore
|
| 105 |
tags:
|
| 106 |
-
- sentiment-analysis
|
| 107 |
-
- thai
|
| 108 |
-
-
|
| 109 |
-
- fine-tuned
|
| 110 |
-
-
|
| 111 |
-
|
| 112 |
datasets:
|
| 113 |
-
- ZombitX64/SEACrowdWongnaiReviews
|
| 114 |
-
- ZombitX64/Sentiment-Benchmark
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
### Model Description
|
| 131 |
-
|
| 132 |
-
This model is a fine-tuned version of intfloat/multilingual-e5-large specifically trained for Thai sentiment analysis. It can classify Thai text into four sentiment categories: positive, negative, neutral, and question. The model demonstrates strong performance on Thai language sentiment classification tasks with high accuracy and good understanding of Thai linguistic nuances including sarcasm and implicit sentiment.
|
| 133 |
-
|
| 134 |
-
The model is particularly effective at:
|
| 135 |
-
- **Sarcasm Detection**: Understanding when positive words are used in a negative context
|
| 136 |
-
- **Cultural Context**: Recognizing Thai-specific expressions and cultural references
|
| 137 |
-
- **Implicit Sentiment**: Detecting sentiment even when not explicitly stated
|
| 138 |
-
- **Colloquial Language**: Processing informal Thai text from social media and conversations
|
| 139 |
|
| 140 |
-
* **Developed by:** ZombitX64, Krittanut Janutsaha, Chanyut Saengwichain
|
| 141 |
-
* **Model type:** Sequence Classification (Sentiment Analysis)
|
| 142 |
-
* **Language(s) (NLP):** Thai (th) - Primary, with limited multilingual capability
|
| 143 |
-
* **License:** Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
|
| 144 |
-
* **Finetuned from model:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
* **Base Model:** [https://huggingface.co/intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
|
| 150 |
|
| 151 |
-
|
|
|
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
|
| 156 |
|
| 157 |
-
|
| 158 |
-
* **Customer Feedback Analysis**: Processing reviews and feedback in Thai for e-commerce and services
|
| 159 |
-
* **Product Review Classification**: Automatically categorizing product reviews by sentiment
|
| 160 |
-
* **Opinion Mining**: Extracting sentiment from Thai news articles, blogs, and forums
|
| 161 |
-
* **Customer Service**: Categorizing customer inquiries and complaints by sentiment and intent
|
| 162 |
|
| 163 |
-
|
| 164 |
|
| 165 |
-
The model
|
| 166 |
|
| 167 |
-
|
| 168 |
-
* **Social Media Analytics Platforms**: Real-time sentiment monitoring dashboards
|
| 169 |
-
* **E-commerce Review Systems**: Automated review scoring and categorization
|
| 170 |
-
* **Content Moderation Systems**: Identifying potentially problematic content
|
| 171 |
-
* **Market Research Tools**: Analyzing consumer sentiment towards brands or products
|
| 172 |
-
* **News Analysis Systems**: Tracking public opinion on political or social issues
|
| 173 |
|
| 174 |
-
|
|
|
|
|
|
|
| 175 |
|
| 176 |
-
|
| 177 |
|
| 178 |
-
|
| 179 |
-
* **Mixed Sentiment Analysis**: Complex texts with both positive and negative elements may be misclassified or produce low confidence scores. Consider using aspect-based sentiment analysis for such cases.
|
| 180 |
-
* **Non-Thai Languages**: While it has some multilingual capability, accuracy is significantly lower for languages other than Thai
|
| 181 |
-
* **Fine-grained Emotion Detection**: The model only classifies into 4 broad categories, not specific emotions like anger, joy, fear, etc.
|
| 182 |
-
* **Clinical Applications**: Should not be used for mental health diagnosis or psychological assessment without proper validation
|
| 183 |
-
* **High-stakes Decision Making**: Avoid using for critical decisions affecting individuals without human oversight, especially for predictions with confidence < 60%
|
| 184 |
-
* **Legal or Financial Decisions**: The model's predictions should not be the sole basis for legal or financial determinations
|
| 185 |
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
-
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
-
###
|
| 197 |
|
| 198 |
-
|
|
| 199 |
-
|
| 200 |
-
|
|
| 201 |
-
|
|
| 202 |
-
|
|
| 203 |
-
|
|
| 204 |
-
| Other Languages | Poor (40-60% accuracy) | Not recommended |
|
| 205 |
|
| 206 |
-
|
| 207 |
|
| 208 |
-
|
| 209 |
-
* **Secondary Use**: For other languages, consider using language-specific models for maximum accuracy
|
| 210 |
-
* **Validation Required**: Always validate results when using with non-Thai languages
|
| 211 |
-
* **Experimental Use**: Multilingual capability can be useful for initial exploration or when Thai-specific models are unavailable
|
| 212 |
|
| 213 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
-
|
| 216 |
|
| 217 |
-
|
| 218 |
|
| 219 |
```python
|
| 220 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 221 |
import torch
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 227 |
-
|
| 228 |
-
# Example Thai text
|
| 229 |
-
text = "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย" # "This product is very good, easy to use"
|
| 230 |
-
|
| 231 |
-
# Tokenize and predict
|
| 232 |
-
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 233 |
|
|
|
|
|
|
|
| 234 |
with torch.no_grad():
|
| 235 |
outputs = model(**inputs)
|
| 236 |
-
|
| 237 |
-
|
| 238 |
|
| 239 |
-
# Label mapping: 0=Question, 1=Negative, 2=Neutral, 3=Positive
|
| 240 |
labels = ["Question", "Negative", "Neutral", "Positive"]
|
| 241 |
-
|
| 242 |
-
confidence = predictions[0][predicted_class.item()].item()
|
| 243 |
-
|
| 244 |
-
print(f"Text: {text}")
|
| 245 |
-
print(f"Predicted sentiment: {predicted_label} ({confidence:.2%})")
|
| 246 |
-
```
|
| 247 |
-
|
| 248 |
-
### Batch Processing
|
| 249 |
-
|
| 250 |
-
```python
|
| 251 |
-
# List of texts to analyze (multilingual examples)
|
| 252 |
-
texts = [
|
| 253 |
-
"ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย", # Thai: "This product is very good, easy to use"
|
| 254 |
-
"The service was terrible and disappointing", # English
|
| 255 |
-
"商品质量还可以", # Chinese: "Product quality is okay"
|
| 256 |
-
"บริการแย่มาก ไม่ประทับใจเลย", # Thai: "Service is terrible, not impressed at all"
|
| 257 |
-
"Ce produit est excellent", # French: "This product is excellent"
|
| 258 |
-
]
|
| 259 |
-
|
| 260 |
-
print("Predicting sentiment for multiple texts:")
|
| 261 |
-
for text in texts:
|
| 262 |
-
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
| 263 |
-
|
| 264 |
-
with torch.no_grad():
|
| 265 |
-
outputs = model(**inputs)
|
| 266 |
-
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 267 |
-
predicted_class = torch.argmax(predictions, dim=-1)
|
| 268 |
-
|
| 269 |
-
predicted_label = labels[predicted_class.item()]
|
| 270 |
-
confidence = predictions[0][predicted_class.item()].item()
|
| 271 |
-
|
| 272 |
-
print(f"\nText: \"{text}\"")
|
| 273 |
-
print(f"Predicted sentiment: {predicted_label} ({confidence:.2%})")
|
| 274 |
-
```
|
| 275 |
-
|
| 276 |
-
### Pipeline Usage
|
| 277 |
-
|
| 278 |
-
```python
|
| 279 |
-
from transformers import pipeline
|
| 280 |
-
|
| 281 |
-
# Create a sentiment analysis pipeline
|
| 282 |
-
classifier = pipeline("text-classification",
|
| 283 |
-
model="ZombitX64/MultiSent-E5-Pro",
|
| 284 |
-
tokenizer="ZombitX64/MultiSent-E5-Pro")
|
| 285 |
-
|
| 286 |
-
# Analyze sentiment
|
| 287 |
-
texts = [
|
| 288 |
-
"วันนี้อากาศดีจังเลย", # "The weather is so nice today"
|
| 289 |
-
"แย่ที่สุดเท่าที่เคยเจอมา" # "The worst I've ever encountered"
|
| 290 |
-
]
|
| 291 |
-
|
| 292 |
-
results = classifier(texts)
|
| 293 |
-
for text, result in zip(texts, results):
|
| 294 |
-
print(f"Text: {text}")
|
| 295 |
-
print(f"Sentiment: {result['label']} (Score: {result['score']:.4f})")
|
| 296 |
```
|
| 297 |
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
### Training Data
|
| 301 |
-
|
| 302 |
-
The model was trained on a carefully curated Thai sentiment dataset with the following characteristics:
|
| 303 |
-
|
| 304 |
-
* **Total samples:** 2,730 (2,729 after data cleaning and filtering)
|
| 305 |
-
* **Data Distribution:**
|
| 306 |
-
- **Question samples:** Minimal representation (specific count not provided)
|
| 307 |
-
- **Negative samples:** 102 (3.7% of dataset)
|
| 308 |
-
- **Neutral samples:** 317 (11.6% of dataset)
|
| 309 |
-
- **Positive samples:** 2,310 (84.7% of dataset)
|
| 310 |
-
|
| 311 |
-
**Data Split Strategy:**
|
| 312 |
-
* **Training set:** 2,456 samples (90% of total data)
|
| 313 |
-
* **Validation set:** 273 samples (10% of total data)
|
| 314 |
-
|
| 315 |
-
**Data Quality and Preprocessing:**
|
| 316 |
-
* Data was manually reviewed and cleaned to ensure quality
|
| 317 |
-
* Duplicate entries were removed
|
| 318 |
-
* Text was normalized for consistent formatting
|
| 319 |
-
* Class imbalance was noted but maintained to reflect real-world distribution
|
| 320 |
-
|
| 321 |
-
### Training Procedure
|
| 322 |
-
|
| 323 |
-
The model was fine-tuned using state-of-the-art techniques with careful hyperparameter optimization:
|
| 324 |
-
|
| 325 |
-
#### Training Hyperparameters
|
| 326 |
-
|
| 327 |
-
* **Base Model:** intfloat/multilingual-e5-large (1.02B parameters)
|
| 328 |
-
* **Model Architecture:** XLMRobertaForSequenceClassification
|
| 329 |
-
* **Training Epochs:** 5 (with early stopping monitoring)
|
| 330 |
-
* **Total Training Steps:** 770
|
| 331 |
-
* **Batch Size:** 8 (effective batch size with gradient accumulation)
|
| 332 |
-
* **Learning Rate:** 2e-5 with linear warmup and decay
|
| 333 |
-
* **Weight Decay:** 0.01
|
| 334 |
-
* **Warmup Steps:** 77 (10% of total steps)
|
| 335 |
-
* **Max Sequence Length:** 512 tokens
|
| 336 |
-
* **Optimization:** AdamW optimizer
|
| 337 |
-
* **Training Runtime:** 1,633.3 seconds (~27 minutes)
|
| 338 |
-
* **Training Samples per Second:** 7.519
|
| 339 |
-
* **Training Steps per Second:** 0.471
|
| 340 |
-
|
| 341 |
-
#### Training Infrastructure
|
| 342 |
-
|
| 343 |
-
* **Hardware:** GPU-accelerated training (specific GPU not specified)
|
| 344 |
-
* **Framework:** Hugging Face Transformers 4.x
|
| 345 |
-
* **Distributed Training:** Single GPU setup
|
| 346 |
-
* **Memory Optimization:** Gradient checkpointing enabled
|
| 347 |
-
|
| 348 |
-
#### Training Results
|
| 349 |
-
|
| 350 |
-
The model showed excellent convergence with minimal overfitting:
|
| 351 |
-
|
| 352 |
-
| Epoch | Training Loss | Validation Loss | Accuracy | Notes |
|
| 353 |
-
|-------|---------------|-----------------|----------|--------|
|
| 354 |
-
| 1 | 0.0812 | 0.0699 | 98.53% | Strong initial performance |
|
| 355 |
-
| 2 | 0.0053 | 0.0527 | 99.27% | Rapid improvement |
|
| 356 |
-
| 3 | 0.0041 | 0.0350 | 99.63% | Near-optimal performance |
|
| 357 |
-
| 4 | 0.0002 | 0.0384 | 99.63% | Slight validation loss increase |
|
| 358 |
-
| 5 | 0.0002 | 0.0410 | 99.63% | Stable performance |
|
| 359 |
-
|
| 360 |
-
**Training Observations:**
|
| 361 |
-
- Very low training loss achieved by epoch 3
|
| 362 |
-
- Validation loss remained stable, indicating minimal overfitting
|
| 363 |
-
- Accuracy plateaued at 99.63% from epoch 3 onwards
|
| 364 |
-
- Early convergence suggests effective transfer learning from the base model
|
| 365 |
-
|
| 366 |
-
============================================================
|
| 367 |
-
Evaluating: ZombitX64/MultiSent-E5-Pro
|
| 368 |
-
============================================================
|
| 369 |
-
Loading ZombitX64/MultiSent-E5-Pro...
|
| 370 |
-
Predicting 2183 samples...
|
| 371 |
-
Predicting: 2183/2183
|
| 372 |
-
Accuracy: 0.846
|
| 373 |
-
F1-Macro: 0.846
|
| 374 |
-
F1-Weighted: 0.847
|
| 375 |
-
Avg Confidence: 0.985
|
| 376 |
-
Low Confidence %: 1.0%
|
| 377 |
-
Error Rate: 0.154
|
| 378 |
-
|
| 379 |
-
Sample Errors:
|
| 380 |
-
'今天的表现无可挑剔' -> neutral (conf: 1.00) [True: positive]
|
| 381 |
-
'这真是个天才的想法,我简直佩服得五体投地' -> positive (conf: 1.00) [True: negative]
|
| 382 |
-
'你真是太能干了,把事情搞成这样' -> positive (conf: 1.00) [True: negative]
|
| 383 |
-
'这个项目真是太成功了,成功到一塌糊涂' -> positive (conf: 1.00) [True: negative]
|
| 384 |
-
'这饭菜做得真是太好吃了,我一点都吃不下' -> positive (conf: 1.00) [True: negative]
|
| 385 |
-
|
| 386 |
-
============================================================
|
| 387 |
-
BEST PERFORMING MODEL: ZombitX64/MultiSent-E5-Pro
|
| 388 |
-
============================================================
|
| 389 |
-
|
| 390 |
-
Per-Class Performance:
|
| 391 |
-
precision recall f1-score support
|
| 392 |
-
negative 0.910 0.846 0.877 661.0
|
| 393 |
-
neutral 0.719 0.816 0.764 517.0
|
| 394 |
-
positive 0.830 0.943 0.883 471.0
|
| 395 |
-
question 0.944 0.790 0.860 534.0
|
| 396 |
-
|
| 397 |
-
================================================================================
|
| 398 |
-
COMPREHENSIVE MODEL COMPARISON REPORT
|
| 399 |
-
Dataset: ZombitX64/Sentiment-Benchmark
|
| 400 |
-
================================================================================
|
| 401 |
-
|
| 402 |
-
Ranked by F1-Macro Score:
|
| 403 |
-
Model Accuracy F1-Macro F1-Weighted Avg_Confidence Low_Conf_% Error_Rate
|
| 404 |
-
ZombitX64/MultiSent-E5-Pro 0.8461 0.8461 0.8475 0.9853 0.9620 0.1539
|
| 405 |
-
ZombitX64/MultiSent-E5 0.8062 0.8062 0.8072 0.9708 1.6033 0.1938
|
| 406 |
-
ZombitX64/sentiment-103 0.5740 0.4987 0.5020 0.9647 2.2446 0.4260
|
| 407 |
-
ZombitX64/Sentiment-03 0.4828 0.4906 0.4856 0.9609 2.7485 0.5172
|
| 408 |
-
ZombitX64/Sentiment-02 0.4137 0.3884 0.3910 0.8151 10.0779 0.5863
|
| 409 |
-
ZombitX64/Thai-sentiment-e5 0.4961 0.3713 0.3704 0.9874 0.8246 0.5039
|
| 410 |
-
nlptown/bert-base-multilingual-uncased-sentiment 0.3587 0.2870 0.2896 0.4103 87.9066 0.6413
|
| 411 |
-
ZombitX64/Sentiment-01 0.2712 0.1928 0.1894 0.5085 94.5946 0.7288
|
| 412 |
-
SandboxBhh/sentiment-thai-text-model 0.2620 0.1807 0.1982 0.8610 20.2016 0.7380
|
| 413 |
-
Thaweewat/wangchanberta-hyperopt-sentiment-01 0.2336 0.1501 0.1655 0.9128 2.9776 0.7664
|
| 414 |
-
phoner45/wangchan-sentiment-thai-text-model 0.2203 0.1073 0.1270 0.7123 41.7316 0.7797
|
| 415 |
-
poom-sci/WangchanBERTa-finetuned-sentiment 0.2093 0.1061 0.1246 0.7889 14.7045 0.7907
|
| 416 |
-
cardiffnlp/twitter-xlm-roberta-base-sentiment 0.0944 0.0848 0.0841 0.6897 32.2492 0.9056
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
### Testing Data, Factors & Metrics
|
| 421 |
-
|
| 422 |
-
#### Testing Data
|
| 423 |
-
|
| 424 |
-
The model was evaluated on a carefully selected validation set with the following characteristics:
|
| 425 |
-
|
| 426 |
-
* **Total Samples:** 2183
|
| 427 |
-
* **Selection Method:** Stratified random sampling to maintain class distribution
|
| 428 |
-
* **Data Quality:** Manually verified and cleaned validation samples
|
| 429 |
-
* **Evaluation Period:** Final model checkpoint from epoch 5
|
| 430 |
-
|
| 431 |
-
#### Evaluation Metrics
|
| 432 |
-
|
| 433 |
-
The model was comprehensively evaluated using multiple metrics:
|
| 434 |
-
|
| 435 |
-
* **Primary Metrics:**
|
| 436 |
-
- **Accuracy:** Overall classification accuracy across all classes
|
| 437 |
-
- **F1-Score:** Both macro and weighted averages
|
| 438 |
-
* **Secondary Metrics:**
|
| 439 |
-
- **Precision:** Per-class and overall precision scores
|
| 440 |
-
- **Recall:** Per-class and overall recall scores
|
| 441 |
-
- **Support:** Number of samples per class in validation set
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
#### Known Limitations
|
| 445 |
-
|
| 446 |
-
**1. Question Class Performance Issues:**
|
| 447 |
-
- **Insufficient Training Data**: The question class has minimal representation in the training dataset
|
| 448 |
-
- **Low Confidence Predictions**: Question classification often results in confidence scores below 60%
|
| 449 |
-
- **Misclassification**: Questions are frequently classified as positive, negative, or neutral instead
|
| 450 |
-
- **Example Issue**: "ลำไยอร่อยดีสดมากและลูกใหญ่ด้วยแต่เน่าไปครึ่งนึ..." (Longans are delicious and fresh, big fruits too, but half are rotten...) → Classified as neutral (97.7% confidence) instead of recognizing mixed sentiment
|
| 451 |
-
|
| 452 |
-
**2. Mixed Sentiment Challenges:**
|
| 453 |
-
- **Complex Sentiment**: Texts with both positive and negative aspects may be misclassified
|
| 454 |
-
- **Moderate Confidence**: Mixed sentiment often results in lower confidence scores (50-60%)
|
| 455 |
-
- **Example**: Product reviews mentioning both good and bad aspects tend toward neutral classification
|
| 456 |
-
|
| 457 |
-
**3. Class Imbalance Effects:**
|
| 458 |
-
- Model may be biased toward positive classifications due to training data imbalance (84.7% positive samples)
|
| 459 |
-
- Neutral class performance slightly lower due to limited training examples (11.6% of data)
|
| 460 |
-
- Negative class well-represented but still only 3.7% of training data
|
| 461 |
-
|
| 462 |
-
**4. Low Confidence Predictions:**
|
| 463 |
-
- Predictions with confidence < 60% should be treated with caution
|
| 464 |
-
- Common in mixed sentiment, ambiguous language, or question-like texts
|
| 465 |
-
- Recommend implementing confidence thresholding for production use
|
| 466 |
-
|
| 467 |
-
## Environmental Impact
|
| 468 |
-
|
| 469 |
-
### Carbon Footprint Considerations
|
| 470 |
-
|
| 471 |
-
* **Training Emissions:** Specific carbon emission data not available
|
| 472 |
-
* **Efficiency Benefits:** Model was fine-tuned from a pre-trained multilingual model, significantly reducing computational cost compared to training from scratch
|
| 473 |
-
* **Resource Usage:** Relatively efficient training with only 27 minutes of GPU time required
|
| 474 |
-
* **Deployment Efficiency:** Model can be deployed efficiently for inference with standard hardware
|
| 475 |
-
|
| 476 |
-
### Sustainable AI Practices
|
| 477 |
-
|
| 478 |
-
* **Transfer Learning:** Leveraged existing multilingual model to reduce training requirements
|
| 479 |
-
* **Efficient Architecture:** Uses proven transformer architecture optimized for efficiency
|
| 480 |
-
* **Reusability:** Single model can handle multiple languages, reducing need for separate models
|
| 481 |
-
|
| 482 |
-
## Technical Specifications
|
| 483 |
-
|
| 484 |
-
### Model Architecture and Objective
|
| 485 |
-
|
| 486 |
-
* **Architecture:** XLMRobertaForSequenceClassification
|
| 487 |
-
* **Base Model:** intfloat/multilingual-e5-large
|
| 488 |
-
* **Model Parameters:** ~1.02 billion parameters
|
| 489 |
-
* **Classification Head:** Linear layer with 4 output classes
|
| 490 |
-
* **Task:** Multi-class text classification (4 classes: Question, Negative, Neutral, Positive)
|
| 491 |
-
* **Objective Function:** Cross-entropy loss minimization
|
| 492 |
-
* **Activation Function:** Softmax for final predictions
|
| 493 |
-
* **Input Processing:** Tokenization with XLM-RoBERTa tokenizer
|
| 494 |
-
* **Maximum Input Length:** 512 tokens
|
| 495 |
-
|
| 496 |
-
### Performance Characteristics
|
| 497 |
-
|
| 498 |
-
* **Inference Speed:** Fast inference suitable for real-time applications
|
| 499 |
-
* **Memory Requirements:** Standard transformer model memory usage
|
| 500 |
-
* **Scalability:** Can handle batch processing efficiently
|
| 501 |
-
* **Hardware Requirements:** Compatible with CPU and GPU inference
|
| 502 |
-
|
| 503 |
-
### Integration Specifications
|
| 504 |
-
|
| 505 |
-
* **Framework Compatibility:**
|
| 506 |
-
- Hugging Face Transformers
|
| 507 |
-
- PyTorch
|
| 508 |
-
- ONNX (convertible)
|
| 509 |
-
- TensorFlow (via conversion)
|
| 510 |
-
* **API Support:** Compatible with Hugging Face Inference API
|
| 511 |
-
* **Deployment Options:**
|
| 512 |
-
- Cloud deployment (AWS, GCP, Azure)
|
| 513 |
-
- Edge deployment (with optimization)
|
| 514 |
-
- Local deployment
|
| 515 |
|
| 516 |
-
##
|
| 517 |
|
| 518 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 519 |
|
| 520 |
-
|
| 521 |
-
* **GPU:** Modern NVIDIA GPU with sufficient VRAM (16GB+ recommended)
|
| 522 |
-
* **Memory:** 32GB+ RAM recommended for training
|
| 523 |
-
* **Storage:** SSD storage for fast data loading
|
| 524 |
|
| 525 |
-
|
| 526 |
-
* **Minimum Requirements:**
|
| 527 |
-
- CPU: Modern multi-core processor
|
| 528 |
-
- RAM: 8GB+ for batch processing
|
| 529 |
-
- Storage: 2GB for model files
|
| 530 |
-
* **Recommended for Production:**
|
| 531 |
-
- GPU: NVIDIA T4 or better
|
| 532 |
-
- RAM: 16GB+
|
| 533 |
-
- Multiple instances for load balancing
|
| 534 |
|
| 535 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 536 |
|
| 537 |
-
|
| 538 |
-
* **Python:** 3.8+
|
| 539 |
-
* **PyTorch:** 1.9+
|
| 540 |
-
* **Transformers:** 4.15+
|
| 541 |
-
* **NumPy:** 1.21+
|
| 542 |
-
* **Tokenizers:** 0.11+
|
| 543 |
|
| 544 |
-
|
| 545 |
-
* **ONNX:** For model conversion and optimization
|
| 546 |
-
* **TensorRT:** For NVIDIA GPU optimization
|
| 547 |
-
* **Gradio/Streamlit:** For web interface development
|
| 548 |
|
| 549 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 550 |
|
| 551 |
-
|
| 552 |
|
|
|
|
|
|
|
|
|
|
| 553 |
|
| 554 |
-
|
| 555 |
|
| 556 |
-
|
| 557 |
|
| 558 |
-
**BibTeX:**
|
| 559 |
```bibtex
|
| 560 |
-
@misc{MultiSent-E5-Pro,
|
| 561 |
-
title={
|
| 562 |
-
author={ZombitX64
|
| 563 |
year={2024},
|
| 564 |
url={https://huggingface.co/ZombitX64/MultiSent-E5-Pro},
|
| 565 |
-
note={Hugging Face Model
|
| 566 |
}
|
| 567 |
```
|
| 568 |
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
If you use this model in your research or applications, please cite both this model and the base model:
|
| 572 |
-
|
| 573 |
-
```bibtex
|
| 574 |
-
@article{wang2024multilingual,
|
| 575 |
-
title={Multilingual E5 Text Embeddings: A Technical Report},
|
| 576 |
-
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
|
| 577 |
-
journal={arXiv preprint arXiv:2402.05672},
|
| 578 |
-
year={2024}
|
| 579 |
-
}
|
| 580 |
-
```
|
| 581 |
-
## Model Card Authors
|
| 582 |
-
|
| 583 |
-
**Primary Contributors:**
|
| 584 |
-
- **ZombitX64** - Lead developer and model architect
|
| 585 |
-
- **Krittanut Janutsaha** - Data curation and evaluation
|
| 586 |
-
- **Chanyut Saengwichain** - Model optimization and documentation
|
| 587 |
-
|
| 588 |
-
## Model Card Contact
|
| 589 |
-
|
| 590 |
-
### Support and Issues
|
| 591 |
-
|
| 592 |
-
For questions, issues, or contributions regarding this model, please use the following channels:
|
| 593 |
|
| 594 |
-
|
| 595 |
-
* **Repository:** [https://huggingface.co/ZombitX64/MultiSent-E5-Pro](https://huggingface.co/ZombitX64/MultiSent-E5-Pro)
|
| 596 |
-
* **Community:** Hugging Face community forums for general questions
|
| 597 |
|
| 598 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 599 |
|
| 600 |
-
|
| 601 |
-
- Improving the model's performance
|
| 602 |
-
- Expanding to other Southeast Asian languages
|
| 603 |
-
- Creating domain-specific variants
|
| 604 |
-
- Integration into larger NLP systems
|
| 605 |
|
| 606 |
-
|
| 607 |
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
- Use cases where the model performs unexpectedly
|
| 612 |
-
- Ideas for model enhancements
|
| 613 |
|
| 614 |
---
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
|
| 3 |
license: cc-by-nc-nd-4.0
|
| 4 |
language:
|
| 5 |
- th
|
|
|
|
| 95 |
- xh
|
| 96 |
- yi
|
| 97 |
- zh
|
| 98 |
+
base_model: intfloat/multilingual-e5-large
|
|
|
|
| 99 |
library_name: transformers
|
| 100 |
pipeline_tag: text-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
tags:
|
| 102 |
+
- sentiment-analysis
|
| 103 |
+
- thai
|
| 104 |
+
- multilingual
|
| 105 |
+
- fine-tuned
|
| 106 |
+
- transformers
|
| 107 |
+
- southeast-asian
|
| 108 |
datasets:
|
| 109 |
+
- ZombitX64/SEACrowdWongnaiReviews
|
| 110 |
+
- ZombitX64/Sentiment-Benchmark
|
| 111 |
+
metrics:
|
| 112 |
+
- accuracy
|
| 113 |
+
- f1
|
| 114 |
+
- precision
|
| 115 |
+
- recall
|
| 116 |
+
widget:
|
| 117 |
+
- text: "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย"
|
| 118 |
+
example_title: "Thai Positive"
|
| 119 |
+
- text: "บริการแย่มาก ไม่ประทับใจเลย"
|
| 120 |
+
example_title: "Thai Negative"
|
| 121 |
+
- text: "อาหารรสชาติธรรมดา"
|
| 122 |
+
example_title: "Thai Neutral"
|
| 123 |
+
- text: "ราคาเท่าไหร่ครับ?"
|
| 124 |
+
example_title: "Thai Question"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
+
---
|
| 128 |
|
| 129 |
+
# 🎯 MultiSent-E5-Pro: Advanced Thai Sentiment Classifier
|
|
|
|
| 130 |
|
| 131 |
+
<div align="center">
|
| 132 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/673eef9c4edfc6d3b58ba3aa/Gl94xasTswsG1cOjR_076.png" width="300" alt="MultiSent-E5-Pro Logo">
|
| 133 |
|
| 134 |
+
<strong>🇹🇭 State-of-the-art Thai sentiment analysis with multilingual capabilities</strong>
|
| 135 |
|
| 136 |
+
<a href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><img src="https://img.shields.io/badge/License-CC_BY--NC--ND_4.0-lightgrey.svg"></a> <a href="https://huggingface.co/ZombitX64/MultiSent-E5-Pro"><img src="https://img.shields.io/badge/🤗%20HF-Model-yellow"></a> <a href="https://huggingface.co/ZombitX64/MultiSent-E5-Pro"><img src="https://img.shields.io/badge/Downloads-1K+-green"></a>
|
| 137 |
|
| 138 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
+
## 📋 Quick Overview
|
| 141 |
|
| 142 |
+
**MultiSent-E5-Pro** is a fine-tuned sentiment analysis model based on `intfloat/multilingual-e5-large`, specially optimized for Thai with support for multilingual contexts. The model classifies text into four categories: **Positive**, **Negative**, **Neutral**, and **Question**.
|
| 143 |
|
| 144 |
+
### 🎯 Key Features
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
* Handles **Thai-specific expressions**, **colloquialisms**, and **sarcasm** effectively
|
| 147 |
+
* Performs well on **real-world social media & review data**
|
| 148 |
+
* **Multilingual support** for Southeast and East Asian languages
|
| 149 |
|
| 150 |
+
---
|
| 151 |
|
| 152 |
+
## 🏆 Benchmark Summary
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
+
| Rank | Model | Accuracy | F1-Macro | Notes |
|
| 155 |
+
| ------ | ---------------- | ---------- | ---------- | ----------------- |
|
| 156 |
+
| 🥇 1st | MultiSent-E5-Pro | **84.61%** | **84.61%** | Best overall |
|
| 157 |
+
| 2nd | MultiSent-E5 | 80.62% | 80.62% | Baseline model |
|
| 158 |
+
| 3rd | sentiment-103 | 57.40% | 49.87% | Moderate baseline |
|
| 159 |
|
| 160 |
+
---
|
| 161 |
|
| 162 |
+
## 📊 Detailed Metrics (2,183 samples)
|
| 163 |
|
| 164 |
+
| Metric | Score |
|
| 165 |
+
| -------------------------- | ------ |
|
| 166 |
+
| Accuracy | 84.61% |
|
| 167 |
+
| F1-Macro | 84.61% |
|
| 168 |
+
| F1-Weighted | 84.75% |
|
| 169 |
+
| Avg Confidence | 98.53% |
|
| 170 |
+
| Low Confidence Rate (<60%) | 0.96% |
|
| 171 |
|
| 172 |
+
### Per-Class Performance
|
| 173 |
|
| 174 |
+
| Class | Precision | Recall | F1 | Notes |
|
| 175 |
+
| -------- | --------- | ------ | ----- | --------- |
|
| 176 |
+
| Negative | 91.0% | 84.6% | 87.7% | Excellent |
|
| 177 |
+
| Positive | 83.0% | 94.3% | 88.3% | Excellent |
|
| 178 |
+
| Neutral | 71.9% | 81.6% | 76.4% | Moderate |
|
| 179 |
+
| Question | 94.4% | 79.0% | 86.0% | Good |
|
|
|
|
| 180 |
|
| 181 |
+
---
|
| 182 |
|
| 183 |
+
## 🌍 Language Support
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
+
| Region | Languages | Performance |
|
| 186 |
+
| --------- | ---------- | ------------ |
|
| 187 |
+
| Thai | Thai | 🟢 Excellent |
|
| 188 |
+
| SEA | ID, VI, MS | 🟡 Good |
|
| 189 |
+
| East Asia | ZH, JA, KO | 🟠 Moderate |
|
| 190 |
+
| Europe | EN, ES, FR | 🔴 Low |
|
| 191 |
|
| 192 |
+
---
|
| 193 |
|
| 194 |
+
## ⚡ Quick Start
|
| 195 |
|
| 196 |
```python
|
| 197 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 198 |
import torch
|
| 199 |
|
| 200 |
+
model = "ZombitX64/MultiSent-E5-Pro"
|
| 201 |
+
tokenizer = AutoTokenizer.from_pretrained(model)
|
| 202 |
+
model = AutoModelForSequenceClassification.from_pretrained(model)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
+
text = "ผลิตภัณฑ์นี้ดีมาก ใช้งานง่าย"
|
| 205 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True)
|
| 206 |
with torch.no_grad():
|
| 207 |
outputs = model(**inputs)
|
| 208 |
+
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
| 209 |
+
predicted = torch.argmax(probs, dim=-1)
|
| 210 |
|
|
|
|
| 211 |
labels = ["Question", "Negative", "Neutral", "Positive"]
|
| 212 |
+
print(f"Sentiment: {labels[predicted.item()]} (Confidence: {probs[0][predicted].item():.2%})")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
```
|
| 214 |
|
| 215 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
+
## 🌟 Use Cases
|
| 218 |
|
| 219 |
+
| Application | Suitability |
|
| 220 |
+
| ------------------ | ------------ |
|
| 221 |
+
| Product Reviews | 🟢 Excellent |
|
| 222 |
+
| Social Media | 🟢 Excellent |
|
| 223 |
+
| Customer Support | 🟢 Excellent |
|
| 224 |
+
| Content Moderation | 🟡 Good |
|
| 225 |
+
| Research Analysis | 🟡 Good |
|
| 226 |
|
| 227 |
+
---
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
+
## ⚠ Known Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
|
| 231 |
+
* **Sarcasm Misclassification** (especially in Chinese)
|
| 232 |
+
* **Mixed Sentiments** lean toward Neutral
|
| 233 |
+
* **Low recall** for **Question** class due to limited data
|
| 234 |
+
* **Bias toward Positive** due to class imbalance
|
| 235 |
+
* **Overconfidence** in some multilingual predictions
|
| 236 |
|
| 237 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
## 🛠 Technical Info
|
|
|
|
|
|
|
|
|
|
| 240 |
|
| 241 |
+
| Config | Value |
|
| 242 |
+
| ------------- | --------------------- |
|
| 243 |
+
| Base Model | multilingual-e5-large |
|
| 244 |
+
| Params | \~1.02B |
|
| 245 |
+
| Classes | 4 |
|
| 246 |
+
| Max Length | 512 |
|
| 247 |
+
| Training Time | \~27 min |
|
| 248 |
|
| 249 |
+
**Data Summary**:
|
| 250 |
|
| 251 |
+
* Training: 2,456 samples
|
| 252 |
+
* Validation: 273 samples
|
| 253 |
+
* Evaluation: 2,183 samples
|
| 254 |
|
| 255 |
+
---
|
| 256 |
|
| 257 |
+
## 📄 Citation
|
| 258 |
|
|
|
|
| 259 |
```bibtex
|
| 260 |
+
@misc{MultiSent-E5-Pro-2024,
|
| 261 |
+
title={MultiSent-E5-Pro: Advanced Thai Sentiment Analysis},
|
| 262 |
+
author={ZombitX64, Janutsaha K., Saengwichain C.},
|
| 263 |
year={2024},
|
| 264 |
url={https://huggingface.co/ZombitX64/MultiSent-E5-Pro},
|
| 265 |
+
note={Hugging Face Model Card}
|
| 266 |
}
|
| 267 |
```
|
| 268 |
|
| 269 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
+
## 👨💼 Authors
|
|
|
|
|
|
|
| 272 |
|
| 273 |
+
| Role | Name |
|
| 274 |
+
| -------------- | -------------------- |
|
| 275 |
+
| Lead Dev | ZombitX64 |
|
| 276 |
+
| Data Scientist | Krittanut Janutsaha |
|
| 277 |
+
| Engineer | Chanyut Saengwichain |
|
| 278 |
|
| 279 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
+
## 😊 Feedback & Contributions
|
| 282 |
|
| 283 |
+
* 💬 [Open Discussion](https://huggingface.co/ZombitX64/MultiSent-E5-Pro/discussions)
|
| 284 |
+
* 🐛 [Report Issue](https://huggingface.co/ZombitX64/MultiSent-E5-Pro/issues)
|
| 285 |
+
* 🌟 Star the repo if useful!
|
|
|
|
|
|
|
| 286 |
|
| 287 |
---
|
| 288 |
+
|
| 289 |
+
<div align="center">
|
| 290 |
+
Last Updated: Dec 2024 | Version: 1.1 | Docs: v2.0
|
| 291 |
+
</div>
|