metadata
language: vi
tags:
- text-classification
- emotion-recognition
- vietnamese
license: mit
datasets:
- custom
metrics:
- accuracy
๐ป๐ณ Vietnamese Emotion Recognition (PhoBERT-based)
This repository provides a full pipeline for Vietnamese emotion recognition, including:
- ๐ Processed datasets
- ๐ง Training scripts for multiple models
- ๐พ Trained model checkpoints
- ๐ Evaluation results
๐ Overview
This project focuses on emotion classification in Vietnamese using both traditional and deep learning models, with a strong emphasis on PhoBERT-base-v2. Key contributions:
- Build a high-quality Vietnamese emotion dataset
- Handle class imbalance via oversampling
- Compare multiple models (SVM โ RNN โ BiLSTM โ CNN-LSTM โ PhoBERT)
- Achieve 94.22% accuracy with PhoBERT-base-v2
๐ Repository Structure
.
โโโ bilstm_emotion_model/ # Saved BiLSTM model
โโโ cnn_lstm_emotion_model/ # Saved CNN-LSTM model
โโโ phobert_emotion_model/ # Saved PhoBERT model
โโโ rnn_emotion_model/ # Saved RNN model
โโโ svm_emotion_model/ # Saved SVM model
โโโ flagged/ # Flagged or filtered samples
โโโ bilstm_best.keras # Best BiLSTM checkpoint
โโโ cnn_lstm_best.keras # Best CNN-LSTM checkpoint
โโโ main_BILSTM.py # Train BiLSTM
โโโ main_RNN_CNN-LSTM.py # Train RNN & CNN-LSTM
โโโ main_lstm.py # LSTM training script
โโโ main_phobert.py # Train PhoBERT
โโโ main_svm.py # Train SVM
โโโ main_v1.py # Legacy / combined script
โโโ run.py # Main runner script
โโโ processed.xlsx # Main processed dataset
โโโ processed_phobert.xlsx # Dataset for PhoBERT
โโโ processed_svm.xlsx # Dataset for SVM
โโโ train.xlsx # Training data
โโโ abbreviations.json # Text normalization rules
โโโ word2vec_vi_syllables_100dims.txt # Word embeddings
โโโ requirements.txt
โโโ README.md
๐ Dataset
๐น Sources
- Social media
- Product reviews
- Conversations
๐น Format
sentence,emotion
"Tรดi rแบฅt vui hรดm nay",enjoyment
๐น Labels
- enjoyment
- anger
- sadness
- disgust
- fear
- surprise
- other
๐น Preprocessing
- Text cleaning and normalization
- Abbreviation expansion (
abbreviations.json) - Tokenization (Vietnamese-specific)
- Oversampling for class balance
๐ง Models
| Model | Script | Output folder |
|---|---|---|
| SVM | main_svm.py |
svm_emotion_model/ |
| RNN | main_RNN_CNN-LSTM.py |
rnn_emotion_model/ |
| BiLSTM | main_BILSTM.py |
bilstm_emotion_model/ |
| CNN-LSTM | main_RNN_CNN-LSTM.py |
cnn_lstm_emotion_model/ |
| PhoBERT | main_phobert.py |
phobert_emotion_model/ |
๐ PhoBERT performs best due to strong contextual understanding of Vietnamese language.
๐ Training
๐ง Install dependencies
pip install -r requirements.txt
โถ๏ธ Run models
PhoBERT
python main_phobert.py
SVM
python main_svm.py
BiLSTM
python main_BILSTM.py
RNN / CNN-LSTM
python main_RNN_CNN-LSTM.py
Run all (if configured)
python run.py
๐ Results
| Model | Accuracy |
|---|---|
| PhoBERT | 94.22% |
| SVM | 78.69% |
| CNN-LSTM | 62.47% |
| BiLSTM | 59.56% |
| RNN | 30.02% |
๐พ Checkpoints
Pretrained models are stored in:
*_emotion_model/
Example load:
from tensorflow.keras.models import load_model
model = load_model("bilstm_best.keras")
๐งช Example
text = "Hรดm nay tรดi rแบฅt vui"
prediction = model.predict(text)
print(prediction)
๐ Publication
If you use this work, please cite: Advancing Emotion Recognition in Vietnamese: A PhoBERT-Based Approach for Enhanced Interaction ๐ DOI: https://doi.org/10.34238/tnu-jst.12889
๐ฎ Future Work
- Multimodal emotion recognition (text + speech)
- Larger and more diverse datasets
- Real-time optimization
- Deployment in production systems
๐ License
MIT License