ducdatit2002's picture
Update README.md
38588f5 verified
---
language: vi
tags:
- text-classification
- emotion-recognition
- vietnamese
license: mit
datasets:
- custom
metrics:
- accuracy
---
# 🇻🇳 Vietnamese Emotion Recognition (PhoBERT-based)
This repository provides a full pipeline for **Vietnamese emotion recognition**, including:
* 📊 Processed datasets
* 🧠 Training scripts for multiple models
* 💾 Trained model checkpoints
* 📈 Evaluation results
## 📌 Overview
This project focuses on emotion classification in Vietnamese using both traditional and deep learning models, with a strong emphasis on **PhoBERT-base-v2**.
Key contributions:
* Build a **high-quality Vietnamese emotion dataset**
* Handle **class imbalance via oversampling**
* Compare multiple models (SVM → RNN → BiLSTM → CNN-LSTM → PhoBERT)
* Achieve **94.22% accuracy** with PhoBERT-base-v2
## 📂 Repository Structure
```
.
├── bilstm_emotion_model/ # Saved BiLSTM model
├── cnn_lstm_emotion_model/ # Saved CNN-LSTM model
├── phobert_emotion_model/ # Saved PhoBERT model
├── rnn_emotion_model/ # Saved RNN model
├── svm_emotion_model/ # Saved SVM model
├── flagged/ # Flagged or filtered samples
├── bilstm_best.keras # Best BiLSTM checkpoint
├── cnn_lstm_best.keras # Best CNN-LSTM checkpoint
├── main_BILSTM.py # Train BiLSTM
├── main_RNN_CNN-LSTM.py # Train RNN & CNN-LSTM
├── main_lstm.py # LSTM training script
├── main_phobert.py # Train PhoBERT
├── main_svm.py # Train SVM
├── main_v1.py # Legacy / combined script
├── run.py # Main runner script
├── processed.xlsx # Main processed dataset
├── processed_phobert.xlsx # Dataset for PhoBERT
├── processed_svm.xlsx # Dataset for SVM
├── train.xlsx # Training data
├── abbreviations.json # Text normalization rules
├── word2vec_vi_syllables_100dims.txt # Word embeddings
├── requirements.txt
└── README.md
```
## 📊 Dataset
### 🔹 Sources
* Social media
* Product reviews
* Conversations
### 🔹 Format
```csv
sentence,emotion
"Tôi rất vui hôm nay",enjoyment
```
### 🔹 Labels
* enjoyment
* anger
* sadness
* disgust
* fear
* surprise
* other
### 🔹 Preprocessing
* Text cleaning and normalization
* Abbreviation expansion (`abbreviations.json`)
* Tokenization (Vietnamese-specific)
* Oversampling for class balance
## 🧠 Models
| Model | Script | Output folder |
| -------- | ---------------------- | ------------------------- |
| SVM | `main_svm.py` | `svm_emotion_model/` |
| RNN | `main_RNN_CNN-LSTM.py` | `rnn_emotion_model/` |
| BiLSTM | `main_BILSTM.py` | `bilstm_emotion_model/` |
| CNN-LSTM | `main_RNN_CNN-LSTM.py` | `cnn_lstm_emotion_model/` |
| PhoBERT | `main_phobert.py` | `phobert_emotion_model/` |
👉 PhoBERT performs best due to strong contextual understanding of Vietnamese language.
## 🚀 Training
### 🔧 Install dependencies
```bash
pip install -r requirements.txt
```
### ▶️ Run models
#### PhoBERT
```bash
python main_phobert.py
```
#### SVM
```bash
python main_svm.py
```
#### BiLSTM
```bash
python main_BILSTM.py
```
#### RNN / CNN-LSTM
```bash
python main_RNN_CNN-LSTM.py
```
#### Run all (if configured)
```bash
python run.py
```
## 📈 Results
| Model | Accuracy |
| -------- | ---------- |
| PhoBERT | **94.22%** |
| SVM | 78.69% |
| CNN-LSTM | 62.47% |
| BiLSTM | 59.56% |
| RNN | 30.02% |
## 💾 Checkpoints
Pretrained models are stored in:
```
*_emotion_model/
```
Example load:
```python
from tensorflow.keras.models import load_model
model = load_model("bilstm_best.keras")
```
## 🧪 Example
```python
text = "Hôm nay tôi rất vui"
prediction = model.predict(text)
print(prediction)
```
## 📚 Publication
If you use this work, please cite:
**Advancing Emotion Recognition in Vietnamese: A PhoBERT-Based Approach for Enhanced Interaction**
📄 DOI: [https://doi.org/10.34238/tnu-jst.12889](https://doi.org/10.34238/tnu-jst.12889)
## 🔮 Future Work
* Multimodal emotion recognition (text + speech)
* Larger and more diverse datasets
* Real-time optimization
* Deployment in production systems
## 📄 License
MIT License