ducdatit2002

Update README.md

38588f5 verified 1 day ago

4.52 kB

language: vi
tags:
  - text-classification
  - emotion-recognition
  - vietnamese
license: mit
datasets:
  - custom
metrics:
  - accuracy

🇻🇳 Vietnamese Emotion Recognition (PhoBERT-based)

This repository provides a full pipeline for Vietnamese emotion recognition, including:

📊 Processed datasets
🧠 Training scripts for multiple models
💾 Trained model checkpoints
📈 Evaluation results

📌 Overview

This project focuses on emotion classification in Vietnamese using both traditional and deep learning models, with a strong emphasis on PhoBERT-base-v2. Key contributions:

Build a high-quality Vietnamese emotion dataset
Handle class imbalance via oversampling
Compare multiple models (SVM → RNN → BiLSTM → CNN-LSTM → PhoBERT)
Achieve 94.22% accuracy with PhoBERT-base-v2

📂 Repository Structure

.
├── bilstm_emotion_model/        # Saved BiLSTM model
├── cnn_lstm_emotion_model/      # Saved CNN-LSTM model
├── phobert_emotion_model/       # Saved PhoBERT model
├── rnn_emotion_model/           # Saved RNN model
├── svm_emotion_model/           # Saved SVM model
├── flagged/                     # Flagged or filtered samples

├── bilstm_best.keras            # Best BiLSTM checkpoint
├── cnn_lstm_best.keras          # Best CNN-LSTM checkpoint

├── main_BILSTM.py               # Train BiLSTM
├── main_RNN_CNN-LSTM.py         # Train RNN & CNN-LSTM
├── main_lstm.py                 # LSTM training script
├── main_phobert.py              # Train PhoBERT
├── main_svm.py                  # Train SVM
├── main_v1.py                   # Legacy / combined script
├── run.py                       # Main runner script

├── processed.xlsx               # Main processed dataset
├── processed_phobert.xlsx       # Dataset for PhoBERT
├── processed_svm.xlsx           # Dataset for SVM
├── train.xlsx                   # Training data

├── abbreviations.json           # Text normalization rules
├── word2vec_vi_syllables_100dims.txt   # Word embeddings

├── requirements.txt
└── README.md

📊 Dataset

🔹 Sources

Social media
Product reviews
Conversations

🔹 Format

sentence,emotion
"Tôi rất vui hôm nay",enjoyment

🔹 Labels

enjoyment
anger
sadness
disgust
fear
surprise
other

🔹 Preprocessing

Text cleaning and normalization
Abbreviation expansion (abbreviations.json)
Tokenization (Vietnamese-specific)
Oversampling for class balance

🧠 Models

Model	Script	Output folder
SVM	`main_svm.py`	`svm_emotion_model/`
RNN	`main_RNN_CNN-LSTM.py`	`rnn_emotion_model/`
BiLSTM	`main_BILSTM.py`	`bilstm_emotion_model/`
CNN-LSTM	`main_RNN_CNN-LSTM.py`	`cnn_lstm_emotion_model/`
PhoBERT	`main_phobert.py`	`phobert_emotion_model/`

👉 PhoBERT performs best due to strong contextual understanding of Vietnamese language.

🚀 Training

🔧 Install dependencies

pip install -r requirements.txt

▶️ Run models

PhoBERT

python main_phobert.py

SVM

python main_svm.py

BiLSTM

python main_BILSTM.py

RNN / CNN-LSTM

python main_RNN_CNN-LSTM.py

Run all (if configured)

python run.py

📈 Results

Model	Accuracy
PhoBERT	94.22%
SVM	78.69%
CNN-LSTM	62.47%
BiLSTM	59.56%
RNN	30.02%

💾 Checkpoints

Pretrained models are stored in:

*_emotion_model/

Example load:

from tensorflow.keras.models import load_model

model = load_model("bilstm_best.keras")

🧪 Example

text = "Hôm nay tôi rất vui"
prediction = model.predict(text)
print(prediction)

📚 Publication

If you use this work, please cite: Advancing Emotion Recognition in Vietnamese: A PhoBERT-Based Approach for Enhanced Interaction 📄 DOI: https://doi.org/10.34238/tnu-jst.12889

🔮 Future Work

Multimodal emotion recognition (text + speech)
Larger and more diverse datasets
Real-time optimization
Deployment in production systems

📄 License

MIT License