ducdatit2002's picture
Update README.md
38588f5 verified
metadata
language: vi
tags:
  - text-classification
  - emotion-recognition
  - vietnamese
license: mit
datasets:
  - custom
metrics:
  - accuracy

๐Ÿ‡ป๐Ÿ‡ณ Vietnamese Emotion Recognition (PhoBERT-based)

This repository provides a full pipeline for Vietnamese emotion recognition, including:

  • ๐Ÿ“Š Processed datasets
  • ๐Ÿง  Training scripts for multiple models
  • ๐Ÿ’พ Trained model checkpoints
  • ๐Ÿ“ˆ Evaluation results

๐Ÿ“Œ Overview

This project focuses on emotion classification in Vietnamese using both traditional and deep learning models, with a strong emphasis on PhoBERT-base-v2. Key contributions:

  • Build a high-quality Vietnamese emotion dataset
  • Handle class imbalance via oversampling
  • Compare multiple models (SVM โ†’ RNN โ†’ BiLSTM โ†’ CNN-LSTM โ†’ PhoBERT)
  • Achieve 94.22% accuracy with PhoBERT-base-v2

๐Ÿ“‚ Repository Structure

.
โ”œโ”€โ”€ bilstm_emotion_model/        # Saved BiLSTM model
โ”œโ”€โ”€ cnn_lstm_emotion_model/      # Saved CNN-LSTM model
โ”œโ”€โ”€ phobert_emotion_model/       # Saved PhoBERT model
โ”œโ”€โ”€ rnn_emotion_model/           # Saved RNN model
โ”œโ”€โ”€ svm_emotion_model/           # Saved SVM model
โ”œโ”€โ”€ flagged/                     # Flagged or filtered samples

โ”œโ”€โ”€ bilstm_best.keras            # Best BiLSTM checkpoint
โ”œโ”€โ”€ cnn_lstm_best.keras          # Best CNN-LSTM checkpoint

โ”œโ”€โ”€ main_BILSTM.py               # Train BiLSTM
โ”œโ”€โ”€ main_RNN_CNN-LSTM.py         # Train RNN & CNN-LSTM
โ”œโ”€โ”€ main_lstm.py                 # LSTM training script
โ”œโ”€โ”€ main_phobert.py              # Train PhoBERT
โ”œโ”€โ”€ main_svm.py                  # Train SVM
โ”œโ”€โ”€ main_v1.py                   # Legacy / combined script
โ”œโ”€โ”€ run.py                       # Main runner script

โ”œโ”€โ”€ processed.xlsx               # Main processed dataset
โ”œโ”€โ”€ processed_phobert.xlsx       # Dataset for PhoBERT
โ”œโ”€โ”€ processed_svm.xlsx           # Dataset for SVM
โ”œโ”€โ”€ train.xlsx                   # Training data

โ”œโ”€โ”€ abbreviations.json           # Text normalization rules
โ”œโ”€โ”€ word2vec_vi_syllables_100dims.txt   # Word embeddings

โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿ“Š Dataset

๐Ÿ”น Sources

  • Social media
  • Product reviews
  • Conversations

๐Ÿ”น Format

sentence,emotion
"Tรดi rแบฅt vui hรดm nay",enjoyment

๐Ÿ”น Labels

  • enjoyment
  • anger
  • sadness
  • disgust
  • fear
  • surprise
  • other

๐Ÿ”น Preprocessing

  • Text cleaning and normalization
  • Abbreviation expansion (abbreviations.json)
  • Tokenization (Vietnamese-specific)
  • Oversampling for class balance

๐Ÿง  Models

Model Script Output folder
SVM main_svm.py svm_emotion_model/
RNN main_RNN_CNN-LSTM.py rnn_emotion_model/
BiLSTM main_BILSTM.py bilstm_emotion_model/
CNN-LSTM main_RNN_CNN-LSTM.py cnn_lstm_emotion_model/
PhoBERT main_phobert.py phobert_emotion_model/

๐Ÿ‘‰ PhoBERT performs best due to strong contextual understanding of Vietnamese language.

๐Ÿš€ Training

๐Ÿ”ง Install dependencies

pip install -r requirements.txt

โ–ถ๏ธ Run models

PhoBERT

python main_phobert.py

SVM

python main_svm.py

BiLSTM

python main_BILSTM.py

RNN / CNN-LSTM

python main_RNN_CNN-LSTM.py

Run all (if configured)

python run.py

๐Ÿ“ˆ Results

Model Accuracy
PhoBERT 94.22%
SVM 78.69%
CNN-LSTM 62.47%
BiLSTM 59.56%
RNN 30.02%

๐Ÿ’พ Checkpoints

Pretrained models are stored in:

*_emotion_model/

Example load:

from tensorflow.keras.models import load_model

model = load_model("bilstm_best.keras")

๐Ÿงช Example

text = "Hรดm nay tรดi rแบฅt vui"
prediction = model.predict(text)
print(prediction)

๐Ÿ“š Publication

If you use this work, please cite: Advancing Emotion Recognition in Vietnamese: A PhoBERT-Based Approach for Enhanced Interaction ๐Ÿ“„ DOI: https://doi.org/10.34238/tnu-jst.12889

๐Ÿ”ฎ Future Work

  • Multimodal emotion recognition (text + speech)
  • Larger and more diverse datasets
  • Real-time optimization
  • Deployment in production systems

๐Ÿ“„ License

MIT License