ducdatit2002

Update README.md

38588f5 verified 1 day ago

4.52 kB

	---
	language: vi
	tags:
	- text-classification
	- emotion-recognition
	- vietnamese
	license: mit
	datasets:
	- custom
	metrics:
	- accuracy
	---

	# 🇻🇳 Vietnamese Emotion Recognition (PhoBERT-based)
	This repository provides a full pipeline for Vietnamese emotion recognition, including:
	* 📊 Processed datasets
	* 🧠 Training scripts for multiple models
	* 💾 Trained model checkpoints
	* 📈 Evaluation results

	## 📌 Overview
	This project focuses on emotion classification in Vietnamese using both traditional and deep learning models, with a strong emphasis on PhoBERT-base-v2.
	Key contributions:
	* Build a high-quality Vietnamese emotion dataset
	* Handle class imbalance via oversampling
	* Compare multiple models (SVM → RNN → BiLSTM → CNN-LSTM → PhoBERT)
	* Achieve 94.22% accuracy with PhoBERT-base-v2

	## 📂 Repository Structure
	```
	.
	├── bilstm_emotion_model/ # Saved BiLSTM model
	├── cnn_lstm_emotion_model/ # Saved CNN-LSTM model
	├── phobert_emotion_model/ # Saved PhoBERT model
	├── rnn_emotion_model/ # Saved RNN model
	├── svm_emotion_model/ # Saved SVM model
	├── flagged/ # Flagged or filtered samples

	├── bilstm_best.keras # Best BiLSTM checkpoint
	├── cnn_lstm_best.keras # Best CNN-LSTM checkpoint

	├── main_BILSTM.py # Train BiLSTM
	├── main_RNN_CNN-LSTM.py # Train RNN & CNN-LSTM
	├── main_lstm.py # LSTM training script
	├── main_phobert.py # Train PhoBERT
	├── main_svm.py # Train SVM
	├── main_v1.py # Legacy / combined script
	├── run.py # Main runner script

	├── processed.xlsx # Main processed dataset
	├── processed_phobert.xlsx # Dataset for PhoBERT
	├── processed_svm.xlsx # Dataset for SVM
	├── train.xlsx # Training data

	├── abbreviations.json # Text normalization rules
	├── word2vec_vi_syllables_100dims.txt # Word embeddings

	├── requirements.txt
	└── README.md
	```
	## 📊 Dataset

	### 🔹 Sources
	* Social media
	* Product reviews
	* Conversations

	### 🔹 Format
	```csv
	sentence,emotion
	"Tôi rất vui hôm nay",enjoyment
	```

	### 🔹 Labels
	* enjoyment
	* anger
	* sadness
	* disgust
	* fear
	* surprise
	* other

	### 🔹 Preprocessing
	* Text cleaning and normalization
	* Abbreviation expansion (`abbreviations.json`)
	* Tokenization (Vietnamese-specific)
	* Oversampling for class balance

	## 🧠 Models
	\| Model \| Script \| Output folder \|
	\| -------- \| ---------------------- \| ------------------------- \|
	\| SVM \| `main_svm.py` \| `svm_emotion_model/` \|
	\| RNN \| `main_RNN_CNN-LSTM.py` \| `rnn_emotion_model/` \|
	\| BiLSTM \| `main_BILSTM.py` \| `bilstm_emotion_model/` \|
	\| CNN-LSTM \| `main_RNN_CNN-LSTM.py` \| `cnn_lstm_emotion_model/` \|
	\| PhoBERT \| `main_phobert.py` \| `phobert_emotion_model/` \|

	👉 PhoBERT performs best due to strong contextual understanding of Vietnamese language.

	## 🚀 Training

	### 🔧 Install dependencies
	```bash
	pip install -r requirements.txt
	```

	### ▶️ Run models

	#### PhoBERT
	```bash
	python main_phobert.py
	```

	#### SVM
	```bash
	python main_svm.py
	```

	#### BiLSTM
	```bash
	python main_BILSTM.py
	```

	#### RNN / CNN-LSTM
	```bash
	python main_RNN_CNN-LSTM.py
	```

	#### Run all (if configured)
	```bash
	python run.py
	```

	## 📈 Results
	\| Model \| Accuracy \|
	\| -------- \| ---------- \|
	\| PhoBERT \| 94.22% \|
	\| SVM \| 78.69% \|
	\| CNN-LSTM \| 62.47% \|
	\| BiLSTM \| 59.56% \|
	\| RNN \| 30.02% \|


	## 💾 Checkpoints

	Pretrained models are stored in:

	```
	*_emotion_model/
	```

	Example load:

	```python
	from tensorflow.keras.models import load_model

	model = load_model("bilstm_best.keras")
	```

	## 🧪 Example
	```python
	text = "Hôm nay tôi rất vui"
	prediction = model.predict(text)
	print(prediction)
	```

	## 📚 Publication
	If you use this work, please cite:
	Advancing Emotion Recognition in Vietnamese: A PhoBERT-Based Approach for Enhanced Interaction
	📄 DOI: [https://doi.org/10.34238/tnu-jst.12889](https://doi.org/10.34238/tnu-jst.12889)

	## 🔮 Future Work
	* Multimodal emotion recognition (text + speech)
	* Larger and more diverse datasets
	* Real-time optimization
	* Deployment in production systems

	## 📄 License
	MIT License