|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification |
|
|
|
|
|
LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from **PhoBERT-base** using a leakage-free and reproducible training recipe. |
|
|
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation. |
|
|
|
|
|
## 📄 Paper |
|
|
**A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification** |
|
|
|
|
|
**Authors:** |
|
|
Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³, |
|
|
Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³ |
|
|
|
|
|
¹ University of Science, Ho Chi Minh City, Vietnam |
|
|
² International University, VNU-HCM, Vietnam |
|
|
³ Vietnam National University, Ho Chi Minh City, Vietnam |
|
|
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam |
|
|
|
|
|
## 🧠 Model Description |
|
|
- **Backbone:** PhoBERT-base (`vinai/phobert-base`) |
|
|
- **Task:** Single-label, multi-class emotion classification |
|
|
- **Language:** Vietnamese |
|
|
- **Domain:** Social media text |
|
|
- **Number of classes:** 7 |
|
|
*(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)* |
|
|
|
|
|
LF-PhoBERT is trained using a unified objective that combines: |
|
|
- Class-Balanced Focal Loss |
|
|
- R-Drop consistency regularization |
|
|
- Supervised Contrastive Learning |
|
|
- FGM-based adversarial training |
|
|
|
|
|
All class statistics are computed **only on the training split** to prevent information leakage. |
|
|
|
|
|
## 📊 Performance (SentiV) |
|
|
Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds. |
|
|
|
|
|
- **Macro-F1:** 0.8040 ± 0.0003 |
|
|
- **Accuracy:** 0.8144 ± 0.0004 |
|
|
|
|
|
The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds. |
|
|
|
|
|
## 📦 Files in This Repository |
|
|
- `model.safetensors` – fine-tuned model weights |
|
|
- `config.json` – model configuration |
|
|
- `tokenizer_config.json`, `vocab.txt`, `bpe.codes` – tokenizer files |
|
|
- `id2label.json` – label mapping |
|
|
- `special_tokens_map.json`, `added_tokens.json` – tokenizer metadata |
|
|
|
|
|
## 🚀 Usage |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "ducdatit2002/LF-PhoBERT" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
text = "Chiến dịch này làm tôi rất thất vọng 😡" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
predicted_label_id = outputs.logits.argmax(dim=-1).item() |
|
|
label = model.config.id2label[str(predicted_label_id)] |
|
|
print(label) |
|
|
```` |
|
|
|
|
|
## 🔁 Reproducibility |
|
|
|
|
|
* Training performed on a single NVIDIA A100 (80GB) |
|
|
* PyTorch 2.9.1, CUDA 12.8 |
|
|
* Results reported as mean ± std over 3 random seeds |
|
|
* Identical preprocessing and optimization settings across runs |
|
|
|
|
|
This checkpoint is released to support reproducibility and practical deployment. |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
* Single-label classification cannot fully capture mixed or ambiguous emotions |
|
|
* Sarcasm and context-dependent expressions remain challenging |
|
|
* Performance is evaluated on SentiV; cross-domain generalization is not guaranteed |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
``` |
|
|
Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026). |
|
|
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification. |
|
|
``` |
|
|
|
|
|
## 📜 License |
|
|
|
|
|
This model is released for research and educational purposes. |
|
|
Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage. |
|
|
|