LF-PhoBERT / README.md
ducdatit2002's picture
Update README.md
0b9f928 verified
---
license: mit
---
# LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification
LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from **PhoBERT-base** using a leakage-free and reproducible training recipe.
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.
## 📄 Paper
**A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification**
**Authors:**
Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³,
Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³
¹ University of Science, Ho Chi Minh City, Vietnam
² International University, VNU-HCM, Vietnam
³ Vietnam National University, Ho Chi Minh City, Vietnam
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam
## 🧠 Model Description
- **Backbone:** PhoBERT-base (`vinai/phobert-base`)
- **Task:** Single-label, multi-class emotion classification
- **Language:** Vietnamese
- **Domain:** Social media text
- **Number of classes:** 7
*(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)*
LF-PhoBERT is trained using a unified objective that combines:
- Class-Balanced Focal Loss
- R-Drop consistency regularization
- Supervised Contrastive Learning
- FGM-based adversarial training
All class statistics are computed **only on the training split** to prevent information leakage.
## 📊 Performance (SentiV)
Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.
- **Macro-F1:** 0.8040 ± 0.0003
- **Accuracy:** 0.8144 ± 0.0004
The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.
## 📦 Files in This Repository
- `model.safetensors` – fine-tuned model weights
- `config.json` – model configuration
- `tokenizer_config.json`, `vocab.txt`, `bpe.codes` – tokenizer files
- `id2label.json` – label mapping
- `special_tokens_map.json`, `added_tokens.json` – tokenizer metadata
## 🚀 Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "ducdatit2002/LF-PhoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Chiến dịch này làm tôi rất thất vọng 😡"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)
````
## 🔁 Reproducibility
* Training performed on a single NVIDIA A100 (80GB)
* PyTorch 2.9.1, CUDA 12.8
* Results reported as mean ± std over 3 random seeds
* Identical preprocessing and optimization settings across runs
This checkpoint is released to support reproducibility and practical deployment.
## ⚠️ Limitations
* Single-label classification cannot fully capture mixed or ambiguous emotions
* Sarcasm and context-dependent expressions remain challenging
* Performance is evaluated on SentiV; cross-domain generalization is not guaranteed
## 📚 Citation
If you use this model, please cite:
```
Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.
```
## 📜 License
This model is released for research and educational purposes.
Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.