LF-PhoBERT / README.md
ducdatit2002's picture
Update README.md
0b9f928 verified
metadata
license: mit

LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification

LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from PhoBERT-base using a leakage-free and reproducible training recipe.
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.

📄 Paper

A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification

Authors:
Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³,
Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³

¹ University of Science, Ho Chi Minh City, Vietnam
² International University, VNU-HCM, Vietnam
³ Vietnam National University, Ho Chi Minh City, Vietnam
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam

🧠 Model Description

  • Backbone: PhoBERT-base (vinai/phobert-base)
  • Task: Single-label, multi-class emotion classification
  • Language: Vietnamese
  • Domain: Social media text
  • Number of classes: 7
    (Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)

LF-PhoBERT is trained using a unified objective that combines:

  • Class-Balanced Focal Loss
  • R-Drop consistency regularization
  • Supervised Contrastive Learning
  • FGM-based adversarial training

All class statistics are computed only on the training split to prevent information leakage.

📊 Performance (SentiV)

Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.

  • Macro-F1: 0.8040 ± 0.0003
  • Accuracy: 0.8144 ± 0.0004

The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.

📦 Files in This Repository

  • model.safetensors – fine-tuned model weights
  • config.json – model configuration
  • tokenizer_config.json, vocab.txt, bpe.codes – tokenizer files
  • id2label.json – label mapping
  • special_tokens_map.json, added_tokens.json – tokenizer metadata

🚀 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ducdatit2002/LF-PhoBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Chiến dịch này làm tôi rất thất vọng 😡"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)

🔁 Reproducibility

  • Training performed on a single NVIDIA A100 (80GB)
  • PyTorch 2.9.1, CUDA 12.8
  • Results reported as mean ± std over 3 random seeds
  • Identical preprocessing and optimization settings across runs

This checkpoint is released to support reproducibility and practical deployment.

⚠️ Limitations

  • Single-label classification cannot fully capture mixed or ambiguous emotions
  • Sarcasm and context-dependent expressions remain challenging
  • Performance is evaluated on SentiV; cross-domain generalization is not guaranteed

📚 Citation

If you use this model, please cite:

Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.

📜 License

This model is released for research and educational purposes. Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.