LF-PhoBERT / README.md

ducdatit2002

Update README.md

0b9f928 verified 6 days ago

preview code

raw

history blame contribute delete

3.63 kB

metadata

license: mit

LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification

LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from PhoBERT-base using a leakage-free and reproducible training recipe.
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.

📄 Paper

A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification

Authors:
Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³,
Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³

¹ University of Science, Ho Chi Minh City, Vietnam
² International University, VNU-HCM, Vietnam
³ Vietnam National University, Ho Chi Minh City, Vietnam
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam

🧠 Model Description

Backbone: PhoBERT-base (vinai/phobert-base)
Task: Single-label, multi-class emotion classification
Language: Vietnamese
Domain: Social media text
Number of classes: 7
(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)

LF-PhoBERT is trained using a unified objective that combines:

Class-Balanced Focal Loss
R-Drop consistency regularization
Supervised Contrastive Learning
FGM-based adversarial training

All class statistics are computed only on the training split to prevent information leakage.

📊 Performance (SentiV)

Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.

Macro-F1: 0.8040 ± 0.0003
Accuracy: 0.8144 ± 0.0004

The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.

📦 Files in This Repository

model.safetensors – fine-tuned model weights
config.json – model configuration
tokenizer_config.json, vocab.txt, bpe.codes – tokenizer files
id2label.json – label mapping
special_tokens_map.json, added_tokens.json – tokenizer metadata

🚀 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ducdatit2002/LF-PhoBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Chiến dịch này làm tôi rất thất vọng 😡"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)

🔁 Reproducibility

Training performed on a single NVIDIA A100 (80GB)
PyTorch 2.9.1, CUDA 12.8
Results reported as mean ± std over 3 random seeds
Identical preprocessing and optimization settings across runs

This checkpoint is released to support reproducibility and practical deployment.

⚠️ Limitations

Single-label classification cannot fully capture mixed or ambiguous emotions
Sarcasm and context-dependent expressions remain challenging
Performance is evaluated on SentiV; cross-domain generalization is not guaranteed

📚 Citation

If you use this model, please cite:

Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.

📜 License

This model is released for research and educational purposes. Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.