LF-PhoBERT / README.md

ducdatit2002

Update README.md

0b9f928 verified 7 days ago

3.63 kB

	---
	license: mit
	---

	# LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification

	LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from PhoBERT-base using a leakage-free and reproducible training recipe.
	The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.

	## 📄 Paper
	A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification

	Authors:
	Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³,
	Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³

	¹ University of Science, Ho Chi Minh City, Vietnam
	² International University, VNU-HCM, Vietnam
	³ Vietnam National University, Ho Chi Minh City, Vietnam
	⁴ Ho Chi Minh City University of Economics and Finance, Vietnam

	## 🧠 Model Description
	- Backbone: PhoBERT-base (`vinai/phobert-base`)
	- Task: Single-label, multi-class emotion classification
	- Language: Vietnamese
	- Domain: Social media text
	- Number of classes: 7
	(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)

	LF-PhoBERT is trained using a unified objective that combines:
	- Class-Balanced Focal Loss
	- R-Drop consistency regularization
	- Supervised Contrastive Learning
	- FGM-based adversarial training

	All class statistics are computed only on the training split to prevent information leakage.

	## 📊 Performance (SentiV)
	Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.

	- Macro-F1: 0.8040 ± 0.0003
	- Accuracy: 0.8144 ± 0.0004

	The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.

	## 📦 Files in This Repository
	- `model.safetensors` – fine-tuned model weights
	- `config.json` – model configuration
	- `tokenizer_config.json`, `vocab.txt`, `bpe.codes` – tokenizer files
	- `id2label.json` – label mapping
	- `special_tokens_map.json`, `added_tokens.json` – tokenizer metadata

	## 🚀 Usage
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "ducdatit2002/LF-PhoBERT"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Chiến dịch này làm tôi rất thất vọng 😡"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

	with torch.no_grad():
	outputs = model(**inputs)

	predicted_label_id = outputs.logits.argmax(dim=-1).item()
	label = model.config.id2label[str(predicted_label_id)]
	print(label)
	````

	## 🔁 Reproducibility

	* Training performed on a single NVIDIA A100 (80GB)
	* PyTorch 2.9.1, CUDA 12.8
	* Results reported as mean ± std over 3 random seeds
	* Identical preprocessing and optimization settings across runs

	This checkpoint is released to support reproducibility and practical deployment.

	## ⚠️ Limitations

	* Single-label classification cannot fully capture mixed or ambiguous emotions
	* Sarcasm and context-dependent expressions remain challenging
	* Performance is evaluated on SentiV; cross-domain generalization is not guaranteed

	## 📚 Citation

	If you use this model, please cite:

	```
	Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
	A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.
	```

	## 📜 License

	This model is released for research and educational purposes.
	Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.