Paper Recommendations

Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.

High Priority - Vietnamese Text Classification

Paper	Year	Citations	arXiv/ID	Why
A Comparative Study on Vietnamese Text Classification Methods	2007	~100	IEEE RIVF	VNTC dataset paper, Sen-1 baseline
PhoBERT: Pre-trained language models for Vietnamese	2020	~400	2003.10555	Primary comparison target (Sen-2)
ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media	2023	~15	EMNLP 2023	Social media Vietnamese BERT

High Priority - Benchmarks & Datasets

Paper	Year	Citations	arXiv/ID	Why
SMTCE Vietnamese Text Classification Benchmark	2022	-	2209.10482	Multi-dataset benchmark
Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC)	2020	-	CSoNet	Phase 2E dataset
UIT-VSFC: Vietnamese Students' Feedback Corpus	2019	-	-	Phase 2E dataset

High Priority - Pretrained Models

Paper	Year	Citations	arXiv/ID	Why
RoBERTa: A Robustly Optimized BERT	2019	~28K	1907.11692	PhoBERT is based on RoBERTa
XLM-RoBERTa: Cross-lingual Representation Learning	2019	~7.7K	1911.02116	Multilingual baseline

Medium Priority - Methods

Paper	Year	Citations	arXiv/ID	Why
Scikit-learn: Machine Learning in Python	2011	~100K	JMLR	ML framework
A Survey on Text Classification: From Traditional to Deep Learning	2022	-	2008.00364	Survey paper
SMOTE: Synthetic Minority Over-sampling Technique	2002	~20K	JAIR	Class imbalance (Phase 2D)

Medium Priority - Vietnamese NLP General

Paper	Year	Citations	arXiv/ID	Why
PhoNLP: Joint multi-task Vietnamese NLP	2021	~5.6K	NAACL	Vietnamese NLP toolkit
VnCoreNLP: Vietnamese NLP Toolkit	2018	~4.4K	NAACL	Word segmentation
PhoGPT: Generative Pre-training for Vietnamese	2023	-	arXiv	Recent Vietnamese LLM

Quick Fetch Commands

# High priority
uv run references/paper_db.py fetch 2003.10555   # PhoBERT
uv run references/paper_db.py fetch 1907.11692   # RoBERTa
uv run references/paper_db.py fetch 1911.02116   # XLM-RoBERTa
uv run references/paper_db.py fetch 2209.10482   # SMTCE

# Medium priority
uv run references/paper_db.py fetch 2008.00364   # Text classification survey

Research Clusters

Vietnamese Text Classification: Vu et al. (2007), SMTCE (2022)
Vietnamese Pretrained: PhoBERT, ViSoBERT, PhoGPT
Multilingual Models: RoBERTa, XLM-RoBERTa
Datasets: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
Class Imbalance: SMOTE, class weighting methods