Paper Recommendations
Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.
High Priority - Vietnamese Text Classification
| Paper | Year | Citations | arXiv/ID | Why |
|---|---|---|---|---|
| A Comparative Study on Vietnamese Text Classification Methods | 2007 | ~100 | IEEE RIVF | VNTC dataset paper, Sen-1 baseline |
| PhoBERT: Pre-trained language models for Vietnamese | 2020 | ~400 | 2003.10555 | Primary comparison target (Sen-2) |
| ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media | 2023 | ~15 | EMNLP 2023 | Social media Vietnamese BERT |
High Priority - Benchmarks & Datasets
| Paper | Year | Citations | arXiv/ID | Why |
|---|---|---|---|---|
| SMTCE Vietnamese Text Classification Benchmark | 2022 | - | 2209.10482 | Multi-dataset benchmark |
| Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) | 2020 | - | CSoNet | Phase 2E dataset |
| UIT-VSFC: Vietnamese Students' Feedback Corpus | 2019 | - | - | Phase 2E dataset |
High Priority - Pretrained Models
| Paper | Year | Citations | arXiv/ID | Why |
|---|---|---|---|---|
| RoBERTa: A Robustly Optimized BERT | 2019 | ~28K | 1907.11692 | PhoBERT is based on RoBERTa |
| XLM-RoBERTa: Cross-lingual Representation Learning | 2019 | ~7.7K | 1911.02116 | Multilingual baseline |
Medium Priority - Methods
| Paper | Year | Citations | arXiv/ID | Why |
|---|---|---|---|---|
| Scikit-learn: Machine Learning in Python | 2011 | ~100K | JMLR | ML framework |
| A Survey on Text Classification: From Traditional to Deep Learning | 2022 | - | 2008.00364 | Survey paper |
| SMOTE: Synthetic Minority Over-sampling Technique | 2002 | ~20K | JAIR | Class imbalance (Phase 2D) |
Medium Priority - Vietnamese NLP General
| Paper | Year | Citations | arXiv/ID | Why |
|---|---|---|---|---|
| PhoNLP: Joint multi-task Vietnamese NLP | 2021 | ~5.6K | NAACL | Vietnamese NLP toolkit |
| VnCoreNLP: Vietnamese NLP Toolkit | 2018 | ~4.4K | NAACL | Word segmentation |
| PhoGPT: Generative Pre-training for Vietnamese | 2023 | - | arXiv | Recent Vietnamese LLM |
Quick Fetch Commands
# High priority
uv run references/paper_db.py fetch 2003.10555 # PhoBERT
uv run references/paper_db.py fetch 1907.11692 # RoBERTa
uv run references/paper_db.py fetch 1911.02116 # XLM-RoBERTa
uv run references/paper_db.py fetch 2209.10482 # SMTCE
# Medium priority
uv run references/paper_db.py fetch 2008.00364 # Text classification survey
Research Clusters
- Vietnamese Text Classification: Vu et al. (2007), SMTCE (2022)
- Vietnamese Pretrained: PhoBERT, ViSoBERT, PhoGPT
- Multilingual Models: RoBERTa, XLM-RoBERTa
- Datasets: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
- Class Imbalance: SMOTE, class weighting methods