| # Paper Recommendations | |
| Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch. | |
| ## High Priority - Vietnamese Text Classification | |
| | Paper | Year | Citations | arXiv/ID | Why | | |
| |-------|------|-----------|----------|-----| | |
| | A Comparative Study on Vietnamese Text Classification Methods | 2007 | ~100 | IEEE RIVF | VNTC dataset paper, Sen-1 baseline | | |
| | PhoBERT: Pre-trained language models for Vietnamese | 2020 | ~400 | 2003.10555 | Primary comparison target (Sen-2) | | |
| | ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media | 2023 | ~15 | EMNLP 2023 | Social media Vietnamese BERT | | |
| ## High Priority - Benchmarks & Datasets | |
| | Paper | Year | Citations | arXiv/ID | Why | | |
| |-------|------|-----------|----------|-----| | |
| | SMTCE Vietnamese Text Classification Benchmark | 2022 | - | 2209.10482 | Multi-dataset benchmark | | |
| | Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) | 2020 | - | CSoNet | Phase 2E dataset | | |
| | UIT-VSFC: Vietnamese Students' Feedback Corpus | 2019 | - | - | Phase 2E dataset | | |
| ## High Priority - Pretrained Models | |
| | Paper | Year | Citations | arXiv/ID | Why | | |
| |-------|------|-----------|----------|-----| | |
| | RoBERTa: A Robustly Optimized BERT | 2019 | ~28K | 1907.11692 | PhoBERT is based on RoBERTa | | |
| | XLM-RoBERTa: Cross-lingual Representation Learning | 2019 | ~7.7K | 1911.02116 | Multilingual baseline | | |
| ## Medium Priority - Methods | |
| | Paper | Year | Citations | arXiv/ID | Why | | |
| |-------|------|-----------|----------|-----| | |
| | Scikit-learn: Machine Learning in Python | 2011 | ~100K | JMLR | ML framework | | |
| | A Survey on Text Classification: From Traditional to Deep Learning | 2022 | - | 2008.00364 | Survey paper | | |
| | SMOTE: Synthetic Minority Over-sampling Technique | 2002 | ~20K | JAIR | Class imbalance (Phase 2D) | | |
| ## Medium Priority - Vietnamese NLP General | |
| | Paper | Year | Citations | arXiv/ID | Why | | |
| |-------|------|-----------|----------|-----| | |
| | PhoNLP: Joint multi-task Vietnamese NLP | 2021 | ~5.6K | NAACL | Vietnamese NLP toolkit | | |
| | VnCoreNLP: Vietnamese NLP Toolkit | 2018 | ~4.4K | NAACL | Word segmentation | | |
| | PhoGPT: Generative Pre-training for Vietnamese | 2023 | - | arXiv | Recent Vietnamese LLM | | |
| ## Quick Fetch Commands | |
| ```bash | |
| # High priority | |
| uv run references/paper_db.py fetch 2003.10555 # PhoBERT | |
| uv run references/paper_db.py fetch 1907.11692 # RoBERTa | |
| uv run references/paper_db.py fetch 1911.02116 # XLM-RoBERTa | |
| uv run references/paper_db.py fetch 2209.10482 # SMTCE | |
| # Medium priority | |
| uv run references/paper_db.py fetch 2008.00364 # Text classification survey | |
| ``` | |
| ## Research Clusters | |
| 1. **Vietnamese Text Classification**: Vu et al. (2007), SMTCE (2022) | |
| 2. **Vietnamese Pretrained**: PhoBERT, ViSoBERT, PhoGPT | |
| 3. **Multilingual Models**: RoBERTa, XLM-RoBERTa | |
| 4. **Datasets**: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS | |
| 5. **Class Imbalance**: SMOTE, class weighting methods | |