# Paper Recommendations Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch. ## High Priority - Vietnamese Text Classification | Paper | Year | Citations | arXiv/ID | Why | |-------|------|-----------|----------|-----| | A Comparative Study on Vietnamese Text Classification Methods | 2007 | ~100 | IEEE RIVF | VNTC dataset paper, Sen-1 baseline | | PhoBERT: Pre-trained language models for Vietnamese | 2020 | ~400 | 2003.10555 | Primary comparison target (Sen-2) | | ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media | 2023 | ~15 | EMNLP 2023 | Social media Vietnamese BERT | ## High Priority - Benchmarks & Datasets | Paper | Year | Citations | arXiv/ID | Why | |-------|------|-----------|----------|-----| | SMTCE Vietnamese Text Classification Benchmark | 2022 | - | 2209.10482 | Multi-dataset benchmark | | Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) | 2020 | - | CSoNet | Phase 2E dataset | | UIT-VSFC: Vietnamese Students' Feedback Corpus | 2019 | - | - | Phase 2E dataset | ## High Priority - Pretrained Models | Paper | Year | Citations | arXiv/ID | Why | |-------|------|-----------|----------|-----| | RoBERTa: A Robustly Optimized BERT | 2019 | ~28K | 1907.11692 | PhoBERT is based on RoBERTa | | XLM-RoBERTa: Cross-lingual Representation Learning | 2019 | ~7.7K | 1911.02116 | Multilingual baseline | ## Medium Priority - Methods | Paper | Year | Citations | arXiv/ID | Why | |-------|------|-----------|----------|-----| | Scikit-learn: Machine Learning in Python | 2011 | ~100K | JMLR | ML framework | | A Survey on Text Classification: From Traditional to Deep Learning | 2022 | - | 2008.00364 | Survey paper | | SMOTE: Synthetic Minority Over-sampling Technique | 2002 | ~20K | JAIR | Class imbalance (Phase 2D) | ## Medium Priority - Vietnamese NLP General | Paper | Year | Citations | arXiv/ID | Why | |-------|------|-----------|----------|-----| | PhoNLP: Joint multi-task Vietnamese NLP | 2021 | ~5.6K | NAACL | Vietnamese NLP toolkit | | VnCoreNLP: Vietnamese NLP Toolkit | 2018 | ~4.4K | NAACL | Word segmentation | | PhoGPT: Generative Pre-training for Vietnamese | 2023 | - | arXiv | Recent Vietnamese LLM | ## Quick Fetch Commands ```bash # High priority uv run references/paper_db.py fetch 2003.10555 # PhoBERT uv run references/paper_db.py fetch 1907.11692 # RoBERTa uv run references/paper_db.py fetch 1911.02116 # XLM-RoBERTa uv run references/paper_db.py fetch 2209.10482 # SMTCE # Medium priority uv run references/paper_db.py fetch 2008.00364 # Text classification survey ``` ## Research Clusters 1. **Vietnamese Text Classification**: Vu et al. (2007), SMTCE (2022) 2. **Vietnamese Pretrained**: PhoBERT, ViSoBERT, PhoGPT 3. **Multilingual Models**: RoBERTa, XLM-RoBERTa 4. **Datasets**: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS 5. **Class Imbalance**: SMOTE, class weighting methods