# Paper Recommendations

Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.

## High Priority - Vietnamese Text Classification

| Paper | Year | Citations | arXiv/ID | Why |
|-------|------|-----------|----------|-----|
| A Comparative Study on Vietnamese Text Classification Methods | 2007 | ~100 | IEEE RIVF | VNTC dataset paper, Sen-1 baseline |
| PhoBERT: Pre-trained language models for Vietnamese | 2020 | ~400 | 2003.10555 | Primary comparison target (Sen-2) |
| ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media | 2023 | ~15 | EMNLP 2023 | Social media Vietnamese BERT |

## High Priority - Benchmarks & Datasets

| Paper | Year | Citations | arXiv/ID | Why |
|-------|------|-----------|----------|-----|
| SMTCE Vietnamese Text Classification Benchmark | 2022 | - | 2209.10482 | Multi-dataset benchmark |
| Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) | 2020 | - | CSoNet | Phase 2E dataset |
| UIT-VSFC: Vietnamese Students' Feedback Corpus | 2019 | - | - | Phase 2E dataset |

## High Priority - Pretrained Models

| Paper | Year | Citations | arXiv/ID | Why |
|-------|------|-----------|----------|-----|
| RoBERTa: A Robustly Optimized BERT | 2019 | ~28K | 1907.11692 | PhoBERT is based on RoBERTa |
| XLM-RoBERTa: Cross-lingual Representation Learning | 2019 | ~7.7K | 1911.02116 | Multilingual baseline |

## Medium Priority - Methods

| Paper | Year | Citations | arXiv/ID | Why |
|-------|------|-----------|----------|-----|
| Scikit-learn: Machine Learning in Python | 2011 | ~100K | JMLR | ML framework |
| A Survey on Text Classification: From Traditional to Deep Learning | 2022 | - | 2008.00364 | Survey paper |
| SMOTE: Synthetic Minority Over-sampling Technique | 2002 | ~20K | JAIR | Class imbalance (Phase 2D) |

## Medium Priority - Vietnamese NLP General

| Paper | Year | Citations | arXiv/ID | Why |
|-------|------|-----------|----------|-----|
| PhoNLP: Joint multi-task Vietnamese NLP | 2021 | ~5.6K | NAACL | Vietnamese NLP toolkit |
| VnCoreNLP: Vietnamese NLP Toolkit | 2018 | ~4.4K | NAACL | Word segmentation |
| PhoGPT: Generative Pre-training for Vietnamese | 2023 | - | arXiv | Recent Vietnamese LLM |

## Quick Fetch Commands

```bash
# High priority
uv run references/paper_db.py fetch 2003.10555   # PhoBERT
uv run references/paper_db.py fetch 1907.11692   # RoBERTa
uv run references/paper_db.py fetch 1911.02116   # XLM-RoBERTa
uv run references/paper_db.py fetch 2209.10482   # SMTCE

# Medium priority
uv run references/paper_db.py fetch 2008.00364   # Text classification survey
```

## Research Clusters

1. **Vietnamese Text Classification**: Vu et al. (2007), SMTCE (2022)
2. **Vietnamese Pretrained**: PhoBERT, ViSoBERT, PhoGPT
3. **Multilingual Models**: RoBERTa, XLM-RoBERTa
4. **Datasets**: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
5. **Class Imbalance**: SMOTE, class weighting methods