sen-1 / references /recommendations.md
Tiep's picture
Add references folder and research skills
ef06968

Paper Recommendations

Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.

High Priority - Vietnamese Text Classification

Paper Year Citations arXiv/ID Why
A Comparative Study on Vietnamese Text Classification Methods 2007 ~100 IEEE RIVF VNTC dataset paper, Sen-1 baseline
PhoBERT: Pre-trained language models for Vietnamese 2020 ~400 2003.10555 Primary comparison target (Sen-2)
ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media 2023 ~15 EMNLP 2023 Social media Vietnamese BERT

High Priority - Benchmarks & Datasets

Paper Year Citations arXiv/ID Why
SMTCE Vietnamese Text Classification Benchmark 2022 - 2209.10482 Multi-dataset benchmark
Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) 2020 - CSoNet Phase 2E dataset
UIT-VSFC: Vietnamese Students' Feedback Corpus 2019 - - Phase 2E dataset

High Priority - Pretrained Models

Paper Year Citations arXiv/ID Why
RoBERTa: A Robustly Optimized BERT 2019 ~28K 1907.11692 PhoBERT is based on RoBERTa
XLM-RoBERTa: Cross-lingual Representation Learning 2019 ~7.7K 1911.02116 Multilingual baseline

Medium Priority - Methods

Paper Year Citations arXiv/ID Why
Scikit-learn: Machine Learning in Python 2011 ~100K JMLR ML framework
A Survey on Text Classification: From Traditional to Deep Learning 2022 - 2008.00364 Survey paper
SMOTE: Synthetic Minority Over-sampling Technique 2002 ~20K JAIR Class imbalance (Phase 2D)

Medium Priority - Vietnamese NLP General

Paper Year Citations arXiv/ID Why
PhoNLP: Joint multi-task Vietnamese NLP 2021 ~5.6K NAACL Vietnamese NLP toolkit
VnCoreNLP: Vietnamese NLP Toolkit 2018 ~4.4K NAACL Word segmentation
PhoGPT: Generative Pre-training for Vietnamese 2023 - arXiv Recent Vietnamese LLM

Quick Fetch Commands

# High priority
uv run references/paper_db.py fetch 2003.10555   # PhoBERT
uv run references/paper_db.py fetch 1907.11692   # RoBERTa
uv run references/paper_db.py fetch 1911.02116   # XLM-RoBERTa
uv run references/paper_db.py fetch 2209.10482   # SMTCE

# Medium priority
uv run references/paper_db.py fetch 2008.00364   # Text classification survey

Research Clusters

  1. Vietnamese Text Classification: Vu et al. (2007), SMTCE (2022)
  2. Vietnamese Pretrained: PhoBERT, ViSoBERT, PhoGPT
  3. Multilingual Models: RoBERTa, XLM-RoBERTa
  4. Datasets: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
  5. Class Imbalance: SMOTE, class weighting methods