sen-1 / references /recommendations.md

Add references folder and research skills

ef06968 about 1 month ago

2.94 kB

	# Paper Recommendations

	Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.

	## High Priority - Vietnamese Text Classification

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| A Comparative Study on Vietnamese Text Classification Methods \| 2007 \| ~100 \| IEEE RIVF \| VNTC dataset paper, Sen-1 baseline \|
	\| PhoBERT: Pre-trained language models for Vietnamese \| 2020 \| ~400 \| 2003.10555 \| Primary comparison target (Sen-2) \|
	\| ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media \| 2023 \| ~15 \| EMNLP 2023 \| Social media Vietnamese BERT \|

	## High Priority - Benchmarks & Datasets

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| SMTCE Vietnamese Text Classification Benchmark \| 2022 \| - \| 2209.10482 \| Multi-dataset benchmark \|
	\| Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) \| 2020 \| - \| CSoNet \| Phase 2E dataset \|
	\| UIT-VSFC: Vietnamese Students' Feedback Corpus \| 2019 \| - \| - \| Phase 2E dataset \|

	## High Priority - Pretrained Models

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| RoBERTa: A Robustly Optimized BERT \| 2019 \| ~28K \| 1907.11692 \| PhoBERT is based on RoBERTa \|
	\| XLM-RoBERTa: Cross-lingual Representation Learning \| 2019 \| ~7.7K \| 1911.02116 \| Multilingual baseline \|

	## Medium Priority - Methods

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| Scikit-learn: Machine Learning in Python \| 2011 \| ~100K \| JMLR \| ML framework \|
	\| A Survey on Text Classification: From Traditional to Deep Learning \| 2022 \| - \| 2008.00364 \| Survey paper \|
	\| SMOTE: Synthetic Minority Over-sampling Technique \| 2002 \| ~20K \| JAIR \| Class imbalance (Phase 2D) \|

	## Medium Priority - Vietnamese NLP General

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| PhoNLP: Joint multi-task Vietnamese NLP \| 2021 \| ~5.6K \| NAACL \| Vietnamese NLP toolkit \|
	\| VnCoreNLP: Vietnamese NLP Toolkit \| 2018 \| ~4.4K \| NAACL \| Word segmentation \|
	\| PhoGPT: Generative Pre-training for Vietnamese \| 2023 \| - \| arXiv \| Recent Vietnamese LLM \|

	## Quick Fetch Commands

	```bash
	# High priority
	uv run references/paper_db.py fetch 2003.10555 # PhoBERT
	uv run references/paper_db.py fetch 1907.11692 # RoBERTa
	uv run references/paper_db.py fetch 1911.02116 # XLM-RoBERTa
	uv run references/paper_db.py fetch 2209.10482 # SMTCE

	# Medium priority
	uv run references/paper_db.py fetch 2008.00364 # Text classification survey
	```

	## Research Clusters

	1. Vietnamese Text Classification: Vu et al. (2007), SMTCE (2022)
	2. Vietnamese Pretrained: PhoBERT, ViSoBERT, PhoGPT
	3. Multilingual Models: RoBERTa, XLM-RoBERTa
	4. Datasets: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
	5. Class Imbalance: SMOTE, class weighting methods

	# Paper Recommendations

	Dựa trên Technical Report và Research Plan của Sen-1, đây là các papers ưu tiên fetch.

	## High Priority - Vietnamese Text Classification

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| A Comparative Study on Vietnamese Text Classification Methods \| 2007 \| ~100 \| IEEE RIVF \| VNTC dataset paper, Sen-1 baseline \|
	\| PhoBERT: Pre-trained language models for Vietnamese \| 2020 \| ~400 \| 2003.10555 \| Primary comparison target (Sen-2) \|
	\| ViSoBERT: Pre-Trained Language Model for Vietnamese Social Media \| 2023 \| ~15 \| EMNLP 2023 \| Social media Vietnamese BERT \|

	## High Priority - Benchmarks & Datasets

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| SMTCE Vietnamese Text Classification Benchmark \| 2022 \| - \| 2209.10482 \| Multi-dataset benchmark \|
	\| Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC) \| 2020 \| - \| CSoNet \| Phase 2E dataset \|
	\| UIT-VSFC: Vietnamese Students' Feedback Corpus \| 2019 \| - \| - \| Phase 2E dataset \|

	## High Priority - Pretrained Models

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| RoBERTa: A Robustly Optimized BERT \| 2019 \| ~28K \| 1907.11692 \| PhoBERT is based on RoBERTa \|
	\| XLM-RoBERTa: Cross-lingual Representation Learning \| 2019 \| ~7.7K \| 1911.02116 \| Multilingual baseline \|

	## Medium Priority - Methods

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| Scikit-learn: Machine Learning in Python \| 2011 \| ~100K \| JMLR \| ML framework \|
	\| A Survey on Text Classification: From Traditional to Deep Learning \| 2022 \| - \| 2008.00364 \| Survey paper \|
	\| SMOTE: Synthetic Minority Over-sampling Technique \| 2002 \| ~20K \| JAIR \| Class imbalance (Phase 2D) \|

	## Medium Priority - Vietnamese NLP General

	\| Paper \| Year \| Citations \| arXiv/ID \| Why \|
	\|-------\|------\|-----------\|----------\|-----\|
	\| PhoNLP: Joint multi-task Vietnamese NLP \| 2021 \| ~5.6K \| NAACL \| Vietnamese NLP toolkit \|
	\| VnCoreNLP: Vietnamese NLP Toolkit \| 2018 \| ~4.4K \| NAACL \| Word segmentation \|
	\| PhoGPT: Generative Pre-training for Vietnamese \| 2023 \| - \| arXiv \| Recent Vietnamese LLM \|

	## Quick Fetch Commands

	```bash
	# High priority
	uv run references/paper_db.py fetch 2003.10555 # PhoBERT
	uv run references/paper_db.py fetch 1907.11692 # RoBERTa
	uv run references/paper_db.py fetch 1911.02116 # XLM-RoBERTa
	uv run references/paper_db.py fetch 2209.10482 # SMTCE

	# Medium priority
	uv run references/paper_db.py fetch 2008.00364 # Text classification survey
	```

	## Research Clusters

	1. Vietnamese Text Classification: Vu et al. (2007), SMTCE (2022)
	2. Vietnamese Pretrained: PhoBERT, ViSoBERT, PhoGPT
	3. Multilingual Models: RoBERTa, XLM-RoBERTa
	4. Datasets: VNTC, UTS2017_Bank, UIT-VSMEC, UIT-VSFC, PhoATIS
	5. Class Imbalance: SMOTE, class weighting methods