12 papers
10 PDFs
7 LaTeX
Papers
Benchmarks
SOTA
Citation Network
Leaderboard
How to Do Science
Paper Database
Research papers related to Vietnamese text classification, fetched from arXiv and ACL Anthology.
2020 EMNLP Findings PDF TEX MD Dat Quoc Nguyen, Anh Tuan Nguyen
We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.
2020.arxiv.nguyen/ · 2020.findings.anh/
2023 EMNLP 2023 PDF TEX MD Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. ViSoBERT surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks with far fewer parameters.
2023.arxiv.nguyen/ · 2023.emnlp.kiet/
2020 arXiv PDF TEX MD Viet Bui The, Oanh Tran Thi, Phuong Le-Hong
Introduces viBERT (trained on 10GB) and vELECTRA (trained on 60GB) Vietnamese pretrained models. Strong performance on sequence tagging and text classification tasks. vELECTRA achieves 95.26% on ViOCD complaint classification in the SMTCE benchmark.
2020.arxiv.the/
2022 PACLIC 2022 PDF TEX MD Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
GLUE-inspired benchmark for Vietnamese social media text classification. Compares multilingual (mBERT, XLM-R, DistilmBERT) and monolingual (PhoBERT, viBERT, vELECTRA, viBERT4news) BERT models. Monolingual models consistently outperform multilingual for Vietnamese.
2022.arxiv.nguyen/ · 2022.paclic.ngan/
2007 IEEE RIVF MD Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo
Seminal paper introducing VNTC corpus and comparing BOW and N-gram language model approaches for Vietnamese text classification. N-gram LM achieves 97.1% accuracy, SVM Multi achieves 93.4% on 10-topic news classification. The VNTC dataset remains the standard benchmark.
2007.rivf.hoang/
2019 arXiv PDF TEX MD Yinhan Liu, Myle Ott, Naman Goyal, ...
PhoBERT is based on the RoBERTa architecture. Key optimizations over BERT: dynamic masking, larger batches, more training data, removal of Next Sentence Prediction (NSP). Foundation for most Vietnamese pretrained models.
2019.arxiv.liu/
2019 ACL 2020 PDF TEX MD Alexis Conneau, Kartikay Khandelwal, Naman Goyal, ...
Multilingual pretrained model trained on 100 languages (2.5TB CC-100). Strong multilingual baseline for Vietnamese, but consistently outperformed by monolingual PhoBERT on Vietnamese-specific tasks.
2019.arxiv.conneau/
2019 CSoNet 2020 PDF TEX MD Vong Anh Ho, Duong Huynh-Cong Nguyen, Danh Hoang Nguyen, ...
Introduces UIT-VSMEC corpus: 6,927 emotion-annotated Vietnamese social media sentences with 7 labels (sadness, enjoyment, anger, disgust, fear, surprise, other). CNN baseline achieves 59.74% weighted F1.
2019.arxiv.ho/
2018 KSE 2018 MD Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, ...
16,175 Vietnamese student feedback sentences annotated for sentiment (3 classes: positive, negative, neutral) and topic classification. Inter-annotator agreement: 91.20% for sentiment. MaxEnt baseline: 88% sentiment F1.
2018.kse.nguyen/
Benchmark Comparison
Vietnamese text classification results across datasets and models.

VNTC Dataset (10-topic News Classification)

ModelYearAccuracyF1 (weighted)TrainingInferenceSize
N-gram LM (Vu et al.)200797.1%-~79 min--
SVM Multi (Vu et al.)200793.4%-~79 min--
sonar_core_1 (SVC)-92.80%92.0%~54.6 min-~75MB
Sen-1 (LinearSVC)202692.49%92.40%37.6s66K/sec2.4MB
PhoBERT-base*2020~95-97%~95%Hours (GPU)~20/sec~400MB

*PhoBERT not directly evaluated on VNTC; estimates from similar tasks.

UTS2017_Bank Dataset (14-category Banking)

ModelAccuracyF1 (weighted)F1 (macro)Training
Sen-175.76%72.70%36.18%0.13s
sonar_core_172.47%66.0%-~5.3s

Vietnamese Pretrained Models

ModelArchitecturePre-training DataLanguagesVietnamese Tasks
PhoBERTRoBERTa20GB Vietnamese1 (vi)SOTA: POS, NER, NLI
ViSoBERTXLM-RSocial media corpus1 (vi)SOTA: social media tasks
vELECTRAELECTRA60GB Vietnamese1 (vi)Strong on classification
viBERTBERT10GB Vietnamese1 (vi)Baseline
XLM-RRoBERTaCC-100 (2.5TB)100Strong multilingual
mBERTBERTWikipedia104Weakest on Vietnamese

SMTCE Benchmark (Best model per task)

TaskBest ModelScoreRunner-up
UIT-VSMEC (Emotion)PhoBERT65.44% F1viBERT4news
ViOCD (Complaint)vELECTRA95.26% F1PhoBERT
ViHSD (Hate Speech)PhoBERT-XLM-R
ViCTSD (Constructive)PhoBERT-vELECTRA
UIT-VSFC (Sentiment)PhoBERT-viBERT

Model Efficiency

ModelSizeVNTC AccuracyEfficiency (Acc/MB)
Sen-12.4 MB92.49%38.5
PhoBERT-base~400 MB~95%0.24
XLM-R-base~1.1 GB~93%0.08

Sen-1 is ~160x more efficient in accuracy-per-MB than PhoBERT.

State-of-the-Art
Current SOTA for Vietnamese text classification tasks (as of 2026).
TaskDatasetSOTA ModelScorePaper
News ClassificationVNTCN-gram LM97.1% AccVu et al. 2007
Emotion RecognitionUIT-VSMECViSoBERTSOTA F1Nguyen et al. 2023
Sentiment AnalysisUIT-VSFCPhoBERTSOTA F1SMTCE 2022
Hate SpeechViHSDPhoBERT/ViSoBERTSOTA F1SMTCE/ViSoBERT
Complaint DetectionViOCDvELECTRA95.26% F1SMTCE 2022
Spam ReviewsViSpamReviewsViSoBERTSOTA F1Nguyen et al. 2023

Key Trends

Monolingual > Multilingual

PhoBERT, ViSoBERT, vELECTRA consistently outperform XLM-R, mBERT on Vietnamese tasks.

Domain-specific Pretraining

ViSoBERT (social media) outperforms PhoBERT (general) on social media tasks.

Traditional ML Still Competitive

TF-IDF + SVM achieves 92%+ on news classification with 160x less resources.

Word Segmentation Matters

~5% accuracy gap between syllable-level (Sen-1) and word-level approaches.

Sen-1 Position

Accuracy High ^ | PhoBERT ViSoBERT | * * | | N-gram (2007) | * | Sen-1 | * | Low | +-------------------------------> Fast Slow Inference Speed

Sen-1 = fast + lightweight quadrant: edge deployment, real-time batch processing, resource-constrained environments.

Open Questions

RQ1

Can word segmentation close the gap between Sen-1 and PhoBERT?

RQ2

How does Sen-1 perform on social media/informal text?

RQ3

Can ensemble (Sen-1 + lightweight transformer) get speed + accuracy?

RQ4

Minimum dataset size where PhoBERT outperforms TF-IDF+SVM?

Citation Network
How the papers in this collection relate to each other and to Sen-1.
Vu et al. 2007 (VNTC dataset) | +---> Vietnamese text classification research | | | RoBERTa (2019) ---> PhoBERT (2020) ---> ViSoBERT (2023) | | | | XLM-R (2019) ------> vELECTRA (2020) SMTCE benchmark (2022) | | | | Sen-2 (future) UIT-VSMEC, UIT-VSFC | +---> Sen-1 (TF-IDF + SVM baseline) | +---> 92.49% VNTC | 75.76% UTS2017_Bank | +---> Phase 2: word segmentation, PhoBERT comparison | +---> Phase 3: Sen-2 (PhoBERT-based)

Available Datasets

DatasetTaskSamplesClassesDomainSource
VNTCTopic84,13210NewsGitHub
UTS2017_BankIntent1,97714BankingHuggingFace
UIT-VSMECEmotion6,9277Social mediaUIT NLP
UIT-VSFCSentiment16,1753EducationHuggingFace
SMTCEMulti-taskMultipleVariousSocial mediaarXiv

Research Gaps

Gap 1

No comprehensive TF-IDF vs PhoBERT comparison on same Vietnamese benchmarks with controlled experiments.

Gap 2

Limited edge/resource-constrained deployment studies. Most work focuses on accuracy, not efficiency.

Gap 3

Class imbalance handling for Vietnamese datasets is under-explored.

Gap 4

Cross-domain evaluation and ablation studies for Vietnamese features are rare.

Vietnamese Text Classification Leaderboard

Comprehensive comparison of models across Vietnamese NLP benchmarks, speed, and efficiency.

Updated: February 2026 · Inspired by Vellum LLM Leaderboard
Quality Benchmarks
Top models per dataset, ranked by primary metric.

News Classification

VNTC (10 topics, 84K samples)
  1. 1N-gram LM97.1%
  2. 2PhoBERT-base*~95%
  3. 3SVM Multi93.4%
  4. 4sonar_core_192.80%
  5. 5Sen-192.49%

Banking Classification

UTS2017_Bank (14 categories, 1.9K samples)
  1. 1Sen-175.76%
  2. 2sonar_core_172.47%

Emotion Recognition

UIT-VSMEC (7 classes, 6.9K samples)
  1. 1ViSoBERTSOTA
  2. 2PhoBERT65.44%
  3. 3viBERT4news-
  4. 4CNN baseline59.74%

Sentiment Analysis

UIT-VSFC (3 classes, 16K samples)
  1. 1PhoBERTSOTA
  2. 2viBERT-
  3. 3MaxEnt baseline88%

Complaint Detection

ViOCD (SMTCE benchmark)
  1. 1vELECTRA95.26%
  2. 2PhoBERT-
  3. 3XLM-R-

Hate Speech Detection

ViHSD (SMTCE benchmark)
  1. 1PhoBERTSOTA
  2. 2ViSoBERT-
  3. 3XLM-R-
Performance Metrics
Speed, latency, and efficiency rankings.

Fastest Inference

Batch throughput (samples/sec)
  1. 1Sen-166,678/s
  2. 2TF-IDF + SVM (sklearn)~50K/s
  3. 3PhoBERT (GPU)~20/s

Smallest Model

Model file size
  1. 1Sen-12.4 MB
  2. 2sonar_core_1~75 MB
  3. 3PhoBERT-base~400 MB
  4. 4XLM-R-base~1.1 GB

Most Efficient

Accuracy per MB (VNTC)
  1. 1Sen-138.5
  2. 2sonar_core_11.24
  3. 3PhoBERT-base0.24
  4. 4XLM-R-base0.08

Fastest Training

VNTC full training time
  1. 1Sen-1 (Rust)37.6s
  2. 2TF-IDF+SVM (sklearn)~2 min
  3. 3sonar_core_154.6 min
  4. 4N-gram LM~79 min
  5. 5PhoBERT fine-tuneHours
Comprehensive Comparison
All models with operational and benchmark metrics. Click column headers to sort.
Traditional ML Vietnamese Transformer Multilingual
# Model Type Architecture Size VNTC
Acc %
UTS2017
Acc %
UIT-VSMEC
F1 %
ViOCD
F1 %
Training Inference
/sec
Eff.
Acc/MB
1
N-gram LM
TraditionalN-gram Language Modeln/a97.1n/an/an/a~79 minn/an/a
2
PhoBERT-base
TransformerRoBERTa (20GB vi)~400 MB~95n/a65.44n/aHours (GPU)~200.24
3
ViSoBERT
TransformerXLM-R (social media)~400 MBn/an/aSOTAn/aHours (GPU)n/an/a
4
vELECTRA
TransformerELECTRA (60GB vi)~400 MBn/an/an/a95.26Hours (GPU)n/an/a
5
SVM Multi
TraditionalSVM + BOWn/a93.4n/an/an/a~79 minn/an/a
6
sonar_core_1
TraditionalTF-IDF + SVC (RBF)~75 MB92.8072.47n/an/a54.6 minn/a1.24
7
Sen-1
TraditionalTF-IDF + LinearSVC (Rust)2.4 MB92.4975.76n/an/a37.6s66,67838.5
8
XLM-R-base
MultilingualRoBERTa (100 langs)~1.1 GB~93n/an/an/aHours (GPU)n/a0.08
9
mBERT
MultilingualBERT (104 langs)~700 MBn/an/an/an/aHours (GPU)n/an/a
10
viBERT
TransformerBERT (10GB vi)~400 MBn/an/an/an/aHours (GPU)n/an/a
11
viBERT4news
TransformerBERT (news domain)~400 MBn/an/an/an/aHours (GPU)n/an/a
12
MaxEnt baseline
TraditionalMaximum Entropyn/an/an/an/an/an/an/an/a
13
CNN baseline
TraditionalConvolutional NNn/an/an/a59.74n/an/an/an/a
14
DistilmBERT
MultilingualDistilBERT (multilingual)~260 MBn/an/an/an/aHours (GPU)n/an/a

* PhoBERT VNTC estimate based on similar Vietnamese classification tasks. Blank cells indicate benchmark not evaluated.
Efficiency = VNTC Accuracy / Model Size in MB. Higher is better.

How to Do Science
Research methodology guide compiled from Hamming, Schulman, Marek Rei, and Microsoft Research Asia.

1. Choosing Important Problems

"If you do not work on an important problem, it's unlikely you'll do important work." -- Richard Hamming

Hamming's Principles

  • Maintain a list of 10-20 important problems in your field
  • A problem becomes important when you have a reasonable attack
  • Dedicate deep thinking time (Friday "Great Thoughts Time")
  • Have courage to pursue unconventional ideas

Schulman's Framework (OpenAI)

  • Work on the right problems
  • Make continual progress
  • Achieve continual personal growth

Develop "research taste" by reading broadly, collaborating widely, and asking: "If this succeeds, how big is the impact?"

Microsoft Research Asia (Dr. Ming Zhou)

  • Read recent ACL proceedings to find your field
  • Target "blue ocean" areas - new fields with less competition
  • Verify 3 prerequisites: math/ML framework, standard datasets, active research teams
  • Find gaps: what can be improved, combined, or inverted

2. Reading Papers

Effective Reading Strategy

  • Read broadly: Not just NLP - also cognitive science, neuroscience, linguistics, vision
  • Read deeply: Become the "world-leading expert" on your narrow question
  • Read textbooks: More knowledge-dense than papers
  • Follow citation chains via Google Scholar, Semantic Scholar
  • Use PRISMA methodology for systematic reviews

3. Running Experiments

Step 1: Reproduce baselines first

"Reimplement existing state-of-the-art work first to validate your setup." -- Marek Rei

  • Choose open source project, compile, run demo, match results
  • Understand the algorithm deeply, then reimplement
  • Test on standard test set until results align

Step 2: Simple baseline (1-2 weeks)

Implement the simplest approach before building complex architectures. Verify your setup works.

Step 3: Rigorous experimentation

  • Debug: Don't assume bug-free code. Test with toy examples. Add assertions.
  • Evaluate: Separate train/dev/test. Run 10+ times. Report mean + std.
  • Ablate: Significance tests and ablation studies for every novel component.
  • Avoid: Single-run results, weak-only baselines, blind trend-following.

4. Writing Papers

ACL Paper Structure (Dr. Ming Zhou)

  • Title: Specific, no generic words
  • Abstract: Problem + Method + Advantage + Achievement
  • Introduction: Background → existing → limitations → contribution (≤3 points)
  • Related Work: Organized by theme, not chronology
  • Methodology: Problem definition → notation → formulas
  • Experiments: Purpose → data → parameters → reproducibility
  • Limitations: Required by ACL - honest assessment

Revision: 3 passes - self review → team review → outsider review.

5. Mindset & Habits

PrincipleLesson
Open doorsStay connected to the community; know emerging problems
Preparation"Luck favors the prepared mind" (Pasteur)
ConstraintsDifficult conditions often lead to breakthroughs
CommitmentDeep immersion activates subconscious problem-solving
Selling workPresentation matters - great work needs effective communication

Essential Reading

Richard Hamming · Choosing important problems, mindset
John Schulman (OpenAI) · Problem selection, progress, growth
Marek Rei · Practical experiment workflow
Dr. Ming Zhou (MSRA) · NLP research methodology