Sen-1 References - Vietnamese Text Classification

Paper Database

Research papers related to Vietnamese text classification, fetched from arXiv and ACL Anthology.

PhoBERT: Pre-trained language models for Vietnamese

2020 EMNLP Findings PDF TEX MD Dat Quoc Nguyen, Anh Tuan Nguyen

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

2020.arxiv.nguyen/ · 2020.findings.anh/

ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

2023 EMNLP 2023 PDF TEX MD Quoc-Nam Nguyen, Thang Chau Phan, Duc-Vu Nguyen, Kiet Van Nguyen

We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. ViSoBERT surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks with far fewer parameters.

2023.arxiv.nguyen/ · 2023.emnlp.kiet/

Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models

2020 arXiv PDF TEX MD Viet Bui The, Oanh Tran Thi, Phuong Le-Hong

Introduces viBERT (trained on 10GB) and vELECTRA (trained on 60GB) Vietnamese pretrained models. Strong performance on sequence tagging and text classification tasks. vELECTRA achieves 95.26% on ViOCD complaint classification in the SMTCE benchmark.

2020.arxiv.the/

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

2022 PACLIC 2022 PDF TEX MD Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

GLUE-inspired benchmark for Vietnamese social media text classification. Compares multilingual (mBERT, XLM-R, DistilmBERT) and monolingual (PhoBERT, viBERT, vELECTRA, viBERT4news) BERT models. Monolingual models consistently outperform multilingual for Vietnamese.

2022.arxiv.nguyen/ · 2022.paclic.ngan/

A Comparative Study on Vietnamese Text Classification Methods

2007 IEEE RIVF MD Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo

Seminal paper introducing VNTC corpus and comparing BOW and N-gram language model approaches for Vietnamese text classification. N-gram LM achieves 97.1% accuracy, SVM Multi achieves 93.4% on 10-topic news classification. The VNTC dataset remains the standard benchmark.

2007.rivf.hoang/

RoBERTa: A Robustly Optimized BERT Pretraining Approach

2019 arXiv PDF TEX MD Yinhan Liu, Myle Ott, Naman Goyal, ...

PhoBERT is based on the RoBERTa architecture. Key optimizations over BERT: dynamic masking, larger batches, more training data, removal of Next Sentence Prediction (NSP). Foundation for most Vietnamese pretrained models.

2019.arxiv.liu/

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)

2019 ACL 2020 PDF TEX MD Alexis Conneau, Kartikay Khandelwal, Naman Goyal, ...

Multilingual pretrained model trained on 100 languages (2.5TB CC-100). Strong multilingual baseline for Vietnamese, but consistently outperformed by monolingual PhoBERT on Vietnamese-specific tasks.

2019.arxiv.conneau/

Emotion Recognition for Vietnamese Social Media Text (UIT-VSMEC)

2019 CSoNet 2020 PDF TEX MD Vong Anh Ho, Duong Huynh-Cong Nguyen, Danh Hoang Nguyen, ...

Introduces UIT-VSMEC corpus: 6,927 emotion-annotated Vietnamese social media sentences with 7 labels (sadness, enjoyment, anger, disgust, fear, surprise, other). CNN baseline achieves 59.74% weighted F1.

2019.arxiv.ho/

UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis

2018 KSE 2018 MD Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, ...

16,175 Vietnamese student feedback sentences annotated for sentiment (3 classes: positive, negative, neutral) and topic classification. Inter-annotator agreement: 91.20% for sentiment. MaxEnt baseline: 88% sentiment F1.

2018.kse.nguyen/

Benchmark Comparison

Vietnamese text classification results across datasets and models.

VNTC Dataset (10-topic News Classification)

Model	Year	Accuracy	F1 (weighted)	Training	Inference	Size
N-gram LM (Vu et al.)	2007	97.1%	-	~79 min	-	-
SVM Multi (Vu et al.)	2007	93.4%	-	~79 min	-	-
sonar_core_1 (SVC)	-	92.80%	92.0%	~54.6 min	-	~75MB
Sen-1 (LinearSVC)	2026	92.49%	92.40%	37.6s	66K/sec	2.4MB
PhoBERT-base*	2020	~95-97%	~95%	Hours (GPU)	~20/sec	~400MB

*PhoBERT not directly evaluated on VNTC; estimates from similar tasks.

UTS2017_Bank Dataset (14-category Banking)

Model	Accuracy	F1 (weighted)	F1 (macro)	Training
Sen-1	75.76%	72.70%	36.18%	0.13s
sonar_core_1	72.47%	66.0%	-	~5.3s

Vietnamese Pretrained Models

Model	Architecture	Pre-training Data	Languages	Vietnamese Tasks
PhoBERT	RoBERTa	20GB Vietnamese	1 (vi)	SOTA: POS, NER, NLI
ViSoBERT	XLM-R	Social media corpus	1 (vi)	SOTA: social media tasks
vELECTRA	ELECTRA	60GB Vietnamese	1 (vi)	Strong on classification
viBERT	BERT	10GB Vietnamese	1 (vi)	Baseline
XLM-R	RoBERTa	CC-100 (2.5TB)	100	Strong multilingual
mBERT	BERT	Wikipedia	104	Weakest on Vietnamese

SMTCE Benchmark (Best model per task)

Task	Best Model	Score	Runner-up
UIT-VSMEC (Emotion)	PhoBERT	65.44% F1	viBERT4news
ViOCD (Complaint)	vELECTRA	95.26% F1	PhoBERT
ViHSD (Hate Speech)	PhoBERT	-	XLM-R
ViCTSD (Constructive)	PhoBERT	-	vELECTRA
UIT-VSFC (Sentiment)	PhoBERT	-	viBERT

Model Efficiency

Model	Size	VNTC Accuracy	Efficiency (Acc/MB)
Sen-1	2.4 MB	92.49%	38.5
PhoBERT-base	~400 MB	~95%	0.24
XLM-R-base	~1.1 GB	~93%	0.08

Sen-1 is ~160x more efficient in accuracy-per-MB than PhoBERT.

State-of-the-Art

Current SOTA for Vietnamese text classification tasks (as of 2026).

Task	Dataset	SOTA Model	Score	Paper
News Classification	VNTC	N-gram LM	97.1% Acc	Vu et al. 2007
Emotion Recognition	UIT-VSMEC	ViSoBERT	SOTA F1	Nguyen et al. 2023
Sentiment Analysis	UIT-VSFC	PhoBERT	SOTA F1	SMTCE 2022
Hate Speech	ViHSD	PhoBERT/ViSoBERT	SOTA F1	SMTCE/ViSoBERT
Complaint Detection	ViOCD	vELECTRA	95.26% F1	SMTCE 2022
Spam Reviews	ViSpamReviews	ViSoBERT	SOTA F1	Nguyen et al. 2023

Key Trends

Monolingual > Multilingual

PhoBERT, ViSoBERT, vELECTRA consistently outperform XLM-R, mBERT on Vietnamese tasks.

Domain-specific Pretraining

ViSoBERT (social media) outperforms PhoBERT (general) on social media tasks.

Traditional ML Still Competitive

TF-IDF + SVM achieves 92%+ on news classification with 160x less resources.

Word Segmentation Matters

~5% accuracy gap between syllable-level (Sen-1) and word-level approaches.

Sen-1 Position

Accuracy High ^ | PhoBERT ViSoBERT | * * | | N-gram (2007) | * | Sen-1 | * | Low | +-------------------------------> Fast Slow Inference Speed

Sen-1 = fast + lightweight quadrant: edge deployment, real-time batch processing, resource-constrained environments.

Open Questions

RQ1

Can word segmentation close the gap between Sen-1 and PhoBERT?

RQ2

How does Sen-1 perform on social media/informal text?

RQ3

Can ensemble (Sen-1 + lightweight transformer) get speed + accuracy?

RQ4

Minimum dataset size where PhoBERT outperforms TF-IDF+SVM?

Citation Network

How the papers in this collection relate to each other and to Sen-1.

Vu et al. 2007 (VNTC dataset) | +---> Vietnamese text classification research | | | RoBERTa (2019) ---> PhoBERT (2020) ---> ViSoBERT (2023) | | | | XLM-R (2019) ------> vELECTRA (2020) SMTCE benchmark (2022) | | | | Sen-2 (future) UIT-VSMEC, UIT-VSFC | +---> Sen-1 (TF-IDF + SVM baseline) | +---> 92.49% VNTC | 75.76% UTS2017_Bank | +---> Phase 2: word segmentation, PhoBERT comparison | +---> Phase 3: Sen-2 (PhoBERT-based)

Available Datasets

Dataset	Task	Samples	Classes	Domain	Source
VNTC	Topic	84,132	10	News	GitHub
UTS2017_Bank	Intent	1,977	14	Banking	HuggingFace
UIT-VSMEC	Emotion	6,927	7	Social media	UIT NLP
UIT-VSFC	Sentiment	16,175	3	Education	HuggingFace
SMTCE	Multi-task	Multiple	Various	Social media	arXiv

Research Gaps

Gap 1

No comprehensive TF-IDF vs PhoBERT comparison on same Vietnamese benchmarks with controlled experiments.

Gap 2

Limited edge/resource-constrained deployment studies. Most work focuses on accuracy, not efficiency.

Gap 3

Class imbalance handling for Vietnamese datasets is under-explored.

Gap 4

Cross-domain evaluation and ablation studies for Vietnamese features are rare.

Vietnamese Text Classification Leaderboard

Comprehensive comparison of models across Vietnamese NLP benchmarks, speed, and efficiency.

Updated: February 2026 · Inspired by Vellum LLM Leaderboard

Quality Benchmarks

Top models per dataset, ranked by primary metric.

News Classification

VNTC (10 topics, 84K samples)

1N-gram LM97.1%
2PhoBERT-base*~95%
3SVM Multi93.4%
4sonar_core_192.80%
5Sen-192.49%

Banking Classification

UTS2017_Bank (14 categories, 1.9K samples)

1Sen-175.76%
2sonar_core_172.47%

Emotion Recognition

UIT-VSMEC (7 classes, 6.9K samples)

1ViSoBERTSOTA
2PhoBERT65.44%
3viBERT4news-
4CNN baseline59.74%

Sentiment Analysis

UIT-VSFC (3 classes, 16K samples)

1PhoBERTSOTA
2viBERT-
3MaxEnt baseline88%

Complaint Detection

ViOCD (SMTCE benchmark)

1vELECTRA95.26%
2PhoBERT-
3XLM-R-

Hate Speech Detection

ViHSD (SMTCE benchmark)

1PhoBERTSOTA
2ViSoBERT-
3XLM-R-

Performance Metrics

Speed, latency, and efficiency rankings.

Fastest Inference

Batch throughput (samples/sec)

1Sen-166,678/s
2TF-IDF + SVM (sklearn)~50K/s
3PhoBERT (GPU)~20/s

Smallest Model

Model file size

1Sen-12.4 MB
2sonar_core_1~75 MB
3PhoBERT-base~400 MB
4XLM-R-base~1.1 GB

Most Efficient

Accuracy per MB (VNTC)

1Sen-138.5
2sonar_core_11.24
3PhoBERT-base0.24
4XLM-R-base0.08

Fastest Training

VNTC full training time

1Sen-1 (Rust)37.6s
2TF-IDF+SVM (sklearn)~2 min
3sonar_core_154.6 min
4N-gram LM~79 min
5PhoBERT fine-tuneHours

Comprehensive Comparison

All models with operational and benchmark metrics. Click column headers to sort.

Traditional ML Vietnamese Transformer Multilingual

#	Model	Type	Architecture	Size	VNTC Acc %	UTS2017 Acc %	UIT-VSMEC F1 %	ViOCD F1 %	Training	Inference /sec	Eff. Acc/MB
1	N-gram LM	Traditional	N-gram Language Model	n/a	97.1	n/a	n/a	n/a	~79 min	n/a	n/a
2	PhoBERT-base	Transformer	RoBERTa (20GB vi)	~400 MB	~95	n/a	65.44	n/a	Hours (GPU)	~20	0.24
3	ViSoBERT	Transformer	XLM-R (social media)	~400 MB	n/a	n/a	SOTA	n/a	Hours (GPU)	n/a	n/a
4	vELECTRA	Transformer	ELECTRA (60GB vi)	~400 MB	n/a	n/a	n/a	95.26	Hours (GPU)	n/a	n/a
5	SVM Multi	Traditional	SVM + BOW	n/a	93.4	n/a	n/a	n/a	~79 min	n/a	n/a
6	sonar_core_1	Traditional	TF-IDF + SVC (RBF)	~75 MB	92.80	72.47	n/a	n/a	54.6 min	n/a	1.24
7	Sen-1	Traditional	TF-IDF + LinearSVC (Rust)	2.4 MB	92.49	75.76	n/a	n/a	37.6s	66,678	38.5
8	XLM-R-base	Multilingual	RoBERTa (100 langs)	~1.1 GB	~93	n/a	n/a	n/a	Hours (GPU)	n/a	0.08
9	mBERT	Multilingual	BERT (104 langs)	~700 MB	n/a	n/a	n/a	n/a	Hours (GPU)	n/a	n/a
10	viBERT	Transformer	BERT (10GB vi)	~400 MB	n/a	n/a	n/a	n/a	Hours (GPU)	n/a	n/a
11	viBERT4news	Transformer	BERT (news domain)	~400 MB	n/a	n/a	n/a	n/a	Hours (GPU)	n/a	n/a
12	MaxEnt baseline	Traditional	Maximum Entropy	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
13	CNN baseline	Traditional	Convolutional NN	n/a	n/a	n/a	59.74	n/a	n/a	n/a	n/a
14	DistilmBERT	Multilingual	DistilBERT (multilingual)	~260 MB	n/a	n/a	n/a	n/a	Hours (GPU)	n/a	n/a

* PhoBERT VNTC estimate based on similar Vietnamese classification tasks. Blank cells indicate benchmark not evaluated.
Efficiency = VNTC Accuracy / Model Size in MB. Higher is better.

How to Do Science

Research methodology guide compiled from Hamming, Schulman, Marek Rei, and Microsoft Research Asia.

1. Choosing Important Problems

"If you do not work on an important problem, it's unlikely you'll do important work." -- Richard Hamming

Hamming's Principles

Maintain a list of 10-20 important problems in your field
A problem becomes important when you have a reasonable attack
Dedicate deep thinking time (Friday "Great Thoughts Time")
Have courage to pursue unconventional ideas

Schulman's Framework (OpenAI)

Work on the right problems
Make continual progress
Achieve continual personal growth

Develop "research taste" by reading broadly, collaborating widely, and asking: "If this succeeds, how big is the impact?"

Microsoft Research Asia (Dr. Ming Zhou)

Read recent ACL proceedings to find your field
Target "blue ocean" areas - new fields with less competition
Verify 3 prerequisites: math/ML framework, standard datasets, active research teams
Find gaps: what can be improved, combined, or inverted

2. Reading Papers

Effective Reading Strategy

Read broadly: Not just NLP - also cognitive science, neuroscience, linguistics, vision
Read deeply: Become the "world-leading expert" on your narrow question
Read textbooks: More knowledge-dense than papers
Follow citation chains via Google Scholar, Semantic Scholar
Use PRISMA methodology for systematic reviews

3. Running Experiments

Step 1: Reproduce baselines first

"Reimplement existing state-of-the-art work first to validate your setup." -- Marek Rei

Choose open source project, compile, run demo, match results
Understand the algorithm deeply, then reimplement
Test on standard test set until results align

Step 2: Simple baseline (1-2 weeks)

Implement the simplest approach before building complex architectures. Verify your setup works.

Step 3: Rigorous experimentation

Debug: Don't assume bug-free code. Test with toy examples. Add assertions.
Evaluate: Separate train/dev/test. Run 10+ times. Report mean + std.
Ablate: Significance tests and ablation studies for every novel component.
Avoid: Single-run results, weak-only baselines, blind trend-following.

4. Writing Papers

ACL Paper Structure (Dr. Ming Zhou)

Title: Specific, no generic words
Abstract: Problem + Method + Advantage + Achievement
Introduction: Background → existing → limitations → contribution (≤3 points)
Related Work: Organized by theme, not chronology
Methodology: Problem definition → notation → formulas
Experiments: Purpose → data → parameters → reproducibility
Limitations: Required by ACL - honest assessment

Revision: 3 passes - self review → team review → outsider review.

5. Mindset & Habits

Principle	Lesson
Open doors	Stay connected to the community; know emerging problems
Preparation	"Luck favors the prepared mind" (Pasteur)
Constraints	Difficult conditions often lead to breakthroughs
Commitment	Deep immersion activates subconscious problem-solving
Selling work	Presentation matters - great work needs effective communication

Essential Reading

You and Your Research

Richard Hamming · Choosing important problems, mindset

An Opinionated Guide to ML Research

John Schulman (OpenAI) · Problem selection, progress, growth

ML/NLP Research Project Advice

Marek Rei · Practical experiment workflow

How to Make First Accomplishment in NLP

Dr. Ming Zhou (MSRA) · NLP research methodology

Sen-1 References Vietnamese Text Classification