2020.arxiv.nguyen/ · 2020.findings.anh/
2023.arxiv.nguyen/ · 2023.emnlp.kiet/
2020.arxiv.the/2022.arxiv.nguyen/ · 2022.paclic.ngan/
2007.rivf.hoang/2019.arxiv.liu/2019.arxiv.conneau/2019.arxiv.ho/2018.kse.nguyen/VNTC Dataset (10-topic News Classification)
| Model | Year | Accuracy | F1 (weighted) | Training | Inference | Size |
|---|---|---|---|---|---|---|
| N-gram LM (Vu et al.) | 2007 | 97.1% | - | ~79 min | - | - |
| SVM Multi (Vu et al.) | 2007 | 93.4% | - | ~79 min | - | - |
| sonar_core_1 (SVC) | - | 92.80% | 92.0% | ~54.6 min | - | ~75MB |
| Sen-1 (LinearSVC) | 2026 | 92.49% | 92.40% | 37.6s | 66K/sec | 2.4MB |
| PhoBERT-base* | 2020 | ~95-97% | ~95% | Hours (GPU) | ~20/sec | ~400MB |
*PhoBERT not directly evaluated on VNTC; estimates from similar tasks.
UTS2017_Bank Dataset (14-category Banking)
| Model | Accuracy | F1 (weighted) | F1 (macro) | Training |
|---|---|---|---|---|
| Sen-1 | 75.76% | 72.70% | 36.18% | 0.13s |
| sonar_core_1 | 72.47% | 66.0% | - | ~5.3s |
Vietnamese Pretrained Models
| Model | Architecture | Pre-training Data | Languages | Vietnamese Tasks |
|---|---|---|---|---|
| PhoBERT | RoBERTa | 20GB Vietnamese | 1 (vi) | SOTA: POS, NER, NLI |
| ViSoBERT | XLM-R | Social media corpus | 1 (vi) | SOTA: social media tasks |
| vELECTRA | ELECTRA | 60GB Vietnamese | 1 (vi) | Strong on classification |
| viBERT | BERT | 10GB Vietnamese | 1 (vi) | Baseline |
| XLM-R | RoBERTa | CC-100 (2.5TB) | 100 | Strong multilingual |
| mBERT | BERT | Wikipedia | 104 | Weakest on Vietnamese |
SMTCE Benchmark (Best model per task)
| Task | Best Model | Score | Runner-up |
|---|---|---|---|
| UIT-VSMEC (Emotion) | PhoBERT | 65.44% F1 | viBERT4news |
| ViOCD (Complaint) | vELECTRA | 95.26% F1 | PhoBERT |
| ViHSD (Hate Speech) | PhoBERT | - | XLM-R |
| ViCTSD (Constructive) | PhoBERT | - | vELECTRA |
| UIT-VSFC (Sentiment) | PhoBERT | - | viBERT |
Model Efficiency
| Model | Size | VNTC Accuracy | Efficiency (Acc/MB) |
|---|---|---|---|
| Sen-1 | 2.4 MB | 92.49% | 38.5 |
| PhoBERT-base | ~400 MB | ~95% | 0.24 |
| XLM-R-base | ~1.1 GB | ~93% | 0.08 |
Sen-1 is ~160x more efficient in accuracy-per-MB than PhoBERT.
| Task | Dataset | SOTA Model | Score | Paper |
|---|---|---|---|---|
| News Classification | VNTC | N-gram LM | 97.1% Acc | Vu et al. 2007 |
| Emotion Recognition | UIT-VSMEC | ViSoBERT | SOTA F1 | Nguyen et al. 2023 |
| Sentiment Analysis | UIT-VSFC | PhoBERT | SOTA F1 | SMTCE 2022 |
| Hate Speech | ViHSD | PhoBERT/ViSoBERT | SOTA F1 | SMTCE/ViSoBERT |
| Complaint Detection | ViOCD | vELECTRA | 95.26% F1 | SMTCE 2022 |
| Spam Reviews | ViSpamReviews | ViSoBERT | SOTA F1 | Nguyen et al. 2023 |
Key Trends
Monolingual > Multilingual
PhoBERT, ViSoBERT, vELECTRA consistently outperform XLM-R, mBERT on Vietnamese tasks.
Domain-specific Pretraining
ViSoBERT (social media) outperforms PhoBERT (general) on social media tasks.
Traditional ML Still Competitive
TF-IDF + SVM achieves 92%+ on news classification with 160x less resources.
Word Segmentation Matters
~5% accuracy gap between syllable-level (Sen-1) and word-level approaches.
Sen-1 Position
Sen-1 = fast + lightweight quadrant: edge deployment, real-time batch processing, resource-constrained environments.
Open Questions
RQ1
Can word segmentation close the gap between Sen-1 and PhoBERT?
RQ2
How does Sen-1 perform on social media/informal text?
RQ3
Can ensemble (Sen-1 + lightweight transformer) get speed + accuracy?
RQ4
Minimum dataset size where PhoBERT outperforms TF-IDF+SVM?
Available Datasets
| Dataset | Task | Samples | Classes | Domain | Source |
|---|---|---|---|---|---|
| VNTC | Topic | 84,132 | 10 | News | GitHub |
| UTS2017_Bank | Intent | 1,977 | 14 | Banking | HuggingFace |
| UIT-VSMEC | Emotion | 6,927 | 7 | Social media | UIT NLP |
| UIT-VSFC | Sentiment | 16,175 | 3 | Education | HuggingFace |
| SMTCE | Multi-task | Multiple | Various | Social media | arXiv |
Research Gaps
Gap 1
No comprehensive TF-IDF vs PhoBERT comparison on same Vietnamese benchmarks with controlled experiments.
Gap 2
Limited edge/resource-constrained deployment studies. Most work focuses on accuracy, not efficiency.
Gap 3
Class imbalance handling for Vietnamese datasets is under-explored.
Gap 4
Cross-domain evaluation and ablation studies for Vietnamese features are rare.
Vietnamese Text Classification Leaderboard
Comprehensive comparison of models across Vietnamese NLP benchmarks, speed, and efficiency.
News Classification
- 1N-gram LM97.1%
- 2PhoBERT-base*~95%
- 3SVM Multi93.4%
- 4sonar_core_192.80%
- 5Sen-192.49%
Banking Classification
- 1Sen-175.76%
- 2sonar_core_172.47%
Emotion Recognition
- 1ViSoBERTSOTA
- 2PhoBERT65.44%
- 3viBERT4news-
- 4CNN baseline59.74%
Sentiment Analysis
- 1PhoBERTSOTA
- 2viBERT-
- 3MaxEnt baseline88%
Complaint Detection
- 1vELECTRA95.26%
- 2PhoBERT-
- 3XLM-R-
Hate Speech Detection
- 1PhoBERTSOTA
- 2ViSoBERT-
- 3XLM-R-
Fastest Inference
- 1Sen-166,678/s
- 2TF-IDF + SVM (sklearn)~50K/s
- 3PhoBERT (GPU)~20/s
Smallest Model
- 1Sen-12.4 MB
- 2sonar_core_1~75 MB
- 3PhoBERT-base~400 MB
- 4XLM-R-base~1.1 GB
Most Efficient
- 1Sen-138.5
- 2sonar_core_11.24
- 3PhoBERT-base0.24
- 4XLM-R-base0.08
Fastest Training
- 1Sen-1 (Rust)37.6s
- 2TF-IDF+SVM (sklearn)~2 min
- 3sonar_core_154.6 min
- 4N-gram LM~79 min
- 5PhoBERT fine-tuneHours
| # | Model | Type | Architecture | Size | VNTC Acc % |
UTS2017 Acc % |
UIT-VSMEC F1 % |
ViOCD F1 % |
Training | Inference /sec |
Eff. Acc/MB |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | N-gram LM | Traditional | N-gram Language Model | n/a | 97.1 | n/a | n/a | n/a | ~79 min | n/a | n/a |
| 2 | PhoBERT-base | Transformer | RoBERTa (20GB vi) | ~400 MB | ~95 | n/a | 65.44 | n/a | Hours (GPU) | ~20 | 0.24 |
| 3 | ViSoBERT | Transformer | XLM-R (social media) | ~400 MB | n/a | n/a | SOTA | n/a | Hours (GPU) | n/a | n/a |
| 4 | vELECTRA | Transformer | ELECTRA (60GB vi) | ~400 MB | n/a | n/a | n/a | 95.26 | Hours (GPU) | n/a | n/a |
| 5 | SVM Multi | Traditional | SVM + BOW | n/a | 93.4 | n/a | n/a | n/a | ~79 min | n/a | n/a |
| 6 | sonar_core_1 | Traditional | TF-IDF + SVC (RBF) | ~75 MB | 92.80 | 72.47 | n/a | n/a | 54.6 min | n/a | 1.24 |
| 7 | Sen-1 | Traditional | TF-IDF + LinearSVC (Rust) | 2.4 MB | 92.49 | 75.76 | n/a | n/a | 37.6s | 66,678 | 38.5 |
| 8 | XLM-R-base | Multilingual | RoBERTa (100 langs) | ~1.1 GB | ~93 | n/a | n/a | n/a | Hours (GPU) | n/a | 0.08 |
| 9 | mBERT | Multilingual | BERT (104 langs) | ~700 MB | n/a | n/a | n/a | n/a | Hours (GPU) | n/a | n/a |
| 10 | viBERT | Transformer | BERT (10GB vi) | ~400 MB | n/a | n/a | n/a | n/a | Hours (GPU) | n/a | n/a |
| 11 | viBERT4news | Transformer | BERT (news domain) | ~400 MB | n/a | n/a | n/a | n/a | Hours (GPU) | n/a | n/a |
| 12 | MaxEnt baseline | Traditional | Maximum Entropy | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| 13 | CNN baseline | Traditional | Convolutional NN | n/a | n/a | n/a | 59.74 | n/a | n/a | n/a | n/a |
| 14 | DistilmBERT | Multilingual | DistilBERT (multilingual) | ~260 MB | n/a | n/a | n/a | n/a | Hours (GPU) | n/a | n/a |
* PhoBERT VNTC estimate based on similar Vietnamese classification tasks. Blank cells indicate benchmark not evaluated.
Efficiency = VNTC Accuracy / Model Size in MB. Higher is better.
1. Choosing Important Problems
"If you do not work on an important problem, it's unlikely you'll do important work." -- Richard Hamming
Hamming's Principles
- Maintain a list of 10-20 important problems in your field
- A problem becomes important when you have a reasonable attack
- Dedicate deep thinking time (Friday "Great Thoughts Time")
- Have courage to pursue unconventional ideas
Schulman's Framework (OpenAI)
- Work on the right problems
- Make continual progress
- Achieve continual personal growth
Develop "research taste" by reading broadly, collaborating widely, and asking: "If this succeeds, how big is the impact?"
Microsoft Research Asia (Dr. Ming Zhou)
- Read recent ACL proceedings to find your field
- Target "blue ocean" areas - new fields with less competition
- Verify 3 prerequisites: math/ML framework, standard datasets, active research teams
- Find gaps: what can be improved, combined, or inverted
2. Reading Papers
Effective Reading Strategy
- Read broadly: Not just NLP - also cognitive science, neuroscience, linguistics, vision
- Read deeply: Become the "world-leading expert" on your narrow question
- Read textbooks: More knowledge-dense than papers
- Follow citation chains via Google Scholar, Semantic Scholar
- Use PRISMA methodology for systematic reviews
3. Running Experiments
Step 1: Reproduce baselines first
"Reimplement existing state-of-the-art work first to validate your setup." -- Marek Rei
- Choose open source project, compile, run demo, match results
- Understand the algorithm deeply, then reimplement
- Test on standard test set until results align
Step 2: Simple baseline (1-2 weeks)
Implement the simplest approach before building complex architectures. Verify your setup works.
Step 3: Rigorous experimentation
- Debug: Don't assume bug-free code. Test with toy examples. Add assertions.
- Evaluate: Separate train/dev/test. Run 10+ times. Report mean + std.
- Ablate: Significance tests and ablation studies for every novel component.
- Avoid: Single-run results, weak-only baselines, blind trend-following.
4. Writing Papers
ACL Paper Structure (Dr. Ming Zhou)
- Title: Specific, no generic words
- Abstract: Problem + Method + Advantage + Achievement
- Introduction: Background → existing → limitations → contribution (≤3 points)
- Related Work: Organized by theme, not chronology
- Methodology: Problem definition → notation → formulas
- Experiments: Purpose → data → parameters → reproducibility
- Limitations: Required by ACL - honest assessment
Revision: 3 passes - self review → team review → outsider review.
5. Mindset & Habits
| Principle | Lesson |
|---|---|
| Open doors | Stay connected to the community; know emerging problems |
| Preparation | "Luck favors the prepared mind" (Pasteur) |
| Constraints | Difficult conditions often lead to breakthroughs |
| Commitment | Deep immersion activates subconscious problem-solving |
| Selling work | Presentation matters - great work needs effective communication |