Comparison Tables: Vietnamese NLP Methods (4 Tasks, CPU-First)
Last Updated: 2026-02-07
Scope: Word Segmentation, POS Tagging, Chunking, Dependency Parsing
1. Vietnamese POS Tagging: Method Comparison
1.1 Results on VLSP 2013 Benchmark
| Model |
Year |
Method |
Accuracy |
Params |
GPU |
Inference Speed |
Paper |
| ViDeBERTa-large |
2023 |
DeBERTaV3 |
97.2% |
304M |
Yes |
Slow |
Tran et al. |
| PhoBERT-large |
2020 |
RoBERTa |
96.8% |
370M |
Yes |
Slow |
Nguyen & Nguyen |
| vELECTRA |
2020 |
ELECTRA |
96.77% |
110M |
Yes |
Medium |
Bui et al. |
| PhoNLP |
2021 |
PhoBERT+MTL |
96.76% |
135M+ |
Yes |
Slow |
Nguyen & Nguyen |
| PhoBERT-base |
2020 |
RoBERTa |
96.7% |
135M |
Yes |
Medium |
Nguyen & Nguyen |
| VnMarMoT |
2018 |
CRF (MarMoT) |
95.88% |
~1MB |
No |
90K w/s |
Vu et al. |
| TRE-1 |
2026 |
CRF (python-crfsuite) |
95.89%* |
2.3MB |
No |
Fast |
This work |
| BiLSTM-CRF+CNN |
2018 |
Neural+CRF |
95.40% |
~10M |
Yes |
Medium |
DU et al. |
| RDRPOSTagger |
2014 |
Ripple Down Rules |
95.11% |
~5MB |
No |
8K w/s |
Nguyen et al. |
| JointWPD |
2018 |
Joint neural |
94.03% |
~10M |
Yes |
Slow |
Nguyen |
| BiLSTM-CRF |
2018 |
Neural+CRF |
93.52% |
~10M |
Yes |
Medium |
DU et al. |
*Evaluated on UDD-1, not VLSP 2013.
1.2 Method Categories
| Category |
Best Model |
Accuracy |
Pros |
Cons |
| Pre-trained Transformer |
ViDeBERTa-large |
97.2% |
Highest accuracy, transfer learning |
GPU required, slow, large model |
| Multi-task Transformer |
PhoNLP |
96.76% |
Shared representations, joint tasks |
Complex training, GPU required |
| ELECTRA |
vELECTRA |
96.77% |
More sample-efficient pre-training |
GPU required |
| CRF |
VnMarMoT / TRE-1 |
95.88-89% |
No GPU, fast, interpretable, small model |
Manual features, limited context |
| BiLSTM-CRF |
BiLSTM-CRF+CNN |
95.40% |
Automatic features, CRF constraints |
GPU needed, no pre-training |
| Rule-based |
RDRPOSTagger |
95.11% |
Very fast, interpretable rules |
Lower accuracy |
2. Vietnamese Word Segmentation: Method Comparison
2.1 Results on VLSP 2013 Benchmark
| Model |
Year |
Method |
F1 Score |
GPU |
Paper |
| UITws-v1 |
2019 |
SVM + ambiguity reduction |
98.06% |
No |
Nguyen et al. |
| RDRsegmenter |
2018 |
Rule-based decision trees |
97.90% |
No |
Vu et al. |
| jPTDP-v2 |
2018 |
Joint neural |
97.90% |
Yes |
Nguyen |
| UETsegmenter |
2016 |
ML-based |
97.87% |
No |
-- |
| JointWPD |
2018 |
Multi-task learning |
97.78% |
Yes |
Nguyen |
| vnTokenizer |
2008 |
Rule-based |
97.33% |
No |
-- |
| JVnSegmenter |
2006 |
CRF + SVM |
97.06% |
No |
Nguyen et al. |
| DongDu |
-- |
Dictionary-based |
96.90% |
No |
-- |
| LSTM+CNN |
2022 |
Deep neural |
96.3%** |
Yes |
Zheng et al. |
**Evaluated on different dataset.
2.2 TRE-1 WS Results on UDD-1
| Level |
Precision |
Recall |
F1 |
| Syllable |
-- |
-- |
98.90% (acc) |
| Word |
98.02% |
98.01% |
98.01% |
3. CRF vs Neural Methods: Cross-Language Evidence
3.1 Feature-based CRF vs BiLSTM-CRF (Without Pre-trained LM)
| Language |
CRF Accuracy |
BiLSTM-CRF Accuracy |
Winner |
Paper |
| Vietnamese |
95.88% |
93.52% |
CRF |
VnMarMoT vs DU et al. |
| Burmese |
98.18%* |
97.85%* |
CRF |
Thant et al. 2025 |
| Assamese |
85.0%** |
74.6%** |
CRF/Rules |
Pathak et al. 2023 |
| Uzbek |
88% |
-- |
CRF (vs HMM 82%) |
Jamoldinova et al. 2025 |
| Amazigh |
Best |
Lower |
CRF |
Amri et al. 2024 |
| Khasi |
-- |
96.98% |
BiLSTM-CRF |
Warjri et al. 2021 |
| English (PTB) |
~97.0% |
97.55%*** |
Neural |
Ma & Hovy 2016 |
*With fastText features. **F1 score. ***BiLSTM-CNN-CRF.
Key insight: For low-resource and morphologically rich languages, CRF with proper feature engineering often matches or outperforms neural methods without pre-trained representations.
3.2 CRF vs Transformer (With Pre-trained LM)
| Language |
CRF Accuracy |
Transformer Accuracy |
Gap |
Notes |
| Vietnamese |
95.89% |
97.2% |
-1.31% |
TRE-1 vs ViDeBERTa-large |
| English |
~97.0% |
97.85%+ |
-0.85% |
CRF vs BERT fine-tuned |
| Arabic |
SVM-Rank SOTA |
Competitive |
~0% |
Darwish et al. 2017 |
Key insight: Pre-trained transformers consistently outperform CRF by 0.8-1.3%, but the gap is narrower than might be expected given the vast difference in model complexity.
4. Feature Template Comparison
4.1 POS Tagging Feature Templates
| Feature Category |
TRE-1 (27) |
VnMarMoT |
sklearn-crfsuite tutorial |
Ratnaparkhi 1996 |
| Word form |
Yes |
Yes |
Yes |
Yes |
| Lowercase |
Yes |
Yes |
Yes |
-- |
| Case type (title/upper) |
Yes |
-- |
Yes |
Yes |
| Is digit |
Yes |
-- |
Yes |
Yes |
| Is alpha |
Yes |
-- |
-- |
-- |
| Prefix (2-3 char) |
Yes |
Yes |
Yes |
Yes (1-4) |
| Suffix (2-3 char) |
Yes |
Yes |
Yes |
Yes (1-4) |
| Context T[-1] |
Yes |
Yes |
Yes |
Yes |
| Context T[-2] |
Yes |
Yes |
Yes |
Yes |
| Context T[+1] |
Yes |
Yes |
Yes |
Yes |
| Context T[+2] |
Yes |
Yes |
Yes |
-- |
| Bigrams T[-1,0] |
Yes |
-- |
-- |
-- |
| Bigrams T[0,1] |
Yes |
-- |
-- |
-- |
| Dictionary lookup |
Yes |
-- |
-- |
-- |
| BOS/EOS markers |
Yes |
Yes |
Yes |
-- |
4.2 Word Segmentation Feature Templates
| Feature Category |
TRE-1 (21) |
JVnSegmenter (2006) |
UITws-v1 (2019) |
| Syllable form |
Yes |
Yes |
Yes |
| Lowercase |
Yes |
Yes |
Yes |
| Case type |
Yes |
-- |
-- |
| Is digit |
Yes |
Yes |
-- |
| Is punctuation |
Yes |
-- |
-- |
| Syllable length |
Yes |
-- |
-- |
| Prefix (2 char) |
Yes |
Yes |
Yes |
| Suffix (2 char) |
Yes |
Yes |
Yes |
| Context S[-1] |
Yes |
Yes |
Yes |
| Context S[-2] |
Yes |
Yes |
Yes |
| Context S[+1] |
Yes |
Yes |
Yes |
| Context S[+2] |
Yes |
Yes |
Yes |
| Bigrams S[-1,0] |
Yes |
Yes |
Yes |
| Bigrams S[0,1] |
Yes |
Yes |
Yes |
| Trigrams S[-1,0,1] |
Yes |
-- |
-- |
| Syllable type n-grams |
-- |
-- |
Yes |
| Ambiguity reduction |
-- |
-- |
Yes |
| Dictionary conjunction |
-- |
-- |
Yes |
5. Training Configuration Comparison
| Parameter |
TRE-1 |
sklearn-crfsuite default |
Literature range |
| Algorithm |
L-BFGS |
L-BFGS |
L-BFGS, SGD, AROW |
| L1 (c1) |
1.0 |
0.0 |
0.01 - 1.0 |
| L2 (c2) |
0.001 |
1.0 |
0.001 - 0.1 |
| Max iterations |
100 |
1000 |
50 - 500 |
| All transitions |
True |
False |
True recommended |
Note: TRE-1 uses high L1 (1.0) for strong feature selection/sparsity and low L2 (0.001). The sklearn-crfsuite tutorial suggests c1=0.1, c2=0.1 as alternative starting points.
6. Dataset Comparison
| Dataset |
Language |
Sentences |
Tokens |
Domain |
Annotation |
Access |
POS Tags |
| VLSP 2013 |
Vietnamese |
27,870 |
~650K |
News |
Manual |
Request |
Vietnamese tagset |
| VietTreeBank |
Vietnamese |
10,000+ |
~200K |
Mixed |
Manual |
Research |
Vietnamese |
| UDD-1 |
Vietnamese |
20,000 |
~453K |
Legal+News |
Machine |
HuggingFace |
15 UD tags |
| VnDT v1.1 |
Vietnamese |
10,197 |
~220K |
News |
Manual+Auto |
GitHub |
UD |
| UD Vietnamese-VTB |
Vietnamese |
1,400 |
~39K |
Wiki |
Manual |
GitHub |
17 UD tags |
| Penn Treebank |
English |
49,208 |
~1.2M |
WSJ |
Manual |
LDC |
45 PTB tags |
7. Vietnamese Pre-trained Language Model Comparison
| Model |
Year |
Architecture |
Params |
Pre-train Data |
POS (VLSP) |
NER (VLSP) |
Venue |
| PhoBERT-base |
2020 |
RoBERTa |
135M |
20GB |
96.7% |
94.0% |
EMNLP |
| PhoBERT-large |
2020 |
RoBERTa |
370M |
20GB |
96.8% |
94.5% |
EMNLP |
| vELECTRA |
2020 |
ELECTRA |
110M |
60GB |
96.77% |
-- |
PACLIC |
| BARTpho |
2021 |
BART |
Large |
20GB |
-- |
-- |
INTERSPEECH |
| PhoNLP |
2021 |
PhoBERT+MTL |
135M+ |
20GB |
96.76% |
94.41% |
NAACL |
| ViT5 |
2022 |
T5 |
310M/866M |
CC100 |
-- |
-- |
NAACL SRW |
| ViDeBERTa-xsmall |
2023 |
DeBERTaV3 |
22M |
138GB |
-- |
-- |
EACL |
| ViDeBERTa-base |
2023 |
DeBERTaV3 |
86M |
138GB |
96.8% |
94.7% |
EACL |
| ViDeBERTa-large |
2023 |
DeBERTaV3 |
304M |
138GB |
97.2% |
95.3% |
EACL |
| ViSoBERT |
2023 |
XLM-R |
278M |
Social media |
-- |
-- |
EMNLP |
| PhoGPT |
2023 |
GPT |
3.7B |
482GB |
-- |
-- |
arXiv |
8. Chunking: Method Comparison
8.1 English CoNLL-2000 (Reference Benchmark)
| Model |
Year |
Method |
F1 |
Neural? |
CPU? |
| SS05 (HMM+voting) |
2005 |
Specialized HMM |
95.23% |
No |
Yes |
| BiLSTM-CRF |
2015 |
Neural+CRF |
94.46% |
Yes |
Optional |
| S08 (Latent-Dynamic CRF) |
2008 |
CRF variant |
94.34% |
No |
Yes |
| SP03 (CRF) |
2003 |
CRF |
94.30% |
No |
Yes |
| M05 (2nd-order CRF) |
2005 |
CRF |
94.29% |
No |
Yes |
| C00 (Charniak) |
2000 |
Parser-based |
94.20% |
No |
Yes |
| KM01 (SVM ensemble) |
2001 |
SVM |
94.22% |
No |
Yes |
| KM00 (SVM) |
2000 |
SVM |
93.79% |
No |
Yes |
Key insight: Non-neural methods dominate chunking. CRF (94.3%) matches BiLSTM-CRF (94.46%).
8.2 Vietnamese Chunking
| Model |
Year |
Method |
F1 |
Neural? |
| Lai et al. (NP) |
2019 |
BiLSTM-CRF + rules |
88.40% |
Yes |
| Nguyen et al. (NP) |
2009 |
CRF/MaxEnt/SVM |
-- |
No |
| underthesea |
ongoing |
CRF |
-- |
No |
Vietnamese chunking is under-researched. CRF approach proven viable.
8.3 Chunking Feature Templates
| Feature |
CoNLL-2000 Standard |
TRE-1 Planned |
| POS tag (current) |
Essential |
Yes (from TRE-1 POS model) |
| POS tag (context Β±2) |
Essential |
Yes |
| Word form |
Standard |
Yes |
| Lowercase |
Standard |
Yes |
| Prefix/suffix |
Sometimes |
Yes |
| Previous BIO tag |
Standard |
Yes |
| Bigrams (POS) |
Helpful |
Yes |
| Case type |
Sometimes |
Yes |
9. Dependency Parsing: Method Comparison
9.1 Vietnamese Dependency Parsing (VnDT)
| Model |
Year |
Dataset |
Method |
UAS |
LAS |
Neural? |
CPU? |
| PhoNLP |
2021 |
v1.1 |
PhoBERT+MTL |
85.47 |
79.11 |
Yes |
No |
| PhoBERT-base |
2020 |
v1.1 |
Fine-tuned biaffine |
85.22 |
78.77 |
Yes |
No |
| PhoBERT-large |
2020 |
v1.1 |
Fine-tuned biaffine |
84.32 |
77.85 |
Yes |
No |
| Biaffine |
2017 |
v1.1 |
Neural biaffine |
81.19 |
74.99 |
Yes |
No |
| jointWPD |
2018 |
v1.1 |
Neural joint |
80.12 |
73.90 |
Yes |
No |
| jPTDP-v2 |
2018 |
v1.1 |
Neural joint |
79.63 |
73.12 |
Yes |
No |
| VnCoreNLP |
2018 |
v1.1 |
Hybrid |
77.35 |
71.38 |
Partial |
Yes |
| MSTParser |
2015 |
v1.0 |
Graph-based |
76.58 |
70.10 |
No |
Yes |
| MaltParser |
2015 |
v1.0 |
Transition-based |
76.08 |
69.88 |
No |
Yes |
| TRE-1 (stacklazy) |
2026 |
v1.1 |
MaltParser default |
72.58 |
65.87 |
No |
Yes |
Note: Nguyen & Nguyen (2015) results were on VnDT v1.0. VnDT v1.1 (Dec 2018) fixed annotation errors. TRE-1 uses default MaltParser features (no MaltOptimizer tuning).
9.1.1 TRE-1 Algorithm Comparison (VnDT v1.1, Predicted POS)
| Algorithm |
UAS |
LAS |
Train Time |
| stacklazy |
72.58% |
65.87% |
34s |
| stackproj |
72.48% |
65.86% |
34s |
| nivrestandard |
72.42% |
65.84% |
40s |
| stackeager |
72.24% |
65.61% |
36s |
| nivreeager |
72.07% |
65.40% |
37s |
| covproj |
71.27% |
64.74% |
42s |
| covnonproj |
71.07% |
64.53% |
86s |
Gold POS (nivreeager): 74.36% UAS / 69.04% LAS (+2.3% UAS over predicted POS).
9.2 English Penn Treebank (Non-Neural Reference)
| Model |
Year |
Method |
UAS |
Neural? |
| Zhang & Nivre |
2011 |
Transition + 72 features |
92.9% |
No |
| MSTParser 2nd-order |
2006 |
Graph-based |
91.5% |
No |
| MaltParser |
2006 |
Transition + SVM |
90.1% |
No |
9.3 Parsing Paradigms Comparison
| Paradigm |
Algorithm |
Complexity |
Approach |
CPU |
Example |
| Transition-based |
Arc-standard/eager |
O(n) |
Greedy, local |
Yes |
MaltParser |
| Graph-based |
MST (Chu-Liu-Edmonds) |
O(n^2) |
Global, exact |
Yes |
MSTParser |
| Neural biaffine |
Biaffine attention |
O(n^2) |
Neural, global |
No |
Dozat & Manning |
| Pre-trained |
PhoBERT + biaffine |
O(n^2) |
Transfer learning |
No |
PhoNLP |
9.4 Dependency Parsing Feature Templates (Zhang & Nivre 2011)
| Feature Category |
Count |
Description |
| Stack/buffer word |
~10 |
Word forms at stack top, buffer front |
| Stack/buffer POS |
~10 |
POS tags at stack/buffer positions |
| Distance |
~3 |
Number of tokens between head/dependent |
| Direction |
~3 |
Left/right arc features |
| Valency |
~6 |
Count of left/right dependents |
| Grandparent |
~8 |
Features of head's head |
| Sibling |
~8 |
Features of adjacent dependents |
| Trigram |
~12 |
Head + dependent + sibling/grandparent combos |
| Total |
~72 |
Rich non-local feature templates |
10. Cross-Task Summary: Non-Neural vs Neural Gap
| Task |
Best Non-Neural |
Best Neural |
Gap |
CPU Priority |
| Word Segmentation |
98.06% F1 (SVM) |
97.90% F1 (jPTDP) |
+0.16% |
Excellent |
| POS Tagging |
95.88% acc (CRF) |
97.2% acc (DeBERTa) |
-1.32% |
Good |
| Chunking (English) |
95.23% F1 (HMM) |
94.46% F1 (BiLSTM) |
+0.77% |
Excellent |
| Dep Parsing (Viet) |
76.58% UAS (MST) |
85.47% UAS (PhoBERT) |
-8.89% |
Moderate |
11. TRE-1 Pipeline Architecture
Input text
β
βΌ
βββββββββββββββββββββββ
β Word Segmentation β CRF (BIO tagging) 98.01% F1
β (syllable β word) β Model: ~1.1 MB
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β POS Tagging β CRF (27 features) 95.89% acc
β (word β POS tag) β Model: ~2.3 MB
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Chunking β CRF (BIO tagging) [Planned]
β (word+POS β chunk) β Features: POS + word
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Dependency Parsing β Transition/Graph-based [Planned]
β (word+POS β tree) β Features: ~72 templates
βββββββββββββββββββββββ
Total pipeline: < 20 MB, CPU-only, fast inference
Pipeline validated by Nguyen et al. (2017): pipeline approach outperforms joint for Vietnamese.