tre-1 / research /comparison.md
rain1024's picture
Add POS error analysis and ignore datasets folder
c511e74

Comparison Tables: Vietnamese NLP Methods (4 Tasks, CPU-First)

Last Updated: 2026-02-07 Scope: Word Segmentation, POS Tagging, Chunking, Dependency Parsing

1. Vietnamese POS Tagging: Method Comparison

1.1 Results on VLSP 2013 Benchmark

Model Year Method Accuracy Params GPU Inference Speed Paper
ViDeBERTa-large 2023 DeBERTaV3 97.2% 304M Yes Slow Tran et al.
PhoBERT-large 2020 RoBERTa 96.8% 370M Yes Slow Nguyen & Nguyen
vELECTRA 2020 ELECTRA 96.77% 110M Yes Medium Bui et al.
PhoNLP 2021 PhoBERT+MTL 96.76% 135M+ Yes Slow Nguyen & Nguyen
PhoBERT-base 2020 RoBERTa 96.7% 135M Yes Medium Nguyen & Nguyen
VnMarMoT 2018 CRF (MarMoT) 95.88% ~1MB No 90K w/s Vu et al.
TRE-1 2026 CRF (python-crfsuite) 95.89%* 2.3MB No Fast This work
BiLSTM-CRF+CNN 2018 Neural+CRF 95.40% ~10M Yes Medium DU et al.
RDRPOSTagger 2014 Ripple Down Rules 95.11% ~5MB No 8K w/s Nguyen et al.
JointWPD 2018 Joint neural 94.03% ~10M Yes Slow Nguyen
BiLSTM-CRF 2018 Neural+CRF 93.52% ~10M Yes Medium DU et al.

*Evaluated on UDD-1, not VLSP 2013.

1.2 Method Categories

Category Best Model Accuracy Pros Cons
Pre-trained Transformer ViDeBERTa-large 97.2% Highest accuracy, transfer learning GPU required, slow, large model
Multi-task Transformer PhoNLP 96.76% Shared representations, joint tasks Complex training, GPU required
ELECTRA vELECTRA 96.77% More sample-efficient pre-training GPU required
CRF VnMarMoT / TRE-1 95.88-89% No GPU, fast, interpretable, small model Manual features, limited context
BiLSTM-CRF BiLSTM-CRF+CNN 95.40% Automatic features, CRF constraints GPU needed, no pre-training
Rule-based RDRPOSTagger 95.11% Very fast, interpretable rules Lower accuracy

2. Vietnamese Word Segmentation: Method Comparison

2.1 Results on VLSP 2013 Benchmark

Model Year Method F1 Score GPU Paper
UITws-v1 2019 SVM + ambiguity reduction 98.06% No Nguyen et al.
RDRsegmenter 2018 Rule-based decision trees 97.90% No Vu et al.
jPTDP-v2 2018 Joint neural 97.90% Yes Nguyen
UETsegmenter 2016 ML-based 97.87% No --
JointWPD 2018 Multi-task learning 97.78% Yes Nguyen
vnTokenizer 2008 Rule-based 97.33% No --
JVnSegmenter 2006 CRF + SVM 97.06% No Nguyen et al.
DongDu -- Dictionary-based 96.90% No --
LSTM+CNN 2022 Deep neural 96.3%** Yes Zheng et al.

**Evaluated on different dataset.

2.2 TRE-1 WS Results on UDD-1

Level Precision Recall F1
Syllable -- -- 98.90% (acc)
Word 98.02% 98.01% 98.01%

3. CRF vs Neural Methods: Cross-Language Evidence

3.1 Feature-based CRF vs BiLSTM-CRF (Without Pre-trained LM)

Language CRF Accuracy BiLSTM-CRF Accuracy Winner Paper
Vietnamese 95.88% 93.52% CRF VnMarMoT vs DU et al.
Burmese 98.18%* 97.85%* CRF Thant et al. 2025
Assamese 85.0%** 74.6%** CRF/Rules Pathak et al. 2023
Uzbek 88% -- CRF (vs HMM 82%) Jamoldinova et al. 2025
Amazigh Best Lower CRF Amri et al. 2024
Khasi -- 96.98% BiLSTM-CRF Warjri et al. 2021
English (PTB) ~97.0% 97.55%*** Neural Ma & Hovy 2016

*With fastText features. **F1 score. ***BiLSTM-CNN-CRF.

Key insight: For low-resource and morphologically rich languages, CRF with proper feature engineering often matches or outperforms neural methods without pre-trained representations.

3.2 CRF vs Transformer (With Pre-trained LM)

Language CRF Accuracy Transformer Accuracy Gap Notes
Vietnamese 95.89% 97.2% -1.31% TRE-1 vs ViDeBERTa-large
English ~97.0% 97.85%+ -0.85% CRF vs BERT fine-tuned
Arabic SVM-Rank SOTA Competitive ~0% Darwish et al. 2017

Key insight: Pre-trained transformers consistently outperform CRF by 0.8-1.3%, but the gap is narrower than might be expected given the vast difference in model complexity.


4. Feature Template Comparison

4.1 POS Tagging Feature Templates

Feature Category TRE-1 (27) VnMarMoT sklearn-crfsuite tutorial Ratnaparkhi 1996
Word form Yes Yes Yes Yes
Lowercase Yes Yes Yes --
Case type (title/upper) Yes -- Yes Yes
Is digit Yes -- Yes Yes
Is alpha Yes -- -- --
Prefix (2-3 char) Yes Yes Yes Yes (1-4)
Suffix (2-3 char) Yes Yes Yes Yes (1-4)
Context T[-1] Yes Yes Yes Yes
Context T[-2] Yes Yes Yes Yes
Context T[+1] Yes Yes Yes Yes
Context T[+2] Yes Yes Yes --
Bigrams T[-1,0] Yes -- -- --
Bigrams T[0,1] Yes -- -- --
Dictionary lookup Yes -- -- --
BOS/EOS markers Yes Yes Yes --

4.2 Word Segmentation Feature Templates

Feature Category TRE-1 (21) JVnSegmenter (2006) UITws-v1 (2019)
Syllable form Yes Yes Yes
Lowercase Yes Yes Yes
Case type Yes -- --
Is digit Yes Yes --
Is punctuation Yes -- --
Syllable length Yes -- --
Prefix (2 char) Yes Yes Yes
Suffix (2 char) Yes Yes Yes
Context S[-1] Yes Yes Yes
Context S[-2] Yes Yes Yes
Context S[+1] Yes Yes Yes
Context S[+2] Yes Yes Yes
Bigrams S[-1,0] Yes Yes Yes
Bigrams S[0,1] Yes Yes Yes
Trigrams S[-1,0,1] Yes -- --
Syllable type n-grams -- -- Yes
Ambiguity reduction -- -- Yes
Dictionary conjunction -- -- Yes

5. Training Configuration Comparison

Parameter TRE-1 sklearn-crfsuite default Literature range
Algorithm L-BFGS L-BFGS L-BFGS, SGD, AROW
L1 (c1) 1.0 0.0 0.01 - 1.0
L2 (c2) 0.001 1.0 0.001 - 0.1
Max iterations 100 1000 50 - 500
All transitions True False True recommended

Note: TRE-1 uses high L1 (1.0) for strong feature selection/sparsity and low L2 (0.001). The sklearn-crfsuite tutorial suggests c1=0.1, c2=0.1 as alternative starting points.


6. Dataset Comparison

Dataset Language Sentences Tokens Domain Annotation Access POS Tags
VLSP 2013 Vietnamese 27,870 ~650K News Manual Request Vietnamese tagset
VietTreeBank Vietnamese 10,000+ ~200K Mixed Manual Research Vietnamese
UDD-1 Vietnamese 20,000 ~453K Legal+News Machine HuggingFace 15 UD tags
VnDT v1.1 Vietnamese 10,197 ~220K News Manual+Auto GitHub UD
UD Vietnamese-VTB Vietnamese 1,400 ~39K Wiki Manual GitHub 17 UD tags
Penn Treebank English 49,208 ~1.2M WSJ Manual LDC 45 PTB tags

7. Vietnamese Pre-trained Language Model Comparison

Model Year Architecture Params Pre-train Data POS (VLSP) NER (VLSP) Venue
PhoBERT-base 2020 RoBERTa 135M 20GB 96.7% 94.0% EMNLP
PhoBERT-large 2020 RoBERTa 370M 20GB 96.8% 94.5% EMNLP
vELECTRA 2020 ELECTRA 110M 60GB 96.77% -- PACLIC
BARTpho 2021 BART Large 20GB -- -- INTERSPEECH
PhoNLP 2021 PhoBERT+MTL 135M+ 20GB 96.76% 94.41% NAACL
ViT5 2022 T5 310M/866M CC100 -- -- NAACL SRW
ViDeBERTa-xsmall 2023 DeBERTaV3 22M 138GB -- -- EACL
ViDeBERTa-base 2023 DeBERTaV3 86M 138GB 96.8% 94.7% EACL
ViDeBERTa-large 2023 DeBERTaV3 304M 138GB 97.2% 95.3% EACL
ViSoBERT 2023 XLM-R 278M Social media -- -- EMNLP
PhoGPT 2023 GPT 3.7B 482GB -- -- arXiv

8. Chunking: Method Comparison

8.1 English CoNLL-2000 (Reference Benchmark)

Model Year Method F1 Neural? CPU?
SS05 (HMM+voting) 2005 Specialized HMM 95.23% No Yes
BiLSTM-CRF 2015 Neural+CRF 94.46% Yes Optional
S08 (Latent-Dynamic CRF) 2008 CRF variant 94.34% No Yes
SP03 (CRF) 2003 CRF 94.30% No Yes
M05 (2nd-order CRF) 2005 CRF 94.29% No Yes
C00 (Charniak) 2000 Parser-based 94.20% No Yes
KM01 (SVM ensemble) 2001 SVM 94.22% No Yes
KM00 (SVM) 2000 SVM 93.79% No Yes

Key insight: Non-neural methods dominate chunking. CRF (94.3%) matches BiLSTM-CRF (94.46%).

8.2 Vietnamese Chunking

Model Year Method F1 Neural?
Lai et al. (NP) 2019 BiLSTM-CRF + rules 88.40% Yes
Nguyen et al. (NP) 2009 CRF/MaxEnt/SVM -- No
underthesea ongoing CRF -- No

Vietnamese chunking is under-researched. CRF approach proven viable.

8.3 Chunking Feature Templates

Feature CoNLL-2000 Standard TRE-1 Planned
POS tag (current) Essential Yes (from TRE-1 POS model)
POS tag (context Β±2) Essential Yes
Word form Standard Yes
Lowercase Standard Yes
Prefix/suffix Sometimes Yes
Previous BIO tag Standard Yes
Bigrams (POS) Helpful Yes
Case type Sometimes Yes

9. Dependency Parsing: Method Comparison

9.1 Vietnamese Dependency Parsing (VnDT)

Model Year Dataset Method UAS LAS Neural? CPU?
PhoNLP 2021 v1.1 PhoBERT+MTL 85.47 79.11 Yes No
PhoBERT-base 2020 v1.1 Fine-tuned biaffine 85.22 78.77 Yes No
PhoBERT-large 2020 v1.1 Fine-tuned biaffine 84.32 77.85 Yes No
Biaffine 2017 v1.1 Neural biaffine 81.19 74.99 Yes No
jointWPD 2018 v1.1 Neural joint 80.12 73.90 Yes No
jPTDP-v2 2018 v1.1 Neural joint 79.63 73.12 Yes No
VnCoreNLP 2018 v1.1 Hybrid 77.35 71.38 Partial Yes
MSTParser 2015 v1.0 Graph-based 76.58 70.10 No Yes
MaltParser 2015 v1.0 Transition-based 76.08 69.88 No Yes
TRE-1 (stacklazy) 2026 v1.1 MaltParser default 72.58 65.87 No Yes

Note: Nguyen & Nguyen (2015) results were on VnDT v1.0. VnDT v1.1 (Dec 2018) fixed annotation errors. TRE-1 uses default MaltParser features (no MaltOptimizer tuning).

9.1.1 TRE-1 Algorithm Comparison (VnDT v1.1, Predicted POS)

Algorithm UAS LAS Train Time
stacklazy 72.58% 65.87% 34s
stackproj 72.48% 65.86% 34s
nivrestandard 72.42% 65.84% 40s
stackeager 72.24% 65.61% 36s
nivreeager 72.07% 65.40% 37s
covproj 71.27% 64.74% 42s
covnonproj 71.07% 64.53% 86s

Gold POS (nivreeager): 74.36% UAS / 69.04% LAS (+2.3% UAS over predicted POS).

9.2 English Penn Treebank (Non-Neural Reference)

Model Year Method UAS Neural?
Zhang & Nivre 2011 Transition + 72 features 92.9% No
MSTParser 2nd-order 2006 Graph-based 91.5% No
MaltParser 2006 Transition + SVM 90.1% No

9.3 Parsing Paradigms Comparison

Paradigm Algorithm Complexity Approach CPU Example
Transition-based Arc-standard/eager O(n) Greedy, local Yes MaltParser
Graph-based MST (Chu-Liu-Edmonds) O(n^2) Global, exact Yes MSTParser
Neural biaffine Biaffine attention O(n^2) Neural, global No Dozat & Manning
Pre-trained PhoBERT + biaffine O(n^2) Transfer learning No PhoNLP

9.4 Dependency Parsing Feature Templates (Zhang & Nivre 2011)

Feature Category Count Description
Stack/buffer word ~10 Word forms at stack top, buffer front
Stack/buffer POS ~10 POS tags at stack/buffer positions
Distance ~3 Number of tokens between head/dependent
Direction ~3 Left/right arc features
Valency ~6 Count of left/right dependents
Grandparent ~8 Features of head's head
Sibling ~8 Features of adjacent dependents
Trigram ~12 Head + dependent + sibling/grandparent combos
Total ~72 Rich non-local feature templates

10. Cross-Task Summary: Non-Neural vs Neural Gap

Task Best Non-Neural Best Neural Gap CPU Priority
Word Segmentation 98.06% F1 (SVM) 97.90% F1 (jPTDP) +0.16% Excellent
POS Tagging 95.88% acc (CRF) 97.2% acc (DeBERTa) -1.32% Good
Chunking (English) 95.23% F1 (HMM) 94.46% F1 (BiLSTM) +0.77% Excellent
Dep Parsing (Viet) 76.58% UAS (MST) 85.47% UAS (PhoBERT) -8.89% Moderate

11. TRE-1 Pipeline Architecture

Input text
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Word Segmentation   β”‚  CRF (BIO tagging)      98.01% F1
β”‚  (syllable β†’ word)   β”‚  Model: ~1.1 MB
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  POS Tagging         β”‚  CRF (27 features)      95.89% acc
β”‚  (word β†’ POS tag)    β”‚  Model: ~2.3 MB
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Chunking            β”‚  CRF (BIO tagging)      [Planned]
β”‚  (word+POS β†’ chunk)  β”‚  Features: POS + word
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Dependency Parsing  β”‚  Transition/Graph-based  [Planned]
β”‚  (word+POS β†’ tree)   β”‚  Features: ~72 templates
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total pipeline: < 20 MB, CPU-only, fast inference

Pipeline validated by Nguyen et al. (2017): pipeline approach outperforms joint for Vietnamese.