tre-1 / research /comparison.md

rain1024

Add POS error analysis and ignore datasets folder

c511e74 2 months ago

preview code

raw

history blame contribute delete

15.3 kB

Comparison Tables: Vietnamese NLP Methods (4 Tasks, CPU-First)

Last Updated: 2026-02-07 Scope: Word Segmentation, POS Tagging, Chunking, Dependency Parsing

1. Vietnamese POS Tagging: Method Comparison

1.1 Results on VLSP 2013 Benchmark

Model	Year	Method	Accuracy	Params	GPU	Inference Speed	Paper
ViDeBERTa-large	2023	DeBERTaV3	97.2%	304M	Yes	Slow	Tran et al.
PhoBERT-large	2020	RoBERTa	96.8%	370M	Yes	Slow	Nguyen & Nguyen
vELECTRA	2020	ELECTRA	96.77%	110M	Yes	Medium	Bui et al.
PhoNLP	2021	PhoBERT+MTL	96.76%	135M+	Yes	Slow	Nguyen & Nguyen
PhoBERT-base	2020	RoBERTa	96.7%	135M	Yes	Medium	Nguyen & Nguyen
VnMarMoT	2018	CRF (MarMoT)	95.88%	~1MB	No	90K w/s	Vu et al.
TRE-1	2026	CRF (python-crfsuite)	95.89%*	2.3MB	No	Fast	This work
BiLSTM-CRF+CNN	2018	Neural+CRF	95.40%	~10M	Yes	Medium	DU et al.
RDRPOSTagger	2014	Ripple Down Rules	95.11%	~5MB	No	8K w/s	Nguyen et al.
JointWPD	2018	Joint neural	94.03%	~10M	Yes	Slow	Nguyen
BiLSTM-CRF	2018	Neural+CRF	93.52%	~10M	Yes	Medium	DU et al.

*Evaluated on UDD-1, not VLSP 2013.

1.2 Method Categories

Category	Best Model	Accuracy	Pros	Cons
Pre-trained Transformer	ViDeBERTa-large	97.2%	Highest accuracy, transfer learning	GPU required, slow, large model
Multi-task Transformer	PhoNLP	96.76%	Shared representations, joint tasks	Complex training, GPU required
ELECTRA	vELECTRA	96.77%	More sample-efficient pre-training	GPU required
CRF	VnMarMoT / TRE-1	95.88-89%	No GPU, fast, interpretable, small model	Manual features, limited context
BiLSTM-CRF	BiLSTM-CRF+CNN	95.40%	Automatic features, CRF constraints	GPU needed, no pre-training
Rule-based	RDRPOSTagger	95.11%	Very fast, interpretable rules	Lower accuracy

2. Vietnamese Word Segmentation: Method Comparison

2.1 Results on VLSP 2013 Benchmark

Model	Year	Method	F1 Score	GPU	Paper
UITws-v1	2019	SVM + ambiguity reduction	98.06%	No	Nguyen et al.
RDRsegmenter	2018	Rule-based decision trees	97.90%	No	Vu et al.
jPTDP-v2	2018	Joint neural	97.90%	Yes	Nguyen
UETsegmenter	2016	ML-based	97.87%	No	--
JointWPD	2018	Multi-task learning	97.78%	Yes	Nguyen
vnTokenizer	2008	Rule-based	97.33%	No	--
JVnSegmenter	2006	CRF + SVM	97.06%	No	Nguyen et al.
DongDu	--	Dictionary-based	96.90%	No	--
LSTM+CNN	2022	Deep neural	96.3%**	Yes	Zheng et al.

**Evaluated on different dataset.

2.2 TRE-1 WS Results on UDD-1

Level	Precision	Recall	F1
Syllable	--	--	98.90% (acc)
Word	98.02%	98.01%	98.01%

3. CRF vs Neural Methods: Cross-Language Evidence

3.1 Feature-based CRF vs BiLSTM-CRF (Without Pre-trained LM)

Language	CRF Accuracy	BiLSTM-CRF Accuracy	Winner	Paper
Vietnamese	95.88%	93.52%	CRF	VnMarMoT vs DU et al.
Burmese	98.18%*	97.85%*	CRF	Thant et al. 2025
Assamese	85.0%**	74.6%**	CRF/Rules	Pathak et al. 2023
Uzbek	88%	--	CRF (vs HMM 82%)	Jamoldinova et al. 2025
Amazigh	Best	Lower	CRF	Amri et al. 2024
Khasi	--	96.98%	BiLSTM-CRF	Warjri et al. 2021
English (PTB)	~97.0%	97.55%***	Neural	Ma & Hovy 2016

*With fastText features. **F1 score. ***BiLSTM-CNN-CRF.

Key insight: For low-resource and morphologically rich languages, CRF with proper feature engineering often matches or outperforms neural methods without pre-trained representations.

3.2 CRF vs Transformer (With Pre-trained LM)

Language	CRF Accuracy	Transformer Accuracy	Gap	Notes
Vietnamese	95.89%	97.2%	-1.31%	TRE-1 vs ViDeBERTa-large
English	~97.0%	97.85%+	-0.85%	CRF vs BERT fine-tuned
Arabic	SVM-Rank SOTA	Competitive	~0%	Darwish et al. 2017

Key insight: Pre-trained transformers consistently outperform CRF by 0.8-1.3%, but the gap is narrower than might be expected given the vast difference in model complexity.

4. Feature Template Comparison

4.1 POS Tagging Feature Templates

Feature Category	TRE-1 (27)	VnMarMoT	sklearn-crfsuite tutorial	Ratnaparkhi 1996
Word form	Yes	Yes	Yes	Yes
Lowercase	Yes	Yes	Yes	--
Case type (title/upper)	Yes	--	Yes	Yes
Is digit	Yes	--	Yes	Yes
Is alpha	Yes	--	--	--
Prefix (2-3 char)	Yes	Yes	Yes	Yes (1-4)
Suffix (2-3 char)	Yes	Yes	Yes	Yes (1-4)
Context T[-1]	Yes	Yes	Yes	Yes
Context T[-2]	Yes	Yes	Yes	Yes
Context T[+1]	Yes	Yes	Yes	Yes
Context T[+2]	Yes	Yes	Yes	--
Bigrams T[-1,0]	Yes	--	--	--
Bigrams T[0,1]	Yes	--	--	--
Dictionary lookup	Yes	--	--	--
BOS/EOS markers	Yes	Yes	Yes	--

4.2 Word Segmentation Feature Templates

Feature Category	TRE-1 (21)	JVnSegmenter (2006)	UITws-v1 (2019)
Syllable form	Yes	Yes	Yes
Lowercase	Yes	Yes	Yes
Case type	Yes	--	--
Is digit	Yes	Yes	--
Is punctuation	Yes	--	--
Syllable length	Yes	--	--
Prefix (2 char)	Yes	Yes	Yes
Suffix (2 char)	Yes	Yes	Yes
Context S[-1]	Yes	Yes	Yes
Context S[-2]	Yes	Yes	Yes
Context S[+1]	Yes	Yes	Yes
Context S[+2]	Yes	Yes	Yes
Bigrams S[-1,0]	Yes	Yes	Yes
Bigrams S[0,1]	Yes	Yes	Yes
Trigrams S[-1,0,1]	Yes	--	--
Syllable type n-grams	--	--	Yes
Ambiguity reduction	--	--	Yes
Dictionary conjunction	--	--	Yes

5. Training Configuration Comparison

Parameter	TRE-1	sklearn-crfsuite default	Literature range
Algorithm	L-BFGS	L-BFGS	L-BFGS, SGD, AROW
L1 (c1)	1.0	0.0	0.01 - 1.0
L2 (c2)	0.001	1.0	0.001 - 0.1
Max iterations	100	1000	50 - 500
All transitions	True	False	True recommended

Note: TRE-1 uses high L1 (1.0) for strong feature selection/sparsity and low L2 (0.001). The sklearn-crfsuite tutorial suggests c1=0.1, c2=0.1 as alternative starting points.

6. Dataset Comparison

Dataset	Language	Sentences	Tokens	Domain	Annotation	Access	POS Tags
VLSP 2013	Vietnamese	27,870	~650K	News	Manual	Request	Vietnamese tagset
VietTreeBank	Vietnamese	10,000+	~200K	Mixed	Manual	Research	Vietnamese
UDD-1	Vietnamese	20,000	~453K	Legal+News	Machine	HuggingFace	15 UD tags
VnDT v1.1	Vietnamese	10,197	~220K	News	Manual+Auto	GitHub	UD
UD Vietnamese-VTB	Vietnamese	1,400	~39K	Wiki	Manual	GitHub	17 UD tags
Penn Treebank	English	49,208	~1.2M	WSJ	Manual	LDC	45 PTB tags

7. Vietnamese Pre-trained Language Model Comparison

Model	Year	Architecture	Params	Pre-train Data	POS (VLSP)	NER (VLSP)	Venue
PhoBERT-base	2020	RoBERTa	135M	20GB	96.7%	94.0%	EMNLP
PhoBERT-large	2020	RoBERTa	370M	20GB	96.8%	94.5%	EMNLP
vELECTRA	2020	ELECTRA	110M	60GB	96.77%	--	PACLIC
BARTpho	2021	BART	Large	20GB	--	--	INTERSPEECH
PhoNLP	2021	PhoBERT+MTL	135M+	20GB	96.76%	94.41%	NAACL
ViT5	2022	T5	310M/866M	CC100	--	--	NAACL SRW
ViDeBERTa-xsmall	2023	DeBERTaV3	22M	138GB	--	--	EACL
ViDeBERTa-base	2023	DeBERTaV3	86M	138GB	96.8%	94.7%	EACL
ViDeBERTa-large	2023	DeBERTaV3	304M	138GB	97.2%	95.3%	EACL
ViSoBERT	2023	XLM-R	278M	Social media	--	--	EMNLP
PhoGPT	2023	GPT	3.7B	482GB	--	--	arXiv

8. Chunking: Method Comparison

8.1 English CoNLL-2000 (Reference Benchmark)

Model	Year	Method	F1	Neural?	CPU?
SS05 (HMM+voting)	2005	Specialized HMM	95.23%	No	Yes
BiLSTM-CRF	2015	Neural+CRF	94.46%	Yes	Optional
S08 (Latent-Dynamic CRF)	2008	CRF variant	94.34%	No	Yes
SP03 (CRF)	2003	CRF	94.30%	No	Yes
M05 (2nd-order CRF)	2005	CRF	94.29%	No	Yes
C00 (Charniak)	2000	Parser-based	94.20%	No	Yes
KM01 (SVM ensemble)	2001	SVM	94.22%	No	Yes
KM00 (SVM)	2000	SVM	93.79%	No	Yes

Key insight: Non-neural methods dominate chunking. CRF (94.3%) matches BiLSTM-CRF (94.46%).

8.2 Vietnamese Chunking

Model	Year	Method	F1	Neural?
Lai et al. (NP)	2019	BiLSTM-CRF + rules	88.40%	Yes
Nguyen et al. (NP)	2009	CRF/MaxEnt/SVM	--	No
underthesea	ongoing	CRF	--	No

Vietnamese chunking is under-researched. CRF approach proven viable.

8.3 Chunking Feature Templates

Feature	CoNLL-2000 Standard	TRE-1 Planned
POS tag (current)	Essential	Yes (from TRE-1 POS model)
POS tag (context ±2)	Essential	Yes
Word form	Standard	Yes
Lowercase	Standard	Yes
Prefix/suffix	Sometimes	Yes
Previous BIO tag	Standard	Yes
Bigrams (POS)	Helpful	Yes
Case type	Sometimes	Yes

9. Dependency Parsing: Method Comparison

9.1 Vietnamese Dependency Parsing (VnDT)

Model	Year	Dataset	Method	UAS	LAS	Neural?	CPU?
PhoNLP	2021	v1.1	PhoBERT+MTL	85.47	79.11	Yes	No
PhoBERT-base	2020	v1.1	Fine-tuned biaffine	85.22	78.77	Yes	No
PhoBERT-large	2020	v1.1	Fine-tuned biaffine	84.32	77.85	Yes	No
Biaffine	2017	v1.1	Neural biaffine	81.19	74.99	Yes	No
jointWPD	2018	v1.1	Neural joint	80.12	73.90	Yes	No
jPTDP-v2	2018	v1.1	Neural joint	79.63	73.12	Yes	No
VnCoreNLP	2018	v1.1	Hybrid	77.35	71.38	Partial	Yes
MSTParser	2015	v1.0	Graph-based	76.58	70.10	No	Yes
MaltParser	2015	v1.0	Transition-based	76.08	69.88	No	Yes
TRE-1 (stacklazy)	2026	v1.1	MaltParser default	72.58	65.87	No	Yes

Note: Nguyen & Nguyen (2015) results were on VnDT v1.0. VnDT v1.1 (Dec 2018) fixed annotation errors. TRE-1 uses default MaltParser features (no MaltOptimizer tuning).

9.1.1 TRE-1 Algorithm Comparison (VnDT v1.1, Predicted POS)

Algorithm	UAS	LAS	Train Time
stacklazy	72.58%	65.87%	34s
stackproj	72.48%	65.86%	34s
nivrestandard	72.42%	65.84%	40s
stackeager	72.24%	65.61%	36s
nivreeager	72.07%	65.40%	37s
covproj	71.27%	64.74%	42s
covnonproj	71.07%	64.53%	86s

Gold POS (nivreeager): 74.36% UAS / 69.04% LAS (+2.3% UAS over predicted POS).

9.2 English Penn Treebank (Non-Neural Reference)

Model	Year	Method	UAS	Neural?
Zhang & Nivre	2011	Transition + 72 features	92.9%	No
MSTParser 2nd-order	2006	Graph-based	91.5%	No
MaltParser	2006	Transition + SVM	90.1%	No

9.3 Parsing Paradigms Comparison

Paradigm	Algorithm	Complexity	Approach	CPU	Example
Transition-based	Arc-standard/eager	O(n)	Greedy, local	Yes	MaltParser
Graph-based	MST (Chu-Liu-Edmonds)	O(n^2)	Global, exact	Yes	MSTParser
Neural biaffine	Biaffine attention	O(n^2)	Neural, global	No	Dozat & Manning
Pre-trained	PhoBERT + biaffine	O(n^2)	Transfer learning	No	PhoNLP

9.4 Dependency Parsing Feature Templates (Zhang & Nivre 2011)

Feature Category	Count	Description
Stack/buffer word	~10	Word forms at stack top, buffer front
Stack/buffer POS	~10	POS tags at stack/buffer positions
Distance	~3	Number of tokens between head/dependent
Direction	~3	Left/right arc features
Valency	~6	Count of left/right dependents
Grandparent	~8	Features of head's head
Sibling	~8	Features of adjacent dependents
Trigram	~12	Head + dependent + sibling/grandparent combos
Total	~72	Rich non-local feature templates

10. Cross-Task Summary: Non-Neural vs Neural Gap

Task	Best Non-Neural	Best Neural	Gap	CPU Priority
Word Segmentation	98.06% F1 (SVM)	97.90% F1 (jPTDP)	+0.16%	Excellent
POS Tagging	95.88% acc (CRF)	97.2% acc (DeBERTa)	-1.32%	Good
Chunking (English)	95.23% F1 (HMM)	94.46% F1 (BiLSTM)	+0.77%	Excellent
Dep Parsing (Viet)	76.58% UAS (MST)	85.47% UAS (PhoBERT)	-8.89%	Moderate

11. TRE-1 Pipeline Architecture

Input text
    │
    ▼
┌─────────────────────┐
│  Word Segmentation   │  CRF (BIO tagging)      98.01% F1
│  (syllable → word)   │  Model: ~1.1 MB
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  POS Tagging         │  CRF (27 features)      95.89% acc
│  (word → POS tag)    │  Model: ~2.3 MB
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Chunking            │  CRF (BIO tagging)      [Planned]
│  (word+POS → chunk)  │  Features: POS + word
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Dependency Parsing  │  Transition/Graph-based  [Planned]
│  (word+POS → tree)   │  Features: ~72 templates
└─────────────────────┘

Total pipeline: < 20 MB, CPU-only, fast inference

Pipeline validated by Nguyen et al. (2017): pipeline approach outperforms joint for Vietnamese.