tre-1 / references /README.md
rain1024's picture
Add VLSP 2013 word segmentation, feature ablation study, and Hydra config
27e6434

References

Reference materials for TRE-1 Vietnamese NLP Models (CRF/non-neural, CPU-first).

Scope

TRE-1 covers 4 tasks: word segmentation, POS tagging, chunking, dependency parsing -- using CRF and non-deep-learning methods optimized for CPU inference.

Papers

Folder Title Authors Year Task
2001.icml.lafferty Conditional Random Fields Lafferty, McCallum, Pereira 2001 Foundation
2003.naacl.sha Shallow Parsing with CRFs Sha, Pereira 2003 Chunking
2005.emnlp.mcdonald MST Dependency Parsing McDonald, Pereira et al. 2005 Dep Parsing
2006.eacl.mcdonald 2nd-Order MST Parsing McDonald, Pereira 2006 Dep Parsing
2006.paclic.nguyen Vietnamese CRF Word Segmentation Nguyen et al. 2006 Word Seg
2009.alr.nguyen Vietnamese Chunking with CRF Nguyen et al. 2009 Chunking
2011.acl.zhang Rich Features for Parsing (72 templates) Zhang, Nivre 2011 Dep Parsing
2014.eacl.nguyen RDRPOSTagger Nguyen et al. 2014 POS
2015.kse.nguyen Vietnamese DP Error Analysis Nguyen, Nguyen 2015 Dep Parsing
2016.acl.ma BiLSTM-CNNs-CRF Ma, Hovy 2016 Seq Labeling
2016.naacl.lample Neural NER (BiLSTM-CRF) Lample et al. 2016 Seq Labeling
2017.alta.nguyen From WS to POS for Vietnamese Nguyen et al. 2017 Pipeline
2017.conll.qian FBAML Non-DNN Pipeline Qian, Liu 2017 Full Pipeline
2017.wanlp.darwish Arabic POS: Feature Engineering Darwish et al. 2017 POS
2018.naacl.vu VnCoreNLP Vu, Nguyen et al. 2018 Toolkit
2019.alta.nguyen Joint Vietnamese WS/POS/DP Nguyen 2019 Joint Model
2019.pacling.nguyen UITws-v1 (WS SOTA) Nguyen et al. 2019 Word Seg
2020.emnlp.nguyen PhoBERT Nguyen & Nguyen 2020 Pre-trained LM
2021.naacl.nguyen PhoNLP Nguyen & Nguyen 2021 Multi-task
2021.naacl.wei Masked CRF Wei et al. 2021 Seq Labeling
2023.eacl.tran ViDeBERTa (POS SOTA) Tran et al. 2023 Pre-trained LM

Each paper folder contains:

  • paper.md - Markdown with YAML front matter (for LLM/RAG)
  • paper.tex - LaTeX source (original from arXiv or generated)
  • paper.pdf - PDF file
  • source/ - Full arXiv source (if available)

Resources

File Title Type
universal_dependencies.md Universal Dependencies Annotation Framework
underthesea.md Underthesea Vietnamese NLP Toolkit
python_crfsuite.md python-crfsuite CRF Library

Research Notes

Folder Description
research Literature review: 4 Vietnamese NLP tasks with CRF/non-neural focus (40+ papers)

Benchmarks

Word Segmentation

Model Dataset F1 Year Neural?
UITws-v1 VLSP 2013 98.06% 2019 No (SVM)
RDRsegmenter VLSP 2013 97.90% 2018 No (Rules)
TRE-1 WS UDD-1 98.01% 2026 No (CRF)
JVnSegmenter VLSP 2013 97.06% 2006 No (CRF+SVM)

POS Tagging

Model Dataset Accuracy Year Neural?
ViDeBERTa-large VLSP 2013 97.2% 2023 Yes
PhoBERT-large VLSP 2013 96.8% 2020 Yes
PhoNLP VLSP 2013 96.76% 2021 Yes
VnMarMoT VLSP 2013 95.88% 2018 No (CRF)
TRE-1 UDD-1 95.89% 2026 No (CRF)
RDRPOSTagger VLSP 2013 95.11% 2014 No (Rules)

Chunking

Model Dataset F1 Year Neural?
SS05 (HMM) CoNLL-2000 95.23% 2005 No
CRF (Sha & Pereira) CoNLL-2000 94.30% 2003 No
BiLSTM-CRF CoNLL-2000 94.46% 2015 Yes
Lai et al. Vietnamese NP 88.40% 2019 Yes

Dependency Parsing (Vietnamese)

Model Dataset UAS LAS Neural?
PhoNLP VnDT v1.1 85.47 79.11 Yes
VnCoreNLP VnDT v1.1 77.35 71.38 Partial
MSTParser VnDT v1.1 76.58 70.10 No
MaltParser VnDT v1.1 76.08 69.88 No