References
Reference materials for TRE-1 Vietnamese NLP Models (CRF/non-neural, CPU-first).
Scope
TRE-1 covers 4 tasks: word segmentation, POS tagging, chunking, dependency parsing -- using CRF and non-deep-learning methods optimized for CPU inference.
Papers
| Folder | Title | Authors | Year | Task |
|---|---|---|---|---|
| 2001.icml.lafferty | Conditional Random Fields | Lafferty, McCallum, Pereira | 2001 | Foundation |
| 2003.naacl.sha | Shallow Parsing with CRFs | Sha, Pereira | 2003 | Chunking |
| 2005.emnlp.mcdonald | MST Dependency Parsing | McDonald, Pereira et al. | 2005 | Dep Parsing |
| 2006.eacl.mcdonald | 2nd-Order MST Parsing | McDonald, Pereira | 2006 | Dep Parsing |
| 2006.paclic.nguyen | Vietnamese CRF Word Segmentation | Nguyen et al. | 2006 | Word Seg |
| 2009.alr.nguyen | Vietnamese Chunking with CRF | Nguyen et al. | 2009 | Chunking |
| 2011.acl.zhang | Rich Features for Parsing (72 templates) | Zhang, Nivre | 2011 | Dep Parsing |
| 2014.eacl.nguyen | RDRPOSTagger | Nguyen et al. | 2014 | POS |
| 2015.kse.nguyen | Vietnamese DP Error Analysis | Nguyen, Nguyen | 2015 | Dep Parsing |
| 2016.acl.ma | BiLSTM-CNNs-CRF | Ma, Hovy | 2016 | Seq Labeling |
| 2016.naacl.lample | Neural NER (BiLSTM-CRF) | Lample et al. | 2016 | Seq Labeling |
| 2017.alta.nguyen | From WS to POS for Vietnamese | Nguyen et al. | 2017 | Pipeline |
| 2017.conll.qian | FBAML Non-DNN Pipeline | Qian, Liu | 2017 | Full Pipeline |
| 2017.wanlp.darwish | Arabic POS: Feature Engineering | Darwish et al. | 2017 | POS |
| 2018.naacl.vu | VnCoreNLP | Vu, Nguyen et al. | 2018 | Toolkit |
| 2019.alta.nguyen | Joint Vietnamese WS/POS/DP | Nguyen | 2019 | Joint Model |
| 2019.pacling.nguyen | UITws-v1 (WS SOTA) | Nguyen et al. | 2019 | Word Seg |
| 2020.emnlp.nguyen | PhoBERT | Nguyen & Nguyen | 2020 | Pre-trained LM |
| 2021.naacl.nguyen | PhoNLP | Nguyen & Nguyen | 2021 | Multi-task |
| 2021.naacl.wei | Masked CRF | Wei et al. | 2021 | Seq Labeling |
| 2023.eacl.tran | ViDeBERTa (POS SOTA) | Tran et al. | 2023 | Pre-trained LM |
Each paper folder contains:
paper.md- Markdown with YAML front matter (for LLM/RAG)paper.tex- LaTeX source (original from arXiv or generated)paper.pdf- PDF filesource/- Full arXiv source (if available)
Resources
| File | Title | Type |
|---|---|---|
| universal_dependencies.md | Universal Dependencies | Annotation Framework |
| underthesea.md | Underthesea | Vietnamese NLP Toolkit |
| python_crfsuite.md | python-crfsuite | CRF Library |
Research Notes
| Folder | Description |
|---|---|
| research | Literature review: 4 Vietnamese NLP tasks with CRF/non-neural focus (40+ papers) |
Benchmarks
Word Segmentation
| Model | Dataset | F1 | Year | Neural? |
|---|---|---|---|---|
| UITws-v1 | VLSP 2013 | 98.06% | 2019 | No (SVM) |
| RDRsegmenter | VLSP 2013 | 97.90% | 2018 | No (Rules) |
| TRE-1 WS | UDD-1 | 98.01% | 2026 | No (CRF) |
| JVnSegmenter | VLSP 2013 | 97.06% | 2006 | No (CRF+SVM) |
POS Tagging
| Model | Dataset | Accuracy | Year | Neural? |
|---|---|---|---|---|
| ViDeBERTa-large | VLSP 2013 | 97.2% | 2023 | Yes |
| PhoBERT-large | VLSP 2013 | 96.8% | 2020 | Yes |
| PhoNLP | VLSP 2013 | 96.76% | 2021 | Yes |
| VnMarMoT | VLSP 2013 | 95.88% | 2018 | No (CRF) |
| TRE-1 | UDD-1 | 95.89% | 2026 | No (CRF) |
| RDRPOSTagger | VLSP 2013 | 95.11% | 2014 | No (Rules) |
Chunking
| Model | Dataset | F1 | Year | Neural? |
|---|---|---|---|---|
| SS05 (HMM) | CoNLL-2000 | 95.23% | 2005 | No |
| CRF (Sha & Pereira) | CoNLL-2000 | 94.30% | 2003 | No |
| BiLSTM-CRF | CoNLL-2000 | 94.46% | 2015 | Yes |
| Lai et al. | Vietnamese NP | 88.40% | 2019 | Yes |
Dependency Parsing (Vietnamese)
| Model | Dataset | UAS | LAS | Neural? |
|---|---|---|---|---|
| PhoNLP | VnDT v1.1 | 85.47 | 79.11 | Yes |
| VnCoreNLP | VnDT v1.1 | 77.35 | 71.38 | Partial |
| MSTParser | VnDT v1.1 | 76.58 | 70.10 | No |
| MaltParser | VnDT v1.1 | 76.08 | 69.88 | No |