| # References | |
| Reference materials for TRE-1 Vietnamese NLP Models (CRF/non-neural, CPU-first). | |
| ## Scope | |
| TRE-1 covers 4 tasks: **word segmentation**, **POS tagging**, **chunking**, **dependency parsing** -- using CRF and non-deep-learning methods optimized for CPU inference. | |
| ## Papers | |
| | Folder | Title | Authors | Year | Task | | |
| |--------|-------|---------|------|------| | |
| | [2001.icml.lafferty](2001.icml.lafferty/) | Conditional Random Fields | Lafferty, McCallum, Pereira | 2001 | Foundation | | |
| | [2003.naacl.sha](2003.naacl.sha/) | Shallow Parsing with CRFs | Sha, Pereira | 2003 | Chunking | | |
| | [2005.emnlp.mcdonald](2005.emnlp.mcdonald/) | MST Dependency Parsing | McDonald, Pereira et al. | 2005 | Dep Parsing | | |
| | [2006.eacl.mcdonald](2006.eacl.mcdonald/) | 2nd-Order MST Parsing | McDonald, Pereira | 2006 | Dep Parsing | | |
| | [2006.paclic.nguyen](2006.paclic.nguyen/) | Vietnamese CRF Word Segmentation | Nguyen et al. | 2006 | Word Seg | | |
| | [2009.alr.nguyen](2009.alr.nguyen/) | Vietnamese Chunking with CRF | Nguyen et al. | 2009 | Chunking | | |
| | [2011.acl.zhang](2011.acl.zhang/) | Rich Features for Parsing (72 templates) | Zhang, Nivre | 2011 | Dep Parsing | | |
| | [2014.eacl.nguyen](2014.eacl.nguyen/) | RDRPOSTagger | Nguyen et al. | 2014 | POS | | |
| | [2015.kse.nguyen](2015.kse.nguyen/) | Vietnamese DP Error Analysis | Nguyen, Nguyen | 2015 | Dep Parsing | | |
| | [2016.acl.ma](2016.acl.ma/) | BiLSTM-CNNs-CRF | Ma, Hovy | 2016 | Seq Labeling | | |
| | [2016.naacl.lample](2016.naacl.lample/) | Neural NER (BiLSTM-CRF) | Lample et al. | 2016 | Seq Labeling | | |
| | [2017.alta.nguyen](2017.alta.nguyen/) | From WS to POS for Vietnamese | Nguyen et al. | 2017 | Pipeline | | |
| | [2017.conll.qian](2017.conll.qian/) | FBAML Non-DNN Pipeline | Qian, Liu | 2017 | Full Pipeline | | |
| | [2017.wanlp.darwish](2017.wanlp.darwish/) | Arabic POS: Feature Engineering | Darwish et al. | 2017 | POS | | |
| | [2018.naacl.vu](2018.naacl.vu/) | VnCoreNLP | Vu, Nguyen et al. | 2018 | Toolkit | | |
| | [2019.alta.nguyen](2019.alta.nguyen/) | Joint Vietnamese WS/POS/DP | Nguyen | 2019 | Joint Model | | |
| | [2019.pacling.nguyen](2019.pacling.nguyen/) | UITws-v1 (WS SOTA) | Nguyen et al. | 2019 | Word Seg | | |
| | [2020.emnlp.nguyen](2020.emnlp.nguyen/) | PhoBERT | Nguyen & Nguyen | 2020 | Pre-trained LM | | |
| | [2021.naacl.nguyen](2021.naacl.nguyen/) | PhoNLP | Nguyen & Nguyen | 2021 | Multi-task | | |
| | [2021.naacl.wei](2021.naacl.wei/) | Masked CRF | Wei et al. | 2021 | Seq Labeling | | |
| | [2023.eacl.tran](2023.eacl.tran/) | ViDeBERTa (POS SOTA) | Tran et al. | 2023 | Pre-trained LM | | |
| Each paper folder contains: | |
| - `paper.md` - Markdown with YAML front matter (for LLM/RAG) | |
| - `paper.tex` - LaTeX source (original from arXiv or generated) | |
| - `paper.pdf` - PDF file | |
| - `source/` - Full arXiv source (if available) | |
| ## Resources | |
| | File | Title | Type | | |
| |------|-------|------| | |
| | [universal_dependencies.md](universal_dependencies.md) | Universal Dependencies | Annotation Framework | | |
| | [underthesea.md](underthesea.md) | Underthesea | Vietnamese NLP Toolkit | | |
| | [python_crfsuite.md](python_crfsuite.md) | python-crfsuite | CRF Library | | |
| ## Research Notes | |
| | Folder | Description | | |
| |--------|-------------| | |
| | [research](../research/) | Literature review: 4 Vietnamese NLP tasks with CRF/non-neural focus (40+ papers) | | |
| ## Benchmarks | |
| ### Word Segmentation | |
| | Model | Dataset | F1 | Year | Neural? | | |
| |-------|---------|-----|------|---------| | |
| | UITws-v1 | VLSP 2013 | 98.06% | 2019 | No (SVM) | | |
| | RDRsegmenter | VLSP 2013 | 97.90% | 2018 | No (Rules) | | |
| | **TRE-1 WS** | **UDD-1** | **98.01%** | **2026** | **No (CRF)** | | |
| | JVnSegmenter | VLSP 2013 | 97.06% | 2006 | No (CRF+SVM) | | |
| ### POS Tagging | |
| | Model | Dataset | Accuracy | Year | Neural? | | |
| |-------|---------|----------|------|---------| | |
| | ViDeBERTa-large | VLSP 2013 | 97.2% | 2023 | Yes | | |
| | PhoBERT-large | VLSP 2013 | 96.8% | 2020 | Yes | | |
| | PhoNLP | VLSP 2013 | 96.76% | 2021 | Yes | | |
| | VnMarMoT | VLSP 2013 | 95.88% | 2018 | **No (CRF)** | | |
| | **TRE-1** | **UDD-1** | **95.89%** | **2026** | **No (CRF)** | | |
| | RDRPOSTagger | VLSP 2013 | 95.11% | 2014 | No (Rules) | | |
| ### Chunking | |
| | Model | Dataset | F1 | Year | Neural? | | |
| |-------|---------|-----|------|---------| | |
| | SS05 (HMM) | CoNLL-2000 | 95.23% | 2005 | No | | |
| | CRF (Sha & Pereira) | CoNLL-2000 | 94.30% | 2003 | No | | |
| | BiLSTM-CRF | CoNLL-2000 | 94.46% | 2015 | Yes | | |
| | Lai et al. | Vietnamese NP | 88.40% | 2019 | Yes | | |
| ### Dependency Parsing (Vietnamese) | |
| | Model | Dataset | UAS | LAS | Neural? | | |
| |-------|---------|-----|-----|---------| | |
| | PhoNLP | VnDT v1.1 | 85.47 | 79.11 | Yes | | |
| | VnCoreNLP | VnDT v1.1 | 77.35 | 71.38 | Partial | | |
| | MSTParser | VnDT v1.1 | 76.58 | 70.10 | **No** | | |
| | MaltParser | VnDT v1.1 | 76.08 | 69.88 | **No** | | |