tre-1 / references /README.md
rain1024's picture
Add VLSP 2013 word segmentation, feature ablation study, and Hydra config
27e6434
# References
Reference materials for TRE-1 Vietnamese NLP Models (CRF/non-neural, CPU-first).
## Scope
TRE-1 covers 4 tasks: **word segmentation**, **POS tagging**, **chunking**, **dependency parsing** -- using CRF and non-deep-learning methods optimized for CPU inference.
## Papers
| Folder | Title | Authors | Year | Task |
|--------|-------|---------|------|------|
| [2001.icml.lafferty](2001.icml.lafferty/) | Conditional Random Fields | Lafferty, McCallum, Pereira | 2001 | Foundation |
| [2003.naacl.sha](2003.naacl.sha/) | Shallow Parsing with CRFs | Sha, Pereira | 2003 | Chunking |
| [2005.emnlp.mcdonald](2005.emnlp.mcdonald/) | MST Dependency Parsing | McDonald, Pereira et al. | 2005 | Dep Parsing |
| [2006.eacl.mcdonald](2006.eacl.mcdonald/) | 2nd-Order MST Parsing | McDonald, Pereira | 2006 | Dep Parsing |
| [2006.paclic.nguyen](2006.paclic.nguyen/) | Vietnamese CRF Word Segmentation | Nguyen et al. | 2006 | Word Seg |
| [2009.alr.nguyen](2009.alr.nguyen/) | Vietnamese Chunking with CRF | Nguyen et al. | 2009 | Chunking |
| [2011.acl.zhang](2011.acl.zhang/) | Rich Features for Parsing (72 templates) | Zhang, Nivre | 2011 | Dep Parsing |
| [2014.eacl.nguyen](2014.eacl.nguyen/) | RDRPOSTagger | Nguyen et al. | 2014 | POS |
| [2015.kse.nguyen](2015.kse.nguyen/) | Vietnamese DP Error Analysis | Nguyen, Nguyen | 2015 | Dep Parsing |
| [2016.acl.ma](2016.acl.ma/) | BiLSTM-CNNs-CRF | Ma, Hovy | 2016 | Seq Labeling |
| [2016.naacl.lample](2016.naacl.lample/) | Neural NER (BiLSTM-CRF) | Lample et al. | 2016 | Seq Labeling |
| [2017.alta.nguyen](2017.alta.nguyen/) | From WS to POS for Vietnamese | Nguyen et al. | 2017 | Pipeline |
| [2017.conll.qian](2017.conll.qian/) | FBAML Non-DNN Pipeline | Qian, Liu | 2017 | Full Pipeline |
| [2017.wanlp.darwish](2017.wanlp.darwish/) | Arabic POS: Feature Engineering | Darwish et al. | 2017 | POS |
| [2018.naacl.vu](2018.naacl.vu/) | VnCoreNLP | Vu, Nguyen et al. | 2018 | Toolkit |
| [2019.alta.nguyen](2019.alta.nguyen/) | Joint Vietnamese WS/POS/DP | Nguyen | 2019 | Joint Model |
| [2019.pacling.nguyen](2019.pacling.nguyen/) | UITws-v1 (WS SOTA) | Nguyen et al. | 2019 | Word Seg |
| [2020.emnlp.nguyen](2020.emnlp.nguyen/) | PhoBERT | Nguyen & Nguyen | 2020 | Pre-trained LM |
| [2021.naacl.nguyen](2021.naacl.nguyen/) | PhoNLP | Nguyen & Nguyen | 2021 | Multi-task |
| [2021.naacl.wei](2021.naacl.wei/) | Masked CRF | Wei et al. | 2021 | Seq Labeling |
| [2023.eacl.tran](2023.eacl.tran/) | ViDeBERTa (POS SOTA) | Tran et al. | 2023 | Pre-trained LM |
Each paper folder contains:
- `paper.md` - Markdown with YAML front matter (for LLM/RAG)
- `paper.tex` - LaTeX source (original from arXiv or generated)
- `paper.pdf` - PDF file
- `source/` - Full arXiv source (if available)
## Resources
| File | Title | Type |
|------|-------|------|
| [universal_dependencies.md](universal_dependencies.md) | Universal Dependencies | Annotation Framework |
| [underthesea.md](underthesea.md) | Underthesea | Vietnamese NLP Toolkit |
| [python_crfsuite.md](python_crfsuite.md) | python-crfsuite | CRF Library |
## Research Notes
| Folder | Description |
|--------|-------------|
| [research](../research/) | Literature review: 4 Vietnamese NLP tasks with CRF/non-neural focus (40+ papers) |
## Benchmarks
### Word Segmentation
| Model | Dataset | F1 | Year | Neural? |
|-------|---------|-----|------|---------|
| UITws-v1 | VLSP 2013 | 98.06% | 2019 | No (SVM) |
| RDRsegmenter | VLSP 2013 | 97.90% | 2018 | No (Rules) |
| **TRE-1 WS** | **UDD-1** | **98.01%** | **2026** | **No (CRF)** |
| JVnSegmenter | VLSP 2013 | 97.06% | 2006 | No (CRF+SVM) |
### POS Tagging
| Model | Dataset | Accuracy | Year | Neural? |
|-------|---------|----------|------|---------|
| ViDeBERTa-large | VLSP 2013 | 97.2% | 2023 | Yes |
| PhoBERT-large | VLSP 2013 | 96.8% | 2020 | Yes |
| PhoNLP | VLSP 2013 | 96.76% | 2021 | Yes |
| VnMarMoT | VLSP 2013 | 95.88% | 2018 | **No (CRF)** |
| **TRE-1** | **UDD-1** | **95.89%** | **2026** | **No (CRF)** |
| RDRPOSTagger | VLSP 2013 | 95.11% | 2014 | No (Rules) |
### Chunking
| Model | Dataset | F1 | Year | Neural? |
|-------|---------|-----|------|---------|
| SS05 (HMM) | CoNLL-2000 | 95.23% | 2005 | No |
| CRF (Sha & Pereira) | CoNLL-2000 | 94.30% | 2003 | No |
| BiLSTM-CRF | CoNLL-2000 | 94.46% | 2015 | Yes |
| Lai et al. | Vietnamese NP | 88.40% | 2019 | Yes |
### Dependency Parsing (Vietnamese)
| Model | Dataset | UAS | LAS | Neural? |
|-------|---------|-----|-----|---------|
| PhoNLP | VnDT v1.1 | 85.47 | 79.11 | Yes |
| VnCoreNLP | VnDT v1.1 | 77.35 | 71.38 | Partial |
| MSTParser | VnDT v1.1 | 76.58 | 70.10 | **No** |
| MaltParser | VnDT v1.1 | 76.08 | 69.88 | **No** |