tre-1 / TECHNICAL_REPORT_REVIEW.md

Add VLSP 2013 word segmentation, feature ablation study, and Hydra config

27e6434 3 months ago

18 kB

Paper Review: Technical Report -- Vietnamese NLP Pipeline (TRE-1)

Document: TECHNICAL_REPORT.md Reviewer: Claude Code (paper-review skill) Date: February 7, 2026 Review Standard: ACL Rolling Review (ARR) Guidelines

Paper Summary

This technical report presents TRE-1, a Vietnamese NLP pipeline using CRF and non-neural methods optimized for CPU inference. The system covers four tasks -- word segmentation (98.01% word F1), POS tagging (95.89% accuracy), chunking (planned), and dependency parsing (planned) -- and is deployed on Hugging Face. The report argues that CRF-based methods remain competitive with neural approaches for Vietnamese NLP, particularly for resource-constrained deployment.

Summary of Strengths

Clear practical motivation and well-scoped contribution (Sections 1.2, 10): The CPU-first design philosophy is well-justified. The cross-task analysis in Section 10 effectively demonstrates that non-neural methods are within 1.3% or better for 3 of 4 tasks, making a compelling case for lightweight deployment. The total pipeline footprint (<20 MB) vs PhoBERT (500 MB+) is a meaningful engineering contribution.
Competitive empirical results (Sections 4.3, 5.3): Word segmentation achieves 98.01% word F1 (near non-neural SOTA of 98.06%) and POS tagging at 95.89% matches VnMarMoT (95.88%), the best CRF system. These results validate that well-engineered feature-based CRF remains practical for Vietnamese. The detailed per-tag breakdown (Table 5.3) and error analysis (Section 9) provide useful diagnostic information.
Thorough literature coverage and honest benchmarking (Sections 5.6, 12): The report surveys 28 references spanning CRF foundations through modern transformers. Comparisons are fair -- the authors correctly note that TRE-1 is evaluated on UDD-1 rather than VLSP 2013 (Section 5.6 note), and they do not overstate claims. The cross-language CRF evidence in Section 12.8 strengthens the methodological choice.
Reproducibility (Section 16): Training commands, hyperparameters, dependencies, and PEP 723 inline metadata are provided. The three interchangeable trainer backends (python-crfsuite, crfsuite-rs, underthesea-core) demonstrate engineering maturity. Code and models are publicly available.

Summary of Weaknesses

No evaluation on standard benchmarks (Sections 4.4, 5.6): Both WS and POS are evaluated only on UDD-1, a machine-annotated dataset created by the authors' own toolkit (Underthesea). This creates a circular evaluation concern: the model is trained on data annotated by the same framework. No results on VLSP 2013 (the standard Vietnamese NLP benchmark) are provided, making all cross-system comparisons in Tables 4.4 and 5.6 apples-to-oranges. This is the report's most significant limitation and is acknowledged (Limitation 5) but not addressed experimentally.
Half the pipeline is unimplemented (Sections 6, 7): Chunking and dependency parsing are described as "Planned" with no experimental results. Sections 6-7 are literature surveys rather than system descriptions. The report title ("Vietnamese NLP Pipeline") and pipeline diagram (Section 3) create an impression of a complete system that does not yet exist. The contribution would be more accurately scoped as "Vietnamese WS and POS Tagging with CRF."
Missing ablation studies and analysis: There is no feature ablation showing the contribution of individual feature groups (e.g., dictionary features, n-grams, context window size). For 27 POS features and 21 WS features, understanding which features drive performance is essential. There is no comparison of different CRF regularization settings beyond the single configuration (c1=1.0, c2=0.001). There is no learning curve analysis showing how performance scales with training data size.
Machine-annotated training data quality unquantified (Section 2.1): UDD-1 is described as "machine-generated using Underthesea toolkit" but no inter-annotator agreement, annotation quality assessment, or error rate estimation is provided. The 95.89% POS accuracy may partially reflect overfitting to Underthesea's annotation patterns rather than linguistic ground truth.
Important recent work not cited: The AAAI 2025 paper "Vietnamese Words Are Not Constructed from Syllables: Rethinking the Role of Word Segmentation in NLP for Vietnamese Texts" (Nguyen et al., 2025) directly challenges TRE-1's syllable-based BIO word segmentation approach. This paper argues that word segmentation may not be appropriate for Vietnamese and proposes ViWordFormer. This is a significant omission given TRE-1's word segmentation is the first pipeline stage.

Scores

Soundness: 3/5

The implemented components (WS, POS) are technically sound with clearly described methodology. However, the lack of standard benchmark evaluation and the circular evaluation on machine-annotated data from the same toolkit reduce confidence in the claimed results.

Excitement: 3/5

The work addresses a genuine practical need (CPU-first Vietnamese NLP) and achieves competitive results. However, CRF-based Vietnamese NLP is well-explored territory (VnMarMoT, VnCoreNLP), and the contribution is primarily engineering integration rather than methodological novelty. The incomplete pipeline reduces impact.

Overall Assessment: 3/5

Solid engineering work with clear practical value, appropriate for a Findings track or systems paper venue. The incomplete pipeline and lack of standard benchmark evaluation prevent a higher rating. Addressing VLSP 2013 evaluation would significantly strengthen the work.

Reproducibility: 4/5

Training commands, hyperparameters, code, and models are provided. Minor concerns: exact dataset version/preprocessing details could be more precise, and the underthesea-core backend requires building from source with a hardcoded wheel path.

Confidence: 4/5

Familiar with Vietnamese NLP landscape, CRF methods, and the benchmark ecosystem. Verified SOTA claims against published results.

Detailed Comments

Technical Soundness

The CRF formulation (Section 5.1) and feature engineering (Sections 4.2, 5.2) are standard and correctly described. The L-BFGS training with c1=1.0, c2=0.001 uses aggressive L1 regularization (100x the L2 penalty), which promotes feature sparsity. This is a reasonable choice but the sensitivity to these hyperparameters is not explored.

The BIO tagging scheme for word segmentation (Section 4.1) omits the "O" tag -- using only B and I. This is correct for Vietnamese where every syllable belongs to a word, but should be explicitly noted. The scheme is actually BI rather than BIO.

The pipeline error propagation analysis (Section 9.3) is qualitative. A quantitative analysis (e.g., oracle POS tagging performance given gold vs predicted word segmentation) would strengthen the argument.

Novelty and Contribution

The primary contribution is an engineering integration of known techniques (CRF for WS/POS) into a deployable pipeline with modern tooling (HuggingFace, PEP 723, multiple backends). This is valuable practical work but limited in novelty -- VnCoreNLP (Vu et al., 2018) already provides a similar CRF-based pipeline with comparable performance.

The feature template designs (21 WS, 27 POS) closely follow established patterns from VnMarMoT and underthesea. The dictionary features (T[0].is_in_dict, etc.) are a distinguishing element but their contribution is not isolated through ablation.

Clarity and Presentation

The report is well-structured and clearly written. The pipeline diagram (Section 3) effectively communicates the architecture. Tables are informative and consistently formatted. The cross-task analysis (Section 10) is a particularly useful contribution for practitioners deciding between neural and non-neural approaches.

However, the mixed scope (2 implemented + 2 planned tasks) creates structural issues. Sections 6-7 read as literature review rather than system description, and their placement between implemented results and error analysis disrupts the narrative flow.

Reproducibility Assessment

Strong on most dimensions: exact hyperparameters (Section 8.1), training commands (Section 16), public code and models. The PEP 723 inline metadata is a nice modern touch. Weaknesses: (1) the underthesea-core trainer references a hardcoded local wheel path (line 12 of train_word_segmentation.py); (2) exact preprocessing steps for converting UDD-1 to syllable-level sequences should be more explicit; (3) no random seeds specified.

Limitations and Ethics

The limitations section (Section 13) is comprehensive and honest -- covering tokenization requirements, machine annotation quality, domain coverage, rare tags, benchmark comparability, and the DP gap. The machine annotation concern (Limitation 2) deserves more attention given its fundamental impact on result interpretation.

No ethical concerns specific to this work. The CC-BY-SA-4.0 licensing is appropriate.

Related Work Research

Papers Found

Paper	Year	Method	Results	Relevance
ViWordFormer (Nguyen et al.)	AAAI 2025	Transformer-based word formation	Challenges syllable-based WS	Directly challenges TRE-1's WS approach
ViDeBERTa (Tran et al.)	EACL 2023	DeBERTaV3	97.2% POS acc	SOTA POS -- cited correctly
PhoNLP (Nguyen & Nguyen)	NAACL 2021	PhoBERT+MTL	84.95-85.18% UAS	SOTA DP -- cited, UAS figure slightly varies
Conversational POS Dataset (Pham et al.)	2022	New dataset	N/A	Domain extension for Vietnamese POS

Missing Citations

Nguyen, Nguyen, & Nguyen (AAAI 2025): "Vietnamese Words Are Not Constructed from Syllables" -- directly challenges the syllable-based BIO tagging used in TRE-1's word segmentation. Should be discussed.
Pham et al. (2022): "Introducing a Large-Scale Dataset for Vietnamese POS Tagging on Conversational Texts" -- relevant to Limitation 3 (domain coverage).

SOTA Verification

POS tagging SOTA: Claimed ViDeBERTa-large 97.2% -- Confirmed (Table 2 of arXiv:2301.10439)
WS non-neural SOTA: Claimed UITws-v1 98.06% -- Confirmed, still best non-neural as of 2026
DP SOTA: Claimed PhoNLP 85.47% UAS -- Approximately confirmed (84.95-85.18% in some reports; exact figure may depend on evaluation split)
VnMarMoT best CRF: Claimed 95.88% -- Confirmed

Questions for Authors

Have you evaluated TRE-1 on VLSP 2013 for either WS or POS? If not, what prevents this evaluation? The comparison tables would be significantly stronger with same-benchmark results.
What is the estimated annotation quality of UDD-1? Even a small manual verification sample (e.g., 100 sentences) would help assess whether the 95.89% accuracy reflects linguistic correctness or toolkit agreement.
How do the 3 dictionary features contribute to POS tagging performance? An ablation removing these features would clarify whether dictionary lookup is essential or marginal.
The AAAI 2025 paper by Nguyen et al. argues that syllable-based word segmentation is not appropriate for Vietnamese NLP. How would you position TRE-1's approach in light of this critique?
PhoNLP (2021) reports POS tagging accuracy of 96.76% on VLSP 2013, but the research notes reference 96.91% elsewhere. Which is the correct figure?

Minor Issues

Section 5.6 table header: Accuracy column has inconsistent alignment markup
Section 11.3 example output: "Nam/PUNCT" appears incorrect -- "Nam" in "Viet Nam" should likely be tagged PROPN, not PUNCT. This may indicate a model error worth noting.
Appendix B: Vietnamese examples lack diacritics (e.g., "dep" should be "dep", "rat" should be "rat") -- likely intentional for ASCII compatibility but worth noting.
Section 4.1: The tagging scheme is described as "BIO" but only B and I tags are used (no O tag). This is technically a BI scheme.
Reference 13 (Huang et al., 2015) is an arXiv preprint that was never formally published -- consider noting this.

Suggestions for Improvement

Priority 1: Evaluate on VLSP 2013

Evaluate both WS and POS on VLSP 2013. This single addition would transform the comparison tables from suggestive to definitive and address the most significant weakness.

Priority 2: Add feature ablation studies

Remove feature groups (dictionary, n-grams, context windows) one at a time to show their individual contributions. This would strengthen the feature engineering narrative.

Priority 3: Quantify UDD-1 annotation quality

Manually annotate 100-200 sentences and compute agreement with the machine annotations. Report the estimated annotation accuracy.

Priority 4: Restructure to separate implemented vs planned

Consider moving Sections 6-7 (planned work) to an extended "Future Work" section. This would prevent the impression that the pipeline is complete.

Priority 5: Add inference speed benchmarks

The report claims "fast inference" (Sections 1.2, 3) but provides no latency measurements. Report sentences/second or tokens/second on a reference CPU to substantiate the efficiency claims.

Priority 6: Discuss ViWordFormer (AAAI 2025)

Address the recent challenge to syllable-based word segmentation to show awareness of evolving perspectives on Vietnamese word formation.

Priority 7: Hyperparameter sensitivity

Show performance across 3-5 different (c1, c2) configurations to demonstrate robustness or motivate the chosen values.

Priority Actions

Priority	Section	Action
High	4.4, 5.6	Evaluate on VLSP 2013 for direct comparison
High	New	Add feature ablation studies
High	2.1	Quantify UDD-1 annotation quality
Medium	6, 7	Restructure: move planned tasks to Future Work
Medium	3, 10	Add inference speed benchmarks (sentences/sec)
Medium	12	Cite and discuss ViWordFormer (AAAI 2025)
Medium	8	Add hyperparameter sensitivity analysis
Low	4.1	Clarify BI vs BIO tagging scheme
Low	11.3	Verify example predictions (Nam/PUNCT)
Low	App B	Add diacritics to Vietnamese examples

Evaluation Checklist

Methodology

Research questions clearly stated
Methods appropriate for research questions
Baselines appropriate and fairly compared (different datasets)
Statistical significance properly addressed (not reported)
Limitations of approach acknowledged

Experiments

Datasets properly described (source, size, splits, preprocessing)
Evaluation metrics appropriate for the task
Training details sufficient for reproduction
Ablation studies or analysis provided
Results support the claims made

Presentation

Abstract accurately summarizes contributions
Introduction motivates the problem
Related work comprehensive and fair
Figures/tables readable and informative
Conclusion matches actual contributions (overstates pipeline completeness)

Related Work Verification

Key prior work on same task is cited
Baseline comparisons use current methods
SOTA claims are accurate and up-to-date
No significant missing references (AAAI 2025 missing)
Fair characterization of competing approaches

Responsible NLP

Limitations section present and substantive
Potential negative impacts discussed (not addressed)
Data collection ethics addressed (machine annotation)
Bias considerations mentioned (domain bias not discussed)

Comparison with Previous Review (January 31, 2026)

The report has been substantially improved since the previous review:

Issue from Previous Review	Status
Missing referenced visualizations (confusion_matrix.png)	Fixed -- references now present
Incorrect example predictions (Toi -> PROPN)	Partially fixed -- examples updated but Nam/PUNCT may still be incorrect
Limited baseline comparison	Fixed -- comprehensive comparison tables added (Sections 4.4, 5.6)
Missing PhoBERT, VnCoreNLP citations	Fixed -- both now cited with detailed discussion
Dataset not standard benchmark	Unchanged -- still UDD-1 only, VLSP 2013 evaluation remains the top priority
Machine-annotated data concern	Acknowledged -- listed in Limitations (Section 13.2) but not quantified
Scope limited to POS only	Expanded -- now covers WS (implemented) + chunking/DP (planned)

New issues identified in this review:

Missing AAAI 2025 citation (ViWordFormer)
No feature ablation studies
No inference speed measurements
Planned tasks create misleading impression of completeness

Review conducted following ACL Rolling Review guidelines. SOTA claims verified against published results (arXiv:2301.10439, NAACL 2021, PACLING 2019).

undertheseanlp
/

tre-1