sen-1 / TECHNICAL_REPORT_REVIEW.md
rain1024's picture
Add research plan and paper review documentation
82c639c

Paper Review: Sen-1: Vietnamese Text Classification Model

Review Date: February 2, 2026 Reviewer Expertise: Vietnamese NLP, Text Classification Review Type: ACL Rolling Review (ARR) Format


Paper Summary

This technical report describes Sen-1, a Vietnamese text classification model using TF-IDF vectorization combined with Linear SVM. The system is designed as a lightweight baseline compatible with the underthesea Vietnamese NLP toolkit API. The report documents the methodology, implementation details, and provides a demonstration release trained on a small sample dataset (60 training samples).


Summary of Strengths

  1. Clear documentation and reproducibility (Sections 3, 5, 7): The report provides comprehensive implementation details including exact hyperparameters (Table in Section 3.2-3.3), model file structure (Section 5.3), and complete usage examples with code snippets. This level of documentation enables straightforward reproduction.

  2. Practical resource contribution: The model is released on HuggingFace with Apache 2.0 license, providing the community with an accessible baseline. The lightweight design (~103 KB) makes it suitable for resource-constrained environments where transformer-based models are impractical.

  3. Honest limitations disclosure (Section 8): The authors transparently acknowledge the current release's limitations including the minimal training data (60 samples), lack of word segmentation, and single-label constraint. This intellectual honesty helps users set appropriate expectations.

  4. API design consideration (Section 5.2): The integration with the established underthesea ecosystem provides continuity for existing users and lowers the barrier to adoption.


Summary of Weaknesses

  1. Insufficient experimental evaluation (Section 6.2): The reported results (40% validation accuracy, 34.67% F1) are based on only 60 training and 20 validation samples. While the authors note this limitation, the paper would benefit from at least one experiment on the full VNTC dataset or a meaningful subset to demonstrate the method's actual effectiveness. The "Expected Sen-1" results in Table 6.4 (~93-95%) are speculative projections, not empirical measurements.

  2. Incomplete comparison with modern baselines: The related work (Section 2.2) mentions PhoBERT but provides no actual comparison. Given that PhoBERT-based models achieve 65.44% F1 on UIT-VSMEC and 87.35% on extended versions, and TextGraphFuseGAT achieves +4% over PhoBERT baselines, the paper lacks context for where TF-IDF+SVM fits in the current landscape.

  3. Missing word segmentation integration: Vietnamese text classification typically benefits significantly from proper word segmentation (as noted in PhoBERT's use of VnCoreNLP's RDRSegmenter). The paper acknowledges this gap but doesn't provide quantitative analysis of the performance impact, despite underthesea having word segmentation capabilities.

  4. Statistical rigor concerns: No confidence intervals, standard deviations, or statistical significance tests are reported. For the sample predictions (Table 6.3), only 5 cherry-picked examples are shown, all correct. This does not provide a representative view of model behavior.


Scores

Soundness: 2/5 (Poor)

The methodology is technically correct but the experimental evaluation is insufficient. The core claims about model effectiveness cannot be verified with the provided 60-sample experiments. The "expected" benchmark results are projections without empirical backing.

Excitement: 2/5 (Somewhat Boring)

TF-IDF + SVM for text classification is a well-established baseline from nearly two decades ago. While having a documented Vietnamese baseline is useful, the contribution is incremental. The paper does not introduce novel techniques or provide new insights about Vietnamese text classification.

Overall Assessment: 2/5 (Resubmit next cycle)

The paper provides a useful resource but requires substantial revisions: (1) actual benchmark results on VNTC or other standard datasets, (2) comparison with at least one modern baseline (even a simple fine-tuned PhoBERT), and (3) ablation studies on design choices (e.g., with/without word segmentation, n-gram ranges).

Reproducibility: 4/5 (Could mostly reproduce)

The paper provides sufficient detail for reproduction: hyperparameters, dependencies, model files, and code examples. Minor variations may occur due to scikit-learn version differences. Code is publicly available on HuggingFace.

Confidence: 4/5 (High)

I am familiar with Vietnamese NLP and text classification methods. I have reviewed the current state-of-the-art and can assess this work's positioning.


Detailed Comments

Technical Soundness

The TF-IDF + Linear SVM pipeline is correctly implemented and the mathematical formulations (Sections 3.2-3.4) are accurate. However:

  • Confidence score calculation (Section 3.4): Using absolute value of the decision function |f(x)| loses information about prediction direction. This is acceptable for confidence but should be clarified.

  • No cross-validation: With such limited data (60 samples), k-fold cross-validation would provide more robust estimates than a single train/validation split.

  • Class imbalance not addressed: The VNTC dataset has significant class imbalance (5,219 vs 1,820 samples across categories). The report doesn't discuss handling strategies.

Novelty and Contribution

This work is positioned as a "baseline" and "resource paper" rather than a methodological contribution. As such, the novelty expectations are lower. However:

  • The TF-IDF + SVM approach has been used for Vietnamese text classification since Vu et al. (2007)
  • The primary contribution is the packaged, documented implementation
  • No new insights about Vietnamese text classification are provided

Clarity and Presentation

The report is well-organized and clearly written:

  • Good use of tables and diagrams (Section 3.1 architecture diagram)
  • Comprehensive appendices with label mappings and model card
  • Code examples are helpful for practitioners

Minor issues:

  • The Abstract could be more specific about intended use cases
  • Section 9 (Future Work) is essentially a TODO list rather than research directions

Reproducibility Assessment

Positive aspects:

  • All hyperparameters documented
  • Model files available on HuggingFace
  • Dependencies listed with version constraints
  • Clear API documentation

Missing:

  • Random seeds for reproducibility
  • Training/validation split procedure details
  • Exact preprocessing steps applied to text

Limitations and Ethics

The limitations section (Section 8) is present and covers major issues. Could be expanded:

  • No discussion of potential biases in news classification
  • No analysis of error types or failure modes
  • No discussion of environmental impact (though minimal for this model)

Related Work Research

Papers Found

Paper Year Method Results Relevance
PhoBERT 2020 RoBERTa for Vietnamese SOTA on multiple tasks Primary modern baseline missing
TextGraphFuseGAT 2025 PhoBERT + GAT +4% F1 over PhoBERT Shows current SOTA direction
SMTCE Benchmark 2022 Multiple BERTs 65.44% F1 (VSMEC) Comprehensive Vietnamese benchmark
ViSoBERT 2023 Social media BERT Improved social text Specialized Vietnamese model
Vu et al. 2007 SVM, N-gram 97.1% (VNTC-10) Original baseline cited

Missing Citations

The following relevant works should be considered:

  1. Nguyen & Nguyen (2020) - PhoBERT paper: The primary Vietnamese pre-trained model that sets modern baselines
  2. Ho et al. (2020) - UIT-VSMEC: Standard emotion recognition corpus mentioned in future work but not cited
  3. Nguyen et al. (2022) - SMTCE benchmark: Comprehensive evaluation of Vietnamese text classification models
  4. VnCoreNLP - Word segmentation tool used by most Vietnamese NLP systems

SOTA Verification

  • Claimed: SVM Multi achieves 93.4% on VNTC-10 (Vu et al. 2007)
  • Actual: This result is from 2007. Modern PhoBERT-based approaches likely exceed this, though direct VNTC comparisons with transformers are limited in literature.
  • Assessment: The cited baseline is accurate but dated. The "expected" 93-95% projection is reasonable but unverified.

Questions for Authors

  1. Can you provide results on the full VNTC dataset (33,759 training samples) to validate the "expected" performance claims?

  2. What is the rationale for not incorporating underthesea's word segmentation, given it's already a dependency of the ecosystem?

  3. How does the model perform on out-of-domain text (e.g., social media vs. news articles)?

  4. Were alternative TF-IDF configurations (e.g., different n-gram ranges, vocabulary sizes) explored? If so, what guided the final hyperparameter choices?


Minor Issues

  1. Section 3.3: The SVM formulation shows soft-margin SVM but loss: hinge is listed as default. LinearSVC uses squared hinge by default; clarify if hinge loss is explicitly set.

  2. Table in Section 4.1: Train/test split ratios vary significantly across categories (e.g., Doi song: 3,159/2,036 vs Khoa hoc: 1,820/2,096). Is this the original VNTC split?

  3. Section 6.4: The table header says "Benchmark Comparison" but Sen-1 results are "Expected" not measured.

  4. Typo: Appendix B model size "~103 KB" - verify this is the actual size of the joblib files.

  5. Reference 2: GitHub repository is not a citable publication; consider citing the RIVF'07 paper for the dataset.


Suggestions for Improvement

Essential (for resubmission)

  1. Train and evaluate on full VNTC: Report actual accuracy/F1 on the 33,759/50,373 train/test split to validate baseline claims.

  2. Add at least one modern baseline: Fine-tune PhoBERT-base on VNTC and compare. This contextualizes the resource contribution.

  3. Include ablation study: Test impact of (a) word segmentation, (b) n-gram range, (c) vocabulary size on performance.

Recommended

  1. Add error analysis: Examine which categories are confused, analyze failure cases to provide insights.

  2. Report statistical measures: Use k-fold cross-validation with standard deviations.

  3. Expand related work: Include post-2020 Vietnamese NLP advances (PhoBERT, ViSoBERT, SMTCE benchmark).

  4. Test on additional datasets: UIT-VSMEC or UIT-VSFC would show generalization beyond news classification.

Optional

  1. Add inference speed benchmarks: Quantify the speed advantage over transformer models.

  2. Provide pre-trained model on full VNTC: This would make the resource immediately useful.

  3. Include model interpretability analysis: Show top features per category to leverage SVM's interpretability advantage.


Final Recommendation

Decision: Major Revision Required

The paper provides a useful baseline resource for Vietnamese text classification but currently lacks sufficient experimental validation. The 60-sample demonstration is inadequate for assessing model quality. With experiments on the full VNTC dataset and comparison to at least one modern baseline, this could become a valuable resource paper for the Vietnamese NLP community.

The authors should:

  1. Train on full VNTC and report actual benchmark results
  2. Compare against PhoBERT baseline
  3. Include ablation studies
  4. Expand related work coverage

After these revisions, the paper would be suitable for a resource/system demonstration track at an *ACL venue.


References Used in This Review


Review completed following ACL Rolling Review guidelines