--- license: apache-2.0 language: - vi tags: - text-classification - vietnamese - sklearn - tfidf - svm library_name: sklearn pipeline_tag: text-classification metrics: - accuracy - f1 datasets: - VNTC --- # Sen-1 Sen-1 is a Vietnamese text classification model developed by UnderTheSea NLP. ## Model Description - **Model Type:** CountVectorizer + TfidfTransformer + LinearSVC (sklearn pipeline) - **Base Architecture:** sonar_core_1 reproduction - **Language:** Vietnamese - **License:** Apache 2.0 - **Accuracy:** 92.49% on VNTC benchmark - **F1 Score:** 92.40% (weighted) - **Training Time:** 37.6 seconds ## VNTC Benchmark Results Evaluated on the Vietnamese News Text Classification (VNTC) dataset: | Metric | Value | |--------|-------| | **Accuracy** | **92.49%** | | **F1 (weighted)** | **92.40%** | | F1 (macro) | 90.44% | | Training samples | 33,759 | | Test samples | 50,373 | | Categories | 10 | | **Training time** | **37.6s** | ### Per-Category Performance (VNTC) | Category | F1-Score | |----------|----------| | the_thao (Sports) | 0.98 | | the_gioi (World) | 0.95 | | vi_tinh (Technology) | 0.95 | | suc_khoe (Health) | 0.94 | | van_hoa (Culture) | 0.94 | | kinh_doanh (Business) | 0.92 | | phap_luat (Law) | 0.92 | | chinh_tri_xa_hoi (Politics) | 0.89 | | khoa_hoc (Science) | 0.85 | | doi_song (Lifestyle) | 0.72 | ## Reference Based on: **"A Comparative Study on Vietnamese Text Classification Methods"** - Authors: Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo - Published: IEEE RIVF 2007 - Paper: [IEEE Xplore](https://ieeexplore.ieee.org/document/4223084/) - Dataset: [VNTC GitHub](https://github.com/duyvuleo/VNTC) ## Installation ```bash pip install scikit-learn joblib huggingface_hub ``` ## Usage (Pre-trained Model) ```python from huggingface_hub import snapshot_download from sen import SenTextClassifier, Sentence # Download pre-trained model (VNTC benchmark) model_path = snapshot_download( 'undertheseanlp/sen-1', allow_patterns=['models/sen-general-1.0.0-20260202/*'] ) # Load model classifier = SenTextClassifier.load(f'{model_path}/models/sen-general-1.0.0-20260202') # Predict sentence = Sentence("Đội tuyển Việt Nam thắng 3-0") classifier.predict(sentence) print(sentence.labels) # [the_thao (0.89)] ``` ## Train Your Own Model ```python from sen import SenTextClassifier, Sentence # Initialize classifier (sonar_core_1 config) classifier = SenTextClassifier( max_features=20000, ngram_range=(1, 2), C=1.0, ) # Train train_texts = ["Sản phẩm rất tốt", "Hàng tệ quá"] train_labels = ["positive", "negative"] classifier.train(train_texts, train_labels) # Predict sentence = Sentence("Chất lượng tuyệt vời!") classifier.predict(sentence) print(sentence.labels) # [positive (0.85)] # Save/Load classifier.save("./my-model") loaded = SenTextClassifier.load("./my-model") ``` ## API (compatible with underthesea) ```python from sen import Sentence, Label, SenTextClassifier # Sentence class sentence = Sentence("Sản phẩm rất tốt") classifier.predict(sentence) print(sentence.labels) # List[Label] # Label class label = Label("positive", 0.95) print(label.value) # "positive" print(label.score) # 0.95 ``` ## Model Versions | Version | Dataset | Classes | Accuracy | Training Time | Notes | |---------|---------|---------|----------|---------------|-------| | models/sen-general-1.0.0-20260202 | VNTC (33,759) | 10 | 92.49% | 37.6s | News classification | | sen-bank-1.0.0-20260202 | UTS2017_Bank (1,581) | 14 | **75.76%** | 0.13s | Banking domain | ### Comparison with sonar_core_1 | Dataset | sonar_core_1 | Sen-1 | Difference | |---------|--------------|-------|------------| | VNTC (News) | 92.80% | 92.49% | -0.31% | | UTS2017_Bank | 72.47% | **75.76%** | **+3.29%** | ### Inference Speed Benchmark Comparison with Underthesea 9.2.8: | Model | Single Inference | Throughput | |-------|------------------|------------| | **Sen-1** | **0.465 ms** | **66,678 samples/sec** | | Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec | **Speedup:** 1.3x (single) / **41x** (batch throughput) Sen-1 supports batch processing, making it significantly faster for bulk classification tasks. ## Citation ```bibtex @inproceedings{vu2007comparative, title={A Comparative Study on Vietnamese Text Classification Methods}, author={Hoang, Cong Duy Vu and Dien, Dinh and Nguyen, Le Nguyen and Ngo, Quoc Hung}, booktitle={IEEE International Conference on Research, Innovation and Vision for the Future}, pages={267--273}, year={2007}, organization={IEEE} } ``` ## Technical Report See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation.