Sen-1

Sen-1 is a Vietnamese text classification model developed by UnderTheSea NLP.

Model Description

  • Model Type: CountVectorizer + TfidfTransformer + LinearSVC (sklearn pipeline)
  • Base Architecture: sonar_core_1 reproduction
  • Language: Vietnamese
  • License: Apache 2.0
  • Accuracy: 92.49% on VNTC benchmark
  • F1 Score: 92.40% (weighted)
  • Training Time: 37.6 seconds

VNTC Benchmark Results

Evaluated on the Vietnamese News Text Classification (VNTC) dataset:

Metric Value
Accuracy 92.49%
F1 (weighted) 92.40%
F1 (macro) 90.44%
Training samples 33,759
Test samples 50,373
Categories 10
Training time 37.6s

Per-Category Performance (VNTC)

Category F1-Score
the_thao (Sports) 0.98
the_gioi (World) 0.95
vi_tinh (Technology) 0.95
suc_khoe (Health) 0.94
van_hoa (Culture) 0.94
kinh_doanh (Business) 0.92
phap_luat (Law) 0.92
chinh_tri_xa_hoi (Politics) 0.89
khoa_hoc (Science) 0.85
doi_song (Lifestyle) 0.72

Reference

Based on: "A Comparative Study on Vietnamese Text Classification Methods"

  • Authors: Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo
  • Published: IEEE RIVF 2007
  • Paper: IEEE Xplore
  • Dataset: VNTC GitHub

Installation

pip install scikit-learn joblib huggingface_hub

Usage (Pre-trained Model)

from huggingface_hub import snapshot_download
from sen import SenTextClassifier, Sentence

# Download pre-trained model (VNTC benchmark)
model_path = snapshot_download(
    'undertheseanlp/sen-1',
    allow_patterns=['models/sen-general-1.0.0-20260202/*']
)

# Load model
classifier = SenTextClassifier.load(f'{model_path}/models/sen-general-1.0.0-20260202')

# Predict
sentence = Sentence("Đội tuyển Việt Nam thắng 3-0")
classifier.predict(sentence)
print(sentence.labels)  # [the_thao (0.89)]

Train Your Own Model

from sen import SenTextClassifier, Sentence

# Initialize classifier (sonar_core_1 config)
classifier = SenTextClassifier(
    max_features=20000,
    ngram_range=(1, 2),
    C=1.0,
)

# Train
train_texts = ["Sản phẩm rất tốt", "Hàng tệ quá"]
train_labels = ["positive", "negative"]
classifier.train(train_texts, train_labels)

# Predict
sentence = Sentence("Chất lượng tuyệt vời!")
classifier.predict(sentence)
print(sentence.labels)  # [positive (0.85)]

# Save/Load
classifier.save("./my-model")
loaded = SenTextClassifier.load("./my-model")

API (compatible with underthesea)

from sen import Sentence, Label, SenTextClassifier

# Sentence class
sentence = Sentence("Sản phẩm rất tốt")
classifier.predict(sentence)
print(sentence.labels)  # List[Label]

# Label class
label = Label("positive", 0.95)
print(label.value)  # "positive"
print(label.score)  # 0.95

Model Versions

Version Dataset Classes Accuracy Training Time Notes
models/sen-general-1.0.0-20260202 VNTC (33,759) 10 92.49% 37.6s News classification
sen-bank-1.0.0-20260202 UTS2017_Bank (1,581) 14 75.76% 0.13s Banking domain

Comparison with sonar_core_1

Dataset sonar_core_1 Sen-1 Difference
VNTC (News) 92.80% 92.49% -0.31%
UTS2017_Bank 72.47% 75.76% +3.29%

Inference Speed Benchmark

Comparison with Underthesea 9.2.8:

Model Single Inference Throughput
Sen-1 0.465 ms 66,678 samples/sec
Underthesea 9.2.8 0.615 ms 1,617 samples/sec

Speedup: 1.3x (single) / 41x (batch throughput)

Sen-1 supports batch processing, making it significantly faster for bulk classification tasks.

Citation

@inproceedings{vu2007comparative,
  title={A Comparative Study on Vietnamese Text Classification Methods},
  author={Hoang, Cong Duy Vu and Dien, Dinh and Nguyen, Le Nguyen and Ngo, Quoc Hung},
  booktitle={IEEE International Conference on Research, Innovation and Vision for the Future},
  pages={267--273},
  year={2007},
  organization={IEEE}
}

Technical Report

See TECHNICAL_REPORT.md for detailed methodology and evaluation.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support