|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- vi |
|
|
tags: |
|
|
- text-classification |
|
|
- vietnamese |
|
|
- sklearn |
|
|
- tfidf |
|
|
- svm |
|
|
library_name: sklearn |
|
|
pipeline_tag: text-classification |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
datasets: |
|
|
- VNTC |
|
|
--- |
|
|
|
|
|
# Sen-1 |
|
|
|
|
|
Sen-1 is a Vietnamese text classification model developed by UnderTheSea NLP. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model Type:** CountVectorizer + TfidfTransformer + LinearSVC (sklearn pipeline) |
|
|
- **Base Architecture:** sonar_core_1 reproduction |
|
|
- **Language:** Vietnamese |
|
|
- **License:** Apache 2.0 |
|
|
- **Accuracy:** 92.49% on VNTC benchmark |
|
|
- **F1 Score:** 92.40% (weighted) |
|
|
- **Training Time:** 37.6 seconds |
|
|
|
|
|
## VNTC Benchmark Results |
|
|
|
|
|
Evaluated on the Vietnamese News Text Classification (VNTC) dataset: |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Accuracy** | **92.49%** | |
|
|
| **F1 (weighted)** | **92.40%** | |
|
|
| F1 (macro) | 90.44% | |
|
|
| Training samples | 33,759 | |
|
|
| Test samples | 50,373 | |
|
|
| Categories | 10 | |
|
|
| **Training time** | **37.6s** | |
|
|
|
|
|
### Per-Category Performance (VNTC) |
|
|
|
|
|
| Category | F1-Score | |
|
|
|----------|----------| |
|
|
| the_thao (Sports) | 0.98 | |
|
|
| the_gioi (World) | 0.95 | |
|
|
| vi_tinh (Technology) | 0.95 | |
|
|
| suc_khoe (Health) | 0.94 | |
|
|
| van_hoa (Culture) | 0.94 | |
|
|
| kinh_doanh (Business) | 0.92 | |
|
|
| phap_luat (Law) | 0.92 | |
|
|
| chinh_tri_xa_hoi (Politics) | 0.89 | |
|
|
| khoa_hoc (Science) | 0.85 | |
|
|
| doi_song (Lifestyle) | 0.72 | |
|
|
|
|
|
## Reference |
|
|
|
|
|
Based on: **"A Comparative Study on Vietnamese Text Classification Methods"** |
|
|
- Authors: Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo |
|
|
- Published: IEEE RIVF 2007 |
|
|
- Paper: [IEEE Xplore](https://ieeexplore.ieee.org/document/4223084/) |
|
|
- Dataset: [VNTC GitHub](https://github.com/duyvuleo/VNTC) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install scikit-learn joblib huggingface_hub |
|
|
``` |
|
|
|
|
|
## Usage (Pre-trained Model) |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
from sen import SenTextClassifier, Sentence |
|
|
|
|
|
# Download pre-trained model (VNTC benchmark) |
|
|
model_path = snapshot_download( |
|
|
'undertheseanlp/sen-1', |
|
|
allow_patterns=['models/sen-general-1.0.0-20260202/*'] |
|
|
) |
|
|
|
|
|
# Load model |
|
|
classifier = SenTextClassifier.load(f'{model_path}/models/sen-general-1.0.0-20260202') |
|
|
|
|
|
# Predict |
|
|
sentence = Sentence("Đội tuyển Việt Nam thắng 3-0") |
|
|
classifier.predict(sentence) |
|
|
print(sentence.labels) # [the_thao (0.89)] |
|
|
``` |
|
|
|
|
|
## Train Your Own Model |
|
|
|
|
|
```python |
|
|
from sen import SenTextClassifier, Sentence |
|
|
|
|
|
# Initialize classifier (sonar_core_1 config) |
|
|
classifier = SenTextClassifier( |
|
|
max_features=20000, |
|
|
ngram_range=(1, 2), |
|
|
C=1.0, |
|
|
) |
|
|
|
|
|
# Train |
|
|
train_texts = ["Sản phẩm rất tốt", "Hàng tệ quá"] |
|
|
train_labels = ["positive", "negative"] |
|
|
classifier.train(train_texts, train_labels) |
|
|
|
|
|
# Predict |
|
|
sentence = Sentence("Chất lượng tuyệt vời!") |
|
|
classifier.predict(sentence) |
|
|
print(sentence.labels) # [positive (0.85)] |
|
|
|
|
|
# Save/Load |
|
|
classifier.save("./my-model") |
|
|
loaded = SenTextClassifier.load("./my-model") |
|
|
``` |
|
|
|
|
|
## API (compatible with underthesea) |
|
|
|
|
|
```python |
|
|
from sen import Sentence, Label, SenTextClassifier |
|
|
|
|
|
# Sentence class |
|
|
sentence = Sentence("Sản phẩm rất tốt") |
|
|
classifier.predict(sentence) |
|
|
print(sentence.labels) # List[Label] |
|
|
|
|
|
# Label class |
|
|
label = Label("positive", 0.95) |
|
|
print(label.value) # "positive" |
|
|
print(label.score) # 0.95 |
|
|
``` |
|
|
|
|
|
## Model Versions |
|
|
|
|
|
| Version | Dataset | Classes | Accuracy | Training Time | Notes | |
|
|
|---------|---------|---------|----------|---------------|-------| |
|
|
| models/sen-general-1.0.0-20260202 | VNTC (33,759) | 10 | 92.49% | 37.6s | News classification | |
|
|
| sen-bank-1.0.0-20260202 | UTS2017_Bank (1,581) | 14 | **75.76%** | 0.13s | Banking domain | |
|
|
|
|
|
### Comparison with sonar_core_1 |
|
|
|
|
|
| Dataset | sonar_core_1 | Sen-1 | Difference | |
|
|
|---------|--------------|-------|------------| |
|
|
| VNTC (News) | 92.80% | 92.49% | -0.31% | |
|
|
| UTS2017_Bank | 72.47% | **75.76%** | **+3.29%** | |
|
|
|
|
|
### Inference Speed Benchmark |
|
|
|
|
|
Comparison with Underthesea 9.2.8: |
|
|
|
|
|
| Model | Single Inference | Throughput | |
|
|
|-------|------------------|------------| |
|
|
| **Sen-1** | **0.465 ms** | **66,678 samples/sec** | |
|
|
| Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec | |
|
|
|
|
|
**Speedup:** 1.3x (single) / **41x** (batch throughput) |
|
|
|
|
|
Sen-1 supports batch processing, making it significantly faster for bulk classification tasks. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{vu2007comparative, |
|
|
title={A Comparative Study on Vietnamese Text Classification Methods}, |
|
|
author={Hoang, Cong Duy Vu and Dien, Dinh and Nguyen, Le Nguyen and Ngo, Quoc Hung}, |
|
|
booktitle={IEEE International Conference on Research, Innovation and Vision for the Future}, |
|
|
pages={267--273}, |
|
|
year={2007}, |
|
|
organization={IEEE} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Technical Report |
|
|
|
|
|
See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation. |
|
|
|