sen-1 / README.md
rain1024's picture
Add research plan and paper review documentation
82c639c
---
license: apache-2.0
language:
- vi
tags:
- text-classification
- vietnamese
- sklearn
- tfidf
- svm
library_name: sklearn
pipeline_tag: text-classification
metrics:
- accuracy
- f1
datasets:
- VNTC
---
# Sen-1
Sen-1 is a Vietnamese text classification model developed by UnderTheSea NLP.
## Model Description
- **Model Type:** CountVectorizer + TfidfTransformer + LinearSVC (sklearn pipeline)
- **Base Architecture:** sonar_core_1 reproduction
- **Language:** Vietnamese
- **License:** Apache 2.0
- **Accuracy:** 92.49% on VNTC benchmark
- **F1 Score:** 92.40% (weighted)
- **Training Time:** 37.6 seconds
## VNTC Benchmark Results
Evaluated on the Vietnamese News Text Classification (VNTC) dataset:
| Metric | Value |
|--------|-------|
| **Accuracy** | **92.49%** |
| **F1 (weighted)** | **92.40%** |
| F1 (macro) | 90.44% |
| Training samples | 33,759 |
| Test samples | 50,373 |
| Categories | 10 |
| **Training time** | **37.6s** |
### Per-Category Performance (VNTC)
| Category | F1-Score |
|----------|----------|
| the_thao (Sports) | 0.98 |
| the_gioi (World) | 0.95 |
| vi_tinh (Technology) | 0.95 |
| suc_khoe (Health) | 0.94 |
| van_hoa (Culture) | 0.94 |
| kinh_doanh (Business) | 0.92 |
| phap_luat (Law) | 0.92 |
| chinh_tri_xa_hoi (Politics) | 0.89 |
| khoa_hoc (Science) | 0.85 |
| doi_song (Lifestyle) | 0.72 |
## Reference
Based on: **"A Comparative Study on Vietnamese Text Classification Methods"**
- Authors: Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo
- Published: IEEE RIVF 2007
- Paper: [IEEE Xplore](https://ieeexplore.ieee.org/document/4223084/)
- Dataset: [VNTC GitHub](https://github.com/duyvuleo/VNTC)
## Installation
```bash
pip install scikit-learn joblib huggingface_hub
```
## Usage (Pre-trained Model)
```python
from huggingface_hub import snapshot_download
from sen import SenTextClassifier, Sentence
# Download pre-trained model (VNTC benchmark)
model_path = snapshot_download(
'undertheseanlp/sen-1',
allow_patterns=['models/sen-general-1.0.0-20260202/*']
)
# Load model
classifier = SenTextClassifier.load(f'{model_path}/models/sen-general-1.0.0-20260202')
# Predict
sentence = Sentence("Đội tuyển Việt Nam thắng 3-0")
classifier.predict(sentence)
print(sentence.labels) # [the_thao (0.89)]
```
## Train Your Own Model
```python
from sen import SenTextClassifier, Sentence
# Initialize classifier (sonar_core_1 config)
classifier = SenTextClassifier(
max_features=20000,
ngram_range=(1, 2),
C=1.0,
)
# Train
train_texts = ["Sản phẩm rất tốt", "Hàng tệ quá"]
train_labels = ["positive", "negative"]
classifier.train(train_texts, train_labels)
# Predict
sentence = Sentence("Chất lượng tuyệt vời!")
classifier.predict(sentence)
print(sentence.labels) # [positive (0.85)]
# Save/Load
classifier.save("./my-model")
loaded = SenTextClassifier.load("./my-model")
```
## API (compatible with underthesea)
```python
from sen import Sentence, Label, SenTextClassifier
# Sentence class
sentence = Sentence("Sản phẩm rất tốt")
classifier.predict(sentence)
print(sentence.labels) # List[Label]
# Label class
label = Label("positive", 0.95)
print(label.value) # "positive"
print(label.score) # 0.95
```
## Model Versions
| Version | Dataset | Classes | Accuracy | Training Time | Notes |
|---------|---------|---------|----------|---------------|-------|
| models/sen-general-1.0.0-20260202 | VNTC (33,759) | 10 | 92.49% | 37.6s | News classification |
| sen-bank-1.0.0-20260202 | UTS2017_Bank (1,581) | 14 | **75.76%** | 0.13s | Banking domain |
### Comparison with sonar_core_1
| Dataset | sonar_core_1 | Sen-1 | Difference |
|---------|--------------|-------|------------|
| VNTC (News) | 92.80% | 92.49% | -0.31% |
| UTS2017_Bank | 72.47% | **75.76%** | **+3.29%** |
### Inference Speed Benchmark
Comparison with Underthesea 9.2.8:
| Model | Single Inference | Throughput |
|-------|------------------|------------|
| **Sen-1** | **0.465 ms** | **66,678 samples/sec** |
| Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec |
**Speedup:** 1.3x (single) / **41x** (batch throughput)
Sen-1 supports batch processing, making it significantly faster for bulk classification tasks.
## Citation
```bibtex
@inproceedings{vu2007comparative,
title={A Comparative Study on Vietnamese Text Classification Methods},
author={Hoang, Cong Duy Vu and Dien, Dinh and Nguyen, Le Nguyen and Ngo, Quoc Hung},
booktitle={IEEE International Conference on Research, Innovation and Vision for the Future},
pages={267--273},
year={2007},
organization={IEEE}
}
```
## Technical Report
See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation.