sen-1 / README.md

Add research plan and paper review documentation

82c639c 14 days ago

4.71 kB

	---
	license: apache-2.0
	language:
	- vi
	tags:
	- text-classification
	- vietnamese
	- sklearn
	- tfidf
	- svm
	library_name: sklearn
	pipeline_tag: text-classification
	metrics:
	- accuracy
	- f1
	datasets:
	- VNTC
	---

	# Sen-1

	Sen-1 is a Vietnamese text classification model developed by UnderTheSea NLP.

	## Model Description

	- Model Type: CountVectorizer + TfidfTransformer + LinearSVC (sklearn pipeline)
	- Base Architecture: sonar_core_1 reproduction
	- Language: Vietnamese
	- License: Apache 2.0
	- Accuracy: 92.49% on VNTC benchmark
	- F1 Score: 92.40% (weighted)
	- Training Time: 37.6 seconds

	## VNTC Benchmark Results

	Evaluated on the Vietnamese News Text Classification (VNTC) dataset:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 92.49% \|
	\| F1 (weighted) \| 92.40% \|
	\| F1 (macro) \| 90.44% \|
	\| Training samples \| 33,759 \|
	\| Test samples \| 50,373 \|
	\| Categories \| 10 \|
	\| Training time \| 37.6s \|

	### Per-Category Performance (VNTC)

	\| Category \| F1-Score \|
	\|----------\|----------\|
	\| the_thao (Sports) \| 0.98 \|
	\| the_gioi (World) \| 0.95 \|
	\| vi_tinh (Technology) \| 0.95 \|
	\| suc_khoe (Health) \| 0.94 \|
	\| van_hoa (Culture) \| 0.94 \|
	\| kinh_doanh (Business) \| 0.92 \|
	\| phap_luat (Law) \| 0.92 \|
	\| chinh_tri_xa_hoi (Politics) \| 0.89 \|
	\| khoa_hoc (Science) \| 0.85 \|
	\| doi_song (Lifestyle) \| 0.72 \|

	## Reference

	Based on: "A Comparative Study on Vietnamese Text Classification Methods"
	- Authors: Cong Duy Vu Hoang, Dien Dinh, Le Nguyen Nguyen, Quoc Hung Ngo
	- Published: IEEE RIVF 2007
	- Paper: [IEEE Xplore](https://ieeexplore.ieee.org/document/4223084/)
	- Dataset: [VNTC GitHub](https://github.com/duyvuleo/VNTC)

	## Installation

	```bash
	pip install scikit-learn joblib huggingface_hub
	```

	## Usage (Pre-trained Model)

	```python
	from huggingface_hub import snapshot_download
	from sen import SenTextClassifier, Sentence

	# Download pre-trained model (VNTC benchmark)
	model_path = snapshot_download(
	'undertheseanlp/sen-1',
	allow_patterns=['models/sen-general-1.0.0-20260202/*']
	)

	# Load model
	classifier = SenTextClassifier.load(f'{model_path}/models/sen-general-1.0.0-20260202')

	# Predict
	sentence = Sentence("Đội tuyển Việt Nam thắng 3-0")
	classifier.predict(sentence)
	print(sentence.labels) # [the_thao (0.89)]
	```

	## Train Your Own Model

	```python
	from sen import SenTextClassifier, Sentence

	# Initialize classifier (sonar_core_1 config)
	classifier = SenTextClassifier(
	max_features=20000,
	ngram_range=(1, 2),
	C=1.0,
	)

	# Train
	train_texts = ["Sản phẩm rất tốt", "Hàng tệ quá"]
	train_labels = ["positive", "negative"]
	classifier.train(train_texts, train_labels)

	# Predict
	sentence = Sentence("Chất lượng tuyệt vời!")
	classifier.predict(sentence)
	print(sentence.labels) # [positive (0.85)]

	# Save/Load
	classifier.save("./my-model")
	loaded = SenTextClassifier.load("./my-model")
	```

	## API (compatible with underthesea)

	```python
	from sen import Sentence, Label, SenTextClassifier

	# Sentence class
	sentence = Sentence("Sản phẩm rất tốt")
	classifier.predict(sentence)
	print(sentence.labels) # List[Label]

	# Label class
	label = Label("positive", 0.95)
	print(label.value) # "positive"
	print(label.score) # 0.95
	```

	## Model Versions

	\| Version \| Dataset \| Classes \| Accuracy \| Training Time \| Notes \|
	\|---------\|---------\|---------\|----------\|---------------\|-------\|
	\| models/sen-general-1.0.0-20260202 \| VNTC (33,759) \| 10 \| 92.49% \| 37.6s \| News classification \|
	\| sen-bank-1.0.0-20260202 \| UTS2017_Bank (1,581) \| 14 \| 75.76% \| 0.13s \| Banking domain \|

	### Comparison with sonar_core_1

	\| Dataset \| sonar_core_1 \| Sen-1 \| Difference \|
	\|---------\|--------------\|-------\|------------\|
	\| VNTC (News) \| 92.80% \| 92.49% \| -0.31% \|
	\| UTS2017_Bank \| 72.47% \| 75.76% \| +3.29% \|

	### Inference Speed Benchmark

	Comparison with Underthesea 9.2.8:

	\| Model \| Single Inference \| Throughput \|
	\|-------\|------------------\|------------\|
	\| Sen-1 \| 0.465 ms \| 66,678 samples/sec \|
	\| Underthesea 9.2.8 \| 0.615 ms \| 1,617 samples/sec \|

	Speedup: 1.3x (single) / 41x (batch throughput)

	Sen-1 supports batch processing, making it significantly faster for bulk classification tasks.

	## Citation

	```bibtex
	@inproceedings{vu2007comparative,
	title={A Comparative Study on Vietnamese Text Classification Methods},
	author={Hoang, Cong Duy Vu and Dien, Dinh and Nguyen, Le Nguyen and Ngo, Quoc Hung},
	booktitle={IEEE International Conference on Research, Innovation and Vision for the Future},
	pages={267--273},
	year={2007},
	organization={IEEE}
	}
	```

	## Technical Report

	See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation.