Add sentiment models v1.2.0 with Vietnamese preprocessing
Browse files- Train sentiment-general (VLSP2016+UTS2017): 92.11% UTS2017, 70.86% VLSP2016
- Train sentiment-bank (UTS2017): 70.65% accuracy
- Add preprocessing: lowercase, teencode expansion, negation marking, repeated char normalization
- Update TECHNICAL_REPORT.md to v1.2.0 with full experiment results
- Track .bin files with LFS/Xet storage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- .gitattributes +1 -0
- TECHNICAL_REPORT.md +368 -378
- models/sen-sentiment-bank-1.0.0-20260206.bin +3 -0
- models/sen-sentiment-general-1.0.0-20260206.bin +3 -0
- src/train.py +278 -0
.gitattributes
CHANGED
|
@@ -4,3 +4,4 @@
|
|
| 4 |
*.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 5 |
*.gif filter=lfs diff=lfs merge=lfs -text
|
| 6 |
*.synctex filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 4 |
*.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 5 |
*.gif filter=lfs diff=lfs merge=lfs -text
|
| 6 |
*.synctex filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
TECHNICAL_REPORT.md
CHANGED
|
@@ -1,21 +1,23 @@
|
|
| 1 |
# Sen-1: Vietnamese Text Classification Model
|
| 2 |
|
| 3 |
-
**Technical Report v1.
|
| 4 |
|
| 5 |
Authors: UnderTheSea NLP
|
| 6 |
-
Date: February
|
| 7 |
Model: `undertheseanlp/sen-1`
|
| 8 |
|
| 9 |
---
|
| 10 |
|
| 11 |
## Abstract
|
| 12 |
|
| 13 |
-
Sen-1 is a Vietnamese text classification model based on
|
| 14 |
|
| 15 |
- **VNTC (News)**: 92.49% accuracy on 10-topic news classification
|
| 16 |
- **UTS2017_Bank (Banking)**: 75.76% accuracy on 14-category banking text classification
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
The
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
@@ -25,9 +27,10 @@ Text classification is a fundamental task in Natural Language Processing (NLP) t
|
|
| 25 |
|
| 26 |
- **Word segmentation**: Vietnamese words can consist of multiple syllables
|
| 27 |
- **Diacritics**: Vietnamese uses Latin script with additional diacritical marks
|
|
|
|
| 28 |
- **Limited resources**: Fewer labeled datasets compared to English
|
| 29 |
|
| 30 |
-
Sen-1 addresses these challenges by implementing a robust TF-IDF + SVM pipeline
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
@@ -41,7 +44,20 @@ The seminal work on Vietnamese text classification was presented by Vu et al. (2
|
|
| 41 |
- **Baseline methods**: Bag-of-Words (BOW), N-gram, and SVM approaches
|
| 42 |
- **Benchmark results**: Achieving >95% accuracy on 10-topic classification
|
| 43 |
|
| 44 |
-
### 2.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
| Approach | Pros | Cons |
|
| 47 |
|----------|------|------|
|
|
@@ -56,34 +72,37 @@ Sen-1 adopts the traditional approach for its simplicity, speed, and effectivene
|
|
| 56 |
|
| 57 |
### 3.1 Architecture Overview
|
| 58 |
|
| 59 |
-
Sen-1
|
| 60 |
|
| 61 |
```
|
| 62 |
-
|
| 63 |
-
│
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
│
|
| 67 |
-
│
|
| 68 |
-
│
|
| 69 |
-
│ │
|
| 70 |
-
│ │ -
|
| 71 |
-
│ │ -
|
| 72 |
-
│
|
| 73 |
-
│
|
| 74 |
-
│
|
| 75 |
-
│
|
| 76 |
-
│ │ -
|
| 77 |
-
│
|
| 78 |
-
│
|
| 79 |
-
│
|
| 80 |
-
│
|
| 81 |
-
│
|
| 82 |
-
│
|
| 83 |
-
│
|
| 84 |
-
│
|
| 85 |
-
│
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
| 87 |
```
|
| 88 |
|
| 89 |
### 3.2 TF-IDF Vectorization
|
|
@@ -97,13 +116,14 @@ Where:
|
|
| 97 |
- $\text{IDF}(t) = \log\frac{N}{|\{d \in D : t \in d\}|}$
|
| 98 |
- $N$ = total number of documents
|
| 99 |
|
| 100 |
-
**Hyperparameters
|
| 101 |
|
| 102 |
-
| Parameter |
|
| 103 |
-
|
| 104 |
-
| `max_features` | 20,000 |
|
| 105 |
-
| `ngram_range` | (1, 2) |
|
| 106 |
-
| `
|
|
|
|
| 107 |
|
| 108 |
### 3.3 Support Vector Machine
|
| 109 |
|
|
@@ -111,15 +131,55 @@ Linear SVM is used for classification due to its effectiveness on high-dimension
|
|
| 111 |
|
| 112 |
$$\min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))$$
|
| 113 |
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|-----------|-------|-------------|
|
| 118 |
-
| `C` | 1.0 | Regularization parameter |
|
| 119 |
-
| `max_iter` | 2000 | Maximum iterations |
|
| 120 |
-
| `loss` | squared_hinge | Squared hinge loss (LinearSVC default) |
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
Confidence scores are computed from the SVM decision function using sigmoid transformation:
|
| 125 |
|
|
@@ -129,7 +189,7 @@ Where $f(x)$ is the decision function value.
|
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
-
## 4.
|
| 133 |
|
| 134 |
### 4.1 VNTC Dataset
|
| 135 |
|
|
@@ -155,33 +215,45 @@ The Vietnamese News Text Classification (VNTC) corpus is the standard benchmark
|
|
| 155 |
|
| 156 |
### 4.2 UTS2017_Bank Dataset
|
| 157 |
|
| 158 |
-
The UTS2017_Bank dataset is a Vietnamese banking domain text classification corpus:
|
| 159 |
-
|
| 160 |
-
**14 Categories:**
|
| 161 |
-
|
| 162 |
-
| Category | English |
|
| 163 |
-
|
| 164 |
-
|
|
| 165 |
-
|
|
| 166 |
-
|
|
| 167 |
-
|
|
| 168 |
-
|
|
| 169 |
-
|
|
| 170 |
-
|
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
|
| 175 |
-
|
| 176 |
-
|
|
| 177 |
-
|
|
| 178 |
-
|
|
| 179 |
-
|
| 180 |
-
|
|
|
|
| 181 |
|
| 182 |
**Source:** https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
---
|
| 187 |
|
|
@@ -189,33 +261,43 @@ The UTS2017_Bank dataset is a Vietnamese banking domain text classification corp
|
|
| 189 |
|
| 190 |
### 5.1 Dependencies
|
| 191 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
```
|
| 193 |
-
|
| 194 |
-
joblib>=1.0.0
|
| 195 |
-
numpy>=1.20.0
|
| 196 |
```
|
| 197 |
|
| 198 |
-
### 5.2
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
```python
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
|
|
|
|
|
|
|
|
|
| 219 |
```
|
| 220 |
|
| 221 |
### 5.3 Model Files
|
|
@@ -223,64 +305,30 @@ class SenTextClassifier:
|
|
| 223 |
```
|
| 224 |
undertheseanlp/sen-1/
|
| 225 |
└── models/
|
| 226 |
-
├── sen-general-1.0.0-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
│
|
| 231 |
-
└── sen-bank-1.0.0-20260202/ # Banking classification (UTS2017_Bank)
|
| 232 |
-
├── pipeline.joblib # TF-IDF + SVM pipeline
|
| 233 |
-
├── label_encoder.joblib # Label encoder
|
| 234 |
-
└── metadata.json # Model configuration
|
| 235 |
```
|
| 236 |
|
| 237 |
-
|
| 238 |
-
```json
|
| 239 |
-
{
|
| 240 |
-
"model_type": "sonar_core_1_reproduction",
|
| 241 |
-
"architecture": "CountVectorizer + TfidfTransformer + LinearSVC",
|
| 242 |
-
"max_features": 20000,
|
| 243 |
-
"ngram_range": [1, 2],
|
| 244 |
-
"test_accuracy": 0.9249,
|
| 245 |
-
"test_f1_weighted": 0.924,
|
| 246 |
-
"labels": ["Chinh tri Xa hoi", "Doi song", ...]
|
| 247 |
-
}
|
| 248 |
-
```
|
| 249 |
|
| 250 |
---
|
| 251 |
|
| 252 |
## 6. Experiments
|
| 253 |
|
| 254 |
-
### 6.1
|
| 255 |
-
|
| 256 |
-
```python
|
| 257 |
-
# sonar_core_1 configuration
|
| 258 |
-
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
|
| 259 |
-
from sklearn.svm import LinearSVC
|
| 260 |
-
from sklearn.pipeline import Pipeline
|
| 261 |
-
|
| 262 |
-
pipeline = Pipeline([
|
| 263 |
-
('vect', CountVectorizer(max_features=20000, ngram_range=(1, 2))),
|
| 264 |
-
('tfidf', TfidfTransformer(use_idf=True)),
|
| 265 |
-
('clf', LinearSVC(C=1.0, max_iter=2000, random_state=42)),
|
| 266 |
-
])
|
| 267 |
-
```
|
| 268 |
-
|
| 269 |
-
### 6.2 VNTC Benchmark Results
|
| 270 |
|
| 271 |
-
**
|
| 272 |
|
| 273 |
| Metric | Value |
|
| 274 |
|--------|-------|
|
| 275 |
| **Accuracy** | **92.49%** |
|
| 276 |
| **F1 (weighted)** | **92.40%** |
|
| 277 |
| F1 (macro) | 90.44% |
|
| 278 |
-
| Precision (weighted) | 92.00% |
|
| 279 |
-
| Recall (weighted) | 92.00% |
|
| 280 |
| **Training time** | **37.6s** |
|
| 281 |
-
| Test samples | 50,373 |
|
| 282 |
|
| 283 |
-
|
| 284 |
|
| 285 |
| Category | Precision | Recall | F1-Score | Support |
|
| 286 |
|----------|-----------|--------|----------|---------|
|
|
@@ -295,183 +343,130 @@ pipeline = Pipeline([
|
|
| 295 |
| Van hoa | 0.93 | 0.96 | 0.94 | 6,250 |
|
| 296 |
| Vi tinh | 0.94 | 0.96 | 0.95 | 4,560 |
|
| 297 |
|
| 298 |
-
|
| 299 |
-
**Most challenging category:** Lifestyle (Doi song) with 72% F1-score
|
| 300 |
-
|
| 301 |
-
### 6.4 UTS2017_Bank Benchmark Results
|
| 302 |
|
| 303 |
-
**
|
| 304 |
|
| 305 |
| Metric | Value |
|
| 306 |
|--------|-------|
|
| 307 |
| **Accuracy** | **75.76%** |
|
| 308 |
| **F1 (weighted)** | **72.70%** |
|
| 309 |
| F1 (macro) | 36.18% |
|
| 310 |
-
| Precision (weighted) | 74.00% |
|
| 311 |
-
| Recall (weighted) | 76.00% |
|
| 312 |
| **Training time** | **0.13s** |
|
| 313 |
-
| Train samples | 1,581 |
|
| 314 |
-
| Test samples | 396 |
|
| 315 |
|
| 316 |
-
### 6.
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|----------|-----------|--------|----------|---------|
|
| 320 |
-
| ACCOUNT | 0.00 | 0.00 | 0.00 | 1 |
|
| 321 |
-
| CARD | 0.36 | 0.31 | 0.33 | 13 |
|
| 322 |
-
| **CUSTOMER_SUPPORT** | **0.73** | **0.93** | **0.82** | 155 |
|
| 323 |
-
| DISCOUNT | 0.67 | 0.25 | 0.36 | 8 |
|
| 324 |
-
| INTEREST_RATE | 0.40 | 0.33 | 0.36 | 12 |
|
| 325 |
-
| INTERNET_BANKING | 0.80 | 0.29 | 0.42 | 14 |
|
| 326 |
-
| LOAN | 0.73 | 0.73 | 0.73 | 15 |
|
| 327 |
-
| MONEY_TRANSFER | 1.00 | 0.14 | 0.25 | 7 |
|
| 328 |
-
| OTHER | 0.25 | 0.07 | 0.11 | 14 |
|
| 329 |
-
| PAYMENT | 0.50 | 0.33 | 0.40 | 3 |
|
| 330 |
-
| PROMOTION | 0.75 | 0.27 | 0.40 | 11 |
|
| 331 |
-
| SAVING | 0.00 | 0.00 | 0.00 | 2 |
|
| 332 |
-
| SECURITY | 0.00 | 0.00 | 0.00 | 1 |
|
| 333 |
-
| **TRADEMARK** | **0.87** | **0.89** | **0.88** | 140 |
|
| 334 |
-
|
| 335 |
-
**Best performing categories:** TRADEMARK (88% F1), CUSTOMER_SUPPORT (82% F1)
|
| 336 |
-
**Zero-shot categories:** ACCOUNT, SAVING, SECURITY (insufficient training samples)
|
| 337 |
-
|
| 338 |
-
**Analysis:** The low macro F1 (36.18%) vs high weighted F1 (72.70%) indicates severe class imbalance. The model performs well on majority classes but fails on minority classes with <10 training samples.
|
| 339 |
-
|
| 340 |
-
### 6.6 Comparison with sonar_core_1 and VNTC Paper
|
| 341 |
-
|
| 342 |
-
#### Overall Comparison with sonar_core_1
|
| 343 |
-
|
| 344 |
-
| Dataset | sonar_core_1 | Sen-1 | Difference |
|
| 345 |
-
|---------|--------------|-------|------------|
|
| 346 |
-
| VNTC (News) | 92.80% | 92.49% | -0.31% |
|
| 347 |
-
| **UTS2017_Bank** | 72.47% | **75.76%** | **+3.29%** |
|
| 348 |
-
|
| 349 |
-
Sen-1 outperforms sonar_core_1 on the banking dataset while using significantly less training time.
|
| 350 |
-
|
| 351 |
-
#### VNTC Benchmark Results
|
| 352 |
|
| 353 |
-
|
| 354 |
-
|-------|----------|---------------|---------------|--------|
|
| 355 |
-
| **N-gram** (Vu et al. 2007) | **97.1%** | - | - | RIVF 2007 |
|
| 356 |
-
| SVM Multi (Vu et al. 2007) | 93.4% | - | - | RIVF 2007 |
|
| 357 |
-
| **sonar_core_1** (SVC) | **92.80%** | 92.0% | ~54.6 min | HuggingFace |
|
| 358 |
-
| **Sen-1 (Ours)** | 92.49% | 92.40% | **37.6s** | This report |
|
| 359 |
|
| 360 |
-
|
|
|
|
|
|
|
|
|
|
| 361 |
|
| 362 |
-
|
| 363 |
-
|-------|----------|---------------|---------------|--------|
|
| 364 |
-
| **Sen-1 (Ours)** | **75.76%** | **72.70%** | **0.13s** | This report |
|
| 365 |
-
| sonar_core_1 (SVC) | 72.47% | 66.0% | ~5.3s | HuggingFace |
|
| 366 |
|
| 367 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 368 |
|
| 369 |
-
|
| 370 |
|
| 371 |
-
|
|
| 372 |
-
|
| 373 |
-
|
|
| 374 |
-
|
|
| 375 |
-
|
|
| 376 |
-
| max_features | 20,000 | 20,000 |
|
| 377 |
-
| ngram_range | (1, 2) | (1, 2) |
|
| 378 |
-
| Test Accuracy | 92.80% | 92.49% |
|
| 379 |
-
| Training Time | ~54.6 min | 37.6s |
|
| 380 |
|
| 381 |
-
|
| 382 |
-
- sonar_core_1 uses SVC with `kernel='linear'` and `probability=True`
|
| 383 |
-
- Sen-1 uses LinearSVC which is faster but slightly different optimization
|
| 384 |
-
- Data source may differ (sonar_core_1 uses preprocessed data from underthesea releases)
|
| 385 |
|
| 386 |
-
|
| 387 |
|
| 388 |
-
**
|
| 389 |
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
|
|
|
| 394 |
|
| 395 |
-
|
| 396 |
-
- N-gram approach in original paper used character/word n-grams with language modeling
|
| 397 |
-
- Sen-1 uses TF-IDF with unigrams and bigrams only
|
| 398 |
-
- Original SVM Multi may have used different kernel or feature selection
|
| 399 |
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 403 |
|
| 404 |
-
|
| 405 |
-
- Sen-1 uses scikit-learn's LinearSVC with default squared hinge loss
|
| 406 |
-
- Original implementation details are not publicly available
|
| 407 |
|
| 408 |
-
####
|
| 409 |
|
| 410 |
-
|
|
| 411 |
-
|
| 412 |
-
|
|
| 413 |
-
|
|
| 414 |
-
|
|
| 415 |
-
| Vocabulary | Not specified | 10,000 features |
|
| 416 |
-
| Preprocessing | Vietnamese tokenizer | None (raw text) |
|
| 417 |
|
| 418 |
-
|
| 419 |
|
| 420 |
-
|
| 421 |
-
- **Sequential patterns matter** for Vietnamese text classification
|
| 422 |
-
- **Word segmentation** likely contributes to the performance gap
|
| 423 |
-
- Future versions of Sen should incorporate word segmentation (underthesea) to close this gap
|
| 424 |
|
| 425 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 426 |
|
| 427 |
-
|
| 428 |
-
|-------|-----------|------------|
|
| 429 |
-
| "Đội tuyển Việt Nam thắng đậm 3-0 trước Indonesia" | the_thao | 0.89 |
|
| 430 |
-
| "Giá vàng tăng mạnh trong phiên giao dịch hôm nay" | kinh_doanh | 0.85 |
|
| 431 |
-
| "Apple ra mắt iPhone mới với nhiều tính năng hấp dẫn" | vi_tinh | 0.82 |
|
| 432 |
-
| "Bộ Y tế cảnh báo về dịch cúm mùa đông" | suc_khoe | 0.91 |
|
| 433 |
-
| "Quốc hội thông qua nghị quyết phát triển kinh tế" | chinh_tri_xa_hoi | 0.78 |
|
| 434 |
|
| 435 |
-
|
|
|
|
|
|
|
|
|
|
| 436 |
|
| 437 |
-
|
| 438 |
|
| 439 |
-
|
| 440 |
|
| 441 |
-
|
|
| 442 |
-
|
| 443 |
-
| **
|
| 444 |
-
|
|
|
|
|
|
|
|
|
|
|
| 445 |
|
| 446 |
-
|
| 447 |
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
|
| 453 |
-
|
| 454 |
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
|
|
|
|
|
|
| 458 |
|
| 459 |
-
|
| 460 |
-
- Sen-1 supports **batch processing** (vectorize + predict entire batch)
|
| 461 |
-
- Underthesea processes samples **sequentially** (loop)
|
| 462 |
-
- Batch processing eliminates per-sample overhead
|
| 463 |
|
| 464 |
-
|
| 465 |
-
- Sen-1: ~2.4 MB
|
| 466 |
-
- Underthesea (sonar_core_1): ~75 MB (compressed)
|
| 467 |
|
| 468 |
-
|
|
|
|
|
|
|
|
|
|
| 469 |
|
| 470 |
-
-
|
| 471 |
-
- scikit-learn: 1.7.2
|
| 472 |
-
- underthesea: 9.2.8
|
| 473 |
-
- underthesea-core: 3.1.6
|
| 474 |
-
- OS: Ubuntu 20.04 LTS
|
| 475 |
|
| 476 |
---
|
| 477 |
|
|
@@ -480,53 +475,62 @@ Comparison of inference speed between Sen-1 and Underthesea 9.2.8 (which uses so
|
|
| 480 |
### 7.1 Installation
|
| 481 |
|
| 482 |
```bash
|
| 483 |
-
pip install
|
| 484 |
```
|
| 485 |
|
| 486 |
### 7.2 Load Pre-trained Model
|
| 487 |
|
| 488 |
```python
|
| 489 |
-
from
|
| 490 |
-
from sen import SenTextClassifier, Sentence
|
| 491 |
|
| 492 |
-
#
|
| 493 |
-
|
| 494 |
-
'undertheseanlp/sen-1',
|
| 495 |
-
allow_patterns=['sen-general-1.0.0-20260202/*']
|
| 496 |
-
)
|
| 497 |
-
|
| 498 |
-
# Load
|
| 499 |
-
classifier = SenTextClassifier.load(f'{model_path}/sen-general-1.0.0-20260202')
|
| 500 |
|
| 501 |
# Predict
|
| 502 |
-
|
| 503 |
-
classifier.predict(sentence)
|
| 504 |
-
print(sentence.labels) # [the_thao (0.89)]
|
| 505 |
```
|
| 506 |
|
| 507 |
-
### 7.3
|
| 508 |
|
| 509 |
```python
|
| 510 |
-
from
|
| 511 |
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
|
| 517 |
-
|
| 518 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 519 |
```
|
| 520 |
|
| 521 |
---
|
| 522 |
|
| 523 |
## 8. Limitations
|
| 524 |
|
| 525 |
-
1. **No word segmentation**:
|
| 526 |
2. **No pre-trained embeddings**: Uses TF-IDF instead of word vectors or contextual embeddings
|
| 527 |
3. **Single-label only**: Does not support multi-label classification
|
| 528 |
-
4. **
|
| 529 |
-
5. **Class imbalance sensitivity**: Lower performance on underrepresented categories
|
|
|
|
| 530 |
|
| 531 |
---
|
| 532 |
|
|
@@ -534,140 +538,126 @@ classifier.save("./my-model")
|
|
| 534 |
|
| 535 |
- [x] ~~Train on full VNTC dataset (33,759 samples)~~ **Done**
|
| 536 |
- [x] ~~Train on UTS2017_Bank dataset (1,977 samples)~~ **Done** (+3.29% vs sonar_core_1)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 537 |
- [ ] Add Vietnamese word segmentation (using underthesea)
|
| 538 |
- [ ] Implement multi-label classification
|
| 539 |
- [ ] Add PhoBERT-based variant (sen-2)
|
| 540 |
- [ ] Benchmark on additional datasets (UIT-VSMEC, UIT-VSFC)
|
| 541 |
-
- [ ]
|
| 542 |
-
- [ ]
|
| 543 |
|
| 544 |
---
|
| 545 |
|
| 546 |
## 10. Conclusion
|
| 547 |
|
| 548 |
-
Sen-1
|
| 549 |
|
| 550 |
-
|
|
| 551 |
-
|
| 552 |
-
|
|
| 553 |
-
|
|
|
|
|
|
|
|
|
|
|
| 554 |
|
| 555 |
Key achievements:
|
| 556 |
|
| 557 |
-
- **Fast
|
| 558 |
-
- **
|
| 559 |
-
- **
|
| 560 |
-
- **Multi-domain**:
|
| 561 |
-
|
| 562 |
-
While deep learning approaches (PhoBERT, etc.) may achieve higher accuracy, Sen-1 serves as a strong baseline and practical solution for resource-constrained environments.
|
| 563 |
|
| 564 |
---
|
| 565 |
|
| 566 |
## References
|
| 567 |
|
| 568 |
-
1. Vu, C. D. H., Dien, D., Nguyen, L. N., & Ngo, Q. H. (2007). **A Comparative Study on Vietnamese Text Classification Methods**. IEEE
|
| 569 |
|
| 570 |
2. duyvuleo. (2007). **VNTC: A Large-scale Vietnamese News Text Classification Corpus**. GitHub. https://github.com/duyvuleo/VNTC
|
| 571 |
|
| 572 |
-
3.
|
| 573 |
|
| 574 |
-
4.
|
| 575 |
|
| 576 |
5. Nguyen, D. Q., & Nguyen, A. T. (2020). **PhoBERT: Pre-trained language models for Vietnamese**. Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.92/
|
| 577 |
|
| 578 |
-
|
| 579 |
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
### VNTC (News) - 10 Categories
|
| 583 |
-
|
| 584 |
-
| ID | Label | Vietnamese | English |
|
| 585 |
-
|----|-------|------------|---------|
|
| 586 |
-
| 0 | Chinh tri Xa hoi | Chính trị Xã hội | Politics/Society |
|
| 587 |
-
| 1 | Doi song | Đời sống | Lifestyle |
|
| 588 |
-
| 2 | Khoa hoc | Khoa học | Science |
|
| 589 |
-
| 3 | Kinh doanh | Kinh doanh | Business |
|
| 590 |
-
| 4 | Phap luat | Pháp luật | Law |
|
| 591 |
-
| 5 | Suc khoe | Sức khỏe | Health |
|
| 592 |
-
| 6 | The gioi | Thế giới | World |
|
| 593 |
-
| 7 | The thao | Thể thao | Sports |
|
| 594 |
-
| 8 | Van hoa | Văn hóa | Culture |
|
| 595 |
-
| 9 | Vi tinh | Vi tính | Technology |
|
| 596 |
-
|
| 597 |
-
### UTS2017_Bank (Banking) - 14 Categories
|
| 598 |
-
|
| 599 |
-
| ID | Label | English | Train Samples |
|
| 600 |
-
|----|-------|---------|---------------|
|
| 601 |
-
| 0 | ACCOUNT | Account services | 4 |
|
| 602 |
-
| 1 | CARD | Card services | 53 |
|
| 603 |
-
| 2 | CUSTOMER_SUPPORT | Customer support | 619 |
|
| 604 |
-
| 3 | DISCOUNT | Discounts | 32 |
|
| 605 |
-
| 4 | INTEREST_RATE | Interest rates | 46 |
|
| 606 |
-
| 5 | INTERNET_BANKING | Internet banking | 55 |
|
| 607 |
-
| 6 | LOAN | Loan services | 58 |
|
| 608 |
-
| 7 | MONEY_TRANSFER | Money transfer | 30 |
|
| 609 |
-
| 8 | OTHER | Other | 56 |
|
| 610 |
-
| 9 | PAYMENT | Payment services | 14 |
|
| 611 |
-
| 10 | PROMOTION | Promotions | 45 |
|
| 612 |
-
| 11 | SAVING | Savings | 10 |
|
| 613 |
-
| 12 | SECURITY | Security | 2 |
|
| 614 |
-
| 13 | TRADEMARK | Trademark/Brand | 557 |
|
| 615 |
|
| 616 |
---
|
| 617 |
|
| 618 |
-
## Appendix
|
| 619 |
|
| 620 |
-
### sen-general-1.0.0-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 621 |
|
| 622 |
| Field | Value |
|
| 623 |
|-------|-------|
|
| 624 |
-
| Model Name | sen-
|
| 625 |
-
| Architecture |
|
| 626 |
-
| Base Model | sonar_core_1 reproduction |
|
| 627 |
| Language | Vietnamese |
|
| 628 |
| License | Apache 2.0 |
|
| 629 |
| Repository | https://huggingface.co/undertheseanlp/sen-1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 630 |
| Training Data | VNTC (33,759 samples) |
|
| 631 |
-
| Test Data | VNTC (50,373 samples) |
|
| 632 |
| Categories | 10 |
|
| 633 |
| max_features | 20,000 |
|
| 634 |
| ngram_range | (1, 2) |
|
| 635 |
| Accuracy | 92.49% |
|
| 636 |
-
| F1 (weighted) | 92.40% |
|
| 637 |
-
| Training Time | 37.6s |
|
| 638 |
|
| 639 |
-
### sen-bank-1.0.0-
|
| 640 |
|
| 641 |
| Field | Value |
|
| 642 |
|-------|-------|
|
| 643 |
-
| Model Name | sen-bank-1.0.0-
|
| 644 |
-
| Architecture |
|
| 645 |
-
| Base Model | sonar_core_1 reproduction |
|
| 646 |
| Language | Vietnamese |
|
| 647 |
-
|
|
| 648 |
-
| Repository | https://huggingface.co/undertheseanlp/sen-1 |
|
| 649 |
-
| Training Data | UTS2017_Bank (1,581 samples) |
|
| 650 |
-
| Test Data | UTS2017_Bank (396 samples) |
|
| 651 |
| Categories | 14 |
|
| 652 |
-
| max_features |
|
| 653 |
| ngram_range | (1, 2) |
|
| 654 |
| Accuracy | 75.76% |
|
| 655 |
-
| F1 (weighted) | 72.70% |
|
| 656 |
-
| Training Time | 0.13s |
|
| 657 |
-
|
| 658 |
-
---
|
| 659 |
-
|
| 660 |
-
## Appendix C: Confusion Matrix Analysis
|
| 661 |
-
|
| 662 |
-
Categories with highest confusion:
|
| 663 |
-
- **Lifestyle (doi_song)** often confused with Culture (van_hoa) and Health (suc_khoe)
|
| 664 |
-
- **Politics (chinh_tri_xa_hoi)** sometimes confused with World (the_gioi) and Law (phap_luat)
|
| 665 |
-
|
| 666 |
-
Categories with clearest separation:
|
| 667 |
-
- **Sports (the_thao)**: Very distinctive vocabulary (team names, scores, competitions)
|
| 668 |
-
- **Technology (vi_tinh)**: Distinctive technical terms (software, hardware brands)
|
| 669 |
|
| 670 |
---
|
| 671 |
|
| 672 |
-
*Report generated: February
|
| 673 |
*UnderTheSea NLP - https://github.com/undertheseanlp*
|
|
|
|
| 1 |
# Sen-1: Vietnamese Text Classification Model
|
| 2 |
|
| 3 |
+
**Technical Report v1.2.0**
|
| 4 |
|
| 5 |
Authors: UnderTheSea NLP
|
| 6 |
+
Date: February 6, 2026
|
| 7 |
Model: `undertheseanlp/sen-1`
|
| 8 |
|
| 9 |
---
|
| 10 |
|
| 11 |
## Abstract
|
| 12 |
|
| 13 |
+
Sen-1 is a Vietnamese text classification model based on TF-IDF vectorization combined with Linear SVM, implemented entirely in Rust via `underthesea_core` for fast training and inference. This report describes the methodology, implementation, and evaluation on four benchmark tasks:
|
| 14 |
|
| 15 |
- **VNTC (News)**: 92.49% accuracy on 10-topic news classification
|
| 16 |
- **UTS2017_Bank (Banking)**: 75.76% accuracy on 14-category banking text classification
|
| 17 |
+
- **Sentiment General**: 92.11% (UTS2017_Bank) / 70.86% (VLSP2016) on 3-class sentiment
|
| 18 |
+
- **Sentiment Bank**: 70.65% accuracy on 36-class aspect-sentiment classification
|
| 19 |
|
| 20 |
+
The sentiment models include a Vietnamese-specific preprocessing pipeline (teencode expansion, negation marking, character normalization) that yields +4.1% improvement over the previous flair-based SVM on VLSP2016, while removing the scikit-learn dependency from the inference path.
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
|
|
| 27 |
|
| 28 |
- **Word segmentation**: Vietnamese words can consist of multiple syllables
|
| 29 |
- **Diacritics**: Vietnamese uses Latin script with additional diacritical marks
|
| 30 |
+
- **Informal text**: Social media text contains extensive teencode and abbreviations
|
| 31 |
- **Limited resources**: Fewer labeled datasets compared to English
|
| 32 |
|
| 33 |
+
Sen-1 addresses these challenges by implementing a robust TF-IDF + SVM pipeline with Vietnamese-specific preprocessing, operating at syllable-level for speed while achieving competitive performance with word-level approaches.
|
| 34 |
|
| 35 |
---
|
| 36 |
|
|
|
|
| 44 |
- **Baseline methods**: Bag-of-Words (BOW), N-gram, and SVM approaches
|
| 45 |
- **Benchmark results**: Achieving >95% accuracy on 10-topic classification
|
| 46 |
|
| 47 |
+
### 2.2 VLSP2016 Sentiment Analysis Shared Task
|
| 48 |
+
|
| 49 |
+
The VLSP 2016 Sentiment Analysis shared task was the first Vietnamese sentiment analysis campaign, focusing on polarity classification of electronic product reviews into 3 classes (positive, negative, neutral). Top results from the shared task:
|
| 50 |
+
|
| 51 |
+
| System | Approach | F1 |
|
| 52 |
+
|--------|----------|-----|
|
| 53 |
+
| Pham et al. | Perceptron / SVM / MaxEnt ensemble | **80.05** |
|
| 54 |
+
| Nguyen et al. | SVM / MLNN / LSTM ensemble | 71.44 |
|
| 55 |
+
| Pham et al. | Random Forest + SVM + Naive Bayes | 71.22 |
|
| 56 |
+
| Ngo et al. | SVM | 67.54 |
|
| 57 |
+
|
| 58 |
+
All top systems used word segmentation. However, recent research (Arxiv 2301.00418) demonstrates that for traditional classifiers like SVM, word segmentation may not be necessary for Vietnamese sentiment classification on social domain text.
|
| 59 |
+
|
| 60 |
+
### 2.3 Traditional ML vs Deep Learning
|
| 61 |
|
| 62 |
| Approach | Pros | Cons |
|
| 63 |
|----------|------|------|
|
|
|
|
| 72 |
|
| 73 |
### 3.1 Architecture Overview
|
| 74 |
|
| 75 |
+
Sen-1 uses a 3-stage pipeline implemented in Rust via `underthesea_core`:
|
| 76 |
|
| 77 |
```
|
| 78 |
+
┌──────────────────────────────────────────────────────────┐
|
| 79 |
+
│ Sen-1 Pipeline │
|
| 80 |
+
├──────────────────────────────────────────────────────────┤
|
| 81 |
+
│ Input Text │
|
| 82 |
+
│ ↓ │
|
| 83 |
+
│ ┌──────────────────────────────────────────────────┐ │
|
| 84 |
+
│ │ [Optional] Sentiment Preprocessing │ │
|
| 85 |
+
│ │ - Lowercase + Unicode NFC │ │
|
| 86 |
+
│ │ - Teencode expansion │ │
|
| 87 |
+
│ │ - Negation marking (2-word window) │ │
|
| 88 |
+
│ │ - Repeated character normalization │ │
|
| 89 |
+
│ └──────────────────────────────────────────────────┘ │
|
| 90 |
+
│ ↓ │
|
| 91 |
+
│ ┌──────────────────────────────────────────────────┐ │
|
| 92 |
+
│ │ TF-IDF Vectorizer (Rust) │ │
|
| 93 |
+
│ │ - max_features: 20k-200k │ │
|
| 94 |
+
│ │ - ngram_range: (1,2) or (1,3) │ │
|
| 95 |
+
│ │ - max_df: 0.8-1.0 │ │
|
| 96 |
+
│ └──────────────────────────────────────────────────┘ │
|
| 97 |
+
│ ↓ │
|
| 98 |
+
│ ┌──────────────────────────────────────────────────┐ │
|
| 99 |
+
│ │ Linear SVM Classifier (Rust) │ │
|
| 100 |
+
│ │ - C: 0.7-1.0 │ │
|
| 101 |
+
│ │ - max_iter: 1000 │ │
|
| 102 |
+
│ └──────────────────────────────────────────────────┘ │
|
| 103 |
+
│ ↓ │
|
| 104 |
+
│ Output: Predicted Label │
|
| 105 |
+
└──────────────────────────────────────────────────────────┘
|
| 106 |
```
|
| 107 |
|
| 108 |
### 3.2 TF-IDF Vectorization
|
|
|
|
| 116 |
- $\text{IDF}(t) = \log\frac{N}{|\{d \in D : t \in d\}|}$
|
| 117 |
- $N$ = total number of documents
|
| 118 |
|
| 119 |
+
**Hyperparameters vary by task:**
|
| 120 |
|
| 121 |
+
| Parameter | Classification | Sentiment |
|
| 122 |
+
|-----------|---------------|-----------|
|
| 123 |
+
| `max_features` | 20,000 | 200,000 |
|
| 124 |
+
| `ngram_range` | (1, 2) | (1, 3) |
|
| 125 |
+
| `max_df` | 1.0 | 0.9 |
|
| 126 |
+
| `min_df` | 2 | 1 |
|
| 127 |
|
| 128 |
### 3.3 Support Vector Machine
|
| 129 |
|
|
|
|
| 131 |
|
| 132 |
$$\min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))$$
|
| 133 |
|
| 134 |
+
### 3.4 Sentiment Preprocessing Pipeline
|
| 135 |
+
|
| 136 |
+
For sentiment models, a Vietnamese-specific preprocessing pipeline is applied before TF-IDF vectorization:
|
| 137 |
+
|
| 138 |
+
**Step 1: Text Normalization**
|
| 139 |
+
- Unicode NFC normalization (standardizes diacritics)
|
| 140 |
+
- Lowercase conversion
|
| 141 |
+
- URL removal
|
| 142 |
+
- Repeated character collapse: `quáááá` -> `quáá`
|
| 143 |
+
- Punctuation normalization: `!!!` -> `!`, `????` -> `?`
|
| 144 |
+
|
| 145 |
+
**Step 2: Teencode Expansion**
|
| 146 |
+
|
| 147 |
+
Vietnamese social media text contains extensive abbreviations. We expand 25+ common teencode mappings:
|
| 148 |
+
|
| 149 |
+
| Teencode | Standard | Meaning |
|
| 150 |
+
|----------|----------|---------|
|
| 151 |
+
| ko, k, hok, hem | không | not/no |
|
| 152 |
+
| dc, đc, dk | được | can/ok |
|
| 153 |
+
| cx, cg | cũng | also |
|
| 154 |
+
| bt, bth | bình thường | normal |
|
| 155 |
+
| sp | sản phẩm | product |
|
| 156 |
+
| j | gì | what |
|
| 157 |
+
| z, v | vậy | so |
|
| 158 |
+
| tks, thanks | cảm ơn | thanks |
|
| 159 |
+
| ... | ... | ... |
|
| 160 |
+
|
| 161 |
+
**Step 3: Negation Marking**
|
| 162 |
+
|
| 163 |
+
Negation words (`không`, `chẳng`, `chả`, `chưa`, `đừng`) modify the sentiment of following words. We mark the next 2 words with a `NEG_` prefix:
|
| 164 |
+
|
| 165 |
+
```
|
| 166 |
+
"không tốt lắm" -> "không NEG_tốt NEG_lắm"
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
This allows the TF-IDF features to distinguish "tốt" (good) from "NEG_tốt" (not good).
|
| 170 |
|
| 171 |
+
**Impact of preprocessing (VLSP2016):**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
+
| Preprocessing Step | Accuracy | Delta |
|
| 174 |
+
|-------------------|----------|-------|
|
| 175 |
+
| None (baseline) | 64.76% | - |
|
| 176 |
+
| + Lowercase | 67.62% | +2.86% |
|
| 177 |
+
| + Repeated char normalization | 68.29% | +0.67% |
|
| 178 |
+
| + Teencode expansion | +1.14% | +1.14% |
|
| 179 |
+
| + Negation marking | +1.24% | +1.24% |
|
| 180 |
+
| **All combined** | **70.86%** | **+6.10%** |
|
| 181 |
+
|
| 182 |
+
### 3.5 Confidence Score
|
| 183 |
|
| 184 |
Confidence scores are computed from the SVM decision function using sigmoid transformation:
|
| 185 |
|
|
|
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
+
## 4. Datasets
|
| 193 |
|
| 194 |
### 4.1 VNTC Dataset
|
| 195 |
|
|
|
|
| 215 |
|
| 216 |
### 4.2 UTS2017_Bank Dataset
|
| 217 |
|
| 218 |
+
The UTS2017_Bank dataset is a Vietnamese banking domain text classification corpus with two configurations:
|
| 219 |
+
|
| 220 |
+
**Classification (14 Categories):**
|
| 221 |
+
|
| 222 |
+
| Category | English | Train | Test |
|
| 223 |
+
|----------|---------|-------|------|
|
| 224 |
+
| CUSTOMER_SUPPORT | Customer support | 619 | 155 |
|
| 225 |
+
| TRADEMARK | Trademark/Brand | 557 | 140 |
|
| 226 |
+
| LOAN | Loan services | 58 | 15 |
|
| 227 |
+
| INTERNET_BANKING | Internet banking | 55 | 14 |
|
| 228 |
+
| CARD | Card services | 53 | 13 |
|
| 229 |
+
| ... | ... | ... | ... |
|
| 230 |
+
| **Total** | | **1,977** | **494** |
|
| 231 |
+
|
| 232 |
+
**Sentiment (3 Classes):**
|
| 233 |
+
|
| 234 |
+
| Label | Train | Test |
|
| 235 |
+
|-------|-------|------|
|
| 236 |
+
| negative | 1,189 | 301 |
|
| 237 |
+
| positive | 765 | 185 |
|
| 238 |
+
| neutral | 23 | 8 |
|
| 239 |
+
| **Total** | **1,977** | **494** |
|
| 240 |
+
|
| 241 |
+
**Combined (36 Aspect-Sentiment Labels):** Merging classification + sentiment configs produces labels like `CUSTOMER_SUPPORT#negative`, `CARD#positive`, etc.
|
| 242 |
|
| 243 |
**Source:** https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
|
| 244 |
|
| 245 |
+
### 4.3 VLSP2016 Sentiment Analysis Dataset
|
| 246 |
+
|
| 247 |
+
The VLSP 2016 Sentiment Analysis dataset contains electronic product reviews labeled for sentiment:
|
| 248 |
+
|
| 249 |
+
| Split | POS | NEG | NEU | Total |
|
| 250 |
+
|-------|-----|-----|-----|-------|
|
| 251 |
+
| Train | 1,700 | 1,700 | 1,700 | 5,100 |
|
| 252 |
+
| Test | 350 | 350 | 350 | 1,050 |
|
| 253 |
+
|
| 254 |
+
The dataset is perfectly balanced across all three sentiment classes.
|
| 255 |
+
|
| 256 |
+
**Source:** VLSP 2016 Shared Task (https://vlsp.org.vn/vlsp2016/eval/sa)
|
| 257 |
|
| 258 |
---
|
| 259 |
|
|
|
|
| 261 |
|
| 262 |
### 5.1 Dependencies
|
| 263 |
|
| 264 |
+
**Training:**
|
| 265 |
+
```
|
| 266 |
+
underthesea_core>=3.1.7 # Rust TF-IDF + SVM backend
|
| 267 |
+
scikit-learn>=1.0.0 # Metrics only (accuracy, F1, classification_report)
|
| 268 |
+
click>=8.0.0 # CLI
|
| 269 |
+
datasets>=2.0.0 # HuggingFace dataset loading
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
**Inference (underthesea pipeline):**
|
| 273 |
```
|
| 274 |
+
underthesea_core>=3.1.7 # Only dependency (no sklearn needed)
|
|
|
|
|
|
|
| 275 |
```
|
| 276 |
|
| 277 |
+
### 5.2 Rust Backend
|
| 278 |
|
| 279 |
+
All vectorization and classification is performed by `underthesea_core.TextClassifier`, a Rust implementation exposed via PyO3:
|
| 280 |
|
| 281 |
```python
|
| 282 |
+
from underthesea_core import TextClassifier
|
| 283 |
+
|
| 284 |
+
# Constructor parameters
|
| 285 |
+
clf = TextClassifier(
|
| 286 |
+
max_features=200000, # Maximum vocabulary size
|
| 287 |
+
ngram_range=(1, 3), # N-gram range
|
| 288 |
+
min_df=1, # Minimum document frequency
|
| 289 |
+
max_df=0.9, # Maximum document frequency
|
| 290 |
+
c=0.7, # SVM regularization parameter
|
| 291 |
+
max_iter=1000, # Maximum iterations
|
| 292 |
+
tol=0.0001, # Convergence tolerance
|
| 293 |
+
)
|
| 294 |
+
|
| 295 |
+
# Training and inference
|
| 296 |
+
clf.fit(texts, labels)
|
| 297 |
+
label = clf.predict(text)
|
| 298 |
+
labels = clf.predict_batch(texts)
|
| 299 |
+
clf.save("model.bin")
|
| 300 |
+
clf = TextClassifier.load("model.bin")
|
| 301 |
```
|
| 302 |
|
| 303 |
### 5.3 Model Files
|
|
|
|
| 305 |
```
|
| 306 |
undertheseanlp/sen-1/
|
| 307 |
└── models/
|
| 308 |
+
├── sen-general-1.0.0-20260203.bin # News classification (VNTC)
|
| 309 |
+
├── sen-bank-1.0.0-20260203.bin # Banking classification (UTS2017)
|
| 310 |
+
├── sen-sentiment-general-1.0.0-20260206.bin # Sentiment (VLSP2016+UTS2017)
|
| 311 |
+
└── sen-sentiment-bank-1.0.0-20260206.bin # Aspect-sentiment (UTS2017)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
```
|
| 313 |
|
| 314 |
+
All models are serialized in Rust binary format (bincode).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
|
| 316 |
---
|
| 317 |
|
| 318 |
## 6. Experiments
|
| 319 |
|
| 320 |
+
### 6.1 VNTC Benchmark Results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 321 |
|
| 322 |
+
**Configuration:** max_features=20000, ngram=(1,2), min_df=2, C=1.0
|
| 323 |
|
| 324 |
| Metric | Value |
|
| 325 |
|--------|-------|
|
| 326 |
| **Accuracy** | **92.49%** |
|
| 327 |
| **F1 (weighted)** | **92.40%** |
|
| 328 |
| F1 (macro) | 90.44% |
|
|
|
|
|
|
|
| 329 |
| **Training time** | **37.6s** |
|
|
|
|
| 330 |
|
| 331 |
+
**Per-Category Results:**
|
| 332 |
|
| 333 |
| Category | Precision | Recall | F1-Score | Support |
|
| 334 |
|----------|-----------|--------|----------|---------|
|
|
|
|
| 343 |
| Van hoa | 0.93 | 0.96 | 0.94 | 6,250 |
|
| 344 |
| Vi tinh | 0.94 | 0.96 | 0.95 | 4,560 |
|
| 345 |
|
| 346 |
+
### 6.2 UTS2017_Bank Classification Results
|
|
|
|
|
|
|
|
|
|
| 347 |
|
| 348 |
+
**Configuration:** max_features=10000, ngram=(1,2), min_df=1, C=1.0
|
| 349 |
|
| 350 |
| Metric | Value |
|
| 351 |
|--------|-------|
|
| 352 |
| **Accuracy** | **75.76%** |
|
| 353 |
| **F1 (weighted)** | **72.70%** |
|
| 354 |
| F1 (macro) | 36.18% |
|
|
|
|
|
|
|
| 355 |
| **Training time** | **0.13s** |
|
|
|
|
|
|
|
| 356 |
|
| 357 |
+
### 6.3 Sentiment General Results
|
| 358 |
|
| 359 |
+
**Configuration:** max_features=200000, ngram=(1,3), max_df=0.9, C=0.7, with preprocessing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 360 |
|
| 361 |
+
**Training data:** UTS2017_Bank sentiment (1,977) + VLSP2016 (5,100) = 7,077 samples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 362 |
|
| 363 |
+
| Test Set | Accuracy | F1 (weighted) | F1 (macro) |
|
| 364 |
+
|----------|----------|---------------|------------|
|
| 365 |
+
| **UTS2017_Bank** | **92.11%** | **0.9163** | 0.6196 |
|
| 366 |
+
| **VLSP2016** | **70.86%** | **0.7081** | 0.7081 |
|
| 367 |
|
| 368 |
+
**Per-Class Results (UTS2017_Bank):**
|
|
|
|
|
|
|
|
|
|
| 369 |
|
| 370 |
+
| Class | Precision | Recall | F1-Score | Support |
|
| 371 |
+
|-------|-----------|--------|----------|---------|
|
| 372 |
+
| negative | 0.93 | 0.95 | 0.94 | 301 |
|
| 373 |
+
| neutral | 0.00 | 0.00 | 0.00 | 8 |
|
| 374 |
+
| positive | 0.93 | 0.91 | 0.92 | 185 |
|
| 375 |
|
| 376 |
+
**Per-Class Results (VLSP2016):**
|
| 377 |
|
| 378 |
+
| Class | Precision | Recall | F1-Score | Support |
|
| 379 |
+
|-------|-----------|--------|----------|---------|
|
| 380 |
+
| negative | 0.68 | 0.74 | 0.71 | 350 |
|
| 381 |
+
| neutral | 0.69 | 0.64 | 0.66 | 350 |
|
| 382 |
+
| positive | 0.76 | 0.75 | 0.75 | 350 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 383 |
|
| 384 |
+
### 6.4 Sentiment Bank Results
|
|
|
|
|
|
|
|
|
|
| 385 |
|
| 386 |
+
**Configuration:** max_features=10000, ngram=(1,2), max_df=1.0, C=1.0, with preprocessing
|
| 387 |
|
| 388 |
+
**Training data:** UTS2017_Bank classification + sentiment merged (1,977 samples, 36 labels)
|
| 389 |
|
| 390 |
+
| Metric | Value |
|
| 391 |
+
|--------|-------|
|
| 392 |
+
| **Accuracy** | **70.65%** |
|
| 393 |
+
| F1 (weighted) | 0.6693 |
|
| 394 |
+
| F1 (macro) | 0.2153 |
|
| 395 |
|
| 396 |
+
**Top-Performing Categories:**
|
|
|
|
|
|
|
|
|
|
| 397 |
|
| 398 |
+
| Category | Precision | Recall | F1-Score | Support |
|
| 399 |
+
|----------|-----------|--------|----------|---------|
|
| 400 |
+
| LOAN#negative | 0.60 | 1.00 | 0.75 | 3 |
|
| 401 |
+
| CUSTOMER_SUPPORT#negative | 0.75 | 0.93 | 0.83 | 214 |
|
| 402 |
+
| CUSTOMER_SUPPORT#positive | 0.80 | 0.82 | 0.81 | 122 |
|
| 403 |
+
| MONEY_TRANSFER#negative | 1.00 | 0.50 | 0.67 | 2 |
|
| 404 |
+
| TRADEMARK#positive | 0.54 | 0.63 | 0.58 | 35 |
|
| 405 |
|
| 406 |
+
### 6.5 Comparison with Previous Models
|
|
|
|
|
|
|
| 407 |
|
| 408 |
+
#### Sentiment General
|
| 409 |
|
| 410 |
+
| Model | Architecture | VLSP2016 | UTS2017_Bank |
|
| 411 |
+
|-------|-------------|----------|-------------|
|
| 412 |
+
| SA_GENERAL_V131 (old) | flair SVM + word segmentation | 69.14% | 47.17% |
|
| 413 |
+
| **Sen-1 (new)** | **underthesea_core + preprocessing** | **70.86%** | **92.11%** |
|
| 414 |
+
| Delta | | **+1.72%** | **+44.94%** |
|
|
|
|
|
|
|
| 415 |
|
| 416 |
+
The old model was trained only on VLSP2016 and could not predict the "neutral" class, resulting in poor generalization to UTS2017_Bank. The new model is trained on both datasets and includes preprocessing.
|
| 417 |
|
| 418 |
+
#### Sentiment Bank
|
|
|
|
|
|
|
|
|
|
| 419 |
|
| 420 |
+
| Model | Architecture | UTS2017_Bank |
|
| 421 |
+
|-------|-------------|-------------|
|
| 422 |
+
| pulse_core_1 (old) | sklearn Pipeline + joblib | 69.03% |
|
| 423 |
+
| **Sen-1 (new)** | **underthesea_core + preprocessing** | **70.65%** |
|
| 424 |
+
| Delta | | **+1.62%** |
|
| 425 |
|
| 426 |
+
#### Classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 427 |
|
| 428 |
+
| Dataset | sonar_core_1 | Sen-1 | Difference |
|
| 429 |
+
|---------|--------------|-------|------------|
|
| 430 |
+
| VNTC (News) | 92.80% | 92.49% | -0.31% |
|
| 431 |
+
| **UTS2017_Bank** | 72.47% | **75.76%** | **+3.29%** |
|
| 432 |
|
| 433 |
+
### 6.6 Hyperparameter Sensitivity (VLSP2016)
|
| 434 |
|
| 435 |
+
Key findings from hyperparameter search on VLSP2016:
|
| 436 |
|
| 437 |
+
| Factor | Finding |
|
| 438 |
+
|--------|---------|
|
| 439 |
+
| **max_features** | 200k >> 20k (+3% accuracy); larger vocabulary captures more discriminative patterns |
|
| 440 |
+
| **ngram_range** | (1,3) slightly better than (1,2) with large vocabulary |
|
| 441 |
+
| **max_df** | 0.8-0.9 helps filter very common terms that add noise |
|
| 442 |
+
| **C** | 0.7 optimal; lower C (more regularization) prevents overfitting on small datasets |
|
| 443 |
+
| **Preprocessing** | Most impactful factor: +6.1% total (lowercase +2.9%, teencode +1.1%, negation +1.2%) |
|
| 444 |
|
| 445 |
+
### 6.7 Error Analysis (VLSP2016)
|
| 446 |
|
| 447 |
+
**Confusion patterns:**
|
| 448 |
+
- NEU (neutral) is the most confused class, acting as an "attractor" for both POS and NEG
|
| 449 |
+
- NEU<->NEG confusion accounts for 38% of all errors
|
| 450 |
+
- No single error pattern (text length, teencode, negation) dominates
|
| 451 |
|
| 452 |
+
**Confidence calibration:**
|
| 453 |
|
| 454 |
+
| Confidence | Samples | Accuracy |
|
| 455 |
+
|------------|---------|----------|
|
| 456 |
+
| >= 0.7 | 129 | 94.0% |
|
| 457 |
+
| >= 0.6 | 365 | 84.4% |
|
| 458 |
+
| < 0.5 | 224 | 45.5% |
|
| 459 |
|
| 460 |
+
Predictions with confidence >= 0.7 are 94% accurate, suggesting confidence thresholds can be effective for production use.
|
|
|
|
|
|
|
|
|
|
| 461 |
|
| 462 |
+
### 6.8 Inference Speed Benchmark
|
|
|
|
|
|
|
| 463 |
|
| 464 |
+
| Model | Single Inference | Throughput |
|
| 465 |
+
|-------|------------------|------------|
|
| 466 |
+
| **Sen-1 1.0.0** | **0.465 ms** | **66,678 samples/sec** |
|
| 467 |
+
| Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec |
|
| 468 |
|
| 469 |
+
Sen-1 achieves **41x** faster throughput via batch processing and the Rust backend.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 470 |
|
| 471 |
---
|
| 472 |
|
|
|
|
| 475 |
### 7.1 Installation
|
| 476 |
|
| 477 |
```bash
|
| 478 |
+
pip install underthesea_core
|
| 479 |
```
|
| 480 |
|
| 481 |
### 7.2 Load Pre-trained Model
|
| 482 |
|
| 483 |
```python
|
| 484 |
+
from underthesea_core import TextClassifier
|
|
|
|
| 485 |
|
| 486 |
+
# Load model
|
| 487 |
+
clf = TextClassifier.load("models/sen-sentiment-general-1.0.0-20260206.bin")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 488 |
|
| 489 |
# Predict
|
| 490 |
+
label = clf.predict("Sản phẩm rất tốt") # "positive"
|
|
|
|
|
|
|
| 491 |
```
|
| 492 |
|
| 493 |
+
### 7.3 With underthesea API
|
| 494 |
|
| 495 |
```python
|
| 496 |
+
from underthesea import sentiment
|
| 497 |
|
| 498 |
+
# General sentiment
|
| 499 |
+
sentiment("Sản phẩm rất tốt") # "positive"
|
| 500 |
+
sentiment("hàng kém chất lg") # "negative"
|
| 501 |
+
sentiment.labels # ['positive', 'negative', 'neutral']
|
| 502 |
|
| 503 |
+
# Bank aspect-sentiment
|
| 504 |
+
sentiment("nhân viên hỗ trợ quá lâu", domain="bank") # ['CUSTOMER_SUPPORT#negative']
|
| 505 |
+
sentiment.bank.labels # ['CARD#negative', 'CARD#positive', ...]
|
| 506 |
+
```
|
| 507 |
+
|
| 508 |
+
### 7.4 Train Custom Model
|
| 509 |
+
|
| 510 |
+
```bash
|
| 511 |
+
# Train sentiment-general (with VLSP2016 data)
|
| 512 |
+
python src/train.py sentiment-general --vlsp2016-dir /path/to/VLSP2016_SA
|
| 513 |
+
|
| 514 |
+
# Train sentiment-bank
|
| 515 |
+
python src/train.py sentiment-bank
|
| 516 |
+
|
| 517 |
+
# Train news classifier
|
| 518 |
+
python src/train.py vntc --data-dir /path/to/VNTC
|
| 519 |
+
|
| 520 |
+
# Train banking classifier
|
| 521 |
+
python src/train.py bank
|
| 522 |
```
|
| 523 |
|
| 524 |
---
|
| 525 |
|
| 526 |
## 8. Limitations
|
| 527 |
|
| 528 |
+
1. **No word segmentation**: Operates at syllable-level (~4.6% gap vs word-level on VNTC)
|
| 529 |
2. **No pre-trained embeddings**: Uses TF-IDF instead of word vectors or contextual embeddings
|
| 530 |
3. **Single-label only**: Does not support multi-label classification
|
| 531 |
+
4. **Neutral class weakness**: NEU class has lowest precision in sentiment tasks due to inherent ambiguity
|
| 532 |
+
5. **Class imbalance sensitivity**: Lower performance on underrepresented categories
|
| 533 |
+
6. **Preprocessing dependency**: Sentiment models require `preprocess_sentiment()` at inference time (preprocessing must match training)
|
| 534 |
|
| 535 |
---
|
| 536 |
|
|
|
|
| 538 |
|
| 539 |
- [x] ~~Train on full VNTC dataset (33,759 samples)~~ **Done**
|
| 540 |
- [x] ~~Train on UTS2017_Bank dataset (1,977 samples)~~ **Done** (+3.29% vs sonar_core_1)
|
| 541 |
+
- [x] ~~Sentiment general model (VLSP2016 + UTS2017)~~ **Done** (+1.72% vs old flair SVM)
|
| 542 |
+
- [x] ~~Sentiment bank model (aspect-sentiment)~~ **Done** (+1.62% vs old sklearn)
|
| 543 |
+
- [x] ~~Remove sklearn from inference path~~ **Done** (pure Rust via underthesea_core)
|
| 544 |
+
- [x] ~~Vietnamese preprocessing pipeline~~ **Done** (teencode, negation, normalization)
|
| 545 |
- [ ] Add Vietnamese word segmentation (using underthesea)
|
| 546 |
- [ ] Implement multi-label classification
|
| 547 |
- [ ] Add PhoBERT-based variant (sen-2)
|
| 548 |
- [ ] Benchmark on additional datasets (UIT-VSMEC, UIT-VSFC)
|
| 549 |
+
- [ ] CHI-square feature selection for further improvement
|
| 550 |
+
- [ ] Ensemble methods (SVM + Perceptron + MaxEnt)
|
| 551 |
|
| 552 |
---
|
| 553 |
|
| 554 |
## 10. Conclusion
|
| 555 |
|
| 556 |
+
Sen-1 provides a suite of Vietnamese text classification and sentiment analysis models, all running on a pure Rust backend for fast inference:
|
| 557 |
|
| 558 |
+
| Task | Model | Accuracy | vs Previous |
|
| 559 |
+
|------|-------|----------|-------------|
|
| 560 |
+
| News Classification | sen-general | 92.49% | -0.31% vs sonar_core_1 |
|
| 561 |
+
| Banking Classification | sen-bank | 75.76% | +3.29% vs sonar_core_1 |
|
| 562 |
+
| Sentiment General (UTS2017) | sen-sentiment-general | 92.11% | +44.94% vs old flair |
|
| 563 |
+
| Sentiment General (VLSP2016) | sen-sentiment-general | 70.86% | +1.72% vs old flair |
|
| 564 |
+
| Sentiment Bank | sen-sentiment-bank | 70.65% | +1.62% vs old sklearn |
|
| 565 |
|
| 566 |
Key achievements:
|
| 567 |
|
| 568 |
+
- **Fast inference**: 66,678 samples/sec batch throughput (41x vs underthesea 9.2.8)
|
| 569 |
+
- **No sklearn dependency**: Pure Rust inference via underthesea_core
|
| 570 |
+
- **Vietnamese preprocessing**: Teencode expansion + negation marking yields +6.1% on VLSP2016
|
| 571 |
+
- **Multi-domain sentiment**: Single model handles both product reviews and banking text
|
| 572 |
+
- **Small footprint**: Models range from 1.6 MB to 8 MB
|
|
|
|
| 573 |
|
| 574 |
---
|
| 575 |
|
| 576 |
## References
|
| 577 |
|
| 578 |
+
1. Vu, C. D. H., Dien, D., Nguyen, L. N., & Ngo, Q. H. (2007). **A Comparative Study on Vietnamese Text Classification Methods**. IEEE RIVF 2007, 267-273.
|
| 579 |
|
| 580 |
2. duyvuleo. (2007). **VNTC: A Large-scale Vietnamese News Text Classification Corpus**. GitHub. https://github.com/duyvuleo/VNTC
|
| 581 |
|
| 582 |
+
3. VLSP. (2016). **VLSP 2016 Shared Task: Sentiment Analysis**. https://vlsp.org.vn/vlsp2016/eval/sa
|
| 583 |
|
| 584 |
+
4. Nguyen, L. T., et al. (2023). **Is Word Segmentation Necessary for Vietnamese Sentiment Classification?** arXiv:2301.00418. https://arxiv.org/abs/2301.00418
|
| 585 |
|
| 586 |
5. Nguyen, D. Q., & Nguyen, A. T. (2020). **PhoBERT: Pre-trained language models for Vietnamese**. Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.92/
|
| 587 |
|
| 588 |
+
6. Pedregosa, F., et al. (2011). **Scikit-learn: Machine Learning in Python**. JMLR, 12, 2825-2830.
|
| 589 |
|
| 590 |
+
7. UnderTheSea NLP. (2017). **Underthesea: Vietnamese NLP Toolkit**. https://github.com/undertheseanlp/underthesea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 591 |
|
| 592 |
---
|
| 593 |
|
| 594 |
+
## Appendix A: Model Cards
|
| 595 |
|
| 596 |
+
### sen-sentiment-general-1.0.0-20260206
|
| 597 |
+
|
| 598 |
+
| Field | Value |
|
| 599 |
+
|-------|-------|
|
| 600 |
+
| Model Name | sen-sentiment-general-1.0.0-20260206 |
|
| 601 |
+
| Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
|
| 602 |
+
| Language | Vietnamese |
|
| 603 |
+
| License | Apache 2.0 |
|
| 604 |
+
| Repository | https://huggingface.co/undertheseanlp/sen-1 |
|
| 605 |
+
| Training Data | VLSP2016 (5,100) + UTS2017_Bank sentiment (1,977) = 7,077 |
|
| 606 |
+
| Labels | positive, negative, neutral |
|
| 607 |
+
| Preprocessing | preprocess_sentiment() required |
|
| 608 |
+
| max_features | 200,000 |
|
| 609 |
+
| ngram_range | (1, 3) |
|
| 610 |
+
| max_df | 0.9 |
|
| 611 |
+
| C | 0.7 |
|
| 612 |
+
| Accuracy (UTS2017) | 92.11% |
|
| 613 |
+
| Accuracy (VLSP2016) | 70.86% |
|
| 614 |
+
| Model Size | 7.95 MB |
|
| 615 |
+
|
| 616 |
+
### sen-sentiment-bank-1.0.0-20260206
|
| 617 |
|
| 618 |
| Field | Value |
|
| 619 |
|-------|-------|
|
| 620 |
+
| Model Name | sen-sentiment-bank-1.0.0-20260206 |
|
| 621 |
+
| Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
|
|
|
|
| 622 |
| Language | Vietnamese |
|
| 623 |
| License | Apache 2.0 |
|
| 624 |
| Repository | https://huggingface.co/undertheseanlp/sen-1 |
|
| 625 |
+
| Training Data | UTS2017_Bank merged (1,977 samples) |
|
| 626 |
+
| Labels | 36 (e.g., CUSTOMER_SUPPORT#negative, CARD#positive) |
|
| 627 |
+
| Preprocessing | preprocess_sentiment() required |
|
| 628 |
+
| max_features | 10,000 |
|
| 629 |
+
| ngram_range | (1, 2) |
|
| 630 |
+
| C | 1.0 |
|
| 631 |
+
| Accuracy | 70.65% |
|
| 632 |
+
| Model Size | 1.61 MB |
|
| 633 |
+
|
| 634 |
+
### sen-general-1.0.0-20260203 (News Classification)
|
| 635 |
+
|
| 636 |
+
| Field | Value |
|
| 637 |
+
|-------|-------|
|
| 638 |
+
| Model Name | sen-general-1.0.0-20260203 |
|
| 639 |
+
| Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
|
| 640 |
+
| Language | Vietnamese |
|
| 641 |
| Training Data | VNTC (33,759 samples) |
|
|
|
|
| 642 |
| Categories | 10 |
|
| 643 |
| max_features | 20,000 |
|
| 644 |
| ngram_range | (1, 2) |
|
| 645 |
| Accuracy | 92.49% |
|
|
|
|
|
|
|
| 646 |
|
| 647 |
+
### sen-bank-1.0.0-20260203 (Banking Classification)
|
| 648 |
|
| 649 |
| Field | Value |
|
| 650 |
|-------|-------|
|
| 651 |
+
| Model Name | sen-bank-1.0.0-20260203 |
|
| 652 |
+
| Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
|
|
|
|
| 653 |
| Language | Vietnamese |
|
| 654 |
+
| Training Data | UTS2017_Bank (1,977 samples) |
|
|
|
|
|
|
|
|
|
|
| 655 |
| Categories | 14 |
|
| 656 |
+
| max_features | 10,000 |
|
| 657 |
| ngram_range | (1, 2) |
|
| 658 |
| Accuracy | 75.76% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 659 |
|
| 660 |
---
|
| 661 |
|
| 662 |
+
*Report generated: February 6, 2026*
|
| 663 |
*UnderTheSea NLP - https://github.com/undertheseanlp*
|
models/sen-sentiment-bank-1.0.0-20260206.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:15f92315a43b3d131322402bc2f44b3bd5ee2ec58584a9b7ec3eec596d3eab8b
|
| 3 |
+
size 1693351
|
models/sen-sentiment-general-1.0.0-20260206.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1d2c23599cff870ecee535c1faab56f4e30a0f02950490a79e4ec713d69394a6
|
| 3 |
+
size 8335929
|
src/train.py
CHANGED
|
@@ -7,7 +7,9 @@ Usage:
|
|
| 7 |
"""
|
| 8 |
|
| 9 |
import os
|
|
|
|
| 10 |
import time
|
|
|
|
| 11 |
from pathlib import Path
|
| 12 |
|
| 13 |
import click
|
|
@@ -15,6 +17,57 @@ from sklearn.metrics import accuracy_score, f1_score, classification_report
|
|
| 15 |
|
| 16 |
from underthesea_core import TextClassifier
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
def read_file(filepath):
|
| 20 |
"""Read text file with multiple encoding attempts."""
|
|
@@ -209,5 +262,230 @@ def bank(output, max_features, ngram_min, ngram_max, min_df, c, max_iter, tol):
|
|
| 209 |
click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
|
| 210 |
|
| 211 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
if __name__ == "__main__":
|
| 213 |
cli()
|
|
|
|
| 7 |
"""
|
| 8 |
|
| 9 |
import os
|
| 10 |
+
import re
|
| 11 |
import time
|
| 12 |
+
import unicodedata
|
| 13 |
from pathlib import Path
|
| 14 |
|
| 15 |
import click
|
|
|
|
| 17 |
|
| 18 |
from underthesea_core import TextClassifier
|
| 19 |
|
| 20 |
+
# Vietnamese teencode dictionary
|
| 21 |
+
_TEENCODE = {
|
| 22 |
+
'ko': 'không', 'k': 'không', 'hok': 'không', 'hem': 'không',
|
| 23 |
+
'dc': 'được', 'đc': 'được', 'dk': 'được',
|
| 24 |
+
'ntn': 'như thế nào',
|
| 25 |
+
'nc': 'nói chuyện', 'nt': 'nhắn tin',
|
| 26 |
+
'cx': 'cũng', 'cg': 'cũng',
|
| 27 |
+
'vs': 'với', 'vl': 'vãi',
|
| 28 |
+
'bt': 'bình thường', 'bth': 'bình thường',
|
| 29 |
+
'lg': 'lượng', 'tl': 'trả lời',
|
| 30 |
+
'ms': 'mới', 'r': 'rồi',
|
| 31 |
+
'mn': 'mọi người', 'mk': 'mình',
|
| 32 |
+
'ok': 'tốt', 'oke': 'tốt',
|
| 33 |
+
'sp': 'sản phẩm',
|
| 34 |
+
'hqua': 'hôm qua', 'hnay': 'hôm nay',
|
| 35 |
+
'tks': 'cảm ơn', 'thanks': 'cảm ơn', 'thank': 'cảm ơn',
|
| 36 |
+
'j': 'gì', 'z': 'vậy', 'v': 'vậy',
|
| 37 |
+
'đt': 'điện thoại', 'dt': 'điện thoại',
|
| 38 |
+
'lm': 'làm', 'ns': 'nói',
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
_NEG_WORDS = {'không', 'chẳng', 'chả', 'chưa', 'đừng', 'ko', 'hok', 'hem', 'chăng'}
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def preprocess_sentiment(text):
|
| 45 |
+
"""Preprocess Vietnamese text for sentiment analysis."""
|
| 46 |
+
text = unicodedata.normalize('NFC', text)
|
| 47 |
+
text = text.lower()
|
| 48 |
+
text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
|
| 49 |
+
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
|
| 50 |
+
text = re.sub(r'!{2,}', '!', text)
|
| 51 |
+
text = re.sub(r'\?{2,}', '?', text)
|
| 52 |
+
text = re.sub(r'\.{4,}', '...', text)
|
| 53 |
+
# Teencode expansion
|
| 54 |
+
words = text.split()
|
| 55 |
+
expanded = []
|
| 56 |
+
for w in words:
|
| 57 |
+
wl = w.strip('.,!?;:')
|
| 58 |
+
if wl in _TEENCODE:
|
| 59 |
+
expanded.append(_TEENCODE[wl])
|
| 60 |
+
else:
|
| 61 |
+
expanded.append(w)
|
| 62 |
+
# Negation marking (2-word window)
|
| 63 |
+
new_words = list(expanded)
|
| 64 |
+
for i, w in enumerate(expanded):
|
| 65 |
+
wl = w.strip('.,!?;:')
|
| 66 |
+
if wl in _NEG_WORDS:
|
| 67 |
+
for j in range(i + 1, min(i + 3, len(expanded))):
|
| 68 |
+
new_words[j] = 'NEG_' + expanded[j]
|
| 69 |
+
return ' '.join(new_words)
|
| 70 |
+
|
| 71 |
|
| 72 |
def read_file(filepath):
|
| 73 |
"""Read text file with multiple encoding attempts."""
|
|
|
|
| 262 |
click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
|
| 263 |
|
| 264 |
|
| 265 |
+
def _load_vlsp2016(data_dir):
|
| 266 |
+
"""Load VLSP2016 sentiment data from directory."""
|
| 267 |
+
label_map = {'POS': 'positive', 'NEG': 'negative', 'NEU': 'neutral'}
|
| 268 |
+
texts, labels = [], []
|
| 269 |
+
for split in ['train.txt', 'test.txt']:
|
| 270 |
+
split_texts, split_labels = [], []
|
| 271 |
+
filepath = os.path.join(data_dir, split)
|
| 272 |
+
with open(filepath, 'r', encoding='utf-8') as f:
|
| 273 |
+
for line in f:
|
| 274 |
+
line = line.strip()
|
| 275 |
+
if line.startswith('__label__'):
|
| 276 |
+
parts = line.split(' ', 1)
|
| 277 |
+
label = label_map[parts[0].replace('__label__', '')]
|
| 278 |
+
text = parts[1] if len(parts) > 1 else ''
|
| 279 |
+
split_texts.append(text)
|
| 280 |
+
split_labels.append(label)
|
| 281 |
+
texts.append(split_texts)
|
| 282 |
+
labels.append(split_labels)
|
| 283 |
+
return texts[0], labels[0], texts[1], labels[1]
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
@cli.command('sentiment-general')
|
| 287 |
+
@click.option('--output', '-o', default=None, help='Output model path')
|
| 288 |
+
@click.option('--vlsp2016-dir', default=None, help='Path to VLSP2016_SA directory (adds to training data)')
|
| 289 |
+
@click.option('--max-features', default=200000, help='Maximum vocabulary size')
|
| 290 |
+
@click.option('--ngram-min', default=1, help='Minimum n-gram')
|
| 291 |
+
@click.option('--ngram-max', default=3, help='Maximum n-gram')
|
| 292 |
+
@click.option('--min-df', default=1, help='Minimum document frequency')
|
| 293 |
+
@click.option('--max-df', default=0.9, help='Maximum document frequency')
|
| 294 |
+
@click.option('--c', default=0.7, help='SVM regularization parameter')
|
| 295 |
+
@click.option('--max-iter', default=1000, help='Maximum iterations')
|
| 296 |
+
@click.option('--tol', default=0.0001, help='Convergence tolerance')
|
| 297 |
+
def sentiment_general(output, vlsp2016_dir, max_features, ngram_min, ngram_max, min_df, max_df, c, max_iter, tol):
|
| 298 |
+
"""Train sentiment-general model (3 classes: positive/negative/neutral).
|
| 299 |
+
|
| 300 |
+
Uses UTS2017_Bank sentiment data by default. Optionally adds VLSP2016 data
|
| 301 |
+
with --vlsp2016-dir for improved general-domain coverage.
|
| 302 |
+
"""
|
| 303 |
+
from datetime import datetime
|
| 304 |
+
from datasets import load_dataset
|
| 305 |
+
|
| 306 |
+
if output is None:
|
| 307 |
+
date_str = datetime.now().strftime('%Y%m%d')
|
| 308 |
+
output = f'models/sen-sentiment-general-1.0.0-{date_str}.bin'
|
| 309 |
+
|
| 310 |
+
click.echo("=" * 70)
|
| 311 |
+
click.echo("Sentiment General Training (positive/negative/neutral)")
|
| 312 |
+
click.echo("=" * 70)
|
| 313 |
+
|
| 314 |
+
# Load UTS2017_Bank sentiment data
|
| 315 |
+
click.echo("\nLoading UTS2017_Bank sentiment dataset from HuggingFace...")
|
| 316 |
+
dataset = load_dataset("undertheseanlp/UTS2017_Bank", "sentiment")
|
| 317 |
+
|
| 318 |
+
train_texts = list(dataset["train"]["text"])
|
| 319 |
+
train_labels = list(dataset["train"]["sentiment"])
|
| 320 |
+
test_texts = list(dataset["test"]["text"])
|
| 321 |
+
test_labels = list(dataset["test"]["sentiment"])
|
| 322 |
+
|
| 323 |
+
vlsp_test_texts, vlsp_test_labels = None, None
|
| 324 |
+
|
| 325 |
+
# Optionally add VLSP2016 data
|
| 326 |
+
if vlsp2016_dir:
|
| 327 |
+
click.echo(f"\nLoading VLSP2016 data from {vlsp2016_dir}...")
|
| 328 |
+
vlsp_train_texts, vlsp_train_labels, vlsp_test_texts, vlsp_test_labels = _load_vlsp2016(vlsp2016_dir)
|
| 329 |
+
train_texts.extend(vlsp_train_texts)
|
| 330 |
+
train_labels.extend(vlsp_train_labels)
|
| 331 |
+
click.echo(f" VLSP2016 train: {len(vlsp_train_texts)}, test: {len(vlsp_test_texts)}")
|
| 332 |
+
|
| 333 |
+
click.echo(f" Total train samples: {len(train_texts)}")
|
| 334 |
+
click.echo(f" UTS2017 test samples: {len(test_texts)}")
|
| 335 |
+
click.echo(f" Labels: {sorted(set(train_labels))}")
|
| 336 |
+
|
| 337 |
+
# Preprocess
|
| 338 |
+
click.echo("\nPreprocessing...")
|
| 339 |
+
proc_train = [preprocess_sentiment(t) for t in train_texts]
|
| 340 |
+
proc_test = [preprocess_sentiment(t) for t in test_texts]
|
| 341 |
+
|
| 342 |
+
# Train
|
| 343 |
+
click.echo("\nTraining Rust TextClassifier...")
|
| 344 |
+
clf = TextClassifier(
|
| 345 |
+
max_features=max_features,
|
| 346 |
+
ngram_range=(ngram_min, ngram_max),
|
| 347 |
+
min_df=min_df,
|
| 348 |
+
max_df=max_df,
|
| 349 |
+
c=c,
|
| 350 |
+
max_iter=max_iter,
|
| 351 |
+
tol=tol,
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
t0 = time.perf_counter()
|
| 355 |
+
clf.fit(proc_train, train_labels)
|
| 356 |
+
train_time = time.perf_counter() - t0
|
| 357 |
+
click.echo(f" Training time: {train_time:.3f}s")
|
| 358 |
+
click.echo(f" Vocabulary size: {clf.n_features}")
|
| 359 |
+
|
| 360 |
+
# Evaluate on UTS2017
|
| 361 |
+
click.echo("\nEvaluating on UTS2017_Bank test set...")
|
| 362 |
+
preds = clf.predict_batch(proc_test)
|
| 363 |
+
|
| 364 |
+
acc = accuracy_score(test_labels, preds)
|
| 365 |
+
f1_w = f1_score(test_labels, preds, average='weighted', zero_division=0)
|
| 366 |
+
f1_m = f1_score(test_labels, preds, average='macro', zero_division=0)
|
| 367 |
+
|
| 368 |
+
click.echo("\n" + "=" * 70)
|
| 369 |
+
click.echo("RESULTS (UTS2017_Bank)")
|
| 370 |
+
click.echo("=" * 70)
|
| 371 |
+
click.echo(f" Accuracy: {acc:.4f} ({acc*100:.2f}%)")
|
| 372 |
+
click.echo(f" F1 (weighted): {f1_w:.4f}")
|
| 373 |
+
click.echo(f" F1 (macro): {f1_m:.4f}")
|
| 374 |
+
click.echo("\nClassification Report:")
|
| 375 |
+
click.echo(classification_report(test_labels, preds, zero_division=0))
|
| 376 |
+
|
| 377 |
+
# Evaluate on VLSP2016 if available
|
| 378 |
+
if vlsp_test_texts:
|
| 379 |
+
proc_vlsp_test = [preprocess_sentiment(t) for t in vlsp_test_texts]
|
| 380 |
+
vlsp_preds = clf.predict_batch(proc_vlsp_test)
|
| 381 |
+
vlsp_acc = accuracy_score(vlsp_test_labels, vlsp_preds)
|
| 382 |
+
vlsp_f1w = f1_score(vlsp_test_labels, vlsp_preds, average='weighted', zero_division=0)
|
| 383 |
+
vlsp_f1m = f1_score(vlsp_test_labels, vlsp_preds, average='macro', zero_division=0)
|
| 384 |
+
|
| 385 |
+
click.echo("=" * 70)
|
| 386 |
+
click.echo("RESULTS (VLSP2016)")
|
| 387 |
+
click.echo("=" * 70)
|
| 388 |
+
click.echo(f" Accuracy: {vlsp_acc:.4f} ({vlsp_acc*100:.2f}%)")
|
| 389 |
+
click.echo(f" F1 (weighted): {vlsp_f1w:.4f}")
|
| 390 |
+
click.echo(f" F1 (macro): {vlsp_f1m:.4f}")
|
| 391 |
+
click.echo("\nClassification Report:")
|
| 392 |
+
click.echo(classification_report(vlsp_test_labels, vlsp_preds, zero_division=0))
|
| 393 |
+
|
| 394 |
+
# Save model
|
| 395 |
+
model_path = Path(output)
|
| 396 |
+
model_path.parent.mkdir(parents=True, exist_ok=True)
|
| 397 |
+
clf.save(str(model_path))
|
| 398 |
+
|
| 399 |
+
size_mb = model_path.stat().st_size / (1024 * 1024)
|
| 400 |
+
click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
|
| 401 |
+
|
| 402 |
+
|
| 403 |
+
@cli.command('sentiment-bank')
|
| 404 |
+
@click.option('--output', '-o', default=None, help='Output model path')
|
| 405 |
+
@click.option('--max-features', default=200000, help='Maximum vocabulary size')
|
| 406 |
+
@click.option('--ngram-min', default=1, help='Minimum n-gram')
|
| 407 |
+
@click.option('--ngram-max', default=3, help='Maximum n-gram')
|
| 408 |
+
@click.option('--min-df', default=1, help='Minimum document frequency')
|
| 409 |
+
@click.option('--max-df', default=0.9, help='Maximum document frequency')
|
| 410 |
+
@click.option('--c', default=0.7, help='SVM regularization parameter')
|
| 411 |
+
@click.option('--max-iter', default=1000, help='Maximum iterations')
|
| 412 |
+
@click.option('--tol', default=0.0001, help='Convergence tolerance')
|
| 413 |
+
def sentiment_bank(output, max_features, ngram_min, ngram_max, min_df, max_df, c, max_iter, tol):
|
| 414 |
+
"""Train sentiment-bank model on UTS2017_Bank (36 combined category#sentiment labels)."""
|
| 415 |
+
from datetime import datetime
|
| 416 |
+
from datasets import load_dataset
|
| 417 |
+
|
| 418 |
+
if output is None:
|
| 419 |
+
date_str = datetime.now().strftime('%Y%m%d')
|
| 420 |
+
output = f'models/sen-sentiment-bank-1.0.0-{date_str}.bin'
|
| 421 |
+
|
| 422 |
+
click.echo("=" * 70)
|
| 423 |
+
click.echo("Sentiment Bank Training (category#sentiment, 36 labels)")
|
| 424 |
+
click.echo("=" * 70)
|
| 425 |
+
|
| 426 |
+
# Load and merge classification + sentiment configs
|
| 427 |
+
click.echo("\nLoading UTS2017_Bank dataset from HuggingFace...")
|
| 428 |
+
ds_class = load_dataset("undertheseanlp/UTS2017_Bank", "classification")
|
| 429 |
+
ds_sent = load_dataset("undertheseanlp/UTS2017_Bank", "sentiment")
|
| 430 |
+
|
| 431 |
+
train_texts = list(ds_class["train"]["text"])
|
| 432 |
+
train_labels = [f'{c}#{s}' for c, s in zip(ds_class["train"]["label"], ds_sent["train"]["sentiment"])]
|
| 433 |
+
test_texts = list(ds_class["test"]["text"])
|
| 434 |
+
test_labels = [f'{c}#{s}' for c, s in zip(ds_class["test"]["label"], ds_sent["test"]["sentiment"])]
|
| 435 |
+
|
| 436 |
+
click.echo(f" Train samples: {len(train_texts)}")
|
| 437 |
+
click.echo(f" Test samples: {len(test_texts)}")
|
| 438 |
+
click.echo(f" Labels: {len(set(train_labels))}")
|
| 439 |
+
|
| 440 |
+
# Preprocess
|
| 441 |
+
click.echo("\nPreprocessing...")
|
| 442 |
+
proc_train = [preprocess_sentiment(t) for t in train_texts]
|
| 443 |
+
proc_test = [preprocess_sentiment(t) for t in test_texts]
|
| 444 |
+
|
| 445 |
+
# Train
|
| 446 |
+
click.echo("\nTraining Rust TextClassifier...")
|
| 447 |
+
clf = TextClassifier(
|
| 448 |
+
max_features=max_features,
|
| 449 |
+
ngram_range=(ngram_min, ngram_max),
|
| 450 |
+
min_df=min_df,
|
| 451 |
+
max_df=max_df,
|
| 452 |
+
c=c,
|
| 453 |
+
max_iter=max_iter,
|
| 454 |
+
tol=tol,
|
| 455 |
+
)
|
| 456 |
+
|
| 457 |
+
t0 = time.perf_counter()
|
| 458 |
+
clf.fit(proc_train, train_labels)
|
| 459 |
+
train_time = time.perf_counter() - t0
|
| 460 |
+
click.echo(f" Training time: {train_time:.3f}s")
|
| 461 |
+
click.echo(f" Vocabulary size: {clf.n_features}")
|
| 462 |
+
|
| 463 |
+
# Evaluate
|
| 464 |
+
click.echo("\nEvaluating...")
|
| 465 |
+
preds = clf.predict_batch(proc_test)
|
| 466 |
+
|
| 467 |
+
acc = accuracy_score(test_labels, preds)
|
| 468 |
+
f1_w = f1_score(test_labels, preds, average='weighted', zero_division=0)
|
| 469 |
+
f1_m = f1_score(test_labels, preds, average='macro', zero_division=0)
|
| 470 |
+
|
| 471 |
+
click.echo("\n" + "=" * 70)
|
| 472 |
+
click.echo("RESULTS")
|
| 473 |
+
click.echo("=" * 70)
|
| 474 |
+
click.echo(f" Accuracy: {acc:.4f} ({acc*100:.2f}%)")
|
| 475 |
+
click.echo(f" F1 (weighted): {f1_w:.4f}")
|
| 476 |
+
click.echo(f" F1 (macro): {f1_m:.4f}")
|
| 477 |
+
|
| 478 |
+
click.echo("\nClassification Report:")
|
| 479 |
+
click.echo(classification_report(test_labels, preds, zero_division=0))
|
| 480 |
+
|
| 481 |
+
# Save model
|
| 482 |
+
model_path = Path(output)
|
| 483 |
+
model_path.parent.mkdir(parents=True, exist_ok=True)
|
| 484 |
+
clf.save(str(model_path))
|
| 485 |
+
|
| 486 |
+
size_mb = model_path.stat().st_size / (1024 * 1024)
|
| 487 |
+
click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
|
| 488 |
+
|
| 489 |
+
|
| 490 |
if __name__ == "__main__":
|
| 491 |
cli()
|