Tiep Claude Opus 4.6 commited on
Commit
b5fd35d
·
1 Parent(s): ef06968

Add sentiment models v1.2.0 with Vietnamese preprocessing

Browse files

- Train sentiment-general (VLSP2016+UTS2017): 92.11% UTS2017, 70.86% VLSP2016
- Train sentiment-bank (UTS2017): 70.65% accuracy
- Add preprocessing: lowercase, teencode expansion, negation marking, repeated char normalization
- Update TECHNICAL_REPORT.md to v1.2.0 with full experiment results
- Track .bin files with LFS/Xet storage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.gitattributes CHANGED
@@ -4,3 +4,4 @@
4
  *.jpeg filter=lfs diff=lfs merge=lfs -text
5
  *.gif filter=lfs diff=lfs merge=lfs -text
6
  *.synctex filter=lfs diff=lfs merge=lfs -text
 
 
4
  *.jpeg filter=lfs diff=lfs merge=lfs -text
5
  *.gif filter=lfs diff=lfs merge=lfs -text
6
  *.synctex filter=lfs diff=lfs merge=lfs -text
7
+ *.bin filter=lfs diff=lfs merge=lfs -text
TECHNICAL_REPORT.md CHANGED
@@ -1,21 +1,23 @@
1
  # Sen-1: Vietnamese Text Classification Model
2
 
3
- **Technical Report v1.1.0**
4
 
5
  Authors: UnderTheSea NLP
6
- Date: February 2, 2026
7
  Model: `undertheseanlp/sen-1`
8
 
9
  ---
10
 
11
  ## Abstract
12
 
13
- Sen-1 is a Vietnamese text classification model based on the traditional machine learning approach using TF-IDF vectorization combined with Support Vector Machine (SVM) classifier. This report describes the methodology, implementation, and evaluation of the model on two benchmark datasets:
14
 
15
  - **VNTC (News)**: 92.49% accuracy on 10-topic news classification
16
  - **UTS2017_Bank (Banking)**: 75.76% accuracy on 14-category banking text classification
 
 
17
 
18
- The model reproduces the sonar_core_1 architecture and is designed to be compatible with the underthesea Vietnamese NLP toolkit API.
19
 
20
  ---
21
 
@@ -25,9 +27,10 @@ Text classification is a fundamental task in Natural Language Processing (NLP) t
25
 
26
  - **Word segmentation**: Vietnamese words can consist of multiple syllables
27
  - **Diacritics**: Vietnamese uses Latin script with additional diacritical marks
 
28
  - **Limited resources**: Fewer labeled datasets compared to English
29
 
30
- Sen-1 addresses these challenges by implementing a robust TF-IDF + SVM pipeline that has proven effective for Vietnamese text classification tasks.
31
 
32
  ---
33
 
@@ -41,7 +44,20 @@ The seminal work on Vietnamese text classification was presented by Vu et al. (2
41
  - **Baseline methods**: Bag-of-Words (BOW), N-gram, and SVM approaches
42
  - **Benchmark results**: Achieving >95% accuracy on 10-topic classification
43
 
44
- ### 2.2 Traditional ML vs Deep Learning
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  | Approach | Pros | Cons |
47
  |----------|------|------|
@@ -56,34 +72,37 @@ Sen-1 adopts the traditional approach for its simplicity, speed, and effectivene
56
 
57
  ### 3.1 Architecture Overview
58
 
59
- Sen-1 reproduces the **sonar_core_1** architecture using a 3-stage pipeline:
60
 
61
  ```
62
- ┌─────────────────────────────────────────────────────────┐
63
- Sen-1 Pipeline │
64
- │ (sonar_core_1 reproduction) │
65
- ├─────────────────────────────────────────────────────────┤
66
- Input Text
67
-
68
- ┌─────────────────────────────────────────────────┐
69
- │ │ CountVectorizer │ │
70
- │ │ - max_features: 20,000 │ │
71
- │ │ - ngram_range: (1, 2) │ │
72
- └─────────────────────────────────────────────────┘
73
-
74
- ┌─────────────────────────────────────────────────┐
75
- │ TfidfTransformer │
76
- │ │ - use_idf: True │ │
77
- └─────────────────────────────────────────────────┘
78
-
79
- ┌─────────────────────────────────────────────────┐
80
- │ LinearSVC Classifier │
81
- - C: 1.0 │ │
82
- │ - max_iter: 2000 │
83
- └─────────────────────────────────────────────────┘
84
-
85
- Output: Predicted Label
86
- └─────────────────────────────────────────────────────────┘
 
 
 
87
  ```
88
 
89
  ### 3.2 TF-IDF Vectorization
@@ -97,13 +116,14 @@ Where:
97
  - $\text{IDF}(t) = \log\frac{N}{|\{d \in D : t \in d\}|}$
98
  - $N$ = total number of documents
99
 
100
- **Hyperparameters (sonar_core_1 config):**
101
 
102
- | Parameter | Value | Description |
103
- |-----------|-------|-------------|
104
- | `max_features` | 20,000 | Maximum vocabulary size |
105
- | `ngram_range` | (1, 2) | Unigrams and bigrams |
106
- | `use_idf` | True | Use inverse document frequency |
 
107
 
108
  ### 3.3 Support Vector Machine
109
 
@@ -111,15 +131,55 @@ Linear SVM is used for classification due to its effectiveness on high-dimension
111
 
112
  $$\min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))$$
113
 
114
- **Hyperparameters:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
- | Parameter | Value | Description |
117
- |-----------|-------|-------------|
118
- | `C` | 1.0 | Regularization parameter |
119
- | `max_iter` | 2000 | Maximum iterations |
120
- | `loss` | squared_hinge | Squared hinge loss (LinearSVC default) |
121
 
122
- ### 3.4 Confidence Score
 
 
 
 
 
 
 
 
 
123
 
124
  Confidence scores are computed from the SVM decision function using sigmoid transformation:
125
 
@@ -129,7 +189,7 @@ Where $f(x)$ is the decision function value.
129
 
130
  ---
131
 
132
- ## 4. Dataset
133
 
134
  ### 4.1 VNTC Dataset
135
 
@@ -155,33 +215,45 @@ The Vietnamese News Text Classification (VNTC) corpus is the standard benchmark
155
 
156
  ### 4.2 UTS2017_Bank Dataset
157
 
158
- The UTS2017_Bank dataset is a Vietnamese banking domain text classification corpus:
159
-
160
- **14 Categories:**
161
-
162
- | Category | English | Samples |
163
- |----------|---------|---------|
164
- | ACCOUNT | Account services | 5 |
165
- | CARD | Card services | 66 |
166
- | CUSTOMER_SUPPORT | Customer support | 774 |
167
- | DISCOUNT | Discounts | 40 |
168
- | INTEREST_RATE | Interest rates | 58 |
169
- | INTERNET_BANKING | Internet banking | 69 |
170
- | LOAN | Loan services | 73 |
171
- | MONEY_TRANSFER | Money transfer | 37 |
172
- | OTHER | Other | 70 |
173
- | PAYMENT | Payment services | 17 |
174
- | PROMOTION | Promotions | 56 |
175
- | SAVING | Savings | 12 |
176
- | SECURITY | Security | 3 |
177
- | TRADEMARK | Trademark/Brand | 697 |
178
- | **Total** | | **1,977** |
179
-
180
- **Train/Test Split:** 80/20 stratified (1,581 train / 396 test)
 
181
 
182
  **Source:** https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
183
 
184
- **Class Imbalance:** The dataset is highly imbalanced, with CUSTOMER_SUPPORT (39%) and TRADEMARK (35%) dominating, while ACCOUNT (0.3%) and SECURITY (0.2%) have very few samples.
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  ---
187
 
@@ -189,33 +261,43 @@ The UTS2017_Bank dataset is a Vietnamese banking domain text classification corp
189
 
190
  ### 5.1 Dependencies
191
 
 
 
 
 
 
 
 
 
 
192
  ```
193
- scikit-learn>=1.0.0
194
- joblib>=1.0.0
195
- numpy>=1.20.0
196
  ```
197
 
198
- ### 5.2 API Design
199
 
200
- Sen-1 is designed to be compatible with underthesea API:
201
 
202
  ```python
203
- # Core classes
204
- class Label:
205
- value: str # Label name
206
- score: float # Confidence (0-1)
207
-
208
- class Sentence:
209
- text: str # Input text
210
- labels: List[Label] # Predicted labels
211
-
212
- class SenTextClassifier:
213
- def train(train_texts, train_labels, val_texts=None, val_labels=None) -> dict
214
- def predict(sentence: Sentence) -> None
215
- def predict_batch(texts: List[str]) -> List[Label]
216
- def evaluate(texts, labels) -> dict
217
- def save(path: str) -> None
218
- def load(path: str) -> SenTextClassifier
 
 
 
219
  ```
220
 
221
  ### 5.3 Model Files
@@ -223,64 +305,30 @@ class SenTextClassifier:
223
  ```
224
  undertheseanlp/sen-1/
225
  └── models/
226
- ├── sen-general-1.0.0-20260202/ # News classification (VNTC)
227
- ├── pipeline.joblib # TF-IDF + SVM pipeline
228
- ├── label_encoder.joblib # Label encoder
229
- └── metadata.json # Model configuration
230
-
231
- └── sen-bank-1.0.0-20260202/ # Banking classification (UTS2017_Bank)
232
- ├── pipeline.joblib # TF-IDF + SVM pipeline
233
- ├── label_encoder.joblib # Label encoder
234
- └── metadata.json # Model configuration
235
  ```
236
 
237
- **metadata.json:**
238
- ```json
239
- {
240
- "model_type": "sonar_core_1_reproduction",
241
- "architecture": "CountVectorizer + TfidfTransformer + LinearSVC",
242
- "max_features": 20000,
243
- "ngram_range": [1, 2],
244
- "test_accuracy": 0.9249,
245
- "test_f1_weighted": 0.924,
246
- "labels": ["Chinh tri Xa hoi", "Doi song", ...]
247
- }
248
- ```
249
 
250
  ---
251
 
252
  ## 6. Experiments
253
 
254
- ### 6.1 Training Configuration
255
-
256
- ```python
257
- # sonar_core_1 configuration
258
- from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
259
- from sklearn.svm import LinearSVC
260
- from sklearn.pipeline import Pipeline
261
-
262
- pipeline = Pipeline([
263
- ('vect', CountVectorizer(max_features=20000, ngram_range=(1, 2))),
264
- ('tfidf', TfidfTransformer(use_idf=True)),
265
- ('clf', LinearSVC(C=1.0, max_iter=2000, random_state=42)),
266
- ])
267
- ```
268
-
269
- ### 6.2 VNTC Benchmark Results
270
 
271
- **Overall Performance:**
272
 
273
  | Metric | Value |
274
  |--------|-------|
275
  | **Accuracy** | **92.49%** |
276
  | **F1 (weighted)** | **92.40%** |
277
  | F1 (macro) | 90.44% |
278
- | Precision (weighted) | 92.00% |
279
- | Recall (weighted) | 92.00% |
280
  | **Training time** | **37.6s** |
281
- | Test samples | 50,373 |
282
 
283
- ### 6.3 Per-Category Results
284
 
285
  | Category | Precision | Recall | F1-Score | Support |
286
  |----------|-----------|--------|----------|---------|
@@ -295,183 +343,130 @@ pipeline = Pipeline([
295
  | Van hoa | 0.93 | 0.96 | 0.94 | 6,250 |
296
  | Vi tinh | 0.94 | 0.96 | 0.95 | 4,560 |
297
 
298
- **Best performing category:** Sports (The thao) with 98% F1-score
299
- **Most challenging category:** Lifestyle (Doi song) with 72% F1-score
300
-
301
- ### 6.4 UTS2017_Bank Benchmark Results
302
 
303
- **Overall Performance:**
304
 
305
  | Metric | Value |
306
  |--------|-------|
307
  | **Accuracy** | **75.76%** |
308
  | **F1 (weighted)** | **72.70%** |
309
  | F1 (macro) | 36.18% |
310
- | Precision (weighted) | 74.00% |
311
- | Recall (weighted) | 76.00% |
312
  | **Training time** | **0.13s** |
313
- | Train samples | 1,581 |
314
- | Test samples | 396 |
315
 
316
- ### 6.5 UTS2017_Bank Per-Category Results
317
 
318
- | Category | Precision | Recall | F1-Score | Support |
319
- |----------|-----------|--------|----------|---------|
320
- | ACCOUNT | 0.00 | 0.00 | 0.00 | 1 |
321
- | CARD | 0.36 | 0.31 | 0.33 | 13 |
322
- | **CUSTOMER_SUPPORT** | **0.73** | **0.93** | **0.82** | 155 |
323
- | DISCOUNT | 0.67 | 0.25 | 0.36 | 8 |
324
- | INTEREST_RATE | 0.40 | 0.33 | 0.36 | 12 |
325
- | INTERNET_BANKING | 0.80 | 0.29 | 0.42 | 14 |
326
- | LOAN | 0.73 | 0.73 | 0.73 | 15 |
327
- | MONEY_TRANSFER | 1.00 | 0.14 | 0.25 | 7 |
328
- | OTHER | 0.25 | 0.07 | 0.11 | 14 |
329
- | PAYMENT | 0.50 | 0.33 | 0.40 | 3 |
330
- | PROMOTION | 0.75 | 0.27 | 0.40 | 11 |
331
- | SAVING | 0.00 | 0.00 | 0.00 | 2 |
332
- | SECURITY | 0.00 | 0.00 | 0.00 | 1 |
333
- | **TRADEMARK** | **0.87** | **0.89** | **0.88** | 140 |
334
-
335
- **Best performing categories:** TRADEMARK (88% F1), CUSTOMER_SUPPORT (82% F1)
336
- **Zero-shot categories:** ACCOUNT, SAVING, SECURITY (insufficient training samples)
337
-
338
- **Analysis:** The low macro F1 (36.18%) vs high weighted F1 (72.70%) indicates severe class imbalance. The model performs well on majority classes but fails on minority classes with <10 training samples.
339
-
340
- ### 6.6 Comparison with sonar_core_1 and VNTC Paper
341
-
342
- #### Overall Comparison with sonar_core_1
343
-
344
- | Dataset | sonar_core_1 | Sen-1 | Difference |
345
- |---------|--------------|-------|------------|
346
- | VNTC (News) | 92.80% | 92.49% | -0.31% |
347
- | **UTS2017_Bank** | 72.47% | **75.76%** | **+3.29%** |
348
-
349
- Sen-1 outperforms sonar_core_1 on the banking dataset while using significantly less training time.
350
-
351
- #### VNTC Benchmark Results
352
 
353
- | Model | Accuracy | F1 (weighted) | Training Time | Source |
354
- |-------|----------|---------------|---------------|--------|
355
- | **N-gram** (Vu et al. 2007) | **97.1%** | - | - | RIVF 2007 |
356
- | SVM Multi (Vu et al. 2007) | 93.4% | - | - | RIVF 2007 |
357
- | **sonar_core_1** (SVC) | **92.80%** | 92.0% | ~54.6 min | HuggingFace |
358
- | **Sen-1 (Ours)** | 92.49% | 92.40% | **37.6s** | This report |
359
 
360
- #### UTS2017_Bank Benchmark Results
 
 
 
361
 
362
- | Model | Accuracy | F1 (weighted) | Training Time | Source |
363
- |-------|----------|---------------|---------------|--------|
364
- | **Sen-1 (Ours)** | **75.76%** | **72.70%** | **0.13s** | This report |
365
- | sonar_core_1 (SVC) | 72.47% | 66.0% | ~5.3s | HuggingFace |
366
 
367
- #### Comparison with sonar_core_1
 
 
 
 
368
 
369
- Sen-1 reproduces the sonar_core_1 architecture with identical hyperparameters:
370
 
371
- | Aspect | sonar_core_1 | Sen-1 |
372
- |--------|--------------|-------|
373
- | Vectorizer | CountVectorizer | CountVectorizer |
374
- | TF-IDF | TfidfTransformer | TfidfTransformer |
375
- | Classifier | SVC (kernel=linear) | LinearSVC |
376
- | max_features | 20,000 | 20,000 |
377
- | ngram_range | (1, 2) | (1, 2) |
378
- | Test Accuracy | 92.80% | 92.49% |
379
- | Training Time | ~54.6 min | 37.6s |
380
 
381
- **Performance Gap Analysis (-0.31%):**
382
- - sonar_core_1 uses SVC with `kernel='linear'` and `probability=True`
383
- - Sen-1 uses LinearSVC which is faster but slightly different optimization
384
- - Data source may differ (sonar_core_1 uses preprocessed data from underthesea releases)
385
 
386
- #### Analysis
387
 
388
- **Performance Gap:** Sen-1 achieves 92.51% accuracy compared to 97.1% (N-gram) and 93.4% (SVM Multi) reported by Vu et al. (2007). The 4.6% gap with N-gram and 0.9% gap with SVM Multi can be attributed to several factors:
389
 
390
- 1. **Preprocessing Differences**
391
- - Vu et al. (2007) likely used Vietnamese word segmentation
392
- - Sen-1 operates at syllable-level (no word segmentation)
393
- - Word-level features typically improve classification accuracy
 
394
 
395
- 2. **Feature Engineering**
396
- - N-gram approach in original paper used character/word n-grams with language modeling
397
- - Sen-1 uses TF-IDF with unigrams and bigrams only
398
- - Original SVM Multi may have used different kernel or feature selection
399
 
400
- 3. **Train/Test Split**
401
- - We use the exact VNTC Ver1.1 split (33,759 train / 50,373 test)
402
- - Original paper split details are not fully documented
 
 
 
 
403
 
404
- 4. **Implementation Details**
405
- - Sen-1 uses scikit-learn's LinearSVC with default squared hinge loss
406
- - Original implementation details are not publicly available
407
 
408
- #### Methodology Comparison
409
 
410
- | Aspect | Vu et al. (2007) | Sen-1 |
411
- |--------|------------------|-------|
412
- | Text Unit | Word-level (segmented) | Syllable-level |
413
- | Features | BOW, N-gram LM | TF-IDF (1,2)-grams |
414
- | Classifier | SVM (libsvm) | LinearSVC (sklearn) |
415
- | Vocabulary | Not specified | 10,000 features |
416
- | Preprocessing | Vietnamese tokenizer | None (raw text) |
417
 
418
- #### Key Insight
419
 
420
- The N-gram language modeling approach (97.1%) significantly outperforms bag-of-words methods. This suggests that:
421
- - **Sequential patterns matter** for Vietnamese text classification
422
- - **Word segmentation** likely contributes to the performance gap
423
- - Future versions of Sen should incorporate word segmentation (underthesea) to close this gap
424
 
425
- ### 6.5 Sample Predictions
 
 
 
 
426
 
427
- | Input | Predicted | Confidence |
428
- |-------|-----------|------------|
429
- | "Đội tuyển Việt Nam thắng đậm 3-0 trước Indonesia" | the_thao | 0.89 |
430
- | "Giá vàng tăng mạnh trong phiên giao dịch hôm nay" | kinh_doanh | 0.85 |
431
- | "Apple ra mắt iPhone mới với nhiều tính năng hấp dẫn" | vi_tinh | 0.82 |
432
- | "Bộ Y tế cảnh báo về dịch cúm mùa đông" | suc_khoe | 0.91 |
433
- | "Quốc hội thông qua nghị quyết phát triển kinh tế" | chinh_tri_xa_hoi | 0.78 |
434
 
435
- ### 6.7 Inference Speed Benchmark
 
 
 
436
 
437
- Comparison of inference speed between Sen-1 and Underthesea 9.2.8 (which uses sonar_core_1 internally):
438
 
439
- #### Benchmark Results
440
 
441
- | Model | Single Inference | Throughput |
442
- |-------|------------------|------------|
443
- | **Sen-1 1.0.0** | **0.465 ms** | **66,678 samples/sec** |
444
- | Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec |
 
 
 
445
 
446
- #### Speedup
447
 
448
- | Metric | Speedup |
449
- |--------|---------|
450
- | Single inference | **1.3x** faster |
451
- | Throughput (batch) | **41x** faster |
452
 
453
- #### Analysis
454
 
455
- 1. **Single Inference**: Sen-1 is 1.3x faster (0.465ms vs 0.615ms)
456
- - Both use similar architecture (sonar_core_1)
457
- - Difference due to API overhead in underthesea
 
 
458
 
459
- 2. **Throughput**: Sen-1 is 41x faster (66,678 vs 1,617 samples/sec)
460
- - Sen-1 supports **batch processing** (vectorize + predict entire batch)
461
- - Underthesea processes samples **sequentially** (loop)
462
- - Batch processing eliminates per-sample overhead
463
 
464
- 3. **Model Size**:
465
- - Sen-1: ~2.4 MB
466
- - Underthesea (sonar_core_1): ~75 MB (compressed)
467
 
468
- #### Benchmark Environment
 
 
 
469
 
470
- - Python: 3.10.19
471
- - scikit-learn: 1.7.2
472
- - underthesea: 9.2.8
473
- - underthesea-core: 3.1.6
474
- - OS: Ubuntu 20.04 LTS
475
 
476
  ---
477
 
@@ -480,53 +475,62 @@ Comparison of inference speed between Sen-1 and Underthesea 9.2.8 (which uses so
480
  ### 7.1 Installation
481
 
482
  ```bash
483
- pip install scikit-learn joblib huggingface_hub
484
  ```
485
 
486
  ### 7.2 Load Pre-trained Model
487
 
488
  ```python
489
- from huggingface_hub import snapshot_download
490
- from sen import SenTextClassifier, Sentence
491
 
492
- # Download model
493
- model_path = snapshot_download(
494
- 'undertheseanlp/sen-1',
495
- allow_patterns=['sen-general-1.0.0-20260202/*']
496
- )
497
-
498
- # Load
499
- classifier = SenTextClassifier.load(f'{model_path}/sen-general-1.0.0-20260202')
500
 
501
  # Predict
502
- sentence = Sentence("Đội tuyển Việt Nam thắng 3-0")
503
- classifier.predict(sentence)
504
- print(sentence.labels) # [the_thao (0.89)]
505
  ```
506
 
507
- ### 7.3 Train Custom Model
508
 
509
  ```python
510
- from sen import SenTextClassifier
511
 
512
- classifier = SenTextClassifier(
513
- max_features=10000,
514
- ngram_range=(1, 2),
515
- )
516
 
517
- classifier.train(train_texts, train_labels, val_texts, val_labels)
518
- classifier.save("./my-model")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
519
  ```
520
 
521
  ---
522
 
523
  ## 8. Limitations
524
 
525
- 1. **No word segmentation**: Does not use Vietnamese word segmentation (operates on syllable-level)
526
  2. **No pre-trained embeddings**: Uses TF-IDF instead of word vectors or contextual embeddings
527
  3. **Single-label only**: Does not support multi-label classification
528
- 4. **Domain-specific**: Trained on news articles, may not generalize to other domains (social media, reviews)
529
- 5. **Class imbalance sensitivity**: Lower performance on underrepresented categories (e.g., Lifestyle)
 
530
 
531
  ---
532
 
@@ -534,140 +538,126 @@ classifier.save("./my-model")
534
 
535
  - [x] ~~Train on full VNTC dataset (33,759 samples)~~ **Done**
536
  - [x] ~~Train on UTS2017_Bank dataset (1,977 samples)~~ **Done** (+3.29% vs sonar_core_1)
 
 
 
 
537
  - [ ] Add Vietnamese word segmentation (using underthesea)
538
  - [ ] Implement multi-label classification
539
  - [ ] Add PhoBERT-based variant (sen-2)
540
  - [ ] Benchmark on additional datasets (UIT-VSMEC, UIT-VSFC)
541
- - [ ] Add error analysis and confusion matrix visualization
542
- - [ ] Address class imbalance in UTS2017_Bank (oversampling, class weights)
543
 
544
  ---
545
 
546
  ## 10. Conclusion
547
 
548
- Sen-1 successfully reproduces the sonar_core_1 architecture and achieves competitive results on two Vietnamese text classification benchmarks:
549
 
550
- | Dataset | Accuracy | vs sonar_core_1 |
551
- |---------|----------|-----------------|
552
- | VNTC (News) | 92.49% | -0.31% |
553
- | UTS2017_Bank | **75.76%** | **+3.29%** |
 
 
 
554
 
555
  Key achievements:
556
 
557
- - **Fast training**: 37.6s for VNTC (vs 54.6 min for sonar_core_1 SVC), 0.13s for UTS2017_Bank
558
- - **Better banking accuracy**: Outperforms sonar_core_1 by 3.29% on UTS2017_Bank
559
- - **Small footprint**: Lightweight models (~2-3 MB each) suitable for deployment
560
- - **Multi-domain**: Supports both news and banking text classification
561
-
562
- While deep learning approaches (PhoBERT, etc.) may achieve higher accuracy, Sen-1 serves as a strong baseline and practical solution for resource-constrained environments.
563
 
564
  ---
565
 
566
  ## References
567
 
568
- 1. Vu, C. D. H., Dien, D., Nguyen, L. N., & Ngo, Q. H. (2007). **A Comparative Study on Vietnamese Text Classification Methods**. IEEE International Conference on Research, Innovation and Vision for the Future (RIVF), 267-273. https://ieeexplore.ieee.org/document/4223084/
569
 
570
  2. duyvuleo. (2007). **VNTC: A Large-scale Vietnamese News Text Classification Corpus**. GitHub. https://github.com/duyvuleo/VNTC
571
 
572
- 3. Pedregosa, F., et al. (2011). **Scikit-learn: Machine Learning in Python**. Journal of Machine Learning Research, 12, 2825-2830.
573
 
574
- 4. UnderTheSea NLP. (2017). **Underthesea: Vietnamese NLP Toolkit**. GitHub. https://github.com/undertheseanlp/underthesea
575
 
576
  5. Nguyen, D. Q., & Nguyen, A. T. (2020). **PhoBERT: Pre-trained language models for Vietnamese**. Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.92/
577
 
578
- ---
579
 
580
- ## Appendix A: Category Labels
581
-
582
- ### VNTC (News) - 10 Categories
583
-
584
- | ID | Label | Vietnamese | English |
585
- |----|-------|------------|---------|
586
- | 0 | Chinh tri Xa hoi | Chính trị Xã hội | Politics/Society |
587
- | 1 | Doi song | Đời sống | Lifestyle |
588
- | 2 | Khoa hoc | Khoa học | Science |
589
- | 3 | Kinh doanh | Kinh doanh | Business |
590
- | 4 | Phap luat | Pháp luật | Law |
591
- | 5 | Suc khoe | Sức khỏe | Health |
592
- | 6 | The gioi | Thế giới | World |
593
- | 7 | The thao | Thể thao | Sports |
594
- | 8 | Van hoa | Văn hóa | Culture |
595
- | 9 | Vi tinh | Vi tính | Technology |
596
-
597
- ### UTS2017_Bank (Banking) - 14 Categories
598
-
599
- | ID | Label | English | Train Samples |
600
- |----|-------|---------|---------------|
601
- | 0 | ACCOUNT | Account services | 4 |
602
- | 1 | CARD | Card services | 53 |
603
- | 2 | CUSTOMER_SUPPORT | Customer support | 619 |
604
- | 3 | DISCOUNT | Discounts | 32 |
605
- | 4 | INTEREST_RATE | Interest rates | 46 |
606
- | 5 | INTERNET_BANKING | Internet banking | 55 |
607
- | 6 | LOAN | Loan services | 58 |
608
- | 7 | MONEY_TRANSFER | Money transfer | 30 |
609
- | 8 | OTHER | Other | 56 |
610
- | 9 | PAYMENT | Payment services | 14 |
611
- | 10 | PROMOTION | Promotions | 45 |
612
- | 11 | SAVING | Savings | 10 |
613
- | 12 | SECURITY | Security | 2 |
614
- | 13 | TRADEMARK | Trademark/Brand | 557 |
615
 
616
  ---
617
 
618
- ## Appendix B: Model Card
619
 
620
- ### sen-general-1.0.0-20260202 (News Classification)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
621
 
622
  | Field | Value |
623
  |-------|-------|
624
- | Model Name | sen-general-1.0.0-20260202 |
625
- | Architecture | CountVectorizer + TfidfTransformer + LinearSVC |
626
- | Base Model | sonar_core_1 reproduction |
627
  | Language | Vietnamese |
628
  | License | Apache 2.0 |
629
  | Repository | https://huggingface.co/undertheseanlp/sen-1 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
630
  | Training Data | VNTC (33,759 samples) |
631
- | Test Data | VNTC (50,373 samples) |
632
  | Categories | 10 |
633
  | max_features | 20,000 |
634
  | ngram_range | (1, 2) |
635
  | Accuracy | 92.49% |
636
- | F1 (weighted) | 92.40% |
637
- | Training Time | 37.6s |
638
 
639
- ### sen-bank-1.0.0-20260202 (Banking Classification)
640
 
641
  | Field | Value |
642
  |-------|-------|
643
- | Model Name | sen-bank-1.0.0-20260202 |
644
- | Architecture | CountVectorizer + TfidfTransformer + LinearSVC |
645
- | Base Model | sonar_core_1 reproduction |
646
  | Language | Vietnamese |
647
- | License | Apache 2.0 |
648
- | Repository | https://huggingface.co/undertheseanlp/sen-1 |
649
- | Training Data | UTS2017_Bank (1,581 samples) |
650
- | Test Data | UTS2017_Bank (396 samples) |
651
  | Categories | 14 |
652
- | max_features | 20,000 |
653
  | ngram_range | (1, 2) |
654
  | Accuracy | 75.76% |
655
- | F1 (weighted) | 72.70% |
656
- | Training Time | 0.13s |
657
-
658
- ---
659
-
660
- ## Appendix C: Confusion Matrix Analysis
661
-
662
- Categories with highest confusion:
663
- - **Lifestyle (doi_song)** often confused with Culture (van_hoa) and Health (suc_khoe)
664
- - **Politics (chinh_tri_xa_hoi)** sometimes confused with World (the_gioi) and Law (phap_luat)
665
-
666
- Categories with clearest separation:
667
- - **Sports (the_thao)**: Very distinctive vocabulary (team names, scores, competitions)
668
- - **Technology (vi_tinh)**: Distinctive technical terms (software, hardware brands)
669
 
670
  ---
671
 
672
- *Report generated: February 2, 2026*
673
  *UnderTheSea NLP - https://github.com/undertheseanlp*
 
1
  # Sen-1: Vietnamese Text Classification Model
2
 
3
+ **Technical Report v1.2.0**
4
 
5
  Authors: UnderTheSea NLP
6
+ Date: February 6, 2026
7
  Model: `undertheseanlp/sen-1`
8
 
9
  ---
10
 
11
  ## Abstract
12
 
13
+ Sen-1 is a Vietnamese text classification model based on TF-IDF vectorization combined with Linear SVM, implemented entirely in Rust via `underthesea_core` for fast training and inference. This report describes the methodology, implementation, and evaluation on four benchmark tasks:
14
 
15
  - **VNTC (News)**: 92.49% accuracy on 10-topic news classification
16
  - **UTS2017_Bank (Banking)**: 75.76% accuracy on 14-category banking text classification
17
+ - **Sentiment General**: 92.11% (UTS2017_Bank) / 70.86% (VLSP2016) on 3-class sentiment
18
+ - **Sentiment Bank**: 70.65% accuracy on 36-class aspect-sentiment classification
19
 
20
+ The sentiment models include a Vietnamese-specific preprocessing pipeline (teencode expansion, negation marking, character normalization) that yields +4.1% improvement over the previous flair-based SVM on VLSP2016, while removing the scikit-learn dependency from the inference path.
21
 
22
  ---
23
 
 
27
 
28
  - **Word segmentation**: Vietnamese words can consist of multiple syllables
29
  - **Diacritics**: Vietnamese uses Latin script with additional diacritical marks
30
+ - **Informal text**: Social media text contains extensive teencode and abbreviations
31
  - **Limited resources**: Fewer labeled datasets compared to English
32
 
33
+ Sen-1 addresses these challenges by implementing a robust TF-IDF + SVM pipeline with Vietnamese-specific preprocessing, operating at syllable-level for speed while achieving competitive performance with word-level approaches.
34
 
35
  ---
36
 
 
44
  - **Baseline methods**: Bag-of-Words (BOW), N-gram, and SVM approaches
45
  - **Benchmark results**: Achieving >95% accuracy on 10-topic classification
46
 
47
+ ### 2.2 VLSP2016 Sentiment Analysis Shared Task
48
+
49
+ The VLSP 2016 Sentiment Analysis shared task was the first Vietnamese sentiment analysis campaign, focusing on polarity classification of electronic product reviews into 3 classes (positive, negative, neutral). Top results from the shared task:
50
+
51
+ | System | Approach | F1 |
52
+ |--------|----------|-----|
53
+ | Pham et al. | Perceptron / SVM / MaxEnt ensemble | **80.05** |
54
+ | Nguyen et al. | SVM / MLNN / LSTM ensemble | 71.44 |
55
+ | Pham et al. | Random Forest + SVM + Naive Bayes | 71.22 |
56
+ | Ngo et al. | SVM | 67.54 |
57
+
58
+ All top systems used word segmentation. However, recent research (Arxiv 2301.00418) demonstrates that for traditional classifiers like SVM, word segmentation may not be necessary for Vietnamese sentiment classification on social domain text.
59
+
60
+ ### 2.3 Traditional ML vs Deep Learning
61
 
62
  | Approach | Pros | Cons |
63
  |----------|------|------|
 
72
 
73
  ### 3.1 Architecture Overview
74
 
75
+ Sen-1 uses a 3-stage pipeline implemented in Rust via `underthesea_core`:
76
 
77
  ```
78
+ ┌──────────────────────────────────────────────────────────┐
79
+ Sen-1 Pipeline │
80
+ ├──────────────────────────────────────────────────────────┤
81
+ │ Input Text │
82
+
83
+ ┌──────────────────────────────────────────────────┐
84
+ │ [Optional] Sentiment Preprocessing │
85
+ │ │ - Lowercase + Unicode NFC │ │
86
+ │ │ - Teencode expansion │ │
87
+ │ │ - Negation marking (2-word window) │ │
88
+ │ - Repeated character normalization │
89
+ └──────────────────────────────────────────────────┘
90
+
91
+ ┌──────────────────────────────────────────────────┐
92
+ │ │ TF-IDF Vectorizer (Rust) │ │
93
+ │ - max_features: 20k-200k │
94
+ - ngram_range: (1,2) or (1,3) │ │
95
+ │ - max_df: 0.8-1.0 │
96
+ └──────────────────────────────────────────────────┘
97
+
98
+ ┌──────────────────────────────────────────────────┐
99
+ │ Linear SVM Classifier (Rust) │
100
+ - C: 0.7-1.0 │ │
101
+ │ - max_iter: 1000
102
+ │ └──────────────────────────────────────────────────┘ │
103
+ │ ↓ │
104
+ │ Output: Predicted Label │
105
+ └──────────────────────────────────────────────────────────┘
106
  ```
107
 
108
  ### 3.2 TF-IDF Vectorization
 
116
  - $\text{IDF}(t) = \log\frac{N}{|\{d \in D : t \in d\}|}$
117
  - $N$ = total number of documents
118
 
119
+ **Hyperparameters vary by task:**
120
 
121
+ | Parameter | Classification | Sentiment |
122
+ |-----------|---------------|-----------|
123
+ | `max_features` | 20,000 | 200,000 |
124
+ | `ngram_range` | (1, 2) | (1, 3) |
125
+ | `max_df` | 1.0 | 0.9 |
126
+ | `min_df` | 2 | 1 |
127
 
128
  ### 3.3 Support Vector Machine
129
 
 
131
 
132
  $$\min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^T x_i + b))$$
133
 
134
+ ### 3.4 Sentiment Preprocessing Pipeline
135
+
136
+ For sentiment models, a Vietnamese-specific preprocessing pipeline is applied before TF-IDF vectorization:
137
+
138
+ **Step 1: Text Normalization**
139
+ - Unicode NFC normalization (standardizes diacritics)
140
+ - Lowercase conversion
141
+ - URL removal
142
+ - Repeated character collapse: `quáááá` -> `quáá`
143
+ - Punctuation normalization: `!!!` -> `!`, `????` -> `?`
144
+
145
+ **Step 2: Teencode Expansion**
146
+
147
+ Vietnamese social media text contains extensive abbreviations. We expand 25+ common teencode mappings:
148
+
149
+ | Teencode | Standard | Meaning |
150
+ |----------|----------|---------|
151
+ | ko, k, hok, hem | không | not/no |
152
+ | dc, đc, dk | được | can/ok |
153
+ | cx, cg | cũng | also |
154
+ | bt, bth | bình thường | normal |
155
+ | sp | sản phẩm | product |
156
+ | j | gì | what |
157
+ | z, v | vậy | so |
158
+ | tks, thanks | cảm ơn | thanks |
159
+ | ... | ... | ... |
160
+
161
+ **Step 3: Negation Marking**
162
+
163
+ Negation words (`không`, `chẳng`, `chả`, `chưa`, `đừng`) modify the sentiment of following words. We mark the next 2 words with a `NEG_` prefix:
164
+
165
+ ```
166
+ "không tốt lắm" -> "không NEG_tốt NEG_lắm"
167
+ ```
168
+
169
+ This allows the TF-IDF features to distinguish "tốt" (good) from "NEG_tốt" (not good).
170
 
171
+ **Impact of preprocessing (VLSP2016):**
 
 
 
 
172
 
173
+ | Preprocessing Step | Accuracy | Delta |
174
+ |-------------------|----------|-------|
175
+ | None (baseline) | 64.76% | - |
176
+ | + Lowercase | 67.62% | +2.86% |
177
+ | + Repeated char normalization | 68.29% | +0.67% |
178
+ | + Teencode expansion | +1.14% | +1.14% |
179
+ | + Negation marking | +1.24% | +1.24% |
180
+ | **All combined** | **70.86%** | **+6.10%** |
181
+
182
+ ### 3.5 Confidence Score
183
 
184
  Confidence scores are computed from the SVM decision function using sigmoid transformation:
185
 
 
189
 
190
  ---
191
 
192
+ ## 4. Datasets
193
 
194
  ### 4.1 VNTC Dataset
195
 
 
215
 
216
  ### 4.2 UTS2017_Bank Dataset
217
 
218
+ The UTS2017_Bank dataset is a Vietnamese banking domain text classification corpus with two configurations:
219
+
220
+ **Classification (14 Categories):**
221
+
222
+ | Category | English | Train | Test |
223
+ |----------|---------|-------|------|
224
+ | CUSTOMER_SUPPORT | Customer support | 619 | 155 |
225
+ | TRADEMARK | Trademark/Brand | 557 | 140 |
226
+ | LOAN | Loan services | 58 | 15 |
227
+ | INTERNET_BANKING | Internet banking | 55 | 14 |
228
+ | CARD | Card services | 53 | 13 |
229
+ | ... | ... | ... | ... |
230
+ | **Total** | | **1,977** | **494** |
231
+
232
+ **Sentiment (3 Classes):**
233
+
234
+ | Label | Train | Test |
235
+ |-------|-------|------|
236
+ | negative | 1,189 | 301 |
237
+ | positive | 765 | 185 |
238
+ | neutral | 23 | 8 |
239
+ | **Total** | **1,977** | **494** |
240
+
241
+ **Combined (36 Aspect-Sentiment Labels):** Merging classification + sentiment configs produces labels like `CUSTOMER_SUPPORT#negative`, `CARD#positive`, etc.
242
 
243
  **Source:** https://huggingface.co/datasets/undertheseanlp/UTS2017_Bank
244
 
245
+ ### 4.3 VLSP2016 Sentiment Analysis Dataset
246
+
247
+ The VLSP 2016 Sentiment Analysis dataset contains electronic product reviews labeled for sentiment:
248
+
249
+ | Split | POS | NEG | NEU | Total |
250
+ |-------|-----|-----|-----|-------|
251
+ | Train | 1,700 | 1,700 | 1,700 | 5,100 |
252
+ | Test | 350 | 350 | 350 | 1,050 |
253
+
254
+ The dataset is perfectly balanced across all three sentiment classes.
255
+
256
+ **Source:** VLSP 2016 Shared Task (https://vlsp.org.vn/vlsp2016/eval/sa)
257
 
258
  ---
259
 
 
261
 
262
  ### 5.1 Dependencies
263
 
264
+ **Training:**
265
+ ```
266
+ underthesea_core>=3.1.7 # Rust TF-IDF + SVM backend
267
+ scikit-learn>=1.0.0 # Metrics only (accuracy, F1, classification_report)
268
+ click>=8.0.0 # CLI
269
+ datasets>=2.0.0 # HuggingFace dataset loading
270
+ ```
271
+
272
+ **Inference (underthesea pipeline):**
273
  ```
274
+ underthesea_core>=3.1.7 # Only dependency (no sklearn needed)
 
 
275
  ```
276
 
277
+ ### 5.2 Rust Backend
278
 
279
+ All vectorization and classification is performed by `underthesea_core.TextClassifier`, a Rust implementation exposed via PyO3:
280
 
281
  ```python
282
+ from underthesea_core import TextClassifier
283
+
284
+ # Constructor parameters
285
+ clf = TextClassifier(
286
+ max_features=200000, # Maximum vocabulary size
287
+ ngram_range=(1, 3), # N-gram range
288
+ min_df=1, # Minimum document frequency
289
+ max_df=0.9, # Maximum document frequency
290
+ c=0.7, # SVM regularization parameter
291
+ max_iter=1000, # Maximum iterations
292
+ tol=0.0001, # Convergence tolerance
293
+ )
294
+
295
+ # Training and inference
296
+ clf.fit(texts, labels)
297
+ label = clf.predict(text)
298
+ labels = clf.predict_batch(texts)
299
+ clf.save("model.bin")
300
+ clf = TextClassifier.load("model.bin")
301
  ```
302
 
303
  ### 5.3 Model Files
 
305
  ```
306
  undertheseanlp/sen-1/
307
  └── models/
308
+ ├── sen-general-1.0.0-20260203.bin # News classification (VNTC)
309
+ ├── sen-bank-1.0.0-20260203.bin # Banking classification (UTS2017)
310
+ ├── sen-sentiment-general-1.0.0-20260206.bin # Sentiment (VLSP2016+UTS2017)
311
+ └── sen-sentiment-bank-1.0.0-20260206.bin # Aspect-sentiment (UTS2017)
 
 
 
 
 
312
  ```
313
 
314
+ All models are serialized in Rust binary format (bincode).
 
 
 
 
 
 
 
 
 
 
 
315
 
316
  ---
317
 
318
  ## 6. Experiments
319
 
320
+ ### 6.1 VNTC Benchmark Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
+ **Configuration:** max_features=20000, ngram=(1,2), min_df=2, C=1.0
323
 
324
  | Metric | Value |
325
  |--------|-------|
326
  | **Accuracy** | **92.49%** |
327
  | **F1 (weighted)** | **92.40%** |
328
  | F1 (macro) | 90.44% |
 
 
329
  | **Training time** | **37.6s** |
 
330
 
331
+ **Per-Category Results:**
332
 
333
  | Category | Precision | Recall | F1-Score | Support |
334
  |----------|-----------|--------|----------|---------|
 
343
  | Van hoa | 0.93 | 0.96 | 0.94 | 6,250 |
344
  | Vi tinh | 0.94 | 0.96 | 0.95 | 4,560 |
345
 
346
+ ### 6.2 UTS2017_Bank Classification Results
 
 
 
347
 
348
+ **Configuration:** max_features=10000, ngram=(1,2), min_df=1, C=1.0
349
 
350
  | Metric | Value |
351
  |--------|-------|
352
  | **Accuracy** | **75.76%** |
353
  | **F1 (weighted)** | **72.70%** |
354
  | F1 (macro) | 36.18% |
 
 
355
  | **Training time** | **0.13s** |
 
 
356
 
357
+ ### 6.3 Sentiment General Results
358
 
359
+ **Configuration:** max_features=200000, ngram=(1,3), max_df=0.9, C=0.7, with preprocessing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
360
 
361
+ **Training data:** UTS2017_Bank sentiment (1,977) + VLSP2016 (5,100) = 7,077 samples
 
 
 
 
 
362
 
363
+ | Test Set | Accuracy | F1 (weighted) | F1 (macro) |
364
+ |----------|----------|---------------|------------|
365
+ | **UTS2017_Bank** | **92.11%** | **0.9163** | 0.6196 |
366
+ | **VLSP2016** | **70.86%** | **0.7081** | 0.7081 |
367
 
368
+ **Per-Class Results (UTS2017_Bank):**
 
 
 
369
 
370
+ | Class | Precision | Recall | F1-Score | Support |
371
+ |-------|-----------|--------|----------|---------|
372
+ | negative | 0.93 | 0.95 | 0.94 | 301 |
373
+ | neutral | 0.00 | 0.00 | 0.00 | 8 |
374
+ | positive | 0.93 | 0.91 | 0.92 | 185 |
375
 
376
+ **Per-Class Results (VLSP2016):**
377
 
378
+ | Class | Precision | Recall | F1-Score | Support |
379
+ |-------|-----------|--------|----------|---------|
380
+ | negative | 0.68 | 0.74 | 0.71 | 350 |
381
+ | neutral | 0.69 | 0.64 | 0.66 | 350 |
382
+ | positive | 0.76 | 0.75 | 0.75 | 350 |
 
 
 
 
383
 
384
+ ### 6.4 Sentiment Bank Results
 
 
 
385
 
386
+ **Configuration:** max_features=10000, ngram=(1,2), max_df=1.0, C=1.0, with preprocessing
387
 
388
+ **Training data:** UTS2017_Bank classification + sentiment merged (1,977 samples, 36 labels)
389
 
390
+ | Metric | Value |
391
+ |--------|-------|
392
+ | **Accuracy** | **70.65%** |
393
+ | F1 (weighted) | 0.6693 |
394
+ | F1 (macro) | 0.2153 |
395
 
396
+ **Top-Performing Categories:**
 
 
 
397
 
398
+ | Category | Precision | Recall | F1-Score | Support |
399
+ |----------|-----------|--------|----------|---------|
400
+ | LOAN#negative | 0.60 | 1.00 | 0.75 | 3 |
401
+ | CUSTOMER_SUPPORT#negative | 0.75 | 0.93 | 0.83 | 214 |
402
+ | CUSTOMER_SUPPORT#positive | 0.80 | 0.82 | 0.81 | 122 |
403
+ | MONEY_TRANSFER#negative | 1.00 | 0.50 | 0.67 | 2 |
404
+ | TRADEMARK#positive | 0.54 | 0.63 | 0.58 | 35 |
405
 
406
+ ### 6.5 Comparison with Previous Models
 
 
407
 
408
+ #### Sentiment General
409
 
410
+ | Model | Architecture | VLSP2016 | UTS2017_Bank |
411
+ |-------|-------------|----------|-------------|
412
+ | SA_GENERAL_V131 (old) | flair SVM + word segmentation | 69.14% | 47.17% |
413
+ | **Sen-1 (new)** | **underthesea_core + preprocessing** | **70.86%** | **92.11%** |
414
+ | Delta | | **+1.72%** | **+44.94%** |
 
 
415
 
416
+ The old model was trained only on VLSP2016 and could not predict the "neutral" class, resulting in poor generalization to UTS2017_Bank. The new model is trained on both datasets and includes preprocessing.
417
 
418
+ #### Sentiment Bank
 
 
 
419
 
420
+ | Model | Architecture | UTS2017_Bank |
421
+ |-------|-------------|-------------|
422
+ | pulse_core_1 (old) | sklearn Pipeline + joblib | 69.03% |
423
+ | **Sen-1 (new)** | **underthesea_core + preprocessing** | **70.65%** |
424
+ | Delta | | **+1.62%** |
425
 
426
+ #### Classification
 
 
 
 
 
 
427
 
428
+ | Dataset | sonar_core_1 | Sen-1 | Difference |
429
+ |---------|--------------|-------|------------|
430
+ | VNTC (News) | 92.80% | 92.49% | -0.31% |
431
+ | **UTS2017_Bank** | 72.47% | **75.76%** | **+3.29%** |
432
 
433
+ ### 6.6 Hyperparameter Sensitivity (VLSP2016)
434
 
435
+ Key findings from hyperparameter search on VLSP2016:
436
 
437
+ | Factor | Finding |
438
+ |--------|---------|
439
+ | **max_features** | 200k >> 20k (+3% accuracy); larger vocabulary captures more discriminative patterns |
440
+ | **ngram_range** | (1,3) slightly better than (1,2) with large vocabulary |
441
+ | **max_df** | 0.8-0.9 helps filter very common terms that add noise |
442
+ | **C** | 0.7 optimal; lower C (more regularization) prevents overfitting on small datasets |
443
+ | **Preprocessing** | Most impactful factor: +6.1% total (lowercase +2.9%, teencode +1.1%, negation +1.2%) |
444
 
445
+ ### 6.7 Error Analysis (VLSP2016)
446
 
447
+ **Confusion patterns:**
448
+ - NEU (neutral) is the most confused class, acting as an "attractor" for both POS and NEG
449
+ - NEU<->NEG confusion accounts for 38% of all errors
450
+ - No single error pattern (text length, teencode, negation) dominates
451
 
452
+ **Confidence calibration:**
453
 
454
+ | Confidence | Samples | Accuracy |
455
+ |------------|---------|----------|
456
+ | >= 0.7 | 129 | 94.0% |
457
+ | >= 0.6 | 365 | 84.4% |
458
+ | < 0.5 | 224 | 45.5% |
459
 
460
+ Predictions with confidence >= 0.7 are 94% accurate, suggesting confidence thresholds can be effective for production use.
 
 
 
461
 
462
+ ### 6.8 Inference Speed Benchmark
 
 
463
 
464
+ | Model | Single Inference | Throughput |
465
+ |-------|------------------|------------|
466
+ | **Sen-1 1.0.0** | **0.465 ms** | **66,678 samples/sec** |
467
+ | Underthesea 9.2.8 | 0.615 ms | 1,617 samples/sec |
468
 
469
+ Sen-1 achieves **41x** faster throughput via batch processing and the Rust backend.
 
 
 
 
470
 
471
  ---
472
 
 
475
  ### 7.1 Installation
476
 
477
  ```bash
478
+ pip install underthesea_core
479
  ```
480
 
481
  ### 7.2 Load Pre-trained Model
482
 
483
  ```python
484
+ from underthesea_core import TextClassifier
 
485
 
486
+ # Load model
487
+ clf = TextClassifier.load("models/sen-sentiment-general-1.0.0-20260206.bin")
 
 
 
 
 
 
488
 
489
  # Predict
490
+ label = clf.predict("Sản phẩm rất tốt") # "positive"
 
 
491
  ```
492
 
493
+ ### 7.3 With underthesea API
494
 
495
  ```python
496
+ from underthesea import sentiment
497
 
498
+ # General sentiment
499
+ sentiment("Sản phẩm rất tốt") # "positive"
500
+ sentiment("hàng kém chất lg") # "negative"
501
+ sentiment.labels # ['positive', 'negative', 'neutral']
502
 
503
+ # Bank aspect-sentiment
504
+ sentiment("nhân viên hỗ trợ quá lâu", domain="bank") # ['CUSTOMER_SUPPORT#negative']
505
+ sentiment.bank.labels # ['CARD#negative', 'CARD#positive', ...]
506
+ ```
507
+
508
+ ### 7.4 Train Custom Model
509
+
510
+ ```bash
511
+ # Train sentiment-general (with VLSP2016 data)
512
+ python src/train.py sentiment-general --vlsp2016-dir /path/to/VLSP2016_SA
513
+
514
+ # Train sentiment-bank
515
+ python src/train.py sentiment-bank
516
+
517
+ # Train news classifier
518
+ python src/train.py vntc --data-dir /path/to/VNTC
519
+
520
+ # Train banking classifier
521
+ python src/train.py bank
522
  ```
523
 
524
  ---
525
 
526
  ## 8. Limitations
527
 
528
+ 1. **No word segmentation**: Operates at syllable-level (~4.6% gap vs word-level on VNTC)
529
  2. **No pre-trained embeddings**: Uses TF-IDF instead of word vectors or contextual embeddings
530
  3. **Single-label only**: Does not support multi-label classification
531
+ 4. **Neutral class weakness**: NEU class has lowest precision in sentiment tasks due to inherent ambiguity
532
+ 5. **Class imbalance sensitivity**: Lower performance on underrepresented categories
533
+ 6. **Preprocessing dependency**: Sentiment models require `preprocess_sentiment()` at inference time (preprocessing must match training)
534
 
535
  ---
536
 
 
538
 
539
  - [x] ~~Train on full VNTC dataset (33,759 samples)~~ **Done**
540
  - [x] ~~Train on UTS2017_Bank dataset (1,977 samples)~~ **Done** (+3.29% vs sonar_core_1)
541
+ - [x] ~~Sentiment general model (VLSP2016 + UTS2017)~~ **Done** (+1.72% vs old flair SVM)
542
+ - [x] ~~Sentiment bank model (aspect-sentiment)~~ **Done** (+1.62% vs old sklearn)
543
+ - [x] ~~Remove sklearn from inference path~~ **Done** (pure Rust via underthesea_core)
544
+ - [x] ~~Vietnamese preprocessing pipeline~~ **Done** (teencode, negation, normalization)
545
  - [ ] Add Vietnamese word segmentation (using underthesea)
546
  - [ ] Implement multi-label classification
547
  - [ ] Add PhoBERT-based variant (sen-2)
548
  - [ ] Benchmark on additional datasets (UIT-VSMEC, UIT-VSFC)
549
+ - [ ] CHI-square feature selection for further improvement
550
+ - [ ] Ensemble methods (SVM + Perceptron + MaxEnt)
551
 
552
  ---
553
 
554
  ## 10. Conclusion
555
 
556
+ Sen-1 provides a suite of Vietnamese text classification and sentiment analysis models, all running on a pure Rust backend for fast inference:
557
 
558
+ | Task | Model | Accuracy | vs Previous |
559
+ |------|-------|----------|-------------|
560
+ | News Classification | sen-general | 92.49% | -0.31% vs sonar_core_1 |
561
+ | Banking Classification | sen-bank | 75.76% | +3.29% vs sonar_core_1 |
562
+ | Sentiment General (UTS2017) | sen-sentiment-general | 92.11% | +44.94% vs old flair |
563
+ | Sentiment General (VLSP2016) | sen-sentiment-general | 70.86% | +1.72% vs old flair |
564
+ | Sentiment Bank | sen-sentiment-bank | 70.65% | +1.62% vs old sklearn |
565
 
566
  Key achievements:
567
 
568
+ - **Fast inference**: 66,678 samples/sec batch throughput (41x vs underthesea 9.2.8)
569
+ - **No sklearn dependency**: Pure Rust inference via underthesea_core
570
+ - **Vietnamese preprocessing**: Teencode expansion + negation marking yields +6.1% on VLSP2016
571
+ - **Multi-domain sentiment**: Single model handles both product reviews and banking text
572
+ - **Small footprint**: Models range from 1.6 MB to 8 MB
 
573
 
574
  ---
575
 
576
  ## References
577
 
578
+ 1. Vu, C. D. H., Dien, D., Nguyen, L. N., & Ngo, Q. H. (2007). **A Comparative Study on Vietnamese Text Classification Methods**. IEEE RIVF 2007, 267-273.
579
 
580
  2. duyvuleo. (2007). **VNTC: A Large-scale Vietnamese News Text Classification Corpus**. GitHub. https://github.com/duyvuleo/VNTC
581
 
582
+ 3. VLSP. (2016). **VLSP 2016 Shared Task: Sentiment Analysis**. https://vlsp.org.vn/vlsp2016/eval/sa
583
 
584
+ 4. Nguyen, L. T., et al. (2023). **Is Word Segmentation Necessary for Vietnamese Sentiment Classification?** arXiv:2301.00418. https://arxiv.org/abs/2301.00418
585
 
586
  5. Nguyen, D. Q., & Nguyen, A. T. (2020). **PhoBERT: Pre-trained language models for Vietnamese**. Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.92/
587
 
588
+ 6. Pedregosa, F., et al. (2011). **Scikit-learn: Machine Learning in Python**. JMLR, 12, 2825-2830.
589
 
590
+ 7. UnderTheSea NLP. (2017). **Underthesea: Vietnamese NLP Toolkit**. https://github.com/undertheseanlp/underthesea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
591
 
592
  ---
593
 
594
+ ## Appendix A: Model Cards
595
 
596
+ ### sen-sentiment-general-1.0.0-20260206
597
+
598
+ | Field | Value |
599
+ |-------|-------|
600
+ | Model Name | sen-sentiment-general-1.0.0-20260206 |
601
+ | Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
602
+ | Language | Vietnamese |
603
+ | License | Apache 2.0 |
604
+ | Repository | https://huggingface.co/undertheseanlp/sen-1 |
605
+ | Training Data | VLSP2016 (5,100) + UTS2017_Bank sentiment (1,977) = 7,077 |
606
+ | Labels | positive, negative, neutral |
607
+ | Preprocessing | preprocess_sentiment() required |
608
+ | max_features | 200,000 |
609
+ | ngram_range | (1, 3) |
610
+ | max_df | 0.9 |
611
+ | C | 0.7 |
612
+ | Accuracy (UTS2017) | 92.11% |
613
+ | Accuracy (VLSP2016) | 70.86% |
614
+ | Model Size | 7.95 MB |
615
+
616
+ ### sen-sentiment-bank-1.0.0-20260206
617
 
618
  | Field | Value |
619
  |-------|-------|
620
+ | Model Name | sen-sentiment-bank-1.0.0-20260206 |
621
+ | Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
 
622
  | Language | Vietnamese |
623
  | License | Apache 2.0 |
624
  | Repository | https://huggingface.co/undertheseanlp/sen-1 |
625
+ | Training Data | UTS2017_Bank merged (1,977 samples) |
626
+ | Labels | 36 (e.g., CUSTOMER_SUPPORT#negative, CARD#positive) |
627
+ | Preprocessing | preprocess_sentiment() required |
628
+ | max_features | 10,000 |
629
+ | ngram_range | (1, 2) |
630
+ | C | 1.0 |
631
+ | Accuracy | 70.65% |
632
+ | Model Size | 1.61 MB |
633
+
634
+ ### sen-general-1.0.0-20260203 (News Classification)
635
+
636
+ | Field | Value |
637
+ |-------|-------|
638
+ | Model Name | sen-general-1.0.0-20260203 |
639
+ | Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
640
+ | Language | Vietnamese |
641
  | Training Data | VNTC (33,759 samples) |
 
642
  | Categories | 10 |
643
  | max_features | 20,000 |
644
  | ngram_range | (1, 2) |
645
  | Accuracy | 92.49% |
 
 
646
 
647
+ ### sen-bank-1.0.0-20260203 (Banking Classification)
648
 
649
  | Field | Value |
650
  |-------|-------|
651
+ | Model Name | sen-bank-1.0.0-20260203 |
652
+ | Architecture | TF-IDF + Linear SVM (Rust/underthesea_core) |
 
653
  | Language | Vietnamese |
654
+ | Training Data | UTS2017_Bank (1,977 samples) |
 
 
 
655
  | Categories | 14 |
656
+ | max_features | 10,000 |
657
  | ngram_range | (1, 2) |
658
  | Accuracy | 75.76% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
659
 
660
  ---
661
 
662
+ *Report generated: February 6, 2026*
663
  *UnderTheSea NLP - https://github.com/undertheseanlp*
models/sen-sentiment-bank-1.0.0-20260206.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15f92315a43b3d131322402bc2f44b3bd5ee2ec58584a9b7ec3eec596d3eab8b
3
+ size 1693351
models/sen-sentiment-general-1.0.0-20260206.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d2c23599cff870ecee535c1faab56f4e30a0f02950490a79e4ec713d69394a6
3
+ size 8335929
src/train.py CHANGED
@@ -7,7 +7,9 @@ Usage:
7
  """
8
 
9
  import os
 
10
  import time
 
11
  from pathlib import Path
12
 
13
  import click
@@ -15,6 +17,57 @@ from sklearn.metrics import accuracy_score, f1_score, classification_report
15
 
16
  from underthesea_core import TextClassifier
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  def read_file(filepath):
20
  """Read text file with multiple encoding attempts."""
@@ -209,5 +262,230 @@ def bank(output, max_features, ngram_min, ngram_max, min_df, c, max_iter, tol):
209
  click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
210
 
211
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
  if __name__ == "__main__":
213
  cli()
 
7
  """
8
 
9
  import os
10
+ import re
11
  import time
12
+ import unicodedata
13
  from pathlib import Path
14
 
15
  import click
 
17
 
18
  from underthesea_core import TextClassifier
19
 
20
+ # Vietnamese teencode dictionary
21
+ _TEENCODE = {
22
+ 'ko': 'không', 'k': 'không', 'hok': 'không', 'hem': 'không',
23
+ 'dc': 'được', 'đc': 'được', 'dk': 'được',
24
+ 'ntn': 'như thế nào',
25
+ 'nc': 'nói chuyện', 'nt': 'nhắn tin',
26
+ 'cx': 'cũng', 'cg': 'cũng',
27
+ 'vs': 'với', 'vl': 'vãi',
28
+ 'bt': 'bình thường', 'bth': 'bình thường',
29
+ 'lg': 'lượng', 'tl': 'trả lời',
30
+ 'ms': 'mới', 'r': 'rồi',
31
+ 'mn': 'mọi người', 'mk': 'mình',
32
+ 'ok': 'tốt', 'oke': 'tốt',
33
+ 'sp': 'sản phẩm',
34
+ 'hqua': 'hôm qua', 'hnay': 'hôm nay',
35
+ 'tks': 'cảm ơn', 'thanks': 'cảm ơn', 'thank': 'cảm ơn',
36
+ 'j': 'gì', 'z': 'vậy', 'v': 'vậy',
37
+ 'đt': 'điện thoại', 'dt': 'điện thoại',
38
+ 'lm': 'làm', 'ns': 'nói',
39
+ }
40
+
41
+ _NEG_WORDS = {'không', 'chẳng', 'chả', 'chưa', 'đừng', 'ko', 'hok', 'hem', 'chăng'}
42
+
43
+
44
+ def preprocess_sentiment(text):
45
+ """Preprocess Vietnamese text for sentiment analysis."""
46
+ text = unicodedata.normalize('NFC', text)
47
+ text = text.lower()
48
+ text = re.sub(r'https?://\S+|www\.\S+', ' ', text)
49
+ text = re.sub(r'(.)\1{2,}', r'\1\1', text)
50
+ text = re.sub(r'!{2,}', '!', text)
51
+ text = re.sub(r'\?{2,}', '?', text)
52
+ text = re.sub(r'\.{4,}', '...', text)
53
+ # Teencode expansion
54
+ words = text.split()
55
+ expanded = []
56
+ for w in words:
57
+ wl = w.strip('.,!?;:')
58
+ if wl in _TEENCODE:
59
+ expanded.append(_TEENCODE[wl])
60
+ else:
61
+ expanded.append(w)
62
+ # Negation marking (2-word window)
63
+ new_words = list(expanded)
64
+ for i, w in enumerate(expanded):
65
+ wl = w.strip('.,!?;:')
66
+ if wl in _NEG_WORDS:
67
+ for j in range(i + 1, min(i + 3, len(expanded))):
68
+ new_words[j] = 'NEG_' + expanded[j]
69
+ return ' '.join(new_words)
70
+
71
 
72
  def read_file(filepath):
73
  """Read text file with multiple encoding attempts."""
 
262
  click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
263
 
264
 
265
+ def _load_vlsp2016(data_dir):
266
+ """Load VLSP2016 sentiment data from directory."""
267
+ label_map = {'POS': 'positive', 'NEG': 'negative', 'NEU': 'neutral'}
268
+ texts, labels = [], []
269
+ for split in ['train.txt', 'test.txt']:
270
+ split_texts, split_labels = [], []
271
+ filepath = os.path.join(data_dir, split)
272
+ with open(filepath, 'r', encoding='utf-8') as f:
273
+ for line in f:
274
+ line = line.strip()
275
+ if line.startswith('__label__'):
276
+ parts = line.split(' ', 1)
277
+ label = label_map[parts[0].replace('__label__', '')]
278
+ text = parts[1] if len(parts) > 1 else ''
279
+ split_texts.append(text)
280
+ split_labels.append(label)
281
+ texts.append(split_texts)
282
+ labels.append(split_labels)
283
+ return texts[0], labels[0], texts[1], labels[1]
284
+
285
+
286
+ @cli.command('sentiment-general')
287
+ @click.option('--output', '-o', default=None, help='Output model path')
288
+ @click.option('--vlsp2016-dir', default=None, help='Path to VLSP2016_SA directory (adds to training data)')
289
+ @click.option('--max-features', default=200000, help='Maximum vocabulary size')
290
+ @click.option('--ngram-min', default=1, help='Minimum n-gram')
291
+ @click.option('--ngram-max', default=3, help='Maximum n-gram')
292
+ @click.option('--min-df', default=1, help='Minimum document frequency')
293
+ @click.option('--max-df', default=0.9, help='Maximum document frequency')
294
+ @click.option('--c', default=0.7, help='SVM regularization parameter')
295
+ @click.option('--max-iter', default=1000, help='Maximum iterations')
296
+ @click.option('--tol', default=0.0001, help='Convergence tolerance')
297
+ def sentiment_general(output, vlsp2016_dir, max_features, ngram_min, ngram_max, min_df, max_df, c, max_iter, tol):
298
+ """Train sentiment-general model (3 classes: positive/negative/neutral).
299
+
300
+ Uses UTS2017_Bank sentiment data by default. Optionally adds VLSP2016 data
301
+ with --vlsp2016-dir for improved general-domain coverage.
302
+ """
303
+ from datetime import datetime
304
+ from datasets import load_dataset
305
+
306
+ if output is None:
307
+ date_str = datetime.now().strftime('%Y%m%d')
308
+ output = f'models/sen-sentiment-general-1.0.0-{date_str}.bin'
309
+
310
+ click.echo("=" * 70)
311
+ click.echo("Sentiment General Training (positive/negative/neutral)")
312
+ click.echo("=" * 70)
313
+
314
+ # Load UTS2017_Bank sentiment data
315
+ click.echo("\nLoading UTS2017_Bank sentiment dataset from HuggingFace...")
316
+ dataset = load_dataset("undertheseanlp/UTS2017_Bank", "sentiment")
317
+
318
+ train_texts = list(dataset["train"]["text"])
319
+ train_labels = list(dataset["train"]["sentiment"])
320
+ test_texts = list(dataset["test"]["text"])
321
+ test_labels = list(dataset["test"]["sentiment"])
322
+
323
+ vlsp_test_texts, vlsp_test_labels = None, None
324
+
325
+ # Optionally add VLSP2016 data
326
+ if vlsp2016_dir:
327
+ click.echo(f"\nLoading VLSP2016 data from {vlsp2016_dir}...")
328
+ vlsp_train_texts, vlsp_train_labels, vlsp_test_texts, vlsp_test_labels = _load_vlsp2016(vlsp2016_dir)
329
+ train_texts.extend(vlsp_train_texts)
330
+ train_labels.extend(vlsp_train_labels)
331
+ click.echo(f" VLSP2016 train: {len(vlsp_train_texts)}, test: {len(vlsp_test_texts)}")
332
+
333
+ click.echo(f" Total train samples: {len(train_texts)}")
334
+ click.echo(f" UTS2017 test samples: {len(test_texts)}")
335
+ click.echo(f" Labels: {sorted(set(train_labels))}")
336
+
337
+ # Preprocess
338
+ click.echo("\nPreprocessing...")
339
+ proc_train = [preprocess_sentiment(t) for t in train_texts]
340
+ proc_test = [preprocess_sentiment(t) for t in test_texts]
341
+
342
+ # Train
343
+ click.echo("\nTraining Rust TextClassifier...")
344
+ clf = TextClassifier(
345
+ max_features=max_features,
346
+ ngram_range=(ngram_min, ngram_max),
347
+ min_df=min_df,
348
+ max_df=max_df,
349
+ c=c,
350
+ max_iter=max_iter,
351
+ tol=tol,
352
+ )
353
+
354
+ t0 = time.perf_counter()
355
+ clf.fit(proc_train, train_labels)
356
+ train_time = time.perf_counter() - t0
357
+ click.echo(f" Training time: {train_time:.3f}s")
358
+ click.echo(f" Vocabulary size: {clf.n_features}")
359
+
360
+ # Evaluate on UTS2017
361
+ click.echo("\nEvaluating on UTS2017_Bank test set...")
362
+ preds = clf.predict_batch(proc_test)
363
+
364
+ acc = accuracy_score(test_labels, preds)
365
+ f1_w = f1_score(test_labels, preds, average='weighted', zero_division=0)
366
+ f1_m = f1_score(test_labels, preds, average='macro', zero_division=0)
367
+
368
+ click.echo("\n" + "=" * 70)
369
+ click.echo("RESULTS (UTS2017_Bank)")
370
+ click.echo("=" * 70)
371
+ click.echo(f" Accuracy: {acc:.4f} ({acc*100:.2f}%)")
372
+ click.echo(f" F1 (weighted): {f1_w:.4f}")
373
+ click.echo(f" F1 (macro): {f1_m:.4f}")
374
+ click.echo("\nClassification Report:")
375
+ click.echo(classification_report(test_labels, preds, zero_division=0))
376
+
377
+ # Evaluate on VLSP2016 if available
378
+ if vlsp_test_texts:
379
+ proc_vlsp_test = [preprocess_sentiment(t) for t in vlsp_test_texts]
380
+ vlsp_preds = clf.predict_batch(proc_vlsp_test)
381
+ vlsp_acc = accuracy_score(vlsp_test_labels, vlsp_preds)
382
+ vlsp_f1w = f1_score(vlsp_test_labels, vlsp_preds, average='weighted', zero_division=0)
383
+ vlsp_f1m = f1_score(vlsp_test_labels, vlsp_preds, average='macro', zero_division=0)
384
+
385
+ click.echo("=" * 70)
386
+ click.echo("RESULTS (VLSP2016)")
387
+ click.echo("=" * 70)
388
+ click.echo(f" Accuracy: {vlsp_acc:.4f} ({vlsp_acc*100:.2f}%)")
389
+ click.echo(f" F1 (weighted): {vlsp_f1w:.4f}")
390
+ click.echo(f" F1 (macro): {vlsp_f1m:.4f}")
391
+ click.echo("\nClassification Report:")
392
+ click.echo(classification_report(vlsp_test_labels, vlsp_preds, zero_division=0))
393
+
394
+ # Save model
395
+ model_path = Path(output)
396
+ model_path.parent.mkdir(parents=True, exist_ok=True)
397
+ clf.save(str(model_path))
398
+
399
+ size_mb = model_path.stat().st_size / (1024 * 1024)
400
+ click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
401
+
402
+
403
+ @cli.command('sentiment-bank')
404
+ @click.option('--output', '-o', default=None, help='Output model path')
405
+ @click.option('--max-features', default=200000, help='Maximum vocabulary size')
406
+ @click.option('--ngram-min', default=1, help='Minimum n-gram')
407
+ @click.option('--ngram-max', default=3, help='Maximum n-gram')
408
+ @click.option('--min-df', default=1, help='Minimum document frequency')
409
+ @click.option('--max-df', default=0.9, help='Maximum document frequency')
410
+ @click.option('--c', default=0.7, help='SVM regularization parameter')
411
+ @click.option('--max-iter', default=1000, help='Maximum iterations')
412
+ @click.option('--tol', default=0.0001, help='Convergence tolerance')
413
+ def sentiment_bank(output, max_features, ngram_min, ngram_max, min_df, max_df, c, max_iter, tol):
414
+ """Train sentiment-bank model on UTS2017_Bank (36 combined category#sentiment labels)."""
415
+ from datetime import datetime
416
+ from datasets import load_dataset
417
+
418
+ if output is None:
419
+ date_str = datetime.now().strftime('%Y%m%d')
420
+ output = f'models/sen-sentiment-bank-1.0.0-{date_str}.bin'
421
+
422
+ click.echo("=" * 70)
423
+ click.echo("Sentiment Bank Training (category#sentiment, 36 labels)")
424
+ click.echo("=" * 70)
425
+
426
+ # Load and merge classification + sentiment configs
427
+ click.echo("\nLoading UTS2017_Bank dataset from HuggingFace...")
428
+ ds_class = load_dataset("undertheseanlp/UTS2017_Bank", "classification")
429
+ ds_sent = load_dataset("undertheseanlp/UTS2017_Bank", "sentiment")
430
+
431
+ train_texts = list(ds_class["train"]["text"])
432
+ train_labels = [f'{c}#{s}' for c, s in zip(ds_class["train"]["label"], ds_sent["train"]["sentiment"])]
433
+ test_texts = list(ds_class["test"]["text"])
434
+ test_labels = [f'{c}#{s}' for c, s in zip(ds_class["test"]["label"], ds_sent["test"]["sentiment"])]
435
+
436
+ click.echo(f" Train samples: {len(train_texts)}")
437
+ click.echo(f" Test samples: {len(test_texts)}")
438
+ click.echo(f" Labels: {len(set(train_labels))}")
439
+
440
+ # Preprocess
441
+ click.echo("\nPreprocessing...")
442
+ proc_train = [preprocess_sentiment(t) for t in train_texts]
443
+ proc_test = [preprocess_sentiment(t) for t in test_texts]
444
+
445
+ # Train
446
+ click.echo("\nTraining Rust TextClassifier...")
447
+ clf = TextClassifier(
448
+ max_features=max_features,
449
+ ngram_range=(ngram_min, ngram_max),
450
+ min_df=min_df,
451
+ max_df=max_df,
452
+ c=c,
453
+ max_iter=max_iter,
454
+ tol=tol,
455
+ )
456
+
457
+ t0 = time.perf_counter()
458
+ clf.fit(proc_train, train_labels)
459
+ train_time = time.perf_counter() - t0
460
+ click.echo(f" Training time: {train_time:.3f}s")
461
+ click.echo(f" Vocabulary size: {clf.n_features}")
462
+
463
+ # Evaluate
464
+ click.echo("\nEvaluating...")
465
+ preds = clf.predict_batch(proc_test)
466
+
467
+ acc = accuracy_score(test_labels, preds)
468
+ f1_w = f1_score(test_labels, preds, average='weighted', zero_division=0)
469
+ f1_m = f1_score(test_labels, preds, average='macro', zero_division=0)
470
+
471
+ click.echo("\n" + "=" * 70)
472
+ click.echo("RESULTS")
473
+ click.echo("=" * 70)
474
+ click.echo(f" Accuracy: {acc:.4f} ({acc*100:.2f}%)")
475
+ click.echo(f" F1 (weighted): {f1_w:.4f}")
476
+ click.echo(f" F1 (macro): {f1_m:.4f}")
477
+
478
+ click.echo("\nClassification Report:")
479
+ click.echo(classification_report(test_labels, preds, zero_division=0))
480
+
481
+ # Save model
482
+ model_path = Path(output)
483
+ model_path.parent.mkdir(parents=True, exist_ok=True)
484
+ clf.save(str(model_path))
485
+
486
+ size_mb = model_path.stat().st_size / (1024 * 1024)
487
+ click.echo(f"\nModel saved to {model_path} ({size_mb:.2f} MB)")
488
+
489
+
490
  if __name__ == "__main__":
491
  cli()