Vu Anh Claude commited on
Commit
8be6e55
·
1 Parent(s): 020a0e3

Update README.md

Browse files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +410 -3
README.md CHANGED
@@ -1,3 +1,410 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: scikit-learn
4
+ tags:
5
+ - scikit-learn
6
+ - sklearn
7
+ - text-classification
8
+ - vietnamese
9
+ - nlp
10
+ - sonar
11
+ - tf-idf
12
+ - logistic-regression
13
+ - svc
14
+ - support-vector-classification
15
+ datasets:
16
+ - vntc
17
+ - undertheseanlp/UTS2017_Bank
18
+ metrics:
19
+ - accuracy
20
+ - precision
21
+ - recall
22
+ - f1-score
23
+ model-index:
24
+ - name: sonar-core-1
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: Vietnamese News Classification
29
+ dataset:
30
+ name: VNTC
31
+ type: vntc
32
+ metrics:
33
+ - type: accuracy
34
+ value: 0.9280
35
+ name: Test Accuracy (SVC)
36
+ - type: precision
37
+ value: 0.92
38
+ name: Weighted Precision
39
+ - type: recall
40
+ value: 0.92
41
+ name: Weighted Recall
42
+ - type: f1-score
43
+ value: 0.92
44
+ name: Weighted F1-Score
45
+ - task:
46
+ type: text-classification
47
+ name: Vietnamese Banking Text Classification
48
+ dataset:
49
+ name: UTS2017_Bank
50
+ type: undertheseanlp/UTS2017_Bank
51
+ metrics:
52
+ - type: accuracy
53
+ value: 0.7247
54
+ name: Test Accuracy (SVC)
55
+ - type: precision
56
+ value: 0.65
57
+ name: Weighted Precision (SVC)
58
+ - type: recall
59
+ value: 0.72
60
+ name: Weighted Recall (SVC)
61
+ - type: f1-score
62
+ value: 0.66
63
+ name: Weighted F1-Score (SVC)
64
+ language:
65
+ - vi
66
+ pipeline_tag: text-classification
67
+ ---
68
+
69
+ # Sonar Core 1 - Vietnamese Text Classification Model
70
+
71
+ A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC.
72
+
73
+ 📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
74
+
75
+ ## Model Description
76
+
77
+ **Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.
78
+
79
+ ### Model Architecture
80
+
81
+ - **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
82
+ - **Feature Extraction**: CountVectorizer with 20,000 max features
83
+ - **N-gram Support**: Unigram and bigram (1-2)
84
+ - **TF-IDF**: Transformation with IDF weighting
85
+ - **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
86
+ - **Framework**: scikit-learn ≥1.6
87
+ - **Caching System**: Hash-based caching for efficient processing
88
+
89
+ ## Supported Datasets & Categories
90
+
91
+ ### VNTC Dataset - News Categories (10 classes)
92
+ 1. **chinh_tri_xa_hoi** - Politics and Society
93
+ 2. **doi_song** - Lifestyle
94
+ 3. **khoa_hoc** - Science
95
+ 4. **kinh_doanh** - Business
96
+ 5. **phap_luat** - Law
97
+ 6. **suc_khoe** - Health
98
+ 7. **the_gioi** - World News
99
+ 8. **the_thao** - Sports
100
+ 9. **van_hoa** - Culture
101
+ 10. **vi_tinh** - Information Technology
102
+
103
+ ### UTS2017_Bank Dataset - Banking Categories (14 classes)
104
+ 1. **ACCOUNT** - Account services
105
+ 2. **CARD** - Card services
106
+ 3. **CUSTOMER_SUPPORT** - Customer support
107
+ 4. **DISCOUNT** - Discount offers
108
+ 5. **INTEREST_RATE** - Interest rate information
109
+ 6. **INTERNET_BANKING** - Internet banking services
110
+ 7. **LOAN** - Loan services
111
+ 8. **MONEY_TRANSFER** - Money transfer services
112
+ 9. **OTHER** - Other services
113
+ 10. **PAYMENT** - Payment services
114
+ 11. **PROMOTION** - Promotional offers
115
+ 12. **SAVING** - Savings accounts
116
+ 13. **SECURITY** - Security features
117
+ 14. **TRADEMARK** - Trademark/branding
118
+
119
+ ## Installation
120
+
121
+ ```bash
122
+ pip install scikit-learn>=1.6 joblib
123
+ ```
124
+
125
+ ## Usage
126
+
127
+ ### Training the Model
128
+
129
+ #### VNTC Dataset (News Classification)
130
+ ```bash
131
+ # Default training with VNTC dataset
132
+ python train.py --dataset vntc --model logistic
133
+
134
+ # With specific parameters
135
+ python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
136
+ ```
137
+
138
+ #### UTS2017_Bank Dataset (Banking Text Classification)
139
+ ```bash
140
+ # Train with UTS2017_Bank dataset (SVC recommended)
141
+ python train.py --dataset uts2017 --model svc_linear
142
+
143
+ # Train with Logistic Regression
144
+ python train.py --dataset uts2017 --model logistic
145
+
146
+ # With specific parameters (SVC)
147
+ python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2
148
+
149
+ # Compare multiple configurations
150
+ python train.py --dataset uts2017 --compare
151
+ ```
152
+
153
+ ### Training from Scratch
154
+
155
+ ```python
156
+ from train import train_notebook
157
+
158
+ # Train VNTC model
159
+ vntc_results = train_notebook(
160
+ dataset="vntc",
161
+ model_name="logistic",
162
+ max_features=20000,
163
+ ngram_min=1,
164
+ ngram_max=2
165
+ )
166
+
167
+ # Train UTS2017_Bank model
168
+ bank_results = train_notebook(
169
+ dataset="uts2017",
170
+ model_name="logistic",
171
+ max_features=20000,
172
+ ngram_min=1,
173
+ ngram_max=2
174
+ )
175
+ ```
176
+
177
+ ## Performance Metrics
178
+
179
+ ### VNTC Dataset Performance
180
+ - **Training Accuracy**: 95.39%
181
+ - **Test Accuracy (SVC)**: 92.80%
182
+ - **Test Accuracy (Logistic Regression)**: 92.33%
183
+ - **Training Samples**: 33,759
184
+ - **Test Samples**: 50,373
185
+ - **Training Time (SVC)**: ~54.6 minutes
186
+ - **Training Time (Logistic Regression)**: ~31.40 seconds
187
+ - **Best Performing**: Sports (98% F1-score)
188
+ - **Challenging Category**: Lifestyle (76% F1-score)
189
+
190
+ ### UTS2017_Bank Dataset Performance
191
+ - **Training Accuracy (SVC)**: 95.07%
192
+ - **Test Accuracy (SVC)**: 72.47%
193
+ - **Test Accuracy (Logistic Regression)**: 70.96%
194
+ - **Training Samples**: 1,581
195
+ - **Test Samples**: 396
196
+ - **Training Time (SVC)**: ~5.3 seconds
197
+ - **Training Time (Logistic Regression)**: ~0.78 seconds
198
+ - **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
199
+ - **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
200
+ - **Challenges**: Many minority classes with insufficient training data
201
+
202
+ ## Using the Pre-trained Models
203
+
204
+ ### VNTC Model (Vietnamese News Classification)
205
+
206
+ ```python
207
+ from huggingface_hub import hf_hub_download
208
+ import joblib
209
+
210
+ # Download and load VNTC model
211
+ vntc_model = joblib.load(
212
+ hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
213
+ )
214
+
215
+ # Enhanced prediction function
216
+ def predict_text(model, text):
217
+ probabilities = model.predict_proba([text])[0]
218
+
219
+ # Get top 3 predictions sorted by probability
220
+ top_indices = probabilities.argsort()[-3:][::-1]
221
+ top_predictions = []
222
+ for idx in top_indices:
223
+ category = model.classes_[idx]
224
+ prob = probabilities[idx]
225
+ top_predictions.append((category, prob))
226
+
227
+ # The prediction should be the top category
228
+ prediction = top_predictions[0][0]
229
+ confidence = top_predictions[0][1]
230
+
231
+ return prediction, confidence, top_predictions
232
+
233
+ # Make prediction on news text
234
+ news_text = "Đội tuyển bóng đá Việt Nam giành chiến thắng"
235
+ prediction, confidence, top_predictions = predict_text(vntc_model, news_text)
236
+
237
+ print(f"News category: {prediction}")
238
+ print(f"Confidence: {confidence:.3f}")
239
+ print("Top 3 predictions:")
240
+ for i, (category, prob) in enumerate(top_predictions, 1):
241
+ print(f" {i}. {category}: {prob:.3f}")
242
+ ```
243
+
244
+ ### UTS2017_Bank Model (Vietnamese Banking Text Classification)
245
+
246
+ ```python
247
+ from huggingface_hub import hf_hub_download
248
+ import joblib
249
+
250
+ # Download and load UTS2017_Bank model (latest SVC model)
251
+ bank_model = joblib.load(
252
+ hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
253
+ )
254
+
255
+ # Enhanced prediction function (same as above)
256
+ def predict_text(model, text):
257
+ probabilities = model.predict_proba([text])[0]
258
+
259
+ # Get top 3 predictions sorted by probability
260
+ top_indices = probabilities.argsort()[-3:][::-1]
261
+ top_predictions = []
262
+ for idx in top_indices:
263
+ category = model.classes_[idx]
264
+ prob = probabilities[idx]
265
+ top_predictions.append((category, prob))
266
+
267
+ # The prediction should be the top category
268
+ prediction = top_predictions[0][0]
269
+ confidence = top_predictions[0][1]
270
+
271
+ return prediction, confidence, top_predictions
272
+
273
+ # Make prediction on banking text
274
+ bank_text = "Tôi muốn mở tài khoản tiết kiệm"
275
+ prediction, confidence, top_predictions = predict_text(bank_model, bank_text)
276
+
277
+ print(f"Banking category: {prediction}")
278
+ print(f"Confidence: {confidence:.3f}")
279
+ print("Top 3 predictions:")
280
+ for i, (category, prob) in enumerate(top_predictions, 1):
281
+ print(f" {i}. {category}: {prob:.3f}")
282
+ ```
283
+
284
+ ### Using Both Models
285
+
286
+ ```python
287
+ from huggingface_hub import hf_hub_download
288
+ import joblib
289
+
290
+ # Load both models
291
+ vntc_model = joblib.load(
292
+ hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
293
+ )
294
+ bank_model = joblib.load(
295
+ hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
296
+ )
297
+
298
+ # Enhanced prediction function for both models
299
+ def predict_text(model, text):
300
+ probabilities = model.predict_proba([text])[0]
301
+
302
+ # Get top 3 predictions sorted by probability
303
+ top_indices = probabilities.argsort()[-3:][::-1]
304
+ top_predictions = []
305
+ for idx in top_indices:
306
+ category = model.classes_[idx]
307
+ prob = probabilities[idx]
308
+ top_predictions.append((category, prob))
309
+
310
+ # The prediction should be the top category
311
+ prediction = top_predictions[0][0]
312
+ confidence = top_predictions[0][1]
313
+
314
+ return prediction, confidence, top_predictions
315
+
316
+ # Function to classify any Vietnamese text
317
+ def classify_vietnamese_text(text, domain="auto"):
318
+ """
319
+ Classify Vietnamese text using appropriate model with detailed predictions
320
+
321
+ Args:
322
+ text: Vietnamese text to classify
323
+ domain: "news", "banking", or "auto" to detect domain
324
+
325
+ Returns:
326
+ tuple: (prediction, confidence, top_predictions, domain_used)
327
+ """
328
+ if domain == "news":
329
+ prediction, confidence, top_predictions = predict_text(vntc_model, text)
330
+ return prediction, confidence, top_predictions, "news"
331
+ elif domain == "banking":
332
+ prediction, confidence, top_predictions = predict_text(bank_model, text)
333
+ return prediction, confidence, top_predictions, "banking"
334
+ else:
335
+ # Try both models and return higher confidence
336
+ news_pred, news_conf, news_top = predict_text(vntc_model, text)
337
+ bank_pred, bank_conf, bank_top = predict_text(bank_model, text)
338
+
339
+ if news_conf > bank_conf:
340
+ return f"NEWS: {news_pred}", news_conf, news_top, "news"
341
+ else:
342
+ return f"BANKING: {bank_pred}", bank_conf, bank_top, "banking"
343
+
344
+ # Examples
345
+ examples = [
346
+ "Đội tuyển bóng đá Việt Nam thắng 2-0",
347
+ "Tôi muốn vay tiền mua nhà",
348
+ "Chính phủ thông qua luật mới"
349
+ ]
350
+
351
+ for text in examples:
352
+ category, confidence, top_predictions, domain = classify_vietnamese_text(text)
353
+ print(f"Text: {text}")
354
+ print(f"Category: {category}")
355
+ print(f"Confidence: {confidence:.3f}")
356
+ print(f"Domain: {domain}")
357
+ print("Top 3 predictions:")
358
+ for i, (cat, prob) in enumerate(top_predictions, 1):
359
+ print(f" {i}. {cat}: {prob:.3f}")
360
+ print()
361
+ ```
362
+
363
+ ## Model Parameters
364
+
365
+ - `dataset`: Dataset to use ("vntc" or "uts2017")
366
+ - `model`: Model type ("logistic" or "svc" - SVC recommended for best performance)
367
+ - `max_features`: Maximum number of TF-IDF features (default: 20000)
368
+ - `ngram_min/max`: N-gram range (default: 1-2)
369
+ - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
370
+ - `n_samples`: Optional sample limit for quick testing
371
+
372
+ ## Limitations
373
+
374
+ 1. **Language Specificity**: Only works with Vietnamese text
375
+ 2. **Domain Specificity**: Optimized for specific domains (news and banking)
376
+ 3. **Feature Limitations**: Limited to 20,000 most frequent features
377
+ 4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
378
+ 5. **Specific Weaknesses**:
379
+ - VNTC: Lower performance on lifestyle category (71% recall)
380
+ - UTS2017_Bank: Poor performance on minority classes despite SVC improvements
381
+ - SVC requires longer training time compared to Logistic Regression
382
+
383
+ ## Ethical Considerations
384
+
385
+ - Model reflects biases present in training datasets
386
+ - Performance varies significantly across categories
387
+ - Should be validated on target domain before deployment
388
+ - Consider class imbalance when interpreting results
389
+
390
+ ## Additional Information
391
+
392
+ - **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
393
+ - **Framework Version**: scikit-learn ≥1.6
394
+ - **Python Version**: 3.10+
395
+ - **System Card**: See [Sonar Core 1 - System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md) for detailed documentation
396
+
397
+ ## Citation
398
+
399
+ If you use this model, please cite:
400
+
401
+ ```bibtex
402
+ @misc{undertheseanlp_2025,
403
+ author = { undertheseanlp },
404
+ title = { Sonar Core 1 - Vietnamese Text Classification Model },
405
+ year = 2025,
406
+ url = { https://huggingface.co/undertheseanlp/sonar_core_1 },
407
+ doi = { 10.57967/hf/6599 },
408
+ publisher = { Hugging Face }
409
+ }
410
+ ```