Vu Anh Claude commited on
Commit ·
0712d08
1
Parent(s): 9cf063d
Update README.md and remove old model files
Browse files- Update README.md with latest information
- Remove old model files with generic names
- New descriptive model files already committed with timestamps
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- README.md +191 -78
- sklearn_model.joblib +0 -3
- sklearn_model_uts2017_bank.joblib +0 -3
README.md
CHANGED
|
@@ -4,139 +4,252 @@ library_name: scikit-learn
|
|
| 4 |
tags:
|
| 5 |
- scikit-learn
|
| 6 |
- sklearn
|
| 7 |
-
- classification
|
| 8 |
-
-
|
|
|
|
| 9 |
- sonar
|
| 10 |
-
-
|
|
|
|
| 11 |
datasets:
|
| 12 |
-
-
|
|
|
|
| 13 |
metrics:
|
| 14 |
- accuracy
|
|
|
|
|
|
|
|
|
|
| 15 |
model-index:
|
| 16 |
- name: sonar-core-1
|
| 17 |
results:
|
| 18 |
- task:
|
| 19 |
-
type:
|
| 20 |
-
name:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
dataset:
|
| 22 |
-
name:
|
| 23 |
-
type:
|
| 24 |
metrics:
|
| 25 |
- type: accuracy
|
| 26 |
-
value: 0.
|
| 27 |
name: Test Accuracy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
language:
|
| 29 |
-
-
|
| 30 |
-
pipeline_tag:
|
| 31 |
---
|
| 32 |
|
| 33 |
-
# Sonar Core Model
|
| 34 |
|
| 35 |
-
A
|
| 36 |
|
| 37 |
## Model Description
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
### Model Architecture
|
| 42 |
|
| 43 |
-
- **Algorithm**:
|
| 44 |
-
- **
|
| 45 |
-
- **
|
| 46 |
-
- **
|
| 47 |
-
- **
|
| 48 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
## Installation
|
| 51 |
|
| 52 |
-
Using uv:
|
| 53 |
```bash
|
| 54 |
-
|
| 55 |
```
|
| 56 |
|
| 57 |
## Usage
|
| 58 |
|
| 59 |
-
### Training the
|
|
|
|
|
|
|
| 60 |
```bash
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
```
|
| 63 |
|
| 64 |
-
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
|
| 68 |
|
| 69 |
-
#
|
| 70 |
-
model
|
| 71 |
|
| 72 |
-
#
|
| 73 |
-
|
| 74 |
-
prediction = model.predict(X_new)
|
| 75 |
-
probabilities = model.predict_proba(X_new)
|
| 76 |
```
|
| 77 |
|
| 78 |
-
###
|
| 79 |
-
```python
|
| 80 |
-
from model import SonarModel
|
| 81 |
-
from sklearn.model_selection import train_test_split
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|
| 85 |
|
| 86 |
-
#
|
| 87 |
-
model.
|
| 88 |
|
| 89 |
-
#
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
```
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
- `n_estimators`: Number of trees in the forest (default: 100)
|
| 99 |
-
- `max_depth`: Maximum depth of trees (default: 10)
|
| 100 |
-
- `random_state`: Random seed for reproducibility (default: 42)
|
| 101 |
-
|
| 102 |
-
## Training
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
-
|
| 117 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
| 120 |
|
| 121 |
-
|
| 122 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
## Limitations
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## Ethical Considerations
|
| 132 |
|
| 133 |
-
|
| 134 |
-
-
|
| 135 |
-
-
|
| 136 |
-
-
|
| 137 |
|
| 138 |
## Additional Information
|
| 139 |
|
| 140 |
- **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
|
| 141 |
-
- **Framework Version**: scikit-learn 1.
|
| 142 |
- **Python Version**: 3.10+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- scikit-learn
|
| 6 |
- sklearn
|
| 7 |
+
- text-classification
|
| 8 |
+
- vietnamese
|
| 9 |
+
- nlp
|
| 10 |
- sonar
|
| 11 |
+
- tf-idf
|
| 12 |
+
- logistic-regression
|
| 13 |
datasets:
|
| 14 |
+
- vntc
|
| 15 |
+
- uts2017_bank
|
| 16 |
metrics:
|
| 17 |
- accuracy
|
| 18 |
+
- precision
|
| 19 |
+
- recall
|
| 20 |
+
- f1-score
|
| 21 |
model-index:
|
| 22 |
- name: sonar-core-1
|
| 23 |
results:
|
| 24 |
- task:
|
| 25 |
+
type: text-classification
|
| 26 |
+
name: Vietnamese News Classification
|
| 27 |
+
dataset:
|
| 28 |
+
name: VNTC
|
| 29 |
+
type: vntc
|
| 30 |
+
metrics:
|
| 31 |
+
- type: accuracy
|
| 32 |
+
value: 0.9233
|
| 33 |
+
name: Test Accuracy
|
| 34 |
+
- type: precision
|
| 35 |
+
value: 0.92
|
| 36 |
+
name: Weighted Precision
|
| 37 |
+
- type: recall
|
| 38 |
+
value: 0.92
|
| 39 |
+
name: Weighted Recall
|
| 40 |
+
- type: f1-score
|
| 41 |
+
value: 0.92
|
| 42 |
+
name: Weighted F1-Score
|
| 43 |
+
- task:
|
| 44 |
+
type: text-classification
|
| 45 |
+
name: Vietnamese Banking Text Classification
|
| 46 |
dataset:
|
| 47 |
+
name: UTS2017_Bank
|
| 48 |
+
type: uts2017_bank
|
| 49 |
metrics:
|
| 50 |
- type: accuracy
|
| 51 |
+
value: 0.7096
|
| 52 |
name: Test Accuracy
|
| 53 |
+
- type: precision
|
| 54 |
+
value: 0.64
|
| 55 |
+
name: Weighted Precision
|
| 56 |
+
- type: recall
|
| 57 |
+
value: 0.71
|
| 58 |
+
name: Weighted Recall
|
| 59 |
+
- type: f1-score
|
| 60 |
+
value: 0.63
|
| 61 |
+
name: Weighted F1-Score
|
| 62 |
language:
|
| 63 |
+
- vi
|
| 64 |
+
pipeline_tag: text-classification
|
| 65 |
---
|
| 66 |
|
| 67 |
+
# Sonar Core 1 - Vietnamese Text Classification Model
|
| 68 |
|
| 69 |
+
A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Logistic Regression, achieving **92.33% accuracy** on VNTC (news) and **70.96% accuracy** on UTS2017_Bank (banking) datasets.
|
| 70 |
|
| 71 |
## Model Description
|
| 72 |
|
| 73 |
+
**Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.
|
| 74 |
|
| 75 |
### Model Architecture
|
| 76 |
|
| 77 |
+
- **Algorithm**: TF-IDF + Logistic Regression Pipeline
|
| 78 |
+
- **Feature Extraction**: CountVectorizer with 20,000 max features
|
| 79 |
+
- **N-gram Support**: Unigram and bigram (1-2)
|
| 80 |
+
- **TF-IDF**: Transformation with IDF weighting
|
| 81 |
+
- **Classifier**: Logistic Regression with 1,000 max iterations
|
| 82 |
+
- **Framework**: scikit-learn ≥1.6
|
| 83 |
+
- **Caching System**: Hash-based caching for efficient processing
|
| 84 |
+
|
| 85 |
+
## Supported Datasets & Categories
|
| 86 |
+
|
| 87 |
+
### VNTC Dataset - News Categories (10 classes)
|
| 88 |
+
1. **chinh_tri_xa_hoi** - Politics and Society
|
| 89 |
+
2. **doi_song** - Lifestyle
|
| 90 |
+
3. **khoa_hoc** - Science
|
| 91 |
+
4. **kinh_doanh** - Business
|
| 92 |
+
5. **phap_luat** - Law
|
| 93 |
+
6. **suc_khoe** - Health
|
| 94 |
+
7. **the_gioi** - World News
|
| 95 |
+
8. **the_thao** - Sports
|
| 96 |
+
9. **van_hoa** - Culture
|
| 97 |
+
10. **vi_tinh** - Information Technology
|
| 98 |
+
|
| 99 |
+
### UTS2017_Bank Dataset - Banking Categories (14 classes)
|
| 100 |
+
1. **ACCOUNT** - Account services
|
| 101 |
+
2. **CARD** - Card services
|
| 102 |
+
3. **CUSTOMER_SUPPORT** - Customer support
|
| 103 |
+
4. **DISCOUNT** - Discount offers
|
| 104 |
+
5. **INTEREST_RATE** - Interest rate information
|
| 105 |
+
6. **INTERNET_BANKING** - Internet banking services
|
| 106 |
+
7. **LOAN** - Loan services
|
| 107 |
+
8. **MONEY_TRANSFER** - Money transfer services
|
| 108 |
+
9. **OTHER** - Other services
|
| 109 |
+
10. **PAYMENT** - Payment services
|
| 110 |
+
11. **PROMOTION** - Promotional offers
|
| 111 |
+
12. **SAVING** - Savings accounts
|
| 112 |
+
13. **SECURITY** - Security features
|
| 113 |
+
14. **TRADEMARK** - Trademark/branding
|
| 114 |
|
| 115 |
## Installation
|
| 116 |
|
|
|
|
| 117 |
```bash
|
| 118 |
+
pip install scikit-learn>=1.6 joblib
|
| 119 |
```
|
| 120 |
|
| 121 |
## Usage
|
| 122 |
|
| 123 |
+
### Training the Model
|
| 124 |
+
|
| 125 |
+
#### VNTC Dataset (News Classification)
|
| 126 |
```bash
|
| 127 |
+
# Default training with VNTC dataset
|
| 128 |
+
python train.py --dataset vntc --model logistic
|
| 129 |
+
|
| 130 |
+
# With specific parameters
|
| 131 |
+
python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
|
| 132 |
```
|
| 133 |
|
| 134 |
+
#### UTS2017_Bank Dataset (Banking Text Classification)
|
| 135 |
+
```bash
|
| 136 |
+
# Train with UTS2017_Bank dataset
|
| 137 |
+
python train.py --dataset uts2017 --model logistic
|
| 138 |
|
| 139 |
+
# With specific parameters
|
| 140 |
+
python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
|
| 141 |
|
| 142 |
+
# Compare multiple configurations
|
| 143 |
+
python train.py --dataset uts2017 --compare
|
|
|
|
|
|
|
| 144 |
```
|
| 145 |
|
| 146 |
+
### Using the Model for Prediction
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
+
```python
|
| 149 |
+
import joblib
|
| 150 |
|
| 151 |
+
# Load model
|
| 152 |
+
model = joblib.load('path/to/model.pkl')
|
| 153 |
|
| 154 |
+
# Make prediction
|
| 155 |
+
text = "Việt Nam giành chiến thắng trong trận bán kết"
|
| 156 |
+
prediction = model.predict([text])[0]
|
| 157 |
+
probabilities = model.predict_proba([text])[0]
|
| 158 |
|
| 159 |
+
print(f"Predicted category: {prediction}")
|
| 160 |
+
print(f"Confidence scores: {probabilities}")
|
| 161 |
```
|
| 162 |
|
| 163 |
+
### Training from Scratch
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
+
```python
|
| 166 |
+
from train import train_notebook
|
| 167 |
+
|
| 168 |
+
# Train VNTC model
|
| 169 |
+
vntc_results = train_notebook(
|
| 170 |
+
dataset="vntc",
|
| 171 |
+
model_name="logistic",
|
| 172 |
+
max_features=20000,
|
| 173 |
+
ngram_min=1,
|
| 174 |
+
ngram_max=2
|
| 175 |
+
)
|
| 176 |
+
|
| 177 |
+
# Train UTS2017_Bank model
|
| 178 |
+
bank_results = train_notebook(
|
| 179 |
+
dataset="uts2017",
|
| 180 |
+
model_name="logistic",
|
| 181 |
+
max_features=20000,
|
| 182 |
+
ngram_min=1,
|
| 183 |
+
ngram_max=2
|
| 184 |
+
)
|
| 185 |
+
```
|
| 186 |
|
| 187 |
+
## Performance Metrics
|
| 188 |
+
|
| 189 |
+
### VNTC Dataset Performance
|
| 190 |
+
- **Training Accuracy**: 95.39%
|
| 191 |
+
- **Test Accuracy**: 92.33%
|
| 192 |
+
- **Training Samples**: 33,759
|
| 193 |
+
- **Test Samples**: 50,373
|
| 194 |
+
- **Training Time**: ~31.40 seconds
|
| 195 |
+
- **Best Performing**: Sports (98% F1-score)
|
| 196 |
+
- **Challenging Category**: Lifestyle (76% F1-score)
|
| 197 |
+
|
| 198 |
+
### UTS2017_Bank Dataset Performance
|
| 199 |
+
- **Training Accuracy**: 76.22%
|
| 200 |
+
- **Test Accuracy**: 70.96%
|
| 201 |
+
- **Training Samples**: 1,581
|
| 202 |
+
- **Test Samples**: 396
|
| 203 |
+
- **Training Time**: ~0.78 seconds
|
| 204 |
+
- **Best Performing**: TRADEMARK (88% F1-score), CUSTOMER_SUPPORT (76% F1-score)
|
| 205 |
+
- **Challenges**: Many minority classes with insufficient training data
|
| 206 |
|
| 207 |
+
## Model Parameters
|
| 208 |
|
| 209 |
+
- `dataset`: Dataset to use ("vntc" or "uts2017")
|
| 210 |
+
- `model`: Model type ("logistic" or "svc")
|
| 211 |
+
- `max_features`: Maximum number of TF-IDF features (default: 20000)
|
| 212 |
+
- `ngram_min/max`: N-gram range (default: 1-2)
|
| 213 |
+
- `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
|
| 214 |
+
- `n_samples`: Optional sample limit for quick testing
|
| 215 |
|
| 216 |
## Limitations
|
| 217 |
|
| 218 |
+
1. **Language Specificity**: Only works with Vietnamese text
|
| 219 |
+
2. **Domain Specificity**: Optimized for specific domains (news and banking)
|
| 220 |
+
3. **Feature Limitations**: Limited to 20,000 most frequent features
|
| 221 |
+
4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
|
| 222 |
+
5. **Specific Weaknesses**:
|
| 223 |
+
- VNTC: Lower performance on lifestyle category (71% recall)
|
| 224 |
+
- UTS2017_Bank: Poor performance on minority classes
|
| 225 |
|
| 226 |
## Ethical Considerations
|
| 227 |
|
| 228 |
+
- Model reflects biases present in training datasets
|
| 229 |
+
- Performance varies significantly across categories
|
| 230 |
+
- Should be validated on target domain before deployment
|
| 231 |
+
- Consider class imbalance when interpreting results
|
| 232 |
|
| 233 |
## Additional Information
|
| 234 |
|
| 235 |
- **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
|
| 236 |
+
- **Framework Version**: scikit-learn ≥1.6
|
| 237 |
- **Python Version**: 3.10+
|
| 238 |
+
- **System Card**: See "Sonar Core 1 - System Card.md" for detailed documentation
|
| 239 |
+
|
| 240 |
+
## Citation
|
| 241 |
+
|
| 242 |
+
If you use this model, please cite:
|
| 243 |
+
|
| 244 |
+
```bibtex
|
| 245 |
+
@techreport{underthesea2025sonarcore1,
|
| 246 |
+
title = {Sonar Core 1: A Vietnamese Text Classification Model using Machine Learning},
|
| 247 |
+
author = {Vu Anh},
|
| 248 |
+
year = {2025},
|
| 249 |
+
month = {September},
|
| 250 |
+
institution = {Underthesea},
|
| 251 |
+
version = {1.0},
|
| 252 |
+
url = {https://github.com/undertheseanlp/underthesea/},
|
| 253 |
+
keywords = {text classification, vietnamese nlp, machine learning, tf-idf, logistic regression}
|
| 254 |
+
}
|
| 255 |
+
```
|
sklearn_model.joblib
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:b25b914bfacc590165e0ce35e944815cf1fda52d9d2fadf79334c5bc2754b360
|
| 3 |
-
size 2393144
|
|
|
|
|
|
|
|
|
|
|
|
sklearn_model_uts2017_bank.joblib
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:729e6e6d7b34dc1057275d15ce0d8475ffc1b614a13fbc0f174155e9dec4795d
|
| 3 |
-
size 3029656
|
|
|
|
|
|
|
|
|
|
|
|