Vu Anh Claude commited on
Commit
0712d08
·
1 Parent(s): 9cf063d

Update README.md and remove old model files

Browse files

- Update README.md with latest information
- Remove old model files with generic names
- New descriptive model files already committed with timestamps

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

README.md CHANGED
@@ -4,139 +4,252 @@ library_name: scikit-learn
4
  tags:
5
  - scikit-learn
6
  - sklearn
7
- - classification
8
- - tabular-classification
 
9
  - sonar
10
- - random-forest
 
11
  datasets:
12
- - sonar
 
13
  metrics:
14
  - accuracy
 
 
 
15
  model-index:
16
  - name: sonar-core-1
17
  results:
18
  - task:
19
- type: tabular-classification
20
- name: Tabular Classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  dataset:
22
- name: Sonar Dataset
23
- type: sonar
24
  metrics:
25
  - type: accuracy
26
- value: 0.86
27
  name: Test Accuracy
 
 
 
 
 
 
 
 
 
28
  language:
29
- - en
30
- pipeline_tag: tabular-classification
31
  ---
32
 
33
- # Sonar Core Model
34
 
35
- A simple scikit-learn Random Forest classifier for the Sonar dataset (Rocks vs Mines classification).
36
 
37
  ## Model Description
38
 
39
- This is a Random Forest classifier trained for binary classification on sonar signal data. The model distinguishes between sonar signals bounced off metal cylinders (mines) and those bounced off rocks.
40
 
41
  ### Model Architecture
42
 
43
- - **Algorithm**: Random Forest Classifier
44
- - **Preprocessing**: StandardScaler normalization
45
- - **Framework**: scikit-learn
46
- - **Task**: Binary classification
47
- - **Input**: 60 numeric features (sonar signal frequencies)
48
- - **Output**: Binary classification (Rock=0, Mine=1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Installation
51
 
52
- Using uv:
53
  ```bash
54
- uv sync
55
  ```
56
 
57
  ## Usage
58
 
59
- ### Training the model
 
 
60
  ```bash
61
- uv run python train.py
 
 
 
 
62
  ```
63
 
64
- ### Using the model in your code
65
- ```python
66
- from model import SonarModel
67
- import numpy as np
68
 
69
- # Load a pre-trained model
70
- model = SonarModel.load("sonar_model.pkl")
71
 
72
- # Make predictions
73
- X_new = np.random.randn(1, 60) # 60 features for Sonar dataset
74
- prediction = model.predict(X_new)
75
- probabilities = model.predict_proba(X_new)
76
  ```
77
 
78
- ### Training from scratch
79
- ```python
80
- from model import SonarModel
81
- from sklearn.model_selection import train_test_split
82
 
83
- # Initialize model
84
- model = SonarModel(n_estimators=100, max_depth=10)
85
 
86
- # Train
87
- model.fit(X_train, y_train)
88
 
89
- # Evaluate
90
- accuracy = model.score(X_test, y_test)
 
 
91
 
92
- # Save
93
- model.save("my_model.pkl")
94
  ```
95
 
96
- ## Model Parameters
97
-
98
- - `n_estimators`: Number of trees in the forest (default: 100)
99
- - `max_depth`: Maximum depth of trees (default: 10)
100
- - `random_state`: Random seed for reproducibility (default: 42)
101
-
102
- ## Training
103
 
104
- ### Training Data
105
-
106
- The model is designed for the Sonar dataset which contains:
107
- - 60 numeric features representing sonar signal frequencies (ranging from 0.0 to 1.0)
108
- - Binary target: Rock (R) or Mine (M)
109
- - Balanced classes with approximately 50% distribution
110
-
111
- ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- The model was trained using:
114
- - Train/test split: 80/20
115
- - Random state: 42 for reproducibility
116
- - StandardScaler preprocessing for feature normalization
117
- - Random Forest with 100 trees and max depth of 10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
- ### Evaluation
120
 
121
- **Test Set Performance:**
122
- - Accuracy: 86.0%
 
 
 
 
123
 
124
  ## Limitations
125
 
126
- - The model is trained on synthetic data for demonstration purposes
127
- - Actual sonar data may have different characteristics
128
- - Performance may vary on real-world sonar signals
129
- - Limited to binary classification (rock vs mine)
 
 
 
130
 
131
  ## Ethical Considerations
132
 
133
- This model is intended for educational and research purposes. When deploying for real-world applications:
134
- - Consider the consequences of false positives/negatives in mine detection
135
- - Ensure proper validation with actual sonar data
136
- - Use as part of a broader decision-making system, not as the sole detector
137
 
138
  ## Additional Information
139
 
140
  - **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
141
- - **Framework Version**: scikit-learn 1.7.2
142
  - **Python Version**: 3.10+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  tags:
5
  - scikit-learn
6
  - sklearn
7
+ - text-classification
8
+ - vietnamese
9
+ - nlp
10
  - sonar
11
+ - tf-idf
12
+ - logistic-regression
13
  datasets:
14
+ - vntc
15
+ - uts2017_bank
16
  metrics:
17
  - accuracy
18
+ - precision
19
+ - recall
20
+ - f1-score
21
  model-index:
22
  - name: sonar-core-1
23
  results:
24
  - task:
25
+ type: text-classification
26
+ name: Vietnamese News Classification
27
+ dataset:
28
+ name: VNTC
29
+ type: vntc
30
+ metrics:
31
+ - type: accuracy
32
+ value: 0.9233
33
+ name: Test Accuracy
34
+ - type: precision
35
+ value: 0.92
36
+ name: Weighted Precision
37
+ - type: recall
38
+ value: 0.92
39
+ name: Weighted Recall
40
+ - type: f1-score
41
+ value: 0.92
42
+ name: Weighted F1-Score
43
+ - task:
44
+ type: text-classification
45
+ name: Vietnamese Banking Text Classification
46
  dataset:
47
+ name: UTS2017_Bank
48
+ type: uts2017_bank
49
  metrics:
50
  - type: accuracy
51
+ value: 0.7096
52
  name: Test Accuracy
53
+ - type: precision
54
+ value: 0.64
55
+ name: Weighted Precision
56
+ - type: recall
57
+ value: 0.71
58
+ name: Weighted Recall
59
+ - type: f1-score
60
+ value: 0.63
61
+ name: Weighted F1-Score
62
  language:
63
+ - vi
64
+ pipeline_tag: text-classification
65
  ---
66
 
67
+ # Sonar Core 1 - Vietnamese Text Classification Model
68
 
69
+ A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Logistic Regression, achieving **92.33% accuracy** on VNTC (news) and **70.96% accuracy** on UTS2017_Bank (banking) datasets.
70
 
71
  ## Model Description
72
 
73
+ **Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.
74
 
75
  ### Model Architecture
76
 
77
+ - **Algorithm**: TF-IDF + Logistic Regression Pipeline
78
+ - **Feature Extraction**: CountVectorizer with 20,000 max features
79
+ - **N-gram Support**: Unigram and bigram (1-2)
80
+ - **TF-IDF**: Transformation with IDF weighting
81
+ - **Classifier**: Logistic Regression with 1,000 max iterations
82
+ - **Framework**: scikit-learn 1.6
83
+ - **Caching System**: Hash-based caching for efficient processing
84
+
85
+ ## Supported Datasets & Categories
86
+
87
+ ### VNTC Dataset - News Categories (10 classes)
88
+ 1. **chinh_tri_xa_hoi** - Politics and Society
89
+ 2. **doi_song** - Lifestyle
90
+ 3. **khoa_hoc** - Science
91
+ 4. **kinh_doanh** - Business
92
+ 5. **phap_luat** - Law
93
+ 6. **suc_khoe** - Health
94
+ 7. **the_gioi** - World News
95
+ 8. **the_thao** - Sports
96
+ 9. **van_hoa** - Culture
97
+ 10. **vi_tinh** - Information Technology
98
+
99
+ ### UTS2017_Bank Dataset - Banking Categories (14 classes)
100
+ 1. **ACCOUNT** - Account services
101
+ 2. **CARD** - Card services
102
+ 3. **CUSTOMER_SUPPORT** - Customer support
103
+ 4. **DISCOUNT** - Discount offers
104
+ 5. **INTEREST_RATE** - Interest rate information
105
+ 6. **INTERNET_BANKING** - Internet banking services
106
+ 7. **LOAN** - Loan services
107
+ 8. **MONEY_TRANSFER** - Money transfer services
108
+ 9. **OTHER** - Other services
109
+ 10. **PAYMENT** - Payment services
110
+ 11. **PROMOTION** - Promotional offers
111
+ 12. **SAVING** - Savings accounts
112
+ 13. **SECURITY** - Security features
113
+ 14. **TRADEMARK** - Trademark/branding
114
 
115
  ## Installation
116
 
 
117
  ```bash
118
+ pip install scikit-learn>=1.6 joblib
119
  ```
120
 
121
  ## Usage
122
 
123
+ ### Training the Model
124
+
125
+ #### VNTC Dataset (News Classification)
126
  ```bash
127
+ # Default training with VNTC dataset
128
+ python train.py --dataset vntc --model logistic
129
+
130
+ # With specific parameters
131
+ python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
132
  ```
133
 
134
+ #### UTS2017_Bank Dataset (Banking Text Classification)
135
+ ```bash
136
+ # Train with UTS2017_Bank dataset
137
+ python train.py --dataset uts2017 --model logistic
138
 
139
+ # With specific parameters
140
+ python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
141
 
142
+ # Compare multiple configurations
143
+ python train.py --dataset uts2017 --compare
 
 
144
  ```
145
 
146
+ ### Using the Model for Prediction
 
 
 
147
 
148
+ ```python
149
+ import joblib
150
 
151
+ # Load model
152
+ model = joblib.load('path/to/model.pkl')
153
 
154
+ # Make prediction
155
+ text = "Việt Nam giành chiến thắng trong trận bán kết"
156
+ prediction = model.predict([text])[0]
157
+ probabilities = model.predict_proba([text])[0]
158
 
159
+ print(f"Predicted category: {prediction}")
160
+ print(f"Confidence scores: {probabilities}")
161
  ```
162
 
163
+ ### Training from Scratch
 
 
 
 
 
 
164
 
165
+ ```python
166
+ from train import train_notebook
167
+
168
+ # Train VNTC model
169
+ vntc_results = train_notebook(
170
+ dataset="vntc",
171
+ model_name="logistic",
172
+ max_features=20000,
173
+ ngram_min=1,
174
+ ngram_max=2
175
+ )
176
+
177
+ # Train UTS2017_Bank model
178
+ bank_results = train_notebook(
179
+ dataset="uts2017",
180
+ model_name="logistic",
181
+ max_features=20000,
182
+ ngram_min=1,
183
+ ngram_max=2
184
+ )
185
+ ```
186
 
187
+ ## Performance Metrics
188
+
189
+ ### VNTC Dataset Performance
190
+ - **Training Accuracy**: 95.39%
191
+ - **Test Accuracy**: 92.33%
192
+ - **Training Samples**: 33,759
193
+ - **Test Samples**: 50,373
194
+ - **Training Time**: ~31.40 seconds
195
+ - **Best Performing**: Sports (98% F1-score)
196
+ - **Challenging Category**: Lifestyle (76% F1-score)
197
+
198
+ ### UTS2017_Bank Dataset Performance
199
+ - **Training Accuracy**: 76.22%
200
+ - **Test Accuracy**: 70.96%
201
+ - **Training Samples**: 1,581
202
+ - **Test Samples**: 396
203
+ - **Training Time**: ~0.78 seconds
204
+ - **Best Performing**: TRADEMARK (88% F1-score), CUSTOMER_SUPPORT (76% F1-score)
205
+ - **Challenges**: Many minority classes with insufficient training data
206
 
207
+ ## Model Parameters
208
 
209
+ - `dataset`: Dataset to use ("vntc" or "uts2017")
210
+ - `model`: Model type ("logistic" or "svc")
211
+ - `max_features`: Maximum number of TF-IDF features (default: 20000)
212
+ - `ngram_min/max`: N-gram range (default: 1-2)
213
+ - `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
214
+ - `n_samples`: Optional sample limit for quick testing
215
 
216
  ## Limitations
217
 
218
+ 1. **Language Specificity**: Only works with Vietnamese text
219
+ 2. **Domain Specificity**: Optimized for specific domains (news and banking)
220
+ 3. **Feature Limitations**: Limited to 20,000 most frequent features
221
+ 4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
222
+ 5. **Specific Weaknesses**:
223
+ - VNTC: Lower performance on lifestyle category (71% recall)
224
+ - UTS2017_Bank: Poor performance on minority classes
225
 
226
  ## Ethical Considerations
227
 
228
+ - Model reflects biases present in training datasets
229
+ - Performance varies significantly across categories
230
+ - Should be validated on target domain before deployment
231
+ - Consider class imbalance when interpreting results
232
 
233
  ## Additional Information
234
 
235
  - **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
236
+ - **Framework Version**: scikit-learn 1.6
237
  - **Python Version**: 3.10+
238
+ - **System Card**: See "Sonar Core 1 - System Card.md" for detailed documentation
239
+
240
+ ## Citation
241
+
242
+ If you use this model, please cite:
243
+
244
+ ```bibtex
245
+ @techreport{underthesea2025sonarcore1,
246
+ title = {Sonar Core 1: A Vietnamese Text Classification Model using Machine Learning},
247
+ author = {Vu Anh},
248
+ year = {2025},
249
+ month = {September},
250
+ institution = {Underthesea},
251
+ version = {1.0},
252
+ url = {https://github.com/undertheseanlp/underthesea/},
253
+ keywords = {text classification, vietnamese nlp, machine learning, tf-idf, logistic regression}
254
+ }
255
+ ```
sklearn_model.joblib DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b25b914bfacc590165e0ce35e944815cf1fda52d9d2fadf79334c5bc2754b360
3
- size 2393144
 
 
 
 
sklearn_model_uts2017_bank.joblib DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:729e6e6d7b34dc1057275d15ce0d8475ffc1b614a13fbc0f174155e9dec4795d
3
- size 3029656