Update README.md and remove old model files

- Update README.md with latest information
- Remove old model files with generic names
- New descriptive model files already committed with timestamps

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (3) hide show

README.md +191 -78
sklearn_model.joblib +0 -3
sklearn_model_uts2017_bank.joblib +0 -3

README.md CHANGED Viewed

@@ -4,139 +4,252 @@ library_name: scikit-learn
 tags:
   - scikit-learn
   - sklearn
-  - classification
-  - tabular-classification
   - sonar
-  - random-forest
 datasets:
-  - sonar
 metrics:
   - accuracy
 model-index:
   - name: sonar-core-1
     results:
       - task:
-          type: tabular-classification
-          name: Tabular Classification
         dataset:
-          name: Sonar Dataset
-          type: sonar
         metrics:
           - type: accuracy
-            value: 0.86
             name: Test Accuracy
 language:
-  - en
-pipeline_tag: tabular-classification
 ---
-# Sonar Core Model
-A simple scikit-learn Random Forest classifier for the Sonar dataset (Rocks vs Mines classification).
 ## Model Description
-This is a Random Forest classifier trained for binary classification on sonar signal data. The model distinguishes between sonar signals bounced off metal cylinders (mines) and those bounced off rocks.
 ### Model Architecture
-- **Algorithm**: Random Forest Classifier
-- **Preprocessing**: StandardScaler normalization
-- **Framework**: scikit-learn
-- **Task**: Binary classification
-- **Input**: 60 numeric features (sonar signal frequencies)
-- **Output**: Binary classification (Rock=0, Mine=1)
 ## Installation
-Using uv:
 ```bash
-uv sync
 ```
 ## Usage
-### Training the model
 ```bash
-uv run python train.py
 ```
-### Using the model in your code
-```python
-from model import SonarModel
-import numpy as np
-# Load a pre-trained model
-model = SonarModel.load("sonar_model.pkl")
-# Make predictions
-X_new = np.random.randn(1, 60)  # 60 features for Sonar dataset
-prediction = model.predict(X_new)
-probabilities = model.predict_proba(X_new)
 ```
-### Training from scratch
-```python
-from model import SonarModel
-from sklearn.model_selection import train_test_split
-# Initialize model
-model = SonarModel(n_estimators=100, max_depth=10)
-# Train
-model.fit(X_train, y_train)
-# Evaluate
-accuracy = model.score(X_test, y_test)
-# Save
-model.save("my_model.pkl")
 ```
-## Model Parameters
-- `n_estimators`: Number of trees in the forest (default: 100)
-- `max_depth`: Maximum depth of trees (default: 10)
-- `random_state`: Random seed for reproducibility (default: 42)
-## Training
-### Training Data
-The model is designed for the Sonar dataset which contains:
-- 60 numeric features representing sonar signal frequencies (ranging from 0.0 to 1.0)
-- Binary target: Rock (R) or Mine (M)
-- Balanced classes with approximately 50% distribution
-### Training Procedure
-The model was trained using:
-- Train/test split: 80/20
-- Random state: 42 for reproducibility
-- StandardScaler preprocessing for feature normalization
-- Random Forest with 100 trees and max depth of 10
-### Evaluation
-**Test Set Performance:**
-- Accuracy: 86.0%
 ## Limitations
-- The model is trained on synthetic data for demonstration purposes
-- Actual sonar data may have different characteristics
-- Performance may vary on real-world sonar signals
-- Limited to binary classification (rock vs mine)
 ## Ethical Considerations
-This model is intended for educational and research purposes. When deploying for real-world applications:
-- Consider the consequences of false positives/negatives in mine detection
-- Ensure proper validation with actual sonar data
-- Use as part of a broader decision-making system, not as the sole detector
 ## Additional Information
 - **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
-- **Framework Version**: scikit-learn 1.7.2
 - **Python Version**: 3.10+

 tags:
   - scikit-learn
   - sklearn
+  - text-classification
+  - vietnamese
+  - nlp
   - sonar
+  - tf-idf
+  - logistic-regression
 datasets:
+  - vntc
+  - uts2017_bank
 metrics:
   - accuracy
+  - precision
+  - recall
+  - f1-score
 model-index:
   - name: sonar-core-1
     results:
       - task:
+          type: text-classification
+          name: Vietnamese News Classification
+        dataset:
+          name: VNTC
+          type: vntc
+        metrics:
+          - type: accuracy
+            value: 0.9233
+            name: Test Accuracy
+          - type: precision
+            value: 0.92
+            name: Weighted Precision
+          - type: recall
+            value: 0.92
+            name: Weighted Recall
+          - type: f1-score
+            value: 0.92
+            name: Weighted F1-Score
+      - task:
+          type: text-classification
+          name: Vietnamese Banking Text Classification
         dataset:
+          name: UTS2017_Bank
+          type: uts2017_bank
         metrics:
           - type: accuracy
+            value: 0.7096
             name: Test Accuracy
+          - type: precision
+            value: 0.64
+            name: Weighted Precision
+          - type: recall
+            value: 0.71
+            name: Weighted Recall
+          - type: f1-score
+            value: 0.63
+            name: Weighted F1-Score
 language:
+  - vi
+pipeline_tag: text-classification
 ---
+# Sonar Core 1 - Vietnamese Text Classification Model
+A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Logistic Regression, achieving **92.33% accuracy** on VNTC (news) and **70.96% accuracy** on UTS2017_Bank (banking) datasets.
 ## Model Description
+**Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.
 ### Model Architecture
+- **Algorithm**: TF-IDF + Logistic Regression Pipeline
+- **Feature Extraction**: CountVectorizer with 20,000 max features
+- **N-gram Support**: Unigram and bigram (1-2)
+- **TF-IDF**: Transformation with IDF weighting
+- **Classifier**: Logistic Regression with 1,000 max iterations
+- **Framework**: scikit-learn ≥1.6
+- **Caching System**: Hash-based caching for efficient processing
+## Supported Datasets & Categories
+### VNTC Dataset - News Categories (10 classes)
+1. **chinh_tri_xa_hoi** - Politics and Society
+2. **doi_song** - Lifestyle
+3. **khoa_hoc** - Science
+4. **kinh_doanh** - Business
+5. **phap_luat** - Law
+6. **suc_khoe** - Health
+7. **the_gioi** - World News
+8. **the_thao** - Sports
+9. **van_hoa** - Culture
+10. **vi_tinh** - Information Technology
+### UTS2017_Bank Dataset - Banking Categories (14 classes)
+1. **ACCOUNT** - Account services
+2. **CARD** - Card services
+3. **CUSTOMER_SUPPORT** - Customer support
+4. **DISCOUNT** - Discount offers
+5. **INTEREST_RATE** - Interest rate information
+6. **INTERNET_BANKING** - Internet banking services
+7. **LOAN** - Loan services
+8. **MONEY_TRANSFER** - Money transfer services
+9. **OTHER** - Other services
+10. **PAYMENT** - Payment services
+11. **PROMOTION** - Promotional offers
+12. **SAVING** - Savings accounts
+13. **SECURITY** - Security features
+14. **TRADEMARK** - Trademark/branding
 ## Installation
 ```bash
+pip install scikit-learn>=1.6 joblib
 ```
 ## Usage
+### Training the Model
+#### VNTC Dataset (News Classification)
 ```bash
+# Default training with VNTC dataset
+python train.py --dataset vntc --model logistic
+# With specific parameters
+python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
 ```
+#### UTS2017_Bank Dataset (Banking Text Classification)
+```bash
+# Train with UTS2017_Bank dataset
+python train.py --dataset uts2017 --model logistic
+# With specific parameters
+python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2
+# Compare multiple configurations
+python train.py --dataset uts2017 --compare
 ```
+### Using the Model for Prediction
+```python
+import joblib
+# Load model
+model = joblib.load('path/to/model.pkl')
+# Make prediction
+text = "Việt Nam giành chiến thắng trong trận bán kết"
+prediction = model.predict([text])[0]
+probabilities = model.predict_proba([text])[0]
+print(f"Predicted category: {prediction}")
+print(f"Confidence scores: {probabilities}")
 ```
+### Training from Scratch
+```python
+from train import train_notebook
+# Train VNTC model
+vntc_results = train_notebook(
+    dataset="vntc",
+    model_name="logistic",
+    max_features=20000,
+    ngram_min=1,
+    ngram_max=2
+)
+# Train UTS2017_Bank model
+bank_results = train_notebook(
+    dataset="uts2017",
+    model_name="logistic",
+    max_features=20000,
+    ngram_min=1,
+    ngram_max=2
+)
+```
+## Performance Metrics
+### VNTC Dataset Performance
+- **Training Accuracy**: 95.39%
+- **Test Accuracy**: 92.33%
+- **Training Samples**: 33,759
+- **Test Samples**: 50,373
+- **Training Time**: ~31.40 seconds
+- **Best Performing**: Sports (98% F1-score)
+- **Challenging Category**: Lifestyle (76% F1-score)
+### UTS2017_Bank Dataset Performance
+- **Training Accuracy**: 76.22%
+- **Test Accuracy**: 70.96%
+- **Training Samples**: 1,581
+- **Test Samples**: 396
+- **Training Time**: ~0.78 seconds
+- **Best Performing**: TRADEMARK (88% F1-score), CUSTOMER_SUPPORT (76% F1-score)
+- **Challenges**: Many minority classes with insufficient training data
+## Model Parameters
+- `dataset`: Dataset to use ("vntc" or "uts2017")
+- `model`: Model type ("logistic" or "svc")
+- `max_features`: Maximum number of TF-IDF features (default: 20000)
+- `ngram_min/max`: N-gram range (default: 1-2)
+- `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2)
+- `n_samples`: Optional sample limit for quick testing
 ## Limitations
+1. **Language Specificity**: Only works with Vietnamese text
+2. **Domain Specificity**: Optimized for specific domains (news and banking)
+3. **Feature Limitations**: Limited to 20,000 most frequent features
+4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets
+5. **Specific Weaknesses**:
+   - VNTC: Lower performance on lifestyle category (71% recall)
+   - UTS2017_Bank: Poor performance on minority classes
 ## Ethical Considerations
+- Model reflects biases present in training datasets
+- Performance varies significantly across categories
+- Should be validated on target domain before deployment
+- Consider class imbalance when interpreting results
 ## Additional Information
 - **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1
+- **Framework Version**: scikit-learn ≥1.6
 - **Python Version**: 3.10+
+- **System Card**: See "Sonar Core 1 - System Card.md" for detailed documentation
+## Citation
+If you use this model, please cite:
+```bibtex
+@techreport{underthesea2025sonarcore1,
+  title = {Sonar Core 1: A Vietnamese Text Classification Model using Machine Learning},
+  author = {Vu Anh},
+  year = {2025},
+  month = {September},
+  institution = {Underthesea},
+  version = {1.0},
+  url = {https://github.com/undertheseanlp/underthesea/},
+  keywords = {text classification, vietnamese nlp, machine learning, tf-idf, logistic regression}
+}
+```

sklearn_model.joblib DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:b25b914bfacc590165e0ce35e944815cf1fda52d9d2fadf79334c5bc2754b360
-size 2393144

sklearn_model_uts2017_bank.joblib DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:729e6e6d7b34dc1057275d15ce0d8475ffc1b614a13fbc0f174155e9dec4795d
-size 3029656