File size: 13,717 Bytes
8be6e55
 
 
 
 
 
 
 
 
31e3523
8be6e55
 
 
 
f2230ce
 
 
8be6e55
 
25aa0d2
8be6e55
 
 
 
 
 
31e3523
8be6e55
25aa0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8be6e55
 
f2230ce
8be6e55
 
 
 
 
0b1c1cf
8be6e55
f2230ce
 
 
8be6e55
 
 
 
 
 
 
 
 
f2230ce
 
 
8be6e55
 
 
 
 
25aa0d2
8be6e55
25aa0d2
8be6e55
b0fc171
8be6e55
 
 
25aa0d2
8be6e55
 
 
 
 
 
 
 
 
 
 
25aa0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
31e3523
b0fc171
31e3523
 
8be6e55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31e3523
 
 
 
 
 
 
 
 
 
8be6e55
 
 
 
 
 
 
 
 
 
25aa0d2
 
 
8be6e55
25aa0d2
 
8be6e55
31e3523
25aa0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8be6e55
31e3523
25aa0d2
8be6e55
31e3523
25aa0d2
8be6e55
25aa0d2
 
8be6e55
 
 
 
 
 
 
25aa0d2
 
 
 
 
 
 
 
 
 
31e3523
 
25aa0d2
8be6e55
 
 
31e3523
 
8be6e55
 
25aa0d2
31e3523
25aa0d2
31e3523
8be6e55
 
 
 
 
25aa0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
31e3523
25aa0d2
 
8be6e55
 
31e3523
25aa0d2
31e3523
0b1c1cf
31e3523
25aa0d2
0b1c1cf
31e3523
25aa0d2
 
 
 
 
 
8be6e55
 
 
31e3523
8be6e55
 
 
 
25aa0d2
 
 
 
 
8be6e55
31e3523
 
8be6e55
25aa0d2
 
 
 
 
 
 
 
 
8be6e55
 
 
 
 
31e3523
25aa0d2
 
 
31e3523
25aa0d2
 
 
8be6e55
 
31e3523
8be6e55
31e3523
 
 
8be6e55
31e3523
 
8be6e55
31e3523
 
 
 
 
8be6e55
 
31e3523
8be6e55
 
25aa0d2
31e3523
8be6e55
25aa0d2
 
8be6e55
25aa0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8be6e55
 
 
 
25aa0d2
8be6e55
25aa0d2
8be6e55
25aa0d2
 
 
 
8be6e55
 
 
25aa0d2
 
 
 
 
8be6e55
 
 
 
 
 
 
25aa0d2
 
 
8be6e55
31e3523
91cc050
8be6e55
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
license: apache-2.0
library_name: scikit-learn
tags:
  - scikit-learn
  - sklearn
  - text-classification
  - vietnamese
  - nlp
  - pulse
  - tf-idf
  - logistic-regression
  - svc
  - support-vector-classification
  - aspect-sentiment-analysis
  - banking
  - financial-nlp
datasets:
  - undertheseanlp/UTS2017_Bank
  - ura-hcmut/vlsp2016
metrics:
  - accuracy
  - precision
  - recall
  - f1-score
model-index:
  - name: pulse-core-1
    results:
      - task:
          type: text-classification
          name: Vietnamese General Sentiment Analysis
        dataset:
          name: VLSP2016
          type: ura-hcmut/vlsp2016
        metrics:
          - type: accuracy
            value: 0.7114
            name: Test Accuracy (SVC Linear)
          - type: accuracy
            value: 0.7019
            name: Test Accuracy (Logistic Regression)
          - type: f1-score
            value: 0.713
            name: Weighted F1-Score (SVC)
          - type: f1-score
            value: 0.703
            name: Weighted F1-Score (Logistic Regression)
      - task:
          type: text-classification
          name: Vietnamese Banking Aspect Sentiment Analysis
        dataset:
          name: UTS2017_Bank
          type: undertheseanlp/UTS2017_Bank
        metrics:
          - type: accuracy
            value: 0.7172
            name: Test Accuracy (SVC)
          - type: accuracy
            value: 0.6818
            name: Test Accuracy (Logistic Regression)
          - type: precision
            value: 0.65
            name: Weighted Precision (SVC)
          - type: recall
            value: 0.72
            name: Weighted Recall (SVC)
          - type: f1-score
            value: 0.66
            name: Weighted F1-Score (SVC)
          - type: f1-score
            value: 0.66
            name: Weighted F1-Score (Logistic Regression)
language:
  - vi
pipeline_tag: text-classification
---

# Pulse Core 1 - Vietnamese Sentiment Analysis System

A comprehensive machine learning-based sentiment analysis system for Vietnamese text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **71.14% accuracy** on VLSP2016 general sentiment dataset and **71.72% accuracy** on UTS2017_Bank banking aspect sentiment dataset with Support Vector Classification (SVC).

📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex)** for comprehensive model documentation, performance analysis, and limitations.

## Model Description

**Pulse Core 1** is a versatile Vietnamese sentiment analysis system that supports both general sentiment classification and specialized banking aspect sentiment analysis. The system can analyze general Vietnamese text sentiment (positive/negative/neutral) and banking-specific aspect sentiment (combining banking aspects with sentiment polarities). It's designed for Vietnamese text analysis across multiple domains, with specialized capabilities for banking customer feedback analysis and financial service categorization.

### Model Architecture

- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline
- **Feature Extraction**: CountVectorizer with 20,000 max features
- **N-gram Support**: Unigram and bigram (1-2)
- **TF-IDF**: Transformation with IDF weighting
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
- **Framework**: scikit-learn ≥1.6
- **Caching System**: Hash-based caching for efficient processing

## Supported Datasets & Categories

### VLSP2016 Dataset - General Sentiment Analysis (3 classes)

**Sentiment Categories:**
- **positive** - Positive sentiment towards products/services
- **negative** - Negative sentiment towards products/services
- **neutral** - Neutral or mixed sentiment

**Dataset Statistics:**
- Training samples: 5,100 (1,700 per class)
- Test samples: 1,050 (350 per class)
- Balanced distribution across all sentiment classes
- Domain: General product and service reviews

### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)

**Banking Aspects:**
1. **ACCOUNT** - Account services
2. **CARD** - Card services
3. **CUSTOMER_SUPPORT** - Customer support
4. **DISCOUNT** - Discount offers
5. **INTEREST_RATE** - Interest rate information
6. **INTERNET_BANKING** - Internet banking services
7. **LOAN** - Loan services
8. **MONEY_TRANSFER** - Money transfer services
9. **OTHER** - Other services
10. **PAYMENT** - Payment services
11. **PROMOTION** - Promotional offers
12. **SAVING** - Savings accounts
13. **SECURITY** - Security features
14. **TRADEMARK** - Trademark/branding

**Sentiments:**
- **positive** - Positive sentiment
- **negative** - Negative sentiment
- **neutral** - Neutral sentiment

**Combined Labels:** The model predicts combined aspect-sentiment labels in the format `<aspect>#<sentiment>`, such as:
- `CUSTOMER_SUPPORT#negative` - Negative feedback about customer support
- `LOAN#positive` - Positive opinion about loan services
- `TRADEMARK#positive` - Positive brand perception

## Installation

```bash
pip install scikit-learn>=1.6 joblib
```

## Usage

### Training the Model

#### Dataset Selection and Training

**VLSP2016 Dataset (General Sentiment Analysis):**
```bash
# Train on VLSP2016 with Logistic Regression
python train.py --dataset vlsp2016 --model logistic

# Train with SVC for better performance
python train.py --dataset vlsp2016 --model svc_linear

# Compare n-gram ranges
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 2
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 3

# Export model for deployment
python train.py --dataset vlsp2016 --model svc_linear --export-model
```

**UTS2017_Bank Dataset (Banking Aspect Sentiment Analysis):**
```bash
# Train on UTS2017_Bank (default dataset)
python train.py --dataset uts2017 --model logistic

# Train with SVC for better performance
python train.py --dataset uts2017 --model svc_linear

# With specific parameters
python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2

# Export model for deployment
python train.py --dataset uts2017 --model logistic --export-model

# Compare multiple models on specific dataset
python train.py --dataset vlsp2016 --compare-models logistic svc_linear
```

### Training from Scratch

```python
from train import train_notebook

# Train VLSP2016 general sentiment model
results = train_notebook(
    dataset="vlsp2016",
    model_name="svc_linear",
    max_features=20000,
    ngram_min=1,
    ngram_max=2,
    export_model=True
)

# Train UTS2017_Bank aspect sentiment model
results = train_notebook(
    dataset="uts2017",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2,
    export_model=True
)

# Compare multiple models on VLSP2016
comparison_results = train_notebook(
    dataset="vlsp2016",
    compare=True
)
```

## Performance Metrics

### VLSP2016 General Sentiment Analysis Performance
- **Training Accuracy**: 94.57% (SVC Linear)
- **Test Accuracy**: 71.14% (SVC Linear, 1-2 ngram) / 70.67% (SVC Linear, 1-3 ngram) / 70.19% (Logistic Regression)
- **Training Samples**: 5,100 (balanced: 1,700 per class)
- **Test Samples**: 1,050 (balanced: 350 per class)
- **Number of Classes**: 3 sentiment polarities
- **Training Time**: ~24.95 seconds (SVC) / 0.75 seconds (LR)
- **Per-Class Performance (SVC Linear)**:
  - **Positive**: 80% precision, 72% recall, 76% F1-score
  - **Negative**: 70% precision, 72% recall, 71% F1-score
  - **Neutral**: 65% precision, 69% recall, 67% F1-score
- **Key Insights**: Consistent performance across all sentiment classes due to balanced dataset
- **Optimal N-gram**: Bigrams (1-2) outperform trigrams (1-3) by 0.47 percentage points

### UTS2017_Bank Aspect Sentiment Analysis Performance
- **Training Accuracy**: 94.57% (SVC)
- **Test Accuracy**: 71.72% (SVC) / 68.18% (Logistic Regression)
- **Training Samples**: 1,581
- **Test Samples**: 396
- **Number of Classes**: 35 aspect-sentiment combinations
- **Training Time**: ~5.3 seconds (SVC) / 2.13 seconds (LR)
- **Best Performing Classes**:
  - `TRADEMARK#positive`: 90% F1-score
  - `CUSTOMER_SUPPORT#positive`: 88% F1-score
  - `LOAN#negative`: 67% F1-score (SVC improvement over LR)
  - `CUSTOMER_SUPPORT#negative`: 65% F1-score
- **Challenges**: Class imbalance affects minority aspect-sentiment combinations
- **Key Finding**: SVC shows superior category diversity compared to Logistic Regression

### Cross-Dataset Performance Analysis
- **Consistent SVC Performance**: ~71% accuracy on both 3-class (VLSP2016) and 35-class (UTS2017_Bank) tasks
- **Balance Impact**: Balanced datasets (VLSP2016) yield consistent per-class results while imbalanced datasets create performance variations
- **Training Efficiency**: Larger balanced datasets require more training time but provide stable results

## Using the Pre-trained Models

### Local Model (Vietnamese Banking Aspect Sentiment Analysis)

```python
import joblib

# Load VLSP2016 general sentiment model
general_model = joblib.load("vlsp2016_sentiment_20250929_075529.joblib")

# Load UTS2017_Bank aspect sentiment model
banking_model = joblib.load("uts2017_sentiment_20250928_131716.joblib")

# Or use inference script directly
from inference import predict_text

# General sentiment analysis
general_text = "Sản phẩm này rất tốt, tôi rất hài lòng"
prediction, confidence, top_predictions = predict_text(general_model, general_text)
print(f"General Sentiment: {prediction}")  # Expected: positive

# Banking aspect sentiment analysis
bank_text = "Lãi suất vay mua nhà hiện tại quá cao"
prediction, confidence, top_predictions = predict_text(banking_model, bank_text)
print(f"Banking Aspect-Sentiment: {prediction}")  # Expected: INTEREST_RATE#negative

print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")

# Example output for banking text:
# Banking Aspect-Sentiment: INTEREST_RATE#negative
# Confidence: 0.509
# Top 3 predictions:
#   1. INTEREST_RATE#negative: 0.509
#   2. LOAN#negative: 0.218
#   3. CUSTOMER_SUPPORT#negative: 0.095
```

### Using the Inference Script

```bash
# Interactive mode
python inference.py

# Single prediction
python inference.py --text "Lãi suất vay mua nhà hiện tại quá cao"

# Test with examples
python inference.py --test-examples

# List available models
python inference.py --list-models
```


## Model Parameters

- `dataset`: Dataset selection ("vlsp2016" for general sentiment, "uts2017" for banking aspect sentiment)
- `model`: Model type ("logistic", "svc_linear", "svc_rbf", "naive_bayes", "decision_tree", "random_forest", etc.)
- `max_features`: Maximum number of TF-IDF features (default: 20000)
- `ngram_min/max`: N-gram range (default: 1-2, optimal for Vietnamese)
- `split_ratio`: Train/test split ratio (default: 0.2, only used for uts2017)
- `n_samples`: Optional sample limit for quick testing
- `export_model`: Export model for deployment (creates `<dataset>_sentiment_<timestamp>.joblib`)
- `compare`: Compare multiple model configurations
- `compare_models`: Specify models to compare

## Project Management

### Cleanup Utility

The project includes a cleanup script to manage training runs:

```bash
# Preview runs that will be deleted (without exported models)
uv run python clean.py --dry-run --verbose

# Clean up runs without exported models
uv run python clean.py --yes

# Interactive cleanup with confirmation
uv run python clean.py
```

**Features:**
- Automatically identifies runs without exported model files
- Shows space that will be freed
- Dry-run mode for safe previewing
- Detailed information about each run
- Preserves runs with exported models

## Limitations

1. **Language Specificity**: Only works with Vietnamese text
2. **Domain Coverage**: Two specialized domains (general sentiment + banking aspect sentiment)
3. **Feature Limitations**: Limited to 20,000 most frequent features
4. **Class Imbalance Sensitivity**: Performance degrades significantly with imbalanced datasets (evident in UTS2017_Bank)
5. **Specific Weaknesses**:
   - **VLSP2016**: Minor performance variation between sentiment classes
   - **UTS2017_Bank**: Poor performance on minority aspect-sentiment classes due to insufficient training data
   - **N-gram Limitation**: Trigrams provide minimal improvement over bigrams while increasing computational cost
   - Banking domain aspects limited to predefined categories (account, loan, card, etc.)

## Ethical Considerations

- **Dataset Bias**: Models reflect biases present in training datasets (VLSP2016 general reviews, UTS2017_Bank banking feedback)
- **Performance Variation**: Significant performance differences between balanced (VLSP2016) and imbalanced (UTS2017_Bank) datasets
- **Domain Validation**: Should be validated on target domain before deployment
- **Class Imbalance**: Consider dataset balance when interpreting results, especially for banking aspect sentiment
- **Representation**: VLSP2016 provides more equitable performance across sentiment classes due to balanced training data

## Citation

If you use this model, please cite:

```bibtex
@misc{undertheseanlp_2025,
    author       = { Vu Anh },
    organization = { UnderTheSea NLP },
    title        = { Pulse Core 1 - Vietnamese Sentiment Analysis System },
    year         = 2025,
    url          = { https://huggingface.co/undertheseanlp/pulse_core_1 },
    doi          = { 10.57967/hf/6605 },
    publisher    = { Hugging Face }
}
```