Vu Anh Claude commited on
Commit
b0fc171
·
1 Parent(s): f5d9660

Update class count to exact number in README.md and technical report

Browse files

- Change all instances of "35+" to "35" for precise class information
- Updated README.md: UTS2017_Bank Dataset section header
- Updated technical report: Abstract, introduction, datasets section, tables, and conclusion
- Provides accurate information about model capabilities (exactly 35 aspect-sentiment combinations)
- Verified from labels.txt file showing 35 unique classes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

README.md CHANGED
@@ -59,7 +59,7 @@ pipeline_tag: text-classification
59
 
60
  A machine learning-based aspect sentiment analysis model designed for Vietnamese banking text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **68.18% accuracy** on UTS2017_Bank aspect sentiment dataset with Logistic Regression.
61
 
62
- 📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/Pulse%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations.
63
 
64
  ## Model Description
65
 
@@ -77,7 +77,7 @@ A machine learning-based aspect sentiment analysis model designed for Vietnamese
77
 
78
  ## Supported Dataset & Categories
79
 
80
- ### UTS2017_Bank Dataset - Banking Aspect Sentiment (35+ combined classes)
81
 
82
  **Banking Aspects:**
83
  1. **ACCOUNT** - Account services
 
59
 
60
  A machine learning-based aspect sentiment analysis model designed for Vietnamese banking text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **68.18% accuracy** on UTS2017_Bank aspect sentiment dataset with Logistic Regression.
61
 
62
+ 📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex)** for comprehensive model documentation, performance analysis, and limitations.
63
 
64
  ## Model Description
65
 
 
77
 
78
  ## Supported Dataset & Categories
79
 
80
+ ### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)
81
 
82
  **Banking Aspects:**
83
  1. **ACCOUNT** - Account services
paper/pulse_core_1_technical_report.tex CHANGED
@@ -23,7 +23,7 @@
23
  \maketitle
24
 
25
  \begin{abstract}
26
- This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system employing Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction combined with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 35+ combined aspect-sentiment categories, achieving 68.18\% accuracy with Logistic Regression and 72.47\% accuracy with Support Vector Classification (SVC). The implementation utilizes a 20,000-dimensional TF-IDF feature space with n-gram analysis and incorporates hash-based caching for computational optimization. The model predicts combined aspect-sentiment labels in the format \texttt{<aspect>\#<sentiment>}, enabling fine-grained analysis of Vietnamese banking customer feedback across 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, etc.) and 3 sentiment polarities (positive, negative, neutral). These results establish baseline performance metrics for Vietnamese banking aspect sentiment analysis and demonstrate the efficacy of traditional machine learning approaches for Vietnamese financial domain natural language processing tasks.
27
  \end{abstract}
28
 
29
  \section{Introduction}
@@ -34,7 +34,7 @@ Vietnamese, spoken by approximately 95 million speakers globally, exhibits disti
34
 
35
  Traditional machine learning approaches utilizing Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with logistic regression maintain practical relevance for text classification tasks, particularly in resource-constrained computational environments \citep{pedregosa2011scikit}. These methodologies provide advantages in training efficiency, memory utilization, and model interpretability.
36
 
37
- This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system implementing TF-IDF feature extraction with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 1,977 Vietnamese banking documents across 35+ combined aspect-sentiment categories, achieving competitive performance with traditional machine learning approaches. The system addresses the challenge of simultaneous aspect detection and sentiment classification for Vietnamese banking customer feedback, providing a computationally efficient solution for production deployment scenarios.
38
 
39
  \section{Related Work}
40
 
@@ -48,7 +48,7 @@ Initial research in Vietnamese text classification employed rule-based methodolo
48
 
49
  Contemporary Vietnamese aspect sentiment analysis research employs domain-specific datasets for banking applications:
50
 
51
- \textbf{UTS2017\_Bank Dataset}: A specialized corpus developed by the Underthesea NLP Team for Vietnamese banking aspect sentiment analysis. This dataset encompasses 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, DISCOUNT, INTEREST\_RATE, INTERNET\_BANKING, LOAN, MONEY\_TRANSFER, OTHER, PAYMENT, PROMOTION, SAVING, SECURITY, TRADEMARK) combined with 3 sentiment polarities (positive, negative, neutral), creating 35+ unique aspect-sentiment combinations. The dataset represents specialized Vietnamese text classification challenges in the financial domain, focusing on customer feedback analysis and banking service categorization.
52
 
53
  \subsection{Text Preprocessing Methodologies}
54
 
@@ -115,7 +115,7 @@ The system incorporates several optimization mechanisms to enhance computational
115
 
116
  This study evaluates performance on the UTS2017\_Bank dataset, a specialized Vietnamese banking aspect sentiment analysis corpus representing the financial domain.
117
 
118
- The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 14 banking aspects combined with sentiment analysis, creating 35+ unique aspect-sentiment categories. The dataset includes: account services, card services, customer support, discount offers, interest rates, internet banking, loans, money transfers, payments, promotions, savings, security features, trademark information, and miscellaneous services, each labeled with positive, negative, or neutral sentiment. The dataset exhibits significant class imbalance, with CUSTOMER\_SUPPORT\#negative (39\%) and TRADEMARK\#positive (35\%) categories dominating the distribution, while minority aspect-sentiment combinations have limited training examples.
119
 
120
  \begin{table}[h]
121
  \centering
@@ -123,7 +123,7 @@ The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 1
123
  \toprule
124
  \textbf{Dataset} & \textbf{Classes} & \textbf{Training} & \textbf{Test} & \textbf{Domain} \\
125
  \midrule
126
- UTS2017\_Bank & 35+ & 1,581 & 396 & Banking Aspect Sentiment \\
127
  \bottomrule
128
  \end{tabular}
129
  \caption{Dataset characteristics for Vietnamese banking aspect sentiment analysis evaluation.}
@@ -168,8 +168,8 @@ This work establishes the first comprehensive baseline for Vietnamese banking as
168
  \hline
169
  \textbf{Dataset} & \textbf{Method} & \textbf{Accuracy} \\
170
  \hline
171
- UTS2017\_Bank (35+ aspect-sentiment) & \textbf{Pulse Core 1 - SVC with TF-IDF} & \textbf{72.47\%} \\
172
- UTS2017\_Bank (35+ aspect-sentiment) & \textbf{Pulse Core 1 - Logistic Regression with TF-IDF} & \textbf{68.18\%} \\
173
  \hline
174
  \end{tabular}
175
  \caption{Performance results for Vietnamese banking aspect sentiment analysis using TF-IDF-based traditional machine learning approaches.}
@@ -194,7 +194,7 @@ The system exhibits competitive performance on the banking aspect sentiment anal
194
  \item \textbf{Inference Latency}: 0.1 seconds for 396 test samples (0.025 ms per sample)
195
  \item \textbf{Training Samples}: 1,581 aspect-sentiment combinations
196
  \item \textbf{Test Samples}: 396 aspect-sentiment combinations
197
- \item \textbf{Number of Classes}: 35+ combined aspect-sentiment categories
198
  \item \textbf{Weighted Average F1-Score (Logistic Regression)}: 0.66
199
  \end{itemize}
200
 
@@ -417,7 +417,7 @@ This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis
417
 
418
  \begin{enumerate}
419
  \item Traditional machine learning approaches achieve competitive performance on Vietnamese banking aspect sentiment analysis tasks (72.47\% accuracy with SVC) while maintaining substantial computational efficiency advantages (5.3s training time).
420
- \item Feature engineering methodologies retain critical importance for Vietnamese banking applications, with the implemented 20,000-dimensional TF-IDF representation effectively capturing aspect-sentiment relationships across 35+ combined categories.
421
  \item Class distribution imbalance constitutes the primary performance limitation for aspect sentiment analysis, with minority aspect-sentiment combinations achieving zero performance due to insufficient training data.
422
  \item The fundamental trade-off between algorithmic complexity and model interpretability substantially favors TF-IDF approaches for banking applications requiring transparency and regulatory compliance.
423
  \end{enumerate}
 
23
  \maketitle
24
 
25
  \begin{abstract}
26
+ This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system employing Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction combined with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 35 combined aspect-sentiment categories, achieving 68.18\% accuracy with Logistic Regression and 72.47\% accuracy with Support Vector Classification (SVC). The implementation utilizes a 20,000-dimensional TF-IDF feature space with n-gram analysis and incorporates hash-based caching for computational optimization. The model predicts combined aspect-sentiment labels in the format \texttt{<aspect>\#<sentiment>}, enabling fine-grained analysis of Vietnamese banking customer feedback across 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, etc.) and 3 sentiment polarities (positive, negative, neutral). These results establish baseline performance metrics for Vietnamese banking aspect sentiment analysis and demonstrate the efficacy of traditional machine learning approaches for Vietnamese financial domain natural language processing tasks.
27
  \end{abstract}
28
 
29
  \section{Introduction}
 
34
 
35
  Traditional machine learning approaches utilizing Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with logistic regression maintain practical relevance for text classification tasks, particularly in resource-constrained computational environments \citep{pedregosa2011scikit}. These methodologies provide advantages in training efficiency, memory utilization, and model interpretability.
36
 
37
+ This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system implementing TF-IDF feature extraction with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 1,977 Vietnamese banking documents across 35 combined aspect-sentiment categories, achieving competitive performance with traditional machine learning approaches. The system addresses the challenge of simultaneous aspect detection and sentiment classification for Vietnamese banking customer feedback, providing a computationally efficient solution for production deployment scenarios.
38
 
39
  \section{Related Work}
40
 
 
48
 
49
  Contemporary Vietnamese aspect sentiment analysis research employs domain-specific datasets for banking applications:
50
 
51
+ \textbf{UTS2017\_Bank Dataset}: A specialized corpus developed by the Underthesea NLP Team for Vietnamese banking aspect sentiment analysis. This dataset encompasses 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, DISCOUNT, INTEREST\_RATE, INTERNET\_BANKING, LOAN, MONEY\_TRANSFER, OTHER, PAYMENT, PROMOTION, SAVING, SECURITY, TRADEMARK) combined with 3 sentiment polarities (positive, negative, neutral), creating 35 unique aspect-sentiment combinations. The dataset represents specialized Vietnamese text classification challenges in the financial domain, focusing on customer feedback analysis and banking service categorization.
52
 
53
  \subsection{Text Preprocessing Methodologies}
54
 
 
115
 
116
  This study evaluates performance on the UTS2017\_Bank dataset, a specialized Vietnamese banking aspect sentiment analysis corpus representing the financial domain.
117
 
118
+ The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 14 banking aspects combined with sentiment analysis, creating 35 unique aspect-sentiment categories. The dataset includes: account services, card services, customer support, discount offers, interest rates, internet banking, loans, money transfers, payments, promotions, savings, security features, trademark information, and miscellaneous services, each labeled with positive, negative, or neutral sentiment. The dataset exhibits significant class imbalance, with CUSTOMER\_SUPPORT\#negative (39\%) and TRADEMARK\#positive (35\%) categories dominating the distribution, while minority aspect-sentiment combinations have limited training examples.
119
 
120
  \begin{table}[h]
121
  \centering
 
123
  \toprule
124
  \textbf{Dataset} & \textbf{Classes} & \textbf{Training} & \textbf{Test} & \textbf{Domain} \\
125
  \midrule
126
+ UTS2017\_Bank & 35 & 1,581 & 396 & Banking Aspect Sentiment \\
127
  \bottomrule
128
  \end{tabular}
129
  \caption{Dataset characteristics for Vietnamese banking aspect sentiment analysis evaluation.}
 
168
  \hline
169
  \textbf{Dataset} & \textbf{Method} & \textbf{Accuracy} \\
170
  \hline
171
+ UTS2017\_Bank (35 aspect-sentiment) & \textbf{Pulse Core 1 - SVC with TF-IDF} & \textbf{72.47\%} \\
172
+ UTS2017\_Bank (35 aspect-sentiment) & \textbf{Pulse Core 1 - Logistic Regression with TF-IDF} & \textbf{68.18\%} \\
173
  \hline
174
  \end{tabular}
175
  \caption{Performance results for Vietnamese banking aspect sentiment analysis using TF-IDF-based traditional machine learning approaches.}
 
194
  \item \textbf{Inference Latency}: 0.1 seconds for 396 test samples (0.025 ms per sample)
195
  \item \textbf{Training Samples}: 1,581 aspect-sentiment combinations
196
  \item \textbf{Test Samples}: 396 aspect-sentiment combinations
197
+ \item \textbf{Number of Classes}: 35 combined aspect-sentiment categories
198
  \item \textbf{Weighted Average F1-Score (Logistic Regression)}: 0.66
199
  \end{itemize}
200
 
 
417
 
418
  \begin{enumerate}
419
  \item Traditional machine learning approaches achieve competitive performance on Vietnamese banking aspect sentiment analysis tasks (72.47\% accuracy with SVC) while maintaining substantial computational efficiency advantages (5.3s training time).
420
+ \item Feature engineering methodologies retain critical importance for Vietnamese banking applications, with the implemented 20,000-dimensional TF-IDF representation effectively capturing aspect-sentiment relationships across 35 combined categories.
421
  \item Class distribution imbalance constitutes the primary performance limitation for aspect sentiment analysis, with minority aspect-sentiment combinations achieving zero performance due to insufficient training data.
422
  \item The fundamental trade-off between algorithmic complexity and model interpretability substantially favors TF-IDF approaches for banking applications requiring transparency and regulatory compliance.
423
  \end{enumerate}