Vu Anh
Claude
commited on
Commit
·
b0fc171
1
Parent(s):
f5d9660
Update class count to exact number in README.md and technical report
Browse files- Change all instances of "35+" to "35" for precise class information
- Updated README.md: UTS2017_Bank Dataset section header
- Updated technical report: Abstract, introduction, datasets section, tables, and conclusion
- Provides accurate information about model capabilities (exactly 35 aspect-sentiment combinations)
- Verified from labels.txt file showing 35 unique classes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- README.md +2 -2
- paper/pulse_core_1_technical_report.tex +9 -9
README.md
CHANGED
|
@@ -59,7 +59,7 @@ pipeline_tag: text-classification
|
|
| 59 |
|
| 60 |
A machine learning-based aspect sentiment analysis model designed for Vietnamese banking text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **68.18% accuracy** on UTS2017_Bank aspect sentiment dataset with Logistic Regression.
|
| 61 |
|
| 62 |
-
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/
|
| 63 |
|
| 64 |
## Model Description
|
| 65 |
|
|
@@ -77,7 +77,7 @@ A machine learning-based aspect sentiment analysis model designed for Vietnamese
|
|
| 77 |
|
| 78 |
## Supported Dataset & Categories
|
| 79 |
|
| 80 |
-
### UTS2017_Bank Dataset - Banking Aspect Sentiment (35
|
| 81 |
|
| 82 |
**Banking Aspects:**
|
| 83 |
1. **ACCOUNT** - Account services
|
|
|
|
| 59 |
|
| 60 |
A machine learning-based aspect sentiment analysis model designed for Vietnamese banking text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving **68.18% accuracy** on UTS2017_Bank aspect sentiment dataset with Logistic Regression.
|
| 61 |
|
| 62 |
+
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/pulse_core_1/blob/main/paper/pulse_core_1_technical_report.tex)** for comprehensive model documentation, performance analysis, and limitations.
|
| 63 |
|
| 64 |
## Model Description
|
| 65 |
|
|
|
|
| 77 |
|
| 78 |
## Supported Dataset & Categories
|
| 79 |
|
| 80 |
+
### UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)
|
| 81 |
|
| 82 |
**Banking Aspects:**
|
| 83 |
1. **ACCOUNT** - Account services
|
paper/pulse_core_1_technical_report.tex
CHANGED
|
@@ -23,7 +23,7 @@
|
|
| 23 |
\maketitle
|
| 24 |
|
| 25 |
\begin{abstract}
|
| 26 |
-
This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system employing Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction combined with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 35
|
| 27 |
\end{abstract}
|
| 28 |
|
| 29 |
\section{Introduction}
|
|
@@ -34,7 +34,7 @@ Vietnamese, spoken by approximately 95 million speakers globally, exhibits disti
|
|
| 34 |
|
| 35 |
Traditional machine learning approaches utilizing Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with logistic regression maintain practical relevance for text classification tasks, particularly in resource-constrained computational environments \citep{pedregosa2011scikit}. These methodologies provide advantages in training efficiency, memory utilization, and model interpretability.
|
| 36 |
|
| 37 |
-
This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system implementing TF-IDF feature extraction with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 1,977 Vietnamese banking documents across 35
|
| 38 |
|
| 39 |
\section{Related Work}
|
| 40 |
|
|
@@ -48,7 +48,7 @@ Initial research in Vietnamese text classification employed rule-based methodolo
|
|
| 48 |
|
| 49 |
Contemporary Vietnamese aspect sentiment analysis research employs domain-specific datasets for banking applications:
|
| 50 |
|
| 51 |
-
\textbf{UTS2017\_Bank Dataset}: A specialized corpus developed by the Underthesea NLP Team for Vietnamese banking aspect sentiment analysis. This dataset encompasses 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, DISCOUNT, INTEREST\_RATE, INTERNET\_BANKING, LOAN, MONEY\_TRANSFER, OTHER, PAYMENT, PROMOTION, SAVING, SECURITY, TRADEMARK) combined with 3 sentiment polarities (positive, negative, neutral), creating 35
|
| 52 |
|
| 53 |
\subsection{Text Preprocessing Methodologies}
|
| 54 |
|
|
@@ -115,7 +115,7 @@ The system incorporates several optimization mechanisms to enhance computational
|
|
| 115 |
|
| 116 |
This study evaluates performance on the UTS2017\_Bank dataset, a specialized Vietnamese banking aspect sentiment analysis corpus representing the financial domain.
|
| 117 |
|
| 118 |
-
The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 14 banking aspects combined with sentiment analysis, creating 35
|
| 119 |
|
| 120 |
\begin{table}[h]
|
| 121 |
\centering
|
|
@@ -123,7 +123,7 @@ The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 1
|
|
| 123 |
\toprule
|
| 124 |
\textbf{Dataset} & \textbf{Classes} & \textbf{Training} & \textbf{Test} & \textbf{Domain} \\
|
| 125 |
\midrule
|
| 126 |
-
UTS2017\_Bank & 35
|
| 127 |
\bottomrule
|
| 128 |
\end{tabular}
|
| 129 |
\caption{Dataset characteristics for Vietnamese banking aspect sentiment analysis evaluation.}
|
|
@@ -168,8 +168,8 @@ This work establishes the first comprehensive baseline for Vietnamese banking as
|
|
| 168 |
\hline
|
| 169 |
\textbf{Dataset} & \textbf{Method} & \textbf{Accuracy} \\
|
| 170 |
\hline
|
| 171 |
-
UTS2017\_Bank (35
|
| 172 |
-
UTS2017\_Bank (35
|
| 173 |
\hline
|
| 174 |
\end{tabular}
|
| 175 |
\caption{Performance results for Vietnamese banking aspect sentiment analysis using TF-IDF-based traditional machine learning approaches.}
|
|
@@ -194,7 +194,7 @@ The system exhibits competitive performance on the banking aspect sentiment anal
|
|
| 194 |
\item \textbf{Inference Latency}: 0.1 seconds for 396 test samples (0.025 ms per sample)
|
| 195 |
\item \textbf{Training Samples}: 1,581 aspect-sentiment combinations
|
| 196 |
\item \textbf{Test Samples}: 396 aspect-sentiment combinations
|
| 197 |
-
\item \textbf{Number of Classes}: 35
|
| 198 |
\item \textbf{Weighted Average F1-Score (Logistic Regression)}: 0.66
|
| 199 |
\end{itemize}
|
| 200 |
|
|
@@ -417,7 +417,7 @@ This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis
|
|
| 417 |
|
| 418 |
\begin{enumerate}
|
| 419 |
\item Traditional machine learning approaches achieve competitive performance on Vietnamese banking aspect sentiment analysis tasks (72.47\% accuracy with SVC) while maintaining substantial computational efficiency advantages (5.3s training time).
|
| 420 |
-
\item Feature engineering methodologies retain critical importance for Vietnamese banking applications, with the implemented 20,000-dimensional TF-IDF representation effectively capturing aspect-sentiment relationships across 35
|
| 421 |
\item Class distribution imbalance constitutes the primary performance limitation for aspect sentiment analysis, with minority aspect-sentiment combinations achieving zero performance due to insufficient training data.
|
| 422 |
\item The fundamental trade-off between algorithmic complexity and model interpretability substantially favors TF-IDF approaches for banking applications requiring transparency and regulatory compliance.
|
| 423 |
\end{enumerate}
|
|
|
|
| 23 |
\maketitle
|
| 24 |
|
| 25 |
\begin{abstract}
|
| 26 |
+
This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system employing Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction combined with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 35 combined aspect-sentiment categories, achieving 68.18\% accuracy with Logistic Regression and 72.47\% accuracy with Support Vector Classification (SVC). The implementation utilizes a 20,000-dimensional TF-IDF feature space with n-gram analysis and incorporates hash-based caching for computational optimization. The model predicts combined aspect-sentiment labels in the format \texttt{<aspect>\#<sentiment>}, enabling fine-grained analysis of Vietnamese banking customer feedback across 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, etc.) and 3 sentiment polarities (positive, negative, neutral). These results establish baseline performance metrics for Vietnamese banking aspect sentiment analysis and demonstrate the efficacy of traditional machine learning approaches for Vietnamese financial domain natural language processing tasks.
|
| 27 |
\end{abstract}
|
| 28 |
|
| 29 |
\section{Introduction}
|
|
|
|
| 34 |
|
| 35 |
Traditional machine learning approaches utilizing Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with logistic regression maintain practical relevance for text classification tasks, particularly in resource-constrained computational environments \citep{pedregosa2011scikit}. These methodologies provide advantages in training efficiency, memory utilization, and model interpretability.
|
| 36 |
|
| 37 |
+
This paper presents Pulse Core 1, a Vietnamese banking aspect sentiment analysis system implementing TF-IDF feature extraction with machine learning classification algorithms. The system is evaluated on the UTS2017\_Bank aspect sentiment dataset containing 1,977 Vietnamese banking documents across 35 combined aspect-sentiment categories, achieving competitive performance with traditional machine learning approaches. The system addresses the challenge of simultaneous aspect detection and sentiment classification for Vietnamese banking customer feedback, providing a computationally efficient solution for production deployment scenarios.
|
| 38 |
|
| 39 |
\section{Related Work}
|
| 40 |
|
|
|
|
| 48 |
|
| 49 |
Contemporary Vietnamese aspect sentiment analysis research employs domain-specific datasets for banking applications:
|
| 50 |
|
| 51 |
+
\textbf{UTS2017\_Bank Dataset}: A specialized corpus developed by the Underthesea NLP Team for Vietnamese banking aspect sentiment analysis. This dataset encompasses 14 banking aspects (ACCOUNT, CARD, CUSTOMER\_SUPPORT, DISCOUNT, INTEREST\_RATE, INTERNET\_BANKING, LOAN, MONEY\_TRANSFER, OTHER, PAYMENT, PROMOTION, SAVING, SECURITY, TRADEMARK) combined with 3 sentiment polarities (positive, negative, neutral), creating 35 unique aspect-sentiment combinations. The dataset represents specialized Vietnamese text classification challenges in the financial domain, focusing on customer feedback analysis and banking service categorization.
|
| 52 |
|
| 53 |
\subsection{Text Preprocessing Methodologies}
|
| 54 |
|
|
|
|
| 115 |
|
| 116 |
This study evaluates performance on the UTS2017\_Bank dataset, a specialized Vietnamese banking aspect sentiment analysis corpus representing the financial domain.
|
| 117 |
|
| 118 |
+
The UTS2017\_Bank dataset contains 1,977 Vietnamese banking documents spanning 14 banking aspects combined with sentiment analysis, creating 35 unique aspect-sentiment categories. The dataset includes: account services, card services, customer support, discount offers, interest rates, internet banking, loans, money transfers, payments, promotions, savings, security features, trademark information, and miscellaneous services, each labeled with positive, negative, or neutral sentiment. The dataset exhibits significant class imbalance, with CUSTOMER\_SUPPORT\#negative (39\%) and TRADEMARK\#positive (35\%) categories dominating the distribution, while minority aspect-sentiment combinations have limited training examples.
|
| 119 |
|
| 120 |
\begin{table}[h]
|
| 121 |
\centering
|
|
|
|
| 123 |
\toprule
|
| 124 |
\textbf{Dataset} & \textbf{Classes} & \textbf{Training} & \textbf{Test} & \textbf{Domain} \\
|
| 125 |
\midrule
|
| 126 |
+
UTS2017\_Bank & 35 & 1,581 & 396 & Banking Aspect Sentiment \\
|
| 127 |
\bottomrule
|
| 128 |
\end{tabular}
|
| 129 |
\caption{Dataset characteristics for Vietnamese banking aspect sentiment analysis evaluation.}
|
|
|
|
| 168 |
\hline
|
| 169 |
\textbf{Dataset} & \textbf{Method} & \textbf{Accuracy} \\
|
| 170 |
\hline
|
| 171 |
+
UTS2017\_Bank (35 aspect-sentiment) & \textbf{Pulse Core 1 - SVC with TF-IDF} & \textbf{72.47\%} \\
|
| 172 |
+
UTS2017\_Bank (35 aspect-sentiment) & \textbf{Pulse Core 1 - Logistic Regression with TF-IDF} & \textbf{68.18\%} \\
|
| 173 |
\hline
|
| 174 |
\end{tabular}
|
| 175 |
\caption{Performance results for Vietnamese banking aspect sentiment analysis using TF-IDF-based traditional machine learning approaches.}
|
|
|
|
| 194 |
\item \textbf{Inference Latency}: 0.1 seconds for 396 test samples (0.025 ms per sample)
|
| 195 |
\item \textbf{Training Samples}: 1,581 aspect-sentiment combinations
|
| 196 |
\item \textbf{Test Samples}: 396 aspect-sentiment combinations
|
| 197 |
+
\item \textbf{Number of Classes}: 35 combined aspect-sentiment categories
|
| 198 |
\item \textbf{Weighted Average F1-Score (Logistic Regression)}: 0.66
|
| 199 |
\end{itemize}
|
| 200 |
|
|
|
|
| 417 |
|
| 418 |
\begin{enumerate}
|
| 419 |
\item Traditional machine learning approaches achieve competitive performance on Vietnamese banking aspect sentiment analysis tasks (72.47\% accuracy with SVC) while maintaining substantial computational efficiency advantages (5.3s training time).
|
| 420 |
+
\item Feature engineering methodologies retain critical importance for Vietnamese banking applications, with the implemented 20,000-dimensional TF-IDF representation effectively capturing aspect-sentiment relationships across 35 combined categories.
|
| 421 |
\item Class distribution imbalance constitutes the primary performance limitation for aspect sentiment analysis, with minority aspect-sentiment combinations achieving zero performance due to insufficient training data.
|
| 422 |
\item The fundamental trade-off between algorithmic complexity and model interpretability substantially favors TF-IDF approaches for banking applications requiring transparency and regulatory compliance.
|
| 423 |
\end{enumerate}
|