shreyanshknayak commited on
Commit
96405ba
·
verified ·
1 Parent(s): ea27fc4

Update Docs.txt

Browse files
Files changed (1) hide show
  1. Docs.txt +53 -26
Docs.txt CHANGED
@@ -1,49 +1,76 @@
1
- 7. Conclusion and Future Work
2
 
3
- Conclusion
4
 
5
- This project aimed to develop a robust and scalable system for analyzing unstructured customer complaints in the UPI domain using state-of-the-art topic modeling techniques. Through extensive experimentation and evaluation of both TF-IDF based methods (LSA, NMF, LDA) and transformer-based techniques (BERTopic), it was demonstrated that BERTopic provides superior performance in terms of coherence, semantic understanding, and flexibility.
6
 
7
- By extracting latent themes from complaints and visualizing their evolution over time through an interactive Streamlit dashboard, we enabled deeper insights into the voice of the customer. The dashboard’s modules similarity heatmaps, topic size visualizer, and content drift tracker help identify which complaints persist over time, evolve in context, or diminish due to policy interventions.
 
 
 
8
 
9
- Key topics such as refund delays, failed transactions, and unauthorized transfers were found to dominate the dataset across months. These insights can guide SBI’s customer service, product development, and fraud detection teams toward faster and more effective resolutions.
10
 
11
 
12
 
13
- Future Work
 
 
 
 
 
 
 
14
 
15
- 1. Online Topic Modeling System
 
 
16
 
17
- To further scale this solution and operationalize its benefits, the next step is to implement an online topic modeling system that can process and classify incoming complaints in near real-time. This would involve deploying a pipeline that continuously:
18
- Preprocesses new complaints as they are received,
19
- Embeds them using a transformer model,
20
- Assigns them to the most semantically similar existing topic,
21
- • Or, if the complaint does not align well with existing topics, forms a new topic cluster dynamically.
22
 
23
- This would eliminate the need to periodically retrain models from scratch and allow continuous learning and adaptation, ensuring the system remains up to date with emerging customer issues.
24
 
25
 
26
 
27
- 2. Use Case: Phishing Complaint Surge
28
 
29
- Consider a real-world scenario: A new phishing scam targeting UPI users spreads through fake apps. Initially, this complaint is rare and doesn’t fit into any existing manual category. In the current system, this might be manually labeled as “unauthorized transaction” — losing the nuance of how it occurred.
 
 
 
30
 
31
- With an online topic modeling system:
32
- • As new complaints mentioning phrases like “fake app,” “OTP stolen,” or “link sent on SMS” come in, the system will cluster them into a new, distinct topic.
33
- • Within a few hours or days, this cluster grows in size and is detected as an emerging issue on the dashboard.
34
- • Management can then quickly investigate the trend, issue customer advisories, and initiate app takedown procedures, mitigating customer losses early.
35
 
36
 
37
 
38
- 3. Integration with Internal Complaint Resolution Systems
 
 
 
 
 
39
 
40
- The online topic modeling system can be integrated with existing internal tools for:
41
- • Automatic triaging of complaints to relevant teams,
42
- • Alert generation for anomalous topic spikes (potential fraud),
43
- • Resolution status tracking per topic cluster to analyze policy impact.
44
 
45
 
46
 
47
- Final Remarks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- By transitioning from a static, manual classification process to an automated, intelligent, and adaptive complaint analysis pipeline, SBI can significantly improve its responsiveness to customer grievances, reduce resolution times, and proactively safeguard users against evolving threats. The online topic modeling system represents not only a technological enhancement but a strategic investment in SBI’s commitment to customer-centric banking.
 
1
+ Certainly! Here’s a Results section for Benchmarking, Evaluation, and Model Selection, written in a formal and analytical report format:
2
 
3
+
4
 
5
+ 5. Benchmarking, Evaluation, and Model Selection
6
 
7
+ To determine the most effective topic modeling technique for the SBI UPI complaint corpus, four different models were benchmarked on the July 2023 complaint dataset: Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA), and BERTopic. These models were evaluated using three standard metrics:
8
+ • Topic Coherence: Measures the semantic similarity between top keywords of a topic.
9
+ • Topic Diversity: Measures the uniqueness of the top words across all topics.
10
+ • Topic Exclusivity: Measures how uniquely a word belongs to a single topic (low overlap across topics).
11
 
12
+ The following summarizes the performance of each model:
13
 
14
 
15
 
16
+ 5.1 Latent Semantic Analysis (LSA)
17
+
18
+ LSA was evaluated over a topic range of 1 to 100 to identify an optimal number of topics. Based on the inflection point in the coherence curve, 15 topics were selected. The evaluation scores were:
19
+ • Coherence Score: 0.4616
20
+ • Diversity Score: 0.3000
21
+ • Exclusivity Score: 0.0530
22
+
23
+ While LSA was computationally efficient, the low exclusivity and diversity scores indicated substantial overlap in topic content, making the results less interpretable for business insights.
24
 
25
+
26
+
27
+ 5.2 Non-Negative Matrix Factorization (NMF)
28
 
29
+ NMF outperformed LSA in both interpretability and distinctiveness of topic clusters:
30
+ Coherence Score: 0.6693
31
+ Diversity Score: 0.5100
32
+ Exclusivity Score: 0.3100
 
33
 
34
+ The higher scores indicated that NMF provided more meaningful topics with reduced word overlap, although semantic similarity between them remained a limitation due to its reliance on bag-of-words representations.
35
 
36
 
37
 
38
+ 5.3 Latent Dirichlet Allocation (LDA)
39
 
40
+ Multiple hyperparameter configurations were tested for LDA, including unigram, bigram, and trigram models. After rigorous tuning, the unigram model was found to perform the best:
41
+ • Coherence Score: 0.5639
42
+ • Diversity Score: 0.5800
43
+ • Exclusivity Score: 0.5800
44
 
45
+ Although LDA maintained a good balance between coherence and diversity, the quality of topics was still constrained by its inability to understand contextual semantics, especially in short or repetitive complaint data.
 
 
 
46
 
47
 
48
 
49
+ 5.4 BERTopic
50
+
51
+ The BERTopic model was configured with a minimum topic size of 100 and top_n_words = 10. It produced the best results across all three metrics:
52
+ • Coherence Score: 0.7300
53
+ • Diversity Score: 0.9930
54
+ • Exclusivity Score: 0.9924
55
 
56
+ In addition to high evaluation scores, BERTopic’s use of transformer-based document embeddings, followed by density-based clustering (DBSCAN) and class-wise TF-IDF scoring, allowed for automatic topic discovery and grouping of semantically similar complaints. This provided a qualitatively richer and more interpretable topic structure compared to TF-IDF-based models.
 
 
 
57
 
58
 
59
 
60
+ 5.5 Model Selection Rationale
61
+
62
+ Based on both quantitative evaluation and qualitative assessment of topic coherence and semantic similarity, BERTopic was selected as the final topic modeling technique for processing and analyzing the full SBI UPI complaint dataset.
63
+
64
+ In summary:
65
+
66
+ Model Coherence Diversity Exclusivity
67
+ LSA 0.4616 0.3000 0.0530
68
+ NMF 0.6693 0.5100 0.3100
69
+ LDA (Unigram) 0.5639 0.5800 0.5800
70
+ BERTopic 0.7300 0.9930 0.9924
71
+
72
+ The BERTopic model’s superior scores, along with its ability to model topics without predefining their number and adapt dynamically to new data, makes it a highly suitable choice for customer complaint analysis in dynamic domains like digital payments.
73
+
74
+
75
 
76
+ Let me know if you’d like a visual plot or graph version of these results (like a bar chart or radar chart for scores).