Spaces:

shreyanshknayak
/

soca_test_fastapi

Runtime error

App Files Files Community

shreyanshknayak commited on Jul 3, 2025

Commit

96405ba

verified ·

1 Parent(s): ea27fc4

Update Docs.txt

Browse files

Files changed (1) hide show

Docs.txt +53 -26

Docs.txt CHANGED Viewed

@@ -1,49 +1,76 @@
-7. Conclusion and Future Work
-Conclusion
-This project aimed to develop a robust and scalable system for analyzing unstructured customer complaints in the UPI domain using state-of-the-art topic modeling techniques. Through extensive experimentation and evaluation of both TF-IDF based methods (LSA, NMF, LDA) and transformer-based techniques (BERTopic), it was demonstrated that BERTopic provides superior performance in terms of coherence, semantic understanding, and flexibility.
-By extracting latent themes from complaints and visualizing their evolution over time through an interactive Streamlit dashboard, we enabled deeper insights into the voice of the customer. The dashboard’s modules — similarity heatmaps, topic size visualizer, and content drift tracker — help identify which complaints persist over time, evolve in context, or diminish due to policy interventions.
-Key topics such as refund delays, failed transactions, and unauthorized transfers were found to dominate the dataset across months. These insights can guide SBI’s customer service, product development, and fraud detection teams toward faster and more effective resolutions.
 ⸻
-Future Work
-1. Online Topic Modeling System
-To further scale this solution and operationalize its benefits, the next step is to implement an online topic modeling system that can process and classify incoming complaints in near real-time. This would involve deploying a pipeline that continuously:
-	•	Preprocesses new complaints as they are received,
-	•	Embeds them using a transformer model,
-	•	Assigns them to the most semantically similar existing topic,
-	•	Or, if the complaint does not align well with existing topics, forms a new topic cluster dynamically.
-This would eliminate the need to periodically retrain models from scratch and allow continuous learning and adaptation, ensuring the system remains up to date with emerging customer issues.
 ⸻
-2. Use Case: Phishing Complaint Surge
-Consider a real-world scenario: A new phishing scam targeting UPI users spreads through fake apps. Initially, this complaint is rare and doesn’t fit into any existing manual category. In the current system, this might be manually labeled as “unauthorized transaction” — losing the nuance of how it occurred.
-With an online topic modeling system:
-	•	As new complaints mentioning phrases like “fake app,” “OTP stolen,” or “link sent on SMS” come in, the system will cluster them into a new, distinct topic.
-	•	Within a few hours or days, this cluster grows in size and is detected as an emerging issue on the dashboard.
-	•	Management can then quickly investigate the trend, issue customer advisories, and initiate app takedown procedures, mitigating customer losses early.
 ⸻
-3. Integration with Internal Complaint Resolution Systems
-The online topic modeling system can be integrated with existing internal tools for:
-	•	Automatic triaging of complaints to relevant teams,
-	•	Alert generation for anomalous topic spikes (potential fraud),
-	•	Resolution status tracking per topic cluster to analyze policy impact.
 ⸻
-Final Remarks
-By transitioning from a static, manual classification process to an automated, intelligent, and adaptive complaint analysis pipeline, SBI can significantly improve its responsiveness to customer grievances, reduce resolution times, and proactively safeguard users against evolving threats. The online topic modeling system represents not only a technological enhancement but a strategic investment in SBI’s commitment to customer-centric banking.

+Certainly! Here’s a Results section for Benchmarking, Evaluation, and Model Selection, written in a formal and analytical report format:
+⸻
+5. Benchmarking, Evaluation, and Model Selection
+To determine the most effective topic modeling technique for the SBI UPI complaint corpus, four different models were benchmarked on the July 2023 complaint dataset: Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA), and BERTopic. These models were evaluated using three standard metrics:
+	•	Topic Coherence: Measures the semantic similarity between top keywords of a topic.
+	•	Topic Diversity: Measures the uniqueness of the top words across all topics.
+	•	Topic Exclusivity: Measures how uniquely a word belongs to a single topic (low overlap across topics).
+The following summarizes the performance of each model:
 ⸻
+5.1 Latent Semantic Analysis (LSA)
+LSA was evaluated over a topic range of 1 to 100 to identify an optimal number of topics. Based on the inflection point in the coherence curve, 15 topics were selected. The evaluation scores were:
+	•	Coherence Score: 0.4616
+	•	Diversity Score: 0.3000
+	•	Exclusivity Score: 0.0530
+While LSA was computationally efficient, the low exclusivity and diversity scores indicated substantial overlap in topic content, making the results less interpretable for business insights.
+⸻
+5.2 Non-Negative Matrix Factorization (NMF)
+NMF outperformed LSA in both interpretability and distinctiveness of topic clusters:
+	•	Coherence Score: 0.6693
+	•	Diversity Score: 0.5100
+	•	Exclusivity Score: 0.3100
+The higher scores indicated that NMF provided more meaningful topics with reduced word overlap, although semantic similarity between them remained a limitation due to its reliance on bag-of-words representations.
 ⸻
+5.3 Latent Dirichlet Allocation (LDA)
+Multiple hyperparameter configurations were tested for LDA, including unigram, bigram, and trigram models. After rigorous tuning, the unigram model was found to perform the best:
+	•	Coherence Score: 0.5639
+	•	Diversity Score: 0.5800
+	•	Exclusivity Score: 0.5800
+Although LDA maintained a good balance between coherence and diversity, the quality of topics was still constrained by its inability to understand contextual semantics, especially in short or repetitive complaint data.
 ⸻
+5.4 BERTopic
+The BERTopic model was configured with a minimum topic size of 100 and top_n_words = 10. It produced the best results across all three metrics:
+	•	Coherence Score: 0.7300
+	•	Diversity Score: 0.9930
+	•	Exclusivity Score: 0.9924
+In addition to high evaluation scores, BERTopic’s use of transformer-based document embeddings, followed by density-based clustering (DBSCAN) and class-wise TF-IDF scoring, allowed for automatic topic discovery and grouping of semantically similar complaints. This provided a qualitatively richer and more interpretable topic structure compared to TF-IDF-based models.
 ⸻
+5.5 Model Selection Rationale
+Based on both quantitative evaluation and qualitative assessment of topic coherence and semantic similarity, BERTopic was selected as the final topic modeling technique for processing and analyzing the full SBI UPI complaint dataset.
+In summary:
+Model	Coherence	Diversity	Exclusivity
+LSA	0.4616	0.3000	0.0530
+NMF	0.6693	0.5100	0.3100
+LDA (Unigram)	0.5639	0.5800	0.5800
+BERTopic	0.7300	0.9930	0.9924
+The BERTopic model’s superior scores, along with its ability to model topics without predefining their number and adapt dynamically to new data, makes it a highly suitable choice for customer complaint analysis in dynamic domains like digital payments.
+⸻
+Let me know if you’d like a visual plot or graph version of these results (like a bar chart or radar chart for scores).