Customer-Segmentation

Sleeping

App Files Files Community

Mahmoud Adel commited on Apr 14

Commit

5e2aaa0

0 Parent(s):

Clean Hugging Face deployment

Browse files

Files changed (21) hide show

.streamlit/config.toml +7 -0
.streamlit/config.toml +13 -0
README.md +423 -0
data/Mall_Customers.csv +201 -0
requirements.txt +8 -0
run_app.py +51 -0
src/__init__.py +1 -0
src/__pycache__/__init__.cpython-39.pyc +0 -0
src/__pycache__/clustering.cpython-312.pyc +0 -0
src/__pycache__/clustering.cpython-39.pyc +0 -0
src/__pycache__/data_loader.cpython-312.pyc +0 -0
src/__pycache__/data_loader.cpython-39.pyc +0 -0
src/__pycache__/visualizations.cpython-312.pyc +0 -0
src/__pycache__/visualizations.cpython-39.pyc +0 -0
src/clustering.py +260 -0
src/data_loader.py +151 -0
src/visualizations.py +780 -0
streamlit_app/main.py +1112 -0
utils/__init__.py +1 -0
utils/__pycache__/data_generator.cpython-311.pyc +0 -0
utils/data_generator.py +73 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,7 @@

+[theme]
+base = "dark"
+primaryColor = "#818CF8"
+backgroundColor = "#0F172A"
+secondaryBackgroundColor = "#111827"
+textColor = "#E5E7EB"
+font = "sans serif"

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,13 @@

+[theme]
+primaryColor = "#3498db"
+backgroundColor = "#0e1117"
+secondaryBackgroundColor = "#262730"
+textColor = "#ffffff"
+[server]
+headless = true
+enableCORS = false
+enableXsrfProtection = false
+[browser]
+gatherUsageStats = false

README.md ADDED Viewed

	@@ -0,0 +1,423 @@

+# 🛍️ Customer Segmentation Analysis
+[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+[![Streamlit](https://img.shields.io/badge/Streamlit-1.28+-red.svg)](https://streamlit.io/)
+> **🎯 Live Application**: [Customer Segmentation Analysis](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+A comprehensive, interactive web application for customer segmentation analysis using machine learning clustering algorithms. This project provides an end-to-end solution for identifying distinct customer groups based on purchasing behavior and demographic characteristics.
+## 🌟 Live Demo
+**🚀 Try the application now:** [Customer Segmentation Analysis](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+The live application features:
+- ✨ **Interactive Data Exploration** with real-time visualizations
+- 🎯 **K-Means & DBSCAN Clustering** with optimal parameter selection
+- 📊 **Beautiful Visualizations** with dark theme and modern UI
+- 💡 **Business Insights** and actionable recommendations
+- 📱 **Responsive Design** that works on all devices
+---
+## 📋 Table of Contents
+- [🎯 Project Overview](#-project-overview)
+- [✨ Key Features](#-key-features)
+- [📊 Dataset Information](#-dataset-information)
+- [🛠️ Technology Stack](#️-technology-stack)
+- [🚀 Quick Start](#-quick-start)
+- [📁 Project Structure](#-project-structure)
+- [🔍 Analysis Workflow](#-analysis-workflow)
+- [📈 Results & Insights](#-results--insights)
+- [🎨 Screenshots](#-screenshots)
+- [⚙️ Configuration](#️-configuration)
+- [🤝 Contributing](#-contributing)
+- [📝 License](#-license)
+---
+## 🎯 Project Overview
+This project implements advanced customer segmentation using unsupervised machine learning techniques. It provides a complete solution for businesses to understand their customer base through data-driven insights and actionable recommendations.
+### 🎯 Business Value
+- **Customer Understanding**: Identify distinct customer segments based on behavior patterns
+- **Targeted Marketing**: Develop personalized marketing strategies for each segment
+- **Resource Optimization**: Allocate marketing budgets more effectively
+- **Product Development**: Tailor products and services to specific customer needs
+- **Customer Retention**: Implement segment-specific retention strategies
+---
+## ✨ Key Features
+### 🎨 **Modern User Interface**
+- **Dark Theme**: Beautiful, modern dark interface with gradient accents
+- **Responsive Design**: Works seamlessly on desktop, tablet, and mobile
+- **Interactive Elements**: Hover effects, animations, and smooth transitions
+- **Real-time Updates**: Dynamic visualizations that update instantly
+### 📊 **Comprehensive Data Analysis**
+- **Data Exploration**: Interactive histograms, scatter plots, and correlation matrices
+- **Statistical Summary**: Detailed descriptive statistics and data quality checks
+- **Feature Relationships**: Visual analysis of correlations between variables
+- **Missing Value Detection**: Automatic identification and handling of data issues
+### 🎯 **Advanced Clustering Algorithms**
+- **K-Means Clustering**: With optimal cluster determination using multiple metrics
+- **DBSCAN Clustering**: Density-based clustering for comparison
+- **Parameter Optimization**: Automatic selection of optimal clustering parameters
+- **Performance Metrics**: Silhouette score, Calinski-Harabasz score, and inertia
+### 📈 **Rich Visualizations**
+- **2D Cluster Plots**: Interactive scatter plots with cluster assignments
+- **Distribution Analysis**: Box plots and histograms for each segment
+- **Comparative Analysis**: Side-by-side comparison of different algorithms
+- **Business Metrics**: Spending analysis and customer profile visualizations
+### 💡 **Business Intelligence**
+- **Customer Profiles**: Detailed characteristics of each segment
+- **Spending Analysis**: Average spending patterns and trends
+- **Actionable Recommendations**: Specific strategies for each customer segment
+- **Download Results**: Export analysis results for further processing
+---
+## 📊 Dataset Information
+The application uses the **Mall Customer Segmentation** dataset, which simulates real-world customer data with the following features:
+| Feature | Description | Type | Range |
+|---------|-------------|------|-------|
+| **CustomerID** | Unique customer identifier | Integer | 1-200 |
+| **Gender** | Customer gender | Categorical | Male/Female |
+| **Age** | Customer age in years | Integer | 18-70 |
+| **Annual Income (k$)** | Annual income in thousands | Integer | 15-137 |
+| **Spending Score (1-100)** | Mall-assigned spending score | Integer | 1-100 |
+### 📈 **Dataset Characteristics**
+- **Size**: 200 customers
+- **Features**: 5 variables (3 numeric, 2 categorical)
+- **Quality**: Clean data with no missing values
+- **Realism**: Simulates realistic customer behavior patterns
+---
+## 🛠️ Technology Stack
+### **Core Technologies**
+- **Python 3.8+**: Primary programming language
+- **Streamlit 1.28+**: Interactive web application framework
+- **Pandas**: Data manipulation and analysis
+- **NumPy**: Numerical computing and array operations
+### **Machine Learning**
+- **Scikit-learn**: Clustering algorithms (K-Means, DBSCAN)
+- **Silhouette Analysis**: Cluster quality evaluation
+- **StandardScaler**: Feature normalization
+### **Visualization**
+- **Plotly**: Interactive charts and graphs
+- **Custom CSS**: Modern dark theme styling
+- **Responsive Design**: Mobile-friendly interface
+### **Development Tools**
+- **YAML**: Configuration management
+- **Git**: Version control
+- **Streamlit Cloud**: Deployment platform
+---
+## 🚀 Quick Start
+### **Option 1: Use the Live Application**
+1. Visit [Customer Segmentation Analysis](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+2. Start exploring the data immediately
+3. No installation required!
+### **Option 2: Run Locally**
+#### **Prerequisites**
+```bash
+# Ensure you have Python 3.8+ installed
+python --version
+# Install Git (if not already installed)
+git --version
+```
+#### **Installation Steps**
+1. **Clone the repository**
+   ```bash
+   git clone https://github.com/yourusername/customer-segmentation.git
+   cd customer-segmentation
+   ```
+2. **Install dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Launch the application**
+   ```bash
+   python run_app.py
+   ```
+   Or directly with Streamlit:
+   ```bash
+   streamlit run streamlit_app/main.py
+   ```
+4. **Access the application**
+   - Open your browser and navigate to `http://localhost:8501`
+   - The application will automatically load the sample dataset
+   - Start exploring the different analysis sections
+---
+## 📁 Project Structure
+```
+Customer segmentation/
+├── 📁 streamlit_app/
+│   └── 🐍 main.py                    # Main Streamlit application
+├── 📁 src/
+│   ├── 🐍 __init__.py                # Package initialization
+│   ├── 🐍 data_loader.py             # Data loading and preprocessing
+│   ├── 🐍 clustering.py              # Clustering algorithms
+│   └── 🐍 visualizations.py          # Visualization components
+├── 📁 utils/
+│   ├── 🐍 __init__.py                # Utilities package
+│   └── 🐍 data_generator.py          # Sample data generation
+├── 📁 config/
+│   └── ⚙️ config.yaml                # Configuration settings
+├── 📁 data/
+│   └── 📊 Mall_Customers.csv         # Main dataset
+├── 📁 .streamlit/
+│   └── ⚙️ config.toml                # Streamlit configuration
+├── 📋 requirements.txt               # Python dependencies
+├── 🚀 run_app.py                     # Application launcher
+└── 📖 README.md                      # Project documentation
+```
+---
+## 🔍 Analysis Workflow
+### **1. Data Exploration** 📊
+- **Dataset Overview**: Basic statistics and data quality assessment
+- **Distribution Analysis**: Histograms and density plots for all features
+- **Correlation Analysis**: Heatmaps showing feature relationships
+- **Visual Exploration**: Interactive scatter plots and box plots
+### **2. Data Preprocessing** ⚙️
+- **Feature Selection**: Choose relevant variables for clustering
+- **Data Scaling**: Normalize features using StandardScaler
+- **Missing Value Handling**: Automatic detection and treatment
+- **Data Validation**: Ensure data quality and consistency
+### **3. Optimal Cluster Determination** 🎯
+- **Elbow Method**: Find optimal number of clusters using inertia
+- **Silhouette Analysis**: Evaluate cluster quality and separation
+- **Calinski-Harabasz Score**: Alternative cluster evaluation metric
+- **Visual Assessment**: Interactive plots for parameter selection
+### **4. K-Means Clustering** 🔵
+- **Algorithm Application**: Apply K-Means with optimal parameters
+- **Cluster Assignment**: Generate labels for each customer
+- **Performance Metrics**: Calculate silhouette and Calinski scores
+- **Center Visualization**: Plot cluster centroids
+### **5. DBSCAN Clustering** 🌟
+- **Density-Based Clustering**: Apply DBSCAN algorithm
+- **Parameter Tuning**: Adjust epsilon and min_samples
+- **Noise Detection**: Identify outlier points
+- **Comparison Analysis**: Compare with K-Means results
+### **6. Visualization & Analysis** 📈
+- **2D Cluster Plots**: Interactive scatter plots with cluster assignments
+- **Distribution Analysis**: Box plots showing feature distributions per cluster
+- **Spending Analysis**: Detailed spending patterns for each segment
+- **Comparative Visualizations**: Side-by-side algorithm comparison
+### **7. Business Intelligence** 💡
+- **Customer Profiling**: Detailed characteristics of each segment
+- **Spending Patterns**: Average spending and variance analysis
+- **Actionable Insights**: Specific recommendations for each segment
+- **Export Results**: Download analysis results for further use
+---
+## 📈 Results & Insights
+### **Typical Customer Segments Identified**
+| Segment | Characteristics | Business Strategy |
+|---------|----------------|-------------------|
+| **💎 High Value** | High income, high spending | Premium products, VIP services |
+| **💼 Conservative** | High income, low spending | Upselling, value propositions |
+| **🎯 Budget Spenders** | Low income, high spending | Value-based offerings, loyalty programs |
+| **📉 Low Engagement** | Low income, low spending | Retention strategies, engagement campaigns |
+| **⚖️ Balanced** | Moderate income and spending | Personalized marketing, core offerings |
+### **Performance Metrics**
+The analysis provides comprehensive evaluation metrics:
+- **Silhouette Score**: Measures cluster cohesion and separation (0-1, higher is better)
+- **Calinski-Harabasz Score**: Evaluates cluster definition quality
+- **Inertia**: Within-cluster sum of squares for K-Means
+- **Number of Clusters**: Optimal cluster count determined automatically
+- **Noise Points**: Outlier detection in DBSCAN
+### **Business Recommendations**
+Based on clustering results, the application provides:
+- **Marketing Strategies**: Segment-specific campaign recommendations
+- **Product Positioning**: Align products with cluster preferences
+- **Pricing Strategies**: Dynamic pricing based on segment characteristics
+- **Customer Retention**: Targeted programs for each segment
+- **Growth Opportunities**: Cross-selling and upselling strategies
+---
+## 🎨 Screenshots
+### **Main Dashboard**
+![Dashboard](https://via.placeholder.com/800x400/0F172A/E5E7EB?text=Main+Dashboard)
+### **Data Exploration**
+![Data Exploration](https://via.placeholder.com/800x400/0F172A/E5E7EB?text=Data+Exploration)
+### **Clustering Results**
+![Clustering](https://via.placeholder.com/800x400/0F172A/E5E7EB?text=Clustering+Results)
+### **Business Insights**
+![Insights](https://via.placeholder.com/800x400/0F172A/E5E7EB?text=Business+Insights)
+---
+## ⚙️ Configuration
+### **Customizing Clustering Parameters**
+#### **K-Means Parameters**
+```python
+# In the application interface
+n_clusters = 5  # Number of clusters
+random_state = 42  # For reproducible results
+```
+#### **DBSCAN Parameters**
+```python
+eps = 0.5  # Neighborhood distance
+min_samples = 5  # Minimum points per cluster
+```
+### **Feature Selection**
+```python
+# Default features for clustering
+features = ['Annual Income (k$)', 'Spending Score (1-100)']
+# Custom feature selection
+features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
+```
+### **Visualization Settings**
+```python
+# Color schemes
+colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7']
+# Chart dimensions
+height = 450
+width = '100%'
+```
+---
+## 🤝 Contributing
+We welcome contributions! Here's how you can help:
+### **How to Contribute**
+1. **Fork the repository**
+2. **Create a feature branch**
+   ```bash
+   git checkout -b feature/amazing-feature
+   ```
+3. **Make your changes**
+4. **Test thoroughly**
+5. **Commit your changes**
+   ```bash
+   git commit -m 'Add amazing feature'
+   ```
+6. **Push to the branch**
+   ```bash
+   git push origin feature/amazing-feature
+   ```
+7. **Open a Pull Request**
+### **Areas for Improvement**
+- **Additional Algorithms**: Hierarchical clustering, Gaussian Mixture Models
+- **Enhanced Visualizations**: 3D plots, interactive dashboards
+- **Advanced Analytics**: Customer lifetime value, churn prediction
+- **Performance Optimization**: Faster processing for large datasets
+- **Mobile Experience**: Improved mobile interface
+- **API Integration**: REST API for programmatic access
+### **Bug Reports**
+Please use the [GitHub Issues](https://github.com/yourusername/customer-segmentation/issues) page to report bugs or request features.
+---
+## 📝 License
+This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
+### **MIT License Summary**
+- ✅ **Commercial Use**: Allowed
+- ✅ **Modification**: Allowed
+- ✅ **Distribution**: Allowed
+- ✅ **Private Use**: Allowed
+- ❌ **Liability**: Limited
+- ❌ **Warranty**: None
+---
+## 🙏 Acknowledgments
+- **Dataset Source**: [Kaggle Mall Customer Segmentation](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python)
+- **Streamlit**: For the amazing web application framework
+- **Scikit-learn**: For robust machine learning algorithms
+- **Plotly**: For beautiful interactive visualizations
+- **Open Source Community**: For inspiration and support
+---
+## 📞 Support & Contact
+- **Live Application**: [Customer Segmentation Analysis](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+- **GitHub Repository**: [Customer Segmentation](https://github.com/yourusername/customer-segmentation)
+- **Issues**: [GitHub Issues](https://github.com/yourusername/customer-segmentation/issues)
+- **Email**: your.email@example.com
+---
+<div align="center">
+**🎯 Happy Clustering! 📊**
+[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://customer-segmentation-mqnhet38emja8xtgffpzjt.streamlit.app/)
+*Made with ❤️ using Streamlit and Python*
+</div>

data/Mall_Customers.csv ADDED Viewed

	@@ -0,0 +1,201 @@

+CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
+1,Male,19,15,39
+2,Male,21,15,81
+3,Female,20,16,6
+4,Female,23,16,77
+5,Female,31,17,40
+6,Female,22,17,76
+7,Female,35,18,6
+8,Female,23,18,94
+9,Male,64,19,3
+10,Female,30,19,72
+11,Male,67,19,14
+12,Female,35,19,99
+13,Female,58,20,15
+14,Female,24,20,77
+15,Male,37,20,13
+16,Male,22,20,79
+17,Female,35,21,35
+18,Male,20,21,66
+19,Male,52,23,29
+20,Female,35,23,98
+21,Male,35,24,35
+22,Male,25,24,73
+23,Female,46,25,5
+24,Male,31,25,73
+25,Female,54,28,14
+26,Male,29,28,82
+27,Female,45,28,32
+28,Male,35,28,61
+29,Female,40,29,31
+30,Female,23,29,87
+31,Male,60,30,4
+32,Female,21,30,73
+33,Male,53,33,4
+34,Male,18,33,92
+35,Female,49,33,14
+36,Female,21,33,81
+37,Female,42,34,17
+38,Female,30,34,73
+39,Female,36,37,26
+40,Female,20,37,75
+41,Female,65,38,35
+42,Male,24,38,92
+43,Male,48,39,36
+44,Female,31,39,61
+45,Female,49,39,28
+46,Female,24,39,65
+47,Female,50,40,55
+48,Female,27,40,47
+49,Female,29,40,42
+50,Female,31,40,42
+51,Female,49,42,52
+52,Male,33,42,60
+53,Female,31,43,54
+54,Male,59,43,60
+55,Female,50,43,45
+56,Male,47,43,41
+57,Female,51,44,50
+58,Male,69,44,46
+59,Female,27,46,51
+60,Male,53,46,46
+61,Male,70,46,56
+62,Male,19,46,55
+63,Female,67,47,52
+64,Female,54,47,59
+65,Male,63,48,51
+66,Male,18,48,59
+67,Female,43,48,50
+68,Female,68,48,48
+69,Male,19,48,59
+70,Female,32,48,47
+71,Male,70,49,55
+72,Female,47,49,42
+73,Female,60,50,49
+74,Female,60,50,56
+75,Male,59,54,47
+76,Male,26,54,54
+77,Female,45,54,53
+78,Male,40,54,48
+79,Female,23,54,52
+80,Female,49,54,42
+81,Male,57,54,51
+82,Male,38,54,55
+83,Male,67,54,41
+84,Female,46,54,44
+85,Female,21,54,57
+86,Male,48,54,46
+87,Female,55,57,58
+88,Female,22,57,55
+89,Female,34,58,60
+90,Female,50,58,46
+91,Female,68,59,55
+92,Male,18,59,41
+93,Male,48,60,49
+94,Female,40,60,40
+95,Female,32,60,42
+96,Male,24,60,52
+97,Female,47,60,47
+98,Female,27,60,50
+99,Male,48,61,42
+100,Male,20,61,49
+101,Female,23,62,41
+102,Female,49,62,48
+103,Male,67,62,59
+104,Male,26,62,55
+105,Male,49,62,56
+106,Female,21,62,42
+107,Female,66,63,50
+108,Male,54,63,46
+109,Male,68,63,43
+110,Male,66,63,48
+111,Male,65,63,52
+112,Female,19,63,54
+113,Female,38,64,42
+114,Male,19,64,46
+115,Female,18,65,48
+116,Female,19,65,50
+117,Female,63,65,43
+118,Female,49,65,59
+119,Female,51,67,43
+120,Female,50,67,57
+121,Male,27,67,56
+122,Female,38,67,40
+123,Female,40,69,58
+124,Male,39,69,91
+125,Female,23,70,29
+126,Female,31,70,77
+127,Male,43,71,35
+128,Male,40,71,95
+129,Male,59,71,11
+130,Male,38,71,75
+131,Male,47,71,9
+132,Male,39,71,75
+133,Female,25,72,34
+134,Female,31,72,71
+135,Male,20,73,5
+136,Female,29,73,88
+137,Female,44,73,7
+138,Male,32,73,73
+139,Male,19,74,10
+140,Female,35,74,72
+141,Female,57,75,5
+142,Male,32,75,93
+143,Female,28,76,40
+144,Female,32,76,87
+145,Male,25,77,12
+146,Male,28,77,97
+147,Male,48,77,36
+148,Female,32,77,74
+149,Female,34,78,22
+150,Male,34,78,90
+151,Male,43,78,17
+152,Male,39,78,88
+153,Female,44,78,20
+154,Female,38,78,76
+155,Female,47,78,16
+156,Female,27,78,89
+157,Male,37,78,1
+158,Female,30,78,78
+159,Male,34,78,1
+160,Female,30,78,73
+161,Female,56,79,35
+162,Female,29,79,83
+163,Male,19,81,5
+164,Female,31,81,93
+165,Male,50,85,26
+166,Female,36,85,75
+167,Male,42,86,20
+168,Female,33,86,95
+169,Female,36,87,27
+170,Male,32,87,63
+171,Male,40,87,13
+172,Male,28,87,75
+173,Male,36,87,10
+174,Male,36,87,92
+175,Female,52,88,13
+176,Female,30,88,86
+177,Male,58,88,15
+178,Male,27,88,69
+179,Male,59,93,14
+180,Male,35,93,90
+181,Female,37,97,32
+182,Female,32,97,86
+183,Male,46,98,15
+184,Female,29,98,88
+185,Female,41,99,39
+186,Male,30,99,97
+187,Female,54,101,24
+188,Male,28,101,68
+189,Female,41,103,17
+190,Female,36,103,85
+191,Female,34,103,23
+192,Female,32,103,69
+193,Male,33,113,8
+194,Female,38,113,91
+195,Female,47,120,16
+196,Female,35,120,79
+197,Female,45,126,28
+198,Male,32,126,74
+199,Male,32,137,18
+200,Male,30,137,83

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+streamlit>=1.28.0
+pandas>=2.0.0
+numpy>=1.24.0
+matplotlib>=3.7.0
+seaborn>=0.12.0
+scikit-learn>=1.3.0
+plotly>=5.15.0
+pyyaml>=6.0

run_app.py ADDED Viewed

	@@ -0,0 +1,51 @@

+#!/usr/bin/env python3
+"""
+Customer Segmentation App Launcher
+=================================
+Launch script for the Customer Segmentation Streamlit application.
+"""
+import subprocess
+import sys
+import os
+def main():
+    """Launch the Streamlit application."""
+    # Change to the project directory
+    project_dir = os.path.dirname(os.path.abspath(__file__))
+    os.chdir(project_dir)
+    # Path to the main Streamlit app
+    app_path = os.path.join("streamlit_app", "main.py")
+    # Check if the app file exists
+    if not os.path.exists(app_path):
+        print(f"❌ Error: Streamlit app not found at {app_path}")
+        sys.exit(1)
+    print("🚀 Launching Customer Segmentation App...")
+    print(f"📂 Project directory: {project_dir}")
+    print(f"🎯 App path: {app_path}")
+    print("-" * 50)
+    try:
+        # Launch Streamlit
+        subprocess.run([
+            sys.executable, "-m", "streamlit", "run", app_path,
+            "--server.address", "localhost",
+            "--server.port", "8501",
+            "--browser.gatherUsageStats", "false"
+        ], check=True)
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error launching Streamlit: {e}")
+        sys.exit(1)
+    except KeyboardInterrupt:
+        print("\n👋 Application stopped by user.")
+    except Exception as e:
+        print(f"❌ Unexpected error: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

src/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Customer Segmentation Package

src/__pycache__/__init__.cpython-39.pyc ADDED Viewed

Binary file (174 Bytes). View file

src/__pycache__/clustering.cpython-312.pyc ADDED Viewed

Binary file (10.3 kB). View file

src/__pycache__/clustering.cpython-39.pyc ADDED Viewed

Binary file (6.58 kB). View file

src/__pycache__/data_loader.cpython-312.pyc ADDED Viewed

Binary file (6.97 kB). View file

src/__pycache__/data_loader.cpython-39.pyc ADDED Viewed

Binary file (4.58 kB). View file

src/__pycache__/visualizations.cpython-312.pyc ADDED Viewed

Binary file (15.8 kB). View file

src/__pycache__/visualizations.cpython-39.pyc ADDED Viewed

Binary file (17.2 kB). View file

src/clustering.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Clustering Analysis Module
+=========================
+This module implements various clustering algorithms for customer segmentation.
+"""
+import numpy as np
+import pandas as pd
+from sklearn.cluster import KMeans, DBSCAN
+from sklearn.metrics import silhouette_score, calinski_harabasz_score
+import streamlit as st
+class ClusteringAnalyzer:
+    """
+    Handles clustering analysis for customer segmentation.
+    """
+    def __init__(self):
+        self.kmeans_model = None
+        self.dbscan_model = None
+        self.optimal_clusters = None
+        self.cluster_labels = {}
+    def find_optimal_clusters(self, scaled_data, max_clusters=10):
+        """Find optimal number of clusters using multiple methods."""
+        if scaled_data is None:
+            st.error("No scaled data available. Please preprocess data first.")
+            return None
+        cluster_range = range(2, max_clusters + 1)
+        inertias = []
+        silhouette_scores = []
+        calinski_scores = []
+        progress_bar = st.progress(0)
+        status_text = st.empty()
+        for i, k in enumerate(cluster_range):
+            status_text.text(f'Evaluating {k} clusters...')
+            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
+            cluster_labels = kmeans.fit_predict(scaled_data)
+            inertias.append(kmeans.inertia_)
+            silhouette_scores.append(silhouette_score(scaled_data, cluster_labels))
+            calinski_scores.append(calinski_harabasz_score(scaled_data, cluster_labels))
+            progress_bar.progress((i + 1) / len(cluster_range))
+        status_text.text('Optimization complete!')
+        # Find optimal clusters based on silhouette score
+        optimal_silhouette = cluster_range[np.argmax(silhouette_scores)]
+        optimal_calinski = cluster_range[np.argmax(calinski_scores)]
+        # Store results
+        self.optimization_results = {
+            'cluster_range': list(cluster_range),
+            'inertias': inertias,
+            'silhouette_scores': silhouette_scores,
+            'calinski_scores': calinski_scores,
+            'optimal_silhouette': optimal_silhouette,
+            'optimal_calinski': optimal_calinski
+        }
+        self.optimal_clusters = optimal_silhouette
+        st.success(f"✅ Optimal clusters found: {self.optimal_clusters} (based on Silhouette Score)")
+        return self.optimization_results
+    def apply_kmeans(self, scaled_data, n_clusters=None):
+        """Apply K-Means clustering."""
+        if scaled_data is None:
+            st.error("No scaled data available. Please preprocess data first.")
+            return None
+        if n_clusters is None:
+            n_clusters = self.optimal_clusters or 5
+        with st.spinner(f'Applying K-Means clustering with {n_clusters} clusters...'):
+            self.kmeans_model = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
+            kmeans_labels = self.kmeans_model.fit_predict(scaled_data)
+        # Calculate metrics
+        silhouette_avg = silhouette_score(scaled_data, kmeans_labels)
+        calinski_score = calinski_harabasz_score(scaled_data, kmeans_labels)
+        self.cluster_labels['kmeans'] = kmeans_labels
+        results = {
+            'labels': kmeans_labels,
+            'n_clusters': n_clusters,
+            'silhouette_score': silhouette_avg,
+            'calinski_score': calinski_score,
+            'inertia': self.kmeans_model.inertia_,
+            'centers': self.kmeans_model.cluster_centers_
+        }
+        st.success(f"✅ K-Means clustering completed!")
+        st.info(f"Silhouette Score: {silhouette_avg:.3f} | Calinski-Harabasz Score: {calinski_score:.3f}")
+        return results
+    def apply_dbscan(self, scaled_data, eps=0.5, min_samples=5):
+        """Apply DBSCAN clustering."""
+        if scaled_data is None:
+            st.error("No scaled data available. Please preprocess data first.")
+            return None
+        with st.spinner(f'Applying DBSCAN clustering (eps={eps}, min_samples={min_samples})...'):
+            self.dbscan_model = DBSCAN(eps=eps, min_samples=min_samples)
+            dbscan_labels = self.dbscan_model.fit_predict(scaled_data)
+        # Calculate metrics
+        n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
+        n_noise = list(dbscan_labels).count(-1)
+        self.cluster_labels['dbscan'] = dbscan_labels
+        results = {
+            'labels': dbscan_labels,
+            'n_clusters': n_clusters,
+            'n_noise': n_noise,
+            'eps': eps,
+            'min_samples': min_samples
+        }
+        # Calculate silhouette score only if we have more than 1 cluster and non-noise points
+        if n_clusters > 1:
+            non_noise_mask = dbscan_labels != -1
+            if np.sum(non_noise_mask) > 1:
+                silhouette_avg = silhouette_score(scaled_data[non_noise_mask],
+                                                dbscan_labels[non_noise_mask])
+                results['silhouette_score'] = silhouette_avg
+        st.success(f"✅ DBSCAN clustering completed!")
+        st.info(f"Clusters: {n_clusters} | Noise points: {n_noise}")
+        return results
+    def analyze_clusters(self, data, algorithm='kmeans'):
+        """Analyze cluster characteristics."""
+        # Normalize algorithm name
+        algo_key = algorithm.lower().replace('-', '').replace(' ', '')
+        if algo_key not in self.cluster_labels:
+            st.error(f"No {algorithm} clustering results found. Please run clustering first.")
+            return None
+        cluster_labels = self.cluster_labels[algo_key]
+        # Create consistent column name (use the format that actually gets created)
+        if algo_key == 'kmeans':
+            cluster_col = 'Kmeans_Cluster'  # Match what we see in the error
+        elif algo_key == 'dbscan':
+            cluster_col = 'DBSCAN_Cluster'
+        else:
+            cluster_col = f'{algorithm}_Cluster'
+        # Add cluster labels to data
+        analysis_data = data.copy()
+        analysis_data[cluster_col] = cluster_labels
+        # Calculate cluster statistics
+        numeric_cols = analysis_data.select_dtypes(include=[np.number]).columns
+        numeric_cols = [col for col in numeric_cols if not col.endswith('_Cluster')]
+        cluster_stats = analysis_data.groupby(cluster_col)[numeric_cols].agg(['mean', 'std', 'count'])
+        # Calculate spending analysis if available
+        spending_analysis = None
+        if 'Spending Score (1-100)' in analysis_data.columns:
+            spending_analysis = analysis_data.groupby(cluster_col)['Spending Score (1-100)'].agg(['mean', 'std', 'min', 'max', 'count'])
+        results = {
+            'data_with_clusters': analysis_data,
+            'cluster_stats': cluster_stats,
+            'spending_analysis': spending_analysis,
+            'cluster_distribution': analysis_data[cluster_col].value_counts().sort_index()
+        }
+        return results
+    def get_cluster_profiles(self, data, algorithm='kmeans'):
+        """Generate customer profiles for each cluster."""
+        # Normalize algorithm name
+        algo_key = algorithm.lower().replace('-', '').replace(' ', '')
+        if algo_key not in self.cluster_labels:
+            return None
+        cluster_labels = self.cluster_labels[algo_key]
+        # Create consistent column name (use the format that actually gets created)
+        if algo_key == 'kmeans':
+            cluster_col = 'Kmeans_Cluster'  # Match what we see in the error
+        elif algo_key == 'dbscan':
+            cluster_col = 'DBSCAN_Cluster'
+        else:
+            cluster_col = f'{algorithm}_Cluster'
+        analysis_data = data.copy()
+        analysis_data[cluster_col] = cluster_labels
+        profiles = []
+        for cluster in sorted(analysis_data[cluster_col].unique()):
+            if cluster == -1:  # Skip noise points in DBSCAN
+                continue
+            cluster_data = analysis_data[analysis_data[cluster_col] == cluster]
+            profile = {
+                'cluster': cluster,
+                'size': len(cluster_data),
+                'percentage': len(cluster_data) / len(analysis_data) * 100
+            }
+            # Add feature statistics
+            if 'Age' in cluster_data.columns:
+                profile['avg_age'] = cluster_data['Age'].mean()
+                profile['age_std'] = cluster_data['Age'].std()
+            if 'Annual Income (k$)' in cluster_data.columns:
+                profile['avg_income'] = cluster_data['Annual Income (k$)'].mean()
+                profile['income_std'] = cluster_data['Annual Income (k$)'].std()
+            if 'Spending Score (1-100)' in cluster_data.columns:
+                profile['avg_spending'] = cluster_data['Spending Score (1-100)'].mean()
+                profile['spending_std'] = cluster_data['Spending Score (1-100)'].std()
+            if 'Gender' in cluster_data.columns:
+                profile['gender_dist'] = cluster_data['Gender'].value_counts().to_dict()
+            # Generate profile characterization
+            if 'avg_income' in profile and 'avg_spending' in profile:
+                avg_income = profile['avg_income']
+                avg_spending = profile['avg_spending']
+                if avg_income > 70 and avg_spending > 70:
+                    profile['type'] = "💎 HIGH VALUE"
+                    profile['description'] = "High income, high spending - Premium customers"
+                elif avg_income > 70 and avg_spending < 40:
+                    profile['type'] = "💼 CONSERVATIVE"
+                    profile['description'] = "High income, low spending - Potential for upselling"
+                elif avg_income < 40 and avg_spending > 70:
+                    profile['type'] = "🎯 BUDGET SPENDERS"
+                    profile['description'] = "Low income, high spending - Price-sensitive loyal customers"
+                elif avg_income < 40 and avg_spending < 40:
+                    profile['type'] = "📉 LOW ENGAGEMENT"
+                    profile['description'] = "Low income, low spending - Need retention strategies"
+                else:
+                    profile['type'] = "⚖️ BALANCED"
+                    profile['description'] = "Moderate income and spending - Core customer base"
+            profiles.append(profile)
+        return profiles

src/data_loader.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+Data Loading and Preprocessing Module
+====================================
+This module handles data loading, preprocessing, and validation for customer segmentation.
+"""
+import pandas as pd
+import numpy as np
+import os
+from sklearn.preprocessing import StandardScaler
+import streamlit as st
+class DataLoader:
+    """
+    Handles data loading and preprocessing for customer segmentation analysis.
+    """
+    def __init__(self):
+        self.data = None
+        self.scaled_data = None
+        self.scaler = StandardScaler()
+        self.feature_names = None
+    def create_sample_dataset(self, n_customers=200):
+        """Create a realistic sample Mall Customers dataset."""
+        np.random.seed(42)
+        customer_ids = range(1, n_customers + 1)
+        # Gender distribution (approximately 56% Female, 44% Male)
+        genders = np.random.choice(['Male', 'Female'], n_customers, p=[0.44, 0.56])
+        # Age distribution (mean ~39, std ~14)
+        ages = np.random.normal(38.85, 13.97, n_customers).astype(int)
+        ages = np.clip(ages, 18, 70)
+        # Create realistic income distribution (mean ~61k, std ~26k)
+        annual_incomes = np.random.normal(60.56, 26.26, n_customers)
+        annual_incomes = np.clip(annual_incomes, 15, 137)
+        # Create spending scores with realistic patterns
+        base_spending = np.random.normal(50, 25, n_customers)
+        # Add some income correlation
+        income_normalized = (annual_incomes - annual_incomes.min()) / (annual_incomes.max() - annual_incomes.min())
+        income_effect = (income_normalized - 0.5) * 30
+        # Add age effect
+        age_normalized = (ages - ages.min()) / (ages.max() - ages.min())
+        age_effect = np.where(age_normalized < 0.3, 10,
+                             np.where(age_normalized > 0.7, -5, 0))
+        spending_scores = base_spending + income_effect * 0.6 + age_effect + np.random.normal(0, 10, n_customers)
+        spending_scores = np.clip(spending_scores, 1, 100)
+        # Create DataFrame
+        sample_data = pd.DataFrame({
+            'CustomerID': customer_ids,
+            'Gender': genders,
+            'Age': ages,
+            'Annual Income (k$)': annual_incomes.round().astype(int),
+            'Spending Score (1-100)': spending_scores.round().astype(int)
+        })
+        return sample_data
+    def load_data(self, file_path=None):
+        """Load customer data from file or create sample data."""
+        # Check for default dataset location first
+        default_path = os.path.join("data", "Mall_Customers.csv")
+        if file_path and os.path.exists(file_path):
+            try:
+                self.data = pd.read_csv(file_path)
+                st.success(f"✅ Data loaded successfully from {file_path}")
+                return self.data
+            except Exception as e:
+                st.error(f"Error loading data: {e}")
+                return None
+        elif os.path.exists(default_path):
+            try:
+                self.data = pd.read_csv(default_path)
+                st.success(f"✅ Mall Customers dataset loaded from {default_path}")
+                return self.data
+            except Exception as e:
+                st.error(f"Error loading default dataset: {e}")
+                return None
+        else:
+            # Create sample data
+            self.data = self.create_sample_dataset()
+            st.info("📊 Using generated sample dataset (Mall Customer simulation)")
+            # Save the sample data for future use
+            try:
+                os.makedirs("data", exist_ok=True)
+                self.data.to_csv(default_path, index=False)
+                st.info(f"💾 Sample dataset saved to {default_path}")
+            except Exception as e:
+                st.warning(f"Could not save sample dataset: {e}")
+            return self.data
+    def get_data_info(self):
+        """Get comprehensive data information."""
+        if self.data is None:
+            return None
+        info = {
+            'shape': self.data.shape,
+            'columns': list(self.data.columns),
+            'dtypes': self.data.dtypes.to_dict(),
+            'missing_values': self.data.isnull().sum().to_dict(),
+            'statistics': self.data.describe().to_dict()
+        }
+        return info
+    def preprocess_data(self, features=None):
+        """Preprocess and scale data for clustering."""
+        if self.data is None:
+            st.error("No data loaded. Please load data first.")
+            return None
+        # Default features for clustering
+        if features is None:
+            features = ['Annual Income (k$)', 'Spending Score (1-100)']
+        # Check if features exist in data
+        available_features = [f for f in features if f in self.data.columns]
+        if not available_features:
+            st.error(f"None of the specified features {features} found in data.")
+            return None
+        # Extract features for clustering
+        X = self.data[available_features].copy()
+        # Handle missing values if any
+        if X.isnull().sum().sum() > 0:
+            X = X.fillna(X.mean())
+            st.warning("Missing values filled with mean values.")
+        # Scale the features
+        self.scaled_data = self.scaler.fit_transform(X)
+        self.feature_names = available_features
+        st.success(f"✅ Data preprocessed successfully using features: {available_features}")
+        return self.scaled_data
+    def get_feature_data(self):
+        """Get the original feature data."""
+        if self.data is None or self.feature_names is None:
+            return None
+        return self.data[self.feature_names]

src/visualizations.py ADDED Viewed

	@@ -0,0 +1,780 @@

+"""
+Visualization Module
+===================
+This module handles all visualization components for the customer segmentation analysis.
+"""
+# Matplotlib and Seaborn removed to avoid extra dependency
+# All charts use Plotly for interactive visualization
+import plotly.express as px
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+import plotly.io as pio
+import pandas as pd
+import numpy as np
+import streamlit as st
+# Global Plotly template: dark backgrounds to match app theme
+pio.templates.default = "plotly_dark"
+pio.templates["plotly_dark"].layout.update(
+    paper_bgcolor="#0F172A",
+    plot_bgcolor="#0F172A",
+    font=dict(color="#E5E7EB")
+)
+# Plot styling handled via Plotly theme settings per figure
+class Visualizer:
+    """
+    Handles all visualizations for customer segmentation analysis.
+    """
+    def __init__(self):
+        # Enhanced color palettes for better visual appeal
+        self.colors = px.colors.qualitative.Set1  # More vibrant colors
+        self.gradient_colors = [
+            '#FF6B6B',  # Coral Red
+            '#4ECDC4',  # Turquoise
+            '#45B7D1',  # Sky Blue
+            '#96CEB4',  # Mint Green
+            '#FFEAA7',  # Warm Yellow
+            '#DDA0DD',  # Plum
+            '#98D8C8',  # Seafoam
+            '#F7DC6F',  # Golden Yellow
+            '#BB8FCE',  # Lavender
+            '#85C1E9'   # Light Blue
+        ]
+        self.modern_colors = [
+            '#6C5CE7',  # Purple
+            '#00B894',  # Green
+            '#E17055',  # Orange
+            '#0984E3',  # Blue
+            '#FDCB6E',  # Yellow
+            '#E84393',  # Pink
+            '#00CEC9',  # Cyan
+            '#A29BFE',  # Light Purple
+            '#FD79A8',  # Light Pink
+            '#81ECEC'   # Light Cyan
+        ]
+    def plot_data_exploration(self, data):
+        """Create comprehensive data exploration plots with enhanced styling."""
+        if data is None:
+            st.error("❌ No data available for visualization.")
+            return
+        # Debug: Show data info
+        st.info(f"🔍 **Data shape:** {data.shape}")
+        st.info(f"🔍 **Data columns:** {list(data.columns)}")
+        st.subheader("📊 Data Distribution Analysis")
+        # Create subplots for different visualizations
+        col1, col2 = st.columns(2)
+        with col1:
+            # Age distribution with enhanced styling
+            if 'Age' in data.columns:
+                st.write("📊 Creating Age distribution plot...")
+                fig_age = px.histogram(
+                    data, x='Age', nbins=20,
+                    title='👥 Age Distribution',
+                    color_discrete_sequence=[self.gradient_colors[0]]
+                )
+                fig_age.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A',
+                    xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+                    yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+                )
+                fig_age.update_traces(marker=dict(line=dict(width=1, color='white')))
+                st.plotly_chart(fig_age, use_container_width=True, theme=None)
+                st.success("✅ Age distribution plot created!")
+            # Income distribution with enhanced styling
+            if 'Annual Income (k$)' in data.columns:
+                st.write("💰 Creating Income distribution plot...")
+                fig_income = px.histogram(
+                    data, x='Annual Income (k$)', nbins=20,
+                    title='💰 Annual Income Distribution',
+                    color_discrete_sequence=[self.gradient_colors[1]]
+                )
+                fig_income.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A',
+                    xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+                    yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+                )
+                fig_income.update_traces(marker=dict(line=dict(width=1, color='white')))
+                st.plotly_chart(fig_income, use_container_width=True, theme=None)
+                st.success("✅ Income distribution plot created!")
+        with col2:
+            # Spending Score distribution with enhanced styling
+            if 'Spending Score (1-100)' in data.columns:
+                st.write("🛍️ Creating Spending Score distribution plot...")
+                fig_spending = px.histogram(
+                    data, x='Spending Score (1-100)', nbins=20,
+                    title='🛍️ Spending Score Distribution',
+                    color_discrete_sequence=[self.gradient_colors[2]]
+                )
+                fig_spending.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A',
+                    xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+                    yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+                )
+                fig_spending.update_traces(marker=dict(line=dict(width=1, color='white')))
+                st.plotly_chart(fig_spending, use_container_width=True, theme=None)
+                st.success("✅ Spending Score distribution plot created!")
+            # Gender distribution with enhanced styling
+            if 'Gender' in data.columns:
+                gender_counts = data['Gender'].value_counts()
+                fig_gender = px.pie(
+                    values=gender_counts.values,
+                    names=gender_counts.index,
+                    title='👫 Gender Distribution',
+                    color_discrete_sequence=self.modern_colors[:len(gender_counts)]
+                )
+                fig_gender.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A'
+                )
+                fig_gender.update_traces(
+                    textposition='inside',
+                    textinfo='percent+label',
+                    textfont_size=14,
+                    marker=dict(line=dict(color='white', width=2))
+                )
+                st.plotly_chart(fig_gender, use_container_width=True)
+        # Enhanced correlation analysis
+        st.subheader("🔗 Feature Correlations")
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        if len(numeric_cols) > 1:
+            corr_matrix = data[numeric_cols].corr()
+            fig_corr = px.imshow(
+                corr_matrix,
+                text_auto=True,
+                title='🔗 Feature Correlation Matrix',
+                color_continuous_scale='RdYlBu',
+                aspect='auto'
+            )
+            fig_corr.update_layout(
+                height=500,
+                title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                plot_bgcolor='#0F172A',
+                paper_bgcolor='#0F172A',
+                font=dict(size=12, color='#E5E7EB')
+            )
+            fig_corr.update_traces(
+                textfont=dict(size=12, color='#E5E7EB'),
+                hoverongaps=False
+            )
+            st.plotly_chart(fig_corr, theme=None, use_container_width=True)
+        # Enhanced scatter plots
+        st.subheader("🔍 Feature Relationships")
+        col1, col2 = st.columns(2)
+        with col1:
+            if 'Annual Income (k$)' in data.columns and 'Spending Score (1-100)' in data.columns:
+                fig_scatter1 = px.scatter(
+                    data, x='Annual Income (k$)', y='Spending Score (1-100)',
+                    title='💰 Income vs Spending Score',
+                    hover_data=['Age'] if 'Age' in data.columns else None,
+                    color_discrete_sequence=[self.modern_colors[3]]
+                )
+                fig_scatter1.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A',
+                    xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+                    yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+                )
+                fig_scatter1.update_traces(
+                    marker=dict(size=8, opacity=0.7, line=dict(width=1, color='white'))
+                )
+                st.plotly_chart(fig_scatter1, use_container_width=True)
+        with col2:
+            if 'Age' in data.columns and 'Spending Score (1-100)' in data.columns:
+                fig_scatter2 = px.scatter(
+                    data, x='Age', y='Spending Score (1-100)',
+                    title='👥 Age vs Spending Score',
+                    hover_data=['Annual Income (k$)'] if 'Annual Income (k$)' in data.columns else None,
+                    color_discrete_sequence=[self.modern_colors[4]]
+                )
+                fig_scatter2.update_layout(
+                    height=450,
+                    title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+                    plot_bgcolor='#0F172A',
+                    paper_bgcolor='#0F172A',
+                    xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+                    yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+                )
+                fig_scatter2.update_traces(
+                    marker=dict(size=8, opacity=0.7, line=dict(width=1, color='white'))
+                )
+                st.plotly_chart(fig_scatter2, use_container_width=True)
+    def plot_optimization_results(self, results):
+        """Plot cluster optimization results."""
+        if results is None:
+            st.error("No optimization results available.")
+            return
+        # Create subplots
+        fig = make_subplots(
+            rows=1, cols=3,
+            subplot_titles=('Elbow Method', 'Silhouette Score', 'Calinski-Harabasz Score'),
+            specs=[[{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}]]
+        )
+        cluster_range = results['cluster_range']
+        # Elbow method
+        fig.add_trace(
+            go.Scatter(x=cluster_range, y=results['inertias'],
+                      mode='lines+markers', name='Inertia',
+                      line=dict(color='blue')),
+            row=1, col=1
+        )
+        # Silhouette score
+        fig.add_trace(
+            go.Scatter(x=cluster_range, y=results['silhouette_scores'],
+                      mode='lines+markers', name='Silhouette Score',
+                      line=dict(color='red')),
+            row=1, col=2
+        )
+        # Calinski-Harabasz score
+        fig.add_trace(
+            go.Scatter(x=cluster_range, y=results['calinski_scores'],
+                      mode='lines+markers', name='Calinski-Harabasz Score',
+                      line=dict(color='green')),
+            row=1, col=3
+        )
+        # Update layout
+        fig.update_layout(
+            title_text="Cluster Optimization Results",
+            height=400,
+            showlegend=False,
+            paper_bgcolor="#0F172A",
+            plot_bgcolor="#0F172A",
+            font=dict(color="#E5E7EB")
+        )
+        fig.update_xaxes(title_text="Number of Clusters")
+        fig.update_yaxes(title_text="Inertia", row=1, col=1)
+        fig.update_yaxes(title_text="Silhouette Score", row=1, col=2)
+        fig.update_yaxes(title_text="Calinski-Harabasz Score", row=1, col=3)
+        st.plotly_chart(fig, theme=None, use_container_width=True)
+        # Display optimal results
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("Optimal Clusters (Silhouette)", results['optimal_silhouette'])
+        with col2:
+            st.metric("Optimal Clusters (Calinski-Harabasz)", results['optimal_calinski'])
+        with col3:
+            st.metric("Recommended", results['optimal_silhouette'])
+    def plot_clusters(self, data, cluster_labels, algorithm='K-Means', scaler=None, centers=None):
+        """Plot cluster visualizations."""
+        if data is None or cluster_labels is None:
+            st.error("No data or cluster labels available for visualization.")
+            return
+        # Prepare data with clusters
+        plot_data = data.copy()
+        plot_data['Cluster'] = cluster_labels
+        # Main clustering visualization
+        st.subheader(f"🎯 {algorithm} Clustering Results")
+        col1, col2 = st.columns(2)
+        with col1:
+            if 'Annual Income (k$)' in data.columns and 'Spending Score (1-100)' in data.columns:
+                fig_main = px.scatter(plot_data,
+                                    x='Annual Income (k$)',
+                                    y='Spending Score (1-100)',
+                                    color='Cluster',
+                                    title=f'{algorithm}: Income vs Spending Score',
+                                    hover_data=['Age'] if 'Age' in data.columns else None,
+                                    color_discrete_sequence=self.colors)
+                # Add cluster centers if available
+                if centers is not None and scaler is not None:
+                    centers_original = scaler.inverse_transform(centers)
+                    centers_df = pd.DataFrame(centers_original,
+                                            columns=['Annual Income (k$)', 'Spending Score (1-100)'])
+                    centers_df['Cluster'] = range(len(centers_df))
+                    fig_main.add_scatter(x=centers_df['Annual Income (k$)'],
+                                       y=centers_df['Spending Score (1-100)'],
+                                       mode='markers',
+                                       marker=dict(symbol='x', size=15, color='red', line=dict(width=2)),
+                                       name='Centers',
+                                       showlegend=True)
+                fig_main.update_layout(
+                    height=500,
+                    paper_bgcolor="#0F172A",
+                    plot_bgcolor="#0F172A",
+                    font=dict(color="#E5E7EB"),
+                    xaxis=dict(gridcolor="rgba(229,231,235,0.12)"),
+                    yaxis=dict(gridcolor="rgba(229,231,235,0.12)")
+                )
+                st.plotly_chart(fig_main, theme=None, use_container_width=True)
+        with col2:
+            if 'Age' in data.columns and 'Spending Score (1-100)' in data.columns:
+                fig_age = px.scatter(plot_data,
+                                   x='Age',
+                                   y='Spending Score (1-100)',
+                                   color='Cluster',
+                                   title=f'{algorithm}: Age vs Spending Score',
+                                   color_discrete_sequence=self.colors)
+                fig_age.update_layout(
+                    height=500,
+                    paper_bgcolor="#0F172A",
+                    plot_bgcolor="#0F172A",
+                    font=dict(color="#E5E7EB"),
+                    xaxis=dict(gridcolor="rgba(229,231,235,0.12)"),
+                    yaxis=dict(gridcolor="rgba(229,231,235,0.12)")
+                )
+                st.plotly_chart(fig_age, theme=None, use_container_width=True)
+        # Enhanced cluster distribution
+        st.subheader("📊 Cluster Distribution")
+        cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
+        fig_dist = px.bar(
+            x=cluster_counts.index, y=cluster_counts.values,
+            title='📊 Number of Customers per Cluster',
+            labels={'x': 'Cluster', 'y': 'Number of Customers'},
+            color=cluster_counts.values,
+            color_continuous_scale='Turbo'
+        )
+        fig_dist.update_layout(
+            height=450,
+            title=dict(font=dict(size=18, color='#E5E7EB'), x=0.5),
+            plot_bgcolor='#0F172A',
+            paper_bgcolor='#0F172A',
+            xaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB')),
+            yaxis=dict(gridcolor='rgba(229,231,235,0.12)', title_font=dict(size=14, color='#E5E7EB'))
+        )
+        fig_dist.update_traces(
+            marker=dict(line=dict(width=1, color='white'))
+        )
+        st.plotly_chart(fig_dist, theme=None, use_container_width=True)
+    def plot_cluster_analysis(self, analysis_results, algorithm='K-Means'):
+        """Plot detailed cluster analysis with enhanced visualizations."""
+        if analysis_results is None:
+            st.error("❌ No analysis results available.")
+            return
+        try:
+            data_with_clusters = analysis_results['data_with_clusters']
+            spending_analysis = analysis_results['spending_analysis']
+            # COMPLETELY REWRITTEN: Find cluster column with bulletproof detection
+            available_columns = list(data_with_clusters.columns)
+            st.info(f"🔍 **Available columns in data:** {available_columns}")
+            # Find ANY column that contains 'cluster' (case insensitive)
+            cluster_columns = [col for col in available_columns if 'cluster' in col.lower()]
+            st.info(f"🎯 **Found cluster columns:** {cluster_columns}")
+            if not cluster_columns:
+                st.error("❌ No cluster column found in the data!")
+                st.write("Available columns:", available_columns)
+                st.write("Please ensure clustering has been performed first.")
+                return
+            # Use the first cluster column found
+            cluster_col = cluster_columns[0]
+            st.success(f"✅ **Using cluster column:** `{cluster_col}`")
+            # EXTRA SAFETY: Ensure the column actually exists before proceeding
+            if cluster_col not in data_with_clusters.columns:
+                st.error(f"❌ Column `{cluster_col}` not found in data!")
+                st.write("This should not happen. Please report this bug.")
+                return
+            # Create a beautiful header with metrics
+            st.markdown(f"""
+            <div style="
+                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+                padding: 2rem;
+                border-radius: 15px;
+                color: white;
+                text-align: center;
+                margin: 2rem 0;
+                box-shadow: 0 10px 25px rgba(0,0,0,0.1);
+            ">
+                <h2 style="margin: 0; font-size: 2.5rem; font-weight: 700;">📈 {algorithm} Cluster Analysis</h2>
+                <p style="margin: 0.5rem 0 0 0; font-size: 1.2rem; opacity: 0.9;">Interactive Cluster Visualization & Analysis</p>
+            </div>
+            """, unsafe_allow_html=True)
+            # Quick stats
+            num_clusters = len(data_with_clusters[cluster_col].unique())
+            total_customers = len(data_with_clusters)
+            metric_col1, metric_col2, metric_col3, metric_col4 = st.columns(4)
+            with metric_col1:
+                st.metric("🎯 Total Clusters", num_clusters)
+            with metric_col2:
+                st.metric("👥 Total Customers", total_customers)
+            with metric_col3:
+                avg_cluster_size = total_customers / num_clusters
+                st.metric("📊 Avg Cluster Size", f"{avg_cluster_size:.0f}")
+            with metric_col4:
+                if 'Spending Score (1-100)' in data_with_clusters.columns:
+                    avg_spending = data_with_clusters['Spending Score (1-100)'].mean()
+                    st.metric("💰 Avg Spending", f"{avg_spending:.1f}")
+            st.markdown("---")
+            # Enhanced Box plots with better styling
+            st.subheader("📊 Distribution Analysis")
+            col1, col2 = st.columns(2)
+            with col1:
+                if 'Spending Score (1-100)' in data_with_clusters.columns:
+                    # Convert cluster column to string to ensure proper categorical handling
+                    plot_data = data_with_clusters.copy()
+                    plot_data[cluster_col] = plot_data[cluster_col].astype(str)
+                    # DEBUG: Show exactly what we're passing to plotly
+                    st.write(f"🔍 **DEBUG - About to create box plot with:**")
+                    st.write(f"- x column: `{cluster_col}`")
+                    st.write(f"- Columns in plot_data: {list(plot_data.columns)}")
+                    st.write(f"- First few rows of plot_data:")
+                    st.dataframe(plot_data.head(3))
+                    fig_spending_box = px.box(
+                        plot_data,
+                        x=cluster_col,
+                        y='Spending Score (1-100)',
+                        title='💰 Spending Score Distribution by Cluster',
+                        color=cluster_col,
+                        color_discrete_sequence=self.modern_colors
+                    )
+                    # Enhanced styling for maximum visibility
+                    fig_spending_box.update_layout(
+                        height=600,
+                        title=dict(
+                            text='💰 Spending Score Distribution by Cluster',
+                            font=dict(size=20, color='#E5E7EB'),
+                            x=0.5,
+                            y=0.95
+                        ),
+                        plot_bgcolor='#0F172A',
+                        paper_bgcolor='#0F172A',
+                        font=dict(size=14, family="Arial, sans-serif", color='#E5E7EB'),
+                        xaxis=dict(
+                            title=dict(text='Cluster', font=dict(size=16, color='#E5E7EB')),
+                            tickfont=dict(size=14, color='#E5E7EB'),
+                            gridcolor='rgba(229,231,235,0.12)',
+                            gridwidth=1,
+                            showgrid=True
+                        ),
+                        yaxis=dict(
+                            title=dict(text='Spending Score', font=dict(size=16, color='#E5E7EB')),
+                            tickfont=dict(size=14, color='#E5E7EB'),
+                            gridcolor='rgba(229,231,235,0.12)',
+                            gridwidth=1,
+                            showgrid=True
+                        ),
+                        showlegend=False,
+                        margin=dict(t=80, b=60, l=60, r=40)
+                    )
+                    fig_spending_box.update_traces(
+                        marker=dict(size=6, opacity=0.8),
+                        line=dict(width=3),
+                        fillcolor='rgba(0,0,0,0)',
+                        boxpoints='outliers'
+                    )
+                    st.plotly_chart(fig_spending_box, theme=None, use_container_width=True)
+            with col2:
+                if 'Annual Income (k$)' in data_with_clusters.columns:
+                    # Convert cluster column to string to ensure proper categorical handling
+                    plot_data = data_with_clusters.copy()
+                    plot_data[cluster_col] = plot_data[cluster_col].astype(str)
+                    fig_income_box = px.box(
+                        plot_data,
+                        x=cluster_col,
+                        y='Annual Income (k$)',
+                        title='💵 Income Distribution by Cluster',
+                        color=cluster_col,
+                        color_discrete_sequence=self.modern_colors
+                    )
+                    # Enhanced styling for maximum visibility
+                    fig_income_box.update_layout(
+                        height=600,
+                        title=dict(
+                            text='💵 Annual Income Distribution by Cluster',
+                            font=dict(size=20, color='#E5E7EB'),
+                            x=0.5,
+                            y=0.95
+                        ),
+                        plot_bgcolor='#0F172A',
+                        paper_bgcolor='#0F172A',
+                        font=dict(size=14, family="Arial, sans-serif", color='#E5E7EB'),
+                        xaxis=dict(
+                            title=dict(text='Cluster', font=dict(size=16, color='#E5E7EB')),
+                            tickfont=dict(size=14, color='#E5E7EB'),
+                            gridcolor='rgba(229,231,235,0.12)',
+                            gridwidth=1,
+                            showgrid=True
+                        ),
+                        yaxis=dict(
+                            title=dict(text='Annual Income (k$)', font=dict(size=16, color='#E5E7EB')),
+                            tickfont=dict(size=14, color='#E5E7EB'),
+                            gridcolor='rgba(229,231,235,0.12)',
+                            gridwidth=1,
+                            showgrid=True
+                        ),
+                        showlegend=False,
+                        margin=dict(t=80, b=60, l=60, r=40)
+                    )
+                    fig_income_box.update_traces(
+                        marker=dict(size=6, opacity=0.8),
+                        line=dict(width=3),
+                        fillcolor='rgba(0,0,0,0)',
+                        boxpoints='outliers'
+                    )
+                    st.plotly_chart(fig_income_box, theme=None, use_container_width=True)
+            # Average spending per cluster with stunning visualization
+            if spending_analysis is not None:
+                st.markdown("---")
+                # Beautiful section header
+                st.markdown(f"""
+                <div style="
+                    background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
+                    padding: 1.5rem;
+                    border-radius: 15px;
+                    color: white;
+                    text-align: center;
+                    margin: 2rem 0 1rem 0;
+                    box-shadow: 0 8px 20px rgba(240, 147, 251, 0.3);
+                ">
+                    <h3 style="margin: 0; font-size: 1.8rem; font-weight: 600;">💰 Average Spending Analysis</h3>
+                </div>
+                """, unsafe_allow_html=True)
+                # Create stunning bar chart with enhanced colors
+                fig_avg_spending = px.bar(
+                    x=spending_analysis.index.astype(str),
+                    y=spending_analysis['mean'],
+                    title='📊 Average Spending Score by Cluster',
+                    labels={'x': 'Cluster', 'y': 'Average Spending Score'},
+                    error_y=spending_analysis['std'],
+                    color=spending_analysis['mean'],
+                    color_continuous_scale='Viridis'
+                )
+                # Ultra-enhanced styling
+                fig_avg_spending.update_layout(
+                     height=650,
+                     title=dict(
+                         text='📊 Average Spending Score by Cluster',
+                         font=dict(size=24, color='#E5E7EB', family="Arial Black"),
+                         x=0.5,
+                         y=0.95
+                     ),
+                     plot_bgcolor='#0F172A',
+                     paper_bgcolor='#0F172A',
+                     font=dict(size=16, family="Arial, sans-serif", color='#E5E7EB'),
+                     xaxis=dict(
+                         title=dict(text='Cluster', font=dict(size=18, color='#E5E7EB')),
+                         tickfont=dict(size=16, color='#E5E7EB'),
+                         gridcolor='rgba(229,231,235,0.12)',
+                         gridwidth=1,
+                         showgrid=True,
+                         zeroline=False
+                     ),
+                     yaxis=dict(
+                         title=dict(text='Average Spending Score', font=dict(size=18, color='#E5E7EB')),
+                         tickfont=dict(size=16, color='#E5E7EB'),
+                         gridcolor='rgba(229,231,235,0.12)',
+                         gridwidth=1,
+                         showgrid=True,
+                         zeroline=False
+                     ),
+                     showlegend=False,
+                     margin=dict(t=100, b=80, l=80, r=80)
+                 )
+                # Add stylish value labels on bars
+                for i, (cluster, value) in enumerate(zip(spending_analysis.index, spending_analysis['mean'])):
+                    fig_avg_spending.add_annotation(
+                        x=str(cluster),
+                        y=value + spending_analysis.loc[cluster, 'std'] + 5,
+                        text=f'<b>{value:.1f}</b>',
+                        showarrow=False,
+                        font=dict(size=16, color='white', family="Arial Black"),
+                        bgcolor='rgba(44, 62, 80, 0.9)',
+                        bordercolor='rgba(44, 62, 80, 1)',
+                        borderwidth=2,
+                        borderpad=8
+                    )
+                # Enhance the bars themselves
+                fig_avg_spending.update_traces(
+                    marker=dict(
+                        line=dict(width=2, color='rgba(44, 62, 80, 0.8)'),
+                        opacity=0.9
+                    ),
+                    width=0.6
+                )
+                st.plotly_chart(fig_avg_spending, theme=None, use_container_width=True)
+                # Beautiful cluster insights table
+                st.markdown("""
+                <div style="
+                    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+                    padding: 1.5rem;
+                    border-radius: 15px;
+                    color: white;
+                    text-align: center;
+                    margin: 2rem 0 1rem 0;
+                    box-shadow: 0 8px 20px rgba(102, 126, 234, 0.3);
+                ">
+                    <h3 style="margin: 0; font-size: 1.8rem; font-weight: 600;">📋 Detailed Cluster Statistics</h3>
+                </div>
+                """, unsafe_allow_html=True)
+                summary_df = spending_analysis.round(2)
+                summary_df.columns = ['🎯 Avg Spending', '📊 Std Dev', '📉 Min', '📈 Max', '👥 Count']
+                # Create a Plotly table instead of using background_gradient
+                fig_table = go.Figure(data=[go.Table(
+                    header=dict(
+                        values=list(summary_df.columns),
+                        fill_color='#1F2937',
+                        font=dict(color='#E5E7EB', size=14, family='Inter'),
+                        align='center',
+                        height=40
+                    ),
+                    cells=dict(
+                        values=[summary_df[col] for col in summary_df.columns],
+                        fill_color='#0F172A',
+                        font=dict(color='#E5E7EB', size=12, family='Inter'),
+                        align='center',
+                        height=35,
+                        format=[None, '.2f', '.2f', '.2f', '.2f', '.0f']
+                    )
+                )])
+                fig_table.update_layout(
+                     height=300,
+                     title=dict(
+                         text='📊 Cluster Spending Analysis',
+                         font=dict(size=18, color='#E5E7EB', family='Inter'),
+                         x=0.5
+                     ),
+                     plot_bgcolor='#0F172A',
+                     paper_bgcolor='#0F172A',
+                     margin=dict(t=60, b=20, l=20, r=20)
+                 )
+                st.plotly_chart(fig_table, use_container_width=True, theme=None)
+        except Exception as e:
+            st.error(f"❌ Error in cluster analysis visualization: {str(e)}")
+            st.write("Please try the 'Clear Session' button in the sidebar and run clustering again.")
+    def plot_comparison(self, data, kmeans_labels, dbscan_labels):
+        """Plot comparison between K-Means and DBSCAN."""
+        st.subheader("🔄 Algorithm Comparison")
+        col1, col2 = st.columns(2)
+        with col1:
+            # K-Means
+            plot_data_kmeans = data.copy()
+            plot_data_kmeans['Cluster'] = kmeans_labels
+            fig_kmeans = px.scatter(plot_data_kmeans,
+                                  x='Annual Income (k$)',
+                                  y='Spending Score (1-100)',
+                                  color='Cluster',
+                                  title='K-Means Clustering',
+                                  color_discrete_sequence=self.colors)
+            fig_kmeans.update_layout(
+                height=400,
+                paper_bgcolor="#0F172A",
+                plot_bgcolor="#0F172A",
+                font=dict(color="#E5E7EB")
+            )
+            st.plotly_chart(fig_kmeans, theme=None, use_container_width=True)
+        with col2:
+            # DBSCAN
+            plot_data_dbscan = data.copy()
+            plot_data_dbscan['Cluster'] = dbscan_labels
+            plot_data_dbscan['Cluster'] = plot_data_dbscan['Cluster'].astype(str)
+            plot_data_dbscan.loc[plot_data_dbscan['Cluster'] == '-1', 'Cluster'] = 'Noise'
+            fig_dbscan = px.scatter(plot_data_dbscan,
+                                  x='Annual Income (k$)',
+                                  y='Spending Score (1-100)',
+                                  color='Cluster',
+                                  title='DBSCAN Clustering',
+                                  color_discrete_sequence=self.colors)
+            fig_dbscan.update_layout(
+                height=400,
+                paper_bgcolor="#0F172A",
+                plot_bgcolor="#0F172A",
+                font=dict(color="#E5E7EB")
+            )
+            st.plotly_chart(fig_dbscan, theme=None, use_container_width=True)
+        # Comparison metrics
+        col1, col2, col3, col4 = st.columns(4)
+        with col1:
+            kmeans_clusters = len(set(kmeans_labels))
+            st.metric("K-Means Clusters", kmeans_clusters)
+        with col2:
+            dbscan_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
+            st.metric("DBSCAN Clusters", dbscan_clusters)
+        with col3:
+            noise_points = list(dbscan_labels).count(-1)
+            st.metric("DBSCAN Noise Points", noise_points)
+        with col4:
+            noise_percentage = (noise_points / len(dbscan_labels)) * 100
+            st.metric("Noise Percentage", f"{noise_percentage:.1f}%")

streamlit_app/main.py ADDED Viewed

	@@ -0,0 +1,1112 @@

+"""
+Customer Segmentation Streamlit App
+==================================
+A comprehensive web application for customer segmentation analysis using
+K-Means and DBSCAN clustering algorithms.
+"""
+import streamlit as st
+import pandas as pd
+import numpy as np
+import sys
+import os
+# Add src to path for imports
+sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
+from src.data_loader import DataLoader
+from src.clustering import ClusteringAnalyzer
+from src.visualizations import Visualizer
+# Page configuration
+st.set_page_config(
+    page_title="Customer Segmentation Analysis",
+    page_icon="🛍️",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+import plotly.io as pio
+pio.templates.default = "plotly_dark"
+# Modern Dark Mode Compatible CSS
+st.markdown("""
+<style>
+    /* Import Google Fonts */
+    @import url('https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&family=Inter:wght@300;400;500;600;700&display=swap');
+    /* CSS Variables for Dark Mode Support */
+    /* :root {
+        --bg-primary: #0F172A;       /* slate-900 */
+        --bg-secondary: #111827;     /* gray-900 */
+        --bg-tertiary: #1F2937;      /* gray-800 */
+        --text-primary: #E5E7EB;     /* gray-200 */
+        --text-secondary: #CBD5E1;   /* slate-300 */
+        --text-tertiary: #94A3B8;    /* slate-400 */
+        --border-color: #374151;     /* gray-700 */
+        --accent-primary: #818CF8;   /* indigo-300 */
+        --accent-secondary: #A78BFA; /* violet-300 */
+        --shadow-sm: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+        --shadow-md: 0 4px 6px -1px rgba(0, 0, 0, 0.5);
+        --shadow-lg: 0 10px 15px -3px rgba(0, 0, 0, 0.6);
+    } */
+    /* Dark mode support disabled intentionally */
+    /* Base styling */
+    .main .block-container {
+        padding: 2rem 1rem;
+        max-width: 1200px;
+    }
+    /* Apply CSS variables to Streamlit elements */
+    .stApp { background-color: #0F172A; color: #E5E7EB; }
+    /* Headers */
+    .main-header {
+        font-family: 'Inter', sans-serif;
+        font-size: clamp(2.5rem, 5vw, 4rem);
+        font-weight: 800;
+        text-align: center;
+        margin-bottom: 3rem;
+        background: linear-gradient(135deg, #818CF8 0%, #A78BFA 100%);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+        background-clip: text;
+        letter-spacing: -0.02em;
+    }
+    .sub-header {
+        font-family: 'Inter', sans-serif;
+        font-size: 1.75rem;
+        font-weight: 600;
+        color: #E5E7EB;
+        margin: 2rem 0 1rem 0;
+        padding-bottom: 0.75rem;
+        border-bottom: 2px solid #374151;
+        position: relative;
+    }
+    .sub-header::after {
+        content: '';
+        bottom: -2px;
+        left: 0;
+        width: 60px;
+        height: 2px;
+        background: linear-gradient(135deg, #818CF8, #A78BFA);
+    }
+    /* Enhanced Tab Styling */
+    .stTabs [data-baseweb="tab-list"] {
+        gap: 4px;
+        background: #111827;
+        padding: 8px;
+        border-radius: 16px;
+        border: 1px solid #374151;
+        box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+        margin-bottom: 2rem;
+    }
+    .stTabs [data-baseweb="tab"] {
+        height: 48px;
+        padding: 0 20px;
+        background: transparent;
+        border-radius: 12px;
+        color: #CBD5E1;
+        font-weight: 500;
+        font-family: 'Inter', sans-serif;
+        font-size: 0.875rem;
+        border: none;
+        transition: all 0.2s cubic-bezier(0.4, 0, 0.2, 1);
+        position: relative;
+        overflow: hidden;
+    }
+    .stTabs [data-baseweb="tab"]:hover {
+        background: #1F2937;
+        color: #E5E7EB;
+        transform: translateY(-1px);
+    }
+    .stTabs [aria-selected="true"] {
+        background: linear-gradient(135deg, #818CF8 0%, #A78BFA 100%);
+        color: white !important;
+        font-weight: 600;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.5);
+        transform: translateY(-1px);
+    }
+    /* Cards and containers */
+    .metric-card {
+        background: #0F172A;
+        border: 1px solid #374151;
+        border-radius: 16px;
+        padding: 1.5rem;
+        margin: 1rem 0;
+        box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+        transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
+        position: relative;
+        overflow: hidden;
+    }
+    .metric-card::before {
+        content: '';
+        top: 0;
+        left: 0;
+        right: 0;
+        height: 3px;
+        background: linear-gradient(135deg, #818CF8, #A78BFA);
+    }
+    .metric-card:hover {
+        transform: translateY(-4px);
+        box-shadow: 0 10px 15px -3px rgba(0, 0, 0, 0.6);
+        border-color: #818CF8;
+    }
+    .insight-box {
+        background: #111827;
+        border: 1px solid #818CF8;
+        border-radius: 16px;
+        padding: 1.5rem;
+        margin: 1.5rem 0;
+        box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+        position: relative;
+    }
+    .insight-box::before {
+        content: '';
+        top: 0;
+        left: 0;
+        right: 0;
+        height: 3px;
+        background: linear-gradient(135deg, #818CF8, #A78BFA);
+    }
+    /* Sidebar */
+    .css-1d391kg {
+        background: #111827;
+        border-right: 1px solid #374151;
+    }
+    /* Text styling with proper contrast */
+    .stMarkdown, .stText, p, div, span, label {
+        color: #E5E7EB !important;
+        font-family: 'Inter', sans-serif;
+    }
+    [data-testid="stMarkdownContainer"] {
+        color: #E5E7EB !important;
+    }
+    /* Enhanced message styling */
+    .stSuccess {
+        background: rgba(34, 197, 94, 0.1) !important;
+        border: 1px solid #22c55e !important;
+        border-radius: 12px !important;
+        color: #166534 !important;
+    }
+    .stInfo {
+        background: rgba(59, 130, 246, 0.1) !important;
+        border: 1px solid #3b82f6 !important;
+        border-radius: 12px !important;
+        color: #1e40af !important;
+    }
+    .stWarning {
+        background: rgba(245, 158, 11, 0.1) !important;
+        border: 1px solid #f59e0b !important;
+        border-radius: 12px !important;
+        color: #92400e !important;
+    }
+    .stError {
+        background: rgba(239, 68, 68, 0.1) !important;
+        border: 1px solid #ef4444 !important;
+        border-radius: 12px !important;
+        color: #dc2626 !important;
+    }
+    /* Enhanced Modern Button Styling */
+    .stButton > button {
+        background: linear-gradient(135deg, #818CF8 0%, #A78BFA 100%);
+        color: white !important;
+        border: none;
+        border-radius: 16px;
+        padding: 1rem 2.5rem;
+        font-weight: 700;
+        font-family: 'Inter', sans-serif;
+        font-size: 1rem;
+        letter-spacing: 0.025em;
+        transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
+        box-shadow: 0 8px 25px rgba(129, 140, 248, 0.3);
+        position: relative;
+        overflow: hidden;
+        text-transform: uppercase;
+        min-height: 48px;
+    }
+    .stButton > button::before {
+        content: '';
+        position: absolute;
+        top: 0;
+        left: -100%;
+        width: 100%;
+        height: 100%;
+        background: linear-gradient(90deg, transparent, rgba(255, 255, 255, 0.2), transparent);
+        transition: left 0.5s;
+    }
+    .stButton > button:hover {
+        transform: translateY(-3px) scale(1.02);
+        box-shadow: 0 15px 35px rgba(129, 140, 248, 0.4);
+        filter: brightness(1.15);
+        background: linear-gradient(135deg, #A78BFA 0%, #818CF8 100%);
+    }
+    .stButton > button:hover::before {
+        left: 100%;
+    }
+    .stButton > button:active {
+        transform: translateY(-1px) scale(0.98);
+        box-shadow: 0 5px 15px rgba(129, 140, 248, 0.3);
+    }
+    /* Special styling for primary action buttons */
+    .stButton > button:contains("Apply") {
+        background: linear-gradient(135deg, #10B981 0%, #059669 100%);
+        box-shadow: 0 8px 25px rgba(16, 185, 129, 0.3);
+    }
+    .stButton > button:contains("Apply"):hover {
+        background: linear-gradient(135deg, #059669 0%, #10B981 100%);
+        box-shadow: 0 15px 35px rgba(16, 185, 129, 0.4);
+    }
+    /* Special styling for find/analyze buttons */
+    .stButton > button:contains("Find") {
+        background: linear-gradient(135deg, #F59E0B 0%, #D97706 100%);
+        box-shadow: 0 8px 25px rgba(245, 158, 11, 0.3);
+    }
+    .stButton > button:contains("Find"):hover {
+        background: linear-gradient(135deg, #D97706 0%, #F59E0B 100%);
+        box-shadow: 0 15px 35px rgba(245, 158, 11, 0.4);
+    }
+    /* Special styling for reload/clear buttons */
+    .stButton > button:contains("Reload"), .stButton > button:contains("Clear") {
+        background: linear-gradient(135deg, #EF4444 0%, #DC2626 100%);
+        box-shadow: 0 8px 25px rgba(239, 68, 68, 0.3);
+    }
+    .stButton > button:contains("Reload"):hover, .stButton > button:contains("Clear"):hover {
+        background: linear-gradient(135deg, #DC2626 0%, #EF4444 100%);
+        box-shadow: 0 15px 35px rgba(239, 68, 68, 0.4);
+    }
+    /* Form elements */
+    .stSelectbox > div > div,
+    .stTextInput > div > div > input,
+    .stNumberInput > div > div > input {
+        background: #0F172A !important;
+        border: 1px solid #374151 !important;
+        border-radius: 12px !important;
+        color: #E5E7EB !important;
+        font-family: 'Inter', sans-serif !important;
+        transition: all 0.2s ease;
+    }
+    .stSelectbox > div > div:focus-within,
+    .stTextInput > div > div:focus-within,
+    .stNumberInput > div > div:focus-within {
+        border-color: #818CF8 !important;
+        box-shadow: 0 0 0 3px rgba(99, 102, 241, 0.1) !important;
+    }
+    /* Slider styling */
+    .stSlider > div > div > div > div {
+        background: linear-gradient(135deg, #818CF8, #A78BFA) !important;
+    }
+    .stSlider > div > div > div > div > div {
+        background: white !important;
+        border: 2px solid #818CF8 !important;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.5) !important;
+    }
+fig.update_traces(
+    fillcolor='rgba(129, 140, 248, 0.3)',  # semi-transparent fill
+    selector=dict(type='box')              # only affects box plots
+)
+    .element-container .stPlotlyChart {
+        background: #0F172A !important;
+    }
+    fig.update_traces(
+        marker=dict(size=8, opacity=0.9, line=dict(width=1, color="white"))
+    )
+import plotly.express as px
+color_palette = px.colors.qualitative.Set2
+fig = px.scatter(
+    data_frame,
+    x='Age',
+    y='Annual Income (k$)',
+    color='Cluster',
+    color_discrete_sequence=color_palette,
+    title='Age vs. Annual Income',
+    labels={'Age': 'Age', 'Annual Income (k$)': 'Annual Income (k$)'},
+    template='plotly_dark'
+)
+    /* DataFrames */
+    .stDataFrame {
+        border: 1px solid #374151;
+        border-radius: 12px;
+        overflow: hidden;
+        box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+    }
+    .stDataFrame > div {
+        background: #0F172A;
+    }
+    /* Progress bars */
+    .stProgress > div > div > div {
+        background: linear-gradient(135deg, #818CF8, #A78BFA) !important;
+        border-radius: 8px !important;
+    }
+    /* Expanders */
+    .streamlit-expanderHeader {
+        background: #111827 !important;
+        border: 1px solid #374151 !important;
+        border-radius: 12px !important;
+        color: #E5E7EB !important;
+        font-weight: 500 !important;
+        font-family: 'Inter', sans-serif !important;
+        transition: all 0.2s ease;
+    }
+    .streamlit-expanderHeader:hover {
+        background: #1F2937 !important;
+        border-color: #818CF8 !important;
+    }
+    .streamlit-expanderContent {
+        background: #0F172A !important;
+        border: 1px solid #374151 !important;
+        border-top: none !important;
+        color: #E5E7EB !important;
+        border-radius: 0 0 12px 12px !important;
+    }
+    /* Metrics */
+    [data-testid="metric-container"] {
+        background: #111827;
+        border: 1px solid #374151;
+        border-radius: 12px;
+        padding: 1rem;
+        box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.4);
+        transition: all 0.2s ease;
+    }
+    [data-testid="metric-container"]:hover {
+        transform: translateY(-2px);
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.5);
+    }
+    [data-testid="metric-container"] > div {
+        color: #E5E7EB !important;
+    }
+    /* Code blocks */
+    .stCode {
+        background: #111827 !important;
+        border: 1px solid #374151 !important;
+        border-radius: 12px !important;
+        font-family: 'JetBrains Mono', monospace !important;
+    }
+    /* Headings */
+    h1, h2, h3, h4, h5, h6 {
+        color: #E5E7EB !important;
+        font-family: 'Inter', sans-serif !important;
+        font-weight: 600 !important;
+        letter-spacing: -0.01em;
+    }
+    /* File uploader */
+    .stFileUploader > div {
+        background: #111827 !important;
+        border: 2px dashed #374151 !important;
+        border-radius: 12px !important;
+        transition: all 0.2s ease;
+    }
+    .stFileUploader > div:hover {
+        border-color: #818CF8 !important;
+        background: #1F2937 !important;
+    }
+    /* Scrollbars */
+    ::-webkit-scrollbar {
+        width: 8px;
+        height: 8px;
+    }
+    ::-webkit-scrollbar-track {
+        background: #111827;
+        border-radius: 4px;
+    }
+    ::-webkit-scrollbar-thumb {
+        background: #94A3B8;
+        border-radius: 4px;
+    }
+    ::-webkit-scrollbar-thumb:hover {
+        background: #CBD5E1;
+    }
+    /* Animation keyframes */
+    @keyframes fadeIn {
+        from { opacity: 0; transform: translateY(20px); }
+        to { opacity: 1; transform: translateY(0); }
+    }
+    .stTabs [data-baseweb="tabpanel"] {
+        animation: fadeIn 0.5s ease-out;
+    }
+</style>
+""", unsafe_allow_html=True)
+def initialize_session_state():
+    """Initialize session state variables."""
+    if 'data_loader' not in st.session_state:
+        st.session_state.data_loader = DataLoader()
+    if 'clustering_analyzer' not in st.session_state:
+        st.session_state.clustering_analyzer = ClusteringAnalyzer()
+    if 'visualizer' not in st.session_state:
+        st.session_state.visualizer = Visualizer()
+    if 'data_loaded' not in st.session_state:
+        st.session_state.data_loaded = False
+    if 'data_preprocessed' not in st.session_state:
+        st.session_state.data_preprocessed = False
+    if 'clustering_done' not in st.session_state:
+        st.session_state.clustering_done = {'kmeans': False, 'dbscan': False}
+def main():
+    """Main application function."""
+    initialize_session_state()
+    # Main header
+    st.markdown('<h1 class="main-header">🛍️ Customer Segmentation Analysis</h1>', unsafe_allow_html=True)
+    st.markdown("---")
+    # Tab navigation
+    tab1, tab2, tab3, tab4, tab5, tab6, tab7, tab8 = st.tabs([
+        "🏠 Home", "📊 Data Overview", "🔍 Data Exploration", "⚙️ Preprocessing",
+        "🎯 K-Means", "🌟 DBSCAN", "📈 Comparison", "📋 Insights"
+    ])
+    # Data loading section in sidebar
+    st.sidebar.markdown("---")
+    st.sidebar.subheader("📂 Data Management")
+    # Auto-load dataset on first run
+    if not st.session_state.data_loaded:
+        st.session_state.data_loader.load_data()
+        st.session_state.data_loaded = True
+    # Show current dataset status
+    if st.session_state.data_loaded and st.session_state.data_loader.data is not None:
+        data_info = st.session_state.data_loader.get_data_info()
+        st.sidebar.success(f"📊 Dataset Loaded")
+        st.sidebar.info(f"**Rows:** {data_info['shape'][0]}\n**Columns:** {data_info['shape'][1]}")
+        # Show basic info about the dataset
+        if 'Annual Income (k$)' in st.session_state.data_loader.data.columns:
+            st.sidebar.write("**Dataset Type:** Mall Customers")
+    # File upload option
+    st.sidebar.markdown("### 📁 Upload Different Dataset")
+    uploaded_file = st.sidebar.file_uploader("Choose a CSV file", type=['csv'])
+    if uploaded_file is not None:
+        try:
+            data = pd.read_csv(uploaded_file)
+            st.session_state.data_loader.data = data
+            st.session_state.data_loaded = True
+            st.session_state.data_preprocessed = False  # Reset preprocessing
+            st.session_state.clustering_done = {'kmeans': False, 'dbscan': False}  # Reset clustering
+            st.sidebar.success("✅ New file uploaded!")
+            st.rerun()
+        except Exception as e:
+            st.sidebar.error(f"Error loading file: {e}")
+    # Reload default dataset button
+    if st.sidebar.button("🔄 Reload Default Dataset"):
+        st.session_state.data_loader.load_data()
+        st.session_state.data_loaded = True
+        st.session_state.data_preprocessed = False
+        st.session_state.clustering_done = {'kmeans': False, 'dbscan': False}
+        # Clear any cached clustering results
+        st.session_state.clustering_analyzer = ClusteringAnalyzer()
+        st.rerun()
+    # Debug: Clear session state button (remove this after fixing)
+    if st.sidebar.button("🧪 Clear Session (Debug)"):
+        for key in list(st.session_state.keys()):
+            del st.session_state[key]
+        st.rerun()
+    # Tab content
+    with tab1:
+        show_home_page()
+    with tab2:
+        show_data_overview()
+    with tab3:
+        show_data_exploration()
+    with tab4:
+        show_preprocessing()
+    with tab5:
+        show_kmeans_clustering()
+    with tab6:
+        show_dbscan_clustering()
+    with tab7:
+        show_results_comparison()
+    with tab8:
+        show_business_insights()
+def show_home_page():
+    """Display the home page."""
+    st.markdown('<h2 class="sub-header">Welcome to Customer Segmentation Analysis</h2>', unsafe_allow_html=True)
+    col1, col2, col3 = st.columns([1, 2, 1])
+    with col2:
+        st.markdown("""
+        <div class="insight-box">
+        <h3>🎯 Project Overview</h3>
+        <p>This application provides a comprehensive customer segmentation analysis using machine learning clustering algorithms.</p>
+        </div>
+        """, unsafe_allow_html=True)
+    # Feature overview
+    st.markdown("### 🚀 Features")
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        st.markdown("""
+        **📊 Data Analysis**
+        - Interactive data exploration
+        - Statistical summaries
+        - Correlation analysis
+        - Missing value detection
+        """)
+    with col2:
+        st.markdown("""
+        **🎯 Clustering Algorithms**
+        - K-Means clustering
+        - DBSCAN clustering
+        - Optimal cluster determination
+        - Performance metrics
+        """)
+    with col3:
+        st.markdown("""
+        **📈 Visualizations**
+        - 2D cluster plots
+        - Distribution analysis
+        - Comparative visualizations
+        - Interactive charts
+        """)
+    # Getting started
+    st.markdown("### 🏁 Getting Started")
+    st.markdown("""
+    1. **📊 Data Overview**: Check your dataset information and statistics (automatically loaded from `data/Mall_Customers.csv`)
+    2. **🔍 Data Exploration**: Explore distributions, correlations, and relationships
+    3. **⚙️ Preprocessing**: Select features and scale your data for clustering
+    4. **🎯 K-Means**: Apply K-Means clustering with optimal cluster determination
+    5. **🌟 DBSCAN**: Try density-based clustering for comparison
+    6. **📈 Comparison**: Compare results from both algorithms
+    7. **📋 Insights**: Get business recommendations for each customer segment
+    """)
+    # Quick start note
+    st.info("""
+    💡 **Quick Start**: Your dataset is automatically loaded from the `data/` folder.
+    Just click on the tabs above to start exploring and clustering your customer data!
+    """)
+    # Sample data info
+    st.markdown("### 📋 Sample Dataset")
+    st.info("""
+    The sample dataset simulates mall customer data with the following features:
+    - **CustomerID**: Unique identifier
+    - **Gender**: Customer gender (Male/Female)
+    - **Age**: Customer age (18-70 years)
+    - **Annual Income (k$)**: Annual income in thousands
+    - **Spending Score (1-100)**: Mall-assigned spending score
+    """)
+def show_data_overview():
+    """Display data overview page."""
+    st.markdown('<h2 class="sub-header">📊 Data Overview</h2>', unsafe_allow_html=True)
+    if not st.session_state.data_loaded:
+        st.warning("⚠️ Please load data first using the sidebar.")
+        return
+    data = st.session_state.data_loader.data
+    data_info = st.session_state.data_loader.get_data_info()
+    # Basic information
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric("Total Customers", data_info['shape'][0])
+    with col2:
+        st.metric("Features", data_info['shape'][1])
+    with col3:
+        missing_values = sum(data_info['missing_values'].values())
+        st.metric("Missing Values", missing_values)
+    with col4:
+        numeric_cols = len([col for col, dtype in data_info['dtypes'].items() if dtype in ['int64', 'float64']])
+        st.metric("Numeric Features", numeric_cols)
+    # Data preview
+    st.subheader("📋 Data Preview")
+    st.dataframe(data.head(10), use_container_width=True)
+    # Data types and missing values
+    col1, col2 = st.columns(2)
+    with col1:
+        st.subheader("🔧 Data Types")
+        dtypes_df = pd.DataFrame(list(data_info['dtypes'].items()), columns=['Column', 'Data Type'])
+        st.dataframe(dtypes_df, use_container_width=True)
+    with col2:
+        st.subheader("❓ Missing Values")
+        missing_df = pd.DataFrame(list(data_info['missing_values'].items()), columns=['Column', 'Missing Count'])
+        missing_df['Missing %'] = (missing_df['Missing Count'] / data_info['shape'][0] * 100).round(2)
+        st.dataframe(missing_df, use_container_width=True)
+    # Statistical summary
+    st.subheader("📈 Statistical Summary")
+    st.dataframe(data.describe(), use_container_width=True)
+def show_data_exploration():
+    """Display data exploration page."""
+    st.markdown('<h2 class="sub-header">🔍 Data Exploration</h2>', unsafe_allow_html=True)
+    if not st.session_state.data_loaded:
+        st.warning("⚠️ Please load data first using the sidebar.")
+        return
+    data = st.session_state.data_loader.data
+    visualizer = st.session_state.visualizer
+    # Generate exploration visualizations
+    visualizer.plot_data_exploration(data)
+def show_preprocessing():
+    """Display preprocessing page."""
+    st.markdown('<h2 class="sub-header">⚙️ Data Preprocessing</h2>', unsafe_allow_html=True)
+    if not st.session_state.data_loaded:
+        st.warning("⚠️ Please load data first using the sidebar.")
+        return
+    data = st.session_state.data_loader.data
+    # Feature selection
+    st.subheader("🎯 Feature Selection")
+    numeric_columns = data.select_dtypes(include=[np.number]).columns.tolist()
+    if 'CustomerID' in numeric_columns:
+        numeric_columns.remove('CustomerID')
+    selected_features = st.multiselect(
+        "Select features for clustering:",
+        numeric_columns,
+        default=['Annual Income (k$)', 'Spending Score (1-100)'] if all(col in numeric_columns for col in ['Annual Income (k$)', 'Spending Score (1-100)']) else numeric_columns[:2]
+    )
+    if len(selected_features) < 2:
+        st.error("⚠️ Please select at least 2 features for clustering.")
+        return
+    # Preprocessing options
+    st.subheader("🔧 Preprocessing Options")
+    col1, col2 = st.columns(2)
+    with col1:
+        handle_missing = st.selectbox("Handle missing values:", ["Fill with mean", "Drop rows", "No action"])
+    with col2:
+        scaling_method = st.selectbox("Scaling method:", ["StandardScaler", "MinMaxScaler", "No scaling"])
+    # Apply preprocessing
+    if st.button("🚀 Apply Preprocessing"):
+        scaled_data = st.session_state.data_loader.preprocess_data(selected_features)
+        if scaled_data is not None:
+            st.session_state.data_preprocessed = True
+            # Show preprocessing results
+            st.success("✅ Data preprocessing completed!")
+            col1, col2 = st.columns(2)
+            with col1:
+                st.subheader("📊 Original Data")
+                st.dataframe(data[selected_features].head(), use_container_width=True)
+            with col2:
+                st.subheader("🔄 Scaled Data")
+                scaled_df = pd.DataFrame(scaled_data, columns=selected_features)
+                st.dataframe(scaled_df.head(), use_container_width=True)
+            # Feature statistics
+            st.subheader("📈 Feature Statistics")
+            col1, col2 = st.columns(2)
+            with col1:
+                st.write("**Original Data Statistics:**")
+                st.dataframe(data[selected_features].describe(), use_container_width=True)
+            with col2:
+                st.write("**Scaled Data Statistics:**")
+                st.dataframe(scaled_df.describe(), use_container_width=True)
+def show_kmeans_clustering():
+    """Display K-Means clustering page."""
+    st.markdown('<h2 class="sub-header">🎯 K-Means Clustering</h2>', unsafe_allow_html=True)
+    if not st.session_state.data_preprocessed:
+        st.warning("⚠️ Please preprocess data first.")
+        return
+    data_loader = st.session_state.data_loader
+    clustering_analyzer = st.session_state.clustering_analyzer
+    visualizer = st.session_state.visualizer
+    # Optimal cluster determination
+    st.subheader("🔍 Optimal Cluster Determination")
+    col1, col2 = st.columns([1, 1])
+    with col1:
+        max_clusters = st.slider("Maximum clusters to test:", 2, 15, 10)
+    with col2:
+        if st.button("🔍 Find Optimal Clusters"):
+            with st.spinner("Finding optimal number of clusters..."):
+                optimization_results = clustering_analyzer.find_optimal_clusters(data_loader.scaled_data, max_clusters)
+                if optimization_results:
+                    visualizer.plot_optimization_results(optimization_results)
+    # K-Means clustering
+    st.subheader("🎯 K-Means Clustering")
+    col1, col2 = st.columns([1, 1])
+    with col1:
+        n_clusters = st.slider("Number of clusters:", 2, 10, clustering_analyzer.optimal_clusters or 5)
+    with col2:
+        if st.button("🚀 Apply K-Means"):
+            # Clear any existing clustering results first to avoid column naming issues
+            clustering_analyzer.cluster_labels = {}
+            st.session_state.clustering_done = {'kmeans': False, 'dbscan': False}
+            # Clear any cached data
+            if hasattr(st.session_state, 'cluster_analysis_cache'):
+                del st.session_state.cluster_analysis_cache
+            with st.spinner("🔄 Applying K-Means clustering..."):
+                kmeans_results = clustering_analyzer.apply_kmeans(data_loader.scaled_data, n_clusters)
+            if kmeans_results:
+                st.session_state.clustering_done['kmeans'] = True
+                # Display metrics
+                col1, col2, col3 = st.columns(3)
+                with col1:
+                    st.metric("Silhouette Score", f"{kmeans_results['silhouette_score']:.3f}")
+                with col2:
+                    st.metric("Calinski-Harabasz Score", f"{kmeans_results['calinski_score']:.1f}")
+                with col3:
+                    st.metric("Inertia", f"{kmeans_results['inertia']:.1f}")
+    # Visualizations
+    if st.session_state.clustering_done['kmeans']:
+        feature_data = data_loader.get_feature_data()
+        kmeans_labels = clustering_analyzer.cluster_labels['kmeans']
+        visualizer.plot_clusters(
+            feature_data,
+            kmeans_labels,
+            'K-Means',
+            data_loader.scaler,
+            clustering_analyzer.kmeans_model.cluster_centers_
+        )
+        # Cluster analysis
+        analysis_results = clustering_analyzer.analyze_clusters(feature_data, 'kmeans')
+        if analysis_results:
+            visualizer.plot_cluster_analysis(analysis_results, 'K-Means')
+def show_dbscan_clustering():
+    """Display DBSCAN clustering page."""
+    st.markdown('<h2 class="sub-header">🌟 DBSCAN Clustering</h2>', unsafe_allow_html=True)
+    if not st.session_state.data_preprocessed:
+        st.warning("⚠️ Please preprocess data first.")
+        return
+    data_loader = st.session_state.data_loader
+    clustering_analyzer = st.session_state.clustering_analyzer
+    visualizer = st.session_state.visualizer
+    # DBSCAN parameters
+    st.subheader("⚙️ DBSCAN Parameters")
+    col1, col2 = st.columns(2)
+    with col1:
+        eps = st.slider("Epsilon (neighborhood distance):", 0.1, 2.0, 0.5, 0.1)
+    with col2:
+        min_samples = st.slider("Minimum samples per cluster:", 2, 20, 5)
+    # Parameter guidance
+    st.info("""
+    **Parameter Guidance:**
+    - **Epsilon**: Maximum distance between points in the same cluster. Smaller values create more clusters.
+    - **Min Samples**: Minimum number of points required to form a cluster. Higher values create fewer, denser clusters.
+    """)
+    # Apply DBSCAN
+    if st.button("🚀 Apply DBSCAN"):
+        dbscan_results = clustering_analyzer.apply_dbscan(data_loader.scaled_data, eps, min_samples)
+        if dbscan_results:
+            st.session_state.clustering_done['dbscan'] = True
+            # Display metrics
+            col1, col2, col3 = st.columns(3)
+            with col1:
+                st.metric("Number of Clusters", dbscan_results['n_clusters'])
+            with col2:
+                st.metric("Noise Points", dbscan_results['n_noise'])
+            with col3:
+                if 'silhouette_score' in dbscan_results:
+                    st.metric("Silhouette Score", f"{dbscan_results['silhouette_score']:.3f}")
+                else:
+                    st.metric("Silhouette Score", "N/A")
+    # Visualizations
+    if st.session_state.clustering_done['dbscan']:
+        feature_data = data_loader.get_feature_data()
+        dbscan_labels = clustering_analyzer.cluster_labels['dbscan']
+        visualizer.plot_clusters(feature_data, dbscan_labels, 'DBSCAN')
+        # Cluster analysis
+        analysis_results = clustering_analyzer.analyze_clusters(feature_data, 'dbscan')
+        if analysis_results:
+            visualizer.plot_cluster_analysis(analysis_results, 'DBSCAN')
+def show_results_comparison():
+    """Display results comparison page."""
+    st.markdown('<h2 class="sub-header">📈 Results Comparison</h2>', unsafe_allow_html=True)
+    if not (st.session_state.clustering_done['kmeans'] and st.session_state.clustering_done['dbscan']):
+        st.warning("⚠️ Please complete both K-Means and DBSCAN clustering first.")
+        return
+    data_loader = st.session_state.data_loader
+    clustering_analyzer = st.session_state.clustering_analyzer
+    visualizer = st.session_state.visualizer
+    feature_data = data_loader.get_feature_data()
+    kmeans_labels = clustering_analyzer.cluster_labels['kmeans']
+    dbscan_labels = clustering_analyzer.cluster_labels['dbscan']
+    # Comparison visualization
+    visualizer.plot_comparison(feature_data, kmeans_labels, dbscan_labels)
+    # Performance comparison
+    st.subheader("📊 Performance Metrics Comparison")
+    # Calculate metrics for both algorithms
+    kmeans_analysis = clustering_analyzer.analyze_clusters(feature_data, 'kmeans')
+    dbscan_analysis = clustering_analyzer.analyze_clusters(feature_data, 'dbscan')
+    comparison_data = {
+        'Metric': ['Number of Clusters', 'Silhouette Score', 'Noise Points', 'Largest Cluster Size'],
+        'K-Means': [],
+        'DBSCAN': []
+    }
+    # Number of clusters
+    comparison_data['K-Means'].append(len(set(kmeans_labels)))
+    comparison_data['DBSCAN'].append(len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0))
+    # Silhouette scores (if available)
+    try:
+        from sklearn.metrics import silhouette_score
+        kmeans_silhouette = silhouette_score(data_loader.scaled_data, kmeans_labels)
+        comparison_data['K-Means'].append(f"{kmeans_silhouette:.3f}")
+        # DBSCAN silhouette (excluding noise)
+        if -1 in dbscan_labels:
+            non_noise_mask = dbscan_labels != -1
+            if np.sum(non_noise_mask) > 1:
+                dbscan_silhouette = silhouette_score(data_loader.scaled_data[non_noise_mask],
+                                                   dbscan_labels[non_noise_mask])
+                comparison_data['DBSCAN'].append(f"{dbscan_silhouette:.3f}")
+            else:
+                comparison_data['DBSCAN'].append("N/A")
+        else:
+            dbscan_silhouette = silhouette_score(data_loader.scaled_data, dbscan_labels)
+            comparison_data['DBSCAN'].append(f"{dbscan_silhouette:.3f}")
+    except:
+        comparison_data['K-Means'].append("N/A")
+        comparison_data['DBSCAN'].append("N/A")
+    # Noise points
+    comparison_data['K-Means'].append("0")
+    comparison_data['DBSCAN'].append(str(list(dbscan_labels).count(-1)))
+    # Largest cluster size
+    kmeans_counts = pd.Series(kmeans_labels).value_counts()
+    dbscan_counts = pd.Series(dbscan_labels).value_counts()
+    comparison_data['K-Means'].append(str(kmeans_counts.max()))
+    if -1 in dbscan_counts.index:
+        dbscan_counts = dbscan_counts.drop(-1)  # Exclude noise
+    comparison_data['DBSCAN'].append(str(dbscan_counts.max()) if len(dbscan_counts) > 0 else "0")
+    comparison_df = pd.DataFrame(comparison_data)
+    st.dataframe(comparison_df, use_container_width=True)
+def show_business_insights():
+    """Display business insights page."""
+    st.markdown('<h2 class="sub-header">📋 Business Insights</h2>', unsafe_allow_html=True)
+    if not st.session_state.clustering_done['kmeans']:
+        st.warning("⚠️ Please complete K-Means clustering first to generate insights.")
+        return
+    data_loader = st.session_state.data_loader
+    clustering_analyzer = st.session_state.clustering_analyzer
+    feature_data = data_loader.get_feature_data()
+    # Generate customer profiles
+    profiles = clustering_analyzer.get_cluster_profiles(feature_data, 'kmeans')
+    if profiles:
+        st.subheader("👥 Customer Segment Profiles")
+        for profile in profiles:
+            with st.expander(f"🏷️ Cluster {profile['cluster']} - {profile.get('type', 'Unknown Type')}"):
+                col1, col2 = st.columns(2)
+                with col1:
+                    st.markdown(f"**📊 Segment Overview**")
+                    st.write(f"- **Size**: {profile['size']} customers ({profile['percentage']:.1f}%)")
+                    if 'description' in profile:
+                        st.write(f"- **Profile**: {profile['description']}")
+                    if 'avg_age' in profile:
+                        st.write(f"- **Average Age**: {profile['avg_age']:.1f} ± {profile['age_std']:.1f} years")
+                    if 'gender_dist' in profile:
+                        st.write(f"- **Gender Distribution**: {profile['gender_dist']}")
+                with col2:
+                    st.markdown(f"**💰 Financial Profile**")
+                    if 'avg_income' in profile:
+                        st.write(f"- **Average Income**: ${profile['avg_income']:.1f}k ± ${profile['income_std']:.1f}k")
+                    if 'avg_spending' in profile:
+                        st.write(f"- **Average Spending Score**: {profile['avg_spending']:.1f} ± {profile['spending_std']:.1f}")
+                    # Business recommendations
+                    st.markdown(f"**📈 Recommendations**")
+                    if 'avg_income' in profile and 'avg_spending' in profile:
+                        avg_income = profile['avg_income']
+                        avg_spending = profile['avg_spending']
+                        if avg_income > 70 and avg_spending > 70:
+                            st.write("- Focus on premium products and exclusive services")
+                            st.write("- Implement VIP loyalty programs")
+                            st.write("- Offer personalized shopping experiences")
+                        elif avg_income > 70 and avg_spending < 40:
+                            st.write("- Develop targeted upselling strategies")
+                            st.write("- Showcase value propositions")
+                            st.write("- Create incentive programs to increase spending")
+                        elif avg_income < 40 and avg_spending > 70:
+                            st.write("- Offer value-based products and promotions")
+                            st.write("- Focus on customer retention programs")
+                            st.write("- Provide flexible payment options")
+                        elif avg_income < 40 and avg_spending < 40:
+                            st.write("- Implement engagement and retention strategies")
+                            st.write("- Offer budget-friendly options")
+                            st.write("- Focus on building brand loyalty")
+                        else:
+                            st.write("- Balanced marketing approach")
+                            st.write("- Personalized offers based on preferences")
+                            st.write("- Regular engagement campaigns")
+        # Overall business strategy
+        st.subheader("🎯 Overall Business Strategy")
+        col1, col2 = st.columns(2)
+        with col1:
+            st.markdown("""
+            **🎯 Marketing Strategies**
+            - **Segment-specific campaigns**: Tailor marketing messages to each cluster
+            - **Product positioning**: Align products with cluster preferences
+            - **Channel optimization**: Use preferred communication channels per segment
+            - **Pricing strategies**: Implement dynamic pricing based on segment characteristics
+            """)
+        with col2:
+            st.markdown("""
+            **💡 Growth Opportunities**
+            - **Cross-selling**: Identify products popular in high-spending segments
+            - **Retention programs**: Focus on segments with declining engagement
+            - **New product development**: Create offerings for underserved segments
+            - **Customer lifetime value**: Invest more in high-value segments
+            """)
+        # Download results
+        st.subheader("💾 Download Results")
+        # Prepare data for download
+        result_data = feature_data.copy()
+        result_data['KMeans_Cluster'] = clustering_analyzer.cluster_labels['kmeans']
+        csv = result_data.to_csv(index=False)
+        st.download_button(
+            label="📥 Download Customer Segments (CSV)",
+            data=csv,
+            file_name="customer_segments_results.csv",
+            mime="text/csv"
+        )
+if __name__ == "__main__":
+    main()

utils/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Utilities Package

utils/__pycache__/data_generator.cpython-311.pyc ADDED Viewed

Binary file (3.34 kB). View file

utils/data_generator.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Data Generation Utilities
+========================
+Utility functions for generating sample datasets.
+"""
+import pandas as pd
+import numpy as np
+def create_sample_mall_customers(n_customers=200, random_seed=42):
+    """
+    Create a realistic sample Mall Customers dataset.
+    Parameters:
+    -----------
+    n_customers : int, default=200
+        Number of customers to generate
+    random_seed : int, default=42
+        Random seed for reproducibility
+    Returns:
+    --------
+    pd.DataFrame
+        Generated customer dataset
+    """
+    np.random.seed(random_seed)
+    customer_ids = range(1, n_customers + 1)
+    # Gender distribution (approximately 56% Female, 44% Male)
+    genders = np.random.choice(['Male', 'Female'], n_customers, p=[0.44, 0.56])
+    # Age distribution (mean ~39, std ~14)
+    ages = np.random.normal(38.85, 13.97, n_customers).astype(int)
+    ages = np.clip(ages, 18, 70)
+    # Create realistic income distribution (mean ~61k, std ~26k)
+    annual_incomes = np.random.normal(60.56, 26.26, n_customers)
+    annual_incomes = np.clip(annual_incomes, 15, 137)
+    # Create spending scores with realistic patterns
+    base_spending = np.random.normal(50, 25, n_customers)
+    # Add some income correlation
+    income_normalized = (annual_incomes - annual_incomes.min()) / (annual_incomes.max() - annual_incomes.min())
+    income_effect = (income_normalized - 0.5) * 30
+    # Add age effect (younger people might spend more)
+    age_normalized = (ages - ages.min()) / (ages.max() - ages.min())
+    age_effect = np.where(age_normalized < 0.3, 10,
+                         np.where(age_normalized > 0.7, -5, 0))
+    # Gender effect (slight difference in spending patterns)
+    gender_effect = np.where(genders == 'Female', 3, -3)
+    spending_scores = (base_spending +
+                      income_effect * 0.6 +
+                      age_effect +
+                      gender_effect +
+                      np.random.normal(0, 10, n_customers))
+    spending_scores = np.clip(spending_scores, 1, 100)
+    # Create DataFrame
+    sample_data = pd.DataFrame({
+        'CustomerID': customer_ids,
+        'Gender': genders,
+        'Age': ages,
+        'Annual Income (k$)': annual_incomes.round().astype(int),
+        'Spending Score (1-100)': spending_scores.round().astype(int)
+    })
+    return sample_data