Lohith Venkat Chamakura commited on
Commit
48909ac
·
1 Parent(s): 599f1a9

Initial commit

Browse files
Files changed (9) hide show
  1. .DS_Store +0 -0
  2. README.md +264 -7
  3. app.py +817 -0
  4. constants.py +41 -0
  5. data_processor.py +314 -0
  6. insights.py +204 -0
  7. requirements.txt +10 -0
  8. utils.py +111 -0
  9. visualizations.py +327 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md CHANGED
@@ -1,13 +1,270 @@
1
- ---
2
- title: BI Dashboard
3
- emoji: 🏃
4
- colorFrom: red
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 6.0.2
8
  app_file: app.py
9
  pinned: false
10
- short_description: Business Intelligence Dashboard
11
- ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: Business Intelligence Dashboard
2
+ emoji: 📊
3
+ colorFrom: blue
 
4
  colorTo: green
5
  sdk: gradio
6
  sdk_version: 6.0.2
7
  app_file: app.py
8
  pinned: false
 
 
9
 
10
+ # Business Intelligence Dashboard
11
+
12
+ An interactive Business Intelligence dashboard built with Gradio that enables users to explore and analyze business data through an intuitive, Tableau-like web interface.
13
+
14
+ ## Features
15
+
16
+ ### 📁 Data Upload & Validation
17
+ - Upload CSV or Excel files through the web interface
18
+ - Display basic dataset information (shape, columns, data types)
19
+ - Show data preview (first 10 rows)
20
+ - Graceful error handling with informative messages
21
+
22
+ ### 📈 Data Exploration & Summary Statistics
23
+ - **Automated Data Profiling:**
24
+ - Numerical columns: mean, median, std, min, max, quartiles
25
+ - Categorical columns: unique values, value counts, mode
26
+ - Missing value report
27
+ - Correlation matrix for numerical features
28
+
29
+ ### 🔍 Interactive Filtering
30
+ - Dynamic filtering interface based on column types:
31
+ - **Numerical:** Range sliders with min/max inputs
32
+ - **Categorical:** Multi-select checkboxes
33
+ - **Date:** Date range pickers (when applicable)
34
+ - Real-time row count updates as filters are applied
35
+ - Display filtered data preview
36
+
37
+ ### 📊 Visualizations
38
+ Implements 5 different visualization types:
39
+ 1. **Time Series Plot:** Trends over time with aggregation options
40
+ 2. **Distribution Plot:** Histogram or box plot for numerical data
41
+ 3. **Category Analysis:** Bar chart or pie chart for categorical data
42
+ 4. **Scatter Plot:** Show relationships between variables
43
+ 5. **Correlation Heatmap:** Visualize correlations between numerical features
44
+
45
+ **Features:**
46
+ - User selects which columns to visualize
47
+ - Clear titles, labels, and legends
48
+ - Multiple aggregation methods (sum, mean, count, median)
49
+ - Professional Plotly visualizations
50
+
51
+ ### 💡 Insights Generation
52
+ Automatically generates insights:
53
+ - **Top/Bottom Performers:** Identify highest/lowest values
54
+ - **Basic Trends:** Detect patterns in time series data
55
+ - **Summary Statistics:** High-level dataset overview
56
+
57
+ ### 💾 Export Functionality
58
+ - Export filtered data as CSV
59
+ - Export visualizations as PNG images
60
+
61
+ ## High-Level Architecture
62
+
63
+ ```
64
+ ┌─────────────────────────────────────────────────────────────────┐
65
+ │ User Interface │
66
+ │ (Gradio Web Interface) │
67
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
68
+ │ │ Data Upload │ │ Visualization│ │ Insights │ │
69
+ │ │ & Preview │ │ & Charts │ │ Generation │ │
70
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
71
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
72
+ │ │ Statistics │ │ Filter & │ │ Export │ │
73
+ │ │ & Profiling │ │ Explore │ │ Functionality│ │
74
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
75
+ └────────────────────────────┬────────────────────────────────────┘
76
+
77
+
78
+ ┌─────────────────────────────────────────────────────────────────┐
79
+ │ Application Layer (app.py) │
80
+ │ • Orchestrates user interactions │
81
+ │ • Manages global state (current_df, filters, figures) │
82
+ │ • Routes requests to appropriate modules │
83
+ └────────────────────────────┬────────────────────────────────────┘
84
+
85
+ ┌────────────────────┼────────────────────┐
86
+ │ │ │
87
+ ▼ ▼ ▼
88
+ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
89
+ │ Data Processing │ │ Visualizations │ │ Insights │
90
+ │ Layer │ │ Layer │ │ Layer │
91
+ │ │ │ │ │ │
92
+ │ data_processor.py│ │visualizations.py │ │ insights.py │
93
+ │ │ │ │ │ │
94
+ │ • CSV/Excel Load │ │ • Time Series │ │ • Top/Bottom │
95
+ │ • Data Cleaning │ │ • Distribution │ │ Performers │
96
+ │ • Filtering │ │ • Category │ │ • Trend Analysis │
97
+ │ • Statistics │ │ Analysis │ │ • Summary Stats │
98
+ │ Generation │ │ • Scatter Plot │ │ │
99
+ │ │ │ • Correlation │ │ │
100
+ │ │ │ Heatmap │ │ │
101
+ └──────────────────┘ └──────────────────┘ └──────────────────┘
102
+ │ │ │
103
+ └────────────────────┼────────────────────┘
104
+
105
+
106
+ ┌─────────────────────────────────────────────────────────────────┐
107
+ │ Utilities Layer (utils.py) │
108
+ │ • Column type detection (numerical, categorical, date) │
109
+ │ • Missing value analysis │
110
+ │ • Data validation helpers │
111
+ └────────────────────────────┬────────────────────────────────────┘
112
+
113
+
114
+ ┌─────────────────────────────────────────────────────────────────┐
115
+ │ Data Sources │
116
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
117
+ │ │ stocks.csv │ │sales_train │ │Online Retail │ │
118
+ │ │ │ │ .csv │ │ .xlsx │ │
119
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
120
+ │ │
121
+ │ • CSV files (pandas.read_csv) │
122
+ │ • Excel files (pandas.read_excel) │
123
+ │ • User-uploaded datasets │
124
+ └─────────────────────────────────────────────────────────────────┘
125
+
126
+ ┌─────────────────────────────────────────────────────────────────┐
127
+ │ External Libraries │
128
+ │ • pandas: Data manipulation and analysis │
129
+ │ • plotly: Interactive visualizations │
130
+ │ • gradio: Web interface framework │
131
+ │ • numpy: Numerical computations │
132
+ └─────────────────────────────────────────────────────────────────┘
133
+ ```
134
+
135
+ ## Project Structure
136
+
137
+ ```
138
+ project/
139
+ ├── app.py # Main Gradio application
140
+ ├── data_processor.py # Data loading, cleaning, filtering
141
+ ├── visualizations.py # Chart creation functions
142
+ ├── insights.py # Automated insight generation
143
+ ├── utils.py # Helper functions
144
+ ├── requirements.txt # Python dependencies
145
+ ├── README.md # This file
146
+ └── data/ # Sample datasets
147
+ ├── sales_train.csv
148
+ ├── stocks.csv
149
+ └── Online Retail.xlsx
150
+ ```
151
+
152
+ ## Setup Instructions
153
+
154
+ ### 1. Install Dependencies
155
+
156
+ ```bash
157
+ pip install -r requirements.txt
158
+ ```
159
+
160
+ **Note:** This project uses Gradio 6.0.2, which includes improved performance and updated APIs. Make sure you have Python 3.8 or higher installed.
161
+
162
+ ### 2. Run the Application
163
+
164
+ ```bash
165
+ python app.py
166
+ ```
167
+
168
+ The application will launch and be accessible at `http://localhost:7860` in your web browser.
169
+
170
+ ## Usage
171
+
172
+ 1. **Upload Data:** Navigate to the "Data Upload & Preview" tab and upload a CSV or Excel file
173
+ 2. **View Statistics:** Go to "Statistics & Profiling" to see comprehensive data statistics
174
+ 3. **Apply Filters:** Use "Filter & Explore" to filter your data by column values
175
+ 4. **Create Visualizations:** Visit "Visualizations" to create interactive charts
176
+ 5. **Generate Insights:** Check "Insights" for automated data insights
177
+ 6. **Export Data:** Use "Export" to download filtered data or visualizations
178
+
179
+ ## Aggregation Methods
180
+
181
+ The dashboard supports multiple aggregation methods for visualizations:
182
+ - **Sum**: Adds all values together (useful for totals, volumes)
183
+ - **Mean**: Calculates the average value (useful for prices, rates)
184
+ - **Count**: Counts the number of data points (useful for frequency)
185
+ - **Median**: Finds the middle value (robust to outliers)
186
+ - **None**: No aggregation (shows raw data points)
187
+
188
+ ## Step-by-Step Tutorial: Monthly Average Closing Price
189
+
190
+ Let's walk through a complete example:
191
+
192
+ ### Step 1: Load the Data
193
+ 1. Open the dashboard
194
+ 2. Go to **📁 Data Upload & Preview** tab
195
+ 3. Click **Upload Dataset**
196
+ 4. Select `sample-datasets/stocks.csv`
197
+ 5. Click **Load Data**
198
+ 6. Verify the data preview shows the stock data
199
+
200
+ ### Step 2: Create the Visualization
201
+ 1. Navigate to **📊 Visualizations** tab
202
+ 2. Configure the chart:
203
+ - **Chart Type**: `Time Series`
204
+ - **X-Axis Column**: `Date`
205
+ - **Y-Axis Column**: `Close`
206
+ - **Aggregation Method**: `Mean`
207
+ 3. Click **Generate Visualization**
208
+
209
+ ### Step 3: Interpret the Results
210
+ - The chart shows a line graph with dates on X-axis and average closing prices on Y-axis
211
+ - Each point represents the mean closing price for that date
212
+ - You can see trends, patterns, and changes over time
213
+
214
+ ### Step 4: Compare Different Aggregations
215
+ Try generating the same chart with different aggregation methods:
216
+ - **Mean**: Average closing price (smooth trend)
217
+ - **Sum**: Total closing price (not meaningful for prices, but shows concept)
218
+ - **Median**: Middle closing price (robust to outliers)
219
+ - **None**: All individual closing prices (may be cluttered)
220
+
221
+ ## Technical Details
222
+
223
+ ### Design Patterns
224
+
225
+ The application uses the **Strategy Pattern** for:
226
+ - **Data Loading:** Different strategies for CSV vs Excel files
227
+ - **Data Filtering:** Different strategies for numerical, categorical, and date filters
228
+ - **Visualizations:** Different strategies for each chart type
229
+
230
+ ### Code Quality
231
+
232
+ - Follows PEP 8 style guidelines
233
+ - Comprehensive docstrings for all functions
234
+ - Proper error handling with try/except blocks
235
+ - Modular design with clear separation of concerns
236
+ - No hardcoded values (uses constants and configuration)
237
+
238
+ ### Libraries
239
+
240
+ - **pandas 2.2.0+:** All data manipulation and analysis
241
+ - **Gradio 6.0.2:** Web interface framework
242
+ - **Plotly 5.22.0+:** Interactive visualizations
243
+ - **matplotlib 3.8.0+ / seaborn 0.13.0+:** Additional visualization support
244
+ - **Python 3.8+:** Following best practices
245
+
246
+ ## Sample Datasets
247
+
248
+ The `data/` folder includes sample datasets:
249
+ - `sales_train.csv`: Sales transaction data
250
+ - `stocks.csv`: Stock market data
251
+ - `Online Retail.xlsx`: E-commerce retail data
252
+
253
+ ## Requirements
254
+
255
+ - Python 3.8 or higher
256
+ - All dependencies listed in `requirements.txt`:
257
+ - pandas >= 2.2.0
258
+ - numpy >= 1.26.0
259
+ - gradio == 6.0.2
260
+ - matplotlib >= 3.8.0
261
+ - seaborn >= 0.13.0
262
+ - plotly >= 5.22.0
263
+ - kaleido >= 0.2.1
264
+ - openpyxl >= 3.1.5
265
+ - Pillow >= 10.4.0
266
+
267
+ ## License
268
+
269
+ This project is created for educational purposes as part of CS5130 coursework.
270
+
app.py ADDED
@@ -0,0 +1,817 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main Gradio application for the Business Intelligence Dashboard.
3
+
4
+ This module creates a Tableau-like interactive dashboard interface
5
+ for data exploration and analysis.
6
+ """
7
+
8
+ import gradio as gr
9
+ import pandas as pd
10
+ import numpy as np
11
+ from typing import Optional, Dict, List, Tuple, Any
12
+ import io
13
+ import base64
14
+ from PIL import Image
15
+ import plotly.graph_objects as go
16
+
17
+ from data_processor import DataLoader, DataFilter, DataProfiler
18
+ from visualizations import VisualizationFactory
19
+ from insights import InsightGenerator
20
+ from utils import detect_column_types, get_missing_value_summary
21
+ from constants import (
22
+ PREVIEW_ROWS,
23
+ FILTERED_PREVIEW_ROWS,
24
+ MAX_COLUMNS_DISPLAY,
25
+ MAX_UNIQUE_VALUES_DISPLAY,
26
+ EXPORT_IMAGE_WIDTH,
27
+ EXPORT_IMAGE_HEIGHT,
28
+ EXPORT_IMAGE_SCALE,
29
+ EXPORT_IMAGE_FILENAME,
30
+ EXPORT_HTML_FILENAME,
31
+ DEFAULT_TOP_N,
32
+ KB_CONVERSION,
33
+ TEXTBOX_LINES_DEFAULT,
34
+ TEXTBOX_LINES_INSIGHTS
35
+ )
36
+
37
+
38
+ # Global state
39
+ current_df: Optional[pd.DataFrame] = None
40
+ current_filters: Dict[str, Any] = {}
41
+ current_figure: Optional[go.Figure] = None
42
+
43
+
44
+ def load_and_preview_data(file) -> Tuple[str, pd.DataFrame, str]:
45
+ """
46
+ Load data file and return preview information.
47
+
48
+ Args:
49
+ file: Uploaded file object (can be string path or file object in Gradio 6.0.2)
50
+
51
+ Returns:
52
+ Tuple of (info_text, preview_df, error_message)
53
+ """
54
+ global current_df, current_filters
55
+
56
+ if file is None:
57
+ return "No file uploaded", None, ""
58
+
59
+ try:
60
+ loader = DataLoader()
61
+ # Handle both string paths and file objects (Gradio 6.0.2 compatibility)
62
+ file_path = file if isinstance(file, str) else file.name
63
+ df, error = loader.load_data(file_path)
64
+
65
+ if error:
66
+ return f"Error: {error}", None, error
67
+
68
+ current_df = df
69
+ current_filters = {}
70
+
71
+ # Get basic info
72
+ profiler = DataProfiler()
73
+ info = profiler.get_basic_info(df)
74
+
75
+ info_text = f"""
76
+ **Dataset Information:**
77
+ - **Shape:** {info['shape'][0]:,} rows × {info['shape'][1]} columns
78
+ - **Memory Usage:** {info['memory_usage'] / KB_CONVERSION:.2f} KB
79
+ - **Columns:** {', '.join(info['columns'][:MAX_COLUMNS_DISPLAY])}{'...' if len(info['columns']) > MAX_COLUMNS_DISPLAY else ''}
80
+ """
81
+
82
+ # Preview first rows
83
+ preview_df = df.head(PREVIEW_ROWS)
84
+
85
+ return info_text, preview_df, ""
86
+
87
+ except Exception as e:
88
+ return f"Error loading file: {str(e)}", None, str(e)
89
+
90
+
91
+ def get_statistics() -> Tuple[str, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
92
+ """
93
+ Generate comprehensive statistics for the loaded dataset.
94
+
95
+ Returns:
96
+ Tuple of (missing_values_text, numerical_stats, categorical_stats, correlation_matrix)
97
+ """
98
+ global current_df
99
+
100
+ if current_df is None or current_df.empty:
101
+ return "No data loaded", pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
102
+
103
+ try:
104
+ profiler = DataProfiler()
105
+
106
+ # Missing values
107
+ missing_df = get_missing_value_summary(current_df)
108
+ if missing_df.empty:
109
+ missing_text = "✅ No missing values found in the dataset."
110
+ else:
111
+ missing_text = "**Missing Values Summary:**\n\n"
112
+ missing_text += missing_df.to_string(index=False)
113
+
114
+ # Numerical statistics
115
+ numerical_stats = profiler.get_numerical_stats(current_df)
116
+
117
+ # Categorical statistics
118
+ categorical_stats = profiler.get_categorical_stats(current_df)
119
+
120
+ # Correlation matrix
121
+ correlation_matrix = profiler.get_correlation_matrix(current_df)
122
+
123
+ return missing_text, numerical_stats, categorical_stats, correlation_matrix
124
+
125
+ except Exception as e:
126
+ return f"Error generating statistics: {str(e)}", pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
127
+
128
+
129
+ def update_column_dropdowns():
130
+ """
131
+ Update column dropdown choices based on loaded data.
132
+
133
+ Returns:
134
+ Tuple of update dictionaries for x_column and y_column dropdowns
135
+ """
136
+ global current_df
137
+
138
+ if current_df is None or current_df.empty:
139
+ return gr.update(choices=[]), gr.update(choices=[])
140
+
141
+ all_columns = list(current_df.columns)
142
+ return gr.update(choices=all_columns), gr.update(choices=all_columns)
143
+
144
+
145
+ def apply_simple_filters(
146
+ filter_column: Optional[str],
147
+ filter_type: str,
148
+ min_val: Optional[float],
149
+ max_val: Optional[float],
150
+ selected_values: List[str]
151
+ ) -> Tuple[str, pd.DataFrame, int]:
152
+ """
153
+ Apply a single filter to the dataset.
154
+
155
+ Args:
156
+ filter_column: Column to filter on
157
+ filter_type: Type of filter (numerical/categorical)
158
+ min_val: Minimum value for numerical filter
159
+ max_val: Maximum value for numerical filter
160
+ selected_values: Selected values for categorical filter
161
+
162
+ Returns:
163
+ Tuple of (info_text, filtered_df, row_count)
164
+ """
165
+ global current_df, current_filters
166
+
167
+ if current_df is None or current_df.empty:
168
+ return "No data loaded", pd.DataFrame(), 0
169
+
170
+ if filter_column is None or filter_column == "":
171
+ # No filter applied, return original data
172
+ current_filters = {}
173
+ row_count = len(current_df)
174
+ info_text = f"**Dataset:** {row_count:,} rows (no filters applied)"
175
+ return info_text, current_df.head(FILTERED_PREVIEW_ROWS), row_count
176
+
177
+ try:
178
+ filters = {}
179
+ numerical, categorical, date_columns = detect_column_types(current_df)
180
+
181
+ if filter_type == "numerical" and filter_column in numerical:
182
+ if min_val is not None and max_val is not None:
183
+ original_min = float(current_df[filter_column].min())
184
+ original_max = float(current_df[filter_column].max())
185
+ if min_val != original_min or max_val != original_max:
186
+ filters[filter_column] = (min_val, max_val)
187
+ elif filter_type == "categorical" and filter_column in categorical:
188
+ if selected_values:
189
+ all_vals = sorted(current_df[filter_column].dropna().unique().tolist())
190
+ if set(selected_values) != set(all_vals):
191
+ filters[filter_column] = selected_values
192
+
193
+ # Apply filters
194
+ data_filter = DataFilter()
195
+ filtered_df = data_filter.apply_filters(current_df, filters)
196
+ current_filters = filters
197
+
198
+ row_count = len(filtered_df)
199
+ info_text = f"**Filtered Dataset:** {row_count:,} rows (from {len(current_df):,} original rows)"
200
+
201
+ return info_text, filtered_df.head(FILTERED_PREVIEW_ROWS), row_count
202
+
203
+ except Exception as e:
204
+ return f"Error applying filters: {str(e)}", pd.DataFrame(), 0
205
+
206
+
207
+ def get_filter_options() -> Tuple[List[str], str, Dict]:
208
+ """
209
+ Get filter options based on current data.
210
+
211
+ Returns:
212
+ Tuple of (column_choices, default_type, filter_component_updates)
213
+ """
214
+ global current_df
215
+
216
+ if current_df is None or current_df.empty:
217
+ return [], "numerical", {}
218
+
219
+ numerical, categorical, date_columns = detect_column_types(current_df)
220
+ all_columns = list(current_df.columns)
221
+
222
+ # Determine default filter type
223
+ default_type = "numerical" if numerical else "categorical" if categorical else "numerical"
224
+
225
+ return all_columns, default_type, {}
226
+
227
+
228
+ def create_visualization(
229
+ chart_type: str,
230
+ x_column: Optional[str],
231
+ y_column: Optional[str],
232
+ aggregation: str,
233
+ category_chart_type: str = 'bar'
234
+ ) -> go.Figure:
235
+ """
236
+ Create visualization based on user selections.
237
+
238
+ Args:
239
+ chart_type: Type of chart to create
240
+ x_column: X-axis column
241
+ y_column: Y-axis column
242
+ aggregation: Aggregation method
243
+ category_chart_type: Type for category charts (bar/pie)
244
+
245
+ Returns:
246
+ Plotly figure object
247
+ """
248
+ global current_df, current_filters, current_figure
249
+
250
+ if current_df is None or current_df.empty:
251
+ current_figure = None
252
+ return None
253
+
254
+ try:
255
+ # Apply current filters
256
+ if current_filters:
257
+ data_filter = DataFilter()
258
+ df = data_filter.apply_filters(current_df, current_filters)
259
+ else:
260
+ df = current_df.copy()
261
+
262
+ if df.empty:
263
+ current_figure = None
264
+ return None
265
+
266
+ # Validate required columns for specific chart types
267
+ if chart_type in ['time_series', 'scatter']:
268
+ if not x_column or not y_column:
269
+ # Return a simple error message plot
270
+ fig = go.Figure()
271
+ fig.add_annotation(
272
+ text="Please select both X and Y columns for this chart type",
273
+ xref="paper", yref="paper",
274
+ x=0.5, y=0.5, showarrow=False,
275
+ font=dict(size=16)
276
+ )
277
+ fig.update_layout(title="Missing Required Columns")
278
+ current_figure = fig
279
+ return fig
280
+
281
+ factory = VisualizationFactory()
282
+
283
+ # Handle category chart type and distribution chart type
284
+ # Pass sub-type (bar/pie for category, histogram/box for distribution) in kwargs
285
+ # Use 'sub_chart_type' key to avoid conflict with factory's 'chart_type' parameter
286
+ kwargs = {}
287
+ if chart_type == 'category':
288
+ kwargs['sub_chart_type'] = category_chart_type
289
+ elif chart_type == 'distribution':
290
+ kwargs['sub_chart_type'] = 'histogram'
291
+
292
+ fig = factory.create_visualization(
293
+ chart_type=chart_type,
294
+ df=df,
295
+ x_column=x_column,
296
+ y_column=y_column,
297
+ aggregation=aggregation,
298
+ **kwargs
299
+ )
300
+
301
+ # Store the figure globally for export
302
+ current_figure = fig
303
+
304
+ return fig
305
+
306
+ except Exception as e:
307
+ print(f"Error creating visualization: {e}")
308
+ # Return a simple error message plot
309
+ fig = go.Figure()
310
+ fig.add_annotation(
311
+ text=f"Error creating visualization: {str(e)}",
312
+ xref="paper", yref="paper",
313
+ x=0.5, y=0.5, showarrow=False,
314
+ font=dict(size=14)
315
+ )
316
+ fig.update_layout(title="Visualization Error")
317
+ current_figure = fig
318
+ return fig
319
+
320
+
321
+ def generate_insights() -> Tuple[str, str, str]:
322
+ """
323
+ Generate automated insights from the data.
324
+
325
+ Returns:
326
+ Tuple of (summary_insights, top_performers, trend_analysis)
327
+ """
328
+ global current_df, current_filters
329
+
330
+ if current_df is None or current_df.empty:
331
+ return "No data loaded", "", ""
332
+
333
+ try:
334
+ # Apply filters if any
335
+ if current_filters:
336
+ data_filter = DataFilter()
337
+ df = data_filter.apply_filters(current_df, current_filters)
338
+ else:
339
+ df = current_df.copy()
340
+
341
+ generator = InsightGenerator()
342
+
343
+ # Summary insights
344
+ summary = generator.generate_summary_insights(df)
345
+ summary_text = "\n".join([f"• {insight}" for insight in summary])
346
+
347
+ # Top/Bottom performers
348
+ numerical, _, _ = detect_column_types(df)
349
+ top_bottom_text = ""
350
+ if numerical:
351
+ # Use first numerical column
352
+ col = numerical[0]
353
+ performers = generator.get_top_bottom_performers(df, col, top_n=DEFAULT_TOP_N)
354
+
355
+ top_bottom_text = f"**Top {DEFAULT_TOP_N} Performers for '{col}':**\n"
356
+ for idx, val in performers['top']:
357
+ top_bottom_text += f" • Row {idx}: {val:,.2f}\n"
358
+
359
+ top_bottom_text += f"\n**Bottom {DEFAULT_TOP_N} Performers for '{col}':**\n"
360
+ for idx, val in performers['bottom']:
361
+ top_bottom_text += f" • Row {idx}: {val:,.2f}\n"
362
+
363
+ # Trend analysis
364
+ date_cols = [col for col in df.columns if 'date' in col.lower() or 'time' in col.lower()]
365
+ trend_text = ""
366
+ if date_cols and numerical:
367
+ date_col = date_cols[0]
368
+ value_col = numerical[0]
369
+ trend = generator.detect_trends(df, date_col, value_col)
370
+ trend_text = f"**Trend Analysis ({value_col} over {date_col}):**\n"
371
+ trend_text += f" • {trend.get('message', 'No trend detected')}\n"
372
+
373
+ return summary_text, top_bottom_text, trend_text
374
+
375
+ except Exception as e:
376
+ return f"Error generating insights: {str(e)}", "", ""
377
+
378
+
379
+ def export_data() -> str:
380
+ """
381
+ Export filtered data as CSV.
382
+
383
+ Returns:
384
+ Path to exported CSV file
385
+ """
386
+ global current_df, current_filters
387
+
388
+ if current_df is None or current_df.empty:
389
+ return None
390
+
391
+ try:
392
+ # Apply filters
393
+ if current_filters:
394
+ data_filter = DataFilter()
395
+ df = data_filter.apply_filters(current_df, current_filters)
396
+ else:
397
+ df = current_df.copy()
398
+
399
+ # Save to temporary file
400
+ output_path = "filtered_data_export.csv"
401
+ df.to_csv(output_path, index=False)
402
+
403
+ return output_path
404
+
405
+ except Exception as e:
406
+ print(f"Error exporting data: {e}")
407
+ return None
408
+
409
+
410
+ def export_visualization(fig) -> Optional[str]:
411
+ """
412
+ Export visualization as PNG or HTML.
413
+
414
+ Args:
415
+ fig: Plotly figure object or PlotData from Gradio (can be None)
416
+
417
+ Returns:
418
+ Path to exported file, or None if no figure
419
+ """
420
+ global current_figure
421
+
422
+ # Use the stored figure instead of the PlotData object from Gradio
423
+ plotly_fig = current_figure
424
+
425
+ if plotly_fig is None:
426
+ return None
427
+
428
+ try:
429
+ output_path = EXPORT_IMAGE_FILENAME
430
+ # Try to export as PNG, fallback to HTML if kaleido not available
431
+ try:
432
+ plotly_fig.write_image(
433
+ output_path,
434
+ width=EXPORT_IMAGE_WIDTH,
435
+ height=EXPORT_IMAGE_HEIGHT,
436
+ scale=EXPORT_IMAGE_SCALE
437
+ )
438
+ except Exception as img_error:
439
+ # If image export fails, save as HTML instead
440
+ try:
441
+ output_path = EXPORT_HTML_FILENAME
442
+ plotly_fig.write_html(output_path)
443
+ except Exception as html_error:
444
+ print(f"Error exporting visualization: {html_error}")
445
+ return None
446
+ return output_path
447
+
448
+ except Exception as e:
449
+ print(f"Error exporting visualization: {e}")
450
+ return None
451
+
452
+
453
+ def create_dashboard():
454
+ """Create and configure the Gradio dashboard interface."""
455
+
456
+ with gr.Blocks(title="Business Intelligence Dashboard") as demo:
457
+ gr.Markdown(
458
+ """
459
+ # 📊 Business Intelligence Dashboard
460
+ **Interactive Data Analysis and Visualization Platform**
461
+
462
+ Upload your dataset and explore insights through an intuitive, Tableau-like interface.
463
+ """
464
+ )
465
+
466
+ # State to store current dataframe
467
+ df_state = gr.State(value=None)
468
+
469
+ # Tab 1: Data Upload
470
+ with gr.Tab("📁 Data Upload & Preview"):
471
+ with gr.Row():
472
+ with gr.Column(scale=1):
473
+ file_input = gr.File(
474
+ label="Upload Dataset",
475
+ file_types=[".csv", ".xlsx", ".xls"],
476
+ type="filepath"
477
+ )
478
+ upload_btn = gr.Button("Load Data", variant="primary", size="lg")
479
+
480
+ with gr.Column(scale=2):
481
+ info_output = gr.Markdown("Upload a CSV or Excel file to begin.")
482
+ preview_output = gr.Dataframe(
483
+ label=f"Data Preview (First {PREVIEW_ROWS} Rows)",
484
+ interactive=False,
485
+ wrap=True
486
+ )
487
+
488
+ upload_btn.click(
489
+ fn=load_and_preview_data,
490
+ inputs=[file_input],
491
+ outputs=[info_output, preview_output, df_state]
492
+ )
493
+
494
+ # Tab 2: Statistics
495
+ with gr.Tab("📈 Statistics & Profiling"):
496
+ with gr.Row():
497
+ with gr.Column():
498
+ stats_btn = gr.Button("Generate Statistics", variant="primary")
499
+ missing_output = gr.Textbox(
500
+ label="Missing Values Report",
501
+ lines=TEXTBOX_LINES_DEFAULT,
502
+ interactive=False
503
+ )
504
+
505
+ with gr.Column():
506
+ numerical_stats_output = gr.Dataframe(
507
+ label="Numerical Statistics",
508
+ interactive=False,
509
+ wrap=True
510
+ )
511
+
512
+ with gr.Row():
513
+ categorical_stats_output = gr.Dataframe(
514
+ label="Categorical Statistics",
515
+ interactive=False,
516
+ wrap=True
517
+ )
518
+ correlation_output = gr.Dataframe(
519
+ label="Correlation Matrix",
520
+ interactive=False,
521
+ wrap=True
522
+ )
523
+
524
+ stats_btn.click(
525
+ fn=get_statistics,
526
+ inputs=[],
527
+ outputs=[missing_output, numerical_stats_output, categorical_stats_output, correlation_output]
528
+ )
529
+
530
+ # Tab 3: Filter & Explore
531
+ with gr.Tab("🔍 Filter & Explore"):
532
+ with gr.Row():
533
+ with gr.Column(scale=1):
534
+ filter_info = gr.Markdown("**Apply filters to explore your data:**")
535
+ filter_column = gr.Dropdown(
536
+ choices=[],
537
+ label="Select Column to Filter",
538
+ interactive=True
539
+ )
540
+ filter_type = gr.Radio(
541
+ choices=["numerical", "categorical"],
542
+ label="Filter Type",
543
+ value="numerical",
544
+ interactive=True
545
+ )
546
+
547
+ with gr.Group(visible=True) as numerical_filter_group:
548
+ min_val_input = gr.Number(label="Minimum Value", interactive=True)
549
+ max_val_input = gr.Number(label="Maximum Value", interactive=True)
550
+
551
+ with gr.Group(visible=False) as categorical_filter_group:
552
+ selected_values = gr.CheckboxGroup(
553
+ choices=[],
554
+ label="Select Values",
555
+ interactive=True
556
+ )
557
+
558
+ filter_btn = gr.Button("Apply Filter", variant="primary")
559
+ clear_filter_btn = gr.Button("Clear Filters", variant="secondary")
560
+
561
+ with gr.Column(scale=2):
562
+ filter_result_info = gr.Markdown("")
563
+ filtered_data_output = gr.Dataframe(
564
+ label=f"Filtered Data Preview (First {FILTERED_PREVIEW_ROWS} Rows)",
565
+ interactive=False,
566
+ wrap=True
567
+ )
568
+ row_count_output = gr.Number(
569
+ label="Filtered Row Count",
570
+ interactive=False
571
+ )
572
+
573
+ def update_filter_ui(column, filter_type_val):
574
+ """Update filter UI based on column and type selection."""
575
+ global current_df
576
+
577
+ if current_df is None or current_df.empty or not column:
578
+ return (
579
+ gr.update(visible=False),
580
+ gr.update(visible=False),
581
+ gr.update(value=None),
582
+ gr.update(value=None),
583
+ gr.update(choices=[])
584
+ )
585
+
586
+ numerical, categorical, _ = detect_column_types(current_df)
587
+
588
+ if filter_type_val == "numerical" and column in numerical:
589
+ min_val = float(current_df[column].min())
590
+ max_val = float(current_df[column].max())
591
+ return (
592
+ gr.update(visible=True),
593
+ gr.update(visible=False),
594
+ gr.update(value=min_val, label=f"Min {column}"),
595
+ gr.update(value=max_val, label=f"Max {column}"),
596
+ gr.update(choices=[])
597
+ )
598
+ elif filter_type_val == "categorical" and column in categorical:
599
+ unique_vals = sorted(
600
+ current_df[column].dropna().unique().tolist()
601
+ )[:MAX_UNIQUE_VALUES_DISPLAY]
602
+ return (
603
+ gr.update(visible=False),
604
+ gr.update(visible=True),
605
+ gr.update(value=None),
606
+ gr.update(value=None),
607
+ gr.update(choices=unique_vals, value=unique_vals)
608
+ )
609
+ else:
610
+ return (
611
+ gr.update(visible=False),
612
+ gr.update(visible=False),
613
+ gr.update(value=None),
614
+ gr.update(value=None),
615
+ gr.update(choices=[])
616
+ )
617
+
618
+ filter_column.change(
619
+ fn=update_filter_ui,
620
+ inputs=[filter_column, filter_type],
621
+ outputs=[numerical_filter_group, categorical_filter_group,
622
+ min_val_input, max_val_input, selected_values]
623
+ )
624
+
625
+ filter_type.change(
626
+ fn=update_filter_ui,
627
+ inputs=[filter_column, filter_type],
628
+ outputs=[numerical_filter_group, categorical_filter_group,
629
+ min_val_input, max_val_input, selected_values]
630
+ )
631
+
632
+ filter_btn.click(
633
+ fn=apply_simple_filters,
634
+ inputs=[filter_column, filter_type, min_val_input, max_val_input, selected_values],
635
+ outputs=[filter_result_info, filtered_data_output, row_count_output]
636
+ )
637
+
638
+ def clear_filters():
639
+ """Clear all filters."""
640
+ global current_filters
641
+ current_filters = {}
642
+ if current_df is not None:
643
+ row_count = len(current_df)
644
+ info_text = f"**Dataset:** {row_count:,} rows (filters cleared)"
645
+ return info_text, current_df.head(FILTERED_PREVIEW_ROWS), row_count
646
+ return "No data loaded", pd.DataFrame(), 0
647
+
648
+ clear_filter_btn.click(
649
+ fn=clear_filters,
650
+ inputs=[],
651
+ outputs=[filter_result_info, filtered_data_output, row_count_output]
652
+ )
653
+
654
+ def update_filter_column_choices():
655
+ """Update filter column dropdown when data is loaded."""
656
+ global current_df
657
+ if current_df is not None and not current_df.empty:
658
+ return gr.update(choices=list(current_df.columns))
659
+ return gr.update(choices=[])
660
+
661
+ # Update filter column choices when data is loaded
662
+ upload_btn.click(
663
+ fn=update_filter_column_choices,
664
+ inputs=[],
665
+ outputs=[filter_column],
666
+ queue=False
667
+ )
668
+
669
+ # Tab 4: Visualizations
670
+ with gr.Tab("📊 Visualizations"):
671
+ with gr.Row():
672
+ with gr.Column(scale=1):
673
+ chart_type = gr.Dropdown(
674
+ choices=[
675
+ ("Time Series", "time_series"),
676
+ ("Distribution (Histogram)", "distribution"),
677
+ ("Category Analysis", "category"),
678
+ ("Scatter Plot", "scatter"),
679
+ ("Correlation Heatmap", "correlation")
680
+ ],
681
+ label="Chart Type",
682
+ value="time_series"
683
+ )
684
+
685
+ x_column = gr.Dropdown(
686
+ choices=[],
687
+ label="X-Axis Column",
688
+ interactive=True
689
+ )
690
+
691
+ y_column = gr.Dropdown(
692
+ choices=[],
693
+ label="Y-Axis Column (Optional)",
694
+ interactive=True
695
+ )
696
+
697
+ aggregation = gr.Dropdown(
698
+ choices=["sum", "mean", "count", "median", "none"],
699
+ label="Aggregation Method",
700
+ value="sum"
701
+ )
702
+
703
+ category_chart_type = gr.Radio(
704
+ choices=["bar", "pie"],
705
+ label="Category Chart Type",
706
+ value="bar",
707
+ visible=False
708
+ )
709
+
710
+ viz_btn = gr.Button("Generate Visualization", variant="primary")
711
+
712
+ export_viz_btn = gr.Button("Export Visualization", variant="secondary")
713
+ export_viz_file = gr.File(label="Download Visualization (PNG or HTML)")
714
+
715
+ with gr.Column(scale=2):
716
+ visualization_output = gr.Plot(
717
+ label="Visualization",
718
+ container=True
719
+ )
720
+
721
+ def toggle_category_type(chart_type_val):
722
+ """Show/hide category chart type based on selection."""
723
+ return gr.update(visible=(chart_type_val == "category"))
724
+
725
+ def update_viz_column_choices():
726
+ """Update column dropdowns based on loaded data."""
727
+ global current_df
728
+ if current_df is not None and not current_df.empty:
729
+ all_columns = list(current_df.columns)
730
+ return gr.update(choices=all_columns), gr.update(choices=all_columns)
731
+ return gr.update(choices=[]), gr.update(choices=[])
732
+
733
+ chart_type.change(
734
+ fn=toggle_category_type,
735
+ inputs=[chart_type],
736
+ outputs=[category_chart_type]
737
+ )
738
+
739
+ # Update visualization column choices when data is loaded
740
+ upload_btn.click(
741
+ fn=update_viz_column_choices,
742
+ inputs=[],
743
+ outputs=[x_column, y_column],
744
+ queue=False
745
+ )
746
+
747
+ viz_btn.click(
748
+ fn=create_visualization,
749
+ inputs=[chart_type, x_column, y_column, aggregation, category_chart_type],
750
+ outputs=[visualization_output]
751
+ )
752
+
753
+ export_viz_btn.click(
754
+ fn=export_visualization,
755
+ inputs=[visualization_output],
756
+ outputs=[export_viz_file]
757
+ )
758
+
759
+ # Tab 5: Insights
760
+ with gr.Tab("💡 Insights"):
761
+ with gr.Row():
762
+ insights_btn = gr.Button("Generate Insights", variant="primary", size="lg")
763
+
764
+ with gr.Row():
765
+ with gr.Column():
766
+ summary_insights = gr.Markdown("### Summary Insights")
767
+ summary_output = gr.Textbox(
768
+ label="",
769
+ lines=TEXTBOX_LINES_DEFAULT,
770
+ interactive=False
771
+ )
772
+
773
+ with gr.Column():
774
+ top_bottom_output = gr.Textbox(
775
+ label="Top/Bottom Performers",
776
+ lines=TEXTBOX_LINES_DEFAULT,
777
+ interactive=False
778
+ )
779
+
780
+ trend_output = gr.Textbox(
781
+ label="Trend Analysis",
782
+ lines=TEXTBOX_LINES_INSIGHTS,
783
+ interactive=False
784
+ )
785
+
786
+ insights_btn.click(
787
+ fn=generate_insights,
788
+ inputs=[],
789
+ outputs=[summary_output, top_bottom_output, trend_output]
790
+ )
791
+
792
+ # Tab 6: Export
793
+ with gr.Tab("💾 Export"):
794
+ with gr.Row():
795
+ with gr.Column():
796
+ gr.Markdown("### Export Filtered Data")
797
+ export_data_btn = gr.Button("Export as CSV", variant="primary")
798
+ export_data_file = gr.File(label="Download CSV")
799
+
800
+ export_data_btn.click(
801
+ fn=export_data,
802
+ inputs=[],
803
+ outputs=[export_data_file]
804
+ )
805
+
806
+ return demo
807
+
808
+
809
+ if __name__ == "__main__":
810
+ demo = create_dashboard()
811
+ demo.launch(
812
+ share=False,
813
+ server_name="0.0.0.0",
814
+ server_port=7860,
815
+ theme=gr.themes.Soft()
816
+ )
817
+
constants.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Constants for the Business Intelligence Dashboard.
3
+
4
+ This module contains all configuration constants to avoid hardcoded values
5
+ throughout the codebase.
6
+ """
7
+
8
+ # Preview and Display Constants
9
+ PREVIEW_ROWS = 10
10
+ FILTERED_PREVIEW_ROWS = 100
11
+ MAX_CATEGORY_DISPLAY = 20
12
+ MAX_UNIQUE_VALUES_DISPLAY = 100
13
+ MAX_COLUMNS_DISPLAY = 10
14
+
15
+ # Export Constants
16
+ EXPORT_IMAGE_WIDTH = 1200
17
+ EXPORT_IMAGE_HEIGHT = 800
18
+ EXPORT_IMAGE_SCALE = 2
19
+ EXPORT_IMAGE_FILENAME = "visualization_export.png"
20
+ EXPORT_HTML_FILENAME = "visualization_export.html"
21
+
22
+ # Statistical Constants
23
+ Q1_QUANTILE = 0.25
24
+ Q3_QUANTILE = 0.75
25
+ IQR_MULTIPLIER = 1.5
26
+ TREND_THRESHOLD_PERCENT = 5
27
+
28
+ # Analysis Constants
29
+ DEFAULT_TOP_N = 5
30
+ HISTOGRAM_BINS = 30
31
+ MIN_DATA_POINTS_FOR_TREND = 2
32
+ MIN_NUMERICAL_COLUMNS_FOR_CORRELATION = 2
33
+
34
+ # Data Conversion Constants
35
+ KB_CONVERSION = 1024
36
+ BYTES_TO_KB_DIVISOR = 1024
37
+
38
+ # UI Constants
39
+ TEXTBOX_LINES_DEFAULT = 10
40
+ TEXTBOX_LINES_INSIGHTS = 5
41
+
data_processor.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data processing module for the Business Intelligence Dashboard.
3
+
4
+ This module handles data loading, cleaning, filtering, and profiling
5
+ using the Strategy Pattern for different data operations.
6
+ """
7
+
8
+ from abc import ABC, abstractmethod
9
+ from typing import Dict, List, Optional, Tuple, Any
10
+ import pandas as pd
11
+ import numpy as np
12
+ from utils import detect_column_types, validate_dataframe, get_missing_value_summary
13
+ from constants import MIN_NUMERICAL_COLUMNS_FOR_CORRELATION
14
+
15
+
16
+ class DataLoadStrategy(ABC):
17
+ """Abstract base class for data loading strategies."""
18
+
19
+ @abstractmethod
20
+ def load(self, file_path: str) -> pd.DataFrame:
21
+ """
22
+ Load data from file.
23
+
24
+ Args:
25
+ file_path: Path to the data file
26
+
27
+ Returns:
28
+ Loaded DataFrame
29
+ """
30
+ pass
31
+
32
+
33
+ class CSVLoadStrategy(DataLoadStrategy):
34
+ """Strategy for loading CSV files."""
35
+
36
+ def load(self, file_path: str) -> pd.DataFrame:
37
+ """Load CSV file."""
38
+ return pd.read_csv(file_path)
39
+
40
+
41
+ class ExcelLoadStrategy(DataLoadStrategy):
42
+ """Strategy for loading Excel files."""
43
+
44
+ def load(self, file_path: str) -> pd.DataFrame:
45
+ """Load Excel file."""
46
+ return pd.read_excel(file_path)
47
+
48
+
49
+ class DataLoader:
50
+ """Context class for data loading using Strategy Pattern."""
51
+
52
+ def __init__(self):
53
+ """Initialize with default strategies."""
54
+ self._strategies = {
55
+ '.csv': CSVLoadStrategy(),
56
+ '.xlsx': ExcelLoadStrategy(),
57
+ '.xls': ExcelLoadStrategy()
58
+ }
59
+
60
+ def load_data(self, file_path: str) -> Tuple[pd.DataFrame, Optional[str]]:
61
+ """
62
+ Load data file using appropriate strategy.
63
+
64
+ Args:
65
+ file_path: Path to the data file
66
+
67
+ Returns:
68
+ Tuple of (DataFrame, error_message)
69
+ """
70
+ try:
71
+ import os
72
+ _, ext = os.path.splitext(file_path.lower())
73
+
74
+ if ext not in self._strategies:
75
+ return None, f"Unsupported file format: {ext}"
76
+
77
+ strategy = self._strategies[ext]
78
+ df = strategy.load(file_path)
79
+
80
+ # Validate loaded data
81
+ is_valid, error = validate_dataframe(df)
82
+ if not is_valid:
83
+ return None, error
84
+
85
+ return df, None
86
+
87
+ except Exception as e:
88
+ return None, f"Error loading file: {str(e)}"
89
+
90
+
91
+ class FilterStrategy(ABC):
92
+ """Abstract base class for filtering strategies."""
93
+
94
+ @abstractmethod
95
+ def apply_filter(
96
+ self,
97
+ df: pd.DataFrame,
98
+ column: str,
99
+ filter_value: Any
100
+ ) -> pd.DataFrame:
101
+ """
102
+ Apply filter to DataFrame.
103
+
104
+ Args:
105
+ df: Input DataFrame
106
+ column: Column to filter on
107
+ filter_value: Filter value/range
108
+
109
+ Returns:
110
+ Filtered DataFrame
111
+ """
112
+ pass
113
+
114
+
115
+ class NumericalFilterStrategy(FilterStrategy):
116
+ """Strategy for filtering numerical columns."""
117
+
118
+ def apply_filter(
119
+ self,
120
+ df: pd.DataFrame,
121
+ column: str,
122
+ filter_value: Tuple[float, float]
123
+ ) -> pd.DataFrame:
124
+ """Apply range filter to numerical column."""
125
+ min_val, max_val = filter_value
126
+ return df[(df[column] >= min_val) & (df[column] <= max_val)]
127
+
128
+
129
+ class CategoricalFilterStrategy(FilterStrategy):
130
+ """Strategy for filtering categorical columns."""
131
+
132
+ def apply_filter(
133
+ self,
134
+ df: pd.DataFrame,
135
+ column: str,
136
+ filter_value: List[str]
137
+ ) -> pd.DataFrame:
138
+ """Apply multi-select filter to categorical column."""
139
+ if not filter_value:
140
+ return df
141
+ return df[df[column].isin(filter_value)]
142
+
143
+
144
+ class DateFilterStrategy(FilterStrategy):
145
+ """Strategy for filtering date columns."""
146
+
147
+ def apply_filter(
148
+ self,
149
+ df: pd.DataFrame,
150
+ column: str,
151
+ filter_value: Tuple[str, str]
152
+ ) -> pd.DataFrame:
153
+ """Apply date range filter."""
154
+ start_date, end_date = filter_value
155
+ if start_date and end_date:
156
+ df[column] = pd.to_datetime(df[column], errors='coerce')
157
+ return df[(df[column] >= start_date) & (df[column] <= end_date)]
158
+ return df
159
+
160
+
161
+ class DataFilter:
162
+ """Context class for data filtering using Strategy Pattern."""
163
+
164
+ def __init__(self):
165
+ """Initialize with filter strategies."""
166
+ self._strategies = {
167
+ 'numerical': NumericalFilterStrategy(),
168
+ 'categorical': CategoricalFilterStrategy(),
169
+ 'date': DateFilterStrategy()
170
+ }
171
+
172
+ def apply_filters(
173
+ self,
174
+ df: pd.DataFrame,
175
+ filters: Dict[str, Any]
176
+ ) -> pd.DataFrame:
177
+ """
178
+ Apply multiple filters to DataFrame.
179
+
180
+ Args:
181
+ df: Input DataFrame
182
+ filters: Dictionary of {column: filter_value}
183
+
184
+ Returns:
185
+ Filtered DataFrame
186
+ """
187
+ filtered_df = df.copy()
188
+ numerical, categorical, date_columns = detect_column_types(df)
189
+
190
+ for column, filter_value in filters.items():
191
+ if filter_value is None:
192
+ continue
193
+
194
+ if column in numerical:
195
+ strategy = self._strategies['numerical']
196
+ elif column in categorical:
197
+ strategy = self._strategies['categorical']
198
+ elif column in date_columns:
199
+ strategy = self._strategies['date']
200
+ else:
201
+ continue
202
+
203
+ try:
204
+ filtered_df = strategy.apply_filter(filtered_df, column, filter_value)
205
+ except Exception as e:
206
+ print(f"Error applying filter to {column}: {e}")
207
+ continue
208
+
209
+ return filtered_df
210
+
211
+
212
+ class DataProfiler:
213
+ """Class for generating data profiling and statistics."""
214
+
215
+ @staticmethod
216
+ def get_basic_info(df: pd.DataFrame) -> Dict[str, Any]:
217
+ """
218
+ Get basic dataset information.
219
+
220
+ Args:
221
+ df: Input DataFrame
222
+
223
+ Returns:
224
+ Dictionary with basic info
225
+ """
226
+ return {
227
+ 'shape': df.shape,
228
+ 'columns': list(df.columns),
229
+ 'dtypes': df.dtypes.to_dict(),
230
+ 'memory_usage': df.memory_usage(deep=True).sum()
231
+ }
232
+
233
+ @staticmethod
234
+ def get_numerical_stats(df: pd.DataFrame) -> pd.DataFrame:
235
+ """
236
+ Get statistics for numerical columns.
237
+
238
+ Args:
239
+ df: Input DataFrame
240
+
241
+ Returns:
242
+ DataFrame with numerical statistics, with column names as a column
243
+ """
244
+ numerical, _, _ = detect_column_types(df)
245
+ if not numerical:
246
+ return pd.DataFrame()
247
+
248
+ stats = df[numerical].describe()
249
+ stats.loc['median'] = df[numerical].median()
250
+ stats.loc['std'] = df[numerical].std()
251
+
252
+ # Transpose so column names become rows (index)
253
+ stats_transposed = stats.T
254
+
255
+ # Reset index to make column names a regular column for display
256
+ stats_transposed = stats_transposed.reset_index()
257
+ stats_transposed.rename(columns={'index': 'Column'}, inplace=True)
258
+
259
+ # Reorder columns for better readability (Column first, then statistics)
260
+ column_order = ['Column', 'count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'median']
261
+ # Only include columns that exist
262
+ available_columns = [col for col in column_order if col in stats_transposed.columns]
263
+ stats_transposed = stats_transposed[available_columns]
264
+
265
+ return stats_transposed
266
+
267
+ @staticmethod
268
+ def get_categorical_stats(df: pd.DataFrame) -> pd.DataFrame:
269
+ """
270
+ Get statistics for categorical columns.
271
+
272
+ Args:
273
+ df: Input DataFrame
274
+
275
+ Returns:
276
+ DataFrame with categorical statistics
277
+ """
278
+ _, categorical, _ = detect_column_types(df)
279
+ if not categorical:
280
+ return pd.DataFrame()
281
+
282
+ stats = []
283
+ for col in categorical:
284
+ unique_count = df[col].nunique()
285
+ mode_value = df[col].mode().iloc[0] if not df[col].mode().empty else None
286
+ mode_count = df[col].value_counts().iloc[0] if not df[col].empty else 0
287
+
288
+ stats.append({
289
+ 'Column': col,
290
+ 'Unique_Values': unique_count,
291
+ 'Mode': mode_value,
292
+ 'Mode_Count': mode_count,
293
+ 'Total_Count': len(df)
294
+ })
295
+
296
+ return pd.DataFrame(stats)
297
+
298
+ @staticmethod
299
+ def get_correlation_matrix(df: pd.DataFrame) -> pd.DataFrame:
300
+ """
301
+ Get correlation matrix for numerical columns.
302
+
303
+ Args:
304
+ df: Input DataFrame
305
+
306
+ Returns:
307
+ Correlation matrix DataFrame
308
+ """
309
+ numerical, _, _ = detect_column_types(df)
310
+ if len(numerical) < MIN_NUMERICAL_COLUMNS_FOR_CORRELATION:
311
+ return pd.DataFrame()
312
+
313
+ return df[numerical].corr()
314
+
insights.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Insights generation module for the Business Intelligence Dashboard.
3
+
4
+ This module automatically generates insights and identifies patterns
5
+ in the data.
6
+ """
7
+
8
+ from typing import Dict, List, Tuple, Optional, Any
9
+ import pandas as pd
10
+ import numpy as np
11
+ from utils import detect_column_types
12
+ from constants import (
13
+ Q1_QUANTILE,
14
+ Q3_QUANTILE,
15
+ IQR_MULTIPLIER,
16
+ TREND_THRESHOLD_PERCENT,
17
+ MIN_DATA_POINTS_FOR_TREND
18
+ )
19
+
20
+
21
+ class InsightGenerator:
22
+ """Class for generating automated insights from data."""
23
+
24
+ @staticmethod
25
+ def get_top_bottom_performers(
26
+ df: pd.DataFrame,
27
+ column: str,
28
+ top_n: int = 5
29
+ ) -> Dict[str, List[Tuple[str, float]]]:
30
+ """
31
+ Identify top and bottom performers for a column.
32
+
33
+ Args:
34
+ df: Input DataFrame
35
+ column: Column to analyze
36
+ top_n: Number of top/bottom items to return
37
+
38
+ Returns:
39
+ Dictionary with 'top' and 'bottom' lists
40
+ """
41
+ if column not in df.columns:
42
+ return {'top': [], 'bottom': []}
43
+
44
+ df_clean = df.dropna(subset=[column])
45
+ if df_clean.empty:
46
+ return {'top': [], 'bottom': []}
47
+
48
+ # Get top performers
49
+ top = df_clean.nlargest(top_n, column)[[column]]
50
+ top_list = [(idx, float(val)) for idx, val in top[column].items()]
51
+
52
+ # Get bottom performers
53
+ bottom = df_clean.nsmallest(top_n, column)[[column]]
54
+ bottom_list = [(idx, float(val)) for idx, val in bottom[column].items()]
55
+
56
+ return {
57
+ 'top': top_list,
58
+ 'bottom': bottom_list
59
+ }
60
+
61
+ @staticmethod
62
+ def detect_trends(df: pd.DataFrame, date_column: str, value_column: str) -> Dict[str, Any]:
63
+ """
64
+ Detect trends in time series data.
65
+
66
+ Args:
67
+ df: Input DataFrame
68
+ date_column: Date column name
69
+ value_column: Value column name
70
+
71
+ Returns:
72
+ Dictionary with trend information
73
+ """
74
+ if date_column not in df.columns or value_column not in df.columns:
75
+ return {'trend': 'insufficient_data', 'message': 'Required columns not found'}
76
+
77
+ df_clean = df[[date_column, value_column]].copy()
78
+ df_clean[date_column] = pd.to_datetime(df_clean[date_column], errors='coerce')
79
+ df_clean = df_clean.dropna()
80
+
81
+ if len(df_clean) < MIN_DATA_POINTS_FOR_TREND:
82
+ return {
83
+ 'trend': 'insufficient_data',
84
+ 'message': f'Not enough data points (need at least {MIN_DATA_POINTS_FOR_TREND})'
85
+ }
86
+
87
+ df_clean = df_clean.sort_values(date_column)
88
+
89
+ # Calculate trend
90
+ first_half = df_clean[:len(df_clean)//2][value_column].mean()
91
+ second_half = df_clean[len(df_clean)//2:][value_column].mean()
92
+
93
+ change = ((second_half - first_half) / first_half * 100) if first_half != 0 else 0
94
+
95
+ if change > TREND_THRESHOLD_PERCENT:
96
+ trend = 'increasing'
97
+ message = f'Strong upward trend: {change:.2f}% increase'
98
+ elif change < -TREND_THRESHOLD_PERCENT:
99
+ trend = 'decreasing'
100
+ message = f'Downward trend: {change:.2f}% decrease'
101
+ else:
102
+ trend = 'stable'
103
+ message = f'Relatively stable: {change:.2f}% change'
104
+
105
+ return {
106
+ 'trend': trend,
107
+ 'message': message,
108
+ 'change_percentage': change,
109
+ 'first_half_avg': float(first_half),
110
+ 'second_half_avg': float(second_half)
111
+ }
112
+
113
+ @staticmethod
114
+ def detect_anomalies(df: pd.DataFrame, column: str) -> List[Dict[str, Any]]:
115
+ """
116
+ Detect anomalies in numerical data using IQR method.
117
+
118
+ Args:
119
+ df: Input DataFrame
120
+ column: Column to analyze
121
+
122
+ Returns:
123
+ List of anomaly dictionaries
124
+ """
125
+ if column not in df.columns:
126
+ return []
127
+
128
+ df_clean = df.dropna(subset=[column])
129
+ if df_clean.empty:
130
+ return []
131
+
132
+ Q1 = df_clean[column].quantile(Q1_QUANTILE)
133
+ Q3 = df_clean[column].quantile(Q3_QUANTILE)
134
+ IQR = Q3 - Q1
135
+
136
+ lower_bound = Q1 - IQR_MULTIPLIER * IQR
137
+ upper_bound = Q3 + IQR_MULTIPLIER * IQR
138
+
139
+ anomalies = df_clean[
140
+ (df_clean[column] < lower_bound) | (df_clean[column] > upper_bound)
141
+ ]
142
+
143
+ result = []
144
+ for idx, row in anomalies.iterrows():
145
+ result.append({
146
+ 'index': int(idx),
147
+ 'value': float(row[column]),
148
+ 'type': 'high' if row[column] > upper_bound else 'low'
149
+ })
150
+
151
+ return result
152
+
153
+ @staticmethod
154
+ def generate_summary_insights(df: pd.DataFrame) -> List[str]:
155
+ """
156
+ Generate high-level summary insights.
157
+
158
+ Args:
159
+ df: Input DataFrame
160
+
161
+ Returns:
162
+ List of insight strings
163
+ """
164
+ insights = []
165
+
166
+ # Basic stats
167
+ insights.append(f"Dataset contains {len(df):,} rows and {len(df.columns)} columns")
168
+
169
+ # Missing values
170
+ missing = df.isnull().sum().sum()
171
+ if missing > 0:
172
+ missing_pct = (missing / (len(df) * len(df.columns))) * 100
173
+ insights.append(
174
+ f"Found {missing:,} missing values ({missing_pct:.1f}% of data)"
175
+ )
176
+
177
+ # Numerical columns insights
178
+ numerical, categorical, date_columns = detect_column_types(df)
179
+
180
+ if numerical:
181
+ insights.append(f"Dataset has {len(numerical)} numerical columns")
182
+ # Find column with highest variance
183
+ variances = df[numerical].var()
184
+ if not variances.empty:
185
+ max_var_col = variances.idxmax()
186
+ insights.append(
187
+ f"'{max_var_col}' shows the highest variability"
188
+ )
189
+
190
+ if categorical:
191
+ insights.append(f"Dataset has {len(categorical)} categorical columns")
192
+ # Find most diverse category
193
+ unique_counts = {col: df[col].nunique() for col in categorical}
194
+ if unique_counts:
195
+ max_unique_col = max(unique_counts, key=unique_counts.get)
196
+ insights.append(
197
+ f"'{max_unique_col}' has the most unique values ({unique_counts[max_unique_col]})"
198
+ )
199
+
200
+ if date_columns:
201
+ insights.append(f"Dataset has {len(date_columns)} date columns")
202
+
203
+ return insights
204
+
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas>=2.2.0
2
+ numpy>=1.26.0
3
+ gradio==6.0.2
4
+ matplotlib>=3.8.0
5
+ seaborn>=0.13.0
6
+ plotly>=5.22.0
7
+ kaleido>=0.2.1
8
+ openpyxl>=3.1.5
9
+ Pillow>=10.4.0
10
+
utils.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for the Business Intelligence Dashboard.
3
+
4
+ This module contains helper functions for data type detection,
5
+ validation, and common operations.
6
+ """
7
+
8
+ from typing import List, Optional, Tuple
9
+ import pandas as pd
10
+ import numpy as np
11
+
12
+
13
+ def detect_column_types(df: pd.DataFrame) -> Tuple[List[str], List[str], List[str]]:
14
+ """
15
+ Detect column types in a DataFrame.
16
+
17
+ Args:
18
+ df: Input DataFrame
19
+
20
+ Returns:
21
+ Tuple of (numerical_columns, categorical_columns, date_columns)
22
+ """
23
+ numerical = []
24
+ categorical = []
25
+ date_columns = []
26
+
27
+ for col in df.columns:
28
+ if pd.api.types.is_datetime64_any_dtype(df[col]):
29
+ date_columns.append(col)
30
+ elif pd.api.types.is_numeric_dtype(df[col]):
31
+ numerical.append(col)
32
+ else:
33
+ categorical.append(col)
34
+
35
+ return numerical, categorical, date_columns
36
+
37
+
38
+ def validate_dataframe(df: pd.DataFrame) -> Tuple[bool, Optional[str]]:
39
+ """
40
+ Validate that DataFrame is not empty and has valid structure.
41
+
42
+ Args:
43
+ df: DataFrame to validate
44
+
45
+ Returns:
46
+ Tuple of (is_valid, error_message)
47
+ """
48
+ if df is None or df.empty:
49
+ return False, "DataFrame is empty or None"
50
+
51
+ if len(df.columns) == 0:
52
+ return False, "DataFrame has no columns"
53
+
54
+ return True, None
55
+
56
+
57
+ def format_number(value: float, decimals: int = 2) -> str:
58
+ """
59
+ Format a number with specified decimal places.
60
+
61
+ Args:
62
+ value: Number to format
63
+ decimals: Number of decimal places
64
+
65
+ Returns:
66
+ Formatted string
67
+ """
68
+ if pd.isna(value):
69
+ return "N/A"
70
+ return f"{value:,.{decimals}f}"
71
+
72
+
73
+ def safe_divide(numerator: float, denominator: float) -> float:
74
+ """
75
+ Safely divide two numbers, returning 0 if denominator is 0.
76
+
77
+ Args:
78
+ numerator: Numerator value
79
+ denominator: Denominator value
80
+
81
+ Returns:
82
+ Division result or 0
83
+ """
84
+ if denominator == 0 or pd.isna(denominator):
85
+ return 0.0
86
+ return numerator / denominator
87
+
88
+
89
+ def get_missing_value_summary(df: pd.DataFrame) -> pd.DataFrame:
90
+ """
91
+ Get summary of missing values in DataFrame.
92
+
93
+ Args:
94
+ df: Input DataFrame
95
+
96
+ Returns:
97
+ DataFrame with missing value statistics
98
+ """
99
+ missing = df.isnull().sum()
100
+ missing_pct = (missing / len(df)) * 100
101
+
102
+ summary = pd.DataFrame({
103
+ 'Column': missing.index,
104
+ 'Missing_Count': missing.values,
105
+ 'Missing_Percentage': missing_pct.values
106
+ })
107
+
108
+ return summary[summary['Missing_Count'] > 0].sort_values(
109
+ 'Missing_Count', ascending=False
110
+ )
111
+
visualizations.py ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Visualization module for the Business Intelligence Dashboard.
3
+
4
+ This module creates various types of charts and visualizations
5
+ using the Strategy Pattern for different chart types.
6
+ """
7
+
8
+ from abc import ABC, abstractmethod
9
+ from typing import Dict, List, Optional, Tuple, Any
10
+ import pandas as pd
11
+ import numpy as np
12
+ import matplotlib.pyplot as plt
13
+ import seaborn as sns
14
+ import plotly.express as px
15
+ import plotly.graph_objects as go
16
+ from plotly.subplots import make_subplots
17
+ from utils import detect_column_types
18
+ from constants import (
19
+ HISTOGRAM_BINS,
20
+ MAX_CATEGORY_DISPLAY,
21
+ MIN_NUMERICAL_COLUMNS_FOR_CORRELATION
22
+ )
23
+
24
+
25
+ class VisualizationStrategy(ABC):
26
+ """Abstract base class for visualization strategies."""
27
+
28
+ @abstractmethod
29
+ def create_chart(
30
+ self,
31
+ df: pd.DataFrame,
32
+ x_column: Optional[str] = None,
33
+ y_column: Optional[str] = None,
34
+ aggregation: str = 'sum',
35
+ **kwargs
36
+ ) -> go.Figure:
37
+ """
38
+ Create a visualization.
39
+
40
+ Args:
41
+ df: Input DataFrame
42
+ x_column: X-axis column
43
+ y_column: Y-axis column
44
+ aggregation: Aggregation method (sum, mean, count, median)
45
+ **kwargs: Additional parameters
46
+
47
+ Returns:
48
+ Plotly figure object
49
+ """
50
+ pass
51
+
52
+
53
+ class TimeSeriesStrategy(VisualizationStrategy):
54
+ """Strategy for creating time series plots."""
55
+
56
+ def create_chart(
57
+ self,
58
+ df: pd.DataFrame,
59
+ x_column: Optional[str] = None,
60
+ y_column: Optional[str] = None,
61
+ aggregation: str = 'sum',
62
+ **kwargs
63
+ ) -> go.Figure:
64
+ """Create time series plot."""
65
+ if x_column is None or y_column is None:
66
+ raise ValueError("Both x_column and y_column required for time series")
67
+
68
+ # Convert date column
69
+ df = df.copy()
70
+ df[x_column] = pd.to_datetime(df[x_column], errors='coerce')
71
+ df = df.dropna(subset=[x_column, y_column])
72
+
73
+ # Aggregate if needed
74
+ if aggregation != 'none':
75
+ df = df.groupby(x_column)[y_column].agg(aggregation).reset_index()
76
+
77
+ fig = px.line(
78
+ df,
79
+ x=x_column,
80
+ y=y_column,
81
+ title=f'Time Series: {y_column} over {x_column}',
82
+ labels={x_column: x_column, y_column: y_column}
83
+ )
84
+
85
+ fig.update_layout(
86
+ xaxis_title=x_column,
87
+ yaxis_title=y_column,
88
+ hovermode='x unified',
89
+ template='plotly_white'
90
+ )
91
+
92
+ return fig
93
+
94
+
95
+ class DistributionStrategy(VisualizationStrategy):
96
+ """Strategy for creating distribution plots."""
97
+
98
+ def create_chart(
99
+ self,
100
+ df: pd.DataFrame,
101
+ x_column: Optional[str] = None,
102
+ y_column: Optional[str] = None,
103
+ aggregation: str = 'sum',
104
+ sub_chart_type: str = 'histogram',
105
+ **kwargs
106
+ ) -> go.Figure:
107
+ """Create distribution plot (histogram or box plot)."""
108
+ if x_column is None:
109
+ raise ValueError("x_column required for distribution plot")
110
+
111
+ # Get sub_chart_type from kwargs if provided, otherwise use parameter
112
+ # Check both 'sub_chart_type' (new) and 'chart_type' (legacy) for compatibility
113
+ sub_chart_type = kwargs.pop('sub_chart_type', kwargs.pop('chart_type', sub_chart_type))
114
+
115
+ df = df.copy()
116
+ df = df.dropna(subset=[x_column])
117
+
118
+ if sub_chart_type == 'histogram':
119
+ fig = px.histogram(
120
+ df,
121
+ x=x_column,
122
+ title=f'Distribution of {x_column}',
123
+ labels={x_column: x_column, 'count': 'Frequency'},
124
+ nbins=HISTOGRAM_BINS
125
+ )
126
+ else: # box plot
127
+ fig = px.box(
128
+ df,
129
+ y=x_column,
130
+ title=f'Box Plot of {x_column}',
131
+ labels={x_column: x_column}
132
+ )
133
+
134
+ fig.update_layout(
135
+ template='plotly_white',
136
+ showlegend=False
137
+ )
138
+
139
+ return fig
140
+
141
+
142
+ class CategoryAnalysisStrategy(VisualizationStrategy):
143
+ """Strategy for creating category analysis charts."""
144
+
145
+ def create_chart(
146
+ self,
147
+ df: pd.DataFrame,
148
+ x_column: Optional[str] = None,
149
+ y_column: Optional[str] = None,
150
+ aggregation: str = 'sum',
151
+ sub_chart_type: str = 'bar',
152
+ **kwargs
153
+ ) -> go.Figure:
154
+ """Create category analysis (bar chart or pie chart)."""
155
+ if x_column is None:
156
+ raise ValueError("x_column required for category analysis")
157
+
158
+ # Get sub_chart_type from kwargs if provided, otherwise use parameter
159
+ # Check both 'sub_chart_type' (new) and 'chart_type' (legacy) for compatibility
160
+ sub_chart_type = kwargs.pop('sub_chart_type', kwargs.pop('chart_type', sub_chart_type))
161
+
162
+ df = df.copy()
163
+ df = df.dropna(subset=[x_column])
164
+
165
+ if y_column:
166
+ # Aggregate by category
167
+ if aggregation != 'none':
168
+ df_agg = df.groupby(x_column)[y_column].agg(aggregation).reset_index()
169
+ df_agg.columns = [x_column, y_column]
170
+ else:
171
+ df_agg = df[[x_column, y_column]]
172
+
173
+ # Sort by value
174
+ df_agg = df_agg.sort_values(y_column, ascending=False).head(MAX_CATEGORY_DISPLAY)
175
+
176
+ if sub_chart_type == 'bar':
177
+ fig = px.bar(
178
+ df_agg,
179
+ x=x_column,
180
+ y=y_column,
181
+ title=f'{y_column} by {x_column}',
182
+ labels={x_column: x_column, y_column: y_column}
183
+ )
184
+ else: # pie
185
+ fig = px.pie(
186
+ df_agg,
187
+ names=x_column,
188
+ values=y_column,
189
+ title=f'{y_column} Distribution by {x_column}'
190
+ )
191
+ else:
192
+ # Count by category
193
+ value_counts = df[x_column].value_counts().head(MAX_CATEGORY_DISPLAY)
194
+
195
+ if sub_chart_type == 'bar':
196
+ fig = px.bar(
197
+ x=value_counts.index,
198
+ y=value_counts.values,
199
+ title=f'Count by {x_column}',
200
+ labels={'x': x_column, 'y': 'Count'}
201
+ )
202
+ else: # pie
203
+ fig = px.pie(
204
+ values=value_counts.values,
205
+ names=value_counts.index,
206
+ title=f'Distribution of {x_column}'
207
+ )
208
+
209
+ fig.update_layout(template='plotly_white')
210
+ return fig
211
+
212
+
213
+ class ScatterStrategy(VisualizationStrategy):
214
+ """Strategy for creating scatter plots."""
215
+
216
+ def create_chart(
217
+ self,
218
+ df: pd.DataFrame,
219
+ x_column: Optional[str] = None,
220
+ y_column: Optional[str] = None,
221
+ aggregation: str = 'sum',
222
+ color_column: Optional[str] = None,
223
+ **kwargs
224
+ ) -> go.Figure:
225
+ """Create scatter plot."""
226
+ if x_column is None or y_column is None:
227
+ raise ValueError("Both x_column and y_column required for scatter plot")
228
+
229
+ df = df.copy()
230
+ df = df.dropna(subset=[x_column, y_column])
231
+
232
+ fig = px.scatter(
233
+ df,
234
+ x=x_column,
235
+ y=y_column,
236
+ color=color_column,
237
+ title=f'Scatter Plot: {y_column} vs {x_column}',
238
+ labels={x_column: x_column, y_column: y_column},
239
+ hover_data=df.columns.tolist()
240
+ )
241
+
242
+ fig.update_layout(template='plotly_white')
243
+ return fig
244
+
245
+
246
+ class CorrelationHeatmapStrategy(VisualizationStrategy):
247
+ """Strategy for creating correlation heatmaps."""
248
+
249
+ def create_chart(
250
+ self,
251
+ df: pd.DataFrame,
252
+ x_column: Optional[str] = None,
253
+ y_column: Optional[str] = None,
254
+ aggregation: str = 'sum',
255
+ **kwargs
256
+ ) -> go.Figure:
257
+ """Create correlation heatmap."""
258
+ numerical, _, _ = detect_column_types(df)
259
+
260
+ if len(numerical) < MIN_NUMERICAL_COLUMNS_FOR_CORRELATION:
261
+ raise ValueError(
262
+ f"Need at least {MIN_NUMERICAL_COLUMNS_FOR_CORRELATION} "
263
+ "numerical columns for correlation"
264
+ )
265
+
266
+ corr_matrix = df[numerical].corr()
267
+
268
+ fig = px.imshow(
269
+ corr_matrix,
270
+ title='Correlation Heatmap',
271
+ labels=dict(x="Column", y="Column", color="Correlation"),
272
+ color_continuous_scale='RdBu',
273
+ aspect="auto"
274
+ )
275
+
276
+ fig.update_layout(template='plotly_white')
277
+ return fig
278
+
279
+
280
+ class VisualizationFactory:
281
+ """Factory class for creating visualizations using Strategy Pattern."""
282
+
283
+ def __init__(self):
284
+ """Initialize with visualization strategies."""
285
+ self._strategies = {
286
+ 'time_series': TimeSeriesStrategy(),
287
+ 'distribution': DistributionStrategy(),
288
+ 'category': CategoryAnalysisStrategy(),
289
+ 'scatter': ScatterStrategy(),
290
+ 'correlation': CorrelationHeatmapStrategy()
291
+ }
292
+
293
+ def create_visualization(
294
+ self,
295
+ chart_type: str,
296
+ df: pd.DataFrame,
297
+ x_column: Optional[str] = None,
298
+ y_column: Optional[str] = None,
299
+ aggregation: str = 'sum',
300
+ **kwargs
301
+ ) -> go.Figure:
302
+ """
303
+ Create visualization using appropriate strategy.
304
+
305
+ Args:
306
+ chart_type: Type of chart to create
307
+ df: Input DataFrame
308
+ x_column: X-axis column
309
+ y_column: Y-axis column
310
+ aggregation: Aggregation method
311
+ **kwargs: Additional parameters
312
+
313
+ Returns:
314
+ Plotly figure object
315
+ """
316
+ if chart_type not in self._strategies:
317
+ raise ValueError(f"Unknown chart type: {chart_type}")
318
+
319
+ strategy = self._strategies[chart_type]
320
+ return strategy.create_chart(
321
+ df,
322
+ x_column=x_column,
323
+ y_column=y_column,
324
+ aggregation=aggregation,
325
+ **kwargs
326
+ )
327
+