damndeepesh commited on
Commit
d98380b
·
verified ·
1 Parent(s): f32488c

Uploaded 3 files

Browse files
Files changed (3) hide show
  1. README.md +94 -16
  2. app.py +722 -0
  3. requirements.txt +8 -2
README.md CHANGED
@@ -1,19 +1,97 @@
1
- ---
2
- title: AutoML
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Streamlit template space
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- # Welcome to Streamlit!
15
 
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AutoML & Explainability Web Application
2
+
3
+ This Streamlit web application empowers users to perform end-to-end machine learning tasks with ease. Upload your data, automatically train and compare various models, understand their predictions through SHAP explainability, and export the best model for your needs.
4
+
5
+ ## 🎯 Core Objectives
6
+
7
+ * **Accessibility**: Enable users of all technical backgrounds to leverage machine learning.
8
+ * **Automation**: Streamline the ML pipeline from data ingestion to model evaluation.
9
+ * **Transparency**: Provide clear insights into model behavior using SHAP.
10
+ * **Efficiency**: Quickly identify the best-performing model for a given dataset.
11
+
12
+ ## ✨ Key Features
13
+
14
+ * **Flexible Data Upload**:
15
+ * Supports `.csv` and `.xlsx` files.
16
+ * Option to upload a single file (for automatic train/test splitting) or separate training and testing files.
17
+ * **Data Preprocessing**:
18
+ * Automatic handling of missing values (imputation).
19
+ * Encoding of categorical features.
20
+ * Optional scaling of numeric features.
21
+ * **Target Column & Problem Type Detection**:
22
+ * Easy selection of the target variable.
23
+ * Automatic detection of problem type (Classification/Regression).
24
+ * Auto-detection of common target column names.
25
+ * **Automated Model Training & Comparison**:
26
+ * Trains a suite of models tailored to the problem type:
27
+ * **Classification**: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM, K-Nearest Neighbors, Gaussian Naive Bayes.
28
+ * **Regression**: Linear Regression, Ridge Regression, ElasticNet, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor, SVR, K-Nearest Neighbors Regressor.
29
+ * Displays a leaderboard with key performance metrics (Accuracy, F1, AUC for classification; R2, MSE for regression).
30
+ * **Model Explainability (XAI)**:
31
+ * Utilizes SHAP (SHapley Additive exPlanations) for the best model.
32
+ * Global feature importance plots.
33
+ * Detailed SHAP summary plots (e.g., beeswarm) and individual prediction explanations (waterfall plots coming soon).
34
+ * **Model Export**: Download the trained best model (including preprocessing steps) as a `.joblib` file for deployment or further use.
35
 
36
+ ## ⚙️ Setup & Installation
37
 
38
+ 1. **Prerequisites**: Python 3.7+ installed.
39
+ 2. **Clone the Repository (Optional)**:
40
+ ```bash
41
+ # git clone <your_repository_url> # If you have it on Git
42
+ # cd AutoML-WebApp
43
+ ```
44
+ Alternatively, ensure `app.py` and `requirements.txt` are in your project directory.
45
+ 3. **Create and Activate Virtual Environment (Recommended)**:
46
+ ```bash
47
+ python3 -m venv venv
48
+ source venv/bin/activate # macOS/Linux
49
+ # venv\Scripts\activate # Windows
50
+ ```
51
+ 4. **Install Dependencies**:
52
+ ```bash
53
+ pip install -r requirements.txt
54
+ ```
55
 
56
+ ## 🚀 Running the Application
57
+
58
+ 1. Navigate to your project directory in the terminal.
59
+ 2. Run the Streamlit app:
60
+ ```bash
61
+ streamlit run app.py
62
+ ```
63
+ 3. Open your browser and go to the URL provided (usually `http://localhost:8501`).
64
+
65
+ ## 🔮 Upcoming Features & Enhancements
66
+
67
+ We are continuously working to improve this AutoML application. Here are some features on our roadmap:
68
+
69
+ * **Advanced Preprocessing Options**:
70
+ * User control over imputation strategies (mean, median, mode, constant).
71
+ * More encoding techniques (e.g., One-Hot Encoding, Target Encoding).
72
+ * Feature selection techniques.
73
+ * **Hyperparameter Tuning**:
74
+ * Integration of GridSearchCV or RandomizedSearchCV for optimizing model hyperparameters.
75
+ * User interface to define search spaces.
76
+ * **Expanded Model Support**:
77
+ * LightGBM, XGBoost, CatBoost for both classification and regression.
78
+ * Basic Time Series forecasting models (e.g., ARIMA, Prophet) if applicable data is provided.
79
+ * **Enhanced Evaluation & Visualization**:
80
+ * Interactive Confusion Matrix, ROC/AUC curves, Precision-Recall curves for classification.
81
+ * Residual plots, Actual vs. Predicted plots for regression.
82
+ * Cross-validation score details.
83
+ * **Deployment & Integration**:
84
+ * Option to generate a simple Flask API endpoint for the exported model.
85
+ * Dockerization support for easier deployment.
86
+ * **User Experience & Robustness**:
87
+ * More detailed error handling and user guidance.
88
+ * Saving and loading of experiment configurations.
89
+ * Support for larger datasets (optimizations for memory and speed).
90
+ * **Advanced Explainability**:
91
+ * Individual prediction explanations (waterfall plots).
92
+ * Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots.
93
+ * **Data Insights**:
94
+ * Automated exploratory data analysis (EDA) report generation.
95
+
96
+ ---
97
+ _This application is actively developed, with assistance from AI pair programming._
app.py ADDED
@@ -0,0 +1,722 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ from sklearn.model_selection import train_test_split, cross_val_score
5
+ from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor, GradientBoostingRegressor
6
+ from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
7
+ from sklearn.svm import SVC, SVR
8
+ from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, ElasticNet
9
+ from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
10
+ from sklearn.naive_bayes import GaussianNB
11
+ from sklearn.preprocessing import StandardScaler, LabelEncoder
12
+ from sklearn.impute import SimpleImputer
13
+ from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, f1_score
14
+ import shap
15
+ import matplotlib.pyplot as plt
16
+ import seaborn as sns
17
+ import joblib
18
+ import io
19
+ import base64
20
+ from datetime import datetime
21
+ import warnings
22
+
23
+ warnings.filterwarnings('ignore')
24
+
25
+ # Page configuration
26
+ st.set_page_config(
27
+ page_title="AutoML + Explainability Platform",
28
+ page_icon="🤖",
29
+ layout="wide",
30
+ initial_sidebar_state="expanded"
31
+ )
32
+
33
+ # Custom CSS for better styling
34
+ st.markdown("""
35
+ <style>
36
+ .main-header {
37
+ font-size: 2.5rem;
38
+ color: #1f77b4;
39
+ text-align: center;
40
+ margin-bottom: 2rem;
41
+ }
42
+ .metric-card {
43
+ background-color: #f0f2f6;
44
+ padding: 1rem;
45
+ border-radius: 0.5rem;
46
+ margin: 0.5rem 0;
47
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
48
+ }
49
+ .success-message {
50
+ background-color: #d4edda;
51
+ color: #155724;
52
+ padding: 1rem;
53
+ border-radius: 0.5rem;
54
+ border: 1px solid #c3e6cb;
55
+ }
56
+ .stButton>button {
57
+ width: 100%;
58
+ border-radius: 0.5rem;
59
+ }
60
+ </style>
61
+ """, unsafe_allow_html=True)
62
+
63
+ # --- Helper Functions ---
64
+ def get_model_metrics(y_true, y_pred, y_proba=None, problem_type='Classification'):
65
+ metrics = {}
66
+ if problem_type == "Classification":
67
+ metrics['Accuracy'] = accuracy_score(y_true, y_pred)
68
+ metrics['F1-score'] = f1_score(y_true, y_pred, average='weighted')
69
+ if y_proba is not None and len(np.unique(y_true)) == 2: # AUC for binary classification
70
+ try:
71
+ metrics['AUC'] = roc_auc_score(y_true, y_proba[:, 1])
72
+ except ValueError:
73
+ metrics['AUC'] = None # Handle cases where AUC cannot be computed
74
+ else:
75
+ metrics['AUC'] = None
76
+ elif problem_type == "Regression":
77
+ from sklearn.metrics import r2_score, mean_squared_error
78
+ metrics['R2'] = r2_score(y_true, y_pred)
79
+ metrics['MSE'] = mean_squared_error(y_true, y_pred)
80
+ # Add other regression metrics if desired, e.g., MAE
81
+ return metrics
82
+
83
+ # --- Session State Initialization ---
84
+ def init_session_state():
85
+ defaults = {
86
+ 'data': None, 'target_column': None, 'problem_type': None,
87
+ 'models': {}, 'model_scores': {}, 'best_model_info': None,
88
+ 'X_train': None, 'X_test': None, 'y_train': None, 'y_test': None,
89
+ 'le_dict': {}, 'scaler': None, 'trained_pipeline': None
90
+ }
91
+ for key, value in defaults.items():
92
+ if key not in st.session_state:
93
+ st.session_state[key] = value
94
+
95
+ # --- Page Functions ---
96
+ def data_upload_page():
97
+ st.header("📁 Data Upload & Preview")
98
+
99
+ upload_option = st.radio(
100
+ "Select data upload method:",
101
+ ('Single File (auto-split train/test)', 'Separate Train and Test Files'),
102
+ key='upload_option'
103
+ )
104
+
105
+ uploaded_file = None
106
+ uploaded_train_file = None
107
+ uploaded_test_file = None
108
+
109
+ if upload_option == 'Single File (auto-split train/test)':
110
+ uploaded_file = st.file_uploader(
111
+ "Choose a CSV or Excel file",
112
+ type=['csv', 'xlsx', 'xls'],
113
+ help="Upload your dataset. It will be split into training and testing sets.",
114
+ key='single_file_uploader'
115
+ )
116
+ else:
117
+ uploaded_train_file = st.file_uploader(
118
+ "Choose a Training CSV or Excel file",
119
+ type=['csv', 'xlsx', 'xls'],
120
+ help="Upload your training dataset.",
121
+ key='train_file_uploader'
122
+ )
123
+ uploaded_test_file = st.file_uploader(
124
+ "Choose a Testing CSV or Excel file (Optional)",
125
+ type=['csv', 'xlsx', 'xls'],
126
+ help="Upload your testing dataset. If not provided, the training data will be split.",
127
+ key='test_file_uploader'
128
+ )
129
+
130
+ df = None
131
+ df_train = None
132
+ df_test = None
133
+
134
+ if uploaded_file:
135
+ try:
136
+ df = pd.read_csv(uploaded_file) if uploaded_file.name.endswith('.csv') else pd.read_excel(uploaded_file)
137
+ st.session_state.data = df
138
+ st.session_state.train_data = None # Clear separate train/test if single is uploaded
139
+ st.session_state.test_data = None
140
+ st.session_state.target_column = None
141
+ st.session_state.problem_type = None
142
+ st.session_state.source_data_type = 'single'
143
+ except Exception as e:
144
+ st.error(f"Error reading single file: {e}")
145
+ return
146
+ elif uploaded_train_file:
147
+ try:
148
+ df_train = pd.read_csv(uploaded_train_file) if uploaded_train_file.name.endswith('.csv') else pd.read_excel(uploaded_train_file)
149
+ st.session_state.train_data = df_train
150
+ st.session_state.data = df_train # Use train_data as primary for column selection initially
151
+ df = df_train # for common processing below
152
+ st.session_state.target_column = None
153
+ st.session_state.problem_type = None
154
+ st.session_state.source_data_type = 'separate'
155
+ if uploaded_test_file:
156
+ df_test = pd.read_csv(uploaded_test_file) if uploaded_test_file.name.endswith('.csv') else pd.read_excel(uploaded_test_file)
157
+ st.session_state.test_data = df_test
158
+ else:
159
+ st.session_state.test_data = None # Explicitly set to None
160
+ except Exception as e:
161
+ st.error(f"Error reading train/test files: {e}")
162
+ return
163
+
164
+ if df is not None:
165
+ try:
166
+ # Common processing for df (either single or train_df)
167
+ st.subheader("Data Overview" + (" (Training Data)" if st.session_state.get('source_data_type') == 'separate' else ""))
168
+
169
+ st.subheader("Data Overview")
170
+ col1, col2, col3 = st.columns(3)
171
+ col1.metric("Rows", df.shape[0])
172
+ col2.metric("Columns", df.shape[1])
173
+ col3.metric("Missing Values", df.isnull().sum().sum())
174
+
175
+ st.subheader("Data Preview (First 10 rows)")
176
+ st.dataframe(df.head(10), use_container_width=True)
177
+
178
+ st.subheader("Column Information")
179
+ info_df = pd.DataFrame({
180
+ 'Column': df.columns,
181
+ 'Data Type': df.dtypes.astype(str),
182
+ 'Non-Null Count': df.count(),
183
+ 'Null Count': df.isnull().sum(),
184
+ 'Unique Values': df.nunique()
185
+ }).reset_index(drop=True)
186
+ st.dataframe(info_df, use_container_width=True)
187
+
188
+ st.subheader("🎯 Target Column Selection")
189
+ common_target_names = ['target', 'Target', 'label', 'Label', 'class', 'Class', 'Output', 'output', 'result', 'Result']
190
+ detected_target = None
191
+ df_columns = df.columns.tolist()
192
+ for col_name in common_target_names:
193
+ if col_name in df_columns:
194
+ detected_target = col_name
195
+ break
196
+
197
+ target_options = [None] + df_columns
198
+ target_index = 0
199
+ if detected_target:
200
+ try:
201
+ target_index = target_options.index(detected_target)
202
+ except ValueError:
203
+ target_index = 0 # Should not happen if detected_target is in df_columns
204
+
205
+ target_column = st.selectbox(
206
+ "Select the target column (what you want to predict):",
207
+ options=target_options,
208
+ index=target_index,
209
+ help="Choose the dependent variable. Common names are auto-detected."
210
+ )
211
+
212
+ auto_run_training = st.checkbox("Automatically start training when target is selected/detected?", value=False, key='auto_run_cb')
213
+
214
+ if target_column:
215
+ st.session_state.target_column = target_column
216
+ target_series = df[target_column]
217
+
218
+ # Determine problem type
219
+ if target_series.nunique() <= 2 or (target_series.dtype == 'object' and target_series.nunique() <=10) :
220
+ st.session_state.problem_type = "Classification"
221
+ if target_series.dtype == 'object':
222
+ le = LabelEncoder()
223
+ df[target_column] = le.fit_transform(target_series)
224
+ st.session_state.le_dict[target_column] = le # Store encoder for target
225
+ elif pd.api.types.is_numeric_dtype(target_series):
226
+ st.session_state.problem_type = "Regression"
227
+ else:
228
+ st.session_state.problem_type = "Unsupported Target Type"
229
+ st.error("Target column type is not suitable for classification or regression.")
230
+ return
231
+
232
+ st.success(f"Target column '{target_column}' selected. Problem Type: {st.session_state.problem_type}")
233
+
234
+ if st.session_state.get('source_data_type') == 'separate' and st.session_state.test_data is not None:
235
+ st.subheader("Test Data Overview")
236
+ col1_test, col2_test, col3_test = st.columns(3)
237
+ col1_test.metric("Test Rows", st.session_state.test_data.shape[0])
238
+ col2_test.metric("Test Columns", st.session_state.test_data.shape[1])
239
+ col3_test.metric("Test Missing Values", st.session_state.test_data.isnull().sum().sum())
240
+ st.dataframe(st.session_state.test_data.head(5), use_container_width=True)
241
+ if target_column not in st.session_state.test_data.columns:
242
+ st.error(f"Target column '{target_column}' not found in the uploaded test data. Please ensure column names match.")
243
+ return # Stop further processing if target is missing in test data
244
+
245
+ st.subheader(f"Target Column Distribution (in {'Training Data' if st.session_state.get('source_data_type') == 'separate' else 'Uploaded Data'}): {target_column}")
246
+ if st.session_state.problem_type == "Classification":
247
+ fig, ax = plt.subplots()
248
+ sns.countplot(x=target_series, ax=ax)
249
+ st.pyplot(fig)
250
+ else:
251
+ fig, ax = plt.subplots()
252
+ sns.histplot(target_series, kde=True, ax=ax)
253
+ st.pyplot(fig)
254
+
255
+ except Exception as e:
256
+ st.error(f"Error reading or processing file: {e}")
257
+ if auto_run_training and st.session_state.target_column:
258
+ st.session_state.auto_run_triggered = True
259
+ st.experimental_rerun() # Rerun to switch page or trigger training
260
+
261
+ except Exception as e:
262
+ st.error(f"Error processing data: {e}")
263
+ import traceback
264
+ st.error(traceback.format_exc())
265
+ else:
266
+ st.info("👆 Please upload a CSV or Excel file (or separate train/test files) to get started.")
267
+
268
+ def preprocess_data(df, target_column):
269
+ X = df.drop(columns=[target_column])
270
+ y = df[target_column].copy() # Use .copy() to avoid SettingWithCopyWarning
271
+
272
+ # Impute missing values in target variable y
273
+ if y.isnull().any():
274
+ if st.session_state.problem_type == "Classification":
275
+ # For classification, ensure y is int/str before mode imputation if it's float with NaNs
276
+ if pd.api.types.is_numeric_dtype(y) and y.nunique() > 2: # Check if it might be a float target for classification
277
+ # If it's float and intended for classification, it might have been label encoded already or needs specific handling.
278
+ # For now, let's assume if it's numeric and classification, it's likely already encoded or will be handled by LabelEncoder later.
279
+ # If it's float due to NaNs, mode might be tricky. Let's ensure it's treated as object for mode for safety.
280
+ y_imputer = SimpleImputer(strategy='most_frequent')
281
+ y[:] = y_imputer.fit_transform(y.values.reshape(-1, 1)).ravel()
282
+ else:
283
+ y_imputer = SimpleImputer(strategy='most_frequent')
284
+ y[:] = y_imputer.fit_transform(y.values.reshape(-1, 1)).ravel()
285
+ elif st.session_state.problem_type == "Regression":
286
+ y_imputer = SimpleImputer(strategy='mean')
287
+ y[:] = y_imputer.fit_transform(y.values.reshape(-1, 1)).ravel()
288
+ st.warning(f"NaN values found and imputed in the target column '{target_column}'.")
289
+
290
+ # Impute missing values in features X
291
+ num_imputer = SimpleImputer(strategy='mean')
292
+ cat_imputer = SimpleImputer(strategy='most_frequent')
293
+
294
+ num_cols = X.select_dtypes(include=np.number).columns
295
+ cat_cols = X.select_dtypes(include='object').columns
296
+
297
+ if len(num_cols) > 0:
298
+ X[num_cols] = num_imputer.fit_transform(X[num_cols])
299
+ if len(cat_cols) > 0:
300
+ X[cat_cols] = cat_imputer.fit_transform(X[cat_cols])
301
+
302
+ # Encode categorical features
303
+ le_dict_features = {}
304
+ for col in cat_cols:
305
+ le = LabelEncoder()
306
+ X[col] = le.fit_transform(X[col].astype(str))
307
+ le_dict_features[col] = le
308
+ st.session_state.le_dict.update(le_dict_features)
309
+
310
+ # Ensure target y is correctly typed after imputation, especially for classification
311
+ if st.session_state.problem_type == "Classification" and target_column in st.session_state.le_dict:
312
+ # If target was label encoded, ensure it's integer type after imputation
313
+ # This might be redundant if LabelEncoder was applied after imputation, but good for safety
314
+ pass # y should already be encoded if it was object type initially
315
+ elif st.session_state.problem_type == "Classification" and y.dtype == 'float':
316
+ # If y is float after mean imputation (e.g. binary 0/1 became float)
317
+ # and it's for classification, convert to int if appropriate
318
+ # This case should be rare if 'most_frequent' is used for classification target imputation
319
+ # However, if it was numeric and became float due to NaNs, then imputed with mean (which is wrong for classification)
320
+ # This indicates a logic flaw in imputation strategy selection above. Assuming 'most_frequent' was used.
321
+ pass
322
+
323
+ return X, y
324
+
325
+ def model_training_page():
326
+ st.header("🚀 Model Training")
327
+ # Check if data is available from either single upload or separate train/test upload
328
+ data_available = (st.session_state.data is not None) or \
329
+ (st.session_state.train_data is not None)
330
+ if not data_available or st.session_state.target_column is None:
331
+ st.warning("⚠️ Please upload data (single or train/test) and select a target column first.")
332
+ return
333
+ if st.session_state.problem_type == "Unsupported Target Type":
334
+ st.error("Cannot train models with the current target column type.")
335
+ return
336
+
337
+ target = st.session_state.target_column
338
+
339
+ st.subheader("Training Configuration")
340
+ col1, col2 = st.columns(2)
341
+ # Disable test_size slider if separate test data is provided
342
+ disable_test_size = st.session_state.get('source_data_type') == 'separate' and st.session_state.test_data is not None
343
+ test_size = col1.slider("Test Size (if splitting single file)", 0.1, 0.5, 0.2, 0.05, disabled=disable_test_size)
344
+ random_state = col1.number_input("Random State", value=42, min_value=0)
345
+ cv_folds = col2.slider("Cross-Validation Folds", 3, 10, 5)
346
+ scale_features = col2.checkbox("Scale Numeric Features", value=True)
347
+
348
+ # Auto-start training if triggered
349
+ start_button_pressed = st.button("🎯 Start Training", type="primary", key='manual_start_train_button')
350
+ if st.session_state.get('auto_run_triggered_for_training') and not start_button_pressed:
351
+ st.session_state.auto_run_triggered_for_training = False # Reset trigger
352
+ start_button_pressed = True # Simulate button press
353
+ st.info("🤖 Auto-training initiated...")
354
+
355
+ if start_button_pressed:
356
+ with st.spinner("Preprocessing data and training models..."):
357
+ try:
358
+ X_train, X_test, y_train, y_test = None, None, None, None
359
+
360
+ if st.session_state.get('source_data_type') == 'separate' and st.session_state.train_data is not None:
361
+ df_train_processed = st.session_state.train_data.copy()
362
+ X_train, y_train = preprocess_data(df_train_processed, target)
363
+
364
+ if st.session_state.test_data is not None:
365
+ df_test_processed = st.session_state.test_data.copy()
366
+ if target not in df_test_processed.columns:
367
+ st.error(f"Target column '{target}' not found in test data during preprocessing. Aborting.")
368
+ return
369
+ X_test, y_test = preprocess_data(df_test_processed, target) # Preprocess test data separately
370
+ # Ensure X_test has same columns as X_train after preprocessing (esp. after one-hot encoding if added later)
371
+ # For now, LabelEncoder is per-column, SimpleImputer fits on data it sees.
372
+ # If one-hot encoding is added, fit on X_train, transform X_test, align columns.
373
+ else: # No test file, split train_data
374
+ X_train, X_test, y_train, y_test = train_test_split(
375
+ X_train, y_train, test_size=test_size, random_state=random_state,
376
+ stratify=(y_train if st.session_state.problem_type == "Classification" else None)
377
+ )
378
+ else: # Single file upload
379
+ df_processed = st.session_state.data.copy()
380
+ X, y = preprocess_data(df_processed, target)
381
+ X_train, X_test, y_train, y_test = train_test_split(
382
+ X, y, test_size=test_size, random_state=random_state,
383
+ stratify=(y if st.session_state.problem_type == "Classification" else None)
384
+ )
385
+
386
+ if X_train is None or y_train is None:
387
+ st.error("Training data (X_train, y_train) could not be prepared. Please check your data and selections.")
388
+ return
389
+
390
+ # Scaling should be fit on X_train and transformed on X_test
391
+ if scale_features:
392
+ num_cols_train = X_train.select_dtypes(include=np.number).columns
393
+ if len(num_cols_train) > 0:
394
+ scaler = StandardScaler()
395
+ X_train[num_cols_train] = scaler.fit_transform(X_train[num_cols_train])
396
+ st.session_state.scaler = scaler # Save the fitted scaler
397
+ if X_test is not None:
398
+ num_cols_test = X_test.select_dtypes(include=np.number).columns
399
+ # Ensure test set uses the same numeric columns in the same order as train set for scaling
400
+ cols_to_scale_in_test = [col for col in num_cols_train if col in X_test.columns]
401
+ if len(cols_to_scale_in_test) > 0:
402
+ # Create a DataFrame with columns in the order of num_cols_train
403
+ X_test_subset_for_scaling = X_test[cols_to_scale_in_test]
404
+ X_test_scaled_values = scaler.transform(X_test_subset_for_scaling)
405
+ X_test[cols_to_scale_in_test] = X_test_scaled_values
406
+ # Handle missing/extra columns if necessary, for now assume they match or subset
407
+
408
+ st.session_state.update({'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test})
409
+
410
+ # Define models based on problem type
411
+ if st.session_state.problem_type == "Classification":
412
+ models_to_train = {
413
+ "Logistic Regression": LogisticRegression(random_state=random_state, max_iter=1000),
414
+ "Decision Tree": DecisionTreeClassifier(random_state=random_state),
415
+ "Random Forest": RandomForestClassifier(random_state=random_state),
416
+ "Gradient Boosting": GradientBoostingClassifier(random_state=random_state),
417
+ "Support Vector Machine": SVC(random_state=random_state, probability=True),
418
+ "K-Nearest Neighbors": KNeighborsClassifier(),
419
+ "Gaussian Naive Bayes": GaussianNB()
420
+ }
421
+ scoring = 'accuracy'
422
+ else: # Regression
423
+ # Local imports for LinearRegression, Ridge, RandomForestRegressor, etc.
424
+ # are removed as these models are now imported globally by the first search/replace block.
425
+ # ElasticNet is also imported globally.
426
+ models_to_train = {
427
+ "Linear Regression": LinearRegression(),
428
+ "Ridge Regression": Ridge(random_state=random_state),
429
+ "ElasticNet Regression": ElasticNet(random_state=random_state),
430
+ "Random Forest Regressor": RandomForestRegressor(random_state=random_state),
431
+ "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=random_state),
432
+ "Decision Tree Regressor": DecisionTreeRegressor(random_state=random_state),
433
+ "Support Vector Regressor": SVR(),
434
+ "K-Nearest Neighbors Regressor": KNeighborsRegressor()
435
+ }
436
+ scoring = 'r2'
437
+
438
+ trained_models = {}
439
+ model_scores_dict = {}
440
+ progress_bar = st.progress(0)
441
+ status_text = st.empty()
442
+
443
+ for i, (name, model) in enumerate(models_to_train.items()):
444
+ status_text.text(f"Training {name}...")
445
+ model.fit(X_train, y_train)
446
+ trained_models[name] = model
447
+
448
+ y_pred_test = model.predict(X_test)
449
+ y_proba_test = model.predict_proba(X_test) if hasattr(model, 'predict_proba') and st.session_state.problem_type == "Classification" else None
450
+
451
+ metrics = get_model_metrics(y_test, y_pred_test, y_proba_test, problem_type=st.session_state.problem_type)
452
+ cv_score = cross_val_score(model, X_train, y_train, cv=cv_folds, scoring=scoring).mean()
453
+
454
+ current_model_scores = {'CV Mean Score': cv_score}
455
+ current_model_scores.update(metrics) # Add all relevant metrics
456
+ model_scores_dict[name] = current_model_scores
457
+
458
+ progress_bar.progress((i + 1) / len(models_to_train))
459
+
460
+ st.session_state.models = trained_models
461
+ st.session_state.model_scores = model_scores_dict
462
+
463
+ # Determine best model
464
+ if st.session_state.problem_type == "Classification":
465
+ best_model_name = max(model_scores_dict, key=lambda k: (model_scores_dict[k]['Test Accuracy'] or 0, model_scores_dict[k]['Test AUC'] or 0))
466
+ else: # Regression
467
+ # Ensure 'R2' exists and provide a default if not (e.g., for models where R2 might not be applicable or calculable)
468
+ best_model_name = max(model_scores_dict, key=lambda k: model_scores_dict[k].get('R2', -float('inf')))
469
+
470
+ st.session_state.best_model_info = {
471
+ 'name': best_model_name,
472
+ 'model': trained_models[best_model_name],
473
+ 'metrics': model_scores_dict[best_model_name]
474
+ }
475
+ status_text.text("Training completed!")
476
+ st.success(f"✅ Training completed! Best model: {best_model_name}")
477
+
478
+ except Exception as e:
479
+ st.error(f"Error during training: {e}")
480
+ import traceback
481
+ st.error(traceback.format_exc())
482
+
483
+ def model_comparison_page():
484
+ st.header("📊 Model Comparison")
485
+ if not st.session_state.model_scores:
486
+ st.warning("⚠️ Please train models first.")
487
+ return
488
+
489
+ scores_df = pd.DataFrame(st.session_state.model_scores).T.fillna(0) # Fill NaN with 0 for display
490
+ scores_df = scores_df.round(4)
491
+
492
+ st.subheader("🏆 Model Leaderboard")
493
+ if st.session_state.problem_type == "Classification":
494
+ sort_by = 'Test Accuracy'
495
+ display_cols = ['CV Mean Score', 'Test Accuracy', 'Test F1-score', 'Test AUC']
496
+ else: # Regression
497
+ sort_by = 'R2'
498
+ display_cols = ['CV Mean Score', 'R2', 'MSE'] # Add other relevant regression metrics if needed
499
+ # Ensure MSE is present, if not, it will be filled with 0 by .fillna(0) earlier or handle missing more gracefully if needed
500
+
501
+ leaderboard = scores_df[display_cols].sort_values(by=sort_by, ascending=False)
502
+ leaderboard['Rank'] = range(1, len(leaderboard) + 1)
503
+ leaderboard = leaderboard[['Rank'] + display_cols]
504
+ st.dataframe(leaderboard.style.background_gradient(subset=[sort_by], cmap='RdYlGn'), use_container_width=True)
505
+
506
+ best_model_name = st.session_state.best_model_info['name']
507
+ best_metric_val = st.session_state.best_model_info['metrics'].get(sort_by, 'N/A')
508
+ st.markdown(f"<div class='success-message'><h4>🥇 Best Model: {best_model_name} ({sort_by}: {best_metric_val:.4f})</h4></div>", unsafe_allow_html=True)
509
+
510
+ st.subheader("📈 Performance Visualization")
511
+ fig, ax = plt.subplots(figsize=(10, 6))
512
+ plot_data = scores_df[sort_by].sort_values(ascending=True)
513
+ bars = ax.barh(plot_data.index, plot_data.values, color=['#ff6b6b' if idx == best_model_name else '#4ecdc4' for idx in plot_data.index])
514
+ ax.set_xlabel(sort_by)
515
+ ax.set_title('Model Performance Comparison')
516
+ st.pyplot(fig)
517
+
518
+ if st.session_state.problem_type == "Classification" and st.session_state.X_test is not None:
519
+ st.subheader(f"📋 Detailed Metrics for Best Model: {best_model_name}")
520
+ best_model = st.session_state.best_model_info['model']
521
+ y_pred = best_model.predict(st.session_state.X_test)
522
+
523
+ col1, col2 = st.columns(2)
524
+ with col1:
525
+ st.text("Classification Report:")
526
+ report_df = pd.DataFrame(classification_report(st.session_state.y_test, y_pred, output_dict=True)).transpose()
527
+ st.dataframe(report_df.round(3), use_container_width=True)
528
+ with col2:
529
+ st.text("Confusion Matrix:")
530
+ cm = confusion_matrix(st.session_state.y_test, y_pred)
531
+ fig_cm, ax_cm = plt.subplots()
532
+ sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax_cm)
533
+ ax_cm.set_xlabel('Predicted')
534
+ ax_cm.set_ylabel('Actual')
535
+ st.pyplot(fig_cm)
536
+
537
+ def explainability_page():
538
+ st.header("🔍 Model Explainability (SHAP)")
539
+ if not st.session_state.best_model_info or st.session_state.X_test is None:
540
+ st.warning("⚠️ Please train a model and ensure test data is available.")
541
+ return
542
+
543
+ best_model = st.session_state.best_model_info['model']
544
+ best_model_name = st.session_state.best_model_info['name']
545
+ X_test_df = pd.DataFrame(st.session_state.X_test, columns=st.session_state.X_train.columns)
546
+
547
+ st.write(f"**Explaining model:** {best_model_name}")
548
+ with st.spinner("Generating SHAP explanations..."):
549
+ try:
550
+ # SHAP Explainer
551
+ if isinstance(best_model, (RandomForestClassifier, GradientBoostingClassifier, DecisionTreeClassifier,
552
+ RandomForestRegressor, GradientBoostingRegressor, DecisionTreeRegressor)):
553
+ explainer = shap.TreeExplainer(best_model)
554
+ elif isinstance(best_model, (LogisticRegression, LinearRegression, Ridge, ElasticNet)):
555
+ explainer = shap.LinearExplainer(best_model, X_test_df) # Pass data for LinearExplainer
556
+ elif isinstance(best_model, (SVC, SVR, KNeighborsClassifier, KNeighborsRegressor, GaussianNB)):
557
+ # KernelExplainer can be slow or not directly applicable for some, use a subset of X_train for background data
558
+ # For KNN and Naive Bayes, KernelExplainer is a common choice for SHAP if TreeExplainer/LinearExplainer aren't suitable.
559
+ background_data = shap.sample(st.session_state.X_train, min(100, len(st.session_state.X_train)))
560
+ if isinstance(background_data, np.ndarray):
561
+ background_data = pd.DataFrame(background_data, columns=X_test_df.columns)
562
+ explainer = shap.KernelExplainer(best_model.predict_proba if hasattr(best_model, 'predict_proba') else best_model.predict, background_data)
563
+ else:
564
+ st.error(f"SHAP explanations not supported for {best_model_name} with current setup.")
565
+ return
566
+
567
+ shap_values = explainer.shap_values(X_test_df)
568
+
569
+ # For binary classification, shap_values might be a list of two arrays (for class 0 and 1)
570
+ # We typically use shap_values for the positive class (class 1)
571
+ if isinstance(shap_values, list) and len(shap_values) == 2 and st.session_state.problem_type == "Classification":
572
+ shap_values_plot = shap_values[1]
573
+ else:
574
+ shap_values_plot = shap_values
575
+
576
+ st.subheader("📊 Global Feature Importance (SHAP Summary Plot)")
577
+ fig_summary, ax_summary = plt.subplots()
578
+ shap.summary_plot(shap_values_plot, X_test_df, plot_type="bar", show=False, max_display=15)
579
+ st.pyplot(fig_summary)
580
+
581
+ st.subheader("🎯 SHAP Beeswarm Plot")
582
+ fig_beeswarm, ax_beeswarm = plt.subplots()
583
+ shap.summary_plot(shap_values_plot, X_test_df, show=False, max_display=15)
584
+ st.pyplot(fig_beeswarm)
585
+
586
+ st.subheader("💧 Individual Prediction Explanation (Waterfall Plot)")
587
+ sample_idx = st.selectbox("Select a sample from test set to explain:", range(min(20, len(X_test_df))))
588
+ if st.button("Explain Sample"):
589
+ fig_waterfall, ax_waterfall = plt.subplots()
590
+ # Create SHAP Explanation object
591
+ if isinstance(explainer, shap.explainers.Tree):
592
+ expected_value = explainer.expected_value
593
+ if isinstance(expected_value, list): # Multi-output case for TreeExplainer
594
+ expected_value = expected_value[1] if len(expected_value) > 1 else expected_value[0]
595
+ elif isinstance(explainer, shap.explainers.Linear) or isinstance(explainer, shap.explainers.Kernel):
596
+ expected_value = explainer.expected_value
597
+ if isinstance(expected_value, np.ndarray) and expected_value.ndim > 0:
598
+ expected_value = expected_value[0] # Take the first if it's an array
599
+ else:
600
+ expected_value = 0 # Fallback, might need adjustment
601
+
602
+ shap_explanation_obj = shap.Explanation(
603
+ values=shap_values_plot[sample_idx],
604
+ base_values=expected_value,
605
+ data=X_test_df.iloc[sample_idx].values,
606
+ feature_names=X_test_df.columns
607
+ )
608
+ shap.waterfall_plot(shap_explanation_obj, show=False, max_display=15)
609
+ st.pyplot(fig_waterfall)
610
+
611
+ actual = st.session_state.y_test.iloc[sample_idx]
612
+ predicted = best_model.predict(X_test_df.iloc[[sample_idx]])[0]
613
+ st.metric("Actual Value", f"{actual:.2f}")
614
+ st.metric("Predicted Value", f"{predicted:.2f}")
615
+
616
+ except Exception as e:
617
+ st.error(f"Error generating SHAP explanations: {e}")
618
+ import traceback
619
+ st.error(traceback.format_exc())
620
+
621
+ def model_export_page():
622
+ st.header("💾 Model Export")
623
+ if not st.session_state.best_model_info:
624
+ st.warning("⚠️ Please train a model first.")
625
+ return
626
+
627
+ best_model_info = st.session_state.best_model_info
628
+ best_model = best_model_info['model']
629
+ best_model_name = best_model_info['name']
630
+
631
+ st.write(f"**Best Model:** {best_model_name}")
632
+ st.write(f"**Metrics:**")
633
+ st.json(best_model_info['metrics'])
634
+
635
+ # Build a pipeline for export (model + scaler if used)
636
+ from sklearn.pipeline import Pipeline
637
+ steps = []
638
+ if st.session_state.scaler:
639
+ steps.append(('scaler', st.session_state.scaler))
640
+ steps.append(('model', best_model))
641
+ pipeline_to_export = Pipeline(steps)
642
+ st.session_state.trained_pipeline = pipeline_to_export
643
+
644
+ export_format = st.selectbox("Choose export format:", ["Joblib (.joblib)", "Pickle (.pkl)"])
645
+ file_name_suggestion = f"{best_model_name.lower().replace(' ', '_')}_pipeline"
646
+ file_name = st.text_input("Enter filename for export:", value=file_name_suggestion)
647
+
648
+ if st.button("📥 Download Model Pipeline", type="primary"):
649
+ try:
650
+ buffer = io.BytesIO()
651
+ ext = ".joblib" if "Joblib" in export_format else ".pkl"
652
+ if ext == ".joblib":
653
+ joblib.dump(pipeline_to_export, buffer)
654
+ else:
655
+ import pickle
656
+ pickle.dump(pipeline_to_export, buffer)
657
+
658
+ buffer.seek(0)
659
+ st.download_button(
660
+ label=f"Download {file_name}{ext}",
661
+ data=buffer,
662
+ file_name=f"{file_name}{ext}",
663
+ mime="application/octet-stream"
664
+ )
665
+ st.success("Model pipeline ready for download!")
666
+ except Exception as e:
667
+ st.error(f"Error exporting model: {e}")
668
+
669
+ st.subheader("📖 How to use the exported pipeline:")
670
+ st.code(f"""
671
+ import joblib # or import pickle
672
+ import pandas as pd
673
+
674
+ # Load the pipeline
675
+ pipeline = joblib.load('{file_name}{'.joblib' if 'Joblib' in export_format else '.pkl'}')
676
+
677
+ # Example new data (must have same columns as training, BEFORE scaling)
678
+ # new_data = pd.DataFrame(...)
679
+
680
+ # Preprocess new_data similar to training (handle categoricals, ensure column order)
681
+ # Ensure new_data has columns: {list(st.session_state.X_train.columns) if st.session_state.X_train is not None else 'X_train_columns'}
682
+
683
+ # Make predictions
684
+ # predictions = pipeline.predict(new_data)
685
+ # print(predictions)
686
+ """, language='python')
687
+
688
+ # --- Main Application ---
689
+ def main():
690
+ init_session_state()
691
+ st.markdown('<h1 class="main-header">🤖 AutoML & Explainability Platform</h1>', unsafe_allow_html=True)
692
+
693
+ st.sidebar.title("⚙️ Workflow")
694
+ page_options = ["Data Upload & Preview", "Model Training", "Model Comparison", "Explainability", "Model Export"]
695
+
696
+ # Handle auto-run navigation
697
+ if st.session_state.get('auto_run_triggered') and st.session_state.target_column:
698
+ st.session_state.auto_run_triggered = False # Reset trigger
699
+ st.session_state.current_page = "Model Training"
700
+ st.session_state.auto_run_triggered_for_training = True # Signal model_training_page to auto-start
701
+
702
+ if 'current_page' not in st.session_state:
703
+ st.session_state.current_page = "Data Upload & Preview"
704
+
705
+ page = st.sidebar.radio("Navigate", page_options, key='navigation_radio', index=page_options.index(st.session_state.current_page))
706
+ st.session_state.current_page = page # Update current page based on user selection
707
+
708
+ if page == "Data Upload & Preview":
709
+ data_upload_page()
710
+ elif page == "Model Training":
711
+ model_training_page()
712
+ elif page == "Model Comparison":
713
+ model_comparison_page()
714
+ elif page == "Explainability":
715
+ explainability_page()
716
+ elif page == "Model Export":
717
+ model_export_page()
718
+
719
+ st.sidebar.markdown("---_Developed with Trae AI_---")
720
+
721
+ if __name__ == "__main__":
722
+ main()
requirements.txt CHANGED
@@ -1,3 +1,9 @@
1
- altair
2
  pandas
3
- streamlit
 
 
 
 
 
 
 
1
+ streamlit
2
  pandas
3
+ numpy
4
+ scikit-learn
5
+ shap
6
+ matplotlib
7
+ seaborn
8
+ joblib
9
+ openpyxl # For .xlsx file support