Spaces:
Sleeping
Sleeping
| import os | |
| import streamlit as st | |
| import pandas as pd | |
| import numpy as np | |
| from sklearn.preprocessing import LabelEncoder, StandardScaler | |
| from utils.data_loader import DATA_PATH, CATEGORICAL_COLS, NUMERIC_COLS, load_raw_data, get_encoded_data, get_train_test, get_onehot_train_test | |
| st.set_page_config(page_title="Preprocessing", page_icon="π§", layout="wide") | |
| st.title("Data Preprocessing Pipeline") | |
| st.markdown( | |
| "Before feeding data to a machine learning model, we need to **clean**, " | |
| "**transform**, and **split** it. This page walks through every step we applied, " | |
| "shows the data before and after, and explains *why*." | |
| ) | |
| st.markdown("---") | |
| tab_missing, tab_encoding, tab_scaling, tab_split, tab_reference = st.tabs([ | |
| "Missing Data", | |
| "Categorical Encoding", | |
| "Feature Scaling", | |
| "Train / Test Split", | |
| "Imputation Reference Guide", | |
| ]) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 1 β Missing Data | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_missing: | |
| st.subheader("Step 1: Handling Missing Data") | |
| st.markdown("#### Why this matters") | |
| st.markdown( | |
| "Most ML algorithms cannot handle missing values. If we leave blanks in, " | |
| "the model will either crash or silently ignore those rows, losing valuable data. " | |
| "Proper imputation preserves sample size and avoids bias." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we found") | |
| raw_csv = pd.read_csv(DATA_PATH) | |
| total_charges_blanks = raw_csv[raw_csv["TotalCharges"].str.strip() == ""] | |
| n_blanks = len(total_charges_blanks) | |
| col_info, col_sample = st.columns([1, 2]) | |
| with col_info: | |
| st.metric("Rows with missing TotalCharges", n_blanks) | |
| st.metric("Total rows in dataset", f"{len(raw_csv):,}") | |
| st.metric("% missing", f"{n_blanks / len(raw_csv):.2%}") | |
| with col_sample: | |
| st.markdown("**Rows where `TotalCharges` is blank:**") | |
| display_cols = ["customerID", "tenure", "MonthlyCharges", "TotalCharges", "Churn"] | |
| st.dataframe(total_charges_blanks[display_cols].head(10), use_container_width=True) | |
| st.caption( | |
| "Notice: all these customers have tenure = 0 β they just signed up " | |
| "and haven't been billed yet, so TotalCharges is an empty string." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we did") | |
| st.markdown( | |
| "We used **Median Imputation** β replacing blanks with the median of `TotalCharges`." | |
| ) | |
| before_col, arrow_col, after_col = st.columns([5, 1, 5]) | |
| with before_col: | |
| st.markdown("**Before** (raw CSV)") | |
| sample_before = total_charges_blanks[display_cols].head(5).copy() | |
| st.dataframe(sample_before, use_container_width=True) | |
| with arrow_col: | |
| st.markdown("") | |
| st.markdown("") | |
| st.markdown("### β") | |
| with after_col: | |
| st.markdown("**After** (median filled)") | |
| cleaned = load_raw_data() | |
| sample_after = cleaned.loc[total_charges_blanks.index[:5], ["customerID", "tenure", "MonthlyCharges", "TotalCharges", "Churn"]] | |
| st.dataframe(sample_after, use_container_width=True) | |
| median_val = pd.to_numeric(raw_csv["TotalCharges"], errors="coerce").median() | |
| st.info(f"Median value used for filling: **${median_val:,.2f}**") | |
| st.markdown("---") | |
| st.markdown("#### Why Median instead of Mean?") | |
| st.markdown( | |
| """ | |
| `TotalCharges` is **right-skewed** β a few long-tenure customers have very high | |
| totals (> $8,000) while most are much lower. The mean gets pulled up by those outliers. | |
| The median is robust to skew and gives a more representative "typical" value. | |
| | Statistic | Value | | |
| |---|---| | |
| | Mean | ${:,.2f} | | |
| | Median | ${:,.2f} | | |
| | Min | ${:,.2f} | | |
| | Max | ${:,.2f} | | |
| """.format( | |
| pd.to_numeric(raw_csv["TotalCharges"], errors="coerce").mean(), | |
| median_val, | |
| pd.to_numeric(raw_csv["TotalCharges"], errors="coerce").min(), | |
| pd.to_numeric(raw_csv["TotalCharges"], errors="coerce").max(), | |
| ) | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What else could we have done?") | |
| st.markdown( | |
| """ | |
| - **Fill with 0:** Since these are brand-new customers (tenure=0), filling with 0 | |
| is arguably the most logical choice for this specific column. | |
| - **Drop the rows:** Only 11 rows out of 7,043 (0.16%) β dropping them would have | |
| minimal impact, but we'd lose data unnecessarily. | |
| - **KNN Imputation:** Use similar customers' TotalCharges to estimate the missing value. | |
| Overkill for 11 rows but powerful when missingness is higher. | |
| - **Regression Imputation:** Predict TotalCharges from tenure and MonthlyCharges | |
| (they're highly correlated). Again, overkill here but useful in general. | |
| """ | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 2 β Categorical Encoding | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_encoding: | |
| st.subheader("Step 2: Encoding Categorical Features") | |
| st.markdown("#### Why this matters") | |
| st.markdown( | |
| "ML models work with numbers, not text. A column like `Contract` with values " | |
| '"Month-to-month", "One year", "Two year" needs to be converted to numbers. ' | |
| "The encoding method matters because it affects how the model interprets relationships." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we did β two strategies for two model types") | |
| st.markdown( | |
| """ | |
| We use **different encodings for different models**, because each model type | |
| interprets numbers differently: | |
| | Model | Encoding Used | Why | | |
| |---|---|---| | |
| | **Random Forest, XGBoost, Naive Bayes** | Label Encoding (integer per category) | Tree models split on thresholds, Naive Bayes uses probability distributions β neither assumes ordinal relationships | | |
| | **Logistic Regression** | One-Hot Encoding (binary column per category) | Linear models treat numbers as having magnitude β One-Hot avoids false ordinal relationships | | |
| """ | |
| ) | |
| st.markdown("##### Label Encoding (for tree-based models and Naive Bayes)") | |
| st.markdown( | |
| "Each unique text value gets a unique integer." | |
| ) | |
| raw_df = load_raw_data() | |
| enc_df, encoders = get_encoded_data() | |
| selected_cat = st.selectbox( | |
| "Pick a categorical column to see before/after", | |
| CATEGORICAL_COLS, | |
| index=CATEGORICAL_COLS.index("Contract"), | |
| ) | |
| before_enc, arrow_enc, after_enc = st.columns([5, 1, 5]) | |
| with before_enc: | |
| st.markdown(f"**Before** β `{selected_cat}` (text)") | |
| sample_idx = raw_df.head(10).index | |
| before_sample = raw_df.loc[sample_idx, ["customerID", selected_cat]].copy() | |
| st.dataframe(before_sample, use_container_width=True) | |
| with arrow_enc: | |
| st.markdown("") | |
| st.markdown("") | |
| st.markdown("### β") | |
| with after_enc: | |
| st.markdown(f"**After** β `{selected_cat}` (integer)") | |
| after_sample = enc_df.loc[sample_idx, [selected_cat]].copy() | |
| after_sample.insert(0, "customerID", raw_df.loc[sample_idx, "customerID"].values) | |
| st.dataframe(after_sample, use_container_width=True) | |
| le = encoders[selected_cat] | |
| mapping_df = pd.DataFrame({ | |
| "Original Value": le.classes_, | |
| "Encoded Value": range(len(le.classes_)), | |
| }) | |
| st.markdown(f"**Mapping for `{selected_cat}`:**") | |
| st.dataframe(mapping_df, use_container_width=True, hide_index=True) | |
| st.markdown("---") | |
| st.markdown("#### All 15 categorical columns encoded") | |
| summary_data = [] | |
| for col in CATEGORICAL_COLS: | |
| le_col = encoders[col] | |
| summary_data.append({ | |
| "Column": col, | |
| "Unique Values": len(le_col.classes_), | |
| "Classes": ", ".join(le_col.classes_), | |
| }) | |
| st.dataframe(pd.DataFrame(summary_data), use_container_width=True, hide_index=True) | |
| st.markdown("---") | |
| st.markdown("##### One-Hot Encoding (for Logistic Regression)") | |
| st.markdown( | |
| "For Logistic Regression, we use **One-Hot Encoding** via `pd.get_dummies(drop_first=True)`. " | |
| "Each category becomes its own binary (0/1) column, avoiding false ordinal relationships. " | |
| "We drop the first category per feature to prevent multicollinearity (the \"dummy variable trap\")." | |
| ) | |
| X_train_oh, _, _, _, feature_cols_oh = get_onehot_train_test() | |
| oh_before, oh_arrow, oh_after = st.columns([5, 1, 5]) | |
| with oh_before: | |
| st.markdown(f"**Before** β `Contract` (text)") | |
| oh_sample = raw_df.head(6)[["customerID", "Contract"]].copy() | |
| st.dataframe(oh_sample, use_container_width=True) | |
| with oh_arrow: | |
| st.markdown("") | |
| st.markdown("") | |
| st.markdown("### β") | |
| with oh_after: | |
| st.markdown("**After** β One-Hot columns") | |
| oh_cols = [c for c in feature_cols_oh if c.startswith("Contract_")] | |
| oh_display = X_train_oh.head(6)[oh_cols].copy() | |
| oh_display.index = range(1, len(oh_display) + 1) | |
| st.dataframe(oh_display, use_container_width=True) | |
| st.markdown( | |
| f"One-Hot Encoding expands our {len(CATEGORICAL_COLS)} categorical columns " | |
| f"into **{len([c for c in feature_cols_oh if c not in NUMERIC_COLS])} binary columns** " | |
| f"(total features: **{len(feature_cols_oh)}** vs **{len(CATEGORICAL_COLS) + len(NUMERIC_COLS)}** " | |
| f"with Label Encoding)." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Why two encodings?") | |
| st.markdown( | |
| """ | |
| We chose **Label Encoding for tree-based models and Naive Bayes** (Random Forest, | |
| XGBoost, Naive Bayes) because they split on thresholds or use probability distributions β | |
| the integer mapping works perfectly and they don't assume ordering between numbers. | |
| We chose **One-Hot Encoding for Logistic Regression** because it's a linear model | |
| that interprets feature values as having magnitude. With Label Encoding, the model | |
| might treat "Month-to-month = 2" as "twice" something compared to "DSL = 0", which | |
| is meaningless. One-Hot Encoding eliminates this problem entirely. | |
| **This dual-encoding approach gives each model the representation it works best with.** | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Other encoding alternatives") | |
| st.markdown( | |
| """ | |
| | Technique | How it works | Best for | | |
| |---|---|---| | |
| | **Ordinal Encoding** | Like Label Encoding but you manually set the order (e.g., Month-to-month=0, One year=1, Two year=2) | Features with a natural order | | |
| | **Target Encoding** | Replace each category with the mean of the target (churn rate) for that category | High-cardinality features (many unique values) | | |
| | **Binary Encoding** | Converts integers to binary then splits into columns | Reduces dimensionality vs One-Hot for high-cardinality | | |
| """ | |
| ) | |
| st.success( | |
| "**Our approach:** Using the optimal encoding per model type is a best practice " | |
| "in production ML pipelines. It adds a small amount of complexity but ensures each " | |
| "model sees data in the format it handles best." | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 3 β Feature Scaling | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_scaling: | |
| st.subheader("Step 3: Feature Scaling (StandardScaler)") | |
| st.markdown("#### Why this matters") | |
| st.markdown( | |
| "Some algorithms (Logistic Regression, Naive Bayes, SGDClassifier, KNN, SVM) are sensitive to " | |
| "the **magnitude** of features. If `tenure` ranges 0β72 but `TotalCharges` ranges " | |
| "18β8,685, the model will overweight TotalCharges simply because its numbers are bigger. " | |
| "Scaling puts all features on the same playing field." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Standardization vs Normalization β they are NOT the same") | |
| st.markdown( | |
| """ | |
| These two terms are often confused. Here's the difference: | |
| | | **Standardization** (what we did) | **Normalization** | | |
| |---|---|---| | |
| | **Goal** | Center data around 0 with unit variance | Squeeze data into a fixed range (usually 0β1) | | |
| | **Formula** | (x - mean) / std | (x - min) / (max - min) | | |
| | **Result** | Mean = 0, Std = 1. Values can be negative or > 1 | All values between 0 and 1 | | |
| | **Sensitive to outliers?** | Less sensitive (uses mean/std) | Very sensitive (one outlier stretches the whole scale) | | |
| | **Best for** | Algorithms that assume normally distributed features (Logistic Regression, Naive Bayes, SVM) | Neural networks, image pixel values, algorithms that need bounded input | | |
| We used **Standardization (StandardScaler)** because our models (SGDClassifier, Logistic Regression, | |
| Naive Bayes) either use gradient descent or assume Gaussian distributions, which work best when | |
| features are centered around zero with comparable spread. | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we did") | |
| st.markdown( | |
| "**StandardScaler** transforms each feature to have **mean = 0** and **standard deviation = 1**." | |
| ) | |
| st.latex(r"x_{\text{scaled}} = \frac{x - \mu}{\sigma}") | |
| enc_df_full, _ = get_encoded_data() | |
| feature_cols = CATEGORICAL_COLS + NUMERIC_COLS | |
| sample_rows = enc_df_full[feature_cols].head(8) | |
| scaler = StandardScaler() | |
| all_scaled = scaler.fit_transform(enc_df_full[feature_cols]) | |
| scaled_sample = pd.DataFrame( | |
| all_scaled[:8], | |
| columns=feature_cols, | |
| index=sample_rows.index, | |
| ) | |
| show_cols = st.multiselect( | |
| "Select columns to compare", | |
| feature_cols, | |
| default=["tenure", "MonthlyCharges", "TotalCharges", "Contract"], | |
| ) | |
| if show_cols: | |
| before_sc, arrow_sc, after_sc = st.columns([5, 1, 5]) | |
| with before_sc: | |
| st.markdown("**Before scaling**") | |
| st.dataframe(sample_rows[show_cols], use_container_width=True) | |
| with arrow_sc: | |
| st.markdown("") | |
| st.markdown("") | |
| st.markdown("### β") | |
| with after_sc: | |
| st.markdown("**After scaling** (mean=0, std=1)") | |
| st.dataframe(scaled_sample[show_cols].round(3), use_container_width=True) | |
| st.markdown("---") | |
| st.markdown("#### Feature statistics before scaling") | |
| stats = enc_df_full[NUMERIC_COLS].describe().T[["mean", "std", "min", "max"]] | |
| stats.columns = ["Mean", "Std Dev", "Min", "Max"] | |
| st.dataframe(stats.round(2), use_container_width=True) | |
| st.markdown( | |
| "Notice how `TotalCharges` has a mean of ~2,280 and range of 0β8,685 while `SeniorCitizen` " | |
| "is just 0 or 1. After scaling, they all center around 0 with comparable spread." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Before vs After β Distribution Graphs") | |
| st.markdown( | |
| "The tables above show numbers, but graphs make the difference dramatic. " | |
| "Pick a feature to see how its distribution shifts after StandardScaler." | |
| ) | |
| import plotly.graph_objects as go_fig | |
| from plotly.subplots import make_subplots as mk_sub | |
| scale_feature = st.selectbox( | |
| "Feature to visualize", | |
| ["tenure", "MonthlyCharges", "TotalCharges"], | |
| key="scale_viz_feature", | |
| ) | |
| before_vals = enc_df_full[scale_feature].values | |
| col_idx = feature_cols.index(scale_feature) | |
| after_vals = all_scaled[:, col_idx] | |
| fig_scale = mk_sub(rows=1, cols=2, subplot_titles=["Before Scaling", "After Scaling (StandardScaler)"]) | |
| fig_scale.add_trace( | |
| go_fig.Histogram(x=before_vals, nbinsx=40, marker_color="#636EFA", name="Before", showlegend=False), | |
| row=1, col=1, | |
| ) | |
| fig_scale.add_trace( | |
| go_fig.Histogram(x=after_vals, nbinsx=40, marker_color="#00CC96", name="After", showlegend=False), | |
| row=1, col=2, | |
| ) | |
| before_mean = float(np.mean(before_vals)) | |
| after_mean = float(np.mean(after_vals)) | |
| fig_scale.add_vline(x=before_mean, line_dash="dash", line_color="red", annotation_text=f"mean={before_mean:.1f}", row=1, col=1) | |
| fig_scale.add_vline(x=0, line_dash="dash", line_color="red", annotation_text="mean=0", row=1, col=2) | |
| fig_scale.update_layout( | |
| title_text=f"Effect of StandardScaler on `{scale_feature}`", | |
| height=350, | |
| ) | |
| fig_scale.update_xaxes(title_text=scale_feature, row=1, col=1) | |
| fig_scale.update_xaxes(title_text=f"{scale_feature} (scaled)", row=1, col=2) | |
| fig_scale.update_yaxes(title_text="Count", row=1, col=1) | |
| fig_scale.update_yaxes(title_text="Count", row=1, col=2) | |
| st.plotly_chart(fig_scale, use_container_width=True) | |
| before_std = float(np.std(before_vals)) | |
| after_std = float(np.std(after_vals)) | |
| sc1, sc2 = st.columns(2) | |
| sc1.metric("Before", f"mean = {before_mean:,.1f}, std = {before_std:,.1f}") | |
| sc2.metric("After", f"mean β 0, std β 1 β") | |
| st.caption( | |
| "After scaling, every feature has mean=0 and std=1 β that's exactly what StandardScaler " | |
| "is designed to do. This puts all features on equal footing so no single feature " | |
| "dominates just because its raw numbers are bigger." | |
| ) | |
| st.markdown( | |
| "**What to notice:** The **shape** of the distribution stays exactly the same β " | |
| "StandardScaler doesn't change the data's shape, it just shifts and rescales it " | |
| "so the mean lands at 0 and the spread (std) becomes 1. " | |
| "This is why it's a safe transformation β no information is lost." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Which models need scaling?") | |
| st.markdown( | |
| """ | |
| | Model | Needs scaling? | Why | | |
| |---|---|---| | |
| | **Logistic Regression** | Yes | Uses gradient descent β large features dominate the gradient | | |
| | **Naive Bayes** | Yes (recommended) | Assumes Gaussian distributions β scaling helps features be comparable and improves probability estimates | | |
| | **SGDClassifier** | Yes | Same as Logistic Regression β SGD is very sensitive to scale | | |
| | **SVM / KNN** | Yes | Distance-based β unscaled features distort distances | | |
| | **Random Forest** | No | Splits on thresholds β scale doesn't matter | | |
| | **XGBoost** | No | Same as Random Forest β tree-based | | |
| In our demo, we apply scaling **only** on the Live Updates page (SGDClassifier). | |
| The Churn Models page uses unscaled data for tree models and Naive Bayes (NB is robust to scale but performs better with it). | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Should we scale categorical columns too?") | |
| st.markdown( | |
| """ | |
| In our Live Updates page, we scale **all** columns (including label-encoded | |
| categoricals like `gender` and `Contract`) through StandardScaler. Here's why | |
| that's acceptable but not ideal: | |
| | Column Type | Example | Effect of Scaling | Verdict | | |
| |---|---|---|---| | |
| | **Continuous numeric** | tenure (0β72), TotalCharges (0β8,685) | Centered at 0, comparable spread | **Essential** β this is what scaling is for | | |
| | **Binary (0/1)** | gender, Partner | Shifts to ~-0.5 / +0.5 | **Harmless** β doesn't distort meaning | | |
| | **Label-encoded (0/1/2)** | Contract, InternetService | Shifts and rescales integers | **Acceptable for SGD** β the false ordinal relationship from label encoding is already there; scaling doesn't make it worse | | |
| **Best practice in production:** Scale only numeric columns, leave binary/one-hot | |
| columns as-is. In our demo, scaling everything is a pragmatic shortcut that | |
| doesn't affect results meaningfully because SGDClassifier treats all inputs as | |
| numeric regardless. | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What else could we have done?") | |
| st.markdown( | |
| """ | |
| | Scaler | Formula | Best for | | |
| |---|---|---| | |
| | **StandardScaler** (ours) | (x - mean) / std | Features that are roughly normally distributed | | |
| | **MinMaxScaler** | (x - min) / (max - min), scales to [0, 1] | When you need bounded values (e.g., neural networks) | | |
| | **RobustScaler** | (x - median) / IQR | When you have outliers (uses median instead of mean) | | |
| | **MaxAbsScaler** | x / max(abs(x)) | Sparse data (doesn't shift the center) | | |
| """ | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 4 β Train / Test Split | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_split: | |
| st.subheader("Step 4: Train / Test Split") | |
| st.markdown("#### Why this matters") | |
| st.markdown( | |
| "We need to know how well our model performs on **data it has never seen**. " | |
| "If we train and evaluate on the same data, the model could simply memorize " | |
| "the answers (overfitting) and we'd have no idea it would fail on new customers." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we did") | |
| X_train, X_test, y_train, y_test, _, _ = get_train_test() | |
| split_col1, split_col2 = st.columns(2) | |
| with split_col1: | |
| st.metric("Training set", f"{len(X_train):,} rows (80%)") | |
| train_churn_rate = y_train.mean() | |
| st.metric("Training churn rate", f"{train_churn_rate:.1%}") | |
| with split_col2: | |
| st.metric("Test set", f"{len(X_test):,} rows (20%)") | |
| test_churn_rate = y_test.mean() | |
| st.metric("Test churn rate", f"{test_churn_rate:.1%}") | |
| st.success( | |
| f"Churn rate is **{train_churn_rate:.1%}** in training and **{test_churn_rate:.1%}** in test β " | |
| "nearly identical. This is because we used **stratified splitting**, which preserves the " | |
| "class distribution in both sets." | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Stratified vs Random split") | |
| st.markdown( | |
| """ | |
| | Aspect | Random Split | Stratified Split (ours) | | |
| |---|---|---| | |
| | **How** | Randomly assigns 80/20 | Ensures each set has the same churn ratio | | |
| | **Risk** | Training might get 30% churn, test gets 20% β model learns a skewed world | No imbalance risk | | |
| | **When to use** | Balanced datasets (50/50 classes) | Imbalanced datasets like ours (73.5% / 26.5%) | | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Why 80/20?") | |
| st.markdown( | |
| """ | |
| This is the most common split ratio and a good default: | |
| - **80% training** gives the model enough data to learn patterns | |
| - **20% test** (~1,400 rows) is large enough for reliable metric estimates | |
| Other common ratios: | |
| - **70/30** β more test data, useful when you want higher confidence in metrics | |
| - **90/10** β when data is scarce and you need every row for training | |
| - **60/20/20** (train/validation/test) β when you need a separate validation | |
| set for hyperparameter tuning | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What about cross-validation?") | |
| st.markdown( | |
| """ | |
| Instead of a single 80/20 split, **k-fold cross-validation** splits the data into k | |
| parts, trains on k-1, tests on the remaining 1, and rotates. This gives k different | |
| accuracy estimates and is more robust. | |
| We didn't use it here because: | |
| - The demo focuses on visual clarity (one train set, one test set is easier to explain) | |
| - With 7,043 rows, a single stratified split is already reliable | |
| - Cross-validation would multiply training time (important for SHAP computation) | |
| """ | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 5 β Imputation Reference Guide | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_reference: | |
| st.subheader("Missing Data Imputation β Reference Guide") | |
| st.markdown( | |
| "Choosing the right imputation technique depends on the type of data, " | |
| "the amount of missingness, and whether the missingness itself carries information." | |
| ) | |
| st.markdown("---") | |
| ref_data = pd.DataFrame([ | |
| { | |
| "Situation": "Skewed continuous data", | |
| "Best Technique": "Median Imputation", | |
| "How it works": "Replace missing values with the median of the column", | |
| "Why": "Median is robust to outliers and skew β not pulled by extreme values", | |
| "Example": "TotalCharges in our dataset (right-skewed)", | |
| }, | |
| { | |
| "Situation": "Normal (bell-curve) continuous data", | |
| "Best Technique": "Mean Imputation", | |
| "How it works": "Replace missing values with the average", | |
| "Why": "When data is symmetric, mean = median, and mean is the best estimate", | |
| "Example": "Height, temperature, standardized test scores", | |
| }, | |
| { | |
| "Situation": "Categorical data", | |
| "Best Technique": "Mode Imputation", | |
| "How it works": "Replace missing values with the most frequent category", | |
| "Why": "You can't average text β the most common value is the safest guess", | |
| "Example": "Missing gender β fill with 'Male' if that's most common", | |
| }, | |
| { | |
| "Situation": "Time-series / sequential data", | |
| "Best Technique": "Interpolation or ffill/bfill", | |
| "How it works": "Use neighboring time points to estimate the gap (linear, spline, etc.)", | |
| "Why": "Adjacent timestamps are highly correlated β interpolation preserves trends", | |
| "Example": "Missing stock price on a holiday β average of Friday and Monday", | |
| }, | |
| { | |
| "Situation": "Features are correlated", | |
| "Best Technique": "KNN or Regression Imputation", | |
| "How it works": "Use other features to predict the missing value (KNN finds similar rows; regression fits a model)", | |
| "Why": "Leverages relationships between features for a smarter estimate", | |
| "Example": "Missing TotalCharges β predict from tenure Γ MonthlyCharges", | |
| }, | |
| { | |
| "Situation": "Research / statistical rigor", | |
| "Best Technique": "MICE (Multiple Imputation by Chained Equations)", | |
| "How it works": "Iteratively imputes each feature using all others, multiple times, averaging results", | |
| "Why": "Captures uncertainty β produces multiple complete datasets and pools results", | |
| "Example": "Clinical trial data where imputation accuracy is critical", | |
| }, | |
| { | |
| "Situation": "Missingness is informative", | |
| "Best Technique": "Indicator + Imputation", | |
| "How it works": "Add a binary column 'was_missing' (0/1) THEN impute the original", | |
| "Why": "Sometimes the fact that data is missing IS the signal (e.g., customer refused to answer = higher churn risk)", | |
| "Example": "Missing income field in a loan application β might indicate risk", | |
| }, | |
| { | |
| "Situation": "Very little missing data (< 1%)", | |
| "Best Technique": "Drop rows", | |
| "How it works": "Simply remove rows with missing values", | |
| "Why": "Losing a handful of rows has negligible impact and is the simplest approach", | |
| "Example": "Our dataset: 11 out of 7,043 rows (0.16%) β dropping would be fine too", | |
| }, | |
| ]) | |
| st.dataframe( | |
| ref_data, | |
| use_container_width=True, | |
| hide_index=True, | |
| column_config={ | |
| "Situation": st.column_config.TextColumn(width="medium"), | |
| "Best Technique": st.column_config.TextColumn(width="medium"), | |
| "How it works": st.column_config.TextColumn(width="large"), | |
| "Why": st.column_config.TextColumn(width="large"), | |
| "Example": st.column_config.TextColumn(width="large"), | |
| }, | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### Decision flowchart") | |
| st.markdown( | |
| """ | |
| 1. **How much data is missing?** | |
| - Less than 1%? β **Drop rows** (simplest) | |
| - More than 1%? β Continue below | |
| 2. **What type of data?** | |
| - Continuous + skewed β **Median** | |
| - Continuous + normal β **Mean** | |
| - Categorical β **Mode** | |
| - Time-series β **Interpolation / ffill** | |
| 3. **Are features correlated?** | |
| - Yes β Consider **KNN** or **Regression** imputation for better estimates | |
| 4. **Is the missingness itself meaningful?** | |
| - Yes β Add a **missing indicator column** before imputing | |
| 5. **Do you need statistical rigor?** | |
| - Yes β Use **MICE** for multiple imputation | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("#### What we chose and why") | |
| st.info( | |
| "For our Telco dataset, only `TotalCharges` had 11 missing values (0.16%). " | |
| "We used **Median Imputation** because the column is right-skewed. " | |
| "Dropping rows would also have been perfectly valid given the tiny percentage." | |
| ) | |