Spaces:
Sleeping
Sleeping
File size: 13,670 Bytes
f3a6f24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | import streamlit as st
from utils.data_loader import load_raw_data
from utils.rec_data_loader import load_clean_transactions, build_interaction_matrix
st.set_page_config(
page_title="Data Science Demo",
page_icon="π",
layout="wide",
initial_sidebar_state="expanded",
)
st.title("Data Science Project Demo")
st.markdown(
"Demonstration of two core data science objectives: **predicting customer churn** "
"and **building a product recommendation system** β from raw data to deployed models."
)
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# KPI Dashboard β Both Objectives
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
churn_col, rec_col = st.columns(2)
with churn_col:
st.markdown("### Customer Churn Prediction")
df_churn = load_raw_data()
total_customers = len(df_churn)
churn_rate = df_churn["Churn"].mean()
churned_count = int(df_churn["Churn"].sum())
revenue_at_risk = df_churn.loc[df_churn["Churn"] == 1, "MonthlyCharges"].sum() * 12
c1, c2 = st.columns(2)
c1.metric("Total Customers", f"{total_customers:,}")
c2.metric("Churn Rate", f"{churn_rate:.1%}", delta=f"-{churned_count:,} lost", delta_color="inverse")
c3, c4 = st.columns(2)
c3.metric("Avg Monthly Charge", f"${df_churn['MonthlyCharges'].mean():,.2f}")
c4.metric("Annual Revenue at Risk", f"${revenue_at_risk:,.0f}")
st.caption("Dataset: Telco Customer Churn β 7,043 customers, 21 features")
with rec_col:
st.markdown("### Product Recommendation")
df_rec = load_clean_transactions()
interactions, *_ = build_interaction_matrix()
n_users, n_items = interactions.shape
n_transactions = len(df_rec)
sparsity = 1 - ((interactions > 0).sum().sum() / (n_users * n_items))
total_revenue = df_rec["TotalPrice"].sum()
r1, r2 = st.columns(2)
r1.metric("Unique Customers", f"{n_users:,}")
r2.metric("Unique Products", f"{n_items:,}")
r3, r4 = st.columns(2)
r3.metric("Transactions", f"{n_transactions:,}")
r4.metric("Matrix Sparsity", f"{sparsity:.1%}")
st.caption("Dataset: UCI Online Retail β 541K transactions, UK e-commerce 2010β2011")
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Two Objectives Overview
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.subheader("Two Objectives β Full Pipeline for Each")
st.markdown(
"""
| Objective | Problem Type | Key Algorithms | Pages |
|---|---|---|---|
| **Customer Churn** | Binary Classification | Logistic Regression, Naive Bayes, Random Forest, XGBoost, SGDClassifier | EDA β Preprocessing β Churn Models β Live Updates |
| **Product Recommendation** | Matrix Factorization | SVD, ALS, SGD (Funk SVD), NMF, Item-Based CF | Rec EDA β Rec Preprocessing β Rec Models β Rec Live |
"""
)
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# The ML Journey β Step-by-step pipeline
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.subheader("The Machine Learning Journey β What We're Doing and Why")
st.markdown(
"Both objectives follow the same well-defined pipeline. "
"Each step has a purpose and builds on the previous one:"
)
st.graphviz_chart("""
digraph pipeline {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=11, margin="0.3,0.15"]
edge [color="#888888", penwidth=1.5]
data [label="1. Data\\nCollection", fillcolor="#dbeafe", color="#3b82f6"]
eda [label="2. Exploratory\\nData Analysis", fillcolor="#e0e7ff", color="#6366f1"]
preproc [label="3. Data\\nPreprocessing", fillcolor="#fce7f3", color="#ec4899"]
train [label="4. Model\\nTraining", fillcolor="#d1fae5", color="#10b981"]
eval [label="5. Evaluation\\n& Comparison", fillcolor="#fef3c7", color="#f59e0b"]
deploy [label="6. Live Updates\\n& Deployment", fillcolor="#fee2e2", color="#ef4444"]
data -> eda -> preproc -> train -> eval -> deploy
}
""")
steps = [
{
"num": "1",
"title": "Data Collection",
"churn": "**Telco Customer Churn** β 7,043 customers with demographics, services, billing, and churn labels.",
"rec": "**UCI Online Retail** β 541K purchase transactions from a UK e-commerce store.",
"why": "You can't build a model without data. The quality and relevance of the data determines everything that follows.",
},
{
"num": "2",
"title": "Exploratory Data Analysis",
"churn": "Visualize churn distribution, feature histograms, category-based churn rates, correlation heatmaps, and UMAP customer segments.",
"rec": "Analyze purchase distributions, product popularity (long tail), user-item matrix sparsity, revenue trends, and geography.",
"why": "Before building models, we need to understand the problem. EDA reveals patterns, catches data issues, and guides modeling choices.",
},
{
"num": "3",
"title": "Data Preprocessing",
"churn": "Handle missing values (median imputation), encode categories (Label Encoding for trees, One-Hot for Logistic Regression), scale features (StandardScaler).",
"rec": "Clean transactions (remove cancellations, missing IDs), aggregate into a user-item matrix, handle implicit feedback.",
"why": "ML algorithms need clean, structured input. Raw data has missing values, text, and different scales β all of which confuse models.",
},
{
"num": "4",
"title": "Model Training",
"churn": "Train **Logistic Regression**, **Naive Bayes**, **Random Forest**, and **XGBoost** β each with the encoding that's optimal for it.",
"rec": "Train **SVD**, **ALS**, **SGD (Funk SVD)**, **NMF**, and **Item-Based CF** β five approaches to matrix factorization and collaborative filtering.",
"why": "Different algorithms have different strengths. Training multiple models lets us find the best approach for each problem.",
},
{
"num": "5",
"title": "Evaluation & Comparison",
"churn": "Compare using Accuracy, Precision, Recall, F1, AUC. Explain predictions with **SHAP**.",
"rec": "Compare using Precision@K, Recall@K, Hit Rate. Inspect individual recommendations for quality.",
"why": "A model is only useful if it's accurate and we can measure how accurate. Different metrics capture different aspects of quality.",
},
{
"num": "6",
"title": "Live Updates & Deployment",
"churn": "Incremental learning with **SGDClassifier.partial_fit()** β update the model as new customer data streams in.",
"rec": "Monthly batch retraining of **ALS** β model quality improves as purchase history accumulates over time.",
"why": "In the real world, data doesn't stop arriving. Deployed models must adapt to new patterns without expensive full retraining.",
},
]
for step in steps:
with st.expander(f"Step {step['num']}: {step['title']}", expanded=False):
st.markdown(f"**Why:** {step['why']}")
col_churn, col_rec = st.columns(2)
with col_churn:
st.markdown(f"**Churn Prediction:** \n{step['churn']}")
with col_rec:
st.markdown(f"**Recommendation:** \n{step['rec']}")
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# How a model actually "learns" β quick primer
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.subheader("How Does a Machine Learning Model Actually \"Learn\"?")
st.markdown(
"""
At its core, every ML model follows the same loop:
"""
)
st.graphviz_chart("""
digraph learning {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=11, margin="0.3,0.15"]
edge [color="#888888", penwidth=1.5]
input [label="Input Data\\n(features)", fillcolor="#dbeafe", color="#3b82f6"]
guess [label="Model Makes\\na Prediction", fillcolor="#d1fae5", color="#10b981"]
check [label="Compare to\\nActual Answer", fillcolor="#fef3c7", color="#f59e0b"]
adjust [label="Adjust Model\\n(reduce error)", fillcolor="#fee2e2", color="#ef4444"]
input -> guess -> check -> adjust
adjust -> guess [style=dashed, label="repeat", fontsize=9]
}
""")
st.markdown(
"""
1. **Input** β The model receives data (customer features for churn, purchase history for recommendations)
2. **Predict** β It makes a guess (will they churn? what would they buy?)
3. **Compare** β We check the guess against reality
4. **Adjust** β The model tweaks its internal parameters to reduce the error
5. **Repeat** β This cycle runs until the model gets good
Different algorithms differ in *how* they make predictions and *how* they adjust.
The algorithm explanation pages cover each one with diagrams.
"""
)
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Business Impact
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.subheader("Business Impact")
impact_churn, impact_rec = st.columns(2)
with impact_churn:
st.markdown("#### Churn Reduction")
reduction_pct = st.slider(
"If we reduce churn by this %",
min_value=1, max_value=50, value=10, step=1, format="%d%%",
)
savings = revenue_at_risk * (reduction_pct / 100)
st.success(f"**{reduction_pct}% churn reduction** β **${savings:,.0f}/year** saved")
with impact_rec:
st.markdown("#### Recommendation Revenue")
avg_order = df_rec["TotalPrice"].mean()
conversion_pct = st.slider(
"If recommendations lift conversion by this %",
min_value=1, max_value=30, value=5, step=1, format="%d%%",
)
additional = total_revenue * (conversion_pct / 100)
st.success(f"**{conversion_pct}% conversion lift** β **Β£{additional:,.0f}/year** additional revenue")
st.markdown("---")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Page Directory
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
st.subheader("Page Directory")
st.markdown(
"""
**Customer Churn Prediction:**
| Page | What You'll See |
|---|---|
| **EDA** | Interactive charts exploring customer demographics, services, and churn patterns |
| **Preprocessing** | Step-by-step data cleaning, encoding (Label + One-Hot), scaling with before/after visuals |
| **Churn Models** | Algorithm explainers with diagrams, model comparison, SHAP explanations, What-If predictor |
| **Live Updates** | All 4 churn models compete β SGD (partial_fit) vs LR, RF, XGBoost (full retrain) |
**Product Recommendation:**
| Page | What You'll See |
|---|---|
| **Rec EDA** | Purchase distributions, product popularity (long tail), sparsity analysis, user-item matrix |
| **Rec Preprocessing** | Transaction-to-matrix pipeline, implicit vs explicit feedback, train/test split |
| **Rec Models** | Algorithm explainers for SVD/ALS/SGD/NMF/Item-CF, model comparison, interactive recommendations |
| **Rec Live** | Monthly data streaming demo β all 5 models compete as purchase history grows |
"""
)
st.markdown("---")
st.caption(
"Data: Telco Customer Churn Dataset + UCI Online Retail Dataset β’ "
"Built with Streamlit, scikit-learn, XGBoost, SHAP, implicit, scipy"
)
|