Spaces:

Divya499
/

segmentx-behavioral-intelligence

Sleeping

App Files Files Community

segmentx-behavioral-intelligence / docs /Customer_Segmentation_Project_Instructions (2).md

DIVYANSHI SINGH

Initial commit: SegmentX Behavioral Intelligence Portal

72d0706 2 months ago

preview code

raw

history blame contribute delete

9.76 kB

PROJECT INSTRUCTION FILE

Customer Segmentation

Machine Learning Project | Clustering (Unsupervised) | E-Commerce Domain

Dataset

UK Online Retail II — UCI (Kaggle)

Rows

541K transactions

Difficulty

Easy

Target Metric

Silhouette Score 0.45–0.65

1. Project Overview

Customer segmentation groups customers into distinct clusters based on their purchasing behavior, enabling businesses to tailor marketing strategies, personalize offers, and allocate resources efficiently. Unlike classification, this is an unsupervised problem — there is no predefined label; the model discovers natural groupings in data.

Real-World Use Case
E-commerce companies like Amazon, Flipkart, and Myntra use RFM (Recency, Frequency, Monetary) segmentation to identify high-value customers, at-risk customers, and dormant buyers. Each segment receives targeted campaigns — loyalty rewards for champions, re-engagement emails for at-risk, and win-back offers for lost customers.

2. Dataset Details

Source

Name: UK Online Retail II Dataset (UCI)
Platform: Kaggle — https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci
Format: CSV — single file (Year 2009-2010 sheet or 2010-2011 sheet)
License: Public / Open Use

Dataset Statistics

Property	Value
Total Rows	~541,000 transactions
Total Columns	8 features
Target Column	None (unsupervised)
Key ID Column	Customer ID
Missing Values	~25% rows missing CustomerID (must drop)
Data Types	Mix of string, numeric, datetime

Key Features

InvoiceNo — transaction ID (invoices starting with 'C' are cancellations — remove)
StockCode — product code
Description — product name
Quantity — units purchased (negative = returns, remove)
InvoiceDate — date and time of transaction (used to compute Recency)
UnitPrice — price per unit (remove zero or negative prices)
CustomerID — unique customer identifier (drop rows where null)
Country — country of purchase (optionally filter to UK only for cleaner data)

3. Step-by-Step Workflow

Step 1 — Environment Setup

Install the required Python libraries before starting:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 2 — Load & Explore Data (EDA)

Load CSV: df = pd.read_csv('online_retail_II.csv', encoding='ISO-8859-1')
Check shape, dtypes, nulls: df.info(), df.isnull().sum()
Check for cancelled invoices (InvoiceNo starting with 'C') — remove them
Check for negative Quantity and zero/negative UnitPrice — remove them
Drop rows with missing CustomerID
Plot transaction count by Country — confirm UK dominates (~90%)
Plot monthly revenue trend using InvoiceDate

Key EDA Finding
After cleaning, you'll have ~400K transactions from ~4,300 customers. UK customers account for 90%+ of transactions. Cancelled invoices (prefix 'C') represent ~2% of all records and must be removed to avoid negative Recency/Monetary distortion.

Step 3 — RFM Feature Engineering

RFM stands for Recency, Frequency, and Monetary — the three most powerful behavioral signals for segmentation:

Set a snapshot date: snapshot_date = df['InvoiceDate'].max() + timedelta(days=1)
Compute per-customer aggregates:
- Recency = (snapshot_date - max(InvoiceDate)).days — lower is better (more recent)
- Frequency = count of unique InvoiceNo per CustomerID — higher is better
- Monetary = sum of (Quantity × UnitPrice) per CustomerID — higher is better
Result: rfm_df with one row per customer and three columns: Recency, Frequency, Monetary
Check for outliers — Monetary can have extreme high values (log-transform recommended)

Step 4 — Data Preprocessing

Handle outliers: Use IQR method or clip at 99th percentile for Monetary and Frequency
Log-transform skewed features: np.log1p(rfm_df[['Recency', 'Frequency', 'Monetary']])
Scale features: StandardScaler() — clustering algorithms are distance-based, scaling is mandatory
Confirm all three features have similar scale after transformation

Common Mistake
Never cluster on raw RFM values. A Monetary range of 0–100,000 will completely dominate Recency (0–365 days) in Euclidean distance calculations. Always scale after log-transformation.

Step 5 — Model Building

Model	When to Use	Silhouette Score
K-Means	Best for compact, spherical clusters — recommended baseline	0.45 – 0.60
K-Means++ (init)	Better initialization than random K-Means	0.48 – 0.62
DBSCAN	Detects noise and irregular shapes — good for outlier detection	Variable
Hierarchical (Agglomerative)	Dendrogram shows natural cluster structure	0.40 – 0.55

Recommended order: Start with K-Means → use Elbow + Silhouette to find optimal k → compare with Hierarchical.

Step 6 — Finding Optimal Number of Clusters

Elbow Method: Plot inertia (within-cluster sum of squares) for k = 2 to 10 — look for the "elbow" where inertia stops dropping sharply
Silhouette Score: Compute silhouette_score for each k — higher is better (max = 1.0)
Typical optimal k for RFM data: k = 4 to 6
Visualize clusters in 2D using PCA (reduce 3D RFM to 2D for plotting)

Step 7 — Evaluate & Interpret Clusters

Clustering evaluation is both quantitative and qualitative:

Metric	What it Measures	Target Value
Silhouette Score	How well-separated and dense clusters are	0.45 – 0.65
Inertia (WCSS)	Within-cluster compactness — lower is better	Minimize
Davies-Bouldin Index	Ratio of within-cluster to between-cluster distance — lower is better	< 1.0
Cluster Size Distribution	Ensure no cluster has < 5% or > 60% of customers	Balanced

After assigning cluster labels, compute mean RFM per cluster and assign business names:

Cluster Profile	Recency	Frequency	Monetary	Business Label
Champions	Low (recent)	High	High	Best customers — reward them
Loyal Customers	Low-Medium	High	Medium	Upsell opportunity
At-Risk	High (not recent)	Medium	Medium	Send re-engagement campaign
Lost/Hibernating	Very High	Low	Low	Win-back offer or deprioritize

4. Project Structure

02_customer_segmentation/
├── data/
│   ├── raw/online_retail_II.csv
│   └── processed/rfm_features.csv
├── models/
│   └── kmeans_model.pkl
├── pipeline/
│   ├── 01_data_cleaning.py
│   ├── 02_rfm_engineering.py
│   ├── 03_preprocessing.py
│   ├── 04_clustering.py
│   └── 05_evaluation_visualization.py
├── outputs/
│   ├── elbow_curve.png
│   ├── silhouette_scores.png
│   ├── cluster_pca_plot.png
│   ├── rfm_cluster_heatmap.png
│   └── customer_segments.csv
├── app.py
├── path_utils.py
└── README.md

Pipeline File Descriptions:

File	Purpose
01_data_cleaning.py	Load raw CSV, remove cancellations, nulls, negative values, save cleaned data
02_rfm_engineering.py	Compute Recency, Frequency, Monetary per CustomerID, save rfm_features.csv
03_preprocessing.py	Log-transform, scale with StandardScaler, save scaled array
04_clustering.py	Run Elbow + Silhouette analysis, train final K-Means, save model.pkl
05_evaluation_visualization.py	Generate all plots, assign segment labels, save customer_segments.csv

5. Expected Results Summary

Metric	K-Means (k=4)	K-Means (k=5–6)
Silhouette Score	0.48 – 0.55	0.45 – 0.62
Davies-Bouldin Index	0.70 – 0.90	0.65 – 0.85
Inertia Reduction vs k=2	Significant drop	Marginal after k=5
Cluster Balance	3–4 large, 1 small outlier group	More evenly distributed

6. Common Mistakes to Avoid

Clustering on raw transaction data without aggregating to CustomerID level first — always build RFM table
Forgetting to remove cancelled invoices (InvoiceNo starts with 'C') — causes negative Monetary
Not scaling features before clustering — distance-based algorithms are scale-sensitive
Applying SMOTE or train/test split — this is unsupervised; no labels, no split needed
Choosing k based on elbow alone — always validate with Silhouette Score
Not log-transforming Monetary — extreme outliers (large B2B buyers) will distort clusters
Interpreting cluster numbers as ordered (Cluster 0 is not "better" than Cluster 3 — label them semantically)

7. Recommended Tools & Libraries

Library	Purpose
pandas	Data loading, cleaning, RFM aggregation
numpy	Log-transform, numerical operations
scikit-learn	KMeans, DBSCAN, AgglomerativeClustering, StandardScaler, PCA, metrics
matplotlib / seaborn	Elbow curve, silhouette plot, heatmap, PCA scatter
joblib	Save and load trained KMeans model

8. Project Deliverables Checklist

pipeline/ folder with 5 modular .py files (cleaning → RFM → preprocessing → clustering → evaluation)
Trained K-Means model saved as .pkl using joblib
Elbow curve plot + Silhouette score plot
PCA 2D scatter plot with cluster color labels
RFM heatmap per cluster (mean values per segment)
customer_segments.csv with CustomerID + Cluster label + Segment name
Streamlit app (app.py) allowing user to upload new transaction data and see predicted segments with RFM breakdown, interactive cluster visualizations, and a segment summary table

Customer Segmentation | ML Project Instruction File | Clustering Project #2