segmentx-behavioral-intelligence / docs /Customer_Segmentation_Project_Instructions (2).md
DIVYANSHI SINGH
Initial commit: SegmentX Behavioral Intelligence Portal
72d0706

PROJECT INSTRUCTION FILE

Customer Segmentation

Machine Learning Project | Clustering (Unsupervised) | E-Commerce Domain

Dataset

UK Online Retail II β€” UCI (Kaggle)

Rows

541K transactions

Difficulty

Easy

Target Metric

Silhouette Score 0.45–0.65

1. Project Overview

Customer segmentation groups customers into distinct clusters based on their purchasing behavior, enabling businesses to tailor marketing strategies, personalize offers, and allocate resources efficiently. Unlike classification, this is an unsupervised problem β€” there is no predefined label; the model discovers natural groupings in data.

Real-World Use Case
E-commerce companies like Amazon, Flipkart, and Myntra use RFM (Recency, Frequency, Monetary) segmentation to identify high-value customers, at-risk customers, and dormant buyers. Each segment receives targeted campaigns β€” loyalty rewards for champions, re-engagement emails for at-risk, and win-back offers for lost customers.

2. Dataset Details

Source

Dataset Statistics

Property Value
Total Rows ~541,000 transactions
Total Columns 8 features
Target Column None (unsupervised)
Key ID Column Customer ID
Missing Values ~25% rows missing CustomerID (must drop)
Data Types Mix of string, numeric, datetime

Key Features

  • InvoiceNo β€” transaction ID (invoices starting with 'C' are cancellations β€” remove)
  • StockCode β€” product code
  • Description β€” product name
  • Quantity β€” units purchased (negative = returns, remove)
  • InvoiceDate β€” date and time of transaction (used to compute Recency)
  • UnitPrice β€” price per unit (remove zero or negative prices)
  • CustomerID β€” unique customer identifier (drop rows where null)
  • Country β€” country of purchase (optionally filter to UK only for cleaner data)

3. Step-by-Step Workflow

Step 1 β€” Environment Setup

Install the required Python libraries before starting:

pip install pandas numpy scikit-learn matplotlib seaborn

Step 2 β€” Load & Explore Data (EDA)

  1. Load CSV: df = pd.read_csv('online_retail_II.csv', encoding='ISO-8859-1')
  2. Check shape, dtypes, nulls: df.info(), df.isnull().sum()
  3. Check for cancelled invoices (InvoiceNo starting with 'C') β€” remove them
  4. Check for negative Quantity and zero/negative UnitPrice β€” remove them
  5. Drop rows with missing CustomerID
  6. Plot transaction count by Country β€” confirm UK dominates (~90%)
  7. Plot monthly revenue trend using InvoiceDate
Key EDA Finding
After cleaning, you'll have ~400K transactions from ~4,300 customers. UK customers account for 90%+ of transactions. Cancelled invoices (prefix 'C') represent ~2% of all records and must be removed to avoid negative Recency/Monetary distortion.

Step 3 β€” RFM Feature Engineering

RFM stands for Recency, Frequency, and Monetary β€” the three most powerful behavioral signals for segmentation:

  1. Set a snapshot date: snapshot_date = df['InvoiceDate'].max() + timedelta(days=1)
  2. Compute per-customer aggregates:
    • Recency = (snapshot_date - max(InvoiceDate)).days β€” lower is better (more recent)
    • Frequency = count of unique InvoiceNo per CustomerID β€” higher is better
    • Monetary = sum of (Quantity Γ— UnitPrice) per CustomerID β€” higher is better
  3. Result: rfm_df with one row per customer and three columns: Recency, Frequency, Monetary
  4. Check for outliers β€” Monetary can have extreme high values (log-transform recommended)

Step 4 β€” Data Preprocessing

  1. Handle outliers: Use IQR method or clip at 99th percentile for Monetary and Frequency
  2. Log-transform skewed features: np.log1p(rfm_df[['Recency', 'Frequency', 'Monetary']])
  3. Scale features: StandardScaler() β€” clustering algorithms are distance-based, scaling is mandatory
  4. Confirm all three features have similar scale after transformation
Common Mistake
Never cluster on raw RFM values. A Monetary range of 0–100,000 will completely dominate Recency (0–365 days) in Euclidean distance calculations. Always scale after log-transformation.

Step 5 β€” Model Building

Model When to Use Silhouette Score
K-Means Best for compact, spherical clusters β€” recommended baseline 0.45 – 0.60
K-Means++ (init) Better initialization than random K-Means 0.48 – 0.62
DBSCAN Detects noise and irregular shapes β€” good for outlier detection Variable
Hierarchical (Agglomerative) Dendrogram shows natural cluster structure 0.40 – 0.55

Recommended order: Start with K-Means β†’ use Elbow + Silhouette to find optimal k β†’ compare with Hierarchical.

Step 6 β€” Finding Optimal Number of Clusters

  1. Elbow Method: Plot inertia (within-cluster sum of squares) for k = 2 to 10 β€” look for the "elbow" where inertia stops dropping sharply
  2. Silhouette Score: Compute silhouette_score for each k β€” higher is better (max = 1.0)
  3. Typical optimal k for RFM data: k = 4 to 6
  4. Visualize clusters in 2D using PCA (reduce 3D RFM to 2D for plotting)

Step 7 β€” Evaluate & Interpret Clusters

Clustering evaluation is both quantitative and qualitative:

Metric What it Measures Target Value
Silhouette Score How well-separated and dense clusters are 0.45 – 0.65
Inertia (WCSS) Within-cluster compactness β€” lower is better Minimize
Davies-Bouldin Index Ratio of within-cluster to between-cluster distance β€” lower is better < 1.0
Cluster Size Distribution Ensure no cluster has < 5% or > 60% of customers Balanced

After assigning cluster labels, compute mean RFM per cluster and assign business names:

Cluster Profile Recency Frequency Monetary Business Label
Champions Low (recent) High High Best customers β€” reward them
Loyal Customers Low-Medium High Medium Upsell opportunity
At-Risk High (not recent) Medium Medium Send re-engagement campaign
Lost/Hibernating Very High Low Low Win-back offer or deprioritize

4. Project Structure

02_customer_segmentation/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/online_retail_II.csv
β”‚   └── processed/rfm_features.csv
β”œβ”€β”€ models/
β”‚   └── kmeans_model.pkl
β”œβ”€β”€ pipeline/
β”‚   β”œβ”€β”€ 01_data_cleaning.py
β”‚   β”œβ”€β”€ 02_rfm_engineering.py
β”‚   β”œβ”€β”€ 03_preprocessing.py
β”‚   β”œβ”€β”€ 04_clustering.py
β”‚   └── 05_evaluation_visualization.py
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ elbow_curve.png
β”‚   β”œβ”€β”€ silhouette_scores.png
β”‚   β”œβ”€β”€ cluster_pca_plot.png
β”‚   β”œβ”€β”€ rfm_cluster_heatmap.png
β”‚   └── customer_segments.csv
β”œβ”€β”€ app.py
β”œβ”€β”€ path_utils.py
└── README.md

Pipeline File Descriptions:

File Purpose
01_data_cleaning.py Load raw CSV, remove cancellations, nulls, negative values, save cleaned data
02_rfm_engineering.py Compute Recency, Frequency, Monetary per CustomerID, save rfm_features.csv
03_preprocessing.py Log-transform, scale with StandardScaler, save scaled array
04_clustering.py Run Elbow + Silhouette analysis, train final K-Means, save model.pkl
05_evaluation_visualization.py Generate all plots, assign segment labels, save customer_segments.csv

5. Expected Results Summary

Metric K-Means (k=4) K-Means (k=5–6)
Silhouette Score 0.48 – 0.55 0.45 – 0.62
Davies-Bouldin Index 0.70 – 0.90 0.65 – 0.85
Inertia Reduction vs k=2 Significant drop Marginal after k=5
Cluster Balance 3–4 large, 1 small outlier group More evenly distributed

6. Common Mistakes to Avoid

  • Clustering on raw transaction data without aggregating to CustomerID level first β€” always build RFM table
  • Forgetting to remove cancelled invoices (InvoiceNo starts with 'C') β€” causes negative Monetary
  • Not scaling features before clustering β€” distance-based algorithms are scale-sensitive
  • Applying SMOTE or train/test split β€” this is unsupervised; no labels, no split needed
  • Choosing k based on elbow alone β€” always validate with Silhouette Score
  • Not log-transforming Monetary β€” extreme outliers (large B2B buyers) will distort clusters
  • Interpreting cluster numbers as ordered (Cluster 0 is not "better" than Cluster 3 β€” label them semantically)

7. Recommended Tools & Libraries

Library Purpose
pandas Data loading, cleaning, RFM aggregation
numpy Log-transform, numerical operations
scikit-learn KMeans, DBSCAN, AgglomerativeClustering, StandardScaler, PCA, metrics
matplotlib / seaborn Elbow curve, silhouette plot, heatmap, PCA scatter
joblib Save and load trained KMeans model

8. Project Deliverables Checklist

  • pipeline/ folder with 5 modular .py files (cleaning β†’ RFM β†’ preprocessing β†’ clustering β†’ evaluation)
  • Trained K-Means model saved as .pkl using joblib
  • Elbow curve plot + Silhouette score plot
  • PCA 2D scatter plot with cluster color labels
  • RFM heatmap per cluster (mean values per segment)
  • customer_segments.csv with CustomerID + Cluster label + Segment name
  • Streamlit app (app.py) allowing user to upload new transaction data and see predicted segments with RFM breakdown, interactive cluster visualizations, and a segment summary table

Customer Segmentation | ML Project Instruction File | Clustering Project #2