**PROJECT INSTRUCTION FILE**

**Customer Segmentation**

Machine Learning Project  |  Clustering (Unsupervised)  |  E-Commerce Domain

|<p>Dataset</p><p>**UK Online Retail II — UCI (Kaggle)**</p>|<p>Rows</p><p>**541K transactions**</p>|<p>Difficulty</p><p>**Easy**</p>|<p>Target Metric</p><p>**Silhouette Score 0.45–0.65**</p>|
| :-: | :-: | :-: | :-: |


# **1. Project Overview**
Customer segmentation groups customers into distinct clusters based on their purchasing behavior, enabling businesses to tailor marketing strategies, personalize offers, and allocate resources efficiently. Unlike classification, this is an unsupervised problem — there is no predefined label; the model discovers natural groupings in data.

|**Real-World Use Case**|
| :- |
|E-commerce companies like Amazon, Flipkart, and Myntra use RFM (Recency, Frequency, Monetary) segmentation to identify high-value customers, at-risk customers, and dormant buyers. Each segment receives targeted campaigns — loyalty rewards for champions, re-engagement emails for at-risk, and win-back offers for lost customers.|

# **2. Dataset Details**
**Source**

- Name: UK Online Retail II Dataset (UCI)
- Platform: Kaggle — https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci
- Format: CSV — single file (Year 2009-2010 sheet or 2010-2011 sheet)
- License: Public / Open Use

**Dataset Statistics**

|**Property**|**Value**|
| :- | :- |
|Total Rows|~541,000 transactions|
|Total Columns|8 features|
|Target Column|None (unsupervised)|
|Key ID Column|Customer ID|
|Missing Values|~25% rows missing CustomerID (must drop)|
|Data Types|Mix of string, numeric, datetime|

**Key Features**

- InvoiceNo — transaction ID (invoices starting with 'C' are cancellations — remove)
- StockCode — product code
- Description — product name
- Quantity — units purchased (negative = returns, remove)
- InvoiceDate — date and time of transaction (used to compute Recency)
- UnitPrice — price per unit (remove zero or negative prices)
- CustomerID — unique customer identifier (drop rows where null)
- Country — country of purchase (optionally filter to UK only for cleaner data)

# **3. Step-by-Step Workflow**
## **Step 1 — Environment Setup**
Install the required Python libraries before starting:

|pip install pandas numpy scikit-learn matplotlib seaborn|
| :- |

## **Step 2 — Load & Explore Data (EDA)**
1. Load CSV: df = pd.read\_csv('online\_retail\_II.csv', encoding='ISO-8859-1')
2. Check shape, dtypes, nulls: df.info(), df.isnull().sum()
3. Check for cancelled invoices (InvoiceNo starting with 'C') — remove them
4. Check for negative Quantity and zero/negative UnitPrice — remove them
5. Drop rows with missing CustomerID
6. Plot transaction count by Country — confirm UK dominates (~90%)
7. Plot monthly revenue trend using InvoiceDate

|**Key EDA Finding**|
| :- |
|After cleaning, you'll have ~400K transactions from ~4,300 customers. UK customers account for 90%+ of transactions. Cancelled invoices (prefix 'C') represent ~2% of all records and must be removed to avoid negative Recency/Monetary distortion.|

## **Step 3 — RFM Feature Engineering**
RFM stands for Recency, Frequency, and Monetary — the three most powerful behavioral signals for segmentation:

1. Set a snapshot date: snapshot\_date = df['InvoiceDate'].max() + timedelta(days=1)
2. Compute per-customer aggregates:
   - **Recency** = (snapshot\_date - max(InvoiceDate)).days — lower is better (more recent)
   - **Frequency** = count of unique InvoiceNo per CustomerID — higher is better
   - **Monetary** = sum of (Quantity × UnitPrice) per CustomerID — higher is better
3. Result: rfm\_df with one row per customer and three columns: Recency, Frequency, Monetary
4. Check for outliers — Monetary can have extreme high values (log-transform recommended)

## **Step 4 — Data Preprocessing**
1. Handle outliers: Use IQR method or clip at 99th percentile for Monetary and Frequency
2. Log-transform skewed features: np.log1p(rfm\_df[['Recency', 'Frequency', 'Monetary']])
3. Scale features: StandardScaler() — clustering algorithms are distance-based, scaling is mandatory
4. Confirm all three features have similar scale after transformation

|**Common Mistake**|
| :- |
|Never cluster on raw RFM values. A Monetary range of 0–100,000 will completely dominate Recency (0–365 days) in Euclidean distance calculations. Always scale after log-transformation.|

## **Step 5 — Model Building**

|**Model**|**When to Use**|**Silhouette Score**|
| :- | :- | :- |
|K-Means|Best for compact, spherical clusters — recommended baseline|0.45 – 0.60|
|K-Means++ (init)|Better initialization than random K-Means|0.48 – 0.62|
|DBSCAN|Detects noise and irregular shapes — good for outlier detection|Variable|
|Hierarchical (Agglomerative)|Dendrogram shows natural cluster structure|0.40 – 0.55|

Recommended order: Start with K-Means → use Elbow + Silhouette to find optimal k → compare with Hierarchical.

## **Step 6 — Finding Optimal Number of Clusters**
1. **Elbow Method**: Plot inertia (within-cluster sum of squares) for k = 2 to 10 — look for the "elbow" where inertia stops dropping sharply
2. **Silhouette Score**: Compute silhouette\_score for each k — higher is better (max = 1.0)
3. Typical optimal k for RFM data: **k = 4 to 6**
4. Visualize clusters in 2D using PCA (reduce 3D RFM to 2D for plotting)

## **Step 7 — Evaluate & Interpret Clusters**
Clustering evaluation is both quantitative and qualitative:

|**Metric**|**What it Measures**|**Target Value**|
| :- | :- | :- |
|Silhouette Score|How well-separated and dense clusters are|0.45 – 0.65|
|Inertia (WCSS)|Within-cluster compactness — lower is better|Minimize|
|Davies-Bouldin Index|Ratio of within-cluster to between-cluster distance — lower is better|< 1.0|
|Cluster Size Distribution|Ensure no cluster has < 5% or > 60% of customers|Balanced|

After assigning cluster labels, compute mean RFM per cluster and assign business names:

|**Cluster Profile**|**Recency**|**Frequency**|**Monetary**|**Business Label**|
| :- | :- | :- | :- | :- |
|Champions|Low (recent)|High|High|Best customers — reward them|
|Loyal Customers|Low-Medium|High|Medium|Upsell opportunity|
|At-Risk|High (not recent)|Medium|Medium|Send re-engagement campaign|
|Lost/Hibernating|Very High|Low|Low|Win-back offer or deprioritize|

# **4. Project Structure**

```
02_customer_segmentation/
├── data/
│   ├── raw/online_retail_II.csv
│   └── processed/rfm_features.csv
├── models/
│   └── kmeans_model.pkl
├── pipeline/
│   ├── 01_data_cleaning.py
│   ├── 02_rfm_engineering.py
│   ├── 03_preprocessing.py
│   ├── 04_clustering.py
│   └── 05_evaluation_visualization.py
├── outputs/
│   ├── elbow_curve.png
│   ├── silhouette_scores.png
│   ├── cluster_pca_plot.png
│   ├── rfm_cluster_heatmap.png
│   └── customer_segments.csv
├── app.py
├── path_utils.py
└── README.md
```

**Pipeline File Descriptions:**

| File | Purpose |
| :- | :- |
| 01\_data\_cleaning.py | Load raw CSV, remove cancellations, nulls, negative values, save cleaned data |
| 02\_rfm\_engineering.py | Compute Recency, Frequency, Monetary per CustomerID, save rfm\_features.csv |
| 03\_preprocessing.py | Log-transform, scale with StandardScaler, save scaled array |
| 04\_clustering.py | Run Elbow + Silhouette analysis, train final K-Means, save model.pkl |
| 05\_evaluation\_visualization.py | Generate all plots, assign segment labels, save customer\_segments.csv |

# **5. Expected Results Summary**

|**Metric**|**K-Means (k=4)**|**K-Means (k=5–6)**|
| :- | :- | :- |
|Silhouette Score|0.48 – 0.55|0.45 – 0.62|
|Davies-Bouldin Index|0.70 – 0.90|0.65 – 0.85|
|Inertia Reduction vs k=2|Significant drop|Marginal after k=5|
|Cluster Balance|3–4 large, 1 small outlier group|More evenly distributed|

# **6. Common Mistakes to Avoid**
- Clustering on raw transaction data without aggregating to CustomerID level first — always build RFM table
- Forgetting to remove cancelled invoices (InvoiceNo starts with 'C') — causes negative Monetary
- Not scaling features before clustering — distance-based algorithms are scale-sensitive
- Applying SMOTE or train/test split — this is unsupervised; no labels, no split needed
- Choosing k based on elbow alone — always validate with Silhouette Score
- Not log-transforming Monetary — extreme outliers (large B2B buyers) will distort clusters
- Interpreting cluster numbers as ordered (Cluster 0 is not "better" than Cluster 3 — label them semantically)

# **7. Recommended Tools & Libraries**

|**Library**|**Purpose**|
| :- | :- |
|pandas|Data loading, cleaning, RFM aggregation|
|numpy|Log-transform, numerical operations|
|scikit-learn|KMeans, DBSCAN, AgglomerativeClustering, StandardScaler, PCA, metrics|
|matplotlib / seaborn|Elbow curve, silhouette plot, heatmap, PCA scatter|
|joblib|Save and load trained KMeans model|

# **8. Project Deliverables Checklist**
- pipeline/ folder with 5 modular .py files (cleaning → RFM → preprocessing → clustering → evaluation)
- Trained K-Means model saved as .pkl using joblib
- Elbow curve plot + Silhouette score plot
- PCA 2D scatter plot with cluster color labels
- RFM heatmap per cluster (mean values per segment)
- customer\_segments.csv with CustomerID + Cluster label + Segment name
- Streamlit app (app.py) allowing user to upload new transaction data and see predicted segments with RFM breakdown, interactive cluster visualizations, and a segment summary table

Customer Segmentation  |  ML Project Instruction File  |  Clustering Project #2