π¦ Olist E-Commerce Delivery Prediction
π₯ Project Video Presentation
[(https://youtu.be/Gdw33Av6rCk)]
π Project Overview
This project focuses on predicting delivery times for e-commerce orders in Brazil, using the famous Olist Public Dataset. The goal is to improve customer experience by providing accurate delivery estimates.
The project is divided into two main tasks:
- Regression: Predicting the exact number of days for delivery.
- Classification: Categorizing delivery speed into Fast, Medium, or Slow.
π The Dataset
- Source: Brazilian E-Commerce Public Dataset by Olist.
- Size: ~100,000 orders after cleaning.
- Key Features:
customer_state: The destination state (critical for logistics).product_weight_g: Weight of the package.freight_value: Shipping cost.price: Product value.
π οΈ Feature Engineering & Clustering
To improve the model's performance beyond the baseline, I applied several techniques:
- Data Cleaning: Removed outliers (negative days or deliveries > 60 days).
- Geolocation Encoding: Converted
customer_stateinto One-Hot Encoded features to capture geographical patterns (e.g., separating Sao Paulo from Amazonas). - K-Means Clustering: Applied unsupervised learning to group orders into 4 distinct clusters based on weight, price, and shipping cost. These cluster labels were added as a new feature to the model.
π€ Model Performance
Part 1: Regression (Predicting Days)
I compared a Baseline Linear Regression against advanced models trained on the engineered dataset.
| Model | R2 Score | Improvement |
|---|---|---|
| Baseline (Linear Regression) | 0.057 | - |
| Improved Linear Regression | 0.216 | +278% |
| XGBoost / Random Forest (Winner) | 0.267 | Best Performance |
Insight: The significant jump in R2 proves that location data and clustering were essential for prediction.
Part 2: Classification (Fast / Medium / Slow)
I converted the target variable into 3 balanced classes using Quantile Binning:
- Fast: Bottom 33%
- Medium: Middle 33%
- Slow: Top 33% (Critical to identify)
Best Model: XGBoost Accuracy: ~57% (vs. 33% random chance)
π Repository Structure
This repository contains the following files:
Assignment_2.ipynb: The complete Python notebook with all code, EDA, and visualizations.regression_model.pkl: The trained Random Forest model for regression.classification_model.pkl: The trained XGBoost model for classification.README.md: Project documentation.