YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸ“¦ Olist E-Commerce Delivery Prediction

πŸŽ₯ Project Video Presentation

[(https://youtu.be/Gdw33Av6rCk)]


πŸ“„ Project Overview

This project focuses on predicting delivery times for e-commerce orders in Brazil, using the famous Olist Public Dataset. The goal is to improve customer experience by providing accurate delivery estimates.

The project is divided into two main tasks:

  1. Regression: Predicting the exact number of days for delivery.
  2. Classification: Categorizing delivery speed into Fast, Medium, or Slow.

πŸ“Š The Dataset

  • Source: Brazilian E-Commerce Public Dataset by Olist.
  • Size: ~100,000 orders after cleaning.
  • Key Features:
    • customer_state: The destination state (critical for logistics).
    • product_weight_g: Weight of the package.
    • freight_value: Shipping cost.
    • price: Product value.

πŸ› οΈ Feature Engineering & Clustering

To improve the model's performance beyond the baseline, I applied several techniques:

  • Data Cleaning: Removed outliers (negative days or deliveries > 60 days).
  • Geolocation Encoding: Converted customer_state into One-Hot Encoded features to capture geographical patterns (e.g., separating Sao Paulo from Amazonas).
  • K-Means Clustering: Applied unsupervised learning to group orders into 4 distinct clusters based on weight, price, and shipping cost. These cluster labels were added as a new feature to the model.

πŸ€– Model Performance

Part 1: Regression (Predicting Days)

I compared a Baseline Linear Regression against advanced models trained on the engineered dataset.

Model R2 Score Improvement
Baseline (Linear Regression) 0.057 -
Improved Linear Regression 0.216 +278%
XGBoost / Random Forest (Winner) 0.267 Best Performance

Insight: The significant jump in R2 proves that location data and clustering were essential for prediction.

Part 2: Classification (Fast / Medium / Slow)

I converted the target variable into 3 balanced classes using Quantile Binning:

  • Fast: Bottom 33%
  • Medium: Middle 33%
  • Slow: Top 33% (Critical to identify)

Best Model: XGBoost Accuracy: ~57% (vs. 33% random chance)


πŸ“‚ Repository Structure

This repository contains the following files:

  • Assignment_2.ipynb: The complete Python notebook with all code, EDA, and visualizations.
  • regression_model.pkl: The trained Random Forest model for regression.
  • classification_model.pkl: The trained XGBoost model for classification.
  • README.md: Project documentation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support