Amazon_Delivery_Time / README(Github).md
thedynasty23's picture
Rename README (9).md to README(Github).md
090c46d verified
# 🚚 Amazon Delivery Time Prediction
![Python](https://img.shields.io/badge/Python-3.9%2B-blue?logo=python)
![Jupyter](https://img.shields.io/badge/Notebook-Jupyter-orange?logo=jupyter)
![scikit-learn](https://img.shields.io/badge/ML-scikit--learn-%23F7931E?logo=scikitlearn)
![XGBoost](https://img.shields.io/badge/Boosting-XGBoost-76B900)
![LightGBM](https://img.shields.io/badge/Boosting-LightGBM-3EA868)
![CatBoost](https://img.shields.io/badge/Boosting-CatBoost-FF9900)
![Optuna](https://img.shields.io/badge/Optimization-Optuna-6E49CB)
![MLflow](https://img.shields.io/badge/Tracking-MLflow-0194E2)
![Platform](https://img.shields.io/badge/Platform-Colab%20%7C%20Local-lightgrey)
> End‑to‑end regression pipeline to predict e‑commerce delivery time with robust feature engineering, PCA, multiple tree-based models, Optuna tuning, and MLflow tracking.
---
## Table of Contents
- [About the Project](#about-the-project)
- [Key Features](#key-features)
- [Tech Stack](#tech-stack)
- [Installation](#installation)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Dataset](#dataset)
- [Methodology](#methodology)
- [Results](#results)
- [Notebook API (Functions)](#notebook-api-functions)
- [Prerequisites](#prerequisites)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)
- [Acknowledgments](#acknowledgments)
---
## About the Project
This repository builds a supervised learning pipeline to **predict delivery time (minutes)** for e‑commerce orders. It includes:
- Exploratory analysis and rich visualizations.
- Distance computation via **haversine**.
- Categorical encoding and time-based feature extraction.
- **Standardization + PCA (retain ≈95% variance)**.
- Baseline & advanced regressors (**RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost**).
- **Hyperparameter optimization with Optuna** and experiment tracking via **MLflow**.
## Key Features
- 🧭 **Feature engineering**: distance (haversine), weather×traffic interactions, time parts, and efficiency features.
- 🧼 **Data cleaning**: KNN imputation for ratings; strict dtype conversions for dates & times.
- 🧰 **Model zoo**: RandomForest, Histogram GB, LightGBM, XGBoost, CatBoost.
- 🧰 **Dimensionality reduction**: PCA to 95% variance (17 components from 21 numeric features).
- 📈 **Experiment tracking**: MLflow runs, residual plots, PCA scree plots.
- 🎯 **Hyperparameter tuning**: Optuna with TPE sampler; results exported as CSV and visualized.
- 💾 **Model artifacts**: serialized `.pkl` models (Git LFS recommended for >100 MB).
## Tech Stack
**Core:** Python, Jupyter Notebook
**Libraries:** `pandas`, `numpy`, `scikit-learn`, `xgboost`, `lightgbm`, `catboost`, `optuna`, `mlflow`, `matplotlib`, `seaborn`, `plotly`, `statsmodels`, `haversine`, `matplotlib-venn`, `joblib`, `IPython.display`.
## Installation
> Works on Windows/macOS/Linux. For heavy training, Colab is supported.
```bash
# 1) Create and activate a virtual environment
python -m venv .venv
# Windows
.\.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# 2) Upgrade pip
python -m pip install --upgrade pip
# 3) Install dependencies
pip install pandas numpy scikit-learn xgboost lightgbm catboost optuna mlflow matplotlib seaborn plotly statsmodels haversine matplotlib-venn joblib ipython
# (Optional) For large model files in GitHub
git lfs install
git lfs track "*.pkl"
```
## Usage
### Run the Notebook
1. Clone the repository and open `Amazon_Delivery_Time (13).ipynb`.
2. Ensure `amazon_delivery.csv` is present in the repo root.
3. In **Colab**, mount Drive and update paths if needed. In **local Jupyter**, replace any `/content/drive/...` paths with local paths.
### Minimal Example (in Python)
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Load
df = pd.read_csv("amazon_delivery.csv")
# (Example) Build numeric matrix; drop target and IDs
X = df.drop(columns=["Delivery_Time", "Order_ID"], errors="ignore").select_dtypes(include="number")
y = df["Delivery_Time"]
# Scale + PCA (≈95% variance)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit(X_train).transform(X_test) # for brevity here; in notebook scaler is reused
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Train a quick XGBoost model
model = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, n_jobs=-1)
model.fit(X_train_pca, y_train)
pred = model.predict(X_test_pca)
rmse = (np.mean((y_test - pred)**2))**0.5
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
print(rmse, mae, r2)
```
## Project Structure
```
.
├─ Amazon_Delivery_Time (13).ipynb # Main end‑to‑end notebook
├─ amazon_delivery.csv # Dataset (tabular)
├─ Models/
│ └─ RandomForest_PCA.pkl # Example saved model (use Git LFS)
├─ optuna_results-20251004T162645Z-1-001.zip # Optuna exports (trials CSV, plots)
├─ .gitignore
└─ README.md
```
## Dataset
Target: `Delivery_Time` (minutes)
Notable fields used/engineered in the notebook:
- **Dates/Times**: `Order_Date`, `Order_Time`, `Pickup_Time`, derived `Order_Year/Month/Day/Hour/Minute`, `Pickup_Hour/Minute`, `Is_Weekend`, `Pickup_Delay`.
- **Geo**: `Store_Latitude`, `Store_Longitude`, `Drop_Latitude`, `Drop_Longitude`, computed **`Distance`** (km) via haversine; optional `Area_Based_Distance`.
- **Categoricals**: `Weather`, `Traffic`, `Area`, `Category`, `Vehicle`, interaction **`Traffic_Weather`**.
- **Quality/Agent**: `Agent_Rating`, `Agent_Age`.
- **Derived**: frequency/target encodings for `Category`, rolling stats (e.g., `Category_RollingMean`), and **`Efficiency_TimePerKm`**.
> The notebook handles missing values (e.g., KNN imputation for `Agent_Rating`) and applies one‑hot encoding for categorical features (with `drop='first'` to avoid multicollinearity).
## Methodology
1. **Load & Inspect**: read CSV, check dtypes, missingness visualization.
2. **Cleaning & Imputation**: KNN imputation for `Agent_Rating`; coercive parsing for times (`errors='coerce'`); safe date parsing with format strings.
3. **Feature Engineering**:
- Geographic **Distance** using `haversine((lat1, lon1), (lat2, lon2))` (km).
- Time features: order/pickup hour, minute, day‑of‑week, weekend flag; pickup delays.
- Categorical encodings: one‑hot for `Weather`, `Traffic_Weather`; frequency/target encodings for `Category`.
4. **Scaling & PCA**: standardize numeric features, then PCA with `n_components=0.95`**17 components** retaining **≈95.04%** variance on **21** numeric features.
5. **Modeling**:
- Baselines & tree models: **RandomForest**, **HistGradientBoosting**, **LightGBM**, **XGBoost**, **CatBoost**.
- Evaluation metrics: **RMSE**, **MAE**, **R²** on test set; residual analysis.
6. **Optimization**:
- **Optuna** TPE sampler explores XGBoost hyperparameters (`n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`).
- Studies run for 30–100 trials; best params reused to retrain final model.
7. **Experiment Tracking**:
- **MLflow** logs metrics, parameters, residual plots, and PCA scree plots.
8. **Artifacts**: models serialized with `joblib.dump` to `Models/` (recommend Git LFS).
## Results
### Model Comparison (test set)
| Model | Dataset | RMSE | MAE | R² |
|--------------------------|---------|------:|------:|:-----:|
| RandomForest | raw | 0.010 | 0.000 | 1.000 |
| RandomForest | scaled | 0.009 | 0.000 | 1.000 |
| RandomForest | PCA | 7.121 | 5.238 | 0.981 |
| HistGradientBoosting | PCA | 5.206 | 3.919 | 0.990 |
| LightGBM | PCA | 5.405 | 3.981 | 0.989 |
| **XGBoost** | PCA | **5.033** | **3.800** | **0.991** |
| CatBoost | PCA | 5.646 | 4.141 | 0.988 |
**Optuna (validation)**
- Best validation RMSE ranged from **≈0.0766** to **≈0.0616** (on the validation split during tuning).
- The best trial’s parameters were plugged back into the final XGBoost.
**Final Tuned XGBoost (test set)**
- **RMSE:** `4.8012`
- **MAE:** `3.5394`
- **R²:** `0.9915`
> PCA diagnostics: Original shape `(43648, 21)` → PCA `(43648, 17)`, total variance retained **0.9504**.
## Notebook API (Functions)
- `calculate_distance(row) -> float`
Computes great‑circle distance (km) between store and drop coordinates using **haversine**.
- `objective(trial) -> float`
Optuna objective for **XGBRegressor**. Suggests `n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`. Trains on train/val split and returns validation **RMSE**.
- `setup_mlflow_drive(experiment_name, base_dir, nested_dir) -> str`
Configures **MLflow** to log runs under a given folder (e.g., Google Drive), sets tracking URI, and creates/returns the path.
## Prerequisites
- Python **3.9+**
- Jupyter Notebook or Google Colab
- (Optional) **Git LFS** for large `.pkl` artifacts
## Contributing
Contributions are welcome!
1. Fork the repo
2. Create a feature branch (`git checkout -b feature/awesome`)
3. Commit changes (`git commit -m "feat: add X"`)
4. Push to branch (`git push origin feature/awesome`)
5. Open a Pull Request
## License
No license file is present yet. If you intend the work to be open source, consider adding an **MIT License**.
## Contact
Author: **thedynasty**
GitHub: [@thedynasty23](https://github.com/thedynasty23)
## Acknowledgments
- Libraries: scikit‑learn, XGBoost, LightGBM, CatBoost, Optuna, MLflow, pandas, numpy, plotly, seaborn, statsmodels, haversine.
- Inspiration: common delivery‑time prediction use‑cases in logistics.