Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
🚚 Amazon Delivery Time Prediction
End‑to‑end regression pipeline to predict e‑commerce delivery time with robust feature engineering, PCA, multiple tree-based models, Optuna tuning, and MLflow tracking.
Table of Contents
- About the Project
- Key Features
- Tech Stack
- Installation
- Usage
- Project Structure
- Dataset
- Methodology
- Results
- Notebook API (Functions)
- Prerequisites
- Contributing
- License
- Contact
- Acknowledgments
About the Project
This repository builds a supervised learning pipeline to predict delivery time (minutes) for e‑commerce orders. It includes:
- Exploratory analysis and rich visualizations.
- Distance computation via haversine.
- Categorical encoding and time-based feature extraction.
- Standardization + PCA (retain ≈95% variance).
- Baseline & advanced regressors (RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost).
- Hyperparameter optimization with Optuna and experiment tracking via MLflow.
Key Features
- 🧭 Feature engineering: distance (haversine), weather×traffic interactions, time parts, and efficiency features.
- 🧼 Data cleaning: KNN imputation for ratings; strict dtype conversions for dates & times.
- 🧰 Model zoo: RandomForest, Histogram GB, LightGBM, XGBoost, CatBoost.
- 🧰 Dimensionality reduction: PCA to 95% variance (17 components from 21 numeric features).
- 📈 Experiment tracking: MLflow runs, residual plots, PCA scree plots.
- 🎯 Hyperparameter tuning: Optuna with TPE sampler; results exported as CSV and visualized.
- 💾 Model artifacts: serialized
.pklmodels (Git LFS recommended for >100 MB).
Tech Stack
Core: Python, Jupyter Notebook
Libraries: pandas, numpy, scikit-learn, xgboost, lightgbm, catboost, optuna, mlflow, matplotlib, seaborn, plotly, statsmodels, haversine, matplotlib-venn, joblib, IPython.display.
Installation
Works on Windows/macOS/Linux. For heavy training, Colab is supported.
# 1) Create and activate a virtual environment
python -m venv .venv
# Windows
.\.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# 2) Upgrade pip
python -m pip install --upgrade pip
# 3) Install dependencies
pip install pandas numpy scikit-learn xgboost lightgbm catboost optuna mlflow matplotlib seaborn plotly statsmodels haversine matplotlib-venn joblib ipython
# (Optional) For large model files in GitHub
git lfs install
git lfs track "*.pkl"
Usage
Run the Notebook
- Clone the repository and open
Amazon_Delivery_Time (13).ipynb. - Ensure
amazon_delivery.csvis present in the repo root. - In Colab, mount Drive and update paths if needed. In local Jupyter, replace any
/content/drive/...paths with local paths.
Minimal Example (in Python)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Load
df = pd.read_csv("amazon_delivery.csv")
# (Example) Build numeric matrix; drop target and IDs
X = df.drop(columns=["Delivery_Time", "Order_ID"], errors="ignore").select_dtypes(include="number")
y = df["Delivery_Time"]
# Scale + PCA (≈95% variance)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit(X_train).transform(X_test) # for brevity here; in notebook scaler is reused
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Train a quick XGBoost model
model = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, n_jobs=-1)
model.fit(X_train_pca, y_train)
pred = model.predict(X_test_pca)
rmse = (np.mean((y_test - pred)**2))**0.5
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
print(rmse, mae, r2)
Project Structure
.
├─ Amazon_Delivery_Time (13).ipynb # Main end‑to‑end notebook
├─ amazon_delivery.csv # Dataset (tabular)
├─ Models/
│ └─ RandomForest_PCA.pkl # Example saved model (use Git LFS)
├─ optuna_results-20251004T162645Z-1-001.zip # Optuna exports (trials CSV, plots)
├─ .gitignore
└─ README.md
Dataset
Target: Delivery_Time (minutes)
Notable fields used/engineered in the notebook:
- Dates/Times:
Order_Date,Order_Time,Pickup_Time, derivedOrder_Year/Month/Day/Hour/Minute,Pickup_Hour/Minute,Is_Weekend,Pickup_Delay. - Geo:
Store_Latitude,Store_Longitude,Drop_Latitude,Drop_Longitude, computedDistance(km) via haversine; optionalArea_Based_Distance. - Categoricals:
Weather,Traffic,Area,Category,Vehicle, interactionTraffic_Weather. - Quality/Agent:
Agent_Rating,Agent_Age. - Derived: frequency/target encodings for
Category, rolling stats (e.g.,Category_RollingMean), andEfficiency_TimePerKm.
The notebook handles missing values (e.g., KNN imputation for
Agent_Rating) and applies one‑hot encoding for categorical features (withdrop='first'to avoid multicollinearity).
Methodology
- Load & Inspect: read CSV, check dtypes, missingness visualization.
- Cleaning & Imputation: KNN imputation for
Agent_Rating; coercive parsing for times (errors='coerce'); safe date parsing with format strings. - Feature Engineering:
- Geographic Distance using
haversine((lat1, lon1), (lat2, lon2))(km). - Time features: order/pickup hour, minute, day‑of‑week, weekend flag; pickup delays.
- Categorical encodings: one‑hot for
Weather,Traffic_Weather; frequency/target encodings forCategory.
- Geographic Distance using
- Scaling & PCA: standardize numeric features, then PCA with
n_components=0.95→ 17 components retaining ≈95.04% variance on 21 numeric features. - Modeling:
- Baselines & tree models: RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost.
- Evaluation metrics: RMSE, MAE, R² on test set; residual analysis.
- Optimization:
- Optuna TPE sampler explores XGBoost hyperparameters (
n_estimators,learning_rate,max_depth,min_child_weight,subsample,colsample_bytree,gamma,reg_alpha,reg_lambda). - Studies run for 30–100 trials; best params reused to retrain final model.
- Optuna TPE sampler explores XGBoost hyperparameters (
- Experiment Tracking:
- MLflow logs metrics, parameters, residual plots, and PCA scree plots.
- Artifacts: models serialized with
joblib.dumptoModels/(recommend Git LFS).
Results
Model Comparison (test set)
| Model | Dataset | RMSE | MAE | R² |
|---|---|---|---|---|
| RandomForest | raw | 0.010 | 0.000 | 1.000 |
| RandomForest | scaled | 0.009 | 0.000 | 1.000 |
| RandomForest | PCA | 7.121 | 5.238 | 0.981 |
| HistGradientBoosting | PCA | 5.206 | 3.919 | 0.990 |
| LightGBM | PCA | 5.405 | 3.981 | 0.989 |
| XGBoost | PCA | 5.033 | 3.800 | 0.991 |
| CatBoost | PCA | 5.646 | 4.141 | 0.988 |
Optuna (validation)
- Best validation RMSE ranged from ≈0.0766 to ≈0.0616 (on the validation split during tuning).
- The best trial’s parameters were plugged back into the final XGBoost.
Final Tuned XGBoost (test set)
- RMSE:
4.8012 - MAE:
3.5394 - R²:
0.9915
PCA diagnostics: Original shape
(43648, 21)→ PCA(43648, 17), total variance retained 0.9504.
Notebook API (Functions)
calculate_distance(row) -> float
Computes great‑circle distance (km) between store and drop coordinates using haversine.objective(trial) -> float
Optuna objective for XGBRegressor. Suggestsn_estimators,learning_rate,max_depth,min_child_weight,subsample,colsample_bytree,gamma,reg_alpha,reg_lambda. Trains on train/val split and returns validation RMSE.setup_mlflow_drive(experiment_name, base_dir, nested_dir) -> str
Configures MLflow to log runs under a given folder (e.g., Google Drive), sets tracking URI, and creates/returns the path.
Prerequisites
- Python 3.9+
- Jupyter Notebook or Google Colab
- (Optional) Git LFS for large
.pklartifacts
Contributing
Contributions are welcome!
- Fork the repo
- Create a feature branch (
git checkout -b feature/awesome) - Commit changes (
git commit -m "feat: add X") - Push to branch (
git push origin feature/awesome) - Open a Pull Request
License
No license file is present yet. If you intend the work to be open source, consider adding an MIT License.
Contact
Author: thedynasty
GitHub: @thedynasty23
Acknowledgments
- Libraries: scikit‑learn, XGBoost, LightGBM, CatBoost, Optuna, MLflow, pandas, numpy, plotly, seaborn, statsmodels, haversine.
- Inspiration: common delivery‑time prediction use‑cases in logistics.