Spaces:
Sleeping
Sleeping
| # 🚚 Amazon Delivery Time Prediction | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| > End‑to‑end regression pipeline to predict e‑commerce delivery time with robust feature engineering, PCA, multiple tree-based models, Optuna tuning, and MLflow tracking. | |
| --- | |
| ## Table of Contents | |
| - [About the Project](#about-the-project) | |
| - [Key Features](#key-features) | |
| - [Tech Stack](#tech-stack) | |
| - [Installation](#installation) | |
| - [Usage](#usage) | |
| - [Project Structure](#project-structure) | |
| - [Dataset](#dataset) | |
| - [Methodology](#methodology) | |
| - [Results](#results) | |
| - [Notebook API (Functions)](#notebook-api-functions) | |
| - [Prerequisites](#prerequisites) | |
| - [Contributing](#contributing) | |
| - [License](#license) | |
| - [Contact](#contact) | |
| - [Acknowledgments](#acknowledgments) | |
| --- | |
| ## About the Project | |
| This repository builds a supervised learning pipeline to **predict delivery time (minutes)** for e‑commerce orders. It includes: | |
| - Exploratory analysis and rich visualizations. | |
| - Distance computation via **haversine**. | |
| - Categorical encoding and time-based feature extraction. | |
| - **Standardization + PCA (retain ≈95% variance)**. | |
| - Baseline & advanced regressors (**RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost**). | |
| - **Hyperparameter optimization with Optuna** and experiment tracking via **MLflow**. | |
| ## Key Features | |
| - 🧭 **Feature engineering**: distance (haversine), weather×traffic interactions, time parts, and efficiency features. | |
| - 🧼 **Data cleaning**: KNN imputation for ratings; strict dtype conversions for dates & times. | |
| - 🧰 **Model zoo**: RandomForest, Histogram GB, LightGBM, XGBoost, CatBoost. | |
| - 🧰 **Dimensionality reduction**: PCA to 95% variance (17 components from 21 numeric features). | |
| - 📈 **Experiment tracking**: MLflow runs, residual plots, PCA scree plots. | |
| - 🎯 **Hyperparameter tuning**: Optuna with TPE sampler; results exported as CSV and visualized. | |
| - 💾 **Model artifacts**: serialized `.pkl` models (Git LFS recommended for >100 MB). | |
| ## Tech Stack | |
| **Core:** Python, Jupyter Notebook | |
| **Libraries:** `pandas`, `numpy`, `scikit-learn`, `xgboost`, `lightgbm`, `catboost`, `optuna`, `mlflow`, `matplotlib`, `seaborn`, `plotly`, `statsmodels`, `haversine`, `matplotlib-venn`, `joblib`, `IPython.display`. | |
| ## Installation | |
| > Works on Windows/macOS/Linux. For heavy training, Colab is supported. | |
| ```bash | |
| # 1) Create and activate a virtual environment | |
| python -m venv .venv | |
| # Windows | |
| .\.venv\Scripts\activate | |
| # macOS/Linux | |
| source .venv/bin/activate | |
| # 2) Upgrade pip | |
| python -m pip install --upgrade pip | |
| # 3) Install dependencies | |
| pip install pandas numpy scikit-learn xgboost lightgbm catboost optuna mlflow matplotlib seaborn plotly statsmodels haversine matplotlib-venn joblib ipython | |
| # (Optional) For large model files in GitHub | |
| git lfs install | |
| git lfs track "*.pkl" | |
| ``` | |
| ## Usage | |
| ### Run the Notebook | |
| 1. Clone the repository and open `Amazon_Delivery_Time (13).ipynb`. | |
| 2. Ensure `amazon_delivery.csv` is present in the repo root. | |
| 3. In **Colab**, mount Drive and update paths if needed. In **local Jupyter**, replace any `/content/drive/...` paths with local paths. | |
| ### Minimal Example (in Python) | |
| ```python | |
| import pandas as pd | |
| from sklearn.model_selection import train_test_split | |
| from sklearn.preprocessing import StandardScaler | |
| from sklearn.decomposition import PCA | |
| from xgboost import XGBRegressor | |
| from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score | |
| import numpy as np | |
| # Load | |
| df = pd.read_csv("amazon_delivery.csv") | |
| # (Example) Build numeric matrix; drop target and IDs | |
| X = df.drop(columns=["Delivery_Time", "Order_ID"], errors="ignore").select_dtypes(include="number") | |
| y = df["Delivery_Time"] | |
| # Scale + PCA (≈95% variance) | |
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) | |
| X_train = StandardScaler().fit_transform(X_train) | |
| X_test = StandardScaler().fit(X_train).transform(X_test) # for brevity here; in notebook scaler is reused | |
| pca = PCA(n_components=0.95, random_state=42) | |
| X_train_pca = pca.fit_transform(X_train) | |
| X_test_pca = pca.transform(X_test) | |
| # Train a quick XGBoost model | |
| model = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, n_jobs=-1) | |
| model.fit(X_train_pca, y_train) | |
| pred = model.predict(X_test_pca) | |
| rmse = (np.mean((y_test - pred)**2))**0.5 | |
| mae = mean_absolute_error(y_test, pred) | |
| r2 = r2_score(y_test, pred) | |
| print(rmse, mae, r2) | |
| ``` | |
| ## Project Structure | |
| ``` | |
| . | |
| ├─ Amazon_Delivery_Time (13).ipynb # Main end‑to‑end notebook | |
| ├─ amazon_delivery.csv # Dataset (tabular) | |
| ├─ Models/ | |
| │ └─ RandomForest_PCA.pkl # Example saved model (use Git LFS) | |
| ├─ optuna_results-20251004T162645Z-1-001.zip # Optuna exports (trials CSV, plots) | |
| ├─ .gitignore | |
| └─ README.md | |
| ``` | |
| ## Dataset | |
| Target: `Delivery_Time` (minutes) | |
| Notable fields used/engineered in the notebook: | |
| - **Dates/Times**: `Order_Date`, `Order_Time`, `Pickup_Time`, derived `Order_Year/Month/Day/Hour/Minute`, `Pickup_Hour/Minute`, `Is_Weekend`, `Pickup_Delay`. | |
| - **Geo**: `Store_Latitude`, `Store_Longitude`, `Drop_Latitude`, `Drop_Longitude`, computed **`Distance`** (km) via haversine; optional `Area_Based_Distance`. | |
| - **Categoricals**: `Weather`, `Traffic`, `Area`, `Category`, `Vehicle`, interaction **`Traffic_Weather`**. | |
| - **Quality/Agent**: `Agent_Rating`, `Agent_Age`. | |
| - **Derived**: frequency/target encodings for `Category`, rolling stats (e.g., `Category_RollingMean`), and **`Efficiency_TimePerKm`**. | |
| > The notebook handles missing values (e.g., KNN imputation for `Agent_Rating`) and applies one‑hot encoding for categorical features (with `drop='first'` to avoid multicollinearity). | |
| ## Methodology | |
| 1. **Load & Inspect**: read CSV, check dtypes, missingness visualization. | |
| 2. **Cleaning & Imputation**: KNN imputation for `Agent_Rating`; coercive parsing for times (`errors='coerce'`); safe date parsing with format strings. | |
| 3. **Feature Engineering**: | |
| - Geographic **Distance** using `haversine((lat1, lon1), (lat2, lon2))` (km). | |
| - Time features: order/pickup hour, minute, day‑of‑week, weekend flag; pickup delays. | |
| - Categorical encodings: one‑hot for `Weather`, `Traffic_Weather`; frequency/target encodings for `Category`. | |
| 4. **Scaling & PCA**: standardize numeric features, then PCA with `n_components=0.95` → **17 components** retaining **≈95.04%** variance on **21** numeric features. | |
| 5. **Modeling**: | |
| - Baselines & tree models: **RandomForest**, **HistGradientBoosting**, **LightGBM**, **XGBoost**, **CatBoost**. | |
| - Evaluation metrics: **RMSE**, **MAE**, **R²** on test set; residual analysis. | |
| 6. **Optimization**: | |
| - **Optuna** TPE sampler explores XGBoost hyperparameters (`n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`). | |
| - Studies run for 30–100 trials; best params reused to retrain final model. | |
| 7. **Experiment Tracking**: | |
| - **MLflow** logs metrics, parameters, residual plots, and PCA scree plots. | |
| 8. **Artifacts**: models serialized with `joblib.dump` to `Models/` (recommend Git LFS). | |
| ## Results | |
| ### Model Comparison (test set) | |
| | Model | Dataset | RMSE | MAE | R² | | |
| |--------------------------|---------|------:|------:|:-----:| | |
| | RandomForest | raw | 0.010 | 0.000 | 1.000 | | |
| | RandomForest | scaled | 0.009 | 0.000 | 1.000 | | |
| | RandomForest | PCA | 7.121 | 5.238 | 0.981 | | |
| | HistGradientBoosting | PCA | 5.206 | 3.919 | 0.990 | | |
| | LightGBM | PCA | 5.405 | 3.981 | 0.989 | | |
| | **XGBoost** | PCA | **5.033** | **3.800** | **0.991** | | |
| | CatBoost | PCA | 5.646 | 4.141 | 0.988 | | |
| **Optuna (validation)** | |
| - Best validation RMSE ranged from **≈0.0766** to **≈0.0616** (on the validation split during tuning). | |
| - The best trial’s parameters were plugged back into the final XGBoost. | |
| **Final Tuned XGBoost (test set)** | |
| - **RMSE:** `4.8012` | |
| - **MAE:** `3.5394` | |
| - **R²:** `0.9915` | |
| > PCA diagnostics: Original shape `(43648, 21)` → PCA `(43648, 17)`, total variance retained **0.9504**. | |
| ## Notebook API (Functions) | |
| - `calculate_distance(row) -> float` | |
| Computes great‑circle distance (km) between store and drop coordinates using **haversine**. | |
| - `objective(trial) -> float` | |
| Optuna objective for **XGBRegressor**. Suggests `n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`. Trains on train/val split and returns validation **RMSE**. | |
| - `setup_mlflow_drive(experiment_name, base_dir, nested_dir) -> str` | |
| Configures **MLflow** to log runs under a given folder (e.g., Google Drive), sets tracking URI, and creates/returns the path. | |
| ## Prerequisites | |
| - Python **3.9+** | |
| - Jupyter Notebook or Google Colab | |
| - (Optional) **Git LFS** for large `.pkl` artifacts | |
| ## Contributing | |
| Contributions are welcome! | |
| 1. Fork the repo | |
| 2. Create a feature branch (`git checkout -b feature/awesome`) | |
| 3. Commit changes (`git commit -m "feat: add X"`) | |
| 4. Push to branch (`git push origin feature/awesome`) | |
| 5. Open a Pull Request | |
| ## License | |
| No license file is present yet. If you intend the work to be open source, consider adding an **MIT License**. | |
| ## Contact | |
| Author: **thedynasty** | |
| GitHub: [@thedynasty23](https://github.com/thedynasty23) | |
| ## Acknowledgments | |
| - Libraries: scikit‑learn, XGBoost, LightGBM, CatBoost, Optuna, MLflow, pandas, numpy, plotly, seaborn, statsmodels, haversine. | |
| - Inspiration: common delivery‑time prediction use‑cases in logistics. | |