Spaces:

thedynasty23
/

Amazon_Delivery_Time

Sleeping

App Files Files Community

Amazon_Delivery_Time / README(Github).md

thedynasty23

Rename README (9).md to README(Github).md

090c46d verified 6 months ago

preview code

raw

history blame contribute delete

10.3 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

🚚 Amazon Delivery Time Prediction

End‑to‑end regression pipeline to predict e‑commerce delivery time with robust feature engineering, PCA, multiple tree-based models, Optuna tuning, and MLflow tracking.

About the Project
Key Features
Tech Stack
Installation
Usage
Project Structure
Dataset
Methodology
Results
Notebook API (Functions)
Prerequisites
Contributing
License
Contact
Acknowledgments

About the Project

This repository builds a supervised learning pipeline to predict delivery time (minutes) for e‑commerce orders. It includes:

Exploratory analysis and rich visualizations.
Distance computation via haversine.
Categorical encoding and time-based feature extraction.
Standardization + PCA (retain ≈95% variance).
Baseline & advanced regressors (RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost).
Hyperparameter optimization with Optuna and experiment tracking via MLflow.

Key Features

🧭 Feature engineering: distance (haversine), weather×traffic interactions, time parts, and efficiency features.
🧼 Data cleaning: KNN imputation for ratings; strict dtype conversions for dates & times.
🧰 Model zoo: RandomForest, Histogram GB, LightGBM, XGBoost, CatBoost.
🧰 Dimensionality reduction: PCA to 95% variance (17 components from 21 numeric features).
📈 Experiment tracking: MLflow runs, residual plots, PCA scree plots.
🎯 Hyperparameter tuning: Optuna with TPE sampler; results exported as CSV and visualized.
💾 Model artifacts: serialized .pkl models (Git LFS recommended for >100 MB).

Tech Stack

Core: Python, Jupyter Notebook
Libraries: pandas, numpy, scikit-learn, xgboost, lightgbm, catboost, optuna, mlflow, matplotlib, seaborn, plotly, statsmodels, haversine, matplotlib-venn, joblib, IPython.display.

Installation

Works on Windows/macOS/Linux. For heavy training, Colab is supported.

# 1) Create and activate a virtual environment
python -m venv .venv
# Windows
.\.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

# 2) Upgrade pip
python -m pip install --upgrade pip

# 3) Install dependencies
pip install pandas numpy scikit-learn xgboost lightgbm catboost optuna mlflow matplotlib seaborn plotly statsmodels haversine matplotlib-venn joblib ipython

# (Optional) For large model files in GitHub
git lfs install
git lfs track "*.pkl"

Usage

Run the Notebook

Clone the repository and open Amazon_Delivery_Time (13).ipynb.
Ensure amazon_delivery.csv is present in the repo root.
In Colab, mount Drive and update paths if needed. In local Jupyter, replace any /content/drive/... paths with local paths.

Minimal Example (in Python)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Load
df = pd.read_csv("amazon_delivery.csv")

# (Example) Build numeric matrix; drop target and IDs
X = df.drop(columns=["Delivery_Time", "Order_ID"], errors="ignore").select_dtypes(include="number")
y = df["Delivery_Time"]

# Scale + PCA (≈95% variance)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = StandardScaler().fit_transform(X_train)
X_test  = StandardScaler().fit(X_train).transform(X_test)  # for brevity here; in notebook scaler is reused

pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train)
X_test_pca  = pca.transform(X_test)

# Train a quick XGBoost model
model = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, n_jobs=-1)
model.fit(X_train_pca, y_train)
pred = model.predict(X_test_pca)

rmse = (np.mean((y_test - pred)**2))**0.5
mae  = mean_absolute_error(y_test, pred)
r2   = r2_score(y_test, pred)
print(rmse, mae, r2)

Project Structure

.
├─ Amazon_Delivery_Time (13).ipynb        # Main end‑to‑end notebook
├─ amazon_delivery.csv                    # Dataset (tabular)
├─ Models/
│  └─ RandomForest_PCA.pkl                # Example saved model (use Git LFS)
├─ optuna_results-20251004T162645Z-1-001.zip  # Optuna exports (trials CSV, plots)
├─ .gitignore
└─ README.md

Dataset

Target: Delivery_Time (minutes)

Notable fields used/engineered in the notebook:

Dates/Times: Order_Date, Order_Time, Pickup_Time, derived Order_Year/Month/Day/Hour/Minute, Pickup_Hour/Minute, Is_Weekend, Pickup_Delay.
Geo: Store_Latitude, Store_Longitude, Drop_Latitude, Drop_Longitude, computed Distance (km) via haversine; optional Area_Based_Distance.
Categoricals: Weather, Traffic, Area, Category, Vehicle, interaction Traffic_Weather.
Quality/Agent: Agent_Rating, Agent_Age.
Derived: frequency/target encodings for Category, rolling stats (e.g., Category_RollingMean), and Efficiency_TimePerKm.

The notebook handles missing values (e.g., KNN imputation for Agent_Rating) and applies one‑hot encoding for categorical features (with drop='first' to avoid multicollinearity).

Methodology

Load & Inspect: read CSV, check dtypes, missingness visualization.
Cleaning & Imputation: KNN imputation for Agent_Rating; coercive parsing for times (errors='coerce'); safe date parsing with format strings.
Feature Engineering:
- Geographic Distance using haversine((lat1, lon1), (lat2, lon2)) (km).
- Time features: order/pickup hour, minute, day‑of‑week, weekend flag; pickup delays.
- Categorical encodings: one‑hot for Weather, Traffic_Weather; frequency/target encodings for Category.
Scaling & PCA: standardize numeric features, then PCA with n_components=0.95 → 17 components retaining ≈95.04% variance on 21 numeric features.
Modeling:
- Baselines & tree models: RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost.
- Evaluation metrics: RMSE, MAE, R² on test set; residual analysis.
Optimization:
- Optuna TPE sampler explores XGBoost hyperparameters (n_estimators, learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, gamma, reg_alpha, reg_lambda).
- Studies run for 30–100 trials; best params reused to retrain final model.
Experiment Tracking:
- MLflow logs metrics, parameters, residual plots, and PCA scree plots.
Artifacts: models serialized with joblib.dump to Models/ (recommend Git LFS).

Results

Model Comparison (test set)

Model	Dataset	RMSE	MAE	R²
RandomForest	raw	0.010	0.000	1.000
RandomForest	scaled	0.009	0.000	1.000
RandomForest	PCA	7.121	5.238	0.981
HistGradientBoosting	PCA	5.206	3.919	0.990
LightGBM	PCA	5.405	3.981	0.989
XGBoost	PCA	5.033	3.800	0.991
CatBoost	PCA	5.646	4.141	0.988

Optuna (validation)

Best validation RMSE ranged from ≈0.0766 to ≈0.0616 (on the validation split during tuning).
The best trial’s parameters were plugged back into the final XGBoost.

Final Tuned XGBoost (test set)

RMSE: 4.8012
MAE: 3.5394
R²: 0.9915

PCA diagnostics: Original shape (43648, 21) → PCA (43648, 17), total variance retained 0.9504.

Notebook API (Functions)

calculate_distance(row) -> float
Computes great‑circle distance (km) between store and drop coordinates using haversine.
objective(trial) -> float
Optuna objective for XGBRegressor. Suggests n_estimators, learning_rate, max_depth, min_child_weight, subsample, colsample_bytree, gamma, reg_alpha, reg_lambda. Trains on train/val split and returns validation RMSE.
setup_mlflow_drive(experiment_name, base_dir, nested_dir) -> str
Configures MLflow to log runs under a given folder (e.g., Google Drive), sets tracking URI, and creates/returns the path.

Prerequisites

Python 3.9+
Jupyter Notebook or Google Colab
(Optional) Git LFS for large .pkl artifacts

Contributing

Contributions are welcome!

Fork the repo
Create a feature branch (git checkout -b feature/awesome)
Commit changes (git commit -m "feat: add X")
Push to branch (git push origin feature/awesome)
Open a Pull Request

License

No license file is present yet. If you intend the work to be open source, consider adding an MIT License.

Contact

Author: thedynasty
GitHub: @thedynasty23

Acknowledgments

Libraries: scikit‑learn, XGBoost, LightGBM, CatBoost, Optuna, MLflow, pandas, numpy, plotly, seaborn, statsmodels, haversine.
Inspiration: common delivery‑time prediction use‑cases in logistics.