Spaces:

thedynasty23
/

Amazon_Delivery_Time

Sleeping

App Files Files Community

Amazon_Delivery_Time / README(Github).md

thedynasty23

Rename README (9).md to README(Github).md

090c46d verified 6 months ago

preview code

raw

history blame contribute delete

10.3 kB

	# 🚚 Amazon Delivery Time Prediction

	![Python](https://img.shields.io/badge/Python-3.9%2B-blue?logo=python)
	![Jupyter](https://img.shields.io/badge/Notebook-Jupyter-orange?logo=jupyter)
	![scikit-learn](https://img.shields.io/badge/ML-scikit--learn-%23F7931E?logo=scikitlearn)
	![XGBoost](https://img.shields.io/badge/Boosting-XGBoost-76B900)
	![LightGBM](https://img.shields.io/badge/Boosting-LightGBM-3EA868)
	![CatBoost](https://img.shields.io/badge/Boosting-CatBoost-FF9900)
	![Optuna](https://img.shields.io/badge/Optimization-Optuna-6E49CB)
	![MLflow](https://img.shields.io/badge/Tracking-MLflow-0194E2)
	![Platform](https://img.shields.io/badge/Platform-Colab%20%7C%20Local-lightgrey)

	> End‑to‑end regression pipeline to predict e‑commerce delivery time with robust feature engineering, PCA, multiple tree-based models, Optuna tuning, and MLflow tracking.

	---

	## Table of Contents
	- [About the Project](#about-the-project)
	- [Key Features](#key-features)
	- [Tech Stack](#tech-stack)
	- [Installation](#installation)
	- [Usage](#usage)
	- [Project Structure](#project-structure)
	- [Dataset](#dataset)
	- [Methodology](#methodology)
	- [Results](#results)
	- [Notebook API (Functions)](#notebook-api-functions)
	- [Prerequisites](#prerequisites)
	- [Contributing](#contributing)
	- [License](#license)
	- [Contact](#contact)
	- [Acknowledgments](#acknowledgments)

	---

	## About the Project
	This repository builds a supervised learning pipeline to predict delivery time (minutes) for e‑commerce orders. It includes:
	- Exploratory analysis and rich visualizations.
	- Distance computation via haversine.
	- Categorical encoding and time-based feature extraction.
	- Standardization + PCA (retain ≈95% variance).
	- Baseline & advanced regressors (RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost).
	- Hyperparameter optimization with Optuna and experiment tracking via MLflow.

	## Key Features
	- 🧭 Feature engineering: distance (haversine), weather×traffic interactions, time parts, and efficiency features.
	- 🧼 Data cleaning: KNN imputation for ratings; strict dtype conversions for dates & times.
	- 🧰 Model zoo: RandomForest, Histogram GB, LightGBM, XGBoost, CatBoost.
	- 🧰 Dimensionality reduction: PCA to 95% variance (17 components from 21 numeric features).
	- 📈 Experiment tracking: MLflow runs, residual plots, PCA scree plots.
	- 🎯 Hyperparameter tuning: Optuna with TPE sampler; results exported as CSV and visualized.
	- 💾 Model artifacts: serialized `.pkl` models (Git LFS recommended for >100 MB).

	## Tech Stack
	Core: Python, Jupyter Notebook
	Libraries: `pandas`, `numpy`, `scikit-learn`, `xgboost`, `lightgbm`, `catboost`, `optuna`, `mlflow`, `matplotlib`, `seaborn`, `plotly`, `statsmodels`, `haversine`, `matplotlib-venn`, `joblib`, `IPython.display`.

	## Installation
	> Works on Windows/macOS/Linux. For heavy training, Colab is supported.

	```bash
	# 1) Create and activate a virtual environment
	python -m venv .venv
	# Windows
	.\.venv\Scripts\activate
	# macOS/Linux
	source .venv/bin/activate

	# 2) Upgrade pip
	python -m pip install --upgrade pip

	# 3) Install dependencies
	pip install pandas numpy scikit-learn xgboost lightgbm catboost optuna mlflow matplotlib seaborn plotly statsmodels haversine matplotlib-venn joblib ipython

	# (Optional) For large model files in GitHub
	git lfs install
	git lfs track "*.pkl"
	```

	## Usage
	### Run the Notebook
	1. Clone the repository and open `Amazon_Delivery_Time (13).ipynb`.
	2. Ensure `amazon_delivery.csv` is present in the repo root.
	3. In Colab, mount Drive and update paths if needed. In local Jupyter, replace any `/content/drive/...` paths with local paths.

	### Minimal Example (in Python)
	```python
	import pandas as pd
	from sklearn.model_selection import train_test_split
	from sklearn.preprocessing import StandardScaler
	from sklearn.decomposition import PCA
	from xgboost import XGBRegressor
	from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
	import numpy as np

	# Load
	df = pd.read_csv("amazon_delivery.csv")

	# (Example) Build numeric matrix; drop target and IDs
	X = df.drop(columns=["Delivery_Time", "Order_ID"], errors="ignore").select_dtypes(include="number")
	y = df["Delivery_Time"]

	# Scale + PCA (≈95% variance)
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
	X_train = StandardScaler().fit_transform(X_train)
	X_test = StandardScaler().fit(X_train).transform(X_test) # for brevity here; in notebook scaler is reused

	pca = PCA(n_components=0.95, random_state=42)
	X_train_pca = pca.fit_transform(X_train)
	X_test_pca = pca.transform(X_test)

	# Train a quick XGBoost model
	model = XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, n_jobs=-1)
	model.fit(X_train_pca, y_train)
	pred = model.predict(X_test_pca)

	rmse = (np.mean((y_test - pred)2))0.5
	mae = mean_absolute_error(y_test, pred)
	r2 = r2_score(y_test, pred)
	print(rmse, mae, r2)
	```

	## Project Structure
	```
	.
	├─ Amazon_Delivery_Time (13).ipynb # Main end‑to‑end notebook
	├─ amazon_delivery.csv # Dataset (tabular)
	├─ Models/
	│ └─ RandomForest_PCA.pkl # Example saved model (use Git LFS)
	├─ optuna_results-20251004T162645Z-1-001.zip # Optuna exports (trials CSV, plots)
	├─ .gitignore
	└─ README.md
	```

	## Dataset
	Target: `Delivery_Time` (minutes)

	Notable fields used/engineered in the notebook:
	- Dates/Times: `Order_Date`, `Order_Time`, `Pickup_Time`, derived `Order_Year/Month/Day/Hour/Minute`, `Pickup_Hour/Minute`, `Is_Weekend`, `Pickup_Delay`.
	- Geo: `Store_Latitude`, `Store_Longitude`, `Drop_Latitude`, `Drop_Longitude`, computed `Distance` (km) via haversine; optional `Area_Based_Distance`.
	- Categoricals: `Weather`, `Traffic`, `Area`, `Category`, `Vehicle`, interaction `Traffic_Weather`.
	- Quality/Agent: `Agent_Rating`, `Agent_Age`.
	- Derived: frequency/target encodings for `Category`, rolling stats (e.g., `Category_RollingMean`), and `Efficiency_TimePerKm`.

	> The notebook handles missing values (e.g., KNN imputation for `Agent_Rating`) and applies one‑hot encoding for categorical features (with `drop='first'` to avoid multicollinearity).

	## Methodology
	1. Load & Inspect: read CSV, check dtypes, missingness visualization.
	2. Cleaning & Imputation: KNN imputation for `Agent_Rating`; coercive parsing for times (`errors='coerce'`); safe date parsing with format strings.
	3. Feature Engineering:
	- Geographic Distance using `haversine((lat1, lon1), (lat2, lon2))` (km).
	- Time features: order/pickup hour, minute, day‑of‑week, weekend flag; pickup delays.
	- Categorical encodings: one‑hot for `Weather`, `Traffic_Weather`; frequency/target encodings for `Category`.
	4. Scaling & PCA: standardize numeric features, then PCA with `n_components=0.95` → 17 components retaining ≈95.04% variance on 21 numeric features.
	5. Modeling:
	- Baselines & tree models: RandomForest, HistGradientBoosting, LightGBM, XGBoost, CatBoost.
	- Evaluation metrics: RMSE, MAE, R² on test set; residual analysis.
	6. Optimization:
	- Optuna TPE sampler explores XGBoost hyperparameters (`n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`).
	- Studies run for 30–100 trials; best params reused to retrain final model.
	7. Experiment Tracking:
	- MLflow logs metrics, parameters, residual plots, and PCA scree plots.
	8. Artifacts: models serialized with `joblib.dump` to `Models/` (recommend Git LFS).

	## Results
	### Model Comparison (test set)
	\| Model \| Dataset \| RMSE \| MAE \| R² \|
	\|--------------------------\|---------\|------:\|------:\|:-----:\|
	\| RandomForest \| raw \| 0.010 \| 0.000 \| 1.000 \|
	\| RandomForest \| scaled \| 0.009 \| 0.000 \| 1.000 \|
	\| RandomForest \| PCA \| 7.121 \| 5.238 \| 0.981 \|
	\| HistGradientBoosting \| PCA \| 5.206 \| 3.919 \| 0.990 \|
	\| LightGBM \| PCA \| 5.405 \| 3.981 \| 0.989 \|
	\| XGBoost \| PCA \| 5.033 \| 3.800 \| 0.991 \|
	\| CatBoost \| PCA \| 5.646 \| 4.141 \| 0.988 \|

	Optuna (validation)
	- Best validation RMSE ranged from ≈0.0766 to ≈0.0616 (on the validation split during tuning).
	- The best trial’s parameters were plugged back into the final XGBoost.

	Final Tuned XGBoost (test set)
	- RMSE: `4.8012`
	- MAE: `3.5394`
	- R²: `0.9915`

	> PCA diagnostics: Original shape `(43648, 21)` → PCA `(43648, 17)`, total variance retained 0.9504.

	## Notebook API (Functions)
	- `calculate_distance(row) -> float`
	Computes great‑circle distance (km) between store and drop coordinates using haversine.

	- `objective(trial) -> float`
	Optuna objective for XGBRegressor. Suggests `n_estimators`, `learning_rate`, `max_depth`, `min_child_weight`, `subsample`, `colsample_bytree`, `gamma`, `reg_alpha`, `reg_lambda`. Trains on train/val split and returns validation RMSE.

	- `setup_mlflow_drive(experiment_name, base_dir, nested_dir) -> str`
	Configures MLflow to log runs under a given folder (e.g., Google Drive), sets tracking URI, and creates/returns the path.

	## Prerequisites
	- Python 3.9+
	- Jupyter Notebook or Google Colab
	- (Optional) Git LFS for large `.pkl` artifacts

	## Contributing
	Contributions are welcome!
	1. Fork the repo
	2. Create a feature branch (`git checkout -b feature/awesome`)
	3. Commit changes (`git commit -m "feat: add X"`)
	4. Push to branch (`git push origin feature/awesome`)
	5. Open a Pull Request

	## License
	No license file is present yet. If you intend the work to be open source, consider adding an MIT License.

	## Contact
	Author: thedynasty
	GitHub: [@thedynasty23](https://github.com/thedynasty23)

	## Acknowledgments
	- Libraries: scikit‑learn, XGBoost, LightGBM, CatBoost, Optuna, MLflow, pandas, numpy, plotly, seaborn, statsmodels, haversine.
	- Inspiration: common delivery‑time prediction use‑cases in logistics.