Upload 17 files

7f90ea0 verified 4 days ago

19 kB

	# Demand Prediction System for E-commerce

	A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

	## 📋 Table of Contents

	- [Overview](#overview)
	- [Features](#features)
	- [Project Structure](#project-structure)
	- [Installation](#installation)
	- [Dataset](#dataset)
	- [Usage](#usage)
	- [Model Details](#model-details)
	- [Evaluation Metrics](#evaluation-metrics)
	- [Visualizations](#visualizations)
	- [Example Predictions](#example-predictions)
	- [Technical Details](#technical-details)

	## 🎯 Overview

	This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

	1. Machine Learning Models: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
	2. Time-Series Models: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

	The system automatically selects the best performing model across both approaches.

	Key Capabilities:
	- Predicts sales quantity for products on future dates (ML models)
	- Predicts overall daily demand (Time-series models)
	- Handles temporal patterns and seasonality
	- Considers price, discount, category, and date features (ML models)
	- Captures time-series patterns and trends (Time-series models)
	- Automatically selects the best model from multiple candidates
	- Provides comprehensive evaluation metrics
	- Compares ML vs Time-Series approaches

	## ✨ Features

	- Data Preprocessing: Automatic handling of missing values, date feature extraction
	- Feature Engineering:
	- Date features (day, month, day_of_week, weekend, year, quarter)
	- Categorical encoding (product_id, category)
	- Feature scaling
	- Multiple Models:
	- Machine Learning Models:
	- Linear Regression
	- Random Forest Regressor
	- XGBoost Regressor (optional)
	- Time-Series Models:
	- ARIMA (AutoRegressive Integrated Moving Average)
	- Prophet (Facebook's time-series forecasting tool)
	- Model Selection: Automatic best model selection based on R2 score
	- Evaluation Metrics: MAE, RMSE, and R2 Score
	- Visualizations:
	- Demand trends over time
	- Monthly average demand
	- Feature importance
	- Model comparison
	- Model Persistence: Save and load trained models using joblib
	- Future Predictions: Predict demand for any product on any future date

	## 📁 Project Structure

	```
	demand_prediction/
	│
	├── data/
	│ └── sales.csv # Sales dataset
	│
	├── models/ # Generated during training
	│ ├── best_model.joblib # Best ML model (if ML is best)
	│ ├── best_timeseries_model.joblib # Best time-series model (if TS is best)
	│ ├── preprocessing.joblib # Encoders and scaler (for ML models)
	│ ├── model_metadata.json # Model metadata (legacy)
	│ └── all_models_metadata.json # All models comparison metadata
	│
	├── plots/ # Generated during training
	│ ├── demand_trends.png # Time series plot
	│ ├── monthly_demand.png # Monthly averages
	│ ├── feature_importance.png # Feature importance (ML models)
	│ ├── model_comparison.png # Model metrics comparison (all models)
	│ └── timeseries_predictions.png # Time-series model predictions
	│
	├── generate_dataset.py # Script to generate synthetic dataset
	├── train_model.py # Main training script
	├── predict.py # Prediction script
	├── app.py # Streamlit dashboard (interactive web app)
	├── requirements.txt # Python dependencies
	└── README.md # This file
	```

	## 🚀 Installation

	### Prerequisites

	- Python 3.8 or higher
	- pip (Python package manager)

	### Step 1: Navigate to Project Directory

	```bash
	cd demand_prediction
	```

	### Step 2: Create Virtual Environment (Recommended)

	Why use a virtual environment?
	- Keeps project dependencies isolated from your system Python
	- Prevents conflicts with other projects
	- Makes it easier to manage package versions
	- Best practice for Python projects

	Quick Setup (Recommended):

	Windows:
	```bash
	setup_env.bat
	```

	Linux/Mac:
	```bash
	chmod +x setup_env.sh
	./setup_env.sh
	```

	Manual Setup:

	Windows:
	```bash
	python -m venv venv
	venv\Scripts\activate
	```

	Linux/Mac:
	```bash
	python3 -m venv venv
	source venv/bin/activate
	```

	After activation, you should see `(venv)` in your terminal prompt.

	To deactivate later:
	```bash
	deactivate
	```

	### Step 3: Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	Note: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.

	Alternative (without virtual environment):
	If you prefer not to use a virtual environment, you can install directly:
	```bash
	pip install -r requirements.txt
	```
	However, this is not recommended as it may cause conflicts with other Python projects.

	### Step 4: Generate Dataset

	If you don't have a dataset, generate a synthetic one:

	```bash
	python generate_dataset.py
	```

	This will create `data/sales.csv` with realistic e-commerce sales data.

	## 📊 Dataset

	The dataset should contain the following columns:

	- product_id: Unique identifier for each product (integer)
	- date: Date of sale (YYYY-MM-DD format)
	- price: Product price (float)
	- discount: Discount percentage (0-100, float)
	- category: Product category (string)
	- sales_quantity: Target variable - number of units sold (integer)

	### Dataset Format Example

	```csv
	product_id,date,price,discount,category,sales_quantity
	1,2020-01-01,499.99,10,Electronics,45
	2,2020-01-01,29.99,0,Clothing,120
	...
	```

	## 💻 Usage

	### Step 1: Train the Model

	Train the model using the sales dataset:

	```bash
	python train_model.py
	```

	This will:
	1. Load and preprocess the data
	2. Extract features from dates
	3. Encode categorical variables
	4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
	5. Prepare time-series data (aggregate daily sales)
	6. Train time-series models (ARIMA, Prophet)
	7. Evaluate each model using MAE, RMSE, and R2 Score
	8. Compare ML vs Time-Series models
	9. Select the best model automatically (across all model types)
	10. Save the model and preprocessing objects
	11. Generate visualizations

	Output:
	- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
	- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
	- Visualizations saved to `plots/` directory
	- All models metadata saved to `models/all_models_metadata.json`

	### Step 2: Make Predictions

	For ML Models (product-specific predictions):

	Predict demand for a specific product on a date:

	```bash
	python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
	```

	Parameters for ML Models:
	- `--product_id`: Product ID (integer, required)
	- `--date`: Date in YYYY-MM-DD format (required)
	- `--price`: Product price (float, required)
	- `--discount`: Discount percentage 0-100 (float, default: 0)
	- `--category`: Product category (string, required)
	- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`

	For Time-Series Models (overall daily demand):

	Predict total daily demand across all products:

	```bash
	python predict.py --date 2024-01-15 --model_type timeseries
	```

	Parameters for Time-Series Models:
	- `--date`: Date in YYYY-MM-DD format (required)
	- `--model_type`: Set to `timeseries` to use time-series models

	Example Predictions:

	```bash
	# ML Model - Electronics product with discount
	python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics

	# ML Model - Clothing product without discount
	python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing

	# Time-Series Model - Overall daily demand
	python predict.py --date 2024-07-06 --model_type timeseries

	# Auto-detect best model (default)
	python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
	```

	### Step 3: Launch Interactive Dashboard (Optional)

	Launch the Streamlit dashboard for interactive visualization and predictions:

	```bash
	streamlit run app.py
	```

	The dashboard will open in your default web browser (usually at `http://localhost:8501`).

	Dashboard Features:

	1. 📈 Sales Trends Page
	- Interactive filters (category, product, date range)
	- Daily sales trends visualization
	- Monthly sales trends
	- Category-wise analysis
	- Price vs demand relationship
	- Real-time statistics and metrics

	2. 🔮 Demand Prediction Page
	- Interactive prediction interface
	- Select model type (Auto/ML/Time-Series)
	- For ML models:
	- Product selection dropdown
	- Category selection
	- Price and discount sliders
	- Date picker
	- Product statistics display
	- For Time-Series models:
	- Date picker for future predictions
	- Overall daily demand forecast
	- Prediction insights and recommendations

	3. 📊 Model Comparison Page
	- Side-by-side model performance comparison
	- MAE, RMSE, and R2 Score metrics
	- Visual charts comparing all models
	- Best model highlighting
	- Model type indicators (ML vs Time-Series)

	Dashboard Screenshots:
	- Interactive widgets for easy data exploration
	- Real-time predictions with visual feedback
	- Comprehensive model comparison charts

	## 🤖 Model Details

	### Models Trained

	1. Linear Regression
	- Simple linear model
	- Fast training and prediction
	- Good baseline model

	2. Random Forest Regressor
	- Ensemble of decision trees
	- Handles non-linear relationships
	- Provides feature importance
	- Hyperparameters:
	- n_estimators: 100
	- max_depth: 15
	- min_samples_split: 5
	- min_samples_leaf: 2

	3. XGBoost Regressor (Optional)
	- Gradient boosting algorithm
	- Often provides best performance
	- Handles complex patterns
	- Hyperparameters:
	- n_estimators: 100
	- max_depth: 6
	- learning_rate: 0.1

	4. ARIMA (AutoRegressive Integrated Moving Average)
	- Classic time-series forecasting model
	- Captures trends and seasonality
	- Automatically selects best order (p, d, q)
	- Works on aggregated daily sales data
	- Uses chronological train/validation split

	5. Prophet (Facebook's Time-Series Forecasting)
	- Designed for business time series
	- Handles seasonality (weekly, yearly)
	- Robust to missing data and outliers
	- Works on aggregated daily sales data
	- Uses chronological train/validation split

	### Model Comparison: ML vs Time-Series

	Machine Learning Models:
	- ✅ Predict per-product demand
	- ✅ Use product features (price, discount, category)
	- ✅ Can handle new products with similar features
	- ❌ May not capture long-term temporal patterns as well

	Time-Series Models:
	- ✅ Capture temporal patterns and trends
	- ✅ Handle seasonality automatically
	- ✅ Good for overall demand forecasting
	- ❌ Predict aggregate demand, not per-product
	- ❌ Don't use product-specific features

	The system automatically selects the best model based on R2 score across all model types.

	### Feature Engineering

	For ML Models:

	The system extracts the following features from the input data:

	Date Features:
	- `day`: Day of month (1-31)
	- `month`: Month (1-12)
	- `day_of_week`: Day of week (0=Monday, 6=Sunday)
	- `weekend`: Binary indicator (1 if weekend, 0 otherwise)
	- `year`: Year
	- `quarter`: Quarter of year (1-4)

	Original Features:
	- `product_id`: Encoded as categorical
	- `price`: Numerical (scaled)
	- `discount`: Numerical (scaled)
	- `category`: Encoded as categorical

	Total Features: 10 features after encoding and scaling

	For Time-Series Models:

	- Data is aggregated by date (total daily sales)
	- Uses chronological split (80% train, 20% validation)
	- Prophet automatically handles:
	- Weekly seasonality
	- Yearly seasonality
	- Trend components

	## 📈 Evaluation Metrics

	The system evaluates models using three metrics:

	1. MAE (Mean Absolute Error)
	- Average absolute difference between predicted and actual values
	- Lower is better
	- Units: same as target variable (sales quantity)

	2. RMSE (Root Mean Squared Error)
	- Square root of average squared differences
	- Penalizes large errors more than MAE
	- Lower is better
	- Units: same as target variable (sales quantity)

	3. R2 Score (Coefficient of Determination)
	- Proportion of variance explained by the model
	- Range: -∞ to 1 (1 is perfect prediction)
	- Higher is better
	- Used for model selection

	Model Selection: The model with the highest R2 score is selected as the best model.

	## 📊 Visualizations

	The training script generates several visualizations:

	1. Demand Trends Over Time (`plots/demand_trends.png`)
	- Shows total daily sales quantity over the entire time period
	- Helps identify overall trends and patterns

	2. Monthly Average Demand (`plots/monthly_demand.png`)
	- Bar chart showing average sales by month
	- Reveals seasonal patterns (e.g., holiday season spikes)

	3. Feature Importance (`plots/feature_importance.png`)
	- Shows which features are most important for predictions
	- Only available for tree-based models (Random Forest, XGBoost)

	4. Model Comparison (`plots/model_comparison.png`)
	- Side-by-side comparison of all models (ML and Time-Series)
	- Color-coded: ML models (blue) vs Time-Series models (orange/red)
	- Shows MAE, RMSE, and R2 Score for each model

	5. Time-Series Predictions (`plots/timeseries_predictions.png`)
	- Actual vs predicted plots for ARIMA and Prophet models
	- Shows how well time-series models capture temporal patterns
	- Only generated if time-series models are available

	## 🔮 Example Predictions

	Here are some example predictions to demonstrate the system:

	```bash
	# Example 1: Electronics on a weekday
	python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
	# Expected: Moderate demand (weekday, some discount)

	# Example 2: Clothing on weekend
	python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
	# Expected: Higher demand (weekend, good discount)

	# Example 3: Holiday season prediction
	python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
	# Expected: High demand (holiday season, good discount)
	```

	## 🔧 Technical Details

	### Data Preprocessing Pipeline

	1. Date Conversion: Convert date strings to datetime objects
	2. Feature Extraction: Extract temporal features from dates
	3. Missing Value Handling: Fill missing values with median (if any)
	4. Categorical Encoding: Label encode product_id and category
	5. Feature Scaling: Standardize numerical features using StandardScaler

	### Model Training Pipeline

	1. Data Splitting: 80% training, 20% validation
	2. Model Training: Train all available models
	3. Evaluation: Calculate MAE, RMSE, and R2 for each model
	4. Selection: Choose model with highest R2 score
	5. Persistence: Save model, encoders, and scaler

	### Prediction Pipeline

	1. Load Model: Load trained model and preprocessing objects
	2. Feature Preparation: Extract features from input parameters
	3. Encoding: Encode categorical variables using saved encoders
	4. Scaling: Scale features using saved scaler
	5. Prediction: Make prediction using loaded model
	6. Post-processing: Ensure non-negative predictions

	### Handling Unseen Data

	The prediction script handles cases where:
	- Product ID was not seen during training (uses default encoding)
	- Category was not seen during training (uses default encoding)

	Warnings are displayed in such cases.

	## 🎓 Learning Points

	This project demonstrates:

	1. Supervised Learning: Regression problem solving
	2. Feature Engineering: Creating meaningful features from raw data
	3. Model Comparison: Training and evaluating multiple models
	4. Model Selection: Automatic best model selection
	5. Model Persistence: Saving and loading trained models
	6. Production-Ready Code: Clean, modular, well-documented code
	7. Time Series Features: Extracting temporal patterns
	8. Categorical Encoding: Handling categorical variables
	9. Feature Scaling: Normalizing features for better performance
	10. Evaluation Metrics: Understanding different regression metrics

	## 🐛 Troubleshooting

	### Issue: "Model not found"
	Solution: Run `python train_model.py` first to train and save the model.

	### Issue: "XGBoost not available"
	Solution: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).

	### Issue: "Category not seen during training"
	Solution: This is handled automatically with a warning. The system uses a default encoding.

	### Issue: Poor prediction accuracy
	Solutions:
	- Ensure you have sufficient training data
	- Check that input features are in the same range as training data
	- Try retraining with different hyperparameters
	- Consider adding more features or more training data

	## 📝 Notes

	- The synthetic dataset generator creates realistic patterns including:
	- Weekend effects (higher sales on weekends)
	- Seasonal patterns (holiday season spikes)
	- Price and discount effects
	- Category-specific base prices

	- For production use, consider:
	- Using real historical data
	- Retraining models periodically
	- Adding more features (promotions, weather, etc.)
	- Implementing model versioning
	- Adding prediction confidence intervals

	## 📄 License

	This project is provided as-is for educational purposes.

	## 👤 Author

	Created as a complete machine learning project demonstrating demand prediction for e-commerce.

	---

	Happy Predicting! 🚀