vikaswebdev's picture
Upload 17 files
7f90ea0 verified
# Demand Prediction System for E-commerce
A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.
## ๐Ÿ“‹ Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Dataset](#dataset)
- [Usage](#usage)
- [Model Details](#model-details)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualizations](#visualizations)
- [Example Predictions](#example-predictions)
- [Technical Details](#technical-details)
## ๐ŸŽฏ Overview
This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:
1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)
The system automatically selects the best performing model across both approaches.
**Key Capabilities:**
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches
## โœจ Features
- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
- **Feature Engineering**:
- Date features (day, month, day_of_week, weekend, year, quarter)
- Categorical encoding (product_id, category)
- Feature scaling
- **Multiple Models**:
- **Machine Learning Models:**
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor (optional)
- **Time-Series Models:**
- ARIMA (AutoRegressive Integrated Moving Average)
- Prophet (Facebook's time-series forecasting tool)
- **Model Selection**: Automatic best model selection based on R2 score
- **Evaluation Metrics**: MAE, RMSE, and R2 Score
- **Visualizations**:
- Demand trends over time
- Monthly average demand
- Feature importance
- Model comparison
- **Model Persistence**: Save and load trained models using joblib
- **Future Predictions**: Predict demand for any product on any future date
## ๐Ÿ“ Project Structure
```
demand_prediction/
โ”‚
โ”œโ”€โ”€ data/
โ”‚ โ””โ”€โ”€ sales.csv # Sales dataset
โ”‚
โ”œโ”€โ”€ models/ # Generated during training
โ”‚ โ”œโ”€โ”€ best_model.joblib # Best ML model (if ML is best)
โ”‚ โ”œโ”€โ”€ best_timeseries_model.joblib # Best time-series model (if TS is best)
โ”‚ โ”œโ”€โ”€ preprocessing.joblib # Encoders and scaler (for ML models)
โ”‚ โ”œโ”€โ”€ model_metadata.json # Model metadata (legacy)
โ”‚ โ””โ”€โ”€ all_models_metadata.json # All models comparison metadata
โ”‚
โ”œโ”€โ”€ plots/ # Generated during training
โ”‚ โ”œโ”€โ”€ demand_trends.png # Time series plot
โ”‚ โ”œโ”€โ”€ monthly_demand.png # Monthly averages
โ”‚ โ”œโ”€โ”€ feature_importance.png # Feature importance (ML models)
โ”‚ โ”œโ”€โ”€ model_comparison.png # Model metrics comparison (all models)
โ”‚ โ””โ”€โ”€ timeseries_predictions.png # Time-series model predictions
โ”‚
โ”œโ”€โ”€ generate_dataset.py # Script to generate synthetic dataset
โ”œโ”€โ”€ train_model.py # Main training script
โ”œโ”€โ”€ predict.py # Prediction script
โ”œโ”€โ”€ app.py # Streamlit dashboard (interactive web app)
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ””โ”€โ”€ README.md # This file
```
## ๐Ÿš€ Installation
### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
### Step 1: Navigate to Project Directory
```bash
cd demand_prediction
```
### Step 2: Create Virtual Environment (Recommended)
**Why use a virtual environment?**
- Keeps project dependencies isolated from your system Python
- Prevents conflicts with other projects
- Makes it easier to manage package versions
- Best practice for Python projects
**Quick Setup (Recommended):**
**Windows:**
```bash
setup_env.bat
```
**Linux/Mac:**
```bash
chmod +x setup_env.sh
./setup_env.sh
```
**Manual Setup:**
**Windows:**
```bash
python -m venv venv
venv\Scripts\activate
```
**Linux/Mac:**
```bash
python3 -m venv venv
source venv/bin/activate
```
After activation, you should see `(venv)` in your terminal prompt.
**To deactivate later:**
```bash
deactivate
```
### Step 3: Install Dependencies
```bash
pip install -r requirements.txt
```
**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.
**Alternative (without virtual environment):**
If you prefer not to use a virtual environment, you can install directly:
```bash
pip install -r requirements.txt
```
However, this is **not recommended** as it may cause conflicts with other Python projects.
### Step 4: Generate Dataset
If you don't have a dataset, generate a synthetic one:
```bash
python generate_dataset.py
```
This will create `data/sales.csv` with realistic e-commerce sales data.
## ๐Ÿ“Š Dataset
The dataset should contain the following columns:
- **product_id**: Unique identifier for each product (integer)
- **date**: Date of sale (YYYY-MM-DD format)
- **price**: Product price (float)
- **discount**: Discount percentage (0-100, float)
- **category**: Product category (string)
- **sales_quantity**: Target variable - number of units sold (integer)
### Dataset Format Example
```csv
product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...
```
## ๐Ÿ’ป Usage
### Step 1: Train the Model
Train the model using the sales dataset:
```bash
python train_model.py
```
This will:
1. Load and preprocess the data
2. Extract features from dates
3. Encode categorical variables
4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
5. Prepare time-series data (aggregate daily sales)
6. Train time-series models (ARIMA, Prophet)
7. Evaluate each model using MAE, RMSE, and R2 Score
8. Compare ML vs Time-Series models
9. Select the best model automatically (across all model types)
10. Save the model and preprocessing objects
11. Generate visualizations
**Output:**
- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
- Visualizations saved to `plots/` directory
- All models metadata saved to `models/all_models_metadata.json`
### Step 2: Make Predictions
**For ML Models (product-specific predictions):**
Predict demand for a specific product on a date:
```bash
python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
```
**Parameters for ML Models:**
- `--product_id`: Product ID (integer, required)
- `--date`: Date in YYYY-MM-DD format (required)
- `--price`: Product price (float, required)
- `--discount`: Discount percentage 0-100 (float, default: 0)
- `--category`: Product category (string, required)
- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`
**For Time-Series Models (overall daily demand):**
Predict total daily demand across all products:
```bash
python predict.py --date 2024-01-15 --model_type timeseries
```
**Parameters for Time-Series Models:**
- `--date`: Date in YYYY-MM-DD format (required)
- `--model_type`: Set to `timeseries` to use time-series models
**Example Predictions:**
```bash
# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics
# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing
# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries
# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
```
### Step 3: Launch Interactive Dashboard (Optional)
Launch the Streamlit dashboard for interactive visualization and predictions:
```bash
streamlit run app.py
```
The dashboard will open in your default web browser (usually at `http://localhost:8501`).
**Dashboard Features:**
1. **๐Ÿ“ˆ Sales Trends Page**
- Interactive filters (category, product, date range)
- Daily sales trends visualization
- Monthly sales trends
- Category-wise analysis
- Price vs demand relationship
- Real-time statistics and metrics
2. **๐Ÿ”ฎ Demand Prediction Page**
- Interactive prediction interface
- Select model type (Auto/ML/Time-Series)
- For ML models:
- Product selection dropdown
- Category selection
- Price and discount sliders
- Date picker
- Product statistics display
- For Time-Series models:
- Date picker for future predictions
- Overall daily demand forecast
- Prediction insights and recommendations
3. **๐Ÿ“Š Model Comparison Page**
- Side-by-side model performance comparison
- MAE, RMSE, and R2 Score metrics
- Visual charts comparing all models
- Best model highlighting
- Model type indicators (ML vs Time-Series)
**Dashboard Screenshots:**
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts
## ๐Ÿค– Model Details
### Models Trained
1. **Linear Regression**
- Simple linear model
- Fast training and prediction
- Good baseline model
2. **Random Forest Regressor**
- Ensemble of decision trees
- Handles non-linear relationships
- Provides feature importance
- Hyperparameters:
- n_estimators: 100
- max_depth: 15
- min_samples_split: 5
- min_samples_leaf: 2
3. **XGBoost Regressor** (Optional)
- Gradient boosting algorithm
- Often provides best performance
- Handles complex patterns
- Hyperparameters:
- n_estimators: 100
- max_depth: 6
- learning_rate: 0.1
4. **ARIMA** (AutoRegressive Integrated Moving Average)
- Classic time-series forecasting model
- Captures trends and seasonality
- Automatically selects best order (p, d, q)
- Works on aggregated daily sales data
- Uses chronological train/validation split
5. **Prophet** (Facebook's Time-Series Forecasting)
- Designed for business time series
- Handles seasonality (weekly, yearly)
- Robust to missing data and outliers
- Works on aggregated daily sales data
- Uses chronological train/validation split
### Model Comparison: ML vs Time-Series
**Machine Learning Models:**
- โœ… Predict per-product demand
- โœ… Use product features (price, discount, category)
- โœ… Can handle new products with similar features
- โŒ May not capture long-term temporal patterns as well
**Time-Series Models:**
- โœ… Capture temporal patterns and trends
- โœ… Handle seasonality automatically
- โœ… Good for overall demand forecasting
- โŒ Predict aggregate demand, not per-product
- โŒ Don't use product-specific features
**The system automatically selects the best model based on R2 score across all model types.**
### Feature Engineering
**For ML Models:**
The system extracts the following features from the input data:
**Date Features:**
- `day`: Day of month (1-31)
- `month`: Month (1-12)
- `day_of_week`: Day of week (0=Monday, 6=Sunday)
- `weekend`: Binary indicator (1 if weekend, 0 otherwise)
- `year`: Year
- `quarter`: Quarter of year (1-4)
**Original Features:**
- `product_id`: Encoded as categorical
- `price`: Numerical (scaled)
- `discount`: Numerical (scaled)
- `category`: Encoded as categorical
**Total Features**: 10 features after encoding and scaling
**For Time-Series Models:**
- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
- Weekly seasonality
- Yearly seasonality
- Trend components
## ๐Ÿ“ˆ Evaluation Metrics
The system evaluates models using three metrics:
1. **MAE (Mean Absolute Error)**
- Average absolute difference between predicted and actual values
- Lower is better
- Units: same as target variable (sales quantity)
2. **RMSE (Root Mean Squared Error)**
- Square root of average squared differences
- Penalizes large errors more than MAE
- Lower is better
- Units: same as target variable (sales quantity)
3. **R2 Score (Coefficient of Determination)**
- Proportion of variance explained by the model
- Range: -โˆž to 1 (1 is perfect prediction)
- Higher is better
- Used for model selection
**Model Selection**: The model with the highest R2 score is selected as the best model.
## ๐Ÿ“Š Visualizations
The training script generates several visualizations:
1. **Demand Trends Over Time** (`plots/demand_trends.png`)
- Shows total daily sales quantity over the entire time period
- Helps identify overall trends and patterns
2. **Monthly Average Demand** (`plots/monthly_demand.png`)
- Bar chart showing average sales by month
- Reveals seasonal patterns (e.g., holiday season spikes)
3. **Feature Importance** (`plots/feature_importance.png`)
- Shows which features are most important for predictions
- Only available for tree-based models (Random Forest, XGBoost)
4. **Model Comparison** (`plots/model_comparison.png`)
- Side-by-side comparison of all models (ML and Time-Series)
- Color-coded: ML models (blue) vs Time-Series models (orange/red)
- Shows MAE, RMSE, and R2 Score for each model
5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
- Actual vs predicted plots for ARIMA and Prophet models
- Shows how well time-series models capture temporal patterns
- Only generated if time-series models are available
## ๐Ÿ”ฎ Example Predictions
Here are some example predictions to demonstrate the system:
```bash
# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)
# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)
# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)
```
## ๐Ÿ”ง Technical Details
### Data Preprocessing Pipeline
1. **Date Conversion**: Convert date strings to datetime objects
2. **Feature Extraction**: Extract temporal features from dates
3. **Missing Value Handling**: Fill missing values with median (if any)
4. **Categorical Encoding**: Label encode product_id and category
5. **Feature Scaling**: Standardize numerical features using StandardScaler
### Model Training Pipeline
1. **Data Splitting**: 80% training, 20% validation
2. **Model Training**: Train all available models
3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model
4. **Selection**: Choose model with highest R2 score
5. **Persistence**: Save model, encoders, and scaler
### Prediction Pipeline
1. **Load Model**: Load trained model and preprocessing objects
2. **Feature Preparation**: Extract features from input parameters
3. **Encoding**: Encode categorical variables using saved encoders
4. **Scaling**: Scale features using saved scaler
5. **Prediction**: Make prediction using loaded model
6. **Post-processing**: Ensure non-negative predictions
### Handling Unseen Data
The prediction script handles cases where:
- Product ID was not seen during training (uses default encoding)
- Category was not seen during training (uses default encoding)
Warnings are displayed in such cases.
## ๐ŸŽ“ Learning Points
This project demonstrates:
1. **Supervised Learning**: Regression problem solving
2. **Feature Engineering**: Creating meaningful features from raw data
3. **Model Comparison**: Training and evaluating multiple models
4. **Model Selection**: Automatic best model selection
5. **Model Persistence**: Saving and loading trained models
6. **Production-Ready Code**: Clean, modular, well-documented code
7. **Time Series Features**: Extracting temporal patterns
8. **Categorical Encoding**: Handling categorical variables
9. **Feature Scaling**: Normalizing features for better performance
10. **Evaluation Metrics**: Understanding different regression metrics
## ๐Ÿ› Troubleshooting
### Issue: "Model not found"
**Solution**: Run `python train_model.py` first to train and save the model.
### Issue: "XGBoost not available"
**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).
### Issue: "Category not seen during training"
**Solution**: This is handled automatically with a warning. The system uses a default encoding.
### Issue: Poor prediction accuracy
**Solutions**:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data
## ๐Ÿ“ Notes
- The synthetic dataset generator creates realistic patterns including:
- Weekend effects (higher sales on weekends)
- Seasonal patterns (holiday season spikes)
- Price and discount effects
- Category-specific base prices
- For production use, consider:
- Using real historical data
- Retraining models periodically
- Adding more features (promotions, weather, etc.)
- Implementing model versioning
- Adding prediction confidence intervals
## ๐Ÿ“„ License
This project is provided as-is for educational purposes.
## ๐Ÿ‘ค Author
Created as a complete machine learning project demonstrating demand prediction for e-commerce.
---
**Happy Predicting! ๐Ÿš€**