File size: 19,019 Bytes

7f90ea0

# Demand Prediction System for E-commerce

A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

## 📋 Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Dataset](#dataset)
- [Usage](#usage)
- [Model Details](#model-details)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualizations](#visualizations)
- [Example Predictions](#example-predictions)
- [Technical Details](#technical-details)

## 🎯 Overview

This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

The system automatically selects the best performing model across both approaches.

**Key Capabilities:**
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches

## ✨ Features

- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
- **Feature Engineering**: 
  - Date features (day, month, day_of_week, weekend, year, quarter)
  - Categorical encoding (product_id, category)

  - Feature scaling

- **Multiple Models**: 

  - **Machine Learning Models:**

    - Linear Regression

    - Random Forest Regressor

    - XGBoost Regressor (optional)

  - **Time-Series Models:**

    - ARIMA (AutoRegressive Integrated Moving Average)

    - Prophet (Facebook's time-series forecasting tool)

- **Model Selection**: Automatic best model selection based on R2 score

- **Evaluation Metrics**: MAE, RMSE, and R2 Score

- **Visualizations**: 

  - Demand trends over time

  - Monthly average demand

  - Feature importance

  - Model comparison

- **Model Persistence**: Save and load trained models using joblib

- **Future Predictions**: Predict demand for any product on any future date



## 📁 Project Structure



```

demand_prediction/
│
├── data/
│   └── sales.csv                    # Sales dataset
│
├── models/                          # Generated during training
│   ├── best_model.joblib           # Best ML model (if ML is best)

│   ├── best_timeseries_model.joblib # Best time-series model (if TS is best)

│   ├── preprocessing.joblib        # Encoders and scaler (for ML models)

│   ├── model_metadata.json         # Model metadata (legacy)
│   └── all_models_metadata.json    # All models comparison metadata
│
├── plots/                           # Generated during training
│   ├── demand_trends.png           # Time series plot

│   ├── monthly_demand.png          # Monthly averages
│   ├── feature_importance.png      # Feature importance (ML models)

│   ├── model_comparison.png        # Model metrics comparison (all models)
│   └── timeseries_predictions.png  # Time-series model predictions

│

├── generate_dataset.py             # Script to generate synthetic dataset
├── train_model.py                  # Main training script

├── predict.py                      # Prediction script

├── app.py                          # Streamlit dashboard (interactive web app)

├── requirements.txt                # Python dependencies

└── README.md                       # This file

```



## 🚀 Installation



### Prerequisites



- Python 3.8 or higher

- pip (Python package manager)



### Step 1: Navigate to Project Directory



```bash

cd demand_prediction
```



### Step 2: Create Virtual Environment (Recommended)



**Why use a virtual environment?**

- Keeps project dependencies isolated from your system Python

- Prevents conflicts with other projects

- Makes it easier to manage package versions

- Best practice for Python projects



**Quick Setup (Recommended):**



**Windows:**

```bash

setup_env.bat

```

**Linux/Mac:**
```bash

chmod +x setup_env.sh

./setup_env.sh

```

**Manual Setup:**

**Windows:**
```bash

python -m venv venv

venv\Scripts\activate

```

**Linux/Mac:**
```bash

python3 -m venv venv

source venv/bin/activate

```

After activation, you should see `(venv)` in your terminal prompt.

**To deactivate later:**
```bash

deactivate

```

### Step 3: Install Dependencies

```bash

pip install -r requirements.txt

```

**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.

**Alternative (without virtual environment):**
If you prefer not to use a virtual environment, you can install directly:
```bash

pip install -r requirements.txt

```
However, this is **not recommended** as it may cause conflicts with other Python projects.

### Step 4: Generate Dataset

If you don't have a dataset, generate a synthetic one:

```bash

python generate_dataset.py

```

This will create `data/sales.csv` with realistic e-commerce sales data.

## 📊 Dataset

The dataset should contain the following columns:

- **product_id**: Unique identifier for each product (integer)

- **date**: Date of sale (YYYY-MM-DD format)

- **price**: Product price (float)

- **discount**: Discount percentage (0-100, float)

- **category**: Product category (string)

- **sales_quantity**: Target variable - number of units sold (integer)

### Dataset Format Example

```csv

product_id,date,price,discount,category,sales_quantity

1,2020-01-01,499.99,10,Electronics,45

2,2020-01-01,29.99,0,Clothing,120

...

```

## 💻 Usage

### Step 1: Train the Model

Train the model using the sales dataset:

```bash

python train_model.py

```

This will:
1. Load and preprocess the data
2. Extract features from dates
3. Encode categorical variables
4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
5. Prepare time-series data (aggregate daily sales)
6. Train time-series models (ARIMA, Prophet)
7. Evaluate each model using MAE, RMSE, and R2 Score
8. Compare ML vs Time-Series models
9. Select the best model automatically (across all model types)
10. Save the model and preprocessing objects
11. Generate visualizations

**Output:**
- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
- Visualizations saved to `plots/` directory
- All models metadata saved to `models/all_models_metadata.json`

### Step 2: Make Predictions

**For ML Models (product-specific predictions):**

Predict demand for a specific product on a date:

```bash

python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics

```

**Parameters for ML Models:**
- `--product_id`: Product ID (integer, required)
- `--date`: Date in YYYY-MM-DD format (required)
- `--price`: Product price (float, required)
- `--discount`: Discount percentage 0-100 (float, default: 0)
- `--category`: Product category (string, required)
- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`

**For Time-Series Models (overall daily demand):**

Predict total daily demand across all products:

```bash

python predict.py --date 2024-01-15 --model_type timeseries

```

**Parameters for Time-Series Models:**
- `--date`: Date in YYYY-MM-DD format (required)
- `--model_type`: Set to `timeseries` to use time-series models

**Example Predictions:**

```bash

# ML Model - Electronics product with discount

python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics



# ML Model - Clothing product without discount

python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing



# Time-Series Model - Overall daily demand

python predict.py --date 2024-07-06 --model_type timeseries



# Auto-detect best model (default)

python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports

```

### Step 3: Launch Interactive Dashboard (Optional)

Launch the Streamlit dashboard for interactive visualization and predictions:

```bash

streamlit run app.py

```

The dashboard will open in your default web browser (usually at `http://localhost:8501`).

**Dashboard Features:**

1. **📈 Sales Trends Page**
   - Interactive filters (category, product, date range)
   - Daily sales trends visualization
   - Monthly sales trends
   - Category-wise analysis
   - Price vs demand relationship
   - Real-time statistics and metrics

2. **🔮 Demand Prediction Page**
   - Interactive prediction interface
   - Select model type (Auto/ML/Time-Series)
   - For ML models:
     - Product selection dropdown
     - Category selection
     - Price and discount sliders
     - Date picker
     - Product statistics display
   - For Time-Series models:
     - Date picker for future predictions
     - Overall daily demand forecast
   - Prediction insights and recommendations

3. **📊 Model Comparison Page**
   - Side-by-side model performance comparison
   - MAE, RMSE, and R2 Score metrics
   - Visual charts comparing all models
   - Best model highlighting
   - Model type indicators (ML vs Time-Series)

**Dashboard Screenshots:**
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts

## 🤖 Model Details

### Models Trained

1. **Linear Regression**
   - Simple linear model
   - Fast training and prediction
   - Good baseline model

2. **Random Forest Regressor**
   - Ensemble of decision trees
   - Handles non-linear relationships
   - Provides feature importance
   - Hyperparameters:
     - n_estimators: 100

     - max_depth: 15
     - min_samples_split: 5
     - min_samples_leaf: 2

3. **XGBoost Regressor** (Optional)
   - Gradient boosting algorithm
   - Often provides best performance
   - Handles complex patterns
   - Hyperparameters:
     - n_estimators: 100

     - max_depth: 6
     - learning_rate: 0.1



4. **ARIMA** (AutoRegressive Integrated Moving Average)

   - Classic time-series forecasting model

   - Captures trends and seasonality

   - Automatically selects best order (p, d, q)

   - Works on aggregated daily sales data

   - Uses chronological train/validation split



5. **Prophet** (Facebook's Time-Series Forecasting)

   - Designed for business time series

   - Handles seasonality (weekly, yearly)

   - Robust to missing data and outliers

   - Works on aggregated daily sales data

   - Uses chronological train/validation split



### Model Comparison: ML vs Time-Series



**Machine Learning Models:**

- ✅ Predict per-product demand

- ✅ Use product features (price, discount, category)

- ✅ Can handle new products with similar features

- ❌ May not capture long-term temporal patterns as well



**Time-Series Models:**

- ✅ Capture temporal patterns and trends

- ✅ Handle seasonality automatically

- ✅ Good for overall demand forecasting

- ❌ Predict aggregate demand, not per-product

- ❌ Don't use product-specific features



**The system automatically selects the best model based on R2 score across all model types.**



### Feature Engineering



**For ML Models:**



The system extracts the following features from the input data:



**Date Features:**

- `day`: Day of month (1-31)

- `month`: Month (1-12)

- `day_of_week`: Day of week (0=Monday, 6=Sunday)

- `weekend`: Binary indicator (1 if weekend, 0 otherwise)

- `year`: Year

- `quarter`: Quarter of year (1-4)



**Original Features:**

- `product_id`: Encoded as categorical
- `price`: Numerical (scaled)
- `discount`: Numerical (scaled)
- `category`: Encoded as categorical

**Total Features**: 10 features after encoding and scaling

**For Time-Series Models:**

- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
  - Weekly seasonality
  - Yearly seasonality
  - Trend components

## 📈 Evaluation Metrics

The system evaluates models using three metrics:

1. **MAE (Mean Absolute Error)**
   - Average absolute difference between predicted and actual values
   - Lower is better
   - Units: same as target variable (sales quantity)

2. **RMSE (Root Mean Squared Error)**
   - Square root of average squared differences
   - Penalizes large errors more than MAE
   - Lower is better
   - Units: same as target variable (sales quantity)

3. **R2 Score (Coefficient of Determination)**
   - Proportion of variance explained by the model
   - Range: -∞ to 1 (1 is perfect prediction)
   - Higher is better
   - Used for model selection

**Model Selection**: The model with the highest R2 score is selected as the best model.

## 📊 Visualizations

The training script generates several visualizations:

1. **Demand Trends Over Time** (`plots/demand_trends.png`)
   - Shows total daily sales quantity over the entire time period
   - Helps identify overall trends and patterns

2. **Monthly Average Demand** (`plots/monthly_demand.png`)
   - Bar chart showing average sales by month
   - Reveals seasonal patterns (e.g., holiday season spikes)

3. **Feature Importance** (`plots/feature_importance.png`)
   - Shows which features are most important for predictions
   - Only available for tree-based models (Random Forest, XGBoost)

4. **Model Comparison** (`plots/model_comparison.png`)
   - Side-by-side comparison of all models (ML and Time-Series)
   - Color-coded: ML models (blue) vs Time-Series models (orange/red)
   - Shows MAE, RMSE, and R2 Score for each model

5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
   - Actual vs predicted plots for ARIMA and Prophet models
   - Shows how well time-series models capture temporal patterns
   - Only generated if time-series models are available

## 🔮 Example Predictions

Here are some example predictions to demonstrate the system:

```bash

# Example 1: Electronics on a weekday

python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics

# Expected: Moderate demand (weekday, some discount)



# Example 2: Clothing on weekend

python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing

# Expected: Higher demand (weekend, good discount)



# Example 3: Holiday season prediction

python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys

# Expected: High demand (holiday season, good discount)

```

## 🔧 Technical Details

### Data Preprocessing Pipeline

1. **Date Conversion**: Convert date strings to datetime objects
2. **Feature Extraction**: Extract temporal features from dates
3. **Missing Value Handling**: Fill missing values with median (if any)
4. **Categorical Encoding**: Label encode product_id and category

5. **Feature Scaling**: Standardize numerical features using StandardScaler



### Model Training Pipeline



1. **Data Splitting**: 80% training, 20% validation

2. **Model Training**: Train all available models

3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model

4. **Selection**: Choose model with highest R2 score

5. **Persistence**: Save model, encoders, and scaler



### Prediction Pipeline



1. **Load Model**: Load trained model and preprocessing objects

2. **Feature Preparation**: Extract features from input parameters

3. **Encoding**: Encode categorical variables using saved encoders

4. **Scaling**: Scale features using saved scaler

5. **Prediction**: Make prediction using loaded model

6. **Post-processing**: Ensure non-negative predictions



### Handling Unseen Data



The prediction script handles cases where:

- Product ID was not seen during training (uses default encoding)

- Category was not seen during training (uses default encoding)



Warnings are displayed in such cases.



## 🎓 Learning Points



This project demonstrates:



1. **Supervised Learning**: Regression problem solving

2. **Feature Engineering**: Creating meaningful features from raw data

3. **Model Comparison**: Training and evaluating multiple models

4. **Model Selection**: Automatic best model selection

5. **Model Persistence**: Saving and loading trained models

6. **Production-Ready Code**: Clean, modular, well-documented code

7. **Time Series Features**: Extracting temporal patterns

8. **Categorical Encoding**: Handling categorical variables

9. **Feature Scaling**: Normalizing features for better performance

10. **Evaluation Metrics**: Understanding different regression metrics



## 🐛 Troubleshooting



### Issue: "Model not found"

**Solution**: Run `python train_model.py` first to train and save the model.

### Issue: "XGBoost not available"
**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).

### Issue: "Category not seen during training"
**Solution**: This is handled automatically with a warning. The system uses a default encoding.

### Issue: Poor prediction accuracy
**Solutions**:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data

## 📝 Notes

- The synthetic dataset generator creates realistic patterns including:
  - Weekend effects (higher sales on weekends)
  - Seasonal patterns (holiday season spikes)
  - Price and discount effects
  - Category-specific base prices

- For production use, consider:
  - Using real historical data
  - Retraining models periodically
  - Adding more features (promotions, weather, etc.)
  - Implementing model versioning
  - Adding prediction confidence intervals

## 📄 License

This project is provided as-is for educational purposes.

## 👤 Author

Created as a complete machine learning project demonstrating demand prediction for e-commerce.

---

**Happy Predicting! 🚀**