# Demand Prediction System for E-commerce

A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

## 📋 Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Dataset](#dataset)
- [Usage](#usage)
- [Model Details](#model-details)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualizations](#visualizations)
- [Example Predictions](#example-predictions)
- [Technical Details](#technical-details)

## 🎯 Overview

This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

The system automatically selects the best performing model across both approaches.

**Key Capabilities:**
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches

## ✨ Features

- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
- **Feature Engineering**: 
  - Date features (day, month, day_of_week, weekend, year, quarter)
  - Categorical encoding (product_id, category)
  - Feature scaling
- **Multiple Models**: 
  - **Machine Learning Models:**
    - Linear Regression
    - Random Forest Regressor
    - XGBoost Regressor (optional)
  - **Time-Series Models:**
    - ARIMA (AutoRegressive Integrated Moving Average)
    - Prophet (Facebook's time-series forecasting tool)
- **Model Selection**: Automatic best model selection based on R2 score
- **Evaluation Metrics**: MAE, RMSE, and R2 Score
- **Visualizations**: 
  - Demand trends over time
  - Monthly average demand
  - Feature importance
  - Model comparison
- **Model Persistence**: Save and load trained models using joblib
- **Future Predictions**: Predict demand for any product on any future date

## 📁 Project Structure

```
demand_prediction/
│
├── data/
│   └── sales.csv                    # Sales dataset
│
├── models/                          # Generated during training
│   ├── best_model.joblib           # Best ML model (if ML is best)
│   ├── best_timeseries_model.joblib # Best time-series model (if TS is best)
│   ├── preprocessing.joblib        # Encoders and scaler (for ML models)
│   ├── model_metadata.json         # Model metadata (legacy)
│   └── all_models_metadata.json    # All models comparison metadata
│
├── plots/                           # Generated during training
│   ├── demand_trends.png           # Time series plot
│   ├── monthly_demand.png          # Monthly averages
│   ├── feature_importance.png      # Feature importance (ML models)
│   ├── model_comparison.png        # Model metrics comparison (all models)
│   └── timeseries_predictions.png  # Time-series model predictions
│
├── generate_dataset.py             # Script to generate synthetic dataset
├── train_model.py                  # Main training script
├── predict.py                      # Prediction script
├── app.py                          # Streamlit dashboard (interactive web app)
├── requirements.txt                # Python dependencies
└── README.md                       # This file
```

## 🚀 Installation

### Prerequisites

- Python 3.8 or higher
- pip (Python package manager)

### Step 1: Navigate to Project Directory

```bash
cd demand_prediction
```

### Step 2: Create Virtual Environment (Recommended)

**Why use a virtual environment?**
- Keeps project dependencies isolated from your system Python
- Prevents conflicts with other projects
- Makes it easier to manage package versions
- Best practice for Python projects

**Quick Setup (Recommended):**

**Windows:**
```bash
setup_env.bat
```

**Linux/Mac:**
```bash
chmod +x setup_env.sh
./setup_env.sh
```

**Manual Setup:**

**Windows:**
```bash
python -m venv venv
venv\Scripts\activate
```

**Linux/Mac:**
```bash
python3 -m venv venv
source venv/bin/activate
```

After activation, you should see `(venv)` in your terminal prompt.

**To deactivate later:**
```bash
deactivate
```

### Step 3: Install Dependencies

```bash
pip install -r requirements.txt
```

**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.

**Alternative (without virtual environment):**
If you prefer not to use a virtual environment, you can install directly:
```bash
pip install -r requirements.txt
```
However, this is **not recommended** as it may cause conflicts with other Python projects.

### Step 4: Generate Dataset

If you don't have a dataset, generate a synthetic one:

```bash
python generate_dataset.py
```

This will create `data/sales.csv` with realistic e-commerce sales data.

## 📊 Dataset

The dataset should contain the following columns:

- **product_id**: Unique identifier for each product (integer)
- **date**: Date of sale (YYYY-MM-DD format)
- **price**: Product price (float)
- **discount**: Discount percentage (0-100, float)
- **category**: Product category (string)
- **sales_quantity**: Target variable - number of units sold (integer)

### Dataset Format Example

```csv
product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...
```

## 💻 Usage

### Step 1: Train the Model

Train the model using the sales dataset:

```bash
python train_model.py
```

This will:
1. Load and preprocess the data
2. Extract features from dates
3. Encode categorical variables
4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
5. Prepare time-series data (aggregate daily sales)
6. Train time-series models (ARIMA, Prophet)
7. Evaluate each model using MAE, RMSE, and R2 Score
8. Compare ML vs Time-Series models
9. Select the best model automatically (across all model types)
10. Save the model and preprocessing objects
11. Generate visualizations

**Output:**
- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
- Visualizations saved to `plots/` directory
- All models metadata saved to `models/all_models_metadata.json`

### Step 2: Make Predictions

**For ML Models (product-specific predictions):**

Predict demand for a specific product on a date:

```bash
python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
```

**Parameters for ML Models:**
- `--product_id`: Product ID (integer, required)
- `--date`: Date in YYYY-MM-DD format (required)
- `--price`: Product price (float, required)
- `--discount`: Discount percentage 0-100 (float, default: 0)
- `--category`: Product category (string, required)
- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`

**For Time-Series Models (overall daily demand):**

Predict total daily demand across all products:

```bash
python predict.py --date 2024-01-15 --model_type timeseries
```

**Parameters for Time-Series Models:**
- `--date`: Date in YYYY-MM-DD format (required)
- `--model_type`: Set to `timeseries` to use time-series models

**Example Predictions:**

```bash
# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics

# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing

# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries

# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
```

### Step 3: Launch Interactive Dashboard (Optional)

Launch the Streamlit dashboard for interactive visualization and predictions:

```bash
streamlit run app.py
```

The dashboard will open in your default web browser (usually at `http://localhost:8501`).

**Dashboard Features:**

1. **📈 Sales Trends Page**
   - Interactive filters (category, product, date range)
   - Daily sales trends visualization
   - Monthly sales trends
   - Category-wise analysis
   - Price vs demand relationship
   - Real-time statistics and metrics

2. **🔮 Demand Prediction Page**
   - Interactive prediction interface
   - Select model type (Auto/ML/Time-Series)
   - For ML models:
     - Product selection dropdown
     - Category selection
     - Price and discount sliders
     - Date picker
     - Product statistics display
   - For Time-Series models:
     - Date picker for future predictions
     - Overall daily demand forecast
   - Prediction insights and recommendations

3. **📊 Model Comparison Page**
   - Side-by-side model performance comparison
   - MAE, RMSE, and R2 Score metrics
   - Visual charts comparing all models
   - Best model highlighting
   - Model type indicators (ML vs Time-Series)

**Dashboard Screenshots:**
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts

## 🤖 Model Details

### Models Trained

1. **Linear Regression**
   - Simple linear model
   - Fast training and prediction
   - Good baseline model

2. **Random Forest Regressor**
   - Ensemble of decision trees
   - Handles non-linear relationships
   - Provides feature importance
   - Hyperparameters:
     - n_estimators: 100
     - max_depth: 15
     - min_samples_split: 5
     - min_samples_leaf: 2

3. **XGBoost Regressor** (Optional)
   - Gradient boosting algorithm
   - Often provides best performance
   - Handles complex patterns
   - Hyperparameters:
     - n_estimators: 100
     - max_depth: 6
     - learning_rate: 0.1

4. **ARIMA** (AutoRegressive Integrated Moving Average)
   - Classic time-series forecasting model
   - Captures trends and seasonality
   - Automatically selects best order (p, d, q)
   - Works on aggregated daily sales data
   - Uses chronological train/validation split

5. **Prophet** (Facebook's Time-Series Forecasting)
   - Designed for business time series
   - Handles seasonality (weekly, yearly)
   - Robust to missing data and outliers
   - Works on aggregated daily sales data
   - Uses chronological train/validation split

### Model Comparison: ML vs Time-Series

**Machine Learning Models:**
- ✅ Predict per-product demand
- ✅ Use product features (price, discount, category)
- ✅ Can handle new products with similar features
- ❌ May not capture long-term temporal patterns as well

**Time-Series Models:**
- ✅ Capture temporal patterns and trends
- ✅ Handle seasonality automatically
- ✅ Good for overall demand forecasting
- ❌ Predict aggregate demand, not per-product
- ❌ Don't use product-specific features

**The system automatically selects the best model based on R2 score across all model types.**

### Feature Engineering

**For ML Models:**

The system extracts the following features from the input data:

**Date Features:**
- `day`: Day of month (1-31)
- `month`: Month (1-12)
- `day_of_week`: Day of week (0=Monday, 6=Sunday)
- `weekend`: Binary indicator (1 if weekend, 0 otherwise)
- `year`: Year
- `quarter`: Quarter of year (1-4)

**Original Features:**
- `product_id`: Encoded as categorical
- `price`: Numerical (scaled)
- `discount`: Numerical (scaled)
- `category`: Encoded as categorical

**Total Features**: 10 features after encoding and scaling

**For Time-Series Models:**

- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
  - Weekly seasonality
  - Yearly seasonality
  - Trend components

## 📈 Evaluation Metrics

The system evaluates models using three metrics:

1. **MAE (Mean Absolute Error)**
   - Average absolute difference between predicted and actual values
   - Lower is better
   - Units: same as target variable (sales quantity)

2. **RMSE (Root Mean Squared Error)**
   - Square root of average squared differences
   - Penalizes large errors more than MAE
   - Lower is better
   - Units: same as target variable (sales quantity)

3. **R2 Score (Coefficient of Determination)**
   - Proportion of variance explained by the model
   - Range: -∞ to 1 (1 is perfect prediction)
   - Higher is better
   - Used for model selection

**Model Selection**: The model with the highest R2 score is selected as the best model.

## 📊 Visualizations

The training script generates several visualizations:

1. **Demand Trends Over Time** (`plots/demand_trends.png`)
   - Shows total daily sales quantity over the entire time period
   - Helps identify overall trends and patterns

2. **Monthly Average Demand** (`plots/monthly_demand.png`)
   - Bar chart showing average sales by month
   - Reveals seasonal patterns (e.g., holiday season spikes)

3. **Feature Importance** (`plots/feature_importance.png`)
   - Shows which features are most important for predictions
   - Only available for tree-based models (Random Forest, XGBoost)

4. **Model Comparison** (`plots/model_comparison.png`)
   - Side-by-side comparison of all models (ML and Time-Series)
   - Color-coded: ML models (blue) vs Time-Series models (orange/red)
   - Shows MAE, RMSE, and R2 Score for each model

5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
   - Actual vs predicted plots for ARIMA and Prophet models
   - Shows how well time-series models capture temporal patterns
   - Only generated if time-series models are available

## 🔮 Example Predictions

Here are some example predictions to demonstrate the system:

```bash
# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)

# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)

# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)
```

## 🔧 Technical Details

### Data Preprocessing Pipeline

1. **Date Conversion**: Convert date strings to datetime objects
2. **Feature Extraction**: Extract temporal features from dates
3. **Missing Value Handling**: Fill missing values with median (if any)
4. **Categorical Encoding**: Label encode product_id and category
5. **Feature Scaling**: Standardize numerical features using StandardScaler

### Model Training Pipeline

1. **Data Splitting**: 80% training, 20% validation
2. **Model Training**: Train all available models
3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model
4. **Selection**: Choose model with highest R2 score
5. **Persistence**: Save model, encoders, and scaler

### Prediction Pipeline

1. **Load Model**: Load trained model and preprocessing objects
2. **Feature Preparation**: Extract features from input parameters
3. **Encoding**: Encode categorical variables using saved encoders
4. **Scaling**: Scale features using saved scaler
5. **Prediction**: Make prediction using loaded model
6. **Post-processing**: Ensure non-negative predictions

### Handling Unseen Data

The prediction script handles cases where:
- Product ID was not seen during training (uses default encoding)
- Category was not seen during training (uses default encoding)

Warnings are displayed in such cases.

## 🎓 Learning Points

This project demonstrates:

1. **Supervised Learning**: Regression problem solving
2. **Feature Engineering**: Creating meaningful features from raw data
3. **Model Comparison**: Training and evaluating multiple models
4. **Model Selection**: Automatic best model selection
5. **Model Persistence**: Saving and loading trained models
6. **Production-Ready Code**: Clean, modular, well-documented code
7. **Time Series Features**: Extracting temporal patterns
8. **Categorical Encoding**: Handling categorical variables
9. **Feature Scaling**: Normalizing features for better performance
10. **Evaluation Metrics**: Understanding different regression metrics

## 🐛 Troubleshooting

### Issue: "Model not found"
**Solution**: Run `python train_model.py` first to train and save the model.

### Issue: "XGBoost not available"
**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).

### Issue: "Category not seen during training"
**Solution**: This is handled automatically with a warning. The system uses a default encoding.

### Issue: Poor prediction accuracy
**Solutions**:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data

## 📝 Notes

- The synthetic dataset generator creates realistic patterns including:
  - Weekend effects (higher sales on weekends)
  - Seasonal patterns (holiday season spikes)
  - Price and discount effects
  - Category-specific base prices

- For production use, consider:
  - Using real historical data
  - Retraining models periodically
  - Adding more features (promotions, weather, etc.)
  - Implementing model versioning
  - Adding prediction confidence intervals

## 📄 License

This project is provided as-is for educational purposes.

## 👤 Author

Created as a complete machine learning project demonstrating demand prediction for e-commerce.

---

**Happy Predicting! 🚀**