# Demand Prediction System for E-commerce A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach. ## 📋 Table of Contents - [Overview](#overview) - [Features](#features) - [Project Structure](#project-structure) - [Installation](#installation) - [Dataset](#dataset) - [Usage](#usage) - [Model Details](#model-details) - [Evaluation Metrics](#evaluation-metrics) - [Visualizations](#visualizations) - [Example Predictions](#example-predictions) - [Technical Details](#technical-details) ## 🎯 Overview This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches: 1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features) 2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet) The system automatically selects the best performing model across both approaches. **Key Capabilities:** - Predicts sales quantity for products on future dates (ML models) - Predicts overall daily demand (Time-series models) - Handles temporal patterns and seasonality - Considers price, discount, category, and date features (ML models) - Captures time-series patterns and trends (Time-series models) - Automatically selects the best model from multiple candidates - Provides comprehensive evaluation metrics - Compares ML vs Time-Series approaches ## ✨ Features - **Data Preprocessing**: Automatic handling of missing values, date feature extraction - **Feature Engineering**: - Date features (day, month, day_of_week, weekend, year, quarter) - Categorical encoding (product_id, category) - Feature scaling - **Multiple Models**: - **Machine Learning Models:** - Linear Regression - Random Forest Regressor - XGBoost Regressor (optional) - **Time-Series Models:** - ARIMA (AutoRegressive Integrated Moving Average) - Prophet (Facebook's time-series forecasting tool) - **Model Selection**: Automatic best model selection based on R2 score - **Evaluation Metrics**: MAE, RMSE, and R2 Score - **Visualizations**: - Demand trends over time - Monthly average demand - Feature importance - Model comparison - **Model Persistence**: Save and load trained models using joblib - **Future Predictions**: Predict demand for any product on any future date ## 📁 Project Structure ``` demand_prediction/ │ ├── data/ │ └── sales.csv # Sales dataset │ ├── models/ # Generated during training │ ├── best_model.joblib # Best ML model (if ML is best) │ ├── best_timeseries_model.joblib # Best time-series model (if TS is best) │ ├── preprocessing.joblib # Encoders and scaler (for ML models) │ ├── model_metadata.json # Model metadata (legacy) │ └── all_models_metadata.json # All models comparison metadata │ ├── plots/ # Generated during training │ ├── demand_trends.png # Time series plot │ ├── monthly_demand.png # Monthly averages │ ├── feature_importance.png # Feature importance (ML models) │ ├── model_comparison.png # Model metrics comparison (all models) │ └── timeseries_predictions.png # Time-series model predictions │ ├── generate_dataset.py # Script to generate synthetic dataset ├── train_model.py # Main training script ├── predict.py # Prediction script ├── app.py # Streamlit dashboard (interactive web app) ├── requirements.txt # Python dependencies └── README.md # This file ``` ## 🚀 Installation ### Prerequisites - Python 3.8 or higher - pip (Python package manager) ### Step 1: Navigate to Project Directory ```bash cd demand_prediction ``` ### Step 2: Create Virtual Environment (Recommended) **Why use a virtual environment?** - Keeps project dependencies isolated from your system Python - Prevents conflicts with other projects - Makes it easier to manage package versions - Best practice for Python projects **Quick Setup (Recommended):** **Windows:** ```bash setup_env.bat ``` **Linux/Mac:** ```bash chmod +x setup_env.sh ./setup_env.sh ``` **Manual Setup:** **Windows:** ```bash python -m venv venv venv\Scripts\activate ``` **Linux/Mac:** ```bash python3 -m venv venv source venv/bin/activate ``` After activation, you should see `(venv)` in your terminal prompt. **To deactivate later:** ```bash deactivate ``` ### Step 3: Install Dependencies ```bash pip install -r requirements.txt ``` **Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training. **Alternative (without virtual environment):** If you prefer not to use a virtual environment, you can install directly: ```bash pip install -r requirements.txt ``` However, this is **not recommended** as it may cause conflicts with other Python projects. ### Step 4: Generate Dataset If you don't have a dataset, generate a synthetic one: ```bash python generate_dataset.py ``` This will create `data/sales.csv` with realistic e-commerce sales data. ## 📊 Dataset The dataset should contain the following columns: - **product_id**: Unique identifier for each product (integer) - **date**: Date of sale (YYYY-MM-DD format) - **price**: Product price (float) - **discount**: Discount percentage (0-100, float) - **category**: Product category (string) - **sales_quantity**: Target variable - number of units sold (integer) ### Dataset Format Example ```csv product_id,date,price,discount,category,sales_quantity 1,2020-01-01,499.99,10,Electronics,45 2,2020-01-01,29.99,0,Clothing,120 ... ``` ## 💻 Usage ### Step 1: Train the Model Train the model using the sales dataset: ```bash python train_model.py ``` This will: 1. Load and preprocess the data 2. Extract features from dates 3. Encode categorical variables 4. Train multiple ML models (Linear Regression, Random Forest, XGBoost) 5. Prepare time-series data (aggregate daily sales) 6. Train time-series models (ARIMA, Prophet) 7. Evaluate each model using MAE, RMSE, and R2 Score 8. Compare ML vs Time-Series models 9. Select the best model automatically (across all model types) 10. Save the model and preprocessing objects 11. Generate visualizations **Output:** - Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS) - Preprocessing objects saved to `models/preprocessing.joblib` (for ML models) - Visualizations saved to `plots/` directory - All models metadata saved to `models/all_models_metadata.json` ### Step 2: Make Predictions **For ML Models (product-specific predictions):** Predict demand for a specific product on a date: ```bash python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics ``` **Parameters for ML Models:** - `--product_id`: Product ID (integer, required) - `--date`: Date in YYYY-MM-DD format (required) - `--price`: Product price (float, required) - `--discount`: Discount percentage 0-100 (float, default: 0) - `--category`: Product category (string, required) - `--model_type`: Model type - `auto` (default), `ml`, or `timeseries` **For Time-Series Models (overall daily demand):** Predict total daily demand across all products: ```bash python predict.py --date 2024-01-15 --model_type timeseries ``` **Parameters for Time-Series Models:** - `--date`: Date in YYYY-MM-DD format (required) - `--model_type`: Set to `timeseries` to use time-series models **Example Predictions:** ```bash # ML Model - Electronics product with discount python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics # ML Model - Clothing product without discount python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing # Time-Series Model - Overall daily demand python predict.py --date 2024-07-06 --model_type timeseries # Auto-detect best model (default) python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports ``` ### Step 3: Launch Interactive Dashboard (Optional) Launch the Streamlit dashboard for interactive visualization and predictions: ```bash streamlit run app.py ``` The dashboard will open in your default web browser (usually at `http://localhost:8501`). **Dashboard Features:** 1. **📈 Sales Trends Page** - Interactive filters (category, product, date range) - Daily sales trends visualization - Monthly sales trends - Category-wise analysis - Price vs demand relationship - Real-time statistics and metrics 2. **🔮 Demand Prediction Page** - Interactive prediction interface - Select model type (Auto/ML/Time-Series) - For ML models: - Product selection dropdown - Category selection - Price and discount sliders - Date picker - Product statistics display - For Time-Series models: - Date picker for future predictions - Overall daily demand forecast - Prediction insights and recommendations 3. **📊 Model Comparison Page** - Side-by-side model performance comparison - MAE, RMSE, and R2 Score metrics - Visual charts comparing all models - Best model highlighting - Model type indicators (ML vs Time-Series) **Dashboard Screenshots:** - Interactive widgets for easy data exploration - Real-time predictions with visual feedback - Comprehensive model comparison charts ## 🤖 Model Details ### Models Trained 1. **Linear Regression** - Simple linear model - Fast training and prediction - Good baseline model 2. **Random Forest Regressor** - Ensemble of decision trees - Handles non-linear relationships - Provides feature importance - Hyperparameters: - n_estimators: 100 - max_depth: 15 - min_samples_split: 5 - min_samples_leaf: 2 3. **XGBoost Regressor** (Optional) - Gradient boosting algorithm - Often provides best performance - Handles complex patterns - Hyperparameters: - n_estimators: 100 - max_depth: 6 - learning_rate: 0.1 4. **ARIMA** (AutoRegressive Integrated Moving Average) - Classic time-series forecasting model - Captures trends and seasonality - Automatically selects best order (p, d, q) - Works on aggregated daily sales data - Uses chronological train/validation split 5. **Prophet** (Facebook's Time-Series Forecasting) - Designed for business time series - Handles seasonality (weekly, yearly) - Robust to missing data and outliers - Works on aggregated daily sales data - Uses chronological train/validation split ### Model Comparison: ML vs Time-Series **Machine Learning Models:** - ✅ Predict per-product demand - ✅ Use product features (price, discount, category) - ✅ Can handle new products with similar features - ❌ May not capture long-term temporal patterns as well **Time-Series Models:** - ✅ Capture temporal patterns and trends - ✅ Handle seasonality automatically - ✅ Good for overall demand forecasting - ❌ Predict aggregate demand, not per-product - ❌ Don't use product-specific features **The system automatically selects the best model based on R2 score across all model types.** ### Feature Engineering **For ML Models:** The system extracts the following features from the input data: **Date Features:** - `day`: Day of month (1-31) - `month`: Month (1-12) - `day_of_week`: Day of week (0=Monday, 6=Sunday) - `weekend`: Binary indicator (1 if weekend, 0 otherwise) - `year`: Year - `quarter`: Quarter of year (1-4) **Original Features:** - `product_id`: Encoded as categorical - `price`: Numerical (scaled) - `discount`: Numerical (scaled) - `category`: Encoded as categorical **Total Features**: 10 features after encoding and scaling **For Time-Series Models:** - Data is aggregated by date (total daily sales) - Uses chronological split (80% train, 20% validation) - Prophet automatically handles: - Weekly seasonality - Yearly seasonality - Trend components ## 📈 Evaluation Metrics The system evaluates models using three metrics: 1. **MAE (Mean Absolute Error)** - Average absolute difference between predicted and actual values - Lower is better - Units: same as target variable (sales quantity) 2. **RMSE (Root Mean Squared Error)** - Square root of average squared differences - Penalizes large errors more than MAE - Lower is better - Units: same as target variable (sales quantity) 3. **R2 Score (Coefficient of Determination)** - Proportion of variance explained by the model - Range: -∞ to 1 (1 is perfect prediction) - Higher is better - Used for model selection **Model Selection**: The model with the highest R2 score is selected as the best model. ## 📊 Visualizations The training script generates several visualizations: 1. **Demand Trends Over Time** (`plots/demand_trends.png`) - Shows total daily sales quantity over the entire time period - Helps identify overall trends and patterns 2. **Monthly Average Demand** (`plots/monthly_demand.png`) - Bar chart showing average sales by month - Reveals seasonal patterns (e.g., holiday season spikes) 3. **Feature Importance** (`plots/feature_importance.png`) - Shows which features are most important for predictions - Only available for tree-based models (Random Forest, XGBoost) 4. **Model Comparison** (`plots/model_comparison.png`) - Side-by-side comparison of all models (ML and Time-Series) - Color-coded: ML models (blue) vs Time-Series models (orange/red) - Shows MAE, RMSE, and R2 Score for each model 5. **Time-Series Predictions** (`plots/timeseries_predictions.png`) - Actual vs predicted plots for ARIMA and Prophet models - Shows how well time-series models capture temporal patterns - Only generated if time-series models are available ## 🔮 Example Predictions Here are some example predictions to demonstrate the system: ```bash # Example 1: Electronics on a weekday python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics # Expected: Moderate demand (weekday, some discount) # Example 2: Clothing on weekend python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing # Expected: Higher demand (weekend, good discount) # Example 3: Holiday season prediction python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys # Expected: High demand (holiday season, good discount) ``` ## 🔧 Technical Details ### Data Preprocessing Pipeline 1. **Date Conversion**: Convert date strings to datetime objects 2. **Feature Extraction**: Extract temporal features from dates 3. **Missing Value Handling**: Fill missing values with median (if any) 4. **Categorical Encoding**: Label encode product_id and category 5. **Feature Scaling**: Standardize numerical features using StandardScaler ### Model Training Pipeline 1. **Data Splitting**: 80% training, 20% validation 2. **Model Training**: Train all available models 3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model 4. **Selection**: Choose model with highest R2 score 5. **Persistence**: Save model, encoders, and scaler ### Prediction Pipeline 1. **Load Model**: Load trained model and preprocessing objects 2. **Feature Preparation**: Extract features from input parameters 3. **Encoding**: Encode categorical variables using saved encoders 4. **Scaling**: Scale features using saved scaler 5. **Prediction**: Make prediction using loaded model 6. **Post-processing**: Ensure non-negative predictions ### Handling Unseen Data The prediction script handles cases where: - Product ID was not seen during training (uses default encoding) - Category was not seen during training (uses default encoding) Warnings are displayed in such cases. ## 🎓 Learning Points This project demonstrates: 1. **Supervised Learning**: Regression problem solving 2. **Feature Engineering**: Creating meaningful features from raw data 3. **Model Comparison**: Training and evaluating multiple models 4. **Model Selection**: Automatic best model selection 5. **Model Persistence**: Saving and loading trained models 6. **Production-Ready Code**: Clean, modular, well-documented code 7. **Time Series Features**: Extracting temporal patterns 8. **Categorical Encoding**: Handling categorical variables 9. **Feature Scaling**: Normalizing features for better performance 10. **Evaluation Metrics**: Understanding different regression metrics ## 🐛 Troubleshooting ### Issue: "Model not found" **Solution**: Run `python train_model.py` first to train and save the model. ### Issue: "XGBoost not available" **Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model). ### Issue: "Category not seen during training" **Solution**: This is handled automatically with a warning. The system uses a default encoding. ### Issue: Poor prediction accuracy **Solutions**: - Ensure you have sufficient training data - Check that input features are in the same range as training data - Try retraining with different hyperparameters - Consider adding more features or more training data ## 📝 Notes - The synthetic dataset generator creates realistic patterns including: - Weekend effects (higher sales on weekends) - Seasonal patterns (holiday season spikes) - Price and discount effects - Category-specific base prices - For production use, consider: - Using real historical data - Retraining models periodically - Adding more features (promotions, weather, etc.) - Implementing model versioning - Adding prediction confidence intervals ## 📄 License This project is provided as-is for educational purposes. ## 👤 Author Created as a complete machine learning project demonstrating demand prediction for e-commerce. --- **Happy Predicting! 🚀**