| # Demand Prediction System for E-commerce | |
| A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach. | |
| ## ๐ Table of Contents | |
| - [Overview](#overview) | |
| - [Features](#features) | |
| - [Project Structure](#project-structure) | |
| - [Installation](#installation) | |
| - [Dataset](#dataset) | |
| - [Usage](#usage) | |
| - [Model Details](#model-details) | |
| - [Evaluation Metrics](#evaluation-metrics) | |
| - [Visualizations](#visualizations) | |
| - [Example Predictions](#example-predictions) | |
| - [Technical Details](#technical-details) | |
| ## ๐ฏ Overview | |
| This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches: | |
| 1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features) | |
| 2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet) | |
| The system automatically selects the best performing model across both approaches. | |
| **Key Capabilities:** | |
| - Predicts sales quantity for products on future dates (ML models) | |
| - Predicts overall daily demand (Time-series models) | |
| - Handles temporal patterns and seasonality | |
| - Considers price, discount, category, and date features (ML models) | |
| - Captures time-series patterns and trends (Time-series models) | |
| - Automatically selects the best model from multiple candidates | |
| - Provides comprehensive evaluation metrics | |
| - Compares ML vs Time-Series approaches | |
| ## โจ Features | |
| - **Data Preprocessing**: Automatic handling of missing values, date feature extraction | |
| - **Feature Engineering**: | |
| - Date features (day, month, day_of_week, weekend, year, quarter) | |
| - Categorical encoding (product_id, category) | |
| - Feature scaling | |
| - **Multiple Models**: | |
| - **Machine Learning Models:** | |
| - Linear Regression | |
| - Random Forest Regressor | |
| - XGBoost Regressor (optional) | |
| - **Time-Series Models:** | |
| - ARIMA (AutoRegressive Integrated Moving Average) | |
| - Prophet (Facebook's time-series forecasting tool) | |
| - **Model Selection**: Automatic best model selection based on R2 score | |
| - **Evaluation Metrics**: MAE, RMSE, and R2 Score | |
| - **Visualizations**: | |
| - Demand trends over time | |
| - Monthly average demand | |
| - Feature importance | |
| - Model comparison | |
| - **Model Persistence**: Save and load trained models using joblib | |
| - **Future Predictions**: Predict demand for any product on any future date | |
| ## ๐ Project Structure | |
| ``` | |
| demand_prediction/ | |
| โ | |
| โโโ data/ | |
| โ โโโ sales.csv # Sales dataset | |
| โ | |
| โโโ models/ # Generated during training | |
| โ โโโ best_model.joblib # Best ML model (if ML is best) | |
| โ โโโ best_timeseries_model.joblib # Best time-series model (if TS is best) | |
| โ โโโ preprocessing.joblib # Encoders and scaler (for ML models) | |
| โ โโโ model_metadata.json # Model metadata (legacy) | |
| โ โโโ all_models_metadata.json # All models comparison metadata | |
| โ | |
| โโโ plots/ # Generated during training | |
| โ โโโ demand_trends.png # Time series plot | |
| โ โโโ monthly_demand.png # Monthly averages | |
| โ โโโ feature_importance.png # Feature importance (ML models) | |
| โ โโโ model_comparison.png # Model metrics comparison (all models) | |
| โ โโโ timeseries_predictions.png # Time-series model predictions | |
| โ | |
| โโโ generate_dataset.py # Script to generate synthetic dataset | |
| โโโ train_model.py # Main training script | |
| โโโ predict.py # Prediction script | |
| โโโ app.py # Streamlit dashboard (interactive web app) | |
| โโโ requirements.txt # Python dependencies | |
| โโโ README.md # This file | |
| ``` | |
| ## ๐ Installation | |
| ### Prerequisites | |
| - Python 3.8 or higher | |
| - pip (Python package manager) | |
| ### Step 1: Navigate to Project Directory | |
| ```bash | |
| cd demand_prediction | |
| ``` | |
| ### Step 2: Create Virtual Environment (Recommended) | |
| **Why use a virtual environment?** | |
| - Keeps project dependencies isolated from your system Python | |
| - Prevents conflicts with other projects | |
| - Makes it easier to manage package versions | |
| - Best practice for Python projects | |
| **Quick Setup (Recommended):** | |
| **Windows:** | |
| ```bash | |
| setup_env.bat | |
| ``` | |
| **Linux/Mac:** | |
| ```bash | |
| chmod +x setup_env.sh | |
| ./setup_env.sh | |
| ``` | |
| **Manual Setup:** | |
| **Windows:** | |
| ```bash | |
| python -m venv venv | |
| venv\Scripts\activate | |
| ``` | |
| **Linux/Mac:** | |
| ```bash | |
| python3 -m venv venv | |
| source venv/bin/activate | |
| ``` | |
| After activation, you should see `(venv)` in your terminal prompt. | |
| **To deactivate later:** | |
| ```bash | |
| deactivate | |
| ``` | |
| ### Step 3: Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| **Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training. | |
| **Alternative (without virtual environment):** | |
| If you prefer not to use a virtual environment, you can install directly: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| However, this is **not recommended** as it may cause conflicts with other Python projects. | |
| ### Step 4: Generate Dataset | |
| If you don't have a dataset, generate a synthetic one: | |
| ```bash | |
| python generate_dataset.py | |
| ``` | |
| This will create `data/sales.csv` with realistic e-commerce sales data. | |
| ## ๐ Dataset | |
| The dataset should contain the following columns: | |
| - **product_id**: Unique identifier for each product (integer) | |
| - **date**: Date of sale (YYYY-MM-DD format) | |
| - **price**: Product price (float) | |
| - **discount**: Discount percentage (0-100, float) | |
| - **category**: Product category (string) | |
| - **sales_quantity**: Target variable - number of units sold (integer) | |
| ### Dataset Format Example | |
| ```csv | |
| product_id,date,price,discount,category,sales_quantity | |
| 1,2020-01-01,499.99,10,Electronics,45 | |
| 2,2020-01-01,29.99,0,Clothing,120 | |
| ... | |
| ``` | |
| ## ๐ป Usage | |
| ### Step 1: Train the Model | |
| Train the model using the sales dataset: | |
| ```bash | |
| python train_model.py | |
| ``` | |
| This will: | |
| 1. Load and preprocess the data | |
| 2. Extract features from dates | |
| 3. Encode categorical variables | |
| 4. Train multiple ML models (Linear Regression, Random Forest, XGBoost) | |
| 5. Prepare time-series data (aggregate daily sales) | |
| 6. Train time-series models (ARIMA, Prophet) | |
| 7. Evaluate each model using MAE, RMSE, and R2 Score | |
| 8. Compare ML vs Time-Series models | |
| 9. Select the best model automatically (across all model types) | |
| 10. Save the model and preprocessing objects | |
| 11. Generate visualizations | |
| **Output:** | |
| - Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS) | |
| - Preprocessing objects saved to `models/preprocessing.joblib` (for ML models) | |
| - Visualizations saved to `plots/` directory | |
| - All models metadata saved to `models/all_models_metadata.json` | |
| ### Step 2: Make Predictions | |
| **For ML Models (product-specific predictions):** | |
| Predict demand for a specific product on a date: | |
| ```bash | |
| python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics | |
| ``` | |
| **Parameters for ML Models:** | |
| - `--product_id`: Product ID (integer, required) | |
| - `--date`: Date in YYYY-MM-DD format (required) | |
| - `--price`: Product price (float, required) | |
| - `--discount`: Discount percentage 0-100 (float, default: 0) | |
| - `--category`: Product category (string, required) | |
| - `--model_type`: Model type - `auto` (default), `ml`, or `timeseries` | |
| **For Time-Series Models (overall daily demand):** | |
| Predict total daily demand across all products: | |
| ```bash | |
| python predict.py --date 2024-01-15 --model_type timeseries | |
| ``` | |
| **Parameters for Time-Series Models:** | |
| - `--date`: Date in YYYY-MM-DD format (required) | |
| - `--model_type`: Set to `timeseries` to use time-series models | |
| **Example Predictions:** | |
| ```bash | |
| # ML Model - Electronics product with discount | |
| python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics | |
| # ML Model - Clothing product without discount | |
| python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing | |
| # Time-Series Model - Overall daily demand | |
| python predict.py --date 2024-07-06 --model_type timeseries | |
| # Auto-detect best model (default) | |
| python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports | |
| ``` | |
| ### Step 3: Launch Interactive Dashboard (Optional) | |
| Launch the Streamlit dashboard for interactive visualization and predictions: | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| The dashboard will open in your default web browser (usually at `http://localhost:8501`). | |
| **Dashboard Features:** | |
| 1. **๐ Sales Trends Page** | |
| - Interactive filters (category, product, date range) | |
| - Daily sales trends visualization | |
| - Monthly sales trends | |
| - Category-wise analysis | |
| - Price vs demand relationship | |
| - Real-time statistics and metrics | |
| 2. **๐ฎ Demand Prediction Page** | |
| - Interactive prediction interface | |
| - Select model type (Auto/ML/Time-Series) | |
| - For ML models: | |
| - Product selection dropdown | |
| - Category selection | |
| - Price and discount sliders | |
| - Date picker | |
| - Product statistics display | |
| - For Time-Series models: | |
| - Date picker for future predictions | |
| - Overall daily demand forecast | |
| - Prediction insights and recommendations | |
| 3. **๐ Model Comparison Page** | |
| - Side-by-side model performance comparison | |
| - MAE, RMSE, and R2 Score metrics | |
| - Visual charts comparing all models | |
| - Best model highlighting | |
| - Model type indicators (ML vs Time-Series) | |
| **Dashboard Screenshots:** | |
| - Interactive widgets for easy data exploration | |
| - Real-time predictions with visual feedback | |
| - Comprehensive model comparison charts | |
| ## ๐ค Model Details | |
| ### Models Trained | |
| 1. **Linear Regression** | |
| - Simple linear model | |
| - Fast training and prediction | |
| - Good baseline model | |
| 2. **Random Forest Regressor** | |
| - Ensemble of decision trees | |
| - Handles non-linear relationships | |
| - Provides feature importance | |
| - Hyperparameters: | |
| - n_estimators: 100 | |
| - max_depth: 15 | |
| - min_samples_split: 5 | |
| - min_samples_leaf: 2 | |
| 3. **XGBoost Regressor** (Optional) | |
| - Gradient boosting algorithm | |
| - Often provides best performance | |
| - Handles complex patterns | |
| - Hyperparameters: | |
| - n_estimators: 100 | |
| - max_depth: 6 | |
| - learning_rate: 0.1 | |
| 4. **ARIMA** (AutoRegressive Integrated Moving Average) | |
| - Classic time-series forecasting model | |
| - Captures trends and seasonality | |
| - Automatically selects best order (p, d, q) | |
| - Works on aggregated daily sales data | |
| - Uses chronological train/validation split | |
| 5. **Prophet** (Facebook's Time-Series Forecasting) | |
| - Designed for business time series | |
| - Handles seasonality (weekly, yearly) | |
| - Robust to missing data and outliers | |
| - Works on aggregated daily sales data | |
| - Uses chronological train/validation split | |
| ### Model Comparison: ML vs Time-Series | |
| **Machine Learning Models:** | |
| - โ Predict per-product demand | |
| - โ Use product features (price, discount, category) | |
| - โ Can handle new products with similar features | |
| - โ May not capture long-term temporal patterns as well | |
| **Time-Series Models:** | |
| - โ Capture temporal patterns and trends | |
| - โ Handle seasonality automatically | |
| - โ Good for overall demand forecasting | |
| - โ Predict aggregate demand, not per-product | |
| - โ Don't use product-specific features | |
| **The system automatically selects the best model based on R2 score across all model types.** | |
| ### Feature Engineering | |
| **For ML Models:** | |
| The system extracts the following features from the input data: | |
| **Date Features:** | |
| - `day`: Day of month (1-31) | |
| - `month`: Month (1-12) | |
| - `day_of_week`: Day of week (0=Monday, 6=Sunday) | |
| - `weekend`: Binary indicator (1 if weekend, 0 otherwise) | |
| - `year`: Year | |
| - `quarter`: Quarter of year (1-4) | |
| **Original Features:** | |
| - `product_id`: Encoded as categorical | |
| - `price`: Numerical (scaled) | |
| - `discount`: Numerical (scaled) | |
| - `category`: Encoded as categorical | |
| **Total Features**: 10 features after encoding and scaling | |
| **For Time-Series Models:** | |
| - Data is aggregated by date (total daily sales) | |
| - Uses chronological split (80% train, 20% validation) | |
| - Prophet automatically handles: | |
| - Weekly seasonality | |
| - Yearly seasonality | |
| - Trend components | |
| ## ๐ Evaluation Metrics | |
| The system evaluates models using three metrics: | |
| 1. **MAE (Mean Absolute Error)** | |
| - Average absolute difference between predicted and actual values | |
| - Lower is better | |
| - Units: same as target variable (sales quantity) | |
| 2. **RMSE (Root Mean Squared Error)** | |
| - Square root of average squared differences | |
| - Penalizes large errors more than MAE | |
| - Lower is better | |
| - Units: same as target variable (sales quantity) | |
| 3. **R2 Score (Coefficient of Determination)** | |
| - Proportion of variance explained by the model | |
| - Range: -โ to 1 (1 is perfect prediction) | |
| - Higher is better | |
| - Used for model selection | |
| **Model Selection**: The model with the highest R2 score is selected as the best model. | |
| ## ๐ Visualizations | |
| The training script generates several visualizations: | |
| 1. **Demand Trends Over Time** (`plots/demand_trends.png`) | |
| - Shows total daily sales quantity over the entire time period | |
| - Helps identify overall trends and patterns | |
| 2. **Monthly Average Demand** (`plots/monthly_demand.png`) | |
| - Bar chart showing average sales by month | |
| - Reveals seasonal patterns (e.g., holiday season spikes) | |
| 3. **Feature Importance** (`plots/feature_importance.png`) | |
| - Shows which features are most important for predictions | |
| - Only available for tree-based models (Random Forest, XGBoost) | |
| 4. **Model Comparison** (`plots/model_comparison.png`) | |
| - Side-by-side comparison of all models (ML and Time-Series) | |
| - Color-coded: ML models (blue) vs Time-Series models (orange/red) | |
| - Shows MAE, RMSE, and R2 Score for each model | |
| 5. **Time-Series Predictions** (`plots/timeseries_predictions.png`) | |
| - Actual vs predicted plots for ARIMA and Prophet models | |
| - Shows how well time-series models capture temporal patterns | |
| - Only generated if time-series models are available | |
| ## ๐ฎ Example Predictions | |
| Here are some example predictions to demonstrate the system: | |
| ```bash | |
| # Example 1: Electronics on a weekday | |
| python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics | |
| # Expected: Moderate demand (weekday, some discount) | |
| # Example 2: Clothing on weekend | |
| python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing | |
| # Expected: Higher demand (weekend, good discount) | |
| # Example 3: Holiday season prediction | |
| python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys | |
| # Expected: High demand (holiday season, good discount) | |
| ``` | |
| ## ๐ง Technical Details | |
| ### Data Preprocessing Pipeline | |
| 1. **Date Conversion**: Convert date strings to datetime objects | |
| 2. **Feature Extraction**: Extract temporal features from dates | |
| 3. **Missing Value Handling**: Fill missing values with median (if any) | |
| 4. **Categorical Encoding**: Label encode product_id and category | |
| 5. **Feature Scaling**: Standardize numerical features using StandardScaler | |
| ### Model Training Pipeline | |
| 1. **Data Splitting**: 80% training, 20% validation | |
| 2. **Model Training**: Train all available models | |
| 3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model | |
| 4. **Selection**: Choose model with highest R2 score | |
| 5. **Persistence**: Save model, encoders, and scaler | |
| ### Prediction Pipeline | |
| 1. **Load Model**: Load trained model and preprocessing objects | |
| 2. **Feature Preparation**: Extract features from input parameters | |
| 3. **Encoding**: Encode categorical variables using saved encoders | |
| 4. **Scaling**: Scale features using saved scaler | |
| 5. **Prediction**: Make prediction using loaded model | |
| 6. **Post-processing**: Ensure non-negative predictions | |
| ### Handling Unseen Data | |
| The prediction script handles cases where: | |
| - Product ID was not seen during training (uses default encoding) | |
| - Category was not seen during training (uses default encoding) | |
| Warnings are displayed in such cases. | |
| ## ๐ Learning Points | |
| This project demonstrates: | |
| 1. **Supervised Learning**: Regression problem solving | |
| 2. **Feature Engineering**: Creating meaningful features from raw data | |
| 3. **Model Comparison**: Training and evaluating multiple models | |
| 4. **Model Selection**: Automatic best model selection | |
| 5. **Model Persistence**: Saving and loading trained models | |
| 6. **Production-Ready Code**: Clean, modular, well-documented code | |
| 7. **Time Series Features**: Extracting temporal patterns | |
| 8. **Categorical Encoding**: Handling categorical variables | |
| 9. **Feature Scaling**: Normalizing features for better performance | |
| 10. **Evaluation Metrics**: Understanding different regression metrics | |
| ## ๐ Troubleshooting | |
| ### Issue: "Model not found" | |
| **Solution**: Run `python train_model.py` first to train and save the model. | |
| ### Issue: "XGBoost not available" | |
| **Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model). | |
| ### Issue: "Category not seen during training" | |
| **Solution**: This is handled automatically with a warning. The system uses a default encoding. | |
| ### Issue: Poor prediction accuracy | |
| **Solutions**: | |
| - Ensure you have sufficient training data | |
| - Check that input features are in the same range as training data | |
| - Try retraining with different hyperparameters | |
| - Consider adding more features or more training data | |
| ## ๐ Notes | |
| - The synthetic dataset generator creates realistic patterns including: | |
| - Weekend effects (higher sales on weekends) | |
| - Seasonal patterns (holiday season spikes) | |
| - Price and discount effects | |
| - Category-specific base prices | |
| - For production use, consider: | |
| - Using real historical data | |
| - Retraining models periodically | |
| - Adding more features (promotions, weather, etc.) | |
| - Implementing model versioning | |
| - Adding prediction confidence intervals | |
| ## ๐ License | |
| This project is provided as-is for educational purposes. | |
| ## ๐ค Author | |
| Created as a complete machine learning project demonstrating demand prediction for e-commerce. | |
| --- | |
| **Happy Predicting! ๐** | |