vikaswebdev commited on 18 days ago

Commit

7f90ea0

verified ·

1 Parent(s): 7637a10

Upload 17 files

Browse files

Files changed (18) hide show

.gitattributes +4 -0
README.md +564 -0
app.py +628 -0
data/sales.csv +0 -0
generate_dataset.py +128 -0
models/all_models_metadata.json +26 -0
models/best_model.joblib +3 -0
models/model_metadata.json +21 -0
models/preprocessing.joblib +3 -0
plots/demand_trends.png +3 -0
plots/feature_importance.png +3 -0
plots/model_comparison.png +3 -0
plots/monthly_demand.png +3 -0
predict.py +403 -0
requirements.txt +10 -0
setup_env.bat +26 -0
setup_env.sh +25 -0
train_model.py +877 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+plots/demand_trends.png filter=lfs diff=lfs merge=lfs -text
+plots/feature_importance.png filter=lfs diff=lfs merge=lfs -text
+plots/model_comparison.png filter=lfs diff=lfs merge=lfs -text
+plots/monthly_demand.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,564 @@

+# Demand Prediction System for E-commerce
+A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.
+## 📋 Table of Contents
+- [Overview](#overview)
+- [Features](#features)
+- [Project Structure](#project-structure)
+- [Installation](#installation)
+- [Dataset](#dataset)
+- [Usage](#usage)
+- [Model Details](#model-details)
+- [Evaluation Metrics](#evaluation-metrics)
+- [Visualizations](#visualizations)
+- [Example Predictions](#example-predictions)
+- [Technical Details](#technical-details)
+## 🎯 Overview
+This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:
+1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
+2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)
+The system automatically selects the best performing model across both approaches.
+**Key Capabilities:**
+- Predicts sales quantity for products on future dates (ML models)
+- Predicts overall daily demand (Time-series models)
+- Handles temporal patterns and seasonality
+- Considers price, discount, category, and date features (ML models)
+- Captures time-series patterns and trends (Time-series models)
+- Automatically selects the best model from multiple candidates
+- Provides comprehensive evaluation metrics
+- Compares ML vs Time-Series approaches
+## ✨ Features
+- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
+- **Feature Engineering**:
+  - Date features (day, month, day_of_week, weekend, year, quarter)
+  - Categorical encoding (product_id, category)
+  - Feature scaling
+- **Multiple Models**:
+  - **Machine Learning Models:**
+    - Linear Regression
+    - Random Forest Regressor
+    - XGBoost Regressor (optional)
+  - **Time-Series Models:**
+    - ARIMA (AutoRegressive Integrated Moving Average)
+    - Prophet (Facebook's time-series forecasting tool)
+- **Model Selection**: Automatic best model selection based on R2 score
+- **Evaluation Metrics**: MAE, RMSE, and R2 Score
+- **Visualizations**:
+  - Demand trends over time
+  - Monthly average demand
+  - Feature importance
+  - Model comparison
+- **Model Persistence**: Save and load trained models using joblib
+- **Future Predictions**: Predict demand for any product on any future date
+## 📁 Project Structure
+```
+demand_prediction/
+│
+├── data/
+│   └── sales.csv                    # Sales dataset
+│
+├── models/                          # Generated during training
+│   ├── best_model.joblib           # Best ML model (if ML is best)
+│   ├── best_timeseries_model.joblib # Best time-series model (if TS is best)
+│   ├── preprocessing.joblib        # Encoders and scaler (for ML models)
+│   ├── model_metadata.json         # Model metadata (legacy)
+│   └── all_models_metadata.json    # All models comparison metadata
+│
+├── plots/                           # Generated during training
+│   ├── demand_trends.png           # Time series plot
+│   ├── monthly_demand.png          # Monthly averages
+│   ├── feature_importance.png      # Feature importance (ML models)
+│   ├── model_comparison.png        # Model metrics comparison (all models)
+│   └── timeseries_predictions.png  # Time-series model predictions
+│
+├── generate_dataset.py             # Script to generate synthetic dataset
+├── train_model.py                  # Main training script
+├── predict.py                      # Prediction script
+├── app.py                          # Streamlit dashboard (interactive web app)
+├── requirements.txt                # Python dependencies
+└── README.md                       # This file
+```
+## 🚀 Installation
+### Prerequisites
+- Python 3.8 or higher
+- pip (Python package manager)
+### Step 1: Navigate to Project Directory
+```bash
+cd demand_prediction
+```
+### Step 2: Create Virtual Environment (Recommended)
+**Why use a virtual environment?**
+- Keeps project dependencies isolated from your system Python
+- Prevents conflicts with other projects
+- Makes it easier to manage package versions
+- Best practice for Python projects
+**Quick Setup (Recommended):**
+**Windows:**
+```bash
+setup_env.bat
+```
+**Linux/Mac:**
+```bash
+chmod +x setup_env.sh
+./setup_env.sh
+```
+**Manual Setup:**
+**Windows:**
+```bash
+python -m venv venv
+venv\Scripts\activate
+```
+**Linux/Mac:**
+```bash
+python3 -m venv venv
+source venv/bin/activate
+```
+After activation, you should see `(venv)` in your terminal prompt.
+**To deactivate later:**
+```bash
+deactivate
+```
+### Step 3: Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.
+**Alternative (without virtual environment):**
+If you prefer not to use a virtual environment, you can install directly:
+```bash
+pip install -r requirements.txt
+```
+However, this is **not recommended** as it may cause conflicts with other Python projects.
+### Step 4: Generate Dataset
+If you don't have a dataset, generate a synthetic one:
+```bash
+python generate_dataset.py
+```
+This will create `data/sales.csv` with realistic e-commerce sales data.
+## 📊 Dataset
+The dataset should contain the following columns:
+- **product_id**: Unique identifier for each product (integer)
+- **date**: Date of sale (YYYY-MM-DD format)
+- **price**: Product price (float)
+- **discount**: Discount percentage (0-100, float)
+- **category**: Product category (string)
+- **sales_quantity**: Target variable - number of units sold (integer)
+### Dataset Format Example
+```csv
+product_id,date,price,discount,category,sales_quantity
+1,2020-01-01,499.99,10,Electronics,45
+2,2020-01-01,29.99,0,Clothing,120
+...
+```
+## 💻 Usage
+### Step 1: Train the Model
+Train the model using the sales dataset:
+```bash
+python train_model.py
+```
+This will:
+1. Load and preprocess the data
+2. Extract features from dates
+3. Encode categorical variables
+4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
+5. Prepare time-series data (aggregate daily sales)
+6. Train time-series models (ARIMA, Prophet)
+7. Evaluate each model using MAE, RMSE, and R2 Score
+8. Compare ML vs Time-Series models
+9. Select the best model automatically (across all model types)
+10. Save the model and preprocessing objects
+11. Generate visualizations
+**Output:**
+- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
+- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
+- Visualizations saved to `plots/` directory
+- All models metadata saved to `models/all_models_metadata.json`
+### Step 2: Make Predictions
+**For ML Models (product-specific predictions):**
+Predict demand for a specific product on a date:
+```bash
+python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
+```
+**Parameters for ML Models:**
+- `--product_id`: Product ID (integer, required)
+- `--date`: Date in YYYY-MM-DD format (required)
+- `--price`: Product price (float, required)
+- `--discount`: Discount percentage 0-100 (float, default: 0)
+- `--category`: Product category (string, required)
+- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`
+**For Time-Series Models (overall daily demand):**
+Predict total daily demand across all products:
+```bash
+python predict.py --date 2024-01-15 --model_type timeseries
+```
+**Parameters for Time-Series Models:**
+- `--date`: Date in YYYY-MM-DD format (required)
+- `--model_type`: Set to `timeseries` to use time-series models
+**Example Predictions:**
+```bash
+# ML Model - Electronics product with discount
+python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics
+# ML Model - Clothing product without discount
+python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing
+# Time-Series Model - Overall daily demand
+python predict.py --date 2024-07-06 --model_type timeseries
+# Auto-detect best model (default)
+python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
+```
+### Step 3: Launch Interactive Dashboard (Optional)
+Launch the Streamlit dashboard for interactive visualization and predictions:
+```bash
+streamlit run app.py
+```
+The dashboard will open in your default web browser (usually at `http://localhost:8501`).
+**Dashboard Features:**
+1. **📈 Sales Trends Page**
+   - Interactive filters (category, product, date range)
+   - Daily sales trends visualization
+   - Monthly sales trends
+   - Category-wise analysis
+   - Price vs demand relationship
+   - Real-time statistics and metrics
+2. **🔮 Demand Prediction Page**
+   - Interactive prediction interface
+   - Select model type (Auto/ML/Time-Series)
+   - For ML models:
+     - Product selection dropdown
+     - Category selection
+     - Price and discount sliders
+     - Date picker
+     - Product statistics display
+   - For Time-Series models:
+     - Date picker for future predictions
+     - Overall daily demand forecast
+   - Prediction insights and recommendations
+3. **📊 Model Comparison Page**
+   - Side-by-side model performance comparison
+   - MAE, RMSE, and R2 Score metrics
+   - Visual charts comparing all models
+   - Best model highlighting
+   - Model type indicators (ML vs Time-Series)
+**Dashboard Screenshots:**
+- Interactive widgets for easy data exploration
+- Real-time predictions with visual feedback
+- Comprehensive model comparison charts
+## 🤖 Model Details
+### Models Trained
+1. **Linear Regression**
+   - Simple linear model
+   - Fast training and prediction
+   - Good baseline model
+2. **Random Forest Regressor**
+   - Ensemble of decision trees
+   - Handles non-linear relationships
+   - Provides feature importance
+   - Hyperparameters:
+     - n_estimators: 100
+     - max_depth: 15
+     - min_samples_split: 5
+     - min_samples_leaf: 2
+3. **XGBoost Regressor** (Optional)
+   - Gradient boosting algorithm
+   - Often provides best performance
+   - Handles complex patterns
+   - Hyperparameters:
+     - n_estimators: 100
+     - max_depth: 6
+     - learning_rate: 0.1
+4. **ARIMA** (AutoRegressive Integrated Moving Average)
+   - Classic time-series forecasting model
+   - Captures trends and seasonality
+   - Automatically selects best order (p, d, q)
+   - Works on aggregated daily sales data
+   - Uses chronological train/validation split
+5. **Prophet** (Facebook's Time-Series Forecasting)
+   - Designed for business time series
+   - Handles seasonality (weekly, yearly)
+   - Robust to missing data and outliers
+   - Works on aggregated daily sales data
+   - Uses chronological train/validation split
+### Model Comparison: ML vs Time-Series
+**Machine Learning Models:**
+- ✅ Predict per-product demand
+- ✅ Use product features (price, discount, category)
+- ✅ Can handle new products with similar features
+- ❌ May not capture long-term temporal patterns as well
+**Time-Series Models:**
+- ✅ Capture temporal patterns and trends
+- ✅ Handle seasonality automatically
+- ✅ Good for overall demand forecasting
+- ❌ Predict aggregate demand, not per-product
+- ❌ Don't use product-specific features
+**The system automatically selects the best model based on R2 score across all model types.**
+### Feature Engineering
+**For ML Models:**
+The system extracts the following features from the input data:
+**Date Features:**
+- `day`: Day of month (1-31)
+- `month`: Month (1-12)
+- `day_of_week`: Day of week (0=Monday, 6=Sunday)
+- `weekend`: Binary indicator (1 if weekend, 0 otherwise)
+- `year`: Year
+- `quarter`: Quarter of year (1-4)
+**Original Features:**
+- `product_id`: Encoded as categorical
+- `price`: Numerical (scaled)
+- `discount`: Numerical (scaled)
+- `category`: Encoded as categorical
+**Total Features**: 10 features after encoding and scaling
+**For Time-Series Models:**
+- Data is aggregated by date (total daily sales)
+- Uses chronological split (80% train, 20% validation)
+- Prophet automatically handles:
+  - Weekly seasonality
+  - Yearly seasonality
+  - Trend components
+## 📈 Evaluation Metrics
+The system evaluates models using three metrics:
+1. **MAE (Mean Absolute Error)**
+   - Average absolute difference between predicted and actual values
+   - Lower is better
+   - Units: same as target variable (sales quantity)
+2. **RMSE (Root Mean Squared Error)**
+   - Square root of average squared differences
+   - Penalizes large errors more than MAE
+   - Lower is better
+   - Units: same as target variable (sales quantity)
+3. **R2 Score (Coefficient of Determination)**
+   - Proportion of variance explained by the model
+   - Range: -∞ to 1 (1 is perfect prediction)
+   - Higher is better
+   - Used for model selection
+**Model Selection**: The model with the highest R2 score is selected as the best model.
+## 📊 Visualizations
+The training script generates several visualizations:
+1. **Demand Trends Over Time** (`plots/demand_trends.png`)
+   - Shows total daily sales quantity over the entire time period
+   - Helps identify overall trends and patterns
+2. **Monthly Average Demand** (`plots/monthly_demand.png`)
+   - Bar chart showing average sales by month
+   - Reveals seasonal patterns (e.g., holiday season spikes)
+3. **Feature Importance** (`plots/feature_importance.png`)
+   - Shows which features are most important for predictions
+   - Only available for tree-based models (Random Forest, XGBoost)
+4. **Model Comparison** (`plots/model_comparison.png`)
+   - Side-by-side comparison of all models (ML and Time-Series)
+   - Color-coded: ML models (blue) vs Time-Series models (orange/red)
+   - Shows MAE, RMSE, and R2 Score for each model
+5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
+   - Actual vs predicted plots for ARIMA and Prophet models
+   - Shows how well time-series models capture temporal patterns
+   - Only generated if time-series models are available
+## 🔮 Example Predictions
+Here are some example predictions to demonstrate the system:
+```bash
+# Example 1: Electronics on a weekday
+python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
+# Expected: Moderate demand (weekday, some discount)
+# Example 2: Clothing on weekend
+python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
+# Expected: Higher demand (weekend, good discount)
+# Example 3: Holiday season prediction
+python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
+# Expected: High demand (holiday season, good discount)
+```
+## 🔧 Technical Details
+### Data Preprocessing Pipeline
+1. **Date Conversion**: Convert date strings to datetime objects
+2. **Feature Extraction**: Extract temporal features from dates
+3. **Missing Value Handling**: Fill missing values with median (if any)
+4. **Categorical Encoding**: Label encode product_id and category
+5. **Feature Scaling**: Standardize numerical features using StandardScaler
+### Model Training Pipeline
+1. **Data Splitting**: 80% training, 20% validation
+2. **Model Training**: Train all available models
+3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model
+4. **Selection**: Choose model with highest R2 score
+5. **Persistence**: Save model, encoders, and scaler
+### Prediction Pipeline
+1. **Load Model**: Load trained model and preprocessing objects
+2. **Feature Preparation**: Extract features from input parameters
+3. **Encoding**: Encode categorical variables using saved encoders
+4. **Scaling**: Scale features using saved scaler
+5. **Prediction**: Make prediction using loaded model
+6. **Post-processing**: Ensure non-negative predictions
+### Handling Unseen Data
+The prediction script handles cases where:
+- Product ID was not seen during training (uses default encoding)
+- Category was not seen during training (uses default encoding)
+Warnings are displayed in such cases.
+## 🎓 Learning Points
+This project demonstrates:
+1. **Supervised Learning**: Regression problem solving
+2. **Feature Engineering**: Creating meaningful features from raw data
+3. **Model Comparison**: Training and evaluating multiple models
+4. **Model Selection**: Automatic best model selection
+5. **Model Persistence**: Saving and loading trained models
+6. **Production-Ready Code**: Clean, modular, well-documented code
+7. **Time Series Features**: Extracting temporal patterns
+8. **Categorical Encoding**: Handling categorical variables
+9. **Feature Scaling**: Normalizing features for better performance
+10. **Evaluation Metrics**: Understanding different regression metrics
+## 🐛 Troubleshooting
+### Issue: "Model not found"
+**Solution**: Run `python train_model.py` first to train and save the model.
+### Issue: "XGBoost not available"
+**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).
+### Issue: "Category not seen during training"
+**Solution**: This is handled automatically with a warning. The system uses a default encoding.
+### Issue: Poor prediction accuracy
+**Solutions**:
+- Ensure you have sufficient training data
+- Check that input features are in the same range as training data
+- Try retraining with different hyperparameters
+- Consider adding more features or more training data
+## 📝 Notes
+- The synthetic dataset generator creates realistic patterns including:
+  - Weekend effects (higher sales on weekends)
+  - Seasonal patterns (holiday season spikes)
+  - Price and discount effects
+  - Category-specific base prices
+- For production use, consider:
+  - Using real historical data
+  - Retraining models periodically
+  - Adding more features (promotions, weather, etc.)
+  - Implementing model versioning
+  - Adding prediction confidence intervals
+## 📄 License
+This project is provided as-is for educational purposes.
+## 👤 Author
+Created as a complete machine learning project demonstrating demand prediction for e-commerce.
+---
+**Happy Predicting! 🚀**

app.py ADDED Viewed

	@@ -0,0 +1,628 @@

+"""
+Demand Prediction System - Streamlit Dashboard
+Interactive dashboard for visualizing sales trends and making demand predictions.
+"""
+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import joblib
+import json
+from datetime import datetime, timedelta, date as dt_date
+import os
+import warnings
+warnings.filterwarnings('ignore')
+# Page configuration
+st.set_page_config(
+    page_title="Demand Prediction Dashboard",
+    page_icon="📊",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+    .main-header {
+        font-size: 2.5rem;
+        font-weight: bold;
+        color: #1f77b4;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .metric-card {
+        background-color: #f0f2f6;
+        padding: 1rem;
+        border-radius: 0.5rem;
+        margin: 0.5rem 0;
+    }
+    .stButton>button {
+        width: 100%;
+        background-color: #1f77b4;
+        color: white;
+    }
+    </style>
+""", unsafe_allow_html=True)
+# Configuration
+DATA_PATH = 'data/sales.csv'
+MODEL_DIR = 'models'
+MODEL_PATH = f'{MODEL_DIR}/best_model.joblib'
+TS_MODEL_PATH = f'{MODEL_DIR}/best_timeseries_model.joblib'
+PREPROCESSING_PATH = f'{MODEL_DIR}/preprocessing.joblib'
+ALL_MODELS_METADATA_PATH = f'{MODEL_DIR}/all_models_metadata.json'
+@st.cache_data
+def load_data():
+    """Load sales data with caching."""
+    if os.path.exists(DATA_PATH):
+        df = pd.read_csv(DATA_PATH)
+        df['date'] = pd.to_datetime(df['date'])
+        return df
+    return None
+@st.cache_resource
+def load_models():
+    """Load trained models with caching."""
+    models = {
+        'ml_model': None,
+        'ts_model': None,
+        'preprocessing': None,
+        'model_name': None,
+        'is_timeseries': False,
+        'metadata': None
+    }
+    # Load metadata
+    if os.path.exists(ALL_MODELS_METADATA_PATH):
+        with open(ALL_MODELS_METADATA_PATH, 'r') as f:
+            models['metadata'] = json.load(f)
+            models['model_name'] = models['metadata'].get('best_model', 'Unknown')
+            models['is_timeseries'] = models['model_name'] in ['ARIMA', 'Prophet']
+    # Load ML model
+    if os.path.exists(MODEL_PATH):
+        try:
+            models['ml_model'] = joblib.load(MODEL_PATH)
+        except:
+            pass
+    # Load time-series model
+    if os.path.exists(TS_MODEL_PATH):
+        try:
+            models['ts_model'] = joblib.load(TS_MODEL_PATH)
+        except:
+            pass
+    # Load preprocessing
+    if os.path.exists(PREPROCESSING_PATH):
+        try:
+            models['preprocessing'] = joblib.load(PREPROCESSING_PATH)
+        except:
+            pass
+    return models
+def prepare_features_ml(product_id, date, price, discount, category, preprocessing_data):
+    """Prepare features for ML model prediction."""
+    if preprocessing_data is None:
+        return None
+    # Convert date to pandas Timestamp (handles date, datetime, and string)
+    # Handle datetime.date objects explicitly
+    if isinstance(date, dt_date):
+        date = pd.Timestamp(date)
+    elif not isinstance(date, pd.Timestamp):
+        date = pd.to_datetime(date)
+    # Extract date features
+    day = date.day
+    month = date.month
+    day_of_week = date.weekday()
+    weekend = 1 if day_of_week >= 5 else 0
+    year = date.year
+    quarter = date.quarter
+    # Encode categorical variables
+    category_encoder = preprocessing_data['encoders']['category']
+    product_encoder = preprocessing_data['encoders']['product_id']
+    try:
+        category_encoded = category_encoder.transform([category])[0]
+    except ValueError:
+        category_encoded = 0
+    try:
+        product_id_encoded = product_encoder.transform([product_id])[0]
+    except ValueError:
+        product_id_encoded = product_encoder.transform([product_encoder.classes_[0]])[0]
+    # Create feature dictionary
+    feature_dict = {
+        'price': price,
+        'discount': discount,
+        'day': day,
+        'month': month,
+        'day_of_week': day_of_week,
+        'weekend': weekend,
+        'year': year,
+        'quarter': quarter,
+        'category_encoded': category_encoded,
+        'product_id_encoded': product_id_encoded
+    }
+    # Create feature array in the same order as training
+    feature_names = preprocessing_data['feature_names']
+    features = np.array([[feature_dict[name] for name in feature_names]])
+    # Scale features
+    scaler = preprocessing_data['scaler']
+    features_scaled = scaler.transform(features)
+    return features_scaled
+def predict_ml(product_id, date, price, discount, category, model, preprocessing_data):
+    """Make prediction using ML model."""
+    features = prepare_features_ml(product_id, date, price, discount, category, preprocessing_data)
+    if features is None:
+        return None
+    prediction = model.predict(features)[0]
+    return max(0, prediction)
+def predict_timeseries(date, model, model_name):
+    """Make prediction using time-series model."""
+    # Convert date to pandas Timestamp (handles date, datetime, and string)
+    if isinstance(date, dt_date):
+        date = pd.Timestamp(date)
+    elif not isinstance(date, pd.Timestamp):
+        date = pd.to_datetime(date)
+    if model_name == 'ARIMA':
+        try:
+            forecast = model.forecast(steps=1)
+            prediction = forecast[0] if hasattr(forecast, '__iter__') else forecast
+            return max(0, prediction)
+        except:
+            return None
+    elif model_name == 'Prophet':
+        try:
+            future = pd.DataFrame({'ds': [date]})
+            forecast = model.predict(future)
+            prediction = forecast['yhat'].iloc[0]
+            return max(0, prediction)
+        except:
+            return None
+    return None
+def main():
+    """Main dashboard function."""
+    # Header
+    st.markdown('<h1 class="main-header">📊 Demand Prediction Dashboard</h1>', unsafe_allow_html=True)
+    # Load data
+    df = load_data()
+    if df is None:
+        st.error("❌ Sales data not found. Please run generate_dataset.py first.")
+        return
+    # Load models
+    models = load_models()
+    # Sidebar
+    with st.sidebar:
+        st.header("⚙️ Navigation")
+        page = st.radio(
+            "Select Page",
+            ["📈 Sales Trends", "🔮 Demand Prediction", "📊 Model Comparison"],
+            index=0
+        )
+        st.markdown("---")
+        st.header("ℹ️ Information")
+        if models['metadata']:
+            best_model = models['metadata'].get('best_model', 'Unknown')
+            st.info(f"**Best Model:** {best_model}")
+            if best_model in models['metadata'].get('all_models', {}):
+                metrics = models['metadata']['all_models'][best_model]
+                st.metric("R2 Score", f"{metrics.get('r2', 0):.4f}")
+    # Main content based on selected page
+    if page == "📈 Sales Trends":
+        show_sales_trends(df)
+    elif page == "🔮 Demand Prediction":
+        show_prediction_interface(df, models)
+    elif page == "📊 Model Comparison":
+        show_model_comparison(models)
+def show_sales_trends(df):
+    """Display sales trends visualizations."""
+    st.header("📈 Sales Trends Analysis")
+    # Filters
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        categories = ['All'] + sorted(df['category'].unique().tolist())
+        selected_category = st.selectbox("Select Category", categories)
+    with col2:
+        products = ['All'] + sorted(df['product_id'].unique().tolist())
+        selected_product = st.selectbox("Select Product", products)
+    with col3:
+        date_range = st.date_input(
+            "Select Date Range",
+            value=(df['date'].min(), df['date'].max()),
+            min_value=df['date'].min(),
+            max_value=df['date'].max()
+        )
+    # Filter data
+    filtered_df = df.copy()
+    if selected_category != 'All':
+        filtered_df = filtered_df[filtered_df['category'] == selected_category]
+    if selected_product != 'All':
+        filtered_df = filtered_df[filtered_df['product_id'] == int(selected_product)]
+    if isinstance(date_range, tuple) and len(date_range) == 2:
+        filtered_df = filtered_df[
+            (filtered_df['date'] >= pd.to_datetime(date_range[0])) &
+            (filtered_df['date'] <= pd.to_datetime(date_range[1]))
+        ]
+    if len(filtered_df) == 0:
+        st.warning("No data available for selected filters.")
+        return
+    # Visualizations
+    tab1, tab2, tab3, tab4 = st.tabs(["📅 Daily Trends", "📆 Monthly Trends", "📦 Category Analysis", "💰 Price vs Demand"])
+    with tab1:
+        st.subheader("Daily Sales Trends")
+        daily_sales = filtered_df.groupby('date')['sales_quantity'].sum().reset_index()
+        fig, ax = plt.subplots(figsize=(14, 6))
+        ax.plot(daily_sales['date'], daily_sales['sales_quantity'], linewidth=2, alpha=0.7)
+        ax.set_title('Total Daily Sales Quantity', fontsize=16, fontweight='bold')
+        ax.set_xlabel('Date', fontsize=12)
+        ax.set_ylabel('Sales Quantity', fontsize=12)
+        ax.grid(True, alpha=0.3)
+        plt.xticks(rotation=45)
+        plt.tight_layout()
+        st.pyplot(fig)
+        # Statistics
+        col1, col2, col3, col4 = st.columns(4)
+        with col1:
+            st.metric("Total Sales", f"{daily_sales['sales_quantity'].sum():,.0f}")
+        with col2:
+            st.metric("Average Daily", f"{daily_sales['sales_quantity'].mean():.1f}")
+        with col3:
+            st.metric("Max Daily", f"{daily_sales['sales_quantity'].max():,.0f}")
+        with col4:
+            st.metric("Min Daily", f"{daily_sales['sales_quantity'].min():,.0f}")
+    with tab2:
+        st.subheader("Monthly Sales Trends")
+        filtered_df['month_year'] = filtered_df['date'].dt.to_period('M')
+        monthly_sales = filtered_df.groupby('month_year')['sales_quantity'].sum().reset_index()
+        monthly_sales['month_year'] = monthly_sales['month_year'].astype(str)
+        fig, ax = plt.subplots(figsize=(14, 6))
+        ax.bar(range(len(monthly_sales)), monthly_sales['sales_quantity'], alpha=0.7, color='steelblue')
+        ax.set_title('Monthly Sales Quantity', fontsize=16, fontweight='bold')
+        ax.set_xlabel('Month', fontsize=12)
+        ax.set_ylabel('Sales Quantity', fontsize=12)
+        ax.set_xticks(range(len(monthly_sales)))
+        ax.set_xticklabels(monthly_sales['month_year'], rotation=45, ha='right')
+        ax.grid(True, alpha=0.3, axis='y')
+        plt.tight_layout()
+        st.pyplot(fig)
+    with tab3:
+        st.subheader("Sales by Category")
+        category_sales = filtered_df.groupby('category')['sales_quantity'].sum().sort_values(ascending=False)
+        fig, ax = plt.subplots(figsize=(12, 6))
+        category_sales.plot(kind='barh', ax=ax, color='coral', alpha=0.7)
+        ax.set_title('Total Sales by Category', fontsize=16, fontweight='bold')
+        ax.set_xlabel('Total Sales Quantity', fontsize=12)
+        ax.set_ylabel('Category', fontsize=12)
+        ax.grid(True, alpha=0.3, axis='x')
+        plt.tight_layout()
+        st.pyplot(fig)
+        # Category statistics table
+        category_stats = filtered_df.groupby('category').agg({
+            'sales_quantity': ['sum', 'mean', 'std', 'min', 'max']
+        }).round(2)
+        category_stats.columns = ['Total', 'Average', 'Std Dev', 'Min', 'Max']
+        st.dataframe(category_stats, use_container_width=True)
+    with tab4:
+        st.subheader("Price vs Demand Relationship")
+        # Scatter plot
+        fig, ax = plt.subplots(figsize=(12, 6))
+        scatter = ax.scatter(filtered_df['price'], filtered_df['sales_quantity'],
+                           c=filtered_df['discount'], cmap='viridis', alpha=0.6, s=50)
+        ax.set_title('Price vs Sales Quantity (colored by discount)', fontsize=16, fontweight='bold')
+        ax.set_xlabel('Price', fontsize=12)
+        ax.set_ylabel('Sales Quantity', fontsize=12)
+        ax.grid(True, alpha=0.3)
+        plt.colorbar(scatter, ax=ax, label='Discount %')
+        plt.tight_layout()
+        st.pyplot(fig)
+        # Correlation
+        correlation = filtered_df['price'].corr(filtered_df['sales_quantity'])
+        st.metric("Price-Demand Correlation", f"{correlation:.3f}")
+def show_prediction_interface(df, models):
+    """Display interactive prediction interface."""
+    st.header("🔮 Demand Prediction")
+    # Check if models are available
+    if models['ml_model'] is None and models['ts_model'] is None:
+        st.error("❌ No trained models found. Please run train_model.py first.")
+        return
+    # Model selection
+    model_type = st.radio(
+        "Select Model Type",
+        ["Auto (Best Model)", "Machine Learning", "Time-Series"],
+        horizontal=True
+    )
+    st.markdown("---")
+    if model_type == "Time-Series" or (model_type == "Auto (Best Model)" and models['is_timeseries']):
+        # Time-series prediction
+        st.subheader("Overall Daily Demand Prediction")
+        col1, col2 = st.columns(2)
+        with col1:
+            prediction_date = st.date_input(
+                "Select Date for Prediction",
+                value=datetime.now().date() + timedelta(days=30),
+                min_value=df['date'].max().date() + timedelta(days=1)
+            )
+        with col2:
+            st.write("")  # Spacing
+            st.write("")  # Spacing
+        if st.button("🔮 Predict Demand", type="primary"):
+            if models['ts_model'] is None:
+                st.error("Time-series model not available.")
+            else:
+                with st.spinner("Making prediction..."):
+                    prediction = predict_timeseries(
+                        prediction_date,
+                        models['ts_model'],
+                        models['model_name']
+                    )
+                    if prediction is not None:
+                        st.success(f"✅ Prediction Complete!")
+                        col1, col2, col3 = st.columns(3)
+                        with col1:
+                            st.metric("Predicted Daily Demand", f"{prediction:,.0f} units")
+                        with col2:
+                            day_name = pd.to_datetime(prediction_date).strftime('%A')
+                            st.metric("Day of Week", day_name)
+                        with col3:
+                            is_weekend = "Yes" if pd.to_datetime(prediction_date).weekday() >= 5 else "No"
+                            st.metric("Weekend", is_weekend)
+                        st.info("💡 This prediction represents the total daily demand across all products.")
+                    else:
+                        st.error("Failed to make prediction.")
+    else:
+        # ML model prediction
+        st.subheader("Product-Specific Demand Prediction")
+        # Get unique values for dropdowns
+        categories = sorted(df['category'].unique().tolist())
+        products = sorted(df['product_id'].unique().tolist())
+        col1, col2 = st.columns(2)
+        with col1:
+            selected_category = st.selectbox("Select Category", categories)
+            selected_product = st.selectbox("Select Product ID", products)
+            prediction_date = st.date_input(
+                "Select Date for Prediction",
+                value=datetime.now().date() + timedelta(days=30),
+                min_value=df['date'].max().date() + timedelta(days=1)
+            )
+        with col2:
+            price = st.number_input(
+                "Product Price ($)",
+                min_value=0.01,
+                value=100.0,
+                step=1.0,
+                format="%.2f"
+            )
+            discount = st.slider(
+                "Discount (%)",
+                min_value=0,
+                max_value=100,
+                value=0,
+                step=5
+            )
+        # Show product statistics
+        product_data = df[df['product_id'] == selected_product]
+        if len(product_data) > 0:
+            with st.expander("📊 Product Statistics"):
+                col1, col2, col3, col4 = st.columns(4)
+                with col1:
+                    st.metric("Avg Price", f"${product_data['price'].mean():.2f}")
+                with col2:
+                    st.metric("Avg Sales", f"{product_data['sales_quantity'].mean():.1f}")
+                with col3:
+                    st.metric("Total Sales", f"{product_data['sales_quantity'].sum():,.0f}")
+                with col4:
+                    st.metric("Category", selected_category)
+        if st.button("🔮 Predict Demand", type="primary"):
+            if models['ml_model'] is None or models['preprocessing'] is None:
+                st.error("ML model or preprocessing not available.")
+            else:
+                with st.spinner("Making prediction..."):
+                    prediction = predict_ml(
+                        selected_product,
+                        prediction_date,
+                        price,
+                        discount,
+                        selected_category,
+                        models['ml_model'],
+                        models['preprocessing']
+                    )
+                    if prediction is not None:
+                        st.success(f"✅ Prediction Complete!")
+                        col1, col2, col3, col4 = st.columns(4)
+                        with col1:
+                            st.metric("Predicted Demand", f"{prediction:,.0f} units")
+                        with col2:
+                            st.metric("Price", f"${price:.2f}")
+                        with col3:
+                            st.metric("Discount", f"{discount}%")
+                        with col4:
+                            day_name = pd.to_datetime(prediction_date).strftime('%A')
+                            st.metric("Day", day_name)
+                        # Additional insights
+                        st.markdown("### 📈 Prediction Insights")
+                        date_obj = pd.to_datetime(prediction_date)
+                        is_weekend = date_obj.weekday() >= 5
+                        month = date_obj.month
+                        insights = []
+                        if is_weekend:
+                            insights.append("📅 Weekend - typically higher demand")
+                        if month in [11, 12]:
+                            insights.append("🎄 Holiday season - peak sales period")
+                        if discount > 0:
+                            insights.append(f"💰 {discount}% discount - may increase demand")
+                        if insights:
+                            for insight in insights:
+                                st.info(insight)
+                    else:
+                        st.error("Failed to make prediction.")
+def show_model_comparison(models):
+    """Display model comparison."""
+    st.header("📊 Model Comparison")
+    if models['metadata'] is None:
+        st.warning("Model metadata not available. Please run train_model.py first.")
+        return
+    metadata = models['metadata']
+    all_models = metadata.get('all_models', {})
+    best_model = metadata.get('best_model', 'Unknown')
+    if not all_models:
+        st.warning("No model comparison data available.")
+        return
+    # Model metrics table
+    st.subheader("Model Performance Metrics")
+    comparison_data = []
+    for model_name, metrics in all_models.items():
+        comparison_data.append({
+            'Model': model_name,
+            'Type': 'Time-Series' if model_name in ['ARIMA', 'Prophet'] else 'Machine Learning',
+            'MAE': metrics.get('mae', 0),
+            'RMSE': metrics.get('rmse', 0),
+            'R2 Score': metrics.get('r2', 0)
+        })
+    comparison_df = pd.DataFrame(comparison_data)
+    # Highlight best model
+    def highlight_best(row):
+        if row['Model'] == best_model:
+            return ['background-color: #90EE90'] * len(row)
+        return [''] * len(row)
+    st.dataframe(
+        comparison_df.style.apply(highlight_best, axis=1),
+        use_container_width=True
+    )
+    # Visualizations
+    st.subheader("Performance Comparison Charts")
+    col1, col2 = st.columns(2)
+    with col1:
+        fig, ax = plt.subplots(figsize=(10, 6))
+        model_names = comparison_df['Model'].tolist()
+        mae_scores = comparison_df['MAE'].tolist()
+        colors = ['coral' if name in ['ARIMA', 'Prophet'] else 'skyblue' for name in model_names]
+        ax.bar(model_names, mae_scores, color=colors, alpha=0.7)
+        ax.set_title('MAE Comparison (Lower is Better)', fontsize=14, fontweight='bold')
+        ax.set_ylabel('MAE', fontsize=12)
+        ax.tick_params(axis='x', rotation=45)
+        ax.grid(True, alpha=0.3, axis='y')
+        plt.tight_layout()
+        st.pyplot(fig)
+    with col2:
+        fig, ax = plt.subplots(figsize=(10, 6))
+        r2_scores = comparison_df['R2 Score'].tolist()
+        colors = ['coral' if name in ['ARIMA', 'Prophet'] else 'skyblue' for name in model_names]
+        ax.bar(model_names, r2_scores, color=colors, alpha=0.7)
+        ax.set_title('R2 Score Comparison (Higher is Better)', fontsize=14, fontweight='bold')
+        ax.set_ylabel('R2 Score', fontsize=12)
+        ax.tick_params(axis='x', rotation=45)
+        ax.grid(True, alpha=0.3, axis='y')
+        plt.tight_layout()
+        st.pyplot(fig)
+    # Best model info
+    st.markdown("---")
+    st.success(f"🏆 **Best Model: {best_model}**")
+    if best_model in all_models:
+        best_metrics = all_models[best_model]
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("MAE", f"{best_metrics.get('mae', 0):.2f}")
+        with col2:
+            st.metric("RMSE", f"{best_metrics.get('rmse', 0):.2f}")
+        with col3:
+            st.metric("R2 Score", f"{best_metrics.get('r2', 0):.4f}")
+if __name__ == "__main__":
+    main()

data/sales.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

generate_dataset.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""
+Generate Synthetic E-commerce Sales Dataset
+This script creates a realistic synthetic dataset for demand prediction.
+The dataset includes temporal patterns, seasonality, and realistic relationships
+between features and sales quantity.
+"""
+import pandas as pd
+import numpy as np
+from datetime import datetime, timedelta
+# Set random seed for reproducibility
+np.random.seed(42)
+# Configuration
+NUM_PRODUCTS = 50
+START_DATE = datetime(2020, 1, 1)
+END_DATE = datetime(2023, 12, 31)
+CATEGORIES = ['Electronics', 'Clothing', 'Home & Garden', 'Sports', 'Books',
+              'Toys', 'Beauty', 'Automotive', 'Food & Beverages', 'Health']
+# Generate date range
+date_range = pd.date_range(start=START_DATE, end=END_DATE, freq='D')
+num_days = len(date_range)
+# Initialize lists to store data
+data = []
+# Generate data for each product
+for product_id in range(1, NUM_PRODUCTS + 1):
+    # Assign category randomly
+    category = np.random.choice(CATEGORIES)
+    # Base price varies by category
+    category_base_prices = {
+        'Electronics': 500,
+        'Clothing': 50,
+        'Home & Garden': 100,
+        'Sports': 150,
+        'Books': 20,
+        'Toys': 30,
+        'Beauty': 40,
+        'Automotive': 300,
+        'Food & Beverages': 25,
+        'Health': 60
+    }
+    base_price = category_base_prices[category] * (0.8 + np.random.random() * 0.4)
+    # Generate daily records
+    for date in date_range:
+        # Day of week effect (weekends have higher sales)
+        day_of_week = date.weekday()
+        weekend_multiplier = 1.3 if day_of_week >= 5 else 1.0
+        # Monthly seasonality (higher sales in Nov-Dec, lower in Jan-Feb)
+        month = date.month
+        if month in [11, 12]:  # Holiday season
+            seasonality = 1.5
+        elif month in [1, 2]:  # Post-holiday slump
+            seasonality = 0.7
+        elif month in [6, 7, 8]:  # Summer
+            seasonality = 1.2
+        else:
+            seasonality = 1.0
+        # Random discount (0-30%)
+        discount = np.random.choice([0, 5, 10, 15, 20, 25, 30], p=[0.4, 0.2, 0.15, 0.1, 0.08, 0.05, 0.02])
+        # Price with discount
+        price = base_price * (1 - discount / 100)
+        # Base demand varies by product
+        base_demand = np.random.randint(10, 100)
+        # Calculate sales quantity with multiple factors
+        # Higher discount -> higher sales
+        discount_effect = 1 + (discount / 100) * 0.5
+        # Lower price -> higher sales (inverse relationship)
+        price_effect = 1 / (1 + (price / 1000) * 0.1)
+        # Add some randomness
+        noise = np.random.normal(1, 0.15)
+        # Calculate final sales quantity
+        sales_quantity = int(
+            base_demand *
+            weekend_multiplier *
+            seasonality *
+            discount_effect *
+            price_effect *
+            noise
+        )
+        # Ensure non-negative
+        sales_quantity = max(0, sales_quantity)
+        data.append({
+            'product_id': product_id,
+            'date': date.strftime('%Y-%m-%d'),
+            'price': round(price, 2),
+            'discount': discount,
+            'category': category,
+            'sales_quantity': sales_quantity
+        })
+# Create DataFrame
+df = pd.DataFrame(data)
+# Shuffle the data
+df = df.sample(frac=1, random_state=42).reset_index(drop=True)
+# Save to CSV
+output_path = 'data/sales.csv'
+df.to_csv(output_path, index=False)
+print(f"Dataset generated successfully!")
+print(f"Total records: {len(df)}")
+print(f"Date range: {df['date'].min()} to {df['date'].max()}")
+print(f"Number of products: {df['product_id'].nunique()}")
+print(f"Categories: {df['category'].nunique()}")
+print(f"\nDataset saved to: {output_path}")
+print(f"\nFirst few rows:")
+print(df.head(10))
+print(f"\nDataset statistics:")
+print(df.describe())

models/all_models_metadata.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "best_model": "XGBoost",
+    "best_metrics": {
+        "mae": 28.266036987304688,
+        "rmse": 34.58608132593768,
+        "r2": 0.19758522510528564
+    },
+    "all_models": {
+        "Linear Regression": {
+            "mae": 28.94336905682285,
+            "rmse": 35.499695024759994,
+            "r2": 0.15463271181982996
+        },
+        "Random Forest": {
+            "mae": 28.52530939054232,
+            "rmse": 34.98799112718141,
+            "r2": 0.17882785374399268
+        },
+        "XGBoost": {
+            "mae": 28.266036987304688,
+            "rmse": 34.58608132593768,
+            "r2": 0.19758522510528564
+        }
+    },
+    "saved_at": "2026-02-06 17:29:16"
+}

models/best_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:815655a55967c29f9302ce2d52cd6707ae584cbcb25532722d4a2415acd246a1
+size 495387

models/model_metadata.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "model_name": "XGBoost",
+    "metrics": {
+        "mae": 28.266036987304688,
+        "rmse": 34.58608132593768,
+        "r2": 0.19758522510528564
+    },
+    "feature_names": [
+        "price",
+        "discount",
+        "day",
+        "month",
+        "day_of_week",
+        "weekend",
+        "year",
+        "quarter",
+        "category_encoded",
+        "product_id_encoded"
+    ],
+    "saved_at": "2026-02-06 17:29:16"
+}

models/preprocessing.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8e093b6491538c4afa8a7dfcf6bc7e82118d4f5fc72fa024e2079f98a5b57cc5
+size 2252

plots/demand_trends.png ADDED Viewed

Git LFS Details

SHA256: 4e025111de45c39ee92e9d92c676e7f815e6adce2d5766c99f9a89462e3d75a8
Pointer size: 131 Bytes
Size of remote file: 663 kB

plots/feature_importance.png ADDED Viewed

Git LFS Details

SHA256: d635f5cdaf028e35844fe657f83509ff0696a4dbaed6bf8d216c9bd59d1791e6
Pointer size: 131 Bytes
Size of remote file: 102 kB

plots/model_comparison.png ADDED Viewed

Git LFS Details

SHA256: 35e5f71b23c88502e8cb873477bf74eeca319cf7376ddea3f123660a7e5a4c4e
Pointer size: 131 Bytes
Size of remote file: 204 kB

plots/monthly_demand.png ADDED Viewed

Git LFS Details

SHA256: 23bb6eb77daf4871727835e5006b23a6751f9ac1a9b25faec70915e1fd665d86
Pointer size: 131 Bytes
Size of remote file: 116 kB

predict.py ADDED Viewed

	@@ -0,0 +1,403 @@

+"""
+Demand Prediction System - Prediction Script
+This script loads a trained model and makes demand predictions for products
+on future dates. Supports both ML models and time-series models (ARIMA, Prophet).
+Usage (ML Models):
+    python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
+Usage (Time-Series Models - overall demand):
+    python predict.py --date 2024-01-15 --model_type timeseries
+"""
+import pandas as pd
+import numpy as np
+import joblib
+import json
+import argparse
+from datetime import datetime
+import os
+import warnings
+warnings.filterwarnings('ignore')
+# Configuration
+MODEL_DIR = 'models'
+MODEL_PATH = f'{MODEL_DIR}/best_model.joblib'
+TS_MODEL_PATH = f'{MODEL_DIR}/best_timeseries_model.joblib'
+PREPROCESSING_PATH = f'{MODEL_DIR}/preprocessing.joblib'
+METADATA_PATH = f'{MODEL_DIR}/model_metadata.json'
+ALL_MODELS_METADATA_PATH = f'{MODEL_DIR}/all_models_metadata.json'
+def load_model_and_preprocessing(model_type='auto'):
+    """
+    Load the trained model and preprocessing objects.
+    Args:
+        model_type: 'ml', 'timeseries', or 'auto' (auto-detect best model)
+    Returns:
+        tuple: (model, preprocessing_data, model_name, is_timeseries)
+    """
+    # Load metadata to determine best model
+    if os.path.exists(ALL_MODELS_METADATA_PATH):
+        with open(ALL_MODELS_METADATA_PATH, 'r') as f:
+            all_metadata = json.load(f)
+        best_model_name = all_metadata.get('best_model', 'Unknown')
+    else:
+        best_model_name = None
+    # Determine which model to use
+    if model_type == 'auto':
+        if best_model_name in ['ARIMA', 'Prophet']:
+            model_type = 'timeseries'
+        else:
+            model_type = 'ml'
+    is_timeseries = (model_type == 'timeseries')
+    if is_timeseries:
+        # Load time-series model
+        if not os.path.exists(TS_MODEL_PATH):
+            raise FileNotFoundError(
+                f"Time-series model not found at {TS_MODEL_PATH}. Please run train_model.py first."
+            )
+        print("Loading time-series model...")
+        model = joblib.load(TS_MODEL_PATH)
+        preprocessing_data = None
+        if best_model_name:
+            print(f"Model: {best_model_name}")
+            if best_model_name in all_metadata.get('all_models', {}):
+                metrics = all_metadata['all_models'][best_model_name]
+                print(f"R2 Score: {metrics.get('r2', 'N/A'):.4f}")
+        return model, preprocessing_data, best_model_name or 'Time-Series', True
+    else:
+        # Load ML model
+        if not os.path.exists(MODEL_PATH):
+            raise FileNotFoundError(
+                f"ML model not found at {MODEL_PATH}. Please run train_model.py first."
+            )
+        if not os.path.exists(PREPROCESSING_PATH):
+            raise FileNotFoundError(
+                f"Preprocessing objects not found at {PREPROCESSING_PATH}. Please run train_model.py first."
+            )
+        print("Loading ML model and preprocessing objects...")
+        model = joblib.load(MODEL_PATH)
+        preprocessing_data = joblib.load(PREPROCESSING_PATH)
+        # Load metadata if available
+        if os.path.exists(METADATA_PATH):
+            with open(METADATA_PATH, 'r') as f:
+                metadata = json.load(f)
+            model_name = metadata.get('model_name', 'ML Model')
+            print(f"Model: {model_name}")
+            print(f"R2 Score: {metadata.get('metrics', {}).get('r2', 'N/A'):.4f}")
+        else:
+            model_name = best_model_name or 'ML Model'
+        return model, preprocessing_data, model_name, False
+def prepare_features(product_id, date, price, discount, category, preprocessing_data):
+    """
+    Prepare features for prediction using the same preprocessing pipeline.
+    Args:
+        product_id: Product ID
+        date: Date string (YYYY-MM-DD) or datetime object
+        price: Product price
+        discount: Discount percentage (0-100)
+        category: Product category
+        preprocessing_data: Dictionary containing encoders and scaler
+    Returns:
+        numpy array: Prepared features for prediction
+    """
+    # Convert date to datetime if string
+    if isinstance(date, str):
+        date = pd.to_datetime(date)
+    # Extract date features (same as in training)
+    day = date.day
+    month = date.month
+    day_of_week = date.weekday()  # 0=Monday, 6=Sunday
+    weekend = 1 if day_of_week >= 5 else 0
+    year = date.year
+    quarter = date.quarter
+    # Encode categorical variables
+    category_encoder = preprocessing_data['encoders']['category']
+    product_encoder = preprocessing_data['encoders']['product_id']
+    # Handle unseen categories/products
+    try:
+        category_encoded = category_encoder.transform([category])[0]
+    except ValueError:
+        # If category not seen during training, use most common category
+        print(f"Warning: Category '{category}' not seen during training. Using default encoding.")
+        category_encoded = 0
+    try:
+        product_id_encoded = product_encoder.transform([product_id])[0]
+    except ValueError:
+        # If product_id not seen during training, use mean encoding
+        print(f"Warning: Product ID '{product_id}' not seen during training. Using default encoding.")
+        product_id_encoded = product_encoder.transform([product_encoder.classes_[0]])[0]
+    # Create feature dictionary
+    feature_dict = {
+        'price': price,
+        'discount': discount,
+        'day': day,
+        'month': month,
+        'day_of_week': day_of_week,
+        'weekend': weekend,
+        'year': year,
+        'quarter': quarter,
+        'category_encoded': category_encoded,
+        'product_id_encoded': product_id_encoded
+    }
+    # Create feature array in the same order as training
+    feature_names = preprocessing_data['feature_names']
+    features = np.array([[feature_dict[name] for name in feature_names]])
+    # Scale features
+    scaler = preprocessing_data['scaler']
+    features_scaled = scaler.transform(features)
+    return features_scaled
+def predict_demand_ml(product_id, date, price, discount, category, model, preprocessing_data):
+    """
+    Predict demand for a product on a given date using ML model.
+    Args:
+        product_id: Product ID
+        date: Date string (YYYY-MM-DD) or datetime object
+        price: Product price
+        discount: Discount percentage (0-100)
+        category: Product category
+        model: Trained ML model
+        preprocessing_data: Dictionary containing encoders and scaler
+    Returns:
+        float: Predicted sales quantity
+    """
+    # Prepare features
+    features = prepare_features(product_id, date, price, discount, category, preprocessing_data)
+    # Make prediction
+    prediction = model.predict(features)[0]
+    # Ensure non-negative prediction
+    prediction = max(0, prediction)
+    return prediction
+def predict_demand_timeseries(date, model, model_name):
+    """
+    Predict overall daily demand using time-series model.
+    Args:
+        date: Date string (YYYY-MM-DD) or datetime object
+        model: Trained time-series model (ARIMA or Prophet)
+        model_name: Name of the model ('ARIMA' or 'Prophet')
+    Returns:
+        float: Predicted total daily sales quantity
+    """
+    # Convert date to datetime if string
+    if isinstance(date, str):
+        date = pd.to_datetime(date)
+    if model_name == 'ARIMA':
+        # For ARIMA, we need to calculate how many steps ahead
+        # This is a simplified approach - in practice, you'd need the training end date
+        # For now, predict 1 step ahead
+        try:
+            forecast = model.forecast(steps=1)
+            prediction = forecast[0] if hasattr(forecast, '__iter__') else forecast
+            prediction = max(0, prediction)
+            return prediction
+        except Exception as e:
+            print(f"Error in ARIMA prediction: {e}")
+            return None
+    elif model_name == 'Prophet':
+        # For Prophet, create a future dataframe
+        try:
+            future = pd.DataFrame({'ds': [date]})
+            forecast = model.predict(future)
+            prediction = forecast['yhat'].iloc[0]
+            prediction = max(0, prediction)
+            return prediction
+        except Exception as e:
+            print(f"Error in Prophet prediction: {e}")
+            return None
+    else:
+        print(f"Unknown time-series model: {model_name}")
+        return None
+def predict_batch(predictions_data, model, preprocessing_data):
+    """
+    Predict demand for multiple products/dates at once.
+    Args:
+        predictions_data: List of dictionaries, each containing:
+            - product_id
+            - date
+            - price
+            - discount
+            - category
+        model: Trained model
+        preprocessing_data: Dictionary containing encoders and scaler
+    Returns:
+        list: List of predicted sales quantities
+    """
+    predictions = []
+    for data in predictions_data:
+        pred = predict_demand(
+            data['product_id'],
+            data['date'],
+            data['price'],
+            data['discount'],
+            data['category'],
+            model,
+            preprocessing_data
+        )
+        predictions.append(pred)
+    return predictions
+def main():
+    """
+    Main function for command-line interface.
+    """
+    parser = argparse.ArgumentParser(
+        description='Predict product demand for a given date and product details',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples (ML Models):
+  python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
+  python predict.py --product_id 5 --date 2024-06-20 --price 50 --discount 0 --category Clothing
+Examples (Time-Series Models - overall daily demand):
+  python predict.py --date 2024-01-15 --model_type timeseries
+        """
+    )
+    parser.add_argument('--product_id', type=int, default=None,
+                       help='Product ID (required for ML models)')
+    parser.add_argument('--date', type=str, required=True,
+                       help='Date in YYYY-MM-DD format')
+    parser.add_argument('--price', type=float, default=None,
+                       help='Product price (required for ML models)')
+    parser.add_argument('--discount', type=float, default=0,
+                       help='Discount percentage (0-100), default: 0 (for ML models)')
+    parser.add_argument('--category', type=str, default=None,
+                       help='Product category (required for ML models)')
+    parser.add_argument('--model_type', type=str, default='auto',
+                       choices=['auto', 'ml', 'timeseries'],
+                       help='Model type to use: auto (best model), ml, or timeseries')
+    args = parser.parse_args()
+    # Validate date format
+    try:
+        date_obj = pd.to_datetime(args.date)
+    except ValueError:
+        print(f"Error: Invalid date format '{args.date}'. Please use YYYY-MM-DD format.")
+        return
+    # Load model and preprocessing
+    try:
+        model, preprocessing_data, model_name, is_timeseries = load_model_and_preprocessing(args.model_type)
+    except FileNotFoundError as e:
+        print(f"Error: {e}")
+        return
+    # Validate arguments based on model type
+    if not is_timeseries:
+        # ML model requires product details
+        if args.product_id is None or args.price is None or args.category is None:
+            print("Error: ML models require --product_id, --price, and --category arguments.")
+            return
+        # Validate discount range
+        if args.discount < 0 or args.discount > 100:
+            print(f"Warning: Discount {args.discount}% is outside 0-100 range. Clamping to valid range.")
+            args.discount = max(0, min(100, args.discount))
+    # Make prediction
+    print("\n" + "="*60)
+    print("MAKING PREDICTION")
+    print("="*60)
+    print(f"Model: {model_name}")
+    print(f"Model Type: {'Time-Series' if is_timeseries else 'Machine Learning'}")
+    print(f"Date: {args.date}")
+    if not is_timeseries:
+        print(f"Product ID: {args.product_id}")
+        print(f"Price: ${args.price:.2f}")
+        print(f"Discount: {args.discount}%")
+        print(f"Category: {args.category}")
+    print("-"*60)
+    if is_timeseries:
+        predicted_demand = predict_demand_timeseries(
+            args.date,
+            model,
+            model_name
+        )
+        if predicted_demand is None:
+            print("Error: Failed to make prediction.")
+            return
+        print(f"\nPredicted Total Daily Sales Quantity: {predicted_demand:.0f} units")
+        print("(This is the predicted total demand across all products for this date)")
+    else:
+        predicted_demand = predict_demand_ml(
+            args.product_id,
+            args.date,
+            args.price,
+            args.discount,
+            args.category,
+            model,
+            preprocessing_data
+        )
+        print(f"\nPredicted Sales Quantity: {predicted_demand:.0f} units")
+        print("(This is the predicted demand for this specific product)")
+    print("="*60)
+    # Additional information
+    date_obj = pd.to_datetime(args.date)
+    day_name = date_obj.strftime('%A')
+    is_weekend = "Yes" if date_obj.weekday() >= 5 else "No"
+    print(f"\nDate Information:")
+    print(f"  Day of week: {day_name}")
+    print(f"  Weekend: {is_weekend}")
+    print(f"  Month: {date_obj.strftime('%B')}")
+    print(f"  Quarter: Q{date_obj.quarter}")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+pandas>=1.5.0
+numpy>=1.23.0
+scikit-learn>=1.2.0
+matplotlib>=3.6.0
+seaborn>=0.12.0
+joblib>=1.2.0
+xgboost>=1.7.0
+statsmodels>=0.14.0
+prophet>=1.1.0
+streamlit>=1.28.0

setup_env.bat ADDED Viewed

	@@ -0,0 +1,26 @@

+@echo off
+echo Creating virtual environment...
+python -m venv venv
+echo.
+echo Activating virtual environment...
+call venv\Scripts\activate.bat
+echo.
+echo Installing dependencies...
+pip install --upgrade pip
+pip install -r requirements.txt
+echo.
+echo ========================================
+echo Setup complete!
+echo ========================================
+echo.
+echo To activate the virtual environment in the future, run:
+echo   venv\Scripts\activate
+echo.
+echo To deactivate, run:
+echo   deactivate
+echo.
+pause

setup_env.sh ADDED Viewed

	@@ -0,0 +1,25 @@

+#!/bin/bash
+echo "Creating virtual environment..."
+python3 -m venv venv
+echo ""
+echo "Activating virtual environment..."
+source venv/bin/activate
+echo ""
+echo "Installing dependencies..."
+pip install --upgrade pip
+pip install -r requirements.txt
+echo ""
+echo "========================================"
+echo "Setup complete!"
+echo "========================================"
+echo ""
+echo "To activate the virtual environment in the future, run:"
+echo "  source venv/bin/activate"
+echo ""
+echo "To deactivate, run:"
+echo "  deactivate"
+echo ""

train_model.py ADDED Viewed

	@@ -0,0 +1,877 @@

+"""
+Demand Prediction System - Model Training Script
+This script trains multiple machine learning and time-series models to predict
+product demand (sales quantity) for an e-commerce platform.
+Features:
+- Data preprocessing and feature engineering
+- Date feature extraction (day, month, day_of_week, weekend)
+- Categorical encoding
+- Feature scaling
+- Multiple ML models (Linear Regression, Random Forest, XGBoost)
+- Time-series models (ARIMA, Prophet)
+- Model evaluation (MAE, RMSE, R2 Score)
+- Automatic best model selection
+- Model persistence using joblib
+- Visualization of results
+- Comparison between ML and time-series approaches
+"""
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from datetime import datetime
+import joblib
+import os
+import warnings
+warnings.filterwarnings('ignore')
+# Machine Learning imports
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler, LabelEncoder
+from sklearn.linear_model import LinearRegression
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+# Try to import XGBoost (optional)
+try:
+    import xgboost as xgb
+    XGBOOST_AVAILABLE = True
+except ImportError:
+    XGBOOST_AVAILABLE = False
+    print("XGBoost not available. Install with: pip install xgboost")
+# Try to import time-series libraries
+try:
+    from statsmodels.tsa.arima.model import ARIMA
+    from statsmodels.tsa.stattools import adfuller
+    ARIMA_AVAILABLE = True
+except ImportError:
+    ARIMA_AVAILABLE = False
+    print("statsmodels not available. Install with: pip install statsmodels")
+try:
+    from prophet import Prophet
+    PROPHET_AVAILABLE = True
+except ImportError:
+    PROPHET_AVAILABLE = False
+    print("Prophet not available. Install with: pip install prophet")
+# Set random seeds for reproducibility
+np.random.seed(42)
+# Configuration
+DATA_PATH = 'data/sales.csv'
+MODEL_DIR = 'models'
+PLOTS_DIR = 'plots'
+# Create directories if they don't exist
+os.makedirs(MODEL_DIR, exist_ok=True)
+os.makedirs(PLOTS_DIR, exist_ok=True)
+def load_data(file_path):
+    """
+    Load the sales dataset from CSV file.
+    Args:
+        file_path: Path to the CSV file
+    Returns:
+        DataFrame: Loaded dataset
+    """
+    print(f"Loading data from {file_path}...")
+    df = pd.read_csv(file_path)
+    print(f"Data loaded successfully! Shape: {df.shape}")
+    return df
+def preprocess_data(df):
+    """
+    Preprocess the data: convert date, extract features, handle missing values.
+    Args:
+        df: Raw DataFrame
+    Returns:
+        DataFrame: Preprocessed DataFrame
+    """
+    print("\n" + "="*60)
+    print("PREPROCESSING DATA")
+    print("="*60)
+    # Create a copy to avoid modifying original
+    df = df.copy()
+    # Convert date column to datetime
+    df['date'] = pd.to_datetime(df['date'])
+    # Extract date features
+    print("Extracting date features...")
+    df['day'] = df['date'].dt.day
+    df['month'] = df['date'].dt.month
+    df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
+    df['weekend'] = (df['day_of_week'] >= 5).astype(int)  # 1 if weekend, 0 otherwise
+    df['year'] = df['date'].dt.year
+    df['quarter'] = df['date'].dt.quarter
+    # Check for missing values
+    print("\nMissing values:")
+    missing = df.isnull().sum()
+    print(missing[missing > 0])
+    if missing.sum() > 0:
+        print("Filling missing values...")
+        df = df.fillna(df.median(numeric_only=True))
+    # Display basic statistics
+    print("\nDataset Info:")
+    print(f"Shape: {df.shape}")
+    print(f"\nColumns: {df.columns.tolist()}")
+    print(f"\nData types:\n{df.dtypes}")
+    print(f"\nBasic statistics:\n{df.describe()}")
+    return df
+def feature_engineering(df):
+    """
+    Perform feature engineering: encode categorical variables, scale features.
+    Args:
+        df: Preprocessed DataFrame
+    Returns:
+        tuple: (X_features, y_target, feature_names, encoders, scaler)
+    """
+    print("\n" + "="*60)
+    print("FEATURE ENGINEERING")
+    print("="*60)
+    # Separate features and target
+    # Drop original date column (we have extracted features from it)
+    # Keep product_id for now (we'll encode it)
+    feature_columns = ['product_id', 'price', 'discount', 'category',
+                       'day', 'month', 'day_of_week', 'weekend', 'year', 'quarter']
+    X = df[feature_columns].copy()
+    y = df['sales_quantity'].copy()
+    # Encode categorical variables
+    print("Encoding categorical variables...")
+    # Label encode category
+    category_encoder = LabelEncoder()
+    X['category_encoded'] = category_encoder.fit_transform(X['category'])
+    # Label encode product_id (treating it as categorical)
+    product_encoder = LabelEncoder()
+    X['product_id_encoded'] = product_encoder.fit_transform(X['product_id'])
+    # Drop original categorical columns
+    X = X.drop(['category', 'product_id'], axis=1)
+    # Get feature names
+    feature_names = X.columns.tolist()
+    print(f"Features after encoding: {feature_names}")
+    print(f"Number of features: {len(feature_names)}")
+    # Scale numerical features
+    print("\nScaling numerical features...")
+    scaler = StandardScaler()
+    X_scaled = scaler.fit_transform(X)
+    X_scaled = pd.DataFrame(X_scaled, columns=feature_names)
+    # Store encoders and scaler for later use
+    encoders = {
+        'category': category_encoder,
+        'product_id': product_encoder,
+        'scaler': scaler
+    }
+    return X_scaled, y, feature_names, encoders, scaler
+def train_models(X_train, y_train, X_val, y_val):
+    """
+    Train multiple models and return their performance metrics.
+    Args:
+        X_train: Training features
+        y_train: Training target
+        X_val: Validation features
+        y_val: Validation target
+    Returns:
+        dict: Dictionary containing models and their metrics
+    """
+    print("\n" + "="*60)
+    print("TRAINING MODELS")
+    print("="*60)
+    models = {}
+    results = {}
+    # 1. Linear Regression
+    print("\n1. Training Linear Regression...")
+    lr_model = LinearRegression()
+    lr_model.fit(X_train, y_train)
+    lr_pred = lr_model.predict(X_val)
+    lr_mae = mean_absolute_error(y_val, lr_pred)
+    lr_rmse = np.sqrt(mean_squared_error(y_val, lr_pred))
+    lr_r2 = r2_score(y_val, lr_pred)
+    models['Linear Regression'] = lr_model
+    results['Linear Regression'] = {
+        'model': lr_model,
+        'mae': lr_mae,
+        'rmse': lr_rmse,
+        'r2': lr_r2,
+        'predictions': lr_pred
+    }
+    print(f"   MAE: {lr_mae:.2f}, RMSE: {lr_rmse:.2f}, R2: {lr_r2:.4f}")
+    # 2. Random Forest Regressor
+    print("\n2. Training Random Forest Regressor...")
+    rf_model = RandomForestRegressor(
+        n_estimators=100,
+        max_depth=15,
+        min_samples_split=5,
+        min_samples_leaf=2,
+        random_state=42,
+        n_jobs=-1
+    )
+    rf_model.fit(X_train, y_train)
+    rf_pred = rf_model.predict(X_val)
+    rf_mae = mean_absolute_error(y_val, rf_pred)
+    rf_rmse = np.sqrt(mean_squared_error(y_val, rf_pred))
+    rf_r2 = r2_score(y_val, rf_pred)
+    models['Random Forest'] = rf_model
+    results['Random Forest'] = {
+        'model': rf_model,
+        'mae': rf_mae,
+        'rmse': rf_rmse,
+        'r2': rf_r2,
+        'predictions': rf_pred
+    }
+    print(f"   MAE: {rf_mae:.2f}, RMSE: {rf_rmse:.2f}, R2: {rf_r2:.4f}")
+    # 3. XGBoost (if available)
+    if XGBOOST_AVAILABLE:
+        print("\n3. Training XGBoost Regressor...")
+        xgb_model = xgb.XGBRegressor(
+            n_estimators=100,
+            max_depth=6,
+            learning_rate=0.1,
+            random_state=42,
+            n_jobs=-1
+        )
+        xgb_model.fit(X_train, y_train)
+        xgb_pred = xgb_model.predict(X_val)
+        xgb_mae = mean_absolute_error(y_val, xgb_pred)
+        xgb_rmse = np.sqrt(mean_squared_error(y_val, xgb_pred))
+        xgb_r2 = r2_score(y_val, xgb_pred)
+        models['XGBoost'] = xgb_model
+        results['XGBoost'] = {
+            'model': xgb_model,
+            'mae': xgb_mae,
+            'rmse': xgb_rmse,
+            'r2': xgb_r2,
+            'predictions': xgb_pred
+        }
+        print(f"   MAE: {xgb_mae:.2f}, RMSE: {xgb_rmse:.2f}, R2: {xgb_r2:.4f}")
+    else:
+        print("\n3. XGBoost skipped (not available)")
+    return results
+def prepare_time_series_data(df):
+    """
+    Prepare time-series data by aggregating daily sales.
+    Args:
+        df: DataFrame with date and sales_quantity columns
+    Returns:
+        tuple: (ts_data, train_size) - time series data and training size
+    """
+    print("\n" + "="*60)
+    print("PREPARING TIME-SERIES DATA")
+    print("="*60)
+    # Aggregate by date
+    df['date'] = pd.to_datetime(df['date'])
+    ts_data = df.groupby('date')['sales_quantity'].sum().reset_index()
+    ts_data = ts_data.sort_values('date').reset_index(drop=True)
+    ts_data.columns = ['ds', 'y']  # Prophet expects 'ds' and 'y'
+    print(f"Time-series data shape: {ts_data.shape}")
+    print(f"Date range: {ts_data['ds'].min()} to {ts_data['ds'].max()}")
+    print(f"Total days: {len(ts_data)}")
+    # Use 80% for training (chronological split for time-series)
+    train_size = int(len(ts_data) * 0.8)
+    return ts_data, train_size
+def train_arima(ts_data, train_size):
+    """
+    Train ARIMA model on time-series data.
+    Args:
+        ts_data: Time-series DataFrame with 'ds' and 'y' columns
+        train_size: Number of samples for training
+    Returns:
+        dict: Model results dictionary
+    """
+    if not ARIMA_AVAILABLE:
+        return None
+    print("\n" + "="*60)
+    print("TRAINING ARIMA MODEL")
+    print("="*60)
+    try:
+        # Split data chronologically
+        train_data = ts_data['y'].iloc[:train_size].values
+        val_data = ts_data['y'].iloc[train_size:].values
+        val_dates = ts_data['ds'].iloc[train_size:].values
+        print(f"Training on {len(train_data)} samples")
+        print(f"Validating on {len(val_data)} samples")
+        # Try different ARIMA orders (p, d, q)
+        # Start with auto_arima-like approach - try common orders
+        best_aic = np.inf
+        best_order = None
+        best_model = None
+        # Common ARIMA orders to try
+        orders_to_try = [
+            (1, 1, 1),  # Standard ARIMA(1,1,1)
+            (2, 1, 2),  # ARIMA(2,1,2)
+            (1, 1, 0),  # ARIMA(1,1,0) - AR model
+            (0, 1, 1),  # ARIMA(0,1,1) - MA model
+            (2, 1, 1),  # ARIMA(2,1,1)
+            (1, 1, 2),  # ARIMA(1,1,2)
+        ]
+        print("Trying different ARIMA orders...")
+        for order in orders_to_try:
+            try:
+                model = ARIMA(train_data, order=order)
+                fitted_model = model.fit()
+                aic = fitted_model.aic
+                if aic < best_aic:
+                    best_aic = aic
+                    best_order = order
+                    best_model = fitted_model
+                    print(f"   Order {order}: AIC = {aic:.2f} (best so far)")
+                else:
+                    print(f"   Order {order}: AIC = {aic:.2f}")
+            except Exception as e:
+                print(f"   Order {order}: Failed - {str(e)[:50]}")
+                continue
+        if best_model is None:
+            print("Failed to fit ARIMA model with any order")
+            return None
+        print(f"\nBest ARIMA order: {best_order} (AIC: {best_aic:.2f})")
+        # Make predictions
+        forecast_steps = len(val_data)
+        forecast = best_model.forecast(steps=forecast_steps)
+        # Ensure predictions are non-negative
+        forecast = np.maximum(forecast, 0)
+        # Calculate metrics
+        mae = mean_absolute_error(val_data, forecast)
+        rmse = np.sqrt(mean_squared_error(val_data, forecast))
+        r2 = r2_score(val_data, forecast)
+        print(f"   MAE: {mae:.2f}, RMSE: {rmse:.2f}, R2: {r2:.4f}")
+        return {
+            'model': best_model,
+            'order': best_order,
+            'mae': mae,
+            'rmse': rmse,
+            'r2': r2,
+            'predictions': forecast,
+            'actual': val_data,
+            'dates': val_dates
+        }
+    except Exception as e:
+        print(f"Error training ARIMA: {str(e)}")
+        return None
+def train_prophet(ts_data, train_size):
+    """
+    Train Prophet model on time-series data.
+    Args:
+        ts_data: Time-series DataFrame with 'ds' and 'y' columns
+        train_size: Number of samples for training
+    Returns:
+        dict: Model results dictionary
+    """
+    if not PROPHET_AVAILABLE:
+        return None
+    print("\n" + "="*60)
+    print("TRAINING PROPHET MODEL")
+    print("="*60)
+    try:
+        # Split data chronologically
+        train_data = ts_data.iloc[:train_size].copy()
+        val_data = ts_data.iloc[train_size:].copy()
+        print(f"Training on {len(train_data)} samples")
+        print(f"Validating on {len(val_data)} samples")
+        # Initialize and fit Prophet model
+        # Enable daily seasonality and weekly/yearly seasonality
+        model = Prophet(
+            daily_seasonality=False,  # Disable daily for daily data
+            weekly_seasonality=True,
+            yearly_seasonality=True,
+            seasonality_mode='multiplicative',
+            changepoint_prior_scale=0.05
+        )
+        print("Fitting Prophet model...")
+        model.fit(train_data)
+        # Create future dataframe for validation period
+        future = model.make_future_dataframe(periods=len(val_data), freq='D')
+        # Make predictions
+        forecast = model.predict(future)
+        # Get predictions for validation period
+        val_forecast = forecast.iloc[train_size:]['yhat'].values
+        val_actual = val_data['y'].values
+        # Ensure predictions are non-negative
+        val_forecast = np.maximum(val_forecast, 0)
+        # Calculate metrics
+        mae = mean_absolute_error(val_actual, val_forecast)
+        rmse = np.sqrt(mean_squared_error(val_actual, val_forecast))
+        r2 = r2_score(val_actual, val_forecast)
+        print(f"   MAE: {mae:.2f}, RMSE: {rmse:.2f}, R2: {r2:.4f}")
+        return {
+            'model': model,
+            'mae': mae,
+            'rmse': rmse,
+            'r2': r2,
+            'predictions': val_forecast,
+            'actual': val_actual,
+            'dates': val_data['ds'].values,
+            'full_forecast': forecast
+        }
+    except Exception as e:
+        print(f"Error training Prophet: {str(e)}")
+        import traceback
+        traceback.print_exc()
+        return None
+def select_best_model(results):
+    """
+    Select the best model based on R2 score (higher is better).
+    Args:
+        results: Dictionary containing model results
+    Returns:
+        tuple: (best_model_name, best_model, best_metrics)
+    """
+    print("\n" + "="*60)
+    print("MODEL COMPARISON")
+    print("="*60)
+    # Create comparison DataFrame
+    comparison_data = []
+    for model_name, metrics in results.items():
+        comparison_data.append({
+            'Model': model_name,
+            'MAE': metrics['mae'],
+            'RMSE': metrics['rmse'],
+            'R2 Score': metrics['r2']
+        })
+    comparison_df = pd.DataFrame(comparison_data)
+    print("\nModel Performance Comparison:")
+    print(comparison_df.to_string(index=False))
+    # Select best model based on R2 score
+    best_model_name = max(results.keys(), key=lambda x: results[x]['r2'])
+    best_model = results[best_model_name]['model']
+    best_metrics = {
+        'mae': results[best_model_name]['mae'],
+        'rmse': results[best_model_name]['rmse'],
+        'r2': results[best_model_name]['r2']
+    }
+    print(f"\n{'='*60}")
+    print(f"BEST MODEL: {best_model_name}")
+    print(f"MAE: {best_metrics['mae']:.2f}")
+    print(f"RMSE: {best_metrics['rmse']:.2f}")
+    print(f"R2 Score: {best_metrics['r2']:.4f}")
+    print(f"{'='*60}")
+    return best_model_name, best_model, best_metrics
+def visualize_results(df, results, best_model_name, feature_names):
+    """
+    Create visualizations: demand trends, feature importance, model comparison.
+    Args:
+        df: Original DataFrame
+        results: Model results dictionary
+        best_model_name: Name of the best model
+        feature_names: List of feature names
+    """
+    print("\n" + "="*60)
+    print("GENERATING VISUALIZATIONS")
+    print("="*60)
+    # Set style
+    sns.set_style("whitegrid")
+    plt.rcParams['figure.figsize'] = (12, 6)
+    # 1. Demand trends over time
+    print("1. Plotting demand trends over time...")
+    df['date'] = pd.to_datetime(df['date'])
+    daily_demand = df.groupby('date')['sales_quantity'].sum().reset_index()
+    plt.figure(figsize=(14, 6))
+    plt.plot(daily_demand['date'], daily_demand['sales_quantity'], linewidth=1, alpha=0.7)
+    plt.title('Total Daily Sales Quantity Over Time', fontsize=16, fontweight='bold')
+    plt.xlabel('Date', fontsize=12)
+    plt.ylabel('Total Sales Quantity', fontsize=12)
+    plt.grid(True, alpha=0.3)
+    plt.tight_layout()
+    plt.savefig(f'{PLOTS_DIR}/demand_trends.png', dpi=300, bbox_inches='tight')
+    print(f"   Saved: {PLOTS_DIR}/demand_trends.png")
+    plt.close()
+    # 2. Monthly average demand
+    print("2. Plotting monthly average demand...")
+    df['month_name'] = pd.to_datetime(df['date']).dt.strftime('%B')
+    monthly_avg = df.groupby('month')['sales_quantity'].mean().reset_index()
+    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
+                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
+    monthly_avg['month_name'] = monthly_avg['month'].apply(lambda x: month_names[x-1])
+    plt.figure(figsize=(12, 6))
+    plt.bar(monthly_avg['month_name'], monthly_avg['sales_quantity'], color='steelblue', alpha=0.7)
+    plt.title('Average Sales Quantity by Month', fontsize=16, fontweight='bold')
+    plt.xlabel('Month', fontsize=12)
+    plt.ylabel('Average Sales Quantity', fontsize=12)
+    plt.xticks(rotation=45)
+    plt.grid(True, alpha=0.3, axis='y')
+    plt.tight_layout()
+    plt.savefig(f'{PLOTS_DIR}/monthly_demand.png', dpi=300, bbox_inches='tight')
+    print(f"   Saved: {PLOTS_DIR}/monthly_demand.png")
+    plt.close()
+    # 3. Feature importance (for tree-based models)
+    print("3. Plotting feature importance...")
+    best_model = results[best_model_name]['model']
+    if hasattr(best_model, 'feature_importances_'):
+        importances = best_model.feature_importances_
+        feature_importance_df = pd.DataFrame({
+            'feature': feature_names,
+            'importance': importances
+        }).sort_values('importance', ascending=False)
+        plt.figure(figsize=(10, 6))
+        plt.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='coral', alpha=0.7)
+        plt.title(f'Feature Importance - {best_model_name}', fontsize=16, fontweight='bold')
+        plt.xlabel('Importance', fontsize=12)
+        plt.ylabel('Feature', fontsize=12)
+        plt.gca().invert_yaxis()
+        plt.grid(True, alpha=0.3, axis='x')
+        plt.tight_layout()
+        plt.savefig(f'{PLOTS_DIR}/feature_importance.png', dpi=300, bbox_inches='tight')
+        print(f"   Saved: {PLOTS_DIR}/feature_importance.png")
+        plt.close()
+    else:
+        print("   Feature importance not available for this model type")
+    # 4. Model comparison
+    print("4. Plotting model comparison...")
+    model_names = list(results.keys())
+    mae_scores = [results[m]['mae'] for m in model_names]
+    rmse_scores = [results[m]['rmse'] for m in model_names]
+    r2_scores = [results[m]['r2'] for m in model_names]
+    # Separate ML and time-series models for visualization
+    ml_models = [m for m in model_names if m not in ['ARIMA', 'Prophet']]
+    ts_models = [m for m in model_names if m in ['ARIMA', 'Prophet']]
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+    # Color code: ML models in blue tones, TS models in orange/red tones
+    colors = []
+    for m in model_names:
+        if m in ts_models:
+            colors.append('coral' if m == 'ARIMA' else 'salmon')
+        else:
+            colors.append('skyblue')
+    # MAE comparison
+    axes[0].bar(model_names, mae_scores, color=colors, alpha=0.7)
+    axes[0].set_title('MAE Comparison (Lower is Better)', fontsize=14, fontweight='bold')
+    axes[0].set_ylabel('MAE', fontsize=12)
+    axes[0].tick_params(axis='x', rotation=45)
+    axes[0].grid(True, alpha=0.3, axis='y')
+    # Add legend
+    from matplotlib.patches import Patch
+    legend_elements = [
+        Patch(facecolor='skyblue', alpha=0.7, label='ML Models'),
+        Patch(facecolor='coral', alpha=0.7, label='Time-Series Models')
+    ]
+    axes[0].legend(handles=legend_elements, loc='upper right')
+    # RMSE comparison
+    axes[1].bar(model_names, rmse_scores, color=colors, alpha=0.7)
+    axes[1].set_title('RMSE Comparison (Lower is Better)', fontsize=14, fontweight='bold')
+    axes[1].set_ylabel('RMSE', fontsize=12)
+    axes[1].tick_params(axis='x', rotation=45)
+    axes[1].grid(True, alpha=0.3, axis='y')
+    # R2 comparison
+    axes[2].bar(model_names, r2_scores, color=colors, alpha=0.7)
+    axes[2].set_title('R2 Score Comparison (Higher is Better)', fontsize=14, fontweight='bold')
+    axes[2].set_ylabel('R2 Score', fontsize=12)
+    axes[2].tick_params(axis='x', rotation=45)
+    axes[2].grid(True, alpha=0.3, axis='y')
+    plt.tight_layout()
+    plt.savefig(f'{PLOTS_DIR}/model_comparison.png', dpi=300, bbox_inches='tight')
+    print(f"   Saved: {PLOTS_DIR}/model_comparison.png")
+    plt.close()
+    # 5. Time-series predictions plot (if time-series models available)
+    if ts_models:
+        print("5. Plotting time-series model predictions...")
+        fig, axes = plt.subplots(len(ts_models), 1, figsize=(14, 6*len(ts_models)))
+        if len(ts_models) == 1:
+            axes = [axes]
+        for idx, model_name in enumerate(ts_models):
+            if model_name in results and 'dates' in results[model_name]:
+                dates = pd.to_datetime(results[model_name]['dates'])
+                actual = results[model_name]['actual']
+                predictions = results[model_name]['predictions']
+                axes[idx].plot(dates, actual, label='Actual', linewidth=2, alpha=0.7)
+                axes[idx].plot(dates, predictions, label='Predicted', linewidth=2, alpha=0.7, linestyle='--')
+                axes[idx].set_title(f'{model_name} - Actual vs Predicted', fontsize=14, fontweight='bold')
+                axes[idx].set_xlabel('Date', fontsize=12)
+                axes[idx].set_ylabel('Sales Quantity', fontsize=12)
+                axes[idx].legend()
+                axes[idx].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(f'{PLOTS_DIR}/timeseries_predictions.png', dpi=300, bbox_inches='tight')
+        print(f"   Saved: {PLOTS_DIR}/timeseries_predictions.png")
+        plt.close()
+    print("   Visualization complete!")
+def save_model(model, encoders, scaler, feature_names, best_model_name, best_metrics):
+    """
+    Save the trained model and preprocessing objects.
+    Args:
+        model: Trained model
+        encoders: Dictionary of encoders
+        scaler: Fitted scaler
+        feature_names: List of feature names
+        best_model_name: Name of the best model
+        best_metrics: Dictionary of metrics
+    """
+    print("\n" + "="*60)
+    print("SAVING MODEL")
+    print("="*60)
+    # Save model
+    model_path = f'{MODEL_DIR}/best_model.joblib'
+    joblib.dump(model, model_path)
+    print(f"Model saved to: {model_path}")
+    # Save encoders and scaler
+    preprocessing_path = f'{MODEL_DIR}/preprocessing.joblib'
+    preprocessing_data = {
+        'encoders': encoders,
+        'scaler': scaler,
+        'feature_names': feature_names
+    }
+    joblib.dump(preprocessing_data, preprocessing_path)
+    print(f"Preprocessing objects saved to: {preprocessing_path}")
+    # Save model metadata
+    metadata = {
+        'model_name': best_model_name,
+        'metrics': best_metrics,
+        'feature_names': feature_names,
+        'saved_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+    }
+    import json
+    metadata_path = f'{MODEL_DIR}/model_metadata.json'
+    with open(metadata_path, 'w') as f:
+        json.dump(metadata, f, indent=4)
+    print(f"Model metadata saved to: {metadata_path}")
+def main():
+    """
+    Main function to orchestrate the training pipeline.
+    """
+    print("\n" + "="*60)
+    print("DEMAND PREDICTION SYSTEM - MODEL TRAINING")
+    print("ML Models vs Time-Series Models Comparison")
+    print("="*60)
+    # Step 1: Load data
+    df = load_data(DATA_PATH)
+    # Step 2: Preprocess data
+    df_processed = preprocess_data(df)
+    # Step 3: Feature engineering for ML models
+    X, y, feature_names, encoders, scaler = feature_engineering(df_processed)
+    # Step 4: Split data for ML models (random split)
+    print("\n" + "="*60)
+    print("SPLITTING DATA FOR ML MODELS")
+    print("="*60)
+    X_train, X_val, y_train, y_val = train_test_split(
+        X, y, test_size=0.2, random_state=42
+    )
+    print(f"Training set: {X_train.shape[0]} samples")
+    print(f"Validation set: {X_val.shape[0]} samples")
+    # Step 5: Train ML models
+    print("\n" + "="*70)
+    print("TRAINING MACHINE LEARNING MODELS")
+    print("="*70)
+    results = train_models(X_train, y_train, X_val, y_val)
+    # Step 6: Prepare time-series data
+    ts_data, train_size = prepare_time_series_data(df_processed)
+    # Step 7: Train time-series models
+    print("\n" + "="*70)
+    print("TRAINING TIME-SERIES MODELS")
+    print("="*70)
+    # Train ARIMA
+    if ARIMA_AVAILABLE:
+        arima_results = train_arima(ts_data, train_size)
+        if arima_results:
+            results['ARIMA'] = arima_results
+    else:
+        print("\nARIMA skipped (statsmodels not available)")
+    # Train Prophet
+    if PROPHET_AVAILABLE:
+        prophet_results = train_prophet(ts_data, train_size)
+        if prophet_results:
+            results['Prophet'] = prophet_results
+    else:
+        print("\nProphet skipped (prophet not available)")
+    # Step 8: Select best model (across all model types)
+    best_model_name, best_model, best_metrics = select_best_model(results)
+    # Step 9: Visualize results
+    visualize_results(df_processed, results, best_model_name, feature_names)
+    # Step 10: Save model (only ML models can be saved with preprocessing)
+    # For time-series models, save separately
+    if best_model_name not in ['ARIMA', 'Prophet']:
+        save_model(best_model, encoders, scaler, feature_names, best_model_name, best_metrics)
+    else:
+        # Save time-series model separately
+        print("\n" + "="*60)
+        print("SAVING TIME-SERIES MODEL")
+        print("="*60)
+        ts_model_path = f'{MODEL_DIR}/best_timeseries_model.joblib'
+        joblib.dump(best_model, ts_model_path)
+        print(f"Time-series model saved to: {ts_model_path}")
+        # Also save preprocessing for ML models (in case user wants to use them)
+        preprocessing_path = f'{MODEL_DIR}/preprocessing.joblib'
+        preprocessing_data = {
+            'encoders': encoders,
+            'scaler': scaler,
+            'feature_names': feature_names
+        }
+        joblib.dump(preprocessing_data, preprocessing_path)
+        print(f"ML preprocessing objects saved to: {preprocessing_path}")
+    # Save all results metadata
+    import json
+    all_models_metadata = {
+        'best_model': best_model_name,
+        'best_metrics': best_metrics,
+        'all_models': {}
+    }
+    for model_name, model_results in results.items():
+        all_models_metadata['all_models'][model_name] = {
+            'mae': model_results['mae'],
+            'rmse': model_results['rmse'],
+            'r2': model_results['r2']
+        }
+    all_models_metadata['saved_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+    metadata_path = f'{MODEL_DIR}/all_models_metadata.json'
+    with open(metadata_path, 'w') as f:
+        json.dump(all_models_metadata, f, indent=4)
+    print(f"All models metadata saved to: {metadata_path}")
+    print("\n" + "="*60)
+    print("TRAINING COMPLETE!")
+    print("="*60)
+    print(f"\nBest model: {best_model_name}")
+    print(f"Model type: {'Time-Series' if best_model_name in ['ARIMA', 'Prophet'] else 'Machine Learning'}")
+    print(f"Model saved to: {MODEL_DIR}/")
+    print(f"Visualizations saved to: {PLOTS_DIR}/")
+    print("\nYou can now use predict.py to make predictions!")
+if __name__ == "__main__":
+    main()