File size: 19,019 Bytes
7f90ea0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 |
# Demand Prediction System for E-commerce
A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.
## ๐ Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Dataset](#dataset)
- [Usage](#usage)
- [Model Details](#model-details)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualizations](#visualizations)
- [Example Predictions](#example-predictions)
- [Technical Details](#technical-details)
## ๐ฏ Overview
This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:
1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)
The system automatically selects the best performing model across both approaches.
**Key Capabilities:**
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches
## โจ Features
- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
- **Feature Engineering**:
- Date features (day, month, day_of_week, weekend, year, quarter)
- Categorical encoding (product_id, category)
- Feature scaling
- **Multiple Models**:
- **Machine Learning Models:**
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor (optional)
- **Time-Series Models:**
- ARIMA (AutoRegressive Integrated Moving Average)
- Prophet (Facebook's time-series forecasting tool)
- **Model Selection**: Automatic best model selection based on R2 score
- **Evaluation Metrics**: MAE, RMSE, and R2 Score
- **Visualizations**:
- Demand trends over time
- Monthly average demand
- Feature importance
- Model comparison
- **Model Persistence**: Save and load trained models using joblib
- **Future Predictions**: Predict demand for any product on any future date
## ๐ Project Structure
```
demand_prediction/
โ
โโโ data/
โ โโโ sales.csv # Sales dataset
โ
โโโ models/ # Generated during training
โ โโโ best_model.joblib # Best ML model (if ML is best)
โ โโโ best_timeseries_model.joblib # Best time-series model (if TS is best)
โ โโโ preprocessing.joblib # Encoders and scaler (for ML models)
โ โโโ model_metadata.json # Model metadata (legacy)
โ โโโ all_models_metadata.json # All models comparison metadata
โ
โโโ plots/ # Generated during training
โ โโโ demand_trends.png # Time series plot
โ โโโ monthly_demand.png # Monthly averages
โ โโโ feature_importance.png # Feature importance (ML models)
โ โโโ model_comparison.png # Model metrics comparison (all models)
โ โโโ timeseries_predictions.png # Time-series model predictions
โ
โโโ generate_dataset.py # Script to generate synthetic dataset
โโโ train_model.py # Main training script
โโโ predict.py # Prediction script
โโโ app.py # Streamlit dashboard (interactive web app)
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
```
## ๐ Installation
### Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
### Step 1: Navigate to Project Directory
```bash
cd demand_prediction
```
### Step 2: Create Virtual Environment (Recommended)
**Why use a virtual environment?**
- Keeps project dependencies isolated from your system Python
- Prevents conflicts with other projects
- Makes it easier to manage package versions
- Best practice for Python projects
**Quick Setup (Recommended):**
**Windows:**
```bash
setup_env.bat
```
**Linux/Mac:**
```bash
chmod +x setup_env.sh
./setup_env.sh
```
**Manual Setup:**
**Windows:**
```bash
python -m venv venv
venv\Scripts\activate
```
**Linux/Mac:**
```bash
python3 -m venv venv
source venv/bin/activate
```
After activation, you should see `(venv)` in your terminal prompt.
**To deactivate later:**
```bash
deactivate
```
### Step 3: Install Dependencies
```bash
pip install -r requirements.txt
```
**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.
**Alternative (without virtual environment):**
If you prefer not to use a virtual environment, you can install directly:
```bash
pip install -r requirements.txt
```
However, this is **not recommended** as it may cause conflicts with other Python projects.
### Step 4: Generate Dataset
If you don't have a dataset, generate a synthetic one:
```bash
python generate_dataset.py
```
This will create `data/sales.csv` with realistic e-commerce sales data.
## ๐ Dataset
The dataset should contain the following columns:
- **product_id**: Unique identifier for each product (integer)
- **date**: Date of sale (YYYY-MM-DD format)
- **price**: Product price (float)
- **discount**: Discount percentage (0-100, float)
- **category**: Product category (string)
- **sales_quantity**: Target variable - number of units sold (integer)
### Dataset Format Example
```csv
product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...
```
## ๐ป Usage
### Step 1: Train the Model
Train the model using the sales dataset:
```bash
python train_model.py
```
This will:
1. Load and preprocess the data
2. Extract features from dates
3. Encode categorical variables
4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
5. Prepare time-series data (aggregate daily sales)
6. Train time-series models (ARIMA, Prophet)
7. Evaluate each model using MAE, RMSE, and R2 Score
8. Compare ML vs Time-Series models
9. Select the best model automatically (across all model types)
10. Save the model and preprocessing objects
11. Generate visualizations
**Output:**
- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
- Visualizations saved to `plots/` directory
- All models metadata saved to `models/all_models_metadata.json`
### Step 2: Make Predictions
**For ML Models (product-specific predictions):**
Predict demand for a specific product on a date:
```bash
python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
```
**Parameters for ML Models:**
- `--product_id`: Product ID (integer, required)
- `--date`: Date in YYYY-MM-DD format (required)
- `--price`: Product price (float, required)
- `--discount`: Discount percentage 0-100 (float, default: 0)
- `--category`: Product category (string, required)
- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`
**For Time-Series Models (overall daily demand):**
Predict total daily demand across all products:
```bash
python predict.py --date 2024-01-15 --model_type timeseries
```
**Parameters for Time-Series Models:**
- `--date`: Date in YYYY-MM-DD format (required)
- `--model_type`: Set to `timeseries` to use time-series models
**Example Predictions:**
```bash
# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics
# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing
# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries
# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
```
### Step 3: Launch Interactive Dashboard (Optional)
Launch the Streamlit dashboard for interactive visualization and predictions:
```bash
streamlit run app.py
```
The dashboard will open in your default web browser (usually at `http://localhost:8501`).
**Dashboard Features:**
1. **๐ Sales Trends Page**
- Interactive filters (category, product, date range)
- Daily sales trends visualization
- Monthly sales trends
- Category-wise analysis
- Price vs demand relationship
- Real-time statistics and metrics
2. **๐ฎ Demand Prediction Page**
- Interactive prediction interface
- Select model type (Auto/ML/Time-Series)
- For ML models:
- Product selection dropdown
- Category selection
- Price and discount sliders
- Date picker
- Product statistics display
- For Time-Series models:
- Date picker for future predictions
- Overall daily demand forecast
- Prediction insights and recommendations
3. **๐ Model Comparison Page**
- Side-by-side model performance comparison
- MAE, RMSE, and R2 Score metrics
- Visual charts comparing all models
- Best model highlighting
- Model type indicators (ML vs Time-Series)
**Dashboard Screenshots:**
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts
## ๐ค Model Details
### Models Trained
1. **Linear Regression**
- Simple linear model
- Fast training and prediction
- Good baseline model
2. **Random Forest Regressor**
- Ensemble of decision trees
- Handles non-linear relationships
- Provides feature importance
- Hyperparameters:
- n_estimators: 100
- max_depth: 15
- min_samples_split: 5
- min_samples_leaf: 2
3. **XGBoost Regressor** (Optional)
- Gradient boosting algorithm
- Often provides best performance
- Handles complex patterns
- Hyperparameters:
- n_estimators: 100
- max_depth: 6
- learning_rate: 0.1
4. **ARIMA** (AutoRegressive Integrated Moving Average)
- Classic time-series forecasting model
- Captures trends and seasonality
- Automatically selects best order (p, d, q)
- Works on aggregated daily sales data
- Uses chronological train/validation split
5. **Prophet** (Facebook's Time-Series Forecasting)
- Designed for business time series
- Handles seasonality (weekly, yearly)
- Robust to missing data and outliers
- Works on aggregated daily sales data
- Uses chronological train/validation split
### Model Comparison: ML vs Time-Series
**Machine Learning Models:**
- โ
Predict per-product demand
- โ
Use product features (price, discount, category)
- โ
Can handle new products with similar features
- โ May not capture long-term temporal patterns as well
**Time-Series Models:**
- โ
Capture temporal patterns and trends
- โ
Handle seasonality automatically
- โ
Good for overall demand forecasting
- โ Predict aggregate demand, not per-product
- โ Don't use product-specific features
**The system automatically selects the best model based on R2 score across all model types.**
### Feature Engineering
**For ML Models:**
The system extracts the following features from the input data:
**Date Features:**
- `day`: Day of month (1-31)
- `month`: Month (1-12)
- `day_of_week`: Day of week (0=Monday, 6=Sunday)
- `weekend`: Binary indicator (1 if weekend, 0 otherwise)
- `year`: Year
- `quarter`: Quarter of year (1-4)
**Original Features:**
- `product_id`: Encoded as categorical
- `price`: Numerical (scaled)
- `discount`: Numerical (scaled)
- `category`: Encoded as categorical
**Total Features**: 10 features after encoding and scaling
**For Time-Series Models:**
- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
- Weekly seasonality
- Yearly seasonality
- Trend components
## ๐ Evaluation Metrics
The system evaluates models using three metrics:
1. **MAE (Mean Absolute Error)**
- Average absolute difference between predicted and actual values
- Lower is better
- Units: same as target variable (sales quantity)
2. **RMSE (Root Mean Squared Error)**
- Square root of average squared differences
- Penalizes large errors more than MAE
- Lower is better
- Units: same as target variable (sales quantity)
3. **R2 Score (Coefficient of Determination)**
- Proportion of variance explained by the model
- Range: -โ to 1 (1 is perfect prediction)
- Higher is better
- Used for model selection
**Model Selection**: The model with the highest R2 score is selected as the best model.
## ๐ Visualizations
The training script generates several visualizations:
1. **Demand Trends Over Time** (`plots/demand_trends.png`)
- Shows total daily sales quantity over the entire time period
- Helps identify overall trends and patterns
2. **Monthly Average Demand** (`plots/monthly_demand.png`)
- Bar chart showing average sales by month
- Reveals seasonal patterns (e.g., holiday season spikes)
3. **Feature Importance** (`plots/feature_importance.png`)
- Shows which features are most important for predictions
- Only available for tree-based models (Random Forest, XGBoost)
4. **Model Comparison** (`plots/model_comparison.png`)
- Side-by-side comparison of all models (ML and Time-Series)
- Color-coded: ML models (blue) vs Time-Series models (orange/red)
- Shows MAE, RMSE, and R2 Score for each model
5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
- Actual vs predicted plots for ARIMA and Prophet models
- Shows how well time-series models capture temporal patterns
- Only generated if time-series models are available
## ๐ฎ Example Predictions
Here are some example predictions to demonstrate the system:
```bash
# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)
# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)
# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)
```
## ๐ง Technical Details
### Data Preprocessing Pipeline
1. **Date Conversion**: Convert date strings to datetime objects
2. **Feature Extraction**: Extract temporal features from dates
3. **Missing Value Handling**: Fill missing values with median (if any)
4. **Categorical Encoding**: Label encode product_id and category
5. **Feature Scaling**: Standardize numerical features using StandardScaler
### Model Training Pipeline
1. **Data Splitting**: 80% training, 20% validation
2. **Model Training**: Train all available models
3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model
4. **Selection**: Choose model with highest R2 score
5. **Persistence**: Save model, encoders, and scaler
### Prediction Pipeline
1. **Load Model**: Load trained model and preprocessing objects
2. **Feature Preparation**: Extract features from input parameters
3. **Encoding**: Encode categorical variables using saved encoders
4. **Scaling**: Scale features using saved scaler
5. **Prediction**: Make prediction using loaded model
6. **Post-processing**: Ensure non-negative predictions
### Handling Unseen Data
The prediction script handles cases where:
- Product ID was not seen during training (uses default encoding)
- Category was not seen during training (uses default encoding)
Warnings are displayed in such cases.
## ๐ Learning Points
This project demonstrates:
1. **Supervised Learning**: Regression problem solving
2. **Feature Engineering**: Creating meaningful features from raw data
3. **Model Comparison**: Training and evaluating multiple models
4. **Model Selection**: Automatic best model selection
5. **Model Persistence**: Saving and loading trained models
6. **Production-Ready Code**: Clean, modular, well-documented code
7. **Time Series Features**: Extracting temporal patterns
8. **Categorical Encoding**: Handling categorical variables
9. **Feature Scaling**: Normalizing features for better performance
10. **Evaluation Metrics**: Understanding different regression metrics
## ๐ Troubleshooting
### Issue: "Model not found"
**Solution**: Run `python train_model.py` first to train and save the model.
### Issue: "XGBoost not available"
**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).
### Issue: "Category not seen during training"
**Solution**: This is handled automatically with a warning. The system uses a default encoding.
### Issue: Poor prediction accuracy
**Solutions**:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data
## ๐ Notes
- The synthetic dataset generator creates realistic patterns including:
- Weekend effects (higher sales on weekends)
- Seasonal patterns (holiday season spikes)
- Price and discount effects
- Category-specific base prices
- For production use, consider:
- Using real historical data
- Retraining models periodically
- Adding more features (promotions, weather, etc.)
- Implementing model versioning
- Adding prediction confidence intervals
## ๐ License
This project is provided as-is for educational purposes.
## ๐ค Author
Created as a complete machine learning project demonstrating demand prediction for e-commerce.
---
**Happy Predicting! ๐**
|