File size: 19,019 Bytes
7f90ea0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
# Demand Prediction System for E-commerce

A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

## ๐Ÿ“‹ Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Dataset](#dataset)
- [Usage](#usage)
- [Model Details](#model-details)
- [Evaluation Metrics](#evaluation-metrics)
- [Visualizations](#visualizations)
- [Example Predictions](#example-predictions)
- [Technical Details](#technical-details)

## ๐ŸŽฏ Overview

This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

1. **Machine Learning Models**: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
2. **Time-Series Models**: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

The system automatically selects the best performing model across both approaches.

**Key Capabilities:**
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches

## โœจ Features

- **Data Preprocessing**: Automatic handling of missing values, date feature extraction
- **Feature Engineering**: 
  - Date features (day, month, day_of_week, weekend, year, quarter)
  - Categorical encoding (product_id, category)

  - Feature scaling

- **Multiple Models**: 

  - **Machine Learning Models:**

    - Linear Regression

    - Random Forest Regressor

    - XGBoost Regressor (optional)

  - **Time-Series Models:**

    - ARIMA (AutoRegressive Integrated Moving Average)

    - Prophet (Facebook's time-series forecasting tool)

- **Model Selection**: Automatic best model selection based on R2 score

- **Evaluation Metrics**: MAE, RMSE, and R2 Score

- **Visualizations**: 

  - Demand trends over time

  - Monthly average demand

  - Feature importance

  - Model comparison

- **Model Persistence**: Save and load trained models using joblib

- **Future Predictions**: Predict demand for any product on any future date



## ๐Ÿ“ Project Structure



```

demand_prediction/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ sales.csv                    # Sales dataset
โ”‚
โ”œโ”€โ”€ models/                          # Generated during training
โ”‚   โ”œโ”€โ”€ best_model.joblib           # Best ML model (if ML is best)

โ”‚   โ”œโ”€โ”€ best_timeseries_model.joblib # Best time-series model (if TS is best)

โ”‚   โ”œโ”€โ”€ preprocessing.joblib        # Encoders and scaler (for ML models)

โ”‚   โ”œโ”€โ”€ model_metadata.json         # Model metadata (legacy)
โ”‚   โ””โ”€โ”€ all_models_metadata.json    # All models comparison metadata
โ”‚
โ”œโ”€โ”€ plots/                           # Generated during training
โ”‚   โ”œโ”€โ”€ demand_trends.png           # Time series plot

โ”‚   โ”œโ”€โ”€ monthly_demand.png          # Monthly averages
โ”‚   โ”œโ”€โ”€ feature_importance.png      # Feature importance (ML models)

โ”‚   โ”œโ”€โ”€ model_comparison.png        # Model metrics comparison (all models)
โ”‚   โ””โ”€โ”€ timeseries_predictions.png  # Time-series model predictions

โ”‚

โ”œโ”€โ”€ generate_dataset.py             # Script to generate synthetic dataset
โ”œโ”€โ”€ train_model.py                  # Main training script

โ”œโ”€โ”€ predict.py                      # Prediction script

โ”œโ”€โ”€ app.py                          # Streamlit dashboard (interactive web app)

โ”œโ”€โ”€ requirements.txt                # Python dependencies

โ””โ”€โ”€ README.md                       # This file

```



## ๐Ÿš€ Installation



### Prerequisites



- Python 3.8 or higher

- pip (Python package manager)



### Step 1: Navigate to Project Directory



```bash

cd demand_prediction
```



### Step 2: Create Virtual Environment (Recommended)



**Why use a virtual environment?**

- Keeps project dependencies isolated from your system Python

- Prevents conflicts with other projects

- Makes it easier to manage package versions

- Best practice for Python projects



**Quick Setup (Recommended):**



**Windows:**

```bash

setup_env.bat

```

**Linux/Mac:**
```bash

chmod +x setup_env.sh

./setup_env.sh

```

**Manual Setup:**

**Windows:**
```bash

python -m venv venv

venv\Scripts\activate

```

**Linux/Mac:**
```bash

python3 -m venv venv

source venv/bin/activate

```

After activation, you should see `(venv)` in your terminal prompt.

**To deactivate later:**
```bash

deactivate

```

### Step 3: Install Dependencies

```bash

pip install -r requirements.txt

```

**Note**: If you don't want to use XGBoost, you can remove it from `requirements.txt`. The system will work fine without it, just skipping XGBoost model training.

**Alternative (without virtual environment):**
If you prefer not to use a virtual environment, you can install directly:
```bash

pip install -r requirements.txt

```
However, this is **not recommended** as it may cause conflicts with other Python projects.

### Step 4: Generate Dataset

If you don't have a dataset, generate a synthetic one:

```bash

python generate_dataset.py

```

This will create `data/sales.csv` with realistic e-commerce sales data.

## ๐Ÿ“Š Dataset

The dataset should contain the following columns:

- **product_id**: Unique identifier for each product (integer)

- **date**: Date of sale (YYYY-MM-DD format)

- **price**: Product price (float)

- **discount**: Discount percentage (0-100, float)

- **category**: Product category (string)

- **sales_quantity**: Target variable - number of units sold (integer)

### Dataset Format Example

```csv

product_id,date,price,discount,category,sales_quantity

1,2020-01-01,499.99,10,Electronics,45

2,2020-01-01,29.99,0,Clothing,120

...

```

## ๐Ÿ’ป Usage

### Step 1: Train the Model

Train the model using the sales dataset:

```bash

python train_model.py

```

This will:
1. Load and preprocess the data
2. Extract features from dates
3. Encode categorical variables
4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
5. Prepare time-series data (aggregate daily sales)
6. Train time-series models (ARIMA, Prophet)
7. Evaluate each model using MAE, RMSE, and R2 Score
8. Compare ML vs Time-Series models
9. Select the best model automatically (across all model types)
10. Save the model and preprocessing objects
11. Generate visualizations

**Output:**
- Best model saved to `models/best_model.joblib` (ML) or `models/best_timeseries_model.joblib` (TS)
- Preprocessing objects saved to `models/preprocessing.joblib` (for ML models)
- Visualizations saved to `plots/` directory
- All models metadata saved to `models/all_models_metadata.json`

### Step 2: Make Predictions

**For ML Models (product-specific predictions):**

Predict demand for a specific product on a date:

```bash

python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics

```

**Parameters for ML Models:**
- `--product_id`: Product ID (integer, required)
- `--date`: Date in YYYY-MM-DD format (required)
- `--price`: Product price (float, required)
- `--discount`: Discount percentage 0-100 (float, default: 0)
- `--category`: Product category (string, required)
- `--model_type`: Model type - `auto` (default), `ml`, or `timeseries`

**For Time-Series Models (overall daily demand):**

Predict total daily demand across all products:

```bash

python predict.py --date 2024-01-15 --model_type timeseries

```

**Parameters for Time-Series Models:**
- `--date`: Date in YYYY-MM-DD format (required)
- `--model_type`: Set to `timeseries` to use time-series models

**Example Predictions:**

```bash

# ML Model - Electronics product with discount

python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics



# ML Model - Clothing product without discount

python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing



# Time-Series Model - Overall daily demand

python predict.py --date 2024-07-06 --model_type timeseries



# Auto-detect best model (default)

python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports

```

### Step 3: Launch Interactive Dashboard (Optional)

Launch the Streamlit dashboard for interactive visualization and predictions:

```bash

streamlit run app.py

```

The dashboard will open in your default web browser (usually at `http://localhost:8501`).

**Dashboard Features:**

1. **๐Ÿ“ˆ Sales Trends Page**
   - Interactive filters (category, product, date range)
   - Daily sales trends visualization
   - Monthly sales trends
   - Category-wise analysis
   - Price vs demand relationship
   - Real-time statistics and metrics

2. **๐Ÿ”ฎ Demand Prediction Page**
   - Interactive prediction interface
   - Select model type (Auto/ML/Time-Series)
   - For ML models:
     - Product selection dropdown
     - Category selection
     - Price and discount sliders
     - Date picker
     - Product statistics display
   - For Time-Series models:
     - Date picker for future predictions
     - Overall daily demand forecast
   - Prediction insights and recommendations

3. **๐Ÿ“Š Model Comparison Page**
   - Side-by-side model performance comparison
   - MAE, RMSE, and R2 Score metrics
   - Visual charts comparing all models
   - Best model highlighting
   - Model type indicators (ML vs Time-Series)

**Dashboard Screenshots:**
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts

## ๐Ÿค– Model Details

### Models Trained

1. **Linear Regression**
   - Simple linear model
   - Fast training and prediction
   - Good baseline model

2. **Random Forest Regressor**
   - Ensemble of decision trees
   - Handles non-linear relationships
   - Provides feature importance
   - Hyperparameters:
     - n_estimators: 100

     - max_depth: 15
     - min_samples_split: 5
     - min_samples_leaf: 2

3. **XGBoost Regressor** (Optional)
   - Gradient boosting algorithm
   - Often provides best performance
   - Handles complex patterns
   - Hyperparameters:
     - n_estimators: 100

     - max_depth: 6
     - learning_rate: 0.1



4. **ARIMA** (AutoRegressive Integrated Moving Average)

   - Classic time-series forecasting model

   - Captures trends and seasonality

   - Automatically selects best order (p, d, q)

   - Works on aggregated daily sales data

   - Uses chronological train/validation split



5. **Prophet** (Facebook's Time-Series Forecasting)

   - Designed for business time series

   - Handles seasonality (weekly, yearly)

   - Robust to missing data and outliers

   - Works on aggregated daily sales data

   - Uses chronological train/validation split



### Model Comparison: ML vs Time-Series



**Machine Learning Models:**

- โœ… Predict per-product demand

- โœ… Use product features (price, discount, category)

- โœ… Can handle new products with similar features

- โŒ May not capture long-term temporal patterns as well



**Time-Series Models:**

- โœ… Capture temporal patterns and trends

- โœ… Handle seasonality automatically

- โœ… Good for overall demand forecasting

- โŒ Predict aggregate demand, not per-product

- โŒ Don't use product-specific features



**The system automatically selects the best model based on R2 score across all model types.**



### Feature Engineering



**For ML Models:**



The system extracts the following features from the input data:



**Date Features:**

- `day`: Day of month (1-31)

- `month`: Month (1-12)

- `day_of_week`: Day of week (0=Monday, 6=Sunday)

- `weekend`: Binary indicator (1 if weekend, 0 otherwise)

- `year`: Year

- `quarter`: Quarter of year (1-4)



**Original Features:**

- `product_id`: Encoded as categorical
- `price`: Numerical (scaled)
- `discount`: Numerical (scaled)
- `category`: Encoded as categorical

**Total Features**: 10 features after encoding and scaling

**For Time-Series Models:**

- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
  - Weekly seasonality
  - Yearly seasonality
  - Trend components

## ๐Ÿ“ˆ Evaluation Metrics

The system evaluates models using three metrics:

1. **MAE (Mean Absolute Error)**
   - Average absolute difference between predicted and actual values
   - Lower is better
   - Units: same as target variable (sales quantity)

2. **RMSE (Root Mean Squared Error)**
   - Square root of average squared differences
   - Penalizes large errors more than MAE
   - Lower is better
   - Units: same as target variable (sales quantity)

3. **R2 Score (Coefficient of Determination)**
   - Proportion of variance explained by the model
   - Range: -โˆž to 1 (1 is perfect prediction)
   - Higher is better
   - Used for model selection

**Model Selection**: The model with the highest R2 score is selected as the best model.

## ๐Ÿ“Š Visualizations

The training script generates several visualizations:

1. **Demand Trends Over Time** (`plots/demand_trends.png`)
   - Shows total daily sales quantity over the entire time period
   - Helps identify overall trends and patterns

2. **Monthly Average Demand** (`plots/monthly_demand.png`)
   - Bar chart showing average sales by month
   - Reveals seasonal patterns (e.g., holiday season spikes)

3. **Feature Importance** (`plots/feature_importance.png`)
   - Shows which features are most important for predictions
   - Only available for tree-based models (Random Forest, XGBoost)

4. **Model Comparison** (`plots/model_comparison.png`)
   - Side-by-side comparison of all models (ML and Time-Series)
   - Color-coded: ML models (blue) vs Time-Series models (orange/red)
   - Shows MAE, RMSE, and R2 Score for each model

5. **Time-Series Predictions** (`plots/timeseries_predictions.png`)
   - Actual vs predicted plots for ARIMA and Prophet models
   - Shows how well time-series models capture temporal patterns
   - Only generated if time-series models are available

## ๐Ÿ”ฎ Example Predictions

Here are some example predictions to demonstrate the system:

```bash

# Example 1: Electronics on a weekday

python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics

# Expected: Moderate demand (weekday, some discount)



# Example 2: Clothing on weekend

python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing

# Expected: Higher demand (weekend, good discount)



# Example 3: Holiday season prediction

python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys

# Expected: High demand (holiday season, good discount)

```

## ๐Ÿ”ง Technical Details

### Data Preprocessing Pipeline

1. **Date Conversion**: Convert date strings to datetime objects
2. **Feature Extraction**: Extract temporal features from dates
3. **Missing Value Handling**: Fill missing values with median (if any)
4. **Categorical Encoding**: Label encode product_id and category

5. **Feature Scaling**: Standardize numerical features using StandardScaler



### Model Training Pipeline



1. **Data Splitting**: 80% training, 20% validation

2. **Model Training**: Train all available models

3. **Evaluation**: Calculate MAE, RMSE, and R2 for each model

4. **Selection**: Choose model with highest R2 score

5. **Persistence**: Save model, encoders, and scaler



### Prediction Pipeline



1. **Load Model**: Load trained model and preprocessing objects

2. **Feature Preparation**: Extract features from input parameters

3. **Encoding**: Encode categorical variables using saved encoders

4. **Scaling**: Scale features using saved scaler

5. **Prediction**: Make prediction using loaded model

6. **Post-processing**: Ensure non-negative predictions



### Handling Unseen Data



The prediction script handles cases where:

- Product ID was not seen during training (uses default encoding)

- Category was not seen during training (uses default encoding)



Warnings are displayed in such cases.



## ๐ŸŽ“ Learning Points



This project demonstrates:



1. **Supervised Learning**: Regression problem solving

2. **Feature Engineering**: Creating meaningful features from raw data

3. **Model Comparison**: Training and evaluating multiple models

4. **Model Selection**: Automatic best model selection

5. **Model Persistence**: Saving and loading trained models

6. **Production-Ready Code**: Clean, modular, well-documented code

7. **Time Series Features**: Extracting temporal patterns

8. **Categorical Encoding**: Handling categorical variables

9. **Feature Scaling**: Normalizing features for better performance

10. **Evaluation Metrics**: Understanding different regression metrics



## ๐Ÿ› Troubleshooting



### Issue: "Model not found"

**Solution**: Run `python train_model.py` first to train and save the model.

### Issue: "XGBoost not available"
**Solution**: Install XGBoost with `pip install xgboost`, or the system will work without it (skipping XGBoost model).

### Issue: "Category not seen during training"
**Solution**: This is handled automatically with a warning. The system uses a default encoding.

### Issue: Poor prediction accuracy
**Solutions**:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data

## ๐Ÿ“ Notes

- The synthetic dataset generator creates realistic patterns including:
  - Weekend effects (higher sales on weekends)
  - Seasonal patterns (holiday season spikes)
  - Price and discount effects
  - Category-specific base prices

- For production use, consider:
  - Using real historical data
  - Retraining models periodically
  - Adding more features (promotions, weather, etc.)
  - Implementing model versioning
  - Adding prediction confidence intervals

## ๐Ÿ“„ License

This project is provided as-is for educational purposes.

## ๐Ÿ‘ค Author

Created as a complete machine learning project demonstrating demand prediction for e-commerce.

---

**Happy Predicting! ๐Ÿš€**