OJ Sales π Revenue Prediction (Gradient Boosting)
A Gradient Boosting Regressor trained on the OJ Sales dataset from Azure Open Datasets to predict orange juice Revenue based on sales features.
Built and deployed on Microsoft Fabric during METC Online Training #5: "Mempersiapkan Data untuk Model AI di Microsoft Fabric" π organized by Microsoft Elevate and Dicoding.
Model Details
| Property |
Value |
| Model Type |
Gradient Boosting Regressor |
| Framework |
scikit-learn |
| Task |
Tabular Regression |
| Target Variable |
Revenue (continuous) |
| Training Platform |
Microsoft Fabric + MLflow |
| Dataset |
OJ Sales (Azure Open Datasets) |
| Sample Size |
500 rows (sampled with random_state=1) |
| Train/Test Split |
80/20 (random_state=42) |
Features (9)
| Feature |
Type |
Description |
Quantity |
int |
Units sold |
Advert |
int |
Advertisement flag (0=No, 1=Yes) |
Price |
float |
Unit price ($) |
Brand_encoded |
int |
Brand (0=Dominicks, 1=Minute Maid, 2=Tropicana) |
Store_encoded |
int |
Store ID (label encoded) |
Year |
int |
Year extracted from WeekStarting |
Month |
int |
Month (1-12) |
WeekOfYear |
int |
Week number (1-52) |
Quarter |
int |
Quarter (1-4) |
Performance
Best Model: Gradient Boosting
| Metric |
Score |
| RΒ² |
0.9965 |
| MAE |
358.00 |
| RMSE |
454.92 |
| CV RΒ² (5-fold) |
0.9964 ΓΒ± 0.0008 |
All Models Compared
| Model |
RΓΒ² |
MAE |
RMSE |
| Gradient Boosting |
0.9965 |
358 |
455 |
| XGBoost |
0.9960 |
380 |
489 |
| Random Forest |
0.9952 |
412 |
533 |
| Linear Regression |
0.9474 |
1,835 |
2,450 |
Usage
import pickle
import numpy as np
with open("model.pkl", "rb") as f:
model = pickle.load(f)
sample = np.array([[15000, 1, 2.50, 1, 5, 1992, 6, 24, 2]])
prediction = model.predict(sample)
print(f"Predicted Revenue: ${prediction[0]:,.2f}")
Key Insights
- Quantity and Price are the dominant predictors of Revenue
- Advertising (
Advert=1) shows positive impact on sales quantity
- Ensemble models significantly outperform Linear Regression
- Model is highly stable with CV std of only 0.0008
Training Details
- Hyperparameters:
n_estimators=200, learning_rate=0.1, max_depth=5, random_state=42
- Cross-Validation: 5-fold with RΒ² scoring
- Experiment Tracking: MLflow on Microsoft Fabric
- Deployment: Real-time API endpoint on Microsoft Fabric
Acknowledgments
- Microsoft Elevate and Dicoding for organizing METC Online Training #5
- Azure Open Datasets π for providing the OJ Sales dataset