OJ Sales 🍊 Revenue Prediction (Gradient Boosting)

A Gradient Boosting Regressor trained on the OJ Sales dataset from Azure Open Datasets to predict orange juice Revenue based on sales features.

Built and deployed on Microsoft Fabric during METC Online Training #5: "Mempersiapkan Data untuk Model AI di Microsoft Fabric" 🍊 organized by Microsoft Elevate and Dicoding.

Model Details

Property	Value
Model Type	Gradient Boosting Regressor
Framework	scikit-learn
Task	Tabular Regression
Target Variable	Revenue (continuous)
Training Platform	Microsoft Fabric + MLflow
Dataset	OJ Sales (Azure Open Datasets)
Sample Size	500 rows (sampled with `random_state=1`)
Train/Test Split	80/20 (`random_state=42`)

Features (9)

Feature	Type	Description
`Quantity`	int	Units sold
`Advert`	int	Advertisement flag (0=No, 1=Yes)
`Price`	float	Unit price ($)
`Brand_encoded`	int	Brand (0=Dominicks, 1=Minute Maid, 2=Tropicana)
`Store_encoded`	int	Store ID (label encoded)
`Year`	int	Year extracted from WeekStarting
`Month`	int	Month (1-12)
`WeekOfYear`	int	Week number (1-52)
`Quarter`	int	Quarter (1-4)

Performance

Best Model: Gradient Boosting

Metric	Score
R²	0.9965
MAE	358.00
RMSE	454.92
CV R² (5-fold)	0.9964 Â± 0.0008

All Models Compared

Model	RÂ²	MAE	RMSE
Gradient Boosting	0.9965	358	455
XGBoost	0.9960	380	489
Random Forest	0.9952	412	533
Linear Regression	0.9474	1,835	2,450

Usage

import pickle
import numpy as np

# Load model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input: [Quantity, Advert, Price, Brand_encoded, Store_encoded, Year, Month, WeekOfYear, Quarter]
sample = np.array([[15000, 1, 2.50, 1, 5, 1992, 6, 24, 2]])
prediction = model.predict(sample)
print(f"Predicted Revenue: ${prediction[0]:,.2f}")

Key Insights

Quantity and Price are the dominant predictors of Revenue
Advertising (Advert=1) shows positive impact on sales quantity
Ensemble models significantly outperform Linear Regression
Model is highly stable with CV std of only 0.0008

Training Details

Hyperparameters: n_estimators=200, learning_rate=0.1, max_depth=5, random_state=42
Cross-Validation: 5-fold with R² scoring
Experiment Tracking: MLflow on Microsoft Fabric
Deployment: Real-time API endpoint on Microsoft Fabric

Acknowledgments

Microsoft Elevate and Dicoding for organizing METC Online Training #5
Azure Open Datasets 🍊 for providing the OJ Sales dataset

Downloads last month: -