OJ Sales 🍊 Revenue Prediction (Gradient Boosting)

A Gradient Boosting Regressor trained on the OJ Sales dataset from Azure Open Datasets to predict orange juice Revenue based on sales features.

Built and deployed on Microsoft Fabric during METC Online Training #5: "Mempersiapkan Data untuk Model AI di Microsoft Fabric" 🍊 organized by Microsoft Elevate and Dicoding.

Model Details

Property Value
Model Type Gradient Boosting Regressor
Framework scikit-learn
Task Tabular Regression
Target Variable Revenue (continuous)
Training Platform Microsoft Fabric + MLflow
Dataset OJ Sales (Azure Open Datasets)
Sample Size 500 rows (sampled with random_state=1)
Train/Test Split 80/20 (random_state=42)

Features (9)

Feature Type Description
Quantity int Units sold
Advert int Advertisement flag (0=No, 1=Yes)
Price float Unit price ($)
Brand_encoded int Brand (0=Dominicks, 1=Minute Maid, 2=Tropicana)
Store_encoded int Store ID (label encoded)
Year int Year extracted from WeekStarting
Month int Month (1-12)
WeekOfYear int Week number (1-52)
Quarter int Quarter (1-4)

Performance

Best Model: Gradient Boosting

Metric Score
RΒ² 0.9965
MAE 358.00
RMSE 454.92
CV RΒ² (5-fold) 0.9964 ± 0.0008

All Models Compared

Model R² MAE RMSE
Gradient Boosting 0.9965 358 455
XGBoost 0.9960 380 489
Random Forest 0.9952 412 533
Linear Regression 0.9474 1,835 2,450

Usage

import pickle
import numpy as np

# Load model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input: [Quantity, Advert, Price, Brand_encoded, Store_encoded, Year, Month, WeekOfYear, Quarter]
sample = np.array([[15000, 1, 2.50, 1, 5, 1992, 6, 24, 2]])
prediction = model.predict(sample)
print(f"Predicted Revenue: ${prediction[0]:,.2f}")

Key Insights

  • Quantity and Price are the dominant predictors of Revenue
  • Advertising (Advert=1) shows positive impact on sales quantity
  • Ensemble models significantly outperform Linear Regression
  • Model is highly stable with CV std of only 0.0008

Training Details

  • Hyperparameters: n_estimators=200, learning_rate=0.1, max_depth=5, random_state=42
  • Cross-Validation: 5-fold with RΒ² scoring
  • Experiment Tracking: MLflow on Microsoft Fabric
  • Deployment: Real-time API endpoint on Microsoft Fabric

Acknowledgments

  • Microsoft Elevate and Dicoding for organizing METC Online Training #5
  • Azure Open Datasets 🍊 for providing the OJ Sales dataset
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support