🚜 Bulldozer Sale Price Prediction (scikit-learn Random Forest)

This model predicts the auction sale price of bulldozers from structured historical sales + equipment specification data (tabular regression). It follows the classic Kaggle “Blue Book for Bulldozers” style workflow, including time-based feature engineering, missing-value handling, categorical encoding, model tuning, and evaluation.

Model Details

Intended Use

  • Educational / portfolio demonstration of an end-to-end ML regression pipeline
  • Baseline price prediction experiments for auction-style heavy-equipment data

Out-of-scope / Not suitable for

  • Financial decision-making without additional validation and monitoring
  • Any production pricing system without robust data validation, drift monitoring, and periodic retraining

Training Data

  • Dataset source: Kaggle competition “Blue Book for Bulldozers”
  • File used in the project workflow: data/TrainAndValid.csv
  • Target column: SalePrice

The dataset includes historical auction records and many structured features describing equipment configuration, usage, and sale metadata.

Preprocessing & Feature Engineering

The training notebook applies the following transformations (high level):

  • Parse saledate as a datetime and sort records by time
  • Create time-based features from saledate:
    • saleYear, saleMonth, saleDay, saleDayofWeek, saleDayofYear
    • then drop saledate
  • Handle missing values:
    • Numeric columns: impute missing with median + add <col>_is_missing indicator
    • Categorical/object columns: add <col>_is_missing indicator and encode categories to integer codes (+1)

Important: This model is trained on the transformed (fully numeric) dataset; inputs must follow the same preprocessing rules.

Train / Validation Split

A time-based split is used:

  • Validation set: rows where saleYear == 2012
  • Training set: all other years

Shapes after preprocessing:

  • X_train: (401125, 102)
  • X_valid: (11573, 102)

Training Procedure

  • Base algorithm: RandomForestRegressor
  • Hyperparameter tuning: RandomizedSearchCV (5-fold CV, 20 iterations) over:
    • n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, max_samples

Final fitted model parameters (from the notebook’s “ideal_model”):

  • n_estimators=70
  • min_samples_split=14
  • min_samples_leaf=1
  • max_features=0.5
  • n_jobs=-1

Evaluation

Metrics reported on the time-based validation set (saleYear=2012):

  • Validation MAE: 5910.1576
  • Validation RMSLE: 0.2448450
  • Validation R²: 0.8835954

Note: RMSLE is the standard metric used in the Kaggle competition; MAE and R² are included for interpretability.

How to Use

This repo’s notebook demonstrates the full preprocessing logic. For inference, you must:

  1. Construct features using the exact same preprocessing steps (datetime feature extraction, missing indicators, category code mapping strategy).
  2. Ensure the final inference DataFrame has the same columns (and order) as the training features used by the model.

Example loading pattern (adjust filenames to match your Hugging Face repo artifacts):

  • Load model with joblib.load(...) or pickle.load(...)
  • Build a single-row (or batch) DataFrame with the expected engineered numeric feature columns
  • Call model.predict(X)

Input Requirements

  • Inputs must be tabular (pandas DataFrame recommended)
  • Inputs must match the post-processed feature space used during training (102 columns in this notebook run)
  • Missing values must be handled exactly as in training (median imputation + missing indicators; categorical missing indicators + integer codes)
  • Datetime-based features must be generated from saledate the same way as in training

Bias, Risks, and Limitations

  • Data is historical auction data; performance may degrade if market conditions shift (concept drift)
  • Category encoding is dataset-dependent; new/unseen categories at inference time need careful handling
  • Time-based relationships can change; regular backtesting and retraining are recommended
  • Model may underperform for rare equipment configurations with limited historical examples

Environmental Impact

Training is classical ML on tabular data (CPU-friendly) and typically has relatively low compute and carbon impact compared to deep learning.

Technical Specifications

  • Framework: scikit-learn
  • Model type: RandomForestRegressor
  • Task: tabular regression
  • Recommended runtime: CPU

Model Card Authors

  • BrejBala

Contact

For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/Logicmojo-AIML-Assignments-bulldozer-price-prediction

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Validation MAE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
    self-reported
    5910.158
  • Validation RMSLE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
    self-reported
    0.245
  • Validation R^2 on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
    self-reported
    0.884