🚜 Bulldozer Sale Price Prediction (scikit-learn Random Forest)

This model predicts the auction sale price of bulldozers from structured historical sales + equipment specification data (tabular regression). It follows the classic Kaggle “Blue Book for Bulldozers” style workflow, including time-based feature engineering, missing-value handling, categorical encoding, model tuning, and evaluation.

Model Details

Developed by: brej-29
Model type: RandomForestRegressor (scikit-learn)
Task: Tabular regression (predict SalePrice)
Output: Predicted sale price (continuous numeric value)
Training notebook: Bulldozer Sales Project.ipynb
Source repo: https://github.com/brej-29/Logicmojo-AIML-Assignments-bulldozer-price-prediction
License: MIT

Intended Use

Educational / portfolio demonstration of an end-to-end ML regression pipeline
Baseline price prediction experiments for auction-style heavy-equipment data

Out-of-scope / Not suitable for

Financial decision-making without additional validation and monitoring
Any production pricing system without robust data validation, drift monitoring, and periodic retraining

Training Data

Dataset source: Kaggle competition “Blue Book for Bulldozers”
File used in the project workflow: data/TrainAndValid.csv
Target column: SalePrice

The dataset includes historical auction records and many structured features describing equipment configuration, usage, and sale metadata.

Preprocessing & Feature Engineering

The training notebook applies the following transformations (high level):

Parse saledate as a datetime and sort records by time
Create time-based features from saledate:
- saleYear, saleMonth, saleDay, saleDayofWeek, saleDayofYear
- then drop saledate
Handle missing values:
- Numeric columns: impute missing with median + add <col>_is_missing indicator
- Categorical/object columns: add <col>_is_missing indicator and encode categories to integer codes (+1)

Important: This model is trained on the transformed (fully numeric) dataset; inputs must follow the same preprocessing rules.

Train / Validation Split

A time-based split is used:

Validation set: rows where saleYear == 2012
Training set: all other years

Shapes after preprocessing:

X_train: (401125, 102)
X_valid: (11573, 102)

Training Procedure

Base algorithm: RandomForestRegressor
Hyperparameter tuning: RandomizedSearchCV (5-fold CV, 20 iterations) over:
- n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, max_samples

Final fitted model parameters (from the notebook’s “ideal_model”):

n_estimators=70
min_samples_split=14
min_samples_leaf=1
max_features=0.5
n_jobs=-1

Evaluation

Metrics reported on the time-based validation set (saleYear=2012):

Validation MAE: 5910.1576
Validation RMSLE: 0.2448450
Validation R²: 0.8835954

Note: RMSLE is the standard metric used in the Kaggle competition; MAE and R² are included for interpretability.

How to Use

This repo’s notebook demonstrates the full preprocessing logic. For inference, you must:

Construct features using the exact same preprocessing steps (datetime feature extraction, missing indicators, category code mapping strategy).
Ensure the final inference DataFrame has the same columns (and order) as the training features used by the model.

Example loading pattern (adjust filenames to match your Hugging Face repo artifacts):

Load model with joblib.load(...) or pickle.load(...)
Build a single-row (or batch) DataFrame with the expected engineered numeric feature columns
Call model.predict(X)

Input Requirements

Inputs must be tabular (pandas DataFrame recommended)
Inputs must match the post-processed feature space used during training (102 columns in this notebook run)
Missing values must be handled exactly as in training (median imputation + missing indicators; categorical missing indicators + integer codes)
Datetime-based features must be generated from saledate the same way as in training

Bias, Risks, and Limitations

Data is historical auction data; performance may degrade if market conditions shift (concept drift)
Category encoding is dataset-dependent; new/unseen categories at inference time need careful handling
Time-based relationships can change; regular backtesting and retraining are recommended
Model may underperform for rare equipment configurations with limited historical examples

Environmental Impact

Training is classical ML on tabular data (CPU-friendly) and typically has relatively low compute and carbon impact compared to deep learning.

Technical Specifications

Framework: scikit-learn
Model type: RandomForestRegressor
Task: tabular regression
Recommended runtime: CPU

Model Card Authors

BrejBala

Contact

For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/Logicmojo-AIML-Assignments-bulldozer-price-prediction

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Validation MAE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
self-reported

5910.158
Validation RMSLE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
self-reported

0.245
Validation R^2 on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)
self-reported

0.884