🚜 Bulldozer Sale Price Prediction (scikit-learn Random Forest)
This model predicts the auction sale price of bulldozers from structured historical sales + equipment specification data (tabular regression). It follows the classic Kaggle “Blue Book for Bulldozers” style workflow, including time-based feature engineering, missing-value handling, categorical encoding, model tuning, and evaluation.
Model Details
- Developed by: brej-29
- Model type:
RandomForestRegressor(scikit-learn) - Task: Tabular regression (predict
SalePrice) - Output: Predicted sale price (continuous numeric value)
- Training notebook:
Bulldozer Sales Project.ipynb - Source repo: https://github.com/brej-29/Logicmojo-AIML-Assignments-bulldozer-price-prediction
- License: MIT
Intended Use
- Educational / portfolio demonstration of an end-to-end ML regression pipeline
- Baseline price prediction experiments for auction-style heavy-equipment data
Out-of-scope / Not suitable for
- Financial decision-making without additional validation and monitoring
- Any production pricing system without robust data validation, drift monitoring, and periodic retraining
Training Data
- Dataset source: Kaggle competition “Blue Book for Bulldozers”
- File used in the project workflow:
data/TrainAndValid.csv - Target column:
SalePrice
The dataset includes historical auction records and many structured features describing equipment configuration, usage, and sale metadata.
Preprocessing & Feature Engineering
The training notebook applies the following transformations (high level):
- Parse
saledateas a datetime and sort records by time - Create time-based features from
saledate:saleYear,saleMonth,saleDay,saleDayofWeek,saleDayofYear- then drop
saledate
- Handle missing values:
- Numeric columns: impute missing with median + add
<col>_is_missingindicator - Categorical/object columns: add
<col>_is_missingindicator and encode categories to integer codes (+1)
- Numeric columns: impute missing with median + add
Important: This model is trained on the transformed (fully numeric) dataset; inputs must follow the same preprocessing rules.
Train / Validation Split
A time-based split is used:
- Validation set: rows where
saleYear == 2012 - Training set: all other years
Shapes after preprocessing:
X_train: (401125, 102)X_valid: (11573, 102)
Training Procedure
- Base algorithm:
RandomForestRegressor - Hyperparameter tuning:
RandomizedSearchCV(5-fold CV, 20 iterations) over:n_estimators,max_depth,min_samples_split,min_samples_leaf,max_features,max_samples
Final fitted model parameters (from the notebook’s “ideal_model”):
n_estimators=70min_samples_split=14min_samples_leaf=1max_features=0.5n_jobs=-1
Evaluation
Metrics reported on the time-based validation set (saleYear=2012):
- Validation MAE: 5910.1576
- Validation RMSLE: 0.2448450
- Validation R²: 0.8835954
Note: RMSLE is the standard metric used in the Kaggle competition; MAE and R² are included for interpretability.
How to Use
This repo’s notebook demonstrates the full preprocessing logic. For inference, you must:
- Construct features using the exact same preprocessing steps (datetime feature extraction, missing indicators, category code mapping strategy).
- Ensure the final inference DataFrame has the same columns (and order) as the training features used by the model.
Example loading pattern (adjust filenames to match your Hugging Face repo artifacts):
- Load model with
joblib.load(...)orpickle.load(...) - Build a single-row (or batch) DataFrame with the expected engineered numeric feature columns
- Call
model.predict(X)
Input Requirements
- Inputs must be tabular (pandas DataFrame recommended)
- Inputs must match the post-processed feature space used during training (102 columns in this notebook run)
- Missing values must be handled exactly as in training (median imputation + missing indicators; categorical missing indicators + integer codes)
- Datetime-based features must be generated from
saledatethe same way as in training
Bias, Risks, and Limitations
- Data is historical auction data; performance may degrade if market conditions shift (concept drift)
- Category encoding is dataset-dependent; new/unseen categories at inference time need careful handling
- Time-based relationships can change; regular backtesting and retraining are recommended
- Model may underperform for rare equipment configurations with limited historical examples
Environmental Impact
Training is classical ML on tabular data (CPU-friendly) and typically has relatively low compute and carbon impact compared to deep learning.
Technical Specifications
- Framework: scikit-learn
- Model type:
RandomForestRegressor - Task: tabular regression
- Recommended runtime: CPU
Model Card Authors
- BrejBala
Contact
For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/Logicmojo-AIML-Assignments-bulldozer-price-prediction
Evaluation results
- Validation MAE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)self-reported5910.158
- Validation RMSLE on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)self-reported0.245
- Validation R^2 on Kaggle Blue Book for Bulldozers (TrainAndValid.csv)self-reported0.884