π End-to-End Vehicle Valuation Project
Author: Yuval Course: Data Science Goal: Predicting used vehicle prices and classifying market tiers using advanced ML.
πΊ Video Presentation
[https://www.youtube.com/watch?v=RAbcuBYAg2o]
π 1. Project Overview & Data
The goal is to solve information asymmetry in the used car market by creating an Automated Valuation Model (AVM).
- Original Dataset: ~426,000 raw listings , 26 columns
- Final Dataset (After Cleaning): The dataset contains 363k rows and 30 columns (17 columns after cleaning, which expanded to 30 features for modeling after engineering and encoding)
- Target: Predict Price ($) and Price Category (Low/Mid/High).
π 2. Exploratory Data Analysis (EDA)
I started by analyzing the market distribution to understand the data structure. I started by looking at the raw data. I began removing irrelevant fields that wouldnβt affect the price, such as license plate number, URL link, and similar features. Then, assuming the dataset was large enough and contained many missing values, I decided to drop a lot of data (anything with missing or extreme values). After finishing that work, I realized this approach was incorrect, so I pivoted. Additionally, I converted categorical variables into numerical ones using ordinal encoding (for example, mapping βbadβ, βgoodβ, and βexcellentβ to 1, 2, and 3). I also applied one-hot encoding to the vehicle type feature by splitting it into multiple type-specific columns and then representing each type as a binary indicator.
π οΈ 2.1. The Data Pivot: From 70k to 360k Rows
A critical decision was made regarding missing data.
- Initial Approach: Dropping NaNs left only 70,000 rows.
- The Pivot: I used Smart Imputation (Median for Odometer, Mode for Categories) to recover data.
- Result: Training on 363,494 rows resulted in a much more robust model.
Price Distribution
The market is right-skewed with a long tail of expensive cars.
Mileage Distribution
Most vehicles fall between 50k and 150k miles.
Technology Trends
Gasoline engines and Automatic transmissions dominate the market.
πPart 3: Basic feature Engineering (Encoding) + Define and Train a baseline model**
Baseline Model
Feature Importance
π§ 4. Advanced Feature Engineering (Usage Clusters)
I hypothesized that Year and Odometer alone are not enough. I used K-Means Clustering to create a "Usage Profile".
The Clusters (Year vs. Odometer)
Distinct groups were identified (e.g., "Garage Queens" vs. "High Mileage").
Validation (PCA)
Using PCA, I confirmed that the clusters are mathematically distinct.
π€ 5. Regression Modeling & Results
I trained three models. Here is the baseline performance vs. the final winner.
The Improvement Evolution & Models comparison
- Linear (Base): 0.57
- Linear + Clusters: 0.6 (Feature Engineering Impact)
- XGBOOST 0.72
- Random Forest: 0.837 (Winner)
Why did the Linear Model improve?
The Linear model relied heavily on the engineered clusters (Green bars below), proving the feature engineering was effective.
Why did Random Forest Win?
The Random Forest model was smart enough to use the raw Year and Odometer directly to find non-linear patterns.
π·οΈ 6. Class Balance
Finally, I converted the problem to a classification task (Low 0-10k$ / Mid 10k$-23.5k$ / High budget 23.5k$-100k$ )
Using Quantile Binning, I ensured perfectly balanced classes (33% each).
Classification moodels training & Results
I trained and evaluated three different classification models:
- Logistic Regression: Achieved 69% Accuracy.
- XGBoost: Achieved 78% Accuracy.
- Random Forest (Winner): Outperformed the others with 85% Accuracy
Confusion Matrix
The model achieved ~85.5% Accuracy and rarely confused "Low" with "High".
π‘ 7. Key Takeaways
- Data Volume Wins: Pivoting from strict dropping (70k rows) to Smart Imputation (360k rows) significantly boosted model stability and reduced bias.
- Feature Engineering Works: The custom
Usage_Clusterfeature was the #1 strongest predictor for the Linear Model, proving that combining age and mileage creates value. - Non-Linearity: Random Forest ($R^2=0.837$) outperformed Linear Regression ($R^2=0.61$), confirming that car depreciation follows complex, non-linear curves.
- Valuation vs. Bucketing: The project demonstrated two distinct AI use cases. Regression provides exact valuation for sellers (Precision), while Classification (85.5% Accuracy) serves as a robust "Bucketing" tool for buyers, allowing efficient filtering of price ranges.
- Trust-First: The classifier minimizes False Positives, ensuring we rarely mislabel a cheap car as expensive.
π 8. Project Files (Download Links)
All files are hosted in this repository. Click to download:
- π Python Notebook: Copy_of_Assignment_2...ipynb
- π§ Regression Model: vehicle_price_model.pkl ($R^2=0.837$)
- π§ Classification Model: vehicle_classification_model.pkl (Acc=80%)
- πΎ Dataset: vehicles.csv
πΊ Video Presentation
Dataset used to train yuvalkorem1/vehicle-price-predictor
Evaluation results
- r_squared on Vehicle Sales Dataself-reported0.837