πŸš— End-to-End Vehicle Valuation Project

Author: Yuval Course: Data Science Goal: Predicting used vehicle prices and classifying market tiers using advanced ML.

πŸ“Ί Video Presentation

[https://www.youtube.com/watch?v=RAbcuBYAg2o]

Open In Colab

πŸ“Œ 1. Project Overview & Data

The goal is to solve information asymmetry in the used car market by creating an Automated Valuation Model (AVM).

  • Original Dataset: ~426,000 raw listings , 26 columns
  • Final Dataset (After Cleaning): The dataset contains 363k rows and 30 columns (17 columns after cleaning, which expanded to 30 features for modeling after engineering and encoding)
  • Target: Predict Price ($) and Price Category (Low/Mid/High).

πŸ“Š 2. Exploratory Data Analysis (EDA)

I started by analyzing the market distribution to understand the data structure. I started by looking at the raw data. I began removing irrelevant fields that wouldn’t affect the price, such as license plate number, URL link, and similar features. Then, assuming the dataset was large enough and contained many missing values, I decided to drop a lot of data (anything with missing or extreme values). After finishing that work, I realized this approach was incorrect, so I pivoted. Additionally, I converted categorical variables into numerical ones using ordinal encoding (for example, mapping β€˜bad’, β€˜good’, and β€˜excellent’ to 1, 2, and 3). I also applied one-hot encoding to the vehicle type feature by splitting it into multiple type-specific columns and then representing each type as a binary indicator.

πŸ› οΈ 2.1. The Data Pivot: From 70k to 360k Rows

A critical decision was made regarding missing data.

  1. Initial Approach: Dropping NaNs left only 70,000 rows.
  2. The Pivot: I used Smart Imputation (Median for Odometer, Mode for Categories) to recover data.
  3. Result: Training on 363,494 rows resulted in a much more robust model.

Price Distribution

The market is right-skewed with a long tail of expensive cars. Price

Mileage Distribution

Most vehicles fall between 50k and 150k miles. Mileage

Technology Trends

Gasoline engines and Automatic transmissions dominate the market. Fuel

Transmission

πŸ“ˆPart 3: Basic feature Engineering (Encoding) + Define and Train a baseline model**

Baseline Model

Baseline

Feature Importance

Coefficients

🧠 4. Advanced Feature Engineering (Usage Clusters)

I hypothesized that Year and Odometer alone are not enough. I used K-Means Clustering to create a "Usage Profile".

The Clusters (Year vs. Odometer)

Distinct groups were identified (e.g., "Garage Queens" vs. "High Mileage"). Clusters

Validation (PCA)

Using PCA, I confirmed that the clusters are mathematically distinct. PCA


πŸ€– 5. Regression Modeling & Results

I trained three models. Here is the baseline performance vs. the final winner.

The Improvement Evolution & Models comparison

  • Linear (Base): 0.57
  • Linear + Clusters: 0.6 (Feature Engineering Impact)
  • XGBOOST 0.72
  • Random Forest: 0.837 (Winner)

Evolution

Why did the Linear Model improve?

The Linear model relied heavily on the engineered clusters (Green bars below), proving the feature engineering was effective. Linear Importance

Why did Random Forest Win?

The Random Forest model was smart enough to use the raw Year and Odometer directly to find non-linear patterns. RF Importance


🏷️ 6. Class Balance

Finally, I converted the problem to a classification task (Low 0-10k$ / Mid 10k$-23.5k$ / High budget 23.5k$-100k$ ) Using Quantile Binning, I ensured perfectly balanced classes (33% each). Balance

Classification moodels training & Results

I trained and evaluated three different classification models:

  • Logistic Regression: Achieved 69% Accuracy.
  • XGBoost: Achieved 78% Accuracy.
  • Random Forest (Winner): Outperformed the others with 85% Accuracy

Confusion Matrix

The model achieved ~85.5% Accuracy and rarely confused "Low" with "High". Matrix


πŸ’‘ 7. Key Takeaways

  • Data Volume Wins: Pivoting from strict dropping (70k rows) to Smart Imputation (360k rows) significantly boosted model stability and reduced bias.
  • Feature Engineering Works: The custom Usage_Cluster feature was the #1 strongest predictor for the Linear Model, proving that combining age and mileage creates value.
  • Non-Linearity: Random Forest ($R^2=0.837$) outperformed Linear Regression ($R^2=0.61$), confirming that car depreciation follows complex, non-linear curves.
  • Valuation vs. Bucketing: The project demonstrated two distinct AI use cases. Regression provides exact valuation for sellers (Precision), while Classification (85.5% Accuracy) serves as a robust "Bucketing" tool for buyers, allowing efficient filtering of price ranges.
  • Trust-First: The classifier minimizes False Positives, ensuring we rarely mislabel a cheap car as expensive.

πŸ“‚ 8. Project Files (Download Links)

All files are hosted in this repository. Click to download:

πŸ“Ί Video Presentation

[https://www.youtube.com/watch?v=RAbcuBYAg2o]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train yuvalkorem1/vehicle-price-predictor

Evaluation results