YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸš— Car Price Regression & Classification Project

Uploading video2385453674.mp4…

This repository contains my full machine learning project, built as part of Assignment #2 in the Introduction to Data Science course at Reichman University. The goal of the project was to analyze a large cars dataset, clean and enrich the data, visualize insights, engineer features, and ultimately train two predictive models:

  • Regression model β€” Predicting car prices
  • Classification model β€” Categorizing cars into Low / Mid / High price ranges

Both final models are exported as .pkl files and available here in the repository.


πŸ“Š 1. Dataset Overview

The dataset includes almost 19,000 cars, with features such as:

  • Manufacturer & Model
  • Production Year
  • Mileage
  • Engine Volume
  • Gearbox Type
  • Fuel Type
  • Drive Wheels
  • Color
  • Airbags
  • and more.

Before modeling, I performed cleaning, handled outliers, and created new engineered features.


πŸ” 2. Exploratory Data Analysis (EDA)

Key steps:

  • Visualizing the raw price distribution
  • Applying log-transform to stabilize variance
  • Identifying outliers using quantiles
  • Removing the top/bottom 1% noisy records
  • Plotting cleaned price distribution

This resulted in a cleaner dataset with 18,864 rows, much more suitable for modeling.

image

image

image


βš™οΈ 3. Feature Engineering

I engineered several new features and cleaned others:

  • Car_age β€” calculated from production year
  • Engine_volume_clean β€” cleaned numeric version of engine volume
  • Doors_clean β€” fixed inconsistencies
  • Removed unnecessary columns (ID, Levy, Doors, raw Engine Volume)
  • One-Hot Encoding for all categorical variables
  • Scaling numeric features

image

πŸŒ€ Clustering Feature

I also added a brand-new feature using KMeans clustering (Cluster). This helped group vehicles into 3 natural segments and improved model performance.


πŸ“ˆ 4. Regression Models

I trained 3 regression models:

  1. Linear Regression (baseline + improved)
  2. Decision Tree
  3. Random Forest (winner)

image

image

image

πŸ† Winning Model

The best model was Random Forest Regressor, achieving strong results (lowest MAE/RMSE and highest RΒ²). The final model is exported here as:

car_price_random_forest.pkl

πŸ”„ 5. Classification

I converted the price variable into 3 balanced classes using qcut:

  • 0 = Low
  • 1 = Mid
  • 2 = High

Then I trained:

  • Logistic Regression (πŸ”₯ Winner β€” ~76% accuracy)
  • KNN
  • Random Forest Classifier

The Logistic Regression model gave the best results and is exported as:

car_classification_logistic.pkl

πŸ€– 6. Files Included in This Repository

File Description
car_price_random_forest.pkl Final regression model
car_classification_logistic.pkl Final classification model
Assignment_notebook.ipynb Full Google Colab notebook
README.md This documentation

🎬 7. Video Presentation

A video walkthrough of the entire project, including:

  • Dataset overview
  • EDA
  • Feature engineering
  • Clustering
  • Model training
  • HuggingFace upload

Will be attached here.


✨ 8. Summary

This project gave me practical experience with:

  • Data cleaning
  • EDA
  • Feature engineering
  • Clustering
  • Regression & classification models
  • Exporting models
  • Managing a HuggingFace repository

All work is included in this repository and submitted as required.


Thank you for reviewing! Yotam Gil https://colab.research.google.com/drive/1XQCA3yYdm1k0bnV02Gz1eC3uFdy_N0p2#scrollTo=LcN1CpJ0h_tt

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support