YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🚗 Car Price Regression & Classification Project

This repository contains my full machine learning project, built as part of Assignment #2 in the Introduction to Data Science course at Reichman University. The goal of the project was to analyze a large cars dataset, clean and enrich the data, visualize insights, engineer features, and ultimately train two predictive models:

Regression model — Predicting car prices
Classification model — Categorizing cars into Low / Mid / High price ranges

Both final models are exported as .pkl files and available here in the repository.

📊 1. Dataset Overview

The dataset includes almost 19,000 cars, with features such as:

Manufacturer & Model
Production Year
Mileage
Engine Volume
Gearbox Type
Fuel Type
Drive Wheels
Color
Airbags
and more.

Before modeling, I performed cleaning, handled outliers, and created new engineered features.

🔍 2. Exploratory Data Analysis (EDA)

Key steps:

Visualizing the raw price distribution
Applying log-transform to stabilize variance
Identifying outliers using quantiles
Removing the top/bottom 1% noisy records
Plotting cleaned price distribution

This resulted in a cleaner dataset with 18,864 rows, much more suitable for modeling.

⚙️ 3. Feature Engineering

I engineered several new features and cleaned others:

Car_age — calculated from production year
Engine_volume_clean — cleaned numeric version of engine volume
Doors_clean — fixed inconsistencies
Removed unnecessary columns (ID, Levy, Doors, raw Engine Volume)
One-Hot Encoding for all categorical variables
Scaling numeric features

🌀 Clustering Feature

I also added a brand-new feature using KMeans clustering (Cluster). This helped group vehicles into 3 natural segments and improved model performance.

📈 4. Regression Models

I trained 3 regression models:

Linear Regression (baseline + improved)
Decision Tree
Random Forest (winner)

🏆 Winning Model

The best model was Random Forest Regressor, achieving strong results (lowest MAE/RMSE and highest R²). The final model is exported here as:

car_price_random_forest.pkl

🔄 5. Classification

I converted the price variable into 3 balanced classes using qcut:

0 = Low
1 = Mid
2 = High

Then I trained:

Logistic Regression (🔥 Winner — ~76% accuracy)
KNN
Random Forest Classifier

The Logistic Regression model gave the best results and is exported as:

car_classification_logistic.pkl

🤖 6. Files Included in This Repository

File	Description
`car_price_random_forest.pkl`	Final regression model
`car_classification_logistic.pkl`	Final classification model
`Assignment_notebook.ipynb`	Full Google Colab notebook
`README.md`	This documentation

🎬 7. Video Presentation

A video walkthrough of the entire project, including:

Dataset overview
EDA
Feature engineering
Clustering
Model training
HuggingFace upload

Will be attached here.

✨ 8. Summary

This project gave me practical experience with:

Data cleaning
EDA
Feature engineering
Clustering
Regression & classification models
Exporting models
Managing a HuggingFace repository

All work is included in this repository and submitted as required.

Thank you for reviewing! Yotam Gil https://colab.research.google.com/drive/1XQCA3yYdm1k0bnV02Gz1eC3uFdy_N0p2#scrollTo=LcN1CpJ0h_tt

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support