π Car Price Regression & Classification Project
This repository contains my full machine learning project, built as part of Assignment #2 in the Introduction to Data Science course at Reichman University. The goal of the project was to analyze a large cars dataset, clean and enrich the data, visualize insights, engineer features, and ultimately train two predictive models:
- Regression model β Predicting car prices
- Classification model β Categorizing cars into Low / Mid / High price ranges
Both final models are exported as .pkl files and available here in the repository.
π 1. Dataset Overview
The dataset includes almost 19,000 cars, with features such as:
- Manufacturer & Model
- Production Year
- Mileage
- Engine Volume
- Gearbox Type
- Fuel Type
- Drive Wheels
- Color
- Airbags
- and more.
Before modeling, I performed cleaning, handled outliers, and created new engineered features.
π 2. Exploratory Data Analysis (EDA)
Key steps:
- Visualizing the raw price distribution
- Applying log-transform to stabilize variance
- Identifying outliers using quantiles
- Removing the top/bottom 1% noisy records
- Plotting cleaned price distribution
This resulted in a cleaner dataset with 18,864 rows, much more suitable for modeling.
βοΈ 3. Feature Engineering
I engineered several new features and cleaned others:
- Car_age β calculated from production year
- Engine_volume_clean β cleaned numeric version of engine volume
- Doors_clean β fixed inconsistencies
- Removed unnecessary columns (ID, Levy, Doors, raw Engine Volume)
- One-Hot Encoding for all categorical variables
- Scaling numeric features
π Clustering Feature
I also added a brand-new feature using KMeans clustering (Cluster).
This helped group vehicles into 3 natural segments and improved model performance.
π 4. Regression Models
I trained 3 regression models:
- Linear Regression (baseline + improved)
- Decision Tree
- Random Forest (winner)
π Winning Model
The best model was Random Forest Regressor, achieving strong results (lowest MAE/RMSE and highest RΒ²). The final model is exported here as:
car_price_random_forest.pkl
π 5. Classification
I converted the price variable into 3 balanced classes using qcut:
- 0 = Low
- 1 = Mid
- 2 = High
Then I trained:
- Logistic Regression (π₯ Winner β ~76% accuracy)
- KNN
- Random Forest Classifier
The Logistic Regression model gave the best results and is exported as:
car_classification_logistic.pkl
π€ 6. Files Included in This Repository
| File | Description |
|---|---|
car_price_random_forest.pkl |
Final regression model |
car_classification_logistic.pkl |
Final classification model |
Assignment_notebook.ipynb |
Full Google Colab notebook |
README.md |
This documentation |
π¬ 7. Video Presentation
A video walkthrough of the entire project, including:
- Dataset overview
- EDA
- Feature engineering
- Clustering
- Model training
- HuggingFace upload
Will be attached here.
β¨ 8. Summary
This project gave me practical experience with:
- Data cleaning
- EDA
- Feature engineering
- Clustering
- Regression & classification models
- Exporting models
- Managing a HuggingFace repository
All work is included in this repository and submitted as required.
Thank you for reviewing! Yotam Gil https://colab.research.google.com/drive/1XQCA3yYdm1k0bnV02Gz1eC3uFdy_N0p2#scrollTo=LcN1CpJ0h_tt






