YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🎥 Project Video Walkthrough

🏠 Airbnb Price Prediction & Price Classification Complete Data Science Workflow – EDA, Feature Engineering, Regression, and Classification

This project presents a full end-to-end data science pipeline applied to Airbnb listing data. The goal was to build models that:

Predict continuous log-price (Regression)

Classify listings into three price tiers: Low / Medium / High (Classification)

The workflow includes data cleaning, exploratory data analysis, extensive feature engineering, model training, evaluation, interpretation, and final model selection.

📌 Table of Contents

1.Project Overview

2.Dataset & Objective

3.Data Cleaning

4.Exploratory Data Analysis (EDA)

5.Feature Engineering

6.Regression Modeling

7.Price Classification Modeling

8.Model Comparison & Winners

9.Key Insights & Lessons Learned

🔍 1. Project Overview

This assignment demonstrates practical experience in:

Data preprocessing and cleaning

Exploratory Data Analysis

Feature Engineering (including clustering features)

Regression modeling

Classification modeling

Model evaluation and interpretation

Building reproducible ML pipelines using Scikit-Learn

The project concludes with the selection of the best Regression model and the best Classification model. ......... .........

📊 2. Dataset & Objective

The dataset contains Airbnb listings with:

Price & log-price

Geographic coordinates

Property features (bedrooms, beds, bathrooms, accommodates)

Host attributes

Review statistics

Categorical descriptors such as room_type, property_type, neighbourhood

Two prediction tasks were defined: Task 1: Regression

Predict the log-price of a listing using numerical, categorical, engineered, and spatial features.

Task 2: Classification

Transform log-price into 3 balanced classes using quantiles:

0 = Low price

1 = Medium price

2 = High price

Then train and compare classification models.

🧹 3. Data Cleaning

A structured cleaning process was performed:

✔ Missing values:

Removed rows missing the target log_price

Filled numerical features using the median

Filled categorical features using the mode

Categorical location fields were imputed with "Unknown"

✔ Outlier analysis:

Explored via:

IQR method

Z-score

Boxplots

Outliers were not removed, as they represent valid real-world listings.

✔ Data types & formatting:

Standardized data types and ensured consistency prior to modeling. ......... .........

🔎 4. Exploratory Data Analysis (EDA)

EDA provided insights that guided Feature Engineering:

✔ Distribution plots:

log_price

accommodates

number_of_reviews

ratings

✔ Correlation heatmap:

Helped identify linear relationships and redundancy among features.

✔ Scatter/regression plots:

Revealed weak or non-linear relationships for some variables.

Key takeaway:

Price is highly influenced by location and property characteristics, requiring more sophisticated feature transformations.

🛠️ 5. Feature Engineering

Feature Engineering was the most impactful part of the project.

A. Spatial Features Using Clustering

Applied KMeans (k=10) to latitude & longitude.

Produced:

cluster_id – geographic zone

distance_to_centroid – how far the listing is from the center of its cluster

These features capture neighborhood effects, which strongly influence price.

B. Interaction & Ratio Features

Created meaningful engineered variables:

beds_per_bedroom

bath_per_bedroom

reviews_ratio (reviews relative to guest capacity)

These capture density, comfort, and listing popularity patterns.

C. Encoding & Scaling

Used ColumnTransformer with:

StandardScaler for numerical features

OneHotEncoder for categorical features

All models were wrapped in a Scikit-Learn Pipeline to ensure reproducibility and prevent data leakage.

📈 6. Regression Modeling (Predicting log_price)

Three models were trained:

Linear Regression (Baseline + Advanced Pipeline)

Decision Tree Regressor

Random Forest Regressor

✔ Baseline Model (simple numeric features only):

R² ≈ 0.38

MAE ≈ 0.40

A weak but important benchmark.

✔ Final Regression Results (with engineered features):

🏆 Regression Winner:

Random Forest Regressor It captured nonlinear patterns and interactions introduced by the engineered features.

🟨 7. Classification Modeling (Price Tier Prediction)

log_price was transformed into 3 balanced classes using quantiles. Class distribution remained roughly uniform in both train and test sets.

Three models were trained:

Logistic Regression

Decision Tree Classifier

Random Forest Classifier

✔ Results:

Findings:

Logistic Regression showed the best overall performance

It made fewer severe errors and rarely confused low-priced listings with high-priced ones

Decision Tree overfit and performed poorly

Random Forest was strong but slightly less balanced than Logistic Regression

🏆 Classification Winner:

Logistic Regression

🧠 8. Key Insights & Takeaways

Feature Engineering mattered more than model selection After engineering features, R² jumped from 0.38 → 0.73.

Spatial features were the most powerful cluster_id and distance_to_centroid were heavily used by the Random Forest.

Regression and Classification answer different business questions

Regression → “How much will this listing cost?”

Classification → “Is this listing cheap, medium, or expensive?”

Balanced classes allowed fair evaluation Quantile binning ensured no class dominated the dataset.

Pipelines ensured clean, reproducible workflows

🛠️ 9. Tools & Technologies Used

Python

Scikit-Learn

Pandas

NumPy

Seaborn & Matplotlib

KMeans Clustering

Machine Learning Pipelines

OneHotEncoder & StandardScaler

🎉 Final Notes

This project demonstrates a complete ML workflow, from raw data to insights and deployment-ready models. Both Regression and Classification models were evaluated, and the impact of Feature Engineering was clearly visible in the results.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support