YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π₯ Project Video Walkthrough
π Airbnb Price Prediction & Price Classification Complete Data Science Workflow β EDA, Feature Engineering, Regression, and Classification
This project presents a full end-to-end data science pipeline applied to Airbnb listing data. The goal was to build models that:
Predict continuous log-price (Regression)
Classify listings into three price tiers: Low / Medium / High (Classification)
The workflow includes data cleaning, exploratory data analysis, extensive feature engineering, model training, evaluation, interpretation, and final model selection.
π Table of Contents
1.Project Overview
2.Dataset & Objective
3.Data Cleaning
4.Exploratory Data Analysis (EDA)
5.Feature Engineering
6.Regression Modeling
7.Price Classification Modeling
8.Model Comparison & Winners
9.Key Insights & Lessons Learned
π 1. Project Overview
This assignment demonstrates practical experience in:
Data preprocessing and cleaning
Exploratory Data Analysis
Feature Engineering (including clustering features)
Regression modeling
Classification modeling
Model evaluation and interpretation
Building reproducible ML pipelines using Scikit-Learn
The project concludes with the selection of the best Regression model and the best Classification model. ......... .........
π 2. Dataset & Objective
The dataset contains Airbnb listings with:
Price & log-price
Geographic coordinates
Property features (bedrooms, beds, bathrooms, accommodates)
Host attributes
Review statistics
Categorical descriptors such as room_type, property_type, neighbourhood
Two prediction tasks were defined: Task 1: Regression
Predict the log-price of a listing using numerical, categorical, engineered, and spatial features.
Task 2: Classification
Transform log-price into 3 balanced classes using quantiles:
0 = Low price
1 = Medium price
2 = High price
Then train and compare classification models.
π§Ή 3. Data Cleaning
A structured cleaning process was performed:
β Missing values:
Removed rows missing the target log_price
Filled numerical features using the median
Filled categorical features using the mode
Categorical location fields were imputed with "Unknown"
β Outlier analysis:
Explored via:
IQR method
Z-score
Boxplots
Outliers were not removed, as they represent valid real-world listings.
β Data types & formatting:
Standardized data types and ensured consistency prior to modeling. ......... .........
π 4. Exploratory Data Analysis (EDA)
EDA provided insights that guided Feature Engineering:
β Distribution plots:
log_price
accommodates
number_of_reviews
ratings
β Correlation heatmap:
Helped identify linear relationships and redundancy among features.
β Scatter/regression plots:
Revealed weak or non-linear relationships for some variables.
Key takeaway:
Price is highly influenced by location and property characteristics, requiring more sophisticated feature transformations.
π οΈ 5. Feature Engineering
Feature Engineering was the most impactful part of the project.
A. Spatial Features Using Clustering
Applied KMeans (k=10) to latitude & longitude.
Produced:
cluster_id β geographic zone
distance_to_centroid β how far the listing is from the center of its cluster
These features capture neighborhood effects, which strongly influence price.
B. Interaction & Ratio Features
Created meaningful engineered variables:
beds_per_bedroom
bath_per_bedroom
reviews_ratio (reviews relative to guest capacity)
These capture density, comfort, and listing popularity patterns.
C. Encoding & Scaling
Used ColumnTransformer with:
StandardScaler for numerical features
OneHotEncoder for categorical features
All models were wrapped in a Scikit-Learn Pipeline to ensure reproducibility and prevent data leakage.
π 6. Regression Modeling (Predicting log_price)
Three models were trained:
Linear Regression (Baseline + Advanced Pipeline)
Decision Tree Regressor
Random Forest Regressor
β Baseline Model (simple numeric features only):
RΒ² β 0.38
MAE β 0.40
A weak but important benchmark.
β Final Regression Results (with engineered features):
π Regression Winner:
Random Forest Regressor It captured nonlinear patterns and interactions introduced by the engineered features.
π¨ 7. Classification Modeling (Price Tier Prediction)
log_price was transformed into 3 balanced classes using quantiles. Class distribution remained roughly uniform in both train and test sets.
Three models were trained:
Logistic Regression
Decision Tree Classifier
Random Forest Classifier
β Results:
Findings:
Logistic Regression showed the best overall performance
It made fewer severe errors and rarely confused low-priced listings with high-priced ones
Decision Tree overfit and performed poorly
Random Forest was strong but slightly less balanced than Logistic Regression
π Classification Winner:
Logistic Regression
π§ 8. Key Insights & Takeaways
Feature Engineering mattered more than model selection After engineering features, RΒ² jumped from 0.38 β 0.73.
Spatial features were the most powerful cluster_id and distance_to_centroid were heavily used by the Random Forest.
Regression and Classification answer different business questions
Regression β βHow much will this listing cost?β
Classification β βIs this listing cheap, medium, or expensive?β
Balanced classes allowed fair evaluation Quantile binning ensured no class dominated the dataset.
Pipelines ensured clean, reproducible workflows
π οΈ 9. Tools & Technologies Used
Python
Scikit-Learn
Pandas
NumPy
Seaborn & Matplotlib
KMeans Clustering
Machine Learning Pipelines
OneHotEncoder & StandardScaler
π Final Notes
This project demonstrates a complete ML workflow, from raw data to insights and deployment-ready models. Both Regression and Classification models were evaluated, and the impact of Feature Engineering was clearly visible in the results.

