YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Airbnb NYC Price Analysis and Classification Project

Overview

This project analyzes Airbnb listings in New York City. The primary goal was to predict listing prices and classify properties into price tiers (Low, Medium, High). The project demonstrates a full Data Science lifecycle: from rigorous data cleaning and EDA to training regression and classification models, identifying critical data leakage, and performing unsupervised clustering.

Workflow and Key Findings

The notebook follows these main steps:

Data Cleaning: Removing currency symbols, handling missing values, and type conversion.
EDA: Visualizing correlations which hinted at data leakage via the 'service_fee' feature.
Regression (Baseline): Establishing a baseline model which performed suspiciously well due to the leakage.
Classification: Retraining models after removing the leaking feature to reveal the true predictive power of physical attributes.
Clustering: Segmenting the market using K-Means.

1. Exploratory Data Analysis (EDA)

We started by examining the distribution of data and relationships between variables. The heatmap was particularly revealing.

Visualizations: Correlation and Distributions

Key Insight: The perfect correlation between 'price' and 'service_fee' was the first indicator that 'service_fee' is a direct derivative of the target variable, leading to data leakage.

2. Regression Modeling (Baseline)

Model Setup

Regression Goal: The goal is to predict the daily price (continuous variable) of an Airbnb listing in NYC based on its numeric attributes (reviews, availability, location coordinates, etc.).
Feature Selection (Baseline): For the baseline model, we will use only the numeric features available in the dataset to establish a starting point.
- Features (X): service_fee, minimum_nights, number_of_reviews, reviews_per_month, review_rate_number, calculated_host_listings_count, availability_365, construction_year, lat, long.
- Target (y): price
Train-Test Split: We will split the data into 80% training set and 20% testing set using a random seed for reproducibility.
Model Training: We will use a standard LinearRegression model from Scikit-Learn with default parameters.

Model Performance & Results

The graphs below show the actual vs. predicted prices and the feature importance based on the setup above.

Observation: The RMSE is extremely low (~$25), and the R2 score is very high. While this looks great, it raises a suspicion of potential data leakage (overfitting to a specific feature). We will investigate this further in the Classification section.

3. Classification and Leakage Correction

To build a realistic model, we defined three price classes: Low, Medium, and High. We then removed the service_fee feature and retrained the models.

The Reality Check Without the leaked information, the accuracy dropped to ~33% across all models. This indicates that physical features alone (like neighbourhood or room type) are not sufficient to linearly separate price tiers in this specific dataset.

Model Comparison We compared Logistic Regression, Decision Tree, and KNN.

Confusion Matrices The matrices below illustrate the difficulty the models faced in distinguishing between the three classes without the price proxy.

Selected Model: Logistic Regression was chosen as the final model due to its stability and F1-Score compared to the others.

4. Clustering (Unsupervised Learning)

We applied K-Means clustering to identify inherent groups in the data based on location and property attributes.

Cluster Analysis The visualizations below show the segmentation of NYC listings.

Conclusion

Data Integrity: The project highlights the importance of scrutinizing "too good to be true" results. Removing the service_fee leakage was critical for model validity.
Predictive Power: The remaining features have low predictive power for price classification, suggesting the need for external data (e.g., precise location scoring, image quality analysis, or sentiment analysis of reviews).
Final Deliverable: A robust pipeline that correctly preprocesses data and a deployed Logistic Regression baseline model.

How to Use the Model

import pickle
import pandas as pd

# Load the trained model
with open("classification_model.pkl", "rb") as f:
    model = pickle.load(f)

Video Presentation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support