YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

NYC Real Estate Price Prediction

Classification, Regression, Clustering & Evaluation

Part 1. Dataset Description

Dataset: NYC Rolling Sales Data Size: 84,550 rows, 22 original features

Features include:

Numeric: SALE PRICE (target), GROSS SQUARE FEET, LAND SQUARE FEET, YEAR BUILT, RESIDENTIAL UNITS, COMMERCIAL UNITS, TOTAL UNITS
Categorical: BOROUGH, NEIGHBORHOOD, BUILDING CLASS CATEGORY, TAX CLASS AT PRESENT
Temporal: SALE DATE

Research Question: Can I predict the sale price of a residential or commercial property in New York City based on its physical attributes (size, age, unit count), location (borough, neighborhood), and market context (building class)?

Part 2. Exploratory Data Analysis

Step 1 - Loading the Data and Initial Cleaning

The first step in any data science project is loading the raw data and performing basic structural cleaning before doing any analysis. Raw datasets almost always contain issues that need to be addressed immediately, before any statistics or visualizations are computed - otherwise the analysis is built on a corrupted foundation.

Cleaning steps applied:

Drop irrelevant columns: Unnamed: 0 is a leftover row index from the CSV export; EASE-MENT is completely empty in this dataset.
Handle hidden missing values: Several numeric columns store missing entries as " - " instead of a proper NaN. I convert them so pandas can use them in arithmetic or models.
Parse dates: SALE DATE is stored as a plain string. Converting it to a proper datetime object allows me to extract temporal features (month, year) in later steps.
Treat BOROUGH as categorical: The borough codes 1–5 are labels (1=Manhattan, 2=Bronx, 3=Brooklyn, 4=Queens, 5=Staten Island), not quantities.
Remove exact duplicate rows: Identical duplicate rows add noise and can artificially inflate the importance of certain patterns.

Step 2 - Deep Cleaning and Outlier Handling

After the structural cleaning, I address domain-specific issues in the target variable and key numeric features. Raw sale records include many entries that do not represent real market transactions.

Issues fixed:

Zero or near-zero sale prices (e.g. $0 or $1) typically represent family transfers, estate settlements, or administrative corrections - not real market transactions.
I keep only sales above $10,000.
Zero gross square footage is a data entry error and must be removed.
Year built = 0 means unknown year of construction, which would produce nonsensical Age_at_Sale values in feature engineering.

Outlier decision: I am keeping high-end outliers (e.g. $50M+ luxury sales). NYC real estate genuinely spans this range, and removing luxury sales would bias the model against premium properties. The log transformation applied in Part 4 will compress extreme values and reduce their influence during training.

Step 3 - Correlation Analysis

A correlation matrix shows how strongly each pair of numeric features moves together.

What I was looking for:

Which features correlate strongly with SALE PRICE? These are the most promising predictors.
Are any features strongly correlated with each other?

Insight: GROSS SQUARE FEET has the strongest correlation with SALE PRICE. GROSS SQUARE FEET and RESIDENTIAL UNITS are correlated at 0.73, meaning larger buildings tend to have more units, which can destabilize the linear model.

Step 4 - Outlier Detection (IQR Method)

Even though I decided to keep high-value outliers, it is important to quantify them formally. I use the Interquartile Range (IQR) method:

Q1 = 25th percentile, Q3 = 75th percentile
IQR = Q3 − Q1
Outlier boundaries: below Q1 − 1.5×IQR or above Q3 + 1.5×IQR

I also visualize the sale price distribution before and after log transformation to confirm why this transformation is necessary:

Insight: The raw distribution is so heavily skewed that almost all properties are compressed into a thin bar near zero on the x-axis. After log transformation the distribution becomes a clean bell shape. This confirms that the log transformation is not optional - training any model on the raw price scale would be fundamentally broken.

Research Question 1: Is There Seasonality in NYC Real Estate Sales?

Real estate markets are often seasonal. I test whether this pattern holds in the NYC data by examining the total volume of sales per month and the median sale price per month.

Insight: Sales volume does NOT show a clear spring peak. Volume is relatively uniform throughout the year - August is the lowest month, while June, September, and December are the highest. Median price tells a different story: lowest in March–April, rising through summer, and peaking sharply in August, then dropping.

Research Question 2: The Marginal Value of Space

Does a larger property always mean a higher sale price, and is this consistent across boroughs?

Insight: The boroughs overlap heavily throughout the price range. Manhattan tends to appear at higher prices and Staten Island at lower prices, but Brooklyn and Queens are mixed throughout the entire chart. There are no cleanly distinct slopes per borough - the relationship between size and price is noisy and inconsistent, which is exactly why the raw borough code is a weak feature for a linear model and why clustering is needed.

Research Question 3: Which Neighborhoods Command the Highest Price Per Square Foot?

Comparing raw median sale prices across neighborhoods is misleading. To fairly measure desirability, I calculate price per square foot, which normalizes for size and isolates the pure location premium.

Insight: 8 of the top 10 neighborhoods are in Manhattan. However, Brooklyn Heights and Downtown-Fulton Mall also appear in the top 10, showing that premium Brooklyn locations rival Manhattan in price per square foot. Location is overwhelmingly the dominant driver of value in NYC real estate.

Research Question 4: Residential vs. Commercial - Does Mixed-Use Command a Premium?

Properties with commercial units generate rental income in addition to residential value. I test whether the presence of commercial units meaningfully separates properties into different price tiers.

Insight: Properties with commercial units have a noticeably higher median sale price and a much wider distribution. COMMERCIAL UNITS carries meaningful signal and should be included in the feature set.

Research Question 5: The Age Factor - Old Charm vs. New Build

How does building age affect value in NYC? I group buildings by decade built and plot median sale price over time.

Insight: There is NO U-shaped pattern in this dataset. The oldest surviving buildings show the highest median price (~$5.5M), but this is a survivorship bias effect - the very few buildings from that era that still exist tend to be landmark properties in prime Manhattan locations. From 1900 onward prices are relatively flat at $600K–$800K with only a modest uptick for recent construction. There is no clear pre-war premium or new-construction premium visible at the dataset level.

Research Question 6: What is the Interaction Between Age and Location?

The value of a building's age depends entirely on where it is located. This heatmap cross-references decade built with borough to reveal this critical interaction effect.

Insight: Manhattan dominates all other boroughs across every decade. The key finding: the value of a building's age is almost entirely a Manhattan phenomenon. The same decade of construction means something completely different depending on location.

Part 3. Baseline Model

Regression Goal: Predict the continuous sale price of NYC properties using only raw, untransformed features - before any engineering or advanced modeling. Features: GROSS SQUARE FEET, RESIDENTIAL UNITS, COMMERCIAL UNITS, YEAR BUILT, BOROUGH (numeric code 1–5)

Result: R²≈0.06 - the model explains only 6% of price variance. The chart below reveals why: the raw price distribution is so skewed that the model produces negative predicted prices for some properties (physically impossible) and wildly overestimates others. This confirms that training on untransformed dollar values doesn't work - the log transformation in Part 4 is essential.

Key finding: BOROUGH has the largest impact - moving one step away from Manhattan drops the predicted price by over $1M. GROSS SQUARE FEET looks near-zero only because the borough effect dwarfs it on this scale - it still contributes ~$150–200 per sqft.

Part 4. Feature Engineering & Clustering

Five new features were engineered to replace or augment the raw inputs:

LOG_PRICE - Removes right-skew from the target
LOG_GROSS_FT - Makes the size–price relationship more linear.
Age_at_Sale - Sale year minus year built.
Market_Tier - Compresses 200+ neighborhoods into economically meaningful groups.
Centroid_Distance - Captures how "atypical" a neighborhood is within its tier.

Most NYC neighborhoods are small and similarly priced - a tiny number of extreme outliers are so unusual they got their own cluster.

Each row is a cluster with its median price and size - Cluster 3 is the typical NYC market (212 neighborhoods, ~$617K), while Clusters 0 and 1 are extreme outliers with median prices over $65M.

Think of it as a map where left = cheap and right = expensive - almost every neighborhood is packed on the left, and a few extreme ones sit alone on the right.

Key finding: PC1 alone explains 91.8% of variance - the entire neighborhood market structure compresses onto a single economic axis from affordable to expensive.

Part 5. Improved Models

Three models were trained on the engineered feature set and evaluated on the original dollar scale for a fair comparison against the baseline.

Biggest surprise: Centroid_Distance is the second most important feature with ~0.30 importance - almost as influential as property size. How atypical a neighborhood is within its market cluster turns out to be a stronger signal than borough, age, or unit count combined.

Declared Winner: Random Forest Regressor - highest R², lowest MAE. The improvement comes from four things: log-transforming the target (makes it learnable), Centroid_Distance (captures neighborhood atypicality), LOG_GROSS_FT (linearizes size–price), and ensemble learning (captures non-linear interactions). Honest caveat: the model is significantly better than the baseline but real estate prediction remains hard - renovation quality, floor level, and exact street-level factors are not in the dataset.

Part 7. Regression to Classification

The same problem is reframed as classification by converting SALE PRICE into three tiers using quantile binning: Low - bottom 33% of sale prices Medium - middle 33% High - top 33% Quantile binning was chosen over fixed dollar thresholds because it guarantees balanced classes (~33% each), making standard accuracy a valid metric. A fixed threshold on NYC's right-skewed prices would create a heavily imbalanced dataset.

Part 8. Classification Models

Precision vs. Recall: RECALL is more important. Missing a truly High-value property (False Negative) means underpricing an asset or missing a premium opportunity - a direct financial loss. A False Positive leads to over-scrutiny but the true value is discovered before transacting.

False Positive vs. False Negative: FALSE NEGATIVES are more critical for the same reason - predicting High as Low/Medium causes the seller to lose money, while predicting Low/Medium as High leads to investigation that catches the mistake before harm occurs. Three classifiers were trained on the same engineered feature set as Part 5.

Logistic Regression:

Accuracy 59.4% · False Negatives on High = 120 · Medium accuracy only 39%. Weakest model by a wide margin. Random Forest:
Accuracy 69.0% · False Negatives on High = 47 · Medium accuracy 57%. Large improvement over Logistic Regression. Gradient Boosting:
Accuracy 69.5% · False Negatives on High = 42 · Wins on every metric that matters. All three models struggle most with the Medium class - boundary properties share features with both neighboring tiers and cannot be cleanly separated. Gradient Boosting wins on overall accuracy and makes the fewest costly errors (False Negatives on the High class).
Full per-class precision, recall, and F1-scores for all three models are documented in the notebook output

Declared Winner: Gradient Boosting Classifier - exported to nyc_real_estate_classifier_model.pkl.

Bonus: Interactive Property Price Estimator

An interactive estimator was built and deployed as a HuggingFace Space - enter borough, square footage, year built, and unit counts to get an instant price prediction and market tier with confidence probabilities.

Try the Interactive Property Price Estimator →

Conclusion

The most surprising finding was that Centroid_Distance - how atypical a neighborhood is within its market cluster - turned out to be the second most important feature, stronger than borough, age, and unit count combined. That was not expected going in. The hardest part was the raw price distribution. The data was so skewed that the baseline model produced negative predicted prices - something physically impossible. The log transformation fixed that, but it took real trial and error to understand why the model was failing so badly before it. The honest takeaway: even the best model here is still off by hundreds of thousands of dollars on individual properties. The data simply does not contain the features that drive individual sale prices - floor level, renovation quality, exact view. What it does capture is the market structure, and that alone is already useful.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support