datawizard116's picture
Update README.md
61b466f verified
metadata
title: House Price Prediction
emoji: ๐Ÿ 
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false

๐Ÿ  House Price Prediction

An end-to-end Machine Learning project that predicts house prices in Bengaluru using features like square footage, BHK, bathrooms, and locality-based pricing.

Built using:

  • Python
  • Pandas
  • Scikit-learn
  • XGBoost
  • Flask
  • Streamlit

๐Ÿ“Œ Project Overview

This project uses the Bengaluru House Price dataset to build a real estate price prediction system.

The workflow includes:

  • Data cleaning
  • Feature engineering
  • Outlier removal
  • Log transformation
  • Model training
  • Hyperparameter tuning
  • Feature importance analysis
  • Flask API development
  • Streamlit frontend deployment

๐Ÿš€ Features

โœ… Cleaned messy real-estate data
โœ… Converted sqft ranges into numeric values
โœ… Engineered geospatial locality pricing feature
โœ… Removed outliers using IQR method
โœ… Applied log transformation to target variable
โœ… Compared multiple ML models
โœ… Tuned XGBoost hyperparameters
โœ… Built Flask prediction API
โœ… Created interactive frontend UI
โœ… Ready for deployment on Hugging Face / Render


๐Ÿ“‚ Dataset

Dataset used:

  • Bengaluru House Price Dataset

Main features:

  • location
  • total_sqft
  • bath
  • balcony
  • BHK
  • price

๐Ÿงน Data Preprocessing

1. Handling Missing Values

Removed null values and inconsistent rows.

data = data.dropna()

2. Converted size Column to BHK

Example:

2 BHK โ†’ 2

Code:

data['bhk'] = data['size'].apply(
    lambda x: int(str(x).split()[0])
)

3. Cleaned total_sqft

Handled:

  • ranges
  • inconsistent units
  • invalid values

Examples:

2100 - 2850 โ†’ 2475

Code:

def convert_sqft(x):

    x = str(x)

    if '-' in x:
        a, b = x.split('-')
        return (float(a) + float(b)) / 2

    try:
        return float(x)

    except:
        return None

Applied:

data['total_sqft'] = data['total_sqft'].apply(convert_sqft)

โš™๏ธ Feature Engineering

1. Price Per Sqft

Created normalized pricing feature:

data['price_per_sqft'] = (
    data['price'] * 100000
) / data['total_sqft']

Used for:

  • outlier detection
  • normalization
  • better model learning

2. Geospatial Locality Feature

Calculated average locality price using:

location_price = data.groupby(
    'location'
)['price'].mean()

Mapped back to dataset:

data['location_avg_price'] = data[
    'location'
].map(location_price)

This feature helps the model learn:

  • expensive locations
  • cheaper localities
  • pricing trends by area

๐Ÿ“Š Outlier Removal

Used IQR (Interquartile Range) method.

Formula:

IQR = Q3 - Q1

Outlier Range:

[Q1 - 1.5(IQR), Q3 + 1.5(IQR)]

Code:

Q1 = data['price_per_sqft'].quantile(0.25)

Q3 = data['price_per_sqft'].quantile(0.75)

IQR = Q3 - Q1

lower_limit = Q1 - 1.5 * IQR

upper_limit = Q3 + 1.5 * IQR

data = data[
    (data['price_per_sqft'] >= lower_limit) &
    (data['price_per_sqft'] <= upper_limit)
]

๐Ÿ“ˆ Log Transformation

Applied logarithmic transformation on target variable:

import numpy as np

data['log_price'] = np.log(data['price'])

Benefits:

  • reduced skewness
  • stabilized variance
  • improved regression performance

๐Ÿค– Machine Learning Models

Compared:

  • Linear Regression
  • Ridge Regression
  • XGBoost Regressor

๐Ÿ“Œ Feature & Target Selection

X = data[
    [
        'total_sqft',
        'bath',
        'bhk',
        'location_avg_price'
    ]
]

y = data['log_price']

โœ‚๏ธ Train Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

๐Ÿ“Š Cross Validation Evaluation

Used:

cross_val_score()

Scoring Metric:

  • Rยฒ Score

๐Ÿ“ˆ Model Results

Model Rยฒ Score
Linear Regression 0.559
Ridge Regression 0.559
XGBoost 0.827

๐Ÿ† Best Model

XGBoost Regressor

Reason:

  • captures non-linear relationships
  • handles feature interactions
  • performs well on tabular datasets

๐Ÿ”ง Hyperparameter Tuning

Used:

GridSearchCV

Parameter Grid:

params = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2]
}

Best Parameters:

{
 'learning_rate': 0.1,
 'max_depth': 7,
 'n_estimators': 100
}

Best Tuned Score:

0.823

๐Ÿ“Œ Feature Importance

Visualized feature importance using XGBoost.

Top contributing features:

  • Location Average Price
  • Total Square Feet
  • BHK
  • Bathrooms

Code:

import matplotlib.pyplot as plt

importance = xgb.feature_importances_

features = X.columns

plt.figure(figsize=(8,5))

plt.bar(features, importance)

plt.xlabel("Features")
plt.ylabel("Importance")

plt.title("Feature Importance")

plt.show()

๐ŸŒ Flask API

Created a Flask API for predictions.

POST Endpoint

/predict

Example Request

{
  "location": 85,
  "BHK": 2,
  "area": 1200,
  "bath": 2
}

Example Response

{
  "predicted_price": 78.5
}

๐Ÿ–ฅ๏ธ Streamlit Frontend

Built an interactive UI using Streamlit.

Features:

  • Area input
  • Bathroom input
  • BHK input
  • Location pricing input
  • Instant prediction display

๐Ÿ“ฆ Installation

Clone repository:

git clone https://github.com/your-username/house-price-predictor.git

Move into project directory:

cd house-price-predictor

Install dependencies:

pip install -r requirements.txt

โ–ถ๏ธ Run Flask App

python app.py

Open:

http://127.0.0.1:5000

โ–ถ๏ธ Run Streamlit App

streamlit run streamlit_app.py

๐Ÿ“ Project Structure

house-price-predictor/
โ”‚
โ”œโ”€โ”€ app.py
โ”œโ”€โ”€ streamlit_app.py
โ”œโ”€โ”€ house_price_model.pkl
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ runtime.txt
โ”œโ”€โ”€ README.md

๐Ÿ› ๏ธ Tech Stack

Tool Purpose
Python Programming
Pandas Data preprocessing
NumPy Numerical operations
Matplotlib Visualization
Scikit-learn ML utilities
XGBoost Regression model
Flask API backend
Streamlit Frontend UI

๐Ÿ“š Key Learnings

  • Real-world data preprocessing
  • Feature engineering
  • Outlier handling using IQR
  • Log transformation
  • Model comparison using cross-validation
  • Hyperparameter tuning
  • Flask API creation
  • Streamlit UI development
  • ML deployment workflow

๐Ÿ”ฎ Future Improvements

  • Use actual location names
  • Add location dropdown
  • Add map-based visualization
  • Improve frontend UI
  • Add cloud deployment pipeline
  • Add model monitoring

๐Ÿ‘จโ€๐Ÿ’ป Author

Mohd Faizanullah

Aspiring ML Engineer focused on:

  • Machine Learning
  • Deep Learning
  • AI Applications
  • Full ML Deployment Pipelines