title: House Price Prediction
emoji: ๐
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
๐ House Price Prediction
An end-to-end Machine Learning project that predicts house prices in Bengaluru using features like square footage, BHK, bathrooms, and locality-based pricing.
Built using:
- Python
- Pandas
- Scikit-learn
- XGBoost
- Flask
- Streamlit
๐ Project Overview
This project uses the Bengaluru House Price dataset to build a real estate price prediction system.
The workflow includes:
- Data cleaning
- Feature engineering
- Outlier removal
- Log transformation
- Model training
- Hyperparameter tuning
- Feature importance analysis
- Flask API development
- Streamlit frontend deployment
๐ Features
โ
Cleaned messy real-estate data
โ
Converted sqft ranges into numeric values
โ
Engineered geospatial locality pricing feature
โ
Removed outliers using IQR method
โ
Applied log transformation to target variable
โ
Compared multiple ML models
โ
Tuned XGBoost hyperparameters
โ
Built Flask prediction API
โ
Created interactive frontend UI
โ
Ready for deployment on Hugging Face / Render
๐ Dataset
Dataset used:
- Bengaluru House Price Dataset
Main features:
- location
- total_sqft
- bath
- balcony
- BHK
- price
๐งน Data Preprocessing
1. Handling Missing Values
Removed null values and inconsistent rows.
data = data.dropna()
2. Converted size Column to BHK
Example:
2 BHK โ 2
Code:
data['bhk'] = data['size'].apply(
lambda x: int(str(x).split()[0])
)
3. Cleaned total_sqft
Handled:
- ranges
- inconsistent units
- invalid values
Examples:
2100 - 2850 โ 2475
Code:
def convert_sqft(x):
x = str(x)
if '-' in x:
a, b = x.split('-')
return (float(a) + float(b)) / 2
try:
return float(x)
except:
return None
Applied:
data['total_sqft'] = data['total_sqft'].apply(convert_sqft)
โ๏ธ Feature Engineering
1. Price Per Sqft
Created normalized pricing feature:
data['price_per_sqft'] = (
data['price'] * 100000
) / data['total_sqft']
Used for:
- outlier detection
- normalization
- better model learning
2. Geospatial Locality Feature
Calculated average locality price using:
location_price = data.groupby(
'location'
)['price'].mean()
Mapped back to dataset:
data['location_avg_price'] = data[
'location'
].map(location_price)
This feature helps the model learn:
- expensive locations
- cheaper localities
- pricing trends by area
๐ Outlier Removal
Used IQR (Interquartile Range) method.
Formula:
IQR = Q3 - Q1
Outlier Range:
[Q1 - 1.5(IQR), Q3 + 1.5(IQR)]
Code:
Q1 = data['price_per_sqft'].quantile(0.25)
Q3 = data['price_per_sqft'].quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
data = data[
(data['price_per_sqft'] >= lower_limit) &
(data['price_per_sqft'] <= upper_limit)
]
๐ Log Transformation
Applied logarithmic transformation on target variable:
import numpy as np
data['log_price'] = np.log(data['price'])
Benefits:
- reduced skewness
- stabilized variance
- improved regression performance
๐ค Machine Learning Models
Compared:
- Linear Regression
- Ridge Regression
- XGBoost Regressor
๐ Feature & Target Selection
X = data[
[
'total_sqft',
'bath',
'bhk',
'location_avg_price'
]
]
y = data['log_price']
โ๏ธ Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
๐ Cross Validation Evaluation
Used:
cross_val_score()
Scoring Metric:
- Rยฒ Score
๐ Model Results
| Model | Rยฒ Score |
|---|---|
| Linear Regression | 0.559 |
| Ridge Regression | 0.559 |
| XGBoost | 0.827 |
๐ Best Model
XGBoost Regressor
Reason:
- captures non-linear relationships
- handles feature interactions
- performs well on tabular datasets
๐ง Hyperparameter Tuning
Used:
GridSearchCV
Parameter Grid:
params = {
'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2]
}
Best Parameters:
{
'learning_rate': 0.1,
'max_depth': 7,
'n_estimators': 100
}
Best Tuned Score:
0.823
๐ Feature Importance
Visualized feature importance using XGBoost.
Top contributing features:
- Location Average Price
- Total Square Feet
- BHK
- Bathrooms
Code:
import matplotlib.pyplot as plt
importance = xgb.feature_importances_
features = X.columns
plt.figure(figsize=(8,5))
plt.bar(features, importance)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.title("Feature Importance")
plt.show()
๐ Flask API
Created a Flask API for predictions.
POST Endpoint
/predict
Example Request
{
"location": 85,
"BHK": 2,
"area": 1200,
"bath": 2
}
Example Response
{
"predicted_price": 78.5
}
๐ฅ๏ธ Streamlit Frontend
Built an interactive UI using Streamlit.
Features:
- Area input
- Bathroom input
- BHK input
- Location pricing input
- Instant prediction display
๐ฆ Installation
Clone repository:
git clone https://github.com/your-username/house-price-predictor.git
Move into project directory:
cd house-price-predictor
Install dependencies:
pip install -r requirements.txt
โถ๏ธ Run Flask App
python app.py
Open:
http://127.0.0.1:5000
โถ๏ธ Run Streamlit App
streamlit run streamlit_app.py
๐ Project Structure
house-price-predictor/
โ
โโโ app.py
โโโ streamlit_app.py
โโโ house_price_model.pkl
โโโ requirements.txt
โโโ runtime.txt
โโโ README.md
๐ ๏ธ Tech Stack
| Tool | Purpose |
|---|---|
| Python | Programming |
| Pandas | Data preprocessing |
| NumPy | Numerical operations |
| Matplotlib | Visualization |
| Scikit-learn | ML utilities |
| XGBoost | Regression model |
| Flask | API backend |
| Streamlit | Frontend UI |
๐ Key Learnings
- Real-world data preprocessing
- Feature engineering
- Outlier handling using IQR
- Log transformation
- Model comparison using cross-validation
- Hyperparameter tuning
- Flask API creation
- Streamlit UI development
- ML deployment workflow
๐ฎ Future Improvements
- Use actual location names
- Add location dropdown
- Add map-based visualization
- Improve frontend UI
- Add cloud deployment pipeline
- Add model monitoring
๐จโ๐ป Author
Mohd Faizanullah
Aspiring ML Engineer focused on:
- Machine Learning
- Deep Learning
- AI Applications
- Full ML Deployment Pipelines