datawizard116's picture
Update README.md
61b466f verified
---
title: House Price Prediction
emoji: ๐Ÿ 
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
---
# ๐Ÿ  House Price Prediction
An end-to-end Machine Learning project that predicts house prices in Bengaluru using features like square footage, BHK, bathrooms, and locality-based pricing.
Built using:
- Python
- Pandas
- Scikit-learn
- XGBoost
- Flask
- Streamlit
---
# ๐Ÿ“Œ Project Overview
This project uses the Bengaluru House Price dataset to build a real estate price prediction system.
The workflow includes:
- Data cleaning
- Feature engineering
- Outlier removal
- Log transformation
- Model training
- Hyperparameter tuning
- Feature importance analysis
- Flask API development
- Streamlit frontend deployment
---
# ๐Ÿš€ Features
โœ… Cleaned messy real-estate data
โœ… Converted sqft ranges into numeric values
โœ… Engineered geospatial locality pricing feature
โœ… Removed outliers using IQR method
โœ… Applied log transformation to target variable
โœ… Compared multiple ML models
โœ… Tuned XGBoost hyperparameters
โœ… Built Flask prediction API
โœ… Created interactive frontend UI
โœ… Ready for deployment on Hugging Face / Render
---
# ๐Ÿ“‚ Dataset
Dataset used:
- Bengaluru House Price Dataset
Main features:
- location
- total_sqft
- bath
- balcony
- BHK
- price
---
# ๐Ÿงน Data Preprocessing
## 1. Handling Missing Values
Removed null values and inconsistent rows.
```python
data = data.dropna()
```
---
## 2. Converted `size` Column to BHK
Example:
```python
2 BHK โ†’ 2
```
Code:
```python
data['bhk'] = data['size'].apply(
lambda x: int(str(x).split()[0])
)
```
---
## 3. Cleaned `total_sqft`
Handled:
- ranges
- inconsistent units
- invalid values
Examples:
```python
2100 - 2850 โ†’ 2475
```
Code:
```python
def convert_sqft(x):
x = str(x)
if '-' in x:
a, b = x.split('-')
return (float(a) + float(b)) / 2
try:
return float(x)
except:
return None
```
Applied:
```python
data['total_sqft'] = data['total_sqft'].apply(convert_sqft)
```
---
# โš™๏ธ Feature Engineering
## 1. Price Per Sqft
Created normalized pricing feature:
```python
data['price_per_sqft'] = (
data['price'] * 100000
) / data['total_sqft']
```
Used for:
- outlier detection
- normalization
- better model learning
---
## 2. Geospatial Locality Feature
Calculated average locality price using:
```python
location_price = data.groupby(
'location'
)['price'].mean()
```
Mapped back to dataset:
```python
data['location_avg_price'] = data[
'location'
].map(location_price)
```
This feature helps the model learn:
- expensive locations
- cheaper localities
- pricing trends by area
---
# ๐Ÿ“Š Outlier Removal
Used IQR (Interquartile Range) method.
Formula:
```python
IQR = Q3 - Q1
```
Outlier Range:
```python
[Q1 - 1.5(IQR), Q3 + 1.5(IQR)]
```
Code:
```python
Q1 = data['price_per_sqft'].quantile(0.25)
Q3 = data['price_per_sqft'].quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
data = data[
(data['price_per_sqft'] >= lower_limit) &
(data['price_per_sqft'] <= upper_limit)
]
```
---
# ๐Ÿ“ˆ Log Transformation
Applied logarithmic transformation on target variable:
```python
import numpy as np
data['log_price'] = np.log(data['price'])
```
Benefits:
- reduced skewness
- stabilized variance
- improved regression performance
---
# ๐Ÿค– Machine Learning Models
Compared:
- Linear Regression
- Ridge Regression
- XGBoost Regressor
---
# ๐Ÿ“Œ Feature & Target Selection
```python
X = data[
[
'total_sqft',
'bath',
'bhk',
'location_avg_price'
]
]
y = data['log_price']
```
---
# โœ‚๏ธ Train Test Split
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
```
---
# ๐Ÿ“Š Cross Validation Evaluation
Used:
```python
cross_val_score()
```
Scoring Metric:
- Rยฒ Score
---
# ๐Ÿ“ˆ Model Results
| Model | Rยฒ Score |
|---|---|
| Linear Regression | 0.559 |
| Ridge Regression | 0.559 |
| XGBoost | 0.827 |
---
# ๐Ÿ† Best Model
## XGBoost Regressor
Reason:
- captures non-linear relationships
- handles feature interactions
- performs well on tabular datasets
---
# ๐Ÿ”ง Hyperparameter Tuning
Used:
```python
GridSearchCV
```
Parameter Grid:
```python
params = {
'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2]
}
```
Best Parameters:
```python
{
'learning_rate': 0.1,
'max_depth': 7,
'n_estimators': 100
}
```
Best Tuned Score:
```python
0.823
```
---
# ๐Ÿ“Œ Feature Importance
Visualized feature importance using XGBoost.
Top contributing features:
- Location Average Price
- Total Square Feet
- BHK
- Bathrooms
Code:
```python
import matplotlib.pyplot as plt
importance = xgb.feature_importances_
features = X.columns
plt.figure(figsize=(8,5))
plt.bar(features, importance)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.title("Feature Importance")
plt.show()
```
---
# ๐ŸŒ Flask API
Created a Flask API for predictions.
## POST Endpoint
```python
/predict
```
---
## Example Request
```json
{
"location": 85,
"BHK": 2,
"area": 1200,
"bath": 2
}
```
---
## Example Response
```json
{
"predicted_price": 78.5
}
```
---
# ๐Ÿ–ฅ๏ธ Streamlit Frontend
Built an interactive UI using Streamlit.
Features:
- Area input
- Bathroom input
- BHK input
- Location pricing input
- Instant prediction display
---
# ๐Ÿ“ฆ Installation
Clone repository:
```bash
git clone https://github.com/your-username/house-price-predictor.git
```
Move into project directory:
```bash
cd house-price-predictor
```
Install dependencies:
```bash
pip install -r requirements.txt
```
---
# โ–ถ๏ธ Run Flask App
```bash
python app.py
```
Open:
```text
http://127.0.0.1:5000
```
---
# โ–ถ๏ธ Run Streamlit App
```bash
streamlit run streamlit_app.py
```
---
# ๐Ÿ“ Project Structure
```text
house-price-predictor/
โ”‚
โ”œโ”€โ”€ app.py
โ”œโ”€โ”€ streamlit_app.py
โ”œโ”€โ”€ house_price_model.pkl
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ runtime.txt
โ”œโ”€โ”€ README.md
```
---
# ๐Ÿ› ๏ธ Tech Stack
| Tool | Purpose |
|---|---|
| Python | Programming |
| Pandas | Data preprocessing |
| NumPy | Numerical operations |
| Matplotlib | Visualization |
| Scikit-learn | ML utilities |
| XGBoost | Regression model |
| Flask | API backend |
| Streamlit | Frontend UI |
---
# ๐Ÿ“š Key Learnings
- Real-world data preprocessing
- Feature engineering
- Outlier handling using IQR
- Log transformation
- Model comparison using cross-validation
- Hyperparameter tuning
- Flask API creation
- Streamlit UI development
- ML deployment workflow
---
# ๐Ÿ”ฎ Future Improvements
- Use actual location names
- Add location dropdown
- Add map-based visualization
- Improve frontend UI
- Add cloud deployment pipeline
- Add model monitoring
---
# ๐Ÿ‘จโ€๐Ÿ’ป Author
Mohd Faizanullah
Aspiring ML Engineer focused on:
- Machine Learning
- Deep Learning
- AI Applications
- Full ML Deployment Pipelines