| --- |
| title: House Price Prediction |
| emoji: ๐ |
| colorFrom: blue |
| colorTo: purple |
| sdk: docker |
| app_file: app.py |
| pinned: false |
| --- |
| |
| # ๐ House Price Prediction |
|
|
| An end-to-end Machine Learning project that predicts house prices in Bengaluru using features like square footage, BHK, bathrooms, and locality-based pricing. |
|
|
| Built using: |
| - Python |
| - Pandas |
| - Scikit-learn |
| - XGBoost |
| - Flask |
| - Streamlit |
|
|
| --- |
|
|
| # ๐ Project Overview |
|
|
| This project uses the Bengaluru House Price dataset to build a real estate price prediction system. |
|
|
| The workflow includes: |
| - Data cleaning |
| - Feature engineering |
| - Outlier removal |
| - Log transformation |
| - Model training |
| - Hyperparameter tuning |
| - Feature importance analysis |
| - Flask API development |
| - Streamlit frontend deployment |
|
|
| --- |
|
|
| # ๐ Features |
|
|
| โ
Cleaned messy real-estate data |
| โ
Converted sqft ranges into numeric values |
| โ
Engineered geospatial locality pricing feature |
| โ
Removed outliers using IQR method |
| โ
Applied log transformation to target variable |
| โ
Compared multiple ML models |
| โ
Tuned XGBoost hyperparameters |
| โ
Built Flask prediction API |
| โ
Created interactive frontend UI |
| โ
Ready for deployment on Hugging Face / Render |
|
|
| --- |
|
|
| # ๐ Dataset |
|
|
| Dataset used: |
| - Bengaluru House Price Dataset |
|
|
| Main features: |
| - location |
| - total_sqft |
| - bath |
| - balcony |
| - BHK |
| - price |
| |
| --- |
| |
| # ๐งน Data Preprocessing |
| |
| ## 1. Handling Missing Values |
| |
| Removed null values and inconsistent rows. |
| |
| ```python |
| data = data.dropna() |
| ``` |
| |
| --- |
| |
| ## 2. Converted `size` Column to BHK |
| |
| Example: |
| |
| ```python |
| 2 BHK โ 2 |
| ``` |
| |
| Code: |
| |
| ```python |
| data['bhk'] = data['size'].apply( |
| lambda x: int(str(x).split()[0]) |
| ) |
| ``` |
| |
| --- |
| |
| ## 3. Cleaned `total_sqft` |
|
|
| Handled: |
| - ranges |
| - inconsistent units |
| - invalid values |
|
|
| Examples: |
|
|
| ```python |
| 2100 - 2850 โ 2475 |
| ``` |
|
|
| Code: |
|
|
| ```python |
| def convert_sqft(x): |
| |
| x = str(x) |
| |
| if '-' in x: |
| a, b = x.split('-') |
| return (float(a) + float(b)) / 2 |
| |
| try: |
| return float(x) |
| |
| except: |
| return None |
| ``` |
|
|
| Applied: |
|
|
| ```python |
| data['total_sqft'] = data['total_sqft'].apply(convert_sqft) |
| ``` |
|
|
| --- |
|
|
| # โ๏ธ Feature Engineering |
|
|
| ## 1. Price Per Sqft |
|
|
| Created normalized pricing feature: |
|
|
| ```python |
| data['price_per_sqft'] = ( |
| data['price'] * 100000 |
| ) / data['total_sqft'] |
| ``` |
|
|
| Used for: |
| - outlier detection |
| - normalization |
| - better model learning |
|
|
| --- |
|
|
| ## 2. Geospatial Locality Feature |
|
|
| Calculated average locality price using: |
|
|
| ```python |
| location_price = data.groupby( |
| 'location' |
| )['price'].mean() |
| ``` |
|
|
| Mapped back to dataset: |
|
|
| ```python |
| data['location_avg_price'] = data[ |
| 'location' |
| ].map(location_price) |
| ``` |
|
|
| This feature helps the model learn: |
| - expensive locations |
| - cheaper localities |
| - pricing trends by area |
|
|
| --- |
|
|
| # ๐ Outlier Removal |
|
|
| Used IQR (Interquartile Range) method. |
|
|
| Formula: |
|
|
| ```python |
| IQR = Q3 - Q1 |
| ``` |
|
|
| Outlier Range: |
|
|
| ```python |
| [Q1 - 1.5(IQR), Q3 + 1.5(IQR)] |
| ``` |
|
|
| Code: |
|
|
| ```python |
| Q1 = data['price_per_sqft'].quantile(0.25) |
| |
| Q3 = data['price_per_sqft'].quantile(0.75) |
| |
| IQR = Q3 - Q1 |
| |
| lower_limit = Q1 - 1.5 * IQR |
| |
| upper_limit = Q3 + 1.5 * IQR |
| |
| data = data[ |
| (data['price_per_sqft'] >= lower_limit) & |
| (data['price_per_sqft'] <= upper_limit) |
| ] |
| ``` |
|
|
| --- |
|
|
| # ๐ Log Transformation |
|
|
| Applied logarithmic transformation on target variable: |
|
|
| ```python |
| import numpy as np |
| |
| data['log_price'] = np.log(data['price']) |
| ``` |
|
|
| Benefits: |
| - reduced skewness |
| - stabilized variance |
| - improved regression performance |
|
|
| --- |
|
|
| # ๐ค Machine Learning Models |
|
|
| Compared: |
| - Linear Regression |
| - Ridge Regression |
| - XGBoost Regressor |
|
|
| --- |
|
|
| # ๐ Feature & Target Selection |
|
|
| ```python |
| X = data[ |
| [ |
| 'total_sqft', |
| 'bath', |
| 'bhk', |
| 'location_avg_price' |
| ] |
| ] |
| |
| y = data['log_price'] |
| ``` |
|
|
| --- |
|
|
| # โ๏ธ Train Test Split |
|
|
| ```python |
| from sklearn.model_selection import train_test_split |
| |
| X_train, X_test, y_train, y_test = train_test_split( |
| X, |
| y, |
| test_size=0.2, |
| random_state=42 |
| ) |
| ``` |
|
|
| --- |
|
|
| # ๐ Cross Validation Evaluation |
|
|
| Used: |
|
|
| ```python |
| cross_val_score() |
| ``` |
|
|
| Scoring Metric: |
| - Rยฒ Score |
|
|
| --- |
|
|
| # ๐ Model Results |
|
|
| | Model | Rยฒ Score | |
| |---|---| |
| | Linear Regression | 0.559 | |
| | Ridge Regression | 0.559 | |
| | XGBoost | 0.827 | |
|
|
| --- |
|
|
| # ๐ Best Model |
|
|
| ## XGBoost Regressor |
|
|
| Reason: |
| - captures non-linear relationships |
| - handles feature interactions |
| - performs well on tabular datasets |
|
|
| --- |
|
|
| # ๐ง Hyperparameter Tuning |
|
|
| Used: |
|
|
| ```python |
| GridSearchCV |
| ``` |
|
|
| Parameter Grid: |
|
|
| ```python |
| params = { |
| 'n_estimators': [100, 200], |
| 'max_depth': [3, 5, 7], |
| 'learning_rate': [0.01, 0.1, 0.2] |
| } |
| ``` |
|
|
| Best Parameters: |
|
|
| ```python |
| { |
| 'learning_rate': 0.1, |
| 'max_depth': 7, |
| 'n_estimators': 100 |
| } |
| ``` |
|
|
| Best Tuned Score: |
|
|
| ```python |
| 0.823 |
| ``` |
|
|
| --- |
|
|
| # ๐ Feature Importance |
|
|
| Visualized feature importance using XGBoost. |
|
|
| Top contributing features: |
| - Location Average Price |
| - Total Square Feet |
| - BHK |
| - Bathrooms |
|
|
| Code: |
|
|
| ```python |
| import matplotlib.pyplot as plt |
| |
| importance = xgb.feature_importances_ |
| |
| features = X.columns |
| |
| plt.figure(figsize=(8,5)) |
| |
| plt.bar(features, importance) |
| |
| plt.xlabel("Features") |
| plt.ylabel("Importance") |
| |
| plt.title("Feature Importance") |
| |
| plt.show() |
| ``` |
|
|
| --- |
|
|
| # ๐ Flask API |
|
|
| Created a Flask API for predictions. |
|
|
| ## POST Endpoint |
|
|
| ```python |
| /predict |
| ``` |
|
|
| --- |
|
|
| ## Example Request |
|
|
| ```json |
| { |
| "location": 85, |
| "BHK": 2, |
| "area": 1200, |
| "bath": 2 |
| } |
| ``` |
|
|
| --- |
|
|
| ## Example Response |
|
|
| ```json |
| { |
| "predicted_price": 78.5 |
| } |
| ``` |
|
|
| --- |
|
|
| # ๐ฅ๏ธ Streamlit Frontend |
|
|
| Built an interactive UI using Streamlit. |
|
|
| Features: |
| - Area input |
| - Bathroom input |
| - BHK input |
| - Location pricing input |
| - Instant prediction display |
|
|
| --- |
|
|
| # ๐ฆ Installation |
|
|
| Clone repository: |
|
|
| ```bash |
| git clone https://github.com/your-username/house-price-predictor.git |
| ``` |
|
|
| Move into project directory: |
|
|
| ```bash |
| cd house-price-predictor |
| ``` |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| --- |
|
|
| # โถ๏ธ Run Flask App |
|
|
| ```bash |
| python app.py |
| ``` |
|
|
| Open: |
|
|
| ```text |
| http://127.0.0.1:5000 |
| ``` |
|
|
| --- |
|
|
| # โถ๏ธ Run Streamlit App |
|
|
| ```bash |
| streamlit run streamlit_app.py |
| ``` |
|
|
| --- |
|
|
| # ๐ Project Structure |
|
|
| ```text |
| house-price-predictor/ |
| โ |
| โโโ app.py |
| โโโ streamlit_app.py |
| โโโ house_price_model.pkl |
| โโโ requirements.txt |
| โโโ runtime.txt |
| โโโ README.md |
| ``` |
|
|
| --- |
|
|
| # ๐ ๏ธ Tech Stack |
|
|
| | Tool | Purpose | |
| |---|---| |
| | Python | Programming | |
| | Pandas | Data preprocessing | |
| | NumPy | Numerical operations | |
| | Matplotlib | Visualization | |
| | Scikit-learn | ML utilities | |
| | XGBoost | Regression model | |
| | Flask | API backend | |
| | Streamlit | Frontend UI | |
|
|
| --- |
|
|
| # ๐ Key Learnings |
|
|
| - Real-world data preprocessing |
| - Feature engineering |
| - Outlier handling using IQR |
| - Log transformation |
| - Model comparison using cross-validation |
| - Hyperparameter tuning |
| - Flask API creation |
| - Streamlit UI development |
| - ML deployment workflow |
|
|
| --- |
|
|
| # ๐ฎ Future Improvements |
|
|
| - Use actual location names |
| - Add location dropdown |
| - Add map-based visualization |
| - Improve frontend UI |
| - Add cloud deployment pipeline |
| - Add model monitoring |
|
|
| --- |
|
|
| # ๐จโ๐ป Author |
|
|
| Mohd Faizanullah |
|
|
| Aspiring ML Engineer focused on: |
| - Machine Learning |
| - Deep Learning |
| - AI Applications |
| - Full ML Deployment Pipelines |
|
|
|
|