--- title: House Price Prediction emoji: ๐Ÿ  colorFrom: blue colorTo: purple sdk: docker app_file: app.py pinned: false --- # ๐Ÿ  House Price Prediction An end-to-end Machine Learning project that predicts house prices in Bengaluru using features like square footage, BHK, bathrooms, and locality-based pricing. Built using: - Python - Pandas - Scikit-learn - XGBoost - Flask - Streamlit --- # ๐Ÿ“Œ Project Overview This project uses the Bengaluru House Price dataset to build a real estate price prediction system. The workflow includes: - Data cleaning - Feature engineering - Outlier removal - Log transformation - Model training - Hyperparameter tuning - Feature importance analysis - Flask API development - Streamlit frontend deployment --- # ๐Ÿš€ Features โœ… Cleaned messy real-estate data โœ… Converted sqft ranges into numeric values โœ… Engineered geospatial locality pricing feature โœ… Removed outliers using IQR method โœ… Applied log transformation to target variable โœ… Compared multiple ML models โœ… Tuned XGBoost hyperparameters โœ… Built Flask prediction API โœ… Created interactive frontend UI โœ… Ready for deployment on Hugging Face / Render --- # ๐Ÿ“‚ Dataset Dataset used: - Bengaluru House Price Dataset Main features: - location - total_sqft - bath - balcony - BHK - price --- # ๐Ÿงน Data Preprocessing ## 1. Handling Missing Values Removed null values and inconsistent rows. ```python data = data.dropna() ``` --- ## 2. Converted `size` Column to BHK Example: ```python 2 BHK โ†’ 2 ``` Code: ```python data['bhk'] = data['size'].apply( lambda x: int(str(x).split()[0]) ) ``` --- ## 3. Cleaned `total_sqft` Handled: - ranges - inconsistent units - invalid values Examples: ```python 2100 - 2850 โ†’ 2475 ``` Code: ```python def convert_sqft(x): x = str(x) if '-' in x: a, b = x.split('-') return (float(a) + float(b)) / 2 try: return float(x) except: return None ``` Applied: ```python data['total_sqft'] = data['total_sqft'].apply(convert_sqft) ``` --- # โš™๏ธ Feature Engineering ## 1. Price Per Sqft Created normalized pricing feature: ```python data['price_per_sqft'] = ( data['price'] * 100000 ) / data['total_sqft'] ``` Used for: - outlier detection - normalization - better model learning --- ## 2. Geospatial Locality Feature Calculated average locality price using: ```python location_price = data.groupby( 'location' )['price'].mean() ``` Mapped back to dataset: ```python data['location_avg_price'] = data[ 'location' ].map(location_price) ``` This feature helps the model learn: - expensive locations - cheaper localities - pricing trends by area --- # ๐Ÿ“Š Outlier Removal Used IQR (Interquartile Range) method. Formula: ```python IQR = Q3 - Q1 ``` Outlier Range: ```python [Q1 - 1.5(IQR), Q3 + 1.5(IQR)] ``` Code: ```python Q1 = data['price_per_sqft'].quantile(0.25) Q3 = data['price_per_sqft'].quantile(0.75) IQR = Q3 - Q1 lower_limit = Q1 - 1.5 * IQR upper_limit = Q3 + 1.5 * IQR data = data[ (data['price_per_sqft'] >= lower_limit) & (data['price_per_sqft'] <= upper_limit) ] ``` --- # ๐Ÿ“ˆ Log Transformation Applied logarithmic transformation on target variable: ```python import numpy as np data['log_price'] = np.log(data['price']) ``` Benefits: - reduced skewness - stabilized variance - improved regression performance --- # ๐Ÿค– Machine Learning Models Compared: - Linear Regression - Ridge Regression - XGBoost Regressor --- # ๐Ÿ“Œ Feature & Target Selection ```python X = data[ [ 'total_sqft', 'bath', 'bhk', 'location_avg_price' ] ] y = data['log_price'] ``` --- # โœ‚๏ธ Train Test Split ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) ``` --- # ๐Ÿ“Š Cross Validation Evaluation Used: ```python cross_val_score() ``` Scoring Metric: - Rยฒ Score --- # ๐Ÿ“ˆ Model Results | Model | Rยฒ Score | |---|---| | Linear Regression | 0.559 | | Ridge Regression | 0.559 | | XGBoost | 0.827 | --- # ๐Ÿ† Best Model ## XGBoost Regressor Reason: - captures non-linear relationships - handles feature interactions - performs well on tabular datasets --- # ๐Ÿ”ง Hyperparameter Tuning Used: ```python GridSearchCV ``` Parameter Grid: ```python params = { 'n_estimators': [100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.2] } ``` Best Parameters: ```python { 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100 } ``` Best Tuned Score: ```python 0.823 ``` --- # ๐Ÿ“Œ Feature Importance Visualized feature importance using XGBoost. Top contributing features: - Location Average Price - Total Square Feet - BHK - Bathrooms Code: ```python import matplotlib.pyplot as plt importance = xgb.feature_importances_ features = X.columns plt.figure(figsize=(8,5)) plt.bar(features, importance) plt.xlabel("Features") plt.ylabel("Importance") plt.title("Feature Importance") plt.show() ``` --- # ๐ŸŒ Flask API Created a Flask API for predictions. ## POST Endpoint ```python /predict ``` --- ## Example Request ```json { "location": 85, "BHK": 2, "area": 1200, "bath": 2 } ``` --- ## Example Response ```json { "predicted_price": 78.5 } ``` --- # ๐Ÿ–ฅ๏ธ Streamlit Frontend Built an interactive UI using Streamlit. Features: - Area input - Bathroom input - BHK input - Location pricing input - Instant prediction display --- # ๐Ÿ“ฆ Installation Clone repository: ```bash git clone https://github.com/your-username/house-price-predictor.git ``` Move into project directory: ```bash cd house-price-predictor ``` Install dependencies: ```bash pip install -r requirements.txt ``` --- # โ–ถ๏ธ Run Flask App ```bash python app.py ``` Open: ```text http://127.0.0.1:5000 ``` --- # โ–ถ๏ธ Run Streamlit App ```bash streamlit run streamlit_app.py ``` --- # ๐Ÿ“ Project Structure ```text house-price-predictor/ โ”‚ โ”œโ”€โ”€ app.py โ”œโ”€โ”€ streamlit_app.py โ”œโ”€โ”€ house_price_model.pkl โ”œโ”€โ”€ requirements.txt โ”œโ”€โ”€ runtime.txt โ”œโ”€โ”€ README.md ``` --- # ๐Ÿ› ๏ธ Tech Stack | Tool | Purpose | |---|---| | Python | Programming | | Pandas | Data preprocessing | | NumPy | Numerical operations | | Matplotlib | Visualization | | Scikit-learn | ML utilities | | XGBoost | Regression model | | Flask | API backend | | Streamlit | Frontend UI | --- # ๐Ÿ“š Key Learnings - Real-world data preprocessing - Feature engineering - Outlier handling using IQR - Log transformation - Model comparison using cross-validation - Hyperparameter tuning - Flask API creation - Streamlit UI development - ML deployment workflow --- # ๐Ÿ”ฎ Future Improvements - Use actual location names - Add location dropdown - Add map-based visualization - Improve frontend UI - Add cloud deployment pipeline - Add model monitoring --- # ๐Ÿ‘จโ€๐Ÿ’ป Author Mohd Faizanullah Aspiring ML Engineer focused on: - Machine Learning - Deep Learning - AI Applications - Full ML Deployment Pipelines