File size: 5,647 Bytes
de78064 3519b69 754fe91 3519b69 4c92717 3519b69 de78064 3519b69 4c92717 3519b69 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
title: Exercise1
emoji: "🏃"
colorFrom: gray
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
---
# Model Iterations Documentation
## Task: Apartment Price Prediction (Regression)
## Application Link
**Public URL (Hugging Face Space):**
https://huggingface.co/spaces/nbacchi/exercise1
---
## Summary of Iterative Process
| Iteration | Objective | Key Changes | Models Used | CV Mean R² | CV Std Dev | Change in Performance | Fit Diagnosis |
|------------|------------|-------------|-------------|------------|------------|-----------------------|----------------|
| **1** | Build baseline model | - Drop missing values<br>- Remove duplicates<br>- Price filter (750–8000 CHF)<br>- Valid rooms/area filter<br>- 5-fold CV | Linear Regression<br>Random Forest (n_estimators=300) | 0.5446 (LR)<br>0.5178 (RF) | 0.1071 (LR)<br>0.1195 (RF) | Baseline | ☑ Overfitting ☐ Underfitting ☐ Good Fit |
| **2** | Improve generalization | - Feature engineering<br>- municipality_area_proxy = pop/pop_dens<br>- emp_per_resident = emp/pop<br>- foreigner_count_est = pop×(frg_pct/100)<br>- Hyperparameter tuning<br>- 5-fold CV | Ridge (alpha=1.0)<br>Tuned Random Forest (n_estimators=500, max_depth=12, min_samples_split=5, min_samples_leaf=2) | 0.5297 (Ridge)<br>0.5509 (RF) | 0.0947 (Ridge)<br>0.1060 (RF) | +0.0331 (RF) | ☐ Overfitting ☐ Underfitting ☑ Good Fit |
---
## Detailed Metrics Comparison
### Iteration 1 – Baseline
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Linear Regression | 0.5446 | 0.1071 | 673.00 | 468.07 |
| Random Forest | 0.5178 | 0.1195 | 698.51 | 500.13 |
### Iteration 2 – Feature Engineering
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Ridge | 0.5297 | 0.0947 | 682.01 | 481.08 |
| Tuned Random Forest | 0.5509 | 0.1060 | 674.54 | 473.98 |
---
## Created Features
**Iteration 2 Feature Engineering:**
- `municipality_area_proxy` = population / population density
- `emp_per_resident` = employees / population
- `foreigner_count_est` = population × (foreigner_pct / 100)
All features are reproducible from municipality-level variables and can be computed in real-time in the web application.
**In der App angezeigte Bezeichnungen (Deutsch):**
- `municipality_area_proxy` → **Gemeindegröße**
- `emp_per_resident` → **Arbeitsplatzquote**
- `foreigner_count_est` → **Ausländerpopulation**
---
## Final Selected Features
**Feature Set for Final Model:**
- `rooms` – number of apartment rooms
- `area` – living area in m²
- `pop` – municipality population
- `pop_dens` – population density (per km²)
- `frg_pct` – percentage of foreign residents
- `emp` – number of employees in municipality
- `tax_income` – taxable income per capita
- `municipality_area_proxy` – proxy for geographic size
- `emp_per_resident` – economic activity indicator
- `foreigner_count_est` – estimated foreigner count
---
## Reason for Selection
**Final model:** `RandomForestRegressor` (tuned from iteration 2)
**Justification:**
- Highest cross-validated $R^2$ across all iterations (0.5509)
- Lowest generalization gap (CV Std = 0.1060 vs baseline 0.1195)
- Feature engineering improves predictive power by +0.0331 in $R^2$
- Tuned hyperparameters reduce overfitting (`max_depth=12`, `min_samples_split=5`)
- RMSE of CHF 674.54 acceptable for price range 750–8000
---
## Preprocessing Steps (Iteration 1 → 2)
### Data Cleaning
1. Load original dataset (apartments in canton Zurich)
2. Remove rows with missing values (`dropna()`)
3. Remove duplicate rows (`drop_duplicates()`)
4. Filter unrealistic prices: keep `750 ≤ price ≤ 8000` CHF
5. Filter invalid structures: keep `rooms > 0` and `area > 0`
### Feature Engineering (Iteration 2)
1. Compute `municipality_area_proxy` from `pop` and `pop_dens`
2. Compute `emp_per_resident` from `emp` and `pop`
3. Compute `foreigner_count_est` from `pop` and `frg_pct`
4. Combine with baseline features for final training
### Evaluation Method
- 5-fold cross-validation
- Metrics: $R^2$, RMSE, MAE
- No separate validation set (full data used with CV)
---
## Metric Definition
**$R^2$ (Coefficient of Determination):**
Proportion of variance in price explained by features. Range: [0, 1]. Higher is better.
**RMSE (Root Mean Squared Error):**
Square root of average squared prediction error. Units: CHF. Lower is better.
**MAE (Mean Absolute Error):**
Average absolute prediction error. Units: CHF. Lower is better.
---
## Application & Deployment
- **App Framework:** Gradio
- **App File:** [app.py](app.py)
- **Saved Model:** [models/apartment_price_model.pkl](models/apartment_price_model.pkl)
- **Deployment Platform:** Hugging Face Spaces (URL to be updated)
### How to Run Locally
```bash
cd Projekt1
uv run python app.py
```
---
## Submission Checklist (Mandatory)
- [x] Trained regression model available ([models/apartment_price_model.pkl](models/apartment_price_model.pkl))
- [x] New feature(s) added (iteration 2 feature engineering)
- [x] Working web application ([app.py](app.py))
- [x] Documented iterative modeling process (2 iterations, tables + metrics)
- [x] Completed README
- [x] README uploaded to Hugging Face repository
- [x] Public application link inserted above
---
## Notes
- Baseline R² (0.5446) is competitive for real estate price prediction
- Feature engineering provides modest +0.0331 improvement in $R^2$
- Standard deviation drop (0.1195 → 0.1060) indicates more stable predictions
- Model saved and ready for deployment on Hugging Face Spaces
|