---
title: Exercise1
emoji: "🏃"
colorFrom: gray
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
---
# Model Iterations Documentation
## Task: Apartment Price Prediction (Regression)
## Application Link
**Public URL (Hugging Face Space):**
https://huggingface.co/spaces/nbacchi/exercise1
---
## Summary of Iterative Process
| Iteration | Objective | Key Changes | Models Used | CV Mean R² | CV Std Dev | Change in Performance | Fit Diagnosis |
|------------|------------|-------------|-------------|------------|------------|-----------------------|----------------|
| **1** | Build baseline model | - Drop missing values
- Remove duplicates
- Price filter (750–8000 CHF)
- Valid rooms/area filter
- 5-fold CV | Linear Regression
Random Forest (n_estimators=300) | 0.5446 (LR)
0.5178 (RF) | 0.1071 (LR)
0.1195 (RF) | Baseline | ☑ Overfitting ☐ Underfitting ☐ Good Fit |
| **2** | Improve generalization | - Feature engineering
- municipality_area_proxy = pop/pop_dens
- emp_per_resident = emp/pop
- foreigner_count_est = pop×(frg_pct/100)
- Hyperparameter tuning
- 5-fold CV | Ridge (alpha=1.0)
Tuned Random Forest (n_estimators=500, max_depth=12, min_samples_split=5, min_samples_leaf=2) | 0.5297 (Ridge)
0.5509 (RF) | 0.0947 (Ridge)
0.1060 (RF) | +0.0331 (RF) | ☐ Overfitting ☐ Underfitting ☑ Good Fit |
---
## Detailed Metrics Comparison
### Iteration 1 – Baseline
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Linear Regression | 0.5446 | 0.1071 | 673.00 | 468.07 |
| Random Forest | 0.5178 | 0.1195 | 698.51 | 500.13 |
### Iteration 2 – Feature Engineering
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Ridge | 0.5297 | 0.0947 | 682.01 | 481.08 |
| Tuned Random Forest | 0.5509 | 0.1060 | 674.54 | 473.98 |
---
## Created Features
**Iteration 2 Feature Engineering:**
- `municipality_area_proxy` = population / population density
- `emp_per_resident` = employees / population
- `foreigner_count_est` = population × (foreigner_pct / 100)
All features are reproducible from municipality-level variables and can be computed in real-time in the web application.
**In der App angezeigte Bezeichnungen (Deutsch):**
- `municipality_area_proxy` → **Gemeindegröße**
- `emp_per_resident` → **Arbeitsplatzquote**
- `foreigner_count_est` → **Ausländerpopulation**
---
## Final Selected Features
**Feature Set for Final Model:**
- `rooms` – number of apartment rooms
- `area` – living area in m²
- `pop` – municipality population
- `pop_dens` – population density (per km²)
- `frg_pct` – percentage of foreign residents
- `emp` – number of employees in municipality
- `tax_income` – taxable income per capita
- `municipality_area_proxy` – proxy for geographic size
- `emp_per_resident` – economic activity indicator
- `foreigner_count_est` – estimated foreigner count
---
## Reason for Selection
**Final model:** `RandomForestRegressor` (tuned from iteration 2)
**Justification:**
- Highest cross-validated $R^2$ across all iterations (0.5509)
- Lowest generalization gap (CV Std = 0.1060 vs baseline 0.1195)
- Feature engineering improves predictive power by +0.0331 in $R^2$
- Tuned hyperparameters reduce overfitting (`max_depth=12`, `min_samples_split=5`)
- RMSE of CHF 674.54 acceptable for price range 750–8000
---
## Preprocessing Steps (Iteration 1 → 2)
### Data Cleaning
1. Load original dataset (apartments in canton Zurich)
2. Remove rows with missing values (`dropna()`)
3. Remove duplicate rows (`drop_duplicates()`)
4. Filter unrealistic prices: keep `750 ≤ price ≤ 8000` CHF
5. Filter invalid structures: keep `rooms > 0` and `area > 0`
### Feature Engineering (Iteration 2)
1. Compute `municipality_area_proxy` from `pop` and `pop_dens`
2. Compute `emp_per_resident` from `emp` and `pop`
3. Compute `foreigner_count_est` from `pop` and `frg_pct`
4. Combine with baseline features for final training
### Evaluation Method
- 5-fold cross-validation
- Metrics: $R^2$, RMSE, MAE
- No separate validation set (full data used with CV)
---
## Metric Definition
**$R^2$ (Coefficient of Determination):**
Proportion of variance in price explained by features. Range: [0, 1]. Higher is better.
**RMSE (Root Mean Squared Error):**
Square root of average squared prediction error. Units: CHF. Lower is better.
**MAE (Mean Absolute Error):**
Average absolute prediction error. Units: CHF. Lower is better.
---
## Application & Deployment
- **App Framework:** Gradio
- **App File:** [app.py](app.py)
- **Saved Model:** [models/apartment_price_model.pkl](models/apartment_price_model.pkl)
- **Deployment Platform:** Hugging Face Spaces (URL to be updated)
### How to Run Locally
```bash
cd Projekt1
uv run python app.py
```
---
## Submission Checklist (Mandatory)
- [x] Trained regression model available ([models/apartment_price_model.pkl](models/apartment_price_model.pkl))
- [x] New feature(s) added (iteration 2 feature engineering)
- [x] Working web application ([app.py](app.py))
- [x] Documented iterative modeling process (2 iterations, tables + metrics)
- [x] Completed README
- [x] README uploaded to Hugging Face repository
- [x] Public application link inserted above
---
## Notes
- Baseline R² (0.5446) is competitive for real estate price prediction
- Feature engineering provides modest +0.0331 improvement in $R^2$
- Standard deviation drop (0.1195 → 0.1060) indicates more stable predictions
- Model saved and ready for deployment on Hugging Face Spaces