File size: 5,647 Bytes
de78064
 
 
 
 
 
 
 
 
 
3519b69
 
 
 
 
754fe91
 
 
3519b69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c92717
3519b69
 
 
 
 
 
 
 
de78064
 
3519b69
4c92717
 
3519b69
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: Exercise1
emoji: "🏃"
colorFrom: gray
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
---

# Model Iterations Documentation  
## Task: Apartment Price Prediction (Regression)

## Application Link

**Public URL (Hugging Face Space):** 

https://huggingface.co/spaces/nbacchi/exercise1

---

## Summary of Iterative Process

| Iteration | Objective | Key Changes | Models Used | CV Mean R² | CV Std Dev | Change in Performance | Fit Diagnosis |
|------------|------------|-------------|-------------|------------|------------|-----------------------|----------------|
| **1** | Build baseline model | - Drop missing values<br>- Remove duplicates<br>- Price filter (750–8000 CHF)<br>- Valid rooms/area filter<br>- 5-fold CV | Linear Regression<br>Random Forest (n_estimators=300) | 0.5446 (LR)<br>0.5178 (RF) | 0.1071 (LR)<br>0.1195 (RF) | Baseline | ☑ Overfitting ☐ Underfitting ☐ Good Fit |
| **2** | Improve generalization | - Feature engineering<br>- municipality_area_proxy = pop/pop_dens<br>- emp_per_resident = emp/pop<br>- foreigner_count_est = pop×(frg_pct/100)<br>- Hyperparameter tuning<br>- 5-fold CV | Ridge (alpha=1.0)<br>Tuned Random Forest (n_estimators=500, max_depth=12, min_samples_split=5, min_samples_leaf=2) | 0.5297 (Ridge)<br>0.5509 (RF) | 0.0947 (Ridge)<br>0.1060 (RF) | +0.0331 (RF) | ☐ Overfitting ☐ Underfitting ☑ Good Fit |

---

## Detailed Metrics Comparison

### Iteration 1 – Baseline
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Linear Regression | 0.5446 | 0.1071 | 673.00 | 468.07 |
| Random Forest | 0.5178 | 0.1195 | 698.51 | 500.13 |

### Iteration 2 – Feature Engineering
| Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE |
|-------|---:|---:|---:|---:|
| Ridge | 0.5297 | 0.0947 | 682.01 | 481.08 |
| Tuned Random Forest | 0.5509 | 0.1060 | 674.54 | 473.98 |

---

## Created Features

**Iteration 2 Feature Engineering:**
- `municipality_area_proxy` = population / population density  
- `emp_per_resident` = employees / population  
- `foreigner_count_est` = population × (foreigner_pct / 100)  

All features are reproducible from municipality-level variables and can be computed in real-time in the web application.

**In der App angezeigte Bezeichnungen (Deutsch):**
- `municipality_area_proxy`**Gemeindegröße**
- `emp_per_resident`**Arbeitsplatzquote**
- `foreigner_count_est`**Ausländerpopulation**

---

## Final Selected Features

**Feature Set for Final Model:**
- `rooms` – number of apartment rooms
- `area` – living area in m²
- `pop` – municipality population
- `pop_dens` – population density (per km²)
- `frg_pct` – percentage of foreign residents
- `emp` – number of employees in municipality
- `tax_income` – taxable income per capita
- `municipality_area_proxy` – proxy for geographic size
- `emp_per_resident` – economic activity indicator
- `foreigner_count_est` – estimated foreigner count

---

## Reason for Selection

**Final model:** `RandomForestRegressor` (tuned from iteration 2)  
**Justification:**
- Highest cross-validated $R^2$ across all iterations (0.5509)
- Lowest generalization gap (CV Std = 0.1060 vs baseline 0.1195)
- Feature engineering improves predictive power by +0.0331 in $R^2$
- Tuned hyperparameters reduce overfitting (`max_depth=12`, `min_samples_split=5`)
- RMSE of CHF 674.54 acceptable for price range 750–8000

---

## Preprocessing Steps (Iteration 1 → 2)

### Data Cleaning
1. Load original dataset (apartments in canton Zurich)
2. Remove rows with missing values (`dropna()`)
3. Remove duplicate rows (`drop_duplicates()`)
4. Filter unrealistic prices: keep `750 ≤ price ≤ 8000` CHF
5. Filter invalid structures: keep `rooms > 0` and `area > 0`

### Feature Engineering (Iteration 2)
1. Compute `municipality_area_proxy` from `pop` and `pop_dens`
2. Compute `emp_per_resident` from `emp` and `pop`
3. Compute `foreigner_count_est` from `pop` and `frg_pct`
4. Combine with baseline features for final training

### Evaluation Method
- 5-fold cross-validation
- Metrics: $R^2$, RMSE, MAE
- No separate validation set (full data used with CV)

---

## Metric Definition

**$R^2$ (Coefficient of Determination):**  
Proportion of variance in price explained by features. Range: [0, 1]. Higher is better.

**RMSE (Root Mean Squared Error):**  
Square root of average squared prediction error. Units: CHF. Lower is better.

**MAE (Mean Absolute Error):**  
Average absolute prediction error. Units: CHF. Lower is better.

---

## Application & Deployment

- **App Framework:** Gradio
- **App File:** [app.py](app.py)
- **Saved Model:** [models/apartment_price_model.pkl](models/apartment_price_model.pkl)
- **Deployment Platform:** Hugging Face Spaces (URL to be updated)

### How to Run Locally
```bash
cd Projekt1
uv run python app.py
```

---

## Submission Checklist (Mandatory)

- [x] Trained regression model available ([models/apartment_price_model.pkl](models/apartment_price_model.pkl))
- [x] New feature(s) added (iteration 2 feature engineering)
- [x] Working web application ([app.py](app.py))
- [x] Documented iterative modeling process (2 iterations, tables + metrics)
- [x] Completed README
- [x] README uploaded to Hugging Face repository
- [x] Public application link inserted above

---

## Notes

- Baseline R² (0.5446) is competitive for real estate price prediction
- Feature engineering provides modest +0.0331 improvement in $R^2$
- Standard deviation drop (0.1195 → 0.1060) indicates more stable predictions
- Model saved and ready for deployment on Hugging Face Spaces