| --- |
| title: Exercise1 |
| emoji: "🏃" |
| colorFrom: gray |
| colorTo: gray |
| sdk: gradio |
| app_file: app.py |
| pinned: false |
| --- |
| |
| # Model Iterations Documentation |
| ## Task: Apartment Price Prediction (Regression) |
|
|
| ## Application Link |
|
|
| **Public URL (Hugging Face Space):** |
|
|
| https://huggingface.co/spaces/nbacchi/exercise1 |
|
|
| --- |
|
|
| ## Summary of Iterative Process |
|
|
| | Iteration | Objective | Key Changes | Models Used | CV Mean R² | CV Std Dev | Change in Performance | Fit Diagnosis | |
| |------------|------------|-------------|-------------|------------|------------|-----------------------|----------------| |
| | **1** | Build baseline model | - Drop missing values<br>- Remove duplicates<br>- Price filter (750–8000 CHF)<br>- Valid rooms/area filter<br>- 5-fold CV | Linear Regression<br>Random Forest (n_estimators=300) | 0.5446 (LR)<br>0.5178 (RF) | 0.1071 (LR)<br>0.1195 (RF) | Baseline | ☑ Overfitting ☐ Underfitting ☐ Good Fit | |
| | **2** | Improve generalization | - Feature engineering<br>- municipality_area_proxy = pop/pop_dens<br>- emp_per_resident = emp/pop<br>- foreigner_count_est = pop×(frg_pct/100)<br>- Hyperparameter tuning<br>- 5-fold CV | Ridge (alpha=1.0)<br>Tuned Random Forest (n_estimators=500, max_depth=12, min_samples_split=5, min_samples_leaf=2) | 0.5297 (Ridge)<br>0.5509 (RF) | 0.0947 (Ridge)<br>0.1060 (RF) | +0.0331 (RF) | ☐ Overfitting ☐ Underfitting ☑ Good Fit | |
| |
| --- |
| |
| ## Detailed Metrics Comparison |
| |
| ### Iteration 1 – Baseline |
| | Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE | |
| |-------|---:|---:|---:|---:| |
| | Linear Regression | 0.5446 | 0.1071 | 673.00 | 468.07 | |
| | Random Forest | 0.5178 | 0.1195 | 698.51 | 500.13 | |
| |
| ### Iteration 2 – Feature Engineering |
| | Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE | |
| |-------|---:|---:|---:|---:| |
| | Ridge | 0.5297 | 0.0947 | 682.01 | 481.08 | |
| | Tuned Random Forest | 0.5509 | 0.1060 | 674.54 | 473.98 | |
| |
| --- |
| |
| ## Created Features |
| |
| **Iteration 2 Feature Engineering:** |
| - `municipality_area_proxy` = population / population density |
| - `emp_per_resident` = employees / population |
| - `foreigner_count_est` = population × (foreigner_pct / 100) |
|
|
| All features are reproducible from municipality-level variables and can be computed in real-time in the web application. |
|
|
| **In der App angezeigte Bezeichnungen (Deutsch):** |
| - `municipality_area_proxy` → **Gemeindegröße** |
| - `emp_per_resident` → **Arbeitsplatzquote** |
| - `foreigner_count_est` → **Ausländerpopulation** |
|
|
| --- |
|
|
| ## Final Selected Features |
|
|
| **Feature Set for Final Model:** |
| - `rooms` – number of apartment rooms |
| - `area` – living area in m² |
| - `pop` – municipality population |
| - `pop_dens` – population density (per km²) |
| - `frg_pct` – percentage of foreign residents |
| - `emp` – number of employees in municipality |
| - `tax_income` – taxable income per capita |
| - `municipality_area_proxy` – proxy for geographic size |
| - `emp_per_resident` – economic activity indicator |
| - `foreigner_count_est` – estimated foreigner count |
|
|
| --- |
|
|
| ## Reason for Selection |
|
|
| **Final model:** `RandomForestRegressor` (tuned from iteration 2) |
| **Justification:** |
| - Highest cross-validated $R^2$ across all iterations (0.5509) |
| - Lowest generalization gap (CV Std = 0.1060 vs baseline 0.1195) |
| - Feature engineering improves predictive power by +0.0331 in $R^2$ |
| - Tuned hyperparameters reduce overfitting (`max_depth=12`, `min_samples_split=5`) |
| - RMSE of CHF 674.54 acceptable for price range 750–8000 |
|
|
| --- |
|
|
| ## Preprocessing Steps (Iteration 1 → 2) |
|
|
| ### Data Cleaning |
| 1. Load original dataset (apartments in canton Zurich) |
| 2. Remove rows with missing values (`dropna()`) |
| 3. Remove duplicate rows (`drop_duplicates()`) |
| 4. Filter unrealistic prices: keep `750 ≤ price ≤ 8000` CHF |
| 5. Filter invalid structures: keep `rooms > 0` and `area > 0` |
|
|
| ### Feature Engineering (Iteration 2) |
| 1. Compute `municipality_area_proxy` from `pop` and `pop_dens` |
| 2. Compute `emp_per_resident` from `emp` and `pop` |
| 3. Compute `foreigner_count_est` from `pop` and `frg_pct` |
| 4. Combine with baseline features for final training |
|
|
| ### Evaluation Method |
| - 5-fold cross-validation |
| - Metrics: $R^2$, RMSE, MAE |
| - No separate validation set (full data used with CV) |
|
|
| --- |
|
|
| ## Metric Definition |
|
|
| **$R^2$ (Coefficient of Determination):** |
| Proportion of variance in price explained by features. Range: [0, 1]. Higher is better. |
|
|
| **RMSE (Root Mean Squared Error):** |
| Square root of average squared prediction error. Units: CHF. Lower is better. |
|
|
| **MAE (Mean Absolute Error):** |
| Average absolute prediction error. Units: CHF. Lower is better. |
|
|
| --- |
|
|
| ## Application & Deployment |
|
|
| - **App Framework:** Gradio |
| - **App File:** [app.py](app.py) |
| - **Saved Model:** [models/apartment_price_model.pkl](models/apartment_price_model.pkl) |
| - **Deployment Platform:** Hugging Face Spaces (URL to be updated) |
|
|
| ### How to Run Locally |
| ```bash |
| cd Projekt1 |
| uv run python app.py |
| ``` |
|
|
| --- |
|
|
| ## Submission Checklist (Mandatory) |
|
|
| - [x] Trained regression model available ([models/apartment_price_model.pkl](models/apartment_price_model.pkl)) |
| - [x] New feature(s) added (iteration 2 feature engineering) |
| - [x] Working web application ([app.py](app.py)) |
| - [x] Documented iterative modeling process (2 iterations, tables + metrics) |
| - [x] Completed README |
| - [x] README uploaded to Hugging Face repository |
| - [x] Public application link inserted above |
|
|
| --- |
|
|
| ## Notes |
|
|
| - Baseline R² (0.5446) is competitive for real estate price prediction |
| - Feature engineering provides modest +0.0331 improvement in $R^2$ |
| - Standard deviation drop (0.1195 → 0.1060) indicates more stable predictions |
| - Model saved and ready for deployment on Hugging Face Spaces |
|
|
|
|