--- title: Exercise1 emoji: "🏃" colorFrom: gray colorTo: gray sdk: gradio app_file: app.py pinned: false --- # Model Iterations Documentation ## Task: Apartment Price Prediction (Regression) ## Application Link **Public URL (Hugging Face Space):** https://huggingface.co/spaces/nbacchi/exercise1 --- ## Summary of Iterative Process | Iteration | Objective | Key Changes | Models Used | CV Mean R² | CV Std Dev | Change in Performance | Fit Diagnosis | |------------|------------|-------------|-------------|------------|------------|-----------------------|----------------| | **1** | Build baseline model | - Drop missing values
- Remove duplicates
- Price filter (750–8000 CHF)
- Valid rooms/area filter
- 5-fold CV | Linear Regression
Random Forest (n_estimators=300) | 0.5446 (LR)
0.5178 (RF) | 0.1071 (LR)
0.1195 (RF) | Baseline | ☑ Overfitting ☐ Underfitting ☐ Good Fit | | **2** | Improve generalization | - Feature engineering
- municipality_area_proxy = pop/pop_dens
- emp_per_resident = emp/pop
- foreigner_count_est = pop×(frg_pct/100)
- Hyperparameter tuning
- 5-fold CV | Ridge (alpha=1.0)
Tuned Random Forest (n_estimators=500, max_depth=12, min_samples_split=5, min_samples_leaf=2) | 0.5297 (Ridge)
0.5509 (RF) | 0.0947 (Ridge)
0.1060 (RF) | +0.0331 (RF) | ☐ Overfitting ☐ Underfitting ☑ Good Fit | --- ## Detailed Metrics Comparison ### Iteration 1 – Baseline | Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE | |-------|---:|---:|---:|---:| | Linear Regression | 0.5446 | 0.1071 | 673.00 | 468.07 | | Random Forest | 0.5178 | 0.1195 | 698.51 | 500.13 | ### Iteration 2 – Feature Engineering | Model | CV Mean R² | CV Std R² | CV Mean RMSE | CV Mean MAE | |-------|---:|---:|---:|---:| | Ridge | 0.5297 | 0.0947 | 682.01 | 481.08 | | Tuned Random Forest | 0.5509 | 0.1060 | 674.54 | 473.98 | --- ## Created Features **Iteration 2 Feature Engineering:** - `municipality_area_proxy` = population / population density - `emp_per_resident` = employees / population - `foreigner_count_est` = population × (foreigner_pct / 100) All features are reproducible from municipality-level variables and can be computed in real-time in the web application. **In der App angezeigte Bezeichnungen (Deutsch):** - `municipality_area_proxy` → **Gemeindegröße** - `emp_per_resident` → **Arbeitsplatzquote** - `foreigner_count_est` → **Ausländerpopulation** --- ## Final Selected Features **Feature Set for Final Model:** - `rooms` – number of apartment rooms - `area` – living area in m² - `pop` – municipality population - `pop_dens` – population density (per km²) - `frg_pct` – percentage of foreign residents - `emp` – number of employees in municipality - `tax_income` – taxable income per capita - `municipality_area_proxy` – proxy for geographic size - `emp_per_resident` – economic activity indicator - `foreigner_count_est` – estimated foreigner count --- ## Reason for Selection **Final model:** `RandomForestRegressor` (tuned from iteration 2) **Justification:** - Highest cross-validated $R^2$ across all iterations (0.5509) - Lowest generalization gap (CV Std = 0.1060 vs baseline 0.1195) - Feature engineering improves predictive power by +0.0331 in $R^2$ - Tuned hyperparameters reduce overfitting (`max_depth=12`, `min_samples_split=5`) - RMSE of CHF 674.54 acceptable for price range 750–8000 --- ## Preprocessing Steps (Iteration 1 → 2) ### Data Cleaning 1. Load original dataset (apartments in canton Zurich) 2. Remove rows with missing values (`dropna()`) 3. Remove duplicate rows (`drop_duplicates()`) 4. Filter unrealistic prices: keep `750 ≤ price ≤ 8000` CHF 5. Filter invalid structures: keep `rooms > 0` and `area > 0` ### Feature Engineering (Iteration 2) 1. Compute `municipality_area_proxy` from `pop` and `pop_dens` 2. Compute `emp_per_resident` from `emp` and `pop` 3. Compute `foreigner_count_est` from `pop` and `frg_pct` 4. Combine with baseline features for final training ### Evaluation Method - 5-fold cross-validation - Metrics: $R^2$, RMSE, MAE - No separate validation set (full data used with CV) --- ## Metric Definition **$R^2$ (Coefficient of Determination):** Proportion of variance in price explained by features. Range: [0, 1]. Higher is better. **RMSE (Root Mean Squared Error):** Square root of average squared prediction error. Units: CHF. Lower is better. **MAE (Mean Absolute Error):** Average absolute prediction error. Units: CHF. Lower is better. --- ## Application & Deployment - **App Framework:** Gradio - **App File:** [app.py](app.py) - **Saved Model:** [models/apartment_price_model.pkl](models/apartment_price_model.pkl) - **Deployment Platform:** Hugging Face Spaces (URL to be updated) ### How to Run Locally ```bash cd Projekt1 uv run python app.py ``` --- ## Submission Checklist (Mandatory) - [x] Trained regression model available ([models/apartment_price_model.pkl](models/apartment_price_model.pkl)) - [x] New feature(s) added (iteration 2 feature engineering) - [x] Working web application ([app.py](app.py)) - [x] Documented iterative modeling process (2 iterations, tables + metrics) - [x] Completed README - [x] README uploaded to Hugging Face repository - [x] Public application link inserted above --- ## Notes - Baseline R² (0.5446) is competitive for real estate price prediction - Feature engineering provides modest +0.0331 improvement in $R^2$ - Standard deviation drop (0.1195 → 0.1060) indicates more stable predictions - Model saved and ready for deployment on Hugging Face Spaces