# Documentation ## Apartment Predictor (Saved Regression Model + LLM Workflow) This file documents what was built, tested, and learned in this exercise. It follows the structure of the reference template from `zhaw-iwi/ai-applications-prediction-and-nlp/documentation.md`. --- ## 1. Project Summary **Short description of your app:** The app turns a free-text German apartment wish (e.g. *"Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur."*) into an estimated monthly rent in CHF for the Canton of Zürich. An OpenAI LLM extracts the structured fields, a saved scikit-learn `GradientBoostingRegressor` predicts the rent, and a second LLM call returns a short German explanation including one uncertainty note. The app is deployed as a Gradio Space on Hugging Face. --- ## 2. Files Used | File | Purpose | |------|---------| | `app.py` | Final deployable Gradio app (LLM → model → LLM pipeline) | | `model_gbm.pkl` | Saved scikit-learn `GradientBoostingRegressor` (12 features) | | `municipality_lookup.csv` | Zürich municipality features used for prediction | | `requirements.txt` | Python dependencies | | `README.md` | Hugging Face Space metadata + project overview | | `documentation.md` | Written documentation for the submission | --- ## 3. Numeric Prediction Part ### 3.1 Reused Model **Which saved model did you use?** `model_gbm.pkl` – the `GradientBoostingRegressor` trained earlier in `ai-applications/end-of-module-block-1/train_model.ipynb` (5-fold CV R² ≈ 0.73, RMSE ≈ 559 CHF). **What does the model predict?** The monthly gross rent in CHF for an apartment located in a municipality of the Canton of Zürich. **Which input features are used for prediction?** The model uses **12 features** in this exact order: 1. `rooms` 2. `area` (m²) 3. `pop` – municipality population 4. `pop_dens` – population density 5. `frg_pct` – percentage of foreign residents 6. `emp` – number of employees 7. `tax_income` – average taxable income (CHF) 8. `room_per_m2` – engineered: `area / rooms` 9. `luxurious` – binary flag 10. `furnished` – binary flag 11. `zurich_city` – 1 if municipality is the City of Zürich 12. `distance_to_zurich_center` – Haversine distance to Zürich centre (km) ### 3.2 Prediction Logic The LLM returns `rooms`, `area_m2`, `town`, plus the optional flags `luxurious` and `furnished`. The Python function `predict_apartment_price` looks up the municipality row in `municipality_lookup.csv` to pull the BFS socioeconomic features (`pop`, `pop_dens`, `frg_pct`, `emp`, `tax_income`) and the precomputed `zurich_city` / `distance_to_zurich_center` columns. `room_per_m2` is computed on the fly. The 12-column DataFrame is passed to `model.predict(...)` and the result is rounded to the nearest CHF. --- ## 4. LLM Extraction Part ### 4.1 Goal Convert a free-form German sentence into a strict JSON object containing the values the regression model needs (`rooms`, `area_m2`, `town`) plus two optional binary flags (`luxurious`, `furnished`). ### 4.2 Prompt Design - **System prompt** in German, naming the role (*"Du bist ein Extraktionshelfer für eine Schweizer Wohnungs-App."*). - **Strict JSON only** is required – no Markdown, no explanation. - **Required keys** are spelled out exactly: `rooms`, `area_m2`, `town`. - **Optional keys** with default `false`: `luxurious`, `furnished`. - The user prompt is the raw German wish. - The OpenAI call uses `response_format={"type": "json_object"}` and `temperature=0` so the output is deterministic and parseable. ```text Du bist ein Extraktionshelfer für eine Schweizer Wohnungs-App. Lies den deutschen Text und gib AUSSCHLIESSLICH ein JSON-Objekt zurück. Pflichtfelder: - "rooms" (Zahl, z.B. 3.5) - "area_m2" (Zahl in Quadratmetern, z.B. 85) - "town" (Schweizer Gemeindename im Kanton Zürich, z.B. "Winterthur") Optionale Felder (sonst false): - "luxurious" - "furnished" ``` ### 4.3 Expected Output Format ```json {"rooms": 3.5, "area_m2": 85, "town": "Winterthur", "luxurious": false, "furnished": false} ``` ### 4.4 Validation `parse_json_response` enforces three checks before any value is used: 1. The response is non-empty. 2. `json.loads` succeeds (otherwise the raw text is shown in the error). 3. All required keys are present. `extract_preferences` then verifies that `rooms` and `area_m2` are not `None`, that `town` is non-empty, and that `match_town` resolves the town to a canonical `bfs_name` (case-insensitive exact match first, then a substring match against the BFS list). Any failure raises a `ValueError` that surfaces in the German error message in the UI – there is no silent regex fallback. --- ## 5. LLM Explanation Part ### 5.1 Goal Produce a short, plain German explanation of the model's prediction. The LLM **must not recompute** the price – it only describes the result the regression model already produced. ### 5.2 Prompt Design - **System prompt** tells the model it is explaining a rent estimate from a machine-learning model, in German. - The user message contains a JSON payload with the structured preferences and the predicted rent in CHF. - The model is instructed to return JSON with one key, `answer`, containing 2–4 short German sentences. - The answer must reference the user's rooms, area, and town and must include exactly **one** uncertainty / limitation note (condition, micro location, floor, year of renovation, …). - No Markdown formatting. ### 5.3 Expected Output Format ```json {"answer": "Für eine 3.5-Zimmer-Wohnung mit 85 m² in Winterthur schätzt das Modell rund 2100 CHF pro Monat. Die Schätzung basiert auf Wohnfläche und Ortsmerkmalen wie Steuerkraft und Distanz zur Stadt Zürich. Eine Unsicherheit ist, dass Zustand und Stockwerk nicht im Modell enthalten sind."} ``` The `answer` string is the text shown in the *Erklärung (LLM)* textbox. --- ## 6. End-to-End Pipeline 1. The user enters a German apartment description in the textbox. 2. `extract_preferences` calls the LLM and returns a validated dict `{rooms, area_m2, town, luxurious, furnished}`. 3. Python validates the values with `parse_json_response` and `match_town` – any failure raises a clear German error. 4. `predict_apartment_price` joins the BFS lookup, builds the 12-feature row, and calls `model.predict(...)`. 5. `generate_explanation` calls the LLM again with the preferences and the prediction; the JSON `answer` field is extracted. 6. The Gradio app returns the structured preferences (JSON), the rounded CHF prediction, and the explanation text. If any step fails, the error message is shown in the *Erklärung* field and the prediction is left empty – nothing is silently filled in. --- ## 7. Test Cases | # | Test Input | Extracted Output Correct? | Prediction Returned? | Explanation Returned? | Notes | |---|------------|---------------------------|----------------------|-----------------------|-------| | 1 | `Ich suche eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur.` | Yes | Yes (~CHF 2,100) | Yes | Baseline case from the assignment | | 2 | `Ich brauche eine möblierte 2-Zimmer-Wohnung mit 55 m2 in Kloten.` | Yes (`furnished=true`) | Yes | Yes | Tests optional flag detection | | 3 | `Ich hätte gerne eine luxuriöse 4.5-Zimmer-Wohnung mit 140 m2 in Küsnacht (ZH).` | Yes (`luxurious=true`) | Yes (~CHF 4,500) | Yes | Tests luxury flag and a town with parentheses | | 4 | `Eine 5-Zimmer-Wohnung mit 130 m2 in Zürich wäre ideal.` | Yes | Yes | Yes | Tests `zurich_city=1` path | | 5 | `Ich suche etwas in Bern.` | Pipeline raises a German error | No | Error message shown | Out-of-canton town → friendly failure, no silent fallback | Local sanity check (calling `predict_apartment_price` directly, no LLM): ```text 3.5 rooms / 85 m² / Winterthur → CHF 2,103 4.5 rooms / 140 m² / Küsnacht (ZH) luxury → CHF 4,462 ``` --- ## 8. Errors and Problems **Problem:** First test runs returned a 132-byte `model_gbm.pkl` and `pickle.load` failed. **Cause:** The copy of the file in `apartment-price-prediction/` was a Git LFS pointer, not the real model. **Fix:** Use the actual 1.4 MB model from `ai-applications/end-of-module-block-1/model_gbm.pkl`. **Problem:** First push to Hugging Face was rejected with *"contains binary files. Please use Xet to store binary files."* **Cause:** `model_gbm.pkl` was committed as a regular blob and the HF pre-receive hook enforces Xet/LFS for `.pkl` files. **Fix:** Reset the commit, upload the model with `hf upload --repo-type space saettsam/conversational-agent model_gbm.pkl model_gbm.pkl` (uses Xet), pull the new commit, then push the rest of the files normally. **Problem:** Town names like `Küsnacht (ZH)` or `Zürich` (umlaut) did not match user input. **Cause:** Strict, case-sensitive equality on the BFS list. **Fix:** `match_town` lower-cases both sides and falls back to a substring match against the canonical `bfs_name` list. **Problem:** Missing `OPENAI_API_KEY` on the Space crashed the app on the first user interaction with an opaque traceback. **Cause:** The OpenAI client was being created at import time. **Fix:** Lazy `get_openai_client()` raises a clear German error message that surfaces directly in the UI textbox. --- ## 9. Deployment Notes ### 9.1 Files included - `app.py` - `model_gbm.pkl` (uploaded via Xet) - `municipality_lookup.csv` - `requirements.txt` - `README.md` - `documentation.md` - `.gitattributes` ### 9.2 Secrets / Environment Variables Configured in **Settings → Variables and secrets** of the Space: - `OPENAI_API_KEY` (required) - `OPENAI_MODEL` (optional, defaults to `gpt-4o-mini`) ### 9.3 Deployment Result The Space builds with the standard Gradio template. The model file (~1.4 MB) lives in Xet storage and loads on cold start. After the secret is set, end-to-end latency is roughly 0.5 s for extraction, negligible for the local model prediction, and ~1 s for the explanation – about 1.5–2 s per German request in total. ### 9.4 Screenshots Two screenshots from the running app, each showing a different German input, the *Extrahierte Eingaben (LLM)* JSON, the *Geschätzte Monatsmiete (CHF)* number, and the *Erklärung (LLM)* text: ![Beispiel 1](Beispiel1.png) **Beispiel 1:** A first German apartment wish is entered. The LLM extracts the structured JSON (`rooms`, `area_m2`, `town`, plus the optional `luxurious` / `furnished` flags), the GradientBoostingRegressor returns a CHF rent estimate, and the second LLM call produces the short German explanation visible in the *Erklärung (LLM)* textbox – including one uncertainty note about features not contained in the model. ![Beispiel 2](Beispiel2.png) **Beispiel 2:** A second German apartment wish with different rooms, area, and town is entered. Again the extracted JSON, the predicted monthly rent, and the German explanation are all visible at the same time, demonstrating that the end-to-end pipeline (LLM extraction → model prediction → LLM explanation) works for multiple inputs. --- ## 10. Reflection Combining a regression model with an LLM gives a friendly natural-language front end without giving up the deterministic numerics – the model still owns the price. The system is most fragile when the user names a town outside the canton or omits a required value; strict JSON mode plus an explicit `match_town` check keeps those failures visible instead of producing a confidently wrong prediction. German input matters because the BFS dataset uses Swiss spellings (`Zürich`, `Küsnacht (ZH)`) that an English prompt drifts away from. The biggest missing inputs are condition, year of renovation, floor / elevator, and balcony – features that humans weigh heavily but the training data did not capture. Next iteration: add confidence intervals from a quantile regressor and an optional clarifying question when the LLM returns `null` for `area_m2` or `rooms`. --- ## 11. Responsible Use Note The prediction is a rough indication, not a market quote. The model was trained on a snapshot of public listings and only sees twelve structured features – condition, micro location, balcony, floor, elevator and many other rent drivers are not represented. The LLM may also misread the user text (e.g. confuse "etwa 85 m²" with another number); that is why every prediction is shown together with the extracted JSON, so the user can verify what the model actually saw. The app is intended for educational and exploratory use only and must not be used as the sole basis for any rental decision.