gumannic's picture
Update documentation.md
4a06cd2 verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

Documentation

Block 3: Apartment Predictor


1. Project Summary

The app takes a German free-text apartment request (e.g. number of rooms, area, municipality) and returns an estimated monthly rent in CHF. GPT-4o-mini (via OpenAI API) extracts the structured parameters from the text and converts them into a JSON object. A pre-trained Random Forest regression model then calculates the rent estimate. Finally, the LLM generates a clear German explanation of the result including an uncertainty note.


2. Files Used

File Purpose
ai_applications_exercise2.ipynb Notebook for developing and testing the functions
app_student.py Starting file with TODOs (not deployed)
app.py Final deployable app for Hugging Face Spaces
random_forest_regression.pkl Pre-trained scikit-learn regression model
bfs_municipality_and_tax_data.csv BFS municipality data (population, taxes, etc.)
requirements.txt Python dependencies
documentation.md This documentation
README.md Hugging Face Spaces configuration

3. Numeric Prediction Part

3.1 Reused Model

Which saved model did you use?
random_forest_regression.pkl – a pre-trained scikit-learn RandomForestRegressor model.

What does the model predict?
The model estimates the monthly gross rent (CHF) of a Swiss apartment based on apartment characteristics and municipality statistics.

Which input features are used for prediction?

  1. rooms – Number of rooms (e.g. 3.5)
  2. area_m2 – Living area in square metres
  3. pop – Population of the municipality
  4. pop_dens – Population density (inhabitants/km²)
  5. frg_pct – Percentage of foreign residents
  6. emp – Number of employees in the municipality
  7. tax_income – Taxable income (median income of the municipality)

3.2 Prediction Logic

The municipality name is matched in lowercase against the town_to_row dictionary. The five municipality features (pop, pop_dens, frg_pct, emp, tax_income) are retrieved and passed together with rooms and area_m2 as a NumPy array [[...]] to model.predict(). The result is rounded to two decimal places.


4. LLM Extraction Part

4.1 Goal

The LLM should extract the three fields rooms, area_m2 and town from a German free-text apartment request and return them as a pure JSON object.

4.2 Prompt Design

System Prompt:
The model is instructed to return exclusively a JSON object (no Markdown, no explanations) with exactly the three keys rooms (number), area_m2 (number) and town (string in German). An example JSON is included in the system prompt.

User Prompt:
The user's free text is passed directly with the instruction to extract the three values.

  • ✅ System instruction used
  • ✅ Strict JSON required
  • ✅ Keys explicitly named: rooms, area_m2, town
  • ✅ German input expected (municipality names match BFS dataset)

4.3 Expected Output Format

{"rooms": 3.5, "area_m2": 85, "town": "Winterthur"}

4.4 Validation

After the LLM call, parse_json_response() is used, which:

  1. Removes Markdown fences (if present)
  2. Applies json.loads() to the cleaned string
  3. Checks whether all three keys are present

Then match_town() is called to validate the extracted municipality name against the BFS dataset. If no match is found, a clear ValueError is raised.


5. LLM Explanation Part

5.1 Goal

The second LLM step should explain the already calculated model result in plain German. The LLM does not calculate its own price – it receives the prediction value and explains it.

5.2 Prompt Design

System Prompt: The model is instructed to return exclusively JSON with the key "answer". The explanation should be 2–4 sentences in German and mention a concrete uncertainty factor (e.g. fixtures, micro-location, year of construction).

User Prompt: Contains number of rooms, area, municipality and the model-calculated rent price.

  • ✅ Structured preferences included
  • ✅ Prediction value explicitly passed
  • ✅ German output required
  • ✅ Uncertainty note required
  • ✅ JSON output with key answer required

5.3 Expected Output Format

{"answer": "The predicted monthly rent for a 3.5-room apartment with 85.0 m² in Winterthur is 2117 CHF. This estimate may vary, as factors such as the apartment's fixtures, the exact micro-location or the year of construction can have a significant influence on the actual rent."}

6. End-to-End Pipeline

  1. User input: The user enters a German apartment request (Gradio text field).
  2. Extraction: extract_preferences() calls the LLM and receives rooms, area_m2, town as JSON.
  3. Validation: Python validates the fields and matches the municipality name using match_town().
  4. Prediction: predict_apartment_price() loads the municipality data and calls model.predict().
  5. Explanation: generate_explanation() passes preferences + prediction to the LLM and receives a German explanation.
  6. Output: Gradio displays the JSON extraction, estimated price and explanation text.

7. Test Cases

Test Input Extracted correctly? Prediction received? Explanation received? Notes
Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur. Yes Yes Yes Prediction: 2117 CHF
4-Zimmer-Wohnung, 110 m², Zürich Yes Yes Yes Prediction: 4029 CHF
Ich brauche 2 Zimmer und etwa 55 m2 in Kloten. Yes Yes Yes Smaller apartment near airport
5 Zimmer, 150 m2 in Zug Yes Yes Yes Affluent municipality, high rent

8. Errors and Problems

Problem 1: scikit-learn version conflict

  • Cause: The .pkl model was trained with a different scikit-learn version than installed on Hugging Face.
  • Fix: Pin scikit-learn==1.6.1 in requirements.txt.

Problem 2: LLM returns Markdown instead of pure JSON

  • Cause: The LLM sometimes wraps the response in ```json fences.
  • Fix: parse_json_response() strips Markdown fences before JSON parsing.

Problem 3: Municipality names do not match

  • Cause: User writes e.g. "Zürich" while the dataset entry is "Zürich (Kreis 1)", or typos occur.
  • Fix: match_town() first uses exact match, then substring match.

9. Deployment Notes

9.1 Files included

  • app.py
  • requirements.txt
  • README.md
  • documentation.md
  • random_forest_regression.pkl
  • bfs_municipality_and_tax_data.csv

9.2 Secrets / Environment Variables

  • OPENAI_API_KEY – OpenAI API key (mandatory)

9.3 Deployment Result

The Space runs successfully on Hugging Face. The Gradio UI is publicly accessible. All three output fields (JSON, price, explanation) are populated correctly.

9.4 Screenshots

Screenshot 1: Input "Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur."

Screenshot_1

Extracted JSON: rooms: 3.5, area_m2: 85, town: Winterthur – Prediction: 2117.32 CHF – The explanation mentions fixtures, micro-location and year of construction as uncertainty factors.

Screenshot 2: Input "4-Zimmer-Wohnung, 110 m², Zürich"

Screenshot_2

Extracted JSON: rooms: 4, area_m2: 110, town: Zürich – Prediction: 4028.79 CHF – The explanation highlights general market conditions in Zurich and names micro-location and year of construction as uncertainty factors.


10. Reflection

The combination of a numeric regression model and an LLM works well: the model delivers fast, consistent estimates, while the LLM makes the communication with the user friendly and natural. The biggest weakness is the dependency on correct municipality names – typos or unknown municipalities lead to errors. German input is important because the BFS municipality data is in German and the LLM should adopt the municipality names directly. Potential improvements include more robust fuzzy matching for municipality names and the ability to ask clarifying questions when information is missing.


11. Responsible Use Note

The app provides estimates only, based on aggregated municipality and apartment data – not reliable rent figures for specific properties. The model does not account for important factors such as building condition, floor level, fixtures and micro-location. LLM-based extraction may produce incorrect values for ambiguous inputs. The results should be understood as a rough guide and do not replace professional real estate advice.