Spaces:

gumannic
/

AI_Apartment_Prediction_Chatbot

Sleeping

App Files Files Community

AI_Apartment_Prediction_Chatbot / documentation.md

gumannic

Update documentation.md

4a06cd2 verified 25 days ago

preview code

raw

history blame contribute delete

8.88 kB

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

Documentation

Block 3: Apartment Predictor

1. Project Summary

The app takes a German free-text apartment request (e.g. number of rooms, area, municipality) and returns an estimated monthly rent in CHF. GPT-4o-mini (via OpenAI API) extracts the structured parameters from the text and converts them into a JSON object. A pre-trained Random Forest regression model then calculates the rent estimate. Finally, the LLM generates a clear German explanation of the result including an uncertainty note.

2. Files Used

File	Purpose
`ai_applications_exercise2.ipynb`	Notebook for developing and testing the functions
`app_student.py`	Starting file with TODOs (not deployed)
`app.py`	Final deployable app for Hugging Face Spaces
`random_forest_regression.pkl`	Pre-trained scikit-learn regression model
`bfs_municipality_and_tax_data.csv`	BFS municipality data (population, taxes, etc.)
`requirements.txt`	Python dependencies
`documentation.md`	This documentation
`README.md`	Hugging Face Spaces configuration

3. Numeric Prediction Part

3.1 Reused Model

Which saved model did you use?
random_forest_regression.pkl – a pre-trained scikit-learn RandomForestRegressor model.

What does the model predict?
The model estimates the monthly gross rent (CHF) of a Swiss apartment based on apartment characteristics and municipality statistics.

Which input features are used for prediction?

rooms – Number of rooms (e.g. 3.5)
area_m2 – Living area in square metres
pop – Population of the municipality
pop_dens – Population density (inhabitants/km²)
frg_pct – Percentage of foreign residents
emp – Number of employees in the municipality
tax_income – Taxable income (median income of the municipality)

3.2 Prediction Logic

The municipality name is matched in lowercase against the town_to_row dictionary. The five municipality features (pop, pop_dens, frg_pct, emp, tax_income) are retrieved and passed together with rooms and area_m2 as a NumPy array [[...]] to model.predict(). The result is rounded to two decimal places.

4. LLM Extraction Part

4.1 Goal

The LLM should extract the three fields rooms, area_m2 and town from a German free-text apartment request and return them as a pure JSON object.

4.2 Prompt Design

System Prompt:
The model is instructed to return exclusively a JSON object (no Markdown, no explanations) with exactly the three keys rooms (number), area_m2 (number) and town (string in German). An example JSON is included in the system prompt.

User Prompt:
The user's free text is passed directly with the instruction to extract the three values.

✅ System instruction used
✅ Strict JSON required
✅ Keys explicitly named: rooms, area_m2, town
✅ German input expected (municipality names match BFS dataset)

4.3 Expected Output Format

{"rooms": 3.5, "area_m2": 85, "town": "Winterthur"}

4.4 Validation

After the LLM call, parse_json_response() is used, which:

Removes Markdown fences (if present)
Applies json.loads() to the cleaned string
Checks whether all three keys are present

Then match_town() is called to validate the extracted municipality name against the BFS dataset. If no match is found, a clear ValueError is raised.

5. LLM Explanation Part

5.1 Goal

The second LLM step should explain the already calculated model result in plain German. The LLM does not calculate its own price – it receives the prediction value and explains it.

5.2 Prompt Design

System Prompt: The model is instructed to return exclusively JSON with the key "answer". The explanation should be 2–4 sentences in German and mention a concrete uncertainty factor (e.g. fixtures, micro-location, year of construction).

User Prompt: Contains number of rooms, area, municipality and the model-calculated rent price.

✅ Structured preferences included
✅ Prediction value explicitly passed
✅ German output required
✅ Uncertainty note required
✅ JSON output with key answer required

5.3 Expected Output Format

{"answer": "The predicted monthly rent for a 3.5-room apartment with 85.0 m² in Winterthur is 2117 CHF. This estimate may vary, as factors such as the apartment's fixtures, the exact micro-location or the year of construction can have a significant influence on the actual rent."}

6. End-to-End Pipeline

User input: The user enters a German apartment request (Gradio text field).
Extraction: extract_preferences() calls the LLM and receives rooms, area_m2, town as JSON.
Validation: Python validates the fields and matches the municipality name using match_town().
Prediction: predict_apartment_price() loads the municipality data and calls model.predict().
Explanation: generate_explanation() passes preferences + prediction to the LLM and receives a German explanation.
Output: Gradio displays the JSON extraction, estimated price and explanation text.

7. Test Cases

Test Input	Extracted correctly?	Prediction received?	Explanation received?	Notes
`Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur.`	Yes	Yes	Yes	Prediction: 2117 CHF
`4-Zimmer-Wohnung, 110 m², Zürich`	Yes	Yes	Yes	Prediction: 4029 CHF
`Ich brauche 2 Zimmer und etwa 55 m2 in Kloten.`	Yes	Yes	Yes	Smaller apartment near airport
`5 Zimmer, 150 m2 in Zug`	Yes	Yes	Yes	Affluent municipality, high rent

8. Errors and Problems

Problem 1: scikit-learn version conflict

Cause: The .pkl model was trained with a different scikit-learn version than installed on Hugging Face.
Fix: Pin scikit-learn==1.6.1 in requirements.txt.

Problem 2: LLM returns Markdown instead of pure JSON

Cause: The LLM sometimes wraps the response in ```json fences.
Fix: parse_json_response() strips Markdown fences before JSON parsing.

Problem 3: Municipality names do not match

Cause: User writes e.g. "Zürich" while the dataset entry is "Zürich (Kreis 1)", or typos occur.
Fix: match_town() first uses exact match, then substring match.

9. Deployment Notes

9.1 Files included

app.py
requirements.txt
README.md
documentation.md
random_forest_regression.pkl
bfs_municipality_and_tax_data.csv

9.2 Secrets / Environment Variables

OPENAI_API_KEY – OpenAI API key (mandatory)

9.3 Deployment Result

The Space runs successfully on Hugging Face. The Gradio UI is publicly accessible. All three output fields (JSON, price, explanation) are populated correctly.

9.4 Screenshots

Screenshot 1: Input "Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur."

Extracted JSON: rooms: 3.5, area_m2: 85, town: Winterthur – Prediction: 2117.32 CHF – The explanation mentions fixtures, micro-location and year of construction as uncertainty factors.

Screenshot 2: Input "4-Zimmer-Wohnung, 110 m², Zürich"

Extracted JSON: rooms: 4, area_m2: 110, town: Zürich – Prediction: 4028.79 CHF – The explanation highlights general market conditions in Zurich and names micro-location and year of construction as uncertainty factors.

10. Reflection

The combination of a numeric regression model and an LLM works well: the model delivers fast, consistent estimates, while the LLM makes the communication with the user friendly and natural. The biggest weakness is the dependency on correct municipality names – typos or unknown municipalities lead to errors. German input is important because the BFS municipality data is in German and the LLM should adopt the municipality names directly. Potential improvements include more robust fuzzy matching for municipality names and the ability to ask clarifying questions when information is missing.

11. Responsible Use Note

The app provides estimates only, based on aggregated municipality and apartment data – not reliable rent figures for specific properties. The model does not account for important factors such as building condition, floor level, fixtures and micro-location. LLM-based extraction may produce incorrect values for ambiguous inputs. The results should be understood as a rough guide and do not replace professional real estate advice.