Spaces:

gumannic
/

AI_Apartment_Prediction_Chatbot

Sleeping

App Files Files Community

AI_Apartment_Prediction_Chatbot / documentation.md

gumannic

Update documentation.md

4a06cd2 verified 25 days ago

preview code

raw

history blame contribute delete

8.88 kB

	# Documentation
	## Block 3: Apartment Predictor

	---

	## 1. Project Summary

	The app takes a German free-text apartment request (e.g. number of rooms, area, municipality) and returns an estimated monthly rent in CHF. GPT-4o-mini (via OpenAI API) extracts the structured parameters from the text and converts them into a JSON object. A pre-trained Random Forest regression model then calculates the rent estimate. Finally, the LLM generates a clear German explanation of the result including an uncertainty note.

	---

	## 2. Files Used

	\| File \| Purpose \|
	\|------\|---------\|
	\| `ai_applications_exercise2.ipynb` \| Notebook for developing and testing the functions \|
	\| `app_student.py` \| Starting file with TODOs (not deployed) \|
	\| `app.py` \| Final deployable app for Hugging Face Spaces \|
	\| `random_forest_regression.pkl` \| Pre-trained scikit-learn regression model \|
	\| `bfs_municipality_and_tax_data.csv` \| BFS municipality data (population, taxes, etc.) \|
	\| `requirements.txt` \| Python dependencies \|
	\| `documentation.md` \| This documentation \|
	\| `README.md` \| Hugging Face Spaces configuration \|

	---

	## 3. Numeric Prediction Part

	### 3.1 Reused Model

	Which saved model did you use?
	`random_forest_regression.pkl` – a pre-trained scikit-learn `RandomForestRegressor` model.

	What does the model predict?
	The model estimates the monthly gross rent (CHF) of a Swiss apartment based on apartment characteristics and municipality statistics.

	Which input features are used for prediction?

	1. `rooms` – Number of rooms (e.g. 3.5)
	2. `area_m2` – Living area in square metres
	3. `pop` – Population of the municipality
	4. `pop_dens` – Population density (inhabitants/km²)
	5. `frg_pct` – Percentage of foreign residents
	6. `emp` – Number of employees in the municipality
	7. `tax_income` – Taxable income (median income of the municipality)

	### 3.2 Prediction Logic

	The municipality name is matched in lowercase against the `town_to_row` dictionary. The five municipality features (pop, pop_dens, frg_pct, emp, tax_income) are retrieved and passed together with `rooms` and `area_m2` as a NumPy array `[[...]]` to `model.predict()`. The result is rounded to two decimal places.

	---

	## 4. LLM Extraction Part

	### 4.1 Goal

	The LLM should extract the three fields `rooms`, `area_m2` and `town` from a German free-text apartment request and return them as a pure JSON object.

	### 4.2 Prompt Design

	System Prompt:
	The model is instructed to return exclusively a JSON object (no Markdown, no explanations) with exactly the three keys `rooms` (number), `area_m2` (number) and `town` (string in German). An example JSON is included in the system prompt.

	User Prompt:
	The user's free text is passed directly with the instruction to extract the three values.

	- ✅ System instruction used
	- ✅ Strict JSON required
	- ✅ Keys explicitly named: `rooms`, `area_m2`, `town`
	- ✅ German input expected (municipality names match BFS dataset)

	### 4.3 Expected Output Format

	```json
	{"rooms": 3.5, "area_m2": 85, "town": "Winterthur"}
	```

	### 4.4 Validation

	After the LLM call, `parse_json_response()` is used, which:
	1. Removes Markdown fences (if present)
	2. Applies `json.loads()` to the cleaned string
	3. Checks whether all three keys are present

	Then `match_town()` is called to validate the extracted municipality name against the BFS dataset. If no match is found, a clear `ValueError` is raised.

	---

	## 5. LLM Explanation Part

	### 5.1 Goal

	The second LLM step should explain the already calculated model result in plain German. The LLM does not calculate its own price – it receives the prediction value and explains it.

	### 5.2 Prompt Design

	System Prompt: The model is instructed to return exclusively JSON with the key `"answer"`. The explanation should be 2–4 sentences in German and mention a concrete uncertainty factor (e.g. fixtures, micro-location, year of construction).

	User Prompt: Contains number of rooms, area, municipality and the model-calculated rent price.

	- ✅ Structured preferences included
	- ✅ Prediction value explicitly passed
	- ✅ German output required
	- ✅ Uncertainty note required
	- ✅ JSON output with key `answer` required

	### 5.3 Expected Output Format

	```json
	{"answer": "The predicted monthly rent for a 3.5-room apartment with 85.0 m² in Winterthur is 2117 CHF. This estimate may vary, as factors such as the apartment's fixtures, the exact micro-location or the year of construction can have a significant influence on the actual rent."}
	```

	---

	## 6. End-to-End Pipeline

	1. User input: The user enters a German apartment request (Gradio text field).
	2. Extraction: `extract_preferences()` calls the LLM and receives `rooms`, `area_m2`, `town` as JSON.
	3. Validation: Python validates the fields and matches the municipality name using `match_town()`.
	4. Prediction: `predict_apartment_price()` loads the municipality data and calls `model.predict()`.
	5. Explanation: `generate_explanation()` passes preferences + prediction to the LLM and receives a German explanation.
	6. Output: Gradio displays the JSON extraction, estimated price and explanation text.

	---

	## 7. Test Cases

	\| Test Input \| Extracted correctly? \| Prediction received? \| Explanation received? \| Notes \|
	\|------------\|----------------------\|----------------------\|-----------------------\|-------\|
	\| `Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur.` \| Yes \| Yes \| Yes \| Prediction: 2117 CHF \|
	\| `4-Zimmer-Wohnung, 110 m², Zürich` \| Yes \| Yes \| Yes \| Prediction: 4029 CHF \|
	\| `Ich brauche 2 Zimmer und etwa 55 m2 in Kloten.` \| Yes \| Yes \| Yes \| Smaller apartment near airport \|
	\| `5 Zimmer, 150 m2 in Zug` \| Yes \| Yes \| Yes \| Affluent municipality, high rent \|

	---

	## 8. Errors and Problems

	Problem 1: scikit-learn version conflict
	- Cause: The `.pkl` model was trained with a different scikit-learn version than installed on Hugging Face.
	- Fix: Pin `scikit-learn==1.6.1` in `requirements.txt`.

	Problem 2: LLM returns Markdown instead of pure JSON
	- Cause: The LLM sometimes wraps the response in ` ```json ` fences.
	- Fix: `parse_json_response()` strips Markdown fences before JSON parsing.

	Problem 3: Municipality names do not match
	- Cause: User writes e.g. "Zürich" while the dataset entry is "Zürich (Kreis 1)", or typos occur.
	- Fix: `match_town()` first uses exact match, then substring match.

	---

	## 9. Deployment Notes

	### 9.1 Files included

	- `app.py`
	- `requirements.txt`
	- `README.md`
	- `documentation.md`
	- `random_forest_regression.pkl`
	- `bfs_municipality_and_tax_data.csv`

	### 9.2 Secrets / Environment Variables

	- `OPENAI_API_KEY` – OpenAI API key (mandatory)

	### 9.3 Deployment Result

	The Space runs successfully on Hugging Face. The Gradio UI is publicly accessible. All three output fields (JSON, price, explanation) are populated correctly.

	### 9.4 Screenshots

	Screenshot 1: Input "Ich suche eine 3.5-Zimmer-Wohnung mit etwa 85 m² in Winterthur."

	![Screenshot_1](https://cdn-uploads.huggingface.co/production/uploads/69b3ec7ff08fcd1e29a9308e/107UKAcbfTdfw4uRg8K7c.png)

	Extracted JSON: `rooms: 3.5, area_m2: 85, town: Winterthur` – Prediction: 2117.32 CHF – The explanation mentions fixtures, micro-location and year of construction as uncertainty factors.

	Screenshot 2: Input "4-Zimmer-Wohnung, 110 m², Zürich"

	![Screenshot_2](https://cdn-uploads.huggingface.co/production/uploads/69b3ec7ff08fcd1e29a9308e/wFZrlH8-ZxeI7tOQ0LDsq.png)

	Extracted JSON: `rooms: 4, area_m2: 110, town: Zürich` – Prediction: 4028.79 CHF – The explanation highlights general market conditions in Zurich and names micro-location and year of construction as uncertainty factors.

	---

	## 10. Reflection

	The combination of a numeric regression model and an LLM works well: the model delivers fast, consistent estimates, while the LLM makes the communication with the user friendly and natural. The biggest weakness is the dependency on correct municipality names – typos or unknown municipalities lead to errors. German input is important because the BFS municipality data is in German and the LLM should adopt the municipality names directly. Potential improvements include more robust fuzzy matching for municipality names and the ability to ask clarifying questions when information is missing.

	---

	## 11. Responsible Use Note

	The app provides estimates only, based on aggregated municipality and apartment data – not reliable rent figures for specific properties. The model does not account for important factors such as building condition, floor level, fixtures and micro-location. LLM-based extraction may produce incorrect values for ambiguous inputs. The results should be understood as a rough guide and do not replace professional real estate advice.