Exercise_3 / documentation.md
DKatheesrupan's picture
Upload 6 files
7638048 verified

Documentation

Week 2: Apartment Predictor (Saved Regression Model + LLM Workflow)

This file documents the apartment prediction application that combines a saved regression model with an LLM workflow.


1. Project Summary

The app accepts a German free-text apartment request, for example: "Ich suche eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur." An LLM extracts the structured parameters rooms, area_m2, and town from the text. The app then uses a saved random forest regression model to estimate the monthly rent in CHF. A second LLM step turns the numeric prediction into a short German explanation with an uncertainty note.


2. Files Used

File Purpose
ai_applications_exercise2.ipynb Notebook work and testing during the exercise
app_student.py Student implementation with completed TODOs
app.py Final deployable Gradio app for Hugging Face Spaces
random_forest_regression.pkl Saved regression model used for numeric prediction
bfs_municipality_and_tax_data.csv Municipality features used for prediction
requirements.txt Python dependencies for the Space
README.md Hugging Face Space configuration and project summary
documentation.md Written documentation for the submission
screenshot1.png First screenshot from the running app
screenshot2.png Second screenshot from the running app

3. Numeric Prediction Part

3.1 Reused Model

Which saved model did you use?
random_forest_regression.pkl

What does the model predict?
The model predicts the estimated monthly rent of an apartment in CHF. The prediction is based on apartment size information and municipality-level data.

Which input features are used for prediction?

The model input uses the seven features in this exact order:

  1. rooms
  2. area_m2
  3. pop
  4. pop_dens
  5. frg_pct
  6. emp
  7. tax_income

3.2 Prediction Logic

The user provides rooms, area_m2, and town in natural language. The LLM extracts these values as JSON. Python then matches the extracted town name to the canonical bfs_name in bfs_municipality_and_tax_data.csv. The app reads the municipality values pop, pop_dens, frg_pct, emp, and tax_income from the matching CSV row. These values are combined with rooms and area_m2 into a NumPy array and passed to the saved regression model.


4. LLM Extraction Part

4.1 Goal

The first LLM step extracts structured model inputs from a German apartment request. It must return the number of rooms, the apartment area in square meters, and the town name.

4.2 Prompt Design

The extraction prompt tells the LLM that it is an extraction system for Swiss apartment wishes. The prompt requires strict JSON and explicitly lists the required keys: rooms, area_m2, and town. It also tells the model not to return Markdown, explanations, or additional keys. If a value is missing, the LLM is instructed to use null.

4.3 Expected Output Format

{"rooms": 3.5, "area_m2": 85, "town": "Winterthur"}

4.4 Validation

The app validates the LLM output with parse_json_response. This function checks that the response is not empty, that it is valid JSON, and that all required keys are present. The app also converts rooms and area_m2 to numbers and checks that they are greater than zero. The extracted town is matched against the BFS municipality data before the prediction is made.


5. LLM Explanation Part

5.1 Goal

The second LLM step generates a short natural-language answer in German. It explains the model prediction to the user. The LLM should not calculate a new price; it should use the prediction value that was produced by the regression model.

5.2 Prompt Design

The explanation prompt includes the structured preferences and the predicted rent. The prompt instructs the LLM to write in German, mention the number of rooms, area, and town, and include a short uncertainty note. The answer is also required to be valid JSON with the key answer.

5.3 Expected Output Format

{"answer": "Für eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur schätzt das Modell die Monatsmiete auf rund 2800 CHF. Die Schätzung ist unsicher, weil Faktoren wie Zustand, Mikrolage und Ausstattung nicht direkt berücksichtigt werden."}

6. End-to-End Pipeline

  1. The user enters a German apartment request.
  2. The LLM extracts rooms, area_m2, and town as JSON.
  3. Python validates the JSON and converts the numeric values.
  4. Python matches the town name to the BFS municipality data.
  5. The app builds the seven-feature model input.
  6. The saved random forest regression model predicts the monthly rent.
  7. The second LLM step generates a short German explanation.
  8. The Gradio app returns the extracted JSON, the predicted rent, and the final explanation text.

7. Test Cases

Test Input Extracted Output Correct? Prediction Returned? Explanation Returned? Notes
Ich suche eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur. Yes Yes Yes The app extracted the apartment size and town and returned a German explanation.
Wie viel kostet ungefähr eine 2-Zimmer-Wohnung mit 55 Quadratmetern in Zürich? Yes Yes Yes The LLM extracted rooms, area_m2, and town; the model returned a rent estimate.
Ich interessiere mich für eine 4.5 Zimmer Wohnung mit 110 m2 in Bern. Yes Yes Yes The app handled another Swiss city and produced the full output pipeline.

8. Errors and Problems

Problem: The app failed locally with FileNotFoundError: random_forest_regression.pkl.
Cause: The saved model file was not in the same folder as app.py.
Fix: I added random_forest_regression.pkl to the project folder before running and deploying the app.

Problem: The LLM could return invalid or incomplete JSON.
Cause: LLM output is not guaranteed unless the prompt and API response format are strict.
Fix: The app uses JSON mode and validates the response with parse_json_response.

Problem: Invalid or ambiguous town names may not match the BFS CSV.
Cause: The model requires municipality-level features from the CSV.
Fix: The app first tries exact town matching and then relaxed contains matching. Ambiguous or missing towns return a visible error.

Problem: The Space may fail if the API key is missing.
Cause: The LLM steps require an OpenAI API key.
Fix: I added OPENAI_API_KEY as a Hugging Face Space Secret.


9. Deployment Notes

9.1 Files included

The Hugging Face Space contains:

  • app.py
  • app_student.py
  • README.md
  • requirements.txt
  • documentation.md
  • bfs_municipality_and_tax_data.csv
  • random_forest_regression.pkl
  • screenshot1.png
  • screenshot2.png

9.2 Secrets / Environment Variables

The app requires this Hugging Face Secret:

  • OPENAI_API_KEY

Optional:

  • OPENAI_MODEL, for example gpt-4.1-mini

9.3 Deployment Result

The Space was deployed as a Gradio application. After uploading all required files and setting the OpenAI API key, the app started successfully and returned the extracted JSON, the predicted rent, and the final German explanation.

9.4 Screenshots

Add the two screenshots from the running Hugging Face Space here after deployment:

![Example 1](screenshot1.png)

In this example, the app extracted the apartment request as JSON, predicted the monthly rent, and generated a German explanation.

![Example 2](screenshot2.png)

In this example, a different German apartment request was used. The screenshot shows the extracted JSON, predicted rent, and final answer.

Example 1

In this example, the app extracted the apartment request as JSON, predicted the monthly rent, and generated a German explanation.

Example 2

In this example, a different German apartment request was used. The screenshot shows the extracted JSON, predicted rent, and final answer.


10. Reflection

The combination of a regression model and an LLM worked well because the LLM made the app easier to use with natural German input. The numeric prediction stayed inside the saved model, while the LLM handled text extraction and explanation. The system is still fragile because wrong town names, missing values, or invalid JSON can stop the pipeline. German input is important because the dataset contains Swiss municipality names and the expected users describe apartments in German. In a future version, I would add more apartment features such as floor, building condition, balcony, public transport access, and exact location.


11. Responsible Use Note

The predicted rent is only an estimate and should not be treated as a guaranteed market price. The model uses a limited set of structured features and does not directly include factors such as apartment condition, exact address, renovation status, balcony, view, or micro-location. The LLM may also extract values incorrectly, so the visible JSON output should always be checked. The app is useful for learning and experimentation, but not for final financial or rental decisions.