ApartmentPredictor2.0 / documentation.md
lst0004's picture
Update documentation.md
d3d1c18 verified

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Documentation

Week 12: Apartment Predictor (Saved Regression Model + LLM Workflow)

Use this file to document what you built, tested, and learned in this exercise.

Do not rename this file to README.md, because README.md is needed by Hugging Face Spaces.

This file is part of the submission. Complete it after you have tested and deployed your app.


1. Project Summary

Short description of your app:

This app allows users to describe their apartment search in natural German language (e.g. number of rooms, size, and location). The system extracts structured input from the text, predicts a monthly rent using a regression model, and generates a short explanation.

The regression model predicts the estimated monthly rent in CHF. The LLM is used for two tasks: extracting structured data from free text and generating a human-readable explanation of the prediction.


2. Files Used

List the main files you worked with.

File Purpose
ai_applications_exercise2.ipynb Notebook for initial understanding and testing
app_student.py Implementation of the full pipeline & deployable app for Hugging Face
random_forest_regression.pkl Pre-trained regression model
bfs_municipality_and_tax_data.csv Municipality features used for prediction
requirements.txt Python dependencies
documentation.md This documentation: Written documentation for the submission

3. Numeric Prediction Part

3.1 Reused Model

Which saved model did you use?
random_forest_regression.pkl

What does the model predict?

The model predicts the estimated monthly rent (in CHF) for an apartment based on apartment characteristics and municipality data.

Which input features are used for prediction?

  1. rooms
  2. area_m2
  3. pop
  4. pop_dens
  5. frg_pct
  6. emp
  7. tax_income
  8. distance_to_zurich_center (not available in dataset, approximated as 0)

3.2 Prediction Logic

The user input is first converted into structured values (rooms, area_m2, town). The town is matched to the dataset, and municipality features are retrieved.

Since the trained model expects 8 features, but the dataset does not include distance_to_zurich_center, this value is approximated as 0. The full feature vector is then passed to the model for prediction.


4. LLM Extraction Part

4.1 Goal

The LLM extracts structured information from the user’s German text:

  • number of rooms
  • apartment size (m²)
  • town name

4.2 Prompt Design

The LLM is instructed with:

  • a system prompt defining the task (extraction)
  • a requirement to return strict JSON only
  • explicit keys: rooms, area_m2, town
  • numeric values for rooms and area
  • no additional text or Markdown

4.3 Expected Output Format

Document the ideal extraction output.

{
"rooms":4,
"area_m2":110,
"town":"Zürich"
}

4.4 Validation

The JSON response is validated in Python:

  • empty responses are rejected
  • JSON parsing is enforced
  • required keys are checked
  • Markdown code blocks are removed if present

This ensures robustness against LLM formatting errors.


5. LLM Explanation Part

5.1 Goal

The second LLM step generates a short explanation of the predicted rent. It does not calculate a new prediction.

5.2 Prompt Design

The prompt includes:

  • structured preferences (rooms, area, town)
  • the predicted rent value
  • instruction to respond in German
  • requirement to include one uncertainty note
  • strict JSON output with key answer

5.3 Expected Output Format

Example:

{"answer": "Die geschätzte Monatsmiete für eine 4-Zimmer-Wohnung mit 110 m² in Zürich beträgt etwa 4.310 CHF. Bitte beachten Sie, dass es sich um eine Schätzung handelt und der tatsächliche Mietpreis je nach Lage und Ausstattung variieren kann."}

6. End-to-End Pipeline

  1. User enters a German apartment request.
  2. LLM extracts rooms, area_m2, and town.
  3. Python validates the extracted values.
  4. The regression model predicts the rent.
  5. The LLM generates a short explanation.
  6. The app returns JSON, prediction, and explanation

7. Test Cases

Document at least 3 test inputs.

Test Input Extracted Output Correct? Prediction Returned? Explanation Returned? Notes
3.5 Zimmer, 80m2, Adlikon Yes Yes Yes Short input works
Ich suche eine 3.5 Zimmer Wohnung mit 80m2 in Adlikon. Yes Yes Yes Correct extraction and prediction
Ich suche eine 4 Zimmer Wohnung mit 110m2 in Zürich. Yes Yes Yes Full pipeline works

8. Errors and Problems

Problem1: Missing OpenAI API key

Cause: Environment variable not set correctly

Fix: Used OPENAI_API_KEY and environment variables

Problem2: LLM returned invalid JSON

Cause: LLM returned Markdown-formatted JSON (```json)

Fix: Removed Markdown in parser and improved prompt

Problem3: Feature mismatch (7 vs 8 features)

Cause: Model expected 8 features

Fix: Added dummy value (0) for missing feature


9. Deployment Notes

https://huggingface.co/spaces/lst0004/ApartmentPredictor2.0

9.1 Files included

  • app_student.py
  • random_forest_regression.pkl
  • bfs_municipality_and_tax_data.csv
  • requirements.txt
  • documentation.md
  • README.md
  • .gitattributes

9.2 Secrets / Environment Variables

  • OPENAI_API_KEY

9.3 Deployment Result

The app runs successfully on Hugging Face Spaces. The full pipeline works, including LLM extraction, prediction, and explanation.

9.4 Screenshots

Example 1

Dieses Beispiel zeigt eine Eingabe für Adlikon. Die Werte wurden korrekt extrahiert und eine plausible Mietschätzung erzeugt.

Example 2

Hier wird eine Anfrage für Zürich verarbeitet. Die Pipeline funktioniert ebenfalls vollständig inklusive Erklärung durch das LLM.


10. Reflection

The combination of a regression model and an LLM worked well. The LLM simplifies the user interface by allowing natural language input.

However, the system is fragile when the LLM returns incorrectly formatted JSON. Additionally, the model is limited because it does not include important factors such as apartment condition or micro-location.

German input is important because the dataset contains Swiss town names. In the future, adding more features or improving extraction robustness would improve the system.


11. Responsible Use Note

The prediction is only an estimate and should not be used as a definitive price. The model uses limited structured features and ignores important real-world factors such as condition, location within a city, and amenities.

The LLM may also extract incorrect values from user input. Users should treat the output as a rough guideline rather than an exact prediction.