Block_3_Ex / documentation.md
Mariosolerzhawhugging's picture
Upload documentation.md
bf55aa3 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

Documentation

Apartment Predictor Conversational Agent

Saved Regression Model + LLM Workflow

This file documents the conversational apartment rent prediction application created for the exercise.
The file is named documentation.md because Hugging Face Spaces uses README.md for the public Space description.


1. Project Summary

This project builds a conversational apartment rent prediction app for the Canton of Zurich. The user enters a natural-language apartment request in English, and the app uses an LLM to extract structured input variables from the text. These extracted variables are converted into the feature format required by my saved Random Forest regression model, which predicts the estimated monthly rent in CHF.

The app combines two components:

  1. A numeric prediction model trained for Zurich apartment rent prediction.
  2. An LLM interaction layer that extracts user preferences and explains the prediction in natural language.

The final deployed app shows:

  • the extracted JSON input,
  • the model variables used for prediction,
  • the estimated monthly rent,
  • and the final natural-language explanation.

The app was created by Mario Soler Vidal.


2. Files Used

File Purpose
app.py Final deployable Gradio application
zurich_rent_rf_model.pkl Saved Random Forest regression model used to predict monthly apartment rent
model_features.pkl Ordered list of the 17 predictor columns expected by the saved model
requirements.txt Python dependencies required for deployment
README.md Public Hugging Face Space description
documentation.md Written project documentation for the submission

Note about the notebook

I did not use ai_applications_exercise2.ipynb as part of my final implementation.
The final deployed solution was implemented directly in app.py, using my own saved prediction model (zurich_rent_rf_model.pkl) and the corresponding feature list (model_features.pkl).


3. Numeric Prediction Part

3.1 Reused Model

Saved model used:
zurich_rent_rf_model.pkl

The reused model is a Random Forest regression model trained to predict apartment rental prices in the Canton of Zurich.

Target variable:
The model predicts the estimated monthly apartment rent in CHF.

The model does not receive raw natural-language text directly. Instead, the app converts the user request into structured variables and passes these variables to the trained model.

3.2 Model Features

The saved model uses the following 17 predictor variables:

  1. area
  2. rooms
  3. lat
  4. lon
  5. zurich_city
  6. log_emp
  7. log_tax_income
  8. log_pop_dens
  9. log_pop
  10. frg_pct
  11. luxurious
  12. furnished
  13. (ATTIKA)
  14. (EXKLUSIV)
  15. Kreis 7
  16. Kreis 8
  17. Kreis 9

These variables are loaded from:

model_features.pkl

This ensures that the prediction input is passed to the Random Forest model in the same feature order used during training.

3.3 Why These Features Were Used

The original modelling process started from a larger set of explanatory variables. After exploratory data analysis, feature engineering, and feature selection, the model was reduced to these 17 representative predictors.

These variables were kept because they provided the best balance between:

  • predictive performance,
  • precision,
  • interpretability,
  • and reduced model complexity.

This feature reduction improves model efficiency, but it also creates an important limitation: the deployed conversational app can only predict using the variables that exist in the saved model. Any user request containing information outside this fixed feature set, such as balcony, view, exact floor, renovation quality, building age, parking, pets, garden, or proximity to public transport, cannot directly influence the prediction unless those variables are explicitly part of the trained model.

3.4 Prediction Logic

The app builds the model input in the following way:

Feature group Variables Source
Apartment size area, rooms Extracted from the user query by the LLM
Geographic location lat, lon, zurich_city, Kreis 7, Kreis 8, Kreis 9 Derived from the matched location
Socio-economic variables log_emp, log_tax_income, log_pop_dens, log_pop, frg_pct Derived from location defaults
Apartment attributes luxurious, furnished, (ATTIKA), (EXKLUSIV) Extracted from the user query by the LLM

The model input is built as a pandas DataFrame with the exact 17 columns expected by the saved model.

Example model input dictionary:

feature_values = {
    "area": preferences["area"],
    "rooms": preferences["rooms"],
    "lat": loc["lat"],
    "lon": loc["lon"],
    "zurich_city": loc["zurich_city"],
    "log_emp": float(np.log1p(loc["emp"])),
    "log_tax_income": float(np.log1p(loc["tax_income"])),
    "log_pop_dens": float(np.log1p(loc["pop_dens"])),
    "log_pop": float(np.log1p(loc["pop"])),
    "frg_pct": loc["frg_pct"],
    "luxurious": preferences["luxurious"],
    "furnished": preferences["furnished"],
    "(ATTIKA)": preferences["attika"],
    "(EXKLUSIV)": preferences["exclusive"],
    "Kreis 7": loc["Kreis 7"],
    "Kreis 8": loc["Kreis 8"],
    "Kreis 9": loc["Kreis 9"],
}

4. Supported Locations

The app currently supports a selected set of Zurich locations. This is because the deployed prototype uses manually defined location defaults to provide the geographic and socio-economic variables required by the model.

Supported Locations

Supported input Type Notes
Zurich / Zurich City Municipality / city General Zurich city default values
Zurich District 7 / Kreis 7 District inside Zurich city Includes areas such as Fluntern, Hottingen, Hirslanden, and Witikon
Zurich District 8 / Kreis 8 District inside Zurich city Includes areas such as Seefeld and Riesbach
Zurich District 9 / Kreis 9 District inside Zurich city Includes areas such as Altstetten and Albisrieden
Winterthur Municipality Supported municipality in the Canton of Zurich
Uster Municipality Supported municipality in the Canton of Zurich
Dübendorf / Dubendorf Municipality Supported municipality in the Canton of Zurich
Kloten Municipality Supported municipality in the Canton of Zurich

Limitation

The app does not cover all municipalities in the Canton of Zurich. It only understands the fixed set of locations defined in LOCATION_DEFAULTS.

This is one of the most important limitations of the project. Since location variables such as latitude, longitude, population, employment, tax income, population density, foreign-resident percentage, and Zurich district indicators must be passed to the model, the app needs a structured location mapping for each supported municipality. In the current prototype, this mapping is manually defined for only a small number of selected locations.

Therefore, the current app should be interpreted as a functional demo-like project that demonstrates the integration of:

LLM extraction → structured JSON → Random Forest prediction → LLM explanation

It should not be interpreted as a complete real-life Zurich apartment rent predictor.

A production-ready version would need a complete municipality-level dataset covering the Canton of Zurich and a more scalable location-matching system.


5. LLM Extraction Part

5.1 Goal

The first LLM step extracts structured apartment preferences from a free-text user query.

The user writes a request such as:

I am looking for a furnished 3.5-room apartment with 85 m2 in Zurich District 8.

The LLM converts this text into structured JSON so that Python can validate the values and pass them to the prediction pipeline.

5.2 Prompt Design

The extraction prompt instructs the LLM to behave as an information extraction assistant for a Zurich apartment rent prediction app.

The prompt requires:

  • strict JSON output,
  • no Markdown,
  • fixed required keys,
  • numeric values for area and rooms,
  • binary indicators for optional apartment attributes,
  • and null values if mandatory information is missing.

The required JSON keys are:

  • area
  • rooms
  • location
  • luxurious
  • furnished
  • attika
  • exclusive

The prompt is written for English user input because the deployed app interface is in English.

5.3 Expected Output Format

Example expected output:

{
  "area": 85,
  "rooms": 3.5,
  "location": "Zurich District 8",
  "luxurious": 0,
  "furnished": 1,
  "attika": 0,
  "exclusive": 0
}

5.4 Validation

The app validates the LLM output before using it for prediction.

Validation steps:

  1. Check that the LLM response is not empty.
  2. Parse the response as JSON.
  3. Check that all required keys are present.
  4. Check that area, rooms, and location are not missing.
  5. Convert area and rooms to numeric values.
  6. Convert binary variables to integers.
  7. Check that area and rooms are positive.
  8. Match the extracted location to one of the supported locations.

If validation fails, the app shows a visible error message. This follows the exercise requirement that errors should remain visible for debugging.


6. LLM Explanation Part

6.1 Goal

The second LLM step generates a concise natural-language explanation of the model prediction.

Important distinction:

  • The Random Forest model predicts the numeric rent.
  • The LLM only explains the already computed prediction.
  • The LLM is not allowed to calculate a new price.

6.2 Prompt Design

The explanation prompt gives the LLM:

  • the structured user preferences,
  • the model variables used for prediction,
  • and the predicted monthly rent in CHF.

The prompt instructs the LLM to:

  • explain the result in English,
  • mention that the prediction is an estimate,
  • include one short uncertainty or limitation note,
  • return strict JSON only,
  • and use the key answer.

6.3 Expected Output Format

Example expected output:

{
  "answer": "For a furnished 3.5-room apartment with 85 m² in Zurich District 8, the model estimates a monthly rent of around 3,200 CHF. This is an estimate based on the selected apartment and location variables. One limitation is that exact micro-location and apartment condition are not fully captured by the model."
}

7. End-to-End Pipeline

The complete app pipeline works as follows:

  1. The user enters an apartment request in English.
  2. The LLM extracts structured preferences as JSON.
  3. Python validates the extracted JSON.
  4. The app matches the extracted location to one of the supported locations.
  5. The app builds the 17-feature input vector expected by the Random Forest model.
  6. The saved model predicts the estimated monthly rent in CHF.
  7. A second LLM call generates a concise English explanation.
  8. The Gradio interface displays:
    • extracted JSON,
    • model variables,
    • predicted rent,
    • final explanation.

Pipeline overview:

User query
   ↓
LLM extraction
   ↓
Validated JSON
   ↓
Location matching
   ↓
17-feature model input
   ↓
Random Forest prediction
   ↓
LLM explanation
   ↓
Final app output

8. Test Cases

The following test cases were used to check the complete app workflow.

Test Input Extracted Output Correct? Prediction Returned? Explanation Returned? Notes
I am looking for a furnished 3.5-room apartment with 85 m2 in Zurich District 8. Yes Yes Yes The app extracts area, rooms, location, and furnished = 1.
I need a luxurious 4-room apartment with 110 m2 in Zurich District 7. Yes Yes Yes The app extracts luxurious = 1 and maps the location to Zurich District 7.
I am looking for a 2-room apartment with about 55 m2 in Winterthur. Yes Yes Yes The app maps Winterthur as a supported municipality outside Zurich city.
I want a 3-room apartment in Winterthur. No No No The app correctly requests the missing apartment area.
I want a 3-room apartment with 80 m2 in Basel. No No No The app correctly rejects unsupported locations.
I want a 3-room apartment with 85 m2 in Winterthur, with a pool and close to public transport. Partially Yes Yes The app can use area, rooms, and location, but pool and transport proximity are not included in the fixed model feature set.

9. Errors, Problems, and Critical Limitations

This section is especially important because the deployed app is a working prototype, but it has important limitations that prevent it from being a complete real-life rent prediction system.

Problem 1: Fixed model feature set

Problem:
The saved Random Forest model only accepts a fixed set of 17 features:

area, rooms, lat, lon, zurich_city, log_emp, log_tax_income, log_pop_dens,
log_pop, frg_pct, luxurious, furnished, (ATTIKA), (EXKLUSIV),
Kreis 7, Kreis 8, Kreis 9

Cause:
The model was trained only with these selected variables after EDA, feature engineering, and feature selection. Therefore, any variable not included in this final feature set cannot be used by the deployed model at prediction time.

Impact:
The user may mention relevant apartment characteristics such as:

  • balcony,
  • terrace,
  • garden,
  • parking,
  • pool,
  • public transport proximity,
  • exact building age,
  • apartment condition,
  • renovation quality,
  • view,
  • floor level,
  • elevator,
  • noise level,
  • exact address or micro-location.

However, unless these are part of the 17 model features, they do not directly affect the numeric prediction.

Fix / Future improvement:
A future model should be retrained with a richer feature set if these variables are available in the dataset. The LLM could then extract more user preferences and pass them to a more complete model.


Problem 2: Fixed supported municipality/location set

Problem:
The app only supports a small predefined set of locations:

Zurich, Zurich District 7, Zurich District 8, Zurich District 9,
Winterthur, Uster, Dübendorf, Kloten

Cause:
The model requires geographic and socio-economic location variables. In this prototype, these values are manually defined in LOCATION_DEFAULTS. Since this dictionary contains only selected locations, the app cannot process every municipality in the Canton of Zurich.

Impact:
This critically limits the functionality of the app. If the user enters a municipality that is not in the predefined location mapping, the app cannot generate the full model input and therefore cannot produce a valid prediction.

Fix / Future improvement:
A production-ready version should include a complete municipality-level dataset for the Canton of Zurich, containing coordinates and socio-economic variables for all municipalities. The app should then match the user’s location to this dataset automatically instead of relying on manual defaults.


Problem 3: Demo-like nature of the project

Problem:
Because the model uses a fixed set of features and the app supports only a fixed set of locations, the project should not be considered a full real-life apartment rent predictor.

Cause:
The project was designed as an educational exercise to demonstrate the combination of a saved regression model with an LLM interaction layer. The current implementation focuses on showing that the pipeline works, rather than building a complete real estate valuation system.

Impact:
The app is useful as a proof of concept, but it is not robust enough for real housing market decisions. It should be interpreted as a demo-like prototype.

Fix / Future improvement:
To become a real-world rent prediction tool, the system would need:

  • full municipality coverage,
  • more complete apartment-level features,
  • updated rental market data,
  • better micro-location modelling,
  • external validation,
  • model monitoring,
  • and a more complete error-handling system.

Problem 4: Invalid API key

Problem:
During deployment testing, the app returned an OpenAI authentication error:

401 invalid_api_key – Incorrect API key provided

Cause:
The API key used in Hugging Face Secrets was not accepted by the OpenAI API. This may happen if the key is expired, revoked, copied incorrectly, or not valid for the selected API endpoint.

Fix:
The Hugging Face Space must contain a valid OPENAI_API_KEY secret. The app should be restarted after replacing the key.


Problem 5: Missing area, rooms, or location

Problem:
The user enters an incomplete request.

Example:

I want a 3-room apartment in Winterthur.

Cause:
The model requires at least area, rooms, and location.

Fix:
The app raises a validation error and asks the user to include apartment area, number of rooms, and location.


10. Deployment Notes

10.1 Hugging Face Space

Public Application Link:

https://huggingface.co/spaces/Mariosolerzhawhugging/Block_3_Ex

10.2 Files Included

The following files are included in the Hugging Face Space:

File Included? Purpose
app.py Yes Main Gradio app
zurich_rent_rf_model.pkl Yes Saved Random Forest prediction model
model_features.pkl Yes Ordered list of model features
requirements.txt Yes Python dependencies
README.md Yes Public Hugging Face Space description
documentation.md Yes Submission documentation

10.3 Secrets / Environment Variables

The app requires one private Hugging Face Secret to call the OpenAI API.

Type Name Required? Purpose
Secret OPENAI_API_KEY Yes Stores the OpenAI API key used by the app to call the LLM for JSON extraction and explanation generation
Variable OPENAI_MODEL No Optional model name. If it is not defined, the app uses gpt-4.1-mini by default

The API key was added in Hugging Face under:

Settings → Variables and secrets → New secret

The secret name must be exactly:

OPENAI_API_KEY

This exact name is required because app.py reads the key with:

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")

The API key should not be written inside app.py, README.md, or documentation.md. It should only be stored as a private Hugging Face Secret.

The optional model variable can be added under:

Settings → Variables and secrets → New variable

with the name:

OPENAI_MODEL

If OPENAI_MODEL is not added, the app automatically uses:

gpt-4.1-mini

10.4 Deployment Result

The app was deployed on Hugging Face Spaces using Gradio. The user interface provides:

  • a text box for the apartment request,
  • an explanation of the 17 model features,
  • a list of supported locations,
  • an extracted JSON output,
  • a numeric rent prediction,
  • and a natural-language explanation.

When the API key is valid and the input contains area, rooms, and a supported location, the app returns the complete end-to-end output.


11. Screenshots

Because the web interface is relatively long, each test example is documented with three screenshots instead of one.
This makes it possible to clearly show the full end-to-end workflow required by the exercise.

For each example input, the screenshots should show:

  1. the user input,
  2. the extracted JSON and model variables,
  3. the predicted rent and final explanation text.

In total, this documentation includes six screenshots:

  • 3 screenshots for Example 1
  • 3 screenshots for Example 2

After uploading the screenshot files to the repository, insert them below using the following paths.


Example 1

test input : I am looking for a furnished 3.5-room apartment with 85 m2 in Zurich District 8.

This example should demonstrate that the app extracts:

  • area = 85
  • rooms = 3.5
  • location = Zurich District 8
  • furnished = 1

It should also show the resulting prediction and final English explanation.

Screenshot 1.1 — User input and app interface

Example 1 - Input

This screenshot should show the text input field containing the first apartment request.

Screenshot 1.2 — Extracted JSON and model variables

Example 1 - Extracted JSON

This screenshot should show the extracted JSON and the model variables generated from the user request.

Screenshot 1.3 — Prediction and final explanation

Example 1 - Prediction and Explanation

This screenshot should show the predicted monthly rent in CHF and the final natural-language explanation.

Explanation of input text 1

For the first test case, the user entered: “I am looking for a furnished 3.5-room apartment with 85 m2 in Zurich District 8.” The LLM correctly extracted the main apartment characteristics: area = 85, rooms = 3.5, furnished = 1, and matched the location to Zurich. These values were then transformed into the 17 model features required by the saved Random Forest model. The app returned an estimated monthly rent of 3,173 CHF and generated a short English explanation clarifying that the prediction is an estimate based on the available model inputs.


Example 2

test input : I need a luxurious 4-room apartment with 110 m2 in Zurich District 7.

This example should demonstrate that the app extracts:

  • area = 110
  • rooms = 4
  • location = Zurich District 7
  • luxurious = 1

It should also show the resulting prediction and final English explanation.

Screenshot 2.1 — User input and app interface

Example 2 - Input

This screenshot should show the text input field containing the second apartment request.

Screenshot 2.2 — Extracted JSON and model variables

Example 2 - Extracted JSON

This screenshot should show the extracted JSON and the model variables generated from the user request.

Screenshot 2.3 — Prediction and final explanation

Example 2 - Prediction and Explanation

Explanation of input text 2

For the second test case, the user entered: “I need a luxurious 4-room apartment with 110 m2 in Zurich District 7.” The LLM correctly extracted the main apartment characteristics: area = 110, rooms = 4, location = Zurich District 7, and luxurious = 1. These extracted values were converted into the fixed 17-feature input structure required by the saved Random Forest model. The app returned an estimated monthly rent of 3,752 CHF and generated a short English explanation, while also noting that the prediction is only an estimate and may vary depending on factors not captured by the model.


12. Reflection

This exercise demonstrates how a traditional machine learning model can be made easier to use through an LLM interface. The LLM allows users to describe an apartment request naturally, while the Random Forest model remains responsible for the numeric prediction. This separation is important because the LLM improves usability but does not replace the trained regression model.

The most important limitation is that the deployed model can only use the fixed features that were included during training. The second major limitation is that the app only supports a fixed set of manually defined locations. For this reason, the final app should be understood as a demo-like project that demonstrates the technical workflow, not as a full real-life apartment rent prediction product.

A future improvement would be to connect the app to a complete municipality-level dataset and retrain the model with richer apartment-level variables. This would allow the conversational agent to use more of the information users naturally provide in their queries.


13. Responsible Use Note

The predicted rent is only an estimate and should not be interpreted as a guaranteed market price. Real rental prices depend on additional factors such as exact address, apartment condition, renovation quality, floor level, view, building age, contract terms, and current market demand.

The LLM may also extract user information incorrectly. Therefore, the visible extracted JSON should always be checked before interpreting the prediction. The app is intended as an educational prototype, not as a professional real estate valuation tool.