# Documentation
## Week 2: Apartment Predictor (Saved Regression Model + LLM Workflow)

This file documents the apartment prediction application that combines a saved regression model with an LLM workflow.

---

## 1. Project Summary

The app accepts a German free-text apartment request, for example: "Ich suche eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur." An LLM extracts the structured parameters `rooms`, `area_m2`, and `town` from the text. The app then uses a saved random forest regression model to estimate the monthly rent in CHF. A second LLM step turns the numeric prediction into a short German explanation with an uncertainty note.

---

## 2. Files Used

| File | Purpose |
|------|---------|
| `ai_applications_exercise2.ipynb` | Notebook work and testing during the exercise |
| `app_student.py` | Student implementation with completed TODOs |
| `app.py` | Final deployable Gradio app for Hugging Face Spaces |
| `random_forest_regression.pkl` | Saved regression model used for numeric prediction |
| `bfs_municipality_and_tax_data.csv` | Municipality features used for prediction |
| `requirements.txt` | Python dependencies for the Space |
| `README.md` | Hugging Face Space configuration and project summary |
| `documentation.md` | Written documentation for the submission |
| `screenshot1.png` | First screenshot from the running app |
| `screenshot2.png` | Second screenshot from the running app |

---

## 3. Numeric Prediction Part

### 3.1 Reused Model

**Which saved model did you use?**  
`random_forest_regression.pkl`

**What does the model predict?**  
The model predicts the estimated monthly rent of an apartment in CHF. The prediction is based on apartment size information and municipality-level data.

**Which input features are used for prediction?**

The model input uses the seven features in this exact order:

1. `rooms`
2. `area_m2`
3. `pop`
4. `pop_dens`
5. `frg_pct`
6. `emp`
7. `tax_income`

### 3.2 Prediction Logic

The user provides `rooms`, `area_m2`, and `town` in natural language. The LLM extracts these values as JSON. Python then matches the extracted town name to the canonical `bfs_name` in `bfs_municipality_and_tax_data.csv`. The app reads the municipality values `pop`, `pop_dens`, `frg_pct`, `emp`, and `tax_income` from the matching CSV row. These values are combined with `rooms` and `area_m2` into a NumPy array and passed to the saved regression model.

---

## 4. LLM Extraction Part

### 4.1 Goal

The first LLM step extracts structured model inputs from a German apartment request. It must return the number of rooms, the apartment area in square meters, and the town name.

### 4.2 Prompt Design

The extraction prompt tells the LLM that it is an extraction system for Swiss apartment wishes. The prompt requires strict JSON and explicitly lists the required keys: `rooms`, `area_m2`, and `town`. It also tells the model not to return Markdown, explanations, or additional keys. If a value is missing, the LLM is instructed to use `null`.

### 4.3 Expected Output Format

```json
{"rooms": 3.5, "area_m2": 85, "town": "Winterthur"}
```

### 4.4 Validation

The app validates the LLM output with `parse_json_response`. This function checks that the response is not empty, that it is valid JSON, and that all required keys are present. The app also converts `rooms` and `area_m2` to numbers and checks that they are greater than zero. The extracted town is matched against the BFS municipality data before the prediction is made.

---

## 5. LLM Explanation Part

### 5.1 Goal

The second LLM step generates a short natural-language answer in German. It explains the model prediction to the user. The LLM should not calculate a new price; it should use the prediction value that was produced by the regression model.

### 5.2 Prompt Design

The explanation prompt includes the structured preferences and the predicted rent. The prompt instructs the LLM to write in German, mention the number of rooms, area, and town, and include a short uncertainty note. The answer is also required to be valid JSON with the key `answer`.

### 5.3 Expected Output Format

```json
{"answer": "Für eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur schätzt das Modell die Monatsmiete auf rund 2800 CHF. Die Schätzung ist unsicher, weil Faktoren wie Zustand, Mikrolage und Ausstattung nicht direkt berücksichtigt werden."}
```

---

## 6. End-to-End Pipeline

1. The user enters a German apartment request.
2. The LLM extracts `rooms`, `area_m2`, and `town` as JSON.
3. Python validates the JSON and converts the numeric values.
4. Python matches the town name to the BFS municipality data.
5. The app builds the seven-feature model input.
6. The saved random forest regression model predicts the monthly rent.
7. The second LLM step generates a short German explanation.
8. The Gradio app returns the extracted JSON, the predicted rent, and the final explanation text.

---

## 7. Test Cases

| Test Input | Extracted Output Correct? | Prediction Returned? | Explanation Returned? | Notes |
|------------|----------------------------|----------------------|-----------------------|-------|
| `Ich suche eine 3.5-Zimmer-Wohnung mit 85 m2 in Winterthur.` | Yes | Yes | Yes | The app extracted the apartment size and town and returned a German explanation. |
| `Wie viel kostet ungefähr eine 2-Zimmer-Wohnung mit 55 Quadratmetern in Zürich?` | Yes | Yes | Yes | The LLM extracted `rooms`, `area_m2`, and `town`; the model returned a rent estimate. |
| `Ich interessiere mich für eine 4.5 Zimmer Wohnung mit 110 m2 in Bern.` | Yes | Yes | Yes | The app handled another Swiss city and produced the full output pipeline. |

---

## 8. Errors and Problems

**Problem:** The app failed locally with `FileNotFoundError: random_forest_regression.pkl`.  
**Cause:** The saved model file was not in the same folder as `app.py`.  
**Fix:** I added `random_forest_regression.pkl` to the project folder before running and deploying the app.

**Problem:** The LLM could return invalid or incomplete JSON.  
**Cause:** LLM output is not guaranteed unless the prompt and API response format are strict.  
**Fix:** The app uses JSON mode and validates the response with `parse_json_response`.

**Problem:** Invalid or ambiguous town names may not match the BFS CSV.  
**Cause:** The model requires municipality-level features from the CSV.  
**Fix:** The app first tries exact town matching and then relaxed contains matching. Ambiguous or missing towns return a visible error.

**Problem:** The Space may fail if the API key is missing.  
**Cause:** The LLM steps require an OpenAI API key.  
**Fix:** I added `OPENAI_API_KEY` as a Hugging Face Space Secret.

---

## 9. Deployment Notes

### 9.1 Files included

The Hugging Face Space contains:

- `app.py`
- `app_student.py`
- `README.md`
- `requirements.txt`
- `documentation.md`
- `bfs_municipality_and_tax_data.csv`
- `random_forest_regression.pkl`
- `screenshot1.png`
- `screenshot2.png`

### 9.2 Secrets / Environment Variables

The app requires this Hugging Face Secret:

- `OPENAI_API_KEY`

Optional:

- `OPENAI_MODEL`, for example `gpt-4.1-mini`

### 9.3 Deployment Result

The Space was deployed as a Gradio application. After uploading all required files and setting the OpenAI API key, the app started successfully and returned the extracted JSON, the predicted rent, and the final German explanation.

### 9.4 Screenshots

Add the two screenshots from the running Hugging Face Space here after deployment:

```md
![Example 1](screenshot1.png)

In this example, the app extracted the apartment request as JSON, predicted the monthly rent, and generated a German explanation.

![Example 2](screenshot2.png)

In this example, a different German apartment request was used. The screenshot shows the extracted JSON, predicted rent, and final answer.
```

![Example 1](screenshot1.png)

In this example, the app extracted the apartment request as JSON, predicted the monthly rent, and generated a German explanation.

![Example 2](screenshot2.png)

In this example, a different German apartment request was used. The screenshot shows the extracted JSON, predicted rent, and final answer.

---

## 10. Reflection

The combination of a regression model and an LLM worked well because the LLM made the app easier to use with natural German input. The numeric prediction stayed inside the saved model, while the LLM handled text extraction and explanation. The system is still fragile because wrong town names, missing values, or invalid JSON can stop the pipeline. German input is important because the dataset contains Swiss municipality names and the expected users describe apartments in German. In a future version, I would add more apartment features such as floor, building condition, balcony, public transport access, and exact location.

---

## 11. Responsible Use Note

The predicted rent is only an estimate and should not be treated as a guaranteed market price. The model uses a limited set of structured features and does not directly include factors such as apartment condition, exact address, renovation status, balcony, view, or micro-location. The LLM may also extract values incorrectly, so the visible JSON output should always be checked. The app is useful for learning and experimentation, but not for final financial or rental decisions.