File size: 5,986 Bytes
0ca97fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: ProBas RAG Assistant
emoji: 🌍
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.16.0
app_file: app.py
pinned: false
short_description: RAG chat over the ProBas life-cycle process database
---

# ProBas RAG Assistant

ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in `probas_processes_by_classification_rag_json`.

It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain.

## Features

- ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25)
- Cached lexical and embedding index with checkpoint/resume
- Six selectable chat models with automatic failover
- Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations
- Gradio chat UI with a retrieved-evidence panel

## Setup

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in OPENAI_API_KEY
```

## Environment

- `OPENAI_API_KEY`: API key for the OpenAI-compatible endpoint (**required**)
- `OPENAI_BASE_URL`: defaults to `https://chat-ai.academiccloud.de/v1`
- `PROBAS_EMBEDDING_MODEL`: defaults to `qwen3-embedding-4b` (must be an embedding model served by the endpoint)
- `PROBAS_MAX_RECORDS`: optional record limit for smoke tests
- `PROBAS_EMBED_CONCURRENCY`: parallel embedding requests during index build (default `8`); the main lever for build speed
- `PROBAS_EMBED_BATCH_SIZE`: texts per embedding request (default `24`); lower this if you see request timeouts
- `PROBAS_EMBED_TIMEOUT_SECONDS`: per-request timeout for embeddings (default `180`)
- `PROBAS_EMBED_MAX_RETRIES`: retries before a failing batch is split in half (default `1`)
- `PROBAS_CHECKPOINT_EVERY`: save a resume checkpoint every N waves (default `10`)

### Retrieval and answer-quality tuning

- `PROBAS_BM25_WEIGHT` / `PROBAS_VECTOR_WEIGHT`: hybrid retrieval weights (defaults `0.30` / `0.70`). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries.
- `PROBAS_MIN_RELEVANCE`: minimum top cosine similarity for a query to be treated as on-topic (default `0.45`). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer.
- `PROBAS_MAX_CONTEXT_CHARS`: per-record excerpt fed to the model (default `5000`).
- `PROBAS_EVIDENCE_SNIPPET_CHARS`: per-record snippet shown in the UI evidence panel (default `320`, kept compact and separate from the model context).
- `PROBAS_EMBED_QUERY_INSTRUCTION`: the instruction prefix added to **queries** (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records).
- `PORT`: optional deployment port (Hugging Face Spaces uses `7860`)

### Impact numbers (`key_impacts`)

The records' `rag_text` only previews the first few exchanges, which miss the
actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt,
cumulative energy demand). The app extracts a compact `key_impacts` block from the
raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real
numbers. A fresh index build does this automatically; to add it to an existing
prebuilt bundle **without re-embedding**, run once:

```bash
python enrich_bundle.py
```

## Run

```bash
python app.py
```

The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s.

## Model dropdown

The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first:

1. `qwen3.5-397b-a17b`  *(default — large MoE, strong multilingual, fast 17B active params)*
2. `mistral-large-3-675b-instruct-2512`
3. `qwen3.5-122b-a10b`
4. `openai-gpt-oss-120b`
5. `deepseek-r1-distill-llama-70b`
6. `glm-4.7`

The app tries the selected model first, then falls back through the rest with retry and backoff.

## Index build, checkpointing, and resume

On first launch the app embeds every ProBas record in the background using
`PROBAS_EMBED_CONCURRENCY` parallel requests, periodically writing a resume
checkpoint under `indexes/probas_rag/`. If the build is interrupted, the next
launch resumes from the last checkpoint instead of starting over.

Checkpoints are keyed by a fingerprint of the dataset **and the embedding model**,
so changing `PROBAS_EMBEDDING_MODEL` intentionally invalidates the old checkpoint.
Cache files from older code versions are purged automatically on startup.

If the raw dataset directory is absent but a prebuilt bundle is present under
`indexes/probas_rag/`, the app loads that bundle directly — this is what makes a
deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work
without re-embedding.

### Tracking build progress and ETA

While embedding, the app logs a live line per wave:

```
Embedded 1440/23172 records (6.2%) | 3.1 rec/s | elapsed 7m42s | ETA 1h56m
```

To check durable progress (what a restart would resume from) from a second terminal:

```bash
python check_progress.py
```

## Deploying to Hugging Face Spaces

See [DEPLOY_HF.md](DEPLOY_HF.md) for the full step-by-step. In short:

1. Set `OPENAI_API_KEY` as a **Space secret** (never commit it).
2. Commit the prebuilt index under `indexes/probas_rag/` via Git LFS (the
   `.gitattributes` already tracks it) so the Space starts without re-embedding
   and without shipping the 1.2 GB raw dataset.
3. Push to the Space remote.

## Data and cache

The dataset folder is read directly from [probas_processes_by_classification_rag_json](probas_processes_by_classification_rag_json). The generated cache is stored under `indexes/probas_rag/` and is safe to delete when rebuilding from scratch.