| --- |
| title: ProBas RAG Assistant |
| emoji: 🌍 |
| colorFrom: green |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 6.16.0 |
| app_file: app.py |
| pinned: false |
| short_description: RAG chat over the ProBas life-cycle process database |
| --- |
| |
| # ProBas RAG Assistant |
|
|
| ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in `probas_processes_by_classification_rag_json`. |
|
|
| It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain. |
|
|
| ## Features |
|
|
| - ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25) |
| - Cached lexical and embedding index with checkpoint/resume |
| - Six selectable chat models with automatic failover |
| - Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations |
| - Gradio chat UI with a retrieved-evidence panel |
|
|
| ## Setup |
|
|
| ```bash |
| python -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| cp .env.example .env # then fill in OPENAI_API_KEY |
| ``` |
|
|
| ## Environment |
|
|
| - `OPENAI_API_KEY`: API key for the OpenAI-compatible endpoint (**required**) |
| - `OPENAI_BASE_URL`: defaults to `https://chat-ai.academiccloud.de/v1` |
| - `PROBAS_EMBEDDING_MODEL`: defaults to `qwen3-embedding-4b` (must be an embedding model served by the endpoint) |
| - `PROBAS_MAX_RECORDS`: optional record limit for smoke tests |
| - `PROBAS_EMBED_CONCURRENCY`: parallel embedding requests during index build (default `8`); the main lever for build speed |
| - `PROBAS_EMBED_BATCH_SIZE`: texts per embedding request (default `24`); lower this if you see request timeouts |
| - `PROBAS_EMBED_TIMEOUT_SECONDS`: per-request timeout for embeddings (default `180`) |
| - `PROBAS_EMBED_MAX_RETRIES`: retries before a failing batch is split in half (default `1`) |
| - `PROBAS_CHECKPOINT_EVERY`: save a resume checkpoint every N waves (default `10`) |
|
|
| ### Retrieval and answer-quality tuning |
|
|
| - `PROBAS_BM25_WEIGHT` / `PROBAS_VECTOR_WEIGHT`: hybrid retrieval weights (defaults `0.30` / `0.70`). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries. |
| - `PROBAS_MIN_RELEVANCE`: minimum top cosine similarity for a query to be treated as on-topic (default `0.45`). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer. |
| - `PROBAS_MAX_CONTEXT_CHARS`: per-record excerpt fed to the model (default `5000`). |
| - `PROBAS_EVIDENCE_SNIPPET_CHARS`: per-record snippet shown in the UI evidence panel (default `320`, kept compact and separate from the model context). |
| - `PROBAS_EMBED_QUERY_INSTRUCTION`: the instruction prefix added to **queries** (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records). |
| - `PORT`: optional deployment port (Hugging Face Spaces uses `7860`) |
|
|
| ### Impact numbers (`key_impacts`) |
| |
| The records' `rag_text` only previews the first few exchanges, which miss the |
| actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt, |
| cumulative energy demand). The app extracts a compact `key_impacts` block from the |
| raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real |
| numbers. A fresh index build does this automatically; to add it to an existing |
| prebuilt bundle **without re-embedding**, run once: |
|
|
| ```bash |
| python enrich_bundle.py |
| ``` |
|
|
| ## Run |
|
|
| ```bash |
| python app.py |
| ``` |
|
|
| The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s. |
|
|
| ## Model dropdown |
|
|
| The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first: |
|
|
| 1. `qwen3.5-397b-a17b` *(default — large MoE, strong multilingual, fast 17B active params)* |
| 2. `mistral-large-3-675b-instruct-2512` |
| 3. `qwen3.5-122b-a10b` |
| 4. `openai-gpt-oss-120b` |
| 5. `deepseek-r1-distill-llama-70b` |
| 6. `glm-4.7` |
|
|
| The app tries the selected model first, then falls back through the rest with retry and backoff. |
|
|
| ## Index build, checkpointing, and resume |
|
|
| On first launch the app embeds every ProBas record in the background using |
| `PROBAS_EMBED_CONCURRENCY` parallel requests, periodically writing a resume |
| checkpoint under `indexes/probas_rag/`. If the build is interrupted, the next |
| launch resumes from the last checkpoint instead of starting over. |
|
|
| Checkpoints are keyed by a fingerprint of the dataset **and the embedding model**, |
| so changing `PROBAS_EMBEDDING_MODEL` intentionally invalidates the old checkpoint. |
| Cache files from older code versions are purged automatically on startup. |
|
|
| If the raw dataset directory is absent but a prebuilt bundle is present under |
| `indexes/probas_rag/`, the app loads that bundle directly — this is what makes a |
| deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work |
| without re-embedding. |
|
|
| ### Tracking build progress and ETA |
|
|
| While embedding, the app logs a live line per wave: |
|
|
| ``` |
| Embedded 1440/23172 records (6.2%) | 3.1 rec/s | elapsed 7m42s | ETA 1h56m |
| ``` |
|
|
| To check durable progress (what a restart would resume from) from a second terminal: |
|
|
| ```bash |
| python check_progress.py |
| ``` |
|
|
| ## Deploying to Hugging Face Spaces |
|
|
| See [DEPLOY_HF.md](DEPLOY_HF.md) for the full step-by-step. In short: |
|
|
| 1. Set `OPENAI_API_KEY` as a **Space secret** (never commit it). |
| 2. Commit the prebuilt index under `indexes/probas_rag/` via Git LFS (the |
| `.gitattributes` already tracks it) so the Space starts without re-embedding |
| and without shipping the 1.2 GB raw dataset. |
| 3. Push to the Space remote. |
|
|
| ## Data and cache |
|
|
| The dataset folder is read directly from [probas_processes_by_classification_rag_json](probas_processes_by_classification_rag_json). The generated cache is stored under `indexes/probas_rag/` and is safe to delete when rebuilding from scratch. |
|
|