--- title: ProBas RAG Assistant emoji: 🌍 colorFrom: green colorTo: blue sdk: gradio sdk_version: 6.16.0 app_file: app.py pinned: false short_description: RAG chat over the ProBas life-cycle process database --- # ProBas RAG Assistant ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in `probas_processes_by_classification_rag_json`. It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain. ## Features - ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25) - Cached lexical and embedding index with checkpoint/resume - Six selectable chat models with automatic failover - Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations - Gradio chat UI with a retrieved-evidence panel ## Setup ```bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt cp .env.example .env # then fill in OPENAI_API_KEY ``` ## Environment - `OPENAI_API_KEY`: API key for the OpenAI-compatible endpoint (**required**) - `OPENAI_BASE_URL`: defaults to `https://chat-ai.academiccloud.de/v1` - `PROBAS_EMBEDDING_MODEL`: defaults to `qwen3-embedding-4b` (must be an embedding model served by the endpoint) - `PROBAS_MAX_RECORDS`: optional record limit for smoke tests - `PROBAS_EMBED_CONCURRENCY`: parallel embedding requests during index build (default `8`); the main lever for build speed - `PROBAS_EMBED_BATCH_SIZE`: texts per embedding request (default `24`); lower this if you see request timeouts - `PROBAS_EMBED_TIMEOUT_SECONDS`: per-request timeout for embeddings (default `180`) - `PROBAS_EMBED_MAX_RETRIES`: retries before a failing batch is split in half (default `1`) - `PROBAS_CHECKPOINT_EVERY`: save a resume checkpoint every N waves (default `10`) ### Retrieval and answer-quality tuning - `PROBAS_BM25_WEIGHT` / `PROBAS_VECTOR_WEIGHT`: hybrid retrieval weights (defaults `0.30` / `0.70`). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries. - `PROBAS_MIN_RELEVANCE`: minimum top cosine similarity for a query to be treated as on-topic (default `0.45`). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer. - `PROBAS_MAX_CONTEXT_CHARS`: per-record excerpt fed to the model (default `5000`). - `PROBAS_EVIDENCE_SNIPPET_CHARS`: per-record snippet shown in the UI evidence panel (default `320`, kept compact and separate from the model context). - `PROBAS_EMBED_QUERY_INSTRUCTION`: the instruction prefix added to **queries** (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records). - `PORT`: optional deployment port (Hugging Face Spaces uses `7860`) ### Impact numbers (`key_impacts`) The records' `rag_text` only previews the first few exchanges, which miss the actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt, cumulative energy demand). The app extracts a compact `key_impacts` block from the raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real numbers. A fresh index build does this automatically; to add it to an existing prebuilt bundle **without re-embedding**, run once: ```bash python enrich_bundle.py ``` ## Run ```bash python app.py ``` The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s. ## Model dropdown The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first: 1. `qwen3.5-397b-a17b`  *(default — large MoE, strong multilingual, fast 17B active params)* 2. `mistral-large-3-675b-instruct-2512` 3. `qwen3.5-122b-a10b` 4. `openai-gpt-oss-120b` 5. `deepseek-r1-distill-llama-70b` 6. `glm-4.7` The app tries the selected model first, then falls back through the rest with retry and backoff. ## Index build, checkpointing, and resume On first launch the app embeds every ProBas record in the background using `PROBAS_EMBED_CONCURRENCY` parallel requests, periodically writing a resume checkpoint under `indexes/probas_rag/`. If the build is interrupted, the next launch resumes from the last checkpoint instead of starting over. Checkpoints are keyed by a fingerprint of the dataset **and the embedding model**, so changing `PROBAS_EMBEDDING_MODEL` intentionally invalidates the old checkpoint. Cache files from older code versions are purged automatically on startup. If the raw dataset directory is absent but a prebuilt bundle is present under `indexes/probas_rag/`, the app loads that bundle directly — this is what makes a deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work without re-embedding. ### Tracking build progress and ETA While embedding, the app logs a live line per wave: ``` Embedded 1440/23172 records (6.2%) | 3.1 rec/s | elapsed 7m42s | ETA 1h56m ``` To check durable progress (what a restart would resume from) from a second terminal: ```bash python check_progress.py ``` ## Deploying to Hugging Face Spaces See [DEPLOY_HF.md](DEPLOY_HF.md) for the full step-by-step. In short: 1. Set `OPENAI_API_KEY` as a **Space secret** (never commit it). 2. Commit the prebuilt index under `indexes/probas_rag/` via Git LFS (the `.gitattributes` already tracks it) so the Space starts without re-embedding and without shipping the 1.2 GB raw dataset. 3. Push to the Space remote. ## Data and cache The dataset folder is read directly from [probas_processes_by_classification_rag_json](probas_processes_by_classification_rag_json). The generated cache is stored under `indexes/probas_rag/` and is safe to delete when rebuilding from scratch.