Mohamed284's picture
Deploy ProBas RAG Assistant with enriched prebuilt index
0ca97fd
|
Raw
History Blame Contribute Delete
5.99 kB
---
title: ProBas RAG Assistant
emoji: 🌍
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.16.0
app_file: app.py
pinned: false
short_description: RAG chat over the ProBas life-cycle process database
---
# ProBas RAG Assistant
ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in `probas_processes_by_classification_rag_json`.
It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain.
## Features
- ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25)
- Cached lexical and embedding index with checkpoint/resume
- Six selectable chat models with automatic failover
- Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations
- Gradio chat UI with a retrieved-evidence panel
## Setup
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # then fill in OPENAI_API_KEY
```
## Environment
- `OPENAI_API_KEY`: API key for the OpenAI-compatible endpoint (**required**)
- `OPENAI_BASE_URL`: defaults to `https://chat-ai.academiccloud.de/v1`
- `PROBAS_EMBEDDING_MODEL`: defaults to `qwen3-embedding-4b` (must be an embedding model served by the endpoint)
- `PROBAS_MAX_RECORDS`: optional record limit for smoke tests
- `PROBAS_EMBED_CONCURRENCY`: parallel embedding requests during index build (default `8`); the main lever for build speed
- `PROBAS_EMBED_BATCH_SIZE`: texts per embedding request (default `24`); lower this if you see request timeouts
- `PROBAS_EMBED_TIMEOUT_SECONDS`: per-request timeout for embeddings (default `180`)
- `PROBAS_EMBED_MAX_RETRIES`: retries before a failing batch is split in half (default `1`)
- `PROBAS_CHECKPOINT_EVERY`: save a resume checkpoint every N waves (default `10`)
### Retrieval and answer-quality tuning
- `PROBAS_BM25_WEIGHT` / `PROBAS_VECTOR_WEIGHT`: hybrid retrieval weights (defaults `0.30` / `0.70`). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries.
- `PROBAS_MIN_RELEVANCE`: minimum top cosine similarity for a query to be treated as on-topic (default `0.45`). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer.
- `PROBAS_MAX_CONTEXT_CHARS`: per-record excerpt fed to the model (default `5000`).
- `PROBAS_EVIDENCE_SNIPPET_CHARS`: per-record snippet shown in the UI evidence panel (default `320`, kept compact and separate from the model context).
- `PROBAS_EMBED_QUERY_INSTRUCTION`: the instruction prefix added to **queries** (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records).
- `PORT`: optional deployment port (Hugging Face Spaces uses `7860`)
### Impact numbers (`key_impacts`)
The records' `rag_text` only previews the first few exchanges, which miss the
actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt,
cumulative energy demand). The app extracts a compact `key_impacts` block from the
raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real
numbers. A fresh index build does this automatically; to add it to an existing
prebuilt bundle **without re-embedding**, run once:
```bash
python enrich_bundle.py
```
## Run
```bash
python app.py
```
The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s.
## Model dropdown
The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first:
1. `qwen3.5-397b-a17b`  *(default — large MoE, strong multilingual, fast 17B active params)*
2. `mistral-large-3-675b-instruct-2512`
3. `qwen3.5-122b-a10b`
4. `openai-gpt-oss-120b`
5. `deepseek-r1-distill-llama-70b`
6. `glm-4.7`
The app tries the selected model first, then falls back through the rest with retry and backoff.
## Index build, checkpointing, and resume
On first launch the app embeds every ProBas record in the background using
`PROBAS_EMBED_CONCURRENCY` parallel requests, periodically writing a resume
checkpoint under `indexes/probas_rag/`. If the build is interrupted, the next
launch resumes from the last checkpoint instead of starting over.
Checkpoints are keyed by a fingerprint of the dataset **and the embedding model**,
so changing `PROBAS_EMBEDDING_MODEL` intentionally invalidates the old checkpoint.
Cache files from older code versions are purged automatically on startup.
If the raw dataset directory is absent but a prebuilt bundle is present under
`indexes/probas_rag/`, the app loads that bundle directly — this is what makes a
deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work
without re-embedding.
### Tracking build progress and ETA
While embedding, the app logs a live line per wave:
```
Embedded 1440/23172 records (6.2%) | 3.1 rec/s | elapsed 7m42s | ETA 1h56m
```
To check durable progress (what a restart would resume from) from a second terminal:
```bash
python check_progress.py
```
## Deploying to Hugging Face Spaces
See [DEPLOY_HF.md](DEPLOY_HF.md) for the full step-by-step. In short:
1. Set `OPENAI_API_KEY` as a **Space secret** (never commit it).
2. Commit the prebuilt index under `indexes/probas_rag/` via Git LFS (the
`.gitattributes` already tracks it) so the Space starts without re-embedding
and without shipping the 1.2 GB raw dataset.
3. Push to the Space remote.
## Data and cache
The dataset folder is read directly from [probas_processes_by_classification_rag_json](probas_processes_by_classification_rag_json). The generated cache is stored under `indexes/probas_rag/` and is safe to delete when rebuilding from scratch.