Spaces:

IPTS-PRODDEV
/

ProBas_RAG_Assistant

Running

App Files Files Community

ProBas_RAG_Assistant / README.md

Mohamed284

Deploy ProBas RAG Assistant with enriched prebuilt index

0ca97fd 20 days ago

preview code

Raw

History Blame Contribute Delete

5.99 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: ProBas RAG Assistant
emoji: 🌍
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.16.0
app_file: app.py
pinned: false
short_description: RAG chat over the ProBas life-cycle process database

ProBas RAG Assistant

ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in probas_processes_by_classification_rag_json.

It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain.

Features

ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25)
Cached lexical and embedding index with checkpoint/resume
Six selectable chat models with automatic failover
Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations
Gradio chat UI with a retrieved-evidence panel

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in OPENAI_API_KEY

Environment

OPENAI_API_KEY: API key for the OpenAI-compatible endpoint (required)
OPENAI_BASE_URL: defaults to https://chat-ai.academiccloud.de/v1
PROBAS_EMBEDDING_MODEL: defaults to qwen3-embedding-4b (must be an embedding model served by the endpoint)
PROBAS_MAX_RECORDS: optional record limit for smoke tests
PROBAS_EMBED_CONCURRENCY: parallel embedding requests during index build (default 8); the main lever for build speed
PROBAS_EMBED_BATCH_SIZE: texts per embedding request (default 24); lower this if you see request timeouts
PROBAS_EMBED_TIMEOUT_SECONDS: per-request timeout for embeddings (default 180)
PROBAS_EMBED_MAX_RETRIES: retries before a failing batch is split in half (default 1)
PROBAS_CHECKPOINT_EVERY: save a resume checkpoint every N waves (default 10)

Retrieval and answer-quality tuning

PROBAS_BM25_WEIGHT / PROBAS_VECTOR_WEIGHT: hybrid retrieval weights (defaults 0.30 / 0.70). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries.
PROBAS_MIN_RELEVANCE: minimum top cosine similarity for a query to be treated as on-topic (default 0.45). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer.
PROBAS_MAX_CONTEXT_CHARS: per-record excerpt fed to the model (default 5000).
PROBAS_EVIDENCE_SNIPPET_CHARS: per-record snippet shown in the UI evidence panel (default 320, kept compact and separate from the model context).
PROBAS_EMBED_QUERY_INSTRUCTION: the instruction prefix added to queries (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records).
PORT: optional deployment port (Hugging Face Spaces uses 7860)

Impact numbers (`key_impacts`)

The records' rag_text only previews the first few exchanges, which miss the actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt, cumulative energy demand). The app extracts a compact key_impacts block from the raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real numbers. A fresh index build does this automatically; to add it to an existing prebuilt bundle without re-embedding, run once:

python enrich_bundle.py

Run

python app.py

The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s.

Model dropdown

The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first:

qwen3.5-397b-a17b (default — large MoE, strong multilingual, fast 17B active params)
mistral-large-3-675b-instruct-2512
qwen3.5-122b-a10b
openai-gpt-oss-120b
deepseek-r1-distill-llama-70b
glm-4.7

The app tries the selected model first, then falls back through the rest with retry and backoff.

Index build, checkpointing, and resume

On first launch the app embeds every ProBas record in the background using PROBAS_EMBED_CONCURRENCY parallel requests, periodically writing a resume checkpoint under indexes/probas_rag/. If the build is interrupted, the next launch resumes from the last checkpoint instead of starting over.

Checkpoints are keyed by a fingerprint of the dataset and the embedding model, so changing PROBAS_EMBEDDING_MODEL intentionally invalidates the old checkpoint. Cache files from older code versions are purged automatically on startup.

If the raw dataset directory is absent but a prebuilt bundle is present under indexes/probas_rag/, the app loads that bundle directly — this is what makes a deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work without re-embedding.

Tracking build progress and ETA

While embedding, the app logs a live line per wave:

Embedded 1440/23172 records (6.2%) | 3.1 rec/s | elapsed 7m42s | ETA 1h56m

To check durable progress (what a restart would resume from) from a second terminal:

python check_progress.py

Deploying to Hugging Face Spaces

See DEPLOY_HF.md for the full step-by-step. In short:

Set OPENAI_API_KEY as a Space secret (never commit it).
Commit the prebuilt index under indexes/probas_rag/ via Git LFS (the .gitattributes already tracks it) so the Space starts without re-embedding and without shipping the 1.2 GB raw dataset.
Push to the Space remote.

Data and cache

The dataset folder is read directly from probas_processes_by_classification_rag_json. The generated cache is stored under indexes/probas_rag/ and is safe to delete when rebuilding from scratch.