Spaces:

IPTS-PRODDEV
/

ProBas_RAG_Assistant

Running

App Files Files Community

ProBas_RAG_Assistant / README.md

Mohamed284

Deploy ProBas RAG Assistant with enriched prebuilt index

0ca97fd 20 days ago

preview code

Raw

History Blame Contribute Delete

5.99 kB

	---
	title: ProBas RAG Assistant
	emoji: 🌍
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 6.16.0
	app_file: app.py
	pinned: false
	short_description: RAG chat over the ProBas life-cycle process database
	---

	# ProBas RAG Assistant

	ProBas RAG Assistant is a retrieval-augmented chat app for the ProBas process dataset in `probas_processes_by_classification_rag_json`.

	It loads the ProBas JSON records, builds a cached BM25 plus embedding index, and answers questions through the Academic Cloud (GWDG) OpenAI-compatible API, with a model fallback chain.

	## Features

	- ProBas-only ingestion and hybrid retrieval (dense embeddings + BM25)
	- Cached lexical and embedding index with checkpoint/resume
	- Six selectable chat models with automatic failover
	- Greeting / off-topic detection so casual messages get a friendly reply instead of forced citations
	- Gradio chat UI with a retrieved-evidence panel

	## Setup

	```bash
	python -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	cp .env.example .env # then fill in OPENAI_API_KEY
	```

	## Environment

	- `OPENAI_API_KEY`: API key for the OpenAI-compatible endpoint (required)
	- `OPENAI_BASE_URL`: defaults to `https://chat-ai.academiccloud.de/v1`
	- `PROBAS_EMBEDDING_MODEL`: defaults to `qwen3-embedding-4b` (must be an embedding model served by the endpoint)
	- `PROBAS_MAX_RECORDS`: optional record limit for smoke tests
	- `PROBAS_EMBED_CONCURRENCY`: parallel embedding requests during index build (default `8`); the main lever for build speed
	- `PROBAS_EMBED_BATCH_SIZE`: texts per embedding request (default `24`); lower this if you see request timeouts
	- `PROBAS_EMBED_TIMEOUT_SECONDS`: per-request timeout for embeddings (default `180`)
	- `PROBAS_EMBED_MAX_RETRIES`: retries before a failing batch is split in half (default `1`)
	- `PROBAS_CHECKPOINT_EVERY`: save a resume checkpoint every N waves (default `10`)

	### Retrieval and answer-quality tuning

	- `PROBAS_BM25_WEIGHT` / `PROBAS_VECTOR_WEIGHT`: hybrid retrieval weights (defaults `0.30` / `0.70`). The dataset is German and the multilingual dense embedding handles cross-lingual queries (English "lignite" → German "Braunkohle"); BM25 is kept as a minority signal because at high weight it ranks generic boilerplate for such queries.
	- `PROBAS_MIN_RELEVANCE`: minimum top cosine similarity for a query to be treated as on-topic (default `0.45`). Below it, the query is answered conversationally and the user is told no matching records were found, instead of fabricating an answer.
	- `PROBAS_MAX_CONTEXT_CHARS`: per-record excerpt fed to the model (default `5000`).
	- `PROBAS_EVIDENCE_SNIPPET_CHARS`: per-record snippet shown in the UI evidence panel (default `320`, kept compact and separate from the model context).
	- `PROBAS_EMBED_QUERY_INSTRUCTION`: the instruction prefix added to queries (not documents), as Qwen3-Embedding expects. Greatly improves cross-lingual matching (English query → German records).
	- `PORT`: optional deployment port (Hugging Face Spaces uses `7860`)

	### Impact numbers (`key_impacts`)

	The records' `rag_text` only previews the first few exchanges, which miss the
	actual emission outputs (CO₂, SO₂, NOₓ) and impact indicators (GWP/Treibhauseffekt,
	cumulative energy demand). The app extracts a compact `key_impacts` block from the
	raw exchanges/LCIA so the model can answer "what are the CO₂ emissions" with real
	numbers. A fresh index build does this automatically; to add it to an existing
	prebuilt bundle without re-embedding, run once:

	```bash
	python enrich_bundle.py
	```

	## Run

	```bash
	python app.py
	```

	The first launch builds the index in the background (see below). On later launches the cached index loads in ~15s.

	## Model dropdown

	The UI exposes the six strongest general-purpose chat models on the endpoint, strongest first:

	1. `qwen3.5-397b-a17b`  (default — large MoE, strong multilingual, fast 17B active params)
	2. `mistral-large-3-675b-instruct-2512`
	3. `qwen3.5-122b-a10b`
	4. `openai-gpt-oss-120b`
	5. `deepseek-r1-distill-llama-70b`
	6. `glm-4.7`

	The app tries the selected model first, then falls back through the rest with retry and backoff.

	## Index build, checkpointing, and resume

	On first launch the app embeds every ProBas record in the background using
	`PROBAS_EMBED_CONCURRENCY` parallel requests, periodically writing a resume
	checkpoint under `indexes/probas_rag/`. If the build is interrupted, the next
	launch resumes from the last checkpoint instead of starting over.

	Checkpoints are keyed by a fingerprint of the dataset and the embedding model,
	so changing `PROBAS_EMBEDDING_MODEL` intentionally invalidates the old checkpoint.
	Cache files from older code versions are purged automatically on startup.

	If the raw dataset directory is absent but a prebuilt bundle is present under
	`indexes/probas_rag/`, the app loads that bundle directly — this is what makes a
	deployment that ships only the prebuilt index (e.g. a Hugging Face Space) work
	without re-embedding.

	### Tracking build progress and ETA

	While embedding, the app logs a live line per wave:

	```
	Embedded 1440/23172 records (6.2%) \| 3.1 rec/s \| elapsed 7m42s \| ETA 1h56m
	```

	To check durable progress (what a restart would resume from) from a second terminal:

	```bash
	python check_progress.py
	```

	## Deploying to Hugging Face Spaces

	See [DEPLOY_HF.md](DEPLOY_HF.md) for the full step-by-step. In short:

	1. Set `OPENAI_API_KEY` as a Space secret (never commit it).
	2. Commit the prebuilt index under `indexes/probas_rag/` via Git LFS (the
	`.gitattributes` already tracks it) so the Space starts without re-embedding
	and without shipping the 1.2 GB raw dataset.
	3. Push to the Space remote.

	## Data and cache

	The dataset folder is read directly from [probas_processes_by_classification_rag_json](probas_processes_by_classification_rag_json). The generated cache is stored under `indexes/probas_rag/` and is safe to delete when rebuilding from scratch.