Spaces:

tarekmasryo
/

rag-qa-command-cente

Running

App Files Files Community

rag-qa-command-cente / docs /github_readme.md

Tarek Masryo

chore: update project files

6bef416 8 days ago

preview code

raw

history blame contribute delete

7.45 kB

	# RAG QA Command Center

	![Python](https://img.shields.io/badge/Python-3.11%20%7C%203.12-blue)
	![Streamlit](https://img.shields.io/badge/Streamlit-app-FF4B4B?logo=streamlit&logoColor=white)
	![Tests](https://img.shields.io/badge/Tests-CI%20checked-success)
	![Docker](https://img.shields.io/badge/Docker-health%20smoke-blue)
	![License](https://img.shields.io/badge/License-MIT-green)
	![Version](https://img.shields.io/badge/Version-v1.0.0-black)

	A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.

	This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation.

	![RAG QA Command Center overview](assets/Overview.png)

	---

	## What it shows

	- Quality posture: correctness, hallucination rate, Recall@10, MRR@10, latency, and cost.
	- Retrieval diagnostics: separates retrieval weakness from generation failure and hallucination exposure.
	- Risk board: ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk.
	- Config comparison: compares retriever, generator, and chunking combinations under multiple objectives.
	- Policy simulator: explores offline review thresholds and review-load trade-offs.
	- Trace explorer: inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks.
	- Export center: downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON.

	---

	## Repository structure

	```text
	app.py # Thin Streamlit entrypoint
	src/
	dashboard.py # App controller and shared orchestration
	app_state.py # Typed dashboard context models
	models.py # Runtime settings and release metadata
	data.py # Data loading, standardization, and fail-fast validation
	analytics.py # Metrics, risk slices, config scoring, and policy curves
	charts.py # Plotly chart factories
	formatting.py # Display formatting helpers
	text_search.py # Literal user-search helper
	ui.py # Reusable UI primitives
	views/ # Focused tab/page mixins
	overview.py
	quality_map.py
	retrieval_lab.py
	risk_board.py
	config_comparison.py
	policy_simulator.py
	trace_explorer.py
	export_center.py
	scripts/
	run_pytest.py # Cross-platform pytest runner
	check_http_health.py # HTTP health check helper used by Docker
	docker_smoke.py # Local Docker build/run/health smoke check
	data/ # Bundled synthetic/offline RAG QA artifacts
	docs/ # Architecture, limitations, manifest, and data dictionary
	tests/ # Data contracts, numeric analytics regressions, smoke tests, and release checks
	Dockerfile
	docker-compose.yml
	Makefile
	```

	---

	## Quickstart

	### macOS / Linux

	```bash
	python -m venv .venv
	source .venv/bin/activate
	python -m pip install --upgrade pip
	pip install -r requirements.txt
	streamlit run app.py
	```

	### Windows PowerShell

	```powershell
	python -m venv .venv
	.\.venv\Scripts\Activate.ps1
	python -m pip install --upgrade pip
	pip install -r requirements.txt
	streamlit run app.py
	```

	Then open:

	```text
	http://localhost:8501
	```

	If port `8501` is already in use, run Streamlit on another port:

	```bash
	streamlit run app.py --server.port 8510
	```

	---

	## Docker

	Build and run the container:

	```bash
	docker build -t rag-qa-command-center:1.0.0 .
	docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0
	```

	Or use Docker Compose:

	```bash
	docker compose up --build
	```

	For a local Docker health-smoke check:

	```bash
	python scripts/docker_smoke.py --image rag-qa-command-center:ci
	```

	The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint.

	---

	## Quality checks

	Install the development toolchain with either `requirements-dev.txt` or the optional project metadata:

	```bash
	pip install -r requirements-dev.txt
	# or
	pip install -e .[dev]
	```

	Run the full check target:

	```bash
	make check
	```

	Equivalent cross-platform commands:

	```bash
	ruff check app.py src tests scripts
	python -m compileall app.py src tests scripts
	python scripts/run_pytest.py
	python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()"
	```

	The test suite covers:

	- data contract validation,
	- artifact manifest integrity,
	- analytics formula regressions,
	- static checks for class helper methods,
	- literal search behavior for trace filtering,
	- Streamlit import and app-construction smoke checks.

	CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint.

	---

	## Data contract

	The packaged dataset includes:

	\| File \| Purpose \|
	\|---\|---\|
	\| `data/eval_runs.csv` \| Example-level RAG QA evaluation rows. \|
	\| `data/rag_retrieval_events.csv` \| Retrieved chunk events by example and rank. \|
	\| `data/rag_corpus_documents.csv` \| Corpus document metadata. \|
	\| `data/rag_corpus_chunks.csv` \| Chunk-level corpus text and metadata. \|
	\| `data/scenarios.csv` \| Scenario/query metadata. \|
	\| `docs/data_dictionary.csv` \| Column reference. \|

	At startup, `load_bundle()` validates:

	- required columns,
	- non-empty tables,
	- primary key presence and uniqueness,
	- required foreign-key presence,
	- retrieval-event references to evaluation examples and chunks,
	- chunk references to documents,
	- scenario references,
	- required numeric fields with strict conversion and no silent non-numeric coercion,
	- basic metric ranges,
	- retrieval relevance flags,
	- positive/integer retrieval ranks and non-negative runtime/cost signals.

	See [`docs/artifact_manifest.md`](docs/artifact_manifest.md) for row counts and SHA256 checksums.

	---

	## Design intent

	The project is built to demonstrate production-minded review patterns without pretending to be a live production platform:

	- deterministic local data,
	- clean app boundaries,
	- typed state/context,
	- fail-fast data validation,
	- focused view modules,
	- cross-platform local checks,
	- CI checks across Python 3.11 and 3.12,
	- Streamlit import smoke check,
	- Docker build and container health-smoke verification,
	- explicit limitations.

	See [`docs/architecture.md`](docs/architecture.md) and [`docs/limitations.md`](docs/limitations.md).

	---

	## Boundaries

	This project does not include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation.

	It is a self-contained evaluation command center over bundled synthetic/offline artifacts.

	---

	## Suggested GitHub About

	```text
	RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation.
	```

	## Suggested topics

	```text
	rag
	rag-evaluation
	streamlit
	plotly
	llm-evaluation
	retrieval-augmented-generation
	hallucination-detection
	policy-simulation
	trace-review
	data-visualization
	python
	synthetic-data
	```

	---

	## License

	MIT License.