# RAG QA Command Center ![Python](https://img.shields.io/badge/Python-3.11%20%7C%203.12-blue) ![Streamlit](https://img.shields.io/badge/Streamlit-app-FF4B4B?logo=streamlit&logoColor=white) ![Tests](https://img.shields.io/badge/Tests-CI%20checked-success) ![Docker](https://img.shields.io/badge/Docker-health%20smoke-blue) ![License](https://img.shields.io/badge/License-MIT-green) ![Version](https://img.shields.io/badge/Version-v1.0.0-black) **A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.** This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation. ![RAG QA Command Center overview](assets/Overview.png) --- ## What it shows - **Quality posture:** correctness, hallucination rate, Recall@10, MRR@10, latency, and cost. - **Retrieval diagnostics:** separates retrieval weakness from generation failure and hallucination exposure. - **Risk board:** ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk. - **Config comparison:** compares retriever, generator, and chunking combinations under multiple objectives. - **Policy simulator:** explores offline review thresholds and review-load trade-offs. - **Trace explorer:** inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks. - **Export center:** downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON. --- ## Repository structure ```text app.py # Thin Streamlit entrypoint src/ dashboard.py # App controller and shared orchestration app_state.py # Typed dashboard context models models.py # Runtime settings and release metadata data.py # Data loading, standardization, and fail-fast validation analytics.py # Metrics, risk slices, config scoring, and policy curves charts.py # Plotly chart factories formatting.py # Display formatting helpers text_search.py # Literal user-search helper ui.py # Reusable UI primitives views/ # Focused tab/page mixins overview.py quality_map.py retrieval_lab.py risk_board.py config_comparison.py policy_simulator.py trace_explorer.py export_center.py scripts/ run_pytest.py # Cross-platform pytest runner check_http_health.py # HTTP health check helper used by Docker docker_smoke.py # Local Docker build/run/health smoke check data/ # Bundled synthetic/offline RAG QA artifacts docs/ # Architecture, limitations, manifest, and data dictionary tests/ # Data contracts, numeric analytics regressions, smoke tests, and release checks Dockerfile docker-compose.yml Makefile ``` --- ## Quickstart ### macOS / Linux ```bash python -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install -r requirements.txt streamlit run app.py ``` ### Windows PowerShell ```powershell python -m venv .venv .\.venv\Scripts\Activate.ps1 python -m pip install --upgrade pip pip install -r requirements.txt streamlit run app.py ``` Then open: ```text http://localhost:8501 ``` If port `8501` is already in use, run Streamlit on another port: ```bash streamlit run app.py --server.port 8510 ``` --- ## Docker Build and run the container: ```bash docker build -t rag-qa-command-center:1.0.0 . docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0 ``` Or use Docker Compose: ```bash docker compose up --build ``` For a local Docker health-smoke check: ```bash python scripts/docker_smoke.py --image rag-qa-command-center:ci ``` The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint. --- ## Quality checks Install the development toolchain with either `requirements-dev.txt` or the optional project metadata: ```bash pip install -r requirements-dev.txt # or pip install -e .[dev] ``` Run the full check target: ```bash make check ``` Equivalent cross-platform commands: ```bash ruff check app.py src tests scripts python -m compileall app.py src tests scripts python scripts/run_pytest.py python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()" ``` The test suite covers: - data contract validation, - artifact manifest integrity, - analytics formula regressions, - static checks for class helper methods, - literal search behavior for trace filtering, - Streamlit import and app-construction smoke checks. CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint. --- ## Data contract The packaged dataset includes: | File | Purpose | |---|---| | `data/eval_runs.csv` | Example-level RAG QA evaluation rows. | | `data/rag_retrieval_events.csv` | Retrieved chunk events by example and rank. | | `data/rag_corpus_documents.csv` | Corpus document metadata. | | `data/rag_corpus_chunks.csv` | Chunk-level corpus text and metadata. | | `data/scenarios.csv` | Scenario/query metadata. | | `docs/data_dictionary.csv` | Column reference. | At startup, `load_bundle()` validates: - required columns, - non-empty tables, - primary key presence and uniqueness, - required foreign-key presence, - retrieval-event references to evaluation examples and chunks, - chunk references to documents, - scenario references, - required numeric fields with strict conversion and no silent non-numeric coercion, - basic metric ranges, - retrieval relevance flags, - positive/integer retrieval ranks and non-negative runtime/cost signals. See [`docs/artifact_manifest.md`](docs/artifact_manifest.md) for row counts and SHA256 checksums. --- ## Design intent The project is built to demonstrate production-minded review patterns without pretending to be a live production platform: - deterministic local data, - clean app boundaries, - typed state/context, - fail-fast data validation, - focused view modules, - cross-platform local checks, - CI checks across Python 3.11 and 3.12, - Streamlit import smoke check, - Docker build and container health-smoke verification, - explicit limitations. See [`docs/architecture.md`](docs/architecture.md) and [`docs/limitations.md`](docs/limitations.md). --- ## Boundaries This project does **not** include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation. It is a self-contained evaluation command center over bundled synthetic/offline artifacts. --- ## Suggested GitHub About ```text RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation. ``` ## Suggested topics ```text rag rag-evaluation streamlit plotly llm-evaluation retrieval-augmented-generation hallucination-detection policy-simulation trace-review data-visualization python synthetic-data ``` --- ## License MIT License.