| # RAG QA Command Center |
|
|
|  |
|  |
|  |
|  |
|  |
|  |
|
|
| **A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.** |
|
|
| This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation. |
|
|
|  |
|
|
| --- |
|
|
| ## What it shows |
|
|
| - **Quality posture:** correctness, hallucination rate, Recall@10, MRR@10, latency, and cost. |
| - **Retrieval diagnostics:** separates retrieval weakness from generation failure and hallucination exposure. |
| - **Risk board:** ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk. |
| - **Config comparison:** compares retriever, generator, and chunking combinations under multiple objectives. |
| - **Policy simulator:** explores offline review thresholds and review-load trade-offs. |
| - **Trace explorer:** inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks. |
| - **Export center:** downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON. |
|
|
| --- |
|
|
| ## Repository structure |
|
|
| ```text |
| app.py # Thin Streamlit entrypoint |
| src/ |
| dashboard.py # App controller and shared orchestration |
| app_state.py # Typed dashboard context models |
| models.py # Runtime settings and release metadata |
| data.py # Data loading, standardization, and fail-fast validation |
| analytics.py # Metrics, risk slices, config scoring, and policy curves |
| charts.py # Plotly chart factories |
| formatting.py # Display formatting helpers |
| text_search.py # Literal user-search helper |
| ui.py # Reusable UI primitives |
| views/ # Focused tab/page mixins |
| overview.py |
| quality_map.py |
| retrieval_lab.py |
| risk_board.py |
| config_comparison.py |
| policy_simulator.py |
| trace_explorer.py |
| export_center.py |
| scripts/ |
| run_pytest.py # Cross-platform pytest runner |
| check_http_health.py # HTTP health check helper used by Docker |
| docker_smoke.py # Local Docker build/run/health smoke check |
| data/ # Bundled synthetic/offline RAG QA artifacts |
| docs/ # Architecture, limitations, manifest, and data dictionary |
| tests/ # Data contracts, numeric analytics regressions, smoke tests, and release checks |
| Dockerfile |
| docker-compose.yml |
| Makefile |
| ``` |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ### macOS / Linux |
|
|
| ```bash |
| python -m venv .venv |
| source .venv/bin/activate |
| python -m pip install --upgrade pip |
| pip install -r requirements.txt |
| streamlit run app.py |
| ``` |
|
|
| ### Windows PowerShell |
|
|
| ```powershell |
| python -m venv .venv |
| .\.venv\Scripts\Activate.ps1 |
| python -m pip install --upgrade pip |
| pip install -r requirements.txt |
| streamlit run app.py |
| ``` |
|
|
| Then open: |
|
|
| ```text |
| http://localhost:8501 |
| ``` |
|
|
| If port `8501` is already in use, run Streamlit on another port: |
|
|
| ```bash |
| streamlit run app.py --server.port 8510 |
| ``` |
|
|
| --- |
|
|
| ## Docker |
|
|
| Build and run the container: |
|
|
| ```bash |
| docker build -t rag-qa-command-center:1.0.0 . |
| docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0 |
| ``` |
|
|
| Or use Docker Compose: |
|
|
| ```bash |
| docker compose up --build |
| ``` |
|
|
| For a local Docker health-smoke check: |
|
|
| ```bash |
| python scripts/docker_smoke.py --image rag-qa-command-center:ci |
| ``` |
|
|
| The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint. |
|
|
| --- |
|
|
| ## Quality checks |
|
|
| Install the development toolchain with either `requirements-dev.txt` or the optional project metadata: |
|
|
| ```bash |
| pip install -r requirements-dev.txt |
| # or |
| pip install -e .[dev] |
| ``` |
|
|
| Run the full check target: |
|
|
| ```bash |
| make check |
| ``` |
|
|
| Equivalent cross-platform commands: |
|
|
| ```bash |
| ruff check app.py src tests scripts |
| python -m compileall app.py src tests scripts |
| python scripts/run_pytest.py |
| python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()" |
| ``` |
|
|
| The test suite covers: |
|
|
| - data contract validation, |
| - artifact manifest integrity, |
| - analytics formula regressions, |
| - static checks for class helper methods, |
| - literal search behavior for trace filtering, |
| - Streamlit import and app-construction smoke checks. |
|
|
| CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint. |
|
|
| --- |
|
|
| ## Data contract |
|
|
| The packaged dataset includes: |
|
|
| | File | Purpose | |
| |---|---| |
| | `data/eval_runs.csv` | Example-level RAG QA evaluation rows. | |
| | `data/rag_retrieval_events.csv` | Retrieved chunk events by example and rank. | |
| | `data/rag_corpus_documents.csv` | Corpus document metadata. | |
| | `data/rag_corpus_chunks.csv` | Chunk-level corpus text and metadata. | |
| | `data/scenarios.csv` | Scenario/query metadata. | |
| | `docs/data_dictionary.csv` | Column reference. | |
|
|
| At startup, `load_bundle()` validates: |
|
|
| - required columns, |
| - non-empty tables, |
| - primary key presence and uniqueness, |
| - required foreign-key presence, |
| - retrieval-event references to evaluation examples and chunks, |
| - chunk references to documents, |
| - scenario references, |
| - required numeric fields with strict conversion and no silent non-numeric coercion, |
| - basic metric ranges, |
| - retrieval relevance flags, |
| - positive/integer retrieval ranks and non-negative runtime/cost signals. |
|
|
| See [`docs/artifact_manifest.md`](docs/artifact_manifest.md) for row counts and SHA256 checksums. |
|
|
| --- |
|
|
| ## Design intent |
|
|
| The project is built to demonstrate production-minded review patterns without pretending to be a live production platform: |
|
|
| - deterministic local data, |
| - clean app boundaries, |
| - typed state/context, |
| - fail-fast data validation, |
| - focused view modules, |
| - cross-platform local checks, |
| - CI checks across Python 3.11 and 3.12, |
| - Streamlit import smoke check, |
| - Docker build and container health-smoke verification, |
| - explicit limitations. |
|
|
| See [`docs/architecture.md`](docs/architecture.md) and [`docs/limitations.md`](docs/limitations.md). |
|
|
| --- |
|
|
| ## Boundaries |
|
|
| This project does **not** include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation. |
|
|
| It is a self-contained evaluation command center over bundled synthetic/offline artifacts. |
|
|
| --- |
|
|
| ## Suggested GitHub About |
|
|
| ```text |
| RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation. |
| ``` |
|
|
| ## Suggested topics |
|
|
| ```text |
| rag |
| rag-evaluation |
| streamlit |
| plotly |
| llm-evaluation |
| retrieval-augmented-generation |
| hallucination-detection |
| policy-simulation |
| trace-review |
| data-visualization |
| python |
| synthetic-data |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License. |
|
|