RAG QA Command Center
A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.
This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation.
What it shows
- Quality posture: correctness, hallucination rate, Recall@10, MRR@10, latency, and cost.
- Retrieval diagnostics: separates retrieval weakness from generation failure and hallucination exposure.
- Risk board: ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk.
- Config comparison: compares retriever, generator, and chunking combinations under multiple objectives.
- Policy simulator: explores offline review thresholds and review-load trade-offs.
- Trace explorer: inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks.
- Export center: downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON.
Repository structure
app.py # Thin Streamlit entrypoint
src/
dashboard.py # App controller and shared orchestration
app_state.py # Typed dashboard context models
models.py # Runtime settings and release metadata
data.py # Data loading, standardization, and fail-fast validation
analytics.py # Metrics, risk slices, config scoring, and policy curves
charts.py # Plotly chart factories
formatting.py # Display formatting helpers
text_search.py # Literal user-search helper
ui.py # Reusable UI primitives
views/ # Focused tab/page mixins
overview.py
quality_map.py
retrieval_lab.py
risk_board.py
config_comparison.py
policy_simulator.py
trace_explorer.py
export_center.py
scripts/
run_pytest.py # Cross-platform pytest runner
check_http_health.py # HTTP health check helper used by Docker
docker_smoke.py # Local Docker build/run/health smoke check
data/ # Bundled synthetic/offline RAG QA artifacts
docs/ # Architecture, limitations, manifest, and data dictionary
tests/ # Data contracts, numeric analytics regressions, smoke tests, and release checks
Dockerfile
docker-compose.yml
Makefile
Quickstart
macOS / Linux
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py
Windows PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py
Then open:
http://localhost:8501
If port 8501 is already in use, run Streamlit on another port:
streamlit run app.py --server.port 8510
Docker
Build and run the container:
docker build -t rag-qa-command-center:1.0.0 .
docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0
Or use Docker Compose:
docker compose up --build
For a local Docker health-smoke check:
python scripts/docker_smoke.py --image rag-qa-command-center:ci
The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint.
Quality checks
Install the development toolchain with either requirements-dev.txt or the optional project metadata:
pip install -r requirements-dev.txt
# or
pip install -e .[dev]
Run the full check target:
make check
Equivalent cross-platform commands:
ruff check app.py src tests scripts
python -m compileall app.py src tests scripts
python scripts/run_pytest.py
python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()"
The test suite covers:
- data contract validation,
- artifact manifest integrity,
- analytics formula regressions,
- static checks for class helper methods,
- literal search behavior for trace filtering,
- Streamlit import and app-construction smoke checks.
CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint.
Data contract
The packaged dataset includes:
| File | Purpose |
|---|---|
data/eval_runs.csv |
Example-level RAG QA evaluation rows. |
data/rag_retrieval_events.csv |
Retrieved chunk events by example and rank. |
data/rag_corpus_documents.csv |
Corpus document metadata. |
data/rag_corpus_chunks.csv |
Chunk-level corpus text and metadata. |
data/scenarios.csv |
Scenario/query metadata. |
docs/data_dictionary.csv |
Column reference. |
At startup, load_bundle() validates:
- required columns,
- non-empty tables,
- primary key presence and uniqueness,
- required foreign-key presence,
- retrieval-event references to evaluation examples and chunks,
- chunk references to documents,
- scenario references,
- required numeric fields with strict conversion and no silent non-numeric coercion,
- basic metric ranges,
- retrieval relevance flags,
- positive/integer retrieval ranks and non-negative runtime/cost signals.
See docs/artifact_manifest.md for row counts and SHA256 checksums.
Design intent
The project is built to demonstrate production-minded review patterns without pretending to be a live production platform:
- deterministic local data,
- clean app boundaries,
- typed state/context,
- fail-fast data validation,
- focused view modules,
- cross-platform local checks,
- CI checks across Python 3.11 and 3.12,
- Streamlit import smoke check,
- Docker build and container health-smoke verification,
- explicit limitations.
See docs/architecture.md and docs/limitations.md.
Boundaries
This project does not include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation.
It is a self-contained evaluation command center over bundled synthetic/offline artifacts.
Suggested GitHub About
RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation.
Suggested topics
rag
rag-evaluation
streamlit
plotly
llm-evaluation
retrieval-augmented-generation
hallucination-detection
policy-simulation
trace-review
data-visualization
python
synthetic-data
License
MIT License.
