rag-qa-command-cente / docs /github_readme.md
Tarek Masryo
chore: update project files
6bef416

RAG QA Command Center

Python Streamlit Tests Docker License Version

A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.

This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation.

RAG QA Command Center overview


What it shows

  • Quality posture: correctness, hallucination rate, Recall@10, MRR@10, latency, and cost.
  • Retrieval diagnostics: separates retrieval weakness from generation failure and hallucination exposure.
  • Risk board: ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk.
  • Config comparison: compares retriever, generator, and chunking combinations under multiple objectives.
  • Policy simulator: explores offline review thresholds and review-load trade-offs.
  • Trace explorer: inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks.
  • Export center: downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON.

Repository structure

app.py                         # Thin Streamlit entrypoint
src/
  dashboard.py                 # App controller and shared orchestration
  app_state.py                 # Typed dashboard context models
  models.py                    # Runtime settings and release metadata
  data.py                      # Data loading, standardization, and fail-fast validation
  analytics.py                 # Metrics, risk slices, config scoring, and policy curves
  charts.py                    # Plotly chart factories
  formatting.py                # Display formatting helpers
  text_search.py               # Literal user-search helper
  ui.py                        # Reusable UI primitives
  views/                       # Focused tab/page mixins
    overview.py
    quality_map.py
    retrieval_lab.py
    risk_board.py
    config_comparison.py
    policy_simulator.py
    trace_explorer.py
    export_center.py
scripts/
  run_pytest.py                # Cross-platform pytest runner
  check_http_health.py         # HTTP health check helper used by Docker
  docker_smoke.py              # Local Docker build/run/health smoke check
data/                          # Bundled synthetic/offline RAG QA artifacts
docs/                          # Architecture, limitations, manifest, and data dictionary
tests/                         # Data contracts, numeric analytics regressions, smoke tests, and release checks
Dockerfile
docker-compose.yml
Makefile

Quickstart

macOS / Linux

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py

Windows PowerShell

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py

Then open:

http://localhost:8501

If port 8501 is already in use, run Streamlit on another port:

streamlit run app.py --server.port 8510

Docker

Build and run the container:

docker build -t rag-qa-command-center:1.0.0 .
docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0

Or use Docker Compose:

docker compose up --build

For a local Docker health-smoke check:

python scripts/docker_smoke.py --image rag-qa-command-center:ci

The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint.


Quality checks

Install the development toolchain with either requirements-dev.txt or the optional project metadata:

pip install -r requirements-dev.txt
# or
pip install -e .[dev]

Run the full check target:

make check

Equivalent cross-platform commands:

ruff check app.py src tests scripts
python -m compileall app.py src tests scripts
python scripts/run_pytest.py
python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()"

The test suite covers:

  • data contract validation,
  • artifact manifest integrity,
  • analytics formula regressions,
  • static checks for class helper methods,
  • literal search behavior for trace filtering,
  • Streamlit import and app-construction smoke checks.

CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint.


Data contract

The packaged dataset includes:

File Purpose
data/eval_runs.csv Example-level RAG QA evaluation rows.
data/rag_retrieval_events.csv Retrieved chunk events by example and rank.
data/rag_corpus_documents.csv Corpus document metadata.
data/rag_corpus_chunks.csv Chunk-level corpus text and metadata.
data/scenarios.csv Scenario/query metadata.
docs/data_dictionary.csv Column reference.

At startup, load_bundle() validates:

  • required columns,
  • non-empty tables,
  • primary key presence and uniqueness,
  • required foreign-key presence,
  • retrieval-event references to evaluation examples and chunks,
  • chunk references to documents,
  • scenario references,
  • required numeric fields with strict conversion and no silent non-numeric coercion,
  • basic metric ranges,
  • retrieval relevance flags,
  • positive/integer retrieval ranks and non-negative runtime/cost signals.

See docs/artifact_manifest.md for row counts and SHA256 checksums.


Design intent

The project is built to demonstrate production-minded review patterns without pretending to be a live production platform:

  • deterministic local data,
  • clean app boundaries,
  • typed state/context,
  • fail-fast data validation,
  • focused view modules,
  • cross-platform local checks,
  • CI checks across Python 3.11 and 3.12,
  • Streamlit import smoke check,
  • Docker build and container health-smoke verification,
  • explicit limitations.

See docs/architecture.md and docs/limitations.md.


Boundaries

This project does not include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation.

It is a self-contained evaluation command center over bundled synthetic/offline artifacts.


Suggested GitHub About

RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation.

Suggested topics

rag
rag-evaluation
streamlit
plotly
llm-evaluation
retrieval-augmented-generation
hallucination-detection
policy-simulation
trace-review
data-visualization
python
synthetic-data

License

MIT License.