# RAG QA Command Center

![Python](https://img.shields.io/badge/Python-3.11%20%7C%203.12-blue)
![Streamlit](https://img.shields.io/badge/Streamlit-app-FF4B4B?logo=streamlit&logoColor=white)
![Tests](https://img.shields.io/badge/Tests-CI%20checked-success)
![Docker](https://img.shields.io/badge/Docker-health%20smoke-blue)
![License](https://img.shields.io/badge/License-MIT-green)
![Version](https://img.shields.io/badge/Version-v1.0.0-black)

**A CI-tested Streamlit command center for reviewing offline RAG QA evaluation logs across retrieval quality, hallucination exposure, configuration trade-offs, trace evidence, and review-policy simulation.**

This is not a generic RAG chatbot. It is a deterministic review workspace for understanding where a RAG pipeline succeeds, where it fails, and which examples or configurations deserve investigation.

![RAG QA Command Center overview](assets/Overview.png)

---

## What it shows

- **Quality posture:** correctness, hallucination rate, Recall@10, MRR@10, latency, and cost.
- **Retrieval diagnostics:** separates retrieval weakness from generation failure and hallucination exposure.
- **Risk board:** ranks domain, scenario, and difficulty slices by error, hallucination, and retrieval risk.
- **Config comparison:** compares retriever, generator, and chunking combinations under multiple objectives.
- **Policy simulator:** explores offline review thresholds and review-load trade-offs.
- **Trace explorer:** inspects example-level questions, gold answers, metadata, diagnosis, and retrieved chunks.
- **Export center:** downloads filtered evidence, risk tables, config leaderboards, policy curves, and an executive brief JSON.

---

## Repository structure

```text
app.py                         # Thin Streamlit entrypoint
src/
  dashboard.py                 # App controller and shared orchestration
  app_state.py                 # Typed dashboard context models
  models.py                    # Runtime settings and release metadata
  data.py                      # Data loading, standardization, and fail-fast validation
  analytics.py                 # Metrics, risk slices, config scoring, and policy curves
  charts.py                    # Plotly chart factories
  formatting.py                # Display formatting helpers
  text_search.py               # Literal user-search helper
  ui.py                        # Reusable UI primitives
  views/                       # Focused tab/page mixins
    overview.py
    quality_map.py
    retrieval_lab.py
    risk_board.py
    config_comparison.py
    policy_simulator.py
    trace_explorer.py
    export_center.py
scripts/
  run_pytest.py                # Cross-platform pytest runner
  check_http_health.py         # HTTP health check helper used by Docker
  docker_smoke.py              # Local Docker build/run/health smoke check
data/                          # Bundled synthetic/offline RAG QA artifacts
docs/                          # Architecture, limitations, manifest, and data dictionary
tests/                         # Data contracts, numeric analytics regressions, smoke tests, and release checks
Dockerfile
docker-compose.yml
Makefile
```

---

## Quickstart

### macOS / Linux

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py
```

### Windows PowerShell

```powershell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
streamlit run app.py
```

Then open:

```text
http://localhost:8501
```

If port `8501` is already in use, run Streamlit on another port:

```bash
streamlit run app.py --server.port 8510
```

---

## Docker

Build and run the container:

```bash
docker build -t rag-qa-command-center:1.0.0 .
docker run --rm -p 8501:8501 rag-qa-command-center:1.0.0
```

Or use Docker Compose:

```bash
docker compose up --build
```

For a local Docker health-smoke check:

```bash
python scripts/docker_smoke.py --image rag-qa-command-center:ci
```

The smoke check builds the image, starts a container, verifies that it remains running, and checks Streamlit's health endpoint.

---

## Quality checks

Install the development toolchain with either `requirements-dev.txt` or the optional project metadata:

```bash
pip install -r requirements-dev.txt
# or
pip install -e .[dev]
```

Run the full check target:

```bash
make check
```

Equivalent cross-platform commands:

```bash
ruff check app.py src tests scripts
python -m compileall app.py src tests scripts
python scripts/run_pytest.py
python -c "import app; from src.dashboard import CommandCenterApp; CommandCenterApp()"
```

The test suite covers:

- data contract validation,
- artifact manifest integrity,
- analytics formula regressions,
- static checks for class helper methods,
- literal search behavior for trace filtering,
- Streamlit import and app-construction smoke checks.

CI runs Python checks across Python 3.11 and 3.12. After the Python matrix passes, CI also builds the Docker image, starts the container, and validates the Streamlit health endpoint.

---

## Data contract

The packaged dataset includes:

| File | Purpose |
|---|---|
| `data/eval_runs.csv` | Example-level RAG QA evaluation rows. |
| `data/rag_retrieval_events.csv` | Retrieved chunk events by example and rank. |
| `data/rag_corpus_documents.csv` | Corpus document metadata. |
| `data/rag_corpus_chunks.csv` | Chunk-level corpus text and metadata. |
| `data/scenarios.csv` | Scenario/query metadata. |
| `docs/data_dictionary.csv` | Column reference. |

At startup, `load_bundle()` validates:

- required columns,
- non-empty tables,
- primary key presence and uniqueness,
- required foreign-key presence,
- retrieval-event references to evaluation examples and chunks,
- chunk references to documents,
- scenario references,
- required numeric fields with strict conversion and no silent non-numeric coercion,
- basic metric ranges,
- retrieval relevance flags,
- positive/integer retrieval ranks and non-negative runtime/cost signals.

See [`docs/artifact_manifest.md`](docs/artifact_manifest.md) for row counts and SHA256 checksums.

---

## Design intent

The project is built to demonstrate production-minded review patterns without pretending to be a live production platform:

- deterministic local data,
- clean app boundaries,
- typed state/context,
- fail-fast data validation,
- focused view modules,
- cross-platform local checks,
- CI checks across Python 3.11 and 3.12,
- Streamlit import smoke check,
- Docker build and container health-smoke verification,
- explicit limitations.

See [`docs/architecture.md`](docs/architecture.md) and [`docs/limitations.md`](docs/limitations.md).

---

## Boundaries

This project does **not** include live RAG serving, online document ingestion, vector indexing, authentication, alerting, scheduled jobs, or production incident automation.

It is a self-contained evaluation command center over bundled synthetic/offline artifacts.

---

## Suggested GitHub About

```text
RAG QA command center for retrieval quality, hallucination exposure, config trade-offs, trace review, and review-policy simulation.
```

## Suggested topics

```text
rag
rag-evaluation
streamlit
plotly
llm-evaluation
retrieval-augmented-generation
hallucination-detection
policy-simulation
trace-review
data-visualization
python
synthetic-data
```

---

## License

MIT License.