# Evaluation runner and reporting

This folder contains evaluation scripts and helpers. Key files:

- `run_evaluation.py` - standard evaluation runner (token-overlap + citation checks)
- `enhanced_evaluation.py` - enhanced groundedness evaluation that can use an LLM evaluator
- `run_and_archive.sh` - convenience script that runs both evaluators and copies outputs to `../evaluation_results/`

## How to run locally

Set your target endpoint (defaults to http://localhost:5000):

```bash
EVAL_TARGET_URL="http://localhost:5000" bash evaluation/run_and_archive.sh
```

## CI Integration

A GitHub Actions workflow `.github/workflows/evaluation.yml` is included. When triggered it will:

- Check out the repo and install dependencies
- Run `evaluation/run_and_archive.sh` (target URL can be provided via the `EVAL_TARGET_URL` secret)
- Upload the `evaluation_results/` folder as a workflow artifact for later retrieval

## Where results are stored

The evaluation scripts write their detailed JSON outputs to `evaluation/` (e.g. `results.json`, `enhanced_results.json`). The `run_and_archive.sh` script copies timestamped copies into the top-level `evaluation_results/` directory so CI artifacts can be aggregated.
Evaluation runner

This directory contains a small, reproducible evaluation harness to measure:

- Groundedness (approx): token-overlap of the model response vs the gold answer
- Citation accuracy (approx): fraction of expected source filenames returned in the `sources` field
- Latency: p50 and p95 response times for the `POST /chat` endpoint

Files:

- `questions.json` — 20 evaluation questions covering policy areas
- `gold_answers.json` — short canonical answers and expected source filenames for each question
- `run_evaluation.py` — runner that posts to `/chat`, records responses, computes summary metrics, and writes `results.json`

How to run (local):

1. Start the app locally (default target `http://localhost:5000`):

```bash
# from repo root
python app.py
```

2. Run the evaluation runner (local target):

```bash
python evaluation/run_evaluation.py
```

How to run (deployed target):

```bash
EVAL_TARGET_URL=https://msse-ai-engineering.onrender.com python evaluation/run_evaluation.py
```

Notes & limitations:

- The groundedness and citation metrics are approximations to keep the evaluation reproducible without direct access to internal vector-store content. They should be interpreted as lower-fidelity but repeatable checks.
- For full, high-fidelity evaluation, the runner would fetch the actual cited chunks content and verify that model statements are grounded in those chunks. That requires API access to the vector store or a server-side endpoint that can return chunk text for a source id.