# Evaluation runner and reporting This folder contains evaluation scripts and helpers. Key files: - `run_evaluation.py` - standard evaluation runner (token-overlap + citation checks) - `enhanced_evaluation.py` - enhanced groundedness evaluation that can use an LLM evaluator - `run_and_archive.sh` - convenience script that runs both evaluators and copies outputs to `../evaluation_results/` ## How to run locally Set your target endpoint (defaults to http://localhost:5000): ```bash EVAL_TARGET_URL="http://localhost:5000" bash evaluation/run_and_archive.sh ``` ## CI Integration A GitHub Actions workflow `.github/workflows/evaluation.yml` is included. When triggered it will: - Check out the repo and install dependencies - Run `evaluation/run_and_archive.sh` (target URL can be provided via the `EVAL_TARGET_URL` secret) - Upload the `evaluation_results/` folder as a workflow artifact for later retrieval ## Where results are stored The evaluation scripts write their detailed JSON outputs to `evaluation/` (e.g. `results.json`, `enhanced_results.json`). The `run_and_archive.sh` script copies timestamped copies into the top-level `evaluation_results/` directory so CI artifacts can be aggregated. Evaluation runner This directory contains a small, reproducible evaluation harness to measure: - Groundedness (approx): token-overlap of the model response vs the gold answer - Citation accuracy (approx): fraction of expected source filenames returned in the `sources` field - Latency: p50 and p95 response times for the `POST /chat` endpoint Files: - `questions.json` — 20 evaluation questions covering policy areas - `gold_answers.json` — short canonical answers and expected source filenames for each question - `run_evaluation.py` — runner that posts to `/chat`, records responses, computes summary metrics, and writes `results.json` How to run (local): 1. Start the app locally (default target `http://localhost:5000`): ```bash # from repo root python app.py ``` 2. Run the evaluation runner (local target): ```bash python evaluation/run_evaluation.py ``` How to run (deployed target): ```bash EVAL_TARGET_URL=https://msse-ai-engineering.onrender.com python evaluation/run_evaluation.py ``` Notes & limitations: - The groundedness and citation metrics are approximations to keep the evaluation reproducible without direct access to internal vector-store content. They should be interpreted as lower-fidelity but repeatable checks. - For full, high-fidelity evaluation, the runner would fetch the actual cited chunks content and verify that model statements are grounded in those chunks. That requires API access to the vector store or a server-side endpoint that can return chunk text for a source id.