GitHub Action
Clean deployment without binary files
f884e6e

Evaluation runner and reporting

This folder contains evaluation scripts and helpers. Key files:

  • run_evaluation.py - standard evaluation runner (token-overlap + citation checks)
  • enhanced_evaluation.py - enhanced groundedness evaluation that can use an LLM evaluator
  • run_and_archive.sh - convenience script that runs both evaluators and copies outputs to ../evaluation_results/

How to run locally

Set your target endpoint (defaults to http://localhost:5000):

EVAL_TARGET_URL="http://localhost:5000" bash evaluation/run_and_archive.sh

CI Integration

A GitHub Actions workflow .github/workflows/evaluation.yml is included. When triggered it will:

  • Check out the repo and install dependencies
  • Run evaluation/run_and_archive.sh (target URL can be provided via the EVAL_TARGET_URL secret)
  • Upload the evaluation_results/ folder as a workflow artifact for later retrieval

Where results are stored

The evaluation scripts write their detailed JSON outputs to evaluation/ (e.g. results.json, enhanced_results.json). The run_and_archive.sh script copies timestamped copies into the top-level evaluation_results/ directory so CI artifacts can be aggregated. Evaluation runner

This directory contains a small, reproducible evaluation harness to measure:

  • Groundedness (approx): token-overlap of the model response vs the gold answer
  • Citation accuracy (approx): fraction of expected source filenames returned in the sources field
  • Latency: p50 and p95 response times for the POST /chat endpoint

Files:

  • questions.json — 20 evaluation questions covering policy areas
  • gold_answers.json — short canonical answers and expected source filenames for each question
  • run_evaluation.py — runner that posts to /chat, records responses, computes summary metrics, and writes results.json

How to run (local):

  1. Start the app locally (default target http://localhost:5000):
# from repo root
python app.py
  1. Run the evaluation runner (local target):
python evaluation/run_evaluation.py

How to run (deployed target):

EVAL_TARGET_URL=https://msse-ai-engineering.onrender.com python evaluation/run_evaluation.py

Notes & limitations:

  • The groundedness and citation metrics are approximations to keep the evaluation reproducible without direct access to internal vector-store content. They should be interpreted as lower-fidelity but repeatable checks.
  • For full, high-fidelity evaluation, the runner would fetch the actual cited chunks content and verify that model statements are grounded in those chunks. That requires API access to the vector store or a server-side endpoint that can return chunk text for a source id.