Spaces:
Sleeping
Sleeping
Evaluation runner and reporting
This folder contains evaluation scripts and helpers. Key files:
run_evaluation.py- standard evaluation runner (token-overlap + citation checks)enhanced_evaluation.py- enhanced groundedness evaluation that can use an LLM evaluatorrun_and_archive.sh- convenience script that runs both evaluators and copies outputs to../evaluation_results/
How to run locally
Set your target endpoint (defaults to http://localhost:5000):
EVAL_TARGET_URL="http://localhost:5000" bash evaluation/run_and_archive.sh
CI Integration
A GitHub Actions workflow .github/workflows/evaluation.yml is included. When triggered it will:
- Check out the repo and install dependencies
- Run
evaluation/run_and_archive.sh(target URL can be provided via theEVAL_TARGET_URLsecret) - Upload the
evaluation_results/folder as a workflow artifact for later retrieval
Where results are stored
The evaluation scripts write their detailed JSON outputs to evaluation/ (e.g. results.json, enhanced_results.json). The run_and_archive.sh script copies timestamped copies into the top-level evaluation_results/ directory so CI artifacts can be aggregated.
Evaluation runner
This directory contains a small, reproducible evaluation harness to measure:
- Groundedness (approx): token-overlap of the model response vs the gold answer
- Citation accuracy (approx): fraction of expected source filenames returned in the
sourcesfield - Latency: p50 and p95 response times for the
POST /chatendpoint
Files:
questions.json— 20 evaluation questions covering policy areasgold_answers.json— short canonical answers and expected source filenames for each questionrun_evaluation.py— runner that posts to/chat, records responses, computes summary metrics, and writesresults.json
How to run (local):
- Start the app locally (default target
http://localhost:5000):
# from repo root
python app.py
- Run the evaluation runner (local target):
python evaluation/run_evaluation.py
How to run (deployed target):
EVAL_TARGET_URL=https://msse-ai-engineering.onrender.com python evaluation/run_evaluation.py
Notes & limitations:
- The groundedness and citation metrics are approximations to keep the evaluation reproducible without direct access to internal vector-store content. They should be interpreted as lower-fidelity but repeatable checks.
- For full, high-fidelity evaluation, the runner would fetch the actual cited chunks content and verify that model statements are grounded in those chunks. That requires API access to the vector store or a server-side endpoint that can return chunk text for a source id.