Spaces:

msse-team-3
/

ai-engineering-project

Sleeping

App Files Files Community

ai-engineering-project / evaluation /README.md

GitHub Action

Clean deployment without binary files

f884e6e 2 months ago

preview code

raw

history blame contribute delete

2.74 kB

Evaluation runner and reporting

This folder contains evaluation scripts and helpers. Key files:

run_evaluation.py - standard evaluation runner (token-overlap + citation checks)
enhanced_evaluation.py - enhanced groundedness evaluation that can use an LLM evaluator
run_and_archive.sh - convenience script that runs both evaluators and copies outputs to ../evaluation_results/

How to run locally

Set your target endpoint (defaults to http://localhost:5000):

EVAL_TARGET_URL="http://localhost:5000" bash evaluation/run_and_archive.sh

CI Integration

A GitHub Actions workflow .github/workflows/evaluation.yml is included. When triggered it will:

Check out the repo and install dependencies
Run evaluation/run_and_archive.sh (target URL can be provided via the EVAL_TARGET_URL secret)
Upload the evaluation_results/ folder as a workflow artifact for later retrieval

Where results are stored

The evaluation scripts write their detailed JSON outputs to evaluation/ (e.g. results.json, enhanced_results.json). The run_and_archive.sh script copies timestamped copies into the top-level evaluation_results/ directory so CI artifacts can be aggregated. Evaluation runner

This directory contains a small, reproducible evaluation harness to measure:

Groundedness (approx): token-overlap of the model response vs the gold answer
Citation accuracy (approx): fraction of expected source filenames returned in the sources field
Latency: p50 and p95 response times for the POST /chat endpoint

Files:

questions.json — 20 evaluation questions covering policy areas
gold_answers.json — short canonical answers and expected source filenames for each question
run_evaluation.py — runner that posts to /chat, records responses, computes summary metrics, and writes results.json

How to run (local):

Start the app locally (default target http://localhost:5000):

# from repo root
python app.py

Run the evaluation runner (local target):

python evaluation/run_evaluation.py

How to run (deployed target):

EVAL_TARGET_URL=https://msse-ai-engineering.onrender.com python evaluation/run_evaluation.py

Notes & limitations:

The groundedness and citation metrics are approximations to keep the evaluation reproducible without direct access to internal vector-store content. They should be interpreted as lower-fidelity but repeatable checks.
For full, high-fidelity evaluation, the runner would fetch the actual cited chunks content and verify that model statements are grounded in those chunks. That requires API access to the vector store or a server-side endpoint that can return chunk text for a source id.