Spaces:
Running
Running
| # Evaluation Methodology, Metrics, and Reproducibility | |
| This document covers what we did, what we found, what's still open, and | |
| how to reproduce it. It addresses Eric's WP5 requirements: | |
| *"documented evaluation process; evaluation results across retrieval, | |
| paragraphs, answer, summary and tone."* | |
| --- | |
| ## 1. Background and constraints | |
| ### 1.1 What we were given | |
| A benchmark test dataset was built before the implementation phase began, | |
| using **RAGAS multi-hop question generation** over the audit-report | |
| corpus. The intent was to use this dataset to compute standard RAG | |
| metrics (recall@k, MRR, faithfulness, answer correctness). | |
| ### 1.2 What we discovered | |
| When we wired up the evaluation pipeline and ran the original test set, | |
| two systematic issues surfaced: | |
| 1. **Abstraction mismatch**. Questions were generated at a higher | |
| abstraction level than the corpus supports — e.g. "What are the main | |
| audit themes across Uganda's districts in 2022?". The corpus chunks | |
| are concrete (specific findings, specific districts, specific | |
| figures). A question that's vague enough to demand "the main themes | |
| across Uganda" maps to *many* chunks, none of which the test labels | |
| as correct. | |
| 2. **Sparse gold labels**. RAGAS labels 2–3 chunks per question as | |
| ground truth. In a corpus where templated audit-report wording | |
| repeats across hundreds of similar reports (different districts, | |
| same finding categories), many other chunks would be equally | |
| appropriate answers. The metric punishes the system for retrieving | |
| them. | |
| The combined effect: metrics produced by this set were not predictive | |
| of real production quality. We could improve real quality and watch the | |
| metric drop, or vice versa. So we paused the formal quantitative | |
| evaluation and proceeded with **qualitative iterative testing** (Martin, | |
| Dyna, the implementation team) plus a **detailed cost-quality analysis** | |
| shared with the client in December 2025. | |
| ### 1.3 What this means for handover | |
| We are **carrying forward two evaluation activities** as part of WP5 | |
| close-out: | |
| 1. **Rebuild a representative benchmark dataset**. ~1 week of focused | |
| work to produce ~150-200 questions covering retrieval, paragraph | |
| relevance, answer faithfulness, summarisation quality, tone. | |
| 2. **Document the existing evaluation pipeline** (this doc + the code in | |
| `src/evaluation/` if/when added back from `_archive/`). | |
| --- | |
| ## 2. What the system is (a reminder for the evaluator) | |
| The Audit Assistant is **NOT a single-shot RAG** system. It is a | |
| multi-turn, multi-agent system with: | |
| - structured filter inference from natural language; | |
| - LLM-based query rewriting using conversation history; | |
| - pre-validation of filter combinations against the corpus (cheap | |
| `count()` check) before expensive retrieval; | |
| - conversational state carrying anchored filters across turns; | |
| - cross-encoder reranking with a CPU-aware skip optimisation. | |
| Standard "give a question, score the retrieved chunks" metrics measure | |
| only **one slice** of what determines real production quality. | |
| A meaningful evaluation must cover **at least** the following five | |
| dimensions: | |
| | Dimension | What it measures | Why it matters here | | |
| |---|---|---| | |
| | Retrieval | Are the right chunks coming back for a one-shot query? | Standard RAG metric; necessary but not sufficient. | | |
| | Paragraph relevance | Are the chunks coherent paragraphs (vs noise)? | Our chunker has a boilerplate-filter step; this checks it. | | |
| | Answer faithfulness | Does the answer stay grounded in the retrieved chunks? | LLM hallucination risk. | | |
| | Summarisation quality | Does the answer cover the key points without padding? | Common user-task on this corpus. | | |
| | Tone & register | Is the answer appropriately formal / cautious for an audit-report context? | Domain-specific quality bar. | | |
| A **production-realistic** evaluation must also cover: | |
| | Dimension | What it measures | | |
| |---|---| | |
| | Filter-inference accuracy | When the user says "Gulu 2022", does the system extract those filters? | | |
| | Multi-turn task completion | Across a 2–5 turn conversation, does the system stay on-topic and remember anchored filters? | | |
| | Filter relaxation behaviour | When a filter combo has 0 docs, does the system relax gracefully or fail silently? | | |
| | Cost per query | What does each kind of query cost on the configured LLM? | | |
| The proposed close-out evaluation suite includes both dimension sets. | |
| --- | |
| ## 3. Proposed evaluation methodology | |
| ### 3.1 Dataset construction | |
| **Goal**: produce 150-200 questions with gold-standard answers and | |
| gold-standard chunk references, covering: | |
| - **30%** simple-factual ("How much was district X's audit budget in | |
| 2022?") | |
| - **20%** comparative ("Which 3 districts had the largest VFM | |
| findings?") | |
| - **20%** summarisation ("Summarise the OAG 2022 annual report") | |
| - **15%** multi-turn ("What about 2021?", "And for sources audited by | |
| OAG only?") | |
| - **15%** edge cases (impossible filter combos, ambiguous district | |
| names, follow-ups about prior turns) | |
| Construction process: | |
| 1. Sample chunks from the live Qdrant collection covering each | |
| dimension and source. | |
| 2. For each chunk, manually formulate 1-3 questions that the chunk | |
| genuinely answers. (Manual = expensive, but the only way to avoid | |
| the abstraction-mismatch problem of automated generation.) | |
| 3. For each question, record the GOLD ANSWER text and the chunks the | |
| answer comes from. | |
| 4. For multi-turn questions, record the conversation history they | |
| depend on. | |
| 5. Have a second annotator review 20% of the questions for quality. | |
| Estimated effort: **4-6 ML-engineer days** (per Eric's WP3 estimate). | |
| ### 3.2 Metrics | |
| | Dimension | Metric | How computed | | |
| |---|---|---| | |
| | Retrieval | Recall@k (k=5,10,20), MRR | Standard: did the gold chunks appear in top-k? | | |
| | Paragraph relevance | Coherence score (LLM-judge) | LLM-as-judge over each retrieved chunk | | |
| | Answer faithfulness | Faithfulness score (LLM-judge) | RAGAS-style faithfulness — does the answer match the retrieved evidence? | | |
| | Summarisation | ROUGE-L vs gold summary; LLM-judge for content coverage | Two complementary scores | | |
| | Tone | LLM-judge tone score (formal vs casual; cautious vs assertive) | Single LLM judge call | | |
| | Filter inference | Accuracy on filter dict (exact match) | Compare extracted filters to gold filters per question | | |
| | Multi-turn task completion | Task-completed (yes/no) per multi-turn case | Manual annotation of final-turn answer | | |
| | Filter relaxation | Behavior label (relaxed/asked-follow-up/failed) | Manual review of edge-case set | | |
| | Cost | $ per query | Sum of LLM input + output token cost | | |
| Confidence intervals on all reported metrics: bootstrap CIs with 1000 | |
| resamples. | |
| ### 3.3 LLM-judge configuration | |
| We use the **OpenAI** strong model (`gpt-4.1` per | |
| `src/config/settings.yaml::reader.OPENAI_STRONG.model`) as the judge for | |
| faithfulness, tone, and content-coverage metrics. We do NOT use the | |
| same model as the answer generator (`gpt-4o-mini`) to avoid | |
| self-evaluation bias. | |
| Judge prompts and few-shot examples will be checked into `evaluation/` | |
| when the dataset is built. | |
| --- | |
| ## 4. Existing evaluation infrastructure (status) | |
| The original evaluation pipeline (built before the dataset issue was | |
| discovered) lives in `_archive/` after the May 2026 cleanup. It included: | |
| - A test runner that loaded the RAGAS-generated questions, ran the | |
| pipeline, computed RAGAS metrics, and emitted a JSON report. | |
| - Hand-rolled retrieval-quality metrics (recall@k, MRR) per question. | |
| - A summarisation comparison script. | |
| For the close-out evaluation we will: | |
| - Restore the relevant scripts from `_archive/` into `evaluation/`. | |
| - Update them to consume the rebuilt dataset. | |
| - Wire the metrics into a single `python -m evaluation.run` command. | |
| --- | |
| ## 5. Reproducibility | |
| ### 5.1 Versioning | |
| Once the evaluation pipeline is restored: | |
| | Artefact | Where | | |
| |---|---| | |
| | Test dataset | `evaluation/data/test_set_vYYYY-MM-DD.jsonl` | | |
| | Evaluation script | `evaluation/run.py` | | |
| | Judge prompts | `evaluation/prompts/` | | |
| | Run reports | `evaluation/reports/<timestamp>_<commit_sha>.json` | | |
| | System config snapshot | First entry in each report (model versions, `settings.yaml` hash) | | |
| Every report should be reproducible from `commit_sha` + `dataset_path` | |
| alone. | |
| ### 5.2 How to reproduce a report | |
| ```bash | |
| # 1. Check out the commit the report was generated from | |
| git checkout <commit_sha> | |
| # 2. Restore dependencies | |
| uv sync # or pip install -r requirements.txt | |
| # 3. Set environment variables (see runbook/rotate-credentials.md) | |
| export OPENAI_API_KEY=... | |
| export QDRANT_URL=... | |
| export QDRANT_API_KEY=... | |
| # 4. Run evaluation | |
| python -m evaluation.run \ | |
| --dataset evaluation/data/test_set_v2026-XX-XX.jsonl \ | |
| --output evaluation/reports/$(date -u +%Y-%m-%dT%H%M%SZ)_$(git rev-parse --short HEAD).json | |
| ``` | |
| ### 5.3 Sources of nondeterminism | |
| - **LLM responses** are non-deterministic by default. We use | |
| `temperature=0` for evaluation runs to reduce variance, but exact | |
| byte-for-byte reproducibility across LLM provider releases is not | |
| guaranteed. | |
| - **Qdrant content** must match the test-time state. If the corpus | |
| grows between two runs, retrieval scores will change. The report | |
| snapshots a Qdrant collection-name + commit-time hash for traceability. | |
| --- | |
| ## 6. What we will report at handover | |
| The close-out report will include, per metric: | |
| - Score with bootstrap 95% CI. | |
| - Breakdown by question type (factual / comparative / summary / multi-turn / | |
| edge-case). | |
| - Comparison against a **baseline** — the simplest possible single-shot | |
| RAG (no filter inference, no agent flow, no reranker). This lets the | |
| client see what the multi-agent design buys, in numbers. | |
| Sample shape of the final report: | |
| ``` | |
| Metric Score (95% CI) Baseline Δ | |
| Retrieval Recall@5 0.78 [0.74, 0.82] 0.61 +0.17 | |
| Answer Faithfulness 0.91 [0.88, 0.93] 0.79 +0.12 | |
| Filter Inference Accuracy 0.88 [0.83, 0.93] n/a n/a | |
| Multi-Turn Task Completion 0.74 [0.66, 0.81] 0.31 +0.43 | |
| Cost per Query (USD) 0.0042 [...] 0.0028 +0.0014 | |
| ``` | |
| --- | |
| ## 7. Known limitations of this methodology | |
| 1. **LLM-as-judge bias**. Despite using a different model for | |
| answering vs judging, both models share design biases. We will | |
| spot-check a 10% sample manually as a sanity check. | |
| 2. **Manual dataset construction is slow** and reflects the | |
| constructor's mental model of "good answers". A future iteration | |
| could blend in real user queries (anonymized from | |
| `spaces_logs`-backed conversation logs). | |
| 3. **Cost figures depend on the LLM provider's pricing** at evaluation | |
| time. The report stamps the pricing used. | |
| --- | |
| ## Related | |
| - [`accountability-transparency-limitations-biases.md`](accountability-transparency-limitations-biases.md) | |
| - [`cost-and-performance.md`](cost-and-performance.md) | |
| - [`architecture/adrs/004-llm-model-choice-gpt-4o-mini.md`](architecture/adrs/004-llm-model-choice-gpt-4o-mini.md) | |
| - [`DEFERRED.md`](DEFERRED.md) — items related to evaluation | |