# Evaluation Methodology, Metrics, and Reproducibility This document covers what we did, what we found, what's still open, and how to reproduce it. It addresses Eric's WP5 requirements: *"documented evaluation process; evaluation results across retrieval, paragraphs, answer, summary and tone."* --- ## 1. Background and constraints ### 1.1 What we were given A benchmark test dataset was built before the implementation phase began, using **RAGAS multi-hop question generation** over the audit-report corpus. The intent was to use this dataset to compute standard RAG metrics (recall@k, MRR, faithfulness, answer correctness). ### 1.2 What we discovered When we wired up the evaluation pipeline and ran the original test set, two systematic issues surfaced: 1. **Abstraction mismatch**. Questions were generated at a higher abstraction level than the corpus supports — e.g. "What are the main audit themes across Uganda's districts in 2022?". The corpus chunks are concrete (specific findings, specific districts, specific figures). A question that's vague enough to demand "the main themes across Uganda" maps to *many* chunks, none of which the test labels as correct. 2. **Sparse gold labels**. RAGAS labels 2–3 chunks per question as ground truth. In a corpus where templated audit-report wording repeats across hundreds of similar reports (different districts, same finding categories), many other chunks would be equally appropriate answers. The metric punishes the system for retrieving them. The combined effect: metrics produced by this set were not predictive of real production quality. We could improve real quality and watch the metric drop, or vice versa. So we paused the formal quantitative evaluation and proceeded with **qualitative iterative testing** (Martin, Dyna, the implementation team) plus a **detailed cost-quality analysis** shared with the client in December 2025. ### 1.3 What this means for handover We are **carrying forward two evaluation activities** as part of WP5 close-out: 1. **Rebuild a representative benchmark dataset**. ~1 week of focused work to produce ~150-200 questions covering retrieval, paragraph relevance, answer faithfulness, summarisation quality, tone. 2. **Document the existing evaluation pipeline** (this doc + the code in `src/evaluation/` if/when added back from `_archive/`). --- ## 2. What the system is (a reminder for the evaluator) The Audit Assistant is **NOT a single-shot RAG** system. It is a multi-turn, multi-agent system with: - structured filter inference from natural language; - LLM-based query rewriting using conversation history; - pre-validation of filter combinations against the corpus (cheap `count()` check) before expensive retrieval; - conversational state carrying anchored filters across turns; - cross-encoder reranking with a CPU-aware skip optimisation. Standard "give a question, score the retrieved chunks" metrics measure only **one slice** of what determines real production quality. A meaningful evaluation must cover **at least** the following five dimensions: | Dimension | What it measures | Why it matters here | |---|---|---| | Retrieval | Are the right chunks coming back for a one-shot query? | Standard RAG metric; necessary but not sufficient. | | Paragraph relevance | Are the chunks coherent paragraphs (vs noise)? | Our chunker has a boilerplate-filter step; this checks it. | | Answer faithfulness | Does the answer stay grounded in the retrieved chunks? | LLM hallucination risk. | | Summarisation quality | Does the answer cover the key points without padding? | Common user-task on this corpus. | | Tone & register | Is the answer appropriately formal / cautious for an audit-report context? | Domain-specific quality bar. | A **production-realistic** evaluation must also cover: | Dimension | What it measures | |---|---| | Filter-inference accuracy | When the user says "Gulu 2022", does the system extract those filters? | | Multi-turn task completion | Across a 2–5 turn conversation, does the system stay on-topic and remember anchored filters? | | Filter relaxation behaviour | When a filter combo has 0 docs, does the system relax gracefully or fail silently? | | Cost per query | What does each kind of query cost on the configured LLM? | The proposed close-out evaluation suite includes both dimension sets. --- ## 3. Proposed evaluation methodology ### 3.1 Dataset construction **Goal**: produce 150-200 questions with gold-standard answers and gold-standard chunk references, covering: - **30%** simple-factual ("How much was district X's audit budget in 2022?") - **20%** comparative ("Which 3 districts had the largest VFM findings?") - **20%** summarisation ("Summarise the OAG 2022 annual report") - **15%** multi-turn ("What about 2021?", "And for sources audited by OAG only?") - **15%** edge cases (impossible filter combos, ambiguous district names, follow-ups about prior turns) Construction process: 1. Sample chunks from the live Qdrant collection covering each dimension and source. 2. For each chunk, manually formulate 1-3 questions that the chunk genuinely answers. (Manual = expensive, but the only way to avoid the abstraction-mismatch problem of automated generation.) 3. For each question, record the GOLD ANSWER text and the chunks the answer comes from. 4. For multi-turn questions, record the conversation history they depend on. 5. Have a second annotator review 20% of the questions for quality. Estimated effort: **4-6 ML-engineer days** (per Eric's WP3 estimate). ### 3.2 Metrics | Dimension | Metric | How computed | |---|---|---| | Retrieval | Recall@k (k=5,10,20), MRR | Standard: did the gold chunks appear in top-k? | | Paragraph relevance | Coherence score (LLM-judge) | LLM-as-judge over each retrieved chunk | | Answer faithfulness | Faithfulness score (LLM-judge) | RAGAS-style faithfulness — does the answer match the retrieved evidence? | | Summarisation | ROUGE-L vs gold summary; LLM-judge for content coverage | Two complementary scores | | Tone | LLM-judge tone score (formal vs casual; cautious vs assertive) | Single LLM judge call | | Filter inference | Accuracy on filter dict (exact match) | Compare extracted filters to gold filters per question | | Multi-turn task completion | Task-completed (yes/no) per multi-turn case | Manual annotation of final-turn answer | | Filter relaxation | Behavior label (relaxed/asked-follow-up/failed) | Manual review of edge-case set | | Cost | $ per query | Sum of LLM input + output token cost | Confidence intervals on all reported metrics: bootstrap CIs with 1000 resamples. ### 3.3 LLM-judge configuration We use the **OpenAI** strong model (`gpt-4.1` per `src/config/settings.yaml::reader.OPENAI_STRONG.model`) as the judge for faithfulness, tone, and content-coverage metrics. We do NOT use the same model as the answer generator (`gpt-4o-mini`) to avoid self-evaluation bias. Judge prompts and few-shot examples will be checked into `evaluation/` when the dataset is built. --- ## 4. Existing evaluation infrastructure (status) The original evaluation pipeline (built before the dataset issue was discovered) lives in `_archive/` after the May 2026 cleanup. It included: - A test runner that loaded the RAGAS-generated questions, ran the pipeline, computed RAGAS metrics, and emitted a JSON report. - Hand-rolled retrieval-quality metrics (recall@k, MRR) per question. - A summarisation comparison script. For the close-out evaluation we will: - Restore the relevant scripts from `_archive/` into `evaluation/`. - Update them to consume the rebuilt dataset. - Wire the metrics into a single `python -m evaluation.run` command. --- ## 5. Reproducibility ### 5.1 Versioning Once the evaluation pipeline is restored: | Artefact | Where | |---|---| | Test dataset | `evaluation/data/test_set_vYYYY-MM-DD.jsonl` | | Evaluation script | `evaluation/run.py` | | Judge prompts | `evaluation/prompts/` | | Run reports | `evaluation/reports/_.json` | | System config snapshot | First entry in each report (model versions, `settings.yaml` hash) | Every report should be reproducible from `commit_sha` + `dataset_path` alone. ### 5.2 How to reproduce a report ```bash # 1. Check out the commit the report was generated from git checkout # 2. Restore dependencies uv sync # or pip install -r requirements.txt # 3. Set environment variables (see runbook/rotate-credentials.md) export OPENAI_API_KEY=... export QDRANT_URL=... export QDRANT_API_KEY=... # 4. Run evaluation python -m evaluation.run \ --dataset evaluation/data/test_set_v2026-XX-XX.jsonl \ --output evaluation/reports/$(date -u +%Y-%m-%dT%H%M%SZ)_$(git rev-parse --short HEAD).json ``` ### 5.3 Sources of nondeterminism - **LLM responses** are non-deterministic by default. We use `temperature=0` for evaluation runs to reduce variance, but exact byte-for-byte reproducibility across LLM provider releases is not guaranteed. - **Qdrant content** must match the test-time state. If the corpus grows between two runs, retrieval scores will change. The report snapshots a Qdrant collection-name + commit-time hash for traceability. --- ## 6. What we will report at handover The close-out report will include, per metric: - Score with bootstrap 95% CI. - Breakdown by question type (factual / comparative / summary / multi-turn / edge-case). - Comparison against a **baseline** — the simplest possible single-shot RAG (no filter inference, no agent flow, no reranker). This lets the client see what the multi-agent design buys, in numbers. Sample shape of the final report: ``` Metric Score (95% CI) Baseline Δ Retrieval Recall@5 0.78 [0.74, 0.82] 0.61 +0.17 Answer Faithfulness 0.91 [0.88, 0.93] 0.79 +0.12 Filter Inference Accuracy 0.88 [0.83, 0.93] n/a n/a Multi-Turn Task Completion 0.74 [0.66, 0.81] 0.31 +0.43 Cost per Query (USD) 0.0042 [...] 0.0028 +0.0014 ``` --- ## 7. Known limitations of this methodology 1. **LLM-as-judge bias**. Despite using a different model for answering vs judging, both models share design biases. We will spot-check a 10% sample manually as a sanity check. 2. **Manual dataset construction is slow** and reflects the constructor's mental model of "good answers". A future iteration could blend in real user queries (anonymized from `spaces_logs`-backed conversation logs). 3. **Cost figures depend on the LLM provider's pricing** at evaluation time. The report stamps the pricing used. --- ## Related - [`accountability-transparency-limitations-biases.md`](accountability-transparency-limitations-biases.md) - [`cost-and-performance.md`](cost-and-performance.md) - [`architecture/adrs/004-llm-model-choice-gpt-4o-mini.md`](architecture/adrs/004-llm-model-choice-gpt-4o-mini.md) - [`DEFERRED.md`](DEFERRED.md) — items related to evaluation