audit_assistant / docs /evaluation.md
akryldigital's picture
add docs
815b494 verified

Evaluation Methodology, Metrics, and Reproducibility

This document covers what we did, what we found, what's still open, and how to reproduce it. It addresses Eric's WP5 requirements: "documented evaluation process; evaluation results across retrieval, paragraphs, answer, summary and tone."


1. Background and constraints

1.1 What we were given

A benchmark test dataset was built before the implementation phase began, using RAGAS multi-hop question generation over the audit-report corpus. The intent was to use this dataset to compute standard RAG metrics (recall@k, MRR, faithfulness, answer correctness).

1.2 What we discovered

When we wired up the evaluation pipeline and ran the original test set, two systematic issues surfaced:

  1. Abstraction mismatch. Questions were generated at a higher abstraction level than the corpus supports β€” e.g. "What are the main audit themes across Uganda's districts in 2022?". The corpus chunks are concrete (specific findings, specific districts, specific figures). A question that's vague enough to demand "the main themes across Uganda" maps to many chunks, none of which the test labels as correct.

  2. Sparse gold labels. RAGAS labels 2–3 chunks per question as ground truth. In a corpus where templated audit-report wording repeats across hundreds of similar reports (different districts, same finding categories), many other chunks would be equally appropriate answers. The metric punishes the system for retrieving them.

The combined effect: metrics produced by this set were not predictive of real production quality. We could improve real quality and watch the metric drop, or vice versa. So we paused the formal quantitative evaluation and proceeded with qualitative iterative testing (Martin, Dyna, the implementation team) plus a detailed cost-quality analysis shared with the client in December 2025.

1.3 What this means for handover

We are carrying forward two evaluation activities as part of WP5 close-out:

  1. Rebuild a representative benchmark dataset. ~1 week of focused work to produce ~150-200 questions covering retrieval, paragraph relevance, answer faithfulness, summarisation quality, tone.
  2. Document the existing evaluation pipeline (this doc + the code in src/evaluation/ if/when added back from _archive/).

2. What the system is (a reminder for the evaluator)

The Audit Assistant is NOT a single-shot RAG system. It is a multi-turn, multi-agent system with:

  • structured filter inference from natural language;
  • LLM-based query rewriting using conversation history;
  • pre-validation of filter combinations against the corpus (cheap count() check) before expensive retrieval;
  • conversational state carrying anchored filters across turns;
  • cross-encoder reranking with a CPU-aware skip optimisation.

Standard "give a question, score the retrieved chunks" metrics measure only one slice of what determines real production quality.

A meaningful evaluation must cover at least the following five dimensions:

Dimension What it measures Why it matters here
Retrieval Are the right chunks coming back for a one-shot query? Standard RAG metric; necessary but not sufficient.
Paragraph relevance Are the chunks coherent paragraphs (vs noise)? Our chunker has a boilerplate-filter step; this checks it.
Answer faithfulness Does the answer stay grounded in the retrieved chunks? LLM hallucination risk.
Summarisation quality Does the answer cover the key points without padding? Common user-task on this corpus.
Tone & register Is the answer appropriately formal / cautious for an audit-report context? Domain-specific quality bar.

A production-realistic evaluation must also cover:

Dimension What it measures
Filter-inference accuracy When the user says "Gulu 2022", does the system extract those filters?
Multi-turn task completion Across a 2–5 turn conversation, does the system stay on-topic and remember anchored filters?
Filter relaxation behaviour When a filter combo has 0 docs, does the system relax gracefully or fail silently?
Cost per query What does each kind of query cost on the configured LLM?

The proposed close-out evaluation suite includes both dimension sets.


3. Proposed evaluation methodology

3.1 Dataset construction

Goal: produce 150-200 questions with gold-standard answers and gold-standard chunk references, covering:

  • 30% simple-factual ("How much was district X's audit budget in 2022?")
  • 20% comparative ("Which 3 districts had the largest VFM findings?")
  • 20% summarisation ("Summarise the OAG 2022 annual report")
  • 15% multi-turn ("What about 2021?", "And for sources audited by OAG only?")
  • 15% edge cases (impossible filter combos, ambiguous district names, follow-ups about prior turns)

Construction process:

  1. Sample chunks from the live Qdrant collection covering each dimension and source.
  2. For each chunk, manually formulate 1-3 questions that the chunk genuinely answers. (Manual = expensive, but the only way to avoid the abstraction-mismatch problem of automated generation.)
  3. For each question, record the GOLD ANSWER text and the chunks the answer comes from.
  4. For multi-turn questions, record the conversation history they depend on.
  5. Have a second annotator review 20% of the questions for quality.

Estimated effort: 4-6 ML-engineer days (per Eric's WP3 estimate).

3.2 Metrics

Dimension Metric How computed
Retrieval Recall@k (k=5,10,20), MRR Standard: did the gold chunks appear in top-k?
Paragraph relevance Coherence score (LLM-judge) LLM-as-judge over each retrieved chunk
Answer faithfulness Faithfulness score (LLM-judge) RAGAS-style faithfulness β€” does the answer match the retrieved evidence?
Summarisation ROUGE-L vs gold summary; LLM-judge for content coverage Two complementary scores
Tone LLM-judge tone score (formal vs casual; cautious vs assertive) Single LLM judge call
Filter inference Accuracy on filter dict (exact match) Compare extracted filters to gold filters per question
Multi-turn task completion Task-completed (yes/no) per multi-turn case Manual annotation of final-turn answer
Filter relaxation Behavior label (relaxed/asked-follow-up/failed) Manual review of edge-case set
Cost $ per query Sum of LLM input + output token cost

Confidence intervals on all reported metrics: bootstrap CIs with 1000 resamples.

3.3 LLM-judge configuration

We use the OpenAI strong model (gpt-4.1 per src/config/settings.yaml::reader.OPENAI_STRONG.model) as the judge for faithfulness, tone, and content-coverage metrics. We do NOT use the same model as the answer generator (gpt-4o-mini) to avoid self-evaluation bias.

Judge prompts and few-shot examples will be checked into evaluation/ when the dataset is built.


4. Existing evaluation infrastructure (status)

The original evaluation pipeline (built before the dataset issue was discovered) lives in _archive/ after the May 2026 cleanup. It included:

  • A test runner that loaded the RAGAS-generated questions, ran the pipeline, computed RAGAS metrics, and emitted a JSON report.
  • Hand-rolled retrieval-quality metrics (recall@k, MRR) per question.
  • A summarisation comparison script.

For the close-out evaluation we will:

  • Restore the relevant scripts from _archive/ into evaluation/.
  • Update them to consume the rebuilt dataset.
  • Wire the metrics into a single python -m evaluation.run command.

5. Reproducibility

5.1 Versioning

Once the evaluation pipeline is restored:

Artefact Where
Test dataset evaluation/data/test_set_vYYYY-MM-DD.jsonl
Evaluation script evaluation/run.py
Judge prompts evaluation/prompts/
Run reports evaluation/reports/<timestamp>_<commit_sha>.json
System config snapshot First entry in each report (model versions, settings.yaml hash)

Every report should be reproducible from commit_sha + dataset_path alone.

5.2 How to reproduce a report

# 1. Check out the commit the report was generated from
git checkout <commit_sha>

# 2. Restore dependencies
uv sync  # or pip install -r requirements.txt

# 3. Set environment variables (see runbook/rotate-credentials.md)
export OPENAI_API_KEY=...
export QDRANT_URL=...
export QDRANT_API_KEY=...

# 4. Run evaluation
python -m evaluation.run \
    --dataset evaluation/data/test_set_v2026-XX-XX.jsonl \
    --output evaluation/reports/$(date -u +%Y-%m-%dT%H%M%SZ)_$(git rev-parse --short HEAD).json

5.3 Sources of nondeterminism

  • LLM responses are non-deterministic by default. We use temperature=0 for evaluation runs to reduce variance, but exact byte-for-byte reproducibility across LLM provider releases is not guaranteed.
  • Qdrant content must match the test-time state. If the corpus grows between two runs, retrieval scores will change. The report snapshots a Qdrant collection-name + commit-time hash for traceability.

6. What we will report at handover

The close-out report will include, per metric:

  • Score with bootstrap 95% CI.
  • Breakdown by question type (factual / comparative / summary / multi-turn / edge-case).
  • Comparison against a baseline β€” the simplest possible single-shot RAG (no filter inference, no agent flow, no reranker). This lets the client see what the multi-agent design buys, in numbers.

Sample shape of the final report:

Metric                       Score (95% CI)        Baseline       Ξ”
Retrieval Recall@5           0.78 [0.74, 0.82]     0.61           +0.17
Answer Faithfulness          0.91 [0.88, 0.93]     0.79           +0.12
Filter Inference Accuracy    0.88 [0.83, 0.93]     n/a            n/a
Multi-Turn Task Completion   0.74 [0.66, 0.81]     0.31           +0.43
Cost per Query (USD)         0.0042 [...]          0.0028         +0.0014

7. Known limitations of this methodology

  1. LLM-as-judge bias. Despite using a different model for answering vs judging, both models share design biases. We will spot-check a 10% sample manually as a sanity check.
  2. Manual dataset construction is slow and reflects the constructor's mental model of "good answers". A future iteration could blend in real user queries (anonymized from spaces_logs-backed conversation logs).
  3. Cost figures depend on the LLM provider's pricing at evaluation time. The report stamps the pricing used.

Related