# Evaluation Methodology, Metrics, and Reproducibility

This document covers what we did, what we found, what's still open, and
how to reproduce it. It addresses Eric's WP5 requirements:
*"documented evaluation process; evaluation results across retrieval,
paragraphs, answer, summary and tone."*

---

## 1. Background and constraints

### 1.1 What we were given

A benchmark test dataset was built before the implementation phase began,
using **RAGAS multi-hop question generation** over the audit-report
corpus. The intent was to use this dataset to compute standard RAG
metrics (recall@k, MRR, faithfulness, answer correctness).

### 1.2 What we discovered

When we wired up the evaluation pipeline and ran the original test set,
two systematic issues surfaced:

1. **Abstraction mismatch**. Questions were generated at a higher
   abstraction level than the corpus supports — e.g. "What are the main
   audit themes across Uganda's districts in 2022?". The corpus chunks
   are concrete (specific findings, specific districts, specific
   figures). A question that's vague enough to demand "the main themes
   across Uganda" maps to *many* chunks, none of which the test labels
   as correct.

2. **Sparse gold labels**. RAGAS labels 2–3 chunks per question as
   ground truth. In a corpus where templated audit-report wording
   repeats across hundreds of similar reports (different districts,
   same finding categories), many other chunks would be equally
   appropriate answers. The metric punishes the system for retrieving
   them.

The combined effect: metrics produced by this set were not predictive
of real production quality. We could improve real quality and watch the
metric drop, or vice versa. So we paused the formal quantitative
evaluation and proceeded with **qualitative iterative testing** (Martin,
Dyna, the implementation team) plus a **detailed cost-quality analysis**
shared with the client in December 2025.

### 1.3 What this means for handover

We are **carrying forward two evaluation activities** as part of WP5
close-out:

1. **Rebuild a representative benchmark dataset**. ~1 week of focused
   work to produce ~150-200 questions covering retrieval, paragraph
   relevance, answer faithfulness, summarisation quality, tone.
2. **Document the existing evaluation pipeline** (this doc + the code in
   `src/evaluation/` if/when added back from `_archive/`).

---

## 2. What the system is (a reminder for the evaluator)

The Audit Assistant is **NOT a single-shot RAG** system. It is a
multi-turn, multi-agent system with:

- structured filter inference from natural language;
- LLM-based query rewriting using conversation history;
- pre-validation of filter combinations against the corpus (cheap
  `count()` check) before expensive retrieval;
- conversational state carrying anchored filters across turns;
- cross-encoder reranking with a CPU-aware skip optimisation.

Standard "give a question, score the retrieved chunks" metrics measure
only **one slice** of what determines real production quality.

A meaningful evaluation must cover **at least** the following five
dimensions:

| Dimension | What it measures | Why it matters here |
|---|---|---|
| Retrieval | Are the right chunks coming back for a one-shot query? | Standard RAG metric; necessary but not sufficient. |
| Paragraph relevance | Are the chunks coherent paragraphs (vs noise)? | Our chunker has a boilerplate-filter step; this checks it. |
| Answer faithfulness | Does the answer stay grounded in the retrieved chunks? | LLM hallucination risk. |
| Summarisation quality | Does the answer cover the key points without padding? | Common user-task on this corpus. |
| Tone & register | Is the answer appropriately formal / cautious for an audit-report context? | Domain-specific quality bar. |

A **production-realistic** evaluation must also cover:

| Dimension | What it measures |
|---|---|
| Filter-inference accuracy | When the user says "Gulu 2022", does the system extract those filters? |
| Multi-turn task completion | Across a 2–5 turn conversation, does the system stay on-topic and remember anchored filters? |
| Filter relaxation behaviour | When a filter combo has 0 docs, does the system relax gracefully or fail silently? |
| Cost per query | What does each kind of query cost on the configured LLM? |

The proposed close-out evaluation suite includes both dimension sets.

---

## 3. Proposed evaluation methodology

### 3.1 Dataset construction

**Goal**: produce 150-200 questions with gold-standard answers and
gold-standard chunk references, covering:

- **30%** simple-factual ("How much was district X's audit budget in
  2022?")
- **20%** comparative ("Which 3 districts had the largest VFM
  findings?")
- **20%** summarisation ("Summarise the OAG 2022 annual report")
- **15%** multi-turn ("What about 2021?", "And for sources audited by
  OAG only?")
- **15%** edge cases (impossible filter combos, ambiguous district
  names, follow-ups about prior turns)

Construction process:

1. Sample chunks from the live Qdrant collection covering each
   dimension and source.
2. For each chunk, manually formulate 1-3 questions that the chunk
   genuinely answers. (Manual = expensive, but the only way to avoid
   the abstraction-mismatch problem of automated generation.)
3. For each question, record the GOLD ANSWER text and the chunks the
   answer comes from.
4. For multi-turn questions, record the conversation history they
   depend on.
5. Have a second annotator review 20% of the questions for quality.

Estimated effort: **4-6 ML-engineer days** (per Eric's WP3 estimate).

### 3.2 Metrics

| Dimension | Metric | How computed |
|---|---|---|
| Retrieval | Recall@k (k=5,10,20), MRR | Standard: did the gold chunks appear in top-k? |
| Paragraph relevance | Coherence score (LLM-judge) | LLM-as-judge over each retrieved chunk |
| Answer faithfulness | Faithfulness score (LLM-judge) | RAGAS-style faithfulness — does the answer match the retrieved evidence? |
| Summarisation | ROUGE-L vs gold summary; LLM-judge for content coverage | Two complementary scores |
| Tone | LLM-judge tone score (formal vs casual; cautious vs assertive) | Single LLM judge call |
| Filter inference | Accuracy on filter dict (exact match) | Compare extracted filters to gold filters per question |
| Multi-turn task completion | Task-completed (yes/no) per multi-turn case | Manual annotation of final-turn answer |
| Filter relaxation | Behavior label (relaxed/asked-follow-up/failed) | Manual review of edge-case set |
| Cost | $ per query | Sum of LLM input + output token cost |

Confidence intervals on all reported metrics: bootstrap CIs with 1000
resamples.

### 3.3 LLM-judge configuration

We use the **OpenAI** strong model (`gpt-4.1` per
`src/config/settings.yaml::reader.OPENAI_STRONG.model`) as the judge for
faithfulness, tone, and content-coverage metrics. We do NOT use the
same model as the answer generator (`gpt-4o-mini`) to avoid
self-evaluation bias.

Judge prompts and few-shot examples will be checked into `evaluation/`
when the dataset is built.

---

## 4. Existing evaluation infrastructure (status)

The original evaluation pipeline (built before the dataset issue was
discovered) lives in `_archive/` after the May 2026 cleanup. It included:

- A test runner that loaded the RAGAS-generated questions, ran the
  pipeline, computed RAGAS metrics, and emitted a JSON report.
- Hand-rolled retrieval-quality metrics (recall@k, MRR) per question.
- A summarisation comparison script.

For the close-out evaluation we will:
- Restore the relevant scripts from `_archive/` into `evaluation/`.
- Update them to consume the rebuilt dataset.
- Wire the metrics into a single `python -m evaluation.run` command.

---

## 5. Reproducibility

### 5.1 Versioning

Once the evaluation pipeline is restored:

| Artefact | Where |
|---|---|
| Test dataset | `evaluation/data/test_set_vYYYY-MM-DD.jsonl` |
| Evaluation script | `evaluation/run.py` |
| Judge prompts | `evaluation/prompts/` |
| Run reports | `evaluation/reports/<timestamp>_<commit_sha>.json` |
| System config snapshot | First entry in each report (model versions, `settings.yaml` hash) |

Every report should be reproducible from `commit_sha` + `dataset_path`
alone.

### 5.2 How to reproduce a report

```bash
# 1. Check out the commit the report was generated from
git checkout <commit_sha>

# 2. Restore dependencies
uv sync  # or pip install -r requirements.txt

# 3. Set environment variables (see runbook/rotate-credentials.md)
export OPENAI_API_KEY=...
export QDRANT_URL=...
export QDRANT_API_KEY=...

# 4. Run evaluation
python -m evaluation.run \
    --dataset evaluation/data/test_set_v2026-XX-XX.jsonl \
    --output evaluation/reports/$(date -u +%Y-%m-%dT%H%M%SZ)_$(git rev-parse --short HEAD).json
```

### 5.3 Sources of nondeterminism

- **LLM responses** are non-deterministic by default. We use
  `temperature=0` for evaluation runs to reduce variance, but exact
  byte-for-byte reproducibility across LLM provider releases is not
  guaranteed.
- **Qdrant content** must match the test-time state. If the corpus
  grows between two runs, retrieval scores will change. The report
  snapshots a Qdrant collection-name + commit-time hash for traceability.

---

## 6. What we will report at handover

The close-out report will include, per metric:

- Score with bootstrap 95% CI.
- Breakdown by question type (factual / comparative / summary / multi-turn /
  edge-case).
- Comparison against a **baseline** — the simplest possible single-shot
  RAG (no filter inference, no agent flow, no reranker). This lets the
  client see what the multi-agent design buys, in numbers.

Sample shape of the final report:

```
Metric                       Score (95% CI)        Baseline       Δ
Retrieval Recall@5           0.78 [0.74, 0.82]     0.61           +0.17
Answer Faithfulness          0.91 [0.88, 0.93]     0.79           +0.12
Filter Inference Accuracy    0.88 [0.83, 0.93]     n/a            n/a
Multi-Turn Task Completion   0.74 [0.66, 0.81]     0.31           +0.43
Cost per Query (USD)         0.0042 [...]          0.0028         +0.0014
```

---

## 7. Known limitations of this methodology

1. **LLM-as-judge bias**. Despite using a different model for
   answering vs judging, both models share design biases. We will
   spot-check a 10% sample manually as a sanity check.
2. **Manual dataset construction is slow** and reflects the
   constructor's mental model of "good answers". A future iteration
   could blend in real user queries (anonymized from
   `spaces_logs`-backed conversation logs).
3. **Cost figures depend on the LLM provider's pricing** at evaluation
   time. The report stamps the pricing used.

---

## Related

- [`accountability-transparency-limitations-biases.md`](accountability-transparency-limitations-biases.md)
- [`cost-and-performance.md`](cost-and-performance.md)
- [`architecture/adrs/004-llm-model-choice-gpt-4o-mini.md`](architecture/adrs/004-llm-model-choice-gpt-4o-mini.md)
- [`DEFERRED.md`](DEFERRED.md) — items related to evaluation