Spaces:

akryldigital
/

audit_assistant

Running

App Files Files Community

audit_assistant / docs /evaluation.md

akryldigital

add docs

815b494 verified 11 days ago

preview code

raw

history blame contribute delete

11.1 kB

	# Evaluation Methodology, Metrics, and Reproducibility

	This document covers what we did, what we found, what's still open, and
	how to reproduce it. It addresses Eric's WP5 requirements:
	*"documented evaluation process; evaluation results across retrieval,
	paragraphs, answer, summary and tone."*

	---

	## 1. Background and constraints

	### 1.1 What we were given

	A benchmark test dataset was built before the implementation phase began,
	using RAGAS multi-hop question generation over the audit-report
	corpus. The intent was to use this dataset to compute standard RAG
	metrics (recall@k, MRR, faithfulness, answer correctness).

	### 1.2 What we discovered

	When we wired up the evaluation pipeline and ran the original test set,
	two systematic issues surfaced:

	1. Abstraction mismatch. Questions were generated at a higher
	abstraction level than the corpus supports — e.g. "What are the main
	audit themes across Uganda's districts in 2022?". The corpus chunks
	are concrete (specific findings, specific districts, specific
	figures). A question that's vague enough to demand "the main themes
	across Uganda" maps to many chunks, none of which the test labels
	as correct.

	2. Sparse gold labels. RAGAS labels 2–3 chunks per question as
	ground truth. In a corpus where templated audit-report wording
	repeats across hundreds of similar reports (different districts,
	same finding categories), many other chunks would be equally
	appropriate answers. The metric punishes the system for retrieving
	them.

	The combined effect: metrics produced by this set were not predictive
	of real production quality. We could improve real quality and watch the
	metric drop, or vice versa. So we paused the formal quantitative
	evaluation and proceeded with qualitative iterative testing (Martin,
	Dyna, the implementation team) plus a detailed cost-quality analysis
	shared with the client in December 2025.

	### 1.3 What this means for handover

	We are carrying forward two evaluation activities as part of WP5
	close-out:

	1. Rebuild a representative benchmark dataset. ~1 week of focused
	work to produce ~150-200 questions covering retrieval, paragraph
	relevance, answer faithfulness, summarisation quality, tone.
	2. Document the existing evaluation pipeline (this doc + the code in
	`src/evaluation/` if/when added back from `_archive/`).

	---

	## 2. What the system is (a reminder for the evaluator)

	The Audit Assistant is NOT a single-shot RAG system. It is a
	multi-turn, multi-agent system with:

	- structured filter inference from natural language;
	- LLM-based query rewriting using conversation history;
	- pre-validation of filter combinations against the corpus (cheap
	`count()` check) before expensive retrieval;
	- conversational state carrying anchored filters across turns;
	- cross-encoder reranking with a CPU-aware skip optimisation.

	Standard "give a question, score the retrieved chunks" metrics measure
	only one slice of what determines real production quality.

	A meaningful evaluation must cover at least the following five
	dimensions:

	\| Dimension \| What it measures \| Why it matters here \|
	\|---\|---\|---\|
	\| Retrieval \| Are the right chunks coming back for a one-shot query? \| Standard RAG metric; necessary but not sufficient. \|
	\| Paragraph relevance \| Are the chunks coherent paragraphs (vs noise)? \| Our chunker has a boilerplate-filter step; this checks it. \|
	\| Answer faithfulness \| Does the answer stay grounded in the retrieved chunks? \| LLM hallucination risk. \|
	\| Summarisation quality \| Does the answer cover the key points without padding? \| Common user-task on this corpus. \|
	\| Tone & register \| Is the answer appropriately formal / cautious for an audit-report context? \| Domain-specific quality bar. \|

	A production-realistic evaluation must also cover:

	\| Dimension \| What it measures \|
	\|---\|---\|
	\| Filter-inference accuracy \| When the user says "Gulu 2022", does the system extract those filters? \|
	\| Multi-turn task completion \| Across a 2–5 turn conversation, does the system stay on-topic and remember anchored filters? \|
	\| Filter relaxation behaviour \| When a filter combo has 0 docs, does the system relax gracefully or fail silently? \|
	\| Cost per query \| What does each kind of query cost on the configured LLM? \|

	The proposed close-out evaluation suite includes both dimension sets.

	---

	## 3. Proposed evaluation methodology

	### 3.1 Dataset construction

	Goal: produce 150-200 questions with gold-standard answers and
	gold-standard chunk references, covering:

	- 30% simple-factual ("How much was district X's audit budget in
	2022?")
	- 20% comparative ("Which 3 districts had the largest VFM
	findings?")
	- 20% summarisation ("Summarise the OAG 2022 annual report")
	- 15% multi-turn ("What about 2021?", "And for sources audited by
	OAG only?")
	- 15% edge cases (impossible filter combos, ambiguous district
	names, follow-ups about prior turns)

	Construction process:

	1. Sample chunks from the live Qdrant collection covering each
	dimension and source.
	2. For each chunk, manually formulate 1-3 questions that the chunk
	genuinely answers. (Manual = expensive, but the only way to avoid
	the abstraction-mismatch problem of automated generation.)
	3. For each question, record the GOLD ANSWER text and the chunks the
	answer comes from.
	4. For multi-turn questions, record the conversation history they
	depend on.
	5. Have a second annotator review 20% of the questions for quality.

	Estimated effort: 4-6 ML-engineer days (per Eric's WP3 estimate).

	### 3.2 Metrics

	\| Dimension \| Metric \| How computed \|
	\|---\|---\|---\|
	\| Retrieval \| Recall@k (k=5,10,20), MRR \| Standard: did the gold chunks appear in top-k? \|
	\| Paragraph relevance \| Coherence score (LLM-judge) \| LLM-as-judge over each retrieved chunk \|
	\| Answer faithfulness \| Faithfulness score (LLM-judge) \| RAGAS-style faithfulness — does the answer match the retrieved evidence? \|
	\| Summarisation \| ROUGE-L vs gold summary; LLM-judge for content coverage \| Two complementary scores \|
	\| Tone \| LLM-judge tone score (formal vs casual; cautious vs assertive) \| Single LLM judge call \|
	\| Filter inference \| Accuracy on filter dict (exact match) \| Compare extracted filters to gold filters per question \|
	\| Multi-turn task completion \| Task-completed (yes/no) per multi-turn case \| Manual annotation of final-turn answer \|
	\| Filter relaxation \| Behavior label (relaxed/asked-follow-up/failed) \| Manual review of edge-case set \|
	\| Cost \| $ per query \| Sum of LLM input + output token cost \|

	Confidence intervals on all reported metrics: bootstrap CIs with 1000
	resamples.

	### 3.3 LLM-judge configuration

	We use the OpenAI strong model (`gpt-4.1` per
	`src/config/settings.yaml::reader.OPENAI_STRONG.model`) as the judge for
	faithfulness, tone, and content-coverage metrics. We do NOT use the
	same model as the answer generator (`gpt-4o-mini`) to avoid
	self-evaluation bias.

	Judge prompts and few-shot examples will be checked into `evaluation/`
	when the dataset is built.

	---

	## 4. Existing evaluation infrastructure (status)

	The original evaluation pipeline (built before the dataset issue was
	discovered) lives in `_archive/` after the May 2026 cleanup. It included:

	- A test runner that loaded the RAGAS-generated questions, ran the
	pipeline, computed RAGAS metrics, and emitted a JSON report.
	- Hand-rolled retrieval-quality metrics (recall@k, MRR) per question.
	- A summarisation comparison script.

	For the close-out evaluation we will:
	- Restore the relevant scripts from `_archive/` into `evaluation/`.
	- Update them to consume the rebuilt dataset.
	- Wire the metrics into a single `python -m evaluation.run` command.

	---

	## 5. Reproducibility

	### 5.1 Versioning

	Once the evaluation pipeline is restored:

	\| Artefact \| Where \|
	\|---\|---\|
	\| Test dataset \| `evaluation/data/test_set_vYYYY-MM-DD.jsonl` \|
	\| Evaluation script \| `evaluation/run.py` \|
	\| Judge prompts \| `evaluation/prompts/` \|
	\| Run reports \| `evaluation/reports/<timestamp>_<commit_sha>.json` \|
	\| System config snapshot \| First entry in each report (model versions, `settings.yaml` hash) \|

	Every report should be reproducible from `commit_sha` + `dataset_path`
	alone.

	### 5.2 How to reproduce a report

	```bash
	# 1. Check out the commit the report was generated from
	git checkout <commit_sha>

	# 2. Restore dependencies
	uv sync # or pip install -r requirements.txt

	# 3. Set environment variables (see runbook/rotate-credentials.md)
	export OPENAI_API_KEY=...
	export QDRANT_URL=...
	export QDRANT_API_KEY=...

	# 4. Run evaluation
	python -m evaluation.run \
	--dataset evaluation/data/test_set_v2026-XX-XX.jsonl \
	--output evaluation/reports/$(date -u +%Y-%m-%dT%H%M%SZ)_$(git rev-parse --short HEAD).json
	```

	### 5.3 Sources of nondeterminism

	- LLM responses are non-deterministic by default. We use
	`temperature=0` for evaluation runs to reduce variance, but exact
	byte-for-byte reproducibility across LLM provider releases is not
	guaranteed.
	- Qdrant content must match the test-time state. If the corpus
	grows between two runs, retrieval scores will change. The report
	snapshots a Qdrant collection-name + commit-time hash for traceability.

	---

	## 6. What we will report at handover

	The close-out report will include, per metric:

	- Score with bootstrap 95% CI.
	- Breakdown by question type (factual / comparative / summary / multi-turn /
	edge-case).
	- Comparison against a baseline — the simplest possible single-shot
	RAG (no filter inference, no agent flow, no reranker). This lets the
	client see what the multi-agent design buys, in numbers.

	Sample shape of the final report:

	```
	Metric Score (95% CI) Baseline Δ
	Retrieval Recall@5 0.78 [0.74, 0.82] 0.61 +0.17
	Answer Faithfulness 0.91 [0.88, 0.93] 0.79 +0.12
	Filter Inference Accuracy 0.88 [0.83, 0.93] n/a n/a
	Multi-Turn Task Completion 0.74 [0.66, 0.81] 0.31 +0.43
	Cost per Query (USD) 0.0042 [...] 0.0028 +0.0014
	```

	---

	## 7. Known limitations of this methodology

	1. LLM-as-judge bias. Despite using a different model for
	answering vs judging, both models share design biases. We will
	spot-check a 10% sample manually as a sanity check.
	2. Manual dataset construction is slow and reflects the
	constructor's mental model of "good answers". A future iteration
	could blend in real user queries (anonymized from
	`spaces_logs`-backed conversation logs).
	3. Cost figures depend on the LLM provider's pricing at evaluation
	time. The report stamps the pricing used.

	---

	## Related

	- [`accountability-transparency-limitations-biases.md`](accountability-transparency-limitations-biases.md)
	- [`cost-and-performance.md`](cost-and-performance.md)
	- [`architecture/adrs/004-llm-model-choice-gpt-4o-mini.md`](architecture/adrs/004-llm-model-choice-gpt-4o-mini.md)
	- [`DEFERRED.md`](DEFERRED.md) — items related to evaluation