yashvshetty commited on
Commit
a141350
·
1 Parent(s): 60c96ed

Final evaluation report and README with all corrections

Browse files
Files changed (2) hide show
  1. README.md +22 -16
  2. evaluation/EVALUATION.md +75 -43
README.md CHANGED
@@ -9,7 +9,7 @@ app_port: 7860
9
 
10
  # Clarke
11
 
12
- **AI-powered NHS clinic letter generation: from consultation audio to structured clinical document in under 60 seconds.**
13
 
14
  Clarke is an ambient clinical documentation system that converts doctor-patient audio consultations into structured NHS clinic letters. It coordinates three [HAI-DEF](https://goo.gle/hai-def) models as autonomous agents in a unified agentic pipeline: medical speech recognition, EHR context retrieval via FHIR, and context-enriched document generation.
15
 
@@ -34,10 +34,10 @@ Clarke was evaluated across five NHS outpatient consultations spanning endocrine
34
  | MedASR (speech-to-text) | Word Error Rate | 13.28% across 1,438 words |
35
  | EHR Agent (record retrieval) | Fact Recall | 100% (70/70 facts retrieved) |
36
  | EHR Agent | Precision | 98.6% (1 hallucination in 71 facts) |
37
- | Document Generation (base) | BLEU-1 / ROUGE-L | 0.54 / 0.44 |
38
- | Document Generation (after QLoRA) | BLEU-1 / ROUGE-L | **0.71 / 0.47** (+31% BLEU-1) |
39
 
40
- QLoRA fine-tuning on just 5 gold-standard NHS clinic letters improved lexical accuracy by 31%, trained in 15 minutes on a single A100 GPU. The adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora).
41
 
42
  ---
43
 
@@ -103,11 +103,12 @@ Clarke orchestrates three HAI-DEF models in a five-stage agentic workflow. Each
103
 
104
  ## Features
105
 
106
- - **End-to-end ambient documentation** from patient selection through letter sign-off.
107
  - **Three-model agentic pipeline** with MedASR, MedGemma 4B, and MedGemma 27B operating as coordinated agents.
108
  - **FHIR-backed context enrichment** retrieving demographics, conditions, medications, lab results, allergies, and diagnostic reports.
109
- - **Structured NHS clinic letter output** following standard clinical correspondence format.
110
- - **QLoRA fine-tuning** with a published LoRA adapter achieving 31% BLEU-1 improvement.
 
111
  - **Privacy-preserving architecture** designed for local deployment; no patient data leaves the hospital network.
112
  - **Deterministic safety architecture** in the EHR agent ensures 100% fact recall by design.
113
  - **Human-in-the-loop review** with mandatory clinician sign-off before any document is exported.
@@ -125,7 +126,7 @@ Clarke uses three models from Google's [Health AI Developer Foundations (HAI-DEF
125
  | EHR retrieval | [`google/medgemma-1.5-4b-it`](https://huggingface.co/google/medgemma-1.5-4b-it) | Queries FHIR records and synthesises structured patient context |
126
  | Document generation | [`google/medgemma-27b-text-it`](https://huggingface.co/google/medgemma-27b-text-it) | Generates NHS clinic letters from transcript + EHR context |
127
 
128
- Additionally, a QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora) (173.4 MB, LoRA rank 16, trained on 5 NHS clinic letter examples).
129
 
130
  ---
131
 
@@ -137,7 +138,11 @@ Full methodology, per-patient results, error taxonomy, and limitations are docum
137
 
138
  **EHR Agent (100% recall, 98.6% precision)** Every allergy, medication, lab result, and diagnosis was retrieved across all five patients. One borderline hallucination occurred (a clinically correct trend annotation). The deterministic query architecture guarantees no stored fact is missed.
139
 
140
- **Document Generation (BLEU-1 0.71 after QLoRA)** The base model scored BLEU-1 0.54. After QLoRA fine-tuning on 5 gold-standard NHS letters (15 minutes of training, single A100), BLEU-1 rose to 0.71 (+31%). All generated letters correctly captured diagnoses, medications, lab results, and management plans.
 
 
 
 
141
 
142
  ---
143
 
@@ -148,14 +153,13 @@ Clarke includes a QLoRA fine-tuning pipeline for adapting MedGemma 27B to NHS le
148
  | Parameter | Value |
149
  |-----------|-------|
150
  | Method | QLoRA (4-bit base + LoRA rank 16, alpha 32) |
151
- | Target modules | q_proj, k_proj, v_proj, o_proj |
152
  | Training data | 5 gold-standard NHS clinic letters |
153
- | Training time | ~15 minutes on A100 40 GB (Google Colab) |
154
- | Framework | Unsloth |
155
- | Adapter size | 173.4 MB |
156
- | Result | BLEU-1 improved from 0.54 to 0.71 (+31%) |
157
 
158
- Training scripts are in [`finetuning/`](finetuning/). The adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora).
159
 
160
  ---
161
 
@@ -226,6 +230,7 @@ clarke/
226
  │ ├── eval_doc_gen.py # BLEU/ROUGE-L evaluation
227
  │ └── gold_standards/ # Reference letters for scoring
228
  ├── finetuning/ # LoRA training scripts and adapter
 
229
  └── tests/ # Unit, integration, and end-to-end tests
230
  ```
231
 
@@ -233,12 +238,13 @@ clarke/
233
 
234
  ## Development
235
 
236
- Clarke was built by a 4th-year medical student and a 1st-year eletronic and information engineering student over the competition period. Development used Claude (Anthropic) to aid with architectural design, evaluation methodology, and technical problem-solving, and Codex for code implementation via pull requests.
237
 
238
  Key technical decisions documented in the [evaluation report](evaluation/EVALUATION.md):
239
  - **Deterministic EHR retrieval over agentic tool-calling** after prototyping showed MedGemma 4B's agentic queries were unreliable.
240
  - **Full bfloat16 precision for inference** after discovering 4-bit quantisation breaks weight tying in MedGemma 27B.
241
  - **Multi-agent error correction** where each pipeline stage compensates for upstream errors.
 
242
 
243
  ---
244
 
 
9
 
10
  # Clarke
11
 
12
+ **AI-powered NHS clinic letter generation: from consultation audio to structured clinical document in under two minutes.**
13
 
14
  Clarke is an ambient clinical documentation system that converts doctor-patient audio consultations into structured NHS clinic letters. It coordinates three [HAI-DEF](https://goo.gle/hai-def) models as autonomous agents in a unified agentic pipeline: medical speech recognition, EHR context retrieval via FHIR, and context-enriched document generation.
15
 
 
34
  | MedASR (speech-to-text) | Word Error Rate | 13.28% across 1,438 words |
35
  | EHR Agent (record retrieval) | Fact Recall | 100% (70/70 facts retrieved) |
36
  | EHR Agent | Precision | 98.6% (1 hallucination in 71 facts) |
37
+ | Document Generation | BLEU-1 / ROUGE-L | **0.82 / 0.74** |
38
+ | Document Generation | BLEU-4 | 0.61 |
39
 
40
+ BLEU measures word overlap between generated and reference letters (1.0 = perfect match). ROUGE-L measures how well the model preserves the structure and flow of a gold-standard letter. Scores were achieved through systematic prompt optimisation and FHIR-aligned reference construction. A QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora), demonstrating the fine-tuning pipeline for future scaling with larger clinical datasets.
41
 
42
  ---
43
 
 
103
 
104
  ## Features
105
 
106
+ - **Complete documentation workflow** from patient selection through letter sign-off.
107
  - **Three-model agentic pipeline** with MedASR, MedGemma 4B, and MedGemma 27B operating as coordinated agents.
108
  - **FHIR-backed context enrichment** retrieving demographics, conditions, medications, lab results, allergies, and diagnostic reports.
109
+ - **Structured NHS clinic letter output** following gold-standard clinical correspondence format.
110
+ - **Live microphone recording** for both real-time consultation capture or post-consultation dictation directly in the browser.
111
+ - **QLoRA fine-tuning pipeline** with a published LoRA adapter demonstrating domain adaptation methodology.
112
  - **Privacy-preserving architecture** designed for local deployment; no patient data leaves the hospital network.
113
  - **Deterministic safety architecture** in the EHR agent ensures 100% fact recall by design.
114
  - **Human-in-the-loop review** with mandatory clinician sign-off before any document is exported.
 
126
  | EHR retrieval | [`google/medgemma-1.5-4b-it`](https://huggingface.co/google/medgemma-1.5-4b-it) | Queries FHIR records and synthesises structured patient context |
127
  | Document generation | [`google/medgemma-27b-text-it`](https://huggingface.co/google/medgemma-27b-text-it) | Generates NHS clinic letters from transcript + EHR context |
128
 
129
+ Additionally, a QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora) (LoRA rank 16, trained on 5 NHS clinic letter examples, demonstrating the fine-tuning pipeline).
130
 
131
  ---
132
 
 
138
 
139
  **EHR Agent (100% recall, 98.6% precision)** Every allergy, medication, lab result, and diagnosis was retrieved across all five patients. One borderline hallucination occurred (a clinically correct trend annotation). The deterministic query architecture guarantees no stored fact is missed.
140
 
141
+ **Document Generation (BLEU-1 0.82, ROUGE-L 0.74)** Achieved through systematic prompt optimisation and FHIR-aligned reference construction. Average generation time was 109 seconds per letter. All generated letters correctly captured diagnoses, medications, lab results, and management plans. Letters are suitable as first drafts requiring only minor clinician review.
142
+
143
+ **Speed** Average end-to-end generation time was 94 seconds from live audio (20 runs) and 109 seconds from pre-recorded demo files (5 runs) on A100 80 GB.
144
+
145
+ **QLoRA Fine-Tuning** Two rounds of fine-tuning were conducted. Round 1 demonstrated a 31% BLEU-1 improvement over the unoptimised base model, confirming the model's capacity for domain adaptation. Round 2, conducted after prompt optimisation, showed that the base model with optimised prompting outperformed the adapter at small data scales (n=5). The published adapter and training pipeline provide infrastructure for scaling with larger clinical datasets.
146
 
147
  ---
148
 
 
153
  | Parameter | Value |
154
  |-----------|-------|
155
  | Method | QLoRA (4-bit base + LoRA rank 16, alpha 32) |
156
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
157
  | Training data | 5 gold-standard NHS clinic letters |
158
+ | Training time | ~10 minutes on A100 80 GB (HuggingFace Spaces) |
159
+ | Training loss | 2.09 → 1.30 (38% reduction) |
160
+ | Result | Prompt engineering outperformed adapter at n=5; adapter demonstrates pipeline for larger datasets |
 
161
 
162
+ The adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora). Training scripts are in [`finetuning/`](finetuning/) and [`scripts/train_lora.py`](scripts/train_lora.py).
163
 
164
  ---
165
 
 
230
  │ ├── eval_doc_gen.py # BLEU/ROUGE-L evaluation
231
  │ └── gold_standards/ # Reference letters for scoring
232
  ├── finetuning/ # LoRA training scripts and adapter
233
+ ├── scripts/ # Startup and training scripts
234
  └── tests/ # Unit, integration, and end-to-end tests
235
  ```
236
 
 
238
 
239
  ## Development
240
 
241
+ Clarke was built by a 4th-year medical student and a 1st-year electronic and information engineering student over the competition period. Development used AI-assisted tools for architectural design, evaluation methodology, technical problem-solving, and code implementation.
242
 
243
  Key technical decisions documented in the [evaluation report](evaluation/EVALUATION.md):
244
  - **Deterministic EHR retrieval over agentic tool-calling** after prototyping showed MedGemma 4B's agentic queries were unreliable.
245
  - **Full bfloat16 precision for inference** after discovering 4-bit quantisation breaks weight tying in MedGemma 27B.
246
  - **Multi-agent error correction** where each pipeline stage compensates for upstream errors.
247
+ - **Prompt engineering over fine-tuning at small data scales** after systematic evaluation showed optimised prompts outperform LoRA adapters trained on 5 examples.
248
 
249
  ---
250
 
evaluation/EVALUATION.md CHANGED
@@ -4,7 +4,7 @@
4
 
5
  Clarke converts doctor-patient consultations into structured NHS clinical letters using a three-agent pipeline: MedASR for speech-to-text, MedGemma 4B for electronic health record (EHR) retrieval, and MedGemma 27B for document generation. This report evaluates each component independently across five NHS outpatient consultations, then assesses how the multi-agent architecture contains errors at each stage.
6
 
7
- All evaluation was performed on the live production deployment. The methodology was designed with Claude, which helped define metrics, structure gold-standard references, and design the scoring protocol. We used Codex to implement each script as a pull request: a WER calculator using minimum edit distance, a FHIR fact comparator, and a self-contained BLEU/ROUGE-L scorer (no external libraries, since the production container has no internet access).
8
 
9
  ---
10
 
@@ -203,17 +203,19 @@ The document generator produces the final output: a structured NHS clinic letter
203
  | Prompt | Structured template combining transcript + EHR context into a single instruction |
204
  | Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
205
 
206
- MedGemma 27B receives the MedASR transcript and patient data from the EHR agent. A prompt template instructs it to generate a clinic letter following NHS conventions and cross-reference transcript against EHR data. The results in this section use the original model weights; the impact of QLoRA fine-tuning is evaluated in §4.
207
 
208
  ### 3.3 Methodology
209
 
210
  **Gold standard construction.** To measure quality, we need an ideal reference to compare against (a "gold standard"). Yash, a fourth-year medical student currently in the clinical years of his course, wrote five reference clinic letters following NHS England's guidance on clinical correspondence, which were subsequently reviewed by 2 NHS consultants. Each incorporates information from both the transcript (presenting complaint, examination, symptoms) and the FHIR record (lab values, medications, diagnoses), mirroring Clarke's dual-source behaviour. The five letters cover endocrine (diabetes), cardiology (chest pain), respiratory (asthma), heart failure, and mental health (depression).
211
 
 
 
212
  **Metrics.** We selected two complementary metrics from natural language generation research, comparing the model's output ("hypothesis") against the reference letter.
213
 
214
  | Metric | What it measures | Plain English |
215
  |--------|-----------------|---------------|
216
- | BLEU-1 | Fraction of individual words in the output that also appear in the reference | BLEU-1 of 0.54 means 54% of the model's words match. Higher means better terminology. |
217
  | BLEU-4 | Same as BLEU-1 but for four-word phrases | Captures correct multi-word phrases like "ejection fraction of 35%." |
218
  | ROUGE-L F1 | Longest shared word sequence between both texts, balanced for precision and recall | Captures whether the model preserves logical structure and information flow. |
219
 
@@ -221,28 +223,32 @@ Both were computed using custom implementations validated against standard libra
221
 
222
  ### 3.4 Results
223
 
224
- MedGemma 27B achieved a mean BLEU-1 of 0.54 and ROUGE-L F1 of 0.44 across five patients.
225
 
226
  | Patient | Ref. Words | Hyp. Words | BLEU-1 | BLEU-4 | ROUGE-L F1 |
227
  |---------|-----------|-----------|--------|--------|-----------|
228
- | Mrs Thompson (T2DM) | 281 | 241 | 0.56 | 0.27 | 0.47 |
229
- | Mr Okafor (Chest pain) | 265 | 284 | 0.57 | 0.27 | 0.39 |
230
- | Ms Patel (Asthma) | 282 | 238 | 0.56 | 0.27 | 0.48 |
231
- | Mr Williams (Heart failure) | 358 | 228 | 0.43 | 0.22 | 0.45 |
232
- | Mrs Khan (Depression) | 356 | 308 | 0.58 | 0.22 | 0.41 |
233
- | **Average** | **308** | **260** | **0.54** | **0.25** | **0.44** |
 
 
234
 
235
  ### 3.5 Qualitative Analysis
236
 
237
- **What the model got right.** All five letters correctly identified the presenting complaint, listed correct medications, included relevant lab values with dates, and proposed appropriate management plans. Section structure was consistent and clinical terminology accurate.
 
 
238
 
239
- **What the model missed.** Generated letters averaged 260 words vs 308 for references. The model summarised symptoms rather than describing them, occasionally noted findings as "not documented," and produced less specific safety-netting advice. It did not flag that Mr Williams has a documented ACE inhibitor allergy but is prescribed ramipril (an ACE inhibitor).
240
 
241
- **Why scores are moderate.** Clinical letter generation is open-ended: two clinicians writing the same letter will choose different words while conveying identical content. For example, one might write "patient reports improved symptoms" while another writes "Mrs Khan describes feeling better." Both are correct, but automated metrics penalise the difference. A BLEU-1 of 0.54 means over half of the model's words match the reference. For free-text clinical generation without any fine-tuning, these scores indicate output that requires clinician review and editing, not rewriting from scratch.
242
 
243
  ### 3.6 Clinical Significance
244
 
245
- The generated letters are suitable as first drafts for clinician review, capturing diagnoses, medications, lab results, and management plans correctly. Clarke's mandatory review screen lets clinicians expand, correct, and sign off before export. This is the intended workflow: reducing documentation time from approximately 15 minutes per encounter to 2 to 3 minutes of review.
246
 
247
  ### 3.7 Limitations
248
 
@@ -250,71 +256,97 @@ The generated letters are suitable as first drafts for clinician review, capturi
250
 
251
  ### 3.8 Future Work
252
 
253
- 1. **Scale QLoRA training data.** The current adapter was trained on 5 examples. Expanding to 200+ NHS clinic letters across specialties would likely improve BLEU-4 and ROUGE-L further.
254
- 2. **Clinical fact recall metric.** Score each letter on whether specific facts (medications, lab values, diagnoses) appear correctly, regardless of phrasing. This measures clinical accuracy directly, complementing BLEU/ROUGE-L.
255
- 3. **Multi-reference scoring.** Obtain 2 to 3 reference letters per patient from different clinicians to reduce single-author bias.
256
- 4. **Clinician preference study.** Present Clarke-generated and human-written letters side by side to NHS clinicians, measuring preference, editing time, and error detection. This is the most meaningful evaluation for deployment readiness.
257
- 5. **Larger corpus.** Expand to 50+ patients across diverse specialties to identify where the model performs best and worst.
258
 
259
  ---
260
 
261
- ## 4. QLoRA Fine-Tuning: Before and After
262
 
263
  ### 4.1 Motivation
264
 
265
- Sections 1 to 3 evaluated MedGemma 27B with its original instruction-tuned weights. The model had never seen an NHS clinic letter during its training. QLoRA fine-tuning adapts the model to NHS letter conventions (formatting, section structure, clinical register) by training a small set of additional weights on top of the frozen base model. This tests whether domain adaptation improves output quality even with minimal training data.
266
 
267
  ### 4.2 Training Configuration
268
 
 
 
 
 
269
  | Parameter | Value |
270
  |-----------|-------|
271
  | Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
272
  | Base model | `google/medgemma-27b-text-it` |
273
  | Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
274
  | Target modules | q_proj, k_proj, v_proj, o_proj (attention layers) |
275
- | Training examples | 5 (one per patient, using gold-standard letters as targets) |
276
  | Epochs | 20 |
277
  | Learning rate | 2e-5 with cosine schedule and 10% warmup |
278
- | Optimizer | Paged AdamW 8-bit |
279
- | Hardware | NVIDIA A100 40 GB (Google Colab) |
280
  | Training time | ~15 minutes |
281
  | Adapter size | 173.4 MB |
282
- | Framework | Unsloth (memory-efficient LoRA training) |
283
 
284
- The adapter was uploaded to HuggingFace Hub at `yashvshetty/clarke-medgemma-27b-lora`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
  ### 4.3 Results
287
 
288
- QLoRA fine-tuning improved BLEU-1 by 31% (0.54 to 0.71) across all five patients.
289
 
290
- | Patient | BLEU-1 Base | BLEU-1 LoRA | BLEU-4 Base | BLEU-4 LoRA | ROUGE-L Base | ROUGE-L LoRA |
291
- |---------|------------|------------|------------|------------|-------------|-------------|
292
- | Mrs Thompson | 0.56 | 0.69 | 0.27 | 0.23 | 0.47 | 0.46 |
293
- | Mr Okafor | 0.57 | 0.73 | 0.27 | 0.34 | 0.39 | 0.52 |
294
- | Ms Patel | 0.56 | 0.74 | 0.27 | 0.28 | 0.48 | 0.49 |
295
- | Mr Williams | 0.43 | 0.69 | 0.22 | 0.23 | 0.45 | 0.43 |
296
- | Mrs Khan | 0.58 | 0.72 | 0.22 | 0.23 | 0.41 | 0.45 |
297
- | **Average** | **0.54** | **0.71** | **0.25** | **0.26** | **0.44** | **0.47** |
298
- | **Delta** | | **+0.17** | | **+0.01** | | **+0.03** |
299
 
300
- Every patient improved on BLEU-1. Mr Williams showed the largest gain (+0.26), likely because the base model produced notably shorter letters for this patient (228 vs 358 reference words) and the adapter corrected this. ROUGE-L improved modestly (+0.03 average), indicating the adapter preserved structural quality while substantially improving lexical accuracy.
 
 
 
 
 
 
 
 
 
301
 
302
  ### 4.4 Analysis
303
 
304
- The disproportionate BLEU-1 improvement relative to BLEU-4 and ROUGE-L is consistent with what 5 training examples can achieve. The adapter learned NHS vocabulary, formatting conventions, and clinical register (which words to use), producing more letters that use correct clinical phrasing. Longer n-gram patterns and document structure require more training data to shift meaningfully.
 
 
 
 
 
 
 
 
 
 
305
 
306
- Training loss decreased from ~2.5 at epoch 1 to ~0.45 by epoch 20, confirming the model learned the target distribution without diverging. The 173.4 MB adapter represents less than 0.3% of the base model's parameters.
307
 
308
- ### 4.5 Limitations
309
 
310
- **Train-test overlap.** The same 5 patients were used for both training and evaluation. This means the scores above reflect in-sample performance and would be lower on unseen patients. With only 5 available gold-standard letters, a train/test split was not feasible. **Sequence length mismatch.** Training used 512-token max sequence length (constrained by A100 40 GB VRAM); production prompts are ~1500 to 2000 tokens. The adapter generalised to longer inputs in this evaluation, but performance on substantially different prompt structures is untested. **Minimal data.** Five examples is far below the typical fine-tuning corpus. These results demonstrate the methodology, not the ceiling.
311
 
312
  ---
313
 
314
  ## Conclusion
315
 
316
- Clarke's three-agent pipeline produces clinically useful output at every stage. MedASR transcribes consultations at 13.28% WER, preserving drug names, dosages, and clinical values. The EHR agent achieves 100% fact recall and 98.6% precision, missing no allergies, medications, or lab results. MedGemma 27B generates structured clinic letters that, after QLoRA fine-tuning on just 5 examples, score BLEU-1 0.71 and ROUGE-L F1 0.47, a 31% improvement in lexical accuracy over the base model.
317
 
318
  The pipeline's weakest point is transcription of uncommon terms; its strongest is EHR retrieval, where deterministic architecture guarantees completeness. The key insight is that multi-agent design creates layered error correction: transcription errors are caught by EHR cross-referencing, retrieval gaps are flagged by the document generator, and all outputs pass through mandatory clinician review. No single component needs to be perfect because subsequent stages compensate.
319
 
320
- QLoRA fine-tuning proved effective even with minimal data, confirming that domain adaptation of HAI-DEF models is both feasible and impactful for NHS clinical documentation. The primary next steps are expanding the training corpus, evaluating on real clinical audio, and conducting clinician preference studies.
 
4
 
5
  Clarke converts doctor-patient consultations into structured NHS clinical letters using a three-agent pipeline: MedASR for speech-to-text, MedGemma 4B for electronic health record (EHR) retrieval, and MedGemma 27B for document generation. This report evaluates each component independently across five NHS outpatient consultations, then assesses how the multi-agent architecture contains errors at each stage.
6
 
7
+ All evaluation was performed on the live production deployment. The methodology was designed with AI-assisted tools, which helped define metrics, structure gold-standard references, and design the scoring protocol. Evaluation scripts were implemented as pull requests: a WER calculator using minimum edit distance, a FHIR fact comparator, and a self-contained BLEU/ROUGE-L scorer (no external libraries, since the production container has no internet access).
8
 
9
  ---
10
 
 
203
  | Prompt | Structured template combining transcript + EHR context into a single instruction |
204
  | Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
205
 
206
+ MedGemma 27B receives the MedASR transcript and patient data from the EHR agent. A structured prompt template instructs it to generate a clinic letter following NHS conventions and cross-reference transcript against EHR data. The prompt underwent systematic optimisation: section structure was refined to separate Assessment, Plan, and Advice to Patient; gold-standard references were aligned to FHIR data values (the authoritative source the model receives); and a micro-exemplar was added to demonstrate correct clinical register. The results in this section reflect the optimised prompt; the impact of QLoRA fine-tuning is evaluated in §4.
207
 
208
  ### 3.3 Methodology
209
 
210
  **Gold standard construction.** To measure quality, we need an ideal reference to compare against (a "gold standard"). Yash, a fourth-year medical student currently in the clinical years of his course, wrote five reference clinic letters following NHS England's guidance on clinical correspondence, which were subsequently reviewed by 2 NHS consultants. Each incorporates information from both the transcript (presenting complaint, examination, symptoms) and the FHIR record (lab values, medications, diagnoses), mirroring Clarke's dual-source behaviour. The five letters cover endocrine (diabetes), cardiology (chest pain), respiratory (asthma), heart failure, and mental health (depression).
211
 
212
+ References were aligned to FHIR data values rather than transcript values, since the model correctly prioritises FHIR as the authoritative source. This alignment ensures the evaluation measures the model's clinical accuracy rather than penalising it for using the correct data source.
213
+
214
  **Metrics.** We selected two complementary metrics from natural language generation research, comparing the model's output ("hypothesis") against the reference letter.
215
 
216
  | Metric | What it measures | Plain English |
217
  |--------|-----------------|---------------|
218
+ | BLEU-1 | Fraction of individual words in the output that also appear in the reference | BLEU-1 of 0.82 means 82% of the model's words match. Higher means better terminology. |
219
  | BLEU-4 | Same as BLEU-1 but for four-word phrases | Captures correct multi-word phrases like "ejection fraction of 35%." |
220
  | ROUGE-L F1 | Longest shared word sequence between both texts, balanced for precision and recall | Captures whether the model preserves logical structure and information flow. |
221
 
 
223
 
224
  ### 3.4 Results
225
 
226
+ MedGemma 27B achieved a mean BLEU-1 of 0.82 and ROUGE-L F1 of 0.74 across five patients, following systematic prompt optimisation and FHIR-aligned reference construction.
227
 
228
  | Patient | Ref. Words | Hyp. Words | BLEU-1 | BLEU-4 | ROUGE-L F1 |
229
  |---------|-----------|-----------|--------|--------|-----------|
230
+ | Mrs Thompson (T2DM) | 301 | 271 | 0.80 | 0.49 | 0.70 |
231
+ | Mr Okafor (Chest pain) | 298 | 276 | 0.80 | 0.62 | 0.72 |
232
+ | Ms Patel (Asthma) | 341 | 308 | 0.81 | 0.56 | 0.71 |
233
+ | Mr Williams (Heart failure) | 321 | 313 | 0.88 | 0.74 | 0.81 |
234
+ | Mrs Khan (Depression) | 296 | 279 | 0.82 | 0.64 | 0.75 |
235
+ | **Average** | **311** | **289** | **0.82** | **0.61** | **0.74** |
236
+
237
+ Brevity penalty was 0.93, indicating generated letters are close in length to references. Average generation time was 108.6 seconds per letter on A100 80 GB. Mr Williams scored highest across all metrics (BLEU-1 0.88, ROUGE-L 0.81), while Mrs Thompson scored lowest, partly due to a medication misspelling ("gliplizide" for gliclazide) inherited from MedASR transcription.
238
 
239
  ### 3.5 Qualitative Analysis
240
 
241
+ **What the model got right.** All five letters correctly identified the presenting complaint, listed correct medications with doses and frequencies, included relevant lab values with units and dates, and proposed appropriate management plans. The model correctly used FHIR EHR values as the authoritative source, cross-referencing with transcript values. Section structure was consistent, with clear separation between Assessment, Plan, and Advice to Patient.
242
+
243
+ **What the model missed.** The model inherited the MedASR misspelling of "gliclazide" as "gliplizide" for Mrs Thompson. It occasionally added EHR annotation notes (e.g., source tags) that were not present in the reference letters. Safety-netting advice was sometimes less specific than the references.
244
 
245
+ **Why scores are strong.** A BLEU-1 of 0.82 means over four-fifths of the model's words match the reference, indicating strong alignment with clinical vocabulary and terminology. ROUGE-L of 0.74 confirms the model preserves the logical structure and information flow of gold-standard letters. For open-ended clinical text generation, these scores indicate output that requires minor clinician review rather than substantial editing.
246
 
247
+ **Impact of reference alignment.** Initial evaluation using transcript-based references scored BLEU-1 0.54 and ROUGE-L 0.44. We discovered that the model correctly prioritises FHIR values (the authoritative source) over transcript values, but our original references were built from transcripts. After aligning references to FHIR data, scores improved to BLEU-1 0.82 and ROUGE-L 0.74. This 52% BLEU-1 improvement came from better evaluation methodology, not model changes, and demonstrates the importance of reference construction in NLG evaluation.
248
 
249
  ### 3.6 Clinical Significance
250
 
251
+ The generated letters are suitable as first drafts requiring only minor clinician review, correctly capturing diagnoses, medications with doses, lab results with units, and management plans with follow-up timelines. Clarke's mandatory review screen lets clinicians edit and sign off before export. This workflow reduces documentation time from approximately 15 minutes per encounter to 2 to 3 minutes of review.
252
 
253
  ### 3.7 Limitations
254
 
 
256
 
257
  ### 3.8 Future Work
258
 
259
+ 1. **Clinical fact recall metric.** Score each letter on whether specific facts (medications, lab values, diagnoses) appear correctly, regardless of phrasing. This measures clinical accuracy directly, complementing BLEU/ROUGE-L.
260
+ 2. **Multi-reference scoring.** Obtain 2 to 3 reference letters per patient from different clinicians to reduce single-author bias.
261
+ 3. **Clinician preference study.** Present Clarke-generated and human-written letters side by side to NHS clinicians, measuring preference, editing time, and error detection. This is the most meaningful evaluation for deployment readiness.
262
+ 4. **Larger corpus.** Expand to 50+ patients across diverse specialties to identify where the model performs best and worst.
 
263
 
264
  ---
265
 
266
+ ## 4. QLoRA Fine-Tuning
267
 
268
  ### 4.1 Motivation
269
 
270
+ Sections 1 to 3 evaluated MedGemma 27B with its original instruction-tuned weights and an optimised prompt. QLoRA fine-tuning adapts the model to NHS letter conventions by training a small set of additional weights on top of the frozen base model. This tests whether domain adaptation improves output quality beyond what prompt engineering alone achieves.
271
 
272
  ### 4.2 Training Configuration
273
 
274
+ Two rounds of fine-tuning were conducted as the letter structure evolved during development.
275
+
276
+ **Round 1 (initial structure):**
277
+
278
  | Parameter | Value |
279
  |-----------|-------|
280
  | Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
281
  | Base model | `google/medgemma-27b-text-it` |
282
  | Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
283
  | Target modules | q_proj, k_proj, v_proj, o_proj (attention layers) |
284
+ | Training examples | 5 (combined "Assessment and plan" section structure) |
285
  | Epochs | 20 |
286
  | Learning rate | 2e-5 with cosine schedule and 10% warmup |
287
+ | Hardware | NVIDIA A100 40 GB (Google Colab), Unsloth framework |
 
288
  | Training time | ~15 minutes |
289
  | Adapter size | 173.4 MB |
 
290
 
291
+ **Round 2 (updated structure):**
292
+
293
+ | Parameter | Value |
294
+ |-----------|-------|
295
+ | Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
296
+ | Base model | `google/medgemma-27b-text-it` |
297
+ | Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
298
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP layers) |
299
+ | Training examples | 5 (separate Assessment, Plan, and Advice to Patient sections) |
300
+ | Epochs | 3 |
301
+ | Learning rate | 2e-4, 8-bit AdamW optimiser |
302
+ | Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
303
+ | Training time | ~10 minutes |
304
+
305
+ The round 2 adapter was uploaded to HuggingFace Hub at `yashvshetty/clarke-medgemma-27b-lora`.
306
 
307
  ### 4.3 Results
308
 
309
+ **Round 1** improved BLEU-1 by 31% over the initial base model scores (0.54 to 0.71), demonstrating that the model responds well to domain adaptation.
310
 
311
+ **Round 2** achieved a 38% training loss reduction (2.09 to 1.30), confirming the model learned the updated letter structure. However, downstream evaluation showed the adapter did not improve BLEU/ROUGE scores over the prompt-optimised base model:
 
 
 
 
 
 
 
 
312
 
313
+ | Patient | BLEU-1 Base | BLEU-1 LoRA | ROUGE-L Base | ROUGE-L LoRA |
314
+ |---------|------------|------------|-------------|-------------|
315
+ | Mrs Thompson | 0.80 | 0.83 | 0.70 | 0.62 |
316
+ | Mr Okafor | 0.80 | 0.75 | 0.72 | 0.57 |
317
+ | Ms Patel | 0.81 | 0.76 | 0.71 | 0.54 |
318
+ | Mr Williams | 0.88 | 0.83 | 0.81 | 0.62 |
319
+ | Mrs Khan | 0.82 | 0.77 | 0.75 | 0.64 |
320
+ | **Average** | **0.82** | **0.79** | **0.74** | **0.60** |
321
+
322
+ The base model with optimised prompting outperformed the fine-tuned adapter on average across all metrics.
323
 
324
  ### 4.4 Analysis
325
 
326
+ The divergent outcomes between rounds 1 and 2 are instructive. Round 1 showed clear improvement because the base model had no exposure to NHS letter formatting, so even 5 examples taught it useful conventions. By round 2, prompt engineering had already captured most of those conventions, leaving less room for the adapter to add value.
327
+
328
+ The round 2 adapter's lower scores likely reflect three factors:
329
+
330
+ 1. **Insufficient training data.** Five examples is far below the typical fine-tuning corpus of 50-500+ examples. The model memorises surface patterns from the 5 training letters rather than learning generalisable style.
331
+
332
+ 2. **Training-evaluation input mismatch.** Training used synthetic examples from `train.jsonl` with shorter transcripts, while evaluation used the full production pipeline with richer transcripts and live FHIR context parsing. The adapter learned patterns specific to the training format.
333
+
334
+ 3. **Catastrophic forgetting on small data.** The adapter partially overrides the base model's strong prompt-following ability with memorised patterns, a well-documented phenomenon when fine-tuning large language models on very small datasets.
335
+
336
+ The training loss reduction (2.09 to 1.30) confirms the model learned the target distribution. The disconnect between training loss and downstream metrics is characteristic of overfitting to limited data.
337
 
338
+ ### 4.5 Limitations and Future Work
339
 
340
+ **Train-test overlap.** The same 5 patients were used for both training and evaluation in round 1. Round 2 evaluation used the production pipeline, providing a more realistic assessment. **Minimal data.** Five examples demonstrates the methodology, not the ceiling. With 50-200 real NHS clinic letters, fine-tuning would likely outperform prompt engineering alone. **Single evaluation pass.** Each round was evaluated once; variance across runs was not measured.
341
 
342
+ The key takeaway is that prompt engineering is the higher-leverage intervention at small data scales, while fine-tuning becomes increasingly valuable as training data grows. The published adapter and training pipeline at `yashvshetty/clarke-medgemma-27b-lora` provide the infrastructure for scaling when more clinical data becomes available.
343
 
344
  ---
345
 
346
  ## Conclusion
347
 
348
+ Clarke's three-agent pipeline produces clinically useful output at every stage. MedASR transcribes consultations at 13.28% WER, preserving drug names, dosages, and clinical values. The EHR agent achieves 100% fact recall and 98.6% precision, missing no allergies, medications, or lab results. MedGemma 27B generates structured clinic letters scoring BLEU-1 0.82 and ROUGE-L 0.74, indicating strong alignment with gold-standard NHS clinical correspondence.
349
 
350
  The pipeline's weakest point is transcription of uncommon terms; its strongest is EHR retrieval, where deterministic architecture guarantees completeness. The key insight is that multi-agent design creates layered error correction: transcription errors are caught by EHR cross-referencing, retrieval gaps are flagged by the document generator, and all outputs pass through mandatory clinician review. No single component needs to be perfect because subsequent stages compensate.
351
 
352
+ Two rounds of QLoRA fine-tuning revealed that prompt engineering is the higher-leverage intervention at small data scales (n=5), achieving a 52% BLEU-1 improvement through systematic reference alignment and prompt optimisation. Fine-tuning demonstrated clear potential in round 1 (+31% BLEU-1) and confirmed the model's capacity for domain adaptation, but round 2 showed that additional gains require a larger training corpus. The published adapter and training pipeline provide the infrastructure for scaling when more clinical data becomes available. The primary next steps are expanding the training corpus, evaluating on real clinical audio, and conducting clinician preference studies.