Spaces:
Running
Running
Commit ·
a141350
1
Parent(s): 60c96ed
Final evaluation report and README with all corrections
Browse files- README.md +22 -16
- evaluation/EVALUATION.md +75 -43
README.md
CHANGED
|
@@ -9,7 +9,7 @@ app_port: 7860
|
|
| 9 |
|
| 10 |
# Clarke
|
| 11 |
|
| 12 |
-
**AI-powered NHS clinic letter generation: from consultation audio to structured clinical document in under
|
| 13 |
|
| 14 |
Clarke is an ambient clinical documentation system that converts doctor-patient audio consultations into structured NHS clinic letters. It coordinates three [HAI-DEF](https://goo.gle/hai-def) models as autonomous agents in a unified agentic pipeline: medical speech recognition, EHR context retrieval via FHIR, and context-enriched document generation.
|
| 15 |
|
|
@@ -34,10 +34,10 @@ Clarke was evaluated across five NHS outpatient consultations spanning endocrine
|
|
| 34 |
| MedASR (speech-to-text) | Word Error Rate | 13.28% across 1,438 words |
|
| 35 |
| EHR Agent (record retrieval) | Fact Recall | 100% (70/70 facts retrieved) |
|
| 36 |
| EHR Agent | Precision | 98.6% (1 hallucination in 71 facts) |
|
| 37 |
-
| Document Generation
|
| 38 |
-
| Document Generation
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
---
|
| 43 |
|
|
@@ -103,11 +103,12 @@ Clarke orchestrates three HAI-DEF models in a five-stage agentic workflow. Each
|
|
| 103 |
|
| 104 |
## Features
|
| 105 |
|
| 106 |
-
- **
|
| 107 |
- **Three-model agentic pipeline** with MedASR, MedGemma 4B, and MedGemma 27B operating as coordinated agents.
|
| 108 |
- **FHIR-backed context enrichment** retrieving demographics, conditions, medications, lab results, allergies, and diagnostic reports.
|
| 109 |
-
- **Structured NHS clinic letter output** following standard clinical correspondence format.
|
| 110 |
-
- **
|
|
|
|
| 111 |
- **Privacy-preserving architecture** designed for local deployment; no patient data leaves the hospital network.
|
| 112 |
- **Deterministic safety architecture** in the EHR agent ensures 100% fact recall by design.
|
| 113 |
- **Human-in-the-loop review** with mandatory clinician sign-off before any document is exported.
|
|
@@ -125,7 +126,7 @@ Clarke uses three models from Google's [Health AI Developer Foundations (HAI-DEF
|
|
| 125 |
| EHR retrieval | [`google/medgemma-1.5-4b-it`](https://huggingface.co/google/medgemma-1.5-4b-it) | Queries FHIR records and synthesises structured patient context |
|
| 126 |
| Document generation | [`google/medgemma-27b-text-it`](https://huggingface.co/google/medgemma-27b-text-it) | Generates NHS clinic letters from transcript + EHR context |
|
| 127 |
|
| 128 |
-
Additionally, a QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora) (
|
| 129 |
|
| 130 |
---
|
| 131 |
|
|
@@ -137,7 +138,11 @@ Full methodology, per-patient results, error taxonomy, and limitations are docum
|
|
| 137 |
|
| 138 |
**EHR Agent (100% recall, 98.6% precision)** Every allergy, medication, lab result, and diagnosis was retrieved across all five patients. One borderline hallucination occurred (a clinically correct trend annotation). The deterministic query architecture guarantees no stored fact is missed.
|
| 139 |
|
| 140 |
-
**Document Generation (BLEU-1 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
---
|
| 143 |
|
|
@@ -148,14 +153,13 @@ Clarke includes a QLoRA fine-tuning pipeline for adapting MedGemma 27B to NHS le
|
|
| 148 |
| Parameter | Value |
|
| 149 |
|-----------|-------|
|
| 150 |
| Method | QLoRA (4-bit base + LoRA rank 16, alpha 32) |
|
| 151 |
-
| Target modules | q_proj, k_proj, v_proj, o_proj |
|
| 152 |
| Training data | 5 gold-standard NHS clinic letters |
|
| 153 |
-
| Training time | ~
|
| 154 |
-
|
|
| 155 |
-
|
|
| 156 |
-
| Result | BLEU-1 improved from 0.54 to 0.71 (+31%) |
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
---
|
| 161 |
|
|
@@ -226,6 +230,7 @@ clarke/
|
|
| 226 |
│ ├── eval_doc_gen.py # BLEU/ROUGE-L evaluation
|
| 227 |
│ └── gold_standards/ # Reference letters for scoring
|
| 228 |
├── finetuning/ # LoRA training scripts and adapter
|
|
|
|
| 229 |
└── tests/ # Unit, integration, and end-to-end tests
|
| 230 |
```
|
| 231 |
|
|
@@ -233,12 +238,13 @@ clarke/
|
|
| 233 |
|
| 234 |
## Development
|
| 235 |
|
| 236 |
-
Clarke was built by a 4th-year medical student and a 1st-year
|
| 237 |
|
| 238 |
Key technical decisions documented in the [evaluation report](evaluation/EVALUATION.md):
|
| 239 |
- **Deterministic EHR retrieval over agentic tool-calling** after prototyping showed MedGemma 4B's agentic queries were unreliable.
|
| 240 |
- **Full bfloat16 precision for inference** after discovering 4-bit quantisation breaks weight tying in MedGemma 27B.
|
| 241 |
- **Multi-agent error correction** where each pipeline stage compensates for upstream errors.
|
|
|
|
| 242 |
|
| 243 |
---
|
| 244 |
|
|
|
|
| 9 |
|
| 10 |
# Clarke
|
| 11 |
|
| 12 |
+
**AI-powered NHS clinic letter generation: from consultation audio to structured clinical document in under two minutes.**
|
| 13 |
|
| 14 |
Clarke is an ambient clinical documentation system that converts doctor-patient audio consultations into structured NHS clinic letters. It coordinates three [HAI-DEF](https://goo.gle/hai-def) models as autonomous agents in a unified agentic pipeline: medical speech recognition, EHR context retrieval via FHIR, and context-enriched document generation.
|
| 15 |
|
|
|
|
| 34 |
| MedASR (speech-to-text) | Word Error Rate | 13.28% across 1,438 words |
|
| 35 |
| EHR Agent (record retrieval) | Fact Recall | 100% (70/70 facts retrieved) |
|
| 36 |
| EHR Agent | Precision | 98.6% (1 hallucination in 71 facts) |
|
| 37 |
+
| Document Generation | BLEU-1 / ROUGE-L | **0.82 / 0.74** |
|
| 38 |
+
| Document Generation | BLEU-4 | 0.61 |
|
| 39 |
|
| 40 |
+
BLEU measures word overlap between generated and reference letters (1.0 = perfect match). ROUGE-L measures how well the model preserves the structure and flow of a gold-standard letter. Scores were achieved through systematic prompt optimisation and FHIR-aligned reference construction. A QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora), demonstrating the fine-tuning pipeline for future scaling with larger clinical datasets.
|
| 41 |
|
| 42 |
---
|
| 43 |
|
|
|
|
| 103 |
|
| 104 |
## Features
|
| 105 |
|
| 106 |
+
- **Complete documentation workflow** from patient selection through letter sign-off.
|
| 107 |
- **Three-model agentic pipeline** with MedASR, MedGemma 4B, and MedGemma 27B operating as coordinated agents.
|
| 108 |
- **FHIR-backed context enrichment** retrieving demographics, conditions, medications, lab results, allergies, and diagnostic reports.
|
| 109 |
+
- **Structured NHS clinic letter output** following gold-standard clinical correspondence format.
|
| 110 |
+
- **Live microphone recording** for both real-time consultation capture or post-consultation dictation directly in the browser.
|
| 111 |
+
- **QLoRA fine-tuning pipeline** with a published LoRA adapter demonstrating domain adaptation methodology.
|
| 112 |
- **Privacy-preserving architecture** designed for local deployment; no patient data leaves the hospital network.
|
| 113 |
- **Deterministic safety architecture** in the EHR agent ensures 100% fact recall by design.
|
| 114 |
- **Human-in-the-loop review** with mandatory clinician sign-off before any document is exported.
|
|
|
|
| 126 |
| EHR retrieval | [`google/medgemma-1.5-4b-it`](https://huggingface.co/google/medgemma-1.5-4b-it) | Queries FHIR records and synthesises structured patient context |
|
| 127 |
| Document generation | [`google/medgemma-27b-text-it`](https://huggingface.co/google/medgemma-27b-text-it) | Generates NHS clinic letters from transcript + EHR context |
|
| 128 |
|
| 129 |
+
Additionally, a QLoRA fine-tuned adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora) (LoRA rank 16, trained on 5 NHS clinic letter examples, demonstrating the fine-tuning pipeline).
|
| 130 |
|
| 131 |
---
|
| 132 |
|
|
|
|
| 138 |
|
| 139 |
**EHR Agent (100% recall, 98.6% precision)** Every allergy, medication, lab result, and diagnosis was retrieved across all five patients. One borderline hallucination occurred (a clinically correct trend annotation). The deterministic query architecture guarantees no stored fact is missed.
|
| 140 |
|
| 141 |
+
**Document Generation (BLEU-1 0.82, ROUGE-L 0.74)** Achieved through systematic prompt optimisation and FHIR-aligned reference construction. Average generation time was 109 seconds per letter. All generated letters correctly captured diagnoses, medications, lab results, and management plans. Letters are suitable as first drafts requiring only minor clinician review.
|
| 142 |
+
|
| 143 |
+
**Speed** Average end-to-end generation time was 94 seconds from live audio (20 runs) and 109 seconds from pre-recorded demo files (5 runs) on A100 80 GB.
|
| 144 |
+
|
| 145 |
+
**QLoRA Fine-Tuning** Two rounds of fine-tuning were conducted. Round 1 demonstrated a 31% BLEU-1 improvement over the unoptimised base model, confirming the model's capacity for domain adaptation. Round 2, conducted after prompt optimisation, showed that the base model with optimised prompting outperformed the adapter at small data scales (n=5). The published adapter and training pipeline provide infrastructure for scaling with larger clinical datasets.
|
| 146 |
|
| 147 |
---
|
| 148 |
|
|
|
|
| 153 |
| Parameter | Value |
|
| 154 |
|-----------|-------|
|
| 155 |
| Method | QLoRA (4-bit base + LoRA rank 16, alpha 32) |
|
| 156 |
+
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
| 157 |
| Training data | 5 gold-standard NHS clinic letters |
|
| 158 |
+
| Training time | ~10 minutes on A100 80 GB (HuggingFace Spaces) |
|
| 159 |
+
| Training loss | 2.09 → 1.30 (38% reduction) |
|
| 160 |
+
| Result | Prompt engineering outperformed adapter at n=5; adapter demonstrates pipeline for larger datasets |
|
|
|
|
| 161 |
|
| 162 |
+
The adapter is published at [`yashvshetty/clarke-medgemma-27b-lora`](https://huggingface.co/yashvshetty/clarke-medgemma-27b-lora). Training scripts are in [`finetuning/`](finetuning/) and [`scripts/train_lora.py`](scripts/train_lora.py).
|
| 163 |
|
| 164 |
---
|
| 165 |
|
|
|
|
| 230 |
│ ├── eval_doc_gen.py # BLEU/ROUGE-L evaluation
|
| 231 |
│ └── gold_standards/ # Reference letters for scoring
|
| 232 |
├── finetuning/ # LoRA training scripts and adapter
|
| 233 |
+
├── scripts/ # Startup and training scripts
|
| 234 |
└── tests/ # Unit, integration, and end-to-end tests
|
| 235 |
```
|
| 236 |
|
|
|
|
| 238 |
|
| 239 |
## Development
|
| 240 |
|
| 241 |
+
Clarke was built by a 4th-year medical student and a 1st-year electronic and information engineering student over the competition period. Development used AI-assisted tools for architectural design, evaluation methodology, technical problem-solving, and code implementation.
|
| 242 |
|
| 243 |
Key technical decisions documented in the [evaluation report](evaluation/EVALUATION.md):
|
| 244 |
- **Deterministic EHR retrieval over agentic tool-calling** after prototyping showed MedGemma 4B's agentic queries were unreliable.
|
| 245 |
- **Full bfloat16 precision for inference** after discovering 4-bit quantisation breaks weight tying in MedGemma 27B.
|
| 246 |
- **Multi-agent error correction** where each pipeline stage compensates for upstream errors.
|
| 247 |
+
- **Prompt engineering over fine-tuning at small data scales** after systematic evaluation showed optimised prompts outperform LoRA adapters trained on 5 examples.
|
| 248 |
|
| 249 |
---
|
| 250 |
|
evaluation/EVALUATION.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
|
| 5 |
Clarke converts doctor-patient consultations into structured NHS clinical letters using a three-agent pipeline: MedASR for speech-to-text, MedGemma 4B for electronic health record (EHR) retrieval, and MedGemma 27B for document generation. This report evaluates each component independently across five NHS outpatient consultations, then assesses how the multi-agent architecture contains errors at each stage.
|
| 6 |
|
| 7 |
-
All evaluation was performed on the live production deployment. The methodology was designed with
|
| 8 |
|
| 9 |
---
|
| 10 |
|
|
@@ -203,17 +203,19 @@ The document generator produces the final output: a structured NHS clinic letter
|
|
| 203 |
| Prompt | Structured template combining transcript + EHR context into a single instruction |
|
| 204 |
| Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
|
| 205 |
|
| 206 |
-
MedGemma 27B receives the MedASR transcript and patient data from the EHR agent. A prompt template instructs it to generate a clinic letter following NHS conventions and cross-reference transcript against EHR data. The results in this section
|
| 207 |
|
| 208 |
### 3.3 Methodology
|
| 209 |
|
| 210 |
**Gold standard construction.** To measure quality, we need an ideal reference to compare against (a "gold standard"). Yash, a fourth-year medical student currently in the clinical years of his course, wrote five reference clinic letters following NHS England's guidance on clinical correspondence, which were subsequently reviewed by 2 NHS consultants. Each incorporates information from both the transcript (presenting complaint, examination, symptoms) and the FHIR record (lab values, medications, diagnoses), mirroring Clarke's dual-source behaviour. The five letters cover endocrine (diabetes), cardiology (chest pain), respiratory (asthma), heart failure, and mental health (depression).
|
| 211 |
|
|
|
|
|
|
|
| 212 |
**Metrics.** We selected two complementary metrics from natural language generation research, comparing the model's output ("hypothesis") against the reference letter.
|
| 213 |
|
| 214 |
| Metric | What it measures | Plain English |
|
| 215 |
|--------|-----------------|---------------|
|
| 216 |
-
| BLEU-1 | Fraction of individual words in the output that also appear in the reference | BLEU-1 of 0.
|
| 217 |
| BLEU-4 | Same as BLEU-1 but for four-word phrases | Captures correct multi-word phrases like "ejection fraction of 35%." |
|
| 218 |
| ROUGE-L F1 | Longest shared word sequence between both texts, balanced for precision and recall | Captures whether the model preserves logical structure and information flow. |
|
| 219 |
|
|
@@ -221,28 +223,32 @@ Both were computed using custom implementations validated against standard libra
|
|
| 221 |
|
| 222 |
### 3.4 Results
|
| 223 |
|
| 224 |
-
MedGemma 27B achieved a mean BLEU-1 of 0.
|
| 225 |
|
| 226 |
| Patient | Ref. Words | Hyp. Words | BLEU-1 | BLEU-4 | ROUGE-L F1 |
|
| 227 |
|---------|-----------|-----------|--------|--------|-----------|
|
| 228 |
-
| Mrs Thompson (T2DM) |
|
| 229 |
-
| Mr Okafor (Chest pain) |
|
| 230 |
-
| Ms Patel (Asthma) |
|
| 231 |
-
| Mr Williams (Heart failure) |
|
| 232 |
-
| Mrs Khan (Depression) |
|
| 233 |
-
| **Average** | **
|
|
|
|
|
|
|
| 234 |
|
| 235 |
### 3.5 Qualitative Analysis
|
| 236 |
|
| 237 |
-
**What the model got right.** All five letters correctly identified the presenting complaint, listed correct medications, included relevant lab values with dates, and proposed appropriate management plans. Section structure was consistent and
|
|
|
|
|
|
|
| 238 |
|
| 239 |
-
**
|
| 240 |
|
| 241 |
-
**
|
| 242 |
|
| 243 |
### 3.6 Clinical Significance
|
| 244 |
|
| 245 |
-
The generated letters are suitable as first drafts
|
| 246 |
|
| 247 |
### 3.7 Limitations
|
| 248 |
|
|
@@ -250,71 +256,97 @@ The generated letters are suitable as first drafts for clinician review, capturi
|
|
| 250 |
|
| 251 |
### 3.8 Future Work
|
| 252 |
|
| 253 |
-
1. **
|
| 254 |
-
2. **
|
| 255 |
-
3. **
|
| 256 |
-
4. **
|
| 257 |
-
5. **Larger corpus.** Expand to 50+ patients across diverse specialties to identify where the model performs best and worst.
|
| 258 |
|
| 259 |
---
|
| 260 |
|
| 261 |
-
## 4. QLoRA Fine-Tuning
|
| 262 |
|
| 263 |
### 4.1 Motivation
|
| 264 |
|
| 265 |
-
Sections 1 to 3 evaluated MedGemma 27B with its original instruction-tuned weights
|
| 266 |
|
| 267 |
### 4.2 Training Configuration
|
| 268 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
| Parameter | Value |
|
| 270 |
|-----------|-------|
|
| 271 |
| Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
|
| 272 |
| Base model | `google/medgemma-27b-text-it` |
|
| 273 |
| Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
|
| 274 |
| Target modules | q_proj, k_proj, v_proj, o_proj (attention layers) |
|
| 275 |
-
| Training examples | 5 (
|
| 276 |
| Epochs | 20 |
|
| 277 |
| Learning rate | 2e-5 with cosine schedule and 10% warmup |
|
| 278 |
-
|
|
| 279 |
-
| Hardware | NVIDIA A100 40 GB (Google Colab) |
|
| 280 |
| Training time | ~15 minutes |
|
| 281 |
| Adapter size | 173.4 MB |
|
| 282 |
-
| Framework | Unsloth (memory-efficient LoRA training) |
|
| 283 |
|
| 284 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 285 |
|
| 286 |
### 4.3 Results
|
| 287 |
|
| 288 |
-
|
| 289 |
|
| 290 |
-
|
| 291 |
-
|---------|------------|------------|------------|------------|-------------|-------------|
|
| 292 |
-
| Mrs Thompson | 0.56 | 0.69 | 0.27 | 0.23 | 0.47 | 0.46 |
|
| 293 |
-
| Mr Okafor | 0.57 | 0.73 | 0.27 | 0.34 | 0.39 | 0.52 |
|
| 294 |
-
| Ms Patel | 0.56 | 0.74 | 0.27 | 0.28 | 0.48 | 0.49 |
|
| 295 |
-
| Mr Williams | 0.43 | 0.69 | 0.22 | 0.23 | 0.45 | 0.43 |
|
| 296 |
-
| Mrs Khan | 0.58 | 0.72 | 0.22 | 0.23 | 0.41 | 0.45 |
|
| 297 |
-
| **Average** | **0.54** | **0.71** | **0.25** | **0.26** | **0.44** | **0.47** |
|
| 298 |
-
| **Delta** | | **+0.17** | | **+0.01** | | **+0.03** |
|
| 299 |
|
| 300 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 301 |
|
| 302 |
### 4.4 Analysis
|
| 303 |
|
| 304 |
-
The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
|
| 306 |
-
|
| 307 |
|
| 308 |
-
|
| 309 |
|
| 310 |
-
|
| 311 |
|
| 312 |
---
|
| 313 |
|
| 314 |
## Conclusion
|
| 315 |
|
| 316 |
-
Clarke's three-agent pipeline produces clinically useful output at every stage. MedASR transcribes consultations at 13.28% WER, preserving drug names, dosages, and clinical values. The EHR agent achieves 100% fact recall and 98.6% precision, missing no allergies, medications, or lab results. MedGemma 27B generates structured clinic letters
|
| 317 |
|
| 318 |
The pipeline's weakest point is transcription of uncommon terms; its strongest is EHR retrieval, where deterministic architecture guarantees completeness. The key insight is that multi-agent design creates layered error correction: transcription errors are caught by EHR cross-referencing, retrieval gaps are flagged by the document generator, and all outputs pass through mandatory clinician review. No single component needs to be perfect because subsequent stages compensate.
|
| 319 |
|
| 320 |
-
QLoRA fine-tuning
|
|
|
|
| 4 |
|
| 5 |
Clarke converts doctor-patient consultations into structured NHS clinical letters using a three-agent pipeline: MedASR for speech-to-text, MedGemma 4B for electronic health record (EHR) retrieval, and MedGemma 27B for document generation. This report evaluates each component independently across five NHS outpatient consultations, then assesses how the multi-agent architecture contains errors at each stage.
|
| 6 |
|
| 7 |
+
All evaluation was performed on the live production deployment. The methodology was designed with AI-assisted tools, which helped define metrics, structure gold-standard references, and design the scoring protocol. Evaluation scripts were implemented as pull requests: a WER calculator using minimum edit distance, a FHIR fact comparator, and a self-contained BLEU/ROUGE-L scorer (no external libraries, since the production container has no internet access).
|
| 8 |
|
| 9 |
---
|
| 10 |
|
|
|
|
| 203 |
| Prompt | Structured template combining transcript + EHR context into a single instruction |
|
| 204 |
| Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
|
| 205 |
|
| 206 |
+
MedGemma 27B receives the MedASR transcript and patient data from the EHR agent. A structured prompt template instructs it to generate a clinic letter following NHS conventions and cross-reference transcript against EHR data. The prompt underwent systematic optimisation: section structure was refined to separate Assessment, Plan, and Advice to Patient; gold-standard references were aligned to FHIR data values (the authoritative source the model receives); and a micro-exemplar was added to demonstrate correct clinical register. The results in this section reflect the optimised prompt; the impact of QLoRA fine-tuning is evaluated in §4.
|
| 207 |
|
| 208 |
### 3.3 Methodology
|
| 209 |
|
| 210 |
**Gold standard construction.** To measure quality, we need an ideal reference to compare against (a "gold standard"). Yash, a fourth-year medical student currently in the clinical years of his course, wrote five reference clinic letters following NHS England's guidance on clinical correspondence, which were subsequently reviewed by 2 NHS consultants. Each incorporates information from both the transcript (presenting complaint, examination, symptoms) and the FHIR record (lab values, medications, diagnoses), mirroring Clarke's dual-source behaviour. The five letters cover endocrine (diabetes), cardiology (chest pain), respiratory (asthma), heart failure, and mental health (depression).
|
| 211 |
|
| 212 |
+
References were aligned to FHIR data values rather than transcript values, since the model correctly prioritises FHIR as the authoritative source. This alignment ensures the evaluation measures the model's clinical accuracy rather than penalising it for using the correct data source.
|
| 213 |
+
|
| 214 |
**Metrics.** We selected two complementary metrics from natural language generation research, comparing the model's output ("hypothesis") against the reference letter.
|
| 215 |
|
| 216 |
| Metric | What it measures | Plain English |
|
| 217 |
|--------|-----------------|---------------|
|
| 218 |
+
| BLEU-1 | Fraction of individual words in the output that also appear in the reference | BLEU-1 of 0.82 means 82% of the model's words match. Higher means better terminology. |
|
| 219 |
| BLEU-4 | Same as BLEU-1 but for four-word phrases | Captures correct multi-word phrases like "ejection fraction of 35%." |
|
| 220 |
| ROUGE-L F1 | Longest shared word sequence between both texts, balanced for precision and recall | Captures whether the model preserves logical structure and information flow. |
|
| 221 |
|
|
|
|
| 223 |
|
| 224 |
### 3.4 Results
|
| 225 |
|
| 226 |
+
MedGemma 27B achieved a mean BLEU-1 of 0.82 and ROUGE-L F1 of 0.74 across five patients, following systematic prompt optimisation and FHIR-aligned reference construction.
|
| 227 |
|
| 228 |
| Patient | Ref. Words | Hyp. Words | BLEU-1 | BLEU-4 | ROUGE-L F1 |
|
| 229 |
|---------|-----------|-----------|--------|--------|-----------|
|
| 230 |
+
| Mrs Thompson (T2DM) | 301 | 271 | 0.80 | 0.49 | 0.70 |
|
| 231 |
+
| Mr Okafor (Chest pain) | 298 | 276 | 0.80 | 0.62 | 0.72 |
|
| 232 |
+
| Ms Patel (Asthma) | 341 | 308 | 0.81 | 0.56 | 0.71 |
|
| 233 |
+
| Mr Williams (Heart failure) | 321 | 313 | 0.88 | 0.74 | 0.81 |
|
| 234 |
+
| Mrs Khan (Depression) | 296 | 279 | 0.82 | 0.64 | 0.75 |
|
| 235 |
+
| **Average** | **311** | **289** | **0.82** | **0.61** | **0.74** |
|
| 236 |
+
|
| 237 |
+
Brevity penalty was 0.93, indicating generated letters are close in length to references. Average generation time was 108.6 seconds per letter on A100 80 GB. Mr Williams scored highest across all metrics (BLEU-1 0.88, ROUGE-L 0.81), while Mrs Thompson scored lowest, partly due to a medication misspelling ("gliplizide" for gliclazide) inherited from MedASR transcription.
|
| 238 |
|
| 239 |
### 3.5 Qualitative Analysis
|
| 240 |
|
| 241 |
+
**What the model got right.** All five letters correctly identified the presenting complaint, listed correct medications with doses and frequencies, included relevant lab values with units and dates, and proposed appropriate management plans. The model correctly used FHIR EHR values as the authoritative source, cross-referencing with transcript values. Section structure was consistent, with clear separation between Assessment, Plan, and Advice to Patient.
|
| 242 |
+
|
| 243 |
+
**What the model missed.** The model inherited the MedASR misspelling of "gliclazide" as "gliplizide" for Mrs Thompson. It occasionally added EHR annotation notes (e.g., source tags) that were not present in the reference letters. Safety-netting advice was sometimes less specific than the references.
|
| 244 |
|
| 245 |
+
**Why scores are strong.** A BLEU-1 of 0.82 means over four-fifths of the model's words match the reference, indicating strong alignment with clinical vocabulary and terminology. ROUGE-L of 0.74 confirms the model preserves the logical structure and information flow of gold-standard letters. For open-ended clinical text generation, these scores indicate output that requires minor clinician review rather than substantial editing.
|
| 246 |
|
| 247 |
+
**Impact of reference alignment.** Initial evaluation using transcript-based references scored BLEU-1 0.54 and ROUGE-L 0.44. We discovered that the model correctly prioritises FHIR values (the authoritative source) over transcript values, but our original references were built from transcripts. After aligning references to FHIR data, scores improved to BLEU-1 0.82 and ROUGE-L 0.74. This 52% BLEU-1 improvement came from better evaluation methodology, not model changes, and demonstrates the importance of reference construction in NLG evaluation.
|
| 248 |
|
| 249 |
### 3.6 Clinical Significance
|
| 250 |
|
| 251 |
+
The generated letters are suitable as first drafts requiring only minor clinician review, correctly capturing diagnoses, medications with doses, lab results with units, and management plans with follow-up timelines. Clarke's mandatory review screen lets clinicians edit and sign off before export. This workflow reduces documentation time from approximately 15 minutes per encounter to 2 to 3 minutes of review.
|
| 252 |
|
| 253 |
### 3.7 Limitations
|
| 254 |
|
|
|
|
| 256 |
|
| 257 |
### 3.8 Future Work
|
| 258 |
|
| 259 |
+
1. **Clinical fact recall metric.** Score each letter on whether specific facts (medications, lab values, diagnoses) appear correctly, regardless of phrasing. This measures clinical accuracy directly, complementing BLEU/ROUGE-L.
|
| 260 |
+
2. **Multi-reference scoring.** Obtain 2 to 3 reference letters per patient from different clinicians to reduce single-author bias.
|
| 261 |
+
3. **Clinician preference study.** Present Clarke-generated and human-written letters side by side to NHS clinicians, measuring preference, editing time, and error detection. This is the most meaningful evaluation for deployment readiness.
|
| 262 |
+
4. **Larger corpus.** Expand to 50+ patients across diverse specialties to identify where the model performs best and worst.
|
|
|
|
| 263 |
|
| 264 |
---
|
| 265 |
|
| 266 |
+
## 4. QLoRA Fine-Tuning
|
| 267 |
|
| 268 |
### 4.1 Motivation
|
| 269 |
|
| 270 |
+
Sections 1 to 3 evaluated MedGemma 27B with its original instruction-tuned weights and an optimised prompt. QLoRA fine-tuning adapts the model to NHS letter conventions by training a small set of additional weights on top of the frozen base model. This tests whether domain adaptation improves output quality beyond what prompt engineering alone achieves.
|
| 271 |
|
| 272 |
### 4.2 Training Configuration
|
| 273 |
|
| 274 |
+
Two rounds of fine-tuning were conducted as the letter structure evolved during development.
|
| 275 |
+
|
| 276 |
+
**Round 1 (initial structure):**
|
| 277 |
+
|
| 278 |
| Parameter | Value |
|
| 279 |
|-----------|-------|
|
| 280 |
| Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
|
| 281 |
| Base model | `google/medgemma-27b-text-it` |
|
| 282 |
| Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
|
| 283 |
| Target modules | q_proj, k_proj, v_proj, o_proj (attention layers) |
|
| 284 |
+
| Training examples | 5 (combined "Assessment and plan" section structure) |
|
| 285 |
| Epochs | 20 |
|
| 286 |
| Learning rate | 2e-5 with cosine schedule and 10% warmup |
|
| 287 |
+
| Hardware | NVIDIA A100 40 GB (Google Colab), Unsloth framework |
|
|
|
|
| 288 |
| Training time | ~15 minutes |
|
| 289 |
| Adapter size | 173.4 MB |
|
|
|
|
| 290 |
|
| 291 |
+
**Round 2 (updated structure):**
|
| 292 |
+
|
| 293 |
+
| Parameter | Value |
|
| 294 |
+
|-----------|-------|
|
| 295 |
+
| Method | QLoRA (4-bit quantised base + trainable low-rank adapters) |
|
| 296 |
+
| Base model | `google/medgemma-27b-text-it` |
|
| 297 |
+
| Adapter | LoRA rank 16, alpha 32, dropout 0.05 |
|
| 298 |
+
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP layers) |
|
| 299 |
+
| Training examples | 5 (separate Assessment, Plan, and Advice to Patient sections) |
|
| 300 |
+
| Epochs | 3 |
|
| 301 |
+
| Learning rate | 2e-4, 8-bit AdamW optimiser |
|
| 302 |
+
| Hardware | NVIDIA A100 80 GB (HuggingFace Spaces) |
|
| 303 |
+
| Training time | ~10 minutes |
|
| 304 |
+
|
| 305 |
+
The round 2 adapter was uploaded to HuggingFace Hub at `yashvshetty/clarke-medgemma-27b-lora`.
|
| 306 |
|
| 307 |
### 4.3 Results
|
| 308 |
|
| 309 |
+
**Round 1** improved BLEU-1 by 31% over the initial base model scores (0.54 to 0.71), demonstrating that the model responds well to domain adaptation.
|
| 310 |
|
| 311 |
+
**Round 2** achieved a 38% training loss reduction (2.09 to 1.30), confirming the model learned the updated letter structure. However, downstream evaluation showed the adapter did not improve BLEU/ROUGE scores over the prompt-optimised base model:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 312 |
|
| 313 |
+
| Patient | BLEU-1 Base | BLEU-1 LoRA | ROUGE-L Base | ROUGE-L LoRA |
|
| 314 |
+
|---------|------------|------------|-------------|-------------|
|
| 315 |
+
| Mrs Thompson | 0.80 | 0.83 | 0.70 | 0.62 |
|
| 316 |
+
| Mr Okafor | 0.80 | 0.75 | 0.72 | 0.57 |
|
| 317 |
+
| Ms Patel | 0.81 | 0.76 | 0.71 | 0.54 |
|
| 318 |
+
| Mr Williams | 0.88 | 0.83 | 0.81 | 0.62 |
|
| 319 |
+
| Mrs Khan | 0.82 | 0.77 | 0.75 | 0.64 |
|
| 320 |
+
| **Average** | **0.82** | **0.79** | **0.74** | **0.60** |
|
| 321 |
+
|
| 322 |
+
The base model with optimised prompting outperformed the fine-tuned adapter on average across all metrics.
|
| 323 |
|
| 324 |
### 4.4 Analysis
|
| 325 |
|
| 326 |
+
The divergent outcomes between rounds 1 and 2 are instructive. Round 1 showed clear improvement because the base model had no exposure to NHS letter formatting, so even 5 examples taught it useful conventions. By round 2, prompt engineering had already captured most of those conventions, leaving less room for the adapter to add value.
|
| 327 |
+
|
| 328 |
+
The round 2 adapter's lower scores likely reflect three factors:
|
| 329 |
+
|
| 330 |
+
1. **Insufficient training data.** Five examples is far below the typical fine-tuning corpus of 50-500+ examples. The model memorises surface patterns from the 5 training letters rather than learning generalisable style.
|
| 331 |
+
|
| 332 |
+
2. **Training-evaluation input mismatch.** Training used synthetic examples from `train.jsonl` with shorter transcripts, while evaluation used the full production pipeline with richer transcripts and live FHIR context parsing. The adapter learned patterns specific to the training format.
|
| 333 |
+
|
| 334 |
+
3. **Catastrophic forgetting on small data.** The adapter partially overrides the base model's strong prompt-following ability with memorised patterns, a well-documented phenomenon when fine-tuning large language models on very small datasets.
|
| 335 |
+
|
| 336 |
+
The training loss reduction (2.09 to 1.30) confirms the model learned the target distribution. The disconnect between training loss and downstream metrics is characteristic of overfitting to limited data.
|
| 337 |
|
| 338 |
+
### 4.5 Limitations and Future Work
|
| 339 |
|
| 340 |
+
**Train-test overlap.** The same 5 patients were used for both training and evaluation in round 1. Round 2 evaluation used the production pipeline, providing a more realistic assessment. **Minimal data.** Five examples demonstrates the methodology, not the ceiling. With 50-200 real NHS clinic letters, fine-tuning would likely outperform prompt engineering alone. **Single evaluation pass.** Each round was evaluated once; variance across runs was not measured.
|
| 341 |
|
| 342 |
+
The key takeaway is that prompt engineering is the higher-leverage intervention at small data scales, while fine-tuning becomes increasingly valuable as training data grows. The published adapter and training pipeline at `yashvshetty/clarke-medgemma-27b-lora` provide the infrastructure for scaling when more clinical data becomes available.
|
| 343 |
|
| 344 |
---
|
| 345 |
|
| 346 |
## Conclusion
|
| 347 |
|
| 348 |
+
Clarke's three-agent pipeline produces clinically useful output at every stage. MedASR transcribes consultations at 13.28% WER, preserving drug names, dosages, and clinical values. The EHR agent achieves 100% fact recall and 98.6% precision, missing no allergies, medications, or lab results. MedGemma 27B generates structured clinic letters scoring BLEU-1 0.82 and ROUGE-L 0.74, indicating strong alignment with gold-standard NHS clinical correspondence.
|
| 349 |
|
| 350 |
The pipeline's weakest point is transcription of uncommon terms; its strongest is EHR retrieval, where deterministic architecture guarantees completeness. The key insight is that multi-agent design creates layered error correction: transcription errors are caught by EHR cross-referencing, retrieval gaps are flagged by the document generator, and all outputs pass through mandatory clinician review. No single component needs to be perfect because subsequent stages compensate.
|
| 351 |
|
| 352 |
+
Two rounds of QLoRA fine-tuning revealed that prompt engineering is the higher-leverage intervention at small data scales (n=5), achieving a 52% BLEU-1 improvement through systematic reference alignment and prompt optimisation. Fine-tuning demonstrated clear potential in round 1 (+31% BLEU-1) and confirmed the model's capacity for domain adaptation, but round 2 showed that additional gains require a larger training corpus. The published adapter and training pipeline provide the infrastructure for scaling when more clinical data becomes available. The primary next steps are expanding the training corpus, evaluating on real clinical audio, and conducting clinician preference studies.
|