yakilee Claude Opus 4.6 commited on
Commit
1abff4e
·
0 Parent(s):

chore: initialize project skeleton with pyproject.toml

Browse files

- Add pyproject.toml with core deps (pydantic, httpx, streamlit, pytest)
- Empty package structure: trialpath/ (models, services, agent) and app/ (pages, components, services)
- Configure ruff and pytest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CLAUDE.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ TrialPath is an AI-powered clinical trial matching system for NSCLC (Non-Small Cell Lung Cancer) patients. It is currently in **pre-implementation design phase** — only design documents exist, no source code yet.
8
+
9
+ **Core idea:** Help patients understand which clinical trials they may qualify for, transform "rejection" into "actionable next steps" via gap analysis.
10
+
11
+ ## Design Documents
12
+
13
+ - `Trialpath PRD.md` — Product requirements, success metrics, HAI-DEF submission plan
14
+ - `TrialPath AI Synergy in Digital Health Trials.md` — Technical architecture, data contracts, Parlant workflow design
15
+
16
+ ## Architecture (5 Components)
17
+
18
+ 1. **UI & Orchestrator** — Streamlit/FastAPI app embedding Parlant engine
19
+ 2. **Parlant Agent + Journey** — Single agent (`patient_trial_copilot`) with 5 states: `INGEST` → `PRESCREEN` → `VALIDATE_TRIALS` → `GAP_FOLLOWUP` → `SUMMARY`
20
+ 3. **MedGemma 4B** (HF endpoint) — Multimodal extraction from PDFs/images → `PatientProfile` + evidence spans
21
+ 4. **Gemini 3 Pro** — LLM planner: generates `SearchAnchors` from profile, reranks trials, orchestrates criterion evaluation
22
+ 5. **ClinicalTrials MCP Server** (existing, not custom) — Wraps ClinicalTrials.gov REST API v2
23
+
24
+ ## Key Design Decisions
25
+
26
+ - **No vector DB / RAG** — Uses agentic search via ClinicalTrials.gov API with iterative query refinement
27
+ - **Reuse existing MCP** — Don't build custom trial search; use off-the-shelf ClinicalTrials MCP servers
28
+ - **Two-stage clinical screening** — Mirrors real-world: prescreen (minimal dataset) → validation (full criterion-by-criterion)
29
+ - **Evidence-linked** — Every decision must cite source doc/page/span
30
+ - **Gap analysis as core differentiator** — "You'd qualify IF you had X" rather than just "No match"
31
+
32
+ ## Data Contracts (JSON Schemas)
33
+
34
+ Four core contracts defined in the tech design doc (section 4):
35
+ - **PatientProfile v1** — MedGemma output with demographics, diagnosis, biomarkers, labs, treatments, unknowns
36
+ - **SearchAnchors v1** — Gemini-generated query params for MCP search
37
+ - **TrialCandidate v1** — Normalized MCP search results
38
+ - **EligibilityLedger v1** — Per-trial criterion-level assessment with evidence pointers and gaps
39
+
40
+ ## Planned Code Structure
41
+
42
+ From PRD deliverables section:
43
+ ```
44
+ data/generate_synthetic_patients.py
45
+ data/generate_noisy_pdfs.py
46
+ matching/medgemma_extractor.py
47
+ matching/agentic_search.py # Parlant + Gemini + MCP
48
+ evaluation/run_trec_benchmark.py
49
+ ```
50
+
51
+ ## Planned Tech Stack
52
+
53
+ - Python (Streamlit or FastAPI)
54
+ - Google Gemini 3 Pro (orchestration)
55
+ - MedGemma 4B via Hugging Face endpoint (multimodal extraction)
56
+ - Parlant (agentic workflow engine)
57
+ - Synthea FHIR (synthetic patient generation)
58
+ - TREC Clinical Trials Track 2021/2022 (benchmarking)
59
+
60
+ ## Success Targets
61
+
62
+ - MedGemma Extraction F1 >= 0.85
63
+ - Trial Retrieval Recall@50 >= 0.75
64
+ - Trial Ranking NDCG@10 >= 0.60
65
+ - Criterion Decision Accuracy >= 0.85
66
+ - Latency < 15s, Cost < $0.50/session
67
+
68
+ ## Scope
69
+
70
+ - Disease: NSCLC only
71
+ - Data: Synthetic patients only (no real PHI)
72
+ - Timeline: 3-month PoC
73
+
74
+
75
+ ## Dev tools
76
+
77
+ - use huggingface cli for model deployment
78
+ - use uv, ruff, astral ty
79
+ - use ripgrep
80
+
81
+
82
+ ## Commit atomically
app/__init__.py ADDED
File without changes
app/tests/__init__.py ADDED
File without changes
docs/TrialPath AI technical design.md ADDED
@@ -0,0 +1,487 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Below is a compact but deepened tech design doc that applies your three constraints:
2
+
3
+ 1. Reuse existing ClinicalTrials MCPs.
4
+ 2. Make Parlant workflows map tightly onto real clinical screening.
5
+ 3. Lay out a general patient plan (using synthetic data) that feels like a real-world journey.
6
+
7
+ No code; just user flow, data contracts, and architecture.
8
+
9
+ ---
10
+
11
+ ## **1\. Scope & Positioning**
12
+
13
+ **PoC Goal (2‑week sprint, YAGNI):**
14
+ A working, demoable *patient‑centric* trial-matching copilot that:
15
+
16
+ * Takes **synthetic NSCLC patients** (documents \+ minimal metadata).
17
+ * Uses **MedGemma 4B multimodal** to understand those artifacts.
18
+ * Uses **Gemini 3 Pro \+ Parlant** to orchestrate **patient‑to‑trials matching** via an **off‑the‑shelf ClinicalTrials MCP server**.
19
+ * Produces an **eligibility ledger \+ gap analysis** aligned with real clinical screening workflows (prescreen → validation), not “toy” UX.
20
+
21
+ We explicitly **don’t** build our own trial MCP, own search stack, or multi-service infra. Everything runs in a thin orchestrator \+ UI process.
22
+
23
+ ---
24
+
25
+ ## **2\. Real-World Screening Workflow Mapping**
26
+
27
+ Evidence from clinical practice and trial‑matching research converges on a two‑stage flow:[appliedclinicaltrialsonline+4](https://www.appliedclinicaltrialsonline.com/view/clinical-trial-matching-solutions-understanding-the-landscape)
28
+
29
+ 1. **Prescreening**
30
+ * Quick eligibility judgment on a *minimal dataset*: diagnosis, stage, functional status (ECOG), basic labs, key comorbidities.
31
+ * Usually: oncologist \+ coordinator \+ minimal EHR context.
32
+ * Goal: “Is this patient worth deeper chart review for any trials here?”
33
+ 2. **Validation (Full Match / Chart Review)**
34
+ * Detailed comparison of **full record** vs **full inclusion/exclusion**, often 40–60 criteria per trial.
35
+ * Typically done by a coordinator/CRA with investigator sign‑off.
36
+ * Goal: for a *specific trial*, decide: *eligible / excluded / unclear → needs further tests*.
37
+
38
+ Our PoC should simulate this **two‑stage workflow**:
39
+
40
+ * **Stage 1 \= “Patient‑First Prescreen”** → shortlist trials via MCP \+ Gemini using MedGemma‑extracted “minimal dataset”.
41
+ * **Stage 2 \= “Trial‑Specific Validation”** → trial‑by‑trial, criterion‑by‑criterion ledger using MedGemma evidence.
42
+
43
+ Parlant Journeys become the *explicit codification* of these two stages \+ transitions.
44
+
45
+ ---
46
+
47
+ ## **3\. High-Level Architecture (YAGNI, Reusing MCP)**
48
+
49
+ ## **3.1 Components**
50
+
51
+ **1\) UI & Orchestrator (single process)**
52
+
53
+ * Streamlit/FastAPI-style app (exact stack is secondary) that:
54
+ * Hosts the chat/stepper UI.
55
+ * Embeds **Parlant** and maintains session state.
56
+ * Calls external tools (Gemini API, MedGemma HF endpoint, ClinicalTrials MCP).
57
+
58
+ **2\) Parlant Agent \+ Journey**
59
+
60
+ * Single Parlant agent, e.g. `patient_trial_copilot`.
61
+ * One **Journey** with explicit stages mirroring real-world workflow:
62
+ * `INGEST` → `PRESCREEN` → `VALIDATE_TRIALS` → `GAP_FOLLOWUP` → `SUMMARY`.
63
+ * Parlant rules enforce:
64
+ * When to call which tool.
65
+ * When to move from prescreen to validation.
66
+ * When to ask the patient (synthetic persona) for more documents.
67
+
68
+ **3\) MedGemma 4B Multimodal Service (HF endpoint)**
69
+
70
+ * Input: PDF(s) \+ optional images.
71
+ * Output: structured **PatientProfile** \+ **evidence spans** (doc/page/region references).
72
+ * Used twice:
73
+ * Once for **prescreen dataset** extraction.
74
+ * Once for **criterion‑level validation** (patient vs trial snippets).
75
+
76
+ **4\) Gemini 3 Pro (LLM Planner & Re‑ranker)**
77
+
78
+ * Uses Google AI / Vertex Gemini 3 Pro for:
79
+ * Generating query parameters for ClinicalTrials MCP from PatientProfile.
80
+ * Interpreting MCP results & producing ranked **TrialCandidate** list.
81
+ * Orchestrating criterion slicing and gap reasoning.
82
+ * Strategy: keep Gemini in **tools \+ structured outputs** mode; no direct free-form “actions”.
83
+
84
+ **5\) ClinicalTrials MCP Server (Existing)**
85
+
86
+ * Choose an existing **ClinicalTrials MCP server** rather than hand-rolling: e.g. one of the open-source MCP servers wrapping the ClinicalTrials.gov REST API v2.[github+3](https://github.com/JackKuo666/ClinicalTrials-MCP-Server)
87
+ * Must support at least:
88
+ * `search_trials(parameters)` → list of (NCT ID, title, conditions, locations, status, phase, eligibility text).
89
+ * `get_trial(nct_id)` → full record including inclusion/exclusion criteria.
90
+
91
+ ## **3.2 Why Reuse MCP is Critical**
92
+
93
+ * **Time**: ClinicalTrials.gov v2 API is detailed and somewhat finicky; paging, filters, field lists. Existing MCPs already encode those details \+ JSON schemas.[nlm.nih+1](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
94
+ * **Alignment with agentic ecosystems**: These MCP servers are already shaped as “tools” for LLMs. We just plug Parlant/Gemini on top.
95
+ * **YAGNI**: custom MCP or RAG index for trials is a post‑PoC optimization.
96
+
97
+ ---
98
+
99
+ ## **4\. Data Contracts (Core JSON Schemas)**
100
+
101
+ We keep contracts minimal but explicit, so we can test each piece in isolation.
102
+
103
+ ## **4.1 PatientProfile (v1)**
104
+
105
+ Output of MedGemma’s **prescreen extraction**; updated as new docs arrive:
106
+
107
+ json
108
+ `{`
109
+ `"patient_id": "string",`
110
+ `"source_docs": [`
111
+ `{ "doc_id": "string", "type": "clinic_letter|pathology|lab|imaging", "meta": {} }`
112
+ `],`
113
+ `"demographics": {`
114
+ `"age": 52,`
115
+ `"sex": "female"`
116
+ `},`
117
+ `"diagnosis": {`
118
+ `"primary_condition": "Non-Small Cell Lung Cancer",`
119
+ `"histology": "adenocarcinoma",`
120
+ `"stage": "IVa",`
121
+ `"diagnosis_date": "2025-11-15"`
122
+ `},`
123
+ `"performance_status": {`
124
+ `"scale": "ECOG",`
125
+ `"value": 1,`
126
+ `"evidence": [{ "doc_id": "clinic_1", "page": 2, "span_id": "s_17" }]`
127
+ `},`
128
+ `"biomarkers": [`
129
+ `{`
130
+ `"name": "EGFR",`
131
+ `"result": "Exon 19 deletion",`
132
+ `"date": "2026-01-10",`
133
+ `"evidence": [{ "doc_id": "path_egfr", "page": 1, "span_id": "s_3" }]`
134
+ `}`
135
+ `],`
136
+ `"key_labs": [`
137
+ `{`
138
+ `"name": "ANC",`
139
+ `"value": 1.8,`
140
+ `"unit": "10^9/L",`
141
+ `"date": "2026-01-28",`
142
+ `"evidence": [{ "doc_id": "labs_jan", "page": 1, "span_id": "tbl_anc" }]`
143
+ `}`
144
+ `],`
145
+ `"treatments": [`
146
+ `{`
147
+ `"drug_name": "Pembrolizumab",`
148
+ `"start_date": "2024-06-01",`
149
+ `"end_date": "2024-11-30",`
150
+ `"line": 1,`
151
+ `"evidence": [{ "doc_id": "clinic_2", "page": 3, "span_id": "s_45" }]`
152
+ `}`
153
+ `],`
154
+ `"comorbidities": [`
155
+ `{`
156
+ `"name": "CKD",`
157
+ `"grade": "Stage 3",`
158
+ `"evidence": [{ "doc_id": "clinic_1", "page": 2, "span_id": "s_20" }]`
159
+ `}`
160
+ `],`
161
+ `"imaging_summary": [`
162
+ `{`
163
+ `"modality": "MRI brain",`
164
+ `"date": "2026-01-20",`
165
+ `"finding": "Stable 3mm left frontal lesion, no enhancement",`
166
+ `"interpretation": "likely inactive scar",`
167
+ `"certainty": "low|medium|high",`
168
+ `"evidence": [{ "doc_id": "mri_report", "page": 1, "span_id": "s_9" }]`
169
+ `}`
170
+ `],`
171
+ `"unknowns": [`
172
+ `{ "field": "EGFR", "reason": "No clear mention", "importance": "high" }`
173
+ `]`
174
+ `}`
175
+
176
+ Notes:
177
+
178
+ * `unknowns` is **explicit**, enabling Parlant to decide what to ask for in `GAP_FOLLOWUP`.
179
+ * `evidence` structure enables later criterion-level ledger to reference the same spans.
180
+ * This is **not** a fully normalized EHR; it’s what’s needed for prescreening.[pmc.ncbi.nlm.nih+1](https://pmc.ncbi.nlm.nih.gov/articles/PMC11612666/)
181
+
182
+ ## **4.2 SearchAnchors (v1)**
183
+
184
+ Intermediate structure Gemini produces from PatientProfile to drive the MCP search:
185
+
186
+ json
187
+ `{`
188
+ `"condition": "Non-Small Cell Lung Cancer",`
189
+ `"subtype": "adenocarcinoma",`
190
+ `"biomarkers": ["EGFR exon 19 deletion"],`
191
+ `"stage": "IV",`
192
+ `"geography": {`
193
+ `"country": "DE",`
194
+ `"max_distance_km": 200`
195
+ `},`
196
+ `"age": 52,`
197
+ `"performance_status_max": 1,`
198
+ `"trial_filters": {`
199
+ `"recruitment_status": ["Recruiting", "Not yet recruiting"],`
200
+ `"phase": ["Phase 2", "Phase 3"]`
201
+ `},`
202
+ `"relaxation_order": [`
203
+ `"phase",`
204
+ `"distance",`
205
+ `"biomarker_strictness"`
206
+ `]`
207
+ `}`
208
+
209
+ This mirrors patient‑centric matching literature: patient characteristics \+ geography \+ site status.[nature+1](https://www.nature.com/articles/s41467-024-53081-z)
210
+
211
+ ## **4.3 TrialCandidate (v1)**
212
+
213
+ Returned by ClinicalTrials MCP search and lightly normalized:
214
+
215
+ json
216
+ `{`
217
+ `"nct_id": "NCT01234567",`
218
+ `"title": "Phase 3 Study of Osimertinib in EGFR+ NSCLC",`
219
+ `"conditions": ["NSCLC"],`
220
+ `"phase": "Phase 3",`
221
+ `"status": "Recruiting",`
222
+ `"locations": [`
223
+ `{ "country": "DE", "city": "Berlin" },`
224
+ `{ "country": "DE", "city": "Hamburg" }`
225
+ `],`
226
+ `"age_range": { "min": 18, "max": 75 },`
227
+ `"fingerprint_text": "short concatenation of title + key inclusion/exclusion + keywords",`
228
+ `"eligibility_text": {`
229
+ `"inclusion": "raw inclusion criteria text ...",`
230
+ `"exclusion": "raw exclusion criteria text ..."`
231
+ `}`
232
+ `}`
233
+
234
+ `fingerprint_text` is purposely short and designed for Gemini reranking; full eligibility goes to MedGemma for criterion analysis.
235
+
236
+ ## **4.4 EligibilityLedger (v1)**
237
+
238
+ Final artifact per trial, shown to the “clinician” or patient:
239
+
240
+ json
241
+ `{`
242
+ `"patient_id": "P001",`
243
+ `"nct_id": "NCT01234567",`
244
+ `"overall_assessment": "likely_eligible|likely_ineligible|uncertain",`
245
+ `"criteria": [`
246
+ `{`
247
+ `"criterion_id": "inc_1",`
248
+ `"type": "inclusion",`
249
+ `"text": "Histologically confirmed NSCLC, stage IIIB/IV",`
250
+ `"decision": "met|not_met|unknown",`
251
+ `"patient_evidence": [{ "doc_id": "clinic_1", "page": 1, "span_id": "s_12" }],`
252
+ `"trial_evidence": [{ "field": "eligibility_text.inclusion", "offset_start": 0, "offset_end": 80 }]`
253
+ `},`
254
+ `{`
255
+ `"criterion_id": "exc_3",`
256
+ `"type": "exclusion",`
257
+ `"text": "No prior treatment with immune checkpoint inhibitors",`
258
+ `"decision": "not_met",`
259
+ `"patient_evidence": [{ "doc_id": "clinic_2", "page": 3, "span_id": "s_45" }],`
260
+ `"trial_evidence": [{ "field": "eligibility_text.exclusion", "offset_start": 211, "offset_end": 280 }]`
261
+ `}`
262
+ `],`
263
+ `"gaps": [`
264
+ `{`
265
+ `"description": "Requires brain MRI within 28 days; last MRI is 45 days old",`
266
+ `"recommended_action": "Repeat brain MRI",`
267
+ `"clinical_importance": "high"`
268
+ `}`
269
+ `]`
270
+ `}`
271
+
272
+ This mirrors TrialGPT’s criterion‑level output (explanation \+ evidence locations \+ decision) but tuned to our multimodal extraction and PoC constraints.\[[nature](https://www.nature.com/articles/s41467-024-53081-z)\]​
273
+
274
+ ---
275
+
276
+ ## **5\. Parlant Workflow Design (Aligned with Real Clinical Work)**
277
+
278
+ We design a **single Parlant Journey** that approximates the real-world job of a trial coordinator/oncologist team, but in a patient‑centric context.[pmc.ncbi.nlm.nih+3](https://pmc.ncbi.nlm.nih.gov/articles/PMC6685132/)
279
+
280
+ ## **5.1 Journey States**
281
+
282
+ **States:**
283
+
284
+ 1. `INGEST` (Document Collection)
285
+ 2. `PRESCREEN` (Patient-Level Trial Shortlist)
286
+ 3. `VALIDATE_TRIALS` (Trial-Level Eligibility Ledger)
287
+ 4. `GAP_FOLLOWUP` (Patient Data Completion Loop)
288
+ 5. `SUMMARY` (Shareable Packet & Next Steps)
289
+
290
+ ## **State 1 — INGEST**
291
+
292
+ **Role in real world:** Patient (or referrer) provides records; coordinator checks if enough to do prescreen.[trialchoices+2](https://www.trialchoices.org/post/what-to-expect-during-the-clinical-trial-screening-process)
293
+
294
+ **Inputs:**
295
+
296
+ * Uploaded PDFs/images (synthetic in PoC).
297
+ * Lightweight metadata (age, sex, location) from user form.
298
+
299
+ **Actions:**
300
+
301
+ * Parlant calls MedGemma with multimodal input (images \+ text) to generate `PatientProfile.v1`.
302
+ * Parlant agent summarises back to the patient:
303
+ * What it understood (“You have stage IV NSCLC, ECOG 1, EGFR unknown”).
304
+ * What it is missing (“I did not find EGFR mutation status or recent brain MRI”).
305
+
306
+ **Transitions:**
307
+
308
+ * If **minimal prescreen dataset is present** (diagnosis \+ stage \+ ECOG \+ rough labs): → `PRESCREEN`.
309
+ * Else: stays in `INGEST` but triggers `GAP_FOLLOWUP`‑style prompts (“Can you upload a pathology report or discharge summary?”).
310
+
311
+ ## **State 2 — PRESCREEN**
312
+
313
+ **Role in real world:** Pre‑filter to “worth reviewing” trials based on limited data.[pmc.ncbi.nlm.nih+1](https://pmc.ncbi.nlm.nih.gov/articles/PMC11612666/)
314
+
315
+ **Inputs:**
316
+
317
+ * `PatientProfile.v1`.
318
+
319
+ **Actions:**
320
+
321
+ * Gemini converts `PatientProfile` → `SearchAnchors.v1`.
322
+ * Parlant calls **existing ClinicalTrials MCP** with `SearchAnchors` mapping to MCP’s parameters:
323
+ * Condition keywords
324
+ * Recruitment status
325
+ * Phase filters
326
+ * Geography
327
+ * Trials returned as `TrialCandidate` list.
328
+ * Gemini reranks them using `fingerprint_text` \+ `PatientProfile` to produce a shortlist (e.g., top 20).
329
+ * Parlant communicates to user:
330
+ * “Based on your profile, I found 23 potentially relevant NSCLC trials; I’ll now check each more carefully.”
331
+
332
+ **Transitions:**
333
+
334
+ * If **0 trials** → `GAP_FOLLOWUP` (relax criteria and/or widen geography).
335
+ * If **\>0 trials** → `VALIDATE_TRIALS`.
336
+
337
+ This maps to patient‑centric matching described in the applied literature: single patient → candidate trials, then deeper evaluation.[trec-cds+2](https://www.trec-cds.org/2021.html)
338
+
339
+ ## **State 3 — VALIDATE\_TRIALS**
340
+
341
+ **Role in real world:** Detailed chart review vs full eligibility criteria.[pmc.ncbi.nlm.nih+1](https://pmc.ncbi.nlm.nih.gov/articles/PMC6685132/)
342
+
343
+ **Inputs:**
344
+
345
+ * Shortlisted `TrialCandidate` (e.g., top 10–20).
346
+
347
+ **Actions:**
348
+
349
+ For each trial in shortlist:
350
+
351
+ 1. Gemini slices inclusion/exclusion text into atomic criteria (each with an ID and text).
352
+ 2. For each criterion:
353
+ * Parlant calls **MedGemma** with:
354
+ * `PatientProfile` \+ selected patient evidence snippets (and where available, underlying images).
355
+ * Criterion text snippet.
356
+ * MedGemma outputs:
357
+ * `decision: met/not_met/unknown`.
358
+ * `patient_evidence` span references (doc/page/span\_id).
359
+ 3. Parlant aggregates per‑trial into `EligibilityLedger.v1`.
360
+
361
+ **Outputs:**
362
+
363
+ * A ranked list of trials with:
364
+ * Traffic‑light label (green/yellow/red) for overall eligibility (+ explanation).
365
+ * Criterion‑level breakdowns & evidence pointers.
366
+
367
+ **Transitions:**
368
+
369
+ * If **no trial has any green/yellow** (all clearly ineligible):
370
+ * `GAP_FOLLOWUP` to explore whether missing data (e.g., outdated labs) could change this.
371
+ * Else:
372
+ * Offer `SUMMARY` while keeping `GAP_FOLLOWUP` open.
373
+
374
+ ## **State 4 — GAP\_FOLLOWUP**
375
+
376
+ **Role in real world:** Additional tests/data to confirm eligibility (e.g., labs, imaging).[pfizerclinicaltrials+2](https://www.pfizerclinicaltrials.com/about/steps-to-join)
377
+
378
+ **Inputs:**
379
+
380
+ * `PatientProfile.unknowns` \+ `EligibilityLedger.gaps`.
381
+
382
+ **Actions:**
383
+
384
+ * Gemini synthesizes the **minimal actionable set** of missing data:
385
+ * E.g., “Most promising trials require: (1) current EGFR mutation status, (2) brain MRI \< 28 days old.”
386
+ * Parlant:
387
+ * Poses this to the patient in simple language.
388
+ * For PoC, user (you, or script) uploads new synthetic documents representing those tests.
389
+ * On new upload, we go back through `INGEST` → update `PatientProfile` → fast‑path direct to `PRESCREEN`/`VALIDATE_TRIALS`.
390
+
391
+ **Transitions:**
392
+
393
+ * On new docs → `INGEST` (update and re‑run).
394
+ * If user declines or no additional data possible → `SUMMARY` with clear explanation (“Here’s why current trials don’t fit”).
395
+
396
+ ## **State 5 — SUMMARY**
397
+
398
+ **Role in real world:** Coordinator/oncologist summarises findings, shares options, and discusses next steps.[pfizerclinicaltrials+2](https://www.pfizerclinicaltrials.com/about/steps-to-join)
399
+
400
+ **Inputs:**
401
+
402
+ * Final `PatientProfile`.
403
+ * Set of `EligibilityLedger` objects for top trials.
404
+ * List of `gaps`.
405
+
406
+ **Actions:**
407
+
408
+ * Generate:
409
+ * **Patient‑friendly summary**: 3–5 bullet explanation of matches.
410
+ * **Clinician packet**: aggregated ledger and evidence pointers, referencing doc IDs and trial NCT IDs.
411
+ * For PoC: show in UI \+ downloadable JSON/Markdown.
412
+
413
+ **Transitions:**
414
+
415
+ * End of Journey.
416
+
417
+ ---
418
+
419
+ ## **6\. General Patient Plan (Synthetic Data Flow)**
420
+
421
+ We simulate realistic but synthetic patients, and run them through exactly the above journey.
422
+
423
+ ## **6.1 Synthetic Patient Generation & Formats**
424
+
425
+ **Source:**
426
+
427
+ * TREC Clinical Trials Track 2021/2022 patient topics (free‑text vignettes) as the ground truth for “what the patient’s story should convey”.[trec-cds+3](https://www.trec-cds.org/2022.html)
428
+ * Synthea or custom scripts to generate structured NSCLC trajectories consistent with those vignettes (for additional fields we want).
429
+
430
+ **Artifacts per patient:**
431
+
432
+ 1. **Clinic letter PDF**
433
+ * Plain text \+ embedded logo; maybe 1–2 key tables (comorbidities, meds).
434
+ 2. **Biomarker/pathology PDF**
435
+ * EGFR/ALK/PD‑L1 etc, with small table or scanned‑like image.
436
+ 3. **Lab report PDF**
437
+ * Hematology and chemistry values, with dates.
438
+ 4. **Imaging report PDF** (+ optional illustrative image)
439
+ * Brain MRI/CT narrative with lesion description; maybe a low‑res “snapshot” image.
440
+
441
+ Each artifact is saved with metadata mapping to the underlying TREC topic (so we can label what the “true” conditions/stage/biomarkers are).
442
+
443
+ ## **6.2 Patient Journey (Narrative)**
444
+
445
+ For each synthetic patient “Anna”:
446
+
447
+ 1. **Pre‑visit (INGEST)**
448
+ * Anna (or a proxy) uploads her documents to the copilot.
449
+ * MedGemma extracts a `PatientProfile`.
450
+ * Parlant confirms: “You have stage IV NSCLC with ECOG 1 and prior pembrolizumab; I don’t see your EGFR mutation test yet.”
451
+ 2. **Prescreen (PRESCREEN)**
452
+ * Using `SearchAnchors`, trials are fetched via ClinicalTrials MCP.
453
+ * The system returns, e.g., 30 candidates; after reranking, top 10 are selected for validation.
454
+ 3. **Trial Validation (VALIDATE\_TRIALS)**
455
+ * For each of top 10, the eligibility ledger is computed.
456
+ * System identifies, say, 3 trials with many green criteria but a few unknowns (e.g., recent brain MRI).
457
+ 4. **Gap‑Driven Iteration (GAP\_FOLLOWUP)**
458
+ * Copilot: “You likely qualify for trial NCT01234567 if you have a brain MRI within the last 28 days. Your last MRI is 45 days ago. If your doctor orders a new MRI and the report shows no active brain metastases, you may qualify. For this PoC, you can upload a ‘new MRI report’ file to simulate this.”
459
+ * New synthetic PDF is uploaded; `PatientProfile` is updated.
460
+ 5. **Re‑match & Summary (PRESCREEN → VALIDATE\_TRIALS → SUMMARY)**
461
+ * System re‑runs with updated `PatientProfile`.
462
+ * Now 3 trials are “likely eligible”, with red flags on only non‑critical criteria.
463
+ * Copilot generates:
464
+ * Patient summary: “Here are three trials that look promising for your situation, and why.”
465
+ * Clinician packet: ledger \+ evidence pointers that mimic a coordinator’s notes.
466
+
467
+ This general patient plan is consistent across synthetic cases but parameterized by each TREC topic (e.g. biomarker variant, comorbidity pattern).
468
+
469
+ ---
470
+
471
+ ## **7\. How This Plan Fixes Earlier Gaps**
472
+
473
+ 1. **No custom trial search stack**
474
+ * We explicitly plug into existing ClinicalTrials MCPs built for LLM agents, aligning with your “don’t reinvent the wheel” constraint and drastically lowering infra risk in 2 weeks.[github+2](https://github.com/cyanheads/clinicaltrialsgov-mcp-server)
475
+ 2. **Parlant used as a real workflow engine, not just a wrapper**
476
+ * States mirror prescreen vs validation vs gap‑closure described in empirical screening studies and trial‑matching frameworks.[appliedclinicaltrialsonline+3](https://www.appliedclinicaltrialsonline.com/view/clinical-trial-matching-solutions-understanding-the-landscape)
477
+ * Parlant becomes the place where you encode “when do we ask a human for more information vs when do we refine a query vs when do we stop?”
478
+ 3. **Patient plan grounded in real‑world processes**
479
+ * The synthetic patient journey isn’t just “upload docs → list trials.”
480
+ * It follows actual clinical workflows: minimal dataset, prescreen, chart review, additional tests, and finally discussion/summary.[trialchoices+3](https://www.trialchoices.org/post/what-to-expect-during-the-clinical-trial-screening-process)
481
+ 4. **Minimal, testable contracts**
482
+ * PatientProfile, SearchAnchors, TrialCandidate, EligibilityLedger together give you:
483
+ * Places to measure MedGemma extraction F1.
484
+ * Places to plug TREC qrels (TrialCandidate → NDCG@10).[arxiv+2](https://arxiv.org/pdf/2202.07858.pdf)
485
+ * They’re small enough to implement quickly but rich enough to survive PoC → MVP.
486
+
487
+ Source: [https://www.perplexity.ai/search/simulate-as-an-experienced-cto-i6TIXOP9TX.rqA97awuc1Q?sm=d\#3](https://www.perplexity.ai/search/simulate-as-an-experienced-cto-i6TIXOP9TX.rqA97awuc1Q?sm=d#3)
docs/Trialpath PRD.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HAI-DEF Pitch: MedGemma Match – Patient Trial Copilot
2
+
3
+ **PoC Goal:** Demonstrate MedGemma + Gemini 3 Pro + Parlant agentic architecture for patient-facing clinical trial matching with **explainable eligibility reasoning** and **iterative gap-filling**.
4
+
5
+ ---
6
+
7
+ ## 1. Problem & Unmet Need
8
+
9
+ ### The Challenge
10
+ - **Low trial participation:** <5% of adult cancer patients enroll in clinical trials despite potential eligibility
11
+ - **Complex eligibility criteria:** Free-text criteria mix demographics, biomarkers, labs, imaging findings, and treatment history
12
+ - **Patient barrier:** Patients receive PDFs/reports but have no way to understand which trials fit their situation
13
+ - **Manual screening burden:** Clinicians spend hours per patient manually reviewing eligibility; automated tools show mixed real-world performance
14
+
15
+ ### Why AI? Why Now?
16
+ - Eligibility criteria require synthesis across multiple document types (pathology, labs, imaging, treatment history)—impossible with keyword search alone
17
+ - Recent LLM-based matching systems (TrialGPT, PRISM) show promise but lack patient-centric design and multimodal medical understanding
18
+ - HAI-DEF open-weight health models enable privacy-preserving deployment with medical domain expertise
19
+
20
+ ---
21
+
22
+ ## 2. Solution: MedGemma as Clinical Understanding Engine
23
+
24
+ ### Core Concept
25
+ **"Agentic Search + Multimodal Extraction"** replacing traditional vector-RAG approaches.
26
+
27
+ **Architecture:**
28
+ - **MedGemma (HAI-DEF):** Extracts structured clinical facts from messy PDFs/reports + understands medical imaging contexts
29
+ - **Gemini 3 Pro:** Orchestrates agentic search through ClinicalTrials.gov API with iterative query refinement
30
+ - **Parlant:** Enforces state machine (search → filter → verify) and prevents parameter hallucination
31
+ - **ClinicalTrials MCP:** Structured API wrapper for trials data (no vector DB needed)
32
+
33
+ ### Why MedGemma is Central (Not Replaceable)
34
+ 1. **Multimodal medical reasoning:** Designed for radiology reports, pathology, labs—where generic LLMs are weaker
35
+ 2. **Domain-aligned extraction:** Medical entity recognition with units, dates, and clinical context preservation
36
+ 3. **Open weights:** Enables VPC deployment for future PHI handling (vs closed-weight alternatives)
37
+ 4. **Health-safety guardrails:** Model card emphasizes validation/adaptation patterns we follow
38
+
39
+ ---
40
+
41
+ ## 3. User Journey (Patient-Centric)
42
+
43
+ ### Target User (PoC Persona)
44
+ **"Anna"** – 52-year-old NSCLC patient in Berlin with PDFs from her oncologist but no trial navigation support.
45
+
46
+ ### Journey Flow
47
+ 1. **Upload Documents** → Clinic letter, pathology report, lab results (synthetic PDFs in PoC)
48
+ 2. **MedGemma Extraction** → System builds "My Clinical Profile (draft)": Stage IVa, EGFR status unknown, ECOG 1
49
+ 3. **Agentic Search** → Gemini queries ClinicalTrials.gov via MCP:
50
+ - Initial: `condition=NSCLC, location=DE, status=RECRUITING, keywords=EGFR` → 47 results
51
+ - Refines: Adds `phase=PHASE3` → 12 results
52
+ - Reads summaries, filters to 5 relevant trials
53
+ 4. **Eligibility Analysis** → For each trial, MedGemma evaluates criteria against extracted facts
54
+ 5. **Gap Identification** → System highlights: *"You'd likely qualify IF you had EGFR mutation test"*
55
+ 6. **Iteration** → Anna uploads biomarker report → System re-matches → 3 new trials appear
56
+ 7. **Share with Doctor** → Generate clinician packet with evidence-linked eligibility ledger
57
+
58
+ ### Key Differentiator: The "Gap Analysis"
59
+ - We don't just say "No Match"
60
+ - We say: **"You would match NCT12345 IF you had: recent brain MRI showing no active CNS disease"**
61
+ - This transforms "rejection" into "actionable next steps"
62
+
63
+ ---
64
+
65
+ ## 4. Technical Innovation: Smart Agentic Search (No Vector DB)
66
+
67
+ ### Traditional Approach (What We're *Not* Doing)
68
+ ```
69
+ Patient text → Embeddings → Vector similarity search →
70
+ Retrieve top-K trials → LLM re-ranks
71
+ ```
72
+ **Problem:** Vector search is "dumb" about structured constraints (Phase, Location, Status) and negations.
73
+
74
+ ### Our Approach: Iterative Query Refinement
75
+ ```
76
+ MedGemma extracts "Search Anchors" (Condition, Biomarkers, Location) →
77
+ Gemini formulates API query with filters →
78
+ ClinicalTrials MCP returns results →
79
+ Too many (>50)? → Parlant enforces refinement (add phase/keywords)
80
+ Too few (0)? → Parlant enforces relaxation (remove city filter)
81
+ Right size (10-30)? → Gemini reads summaries in 2M context window →
82
+ Shortlist 5 NCT IDs → Deep eligibility verification with MedGemma
83
+ ```
84
+
85
+ **Why This is Better:**
86
+ - **Precision:** Leverages native API filters (Phase, Status, Location) that vectors can't handle
87
+ - **Transparency:** Every search step is logged and explainable ("I searched X, got Y results, refined to Z")
88
+ - **Feasibility:** No vector DB infrastructure; uses live API
89
+ - **Showcases Gemini reasoning:** Demonstrates multi-step planning vs one-shot retrieval
90
+
91
+ ---
92
+
93
+ ## 5. MedGemma Showcase Moments (HAI-DEF "Fullest Potential")
94
+
95
+ ### Use Case 1: Temporal Lab Extraction
96
+ **Challenge:** Criterion requires "ANC ≥ 1.5 �� 10⁹/L within 14 days of enrollment"
97
+ - **MedGemma extracts:** Value=1.8, Units=10⁹/L, Date=2026-01-28, DocID=labs_jan.pdf
98
+ - **System verifies:** Current date Feb 4 → 7 days ago → ✓ MEETS criterion
99
+ - **Evidence link:** User can click to see exact lab table and date
100
+
101
+ ### Use Case 2: Multimodal Imaging Context
102
+ **Challenge:** Criterion requires "No active CNS metastases"
103
+ - **MedGemma reads:** Brain MRI report text: *"Stable 3mm left frontal lesion, no enhancement, likely scarring from prior SRS"*
104
+ - **System interprets:** "Stable" + "no enhancement" + "scarring" → Likely inactive → Flags as ⚠️ UNKNOWN (requires clinician confirmation)
105
+ - **Evidence link:** Highlights report section for doctor review
106
+
107
+ ### Use Case 3: Treatment Line Reconstruction
108
+ **Challenge:** Criterion excludes "Prior immune checkpoint inhibitor therapy"
109
+ - **MedGemma reconstructs:** From medication list and notes → Patient received Pembrolizumab 2024-06 to 2024-11
110
+ - **System verifies:** → ✗ EXCLUDED
111
+ - **Evidence link:** Shows medication timeline with dates and sources
112
+
113
+ ---
114
+
115
+ ## 6. PoC Scope & Data Strategy
116
+
117
+ ### In Scope (3-Month PoC)
118
+ - **Disease:** NSCLC only (complex biomarkers, high trial volume)
119
+ - **Data:** Synthetic patients only (no real PHI)
120
+ - **Deliverables:**
121
+ - Working web prototype (video demo)
122
+ - Experimental validation on TREC benchmarks
123
+ - Technical write-up + public code repo
124
+
125
+ ### Data Sources
126
+ **Patients (Synthetic):**
127
+ - Structured ground truth: Synthea FHIR (500 NSCLC patients)
128
+ - Unstructured artifacts: LLM-generated clinic letters + lab PDFs with controlled noise (abbreviations, OCR errors, missing values)
129
+
130
+ **Trials (Real):**
131
+ - ClinicalTrials.gov live API via MCP wrapper
132
+ - Focus on NSCLC recruiting trials in Europe + US
133
+
134
+ **Benchmarking:**
135
+ - TREC Clinical Trials Track 2021/2022 (75 patient topics + judged relevance)
136
+ - Custom criterion-extraction test set (labeled synthetic reports)
137
+
138
+ ---
139
+
140
+ ## 7. Success Metrics & Evaluation Plan
141
+
142
+ ### Model Performance
143
+ | Metric | Target | Baseline | Method |
144
+ |--------|--------|----------|--------|
145
+ | **MedGemma Extraction F1** | ≥0.85 | Gemini-only: 0.65-0.75 | Field-level (stage, ECOG, biomarkers, labs) on labeled synthetic reports |
146
+ | **Trial Retrieval Recall@50** | ≥0.75 | BM25: ~0.60 | TREC 2021 patient topics |
147
+ | **Trial Ranking NDCG@10** | ≥0.60 | Non-LLM baseline: ~0.45 | TREC judged relevance |
148
+ | **Criterion Decision Accuracy** | ≥0.85 | Rule-based: ~0.70 | Per-criterion classification on synthetic patient-trial pairs |
149
+
150
+ ### Product Quality
151
+ - **Latency:** <15s from upload to first match results
152
+ - **Explainability:** 100% of "met/not met" decisions must include evidence pointer (trial text + patient doc ID)
153
+ - **Cost:** <$0.50 per patient session (token + GPU usage)
154
+
155
+ ### UX Validation (Small Study)
156
+ - Task completion: Can lay users identify ≥1 plausible trial from shortlist?
157
+ - Explanation clarity: SUS-style usability score ≥70
158
+ - Reading level: B1/8th-grade equivalent (Flesch-Kincaid)
159
+
160
+ ---
161
+
162
+ ## 8. Impact Potential
163
+
164
+ ### If PoC Succeeds (Quantified)
165
+ **Near-term (PoC phase):**
166
+ - Demonstrate 15-25% relative improvement in ranking quality (NDCG) vs non-LLM baselines on TREC benchmarks
167
+ - Show multimodal extraction advantage: MedGemma F1 ≥0.10 higher than Gemini-only on medical fields
168
+
169
+ **Post-PoC (Real-world projection):**
170
+ - **Patient impact:** Based on literature showing automated tools can surface 20-30% more eligible trials vs manual search, and considering NSCLC patients often face 50+ active trials but only learn about 2-3 from their oncologist
171
+ - **Clinician impact:** Trial coordinators report spending 2-4 hours per patient on manual screening; if our tool pre-screens with 85% sensitivity, reduces manual verification by ~60%
172
+ - **Trial enrollment:** Even a 10% increase in eligible patient identification could improve trial recruitment timelines (major pharma pain point)
173
+
174
+ ---
175
+
176
+ ## 9. Risks & Mitigations
177
+
178
+ | Risk | Mitigation |
179
+ |------|-----------|
180
+ | **Synthetic data too clean** | Add controlled noise to PDFs (OCR errors, abbreviations); validate against TREC which uses realistic synthetic cases |
181
+ | **MedGemma hallucination on edge cases** | Implement evidence-pointer system (every decision must cite doc ID + span); flag low-confidence as "unknown" not "met" |
182
+ | **API rate limits** | Cache trial protocols; batch requests during search refinement |
183
+ | **Regulatory misunderstanding** | Explicit "information only, not medical advice" framing throughout UI; follow MedGemma model card guidance on validation/adaptation |
184
+
185
+ ---
186
+
187
+ ## 10. Deliverables for HAI-DEF Submission
188
+
189
+ ### Video Demo (~5-7 min)
190
+ - Patient persona introduction
191
+ - Upload → extraction visualization (showing MedGemma in action)
192
+ - Agentic search loop (showing query refinement)
193
+ - Match results with traffic-light eligibility cards
194
+ - Gap-filling iteration (upload biomarker → new matches)
195
+ - "Share with doctor" packet generation
196
+
197
+ ### Technical Write-up
198
+ 1. Problem + why HAI-DEF models
199
+ 2. Architecture diagram (Parlant journey + MedGemma + Gemini + MCP)
200
+ 3. Data generation pipeline
201
+ 4. Experiments: extraction, retrieval, ranking (tables + ablations)
202
+ 5. Limitations + path to real PHI deployment
203
+
204
+ ### Code Repository
205
+ - `data/generate_synthetic_patients.py`
206
+ - `data/generate_noisy_pdfs.py`
207
+ - `matching/medgemma_extractor.py`
208
+ - `matching/agentic_search.py` (Parlant + Gemini + MCP)
209
+ - `evaluation/run_trec_benchmark.py`
210
+ - Clear README with one-command reproducibility
211
+
212
+ ---
213
+
214
+ ## 11. Why This Wins HAI-DEF
215
+
216
+ ### Effective Use of Models (20%)
217
+ ✓ MedGemma as primary clinical understanding engine (extraction + multimodal)
218
+ ✓ Concrete demos showing where non-HAI-DEF models fail (extraction accuracy gaps)
219
+ ✓ Plan for task-specific evaluation showing measurable improvement
220
+
221
+ ### Problem Domain (15%)
222
+ ✓ Clear unmet need (low trial enrollment, manual screening burden)
223
+ ✓ Patient-centric storytelling ("Anna's journey")
224
+ ✓ Evidence-based magnitude (enrollment stats, screening time data)
225
+
226
+ ### Impact Potential (15%)
227
+ ✓ Quantified near-term (benchmark improvements) and long-term (enrollment lift) impact
228
+ ✓ Clear calculation logic grounded in literature
229
+
230
+ ### Product Feasibility (20%)
231
+ ✓ Detailed technical architecture (agentic search innovation)
232
+ ✓ Realistic synthetic data strategy
233
+ ✓ Concrete evaluation plan with baselines
234
+ ✓ Deployment considerations (latency, cost, safety)
235
+
236
+ ### Execution & Communication (30%)
237
+ ✓ Cohesive narrative across video + write-up + code
238
+ ✓ Reproducible experiments
239
+ ✓ Clear explanation of design choices
240
+ ✓ Professional polish (evidence pointers, explanations, UX details)
241
+
242
+ ---
243
+
244
+ **Timeline:** 3 months to PoC demo ready for HAI-DEF submission.
245
+
246
+ **Team needs:** 1 ML engineer (MedGemma fine-tuning + evaluation), 1 full-stack engineer (web app + Parlant orchestration), 1 CPO (coordination + submission materials).
docs/tdd-guide-backend-service.md ADDED
The diff for this file is too large to render. See raw diff
 
docs/tdd-guide-data-evaluation.md ADDED
@@ -0,0 +1,2384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TrialPath 数据与评估管线 TDD 实现指南
2
+
3
+ > 基于 DeepWiki、TREC 官方文档、ir-measures/ir_datasets 库深度研究产出
4
+
5
+ ---
6
+
7
+ ## 1. 管线架构概览
8
+
9
+ ### 1.1 数据流图
10
+
11
+ ```
12
+ ┌─────────────────────────────────────────────────────────────────┐
13
+ │ Data & Evaluation Pipeline │
14
+ ├─────────────────────────────────────────────────────────────────┤
15
+ │ │
16
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
17
+ │ │ Synthea │───▶│ FHIR Bundle │───▶│ PatientProfile │ │
18
+ │ │ (Java CLI) │ │ (JSON) │ │ (JSON Schema) │ │
19
+ │ └──────────────┘ └──────────────┘ └────────┬─────────┘ │
20
+ │ │ │
21
+ │ ┌──────────────┐ ┌──────────────┐ ▼ │
22
+ │ │ LLM Letter │───▶│ ReportLab │───▶ Noisy Clinical PDFs │
23
+ │ │ Generator │ │ + Augraphy │ (Letters/Labs/Path) │
24
+ │ └──────────────┘ └──────────────┘ │
25
+ │ │
26
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
27
+ │ │ MedGemma │───▶│ Extracted │───▶│ F1 Evaluator │ │
28
+ │ │ Extractor │ │ Profile │ │ (scikit-learn) │ │
29
+ │ └──────────────┘ └──────────────┘ └──────────────────┘ │
30
+ │ │
31
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
32
+ │ │ TREC Topics │───▶│ TrialPath │───▶│ TREC Evaluator │ │
33
+ │ │ (ir_datasets)│ │ Matching │ │ (ir-measures) │ │
34
+ │ └──────────────┘ └──────────────┘ └──────────────────┘ │
35
+ │ │
36
+ └─────────────────────────────────────────────────────────────────┘
37
+ ```
38
+
39
+ ### 1.2 模块关系
40
+
41
+ | 模块 | 输入 | 输出 | 依赖 |
42
+ |------|------|------|------|
43
+ | `data/generate_synthetic_patients.py` | Synthea FHIR Bundles | `PatientProfile` JSON + Ground Truth | Synthea CLI, FHIR R4 |
44
+ | `data/generate_noisy_pdfs.py` | `PatientProfile` JSON | Clinical PDFs (带噪声) | ReportLab, Augraphy |
45
+ | `evaluation/run_trec_benchmark.py` | TREC Topics + TrialPath Run | Recall@50, NDCG@10, P@10 | ir_datasets, ir-measures |
46
+ | `evaluation/extraction_eval.py` | Extracted vs Ground Truth Profiles | Field-level F1 | scikit-learn |
47
+ | `evaluation/criterion_eval.py` | EligibilityLedger vs Gold Standard | Criterion Accuracy | scikit-learn |
48
+ | `evaluation/latency_cost_tracker.py` | API call logs | Latency/Cost reports | time, logging |
49
+
50
+ ### 1.3 目录结构
51
+
52
+ ```
53
+ data/
54
+ ├── generate_synthetic_patients.py # Synthea FHIR → PatientProfile
55
+ ├── generate_noisy_pdfs.py # PatientProfile → Clinical PDFs
56
+ ├── synthea_config/
57
+ │ ├── synthea.properties # Synthea 配置
58
+ │ └── modules/
59
+ │ └── lung_cancer_extended.json # 扩展 NSCLC 模块 (含 biomarkers)
60
+ ├── templates/
61
+ │ ├── clinical_letter.py # 临床信件模板
62
+ │ ├── pathology_report.py # 病理报告模板
63
+ │ ├── lab_report.py # 实验室报告模板
64
+ │ └── imaging_report.py # 影像报告模板
65
+ ├── noise/
66
+ │ └── noise_injector.py # 噪声注入引擎
67
+ └── output/
68
+ ├── fhir/ # Synthea 原始 FHIR 输出
69
+ ├── profiles/ # 转换后的 PatientProfile JSON
70
+ ├── pdfs/ # 生成的临床 PDF
71
+ └── ground_truth/ # 标注数据
72
+
73
+ evaluation/
74
+ ├── run_trec_benchmark.py # TREC 检索评估
75
+ ├── extraction_eval.py # MedGemma 提取 F1
76
+ ├── criterion_eval.py # Criterion Decision Accuracy
77
+ ├── latency_cost_tracker.py # 延迟与成本追踪
78
+ ├── trec_data/
79
+ │ ├── topics2021.xml # TREC 2021 topics
80
+ │ ├── qrels2021.txt # TREC 2021 relevance judgments
81
+ │ └── topics2022.xml # TREC 2022 topics
82
+ └── reports/ # 评估报告输出
83
+
84
+ tests/
85
+ ├── test_synthea_data.py # Synthea 数据验证
86
+ ├── test_pdf_generation.py # PDF 生成正确性
87
+ ├── test_noise_injection.py # 噪声注入效果
88
+ ├── test_trec_evaluation.py # TREC 评估计算
89
+ ├── test_extraction_f1.py # F1 计算测试
90
+ ├── test_latency_cost.py # 延迟成本测试
91
+ └── test_e2e_pipeline.py # 端到端管线测试
92
+ ```
93
+
94
+ ---
95
+
96
+ ## 2. Synthea 合成患者生成指南
97
+
98
+ ### 2.1 Synthea 概述
99
+
100
+ Synthea 是 MITRE 开发的开源合成患者模拟器,基于 Java 实现。它通过 JSON 状态机模块模拟疾病轨迹,输出标准 FHIR R4 Bundle。
101
+
102
+ **关键特性(来源:DeepWiki synthetichealth/synthea):**
103
+ - 基于模块的疾病模拟:每种疾病定义为 JSON 状态机
104
+ - 支持 FHIR R4/STU3/DSTU2 导出
105
+ - 内置 `lung_cancer.json` 模块,85% NSCLC / 15% SCLC 分布
106
+ - 支持 Stage I-IV 分期和化疗/放疗治疗路径
107
+ - **不含 NSCLC 特异性 biomarkers(EGFR, ALK, PD-L1, KRAS, ROS1)—— 需要自定义扩展**
108
+
109
+ ### 2.2 安装和配置
110
+
111
+ **系统要求:**
112
+ - Java JDK 11 或更高版本(推荐 LTS 11 或 17)
113
+
114
+ **安装方式 A:直接使用 JAR(推荐用于数据生成)**
115
+ ```bash
116
+ # 下载最新 release JAR
117
+ # 从 https://github.com/synthetichealth/synthea/releases 获取
118
+ wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar
119
+
120
+ # 验证安装
121
+ java -jar synthea-with-dependencies.jar --help
122
+ ```
123
+
124
+ **安装方式 B:从源码构建(需要自定义模块时使用)**
125
+ ```bash
126
+ git clone https://github.com/synthetichealth/synthea.git
127
+ cd synthea
128
+ ./gradlew build check test
129
+ ```
130
+
131
+ ### 2.3 NSCLC 模块配置
132
+
133
+ #### 2.3.1 现有 lung_cancer 模块分析
134
+
135
+ 来源:DeepWiki 对 `synthetichealth/synthea` 的 `lung_cancer.json` 模块分析:
136
+
137
+ - **入口条件**:45-65 岁人群,基于概率计算
138
+ - **诊断流程**:症状(咳嗽、咯血、气短) → 胸部 X 光 → 胸部 CT → 活检/细胞学
139
+ - **分型**:85% NSCLC,15% SCLC
140
+ - **分期**:Stage I-IV,基于 `lung_cancer_nondiagnosis_counter`
141
+ - **治疗**:NSCLC 使用 Cisplatin + Paclitaxel → 放疗
142
+
143
+ #### 2.3.2 自定义 NSCLC Biomarker 扩展模块
144
+
145
+ 由于原生模块不含 EGFR/ALK/PD-L1 等 biomarkers,需要创建扩展子模块。
146
+
147
+ **文件:`data/synthea_config/modules/lung_cancer_biomarkers.json`**
148
+
149
+ 基于 DeepWiki 研究的 Synthea 模块状态类型,可用的状态类型包括:
150
+ - `Initial` — 模块入口
151
+ - `Terminal` — 模块出口
152
+ - `Observation` — 记录临床观察值(用于 biomarkers)
153
+ - `SetAttribute` — 设置患者属性
154
+ - `Guard` — 条件门控
155
+ - `Simple` — 简单转换状态
156
+ - `Encounter` — 就诊状态
157
+
158
+ Biomarker 观察状态示例结构:
159
+ ```json
160
+ {
161
+ "name": "NSCLC Biomarker Panel",
162
+ "states": {
163
+ "Initial": {
164
+ "type": "Initial",
165
+ "conditional_transition": [
166
+ {
167
+ "condition": {
168
+ "condition_type": "Attribute",
169
+ "attribute": "Lung Cancer Type",
170
+ "operator": "==",
171
+ "value": "NSCLC"
172
+ },
173
+ "transition": "EGFR_Test_Encounter"
174
+ },
175
+ {
176
+ "transition": "Terminal"
177
+ }
178
+ ]
179
+ },
180
+ "EGFR_Test_Encounter": {
181
+ "type": "Encounter",
182
+ "encounter_class": "ambulatory",
183
+ "codes": [
184
+ {
185
+ "system": "SNOMED-CT",
186
+ "code": "185349003",
187
+ "display": "Encounter for check up"
188
+ }
189
+ ],
190
+ "direct_transition": "EGFR_Mutation_Status"
191
+ },
192
+ "EGFR_Mutation_Status": {
193
+ "type": "Observation",
194
+ "category": "laboratory",
195
+ "codes": [
196
+ {
197
+ "system": "LOINC",
198
+ "code": "41103-3",
199
+ "display": "EGFR gene mutations found"
200
+ }
201
+ ],
202
+ "distributed_transition": [
203
+ {
204
+ "distribution": 0.15,
205
+ "transition": "EGFR_Positive"
206
+ },
207
+ {
208
+ "distribution": 0.85,
209
+ "transition": "EGFR_Negative"
210
+ }
211
+ ]
212
+ },
213
+ "EGFR_Positive": {
214
+ "type": "SetAttribute",
215
+ "attribute": "egfr_status",
216
+ "value": "positive",
217
+ "direct_transition": "ALK_Rearrangement_Status"
218
+ },
219
+ "EGFR_Negative": {
220
+ "type": "SetAttribute",
221
+ "attribute": "egfr_status",
222
+ "value": "negative",
223
+ "direct_transition": "ALK_Rearrangement_Status"
224
+ },
225
+ "ALK_Rearrangement_Status": {
226
+ "type": "Observation",
227
+ "category": "laboratory",
228
+ "codes": [
229
+ {
230
+ "system": "LOINC",
231
+ "code": "46264-8",
232
+ "display": "ALK gene rearrangement"
233
+ }
234
+ ],
235
+ "distributed_transition": [
236
+ {
237
+ "distribution": 0.05,
238
+ "transition": "ALK_Positive"
239
+ },
240
+ {
241
+ "distribution": 0.95,
242
+ "transition": "ALK_Negative"
243
+ }
244
+ ]
245
+ },
246
+ "ALK_Positive": {
247
+ "type": "SetAttribute",
248
+ "attribute": "alk_status",
249
+ "value": "positive",
250
+ "direct_transition": "PDL1_Expression"
251
+ },
252
+ "ALK_Negative": {
253
+ "type": "SetAttribute",
254
+ "attribute": "alk_status",
255
+ "value": "negative",
256
+ "direct_transition": "PDL1_Expression"
257
+ },
258
+ "PDL1_Expression": {
259
+ "type": "Observation",
260
+ "category": "laboratory",
261
+ "codes": [
262
+ {
263
+ "system": "LOINC",
264
+ "code": "85147-0",
265
+ "display": "PD-L1 by immune stain"
266
+ }
267
+ ],
268
+ "distributed_transition": [
269
+ {
270
+ "distribution": 0.30,
271
+ "transition": "PDL1_High"
272
+ },
273
+ {
274
+ "distribution": 0.35,
275
+ "transition": "PDL1_Low"
276
+ },
277
+ {
278
+ "distribution": 0.35,
279
+ "transition": "PDL1_Negative"
280
+ }
281
+ ]
282
+ },
283
+ "PDL1_High": {
284
+ "type": "SetAttribute",
285
+ "attribute": "pdl1_tps",
286
+ "value": ">=50%",
287
+ "direct_transition": "KRAS_Mutation_Status"
288
+ },
289
+ "PDL1_Low": {
290
+ "type": "SetAttribute",
291
+ "attribute": "pdl1_tps",
292
+ "value": "1-49%",
293
+ "direct_transition": "KRAS_Mutation_Status"
294
+ },
295
+ "PDL1_Negative": {
296
+ "type": "SetAttribute",
297
+ "attribute": "pdl1_tps",
298
+ "value": "<1%",
299
+ "direct_transition": "KRAS_Mutation_Status"
300
+ },
301
+ "KRAS_Mutation_Status": {
302
+ "type": "Observation",
303
+ "category": "laboratory",
304
+ "codes": [
305
+ {
306
+ "system": "LOINC",
307
+ "code": "21717-3",
308
+ "display": "KRAS gene mutations found"
309
+ }
310
+ ],
311
+ "distributed_transition": [
312
+ {
313
+ "distribution": 0.25,
314
+ "transition": "KRAS_Positive"
315
+ },
316
+ {
317
+ "distribution": 0.75,
318
+ "transition": "KRAS_Negative"
319
+ }
320
+ ]
321
+ },
322
+ "KRAS_Positive": {
323
+ "type": "SetAttribute",
324
+ "attribute": "kras_status",
325
+ "value": "positive",
326
+ "direct_transition": "Terminal"
327
+ },
328
+ "KRAS_Negative": {
329
+ "type": "SetAttribute",
330
+ "attribute": "kras_status",
331
+ "value": "negative",
332
+ "direct_transition": "Terminal"
333
+ },
334
+ "Terminal": {
335
+ "type": "Terminal"
336
+ }
337
+ }
338
+ }
339
+ ```
340
+
341
+ **Biomarker 流行率分布(基于 NSCLC 文献):**
342
+
343
+ | Biomarker | 阳性率 | LOINC Code | 说明 |
344
+ |-----------|--------|------------|------|
345
+ | EGFR mutation | ~15% | 41103-3 | 非吸烟亚裔女性更高 |
346
+ | ALK rearrangement | ~5% | 46264-8 | 年轻非吸烟者更常见 |
347
+ | PD-L1 TPS>=50% | ~30% | 85147-0 | 免疫治疗适用标准 |
348
+ | KRAS G12C | ~13% | 21717-3 | Sotorasib 靶向 |
349
+ | ROS1 fusion | ~1-2% | 46265-5 | Crizotinib 靶向 |
350
+
351
+ ### 2.4 批量生成命令
352
+
353
+ ```bash
354
+ # 生成 500 个 NSCLC 患者,使用种子确保可重现
355
+ java -jar synthea-with-dependencies.jar \
356
+ -p 500 \
357
+ -s 42 \
358
+ -m lung_cancer \
359
+ --exporter.fhir.export=true \
360
+ --exporter.fhir_stu3.export=false \
361
+ --exporter.fhir_dstu2.export=false \
362
+ --exporter.ccda.export=false \
363
+ --exporter.csv.export=false \
364
+ --exporter.hospital.fhir.export=false \
365
+ --exporter.practitioner.fhir.export=false \
366
+ --exporter.pretty_print=true \
367
+ Massachusetts
368
+
369
+ # 参数说明:
370
+ # -p 500 : 生成 500 个患者
371
+ # -s 42 : 随机种子 (可重现)
372
+ # -m lung_cancer : 仅运行 lung_cancer 模块
373
+ # --exporter.fhir.export=true : 启用 FHIR R4 导出
374
+ # Massachusetts : 生成地区
375
+ ```
376
+
377
+ **输出位置:** `./output/fhir/` 目录下,每个患者一个 JSON 文件。
378
+
379
+ ### 2.5 FHIR Bundle 输出格式
380
+
381
+ 来源:DeepWiki `synthetichealth/synthea` 关于 FHIR 导出系统的分析。
382
+
383
+ **顶层结构:**
384
+ ```json
385
+ {
386
+ "resourceType": "Bundle",
387
+ "type": "transaction",
388
+ "entry": [
389
+ {
390
+ "fullUrl": "urn:uuid:patient-uuid-here",
391
+ "resource": { "resourceType": "Patient", ... },
392
+ "request": { "method": "POST", "url": "Patient" }
393
+ },
394
+ {
395
+ "fullUrl": "urn:uuid:condition-uuid-here",
396
+ "resource": { "resourceType": "Condition", ... },
397
+ "request": { "method": "POST", "url": "Condition" }
398
+ }
399
+ ]
400
+ }
401
+ ```
402
+
403
+ **Synthea 生成的 FHIR Resource 类型(DeepWiki 确认):**
404
+ - `Patient` — 患者基本信息
405
+ - `Condition` — 诊断(如 NSCLC)
406
+ - `Observation` — 实验室检查和生命体征
407
+ - `MedicationRequest` — 用药处方
408
+ - `Procedure` — 手术和操作
409
+ - `DiagnosticReport` — 诊断报告
410
+ - `DocumentReference` — 临床文档(需 US Core IG 启用)
411
+ - `Encounter` — 就诊记录
412
+ - `AllergyIntolerance` — 过敏史
413
+ - `Immunization` — 免疫接种
414
+ - `CarePlan` — 护理计划
415
+ - `ImagingStudy` — 影像检查
416
+
417
+ ### 2.6 FHIR Resource 到 PatientProfile 的映射
418
+
419
+ ```python
420
+ # data/generate_synthetic_patients.py 中的映射逻辑
421
+
422
+ FHIR_TO_PATIENT_PROFILE_MAP = {
423
+ # Patient Resource → demographics
424
+ "Patient.name": "demographics.name",
425
+ "Patient.gender": "demographics.sex",
426
+ "Patient.birthDate": "demographics.date_of_birth",
427
+ "Patient.address.state": "demographics.state",
428
+
429
+ # Condition Resource → diagnosis
430
+ "Condition[code=SNOMED:254637007]": "diagnosis.primary", # NSCLC
431
+ "Condition.stage.summary": "diagnosis.stage",
432
+ "Condition.bodySite": "diagnosis.histology",
433
+
434
+ # Observation Resources → biomarkers
435
+ "Observation[code=LOINC:41103-3]": "biomarkers.egfr",
436
+ "Observation[code=LOINC:46264-8]": "biomarkers.alk",
437
+ "Observation[code=LOINC:85147-0]": "biomarkers.pdl1_tps",
438
+ "Observation[code=LOINC:21717-3]": "biomarkers.kras",
439
+
440
+ # Observation Resources → labs
441
+ "Observation[category=laboratory]": "labs[]",
442
+
443
+ # MedicationRequest → prior_treatments
444
+ "MedicationRequest.medicationCodeableConcept": "treatments[].medication",
445
+
446
+ # Procedure → prior_treatments
447
+ "Procedure.code": "treatments[].procedure",
448
+ }
449
+ ```
450
+
451
+ **转换函数模式:**
452
+ ```python
453
+ import json
454
+ from pathlib import Path
455
+ from dataclasses import dataclass, field, asdict
456
+ from typing import Optional
457
+
458
+ @dataclass
459
+ class Demographics:
460
+ name: str = ""
461
+ sex: str = ""
462
+ date_of_birth: str = ""
463
+ age: int = 0
464
+ state: str = ""
465
+
466
+ @dataclass
467
+ class Diagnosis:
468
+ primary: str = ""
469
+ stage: str = ""
470
+ histology: str = ""
471
+ diagnosis_date: str = ""
472
+
473
+ @dataclass
474
+ class Biomarkers:
475
+ egfr: Optional[str] = None
476
+ alk: Optional[str] = None
477
+ pdl1_tps: Optional[str] = None
478
+ kras: Optional[str] = None
479
+ ros1: Optional[str] = None
480
+
481
+ @dataclass
482
+ class LabResult:
483
+ name: str = ""
484
+ value: float = 0.0
485
+ unit: str = ""
486
+ date: str = ""
487
+ loinc_code: str = ""
488
+
489
+ @dataclass
490
+ class Treatment:
491
+ name: str = ""
492
+ type: str = "" # "medication" | "procedure" | "radiation"
493
+ start_date: str = ""
494
+ end_date: Optional[str] = None
495
+
496
+ @dataclass
497
+ class PatientProfile:
498
+ patient_id: str = ""
499
+ demographics: Demographics = field(default_factory=Demographics)
500
+ diagnosis: Diagnosis = field(default_factory=Diagnosis)
501
+ biomarkers: Biomarkers = field(default_factory=Biomarkers)
502
+ labs: list[LabResult] = field(default_factory=list)
503
+ treatments: list[Treatment] = field(default_factory=list)
504
+ unknowns: list[str] = field(default_factory=list)
505
+ evidence_spans: list[dict] = field(default_factory=list)
506
+
507
+
508
+ def parse_fhir_bundle(fhir_path: Path) -> PatientProfile:
509
+ """Parse a Synthea FHIR Bundle JSON into PatientProfile."""
510
+ with open(fhir_path) as f:
511
+ bundle = json.load(f)
512
+
513
+ profile = PatientProfile()
514
+ entries = bundle.get("entry", [])
515
+
516
+ for entry in entries:
517
+ resource = entry.get("resource", {})
518
+ resource_type = resource.get("resourceType")
519
+
520
+ if resource_type == "Patient":
521
+ _parse_patient(resource, profile)
522
+ elif resource_type == "Condition":
523
+ _parse_condition(resource, profile)
524
+ elif resource_type == "Observation":
525
+ _parse_observation(resource, profile)
526
+ elif resource_type == "MedicationRequest":
527
+ _parse_medication(resource, profile)
528
+ elif resource_type == "Procedure":
529
+ _parse_procedure(resource, profile)
530
+
531
+ return profile
532
+
533
+
534
+ def _parse_patient(resource: dict, profile: PatientProfile):
535
+ """Extract demographics from Patient resource."""
536
+ names = resource.get("name", [{}])
537
+ if names:
538
+ given = " ".join(names[0].get("given", []))
539
+ family = names[0].get("family", "")
540
+ profile.demographics.name = f"{given} {family}".strip()
541
+
542
+ profile.demographics.sex = resource.get("gender", "")
543
+ profile.demographics.date_of_birth = resource.get("birthDate", "")
544
+ profile.patient_id = resource.get("id", "")
545
+
546
+ addresses = resource.get("address", [{}])
547
+ if addresses:
548
+ profile.demographics.state = addresses[0].get("state", "")
549
+
550
+
551
+ def _parse_condition(resource: dict, profile: PatientProfile):
552
+ """Extract diagnosis from Condition resource."""
553
+ code = resource.get("code", {})
554
+ codings = code.get("coding", [])
555
+ for coding in codings:
556
+ # SNOMED codes for lung cancer
557
+ if coding.get("code") in ["254637007", "254632001"]:
558
+ profile.diagnosis.primary = coding.get("display", "")
559
+ onset = resource.get("onsetDateTime", "")
560
+ profile.diagnosis.diagnosis_date = onset
561
+ # Extract stage if available
562
+ stage_info = resource.get("stage", [])
563
+ if stage_info:
564
+ summary = stage_info[0].get("summary", {})
565
+ stage_codings = summary.get("coding", [])
566
+ if stage_codings:
567
+ profile.diagnosis.stage = stage_codings[0].get("display", "")
568
+
569
+
570
+ def _parse_observation(resource: dict, profile: PatientProfile):
571
+ """Extract labs and biomarkers from Observation resource."""
572
+ code = resource.get("code", {})
573
+ codings = code.get("coding", [])
574
+ category_list = resource.get("category", [])
575
+ is_lab = any(
576
+ cat_coding.get("code") == "laboratory"
577
+ for cat in category_list
578
+ for cat_coding in cat.get("coding", [])
579
+ )
580
+
581
+ for coding in codings:
582
+ loinc = coding.get("code", "")
583
+ display = coding.get("display", "")
584
+
585
+ # Biomarker mappings
586
+ biomarker_map = {
587
+ "41103-3": "egfr",
588
+ "46264-8": "alk",
589
+ "85147-0": "pdl1_tps",
590
+ "21717-3": "kras",
591
+ "46265-5": "ros1",
592
+ }
593
+
594
+ if loinc in biomarker_map:
595
+ value_cc = resource.get("valueCodeableConcept", {})
596
+ value_codings = value_cc.get("coding", [])
597
+ value_str = value_codings[0].get("display", "") if value_codings else ""
598
+ setattr(profile.biomarkers, biomarker_map[loinc], value_str)
599
+ elif is_lab:
600
+ value_qty = resource.get("valueQuantity", {})
601
+ lab = LabResult(
602
+ name=display,
603
+ value=value_qty.get("value", 0.0),
604
+ unit=value_qty.get("unit", ""),
605
+ date=resource.get("effectiveDateTime", ""),
606
+ loinc_code=loinc,
607
+ )
608
+ profile.labs.append(lab)
609
+ ```
610
+
611
+ ---
612
+
613
+ ## 3. 合成 PDF 生成管线
614
+
615
+ ### 3.1 概述
616
+
617
+ 目标:将 `PatientProfile` 转换为逼真的临床文档 PDF,并注入受控噪声以模拟真实世界 OCR 场景。
618
+
619
+ **技术栈:**
620
+ - **ReportLab** (`pip install reportlab`) — PDF 生成引擎,支持 `SimpleDocTemplate`、`Table`、`Paragraph` 等 Platypus 流式组件
621
+ - **Augraphy** (`pip install augraphy`) — 文档图像退化管线,模拟打印、传真、扫描噪声
622
+ - **Pillow** (`pip install Pillow`) — 图像处理
623
+ - **pdf2image** (`pip install pdf2image`) — PDF 转图像(用于噪声注入后转回 PDF)
624
+
625
+ ### 3.2 临床信件模板
626
+
627
+ ```python
628
+ # data/templates/clinical_letter.py
629
+ from reportlab.lib.pagesizes import letter
630
+ from reportlab.lib.units import inch
631
+ from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
632
+ from reportlab.platypus import (
633
+ SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
634
+ )
635
+ from reportlab.lib import colors
636
+
637
+
638
+ def generate_clinical_letter(profile: dict, output_path: str):
639
+ """Generate a clinical letter PDF from PatientProfile."""
640
+ doc = SimpleDocTemplate(output_path, pagesize=letter,
641
+ topMargin=1*inch, bottomMargin=1*inch)
642
+ styles = getSampleStyleSheet()
643
+ story = []
644
+
645
+ # Header
646
+ header_style = ParagraphStyle(
647
+ 'Header', parent=styles['Heading1'], fontSize=14,
648
+ spaceAfter=6
649
+ )
650
+ story.append(Paragraph("Clinical Summary Letter", header_style))
651
+ story.append(Spacer(1, 12))
652
+
653
+ # Patient Info
654
+ info_data = [
655
+ ["Patient Name:", profile["demographics"]["name"]],
656
+ ["Date of Birth:", profile["demographics"]["date_of_birth"]],
657
+ ["Sex:", profile["demographics"]["sex"]],
658
+ ["MRN:", profile["patient_id"]],
659
+ ]
660
+ info_table = Table(info_data, colWidths=[2*inch, 4*inch])
661
+ info_table.setStyle(TableStyle([
662
+ ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
663
+ ('FONTNAME', (1, 0), (1, -1), 'Helvetica'),
664
+ ('FONTSIZE', (0, 0), (-1, -1), 10),
665
+ ('VALIGN', (0, 0), (-1, -1), 'TOP'),
666
+ ]))
667
+ story.append(info_table)
668
+ story.append(Spacer(1, 18))
669
+
670
+ # Diagnosis Section
671
+ story.append(Paragraph("Diagnosis", styles['Heading2']))
672
+ dx = profile.get("diagnosis", {})
673
+ dx_text = (
674
+ f"Primary: {dx.get('primary', 'Unknown')}. "
675
+ f"Stage: {dx.get('stage', 'Unknown')}. "
676
+ f"Histology: {dx.get('histology', 'Unknown')}. "
677
+ f"Diagnosed: {dx.get('diagnosis_date', 'Unknown')}."
678
+ )
679
+ story.append(Paragraph(dx_text, styles['Normal']))
680
+ story.append(Spacer(1, 12))
681
+
682
+ # Biomarkers Section
683
+ story.append(Paragraph("Molecular Testing", styles['Heading2']))
684
+ bm = profile.get("biomarkers", {})
685
+ bm_data = [["Biomarker", "Result"]]
686
+ for marker, value in bm.items():
687
+ if value is not None:
688
+ bm_data.append([marker.upper(), str(value)])
689
+ if len(bm_data) > 1:
690
+ bm_table = Table(bm_data, colWidths=[2.5*inch, 3.5*inch])
691
+ bm_table.setStyle(TableStyle([
692
+ ('BACKGROUND', (0, 0), (-1, 0), colors.lightgrey),
693
+ ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
694
+ ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
695
+ ('FONTSIZE', (0, 0), (-1, -1), 10),
696
+ ]))
697
+ story.append(bm_table)
698
+ story.append(Spacer(1, 12))
699
+
700
+ # Treatment History
701
+ story.append(Paragraph("Treatment History", styles['Heading2']))
702
+ treatments = profile.get("treatments", [])
703
+ for tx in treatments:
704
+ tx_text = f"- {tx['name']} ({tx['type']}): {tx.get('start_date', '')}"
705
+ story.append(Paragraph(tx_text, styles['Normal']))
706
+
707
+ doc.build(story)
708
+ ```
709
+
710
+ ### 3.3 病理报告模板
711
+
712
+ ```python
713
+ # data/templates/pathology_report.py
714
+ def generate_pathology_report(profile: dict, output_path: str):
715
+ """Generate a pathology report PDF."""
716
+ doc = SimpleDocTemplate(output_path, pagesize=letter)
717
+ styles = getSampleStyleSheet()
718
+ story = []
719
+
720
+ story.append(Paragraph("SURGICAL PATHOLOGY REPORT", styles['Title']))
721
+ story.append(Spacer(1, 12))
722
+
723
+ # Specimen Info
724
+ spec_data = [
725
+ ["Specimen:", "Right lung, upper lobe, wedge resection"],
726
+ ["Procedure:", "CT-guided needle biopsy"],
727
+ ["Date:", profile["diagnosis"]["diagnosis_date"]],
728
+ ]
729
+ spec_table = Table(spec_data, colWidths=[2*inch, 4*inch])
730
+ story.append(spec_table)
731
+ story.append(Spacer(1, 12))
732
+
733
+ # Final Diagnosis
734
+ story.append(Paragraph("FINAL DIAGNOSIS", styles['Heading2']))
735
+ story.append(Paragraph(
736
+ f"Non-small cell lung carcinoma, {profile['diagnosis'].get('histology', 'adenocarcinoma')}, "
737
+ f"{profile['diagnosis'].get('stage', 'Stage IIIA')}",
738
+ styles['Normal']
739
+ ))
740
+
741
+ # Biomarker Results
742
+ story.append(Spacer(1, 12))
743
+ story.append(Paragraph("MOLECULAR/IMMUNOHISTOCHEMISTRY", styles['Heading2']))
744
+ bm = profile.get("biomarkers", {})
745
+ results = []
746
+ if bm.get("egfr"):
747
+ results.append(f"EGFR mutation analysis: {bm['egfr']}")
748
+ if bm.get("alk"):
749
+ results.append(f"ALK rearrangement (FISH): {bm['alk']}")
750
+ if bm.get("pdl1_tps"):
751
+ results.append(f"PD-L1 (22C3, TPS): {bm['pdl1_tps']}")
752
+ if bm.get("kras"):
753
+ results.append(f"KRAS mutation analysis: {bm['kras']}")
754
+ for r in results:
755
+ story.append(Paragraph(r, styles['Normal']))
756
+
757
+ doc.build(story)
758
+ ```
759
+
760
+ ### 3.4 实验室报告模板
761
+
762
+ ```python
763
+ # data/templates/lab_report.py
764
+ def generate_lab_report(profile: dict, output_path: str):
765
+ """Generate a laboratory report PDF with CBC, CMP, etc."""
766
+ doc = SimpleDocTemplate(output_path, pagesize=letter)
767
+ styles = getSampleStyleSheet()
768
+ story = []
769
+
770
+ story.append(Paragraph("LABORATORY REPORT", styles['Title']))
771
+ story.append(Spacer(1, 12))
772
+
773
+ # Lab Results Table
774
+ lab_data = [["Test", "Result", "Unit", "Reference Range", "Date"]]
775
+ for lab in profile.get("labs", []):
776
+ lab_data.append([
777
+ lab["name"], str(lab["value"]), lab["unit"],
778
+ "", # Reference range (can be added)
779
+ lab["date"][:10] if lab["date"] else ""
780
+ ])
781
+
782
+ if len(lab_data) > 1:
783
+ lab_table = Table(lab_data, colWidths=[2*inch, 1*inch, 0.8*inch, 1.2*inch, 1*inch])
784
+ lab_table.setStyle(TableStyle([
785
+ ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#003366')),
786
+ ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
787
+ ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
788
+ ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
789
+ ('FONTSIZE', (0, 0), (-1, -1), 9),
790
+ ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#f0f0f0')]),
791
+ ]))
792
+ story.append(lab_table)
793
+
794
+ doc.build(story)
795
+ ```
796
+
797
+ ### 3.5 噪声注入策略
798
+
799
+ ```python
800
+ # data/noise/noise_injector.py
801
+ import random
802
+ import re
803
+ from pathlib import Path
804
+ from PIL import Image
805
+
806
+ # Augraphy 管线配置
807
+ try:
808
+ from augraphy import (
809
+ AugraphyPipeline, InkBleed, Letterpress, LowInkPeriodicLines,
810
+ DirtyDrum, SubtleNoise, Jpeg, Brightness, BleedThrough
811
+ )
812
+ AUGRAPHY_AVAILABLE = True
813
+ except ImportError:
814
+ AUGRAPHY_AVAILABLE = False
815
+
816
+
817
+ class NoiseInjector:
818
+ """受控噪声注入引擎,模拟真实世界文档退化。"""
819
+
820
+ # OCR 常见错误映射
821
+ OCR_ERROR_MAP = {
822
+ "0": ["O", "o", "Q"],
823
+ "1": ["l", "I", "|"],
824
+ "5": ["S", "s"],
825
+ "8": ["B"],
826
+ "O": ["0", "Q"],
827
+ "l": ["1", "I", "|"],
828
+ "rn": ["m"],
829
+ "cl": ["d"],
830
+ "vv": ["w"],
831
+ }
832
+
833
+ # 医学缩写替换
834
+ ABBREVIATION_MAP = {
835
+ "non-small cell lung cancer": ["NSCLC", "non-small cell ca", "NSCC"],
836
+ "adenocarcinoma": ["adeno", "adenoca", "adeno ca"],
837
+ "squamous cell carcinoma": ["SCC", "squamous ca", "sq cell ca"],
838
+ "Eastern Cooperative Oncology Group": ["ECOG"],
839
+ "performance status": ["PS", "perf status"],
840
+ "milligrams per deciliter": ["mg/dL", "mg/dl"],
841
+ "computed tomography": ["CT", "cat scan"],
842
+ }
843
+
844
+ # 噪声级别配置
845
+ NOISE_LEVELS = {
846
+ "clean": {"ocr_rate": 0.0, "abbrev_rate": 0.0, "missing_rate": 0.0},
847
+ "mild": {"ocr_rate": 0.02, "abbrev_rate": 0.1, "missing_rate": 0.05},
848
+ "moderate": {"ocr_rate": 0.05, "abbrev_rate": 0.2, "missing_rate": 0.1},
849
+ "severe": {"ocr_rate": 0.10, "abbrev_rate": 0.3, "missing_rate": 0.2},
850
+ }
851
+
852
+ def __init__(self, noise_level: str = "mild", seed: int = 42):
853
+ self.config = self.NOISE_LEVELS[noise_level]
854
+ self.rng = random.Random(seed)
855
+
856
+ def inject_text_noise(self, text: str) -> tuple[str, list[dict]]:
857
+ """Inject OCR errors and abbreviations into text.
858
+
859
+ Returns (noisy_text, list_of_injected_noise_records).
860
+ """
861
+ noise_records = []
862
+ chars = list(text)
863
+
864
+ # OCR character substitutions
865
+ i = 0
866
+ while i < len(chars):
867
+ if self.rng.random() < self.config["ocr_rate"]:
868
+ original = chars[i]
869
+ if original in self.OCR_ERROR_MAP:
870
+ replacement = self.rng.choice(self.OCR_ERROR_MAP[original])
871
+ chars[i] = replacement
872
+ noise_records.append({
873
+ "type": "ocr_error",
874
+ "position": i,
875
+ "original": original,
876
+ "replacement": replacement,
877
+ })
878
+ i += 1
879
+
880
+ noisy_text = "".join(chars)
881
+
882
+ # Abbreviation substitutions
883
+ for full_form, abbreviations in self.ABBREVIATION_MAP.items():
884
+ if full_form in noisy_text.lower() and self.rng.random() < self.config["abbrev_rate"]:
885
+ abbrev = self.rng.choice(abbreviations)
886
+ noisy_text = re.sub(
887
+ re.escape(full_form), abbrev, noisy_text, count=1, flags=re.IGNORECASE
888
+ )
889
+ noise_records.append({
890
+ "type": "abbreviation",
891
+ "original": full_form,
892
+ "replacement": abbrev,
893
+ })
894
+
895
+ return noisy_text, noise_records
896
+
897
+ def inject_missing_values(self, profile: dict) -> tuple[dict, list[str]]:
898
+ """Randomly remove fields from profile to simulate missing data.
899
+
900
+ Returns (modified_profile, list_of_removed_fields).
901
+ """
902
+ removed = []
903
+ removable_fields = [
904
+ ("biomarkers", "egfr"),
905
+ ("biomarkers", "alk"),
906
+ ("biomarkers", "pdl1_tps"),
907
+ ("biomarkers", "kras"),
908
+ ("biomarkers", "ros1"),
909
+ ("diagnosis", "stage"),
910
+ ("diagnosis", "histology"),
911
+ ]
912
+
913
+ for section, field_name in removable_fields:
914
+ if self.rng.random() < self.config["missing_rate"]:
915
+ if section in profile and field_name in profile[section]:
916
+ profile[section][field_name] = None
917
+ removed.append(f"{section}.{field_name}")
918
+
919
+ return profile, removed
920
+
921
+ def degrade_image(self, image: Image.Image) -> Image.Image:
922
+ """Apply Augraphy degradation pipeline to document image."""
923
+ if not AUGRAPHY_AVAILABLE:
924
+ return image
925
+
926
+ import numpy as np
927
+ img_array = np.array(image)
928
+
929
+ pipeline = AugraphyPipeline(
930
+ ink_phase=[
931
+ InkBleed(p=0.5),
932
+ Letterpress(p=0.3),
933
+ LowInkPeriodicLines(p=0.3),
934
+ ],
935
+ paper_phase=[
936
+ SubtleNoise(p=0.5),
937
+ ],
938
+ post_phase=[
939
+ DirtyDrum(p=0.3),
940
+ Brightness(p=0.5),
941
+ Jpeg(p=0.5),
942
+ ],
943
+ )
944
+
945
+ degraded = pipeline(img_array)
946
+ return Image.fromarray(degraded)
947
+ ```
948
+
949
+ ---
950
+
951
+ ## 4. TREC 基准评估指南
952
+
953
+ ### 4.1 数据集概述
954
+
955
+ **TREC Clinical Trials Track 2021:**
956
+ - 来源:NIST 文本检索会议
957
+ - Topics(查询):75 个合成患者描述(5-10 句入院记录)
958
+ - 文档集:376,000+ 临床试验(ClinicalTrials.gov 2021 年 4 月快照)
959
+ - Qrels:35,832 条相关性判断
960
+ - 相关性标签:0=不相关,1=排除,2=合格
961
+
962
+ **TREC Clinical Trials Track 2022:**
963
+ - Topics:50 个合成患者描述
964
+ - 使用相同的文档集快照
965
+
966
+ ### 4.2 数据格式
967
+
968
+ #### Topics XML 格式
969
+ ```xml
970
+ <topics task="2021 TREC Clinical Trials">
971
+ <topic number="1">
972
+ A 62-year-old male presents with a 3-month history of
973
+ progressive dyspnea and a 20-pound weight loss. He has
974
+ a 40 pack-year smoking history. CT chest reveals a 4.5cm
975
+ right upper lobe mass with mediastinal lymphadenopathy.
976
+ Biopsy confirms non-small cell lung cancer, adenocarcinoma.
977
+ EGFR mutation testing is positive for exon 19 deletion.
978
+ PD-L1 TPS is 60%. ECOG performance status is 1.
979
+ </topic>
980
+ <topic number="2">
981
+ ...
982
+ </topic>
983
+ </topics>
984
+ ```
985
+
986
+ #### Qrels 格式(制表符分隔)
987
+ ```
988
+ topic_id 0 doc_id relevance
989
+ 1 0 NCT00760162 2
990
+ 1 0 NCT01234567 1
991
+ 1 0 NCT09876543 0
992
+ ```
993
+ - 列 1:Topic 编号
994
+ - 列 2:固定值 0(迭代次数)
995
+ - 列 3:NCT 文档 ID
996
+ - 列 4:相关性(0=不相关,1=排除,2=合格)
997
+
998
+ #### Run 提交格式
999
+ ```
1000
+ TOPIC_NO Q0 NCT_ID RANK SCORE RUN_NAME
1001
+ 1 Q0 NCT00760162 1 0.9999 trialpath-v1
1002
+ 1 Q0 NCT01234567 2 0.9998 trialpath-v1
1003
+ ```
1004
+
1005
+ ### 4.3 使用 ir_datasets 加载数据
1006
+
1007
+ ```python
1008
+ # evaluation/run_trec_benchmark.py
1009
+ import ir_datasets
1010
+
1011
+ def load_trec_2021():
1012
+ """Load TREC CT 2021 topics and qrels via ir_datasets."""
1013
+ dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2021")
1014
+
1015
+ # 加载 topics (GenericQuery: query_id, text)
1016
+ topics = {}
1017
+ for query in dataset.queries_iter():
1018
+ topics[query.query_id] = query.text
1019
+
1020
+ # 加载 qrels (TrecQrel: query_id, doc_id, relevance, iteration)
1021
+ qrels = {}
1022
+ for qrel in dataset.qrels_iter():
1023
+ if qrel.query_id not in qrels:
1024
+ qrels[qrel.query_id] = {}
1025
+ qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
1026
+
1027
+ return topics, qrels
1028
+
1029
+
1030
+ def load_trec_2022():
1031
+ """Load TREC CT 2022 topics and qrels."""
1032
+ dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2022")
1033
+
1034
+ topics = {q.query_id: q.text for q in dataset.queries_iter()}
1035
+ qrels = {}
1036
+ for qrel in dataset.qrels_iter():
1037
+ if qrel.query_id not in qrels:
1038
+ qrels[qrel.query_id] = {}
1039
+ qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
1040
+
1041
+ return topics, qrels
1042
+
1043
+
1044
+ def load_trial_documents():
1045
+ """Load the clinical trial documents from ir_datasets."""
1046
+ dataset = ir_datasets.load("clinicaltrials/2021")
1047
+ # ClinicalTrialsDoc: doc_id, title, condition, summary,
1048
+ # detailed_description, eligibility
1049
+ docs = {}
1050
+ for doc in dataset.docs_iter():
1051
+ docs[doc.doc_id] = {
1052
+ "title": doc.title,
1053
+ "condition": doc.condition,
1054
+ "summary": doc.summary,
1055
+ "detailed_description": doc.detailed_description,
1056
+ "eligibility": doc.eligibility,
1057
+ }
1058
+ return docs
1059
+ ```
1060
+
1061
+ ### 4.4 TrialPath 输出到 TREC 格式的映射
1062
+
1063
+ ```python
1064
+ def convert_trialpath_to_trec_run(
1065
+ results: dict[str, list[dict]],
1066
+ run_name: str = "trialpath-v1"
1067
+ ) -> str:
1068
+ """Convert TrialPath matching results to TREC run format.
1069
+
1070
+ Args:
1071
+ results: {topic_id: [{"nct_id": str, "score": float}, ...]}
1072
+ run_name: Run identifier
1073
+
1074
+ Returns:
1075
+ TREC-format run string
1076
+ """
1077
+ lines = []
1078
+ for topic_id, candidates in results.items():
1079
+ sorted_candidates = sorted(candidates, key=lambda x: x["score"], reverse=True)
1080
+ for rank, candidate in enumerate(sorted_candidates[:1000], 1):
1081
+ lines.append(
1082
+ f"{topic_id} Q0 {candidate['nct_id']} {rank} "
1083
+ f"{candidate['score']:.6f} {run_name}"
1084
+ )
1085
+ return "\n".join(lines)
1086
+
1087
+
1088
+ def save_trec_run(run_str: str, output_path: str):
1089
+ """Save TREC run to file."""
1090
+ with open(output_path, 'w') as f:
1091
+ f.write(run_str)
1092
+ ```
1093
+
1094
+ ### 4.5 使用 ir-measures 计算评估指标
1095
+
1096
+ ```python
1097
+ # evaluation/run_trec_benchmark.py (续)
1098
+ import ir_measures
1099
+ from ir_measures import nDCG, P, Recall, AP, RR, SetP, SetR, SetF
1100
+
1101
+
1102
+ def evaluate_trec_run(
1103
+ qrels_path: str,
1104
+ run_path: str,
1105
+ ) -> dict:
1106
+ """Evaluate a TREC run using ir-measures.
1107
+
1108
+ Target metrics:
1109
+ - Recall@50 >= 0.75
1110
+ - NDCG@10 >= 0.60
1111
+ - P@10 (informational)
1112
+ """
1113
+ qrels = list(ir_measures.read_trec_qrels(qrels_path))
1114
+ run = list(ir_measures.read_trec_run(run_path))
1115
+
1116
+ # 定义目标指标
1117
+ measures = [
1118
+ nDCG@10, # Target >= 0.60
1119
+ Recall@50, # Target >= 0.75
1120
+ P@10, # Precision at 10
1121
+ AP, # Mean Average Precision
1122
+ RR, # Reciprocal Rank
1123
+ nDCG@20, # Additional depth
1124
+ Recall@100, # Extended recall
1125
+ ]
1126
+
1127
+ # 计算聚合指标
1128
+ aggregate = ir_measures.calc_aggregate(measures, qrels, run)
1129
+
1130
+ # 计算逐查询指标
1131
+ per_query = {}
1132
+ for metric in ir_measures.iter_calc(measures, qrels, run):
1133
+ qid = metric.query_id
1134
+ if qid not in per_query:
1135
+ per_query[qid] = {}
1136
+ per_query[qid][str(metric.measure)] = metric.value
1137
+
1138
+ return {
1139
+ "aggregate": {str(k): v for k, v in aggregate.items()},
1140
+ "per_query": per_query,
1141
+ "pass_fail": {
1142
+ "ndcg@10": aggregate.get(nDCG@10, 0) >= 0.60,
1143
+ "recall@50": aggregate.get(Recall@50, 0) >= 0.75,
1144
+ }
1145
+ }
1146
+
1147
+
1148
+ def evaluate_with_eligibility_levels(
1149
+ qrels_path: str,
1150
+ run_path: str,
1151
+ ) -> dict:
1152
+ """Evaluate with TREC CT graded relevance (0=NR, 1=Excluded, 2=Eligible).
1153
+
1154
+ Uses rel=2 for strict eligible-only evaluation.
1155
+ """
1156
+ qrels = list(ir_measures.read_trec_qrels(qrels_path))
1157
+ run = list(ir_measures.read_trec_run(run_path))
1158
+
1159
+ # Standard evaluation (relevance >= 1)
1160
+ standard_measures = [nDCG@10, Recall@50, P@10]
1161
+ standard = ir_measures.calc_aggregate(standard_measures, qrels, run)
1162
+
1163
+ # Strict evaluation (only eligible = relevance 2)
1164
+ strict_measures = [
1165
+ AP(rel=2),
1166
+ P(rel=2)@10,
1167
+ Recall(rel=2)@50,
1168
+ ]
1169
+ strict = ir_measures.calc_aggregate(strict_measures, qrels, run)
1170
+
1171
+ return {
1172
+ "standard": {str(k): v for k, v in standard.items()},
1173
+ "strict_eligible_only": {str(k): v for k, v in strict.items()},
1174
+ }
1175
+ ```
1176
+
1177
+ ### 4.6 使用 ir_datasets 的替代 qrels/run 格式
1178
+
1179
+ ```python
1180
+ def evaluate_from_dicts(
1181
+ qrels_dict: dict[str, dict[str, int]],
1182
+ run_dict: dict[str, list[tuple[str, float]]],
1183
+ ) -> dict:
1184
+ """Evaluate using Python dict format (no files needed).
1185
+
1186
+ Args:
1187
+ qrels_dict: {query_id: {doc_id: relevance}}
1188
+ run_dict: {query_id: [(doc_id, score), ...]}
1189
+ """
1190
+ # Convert to ir-measures format
1191
+ qrels = [
1192
+ ir_measures.Qrel(qid, did, rel)
1193
+ for qid, docs in qrels_dict.items()
1194
+ for did, rel in docs.items()
1195
+ ]
1196
+ run = [
1197
+ ir_measures.ScoredDoc(qid, did, score)
1198
+ for qid, docs in run_dict.items()
1199
+ for did, score in docs
1200
+ ]
1201
+
1202
+ measures = [nDCG@10, Recall@50, P@10, AP]
1203
+ aggregate = ir_measures.calc_aggregate(measures, qrels, run)
1204
+ return {str(k): v for k, v in aggregate.items()}
1205
+ ```
1206
+
1207
+ ---
1208
+
1209
+ ## 5. MedGemma 提取评估
1210
+
1211
+ ### 5.1 标注数据集设计
1212
+
1213
+ ```python
1214
+ # evaluation/extraction_eval.py
1215
+ from dataclasses import dataclass
1216
+ from typing import Optional
1217
+
1218
+
1219
+ @dataclass
1220
+ class AnnotatedField:
1221
+ """A single annotated field with ground truth and extraction result."""
1222
+ field_name: str # e.g., "biomarkers.egfr"
1223
+ ground_truth: Optional[str] # From Synthea profile (gold standard)
1224
+ extracted: Optional[str] # From MedGemma extraction
1225
+ evidence_span: Optional[str] # Text span in source document
1226
+ source_page: Optional[int] # Page number in PDF
1227
+
1228
+
1229
+ @dataclass
1230
+ class ExtractionAnnotation:
1231
+ """Complete annotation for one patient's extraction."""
1232
+ patient_id: str
1233
+ fields: list[AnnotatedField]
1234
+ noise_level: str # "clean", "mild", "moderate", "severe"
1235
+ document_type: str # "clinical_letter", "pathology_report", etc.
1236
+ ```
1237
+
1238
+ **标注数据集结构:**
1239
+ ```json
1240
+ {
1241
+ "patient_id": "synth-001",
1242
+ "noise_level": "mild",
1243
+ "document_type": "clinical_letter",
1244
+ "fields": [
1245
+ {
1246
+ "field_name": "demographics.name",
1247
+ "ground_truth": "John Smith",
1248
+ "extracted": "John Smith",
1249
+ "correct": true
1250
+ },
1251
+ {
1252
+ "field_name": "diagnosis.stage",
1253
+ "ground_truth": "Stage IIIA",
1254
+ "extracted": "Stage 3A",
1255
+ "correct": true,
1256
+ "note": "Equivalent representation"
1257
+ },
1258
+ {
1259
+ "field_name": "biomarkers.egfr",
1260
+ "ground_truth": "Exon 19 deletion",
1261
+ "extracted": "EGFR positive",
1262
+ "correct": false,
1263
+ "note": "Partial extraction - missing specific mutation"
1264
+ }
1265
+ ]
1266
+ }
1267
+ ```
1268
+
1269
+ ### 5.2 字段级 F1 计算
1270
+
1271
+ ```python
1272
+ # evaluation/extraction_eval.py
1273
+ from sklearn.metrics import (
1274
+ f1_score, precision_score, recall_score,
1275
+ classification_report, confusion_matrix
1276
+ )
1277
+ import numpy as np
1278
+
1279
+
1280
+ # 定义所有可提取字段
1281
+ EXTRACTION_FIELDS = [
1282
+ "demographics.name",
1283
+ "demographics.sex",
1284
+ "demographics.date_of_birth",
1285
+ "demographics.age",
1286
+ "diagnosis.primary",
1287
+ "diagnosis.stage",
1288
+ "diagnosis.histology",
1289
+ "biomarkers.egfr",
1290
+ "biomarkers.alk",
1291
+ "biomarkers.pdl1_tps",
1292
+ "biomarkers.kras",
1293
+ "biomarkers.ros1",
1294
+ "labs.wbc",
1295
+ "labs.hemoglobin",
1296
+ "labs.platelets",
1297
+ "labs.creatinine",
1298
+ "labs.alt",
1299
+ "labs.ast",
1300
+ "treatments.current_regimen",
1301
+ "performance_status.ecog",
1302
+ ]
1303
+
1304
+
1305
+ def compute_field_level_f1(
1306
+ annotations: list[dict],
1307
+ ) -> dict:
1308
+ """Compute field-level F1, precision, recall.
1309
+
1310
+ For each field:
1311
+ - TP: ground_truth exists AND extracted matches
1312
+ - FP: extracted exists BUT ground_truth is None or mismatch
1313
+ - FN: ground_truth exists BUT extracted is None or mismatch
1314
+
1315
+ Args:
1316
+ annotations: List of patient annotation dicts
1317
+
1318
+ Returns:
1319
+ Per-field and aggregate metrics
1320
+ """
1321
+ field_metrics = {}
1322
+
1323
+ for field_name in EXTRACTION_FIELDS:
1324
+ y_true = [] # 1 if field has ground truth value
1325
+ y_pred = [] # 1 if field was correctly extracted
1326
+
1327
+ for ann in annotations:
1328
+ fields = {f["field_name"]: f for f in ann["fields"]}
1329
+ if field_name in fields:
1330
+ f = fields[field_name]
1331
+ has_gt = f["ground_truth"] is not None
1332
+ is_correct = f.get("correct", False)
1333
+
1334
+ y_true.append(1 if has_gt else 0)
1335
+ y_pred.append(1 if is_correct else 0)
1336
+
1337
+ if len(y_true) > 0:
1338
+ precision = precision_score(y_true, y_pred, zero_division=0)
1339
+ recall = recall_score(y_true, y_pred, zero_division=0)
1340
+ f1 = f1_score(y_true, y_pred, zero_division=0)
1341
+ field_metrics[field_name] = {
1342
+ "precision": round(precision, 4),
1343
+ "recall": round(recall, 4),
1344
+ "f1": round(f1, 4),
1345
+ "support": sum(y_true),
1346
+ }
1347
+
1348
+ # Aggregate metrics
1349
+ all_y_true = []
1350
+ all_y_pred = []
1351
+ for ann in annotations:
1352
+ for f in ann["fields"]:
1353
+ has_gt = f["ground_truth"] is not None
1354
+ is_correct = f.get("correct", False)
1355
+ all_y_true.append(1 if has_gt else 0)
1356
+ all_y_pred.append(1 if is_correct else 0)
1357
+
1358
+ micro_f1 = f1_score(all_y_true, all_y_pred, zero_division=0)
1359
+ macro_f1 = np.mean([m["f1"] for m in field_metrics.values()])
1360
+
1361
+ return {
1362
+ "per_field": field_metrics,
1363
+ "micro_f1": round(micro_f1, 4),
1364
+ "macro_f1": round(macro_f1, 4),
1365
+ "total_fields": len(all_y_true),
1366
+ "pass": micro_f1 >= 0.85, # Target: F1 >= 0.85
1367
+ }
1368
+
1369
+
1370
+ def compute_extraction_report(annotations: list[dict]) -> str:
1371
+ """Generate a scikit-learn classification_report style output."""
1372
+ all_y_true = []
1373
+ all_y_pred = []
1374
+ labels = []
1375
+
1376
+ for field_name in EXTRACTION_FIELDS:
1377
+ for ann in annotations:
1378
+ fields = {f["field_name"]: f for f in ann["fields"]}
1379
+ if field_name in fields:
1380
+ f = fields[field_name]
1381
+ has_gt = f["ground_truth"] is not None
1382
+ is_correct = f.get("correct", False)
1383
+ all_y_true.append(1 if has_gt else 0)
1384
+ all_y_pred.append(1 if is_correct else 0)
1385
+
1386
+ return classification_report(
1387
+ all_y_true, all_y_pred,
1388
+ target_names=["absent", "present/correct"],
1389
+ digits=4,
1390
+ )
1391
+
1392
+
1393
+ def compare_with_baseline(
1394
+ medgemma_annotations: list[dict],
1395
+ gemini_only_annotations: list[dict],
1396
+ ) -> dict:
1397
+ """Compare MedGemma extraction vs Gemini-only baseline."""
1398
+ medgemma_metrics = compute_field_level_f1(medgemma_annotations)
1399
+ gemini_metrics = compute_field_level_f1(gemini_only_annotations)
1400
+
1401
+ comparison = {}
1402
+ for field_name in EXTRACTION_FIELDS:
1403
+ mg = medgemma_metrics["per_field"].get(field_name, {})
1404
+ gm = gemini_metrics["per_field"].get(field_name, {})
1405
+ comparison[field_name] = {
1406
+ "medgemma_f1": mg.get("f1", 0),
1407
+ "gemini_f1": gm.get("f1", 0),
1408
+ "delta": round(mg.get("f1", 0) - gm.get("f1", 0), 4),
1409
+ }
1410
+
1411
+ return {
1412
+ "per_field_comparison": comparison,
1413
+ "medgemma_overall_f1": medgemma_metrics["micro_f1"],
1414
+ "gemini_overall_f1": gemini_metrics["micro_f1"],
1415
+ "improvement": round(
1416
+ medgemma_metrics["micro_f1"] - gemini_metrics["micro_f1"], 4
1417
+ ),
1418
+ }
1419
+ ```
1420
+
1421
+ ### 5.3 噪声级别对提取性能的影响分析
1422
+
1423
+ ```python
1424
+ def analyze_noise_impact(annotations: list[dict]) -> dict:
1425
+ """Analyze how noise level affects extraction F1."""
1426
+ by_noise = {}
1427
+ for ann in annotations:
1428
+ level = ann["noise_level"]
1429
+ if level not in by_noise:
1430
+ by_noise[level] = []
1431
+ by_noise[level].append(ann)
1432
+
1433
+ results = {}
1434
+ for level, level_anns in by_noise.items():
1435
+ metrics = compute_field_level_f1(level_anns)
1436
+ results[level] = {
1437
+ "micro_f1": metrics["micro_f1"],
1438
+ "macro_f1": metrics["macro_f1"],
1439
+ "n_patients": len(level_anns),
1440
+ }
1441
+
1442
+ return results
1443
+ ```
1444
+
1445
+ ---
1446
+
1447
+ ## 6. 端到端评估管线
1448
+
1449
+ ### 6.1 Criterion Decision Accuracy
1450
+
1451
+ ```python
1452
+ # evaluation/criterion_eval.py
1453
+
1454
+ def compute_criterion_accuracy(
1455
+ predictions: list[dict],
1456
+ ground_truth: list[dict],
1457
+ ) -> dict:
1458
+ """Compute criterion-level decision accuracy.
1459
+
1460
+ Each prediction/ground_truth entry:
1461
+ {
1462
+ "patient_id": str,
1463
+ "trial_id": str,
1464
+ "criteria": [
1465
+ {"criterion_id": str, "decision": "met"|"not_met"|"unknown",
1466
+ "evidence": str}
1467
+ ]
1468
+ }
1469
+
1470
+ Target: >= 0.85
1471
+ """
1472
+ total = 0
1473
+ correct = 0
1474
+ by_decision_type = {"met": {"tp": 0, "total": 0},
1475
+ "not_met": {"tp": 0, "total": 0},
1476
+ "unknown": {"tp": 0, "total": 0}}
1477
+
1478
+ for pred, gt in zip(predictions, ground_truth):
1479
+ assert pred["patient_id"] == gt["patient_id"]
1480
+ assert pred["trial_id"] == gt["trial_id"]
1481
+
1482
+ gt_map = {c["criterion_id"]: c["decision"] for c in gt["criteria"]}
1483
+
1484
+ for criterion in pred["criteria"]:
1485
+ cid = criterion["criterion_id"]
1486
+ if cid in gt_map:
1487
+ total += 1
1488
+ gt_decision = gt_map[cid]
1489
+ pred_decision = criterion["decision"]
1490
+ by_decision_type[gt_decision]["total"] += 1
1491
+ if pred_decision == gt_decision:
1492
+ correct += 1
1493
+ by_decision_type[gt_decision]["tp"] += 1
1494
+
1495
+ accuracy = correct / total if total > 0 else 0.0
1496
+
1497
+ return {
1498
+ "overall_accuracy": round(accuracy, 4),
1499
+ "total_criteria": total,
1500
+ "correct": correct,
1501
+ "pass": accuracy >= 0.85,
1502
+ "by_decision_type": {
1503
+ k: {
1504
+ "accuracy": round(v["tp"] / v["total"], 4) if v["total"] > 0 else 0,
1505
+ "support": v["total"],
1506
+ }
1507
+ for k, v in by_decision_type.items()
1508
+ },
1509
+ }
1510
+ ```
1511
+
1512
+ ### 6.2 延迟基准测试
1513
+
1514
+ ```python
1515
+ # evaluation/latency_cost_tracker.py
1516
+ import time
1517
+ import json
1518
+ from dataclasses import dataclass, field, asdict
1519
+ from typing import Optional
1520
+ from contextlib import contextmanager
1521
+
1522
+
1523
+ @dataclass
1524
+ class APICallRecord:
1525
+ """Record of a single API call."""
1526
+ service: str # "medgemma", "gemini", "clinicaltrials_mcp"
1527
+ operation: str # "extract", "search", "evaluate_criterion"
1528
+ latency_ms: float
1529
+ input_tokens: int = 0
1530
+ output_tokens: int = 0
1531
+ cost_usd: float = 0.0
1532
+ timestamp: str = ""
1533
+
1534
+
1535
+ @dataclass
1536
+ class SessionMetrics:
1537
+ """Aggregate metrics for a patient matching session."""
1538
+ patient_id: str
1539
+ total_latency_ms: float = 0.0
1540
+ total_cost_usd: float = 0.0
1541
+ api_calls: list[APICallRecord] = field(default_factory=list)
1542
+
1543
+ @property
1544
+ def total_latency_s(self) -> float:
1545
+ return self.total_latency_ms / 1000.0
1546
+
1547
+ @property
1548
+ def pass_latency(self) -> bool:
1549
+ """Target: < 15s per session."""
1550
+ return self.total_latency_s < 15.0
1551
+
1552
+ @property
1553
+ def pass_cost(self) -> bool:
1554
+ """Target: < $0.50 per session."""
1555
+ return self.total_cost_usd < 0.50
1556
+
1557
+
1558
+ class LatencyCostTracker:
1559
+ """Track latency and cost across API calls."""
1560
+
1561
+ # Pricing per 1M tokens (approximate)
1562
+ PRICING = {
1563
+ "medgemma": {"input": 0.0, "output": 0.0}, # Self-hosted
1564
+ "gemini": {"input": 1.25, "output": 5.00}, # Gemini Pro
1565
+ "clinicaltrials_mcp": {"input": 0.0, "output": 0.0}, # Free API
1566
+ }
1567
+
1568
+ def __init__(self):
1569
+ self.sessions: list[SessionMetrics] = []
1570
+ self._current_session: Optional[SessionMetrics] = None
1571
+
1572
+ def start_session(self, patient_id: str):
1573
+ self._current_session = SessionMetrics(patient_id=patient_id)
1574
+
1575
+ def end_session(self) -> SessionMetrics:
1576
+ session = self._current_session
1577
+ if session:
1578
+ session.total_latency_ms = sum(c.latency_ms for c in session.api_calls)
1579
+ session.total_cost_usd = sum(c.cost_usd for c in session.api_calls)
1580
+ self.sessions.append(session)
1581
+ self._current_session = None
1582
+ return session
1583
+
1584
+ @contextmanager
1585
+ def track_call(self, service: str, operation: str):
1586
+ """Context manager to track an API call."""
1587
+ start = time.monotonic()
1588
+ record = APICallRecord(service=service, operation=operation, latency_ms=0)
1589
+ try:
1590
+ yield record
1591
+ finally:
1592
+ record.latency_ms = (time.monotonic() - start) * 1000
1593
+ # Compute cost
1594
+ pricing = self.PRICING.get(service, {"input": 0, "output": 0})
1595
+ record.cost_usd = (
1596
+ record.input_tokens * pricing["input"] / 1_000_000
1597
+ + record.output_tokens * pricing["output"] / 1_000_000
1598
+ )
1599
+ if self._current_session:
1600
+ self._current_session.api_calls.append(record)
1601
+
1602
+ def summary(self) -> dict:
1603
+ """Generate aggregate summary across all sessions."""
1604
+ if not self.sessions:
1605
+ return {}
1606
+
1607
+ latencies = [s.total_latency_s for s in self.sessions]
1608
+ costs = [s.total_cost_usd for s in self.sessions]
1609
+
1610
+ return {
1611
+ "n_sessions": len(self.sessions),
1612
+ "latency": {
1613
+ "mean_s": round(sum(latencies) / len(latencies), 2),
1614
+ "p50_s": round(sorted(latencies)[len(latencies) // 2], 2),
1615
+ "p95_s": round(sorted(latencies)[int(len(latencies) * 0.95)], 2),
1616
+ "max_s": round(max(latencies), 2),
1617
+ "pass_rate": round(
1618
+ sum(1 for s in self.sessions if s.pass_latency) / len(self.sessions), 4
1619
+ ),
1620
+ },
1621
+ "cost": {
1622
+ "mean_usd": round(sum(costs) / len(costs), 4),
1623
+ "total_usd": round(sum(costs), 4),
1624
+ "max_usd": round(max(costs), 4),
1625
+ "pass_rate": round(
1626
+ sum(1 for s in self.sessions if s.pass_cost) / len(self.sessions), 4
1627
+ ),
1628
+ },
1629
+ "targets": {
1630
+ "latency_pass": all(s.pass_latency for s in self.sessions),
1631
+ "cost_pass": all(s.pass_cost for s in self.sessions),
1632
+ },
1633
+ }
1634
+ ```
1635
+
1636
+ ---
1637
+
1638
+ ## 7. TDD 测试用例
1639
+
1640
+ ### 7.1 Synthea 数据验证测试
1641
+
1642
+ ```python
1643
+ # tests/test_synthea_data.py
1644
+ import pytest
1645
+ import json
1646
+ from pathlib import Path
1647
+
1648
+ # 预期的 FHIR Resource 类型
1649
+ REQUIRED_RESOURCE_TYPES = {"Patient", "Condition", "Observation", "Encounter"}
1650
+
1651
+
1652
+ class TestSyntheaDataValidation:
1653
+ """Validate Synthea FHIR output for TrialPath requirements."""
1654
+
1655
+ def test_fhir_bundle_is_valid_json(self, fhir_file):
1656
+ """Bundle must be valid JSON."""
1657
+ with open(fhir_file) as f:
1658
+ data = json.load(f)
1659
+ assert data["resourceType"] == "Bundle"
1660
+ assert "entry" in data
1661
+
1662
+ def test_bundle_contains_required_resources(self, fhir_file):
1663
+ """Bundle must contain Patient, Condition, Observation, Encounter."""
1664
+ with open(fhir_file) as f:
1665
+ bundle = json.load(f)
1666
+ resource_types = {
1667
+ e["resource"]["resourceType"] for e in bundle["entry"]
1668
+ }
1669
+ for rt in REQUIRED_RESOURCE_TYPES:
1670
+ assert rt in resource_types, f"Missing {rt} resource"
1671
+
1672
+ def test_patient_has_demographics(self, fhir_file):
1673
+ """Patient resource must have name, gender, birthDate."""
1674
+ with open(fhir_file) as f:
1675
+ bundle = json.load(f)
1676
+ patients = [
1677
+ e["resource"] for e in bundle["entry"]
1678
+ if e["resource"]["resourceType"] == "Patient"
1679
+ ]
1680
+ assert len(patients) == 1
1681
+ patient = patients[0]
1682
+ assert "name" in patient
1683
+ assert "gender" in patient
1684
+ assert "birthDate" in patient
1685
+
1686
+ def test_lung_cancer_condition_present(self, fhir_file):
1687
+ """At least one Condition must be NSCLC or lung cancer."""
1688
+ with open(fhir_file) as f:
1689
+ bundle = json.load(f)
1690
+ conditions = [
1691
+ e["resource"] for e in bundle["entry"]
1692
+ if e["resource"]["resourceType"] == "Condition"
1693
+ ]
1694
+ lung_cancer_codes = {"254637007", "254632001", "162573006"}
1695
+ has_lung_cancer = False
1696
+ for cond in conditions:
1697
+ codings = cond.get("code", {}).get("coding", [])
1698
+ for c in codings:
1699
+ if c.get("code") in lung_cancer_codes:
1700
+ has_lung_cancer = True
1701
+ assert has_lung_cancer, "No lung cancer Condition found"
1702
+
1703
+ def test_patient_profile_conversion(self, fhir_file):
1704
+ """FHIR Bundle must convert to valid PatientProfile."""
1705
+ profile = parse_fhir_bundle(Path(fhir_file))
1706
+ assert profile.patient_id != ""
1707
+ assert profile.demographics.name != ""
1708
+ assert profile.demographics.sex in ("male", "female")
1709
+ assert profile.diagnosis.primary != ""
1710
+
1711
+ def test_batch_generation_produces_500_patients(self, output_dir):
1712
+ """Batch generation must produce at least 500 FHIR files."""
1713
+ fhir_files = list(Path(output_dir).glob("*.json"))
1714
+ assert len(fhir_files) >= 500
1715
+
1716
+ def test_nsclc_ratio(self, all_profiles):
1717
+ """~85% of lung cancer patients should be NSCLC."""
1718
+ nsclc_count = sum(
1719
+ 1 for p in all_profiles
1720
+ if "non-small cell" in p.diagnosis.primary.lower()
1721
+ or "nsclc" in p.diagnosis.primary.lower()
1722
+ )
1723
+ ratio = nsclc_count / len(all_profiles)
1724
+ assert 0.70 <= ratio <= 0.95, f"NSCLC ratio {ratio} outside expected range"
1725
+ ```
1726
+
1727
+ ### 7.2 PDF 生成正确性测试
1728
+
1729
+ ```python
1730
+ # tests/test_pdf_generation.py
1731
+ import pytest
1732
+ from pathlib import Path
1733
+ from data.templates.clinical_letter import generate_clinical_letter
1734
+ from data.templates.pathology_report import generate_pathology_report
1735
+ from data.templates.lab_report import generate_lab_report
1736
+
1737
+
1738
+ class TestPDFGeneration:
1739
+ """Test that PDF generation produces valid documents."""
1740
+
1741
+ SAMPLE_PROFILE = {
1742
+ "patient_id": "test-001",
1743
+ "demographics": {
1744
+ "name": "Jane Doe",
1745
+ "sex": "female",
1746
+ "date_of_birth": "1960-05-15",
1747
+ },
1748
+ "diagnosis": {
1749
+ "primary": "Non-small cell lung cancer, adenocarcinoma",
1750
+ "stage": "Stage IIIA",
1751
+ "histology": "adenocarcinoma",
1752
+ "diagnosis_date": "2024-01-15",
1753
+ },
1754
+ "biomarkers": {
1755
+ "egfr": "Exon 19 deletion",
1756
+ "alk": "Negative",
1757
+ "pdl1_tps": "60%",
1758
+ "kras": None,
1759
+ },
1760
+ "labs": [
1761
+ {"name": "WBC", "value": 7.2, "unit": "10*3/uL", "date": "2024-01-10", "loinc_code": "6690-2"},
1762
+ {"name": "Hemoglobin", "value": 12.5, "unit": "g/dL", "date": "2024-01-10", "loinc_code": "718-7"},
1763
+ ],
1764
+ "treatments": [
1765
+ {"name": "Cisplatin", "type": "medication", "start_date": "2024-02-01"},
1766
+ ],
1767
+ }
1768
+
1769
+ def test_clinical_letter_generates_pdf(self, tmp_path):
1770
+ """Clinical letter must generate a non-empty PDF file."""
1771
+ output = tmp_path / "letter.pdf"
1772
+ generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
1773
+ assert output.exists()
1774
+ assert output.stat().st_size > 0
1775
+
1776
+ def test_pathology_report_generates_pdf(self, tmp_path):
1777
+ """Pathology report must generate a non-empty PDF file."""
1778
+ output = tmp_path / "pathology.pdf"
1779
+ generate_pathology_report(self.SAMPLE_PROFILE, str(output))
1780
+ assert output.exists()
1781
+ assert output.stat().st_size > 0
1782
+
1783
+ def test_lab_report_generates_pdf(self, tmp_path):
1784
+ """Lab report must generate a non-empty PDF file."""
1785
+ output = tmp_path / "lab.pdf"
1786
+ generate_lab_report(self.SAMPLE_PROFILE, str(output))
1787
+ assert output.exists()
1788
+ assert output.stat().st_size > 0
1789
+
1790
+ def test_pdf_contains_patient_name(self, tmp_path):
1791
+ """Generated PDF must contain patient name (OCR-verifiable)."""
1792
+ output = tmp_path / "letter.pdf"
1793
+ generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
1794
+ # Read PDF text (using pdfplumber or PyPDF2)
1795
+ import pdfplumber
1796
+ with pdfplumber.open(str(output)) as pdf:
1797
+ text = ""
1798
+ for page in pdf.pages:
1799
+ text += page.extract_text() or ""
1800
+ assert "Jane Doe" in text
1801
+
1802
+ def test_pdf_contains_biomarkers(self, tmp_path):
1803
+ """Generated PDF must contain biomarker results."""
1804
+ output = tmp_path / "pathology.pdf"
1805
+ generate_pathology_report(self.SAMPLE_PROFILE, str(output))
1806
+ import pdfplumber
1807
+ with pdfplumber.open(str(output)) as pdf:
1808
+ text = ""
1809
+ for page in pdf.pages:
1810
+ text += page.extract_text() or ""
1811
+ assert "EGFR" in text
1812
+ assert "Exon 19" in text or "positive" in text.lower()
1813
+
1814
+ def test_missing_biomarker_handled_gracefully(self, tmp_path):
1815
+ """PDF generation should not crash when biomarkers are None."""
1816
+ profile = self.SAMPLE_PROFILE.copy()
1817
+ profile["biomarkers"] = {
1818
+ "egfr": None, "alk": None, "pdl1_tps": None, "kras": None
1819
+ }
1820
+ output = tmp_path / "letter.pdf"
1821
+ generate_clinical_letter(profile, str(output))
1822
+ assert output.exists()
1823
+ ```
1824
+
1825
+ ### 7.3 噪声注入效果验证测试
1826
+
1827
+ ```python
1828
+ # tests/test_noise_injection.py
1829
+ import pytest
1830
+ from data.noise.noise_injector import NoiseInjector
1831
+
1832
+
1833
+ class TestNoiseInjection:
1834
+ """Test noise injection produces expected results."""
1835
+
1836
+ def test_clean_noise_no_changes(self):
1837
+ """Clean level should produce no changes."""
1838
+ injector = NoiseInjector(noise_level="clean", seed=42)
1839
+ text = "Patient has EGFR mutation positive"
1840
+ noisy, records = injector.inject_text_noise(text)
1841
+ assert noisy == text
1842
+ assert len(records) == 0
1843
+
1844
+ def test_mild_noise_produces_some_changes(self):
1845
+ """Mild noise should produce some but limited changes."""
1846
+ injector = NoiseInjector(noise_level="mild", seed=42)
1847
+ # Use longer text to increase chance of noise
1848
+ text = "The patient is a 65 year old male with stage IIIA " * 10
1849
+ noisy, records = injector.inject_text_noise(text)
1850
+ # Should have some changes but not too many
1851
+ assert len(records) >= 0 # May or may not have changes depending on seed
1852
+
1853
+ def test_severe_noise_produces_many_changes(self):
1854
+ """Severe noise should produce noticeable changes."""
1855
+ injector = NoiseInjector(noise_level="severe", seed=42)
1856
+ text = "The 50 year old patient has stage 1 NSCLC " * 20
1857
+ noisy, records = injector.inject_text_noise(text)
1858
+ assert noisy != text # Should differ from original
1859
+ assert len(records) > 0
1860
+
1861
+ def test_ocr_error_types_are_valid(self):
1862
+ """OCR errors should only substitute known character pairs."""
1863
+ injector = NoiseInjector(noise_level="severe", seed=42)
1864
+ text = "0123456789 OIBS" * 10
1865
+ _, records = injector.inject_text_noise(text)
1866
+ for r in records:
1867
+ if r["type"] == "ocr_error":
1868
+ assert r["original"] in NoiseInjector.OCR_ERROR_MAP
1869
+ assert r["replacement"] in NoiseInjector.OCR_ERROR_MAP[r["original"]]
1870
+
1871
+ def test_missing_value_injection(self):
1872
+ """Missing value injection should remove some fields."""
1873
+ injector = NoiseInjector(noise_level="moderate", seed=42)
1874
+ profile = {
1875
+ "biomarkers": {"egfr": "positive", "alk": "negative",
1876
+ "pdl1_tps": "60%", "kras": "negative", "ros1": "negative"},
1877
+ "diagnosis": {"stage": "IIIA", "histology": "adenocarcinoma"},
1878
+ }
1879
+ modified, removed = injector.inject_missing_values(profile)
1880
+ # At 10% rate with 7 fields, expect 0-3 removals
1881
+ assert len(removed) <= 7
1882
+ for field_path in removed:
1883
+ section, field_name = field_path.split(".")
1884
+ assert modified[section][field_name] is None
1885
+
1886
+ def test_noise_is_deterministic_with_seed(self):
1887
+ """Same seed should produce identical results."""
1888
+ text = "Patient has stage IIIA non-small cell lung cancer"
1889
+ inj1 = NoiseInjector(noise_level="moderate", seed=123)
1890
+ inj2 = NoiseInjector(noise_level="moderate", seed=123)
1891
+ noisy1, _ = inj1.inject_text_noise(text)
1892
+ noisy2, _ = inj2.inject_text_noise(text)
1893
+ assert noisy1 == noisy2
1894
+
1895
+ def test_different_seeds_produce_different_results(self):
1896
+ """Different seeds should generally produce different noise."""
1897
+ text = "The 50 year old patient has 10 biomarker tests 0 1 5 8" * 20
1898
+ inj1 = NoiseInjector(noise_level="severe", seed=1)
1899
+ inj2 = NoiseInjector(noise_level="severe", seed=999)
1900
+ noisy1, _ = inj1.inject_text_noise(text)
1901
+ noisy2, _ = inj2.inject_text_noise(text)
1902
+ # With severe noise on long text, different seeds should differ
1903
+ assert noisy1 != noisy2
1904
+ ```
1905
+
1906
+ ### 7.4 TREC 评估计算测试
1907
+
1908
+ ```python
1909
+ # tests/test_trec_evaluation.py
1910
+ import pytest
1911
+ import ir_measures
1912
+ from ir_measures import nDCG, Recall, P, AP
1913
+
1914
+
1915
+ class TestTRECEvaluation:
1916
+ """Test TREC evaluation metric computation."""
1917
+
1918
+ @pytest.fixture
1919
+ def sample_qrels(self):
1920
+ """Sample qrels with known ground truth."""
1921
+ return [
1922
+ ir_measures.Qrel("q1", "d1", 2), # eligible
1923
+ ir_measures.Qrel("q1", "d2", 1), # excluded
1924
+ ir_measures.Qrel("q1", "d3", 0), # not relevant
1925
+ ir_measures.Qrel("q1", "d4", 2), # eligible
1926
+ ir_measures.Qrel("q1", "d5", 0), # not relevant
1927
+ ]
1928
+
1929
+ @pytest.fixture
1930
+ def perfect_run(self):
1931
+ """Run that ranks all relevant docs at top."""
1932
+ return [
1933
+ ir_measures.ScoredDoc("q1", "d1", 1.0),
1934
+ ir_measures.ScoredDoc("q1", "d4", 0.9),
1935
+ ir_measures.ScoredDoc("q1", "d2", 0.8),
1936
+ ir_measures.ScoredDoc("q1", "d3", 0.1),
1937
+ ir_measures.ScoredDoc("q1", "d5", 0.05),
1938
+ ]
1939
+
1940
+ @pytest.fixture
1941
+ def worst_run(self):
1942
+ """Run that ranks relevant docs at bottom."""
1943
+ return [
1944
+ ir_measures.ScoredDoc("q1", "d3", 1.0),
1945
+ ir_measures.ScoredDoc("q1", "d5", 0.9),
1946
+ ir_measures.ScoredDoc("q1", "d2", 0.5),
1947
+ ir_measures.ScoredDoc("q1", "d4", 0.2),
1948
+ ir_measures.ScoredDoc("q1", "d1", 0.1),
1949
+ ]
1950
+
1951
+ def test_perfect_ndcg_at_10(self, sample_qrels, perfect_run):
1952
+ """Perfect ranking should yield NDCG@10 = 1.0."""
1953
+ result = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
1954
+ assert result[nDCG@10] == pytest.approx(1.0, abs=0.01)
1955
+
1956
+ def test_worst_ndcg_lower(self, sample_qrels, perfect_run, worst_run):
1957
+ """Worst ranking should yield lower NDCG than perfect."""
1958
+ perfect = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
1959
+ worst = ir_measures.calc_aggregate([nDCG@10], sample_qrels, worst_run)
1960
+ assert worst[nDCG@10] < perfect[nDCG@10]
1961
+
1962
+ def test_recall_at_50_perfect(self, sample_qrels, perfect_run):
1963
+ """Perfect run should retrieve all relevant docs."""
1964
+ result = ir_measures.calc_aggregate([Recall@50], sample_qrels, perfect_run)
1965
+ assert result[Recall@50] == pytest.approx(1.0, abs=0.01)
1966
+
1967
+ def test_empty_run_yields_zero(self, sample_qrels):
1968
+ """Empty run should yield 0 for all metrics."""
1969
+ empty_run = []
1970
+ result = ir_measures.calc_aggregate(
1971
+ [nDCG@10, Recall@50, P@10], sample_qrels, empty_run
1972
+ )
1973
+ assert result[nDCG@10] == 0.0
1974
+ assert result[Recall@50] == 0.0
1975
+ assert result[P@10] == 0.0
1976
+
1977
+ def test_per_query_results(self, sample_qrels, perfect_run):
1978
+ """Per-query results should return one entry per query."""
1979
+ results = list(ir_measures.iter_calc(
1980
+ [nDCG@10], sample_qrels, perfect_run
1981
+ ))
1982
+ assert len(results) == 1 # Only q1
1983
+ assert results[0].query_id == "q1"
1984
+
1985
+ def test_trec_run_format_conversion(self):
1986
+ """Test TrialPath results to TREC format conversion."""
1987
+ results = {
1988
+ "1": [
1989
+ {"nct_id": "NCT001", "score": 0.95},
1990
+ {"nct_id": "NCT002", "score": 0.80},
1991
+ ]
1992
+ }
1993
+ run_str = convert_trialpath_to_trec_run(results, "test-run")
1994
+ lines = run_str.strip().split("\n")
1995
+ assert len(lines) == 2
1996
+ assert "NCT001" in lines[0]
1997
+ assert "1" == lines[0].split()[3] # rank 1
1998
+ assert "2" == lines[1].split()[3] # rank 2
1999
+
2000
+ def test_graded_relevance_evaluation(self, sample_qrels, perfect_run):
2001
+ """Test strict eligible-only evaluation (rel=2)."""
2002
+ strict = ir_measures.calc_aggregate(
2003
+ [AP(rel=2)], sample_qrels, perfect_run
2004
+ )
2005
+ assert strict[AP(rel=2)] > 0.0
2006
+
2007
+ def test_qrels_dict_format(self):
2008
+ """Test evaluation from dict format."""
2009
+ qrels = {"q1": {"d1": 2, "d2": 1, "d3": 0}}
2010
+ run = [
2011
+ ir_measures.ScoredDoc("q1", "d1", 1.0),
2012
+ ir_measures.ScoredDoc("q1", "d2", 0.5),
2013
+ ir_measures.ScoredDoc("q1", "d3", 0.1),
2014
+ ]
2015
+ result = ir_measures.calc_aggregate([nDCG@10], qrels, run)
2016
+ assert nDCG@10 in result
2017
+ ```
2018
+
2019
+ ### 7.5 F1 计算测试
2020
+
2021
+ ```python
2022
+ # tests/test_extraction_f1.py
2023
+ import pytest
2024
+ from evaluation.extraction_eval import compute_field_level_f1
2025
+
2026
+
2027
+ class TestExtractionF1:
2028
+ """Test F1 computation for field-level extraction."""
2029
+
2030
+ def test_perfect_extraction(self):
2031
+ """All fields correctly extracted should yield F1=1.0."""
2032
+ annotations = [{
2033
+ "patient_id": "p1",
2034
+ "noise_level": "clean",
2035
+ "document_type": "clinical_letter",
2036
+ "fields": [
2037
+ {"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
2038
+ {"field_name": "demographics.sex", "ground_truth": "male", "extracted": "male", "correct": True},
2039
+ {"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "NSCLC", "correct": True},
2040
+ {"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
2041
+ ]
2042
+ }]
2043
+ result = compute_field_level_f1(annotations)
2044
+ assert result["micro_f1"] == 1.0
2045
+ assert result["pass"] is True
2046
+
2047
+ def test_zero_extraction(self):
2048
+ """No correct extractions should yield F1=0."""
2049
+ annotations = [{
2050
+ "patient_id": "p1",
2051
+ "noise_level": "clean",
2052
+ "document_type": "clinical_letter",
2053
+ "fields": [
2054
+ {"field_name": "demographics.name", "ground_truth": "John", "extracted": "Jane", "correct": False},
2055
+ {"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": None, "correct": False},
2056
+ ]
2057
+ }]
2058
+ result = compute_field_level_f1(annotations)
2059
+ assert result["micro_f1"] == 0.0
2060
+ assert result["pass"] is False
2061
+
2062
+ def test_partial_extraction(self):
2063
+ """Partial extraction should yield 0 < F1 < 1."""
2064
+ annotations = [{
2065
+ "patient_id": "p1",
2066
+ "noise_level": "mild",
2067
+ "document_type": "clinical_letter",
2068
+ "fields": [
2069
+ {"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
2070
+ {"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "lung ca", "correct": False},
2071
+ {"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
2072
+ {"field_name": "biomarkers.alk", "ground_truth": "negative", "extracted": None, "correct": False},
2073
+ ]
2074
+ }]
2075
+ result = compute_field_level_f1(annotations)
2076
+ assert 0.0 < result["micro_f1"] < 1.0
2077
+
2078
+ def test_f1_threshold_boundary(self):
2079
+ """F1 exactly at 0.85 should pass."""
2080
+ # Create annotations that produce exactly 0.85 F1
2081
+ fields = []
2082
+ for i in range(85):
2083
+ fields.append({"field_name": f"field_{i}", "ground_truth": "val", "extracted": "val", "correct": True})
2084
+ for i in range(15):
2085
+ fields.append({"field_name": f"field_miss_{i}", "ground_truth": "val", "extracted": None, "correct": False})
2086
+
2087
+ annotations = [{"patient_id": "p1", "noise_level": "clean",
2088
+ "document_type": "test", "fields": fields}]
2089
+ result = compute_field_level_f1(annotations)
2090
+ # With 85/100 correct, F1 should be ~0.85
2091
+ assert result["pass"] is True
2092
+
2093
+ def test_empty_annotations(self):
2094
+ """Empty annotations should not crash."""
2095
+ result = compute_field_level_f1([])
2096
+ assert result["micro_f1"] == 0.0
2097
+
2098
+ def test_none_ground_truth_not_counted(self):
2099
+ """Fields with None ground truth should be handled."""
2100
+ annotations = [{
2101
+ "patient_id": "p1",
2102
+ "noise_level": "clean",
2103
+ "document_type": "test",
2104
+ "fields": [
2105
+ {"field_name": "biomarkers.ros1", "ground_truth": None,
2106
+ "extracted": None, "correct": False},
2107
+ ]
2108
+ }]
2109
+ result = compute_field_level_f1(annotations)
2110
+ # Should not crash, though metrics may be 0
2111
+ assert "micro_f1" in result
2112
+ ```
2113
+
2114
+ ### 7.6 端到端管线测试
2115
+
2116
+ ```python
2117
+ # tests/test_e2e_pipeline.py
2118
+ import pytest
2119
+ from pathlib import Path
2120
+
2121
+
2122
+ class TestE2EPipeline:
2123
+ """End-to-end tests for the complete data & evaluation pipeline."""
2124
+
2125
+ def test_fhir_to_profile_to_pdf_roundtrip(self, sample_fhir_file, tmp_path):
2126
+ """FHIR → PatientProfile → PDF should complete without error."""
2127
+ from data.generate_synthetic_patients import parse_fhir_bundle
2128
+ from data.templates.clinical_letter import generate_clinical_letter
2129
+ from dataclasses import asdict
2130
+
2131
+ # Step 1: Parse FHIR
2132
+ profile = parse_fhir_bundle(Path(sample_fhir_file))
2133
+ assert profile.patient_id != ""
2134
+
2135
+ # Step 2: Generate PDF
2136
+ pdf_path = tmp_path / "test_roundtrip.pdf"
2137
+ generate_clinical_letter(asdict(profile), str(pdf_path))
2138
+ assert pdf_path.exists()
2139
+ assert pdf_path.stat().st_size > 1000 # Reasonable PDF size
2140
+
2141
+ def test_noisy_pdf_pipeline(self, sample_profile, tmp_path):
2142
+ """Profile → Noisy PDF should inject noise and produce valid PDF."""
2143
+ from data.templates.clinical_letter import generate_clinical_letter
2144
+ from data.noise.noise_injector import NoiseInjector
2145
+
2146
+ injector = NoiseInjector(noise_level="moderate", seed=42)
2147
+
2148
+ # Inject text noise into profile fields for PDF rendering
2149
+ profile = sample_profile.copy()
2150
+ dx_text = profile["diagnosis"]["primary"]
2151
+ noisy_dx, records = injector.inject_text_noise(dx_text)
2152
+ profile["diagnosis"]["primary"] = noisy_dx
2153
+
2154
+ pdf_path = tmp_path / "noisy.pdf"
2155
+ generate_clinical_letter(profile, str(pdf_path))
2156
+ assert pdf_path.exists()
2157
+
2158
+ def test_trec_evaluation_pipeline(self, tmp_path):
2159
+ """Complete TREC evaluation from dicts should produce metrics."""
2160
+ import ir_measures
2161
+ from ir_measures import nDCG, Recall, P
2162
+
2163
+ qrels = [
2164
+ ir_measures.Qrel("1", "NCT001", 2),
2165
+ ir_measures.Qrel("1", "NCT002", 1),
2166
+ ir_measures.Qrel("1", "NCT003", 0),
2167
+ ]
2168
+ run = [
2169
+ ir_measures.ScoredDoc("1", "NCT001", 0.9),
2170
+ ir_measures.ScoredDoc("1", "NCT002", 0.5),
2171
+ ir_measures.ScoredDoc("1", "NCT003", 0.1),
2172
+ ]
2173
+
2174
+ result = ir_measures.calc_aggregate(
2175
+ [nDCG@10, Recall@50, P@10], qrels, run
2176
+ )
2177
+ assert nDCG@10 in result
2178
+ assert Recall@50 in result
2179
+ assert result[nDCG@10] > 0
2180
+
2181
+ def test_latency_tracker_integration(self):
2182
+ """Latency tracker should record and summarize calls."""
2183
+ import time
2184
+ from evaluation.latency_cost_tracker import LatencyCostTracker
2185
+
2186
+ tracker = LatencyCostTracker()
2187
+ tracker.start_session("test-patient")
2188
+
2189
+ with tracker.track_call("gemini", "search_anchors") as record:
2190
+ time.sleep(0.01) # Simulate API call
2191
+ record.input_tokens = 500
2192
+ record.output_tokens = 200
2193
+
2194
+ session = tracker.end_session()
2195
+ assert session.total_latency_ms > 0
2196
+ assert len(session.api_calls) == 1
2197
+
2198
+ summary = tracker.summary()
2199
+ assert summary["n_sessions"] == 1
2200
+ assert summary["latency"]["mean_s"] > 0
2201
+ ```
2202
+
2203
+ ---
2204
+
2205
+ ## 8. 附录
2206
+
2207
+ ### 8.1 数据格式规范
2208
+
2209
+ #### PatientProfile v1 JSON Schema
2210
+ ```json
2211
+ {
2212
+ "$schema": "http://json-schema.org/draft-07/schema#",
2213
+ "type": "object",
2214
+ "required": ["patient_id", "demographics", "diagnosis"],
2215
+ "properties": {
2216
+ "patient_id": {"type": "string"},
2217
+ "demographics": {
2218
+ "type": "object",
2219
+ "properties": {
2220
+ "name": {"type": "string"},
2221
+ "sex": {"type": "string", "enum": ["male", "female"]},
2222
+ "date_of_birth": {"type": "string", "format": "date"},
2223
+ "age": {"type": "integer"},
2224
+ "state": {"type": "string"}
2225
+ }
2226
+ },
2227
+ "diagnosis": {
2228
+ "type": "object",
2229
+ "properties": {
2230
+ "primary": {"type": "string"},
2231
+ "stage": {"type": ["string", "null"]},
2232
+ "histology": {"type": ["string", "null"]},
2233
+ "diagnosis_date": {"type": "string", "format": "date"}
2234
+ }
2235
+ },
2236
+ "biomarkers": {
2237
+ "type": "object",
2238
+ "properties": {
2239
+ "egfr": {"type": ["string", "null"]},
2240
+ "alk": {"type": ["string", "null"]},
2241
+ "pdl1_tps": {"type": ["string", "null"]},
2242
+ "kras": {"type": ["string", "null"]},
2243
+ "ros1": {"type": ["string", "null"]}
2244
+ }
2245
+ },
2246
+ "labs": {
2247
+ "type": "array",
2248
+ "items": {
2249
+ "type": "object",
2250
+ "properties": {
2251
+ "name": {"type": "string"},
2252
+ "value": {"type": "number"},
2253
+ "unit": {"type": "string"},
2254
+ "date": {"type": "string"},
2255
+ "loinc_code": {"type": "string"}
2256
+ }
2257
+ }
2258
+ },
2259
+ "treatments": {
2260
+ "type": "array",
2261
+ "items": {
2262
+ "type": "object",
2263
+ "properties": {
2264
+ "name": {"type": "string"},
2265
+ "type": {"type": "string", "enum": ["medication", "procedure", "radiation"]},
2266
+ "start_date": {"type": "string"},
2267
+ "end_date": {"type": ["string", "null"]}
2268
+ }
2269
+ }
2270
+ },
2271
+ "unknowns": {"type": "array", "items": {"type": "string"}},
2272
+ "evidence_spans": {"type": "array"}
2273
+ }
2274
+ }
2275
+ ```
2276
+
2277
+ ### 8.2 工具 API 参考
2278
+
2279
+ #### ir_datasets
2280
+
2281
+ | API | 说明 | 返回类型 |
2282
+ |-----|------|----------|
2283
+ | `ir_datasets.load("clinicaltrials/2021/trec-ct-2021")` | 加载 TREC CT 2021 数据集 | Dataset |
2284
+ | `dataset.queries_iter()` | 遍历 topics | GenericQuery(query_id, text) |
2285
+ | `dataset.qrels_iter()` | 遍历 qrels | TrecQrel(query_id, doc_id, relevance, iteration) |
2286
+ | `dataset.docs_iter()` | 遍历文档 | ClinicalTrialsDoc(doc_id, title, condition, summary, detailed_description, eligibility) |
2287
+
2288
+ **数据集 ID:**
2289
+ - `clinicaltrials/2021/trec-ct-2021` — 75 queries, 35,832 qrels
2290
+ - `clinicaltrials/2021/trec-ct-2022` — 50 queries
2291
+ - `clinicaltrials/2021` — 376K 文档(基础集)
2292
+
2293
+ #### ir-measures
2294
+
2295
+ | API | 说明 |
2296
+ |-----|------|
2297
+ | `ir_measures.calc_aggregate(measures, qrels, run)` | 计算聚合指标 |
2298
+ | `ir_measures.iter_calc(measures, qrels, run)` | 逐查询指标迭代 |
2299
+ | `ir_measures.read_trec_qrels(path)` | 读取 TREC qrels 文件 |
2300
+ | `ir_measures.read_trec_run(path)` | 读取 TREC run 文件 |
2301
+ | `ir_measures.Qrel(qid, did, rel)` | 创建 qrel 记录 |
2302
+ | `ir_measures.ScoredDoc(qid, did, score)` | 创建评分文档记录 |
2303
+
2304
+ **指标对象:**
2305
+ - `nDCG@10` — Normalized DCG at cutoff 10
2306
+ - `Recall@50` — Recall at cutoff 50
2307
+ - `P@10` — Precision at cutoff 10
2308
+ - `AP` — Average Precision
2309
+ - `AP(rel=2)` — AP with minimum relevance 2
2310
+ - `RR` — Reciprocal Rank
2311
+
2312
+ #### scikit-learn 评估
2313
+
2314
+ | API | 说明 |
2315
+ |-----|------|
2316
+ | `f1_score(y_true, y_pred, average=None)` | 逐类别 F1 |
2317
+ | `f1_score(y_true, y_pred, average='micro')` | 全局 micro F1 |
2318
+ | `f1_score(y_true, y_pred, average='macro')` | 逐类别平均 F1 |
2319
+ | `precision_score(y_true, y_pred)` | Precision |
2320
+ | `recall_score(y_true, y_pred)` | Recall |
2321
+ | `classification_report(y_true, y_pred)` | 完整分类报告 |
2322
+ | `confusion_matrix(y_true, y_pred)` | 混淆矩阵 |
2323
+
2324
+ #### Synthea CLI
2325
+
2326
+ | 参数 | 说明 | 示例 |
2327
+ |------|------|------|
2328
+ | `-p N` | 生成 N 个患者 | `-p 500` |
2329
+ | `-s SEED` | 随机种子 | `-s 42` |
2330
+ | `-m MODULE` | 指定疾病模块 | `-m lung_cancer` |
2331
+ | `STATE` | 指定州 | `Massachusetts` |
2332
+ | `--exporter.fhir.export` | 启用 FHIR R4 导出 | `=true` |
2333
+ | `--exporter.pretty_print` | 美化 JSON 输出 | `=true` |
2334
+
2335
+ #### ReportLab 核心 API
2336
+
2337
+ | 组件 | 说明 |
2338
+ |------|------|
2339
+ | `SimpleDocTemplate(path, pagesize=letter)` | 创建文档模板 |
2340
+ | `Paragraph(text, style)` | 段落流式组件 |
2341
+ | `Table(data, colWidths)` | 表格流式组件 |
2342
+ | `TableStyle(commands)` | 表格样式 |
2343
+ | `Spacer(width, height)` | 间距组件 |
2344
+ | `getSampleStyleSheet()` | 获取默认样式表 |
2345
+
2346
+ #### Augraphy 降质管线
2347
+
2348
+ | 组件 | 说明 |
2349
+ |------|------|
2350
+ | `AugraphyPipeline(ink_phase, paper_phase, post_phase)` | 完整降质管线 |
2351
+ | `InkBleed(p=0.5)` | 墨水渗透效果 |
2352
+ | `Letterpress(p=0.3)` | 活版印刷效果 |
2353
+ | `LowInkPeriodicLines(p=0.3)` | 低墨水周期性线条 |
2354
+ | `DirtyDrum(p=0.3)` | 脏鼓效果 |
2355
+ | `SubtleNoise(p=0.5)` | 微噪声 |
2356
+ | `Jpeg(p=0.5)` | JPEG 压缩伪影 |
2357
+ | `Brightness(p=0.5)` | 亮度变化 |
2358
+
2359
+ ### 8.3 Python 依赖清单
2360
+
2361
+ ```
2362
+ # requirements-data-eval.txt
2363
+ ir-datasets>=0.5.6
2364
+ ir-measures>=0.3.1
2365
+ reportlab>=4.0
2366
+ augraphy>=8.0
2367
+ Pillow>=10.0
2368
+ pdfplumber>=0.10
2369
+ scikit-learn>=1.3
2370
+ numpy>=1.24
2371
+ pandas>=2.0
2372
+ pdf2image>=1.16
2373
+ ```
2374
+
2375
+ ### 8.4 成功指标速查表
2376
+
2377
+ | 指标 | 目标值 | 评估工具 | 数据源 |
2378
+ |------|--------|----------|--------|
2379
+ | MedGemma Extraction F1 | >= 0.85 | scikit-learn `f1_score` | 合成患者 + Ground Truth |
2380
+ | Trial Retrieval Recall@50 | >= 0.75 | ir-measures `Recall@50` | TREC CT 2021/2022 |
2381
+ | Trial Ranking NDCG@10 | >= 0.60 | ir-measures `nDCG@10` | TREC CT 2021/2022 |
2382
+ | Criterion Decision Accuracy | >= 0.85 | Custom accuracy | 标注 EligibilityLedger |
2383
+ | Latency | < 15s | `LatencyCostTracker` | API call timing |
2384
+ | Cost | < $0.50/session | `LatencyCostTracker` | Token counting |
docs/tdd-guide-ux-frontend.md ADDED
@@ -0,0 +1,1524 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TrialPath UX & Frontend TDD-Ready Implementation Guide
2
+
3
+ > Generated from DeepWiki research on `streamlit/streamlit` and `emcie-co/parlant`, supplemented by official Parlant documentation (`parlant.io`).
4
+ >
5
+ > **Architecture Decisions:**
6
+ > - Parlant runs as **independent service** (REST API mode), FE communicates via `ParlantClient` (httpx)
7
+ > - Doctor packet export: **JSON + Markdown** (no PDF generation in PoC)
8
+ > - MedGemma: **HF Inference Endpoint** (cloud, no local GPU)
9
+
10
+ ---
11
+
12
+ ## 1. Architecture Overview
13
+
14
+ ### 1.1 File Structure
15
+
16
+ ```
17
+ app/
18
+ app.py # Entrypoint: st.navigation, shared sidebar, Parlant client init
19
+ pages/
20
+ 1_upload.py # INGEST state: document upload + extraction trigger
21
+ 2_profile_review.py # PRESCREEN state: PatientProfile review + edit
22
+ 3_trial_matching.py # VALIDATE_TRIALS state: trial search + eligibility cards
23
+ 4_gap_analysis.py # GAP_FOLLOWUP state: gap analysis + iterative refinement
24
+ 5_summary.py # SUMMARY state: final report + doctor packet export
25
+ components/
26
+ file_uploader.py # Multi-file PDF uploader component
27
+ profile_card.py # PatientProfile display/edit component
28
+ trial_card.py # Traffic-light eligibility card component
29
+ gap_card.py # Gap analysis action card component
30
+ progress_tracker.py # Journey state progress indicator
31
+ chat_panel.py # Parlant message panel (send/receive)
32
+ search_process.py # Search refinement step-by-step visualization
33
+ disclaimer_banner.py # Medical disclaimer banner (always visible)
34
+ services/
35
+ parlant_client.py # Parlant REST API wrapper (sessions, events, agents)
36
+ state_manager.py # Session state orchestration
37
+ tests/
38
+ test_upload_page.py
39
+ test_profile_review_page.py
40
+ test_trial_matching_page.py
41
+ test_gap_analysis_page.py
42
+ test_summary_page.py
43
+ test_components.py
44
+ test_parlant_client.py
45
+ test_state_manager.py
46
+ ```
47
+
48
+ ### 1.2 Module Dependency Graph
49
+
50
+ ```
51
+ app.py
52
+ -> pages/* (via st.navigation)
53
+ -> services/parlant_client.py (Parlant REST API)
54
+ -> services/state_manager.py (session state orchestration)
55
+
56
+ pages/*
57
+ -> components/* (UI building blocks)
58
+ -> services/parlant_client.py
59
+ -> services/state_manager.py
60
+
61
+ components/*
62
+ -> st.session_state (read/write)
63
+
64
+ services/parlant_client.py
65
+ -> parlant-client SDK or httpx (REST calls to Parlant server)
66
+
67
+ services/state_manager.py
68
+ -> st.session_state
69
+ -> services/parlant_client.py
70
+ ```
71
+
72
+ ### 1.3 Key Dependencies
73
+
74
+ | Package | Purpose |
75
+ |---------------------|--------------------------------------------|
76
+ | `streamlit>=1.40` | Frontend framework, multipage app, AppTest |
77
+ | `parlant-client` | Python SDK for Parlant REST API |
78
+ | `httpx` | Async HTTP client (fallback for Parlant) |
79
+ | `pytest` | Test runner |
80
+
81
+ ---
82
+
83
+ ## 2. Streamlit Framework Guide
84
+
85
+ ### 2.1 Multipage App with `st.navigation`
86
+
87
+ TrialPath uses the modern `st.navigation` API (not legacy `pages/` auto-discovery) for explicit page control tied to Journey states.
88
+
89
+ **Pattern: Entrypoint with state-aware navigation**
90
+
91
+ ```python
92
+ # app.py
93
+ import streamlit as st
94
+ from services.state_manager import get_current_journey_state
95
+
96
+ st.set_page_config(page_title="TrialPath", page_icon=":material/medical_services:", layout="wide")
97
+
98
+ # Define pages mapped to Journey states
99
+ pages = {
100
+ "Patient Journey": [
101
+ st.Page("pages/1_upload.py", title="Upload Documents", icon=":material/upload_file:"),
102
+ st.Page("pages/2_profile_review.py", title="Review Profile", icon=":material/person:"),
103
+ st.Page("pages/3_trial_matching.py", title="Trial Matching", icon=":material/search:"),
104
+ st.Page("pages/4_gap_analysis.py", title="Gap Analysis", icon=":material/analytics:"),
105
+ st.Page("pages/5_summary.py", title="Summary & Export", icon=":material/summarize:"),
106
+ ]
107
+ }
108
+
109
+ current_page = st.navigation(pages)
110
+
111
+ # Shared sidebar: progress tracker
112
+ with st.sidebar:
113
+ st.markdown("### Journey Progress")
114
+ state = get_current_journey_state()
115
+ # Render progress indicator based on current Parlant Journey state
116
+
117
+ current_page.run()
118
+ ```
119
+
120
+ **Key API details (from DeepWiki):**
121
+ - `st.navigation(pages, position="sidebar")` returns the current `StreamlitPage`, must call `.run()`.
122
+ - `st.switch_page("pages/2_profile_review.py")` for programmatic navigation (stops current page execution).
123
+ - `st.page_link(page, label, icon)` for clickable navigation links.
124
+ - Pages organized as dict = sections in sidebar nav.
125
+
126
+ ### 2.2 File Upload (`st.file_uploader`)
127
+
128
+ **Pattern: Multi-file PDF upload with validation**
129
+
130
+ ```python
131
+ # components/file_uploader.py
132
+ import streamlit as st
133
+ from typing import List
134
+
135
+ def render_file_uploader() -> List:
136
+ """Render multi-file uploader for clinical documents."""
137
+ uploaded_files = st.file_uploader(
138
+ "Upload clinical documents (PDF)",
139
+ type=["pdf", "png", "jpg", "jpeg"],
140
+ accept_multiple_files=True,
141
+ key="clinical_docs_uploader",
142
+ help="Upload clinic letters, pathology reports, lab results",
143
+ )
144
+
145
+ if uploaded_files:
146
+ st.success(f"{len(uploaded_files)} file(s) uploaded")
147
+ for f in uploaded_files:
148
+ st.caption(f"{f.name} ({f.size / 1024:.1f} KB)")
149
+
150
+ return uploaded_files or []
151
+ ```
152
+
153
+ **Key API details (from DeepWiki):**
154
+ - `accept_multiple_files=True` returns `List[UploadedFile]`.
155
+ - `UploadedFile` extends `io.BytesIO` -- can be passed directly to PDF parsers.
156
+ - Default size limit: 200 MB per file (configurable via `server.maxUploadSize` in `config.toml` or per-widget `max_upload_size` param).
157
+ - `type` parameter is best-effort filtering, not a security guarantee.
158
+ - Files are held in memory after upload.
159
+ - Additive selection: clicking browse again adds files, does not replace.
160
+
161
+ ### 2.3 Session State Management
162
+
163
+ **Pattern: Centralized state initialization**
164
+
165
+ ```python
166
+ # services/state_manager.py
167
+ import streamlit as st
168
+
169
+ JOURNEY_STATES = ["INGEST", "PRESCREEN", "VALIDATE_TRIALS", "GAP_FOLLOWUP", "SUMMARY"]
170
+
171
+ def init_session_state():
172
+ """Initialize all session state variables with defaults."""
173
+ defaults = {
174
+ "journey_state": "INGEST",
175
+ "parlant_session_id": None,
176
+ "parlant_agent_id": None,
177
+ "patient_profile": None, # PatientProfile dict
178
+ "uploaded_files": [],
179
+ "search_anchors": None, # SearchAnchors dict
180
+ "trial_candidates": [], # List[TrialCandidate]
181
+ "eligibility_ledger": [], # List[EligibilityLedger]
182
+ "last_event_offset": 0, # For Parlant long-polling
183
+ }
184
+ for key, default_value in defaults.items():
185
+ if key not in st.session_state:
186
+ st.session_state[key] = default_value
187
+
188
+ def get_current_journey_state() -> str:
189
+ return st.session_state.get("journey_state", "INGEST")
190
+
191
+ def advance_journey(target_state: str):
192
+ """Advance Journey to target state with validation."""
193
+ current_idx = JOURNEY_STATES.index(st.session_state.journey_state)
194
+ target_idx = JOURNEY_STATES.index(target_state)
195
+ if target_idx > current_idx:
196
+ st.session_state.journey_state = target_state
197
+ ```
198
+
199
+ **Key API details (from DeepWiki):**
200
+ - `st.session_state` is a `SessionStateProxy` wrapping thread-safe `SafeSessionState`.
201
+ - Internal three-layer dict: `_old_state` (previous run), `_new_session_state` (user-set), `_new_widget_state` (widget values).
202
+ - Cannot modify widget-bound state after widget instantiation in same run (raises `StreamlitAPIException`).
203
+ - Widget `key` parameter maps to `st.session_state[key]` for read access.
204
+ - Values must be pickle-serializable.
205
+
206
+ ### 2.4 Real-Time Progress Feedback
207
+
208
+ **Pattern: AI inference progress with `st.status`**
209
+
210
+ ```python
211
+ # Usage in pages/1_upload.py
212
+ def run_extraction(uploaded_files):
213
+ """Run MedGemma extraction with real-time status feedback."""
214
+ with st.status("Extracting clinical data from documents...", expanded=True) as status:
215
+ st.write("Reading uploaded documents...")
216
+ # Step 1: Send files to MedGemma
217
+ st.write("Running AI extraction (MedGemma 4B)...")
218
+ # Step 2: Poll for results
219
+ st.write("Building patient profile...")
220
+ # Step 3: Parse results into PatientProfile
221
+ status.update(label="Extraction complete!", state="complete")
222
+ ```
223
+
224
+ **Pattern: Streaming LLM output with `st.write_stream`**
225
+
226
+ ```python
227
+ def stream_gap_analysis(generator):
228
+ """Stream Gemini gap analysis output with typewriter effect."""
229
+ st.write_stream(generator)
230
+ ```
231
+
232
+ **Pattern: Auto-refreshing fragment for Parlant events**
233
+
234
+ ```python
235
+ @st.fragment(run_every=3) # Poll every 3 seconds
236
+ def parlant_event_listener():
237
+ """Fragment that polls Parlant for new events without full page rerun."""
238
+ from services.parlant_client import poll_events
239
+ new_events = poll_events(
240
+ st.session_state.parlant_session_id,
241
+ st.session_state.last_event_offset
242
+ )
243
+ if new_events:
244
+ for event in new_events:
245
+ if event["kind"] == "message" and event["source"] == "ai_agent":
246
+ st.chat_message("assistant").write(event["message"])
247
+ elif event["kind"] == "status":
248
+ st.caption(f"Agent status: {event['data']}")
249
+ st.session_state.last_event_offset = new_events[-1]["offset"] + 1
250
+ ```
251
+
252
+ **Key API details (from DeepWiki):**
253
+ - `st.status(label, expanded, state)` -- context manager, auto-completes. States: `"running"`, `"complete"`, `"error"`.
254
+ - `st.spinner(text, show_time=True)` -- simple loading indicator.
255
+ - `st.progress(value, text)` -- 0-100 int or 0.0-1.0 float.
256
+ - `st.toast(body, icon, duration)` -- transient notification, top-right.
257
+ - `st.write_stream(generator)` -- typewriter effect for strings, `st.write` for other types. Supports OpenAI `ChatCompletionChunk` and LangChain `AIMessageChunk`.
258
+ - `@st.fragment(run_every=N)` -- partial rerun every N seconds, isolated from full app rerun.
259
+ - `st.rerun(scope="fragment")` -- rerun only the enclosing fragment.
260
+
261
+ ### 2.5 Layout System (from DeepWiki `streamlit/streamlit`)
262
+
263
+ **Layout primitives for TrialPath UI:**
264
+
265
+ | Primitive | Purpose in TrialPath | Key Params |
266
+ |-----------|---------------------|------------|
267
+ | `st.columns(spec)` | Trial card grid, profile fields side-by-side | `spec` (int or list of ratios), `gap`, `vertical_alignment` |
268
+ | `st.tabs(labels)` | Switching between trial categories (Eligible/Borderline/Not Eligible) | Returns list of containers |
269
+ | `st.expander(label)` | Collapsible criterion detail, evidence citations | `expanded` (bool), `icon` |
270
+ | `st.container(height, border)` | Scrollable trial list, chat panel | `height` (int px), `horizontal` (bool) |
271
+ | `st.empty()` | Dynamic status updates, replacing content | Single-element, replaceable |
272
+
273
+ **Layout composition pattern for trial cards:**
274
+
275
+ ```python
276
+ # Trial matching page layout
277
+ tabs = st.tabs(["Eligible", "Borderline", "Not Eligible", "Unknown"])
278
+
279
+ with tabs[0]: # Eligible trials
280
+ for trial in eligible_trials:
281
+ with st.expander(f"{trial['nct_id']} - {trial['title']}", expanded=False):
282
+ cols = st.columns([0.7, 0.3])
283
+ with cols[0]:
284
+ st.markdown(f"**Phase**: {trial['phase']}")
285
+ st.markdown(f"**Sponsor**: {trial['sponsor']}")
286
+ with cols[1]:
287
+ # Traffic light summary
288
+ met = sum(1 for c in trial['criteria'] if c['status'] == 'MET')
289
+ total = len(trial['criteria'])
290
+ st.metric("Criteria Met", f"{met}/{total}")
291
+
292
+ # Criterion-level detail
293
+ for criterion in trial['criteria']:
294
+ col1, col2 = st.columns([0.8, 0.2])
295
+ with col1:
296
+ st.write(criterion['description'])
297
+ with col2:
298
+ color_map = {"MET": "green", "NOT_MET": "red", "BORDERLINE": "orange", "UNKNOWN": "grey"}
299
+ st.markdown(f":{color_map[criterion['status']]}[{criterion['status']}]")
300
+ ```
301
+
302
+ **Responsive behavior:**
303
+ - `st.columns` stacks vertically at viewport width <= 640px.
304
+ - Use `width="stretch"` for elements to fill available space.
305
+ - Avoid nesting columns more than once.
306
+ - Scrolling containers: avoid heights > 500px for mobile.
307
+
308
+ ### 2.6 Caching System (from DeepWiki `streamlit/streamlit`)
309
+
310
+ **Two caching decorators:**
311
+
312
+ | Decorator | Returns | Serialization | Use Case |
313
+ |-----------|---------|---------------|----------|
314
+ | `@st.cache_data` | Copy of cached value | Requires pickle | Data transformations, API responses, search results |
315
+ | `@st.cache_resource` | Shared instance (singleton) | No pickle needed | ParlantClient instance, HTTP clients, model objects |
316
+
317
+ **TrialPath caching patterns:**
318
+
319
+ ```python
320
+ @st.cache_resource
321
+ def get_parlant_client() -> ParlantClient:
322
+ """Singleton Parlant client shared across all sessions."""
323
+ return ParlantClient(base_url=os.environ.get("PARLANT_URL", "http://localhost:8000"))
324
+
325
+ @st.cache_data(ttl=300) # 5-minute TTL
326
+ def search_trials(query_params: dict) -> list:
327
+ """Cache trial search results to avoid redundant MCP calls."""
328
+ client = get_parlant_client()
329
+ # ... perform search
330
+ return results
331
+ ```
332
+
333
+ **Key details:**
334
+ - Cache key = hash of (function source code + arguments).
335
+ - `ttl` (time-to-live): auto-expire entries. Use for API results that may change.
336
+ - `max_entries`: limit cache size.
337
+ - `hash_funcs`: custom hash for unhashable args.
338
+ - Prefix arg with `_` to exclude from hash (e.g., `_client`).
339
+ - `@st.cache_resource` objects are shared across ALL sessions/threads -- must be thread-safe.
340
+ - Do NOT call interactive widgets inside cached functions (triggers warning).
341
+ - Cache invalidated on: argument change, source code change, TTL expiry, `max_entries` overflow, explicit `.clear()`.
342
+
343
+ ### 2.7 Global Disclaimer Banner (PRD Section 9)
344
+
345
+ Every page must display a medical disclaimer. Implement as a shared component called from `app.py` before navigation.
346
+
347
+ **Pattern: Global disclaimer in entrypoint**
348
+
349
+ ```python
350
+ # app/app.py (add before st.navigation)
351
+ from components.disclaimer_banner import render_disclaimer
352
+
353
+ # Always render disclaimer at top of every page
354
+ render_disclaimer()
355
+
356
+ nav = st.navigation(pages)
357
+ nav.run()
358
+ ```
359
+
360
+ **Component: disclaimer_banner.py**
361
+
362
+ ```python
363
+ # app/components/disclaimer_banner.py
364
+ import streamlit as st
365
+
366
+ DISCLAIMER_TEXT = (
367
+ "This tool provides information for educational purposes only and does not "
368
+ "constitute medical advice. Always consult your healthcare provider before "
369
+ "making decisions about clinical trial participation."
370
+ )
371
+
372
+ def render_disclaimer():
373
+ """Render medical disclaimer banner. Must appear on every page."""
374
+ st.info(DISCLAIMER_TEXT, icon="ℹ️")
375
+ ```
376
+
377
+ ---
378
+
379
+ ## 3. Parlant Frontend Integration Guide
380
+
381
+ ### 3.1 Architecture: Asynchronous Event-Driven Model
382
+
383
+ Parlant uses an **asynchronous, event-driven** conversation model -- NOT traditional request-reply. Both customer and AI agent can post events to a session at any time.
384
+
385
+ **Core concepts:**
386
+ - **Session** = timeline of all events (messages, status updates, tool calls, custom events)
387
+ - **Event** = timestamped item with `offset`, `kind`, `source`, `trace_id`
388
+ - **Long-polling** = client polls for new events with `min_offset` and `wait_for_data` timeout
389
+
390
+ ### 3.2 REST API Endpoints
391
+
392
+ | Method | Path | Purpose |
393
+ |--------|------------------------------------------|----------------------------------------|
394
+ | POST | `/agents` | Create agent |
395
+ | POST | `/sessions` | Create session (agent + customer) |
396
+ | GET | `/sessions` | List sessions (filter by agent/customer, paginated) |
397
+ | POST | `/sessions/{id}/events` | Send message/event |
398
+ | GET | `/sessions/{id}/events` | List/poll events (long-polling) |
399
+ | PATCH | `/sessions/{id}/events/{event_id}` | Update event metadata |
400
+
401
+ **Create Event request schema** (`EventCreationParamsDTO`):
402
+ - `kind`: `"message"` | `"custom"` | `"status"`
403
+ - `source`: `"customer"` | `"human_agent"` | `"customer_ui"`
404
+ - `message`: string (for message events)
405
+ - `data`: dict (for custom/status events)
406
+ - `metadata`: dict (optional)
407
+
408
+ **List Events query params:**
409
+ - `min_offset`: int -- only return events after this offset
410
+ - `wait_for_data`: int (seconds) -- long-poll timeout; returns `504` if no new events
411
+ - `source`, `correlation_id`, `trace_id`, `kinds`: optional filters
412
+
413
+ ### 3.3 Parlant Client Service
414
+
415
+ ```python
416
+ # services/parlant_client.py
417
+ import httpx
418
+ from typing import Optional
419
+
420
+ PARLANT_BASE_URL = "http://localhost:8000"
421
+
422
+ class ParlantClient:
423
+ """Synchronous wrapper around Parlant REST API for Streamlit."""
424
+
425
+ def __init__(self, base_url: str = PARLANT_BASE_URL):
426
+ self.base_url = base_url
427
+ self.http = httpx.Client(base_url=base_url, timeout=65.0) # > long-poll timeout
428
+
429
+ def create_agent(self, name: str, description: str = "") -> dict:
430
+ resp = self.http.post("/agents", json={"name": name, "description": description})
431
+ resp.raise_for_status()
432
+ return resp.json()
433
+
434
+ def create_session(self, agent_id: str, customer_id: Optional[str] = None) -> dict:
435
+ payload = {"agent_id": agent_id}
436
+ if customer_id:
437
+ payload["customer_id"] = customer_id
438
+ resp = self.http.post("/sessions", json=payload)
439
+ resp.raise_for_status()
440
+ return resp.json()
441
+
442
+ def send_message(self, session_id: str, message: str) -> dict:
443
+ resp = self.http.post(
444
+ f"/sessions/{session_id}/events",
445
+ json={"kind": "message", "source": "customer", "message": message}
446
+ )
447
+ resp.raise_for_status()
448
+ return resp.json()
449
+
450
+ def send_custom_event(self, session_id: str, event_type: str, data: dict) -> dict:
451
+ """Send custom event (e.g., journey state change, file upload notification)."""
452
+ resp = self.http.post(
453
+ f"/sessions/{session_id}/events",
454
+ json={"kind": "custom", "source": "customer_ui", "data": {"type": event_type, **data}}
455
+ )
456
+ resp.raise_for_status()
457
+ return resp.json()
458
+
459
+ def poll_events(self, session_id: str, min_offset: int = 0, wait_seconds: int = 60) -> list:
460
+ resp = self.http.get(
461
+ f"/sessions/{session_id}/events",
462
+ params={"min_offset": min_offset, "wait_for_data": wait_seconds}
463
+ )
464
+ resp.raise_for_status()
465
+ return resp.json()
466
+ ```
467
+
468
+ ### 3.4 Event Types Reference
469
+
470
+ | Kind | Source(s) | Description |
471
+ |-----------|--------------------------------|--------------------------------------|
472
+ | message | customer, ai_agent | Text message from participant |
473
+ | status | ai_agent | Agent state: acknowledged, processing, typing, ready, error, cancelled |
474
+ | tool | ai_agent | Tool call result (MedGemma, MCP) |
475
+ | custom | customer_ui, system | App-defined (journey state, uploads) |
476
+
477
+ ### 3.5 Journey State Synchronization
478
+
479
+ Map Parlant events to TrialPath Journey states:
480
+
481
+ ```python
482
+ # services/state_manager.py (continued)
483
+
484
+ JOURNEY_CUSTOM_EVENTS = {
485
+ "extraction_complete": "PRESCREEN",
486
+ "profile_confirmed": "VALIDATE_TRIALS",
487
+ "trials_evaluated": "GAP_FOLLOWUP",
488
+ "gaps_resolved": "SUMMARY",
489
+ }
490
+
491
+ def handle_parlant_event(event: dict):
492
+ """Process incoming Parlant event and update Journey state if needed."""
493
+ if event["kind"] == "custom" and event.get("data", {}).get("type") in JOURNEY_CUSTOM_EVENTS:
494
+ new_state = JOURNEY_CUSTOM_EVENTS[event["data"]["type"]]
495
+ advance_journey(new_state)
496
+ elif event["kind"] == "status" and event.get("data") == "error":
497
+ st.session_state["last_error"] = event.get("message", "Unknown error")
498
+ ```
499
+
500
+ ### 3.6 Parlant Journey System (from DeepWiki `emcie-co/parlant`)
501
+
502
+ Parlant's Journey System defines structured multi-step interaction flows. This is the core mechanism for implementing TrialPath's 5-state patient workflow.
503
+
504
+ **Journey state types:**
505
+ - **Chat State** -- agent converses with customer, guided by state's `action`. Can stay for multiple turns.
506
+ - **Tool State** -- agent calls external tool, result loaded into context. Must be followed by a chat state.
507
+ - **Fork State** -- agent evaluates conditions and branches the flow.
508
+
509
+ **TrialPath Journey definition pattern:**
510
+
511
+ ```python
512
+ import parlant as p
513
+
514
+ async def create_trialpath_journey(agent: p.Agent):
515
+ journey = await agent.create_journey(
516
+ title="Clinical Trial Matching",
517
+ conditions=["The patient wants to find matching clinical trials"],
518
+ description="Guide NSCLC patients through clinical trial matching: "
519
+ "document upload, profile extraction, trial search, "
520
+ "eligibility analysis, and gap identification.",
521
+ )
522
+
523
+ # INGEST: Upload and extract
524
+ t1 = await journey.initial_state.transition_to(
525
+ chat_state="Ask patient to upload clinical documents (clinic letters, pathology reports, lab results)"
526
+ )
527
+
528
+ # Tool state: Run MedGemma extraction
529
+ t2a = await t1.target.transition_to(
530
+ condition="Documents uploaded",
531
+ tool_state=extract_patient_profile # MedGemma tool
532
+ )
533
+ # PRESCREEN: Review extracted profile
534
+ t2b = await t2a.target.transition_to(
535
+ chat_state="Present extracted PatientProfile for review and confirmation"
536
+ )
537
+
538
+ # Tool state: Search trials via MCP
539
+ t3a = await t2b.target.transition_to(
540
+ condition="Profile confirmed",
541
+ tool_state=search_clinical_trials # ClinicalTrials MCP tool
542
+ )
543
+ # VALIDATE_TRIALS: Show results with eligibility
544
+ t3b = await t3a.target.transition_to(
545
+ chat_state="Present trial matches with criterion-level eligibility assessment"
546
+ )
547
+
548
+ # GAP_FOLLOWUP: Identify gaps and suggest actions
549
+ t4 = await t3b.target.transition_to(
550
+ condition="Trials evaluated",
551
+ chat_state="Analyze eligibility gaps and suggest next steps "
552
+ "(additional tests, document uploads)"
553
+ )
554
+
555
+ # Loop back if new documents uploaded
556
+ await t4.target.transition_to(
557
+ condition="New documents uploaded for gap resolution",
558
+ state=t2a.target # Back to extraction
559
+ )
560
+
561
+ # SUMMARY: Final report
562
+ t5 = await t4.target.transition_to(
563
+ condition="Gaps resolved or patient ready for summary",
564
+ chat_state="Generate summary report and doctor packet"
565
+ )
566
+ ```
567
+
568
+ **Key details (from DeepWiki):**
569
+ - Journeys are activated by `conditions` (observational guidelines matched by `GuidelineMatcher`).
570
+ - Transitions can be **direct** (always taken) or **conditional** (only if condition met).
571
+ - Can transition to existing states (for loops, e.g., gap resolution cycle).
572
+ - `END_JOURNEY` is a special terminal state.
573
+ - Journeys dynamically manage LLM context to include only relevant guidelines at each state.
574
+
575
+ ### 3.7 Parlant Guideline System (from DeepWiki `emcie-co/parlant`)
576
+
577
+ Guidelines define behavioral rules for agents. Two types:
578
+
579
+ | Type | Has Action? | Purpose |
580
+ |------|-------------|---------|
581
+ | Observational | No | Track conditions, activate journeys |
582
+ | Actionable | Yes | Drive agent behavior when condition is met |
583
+
584
+ **Journey-scoped vs Global guidelines:**
585
+ - **Global** guidelines apply across all conversations.
586
+ - **Journey-scoped** guidelines are only active when their parent journey is active. Created via `journey.create_guideline()`.
587
+
588
+ **TrialPath guideline examples:**
589
+
590
+ ```python
591
+ # Global guideline: always cite evidence
592
+ await agent.create_guideline(
593
+ condition="the agent makes a clinical assessment",
594
+ action="cite the source document, page number, and relevant text span"
595
+ )
596
+
597
+ # Journey-scoped: only during VALIDATE_TRIALS
598
+ await journey.create_guideline(
599
+ condition="a criterion cannot be evaluated due to missing data",
600
+ action="mark it as UNKNOWN and add to the gap list with the specific data needed"
601
+ )
602
+ ```
603
+
604
+ **Matching pipeline** (from DeepWiki): GuidelineMatcher uses LLM-based evaluation with multiple batch types (observational, actionable, low-criticality, disambiguation, journey-node-selection) to determine which guidelines apply to the current conversation context.
605
+
606
+ ### 3.8 Parlant Tool Integration (from DeepWiki `emcie-co/parlant`)
607
+
608
+ Parlant supports 4 tool service types: `local`, `sdk`/plugin, `openapi`, and `mcp`.
609
+
610
+ **TrialPath will use:**
611
+ - **SDK/Plugin tools** for MedGemma extraction
612
+ - **MCP tools** for ClinicalTrials.gov search
613
+
614
+ **Tool definition with `@p.tool` decorator:**
615
+
616
+ ```python
617
+ @p.tool
618
+ async def extract_patient_profile(
619
+ context: p.ToolContext,
620
+ document_urls: list[str],
621
+ ) -> p.ToolResult:
622
+ """Extract patient clinical profile from uploaded documents using MedGemma 4B.
623
+
624
+ Args:
625
+ document_urls: List of URLs/paths to uploaded clinical documents.
626
+ """
627
+ # Call MedGemma endpoint
628
+ profile = await call_medgemma(document_urls)
629
+ return p.ToolResult(
630
+ data=profile,
631
+ metadata={"source": "MedGemma 4B", "doc_count": len(document_urls)},
632
+ )
633
+ ```
634
+
635
+ **Tool execution flow** (from DeepWiki):
636
+ 1. GuidelineMatch identifies tools associated with matched guidelines
637
+ 2. ToolCaller resolves tool parameters from ServiceRegistry
638
+ 3. ToolCallBatcher groups tools for efficient LLM inference
639
+ 4. LLM infers tool arguments from conversation context
640
+ 5. ToolService.call_tool() executes and returns ToolResult
641
+ 6. ToolEventGenerator emits ToolEvent to session
642
+
643
+ **ToolResult structure:**
644
+ - `data` -- visible to agent for further processing
645
+ - `metadata` -- frontend-only info (not used by agent)
646
+ - `control` -- processing options: `mode` (auto/manual), `lifespan` (response/session)
647
+
648
+ ### 3.9 Parlant NLP Provider: Gemini (from DeepWiki `emcie-co/parlant`)
649
+
650
+ Parlant natively supports Google Gemini, which aligns with TrialPath's planned use of Gemini 3 Pro.
651
+
652
+ **Configuration:**
653
+ ```bash
654
+ # Install with Gemini support
655
+ pip install parlant[gemini]
656
+
657
+ # Set API key
658
+ export GEMINI_API_KEY="your-api-key"
659
+
660
+ # Start server with Gemini backend
661
+ parlant-server --gemini
662
+ ```
663
+
664
+ **Supported providers** (from DeepWiki): OpenAI, Anthropic, Azure, AWS Bedrock, Google Gemini, Vertex AI, Together.ai, LiteLLM, Cerebras, DeepSeek, Ollama, Mistral, and more.
665
+
666
+ **Vertex AI alternative** -- for production, can use `pip install parlant[vertex]` with `VERTEX_AI_MODEL=gemini-2.5-pro`.
667
+
668
+ ### 3.10 AlphaEngine Processing Pipeline (from DeepWiki `emcie-co/parlant`)
669
+
670
+ This is the complete flow from customer message to agent response. Critical for understanding latency and UI feedback points.
671
+
672
+ **Step-by-step pipeline:**
673
+
674
+ ```
675
+ 1. EVENT CREATION
676
+ Customer sends message -> POST /sessions/{id}/events
677
+ -> SessionModule creates event, dispatches background processing
678
+
679
+ 2. CONTEXT LOADING
680
+ AlphaEngine.process() loads:
681
+ - Session history (interaction events)
682
+ - Agent identity + description
683
+ - Customer info
684
+ - Context variables (per-customer/per-tag/global)
685
+ -> Assembled into EngineContext
686
+
687
+ 3. PREPARATION LOOP (while not prepared_to_respond)
688
+ a. GUIDELINE MATCHING
689
+ GuidelineMatcher evaluates guidelines against conversation context
690
+ - Observational guidelines (track conditions)
691
+ - Actionable guidelines (drive behavior)
692
+ - Journey-node guidelines (determine next journey step)
693
+ Uses LLM to score relevance -> GuidelineMatch objects
694
+
695
+ b. TOOL CALLING (if guidelines require tools)
696
+ ToolCaller resolves + executes tools
697
+ - ToolCallBatcher groups for efficient LLM inference
698
+ - LLM infers arguments from context
699
+ - ToolService.call_tool() executes
700
+ - ToolEventGenerator emits ToolEvent to session
701
+ -> Tool results may trigger re-evaluation of guidelines
702
+
703
+ 4. PREAMBLE GENERATION (optional)
704
+ Quick acknowledgment for perceived responsiveness
705
+ -> Emitted as early status event ("acknowledged" / "processing")
706
+
707
+ 5. MESSAGE COMPOSITION
708
+ Based on agent's CompositionMode:
709
+ - FLUID: MessageGenerator builds prompt, generates via SchematicGenerator
710
+ -> Revision loop with temperature-based retries
711
+ - CANNED_STRICT: Only uses predefined templates
712
+ - CANNED_COMPOSITED: Mimics style of canned responses
713
+ - CANNED_FLUID: Prefers canned but falls back to fluid
714
+
715
+ 6. EVENT EMISSION
716
+ Generated message -> emitted as message event
717
+ "ready" status event signals completion
718
+ ```
719
+
720
+ **UI feedback mapping for TrialPath:**
721
+
722
+ | Pipeline Step | Parlant Status Event | UI Feedback |
723
+ |---------------|---------------------|-------------|
724
+ | Event created | `acknowledged` | "Message received" indicator |
725
+ | Context loading | `processing` | `st.status("Analyzing your request...")` |
726
+ | Tool calling | `tool` events | `st.status("Searching ClinicalTrials.gov...")` |
727
+ | Message generation | `typing` | Typing indicator animation |
728
+ | Complete | `ready` | Display agent response |
729
+ | Error | `error` | `st.error()` with retry option |
730
+
731
+ ### 3.11 Context Variables (from DeepWiki `emcie-co/parlant`)
732
+
733
+ Context variables store dynamic data that agents can reference during conversations. Essential for TrialPath to maintain patient profile state across the journey.
734
+
735
+ **Variable scoping (priority order):**
736
+ 1. Customer-specific values (per patient)
737
+ 2. Tag-specific values (e.g., per disease type)
738
+ 3. Global defaults
739
+
740
+ **TrialPath context variable examples:**
741
+
742
+ ```python
743
+ # Create context variables for patient data
744
+ patient_profile_var = await client.context_variables.create(
745
+ name="patient_profile",
746
+ description="Current patient clinical profile extracted from documents",
747
+ )
748
+
749
+ # Set per-customer value
750
+ await client.context_variables.set_value(
751
+ variable_id=patient_profile_var.id,
752
+ key=customer_id, # Per-patient
753
+ value=patient_profile_dict,
754
+ )
755
+
756
+ # Auto-refresh variable via tool (with freshness rules)
757
+ trial_results_var = await client.context_variables.create(
758
+ name="matching_trials",
759
+ description="Current list of matching clinical trials",
760
+ tool_id=search_trials_tool_id,
761
+ freshness_rules="*/10 * * * *", # Refresh every 10 minutes
762
+ )
763
+ ```
764
+
765
+ **Key details:**
766
+ - Values are JSON-serializable.
767
+ - Included in PromptBuilder's `add_context_variables` section for LLM context.
768
+ - Can be auto-refreshed via associated tools + cron-based `freshness_rules`.
769
+ - `ContextVariableStore.GLOBAL_KEY` for default values.
770
+
771
+ ### 3.12 MCP Tool Service Details (from DeepWiki `emcie-co/parlant`)
772
+
773
+ Parlant has native MCP support via `MCPToolClient`. This is how TrialPath connects to ClinicalTrials.gov.
774
+
775
+ **Registration:**
776
+
777
+ ```python
778
+ # Via REST API
779
+ PUT /services/clinicaltrials_mcp
780
+ {
781
+ "kind": "mcp",
782
+ "mcp": {
783
+ "url": "http://localhost:8080"
784
+ }
785
+ }
786
+ ```
787
+
788
+ ```bash
789
+ # Via CLI
790
+ parlant service create \
791
+ --name clinicaltrials_mcp \
792
+ --kind mcp \
793
+ --url http://localhost:8080
794
+ ```
795
+
796
+ **MCPToolClient internals:**
797
+ - Connects via `StreamableHttpTransport` to MCP server's `/mcp` endpoint.
798
+ - `list_tools()` discovers available tools from MCP server.
799
+ - `mcp_tool_to_parlant_tool()` converts MCP tool schemas to Parlant's `Tool` objects.
800
+ - Type mapping: `string`, `integer`, `number`, `boolean`, `date`, `datetime`, `uuid`, `array`, `enum`.
801
+ - `call_tool()` invokes MCP tool, extracts text content from result, wraps in `ToolResult`.
802
+ - Default MCP port: `8181`.
803
+
804
+ **Integration with Guideline System:**
805
+
806
+ ```python
807
+ # Associate MCP tool with a guideline
808
+ search_guideline = await agent.create_guideline(
809
+ condition="the patient profile has been confirmed and trial search is needed",
810
+ action="search ClinicalTrials.gov for matching NSCLC trials using the patient's biomarkers and staging",
811
+ tools=[clinicaltrials_search_tool], # MCP tool reference
812
+ )
813
+ ```
814
+
815
+ ### 3.13 Prompt Construction (from DeepWiki `emcie-co/parlant`)
816
+
817
+ Understanding how Parlant builds LLM prompts is essential for designing effective guidelines and journey states.
818
+
819
+ **PromptBuilder sections (in order):**
820
+
821
+ | Section | Content | TrialPath Relevance |
822
+ |---------|---------|-------------------|
823
+ | General Instructions | Task description, role | Define clinical trial matching context |
824
+ | Agent Identity | Agent name + description | "patient_trial_copilot" identity |
825
+ | Customer Identity | Customer name, session ID | Patient identifier |
826
+ | Context Variables | Dynamic data (JSON) | PatientProfile, SearchAnchors, prior results |
827
+ | Glossary | Domain terms | NSCLC, ECOG, biomarker definitions |
828
+ | Capabilities | What agent can do | Tool descriptions (MedGemma, MCP) |
829
+ | Interaction History | Conversation events | Full chat history with tool results |
830
+ | Guidelines | Matched condition/action pairs | Active behavioral rules for current state |
831
+ | Journey State | Current position in journey | Which step in INGEST->SUMMARY flow |
832
+ | Few-shot Examples | Desired output format | Example eligibility assessments |
833
+ | Staged Tool Events | Pending/completed tool results | MedGemma extraction results, MCP search results |
834
+
835
+ **Context window management:**
836
+ - GuidelineMatcher selectively loads only relevant guidelines and journeys.
837
+ - Journey-scoped guidelines only included when journey is active.
838
+ - Prevents context bloat by pruning low-probability journey guidelines.
839
+
840
+ ### 3.14 Parlant Testing Framework (from DeepWiki `emcie-co/parlant`)
841
+
842
+ Parlant provides a dedicated testing framework with NLP-based assertions (LLM-as-a-Judge).
843
+
844
+ **Key test utilities:**
845
+
846
+ | Class | Purpose |
847
+ |-------|---------|
848
+ | `Suite` | Test runner, manages server connection and scenarios |
849
+ | `Session` | Test session context manager |
850
+ | `Response` | Agent response with `.should()` assertion |
851
+ | `InteractionBuilder` | Build conversation history for preloading |
852
+ | `CustomerMessage` / `AgentMessage` | Step types for conversation construction |
853
+
854
+ **TrialPath test examples:**
855
+
856
+ ```python
857
+ from parlant.testing import Suite, InteractionBuilder
858
+ from parlant.testing.steps import AgentMessage, CustomerMessage
859
+
860
+ suite = Suite(
861
+ server_url="http://localhost:8800",
862
+ agent_id="patient_trial_copilot"
863
+ )
864
+
865
+ @suite.scenario
866
+ async def test_extraction_journey_step():
867
+ """Test that agent asks for documents in INGEST state."""
868
+ async with suite.session() as session:
869
+ response = await session.send("I want to find clinical trials for my lung cancer")
870
+ await response.should("ask the patient to upload clinical documents")
871
+ await response.should("mention accepted file types like PDF or images")
872
+
873
+ @suite.scenario
874
+ async def test_gap_analysis_identifies_missing_data():
875
+ """Test gap analysis identifies unknown biomarkers."""
876
+ async with suite.session() as session:
877
+ # Preload history simulating completed extraction + matching
878
+ history = (
879
+ InteractionBuilder()
880
+ .step(CustomerMessage("Here are my medical documents"))
881
+ .step(AgentMessage("I've extracted your profile. You have NSCLC Stage IIIB, "
882
+ "EGFR positive, but KRAS status is unknown."))
883
+ .step(CustomerMessage("What trials am I eligible for?"))
884
+ .step(AgentMessage("I found 5 trials. For NCT04000005, KRAS status is required "
885
+ "but missing from your records."))
886
+ .build()
887
+ )
888
+ await session.add_events(history)
889
+
890
+ response = await session.send("What should I do about the missing KRAS test?")
891
+ await response.should("suggest getting a KRAS mutation test")
892
+ await response.should("explain which trials require KRAS status")
893
+
894
+ @suite.scenario
895
+ async def test_multi_turn_journey_flow():
896
+ """Test complete journey flow with unfold()."""
897
+ async with suite.session() as session:
898
+ await session.unfold([
899
+ CustomerMessage("I have NSCLC and want to find trials"),
900
+ AgentMessage(
901
+ text="I'd be happy to help. Please upload your clinical documents.",
902
+ should="ask for document upload",
903
+ ),
904
+ CustomerMessage("I've uploaded my pathology report"),
905
+ AgentMessage(
906
+ text="I've extracted your profile...",
907
+ should=["confirm profile extraction", "present key findings"],
908
+ ),
909
+ CustomerMessage("That looks correct, please search for trials"),
910
+ AgentMessage(
911
+ text="I found 8 matching trials...",
912
+ should=["present trial matches", "include eligibility assessment"],
913
+ ),
914
+ ])
915
+ ```
916
+
917
+ **Running tests:**
918
+ ```bash
919
+ parlant-test tests/ # Run all test files
920
+ parlant-test tests/ -k gap # Filter by pattern
921
+ parlant-test tests/ -n 4 # Run in parallel
922
+ ```
923
+
924
+ ### 3.15 Canned Response System (from DeepWiki `emcie-co/parlant`)
925
+
926
+ Canned responses provide consistent, template-based messaging. Useful for TrialPath's structured outputs.
927
+
928
+ **CompositionMode options:**
929
+
930
+ | Mode | Behavior | TrialPath Use |
931
+ |------|----------|--------------|
932
+ | `FLUID` | Free-form LLM generation | General conversation, gap explanations |
933
+ | `CANNED_STRICT` | Only predefined templates | Disclaimer text, safety warnings |
934
+ | `CANNED_COMPOSITED` | Mimics canned style | Eligibility summaries |
935
+ | `CANNED_FLUID` | Prefers canned, falls back to fluid | Standard responses with flexibility |
936
+
937
+ **Journey-state-scoped canned responses:**
938
+
939
+ ```python
940
+ # Canned response only active during SUMMARY state
941
+ summary_template = await journey.create_canned_response(
942
+ value="Based on your clinical profile, you match {{match_count}} trials. "
943
+ "{{eligible_count}} are likely eligible, {{borderline_count}} are borderline, "
944
+ "and {{gap_count}} have unresolved gaps. "
945
+ "See the attached doctor packet for full details.",
946
+ fields=["match_count", "eligible_count", "borderline_count", "gap_count"],
947
+ )
948
+ ```
949
+
950
+ **Template features:**
951
+ - Jinja2 syntax for dynamic fields (e.g., `{{std.customer.name}}`).
952
+ - Fields auto-populated from tool results and context variables.
953
+ - Relevance-scored matching via LLM when multiple templates exist.
954
+ - `signals` and `metadata` for additional template categorization.
955
+
956
+ ---
957
+
958
+ ## 4. UI Component Design per Journey State
959
+
960
+ ### 4.1 INGEST State -- Upload Page
961
+
962
+ ```
963
+ +------------------------------------------+
964
+ | [i] This tool is for information only... |
965
+ | [Sidebar: Journey Progress] |
966
+ | |
967
+ | Upload Clinical Documents |
968
+ | +---------------------------------+ |
969
+ | | Drag & drop or browse | |
970
+ | | Accepted: PDF, PNG, JPG | |
971
+ | +---------------------------------+ |
972
+ | |
973
+ | Uploaded Files: |
974
+ | - clinic_letter.pdf (245 KB) [x] |
975
+ | - pathology_report.pdf (1.2 MB) [x] |
976
+ | - lab_results.png (890 KB) [x] |
977
+ | |
978
+ | [Start Extraction] |
979
+ | |
980
+ | st.status: "Extracting clinical data..." |
981
+ | - Reading documents... |
982
+ | - Running MedGemma 4B... |
983
+ | - Building patient profile... |
984
+ +------------------------------------------+
985
+ ```
986
+
987
+ **Key components:** `file_uploader`, `progress_tracker`
988
+
989
+ ### 4.2 PRESCREEN State -- Profile Review Page
990
+
991
+ ```
992
+ +------------------------------------------+
993
+ | [i] This tool is for information only... |
994
+ | [Sidebar: Journey Progress] |
995
+ | |
996
+ | Patient Clinical Profile |
997
+ | +--------------------------------------+ |
998
+ | | Demographics: Female, 62, ECOG 1 | |
999
+ | | Diagnosis: NSCLC Stage IIIB | |
1000
+ | | Histology: Adenocarcinoma | |
1001
+ | | Biomarkers: | |
1002
+ | | EGFR: Positive (exon 19 del) | |
1003
+ | | ALK: Negative | |
1004
+ | | PD-L1: 45% | |
1005
+ | | Prior Treatment: | |
1006
+ | | Carboplatin+Pemetrexed (2 cycles) | |
1007
+ | | Unknowns: | |
1008
+ | | [!] KRAS status not found | |
1009
+ | | [!] Brain MRI not available | |
1010
+ | +--------------------------------------+ |
1011
+ | |
1012
+ | [Edit Profile] [Confirm & Search Trials] |
1013
+ | |
1014
+ | Searching ClinicalTrials.gov... |
1015
+ | Step 1: Initial query -> 47 results |
1016
+ | Refining: adding Phase 3 filter... |
1017
+ | Step 2: Refined query -> 12 results |
1018
+ | Shortlisting top candidates... |
1019
+ +------------------------------------------+
1020
+ ```
1021
+
1022
+ **Key components:** `profile_card`, `search_process`, `progress_tracker`
1023
+
1024
+ ### 4.3 VALIDATE_TRIALS State -- Trial Matching Page
1025
+
1026
+ ```
1027
+ +------------------------------------------+
1028
+ | [i] This tool is for information only... |
1029
+ | [Sidebar: Journey Progress] |
1030
+ | |
1031
+ | Matching Trials (8 found) |
1032
+ | |
1033
+ | Search Process: |
1034
+ | Step 1: NSCLC + Stage IV + DE -> 47 |
1035
+ | -> Refined: added Phase 3 |
1036
+ | Step 2: + Phase 3 -> 12 results |
1037
+ | -> Shortlisted: reading summaries |
1038
+ | Step 3: 5 trials selected for review |
1039
+ | [Show/Hide Search Details] |
1040
+ | |
1041
+ | +--------------------------------------+ |
1042
+ | | NCT04000001 - KEYNOTE-999 | |
1043
+ | | Pembrolizumab + Chemo for NSCLC | |
1044
+ | | Overall: LIKELY ELIGIBLE | |
1045
+ | | | |
1046
+ | | Criteria: | |
1047
+ | | [G] NSCLC confirmed | |
1048
+ | | [G] ECOG 0-1 | |
1049
+ | | [Y] PD-L1 >= 50% (yours: 45%) | |
1050
+ | | [R] No prior immunotherapy | |
1051
+ | | [?] Brain mets (unknown) | |
1052
+ | +--------------------------------------+ |
1053
+ | | NCT04000002 - ... | |
1054
+ | +--------------------------------------+ |
1055
+ | |
1056
+ | [G]=Met [Y]=Borderline [R]=Not Met |
1057
+ | [?]=Unknown/Needs Info |
1058
+ +------------------------------------------+
1059
+ ```
1060
+
1061
+ **Key components:** `trial_card` (traffic-light display), `search_process`, `progress_tracker`
1062
+
1063
+ ### 4.4 GAP_FOLLOWUP State -- Gap Analysis Page
1064
+
1065
+ ```
1066
+ +------------------------------------------+
1067
+ | [i] This tool is for information only... |
1068
+ | [Sidebar: Journey Progress] |
1069
+ | |
1070
+ | Gap Analysis & Next Steps |
1071
+ | |
1072
+ | +--------------------------------------+ |
1073
+ | | GAP: Brain MRI results needed | |
1074
+ | | Impact: Would resolve [?] criteria | |
1075
+ | | for NCT04000001, NCT04000003 | |
1076
+ | | Action: Upload brain MRI report | |
1077
+ | | [Upload Document] | |
1078
+ | +--------------------------------------+ |
1079
+ | | GAP: KRAS mutation status | |
1080
+ | | Impact: Required for NCT04000005 | |
1081
+ | | Action: Request test from oncologist | |
1082
+ | +--------------------------------------+ |
1083
+ | |
1084
+ | [Re-run Matching with New Data] |
1085
+ | [Proceed to Summary] |
1086
+ +------------------------------------------+
1087
+ ```
1088
+
1089
+ **Key components:** `gap_card`, `file_uploader` (for additional docs), `progress_tracker`
1090
+
1091
+ ### 4.5 SUMMARY State -- Summary & Export Page
1092
+
1093
+ ```
1094
+ +------------------------------------------+
1095
+ | [i] This tool is for information only... |
1096
+ | [Sidebar: Journey Progress] |
1097
+ | |
1098
+ | Clinical Trial Matching Summary |
1099
+ | |
1100
+ | Eligible Trials: 3 |
1101
+ | Borderline Trials: 2 |
1102
+ | Not Eligible: 3 |
1103
+ | Unresolved Gaps: 1 |
1104
+ | |
1105
+ | [Download Doctor Packet (JSON/Markdown)] |
1106
+ | [Start New Session] |
1107
+ | |
1108
+ | Chat with AI Copilot: |
1109
+ | +--------------------------------------+ |
1110
+ | | AI: Based on your profile... | |
1111
+ | | You: What about trial NCT...? | |
1112
+ | | AI: That trial requires... | |
1113
+ | +--------------------------------------+ |
1114
+ | | [Type a message...] [Send] | |
1115
+ | +--------------------------------------+ |
1116
+ +------------------------------------------+
1117
+ ```
1118
+
1119
+ **Key components:** `chat_panel`, `progress_tracker`
1120
+
1121
+ ---
1122
+
1123
+ ## 5. TDD Test Cases
1124
+
1125
+ ### 5.1 Upload Page Tests
1126
+
1127
+ | Test Case | Input | Expected Output | Boundary |
1128
+ |-----------|-------|-----------------|----------|
1129
+ | No files uploaded | Empty uploader | "Start Extraction" button disabled | N/A |
1130
+ | Single PDF upload | 1 PDF file | File listed, extraction button enabled | N/A |
1131
+ | Multiple files | 3 PDF + 1 PNG | All 4 files listed with sizes | N/A |
1132
+ | Invalid file type | 1 .docx file | File rejected, error message shown | File type filter |
1133
+ | Large file | 250 MB PDF | Error or warning per `maxUploadSize` | Size limit |
1134
+ | Extraction triggered | Click "Start Extraction" | `st.status` shows running, Parlant event sent | N/A |
1135
+ | Extraction completes | MedGemma returns profile | Journey advances to PRESCREEN, profile in session_state | State transition |
1136
+ | Extraction fails | MedGemma error | `st.status` shows error state, retry option | Error handling |
1137
+
1138
+ ### 5.2 Profile Review Page Tests
1139
+
1140
+ | Test Case | Input | Expected Output | Boundary |
1141
+ |-----------|-------|-----------------|----------|
1142
+ | Profile display | PatientProfile in session_state | All fields rendered correctly | N/A |
1143
+ | Unknown fields highlighted | Profile with unknowns list | Unknowns shown with warning icon | N/A |
1144
+ | Edit profile | Click Edit, modify ECOG | session_state updated, confirmation shown | N/A |
1145
+ | Confirm profile | Click "Confirm & Search" | Journey advances to VALIDATE_TRIALS | State transition |
1146
+ | Empty profile | No profile in session_state | Redirect to Upload page | Guard clause |
1147
+ | Biomarker display | Complex biomarker data | All biomarkers with values and methods | Data richness |
1148
+
1149
+ ### 5.3 Trial Matching Page Tests
1150
+
1151
+ | Test Case | Input | Expected Output | Boundary |
1152
+ |-----------|-------|-----------------|----------|
1153
+ | Trials loading | Matching in progress | `st.spinner` or `st.status` shown | N/A |
1154
+ | Trials displayed | 8 TrialCandidates | 8 trial cards with traffic-light criteria | N/A |
1155
+ | Green criterion | Criterion met with evidence | Green indicator, evidence citation | N/A |
1156
+ | Yellow criterion | Borderline match | Yellow indicator, explanation | N/A |
1157
+ | Red criterion | Criterion not met | Red indicator, specific reason | N/A |
1158
+ | Unknown criterion | Missing data | Question mark, linked to gap | N/A |
1159
+ | Zero trials | No matches found | Informative message, suggest broadening | Empty state |
1160
+ | Many trials | 50+ results | Pagination or scroll, performance ok | Scale |
1161
+ | Search process displayed | SearchLog with 3 steps | 3 step entries shown with query params and result counts | N/A |
1162
+ | Refinement visible | >50 initial results refined to 12 | Shows refinement action and reason | Iterative loop |
1163
+ | Relaxation visible | 0 initial results relaxed to 5 | Shows relaxation action and reason | Iterative loop |
1164
+
1165
+ ### 5.4 Gap Analysis Page Tests
1166
+
1167
+ | Test Case | Input | Expected Output | Boundary |
1168
+ |-----------|-------|-----------------|----------|
1169
+ | Gaps identified | 3 gaps in ledger | 3 gap cards with actions | N/A |
1170
+ | Upload resolves gap | Upload brain MRI report | Gap card updates, re-match option | Iterative flow |
1171
+ | No gaps | All criteria resolved | Message: "No gaps", proceed to summary | Happy path |
1172
+ | Gap impacts multiple trials | 1 gap affects 3 trials | Gap card lists all 3 affected trials | Cross-reference |
1173
+ | Re-run matching | Click re-run after upload | New extraction + matching cycle | Loop back |
1174
+
1175
+ ### 5.5 Summary Page Tests
1176
+
1177
+ | Test Case | Input | Expected Output | Boundary |
1178
+ |-----------|-------|-----------------|----------|
1179
+ | Summary statistics | Complete ledger | Correct counts per category | N/A |
1180
+ | Download doctor packet | Click download | JSON + Markdown files downloadable via st.download_button | N/A |
1181
+ | Chat interaction | Send message | Message appears, agent responds | N/A |
1182
+ | New session | Click "Start New" | State cleared, redirect to Upload | State reset |
1183
+
1184
+ ### 5.6 Disclaimer Tests
1185
+
1186
+ | Test Case | Input | Expected Output | Boundary |
1187
+ |-----------|-------|-----------------|----------|
1188
+ | Disclaimer on upload page | Navigate to Upload | Info banner with disclaimer text visible | N/A |
1189
+ | Disclaimer on profile page | Navigate to Profile Review | Info banner with disclaimer text visible | N/A |
1190
+ | Disclaimer on matching page | Navigate to Trial Matching | Info banner with disclaimer text visible | N/A |
1191
+ | Disclaimer on gap page | Navigate to Gap Analysis | Info banner with disclaimer text visible | N/A |
1192
+ | Disclaimer on summary page | Navigate to Summary | Info banner with disclaimer text visible | N/A |
1193
+ | Disclaimer text content | Any page | Contains "information only" and "not medical advice" | Exact wording |
1194
+
1195
+ ---
1196
+
1197
+ ## 6. Streamlit AppTest Testing Strategy
1198
+
1199
+ ### 6.1 Test Setup Pattern
1200
+
1201
+ ```python
1202
+ # tests/test_upload_page.py
1203
+ import pytest
1204
+ from streamlit.testing.v1 import AppTest
1205
+
1206
+ @pytest.fixture
1207
+ def upload_app():
1208
+ """Create AppTest instance for upload page."""
1209
+ at = AppTest.from_file("pages/1_upload.py")
1210
+ # Initialize required session state
1211
+ at.session_state["journey_state"] = "INGEST"
1212
+ at.session_state["parlant_session_id"] = "test-session-123"
1213
+ at.session_state["uploaded_files"] = []
1214
+ return at.run()
1215
+
1216
+ def test_initial_state(upload_app):
1217
+ """Upload page shows uploader and disabled extraction button."""
1218
+ at = upload_app
1219
+ # Check file uploader exists
1220
+ assert len(at.file_uploader) > 0
1221
+ # Check no error state
1222
+ assert len(at.exception) == 0
1223
+
1224
+ def test_extraction_button_disabled_without_files(upload_app):
1225
+ """Extraction button should be disabled when no files uploaded."""
1226
+ at = upload_app
1227
+ # Button should exist but extraction should not proceed without files
1228
+ assert at.button[0].disabled or at.session_state.get("uploaded_files") == []
1229
+ ```
1230
+
1231
+ ### 6.2 Widget Interaction Patterns
1232
+
1233
+ ```python
1234
+ def test_text_input_profile_edit():
1235
+ """Test editing patient profile fields via text input."""
1236
+ at = AppTest.from_file("pages/2_profile_review.py")
1237
+ at.session_state["journey_state"] = "PRESCREEN"
1238
+ at.session_state["patient_profile"] = {
1239
+ "demographics": {"age": 62, "sex": "Female"},
1240
+ "diagnosis": {"stage": "IIIB", "histology": "Adenocarcinoma"},
1241
+ }
1242
+ at = at.run()
1243
+
1244
+ # Simulate editing a field
1245
+ if len(at.text_input) > 0:
1246
+ at.text_input[0].input("IIIA").run()
1247
+ # Assert profile updated in session state
1248
+
1249
+ def test_button_click_advances_journey():
1250
+ """Clicking confirm button advances journey to next state."""
1251
+ at = AppTest.from_file("pages/2_profile_review.py")
1252
+ at.session_state["journey_state"] = "PRESCREEN"
1253
+ at.session_state["patient_profile"] = {"demographics": {"age": 62}}
1254
+ at = at.run()
1255
+
1256
+ # Find and click confirm button
1257
+ confirm_buttons = [b for b in at.button if "Confirm" in str(b.label)]
1258
+ if confirm_buttons:
1259
+ confirm_buttons[0].click()
1260
+ at = at.run()
1261
+ assert at.session_state["journey_state"] == "VALIDATE_TRIALS"
1262
+ ```
1263
+
1264
+ ### 6.3 Page Navigation Test
1265
+
1266
+ ```python
1267
+ def test_guard_redirect_without_profile():
1268
+ """Profile review page redirects to upload if no profile exists."""
1269
+ at = AppTest.from_file("pages/2_profile_review.py")
1270
+ at.session_state["journey_state"] = "PRESCREEN"
1271
+ at.session_state["patient_profile"] = None # No profile
1272
+ at = at.run()
1273
+
1274
+ # Should show warning or error, not crash
1275
+ assert len(at.exception) == 0
1276
+ # Could check for warning message
1277
+ warnings = [m for m in at.warning if "upload" in str(m.value).lower()]
1278
+ assert len(warnings) > 0 or at.session_state["journey_state"] == "INGEST"
1279
+ ```
1280
+
1281
+ ### 6.4 Session State Test
1282
+
1283
+ ```python
1284
+ def test_session_state_initialization():
1285
+ """All session state keys should be initialized on first run."""
1286
+ at = AppTest.from_file("app.py").run()
1287
+
1288
+ required_keys = [
1289
+ "journey_state", "parlant_session_id", "patient_profile",
1290
+ "uploaded_files", "trial_candidates", "eligibility_ledger"
1291
+ ]
1292
+ for key in required_keys:
1293
+ assert key in at.session_state, f"Missing session state key: {key}"
1294
+
1295
+ def test_session_state_persists_across_reruns():
1296
+ """Session state values persist across multiple reruns."""
1297
+ at = AppTest.from_file("app.py").run()
1298
+ at.session_state["journey_state"] = "PRESCREEN"
1299
+ at = at.run()
1300
+ assert at.session_state["journey_state"] == "PRESCREEN"
1301
+ ```
1302
+
1303
+ ### 6.5 Component Rendering Tests
1304
+
1305
+ ```python
1306
+ def test_trial_card_traffic_light_rendering():
1307
+ """Trial card displays correct traffic light colors for criteria."""
1308
+ at = AppTest.from_file("pages/3_trial_matching.py")
1309
+ at.session_state["journey_state"] = "VALIDATE_TRIALS"
1310
+ at.session_state["trial_candidates"] = [
1311
+ {
1312
+ "nct_id": "NCT04000001",
1313
+ "title": "Test Trial",
1314
+ "criteria_results": [
1315
+ {"criterion": "NSCLC", "status": "MET", "evidence": "pathology report p.1"},
1316
+ {"criterion": "ECOG 0-1", "status": "MET", "evidence": "clinic letter"},
1317
+ {"criterion": "No prior IO", "status": "NOT_MET", "evidence": "treatment history"},
1318
+ {"criterion": "Brain mets", "status": "UNKNOWN", "evidence": None},
1319
+ ]
1320
+ }
1321
+ ]
1322
+ at = at.run()
1323
+
1324
+ # Check that trial card content is rendered
1325
+ assert len(at.exception) == 0
1326
+ # Check for presence of trial ID in rendered markdown
1327
+ markdown_texts = [str(m.value) for m in at.markdown]
1328
+ assert any("NCT04000001" in text for text in markdown_texts)
1329
+ ```
1330
+
1331
+ ### 6.6 Error Handling Tests
1332
+
1333
+ ```python
1334
+ def test_parlant_connection_error_handling():
1335
+ """App should handle Parlant server unavailability gracefully."""
1336
+ at = AppTest.from_file("app.py")
1337
+ at.session_state["parlant_session_id"] = None # Simulate no connection
1338
+ at = at.run()
1339
+
1340
+ # Should not crash
1341
+ assert len(at.exception) == 0
1342
+
1343
+ def test_extraction_error_shows_retry():
1344
+ """When extraction fails, user sees error status and retry option."""
1345
+ at = AppTest.from_file("pages/1_upload.py")
1346
+ at.session_state["journey_state"] = "INGEST"
1347
+ at.session_state["extraction_error"] = "MedGemma timeout"
1348
+ at = at.run()
1349
+
1350
+ # Should show error message
1351
+ assert len(at.exception) == 0
1352
+ error_msgs = [str(e.value) for e in at.error]
1353
+ assert len(error_msgs) > 0 or at.session_state.get("extraction_error") is not None
1354
+ ```
1355
+
1356
+ ### 6.7 Search Process Component Tests
1357
+
1358
+ ```python
1359
+ # tests/test_components.py (addition)
1360
+
1361
+ class TestSearchProcessComponent:
1362
+ """Test search process visualization component."""
1363
+
1364
+ def test_renders_search_steps(self):
1365
+ """Search process should display all refinement steps."""
1366
+ at = AppTest.from_file("app/components/search_process.py")
1367
+ at.session_state["search_log"] = {
1368
+ "steps": [
1369
+ {"step": 1, "query": {"condition": "NSCLC", "location": "DE"}, "found": 47, "action": "refine", "reason": "Too many results, adding phase filter"},
1370
+ {"step": 2, "query": {"condition": "NSCLC", "location": "DE", "phase": "Phase 3"}, "found": 12, "action": "shortlist", "reason": "Right size for detailed review"},
1371
+ ],
1372
+ "final_shortlist_nct_ids": ["NCT001", "NCT002", "NCT003", "NCT004", "NCT005"],
1373
+ }
1374
+ at.run()
1375
+ # Verify steps are displayed
1376
+ assert "47" in at.text[0].value # First step result count
1377
+ assert "12" in at.text[1].value # Second step result count
1378
+ assert "Phase 3" in at.text[0].value or "Phase 3" in at.text[1].value
1379
+
1380
+ def test_empty_search_log(self):
1381
+ """Should handle missing search log gracefully."""
1382
+ at = AppTest.from_file("app/components/search_process.py")
1383
+ at.run()
1384
+ # Should not crash, show placeholder
1385
+ assert not at.exception
1386
+
1387
+ def test_collapsible_details(self):
1388
+ """Search details should be in an expander for clean UI."""
1389
+ at = AppTest.from_file("app/components/search_process.py")
1390
+ at.session_state["search_log"] = {
1391
+ "steps": [{"step": 1, "query": {}, "found": 10, "action": "shortlist", "reason": "OK"}],
1392
+ }
1393
+ at.run()
1394
+ # Verify expander exists for search details
1395
+ assert len(at.expander) >= 1
1396
+ ```
1397
+
1398
+ ### 6.8 Disclaimer Component Tests
1399
+
1400
+ ```python
1401
+ # tests/test_components.py (addition)
1402
+
1403
+ class TestDisclaimerBanner:
1404
+ """Test medical disclaimer banner appears correctly."""
1405
+
1406
+ def test_disclaimer_renders(self):
1407
+ """Disclaimer banner should render on every page."""
1408
+ at = AppTest.from_file("app/components/disclaimer_banner.py")
1409
+ at.run()
1410
+ assert len(at.info) >= 1
1411
+ assert "information" in at.info[0].value.lower()
1412
+ assert "medical advice" in at.info[0].value.lower()
1413
+
1414
+ def test_disclaimer_in_upload_page(self):
1415
+ """Upload page should include disclaimer."""
1416
+ at = AppTest.from_file("app/pages/1_upload.py")
1417
+ at.run()
1418
+ info_texts = [i.value.lower() for i in at.info]
1419
+ assert any("information" in t and "medical" in t for t in info_texts)
1420
+ ```
1421
+
1422
+ ### 6.9 AppTest Limitations
1423
+
1424
+ - `AppTest` does not support testing `st.file_uploader` file content directly (mock at service layer instead).
1425
+ - Not yet compatible with `st.navigation`/`st.Page` multipage (test individual pages via `from_file`).
1426
+ - No browser rendering -- tests run headless, pure Python.
1427
+ - Must call `.run()` after every interaction to see updated state.
1428
+
1429
+ ---
1430
+
1431
+ ## 7. Appendix: API Reference
1432
+
1433
+ ### 7.1 Streamlit Key APIs
1434
+
1435
+ | API | Purpose | Notes |
1436
+ |-----|---------|-------|
1437
+ | `st.navigation(pages, position)` | Define multipage app | Returns current page, must call `.run()` |
1438
+ | `st.Page(page, title, icon, url_path)` | Define a page | `page` = filepath or callable |
1439
+ | `st.switch_page(page)` | Programmatic navigation | Stops current page execution |
1440
+ | `st.page_link(page, label, icon)` | Clickable nav link | Non-blocking |
1441
+ | `st.file_uploader(label, type, accept_multiple_files, key)` | File upload widget | Returns `UploadedFile` (extends `BytesIO`) |
1442
+ | `st.session_state` | Persistent key-value store | Survives reruns, per-session |
1443
+ | `st.status(label, expanded, state)` | Collapsible status container | Context manager, auto-completes |
1444
+ | `st.spinner(text, show_time)` | Loading spinner | Context manager |
1445
+ | `st.progress(value, text)` | Progress bar | 0-100 int or 0.0-1.0 float |
1446
+ | `st.toast(body, icon, duration)` | Transient notification | Top-right corner |
1447
+ | `st.write_stream(generator)` | Streaming text output | Typewriter effect for strings |
1448
+ | `@st.fragment(run_every=N)` | Partial rerun decorator | Isolated from full app rerun |
1449
+ | `st.rerun(scope)` | Trigger rerun | `"app"` or `"fragment"` |
1450
+ | `st.chat_message(name)` | Chat bubble | `"user"`, `"assistant"`, or custom |
1451
+ | `st.chat_input(placeholder)` | Chat text input | Fixed at bottom of container |
1452
+ | `AppTest.from_file(path)` | Create test instance | `.run()` to execute |
1453
+ | `AppTest.from_string(code)` | Test from string | Quick inline tests |
1454
+ | `at.button[i].click()` | Simulate button click | Chain with `.run()` |
1455
+ | `at.text_input[i].input(val)` | Simulate text entry | Chain with `.run()` |
1456
+ | `at.slider[i].set_value(val)` | Set slider value | Chain with `.run()` |
1457
+
1458
+ ### 7.2 Parlant Key APIs (from DeepWiki `emcie-co/parlant`)
1459
+
1460
+ **REST Endpoints:**
1461
+
1462
+ | Endpoint | Method | Purpose | Key Params |
1463
+ |----------|--------|---------|------------|
1464
+ | `/agents` | POST | Create agent | `name`, `description` |
1465
+ | `/sessions` | POST | Create session | `agent_id`, `customer_id` (optional), `title`, `metadata` |
1466
+ | `/sessions` | GET | List sessions | `agent_id`, `customer_id`, `limit`, `cursor`, `sort` |
1467
+ | `/sessions/{id}/events` | POST | Send event | `kind`, `source`, `message`/`data`, `metadata`; query: `moderation` |
1468
+ | `/sessions/{id}/events` | GET | Poll events | `min_offset`, `wait_for_data`, `source`, `correlation_id`, `trace_id`, `kinds` |
1469
+ | `/sessions/{id}/events/{eid}` | PATCH | Update event | metadata updates only |
1470
+
1471
+ **Event kinds:** `message`, `status`, `tool`, `custom`
1472
+
1473
+ **Event sources:** `customer`, `customer_ui`, `ai_agent`, `human_agent`, `human_agent_on_behalf_of_ai_agent`, `system`
1474
+
1475
+ **Status event states:** `acknowledged`, `processing`, `typing`, `ready`, `error`, `cancelled`
1476
+
1477
+ **Long-polling behavior:** `wait_for_data` > 0 blocks until new events or timeout; returns `504` on timeout.
1478
+
1479
+ **SDK APIs:**
1480
+
1481
+ | SDK Method | Purpose |
1482
+ |------------|---------|
1483
+ | `agent.create_journey(title, conditions, description)` | Create Journey with state machine |
1484
+ | `journey.initial_state.transition_to(chat_state=..., tool_state=..., condition=...)` | Define state transitions |
1485
+ | `agent.create_guideline(condition, action, tools=[...])` | Create global guideline |
1486
+ | `journey.create_guideline(condition, action, tools=[...])` | Create journey-scoped guideline |
1487
+ | `p.Server(session_store="local"/"mongodb://...")` | Configure session persistence |
1488
+
1489
+ **Tool decorator:** `@p.tool` auto-extracts name, description, parameters from function signature.
1490
+
1491
+ **NLP backend:** `parlant-server --gemini` (requires `GEMINI_API_KEY` and `pip install parlant[gemini]`).
1492
+
1493
+ **Client SDK:** `parlant-client` (Python), TypeScript client, or direct REST.
1494
+
1495
+ **Storage options:** in-memory (default/testing), local JSON, MongoDB (production).
1496
+
1497
+ ### 7.3 Integration Pattern: Streamlit + Parlant
1498
+
1499
+ ```
1500
+ User Action (Streamlit UI)
1501
+ -> st.session_state update
1502
+ -> ParlantClient.send_message() or send_custom_event()
1503
+ -> Parlant Server processes (async)
1504
+ -> @st.fragment polls ParlantClient.poll_events()
1505
+ -> New events update st.session_state
1506
+ -> UI rerenders with new data
1507
+ ```
1508
+
1509
+ This polling loop runs via `@st.fragment(run_every=3)` to avoid blocking the main app thread, providing near-real-time updates without full page reruns.
1510
+
1511
+ ---
1512
+
1513
+ ## References
1514
+
1515
+ - Streamlit source: DeepWiki analysis of `streamlit/streamlit`
1516
+ - Parlant source: DeepWiki analysis of `emcie-co/parlant`
1517
+ - Parlant official docs: https://www.parlant.io/docs/
1518
+ - Parlant Sessions: https://www.parlant.io/docs/concepts/sessions/
1519
+ - Parlant Conversation API: https://www.parlant.io/docs/engine-internals/conversation-api/
1520
+ - Parlant GitHub: https://github.com/emcie-co/parlant
1521
+ - Parlant Journey System: DeepWiki `emcie-co/parlant` section 5.2
1522
+ - Parlant Guideline System: DeepWiki `emcie-co/parlant` section 5.1
1523
+ - Parlant Tool Integration: DeepWiki `emcie-co/parlant` section 6
1524
+ - Parlant NLP Providers: DeepWiki `emcie-co/parlant` section 10.1
pyproject.toml ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "trialpath"
3
+ version = "0.1.0"
4
+ description = "AI-powered clinical trial matching for NSCLC patients"
5
+ requires-python = ">=3.11"
6
+ dependencies = [
7
+ "pydantic>=2.0",
8
+ "httpx>=0.27",
9
+ "streamlit>=1.40",
10
+ "pytest>=8.0",
11
+ "pytest-asyncio>=0.24",
12
+ ]
13
+
14
+ [project.optional-dependencies]
15
+ dev = [
16
+ "ruff>=0.8",
17
+ "pytest-cov>=6.0",
18
+ ]
19
+
20
+ [tool.ruff]
21
+ line-length = 100
22
+ target-version = "py311"
23
+
24
+ [tool.ruff.lint]
25
+ select = ["E", "F", "I", "W"]
26
+
27
+ [tool.pytest.ini_options]
28
+ testpaths = ["trialpath/tests", "app/tests"]
29
+ asyncio_mode = "auto"
trialpath/__init__.py ADDED
File without changes
trialpath/agent/__init__.py ADDED
File without changes
trialpath/models/__init__.py ADDED
File without changes
trialpath/services/__init__.py ADDED
File without changes
trialpath/tests/__init__.py ADDED
File without changes