TrialPath ๆฐๆฎไธ่ฏไผฐ็ฎก็บฟ TDD ๅฎ็ฐๆๅ
ๅบไบ DeepWikiใTREC ๅฎๆนๆๆกฃใir-measures/ir_datasets ๅบๆทฑๅบฆ็ ็ฉถไบงๅบ
1. ็ฎก็บฟๆถๆๆฆ่ง
1.1 ๆฐๆฎๆตๅพ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data & Evaluation Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ Synthea โโโโโถโ FHIR Bundle โโโโโถโ PatientProfile โ โ
โ โ (Java CLI) โ โ (JSON) โ โ (JSON Schema) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โผ โ
โ โ LLM Letter โโโโโถโ ReportLab โโโโโถ Noisy Clinical PDFs โ
โ โ Generator โ โ + Augraphy โ (Letters/Labs/Path) โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ MedGemma โโโโโถโ Extracted โโโโโถโ F1 Evaluator โ โ
โ โ Extractor โ โ Profile โ โ (scikit-learn) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ TREC Topics โโโโโถโ TrialPath โโโโโถโ TREC Evaluator โ โ
โ โ (ir_datasets)โ โ Matching โ โ (ir-measures) โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1.2 ๆจกๅๅ ณ็ณป
| ๆจกๅ | ่พๅ ฅ | ่พๅบ | ไพ่ต |
|---|---|---|---|
data/generate_synthetic_patients.py |
Synthea FHIR Bundles | PatientProfile JSON + Ground Truth |
Synthea CLI, FHIR R4 |
data/generate_noisy_pdfs.py |
PatientProfile JSON |
Clinical PDFs (ๅธฆๅชๅฃฐ) | ReportLab, Augraphy |
evaluation/run_trec_benchmark.py |
TREC Topics + TrialPath Run | Recall@50, NDCG@10, P@10 | ir_datasets, ir-measures |
evaluation/extraction_eval.py |
Extracted vs Ground Truth Profiles | Field-level F1 | scikit-learn |
evaluation/criterion_eval.py |
EligibilityLedger vs Gold Standard | Criterion Accuracy | scikit-learn |
evaluation/latency_cost_tracker.py |
API call logs | Latency/Cost reports | time, logging |
1.3 ็ฎๅฝ็ปๆ
data/
โโโ generate_synthetic_patients.py # Synthea FHIR โ PatientProfile
โโโ generate_noisy_pdfs.py # PatientProfile โ Clinical PDFs
โโโ synthea_config/
โ โโโ synthea.properties # Synthea ้
็ฝฎ
โ โโโ modules/
โ โโโ lung_cancer_extended.json # ๆฉๅฑ NSCLC ๆจกๅ (ๅซ biomarkers)
โโโ templates/
โ โโโ clinical_letter.py # ไธดๅบไฟกไปถๆจกๆฟ
โ โโโ pathology_report.py # ็
็ๆฅๅๆจกๆฟ
โ โโโ lab_report.py # ๅฎ้ชๅฎคๆฅๅๆจกๆฟ
โ โโโ imaging_report.py # ๅฝฑๅๆฅๅๆจกๆฟ
โโโ noise/
โ โโโ noise_injector.py # ๅชๅฃฐๆณจๅ
ฅๅผๆ
โโโ output/
โโโ fhir/ # Synthea ๅๅง FHIR ่พๅบ
โโโ profiles/ # ่ฝฌๆขๅ็ PatientProfile JSON
โโโ pdfs/ # ็ๆ็ไธดๅบ PDF
โโโ ground_truth/ # ๆ ๆณจๆฐๆฎ
evaluation/
โโโ run_trec_benchmark.py # TREC ๆฃ็ดข่ฏไผฐ
โโโ extraction_eval.py # MedGemma ๆๅ F1
โโโ criterion_eval.py # Criterion Decision Accuracy
โโโ latency_cost_tracker.py # ๅปถ่ฟไธๆๆฌ่ฟฝ่ธช
โโโ trec_data/
โ โโโ topics2021.xml # TREC 2021 topics
โ โโโ qrels2021.txt # TREC 2021 relevance judgments
โ โโโ topics2022.xml # TREC 2022 topics
โโโ reports/ # ่ฏไผฐๆฅๅ่พๅบ
tests/
โโโ test_synthea_data.py # Synthea ๆฐๆฎ้ช่ฏ
โโโ test_pdf_generation.py # PDF ็ๆๆญฃ็กฎๆง
โโโ test_noise_injection.py # ๅชๅฃฐๆณจๅ
ฅๆๆ
โโโ test_trec_evaluation.py # TREC ่ฏไผฐ่ฎก็ฎ
โโโ test_extraction_f1.py # F1 ่ฎก็ฎๆต่ฏ
โโโ test_latency_cost.py # ๅปถ่ฟๆๆฌๆต่ฏ
โโโ test_e2e_pipeline.py # ็ซฏๅฐ็ซฏ็ฎก็บฟๆต่ฏ
2. Synthea ๅๆๆฃ่ ็ๆๆๅ
2.1 Synthea ๆฆ่ฟฐ
Synthea ๆฏ MITRE ๅผๅ็ๅผๆบๅๆๆฃ่ ๆจกๆๅจ๏ผๅบไบ Java ๅฎ็ฐใๅฎ้่ฟ JSON ็ถๆๆบๆจกๅๆจกๆ็พ็ ่ฝจ่ฟน๏ผ่พๅบๆ ๅ FHIR R4 Bundleใ
ๅ ณ้ฎ็นๆง๏ผๆฅๆบ๏ผDeepWiki synthetichealth/synthea๏ผ๏ผ
- ๅบไบๆจกๅ็็พ็ ๆจกๆ๏ผๆฏ็ง็พ็ ๅฎไนไธบ JSON ็ถๆๆบ
- ๆฏๆ FHIR R4/STU3/DSTU2 ๅฏผๅบ
- ๅ
็ฝฎ
lung_cancer.jsonๆจกๅ๏ผ85% NSCLC / 15% SCLC ๅๅธ - ๆฏๆ Stage I-IV ๅๆๅๅ็/ๆพ็ๆฒป็่ทฏๅพ
- ไธๅซ NSCLC ็นๅผๆง biomarkers๏ผEGFR, ALK, PD-L1, KRAS, ROS1๏ผโโ ้่ฆ่ชๅฎไนๆฉๅฑ
2.2 ๅฎ่ฃ ๅ้ ็ฝฎ
็ณป็ป่ฆๆฑ๏ผ
- Java JDK 11 ๆๆด้ซ็ๆฌ๏ผๆจ่ LTS 11 ๆ 17๏ผ
ๅฎ่ฃ ๆนๅผ A๏ผ็ดๆฅไฝฟ็จ JAR๏ผๆจ่็จไบๆฐๆฎ็ๆ๏ผ
# ไธ่ฝฝๆๆฐ release JAR
# ไป https://github.com/synthetichealth/synthea/releases ่ทๅ
wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar
# ้ช่ฏๅฎ่ฃ
java -jar synthea-with-dependencies.jar --help
ๅฎ่ฃ ๆนๅผ B๏ผไปๆบ็ ๆๅปบ๏ผ้่ฆ่ชๅฎไนๆจกๅๆถไฝฟ็จ๏ผ
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test
2.3 NSCLC ๆจกๅ้ ็ฝฎ
2.3.1 ็ฐๆ lung_cancer ๆจกๅๅๆ
ๆฅๆบ๏ผDeepWiki ๅฏน synthetichealth/synthea ็ lung_cancer.json ๆจกๅๅๆ๏ผ
- ๅ ฅๅฃๆกไปถ๏ผ45-65 ๅฒไบบ็พค๏ผๅบไบๆฆ็่ฎก็ฎ
- ่ฏๆญๆต็จ๏ผ็็ถ๏ผๅณๅฝใๅฏ่กใๆฐ็ญ๏ผ โ ่ธ้จ X ๅ โ ่ธ้จ CT โ ๆดปๆฃ/็ป่ๅญฆ
- ๅๅ๏ผ85% NSCLC๏ผ15% SCLC
- ๅๆ๏ผStage I-IV๏ผๅบไบ
lung_cancer_nondiagnosis_counter - ๆฒป็๏ผNSCLC ไฝฟ็จ Cisplatin + Paclitaxel โ ๆพ็
2.3.2 ่ชๅฎไน NSCLC Biomarker ๆฉๅฑๆจกๅ
็ฑไบๅ็ๆจกๅไธๅซ EGFR/ALK/PD-L1 ็ญ biomarkers๏ผ้่ฆๅๅปบๆฉๅฑๅญๆจกๅใ
ๆไปถ๏ผdata/synthea_config/modules/lung_cancer_biomarkers.json
ๅบไบ DeepWiki ็ ็ฉถ็ Synthea ๆจกๅ็ถๆ็ฑปๅ๏ผๅฏ็จ็็ถๆ็ฑปๅๅ ๆฌ๏ผ
Initialโ ๆจกๅๅ ฅๅฃTerminalโ ๆจกๅๅบๅฃObservationโ ่ฎฐๅฝไธดๅบ่งๅฏๅผ๏ผ็จไบ biomarkers๏ผSetAttributeโ ่ฎพ็ฝฎๆฃ่ ๅฑๆงGuardโ ๆกไปถ้จๆงSimpleโ ็ฎๅ่ฝฌๆข็ถๆEncounterโ ๅฐฑ่ฏ็ถๆ
Biomarker ่งๅฏ็ถๆ็คบไพ็ปๆ๏ผ
{
"name": "NSCLC Biomarker Panel",
"states": {
"Initial": {
"type": "Initial",
"conditional_transition": [
{
"condition": {
"condition_type": "Attribute",
"attribute": "Lung Cancer Type",
"operator": "==",
"value": "NSCLC"
},
"transition": "EGFR_Test_Encounter"
},
{
"transition": "Terminal"
}
]
},
"EGFR_Test_Encounter": {
"type": "Encounter",
"encounter_class": "ambulatory",
"codes": [
{
"system": "SNOMED-CT",
"code": "185349003",
"display": "Encounter for check up"
}
],
"direct_transition": "EGFR_Mutation_Status"
},
"EGFR_Mutation_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "41103-3",
"display": "EGFR gene mutations found"
}
],
"distributed_transition": [
{
"distribution": 0.15,
"transition": "EGFR_Positive"
},
{
"distribution": 0.85,
"transition": "EGFR_Negative"
}
]
},
"EGFR_Positive": {
"type": "SetAttribute",
"attribute": "egfr_status",
"value": "positive",
"direct_transition": "ALK_Rearrangement_Status"
},
"EGFR_Negative": {
"type": "SetAttribute",
"attribute": "egfr_status",
"value": "negative",
"direct_transition": "ALK_Rearrangement_Status"
},
"ALK_Rearrangement_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "46264-8",
"display": "ALK gene rearrangement"
}
],
"distributed_transition": [
{
"distribution": 0.05,
"transition": "ALK_Positive"
},
{
"distribution": 0.95,
"transition": "ALK_Negative"
}
]
},
"ALK_Positive": {
"type": "SetAttribute",
"attribute": "alk_status",
"value": "positive",
"direct_transition": "PDL1_Expression"
},
"ALK_Negative": {
"type": "SetAttribute",
"attribute": "alk_status",
"value": "negative",
"direct_transition": "PDL1_Expression"
},
"PDL1_Expression": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "85147-0",
"display": "PD-L1 by immune stain"
}
],
"distributed_transition": [
{
"distribution": 0.30,
"transition": "PDL1_High"
},
{
"distribution": 0.35,
"transition": "PDL1_Low"
},
{
"distribution": 0.35,
"transition": "PDL1_Negative"
}
]
},
"PDL1_High": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": ">=50%",
"direct_transition": "KRAS_Mutation_Status"
},
"PDL1_Low": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": "1-49%",
"direct_transition": "KRAS_Mutation_Status"
},
"PDL1_Negative": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": "<1%",
"direct_transition": "KRAS_Mutation_Status"
},
"KRAS_Mutation_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "21717-3",
"display": "KRAS gene mutations found"
}
],
"distributed_transition": [
{
"distribution": 0.25,
"transition": "KRAS_Positive"
},
{
"distribution": 0.75,
"transition": "KRAS_Negative"
}
]
},
"KRAS_Positive": {
"type": "SetAttribute",
"attribute": "kras_status",
"value": "positive",
"direct_transition": "Terminal"
},
"KRAS_Negative": {
"type": "SetAttribute",
"attribute": "kras_status",
"value": "negative",
"direct_transition": "Terminal"
},
"Terminal": {
"type": "Terminal"
}
}
}
Biomarker ๆต่ก็ๅๅธ๏ผๅบไบ NSCLC ๆ็ฎ๏ผ๏ผ
| Biomarker | ้ณๆง็ | LOINC Code | ่ฏดๆ |
|---|---|---|---|
| EGFR mutation | ~15% | 41103-3 | ้ๅธ็ไบ่ฃๅฅณๆงๆด้ซ |
| ALK rearrangement | ~5% | 46264-8 | ๅนด่ฝป้ๅธ็่ ๆดๅธธ่ง |
| PD-L1 TPS>=50% | ~30% | 85147-0 | ๅ ็ซๆฒป็้็จๆ ๅ |
| KRAS G12C | ~13% | 21717-3 | Sotorasib ้ถๅ |
| ROS1 fusion | ~1-2% | 46265-5 | Crizotinib ้ถๅ |
2.4 ๆน้็ๆๅฝไปค
# ็ๆ 500 ไธช NSCLC ๆฃ่
๏ผไฝฟ็จ็งๅญ็กฎไฟๅฏ้็ฐ
java -jar synthea-with-dependencies.jar \
-p 500 \
-s 42 \
-m lung_cancer \
--exporter.fhir.export=true \
--exporter.fhir_stu3.export=false \
--exporter.fhir_dstu2.export=false \
--exporter.ccda.export=false \
--exporter.csv.export=false \
--exporter.hospital.fhir.export=false \
--exporter.practitioner.fhir.export=false \
--exporter.pretty_print=true \
Massachusetts
# ๅๆฐ่ฏดๆ:
# -p 500 : ็ๆ 500 ไธชๆฃ่
# -s 42 : ้ๆบ็งๅญ (ๅฏ้็ฐ)
# -m lung_cancer : ไป
่ฟ่ก lung_cancer ๆจกๅ
# --exporter.fhir.export=true : ๅฏ็จ FHIR R4 ๅฏผๅบ
# Massachusetts : ็ๆๅฐๅบ
่พๅบไฝ็ฝฎ๏ผ ./output/fhir/ ็ฎๅฝไธ๏ผๆฏไธชๆฃ่
ไธไธช JSON ๆไปถใ
2.5 FHIR Bundle ่พๅบๆ ผๅผ
ๆฅๆบ๏ผDeepWiki synthetichealth/synthea ๅ
ณไบ FHIR ๅฏผๅบ็ณป็ป็ๅๆใ
้กถๅฑ็ปๆ๏ผ
{
"resourceType": "Bundle",
"type": "transaction",
"entry": [
{
"fullUrl": "urn:uuid:patient-uuid-here",
"resource": { "resourceType": "Patient", ... },
"request": { "method": "POST", "url": "Patient" }
},
{
"fullUrl": "urn:uuid:condition-uuid-here",
"resource": { "resourceType": "Condition", ... },
"request": { "method": "POST", "url": "Condition" }
}
]
}
Synthea ็ๆ็ FHIR Resource ็ฑปๅ๏ผDeepWiki ็กฎ่ฎค๏ผ๏ผ
Patientโ ๆฃ่ ๅบๆฌไฟกๆฏConditionโ ่ฏๆญ๏ผๅฆ NSCLC๏ผObservationโ ๅฎ้ชๅฎคๆฃๆฅๅ็ๅฝไฝๅพMedicationRequestโ ็จ่ฏๅคๆนProcedureโ ๆๆฏๅๆไฝDiagnosticReportโ ่ฏๆญๆฅๅDocumentReferenceโ ไธดๅบๆๆกฃ๏ผ้ US Core IG ๅฏ็จ๏ผEncounterโ ๅฐฑ่ฏ่ฎฐๅฝAllergyIntoleranceโ ่ฟๆๅฒImmunizationโ ๅ ็ซๆฅ็งCarePlanโ ๆค็่ฎกๅImagingStudyโ ๅฝฑๅๆฃๆฅ
2.6 FHIR Resource ๅฐ PatientProfile ็ๆ ๅฐ
# data/generate_synthetic_patients.py ไธญ็ๆ ๅฐ้ป่พ
FHIR_TO_PATIENT_PROFILE_MAP = {
# Patient Resource โ demographics
"Patient.name": "demographics.name",
"Patient.gender": "demographics.sex",
"Patient.birthDate": "demographics.date_of_birth",
"Patient.address.state": "demographics.state",
# Condition Resource โ diagnosis
"Condition[code=SNOMED:254637007]": "diagnosis.primary", # NSCLC
"Condition.stage.summary": "diagnosis.stage",
"Condition.bodySite": "diagnosis.histology",
# Observation Resources โ biomarkers
"Observation[code=LOINC:41103-3]": "biomarkers.egfr",
"Observation[code=LOINC:46264-8]": "biomarkers.alk",
"Observation[code=LOINC:85147-0]": "biomarkers.pdl1_tps",
"Observation[code=LOINC:21717-3]": "biomarkers.kras",
# Observation Resources โ labs
"Observation[category=laboratory]": "labs[]",
# MedicationRequest โ prior_treatments
"MedicationRequest.medicationCodeableConcept": "treatments[].medication",
# Procedure โ prior_treatments
"Procedure.code": "treatments[].procedure",
}
่ฝฌๆขๅฝๆฐๆจกๅผ๏ผ
import json
from pathlib import Path
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class Demographics:
name: str = ""
sex: str = ""
date_of_birth: str = ""
age: int = 0
state: str = ""
@dataclass
class Diagnosis:
primary: str = ""
stage: str = ""
histology: str = ""
diagnosis_date: str = ""
@dataclass
class Biomarkers:
egfr: Optional[str] = None
alk: Optional[str] = None
pdl1_tps: Optional[str] = None
kras: Optional[str] = None
ros1: Optional[str] = None
@dataclass
class LabResult:
name: str = ""
value: float = 0.0
unit: str = ""
date: str = ""
loinc_code: str = ""
@dataclass
class Treatment:
name: str = ""
type: str = "" # "medication" | "procedure" | "radiation"
start_date: str = ""
end_date: Optional[str] = None
@dataclass
class PatientProfile:
patient_id: str = ""
demographics: Demographics = field(default_factory=Demographics)
diagnosis: Diagnosis = field(default_factory=Diagnosis)
biomarkers: Biomarkers = field(default_factory=Biomarkers)
labs: list[LabResult] = field(default_factory=list)
treatments: list[Treatment] = field(default_factory=list)
unknowns: list[str] = field(default_factory=list)
evidence_spans: list[dict] = field(default_factory=list)
def parse_fhir_bundle(fhir_path: Path) -> PatientProfile:
"""Parse a Synthea FHIR Bundle JSON into PatientProfile."""
with open(fhir_path) as f:
bundle = json.load(f)
profile = PatientProfile()
entries = bundle.get("entry", [])
for entry in entries:
resource = entry.get("resource", {})
resource_type = resource.get("resourceType")
if resource_type == "Patient":
_parse_patient(resource, profile)
elif resource_type == "Condition":
_parse_condition(resource, profile)
elif resource_type == "Observation":
_parse_observation(resource, profile)
elif resource_type == "MedicationRequest":
_parse_medication(resource, profile)
elif resource_type == "Procedure":
_parse_procedure(resource, profile)
return profile
def _parse_patient(resource: dict, profile: PatientProfile):
"""Extract demographics from Patient resource."""
names = resource.get("name", [{}])
if names:
given = " ".join(names[0].get("given", []))
family = names[0].get("family", "")
profile.demographics.name = f"{given} {family}".strip()
profile.demographics.sex = resource.get("gender", "")
profile.demographics.date_of_birth = resource.get("birthDate", "")
profile.patient_id = resource.get("id", "")
addresses = resource.get("address", [{}])
if addresses:
profile.demographics.state = addresses[0].get("state", "")
def _parse_condition(resource: dict, profile: PatientProfile):
"""Extract diagnosis from Condition resource."""
code = resource.get("code", {})
codings = code.get("coding", [])
for coding in codings:
# SNOMED codes for lung cancer
if coding.get("code") in ["254637007", "254632001"]:
profile.diagnosis.primary = coding.get("display", "")
onset = resource.get("onsetDateTime", "")
profile.diagnosis.diagnosis_date = onset
# Extract stage if available
stage_info = resource.get("stage", [])
if stage_info:
summary = stage_info[0].get("summary", {})
stage_codings = summary.get("coding", [])
if stage_codings:
profile.diagnosis.stage = stage_codings[0].get("display", "")
def _parse_observation(resource: dict, profile: PatientProfile):
"""Extract labs and biomarkers from Observation resource."""
code = resource.get("code", {})
codings = code.get("coding", [])
category_list = resource.get("category", [])
is_lab = any(
cat_coding.get("code") == "laboratory"
for cat in category_list
for cat_coding in cat.get("coding", [])
)
for coding in codings:
loinc = coding.get("code", "")
display = coding.get("display", "")
# Biomarker mappings
biomarker_map = {
"41103-3": "egfr",
"46264-8": "alk",
"85147-0": "pdl1_tps",
"21717-3": "kras",
"46265-5": "ros1",
}
if loinc in biomarker_map:
value_cc = resource.get("valueCodeableConcept", {})
value_codings = value_cc.get("coding", [])
value_str = value_codings[0].get("display", "") if value_codings else ""
setattr(profile.biomarkers, biomarker_map[loinc], value_str)
elif is_lab:
value_qty = resource.get("valueQuantity", {})
lab = LabResult(
name=display,
value=value_qty.get("value", 0.0),
unit=value_qty.get("unit", ""),
date=resource.get("effectiveDateTime", ""),
loinc_code=loinc,
)
profile.labs.append(lab)
3. ๅๆ PDF ็ๆ็ฎก็บฟ
3.1 ๆฆ่ฟฐ
็ฎๆ ๏ผๅฐ PatientProfile ่ฝฌๆขไธบ้ผ็็ไธดๅบๆๆกฃ PDF๏ผๅนถๆณจๅ
ฅๅๆงๅชๅฃฐไปฅๆจกๆ็ๅฎไธ็ OCR ๅบๆฏใ
ๆๆฏๆ ๏ผ
- ReportLab (
pip install reportlab) โ PDF ็ๆๅผๆ๏ผๆฏๆSimpleDocTemplateใTableใParagraph็ญ Platypus ๆตๅผ็ปไปถ - Augraphy (
pip install augraphy) โ ๆๆกฃๅพๅ้ๅ็ฎก็บฟ๏ผๆจกๆๆๅฐใไผ ็ใๆซๆๅชๅฃฐ - Pillow (
pip install Pillow) โ ๅพๅๅค็ - pdf2image (
pip install pdf2image) โ PDF ่ฝฌๅพๅ๏ผ็จไบๅชๅฃฐๆณจๅ ฅๅ่ฝฌๅ PDF๏ผ
3.2 ไธดๅบไฟกไปถๆจกๆฟ
# data/templates/clinical_letter.py
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
)
from reportlab.lib import colors
def generate_clinical_letter(profile: dict, output_path: str):
"""Generate a clinical letter PDF from PatientProfile."""
doc = SimpleDocTemplate(output_path, pagesize=letter,
topMargin=1*inch, bottomMargin=1*inch)
styles = getSampleStyleSheet()
story = []
# Header
header_style = ParagraphStyle(
'Header', parent=styles['Heading1'], fontSize=14,
spaceAfter=6
)
story.append(Paragraph("Clinical Summary Letter", header_style))
story.append(Spacer(1, 12))
# Patient Info
info_data = [
["Patient Name:", profile["demographics"]["name"]],
["Date of Birth:", profile["demographics"]["date_of_birth"]],
["Sex:", profile["demographics"]["sex"]],
["MRN:", profile["patient_id"]],
]
info_table = Table(info_data, colWidths=[2*inch, 4*inch])
info_table.setStyle(TableStyle([
('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
('FONTNAME', (1, 0), (1, -1), 'Helvetica'),
('FONTSIZE', (0, 0), (-1, -1), 10),
('VALIGN', (0, 0), (-1, -1), 'TOP'),
]))
story.append(info_table)
story.append(Spacer(1, 18))
# Diagnosis Section
story.append(Paragraph("Diagnosis", styles['Heading2']))
dx = profile.get("diagnosis", {})
dx_text = (
f"Primary: {dx.get('primary', 'Unknown')}. "
f"Stage: {dx.get('stage', 'Unknown')}. "
f"Histology: {dx.get('histology', 'Unknown')}. "
f"Diagnosed: {dx.get('diagnosis_date', 'Unknown')}."
)
story.append(Paragraph(dx_text, styles['Normal']))
story.append(Spacer(1, 12))
# Biomarkers Section
story.append(Paragraph("Molecular Testing", styles['Heading2']))
bm = profile.get("biomarkers", {})
bm_data = [["Biomarker", "Result"]]
for marker, value in bm.items():
if value is not None:
bm_data.append([marker.upper(), str(value)])
if len(bm_data) > 1:
bm_table = Table(bm_data, colWidths=[2.5*inch, 3.5*inch])
bm_table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.lightgrey),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('FONTSIZE', (0, 0), (-1, -1), 10),
]))
story.append(bm_table)
story.append(Spacer(1, 12))
# Treatment History
story.append(Paragraph("Treatment History", styles['Heading2']))
treatments = profile.get("treatments", [])
for tx in treatments:
tx_text = f"- {tx['name']} ({tx['type']}): {tx.get('start_date', '')}"
story.append(Paragraph(tx_text, styles['Normal']))
doc.build(story)
3.3 ็ ็ๆฅๅๆจกๆฟ
# data/templates/pathology_report.py
def generate_pathology_report(profile: dict, output_path: str):
"""Generate a pathology report PDF."""
doc = SimpleDocTemplate(output_path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("SURGICAL PATHOLOGY REPORT", styles['Title']))
story.append(Spacer(1, 12))
# Specimen Info
spec_data = [
["Specimen:", "Right lung, upper lobe, wedge resection"],
["Procedure:", "CT-guided needle biopsy"],
["Date:", profile["diagnosis"]["diagnosis_date"]],
]
spec_table = Table(spec_data, colWidths=[2*inch, 4*inch])
story.append(spec_table)
story.append(Spacer(1, 12))
# Final Diagnosis
story.append(Paragraph("FINAL DIAGNOSIS", styles['Heading2']))
story.append(Paragraph(
f"Non-small cell lung carcinoma, {profile['diagnosis'].get('histology', 'adenocarcinoma')}, "
f"{profile['diagnosis'].get('stage', 'Stage IIIA')}",
styles['Normal']
))
# Biomarker Results
story.append(Spacer(1, 12))
story.append(Paragraph("MOLECULAR/IMMUNOHISTOCHEMISTRY", styles['Heading2']))
bm = profile.get("biomarkers", {})
results = []
if bm.get("egfr"):
results.append(f"EGFR mutation analysis: {bm['egfr']}")
if bm.get("alk"):
results.append(f"ALK rearrangement (FISH): {bm['alk']}")
if bm.get("pdl1_tps"):
results.append(f"PD-L1 (22C3, TPS): {bm['pdl1_tps']}")
if bm.get("kras"):
results.append(f"KRAS mutation analysis: {bm['kras']}")
for r in results:
story.append(Paragraph(r, styles['Normal']))
doc.build(story)
3.4 ๅฎ้ชๅฎคๆฅๅๆจกๆฟ
# data/templates/lab_report.py
def generate_lab_report(profile: dict, output_path: str):
"""Generate a laboratory report PDF with CBC, CMP, etc."""
doc = SimpleDocTemplate(output_path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("LABORATORY REPORT", styles['Title']))
story.append(Spacer(1, 12))
# Lab Results Table
lab_data = [["Test", "Result", "Unit", "Reference Range", "Date"]]
for lab in profile.get("labs", []):
lab_data.append([
lab["name"], str(lab["value"]), lab["unit"],
"", # Reference range (can be added)
lab["date"][:10] if lab["date"] else ""
])
if len(lab_data) > 1:
lab_table = Table(lab_data, colWidths=[2*inch, 1*inch, 0.8*inch, 1.2*inch, 1*inch])
lab_table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#003366')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('FONTSIZE', (0, 0), (-1, -1), 9),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#f0f0f0')]),
]))
story.append(lab_table)
doc.build(story)
3.5 ๅชๅฃฐๆณจๅ ฅ็ญ็ฅ
# data/noise/noise_injector.py
import random
import re
from pathlib import Path
from PIL import Image
# Augraphy ็ฎก็บฟ้
็ฝฎ
try:
from augraphy import (
AugraphyPipeline, InkBleed, Letterpress, LowInkPeriodicLines,
DirtyDrum, SubtleNoise, Jpeg, Brightness, BleedThrough
)
AUGRAPHY_AVAILABLE = True
except ImportError:
AUGRAPHY_AVAILABLE = False
class NoiseInjector:
"""ๅๆงๅชๅฃฐๆณจๅ
ฅๅผๆ๏ผๆจกๆ็ๅฎไธ็ๆๆกฃ้ๅใ"""
# OCR ๅธธ่ง้่ฏฏๆ ๅฐ
OCR_ERROR_MAP = {
"0": ["O", "o", "Q"],
"1": ["l", "I", "|"],
"5": ["S", "s"],
"8": ["B"],
"O": ["0", "Q"],
"l": ["1", "I", "|"],
"rn": ["m"],
"cl": ["d"],
"vv": ["w"],
}
# ๅปๅญฆ็ผฉๅๆฟๆข
ABBREVIATION_MAP = {
"non-small cell lung cancer": ["NSCLC", "non-small cell ca", "NSCC"],
"adenocarcinoma": ["adeno", "adenoca", "adeno ca"],
"squamous cell carcinoma": ["SCC", "squamous ca", "sq cell ca"],
"Eastern Cooperative Oncology Group": ["ECOG"],
"performance status": ["PS", "perf status"],
"milligrams per deciliter": ["mg/dL", "mg/dl"],
"computed tomography": ["CT", "cat scan"],
}
# ๅชๅฃฐ็บงๅซ้
็ฝฎ
NOISE_LEVELS = {
"clean": {"ocr_rate": 0.0, "abbrev_rate": 0.0, "missing_rate": 0.0},
"mild": {"ocr_rate": 0.02, "abbrev_rate": 0.1, "missing_rate": 0.05},
"moderate": {"ocr_rate": 0.05, "abbrev_rate": 0.2, "missing_rate": 0.1},
"severe": {"ocr_rate": 0.10, "abbrev_rate": 0.3, "missing_rate": 0.2},
}
def __init__(self, noise_level: str = "mild", seed: int = 42):
self.config = self.NOISE_LEVELS[noise_level]
self.rng = random.Random(seed)
def inject_text_noise(self, text: str) -> tuple[str, list[dict]]:
"""Inject OCR errors and abbreviations into text.
Returns (noisy_text, list_of_injected_noise_records).
"""
noise_records = []
chars = list(text)
# OCR character substitutions
i = 0
while i < len(chars):
if self.rng.random() < self.config["ocr_rate"]:
original = chars[i]
if original in self.OCR_ERROR_MAP:
replacement = self.rng.choice(self.OCR_ERROR_MAP[original])
chars[i] = replacement
noise_records.append({
"type": "ocr_error",
"position": i,
"original": original,
"replacement": replacement,
})
i += 1
noisy_text = "".join(chars)
# Abbreviation substitutions
for full_form, abbreviations in self.ABBREVIATION_MAP.items():
if full_form in noisy_text.lower() and self.rng.random() < self.config["abbrev_rate"]:
abbrev = self.rng.choice(abbreviations)
noisy_text = re.sub(
re.escape(full_form), abbrev, noisy_text, count=1, flags=re.IGNORECASE
)
noise_records.append({
"type": "abbreviation",
"original": full_form,
"replacement": abbrev,
})
return noisy_text, noise_records
def inject_missing_values(self, profile: dict) -> tuple[dict, list[str]]:
"""Randomly remove fields from profile to simulate missing data.
Returns (modified_profile, list_of_removed_fields).
"""
removed = []
removable_fields = [
("biomarkers", "egfr"),
("biomarkers", "alk"),
("biomarkers", "pdl1_tps"),
("biomarkers", "kras"),
("biomarkers", "ros1"),
("diagnosis", "stage"),
("diagnosis", "histology"),
]
for section, field_name in removable_fields:
if self.rng.random() < self.config["missing_rate"]:
if section in profile and field_name in profile[section]:
profile[section][field_name] = None
removed.append(f"{section}.{field_name}")
return profile, removed
def degrade_image(self, image: Image.Image) -> Image.Image:
"""Apply Augraphy degradation pipeline to document image."""
if not AUGRAPHY_AVAILABLE:
return image
import numpy as np
img_array = np.array(image)
pipeline = AugraphyPipeline(
ink_phase=[
InkBleed(p=0.5),
Letterpress(p=0.3),
LowInkPeriodicLines(p=0.3),
],
paper_phase=[
SubtleNoise(p=0.5),
],
post_phase=[
DirtyDrum(p=0.3),
Brightness(p=0.5),
Jpeg(p=0.5),
],
)
degraded = pipeline(img_array)
return Image.fromarray(degraded)
4. TREC ๅบๅ่ฏไผฐๆๅ
4.1 ๆฐๆฎ้ๆฆ่ฟฐ
TREC Clinical Trials Track 2021๏ผ
- ๆฅๆบ๏ผNIST ๆๆฌๆฃ็ดขไผ่ฎฎ
- Topics๏ผๆฅ่ฏข๏ผ๏ผ75 ไธชๅๆๆฃ่ ๆ่ฟฐ๏ผ5-10 ๅฅๅ ฅ้ข่ฎฐๅฝ๏ผ
- ๆๆกฃ้๏ผ376,000+ ไธดๅบ่ฏ้ช๏ผClinicalTrials.gov 2021 ๅนด 4 ๆๅฟซ็ ง๏ผ
- Qrels๏ผ35,832 ๆก็ธๅ ณๆงๅคๆญ
- ็ธๅ ณๆงๆ ็ญพ๏ผ0=ไธ็ธๅ ณ๏ผ1=ๆ้ค๏ผ2=ๅๆ ผ
TREC Clinical Trials Track 2022๏ผ
- Topics๏ผ50 ไธชๅๆๆฃ่ ๆ่ฟฐ
- ไฝฟ็จ็ธๅ็ๆๆกฃ้ๅฟซ็ ง
4.2 ๆฐๆฎๆ ผๅผ
Topics XML ๆ ผๅผ
<topics task="2021 TREC Clinical Trials">
<topic number="1">
A 62-year-old male presents with a 3-month history of
progressive dyspnea and a 20-pound weight loss. He has
a 40 pack-year smoking history. CT chest reveals a 4.5cm
right upper lobe mass with mediastinal lymphadenopathy.
Biopsy confirms non-small cell lung cancer, adenocarcinoma.
EGFR mutation testing is positive for exon 19 deletion.
PD-L1 TPS is 60%. ECOG performance status is 1.
</topic>
<topic number="2">
...
</topic>
</topics>
Qrels ๆ ผๅผ๏ผๅถ่กจ็ฌฆๅ้๏ผ
topic_id 0 doc_id relevance
1 0 NCT00760162 2
1 0 NCT01234567 1
1 0 NCT09876543 0
- ๅ 1๏ผTopic ็ผๅท
- ๅ 2๏ผๅบๅฎๅผ 0๏ผ่ฟญไปฃๆฌกๆฐ๏ผ
- ๅ 3๏ผNCT ๆๆกฃ ID
- ๅ 4๏ผ็ธๅ ณๆง๏ผ0=ไธ็ธๅ ณ๏ผ1=ๆ้ค๏ผ2=ๅๆ ผ๏ผ
Run ๆไบคๆ ผๅผ
TOPIC_NO Q0 NCT_ID RANK SCORE RUN_NAME
1 Q0 NCT00760162 1 0.9999 trialpath-v1
1 Q0 NCT01234567 2 0.9998 trialpath-v1
4.3 ไฝฟ็จ ir_datasets ๅ ่ฝฝๆฐๆฎ
# evaluation/run_trec_benchmark.py
import ir_datasets
def load_trec_2021():
"""Load TREC CT 2021 topics and qrels via ir_datasets."""
dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2021")
# ๅ ่ฝฝ topics (GenericQuery: query_id, text)
topics = {}
for query in dataset.queries_iter():
topics[query.query_id] = query.text
# ๅ ่ฝฝ qrels (TrecQrel: query_id, doc_id, relevance, iteration)
qrels = {}
for qrel in dataset.qrels_iter():
if qrel.query_id not in qrels:
qrels[qrel.query_id] = {}
qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
return topics, qrels
def load_trec_2022():
"""Load TREC CT 2022 topics and qrels."""
dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2022")
topics = {q.query_id: q.text for q in dataset.queries_iter()}
qrels = {}
for qrel in dataset.qrels_iter():
if qrel.query_id not in qrels:
qrels[qrel.query_id] = {}
qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
return topics, qrels
def load_trial_documents():
"""Load the clinical trial documents from ir_datasets."""
dataset = ir_datasets.load("clinicaltrials/2021")
# ClinicalTrialsDoc: doc_id, title, condition, summary,
# detailed_description, eligibility
docs = {}
for doc in dataset.docs_iter():
docs[doc.doc_id] = {
"title": doc.title,
"condition": doc.condition,
"summary": doc.summary,
"detailed_description": doc.detailed_description,
"eligibility": doc.eligibility,
}
return docs
4.4 TrialPath ่พๅบๅฐ TREC ๆ ผๅผ็ๆ ๅฐ
def convert_trialpath_to_trec_run(
results: dict[str, list[dict]],
run_name: str = "trialpath-v1"
) -> str:
"""Convert TrialPath matching results to TREC run format.
Args:
results: {topic_id: [{"nct_id": str, "score": float}, ...]}
run_name: Run identifier
Returns:
TREC-format run string
"""
lines = []
for topic_id, candidates in results.items():
sorted_candidates = sorted(candidates, key=lambda x: x["score"], reverse=True)
for rank, candidate in enumerate(sorted_candidates[:1000], 1):
lines.append(
f"{topic_id} Q0 {candidate['nct_id']} {rank} "
f"{candidate['score']:.6f} {run_name}"
)
return "\n".join(lines)
def save_trec_run(run_str: str, output_path: str):
"""Save TREC run to file."""
with open(output_path, 'w') as f:
f.write(run_str)
4.5 ไฝฟ็จ ir-measures ่ฎก็ฎ่ฏไผฐๆๆ
# evaluation/run_trec_benchmark.py (็ปญ)
import ir_measures
from ir_measures import nDCG, P, Recall, AP, RR, SetP, SetR, SetF
def evaluate_trec_run(
qrels_path: str,
run_path: str,
) -> dict:
"""Evaluate a TREC run using ir-measures.
Target metrics:
- Recall@50 >= 0.75
- NDCG@10 >= 0.60
- P@10 (informational)
"""
qrels = list(ir_measures.read_trec_qrels(qrels_path))
run = list(ir_measures.read_trec_run(run_path))
# ๅฎไน็ฎๆ ๆๆ
measures = [
nDCG@10, # Target >= 0.60
Recall@50, # Target >= 0.75
P@10, # Precision at 10
AP, # Mean Average Precision
RR, # Reciprocal Rank
nDCG@20, # Additional depth
Recall@100, # Extended recall
]
# ่ฎก็ฎ่ๅๆๆ
aggregate = ir_measures.calc_aggregate(measures, qrels, run)
# ่ฎก็ฎ้ๆฅ่ฏขๆๆ
per_query = {}
for metric in ir_measures.iter_calc(measures, qrels, run):
qid = metric.query_id
if qid not in per_query:
per_query[qid] = {}
per_query[qid][str(metric.measure)] = metric.value
return {
"aggregate": {str(k): v for k, v in aggregate.items()},
"per_query": per_query,
"pass_fail": {
"ndcg@10": aggregate.get(nDCG@10, 0) >= 0.60,
"recall@50": aggregate.get(Recall@50, 0) >= 0.75,
}
}
def evaluate_with_eligibility_levels(
qrels_path: str,
run_path: str,
) -> dict:
"""Evaluate with TREC CT graded relevance (0=NR, 1=Excluded, 2=Eligible).
Uses rel=2 for strict eligible-only evaluation.
"""
qrels = list(ir_measures.read_trec_qrels(qrels_path))
run = list(ir_measures.read_trec_run(run_path))
# Standard evaluation (relevance >= 1)
standard_measures = [nDCG@10, Recall@50, P@10]
standard = ir_measures.calc_aggregate(standard_measures, qrels, run)
# Strict evaluation (only eligible = relevance 2)
strict_measures = [
AP(rel=2),
P(rel=2)@10,
Recall(rel=2)@50,
]
strict = ir_measures.calc_aggregate(strict_measures, qrels, run)
return {
"standard": {str(k): v for k, v in standard.items()},
"strict_eligible_only": {str(k): v for k, v in strict.items()},
}
4.6 ไฝฟ็จ ir_datasets ็ๆฟไปฃ qrels/run ๆ ผๅผ
def evaluate_from_dicts(
qrels_dict: dict[str, dict[str, int]],
run_dict: dict[str, list[tuple[str, float]]],
) -> dict:
"""Evaluate using Python dict format (no files needed).
Args:
qrels_dict: {query_id: {doc_id: relevance}}
run_dict: {query_id: [(doc_id, score), ...]}
"""
# Convert to ir-measures format
qrels = [
ir_measures.Qrel(qid, did, rel)
for qid, docs in qrels_dict.items()
for did, rel in docs.items()
]
run = [
ir_measures.ScoredDoc(qid, did, score)
for qid, docs in run_dict.items()
for did, score in docs
]
measures = [nDCG@10, Recall@50, P@10, AP]
aggregate = ir_measures.calc_aggregate(measures, qrels, run)
return {str(k): v for k, v in aggregate.items()}
5. MedGemma ๆๅ่ฏไผฐ
5.1 ๆ ๆณจๆฐๆฎ้่ฎพ่ฎก
# evaluation/extraction_eval.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class AnnotatedField:
"""A single annotated field with ground truth and extraction result."""
field_name: str # e.g., "biomarkers.egfr"
ground_truth: Optional[str] # From Synthea profile (gold standard)
extracted: Optional[str] # From MedGemma extraction
evidence_span: Optional[str] # Text span in source document
source_page: Optional[int] # Page number in PDF
@dataclass
class ExtractionAnnotation:
"""Complete annotation for one patient's extraction."""
patient_id: str
fields: list[AnnotatedField]
noise_level: str # "clean", "mild", "moderate", "severe"
document_type: str # "clinical_letter", "pathology_report", etc.
ๆ ๆณจๆฐๆฎ้็ปๆ๏ผ
{
"patient_id": "synth-001",
"noise_level": "mild",
"document_type": "clinical_letter",
"fields": [
{
"field_name": "demographics.name",
"ground_truth": "John Smith",
"extracted": "John Smith",
"correct": true
},
{
"field_name": "diagnosis.stage",
"ground_truth": "Stage IIIA",
"extracted": "Stage 3A",
"correct": true,
"note": "Equivalent representation"
},
{
"field_name": "biomarkers.egfr",
"ground_truth": "Exon 19 deletion",
"extracted": "EGFR positive",
"correct": false,
"note": "Partial extraction - missing specific mutation"
}
]
}
5.2 ๅญๆฎต็บง F1 ่ฎก็ฎ
# evaluation/extraction_eval.py
from sklearn.metrics import (
f1_score, precision_score, recall_score,
classification_report, confusion_matrix
)
import numpy as np
# ๅฎไนๆๆๅฏๆๅๅญๆฎต
EXTRACTION_FIELDS = [
"demographics.name",
"demographics.sex",
"demographics.date_of_birth",
"demographics.age",
"diagnosis.primary",
"diagnosis.stage",
"diagnosis.histology",
"biomarkers.egfr",
"biomarkers.alk",
"biomarkers.pdl1_tps",
"biomarkers.kras",
"biomarkers.ros1",
"labs.wbc",
"labs.hemoglobin",
"labs.platelets",
"labs.creatinine",
"labs.alt",
"labs.ast",
"treatments.current_regimen",
"performance_status.ecog",
]
def compute_field_level_f1(
annotations: list[dict],
) -> dict:
"""Compute field-level F1, precision, recall.
For each field:
- TP: ground_truth exists AND extracted matches
- FP: extracted exists BUT ground_truth is None or mismatch
- FN: ground_truth exists BUT extracted is None or mismatch
Args:
annotations: List of patient annotation dicts
Returns:
Per-field and aggregate metrics
"""
field_metrics = {}
for field_name in EXTRACTION_FIELDS:
y_true = [] # 1 if field has ground truth value
y_pred = [] # 1 if field was correctly extracted
for ann in annotations:
fields = {f["field_name"]: f for f in ann["fields"]}
if field_name in fields:
f = fields[field_name]
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
y_true.append(1 if has_gt else 0)
y_pred.append(1 if is_correct else 0)
if len(y_true) > 0:
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)
field_metrics[field_name] = {
"precision": round(precision, 4),
"recall": round(recall, 4),
"f1": round(f1, 4),
"support": sum(y_true),
}
# Aggregate metrics
all_y_true = []
all_y_pred = []
for ann in annotations:
for f in ann["fields"]:
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
all_y_true.append(1 if has_gt else 0)
all_y_pred.append(1 if is_correct else 0)
micro_f1 = f1_score(all_y_true, all_y_pred, zero_division=0)
macro_f1 = np.mean([m["f1"] for m in field_metrics.values()])
return {
"per_field": field_metrics,
"micro_f1": round(micro_f1, 4),
"macro_f1": round(macro_f1, 4),
"total_fields": len(all_y_true),
"pass": micro_f1 >= 0.85, # Target: F1 >= 0.85
}
def compute_extraction_report(annotations: list[dict]) -> str:
"""Generate a scikit-learn classification_report style output."""
all_y_true = []
all_y_pred = []
labels = []
for field_name in EXTRACTION_FIELDS:
for ann in annotations:
fields = {f["field_name"]: f for f in ann["fields"]}
if field_name in fields:
f = fields[field_name]
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
all_y_true.append(1 if has_gt else 0)
all_y_pred.append(1 if is_correct else 0)
return classification_report(
all_y_true, all_y_pred,
target_names=["absent", "present/correct"],
digits=4,
)
def compare_with_baseline(
medgemma_annotations: list[dict],
gemini_only_annotations: list[dict],
) -> dict:
"""Compare MedGemma extraction vs Gemini-only baseline."""
medgemma_metrics = compute_field_level_f1(medgemma_annotations)
gemini_metrics = compute_field_level_f1(gemini_only_annotations)
comparison = {}
for field_name in EXTRACTION_FIELDS:
mg = medgemma_metrics["per_field"].get(field_name, {})
gm = gemini_metrics["per_field"].get(field_name, {})
comparison[field_name] = {
"medgemma_f1": mg.get("f1", 0),
"gemini_f1": gm.get("f1", 0),
"delta": round(mg.get("f1", 0) - gm.get("f1", 0), 4),
}
return {
"per_field_comparison": comparison,
"medgemma_overall_f1": medgemma_metrics["micro_f1"],
"gemini_overall_f1": gemini_metrics["micro_f1"],
"improvement": round(
medgemma_metrics["micro_f1"] - gemini_metrics["micro_f1"], 4
),
}
5.3 ๅชๅฃฐ็บงๅซๅฏนๆๅๆง่ฝ็ๅฝฑๅๅๆ
def analyze_noise_impact(annotations: list[dict]) -> dict:
"""Analyze how noise level affects extraction F1."""
by_noise = {}
for ann in annotations:
level = ann["noise_level"]
if level not in by_noise:
by_noise[level] = []
by_noise[level].append(ann)
results = {}
for level, level_anns in by_noise.items():
metrics = compute_field_level_f1(level_anns)
results[level] = {
"micro_f1": metrics["micro_f1"],
"macro_f1": metrics["macro_f1"],
"n_patients": len(level_anns),
}
return results
6. ็ซฏๅฐ็ซฏ่ฏไผฐ็ฎก็บฟ
6.1 Criterion Decision Accuracy
# evaluation/criterion_eval.py
def compute_criterion_accuracy(
predictions: list[dict],
ground_truth: list[dict],
) -> dict:
"""Compute criterion-level decision accuracy.
Each prediction/ground_truth entry:
{
"patient_id": str,
"trial_id": str,
"criteria": [
{"criterion_id": str, "decision": "met"|"not_met"|"unknown",
"evidence": str}
]
}
Target: >= 0.85
"""
total = 0
correct = 0
by_decision_type = {"met": {"tp": 0, "total": 0},
"not_met": {"tp": 0, "total": 0},
"unknown": {"tp": 0, "total": 0}}
for pred, gt in zip(predictions, ground_truth):
assert pred["patient_id"] == gt["patient_id"]
assert pred["trial_id"] == gt["trial_id"]
gt_map = {c["criterion_id"]: c["decision"] for c in gt["criteria"]}
for criterion in pred["criteria"]:
cid = criterion["criterion_id"]
if cid in gt_map:
total += 1
gt_decision = gt_map[cid]
pred_decision = criterion["decision"]
by_decision_type[gt_decision]["total"] += 1
if pred_decision == gt_decision:
correct += 1
by_decision_type[gt_decision]["tp"] += 1
accuracy = correct / total if total > 0 else 0.0
return {
"overall_accuracy": round(accuracy, 4),
"total_criteria": total,
"correct": correct,
"pass": accuracy >= 0.85,
"by_decision_type": {
k: {
"accuracy": round(v["tp"] / v["total"], 4) if v["total"] > 0 else 0,
"support": v["total"],
}
for k, v in by_decision_type.items()
},
}
6.2 ๅปถ่ฟๅบๅๆต่ฏ
# evaluation/latency_cost_tracker.py
import time
import json
from dataclasses import dataclass, field, asdict
from typing import Optional
from contextlib import contextmanager
@dataclass
class APICallRecord:
"""Record of a single API call."""
service: str # "medgemma", "gemini", "clinicaltrials_mcp"
operation: str # "extract", "search", "evaluate_criterion"
latency_ms: float
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
timestamp: str = ""
@dataclass
class SessionMetrics:
"""Aggregate metrics for a patient matching session."""
patient_id: str
total_latency_ms: float = 0.0
total_cost_usd: float = 0.0
api_calls: list[APICallRecord] = field(default_factory=list)
@property
def total_latency_s(self) -> float:
return self.total_latency_ms / 1000.0
@property
def pass_latency(self) -> bool:
"""Target: < 15s per session."""
return self.total_latency_s < 15.0
@property
def pass_cost(self) -> bool:
"""Target: < $0.50 per session."""
return self.total_cost_usd < 0.50
class LatencyCostTracker:
"""Track latency and cost across API calls."""
# Pricing per 1M tokens (approximate)
PRICING = {
"medgemma": {"input": 0.0, "output": 0.0}, # Self-hosted
"gemini": {"input": 1.25, "output": 5.00}, # Gemini Pro
"clinicaltrials_mcp": {"input": 0.0, "output": 0.0}, # Free API
}
def __init__(self):
self.sessions: list[SessionMetrics] = []
self._current_session: Optional[SessionMetrics] = None
def start_session(self, patient_id: str):
self._current_session = SessionMetrics(patient_id=patient_id)
def end_session(self) -> SessionMetrics:
session = self._current_session
if session:
session.total_latency_ms = sum(c.latency_ms for c in session.api_calls)
session.total_cost_usd = sum(c.cost_usd for c in session.api_calls)
self.sessions.append(session)
self._current_session = None
return session
@contextmanager
def track_call(self, service: str, operation: str):
"""Context manager to track an API call."""
start = time.monotonic()
record = APICallRecord(service=service, operation=operation, latency_ms=0)
try:
yield record
finally:
record.latency_ms = (time.monotonic() - start) * 1000
# Compute cost
pricing = self.PRICING.get(service, {"input": 0, "output": 0})
record.cost_usd = (
record.input_tokens * pricing["input"] / 1_000_000
+ record.output_tokens * pricing["output"] / 1_000_000
)
if self._current_session:
self._current_session.api_calls.append(record)
def summary(self) -> dict:
"""Generate aggregate summary across all sessions."""
if not self.sessions:
return {}
latencies = [s.total_latency_s for s in self.sessions]
costs = [s.total_cost_usd for s in self.sessions]
return {
"n_sessions": len(self.sessions),
"latency": {
"mean_s": round(sum(latencies) / len(latencies), 2),
"p50_s": round(sorted(latencies)[len(latencies) // 2], 2),
"p95_s": round(sorted(latencies)[int(len(latencies) * 0.95)], 2),
"max_s": round(max(latencies), 2),
"pass_rate": round(
sum(1 for s in self.sessions if s.pass_latency) / len(self.sessions), 4
),
},
"cost": {
"mean_usd": round(sum(costs) / len(costs), 4),
"total_usd": round(sum(costs), 4),
"max_usd": round(max(costs), 4),
"pass_rate": round(
sum(1 for s in self.sessions if s.pass_cost) / len(self.sessions), 4
),
},
"targets": {
"latency_pass": all(s.pass_latency for s in self.sessions),
"cost_pass": all(s.pass_cost for s in self.sessions),
},
}
7. TDD ๆต่ฏ็จไพ
7.1 Synthea ๆฐๆฎ้ช่ฏๆต่ฏ
# tests/test_synthea_data.py
import pytest
import json
from pathlib import Path
# ้ขๆ็ FHIR Resource ็ฑปๅ
REQUIRED_RESOURCE_TYPES = {"Patient", "Condition", "Observation", "Encounter"}
class TestSyntheaDataValidation:
"""Validate Synthea FHIR output for TrialPath requirements."""
def test_fhir_bundle_is_valid_json(self, fhir_file):
"""Bundle must be valid JSON."""
with open(fhir_file) as f:
data = json.load(f)
assert data["resourceType"] == "Bundle"
assert "entry" in data
def test_bundle_contains_required_resources(self, fhir_file):
"""Bundle must contain Patient, Condition, Observation, Encounter."""
with open(fhir_file) as f:
bundle = json.load(f)
resource_types = {
e["resource"]["resourceType"] for e in bundle["entry"]
}
for rt in REQUIRED_RESOURCE_TYPES:
assert rt in resource_types, f"Missing {rt} resource"
def test_patient_has_demographics(self, fhir_file):
"""Patient resource must have name, gender, birthDate."""
with open(fhir_file) as f:
bundle = json.load(f)
patients = [
e["resource"] for e in bundle["entry"]
if e["resource"]["resourceType"] == "Patient"
]
assert len(patients) == 1
patient = patients[0]
assert "name" in patient
assert "gender" in patient
assert "birthDate" in patient
def test_lung_cancer_condition_present(self, fhir_file):
"""At least one Condition must be NSCLC or lung cancer."""
with open(fhir_file) as f:
bundle = json.load(f)
conditions = [
e["resource"] for e in bundle["entry"]
if e["resource"]["resourceType"] == "Condition"
]
lung_cancer_codes = {"254637007", "254632001", "162573006"}
has_lung_cancer = False
for cond in conditions:
codings = cond.get("code", {}).get("coding", [])
for c in codings:
if c.get("code") in lung_cancer_codes:
has_lung_cancer = True
assert has_lung_cancer, "No lung cancer Condition found"
def test_patient_profile_conversion(self, fhir_file):
"""FHIR Bundle must convert to valid PatientProfile."""
profile = parse_fhir_bundle(Path(fhir_file))
assert profile.patient_id != ""
assert profile.demographics.name != ""
assert profile.demographics.sex in ("male", "female")
assert profile.diagnosis.primary != ""
def test_batch_generation_produces_500_patients(self, output_dir):
"""Batch generation must produce at least 500 FHIR files."""
fhir_files = list(Path(output_dir).glob("*.json"))
assert len(fhir_files) >= 500
def test_nsclc_ratio(self, all_profiles):
"""~85% of lung cancer patients should be NSCLC."""
nsclc_count = sum(
1 for p in all_profiles
if "non-small cell" in p.diagnosis.primary.lower()
or "nsclc" in p.diagnosis.primary.lower()
)
ratio = nsclc_count / len(all_profiles)
assert 0.70 <= ratio <= 0.95, f"NSCLC ratio {ratio} outside expected range"
7.2 PDF ็ๆๆญฃ็กฎๆงๆต่ฏ
# tests/test_pdf_generation.py
import pytest
from pathlib import Path
from data.templates.clinical_letter import generate_clinical_letter
from data.templates.pathology_report import generate_pathology_report
from data.templates.lab_report import generate_lab_report
class TestPDFGeneration:
"""Test that PDF generation produces valid documents."""
SAMPLE_PROFILE = {
"patient_id": "test-001",
"demographics": {
"name": "Jane Doe",
"sex": "female",
"date_of_birth": "1960-05-15",
},
"diagnosis": {
"primary": "Non-small cell lung cancer, adenocarcinoma",
"stage": "Stage IIIA",
"histology": "adenocarcinoma",
"diagnosis_date": "2024-01-15",
},
"biomarkers": {
"egfr": "Exon 19 deletion",
"alk": "Negative",
"pdl1_tps": "60%",
"kras": None,
},
"labs": [
{"name": "WBC", "value": 7.2, "unit": "10*3/uL", "date": "2024-01-10", "loinc_code": "6690-2"},
{"name": "Hemoglobin", "value": 12.5, "unit": "g/dL", "date": "2024-01-10", "loinc_code": "718-7"},
],
"treatments": [
{"name": "Cisplatin", "type": "medication", "start_date": "2024-02-01"},
],
}
def test_clinical_letter_generates_pdf(self, tmp_path):
"""Clinical letter must generate a non-empty PDF file."""
output = tmp_path / "letter.pdf"
generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_pathology_report_generates_pdf(self, tmp_path):
"""Pathology report must generate a non-empty PDF file."""
output = tmp_path / "pathology.pdf"
generate_pathology_report(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_lab_report_generates_pdf(self, tmp_path):
"""Lab report must generate a non-empty PDF file."""
output = tmp_path / "lab.pdf"
generate_lab_report(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_pdf_contains_patient_name(self, tmp_path):
"""Generated PDF must contain patient name (OCR-verifiable)."""
output = tmp_path / "letter.pdf"
generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
# Read PDF text (using pdfplumber or PyPDF2)
import pdfplumber
with pdfplumber.open(str(output)) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
assert "Jane Doe" in text
def test_pdf_contains_biomarkers(self, tmp_path):
"""Generated PDF must contain biomarker results."""
output = tmp_path / "pathology.pdf"
generate_pathology_report(self.SAMPLE_PROFILE, str(output))
import pdfplumber
with pdfplumber.open(str(output)) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
assert "EGFR" in text
assert "Exon 19" in text or "positive" in text.lower()
def test_missing_biomarker_handled_gracefully(self, tmp_path):
"""PDF generation should not crash when biomarkers are None."""
profile = self.SAMPLE_PROFILE.copy()
profile["biomarkers"] = {
"egfr": None, "alk": None, "pdl1_tps": None, "kras": None
}
output = tmp_path / "letter.pdf"
generate_clinical_letter(profile, str(output))
assert output.exists()
7.3 ๅชๅฃฐๆณจๅ ฅๆๆ้ช่ฏๆต่ฏ
# tests/test_noise_injection.py
import pytest
from data.noise.noise_injector import NoiseInjector
class TestNoiseInjection:
"""Test noise injection produces expected results."""
def test_clean_noise_no_changes(self):
"""Clean level should produce no changes."""
injector = NoiseInjector(noise_level="clean", seed=42)
text = "Patient has EGFR mutation positive"
noisy, records = injector.inject_text_noise(text)
assert noisy == text
assert len(records) == 0
def test_mild_noise_produces_some_changes(self):
"""Mild noise should produce some but limited changes."""
injector = NoiseInjector(noise_level="mild", seed=42)
# Use longer text to increase chance of noise
text = "The patient is a 65 year old male with stage IIIA " * 10
noisy, records = injector.inject_text_noise(text)
# Should have some changes but not too many
assert len(records) >= 0 # May or may not have changes depending on seed
def test_severe_noise_produces_many_changes(self):
"""Severe noise should produce noticeable changes."""
injector = NoiseInjector(noise_level="severe", seed=42)
text = "The 50 year old patient has stage 1 NSCLC " * 20
noisy, records = injector.inject_text_noise(text)
assert noisy != text # Should differ from original
assert len(records) > 0
def test_ocr_error_types_are_valid(self):
"""OCR errors should only substitute known character pairs."""
injector = NoiseInjector(noise_level="severe", seed=42)
text = "0123456789 OIBS" * 10
_, records = injector.inject_text_noise(text)
for r in records:
if r["type"] == "ocr_error":
assert r["original"] in NoiseInjector.OCR_ERROR_MAP
assert r["replacement"] in NoiseInjector.OCR_ERROR_MAP[r["original"]]
def test_missing_value_injection(self):
"""Missing value injection should remove some fields."""
injector = NoiseInjector(noise_level="moderate", seed=42)
profile = {
"biomarkers": {"egfr": "positive", "alk": "negative",
"pdl1_tps": "60%", "kras": "negative", "ros1": "negative"},
"diagnosis": {"stage": "IIIA", "histology": "adenocarcinoma"},
}
modified, removed = injector.inject_missing_values(profile)
# At 10% rate with 7 fields, expect 0-3 removals
assert len(removed) <= 7
for field_path in removed:
section, field_name = field_path.split(".")
assert modified[section][field_name] is None
def test_noise_is_deterministic_with_seed(self):
"""Same seed should produce identical results."""
text = "Patient has stage IIIA non-small cell lung cancer"
inj1 = NoiseInjector(noise_level="moderate", seed=123)
inj2 = NoiseInjector(noise_level="moderate", seed=123)
noisy1, _ = inj1.inject_text_noise(text)
noisy2, _ = inj2.inject_text_noise(text)
assert noisy1 == noisy2
def test_different_seeds_produce_different_results(self):
"""Different seeds should generally produce different noise."""
text = "The 50 year old patient has 10 biomarker tests 0 1 5 8" * 20
inj1 = NoiseInjector(noise_level="severe", seed=1)
inj2 = NoiseInjector(noise_level="severe", seed=999)
noisy1, _ = inj1.inject_text_noise(text)
noisy2, _ = inj2.inject_text_noise(text)
# With severe noise on long text, different seeds should differ
assert noisy1 != noisy2
7.4 TREC ่ฏไผฐ่ฎก็ฎๆต่ฏ
# tests/test_trec_evaluation.py
import pytest
import ir_measures
from ir_measures import nDCG, Recall, P, AP
class TestTRECEvaluation:
"""Test TREC evaluation metric computation."""
@pytest.fixture
def sample_qrels(self):
"""Sample qrels with known ground truth."""
return [
ir_measures.Qrel("q1", "d1", 2), # eligible
ir_measures.Qrel("q1", "d2", 1), # excluded
ir_measures.Qrel("q1", "d3", 0), # not relevant
ir_measures.Qrel("q1", "d4", 2), # eligible
ir_measures.Qrel("q1", "d5", 0), # not relevant
]
@pytest.fixture
def perfect_run(self):
"""Run that ranks all relevant docs at top."""
return [
ir_measures.ScoredDoc("q1", "d1", 1.0),
ir_measures.ScoredDoc("q1", "d4", 0.9),
ir_measures.ScoredDoc("q1", "d2", 0.8),
ir_measures.ScoredDoc("q1", "d3", 0.1),
ir_measures.ScoredDoc("q1", "d5", 0.05),
]
@pytest.fixture
def worst_run(self):
"""Run that ranks relevant docs at bottom."""
return [
ir_measures.ScoredDoc("q1", "d3", 1.0),
ir_measures.ScoredDoc("q1", "d5", 0.9),
ir_measures.ScoredDoc("q1", "d2", 0.5),
ir_measures.ScoredDoc("q1", "d4", 0.2),
ir_measures.ScoredDoc("q1", "d1", 0.1),
]
def test_perfect_ndcg_at_10(self, sample_qrels, perfect_run):
"""Perfect ranking should yield NDCG@10 = 1.0."""
result = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
assert result[nDCG@10] == pytest.approx(1.0, abs=0.01)
def test_worst_ndcg_lower(self, sample_qrels, perfect_run, worst_run):
"""Worst ranking should yield lower NDCG than perfect."""
perfect = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
worst = ir_measures.calc_aggregate([nDCG@10], sample_qrels, worst_run)
assert worst[nDCG@10] < perfect[nDCG@10]
def test_recall_at_50_perfect(self, sample_qrels, perfect_run):
"""Perfect run should retrieve all relevant docs."""
result = ir_measures.calc_aggregate([Recall@50], sample_qrels, perfect_run)
assert result[Recall@50] == pytest.approx(1.0, abs=0.01)
def test_empty_run_yields_zero(self, sample_qrels):
"""Empty run should yield 0 for all metrics."""
empty_run = []
result = ir_measures.calc_aggregate(
[nDCG@10, Recall@50, P@10], sample_qrels, empty_run
)
assert result[nDCG@10] == 0.0
assert result[Recall@50] == 0.0
assert result[P@10] == 0.0
def test_per_query_results(self, sample_qrels, perfect_run):
"""Per-query results should return one entry per query."""
results = list(ir_measures.iter_calc(
[nDCG@10], sample_qrels, perfect_run
))
assert len(results) == 1 # Only q1
assert results[0].query_id == "q1"
def test_trec_run_format_conversion(self):
"""Test TrialPath results to TREC format conversion."""
results = {
"1": [
{"nct_id": "NCT001", "score": 0.95},
{"nct_id": "NCT002", "score": 0.80},
]
}
run_str = convert_trialpath_to_trec_run(results, "test-run")
lines = run_str.strip().split("\n")
assert len(lines) == 2
assert "NCT001" in lines[0]
assert "1" == lines[0].split()[3] # rank 1
assert "2" == lines[1].split()[3] # rank 2
def test_graded_relevance_evaluation(self, sample_qrels, perfect_run):
"""Test strict eligible-only evaluation (rel=2)."""
strict = ir_measures.calc_aggregate(
[AP(rel=2)], sample_qrels, perfect_run
)
assert strict[AP(rel=2)] > 0.0
def test_qrels_dict_format(self):
"""Test evaluation from dict format."""
qrels = {"q1": {"d1": 2, "d2": 1, "d3": 0}}
run = [
ir_measures.ScoredDoc("q1", "d1", 1.0),
ir_measures.ScoredDoc("q1", "d2", 0.5),
ir_measures.ScoredDoc("q1", "d3", 0.1),
]
result = ir_measures.calc_aggregate([nDCG@10], qrels, run)
assert nDCG@10 in result
7.5 F1 ่ฎก็ฎๆต่ฏ
# tests/test_extraction_f1.py
import pytest
from evaluation.extraction_eval import compute_field_level_f1
class TestExtractionF1:
"""Test F1 computation for field-level extraction."""
def test_perfect_extraction(self):
"""All fields correctly extracted should yield F1=1.0."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
{"field_name": "demographics.sex", "ground_truth": "male", "extracted": "male", "correct": True},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "NSCLC", "correct": True},
{"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
]
}]
result = compute_field_level_f1(annotations)
assert result["micro_f1"] == 1.0
assert result["pass"] is True
def test_zero_extraction(self):
"""No correct extractions should yield F1=0."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "Jane", "correct": False},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
assert result["micro_f1"] == 0.0
assert result["pass"] is False
def test_partial_extraction(self):
"""Partial extraction should yield 0 < F1 < 1."""
annotations = [{
"patient_id": "p1",
"noise_level": "mild",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "lung ca", "correct": False},
{"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
{"field_name": "biomarkers.alk", "ground_truth": "negative", "extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
assert 0.0 < result["micro_f1"] < 1.0
def test_f1_threshold_boundary(self):
"""F1 exactly at 0.85 should pass."""
# Create annotations that produce exactly 0.85 F1
fields = []
for i in range(85):
fields.append({"field_name": f"field_{i}", "ground_truth": "val", "extracted": "val", "correct": True})
for i in range(15):
fields.append({"field_name": f"field_miss_{i}", "ground_truth": "val", "extracted": None, "correct": False})
annotations = [{"patient_id": "p1", "noise_level": "clean",
"document_type": "test", "fields": fields}]
result = compute_field_level_f1(annotations)
# With 85/100 correct, F1 should be ~0.85
assert result["pass"] is True
def test_empty_annotations(self):
"""Empty annotations should not crash."""
result = compute_field_level_f1([])
assert result["micro_f1"] == 0.0
def test_none_ground_truth_not_counted(self):
"""Fields with None ground truth should be handled."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "test",
"fields": [
{"field_name": "biomarkers.ros1", "ground_truth": None,
"extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
# Should not crash, though metrics may be 0
assert "micro_f1" in result
7.6 ็ซฏๅฐ็ซฏ็ฎก็บฟๆต่ฏ
# tests/test_e2e_pipeline.py
import pytest
from pathlib import Path
class TestE2EPipeline:
"""End-to-end tests for the complete data & evaluation pipeline."""
def test_fhir_to_profile_to_pdf_roundtrip(self, sample_fhir_file, tmp_path):
"""FHIR โ PatientProfile โ PDF should complete without error."""
from data.generate_synthetic_patients import parse_fhir_bundle
from data.templates.clinical_letter import generate_clinical_letter
from dataclasses import asdict
# Step 1: Parse FHIR
profile = parse_fhir_bundle(Path(sample_fhir_file))
assert profile.patient_id != ""
# Step 2: Generate PDF
pdf_path = tmp_path / "test_roundtrip.pdf"
generate_clinical_letter(asdict(profile), str(pdf_path))
assert pdf_path.exists()
assert pdf_path.stat().st_size > 1000 # Reasonable PDF size
def test_noisy_pdf_pipeline(self, sample_profile, tmp_path):
"""Profile โ Noisy PDF should inject noise and produce valid PDF."""
from data.templates.clinical_letter import generate_clinical_letter
from data.noise.noise_injector import NoiseInjector
injector = NoiseInjector(noise_level="moderate", seed=42)
# Inject text noise into profile fields for PDF rendering
profile = sample_profile.copy()
dx_text = profile["diagnosis"]["primary"]
noisy_dx, records = injector.inject_text_noise(dx_text)
profile["diagnosis"]["primary"] = noisy_dx
pdf_path = tmp_path / "noisy.pdf"
generate_clinical_letter(profile, str(pdf_path))
assert pdf_path.exists()
def test_trec_evaluation_pipeline(self, tmp_path):
"""Complete TREC evaluation from dicts should produce metrics."""
import ir_measures
from ir_measures import nDCG, Recall, P
qrels = [
ir_measures.Qrel("1", "NCT001", 2),
ir_measures.Qrel("1", "NCT002", 1),
ir_measures.Qrel("1", "NCT003", 0),
]
run = [
ir_measures.ScoredDoc("1", "NCT001", 0.9),
ir_measures.ScoredDoc("1", "NCT002", 0.5),
ir_measures.ScoredDoc("1", "NCT003", 0.1),
]
result = ir_measures.calc_aggregate(
[nDCG@10, Recall@50, P@10], qrels, run
)
assert nDCG@10 in result
assert Recall@50 in result
assert result[nDCG@10] > 0
def test_latency_tracker_integration(self):
"""Latency tracker should record and summarize calls."""
import time
from evaluation.latency_cost_tracker import LatencyCostTracker
tracker = LatencyCostTracker()
tracker.start_session("test-patient")
with tracker.track_call("gemini", "search_anchors") as record:
time.sleep(0.01) # Simulate API call
record.input_tokens = 500
record.output_tokens = 200
session = tracker.end_session()
assert session.total_latency_ms > 0
assert len(session.api_calls) == 1
summary = tracker.summary()
assert summary["n_sessions"] == 1
assert summary["latency"]["mean_s"] > 0
8. ้ๅฝ
8.1 ๆฐๆฎๆ ผๅผ่ง่
PatientProfile v1 JSON Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["patient_id", "demographics", "diagnosis"],
"properties": {
"patient_id": {"type": "string"},
"demographics": {
"type": "object",
"properties": {
"name": {"type": "string"},
"sex": {"type": "string", "enum": ["male", "female"]},
"date_of_birth": {"type": "string", "format": "date"},
"age": {"type": "integer"},
"state": {"type": "string"}
}
},
"diagnosis": {
"type": "object",
"properties": {
"primary": {"type": "string"},
"stage": {"type": ["string", "null"]},
"histology": {"type": ["string", "null"]},
"diagnosis_date": {"type": "string", "format": "date"}
}
},
"biomarkers": {
"type": "object",
"properties": {
"egfr": {"type": ["string", "null"]},
"alk": {"type": ["string", "null"]},
"pdl1_tps": {"type": ["string", "null"]},
"kras": {"type": ["string", "null"]},
"ros1": {"type": ["string", "null"]}
}
},
"labs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"value": {"type": "number"},
"unit": {"type": "string"},
"date": {"type": "string"},
"loinc_code": {"type": "string"}
}
}
},
"treatments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["medication", "procedure", "radiation"]},
"start_date": {"type": "string"},
"end_date": {"type": ["string", "null"]}
}
}
},
"unknowns": {"type": "array", "items": {"type": "string"}},
"evidence_spans": {"type": "array"}
}
}
8.2 ๅทฅๅ ท API ๅ่
ir_datasets
| API | ่ฏดๆ | ่ฟๅ็ฑปๅ |
|---|---|---|
ir_datasets.load("clinicaltrials/2021/trec-ct-2021") |
ๅ ่ฝฝ TREC CT 2021 ๆฐๆฎ้ | Dataset |
dataset.queries_iter() |
้ๅ topics | GenericQuery(query_id, text) |
dataset.qrels_iter() |
้ๅ qrels | TrecQrel(query_id, doc_id, relevance, iteration) |
dataset.docs_iter() |
้ๅๆๆกฃ | ClinicalTrialsDoc(doc_id, title, condition, summary, detailed_description, eligibility) |
ๆฐๆฎ้ ID๏ผ
clinicaltrials/2021/trec-ct-2021โ 75 queries, 35,832 qrelsclinicaltrials/2021/trec-ct-2022โ 50 queriesclinicaltrials/2021โ 376K ๆๆกฃ๏ผๅบ็ก้๏ผ
ir-measures
| API | ่ฏดๆ |
|---|---|
ir_measures.calc_aggregate(measures, qrels, run) |
่ฎก็ฎ่ๅๆๆ |
ir_measures.iter_calc(measures, qrels, run) |
้ๆฅ่ฏขๆๆ ่ฟญไปฃ |
ir_measures.read_trec_qrels(path) |
่ฏปๅ TREC qrels ๆไปถ |
ir_measures.read_trec_run(path) |
่ฏปๅ TREC run ๆไปถ |
ir_measures.Qrel(qid, did, rel) |
ๅๅปบ qrel ่ฎฐๅฝ |
ir_measures.ScoredDoc(qid, did, score) |
ๅๅปบ่ฏๅๆๆกฃ่ฎฐๅฝ |
ๆๆ ๅฏน่ฑก๏ผ
nDCG@10โ Normalized DCG at cutoff 10Recall@50โ Recall at cutoff 50P@10โ Precision at cutoff 10APโ Average PrecisionAP(rel=2)โ AP with minimum relevance 2RRโ Reciprocal Rank
scikit-learn ่ฏไผฐ
| API | ่ฏดๆ |
|---|---|
f1_score(y_true, y_pred, average=None) |
้็ฑปๅซ F1 |
f1_score(y_true, y_pred, average='micro') |
ๅ จๅฑ micro F1 |
f1_score(y_true, y_pred, average='macro') |
้็ฑปๅซๅนณๅ F1 |
precision_score(y_true, y_pred) |
Precision |
recall_score(y_true, y_pred) |
Recall |
classification_report(y_true, y_pred) |
ๅฎๆดๅ็ฑปๆฅๅ |
confusion_matrix(y_true, y_pred) |
ๆททๆท็ฉ้ต |
Synthea CLI
| ๅๆฐ | ่ฏดๆ | ็คบไพ |
|---|---|---|
-p N |
็ๆ N ไธชๆฃ่ | -p 500 |
-s SEED |
้ๆบ็งๅญ | -s 42 |
-m MODULE |
ๆๅฎ็พ็ ๆจกๅ | -m lung_cancer |
STATE |
ๆๅฎๅท | Massachusetts |
--exporter.fhir.export |
ๅฏ็จ FHIR R4 ๅฏผๅบ | =true |
--exporter.pretty_print |
็พๅ JSON ่พๅบ | =true |
ReportLab ๆ ธๅฟ API
| ็ปไปถ | ่ฏดๆ |
|---|---|
SimpleDocTemplate(path, pagesize=letter) |
ๅๅปบๆๆกฃๆจกๆฟ |
Paragraph(text, style) |
ๆฎต่ฝๆตๅผ็ปไปถ |
Table(data, colWidths) |
่กจๆ ผๆตๅผ็ปไปถ |
TableStyle(commands) |
่กจๆ ผๆ ทๅผ |
Spacer(width, height) |
้ด่ท็ปไปถ |
getSampleStyleSheet() |
่ทๅ้ป่ฎคๆ ทๅผ่กจ |
Augraphy ้่ดจ็ฎก็บฟ
| ็ปไปถ | ่ฏดๆ |
|---|---|
AugraphyPipeline(ink_phase, paper_phase, post_phase) |
ๅฎๆด้่ดจ็ฎก็บฟ |
InkBleed(p=0.5) |
ๅขจๆฐดๆธ้ๆๆ |
Letterpress(p=0.3) |
ๆดป็ๅฐๅทๆๆ |
LowInkPeriodicLines(p=0.3) |
ไฝๅขจๆฐดๅจๆๆง็บฟๆก |
DirtyDrum(p=0.3) |
่้ผๆๆ |
SubtleNoise(p=0.5) |
ๅพฎๅชๅฃฐ |
Jpeg(p=0.5) |
JPEG ๅ็ผฉไผชๅฝฑ |
Brightness(p=0.5) |
ไบฎๅบฆๅๅ |
8.3 Python ไพ่ตๆธ ๅ
# requirements-data-eval.txt
ir-datasets>=0.5.6
ir-measures>=0.3.1
reportlab>=4.0
augraphy>=8.0
Pillow>=10.0
pdfplumber>=0.10
scikit-learn>=1.3
numpy>=1.24
pandas>=2.0
pdf2image>=1.16
8.4 ๆๅๆๆ ้ๆฅ่กจ
| ๆๆ | ็ฎๆ ๅผ | ่ฏไผฐๅทฅๅ ท | ๆฐๆฎๆบ |
|---|---|---|---|
| MedGemma Extraction F1 | >= 0.85 | scikit-learn f1_score |
ๅๆๆฃ่ + Ground Truth |
| Trial Retrieval Recall@50 | >= 0.75 | ir-measures Recall@50 |
TREC CT 2021/2022 |
| Trial Ranking NDCG@10 | >= 0.60 | ir-measures nDCG@10 |
TREC CT 2021/2022 |
| Criterion Decision Accuracy | >= 0.85 | Custom accuracy | ๆ ๆณจ EligibilityLedger |
| Latency | < 15s | LatencyCostTracker |
API call timing |
| Cost | < $0.50/session | LatencyCostTracker |
Token counting |