TrialPath / docs /tdd-guide-data-evaluation.md
yakilee's picture
chore: initialize project skeleton with pyproject.toml
1abff4e
# TrialPath ๆ•ฐๆฎไธŽ่ฏ„ไผฐ็ฎก็บฟ TDD ๅฎž็ŽฐๆŒ‡ๅ—
> ๅŸบไบŽ DeepWikiใ€TREC ๅฎ˜ๆ–นๆ–‡ๆกฃใ€ir-measures/ir_datasets ๅบ“ๆทฑๅบฆ็ ”็ฉถไบงๅ‡บ
---
## 1. ็ฎก็บฟๆžถๆž„ๆฆ‚่งˆ
### 1.1 ๆ•ฐๆฎๆตๅ›พ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Data & Evaluation Pipeline โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Synthea โ”‚โ”€โ”€โ”€โ–ถโ”‚ FHIR Bundle โ”‚โ”€โ”€โ”€โ–ถโ”‚ PatientProfile โ”‚ โ”‚
โ”‚ โ”‚ (Java CLI) โ”‚ โ”‚ (JSON) โ”‚ โ”‚ (JSON Schema) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ”‚
โ”‚ โ”‚ LLM Letter โ”‚โ”€โ”€โ”€โ–ถโ”‚ ReportLab โ”‚โ”€โ”€โ”€โ–ถ Noisy Clinical PDFs โ”‚
โ”‚ โ”‚ Generator โ”‚ โ”‚ + Augraphy โ”‚ (Letters/Labs/Path) โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ MedGemma โ”‚โ”€โ”€โ”€โ–ถโ”‚ Extracted โ”‚โ”€โ”€โ”€โ–ถโ”‚ F1 Evaluator โ”‚ โ”‚
โ”‚ โ”‚ Extractor โ”‚ โ”‚ Profile โ”‚ โ”‚ (scikit-learn) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ TREC Topics โ”‚โ”€โ”€โ”€โ–ถโ”‚ TrialPath โ”‚โ”€โ”€โ”€โ–ถโ”‚ TREC Evaluator โ”‚ โ”‚
โ”‚ โ”‚ (ir_datasets)โ”‚ โ”‚ Matching โ”‚ โ”‚ (ir-measures) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 1.2 ๆจกๅ—ๅ…ณ็ณป
| ๆจกๅ— | ่พ“ๅ…ฅ | ่พ“ๅ‡บ | ไพ่ต– |
|------|------|------|------|
| `data/generate_synthetic_patients.py` | Synthea FHIR Bundles | `PatientProfile` JSON + Ground Truth | Synthea CLI, FHIR R4 |
| `data/generate_noisy_pdfs.py` | `PatientProfile` JSON | Clinical PDFs (ๅธฆๅ™ชๅฃฐ) | ReportLab, Augraphy |
| `evaluation/run_trec_benchmark.py` | TREC Topics + TrialPath Run | Recall@50, NDCG@10, P@10 | ir_datasets, ir-measures |
| `evaluation/extraction_eval.py` | Extracted vs Ground Truth Profiles | Field-level F1 | scikit-learn |
| `evaluation/criterion_eval.py` | EligibilityLedger vs Gold Standard | Criterion Accuracy | scikit-learn |
| `evaluation/latency_cost_tracker.py` | API call logs | Latency/Cost reports | time, logging |
### 1.3 ็›ฎๅฝ•็ป“ๆž„
```
data/
โ”œโ”€โ”€ generate_synthetic_patients.py # Synthea FHIR โ†’ PatientProfile
โ”œโ”€โ”€ generate_noisy_pdfs.py # PatientProfile โ†’ Clinical PDFs
โ”œโ”€โ”€ synthea_config/
โ”‚ โ”œโ”€โ”€ synthea.properties # Synthea ้…็ฝฎ
โ”‚ โ””โ”€โ”€ modules/
โ”‚ โ””โ”€โ”€ lung_cancer_extended.json # ๆ‰ฉๅฑ• NSCLC ๆจกๅ— (ๅซ biomarkers)
โ”œโ”€โ”€ templates/
โ”‚ โ”œโ”€โ”€ clinical_letter.py # ไธดๅบŠไฟกไปถๆจกๆฟ
โ”‚ โ”œโ”€โ”€ pathology_report.py # ็—…็†ๆŠฅๅ‘Šๆจกๆฟ
โ”‚ โ”œโ”€โ”€ lab_report.py # ๅฎž้ชŒๅฎคๆŠฅๅ‘Šๆจกๆฟ
โ”‚ โ””โ”€โ”€ imaging_report.py # ๅฝฑๅƒๆŠฅๅ‘Šๆจกๆฟ
โ”œโ”€โ”€ noise/
โ”‚ โ””โ”€โ”€ noise_injector.py # ๅ™ชๅฃฐๆณจๅ…ฅๅผ•ๆ“Ž
โ””โ”€โ”€ output/
โ”œโ”€โ”€ fhir/ # Synthea ๅŽŸๅง‹ FHIR ่พ“ๅ‡บ
โ”œโ”€โ”€ profiles/ # ่ฝฌๆขๅŽ็š„ PatientProfile JSON
โ”œโ”€โ”€ pdfs/ # ็”Ÿๆˆ็š„ไธดๅบŠ PDF
โ””โ”€โ”€ ground_truth/ # ๆ ‡ๆณจๆ•ฐๆฎ
evaluation/
โ”œโ”€โ”€ run_trec_benchmark.py # TREC ๆฃ€็ดข่ฏ„ไผฐ
โ”œโ”€โ”€ extraction_eval.py # MedGemma ๆๅ– F1
โ”œโ”€โ”€ criterion_eval.py # Criterion Decision Accuracy
โ”œโ”€โ”€ latency_cost_tracker.py # ๅปถ่ฟŸไธŽๆˆๆœฌ่ฟฝ่ธช
โ”œโ”€โ”€ trec_data/
โ”‚ โ”œโ”€โ”€ topics2021.xml # TREC 2021 topics
โ”‚ โ”œโ”€โ”€ qrels2021.txt # TREC 2021 relevance judgments
โ”‚ โ””โ”€โ”€ topics2022.xml # TREC 2022 topics
โ””โ”€โ”€ reports/ # ่ฏ„ไผฐๆŠฅๅ‘Š่พ“ๅ‡บ
tests/
โ”œโ”€โ”€ test_synthea_data.py # Synthea ๆ•ฐๆฎ้ชŒ่ฏ
โ”œโ”€โ”€ test_pdf_generation.py # PDF ็”Ÿๆˆๆญฃ็กฎๆ€ง
โ”œโ”€โ”€ test_noise_injection.py # ๅ™ชๅฃฐๆณจๅ…ฅๆ•ˆๆžœ
โ”œโ”€โ”€ test_trec_evaluation.py # TREC ่ฏ„ไผฐ่ฎก็ฎ—
โ”œโ”€โ”€ test_extraction_f1.py # F1 ่ฎก็ฎ—ๆต‹่ฏ•
โ”œโ”€โ”€ test_latency_cost.py # ๅปถ่ฟŸๆˆๆœฌๆต‹่ฏ•
โ””โ”€โ”€ test_e2e_pipeline.py # ็ซฏๅˆฐ็ซฏ็ฎก็บฟๆต‹่ฏ•
```
---
## 2. Synthea ๅˆๆˆๆ‚ฃ่€…็”ŸๆˆๆŒ‡ๅ—
### 2.1 Synthea ๆฆ‚่ฟฐ
Synthea ๆ˜ฏ MITRE ๅผ€ๅ‘็š„ๅผ€ๆบๅˆๆˆๆ‚ฃ่€…ๆจกๆ‹Ÿๅ™จ๏ผŒๅŸบไบŽ Java ๅฎž็Žฐใ€‚ๅฎƒ้€š่ฟ‡ JSON ็Šถๆ€ๆœบๆจกๅ—ๆจกๆ‹Ÿ็–พ็—…่ฝจ่ฟน๏ผŒ่พ“ๅ‡บๆ ‡ๅ‡† FHIR R4 Bundleใ€‚
**ๅ…ณ้”ฎ็‰นๆ€ง๏ผˆๆฅๆบ๏ผšDeepWiki synthetichealth/synthea๏ผ‰๏ผš**
- ๅŸบไบŽๆจกๅ—็š„็–พ็—…ๆจกๆ‹Ÿ๏ผšๆฏ็ง็–พ็—…ๅฎšไน‰ไธบ JSON ็Šถๆ€ๆœบ
- ๆ”ฏๆŒ FHIR R4/STU3/DSTU2 ๅฏผๅ‡บ
- ๅ†…็ฝฎ `lung_cancer.json` ๆจกๅ—๏ผŒ85% NSCLC / 15% SCLC ๅˆ†ๅธƒ
- ๆ”ฏๆŒ Stage I-IV ๅˆ†ๆœŸๅ’ŒๅŒ–็–—/ๆ”พ็–—ๆฒป็–—่ทฏๅพ„
- **ไธๅซ NSCLC ็‰นๅผ‚ๆ€ง biomarkers๏ผˆEGFR, ALK, PD-L1, KRAS, ROS1๏ผ‰โ€”โ€” ้œ€่ฆ่‡ชๅฎšไน‰ๆ‰ฉๅฑ•**
### 2.2 ๅฎ‰่ฃ…ๅ’Œ้…็ฝฎ
**็ณป็ปŸ่ฆๆฑ‚๏ผš**
- Java JDK 11 ๆˆ–ๆ›ด้ซ˜็‰ˆๆœฌ๏ผˆๆŽจ่ LTS 11 ๆˆ– 17๏ผ‰
**ๅฎ‰่ฃ…ๆ–นๅผ A๏ผš็›ดๆŽฅไฝฟ็”จ JAR๏ผˆๆŽจ่็”จไบŽๆ•ฐๆฎ็”Ÿๆˆ๏ผ‰**
```bash
# ไธ‹่ฝฝๆœ€ๆ–ฐ release JAR
# ไปŽ https://github.com/synthetichealth/synthea/releases ่Žทๅ–
wget https://github.com/synthetichealth/synthea/releases/download/master-branch-latest/synthea-with-dependencies.jar
# ้ชŒ่ฏๅฎ‰่ฃ…
java -jar synthea-with-dependencies.jar --help
```
**ๅฎ‰่ฃ…ๆ–นๅผ B๏ผšไปŽๆบ็ ๆž„ๅปบ๏ผˆ้œ€่ฆ่‡ชๅฎšไน‰ๆจกๅ—ๆ—ถไฝฟ็”จ๏ผ‰**
```bash
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test
```
### 2.3 NSCLC ๆจกๅ—้…็ฝฎ
#### 2.3.1 ็Žฐๆœ‰ lung_cancer ๆจกๅ—ๅˆ†ๆž
ๆฅๆบ๏ผšDeepWiki ๅฏน `synthetichealth/synthea` ็š„ `lung_cancer.json` ๆจกๅ—ๅˆ†ๆž๏ผš
- **ๅ…ฅๅฃๆกไปถ**๏ผš45-65 ๅฒไบบ็พค๏ผŒๅŸบไบŽๆฆ‚็އ่ฎก็ฎ—
- **่ฏŠๆ–ญๆต็จ‹**๏ผš็—‡็Šถ๏ผˆๅ’ณๅ—ฝใ€ๅ’ฏ่ก€ใ€ๆฐ”็Ÿญ๏ผ‰ โ†’ ่ƒธ้ƒจ X ๅ…‰ โ†’ ่ƒธ้ƒจ CT โ†’ ๆดปๆฃ€/็ป†่ƒžๅญฆ
- **ๅˆ†ๅž‹**๏ผš85% NSCLC๏ผŒ15% SCLC
- **ๅˆ†ๆœŸ**๏ผšStage I-IV๏ผŒๅŸบไบŽ `lung_cancer_nondiagnosis_counter`
- **ๆฒป็–—**๏ผšNSCLC ไฝฟ็”จ Cisplatin + Paclitaxel โ†’ ๆ”พ็–—
#### 2.3.2 ่‡ชๅฎšไน‰ NSCLC Biomarker ๆ‰ฉๅฑ•ๆจกๅ—
็”ฑไบŽๅŽŸ็”Ÿๆจกๅ—ไธๅซ EGFR/ALK/PD-L1 ็ญ‰ biomarkers๏ผŒ้œ€่ฆๅˆ›ๅปบๆ‰ฉๅฑ•ๅญๆจกๅ—ใ€‚
**ๆ–‡ไปถ๏ผš`data/synthea_config/modules/lung_cancer_biomarkers.json`**
ๅŸบไบŽ DeepWiki ็ ”็ฉถ็š„ Synthea ๆจกๅ—็Šถๆ€็ฑปๅž‹๏ผŒๅฏ็”จ็š„็Šถๆ€็ฑปๅž‹ๅŒ…ๆ‹ฌ๏ผš
- `Initial` โ€” ๆจกๅ—ๅ…ฅๅฃ
- `Terminal` โ€” ๆจกๅ—ๅ‡บๅฃ
- `Observation` โ€” ่ฎฐๅฝ•ไธดๅบŠ่ง‚ๅฏŸๅ€ผ๏ผˆ็”จไบŽ biomarkers๏ผ‰
- `SetAttribute` โ€” ่ฎพ็ฝฎๆ‚ฃ่€…ๅฑžๆ€ง
- `Guard` โ€” ๆกไปถ้—จๆŽง
- `Simple` โ€” ็ฎ€ๅ•่ฝฌๆข็Šถๆ€
- `Encounter` โ€” ๅฐฑ่ฏŠ็Šถๆ€
Biomarker ่ง‚ๅฏŸ็Šถๆ€็คบไพ‹็ป“ๆž„๏ผš
```json
{
"name": "NSCLC Biomarker Panel",
"states": {
"Initial": {
"type": "Initial",
"conditional_transition": [
{
"condition": {
"condition_type": "Attribute",
"attribute": "Lung Cancer Type",
"operator": "==",
"value": "NSCLC"
},
"transition": "EGFR_Test_Encounter"
},
{
"transition": "Terminal"
}
]
},
"EGFR_Test_Encounter": {
"type": "Encounter",
"encounter_class": "ambulatory",
"codes": [
{
"system": "SNOMED-CT",
"code": "185349003",
"display": "Encounter for check up"
}
],
"direct_transition": "EGFR_Mutation_Status"
},
"EGFR_Mutation_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "41103-3",
"display": "EGFR gene mutations found"
}
],
"distributed_transition": [
{
"distribution": 0.15,
"transition": "EGFR_Positive"
},
{
"distribution": 0.85,
"transition": "EGFR_Negative"
}
]
},
"EGFR_Positive": {
"type": "SetAttribute",
"attribute": "egfr_status",
"value": "positive",
"direct_transition": "ALK_Rearrangement_Status"
},
"EGFR_Negative": {
"type": "SetAttribute",
"attribute": "egfr_status",
"value": "negative",
"direct_transition": "ALK_Rearrangement_Status"
},
"ALK_Rearrangement_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "46264-8",
"display": "ALK gene rearrangement"
}
],
"distributed_transition": [
{
"distribution": 0.05,
"transition": "ALK_Positive"
},
{
"distribution": 0.95,
"transition": "ALK_Negative"
}
]
},
"ALK_Positive": {
"type": "SetAttribute",
"attribute": "alk_status",
"value": "positive",
"direct_transition": "PDL1_Expression"
},
"ALK_Negative": {
"type": "SetAttribute",
"attribute": "alk_status",
"value": "negative",
"direct_transition": "PDL1_Expression"
},
"PDL1_Expression": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "85147-0",
"display": "PD-L1 by immune stain"
}
],
"distributed_transition": [
{
"distribution": 0.30,
"transition": "PDL1_High"
},
{
"distribution": 0.35,
"transition": "PDL1_Low"
},
{
"distribution": 0.35,
"transition": "PDL1_Negative"
}
]
},
"PDL1_High": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": ">=50%",
"direct_transition": "KRAS_Mutation_Status"
},
"PDL1_Low": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": "1-49%",
"direct_transition": "KRAS_Mutation_Status"
},
"PDL1_Negative": {
"type": "SetAttribute",
"attribute": "pdl1_tps",
"value": "<1%",
"direct_transition": "KRAS_Mutation_Status"
},
"KRAS_Mutation_Status": {
"type": "Observation",
"category": "laboratory",
"codes": [
{
"system": "LOINC",
"code": "21717-3",
"display": "KRAS gene mutations found"
}
],
"distributed_transition": [
{
"distribution": 0.25,
"transition": "KRAS_Positive"
},
{
"distribution": 0.75,
"transition": "KRAS_Negative"
}
]
},
"KRAS_Positive": {
"type": "SetAttribute",
"attribute": "kras_status",
"value": "positive",
"direct_transition": "Terminal"
},
"KRAS_Negative": {
"type": "SetAttribute",
"attribute": "kras_status",
"value": "negative",
"direct_transition": "Terminal"
},
"Terminal": {
"type": "Terminal"
}
}
}
```
**Biomarker ๆต่กŒ็އๅˆ†ๅธƒ๏ผˆๅŸบไบŽ NSCLC ๆ–‡็Œฎ๏ผ‰๏ผš**
| Biomarker | ้˜ณๆ€ง็އ | LOINC Code | ่ฏดๆ˜Ž |
|-----------|--------|------------|------|
| EGFR mutation | ~15% | 41103-3 | ้žๅธ็ƒŸไบš่ฃ”ๅฅณๆ€งๆ›ด้ซ˜ |
| ALK rearrangement | ~5% | 46264-8 | ๅนด่ฝป้žๅธ็ƒŸ่€…ๆ›ดๅธธ่ง |
| PD-L1 TPS>=50% | ~30% | 85147-0 | ๅ…็–ซๆฒป็–—้€‚็”จๆ ‡ๅ‡† |
| KRAS G12C | ~13% | 21717-3 | Sotorasib ้ถๅ‘ |
| ROS1 fusion | ~1-2% | 46265-5 | Crizotinib ้ถๅ‘ |
### 2.4 ๆ‰น้‡็”Ÿๆˆๅ‘ฝไปค
```bash
# ็”Ÿๆˆ 500 ไธช NSCLC ๆ‚ฃ่€…๏ผŒไฝฟ็”จ็งๅญ็กฎไฟๅฏ้‡็Žฐ
java -jar synthea-with-dependencies.jar \
-p 500 \
-s 42 \
-m lung_cancer \
--exporter.fhir.export=true \
--exporter.fhir_stu3.export=false \
--exporter.fhir_dstu2.export=false \
--exporter.ccda.export=false \
--exporter.csv.export=false \
--exporter.hospital.fhir.export=false \
--exporter.practitioner.fhir.export=false \
--exporter.pretty_print=true \
Massachusetts
# ๅ‚ๆ•ฐ่ฏดๆ˜Ž:
# -p 500 : ็”Ÿๆˆ 500 ไธชๆ‚ฃ่€…
# -s 42 : ้šๆœบ็งๅญ (ๅฏ้‡็Žฐ)
# -m lung_cancer : ไป…่ฟ่กŒ lung_cancer ๆจกๅ—
# --exporter.fhir.export=true : ๅฏ็”จ FHIR R4 ๅฏผๅ‡บ
# Massachusetts : ็”ŸๆˆๅœฐๅŒบ
```
**่พ“ๅ‡บไฝ็ฝฎ๏ผš** `./output/fhir/` ็›ฎๅฝ•ไธ‹๏ผŒๆฏไธชๆ‚ฃ่€…ไธ€ไธช JSON ๆ–‡ไปถใ€‚
### 2.5 FHIR Bundle ่พ“ๅ‡บๆ ผๅผ
ๆฅๆบ๏ผšDeepWiki `synthetichealth/synthea` ๅ…ณไบŽ FHIR ๅฏผๅ‡บ็ณป็ปŸ็š„ๅˆ†ๆžใ€‚
**้กถๅฑ‚็ป“ๆž„๏ผš**
```json
{
"resourceType": "Bundle",
"type": "transaction",
"entry": [
{
"fullUrl": "urn:uuid:patient-uuid-here",
"resource": { "resourceType": "Patient", ... },
"request": { "method": "POST", "url": "Patient" }
},
{
"fullUrl": "urn:uuid:condition-uuid-here",
"resource": { "resourceType": "Condition", ... },
"request": { "method": "POST", "url": "Condition" }
}
]
}
```
**Synthea ็”Ÿๆˆ็š„ FHIR Resource ็ฑปๅž‹๏ผˆDeepWiki ็กฎ่ฎค๏ผ‰๏ผš**
- `Patient` โ€” ๆ‚ฃ่€…ๅŸบๆœฌไฟกๆฏ
- `Condition` โ€” ่ฏŠๆ–ญ๏ผˆๅฆ‚ NSCLC๏ผ‰
- `Observation` โ€” ๅฎž้ชŒๅฎคๆฃ€ๆŸฅๅ’Œ็”Ÿๅ‘ฝไฝ“ๅพ
- `MedicationRequest` โ€” ็”จ่ฏๅค„ๆ–น
- `Procedure` โ€” ๆ‰‹ๆœฏๅ’Œๆ“ไฝœ
- `DiagnosticReport` โ€” ่ฏŠๆ–ญๆŠฅๅ‘Š
- `DocumentReference` โ€” ไธดๅบŠๆ–‡ๆกฃ๏ผˆ้œ€ US Core IG ๅฏ็”จ๏ผ‰
- `Encounter` โ€” ๅฐฑ่ฏŠ่ฎฐๅฝ•
- `AllergyIntolerance` โ€” ่ฟ‡ๆ•ๅฒ
- `Immunization` โ€” ๅ…็–ซๆŽฅ็ง
- `CarePlan` โ€” ๆŠค็†่ฎกๅˆ’
- `ImagingStudy` โ€” ๅฝฑๅƒๆฃ€ๆŸฅ
### 2.6 FHIR Resource ๅˆฐ PatientProfile ็š„ๆ˜ ๅฐ„
```python
# data/generate_synthetic_patients.py ไธญ็š„ๆ˜ ๅฐ„้€ป่พ‘
FHIR_TO_PATIENT_PROFILE_MAP = {
# Patient Resource โ†’ demographics
"Patient.name": "demographics.name",
"Patient.gender": "demographics.sex",
"Patient.birthDate": "demographics.date_of_birth",
"Patient.address.state": "demographics.state",
# Condition Resource โ†’ diagnosis
"Condition[code=SNOMED:254637007]": "diagnosis.primary", # NSCLC
"Condition.stage.summary": "diagnosis.stage",
"Condition.bodySite": "diagnosis.histology",
# Observation Resources โ†’ biomarkers
"Observation[code=LOINC:41103-3]": "biomarkers.egfr",
"Observation[code=LOINC:46264-8]": "biomarkers.alk",
"Observation[code=LOINC:85147-0]": "biomarkers.pdl1_tps",
"Observation[code=LOINC:21717-3]": "biomarkers.kras",
# Observation Resources โ†’ labs
"Observation[category=laboratory]": "labs[]",
# MedicationRequest โ†’ prior_treatments
"MedicationRequest.medicationCodeableConcept": "treatments[].medication",
# Procedure โ†’ prior_treatments
"Procedure.code": "treatments[].procedure",
}
```
**่ฝฌๆขๅ‡ฝๆ•ฐๆจกๅผ๏ผš**
```python
import json
from pathlib import Path
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class Demographics:
name: str = ""
sex: str = ""
date_of_birth: str = ""
age: int = 0
state: str = ""
@dataclass
class Diagnosis:
primary: str = ""
stage: str = ""
histology: str = ""
diagnosis_date: str = ""
@dataclass
class Biomarkers:
egfr: Optional[str] = None
alk: Optional[str] = None
pdl1_tps: Optional[str] = None
kras: Optional[str] = None
ros1: Optional[str] = None
@dataclass
class LabResult:
name: str = ""
value: float = 0.0
unit: str = ""
date: str = ""
loinc_code: str = ""
@dataclass
class Treatment:
name: str = ""
type: str = "" # "medication" | "procedure" | "radiation"
start_date: str = ""
end_date: Optional[str] = None
@dataclass
class PatientProfile:
patient_id: str = ""
demographics: Demographics = field(default_factory=Demographics)
diagnosis: Diagnosis = field(default_factory=Diagnosis)
biomarkers: Biomarkers = field(default_factory=Biomarkers)
labs: list[LabResult] = field(default_factory=list)
treatments: list[Treatment] = field(default_factory=list)
unknowns: list[str] = field(default_factory=list)
evidence_spans: list[dict] = field(default_factory=list)
def parse_fhir_bundle(fhir_path: Path) -> PatientProfile:
"""Parse a Synthea FHIR Bundle JSON into PatientProfile."""
with open(fhir_path) as f:
bundle = json.load(f)
profile = PatientProfile()
entries = bundle.get("entry", [])
for entry in entries:
resource = entry.get("resource", {})
resource_type = resource.get("resourceType")
if resource_type == "Patient":
_parse_patient(resource, profile)
elif resource_type == "Condition":
_parse_condition(resource, profile)
elif resource_type == "Observation":
_parse_observation(resource, profile)
elif resource_type == "MedicationRequest":
_parse_medication(resource, profile)
elif resource_type == "Procedure":
_parse_procedure(resource, profile)
return profile
def _parse_patient(resource: dict, profile: PatientProfile):
"""Extract demographics from Patient resource."""
names = resource.get("name", [{}])
if names:
given = " ".join(names[0].get("given", []))
family = names[0].get("family", "")
profile.demographics.name = f"{given} {family}".strip()
profile.demographics.sex = resource.get("gender", "")
profile.demographics.date_of_birth = resource.get("birthDate", "")
profile.patient_id = resource.get("id", "")
addresses = resource.get("address", [{}])
if addresses:
profile.demographics.state = addresses[0].get("state", "")
def _parse_condition(resource: dict, profile: PatientProfile):
"""Extract diagnosis from Condition resource."""
code = resource.get("code", {})
codings = code.get("coding", [])
for coding in codings:
# SNOMED codes for lung cancer
if coding.get("code") in ["254637007", "254632001"]:
profile.diagnosis.primary = coding.get("display", "")
onset = resource.get("onsetDateTime", "")
profile.diagnosis.diagnosis_date = onset
# Extract stage if available
stage_info = resource.get("stage", [])
if stage_info:
summary = stage_info[0].get("summary", {})
stage_codings = summary.get("coding", [])
if stage_codings:
profile.diagnosis.stage = stage_codings[0].get("display", "")
def _parse_observation(resource: dict, profile: PatientProfile):
"""Extract labs and biomarkers from Observation resource."""
code = resource.get("code", {})
codings = code.get("coding", [])
category_list = resource.get("category", [])
is_lab = any(
cat_coding.get("code") == "laboratory"
for cat in category_list
for cat_coding in cat.get("coding", [])
)
for coding in codings:
loinc = coding.get("code", "")
display = coding.get("display", "")
# Biomarker mappings
biomarker_map = {
"41103-3": "egfr",
"46264-8": "alk",
"85147-0": "pdl1_tps",
"21717-3": "kras",
"46265-5": "ros1",
}
if loinc in biomarker_map:
value_cc = resource.get("valueCodeableConcept", {})
value_codings = value_cc.get("coding", [])
value_str = value_codings[0].get("display", "") if value_codings else ""
setattr(profile.biomarkers, biomarker_map[loinc], value_str)
elif is_lab:
value_qty = resource.get("valueQuantity", {})
lab = LabResult(
name=display,
value=value_qty.get("value", 0.0),
unit=value_qty.get("unit", ""),
date=resource.get("effectiveDateTime", ""),
loinc_code=loinc,
)
profile.labs.append(lab)
```
---
## 3. ๅˆๆˆ PDF ็”Ÿๆˆ็ฎก็บฟ
### 3.1 ๆฆ‚่ฟฐ
็›ฎๆ ‡๏ผšๅฐ† `PatientProfile` ่ฝฌๆขไธบ้€ผ็œŸ็š„ไธดๅบŠๆ–‡ๆกฃ PDF๏ผŒๅนถๆณจๅ…ฅๅ—ๆŽงๅ™ชๅฃฐไปฅๆจกๆ‹Ÿ็œŸๅฎžไธ–็•Œ OCR ๅœบๆ™ฏใ€‚
**ๆŠ€ๆœฏๆ ˆ๏ผš**
- **ReportLab** (`pip install reportlab`) โ€” PDF ็”Ÿๆˆๅผ•ๆ“Ž๏ผŒๆ”ฏๆŒ `SimpleDocTemplate`ใ€`Table`ใ€`Paragraph` ็ญ‰ Platypus ๆตๅผ็ป„ไปถ
- **Augraphy** (`pip install augraphy`) โ€” ๆ–‡ๆกฃๅ›พๅƒ้€€ๅŒ–็ฎก็บฟ๏ผŒๆจกๆ‹Ÿๆ‰“ๅฐใ€ไผ ็œŸใ€ๆ‰ซๆๅ™ชๅฃฐ
- **Pillow** (`pip install Pillow`) โ€” ๅ›พๅƒๅค„็†
- **pdf2image** (`pip install pdf2image`) โ€” PDF ่ฝฌๅ›พๅƒ๏ผˆ็”จไบŽๅ™ชๅฃฐๆณจๅ…ฅๅŽ่ฝฌๅ›ž PDF๏ผ‰
### 3.2 ไธดๅบŠไฟกไปถๆจกๆฟ
```python
# data/templates/clinical_letter.py
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
)
from reportlab.lib import colors
def generate_clinical_letter(profile: dict, output_path: str):
"""Generate a clinical letter PDF from PatientProfile."""
doc = SimpleDocTemplate(output_path, pagesize=letter,
topMargin=1*inch, bottomMargin=1*inch)
styles = getSampleStyleSheet()
story = []
# Header
header_style = ParagraphStyle(
'Header', parent=styles['Heading1'], fontSize=14,
spaceAfter=6
)
story.append(Paragraph("Clinical Summary Letter", header_style))
story.append(Spacer(1, 12))
# Patient Info
info_data = [
["Patient Name:", profile["demographics"]["name"]],
["Date of Birth:", profile["demographics"]["date_of_birth"]],
["Sex:", profile["demographics"]["sex"]],
["MRN:", profile["patient_id"]],
]
info_table = Table(info_data, colWidths=[2*inch, 4*inch])
info_table.setStyle(TableStyle([
('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
('FONTNAME', (1, 0), (1, -1), 'Helvetica'),
('FONTSIZE', (0, 0), (-1, -1), 10),
('VALIGN', (0, 0), (-1, -1), 'TOP'),
]))
story.append(info_table)
story.append(Spacer(1, 18))
# Diagnosis Section
story.append(Paragraph("Diagnosis", styles['Heading2']))
dx = profile.get("diagnosis", {})
dx_text = (
f"Primary: {dx.get('primary', 'Unknown')}. "
f"Stage: {dx.get('stage', 'Unknown')}. "
f"Histology: {dx.get('histology', 'Unknown')}. "
f"Diagnosed: {dx.get('diagnosis_date', 'Unknown')}."
)
story.append(Paragraph(dx_text, styles['Normal']))
story.append(Spacer(1, 12))
# Biomarkers Section
story.append(Paragraph("Molecular Testing", styles['Heading2']))
bm = profile.get("biomarkers", {})
bm_data = [["Biomarker", "Result"]]
for marker, value in bm.items():
if value is not None:
bm_data.append([marker.upper(), str(value)])
if len(bm_data) > 1:
bm_table = Table(bm_data, colWidths=[2.5*inch, 3.5*inch])
bm_table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.lightgrey),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('FONTSIZE', (0, 0), (-1, -1), 10),
]))
story.append(bm_table)
story.append(Spacer(1, 12))
# Treatment History
story.append(Paragraph("Treatment History", styles['Heading2']))
treatments = profile.get("treatments", [])
for tx in treatments:
tx_text = f"- {tx['name']} ({tx['type']}): {tx.get('start_date', '')}"
story.append(Paragraph(tx_text, styles['Normal']))
doc.build(story)
```
### 3.3 ็—…็†ๆŠฅๅ‘Šๆจกๆฟ
```python
# data/templates/pathology_report.py
def generate_pathology_report(profile: dict, output_path: str):
"""Generate a pathology report PDF."""
doc = SimpleDocTemplate(output_path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("SURGICAL PATHOLOGY REPORT", styles['Title']))
story.append(Spacer(1, 12))
# Specimen Info
spec_data = [
["Specimen:", "Right lung, upper lobe, wedge resection"],
["Procedure:", "CT-guided needle biopsy"],
["Date:", profile["diagnosis"]["diagnosis_date"]],
]
spec_table = Table(spec_data, colWidths=[2*inch, 4*inch])
story.append(spec_table)
story.append(Spacer(1, 12))
# Final Diagnosis
story.append(Paragraph("FINAL DIAGNOSIS", styles['Heading2']))
story.append(Paragraph(
f"Non-small cell lung carcinoma, {profile['diagnosis'].get('histology', 'adenocarcinoma')}, "
f"{profile['diagnosis'].get('stage', 'Stage IIIA')}",
styles['Normal']
))
# Biomarker Results
story.append(Spacer(1, 12))
story.append(Paragraph("MOLECULAR/IMMUNOHISTOCHEMISTRY", styles['Heading2']))
bm = profile.get("biomarkers", {})
results = []
if bm.get("egfr"):
results.append(f"EGFR mutation analysis: {bm['egfr']}")
if bm.get("alk"):
results.append(f"ALK rearrangement (FISH): {bm['alk']}")
if bm.get("pdl1_tps"):
results.append(f"PD-L1 (22C3, TPS): {bm['pdl1_tps']}")
if bm.get("kras"):
results.append(f"KRAS mutation analysis: {bm['kras']}")
for r in results:
story.append(Paragraph(r, styles['Normal']))
doc.build(story)
```
### 3.4 ๅฎž้ชŒๅฎคๆŠฅๅ‘Šๆจกๆฟ
```python
# data/templates/lab_report.py
def generate_lab_report(profile: dict, output_path: str):
"""Generate a laboratory report PDF with CBC, CMP, etc."""
doc = SimpleDocTemplate(output_path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("LABORATORY REPORT", styles['Title']))
story.append(Spacer(1, 12))
# Lab Results Table
lab_data = [["Test", "Result", "Unit", "Reference Range", "Date"]]
for lab in profile.get("labs", []):
lab_data.append([
lab["name"], str(lab["value"]), lab["unit"],
"", # Reference range (can be added)
lab["date"][:10] if lab["date"] else ""
])
if len(lab_data) > 1:
lab_table = Table(lab_data, colWidths=[2*inch, 1*inch, 0.8*inch, 1.2*inch, 1*inch])
lab_table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#003366')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('FONTSIZE', (0, 0), (-1, -1), 9),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#f0f0f0')]),
]))
story.append(lab_table)
doc.build(story)
```
### 3.5 ๅ™ชๅฃฐๆณจๅ…ฅ็ญ–็•ฅ
```python
# data/noise/noise_injector.py
import random
import re
from pathlib import Path
from PIL import Image
# Augraphy ็ฎก็บฟ้…็ฝฎ
try:
from augraphy import (
AugraphyPipeline, InkBleed, Letterpress, LowInkPeriodicLines,
DirtyDrum, SubtleNoise, Jpeg, Brightness, BleedThrough
)
AUGRAPHY_AVAILABLE = True
except ImportError:
AUGRAPHY_AVAILABLE = False
class NoiseInjector:
"""ๅ—ๆŽงๅ™ชๅฃฐๆณจๅ…ฅๅผ•ๆ“Ž๏ผŒๆจกๆ‹Ÿ็œŸๅฎžไธ–็•Œๆ–‡ๆกฃ้€€ๅŒ–ใ€‚"""
# OCR ๅธธ่ง้”™่ฏฏๆ˜ ๅฐ„
OCR_ERROR_MAP = {
"0": ["O", "o", "Q"],
"1": ["l", "I", "|"],
"5": ["S", "s"],
"8": ["B"],
"O": ["0", "Q"],
"l": ["1", "I", "|"],
"rn": ["m"],
"cl": ["d"],
"vv": ["w"],
}
# ๅŒปๅญฆ็ผฉๅ†™ๆ›ฟๆข
ABBREVIATION_MAP = {
"non-small cell lung cancer": ["NSCLC", "non-small cell ca", "NSCC"],
"adenocarcinoma": ["adeno", "adenoca", "adeno ca"],
"squamous cell carcinoma": ["SCC", "squamous ca", "sq cell ca"],
"Eastern Cooperative Oncology Group": ["ECOG"],
"performance status": ["PS", "perf status"],
"milligrams per deciliter": ["mg/dL", "mg/dl"],
"computed tomography": ["CT", "cat scan"],
}
# ๅ™ชๅฃฐ็บงๅˆซ้…็ฝฎ
NOISE_LEVELS = {
"clean": {"ocr_rate": 0.0, "abbrev_rate": 0.0, "missing_rate": 0.0},
"mild": {"ocr_rate": 0.02, "abbrev_rate": 0.1, "missing_rate": 0.05},
"moderate": {"ocr_rate": 0.05, "abbrev_rate": 0.2, "missing_rate": 0.1},
"severe": {"ocr_rate": 0.10, "abbrev_rate": 0.3, "missing_rate": 0.2},
}
def __init__(self, noise_level: str = "mild", seed: int = 42):
self.config = self.NOISE_LEVELS[noise_level]
self.rng = random.Random(seed)
def inject_text_noise(self, text: str) -> tuple[str, list[dict]]:
"""Inject OCR errors and abbreviations into text.
Returns (noisy_text, list_of_injected_noise_records).
"""
noise_records = []
chars = list(text)
# OCR character substitutions
i = 0
while i < len(chars):
if self.rng.random() < self.config["ocr_rate"]:
original = chars[i]
if original in self.OCR_ERROR_MAP:
replacement = self.rng.choice(self.OCR_ERROR_MAP[original])
chars[i] = replacement
noise_records.append({
"type": "ocr_error",
"position": i,
"original": original,
"replacement": replacement,
})
i += 1
noisy_text = "".join(chars)
# Abbreviation substitutions
for full_form, abbreviations in self.ABBREVIATION_MAP.items():
if full_form in noisy_text.lower() and self.rng.random() < self.config["abbrev_rate"]:
abbrev = self.rng.choice(abbreviations)
noisy_text = re.sub(
re.escape(full_form), abbrev, noisy_text, count=1, flags=re.IGNORECASE
)
noise_records.append({
"type": "abbreviation",
"original": full_form,
"replacement": abbrev,
})
return noisy_text, noise_records
def inject_missing_values(self, profile: dict) -> tuple[dict, list[str]]:
"""Randomly remove fields from profile to simulate missing data.
Returns (modified_profile, list_of_removed_fields).
"""
removed = []
removable_fields = [
("biomarkers", "egfr"),
("biomarkers", "alk"),
("biomarkers", "pdl1_tps"),
("biomarkers", "kras"),
("biomarkers", "ros1"),
("diagnosis", "stage"),
("diagnosis", "histology"),
]
for section, field_name in removable_fields:
if self.rng.random() < self.config["missing_rate"]:
if section in profile and field_name in profile[section]:
profile[section][field_name] = None
removed.append(f"{section}.{field_name}")
return profile, removed
def degrade_image(self, image: Image.Image) -> Image.Image:
"""Apply Augraphy degradation pipeline to document image."""
if not AUGRAPHY_AVAILABLE:
return image
import numpy as np
img_array = np.array(image)
pipeline = AugraphyPipeline(
ink_phase=[
InkBleed(p=0.5),
Letterpress(p=0.3),
LowInkPeriodicLines(p=0.3),
],
paper_phase=[
SubtleNoise(p=0.5),
],
post_phase=[
DirtyDrum(p=0.3),
Brightness(p=0.5),
Jpeg(p=0.5),
],
)
degraded = pipeline(img_array)
return Image.fromarray(degraded)
```
---
## 4. TREC ๅŸบๅ‡†่ฏ„ไผฐๆŒ‡ๅ—
### 4.1 ๆ•ฐๆฎ้›†ๆฆ‚่ฟฐ
**TREC Clinical Trials Track 2021๏ผš**
- ๆฅๆบ๏ผšNIST ๆ–‡ๆœฌๆฃ€็ดขไผš่ฎฎ
- Topics๏ผˆๆŸฅ่ฏข๏ผ‰๏ผš75 ไธชๅˆๆˆๆ‚ฃ่€…ๆ่ฟฐ๏ผˆ5-10 ๅฅๅ…ฅ้™ข่ฎฐๅฝ•๏ผ‰
- ๆ–‡ๆกฃ้›†๏ผš376,000+ ไธดๅบŠ่ฏ•้ชŒ๏ผˆClinicalTrials.gov 2021 ๅนด 4 ๆœˆๅฟซ็…ง๏ผ‰
- Qrels๏ผš35,832 ๆก็›ธๅ…ณๆ€งๅˆคๆ–ญ
- ็›ธๅ…ณๆ€งๆ ‡็ญพ๏ผš0=ไธ็›ธๅ…ณ๏ผŒ1=ๆŽ’้™ค๏ผŒ2=ๅˆๆ ผ
**TREC Clinical Trials Track 2022๏ผš**
- Topics๏ผš50 ไธชๅˆๆˆๆ‚ฃ่€…ๆ่ฟฐ
- ไฝฟ็”จ็›ธๅŒ็š„ๆ–‡ๆกฃ้›†ๅฟซ็…ง
### 4.2 ๆ•ฐๆฎๆ ผๅผ
#### Topics XML ๆ ผๅผ
```xml
<topics task="2021 TREC Clinical Trials">
<topic number="1">
A 62-year-old male presents with a 3-month history of
progressive dyspnea and a 20-pound weight loss. He has
a 40 pack-year smoking history. CT chest reveals a 4.5cm
right upper lobe mass with mediastinal lymphadenopathy.
Biopsy confirms non-small cell lung cancer, adenocarcinoma.
EGFR mutation testing is positive for exon 19 deletion.
PD-L1 TPS is 60%. ECOG performance status is 1.
</topic>
<topic number="2">
...
</topic>
</topics>
```
#### Qrels ๆ ผๅผ๏ผˆๅˆถ่กจ็ฌฆๅˆ†้š”๏ผ‰
```
topic_id 0 doc_id relevance
1 0 NCT00760162 2
1 0 NCT01234567 1
1 0 NCT09876543 0
```
- ๅˆ— 1๏ผšTopic ็ผ–ๅท
- ๅˆ— 2๏ผšๅ›บๅฎšๅ€ผ 0๏ผˆ่ฟญไปฃๆฌกๆ•ฐ๏ผ‰
- ๅˆ— 3๏ผšNCT ๆ–‡ๆกฃ ID
- ๅˆ— 4๏ผš็›ธๅ…ณๆ€ง๏ผˆ0=ไธ็›ธๅ…ณ๏ผŒ1=ๆŽ’้™ค๏ผŒ2=ๅˆๆ ผ๏ผ‰
#### Run ๆไบคๆ ผๅผ
```
TOPIC_NO Q0 NCT_ID RANK SCORE RUN_NAME
1 Q0 NCT00760162 1 0.9999 trialpath-v1
1 Q0 NCT01234567 2 0.9998 trialpath-v1
```
### 4.3 ไฝฟ็”จ ir_datasets ๅŠ ่ฝฝๆ•ฐๆฎ
```python
# evaluation/run_trec_benchmark.py
import ir_datasets
def load_trec_2021():
"""Load TREC CT 2021 topics and qrels via ir_datasets."""
dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2021")
# ๅŠ ่ฝฝ topics (GenericQuery: query_id, text)
topics = {}
for query in dataset.queries_iter():
topics[query.query_id] = query.text
# ๅŠ ่ฝฝ qrels (TrecQrel: query_id, doc_id, relevance, iteration)
qrels = {}
for qrel in dataset.qrels_iter():
if qrel.query_id not in qrels:
qrels[qrel.query_id] = {}
qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
return topics, qrels
def load_trec_2022():
"""Load TREC CT 2022 topics and qrels."""
dataset = ir_datasets.load("clinicaltrials/2021/trec-ct-2022")
topics = {q.query_id: q.text for q in dataset.queries_iter()}
qrels = {}
for qrel in dataset.qrels_iter():
if qrel.query_id not in qrels:
qrels[qrel.query_id] = {}
qrels[qrel.query_id][qrel.doc_id] = qrel.relevance
return topics, qrels
def load_trial_documents():
"""Load the clinical trial documents from ir_datasets."""
dataset = ir_datasets.load("clinicaltrials/2021")
# ClinicalTrialsDoc: doc_id, title, condition, summary,
# detailed_description, eligibility
docs = {}
for doc in dataset.docs_iter():
docs[doc.doc_id] = {
"title": doc.title,
"condition": doc.condition,
"summary": doc.summary,
"detailed_description": doc.detailed_description,
"eligibility": doc.eligibility,
}
return docs
```
### 4.4 TrialPath ่พ“ๅ‡บๅˆฐ TREC ๆ ผๅผ็š„ๆ˜ ๅฐ„
```python
def convert_trialpath_to_trec_run(
results: dict[str, list[dict]],
run_name: str = "trialpath-v1"
) -> str:
"""Convert TrialPath matching results to TREC run format.
Args:
results: {topic_id: [{"nct_id": str, "score": float}, ...]}
run_name: Run identifier
Returns:
TREC-format run string
"""
lines = []
for topic_id, candidates in results.items():
sorted_candidates = sorted(candidates, key=lambda x: x["score"], reverse=True)
for rank, candidate in enumerate(sorted_candidates[:1000], 1):
lines.append(
f"{topic_id} Q0 {candidate['nct_id']} {rank} "
f"{candidate['score']:.6f} {run_name}"
)
return "\n".join(lines)
def save_trec_run(run_str: str, output_path: str):
"""Save TREC run to file."""
with open(output_path, 'w') as f:
f.write(run_str)
```
### 4.5 ไฝฟ็”จ ir-measures ่ฎก็ฎ—่ฏ„ไผฐๆŒ‡ๆ ‡
```python
# evaluation/run_trec_benchmark.py (็ปญ)
import ir_measures
from ir_measures import nDCG, P, Recall, AP, RR, SetP, SetR, SetF
def evaluate_trec_run(
qrels_path: str,
run_path: str,
) -> dict:
"""Evaluate a TREC run using ir-measures.
Target metrics:
- Recall@50 >= 0.75
- NDCG@10 >= 0.60
- P@10 (informational)
"""
qrels = list(ir_measures.read_trec_qrels(qrels_path))
run = list(ir_measures.read_trec_run(run_path))
# ๅฎšไน‰็›ฎๆ ‡ๆŒ‡ๆ ‡
measures = [
nDCG@10, # Target >= 0.60
Recall@50, # Target >= 0.75
P@10, # Precision at 10
AP, # Mean Average Precision
RR, # Reciprocal Rank
nDCG@20, # Additional depth
Recall@100, # Extended recall
]
# ่ฎก็ฎ—่šๅˆๆŒ‡ๆ ‡
aggregate = ir_measures.calc_aggregate(measures, qrels, run)
# ่ฎก็ฎ—้€ๆŸฅ่ฏขๆŒ‡ๆ ‡
per_query = {}
for metric in ir_measures.iter_calc(measures, qrels, run):
qid = metric.query_id
if qid not in per_query:
per_query[qid] = {}
per_query[qid][str(metric.measure)] = metric.value
return {
"aggregate": {str(k): v for k, v in aggregate.items()},
"per_query": per_query,
"pass_fail": {
"ndcg@10": aggregate.get(nDCG@10, 0) >= 0.60,
"recall@50": aggregate.get(Recall@50, 0) >= 0.75,
}
}
def evaluate_with_eligibility_levels(
qrels_path: str,
run_path: str,
) -> dict:
"""Evaluate with TREC CT graded relevance (0=NR, 1=Excluded, 2=Eligible).
Uses rel=2 for strict eligible-only evaluation.
"""
qrels = list(ir_measures.read_trec_qrels(qrels_path))
run = list(ir_measures.read_trec_run(run_path))
# Standard evaluation (relevance >= 1)
standard_measures = [nDCG@10, Recall@50, P@10]
standard = ir_measures.calc_aggregate(standard_measures, qrels, run)
# Strict evaluation (only eligible = relevance 2)
strict_measures = [
AP(rel=2),
P(rel=2)@10,
Recall(rel=2)@50,
]
strict = ir_measures.calc_aggregate(strict_measures, qrels, run)
return {
"standard": {str(k): v for k, v in standard.items()},
"strict_eligible_only": {str(k): v for k, v in strict.items()},
}
```
### 4.6 ไฝฟ็”จ ir_datasets ็š„ๆ›ฟไปฃ qrels/run ๆ ผๅผ
```python
def evaluate_from_dicts(
qrels_dict: dict[str, dict[str, int]],
run_dict: dict[str, list[tuple[str, float]]],
) -> dict:
"""Evaluate using Python dict format (no files needed).
Args:
qrels_dict: {query_id: {doc_id: relevance}}
run_dict: {query_id: [(doc_id, score), ...]}
"""
# Convert to ir-measures format
qrels = [
ir_measures.Qrel(qid, did, rel)
for qid, docs in qrels_dict.items()
for did, rel in docs.items()
]
run = [
ir_measures.ScoredDoc(qid, did, score)
for qid, docs in run_dict.items()
for did, score in docs
]
measures = [nDCG@10, Recall@50, P@10, AP]
aggregate = ir_measures.calc_aggregate(measures, qrels, run)
return {str(k): v for k, v in aggregate.items()}
```
---
## 5. MedGemma ๆๅ–่ฏ„ไผฐ
### 5.1 ๆ ‡ๆณจๆ•ฐๆฎ้›†่ฎพ่ฎก
```python
# evaluation/extraction_eval.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class AnnotatedField:
"""A single annotated field with ground truth and extraction result."""
field_name: str # e.g., "biomarkers.egfr"
ground_truth: Optional[str] # From Synthea profile (gold standard)
extracted: Optional[str] # From MedGemma extraction
evidence_span: Optional[str] # Text span in source document
source_page: Optional[int] # Page number in PDF
@dataclass
class ExtractionAnnotation:
"""Complete annotation for one patient's extraction."""
patient_id: str
fields: list[AnnotatedField]
noise_level: str # "clean", "mild", "moderate", "severe"
document_type: str # "clinical_letter", "pathology_report", etc.
```
**ๆ ‡ๆณจๆ•ฐๆฎ้›†็ป“ๆž„๏ผš**
```json
{
"patient_id": "synth-001",
"noise_level": "mild",
"document_type": "clinical_letter",
"fields": [
{
"field_name": "demographics.name",
"ground_truth": "John Smith",
"extracted": "John Smith",
"correct": true
},
{
"field_name": "diagnosis.stage",
"ground_truth": "Stage IIIA",
"extracted": "Stage 3A",
"correct": true,
"note": "Equivalent representation"
},
{
"field_name": "biomarkers.egfr",
"ground_truth": "Exon 19 deletion",
"extracted": "EGFR positive",
"correct": false,
"note": "Partial extraction - missing specific mutation"
}
]
}
```
### 5.2 ๅญ—ๆฎต็บง F1 ่ฎก็ฎ—
```python
# evaluation/extraction_eval.py
from sklearn.metrics import (
f1_score, precision_score, recall_score,
classification_report, confusion_matrix
)
import numpy as np
# ๅฎšไน‰ๆ‰€ๆœ‰ๅฏๆๅ–ๅญ—ๆฎต
EXTRACTION_FIELDS = [
"demographics.name",
"demographics.sex",
"demographics.date_of_birth",
"demographics.age",
"diagnosis.primary",
"diagnosis.stage",
"diagnosis.histology",
"biomarkers.egfr",
"biomarkers.alk",
"biomarkers.pdl1_tps",
"biomarkers.kras",
"biomarkers.ros1",
"labs.wbc",
"labs.hemoglobin",
"labs.platelets",
"labs.creatinine",
"labs.alt",
"labs.ast",
"treatments.current_regimen",
"performance_status.ecog",
]
def compute_field_level_f1(
annotations: list[dict],
) -> dict:
"""Compute field-level F1, precision, recall.
For each field:
- TP: ground_truth exists AND extracted matches
- FP: extracted exists BUT ground_truth is None or mismatch
- FN: ground_truth exists BUT extracted is None or mismatch
Args:
annotations: List of patient annotation dicts
Returns:
Per-field and aggregate metrics
"""
field_metrics = {}
for field_name in EXTRACTION_FIELDS:
y_true = [] # 1 if field has ground truth value
y_pred = [] # 1 if field was correctly extracted
for ann in annotations:
fields = {f["field_name"]: f for f in ann["fields"]}
if field_name in fields:
f = fields[field_name]
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
y_true.append(1 if has_gt else 0)
y_pred.append(1 if is_correct else 0)
if len(y_true) > 0:
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)
field_metrics[field_name] = {
"precision": round(precision, 4),
"recall": round(recall, 4),
"f1": round(f1, 4),
"support": sum(y_true),
}
# Aggregate metrics
all_y_true = []
all_y_pred = []
for ann in annotations:
for f in ann["fields"]:
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
all_y_true.append(1 if has_gt else 0)
all_y_pred.append(1 if is_correct else 0)
micro_f1 = f1_score(all_y_true, all_y_pred, zero_division=0)
macro_f1 = np.mean([m["f1"] for m in field_metrics.values()])
return {
"per_field": field_metrics,
"micro_f1": round(micro_f1, 4),
"macro_f1": round(macro_f1, 4),
"total_fields": len(all_y_true),
"pass": micro_f1 >= 0.85, # Target: F1 >= 0.85
}
def compute_extraction_report(annotations: list[dict]) -> str:
"""Generate a scikit-learn classification_report style output."""
all_y_true = []
all_y_pred = []
labels = []
for field_name in EXTRACTION_FIELDS:
for ann in annotations:
fields = {f["field_name"]: f for f in ann["fields"]}
if field_name in fields:
f = fields[field_name]
has_gt = f["ground_truth"] is not None
is_correct = f.get("correct", False)
all_y_true.append(1 if has_gt else 0)
all_y_pred.append(1 if is_correct else 0)
return classification_report(
all_y_true, all_y_pred,
target_names=["absent", "present/correct"],
digits=4,
)
def compare_with_baseline(
medgemma_annotations: list[dict],
gemini_only_annotations: list[dict],
) -> dict:
"""Compare MedGemma extraction vs Gemini-only baseline."""
medgemma_metrics = compute_field_level_f1(medgemma_annotations)
gemini_metrics = compute_field_level_f1(gemini_only_annotations)
comparison = {}
for field_name in EXTRACTION_FIELDS:
mg = medgemma_metrics["per_field"].get(field_name, {})
gm = gemini_metrics["per_field"].get(field_name, {})
comparison[field_name] = {
"medgemma_f1": mg.get("f1", 0),
"gemini_f1": gm.get("f1", 0),
"delta": round(mg.get("f1", 0) - gm.get("f1", 0), 4),
}
return {
"per_field_comparison": comparison,
"medgemma_overall_f1": medgemma_metrics["micro_f1"],
"gemini_overall_f1": gemini_metrics["micro_f1"],
"improvement": round(
medgemma_metrics["micro_f1"] - gemini_metrics["micro_f1"], 4
),
}
```
### 5.3 ๅ™ชๅฃฐ็บงๅˆซๅฏนๆๅ–ๆ€ง่ƒฝ็š„ๅฝฑๅ“ๅˆ†ๆž
```python
def analyze_noise_impact(annotations: list[dict]) -> dict:
"""Analyze how noise level affects extraction F1."""
by_noise = {}
for ann in annotations:
level = ann["noise_level"]
if level not in by_noise:
by_noise[level] = []
by_noise[level].append(ann)
results = {}
for level, level_anns in by_noise.items():
metrics = compute_field_level_f1(level_anns)
results[level] = {
"micro_f1": metrics["micro_f1"],
"macro_f1": metrics["macro_f1"],
"n_patients": len(level_anns),
}
return results
```
---
## 6. ็ซฏๅˆฐ็ซฏ่ฏ„ไผฐ็ฎก็บฟ
### 6.1 Criterion Decision Accuracy
```python
# evaluation/criterion_eval.py
def compute_criterion_accuracy(
predictions: list[dict],
ground_truth: list[dict],
) -> dict:
"""Compute criterion-level decision accuracy.
Each prediction/ground_truth entry:
{
"patient_id": str,
"trial_id": str,
"criteria": [
{"criterion_id": str, "decision": "met"|"not_met"|"unknown",
"evidence": str}
]
}
Target: >= 0.85
"""
total = 0
correct = 0
by_decision_type = {"met": {"tp": 0, "total": 0},
"not_met": {"tp": 0, "total": 0},
"unknown": {"tp": 0, "total": 0}}
for pred, gt in zip(predictions, ground_truth):
assert pred["patient_id"] == gt["patient_id"]
assert pred["trial_id"] == gt["trial_id"]
gt_map = {c["criterion_id"]: c["decision"] for c in gt["criteria"]}
for criterion in pred["criteria"]:
cid = criterion["criterion_id"]
if cid in gt_map:
total += 1
gt_decision = gt_map[cid]
pred_decision = criterion["decision"]
by_decision_type[gt_decision]["total"] += 1
if pred_decision == gt_decision:
correct += 1
by_decision_type[gt_decision]["tp"] += 1
accuracy = correct / total if total > 0 else 0.0
return {
"overall_accuracy": round(accuracy, 4),
"total_criteria": total,
"correct": correct,
"pass": accuracy >= 0.85,
"by_decision_type": {
k: {
"accuracy": round(v["tp"] / v["total"], 4) if v["total"] > 0 else 0,
"support": v["total"],
}
for k, v in by_decision_type.items()
},
}
```
### 6.2 ๅปถ่ฟŸๅŸบๅ‡†ๆต‹่ฏ•
```python
# evaluation/latency_cost_tracker.py
import time
import json
from dataclasses import dataclass, field, asdict
from typing import Optional
from contextlib import contextmanager
@dataclass
class APICallRecord:
"""Record of a single API call."""
service: str # "medgemma", "gemini", "clinicaltrials_mcp"
operation: str # "extract", "search", "evaluate_criterion"
latency_ms: float
input_tokens: int = 0
output_tokens: int = 0
cost_usd: float = 0.0
timestamp: str = ""
@dataclass
class SessionMetrics:
"""Aggregate metrics for a patient matching session."""
patient_id: str
total_latency_ms: float = 0.0
total_cost_usd: float = 0.0
api_calls: list[APICallRecord] = field(default_factory=list)
@property
def total_latency_s(self) -> float:
return self.total_latency_ms / 1000.0
@property
def pass_latency(self) -> bool:
"""Target: < 15s per session."""
return self.total_latency_s < 15.0
@property
def pass_cost(self) -> bool:
"""Target: < $0.50 per session."""
return self.total_cost_usd < 0.50
class LatencyCostTracker:
"""Track latency and cost across API calls."""
# Pricing per 1M tokens (approximate)
PRICING = {
"medgemma": {"input": 0.0, "output": 0.0}, # Self-hosted
"gemini": {"input": 1.25, "output": 5.00}, # Gemini Pro
"clinicaltrials_mcp": {"input": 0.0, "output": 0.0}, # Free API
}
def __init__(self):
self.sessions: list[SessionMetrics] = []
self._current_session: Optional[SessionMetrics] = None
def start_session(self, patient_id: str):
self._current_session = SessionMetrics(patient_id=patient_id)
def end_session(self) -> SessionMetrics:
session = self._current_session
if session:
session.total_latency_ms = sum(c.latency_ms for c in session.api_calls)
session.total_cost_usd = sum(c.cost_usd for c in session.api_calls)
self.sessions.append(session)
self._current_session = None
return session
@contextmanager
def track_call(self, service: str, operation: str):
"""Context manager to track an API call."""
start = time.monotonic()
record = APICallRecord(service=service, operation=operation, latency_ms=0)
try:
yield record
finally:
record.latency_ms = (time.monotonic() - start) * 1000
# Compute cost
pricing = self.PRICING.get(service, {"input": 0, "output": 0})
record.cost_usd = (
record.input_tokens * pricing["input"] / 1_000_000
+ record.output_tokens * pricing["output"] / 1_000_000
)
if self._current_session:
self._current_session.api_calls.append(record)
def summary(self) -> dict:
"""Generate aggregate summary across all sessions."""
if not self.sessions:
return {}
latencies = [s.total_latency_s for s in self.sessions]
costs = [s.total_cost_usd for s in self.sessions]
return {
"n_sessions": len(self.sessions),
"latency": {
"mean_s": round(sum(latencies) / len(latencies), 2),
"p50_s": round(sorted(latencies)[len(latencies) // 2], 2),
"p95_s": round(sorted(latencies)[int(len(latencies) * 0.95)], 2),
"max_s": round(max(latencies), 2),
"pass_rate": round(
sum(1 for s in self.sessions if s.pass_latency) / len(self.sessions), 4
),
},
"cost": {
"mean_usd": round(sum(costs) / len(costs), 4),
"total_usd": round(sum(costs), 4),
"max_usd": round(max(costs), 4),
"pass_rate": round(
sum(1 for s in self.sessions if s.pass_cost) / len(self.sessions), 4
),
},
"targets": {
"latency_pass": all(s.pass_latency for s in self.sessions),
"cost_pass": all(s.pass_cost for s in self.sessions),
},
}
```
---
## 7. TDD ๆต‹่ฏ•็”จไพ‹
### 7.1 Synthea ๆ•ฐๆฎ้ชŒ่ฏๆต‹่ฏ•
```python
# tests/test_synthea_data.py
import pytest
import json
from pathlib import Path
# ้ข„ๆœŸ็š„ FHIR Resource ็ฑปๅž‹
REQUIRED_RESOURCE_TYPES = {"Patient", "Condition", "Observation", "Encounter"}
class TestSyntheaDataValidation:
"""Validate Synthea FHIR output for TrialPath requirements."""
def test_fhir_bundle_is_valid_json(self, fhir_file):
"""Bundle must be valid JSON."""
with open(fhir_file) as f:
data = json.load(f)
assert data["resourceType"] == "Bundle"
assert "entry" in data
def test_bundle_contains_required_resources(self, fhir_file):
"""Bundle must contain Patient, Condition, Observation, Encounter."""
with open(fhir_file) as f:
bundle = json.load(f)
resource_types = {
e["resource"]["resourceType"] for e in bundle["entry"]
}
for rt in REQUIRED_RESOURCE_TYPES:
assert rt in resource_types, f"Missing {rt} resource"
def test_patient_has_demographics(self, fhir_file):
"""Patient resource must have name, gender, birthDate."""
with open(fhir_file) as f:
bundle = json.load(f)
patients = [
e["resource"] for e in bundle["entry"]
if e["resource"]["resourceType"] == "Patient"
]
assert len(patients) == 1
patient = patients[0]
assert "name" in patient
assert "gender" in patient
assert "birthDate" in patient
def test_lung_cancer_condition_present(self, fhir_file):
"""At least one Condition must be NSCLC or lung cancer."""
with open(fhir_file) as f:
bundle = json.load(f)
conditions = [
e["resource"] for e in bundle["entry"]
if e["resource"]["resourceType"] == "Condition"
]
lung_cancer_codes = {"254637007", "254632001", "162573006"}
has_lung_cancer = False
for cond in conditions:
codings = cond.get("code", {}).get("coding", [])
for c in codings:
if c.get("code") in lung_cancer_codes:
has_lung_cancer = True
assert has_lung_cancer, "No lung cancer Condition found"
def test_patient_profile_conversion(self, fhir_file):
"""FHIR Bundle must convert to valid PatientProfile."""
profile = parse_fhir_bundle(Path(fhir_file))
assert profile.patient_id != ""
assert profile.demographics.name != ""
assert profile.demographics.sex in ("male", "female")
assert profile.diagnosis.primary != ""
def test_batch_generation_produces_500_patients(self, output_dir):
"""Batch generation must produce at least 500 FHIR files."""
fhir_files = list(Path(output_dir).glob("*.json"))
assert len(fhir_files) >= 500
def test_nsclc_ratio(self, all_profiles):
"""~85% of lung cancer patients should be NSCLC."""
nsclc_count = sum(
1 for p in all_profiles
if "non-small cell" in p.diagnosis.primary.lower()
or "nsclc" in p.diagnosis.primary.lower()
)
ratio = nsclc_count / len(all_profiles)
assert 0.70 <= ratio <= 0.95, f"NSCLC ratio {ratio} outside expected range"
```
### 7.2 PDF ็”Ÿๆˆๆญฃ็กฎๆ€งๆต‹่ฏ•
```python
# tests/test_pdf_generation.py
import pytest
from pathlib import Path
from data.templates.clinical_letter import generate_clinical_letter
from data.templates.pathology_report import generate_pathology_report
from data.templates.lab_report import generate_lab_report
class TestPDFGeneration:
"""Test that PDF generation produces valid documents."""
SAMPLE_PROFILE = {
"patient_id": "test-001",
"demographics": {
"name": "Jane Doe",
"sex": "female",
"date_of_birth": "1960-05-15",
},
"diagnosis": {
"primary": "Non-small cell lung cancer, adenocarcinoma",
"stage": "Stage IIIA",
"histology": "adenocarcinoma",
"diagnosis_date": "2024-01-15",
},
"biomarkers": {
"egfr": "Exon 19 deletion",
"alk": "Negative",
"pdl1_tps": "60%",
"kras": None,
},
"labs": [
{"name": "WBC", "value": 7.2, "unit": "10*3/uL", "date": "2024-01-10", "loinc_code": "6690-2"},
{"name": "Hemoglobin", "value": 12.5, "unit": "g/dL", "date": "2024-01-10", "loinc_code": "718-7"},
],
"treatments": [
{"name": "Cisplatin", "type": "medication", "start_date": "2024-02-01"},
],
}
def test_clinical_letter_generates_pdf(self, tmp_path):
"""Clinical letter must generate a non-empty PDF file."""
output = tmp_path / "letter.pdf"
generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_pathology_report_generates_pdf(self, tmp_path):
"""Pathology report must generate a non-empty PDF file."""
output = tmp_path / "pathology.pdf"
generate_pathology_report(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_lab_report_generates_pdf(self, tmp_path):
"""Lab report must generate a non-empty PDF file."""
output = tmp_path / "lab.pdf"
generate_lab_report(self.SAMPLE_PROFILE, str(output))
assert output.exists()
assert output.stat().st_size > 0
def test_pdf_contains_patient_name(self, tmp_path):
"""Generated PDF must contain patient name (OCR-verifiable)."""
output = tmp_path / "letter.pdf"
generate_clinical_letter(self.SAMPLE_PROFILE, str(output))
# Read PDF text (using pdfplumber or PyPDF2)
import pdfplumber
with pdfplumber.open(str(output)) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
assert "Jane Doe" in text
def test_pdf_contains_biomarkers(self, tmp_path):
"""Generated PDF must contain biomarker results."""
output = tmp_path / "pathology.pdf"
generate_pathology_report(self.SAMPLE_PROFILE, str(output))
import pdfplumber
with pdfplumber.open(str(output)) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
assert "EGFR" in text
assert "Exon 19" in text or "positive" in text.lower()
def test_missing_biomarker_handled_gracefully(self, tmp_path):
"""PDF generation should not crash when biomarkers are None."""
profile = self.SAMPLE_PROFILE.copy()
profile["biomarkers"] = {
"egfr": None, "alk": None, "pdl1_tps": None, "kras": None
}
output = tmp_path / "letter.pdf"
generate_clinical_letter(profile, str(output))
assert output.exists()
```
### 7.3 ๅ™ชๅฃฐๆณจๅ…ฅๆ•ˆๆžœ้ชŒ่ฏๆต‹่ฏ•
```python
# tests/test_noise_injection.py
import pytest
from data.noise.noise_injector import NoiseInjector
class TestNoiseInjection:
"""Test noise injection produces expected results."""
def test_clean_noise_no_changes(self):
"""Clean level should produce no changes."""
injector = NoiseInjector(noise_level="clean", seed=42)
text = "Patient has EGFR mutation positive"
noisy, records = injector.inject_text_noise(text)
assert noisy == text
assert len(records) == 0
def test_mild_noise_produces_some_changes(self):
"""Mild noise should produce some but limited changes."""
injector = NoiseInjector(noise_level="mild", seed=42)
# Use longer text to increase chance of noise
text = "The patient is a 65 year old male with stage IIIA " * 10
noisy, records = injector.inject_text_noise(text)
# Should have some changes but not too many
assert len(records) >= 0 # May or may not have changes depending on seed
def test_severe_noise_produces_many_changes(self):
"""Severe noise should produce noticeable changes."""
injector = NoiseInjector(noise_level="severe", seed=42)
text = "The 50 year old patient has stage 1 NSCLC " * 20
noisy, records = injector.inject_text_noise(text)
assert noisy != text # Should differ from original
assert len(records) > 0
def test_ocr_error_types_are_valid(self):
"""OCR errors should only substitute known character pairs."""
injector = NoiseInjector(noise_level="severe", seed=42)
text = "0123456789 OIBS" * 10
_, records = injector.inject_text_noise(text)
for r in records:
if r["type"] == "ocr_error":
assert r["original"] in NoiseInjector.OCR_ERROR_MAP
assert r["replacement"] in NoiseInjector.OCR_ERROR_MAP[r["original"]]
def test_missing_value_injection(self):
"""Missing value injection should remove some fields."""
injector = NoiseInjector(noise_level="moderate", seed=42)
profile = {
"biomarkers": {"egfr": "positive", "alk": "negative",
"pdl1_tps": "60%", "kras": "negative", "ros1": "negative"},
"diagnosis": {"stage": "IIIA", "histology": "adenocarcinoma"},
}
modified, removed = injector.inject_missing_values(profile)
# At 10% rate with 7 fields, expect 0-3 removals
assert len(removed) <= 7
for field_path in removed:
section, field_name = field_path.split(".")
assert modified[section][field_name] is None
def test_noise_is_deterministic_with_seed(self):
"""Same seed should produce identical results."""
text = "Patient has stage IIIA non-small cell lung cancer"
inj1 = NoiseInjector(noise_level="moderate", seed=123)
inj2 = NoiseInjector(noise_level="moderate", seed=123)
noisy1, _ = inj1.inject_text_noise(text)
noisy2, _ = inj2.inject_text_noise(text)
assert noisy1 == noisy2
def test_different_seeds_produce_different_results(self):
"""Different seeds should generally produce different noise."""
text = "The 50 year old patient has 10 biomarker tests 0 1 5 8" * 20
inj1 = NoiseInjector(noise_level="severe", seed=1)
inj2 = NoiseInjector(noise_level="severe", seed=999)
noisy1, _ = inj1.inject_text_noise(text)
noisy2, _ = inj2.inject_text_noise(text)
# With severe noise on long text, different seeds should differ
assert noisy1 != noisy2
```
### 7.4 TREC ่ฏ„ไผฐ่ฎก็ฎ—ๆต‹่ฏ•
```python
# tests/test_trec_evaluation.py
import pytest
import ir_measures
from ir_measures import nDCG, Recall, P, AP
class TestTRECEvaluation:
"""Test TREC evaluation metric computation."""
@pytest.fixture
def sample_qrels(self):
"""Sample qrels with known ground truth."""
return [
ir_measures.Qrel("q1", "d1", 2), # eligible
ir_measures.Qrel("q1", "d2", 1), # excluded
ir_measures.Qrel("q1", "d3", 0), # not relevant
ir_measures.Qrel("q1", "d4", 2), # eligible
ir_measures.Qrel("q1", "d5", 0), # not relevant
]
@pytest.fixture
def perfect_run(self):
"""Run that ranks all relevant docs at top."""
return [
ir_measures.ScoredDoc("q1", "d1", 1.0),
ir_measures.ScoredDoc("q1", "d4", 0.9),
ir_measures.ScoredDoc("q1", "d2", 0.8),
ir_measures.ScoredDoc("q1", "d3", 0.1),
ir_measures.ScoredDoc("q1", "d5", 0.05),
]
@pytest.fixture
def worst_run(self):
"""Run that ranks relevant docs at bottom."""
return [
ir_measures.ScoredDoc("q1", "d3", 1.0),
ir_measures.ScoredDoc("q1", "d5", 0.9),
ir_measures.ScoredDoc("q1", "d2", 0.5),
ir_measures.ScoredDoc("q1", "d4", 0.2),
ir_measures.ScoredDoc("q1", "d1", 0.1),
]
def test_perfect_ndcg_at_10(self, sample_qrels, perfect_run):
"""Perfect ranking should yield NDCG@10 = 1.0."""
result = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
assert result[nDCG@10] == pytest.approx(1.0, abs=0.01)
def test_worst_ndcg_lower(self, sample_qrels, perfect_run, worst_run):
"""Worst ranking should yield lower NDCG than perfect."""
perfect = ir_measures.calc_aggregate([nDCG@10], sample_qrels, perfect_run)
worst = ir_measures.calc_aggregate([nDCG@10], sample_qrels, worst_run)
assert worst[nDCG@10] < perfect[nDCG@10]
def test_recall_at_50_perfect(self, sample_qrels, perfect_run):
"""Perfect run should retrieve all relevant docs."""
result = ir_measures.calc_aggregate([Recall@50], sample_qrels, perfect_run)
assert result[Recall@50] == pytest.approx(1.0, abs=0.01)
def test_empty_run_yields_zero(self, sample_qrels):
"""Empty run should yield 0 for all metrics."""
empty_run = []
result = ir_measures.calc_aggregate(
[nDCG@10, Recall@50, P@10], sample_qrels, empty_run
)
assert result[nDCG@10] == 0.0
assert result[Recall@50] == 0.0
assert result[P@10] == 0.0
def test_per_query_results(self, sample_qrels, perfect_run):
"""Per-query results should return one entry per query."""
results = list(ir_measures.iter_calc(
[nDCG@10], sample_qrels, perfect_run
))
assert len(results) == 1 # Only q1
assert results[0].query_id == "q1"
def test_trec_run_format_conversion(self):
"""Test TrialPath results to TREC format conversion."""
results = {
"1": [
{"nct_id": "NCT001", "score": 0.95},
{"nct_id": "NCT002", "score": 0.80},
]
}
run_str = convert_trialpath_to_trec_run(results, "test-run")
lines = run_str.strip().split("\n")
assert len(lines) == 2
assert "NCT001" in lines[0]
assert "1" == lines[0].split()[3] # rank 1
assert "2" == lines[1].split()[3] # rank 2
def test_graded_relevance_evaluation(self, sample_qrels, perfect_run):
"""Test strict eligible-only evaluation (rel=2)."""
strict = ir_measures.calc_aggregate(
[AP(rel=2)], sample_qrels, perfect_run
)
assert strict[AP(rel=2)] > 0.0
def test_qrels_dict_format(self):
"""Test evaluation from dict format."""
qrels = {"q1": {"d1": 2, "d2": 1, "d3": 0}}
run = [
ir_measures.ScoredDoc("q1", "d1", 1.0),
ir_measures.ScoredDoc("q1", "d2", 0.5),
ir_measures.ScoredDoc("q1", "d3", 0.1),
]
result = ir_measures.calc_aggregate([nDCG@10], qrels, run)
assert nDCG@10 in result
```
### 7.5 F1 ่ฎก็ฎ—ๆต‹่ฏ•
```python
# tests/test_extraction_f1.py
import pytest
from evaluation.extraction_eval import compute_field_level_f1
class TestExtractionF1:
"""Test F1 computation for field-level extraction."""
def test_perfect_extraction(self):
"""All fields correctly extracted should yield F1=1.0."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
{"field_name": "demographics.sex", "ground_truth": "male", "extracted": "male", "correct": True},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "NSCLC", "correct": True},
{"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
]
}]
result = compute_field_level_f1(annotations)
assert result["micro_f1"] == 1.0
assert result["pass"] is True
def test_zero_extraction(self):
"""No correct extractions should yield F1=0."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "Jane", "correct": False},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
assert result["micro_f1"] == 0.0
assert result["pass"] is False
def test_partial_extraction(self):
"""Partial extraction should yield 0 < F1 < 1."""
annotations = [{
"patient_id": "p1",
"noise_level": "mild",
"document_type": "clinical_letter",
"fields": [
{"field_name": "demographics.name", "ground_truth": "John", "extracted": "John", "correct": True},
{"field_name": "diagnosis.primary", "ground_truth": "NSCLC", "extracted": "lung ca", "correct": False},
{"field_name": "biomarkers.egfr", "ground_truth": "positive", "extracted": "positive", "correct": True},
{"field_name": "biomarkers.alk", "ground_truth": "negative", "extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
assert 0.0 < result["micro_f1"] < 1.0
def test_f1_threshold_boundary(self):
"""F1 exactly at 0.85 should pass."""
# Create annotations that produce exactly 0.85 F1
fields = []
for i in range(85):
fields.append({"field_name": f"field_{i}", "ground_truth": "val", "extracted": "val", "correct": True})
for i in range(15):
fields.append({"field_name": f"field_miss_{i}", "ground_truth": "val", "extracted": None, "correct": False})
annotations = [{"patient_id": "p1", "noise_level": "clean",
"document_type": "test", "fields": fields}]
result = compute_field_level_f1(annotations)
# With 85/100 correct, F1 should be ~0.85
assert result["pass"] is True
def test_empty_annotations(self):
"""Empty annotations should not crash."""
result = compute_field_level_f1([])
assert result["micro_f1"] == 0.0
def test_none_ground_truth_not_counted(self):
"""Fields with None ground truth should be handled."""
annotations = [{
"patient_id": "p1",
"noise_level": "clean",
"document_type": "test",
"fields": [
{"field_name": "biomarkers.ros1", "ground_truth": None,
"extracted": None, "correct": False},
]
}]
result = compute_field_level_f1(annotations)
# Should not crash, though metrics may be 0
assert "micro_f1" in result
```
### 7.6 ็ซฏๅˆฐ็ซฏ็ฎก็บฟๆต‹่ฏ•
```python
# tests/test_e2e_pipeline.py
import pytest
from pathlib import Path
class TestE2EPipeline:
"""End-to-end tests for the complete data & evaluation pipeline."""
def test_fhir_to_profile_to_pdf_roundtrip(self, sample_fhir_file, tmp_path):
"""FHIR โ†’ PatientProfile โ†’ PDF should complete without error."""
from data.generate_synthetic_patients import parse_fhir_bundle
from data.templates.clinical_letter import generate_clinical_letter
from dataclasses import asdict
# Step 1: Parse FHIR
profile = parse_fhir_bundle(Path(sample_fhir_file))
assert profile.patient_id != ""
# Step 2: Generate PDF
pdf_path = tmp_path / "test_roundtrip.pdf"
generate_clinical_letter(asdict(profile), str(pdf_path))
assert pdf_path.exists()
assert pdf_path.stat().st_size > 1000 # Reasonable PDF size
def test_noisy_pdf_pipeline(self, sample_profile, tmp_path):
"""Profile โ†’ Noisy PDF should inject noise and produce valid PDF."""
from data.templates.clinical_letter import generate_clinical_letter
from data.noise.noise_injector import NoiseInjector
injector = NoiseInjector(noise_level="moderate", seed=42)
# Inject text noise into profile fields for PDF rendering
profile = sample_profile.copy()
dx_text = profile["diagnosis"]["primary"]
noisy_dx, records = injector.inject_text_noise(dx_text)
profile["diagnosis"]["primary"] = noisy_dx
pdf_path = tmp_path / "noisy.pdf"
generate_clinical_letter(profile, str(pdf_path))
assert pdf_path.exists()
def test_trec_evaluation_pipeline(self, tmp_path):
"""Complete TREC evaluation from dicts should produce metrics."""
import ir_measures
from ir_measures import nDCG, Recall, P
qrels = [
ir_measures.Qrel("1", "NCT001", 2),
ir_measures.Qrel("1", "NCT002", 1),
ir_measures.Qrel("1", "NCT003", 0),
]
run = [
ir_measures.ScoredDoc("1", "NCT001", 0.9),
ir_measures.ScoredDoc("1", "NCT002", 0.5),
ir_measures.ScoredDoc("1", "NCT003", 0.1),
]
result = ir_measures.calc_aggregate(
[nDCG@10, Recall@50, P@10], qrels, run
)
assert nDCG@10 in result
assert Recall@50 in result
assert result[nDCG@10] > 0
def test_latency_tracker_integration(self):
"""Latency tracker should record and summarize calls."""
import time
from evaluation.latency_cost_tracker import LatencyCostTracker
tracker = LatencyCostTracker()
tracker.start_session("test-patient")
with tracker.track_call("gemini", "search_anchors") as record:
time.sleep(0.01) # Simulate API call
record.input_tokens = 500
record.output_tokens = 200
session = tracker.end_session()
assert session.total_latency_ms > 0
assert len(session.api_calls) == 1
summary = tracker.summary()
assert summary["n_sessions"] == 1
assert summary["latency"]["mean_s"] > 0
```
---
## 8. ้™„ๅฝ•
### 8.1 ๆ•ฐๆฎๆ ผๅผ่ง„่Œƒ
#### PatientProfile v1 JSON Schema
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["patient_id", "demographics", "diagnosis"],
"properties": {
"patient_id": {"type": "string"},
"demographics": {
"type": "object",
"properties": {
"name": {"type": "string"},
"sex": {"type": "string", "enum": ["male", "female"]},
"date_of_birth": {"type": "string", "format": "date"},
"age": {"type": "integer"},
"state": {"type": "string"}
}
},
"diagnosis": {
"type": "object",
"properties": {
"primary": {"type": "string"},
"stage": {"type": ["string", "null"]},
"histology": {"type": ["string", "null"]},
"diagnosis_date": {"type": "string", "format": "date"}
}
},
"biomarkers": {
"type": "object",
"properties": {
"egfr": {"type": ["string", "null"]},
"alk": {"type": ["string", "null"]},
"pdl1_tps": {"type": ["string", "null"]},
"kras": {"type": ["string", "null"]},
"ros1": {"type": ["string", "null"]}
}
},
"labs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"value": {"type": "number"},
"unit": {"type": "string"},
"date": {"type": "string"},
"loinc_code": {"type": "string"}
}
}
},
"treatments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["medication", "procedure", "radiation"]},
"start_date": {"type": "string"},
"end_date": {"type": ["string", "null"]}
}
}
},
"unknowns": {"type": "array", "items": {"type": "string"}},
"evidence_spans": {"type": "array"}
}
}
```
### 8.2 ๅทฅๅ…ท API ๅ‚่€ƒ
#### ir_datasets
| API | ่ฏดๆ˜Ž | ่ฟ”ๅ›ž็ฑปๅž‹ |
|-----|------|----------|
| `ir_datasets.load("clinicaltrials/2021/trec-ct-2021")` | ๅŠ ่ฝฝ TREC CT 2021 ๆ•ฐๆฎ้›† | Dataset |
| `dataset.queries_iter()` | ้ๅކ topics | GenericQuery(query_id, text) |
| `dataset.qrels_iter()` | ้ๅކ qrels | TrecQrel(query_id, doc_id, relevance, iteration) |
| `dataset.docs_iter()` | ้ๅކๆ–‡ๆกฃ | ClinicalTrialsDoc(doc_id, title, condition, summary, detailed_description, eligibility) |
**ๆ•ฐๆฎ้›† ID๏ผš**
- `clinicaltrials/2021/trec-ct-2021` โ€” 75 queries, 35,832 qrels
- `clinicaltrials/2021/trec-ct-2022` โ€” 50 queries
- `clinicaltrials/2021` โ€” 376K ๆ–‡ๆกฃ๏ผˆๅŸบ็ก€้›†๏ผ‰
#### ir-measures
| API | ่ฏดๆ˜Ž |
|-----|------|
| `ir_measures.calc_aggregate(measures, qrels, run)` | ่ฎก็ฎ—่šๅˆๆŒ‡ๆ ‡ |
| `ir_measures.iter_calc(measures, qrels, run)` | ้€ๆŸฅ่ฏขๆŒ‡ๆ ‡่ฟญไปฃ |
| `ir_measures.read_trec_qrels(path)` | ่ฏปๅ– TREC qrels ๆ–‡ไปถ |
| `ir_measures.read_trec_run(path)` | ่ฏปๅ– TREC run ๆ–‡ไปถ |
| `ir_measures.Qrel(qid, did, rel)` | ๅˆ›ๅปบ qrel ่ฎฐๅฝ• |
| `ir_measures.ScoredDoc(qid, did, score)` | ๅˆ›ๅปบ่ฏ„ๅˆ†ๆ–‡ๆกฃ่ฎฐๅฝ• |
**ๆŒ‡ๆ ‡ๅฏน่ฑก๏ผš**
- `nDCG@10` โ€” Normalized DCG at cutoff 10
- `Recall@50` โ€” Recall at cutoff 50
- `P@10` โ€” Precision at cutoff 10
- `AP` โ€” Average Precision
- `AP(rel=2)` โ€” AP with minimum relevance 2
- `RR` โ€” Reciprocal Rank
#### scikit-learn ่ฏ„ไผฐ
| API | ่ฏดๆ˜Ž |
|-----|------|
| `f1_score(y_true, y_pred, average=None)` | ้€็ฑปๅˆซ F1 |
| `f1_score(y_true, y_pred, average='micro')` | ๅ…จๅฑ€ micro F1 |
| `f1_score(y_true, y_pred, average='macro')` | ้€็ฑปๅˆซๅนณๅ‡ F1 |
| `precision_score(y_true, y_pred)` | Precision |
| `recall_score(y_true, y_pred)` | Recall |
| `classification_report(y_true, y_pred)` | ๅฎŒๆ•ดๅˆ†็ฑปๆŠฅๅ‘Š |
| `confusion_matrix(y_true, y_pred)` | ๆททๆท†็Ÿฉ้˜ต |
#### Synthea CLI
| ๅ‚ๆ•ฐ | ่ฏดๆ˜Ž | ็คบไพ‹ |
|------|------|------|
| `-p N` | ็”Ÿๆˆ N ไธชๆ‚ฃ่€… | `-p 500` |
| `-s SEED` | ้šๆœบ็งๅญ | `-s 42` |
| `-m MODULE` | ๆŒ‡ๅฎš็–พ็—…ๆจกๅ— | `-m lung_cancer` |
| `STATE` | ๆŒ‡ๅฎšๅทž | `Massachusetts` |
| `--exporter.fhir.export` | ๅฏ็”จ FHIR R4 ๅฏผๅ‡บ | `=true` |
| `--exporter.pretty_print` | ็พŽๅŒ– JSON ่พ“ๅ‡บ | `=true` |
#### ReportLab ๆ ธๅฟƒ API
| ็ป„ไปถ | ่ฏดๆ˜Ž |
|------|------|
| `SimpleDocTemplate(path, pagesize=letter)` | ๅˆ›ๅปบๆ–‡ๆกฃๆจกๆฟ |
| `Paragraph(text, style)` | ๆฎต่ฝๆตๅผ็ป„ไปถ |
| `Table(data, colWidths)` | ่กจๆ ผๆตๅผ็ป„ไปถ |
| `TableStyle(commands)` | ่กจๆ ผๆ ทๅผ |
| `Spacer(width, height)` | ้—ด่ท็ป„ไปถ |
| `getSampleStyleSheet()` | ่Žทๅ–้ป˜่ฎคๆ ทๅผ่กจ |
#### Augraphy ้™่ดจ็ฎก็บฟ
| ็ป„ไปถ | ่ฏดๆ˜Ž |
|------|------|
| `AugraphyPipeline(ink_phase, paper_phase, post_phase)` | ๅฎŒๆ•ด้™่ดจ็ฎก็บฟ |
| `InkBleed(p=0.5)` | ๅขจๆฐดๆธ—้€ๆ•ˆๆžœ |
| `Letterpress(p=0.3)` | ๆดป็‰ˆๅฐๅˆทๆ•ˆๆžœ |
| `LowInkPeriodicLines(p=0.3)` | ไฝŽๅขจๆฐดๅ‘จๆœŸๆ€ง็บฟๆก |
| `DirtyDrum(p=0.3)` | ่„้ผ“ๆ•ˆๆžœ |
| `SubtleNoise(p=0.5)` | ๅพฎๅ™ชๅฃฐ |
| `Jpeg(p=0.5)` | JPEG ๅŽ‹็ผฉไผชๅฝฑ |
| `Brightness(p=0.5)` | ไบฎๅบฆๅ˜ๅŒ– |
### 8.3 Python ไพ่ต–ๆธ…ๅ•
```
# requirements-data-eval.txt
ir-datasets>=0.5.6
ir-measures>=0.3.1
reportlab>=4.0
augraphy>=8.0
Pillow>=10.0
pdfplumber>=0.10
scikit-learn>=1.3
numpy>=1.24
pandas>=2.0
pdf2image>=1.16
```
### 8.4 ๆˆๅŠŸๆŒ‡ๆ ‡้€ŸๆŸฅ่กจ
| ๆŒ‡ๆ ‡ | ็›ฎๆ ‡ๅ€ผ | ่ฏ„ไผฐๅทฅๅ…ท | ๆ•ฐๆฎๆบ |
|------|--------|----------|--------|
| MedGemma Extraction F1 | >= 0.85 | scikit-learn `f1_score` | ๅˆๆˆๆ‚ฃ่€… + Ground Truth |
| Trial Retrieval Recall@50 | >= 0.75 | ir-measures `Recall@50` | TREC CT 2021/2022 |
| Trial Ranking NDCG@10 | >= 0.60 | ir-measures `nDCG@10` | TREC CT 2021/2022 |
| Criterion Decision Accuracy | >= 0.85 | Custom accuracy | ๆ ‡ๆณจ EligibilityLedger |
| Latency | < 15s | `LatencyCostTracker` | API call timing |
| Cost | < $0.50/session | `LatencyCostTracker` | Token counting |