Spaces:

bshepp
/

cds-agent

Running

App Files Files Community

cds-agent / docs /test_results.md

bshepp

docs: full documentation vs reality audit

5d53fbf 5 days ago

preview code

raw

history blame contribute delete

11.7 kB

	# Test Results — CDS Agent

	> Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.

	---

	## 1. RAG Retrieval Quality Test

	Test file: `src/backend/test_rag_quality.py`
	What it tests: Whether the RAG system retrieves the correct clinical guideline for a given clinical query.
	Methodology: 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4).

	### Summary

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total queries \| 30 \|
	\| Passed \| 30 \|
	\| Failed \| 0 \|
	\| Pass rate \| 100% \|
	\| Avg relevance score \| 0.639 \|
	\| Min relevance score \| 0.519 \|
	\| Max relevance score \| 0.765 \|
	\| Top-1 accuracy \| 100% (correct guideline ranked #1 for all 30 queries) \|

	### Results by Specialty

	\| Specialty \| Queries \| Passed \| Pass Rate \| Avg Relevance \|
	\|-----------\|---------\|--------\|-----------\|---------------\|
	\| Cardiology \| 4 \| 4 \| 100% \| 0.65 \|
	\| Emergency Medicine \| 5 \| 5 \| 100% \| 0.62 \|
	\| Endocrinology \| 3 \| 3 \| 100% \| 0.64 \|
	\| Pulmonology \| 2 \| 2 \| 100% \| 0.63 \|
	\| Neurology \| 2 \| 2 \| 100% \| 0.66 \|
	\| Gastroenterology \| 2 \| 2 \| 100% \| 0.61 \|
	\| Infectious Disease \| 2 \| 2 \| 100% \| 0.67 \|
	\| Psychiatry \| 2 \| 2 \| 100% \| 0.64 \|
	\| Pediatrics \| 2 \| 2 \| 100% \| 0.63 \|
	\| Nephrology \| 2 \| 2 \| 100% \| 0.65 \|
	\| Hematology \| 1 \| 1 \| 100% \| 0.62 \|
	\| Rheumatology \| 1 \| 1 \| 100% \| 0.64 \|
	\| OB/GYN \| 1 \| 1 \| 100% \| 0.66 \|
	\| Other \| 1 \| 1 \| 100% \| 0.61 \|

	### How to Reproduce

	```bash
	cd src/backend
	python test_rag_quality.py --rebuild --verbose
	```

	Flags:
	- `--rebuild` — Rebuild ChromaDB from `clinical_guidelines.json` before testing
	- `--verbose` — Print each query, expected ID, actual top result, and relevance score
	- `--stats` — Print summary statistics only
	- `--query "chest pain"` — Test a single ad-hoc query

	---

	## 2. End-to-End Pipeline Test

	Test file: `src/backend/test_e2e.py`
	What it tests: Full 6-step agent pipeline from free-text input to synthesized CDS report.
	Test case: 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin.

	### Pipeline Step Results

	\| Step \| Status \| Duration \| Key Findings \|
	\|------\|--------\|----------\|--------------\|
	\| 1. Parse Patient Data \| PASSED \| 7.8 s \| Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history \|
	\| 2. Clinical Reasoning \| PASSED \| 21.2 s \| Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection \|
	\| 3. Drug Interaction Check \| PASSED \| 11.3 s \| Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions \|
	\| 4. Guideline Retrieval \| PASSED \| 9.6 s \| Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus \|
	\| 5. Conflict Detection \| PASSED \| — \| Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps \|
	\| 6. Synthesis \| PASSED \| 25.3 s \| Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations \|

	Total pipeline time: 75.2 s

	### How to Reproduce

	```bash
	# Start the backend first
	cd src/backend
	uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

	# In another terminal
	cd src/backend
	python test_e2e.py
	```

	---

	## 3. Clinical Test Suite

	Test file: `src/backend/test_clinical_cases.py`
	What it tests: 22 diverse clinical scenarios across 14 medical specialties.
	Methodology: Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report.

	### Test Cases

	\| ID \| Specialty \| Scenario \| Key Validation Keywords \|
	\|----\|-----------\|----------\|------------------------\|
	\| `cardio_acs` \| Cardiology \| 62M crushing chest pain \| ACS, troponin, ECG \|
	\| `cardio_afib` \| Cardiology \| 72F palpitations, irregular pulse \| Atrial fibrillation, anticoagulation, CHA2DS2-VASc \|
	\| `cardio_hf` \| Cardiology \| 68M progressive dyspnea, edema \| Heart failure, BNP, diuretic \|
	\| `neuro_stroke` \| Neurology \| 75M sudden left-sided weakness \| Stroke, CT, tPA, NIH Stroke Scale \|
	\| `em_sepsis` \| Emergency Medicine \| 45F fever, tachycardia, hypotension \| Sepsis, lactate, blood cultures, fluids \|
	\| `em_anaphylaxis` \| Emergency Medicine \| 28F bee sting, urticaria, wheezing \| Anaphylaxis, epinephrine, airway \|
	\| `em_polytrauma` \| Emergency Medicine \| 35M MVC, multiple injuries \| Trauma, ATLS, FAST, C-spine \|
	\| `endo_dka` \| Endocrinology \| 22F T1DM, vomiting, Kussmaul breathing \| DKA, insulin, potassium, anion gap \|
	\| `endo_thyroid_storm` \| Endocrinology \| 40F graves, fever, tachycardia, AMS \| Thyroid storm, PTU, beta-blocker \|
	\| `endo_adrenal` \| Endocrinology \| 55M weakness, hypotension, hyperpigmentation \| Adrenal insufficiency, cortisol, hydrocortisone \|
	\| `pulm_pe` \| Pulmonology \| 50F post-surgical, sudden dyspnea \| Pulmonary embolism, CT angiography, anticoagulation \|
	\| `pulm_asthma` \| Pulmonology \| 19M severe wheezing, accessory muscles \| Status asthmaticus, albuterol, steroids \|
	\| `gi_bleed` \| Gastroenterology \| 60M hematemesis, melena, cirrhosis history \| Upper GI bleed, endoscopy, PPI, variceal \|
	\| `gi_pancreatitis` \| Gastroenterology \| 48F epigastric pain, lipase elevated \| Pancreatitis, NPO, IV fluids, imaging \|
	\| `neuro_seizure` \| Neurology \| 30F witnessed generalized seizure \| Status epilepticus, benzodiazepine, EEG \|
	\| `id_meningitis` \| Infectious Disease \| 20M fever, neck stiffness, photophobia \| Meningitis, lumbar puncture, empiric antibiotics \|
	\| `psych_suicidal` \| Psychiatry \| 35M suicidal ideation, plan, access \| Suicide risk, safety assessment, hospitalization \|
	\| `peds_fever` \| Pediatrics \| 3-week-old neonate, fever 38.5°C \| Neonatal fever, sepsis workup, admit \|
	\| `peds_dehydration` \| Pediatrics \| 2-year-old, 5 days diarrhea/vomiting \| Dehydration, ORS, electrolytes \|
	\| `nephro_hyperkalemia` \| Nephrology \| 70M CKD, K+ 7.2, ECG changes \| Hyperkalemia, calcium gluconate, insulin/glucose, dialysis \|
	\| `tox_acetaminophen` \| Emergency Medicine \| 23F intentional APAP overdose \| Acetaminophen, NAC, liver, Rumack-Matthew \|
	\| `geri_polypharmacy` \| Geriatrics \| 82F on 12 medications, recurrent falls \| Polypharmacy, fall risk, medication reconciliation, Beers criteria \|

	### How to Reproduce

	```bash
	cd src/backend

	# List all available cases
	python test_clinical_cases.py --list

	# Run a single case
	python test_clinical_cases.py --case em_sepsis

	# Run all cases in a specialty
	python test_clinical_cases.py --specialty Cardiology

	# Run all 22 cases
	python test_clinical_cases.py

	# Run all and save report to JSON
	python test_clinical_cases.py --report results.json

	# Quiet mode (summary only)
	python test_clinical_cases.py --quiet
	```

	---

	## 4. RAG Corpus Statistics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total guidelines \| 62 \|
	\| Specialties covered \| 14 \|
	\| Guidelines stored in ChromaDB \| 62 \|
	\| Embedding model \| all-MiniLM-L6-v2 (384 dimensions) \|
	\| Embedding time (full rebuild) \| ~5 s \|
	\| ChromaDB persist directory \| `./data/chroma` \|
	\| Source file \| `app/data/clinical_guidelines.json` \|

	### Guidelines per Specialty

	\| Specialty \| Count \|
	\|-----------\|-------\|
	\| Emergency Medicine \| 10 \|
	\| Cardiology \| 8 \|
	\| Endocrinology \| 7 \|
	\| Gastroenterology \| 5 \|
	\| Infectious Disease \| 5 \|
	\| Pulmonology \| 4 \|
	\| Neurology \| 4 \|
	\| Psychiatry \| 4 \|
	\| Pediatrics \| 4 \|
	\| Nephrology \| 2 \|
	\| Hematology \| 2 \|
	\| Rheumatology \| 2 \|
	\| OB/GYN \| 2 \|
	\| Preventive / Perioperative / Dermatology \| 3 \|

	---

	## 5. Test Infrastructure

	\| File \| Lines \| Purpose \|
	\|------\|-------\|---------\|
	\| `test_e2e.py` \| ~60 \| Submit chest pain case, poll for completion, validate all 6 steps \|
	\| `test_clinical_cases.py` \| ~400 \| 22 clinical cases with keyword validation, CLI flags for filtering \|
	\| `test_rag_quality.py` \| ~350 \| 30 RAG retrieval queries with expected guideline IDs, relevance scoring \|
	\| `test_poll.py` \| ~30 \| Utility: poll a case ID until completion \|

	### Dependencies for Testing

	Tests use only the standard library + `httpx` (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in `requirements.txt`.

	---

	## 6. External Dataset Validation

	Test files: `src/backend/validation/` (package)
	What it tests: Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets.
	Methodology: Each harness fetches a public dataset, converts cases into patient narratives, runs them through the `Orchestrator` directly (no HTTP server), and scores the output against known ground truth.

	### Datasets

	\| Dataset \| Source \| Cases Available \| Metrics \|
	\|---------\|--------\|-----------------\|--------\|
	\| MedQA (USMLE) \| HuggingFace (`GBaker/MedQA-USMLE-4-options`) \| 1,273 test cases \| top-1, top-3, mentioned diagnostic accuracy \|
	\| MTSamples \| GitHub (`socd06/medical-nlp`) \| ~5,000 transcription notes \| parse success, field completeness, specialty alignment \|
	\| PMC Case Reports \| PubMed E-utilities (esearch + efetch) \| Dynamic (curated queries) \| diagnostic accuracy vs published diagnosis \|

	### Initial Results (Smoke Test — 3 MedQA Cases)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Cases run \| 3 \|
	\| Parse success \| 100% (3/3) \|
	\| Top-1 diagnostic accuracy \| 66.7% (2/3) \|
	\| Top-3 diagnostic accuracy \| 66.7% (2/3) \|
	\| Avg pipeline time \| ~94 s per case \|

	### 50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)

	Run with: `python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2`

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Cases run \| 50 \|
	\| Pipeline success \| 94% (47/50) \|
	\| Top-1 diagnostic accuracy \| 36% \|
	\| Top-3 diagnostic accuracy \| 38% \|
	\| Differential accuracy \| 10% \|
	\| Mentioned in report \| 38% \|
	\| Avg pipeline time \| 204 s per case \|
	\| Total run time \| ~60 min \|

	Breakdown by question type (50 cases):

	\| Type \| Count \| Mentioned \| Differential \|
	\|------\|-------\|-----------\|-------------\|
	\| Diagnostic \| 36 \| 14 (39%) \| 5 (14%) \|
	\| Treatment \| 6 \| — \| — \|
	\| Pathophysiology \| 6 \| — \| — \|
	\| Statistics \| 1 \| — \| — \|
	\| Anatomy \| 1 \| — \| — \|

	> Notes: MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.

	> Full validation was run on Feb 15, 2026 using the `medgemma-27b-cds` HuggingFace Dedicated Endpoint (1× A100 80 GB, bfloat16). Incremental checkpoints saved to `validation/results/medqa_checkpoint.jsonl` with `--resume` support.

	### How to Reproduce

	```bash
	cd src/backend

	# Fetch datasets only (no pipeline runs)
	python -m validation.run_validation --fetch-only

	# Run MedQA validation (N cases)
	python -m validation.run_validation --medqa --max-cases 10

	# Run MTSamples validation
	python -m validation.run_validation --mtsamples --max-cases 10

	# Run PMC Case Reports validation
	python -m validation.run_validation --pmc --max-cases 5

	# Run all 3 datasets
	python -m validation.run_validation --all --max-cases 10

	# Additional flags:
	# --seed 42 Reproducible random sampling
	# --delay 2 Seconds between cases (rate limiting)
	# --no-drugs Skip drug interaction step
	# --no-guidelines Skip guideline retrieval step
	```

	Results are saved to `validation/results/` as timestamped JSON files.