Spaces:

bshepp
/

cds-agent

Running

App Files Files Community

cds-agent / docs /test_results.md

bshepp

docs: full documentation vs reality audit

5d53fbf 5 days ago

preview code

raw

history blame contribute delete

11.7 kB

Test Results — CDS Agent

Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.

1. RAG Retrieval Quality Test

Test file: src/backend/test_rag_quality.py
What it tests: Whether the RAG system retrieves the correct clinical guideline for a given clinical query.
Methodology: 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4).

Summary

Metric	Value
Total queries	30
Passed	30
Failed	0
Pass rate	100%
Avg relevance score	0.639
Min relevance score	0.519
Max relevance score	0.765
Top-1 accuracy	100% (correct guideline ranked #1 for all 30 queries)

Results by Specialty

Specialty	Queries	Passed	Pass Rate	Avg Relevance
Cardiology	4	4	100%	0.65
Emergency Medicine	5	5	100%	0.62
Endocrinology	3	3	100%	0.64
Pulmonology	2	2	100%	0.63
Neurology	2	2	100%	0.66
Gastroenterology	2	2	100%	0.61
Infectious Disease	2	2	100%	0.67
Psychiatry	2	2	100%	0.64
Pediatrics	2	2	100%	0.63
Nephrology	2	2	100%	0.65
Hematology	1	1	100%	0.62
Rheumatology	1	1	100%	0.64
OB/GYN	1	1	100%	0.66
Other	1	1	100%	0.61

How to Reproduce

cd src/backend
python test_rag_quality.py --rebuild --verbose

Flags:

--rebuild — Rebuild ChromaDB from clinical_guidelines.json before testing
--verbose — Print each query, expected ID, actual top result, and relevance score
--stats — Print summary statistics only
--query "chest pain" — Test a single ad-hoc query

2. End-to-End Pipeline Test

Test file: src/backend/test_e2e.py
What it tests: Full 6-step agent pipeline from free-text input to synthesized CDS report.
Test case: 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin.

Pipeline Step Results

Step	Status	Duration	Key Findings
1. Parse Patient Data	PASSED	7.8 s	Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history
2. Clinical Reasoning	PASSED	21.2 s	Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection
3. Drug Interaction Check	PASSED	11.3 s	Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions
4. Guideline Retrieval	PASSED	9.6 s	Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus
5. Conflict Detection	PASSED	—	Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps
6. Synthesis	PASSED	25.3 s	Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations

Total pipeline time: 75.2 s

How to Reproduce

# Start the backend first
cd src/backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# In another terminal
cd src/backend
python test_e2e.py

3. Clinical Test Suite

Test file: src/backend/test_clinical_cases.py
What it tests: 22 diverse clinical scenarios across 14 medical specialties.
Methodology: Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report.

Test Cases

ID	Specialty	Scenario	Key Validation Keywords
`cardio_acs`	Cardiology	62M crushing chest pain	ACS, troponin, ECG
`cardio_afib`	Cardiology	72F palpitations, irregular pulse	Atrial fibrillation, anticoagulation, CHA2DS2-VASc
`cardio_hf`	Cardiology	68M progressive dyspnea, edema	Heart failure, BNP, diuretic
`neuro_stroke`	Neurology	75M sudden left-sided weakness	Stroke, CT, tPA, NIH Stroke Scale
`em_sepsis`	Emergency Medicine	45F fever, tachycardia, hypotension	Sepsis, lactate, blood cultures, fluids
`em_anaphylaxis`	Emergency Medicine	28F bee sting, urticaria, wheezing	Anaphylaxis, epinephrine, airway
`em_polytrauma`	Emergency Medicine	35M MVC, multiple injuries	Trauma, ATLS, FAST, C-spine
`endo_dka`	Endocrinology	22F T1DM, vomiting, Kussmaul breathing	DKA, insulin, potassium, anion gap
`endo_thyroid_storm`	Endocrinology	40F graves, fever, tachycardia, AMS	Thyroid storm, PTU, beta-blocker
`endo_adrenal`	Endocrinology	55M weakness, hypotension, hyperpigmentation	Adrenal insufficiency, cortisol, hydrocortisone
`pulm_pe`	Pulmonology	50F post-surgical, sudden dyspnea	Pulmonary embolism, CT angiography, anticoagulation
`pulm_asthma`	Pulmonology	19M severe wheezing, accessory muscles	Status asthmaticus, albuterol, steroids
`gi_bleed`	Gastroenterology	60M hematemesis, melena, cirrhosis history	Upper GI bleed, endoscopy, PPI, variceal
`gi_pancreatitis`	Gastroenterology	48F epigastric pain, lipase elevated	Pancreatitis, NPO, IV fluids, imaging
`neuro_seizure`	Neurology	30F witnessed generalized seizure	Status epilepticus, benzodiazepine, EEG
`id_meningitis`	Infectious Disease	20M fever, neck stiffness, photophobia	Meningitis, lumbar puncture, empiric antibiotics
`psych_suicidal`	Psychiatry	35M suicidal ideation, plan, access	Suicide risk, safety assessment, hospitalization
`peds_fever`	Pediatrics	3-week-old neonate, fever 38.5°C	Neonatal fever, sepsis workup, admit
`peds_dehydration`	Pediatrics	2-year-old, 5 days diarrhea/vomiting	Dehydration, ORS, electrolytes
`nephro_hyperkalemia`	Nephrology	70M CKD, K+ 7.2, ECG changes	Hyperkalemia, calcium gluconate, insulin/glucose, dialysis
`tox_acetaminophen`	Emergency Medicine	23F intentional APAP overdose	Acetaminophen, NAC, liver, Rumack-Matthew
`geri_polypharmacy`	Geriatrics	82F on 12 medications, recurrent falls	Polypharmacy, fall risk, medication reconciliation, Beers criteria

How to Reproduce

cd src/backend

# List all available cases
python test_clinical_cases.py --list

# Run a single case
python test_clinical_cases.py --case em_sepsis

# Run all cases in a specialty
python test_clinical_cases.py --specialty Cardiology

# Run all 22 cases
python test_clinical_cases.py

# Run all and save report to JSON
python test_clinical_cases.py --report results.json

# Quiet mode (summary only)
python test_clinical_cases.py --quiet

4. RAG Corpus Statistics

Metric	Value
Total guidelines	62
Specialties covered	14
Guidelines stored in ChromaDB	62
Embedding model	all-MiniLM-L6-v2 (384 dimensions)
Embedding time (full rebuild)	~5 s
ChromaDB persist directory	`./data/chroma`
Source file	`app/data/clinical_guidelines.json`

Guidelines per Specialty

Specialty	Count
Emergency Medicine	10
Cardiology	8
Endocrinology	7
Gastroenterology	5
Infectious Disease	5
Pulmonology	4
Neurology	4
Psychiatry	4
Pediatrics	4
Nephrology	2
Hematology	2
Rheumatology	2
OB/GYN	2
Preventive / Perioperative / Dermatology	3

5. Test Infrastructure

File	Lines	Purpose
`test_e2e.py`	~60	Submit chest pain case, poll for completion, validate all 6 steps
`test_clinical_cases.py`	~400	22 clinical cases with keyword validation, CLI flags for filtering
`test_rag_quality.py`	~350	30 RAG retrieval queries with expected guideline IDs, relevance scoring
`test_poll.py`	~30	Utility: poll a case ID until completion

Dependencies for Testing

Tests use only the standard library + httpx (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in requirements.txt.

6. External Dataset Validation

Test files: src/backend/validation/ (package)
What it tests: Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets.
Methodology: Each harness fetches a public dataset, converts cases into patient narratives, runs them through the Orchestrator directly (no HTTP server), and scores the output against known ground truth.

Datasets

Dataset	Source	Cases Available	Metrics
MedQA (USMLE)	HuggingFace (`GBaker/MedQA-USMLE-4-options`)	1,273 test cases	top-1, top-3, mentioned diagnostic accuracy
MTSamples	GitHub (`socd06/medical-nlp`)	~5,000 transcription notes	parse success, field completeness, specialty alignment
PMC Case Reports	PubMed E-utilities (esearch + efetch)	Dynamic (curated queries)	diagnostic accuracy vs published diagnosis

Initial Results (Smoke Test — 3 MedQA Cases)

Metric	Value
Cases run	3
Parse success	100% (3/3)
Top-1 diagnostic accuracy	66.7% (2/3)
Top-3 diagnostic accuracy	66.7% (2/3)
Avg pipeline time	~94 s per case

50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)

Run with: python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2

Metric	Value
Cases run	50
Pipeline success	94% (47/50)
Top-1 diagnostic accuracy	36%
Top-3 diagnostic accuracy	38%
Differential accuracy	10%
Mentioned in report	38%
Avg pipeline time	204 s per case
Total run time	~60 min

Breakdown by question type (50 cases):

Type	Count	Mentioned	Differential
Diagnostic	36	14 (39%)	5 (14%)
Treatment	6	—	—
Pathophysiology	6	—	—
Statistics	1	—	—
Anatomy	1	—	—

Notes: MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.

Full validation was run on Feb 15, 2026 using the medgemma-27b-cds HuggingFace Dedicated Endpoint (1× A100 80 GB, bfloat16). Incremental checkpoints saved to validation/results/medqa_checkpoint.jsonl with --resume support.

How to Reproduce

cd src/backend

# Fetch datasets only (no pipeline runs)
python -m validation.run_validation --fetch-only

# Run MedQA validation (N cases)
python -m validation.run_validation --medqa --max-cases 10

# Run MTSamples validation
python -m validation.run_validation --mtsamples --max-cases 10

# Run PMC Case Reports validation
python -m validation.run_validation --pmc --max-cases 5

# Run all 3 datasets
python -m validation.run_validation --all --max-cases 10

# Additional flags:
#   --seed 42          Reproducible random sampling
#   --delay 2          Seconds between cases (rate limiting)
#   --no-drugs         Skip drug interaction step
#   --no-guidelines    Skip guideline retrieval step

Results are saved to validation/results/ as timestamped JSON files.