cds-agent / docs /test_results.md
bshepp
docs: full documentation vs reality audit
5d53fbf

Test Results β€” CDS Agent

Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.


1. RAG Retrieval Quality Test

Test file: src/backend/test_rag_quality.py
What it tests: Whether the RAG system retrieves the correct clinical guideline for a given clinical query.
Methodology: 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4).

Summary

Metric Value
Total queries 30
Passed 30
Failed 0
Pass rate 100%
Avg relevance score 0.639
Min relevance score 0.519
Max relevance score 0.765
Top-1 accuracy 100% (correct guideline ranked #1 for all 30 queries)

Results by Specialty

Specialty Queries Passed Pass Rate Avg Relevance
Cardiology 4 4 100% 0.65
Emergency Medicine 5 5 100% 0.62
Endocrinology 3 3 100% 0.64
Pulmonology 2 2 100% 0.63
Neurology 2 2 100% 0.66
Gastroenterology 2 2 100% 0.61
Infectious Disease 2 2 100% 0.67
Psychiatry 2 2 100% 0.64
Pediatrics 2 2 100% 0.63
Nephrology 2 2 100% 0.65
Hematology 1 1 100% 0.62
Rheumatology 1 1 100% 0.64
OB/GYN 1 1 100% 0.66
Other 1 1 100% 0.61

How to Reproduce

cd src/backend
python test_rag_quality.py --rebuild --verbose

Flags:

  • --rebuild β€” Rebuild ChromaDB from clinical_guidelines.json before testing
  • --verbose β€” Print each query, expected ID, actual top result, and relevance score
  • --stats β€” Print summary statistics only
  • --query "chest pain" β€” Test a single ad-hoc query

2. End-to-End Pipeline Test

Test file: src/backend/test_e2e.py
What it tests: Full 6-step agent pipeline from free-text input to synthesized CDS report.
Test case: 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin.

Pipeline Step Results

Step Status Duration Key Findings
1. Parse Patient Data PASSED 7.8 s Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history
2. Clinical Reasoning PASSED 21.2 s Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection
3. Drug Interaction Check PASSED 11.3 s Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions
4. Guideline Retrieval PASSED 9.6 s Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus
5. Conflict Detection PASSED β€” Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps
6. Synthesis PASSED 25.3 s Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations

Total pipeline time: 75.2 s

How to Reproduce

# Start the backend first
cd src/backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# In another terminal
cd src/backend
python test_e2e.py

3. Clinical Test Suite

Test file: src/backend/test_clinical_cases.py
What it tests: 22 diverse clinical scenarios across 14 medical specialties.
Methodology: Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report.

Test Cases

ID Specialty Scenario Key Validation Keywords
cardio_acs Cardiology 62M crushing chest pain ACS, troponin, ECG
cardio_afib Cardiology 72F palpitations, irregular pulse Atrial fibrillation, anticoagulation, CHA2DS2-VASc
cardio_hf Cardiology 68M progressive dyspnea, edema Heart failure, BNP, diuretic
neuro_stroke Neurology 75M sudden left-sided weakness Stroke, CT, tPA, NIH Stroke Scale
em_sepsis Emergency Medicine 45F fever, tachycardia, hypotension Sepsis, lactate, blood cultures, fluids
em_anaphylaxis Emergency Medicine 28F bee sting, urticaria, wheezing Anaphylaxis, epinephrine, airway
em_polytrauma Emergency Medicine 35M MVC, multiple injuries Trauma, ATLS, FAST, C-spine
endo_dka Endocrinology 22F T1DM, vomiting, Kussmaul breathing DKA, insulin, potassium, anion gap
endo_thyroid_storm Endocrinology 40F graves, fever, tachycardia, AMS Thyroid storm, PTU, beta-blocker
endo_adrenal Endocrinology 55M weakness, hypotension, hyperpigmentation Adrenal insufficiency, cortisol, hydrocortisone
pulm_pe Pulmonology 50F post-surgical, sudden dyspnea Pulmonary embolism, CT angiography, anticoagulation
pulm_asthma Pulmonology 19M severe wheezing, accessory muscles Status asthmaticus, albuterol, steroids
gi_bleed Gastroenterology 60M hematemesis, melena, cirrhosis history Upper GI bleed, endoscopy, PPI, variceal
gi_pancreatitis Gastroenterology 48F epigastric pain, lipase elevated Pancreatitis, NPO, IV fluids, imaging
neuro_seizure Neurology 30F witnessed generalized seizure Status epilepticus, benzodiazepine, EEG
id_meningitis Infectious Disease 20M fever, neck stiffness, photophobia Meningitis, lumbar puncture, empiric antibiotics
psych_suicidal Psychiatry 35M suicidal ideation, plan, access Suicide risk, safety assessment, hospitalization
peds_fever Pediatrics 3-week-old neonate, fever 38.5Β°C Neonatal fever, sepsis workup, admit
peds_dehydration Pediatrics 2-year-old, 5 days diarrhea/vomiting Dehydration, ORS, electrolytes
nephro_hyperkalemia Nephrology 70M CKD, K+ 7.2, ECG changes Hyperkalemia, calcium gluconate, insulin/glucose, dialysis
tox_acetaminophen Emergency Medicine 23F intentional APAP overdose Acetaminophen, NAC, liver, Rumack-Matthew
geri_polypharmacy Geriatrics 82F on 12 medications, recurrent falls Polypharmacy, fall risk, medication reconciliation, Beers criteria

How to Reproduce

cd src/backend

# List all available cases
python test_clinical_cases.py --list

# Run a single case
python test_clinical_cases.py --case em_sepsis

# Run all cases in a specialty
python test_clinical_cases.py --specialty Cardiology

# Run all 22 cases
python test_clinical_cases.py

# Run all and save report to JSON
python test_clinical_cases.py --report results.json

# Quiet mode (summary only)
python test_clinical_cases.py --quiet

4. RAG Corpus Statistics

Metric Value
Total guidelines 62
Specialties covered 14
Guidelines stored in ChromaDB 62
Embedding model all-MiniLM-L6-v2 (384 dimensions)
Embedding time (full rebuild) ~5 s
ChromaDB persist directory ./data/chroma
Source file app/data/clinical_guidelines.json

Guidelines per Specialty

Specialty Count
Emergency Medicine 10
Cardiology 8
Endocrinology 7
Gastroenterology 5
Infectious Disease 5
Pulmonology 4
Neurology 4
Psychiatry 4
Pediatrics 4
Nephrology 2
Hematology 2
Rheumatology 2
OB/GYN 2
Preventive / Perioperative / Dermatology 3

5. Test Infrastructure

File Lines Purpose
test_e2e.py ~60 Submit chest pain case, poll for completion, validate all 6 steps
test_clinical_cases.py ~400 22 clinical cases with keyword validation, CLI flags for filtering
test_rag_quality.py ~350 30 RAG retrieval queries with expected guideline IDs, relevance scoring
test_poll.py ~30 Utility: poll a case ID until completion

Dependencies for Testing

Tests use only the standard library + httpx (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in requirements.txt.


6. External Dataset Validation

Test files: src/backend/validation/ (package)
What it tests: Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets.
Methodology: Each harness fetches a public dataset, converts cases into patient narratives, runs them through the Orchestrator directly (no HTTP server), and scores the output against known ground truth.

Datasets

Dataset Source Cases Available Metrics
MedQA (USMLE) HuggingFace (GBaker/MedQA-USMLE-4-options) 1,273 test cases top-1, top-3, mentioned diagnostic accuracy
MTSamples GitHub (socd06/medical-nlp) ~5,000 transcription notes parse success, field completeness, specialty alignment
PMC Case Reports PubMed E-utilities (esearch + efetch) Dynamic (curated queries) diagnostic accuracy vs published diagnosis

Initial Results (Smoke Test β€” 3 MedQA Cases)

Metric Value
Cases run 3
Parse success 100% (3/3)
Top-1 diagnostic accuracy 66.7% (2/3)
Top-3 diagnostic accuracy 66.7% (2/3)
Avg pipeline time ~94 s per case

50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)

Run with: python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2

Metric Value
Cases run 50
Pipeline success 94% (47/50)
Top-1 diagnostic accuracy 36%
Top-3 diagnostic accuracy 38%
Differential accuracy 10%
Mentioned in report 38%
Avg pipeline time 204 s per case
Total run time ~60 min

Breakdown by question type (50 cases):

Type Count Mentioned Differential
Diagnostic 36 14 (39%) 5 (14%)
Treatment 6 β€” β€”
Pathophysiology 6 β€” β€”
Statistics 1 β€” β€”
Anatomy 1 β€” β€”

Notes: MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.

Full validation was run on Feb 15, 2026 using the medgemma-27b-cds HuggingFace Dedicated Endpoint (1Γ— A100 80 GB, bfloat16). Incremental checkpoints saved to validation/results/medqa_checkpoint.jsonl with --resume support.

How to Reproduce

cd src/backend

# Fetch datasets only (no pipeline runs)
python -m validation.run_validation --fetch-only

# Run MedQA validation (N cases)
python -m validation.run_validation --medqa --max-cases 10

# Run MTSamples validation
python -m validation.run_validation --mtsamples --max-cases 10

# Run PMC Case Reports validation
python -m validation.run_validation --pmc --max-cases 5

# Run all 3 datasets
python -m validation.run_validation --all --max-cases 10

# Additional flags:
#   --seed 42          Reproducible random sampling
#   --delay 2          Seconds between cases (rate limiting)
#   --no-drugs         Skip drug interaction step
#   --no-guidelines    Skip guideline retrieval step

Results are saved to validation/results/ as timestamped JSON files.