Test Results β CDS Agent
Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.
1. RAG Retrieval Quality Test
Test file: src/backend/test_rag_quality.py
What it tests: Whether the RAG system retrieves the correct clinical guideline for a given clinical query.
Methodology: 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4).
Summary
| Metric | Value |
|---|---|
| Total queries | 30 |
| Passed | 30 |
| Failed | 0 |
| Pass rate | 100% |
| Avg relevance score | 0.639 |
| Min relevance score | 0.519 |
| Max relevance score | 0.765 |
| Top-1 accuracy | 100% (correct guideline ranked #1 for all 30 queries) |
Results by Specialty
| Specialty | Queries | Passed | Pass Rate | Avg Relevance |
|---|---|---|---|---|
| Cardiology | 4 | 4 | 100% | 0.65 |
| Emergency Medicine | 5 | 5 | 100% | 0.62 |
| Endocrinology | 3 | 3 | 100% | 0.64 |
| Pulmonology | 2 | 2 | 100% | 0.63 |
| Neurology | 2 | 2 | 100% | 0.66 |
| Gastroenterology | 2 | 2 | 100% | 0.61 |
| Infectious Disease | 2 | 2 | 100% | 0.67 |
| Psychiatry | 2 | 2 | 100% | 0.64 |
| Pediatrics | 2 | 2 | 100% | 0.63 |
| Nephrology | 2 | 2 | 100% | 0.65 |
| Hematology | 1 | 1 | 100% | 0.62 |
| Rheumatology | 1 | 1 | 100% | 0.64 |
| OB/GYN | 1 | 1 | 100% | 0.66 |
| Other | 1 | 1 | 100% | 0.61 |
How to Reproduce
cd src/backend
python test_rag_quality.py --rebuild --verbose
Flags:
--rebuildβ Rebuild ChromaDB fromclinical_guidelines.jsonbefore testing--verboseβ Print each query, expected ID, actual top result, and relevance score--statsβ Print summary statistics only--query "chest pain"β Test a single ad-hoc query
2. End-to-End Pipeline Test
Test file: src/backend/test_e2e.py
What it tests: Full 6-step agent pipeline from free-text input to synthesized CDS report.
Test case: 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin.
Pipeline Step Results
| Step | Status | Duration | Key Findings |
|---|---|---|---|
| 1. Parse Patient Data | PASSED | 7.8 s | Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history |
| 2. Clinical Reasoning | PASSED | 21.2 s | Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection |
| 3. Drug Interaction Check | PASSED | 11.3 s | Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions |
| 4. Guideline Retrieval | PASSED | 9.6 s | Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus |
| 5. Conflict Detection | PASSED | β | Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps |
| 6. Synthesis | PASSED | 25.3 s | Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations |
Total pipeline time: 75.2 s
How to Reproduce
# Start the backend first
cd src/backend
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# In another terminal
cd src/backend
python test_e2e.py
3. Clinical Test Suite
Test file: src/backend/test_clinical_cases.py
What it tests: 22 diverse clinical scenarios across 14 medical specialties.
Methodology: Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report.
Test Cases
| ID | Specialty | Scenario | Key Validation Keywords |
|---|---|---|---|
cardio_acs |
Cardiology | 62M crushing chest pain | ACS, troponin, ECG |
cardio_afib |
Cardiology | 72F palpitations, irregular pulse | Atrial fibrillation, anticoagulation, CHA2DS2-VASc |
cardio_hf |
Cardiology | 68M progressive dyspnea, edema | Heart failure, BNP, diuretic |
neuro_stroke |
Neurology | 75M sudden left-sided weakness | Stroke, CT, tPA, NIH Stroke Scale |
em_sepsis |
Emergency Medicine | 45F fever, tachycardia, hypotension | Sepsis, lactate, blood cultures, fluids |
em_anaphylaxis |
Emergency Medicine | 28F bee sting, urticaria, wheezing | Anaphylaxis, epinephrine, airway |
em_polytrauma |
Emergency Medicine | 35M MVC, multiple injuries | Trauma, ATLS, FAST, C-spine |
endo_dka |
Endocrinology | 22F T1DM, vomiting, Kussmaul breathing | DKA, insulin, potassium, anion gap |
endo_thyroid_storm |
Endocrinology | 40F graves, fever, tachycardia, AMS | Thyroid storm, PTU, beta-blocker |
endo_adrenal |
Endocrinology | 55M weakness, hypotension, hyperpigmentation | Adrenal insufficiency, cortisol, hydrocortisone |
pulm_pe |
Pulmonology | 50F post-surgical, sudden dyspnea | Pulmonary embolism, CT angiography, anticoagulation |
pulm_asthma |
Pulmonology | 19M severe wheezing, accessory muscles | Status asthmaticus, albuterol, steroids |
gi_bleed |
Gastroenterology | 60M hematemesis, melena, cirrhosis history | Upper GI bleed, endoscopy, PPI, variceal |
gi_pancreatitis |
Gastroenterology | 48F epigastric pain, lipase elevated | Pancreatitis, NPO, IV fluids, imaging |
neuro_seizure |
Neurology | 30F witnessed generalized seizure | Status epilepticus, benzodiazepine, EEG |
id_meningitis |
Infectious Disease | 20M fever, neck stiffness, photophobia | Meningitis, lumbar puncture, empiric antibiotics |
psych_suicidal |
Psychiatry | 35M suicidal ideation, plan, access | Suicide risk, safety assessment, hospitalization |
peds_fever |
Pediatrics | 3-week-old neonate, fever 38.5Β°C | Neonatal fever, sepsis workup, admit |
peds_dehydration |
Pediatrics | 2-year-old, 5 days diarrhea/vomiting | Dehydration, ORS, electrolytes |
nephro_hyperkalemia |
Nephrology | 70M CKD, K+ 7.2, ECG changes | Hyperkalemia, calcium gluconate, insulin/glucose, dialysis |
tox_acetaminophen |
Emergency Medicine | 23F intentional APAP overdose | Acetaminophen, NAC, liver, Rumack-Matthew |
geri_polypharmacy |
Geriatrics | 82F on 12 medications, recurrent falls | Polypharmacy, fall risk, medication reconciliation, Beers criteria |
How to Reproduce
cd src/backend
# List all available cases
python test_clinical_cases.py --list
# Run a single case
python test_clinical_cases.py --case em_sepsis
# Run all cases in a specialty
python test_clinical_cases.py --specialty Cardiology
# Run all 22 cases
python test_clinical_cases.py
# Run all and save report to JSON
python test_clinical_cases.py --report results.json
# Quiet mode (summary only)
python test_clinical_cases.py --quiet
4. RAG Corpus Statistics
| Metric | Value |
|---|---|
| Total guidelines | 62 |
| Specialties covered | 14 |
| Guidelines stored in ChromaDB | 62 |
| Embedding model | all-MiniLM-L6-v2 (384 dimensions) |
| Embedding time (full rebuild) | ~5 s |
| ChromaDB persist directory | ./data/chroma |
| Source file | app/data/clinical_guidelines.json |
Guidelines per Specialty
| Specialty | Count |
|---|---|
| Emergency Medicine | 10 |
| Cardiology | 8 |
| Endocrinology | 7 |
| Gastroenterology | 5 |
| Infectious Disease | 5 |
| Pulmonology | 4 |
| Neurology | 4 |
| Psychiatry | 4 |
| Pediatrics | 4 |
| Nephrology | 2 |
| Hematology | 2 |
| Rheumatology | 2 |
| OB/GYN | 2 |
| Preventive / Perioperative / Dermatology | 3 |
5. Test Infrastructure
| File | Lines | Purpose |
|---|---|---|
test_e2e.py |
~60 | Submit chest pain case, poll for completion, validate all 6 steps |
test_clinical_cases.py |
~400 | 22 clinical cases with keyword validation, CLI flags for filtering |
test_rag_quality.py |
~350 | 30 RAG retrieval queries with expected guideline IDs, relevance scoring |
test_poll.py |
~30 | Utility: poll a case ID until completion |
Dependencies for Testing
Tests use only the standard library + httpx (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in requirements.txt.
6. External Dataset Validation
Test files: src/backend/validation/ (package)
What it tests: Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets.
Methodology: Each harness fetches a public dataset, converts cases into patient narratives, runs them through the Orchestrator directly (no HTTP server), and scores the output against known ground truth.
Datasets
| Dataset | Source | Cases Available | Metrics |
|---|---|---|---|
| MedQA (USMLE) | HuggingFace (GBaker/MedQA-USMLE-4-options) |
1,273 test cases | top-1, top-3, mentioned diagnostic accuracy |
| MTSamples | GitHub (socd06/medical-nlp) |
~5,000 transcription notes | parse success, field completeness, specialty alignment |
| PMC Case Reports | PubMed E-utilities (esearch + efetch) | Dynamic (curated queries) | diagnostic accuracy vs published diagnosis |
Initial Results (Smoke Test β 3 MedQA Cases)
| Metric | Value |
|---|---|
| Cases run | 3 |
| Parse success | 100% (3/3) |
| Top-1 diagnostic accuracy | 66.7% (2/3) |
| Top-3 diagnostic accuracy | 66.7% (2/3) |
| Avg pipeline time | ~94 s per case |
50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)
Run with: python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
| Metric | Value |
|---|---|
| Cases run | 50 |
| Pipeline success | 94% (47/50) |
| Top-1 diagnostic accuracy | 36% |
| Top-3 diagnostic accuracy | 38% |
| Differential accuracy | 10% |
| Mentioned in report | 38% |
| Avg pipeline time | 204 s per case |
| Total run time | ~60 min |
Breakdown by question type (50 cases):
| Type | Count | Mentioned | Differential |
|---|---|---|---|
| Diagnostic | 36 | 14 (39%) | 5 (14%) |
| Treatment | 6 | β | β |
| Pathophysiology | 6 | β | β |
| Statistics | 1 | β | β |
| Anatomy | 1 | β | β |
Notes: MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.
Full validation was run on Feb 15, 2026 using the
medgemma-27b-cdsHuggingFace Dedicated Endpoint (1Γ A100 80 GB, bfloat16). Incremental checkpoints saved tovalidation/results/medqa_checkpoint.jsonlwith--resumesupport.
How to Reproduce
cd src/backend
# Fetch datasets only (no pipeline runs)
python -m validation.run_validation --fetch-only
# Run MedQA validation (N cases)
python -m validation.run_validation --medqa --max-cases 10
# Run MTSamples validation
python -m validation.run_validation --mtsamples --max-cases 10
# Run PMC Case Reports validation
python -m validation.run_validation --pmc --max-cases 5
# Run all 3 datasets
python -m validation.run_validation --all --max-cases 10
# Additional flags:
# --seed 42 Reproducible random sampling
# --delay 2 Seconds between cases (rate limiting)
# --no-drugs Skip drug interaction step
# --no-guidelines Skip guideline retrieval step
Results are saved to validation/results/ as timestamped JSON files.