# Test Results — CDS Agent > Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint. --- ## 1. RAG Retrieval Quality Test **Test file:** `src/backend/test_rag_quality.py` **What it tests:** Whether the RAG system retrieves the correct clinical guideline for a given clinical query. **Methodology:** 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4). ### Summary | Metric | Value | |--------|-------| | Total queries | 30 | | Passed | 30 | | Failed | 0 | | **Pass rate** | **100%** | | Avg relevance score | 0.639 | | Min relevance score | 0.519 | | Max relevance score | 0.765 | | Top-1 accuracy | 100% (correct guideline ranked #1 for all 30 queries) | ### Results by Specialty | Specialty | Queries | Passed | Pass Rate | Avg Relevance | |-----------|---------|--------|-----------|---------------| | Cardiology | 4 | 4 | 100% | 0.65 | | Emergency Medicine | 5 | 5 | 100% | 0.62 | | Endocrinology | 3 | 3 | 100% | 0.64 | | Pulmonology | 2 | 2 | 100% | 0.63 | | Neurology | 2 | 2 | 100% | 0.66 | | Gastroenterology | 2 | 2 | 100% | 0.61 | | Infectious Disease | 2 | 2 | 100% | 0.67 | | Psychiatry | 2 | 2 | 100% | 0.64 | | Pediatrics | 2 | 2 | 100% | 0.63 | | Nephrology | 2 | 2 | 100% | 0.65 | | Hematology | 1 | 1 | 100% | 0.62 | | Rheumatology | 1 | 1 | 100% | 0.64 | | OB/GYN | 1 | 1 | 100% | 0.66 | | Other | 1 | 1 | 100% | 0.61 | ### How to Reproduce ```bash cd src/backend python test_rag_quality.py --rebuild --verbose ``` **Flags:** - `--rebuild` — Rebuild ChromaDB from `clinical_guidelines.json` before testing - `--verbose` — Print each query, expected ID, actual top result, and relevance score - `--stats` — Print summary statistics only - `--query "chest pain"` — Test a single ad-hoc query --- ## 2. End-to-End Pipeline Test **Test file:** `src/backend/test_e2e.py` **What it tests:** Full 6-step agent pipeline from free-text input to synthesized CDS report. **Test case:** 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin. ### Pipeline Step Results | Step | Status | Duration | Key Findings | |------|--------|----------|--------------| | 1. Parse Patient Data | PASSED | 7.8 s | Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history | | 2. Clinical Reasoning | PASSED | 21.2 s | Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection | | 3. Drug Interaction Check | PASSED | 11.3 s | Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions | | 4. Guideline Retrieval | PASSED | 9.6 s | Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus | | 5. Conflict Detection | PASSED | — | Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps | | 6. Synthesis | PASSED | 25.3 s | Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations | **Total pipeline time:** 75.2 s ### How to Reproduce ```bash # Start the backend first cd src/backend uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 # In another terminal cd src/backend python test_e2e.py ``` --- ## 3. Clinical Test Suite **Test file:** `src/backend/test_clinical_cases.py` **What it tests:** 22 diverse clinical scenarios across 14 medical specialties. **Methodology:** Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report. ### Test Cases | ID | Specialty | Scenario | Key Validation Keywords | |----|-----------|----------|------------------------| | `cardio_acs` | Cardiology | 62M crushing chest pain | ACS, troponin, ECG | | `cardio_afib` | Cardiology | 72F palpitations, irregular pulse | Atrial fibrillation, anticoagulation, CHA2DS2-VASc | | `cardio_hf` | Cardiology | 68M progressive dyspnea, edema | Heart failure, BNP, diuretic | | `neuro_stroke` | Neurology | 75M sudden left-sided weakness | Stroke, CT, tPA, NIH Stroke Scale | | `em_sepsis` | Emergency Medicine | 45F fever, tachycardia, hypotension | Sepsis, lactate, blood cultures, fluids | | `em_anaphylaxis` | Emergency Medicine | 28F bee sting, urticaria, wheezing | Anaphylaxis, epinephrine, airway | | `em_polytrauma` | Emergency Medicine | 35M MVC, multiple injuries | Trauma, ATLS, FAST, C-spine | | `endo_dka` | Endocrinology | 22F T1DM, vomiting, Kussmaul breathing | DKA, insulin, potassium, anion gap | | `endo_thyroid_storm` | Endocrinology | 40F graves, fever, tachycardia, AMS | Thyroid storm, PTU, beta-blocker | | `endo_adrenal` | Endocrinology | 55M weakness, hypotension, hyperpigmentation | Adrenal insufficiency, cortisol, hydrocortisone | | `pulm_pe` | Pulmonology | 50F post-surgical, sudden dyspnea | Pulmonary embolism, CT angiography, anticoagulation | | `pulm_asthma` | Pulmonology | 19M severe wheezing, accessory muscles | Status asthmaticus, albuterol, steroids | | `gi_bleed` | Gastroenterology | 60M hematemesis, melena, cirrhosis history | Upper GI bleed, endoscopy, PPI, variceal | | `gi_pancreatitis` | Gastroenterology | 48F epigastric pain, lipase elevated | Pancreatitis, NPO, IV fluids, imaging | | `neuro_seizure` | Neurology | 30F witnessed generalized seizure | Status epilepticus, benzodiazepine, EEG | | `id_meningitis` | Infectious Disease | 20M fever, neck stiffness, photophobia | Meningitis, lumbar puncture, empiric antibiotics | | `psych_suicidal` | Psychiatry | 35M suicidal ideation, plan, access | Suicide risk, safety assessment, hospitalization | | `peds_fever` | Pediatrics | 3-week-old neonate, fever 38.5°C | Neonatal fever, sepsis workup, admit | | `peds_dehydration` | Pediatrics | 2-year-old, 5 days diarrhea/vomiting | Dehydration, ORS, electrolytes | | `nephro_hyperkalemia` | Nephrology | 70M CKD, K+ 7.2, ECG changes | Hyperkalemia, calcium gluconate, insulin/glucose, dialysis | | `tox_acetaminophen` | Emergency Medicine | 23F intentional APAP overdose | Acetaminophen, NAC, liver, Rumack-Matthew | | `geri_polypharmacy` | Geriatrics | 82F on 12 medications, recurrent falls | Polypharmacy, fall risk, medication reconciliation, Beers criteria | ### How to Reproduce ```bash cd src/backend # List all available cases python test_clinical_cases.py --list # Run a single case python test_clinical_cases.py --case em_sepsis # Run all cases in a specialty python test_clinical_cases.py --specialty Cardiology # Run all 22 cases python test_clinical_cases.py # Run all and save report to JSON python test_clinical_cases.py --report results.json # Quiet mode (summary only) python test_clinical_cases.py --quiet ``` --- ## 4. RAG Corpus Statistics | Metric | Value | |--------|-------| | Total guidelines | 62 | | Specialties covered | 14 | | Guidelines stored in ChromaDB | 62 | | Embedding model | all-MiniLM-L6-v2 (384 dimensions) | | Embedding time (full rebuild) | ~5 s | | ChromaDB persist directory | `./data/chroma` | | Source file | `app/data/clinical_guidelines.json` | ### Guidelines per Specialty | Specialty | Count | |-----------|-------| | Emergency Medicine | 10 | | Cardiology | 8 | | Endocrinology | 7 | | Gastroenterology | 5 | | Infectious Disease | 5 | | Pulmonology | 4 | | Neurology | 4 | | Psychiatry | 4 | | Pediatrics | 4 | | Nephrology | 2 | | Hematology | 2 | | Rheumatology | 2 | | OB/GYN | 2 | | Preventive / Perioperative / Dermatology | 3 | --- ## 5. Test Infrastructure | File | Lines | Purpose | |------|-------|---------| | `test_e2e.py` | ~60 | Submit chest pain case, poll for completion, validate all 6 steps | | `test_clinical_cases.py` | ~400 | 22 clinical cases with keyword validation, CLI flags for filtering | | `test_rag_quality.py` | ~350 | 30 RAG retrieval queries with expected guideline IDs, relevance scoring | | `test_poll.py` | ~30 | Utility: poll a case ID until completion | ### Dependencies for Testing Tests use only the standard library + `httpx` (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in `requirements.txt`. --- ## 6. External Dataset Validation **Test files:** `src/backend/validation/` (package) **What it tests:** Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets. **Methodology:** Each harness fetches a public dataset, converts cases into patient narratives, runs them through the `Orchestrator` directly (no HTTP server), and scores the output against known ground truth. ### Datasets | Dataset | Source | Cases Available | Metrics | |---------|--------|-----------------|--------| | **MedQA (USMLE)** | HuggingFace (`GBaker/MedQA-USMLE-4-options`) | 1,273 test cases | top-1, top-3, mentioned diagnostic accuracy | | **MTSamples** | GitHub (`socd06/medical-nlp`) | ~5,000 transcription notes | parse success, field completeness, specialty alignment | | **PMC Case Reports** | PubMed E-utilities (esearch + efetch) | Dynamic (curated queries) | diagnostic accuracy vs published diagnosis | ### Initial Results (Smoke Test — 3 MedQA Cases) | Metric | Value | |--------|-------| | Cases run | 3 | | Parse success | 100% (3/3) | | Top-1 diagnostic accuracy | 66.7% (2/3) | | Top-3 diagnostic accuracy | 66.7% (2/3) | | Avg pipeline time | ~94 s per case | ### 50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint) Run with: `python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2` | Metric | Value | |--------|-------| | Cases run | 50 | | Pipeline success | 94% (47/50) | | Top-1 diagnostic accuracy | 36% | | Top-3 diagnostic accuracy | 38% | | Differential accuracy | 10% | | Mentioned in report | 38% | | Avg pipeline time | 204 s per case | | Total run time | ~60 min | **Breakdown by question type (50 cases):** | Type | Count | Mentioned | Differential | |------|-------|-----------|-------------| | Diagnostic | 36 | 14 (39%) | 5 (14%) | | Treatment | 6 | — | — | | Pathophysiology | 6 | — | — | | Statistics | 1 | — | — | | Anatomy | 1 | — | — | > **Notes:** MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run. > Full validation was run on Feb 15, 2026 using the `medgemma-27b-cds` HuggingFace Dedicated Endpoint (1× A100 80 GB, bfloat16). Incremental checkpoints saved to `validation/results/medqa_checkpoint.jsonl` with `--resume` support. ### How to Reproduce ```bash cd src/backend # Fetch datasets only (no pipeline runs) python -m validation.run_validation --fetch-only # Run MedQA validation (N cases) python -m validation.run_validation --medqa --max-cases 10 # Run MTSamples validation python -m validation.run_validation --mtsamples --max-cases 10 # Run PMC Case Reports validation python -m validation.run_validation --pmc --max-cases 5 # Run all 3 datasets python -m validation.run_validation --all --max-cases 10 # Additional flags: # --seed 42 Reproducible random sampling # --delay 2 Seconds between cases (rate limiting) # --no-drugs Skip drug interaction step # --no-guidelines Skip guideline retrieval step ``` Results are saved to `validation/results/` as timestamped JSON files.