| # Test Results β CDS Agent | |
| > Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint. | |
| --- | |
| ## 1. RAG Retrieval Quality Test | |
| **Test file:** `src/backend/test_rag_quality.py` | |
| **What it tests:** Whether the RAG system retrieves the correct clinical guideline for a given clinical query. | |
| **Methodology:** 30 clinical queries, each with an expected guideline ID. For each query, the test retrieves the top-5 guidelines from ChromaDB and checks whether the expected guideline appears in the results, and whether it scores above the relevance threshold (0.4). | |
| ### Summary | |
| | Metric | Value | | |
| |--------|-------| | |
| | Total queries | 30 | | |
| | Passed | 30 | | |
| | Failed | 0 | | |
| | **Pass rate** | **100%** | | |
| | Avg relevance score | 0.639 | | |
| | Min relevance score | 0.519 | | |
| | Max relevance score | 0.765 | | |
| | Top-1 accuracy | 100% (correct guideline ranked #1 for all 30 queries) | | |
| ### Results by Specialty | |
| | Specialty | Queries | Passed | Pass Rate | Avg Relevance | | |
| |-----------|---------|--------|-----------|---------------| | |
| | Cardiology | 4 | 4 | 100% | 0.65 | | |
| | Emergency Medicine | 5 | 5 | 100% | 0.62 | | |
| | Endocrinology | 3 | 3 | 100% | 0.64 | | |
| | Pulmonology | 2 | 2 | 100% | 0.63 | | |
| | Neurology | 2 | 2 | 100% | 0.66 | | |
| | Gastroenterology | 2 | 2 | 100% | 0.61 | | |
| | Infectious Disease | 2 | 2 | 100% | 0.67 | | |
| | Psychiatry | 2 | 2 | 100% | 0.64 | | |
| | Pediatrics | 2 | 2 | 100% | 0.63 | | |
| | Nephrology | 2 | 2 | 100% | 0.65 | | |
| | Hematology | 1 | 1 | 100% | 0.62 | | |
| | Rheumatology | 1 | 1 | 100% | 0.64 | | |
| | OB/GYN | 1 | 1 | 100% | 0.66 | | |
| | Other | 1 | 1 | 100% | 0.61 | | |
| ### How to Reproduce | |
| ```bash | |
| cd src/backend | |
| python test_rag_quality.py --rebuild --verbose | |
| ``` | |
| **Flags:** | |
| - `--rebuild` β Rebuild ChromaDB from `clinical_guidelines.json` before testing | |
| - `--verbose` β Print each query, expected ID, actual top result, and relevance score | |
| - `--stats` β Print summary statistics only | |
| - `--query "chest pain"` β Test a single ad-hoc query | |
| --- | |
| ## 2. End-to-End Pipeline Test | |
| **Test file:** `src/backend/test_e2e.py` | |
| **What it tests:** Full 6-step agent pipeline from free-text input to synthesized CDS report. | |
| **Test case:** 62-year-old male with crushing substernal chest pain, diaphoresis, nausea, HTN history, on lisinopril + metformin + atorvastatin. | |
| ### Pipeline Step Results | |
| | Step | Status | Duration | Key Findings | | |
| |------|--------|----------|--------------| | |
| | 1. Parse Patient Data | PASSED | 7.8 s | Correctly extracted: age 62, male, chest pain chief complaint, 3 medications, HTN/DM history | | |
| | 2. Clinical Reasoning | PASSED | 21.2 s | Top differential: Acute Coronary Syndrome (ACS). Also considered: GERD, PE, aortic dissection | | |
| | 3. Drug Interaction Check | PASSED | 11.3 s | Queried OpenFDA + RxNorm for lisinopril, metformin, atorvastatin interactions | | |
| | 4. Guideline Retrieval | PASSED | 9.6 s | Retrieved ACC/AHA chest pain / ACS guidelines from RAG corpus | | |
| | 5. Conflict Detection | PASSED | β | Compares guidelines against patient data for omissions, contradictions, dosage, monitoring gaps | | |
| | 6. Synthesis | PASSED | 25.3 s | Generated comprehensive CDS report with differential, warnings, conflicts, guideline recommendations | | |
| **Total pipeline time:** 75.2 s | |
| ### How to Reproduce | |
| ```bash | |
| # Start the backend first | |
| cd src/backend | |
| uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 | |
| # In another terminal | |
| cd src/backend | |
| python test_e2e.py | |
| ``` | |
| --- | |
| ## 3. Clinical Test Suite | |
| **Test file:** `src/backend/test_clinical_cases.py` | |
| **What it tests:** 22 diverse clinical scenarios across 14 medical specialties. | |
| **Methodology:** Each case has a clinical vignette, expected keywords in the CDS report output, and specialty classification. The test submits each case through the full pipeline and validates that expected terms appear in the report. | |
| ### Test Cases | |
| | ID | Specialty | Scenario | Key Validation Keywords | | |
| |----|-----------|----------|------------------------| | |
| | `cardio_acs` | Cardiology | 62M crushing chest pain | ACS, troponin, ECG | | |
| | `cardio_afib` | Cardiology | 72F palpitations, irregular pulse | Atrial fibrillation, anticoagulation, CHA2DS2-VASc | | |
| | `cardio_hf` | Cardiology | 68M progressive dyspnea, edema | Heart failure, BNP, diuretic | | |
| | `neuro_stroke` | Neurology | 75M sudden left-sided weakness | Stroke, CT, tPA, NIH Stroke Scale | | |
| | `em_sepsis` | Emergency Medicine | 45F fever, tachycardia, hypotension | Sepsis, lactate, blood cultures, fluids | | |
| | `em_anaphylaxis` | Emergency Medicine | 28F bee sting, urticaria, wheezing | Anaphylaxis, epinephrine, airway | | |
| | `em_polytrauma` | Emergency Medicine | 35M MVC, multiple injuries | Trauma, ATLS, FAST, C-spine | | |
| | `endo_dka` | Endocrinology | 22F T1DM, vomiting, Kussmaul breathing | DKA, insulin, potassium, anion gap | | |
| | `endo_thyroid_storm` | Endocrinology | 40F graves, fever, tachycardia, AMS | Thyroid storm, PTU, beta-blocker | | |
| | `endo_adrenal` | Endocrinology | 55M weakness, hypotension, hyperpigmentation | Adrenal insufficiency, cortisol, hydrocortisone | | |
| | `pulm_pe` | Pulmonology | 50F post-surgical, sudden dyspnea | Pulmonary embolism, CT angiography, anticoagulation | | |
| | `pulm_asthma` | Pulmonology | 19M severe wheezing, accessory muscles | Status asthmaticus, albuterol, steroids | | |
| | `gi_bleed` | Gastroenterology | 60M hematemesis, melena, cirrhosis history | Upper GI bleed, endoscopy, PPI, variceal | | |
| | `gi_pancreatitis` | Gastroenterology | 48F epigastric pain, lipase elevated | Pancreatitis, NPO, IV fluids, imaging | | |
| | `neuro_seizure` | Neurology | 30F witnessed generalized seizure | Status epilepticus, benzodiazepine, EEG | | |
| | `id_meningitis` | Infectious Disease | 20M fever, neck stiffness, photophobia | Meningitis, lumbar puncture, empiric antibiotics | | |
| | `psych_suicidal` | Psychiatry | 35M suicidal ideation, plan, access | Suicide risk, safety assessment, hospitalization | | |
| | `peds_fever` | Pediatrics | 3-week-old neonate, fever 38.5Β°C | Neonatal fever, sepsis workup, admit | | |
| | `peds_dehydration` | Pediatrics | 2-year-old, 5 days diarrhea/vomiting | Dehydration, ORS, electrolytes | | |
| | `nephro_hyperkalemia` | Nephrology | 70M CKD, K+ 7.2, ECG changes | Hyperkalemia, calcium gluconate, insulin/glucose, dialysis | | |
| | `tox_acetaminophen` | Emergency Medicine | 23F intentional APAP overdose | Acetaminophen, NAC, liver, Rumack-Matthew | | |
| | `geri_polypharmacy` | Geriatrics | 82F on 12 medications, recurrent falls | Polypharmacy, fall risk, medication reconciliation, Beers criteria | | |
| ### How to Reproduce | |
| ```bash | |
| cd src/backend | |
| # List all available cases | |
| python test_clinical_cases.py --list | |
| # Run a single case | |
| python test_clinical_cases.py --case em_sepsis | |
| # Run all cases in a specialty | |
| python test_clinical_cases.py --specialty Cardiology | |
| # Run all 22 cases | |
| python test_clinical_cases.py | |
| # Run all and save report to JSON | |
| python test_clinical_cases.py --report results.json | |
| # Quiet mode (summary only) | |
| python test_clinical_cases.py --quiet | |
| ``` | |
| --- | |
| ## 4. RAG Corpus Statistics | |
| | Metric | Value | | |
| |--------|-------| | |
| | Total guidelines | 62 | | |
| | Specialties covered | 14 | | |
| | Guidelines stored in ChromaDB | 62 | | |
| | Embedding model | all-MiniLM-L6-v2 (384 dimensions) | | |
| | Embedding time (full rebuild) | ~5 s | | |
| | ChromaDB persist directory | `./data/chroma` | | |
| | Source file | `app/data/clinical_guidelines.json` | | |
| ### Guidelines per Specialty | |
| | Specialty | Count | | |
| |-----------|-------| | |
| | Emergency Medicine | 10 | | |
| | Cardiology | 8 | | |
| | Endocrinology | 7 | | |
| | Gastroenterology | 5 | | |
| | Infectious Disease | 5 | | |
| | Pulmonology | 4 | | |
| | Neurology | 4 | | |
| | Psychiatry | 4 | | |
| | Pediatrics | 4 | | |
| | Nephrology | 2 | | |
| | Hematology | 2 | | |
| | Rheumatology | 2 | | |
| | OB/GYN | 2 | | |
| | Preventive / Perioperative / Dermatology | 3 | | |
| --- | |
| ## 5. Test Infrastructure | |
| | File | Lines | Purpose | | |
| |------|-------|---------| | |
| | `test_e2e.py` | ~60 | Submit chest pain case, poll for completion, validate all 6 steps | | |
| | `test_clinical_cases.py` | ~400 | 22 clinical cases with keyword validation, CLI flags for filtering | | |
| | `test_rag_quality.py` | ~350 | 30 RAG retrieval queries with expected guideline IDs, relevance scoring | | |
| | `test_poll.py` | ~30 | Utility: poll a case ID until completion | | |
| ### Dependencies for Testing | |
| Tests use only the standard library + `httpx` (for REST calls) and the backend's own modules (for RAG tests). No additional test frameworks required beyond what's in `requirements.txt`. | |
| --- | |
| ## 6. External Dataset Validation | |
| **Test files:** `src/backend/validation/` (package) | |
| **What it tests:** Full pipeline diagnostic accuracy and parse quality against real-world clinical datasets. | |
| **Methodology:** Each harness fetches a public dataset, converts cases into patient narratives, runs them through the `Orchestrator` directly (no HTTP server), and scores the output against known ground truth. | |
| ### Datasets | |
| | Dataset | Source | Cases Available | Metrics | | |
| |---------|--------|-----------------|--------| | |
| | **MedQA (USMLE)** | HuggingFace (`GBaker/MedQA-USMLE-4-options`) | 1,273 test cases | top-1, top-3, mentioned diagnostic accuracy | | |
| | **MTSamples** | GitHub (`socd06/medical-nlp`) | ~5,000 transcription notes | parse success, field completeness, specialty alignment | | |
| | **PMC Case Reports** | PubMed E-utilities (esearch + efetch) | Dynamic (curated queries) | diagnostic accuracy vs published diagnosis | | |
| ### Initial Results (Smoke Test β 3 MedQA Cases) | |
| | Metric | Value | | |
| |--------|-------| | |
| | Cases run | 3 | | |
| | Parse success | 100% (3/3) | | |
| | Top-1 diagnostic accuracy | 66.7% (2/3) | | |
| | Top-3 diagnostic accuracy | 66.7% (2/3) | | |
| | Avg pipeline time | ~94 s per case | | |
| ### 50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint) | |
| Run with: `python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2` | |
| | Metric | Value | | |
| |--------|-------| | |
| | Cases run | 50 | | |
| | Pipeline success | 94% (47/50) | | |
| | Top-1 diagnostic accuracy | 36% | | |
| | Top-3 diagnostic accuracy | 38% | | |
| | Differential accuracy | 10% | | |
| | Mentioned in report | 38% | | |
| | Avg pipeline time | 204 s per case | | |
| | Total run time | ~60 min | | |
| **Breakdown by question type (50 cases):** | |
| | Type | Count | Mentioned | Differential | | |
| |------|-------|-----------|-------------| | |
| | Diagnostic | 36 | 14 (39%) | 5 (14%) | | |
| | Treatment | 6 | β | β | | |
| | Pathophysiology | 6 | β | β | | |
| | Statistics | 1 | β | β | | |
| | Anatomy | 1 | β | β | | |
| > **Notes:** MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run. | |
| > Full validation was run on Feb 15, 2026 using the `medgemma-27b-cds` HuggingFace Dedicated Endpoint (1Γ A100 80 GB, bfloat16). Incremental checkpoints saved to `validation/results/medqa_checkpoint.jsonl` with `--resume` support. | |
| ### How to Reproduce | |
| ```bash | |
| cd src/backend | |
| # Fetch datasets only (no pipeline runs) | |
| python -m validation.run_validation --fetch-only | |
| # Run MedQA validation (N cases) | |
| python -m validation.run_validation --medqa --max-cases 10 | |
| # Run MTSamples validation | |
| python -m validation.run_validation --mtsamples --max-cases 10 | |
| # Run PMC Case Reports validation | |
| python -m validation.run_validation --pmc --max-cases 5 | |
| # Run all 3 datasets | |
| python -m validation.run_validation --all --max-cases 10 | |
| # Additional flags: | |
| # --seed 42 Reproducible random sampling | |
| # --delay 2 Seconds between cases (rate limiting) | |
| # --no-drugs Skip drug interaction step | |
| # --no-guidelines Skip guideline retrieval step | |
| ``` | |
| Results are saved to `validation/results/` as timestamped JSON files. | |