Spaces:

bshepp
/

cds-agent

Running

bshepp commited on Feb 15

Commit

1f36481

1 Parent(s): 3d02eb2

MedGemma validation: 50-case MedQA run, TGI endpoint config, prompt improvements

- Fixed TGI token limits (MAX_INPUT_TOKENS=12288, MAX_TOTAL_TOKENS=16384)
- Reduced per-step max_tokens for faster generation
- Improved clinical reasoning prompt (disease-level dx, not symptoms)
- Fixed Unicode encoding issues for Windows console
- Fixed error masking in orchestrator (failed steps now surface errors)
- Fixed endpoint URL in .env
- Added analyze_results.py for question-type categorization
- Results: 94% pipeline success, 38% top3 accuracy, 14% dx-only accuracy
- Paused endpoint to save costs

Files changed (15) hide show

README.md +1 -1
docs/deploy_medgemma_hf.md +141 -0
docs/writeup_draft.md +11 -10
src/backend/analyze_results.py +189 -0
src/backend/app/config.py +1 -0
src/backend/app/services/medgemma.py +34 -15
src/backend/app/tools/clinical_reasoning.py +14 -9
src/backend/app/tools/conflict_detection.py +1 -1
src/backend/app/tools/patient_parser.py +1 -0
src/backend/app/tools/synthesis.py +55 -27
src/backend/check_progress.py +47 -0
src/backend/validation/base.py +34 -8
src/backend/validation/harness_medqa.py +29 -12
src/backend/validation/harness_pmc.py +3 -3
src/backend/validation/run_validation.py +11 -4

README.md CHANGED Viewed

@@ -333,6 +333,6 @@ curl -X POST http://localhost:8000/api/cases/submit \
 Licensed under the [Apache License 2.0](LICENSE).
-This project uses the Gemma model, which is subject to the [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms).
 > **Disclaimer:** This is a research / demonstration system. It is NOT a substitute for professional medical judgment. All clinical decisions must be made by qualified healthcare professionals.

 Licensed under the [Apache License 2.0](LICENSE).
+This project uses MedGemma and other models from Google's [Health AI Developer Foundations (HAI-DEF)](https://developers.google.com/health-ai-developer-foundations), subject to the [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms).
 > **Disclaimer:** This is a research / demonstration system. It is NOT a substitute for professional medical judgment. All clinical decisions must be made by qualified healthcare professionals.

docs/deploy_medgemma_hf.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# Deploying MedGemma 27B on HuggingFace Dedicated Endpoints
+This guide walks through deploying `google/medgemma-27b-text-it` as a
+HuggingFace Dedicated Inference Endpoint, which our CDS Agent calls via an
+OpenAI-compatible API.
+## Why HuggingFace Endpoints?
+| Feature | Details |
+|---|---|
+| **Model** | `google/medgemma-27b-text-it` (HAI-DEF, competition-required) |
+| **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
+| **Scale-to-zero** | Yes — no charges while idle |
+| **API format** | OpenAI-compatible (TGI) — zero code changes |
+| **Setup time** | ~10 minutes |
+## Prerequisites
+1. **HuggingFace account** with a valid payment method.
+2. **MedGemma access** — accept the gated-model terms at
+   <https://huggingface.co/google/medgemma-27b-text-it>.  You must agree to
+   Google's Health AI Developer Foundations (HAI-DEF) license.
+3. A **HuggingFace token** with `read` scope (already in `.env` as `HF_TOKEN`).
+## Step-by-step Deployment
+### 1. Create the endpoint
+1. Go to <https://ui.endpoints.huggingface.co/new>.
+2. **Model Repository**: `google/medgemma-27b-text-it`
+3. **Cloud Provider**: AWS (cheapest) or GCP
+4. **Region**: `us-east-1` (AWS) or `us-central1` (GCP)
+5. **Instance type**: GPU — **1× NVIDIA A100 80 GB**
+   - AWS: ~$2.50/hr
+   - GCP: ~$3.60/hr
+6. **Container type**: Text Generation Inference (TGI) — this is the default.
+7. **Advanced Settings**:
+   - **Max Input Length**: `32768`
+   - **Max Total Tokens**: `40960`
+   - **Quantization**: `none` (bfloat16 fits in 80 GB)
+   - **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
+8. Click **Create Endpoint**.
+### 2. Wait for the endpoint to become ready
+The first deployment downloads the model weights (~54 GB) and starts the TGI
+server.  This typically takes **5–15 minutes**.  The status will change from
+`Initializing` → `Running`.
+### 3. Configure the CDS Agent
+Edit `src/backend/.env`:
+```dotenv
+MEDGEMMA_API_KEY=hf_YOUR_TOKEN_HERE
+MEDGEMMA_BASE_URL=https://YOUR_ENDPOINT_ID.us-east-1.aws.endpoints.huggingface.cloud/v1
+MEDGEMMA_MODEL_ID=tgi
+```
+- **`MEDGEMMA_API_KEY`**: Your HuggingFace token (same as `HF_TOKEN`).
+- **`MEDGEMMA_BASE_URL`**: The endpoint URL from the HF dashboard, with `/v1`
+  appended.  Example:
+  `https://x1y2z3.us-east-1.aws.endpoints.huggingface.cloud/v1`
+- **`MEDGEMMA_MODEL_ID`**: Use `tgi` — TGI exposes the model under this name
+  by default. Alternatively, you can use the full model name
+  `google/medgemma-27b-text-it`.
+### 4. Verify the connection
+```bash
+cd src/backend
+python -c "
+import asyncio
+from app.services.medgemma import MedGemmaService
+async def test():
+    svc = MedGemmaService()
+    r = await svc.generate('What is the differential diagnosis for chest pain?')
+    print(r[:200])
+asyncio.run(test())
+"
+```
+You should see a clinical response from MedGemma.
+### 5. Run validation
+```bash
+cd src/backend
+python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
+```
+## Cost Estimation
+| Scenario | Hours | Cost |
+|---|---|---|
+| Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
+| Development / debugging (4 hrs) | ~4 hrs | ~$10 |
+| Competition demo recording | ~1 hr | ~$2.50 |
+| **Total estimated** | **~7 hrs** | **~$17.50** |
+With scale-to-zero enabled, the endpoint automatically shuts down after 15 min
+of inactivity — no overnight charges.
+## Troubleshooting
+### Cold start latency
+After scaling to zero, the first request takes 5–15 min while the model
+reloads.  Send a warm-up request before benchmarking.
+### 403 Forbidden
+Your HF token may not have access to the gated model.  Verify at
+<https://huggingface.co/google/medgemma-27b-text-it> that your account has been
+granted access.
+### Out of memory
+If the endpoint fails to start, ensure you selected the **80 GB** A100, not the
+40 GB variant.  MedGemma 27B in bfloat16 requires ~54 GB VRAM.
+### "model not found" error
+TGI exposes the model as `tgi` by default.  If you get a model-not-found error,
+try setting `MEDGEMMA_MODEL_ID=google/medgemma-27b-text-it` or check the
+endpoint's `/v1/models` route.
+## Deleting the Endpoint
+When you're done, delete the endpoint from the HF dashboard to stop all
+charges:
+1. Go to <https://ui.endpoints.huggingface.co/>
+2. Select your endpoint → **Settings** → **Delete**
+## Comparison with Alternatives
+| Platform | GPU | $/hr | Scale-to-Zero | Code Changes | Setup |
+|---|---|---|---|---|---|
+| **HF Endpoints** | 1× A100 80 GB | **$2.50** | **Yes** | **None** | **Easy** |
+| Vertex AI | a2-ultragpu-1g | $5.78 | No | Medium | Medium |
+| AWS EC2 (g5.12xlarge) | 4× A10G 96 GB | $5.67 | No (manual) | High | Hard |
+| AWS EC2 (p4de.24xlarge) | 8× A100 80 GB | $27.45 | No (manual) | High | Hard |

docs/writeup_draft.md CHANGED Viewed

@@ -53,15 +53,16 @@ Estimated reach: There are approximately 140 million ED visits per year in the U
 **HAI-DEF models used:**
-- **Gemma 3 27B IT** (`gemma-3-27b-it`) — accessed via Google AI Studio's OpenAI-compatible endpoint
-**Why this model:**
-Gemma 3 27B IT provides the right balance of capability and accessibility for a clinical decision support application:
-- Large enough to perform complex clinical reasoning with chain-of-thought transparency
 - Open-weight model that can be self-hosted for HIPAA compliance in production
-- Available via API for rapid development and demonstration
-- Part of the HAI-DEF family, designed with health AI applications in mind
 **How the model is used:**
@@ -96,7 +97,7 @@ All inter-step data is strongly typed with Pydantic v2 models. The pipeline stre
 **Fine-tuning:**
-No fine-tuning was performed in the current version. The base `gemma-3-27b-it` model was used with carefully crafted prompt engineering for each pipeline step. Fine-tuning on clinical reasoning datasets is a planned improvement.
 **Performance analysis:**
@@ -113,13 +114,13 @@ No fine-tuning was performed in the current version. The base `gemma-3-27b-it` m
 |-------|-----------|
 | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS |
 | Backend | FastAPI, Python 3.10, Pydantic v2, WebSocket |
-| LLM | Gemma 3 27B IT via Google AI Studio |
 | RAG | ChromaDB + sentence-transformers (all-MiniLM-L6-v2) |
 | Drug Data | OpenFDA API, RxNorm / NLM API |
 **Deployment considerations:**
-- **HIPAA compliance:** Gemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
 - **Latency:** Current pipeline takes ~75 s end-to-end. For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference.
 - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
 - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
@@ -163,4 +164,4 @@ The system is explicitly designed as a **decision support** tool, not a decision
 - Video: [To be recorded]
 - Code Repository: [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
 - Live Demo: [To be deployed]
-- Hugging Face Model: N/A (using base Gemma 3 27B IT)

 **HAI-DEF models used:**
+- **MedGemma** (`google/medgemma-27b-text-it`) — Google's medical-domain model from the Health AI Developer Foundations (HAI-DEF) collection
+- Development/validation also performed with **Gemma 3 27B IT** (`gemma-3-27b-it`) via Google AI Studio for rapid iteration
+**Why MedGemma:**
+MedGemma is purpose-built for medical applications and is part of Google's HAI-DEF collection:
+- Trained specifically for health and biomedical tasks, providing stronger clinical reasoning than general-purpose models
 - Open-weight model that can be self-hosted for HIPAA compliance in production
+- Large enough (27B parameters) for complex chain-of-thought clinical reasoning
+- Designed to be the foundation for healthcare AI applications — exactly what this competition demands
 **How the model is used:**
 **Fine-tuning:**
+No fine-tuning was performed in the current version. The base MedGemma model (`medgemma-27b-text-it`) was used with carefully crafted prompt engineering for each pipeline step. Fine-tuning on clinical reasoning datasets is a planned improvement.
 **Performance analysis:**
 |-------|-----------|
 | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS |
 | Backend | FastAPI, Python 3.10, Pydantic v2, WebSocket |
+| LLM | MedGemma 27B Text IT (HAI-DEF) + Gemma 3 27B IT for dev |
 | RAG | ChromaDB + sentence-transformers (all-MiniLM-L6-v2) |
 | Drug Data | OpenFDA API, RxNorm / NLM API |
 **Deployment considerations:**
+- **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
 - **Latency:** Current pipeline takes ~75 s end-to-end. For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference.
 - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
 - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
 - Video: [To be recorded]
 - Code Repository: [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
 - Live Demo: [To be deployed]
+- Hugging Face Model: [google/medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it)

src/backend/analyze_results.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+Post-analysis of MedQA validation results.
+Categorizes questions by type and reports accuracy for each category.
+This is important because the CDS pipeline focuses on DIAGNOSIS, while
+MedQA includes many non-diagnostic questions (pharmacology, management,
+biostatistics, pathophysiology).
+"""
+import json
+import re
+from collections import defaultdict
+from pathlib import Path
+CHECKPOINT = Path("validation/results/medqa_checkpoint.jsonl")
+DATA_FILE = Path("validation/data/medqa_test.jsonl")
+def classify_answer(correct_answer: str, full_question: str = "") -> str:
+    """Classify the MedQA answer type.
+    Categories:
+    - diagnosis: Answer is a disease, condition, or syndrome
+    - treatment: Answer is a drug, procedure, or intervention
+    - management: Answer is a management strategy (reassurance, referral, etc.)
+    - pathophysiology: Answer is a mechanism, pathway, or biochemical entity
+    - statistics: Answer is about study design, statistics, or epidemiology
+    - anatomy: Answer is about anatomy/location
+    - other: Everything else
+    """
+    answer = correct_answer.lower().strip()
+    question = full_question.lower()
+    # Statistics / study design
+    stats_patterns = [
+        r"type [12] error", r"null hypothesis", r"p.value", r"confidence interval",
+        r"odds ratio", r"relative risk", r"sensitivity", r"specificity",
+        r"positive predictive", r"negative predictive", r"number needed",
+        r"standard deviation", r"study design", r"randomized", r"case.control",
+        r"cohort study", r"cross.sectional", r"meta.analysis", r"selection bias",
+        r"recall bias", r"confounding", r"blinding", r"power of",
+    ]
+    for p in stats_patterns:
+        if re.search(p, answer) or re.search(p, question):
+            return "statistics"
+    # Treatment / pharmacology (drugs, procedures, interventions)
+    treatment_patterns = [
+        r"^start\b", r"^administer\b", r"^give\b", r"^prescribe\b",
+        r"^begin\b", r"^initiate\b", r"surgery", r"laparotomy",
+        r"laparoscop", r"analgesia", r"^reassurance", r"^observation",
+        r"^follow.up", r"^refer", r"^discharge",
+        r"corticosteroid", r"hydrocortisone", r"fludrocortisone",
+        r"prednisone", r"methylprednisolone", r"dexamethasone",
+        r"amitriptyline", r"fluoxetine", r"sertraline", r"metformin",
+        r"insulin", r"heparin", r"warfarin", r"aspirin",
+        r"amoxicillin", r"azithromycin", r"ceftriaxone",
+        r"exploratory", r"endoscop",
+    ]
+    for p in treatment_patterns:
+        if re.search(p, answer):
+            return "treatment"
+    # Management strategies
+    management_patterns = [
+        r"reassurance", r"watchful waiting", r"follow.up", r"counseling",
+        r"lifestyle", r"observation", r"monitor", r"admit",
+        r"discharge", r"consult",
+    ]
+    for p in management_patterns:
+        if re.search(p, answer):
+            return "management"
+    # Pathophysiology / biochemistry
+    patho_patterns = [
+        r"prostaglandin", r"acetaldehyde", r"histamine", r"serotonin",
+        r"dopamine", r"cytokine", r"interleukin", r"antibod",
+        r"complement", r"release of", r"synthesis of", r"inhibition of",
+        r"degradation of", r"mutation in", r"deficiency of",
+        r"mechanism", r"pathway", r"receptor", r"kinase",
+        r"affective symptoms", r"diagnosis of exclusion",
+    ]
+    for p in patho_patterns:
+        if re.search(p, answer):
+            return "pathophysiology"
+    # Anatomy
+    anatomy_patterns = [
+        r"lytic lesions", r"fracture", r"artery", r"vein",
+        r"nerve", r"muscle", r"bone", r"ligament",
+        r"right.sided", r"left.sided", r"posterior", r"anterior",
+    ]
+    for p in anatomy_patterns:
+        if re.search(p, answer):
+            return "anatomy"
+    # Default: assume it's a diagnosis
+    return "diagnosis"
+def analyze():
+    if not CHECKPOINT.exists():
+        print("No checkpoint file found. Run validation first.")
+        return
+    # Load results
+    results = []
+    for line in CHECKPOINT.read_text(encoding="utf-8").strip().split("\n"):
+        if line.strip():
+            results.append(json.loads(line))
+    # Load original questions for classification
+    questions = {}
+    if DATA_FILE.exists():
+        raw = DATA_FILE.read_text(encoding="utf-8").strip().split("\n")
+        for item_str in raw:
+            item = json.loads(item_str)
+            questions[item.get("question", "")] = item
+    # Classify and categorize
+    categories = defaultdict(list)
+    for r in results:
+        det = r.get("details", {})
+        correct = det.get("correct_answer", "")
+        full_q = det.get("full_question", "")
+        # Try to get the full question from the ground truth
+        if not full_q:
+            for case_key in r.get("ground_truth", {}).keys():
+                pass  # Fallback
+        cat = classify_answer(correct, full_q)
+        categories[cat].append(r)
+    # Print summary
+    print("=" * 70)
+    print("  MedQA RESULTS BY QUESTION CATEGORY")
+    print("=" * 70)
+    total_cases = len(results)
+    total_mentioned = sum(1 for r in results if r.get("details", {}).get("match_location", "not_found") != "not_found")
+    total_diff = sum(1 for r in results if r.get("details", {}).get("match_location") == "differential")
+    print(f"\n  OVERALL: {total_cases} cases | Mentioned: {total_mentioned}/{total_cases} ({100*total_mentioned/total_cases:.0f}%) | Differential: {total_diff}/{total_cases} ({100*total_diff/total_cases:.0f}%)")
+    print(f"\n  {'Category':<20} {'Count':>6} {'Mentioned':>10} {'Differential':>13} {'Pipeline OK':>12}")
+    print(f"  {'-'*20} {'-'*6} {'-'*10} {'-'*13} {'-'*12}")
+    for cat in sorted(categories.keys()):
+        items = categories[cat]
+        n = len(items)
+        mentioned = sum(1 for r in items if r.get("details", {}).get("match_location", "not_found") != "not_found")
+        differential = sum(1 for r in items if r.get("details", {}).get("match_location") == "differential")
+        success = sum(1 for r in items if r.get("success"))
+        mentioned_pct = f"{100*mentioned/n:.0f}%" if n > 0 else "N/A"
+        diff_pct = f"{100*differential/n:.0f}%" if n > 0 else "N/A"
+        success_pct = f"{100*success/n:.0f}%" if n > 0 else "N/A"
+        print(f"  {cat:<20} {n:>6} {mentioned:>5} ({mentioned_pct:>4}) {differential:>7} ({diff_pct:>4}) {success:>6} ({success_pct:>4})")
+    # Detailed per-case
+    print(f"\n  DETAILED PER-CASE RESULTS:")
+    print(f"  {'Case':<14} {'Cat':<15} {'Location':<14} {'Correct':<35} {'Top Dx':<35}")
+    print(f"  {'-'*14} {'-'*15} {'-'*14} {'-'*35} {'-'*35}")
+    for r in results:
+        det = r.get("details", {})
+        correct = det.get("correct_answer", "?")[:34]
+        top = det.get("top_diagnosis", "?")[:34]
+        loc = det.get("match_location", "not_found")
+        cat = classify_answer(det.get("correct_answer", ""))
+        print(f"  {r['case_id']:<14} {cat:<15} {loc:<14} {correct:<35} {top:<35}")
+    # Key insight
+    diag_items = categories.get("diagnosis", [])
+    if diag_items:
+        d_mentioned = sum(1 for r in diag_items if r.get("details", {}).get("match_location", "not_found") != "not_found")
+        d_diff = sum(1 for r in diag_items if r.get("details", {}).get("match_location") == "differential")
+        d_n = len(diag_items)
+        print(f"\n  KEY INSIGHT:")
+        print(f"  On DIAGNOSTIC questions only: Mentioned {d_mentioned}/{d_n} ({100*d_mentioned/d_n:.0f}%), Differential {d_diff}/{d_n} ({100*d_diff/d_n:.0f}%)")
+        print(f"  The CDS pipeline is designed for diagnosis support; non-diagnostic questions")
+        print(f"  (treatment, stats, pathophysiology) are outside its intended scope.")
+if __name__ == "__main__":
+    analyze()

src/backend/app/config.py CHANGED Viewed

@@ -27,6 +27,7 @@ class Settings(BaseSettings):
     rxnorm_base_url: str = "https://rxnav.nlm.nih.gov/REST"
     pubmed_base_url: str = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
     pubmed_api_key: str = ""  # Optional, increases rate limits
     # RAG
     chroma_persist_dir: str = "./data/chroma"

     rxnorm_base_url: str = "https://rxnav.nlm.nih.gov/REST"
     pubmed_base_url: str = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
     pubmed_api_key: str = ""  # Optional, increases rate limits
+    hf_token: str = ""  # HuggingFace token for dataset downloads
     # RAG
     chroma_persist_dir: str = "./data/chroma"

src/backend/app/services/medgemma.py CHANGED Viewed

@@ -125,25 +125,44 @@ class MedGemmaService:
     async def _generate_api(
         self, prompt: str, system_prompt: Optional[str], max_tokens: int, temperature: float
     ) -> str:
-        """Generate via OpenAI-compatible API."""
         client = await self._get_client()
         messages = []
-        # Some models (e.g. Gemma via Google AI Studio) don't support system role.
-        # Try with system prompt first, fall back to folding it into the user message.
         if system_prompt:
-            user_content = f"{system_prompt}\n\n{prompt}"
-        else:
-            user_content = prompt
-        messages.append({"role": "user", "content": user_content})
-        response = await client.chat.completions.create(
-            model=settings.medgemma_model_id,
-            messages=messages,
-            max_tokens=max_tokens,
-            temperature=temperature,
-        )
-        return response.choices[0].message.content
     async def _generate_local(
         self, prompt: str, system_prompt: Optional[str], max_tokens: int, temperature: float

     async def _generate_api(
         self, prompt: str, system_prompt: Optional[str], max_tokens: int, temperature: float
     ) -> str:
+        """Generate via OpenAI-compatible API.
+        MedGemma (served by TGI on HuggingFace Endpoints) natively supports the
+        system role, so we send system/user messages properly.  If the backend
+        happens to be plain Gemma on Google AI Studio (which rejects the system
+        role), we automatically fall back to folding the system prompt into the
+        user message.
+        """
         client = await self._get_client()
         messages = []
         if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": prompt})
+        try:
+            response = await client.chat.completions.create(
+                model=settings.medgemma_model_id,
+                messages=messages,
+                max_tokens=max_tokens,
+                temperature=temperature,
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            # Fallback: fold system prompt into user message (Google AI Studio compat)
+            if system_prompt and "system" in str(e).lower():
+                logger.warning("Backend rejected system role — folding into user message.")
+                fallback_messages = [
+                    {"role": "user", "content": f"{system_prompt}\n\n{prompt}"}
+                ]
+                response = await client.chat.completions.create(
+                    model=settings.medgemma_model_id,
+                    messages=fallback_messages,
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                )
+                return response.choices[0].message.content
+            raise
     async def _generate_local(
         self, prompt: str, system_prompt: Optional[str], max_tokens: int, temperature: float

src/backend/app/tools/clinical_reasoning.py CHANGED Viewed

@@ -16,16 +16,21 @@ from app.services.medgemma import MedGemmaService
 logger = logging.getLogger(__name__)
-SYSTEM_PROMPT = """You are an expert clinical reasoning assistant. Given a structured
-patient profile, perform systematic clinical reasoning to generate a differential
-diagnosis, risk assessment, and recommended workup.
-IMPORTANT GUIDELINES:
-- Think step-by-step through the clinical reasoning process
 - Consider the most likely diagnoses first, then less common but important ones
 - Always consider dangerous "can't miss" diagnoses
-- Base your reasoning on the available evidence (symptoms, labs, history)
-- Be explicit about your reasoning chain
 - Rate likelihood as "low", "moderate", or "high"
 - Rate priority of actions as "low", "moderate", "high", or "critical"
 - This is a decision SUPPORT tool — always recommend clinician judgment"""
@@ -86,7 +91,7 @@ class ClinicalReasoningTool:
             response_model=ClinicalReasoningResult,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.3,
-            max_tokens=4096,
         )
         logger.info(

 logger = logging.getLogger(__name__)
+SYSTEM_PROMPT = """You are an expert clinical reasoning assistant trained in USMLE-level
+diagnostic reasoning. Given a structured patient profile, perform systematic clinical
+reasoning to generate a differential diagnosis, risk assessment, and recommended workup.
+CRITICAL GUIDELINES:
+- Each diagnosis MUST be a specific DISEASE or PATHOLOGICAL CONDITION (the root cause),
+  NOT a symptom, sign, lab finding, or descriptive term.
+  GOOD: "Primary hyperaldosteronism (Conn syndrome)", "Chikungunya fever",
+        "Clear cell adenocarcinoma of the cervix"
+  BAD:  "Hypokalemia", "Fatigue", "Metabolic alkalosis", "Muscle cramps"
+- Think step-by-step: symptoms -> pathophysiology -> ETIOLOGICAL diagnosis
 - Consider the most likely diagnoses first, then less common but important ones
 - Always consider dangerous "can't miss" diagnoses
+- Include at least 5 differential diagnoses when clinically reasonable
+- For each diagnosis, cite the specific findings that support or argue against it
 - Rate likelihood as "low", "moderate", or "high"
 - Rate priority of actions as "low", "moderate", "high", or "critical"
 - This is a decision SUPPORT tool — always recommend clinician judgment"""
             response_model=ClinicalReasoningResult,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.3,
+            max_tokens=3072,
         )
         logger.info(

src/backend/app/tools/conflict_detection.py CHANGED Viewed

@@ -125,7 +125,7 @@ class ConflictDetectionTool:
             response_model=ConflictDetectionResult,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.1,  # Low temp for safety-critical analysis
-            max_tokens=4096,
         )
         # Fill in metadata

             response_model=ConflictDetectionResult,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.1,  # Low temp for safety-critical analysis
+            max_tokens=2000,
         )
         # Fill in metadata

src/backend/app/tools/patient_parser.py CHANGED Viewed

@@ -61,6 +61,7 @@ class PatientParserTool:
                 response_model=PatientProfile,
                 system_prompt=SYSTEM_PROMPT,
                 temperature=0.1,  # Low temp for factual extraction
             )
             logger.info(f"Parsed patient profile: {profile.chief_complaint}")
             return profile

                 response_model=PatientProfile,
                 system_prompt=SYSTEM_PROMPT,
                 temperature=0.1,  # Low temp for factual extraction
+                max_tokens=1500,
             )
             logger.info(f"Parsed patient profile: {profile.chief_complaint}")
             return profile

src/backend/app/tools/synthesis.py CHANGED Viewed

@@ -22,47 +22,75 @@ from app.services.medgemma import MedGemmaService
 logger = logging.getLogger(__name__)
-SYSTEM_PROMPT = """You are a clinical decision support synthesis engine. Your job is to
-combine outputs from multiple clinical tools into a single, cohesive report for a clinician.
-CRITICAL RULES:
-1. Be concise and clinically precise
-2. Prioritize safety — drug interactions and critical findings go first
-3. Clearly distinguish between tool-verified facts and model-generated reasoning
-4. Always include caveats and limitations
-5. Cite sources when available
-6. This report SUPPORTS clinical decision-making — it does NOT replace clinician judgment
-7. Include a standard disclaimer about AI-generated content"""
-SYNTHESIS_PROMPT = """Synthesize the following clinical tool outputs into a cohesive
-Clinical Decision Support report.
 ═══ PATIENT PROFILE ═══
 {patient_profile}
-═══ CLINICAL REASONING (MedGemma) ═══
 {clinical_reasoning}
-═══ DRUG INTERACTION CHECK ═══
 {drug_interactions}
-═══ CLINICAL GUIDELINES ═══
 {guidelines}
-═══ CONFLICTS & GAPS DETECTED ═══
 {conflicts}
-Create a comprehensive CDS report including:
 1. Patient Summary — concise summary of the case
-2. Differential Diagnosis — ranked with reasoning, integrating guideline concordance
 3. Drug Interaction Warnings — any flagged interactions with clinical significance
 4. Guideline-Concordant Recommendations — actionable steps aligned with guidelines
-5. Conflicts & Gaps — PROMINENTLY include every detected conflict. For each conflict,
-   state what the guideline recommends, what the patient's current state is, and the
-   suggested resolution. This section is CRITICAL for patient safety.
-6. Suggested Next Steps — prioritized actions for the clinician, incorporating conflict resolutions
-7. Caveats — limitations, uncertainties, and important disclaimers
-8. Sources — cited guidelines and data sources used"""
 class SynthesisTool:
@@ -104,7 +132,7 @@ class SynthesisTool:
             response_model=CDSReport,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.2,
-            max_tokens=4096,
         )
         # Add standard disclaimer to caveats

 logger = logging.getLogger(__name__)
+SYSTEM_PROMPT = """You are an expert clinical arbiter and decision support engine. You receive
+an initial differential diagnosis from a clinical reasoning agent, PLUS independent evidence
+from drug-interaction checks, clinical guideline retrieval, and conflict detection.
+Your job is NOT merely to format these outputs. You are the FINAL DECISION MAKER:
+1. CRITICALLY RE-EVALUATE the initial differential using ALL available evidence.
+2. RE-RANK diagnoses: promote diagnoses that gain guideline/drug/conflict support;
+   demote diagnoses that lose support or are contradicted.
+3. ADD any diagnosis that the evidence strongly suggests but was MISSING from the initial list.
+4. REMOVE or deprioritize diagnoses that are inconsistent with guideline-based evidence.
+5. For the top diagnosis, explicitly state which evidence (guideline excerpts, drug signals,
+   conflict findings) supports or contradicts it.
+6. Prioritize safety — drug interactions and critical conflicts go first.
+7. This report SUPPORTS clinical decision-making — it does NOT replace clinician judgment.
+8. Be concise and clinically precise. Cite sources.
+You are an independent reviewer, not a rubber stamp. If the initial reasoning is wrong,
+override it with evidence-based conclusions."""
+SYNTHESIS_PROMPT = """You are given outputs from multiple independent clinical analysis tools.
+Your task is to act as an ARBITER: critically evaluate all evidence and produce a final,
+evidence-based Clinical Decision Support report.
 ═══ PATIENT PROFILE ═══
 {patient_profile}
+═══ INITIAL CLINICAL REASONING (from reasoning agent) ═══
 {clinical_reasoning}
+═══ DRUG INTERACTION CHECK (independent tool) ═══
 {drug_interactions}
+═══ CLINICAL GUIDELINES (RAG retrieval — independent evidence) ═══
 {guidelines}
+═══ CONFLICTS & GAPS DETECTED (independent analysis) ═══
 {conflicts}
+══════════════════════════════════════
+ARBITRATION INSTRUCTIONS — Follow these steps:
+══════════════════════════════════════
+STEP 1 — CHALLENGE THE INITIAL DIFFERENTIAL:
+For each diagnosis in the initial reasoning, ask:
+  • Does the guideline evidence SUPPORT or CONTRADICT this diagnosis?
+  • Do the drug interactions or conflict findings change the likelihood?
+  • Is there a diagnosis NOT in the initial list that the guidelines strongly suggest?
+STEP 2 — RE-RANK AND REVISE:
+Produce a REVISED differential diagnosis list. This may differ from the initial one.
+  • Promote diagnoses with strong guideline concordance.
+  • Demote diagnoses contradicted by evidence.
+  • Add new diagnoses suggested by guideline/conflict evidence.
+  • For each diagnosis, state the supporting AND contradicting evidence.
+STEP 3 — PRODUCE THE FINAL REPORT:
 1. Patient Summary — concise summary of the case
+2. Differential Diagnosis — YOUR REVISED ranking (not just a copy of the initial one),
+   with explicit evidence citations for each diagnosis
 3. Drug Interaction Warnings — any flagged interactions with clinical significance
 4. Guideline-Concordant Recommendations — actionable steps aligned with guidelines
+5. Conflicts & Gaps — PROMINENTLY include every detected conflict. For each:
+   state what the guideline recommends vs. patient's current state, and the resolution.
+6. Suggested Next Steps — prioritized actions incorporating ALL evidence
+7. Caveats — limitations, uncertainties, disclaimers
+8. Sources — cited guidelines and data sources
+IMPORTANT: Your differential diagnosis MUST reflect your independent arbiter judgment,
+not merely repeat the initial reasoning. If evidence changes the ranking, CHANGE IT."""
 class SynthesisTool:
             response_model=CDSReport,
             system_prompt=SYSTEM_PROMPT,
             temperature=0.2,
+            max_tokens=3000,
         )
         # Add standard disclaimer to caveats

src/backend/check_progress.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""Quick progress checker for validation run."""
+import json
+from pathlib import Path
+checkpoint = Path("validation/results/medqa_checkpoint.jsonl")
+if not checkpoint.exists():
+    print("No checkpoint file found")
+    exit()
+lines = checkpoint.read_text(encoding="utf-8").strip().split("\n")
+print(f"Completed: {len(lines)}/50")
+matches = 0
+diff_matches = 0
+top3_matches = 0
+failures = 0
+for line in lines:
+    d = json.loads(line)
+    det = d.get("details", {})
+    scores = d.get("scores", {})
+    loc = det.get("match_location", "not_found")
+    if not d.get("success"):
+        failures += 1
+    if loc != "not_found":
+        matches += 1
+    if loc == "differential":
+        diff_matches += 1
+    if scores.get("top3_accuracy", 0) > 0:
+        top3_matches += 1
+print(f"Pipeline success: {len(lines) - failures}/{len(lines)}")
+print(f"Mentioned matches: {matches}/{len(lines)} ({100*matches/len(lines):.0f}%)")
+print(f"Differential matches: {diff_matches}/{len(lines)} ({100*diff_matches/len(lines):.0f}%)")
+print(f"Top-3 matches: {top3_matches}/{len(lines)} ({100*top3_matches/len(lines):.0f}%)")
+# Show last 5 cases
+print("\nRecent cases:")
+for line in lines[-5:]:
+    d = json.loads(line)
+    det = d.get("details", {})
+    correct = det.get("correct_answer", "?")[:45]
+    top = det.get("top_diagnosis", "?")[:45]
+    loc = det.get("match_location", "not_found")
+    t = d.get("pipeline_time_ms", 0)
+    print(f"  {d['case_id']}: [{loc}] {t/1000:.0f}s | correct={correct} | top={top}")

src/backend/validation/base.py CHANGED Viewed

@@ -28,7 +28,7 @@ if str(BACKEND_DIR) not in sys.path:
     sys.path.insert(0, str(BACKEND_DIR))
 from app.agent.orchestrator import Orchestrator
-from app.models.schemas import CaseSubmission, CDSReport, AgentState
 # ──────────────────────────────────────────────
@@ -103,7 +103,20 @@ async def run_cds_pipeline(
         async for _step_update in orchestrator.run(case):
             pass  # consume all step updates
-        return orchestrator.state, orchestrator.get_result(), None
     except asyncio.TimeoutError:
         return orchestrator.state, None, f"Pipeline timed out after {timeout_sec}s"
     except Exception as e:
@@ -159,12 +172,14 @@ def diagnosis_in_differential(
     target_diagnosis: str,
     report: CDSReport,
     top_n: Optional[int] = None,
-) -> tuple[bool, int]:
     """
     Check if target_diagnosis appears in the report's differential.
     Returns:
-        (found, rank) — rank is 0-indexed position, or -1 if not found
     """
     diagnoses = report.differential_diagnosis
     if top_n:
@@ -172,18 +187,29 @@ def diagnosis_in_differential(
     for i, dx in enumerate(diagnoses):
         if fuzzy_match(dx.diagnosis, target_diagnosis):
-            return True, i
-    # Also check the full report text (patient_summary, guideline_recommendations, etc.)
     full_text = " ".join([
         report.patient_summary or "",
         " ".join(report.guideline_recommendations),
         " ".join(a.action for a in report.suggested_next_steps),
     ])
     if fuzzy_match(full_text, target_diagnosis, threshold=0.3):
-        return True, len(diagnoses)  # found but not in differential
-    return False, -1
 # ──────────────────────────────────────────────

     sys.path.insert(0, str(BACKEND_DIR))
 from app.agent.orchestrator import Orchestrator
+from app.models.schemas import CaseSubmission, CDSReport, AgentState, AgentStepStatus
 # ──────────────────────────────────────────────
         async for _step_update in orchestrator.run(case):
             pass  # consume all step updates
+        report = orchestrator.get_result()
+        # If no report was produced, collect errors from failed steps
+        if report is None and orchestrator.state:
+            failed_steps = [
+                s for s in orchestrator.state.steps
+                if s.status == AgentStepStatus.FAILED
+            ]
+            if failed_steps:
+                error_msgs = [f"{s.step_id}: {s.error}" for s in failed_steps]
+                return orchestrator.state, None, "; ".join(error_msgs)
+            return orchestrator.state, None, "Pipeline completed but produced no report"
+        return orchestrator.state, report, None
     except asyncio.TimeoutError:
         return orchestrator.state, None, f"Pipeline timed out after {timeout_sec}s"
     except Exception as e:
     target_diagnosis: str,
     report: CDSReport,
     top_n: Optional[int] = None,
+) -> tuple[bool, int, str]:
     """
     Check if target_diagnosis appears in the report's differential.
     Returns:
+        (found, rank, match_location) — rank is 0-indexed position, or -1 if not found.
+        match_location is one of: "differential", "next_steps", "recommendations",
+        "fulltext", or "not_found".
     """
     diagnoses = report.differential_diagnosis
     if top_n:
     for i, dx in enumerate(diagnoses):
         if fuzzy_match(dx.diagnosis, target_diagnosis):
+            return True, i, "differential"
+    # Check suggested_next_steps (for management-type answers)
+    for i, action in enumerate(report.suggested_next_steps):
+        if fuzzy_match(action.action, target_diagnosis):
+            return True, len(diagnoses) + i, "next_steps"
+    # Check guideline recommendations (for treatment-type answers)
+    for i, rec in enumerate(report.guideline_recommendations):
+        if fuzzy_match(rec, target_diagnosis):
+            return True, len(diagnoses) + len(report.suggested_next_steps) + i, "recommendations"
+    # Broad fulltext check (patient_summary, recommendations, next steps combined)
     full_text = " ".join([
         report.patient_summary or "",
         " ".join(report.guideline_recommendations),
         " ".join(a.action for a in report.suggested_next_steps),
+        " ".join(dx.reasoning for dx in report.differential_diagnosis),
     ])
     if fuzzy_match(full_text, target_diagnosis, threshold=0.3):
+        return True, len(diagnoses), "fulltext"
+    return False, -1, "not_found"
 # ──────────────────────────────────────────────

src/backend/validation/harness_medqa.py CHANGED Viewed

@@ -243,39 +243,56 @@ async def validate_medqa(
         correct_answer = case.ground_truth["correct_answer"]
         if report:
-            # Top-1 accuracy
-            found_top1, rank = diagnosis_in_differential(correct_answer, report, top_n=1)
             scores["top1_accuracy"] = 1.0 if found_top1 else 0.0
-            # Top-3 accuracy
-            found_top3, rank3 = diagnosis_in_differential(correct_answer, report, top_n=3)
             scores["top3_accuracy"] = 1.0 if found_top3 else 0.0
-            # Mentioned anywhere
-            found_any, rank_any = diagnosis_in_differential(correct_answer, report)
             scores["mentioned_accuracy"] = 1.0 if found_any else 0.0
             # Parse success
             scores["parse_success"] = 1.0
             details = {
                 "correct_answer": correct_answer,
-                "top_diagnosis": report.differential_diagnosis[0].diagnosis if report.differential_diagnosis else "NONE",
                 "num_diagnoses": len(report.differential_diagnosis),
                 "found_at_rank": rank_any if found_any else -1,
             }
-            status_icon = "✓" if found_top3 else "✗"
-            print(f"{status_icon} top1={'Y' if found_top1 else 'N'} top3={'Y' if found_top3 else 'N'} ({elapsed_ms}ms)")
         else:
             scores = {
                 "top1_accuracy": 0.0,
                 "top3_accuracy": 0.0,
                 "mentioned_accuracy": 0.0,
                 "parse_success": 0.0,
             }
-            details = {"correct_answer": correct_answer, "error": error}
-            print(f"✗ FAILED: {error[:80] if error else 'unknown'}")
         result = ValidationResult(
             case_id=case.case_id,
@@ -300,7 +317,7 @@ async def validate_medqa(
     successful = sum(1 for r in results if r.success)
     # Average each metric across successful cases only
-    metric_names = ["top1_accuracy", "top3_accuracy", "mentioned_accuracy", "parse_success"]
     metrics = {}
     for m in metric_names:
         values = [r.scores.get(m, 0.0) for r in results]

         correct_answer = case.ground_truth["correct_answer"]
         if report:
+            # Top-1 accuracy (differential only)
+            found_top1, rank1, loc1 = diagnosis_in_differential(correct_answer, report, top_n=1)
             scores["top1_accuracy"] = 1.0 if found_top1 else 0.0
+            # Top-3 accuracy (differential only)
+            found_top3, rank3, loc3 = diagnosis_in_differential(correct_answer, report, top_n=3)
             scores["top3_accuracy"] = 1.0 if found_top3 else 0.0
+            # Mentioned anywhere (differential + next_steps + recommendations + fulltext)
+            found_any, rank_any, loc_any = diagnosis_in_differential(correct_answer, report)
             scores["mentioned_accuracy"] = 1.0 if found_any else 0.0
+            # Differential-only accuracy (strict: only counts differential matches)
+            found_diff_only, rank_diff, loc_diff = diagnosis_in_differential(correct_answer, report)
+            scores["differential_accuracy"] = 1.0 if (found_diff_only and loc_diff == "differential") else 0.0
             # Parse success
             scores["parse_success"] = 1.0
+            # Rich details for debugging
+            all_dx = [dx.diagnosis for dx in report.differential_diagnosis]
+            all_next = [a.action for a in report.suggested_next_steps]
+            all_recs = list(report.guideline_recommendations)
             details = {
                 "correct_answer": correct_answer,
+                "top_diagnosis": all_dx[0] if all_dx else "NONE",
+                "all_diagnoses": all_dx,
+                "all_next_steps": all_next[:5],
+                "all_recommendations": all_recs[:5],
                 "num_diagnoses": len(report.differential_diagnosis),
                 "found_at_rank": rank_any if found_any else -1,
+                "match_location": loc_any,
+                "patient_summary": report.patient_summary[:300] if report.patient_summary else "",
             }
+            # Richer console output
+            loc_tag = f"[{loc_any}]" if found_any else ""
+            status_icon = "+" if found_any else "-"
+            print(f"{status_icon} top1={'Y' if found_top1 else 'N'} top3={'Y' if found_top3 else 'N'} diff={'Y' if loc_any=='differential' else 'N'} {loc_tag} ({elapsed_ms}ms)")
         else:
             scores = {
                 "top1_accuracy": 0.0,
                 "top3_accuracy": 0.0,
                 "mentioned_accuracy": 0.0,
+                "differential_accuracy": 0.0,
                 "parse_success": 0.0,
             }
+            details = {"correct_answer": correct_answer, "error": error, "match_location": "not_found"}
+            print(f"- FAILED: {error[:80] if error else 'unknown'}")
         result = ValidationResult(
             case_id=case.case_id,
     successful = sum(1 for r in results if r.success)
     # Average each metric across successful cases only
+    metric_names = ["top1_accuracy", "top3_accuracy", "mentioned_accuracy", "differential_accuracy", "parse_success"]
     metrics = {}
     for m in metric_names:
         values = [r.scores.get(m, 0.0) for r in results]

src/backend/validation/harness_pmc.py CHANGED Viewed

@@ -373,15 +373,15 @@ async def validate_pmc(
         if report:
             # Diagnostic accuracy (anywhere in differential)
-            found_any, rank_any = diagnosis_in_differential(target_diagnosis, report)
             scores["diagnostic_accuracy"] = 1.0 if found_any else 0.0
             # Top-3 accuracy
-            found_top3, rank3 = diagnosis_in_differential(target_diagnosis, report, top_n=3)
             scores["top3_accuracy"] = 1.0 if found_top3 else 0.0
             # Top-1 accuracy
-            found_top1, rank1 = diagnosis_in_differential(target_diagnosis, report, top_n=1)
             scores["top1_accuracy"] = 1.0 if found_top1 else 0.0
             # Parse success

         if report:
             # Diagnostic accuracy (anywhere in differential)
+            found_any, rank_any, loc_any = diagnosis_in_differential(target_diagnosis, report)
             scores["diagnostic_accuracy"] = 1.0 if found_any else 0.0
             # Top-3 accuracy
+            found_top3, rank3, loc3 = diagnosis_in_differential(target_diagnosis, report, top_n=3)
             scores["top3_accuracy"] = 1.0 if found_top3 else 0.0
             # Top-1 accuracy
+            found_top1, rank1, loc1 = diagnosis_in_differential(target_diagnosis, report, top_n=1)
             scores["top1_accuracy"] = 1.0 if found_top1 else 0.0
             # Parse success

src/backend/validation/run_validation.py CHANGED Viewed

@@ -18,6 +18,7 @@ from __future__ import annotations
 import asyncio
 import json
 import sys
 import time
 from datetime import datetime, timezone
@@ -28,6 +29,12 @@ BACKEND_DIR = Path(__file__).resolve().parent.parent
 if str(BACKEND_DIR) not in sys.path:
     sys.path.insert(0, str(BACKEND_DIR))
 from validation.base import (
     ValidationSummary,
     print_summary,
@@ -163,7 +170,7 @@ def _print_combined_summary(results: dict, total_duration: float):
         )
     # All metrics
-    print(f"\n  {'─' * 66}")
     for name, summary in results.items():
         print(f"\n  {name.upper()} metrics:")
         for metric, value in sorted(summary.metrics.items()):
@@ -252,9 +259,9 @@ Examples:
     run_mtsamples = args.all or args.mtsamples
     run_pmc = args.all or args.pmc
-    print("╔════════════════════════════════════════════════════════╗")
-    print("║   Clinical Decision Support Agent — Validation Suite  ║")
-    print("╚════════════════════════════════════════════════════════╝")
     print(f"\n  Datasets:     {'MedQA ' if run_medqa else ''}{'MTSamples ' if run_mtsamples else ''}{'PMC ' if run_pmc else ''}")
     print(f"  Cases/dataset: {args.max_cases}")
     print(f"  Drug check:    {'Yes' if not args.no_drugs else 'No'}")

 import asyncio
 import json
+import os
 import sys
 import time
 from datetime import datetime, timezone
 if str(BACKEND_DIR) not in sys.path:
     sys.path.insert(0, str(BACKEND_DIR))
+# Load .env and export HF_TOKEN so huggingface_hub picks it up
+from dotenv import load_dotenv
+load_dotenv(BACKEND_DIR / ".env")
+if os.getenv("HF_TOKEN"):
+    os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")
 from validation.base import (
     ValidationSummary,
     print_summary,
         )
     # All metrics
+    print(f"\n  {'-' * 66}")
     for name, summary in results.items():
         print(f"\n  {name.upper()} metrics:")
         for metric, value in sorted(summary.metrics.items()):
     run_mtsamples = args.all or args.mtsamples
     run_pmc = args.all or args.pmc
+    print("=" * 58)
+    print("   Clinical Decision Support Agent - Validation Suite")
+    print("=" * 58)
     print(f"\n  Datasets:     {'MedQA ' if run_medqa else ''}{'MTSamples ' if run_mtsamples else ''}{'PMC ' if run_pmc else ''}")
     print(f"  Cases/dataset: {args.max_cases}")
     print(f"  Drug check:    {'Yes' if not args.no_drugs else 'No'}")