+ Multi-Agent Biomedical Reasoning + Grounded in EMR & Guidelines +
+ ++ MedSwin is an evidence-constrained clinical QA stack: + specialised agents coordinate retrieval, EMR summarisation, guideline synthesis, and safety critique—while a + calibrated reranker enforces evidence sufficiency under a token budget. +
+ + + +At a Glance
--
-
- • 500k+ curated & synthetic cases -
- • LoRA/QLoRA, Knowledge Distillation, GRPO -
- • Agentic RAG (Node & Graph) with citations -
- • Safety rails, HIL, uncertainty flags -
- • HPC‑reproducible training -
Trust Stack
+Design primitives for clinical deployment readiness
++ Typed artifacts + explicit provenance (document IDs, sections, timestamps, chunk offsets) enable replay and review. +
++ Retrieval is accepted only when EMR/CPG coverage targets are met under a strict token budget. +
++ Long-context biomedical reranker outputs calibrated probabilities for deterministic inclusion policies. +
++ Critic checks missing evidence, contraindications, and unsafe advice—then routes “retrieve-more” when needed. +
+Overview
-Accuracy, traceability, and safety via multi‑agent orchestration and evidence‑grounded generation.
-Specialist Agents
-Diagnostics (differentials + red flags), Pharmacology (DDIs, dosing), and Triage (urgency & disposition).
-Reasoning Orchestrator
-MCP planner routes tasks, fuses evidence, and enforces self‑consistency with safe refusal and uncertainty flags.
+ + + +Overview
++ MedSwin treats clinical QA as an evidence-constrained pipeline, producing: + an answer, a compact evidence bundle under a context budget, and a structured trace for audit and safety review. +
+Agentic RAG
-Graph & Node RAG over EMR/EHR + PubMed with section‑aware chunking, source allowlists, and citations.
+ ++ A role-based team exchanges typed artifacts: + Query Normaliser, Evidence Retriever, EMR Summariser, Guideline Synthesiser, and Safety Critic. +
++ Hybrid dense+lexical candidate generation, then long-context biomedical reranking with calibrated scores for + deterministic evidence inclusion policies. +
++ A compact medical model is trained with large-scale augmentation (SFT) then refined via hard/soft-label KD from + a larger instructor, enabling institution-controlled deployment. +
+Core Capabilities
-Modeling & optimization to drive accurate, inspectable clinical reasoning.
-LoRA/QLoRA
-Parameter‑efficient adapters let us fine‑tune 7–13B models on modest GPUs while retaining high performance.
+ + + +What’s New in MedSwin
++ Three core contributions: auditable orchestration, policy-aware two-stage biomedical retrieval, and a distilled 7B model pipeline. +
Knowledge Distillation
-Teacher→Student compression to deliver fast, strong specialists with small runtime footprints.
+GRPO Reasoning
-Reinforcement learning variant targeting multi‑step, self‑consistent reasoning with lower compute costs.
+ +1) Auditable Multi-Agent Orchestration
+ ++ Every agent produces artifacts with provenance metadata and logs tool calls + selected evidence. This creates a + structured audit trail suitable for review and replay. +
+-
+
- Typed artifacts: ids, sections, timestamps, offsets +
- Deterministic “retrieve-more” instead of guessing +
- Critique + safety checks before finalisation +
2) Two-Stage Retrieval with Calibrated Reranking
+ ++ Stage-1 retrieves candidates via dense retrieval + BM25 union. Stage-2 reranks with a long-context biomedical reranker + (LoRA-adapted) and outputs calibrated probabilities used by policy constraints. +
+-
+
- Hybrid candidate pool: ANN + lexical coverage +
- Evidence sufficiency thresholds for EMR/CPG +
- Budgeted, diverse selection (MMR-style) +
3) Distilled 7B Medical LLM Pipeline
+ ++ A compact student is built with SFT on augmented biomedical QA, then refined using hard+soft KD from a larger instructor. + This targets deployability while preserving calibrated reasoning behavior. +
+-
+
- Large-scale augmentation with semantic checks +
- Hard labels expand coverage; soft labels preserve uncertainty +
- PEFT (QLoRA/LoRA) enables modest GPU training +
QAC + Counterfactuals
-Paraphrasing, chunking, and “what‑if” synthesis improve robustness across presentation styles.
+ + +From question → audited answer
++ MedSwin outputs an answer only when the evidence bundle is sufficient. Otherwise it asks clarifying questions or expands retrieval. +
+-
+
-
+ 1 Normalise+
Canonicalise terms, expand abbreviations, form retrieval probes.
+
+ -
+ 2 Retrieve + Rank+
Hybrid candidates, rerank with calibrated probabilities, enforce sufficiency.
+
+ -
+ 3 Synthesize + Critique+
Summarise EMR, synthesise guideline actions, run safety critique, then answer.
+
+
Safety Rails
-Allowlists, section filters, and citation‑required answers reduce hallucinations and protect privacy.
+Architecture Explorer
++ Explore MedSwin layers: multi-agent workflow, two-stage retrieval, evidence sufficiency checks, and MAC coordination. +
+HPC Reproducibility
-Deterministic seeds, LR scheduling, and checkpointing ensure auditability and consistent results.
+ +High-Level System Diagram
+ Mermaid · interactive tab ++flowchart LR + U(["Clinician UI / EMR"]) -->|"q + patient context"| ORCH["Orchestrator (MAC)+
planning · policy checks · logging"] + ORCH --> QN["Query Normaliser"] + ORCH --> RET["Evidence Retriever"] + ORCH --> EMRS["EMR Summariser"] + ORCH --> GS["Guideline Synthesiser"] + ORCH --> SC["Safety Critic"] + + subgraph IR["Two-Stage Retrieval (Budgeted)"] + DENSE["Stage-1: Dense ANN (MedEmbed)"] --> CAND["Candidates (dense ∪ BM25)"] + BM25["Stage-1: Lexical (BM25)"] --> CAND + CAND --> RER["Stage-2: Long-context Reranker
(LoRA-adapted, calibrated)"] + RER --> SEL["Policy-aware selection
MMR + sufficiency constraints"] + end + + RET --> IR + SEL --> EVID["Evidence bundle M
EMR + CPG + metadata"] + + EMRS --> STATE["Clinical state summary"] + GS --> ACTIONS["Guideline actions + contraindications"] + SC --> FLAGS["Safety flags / missing evidence"] + + EVID --> FUSE["Evidence-constrained synthesis"] + STATE --> FUSE + ACTIONS --> FUSE + FLAGS --> FUSE + FUSE --> OUT{{"Final answer + citations + cautions
+ structured trace"}} +
Architecture
-Click tabs to switch between layers.
-- flowchart LR - U(["Clinician UI / EMR"]) -->|"symptoms, meds, files"| MCP["MCP Orchestrator-
FastAPI routing, planning, safety, tracing"] - MCP --> DX["Diagnostics Agent"] - MCP --> RX["Pharmacology Agent"] - subgraph RAG["Agentic RAG"] - QR["Query Router"] --> RET["Retriever"] - RET --> SR["Safety Rails"] - end - DX --> RAG - RX --> RAG - SR --> KB[("Med KB / PubMed")] - SR --> EMR[("EMR/EHR summaries")] - DX --> FUSE["Evidence Fusion + Self-Consistency"] - RX --> FUSE - FUSE --> OUT{{"Final Report
summary, plan, citations, cautions"}} - OUT --> QA["Evaluation & QA
MedMCQA, PubMedQA, similarity audits"] -
Two-Stage Retrieval & Calibrated Reranking
++ MedSwin selects a compact, diverse evidence set under budget using hybrid retrieval, a long-context biomedical reranker, + and policy-aware selection with sufficiency constraints. +
+Data, Training & Reproducibility
-From 500k+ cases to specialized, efficient agents.
-Datasets & Augmentation
--
-
- 500k+ curated & synthetic clinical cases across specialties -
- QAC paraphrasing & chunking, self‑consistency sampling -
- Counterfactual case generation, back‑translation -
Pipeline (click to expand)
++ Retrieve top-K dense candidates using ANN over biomedical embeddings, then union with BM25 results to handle rare terms and abbreviations. +
+// Candidate pool
+C(q) = TopK'( Cdense(q) ∪ Clex(q) )
+K' ≥ K
+ + A pointwise LLM reranker scores each (query, passage) pair and provides a calibrated probability used by downstream policy checks. +
+// Calibrated probability (Platt / temperature scaling)
+p_cal(q,d) = σ( (ℓ(q,d) − b) / T )
+ + Fuse calibrated reranker probability with dense/lexical signals and lightweight clinical priors, then select a diverse set under budget. + Accept only if EMR and CPG sufficiency targets are met; otherwise trigger “retrieve-more.” +
+Fine‑Tuning & KD
--
-
- Teacher→Student Knowledge Distillation -
- LoRA/QLoRA adapters, GRPO for reasoning -
- HPC runs: deterministic seeds, LR schedules, checkpoints -
What “evidence sufficiency” means in practice
++ The orchestrator treats sufficiency as a gate: if the selected bundle lacks required guideline and EMR coverage above calibrated thresholds, it will not synthesise a confident answer. +
+ +Ensure key recommendations & contraindications are present (not just background text).
+Include patient-specific meds/labs/history signals needed to avoid unsafe generalisations.
+Use MMR-style selection to avoid redundant passages and preserve coverage breadth.
+Detect missing evidence, conflicts, or contraindication risks; request retrieve-more if needed.
+doc_id · guideline_version · section_tags · chunk_offsets · scores · thresholds · tool_calls
+Evaluation & Safety
-Benchmarks, semantic audits, and runtime guards.
-Data, Training & Distillation
++ MedSwin’s deployable 7B model is produced via SFT on augmented biomedical QA, then KD (hard + soft labels) from a larger instructor, + using PEFT techniques to fit modest GPU footprints. +
+Pipeline Timeline
+A readable progression from data → model → deployable checkpoints.
+ ++ Paraphrasing + multi-variant formatting, back-translation, style standardisation, PHI scrubbing, deduplication, + and medical consistency checks to prevent semantic drift. +
++ Student learns consistent instruction following and robust clinical writing styles from mixed supervision sources. +
++ Hard labels expand coverage; soft labels preserve calibration/uncertainty. Training uses a combined CE + KL objective at temperature τ. +
++ Weight-space merging can combine SFT robustness and KD teacher-aligned behaviour without extra full training passes. +
+Training Modules
+ Click tabs ++ Optimises token-level cross-entropy over instruction-formatted examples; stratified mixing reduces overfitting to any single genre. +
+Benchmarks
-MedMCQA (medical exam QA) and PubMedQA (research abstract QA). Complemented by semantic similarity audits with biomedical embeddings.
+ + + +Evaluation & Safety
++ MedSwin evaluates beyond generic RAG metrics by emphasising retrieval quality, guideline coverage, and answer faithfulness—plus runtime guardrails. +
+Runtime Guards
-Uncertainty prompts, refusal policies for out‑of‑scope, citation‑required answers, and HIL oversight to ensure safety.
+ + +What’s measured
+How well the evidence bundle matches the clinical information need under budget.
+Presence of actionable recommendations + contraindications, not just generic background.
+Does the final answer stay grounded in retrieved evidence and cite what it used?
+Runtime guards
++ At inference time, MedSwin prioritises safety and transparency: when evidence is weak or incomplete, it avoids confident recommendations. +
+ +If sufficiency fails, the system requests missing context or expands retrieval.
+Answers are paired with evidence references and trace-friendly provenance fields.
+Detect missing contraindications and unsafe advice before final response.
+Team
-Open-Source & Reproducibility
++ MedSwin is designed to be inspectable: models, training recipes, retrieval calibration, and orchestration traces are publishable and replayable. +
++ Distilled biomedical LLM checkpoints, biomedical reranker (LoRA-adapted), and embedding model assets can be shared with configs and scripts. +
+ Candidate generation, reranking, calibration, and policy-aware selection are independently testable modules. +
+ Replayable traces show which agents ran, which sources were used, and which evidence was accepted under constraints. +
Explore the live prototype, ingestion pipeline, and training collections.
Team
++ A multidisciplinary research team building an auditable medical AI system. +
+