Multi-Agent Biomedical Reasoning Grounded in EMR & Guidelines
MedSwin is an evidence-constrained clinical QA stack: specialised agents coordinate retrieval, EMR summarisation, guideline synthesis, and safety critique—while a calibrated reranker enforces evidence sufficiency under a token budget.
Trust Stack
Design primitives for clinical deployment readiness
Typed artifacts + explicit provenance (document IDs, sections, timestamps, chunk offsets) enable replay and review.
Retrieval is accepted only when EMR/CPG coverage targets are met under a strict token budget.
Long-context biomedical reranker outputs calibrated probabilities for deterministic inclusion policies.
Critic checks missing evidence, contraindications, and unsafe advice—then routes “retrieve-more” when needed.
Overview
MedSwin frames clinical QA as an evidence-constrained decision pipeline. Every answer is gated by evidence sufficiency, bounded by a strict context budget, and accompanied by a replayable trace suitable for audit and safety review.
Clinically phrased, uncertainty-aware output generated only when evidence gates are satisfied.
Compact EMR + guideline passages selected under token and diversity constraints.
Structured artifact log: retrieval, ranking, policies, safety checks.
1) Auditable Multi-Agent Orchestration
Every agent produces artifacts with provenance metadata and logs tool calls + selected evidence. This creates a structured audit trail suitable for review and replay.
- Typed artifacts: ids, sections, timestamps, offsets
- Deterministic “retrieve-more” instead of guessing
- Critique + safety checks before finalisation
2) Two-Stage Retrieval with Calibrated Reranking
Stage-1 retrieves candidates via dense retrieval + BM25 union. Stage-2 reranks with a long-context biomedical reranker (LoRA-adapted) and outputs calibrated probabilities used by policy constraints.
- Hybrid candidate pool: ANN + lexical coverage
- Evidence sufficiency thresholds for EMR/CPG
- Budgeted, diverse selection (MMR-style)
3) Distilled 7B Medical LLM Pipeline
A compact student is built with SFT on augmented biomedical QA, then refined using hard+soft KD from a larger instructor. This targets deployability while preserving calibrated reasoning behavior.
- Large-scale augmentation with semantic checks
- Hard labels expand coverage; soft labels preserve uncertainty
- PEFT (QLoRA/LoRA) enables modest GPU training
From question → audited answer
MedSwin outputs an answer only when the evidence bundle is sufficient. Otherwise it asks clarifying questions or expands retrieval.
-
1 Normalise
Canonicalise terms, expand abbreviations, form retrieval probes.
-
2 Retrieve + Rank
Hybrid candidates, rerank with calibrated probabilities, enforce sufficiency.
-
3 Synthesize + Critique
Summarise EMR, synthesise guideline actions, run safety critique, then answer.
Architecture Explorer
Explore MedSwin layers: multi-agent workflow, two-stage retrieval, evidence sufficiency checks, and MAC coordination.
High-Level System Diagram
Mermaid · interactive tab
flowchart LR
U["Clinician UI / EMR"] -->|q + patient context| ORCH["Orchestrator (MAC)\nplanning · policy checks · logging"]
ORCH --> QN["Query Normaliser"]
ORCH --> RET["Evidence Retriever"]
ORCH --> EMRS["EMR Summariser"]
ORCH --> GS["Guideline Synthesiser"]
ORCH --> SC["Safety Critic"]
subgraph IR["Two-Stage Retrieval (Budgeted)"]
DENSE["Stage 1: Dense ANN (MedEmbed)"] --> CAND["Candidates (dense OR BM25)"]
BM25["Stage 1: Lexical (BM25)"] --> CAND
CAND --> RER["Stage 2: Long-context Reranker\n(LoRA-adapted, calibrated)"]
RER --> SEL["Policy-aware selection\nMMR + sufficiency constraints"]
end
RET --> IR
SEL --> EVID["Evidence bundle M\nEMR + CPG + metadata"]
EMRS --> STATE["Clinical state summary"]
GS --> ACTIONS["Guideline actions + contraindications"]
SC --> FLAGS["Safety flags / missing evidence"]
EVID --> FUSE["Evidence-constrained synthesis"]
STATE --> FUSE
ACTIONS --> FUSE
FLAGS --> FUSE
FUSE --> OUT["Final answer + citations + cautions\n+ structured trace"]
Two-Stage Retrieval & Calibrated Reranking
Evidence selection is separated into recall-oriented candidate generation and precision-oriented reranking. This avoids early truncation while enabling deterministic, policy-aware inclusion decisions.
Dense ANN retrieval is unioned with BM25 to preserve rare clinical terms, abbreviations, and lab-specific phrasing.
A biomedical LLM reranker scores each passage and outputs calibrated probabilities usable as policy thresholds.
Final selection enforces EMR + guideline sufficiency, diversity (MMR-style), and a strict token budget.
Data, Training & Distillation
MedSwin’s deployable 7B model is trained for reliability rather than raw scale, combining large-scale augmentation, supervised fine-tuning, and knowledge distillation.
Paraphrasing, formatting variants, deduplication, and medical consistency checks expand coverage without semantic drift.
Aligns the student to clinical instruction style, tone control, and structured answers.
Hard labels expand task coverage; soft labels preserve calibration and uncertainty from a larger instructor.
Evaluation & Safety
MedSwin evaluates clinical QA systems beyond answer accuracy, focusing on evidence quality, guideline compliance, and runtime safety behaviour.
Evidence relevance and coverage under a fixed token budget.
Presence of actionable recommendations and contraindications.
Final answers remain grounded in cited evidence only.
Team
A multidisciplinary research team building an auditable medical AI system.