Skip to content
Evidence-first Two-stage retrieval Auditable traces Distilled 7B

Multi-Agent Biomedical Reasoning Grounded in EMR & Guidelines

MedSwin is an evidence-constrained clinical QA stack: specialised agents coordinate retrieval, EMR summarisation, guideline synthesis, and safety critique—while a calibrated reranker enforces evidence sufficiency under a token budget.

MedSwin Local-deployable by design Provenance-aware output
Replayable traces Evidence sufficiency checks Token budget control Hybrid ANN + BM25 Calibrated reranker Multi-agent coordination (MAC)

Trust Stack

Design primitives for clinical deployment readiness

System
Auditability

Typed artifacts + explicit provenance (document IDs, sections, timestamps, chunk offsets) enable replay and review.

Evidence sufficiency

Retrieval is accepted only when EMR/CPG coverage targets are met under a strict token budget.

Calibrated ranking

Long-context biomedical reranker outputs calibrated probabilities for deterministic inclusion policies.

Safety critique

Critic checks missing evidence, contraindications, and unsafe advice—then routes “retrieve-more” when needed.

Research prototype — not a substitute for professional medical advice.

Overview

MedSwin frames clinical QA as an evidence-constrained decision pipeline. Every answer is gated by evidence sufficiency, bounded by a strict context budget, and accompanied by a replayable trace suitable for audit and safety review.

Answer

Clinically phrased, uncertainty-aware output generated only when evidence gates are satisfied.

Evidence bundle

Compact EMR + guideline passages selected under token and diversity constraints.

Trace

Structured artifact log: retrieval, ranking, policies, safety checks.

1) Auditable Multi-Agent Orchestration

Every agent produces artifacts with provenance metadata and logs tool calls + selected evidence. This creates a structured audit trail suitable for review and replay.

  • Typed artifacts: ids, sections, timestamps, offsets
  • Deterministic “retrieve-more” instead of guessing
  • Critique + safety checks before finalisation

2) Two-Stage Retrieval with Calibrated Reranking

Stage-1 retrieves candidates via dense retrieval + BM25 union. Stage-2 reranks with a long-context biomedical reranker (LoRA-adapted) and outputs calibrated probabilities used by policy constraints.

  • Hybrid candidate pool: ANN + lexical coverage
  • Evidence sufficiency thresholds for EMR/CPG
  • Budgeted, diverse selection (MMR-style)

3) Distilled 7B Medical LLM Pipeline

A compact student is built with SFT on augmented biomedical QA, then refined using hard+soft KD from a larger instructor. This targets deployability while preserving calibrated reasoning behavior.

  • Large-scale augmentation with semantic checks
  • Hard labels expand coverage; soft labels preserve uncertainty
  • PEFT (QLoRA/LoRA) enables modest GPU training

From question → audited answer

MedSwin outputs an answer only when the evidence bundle is sufficient. Otherwise it asks clarifying questions or expands retrieval.

Clarify Retrieve-more Safe final
  1. 1 Normalise

    Canonicalise terms, expand abbreviations, form retrieval probes.

  2. 2 Retrieve + Rank

    Hybrid candidates, rerank with calibrated probabilities, enforce sufficiency.

  3. 3 Synthesize + Critique

    Summarise EMR, synthesise guideline actions, run safety critique, then answer.

Architecture Explorer

Explore MedSwin layers: multi-agent workflow, two-stage retrieval, evidence sufficiency checks, and MAC coordination.

Orchestrated Provenance Budgeted

High-Level System Diagram

Mermaid · interactive tab
flowchart LR
  U["Clinician UI / EMR"] -->|q + patient context| ORCH["Orchestrator (MAC)\nplanning · policy checks · logging"]
  ORCH --> QN["Query Normaliser"]
  ORCH --> RET["Evidence Retriever"]
  ORCH --> EMRS["EMR Summariser"]
  ORCH --> GS["Guideline Synthesiser"]
  ORCH --> SC["Safety Critic"]

  subgraph IR["Two-Stage Retrieval (Budgeted)"]
    DENSE["Stage 1: Dense ANN (MedEmbed)"] --> CAND["Candidates (dense OR BM25)"]
    BM25["Stage 1: Lexical (BM25)"] --> CAND
    CAND --> RER["Stage 2: Long-context Reranker\n(LoRA-adapted, calibrated)"]
    RER --> SEL["Policy-aware selection\nMMR + sufficiency constraints"]
  end

  RET --> IR
  SEL --> EVID["Evidence bundle M\nEMR + CPG + metadata"]

  EMRS --> STATE["Clinical state summary"]
  GS --> ACTIONS["Guideline actions + contraindications"]
  SC --> FLAGS["Safety flags / missing evidence"]

  EVID --> FUSE["Evidence-constrained synthesis"]
  STATE --> FUSE
  ACTIONS --> FUSE
  FLAGS --> FUSE
  FUSE --> OUT["Final answer + citations + cautions\n+ structured trace"]
                  

Two-Stage Retrieval & Calibrated Reranking

Evidence selection is separated into recall-oriented candidate generation and precision-oriented reranking. This avoids early truncation while enabling deterministic, policy-aware inclusion decisions.

Stage 1 — Candidate generation

Dense ANN retrieval is unioned with BM25 to preserve rare clinical terms, abbreviations, and lab-specific phrasing.

Stage 2 — Long-context reranking

A biomedical LLM reranker scores each passage and outputs calibrated probabilities usable as policy thresholds.

Policy-aware selection

Final selection enforces EMR + guideline sufficiency, diversity (MMR-style), and a strict token budget.

Data, Training & Distillation

MedSwin’s deployable 7B model is trained for reliability rather than raw scale, combining large-scale augmentation, supervised fine-tuning, and knowledge distillation.

A · Data augmentation

Paraphrasing, formatting variants, deduplication, and medical consistency checks expand coverage without semantic drift.

B · Supervised fine-tuning

Aligns the student to clinical instruction style, tone control, and structured answers.

C · Knowledge distillation

Hard labels expand task coverage; soft labels preserve calibration and uncertainty from a larger instructor.

Evaluation & Safety

MedSwin evaluates clinical QA systems beyond answer accuracy, focusing on evidence quality, guideline compliance, and runtime safety behaviour.

Retrieval quality

Evidence relevance and coverage under a fixed token budget.

Guideline coverage

Presence of actionable recommendations and contraindications.

Faithfulness

Final answers remain grounded in cited evidence only.

Team

A multidisciplinary research team building an auditable medical AI system.

Swinburne Multi-role
🎖️
Liam
Leader
🧪
Henry
LLM
🔗
Hai
System