| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| \chapter{Methodology} |
| \label{ch:method} |
|
|
| This chapter presents the proposed Chest X-Ray Vision--Language Model |
| (CXR-VLM) in nine subsections. The first describes the data |
| materials and the pre-processing pipeline that produces the training |
| corpus from MIMIC-CXR. The next four subsections describe the four |
| modules that constitute the model itself --- a frozen vision encoder, |
| a learnable cross-modal projection, an auxiliary abnormality |
| classifier, and a frozen large language model fitted with |
| parameter-efficient adapters. Section~\ref{sec:prompt} explains how |
| these modules are tied together by a single prompt template that |
| specialises to each downstream task. Section~\ref{sec:training} lays |
| out the two-phase training schedule, with the contrastive alignment |
| loss of Phase~1 and the instruction-tuning loss of Phase~2 stated |
| explicitly. Section~\ref{sec:inference} formalises the inference |
| procedure as an algorithm. Section~\ref{sec:eval} closes the chapter |
| by specifying the evaluation protocol --- the held-out splits, the |
| metrics, and the model-selection criterion --- under which the |
| experimental results of Chapter~\ref{ch:experiments} are reported. |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| |
| \section{Data Materials} |
| \label{sec:data} |
| |
|
|
| \subsection{Datasets} |
|
|
| Three publicly-available chest radiograph datasets are used in this |
| work; each plays a distinct role. |
|
|
| \paragraph{MIMIC-CXR-JPG~\citep{mimic-cxr-jpg}.} The primary training |
| and evaluation corpus. MIMIC-CXR-JPG is the JPEG-converted release of |
| MIMIC-CXR~\citep{mimic-cxr}, comprising 377{,}110 chest radiographs |
| from 227{,}827 imaging studies of 65{,}379 patients, paired with |
| free-text radiology reports. Each study is additionally accompanied |
| by a row in \texttt{mimic-cxr-2.0.0-chexpert.csv}, in which the |
| CheXbert NLP labeler~\citep{chexbert} extracts a $14$-pathology |
| abnormality vector from the report text, with values in |
| $\{-1, 0, 1, \text{blank}\}$ denoting uncertain, negative, positive, |
| and not-mentioned respectively. The official patient-disjoint train / |
| validate / test split in \texttt{mimic-cxr-2.0.0-split.csv} is |
| used as the starting point for our subset construction. |
|
|
| \paragraph{MIMIC-Ext-CXR-VQA~\citep{mimic-cxr-vqa}.} A visual |
| question--answering extension of MIMIC-CXR that provides |
| 1.4\,million $(\text{image}, \text{question}, \text{answer})$ triples |
| covering presence, location, severity, and comparative reasoning over |
| the same 14 pathologies. Each triple references a MIMIC-CXR study by |
| \texttt{dicom\_id}, so it is filtered down to the subset described |
| below by simple set intersection. |
|
|
| \paragraph{IU X-Ray~\citep{iuxray}.} A smaller open-access dataset of |
| 3{,}955 fully de-identified chest X-ray studies with associated |
| reports, used as an out-of-distribution evaluation corpus to |
| assess cross-domain generalisation of the report-generation tasks. |
| Following the convention of prior work~\citep{r2gen, metransformer}, |
| we adopt a 7:1:2 train/validate/test partition at the study level. |
|
|
| \subsection{Subset Construction} |
| \label{sec:subset} |
|
|
| The full MIMIC-CXR collection is too large to train on with the |
| hardware budget of this work, and its native distribution is heavily |
| imbalanced both in view types (frontal vs lateral exposures) and in |
| report formatting. We therefore curate a 50{,}000-study subset by a |
| four-stage filter chain, applied once and then re-used across all |
| experiments. The filter is documented here at a level intended to be |
| sufficient for reproducibility. |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| \paragraph{(a) Frontal-view selection.} Each DICOM is joined with |
| \texttt{mimic-cxr-2.0.0-metadata.csv} by \texttt{dicom\_id}, and only |
| rows whose \texttt{ViewPosition} field is PA or AP are retained. A |
| study with multiple frontal exposures is collapsed to one image, |
| preferring the PA projection where available because PA is the |
| standard reference projection while AP is reserved for bedside or |
| supine portable studies. Lateral views are discarded because they are |
| clinically complementary but stylistically and anatomically distinct |
| from frontals; mixing them with frontals would inject input |
| heterogeneity that the model has no architectural mechanism to handle. |
| After this step every retained study contributes exactly one |
| frontal image to the corpus. |
|
|
| \paragraph{(b) Report parsing.} Each study's \texttt{.txt} file is |
| scanned by a strict regular expression that recognises section |
| headers of the form \verb|^[A-Z ,/().-]+:|. A section body is |
| accepted only if the upper-cased header is exactly \texttt{FINDINGS} |
| or \texttt{IMPRESSION}; near-synonyms such as \texttt{CONCLUSION}, |
| \texttt{WET READ}, \texttt{INDICATION}, or composite headers like |
| \texttt{FINDINGS AND IMPRESSION} are deliberately not merged because |
| they vary substantially in style and would dilute the training signal. |
| A study survives this stage only if both sections are present and |
| non-empty --- a requirement made stricter than RaDialog because, as |
| explained in Section~\ref{sec:prompt}, the impression-generation |
| task in this work is conditioned on the ground-truth findings string, |
| so a study without a clean findings section cannot supply a usable |
| impression sample. |
|
|
| \paragraph{(c) Length-based outlier removal.} Word counts are computed |
| per section; the upper cutoff is set to $Q_3 + 1.5\cdot\mathrm{IQR}$ |
| per section, dropping multi-paragraph teaching reports that sit far |
| from the median, and a lower floor of two words for findings and one |
| word for impression is imposed. The combination removes the long tail |
| without affecting the median statistics. |
|
|
| \paragraph{(d) Stratified patient-disjoint sampling.} The remaining |
| eligible pool is sampled to 40{,}000 train / 5{,}000 validation / |
| 5{,}000 test studies. Each study is first assigned a \emph{stratum} |
| equal to its rarest positive CheXpert label (the rarity order is |
| computed over the eligible pool, so e.g.\ \emph{Pleural Other}, |
| \emph{Lung Lesion}, and \emph{Fracture} sit at the top); studies with |
| no positive label fall into \texttt{No Finding} or \texttt{None}. |
| Sampling allocates the target count per stratum proportionally to |
| prevalence and draws within each stratum, preferring studies that |
| also have an associated VQA triple in MIMIC-Ext-CXR-VQA so that the |
| VQA task receives more supervision. The validation and test pools are |
| populated first from the official PhysioNet splits; if a target count |
| exceeds the official pool, the remainder is drawn from the train pool |
| and the affected subject ids are then removed from train. The three |
| resulting splits are therefore patient-disjoint --- no subject appears |
| in more than one split. Per-pathology prevalence deltas between the |
| final subset and the eligible pool remain below 0.7 percentage points |
| on every label, confirming that stratification did not skew clinical |
| coverage. |
|
|
| \subsection{Image Pre-processing} |
|
|
| Because every training step ultimately resizes the input to the |
| encoder's native $518\!\times\!518$ window (Section~\ref{sec:encoder}), |
| all corpus images are resized once, offline, so the shortest edge |
| equals 518 with the longer edge proportional, and saved at JPEG |
| quality 90 --- near-lossless on grayscale chest radiographs. The 50k |
| subset compresses from $\sim$100\,GB of full-resolution JPEGs down to |
| $\sim$5--8\,GB, which fits comfortably on a single Hugging Face Hub |
| repository for distribution. At training time the dataset loader |
| applies the encoder's standard pre-processing (centre crop to |
| $518\!\times\!518$, channel-wise normalisation with the ImageNet mean |
| and standard deviation) on the resized JPEGs. |
|
|
| |
| \section{Visual Feature Extractor} |
| \label{sec:encoder} |
| |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| The visual backbone is Microsoft's RAD-DINO~\citep{raddino}, a |
| ViT-B/14 vision transformer pre-trained with the DINOv2 self-distil\-lation |
| objective~\citep{dinov2} on $\approx$840{,}000 chest radiographs drawn |
| from MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The |
| \texttt{microsoft/rad-dino} checkpoint is loaded through the |
| HuggingFace \texttt{transformers} hub and accepts inputs at its native |
| resolution of $518\!\times\!518$. At this resolution each image is |
| tokenised into $37\!\times\!37 = 1369$ patches of size |
| $14\!\times\!14$, linearly projected to a 768-dimensional embedding, |
| and processed by a 12-layer transformer with 12 attention heads per |
| layer. The class token at the head of the sequence is discarded |
| downstream, so the encoder output consumed by the projection is |
|
|
| \begin{equation} |
| \mathbf{P} \;=\; \operatorname{RAD-DINO}(\mathbf{X}) \;\in\; \mathbb{R}^{B \times 1369 \times 768} |
| \label{eq:patches} |
| \end{equation} |
|
|
| where $\mathbf{X} \in \mathbb{R}^{B \times 3 \times 518 \times 518}$ |
| is the pre-processed image batch. The choice of RAD-DINO over more |
| commonly-used backbones such as BioViL-T~\citep{biovilt} or |
| generic ImageNet-pretrained ViTs is motivated by three factors: it is |
| trained on roughly an order of magnitude more chest-X-ray data than |
| BioViL-T; its DINOv2 self-distillation objective yields denser, more |
| spatially-coherent patch features than contrastive |
| text--image alignment objectives, which matters for the focal |
| abnormalities (nodules, small consolidations) the report-generation |
| task needs to mention; and it is distributed as a standard |
| HuggingFace model, free of the legacy dependency chain of BioViL-T's |
| \texttt{hi-ml-multimodal} package. |
|
|
| The encoder is kept entirely frozen throughout this work. Allowing |
| gradients to flow back into an 86\,M-parameter ViT, on top of the |
| quantised LLM activations in the same model, exhausts the activation |
| budget of consumer GPUs, and the empirical observation in |
| RaDialog~\citep{radialog} and LLaVA-Med~\citep{llavamed} that |
| unfreezing the encoder before the projection has converged tends to |
| collapse the joint representation gives independent reason for caution. |
| The pre-trained patch features are instead specialised to the |
| chest-X-ray report distribution by the projection module described |
| next, under the explicit alignment objective of Phase~1. |
|
|
| |
| \section{Cross-Modal Projection} |
| \label{sec:projection} |
| |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| The projection module serves two roles: pool the variable-length |
| patch sequence from Equation~\ref{eq:patches} into a fixed |
| visual-token budget the language model can attend to, and lift each |
| visual token from the encoder's 768-dimensional embedding space into |
| the 4096-dimensional embedding space of Vicuna-7B. We adopt a design |
| in the spirit of the perceiver resampler used by |
| Flamingo~\citep{flamingo} and the Q-Former of BLIP-2~\citep{blip2}, |
| but without the heavy multi-layer transformer stack of either: a |
| single learnable query bank cross-attends over the patches, followed |
| by a two-layer MLP. This light-weight choice was first demonstrated |
| on the chest-X-ray task by RaDialog~v2~\citep{radialog}, who report |
| that the gain from a deeper Q-Former does not materialise on radiology |
| metrics. |
|
|
| Concretely, the projection holds three learnable components: |
|
|
| \begin{enumerate} |
| \item A query bank |
| $\mathbf{Q}_0 \in \mathbb{R}^{32 \times 768}$, initialised from |
| $\mathcal{N}(0, 0.02^2)$ and broadcast to a batched |
| $\mathbf{Q}\in\mathbb{R}^{B\times 32\times 768}$ at forward time. |
| \item An 8-head cross-attention block at width 768, with dropout 0.1. |
| \item A two-layer MLP $\mathbf{W}_1 \in \mathbb{R}^{768\times 1024}$, |
| GELU, dropout, $\mathbf{W}_2 \in \mathbb{R}^{1024\times 4096}$, all |
| initialised from $\mathcal{N}(0, 0.02^2)$ with zero biases. |
| \end{enumerate} |
|
|
| For a batch of patch features $\mathbf{P}_i \in \mathbb{R}^{1369 |
| \times 768}$, the projection emits visual tokens |
| $\mathbf{V}_i \in \mathbb{R}^{32 \times 4096}$ in three steps: |
|
|
| \begin{align} |
| \mathbf{H}^{(0)}_i &= \operatorname{CrossAttn}\!\left(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i\right) \in \mathbb{R}^{32 \times 768} \label{eq:crossattn}\\ |
| \mathbf{H}^{(1)}_i &= \operatorname{Dropout}\!\left(\operatorname{GELU}(\mathbf{W}_1 \mathbf{H}^{(0)}_i)\right) \in \mathbb{R}^{32 \times 1024} \label{eq:tap}\\ |
| \mathbf{V}_i &= \mathbf{W}_2\, \mathbf{H}^{(1)}_i \in \mathbb{R}^{32 \times 4096} |
| \end{align} |
|
|
| Three design choices deserve explanation. First, the number of visual |
| tokens is fixed at 32, matching RaDialog and sitting inside the |
| empirical sweet spot reported by BLIP-2 and LLaVA: below 16, the |
| pool is too coarse for small focal pathologies; above 64, the LLM's |
| sequence length grows without measurable gain on radiology metrics. |
| Second, the cross-attention is unidirectional --- queries attend over |
| patches, but patches do not attend back. This is the standard |
| perceiver setup and keeps the projection module small (1.8\,M |
| parameters) and easy to train under a contrastive objective in |
| Phase~1. Third, the 1024-d intermediate activation $\mathbf{H}^{(1)}$ |
| in Equation~\ref{eq:tap} is exposed deliberately --- it is the |
| grounding feature the Phase~1 contrastive head operates on, and |
| placing it between the GELU and the second linear means the |
| contrastive objective sees an already-nonlinear representation rather |
| than the raw cross-attention output. |
|
|
| |
| \section{Abnormality Classifier} |
| \label{sec:chex} |
| |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| The abnormality classifier is the principal mechanism by which |
| explicit clinical knowledge enters the language model. Rather than |
| fusing classifier logits into the visual embedding pathway, we |
| serialise the classifier's predictions to text and append them to the |
| prompt. This design is borrowed from the U-MultiClass formulation of |
| META-CXR~\citep{metacxr}, where each of the 14 CheXpert pathologies |
| is classified into one of three classes --- positive (P), negative |
| (N), or uncertain (U) --- rather than collapsed to a binary |
| present/absent decision. The three-way distinction matters because |
| the negative class (the report explicitly rules a pathology out) |
| carries different information from the uncertain class (the report |
| hedges with phrases such as ``may represent'' or ``cannot exclude''), |
| and the language model can exploit that distinction when phrasing the |
| generated report. |
|
|
| The classifier architecture is intentionally small. It consists of a |
| multi-layer perceptron head on top of the global \texttt{[CLS]} token |
| of the same frozen RAD-DINO encoder used by the rest of the model. |
| Sharing the encoder is what makes the design near-free at inference: |
| the encoder forward is already required for the visual pathway, so |
| adding the classifier costs only a single $\mathbb{R}^{768} \to |
| \mathbb{R}^{14 \times 3}$ MLP forward per image. The head produces a |
| logit tensor |
|
|
| \begin{equation} |
| \mathbf{Z} \;=\; \operatorname{MLP}\!\left(\mathbf{c}_{\texttt{[CLS]}}\right) \;\in\; \mathbb{R}^{14 \times 3} |
| \end{equation} |
|
|
| which is softmaxed along the class axis to yield, for each |
| pathology $j$, the probability distribution $\hat{\mathbf{y}}_j \in |
| \Delta^2$ over the three U-MultiClass categories. The predicted |
| category is the arg-max, |
|
|
| \begin{equation} |
| c_j \;=\; \operatorname*{arg\,max}_{c \in \{\text{P}, \text{N}, \text{U}\}} \hat{\mathbf{y}}_{j, c}. |
| \end{equation} |
|
|
| These per-pathology predictions are then serialised into a |
| human-readable abnormality block --- which we refer to as the |
| \emph{PNU string} --- by listing the pathology names that fall into |
| each of the three categories on three lines: |
|
|
| \begin{verbatim} |
| Positive Abnormalities: <comma-separated names with c_j = P> |
| Negative Abnormalities: <comma-separated names with c_j = N> |
| Uncertain Abnormalities: <comma-separated names with c_j = U> |
| \end{verbatim} |
|
|
| At training time the PNU string is built directly from the |
| \texttt{mimic-cxr-2.0.0-chexpert.csv} ground-truth labels (the |
| classifier-as-oracle setting), so that the language model learns to |
| condition on \emph{reliable} abnormality information from the very |
| first step. At inference time the classifier's own predictions are |
| used, with missing-value labels in the CSV mapped to negative for |
| consistency. The classifier itself is fitted in a brief preparatory |
| phase (Section~\ref{sec:phase0}) and frozen for all subsequent |
| training. |
|
|
| |
| \section{Language Model with Low-Rank Adaptation} |
| \label{sec:llm} |
| |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| The decoder is Vicuna-7B v1.3~\citep{vicuna}, a LLaMA-1 derivative |
| fine-tuned on ShareGPT conversational data. We selected the v1.3 |
| release rather than the larger v1.5 (LLaMA-2 base) or more recent |
| Llama-3 models for two practical reasons: it is the exact language |
| model used by the RaDialog baseline against which our work is most |
| directly comparable, and its chat template employs a single |
| unambiguous \verb|USER: ... ASSISTANT: ...| structure that |
| simplifies the label-masking required at training time. |
|
|
| The base model is loaded in 4-bit NF4 quantisation through the |
| \texttt{BitsAndBytesConfig} interface of the \texttt{bitsandbytes} |
| library~\citep{qlora}, with double quantisation enabled and the |
| compute dtype matched to the host GPU (BF16 on Ampere-or-newer cards, |
| FP16 on Turing). This reduces the resident weight footprint from |
| $\approx\!14$\,GB in FP16 to $\approx\!4.0$\,GB, which fits |
| comfortably alongside the encoder, projection, and activation budget |
| on a single 16\,GB GPU. |
|
|
| Vicuna-7B is otherwise kept frozen; adaptation is performed entirely |
| through LoRA~\citep{lora}. For each attention weight matrix |
| $\mathbf{W}_0 \in \mathbb{R}^{4096 \times 4096}$ in the |
| $\{q, k, v, o\}_{\text{proj}}$ projections of every transformer |
| block, we introduce a low-rank update |
|
|
| \begin{equation} |
| \mathbf{W}' \;=\; \mathbf{W}_0 \;+\; \frac{\alpha}{r}\,\mathbf{B}\mathbf{A}, |
| \qquad |
| \mathbf{A} \in \mathbb{R}^{r \times 4096}, |
| \quad |
| \mathbf{B} \in \mathbb{R}^{4096 \times r}, |
| \quad |
| r \ll 4096 |
| \label{eq:lora} |
| \end{equation} |
|
|
| where $\mathbf{A}$ is initialised from $\mathcal{N}(0, \sigma^2)$ and |
| $\mathbf{B}$ is zero-initialised so that the adapter is the identity |
| function at the start of training. Only $\mathbf{A}$ and $\mathbf{B}$ |
| receive gradients; the quantised base weight $\mathbf{W}_0$ remains |
| fixed. We set $r = 16$, $\alpha = 32$, and apply a dropout of |
| $0.05$ on the LoRA forward, giving an effective adapter scaling of |
| $\alpha/r = 2$ that we found preferable to $1$ and $4$ in a brief |
| sweep. The feed-forward sublayers are deliberately left without LoRA |
| adapters: pilot runs that additionally adapted $\texttt{gate\_proj}$ |
| and the MLP projections enlarged the trainable parameter count by |
| $3$--$4\times$ without a corresponding gain on the downstream |
| clinical-accuracy metric. With this configuration the total trainable |
| parameter count of the language model is $\approx\!19.7$\,M, which |
| together with the $\approx\!1.8$\,M of the projection brings the |
| overall trainable budget to $\approx\!21.5$\,M out of $\approx\!7.1$\,B |
| total --- about $0.30\%$. |
|
|
| |
| \section{Prompt Construction} |
| \label{sec:prompt} |
| |
|
|
| All three downstream tasks --- findings generation, impression |
| generation, and visual question answering --- share a single prompt |
| template that follows Vicuna's v1.1 chat format: |
|
|
| \begin{verbatim} |
| {SYSTEM_PROMPT} USER: <image> |
| {PNU abnormality block} |
| {task-specific context} |
| {instruction} ASSISTANT: {target} |
| \end{verbatim} |
|
|
| The \verb|<image>| token is added to the tokenizer as a special |
| single-id placeholder at vocabulary slot 32{,}000, immediately above |
| Vicuna's existing 32{,}000-token vocabulary. At forward time the |
| model locates this token in the tokenised input and substitutes the |
| 32 visual tokens $\mathbf{V}$ from the projection in its place, |
| expanding the attention mask and label tensor by 31 positions per |
| sample so the causal mask stays consistent. The label tensor entries |
| for visual-token positions are set to $-100$, the |
| \texttt{torch.nn.CrossEntropyLoss} ignore index, so that no loss is |
| incurred on visual tokens. |
|
|
| The fields \verb|{PNU abnormality block}|, \verb|{task-specific context}|, |
| and \verb|{instruction}| are populated differently per task. |
|
|
| \paragraph{Findings generation.} The model is asked to produce the |
| findings paragraph from the frontal radiograph alone, with the PNU |
| abnormality block as soft clinical guidance. There is no task-specific |
| context. The instruction slot draws from a pool of ten hand-written |
| paraphrases (e.g.\ ``Generate the findings section for this chest |
| X-ray.'') sampled uniformly at random during training; at evaluation |
| the first variant is used deterministically for reproducibility. |
|
|
| \paragraph{Impression generation.} The impression is by clinical |
| convention a summary of the findings, so we frame the |
| impression task as conditional summarisation rather than free |
| generation. The task-specific context slot is populated with the |
| ground-truth findings text, prefixed by the literal token |
| \texttt{Findings:}. The model therefore receives the frontal image, |
| the PNU block, the ground-truth findings paragraph, and an instruction |
| drawn from a pool of ten impression-specific paraphrases (e.g.\ |
| ``Generate the impression section for this chest X-ray.''). Studies |
| whose report lacks a findings section emit no impression sample, by |
| construction of the filter chain in Section~\ref{sec:subset}. We |
| acknowledge that this is a teacher-forced setting: the impression |
| metrics reported in Chapter~\ref{ch:experiments} are an upper bound on |
| what a fully end-to-end cascade --- in which the model's own |
| generated findings feeds the impression prompt --- would achieve, and |
| we discuss the gap in the limitations section of that chapter. |
|
|
| \paragraph{Visual question answering.} The task-specific context slot |
| is empty, and the instruction slot is the natural-language question |
| drawn from MIMIC-Ext-CXR-VQA. The reference answer in the dataset |
| supplies the target. |
|
|
| For every task, the system prompt is the radiologist persona used by |
| RaDialog --- ``A chat between a curious user and an artificial |
| intelligence assistant acting as an experienced radiologist''. The |
| full prompt is tokenised with a 512-token right-truncation budget. We |
| choose right-truncation deliberately: a too-long prompt should erode |
| the target on the right, where labels matter for the loss, rather |
| than the system prompt or PNU block on the left, both of which carry |
| non-redundant information. The label tensor is masked with $-100$ at |
| every prompt-token, padding-token, and visual-token position, so the |
| language-modelling loss is computed strictly on the assistant response. |
|
|
| |
| \section{Two-Phase Training} |
| \label{sec:training} |
| |
|
|
| Training proceeds in three stages: a brief preparatory stage that |
| fits the abnormality classifier head, then two main phases that train |
| the model itself. The two-phase split mirrors the standard practice |
| of BLIP-2, LLaVA, and RaDialog: in the first phase the alignment |
| module is fitted under an explicit contrastive objective without |
| touching the language model, and in the second phase the full model |
| is fine-tuned end-to-end under the autoregressive language-modelling |
| objective. |
|
|
| \subsection{Preparatory Phase: Abnormality Classifier} |
| \label{sec:phase0} |
|
|
| The PNU classifier of Section~\ref{sec:chex} is fitted before either |
| of the two main phases. The frozen RAD-DINO encoder is forwarded over |
| every training image and the resulting \texttt{[CLS]} embeddings are |
| cached to disk; the small MLP head is then trained under fourteen |
| independent three-way cross-entropies against the |
| $\{\text{positive}, \text{negative}, \text{uncertain}\}$ targets |
| derived from \texttt{mimic-cxr-2.0.0-chexpert.csv} (with missing-value |
| labels mapped to negative). Because the encoder is frozen and its |
| features are cached, this phase converges in a handful of epochs on a |
| single GPU and is unremarkable beyond noting that the resulting |
| checkpoint is loaded read-only by both of the subsequent phases. |
|
|
| \subsection{Phase 1 --- Contrastive Alignment} |
| \label{sec:phase1} |
|
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
|
| Phase 1 specialises the projection to the chest-X-ray report |
| distribution without involving the language model. The projection |
| output is augmented with a small contrastive head that mean-pools |
| the 32 intermediate-dimensional tokens of Equation~\ref{eq:tap}, |
| projects them to a 128-dimensional joint embedding space, and |
| L2-normalises the result: |
|
|
| \begin{equation} |
| \mathbf{v}_i \;=\; \frac{\mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i}{\bigl\lVert \mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i \bigr\rVert_2}, |
| \qquad |
| \bar{\mathbf{H}}^{(1)}_i \;=\; \frac{1}{32}\sum_{k=1}^{32} \mathbf{H}^{(1)}_{i,k} |
| \label{eq:itc-head} |
| \end{equation} |
|
|
| with $\mathbf{W}_{\text{itc}} \in \mathbb{R}^{128 \times 1024}$ the |
| contrastive projection. On the text side, the canonical reference |
| sentence per study --- the findings paragraph, falling back to the |
| impression where findings is absent --- is encoded once, offline, by |
| the frozen CXR-BERT-specialized model~\citep{cxrbert} through its |
| \verb|get_projected_text_embeddings| interface, which itself emits a |
| 128-dimensional L2-normalised vector $\mathbf{t}_i$. These per-study |
| text embeddings are written to disk and re-used across every Phase~1 |
| run; no text encoder is loaded at training time. |
|
|
| The model is then trained to maximise the cosine similarity between |
| $\mathbf{v}_i$ and the text embedding $\mathbf{t}_i$ of the same |
| study, while pushing $\mathbf{v}_i$ away from the text embeddings of |
| the other studies in the batch. We use the symmetric InfoNCE loss of |
| CLIP~\citep{clip}: |
|
|
| \begin{equation} |
| \mathcal{L}_{\text{ITC}} |
| \;=\; |
| -\tfrac{1}{2}\!\left[ |
| \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)} |
| {\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)} |
| \;+\; |
| \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)} |
| {\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)} |
| \right] |
| \label{eq:itc} |
| \end{equation} |
|
|
| with the softmax temperature $\tau = 0.07$, matching CLIP and the |
| CXR-BERT recipe. The dataset is de-duplicated to one image per study |
| before batching, because the text embedding is study-level. Trainable |
| parameters are restricted to the projection MLP and the contrastive |
| head; the encoder, the classifier, and the language model (which is |
| not even instantiated during this phase) are all frozen. Phase 1 |
| runs for two epochs at peak learning rate $1\!\times\!10^{-3}$ with a |
| 5\% cosine warm-up. Because Vicuna is not loaded, $\approx\!13$\,GB |
| of GPU memory is free, which we exploit to push the per-device batch |
| size to 64--96 and thereby enlarge the InfoNCE negative pool --- |
| known to be the dominant factor in contrastive-alignment |
| quality~\citep{clip, blip2}. The contrastive head is discarded at the |
| end of Phase~1; only the projection weights are carried forward. |
|
|
| \subsection{Phase 2 --- Instruction Tuning} |
| \label{sec:phase2} |
|
|
| |
| |
| |
|
|
| Phase 2 reinstates the language model and switches to autoregressive |
| fine-tuning under the standard causal-language-modelling loss. The |
| projection weights from Phase~1 are loaded into the freshly-built |
| full model; the LoRA adapters of Equation~\ref{eq:lora} are |
| zero-initialised. Trainable parameters are the projection MLP and the |
| LoRA adapters on the four attention projections of all 32 transformer |
| blocks. The encoder, the classifier, and the quantised base weights |
| of Vicuna remain frozen. |
|
|
| Mini-batches are drawn from the union of the three task corpora --- |
| findings, impression, and VQA --- with sampling weights $0.3$, |
| $0.2$, and $0.5$ respectively. The findings/impression ratio favours |
| findings because findings is the harder task (no textual context), |
| and VQA is weighted equally to the combined report-generation tasks |
| because it is the most diverse subset and benefits most from |
| mini-batch coverage. For each sample, the prompt is constructed |
| according to its task as described in Section~\ref{sec:prompt} and |
| tokenised. The forward pass emits visual tokens via |
| Equation~\ref{eq:patches}--\ref{eq:tap} and inserts them in place of |
| the \verb|<image>| token; the language model then produces logits |
| $\mathbf{l}_{i, t}$ at every position $t$. The loss is |
|
|
| \begin{equation} |
| \mathcal{L}_{\text{LM}} |
| \;=\; |
| -\,\frac{1}{|\mathcal{T}|} |
| \sum_{(i, t)\,\in\,\mathcal{T}} |
| \log p_{\theta}\!\bigl(y_{i, t}\,\bigm|\,y_{i, < t},\,\mathbf{V}_i,\,\mathbf{c}_i\bigr) |
| \label{eq:lm} |
| \end{equation} |
|
|
| where $\mathcal{T}$ is the set of $(i, t)$ pairs whose label entry is |
| non-$-100$ (i.e.\ assistant tokens only), $\mathbf{V}_i$ are the |
| visual tokens, and $\mathbf{c}_i$ is the textual prompt context. |
| Phase 2 runs for ten epochs at peak learning rate |
| $2\!\times\!10^{-4}$ with a 5\% cosine warm-up. The effective batch |
| size is held at 16 across all hardware profiles by varying the |
| per-device batch and gradient-accumulation step count in inverse |
| proportion. The optimiser is AdamW~\citep{adamw} with weight decay |
| $0.01$ on all non-bias, non-LayerNorm parameters, applied to FP32 |
| master copies of the projection and LoRA weights; the quantised base |
| LLM acts as a fixed nonlinearity. |
|
|
| \subsection{Hyperparameters} |
|
|
| Table~\ref{tab:hparams} consolidates the hyperparameter settings of |
| both phases. The values are inherited from RaDialog with two |
| adjustments. The Phase~1 batch size was swept over |
| $\{16, 32, 64, 96\}$; 64 was selected as the smallest batch that lay |
| within $0.3\%$ of the best validation contrastive accuracy. The |
| Phase~2 peak learning rate was swept over |
| $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$; $5{\times}10^{-4}$ |
| diverged within $\approx\!300$ optimiser steps and $2{\times}10^{-4}$ |
| was unambiguously best on the validation loss. |
|
|
| \begin{table}[ht] |
| \centering |
| \small |
| \caption{Training hyperparameters for the two phases.} |
| \label{tab:hparams} |
| \begin{tabular}{lll} |
| \toprule |
| \textbf{Hyperparameter} & \textbf{Phase 1 (ITC)} & \textbf{Phase 2 (LM)} \\ |
| \midrule |
| Trainable modules & Projection + ITC head & Projection + LoRA \\ |
| Frozen modules & Encoder, classifier; LLM not loaded & Encoder, classifier, base LLM \\ |
| Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\ |
| Temperature $\tau$ & $0.07$ & --- \\ |
| Epochs & 2 & 10 \\ |
| Peak learning rate & $1\times 10^{-3}$ & $2\times 10^{-4}$ \\ |
| LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\ |
| Weight decay & $0.01$ & $0.01$ \\ |
| Effective batch size & 64--96 & 16 \\ |
| Mixed precision & BF16 (Ampere+) / FP16 (Turing) & BF16 / FP16 \\ |
| LLM quantisation & --- (not loaded) & 4-bit NF4, double-quant \\ |
| LoRA $(r, \alpha, p)$ & --- & $(16, 32, 0.05)$ \\ |
| LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\ |
| Cutoff length & --- & 512 tokens \\ |
| Optimiser & AdamW & AdamW \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| |
| \section{Inference Pipeline} |
| \label{sec:inference} |
| |
|
|
| At inference, all parameters are frozen and the model is invoked as a |
| single forward pass per chest radiograph and task, with greedy |
| decoding (\verb|do_sample=False|, \verb|num_beams=1|, |
| \verb|temperature=0.1|) up to a task-dependent maximum of newly |
| generated tokens. Algorithm~\ref{alg:inference} formalises the |
| procedure. |
|
|
| \begin{algorithm}[h] |
| \caption{Inference Procedure of CXR-VLM} |
| \label{alg:inference} |
| \begin{algorithmic}[1] |
| \Require Chest radiograph $\mathbf{X}$, task $\mathcal{T} \in \{\textsc{findings}, \textsc{impression}, \textsc{vqa}\}$, optional textual side-input $\mathbf{u}$ |
| \Ensure Generated report $\mathbf{R}$ |
| \State $\mathbf{P} \gets \operatorname{RAD-DINO}(\mathbf{X})$ \Comment{$\mathbb{R}^{1369 \times 768}$, frozen} |
| \State $\mathbf{c}_{\texttt{[CLS]}} \gets$ class-token of the same forward pass |
| \State $\mathbf{Z} \gets \operatorname{MLP}(\mathbf{c}_{\texttt{[CLS]}}) \in \mathbb{R}^{14 \times 3}$ |
| \State $\hat{\mathbf{y}}_j \gets \operatorname{softmax}(\mathbf{Z}_j)$, \, $c_j \gets \arg\max_c \hat{\mathbf{y}}_{j,c}$ |
| \State $\textsc{pnu} \gets \operatorname{Serialise}(\{c_j\}_{j=1}^{14})$ \Comment{P/N/U text block} |
| \State $\mathbf{V} \gets \operatorname{Projection}(\mathbf{P}) \in \mathbb{R}^{32 \times 4096}$ |
| \If{$\mathcal{T} = \textsc{findings}$} |
| \State context $\gets$ \verb|""| |
| \State instruction $\gets$ canonical findings instruction |
| \ElsIf{$\mathcal{T} = \textsc{impression}$} |
| \State context $\gets$ ``\texttt{Findings:}\ $\mathbf{u}$'' |
| \State instruction $\gets$ canonical impression instruction |
| \ElsIf{$\mathcal{T} = \textsc{vqa}$} |
| \State context $\gets$ \verb|""| |
| \State instruction $\gets \mathbf{u}$ \Comment{the question itself} |
| \EndIf |
| \State $\mathbf{p} \gets \operatorname{BuildPrompt}(\textsc{system}, \textsc{pnu}, \text{context}, \text{instruction})$ |
| \State $\mathbf{e} \gets \operatorname{Tokenize}(\mathbf{p})$, then substitute $\mathbf{V}$ at the \verb|<image>| position |
| \State $\mathbf{R} \gets \operatorname{Vicuna+LoRA}(\mathbf{e})$ via greedy decoding up to $L_\mathcal{T}$ new tokens |
| \State \Return $\mathbf{R}$ |
| \end{algorithmic} |
| \end{algorithm} |
|
|
| The maximum number of newly generated tokens $L_\mathcal{T}$ is set |
| to 300 for findings, 200 for impression, and 64 for VQA, matching |
| the 99th-percentile reference length per task. For the impression |
| task, the textual side-input $\mathbf{u}$ is the ground-truth |
| findings paragraph in the experiments of |
| Chapter~\ref{ch:experiments}, mirroring the teacher-forced training |
| regime; an end-to-end cascade variant that feeds the model's own |
| generated findings as $\mathbf{u}$ is technically supported by the |
| codebase but not reported as a primary result. |
|
|
| |
| \section{Evaluation Protocol} |
| \label{sec:eval} |
| |
|
|
| The evaluation reports a metric suite tailored to each of the three |
| downstream tasks. For findings and impression generation we report |
| both natural-language-generation (NLG) metrics --- lexical, fluency, |
| and embedding-based --- and the clinical-correctness metric that has |
| become the de facto standard for chest-X-ray report generation. For |
| VQA we report a parallel suite adapted to short-form answers, |
| optionally augmented with an LLM-as-judge evaluation for clinical |
| equivalence. Table~\ref{tab:metrics} summarises the inventory. |
|
|
| \begin{table}[ht] |
| \centering |
| \small |
| \caption{Evaluation metrics by task.} |
| \label{tab:metrics} |
| \begin{tabular}{lll} |
| \toprule |
| \textbf{Family} & \textbf{Metric} & \textbf{Tasks} \\ |
| \midrule |
| Lexical n-gram & BLEU-1, BLEU-4 & findings, impression, VQA \\ |
| & ROUGE-1, ROUGE-2, ROUGE-L & findings, impression \\ |
| Fluency / synonym & METEOR & findings, impression, VQA \\ |
| Embedding semantic & BERTScore F1 & findings, impression, VQA \\ |
| Clinical accuracy & CheXbert macro F1, P, R & findings, impression \\ |
| Exact answer & Exact match, token F1 & VQA \\ |
| Semantic judge & GPT-4o-mini score (0--5) & VQA (optional) \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{Held-Out Data and Decoding} |
|
|
| All metrics are computed on the patient-disjoint test split of |
| Section~\ref{sec:subset}: 5{,}000 studies whose subject ids appear in |
| neither the train nor the validation split. Greedy decoding is used |
| throughout (Section~\ref{sec:inference}); the canonical instruction |
| variant of each task's prompt template is used so that the reported |
| numbers reflect the model's single-shot best guess and remain |
| reproducible across runs. For findings and impression the |
| classifier-predicted PNU block populates the prompt at test time; for |
| the impression task, the ground-truth findings string populates the |
| task-specific context slot as in training. For the IU-X-Ray |
| out-of-distribution evaluation, the entire pipeline is applied |
| without modification, with PNU labels predicted from images directly. |
|
|
| \subsection{Natural Language Generation Metrics} |
| \label{sec:nlg} |
|
|
| \paragraph{BLEU.} Corpus-level BLEU-1 and BLEU-4~\citep{bleu} are |
| computed with NLTK's smoothing method~1 to avoid zero-probability |
| collapse on short references. BLEU is included primarily for |
| comparability with the radiology-report-generation literature; its |
| weak correlation with clinical correctness on this task is well |
| documented, so we treat it as a fluency floor rather than a quality |
| measure. |
|
|
| \paragraph{ROUGE.} ROUGE-1, ROUGE-2, and ROUGE-L~\citep{rouge} |
| F-measures are computed with Porter stemming enabled --- so that, for |
| instance, \emph{enlarged} and \emph{enlarging} match. ROUGE-L in |
| particular is the most commonly reported single number in prior work |
| on this task (R2Gen, KGAE, RaDialog) and is included for that reason. |
|
|
| \paragraph{METEOR.} METEOR~\citep{meteor} computes a weighted |
| F-measure at the token level, with matches granted not only for exact |
| tokens but also for stems and WordNet synonyms |
| (e.g.\ \emph{cardiomegaly}~$\leftrightarrow$~\emph{enlarged heart}), |
| and a chunk-fragmentation penalty that discourages matches scattered |
| across the sentence. Of the three n-gram-style metrics, METEOR |
| correlates best with human judgement on radiology reports, where |
| paraphrasing is rampant; we report it alongside BLEU and ROUGE rather |
| than as a replacement. |
|
|
| \paragraph{BERTScore.} BERTScore~\citep{bertscore} computes |
| greedy-aligned cosine similarities between the contextual embeddings |
| of hypothesis and reference, then aggregates to a precision, recall, |
| and F1. We report F1 with the default |
| \texttt{distilbert-base-uncased} backbone for speed, and verify on a |
| subset that switching to PubMedBERT |
| (\texttt{BiomedNLP-PubMedBERT-base-uncased-abstract}) preserves the |
| relative ordering of methods. BERTScore captures semantic |
| equivalence that the n-gram metrics miss but, unlike the |
| clinical-accuracy metric below, has no notion of clinical |
| \emph{correctness} --- it can reward a fluent paraphrase that flips a |
| positive finding to a negative one. |
|
|
| \subsection{Clinical Accuracy: CheXbert F1} |
| \label{sec:clinical-f1} |
|
|
| The clinical-accuracy metric is the macro-averaged F1 of CheXbert |
| labels~\citep{chexbert}, the standard for chest-X-ray report |
| generation since its introduction. The CheXbert labeler is applied to |
| both the generated and the reference report, producing a 14-element |
| abnormality vector with values in $\{-1, 0, 1\}$ for each. Following |
| the convention of prior work, we collapse $\{-1, 0\}$ to |
| not-positive and compute, per pathology, |
|
|
| \begin{equation} |
| \mathrm{F1}_j \;=\; \frac{2\,\mathrm{precision}_j\,\mathrm{recall}_j}{\mathrm{precision}_j + \mathrm{recall}_j}, |
| \end{equation} |
|
|
| then macro-average across the 14 pathologies. Precision and recall |
| are reported alongside F1 to disambiguate the trade-off when the |
| overall numbers are close between methods. Macro-averaging |
| (rather than micro) is chosen because the CheXpert label distribution |
| is heavy-tailed and micro-averaging would let the most prevalent |
| labels dominate the score. The CheXbert checkpoint is the public |
| release from Stanford ML Group; when it is not available on a given |
| host the metric degrades gracefully to a logged zero rather than |
| failing the evaluation run. |
|
|
| \subsection{Visual Question Answering Metrics} |
| \label{sec:vqa-metrics} |
|
|
| VQA targets are short --- often a single word or phrase --- so the |
| suite differs from generation. |
|
|
| \paragraph{Exact match.} Hypothesis and reference are lower-cased, |
| stripped of punctuation, and collapsed on whitespace, then compared |
| for string equality. Exact match is the lower bound on correctness; |
| it is harsh on open-ended phrasing (``mild pleural effusion'' vs |
| ``small effusion'') but rewards the closed-form yes/no and |
| quantitative questions that dominate the dataset. |
|
|
| \paragraph{Token F1.} For each $(\text{hypothesis}, \text{reference})$ |
| pair we compute the F1 between the bags of normalised tokens. This |
| is the standard short-answer metric inherited from SQuAD-style |
| extractive QA, and it is the most diagnostic single number for VQA |
| on this dataset. |
|
|
| \paragraph{BLEU-1, METEOR, BERTScore.} Reported for symmetry with the |
| generation tasks: BLEU-1 captures unigram overlap, METEOR adds synonym |
| matching, BERTScore F1 captures semantic similarity. BLEU-4 (rarely |
| meaningful on short answers) and ROUGE-L (subsumed by token F1) are |
| omitted. |
|
|
| \paragraph{LLM-as-judge.} For free-form answers where the metrics |
| above misalign with clinical equivalence (e.g.\ ``cardiomegaly'' vs |
| ``enlarged cardiac silhouette''), we additionally evaluate a sample |
| of the test set with GPT-4o-mini~\citep{openai-gpt4o} as a zero-shot |
| judge. Following G-Eval~\citep{geval}, the judge is prompted with the |
| question, the reference, and the prediction, and asked to return an |
| integer score on a $0$--$5$ rubric ($5$~=~clinical equivalence, $0$~= |
| ~contradiction) together with a one-sentence rationale. We report the |
| mean score on the $0$--$5$ scale and its normalisation to $[0, 1]$ |
| for comparability with the other metrics. The judge runs at |
| temperature $0$ with \verb|response_format=json_object| for |
| deterministic parsing; the codebase supports Gemini and local Ollama |
| back-ends behind the same interface so the result is reproducible |
| without an OpenAI subscription. To keep cost bounded, the judge is |
| applied to at most $500$ randomly-sampled VQA items. |
|
|
| \subsection{Model Selection and Reporting} |
|
|
| Model selection during training uses the validation |
| \verb|eval_loss| --- the causal cross-entropy on assistant tokens, |
| the same loss minimised in Equation~\ref{eq:lm}. We deliberately do |
| \emph{not} use any of the downstream metrics for checkpoint selection, |
| to avoid the optimistic bias of optimising on the same signal that is |
| later reported. CheXbert F1 in particular is expensive to compute and |
| unstable across small validation samples, which makes it unsuitable |
| as a training-loop criterion regardless. |
|
|
| Each evaluation run writes per-task prediction files containing the |
| image path, the reference, the generated text, and (for VQA) the |
| question, alongside an aggregated metric summary. Confidence intervals |
| are obtained by bootstrap resampling of the per-sample metric values |
| (1{,}000 resamples, 95\% percentile interval); corpus-level BLEU and |
| CheXbert F1, which are non-decomposable, are reported as point |
| estimates. |
|
|