File size: 48,333 Bytes

2c84a70

% =============================================================================
%  Chapter: Methodology
%
%  Structure follows the META-CXR (IEEE Access 2025) layout:
%    III. METHODOLOGY
%      A. Data Materials
%      B. Module 1 (visual extractor)
%      C. Module 2 (cross-modal alignment)
%      D. Module 3 (abnormality classifier)
%      E. Module 4 (language model)
%      F. Prompt Construction
%      G. Two-Phase Training
%      H. Inference Pipeline
%      I. Evaluation Protocol
%
%  The reported configuration is the only configuration described:
%  one frontal view per study, impression conditioned on the
%  ground-truth findings string, etc.  The codebase exposes other
%  options but they are not part of this report.
% =============================================================================

\chapter{Methodology}
\label{ch:method}

This chapter presents the proposed Chest X-Ray Vision--Language Model
(CXR-VLM) in nine subsections. The first describes the data
materials and the pre-processing pipeline that produces the training
corpus from MIMIC-CXR. The next four subsections describe the four
modules that constitute the model itself --- a frozen vision encoder,
a learnable cross-modal projection, an auxiliary abnormality
classifier, and a frozen large language model fitted with
parameter-efficient adapters. Section~\ref{sec:prompt} explains how
these modules are tied together by a single prompt template that
specialises to each downstream task. Section~\ref{sec:training} lays
out the two-phase training schedule, with the contrastive alignment
loss of Phase~1 and the instruction-tuning loss of Phase~2 stated
explicitly. Section~\ref{sec:inference} formalises the inference
procedure as an algorithm. Section~\ref{sec:eval} closes the chapter
by specifying the evaluation protocol --- the held-out splits, the
metrics, and the model-selection criterion --- under which the
experimental results of Chapter~\ref{ch:experiments} are reported.

% TODO INSERT FIGURE: full pipeline diagram
% File: figures/pipeline_overview.pdf   (the SVG already generated by
%       scripts/generate_pipeline_svg.py can be exported to PDF)
% Caption: ``Overview of the proposed CXR-VLM. A frozen RAD-DINO image
%   encoder produces patch features that are pooled by a learnable
%   MLP projection into 32 visual tokens. The PNU abnormality string,
%   produced by a frozen 14-pathology CheXpert classifier sharing the
%   same encoder, is appended to the textual prompt. A frozen Vicuna-7B
%   language model with rank-16 LoRA adapters consumes the joint
%   visual--textual sequence and emits the radiology report.''

% =============================================================================
\section{Data Materials}
\label{sec:data}
% =============================================================================

\subsection{Datasets}

Three publicly-available chest radiograph datasets are used in this
work; each plays a distinct role.

\paragraph{MIMIC-CXR-JPG~\citep{mimic-cxr-jpg}.} The primary training
and evaluation corpus. MIMIC-CXR-JPG is the JPEG-converted release of
MIMIC-CXR~\citep{mimic-cxr}, comprising 377{,}110 chest radiographs
from 227{,}827 imaging studies of 65{,}379 patients, paired with
free-text radiology reports. Each study is additionally accompanied
by a row in \texttt{mimic-cxr-2.0.0-chexpert.csv}, in which the
CheXbert NLP labeler~\citep{chexbert} extracts a $14$-pathology
abnormality vector from the report text, with values in
$\{-1, 0, 1, \text{blank}\}$ denoting uncertain, negative, positive,
and not-mentioned respectively. The official patient-disjoint train /
validate / test split in \texttt{mimic-cxr-2.0.0-split.csv} is
used as the starting point for our subset construction.

\paragraph{MIMIC-Ext-CXR-VQA~\citep{mimic-cxr-vqa}.} A visual
question--answering extension of MIMIC-CXR that provides
1.4\,million $(\text{image}, \text{question}, \text{answer})$ triples
covering presence, location, severity, and comparative reasoning over
the same 14 pathologies. Each triple references a MIMIC-CXR study by
\texttt{dicom\_id}, so it is filtered down to the subset described
below by simple set intersection.

\paragraph{IU X-Ray~\citep{iuxray}.} A smaller open-access dataset of
3{,}955 fully de-identified chest X-ray studies with associated
reports, used as an out-of-distribution evaluation corpus to
assess cross-domain generalisation of the report-generation tasks.
Following the convention of prior work~\citep{r2gen, metransformer},
we adopt a 7:1:2 train/validate/test partition at the study level.

\subsection{Subset Construction}
\label{sec:subset}

The full MIMIC-CXR collection is too large to train on with the
hardware budget of this work, and its native distribution is heavily
imbalanced both in view types (frontal vs lateral exposures) and in
report formatting. We therefore curate a 50{,}000-study subset by a
four-stage filter chain, applied once and then re-used across all
experiments. The filter is documented here at a level intended to be
sufficient for reproducibility.

% TODO INSERT FIGURE: EDA from the notebook
% File: figures/eda_pathology_prevalence.pdf
%   (per-pathology prevalence: full MIMIC -> eligible pool -> final
%    subset, side-by-side bar chart). Source notebook: data/build_subset_*.ipynb
% Caption: ``Per-pathology prevalence across the full MIMIC-CXR
%   distribution, the eligible pool after report/view filtering, and
%   the final stratified subset. Absolute deltas remain under 0.7
%   percentage points on every label, confirming that stratification
%   preserves the clinical distribution.''

\paragraph{(a) Frontal-view selection.} Each DICOM is joined with
\texttt{mimic-cxr-2.0.0-metadata.csv} by \texttt{dicom\_id}, and only
rows whose \texttt{ViewPosition} field is PA or AP are retained. A
study with multiple frontal exposures is collapsed to one image,
preferring the PA projection where available because PA is the
standard reference projection while AP is reserved for bedside or
supine portable studies. Lateral views are discarded because they are
clinically complementary but stylistically and anatomically distinct
from frontals; mixing them with frontals would inject input
heterogeneity that the model has no architectural mechanism to handle.
After this step every retained study contributes exactly one
frontal image to the corpus.

\paragraph{(b) Report parsing.} Each study's \texttt{.txt} file is
scanned by a strict regular expression that recognises section
headers of the form \verb|^[A-Z ,/().-]+:|. A section body is
accepted only if the upper-cased header is exactly \texttt{FINDINGS}
or \texttt{IMPRESSION}; near-synonyms such as \texttt{CONCLUSION},
\texttt{WET READ}, \texttt{INDICATION}, or composite headers like
\texttt{FINDINGS AND IMPRESSION} are deliberately not merged because
they vary substantially in style and would dilute the training signal.
A study survives this stage only if both sections are present and
non-empty --- a requirement made stricter than RaDialog because, as
explained in Section~\ref{sec:prompt}, the impression-generation
task in this work is conditioned on the ground-truth findings string,
so a study without a clean findings section cannot supply a usable
impression sample.

\paragraph{(c) Length-based outlier removal.} Word counts are computed
per section; the upper cutoff is set to $Q_3 + 1.5\cdot\mathrm{IQR}$
per section, dropping multi-paragraph teaching reports that sit far
from the median, and a lower floor of two words for findings and one
word for impression is imposed. The combination removes the long tail
without affecting the median statistics.

\paragraph{(d) Stratified patient-disjoint sampling.} The remaining
eligible pool is sampled to 40{,}000 train / 5{,}000 validation /
5{,}000 test studies. Each study is first assigned a \emph{stratum}
equal to its rarest positive CheXpert label (the rarity order is
computed over the eligible pool, so e.g.\ \emph{Pleural Other},
\emph{Lung Lesion}, and \emph{Fracture} sit at the top); studies with
no positive label fall into \texttt{No Finding} or \texttt{None}.
Sampling allocates the target count per stratum proportionally to
prevalence and draws within each stratum, preferring studies that
also have an associated VQA triple in MIMIC-Ext-CXR-VQA so that the
VQA task receives more supervision. The validation and test pools are
populated first from the official PhysioNet splits; if a target count
exceeds the official pool, the remainder is drawn from the train pool
and the affected subject ids are then removed from train. The three
resulting splits are therefore patient-disjoint --- no subject appears
in more than one split. Per-pathology prevalence deltas between the
final subset and the eligible pool remain below 0.7 percentage points
on every label, confirming that stratification did not skew clinical
coverage.

\subsection{Image Pre-processing}

Because every training step ultimately resizes the input to the
encoder's native $518\!\times\!518$ window (Section~\ref{sec:encoder}),
all corpus images are resized once, offline, so the shortest edge
equals 518 with the longer edge proportional, and saved at JPEG
quality 90 --- near-lossless on grayscale chest radiographs. The 50k
subset compresses from $\sim$100\,GB of full-resolution JPEGs down to
$\sim$5--8\,GB, which fits comfortably on a single Hugging Face Hub
repository for distribution. At training time the dataset loader
applies the encoder's standard pre-processing (centre crop to
$518\!\times\!518$, channel-wise normalisation with the ImageNet mean
and standard deviation) on the resized JPEGs.

% =============================================================================
\section{Visual Feature Extractor}
\label{sec:encoder}
% =============================================================================

% TODO INSERT FIGURE: RAD-DINO / DINOv2 ViT-B/14 architecture
% Source: original RAD-DINO paper (P\'erez-Garc\'ia et al., 2024) Figure 1
%   or DINOv2 paper (Oquab et al., 2023) Figure 1 - either is fine.
% Caption: ``Architecture of the RAD-DINO visual backbone. The
%   $518\times518$ chest radiograph is divided into $14\times14$
%   patches, linearly projected, and processed by a 12-layer ViT-B
%   transformer pre-trained with the DINOv2 self-distillation objective
%   on $\approx$840k chest radiographs. The class token is discarded
%   downstream; only the 1369 patch tokens are passed to the
%   projection.''

The visual backbone is Microsoft's RAD-DINO~\citep{raddino}, a
ViT-B/14 vision transformer pre-trained with the DINOv2 self-distil\-lation
objective~\citep{dinov2} on $\approx$840{,}000 chest radiographs drawn
from MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The
\texttt{microsoft/rad-dino} checkpoint is loaded through the
HuggingFace \texttt{transformers} hub and accepts inputs at its native
resolution of $518\!\times\!518$. At this resolution each image is
tokenised into $37\!\times\!37 = 1369$ patches of size
$14\!\times\!14$, linearly projected to a 768-dimensional embedding,
and processed by a 12-layer transformer with 12 attention heads per
layer. The class token at the head of the sequence is discarded
downstream, so the encoder output consumed by the projection is

\begin{equation}
  \mathbf{P} \;=\; \operatorname{RAD-DINO}(\mathbf{X}) \;\in\; \mathbb{R}^{B \times 1369 \times 768}
  \label{eq:patches}
\end{equation}

where $\mathbf{X} \in \mathbb{R}^{B \times 3 \times 518 \times 518}$
is the pre-processed image batch. The choice of RAD-DINO over more
commonly-used backbones such as BioViL-T~\citep{biovilt} or
generic ImageNet-pretrained ViTs is motivated by three factors: it is
trained on roughly an order of magnitude more chest-X-ray data than
BioViL-T; its DINOv2 self-distillation objective yields denser, more
spatially-coherent patch features than contrastive
text--image alignment objectives, which matters for the focal
abnormalities (nodules, small consolidations) the report-generation
task needs to mention; and it is distributed as a standard
HuggingFace model, free of the legacy dependency chain of BioViL-T's
\texttt{hi-ml-multimodal} package.

The encoder is kept entirely frozen throughout this work. Allowing
gradients to flow back into an 86\,M-parameter ViT, on top of the
quantised LLM activations in the same model, exhausts the activation
budget of consumer GPUs, and the empirical observation in
RaDialog~\citep{radialog} and LLaVA-Med~\citep{llavamed} that
unfreezing the encoder before the projection has converged tends to
collapse the joint representation gives independent reason for caution.
The pre-trained patch features are instead specialised to the
chest-X-ray report distribution by the projection module described
next, under the explicit alignment objective of Phase~1.

% =============================================================================
\section{Cross-Modal Projection}
\label{sec:projection}
% =============================================================================

% TODO INSERT FIGURE: projection module schematic
% File: figures/projection_module.pdf
% Caption: ``Architecture of the cross-modal MLP projection. A learnable
%   query bank of 32 tokens cross-attends over the 1369 patch features
%   to pool them into a fixed visual-token budget. A two-layer MLP
%   then lifts the pooled tokens from 768 to 4096 dimensions, the
%   embedding size of Vicuna-7B. The intermediate 1024-d activation
%   is exposed to the ITC head used during Phase 1 of training.''

The projection module serves two roles: pool the variable-length
patch sequence from Equation~\ref{eq:patches} into a fixed
visual-token budget the language model can attend to, and lift each
visual token from the encoder's 768-dimensional embedding space into
the 4096-dimensional embedding space of Vicuna-7B. We adopt a design
in the spirit of the perceiver resampler used by
Flamingo~\citep{flamingo} and the Q-Former of BLIP-2~\citep{blip2},
but without the heavy multi-layer transformer stack of either: a
single learnable query bank cross-attends over the patches, followed
by a two-layer MLP. This light-weight choice was first demonstrated
on the chest-X-ray task by RaDialog~v2~\citep{radialog}, who report
that the gain from a deeper Q-Former does not materialise on radiology
metrics.

Concretely, the projection holds three learnable components:

\begin{enumerate}
  \item A query bank
  $\mathbf{Q}_0 \in \mathbb{R}^{32 \times 768}$, initialised from
  $\mathcal{N}(0, 0.02^2)$ and broadcast to a batched
  $\mathbf{Q}\in\mathbb{R}^{B\times 32\times 768}$ at forward time.
  \item An 8-head cross-attention block at width 768, with dropout 0.1.
  \item A two-layer MLP $\mathbf{W}_1 \in \mathbb{R}^{768\times 1024}$,
  GELU, dropout, $\mathbf{W}_2 \in \mathbb{R}^{1024\times 4096}$, all
  initialised from $\mathcal{N}(0, 0.02^2)$ with zero biases.
\end{enumerate}

For a batch of patch features $\mathbf{P}_i \in \mathbb{R}^{1369
\times 768}$, the projection emits visual tokens
$\mathbf{V}_i \in \mathbb{R}^{32 \times 4096}$ in three steps:

\begin{align}
  \mathbf{H}^{(0)}_i &= \operatorname{CrossAttn}\!\left(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i\right) \in \mathbb{R}^{32 \times 768} \label{eq:crossattn}\\
  \mathbf{H}^{(1)}_i &= \operatorname{Dropout}\!\left(\operatorname{GELU}(\mathbf{W}_1 \mathbf{H}^{(0)}_i)\right) \in \mathbb{R}^{32 \times 1024} \label{eq:tap}\\
  \mathbf{V}_i &= \mathbf{W}_2\, \mathbf{H}^{(1)}_i \in \mathbb{R}^{32 \times 4096}
\end{align}

Three design choices deserve explanation. First, the number of visual
tokens is fixed at 32, matching RaDialog and sitting inside the
empirical sweet spot reported by BLIP-2 and LLaVA: below 16, the
pool is too coarse for small focal pathologies; above 64, the LLM's
sequence length grows without measurable gain on radiology metrics.
Second, the cross-attention is unidirectional --- queries attend over
patches, but patches do not attend back. This is the standard
perceiver setup and keeps the projection module small (1.8\,M
parameters) and easy to train under a contrastive objective in
Phase~1. Third, the 1024-d intermediate activation $\mathbf{H}^{(1)}$
in Equation~\ref{eq:tap} is exposed deliberately --- it is the
grounding feature the Phase~1 contrastive head operates on, and
placing it between the GELU and the second linear means the
contrastive objective sees an already-nonlinear representation rather
than the raw cross-attention output.

% =============================================================================
\section{Abnormality Classifier}
\label{sec:chex}
% =============================================================================

% TODO INSERT FIGURE: PNU example
% A two-column figure showing (left) a sample chest X-ray, (right) the
% predicted PNU block as it would appear in the prompt -- something like:
%   "Positive Abnormalities: Cardiomegaly, Pleural Effusion
%    Negative Abnormalities: No Finding, Edema, Pneumothorax, ...
%    Uncertain Abnormalities: Atelectasis"
% Caption: ``Example PNU abnormality block generated by the auxiliary
%   classifier for a representative MIMIC-CXR study. The block is
%   inserted between the visual tokens and the natural-language
%   instruction in the prompt fed to the language model.''

The abnormality classifier is the principal mechanism by which
explicit clinical knowledge enters the language model. Rather than
fusing classifier logits into the visual embedding pathway, we
serialise the classifier's predictions to text and append them to the
prompt. This design is borrowed from the U-MultiClass formulation of
META-CXR~\citep{metacxr}, where each of the 14 CheXpert pathologies
is classified into one of three classes --- positive (P), negative
(N), or uncertain (U) --- rather than collapsed to a binary
present/absent decision. The three-way distinction matters because
the negative class (the report explicitly rules a pathology out)
carries different information from the uncertain class (the report
hedges with phrases such as ``may represent'' or ``cannot exclude''),
and the language model can exploit that distinction when phrasing the
generated report.

The classifier architecture is intentionally small. It consists of a
multi-layer perceptron head on top of the global \texttt{[CLS]} token
of the same frozen RAD-DINO encoder used by the rest of the model.
Sharing the encoder is what makes the design near-free at inference:
the encoder forward is already required for the visual pathway, so
adding the classifier costs only a single $\mathbb{R}^{768} \to
\mathbb{R}^{14 \times 3}$ MLP forward per image. The head produces a
logit tensor

\begin{equation}
  \mathbf{Z} \;=\; \operatorname{MLP}\!\left(\mathbf{c}_{\texttt{[CLS]}}\right) \;\in\; \mathbb{R}^{14 \times 3}
\end{equation}

which is softmaxed along the class axis to yield, for each
pathology $j$, the probability distribution $\hat{\mathbf{y}}_j \in
\Delta^2$ over the three U-MultiClass categories. The predicted
category is the arg-max,

\begin{equation}
  c_j \;=\; \operatorname*{arg\,max}_{c \in \{\text{P}, \text{N}, \text{U}\}} \hat{\mathbf{y}}_{j, c}.
\end{equation}

These per-pathology predictions are then serialised into a
human-readable abnormality block --- which we refer to as the
\emph{PNU string} --- by listing the pathology names that fall into
each of the three categories on three lines:

\begin{verbatim}
Positive Abnormalities: <comma-separated names with c_j = P>
Negative Abnormalities: <comma-separated names with c_j = N>
Uncertain Abnormalities: <comma-separated names with c_j = U>
\end{verbatim}

At training time the PNU string is built directly from the
\texttt{mimic-cxr-2.0.0-chexpert.csv} ground-truth labels (the
classifier-as-oracle setting), so that the language model learns to
condition on \emph{reliable} abnormality information from the very
first step. At inference time the classifier's own predictions are
used, with missing-value labels in the CSV mapped to negative for
consistency. The classifier itself is fitted in a brief preparatory
phase (Section~\ref{sec:phase0}) and frozen for all subsequent
training.

% =============================================================================
\section{Language Model with Low-Rank Adaptation}
\label{sec:llm}
% =============================================================================

% TODO INSERT FIGURE: LoRA decomposition
% Source: original LoRA paper (Hu et al., 2021) Figure 1
% Caption: ``Low-Rank Adaptation (LoRA). For each adapted weight matrix
%   $W_0 \in \mathbb{R}^{d \times d}$ in the language model, a low-rank
%   update $\Delta W = BA$ with $A \in \mathbb{R}^{r\times d}$, $B\in
%   \mathbb{R}^{d \times r}$ and $r \ll d$ is added at inference. Only
%   $A$ and $B$ are trained; the original matrix $W_0$ remains frozen
%   and 4-bit quantised. We set $r=16$, $\alpha = 32$.''

The decoder is Vicuna-7B v1.3~\citep{vicuna}, a LLaMA-1 derivative
fine-tuned on ShareGPT conversational data. We selected the v1.3
release rather than the larger v1.5 (LLaMA-2 base) or more recent
Llama-3 models for two practical reasons: it is the exact language
model used by the RaDialog baseline against which our work is most
directly comparable, and its chat template employs a single
unambiguous \verb|USER: ... ASSISTANT: ...| structure that
simplifies the label-masking required at training time.

The base model is loaded in 4-bit NF4 quantisation through the
\texttt{BitsAndBytesConfig} interface of the \texttt{bitsandbytes}
library~\citep{qlora}, with double quantisation enabled and the
compute dtype matched to the host GPU (BF16 on Ampere-or-newer cards,
FP16 on Turing). This reduces the resident weight footprint from
$\approx\!14$\,GB in FP16 to $\approx\!4.0$\,GB, which fits
comfortably alongside the encoder, projection, and activation budget
on a single 16\,GB GPU.

Vicuna-7B is otherwise kept frozen; adaptation is performed entirely
through LoRA~\citep{lora}. For each attention weight matrix
$\mathbf{W}_0 \in \mathbb{R}^{4096 \times 4096}$ in the
$\{q, k, v, o\}_{\text{proj}}$ projections of every transformer
block, we introduce a low-rank update

\begin{equation}
  \mathbf{W}' \;=\; \mathbf{W}_0 \;+\; \frac{\alpha}{r}\,\mathbf{B}\mathbf{A},
  \qquad
  \mathbf{A} \in \mathbb{R}^{r \times 4096},
  \quad
  \mathbf{B} \in \mathbb{R}^{4096 \times r},
  \quad
  r \ll 4096
  \label{eq:lora}
\end{equation}

where $\mathbf{A}$ is initialised from $\mathcal{N}(0, \sigma^2)$ and
$\mathbf{B}$ is zero-initialised so that the adapter is the identity
function at the start of training. Only $\mathbf{A}$ and $\mathbf{B}$
receive gradients; the quantised base weight $\mathbf{W}_0$ remains
fixed. We set $r = 16$, $\alpha = 32$, and apply a dropout of
$0.05$ on the LoRA forward, giving an effective adapter scaling of
$\alpha/r = 2$ that we found preferable to $1$ and $4$ in a brief
sweep. The feed-forward sublayers are deliberately left without LoRA
adapters: pilot runs that additionally adapted $\texttt{gate\_proj}$
and the MLP projections enlarged the trainable parameter count by
$3$--$4\times$ without a corresponding gain on the downstream
clinical-accuracy metric. With this configuration the total trainable
parameter count of the language model is $\approx\!19.7$\,M, which
together with the $\approx\!1.8$\,M of the projection brings the
overall trainable budget to $\approx\!21.5$\,M out of $\approx\!7.1$\,B
total --- about $0.30\%$.

% =============================================================================
\section{Prompt Construction}
\label{sec:prompt}
% =============================================================================

All three downstream tasks --- findings generation, impression
generation, and visual question answering --- share a single prompt
template that follows Vicuna's v1.1 chat format:

\begin{verbatim}
{SYSTEM_PROMPT} USER: <image>
{PNU abnormality block}
{task-specific context}
{instruction} ASSISTANT: {target}
\end{verbatim}

The \verb|<image>| token is added to the tokenizer as a special
single-id placeholder at vocabulary slot 32{,}000, immediately above
Vicuna's existing 32{,}000-token vocabulary. At forward time the
model locates this token in the tokenised input and substitutes the
32 visual tokens $\mathbf{V}$ from the projection in its place,
expanding the attention mask and label tensor by 31 positions per
sample so the causal mask stays consistent. The label tensor entries
for visual-token positions are set to $-100$, the
\texttt{torch.nn.CrossEntropyLoss} ignore index, so that no loss is
incurred on visual tokens.

The fields \verb|{PNU abnormality block}|, \verb|{task-specific context}|,
and \verb|{instruction}| are populated differently per task.

\paragraph{Findings generation.} The model is asked to produce the
findings paragraph from the frontal radiograph alone, with the PNU
abnormality block as soft clinical guidance. There is no task-specific
context. The instruction slot draws from a pool of ten hand-written
paraphrases (e.g.\ ``Generate the findings section for this chest
X-ray.'') sampled uniformly at random during training; at evaluation
the first variant is used deterministically for reproducibility.

\paragraph{Impression generation.} The impression is by clinical
convention a summary of the findings, so we frame the
impression task as conditional summarisation rather than free
generation. The task-specific context slot is populated with the
ground-truth findings text, prefixed by the literal token
\texttt{Findings:}. The model therefore receives the frontal image,
the PNU block, the ground-truth findings paragraph, and an instruction
drawn from a pool of ten impression-specific paraphrases (e.g.\
``Generate the impression section for this chest X-ray.''). Studies
whose report lacks a findings section emit no impression sample, by
construction of the filter chain in Section~\ref{sec:subset}. We
acknowledge that this is a teacher-forced setting: the impression
metrics reported in Chapter~\ref{ch:experiments} are an upper bound on
what a fully end-to-end cascade --- in which the model's own
generated findings feeds the impression prompt --- would achieve, and
we discuss the gap in the limitations section of that chapter.

\paragraph{Visual question answering.} The task-specific context slot
is empty, and the instruction slot is the natural-language question
drawn from MIMIC-Ext-CXR-VQA. The reference answer in the dataset
supplies the target.

For every task, the system prompt is the radiologist persona used by
RaDialog --- ``A chat between a curious user and an artificial
intelligence assistant acting as an experienced radiologist''. The
full prompt is tokenised with a 512-token right-truncation budget. We
choose right-truncation deliberately: a too-long prompt should erode
the target on the right, where labels matter for the loss, rather
than the system prompt or PNU block on the left, both of which carry
non-redundant information. The label tensor is masked with $-100$ at
every prompt-token, padding-token, and visual-token position, so the
language-modelling loss is computed strictly on the assistant response.

% =============================================================================
\section{Two-Phase Training}
\label{sec:training}
% =============================================================================

Training proceeds in three stages: a brief preparatory stage that
fits the abnormality classifier head, then two main phases that train
the model itself. The two-phase split mirrors the standard practice
of BLIP-2, LLaVA, and RaDialog: in the first phase the alignment
module is fitted under an explicit contrastive objective without
touching the language model, and in the second phase the full model
is fine-tuned end-to-end under the autoregressive language-modelling
objective.

\subsection{Preparatory Phase: Abnormality Classifier}
\label{sec:phase0}

The PNU classifier of Section~\ref{sec:chex} is fitted before either
of the two main phases. The frozen RAD-DINO encoder is forwarded over
every training image and the resulting \texttt{[CLS]} embeddings are
cached to disk; the small MLP head is then trained under fourteen
independent three-way cross-entropies against the
$\{\text{positive}, \text{negative}, \text{uncertain}\}$ targets
derived from \texttt{mimic-cxr-2.0.0-chexpert.csv} (with missing-value
labels mapped to negative). Because the encoder is frozen and its
features are cached, this phase converges in a handful of epochs on a
single GPU and is unremarkable beyond noting that the resulting
checkpoint is loaded read-only by both of the subsequent phases.

\subsection{Phase 1 --- Contrastive Alignment}
\label{sec:phase1}

% TODO INSERT FIGURE: contrastive alignment schematic
% A two-tower diagram: (top) frozen RAD-DINO -> projection -> ITC head
% -> L2-norm 128-d image embedding;  (bottom) frozen CXR-BERT ->
% projected 128-d text embedding from the findings paragraph; an
% InfoNCE block between the two towers.
% Caption: ``Phase 1 contrastive alignment. The projection is trained
%   to map the visual representation of a chest radiograph close to
%   the CXR-BERT text representation of the corresponding findings
%   paragraph under a symmetric InfoNCE loss, with all other study
%   pairings in the batch acting as negatives. The text embeddings
%   are pre-computed offline; the language model is not loaded
%   during this phase.''

Phase 1 specialises the projection to the chest-X-ray report
distribution without involving the language model. The projection
output is augmented with a small contrastive head that mean-pools
the 32 intermediate-dimensional tokens of Equation~\ref{eq:tap},
projects them to a 128-dimensional joint embedding space, and
L2-normalises the result:

\begin{equation}
  \mathbf{v}_i \;=\; \frac{\mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i}{\bigl\lVert \mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i \bigr\rVert_2},
  \qquad
  \bar{\mathbf{H}}^{(1)}_i \;=\; \frac{1}{32}\sum_{k=1}^{32} \mathbf{H}^{(1)}_{i,k}
  \label{eq:itc-head}
\end{equation}

with $\mathbf{W}_{\text{itc}} \in \mathbb{R}^{128 \times 1024}$ the
contrastive projection. On the text side, the canonical reference
sentence per study --- the findings paragraph, falling back to the
impression where findings is absent --- is encoded once, offline, by
the frozen CXR-BERT-specialized model~\citep{cxrbert} through its
\verb|get_projected_text_embeddings| interface, which itself emits a
128-dimensional L2-normalised vector $\mathbf{t}_i$. These per-study
text embeddings are written to disk and re-used across every Phase~1
run; no text encoder is loaded at training time.

The model is then trained to maximise the cosine similarity between
$\mathbf{v}_i$ and the text embedding $\mathbf{t}_i$ of the same
study, while pushing $\mathbf{v}_i$ away from the text embeddings of
the other studies in the batch. We use the symmetric InfoNCE loss of
CLIP~\citep{clip}:

\begin{equation}
  \mathcal{L}_{\text{ITC}}
  \;=\;
  -\tfrac{1}{2}\!\left[
    \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)}
        {\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)}
    \;+\;
    \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)}
        {\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)}
  \right]
  \label{eq:itc}
\end{equation}

with the softmax temperature $\tau = 0.07$, matching CLIP and the
CXR-BERT recipe. The dataset is de-duplicated to one image per study
before batching, because the text embedding is study-level. Trainable
parameters are restricted to the projection MLP and the contrastive
head; the encoder, the classifier, and the language model (which is
not even instantiated during this phase) are all frozen. Phase 1
runs for two epochs at peak learning rate $1\!\times\!10^{-3}$ with a
5\% cosine warm-up. Because Vicuna is not loaded, $\approx\!13$\,GB
of GPU memory is free, which we exploit to push the per-device batch
size to 64--96 and thereby enlarge the InfoNCE negative pool ---
known to be the dominant factor in contrastive-alignment
quality~\citep{clip, blip2}. The contrastive head is discarded at the
end of Phase~1; only the projection weights are carried forward.

\subsection{Phase 2 --- Instruction Tuning}
\label{sec:phase2}

% TODO INSERT FIGURE: phase 2 schematic (optional)
% Could show the full model with the loss flowing only on assistant
% tokens. Or omit if the overview figure already covers it.

Phase 2 reinstates the language model and switches to autoregressive
fine-tuning under the standard causal-language-modelling loss. The
projection weights from Phase~1 are loaded into the freshly-built
full model; the LoRA adapters of Equation~\ref{eq:lora} are
zero-initialised. Trainable parameters are the projection MLP and the
LoRA adapters on the four attention projections of all 32 transformer
blocks. The encoder, the classifier, and the quantised base weights
of Vicuna remain frozen.

Mini-batches are drawn from the union of the three task corpora ---
findings, impression, and VQA --- with sampling weights $0.3$,
$0.2$, and $0.5$ respectively. The findings/impression ratio favours
findings because findings is the harder task (no textual context),
and VQA is weighted equally to the combined report-generation tasks
because it is the most diverse subset and benefits most from
mini-batch coverage. For each sample, the prompt is constructed
according to its task as described in Section~\ref{sec:prompt} and
tokenised. The forward pass emits visual tokens via
Equation~\ref{eq:patches}--\ref{eq:tap} and inserts them in place of
the \verb|<image>| token; the language model then produces logits
$\mathbf{l}_{i, t}$ at every position $t$. The loss is

\begin{equation}
  \mathcal{L}_{\text{LM}}
  \;=\;
  -\,\frac{1}{|\mathcal{T}|}
    \sum_{(i, t)\,\in\,\mathcal{T}}
      \log p_{\theta}\!\bigl(y_{i, t}\,\bigm|\,y_{i, < t},\,\mathbf{V}_i,\,\mathbf{c}_i\bigr)
  \label{eq:lm}
\end{equation}

where $\mathcal{T}$ is the set of $(i, t)$ pairs whose label entry is
non-$-100$ (i.e.\ assistant tokens only), $\mathbf{V}_i$ are the
visual tokens, and $\mathbf{c}_i$ is the textual prompt context.
Phase 2 runs for ten epochs at peak learning rate
$2\!\times\!10^{-4}$ with a 5\% cosine warm-up. The effective batch
size is held at 16 across all hardware profiles by varying the
per-device batch and gradient-accumulation step count in inverse
proportion. The optimiser is AdamW~\citep{adamw} with weight decay
$0.01$ on all non-bias, non-LayerNorm parameters, applied to FP32
master copies of the projection and LoRA weights; the quantised base
LLM acts as a fixed nonlinearity.

\subsection{Hyperparameters}

Table~\ref{tab:hparams} consolidates the hyperparameter settings of
both phases. The values are inherited from RaDialog with two
adjustments. The Phase~1 batch size was swept over
$\{16, 32, 64, 96\}$; 64 was selected as the smallest batch that lay
within $0.3\%$ of the best validation contrastive accuracy. The
Phase~2 peak learning rate was swept over
$\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$; $5{\times}10^{-4}$
diverged within $\approx\!300$ optimiser steps and $2{\times}10^{-4}$
was unambiguously best on the validation loss.

\begin{table}[ht]
\centering
\small
\caption{Training hyperparameters for the two phases.}
\label{tab:hparams}
\begin{tabular}{lll}
\toprule
\textbf{Hyperparameter} & \textbf{Phase 1 (ITC)} & \textbf{Phase 2 (LM)} \\
\midrule
Trainable modules        & Projection + ITC head            & Projection + LoRA \\
Frozen modules           & Encoder, classifier; LLM not loaded & Encoder, classifier, base LLM \\
Loss                     & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\
Temperature $\tau$       & $0.07$                            & --- \\
Epochs                   & 2                                 & 10 \\
Peak learning rate       & $1\times 10^{-3}$                 & $2\times 10^{-4}$ \\
LR schedule              & cosine, 5\% warm-up               & cosine, 5\% warm-up \\
Weight decay             & $0.01$                            & $0.01$ \\
Effective batch size     & 64--96                            & 16 \\
Mixed precision          & BF16 (Ampere+) / FP16 (Turing)    & BF16 / FP16 \\
LLM quantisation         & --- (not loaded)                  & 4-bit NF4, double-quant \\
LoRA $(r, \alpha, p)$    & ---                               & $(16, 32, 0.05)$ \\
LoRA modules             & ---                               & $\{q, k, v, o\}_{\text{proj}}$ \\
Cutoff length            & ---                               & 512 tokens \\
Optimiser                & AdamW                             & AdamW \\
\bottomrule
\end{tabular}
\end{table}

% =============================================================================
\section{Inference Pipeline}
\label{sec:inference}
% =============================================================================

At inference, all parameters are frozen and the model is invoked as a
single forward pass per chest radiograph and task, with greedy
decoding (\verb|do_sample=False|, \verb|num_beams=1|,
\verb|temperature=0.1|) up to a task-dependent maximum of newly
generated tokens. Algorithm~\ref{alg:inference} formalises the
procedure.

\begin{algorithm}[h]
\caption{Inference Procedure of CXR-VLM}
\label{alg:inference}
\begin{algorithmic}[1]
\Require Chest radiograph $\mathbf{X}$, task $\mathcal{T} \in \{\textsc{findings}, \textsc{impression}, \textsc{vqa}\}$, optional textual side-input $\mathbf{u}$
\Ensure Generated report $\mathbf{R}$
\State $\mathbf{P} \gets \operatorname{RAD-DINO}(\mathbf{X})$ \Comment{$\mathbb{R}^{1369 \times 768}$, frozen}
\State $\mathbf{c}_{\texttt{[CLS]}} \gets$ class-token of the same forward pass
\State $\mathbf{Z} \gets \operatorname{MLP}(\mathbf{c}_{\texttt{[CLS]}}) \in \mathbb{R}^{14 \times 3}$
\State $\hat{\mathbf{y}}_j \gets \operatorname{softmax}(\mathbf{Z}_j)$, \, $c_j \gets \arg\max_c \hat{\mathbf{y}}_{j,c}$
\State $\textsc{pnu} \gets \operatorname{Serialise}(\{c_j\}_{j=1}^{14})$ \Comment{P/N/U text block}
\State $\mathbf{V} \gets \operatorname{Projection}(\mathbf{P}) \in \mathbb{R}^{32 \times 4096}$
\If{$\mathcal{T} = \textsc{findings}$}
  \State context $\gets$ \verb|""|
  \State instruction $\gets$ canonical findings instruction
\ElsIf{$\mathcal{T} = \textsc{impression}$}
  \State context $\gets$ ``\texttt{Findings:}\ $\mathbf{u}$''
  \State instruction $\gets$ canonical impression instruction
\ElsIf{$\mathcal{T} = \textsc{vqa}$}
  \State context $\gets$ \verb|""|
  \State instruction $\gets \mathbf{u}$ \Comment{the question itself}
\EndIf
\State $\mathbf{p} \gets \operatorname{BuildPrompt}(\textsc{system}, \textsc{pnu}, \text{context}, \text{instruction})$
\State $\mathbf{e} \gets \operatorname{Tokenize}(\mathbf{p})$, then substitute $\mathbf{V}$ at the \verb|<image>| position
\State $\mathbf{R} \gets \operatorname{Vicuna+LoRA}(\mathbf{e})$ via greedy decoding up to $L_\mathcal{T}$ new tokens
\State \Return $\mathbf{R}$
\end{algorithmic}
\end{algorithm}

The maximum number of newly generated tokens $L_\mathcal{T}$ is set
to 300 for findings, 200 for impression, and 64 for VQA, matching
the 99th-percentile reference length per task. For the impression
task, the textual side-input $\mathbf{u}$ is the ground-truth
findings paragraph in the experiments of
Chapter~\ref{ch:experiments}, mirroring the teacher-forced training
regime; an end-to-end cascade variant that feeds the model's own
generated findings as $\mathbf{u}$ is technically supported by the
codebase but not reported as a primary result.

% =============================================================================
\section{Evaluation Protocol}
\label{sec:eval}
% =============================================================================

The evaluation reports a metric suite tailored to each of the three
downstream tasks. For findings and impression generation we report
both natural-language-generation (NLG) metrics --- lexical, fluency,
and embedding-based --- and the clinical-correctness metric that has
become the de facto standard for chest-X-ray report generation. For
VQA we report a parallel suite adapted to short-form answers,
optionally augmented with an LLM-as-judge evaluation for clinical
equivalence. Table~\ref{tab:metrics} summarises the inventory.

\begin{table}[ht]
\centering
\small
\caption{Evaluation metrics by task.}
\label{tab:metrics}
\begin{tabular}{lll}
\toprule
\textbf{Family} & \textbf{Metric} & \textbf{Tasks} \\
\midrule
Lexical n-gram      & BLEU-1, BLEU-4              & findings, impression, VQA \\
                    & ROUGE-1, ROUGE-2, ROUGE-L   & findings, impression \\
Fluency / synonym   & METEOR                      & findings, impression, VQA \\
Embedding semantic  & BERTScore F1                & findings, impression, VQA \\
Clinical accuracy   & CheXbert macro F1, P, R     & findings, impression \\
Exact answer        & Exact match, token F1       & VQA \\
Semantic judge      & GPT-4o-mini score (0--5)    & VQA (optional) \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Held-Out Data and Decoding}

All metrics are computed on the patient-disjoint test split of
Section~\ref{sec:subset}: 5{,}000 studies whose subject ids appear in
neither the train nor the validation split. Greedy decoding is used
throughout (Section~\ref{sec:inference}); the canonical instruction
variant of each task's prompt template is used so that the reported
numbers reflect the model's single-shot best guess and remain
reproducible across runs. For findings and impression the
classifier-predicted PNU block populates the prompt at test time; for
the impression task, the ground-truth findings string populates the
task-specific context slot as in training. For the IU-X-Ray
out-of-distribution evaluation, the entire pipeline is applied
without modification, with PNU labels predicted from images directly.

\subsection{Natural Language Generation Metrics}
\label{sec:nlg}

\paragraph{BLEU.} Corpus-level BLEU-1 and BLEU-4~\citep{bleu} are
computed with NLTK's smoothing method~1 to avoid zero-probability
collapse on short references. BLEU is included primarily for
comparability with the radiology-report-generation literature; its
weak correlation with clinical correctness on this task is well
documented, so we treat it as a fluency floor rather than a quality
measure.

\paragraph{ROUGE.} ROUGE-1, ROUGE-2, and ROUGE-L~\citep{rouge}
F-measures are computed with Porter stemming enabled --- so that, for
instance, \emph{enlarged} and \emph{enlarging} match. ROUGE-L in
particular is the most commonly reported single number in prior work
on this task (R2Gen, KGAE, RaDialog) and is included for that reason.

\paragraph{METEOR.} METEOR~\citep{meteor} computes a weighted
F-measure at the token level, with matches granted not only for exact
tokens but also for stems and WordNet synonyms
(e.g.\ \emph{cardiomegaly}~$\leftrightarrow$~\emph{enlarged heart}),
and a chunk-fragmentation penalty that discourages matches scattered
across the sentence. Of the three n-gram-style metrics, METEOR
correlates best with human judgement on radiology reports, where
paraphrasing is rampant; we report it alongside BLEU and ROUGE rather
than as a replacement.

\paragraph{BERTScore.} BERTScore~\citep{bertscore} computes
greedy-aligned cosine similarities between the contextual embeddings
of hypothesis and reference, then aggregates to a precision, recall,
and F1. We report F1 with the default
\texttt{distilbert-base-uncased} backbone for speed, and verify on a
subset that switching to PubMedBERT
(\texttt{BiomedNLP-PubMedBERT-base-uncased-abstract}) preserves the
relative ordering of methods. BERTScore captures semantic
equivalence that the n-gram metrics miss but, unlike the
clinical-accuracy metric below, has no notion of clinical
\emph{correctness} --- it can reward a fluent paraphrase that flips a
positive finding to a negative one.

\subsection{Clinical Accuracy: CheXbert F1}
\label{sec:clinical-f1}

The clinical-accuracy metric is the macro-averaged F1 of CheXbert
labels~\citep{chexbert}, the standard for chest-X-ray report
generation since its introduction. The CheXbert labeler is applied to
both the generated and the reference report, producing a 14-element
abnormality vector with values in $\{-1, 0, 1\}$ for each. Following
the convention of prior work, we collapse $\{-1, 0\}$ to
not-positive and compute, per pathology,

\begin{equation}
  \mathrm{F1}_j \;=\; \frac{2\,\mathrm{precision}_j\,\mathrm{recall}_j}{\mathrm{precision}_j + \mathrm{recall}_j},
\end{equation}

then macro-average across the 14 pathologies. Precision and recall
are reported alongside F1 to disambiguate the trade-off when the
overall numbers are close between methods. Macro-averaging
(rather than micro) is chosen because the CheXpert label distribution
is heavy-tailed and micro-averaging would let the most prevalent
labels dominate the score. The CheXbert checkpoint is the public
release from Stanford ML Group; when it is not available on a given
host the metric degrades gracefully to a logged zero rather than
failing the evaluation run.

\subsection{Visual Question Answering Metrics}
\label{sec:vqa-metrics}

VQA targets are short --- often a single word or phrase --- so the
suite differs from generation.

\paragraph{Exact match.} Hypothesis and reference are lower-cased,
stripped of punctuation, and collapsed on whitespace, then compared
for string equality. Exact match is the lower bound on correctness;
it is harsh on open-ended phrasing (``mild pleural effusion'' vs
``small effusion'') but rewards the closed-form yes/no and
quantitative questions that dominate the dataset.

\paragraph{Token F1.} For each $(\text{hypothesis}, \text{reference})$
pair we compute the F1 between the bags of normalised tokens. This
is the standard short-answer metric inherited from SQuAD-style
extractive QA, and it is the most diagnostic single number for VQA
on this dataset.

\paragraph{BLEU-1, METEOR, BERTScore.} Reported for symmetry with the
generation tasks: BLEU-1 captures unigram overlap, METEOR adds synonym
matching, BERTScore F1 captures semantic similarity. BLEU-4 (rarely
meaningful on short answers) and ROUGE-L (subsumed by token F1) are
omitted.

\paragraph{LLM-as-judge.} For free-form answers where the metrics
above misalign with clinical equivalence (e.g.\ ``cardiomegaly'' vs
``enlarged cardiac silhouette''), we additionally evaluate a sample
of the test set with GPT-4o-mini~\citep{openai-gpt4o} as a zero-shot
judge. Following G-Eval~\citep{geval}, the judge is prompted with the
question, the reference, and the prediction, and asked to return an
integer score on a $0$--$5$ rubric ($5$~=~clinical equivalence, $0$~=
~contradiction) together with a one-sentence rationale. We report the
mean score on the $0$--$5$ scale and its normalisation to $[0, 1]$
for comparability with the other metrics. The judge runs at
temperature $0$ with \verb|response_format=json_object| for
deterministic parsing; the codebase supports Gemini and local Ollama
back-ends behind the same interface so the result is reproducible
without an OpenAI subscription. To keep cost bounded, the judge is
applied to at most $500$ randomly-sampled VQA items.

\subsection{Model Selection and Reporting}

Model selection during training uses the validation
\verb|eval_loss| --- the causal cross-entropy on assistant tokens,
the same loss minimised in Equation~\ref{eq:lm}. We deliberately do
\emph{not} use any of the downstream metrics for checkpoint selection,
to avoid the optimistic bias of optimising on the same signal that is
later reported. CheXbert F1 in particular is expensive to compute and
unstable across small validation samples, which makes it unsuitable
as a training-loop criterion regardless.

Each evaluation run writes per-task prediction files containing the
image path, the reference, the generated text, and (for VQA) the
question, alongside an aggregated metric summary. Confidence intervals
are obtained by bootstrap resampling of the per-sample metric values
(1{,}000 resamples, 95\% percentile interval); corpus-level BLEU and
CheXbert F1, which are non-decomposable, are reported as point
estimates.