convitom

2c84a70 3 days ago

48.3 kB

	% =============================================================================
	% Chapter: Methodology
	%
	% Structure follows the META-CXR (IEEE Access 2025) layout:
	% III. METHODOLOGY
	% A. Data Materials
	% B. Module 1 (visual extractor)
	% C. Module 2 (cross-modal alignment)
	% D. Module 3 (abnormality classifier)
	% E. Module 4 (language model)
	% F. Prompt Construction
	% G. Two-Phase Training
	% H. Inference Pipeline
	% I. Evaluation Protocol
	%
	% The reported configuration is the only configuration described:
	% one frontal view per study, impression conditioned on the
	% ground-truth findings string, etc. The codebase exposes other
	% options but they are not part of this report.
	% =============================================================================

	\chapter{Methodology}
	\label{ch:method}

	This chapter presents the proposed Chest X-Ray Vision--Language Model
	(CXR-VLM) in nine subsections. The first describes the data
	materials and the pre-processing pipeline that produces the training
	corpus from MIMIC-CXR. The next four subsections describe the four
	modules that constitute the model itself --- a frozen vision encoder,
	a learnable cross-modal projection, an auxiliary abnormality
	classifier, and a frozen large language model fitted with
	parameter-efficient adapters. Section~\ref{sec:prompt} explains how
	these modules are tied together by a single prompt template that
	specialises to each downstream task. Section~\ref{sec:training} lays
	out the two-phase training schedule, with the contrastive alignment
	loss of Phase~1 and the instruction-tuning loss of Phase~2 stated
	explicitly. Section~\ref{sec:inference} formalises the inference
	procedure as an algorithm. Section~\ref{sec:eval} closes the chapter
	by specifying the evaluation protocol --- the held-out splits, the
	metrics, and the model-selection criterion --- under which the
	experimental results of Chapter~\ref{ch:experiments} are reported.

	% TODO INSERT FIGURE: full pipeline diagram
	% File: figures/pipeline_overview.pdf (the SVG already generated by
	% scripts/generate_pipeline_svg.py can be exported to PDF)
	% Caption: ``Overview of the proposed CXR-VLM. A frozen RAD-DINO image
	% encoder produces patch features that are pooled by a learnable
	% MLP projection into 32 visual tokens. The PNU abnormality string,
	% produced by a frozen 14-pathology CheXpert classifier sharing the
	% same encoder, is appended to the textual prompt. A frozen Vicuna-7B
	% language model with rank-16 LoRA adapters consumes the joint
	% visual--textual sequence and emits the radiology report.''

	% =============================================================================
	\section{Data Materials}
	\label{sec:data}
	% =============================================================================

	\subsection{Datasets}

	Three publicly-available chest radiograph datasets are used in this
	work; each plays a distinct role.

	\paragraph{MIMIC-CXR-JPG~\citep{mimic-cxr-jpg}.} The primary training
	and evaluation corpus. MIMIC-CXR-JPG is the JPEG-converted release of
	MIMIC-CXR~\citep{mimic-cxr}, comprising 377{,}110 chest radiographs
	from 227{,}827 imaging studies of 65{,}379 patients, paired with
	free-text radiology reports. Each study is additionally accompanied
	by a row in \texttt{mimic-cxr-2.0.0-chexpert.csv}, in which the
	CheXbert NLP labeler~\citep{chexbert} extracts a $14$-pathology
	abnormality vector from the report text, with values in
	$\{-1, 0, 1, \text{blank}\}$ denoting uncertain, negative, positive,
	and not-mentioned respectively. The official patient-disjoint train /
	validate / test split in \texttt{mimic-cxr-2.0.0-split.csv} is
	used as the starting point for our subset construction.

	\paragraph{MIMIC-Ext-CXR-VQA~\citep{mimic-cxr-vqa}.} A visual
	question--answering extension of MIMIC-CXR that provides
	1.4\,million $(\text{image}, \text{question}, \text{answer})$ triples
	covering presence, location, severity, and comparative reasoning over
	the same 14 pathologies. Each triple references a MIMIC-CXR study by
	\texttt{dicom\_id}, so it is filtered down to the subset described
	below by simple set intersection.

	\paragraph{IU X-Ray~\citep{iuxray}.} A smaller open-access dataset of
	3{,}955 fully de-identified chest X-ray studies with associated
	reports, used as an out-of-distribution evaluation corpus to
	assess cross-domain generalisation of the report-generation tasks.
	Following the convention of prior work~\citep{r2gen, metransformer},
	we adopt a 7:1:2 train/validate/test partition at the study level.

	\subsection{Subset Construction}
	\label{sec:subset}

	The full MIMIC-CXR collection is too large to train on with the
	hardware budget of this work, and its native distribution is heavily
	imbalanced both in view types (frontal vs lateral exposures) and in
	report formatting. We therefore curate a 50{,}000-study subset by a
	four-stage filter chain, applied once and then re-used across all
	experiments. The filter is documented here at a level intended to be
	sufficient for reproducibility.

	% TODO INSERT FIGURE: EDA from the notebook
	% File: figures/eda_pathology_prevalence.pdf
	% (per-pathology prevalence: full MIMIC -> eligible pool -> final
	% subset, side-by-side bar chart). Source notebook: data/build_subset_*.ipynb
	% Caption: ``Per-pathology prevalence across the full MIMIC-CXR
	% distribution, the eligible pool after report/view filtering, and
	% the final stratified subset. Absolute deltas remain under 0.7
	% percentage points on every label, confirming that stratification
	% preserves the clinical distribution.''

	\paragraph{(a) Frontal-view selection.} Each DICOM is joined with
	\texttt{mimic-cxr-2.0.0-metadata.csv} by \texttt{dicom\_id}, and only
	rows whose \texttt{ViewPosition} field is PA or AP are retained. A
	study with multiple frontal exposures is collapsed to one image,
	preferring the PA projection where available because PA is the
	standard reference projection while AP is reserved for bedside or
	supine portable studies. Lateral views are discarded because they are
	clinically complementary but stylistically and anatomically distinct
	from frontals; mixing them with frontals would inject input
	heterogeneity that the model has no architectural mechanism to handle.
	After this step every retained study contributes exactly one
	frontal image to the corpus.

	\paragraph{(b) Report parsing.} Each study's \texttt{.txt} file is
	scanned by a strict regular expression that recognises section
	headers of the form \verb\|^[A-Z ,/().-]+:\|. A section body is
	accepted only if the upper-cased header is exactly \texttt{FINDINGS}
	or \texttt{IMPRESSION}; near-synonyms such as \texttt{CONCLUSION},
	\texttt{WET READ}, \texttt{INDICATION}, or composite headers like
	\texttt{FINDINGS AND IMPRESSION} are deliberately not merged because
	they vary substantially in style and would dilute the training signal.
	A study survives this stage only if both sections are present and
	non-empty --- a requirement made stricter than RaDialog because, as
	explained in Section~\ref{sec:prompt}, the impression-generation
	task in this work is conditioned on the ground-truth findings string,
	so a study without a clean findings section cannot supply a usable
	impression sample.

	\paragraph{(c) Length-based outlier removal.} Word counts are computed
	per section; the upper cutoff is set to $Q_3 + 1.5\cdot\mathrm{IQR}$
	per section, dropping multi-paragraph teaching reports that sit far
	from the median, and a lower floor of two words for findings and one
	word for impression is imposed. The combination removes the long tail
	without affecting the median statistics.

	\paragraph{(d) Stratified patient-disjoint sampling.} The remaining
	eligible pool is sampled to 40{,}000 train / 5{,}000 validation /
	5{,}000 test studies. Each study is first assigned a \emph{stratum}
	equal to its rarest positive CheXpert label (the rarity order is
	computed over the eligible pool, so e.g.\ \emph{Pleural Other},
	\emph{Lung Lesion}, and \emph{Fracture} sit at the top); studies with
	no positive label fall into \texttt{No Finding} or \texttt{None}.
	Sampling allocates the target count per stratum proportionally to
	prevalence and draws within each stratum, preferring studies that
	also have an associated VQA triple in MIMIC-Ext-CXR-VQA so that the
	VQA task receives more supervision. The validation and test pools are
	populated first from the official PhysioNet splits; if a target count
	exceeds the official pool, the remainder is drawn from the train pool
	and the affected subject ids are then removed from train. The three
	resulting splits are therefore patient-disjoint --- no subject appears
	in more than one split. Per-pathology prevalence deltas between the
	final subset and the eligible pool remain below 0.7 percentage points
	on every label, confirming that stratification did not skew clinical
	coverage.

	\subsection{Image Pre-processing}

	Because every training step ultimately resizes the input to the
	encoder's native $518\!\times\!518$ window (Section~\ref{sec:encoder}),
	all corpus images are resized once, offline, so the shortest edge
	equals 518 with the longer edge proportional, and saved at JPEG
	quality 90 --- near-lossless on grayscale chest radiographs. The 50k
	subset compresses from $\sim$100\,GB of full-resolution JPEGs down to
	$\sim$5--8\,GB, which fits comfortably on a single Hugging Face Hub
	repository for distribution. At training time the dataset loader
	applies the encoder's standard pre-processing (centre crop to
	$518\!\times\!518$, channel-wise normalisation with the ImageNet mean
	and standard deviation) on the resized JPEGs.

	% =============================================================================
	\section{Visual Feature Extractor}
	\label{sec:encoder}
	% =============================================================================

	% TODO INSERT FIGURE: RAD-DINO / DINOv2 ViT-B/14 architecture
	% Source: original RAD-DINO paper (P\'erez-Garc\'ia et al., 2024) Figure 1
	% or DINOv2 paper (Oquab et al., 2023) Figure 1 - either is fine.
	% Caption: ``Architecture of the RAD-DINO visual backbone. The
	% $518\times518$ chest radiograph is divided into $14\times14$
	% patches, linearly projected, and processed by a 12-layer ViT-B
	% transformer pre-trained with the DINOv2 self-distillation objective
	% on $\approx$840k chest radiographs. The class token is discarded
	% downstream; only the 1369 patch tokens are passed to the
	% projection.''

	The visual backbone is Microsoft's RAD-DINO~\citep{raddino}, a
	ViT-B/14 vision transformer pre-trained with the DINOv2 self-distil\-lation
	objective~\citep{dinov2} on $\approx$840{,}000 chest radiographs drawn
	from MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The
	\texttt{microsoft/rad-dino} checkpoint is loaded through the
	HuggingFace \texttt{transformers} hub and accepts inputs at its native
	resolution of $518\!\times\!518$. At this resolution each image is
	tokenised into $37\!\times\!37 = 1369$ patches of size
	$14\!\times\!14$, linearly projected to a 768-dimensional embedding,
	and processed by a 12-layer transformer with 12 attention heads per
	layer. The class token at the head of the sequence is discarded
	downstream, so the encoder output consumed by the projection is

	\begin{equation}
	\mathbf{P} \;=\; \operatorname{RAD-DINO}(\mathbf{X}) \;\in\; \mathbb{R}^{B \times 1369 \times 768}
	\label{eq:patches}
	\end{equation}

	where $\mathbf{X} \in \mathbb{R}^{B \times 3 \times 518 \times 518}$
	is the pre-processed image batch. The choice of RAD-DINO over more
	commonly-used backbones such as BioViL-T~\citep{biovilt} or
	generic ImageNet-pretrained ViTs is motivated by three factors: it is
	trained on roughly an order of magnitude more chest-X-ray data than
	BioViL-T; its DINOv2 self-distillation objective yields denser, more
	spatially-coherent patch features than contrastive
	text--image alignment objectives, which matters for the focal
	abnormalities (nodules, small consolidations) the report-generation
	task needs to mention; and it is distributed as a standard
	HuggingFace model, free of the legacy dependency chain of BioViL-T's
	\texttt{hi-ml-multimodal} package.

	The encoder is kept entirely frozen throughout this work. Allowing
	gradients to flow back into an 86\,M-parameter ViT, on top of the
	quantised LLM activations in the same model, exhausts the activation
	budget of consumer GPUs, and the empirical observation in
	RaDialog~\citep{radialog} and LLaVA-Med~\citep{llavamed} that
	unfreezing the encoder before the projection has converged tends to
	collapse the joint representation gives independent reason for caution.
	The pre-trained patch features are instead specialised to the
	chest-X-ray report distribution by the projection module described
	next, under the explicit alignment objective of Phase~1.

	% =============================================================================
	\section{Cross-Modal Projection}
	\label{sec:projection}
	% =============================================================================

	% TODO INSERT FIGURE: projection module schematic
	% File: figures/projection_module.pdf
	% Caption: ``Architecture of the cross-modal MLP projection. A learnable
	% query bank of 32 tokens cross-attends over the 1369 patch features
	% to pool them into a fixed visual-token budget. A two-layer MLP
	% then lifts the pooled tokens from 768 to 4096 dimensions, the
	% embedding size of Vicuna-7B. The intermediate 1024-d activation
	% is exposed to the ITC head used during Phase 1 of training.''

	The projection module serves two roles: pool the variable-length
	patch sequence from Equation~\ref{eq:patches} into a fixed
	visual-token budget the language model can attend to, and lift each
	visual token from the encoder's 768-dimensional embedding space into
	the 4096-dimensional embedding space of Vicuna-7B. We adopt a design
	in the spirit of the perceiver resampler used by
	Flamingo~\citep{flamingo} and the Q-Former of BLIP-2~\citep{blip2},
	but without the heavy multi-layer transformer stack of either: a
	single learnable query bank cross-attends over the patches, followed
	by a two-layer MLP. This light-weight choice was first demonstrated
	on the chest-X-ray task by RaDialog~v2~\citep{radialog}, who report
	that the gain from a deeper Q-Former does not materialise on radiology
	metrics.

	Concretely, the projection holds three learnable components:

	\begin{enumerate}
	\item A query bank
	$\mathbf{Q}_0 \in \mathbb{R}^{32 \times 768}$, initialised from
	$\mathcal{N}(0, 0.02^2)$ and broadcast to a batched
	$\mathbf{Q}\in\mathbb{R}^{B\times 32\times 768}$ at forward time.
	\item An 8-head cross-attention block at width 768, with dropout 0.1.
	\item A two-layer MLP $\mathbf{W}_1 \in \mathbb{R}^{768\times 1024}$,
	GELU, dropout, $\mathbf{W}_2 \in \mathbb{R}^{1024\times 4096}$, all
	initialised from $\mathcal{N}(0, 0.02^2)$ with zero biases.
	\end{enumerate}

	For a batch of patch features $\mathbf{P}_i \in \mathbb{R}^{1369
	\times 768}$, the projection emits visual tokens
	$\mathbf{V}_i \in \mathbb{R}^{32 \times 4096}$ in three steps:

	\begin{align}
	\mathbf{H}^{(0)}_i &= \operatorname{CrossAttn}\!\left(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i\right) \in \mathbb{R}^{32 \times 768} \label{eq:crossattn}\\
	\mathbf{H}^{(1)}_i &= \operatorname{Dropout}\!\left(\operatorname{GELU}(\mathbf{W}_1 \mathbf{H}^{(0)}_i)\right) \in \mathbb{R}^{32 \times 1024} \label{eq:tap}\\
	\mathbf{V}_i &= \mathbf{W}_2\, \mathbf{H}^{(1)}_i \in \mathbb{R}^{32 \times 4096}
	\end{align}

	Three design choices deserve explanation. First, the number of visual
	tokens is fixed at 32, matching RaDialog and sitting inside the
	empirical sweet spot reported by BLIP-2 and LLaVA: below 16, the
	pool is too coarse for small focal pathologies; above 64, the LLM's
	sequence length grows without measurable gain on radiology metrics.
	Second, the cross-attention is unidirectional --- queries attend over
	patches, but patches do not attend back. This is the standard
	perceiver setup and keeps the projection module small (1.8\,M
	parameters) and easy to train under a contrastive objective in
	Phase~1. Third, the 1024-d intermediate activation $\mathbf{H}^{(1)}$
	in Equation~\ref{eq:tap} is exposed deliberately --- it is the
	grounding feature the Phase~1 contrastive head operates on, and
	placing it between the GELU and the second linear means the
	contrastive objective sees an already-nonlinear representation rather
	than the raw cross-attention output.

	% =============================================================================
	\section{Abnormality Classifier}
	\label{sec:chex}
	% =============================================================================

	% TODO INSERT FIGURE: PNU example
	% A two-column figure showing (left) a sample chest X-ray, (right) the
	% predicted PNU block as it would appear in the prompt -- something like:
	% "Positive Abnormalities: Cardiomegaly, Pleural Effusion
	% Negative Abnormalities: No Finding, Edema, Pneumothorax, ...
	% Uncertain Abnormalities: Atelectasis"
	% Caption: ``Example PNU abnormality block generated by the auxiliary
	% classifier for a representative MIMIC-CXR study. The block is
	% inserted between the visual tokens and the natural-language
	% instruction in the prompt fed to the language model.''

	The abnormality classifier is the principal mechanism by which
	explicit clinical knowledge enters the language model. Rather than
	fusing classifier logits into the visual embedding pathway, we
	serialise the classifier's predictions to text and append them to the
	prompt. This design is borrowed from the U-MultiClass formulation of
	META-CXR~\citep{metacxr}, where each of the 14 CheXpert pathologies
	is classified into one of three classes --- positive (P), negative
	(N), or uncertain (U) --- rather than collapsed to a binary
	present/absent decision. The three-way distinction matters because
	the negative class (the report explicitly rules a pathology out)
	carries different information from the uncertain class (the report
	hedges with phrases such as ``may represent'' or ``cannot exclude''),
	and the language model can exploit that distinction when phrasing the
	generated report.

	The classifier architecture is intentionally small. It consists of a
	multi-layer perceptron head on top of the global \texttt{[CLS]} token
	of the same frozen RAD-DINO encoder used by the rest of the model.
	Sharing the encoder is what makes the design near-free at inference:
	the encoder forward is already required for the visual pathway, so
	adding the classifier costs only a single $\mathbb{R}^{768} \to
	\mathbb{R}^{14 \times 3}$ MLP forward per image. The head produces a
	logit tensor

	\begin{equation}
	\mathbf{Z} \;=\; \operatorname{MLP}\!\left(\mathbf{c}_{\texttt{[CLS]}}\right) \;\in\; \mathbb{R}^{14 \times 3}
	\end{equation}

	which is softmaxed along the class axis to yield, for each
	pathology $j$, the probability distribution $\hat{\mathbf{y}}_j \in
	\Delta^2$ over the three U-MultiClass categories. The predicted
	category is the arg-max,

	\begin{equation}
	c_j \;=\; \operatorname*{arg\,max}_{c \in \{\text{P}, \text{N}, \text{U}\}} \hat{\mathbf{y}}_{j, c}.
	\end{equation}

	These per-pathology predictions are then serialised into a
	human-readable abnormality block --- which we refer to as the
	\emph{PNU string} --- by listing the pathology names that fall into
	each of the three categories on three lines:

	\begin{verbatim}
	Positive Abnormalities: <comma-separated names with c_j = P>
	Negative Abnormalities: <comma-separated names with c_j = N>
	Uncertain Abnormalities: <comma-separated names with c_j = U>
	\end{verbatim}

	At training time the PNU string is built directly from the
	\texttt{mimic-cxr-2.0.0-chexpert.csv} ground-truth labels (the
	classifier-as-oracle setting), so that the language model learns to
	condition on \emph{reliable} abnormality information from the very
	first step. At inference time the classifier's own predictions are
	used, with missing-value labels in the CSV mapped to negative for
	consistency. The classifier itself is fitted in a brief preparatory
	phase (Section~\ref{sec:phase0}) and frozen for all subsequent
	training.

	% =============================================================================
	\section{Language Model with Low-Rank Adaptation}
	\label{sec:llm}
	% =============================================================================

	% TODO INSERT FIGURE: LoRA decomposition
	% Source: original LoRA paper (Hu et al., 2021) Figure 1
	% Caption: ``Low-Rank Adaptation (LoRA). For each adapted weight matrix
	% $W_0 \in \mathbb{R}^{d \times d}$ in the language model, a low-rank
	% update $\Delta W = BA$ with $A \in \mathbb{R}^{r\times d}$, $B\in
	% \mathbb{R}^{d \times r}$ and $r \ll d$ is added at inference. Only
	% $A$ and $B$ are trained; the original matrix $W_0$ remains frozen
	% and 4-bit quantised. We set $r=16$, $\alpha = 32$.''

	The decoder is Vicuna-7B v1.3~\citep{vicuna}, a LLaMA-1 derivative
	fine-tuned on ShareGPT conversational data. We selected the v1.3
	release rather than the larger v1.5 (LLaMA-2 base) or more recent
	Llama-3 models for two practical reasons: it is the exact language
	model used by the RaDialog baseline against which our work is most
	directly comparable, and its chat template employs a single
	unambiguous \verb\|USER: ... ASSISTANT: ...\| structure that
	simplifies the label-masking required at training time.

	The base model is loaded in 4-bit NF4 quantisation through the
	\texttt{BitsAndBytesConfig} interface of the \texttt{bitsandbytes}
	library~\citep{qlora}, with double quantisation enabled and the
	compute dtype matched to the host GPU (BF16 on Ampere-or-newer cards,
	FP16 on Turing). This reduces the resident weight footprint from
	$\approx\!14$\,GB in FP16 to $\approx\!4.0$\,GB, which fits
	comfortably alongside the encoder, projection, and activation budget
	on a single 16\,GB GPU.

	Vicuna-7B is otherwise kept frozen; adaptation is performed entirely
	through LoRA~\citep{lora}. For each attention weight matrix
	$\mathbf{W}_0 \in \mathbb{R}^{4096 \times 4096}$ in the
	$\{q, k, v, o\}_{\text{proj}}$ projections of every transformer
	block, we introduce a low-rank update

	\begin{equation}
	\mathbf{W}' \;=\; \mathbf{W}_0 \;+\; \frac{\alpha}{r}\,\mathbf{B}\mathbf{A},
	\qquad
	\mathbf{A} \in \mathbb{R}^{r \times 4096},
	\quad
	\mathbf{B} \in \mathbb{R}^{4096 \times r},
	\quad
	r \ll 4096
	\label{eq:lora}
	\end{equation}

	where $\mathbf{A}$ is initialised from $\mathcal{N}(0, \sigma^2)$ and
	$\mathbf{B}$ is zero-initialised so that the adapter is the identity
	function at the start of training. Only $\mathbf{A}$ and $\mathbf{B}$
	receive gradients; the quantised base weight $\mathbf{W}_0$ remains
	fixed. We set $r = 16$, $\alpha = 32$, and apply a dropout of
	$0.05$ on the LoRA forward, giving an effective adapter scaling of
	$\alpha/r = 2$ that we found preferable to $1$ and $4$ in a brief
	sweep. The feed-forward sublayers are deliberately left without LoRA
	adapters: pilot runs that additionally adapted $\texttt{gate\_proj}$
	and the MLP projections enlarged the trainable parameter count by
	$3$--$4\times$ without a corresponding gain on the downstream
	clinical-accuracy metric. With this configuration the total trainable
	parameter count of the language model is $\approx\!19.7$\,M, which
	together with the $\approx\!1.8$\,M of the projection brings the
	overall trainable budget to $\approx\!21.5$\,M out of $\approx\!7.1$\,B
	total --- about $0.30\%$.

	% =============================================================================
	\section{Prompt Construction}
	\label{sec:prompt}
	% =============================================================================

	All three downstream tasks --- findings generation, impression
	generation, and visual question answering --- share a single prompt
	template that follows Vicuna's v1.1 chat format:

	\begin{verbatim}
	{SYSTEM_PROMPT} USER: <image>
	{PNU abnormality block}
	{task-specific context}
	{instruction} ASSISTANT: {target}
	\end{verbatim}

	The \verb\|<image>\| token is added to the tokenizer as a special
	single-id placeholder at vocabulary slot 32{,}000, immediately above
	Vicuna's existing 32{,}000-token vocabulary. At forward time the
	model locates this token in the tokenised input and substitutes the
	32 visual tokens $\mathbf{V}$ from the projection in its place,
	expanding the attention mask and label tensor by 31 positions per
	sample so the causal mask stays consistent. The label tensor entries
	for visual-token positions are set to $-100$, the
	\texttt{torch.nn.CrossEntropyLoss} ignore index, so that no loss is
	incurred on visual tokens.

	The fields \verb\|{PNU abnormality block}\|, \verb\|{task-specific context}\|,
	and \verb\|{instruction}\| are populated differently per task.

	\paragraph{Findings generation.} The model is asked to produce the
	findings paragraph from the frontal radiograph alone, with the PNU
	abnormality block as soft clinical guidance. There is no task-specific
	context. The instruction slot draws from a pool of ten hand-written
	paraphrases (e.g.\ ``Generate the findings section for this chest
	X-ray.'') sampled uniformly at random during training; at evaluation
	the first variant is used deterministically for reproducibility.

	\paragraph{Impression generation.} The impression is by clinical
	convention a summary of the findings, so we frame the
	impression task as conditional summarisation rather than free
	generation. The task-specific context slot is populated with the
	ground-truth findings text, prefixed by the literal token
	\texttt{Findings:}. The model therefore receives the frontal image,
	the PNU block, the ground-truth findings paragraph, and an instruction
	drawn from a pool of ten impression-specific paraphrases (e.g.\
	``Generate the impression section for this chest X-ray.''). Studies
	whose report lacks a findings section emit no impression sample, by
	construction of the filter chain in Section~\ref{sec:subset}. We
	acknowledge that this is a teacher-forced setting: the impression
	metrics reported in Chapter~\ref{ch:experiments} are an upper bound on
	what a fully end-to-end cascade --- in which the model's own
	generated findings feeds the impression prompt --- would achieve, and
	we discuss the gap in the limitations section of that chapter.

	\paragraph{Visual question answering.} The task-specific context slot
	is empty, and the instruction slot is the natural-language question
	drawn from MIMIC-Ext-CXR-VQA. The reference answer in the dataset
	supplies the target.

	For every task, the system prompt is the radiologist persona used by
	RaDialog --- ``A chat between a curious user and an artificial
	intelligence assistant acting as an experienced radiologist''. The
	full prompt is tokenised with a 512-token right-truncation budget. We
	choose right-truncation deliberately: a too-long prompt should erode
	the target on the right, where labels matter for the loss, rather
	than the system prompt or PNU block on the left, both of which carry
	non-redundant information. The label tensor is masked with $-100$ at
	every prompt-token, padding-token, and visual-token position, so the
	language-modelling loss is computed strictly on the assistant response.

	% =============================================================================
	\section{Two-Phase Training}
	\label{sec:training}
	% =============================================================================

	Training proceeds in three stages: a brief preparatory stage that
	fits the abnormality classifier head, then two main phases that train
	the model itself. The two-phase split mirrors the standard practice
	of BLIP-2, LLaVA, and RaDialog: in the first phase the alignment
	module is fitted under an explicit contrastive objective without
	touching the language model, and in the second phase the full model
	is fine-tuned end-to-end under the autoregressive language-modelling
	objective.

	\subsection{Preparatory Phase: Abnormality Classifier}
	\label{sec:phase0}

	The PNU classifier of Section~\ref{sec:chex} is fitted before either
	of the two main phases. The frozen RAD-DINO encoder is forwarded over
	every training image and the resulting \texttt{[CLS]} embeddings are
	cached to disk; the small MLP head is then trained under fourteen
	independent three-way cross-entropies against the
	$\{\text{positive}, \text{negative}, \text{uncertain}\}$ targets
	derived from \texttt{mimic-cxr-2.0.0-chexpert.csv} (with missing-value
	labels mapped to negative). Because the encoder is frozen and its
	features are cached, this phase converges in a handful of epochs on a
	single GPU and is unremarkable beyond noting that the resulting
	checkpoint is loaded read-only by both of the subsequent phases.

	\subsection{Phase 1 --- Contrastive Alignment}
	\label{sec:phase1}

	% TODO INSERT FIGURE: contrastive alignment schematic
	% A two-tower diagram: (top) frozen RAD-DINO -> projection -> ITC head
	% -> L2-norm 128-d image embedding; (bottom) frozen CXR-BERT ->
	% projected 128-d text embedding from the findings paragraph; an
	% InfoNCE block between the two towers.
	% Caption: ``Phase 1 contrastive alignment. The projection is trained
	% to map the visual representation of a chest radiograph close to
	% the CXR-BERT text representation of the corresponding findings
	% paragraph under a symmetric InfoNCE loss, with all other study
	% pairings in the batch acting as negatives. The text embeddings
	% are pre-computed offline; the language model is not loaded
	% during this phase.''

	Phase 1 specialises the projection to the chest-X-ray report
	distribution without involving the language model. The projection
	output is augmented with a small contrastive head that mean-pools
	the 32 intermediate-dimensional tokens of Equation~\ref{eq:tap},
	projects them to a 128-dimensional joint embedding space, and
	L2-normalises the result:

	\begin{equation}
	\mathbf{v}_i \;=\; \frac{\mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i}{\bigl\lVert \mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i \bigr\rVert_2},
	\qquad
	\bar{\mathbf{H}}^{(1)}_i \;=\; \frac{1}{32}\sum_{k=1}^{32} \mathbf{H}^{(1)}_{i,k}
	\label{eq:itc-head}
	\end{equation}

	with $\mathbf{W}_{\text{itc}} \in \mathbb{R}^{128 \times 1024}$ the
	contrastive projection. On the text side, the canonical reference
	sentence per study --- the findings paragraph, falling back to the
	impression where findings is absent --- is encoded once, offline, by
	the frozen CXR-BERT-specialized model~\citep{cxrbert} through its
	\verb\|get_projected_text_embeddings\| interface, which itself emits a
	128-dimensional L2-normalised vector $\mathbf{t}_i$. These per-study
	text embeddings are written to disk and re-used across every Phase~1
	run; no text encoder is loaded at training time.

	The model is then trained to maximise the cosine similarity between
	$\mathbf{v}_i$ and the text embedding $\mathbf{t}_i$ of the same
	study, while pushing $\mathbf{v}_i$ away from the text embeddings of
	the other studies in the batch. We use the symmetric InfoNCE loss of
	CLIP~\citep{clip}:

	\begin{equation}
	\mathcal{L}_{\text{ITC}}
	\;=\;
	-\tfrac{1}{2}\!\left[
	\sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)}
	{\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)}
	\;+\;
	\sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)}
	{\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)}
	\right]
	\label{eq:itc}
	\end{equation}

	with the softmax temperature $\tau = 0.07$, matching CLIP and the
	CXR-BERT recipe. The dataset is de-duplicated to one image per study
	before batching, because the text embedding is study-level. Trainable
	parameters are restricted to the projection MLP and the contrastive
	head; the encoder, the classifier, and the language model (which is
	not even instantiated during this phase) are all frozen. Phase 1
	runs for two epochs at peak learning rate $1\!\times\!10^{-3}$ with a
	5\% cosine warm-up. Because Vicuna is not loaded, $\approx\!13$\,GB
	of GPU memory is free, which we exploit to push the per-device batch
	size to 64--96 and thereby enlarge the InfoNCE negative pool ---
	known to be the dominant factor in contrastive-alignment
	quality~\citep{clip, blip2}. The contrastive head is discarded at the
	end of Phase~1; only the projection weights are carried forward.

	\subsection{Phase 2 --- Instruction Tuning}
	\label{sec:phase2}

	% TODO INSERT FIGURE: phase 2 schematic (optional)
	% Could show the full model with the loss flowing only on assistant
	% tokens. Or omit if the overview figure already covers it.

	Phase 2 reinstates the language model and switches to autoregressive
	fine-tuning under the standard causal-language-modelling loss. The
	projection weights from Phase~1 are loaded into the freshly-built
	full model; the LoRA adapters of Equation~\ref{eq:lora} are
	zero-initialised. Trainable parameters are the projection MLP and the
	LoRA adapters on the four attention projections of all 32 transformer
	blocks. The encoder, the classifier, and the quantised base weights
	of Vicuna remain frozen.

	Mini-batches are drawn from the union of the three task corpora ---
	findings, impression, and VQA --- with sampling weights $0.3$,
	$0.2$, and $0.5$ respectively. The findings/impression ratio favours
	findings because findings is the harder task (no textual context),
	and VQA is weighted equally to the combined report-generation tasks
	because it is the most diverse subset and benefits most from
	mini-batch coverage. For each sample, the prompt is constructed
	according to its task as described in Section~\ref{sec:prompt} and
	tokenised. The forward pass emits visual tokens via
	Equation~\ref{eq:patches}--\ref{eq:tap} and inserts them in place of
	the \verb\|<image>\| token; the language model then produces logits
	$\mathbf{l}_{i, t}$ at every position $t$. The loss is

	\begin{equation}
	\mathcal{L}_{\text{LM}}
	\;=\;
	-\,\frac{1}{\|\mathcal{T}\|}
	\sum_{(i, t)\,\in\,\mathcal{T}}
	\log p_{\theta}\!\bigl(y_{i, t}\,\bigm\|\,y_{i, < t},\,\mathbf{V}_i,\,\mathbf{c}_i\bigr)
	\label{eq:lm}
	\end{equation}

	where $\mathcal{T}$ is the set of $(i, t)$ pairs whose label entry is
	non-$-100$ (i.e.\ assistant tokens only), $\mathbf{V}_i$ are the
	visual tokens, and $\mathbf{c}_i$ is the textual prompt context.
	Phase 2 runs for ten epochs at peak learning rate
	$2\!\times\!10^{-4}$ with a 5\% cosine warm-up. The effective batch
	size is held at 16 across all hardware profiles by varying the
	per-device batch and gradient-accumulation step count in inverse
	proportion. The optimiser is AdamW~\citep{adamw} with weight decay
	$0.01$ on all non-bias, non-LayerNorm parameters, applied to FP32
	master copies of the projection and LoRA weights; the quantised base
	LLM acts as a fixed nonlinearity.

	\subsection{Hyperparameters}

	Table~\ref{tab:hparams} consolidates the hyperparameter settings of
	both phases. The values are inherited from RaDialog with two
	adjustments. The Phase~1 batch size was swept over
	$\{16, 32, 64, 96\}$; 64 was selected as the smallest batch that lay
	within $0.3\%$ of the best validation contrastive accuracy. The
	Phase~2 peak learning rate was swept over
	$\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$; $5{\times}10^{-4}$
	diverged within $\approx\!300$ optimiser steps and $2{\times}10^{-4}$
	was unambiguously best on the validation loss.

	\begin{table}[ht]
	\centering
	\small
	\caption{Training hyperparameters for the two phases.}
	\label{tab:hparams}
	\begin{tabular}{lll}
	\toprule
	\textbf{Hyperparameter} & \textbf{Phase 1 (ITC)} & \textbf{Phase 2 (LM)} \\
	\midrule
	Trainable modules & Projection + ITC head & Projection + LoRA \\
	Frozen modules & Encoder, classifier; LLM not loaded & Encoder, classifier, base LLM \\
	Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\
	Temperature $\tau$ & $0.07$ & --- \\
	Epochs & 2 & 10 \\
	Peak learning rate & $1\times 10^{-3}$ & $2\times 10^{-4}$ \\
	LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\
	Weight decay & $0.01$ & $0.01$ \\
	Effective batch size & 64--96 & 16 \\
	Mixed precision & BF16 (Ampere+) / FP16 (Turing) & BF16 / FP16 \\
	LLM quantisation & --- (not loaded) & 4-bit NF4, double-quant \\
	LoRA $(r, \alpha, p)$ & --- & $(16, 32, 0.05)$ \\
	LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\
	Cutoff length & --- & 512 tokens \\
	Optimiser & AdamW & AdamW \\
	\bottomrule
	\end{tabular}
	\end{table}

	% =============================================================================
	\section{Inference Pipeline}
	\label{sec:inference}
	% =============================================================================

	At inference, all parameters are frozen and the model is invoked as a
	single forward pass per chest radiograph and task, with greedy
	decoding (\verb\|do_sample=False\|, \verb\|num_beams=1\|,
	\verb\|temperature=0.1\|) up to a task-dependent maximum of newly
	generated tokens. Algorithm~\ref{alg:inference} formalises the
	procedure.

	\begin{algorithm}[h]
	\caption{Inference Procedure of CXR-VLM}
	\label{alg:inference}
	\begin{algorithmic}[1]
	\Require Chest radiograph $\mathbf{X}$, task $\mathcal{T} \in \{\textsc{findings}, \textsc{impression}, \textsc{vqa}\}$, optional textual side-input $\mathbf{u}$
	\Ensure Generated report $\mathbf{R}$
	\State $\mathbf{P} \gets \operatorname{RAD-DINO}(\mathbf{X})$ \Comment{$\mathbb{R}^{1369 \times 768}$, frozen}
	\State $\mathbf{c}_{\texttt{[CLS]}} \gets$ class-token of the same forward pass
	\State $\mathbf{Z} \gets \operatorname{MLP}(\mathbf{c}_{\texttt{[CLS]}}) \in \mathbb{R}^{14 \times 3}$
	\State $\hat{\mathbf{y}}_j \gets \operatorname{softmax}(\mathbf{Z}_j)$, \, $c_j \gets \arg\max_c \hat{\mathbf{y}}_{j,c}$
	\State $\textsc{pnu} \gets \operatorname{Serialise}(\{c_j\}_{j=1}^{14})$ \Comment{P/N/U text block}
	\State $\mathbf{V} \gets \operatorname{Projection}(\mathbf{P}) \in \mathbb{R}^{32 \times 4096}$
	\If{$\mathcal{T} = \textsc{findings}$}
	\State context $\gets$ \verb\|""\|
	\State instruction $\gets$ canonical findings instruction
	\ElsIf{$\mathcal{T} = \textsc{impression}$}
	\State context $\gets$ ``\texttt{Findings:}\ $\mathbf{u}$''
	\State instruction $\gets$ canonical impression instruction
	\ElsIf{$\mathcal{T} = \textsc{vqa}$}
	\State context $\gets$ \verb\|""\|
	\State instruction $\gets \mathbf{u}$ \Comment{the question itself}
	\EndIf
	\State $\mathbf{p} \gets \operatorname{BuildPrompt}(\textsc{system}, \textsc{pnu}, \text{context}, \text{instruction})$
	\State $\mathbf{e} \gets \operatorname{Tokenize}(\mathbf{p})$, then substitute $\mathbf{V}$ at the \verb\|<image>\| position
	\State $\mathbf{R} \gets \operatorname{Vicuna+LoRA}(\mathbf{e})$ via greedy decoding up to $L_\mathcal{T}$ new tokens
	\State \Return $\mathbf{R}$
	\end{algorithmic}
	\end{algorithm}

	The maximum number of newly generated tokens $L_\mathcal{T}$ is set
	to 300 for findings, 200 for impression, and 64 for VQA, matching
	the 99th-percentile reference length per task. For the impression
	task, the textual side-input $\mathbf{u}$ is the ground-truth
	findings paragraph in the experiments of
	Chapter~\ref{ch:experiments}, mirroring the teacher-forced training
	regime; an end-to-end cascade variant that feeds the model's own
	generated findings as $\mathbf{u}$ is technically supported by the
	codebase but not reported as a primary result.

	% =============================================================================
	\section{Evaluation Protocol}
	\label{sec:eval}
	% =============================================================================

	The evaluation reports a metric suite tailored to each of the three
	downstream tasks. For findings and impression generation we report
	both natural-language-generation (NLG) metrics --- lexical, fluency,
	and embedding-based --- and the clinical-correctness metric that has
	become the de facto standard for chest-X-ray report generation. For
	VQA we report a parallel suite adapted to short-form answers,
	optionally augmented with an LLM-as-judge evaluation for clinical
	equivalence. Table~\ref{tab:metrics} summarises the inventory.

	\begin{table}[ht]
	\centering
	\small
	\caption{Evaluation metrics by task.}
	\label{tab:metrics}
	\begin{tabular}{lll}
	\toprule
	\textbf{Family} & \textbf{Metric} & \textbf{Tasks} \\
	\midrule
	Lexical n-gram & BLEU-1, BLEU-4 & findings, impression, VQA \\
	& ROUGE-1, ROUGE-2, ROUGE-L & findings, impression \\
	Fluency / synonym & METEOR & findings, impression, VQA \\
	Embedding semantic & BERTScore F1 & findings, impression, VQA \\
	Clinical accuracy & CheXbert macro F1, P, R & findings, impression \\
	Exact answer & Exact match, token F1 & VQA \\
	Semantic judge & GPT-4o-mini score (0--5) & VQA (optional) \\
	\bottomrule
	\end{tabular}
	\end{table}

	\subsection{Held-Out Data and Decoding}

	All metrics are computed on the patient-disjoint test split of
	Section~\ref{sec:subset}: 5{,}000 studies whose subject ids appear in
	neither the train nor the validation split. Greedy decoding is used
	throughout (Section~\ref{sec:inference}); the canonical instruction
	variant of each task's prompt template is used so that the reported
	numbers reflect the model's single-shot best guess and remain
	reproducible across runs. For findings and impression the
	classifier-predicted PNU block populates the prompt at test time; for
	the impression task, the ground-truth findings string populates the
	task-specific context slot as in training. For the IU-X-Ray
	out-of-distribution evaluation, the entire pipeline is applied
	without modification, with PNU labels predicted from images directly.

	\subsection{Natural Language Generation Metrics}
	\label{sec:nlg}

	\paragraph{BLEU.} Corpus-level BLEU-1 and BLEU-4~\citep{bleu} are
	computed with NLTK's smoothing method~1 to avoid zero-probability
	collapse on short references. BLEU is included primarily for
	comparability with the radiology-report-generation literature; its
	weak correlation with clinical correctness on this task is well
	documented, so we treat it as a fluency floor rather than a quality
	measure.

	\paragraph{ROUGE.} ROUGE-1, ROUGE-2, and ROUGE-L~\citep{rouge}
	F-measures are computed with Porter stemming enabled --- so that, for
	instance, \emph{enlarged} and \emph{enlarging} match. ROUGE-L in
	particular is the most commonly reported single number in prior work
	on this task (R2Gen, KGAE, RaDialog) and is included for that reason.

	\paragraph{METEOR.} METEOR~\citep{meteor} computes a weighted
	F-measure at the token level, with matches granted not only for exact
	tokens but also for stems and WordNet synonyms
	(e.g.\ \emph{cardiomegaly}~$\leftrightarrow$~\emph{enlarged heart}),
	and a chunk-fragmentation penalty that discourages matches scattered
	across the sentence. Of the three n-gram-style metrics, METEOR
	correlates best with human judgement on radiology reports, where
	paraphrasing is rampant; we report it alongside BLEU and ROUGE rather
	than as a replacement.

	\paragraph{BERTScore.} BERTScore~\citep{bertscore} computes
	greedy-aligned cosine similarities between the contextual embeddings
	of hypothesis and reference, then aggregates to a precision, recall,
	and F1. We report F1 with the default
	\texttt{distilbert-base-uncased} backbone for speed, and verify on a
	subset that switching to PubMedBERT
	(\texttt{BiomedNLP-PubMedBERT-base-uncased-abstract}) preserves the
	relative ordering of methods. BERTScore captures semantic
	equivalence that the n-gram metrics miss but, unlike the
	clinical-accuracy metric below, has no notion of clinical
	\emph{correctness} --- it can reward a fluent paraphrase that flips a
	positive finding to a negative one.

	\subsection{Clinical Accuracy: CheXbert F1}
	\label{sec:clinical-f1}

	The clinical-accuracy metric is the macro-averaged F1 of CheXbert
	labels~\citep{chexbert}, the standard for chest-X-ray report
	generation since its introduction. The CheXbert labeler is applied to
	both the generated and the reference report, producing a 14-element
	abnormality vector with values in $\{-1, 0, 1\}$ for each. Following
	the convention of prior work, we collapse $\{-1, 0\}$ to
	not-positive and compute, per pathology,

	\begin{equation}
	\mathrm{F1}_j \;=\; \frac{2\,\mathrm{precision}_j\,\mathrm{recall}_j}{\mathrm{precision}_j + \mathrm{recall}_j},
	\end{equation}

	then macro-average across the 14 pathologies. Precision and recall
	are reported alongside F1 to disambiguate the trade-off when the
	overall numbers are close between methods. Macro-averaging
	(rather than micro) is chosen because the CheXpert label distribution
	is heavy-tailed and micro-averaging would let the most prevalent
	labels dominate the score. The CheXbert checkpoint is the public
	release from Stanford ML Group; when it is not available on a given
	host the metric degrades gracefully to a logged zero rather than
	failing the evaluation run.

	\subsection{Visual Question Answering Metrics}
	\label{sec:vqa-metrics}

	VQA targets are short --- often a single word or phrase --- so the
	suite differs from generation.

	\paragraph{Exact match.} Hypothesis and reference are lower-cased,
	stripped of punctuation, and collapsed on whitespace, then compared
	for string equality. Exact match is the lower bound on correctness;
	it is harsh on open-ended phrasing (``mild pleural effusion'' vs
	``small effusion'') but rewards the closed-form yes/no and
	quantitative questions that dominate the dataset.

	\paragraph{Token F1.} For each $(\text{hypothesis}, \text{reference})$
	pair we compute the F1 between the bags of normalised tokens. This
	is the standard short-answer metric inherited from SQuAD-style
	extractive QA, and it is the most diagnostic single number for VQA
	on this dataset.

	\paragraph{BLEU-1, METEOR, BERTScore.} Reported for symmetry with the
	generation tasks: BLEU-1 captures unigram overlap, METEOR adds synonym
	matching, BERTScore F1 captures semantic similarity. BLEU-4 (rarely
	meaningful on short answers) and ROUGE-L (subsumed by token F1) are
	omitted.

	\paragraph{LLM-as-judge.} For free-form answers where the metrics
	above misalign with clinical equivalence (e.g.\ ``cardiomegaly'' vs
	``enlarged cardiac silhouette''), we additionally evaluate a sample
	of the test set with GPT-4o-mini~\citep{openai-gpt4o} as a zero-shot
	judge. Following G-Eval~\citep{geval}, the judge is prompted with the
	question, the reference, and the prediction, and asked to return an
	integer score on a $0$--$5$ rubric ($5$~=~clinical equivalence, $0$~=
	~contradiction) together with a one-sentence rationale. We report the
	mean score on the $0$--$5$ scale and its normalisation to $[0, 1]$
	for comparability with the other metrics. The judge runs at
	temperature $0$ with \verb\|response_format=json_object\| for
	deterministic parsing; the codebase supports Gemini and local Ollama
	back-ends behind the same interface so the result is reproducible
	without an OpenAI subscription. To keep cost bounded, the judge is
	applied to at most $500$ randomly-sampled VQA items.

	\subsection{Model Selection and Reporting}

	Model selection during training uses the validation
	\verb\|eval_loss\| --- the causal cross-entropy on assistant tokens,
	the same loss minimised in Equation~\ref{eq:lm}. We deliberately do
	\emph{not} use any of the downstream metrics for checkpoint selection,
	to avoid the optimistic bias of optimising on the same signal that is
	later reported. CheXbert F1 in particular is expensive to compute and
	unstable across small validation samples, which makes it unsuitable
	as a training-loop criterion regardless.

	Each evaluation run writes per-task prediction files containing the
	image path, the reference, the generated text, and (for VQA) the
	question, alongside an aggregated metric summary. Confidence intervals
	are obtained by bootstrap resampling of the per-sample metric values
	(1{,}000 resamples, 95\% percentile interval); corpus-level BLEU and
	CheXbert F1, which are non-decomposable, are reported as point
	estimates.