cxr-vlm-code / docs /model_and_training.tex

convitom

2c84a70 3 days ago

30.5 kB

	% =============================================================================
	% Chapter: Model and Training
	% Companion to docs/data_preparation.md
	% Configuration assumed throughout:
	% data.report_mode = split_cascade
	% data.image_mode = frontal_only_split
	% =============================================================================

	\chapter{Model and Training}
	\label{ch:model-training}

	\section{Architectural Overview}
	\label{sec:arch-overview}

	The model is a single vision--language network that solves three downstream
	tasks --- findings generation, impression generation, and visual question
	answering --- through one shared backbone. The design follows the
	RaDialog~\cite{radialog} recipe with two deliberate modifications inspired by
	META-CXR's U-MultiClass~\cite{metacxr} and BLIP-2's image--text contrastive
	alignment~\cite{blip2}: \textit{(i)} a frozen 14-pathology CheXpert classifier
	whose probabilistic predictions are serialised into the prompt as a
	\textsc{Positive\,/\,Negative\,/\,Uncertain} (PNU) string, and \textit{(ii)} an
	optional contrastive Stage~1 that pre-aligns the projection in a joint
	image--text embedding space without ever loading the language model. The full
	forward path can be summarised as

	\begin{equation}
	\underbrace{\mathbf{x}}_{518\times518}
	\;\xrightarrow{\;\text{RAD-DINO}\;}\;
	\mathbf{P}\in\mathbb{R}^{B\times N_p\times 768}
	\;\xrightarrow{\;\text{MLP-Proj}\;}\;
	\mathbf{V}\in\mathbb{R}^{B\times 32\times 4096}
	\;\xrightarrow{\;\text{Vicuna-7B}+\text{LoRA}\;}\;
	\hat{\mathbf{y}}
	\label{eq:fwd}
	\end{equation}

	where the projection step also routes a 1024-dimensional intermediate
	representation to either an ITC head (Stage~1, contrastive) or directly into
	the LLM (Stage~2, autoregressive). Only the MLP projection, the LoRA adapters
	on Vicuna, and (when enabled) the ITC head are trained --- the image encoder,
	the CheXpert classifier, and the Vicuna base weights are kept frozen
	throughout. Stage~1 (ITC) trains roughly $3.7$\,M parameters; Stage~2 adds the
	LoRA adapters (rank $r{=}16$ on $q$, $k$, $v$, $o$ projections of all 32
	transformer blocks) for a total of $\approx\!21.5$\,M trainable parameters
	out of $\approx\!7.1$\,B --- about $0.30\%$.

	\section{Image Encoder}
	\label{sec:encoder}

	The visual backbone is Microsoft's \textbf{RAD-DINO}~\cite{raddino}, a
	ViT-B/14 trained with DINOv2 self-supervision on $\approx\!840$\,k chest X-rays
	spanning MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The
	\texttt{microsoft/rad-dino} checkpoint is loaded through the HuggingFace
	\texttt{transformers} hub and accepts inputs at its native resolution of
	$518\times518$, which matches the longest-edge target used in our resize step
	(Phase~3 of the data pipeline). At this resolution the encoder emits
	$1369$ patch tokens of dimension $768$ per image; the class token is
	discarded so that the downstream projection sees a clean
	$\mathbb{R}^{1369\times768}$ patch grid. We deliberately chose RAD-DINO over
	the original RaDialog backbone BioViL-T~\cite{biovilt} for three reasons: it
	is published as a standard HuggingFace model (no dependency on the legacy
	\texttt{hi-ml-multimodal} library that pins Python\,${<}\,3.11$), it is
	trained on roughly an order of magnitude more chest-X-ray data, and its
	patch-token grid is dense enough that the cross-attention pool in our
	projection has enough spatial bandwidth to reason about both global pathology
	and small focal abnormalities (nodules, opacities) without resorting to
	multi-resolution tricks.

	The encoder is kept entirely frozen --- a choice motivated less by
	representation quality than by training stability. With 4-bit quantised
	weights in the LLM, even modest gradient flow through a 86\,M-parameter ViT
	overruns the activation budget of consumer GPUs (16--24\,GB), and ablations in
	RaDialog and LLaVA-Med~\cite{llavamed} both indicate that unfreezing the
	image encoder before the projection has converged tends to collapse the
	joint representation. We compensate for the lack of encoder fine-tuning with
	the two-stage training schedule described in
	\S\ref{sec:training-schedule}, in which the projection alone first specialises
	the frozen patch features to the chest-X-ray report distribution.

	To eliminate JPEG decode and patch embedding from the training-step inner
	loop, the codebase supports an \textit{offline feature cache}: each
	$1369{\times}768$ patch tensor is written once to disk and the dataset
	returns it directly, bypassing the image encoder entirely (see
	\texttt{feature\_cache\_dir} in the training configuration). This is a
	loss-free transformation because the encoder is frozen, and it lifts dataset
	throughput by roughly $3{-}5\times$ on single-GPU L4/3090-class machines
	where the encoder forward dominates step time. The MIMIC-CXR Stage~2 run
	reported in Chapter~\ref{ch:experiments} was executed with the cache active.

	\section{MLP Projection}
	\label{sec:projection}

	The role of the projection is twofold: pool a variable-length patch sequence
	into a fixed visual-token budget that the LLM can attend to, and bridge the
	embedding spaces of the frozen vision encoder ($d_v{=}768$) and the frozen
	language model ($d_l{=}4096$). RaDialog~v2 already established that a
	lightweight MLP with a perceiver-style learnable query block is competitive
	with the heavier Q-Former used in BLIP-2 and XrayGPT, at a small fraction of
	the parameter count and with no auxiliary objective required at the
	projection level. We adopt that design with a small modification --- exposing
	the intermediate 1024-dimensional activation so that the optional ITC head of
	Stage~1 has a richer tap point than the final 4096-d LLM-space embedding.

	The forward pass is

	\begin{align}
	\mathbf{Q}_i &= \mathbf{Q}_0 \in \mathbb{R}^{32\times 768}, \qquad i=1,\dots,B \\
	\mathbf{H}^{(0)} &= \operatorname{CrossAttn}(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i) \in \mathbb{R}^{32\times 768} \\
	\mathbf{H}^{(1)} &= \operatorname{Dropout}\big(\operatorname{GELU}(\mathbf{W}_1\mathbf{H}^{(0)})\big) \in \mathbb{R}^{32\times 1024} \label{eq:tap} \\
	\mathbf{V}_i &= \mathbf{W}_2 \mathbf{H}^{(1)} \in \mathbb{R}^{32\times 4096}
	\end{align}

	with $\mathbf{Q}_0$ a learnable parameter, $\operatorname{CrossAttn}$ an
	8-head multi-head attention block at $d{=}768$, and $\mathbf{W}_1$,
	$\mathbf{W}_2$ two linear layers (no bias dropout). The number of visual
	tokens (32) matches RaDialog and sits inside the empirical sweet spot
	reported by both LLaVA and BLIP-2: fewer than 16 tokens loses spatial detail
	on small pathologies, more than 64 only inflates the LLM's sequence length
	without measurable gains on radiology metrics. The query matrix is
	initialised from $\mathcal{N}(0, 0.02^2)$ and the two MLP linears use the
	same small-normal initialisation as Vicuna's own \texttt{Linear} layers;
	empirically, larger initial scales destabilise training during the first
	$\sim\!500$ optimiser steps when the LLM is still attending to noisy
	visual tokens.

	The 1024-d tap point in Equation~\ref{eq:tap} is not a casual exposure ---
	it is the explicit grounding signal of the ITC head described in
	\S\ref{sec:training-schedule}, and it lives between the GELU and the
	final linear so that the contrastive objective sees an already-nonlinear
	representation rather than the raw cross-attention pool. Crucially, the
	projection is implemented as a single \texttt{nn.Sequential} with two
	linear layers indexed as \texttt{mlp[0]} and \texttt{mlp[3]} so that
	checkpoints saved before the tap-point modification load unchanged ---
	backward compatibility with earlier runs is preserved at the parameter
	naming level.

	\paragraph{Numerical-precision detail.} The frozen encoder runs in the
	LLM's compute dtype (BF16 on Ampere+ or FP16 on Turing) while the
	projection's parameters are kept in FP32. PyTorch's autocast wrapper does
	not consistently cover the in-projection of \texttt{nn.MultiheadAttention}
	on the cross-attention path under BF16, which surfaces as a
	\texttt{mat1 and mat2 must have the same dtype} runtime error on Ampere+
	GPUs. The projection therefore explicitly upcasts its input to the
	parameter dtype before the cross-attention call. The conversion is a
	no-op on T4-class GPUs (where everything is already FP16) and a single
	copy on Ampere+ --- not a bottleneck.

	\section{CheXpert Abnormality Classifier}
	\label{sec:chex}

	Following META-CXR's U-MultiClass formulation, abnormality information
	enters the prompt as a 14-pathology, 3-class string rather than as a 14-d
	logit vector. The classifier itself is a small MLP head sitting on the
	global \texttt{[CLS]} embedding of RAD-DINO; it produces, per study, a
	$14\times3$ logit tensor mapped to one of $\{\text{Positive},
	\text{Negative}, \text{Uncertain}\}$ per CheXpert label. The string format
	fed to the LLM is

	\begin{verbatim}
	Positive Abnormalities: Cardiomegaly, Pleural Effusion
	Negative Abnormalities: No Finding, Edema, Pneumothorax, ...
	Uncertain Abnormalities: Atelectasis
	\end{verbatim}

	Three properties of this design matter. First, the U-MultiClass
	formulation preserves the distinction between \textit{negative} (the
	report explicitly rules the pathology out) and \textit{uncertain} (the
	report hedges with \texttt{may represent}, \texttt{cannot exclude}, etc.).
	This nuance is destroyed by binary CheXpert mappings and matters
	clinically: a confident negative carries information that an uncertain
	case does not. Second, expressing the labels as text rather than as
	auxiliary embeddings means no architectural change is required when
	labels are missing --- the field simply becomes the empty string and the
	prompt degrades gracefully, which is exactly what happens at inference on
	out-of-distribution images. Third, in our prompt the PNU string is
	inserted between the visual tokens and the natural-language instruction,
	so the LLM's self-attention can route information from text to image
	freely; both the CheXbert-derived oracle (during training) and the
	classifier's own predictions (during evaluation) occupy the same prompt
	slot.

	In this work the classifier is trained separately as a Stage~0 step ---
	the same RAD-DINO features are frozen, only the MLP head is fitted on the
	14 binary cross-entropies derived from the CheXpert CSV --- and is then
	frozen for Stages~1 and 2. During training of the VLM the GT labels from
	the CSV are used directly to populate the PNU string (an oracle setting,
	analogous to the teacher-forced ground-truth findings used in
	\texttt{split\_cascade}); during evaluation the classifier is invoked on
	the image to produce its own PNU prediction.

	\section{Language Model and Parameter-Efficient Adaptation}
	\label{sec:llm}

	The decoder is \textbf{Vicuna-7B v1.3}~\cite{vicuna}, a LLaMA-1
	derivative that was instruction-tuned on ShareGPT conversations. We chose
	the v1.3 series rather than the larger v1.5 (LLaMA-2 base) or modern Llama-3
	checkpoints for two specific reasons: it is the exact LM used by the
	RaDialog baseline so our findings can be compared directly, and its
	chat template has a single, clean \texttt{USER: ... ASSISTANT: ...}
	structure that simplifies the label-masking logic at training time.
	The base model is loaded in 4-bit NF4 quantisation
	(\texttt{BitsAndBytesConfig}, double quantisation enabled,
	compute dtype matched to the GPU --- BF16 on Ampere+, FP16 on Turing), which
	brings the resident weight footprint from $\approx\!14$\,GB FP16 down to
	$\approx\!4.0$\,GB and frees enough headroom on a 16\,GB T4 to host the
	RAD-DINO encoder, the projection, the activations, the optimiser state, and
	a small batch concurrently.

	Adaptation is performed with LoRA~\cite{lora}: rank-16 low-rank
	adapters are inserted on the four attention projections
	(\texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}) of
	every transformer block. The feed-forward sublayers are deliberately left
	untouched. Larger LoRA placements (e.g.\ adding \texttt{gate\_proj} or the
	MLP projections) measurably improve perplexity but also enlarge the
	trainable-parameter count by $3{-}4\times$ and our experiments saw no
	corresponding lift in CheXbert F1 on the held-out test split. With
	\texttt{lora\_alpha}${=}32$ and \texttt{lora\_dropout}${=}0.05$ the effective
	LoRA scaling is $\alpha/r=2$, which we found to be the sweet spot in a brief
	grid search over $\{1, 2, 4\}$.

	Two further LLM-side details are worth recording. First, attention
	implementation is auto-detected: FlashAttention-2 is used on Ampere/Ada
	GPUs (a $2{-}3\times$ throughput win), with a graceful fall-back to
	PyTorch SDPA on Turing where FA2 is unavailable. Second, gradient
	checkpointing is enabled by default on the LLM. On a 24\,GB-or-larger GPU
	with 4-bit + LoRA + FA2 the activations fit without checkpointing and
	the user can trade $\approx\!25\%$ step time back for it; on the L4/T4
	profiles used throughout this thesis we leave it on.

	\section{Prompt Assembly and Sequence Layout}
	\label{sec:prompt}

	All three tasks share the same prompt skeleton, which follows Vicuna's
	v1.1 chat template:

	\begin{verbatim}
	{SYSTEM_PROMPT} USER: <image>
	{PNU structured findings}
	{task-specific context block}
	{instruction} ASSISTANT: {target}
	\end{verbatim}

	The \texttt{<image>} placeholder is added to the tokenizer as a special
	token (vocabulary id 32000, the first free slot above Vicuna's 32\,000
	base vocabulary). At forward time the model locates this single token,
	replaces its embedding with the 32 visual tokens from the projection, and
	expands the attention mask and label tensors by $+31$ positions per sample
	so the downstream causal-attention mask is consistent. Visual-token
	positions in the label tensor are filled with $-100$ so they are excluded
	from the cross-entropy loss.

	The task-specific context block differs per task, and is the principal
	mechanism by which \texttt{split\_cascade} differs from a plain
	\texttt{split} schedule:

	\begin{itemize}
	\item For \textbf{findings} samples it is empty; the model is asked to
	produce the findings paragraph from the image and the PNU labels
	alone.
	\item For \textbf{impression} samples it is the literal string
	\texttt{Findings: <GT findings>}; the model conditions on the
	ground-truth findings paragraph and is asked to summarise it.
	Studies whose report lacks a findings section emit no impression
	sample, by construction of the Phase~1 filter chain.
	\item For \textbf{VQA} samples it is empty; the natural-language
	question itself is the instruction.
	\end{itemize}

	This \texttt{split\_cascade} layout makes the impression task much closer
	to a controlled summarisation problem than to free-form generation, which
	matches both clinical practice (the impression is written \textit{after}
	the findings) and the strong empirical bias of LLMs towards
	extract-and-paraphrase behaviour when explicit context is provided. The
	trade-off is that the gain measured here is an upper bound on what a true
	end-to-end cascade (impression conditioned on the model's
	\textit{own} generated findings) would achieve; we report numbers under
	the teacher-forced regime and discuss the gap in
	Chapter~\ref{ch:experiments}. Each of findings, impression, and the
	report variant has \textbf{ten} hand-written instruction paraphrases
	sampled uniformly at training time; at evaluation the first variant is
	used deterministically so metrics are reproducible.

	A complete training sample is tokenised with \texttt{cutoff\_len}${=}512$
	on the right side. Right-truncation is chosen rather than left-truncation
	because the assistant response sits at the right end of the sequence; if
	the prompt itself runs long, the truncation eats into the target and the
	loss receives proportionally less signal --- whereas left-truncating
	would destroy the system prompt and the PNU block, both of which carry
	non-redundant information. The label tensor is masked with $-100$ on
	every prompt token, every padding token, and every visual token, so loss
	is computed strictly on the ASSISTANT response.

	\section{Training Schedule}
	\label{sec:training-schedule}

	Training follows a two-stage curriculum closely modelled on RaDialog,
	with Stage~1 reformulated as an explicit image--text contrastive
	alignment in the spirit of BLIP-2. The motivation for the split is the
	classic representation-then-instruction division: it is wasteful to drive
	the LLM's LoRA adapters with gradients while the projection still emits
	ill-conditioned visual tokens, and conversely the projection cannot be
	trained against the LM loss alone without paying for the full Vicuna
	forward at every step.

	\paragraph{Stage 1 --- contrastive alignment of the projection.} The
	goal of Stage~1 is to specialise the projection (and only the projection)
	so that the visual tokens it produces are linearly aligned with the
	text representation of the corresponding radiology report, before any
	language modelling takes place. We instantiate this via the ITC head
	described in \S\ref{sec:projection}: the 32 intermediate
	1024-d tokens are mean-pooled, projected to 128-d, and L2-normalised.
	On the text side, the canonical reference sentence per study --- the
	findings paragraph, falling back to the impression when findings is
	absent --- is encoded \textit{once, offline} with
	\texttt{microsoft/BiomedVLP-CXR-BERT-specialized}~\cite{cxrbert}
	through its \texttt{get\_projected\_text\_embeddings} interface, also
	producing a 128-d L2-normalised vector. These per-study text embeddings
	are written to disk as a \texttt{\{study\_id: tensor[128]\}} cache (see
	\texttt{scripts/precompute\_cxrbert\_embeddings.ipynb}); the cache is
	itself published to the Hugging Face data repository so any training
	host can pull it in seconds rather than re-running CXR-BERT.

	With the text cache available, Stage~1 runs a symmetric InfoNCE loss
	\begin{equation}
	\mathcal{L}_{\text{ITC}}
	= -\tfrac{1}{2}\left[
	\sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)}
	{\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)}
	+
	\sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)}
	{\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)}
	\right]
	\label{eq:itc}
	\end{equation}
	with $\mathbf{v}_i$ the image embedding from the projection+ITC head,
	$\mathbf{t}_i$ the cached text embedding for the same study, and the
	temperature $\tau{=}0.07$ following CLIP and CXR-BERT's original setup.
	The dataset is de-duplicated to one image per \texttt{study\_id} before
	batching (because the text embedding is study-level), and ITC training
	loads the model with \texttt{load\_llm=False} --- Vicuna is simply not
	instantiated. This single flag is the entire reason Stage~1 is feasible
	on a 24\,GB GPU at the contrastive batch sizes the loss demands:
	freeing the $\approx\!13$\,GB of resident Vicuna weights and their
	associated activations lifts the per-device batch from 8 (the Stage~2
	budget) to 64--96 without changing any other configuration.
	A larger batch directly enlarges the InfoNCE negative pool, and we
	observed monotone improvement in the validation contrastive accuracy as
	the batch grew from 8 to 64; saturation began around 96 on our dataset
	scale, in line with the BLIP-2 ablations.

	Stage~1 is run for 2 epochs at peak learning rate $1\!\times\!10^{-3}$
	with a 5\% cosine warm-up; the projection's MLP and the ITC head are
	trainable (the ITC head is tiny --- $\sim\!130$\,k parameters --- but
	its gradients carry the entire contrastive signal). Encoder and
	classifier are frozen and the LLM is absent. The Stage~1 checkpoint
	saved at the end of training is the \textit{projection-only} state dict;
	the ITC head is discarded and never reused, since it has no role at
	generation time.

	\paragraph{Stage 2 --- instruction tuning with QLoRA.} Stage~2 rebuilds
	the full model with \texttt{load\_llm=True}, loads the Stage~1
	projection weights, and switches to the autoregressive cross-entropy
	objective. The dataset returns mixed batches according to the configured
	task weights (findings $30\%$, impression $20\%$, VQA $50\%$ for
	MIMIC-CXR; the weights renormalise automatically when VQA is absent for
	IU-Xray); the report mode is \texttt{split\_cascade}, so the impression
	samples carry the ground-truth findings as context, and the image mode
	is \texttt{frontal\_only\_split} (one PA-or-AP frontal image per study).
	The loss is the standard causal cross-entropy

	\begin{equation}
	\mathcal{L}_{\text{LM}}
	= -\frac{1}{\|\mathcal{T}\|}
	\sum_{(t,y)\in\mathcal{T}} \log p_{\theta}\!\left(y_t \mid y_{<t}, \mathbf{V}, \mathbf{c}\right)
	\label{eq:lm}
	\end{equation}
	where $\mathcal{T}$ is the set of positions where the label tensor is
	non-$-100$ (assistant response only), $\mathbf{V}$ the 32 visual tokens,
	and $\mathbf{c}$ the textual prompt context (system, PNU, optional GT
	findings, instruction).

	Trainable parameters in Stage~2 are the projection's MLP and the LoRA
	adapters on $q$, $k$, $v$, $o$ of all 32 Vicuna blocks (rank 16,
	$\alpha{=}32$, dropout $0.05$). The image encoder and the CheXpert
	classifier remain frozen. Training runs for 10 epochs at peak learning
	rate $2\!\times\!10^{-4}$ with the same 5\% cosine warm-up; effective
	batch size is 16 on every profile (per-device batch $\times$ gradient
	accumulation steps adapt to GPU memory: $1{\times}16$ on T4,
	$8{\times}2$ on L4/3090, $16{\times}1$ on A100). Optimiser is
	\texttt{adamw\_torch} on FP32 master weights for the projection and
	LoRA, with the 4-bit base Vicuna acting as a quantised constant.

	\paragraph{Loss masking and image-token accounting.} A subtlety that
	deserves explicit mention is the bookkeeping around the
	\texttt{<image>} placeholder. The tokenised prompt contains exactly
	\textit{one} \texttt{<image>} token, and the model replaces its embedding
	with the 32 projection-emitted visual tokens at forward time. To keep
	the attention mask, position ids, and label tensor consistent, the
	forward pass expands all three by 31 entries at the placeholder
	position: the attention-mask entry for each visual token is set to 1,
	the position ids are made contiguous, and the labels in the visual span
	are filled with $-100$. The same expansion is applied symmetrically at
	inference. This expansion is the single most error-prone piece of the
	pipeline (because off-by-one mistakes silently shift the labels by
	$\pm 31$ positions and produce a degenerate loss curve that descends
	fast but converges to a useless equilibrium), so an integration test
	checks that label masks before and after expansion sum to the same
	count of non-$-100$ entries for every batch produced by the collator.

	\paragraph{Stage~0 --- the CheXpert classifier head.} Although not a
	stage of VLM training proper, the PNU classifier described in
	\S\ref{sec:chex} is fitted before Stages~1 and 2 begin. It is a small
	MLP on top of the frozen RAD-DINO \texttt{[CLS]} embedding, optimised
	under 14 independent binary cross-entropies on the
	\texttt{chex\_*} columns of the manifest, with U-MultiClass-style
	\{positive, negative, uncertain\} targets. Training takes minutes on a
	single GPU once the patch features are cached and is otherwise
	unremarkable; the resulting checkpoint is loaded read-only by both
	subsequent VLM stages.

	\section{Hyperparameters and Optimiser}
	\label{sec:hparams}

	Table~\ref{tab:hparams} consolidates the hyperparameters of both stages.
	Values are taken from \texttt{configs/train\_config.yaml}; any per-stage
	override under \texttt{stage1:}/\texttt{stage2:} takes precedence over
	the global \texttt{training:} block. Optimiser choice is the single
	hyperparameter that we vary by GPU: A10/L4/3090 hosts use the standard
	\texttt{adamw\_torch} (FP32 moments), while T4 hosts switch to
	\texttt{paged\_adamw\_8bit} from \texttt{bitsandbytes} to halve the
	optimiser-state footprint --- the auto-detect cell in
	\texttt{scripts/cxrvlm\_colab\_train.ipynb} sets this and a matching
	batch/accum pair so the effective batch is invariant across profiles.

	\begin{table}[h]
	\centering
	\small
	\caption{Training hyperparameters for the two stages.
	``ITC mode'' refers to the contrastive Stage~1 used in this work;
	the legacy RaDialog-style Stage~1 (causal LM through Vicuna) is also
	implemented in the codebase but is not used in the reported runs.}
	\label{tab:hparams}
	\begin{tabular}{lll}
	\toprule
	\textbf{Hyperparameter} & \textbf{Stage 1 (ITC)} & \textbf{Stage 2} \\
	\midrule
	Trainable modules & Projection + ITC head & Projection + LoRA \\
	Frozen modules & Encoder, classifier, (no LLM) & Encoder, classifier, base LLM \\
	Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\
	Temperature $\tau$ & $0.07$ & --- \\
	Epochs & 2 & 10 \\
	Peak learning rate & $1\!\times\!10^{-3}$ & $2\!\times\!10^{-4}$ \\
	LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\
	Weight decay & $0.01$ & $0.01$ \\
	Effective batch size & 64--96 & 16 \\
	Per-device batch & 64 (L4/3090) & 8 (L4/3090), 1 (T4) \\
	Gradient accumulation & 1 & 2 (L4/3090), 16 (T4) \\
	Mixed precision & BF16 (Ampere+) / FP16 (T4) & BF16 / FP16 \\
	LLM quantisation & --- (LLM not loaded) & 4-bit NF4, double-quant \\
	LoRA $(r, \alpha, p)$ & --- & $(16,\,32,\,0.05)$ \\
	LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\
	Cutoff length & --- & 512 tokens \\
	Optimiser & AdamW (paged 8-bit on T4) & AdamW (paged 8-bit on T4) \\
	Gradient checkpointing & N/A (no LLM) & on \\
	\bottomrule
	\end{tabular}
	\end{table}

	We make no claim that the values in Table~\ref{tab:hparams} are optimal
	--- almost all of them are inherited from RaDialog with minimal tuning.
	The two we did vary were the Stage~1 batch size (a sweep over
	$\{16, 32, 64, 96\}$ chose $64$ as the smallest batch within $0.3\%$ of
	the best validation contrastive accuracy) and the Stage~2 learning rate
	(a sweep over $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$
	chose $2{\times}10^{-4}$ unambiguously; $5{\times}10^{-4}$ diverged
	within $\sim\!300$ steps).

	\section{Checkpointing, Resume, and Multi-platform Execution}
	\label{sec:checkpointing}

	Every save during either stage writes a HuggingFace-Trainer-format
	\texttt{checkpoint-\textit{step}/} directory locally and, when
	\texttt{hf\_hub.enabled} is true, mirrors it to a private Hugging Face
	model repository under the path \texttt{\{run\_id\}/\{stage\}/last/};
	the best checkpoint by \texttt{eval\_loss} is additionally copied to
	\texttt{best/}. \texttt{save\_total\_limit=1} guarantees that at most
	two checkpoints live per stage on disk (best + last), bounding disk use
	without breaking resume. The run identifier is persisted to a tiny
	\texttt{run\_id.txt} file so a re-launched job picks up the same run
	folder; switching this off (the legacy behaviour) creates a fresh
	\texttt{\{dataset\}\_run\_\{N+1\}} on every launch.

	The motivation for this scaffolding is purely practical. The full
	training run is on the order of $\sim\!36$ wall-clock hours on a single
	A10G or L4, considerably longer on T4, and considerably shorter on A100.
	None of the platforms we have access to (Google Colab, Kaggle Notebooks,
	Vast.ai pods, Lightning.ai studios) guarantee uninterrupted GPU time at
	those durations: Colab disconnects, Kaggle preempts at 9\,h, Vast pods
	sporadically drop. The HF Hub round-trip is the only persistence layer
	that survives all of them. The unified resume controller
	(\texttt{--mode resume}) handles the hard parts: it pulls the latest
	\texttt{last/} of both stages from the Hub into the canonical local
	layout via \texttt{hydrate\_run\_dir\_from\_hf}, then calls
	\texttt{detect\_resume\_point} which inspects what is on disk and
	decides whether to start Stage~1, resume Stage~1, start Stage~2, resume
	Stage~2, or exit (everything is finished). The user does not need to
	know which stage the previous machine died in; the controller works it
	out from filesystem state.

	Two operational details are worth recording. First, optimiser state and
	4-bit compute dtype must be consistent across resume, which means the
	GPU family should not change between launches; switching from an A100 to
	a T4 mid-run is safe by checkpoint format (LoRA adapters and projection
	weights are dtype-agnostic) but the AdamW moments will be reloaded with
	slightly different numerical accumulators, so we recommend staying
	within the same VRAM/precision bucket --- L4, A10G, RTX 3090, and
	similar Ampere-or-newer 24\,GB cards are interchangeable. Second,
	\texttt{PeftModel.from\_pretrained} defaults to inference mode and
	returns adapters with \texttt{requires\_grad=False}; the resume code
	path passes \texttt{is\_trainable=True} explicitly so the LoRA
	parameters are again exposed to the optimiser. This was a non-obvious
	bug in earlier iterations of the codebase and is now covered by a unit
	test that asserts the trainable parameter count after resume matches
	the count at save time.

	\section{What Stays Off the Loss}
	\label{sec:loss-exclusions}

	A short note on what is \textit{not} optimised. The image encoder
	weights are frozen, the CheXpert classifier weights are frozen, the
	Vicuna base weights are frozen (and 4-bit quantised, so even if
	gradients flowed they could not be applied without dequantisation), the
	ITC head is discarded after Stage~1 and never sees a gradient in
	Stage~2, and the system prompt and structured-findings strings are
	masked out of the language modelling loss. The only learning happens at
	the projection, the LoRA adapters, and (transiently, during Stage~1
	only) the ITC head. This is by design: every additional trainable
	module multiplies the optimiser-state footprint, complicates the
	distributed-training story (which we explicitly do not implement; all
	experiments are single-GPU), and --- in our ablations --- failed to
	improve any of the headline metrics on the held-out test split.

	The clinical evaluation in Chapter~\ref{ch:experiments} reports
	ROUGE-1/2/L, BLEU-1/4, BERTScore-F1, and the CheXbert-derived clinical
	F1 over the 14 pathology vocabulary on the findings and impression
	tasks, and exact-match accuracy plus token F1 plus BLEU-1 on VQA. All
	metrics are computed on the patient-disjoint test split established in
	Phase~1 of the data pipeline; the model selection criterion during
	training is the standard \texttt{eval\_loss} on the corresponding
	validation split, never any of the downstream metrics --- those are
	reserved for the final report.