cxr-vlm-code / docs /model_and_training.tex
convitom
f
2c84a70
% =============================================================================
% Chapter: Model and Training
% Companion to docs/data_preparation.md
% Configuration assumed throughout:
% data.report_mode = split_cascade
% data.image_mode = frontal_only_split
% =============================================================================
\chapter{Model and Training}
\label{ch:model-training}
\section{Architectural Overview}
\label{sec:arch-overview}
The model is a single vision--language network that solves three downstream
tasks --- findings generation, impression generation, and visual question
answering --- through one shared backbone. The design follows the
RaDialog~\cite{radialog} recipe with two deliberate modifications inspired by
META-CXR's U-MultiClass~\cite{metacxr} and BLIP-2's image--text contrastive
alignment~\cite{blip2}: \textit{(i)} a frozen 14-pathology CheXpert classifier
whose probabilistic predictions are serialised into the prompt as a
\textsc{Positive\,/\,Negative\,/\,Uncertain} (PNU) string, and \textit{(ii)} an
optional contrastive Stage~1 that pre-aligns the projection in a joint
image--text embedding space without ever loading the language model. The full
forward path can be summarised as
\begin{equation}
\underbrace{\mathbf{x}}_{518\times518}
\;\xrightarrow{\;\text{RAD-DINO}\;}\;
\mathbf{P}\in\mathbb{R}^{B\times N_p\times 768}
\;\xrightarrow{\;\text{MLP-Proj}\;}\;
\mathbf{V}\in\mathbb{R}^{B\times 32\times 4096}
\;\xrightarrow{\;\text{Vicuna-7B}+\text{LoRA}\;}\;
\hat{\mathbf{y}}
\label{eq:fwd}
\end{equation}
where the projection step also routes a 1024-dimensional intermediate
representation to either an ITC head (Stage~1, contrastive) or directly into
the LLM (Stage~2, autoregressive). Only the MLP projection, the LoRA adapters
on Vicuna, and (when enabled) the ITC head are trained --- the image encoder,
the CheXpert classifier, and the Vicuna base weights are kept frozen
throughout. Stage~1 (ITC) trains roughly $3.7$\,M parameters; Stage~2 adds the
LoRA adapters (rank $r{=}16$ on $q$, $k$, $v$, $o$ projections of all 32
transformer blocks) for a total of $\approx\!21.5$\,M trainable parameters
out of $\approx\!7.1$\,B --- about $0.30\%$.
\section{Image Encoder}
\label{sec:encoder}
The visual backbone is Microsoft's \textbf{RAD-DINO}~\cite{raddino}, a
ViT-B/14 trained with DINOv2 self-supervision on $\approx\!840$\,k chest X-rays
spanning MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The
\texttt{microsoft/rad-dino} checkpoint is loaded through the HuggingFace
\texttt{transformers} hub and accepts inputs at its native resolution of
$518\times518$, which matches the longest-edge target used in our resize step
(Phase~3 of the data pipeline). At this resolution the encoder emits
$1369$ patch tokens of dimension $768$ per image; the class token is
discarded so that the downstream projection sees a clean
$\mathbb{R}^{1369\times768}$ patch grid. We deliberately chose RAD-DINO over
the original RaDialog backbone BioViL-T~\cite{biovilt} for three reasons: it
is published as a standard HuggingFace model (no dependency on the legacy
\texttt{hi-ml-multimodal} library that pins Python\,${<}\,3.11$), it is
trained on roughly an order of magnitude more chest-X-ray data, and its
patch-token grid is dense enough that the cross-attention pool in our
projection has enough spatial bandwidth to reason about both global pathology
and small focal abnormalities (nodules, opacities) without resorting to
multi-resolution tricks.
The encoder is kept entirely frozen --- a choice motivated less by
representation quality than by training stability. With 4-bit quantised
weights in the LLM, even modest gradient flow through a 86\,M-parameter ViT
overruns the activation budget of consumer GPUs (16--24\,GB), and ablations in
RaDialog and LLaVA-Med~\cite{llavamed} both indicate that unfreezing the
image encoder before the projection has converged tends to collapse the
joint representation. We compensate for the lack of encoder fine-tuning with
the two-stage training schedule described in
\S\ref{sec:training-schedule}, in which the projection alone first specialises
the frozen patch features to the chest-X-ray report distribution.
To eliminate JPEG decode and patch embedding from the training-step inner
loop, the codebase supports an \textit{offline feature cache}: each
$1369{\times}768$ patch tensor is written once to disk and the dataset
returns it directly, bypassing the image encoder entirely (see
\texttt{feature\_cache\_dir} in the training configuration). This is a
loss-free transformation because the encoder is frozen, and it lifts dataset
throughput by roughly $3{-}5\times$ on single-GPU L4/3090-class machines
where the encoder forward dominates step time. The MIMIC-CXR Stage~2 run
reported in Chapter~\ref{ch:experiments} was executed with the cache active.
\section{MLP Projection}
\label{sec:projection}
The role of the projection is twofold: pool a variable-length patch sequence
into a fixed visual-token budget that the LLM can attend to, and bridge the
embedding spaces of the frozen vision encoder ($d_v{=}768$) and the frozen
language model ($d_l{=}4096$). RaDialog~v2 already established that a
lightweight MLP with a perceiver-style learnable query block is competitive
with the heavier Q-Former used in BLIP-2 and XrayGPT, at a small fraction of
the parameter count and with no auxiliary objective required at the
projection level. We adopt that design with a small modification --- exposing
the intermediate 1024-dimensional activation so that the optional ITC head of
Stage~1 has a richer tap point than the final 4096-d LLM-space embedding.
The forward pass is
\begin{align}
\mathbf{Q}_i &= \mathbf{Q}_0 \in \mathbb{R}^{32\times 768}, \qquad i=1,\dots,B \\
\mathbf{H}^{(0)} &= \operatorname{CrossAttn}(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i) \in \mathbb{R}^{32\times 768} \\
\mathbf{H}^{(1)} &= \operatorname{Dropout}\big(\operatorname{GELU}(\mathbf{W}_1\mathbf{H}^{(0)})\big) \in \mathbb{R}^{32\times 1024} \label{eq:tap} \\
\mathbf{V}_i &= \mathbf{W}_2 \mathbf{H}^{(1)} \in \mathbb{R}^{32\times 4096}
\end{align}
with $\mathbf{Q}_0$ a learnable parameter, $\operatorname{CrossAttn}$ an
8-head multi-head attention block at $d{=}768$, and $\mathbf{W}_1$,
$\mathbf{W}_2$ two linear layers (no bias dropout). The number of visual
tokens (32) matches RaDialog and sits inside the empirical sweet spot
reported by both LLaVA and BLIP-2: fewer than 16 tokens loses spatial detail
on small pathologies, more than 64 only inflates the LLM's sequence length
without measurable gains on radiology metrics. The query matrix is
initialised from $\mathcal{N}(0, 0.02^2)$ and the two MLP linears use the
same small-normal initialisation as Vicuna's own \texttt{Linear} layers;
empirically, larger initial scales destabilise training during the first
$\sim\!500$ optimiser steps when the LLM is still attending to noisy
visual tokens.
The 1024-d tap point in Equation~\ref{eq:tap} is not a casual exposure ---
it is the explicit grounding signal of the ITC head described in
\S\ref{sec:training-schedule}, and it lives between the GELU and the
final linear so that the contrastive objective sees an already-nonlinear
representation rather than the raw cross-attention pool. Crucially, the
projection is implemented as a single \texttt{nn.Sequential} with two
linear layers indexed as \texttt{mlp[0]} and \texttt{mlp[3]} so that
checkpoints saved before the tap-point modification load unchanged ---
backward compatibility with earlier runs is preserved at the parameter
naming level.
\paragraph{Numerical-precision detail.} The frozen encoder runs in the
LLM's compute dtype (BF16 on Ampere+ or FP16 on Turing) while the
projection's parameters are kept in FP32. PyTorch's autocast wrapper does
not consistently cover the in-projection of \texttt{nn.MultiheadAttention}
on the cross-attention path under BF16, which surfaces as a
\texttt{mat1 and mat2 must have the same dtype} runtime error on Ampere+
GPUs. The projection therefore explicitly upcasts its input to the
parameter dtype before the cross-attention call. The conversion is a
no-op on T4-class GPUs (where everything is already FP16) and a single
copy on Ampere+ --- not a bottleneck.
\section{CheXpert Abnormality Classifier}
\label{sec:chex}
Following META-CXR's U-MultiClass formulation, abnormality information
enters the prompt as a 14-pathology, 3-class string rather than as a 14-d
logit vector. The classifier itself is a small MLP head sitting on the
global \texttt{[CLS]} embedding of RAD-DINO; it produces, per study, a
$14\times3$ logit tensor mapped to one of $\{\text{Positive},
\text{Negative}, \text{Uncertain}\}$ per CheXpert label. The string format
fed to the LLM is
\begin{verbatim}
Positive Abnormalities: Cardiomegaly, Pleural Effusion
Negative Abnormalities: No Finding, Edema, Pneumothorax, ...
Uncertain Abnormalities: Atelectasis
\end{verbatim}
Three properties of this design matter. First, the U-MultiClass
formulation preserves the distinction between \textit{negative} (the
report explicitly rules the pathology out) and \textit{uncertain} (the
report hedges with \texttt{may represent}, \texttt{cannot exclude}, etc.).
This nuance is destroyed by binary CheXpert mappings and matters
clinically: a confident negative carries information that an uncertain
case does not. Second, expressing the labels as text rather than as
auxiliary embeddings means no architectural change is required when
labels are missing --- the field simply becomes the empty string and the
prompt degrades gracefully, which is exactly what happens at inference on
out-of-distribution images. Third, in our prompt the PNU string is
inserted between the visual tokens and the natural-language instruction,
so the LLM's self-attention can route information from text to image
freely; both the CheXbert-derived oracle (during training) and the
classifier's own predictions (during evaluation) occupy the same prompt
slot.
In this work the classifier is trained separately as a Stage~0 step ---
the same RAD-DINO features are frozen, only the MLP head is fitted on the
14 binary cross-entropies derived from the CheXpert CSV --- and is then
frozen for Stages~1 and 2. During training of the VLM the GT labels from
the CSV are used directly to populate the PNU string (an oracle setting,
analogous to the teacher-forced ground-truth findings used in
\texttt{split\_cascade}); during evaluation the classifier is invoked on
the image to produce its own PNU prediction.
\section{Language Model and Parameter-Efficient Adaptation}
\label{sec:llm}
The decoder is \textbf{Vicuna-7B v1.3}~\cite{vicuna}, a LLaMA-1
derivative that was instruction-tuned on ShareGPT conversations. We chose
the v1.3 series rather than the larger v1.5 (LLaMA-2 base) or modern Llama-3
checkpoints for two specific reasons: it is the exact LM used by the
RaDialog baseline so our findings can be compared directly, and its
chat template has a single, clean \texttt{USER: ... ASSISTANT: ...}
structure that simplifies the label-masking logic at training time.
The base model is loaded in 4-bit NF4 quantisation
(\texttt{BitsAndBytesConfig}, double quantisation enabled,
compute dtype matched to the GPU --- BF16 on Ampere+, FP16 on Turing), which
brings the resident weight footprint from $\approx\!14$\,GB FP16 down to
$\approx\!4.0$\,GB and frees enough headroom on a 16\,GB T4 to host the
RAD-DINO encoder, the projection, the activations, the optimiser state, and
a small batch concurrently.
Adaptation is performed with LoRA~\cite{lora}: rank-16 low-rank
adapters are inserted on the four attention projections
(\texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}) of
every transformer block. The feed-forward sublayers are deliberately left
untouched. Larger LoRA placements (e.g.\ adding \texttt{gate\_proj} or the
MLP projections) measurably improve perplexity but also enlarge the
trainable-parameter count by $3{-}4\times$ and our experiments saw no
corresponding lift in CheXbert F1 on the held-out test split. With
\texttt{lora\_alpha}${=}32$ and \texttt{lora\_dropout}${=}0.05$ the effective
LoRA scaling is $\alpha/r=2$, which we found to be the sweet spot in a brief
grid search over $\{1, 2, 4\}$.
Two further LLM-side details are worth recording. First, attention
implementation is auto-detected: FlashAttention-2 is used on Ampere/Ada
GPUs (a $2{-}3\times$ throughput win), with a graceful fall-back to
PyTorch SDPA on Turing where FA2 is unavailable. Second, gradient
checkpointing is enabled by default on the LLM. On a 24\,GB-or-larger GPU
with 4-bit + LoRA + FA2 the activations fit without checkpointing and
the user can trade $\approx\!25\%$ step time back for it; on the L4/T4
profiles used throughout this thesis we leave it on.
\section{Prompt Assembly and Sequence Layout}
\label{sec:prompt}
All three tasks share the same prompt skeleton, which follows Vicuna's
v1.1 chat template:
\begin{verbatim}
{SYSTEM_PROMPT} USER: <image>
{PNU structured findings}
{task-specific context block}
{instruction} ASSISTANT: {target}
\end{verbatim}
The \texttt{<image>} placeholder is added to the tokenizer as a special
token (vocabulary id 32000, the first free slot above Vicuna's 32\,000
base vocabulary). At forward time the model locates this single token,
replaces its embedding with the 32 visual tokens from the projection, and
expands the attention mask and label tensors by $+31$ positions per sample
so the downstream causal-attention mask is consistent. Visual-token
positions in the label tensor are filled with $-100$ so they are excluded
from the cross-entropy loss.
The task-specific context block differs per task, and is the principal
mechanism by which \texttt{split\_cascade} differs from a plain
\texttt{split} schedule:
\begin{itemize}
\item For \textbf{findings} samples it is empty; the model is asked to
produce the findings paragraph from the image and the PNU labels
alone.
\item For \textbf{impression} samples it is the literal string
\texttt{Findings: <GT findings>}; the model conditions on the
ground-truth findings paragraph and is asked to summarise it.
Studies whose report lacks a findings section emit no impression
sample, by construction of the Phase~1 filter chain.
\item For \textbf{VQA} samples it is empty; the natural-language
question itself is the instruction.
\end{itemize}
This \texttt{split\_cascade} layout makes the impression task much closer
to a controlled summarisation problem than to free-form generation, which
matches both clinical practice (the impression is written \textit{after}
the findings) and the strong empirical bias of LLMs towards
extract-and-paraphrase behaviour when explicit context is provided. The
trade-off is that the gain measured here is an upper bound on what a true
end-to-end cascade (impression conditioned on the model's
\textit{own} generated findings) would achieve; we report numbers under
the teacher-forced regime and discuss the gap in
Chapter~\ref{ch:experiments}. Each of findings, impression, and the
report variant has \textbf{ten} hand-written instruction paraphrases
sampled uniformly at training time; at evaluation the first variant is
used deterministically so metrics are reproducible.
A complete training sample is tokenised with \texttt{cutoff\_len}${=}512$
on the right side. Right-truncation is chosen rather than left-truncation
because the assistant response sits at the right end of the sequence; if
the prompt itself runs long, the truncation eats into the target and the
loss receives proportionally less signal --- whereas left-truncating
would destroy the system prompt and the PNU block, both of which carry
non-redundant information. The label tensor is masked with $-100$ on
every prompt token, every padding token, and every visual token, so loss
is computed strictly on the ASSISTANT response.
\section{Training Schedule}
\label{sec:training-schedule}
Training follows a two-stage curriculum closely modelled on RaDialog,
with Stage~1 reformulated as an explicit image--text contrastive
alignment in the spirit of BLIP-2. The motivation for the split is the
classic representation-then-instruction division: it is wasteful to drive
the LLM's LoRA adapters with gradients while the projection still emits
ill-conditioned visual tokens, and conversely the projection cannot be
trained against the LM loss alone without paying for the full Vicuna
forward at every step.
\paragraph{Stage 1 --- contrastive alignment of the projection.} The
goal of Stage~1 is to specialise the projection (and only the projection)
so that the visual tokens it produces are linearly aligned with the
text representation of the corresponding radiology report, before any
language modelling takes place. We instantiate this via the ITC head
described in \S\ref{sec:projection}: the 32 intermediate
1024-d tokens are mean-pooled, projected to 128-d, and L2-normalised.
On the text side, the canonical reference sentence per study --- the
findings paragraph, falling back to the impression when findings is
absent --- is encoded \textit{once, offline} with
\texttt{microsoft/BiomedVLP-CXR-BERT-specialized}~\cite{cxrbert}
through its \texttt{get\_projected\_text\_embeddings} interface, also
producing a 128-d L2-normalised vector. These per-study text embeddings
are written to disk as a \texttt{\{study\_id: tensor[128]\}} cache (see
\texttt{scripts/precompute\_cxrbert\_embeddings.ipynb}); the cache is
itself published to the Hugging Face data repository so any training
host can pull it in seconds rather than re-running CXR-BERT.
With the text cache available, Stage~1 runs a symmetric InfoNCE loss
\begin{equation}
\mathcal{L}_{\text{ITC}}
= -\tfrac{1}{2}\left[
\sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)}
{\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)}
+
\sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)}
{\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)}
\right]
\label{eq:itc}
\end{equation}
with $\mathbf{v}_i$ the image embedding from the projection+ITC head,
$\mathbf{t}_i$ the cached text embedding for the same study, and the
temperature $\tau{=}0.07$ following CLIP and CXR-BERT's original setup.
The dataset is de-duplicated to one image per \texttt{study\_id} before
batching (because the text embedding is study-level), and ITC training
loads the model with \texttt{load\_llm=False} --- Vicuna is simply not
instantiated. This single flag is the entire reason Stage~1 is feasible
on a 24\,GB GPU at the contrastive batch sizes the loss demands:
freeing the $\approx\!13$\,GB of resident Vicuna weights and their
associated activations lifts the per-device batch from 8 (the Stage~2
budget) to 64--96 without changing any other configuration.
A larger batch directly enlarges the InfoNCE negative pool, and we
observed monotone improvement in the validation contrastive accuracy as
the batch grew from 8 to 64; saturation began around 96 on our dataset
scale, in line with the BLIP-2 ablations.
Stage~1 is run for 2 epochs at peak learning rate $1\!\times\!10^{-3}$
with a 5\% cosine warm-up; the projection's MLP and the ITC head are
trainable (the ITC head is tiny --- $\sim\!130$\,k parameters --- but
its gradients carry the entire contrastive signal). Encoder and
classifier are frozen and the LLM is absent. The Stage~1 checkpoint
saved at the end of training is the \textit{projection-only} state dict;
the ITC head is discarded and never reused, since it has no role at
generation time.
\paragraph{Stage 2 --- instruction tuning with QLoRA.} Stage~2 rebuilds
the full model with \texttt{load\_llm=True}, loads the Stage~1
projection weights, and switches to the autoregressive cross-entropy
objective. The dataset returns mixed batches according to the configured
task weights (findings $30\%$, impression $20\%$, VQA $50\%$ for
MIMIC-CXR; the weights renormalise automatically when VQA is absent for
IU-Xray); the report mode is \texttt{split\_cascade}, so the impression
samples carry the ground-truth findings as context, and the image mode
is \texttt{frontal\_only\_split} (one PA-or-AP frontal image per study).
The loss is the standard causal cross-entropy
\begin{equation}
\mathcal{L}_{\text{LM}}
= -\frac{1}{|\mathcal{T}|}
\sum_{(t,y)\in\mathcal{T}} \log p_{\theta}\!\left(y_t \mid y_{<t}, \mathbf{V}, \mathbf{c}\right)
\label{eq:lm}
\end{equation}
where $\mathcal{T}$ is the set of positions where the label tensor is
non-$-100$ (assistant response only), $\mathbf{V}$ the 32 visual tokens,
and $\mathbf{c}$ the textual prompt context (system, PNU, optional GT
findings, instruction).
Trainable parameters in Stage~2 are the projection's MLP and the LoRA
adapters on $q$, $k$, $v$, $o$ of all 32 Vicuna blocks (rank 16,
$\alpha{=}32$, dropout $0.05$). The image encoder and the CheXpert
classifier remain frozen. Training runs for 10 epochs at peak learning
rate $2\!\times\!10^{-4}$ with the same 5\% cosine warm-up; effective
batch size is 16 on every profile (per-device batch $\times$ gradient
accumulation steps adapt to GPU memory: $1{\times}16$ on T4,
$8{\times}2$ on L4/3090, $16{\times}1$ on A100). Optimiser is
\texttt{adamw\_torch} on FP32 master weights for the projection and
LoRA, with the 4-bit base Vicuna acting as a quantised constant.
\paragraph{Loss masking and image-token accounting.} A subtlety that
deserves explicit mention is the bookkeeping around the
\texttt{<image>} placeholder. The tokenised prompt contains exactly
\textit{one} \texttt{<image>} token, and the model replaces its embedding
with the 32 projection-emitted visual tokens at forward time. To keep
the attention mask, position ids, and label tensor consistent, the
forward pass expands all three by 31 entries at the placeholder
position: the attention-mask entry for each visual token is set to 1,
the position ids are made contiguous, and the labels in the visual span
are filled with $-100$. The same expansion is applied symmetrically at
inference. This expansion is the single most error-prone piece of the
pipeline (because off-by-one mistakes silently shift the labels by
$\pm 31$ positions and produce a degenerate loss curve that descends
fast but converges to a useless equilibrium), so an integration test
checks that label masks before and after expansion sum to the same
count of non-$-100$ entries for every batch produced by the collator.
\paragraph{Stage~0 --- the CheXpert classifier head.} Although not a
stage of VLM training proper, the PNU classifier described in
\S\ref{sec:chex} is fitted before Stages~1 and 2 begin. It is a small
MLP on top of the frozen RAD-DINO \texttt{[CLS]} embedding, optimised
under 14 independent binary cross-entropies on the
\texttt{chex\_*} columns of the manifest, with U-MultiClass-style
\{positive, negative, uncertain\} targets. Training takes minutes on a
single GPU once the patch features are cached and is otherwise
unremarkable; the resulting checkpoint is loaded read-only by both
subsequent VLM stages.
\section{Hyperparameters and Optimiser}
\label{sec:hparams}
Table~\ref{tab:hparams} consolidates the hyperparameters of both stages.
Values are taken from \texttt{configs/train\_config.yaml}; any per-stage
override under \texttt{stage1:}/\texttt{stage2:} takes precedence over
the global \texttt{training:} block. Optimiser choice is the single
hyperparameter that we vary by GPU: A10/L4/3090 hosts use the standard
\texttt{adamw\_torch} (FP32 moments), while T4 hosts switch to
\texttt{paged\_adamw\_8bit} from \texttt{bitsandbytes} to halve the
optimiser-state footprint --- the auto-detect cell in
\texttt{scripts/cxrvlm\_colab\_train.ipynb} sets this and a matching
batch/accum pair so the effective batch is invariant across profiles.
\begin{table}[h]
\centering
\small
\caption{Training hyperparameters for the two stages.
``ITC mode'' refers to the contrastive Stage~1 used in this work;
the legacy RaDialog-style Stage~1 (causal LM through Vicuna) is also
implemented in the codebase but is not used in the reported runs.}
\label{tab:hparams}
\begin{tabular}{lll}
\toprule
\textbf{Hyperparameter} & \textbf{Stage 1 (ITC)} & \textbf{Stage 2} \\
\midrule
Trainable modules & Projection + ITC head & Projection + LoRA \\
Frozen modules & Encoder, classifier, (no LLM) & Encoder, classifier, base LLM \\
Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\
Temperature $\tau$ & $0.07$ & --- \\
Epochs & 2 & 10 \\
Peak learning rate & $1\!\times\!10^{-3}$ & $2\!\times\!10^{-4}$ \\
LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\
Weight decay & $0.01$ & $0.01$ \\
Effective batch size & 64--96 & 16 \\
Per-device batch & 64 (L4/3090) & 8 (L4/3090), 1 (T4) \\
Gradient accumulation & 1 & 2 (L4/3090), 16 (T4) \\
Mixed precision & BF16 (Ampere+) / FP16 (T4) & BF16 / FP16 \\
LLM quantisation & --- (LLM not loaded) & 4-bit NF4, double-quant \\
LoRA $(r, \alpha, p)$ & --- & $(16,\,32,\,0.05)$ \\
LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\
Cutoff length & --- & 512 tokens \\
Optimiser & AdamW (paged 8-bit on T4) & AdamW (paged 8-bit on T4) \\
Gradient checkpointing & N/A (no LLM) & on \\
\bottomrule
\end{tabular}
\end{table}
We make no claim that the values in Table~\ref{tab:hparams} are optimal
--- almost all of them are inherited from RaDialog with minimal tuning.
The two we did vary were the Stage~1 batch size (a sweep over
$\{16, 32, 64, 96\}$ chose $64$ as the smallest batch within $0.3\%$ of
the best validation contrastive accuracy) and the Stage~2 learning rate
(a sweep over $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$
chose $2{\times}10^{-4}$ unambiguously; $5{\times}10^{-4}$ diverged
within $\sim\!300$ steps).
\section{Checkpointing, Resume, and Multi-platform Execution}
\label{sec:checkpointing}
Every save during either stage writes a HuggingFace-Trainer-format
\texttt{checkpoint-\textit{step}/} directory locally and, when
\texttt{hf\_hub.enabled} is true, mirrors it to a private Hugging Face
model repository under the path \texttt{\{run\_id\}/\{stage\}/last/};
the best checkpoint by \texttt{eval\_loss} is additionally copied to
\texttt{best/}. \texttt{save\_total\_limit=1} guarantees that at most
two checkpoints live per stage on disk (best + last), bounding disk use
without breaking resume. The run identifier is persisted to a tiny
\texttt{run\_id.txt} file so a re-launched job picks up the same run
folder; switching this off (the legacy behaviour) creates a fresh
\texttt{\{dataset\}\_run\_\{N+1\}} on every launch.
The motivation for this scaffolding is purely practical. The full
training run is on the order of $\sim\!36$ wall-clock hours on a single
A10G or L4, considerably longer on T4, and considerably shorter on A100.
None of the platforms we have access to (Google Colab, Kaggle Notebooks,
Vast.ai pods, Lightning.ai studios) guarantee uninterrupted GPU time at
those durations: Colab disconnects, Kaggle preempts at 9\,h, Vast pods
sporadically drop. The HF Hub round-trip is the only persistence layer
that survives all of them. The unified resume controller
(\texttt{--mode resume}) handles the hard parts: it pulls the latest
\texttt{last/} of both stages from the Hub into the canonical local
layout via \texttt{hydrate\_run\_dir\_from\_hf}, then calls
\texttt{detect\_resume\_point} which inspects what is on disk and
decides whether to start Stage~1, resume Stage~1, start Stage~2, resume
Stage~2, or exit (everything is finished). The user does not need to
know which stage the previous machine died in; the controller works it
out from filesystem state.
Two operational details are worth recording. First, optimiser state and
4-bit compute dtype must be consistent across resume, which means the
GPU family should not change between launches; switching from an A100 to
a T4 mid-run is safe by checkpoint format (LoRA adapters and projection
weights are dtype-agnostic) but the AdamW moments will be reloaded with
slightly different numerical accumulators, so we recommend staying
within the same VRAM/precision bucket --- L4, A10G, RTX 3090, and
similar Ampere-or-newer 24\,GB cards are interchangeable. Second,
\texttt{PeftModel.from\_pretrained} defaults to inference mode and
returns adapters with \texttt{requires\_grad=False}; the resume code
path passes \texttt{is\_trainable=True} explicitly so the LoRA
parameters are again exposed to the optimiser. This was a non-obvious
bug in earlier iterations of the codebase and is now covered by a unit
test that asserts the trainable parameter count after resume matches
the count at save time.
\section{What Stays Off the Loss}
\label{sec:loss-exclusions}
A short note on what is \textit{not} optimised. The image encoder
weights are frozen, the CheXpert classifier weights are frozen, the
Vicuna base weights are frozen (and 4-bit quantised, so even if
gradients flowed they could not be applied without dequantisation), the
ITC head is discarded after Stage~1 and never sees a gradient in
Stage~2, and the system prompt and structured-findings strings are
masked out of the language modelling loss. The only learning happens at
the projection, the LoRA adapters, and (transiently, during Stage~1
only) the ITC head. This is by design: every additional trainable
module multiplies the optimiser-state footprint, complicates the
distributed-training story (which we explicitly do not implement; all
experiments are single-GPU), and --- in our ablations --- failed to
improve any of the headline metrics on the held-out test split.
The clinical evaluation in Chapter~\ref{ch:experiments} reports
ROUGE-1/2/L, BLEU-1/4, BERTScore-F1, and the CheXbert-derived clinical
F1 over the 14 pathology vocabulary on the findings and impression
tasks, and exact-match accuracy plus token F1 plus BLEU-1 on VQA. All
metrics are computed on the patient-disjoint test split established in
Phase~1 of the data pipeline; the model selection criterion during
training is the standard \texttt{eval\_loss} on the corresponding
validation split, never any of the downstream metrics --- those are
reserved for the final report.