| |
| |
| |
| |
| |
| |
| |
|
|
| \chapter{Model and Training} |
| \label{ch:model-training} |
|
|
| \section{Architectural Overview} |
| \label{sec:arch-overview} |
|
|
| The model is a single vision--language network that solves three downstream |
| tasks --- findings generation, impression generation, and visual question |
| answering --- through one shared backbone. The design follows the |
| RaDialog~\cite{radialog} recipe with two deliberate modifications inspired by |
| META-CXR's U-MultiClass~\cite{metacxr} and BLIP-2's image--text contrastive |
| alignment~\cite{blip2}: \textit{(i)} a frozen 14-pathology CheXpert classifier |
| whose probabilistic predictions are serialised into the prompt as a |
| \textsc{Positive\,/\,Negative\,/\,Uncertain} (PNU) string, and \textit{(ii)} an |
| optional contrastive Stage~1 that pre-aligns the projection in a joint |
| image--text embedding space without ever loading the language model. The full |
| forward path can be summarised as |
|
|
| \begin{equation} |
| \underbrace{\mathbf{x}}_{518\times518} |
| \;\xrightarrow{\;\text{RAD-DINO}\;}\; |
| \mathbf{P}\in\mathbb{R}^{B\times N_p\times 768} |
| \;\xrightarrow{\;\text{MLP-Proj}\;}\; |
| \mathbf{V}\in\mathbb{R}^{B\times 32\times 4096} |
| \;\xrightarrow{\;\text{Vicuna-7B}+\text{LoRA}\;}\; |
| \hat{\mathbf{y}} |
| \label{eq:fwd} |
| \end{equation} |
|
|
| where the projection step also routes a 1024-dimensional intermediate |
| representation to either an ITC head (Stage~1, contrastive) or directly into |
| the LLM (Stage~2, autoregressive). Only the MLP projection, the LoRA adapters |
| on Vicuna, and (when enabled) the ITC head are trained --- the image encoder, |
| the CheXpert classifier, and the Vicuna base weights are kept frozen |
| throughout. Stage~1 (ITC) trains roughly $3.7$\,M parameters; Stage~2 adds the |
| LoRA adapters (rank $r{=}16$ on $q$, $k$, $v$, $o$ projections of all 32 |
| transformer blocks) for a total of $\approx\!21.5$\,M trainable parameters |
| out of $\approx\!7.1$\,B --- about $0.30\%$. |
|
|
| \section{Image Encoder} |
| \label{sec:encoder} |
|
|
| The visual backbone is Microsoft's \textbf{RAD-DINO}~\cite{raddino}, a |
| ViT-B/14 trained with DINOv2 self-supervision on $\approx\!840$\,k chest X-rays |
| spanning MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The |
| \texttt{microsoft/rad-dino} checkpoint is loaded through the HuggingFace |
| \texttt{transformers} hub and accepts inputs at its native resolution of |
| $518\times518$, which matches the longest-edge target used in our resize step |
| (Phase~3 of the data pipeline). At this resolution the encoder emits |
| $1369$ patch tokens of dimension $768$ per image; the class token is |
| discarded so that the downstream projection sees a clean |
| $\mathbb{R}^{1369\times768}$ patch grid. We deliberately chose RAD-DINO over |
| the original RaDialog backbone BioViL-T~\cite{biovilt} for three reasons: it |
| is published as a standard HuggingFace model (no dependency on the legacy |
| \texttt{hi-ml-multimodal} library that pins Python\,${<}\,3.11$), it is |
| trained on roughly an order of magnitude more chest-X-ray data, and its |
| patch-token grid is dense enough that the cross-attention pool in our |
| projection has enough spatial bandwidth to reason about both global pathology |
| and small focal abnormalities (nodules, opacities) without resorting to |
| multi-resolution tricks. |
|
|
| The encoder is kept entirely frozen --- a choice motivated less by |
| representation quality than by training stability. With 4-bit quantised |
| weights in the LLM, even modest gradient flow through a 86\,M-parameter ViT |
| overruns the activation budget of consumer GPUs (16--24\,GB), and ablations in |
| RaDialog and LLaVA-Med~\cite{llavamed} both indicate that unfreezing the |
| image encoder before the projection has converged tends to collapse the |
| joint representation. We compensate for the lack of encoder fine-tuning with |
| the two-stage training schedule described in |
| \S\ref{sec:training-schedule}, in which the projection alone first specialises |
| the frozen patch features to the chest-X-ray report distribution. |
|
|
| To eliminate JPEG decode and patch embedding from the training-step inner |
| loop, the codebase supports an \textit{offline feature cache}: each |
| $1369{\times}768$ patch tensor is written once to disk and the dataset |
| returns it directly, bypassing the image encoder entirely (see |
| \texttt{feature\_cache\_dir} in the training configuration). This is a |
| loss-free transformation because the encoder is frozen, and it lifts dataset |
| throughput by roughly $3{-}5\times$ on single-GPU L4/3090-class machines |
| where the encoder forward dominates step time. The MIMIC-CXR Stage~2 run |
| reported in Chapter~\ref{ch:experiments} was executed with the cache active. |
|
|
| \section{MLP Projection} |
| \label{sec:projection} |
|
|
| The role of the projection is twofold: pool a variable-length patch sequence |
| into a fixed visual-token budget that the LLM can attend to, and bridge the |
| embedding spaces of the frozen vision encoder ($d_v{=}768$) and the frozen |
| language model ($d_l{=}4096$). RaDialog~v2 already established that a |
| lightweight MLP with a perceiver-style learnable query block is competitive |
| with the heavier Q-Former used in BLIP-2 and XrayGPT, at a small fraction of |
| the parameter count and with no auxiliary objective required at the |
| projection level. We adopt that design with a small modification --- exposing |
| the intermediate 1024-dimensional activation so that the optional ITC head of |
| Stage~1 has a richer tap point than the final 4096-d LLM-space embedding. |
|
|
| The forward pass is |
|
|
| \begin{align} |
| \mathbf{Q}_i &= \mathbf{Q}_0 \in \mathbb{R}^{32\times 768}, \qquad i=1,\dots,B \\ |
| \mathbf{H}^{(0)} &= \operatorname{CrossAttn}(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i) \in \mathbb{R}^{32\times 768} \\ |
| \mathbf{H}^{(1)} &= \operatorname{Dropout}\big(\operatorname{GELU}(\mathbf{W}_1\mathbf{H}^{(0)})\big) \in \mathbb{R}^{32\times 1024} \label{eq:tap} \\ |
| \mathbf{V}_i &= \mathbf{W}_2 \mathbf{H}^{(1)} \in \mathbb{R}^{32\times 4096} |
| \end{align} |
|
|
| with $\mathbf{Q}_0$ a learnable parameter, $\operatorname{CrossAttn}$ an |
| 8-head multi-head attention block at $d{=}768$, and $\mathbf{W}_1$, |
| $\mathbf{W}_2$ two linear layers (no bias dropout). The number of visual |
| tokens (32) matches RaDialog and sits inside the empirical sweet spot |
| reported by both LLaVA and BLIP-2: fewer than 16 tokens loses spatial detail |
| on small pathologies, more than 64 only inflates the LLM's sequence length |
| without measurable gains on radiology metrics. The query matrix is |
| initialised from $\mathcal{N}(0, 0.02^2)$ and the two MLP linears use the |
| same small-normal initialisation as Vicuna's own \texttt{Linear} layers; |
| empirically, larger initial scales destabilise training during the first |
| $\sim\!500$ optimiser steps when the LLM is still attending to noisy |
| visual tokens. |
|
|
| The 1024-d tap point in Equation~\ref{eq:tap} is not a casual exposure --- |
| it is the explicit grounding signal of the ITC head described in |
| \S\ref{sec:training-schedule}, and it lives between the GELU and the |
| final linear so that the contrastive objective sees an already-nonlinear |
| representation rather than the raw cross-attention pool. Crucially, the |
| projection is implemented as a single \texttt{nn.Sequential} with two |
| linear layers indexed as \texttt{mlp[0]} and \texttt{mlp[3]} so that |
| checkpoints saved before the tap-point modification load unchanged --- |
| backward compatibility with earlier runs is preserved at the parameter |
| naming level. |
|
|
| \paragraph{Numerical-precision detail.} The frozen encoder runs in the |
| LLM's compute dtype (BF16 on Ampere+ or FP16 on Turing) while the |
| projection's parameters are kept in FP32. PyTorch's autocast wrapper does |
| not consistently cover the in-projection of \texttt{nn.MultiheadAttention} |
| on the cross-attention path under BF16, which surfaces as a |
| \texttt{mat1 and mat2 must have the same dtype} runtime error on Ampere+ |
| GPUs. The projection therefore explicitly upcasts its input to the |
| parameter dtype before the cross-attention call. The conversion is a |
| no-op on T4-class GPUs (where everything is already FP16) and a single |
| copy on Ampere+ --- not a bottleneck. |
|
|
| \section{CheXpert Abnormality Classifier} |
| \label{sec:chex} |
|
|
| Following META-CXR's U-MultiClass formulation, abnormality information |
| enters the prompt as a 14-pathology, 3-class string rather than as a 14-d |
| logit vector. The classifier itself is a small MLP head sitting on the |
| global \texttt{[CLS]} embedding of RAD-DINO; it produces, per study, a |
| $14\times3$ logit tensor mapped to one of $\{\text{Positive}, |
| \text{Negative}, \text{Uncertain}\}$ per CheXpert label. The string format |
| fed to the LLM is |
|
|
| \begin{verbatim} |
| Positive Abnormalities: Cardiomegaly, Pleural Effusion |
| Negative Abnormalities: No Finding, Edema, Pneumothorax, ... |
| Uncertain Abnormalities: Atelectasis |
| \end{verbatim} |
|
|
| Three properties of this design matter. First, the U-MultiClass |
| formulation preserves the distinction between \textit{negative} (the |
| report explicitly rules the pathology out) and \textit{uncertain} (the |
| report hedges with \texttt{may represent}, \texttt{cannot exclude}, etc.). |
| This nuance is destroyed by binary CheXpert mappings and matters |
| clinically: a confident negative carries information that an uncertain |
| case does not. Second, expressing the labels as text rather than as |
| auxiliary embeddings means no architectural change is required when |
| labels are missing --- the field simply becomes the empty string and the |
| prompt degrades gracefully, which is exactly what happens at inference on |
| out-of-distribution images. Third, in our prompt the PNU string is |
| inserted between the visual tokens and the natural-language instruction, |
| so the LLM's self-attention can route information from text to image |
| freely; both the CheXbert-derived oracle (during training) and the |
| classifier's own predictions (during evaluation) occupy the same prompt |
| slot. |
|
|
| In this work the classifier is trained separately as a Stage~0 step --- |
| the same RAD-DINO features are frozen, only the MLP head is fitted on the |
| 14 binary cross-entropies derived from the CheXpert CSV --- and is then |
| frozen for Stages~1 and 2. During training of the VLM the GT labels from |
| the CSV are used directly to populate the PNU string (an oracle setting, |
| analogous to the teacher-forced ground-truth findings used in |
| \texttt{split\_cascade}); during evaluation the classifier is invoked on |
| the image to produce its own PNU prediction. |
|
|
| \section{Language Model and Parameter-Efficient Adaptation} |
| \label{sec:llm} |
|
|
| The decoder is \textbf{Vicuna-7B v1.3}~\cite{vicuna}, a LLaMA-1 |
| derivative that was instruction-tuned on ShareGPT conversations. We chose |
| the v1.3 series rather than the larger v1.5 (LLaMA-2 base) or modern Llama-3 |
| checkpoints for two specific reasons: it is the exact LM used by the |
| RaDialog baseline so our findings can be compared directly, and its |
| chat template has a single, clean \texttt{USER: ... ASSISTANT: ...} |
| structure that simplifies the label-masking logic at training time. |
| The base model is loaded in 4-bit NF4 quantisation |
| (\texttt{BitsAndBytesConfig}, double quantisation enabled, |
| compute dtype matched to the GPU --- BF16 on Ampere+, FP16 on Turing), which |
| brings the resident weight footprint from $\approx\!14$\,GB FP16 down to |
| $\approx\!4.0$\,GB and frees enough headroom on a 16\,GB T4 to host the |
| RAD-DINO encoder, the projection, the activations, the optimiser state, and |
| a small batch concurrently. |
|
|
| Adaptation is performed with LoRA~\cite{lora}: rank-16 low-rank |
| adapters are inserted on the four attention projections |
| (\texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}) of |
| every transformer block. The feed-forward sublayers are deliberately left |
| untouched. Larger LoRA placements (e.g.\ adding \texttt{gate\_proj} or the |
| MLP projections) measurably improve perplexity but also enlarge the |
| trainable-parameter count by $3{-}4\times$ and our experiments saw no |
| corresponding lift in CheXbert F1 on the held-out test split. With |
| \texttt{lora\_alpha}${=}32$ and \texttt{lora\_dropout}${=}0.05$ the effective |
| LoRA scaling is $\alpha/r=2$, which we found to be the sweet spot in a brief |
| grid search over $\{1, 2, 4\}$. |
|
|
| Two further LLM-side details are worth recording. First, attention |
| implementation is auto-detected: FlashAttention-2 is used on Ampere/Ada |
| GPUs (a $2{-}3\times$ throughput win), with a graceful fall-back to |
| PyTorch SDPA on Turing where FA2 is unavailable. Second, gradient |
| checkpointing is enabled by default on the LLM. On a 24\,GB-or-larger GPU |
| with 4-bit + LoRA + FA2 the activations fit without checkpointing and |
| the user can trade $\approx\!25\%$ step time back for it; on the L4/T4 |
| profiles used throughout this thesis we leave it on. |
|
|
| \section{Prompt Assembly and Sequence Layout} |
| \label{sec:prompt} |
|
|
| All three tasks share the same prompt skeleton, which follows Vicuna's |
| v1.1 chat template: |
|
|
| \begin{verbatim} |
| {SYSTEM_PROMPT} USER: <image> |
| {PNU structured findings} |
| {task-specific context block} |
| {instruction} ASSISTANT: {target} |
| \end{verbatim} |
|
|
| The \texttt{<image>} placeholder is added to the tokenizer as a special |
| token (vocabulary id 32000, the first free slot above Vicuna's 32\,000 |
| base vocabulary). At forward time the model locates this single token, |
| replaces its embedding with the 32 visual tokens from the projection, and |
| expands the attention mask and label tensors by $+31$ positions per sample |
| so the downstream causal-attention mask is consistent. Visual-token |
| positions in the label tensor are filled with $-100$ so they are excluded |
| from the cross-entropy loss. |
|
|
| The task-specific context block differs per task, and is the principal |
| mechanism by which \texttt{split\_cascade} differs from a plain |
| \texttt{split} schedule: |
|
|
| \begin{itemize} |
| \item For \textbf{findings} samples it is empty; the model is asked to |
| produce the findings paragraph from the image and the PNU labels |
| alone. |
| \item For \textbf{impression} samples it is the literal string |
| \texttt{Findings: <GT findings>}; the model conditions on the |
| ground-truth findings paragraph and is asked to summarise it. |
| Studies whose report lacks a findings section emit no impression |
| sample, by construction of the Phase~1 filter chain. |
| \item For \textbf{VQA} samples it is empty; the natural-language |
| question itself is the instruction. |
| \end{itemize} |
|
|
| This \texttt{split\_cascade} layout makes the impression task much closer |
| to a controlled summarisation problem than to free-form generation, which |
| matches both clinical practice (the impression is written \textit{after} |
| the findings) and the strong empirical bias of LLMs towards |
| extract-and-paraphrase behaviour when explicit context is provided. The |
| trade-off is that the gain measured here is an upper bound on what a true |
| end-to-end cascade (impression conditioned on the model's |
| \textit{own} generated findings) would achieve; we report numbers under |
| the teacher-forced regime and discuss the gap in |
| Chapter~\ref{ch:experiments}. Each of findings, impression, and the |
| report variant has \textbf{ten} hand-written instruction paraphrases |
| sampled uniformly at training time; at evaluation the first variant is |
| used deterministically so metrics are reproducible. |
|
|
| A complete training sample is tokenised with \texttt{cutoff\_len}${=}512$ |
| on the right side. Right-truncation is chosen rather than left-truncation |
| because the assistant response sits at the right end of the sequence; if |
| the prompt itself runs long, the truncation eats into the target and the |
| loss receives proportionally less signal --- whereas left-truncating |
| would destroy the system prompt and the PNU block, both of which carry |
| non-redundant information. The label tensor is masked with $-100$ on |
| every prompt token, every padding token, and every visual token, so loss |
| is computed strictly on the ASSISTANT response. |
|
|
| \section{Training Schedule} |
| \label{sec:training-schedule} |
|
|
| Training follows a two-stage curriculum closely modelled on RaDialog, |
| with Stage~1 reformulated as an explicit image--text contrastive |
| alignment in the spirit of BLIP-2. The motivation for the split is the |
| classic representation-then-instruction division: it is wasteful to drive |
| the LLM's LoRA adapters with gradients while the projection still emits |
| ill-conditioned visual tokens, and conversely the projection cannot be |
| trained against the LM loss alone without paying for the full Vicuna |
| forward at every step. |
|
|
| \paragraph{Stage 1 --- contrastive alignment of the projection.} The |
| goal of Stage~1 is to specialise the projection (and only the projection) |
| so that the visual tokens it produces are linearly aligned with the |
| text representation of the corresponding radiology report, before any |
| language modelling takes place. We instantiate this via the ITC head |
| described in \S\ref{sec:projection}: the 32 intermediate |
| 1024-d tokens are mean-pooled, projected to 128-d, and L2-normalised. |
| On the text side, the canonical reference sentence per study --- the |
| findings paragraph, falling back to the impression when findings is |
| absent --- is encoded \textit{once, offline} with |
| \texttt{microsoft/BiomedVLP-CXR-BERT-specialized}~\cite{cxrbert} |
| through its \texttt{get\_projected\_text\_embeddings} interface, also |
| producing a 128-d L2-normalised vector. These per-study text embeddings |
| are written to disk as a \texttt{\{study\_id: tensor[128]\}} cache (see |
| \texttt{scripts/precompute\_cxrbert\_embeddings.ipynb}); the cache is |
| itself published to the Hugging Face data repository so any training |
| host can pull it in seconds rather than re-running CXR-BERT. |
|
|
| With the text cache available, Stage~1 runs a symmetric InfoNCE loss |
| \begin{equation} |
| \mathcal{L}_{\text{ITC}} |
| = -\tfrac{1}{2}\left[ |
| \sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)} |
| {\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)} |
| + |
| \sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)} |
| {\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)} |
| \right] |
| \label{eq:itc} |
| \end{equation} |
| with $\mathbf{v}_i$ the image embedding from the projection+ITC head, |
| $\mathbf{t}_i$ the cached text embedding for the same study, and the |
| temperature $\tau{=}0.07$ following CLIP and CXR-BERT's original setup. |
| The dataset is de-duplicated to one image per \texttt{study\_id} before |
| batching (because the text embedding is study-level), and ITC training |
| loads the model with \texttt{load\_llm=False} --- Vicuna is simply not |
| instantiated. This single flag is the entire reason Stage~1 is feasible |
| on a 24\,GB GPU at the contrastive batch sizes the loss demands: |
| freeing the $\approx\!13$\,GB of resident Vicuna weights and their |
| associated activations lifts the per-device batch from 8 (the Stage~2 |
| budget) to 64--96 without changing any other configuration. |
| A larger batch directly enlarges the InfoNCE negative pool, and we |
| observed monotone improvement in the validation contrastive accuracy as |
| the batch grew from 8 to 64; saturation began around 96 on our dataset |
| scale, in line with the BLIP-2 ablations. |
|
|
| Stage~1 is run for 2 epochs at peak learning rate $1\!\times\!10^{-3}$ |
| with a 5\% cosine warm-up; the projection's MLP and the ITC head are |
| trainable (the ITC head is tiny --- $\sim\!130$\,k parameters --- but |
| its gradients carry the entire contrastive signal). Encoder and |
| classifier are frozen and the LLM is absent. The Stage~1 checkpoint |
| saved at the end of training is the \textit{projection-only} state dict; |
| the ITC head is discarded and never reused, since it has no role at |
| generation time. |
|
|
| \paragraph{Stage 2 --- instruction tuning with QLoRA.} Stage~2 rebuilds |
| the full model with \texttt{load\_llm=True}, loads the Stage~1 |
| projection weights, and switches to the autoregressive cross-entropy |
| objective. The dataset returns mixed batches according to the configured |
| task weights (findings $30\%$, impression $20\%$, VQA $50\%$ for |
| MIMIC-CXR; the weights renormalise automatically when VQA is absent for |
| IU-Xray); the report mode is \texttt{split\_cascade}, so the impression |
| samples carry the ground-truth findings as context, and the image mode |
| is \texttt{frontal\_only\_split} (one PA-or-AP frontal image per study). |
| The loss is the standard causal cross-entropy |
|
|
| \begin{equation} |
| \mathcal{L}_{\text{LM}} |
| = -\frac{1}{|\mathcal{T}|} |
| \sum_{(t,y)\in\mathcal{T}} \log p_{\theta}\!\left(y_t \mid y_{<t}, \mathbf{V}, \mathbf{c}\right) |
| \label{eq:lm} |
| \end{equation} |
| where $\mathcal{T}$ is the set of positions where the label tensor is |
| non-$-100$ (assistant response only), $\mathbf{V}$ the 32 visual tokens, |
| and $\mathbf{c}$ the textual prompt context (system, PNU, optional GT |
| findings, instruction). |
|
|
| Trainable parameters in Stage~2 are the projection's MLP and the LoRA |
| adapters on $q$, $k$, $v$, $o$ of all 32 Vicuna blocks (rank 16, |
| $\alpha{=}32$, dropout $0.05$). The image encoder and the CheXpert |
| classifier remain frozen. Training runs for 10 epochs at peak learning |
| rate $2\!\times\!10^{-4}$ with the same 5\% cosine warm-up; effective |
| batch size is 16 on every profile (per-device batch $\times$ gradient |
| accumulation steps adapt to GPU memory: $1{\times}16$ on T4, |
| $8{\times}2$ on L4/3090, $16{\times}1$ on A100). Optimiser is |
| \texttt{adamw\_torch} on FP32 master weights for the projection and |
| LoRA, with the 4-bit base Vicuna acting as a quantised constant. |
|
|
| \paragraph{Loss masking and image-token accounting.} A subtlety that |
| deserves explicit mention is the bookkeeping around the |
| \texttt{<image>} placeholder. The tokenised prompt contains exactly |
| \textit{one} \texttt{<image>} token, and the model replaces its embedding |
| with the 32 projection-emitted visual tokens at forward time. To keep |
| the attention mask, position ids, and label tensor consistent, the |
| forward pass expands all three by 31 entries at the placeholder |
| position: the attention-mask entry for each visual token is set to 1, |
| the position ids are made contiguous, and the labels in the visual span |
| are filled with $-100$. The same expansion is applied symmetrically at |
| inference. This expansion is the single most error-prone piece of the |
| pipeline (because off-by-one mistakes silently shift the labels by |
| $\pm 31$ positions and produce a degenerate loss curve that descends |
| fast but converges to a useless equilibrium), so an integration test |
| checks that label masks before and after expansion sum to the same |
| count of non-$-100$ entries for every batch produced by the collator. |
|
|
| \paragraph{Stage~0 --- the CheXpert classifier head.} Although not a |
| stage of VLM training proper, the PNU classifier described in |
| \S\ref{sec:chex} is fitted before Stages~1 and 2 begin. It is a small |
| MLP on top of the frozen RAD-DINO \texttt{[CLS]} embedding, optimised |
| under 14 independent binary cross-entropies on the |
| \texttt{chex\_*} columns of the manifest, with U-MultiClass-style |
| \{positive, negative, uncertain\} targets. Training takes minutes on a |
| single GPU once the patch features are cached and is otherwise |
| unremarkable; the resulting checkpoint is loaded read-only by both |
| subsequent VLM stages. |
|
|
| \section{Hyperparameters and Optimiser} |
| \label{sec:hparams} |
|
|
| Table~\ref{tab:hparams} consolidates the hyperparameters of both stages. |
| Values are taken from \texttt{configs/train\_config.yaml}; any per-stage |
| override under \texttt{stage1:}/\texttt{stage2:} takes precedence over |
| the global \texttt{training:} block. Optimiser choice is the single |
| hyperparameter that we vary by GPU: A10/L4/3090 hosts use the standard |
| \texttt{adamw\_torch} (FP32 moments), while T4 hosts switch to |
| \texttt{paged\_adamw\_8bit} from \texttt{bitsandbytes} to halve the |
| optimiser-state footprint --- the auto-detect cell in |
| \texttt{scripts/cxrvlm\_colab\_train.ipynb} sets this and a matching |
| batch/accum pair so the effective batch is invariant across profiles. |
|
|
| \begin{table}[h] |
| \centering |
| \small |
| \caption{Training hyperparameters for the two stages. |
| ``ITC mode'' refers to the contrastive Stage~1 used in this work; |
| the legacy RaDialog-style Stage~1 (causal LM through Vicuna) is also |
| implemented in the codebase but is not used in the reported runs.} |
| \label{tab:hparams} |
| \begin{tabular}{lll} |
| \toprule |
| \textbf{Hyperparameter} & \textbf{Stage 1 (ITC)} & \textbf{Stage 2} \\ |
| \midrule |
| Trainable modules & Projection + ITC head & Projection + LoRA \\ |
| Frozen modules & Encoder, classifier, (no LLM) & Encoder, classifier, base LLM \\ |
| Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\ |
| Temperature $\tau$ & $0.07$ & --- \\ |
| Epochs & 2 & 10 \\ |
| Peak learning rate & $1\!\times\!10^{-3}$ & $2\!\times\!10^{-4}$ \\ |
| LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\ |
| Weight decay & $0.01$ & $0.01$ \\ |
| Effective batch size & 64--96 & 16 \\ |
| Per-device batch & 64 (L4/3090) & 8 (L4/3090), 1 (T4) \\ |
| Gradient accumulation & 1 & 2 (L4/3090), 16 (T4) \\ |
| Mixed precision & BF16 (Ampere+) / FP16 (T4) & BF16 / FP16 \\ |
| LLM quantisation & --- (LLM not loaded) & 4-bit NF4, double-quant \\ |
| LoRA $(r, \alpha, p)$ & --- & $(16,\,32,\,0.05)$ \\ |
| LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\ |
| Cutoff length & --- & 512 tokens \\ |
| Optimiser & AdamW (paged 8-bit on T4) & AdamW (paged 8-bit on T4) \\ |
| Gradient checkpointing & N/A (no LLM) & on \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| We make no claim that the values in Table~\ref{tab:hparams} are optimal |
| --- almost all of them are inherited from RaDialog with minimal tuning. |
| The two we did vary were the Stage~1 batch size (a sweep over |
| $\{16, 32, 64, 96\}$ chose $64$ as the smallest batch within $0.3\%$ of |
| the best validation contrastive accuracy) and the Stage~2 learning rate |
| (a sweep over $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$ |
| chose $2{\times}10^{-4}$ unambiguously; $5{\times}10^{-4}$ diverged |
| within $\sim\!300$ steps). |
|
|
| \section{Checkpointing, Resume, and Multi-platform Execution} |
| \label{sec:checkpointing} |
|
|
| Every save during either stage writes a HuggingFace-Trainer-format |
| \texttt{checkpoint-\textit{step}/} directory locally and, when |
| \texttt{hf\_hub.enabled} is true, mirrors it to a private Hugging Face |
| model repository under the path \texttt{\{run\_id\}/\{stage\}/last/}; |
| the best checkpoint by \texttt{eval\_loss} is additionally copied to |
| \texttt{best/}. \texttt{save\_total\_limit=1} guarantees that at most |
| two checkpoints live per stage on disk (best + last), bounding disk use |
| without breaking resume. The run identifier is persisted to a tiny |
| \texttt{run\_id.txt} file so a re-launched job picks up the same run |
| folder; switching this off (the legacy behaviour) creates a fresh |
| \texttt{\{dataset\}\_run\_\{N+1\}} on every launch. |
|
|
| The motivation for this scaffolding is purely practical. The full |
| training run is on the order of $\sim\!36$ wall-clock hours on a single |
| A10G or L4, considerably longer on T4, and considerably shorter on A100. |
| None of the platforms we have access to (Google Colab, Kaggle Notebooks, |
| Vast.ai pods, Lightning.ai studios) guarantee uninterrupted GPU time at |
| those durations: Colab disconnects, Kaggle preempts at 9\,h, Vast pods |
| sporadically drop. The HF Hub round-trip is the only persistence layer |
| that survives all of them. The unified resume controller |
| (\texttt{--mode resume}) handles the hard parts: it pulls the latest |
| \texttt{last/} of both stages from the Hub into the canonical local |
| layout via \texttt{hydrate\_run\_dir\_from\_hf}, then calls |
| \texttt{detect\_resume\_point} which inspects what is on disk and |
| decides whether to start Stage~1, resume Stage~1, start Stage~2, resume |
| Stage~2, or exit (everything is finished). The user does not need to |
| know which stage the previous machine died in; the controller works it |
| out from filesystem state. |
|
|
| Two operational details are worth recording. First, optimiser state and |
| 4-bit compute dtype must be consistent across resume, which means the |
| GPU family should not change between launches; switching from an A100 to |
| a T4 mid-run is safe by checkpoint format (LoRA adapters and projection |
| weights are dtype-agnostic) but the AdamW moments will be reloaded with |
| slightly different numerical accumulators, so we recommend staying |
| within the same VRAM/precision bucket --- L4, A10G, RTX 3090, and |
| similar Ampere-or-newer 24\,GB cards are interchangeable. Second, |
| \texttt{PeftModel.from\_pretrained} defaults to inference mode and |
| returns adapters with \texttt{requires\_grad=False}; the resume code |
| path passes \texttt{is\_trainable=True} explicitly so the LoRA |
| parameters are again exposed to the optimiser. This was a non-obvious |
| bug in earlier iterations of the codebase and is now covered by a unit |
| test that asserts the trainable parameter count after resume matches |
| the count at save time. |
|
|
| \section{What Stays Off the Loss} |
| \label{sec:loss-exclusions} |
|
|
| A short note on what is \textit{not} optimised. The image encoder |
| weights are frozen, the CheXpert classifier weights are frozen, the |
| Vicuna base weights are frozen (and 4-bit quantised, so even if |
| gradients flowed they could not be applied without dequantisation), the |
| ITC head is discarded after Stage~1 and never sees a gradient in |
| Stage~2, and the system prompt and structured-findings strings are |
| masked out of the language modelling loss. The only learning happens at |
| the projection, the LoRA adapters, and (transiently, during Stage~1 |
| only) the ITC head. This is by design: every additional trainable |
| module multiplies the optimiser-state footprint, complicates the |
| distributed-training story (which we explicitly do not implement; all |
| experiments are single-GPU), and --- in our ablations --- failed to |
| improve any of the headline metrics on the held-out test split. |
|
|
| The clinical evaluation in Chapter~\ref{ch:experiments} reports |
| ROUGE-1/2/L, BLEU-1/4, BERTScore-F1, and the CheXbert-derived clinical |
| F1 over the 14 pathology vocabulary on the findings and impression |
| tasks, and exact-match accuracy plus token F1 plus BLEU-1 on VQA. All |
| metrics are computed on the patient-disjoint test split established in |
| Phase~1 of the data pipeline; the model selection criterion during |
| training is the standard \texttt{eval\_loss} on the corresponding |
| validation split, never any of the downstream metrics --- those are |
| reserved for the final report. |
|
|