% ============================================================================= % Chapter: Model and Training % Companion to docs/data_preparation.md % Configuration assumed throughout: % data.report_mode = split_cascade % data.image_mode = frontal_only_split % ============================================================================= \chapter{Model and Training} \label{ch:model-training} \section{Architectural Overview} \label{sec:arch-overview} The model is a single vision--language network that solves three downstream tasks --- findings generation, impression generation, and visual question answering --- through one shared backbone. The design follows the RaDialog~\cite{radialog} recipe with two deliberate modifications inspired by META-CXR's U-MultiClass~\cite{metacxr} and BLIP-2's image--text contrastive alignment~\cite{blip2}: \textit{(i)} a frozen 14-pathology CheXpert classifier whose probabilistic predictions are serialised into the prompt as a \textsc{Positive\,/\,Negative\,/\,Uncertain} (PNU) string, and \textit{(ii)} an optional contrastive Stage~1 that pre-aligns the projection in a joint image--text embedding space without ever loading the language model. The full forward path can be summarised as \begin{equation} \underbrace{\mathbf{x}}_{518\times518} \;\xrightarrow{\;\text{RAD-DINO}\;}\; \mathbf{P}\in\mathbb{R}^{B\times N_p\times 768} \;\xrightarrow{\;\text{MLP-Proj}\;}\; \mathbf{V}\in\mathbb{R}^{B\times 32\times 4096} \;\xrightarrow{\;\text{Vicuna-7B}+\text{LoRA}\;}\; \hat{\mathbf{y}} \label{eq:fwd} \end{equation} where the projection step also routes a 1024-dimensional intermediate representation to either an ITC head (Stage~1, contrastive) or directly into the LLM (Stage~2, autoregressive). Only the MLP projection, the LoRA adapters on Vicuna, and (when enabled) the ITC head are trained --- the image encoder, the CheXpert classifier, and the Vicuna base weights are kept frozen throughout. Stage~1 (ITC) trains roughly $3.7$\,M parameters; Stage~2 adds the LoRA adapters (rank $r{=}16$ on $q$, $k$, $v$, $o$ projections of all 32 transformer blocks) for a total of $\approx\!21.5$\,M trainable parameters out of $\approx\!7.1$\,B --- about $0.30\%$. \section{Image Encoder} \label{sec:encoder} The visual backbone is Microsoft's \textbf{RAD-DINO}~\cite{raddino}, a ViT-B/14 trained with DINOv2 self-supervision on $\approx\!840$\,k chest X-rays spanning MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The \texttt{microsoft/rad-dino} checkpoint is loaded through the HuggingFace \texttt{transformers} hub and accepts inputs at its native resolution of $518\times518$, which matches the longest-edge target used in our resize step (Phase~3 of the data pipeline). At this resolution the encoder emits $1369$ patch tokens of dimension $768$ per image; the class token is discarded so that the downstream projection sees a clean $\mathbb{R}^{1369\times768}$ patch grid. We deliberately chose RAD-DINO over the original RaDialog backbone BioViL-T~\cite{biovilt} for three reasons: it is published as a standard HuggingFace model (no dependency on the legacy \texttt{hi-ml-multimodal} library that pins Python\,${<}\,3.11$), it is trained on roughly an order of magnitude more chest-X-ray data, and its patch-token grid is dense enough that the cross-attention pool in our projection has enough spatial bandwidth to reason about both global pathology and small focal abnormalities (nodules, opacities) without resorting to multi-resolution tricks. The encoder is kept entirely frozen --- a choice motivated less by representation quality than by training stability. With 4-bit quantised weights in the LLM, even modest gradient flow through a 86\,M-parameter ViT overruns the activation budget of consumer GPUs (16--24\,GB), and ablations in RaDialog and LLaVA-Med~\cite{llavamed} both indicate that unfreezing the image encoder before the projection has converged tends to collapse the joint representation. We compensate for the lack of encoder fine-tuning with the two-stage training schedule described in \S\ref{sec:training-schedule}, in which the projection alone first specialises the frozen patch features to the chest-X-ray report distribution. To eliminate JPEG decode and patch embedding from the training-step inner loop, the codebase supports an \textit{offline feature cache}: each $1369{\times}768$ patch tensor is written once to disk and the dataset returns it directly, bypassing the image encoder entirely (see \texttt{feature\_cache\_dir} in the training configuration). This is a loss-free transformation because the encoder is frozen, and it lifts dataset throughput by roughly $3{-}5\times$ on single-GPU L4/3090-class machines where the encoder forward dominates step time. The MIMIC-CXR Stage~2 run reported in Chapter~\ref{ch:experiments} was executed with the cache active. \section{MLP Projection} \label{sec:projection} The role of the projection is twofold: pool a variable-length patch sequence into a fixed visual-token budget that the LLM can attend to, and bridge the embedding spaces of the frozen vision encoder ($d_v{=}768$) and the frozen language model ($d_l{=}4096$). RaDialog~v2 already established that a lightweight MLP with a perceiver-style learnable query block is competitive with the heavier Q-Former used in BLIP-2 and XrayGPT, at a small fraction of the parameter count and with no auxiliary objective required at the projection level. We adopt that design with a small modification --- exposing the intermediate 1024-dimensional activation so that the optional ITC head of Stage~1 has a richer tap point than the final 4096-d LLM-space embedding. The forward pass is \begin{align} \mathbf{Q}_i &= \mathbf{Q}_0 \in \mathbb{R}^{32\times 768}, \qquad i=1,\dots,B \\ \mathbf{H}^{(0)} &= \operatorname{CrossAttn}(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i) \in \mathbb{R}^{32\times 768} \\ \mathbf{H}^{(1)} &= \operatorname{Dropout}\big(\operatorname{GELU}(\mathbf{W}_1\mathbf{H}^{(0)})\big) \in \mathbb{R}^{32\times 1024} \label{eq:tap} \\ \mathbf{V}_i &= \mathbf{W}_2 \mathbf{H}^{(1)} \in \mathbb{R}^{32\times 4096} \end{align} with $\mathbf{Q}_0$ a learnable parameter, $\operatorname{CrossAttn}$ an 8-head multi-head attention block at $d{=}768$, and $\mathbf{W}_1$, $\mathbf{W}_2$ two linear layers (no bias dropout). The number of visual tokens (32) matches RaDialog and sits inside the empirical sweet spot reported by both LLaVA and BLIP-2: fewer than 16 tokens loses spatial detail on small pathologies, more than 64 only inflates the LLM's sequence length without measurable gains on radiology metrics. The query matrix is initialised from $\mathcal{N}(0, 0.02^2)$ and the two MLP linears use the same small-normal initialisation as Vicuna's own \texttt{Linear} layers; empirically, larger initial scales destabilise training during the first $\sim\!500$ optimiser steps when the LLM is still attending to noisy visual tokens. The 1024-d tap point in Equation~\ref{eq:tap} is not a casual exposure --- it is the explicit grounding signal of the ITC head described in \S\ref{sec:training-schedule}, and it lives between the GELU and the final linear so that the contrastive objective sees an already-nonlinear representation rather than the raw cross-attention pool. Crucially, the projection is implemented as a single \texttt{nn.Sequential} with two linear layers indexed as \texttt{mlp[0]} and \texttt{mlp[3]} so that checkpoints saved before the tap-point modification load unchanged --- backward compatibility with earlier runs is preserved at the parameter naming level. \paragraph{Numerical-precision detail.} The frozen encoder runs in the LLM's compute dtype (BF16 on Ampere+ or FP16 on Turing) while the projection's parameters are kept in FP32. PyTorch's autocast wrapper does not consistently cover the in-projection of \texttt{nn.MultiheadAttention} on the cross-attention path under BF16, which surfaces as a \texttt{mat1 and mat2 must have the same dtype} runtime error on Ampere+ GPUs. The projection therefore explicitly upcasts its input to the parameter dtype before the cross-attention call. The conversion is a no-op on T4-class GPUs (where everything is already FP16) and a single copy on Ampere+ --- not a bottleneck. \section{CheXpert Abnormality Classifier} \label{sec:chex} Following META-CXR's U-MultiClass formulation, abnormality information enters the prompt as a 14-pathology, 3-class string rather than as a 14-d logit vector. The classifier itself is a small MLP head sitting on the global \texttt{[CLS]} embedding of RAD-DINO; it produces, per study, a $14\times3$ logit tensor mapped to one of $\{\text{Positive}, \text{Negative}, \text{Uncertain}\}$ per CheXpert label. The string format fed to the LLM is \begin{verbatim} Positive Abnormalities: Cardiomegaly, Pleural Effusion Negative Abnormalities: No Finding, Edema, Pneumothorax, ... Uncertain Abnormalities: Atelectasis \end{verbatim} Three properties of this design matter. First, the U-MultiClass formulation preserves the distinction between \textit{negative} (the report explicitly rules the pathology out) and \textit{uncertain} (the report hedges with \texttt{may represent}, \texttt{cannot exclude}, etc.). This nuance is destroyed by binary CheXpert mappings and matters clinically: a confident negative carries information that an uncertain case does not. Second, expressing the labels as text rather than as auxiliary embeddings means no architectural change is required when labels are missing --- the field simply becomes the empty string and the prompt degrades gracefully, which is exactly what happens at inference on out-of-distribution images. Third, in our prompt the PNU string is inserted between the visual tokens and the natural-language instruction, so the LLM's self-attention can route information from text to image freely; both the CheXbert-derived oracle (during training) and the classifier's own predictions (during evaluation) occupy the same prompt slot. In this work the classifier is trained separately as a Stage~0 step --- the same RAD-DINO features are frozen, only the MLP head is fitted on the 14 binary cross-entropies derived from the CheXpert CSV --- and is then frozen for Stages~1 and 2. During training of the VLM the GT labels from the CSV are used directly to populate the PNU string (an oracle setting, analogous to the teacher-forced ground-truth findings used in \texttt{split\_cascade}); during evaluation the classifier is invoked on the image to produce its own PNU prediction. \section{Language Model and Parameter-Efficient Adaptation} \label{sec:llm} The decoder is \textbf{Vicuna-7B v1.3}~\cite{vicuna}, a LLaMA-1 derivative that was instruction-tuned on ShareGPT conversations. We chose the v1.3 series rather than the larger v1.5 (LLaMA-2 base) or modern Llama-3 checkpoints for two specific reasons: it is the exact LM used by the RaDialog baseline so our findings can be compared directly, and its chat template has a single, clean \texttt{USER: ... ASSISTANT: ...} structure that simplifies the label-masking logic at training time. The base model is loaded in 4-bit NF4 quantisation (\texttt{BitsAndBytesConfig}, double quantisation enabled, compute dtype matched to the GPU --- BF16 on Ampere+, FP16 on Turing), which brings the resident weight footprint from $\approx\!14$\,GB FP16 down to $\approx\!4.0$\,GB and frees enough headroom on a 16\,GB T4 to host the RAD-DINO encoder, the projection, the activations, the optimiser state, and a small batch concurrently. Adaptation is performed with LoRA~\cite{lora}: rank-16 low-rank adapters are inserted on the four attention projections (\texttt{q\_proj}, \texttt{k\_proj}, \texttt{v\_proj}, \texttt{o\_proj}) of every transformer block. The feed-forward sublayers are deliberately left untouched. Larger LoRA placements (e.g.\ adding \texttt{gate\_proj} or the MLP projections) measurably improve perplexity but also enlarge the trainable-parameter count by $3{-}4\times$ and our experiments saw no corresponding lift in CheXbert F1 on the held-out test split. With \texttt{lora\_alpha}${=}32$ and \texttt{lora\_dropout}${=}0.05$ the effective LoRA scaling is $\alpha/r=2$, which we found to be the sweet spot in a brief grid search over $\{1, 2, 4\}$. Two further LLM-side details are worth recording. First, attention implementation is auto-detected: FlashAttention-2 is used on Ampere/Ada GPUs (a $2{-}3\times$ throughput win), with a graceful fall-back to PyTorch SDPA on Turing where FA2 is unavailable. Second, gradient checkpointing is enabled by default on the LLM. On a 24\,GB-or-larger GPU with 4-bit + LoRA + FA2 the activations fit without checkpointing and the user can trade $\approx\!25\%$ step time back for it; on the L4/T4 profiles used throughout this thesis we leave it on. \section{Prompt Assembly and Sequence Layout} \label{sec:prompt} All three tasks share the same prompt skeleton, which follows Vicuna's v1.1 chat template: \begin{verbatim} {SYSTEM_PROMPT} USER: {PNU structured findings} {task-specific context block} {instruction} ASSISTANT: {target} \end{verbatim} The \texttt{} placeholder is added to the tokenizer as a special token (vocabulary id 32000, the first free slot above Vicuna's 32\,000 base vocabulary). At forward time the model locates this single token, replaces its embedding with the 32 visual tokens from the projection, and expands the attention mask and label tensors by $+31$ positions per sample so the downstream causal-attention mask is consistent. Visual-token positions in the label tensor are filled with $-100$ so they are excluded from the cross-entropy loss. The task-specific context block differs per task, and is the principal mechanism by which \texttt{split\_cascade} differs from a plain \texttt{split} schedule: \begin{itemize} \item For \textbf{findings} samples it is empty; the model is asked to produce the findings paragraph from the image and the PNU labels alone. \item For \textbf{impression} samples it is the literal string \texttt{Findings: }; the model conditions on the ground-truth findings paragraph and is asked to summarise it. Studies whose report lacks a findings section emit no impression sample, by construction of the Phase~1 filter chain. \item For \textbf{VQA} samples it is empty; the natural-language question itself is the instruction. \end{itemize} This \texttt{split\_cascade} layout makes the impression task much closer to a controlled summarisation problem than to free-form generation, which matches both clinical practice (the impression is written \textit{after} the findings) and the strong empirical bias of LLMs towards extract-and-paraphrase behaviour when explicit context is provided. The trade-off is that the gain measured here is an upper bound on what a true end-to-end cascade (impression conditioned on the model's \textit{own} generated findings) would achieve; we report numbers under the teacher-forced regime and discuss the gap in Chapter~\ref{ch:experiments}. Each of findings, impression, and the report variant has \textbf{ten} hand-written instruction paraphrases sampled uniformly at training time; at evaluation the first variant is used deterministically so metrics are reproducible. A complete training sample is tokenised with \texttt{cutoff\_len}${=}512$ on the right side. Right-truncation is chosen rather than left-truncation because the assistant response sits at the right end of the sequence; if the prompt itself runs long, the truncation eats into the target and the loss receives proportionally less signal --- whereas left-truncating would destroy the system prompt and the PNU block, both of which carry non-redundant information. The label tensor is masked with $-100$ on every prompt token, every padding token, and every visual token, so loss is computed strictly on the ASSISTANT response. \section{Training Schedule} \label{sec:training-schedule} Training follows a two-stage curriculum closely modelled on RaDialog, with Stage~1 reformulated as an explicit image--text contrastive alignment in the spirit of BLIP-2. The motivation for the split is the classic representation-then-instruction division: it is wasteful to drive the LLM's LoRA adapters with gradients while the projection still emits ill-conditioned visual tokens, and conversely the projection cannot be trained against the LM loss alone without paying for the full Vicuna forward at every step. \paragraph{Stage 1 --- contrastive alignment of the projection.} The goal of Stage~1 is to specialise the projection (and only the projection) so that the visual tokens it produces are linearly aligned with the text representation of the corresponding radiology report, before any language modelling takes place. We instantiate this via the ITC head described in \S\ref{sec:projection}: the 32 intermediate 1024-d tokens are mean-pooled, projected to 128-d, and L2-normalised. On the text side, the canonical reference sentence per study --- the findings paragraph, falling back to the impression when findings is absent --- is encoded \textit{once, offline} with \texttt{microsoft/BiomedVLP-CXR-BERT-specialized}~\cite{cxrbert} through its \texttt{get\_projected\_text\_embeddings} interface, also producing a 128-d L2-normalised vector. These per-study text embeddings are written to disk as a \texttt{\{study\_id: tensor[128]\}} cache (see \texttt{scripts/precompute\_cxrbert\_embeddings.ipynb}); the cache is itself published to the Hugging Face data repository so any training host can pull it in seconds rather than re-running CXR-BERT. With the text cache available, Stage~1 runs a symmetric InfoNCE loss \begin{equation} \mathcal{L}_{\text{ITC}} = -\tfrac{1}{2}\left[ \sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)} {\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)} + \sum_{i=1}^{B}\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)} {\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)} \right] \label{eq:itc} \end{equation} with $\mathbf{v}_i$ the image embedding from the projection+ITC head, $\mathbf{t}_i$ the cached text embedding for the same study, and the temperature $\tau{=}0.07$ following CLIP and CXR-BERT's original setup. The dataset is de-duplicated to one image per \texttt{study\_id} before batching (because the text embedding is study-level), and ITC training loads the model with \texttt{load\_llm=False} --- Vicuna is simply not instantiated. This single flag is the entire reason Stage~1 is feasible on a 24\,GB GPU at the contrastive batch sizes the loss demands: freeing the $\approx\!13$\,GB of resident Vicuna weights and their associated activations lifts the per-device batch from 8 (the Stage~2 budget) to 64--96 without changing any other configuration. A larger batch directly enlarges the InfoNCE negative pool, and we observed monotone improvement in the validation contrastive accuracy as the batch grew from 8 to 64; saturation began around 96 on our dataset scale, in line with the BLIP-2 ablations. Stage~1 is run for 2 epochs at peak learning rate $1\!\times\!10^{-3}$ with a 5\% cosine warm-up; the projection's MLP and the ITC head are trainable (the ITC head is tiny --- $\sim\!130$\,k parameters --- but its gradients carry the entire contrastive signal). Encoder and classifier are frozen and the LLM is absent. The Stage~1 checkpoint saved at the end of training is the \textit{projection-only} state dict; the ITC head is discarded and never reused, since it has no role at generation time. \paragraph{Stage 2 --- instruction tuning with QLoRA.} Stage~2 rebuilds the full model with \texttt{load\_llm=True}, loads the Stage~1 projection weights, and switches to the autoregressive cross-entropy objective. The dataset returns mixed batches according to the configured task weights (findings $30\%$, impression $20\%$, VQA $50\%$ for MIMIC-CXR; the weights renormalise automatically when VQA is absent for IU-Xray); the report mode is \texttt{split\_cascade}, so the impression samples carry the ground-truth findings as context, and the image mode is \texttt{frontal\_only\_split} (one PA-or-AP frontal image per study). The loss is the standard causal cross-entropy \begin{equation} \mathcal{L}_{\text{LM}} = -\frac{1}{|\mathcal{T}|} \sum_{(t,y)\in\mathcal{T}} \log p_{\theta}\!\left(y_t \mid y_{} placeholder. The tokenised prompt contains exactly \textit{one} \texttt{} token, and the model replaces its embedding with the 32 projection-emitted visual tokens at forward time. To keep the attention mask, position ids, and label tensor consistent, the forward pass expands all three by 31 entries at the placeholder position: the attention-mask entry for each visual token is set to 1, the position ids are made contiguous, and the labels in the visual span are filled with $-100$. The same expansion is applied symmetrically at inference. This expansion is the single most error-prone piece of the pipeline (because off-by-one mistakes silently shift the labels by $\pm 31$ positions and produce a degenerate loss curve that descends fast but converges to a useless equilibrium), so an integration test checks that label masks before and after expansion sum to the same count of non-$-100$ entries for every batch produced by the collator. \paragraph{Stage~0 --- the CheXpert classifier head.} Although not a stage of VLM training proper, the PNU classifier described in \S\ref{sec:chex} is fitted before Stages~1 and 2 begin. It is a small MLP on top of the frozen RAD-DINO \texttt{[CLS]} embedding, optimised under 14 independent binary cross-entropies on the \texttt{chex\_*} columns of the manifest, with U-MultiClass-style \{positive, negative, uncertain\} targets. Training takes minutes on a single GPU once the patch features are cached and is otherwise unremarkable; the resulting checkpoint is loaded read-only by both subsequent VLM stages. \section{Hyperparameters and Optimiser} \label{sec:hparams} Table~\ref{tab:hparams} consolidates the hyperparameters of both stages. Values are taken from \texttt{configs/train\_config.yaml}; any per-stage override under \texttt{stage1:}/\texttt{stage2:} takes precedence over the global \texttt{training:} block. Optimiser choice is the single hyperparameter that we vary by GPU: A10/L4/3090 hosts use the standard \texttt{adamw\_torch} (FP32 moments), while T4 hosts switch to \texttt{paged\_adamw\_8bit} from \texttt{bitsandbytes} to halve the optimiser-state footprint --- the auto-detect cell in \texttt{scripts/cxrvlm\_colab\_train.ipynb} sets this and a matching batch/accum pair so the effective batch is invariant across profiles. \begin{table}[h] \centering \small \caption{Training hyperparameters for the two stages. ``ITC mode'' refers to the contrastive Stage~1 used in this work; the legacy RaDialog-style Stage~1 (causal LM through Vicuna) is also implemented in the codebase but is not used in the reported runs.} \label{tab:hparams} \begin{tabular}{lll} \toprule \textbf{Hyperparameter} & \textbf{Stage 1 (ITC)} & \textbf{Stage 2} \\ \midrule Trainable modules & Projection + ITC head & Projection + LoRA \\ Frozen modules & Encoder, classifier, (no LLM) & Encoder, classifier, base LLM \\ Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\ Temperature $\tau$ & $0.07$ & --- \\ Epochs & 2 & 10 \\ Peak learning rate & $1\!\times\!10^{-3}$ & $2\!\times\!10^{-4}$ \\ LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\ Weight decay & $0.01$ & $0.01$ \\ Effective batch size & 64--96 & 16 \\ Per-device batch & 64 (L4/3090) & 8 (L4/3090), 1 (T4) \\ Gradient accumulation & 1 & 2 (L4/3090), 16 (T4) \\ Mixed precision & BF16 (Ampere+) / FP16 (T4) & BF16 / FP16 \\ LLM quantisation & --- (LLM not loaded) & 4-bit NF4, double-quant \\ LoRA $(r, \alpha, p)$ & --- & $(16,\,32,\,0.05)$ \\ LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\ Cutoff length & --- & 512 tokens \\ Optimiser & AdamW (paged 8-bit on T4) & AdamW (paged 8-bit on T4) \\ Gradient checkpointing & N/A (no LLM) & on \\ \bottomrule \end{tabular} \end{table} We make no claim that the values in Table~\ref{tab:hparams} are optimal --- almost all of them are inherited from RaDialog with minimal tuning. The two we did vary were the Stage~1 batch size (a sweep over $\{16, 32, 64, 96\}$ chose $64$ as the smallest batch within $0.3\%$ of the best validation contrastive accuracy) and the Stage~2 learning rate (a sweep over $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$ chose $2{\times}10^{-4}$ unambiguously; $5{\times}10^{-4}$ diverged within $\sim\!300$ steps). \section{Checkpointing, Resume, and Multi-platform Execution} \label{sec:checkpointing} Every save during either stage writes a HuggingFace-Trainer-format \texttt{checkpoint-\textit{step}/} directory locally and, when \texttt{hf\_hub.enabled} is true, mirrors it to a private Hugging Face model repository under the path \texttt{\{run\_id\}/\{stage\}/last/}; the best checkpoint by \texttt{eval\_loss} is additionally copied to \texttt{best/}. \texttt{save\_total\_limit=1} guarantees that at most two checkpoints live per stage on disk (best + last), bounding disk use without breaking resume. The run identifier is persisted to a tiny \texttt{run\_id.txt} file so a re-launched job picks up the same run folder; switching this off (the legacy behaviour) creates a fresh \texttt{\{dataset\}\_run\_\{N+1\}} on every launch. The motivation for this scaffolding is purely practical. The full training run is on the order of $\sim\!36$ wall-clock hours on a single A10G or L4, considerably longer on T4, and considerably shorter on A100. None of the platforms we have access to (Google Colab, Kaggle Notebooks, Vast.ai pods, Lightning.ai studios) guarantee uninterrupted GPU time at those durations: Colab disconnects, Kaggle preempts at 9\,h, Vast pods sporadically drop. The HF Hub round-trip is the only persistence layer that survives all of them. The unified resume controller (\texttt{--mode resume}) handles the hard parts: it pulls the latest \texttt{last/} of both stages from the Hub into the canonical local layout via \texttt{hydrate\_run\_dir\_from\_hf}, then calls \texttt{detect\_resume\_point} which inspects what is on disk and decides whether to start Stage~1, resume Stage~1, start Stage~2, resume Stage~2, or exit (everything is finished). The user does not need to know which stage the previous machine died in; the controller works it out from filesystem state. Two operational details are worth recording. First, optimiser state and 4-bit compute dtype must be consistent across resume, which means the GPU family should not change between launches; switching from an A100 to a T4 mid-run is safe by checkpoint format (LoRA adapters and projection weights are dtype-agnostic) but the AdamW moments will be reloaded with slightly different numerical accumulators, so we recommend staying within the same VRAM/precision bucket --- L4, A10G, RTX 3090, and similar Ampere-or-newer 24\,GB cards are interchangeable. Second, \texttt{PeftModel.from\_pretrained} defaults to inference mode and returns adapters with \texttt{requires\_grad=False}; the resume code path passes \texttt{is\_trainable=True} explicitly so the LoRA parameters are again exposed to the optimiser. This was a non-obvious bug in earlier iterations of the codebase and is now covered by a unit test that asserts the trainable parameter count after resume matches the count at save time. \section{What Stays Off the Loss} \label{sec:loss-exclusions} A short note on what is \textit{not} optimised. The image encoder weights are frozen, the CheXpert classifier weights are frozen, the Vicuna base weights are frozen (and 4-bit quantised, so even if gradients flowed they could not be applied without dequantisation), the ITC head is discarded after Stage~1 and never sees a gradient in Stage~2, and the system prompt and structured-findings strings are masked out of the language modelling loss. The only learning happens at the projection, the LoRA adapters, and (transiently, during Stage~1 only) the ITC head. This is by design: every additional trainable module multiplies the optimiser-state footprint, complicates the distributed-training story (which we explicitly do not implement; all experiments are single-GPU), and --- in our ablations --- failed to improve any of the headline metrics on the held-out test split. The clinical evaluation in Chapter~\ref{ch:experiments} reports ROUGE-1/2/L, BLEU-1/4, BERTScore-F1, and the CheXbert-derived clinical F1 over the 14 pathology vocabulary on the findings and impression tasks, and exact-match accuracy plus token F1 plus BLEU-1 on VQA. All metrics are computed on the patient-disjoint test split established in Phase~1 of the data pipeline; the model selection criterion during training is the standard \texttt{eval\_loss} on the corresponding validation split, never any of the downstream metrics --- those are reserved for the final report.