% ============================================================================= % Chapter: Methodology % % Structure follows the META-CXR (IEEE Access 2025) layout: % III. METHODOLOGY % A. Data Materials % B. Module 1 (visual extractor) % C. Module 2 (cross-modal alignment) % D. Module 3 (abnormality classifier) % E. Module 4 (language model) % F. Prompt Construction % G. Two-Phase Training % H. Inference Pipeline % I. Evaluation Protocol % % The reported configuration is the only configuration described: % one frontal view per study, impression conditioned on the % ground-truth findings string, etc. The codebase exposes other % options but they are not part of this report. % ============================================================================= \chapter{Methodology} \label{ch:method} This chapter presents the proposed Chest X-Ray Vision--Language Model (CXR-VLM) in nine subsections. The first describes the data materials and the pre-processing pipeline that produces the training corpus from MIMIC-CXR. The next four subsections describe the four modules that constitute the model itself --- a frozen vision encoder, a learnable cross-modal projection, an auxiliary abnormality classifier, and a frozen large language model fitted with parameter-efficient adapters. Section~\ref{sec:prompt} explains how these modules are tied together by a single prompt template that specialises to each downstream task. Section~\ref{sec:training} lays out the two-phase training schedule, with the contrastive alignment loss of Phase~1 and the instruction-tuning loss of Phase~2 stated explicitly. Section~\ref{sec:inference} formalises the inference procedure as an algorithm. Section~\ref{sec:eval} closes the chapter by specifying the evaluation protocol --- the held-out splits, the metrics, and the model-selection criterion --- under which the experimental results of Chapter~\ref{ch:experiments} are reported. % TODO INSERT FIGURE: full pipeline diagram % File: figures/pipeline_overview.pdf (the SVG already generated by % scripts/generate_pipeline_svg.py can be exported to PDF) % Caption: ``Overview of the proposed CXR-VLM. A frozen RAD-DINO image % encoder produces patch features that are pooled by a learnable % MLP projection into 32 visual tokens. The PNU abnormality string, % produced by a frozen 14-pathology CheXpert classifier sharing the % same encoder, is appended to the textual prompt. A frozen Vicuna-7B % language model with rank-16 LoRA adapters consumes the joint % visual--textual sequence and emits the radiology report.'' % ============================================================================= \section{Data Materials} \label{sec:data} % ============================================================================= \subsection{Datasets} Three publicly-available chest radiograph datasets are used in this work; each plays a distinct role. \paragraph{MIMIC-CXR-JPG~\citep{mimic-cxr-jpg}.} The primary training and evaluation corpus. MIMIC-CXR-JPG is the JPEG-converted release of MIMIC-CXR~\citep{mimic-cxr}, comprising 377{,}110 chest radiographs from 227{,}827 imaging studies of 65{,}379 patients, paired with free-text radiology reports. Each study is additionally accompanied by a row in \texttt{mimic-cxr-2.0.0-chexpert.csv}, in which the CheXbert NLP labeler~\citep{chexbert} extracts a $14$-pathology abnormality vector from the report text, with values in $\{-1, 0, 1, \text{blank}\}$ denoting uncertain, negative, positive, and not-mentioned respectively. The official patient-disjoint train / validate / test split in \texttt{mimic-cxr-2.0.0-split.csv} is used as the starting point for our subset construction. \paragraph{MIMIC-Ext-CXR-VQA~\citep{mimic-cxr-vqa}.} A visual question--answering extension of MIMIC-CXR that provides 1.4\,million $(\text{image}, \text{question}, \text{answer})$ triples covering presence, location, severity, and comparative reasoning over the same 14 pathologies. Each triple references a MIMIC-CXR study by \texttt{dicom\_id}, so it is filtered down to the subset described below by simple set intersection. \paragraph{IU X-Ray~\citep{iuxray}.} A smaller open-access dataset of 3{,}955 fully de-identified chest X-ray studies with associated reports, used as an out-of-distribution evaluation corpus to assess cross-domain generalisation of the report-generation tasks. Following the convention of prior work~\citep{r2gen, metransformer}, we adopt a 7:1:2 train/validate/test partition at the study level. \subsection{Subset Construction} \label{sec:subset} The full MIMIC-CXR collection is too large to train on with the hardware budget of this work, and its native distribution is heavily imbalanced both in view types (frontal vs lateral exposures) and in report formatting. We therefore curate a 50{,}000-study subset by a four-stage filter chain, applied once and then re-used across all experiments. The filter is documented here at a level intended to be sufficient for reproducibility. % TODO INSERT FIGURE: EDA from the notebook % File: figures/eda_pathology_prevalence.pdf % (per-pathology prevalence: full MIMIC -> eligible pool -> final % subset, side-by-side bar chart). Source notebook: data/build_subset_*.ipynb % Caption: ``Per-pathology prevalence across the full MIMIC-CXR % distribution, the eligible pool after report/view filtering, and % the final stratified subset. Absolute deltas remain under 0.7 % percentage points on every label, confirming that stratification % preserves the clinical distribution.'' \paragraph{(a) Frontal-view selection.} Each DICOM is joined with \texttt{mimic-cxr-2.0.0-metadata.csv} by \texttt{dicom\_id}, and only rows whose \texttt{ViewPosition} field is PA or AP are retained. A study with multiple frontal exposures is collapsed to one image, preferring the PA projection where available because PA is the standard reference projection while AP is reserved for bedside or supine portable studies. Lateral views are discarded because they are clinically complementary but stylistically and anatomically distinct from frontals; mixing them with frontals would inject input heterogeneity that the model has no architectural mechanism to handle. After this step every retained study contributes exactly one frontal image to the corpus. \paragraph{(b) Report parsing.} Each study's \texttt{.txt} file is scanned by a strict regular expression that recognises section headers of the form \verb|^[A-Z ,/().-]+:|. A section body is accepted only if the upper-cased header is exactly \texttt{FINDINGS} or \texttt{IMPRESSION}; near-synonyms such as \texttt{CONCLUSION}, \texttt{WET READ}, \texttt{INDICATION}, or composite headers like \texttt{FINDINGS AND IMPRESSION} are deliberately not merged because they vary substantially in style and would dilute the training signal. A study survives this stage only if both sections are present and non-empty --- a requirement made stricter than RaDialog because, as explained in Section~\ref{sec:prompt}, the impression-generation task in this work is conditioned on the ground-truth findings string, so a study without a clean findings section cannot supply a usable impression sample. \paragraph{(c) Length-based outlier removal.} Word counts are computed per section; the upper cutoff is set to $Q_3 + 1.5\cdot\mathrm{IQR}$ per section, dropping multi-paragraph teaching reports that sit far from the median, and a lower floor of two words for findings and one word for impression is imposed. The combination removes the long tail without affecting the median statistics. \paragraph{(d) Stratified patient-disjoint sampling.} The remaining eligible pool is sampled to 40{,}000 train / 5{,}000 validation / 5{,}000 test studies. Each study is first assigned a \emph{stratum} equal to its rarest positive CheXpert label (the rarity order is computed over the eligible pool, so e.g.\ \emph{Pleural Other}, \emph{Lung Lesion}, and \emph{Fracture} sit at the top); studies with no positive label fall into \texttt{No Finding} or \texttt{None}. Sampling allocates the target count per stratum proportionally to prevalence and draws within each stratum, preferring studies that also have an associated VQA triple in MIMIC-Ext-CXR-VQA so that the VQA task receives more supervision. The validation and test pools are populated first from the official PhysioNet splits; if a target count exceeds the official pool, the remainder is drawn from the train pool and the affected subject ids are then removed from train. The three resulting splits are therefore patient-disjoint --- no subject appears in more than one split. Per-pathology prevalence deltas between the final subset and the eligible pool remain below 0.7 percentage points on every label, confirming that stratification did not skew clinical coverage. \subsection{Image Pre-processing} Because every training step ultimately resizes the input to the encoder's native $518\!\times\!518$ window (Section~\ref{sec:encoder}), all corpus images are resized once, offline, so the shortest edge equals 518 with the longer edge proportional, and saved at JPEG quality 90 --- near-lossless on grayscale chest radiographs. The 50k subset compresses from $\sim$100\,GB of full-resolution JPEGs down to $\sim$5--8\,GB, which fits comfortably on a single Hugging Face Hub repository for distribution. At training time the dataset loader applies the encoder's standard pre-processing (centre crop to $518\!\times\!518$, channel-wise normalisation with the ImageNet mean and standard deviation) on the resized JPEGs. % ============================================================================= \section{Visual Feature Extractor} \label{sec:encoder} % ============================================================================= % TODO INSERT FIGURE: RAD-DINO / DINOv2 ViT-B/14 architecture % Source: original RAD-DINO paper (P\'erez-Garc\'ia et al., 2024) Figure 1 % or DINOv2 paper (Oquab et al., 2023) Figure 1 - either is fine. % Caption: ``Architecture of the RAD-DINO visual backbone. The % $518\times518$ chest radiograph is divided into $14\times14$ % patches, linearly projected, and processed by a 12-layer ViT-B % transformer pre-trained with the DINOv2 self-distillation objective % on $\approx$840k chest radiographs. The class token is discarded % downstream; only the 1369 patch tokens are passed to the % projection.'' The visual backbone is Microsoft's RAD-DINO~\citep{raddino}, a ViT-B/14 vision transformer pre-trained with the DINOv2 self-distil\-lation objective~\citep{dinov2} on $\approx$840{,}000 chest radiographs drawn from MIMIC-CXR, CheXpert, NIH ChestX-ray14, and PadChest. The \texttt{microsoft/rad-dino} checkpoint is loaded through the HuggingFace \texttt{transformers} hub and accepts inputs at its native resolution of $518\!\times\!518$. At this resolution each image is tokenised into $37\!\times\!37 = 1369$ patches of size $14\!\times\!14$, linearly projected to a 768-dimensional embedding, and processed by a 12-layer transformer with 12 attention heads per layer. The class token at the head of the sequence is discarded downstream, so the encoder output consumed by the projection is \begin{equation} \mathbf{P} \;=\; \operatorname{RAD-DINO}(\mathbf{X}) \;\in\; \mathbb{R}^{B \times 1369 \times 768} \label{eq:patches} \end{equation} where $\mathbf{X} \in \mathbb{R}^{B \times 3 \times 518 \times 518}$ is the pre-processed image batch. The choice of RAD-DINO over more commonly-used backbones such as BioViL-T~\citep{biovilt} or generic ImageNet-pretrained ViTs is motivated by three factors: it is trained on roughly an order of magnitude more chest-X-ray data than BioViL-T; its DINOv2 self-distillation objective yields denser, more spatially-coherent patch features than contrastive text--image alignment objectives, which matters for the focal abnormalities (nodules, small consolidations) the report-generation task needs to mention; and it is distributed as a standard HuggingFace model, free of the legacy dependency chain of BioViL-T's \texttt{hi-ml-multimodal} package. The encoder is kept entirely frozen throughout this work. Allowing gradients to flow back into an 86\,M-parameter ViT, on top of the quantised LLM activations in the same model, exhausts the activation budget of consumer GPUs, and the empirical observation in RaDialog~\citep{radialog} and LLaVA-Med~\citep{llavamed} that unfreezing the encoder before the projection has converged tends to collapse the joint representation gives independent reason for caution. The pre-trained patch features are instead specialised to the chest-X-ray report distribution by the projection module described next, under the explicit alignment objective of Phase~1. % ============================================================================= \section{Cross-Modal Projection} \label{sec:projection} % ============================================================================= % TODO INSERT FIGURE: projection module schematic % File: figures/projection_module.pdf % Caption: ``Architecture of the cross-modal MLP projection. A learnable % query bank of 32 tokens cross-attends over the 1369 patch features % to pool them into a fixed visual-token budget. A two-layer MLP % then lifts the pooled tokens from 768 to 4096 dimensions, the % embedding size of Vicuna-7B. The intermediate 1024-d activation % is exposed to the ITC head used during Phase 1 of training.'' The projection module serves two roles: pool the variable-length patch sequence from Equation~\ref{eq:patches} into a fixed visual-token budget the language model can attend to, and lift each visual token from the encoder's 768-dimensional embedding space into the 4096-dimensional embedding space of Vicuna-7B. We adopt a design in the spirit of the perceiver resampler used by Flamingo~\citep{flamingo} and the Q-Former of BLIP-2~\citep{blip2}, but without the heavy multi-layer transformer stack of either: a single learnable query bank cross-attends over the patches, followed by a two-layer MLP. This light-weight choice was first demonstrated on the chest-X-ray task by RaDialog~v2~\citep{radialog}, who report that the gain from a deeper Q-Former does not materialise on radiology metrics. Concretely, the projection holds three learnable components: \begin{enumerate} \item A query bank $\mathbf{Q}_0 \in \mathbb{R}^{32 \times 768}$, initialised from $\mathcal{N}(0, 0.02^2)$ and broadcast to a batched $\mathbf{Q}\in\mathbb{R}^{B\times 32\times 768}$ at forward time. \item An 8-head cross-attention block at width 768, with dropout 0.1. \item A two-layer MLP $\mathbf{W}_1 \in \mathbb{R}^{768\times 1024}$, GELU, dropout, $\mathbf{W}_2 \in \mathbb{R}^{1024\times 4096}$, all initialised from $\mathcal{N}(0, 0.02^2)$ with zero biases. \end{enumerate} For a batch of patch features $\mathbf{P}_i \in \mathbb{R}^{1369 \times 768}$, the projection emits visual tokens $\mathbf{V}_i \in \mathbb{R}^{32 \times 4096}$ in three steps: \begin{align} \mathbf{H}^{(0)}_i &= \operatorname{CrossAttn}\!\left(\mathbf{Q}_i, \mathbf{P}_i, \mathbf{P}_i\right) \in \mathbb{R}^{32 \times 768} \label{eq:crossattn}\\ \mathbf{H}^{(1)}_i &= \operatorname{Dropout}\!\left(\operatorname{GELU}(\mathbf{W}_1 \mathbf{H}^{(0)}_i)\right) \in \mathbb{R}^{32 \times 1024} \label{eq:tap}\\ \mathbf{V}_i &= \mathbf{W}_2\, \mathbf{H}^{(1)}_i \in \mathbb{R}^{32 \times 4096} \end{align} Three design choices deserve explanation. First, the number of visual tokens is fixed at 32, matching RaDialog and sitting inside the empirical sweet spot reported by BLIP-2 and LLaVA: below 16, the pool is too coarse for small focal pathologies; above 64, the LLM's sequence length grows without measurable gain on radiology metrics. Second, the cross-attention is unidirectional --- queries attend over patches, but patches do not attend back. This is the standard perceiver setup and keeps the projection module small (1.8\,M parameters) and easy to train under a contrastive objective in Phase~1. Third, the 1024-d intermediate activation $\mathbf{H}^{(1)}$ in Equation~\ref{eq:tap} is exposed deliberately --- it is the grounding feature the Phase~1 contrastive head operates on, and placing it between the GELU and the second linear means the contrastive objective sees an already-nonlinear representation rather than the raw cross-attention output. % ============================================================================= \section{Abnormality Classifier} \label{sec:chex} % ============================================================================= % TODO INSERT FIGURE: PNU example % A two-column figure showing (left) a sample chest X-ray, (right) the % predicted PNU block as it would appear in the prompt -- something like: % "Positive Abnormalities: Cardiomegaly, Pleural Effusion % Negative Abnormalities: No Finding, Edema, Pneumothorax, ... % Uncertain Abnormalities: Atelectasis" % Caption: ``Example PNU abnormality block generated by the auxiliary % classifier for a representative MIMIC-CXR study. The block is % inserted between the visual tokens and the natural-language % instruction in the prompt fed to the language model.'' The abnormality classifier is the principal mechanism by which explicit clinical knowledge enters the language model. Rather than fusing classifier logits into the visual embedding pathway, we serialise the classifier's predictions to text and append them to the prompt. This design is borrowed from the U-MultiClass formulation of META-CXR~\citep{metacxr}, where each of the 14 CheXpert pathologies is classified into one of three classes --- positive (P), negative (N), or uncertain (U) --- rather than collapsed to a binary present/absent decision. The three-way distinction matters because the negative class (the report explicitly rules a pathology out) carries different information from the uncertain class (the report hedges with phrases such as ``may represent'' or ``cannot exclude''), and the language model can exploit that distinction when phrasing the generated report. The classifier architecture is intentionally small. It consists of a multi-layer perceptron head on top of the global \texttt{[CLS]} token of the same frozen RAD-DINO encoder used by the rest of the model. Sharing the encoder is what makes the design near-free at inference: the encoder forward is already required for the visual pathway, so adding the classifier costs only a single $\mathbb{R}^{768} \to \mathbb{R}^{14 \times 3}$ MLP forward per image. The head produces a logit tensor \begin{equation} \mathbf{Z} \;=\; \operatorname{MLP}\!\left(\mathbf{c}_{\texttt{[CLS]}}\right) \;\in\; \mathbb{R}^{14 \times 3} \end{equation} which is softmaxed along the class axis to yield, for each pathology $j$, the probability distribution $\hat{\mathbf{y}}_j \in \Delta^2$ over the three U-MultiClass categories. The predicted category is the arg-max, \begin{equation} c_j \;=\; \operatorname*{arg\,max}_{c \in \{\text{P}, \text{N}, \text{U}\}} \hat{\mathbf{y}}_{j, c}. \end{equation} These per-pathology predictions are then serialised into a human-readable abnormality block --- which we refer to as the \emph{PNU string} --- by listing the pathology names that fall into each of the three categories on three lines: \begin{verbatim} Positive Abnormalities: Negative Abnormalities: Uncertain Abnormalities: \end{verbatim} At training time the PNU string is built directly from the \texttt{mimic-cxr-2.0.0-chexpert.csv} ground-truth labels (the classifier-as-oracle setting), so that the language model learns to condition on \emph{reliable} abnormality information from the very first step. At inference time the classifier's own predictions are used, with missing-value labels in the CSV mapped to negative for consistency. The classifier itself is fitted in a brief preparatory phase (Section~\ref{sec:phase0}) and frozen for all subsequent training. % ============================================================================= \section{Language Model with Low-Rank Adaptation} \label{sec:llm} % ============================================================================= % TODO INSERT FIGURE: LoRA decomposition % Source: original LoRA paper (Hu et al., 2021) Figure 1 % Caption: ``Low-Rank Adaptation (LoRA). For each adapted weight matrix % $W_0 \in \mathbb{R}^{d \times d}$ in the language model, a low-rank % update $\Delta W = BA$ with $A \in \mathbb{R}^{r\times d}$, $B\in % \mathbb{R}^{d \times r}$ and $r \ll d$ is added at inference. Only % $A$ and $B$ are trained; the original matrix $W_0$ remains frozen % and 4-bit quantised. We set $r=16$, $\alpha = 32$.'' The decoder is Vicuna-7B v1.3~\citep{vicuna}, a LLaMA-1 derivative fine-tuned on ShareGPT conversational data. We selected the v1.3 release rather than the larger v1.5 (LLaMA-2 base) or more recent Llama-3 models for two practical reasons: it is the exact language model used by the RaDialog baseline against which our work is most directly comparable, and its chat template employs a single unambiguous \verb|USER: ... ASSISTANT: ...| structure that simplifies the label-masking required at training time. The base model is loaded in 4-bit NF4 quantisation through the \texttt{BitsAndBytesConfig} interface of the \texttt{bitsandbytes} library~\citep{qlora}, with double quantisation enabled and the compute dtype matched to the host GPU (BF16 on Ampere-or-newer cards, FP16 on Turing). This reduces the resident weight footprint from $\approx\!14$\,GB in FP16 to $\approx\!4.0$\,GB, which fits comfortably alongside the encoder, projection, and activation budget on a single 16\,GB GPU. Vicuna-7B is otherwise kept frozen; adaptation is performed entirely through LoRA~\citep{lora}. For each attention weight matrix $\mathbf{W}_0 \in \mathbb{R}^{4096 \times 4096}$ in the $\{q, k, v, o\}_{\text{proj}}$ projections of every transformer block, we introduce a low-rank update \begin{equation} \mathbf{W}' \;=\; \mathbf{W}_0 \;+\; \frac{\alpha}{r}\,\mathbf{B}\mathbf{A}, \qquad \mathbf{A} \in \mathbb{R}^{r \times 4096}, \quad \mathbf{B} \in \mathbb{R}^{4096 \times r}, \quad r \ll 4096 \label{eq:lora} \end{equation} where $\mathbf{A}$ is initialised from $\mathcal{N}(0, \sigma^2)$ and $\mathbf{B}$ is zero-initialised so that the adapter is the identity function at the start of training. Only $\mathbf{A}$ and $\mathbf{B}$ receive gradients; the quantised base weight $\mathbf{W}_0$ remains fixed. We set $r = 16$, $\alpha = 32$, and apply a dropout of $0.05$ on the LoRA forward, giving an effective adapter scaling of $\alpha/r = 2$ that we found preferable to $1$ and $4$ in a brief sweep. The feed-forward sublayers are deliberately left without LoRA adapters: pilot runs that additionally adapted $\texttt{gate\_proj}$ and the MLP projections enlarged the trainable parameter count by $3$--$4\times$ without a corresponding gain on the downstream clinical-accuracy metric. With this configuration the total trainable parameter count of the language model is $\approx\!19.7$\,M, which together with the $\approx\!1.8$\,M of the projection brings the overall trainable budget to $\approx\!21.5$\,M out of $\approx\!7.1$\,B total --- about $0.30\%$. % ============================================================================= \section{Prompt Construction} \label{sec:prompt} % ============================================================================= All three downstream tasks --- findings generation, impression generation, and visual question answering --- share a single prompt template that follows Vicuna's v1.1 chat format: \begin{verbatim} {SYSTEM_PROMPT} USER: {PNU abnormality block} {task-specific context} {instruction} ASSISTANT: {target} \end{verbatim} The \verb|| token is added to the tokenizer as a special single-id placeholder at vocabulary slot 32{,}000, immediately above Vicuna's existing 32{,}000-token vocabulary. At forward time the model locates this token in the tokenised input and substitutes the 32 visual tokens $\mathbf{V}$ from the projection in its place, expanding the attention mask and label tensor by 31 positions per sample so the causal mask stays consistent. The label tensor entries for visual-token positions are set to $-100$, the \texttt{torch.nn.CrossEntropyLoss} ignore index, so that no loss is incurred on visual tokens. The fields \verb|{PNU abnormality block}|, \verb|{task-specific context}|, and \verb|{instruction}| are populated differently per task. \paragraph{Findings generation.} The model is asked to produce the findings paragraph from the frontal radiograph alone, with the PNU abnormality block as soft clinical guidance. There is no task-specific context. The instruction slot draws from a pool of ten hand-written paraphrases (e.g.\ ``Generate the findings section for this chest X-ray.'') sampled uniformly at random during training; at evaluation the first variant is used deterministically for reproducibility. \paragraph{Impression generation.} The impression is by clinical convention a summary of the findings, so we frame the impression task as conditional summarisation rather than free generation. The task-specific context slot is populated with the ground-truth findings text, prefixed by the literal token \texttt{Findings:}. The model therefore receives the frontal image, the PNU block, the ground-truth findings paragraph, and an instruction drawn from a pool of ten impression-specific paraphrases (e.g.\ ``Generate the impression section for this chest X-ray.''). Studies whose report lacks a findings section emit no impression sample, by construction of the filter chain in Section~\ref{sec:subset}. We acknowledge that this is a teacher-forced setting: the impression metrics reported in Chapter~\ref{ch:experiments} are an upper bound on what a fully end-to-end cascade --- in which the model's own generated findings feeds the impression prompt --- would achieve, and we discuss the gap in the limitations section of that chapter. \paragraph{Visual question answering.} The task-specific context slot is empty, and the instruction slot is the natural-language question drawn from MIMIC-Ext-CXR-VQA. The reference answer in the dataset supplies the target. For every task, the system prompt is the radiologist persona used by RaDialog --- ``A chat between a curious user and an artificial intelligence assistant acting as an experienced radiologist''. The full prompt is tokenised with a 512-token right-truncation budget. We choose right-truncation deliberately: a too-long prompt should erode the target on the right, where labels matter for the loss, rather than the system prompt or PNU block on the left, both of which carry non-redundant information. The label tensor is masked with $-100$ at every prompt-token, padding-token, and visual-token position, so the language-modelling loss is computed strictly on the assistant response. % ============================================================================= \section{Two-Phase Training} \label{sec:training} % ============================================================================= Training proceeds in three stages: a brief preparatory stage that fits the abnormality classifier head, then two main phases that train the model itself. The two-phase split mirrors the standard practice of BLIP-2, LLaVA, and RaDialog: in the first phase the alignment module is fitted under an explicit contrastive objective without touching the language model, and in the second phase the full model is fine-tuned end-to-end under the autoregressive language-modelling objective. \subsection{Preparatory Phase: Abnormality Classifier} \label{sec:phase0} The PNU classifier of Section~\ref{sec:chex} is fitted before either of the two main phases. The frozen RAD-DINO encoder is forwarded over every training image and the resulting \texttt{[CLS]} embeddings are cached to disk; the small MLP head is then trained under fourteen independent three-way cross-entropies against the $\{\text{positive}, \text{negative}, \text{uncertain}\}$ targets derived from \texttt{mimic-cxr-2.0.0-chexpert.csv} (with missing-value labels mapped to negative). Because the encoder is frozen and its features are cached, this phase converges in a handful of epochs on a single GPU and is unremarkable beyond noting that the resulting checkpoint is loaded read-only by both of the subsequent phases. \subsection{Phase 1 --- Contrastive Alignment} \label{sec:phase1} % TODO INSERT FIGURE: contrastive alignment schematic % A two-tower diagram: (top) frozen RAD-DINO -> projection -> ITC head % -> L2-norm 128-d image embedding; (bottom) frozen CXR-BERT -> % projected 128-d text embedding from the findings paragraph; an % InfoNCE block between the two towers. % Caption: ``Phase 1 contrastive alignment. The projection is trained % to map the visual representation of a chest radiograph close to % the CXR-BERT text representation of the corresponding findings % paragraph under a symmetric InfoNCE loss, with all other study % pairings in the batch acting as negatives. The text embeddings % are pre-computed offline; the language model is not loaded % during this phase.'' Phase 1 specialises the projection to the chest-X-ray report distribution without involving the language model. The projection output is augmented with a small contrastive head that mean-pools the 32 intermediate-dimensional tokens of Equation~\ref{eq:tap}, projects them to a 128-dimensional joint embedding space, and L2-normalises the result: \begin{equation} \mathbf{v}_i \;=\; \frac{\mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i}{\bigl\lVert \mathbf{W}_{\text{itc}}\,\bar{\mathbf{H}}^{(1)}_i \bigr\rVert_2}, \qquad \bar{\mathbf{H}}^{(1)}_i \;=\; \frac{1}{32}\sum_{k=1}^{32} \mathbf{H}^{(1)}_{i,k} \label{eq:itc-head} \end{equation} with $\mathbf{W}_{\text{itc}} \in \mathbb{R}^{128 \times 1024}$ the contrastive projection. On the text side, the canonical reference sentence per study --- the findings paragraph, falling back to the impression where findings is absent --- is encoded once, offline, by the frozen CXR-BERT-specialized model~\citep{cxrbert} through its \verb|get_projected_text_embeddings| interface, which itself emits a 128-dimensional L2-normalised vector $\mathbf{t}_i$. These per-study text embeddings are written to disk and re-used across every Phase~1 run; no text encoder is loaded at training time. The model is then trained to maximise the cosine similarity between $\mathbf{v}_i$ and the text embedding $\mathbf{t}_i$ of the same study, while pushing $\mathbf{v}_i$ away from the text embeddings of the other studies in the batch. We use the symmetric InfoNCE loss of CLIP~\citep{clip}: \begin{equation} \mathcal{L}_{\text{ITC}} \;=\; -\tfrac{1}{2}\!\left[ \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_i/\tau)} {\sum_{j=1}^{B}\exp(\mathbf{v}_i^{\!\top}\mathbf{t}_j/\tau)} \;+\; \sum_{i=1}^{B}\!\log\!\frac{\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_i/\tau)} {\sum_{j=1}^{B}\exp(\mathbf{t}_i^{\!\top}\mathbf{v}_j/\tau)} \right] \label{eq:itc} \end{equation} with the softmax temperature $\tau = 0.07$, matching CLIP and the CXR-BERT recipe. The dataset is de-duplicated to one image per study before batching, because the text embedding is study-level. Trainable parameters are restricted to the projection MLP and the contrastive head; the encoder, the classifier, and the language model (which is not even instantiated during this phase) are all frozen. Phase 1 runs for two epochs at peak learning rate $1\!\times\!10^{-3}$ with a 5\% cosine warm-up. Because Vicuna is not loaded, $\approx\!13$\,GB of GPU memory is free, which we exploit to push the per-device batch size to 64--96 and thereby enlarge the InfoNCE negative pool --- known to be the dominant factor in contrastive-alignment quality~\citep{clip, blip2}. The contrastive head is discarded at the end of Phase~1; only the projection weights are carried forward. \subsection{Phase 2 --- Instruction Tuning} \label{sec:phase2} % TODO INSERT FIGURE: phase 2 schematic (optional) % Could show the full model with the loss flowing only on assistant % tokens. Or omit if the overview figure already covers it. Phase 2 reinstates the language model and switches to autoregressive fine-tuning under the standard causal-language-modelling loss. The projection weights from Phase~1 are loaded into the freshly-built full model; the LoRA adapters of Equation~\ref{eq:lora} are zero-initialised. Trainable parameters are the projection MLP and the LoRA adapters on the four attention projections of all 32 transformer blocks. The encoder, the classifier, and the quantised base weights of Vicuna remain frozen. Mini-batches are drawn from the union of the three task corpora --- findings, impression, and VQA --- with sampling weights $0.3$, $0.2$, and $0.5$ respectively. The findings/impression ratio favours findings because findings is the harder task (no textual context), and VQA is weighted equally to the combined report-generation tasks because it is the most diverse subset and benefits most from mini-batch coverage. For each sample, the prompt is constructed according to its task as described in Section~\ref{sec:prompt} and tokenised. The forward pass emits visual tokens via Equation~\ref{eq:patches}--\ref{eq:tap} and inserts them in place of the \verb|| token; the language model then produces logits $\mathbf{l}_{i, t}$ at every position $t$. The loss is \begin{equation} \mathcal{L}_{\text{LM}} \;=\; -\,\frac{1}{|\mathcal{T}|} \sum_{(i, t)\,\in\,\mathcal{T}} \log p_{\theta}\!\bigl(y_{i, t}\,\bigm|\,y_{i, < t},\,\mathbf{V}_i,\,\mathbf{c}_i\bigr) \label{eq:lm} \end{equation} where $\mathcal{T}$ is the set of $(i, t)$ pairs whose label entry is non-$-100$ (i.e.\ assistant tokens only), $\mathbf{V}_i$ are the visual tokens, and $\mathbf{c}_i$ is the textual prompt context. Phase 2 runs for ten epochs at peak learning rate $2\!\times\!10^{-4}$ with a 5\% cosine warm-up. The effective batch size is held at 16 across all hardware profiles by varying the per-device batch and gradient-accumulation step count in inverse proportion. The optimiser is AdamW~\citep{adamw} with weight decay $0.01$ on all non-bias, non-LayerNorm parameters, applied to FP32 master copies of the projection and LoRA weights; the quantised base LLM acts as a fixed nonlinearity. \subsection{Hyperparameters} Table~\ref{tab:hparams} consolidates the hyperparameter settings of both phases. The values are inherited from RaDialog with two adjustments. The Phase~1 batch size was swept over $\{16, 32, 64, 96\}$; 64 was selected as the smallest batch that lay within $0.3\%$ of the best validation contrastive accuracy. The Phase~2 peak learning rate was swept over $\{1{\times}10^{-4}, 2{\times}10^{-4}, 5{\times}10^{-4}\}$; $5{\times}10^{-4}$ diverged within $\approx\!300$ optimiser steps and $2{\times}10^{-4}$ was unambiguously best on the validation loss. \begin{table}[ht] \centering \small \caption{Training hyperparameters for the two phases.} \label{tab:hparams} \begin{tabular}{lll} \toprule \textbf{Hyperparameter} & \textbf{Phase 1 (ITC)} & \textbf{Phase 2 (LM)} \\ \midrule Trainable modules & Projection + ITC head & Projection + LoRA \\ Frozen modules & Encoder, classifier; LLM not loaded & Encoder, classifier, base LLM \\ Loss & Symmetric InfoNCE (Eq.~\ref{eq:itc}) & Causal CE (Eq.~\ref{eq:lm}) \\ Temperature $\tau$ & $0.07$ & --- \\ Epochs & 2 & 10 \\ Peak learning rate & $1\times 10^{-3}$ & $2\times 10^{-4}$ \\ LR schedule & cosine, 5\% warm-up & cosine, 5\% warm-up \\ Weight decay & $0.01$ & $0.01$ \\ Effective batch size & 64--96 & 16 \\ Mixed precision & BF16 (Ampere+) / FP16 (Turing) & BF16 / FP16 \\ LLM quantisation & --- (not loaded) & 4-bit NF4, double-quant \\ LoRA $(r, \alpha, p)$ & --- & $(16, 32, 0.05)$ \\ LoRA modules & --- & $\{q, k, v, o\}_{\text{proj}}$ \\ Cutoff length & --- & 512 tokens \\ Optimiser & AdamW & AdamW \\ \bottomrule \end{tabular} \end{table} % ============================================================================= \section{Inference Pipeline} \label{sec:inference} % ============================================================================= At inference, all parameters are frozen and the model is invoked as a single forward pass per chest radiograph and task, with greedy decoding (\verb|do_sample=False|, \verb|num_beams=1|, \verb|temperature=0.1|) up to a task-dependent maximum of newly generated tokens. Algorithm~\ref{alg:inference} formalises the procedure. \begin{algorithm}[h] \caption{Inference Procedure of CXR-VLM} \label{alg:inference} \begin{algorithmic}[1] \Require Chest radiograph $\mathbf{X}$, task $\mathcal{T} \in \{\textsc{findings}, \textsc{impression}, \textsc{vqa}\}$, optional textual side-input $\mathbf{u}$ \Ensure Generated report $\mathbf{R}$ \State $\mathbf{P} \gets \operatorname{RAD-DINO}(\mathbf{X})$ \Comment{$\mathbb{R}^{1369 \times 768}$, frozen} \State $\mathbf{c}_{\texttt{[CLS]}} \gets$ class-token of the same forward pass \State $\mathbf{Z} \gets \operatorname{MLP}(\mathbf{c}_{\texttt{[CLS]}}) \in \mathbb{R}^{14 \times 3}$ \State $\hat{\mathbf{y}}_j \gets \operatorname{softmax}(\mathbf{Z}_j)$, \, $c_j \gets \arg\max_c \hat{\mathbf{y}}_{j,c}$ \State $\textsc{pnu} \gets \operatorname{Serialise}(\{c_j\}_{j=1}^{14})$ \Comment{P/N/U text block} \State $\mathbf{V} \gets \operatorname{Projection}(\mathbf{P}) \in \mathbb{R}^{32 \times 4096}$ \If{$\mathcal{T} = \textsc{findings}$} \State context $\gets$ \verb|""| \State instruction $\gets$ canonical findings instruction \ElsIf{$\mathcal{T} = \textsc{impression}$} \State context $\gets$ ``\texttt{Findings:}\ $\mathbf{u}$'' \State instruction $\gets$ canonical impression instruction \ElsIf{$\mathcal{T} = \textsc{vqa}$} \State context $\gets$ \verb|""| \State instruction $\gets \mathbf{u}$ \Comment{the question itself} \EndIf \State $\mathbf{p} \gets \operatorname{BuildPrompt}(\textsc{system}, \textsc{pnu}, \text{context}, \text{instruction})$ \State $\mathbf{e} \gets \operatorname{Tokenize}(\mathbf{p})$, then substitute $\mathbf{V}$ at the \verb|| position \State $\mathbf{R} \gets \operatorname{Vicuna+LoRA}(\mathbf{e})$ via greedy decoding up to $L_\mathcal{T}$ new tokens \State \Return $\mathbf{R}$ \end{algorithmic} \end{algorithm} The maximum number of newly generated tokens $L_\mathcal{T}$ is set to 300 for findings, 200 for impression, and 64 for VQA, matching the 99th-percentile reference length per task. For the impression task, the textual side-input $\mathbf{u}$ is the ground-truth findings paragraph in the experiments of Chapter~\ref{ch:experiments}, mirroring the teacher-forced training regime; an end-to-end cascade variant that feeds the model's own generated findings as $\mathbf{u}$ is technically supported by the codebase but not reported as a primary result. % ============================================================================= \section{Evaluation Protocol} \label{sec:eval} % ============================================================================= The evaluation reports a metric suite tailored to each of the three downstream tasks. For findings and impression generation we report both natural-language-generation (NLG) metrics --- lexical, fluency, and embedding-based --- and the clinical-correctness metric that has become the de facto standard for chest-X-ray report generation. For VQA we report a parallel suite adapted to short-form answers, optionally augmented with an LLM-as-judge evaluation for clinical equivalence. Table~\ref{tab:metrics} summarises the inventory. \begin{table}[ht] \centering \small \caption{Evaluation metrics by task.} \label{tab:metrics} \begin{tabular}{lll} \toprule \textbf{Family} & \textbf{Metric} & \textbf{Tasks} \\ \midrule Lexical n-gram & BLEU-1, BLEU-4 & findings, impression, VQA \\ & ROUGE-1, ROUGE-2, ROUGE-L & findings, impression \\ Fluency / synonym & METEOR & findings, impression, VQA \\ Embedding semantic & BERTScore F1 & findings, impression, VQA \\ Clinical accuracy & CheXbert macro F1, P, R & findings, impression \\ Exact answer & Exact match, token F1 & VQA \\ Semantic judge & GPT-4o-mini score (0--5) & VQA (optional) \\ \bottomrule \end{tabular} \end{table} \subsection{Held-Out Data and Decoding} All metrics are computed on the patient-disjoint test split of Section~\ref{sec:subset}: 5{,}000 studies whose subject ids appear in neither the train nor the validation split. Greedy decoding is used throughout (Section~\ref{sec:inference}); the canonical instruction variant of each task's prompt template is used so that the reported numbers reflect the model's single-shot best guess and remain reproducible across runs. For findings and impression the classifier-predicted PNU block populates the prompt at test time; for the impression task, the ground-truth findings string populates the task-specific context slot as in training. For the IU-X-Ray out-of-distribution evaluation, the entire pipeline is applied without modification, with PNU labels predicted from images directly. \subsection{Natural Language Generation Metrics} \label{sec:nlg} \paragraph{BLEU.} Corpus-level BLEU-1 and BLEU-4~\citep{bleu} are computed with NLTK's smoothing method~1 to avoid zero-probability collapse on short references. BLEU is included primarily for comparability with the radiology-report-generation literature; its weak correlation with clinical correctness on this task is well documented, so we treat it as a fluency floor rather than a quality measure. \paragraph{ROUGE.} ROUGE-1, ROUGE-2, and ROUGE-L~\citep{rouge} F-measures are computed with Porter stemming enabled --- so that, for instance, \emph{enlarged} and \emph{enlarging} match. ROUGE-L in particular is the most commonly reported single number in prior work on this task (R2Gen, KGAE, RaDialog) and is included for that reason. \paragraph{METEOR.} METEOR~\citep{meteor} computes a weighted F-measure at the token level, with matches granted not only for exact tokens but also for stems and WordNet synonyms (e.g.\ \emph{cardiomegaly}~$\leftrightarrow$~\emph{enlarged heart}), and a chunk-fragmentation penalty that discourages matches scattered across the sentence. Of the three n-gram-style metrics, METEOR correlates best with human judgement on radiology reports, where paraphrasing is rampant; we report it alongside BLEU and ROUGE rather than as a replacement. \paragraph{BERTScore.} BERTScore~\citep{bertscore} computes greedy-aligned cosine similarities between the contextual embeddings of hypothesis and reference, then aggregates to a precision, recall, and F1. We report F1 with the default \texttt{distilbert-base-uncased} backbone for speed, and verify on a subset that switching to PubMedBERT (\texttt{BiomedNLP-PubMedBERT-base-uncased-abstract}) preserves the relative ordering of methods. BERTScore captures semantic equivalence that the n-gram metrics miss but, unlike the clinical-accuracy metric below, has no notion of clinical \emph{correctness} --- it can reward a fluent paraphrase that flips a positive finding to a negative one. \subsection{Clinical Accuracy: CheXbert F1} \label{sec:clinical-f1} The clinical-accuracy metric is the macro-averaged F1 of CheXbert labels~\citep{chexbert}, the standard for chest-X-ray report generation since its introduction. The CheXbert labeler is applied to both the generated and the reference report, producing a 14-element abnormality vector with values in $\{-1, 0, 1\}$ for each. Following the convention of prior work, we collapse $\{-1, 0\}$ to not-positive and compute, per pathology, \begin{equation} \mathrm{F1}_j \;=\; \frac{2\,\mathrm{precision}_j\,\mathrm{recall}_j}{\mathrm{precision}_j + \mathrm{recall}_j}, \end{equation} then macro-average across the 14 pathologies. Precision and recall are reported alongside F1 to disambiguate the trade-off when the overall numbers are close between methods. Macro-averaging (rather than micro) is chosen because the CheXpert label distribution is heavy-tailed and micro-averaging would let the most prevalent labels dominate the score. The CheXbert checkpoint is the public release from Stanford ML Group; when it is not available on a given host the metric degrades gracefully to a logged zero rather than failing the evaluation run. \subsection{Visual Question Answering Metrics} \label{sec:vqa-metrics} VQA targets are short --- often a single word or phrase --- so the suite differs from generation. \paragraph{Exact match.} Hypothesis and reference are lower-cased, stripped of punctuation, and collapsed on whitespace, then compared for string equality. Exact match is the lower bound on correctness; it is harsh on open-ended phrasing (``mild pleural effusion'' vs ``small effusion'') but rewards the closed-form yes/no and quantitative questions that dominate the dataset. \paragraph{Token F1.} For each $(\text{hypothesis}, \text{reference})$ pair we compute the F1 between the bags of normalised tokens. This is the standard short-answer metric inherited from SQuAD-style extractive QA, and it is the most diagnostic single number for VQA on this dataset. \paragraph{BLEU-1, METEOR, BERTScore.} Reported for symmetry with the generation tasks: BLEU-1 captures unigram overlap, METEOR adds synonym matching, BERTScore F1 captures semantic similarity. BLEU-4 (rarely meaningful on short answers) and ROUGE-L (subsumed by token F1) are omitted. \paragraph{LLM-as-judge.} For free-form answers where the metrics above misalign with clinical equivalence (e.g.\ ``cardiomegaly'' vs ``enlarged cardiac silhouette''), we additionally evaluate a sample of the test set with GPT-4o-mini~\citep{openai-gpt4o} as a zero-shot judge. Following G-Eval~\citep{geval}, the judge is prompted with the question, the reference, and the prediction, and asked to return an integer score on a $0$--$5$ rubric ($5$~=~clinical equivalence, $0$~= ~contradiction) together with a one-sentence rationale. We report the mean score on the $0$--$5$ scale and its normalisation to $[0, 1]$ for comparability with the other metrics. The judge runs at temperature $0$ with \verb|response_format=json_object| for deterministic parsing; the codebase supports Gemini and local Ollama back-ends behind the same interface so the result is reproducible without an OpenAI subscription. To keep cost bounded, the judge is applied to at most $500$ randomly-sampled VQA items. \subsection{Model Selection and Reporting} Model selection during training uses the validation \verb|eval_loss| --- the causal cross-entropy on assistant tokens, the same loss minimised in Equation~\ref{eq:lm}. We deliberately do \emph{not} use any of the downstream metrics for checkpoint selection, to avoid the optimistic bias of optimising on the same signal that is later reported. CheXbert F1 in particular is expensive to compute and unstable across small validation samples, which makes it unsuitable as a training-loop criterion regardless. Each evaluation run writes per-task prediction files containing the image path, the reference, the generated text, and (for VQA) the question, alongside an aggregated metric summary. Confidence intervals are obtained by bootstrap resampling of the per-sample metric values (1{,}000 resamples, 95\% percentile interval); corpus-level BLEU and CheXbert F1, which are non-decomposable, are reported as point estimates.