129 kB

Title: Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

URL Source: https://arxiv.org/html/2502.00577

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Preliminary 3Motivation 4Information-theoretic Analysis on MLLM Performance Gap 5Empirical Validation on Real Benchmark 6Related Works 7Discussion References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: fontawesome

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0 arXiv:2502.00577v2 [cs.AI] 25 May 2025 Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach Changdae Oh Zhen Fang Shawn Im Xuefeng Du Yixuan Li Abstract

Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the characterization of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We then derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights. Code: \faGithub

\icmlcontact

Changdae Ohchangdae@cs.wisc.edu

1Introduction

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in handling complex tasks that require reasoning over both visual and textual modalities. By leveraging visual instruction tuning (Liu et al., 2023; Dai et al., 2023; Zhu et al., 2024), MLLMs have shown promise in answering open-ended questions and generating contextually relevant captions. As a critical aspect for real-world deployment, MLLMs are expected to operate robustly in the wild, where distribution shifts occur—that is, when the evaluation data deviates from the instruction tuning data, whether due to changes in visual inputs (e.g, domain-specific images), textual inputs (e.g., linguistic variations), or the combination thereof. However, negative reports on MLLM failures in edge cases have steadily emerged, raising concerns about their reliability.

For example, MLLMs struggle with queries in specialized domains such as medicine and chemistry (Zhang et al., 2024a; Han et al., 2024; Zhou et al., 2024), perform poorly on simple image classification tasks compared to open-ended question answering (Zhang et al., 2024b; Zhai et al., 2024), and exhibit hallucinations to biased queries (Li et al., 2023b; Ye-Bin et al., 2025). Given the increasing impact of MLLMs, it is crucial to understand their failure modes under distribution shifts. Despite the significance of the problem, existing works often lack a fine-grained diagnosis for various factors of shifts. More importantly, the absence of a formal framework to explain the underlying principle further hinders the systematic understanding of an MLLM’s behavior. This motivates us to raise the research question:

Could we derive a theoretical framework to characterize MLLM’s behavior under distribution shifts?

To address this, we propose an information-theoretic framework that characterizes MLLM performance under distribution shifts with theoretical rigor and practical interpretability. Our framework is well-suited for analyzing instruction-tuned models, wherein the purpose of learning is naturally connected to the mutual information between the input query and response. In this framework, we introduce effective mutual information (EMI) as a principled measure to access the relevance between an input query and model response. Intuitively, EMI is expected to be higher when a test input query originates from the in-distribution (ID) data, similar to those used during instruction tuning, compared to when the input comes from out-of-distribution (OOD). To quantify this performance gap, we compute the difference in EMI between ID and OOD and derive an upper bound for this difference, expressed in terms of distributional discrepancies in input and output spaces (see Theorem 4.5 and 4.6). By grounding the measure in information theory, we provide the first theoretical framework to analyze and understand the impact of distribution shifts on MLLM performance.

Beyond the theoretical rigor, we further show that our framework holds practical value. In particular, we demonstrate that EMI is closely related to a widely used empirical metric for MLLMs, relative preference score under the LLM-as-a-judge paradigm (Zheng et al., 2023). The relative preference score commonly relies on an external judge model, e.g., GPT-4o (Hurst et al., 2024), thus it lacks mathematical guarantees due to the black-box nature of these models. In contrast, EMI provides an theory-grounded measurement of the relevance between input queries and output responses of the MLLM being evaluated, offering a fundamental basis for assessing the MLLM’s performance gap under shifts.

Finally, we conduct comprehensive validation for the theoretical framework and show that our theorems empirically hold in real-world benchmarks. Our experiments comprehensively examine 34 synthetic and 27 natural distribution shift scenarios, resulting in a total number of 61 ID-OOD evaluations for each MLLM. Results confirm strong correlations between empirical estimates of EMI and relative preference, as well as correlations between the EMI difference and its upper bounds, demonstrating the effectiveness of our framework in capturing performance gaps under diverse shifts. Our contributions can be summarized as follows:

•

We propose a new framework, effective mutual information (EMI), to analyze MLLM under distribution shift, and justify the use of EMI by showing the theoretical connection between EMI and the LLM-judge-driven relative preference score.

•

We derive theoretical upper bounds of MLLM’s performance gap, which can be characterized by shifts over multimodal input queries and output discrepancies.

•

We empirically verify our theoretical statements on 61 real-world distribution shift scenarios of open-ended question-answering benchmarks with six MLLMs.

2Preliminary Random variable and distribution.

Let 𝒳

𝒳 𝑣 × 𝒳 𝑡 denote the input space, where 𝒳 𝑣 and 𝒳 𝑡 correspond to the visual and textual feature spaces, respectively. Similarly, let 𝒴 denote the response space. We define the random variables 𝐗

( 𝑋 𝑣 , 𝑋 𝑡 ) ∈ 𝒳 and 𝑌 ∈ 𝒴 , where 𝐗 is the sequence of tokens that combine visual and text input queries, and 𝑌 represents the associated response tokens. The joint distribution is denoted by 𝑃 𝐗 ⁢ 𝑌 , with marginals 𝑃 𝐗 , 𝑃 𝑌 , and the conditional distribution 𝑃 𝑌 | 𝐗 . For subsequent sections, 𝑃 𝐗 ⁢ 𝑌 refers to the instruction tuning data distribution, which we consider as in-distribution (ID).

MLLM and visual instruction tuning.

MLLM usually consists of three components: (1) a visual encoder, (2) a vision-to-language projector, and (3) an LLM backbone that processes a multimodal input sequence to generate a valid textual output 𝑦 in response to an input query 𝐱 . An MLLM can be regarded as modeling a conditional distribution 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) , where 𝜃 is the model parameters. To attain the multimodal conversation capability, MLLMs commonly undergo a phase, so-called visual instruction tuning (Liu et al., 2023), with a conditional language modeling loss:

arg ⁡ min 𝜃 ∈ Θ ⁡ 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ ∑ 𝑙

1 𝐿 − log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 𝑙 | 𝐱 , 𝑦 < 𝑙 ) ] ,

(1)

where 𝐿 is a sequence length and 𝑦

( 𝑦 1 , … , 𝑦 𝐿 ) . After being trained by Eq. (1), MLLM produces a response given a query of any possible tasks represented by text.

Evaluation of open-ended generations.

(M)LLM-as-a-judge method (Ouyang et al., 2022; Zheng et al., 2023; Kim et al., 2024) is commonly adopted to evaluate open-ended generation. In this paradigm, a judge model produces preference scores or rankings for the responses given a query, model responses, and a scoring rubric. Among the evaluation metrics, the relative preference score (RP score; Eq. (2)) is one of the most representative ones.

Definition 2.1 (Relative Preference Score).

Given a reward function 𝑟 : 𝒳 × 𝒴 → ℝ , the relative preference (RP) score of model 𝑃 𝜃 w.r.t. 𝑃 𝐗 ⁢ 𝑌 are defined as follows:

RP ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) := 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌

𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ [ 𝑟 ⁢ ( 𝐱 , 𝑦 ^ ) / 𝑟 ⁢ ( 𝐱 , 𝑦 ) ] .

(2)

Here, the reward function 𝑟 can be any possible MLLMs, such as GPT-4o (Hurst et al., 2024) or discriminative language models (Lambert et al., 2024), and we often take an output from another MLLM (usually more powerful than 𝑃 𝜃 ) as a reference answer 𝑦 .

3Motivation A systematic understanding of MLLM under distributional shifts.

While instruction-following MLLMs are designed to handle a diverse range of tasks, they often struggle with specialized domains (Zhang et al., 2024a; Zhou et al., 2024), perform poorly on simple image classification tasks (Zhai et al., 2024; Zhang et al., 2024b), and hallucinate to biased queries (Li et al., 2023b; Ye-Bin et al., 2025). We argue that the fundamental cause of these failure modes in MLLMs can be traced back to distribution shifts. Specifically, poor performance on classification tasks and specialized distributions can be attributed to shifts between instruction tuning distribution 𝑃 𝐗 ⁢ 𝑌 and evaluation distribution 𝑄 𝐗 ⁢ 𝑌 . This work comprehensively analyzes three types of distributional shifts that can arise in MLLM as follow:

Figure 1:Performance variation against varying distribution shifts. We evaluated LLaVA v1.5 (top) and LLaVA NeXT (bottom) models on 27 out-of-distribution (OOD) variants of the LLaVA-Bench COCO (ID). Here, the 𝑥 -axis is sorted by the severity of shifts between ID and OOD. There is a consistent trend, increased degrees of distribution shifts result in performance degradations of MLLM. Figure 2:Types of distribution shifts between train and evaluation of MLLMs. We simulate visual, text, and joint shifts by controlling the shift of each input modality.

Visual shift: the marginal distribution of visual query undergoes shift 𝐷 ⁢ ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) ≫ 0 , while that of text query remains largely unchanged 𝐷 ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) ≈ 0 .

Text shift: the marginal distribution of text query undergoes shift 𝐷 ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) ≫ 0 , while that of visual query remains largely unchanged 𝐷 ⁢ ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) ≈ 0 .

Joint shift: both visual and text queries suffer shifts simultaneously, and the relationship between visual and text queries may also shift 𝐷 ⁢ ( 𝑃 𝐗 ∥ 𝑄 𝐗 ) ≫ 0 ,

where 𝐷 denotes a divergence that measures the discrepancy between distributions 𝑃 and 𝑄 . For 𝑀

( 𝑃 + 𝑄 ) / 2 , one can measure the Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence as below:

𝐷 KL ⁢ ( 𝑃 ∥ 𝑄 )

𝔼 𝐳 ∼ 𝑃 ⁢ [ log ⁡ 𝑃 ⁢ ( 𝐳 ) / 𝑄 ⁢ ( 𝐳 ) ] ,

𝐷 JS ⁢ ( 𝑃 ∥ 𝑄 )

[ 𝐷 KL ⁢ ( 𝑃 ∥ 𝑀 ) + 𝐷 KL ⁢ ( 𝑄 ∥ 𝑀 ) ] / 2 .

Pilot study.

We hypothesize that: (1) performance degradation in MLLMs becomes more severe as 𝑄 𝐗 ⁢ 𝑌 deviates further from the 𝑃 𝐗 ⁢ 𝑌 ; (2) the amount of total performance degradation can be factored into visual query shift and text query shift. To test these hypotheses, we design three types of shifts—visual shift, text shift, and joint shift—illustrated in Figure 2, and evaluate MLLMs under these shifts.

Specifically, we adopt LLaVA-1.5 (Liu et al., 2023) and LLaVA-NeXT (Liu et al., 2024a) in 7B and 13B sizes as our target MLLM, with LLaVA-Bench COCO (Liu et al., 2023) serving as the ID dataset, which is distributionally similar to the instruction tuning data. We adopt LLaVA-Bench Wild (Liu et al., 2023) to vary visual input semantics, and we apply language translation with GPT-4, e.g., from English to { Chinese, German, Chinese, Korean, Hindi, Arabic, and Greek } , to realize shifts in text query. We vary the severity of shifts by controlling the magnitude of perturbations in synthetic shift setup and partitioning a dataset based on the mean embedding distance from ID samples in natural shift setup. Following Liu et al. (2023), we evaluate the performance using RP score (Eq. (2)) with GPT-4 judge.

Figure 1 shows the performance variations of MLLMs under different types and magnitudes of distribution shifts, where the 𝑥 -axis is sorted by the severity of shifts (more results from different types of shifts can be founded in Appendix C). Across all models, a consistent trend emerges: as the severity of the shift increases, the performance degradation becomes more significant. This trend robustly holds for both visual and text shifts. Joint shifts result in greater performance degradation, suggesting a complementary effect of shifts across modalities. These consistent observations suggest that there might exist an underlying principle explaining the relationship between performance variation and distributional discrepancy, which motivates us to investigate the theoretical model behind these empirical results.

Our position.

Although there have been similar observations on the performance degradation of MLLM under distribution shifts (Achiam et al., 2023; Zhang et al., 2024a; Zhou et al., 2024; Zhang et al., 2024b), all of them present only the coarse empirical evaluation results without finer analysis on the underlying factor of those performance degradations. To the best of our knowledge, there is no formal framework to explain the performance variations of MLLMs in terms of distribution shifts— despite its crucial importance for ensuring reliable applications of MLLMs. To bridge this gap, we propose the first theoretical framework that characterizes MLLM performance variations under distribution shifts from an information-theoretic perspective.

4Information-theoretic Analysis on MLLM Performance Gap

In this section, we start with introducing the mutual information (MI) and its limitation as a metric in Sec. 4.1, and present a new metric on MLLM evaluation (Sec. 4.2). Then, we derive theorems based on it to characterize the MLLM performance gap under distribution shifts (Sec. 4.3).

4.1Mutual Information for MLLM

A fundamental capability of MLLMs is their instruction-following property (Ouyang et al., 2022)—a direct outcome of instruction-tuning, where the model is trained to generate responses that are aligned with the intent of a given input query or instruction. To evaluate instruction-following capability, we first consider the mutual information (Shannon, 1948) to measure the shared information between the query and the corresponding model response.

Definition 4.1 (Mutual Information (MI)).

For a joint distribution 𝑃 𝐗 ⁢ 𝑌 over 𝒳 × 𝒴 , the mutual information with respect to 𝑃 𝐗 ⁢ 𝑌 is defined as,

𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) := 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝐗 ⁢ 𝑌 ⁢ ( 𝐱 , 𝑦 ) 𝑃 𝐗 ⁢ ( 𝐱 ) ⁢ 𝑃 𝑌 ⁢ ( 𝑦 ) ] .

(3)

MI is deeply related to the entropy, which is defined as 𝐻 ⁢ ( 𝑃 𝐗 ) := − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) ] . It is easy to check that 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ⁢ ( 𝑃 𝑌 ) − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) ] . Intuitively, MI captures how much the response tells us about the query and vice versa. One reason for considering MI for model evaluation is that the conditional language modeling objective in Eq. (1) is closely connected to MI as shown in Eq. (4.1).

Information-theoretic interpretation of instruction tuning.

In Eq. (4.1), we show that the objective for instruction tuning (Eq. 1) forms a lower bound of the MI between 𝐗 and 𝑌 subtracted by entropy over 𝑌 , given a sufficiently large representation capacity of the model, i.e., small 𝛿 . As we are interested in measuring the input-output dependency, we focus on the 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) term as a metric to effectively gauge the upper bound of visual instruction tuning objective, Eq. (1).

𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) − 𝐻 ⁢ ( 𝑃 𝑌 )

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) ] + 𝛿

≥

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) ] ,

(4)

where 𝛿

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] .

Going from instruction tuning to test-time MI.

While 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) measures the MI between the input query and ground truth response from 𝑃 𝐗 ⁢ 𝑌 , one may be interested in measuring MI between the query and model response on inference-time distributions. We use a tensor product 𝑃 𝐗 ⊗ 𝑃 𝜃 to present a joint distribution between the input distribution 𝑃 𝐗 and model output distribution 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) :

𝑃 𝐗 ⊗ 𝑃 𝜃 := 𝑃 𝐗 ⁢ ( 𝐱 ) ⁢ 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) , ∀ ( 𝐱 , 𝑦 ) ∈ 𝒳 × 𝒴 .

(5)

Accordingly, the mutual information w.r.t. the joint distribution 𝑃 𝐗 ⊗ 𝑃 𝜃 can be written as:

𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 )

𝐻 ( 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] )

− 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] .

(6) Limitation of test-time MI under distribution shifts.

Although one could directly use 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) , the mutual information between the input query and the model response, as a metric, the vanilla MI may not be suitable for scenarios involving distribution shifts. For example, consider the distribution 𝑃 𝐗 (e.g., general domain), and the distribution 𝑄 𝐗 (e.g., medical domain). Suppose the MI w.r.t. model 𝑃 𝜃 on 𝑃 𝐗 , i.e., 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) is 2.0, while on the 𝑄 𝐗 , it is 𝐼 ⁢ ( 𝑄 𝐗 ⊗ 𝑃 𝜃 )

1.0 . Does this imply that model 𝑃 𝜃 performs twice as poorly on 𝑄 𝐗 ? The answer is unclear.

The challenge lies in the inherent variability of MI scales across data domains. Recall the formulation of 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) in Eq. (6), the first term 𝐻 ( 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] ) represents the upper bound of MI and varies with the data domain; and the second term 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] reflects the input-output dependency and depends on the true data-generating process. For instance, given a fixed vocabulary, the responses from MLLM could contain more diverse words in a general domain, whereas a narrower subset of words could be expected in the specialized domains, such as medical. Then, 𝐻 ( 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] ) and 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] would be larger in a general domain. Therefore, we argue that a desired evaluation metric should disentangle the pure input-response relevance from the intrinsic characteristics of the dataset.

4.2Effective Mutual Information for Reliable MLLM Evaluation

To remove the influence of the domain-dependent scale, we propose effective mutual information (EMI) as a remedy.

Definition 4.2 (Effective Mutual Information (EMI)).

Given a joint distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 parameterized with 𝜃 , the effective mutual information between the input and model response is defined as below,

EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) := 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) − 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) .

(7)

Compared to the standard MI, EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) measures the “effective” relevance between the query 𝐱 and the model response 𝑦 ^ by subtracting a ground truth MI 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) from 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) . Refer to Figure 5 in Appendix A, for an intuitive example: by accounting for a baseline level of MI, EMI quantifies the extent to which the model captures the effective relevance between the input and output. The use of EMI as an evaluation metric for MLLMs can be further supported by (1) its analogy to the excess risk and effective robustness; (2) its connection to an LLM-judge score.

Analogy to the excess risk and effective robustness.

The minimum achievable error varies depending on the data-generating process. To enable reliable model selection that is agnostic to data distributions, excess risk—defined as the difference between a model’s risk and the minimum possible risk—has been extensively studied (Castro & Nowak, 2008; Koltchinskii, 2010; Mohri, 2018). More recently, Taori et al. (2020) introduced the concept of effective robustness to quantify the “effective” OOD generalization accuracy of classification models by subtracting their ID accuracy. The motivation behind EMI aligns with these concepts, i.e., mitigating the influence of external confounding effects that hinder the accurate measure of model performance. EMI ensures that the metric focuses on the model’s effective ability to capture input-output relevance, independent of confounding effects from the data domain.

Connection to LLM-judge score.

We also show that EMI is closely related to an LLM-judge-based metric, i.e., RP score in Eq. (2), a common metric used to assess MLLM outputs. Conceptually, EMI quantifies the effective relevance between the input query and the model’s response by accounting for the baseline mutual information, whereas the RP score measures the relative preference of the model’s responses over a reference response. More formally, their connection can be mathematically established through the lens of a logit Bradley-Terry preference model (PM) formulation (Bradley & Terry, 1952; Hunter, 2004), logit ⁢ 𝑃 ⁢ ( 𝑦 ^ ≻ 𝑦 | 𝐱 ) , an alternative formulation of relative preference. To be specific, we commonly use log ⁡ 𝑃 ⁢ ( 𝑦 ^ ≻ 𝑦 | 𝐱 ) to train a reward model (RM) which is adopted to compute LLM-judge score, such as Eq. (2). We compare both terms in the following.

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) :

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌

𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ [ logit ⁢ 𝑃 ⁢ ( 𝑦 ^ ≻ 𝑦 | 𝐱 ) ]

= 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌

𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ [ 𝑟 ⁢ ( 𝐱 , 𝑦 ^ ) − 𝑟 ⁢ ( 𝐱 , 𝑦 ) ] ,

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) :

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌

𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ [ log ⁡ 𝑃 ⁢ ( 𝑦 ^ ≻ 𝑦 | 𝐱 ) ]

= 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌

𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ [ log ⁡ 𝜎 ⁢ ( 𝑟 ⁢ ( 𝐱 , 𝑦 ^ ) − 𝑟 ⁢ ( 𝐱 , 𝑦 ) ) ] ,

where 𝑟 ⁢ ( ⋅ , ⋅ ) is the latent score function so-called reward model that generates preference for ( 𝐱 , 𝑦 ) . It is clear that

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − log ⁡ ( 1 − 𝑒 RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) ) ,

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − log ⁡ ( 1 + 𝑒 PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) ) .

Therefore, PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) and RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) exhibit a mutual equivalence, i.e., increase in PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) corresponds to the increase in RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) , and vice versa. In Lemma 4.3, we establish an upper bound for the absolute difference between EMI and PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) , thereby demonstrating their closeness and ultimately highlighting a connection between EMI and LLM-judge score, e.g., Eq (2).

Lemma 4.3.

Given a distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 , if 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 𝛿 , and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) , then

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) | ≤ 𝛿 + 4.4 ⁢ 𝛿 1 8 .

Intuitively, Lemma 4.3 shows that if MLLM 𝑃 𝜃 can approximate the given distribution 𝑃 𝐗 ⁢ 𝑌 with approximate error 𝛿 , the difference between EMI and PM can be bounded by a small term w.r.t. the approximate error 𝛿 . Furthermore, assuming that the model class { 𝑃 𝜃 : ∀ 𝜃 ∈ Θ } has sufficient expressive power (i.e., Eq. (8)), we can derive an additional bound for the case of the optimal solution of autoregressive objective (i.e., Eq. (1)), as shown below.

Theorem 4.4.

Given a distribution 𝑃 𝐗 ⁢ 𝑌 with 𝑃 𝐗 ⁢ 𝑌

𝑐

0 for some constant or 𝑃 𝑌 | 𝐗

𝑐 , if the 𝜖 -representation capacity assumption holds, i.e.,

min 𝜃 ∈ Θ 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] ≤ 𝜖 ,

(8)

and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) , then

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ∗ ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ∗ ) | ≤ 𝛿 + 4.4 ⁢ 𝛿 1 8 ,

where 𝜃 ∗ is the optimal solution of Eq. (1) over 𝑃 𝐗 ⁢ 𝑌 , and 𝛿

4.4 ⁢ 𝜖 1 8 − log ⁡ 𝑐 ⁢ 2 ⁢ 𝜖 .

Theorem 4.4 shows that with a sufficiently expressive model class, EMI exhibits a stronger alignment with PM when the optimal MLLM parameter 𝜃 ∗ is obtained through the autoregressive objective. This alignment underscores the validity of using EMI as a reliable metric for evaluating MLLM and quantifying the relative preference of responses.

Although we confine our analysis to 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) and EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) , the chain rule of MI, i.e., 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐼 ⁢ ( 𝑃 𝑋 𝑣 ⁢ 𝑌 ) + 𝐼 ⁢ ( 𝑃 𝑋 𝑡 ⁢ 𝑌 | 𝑋 𝑣 ) allows us to further factorize the query-response relevance into two input modalities, which is suitable for multimodal LLM’s fine-grained evaluation.

4.3Characterizing MLLM Performance Gap via Effective Mutual Information Difference

Now, based on EMI, we are ready to establish formal guarantees on the performance gap of MLLM via effective mutual information difference (EMID). EMID is defined as the difference between the EMI on the ID distribution 𝑃 𝐗 ⁢ 𝑌 and the OOD distribution 𝑄 𝐗 ⁢ 𝑌 , as follows:

EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) := EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − EMI ⁢ ( 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) .

(9)

To elucidate the key insight and provide a clear foundation, we begin by analyzing a simple scenario where the conditional variables remain consistent across both ID and OOD distributions. In this case, we can derive an upper bound for EMID, as stated in Theorem 4.5. This bound enables us to characterize the maximum performance gap of MLLM over two distributions by measuring the severity of the marginal distribution shift over visual and language modalities.

Theorem 4.5 (Simplified Scenario).

Given an MLLM 𝑃 𝜃 and distributions 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 which have consistent conditional distributions over variables 𝑋 𝑣 | 𝑋 𝑡 , 𝑋 𝑡 | 𝑋 𝑣 , and 𝑌 | 𝐗 , if there exist some constants 𝛿 𝑃 and 𝛿 𝑄 such that

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 , Δ

𝛿 𝑃 + 𝛿 𝑄

and denote 𝑃 𝑌 𝜃

𝔼 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] and 𝑄 𝑌 𝜃

𝔼 𝑄 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] , then EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) is upper bounded by

𝐻 ^ ⁢ ( 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) ) + 8 ⁢ Δ 1 4 ,

(10)

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] .

Figure 3:Scatter plot with regression line between empirical estimates of EMID and its upper bound. Over the 34 synthetic and 27 natural distribution shift scenarios, we evaluate four MLLMs and get 136 cases and 108 cases of synthetic shifts and natural shifts, respectively, for visualizing EMID and its scale-adjusted upper bound estimates (See Appendix B for details). The two panels on the left show results for all four models, whereas the right ones distinguish them per model with fitted linear regression coefficients (Slope). Implication. Theorem 4.5 implies that in the simplified scenario, EMID depends on two main factors: (1) the divergence between the marginal distributions of the visual and textual inputs; (2) the divergence between the model’s predictions and the true output distributions, encapsulated by 𝛿 𝑃 and 𝛿 𝑄 .

Theorem 4.5 naturally captures special cases such as visual-only or text-only input shifts. For a visual-only shift, where 𝐷 JS ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 )

0 , the EMID upper bound depends primarily on the divergence between visual input distributions, and output discrepancy terms. Similarly, for a text-only shift, the bound reflects the divergence in textual input distributions, and output discrepancy terms. These cases not only underscore the flexibility of Theorem 4.5 in isolating the impact of modality-specific distribution shifts on model performance but also highlight the importance of visual and text input shifts on it. In Appendix D, we provide a looser yet better interpretable version of this upper bound (Corollary D.12) by replacing Δ with discrepancy terms between model output and ground truth conditional distributions.

General scenario.

Moving beyond the simplified scenario, we now consider the general scenario in which no assumptions are made about the consistency of conditional distributions across ID and OOD settings. This more realistic scenario accommodates shifts not only in the marginal distributions of visual and textual inputs but also in their conditional dependencies and the relationships between inputs and outputs. By relaxing these constraints, we aim to capture the full complexity of distributional shifts encountered in practice and analyze how such shifts collectively influence the performance gap of MLLMs. The formal upper bound is provided in Theorem 4.6.

Theorem 4.6 (General Scenario).

Given 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 distributions and an MLLM 𝑃 𝜃 , if there exist some constants 𝛿 𝑃 and 𝛿 𝑄 such that

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 , Δ

𝛿 𝑃 + 𝛿 𝑄

and denote 𝑃 𝑌 𝜃

𝔼 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] and 𝑄 𝑌 𝜃

𝔼 𝑄 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] , then EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) is upper bounded by

𝐻 ^ ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) )

𝐻 ^ ⁢ ( 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 ∥ 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 ∥ 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) )

4 ⁢ 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ 𝐷 JS 1 4 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑄 𝑌 | 𝐗

𝐱 ) + 8 ⁢ Δ 1 4 ,

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] and

𝐷 ¯ JS ( 𝑃 𝑋 | 𝑋 ′ | | 𝑄 𝑋 | 𝑋 ′ ) :=
𝔼 𝐱 ∼ 𝑃 𝑋 ′ ⁢ [ 𝐷 JS ⁢ ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

𝐱 ) ]

𝔼 𝐱 ∼ 𝑄 𝑋 ′ ⁢ [ 𝐷 JS ⁢ ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

𝐱 ) ] .

Implication. Compared to Theorem 4.5, Theorem 4.6 indicates that, in the general case, EMID is also influenced by divergences in conditional distributions. Specifically, EMID is upper bounded by marginal distribution shifts in visual and textual inputs ( 𝑋 𝑣 and 𝑋 𝑡 ); divergence between marginal output and model response distributions; shifts in conditional dependencies ( 𝑋 𝑣 | 𝑋 𝑡 and 𝑋 𝑡 | 𝑋 𝑣 ); and a shift between conditional output distributions ( 𝑌 | 𝐗 ).

Theorem 4.6 holds for broader cases, whereas Theorem 4.5 is much simpler to analyze. Thus, we focus on the validation of Theorem 4.5 in the following section. If we have some knowledge of the data-generating process of 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 , we can choose the one that is suitable for given distributions. Although the bounds in both theorems are not such tight, they help us to understand the source of the MLLM performance gap. That is, both Theorem 4.5 and 4.6 provide an analytic tool to characterize the performance gap of MLLM, representing the first formal framework for evaluating MLLM under distribution shifts.

5Empirical Validation on Real Benchmark Setup.

As done in the pilot experiments (Figure 1 and 6), we mainly used LLaVA v1.5 (Liu et al., 2024a) and LLaVA NeXT (Liu et al., 2024b) in 7B and 13B sizes and evaluated them on the LLaVA-Bench COCO and LLaVA-Bench Wild (Liu et al., 2023) datasets to assess open-ended generation quality. We also considered two advanced MLLMs: Qwen2.5-VL-7B (Bai et al., 2025) and InternVL2.5-7B (Chen et al., 2024), and a domain-specialized open-ended question answering dataset, LLaVA-Med (Li et al., 2024), to explore the broad applicability of our framework. For a comprehensive examination on diverse types of shifts, we further simulate synthetic distribution shifts as well as natural distribution shifts. For synthetic shifts, we consider 7 visual scenarios (1 ID case + 2 synthetic perturbation types at 3 severity levels), and 5 text scenarios (1 ID case + 2 synthetic perturbation types at 2 severity levels), resulting in 7 × 5

35 synthetic scenarios, where 1 scenario is ID and the other 34 are OOD cases. For natural shifts, we use 4 visual scenarios (1 ID + 3 OOD difficulty levels) and 7 text scenarios (1 ID-English + 6 different languages), yielding 4 × 7

28 natural scenarios. This comprehensive design covers 34 synthetic and 27 natural shifts spanning 61 shift scenarios in total. A summary of the OOD construction strategies is in Table 1.

Table 1:Summary of distribution shift scenarios. Type Strategy (# of category) Synthetic visual shift (COCO Images) Perturbation (2): Defocus blur, frost Severity (3): Weak, Normal, Strong Synthetic text shift Perturbation (2): Typo, Word Replacement Severity (2): Weak, Strong Natural visual shift (Wild Images) LLaVA-Bench Wild or LLaVA-Med Split (3): Easy, Normal, Hard Natural text shift Translation (6): GE, CH, KO, EL, AR, HI Estimation of MI and JSD.

For the empirical realization of our theoretical statements, we adopt a popular neural estimator for MI, CLUB (Cheng et al., 2020) to compute empirical EMI and EMID, and a JS divergence estimator, RJSD (Hoyos-Osorio & Sanchez-Giraldo, 2023), to compute empirical estimates of EMID upper bound on top of embeddings of 𝐗 , 𝑌 , and 𝑌 ^ , extracted from CLIP-ViT-B/32 (Radford et al., 2021) and XLM-RoBERTa-base (Conneau, 2019) (see Appendix B for details). Experiments with 23 alternative implementations with other estimators derived consistent conclusions (Please refer to Table 7 and 8).

Correlation between RP score and EMI.

We first conduct the Spearman correlation analysis and Kendall’s tau analysis between the RP score and our empirical estimates of EMI. In Table 2, we can see that EMI estimates exhibit a strong correlation with RP score, both in terms of absolute coefficient and 𝑝 -value, across all models. This empirical evidence validates the theoretical connection between EMI and RP score discussed in Theorem 4.4. Therefore, our EMI can be used as a reliable and cost-effective alternative to RP for MLLM evaluation with theoretical grounding.

Table 2:Spearman rank correlation and Kendall’s tau between relative preference (RP) score and EMI. We conduct correlation analysis between the RP score (Eq. (2)) and EMI (Eq. (7)) on 34 synthetic and 27 natural distribution shifts across four MLLMs. Spearman Kendall Model 𝜌

𝑝 -val 𝜏

𝑝 -val LLaVA v1.5 7B 0.794 < 0.001 0.604 < 0.001 LLaVA v1.5 13B 0.652 < 0.001 0.483 < 0.001 LLaVA NeXT 7B 0.738 < 0.001 0.564 < 0.001

Synthetic LLaVA NeXT 13B 0.726 < 0.001 0.527 < 0.001 LLaVA v1.5 7B 0.610 0.001 0.450 0.001 LLaVA v1.5 13B 0.720 < 0.001 0.575 < 0.001 LLaVA NeXT 7B 0.593 0.001 0.435 0.001

Natural LLaVA NeXT 13B 0.457 0.014 0.321 0.017 Table 3:Pearson correlation analysis between EMID and its upper bound. We provide Pearson 𝑟 and 𝑝 -value between the empirical estimates of EMID (Eq. (9)) and its upper bound (Eq. (10)) on 34 synthetic and 27 natural shift scenarios per model. Synthetic Natural Model Pearson 𝑟

𝑝 -val Pearson 𝑟

𝑝 -val LLaVA v1.5 7B 0.755 < 0.001 0.553 0.003 LLaVA v1.5 13B 0.785 < 0.001 0.638 < 0.001 LLaVA NeXT 7B 0.742 < 0.001 0.594 0.001 LLaVA NeXT 13B 0.807 < 0.001 0.550 0.003 All models 0.746 < 0.001 0.565 < 0.001 Verification of bound.

We now validate our main Theorem. Figure 3 (left two) shows the scatter plots comparing EMID with its upper bound across four models, each evaluated over 34 and 27 synthetic and natural distribution shifts. We see a clear trend between EMID and its upper bound in the synthetic shift where we could directly control the severity of shifts. While the result of the natural shift case is noisier, the overall trend is similar. Meanwhile, our bounds depend on the distributional discrepancy between the model’s response 𝑦 ^ and the ground truth response 𝑦 . Thus, they naturally induce different bounds per MLLM. The right panel of Figure 3 presents model-wise plots with linear regression coefficients, where we observe that each model has a different degree of sensitivity against shifts. Pearson correlation analysis results in Table 3 further confirm statistically significant correlations between EMID and its upper bound, supporting the validity of our theorems.

Partial bound analysis.

It is common that we can not access the ground truth response 𝑌 from our training and evaluation datasets. Then, one may want to use a part of EMID upper bound (e.g., JSD terms on visual and textual input only) as an estimator of EMID, by neglecting the output-related term Δ . In Figure 4, we investigate whether the summation of two JS divergence terms can still be predictive for EMID. Although the trends become loose compared to the full bound due to the non-optimality of MLLM parameters, the partial upper bound still has moderately high correlations (denoted by Pearson 𝑟 ) with EMID.

Figure 4:Scatter plot with regression line between empirical estimates of EMID and partial components of its upper bound. We remove the Δ term of bound (Eq. (10)) and only use the estimates of JSD terms over visual and text inputs. Validation with advanced MLLMs.

So far, our evaluation has focused on LLaVA series MLLMs. Now we validate EMI and EMID upper bound with two state-of-the-art MLLMs, Qwen2.5-VL-7B (Bai et al., 2025) and InternVL2.5-7B (Chen et al., 2024), under synthetic shift scenarios. In Table 4, we see that Qwen2.5-VL model shows strong correlations between EMI and RP score, and between EMID and its upper bound with remarkable statistical significance. Although the correlations between EMI and RP score are somewhat weak in the case of InternVL2.5, those are still non-negligible correlations (Schober et al., 2018) with the significance level of 0.05, implying the generality of our framework across different models.

Table 4:Verification of EMI and EMID bound with advanced MLLMs. We conduct correlation analysis between EMI and RP, and between EMID and its upper bound (UB) with Qwen2.5-VL-7B (Bai et al., 2025) and InternVL2.5-7B (Chen et al., 2024), and observe statistically significant correlations. Model EMI ⇔ RP EMID ⇔ UB Spearman 𝜌 ( 𝑝 -val) Kendall 𝜏 ( 𝑝 -val) Pearson 𝑟 ( 𝑝 -val) Qwen2.5-VL 0.767 ( < 0.001) 0.571 ( < 0.001) 0.672 ( < 0.001) InternVL2.5 0.375 (0.029) 0.273 (0.023) 0.810 ( < 0.001) EMI analysis on a specialized domain.

Since the proposed information-theoretic measures rely on embeddings from external models, e.g., CLIP and RoBERTa, to easily compute empirical estimates of MI and JSD, one may wonder whether our framework could be applied to a kind of specialized domain datasets such as medical visual question answering. To investigate this, we adopt LLaVA-Med instruction dataset (Li et al., 2024) as a domain-specific open-ended benchmark on the medical images and corresponding questions. We demonstrate, in Table 5, that EMI and EMID upper bound show strong correlation with RP score and EMID, respectively, indicating the broad applicability of our method in a specialized domain as well.

Table 5:Verification of EMI and EMID bound on LLaVA-Med dataset. We conduct correlation analysis between EMI and RP, and between EMID and its upper bound (UB) with LLaVA v1.5 (7B) on open-ended medical domain QA tasks. EMI ⇔ RP EMID ⇔ UB Spearman 𝜌 ( 𝑝 -val) Kendall 𝜏 ( 𝑝 -val) Pearson 𝑟 ( 𝑝 -val) 0.718 ( < 0.001) 0.572 ( < 0.001) 0.930 ( < 0.001) Application: EMID upper bound as a regularization.

Although our framework mainly stands for the performance analysis of MLLMs under distribution shifts at inference time, one can also leverage EMI and/or EMID upper bound at training time. We give a simple example that instantiates the EMID upper bound as an additional regularization term during visual instruction tuning as below,

(11)

where 𝐙

( 𝑍 𝑣 , 𝑍 𝑡 ) denotes an intermediate representation of MLLM given 𝐗 and 𝒩 is an isotropic Gaussian distribution that has the same dimensionality as 𝑍 𝑣 and 𝑍 𝑡 . Here, we utilize a non-informative prior distribution 𝒩 as an alternative to the target distribution 𝑄 (which is usually inaccessible during training) to implicitly penalize representation discrepancy between 𝑃 and 𝑄 . The above term encourages intermediate representations of visual and text inputs to be shrunk into a zero-centered Gaussian where the penalizing strength is scaled by an averaged model output entropy across batch samples. Table 6 presents evaluation results of LLaVA v1.5 7B model on LLaVA-Bench COCO (ID) and its synthetic shift variants (V, T, and J Shift) after being instruction tuned on LLaVA-mix-665k subset, where we see our EMID-based regularization effectively improves robustness to shifts while maintaining the ID performance.

Table 6:Visual instruction tuning with EMID upper bound. We use EMID upper bound as a regularization term (Eq. 11) during instruction tuning of LLaVA v1.5 (7B) on a 10% subset of LLaVA-mix-665K w/ and w/o 𝑅 , and report relative preference scores. Method ID V Shift T Shift J Shift Baseline 72.7 65.8 68.0 59.6 Baseline w/ 𝑅 72.7 66.3 68.3 60.8 6Related Works Fine-tuned foundation models under distribution shifts.

Recent findings imply that fine-tuning on a relatively small amount of ID datasets hurts the OOD generalization capability of foundation models (Kumar et al., 2022; Wortsman et al., 2022). Although lots of follow-up studies (Goyal et al., 2023; Tian et al., 2023; Oh et al., 2025) including theory-inspired methods (Kumar et al., 2022; Ju et al., 2022; Oh et al., 2024) have been proposed, almost all of them focused on a discriminative model such as CLIP (Radford et al., 2021) for the image classification task. Given the rapidly growing popularity of MLLMs, it is necessary to investigate the reliability of MLLMs under distribution shifts with a tangible formulation. We lay a cornerstone for this.

Performance analysis of MLLM.

There have been numerous reports on MLLMs’ corner-case behaviors. Zhang et al. (2024a), Zhou et al. (2024), and Verma et al. (2024) observed that MLLMs poorly perform under specialized domains or synthetic perturbation, while Zhai et al. (2024) and Zhang et al. (2024b) showed that MLLMs are bad at some simple image classification tasks. Besides, Li et al. (2023b) and Ye-Bin et al. (2025) focused on the object hallucination of MLLM under spurious correlation. However, they all lacked a formal framework to explain such degradation of MLLMs. We recast the degeneration of MLLMs via robustness under distribution shifts between instruction-tuning and evaluation data (Liang et al., 2025), and devise the first theoretical framework to analyze MLLMs.

Information-theoretic approach for model evaluation.

The information-theoretic view has been steadily adopted to establish evaluation criteria for language model probing (Hewitt et al., 2021), prompt engineering (Sorensen et al., 2022), and rationale evaluation (Chen et al., 2023), alongside learning objectives (Alemi et al., 2016; Chen et al., 2016; Tschannen et al., 2020; Kong et al., 2020; Wang et al., 2021), but relatively unexplored for MLLM yet. We also note some works adopting information-theoretic approaches to analyze models under distribution shifts (Federici et al., 2021; Shui et al., 2022). Although they focused on classification tasks with discriminative models, we established new theorems for MLLM analysis based on a new metric, EMI.

7Discussion

This work urged the development of a formal framework to understand MLLMs under distribution shifts which is unexplored yet crucial for reliable artificial intelligence in the wild. As a first step for this, we devised effective mutual information (EMI) as a metric for MLLM evaluation and showed its theoretical connection to an existing standard metric, relative preference score. Then, we provide a theoretical upper bound for an MLLM’s EMI difference between ID and OOD that consists of JS divergences for marginal and conditional distributions of input/output variables. Through experiments on benchmarks spanning 61 distribution shifts, we show the correlation between relative preference and EMI, and further show correlations between EMI difference and its upper bound, thereby empirically verifying our theoretical claims across various models.

Practical implication.

As shown in Table 2, EMI strongly correlates with RP. Compared to RP, the MI estimator can be computed more efficiently without relying on the computationally expensive judge LLM (Achiam et al., 2023) (see Appendix C.4). Thus, EMI can be used as a theoretically grounded, cost-effective evaluation metric that measures the effective relevance between multimodal queries and open-ended responses. Besides, the upper bound of EMID can be adopted as a regularizer during model training, as we showed in Table 6, or test-time adaptation of MLLM to improve its robustness to distribution shifts (Li et al., 2023a), as well as a robustness measure of MLLM.

Limitation and future work.

Although input-output relevance measured by EMI is one of the most important properties for instruction-following models, other crucial quality attributes are not captured by the form of the relevance term. Extending the theory to support evaluation across multiple facets of MLLM will be worth exploring. In addition, we only simulated some intuitive types of distribution shifts with a simplified assumption for the data structure, leaving the analysis on some complex shifts driven by spurious correlation (Simon, 1954) that may be covered by Thm. 4.6. Meanwhile, despite its consistent correlation to EMID, our upper bound is not quite tight in theory, and we did not discuss a lower bound of EMID. Pursuing a tighter bound or exploring the lower limit of EMID can also be worthwhile.

Acknowledgement

We gratefully appreciate the ICML anonymous reviewers for their valuable feedback, and also appreciate constructive comments from Max Khanov, Min-Hsuan Yeh, Dongkwan Kim, and Kyungeun Lee. Changdae Oh, Shawn Im, Xuefeng Du, and Yixuan Li are supported by the AFOSR Young Investigator Program under award number FA9550-23-1-0184, National Science Foundation (NSF) Award No. IIS-2237037 and IIS-2331669, Office of Naval Research under grant number N00014-23-1-2643, Schmidt Sciences Foundation, and Alfred P. Sloan Fellowship. Also, Zhen Fang was funded by the Australian Government through the Australian Research Council (ARC) under grant number DE250100363.

Impact Statement

This work lays the first theoretical foundation for quantifying the reliability of MLLMs. The theoretical statements we made shine a light on analyzing the MLLM performance gap under distribution shifts, which commonly emerged in real-world applications. Our framework can help aware of the potential risk, i.e., performance variation, of MLLMs, and guide practitioners to devise a method towards robust adaptation for chat assistants, thereby ensuring the trustworthiness of artificial intelligence solutions in crucial social applications such as finance and healthcare.

References Achiam et al. (2023) ↑ Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. Alemi et al. (2016) ↑ Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K.Deep variational information bottleneck.arXiv preprint arXiv:1612.00410, 2016. Bai et al. (2025) ↑ Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. Belghazi et al. (2018) ↑ Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D.Mutual information neural estimation.In Proceedings of the 35th International Conference on Machine Learning, pp. 531–540. PMLR, 2018. Bradley & Terry (1952) ↑ Bradley, R. A. and Terry, M. E.Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. Bu et al. (2018) ↑ Bu, Y., Zou, S., Liang, Y., and Veeravalli, V. V.Estimation of kl divergence: Optimal minimax rate.IEEE Transactions on Information Theory, 64(4):2648–2674, 2018. Castro & Nowak (2008) ↑ Castro, R. M. and Nowak, R. D.Minimax bounds for active learning.IEEE Transactions on Information Theory, 54(5):2339–2353, 2008. Chen et al. (2023) ↑ Chen, H., Brahman, F., Ren, X., Ji, Y., Choi, Y., and Swayamdipta, S.Rev: Information-theoretic evaluation of free-text rationales.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2007–2030, 2023. Chen et al. (2016) ↑ Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P.Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016. Chen et al. (2024) ↑ Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. Cheng et al. (2020) ↑ Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L.Club: A contrastive log-ratio upper bound of mutual information.In International conference on machine learning, pp. 1779–1788. PMLR, 2020. Cheng et al. (2021) ↑ Cheng, P., Hao, W., Yuan, S., Si, S., and Carin, L.Fairfil: Contrastive neural debiasing method for pretrained text encoders.In International Conference on Learning Representations, 2021. Conneau (2019) ↑ Conneau, A.Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116, 2019. Cover (1999) ↑ Cover, T. M.Elements of information theory.John Wiley & Sons, 1999. Dai et al. (2023) ↑ Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S.Instructblip: towards general-purpose vision-language models with instruction tuning.In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 49250–49267, 2023. Federici et al. (2021) ↑ Federici, M., Tomioka, R., and Forré, P.An information-theoretic approach to distribution shifts.Advances in Neural Information Processing Systems, 34:17628–17641, 2021. Fraser & Swinney (1986) ↑ Fraser, A. M. and Swinney, H. L.Independent coordinates for strange attractors from mutual information.Physical review A, 33(2):1134, 1986. Furuya et al. (2024) ↑ Furuya, T., de Hoop, M. V., and Peyré, G.Transformers are universal in-context learners.arXiv preprint arXiv:2408.01367, 2024. Goyal et al. (2023) ↑ Goyal, S., Kumar, A., Garg, S., Kolter, Z., and Raghunathan, A.Finetune like you pretrain: Improved finetuning of zero-shot vision models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19338–19347, 2023. Han et al. (2024) ↑ Han, Z., Zhou, G., He, R., Wang, J., Wu, T., Yin, Y., Khan, S., Yao, L., Liu, T., and Zhang, K.How well does gpt-4v (ision) adapt to distribution shifts? a preliminary investigation.In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024. Hessel et al. (2021) ↑ Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y.Clipscore: A reference-free evaluation metric for image captioning.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. Hewitt et al. (2021) ↑ Hewitt, J., Ethayarajh, K., Liang, P., and Manning, C. D.Conditional probing: measuring usable information beyond a baseline.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1626–1639, 2021. Hoyos & Giraldo (2024) ↑ Hoyos, J. K. and Giraldo, L. G. S.A kernel two-sample test with the representation jensen-shannon divergence.In Latinx in AI @ NeurIPS 2024, 2024.URL https://openreview.net/forum?id=bKZbWy3DnR. Hoyos-Osorio & Sanchez-Giraldo (2023) ↑ Hoyos-Osorio, J. K. and Sanchez-Giraldo, L. G.The representation jensen-shannon divergence.arXiv preprint arXiv:2305.16446, 2023. Hunter (2004) ↑ Hunter, D. R.Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004. Hurst et al. (2024) ↑ Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. Jiang et al. (2024) ↑ Jiang, T., Song, M., Zhang, Z., Huang, H., Deng, W., Sun, F., Zhang, Q., Wang, D., and Zhuang, F.E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024. Ju et al. (2022) ↑ Ju, H., Li, D., and Zhang, H. R.Robust fine-tuning of deep neural networks with hessian-based generalization guarantees.In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 10431–10461. PMLR, 17–23 Jul 2022. Kaplan et al. (2020) ↑ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. Kim et al. (2024) ↑ Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., et al.Prometheus: Inducing fine-grained evaluation capability in language models.In The Twelfth International Conference on Learning Representations, 2024. Koltchinskii (2010) ↑ Koltchinskii, V.Rademacher complexities and bounding the excess risk in active learning.The Journal of Machine Learning Research, 11:2457–2485, 2010. Kong et al. (2020) ↑ Kong, L., de Masson d’Autume, C., Yu, L., Ling, W., Dai, Z., and Yogatama, D.A mutual information maximization perspective of language representation learning.In International Conference on Learning Representations, 2020. Kraskov et al. (2004) ↑ Kraskov, A., Stögbauer, H., and Grassberger, P.Estimating mutual information.Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6):066138, 2004. Kumar et al. (2022) ↑ Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P.Fine-tuning can distort pretrained features and underperform out-of-distribution.In International Conference on Learning Representations, 2022. Lambert et al. (2024) ↑ Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al.Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024. Li et al. (2024) ↑ Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J.Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024. Li et al. (2023a) ↑ Li, M., Wang, W., Feng, F., Cao, Y., Zhang, J., and Chua, T.-S.Robust prompt optimization for large language models against distribution shifts.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1539–1554, 2023a. Li & Turner (2016) ↑ Li, Y. and Turner, R. E.Rényi divergence variational inference.Advances in neural information processing systems, 29, 2016. Li et al. (2023b) ↑ Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.-R.Evaluating object hallucination in large vision-language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 292–305, 2023b.URL https://aclanthology.org/2023.emnlp-main.20. Liang et al. (2025) ↑ Liang, Y., Zheng, T., Du, X., Zhang, G., Qu, X., Yue, X., Zheng, C., Liu, J., Ma, L., Chen, W., et al.Aligning instruction tuning with pre-training.arXiv preprint arXiv:2501.09368, 2025. Liu et al. (2020) ↑ Liu, F., Xu, W., Lu, J., Zhang, G., Gretton, A., and Sutherland, D. J.Learning deep kernels for non-parametric two-sample tests.In International conference on machine learning, pp. 6316–6326. PMLR, 2020. Liu et al. (2023) ↑ Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.Conference on Neural Information Processing Systems (NeurIPS), 36, 2023. Liu et al. (2024a) ↑ Liu, H., Li, C., Li, Y., and Lee, Y. J.Improved baselines with visual instruction tuning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024a. Liu et al. (2024b) ↑ Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b. Luo et al. (2022) ↑ Luo, S., Li, S., Zheng, S., Liu, T.-Y., Wang, L., and He, D.Your transformer may not be as powerful as you expect.Advances in Neural Information Processing Systems, 35:4301–4315, 2022. Mohri (2018) ↑ Mohri, M.Foundations of machine learning, 2018. Nguyen et al. (2010) ↑ Nguyen, X., Wainwright, M. J., and Jordan, M. I.Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847–5861, 2010. Oh et al. (2024) ↑ Oh, C., Kim, M., Lim, H., Park, J., Jeong, E., Cheng, Z.-Q., and Song, K.Towards calibrated robust fine-tuning of vision-language models.Advances in Neural Information Processing Systems, 37, 2024. Oh et al. (2025) ↑ Oh, C., Li, Y., Song, K., Yun, S., and Han, D.Dawin: Training-free dynamic weight interpolation for robust adaptation.In The Thirteenth International Conference on Learning Representations, 2025. Oord et al. (2018) ↑ Oord, A. v. d., Li, Y., and Vinyals, O.Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. Ouyang et al. (2022) ↑ Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. Paninski (2003) ↑ Paninski, L.Estimation of entropy and mutual information.Neural computation, 15(6):1191–1253, 2003. Pinsker (1964) ↑ Pinsker, M. S.Information and information stability of random variables and processes.Holden-Day, 1964. Poole et al. (2019) ↑ Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G.On variational bounds of mutual information.In International Conference on Machine Learning, pp. 5171–5180. PMLR, 2019. Radford et al. (2021) ↑ Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021. Schober et al. (2018) ↑ Schober, P., Boer, C., and Schwarte, L. A.Correlation coefficients: Appropriate use and interpretation.Anesthesia & Analgesia, 126(5):1763–1768, May 2018.doi: 10.1213/ANE.0000000000002864. Shannon (1948) ↑ Shannon, C. E.A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948. Shui et al. (2022) ↑ Shui, C., Chen, Q., Wen, J., Zhou, F., Gagné, C., and Wang, B.A novel domain adaptation theory with jensen–shannon divergence.Knowledge-Based Systems, 257:109808, 2022. Shwartz-Ziv & Tishby (2017) ↑ Shwartz-Ziv, R. and Tishby, N.Opening the black box of deep neural networks via information.arXiv preprint arXiv:1703.00810, 2017. Simon (1954) ↑ Simon, H. A.Spurious correlation: A causal interpretation.Journal of the American Statistical Association, 49(267):467–479, 1954. Sinn & Rawat (2018) ↑ Sinn, M. and Rawat, A.Non-parametric estimation of jensen-shannon divergence in generative adversarial network training.In International Conference on Artificial Intelligence and Statistics, pp. 642–651. PMLR, 2018. Sorensen et al. (2022) ↑ Sorensen, T., Robinson, J., Rytting, C., Shaw, A., Rogers, K., Delorey, A., Khalil, M., Fulda, N., and Wingate, D.An information-theoretic approach to prompt engineering without ground truth labels.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 819–862, 2022. Sreekumar & Goldfeld (2022) ↑ Sreekumar, S. and Goldfeld, Z.Neural estimation of statistical divergences.Journal of machine learning research, 23(126):1–75, 2022. Sriperumbudur et al. (2012) ↑ Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. R. G.On the empirical estimation of integral probability metrics.Electronic Journal of Statistics, 6:1550 – 1599, 2012. Taori et al. (2020) ↑ Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L.Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020. Thekumparampil et al. (2018) ↑ Thekumparampil, K. K., Khetan, A., Lin, Z., and Oh, S.Robustness of conditional gans to noisy labels.In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.URL https://proceedings.neurips.cc/paper_files/paper/2018/file/565e8a413d0562de9ee4378402d2b481-Paper.pdf. Tian et al. (2023) ↑ Tian, J., He, Z., Dai, X., Ma, C.-Y., Liu, Y.-C., and Kira, Z.Trainable projected gradient method for robust fine-tuning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7836–7845, 2023. Tschannen et al. (2020) ↑ Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M.On mutual information maximization for representation learning.In International Conference on Learning Representations, 2020. Verma et al. (2024) ↑ Verma, A. A., Saeidi, A., Hegde, S., Therala, A., Bardoliya, F. D., Machavarapu, N., Ravindhiran, S. A. K., Malyala, S., Chatterjee, A., Yang, Y., et al.Evaluating multimodal large language models across distribution shifts and augmentations.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5314–5324, 2024. Wainwright (2019) ↑ Wainwright, M. J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48.Cambridge university press, 2019. Wang et al. (2021) ↑ Wang, B., Wang, S., Cheng, Y., Gan, Z., Jia, R., Li, B., and Liu, J.Info{bert}: Improving robustness of language models from an information theoretic perspective.In International Conference on Learning Representations, 2021. Wortsman et al. (2022) ↑ Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.Robust fine-tuning of zero-shot models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7959–7971, 2022. Yang & Barron (1999) ↑ Yang, Y. and Barron, A.Information-theoretic determination of minimax rates of convergence.Annals of Statistics, pp. 1564–1599, 1999. Ye-Bin et al. (2025) ↑ Ye-Bin, M., Hyeon-Woo, N., Choi, W., and Oh, T.-H.Beaf: Observing before-after changes to evaluate hallucination in vision-language models.In European Conference on Computer Vision, pp. 232–248. Springer, 2025. Yuan et al. (2021) ↑ Yuan, W., Neubig, G., and Liu, P.Bartscore: Evaluating generated text as text generation.Advances in Neural Information Processing Systems, 34:27263–27277, 2021. Zhai et al. (2024) ↑ Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y. J., and Ma, Y.Investigating the catastrophic forgetting in multimodal large language model fine-tuning.In Conference on Parsimony and Learning (Proceedings Track), 2024. Zhang et al. (2020) ↑ Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y.Bertscore: Evaluating text generation with bert.In International Conference on Learning Representations, 2020. Zhang et al. (2024a) ↑ Zhang, X., Li, J., Chu, W., Hai, J., Xu, R., Yang, Y., Guan, S., Xu, J., and Cui, P.On the out-of-distribution generalization of multimodal large language models.arXiv preprint arXiv:2402.06599, 2024a. Zhang et al. (2024b) ↑ Zhang, Y., Unell, A., Wang, X., Ghosh, D., Su, Y., Schmidt, L., and Yeung-Levy, S.Why are visually-grounded language models bad at image classification?Conference on Neural Information Processing Systems (NeurIPS), 2024b. Zheng et al. (2023) ↑ Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. Zhou et al. (2024) ↑ Zhou, G., Han, Z., Chen, S., Huang, B., Zhu, L., Khan, S., Gao, X., and Yao, L.Adapting large multimodal models to distribution shifts: The role of in-context learning.arXiv preprint arXiv:2405.12217, 2024. Zhu et al. (2024) ↑ Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.MiniGPT-4: Enhancing vision-language understanding with advanced large language models.In The Twelfth International Conference on Learning Representations, 2024.

Appendix

Contents 1Introduction 2Preliminary 3Motivation 4Information-theoretic Analysis on MLLM Performance Gap 5Empirical Validation on Real Benchmark 6Related Works 7Discussion Appendix ADetailed Description for Effective Mutual Information

We propose effective mutual information (EMI) as an alternative to vanilla mutual information (MI) to evaluate a model-generated output response given an input query. As explained in Section 4.2, MI (e.g., 𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) ) can not take into account the intrinsic characteristics of data distribution. See Figure 5 for an intuitive example. The amount of information represented by entropy 𝐻 ⁢ ( ⋅ ) and conditional entropy 𝐻 ( ⋅ | ⋅ ) can vary depending on the data-generating process of each dataset. For example, if the task we are interested in is closer to solving a narrow problem in some specific domain (e.g., OOD1: LLaVA-Med; Li et al. (2024)), the cardinality of the desired output response space may be significantly smaller than that of a general problem-solving task in a general domain (e.g., OOD2: LLaVA-Bench Wild; Liu et al. (2023)), and the ground truth MI can differ depending on the domain. By considering these baseline amounts of information, EMI can measure how much our model captures effective relevance between input and output.

In Section 4.2, we provide some justifications for using EMI as an evaluation metric of MLLMs by revealing analogies to excess risk and effective robustness and presenting its theoretical connection to relative preference score. While LLM-as-a-Judge enables flexible evaluation for open-ended generation tasks with multiple user-defined criteria, EMI confines the facet of evaluation to query-response relevance. However, compromise in the flexibility of evaluation enables us to build solid theoretical statements that are necessary for understanding MLLMs to shifts and improving them in a principled way.

Figure 5:Information diagram and motivation of effective mutual information. The difference between vanilla MI terms does not consider the domain-dependent intrinsic scale and mutual information, thereby failing to fairly measure the relevance between input query 𝑥 and model prediction 𝑦 ^ . Meanwhile, EMI ablates the domain-dependent characteristic to focus on measuring effective relevance between 𝑥 and 𝑦 ^ .

Meanwhile, as we adopt neural network models for empirical estimation of EMI, it is somewhat similar to the model-based heuristic metrics, such as BERTscore (Zhang et al., 2020), BARTscore (Yuan et al., 2021), and CLIPscore (Hessel et al., 2021), that map input(s) to a scalar score through a single forward evaluation of the model. However, we take a step further beyond the simple working-heuristic method and lay a theoretical foundation with EMI.

Appendix BImplementation Details

In this paper, we proposed EMI for a reliable evaluation of multimodal large language models (MLLMs) with a theoretical ground. Based on EMI, to analyze the MLLM performance gap under distribution shift, we provided the upper bound for EMI difference between ID and OOD data. In this section, we describe the procedures for estimating EMI and its upper bound in detail.

Overview.

To estimate EMI and its upper bound, we first need to define estimators for MI and Jensen-Shannon divergence (JSD). Those estimators commonly adopt neural network encoders to project the raw data such as text and image into embedding space of neural networks to reduce problem complexity (Oord et al., 2018; Liu et al., 2020), and then, MI estimator commonly optimizes a simple critic function (Poole et al., 2019) on top of the embeddings of data. After training of MI estimator, we evaluate empirical MI over different data distributions. For JSD estimation, given the embedding spaces of pre-trained models, additional training is not necessary. Therefore, the procedures can be divided into two phases: (1) neural MI estimator training, and (2) inference of MI and JSD.

MI estimation.

Estimating MI with finite samples from unknown population distribution is a non-trivial problem, and has been actively studied (Fraser & Swinney, 1986; Paninski, 2003; Kraskov et al., 2004; Nguyen et al., 2010; Shwartz-Ziv & Tishby, 2017; Belghazi et al., 2018; Poole et al., 2019; Cheng et al., 2020). We adopted the contrastive log-ratio upper bound (CLUB; Cheng et al. (2020)) as our default MI estimator similar to (Cheng et al., 2021). We first extract embeddings for visual input query 𝑍 𝑣

enc 𝑣 ⁢ ( 𝑋 𝑣 ) and text input query 𝑍 𝑡

enc 𝑡 ⁢ ( 𝑋 𝑡 ) from visual and text encoder models and take the mean of them to provide input query embedding 𝑍 𝐗

𝑍 𝑣 + 𝑍 𝑡 2 . Specifically, we adopt the most representative embedding models for each modality, i.e., CLIP pre-trained1 ViT-B/32 and XLM-RoBERTa-Base2 (Conneau, 2019) as visual and text encoders, respectively by default. We also obtain the embedding vectors for the model response 𝑍 𝑌 ^

enc 𝑡 ⁢ ( 𝑌 ^ ) and reference response 𝑍 𝑌

enc 𝑡 ⁢ ( 𝑌 ) with text encoder model. Then, we train the MI estimator 𝐼 ^ 𝜓 ⁢ ( ⋅ , ⋅ ) with parameter 𝜓 via gradient descent. To be specific, CLUB formulates the unbiased estimation for MI as below,

𝐼 ^ CLUB ⁢ ( 𝑃 𝑍 𝐗 ⁢ 𝑍 𝑌 )

1 𝑁 ⁢ ∑ 𝑖

1 𝑁 log ⁡ 𝑞 𝜓 ⁢ ( 𝑧 𝑦 𝑖 | 𝑧 𝑥 𝑖 ) − 1 𝑁 2 ⁢ ∑ 𝑖

1 𝑁 ∑ 𝑗

1 𝑁 log ⁡ 𝑞 𝜓 ⁢ ( 𝑧 𝑦 𝑗 | 𝑧 𝑥 𝑖 ) ,

(12)

where 𝑞 𝜓 ( ⋅ | ⋅ ) denotes variational approximation of ground truth probability density function 𝑝 ( ⋅ | ⋅ ) .

Following (Cheng et al., 2020, 2021), we parameterize the 𝑞 𝜓 as a multi-variate Gaussian distribution and estimate the mean and variance parameters of Gaussian with separated two-layer MLPs with 250 hidden dimension size. During mini-batch training, those MLPs consume the concatenated input and response embeddings { [ 𝑧 𝐱 𝑖 , 𝑧 𝑦 𝑖 ] } 𝑖

1 𝑁 to produce a scalar estimate of MI, and they are simultaneously optimized by AdamW optimizer with learning rate 0.001 and batch size 1,024 for 5,000 iterations. However, if we have to train an estimator for every ID-OOD data pair, it may not be practical when the number of data pairs to be evaluated is large. Therefore, we constructed a dataset that integrates all ID-OOD data subsets for MI training (integration of all variants of LLaVA-Bench datasets reach roughly 5,000 samples for natural shift, and 9,000 samples for synthetic shift), trains it only once, and then infers all ID-OOD scenarios (27 for natural shift, 34 for synthetic shift) using these common MI estimators. This not only significantly reduces the time required to evaluate the model’s robustness against multiple distribution shift scenarios, but also stabilizes the training process by increasing the size of the data set used in the training process.

JS divergence estimation.

Estimation of distribution divergences from finite samples has been also a central topic of research (Yang & Barron, 1999; Sriperumbudur et al., 2012; Li & Turner, 2016; Bu et al., 2018; Sinn & Rawat, 2018; Sreekumar & Goldfeld, 2022; Hoyos-Osorio & Sanchez-Giraldo, 2023). We adopt the most recent one, representation Jensen-Shannon divergence (RJSD; Hoyos-Osorio & Sanchez-Giraldo (2023); Hoyos & Giraldo (2024)), which proves its effectiveness on real benchmark datasets as our JSD estimator. The formula is as follows:

𝐷 ^ RJSD ⁢ ( 𝑃 , 𝑄 )

𝑆 ⁢ ( 𝐶 𝑃 + 𝐶 𝑄 2 ) − 1 2 ⁢ ( 𝑆 ⁢ ( 𝐶 𝑃 ) + 𝑆 ⁢ ( 𝐶 𝑄 ) ) ,

(13)

where 𝐶 𝑃

𝔼 𝑋 ∼ 𝑃 ⁢ [ 𝜙 ⁢ ( 𝑋 ) ⊗ 𝜙 ⁢ ( 𝑋 ) ] and 𝑆 ⁢ ( 𝐶 𝑃 )

− Trace ⁢ ( 𝐶 𝑃 ⁢ log ⁡ 𝐶 𝑃 ) . Similar to MI, we compute 𝐷 ^ RJSD ⁢ ( 𝑃 , 𝑄 ) in the embedding space of the same frozen pre-trained models, i.e., leverage neural network embedding space as a kernel < 𝜙 ( 𝑥 ) , 𝜙 ( 𝑥 ′ )

. In contrast to the case of MI, RJSD with a frozen neural embedding model does not require additional training. Still, one might consider learning the embedding model from scratch if necessary.

Scale-adjusted EMID upper bound construction.

By leveraging the estimator described above, one can compute all JSD terms in the proposed EMID upper bound (Eq. 10): 𝐷 JS ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) , 𝐷 JS ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) , 𝐷 JS ( 𝑃 𝑌 𝜃 | | 𝑃 𝑌 ) , and 𝐷 JS ( 𝑄 𝑌 𝜃 | | 𝑄 𝑌 ) ). Because the exact computation of the entropy scaler term 𝐻 ^ in Eq. (10) is intractable, we relax it with an estimate on batch samples, and replace the inaccessible true conditional distribution 𝑄 𝑌 | 𝐗 with 𝑃 𝜃 . For a case of sentence output 𝑌

{ 𝑦 1 , … , 𝑦 𝐿 } with 𝐿 tokens, the length-normalized batch entropy estimate over 𝑁 samples indexed through 𝑖 ∈ ℐ

{ 1 , … , 𝑁 } can be formulated as below,

𝐻 ~

max 𝑖 ∈ ℐ − 1 𝐿 ⁢ ∑ 𝑙

1 𝐿 log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 𝑖 , 𝑙 | 𝑦 𝑖 , < 𝑙 , 𝑥 𝑖 ) .

(14)

In pilot experiments, we observed that the values of 𝐻 ~ are centered on 2.0 in the datasets that we considered in this work. Therefore, 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] can approximately have a value 4.0. As the main implications of EMID upper bound are providing the characterization of MLLM performance gaps and serving as a practical measure of robustness, we compute the scale-adjusted upper bound estimates for EMID upper bound (UB) as below,

EMID UB

= 𝐻 ^ ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) ) + 4 ( 𝐷 JS 1 4 ( 𝑃 𝑌 𝜃 | | 𝑃 𝑌 ) + 𝐷 JS 1 4 ( 𝑄 𝑌 𝜃 | | 𝑄 𝑌 ) )

≈ 4 ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) + 𝐷 JS 1 4 ( 𝑃 𝑌 𝜃 | | 𝑃 𝑌 ) + 𝐷 JS 1 4 ( 𝑄 𝑌 𝜃 | | 𝑄 𝑌 ) )

Scale-adjusted ⁢ UB ^

:= 𝐷 ^ RJSD 1 2 ⁢ ( 𝑃 𝑋 𝑣 , 𝑄 𝑋 𝑣 ) + 𝐷 ^ RJSD 1 2 ⁢ ( 𝑃 𝑋 𝑡 , 𝑄 𝑋 𝑡 ) + 𝐷 ^ RJSD 1 4 ⁢ ( 𝑃 𝑌 𝜃 , 𝑃 𝑌 ) + 𝐷 ^ RJSD 1 4 ⁢ ( 𝑄 𝑌 𝜃 , 𝑄 𝑌 ) .

See Section D for the detailed derivation of EMID UB. Note that applying the linear transformation to the target variables of Pearson correlation analysis does not affect the correlation coefficient value, so that the correlation coefficients between the EMID and its scale-adjusted UB has the same values as between the EMID and its original UB, even though it is not the exact estimate of EMID UB.

Figure 6:Performance variation against varying degrees of distribution shifts. We evaluated LLaVA v1.5 and LLaVA NeXT models on 34 out-of-distribution (OOD) variants induced by image and text perturbations of the LLaVA-Bench COCO dataset (ID). Here, the 𝑥 -axis is sorted by the severity of shifts between ID and OOD. There is a consistent trend – increased degrees of distribution shifts result in performance degradations of MLLM. MLLM judge and relative preference (RP) score.

For evaluation of open-ended generation tasks, (M)LLM-as-a-Judge has been adopted as a current de facto standard. Following (Liu et al., 2023, 2024a), we use GPT-43 with text-only inference mode (with plain-text form visual cue such as ground truth caption for image) as a judge model and also use the output of the same model as a reference answer for each query. We leverage the prompts provided by the source code of LLaVA4, and compute the RP score of a model of interest by comparing its output with that of the reference answer.

Appendix CExtended Empirical Validation and Discussion C.1Additional Result from Pilot Study

In Section 3, we conduct an experiment to validate our hypotheses on the relation between MLLM performance degradation and the severity of natural distribution shift. In Figure 6, we provide additional results from another type of distribution shift that occurred by image and text perturbations. For image perturbation, we consider defocus blur and frost with three different magnitudes, and for text perturbation, we consider keyboard typo error and word synonym replacement with two different magnitudes (We adopt the source code of https://github.com/Jielin-Qiu/MM_Robustness to generate perturbed datasets). We observe the consistent trend in the relation between MLLM performance degradation and the severity of distribution shifts for the case of visual-only, text-only, and joint shift, likewise the case of natural shifts in Figure 1. That is, the increased magnitude of distribution shifts induces more severe MLLM performance degradation, and the degree of performance degradation can attribute to shifts in two modalities.

C.2Different Design Choices of MI and JSD Estimation

Note that the results of all theorem (Lemma 4.3, Theorem 4.4, Theorem 4.5, and Theorem 4.6) are not limited to a specific class of MI and JSD estimators. To investigate whether our empirical verification of theorems robustly holds in an estimator-agnostic manner (if the estimator is valid), we provide an ablation study for the MI estimator, JSD estimator, and embedding space that the estimators are built on.

Specifically, we consider four MI estimators { NWJ (Nguyen et al., 2010), MINE (Belghazi et al., 2018), InfoNCE (Oord et al., 2018), CLUB (Cheng et al., 2020) } , three embedding spaces { individual models (CLIP ViT and XLM-RoBERTa), E5-V joint (Jiang et al., 2024), E5-V disjoint (Jiang et al., 2024) } , and two JSD estimators { MMD (Liu et al., 2020), RJSD(Hoyos-Osorio & Sanchez-Giraldo, 2023) } . E5-V (Jiang et al., 2024) is a recently proposed embedding extraction method that leverages an MLLM with a carefully designed prompt. We used the default prompt “Summary above sentence/image in one word: ” to separately extract embeddings (E5-V disjoint) for images and sentences, and design an ensemble of four custom prompts,

“Summary of the image , and sentence in one word: ”

“Summary of the visual content "" with an associated text query "" in one word: ”

“Given , Summary of the sentence "" in one word: ”

“Given visual content "", Summary of the text query "" in one word: ”,

to extract multimodal joint query embedding (E5-V joint) by averaging four embedding vectors per (image, sentence) pair.

Table 7:Ablation study for MI estimator and embedding space. We evaluate four MI estimators with three different embedding space choices in terms of Spearman correlation coefficient 𝜌 between RP score and EMI. We can see that EMI and RP score are robustly correlated to variations in the embedding space and the MI estimator, but CLUB shows the most stable correlation. Configuration LLaVA v1.5 7B LLaVA v1.5 13B LLaVA NeXT 7B LLaVA NeXT 13B MI estimator Embedding 𝜌

𝑝 -val 𝜌

𝑝 -val CLUB E5-V disjoint 0.695 0.000 0.726 0.000 0.581 0.001 0.579 0.001 CLUB E5-V joint 0.910 0.000 0.846 0.000 0.817 0.000 0.902 0.000 CLUB Individual models 0.606 0.001 0.720 0.000 0.594 0.001 0.457 0.014 InfoNCE E5-V disjoint 0.670 0.000 0.708 0.000 0.638 0.000 0.590 0.001 InfoNCE E5-V joint 0.800 0.000 0.717 0.000 0.636 0.000 0.609 0.001 InfoNCE Individual models 0.519 0.005 0.421 0.026 0.410 0.030 0.275 0.157 MINE E5-V disjoint 0.664 0.000 0.605 0.001 0.269 0.167 0.278 0.153 MINE E5-V joint 0.632 0.000 0.559 0.002 0.610 0.001 0.308 0.111 MINE Individual models 0.632 0.000 0.562 0.002 0.632 0.000 0.613 0.001 NWJ E5-V disjoint 0.583 0.001 0.552 0.002 0.513 0.005 0.429 0.023 NWJ E5-V joint 0.502 0.005 0.519 0.005 0.492 0.008 0.480 0.010 NWJ Individual models 0.510 0.006 0.717 0.000 0.488 0.008 0.322 0.095 Table 8:Ablation study for MI estimator, JSD estimator, and embedding space. We evaluate four MI estimator candidates and two JSD estimator candidates, with three different embedding space choices in terms of Pearson correlation coefficient between EMID and its upper bound. In all the considered variations, EMID and the upper bound of EMID (i.e., the simplified version in Theorem 4.5) show strong correlations, implying that our theorem robustly holds in practice. Configuration Pearson JSD estimator MI estimator Embedding 𝑟

𝑝 -val RJSD CLUB E5-V disjoint 0.618 0.000 RJSD CLUB E5-V joint 0.659 0.000 RJSD CLUB Individual models 0.565 0.000 RJSD InfoNCE E5-V disjoint 0.618 0.000 RJSD InfoNCE E5-V joint 0.617 0.000 RJSD InfoNCE Individual models 0.295 0.002 RJSD MINE E5-V disjoint 0.602 0.000 RJSD MINE E5-V joint 0.534 0.000 RJSD MINE Individual models 0.630 0.000 RJSD NWJ E5-V disjoint 0.611 0.000 RJSD NWJ E5-V joint 0.413 0.000 RJSD NWJ Individual models 0.468 0.000 MMD CLUB E5-V disjoint 0.618 0.000 MMD CLUB E5-V joint 0.659 0.000 MMD CLUB Individual models 0.478 0.000 MMD InfoNCE E5-V disjoint 0.432 0.000 MMD InfoNCE E5-V joint 0.617 0.000 MMD InfoNCE Individual models 0.295 0.002 MMD MINE E5-V disjoint 0.602 0.000 MMD MINE E5-V joint 0.623 0.000 MMD MINE Individual models 0.630 0.000 MMD NWJ E5-V disjoint 0.611 0.000 MMD NWJ E5-V joint 0.273 0.004 MMD NWJ Individual models 0.468 0.000

In Table 7, we conduct Spearman correlation analysis over the 12 (4 × 3) cases of MI estimator and embedding space ablation. We can clearly see that EMI is consistently correlated with the RP score which demonstrates the robust effectiveness of our theorem in practice. Meanwhile, among the candidate MI estimators and embedding space, CLUB and two E5-V joint embeddings show outstanding results. However, E5-V embedding extraction require a forward pass of MLLM in contrast to the case of leveraging relatively small individual models (ViT base and BERT-base). To strike the balance between effectiveness and efficiency, we adopt CLIP ViT-B/32 and XLM-RoBERTa-Base embedding spaces by default.

Next, we present the Pearson correlation analysis result in Table 8 by ablating the JSD estimator with the MI estimator and embedding space choices. Although there are some variations in the exact values, we also observe consistently significant correlations between EMID and its upper bound (that of Theorem 10). Therefore, the upper bound of EMID we derived robustly holds in practice across diverse estimator configurations.

C.3Hyperparameter Sensitivity

We provide the hyperparameter configuration for MI estimator training (Table 9) and further provide sensitivity analysis for varying hyperparameters (Figure 7). We see that the CLUB estimator is quite robust to varying hyperparameters, i.e., batch size, learning rate, and hidden dimension, which implies the effectiveness of EMI estimation without intensive hyperparameter tuning.

Table 9:Hyperparameter tuning grid and selected value for MI estimator training. The hyperparameters are selected based on the variance of last 10 iterations during training. Parameter Selected Sweep learning rate 0.001 {0.005, 0.001, 0.0005, 0.0001} batch size 1024 {64, 128, 256, 512, 1024, 2048} hidden dimension 100/500 {250, 500, 1000, 2000} Figure 7:Sensitivity analysis for batch size, learning rate, and hidden dimension during MI estimator training. For the considered hyperparameter searching grid, MI estimates derived by the CLUB estimator robustly achieve high Spearman correlation with RP score (the largest deviation is less than 0.02 absolute value). C.4Runtime Analysis Table 10:Runtime comparison on LLaVA-Bench COCO 90 samples. We compare the actual wall-clock time (second) of EMI estimation and MLLM judgment protocols by evaluating model-generated response and reference response given input query from the LLaVA-Bench COCO dataset (Liu et al., 2023). EMI estimation protocol on top of CLIP ViT-B/32 and XLM-RoBERTa-Base embedding achieves 138 times boosting from MLLM judgment protocol with GPT-4o. LLaVA v1.5 7B LLaVA v1.5 13B LLaVA NeXT 7B LLaVA NeXT 13B Total runtime per instance dataset per instance dataset per instance dataset per instance dataset MI estimator training 663.45 663.45 EMI estimation 0.0388 3.5884 0.0392 3.5652 0.0412 3.7107 0.0411 3.7039 14.56 ( × 138 boosting) MLLM judgment (GPT-4o API) 5.49 493.75 5.48 493.59 5.52 496.66 5.83 524.45 2008.45

In addition to the advantage of allowing rigorous theoretical statements, EMI also has practical advantages over the RP score derived by the LLM judge. Specifically, while both RP score and EMI are model-dependent, the former relies on models with tens to hundreds of billions of parameters, while the latter enables meaningful evaluation even with relatively small embedding models with millions of parameters. To quantitatively argue this, we compare the inference time per instance and for the entire LLaVA-Bench COCO dataset in Table 10. As shown in the table, EMI can shorten the time by 138 times compared to the RP score. Even including the time required for MI training, EMI-based evaluation is still 3 times faster than MLLM judgment-based evaluation. Since MI training is performed only once and then transferred to all ID-OOD scenarios, the efficiency of EMI-based evaluation over MLLM judgment becomes more evident as the number of datasets to be evaluated increases. In addition, in the LLM judge paradigm, although open-source LLM judges (Kim et al., 2024) have been actively studied recently, proprietary LLMs are still dominant in practice, so one must pay per instance query, whereas EMI can make meaningful inferences with publicly available open-source feature extractors without paying per query.

C.5Notes on Practical Usage

Although the empirical estimates of EMI and EMID upper bound show consistent correlations with the relative preference score and EMID, the absolute value of MI estimates varies depending on the estimator types and their configurations. Therefore, if one has used the empirical EMI estimates (computed by a two-layer MLP-based CLUB estimator on CLIP-VIT and RoBERTa embedding spaces) to assess the qualities of responses from MLLMs, the same estimator and the same embedding models should be used across targeted models and datasets for fair comparison.

Appendix DExtended Theoretical Analysis with Full Proof

In this section, we provide proof of all theorems (Lemma 4.3, Theorem 4.4, Theorem 4.5, and Theorem 4.6) in our manuscript, and introduce an additional theoretical result (Corollary D.12).

D.1Proof for the relationship between EMI and preference model

First, we provide proof of the closeness between the effective mutual information (EMI) and the preference model.

Lemma D.1 (Restatment of Lemma 4.3).

Given a distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 , let the reward model function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) . If 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) ≤ 𝛿 , then,

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) | ≤ 𝛿 + 4.4 ⁢ 𝛿 1 8 .

Proof.

Let 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) , and note the expression for EMI below,

EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) − 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ( 𝑃 𝑌 𝜃 ) − 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐗 ) ) − 𝐻 ( 𝑃 𝑌 ) + 𝐻 ( 𝑃 𝑌 | 𝐗 )

( 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) )

+ 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 ^ | 𝐱 ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) ]

Next, given 𝑟 ⁢ ( 𝐱 , 𝑦 )

log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) , logit Bradley-Terry preference model (PM) (Hunter, 2004) can be expressed as,

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ^ ) − 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) ]

Therefore,

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) |

| 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) + 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 ^ | 𝐱 ) 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ^ ) ] |

≤ | 𝐻 ( 𝑃 𝑌 𝜃 ) − 𝐻 ( 𝑃 𝑌 ) | + 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 ) ]

≤ 4.4 ⁢ 𝛿 1 8 + 𝛿 .

∎

Here, we adopted Lemma D.4 to replace | 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) | into its upper bound 4 𝐷 JS 1 4 ( 𝑃 𝑌 𝜃 | | 𝑃 𝑌 ) and used Pinsker’s inequality (Pinsker, 1964).

We provide a proof for the extended theorem from the Lemma D.2 by considering the optimal model parameter as below.

Theorem D.2 (Restatment of Lemma 4.4).

Given a distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 , and assume 𝑃 𝐗 ⁢ 𝑌

𝑐

0 for a constant 𝑐 , if the 𝜖 -representation capacity assumption holds, i.e.,

min 𝜃 ∈ Θ 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ≤ 𝜖 ,

(15)

and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) , then

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝜃 ∗ ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝜃 ∗ ) | ≤ 𝛿 + 4.4 ⁢ 𝛿 1 8 ,

where 𝜃 ∗ is the optimal solution of Eq. (1) over 𝑃 𝐗 ⁢ 𝑌 , and 𝛿

4.4 ⁢ 𝜖 1 8 − log ⁡ 𝑐 ⁢ 2 ⁢ 𝜖 .

Proof.

Recall the formulation of mutual information as below,

𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ⁢ ( 𝑃 𝑌 ) − 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 )

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝑌 | 𝑋

𝑥 ⁢ ( 𝑦 ) ] + 𝐻 ⁢ ( 𝑃 𝑌 )

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) ] − 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 | 𝐱 ) ] + 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝑌 | 𝑋

𝑥 ⁢ ( 𝑦 ) ] + 𝐻 ⁢ ( 𝑃 𝑌 )

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 [ log 𝑃 𝜃 ( 𝑦 | 𝐱 ) ] + 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] + 𝐻 ( 𝑃 𝑌 )

So, 𝐼 ( 𝑃 𝐗 ⁢ 𝑌 ) − 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 [ log 𝑃 𝜃 ( 𝑦 | 𝐱 ) ] − 𝐻 ( 𝑃 𝑌 )

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝑥 ( 𝑦 ) | | 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] . Therefore, for the optimally learned parameter 𝜃 ∗ , we know that 𝜃 ∗ ∈ arg min 𝜃 ∈ Θ 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] , which implies below,

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝑥 | | 𝑃 𝜃 ∗ ( ⋅ | 𝐱 ) ) ] ≤ 𝜖 .

Meanwhile, we have the below upper bound by leveraging Lemma D.7,

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 4.4 𝜖 1 8 − log 𝑐 2 ⁢ 𝜖

By denoting 𝛿 := 4.4 ⁢ 𝜖 1 8 − log ⁡ 𝑐 ⁢ 2 ⁢ 𝜖 and plugging the Lemma D.1, we complete the proof. ∎

Note that the assumption 𝑃 𝐗 ⁢ 𝑌

𝑐

0 is reasonable in practice given the following two statements. First, we can only observe the samples that 𝑃 𝐗 ⁢ 𝑌 ⁢ ( 𝐱 , 𝑦 )

0 . Therefore, investigating on the case 𝐱 , 𝑦 such that 𝑃 𝐗 ⁢ 𝑌

0 solely does not affect the practical implication of our analysis. Second, for the space 𝒳 × 𝒴 , it is obvious that | 𝒳 × 𝒴 | < + ∞ . Therefore, 𝒳 × 𝒴 is a compact space, and 𝑃 𝐗 ⁢ 𝑌 ⁢ ( 𝐱 , 𝑦 )

0 over a compact space, there exists a constant 𝑐

0 such that 𝑃 𝐗 ⁢ 𝑌 ⁢ ( 𝐱 , 𝑦 )

𝑐 .

Lemma D.3.

Given two distributions 𝑃 𝑋 and 𝑄 𝑋 defined over 𝒳 , let 𝑓 : 𝒳 → [ 0 , 𝑐 ] , then we have the below,

| 𝔼 𝑥 ∼ 𝑃 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] − 𝔼 𝑥 ∼ 𝑄 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] | ≤ 𝑐 ⋅ 𝐷 TV ⁢ ( 𝑃 𝑋 , 𝑄 𝑋 ) .

where 𝐷 TV ⁢ ( 𝑃 𝑋 , 𝑄 𝑋 ) := ∑ 𝑥 ∈ 𝑋 | 𝑃 𝑋 ⁢ ( 𝑥 ) − 𝑄 𝑋 ⁢ ( 𝑥 ) | is the total variation distance between two distributions.

Proof.
| 𝔼 𝑥 ∼ 𝑃 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] − 𝔼 𝑥 ∼ 𝑄 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] |

| ∑ 𝑥 ∈ 𝒳 𝑃 𝑋 ⁢ ( 𝑥 ) ⁢ 𝑓 ⁢ ( 𝑥 ) ⁢ ∑ 𝑥 ∈ 𝒳 𝑄 𝑋 ⁢ ( 𝑥 ) ⁢ 𝑓 ⁢ ( 𝑥 ) |

| ∑ 𝑥 ∈ 𝒳 ( 𝑃 𝑋 ⁢ ( 𝑥 ) − 𝑄 𝑋 ⁢ ( 𝑥 ) ) ⁢ 𝑓 ⁢ ( 𝑥 ) |

| ∑ 𝑥 ∈ 𝒳 ( 𝑃 𝑋 ⁢ ( 𝑥 ) − 𝑄 𝑋 ⁢ ( 𝑥 ) ) ⁢ ( 𝑓 ⁢ ( 𝑥 ) − 𝑐 ) + 𝑐 ⁢ ( ∑ 𝑥 ∈ 𝒳 𝑃 𝑋 ⁢ ( 𝑥 ) − 𝑄 𝑋 ⁢ ( 𝑥 ) ) |

≤
∑ 𝑥 ∈ 𝒳 | 𝑃 𝑋 ⁢ ( 𝑥 ) − 𝑄 𝑋 ⁢ ( 𝑥 ) | ⋅ | 𝑓 ⁢ ( 𝑥 ) − 𝑐 |

≤
𝑐 ⋅ ‖ 𝑃 𝑋 − 𝑄 𝑋 ‖ 1

𝑐 ⋅ 𝐷 TV ⁢ ( 𝑃 𝑋 , 𝑄 𝑋 )

∎

Lemma D.4.

Given random variable 𝐗 , and two distributions 𝑃 𝐗 ⁢ 𝑌

𝑃 𝑌 | 𝐗 ⁢ 𝑃 𝐗 and 𝑄 𝐗 ⁢ 𝑌

𝑄 𝑌 | 𝐗 ⁢ 𝑄 𝐗 , we have the bounds for the difference between Entropy 𝐻 ⁢ ( ⋅ ) over two distributions as below:

| 𝐻 ⁢ ( 𝑃 𝐗 ) − 𝐻 ⁢ ( 𝑄 𝐗 ) |

≤ 4 𝐷 JS 1 4 ( 𝑃 𝐗 | | 𝑄 𝐗 ) ,

| 𝐻 ⁢ ( 𝑃 𝑌 ) − 𝐻 ⁢ ( 𝑄 𝑌 ) |

≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | | 𝑄 𝑌 ) ,

| 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) .

where 𝐷 JS ⁢ ( ⋅ , ⋅ ) is the Jensen-Shannon divergence between two distributions.

Proof.

Let 𝑀 𝐗

( 𝑃 𝐗 + 𝑄 𝐗 ) / 2 , 𝐷 JS ⁢ ( 𝑃 𝐗 , 𝑄 𝐗 ) and 𝐷 TV ⁢ ( 𝑃 𝐗 , 𝑄 𝐗 ) be the Jensen-Shannon divergence and total variation distance between 𝑃 𝐗 and 𝑄 𝐗 , respectively.

| 𝐻 ⁢ ( 𝑃 𝐗 ) − 𝐻 ⁢ ( 𝑄 𝐗 ) |

| 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑄 𝐗 ⁢ ( 𝐱 ) |

| 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑄 𝐗 ⁢ ( 𝐱 ) + 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) + 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) |

≤
| 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) 𝑀 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑄 𝐗 ⁢ ( 𝐱 ) 𝑀 𝐗 ⁢ ( 𝐱 ) | + | 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) |

≤
| 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) 𝑀 𝐗 ⁢ ( 𝐱 ) + 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑄 𝐗 ⁢ ( 𝐱 ) 𝑀 𝐗 ⁢ ( 𝐱 ) | + | 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 ⁢ log ⁡ 𝑀 𝐗 ⁢ ( 𝐱 ) |

≤
2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ∑ 𝑥 | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 − 𝑄 𝐗 ⁢ ( 𝑥 ) 2 | ⋅ | log 𝑀 𝐗 ( 𝑥 ) |

2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ∑ 𝑥 | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 − 𝑄 𝐗 ⁢ ( 𝑥 ) 2 | ⋅ | log | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 + 𝑄 𝐗 ⁢ ( 𝑥 ) 2 | |

≤
2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ∑ 𝑥 | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 − 𝑄 𝐗 ⁢ ( 𝑥 ) 2 | ⋅ | log | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 − 𝑄 𝐗 ⁢ ( 𝑥 ) 2 | |

≤
2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ∑ 𝑥 | 𝑃 𝐗 ⁢ ( 𝑥 ) 2 − 𝑄 𝐗 ⁢ ( 𝑥 ) 2 |

≤
2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ⁢ ∑ 𝑥 | 𝑃 𝐗 ⁢ ( 𝑥 ) − 𝑄 𝐗 ⁢ ( 𝑥 ) |

2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 ⁢ 𝐷 TV ⁢ ( 𝑃 𝐗 , 𝑄 𝐗 )

≤

2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) + 2 𝐷 JS 1 4 ( 𝑃 𝐗 | | 𝑄 𝐗 )

≤

4 𝐷 JS 1 4 ( 𝑃 𝐗 | | 𝑄 𝐗 )

In above inequalities, we have used 𝑥 + 𝑥 ⁢ log ⁡ 𝑥

0 for 𝑥 ∈ ( 0 , 1 ) , Holder’s inequality, and 𝐷 TV ⁢ ( 𝑃 𝐗 , 𝑄 𝐗 ) ≤ 2 ⁢ 𝐷 JS ⁢ ( 𝑃 𝐗 ∥ 𝑄 𝐗 ) proved in Lemma 3 of Thekumparampil et al. (2018). We can prove below with the same strategy,

| 𝐻 ⁢ ( 𝑃 𝑌 ) − 𝐻 ⁢ ( 𝑄 𝑌 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | | 𝑄 𝑌 ) ,

| 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) .

∎

Corollary D.5.

For a data distribution 𝑃 𝐗 ⁢ 𝑌

𝑃 𝑌 | 𝐗 ⁢ 𝑃 𝐗 , MLLM 𝑃 𝜃 ( ⋅ | 𝐱 ) , and Kullback-Leibler divergence 𝐷 KL , if 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ) ≤ 𝜖 for a constant 𝜖 , then

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 4.4 𝜖 1 8
Proof.
𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐱 ) ]
≤ 𝔼 𝐱 ∼ 𝑃 𝐗 [ | 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐱 ) | ]

≤ 4 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑃 𝜃 ( ⋅ | 𝐱 ) )

≤ 4 ⋅ 2 1 8 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL 1 8 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑃 𝜃 ( ⋅ | 𝐱 ) )

≤ 4.4 ⁢ 𝜖 1 8 .

We started from Lemma D.4 and use Pinsker’s inequality (Pinsker, 1964) to leverage 𝐷 JS ( ⋅ | ⋅ ) ≤ 𝐷 TV ( ⋅ , ⋅ ) ≤ 2 𝐷 KL ( ⋅ | ⋅ ) . ∎

Lemma D.6.

For a data distribution 𝑃 𝐗 ⁢ 𝑌

𝑃 𝑌 | 𝐗 ⁢ 𝑃 𝐗 , MLLM 𝑃 𝜃 ( ⋅ | 𝐱 ) , let 𝐷 KL and 𝐷 TV be Kullback-Leibler divergence and total variation distance, respectively. Denote 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] , we have below inequality.

𝐷 TV ⁢ ( 𝑃 𝑌 , 𝑃 𝑌 𝜃 ) ≤
2 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 ) .
Proof.
𝐷 TV ⁢ ( 𝑃 𝑌 , 𝑃 𝑌 𝜃 )

∑ 𝑦 ∈ 𝒴 | 𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝑌 | 𝐗

𝐱 ( 𝑦 ) − 𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝜃 ( 𝑦 | 𝐱 ) |

≤
∑ 𝑦 ∈ 𝒴 𝔼 𝐱 ∼ 𝑃 𝐗 | 𝑃 𝑌 | 𝐗

𝐱 ( 𝑦 ) − 𝑃 𝜃 ( 𝑦 | 𝐱 ) |

=
𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 TV ( 𝑃 𝑌 | 𝐗

𝐱 , 𝑃 𝜃 ( ⋅ | 𝐱 ) )

≤
2 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 )

∎

Lemma D.7.

For a data distribution 𝑃 𝐗 ⁢ 𝑌

𝑃 𝑌 | 𝐗 ⁢ 𝑃 𝐗 , MLLM 𝑃 𝜃 ( ⋅ | 𝐱 ) , and Kullback-Leibler divergence 𝐷 KL , if 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ) ≤ 𝜖 and 𝑃 𝐗 ⁢ 𝑌

𝑐

0 for a constant 𝑐 and 𝜖 , then

𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) ≤ 4.4 𝜖 1 8 − log 𝑐 2 ⁢ 𝜖 .

Proof.

Note that

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐱 ) ]

𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) − 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) ]

𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) .

Therefore,

𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ∥ 𝑃 𝑌 | 𝐗

𝐱 )
≤ | 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) − 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) ] |

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗
𝐱 ) ]

Given Lemma D.3, it is easy to check that

| 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) − 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) ] |

≤
− log 𝑐 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 TV ( 𝑃 𝜃 ( ⋅ | 𝐱 ) , 𝑃 𝑌 | 𝐗

𝐱 )

≤

− log ⁡ 𝑐 ⁢ 2 ⁢ 𝜖 .

Then, with Corollary D.5, we complete this proof. ∎

D.2Proof for EMID upper bound

Now, we give the proof for the upper bound of the EMI Difference (EMID) as below.

Theorem D.8 (General scenario).

Given 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 distributions and an MLLM 𝑃 𝜃 , if there exist some constants 𝛿 𝑃 and 𝛿 𝑄 such that

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 ,

where 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) and 𝑄 𝑌 𝜃

𝔼 𝐱 ∼ 𝑄 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) , then EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) is upper bounded by

𝐻 ^ ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) )

𝐻 ^ ⁢ ( 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 ∥ 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 ∥ 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) )

4 ⁢ 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝐷 JS 1 4 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑄 𝑌 | 𝐗

𝐱 ) ] + 8 ⁢ Δ 1 4 ,

where Δ

𝛿 𝑃 + 𝛿 𝑄 , 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] and

𝐷 ¯ JS ( 𝑃 𝑋 | 𝑋 ′ | | 𝑄 𝑋 | 𝑋 ′ ) := 𝔼 𝐱 ∼ 𝑃 𝑋 ′ 𝐷 JS ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

𝐱 ) + 𝔼 𝐱 ∼ 𝑄 𝑋 ′ 𝐷 JS ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

𝐱 ) .

Proof.

Let 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) and 𝑄 𝑌 𝜃

𝔼 𝐱 ∼ 𝑄 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) , EMID can be expressed with entropy and conditional entropy terms as below.

EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − EMI ⁢ ( 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

( 𝐻 ( 𝑃 𝑌 𝜃 ) − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝐻 ( 𝑃 𝑌 ) + 𝐻 ( 𝑃 𝑌 | 𝐗 ) )

− ( ( 𝐻 ( 𝑄 𝑌 𝜃 ) − 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝐻 ( 𝑄 𝑌 ) + 𝐻 ( 𝑄 𝑌 | 𝐗 ) )

≤ ( 𝐻 ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ( 𝑄 𝑌 | 𝐗 ) ) + ( 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] )

| 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 )
𝐻 ⁢ ( 𝑄 𝑌 ) − 𝐻 ⁢ ( 𝑄 𝑌 𝜃 ) |

≤ ( 𝐻 ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ( 𝑄 𝑌 | 𝐗 ) ) + ( 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] )

| 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) |
| 𝐻 ⁢ ( 𝑄 𝑌 ) − 𝐻 ⁢ ( 𝑄 𝑌 𝜃 ) |

≤ ( 𝐻 ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ( 𝑄 𝑌 | 𝐗 ) ) + ( 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] ) + 4 ( 𝐷 JS 1 4 ( 𝑃 𝑌 𝜃 , 𝑃 𝑌 ) + 𝐷 JS 1 4 ( 𝑄 𝑌 𝜃 , 𝑄 𝑌 ) )

≤ ( 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗 ) ) ¯ (A) + ( 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] ) ¯ (B) + 8 ⁢ Δ 1 4 ,

(16)

where Δ := 𝛿 𝑃 + 𝛿 𝑄 . Now, we will derive the upper-bounds for the terms (A) and (B), independently. First, by adopting Lemma D.4, we can express the term 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 ) as below.

𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 )

𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] + 𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≤ 𝔼 𝑃 𝐗 ⁢ [ | 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) | ] + 𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≤ 4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) ] + 𝔼 𝑃 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≤ 4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) ] + 𝔼 𝑃 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑄 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] + 𝔼 𝑄 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

Then, the term ( A ) of ineq. (D.2), i.e., 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗 ) , is represented as below:

𝐻 ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ( 𝑄 𝑌 | 𝐗 ) ≤ 4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) ] + 𝔼 𝑃 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑄 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

To replace 𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑄 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] into a more interpretable term, and we start from the restatement of Lemma 1 of Shui et al. (2022).

Lemma D.9 (restatement of Lemma 1 of Shui et al. (2022)).

Let 𝑍 ∈ 𝒵 be the real-valued integrable random variable, and denoting two distributions on a common space 𝒵 by 𝑃 and 𝑄 such that 𝑄 is absolutely continuous w.r.t. 𝑃 . If for any function 𝑓 and 𝜆 ∈ ℝ such that 𝔼 𝑃 ⁢ [ exp ⁡ ( 𝜆 ⁢ ( 𝑓 ⁢ ( 𝑧 ) − 𝔼 𝑃 ⁢ ( 𝑓 ⁢ ( 𝑧 ) ) ) ) ] < ∞ , then we have:

𝜆 ⁢ ( 𝔼 𝑄 ⁢ 𝑓 ⁢ ( 𝑧 ) − 𝔼 𝑃 ⁢ 𝑓 ⁢ ( 𝑧 ) )

≤ 𝐷 KL ( 𝑄 | | 𝑃 )

log ⁡ 𝔼 𝑃 ⁢ [ exp ⁡ ( 𝜆 ⁢ ( 𝑓 ⁢ ( 𝑧 ) − 𝔼 𝑃 ⁢ ( 𝑓 ⁢ ( 𝑧 ) ) ) ) ]

Now, let 𝐗 and 𝑌 denote observable variables from a joint distribution 𝐷 𝐗 ⁢ 𝑌 ∈ { 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 } , and we denote 𝑓 ⁢ ( 𝐱 ) := 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ≥ 0 as a loss function of our interest, e.g., conditional entropy of 𝑦 given 𝐱 . Then, 𝑓 has a finite value of 𝔼 𝐷 ⁢ [ exp ⁡ ( 𝜆 ⁢ ( 𝑓 ⁢ ( 𝐱 ) − 𝔼 𝐷 ⁢ ( 𝑓 ⁢ ( 𝐱 ) ) ) ) ] , and is bounded within interval [ 0 , 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) ] where 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) := max 𝐱 ∈ 𝒳 ⁡ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) .

We next define a mixture distribution 𝑀 𝐗 ⁢ 𝑌

1 2 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 + 𝑄 𝐗 ⁢ 𝑌 ) where the support of 𝑀 𝐗 ⁢ 𝑌 covers that of 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 . Then, we get the inequality below by setting 𝑃

𝑀 𝐗 ⁢ 𝑌 and 𝑄

𝑄 𝐗 ⁢ 𝑌 for all 𝜆

0 according to the Lemma D.9:

𝔼 𝑄 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑀 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≤ 1 𝜆 ( log 𝔼 𝑀 𝐗 [ exp ( 𝜆 ( 𝑓 ( 𝐱 ) − 𝔼 𝑀 𝐗 ( 𝑓 ( 𝐱 ) ) ) ] + 𝐷 KL ( 𝑄 𝐗 | | 𝑀 𝐗 ) ) .

(17)

Also, we get similar inequality by setting 𝑃

𝑀 𝐗 ⁢ 𝑌 and 𝑄

𝑃 𝐗 ⁢ 𝑌 for all 𝜆 < 0 according to the Lemma D.9 as below:

𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑀 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≥ 1 𝜆 ( log 𝔼 𝑀 𝐗 [ exp ( 𝜆 ( 𝑓 ( 𝐱 ) − 𝔼 𝑀 𝐗 ( 𝑓 ( 𝐱 ) ) ) ] + 𝐷 KL ( 𝑃 𝐗 | | 𝑀 𝐗 ) ) .

(18)

Meanwhile, give that 𝑓 ⁢ ( 𝐱 ) is bounded within interval 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) , the 𝑓 ⁢ ( 𝐱 ) − 𝔼 𝑀 𝐗 ⁢ 𝑓 ⁢ ( 𝐱 ) becomes a sub-Gaussian (Wainwright, 2019) with the scale parameter 𝜎

𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) / 2 at most. Then, we can leverage the property of sub-Gaussian for the log moment generating function,

log 𝔼 𝑀 𝐗 [ exp ( 𝜆 ( 𝑓 ( 𝐱 ) − 𝔼 𝑀 𝐗 ( 𝑓 ( 𝐱 ) ) ) ]

≤ log ⁡ ( exp ⁡ ( 𝜆 2 ⁢ 𝜎 2 2 ) ) ≤ 𝜆 2 ⁢ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) 2 8 .

(19)

By plugging the ineq. (19) into ineq. (17) and ineq. (18), we can derive the following new inequalities:

𝔼 𝑄 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑀 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] ≤ 𝜆 0 ⁢ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) 2 8 + 1 𝜆 0 𝐷 KL ( 𝑄 𝐗 | | 𝑀 𝐗 ) ,

𝔼 𝑀 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑃 𝐗 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) ] ≤ 𝜆 0 ⁢ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) 2 8 + 1 𝜆 0 𝐷 KL ( 𝑃 𝐗 | | 𝑀 𝐗 ) .

where 𝜆 0

𝜆 stands for 𝜆 > 0 in the ineq. (17) and 𝜆 0

− 𝜆 for 𝜆 < 0 in the ineq. (18).

By adding both inequalities above, and setting the 𝜆 0

𝔼 𝑄 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] ≤ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) ⁢ 2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 ) .

(20)

Next, a decomposition of KL divergence and the definition of JS divergence leads to the following inequality,

2 𝐷 JS ( 𝑃 𝑋 𝑣 ⁢ 𝑋 𝑡 | | 𝑄 𝑋 𝑣 ⁢ 𝑋 𝑡 )

= 𝐷 KL ( 𝑃 𝑋 𝑣 ⁢ 𝑋 𝑡 | | 𝑀 𝑋 𝑣 ⁢ 𝑋 𝑡 ) + 𝐷 KL ( 𝑄 𝑋 𝑣 ⁢ 𝑋 𝑡 | | 𝑀 𝑋 𝑣 ⁢ 𝑋 𝑡 )

= 1 2 ( 𝐷 KL ( 𝑃 𝑋 𝑣 | | 𝑀 𝑋 𝑣 ) + 𝔼 𝑃 𝑋 𝑣 𝐷 KL ( 𝑃 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 | | 𝑀 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 ) )

1 2 ( 𝐷 KL ( 𝑄 𝑋 𝑣 | | 𝑀 𝑋 𝑣 )
𝔼 𝑄 𝑋 𝑣 𝐷 KL ( 𝑄 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 | | 𝑀 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 ) )
1 2 ( 𝐷 KL ( 𝑃 𝑋 𝑡 | | 𝑀 𝑋 𝑡 )
𝔼 𝑃 𝑋 𝑡 𝐷 KL ( 𝑃 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 | | 𝑀 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 ) )
1 2 ( 𝐷 KL ( 𝑄 𝑋 𝑡 | | 𝑃 𝑋 𝑡 )
𝔼 𝑄 𝑋 𝑡 𝐷 KL ( 𝑄 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 | | 𝑀 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 ) )

= 𝐷 JS ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 )

1 2 ( 𝔼 𝑃 𝑋 𝑣 𝐷 KL ( 𝑃 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 | | 𝑀 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 )
𝔼 𝑄 𝑋 𝑣 𝐷 KL ( 𝑄 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 | | 𝑀 𝑋 𝑡 | 𝑋 𝑣

𝑥 𝑣 ) )
1 2 ( 𝔼 𝑃 𝑋 𝑡 𝐷 KL ( 𝑃 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 | | 𝑀 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 )
𝔼 𝑄 𝑋 𝑡 𝐷 KL ( 𝑄 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 | | 𝑀 𝑋 𝑣 | 𝑋 𝑡

𝑥 𝑡 ) )

≤ 𝐷 JS ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 )

where 𝐷 ¯ JS ( 𝑃 𝑌 | 𝑋 | | 𝑄 𝑌 | 𝑋 ) := 𝔼 𝑥 ∼ 𝑃 𝑋 𝐷 JS ( 𝑃 𝑌 | 𝑋

𝑥 | | 𝑄 𝑌 | 𝑋

𝑥 ) + 𝔼 𝑥 ∼ 𝑄 𝑋 𝐷 JS ( 𝑃 𝑌 | 𝑋

𝑥 | | 𝑄 𝑌 | 𝑋

𝑥 )

Based on the above decomposition, we can modify the bound as below,

𝔼 𝑄 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ] − 𝔼 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗

𝐱 ) ]

≤ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) ⁢ 2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 )

≤ 𝐻 ^ ⁢ ( 𝑄 𝑌 | 𝐱 ) ⁢ 𝐷 JS ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) .

Therefore, we get an upper-bound for the ( A ) term in ineq. (D.2) as below,

𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ⁢ ( 𝑄 𝑌 | 𝐗 )

≤ 𝐻 ⁢ ( 𝑄 𝑌 ) ⁢ 𝐷 JS ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 )

4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗
𝐱 ) ]

≤ 𝐻 ^ ( 𝑄 𝑌 | 𝐱 ) ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) )

𝐻 ^ ( 𝑄 𝑌 | 𝐱 ) ( 𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 )
𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) )
4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) ] .

Then, deriving a bound for the remaining ( B ) term in ineq. (D.2), i.e., 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] , is very similar to the procedure of deriving the upper-bound for the term ( A ) by switching 𝑄 𝑌 | 𝐗 to 𝑃 𝜃 ( ⋅ | 𝐗 ) and set the 𝑓 for Lemma D.9 as 𝑓 ( 𝐱 ) := 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) , thereby having the interval [ 0 , 𝐻 ^ ⁢ ( 𝑃 𝜃 ) ] where 𝐻 ^ ( 𝑃 𝜃 ) : max 𝐱 ∈ 𝒳 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) . This induces an upper-bound as below,

𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ]

≤ 𝐻 ^ ⁢ ( 𝑃 𝜃 ) ⁢ 2 𝐷 JS ( 𝑃 𝐗 | | 𝑄 𝐗 )

≤ 𝐻 ^ ( 𝑃 𝜃 ) ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) + 𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) ) .

(21)

Finally, we complete the proof by adding the ineq. (D.2) and ineq. (21) to induces the upper bound of EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) ,

EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

≤ ( 𝐻 ( 𝑃 𝑌 | 𝐗 ) − 𝐻 ( 𝑄 𝑌 | 𝐗 ) ) + ( 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] ) + 8 Δ 1 4

≤ 𝐻 ^ ( 𝐷 JS 1 2 ( 𝑃 𝑋 𝑣 | | 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ( 𝑃 𝑋 𝑡 | | 𝑄 𝑋 𝑡 ) )

𝐻 ^ ( 𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 | | 𝑄 𝑋 𝑡 | 𝑋 𝑣 )
𝐷 ¯ JS 1 2 ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 | | 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) )
4 𝔼 𝑃 𝐗 [ 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 | | 𝑄 𝑌 | 𝐗

𝐱 ) ]
8 Δ 1 4

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] and Δ

𝛿 𝑃 + 𝛿 𝑄 . ∎

Then, we introduce an assumption over the consistency between conditional distributions as below.

Assumption D.10 (Consistency of conditional distributions).

For the distributions 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 over 𝒳 × 𝒴 , conditional distributions of 𝑋 𝑡 given 𝑋 𝑣 , 𝑋 𝑣 given 𝑋 𝑡 , and 𝑌 given 𝐗

( 𝑋 𝑣 , 𝑋 𝑡 ) are consistent between 𝑃 𝐗 ⁢ 𝑌 and 𝑄 𝐗 ⁢ 𝑌 . That is,

•

𝑃 𝑋 𝑡 | 𝑋 𝑣

𝑄 𝑋 𝑡 | 𝑋 𝑣 and 𝑃 𝑋 𝑣 | 𝑋 𝑡

𝑄 𝑋 𝑣 | 𝑋 𝑡 ,

•

𝑃 𝑌 | 𝐗

𝑄 𝑌 | 𝐗 .

Finally, we present the simplified version of EMID upper bound by leveraging the Assumption D.10.

Theorem D.11 (Simplified scenario).

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 ,

where 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) and 𝑄 𝑌 𝜃

𝔼 𝐱 ∼ 𝑄 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) , then EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) is upper bounded by

𝐻 ^ ⁢ ( 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) ) + 8 ⁢ Δ 1 4 ,

(22)

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] and Δ

𝛿 𝑃 + 𝛿 𝑄 .

Proof.

Given Theorem D.8, Assumption D.10 zeros out the terms ( 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 | 𝑋 𝑣 ∥ 𝑄 𝑋 𝑡 | 𝑋 𝑣 ) + 𝐷 ¯ JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 | 𝑋 𝑡 ∥ 𝑄 𝑋 𝑣 | 𝑋 𝑡 ) ) and 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝐷 JS 1 4 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑄 𝑌 | 𝐗

𝐱 ) ] which induces Eq. (22), accordingly. ∎

Corollary D.12.

Given an MLLM 𝑃 𝜃 and distributions 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 which have consistent conditional distributions over variables 𝑋 𝑣 | 𝑋 𝑡 , 𝑋 𝑡 | 𝑋 𝑣 , and 𝑌 | 𝐗 , EMID ⁢ ( 𝑃 𝐗 ⁢ 𝑌 , 𝑄 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) is upper bounded by

𝐻 ^ ⁢ ( 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑣 ∥ 𝑄 𝑋 𝑣 ) + 𝐷 JS 1 2 ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 ) )

8 ( 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 TV ( 𝑃 𝑌 | 𝐗
𝐱 , 𝑃 𝜃 ( ⋅ | 𝐱 ) )
𝔼 𝐱 ∼ 𝑄 𝐗 𝐷 TV ( 𝑄 𝑌 | 𝐗
𝐱 , 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ) 1 4 ,

(23)

where 𝐷 TV ⁢ ( ⋅ , ⋅ ) is the total variation distance, and 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐱 ) + 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] .

Proof.

Let 𝑃 𝑌 𝜃

𝔼 𝐱 ∼ 𝑃 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) and 𝑄 𝑌 𝜃

𝔼 𝐱 ∼ 𝑄 𝐗 𝑃 𝜃 ( ⋅ | 𝐱 ) , then 𝐷 JS ( ⋅ | ⋅ ) ≤ 𝐷 TV ( ⋅ , ⋅ ) allow us to induce below,

𝐷 JS ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 )

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 JS ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 TV ( 𝑃 𝑌 | 𝐗

𝐱 , 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] ,

𝐷 JS ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 )

𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐷 JS ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑄 𝑌 | 𝐗

𝐱 ) ] ≤ 𝔼 𝐱 ∼ 𝑄 𝐗 [ 𝐷 TV ( 𝑄 𝑌 | 𝐗

𝐱 , 𝑃 𝜃 ( ⋅ | 𝐱 ) ) ] .

(24)

Noting that 𝑎 1 4 + 𝑏 1 4 ≤ 2 ⁢ ( 𝑎 + 𝑏 ) 1 4 for 𝑎 , 𝑏 ≥ 0 , plugging the above inequality into the Theorem D.11 complete the proof. ∎

Although this alternative upper bound is looser than Theorem D.11, Corollary D.12 is more interpretable in the sense that it directly represents the model-specific discrepancy terms via distance between model output distribution and true conditional distributions, rather than expresses it through marginal distribution terms in 𝐷 JS ( 𝑃 𝑌 𝜃 | | 𝑃 𝑌 ) and 𝐷 JS ( 𝑄 𝑌 𝜃 | | 𝑄 𝑌 ) . Therefore, as our model becomes more accurate at modeling the ground truth conditional distribution of 𝑌 given 𝐗 , the EMID mainly depends on the divergence between the marginal distributions of visual and text inputs.

D.3Discussion

We made the 𝜖 -representation capacity assumption to derive Lemma 4.3 and Theorem 4.4 that claim the theoretical justification for EMI by showing its connection to a classic preference model. The assumption captures a minimum achievable discrepancy between the truth distribution 𝑃 𝑌 | 𝐗 and the model’s distribution 𝑃 𝜃 ( ⋅ | 𝐱 ) . As the models become more expressive–e.g., by increasing model size (Kaplan et al., 2020) and leveraging advanced positional encoding (Luo et al., 2022)–MLLM approaches the universal approximator of sequence-to-sequence mapping (Luo et al., 2022; Furuya et al., 2024), as a result, the minimum expected discrepancy tends to decrease, leading to a smaller 𝜖 .

Meanwhile, EMI and EMID trigger flexible potential use cases given their generality. For example, they can be used as metrics to evaluate a pure LLM as well as a multimodal LLM, depending on the types of targeted problems. Moreover, although we confined our analysis on 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) , the chain rule of the mutual information (Cover, 1999) allows us to conduct partial modality EMI analysis through 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

0.5 ⁢ 𝐼 ⁢ ( 𝑃 𝑋 𝑣 ⁢ 𝑌 ) + 0.5 ⁢ 𝐼 ⁢ ( 𝑃 𝑋 𝑡 ⁢ 𝑌 ) + 0.5 ⁢ 𝐼 ⁢ ( 𝑃 𝑋 𝑡 ⁢ 𝑌 | 𝑋 𝑣 ) + 0.5 ⁢ 𝐼 ⁢ ( 𝑃 𝑋 𝑣 ⁢ 𝑌 | 𝑋 𝑡 ) , where we can decompose the aggregated EMI into the modality-specific EMI terms and modality interaction (conditional) EMI terms.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 129 kB
Xet hash:: a8b0c0124c00e81759926809f043728dd49cf6bcaa652a57ce0340fb022fd694

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Let 𝒳

𝒳 𝑣 × 𝒳 𝑡 denote the input space, where 𝒳 𝑣 and 𝒳 𝑡 correspond to the visual and textual feature spaces, respectively. Similarly, let 𝒴 denote the response space. We define the random variables 𝐗

arg ⁡ min 𝜃 ∈ Θ ⁡ 𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ ∑ 𝑙

where 𝐿 is a sequence length and 𝑦

where 𝐷 denotes a divergence that measures the discrepancy between distributions 𝑃 and 𝑄 . For 𝑀

𝐷 KL ⁢ ( 𝑃 ∥ 𝑄 )

𝐷 JS ⁢ ( 𝑃 ∥ 𝑄 )

MI is deeply related to the entropy, which is defined as 𝐻 ⁢ ( 𝑃 𝐗 ) := − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ log ⁡ 𝑃 𝐗 ⁢ ( 𝐱 ) ] . It is easy to check that 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ⁢ ( 𝑃 𝑌 ) − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) − 𝐻 ⁢ ( 𝑃 𝑌 )

where 𝛿

𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 )

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) :

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) :

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − log ⁡ ( 1 − 𝑒 RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) ) , RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

Given a distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 , if 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 𝛿 , and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

min 𝜃 ∈ Θ 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝑌 | 𝐗

and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

where 𝜃 ∗ is the optimal solution of Eq. (1) over 𝑃 𝐗 ⁢ 𝑌 , and 𝛿

Although we confine our analysis to 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ) and EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) , the chain rule of MI, i.e., 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 , Δ

and denote 𝑃 𝑌 𝜃

𝔼 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] and 𝑄 𝑌 𝜃

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

Theorem 4.5 naturally captures special cases such as visual-only or text-only input shifts. For a visual-only shift, where 𝐷 JS ⁢ ( 𝑃 𝑋 𝑡 ∥ 𝑄 𝑋 𝑡 )

𝐷 JS ⁢ ( 𝑃 𝑌 𝜃 ∥ 𝑃 𝑌 ) ≤ 𝛿 𝑃 , 𝐷 JS ⁢ ( 𝑄 𝑌 𝜃 ∥ 𝑄 𝑌 ) ≤ 𝛿 𝑄 , Δ

and denote 𝑃 𝑌 𝜃

𝔼 𝑃 𝐗 [ 𝑃 𝜃 ( ⋅ | 𝐱 ) ] and 𝑄 𝑌 𝜃

4 ⁢ 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ 𝐷 JS 1 4 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ∥ 𝑄 𝑌 | 𝐗

where 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

𝐷 ¯ JS ( 𝑃 𝑋 | 𝑋 ′ | | 𝑄 𝑋 | 𝑋 ′ ) := 𝔼 𝐱 ∼ 𝑃 𝑋 ′ ⁢ [ 𝐷 JS ⁢ ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

𝔼 𝐱 ∼ 𝑄 𝑋 ′ ⁢ [ 𝐷 JS ⁢ ( 𝑃 𝑋 | 𝑋 ′

𝐱 ∥ 𝑄 𝑋 | 𝑋 ′

35 synthetic scenarios, where 1 scenario is ID and the other 34 are OOD cases. For natural shifts, we use 4 visual scenarios (1 ID + 3 OOD difficulty levels) and 7 text scenarios (1 ID-English + 6 different languages), yielding 4 × 7

where 𝐙

enc 𝑣 ⁢ ( 𝑋 𝑣 ) and text input query 𝑍 𝑡

enc 𝑡 ⁢ ( 𝑋 𝑡 ) from visual and text encoder models and take the mean of them to provide input query embedding 𝑍 𝐗

enc 𝑡 ⁢ ( 𝑌 ^ ) and reference response 𝑍 𝑌

𝐼 ^ CLUB ⁢ ( 𝑃 𝑍 𝐗 ⁢ 𝑍 𝑌 )

1 𝑁 ⁢ ∑ 𝑖

1 𝑁 log ⁡ 𝑞 𝜓 ⁢ ( 𝑧 𝑦 𝑖 | 𝑧 𝑥 𝑖 ) − 1 𝑁 2 ⁢ ∑ 𝑖

1 𝑁 ∑ 𝑗

𝐷 ^ RJSD ⁢ ( 𝑃 , 𝑄 )

where 𝐶 𝑃

𝔼 𝑋 ∼ 𝑃 ⁢ [ 𝜙 ⁢ ( 𝑋 ) ⊗ 𝜙 ⁢ ( 𝑋 ) ] and 𝑆 ⁢ ( 𝐶 𝑃 )

{ 𝑦 1 , … , 𝑦 𝐿 } with 𝐿 tokens, the length-normalized batch entropy estimate over 𝑁 samples indexed through 𝑖 ∈ ℐ

𝐻 ~

max 𝑖 ∈ ℐ − 1 𝐿 ⁢ ∑ 𝑙

In pilot experiments, we observed that the values of 𝐻 ~ are centered on 2.0 in the datasets that we considered in this work. Therefore, 𝐻 ^

max 𝐱 ∈ 𝒳 [ 𝐻 ( 𝑄 𝑌 | 𝐗

Given a distribution 𝑃 𝐗 ⁢ 𝑌 and an MLLM 𝑃 𝜃 , let the reward model function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ) . If 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ∥ 𝑃 𝑌 | 𝐗

Let 𝑃 𝑌 𝜃

EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

𝐼 ⁢ ( 𝑃 𝐗 ⊗ 𝑃 𝜃 ) − 𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ( 𝑃 𝑌 𝜃 ) − 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐗 ) ) − 𝐻 ( 𝑃 𝑌 ) + 𝐻 ( 𝑃 𝑌 | 𝐗 )

( 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) ) + 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 ^ | 𝐱 ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

Next, given 𝑟 ⁢ ( 𝐱 , 𝑦 )

log ⁡ 𝑃 𝑌 | 𝐗

PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ^ ) − 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ log ⁡ 𝑃 𝑌 | 𝐗

| EMI ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − PM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) |

| 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) + 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 ^ | 𝐱 ) 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ^ ) ] | ≤ | 𝐻 ( 𝑃 𝑌 𝜃 ) − 𝐻 ( 𝑃 𝑌 ) | + 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

min 𝜃 ∈ Θ 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝑌 | 𝐗

and let the reward function 𝑟 ⁢ ( 𝐱 , 𝑦 ) be log ⁡ 𝑃 𝑌 | 𝐗

where 𝜃 ∗ is the optimal solution of Eq. (1) over 𝑃 𝐗 ⁢ 𝑌 , and 𝛿

𝐼 ⁢ ( 𝑃 𝐗 ⁢ 𝑌 )

𝐻 ⁢ ( 𝑃 𝑌 ) − 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗 )

𝔼 𝐱 , 𝑦 ∼ 𝑃 𝐗 ⁢ 𝑌 ⁢ [ log ⁡ 𝑃 𝑌 | 𝑋

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) − log ⁡ ( 1 − 𝑒 RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 ) ) ,

RM ⁢ ( 𝑃 𝐗 ⁢ 𝑌 ; 𝑃 𝜃 )

𝐷 ¯ JS ( 𝑃 𝑋 | 𝑋 ′ | | 𝑄 𝑋 | 𝑋 ′ ) :=
𝔼 𝐱 ∼ 𝑃 𝑋 ′ ⁢ [ 𝐷 JS ⁢ ( 𝑃 𝑋 | 𝑋 ′

( 𝐻 ⁢ ( 𝑃 𝑌 𝜃 ) − 𝐻 ⁢ ( 𝑃 𝑌 ) )

+ 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ^ ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝜃 ⁢ ( 𝑦 ^ | 𝐱 ) ] − 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝑌 | 𝐗

𝐱 ⁢ ( 𝑦 ^ ) ] |

≤ | 𝐻 ( 𝑃 𝑌 𝜃 ) − 𝐻 ( 𝑃 𝑌 ) | + 𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

Proof.
| 𝔼 𝑥 ∼ 𝑃 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] − 𝔼 𝑥 ∼ 𝑄 𝑋 ⁢ [ 𝑓 ⁢ ( 𝑥 ) ] |

𝐱 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

| 𝐻 ⁢ ( 𝑃 𝑌 ) − 𝐻 ⁢ ( 𝑄 𝑌 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | | 𝑄 𝑌 ) ,

| 𝐻 ⁢ ( 𝑃 𝑌 | 𝐗

𝐱 ) |
≤ 4 𝐷 JS 1 4 ( 𝑃 𝑌 | 𝐗

𝐱 ) ] ≤ 4.4 𝜖 1 8
Proof.
𝔼 𝐱 ∼ 𝑃 𝐗 [ 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐱 ) ]
≤ 𝔼 𝐱 ∼ 𝑃 𝐗 [ | 𝐻 ( 𝑃 𝜃 ( ⋅ | 𝐱 ) ) − 𝐻 ( 𝑃 𝑌 | 𝐗

𝐷 TV ⁢ ( 𝑃 𝑌 , 𝑃 𝑌 𝜃 ) ≤
2 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 ) .
Proof.
𝐷 TV ⁢ ( 𝑃 𝑌 , 𝑃 𝑌 𝜃 )

≤
∑ 𝑦 ∈ 𝒴 𝔼 𝐱 ∼ 𝑃 𝐗 | 𝑃 𝑌 | 𝐗

=
𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 TV ( 𝑃 𝑌 | 𝐗

≤
2 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 KL ( 𝑃 𝜃 ( ⋅ | 𝐱 ) | | 𝑃 𝑌 | 𝐗

𝐱 )
≤ | 𝔼 𝐱 ∼ 𝑃 𝐗 ⁢ [ 𝔼 𝑦 ∼ 𝑃 𝜃 ( ⋅ | 𝐱 ) ⁢ log ⁡ 𝑃 𝑌 | 𝐗

≤
− log 𝑐 𝔼 𝐱 ∼ 𝑃 𝐗 𝐷 TV ( 𝑃 𝜃 ( ⋅ | 𝐱 ) , 𝑃 𝑌 | 𝐗