105 kB

Title: Deep Neural Networks Tend To Extrapolate Predictably

URL Source: https://arxiv.org/html/2310.00873

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Related Work 3Reversion to the Optimal Constant Solution 4Why do OOD Predictions Revert to the OCS? 5Risk-Sensitive Decision-Making 6Conclusion 7Acknowledgements License: arXiv.org perpetual non-exclusive license arXiv:2310.00873v2 [cs.LG] 15 Mar 2024 Deep Neural Networks Tend To Extrapolate Predictably Katie Kang 1 , Amrith Setlur 2 , Claire Tomlin 1 , Sergey Levine 1 ( 1 UC Berkeley 2 Carnegie Mellon University) Abstract

Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work reassesses this assumption for neural networks with high-dimensional inputs. Rather than extrapolating in arbitrary ways, we observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD. Moreover, we find that this value often closely approximates the optimal constant solution (OCS), i.e., the prediction that minimizes the average loss over the training data without observing the input. We present results showing this phenomenon across 8 datasets with different distributional shifts (including CIFAR10-C and ImageNet-R, S), different loss functions (cross entropy, MSE, and Gaussian NLL), and different architectures (CNNs and transformers). Furthermore, we present an explanation for this behavior, which we first validate empirically and then study theoretically in a simplified setting involving deep homogeneous networks with ReLU activations. Finally, we show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs. †††

1Introduction

The prevailing belief in machine learning posits that deep neural networks behave erratically when presented with out-of-distribution (OOD) inputs, often yielding predictions that are not only incorrect, but incorrect with high confidence [Guo et al., 2017, Nguyen et al., 2015]. However, there is some evidence which seemingly contradicts this conventional wisdom – for example, Hendrycks and Gimpel [2016] show that the softmax probabilities outputted by neural network classifiers actually tend to be less confident on OOD inputs, making them surprisingly effective OOD detectors. In our work, we find that this softmax behavior may be reflective of a more general pattern in the way neural networks extrapolate: as inputs diverge further from the training distribution, a neural network’s predictions often converge towards a fixed constant value. Moreover, this constant value often approximates the best prediction the network can produce without observing any inputs, which we refer to as the optimal constant solution (OCS). We call this the “reversion to the OCS” hypothesis:

Neural networks predictions on high-dimensional OOD inputs tend to revert towards the optimal constant solution.

In classification, the OCS corresponds to the marginal distribution of the training labels, typically a high-entropy distribution. Therefore, our hypothesis posits that classifier outputs should become higher-entropy as the input distribution becomes more OOD, which is consistent with the findings in Hendrycks and Gimpel [2016]. Beyond classification, to the best of our knowledge, we are the first to present and provide evidence for the “reversion to the OCS” hypothesis in its full generality. Our experiments show that the amount of distributional shift correlates strongly with the distance between model outputs and the OCS across 8 datasets, including both vision and NLP domains, 3 loss functions, and for both CNNS and transformers.

Having made this observation, we set out to understand why neural networks have a tendency to behave this way. Our empirical analysis reveals that the feature representations corresponding to OOD inputs tend to have smaller norms than those of in-distribution inputs, leading to less signal being propagated from the input. As a result, neural network outputs from OOD inputs tend to be dominated by the input-independent parts of the network (e.g., bias vectors at each layer), which we observe to often map closely to the OCS. We also theoretically analyze the extrapolation behavior of deep homogeoneous networks with ReLU activations, and derived evidence which supports this mechanism in the simplified setting.

Lastly, we leverage our observations to propose a simple strategy to enable risk-sensitive decision-making in the face of OOD inputs. The OCS can be viewed as a “backup default output” to which the neural network reverts when it encounters novel inputs. If we design the loss function such that the OCS aligns with the desirable cautious behavior as dictated by the decision-making problem, then the neural network model will automatically produce cautious decisions when its inputs are OOD. We describe a way to enable this alignment, and empirically demonstrate that this simple strategy can yield surprisingly good results in OOD selective classification.

In summary, our key contributions are as follows. First, we present the observation that neural networks often exhibit a predictable pattern of extrapolation towards the OCS, and empirically illustrate this phenomenon for 8 datasets with different distribution shifts, 3 loss functions, and both CNNs and transformers. Second, we provide both empirical and theoretical analyses to better understand the mechanisms that lead to this phenomenon. Finally, we make use of these insights to propose a simple strategy for enabling cautious decision-making in face of OOD inputs. Although we do not yet have a complete characterization of precisely when, and to what extent, we can rely on “reversion to the OCS” to occur, we hope our observations will prompt further investigation into this phenomenon.

Figure 1:A summary of our observations. On in-distribution samples (top), neural network outputs tend to vary significantly based on input labels. In contrast, on OOD samples (bottom), we observe that model predictions tend to not only be more similar to one another, but also gravitate towards the optimal constant solution (OCS). We also observe that OOD inputs tend to map to representations with smaller magnitudes, leading to predictions largely dominated by the (constant) network biases, which may shed light on why neural networks have this tendency. 2Related Work

A large body of prior works have studied various properties of neural network extrapolation. One line of work focuses on the failure modes of neural networks when presented with OOD inputs, such as poor generalization and overconfidence [Torralba and Efros, 2011, Gulrajani and Lopez-Paz, 2020, Recht et al., 2019, Ben-David et al., 2006, Koh et al., 2021]. Other works have noted that neural networks are ineffective in capturing epistemic uncertainty in their predictions [Ovadia et al., 2019, Lakshminarayanan et al., 2017, Nalisnick et al., 2018, Guo et al., 2017, Gal and Ghahramani, 2016], and that a number of techniques can manipulate neural networks to produce incorrect predictions with high confidence [Szegedy et al., 2013, Nguyen et al., 2015, Papernot et al., 2016, Hein et al., 2019]. However, Hendrycks et al. [2018] observed that neural networks assign lower maximum softmax probabilities to OOD than to in-distribution point, meaning neural networks may actually exhibit less confidence on OOD inputs. Our work supports this observation, while further generalizing it to arbitrary loss functions. Other lines of research have explored OOD detection via the norm of the learned features [Sun et al., 2022, Tack et al., 2020], the influence of architectural decisions on generalization [Xu et al., 2020, Yehudai et al., 2021, Cohen-Karlik et al., 2022, Wu et al., 2022], the relationship between in-distribution and OOD performance [Miller et al., 2021, Baek et al., 2022, Balestriero et al., 2021], and the behavior of neural network representations under OOD conditions [Webb et al., 2020, Idnani et al., 2022, Pearce et al., 2021, Dietterich and Guyer, 2022, Huang et al., 2020]. While our work also analyzes representations in the context of extrapolation, our focus is on understanding the mechanism behind “reversion to the OCS”, which differs from the aforementioned works.

Our work also explores risk-sensitive decision-making using selective classification as a testbed. Selective classification is a well-studied problem, and various methods have been proposed to enhance selective classification performance [Geifman and El-Yaniv, 2017, Feng et al., 2011, Charoenphakdee et al., 2021, Ni et al., 2019, Cortes et al., 2016, Xia and Bouganis, 2022]. In contrast, our aim is not to develop the best possible selective classification approach, but rather to providing insights into the effective utilization of neural network predictions in OOD decision-making.

3Reversion to the Optimal Constant Solution

In this work, we will focus on the widely studied covariate shift setting [Gretton et al., 2009, Sugiyama et al., 2007]. Formally, let the training data 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑁 be generated by sampling 𝑥 𝑖 ∼ 𝑃 train ⁢ ( 𝑥 ) and 𝑦 𝑖 ∼ 𝑃 ⁢ ( 𝑦 | 𝑥 𝑖 ) . At test time, we query the model with inputs generated from 𝑃 OOD ⁢ ( 𝑥 ) ≠ 𝑃 train ⁢ ( 𝑥 ) , whose ground truth labels are generated from the same conditional distribution, 𝑃 ⁢ ( 𝑦 | 𝑥 ) , as that in training. We will denote a neural network model as 𝑓 𝜃 : ℝ 𝑑 → ℝ 𝑚 , where 𝑑 and 𝑚 are the dimensionalities of the input and output, and 𝜃 ∈ Θ represents the network weights. We will focus on settings where 𝑑 is high-dimensional. The neural network weights are optimized by minimizing a loss function ℒ using gradient descent, 𝜃 ^

arg ⁢ min 𝜃 ∈ Θ ⁡ 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 ℒ ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 𝑖 ) , 𝑦 𝑖 ) .

Figure 2:Neural network predictions from training with cross entropy and Gaussian NLL on MNIST (top 3 rows) and CIFAR10 (bottom 3 rows). The models were trained with 0 rotation/noise, and evaluated on increasingly OOD inputs consisting of the digit 6 for MNIST, and of automobiles for CIFAR10. The blue plots represent the average model prediction over the evaluation dataset. The orange plots show the OCS associated with each model. We can see that as the distribution shift increases (going left to right), the network predictions tend towards the OCS (rightmost column). 3.1Main Hypothesis

In our experiments, we observed that as inputs become more OOD, neural network predictions tend to revert towards a constant prediction. This means that, assuming there is little label shift, model predictions will tend to be more similar to one another for OOD inputs than for the training distribution. Furthermore, we find that this constant prediction is often similar to the optimal constant solution (OCS), which minimizes the training loss if the network is constrained to ignore the input. The OCS can be interpreted as being the maximally cautious prediction, producing the class marginal in the case of the cross-entropy loss, and a high-variance Gaussian in the case of the Gaussian NLL. More precisely, we define the OCS as

𝑓 constant *

arg ⁢ min 𝑓 ∈ ℝ 𝑚 ⁡ 1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 ℒ ⁢ ( 𝑓 , 𝑦 𝑖 ) .

Based on our observations, we hypothesize that as the likelihood of samples from 𝑃 OOD ⁢ ( 𝑥 ) under 𝑃 train ⁢ ( 𝑥 ) decreases, 𝑓 𝜃 ^ ⁢ ( 𝑥 ) for 𝑥 ∼ 𝑃 OOD ⁢ ( 𝑥 ) tends to approach 𝑓 constant * .

As an illustrative example, we trained models using either cross-entropy or (continuous-valued) Gaussian negative log-likelihood (NLL) on the MNIST and CIFAR10 datasets. The blue plots in Fig. 2 show the models’ predictions as its inputs become increasingly OOD, and the orange plots visualize the OCS associated with each model. We can see that even though we trained on different datasets and evaluated on different kinds of distribution shifts, the neural network predictions exhibit the same pattern of extrapolation: as the distribution shift increases, the network predictions move closer to the OCS. Note that while the behavior of the cross-entropy models can likewise be explained by the network simply producing lower magnitude outputs, the Gaussian NLL models’ predicted variance actually increases with distribution shift, which contradicts this alternative explanation.

3.2Experiments

We will now provide empirical evidence for the “reversion to the OCS” hypothesis. Our experiments aim to answer the question: As the test-time inputs become more OOD, do neural network predictions move closer to the optimal constant solution?

Experimental setup.

We trained our models on 8 different datasets, and evaluated them on both natural and synthetic distribution shifts. See Table 1 for a summary, and Appendix B.1 for a more detailed description of each dataset. Models with image inputs use ResNet He et al. [2016] or VGG Simonyan and Zisserman [2014] style architectures, and models with text inputs use DistilBERT Sanh et al. [2019], a distilled version of BERT Devlin et al. [2018].

Dataset Label Type Input Modality Distribution Shift Type CIFAR10 Krizhevsky et al. [2009] / CIFAR10-C Hendrycks and Dietterich [2019] Discrete Image Synthetic ImageNet Deng et al. [2009] / ImageNet-R(endition) Hendrycks et al. [2021] Discrete Image Natural ImageNet Deng et al. [2009] / ImageNet-Sketch Wang et al. [2019] Discrete Image Natural DomainBed OfficeHome Gulrajani and Lopez-Paz [2020] Discrete Image Natural SkinLesionPixels Gustafsson et al. [2023] Continuous Image Natural UTKFace Zhang et al. [2017] Continuous Image Synthetic BREEDS living-17 Santurkar et al. [2020] Discrete Image Natural BREEDS non-living-26 Santurkar et al. [2020] Discrete Image Natural WILDS Amazon Koh et al. [2021] Discrete Text Natural Table 1:Summary of the datasets that we train/evaluate on in our experiments.

We focus on three tasks, each using a different loss functions: classification with cross entropy (CE), selective classification with mean squared error (MSE), and regression with Gaussian NLL. Datasets with discrete labels are used for classification and selective classification, and datasets with continuous labels are used for regression. The cross entropy models are trained to predict the likelihood that the input belongs to each class, as is typical in classification. The MSE models are trained to predict rewards for a selective classification task. More specifically, the models output a value for each class as well as an abstain option, where the value represents the reward of selecting that option given the input. The ground truth reward is +1 for the correct class, -4 for the incorrect classes, and +0 for abstaining. We train these models by minimizing the MSE loss between the predicted and ground truth rewards. We will later use these models for decision-making in Section 5. The Gaussian NLL models predict a mean and a standard deviation, parameterizing a Gaussian distribution. They are trained to minimize the negative log likelihood of the labels under its predicted distributions.

Evaluation protocol.

To answer our question, we need to quantify (1) the dissimilarity between the training data and the evaluation data, and (2) the proximity of network predictions to the OCS. To estimate the former, we trained a low-capacity model to discriminate between the training and evaluation datasets and measured the average predicted likelihood that the evaluation dataset is generated from the evaluation distribution, which we refer to as the OOD score. This score is 0.5 for indistinguishable train and evaluation data, and 1 for a perfect discriminator. To estimate the distance between the model’s prediction and the OCS, we compute the KL divergence between the model’s predicted distribution and the distribution parameterized by the OCS, 1 𝑁 ∑ 𝑖

1 𝑁 𝐷 KL ( 𝑃 𝜃 ( 𝑦 | 𝑥 𝑖 ) | | 𝑃 𝑓 constant * ( 𝑦 ) ) , for models trained with cross-entropy and Gaussian NLL. For MSE models, the distance is measured using the mean squared error, 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 ‖ 𝑓 𝜃 ⁢ ( 𝑥 𝑖 ) − 𝑓 constant * ‖ 2 . See Appendix B.3 for more details on our evaluation protocol, and Appendix B.4 for the closed form solution for the OCS for each loss.

Figure 3:Evaluating the distance between network predictions and the OCS as the input distribution becomes more OOD. Each point represents a different evaluation dataset, with the red star representing the (holdout) training distribution, and circles representing OOD datasets. The vertical line associated with each point represents the standard deviation over 5 training runs. As the OOD score of the evaluation dataset increases, there is a clear trend of the neural network predictions approaching the OCS. Results.

In Fig. 3, we plot the OOD score (x-axis) against the distance between the network predictions and the OCS (y-axis) for both the training and OOD datasets. Our results indicate a clear trend: as the OOD score of the evaluation dataset increases, neural network predictions move closer to the OCS. Moreover, our results show that this trend holds relatively consistently across different loss functions, input modalities, network architectures, and types of distribution shifts. We also found instances where this phenomenon did not hold, such as adversarial inputs, which we discuss in greater detail in Appendix A. However, the overall prevalence of “reversion to the OCS” across different settings suggests that it may capture a general pattern in the way neural networks extrapolate.

4Why do OOD Predictions Revert to the OCS?

In this section, we aim to provide insights into why neural networks have a tendency to revert to the OCS. We will begin with an intuitive explanation, and provide empirical and theoretical evidence in Sections 4.1 and 4.2. In our analysis, we observe that weight matrices and network representations associated with training inputs often occupy low-dimensional subspaces with high overlap. However, when the network encounters OOD inputs, we observe that their associated representations tend to have less overlap with the weight matrices compared to those from the training distribution, particularly in the later layers. As a result, OOD representations tend to diminish in magnitude as they pass through the layers of the network, causing the network’s output to be primarily influenced by the accumulation of model constants (e.g. bias terms). Furthermore, both empirically and theoretically, we find that this accumulation of model constants tend to closely approximate the OCS. We posit that reversion to the OCS for OOD inputs occurs due to the combination of these two factors: that accumulated model constants in a trained network tend towards the OCS, and that OOD points yield smaller-magnitude representations in the network that become dominated by model constants.

4.1Empirical Analysis Figure 4:Analysis of the interaction between representations and weights as distribution shift increases. Plots in first column visualize the norm of network features for different levels of distribution shift at different layers of the network. In later layer of the network, the norm of features tends to decrease as distribution shift increases. Plots in second column show the proportion of network features which lie within the span of the following linear layer. This tends to decrease as distributional shift increases. Error bars represent the standard deviation taken over the test distribution. Plots in the third and fourth column show the accumulation of model constants as compared to the OCS for a cross entropy and a MSE model; the two closely mirror one another.

We will now provide empirical evidence for the mechanism we describe above using deep neural network models trained on MNIST and CIFAR10. MNIST models use a small 4 layer network, and CIFAR10 models use a ResNet20 [He et al., 2016]. To more precisely describe the quantities we will be illustrating, let us rewrite the neural network as 𝑓 ⁢ ( 𝑥 )

𝑔 𝑖 + 1 ⁢ ( 𝜎 ⁢ ( 𝑊 𝑖 ⁢ 𝜙 𝑖 ⁢ ( 𝑥 ) + 𝑏 𝑖 ) ) , where 𝜙 𝑖 ⁢ ( 𝑥 ) is an intermediate representation at layer 𝑖 , 𝑊 𝑖 and 𝑏 𝑖 are the corresponding weight matrix and bias, 𝜎 is a nonlinearity, and 𝑔 𝑖 + 1 denotes the remaining layers of the network. Because we use a different network architecture for each domain, we will use variables to denote different intermediate layers of the network, and defer details about the specific choice of layers to Appendix C.2. We present additional experiments analyzing the ImageNet domain, as well as the effects of batch and layer normalization in Appendix C.1.

First, we will show that 𝑊 𝑖 ⁢ 𝜙 𝑖 ⁢ ( 𝑥 ) tends to diminish for OOD inputs. The first column of plots in Fig. 4 show 𝔼 𝑥 ∼ 𝑃 OOD ⁢ ( 𝑥 ) ⁢ [ ‖ 𝑊 𝑖 ⁢ 𝜙 𝑖 ⁢ ( 𝑥 ) ‖ 2 ] / 𝔼 𝑥 ∼ 𝑃 train ⁢ ( 𝑥 ) ⁢ [ ‖ 𝑊 𝑖 ⁢ 𝜙 𝑖 ⁢ ( 𝑥 ) ‖ 2 ] for 𝑃 OOD with different level of rotation or noise. The x-axis represents different layers in the network, with the leftmost being the input and the rightmost being the output. We can see that in the later layers of the network, ‖ 𝑊 𝑖 ⁢ 𝜙 𝑖 ⁢ ( 𝑥 ) ‖ 2 consistently became smaller as inputs became more OOD (greater rotation/noise). Furthermore, the diminishing effect becomes more pronounced as the representations pass through more layers.

Next, we will present evidence that this decrease in representation magnitude occurs because 𝜙 𝑗 ⁢ ( 𝑥 ) for 𝑥 ∼ 𝑃 train ⁢ ( 𝑥 ) tend to lie more within the low-dimensional subspace spanned by the rows of 𝑊 𝑗 than 𝜙 𝑗 ⁢ ( 𝑥 ) for 𝑥 ∼ 𝑃 OOD ⁢ ( 𝑥 ) . Let 𝑉 top denote the top (right) singular vectors of 𝑊 𝑗 . The middle plots of Fig. 4 show the ratio of the representation’s norm at layer 𝑗 that is captured by projecting the representation onto 𝑉 top i.e., ‖ 𝜙 𝑗 ⁢ ( 𝑥 ) ⊤ ⁢ 𝑉 top ⁢ 𝑉 top ⊤ ‖ 2 / ‖ 𝜙 𝑗 ⁢ ( 𝑥 ) ‖ 2 , as distribution shift increases. We can see that as the inputs become more OOD, the ratio goes down, suggesting that the representations lie increasingly outside the subspace spanned by the weight matrix.

Finally, we will provide evidence for the part of the mechanism which accounts for the optimality of the OCS. Previously, we have established that OOD representations tend to diminish in magnitude in later layers of the network. This begs the question, what would the output of the network be if the input representation at an intermediary layer had a magnitude of 0? We call this the accumulation of model constants, i.e. 𝑔 𝑘 + 1 ⁢ ( 𝜎 ⁢ ( 𝑏 𝑘 ) ) . In the third and fourth columns of Fig. 4, we visualize the accumulation of model constants at one of the final layers 𝑘 of the networks for both a cross entropy and a MSE model (details in Sec. 3.2), along with the OCS for each model. We can see that the accumulation of model constants closely approximates the OCS in each case.

4.2Theoretical Analysis

We will now explicate our empirical findings more formally by analyzing solutions of gradient flow (gradient descent with infinitesimally small step size) on deep homogeneous neural networks with ReLU activations. We adopt this setting due to its theoretical convenience in reasoning about solutions at convergence of gradient descent [Lyu and Li, 2019, Galanti et al., 2022, Huh et al., 2021], and its relative similarity to deep neural networks used in practice [Neyshabur et al., 2015, Du et al., 2018].

Setup: We consider a class of homogeneous neural networks ℱ := { 𝑓 ⁢ ( 𝑊 ; 𝑥 ) : 𝑊 ∈ 𝒲 } , with 𝐿 layers and ReLU activation, taking the functional form 𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝑊 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 1 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 2 ⁢ 𝜎 ⁢ ( 𝑊 1 ⁢ 𝑥 ) ) ⁢ … ) , where 𝑊 𝑖 ∈ ℝ 𝑚 × 𝑚 , ∀ 𝑖 ∈ { 2 , … , 𝐿 − 1 } , 𝑊 1 ∈ ℝ 𝑚 × 1 and 𝑊 𝐿 ∈ ℝ 1 × 𝑚 . Our focus is on a binary classification problem where we consider two joint distributions 𝑃 train , 𝑃 OOD over inputs and labels: 𝒳 × 𝒴 , where inputs are from 𝒳 := { 𝑥 ∈ ℝ 𝑑 : ‖ 𝑥 ‖ 2 ≤ 1 } , and labels are in 𝒴 := { − 1 , + 1 } . We consider gradient descent with a small learning rate on the objective: 𝐿 ⁢ ( 𝑊 ; 𝒟 ) := ∑ ( 𝑥 , 𝑦 ) ∈ 𝒟 ℓ ⁢ ( 𝑓 ⁢ ( 𝑊 ; 𝑥 ) , 𝑦 ) where ℓ ⁢ ( 𝑓 ⁢ ( 𝑊 ; 𝑥 ) , 𝑦 ) ↦ exp ⁡ ( − 𝑦 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ) is the exponential loss and 𝒟 is an IID sampled dataset of size 𝑁 from 𝑃 train . For more details on the setup, background on homogeneous networks, and full proofs for all results in this section, please see Appendix D.

We will begin by providing a lower bound on the expected magnitude of intermediate layer features corresponding to inputs from the training distribution:

Proposition 4.1 ( 𝑃 train observes high norm features)

When 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) fits 𝒟 , i.e., 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑖 )

≥

𝛾 , ∀ 𝑖 ∈ [ 𝑁 ] , then w.h.p 1 − 𝛿 over 𝒟 , layer 𝑗 representations 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) satisfy 𝔼 𝑃 train ⁢ [ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ] ≥ ( 1 / 𝐶 0 ) ⁢ ( 𝛾 − 𝒪 ~ ⁢ ( log ⁡ ( 1 / 𝛿 ) / 𝑁 + 𝐶 1 ⁢ log ⁡ 𝑚 / 𝑁 ⁢ 𝛾 ) ) , if ∃ constants 𝐶 0 , 𝐶 1 s.t. ‖ 𝑊 ^ 𝑗 ‖ 2 ≤ 𝐶 0 1 / 𝐿 , 𝐶 1 ≥ 𝐶 0 3 ⁢ 𝐿 / 2 .

Here, we can see that if the trained network perfectly fits the training data ( 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑖 ) ≥ 𝛾 , ∀ 𝑖 ∈ [ 𝑁 ] ), and the training data size 𝑁 is sufficiently large, then the expected ℓ 2 norm of layer 𝑗 activations 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) on 𝑃 train is large and scales at least linearly with 𝛾 .

Next, we will analyze the size of network outputs corresponding to points which lie outside of the training distribution. Our analysis builds on prior results for gradient flow on deep homogeneous nets with ReLU activations which show that the gradient flow is biased towards the solution (KKT point) of a specific optimization problem: minimizing the weight norms while achieving sufficiently large margin on each training point [Timor et al., 2023, Arora et al., 2019, Lyu and Li, 2019]. Based on this, it is easy to show that that the solution for this constrained optimization problem is given by a neural network with low rank matrices in each layer for sufficiently deep and wide networks. Furthermore, the low rank nature of these solutions is exacerbated by increasing depth and width, where the network approaches an almost rank one solution for each layer. If test samples deviate from this low rank space of weights in any layer, the dot products of the weights and features will collapse in the subsequent layer, and its affect rolls over to the final layer, which will output features with very small magnitude. Using this insight, we present an upper bound on the magnitude of the final layer features corresponding to OOD inputs:

Theorem 4.1 (Feature norms can drop easily on 𝑃 OOD )

Suppose ∃ a network 𝑓 ′ ⁢ ( 𝑊 ; 𝑥 ) with 𝐿 ′ layers and 𝑚 ′ neurons satisfying conditions in Proposition 4.1 ( 𝛾

1 ). When we optimize the training objective with gradient flow over a class of deeper and wider homogeneous networks ℱ with 𝐿 > 𝐿 ′ , 𝑚 > 𝑚 ′ , the resulting solution would converge directionally to a network 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) for which the following is true: ∃ a set of rank 1 projection matrices { 𝐴 𝑖 } 𝑖

1 𝐿 , such that if representations for any layer 𝑗 satisfy 𝔼 𝑃 OOD ⁢ ‖ 𝐴 𝑗 ⁢ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ≤ 𝜖 , then ∃ 𝐶 2 for which 𝔼 𝑃 OOD ⁢ [ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | ] ⁢ < [-0.07cm] ∼ ⁢ 𝐶 0 ⁢ ( 𝜖 + 𝐶 2 − 1 / 𝐿 ⁢ 𝐿 + 1 / 𝐿 ) .

This theorem tells us that for any layer 𝑗 , there exists only a narrow rank one space 𝐴 𝑗 in which OOD representations may lie, in order for their corresponding final layer outputs to remain significant in norm. Because neural networks are not optimized on OOD inputs, we hypothesize that the features corresponding to OOD inputs tend to lie outside this narrow space, leading to a collapse in last layer magnitudes for OOD inputs in deep networks. Indeed, this result is consistent with our empirical findings in the first and second columns of Fig. 4.1, where we observed that OOD features tend to align less with weight matrices, resulting in a drop in OOD feature norms.

To study the accumulation of model constants, we now analyze a slightly modified class of functions ℱ ~

{ 𝑓 ⁢ ( 𝑊 ; ⋅ ) + 𝑏 : 𝑏 ∈ ℝ , 𝑓 ⁢ ( 𝑊 ; ⋅ ) ∈ ℱ } , which consists of deep homogeneous networks with a bias term in the final layer. In Proposition 4.2, we show that there exists a set of margin points (analogous to support vectors in the linear setting) which solely determines the model’s bias 𝑏 ^ .

Proposition 4.2 (Analyzing network bias)

If gradient flow on ℱ ~ converges directionally to 𝑊 ^ , 𝑏 ^ , then 𝑏 ^ ∝ ∑ 𝑘 𝑦 𝑘 for margin points { ( 𝑥 𝑘 , 𝑦 𝑘 ) : 𝑦 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑘 )

arg ⁢ min 𝑗 ∈ [ 𝑁 ] ⁡ 𝑦 𝑗 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑗 ) } .

If the label marginal of these margin points mimics that of the overall training distribution, then the learnt bias will approximate the OCS for the exponential loss. This result is consistent with our empirical findings in the third and fourth columns on Fig. 4.1, where we found the accumulation of bias terms tends to approximate the OCS.

5Risk-Sensitive Decision-Making

Lastly, we will explore an application of our observations to decision-making problems. In many decision-making scenarios, certain actions offer a high potential for reward when the agent chooses them correctly, but also higher penalties when chosen incorrectly, while other more cautious actions consistently provide a moderate level of reward. When utilizing a learned model for decision-making, it is desirable for the agent to select the high-risk high-reward actions when the model is likely to be accurate, while opting for more cautious actions when the model is prone to errors, such as when the inputs are OOD. It turns out, if we leverage “reversion to the OCS” appropriately, such risk-sensitive behavior can emerge automatically. If the OCS of the agent’s learned model corresponds to cautious actions, then “reversion to the OCS” posits that the agent will take increasingly cautious actions as its inputs become more OOD. However, not all decision-making algorithms leverage “reversion to the OCS” by default. Depending on the choice of loss function (and consequently the OCS), different algorithms which have similar in-distribution performance can have different OOD behavior. In the following sections, we will use selective classification as an example of a decision-making problem to more concretely illustrate this idea.

5.1Example Application: Selective Classification

In selective classification, the agent can choose to classify the input or abstain from making a decision. As an example, we will consider a selective classification task using CIFAR10, where the agent receives a reward of +1 for correctly selecting a class, a reward of -4 for an incorrect classification, and a reward of 0 for choosing to abstain.

Let us consider one approach that leverages “reversion to the OCS” and one that does not, and discuss their respective OOD behavior. An example of the former involves learning a model to predict the reward associated with taking each action, and selecting the action with the highest predicted reward. This reward model, 𝑓 𝜃 : ℝ 𝑑 → ℝ | 𝒜 | , takes as input an image, and outputs a vector the size of the action space. We train 𝑓 𝜃 using a dataset of images, actions and rewards, 𝒟

{ ( 𝑥 𝑖 , 𝑎 𝑖 , 𝑟 𝑖 ) } 𝑖

1 𝑁 , by minimizing the MSE loss, 1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 ( 𝑓 𝜃 ⁢ ( 𝑥 𝑖 ) 𝑎 𝑖 − 𝑟 𝑖 ) 2 , and select actions using the policy 𝜋 ⁢ ( 𝑥 )

arg ⁡ max 𝑎 ∈ 𝒜 ⁡ 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑎 . The OCS of 𝑓 𝜃 is the average reward for each action over the training points, i.e., ( 𝑓 constant * ) 𝑎

∑ 1 ≤ 𝑖 ≤ 𝑁 𝑟 𝑖 ⋅ 𝟙 ⁢ [ 𝑎 𝑖

𝑎 ] ∑ 1 ≤ 𝑗 ≤ 𝑁 𝟙 ⁢ [ 𝑎 𝑗

𝑎 ] . In our example, the OCS is -3.5 for selecting each class and 0 for abstaining, so the policy corresponding to the OCS will choose to abstain. Thus, according to “reversion to the OCS”, this agent should choose to abstain more and more frequently as its input becomes more OOD. We illustrate this behavior in Figure 5. In the first row, we depict the average predictions of a reward model when presented with test images of a specific class with increasing levels of noise (visualized in Figure 1). In the second row, we plot a histogram of the agent’s selected actions for each input distribution. We can see that as the inputs become more OOD, the model’s predictions converged towards the OCS, and consequently, the agent automatically transitioned from making high-risk, high-reward decisions of class prediction to the more cautious decision of abstaining.

Figure 5:Selective classification via reward prediction on CIFAR10. We evaluate on holdout datasets consisting of automobiles (class 1) with increasing levels of noise. X-axis represents the agent’s actions, where classes are indexed by numbers and abstain is represented by "A". We plot the average reward predicted by the model for each class (top), and the distribution of actions selected by the policy (bottom). The rightmost plots represent the OCS (top), and the actions selected by an OCS policy (bottom). As distribution shift increased, the model predictions approached the OCS, and the policy automatically selected the abstain action more frequently.

One example of an approach which does not leverage “reversion to the OCS” is standard classification via cross entropy. The classification model takes as input an image and directly predicts a probability distribution over whether each action is the optimal action. In this case, the optimal action given an input is always its ground truth class. Because the OCS for cross entropy is the marginal distribution of labels in the training data, and the optimal action is never to abstain, the OCS for this approach corresponds to a policy that never chooses to abstain. In this case, “reversion to the OCS” posits that the agent will continue to make high-risk high-reward decisions even as its inputs become more OOD. As a result, while this approach can yield high rewards on the training distribution, it is likely to yield very low rewards on OOD inputs, where the model’s predictions are likely to be incorrect.

5.2Experiments

We will now more thoroughly compare the behavior of a reward prediction agent with a standard classification agent for selective classification on a variety of different datasets. Our experiments aim to answer the questions: How does the performance of a decision-making approach which leverages “reversion to the OCS” compare to that of an approach which does not?

Experimental Setup. Using the same problem setting as the previous section, we consider a selective classification task in which the agent receives a reward of +1 for selecting the correct class, -4 for selecting an incorrect class, and +0 for abstaining from classifying. We experiment with 4 datasets: CIFAR10, DomainBed OfficeHome, BREEDS living-17 and non-living-26. We compare the performance of the reward prediction and standard classification approaches described in the previous section, as well as a third oracle approach that is optimally risk-sensitive, thereby providing an upper bound on the agent’s achievable reward. To obtain the oracle policy, we train a classifier on the training dataset to predict the likelihood of each class, and then calibrate the predictions with temperature scaling on the OOD evaluation dataset. We then use the reward function to calculate the theoretically optimal threshold on the classifier’s maximum predicted likelihood, below which the abstaining option is selected. Note that the oracle policy has access to the reward function and the OOD evaluation dataset for calibration, which the other two approaches do not have access to.

Figure 6:Ratio of abstain action to total actions; error bars represent standard deviation over 5 random seeds; (t) denotes the training distribution. While the oracle and reward prediction approaches selected the abstain action more frequently as inputs became more OOD, the standard classification approach almost never selected abstain. Figure 7:Reward obtained by each approach. While all three approaches performed similarly on the training distribution, reward prediction increasingly outperformed standard classification as inputs became more OOD.

Results. In Fig. 6, we plot the frequency with which the abstain action is selected for each approach. As distribution shift increased, both the reward prediction and oracle approaches selected the abstaining action more frequently, whereas the standard classification approach never selected this option. This discrepancy arises because the OCS of the reward prediction approach aligns with the abstain action, whereas the OCS of standard classification does not. In Fig. 7, we plot the average reward received by each approach. Although the performance of all three approaches are relatively similar on the training distribution, the reward prediction policy increasingly outperformed the classification policy as distribution shift increased. Furthermore, the gaps between the rewards yielded by the reward prediction and classification policies are substantial compared to the gaps between the reward prediction and the oracle policies, suggesting that the former difference in performance is nontrivial. Note that the goal of our experiments was not to demonstrate that our approach is the best possible method for selective classification (in fact, our method is likely not better than SOTA approaches), but rather to highlight how the OCS associated with an agent’s learned model can influence its OOD decision-making behavior. To this end, this result shows that appropriately leveraging “reversion to the OCS” can substantial improve an agent’s performance on OOD inputs.

6Conclusion

We presented the observation that neural network predictions for OOD inputs tend to converge towards a specific constant, which often corresponds to the optimal input-independent prediction based on the model’s loss function. We proposed a mechanism to explain this phenomenon and a simple strategy that leverages this phenomenon to enable risk-sensitive decision-making. Finally, we demonstrated the prevalence of this phenomenon and the effectiveness of our decision-making strategy across diverse datasets and different types of distributional shifts.

Our understanding of this phenomenon is not complete. Further research is needed to to discern the properties of an OOD distribution which govern when, and to what extent, we can rely on “reversion to the OCS” to occur. Another exciting direction would be to extend our investigation on the effect of the OCS on decision-making to more complex multistep problems, and study the OOD behavior of common algorithms such as imitation learning, Q-learning, and policy gradient.

As neural network models become more broadly deployed to make decisions in the “wild”, we believe it is increasingly essential to ensure neural networks behave safely and robustly in the presence of OOD inputs. While our understanding of “reversion to the OCS” is still rudimentary, we believe it offers a new perspective on how we may predict and even potentially steer the behavior of neural networks on OOD inputs. We hope our observations will prompt further investigations on how we should prepare models to tackle the diversity of in-the-wild inputs they must inevitably encounter.

7Acknowledgements

This work was supported by DARPA ANSR, DARPA Assured Autonomy, and the Office of Naval Research under N00014-21-1-2838. Katie Kang was supported by the NSF GRFP. We would like to thank Dibya Ghosh, Ilya Kostribov, Karl Pertsch, Aviral Kumar, Eric Wallace, Kevin Black, Ben Eysenbach, Michael Janner, Young Geng, Colin Li, Manan Tomar, and Simon Zhai for insightful feedback and discussions.

References Arora et al. [2019] ↑ Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo.Implicit regularization in deep matrix factorization.Advances in Neural Information Processing Systems, 32, 2019. Baek et al. [2022] ↑ Christina Baek, Yiding Jiang, Aditi Raghunathan, and J Zico Kolter.Agreement-on-the-line: Predicting the performance of neural networks under distribution shift.Advances in Neural Information Processing Systems, 35:19274–19289, 2022. Balestriero et al. [2021] ↑ Randall Balestriero, Jerome Pesenti, and Yann LeCun.Learning in high dimension always amounts to extrapolation.arXiv preprint arXiv:2110.09485, 2021. Bartlett et al. [2017] ↑ Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky.Spectrally-normalized margin bounds for neural networks.Advances in neural information processing systems, 30, 2017. Ben-David et al. [2006] ↑ Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira.Analysis of representations for domain adaptation.Advances in neural information processing systems, 19, 2006. Charoenphakdee et al. [2021] ↑ Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama.Classification with rejection based on cost-sensitive classification.In International Conference on Machine Learning, pages 1507–1517. PMLR, 2021. Cohen-Karlik et al. [2022] ↑ Edo Cohen-Karlik, Avichai Ben David, Nadav Cohen, and Amir Globerson.On the implicit bias of gradient descent for temporal extrapolation.In International Conference on Artificial Intelligence and Statistics, pages 10966–10981. PMLR, 2022. Cortes et al. [2016] ↑ Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri.Learning with rejection.In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 67–82. Springer, 2016. Deng et al. [2009] ↑ Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. Devlin et al. [2018] ↑ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. Dietterich and Guyer [2022] ↑ Thomas G Dietterich and Alex Guyer.The familiarity hypothesis: Explaining the behavior of deep open set methods.Pattern Recognition, 132:108931, 2022. Du et al. [2018] ↑ Simon S Du, Wei Hu, and Jason D Lee.Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced.Advances in neural information processing systems, 31, 2018. Feng et al. [2011] ↑ Leo Feng, Mohamed Osama Ahmed, Hossein Hajimirsadeghi, and Amir H Abdi.Towards better selective classification.In The Eleventh International Conference on Learning Representations, 2011. Gal and Ghahramani [2016] ↑ Yarin Gal and Zoubin Ghahramani.Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In international conference on machine learning, pages 1050–1059. PMLR, 2016. Galanti et al. [2022] ↑ Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso Poggio.Sgd and weight decay provably induce a low-rank bias in neural networks.arxiv, 2022. Geifman and El-Yaniv [2017] ↑ Yonatan Geifman and Ran El-Yaniv.Selective classification for deep neural networks.Advances in neural information processing systems, 30, 2017. Goodfellow et al. [2014] ↑ Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014. Gretton et al. [2009] ↑ Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf.Covariate shift by kernel mean matching.Dataset shift in machine learning, 3(4):5, 2009. Gulrajani and Lopez-Paz [2020] ↑ Ishaan Gulrajani and David Lopez-Paz.In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020. Guo et al. [2017] ↑ Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger.On calibration of modern neural networks.In International conference on machine learning, pages 1321–1330. PMLR, 2017. Gustafsson et al. [2023] ↑ Fredrik K Gustafsson, Martin Danelljan, and Thomas B Schön.How reliable is your regression model’s uncertainty under real-world distribution shifts?arXiv preprint arXiv:2302.03679, 2023. He et al. [2016] ↑ Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Hein et al. [2019] ↑ Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf.Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41–50, 2019. Hendrycks and Dietterich [2019] ↑ Dan Hendrycks and Thomas Dietterich.Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019. Hendrycks and Gimpel [2016] ↑ Dan Hendrycks and Kevin Gimpel.A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016. Hendrycks et al. [2018] ↑ Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich.Deep anomaly detection with outlier exposure.arXiv preprint arXiv:1812.04606, 2018. Hendrycks et al. [2021] ↑ Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer.The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021. Huang et al. [2020] ↑ Haiwen Huang, Zhihan Li, Lulu Wang, Sishuo Chen, Bin Dong, and Xinyu Zhou.Feature space singularity for out-of-distribution detection.arXiv preprint arXiv:2011.14654, 2020. Huh et al. [2021] ↑ Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, and Phillip Isola.The low-rank simplicity bias in deep networks.arXiv preprint arXiv:2103.10427, 2021. Idnani et al. [2022] ↑ Daksh Idnani, Vivek Madan, Naman Goyal, David J Schwab, and Shanmukha Ramakrishna Vedantam.Don’t forget the nullspace! nullspace occupancy as a mechanism for out of distribution failure.In The Eleventh International Conference on Learning Representations, 2022. Ji and Telgarsky [2020] ↑ Ziwei Ji and Matus Telgarsky.Directional convergence and alignment in deep learning.Advances in Neural Information Processing Systems, 2020. Koh et al. [2021] ↑ Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al.Wilds: A benchmark of in-the-wild distribution shifts.In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021. Krizhevsky et al. [2009] ↑ Alex Krizhevsky, Geoffrey Hinton, et al.Learning multiple layers of features from tiny images.arxiv, 2009. Lakshminarayanan et al. [2017] ↑ Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell.Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017. Lyu and Li [2019] ↑ Kaifeng Lyu and Jian Li.Gradient descent maximizes the margin of homogeneous neural networks.In International Conference on Learning Representations, 2019. Miller et al. [2021] ↑ John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt.Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization.In International Conference on Machine Learning, pages 7721–7735. PMLR, 2021. Nalisnick et al. [2018] ↑ Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan.Do deep generative models know what they don’t know?arXiv preprint arXiv:1810.09136, 2018. Neyshabur et al. [2015] ↑ Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro.Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015. Nguyen et al. [2015] ↑ Anh Nguyen, Jason Yosinski, and Jeff Clune.Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015. Ni et al. [2019] ↑ Chenri Ni, Nontawat Charoenphakdee, Junya Honda, and Masashi Sugiyama.On the calibration of multiclass classification with rejection.Advances in Neural Information Processing Systems, 32, 2019. Ovadia et al. [2019] ↑ Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek.Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift.Advances in neural information processing systems, 32, 2019. Papernot et al. [2016] ↑ Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami.The limitations of deep learning in adversarial settings.In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE, 2016. Pearce et al. [2021] ↑ Tim Pearce, Alexandra Brintrup, and Jun Zhu.Understanding softmax confidence and uncertainty.arXiv preprint arXiv:2106.04972, 2021. Recht et al. [2019] ↑ Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar.Do imagenet classifiers generalize to imagenet?In International conference on machine learning, pages 5389–5400. PMLR, 2019. Sanh et al. [2019] ↑ Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. Santurkar et al. [2020] ↑ Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry.Breeds: Benchmarks for subpopulation shift.arXiv preprint arXiv:2008.04859, 2020. Simonyan and Zisserman [2014] ↑ Karen Simonyan and Andrew Zisserman.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. Soudry et al. [2018] ↑ Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro.The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018. Sugiyama et al. [2007] ↑ Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift adaptation by importance weighted cross validation.Journal of Machine Learning Research, 8(5), 2007. Sun et al. [2022] ↑ Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li.Out-of-distribution detection with deep nearest neighbors.In International Conference on Machine Learning, pages 20827–20840. PMLR, 2022. Szegedy et al. [2013] ↑ Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199, 2013. Tack et al. [2020] ↑ Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin.Csi: Novelty detection via contrastive learning on distributionally shifted instances.Advances in neural information processing systems, 33:11839–11852, 2020. Timor et al. [2023] ↑ Nadav Timor, Gal Vardi, and Ohad Shamir.Implicit regularization towards rank minimization in relu networks.In International Conference on Algorithmic Learning Theory, pages 1429–1459. PMLR, 2023. Torralba and Efros [2011] ↑ Antonio Torralba and Alexei A Efros.Unbiased look at dataset bias.In CVPR 2011, pages 1521–1528. IEEE, 2011. Wang et al. [2019] ↑ Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing.Learning robust global representations by penalizing local predictive power.In Advances in Neural Information Processing Systems, pages 10506–10518, 2019. Webb et al. [2020] ↑ Taylor Webb, Zachary Dulberg, Steven Frankland, Alexander Petrov, Randall O’Reilly, and Jonathan Cohen.Learning representations that support extrapolation.In International conference on machine learning, pages 10136–10146. PMLR, 2020. Wu et al. [2022] ↑ Yongtao Wu, Zhenyu Zhu, Fanghui Liu, Grigorios Chrysos, and Volkan Cevher.Extrapolation and spectral bias of neural nets with hadamard product: a polynomial net study.Advances in neural information processing systems, 35:26980–26993, 2022. Xia and Bouganis [2022] ↑ Guoxuan Xia and Christos-Savvas Bouganis.Augmenting softmax information for selective classification with out-of-distribution data.In Proceedings of the Asian Conference on Computer Vision, pages 1995–2012, 2022. Xu et al. [2020] ↑ Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka.How neural networks extrapolate: From feedforward to graph neural networks.arXiv preprint arXiv:2009.11848, 2020. Yehudai et al. [2021] ↑ Gilad Yehudai, Ethan Fetaya, Eli Meirom, Gal Chechik, and Haggai Maron.From local structures to size generalization in graph neural networks.In International Conference on Machine Learning, pages 11975–11986. PMLR, 2021. Zhang et al. [2017] ↑ Zhifei Zhang, Yang Song, and Hairong Qi.Age progression/regression by conditional adversarial autoencoder.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5810–5818, 2017. Appendix AInstances Where “Reversion to the OCS” Does Not Hold

In this section, we will discuss some instances where “reversion to the OCS” does not hold. The first example is evaluating an MNIST classifier (trained via cross entropy) on an adversarially generated dataset using the Fast Gradient Sign Method (FSGM) [Goodfellow et al., 2014]. In Fig. 8, we show our findings. On the left plot, we can see that the outputs corresponding to the adversarial dataset were farther from the OCS than the predictions corresponding to the training distribution, even though the adversarial dataset is more OOD. On the right plot, we show the normalized representation magnitude of the original and adversarial distributions throughout different layers of the network. We can see that the representations corresponding to adversarial inputs have larger magnitudes compared to those corresponding to the original inputs. This is a departure from the leftmost plots in Fig. 4, where the norm of the representations in later layers decreased as the inputs became more OOD. One potential reason for this behavior is that the adversarial optimization pushes the adversarial inputs to yield representations which align closer with the network weights, leading to higher magnitude representations which push outputs father from the OCS.

Figure 8:On the left, we evaluate the distance between network predictions and the OCS as the input distribution becomes more OOD for an MNIST classifier. The red star represents the (holdout) training distribution, and the blue circle represents an adversarially generated evaluation distribution. Even though the adversarial distribution is more OOD, its predictions were farther from the OCS. On the right, we plot the normalized representation magnitude across different layers of the network. The representations corresponding to adversarial inputs had greater magnitude throughout all the layers.

Our second example is Gaussian NLL models trained on UTKFace, and evaluated on inputs with impulse noise. Previously, in Fig. 3, we had shown that “reversion to the OCS” holds for UTKFace with gaussian blur, but we found this to not necessarily be the case for all corruptions. In Fig. 9, we show the behavior of the models evaluated on inputs with increasing amounts of impulse noise. In the middle plot, we can see that as the OOD score increases (greater noising), the distance to the OCS increases, contradicting “reversion to the OCS”. In the right plot, we show the magnitude of the representations in an internal layer for inputs with both gaussian blur and impulse noise. We can see that while the representation norm decreases with greater noise for gaussian blur, the representation norm actually increases for impulse noise. We are not sure why the model follows “reversion to the OCS” for gaussian blur but not impulse noise. We hypothesize that one potential reason could be that, because the model was trained to predict age, the model learned to identify uneven texture as a proxy for wrinkles. Indeed, we found that the model predicted higher ages for inputs with greater levels of impulse noise, which caused the predictions to move farther from the OCS.

Figure 9:On the left, we visualize an example of UTKFace inputs in its original form, with gaussian blur, and with impulse noise. In the middle, we evaluate the distance between network predictions and the OCS as the input distribution becomes more OOD. Each point represents a different evaluation dataset, with the red star representing the (holdout) training distribution, and circles representing OOD datasets with increasing levels of impulse noise. The vertical line associated with each point represents the standard deviation over 5 training runs. As the OOD score of the evaluation dataset increases, the model predictions here tended to move farther from the OCS. On the right, we plot the magnitude of the representation in a specific layer of the model corresponding to inputs with different levels of gaussian blur and impulse noise. The representation magnitude for inputs with greater gaussian blur tend to decrease, while the representation magnitude with greater impulse noise tend to increase. Appendix BExperiment Details B.1Datasets

The datasets with discrete labels include CIFAR10 [Krizhevsky et al., 2009], ImageNet [Deng et al., 2009] (subsampled to 200 classes to match ImageNet-R(rendition) [Hendrycks et al., 2021]), DomainBed OfficeHome [Gulrajani and Lopez-Paz, 2020], BREEDS LIVING-17 and NON-LIVING-26 [Santurkar et al., 2020], and Wilds Amazon [Koh et al., 2021], and the datasets with continuous labels include SkinLesionPixels [Gustafsson et al., 2023] and UTKFace [Zhang et al., 2017]. We evaluate CIFAR10 models on CIFAR10-C [Hendrycks and Dietterich, 2019], which includes synthetic corruptions at varying levels of intensity. We evaluate ImageNet models on ImageNet-R and ImageNet-Sketch [Wang et al., 2019], which include renditions of ImageNet classes in novel styles and in sketch form. OfficeHome includes images of furniture in the style of product, photo, clipart, and art. We train on product images and evaluate on the other styles. BREEDS datasets consist of training and OOD test datasets consisting of the same classes but distinct subclasses. Wilds Amazon consists of training and OOD test datasets of Amazon reviews from different sets of users. To amplify the distribution shift, we subsample the OOD dataset to only include incorrectly classified points. SkinLesionPixels consists of dermatoscopic images, where the training and OOD test datasets were collected from patients from different countries. UTKFace consists of images of faces, and we construct OOD test datasets by adding different levels of Gaussian blur to the input.

B.2Training Parameters

First, we will discuss the parameters we used to train our models.

Task Network Architecture MNIST 2 convolution layers followed by 2 fully connected layers ReLU nonlinearities

CIFAR10 ResNet20 ImageNet ResNet50 OfficeHome ResNet50 BREEDS ResNet18 Amazon DistilBERT SkinLesionPixels ResNet34 UTKFace Custom VGG style architecture Task Optimizer Learning Rate
Learning Rate Scheduler
Weight Decay Momentum MNIST Adam 0.001 Step; 𝛾

0.7 0.01 - CIFAR10 SGD 0.1 Multi step milestones=[100, 150] 0.0001 0.9 ImageNet SGD 0.1 Step; 𝛾

0.1 step size = 30 0.0001 0.9 OfficeHome Adam 0.00005 - 0 - BREEDS SGD 0.2 Linear with warm up (warm up frac=0.05) 0.00005 0.9 Amazon AdamW 0.00001 Linear with warm up (warm up frac=0) 0.01 - SkinLesionPixels Adam 0.001 - 0 - UTKFace Adam 0.001 - 0 - Task Data Preprocessing MNIST Normalization CIFAR10 Random horizontal flip, random crop, normalization ImageNet Random resized crop, random horizontal flip, normalization OfficeHome Random resized crop, random horizontal flip, color jitter, random grayscale, normalization

BREEDS Random horizontal flip, random resized crop, randaugment

Amazon DistilBERT Tokenizer SkinLesionPixels Normalization UTKFace Normalization B.3Evaluation Metrics

Next, we will describe the details of our OOD score calculation. For image datasets, we pass the image inputs through a pretrained ResNet18 ImageNet featurizer to get feature representations, and train a linear classifier to classify whether the feature representations are from the training distribution or the evaluation distribution. We balance the training data of the classifier such that each distribution makes up 50 percent. We then evaluate the linear classifier on the evaluation distribution, and calculate the average predicted likelihood that the batch of inputs are sampled from the evaluation distribution, which we use as the OOD score. For text datasets, we use a similar approach, but use a DistilBERT classifier and Tokenizer instead of the linear classifier and ImageNet featurizer.

There are some limitations to the OOD score. Ideally, we would like the OOD score to be a measure of how well model trained on the training distribution will generalize to the evaluation distribution. This is often the case, such as the datasets in our experiments, but not always. Consider a scenario where the evaluation dataset is a subset of the training dataset with a particular type of feature. Here, a neural network model trained on the training dataset will likely generalize well to the evaluation dataset in terms of task performance. However, the evaluation dataset will likely receive a high OOD score, because the evaluation inputs will be distinguishable from the training inputs, since the evaluation dataset has a high concentration of a particular type of feature. In this case, the OOD score is not a good measure of the distribution shift of the evaluation dataset.

Additionally, with regards to our measure of distance between model predictions and the OCS, we note that this measure is only informative if the different datasets being evaluated have around the same distribution of labels. This is because both the MSE and the KL metrics are being averaged over the evaluation dataset.

B.4Characterizing the OCS for Common Loss Functions

In this section, we will precisely characterize the OCS for a few of the most common loss functions: cross entropy, mean squared error (MSE), and Gaussian negative log likelihoog (Gaussian NLL).

Cross entropy.

With a cross entropy loss, the neural network outputs a vector where each entry is associated with a class, which we denote as 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑖 . This vector parameterizes a categorical distribution: 𝑃 𝜃 ⁢ ( 𝑦 𝑖 | 𝑥 )

𝑒 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑖 ∑ 1 ≤ 𝑗 ≤ 𝑚 𝑒 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑗 . The loss function minimizes the divergence between 𝑃 𝜃 ⁢ ( 𝑦 | 𝑥 ) and 𝑃 ⁢ ( 𝑦 | 𝑥 ) , given by

ℒ ( 𝑓 𝜃 ( 𝑥 ) , 𝑦 )

∑ 𝑖

1 𝑚 𝟙 [ 𝑦

𝑦 𝑖 ] log ( 𝑒 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑖 ∑ 𝑗

1 𝑚 𝑒 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑗 ) .

While there can exist multiple optimal constant solutions for the cross entropy loss, they all map to the same distribution which matches the marginal empirical distribution of the training labels, 𝑃 𝑓 constant * ⁢ ( 𝑦 𝑖 )

𝑒 𝑓 constant , 𝑖 * ∑ 1 ≤ 𝑗 ≤ 𝑚 𝑒 𝑓 constant , 𝑗 *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 𝟙 ⁢ [ 𝑦

𝑦 𝑖 ] .

For the cross entropy loss, the uncertainty of the neural network prediction can be captured by the entropy of the output distribution. Because 𝑃 𝑓 constant * ⁢ ( 𝑦 ) usually has much higher entropy than 𝑃 𝜃 ⁢ ( 𝑦 | 𝑥 ) evaluated on the training distribution, 𝑃 𝜃 ⁢ ( 𝑦 | 𝑥 ) on OOD inputs will tend to have higher entropy than on in-distribution inputs.

Gaussian NLL.

With this loss function, the output of the neural network parameterizes the mean and standard deviation of a Gaussian distribution, which we denote as 𝑓 𝜃 ⁢ ( 𝑥 )

[ 𝜇 𝜃 ⁢ ( 𝑥 ) , 𝜎 𝜃 ⁢ ( 𝑥 ) ] . The objective is to minimize the negative log likelihood of the training labels under the predicted distribution, 𝑃 𝜃 ⁢ ( 𝑦 | 𝑥 ) ∼ 𝒩 ⁢ ( 𝜇 𝜃 ⁢ ( 𝑥 ) , 𝜎 𝜃 ⁢ ( 𝑥 ) ) :

ℒ ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 ) , 𝑦 )

log ⁡ ( 𝜎 𝜃 ⁢ ( 𝑥 ) 2 ) + ( 𝑦 − 𝜇 𝜃 ⁢ ( 𝑥 ) ) 2 𝜎 𝜃 ⁢ ( 𝑥 ) 2 .

Let us similarly denote 𝑓 constant *

[ 𝜇 constant * , 𝜎 constant * ] . In this case, 𝜇 constant *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 𝑦 𝑖 , and 𝜎 constant *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 ( 𝑦 𝑖 − 𝜇 constant * ) 2 . Here, 𝜎 constant * is usually much higher than the standard deviation of 𝑃 ⁢ ( 𝑦 | 𝑥 ) for any given 𝑥 . Thus, our observation suggests that neural networks should predict higher standard deviations for OOD inputs than training inputs.

MSE.

Mean squared error can be seen as a special case of Gaussian NLL in which the network only predicts the mean of the Gaussian distribution, while the standard deviation is held constant, i.e. 𝑃 𝜃 ⁢ ( 𝑦 | 𝑥 ) ∼ 𝒩 ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 ) , 1 ) . The specific loss function is given by:

ℒ ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 ) , 𝑦 )

( 𝑦 − 𝑓 𝜃 ⁢ ( 𝑥 ) ) 2 .

Here, 𝑓 constant *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 𝑦 𝑖 . Unlike the previous two examples, in which OOD predictions exhibited greater uncertainty, predictions from an MSE loss do not capture any explicit notions of uncertainty. However, our observation suggests that the model’s predicted mean will still move closer to the average label value as the test-time inputs become more OOD.

Appendix CEmpirical Analysis C.1Additional Experiments

In this section, we will provide additional experimental analysis to support the hypothesis that we put forth in Section 4. First, in order to understand whether the trends that we observe for CIFAR10 and MNIST would scale to larger models and datasets, we perform the same analysis as the ones presented in Figure 4 on a ResNet50 model trained on ImageNet, and evaluated on ImageNet-Sketch and ImageNet-R(endition). Our findings are presented in Figure 10. Here, we can see that the same trends from the CIFAR10 and MNIST analysis seem to transfer to the ImageNet models.

Figure 10:Analysis of the interaction between representations and weights for as distribution shift increases, for a model trained on ImageNet and evaluated on ImageNet-Sketch and ImageNet-R(enditions).

Next, we aim to better understand the effects of normalization layers in the network on the behavior of the model representations. We trained models with no normalization (NN), batch normalization (BN), and layer normalization (LN) on the MNIST and CIFAR10 datasets, and evaluated them on OOD test sets with increasing levels of rotation and noise. Note that the model architecture and all other training details are held fixed across these models (for each datasets) with the exception of the type of normalization layer used (or lack thereof). We perform the same analysis as the ones presented in Figure 4, and present our findings in Figure 11 and 12. We found similar trends across the different models which are consistent with the ones we presented in Figure 4

Figure 11:Analysis of the interaction between representations and weights as distribution shift increases, for a model trained on MNIST and evaluated on increasing levels of rotation. The models being considered includes a four layer neural network with no normalization (top), batch normalization (middle), and layer normalization (bottom). Figure 12:Analysis of the interaction between representations and weights as distribution shift increases, for a model trained on CIFAR10 and evaluated on increasing levels of noise. The models being considered includes AlexNet with no normalization (top), batch normalization (middle), and layer normalization (bottom). C.2Analysis Details

In this section, we will provide details on the specific neural network layers that we used in our analysis in Sections 4.1 and C.1. We will illustrate diagrams for the neural network architectures that we used for each of our experiments, along with labels of the layers associated with each quantity we measure. In the first column of each analysis figure, we measure quantities at different layers of of the network; we denote these by 𝑖 0 , … , 𝑖 𝑛 , where each 𝑖 represents one tick in the X-axis from left to right. We use 𝑗 to denote the layer used in the plots in second column of each figure. We 𝑘 CE (and 𝑘 MSE )to denote the layer used in the plots in the third (and fourth) columns of each figure, respectively. We illustrate the networks used in Figure 4 in Figure 13, the one used in Figure 10 in Figure 14, the ones used in Figure 11 in Figure 15, and the ones used in Figure 12 in Figure 16.

Figure 13:Diagram of neural network models used in our experimental analysis in Figure 4, along with labels of the specific layers we used in our analysis. Figure 14:Diagram of neural network models used in our experimental analysis in Figure 10, along with labels of the specific layers we used in our analysis. Figure 15:Diagram of neural network models used in our experimental analysis in Figure 11, along with labels of the specific layers we used in our analysis. Figure 16:Diagram of neural network models used in our experimental analysis in Figure 12, along with labels of the specific layers we used in our analysis. Appendix DProofs from Section 4.2

In the first subsection, we describe the setup, and gradient flow with some results from prior works that we rely upon in proving our claims on the in-distribution and out-of-distribution activation magnitudes. In the following subsections we prove our main claims from Section 4.2.

Setup.

We are learning over a class of homogeneous neural networks ℱ := { 𝑓 ⁢ ( 𝑊 ; 𝑥 ) : 𝑤 ∈ 𝒲 } , with 𝐿 layers, and element wise activation 𝜎 ⁢ ( 𝑥 )

𝑥 ⁢ 𝟙 ⁢ ( 𝑥 ≥ 0 ) (ReLU function), taking the functional form:

𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝑊 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 1 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 2 ⁢ 𝜎 ⁢ ( 𝑊 1 ⁢ 𝑥 ) ) ⁢ … ) ,

where 𝑊 𝑖 ∈ ℝ 𝑚 × 𝑚 , ∀ 𝑖 ∈ { 2 , … , 𝐿 − 1 } , 𝑊 1 ∈ ℝ 𝑚 × 1 and output dimension is set to 1 , i.e., 𝑊 𝐿 ∈ ℝ 1 × 𝑚 . We say that class ℱ is homogeneous, if there exists a constant 𝐶 such that, for all 𝑤 ∈ 𝒲 , we have:

𝑓 ⁢ ( 𝛼 ⋅ 𝑊 ; 𝑥 )

𝛼 𝐶 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) .

Our focus is on a binary classification problem where we have a joint distribution over inputs and labels: 𝒳 × 𝒴 . Here, the inputs are from set 𝒳 := { 𝑥 ∈ ℝ 𝑑 : ‖ 𝑥 ‖ 2 ≤ 𝐵 } , and labels are binary 𝒴 := { − 1 , + 1 } . We have an IID sampled training dataset 𝒟 := { ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 containing pairs of data points 𝑥 𝑖 and their corresponding labels 𝑦 𝑖 .

For a loss function ℓ : ℝ → ℝ , the empirical loss of 𝑓 ⁢ ( 𝑊 ; 𝑥 ) on the dataset 𝒟 is

𝐿 ⁢ ( 𝑤 ; 𝒟 ) := ∑ 𝑖

1 𝑛 ℓ ⁢ ( 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑖 ) ) .

(1)

Here, the loss ℓ can be the exponential loss ℓ ⁢ ( 𝑞 )

𝑒 − 𝑞 and the logistic loss ℓ ⁢ ( 𝑞 )

log ⁡ ( 1 + 𝑒 − 𝑞 ) . To refer to the output (post activations if applicable) of layer 𝑗 in the network 𝑓 ⁢ ( 𝑊 ; ⋅ ) , for the input 𝑥 , we use the notation: 𝑓 𝑗 ⁢ ( 𝑊 ; 𝑥 ) .

Gradient flow (GF).

We optimize the objective in Equation equation 1 using gradient flow. Gradient flow captures the behavior of gradient descent with an infinitesimally small step size [Arora et al., 2019, Huh et al., 2021, Galanti et al., 2022, Timor et al., 2023]. Let 𝑊 ⁢ ( 𝑡 ) be the trajectory of gradient flow, where we start from some initial point 𝑊 ⁢ ( 0 ) of the weights, and the dynamics of 𝑊 ⁢ ( 𝑡 ) is given by the differential equation:

𝑑 ⁢ 𝑊 ⁢ ( 𝑡 ) 𝑑 ⁢ 𝑡

− ∇ 𝑊

𝑊 ⁢ ( 𝑡 ) 𝐿 ⁢ ( 𝑤 ; 𝒟 ) .

(2)

Note that the ReLU function is not differentiable at 0 . Practical implementations of gradient methods define the derivative 𝜎 ′ ⁢ ( 0 ) to be some constant in [ 0 , 1 ] . Following prior works [Timor et al., 2023], in this work we assume for convenience that 𝜎 ′ ⁢ ( 0 )

0 . We say that gradient flow converges if the followling limit exists:

lim 𝑡 → ∞ 𝑊 ⁢ ( 𝑡 ) .

In this case, we denote 𝑊 ⁢ ( ∞ ) := lim 𝑡 → ∞ 𝑊 ⁢ ( 𝑡 ) . We say that the gradient flow converges in direction if the following limit exists:

lim 𝑡 → ∞ 𝑊 ⁢ ( 𝑡 ) / ‖ 𝑊 ⁢ ( 𝑡 ) ‖ 2 .

Whenever the limit point lim 𝑡 → ∞ 𝑊 ⁢ ( 𝑡 ) exists, we refer to the limit point as the ERM solution by running gradient flow, and denote it as 𝑊 ^ .

Gradient flow convergence for interpolating homogeneous networks.

Now, we use a result from prior works that states the implicit bias of gradient flow towards max-margin solutions when sufficiently deep and wide homogeneous networks are trained with small learning rates and exponential tail classification losses. As the loss converges to zero, the solution approaches a KKT point of an optimization problem that finds the minimum 𝑙 2 norm neural network with a margin of at least 1 on each point in the training set. This is formally presented in the following Lemma adapted from [Ji and Telgarsky, 2020] and [Lyu and Li, 2019].

Lemma D.1 (Gradient flow is implicitly biased towards minimum ∥ ⋅ ∥ 2 )

Consider minimizing the average of either the exponential or the logistic loss (in equation 1) over a binary classification dataset 𝒟 using gradient flow (in equation 2) over the class of homogeneous neural networks ℱ with ReLU activations. If the average loss on 𝒟 converges to zero as 𝑡 → ∞ , then gradient flow converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in the parameter space of 𝒲 :

min 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ∈ ℱ ⁡ 1 2 ⁢ ∑ 𝑗 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ‖ 2 2 s.t. ⁢ ∀ 𝑖 ∈ [ 𝑛 ] ⁢ 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑖 ) ≥ 1 .

(3) Spectrally normalized margin based generalization bounds [Bartlett et al., 2017].

Prior work on Rademacher complexity based generalization bounds provides excess risk bounds based on spectrally normalized margins, which scale with the Lipschitz constant (product of spectral norms of weight matrices) divided by the margin.

Lemma D.2 (Adaptation of Theorem 1.1 from [Bartlett et al., 2017])

For the class homogeneous of ReLU networks in ℱ with reference matrices ( 𝐴 1 , … , 𝐴 𝐿 ) , and all distributions 𝑃 inducing binary classification problems, i.e., distributions over ℝ 𝑑 × { − 1 , + 1 } , with probability 1 − 𝛿 over the IID sampled dataset 𝒟 , and margin 𝛾

0 , the network 𝑓 ⁢ ( 𝑤 ; 𝑥 ) has expected margin loss upper bounded as:

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝑃 ⁢ 𝟙 ⁢ ( 𝑦 ⁢ 𝑓 ⁢ ( 𝑤 ; 𝑥 ) ≥ 𝛾 ) ≤ 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝟙 ⁢ ( 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑤 ; 𝑥 𝑖 ) ≥ 𝛾 ) + 𝒪 ~ ⁢ ( ℛ 𝑊 , 𝐴 𝛾 ⁢ 𝑛 ⁢ log ⁡ ( 𝑚 ) + log ⁡ ( 1 / 𝛿 ) 𝑛 ) ,

where the covering number bound determines ℛ 𝑊 , 𝐴 := ( ∏ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ‖ op ) ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 𝑖 ‖ op 2 / 3 ) .

D.1Lower bound for activation magnitude on in-distribution data Proposition D.1 ( 𝑃 train observes high norm features)

When 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) fits 𝒟 , i.e., 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑖 ) ≥ 𝛾 , ∀ 𝑖 ∈ [ 𝑁 ] , then w.h.p 1 − 𝛿 over 𝐷 , layer 𝑗 representations 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) satisfy 𝔼 𝑃 train ⁢ [ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ] ≥ ( 1 / 𝐶 0 ) ⁢ ( 𝛾 − 𝒪 ~ ⁢ ( log ⁡ ( 1 / 𝛿 ) / 𝑁 + 𝐶 1 2 ⁢ log ⁡ 𝑚 / 𝑁 ⁢ 𝛾 ) ) , if ∃ constants 𝐶 0 , 𝐶 1 s.t. ‖ 𝑊 ^ 𝑗 ‖ 2 ≤ 𝐶 0 𝑗 + 1 − 𝐿 , 𝐶 1 ≥ 𝐶 0 𝐿 .

Proof.

Here, we lower bound the expected magnitude of the in-distribution activation norms at a fixed layer 𝑗 in terms of the expected activation norm at the last layer, by repeatedly applying Cauchy-Schwartz inequality and the property of ReLU activations: ‖ 𝜎 ⁢ ( 𝑥 ) ‖ 2 ≤ ‖ 𝑥 ‖ 2 , alternatively.

𝔼 𝑃 train ⁢ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) |

𝔼 𝑃 train ⁢ [ ‖ 𝑊 ^ 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 ^ 𝐿 − 1 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 ^ 2 ⁢ 𝜎 ⁢ ( 𝑊 ^ 1 ⁢ 𝑥 ) ) ⁢ … ) ‖ 2 ]

≤ ‖ 𝑊 ^ 𝐿 ‖ op ⁢ 𝔼 𝑃 train ⁢ [ ‖ 𝜎 ⁢ ( 𝑊 ^ 𝐿 − 1 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 ^ 2 ⁢ 𝜎 ⁢ ( 𝑊 ^ 1 ⁢ 𝑥 ) ) ⁢ … ) ‖ 2 ] (Cauchy-Schwartz)

≤ ‖ 𝑊 ^ 𝐿 ‖ op ⁢ 𝔼 𝑃 train ⁢ [ ‖ 𝑊 ^ 𝐿 − 1 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 ^ 2 ⁢ 𝜎 ⁢ ( 𝑊 ^ 1 ⁢ 𝑥 ) ) ‖ 2 ] (ReLU activation property)

Doing the above repeatedly gives us the following bound:

𝔼 𝑃 train ⁢ | 𝑓 ⁢ ( 𝑤 ^ ; 𝑥 ) | ≤ ( ∏ 𝑘

𝑗 + 1 𝐿 ‖ 𝑊 ^ 𝑘 ‖ op ) ⋅ 𝔼 𝑃 train ⁢ [ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ]

≤ 𝐶 0 𝐿 − ( 𝑗 + 1 ) / 𝐿 ⁢ [ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ] ≤ 𝐶 0 ⁢ [ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 ]

Next, we use a generalization bound on the margin loss to further lower bound the expected norm of the last layer activations. Recall, that we use gradient flow to converge to globally optimal solution of the objective in equation 1 such that the training loss converges to 0 as 𝑡 → ∞ . Now, we use the spectrally normalized generalization bound from Lemma D.2 to get:

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝑃 ⁢ 𝟙 ⁢ ( 𝑦 ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) ≥ 𝛾 ) ⁢ < [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ ( ∏ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ‖ op ) ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 )

This implies that with probability 1 − 𝛿 over the training set 𝒟 , on at least 𝒪 ⁢ ( log ⁡ ( 1 / 𝛿 ) / 𝑛 ) fraction of the test set, the margin is at least 𝛾 , i.e., if we characterize the set of correctly classified test points as 𝒞 𝑊 ^ , then:

𝔼 ⁢ [ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | ∣ ( 𝑥 , 𝑦 ) ∈ 𝒞 𝑊 ^ ]

𝔼 ⁢ [ | 𝑦 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | ∣ ( 𝑥 , 𝑦 ) ∈ 𝒞 𝑊 ^ ]

≥ 𝔼 ⁢ [ 𝑦 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) ∣ ( 𝑥 , 𝑦 ) ∈ 𝒞 𝑊 ^ ] ≥ 𝛾

We are left with lower bounding: 𝔼 ⁢ [ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | ∣ ( 𝑥 , 𝑦 ) ∉ 𝒞 𝑊 ^ ] which is trivially ≥ 0 . Now, from the generalization guarantee we know:

𝔼 ⁢ ( 𝟙 ⁢ ( ( 𝑥 , 𝑦 ) ∈ 𝐶 𝑊 ^ ) )
< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ ( ∏ 𝑖

1 𝐿 𝐶 0 1 / 𝐿 ) ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 )

< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 0 ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 )

< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 1 ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 ) ,

where the final inequality uses

1 𝐿 ⁢ ∑ 𝑖 ‖ 𝑊 𝑖 ‖ 2 2 / 3 ≥ ( ∏ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ‖ 2 2 / 3 ) 1 / 𝐿 ,

which is the typical AM-GM inequality. We also use the inequality: 𝐶 1 ≥ 𝐶 0 3 ⁢ 𝐿 / 2 . This bound tells us that:

𝔼 ⁢ ( 𝟙 ⁢ ( ( 𝑥 , 𝑦 ) ∈ 𝐶 𝑊 ^ ) )

≥ 1 − 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 1 + log ⁡ ( 1 / 𝛿 ) 𝑁 ) .

Plugging the above into the lower bound we derived completes the proof of Proposition D.1.

D.2Upper bound for activation magnitude on out-of-distribution data Theorem D.3 (Feature norms can drop easily on 𝑃 OOD )

If ∃ a shallow network 𝑓 ′ ⁢ ( 𝑊 ; 𝑥 ) with 𝐿 ′ layers and 𝑚 ′ neurons satisfying conditions in Proposition 4.1 ( 𝛾

1 ), then optimizing the training objective with gradient flow over a class of deeper and wider homogeneous network ℱ with 𝐿 > 𝐿 ′ , 𝑚 > 𝑚 ′ would converge directionally to a solution 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) , for which the following is true: ∃ a set of rank 1 projection matrices { 𝐴 𝑖 } 𝑖

Proof.

Here, we show that there will almost rank one subspaces for each layer of the neural network, such that if the OOD representations deviate even slightly from the low rank subspace at any given layer, the last layer magnitude will collapse. This phenomenon is exacerbated in deep and wide networks, since gradient descent on deep homogeneous networks is biased towards KKT points of a minimum norm, max-margin solution [Lyu and Li, 2019], which consequently leads gradient flow on sufficiently deep and wide networks towards weight matrices that are low rank (as low as rank 1 ).

We can then show by construction that there will always exist these low rank subspaces which the OOD representations must not deviate from for the last layer magnitudes to not drop. Before we prove the main result, we adapt some results from Timor et al. [2023] to show that gradient flow is biased towards low rank solutions in our setting. This is formally presented in Lemma D.4.

Lemma D.4 (GF on deep and wide nets is learns low rank 𝑊 1 , … , 𝑊 𝐿 )

We are given the IID sampled dataset for the binary classification task defined by distribution 𝑃 train , i.e., 𝒟 := { ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 ⊆ ℝ 𝑑 × { − 1 , 1 } . Here, ‖ 𝑥 𝑖 ‖ 2 ≤ 1 , with probability 1 . Let there exist a homogeneous network in class ℱ ′ , of depth 𝐿 ′ ≥ 2 , and width 𝑚 ′ ≥ 2 , if there exists a neural network 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) , such that: ∀ ( 𝑥 𝑖 , 𝑦 𝑖 ) ∈ 𝒟 , 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 𝑖 ) ⋅ 𝑦 𝑖 ≥ 𝛾 , and the weight matrices 𝑊 1 ′ , … , 𝑊 𝐿 ′ ′ satisfy ‖ 𝑊 𝑖 ′ ‖ 𝐹 ≤ 𝐶 , for some fixed constant 𝐶

0 . If the solution 𝑓 ⁢ ( 𝑊 ⋆ , 𝑥 ) of gradient flow any class ℱ of deeper 𝐿

𝐿 ′ and wider 𝑚

𝑚 ′ networks 𝑓 ⁢ ( 𝑊 ; 𝑥 ) converges to the global optimal of the optimization problem:

min 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ∈ ℱ ⁡ 1 2 ⁢ ∑ 𝑗 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ‖ 2 2 s.t. ⁢ ∀ 𝑖 ∈ [ 𝑛 ] ⁢ 𝑦 𝑖 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑖 ) ≥ 1 ,

(4)

then for some universal constant 𝐶 1 , the following is satisfied:

max 𝑖 ∈ [ 𝐿 ] ⁡ ‖ 𝑊 𝑖 ⋆ ‖ op / ‖ 𝑊 𝑖 ⋆ ‖ 𝐹 ≥ 1 𝐿 ⁢ ∑ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ⋆ ‖ op ‖ 𝑊 𝑖 ⋆ ‖ 𝐹 ≥ 𝐶 1 1 / 𝐿 ⋅ 𝐿 𝐿 + 1 .

(5)

Proof.

We will prove this using the result from Lemma D.1, and major parts of the proof technique is a re-derivation of some of the results from Timor et al. [2023], in our setting. From Lemma D.1 we know that gradient flow on ℱ necessarily converges in direction to a KKT point of the optimization problem in Equation 4. Furthermore, from Lyu and Li [2019] we know that this optimization problem satisfies the Mangasarian-Fromovitz Constraint Qualification (MFCQ) condition, which means that the KKT conditions are first-order necessary conditions for global optimality.

We will first construct a wide and deep network 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ∈ ℱ , using the network 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) from the relatively shallower class ℱ ′ , and then argue about the Frobenius norm weights of the constructed network to be larger than the global optimal of problem in Equation 4.

Recall that 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) ∈ ℱ ′ satisfies the following:

∀ ( 𝑥 𝑖 , 𝑦 𝑖 ) ∈ 𝒟 , 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 𝑖 ) ⋅ 𝑦 𝑖 ≥ 𝛾 .

Now, we can begin the construction of 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ∈ ℱ . Set the scaling factor:

𝛼

( 2 / 𝐶 ) 𝐿 − 𝐿 ′ 𝐿 .

Then, for any weight matrix 𝑊 𝑖 for 𝑖 ∈ 1 , … , 𝐿 ′ − 1 , set the value for 𝑊 𝑖 to be:

𝑊 𝑖

𝛼 ⋅ 𝑊 𝑖 ′

( 2 / 𝐶 ) 𝐿 − 𝐿 ′ 𝐿 ⋅ 𝑊 𝑖 ′ .

Let 𝑣 be the vector of the output layer 𝐿 ′ in shallow 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) . Note that this is an 𝑚 -dimensional vector, since this is the final layer for 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) . But in our construction, layer 𝐿 ′ is a layer that includes a fully connected matrix 𝑊 𝐿 ′ ∈ ℝ 𝑚 × 𝑚 matrix.

So for layer 𝐿 ′ , we set the new matrix to be:

𝑊 𝐿 ′

𝛼 ⋅ [ 𝑣 ⊤

− 𝑣 ⊤

𝟎 𝑚

⋮

𝟎 𝑚 ] ,

where 𝟎 𝑚 is the 𝑚 -dimensional vector of 0 s.

This means that for the 𝐿 ′ -th layer in 𝑓 ⁢ ( 𝑊 ; 𝑥 ) we have the following first two neurons: the neuron that has weights which match the corresponding layer from 𝑓 ⁢ ( 𝑊 ′ ⁢ 𝑥 ) , and the neuron that has weights given by its negation. Note that since the weights in 𝑓 ⁢ ( 𝑊 ; 𝑥 ) are constructed directly from the weights of 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) , via scaling the weights through the scaling parameter 𝛼 defined above, we can satisfy the following for every input 𝑥 for the output of layer 𝐿 ′ in 𝑓 ⁢ ( 𝑊 ; 𝑥 ) :

𝑓 𝐿 ′ ⁢ ( 𝑊 ; 𝑥 )

[ 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 )

− 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝟎 𝑚

⋮

𝟎 𝑚 ] .

Next, we define the weight matrices for the layers: { 𝐿 ′ + 1 , … , 𝐿 } . We set the weight matrices 𝑖 ∈ { 𝐿 ′ + 1 , … , 𝐿 − 1 } to be:

𝑊 𝑖

( 2 𝐶 ) − 𝐿 ′ 𝐿 ⋅ 𝐈 𝑚 ,

where 𝐈 𝑚 is the 𝑚 × 𝑚 identity matrix. The last layer 𝐿 in 𝑓 ⁢ ( 𝑊 ; 𝑥 ) is set to be [ ( 2 𝐶 ) − 𝐿 ′ 𝐿 , − ( 2 𝐶 ) − 𝐿 ′ 𝐿 , 0 , … , 0 ] ⊤ ∈ ℝ 𝑚 . For this construction, we shall now prove that 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 )

𝑓 ⁢ ( 𝑊 ; 𝑥 ) for every input 𝑥 .

For any input 𝑥 , the output of layer 𝐿 ′ in 𝑓 ⁢ ( 𝑊 ′ ⁢ 𝑥 ) is:

( ReLU ⁢ ( 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

ReLU ⁢ ( − 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

𝟎 𝑚

⋮

𝟎 𝑚 ) .

Given our construction for the layers that follow we get for the last but one layer:

( ( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( − 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

𝟎 𝑚

⋮

𝟎 𝑑 ) .

Hitting the above with the last layer [ ( 2 𝐶 ) − 𝐿 ′ 𝐿 , − ( 2 𝐶 ) − 𝐿 ′ 𝐿 , 0 , … , 0 ] ⊤ , we get:

( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

− ( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( − 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

( 2 𝐶 ) − 𝐿 ′ 𝐿 ⋅ ( 𝐿 − 𝐿 ′ ) ⋅ ( 2 𝐶 ) 𝐿 − 𝐿 ′ 𝐿 ⋅ 𝐿 ′ ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝑓 ⁢ ( 𝑊 ; 𝑥 ) .

Thus, 𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝑓 ⁢ ( 𝑊 ′ ; 𝑥 ) , ∀ 𝑥 .

If 𝑊

[ 𝑊 1 , … , 𝑊 𝐿 ] be the parameters of wider and deeper network 𝑓 ⁢ ( 𝑊 ; 𝑥 ) and if 𝑓 ⁢ 𝑊 ⋆ ; 𝑥 be the network with the parameters 𝑊 ⋆ achieving the global optimum of the constrained optimization problem in equation 4.

Because we have that 𝑓 ⁢ ( 𝑊 ⋆ ; 𝑥 ) is of depth 𝐿

𝐿 ′ and has 𝑚 neurons where 𝑚

𝑚 ′ and is optimum for the squared ℓ 2 norm minimization problem, we can conclude that ‖ 𝑊 ⋆ ‖ 2 ≤ ‖ 𝑊 ‖ . Therefore,

‖ 𝑊 ⋆ ‖ 2 ≤ ‖ 𝑊 ′ ‖ 2

( ∑ 𝑖

1 𝐿 ′ − 1 ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ ‖ 𝑊 𝑖 ‖ 𝐹 2 )

+ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ ( 2 ⁢ ‖ 𝑊 𝑘 ‖ 𝐹 2 )

+ ( 𝐿 − 𝐿 ′ − 1 ) ⁢ ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2 ⋅ 2 + ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2 ⋅ 2

≤ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ 𝐶 2 ⁢ ( 𝐿 ′ − 1 ) + ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⋅ 2 ⁢ 𝐶 2

+ ( 2 ⁢ ( 𝐿 − 𝐿 ′ − 1 ) + 2 ) ⁢ ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2

𝐶 2 ⁢ ( 𝐿 ′ + 1 ) ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2

+ 2 ⁢ ( 𝐿 − 𝐿 ′ ) ⁢ ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2

( 2 / 𝐶 2 ) 𝐿 − 𝐿 ′ / 𝐿 ⁢ 𝐶 2 ⁢ ( 𝐿 ′ + 1 ) + ( 2 / 𝐶 2 ) − 𝐿 ′ / 𝐿 ⋅ 2 ⁢ ( 𝐿 − 𝐿 ′ )

2 ⋅ ( 2 / 𝐵 2 ) − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 ′ + 1 ) + ( 2 / 𝐶 2 ) − 𝐿 ′ / 𝐿 ⋅ 2 ⁢ ( 𝐿 − 𝐿 ′ )

2 ⋅ ( 2 / 𝐶 2 ) − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 + 1 ) .

Since 𝑓 ⋆ is a global optimum of equation 4, we can also show that it satisfies:

‖ 𝑊 𝑖 ⋆ ‖ 𝐹

‖ 𝑊 𝑗 ⋆ ‖ 𝐹 , 𝑖 < 𝑗 , 𝑖 , 𝑗 ∈ [ 𝐿 ]

By the lemma, there is 𝐶 ⋆ > 0 such that 𝐶 ⋆

‖ 𝑊 𝑖 ⋆ ‖ 𝐹 for all 𝑖 ∈ [ 𝐿 ] . To see why this is true, consider the net 𝑓 ⁢ ( 𝑊 ~ ; 𝑥 ) where 𝑊 ~ 𝑖

𝜂 ⁢ 𝑊 𝑖 ⋆ and 𝑊 ~ 𝑗

1 / 𝜂 ⁢ 𝑊 𝑗 ⋆ , for some 𝑖 < 𝑗 and 𝑖 , 𝑗 ∈ [ 𝐿 ] . By the property of homogeneous networks we have that for every input 𝑥 , we get: 𝑓 ⁢ ( 𝑊 ~ ; 𝑥 )

𝑓 ⁢ ( 𝑊 ⋆ ; 𝑥 ) . We can see how the sum of the weight norm squares change with a small change in 𝜂 :

𝑑 𝑑 ⁢ 𝜂 ⁢ ( 𝜂 2 ⁢ ‖ 𝑊 𝑖 ⋆ ‖ + ( 1 / 𝜂 2 ) ⁢ ‖ 𝑊 𝑗 ⋆ ‖ 2 2 )

at 𝜂

1 , since 𝑊 ⋆ is the optimal solution. Taking the derivative we get: 𝑑 𝑑 ⁢ 𝜂 ⁢ ( 𝜂 2 ⁢ ‖ 𝑊 𝑖 ⋆ ‖ + ( 1 / 𝜂 2 ) ⁢ ‖ 𝑊 𝑗 ⋆ ‖ 2 2 )

2 ⁢ 𝜂 ⁢ ‖ 𝑊 𝑖 ⋆ ‖ 𝐹 2 − 2 / 𝜂 3 ⁢ ‖ 𝑊 𝑗 ⋆ ‖ 𝐹 2 . For this expression to be zero, we must have 𝑊 𝑖 ⋆

𝑊 𝑗 ⋆ , for any 𝑖 < 𝑗 and 𝑖 , 𝑗 ∈ [ 𝐿 ] .

Based on the above result we can go back to our derivation of ‖ 𝑊 ⋆ ‖ 𝐹 2 ≤ 2 ⁢ 2 𝐶 2 − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 + 1 ) . Next, we can see that for every 𝑖 ∈ [ 𝐿 ] we have:

𝐶 ⋆ 2 ⁢ 𝐿 ≤ 2 ⁢ 2 𝐶 2 − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 + 1 )

𝐶 ⋆ 2 ≤ 2 ⁢ 2 𝐶 2 − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 + 1 ) 𝐿

1 / 𝐶 ⋆ ≥ 1 2 ⁢ ( 𝐶 2 ) 𝐿 ′ / 𝐿 ⁢ 𝐿 / 𝐿 + 1

Now we use the fact that ∀ 𝑥 ∈ 𝒳 , the norm ‖ 𝑥 ‖ 2 ≤ 1 :

1 ≤ 𝑦 𝑖 ⁢ 𝑓 ⋆ ⁢ ( 𝑥 𝑖 ) ≤ | 𝑓 ⋆ ⁢ ( 𝑥 𝑖 ) | ≤ ‖ 𝑥 𝑖 ‖ ⁢ ∏ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ≤ ∏ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ≤ ( 1 𝐿 ⁢ ∑ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ) 𝐿 ,

Thus: 1 𝐿 ⁢ ∑ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ≥ 1 . Plugging this into the lower bound on 1 𝐶 ⋆ :

1 𝐿 ⁢ ∑ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ‖ 𝑊 𝑖 ⋆ ‖ 𝐹

1 𝐶 ⋆ ⋅ 1 𝐿 ⁢ ∑ 𝑖 ∈ [ 𝐿 ] ‖ 𝑊 𝑖 ⋆ ‖ op ≥ 1 2 ⁢ ( 𝐶 2 ) 𝐿 ′ / 𝐿 ⁢ 𝐿 / 𝐿 + 1 ⋅ 1

(6)

= 1 2 ⋅ ( 2 𝐶 ) 𝐿 ′ 𝐿 ⋅ 𝐿 𝐿 + 1 .

(7)

This further implies that ∀ 𝑖 ∈ [ 𝐿 ] :

‖ 𝑊 𝑖 ⋆ ‖ 𝐹 ≤ ‖ 𝑊 𝑖 ⋆ ‖ op ⁢ 2 ⋅ 2 𝐶 − 𝐿 ′ / 𝐿 ⋅ 𝐿 + 1 / 𝐿 .

Setting 𝐶 ′

1 2 ⋅ ( 2 𝐶 ) 𝐿 ′ we get the final result:

‖ 𝑊 𝑖 ⋆ ‖ 𝐹 ≤ ‖ 𝑊 𝑖 ⋆ ‖ op ⋅ 𝐶 1 1 / 𝐿 ⋅ 𝐿 + 1 𝐿 , ∀ 𝑖 ∈ [ 𝑛 ] .

From Lemma D.4 we know that at each layer the weight matrices are almost rank 1 , i.e., for each layer 𝑗 ∈ [ 𝐿 ] , there exists a vector 𝑣 𝑗 , such that ‖ 𝑣 𝑗 ‖ 2

1 and 𝑊 𝑗 ≈ 𝜎 𝑗 ⁢ 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ for some 𝜎 𝑗

0 . More formally, we know that for any 𝐿

𝐿 ′ and 𝑚 ≥ 𝑚 ′ satisfying the conditions in Lemma D.4, for every layer 𝑗 we have:

‖ ( 𝐼 − 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ) ⁢ 𝑊 ^ 𝑗 ‖ 2

‖ 𝑊 ^ 𝑗 ‖ 𝐹 2 − 𝜎 𝑗 2

(8)

Next, we can substitute the previous bound that we derived: ‖ 𝑊 ^ 𝑗 ‖ 𝐹 ≤ 𝜎 𝑗 ⁢ 𝐶 ′ 1 / 𝐿 ⋅ 𝐿 + 1 𝐿 to get the following, when gradient flow converges to the globally optimal solution of equation 4:

‖ ( 𝐼 − 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ) ⁢ 𝑊 ^ 𝑗 ‖ 2

≤ 𝜎 𝑗 ⁢ 𝐶 ′ 2 / 𝐿 ⋅ 𝐿 + 1 / 𝐿 − 1 ≤ 𝜎 𝑗 ⁢ 𝐶 ′ 1 / 𝐿 ⁢ 𝐿 + 1 𝐿

(9)

Now we are ready to derive the final bound on the last layer activations:

𝔼 𝑥 ∼ 𝑃 OOD ⁢ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | 2

𝔼 𝑃 OOD ⁢ | 𝑊 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 1 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 2 ⁢ … ⁢ 𝜎 ⁢ ( 𝑊 2 ⁢ 𝜎 ⁢ ( 𝑊 1 ⁢ 𝑥 ) ) ⁢ … ) ) |

𝔼 𝑃 OOD ⁢ | 𝑊 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 1 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 2 ⁢ … ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 ⋅ 𝑓 𝑗 ⁢ ( 𝑊 ^ ⁢ 𝑥 ) ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ ⁢ … ) ) |

‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 ⋅ 𝔼 𝑃 OOD ⁢ | 𝑊 𝐿 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 1 ⁢ 𝜎 ⁢ ( 𝑊 𝐿 − 2 ⁢ … ⋅ 𝑓 𝑗 ⁢ ( 𝑊 ^ ⁢ 𝑥 ) ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ ⁢ … ) ) |

≤ 𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 ⋅ ∏ 𝑧

𝑗 + 1 𝐿 𝐶 0 1 / 𝐿 ≤ 𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 ⋅ 𝐶 0

where the final inequality repeatedly applies Cauchy-Schwartz on a norm one vector: 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) / ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 , along with another property of ReLU activations: ‖ 𝜎 ⁢ ( 𝑣 ) ‖ 2 ≤ ‖ 𝑣 ‖ 2 .

𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2

≤ 𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 2

≤ 𝔼 𝑃 OOD ⁢ ( 𝜎 𝑗 2 ⁢ ‖ 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) ‖ 2 2 + ‖ ( 𝐼 − 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ) ⁢ 𝑊 ^ 𝑗 ‖ 2 2 ⁢ 𝐶 0 2 )

≤ 𝜎 𝑗 2 ⁢ 𝜖 2 + ‖ ( 𝐼 − 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ) ⁢ 𝑊 ^ 𝑗 ‖ 2 2 ⁢ 𝐶 0 2

Since, 𝔼 𝑃 OOD ≤ ‖ 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ⁢ 𝑊 ^ 𝑗 ‖ 2 and

𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2

≤ 𝜎 𝑗 ⁢ ( 𝜖 2 + ‖ ( 𝐼 − 𝑣 𝑗 ⁢ 𝑣 𝑗 ⊤ ) ⁢ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 )

≤ 𝜎 𝑗 ( 𝜖 2 + 𝐶 ′ 2 / 𝐿 ⋅ 𝐿 + 1 / 𝐿 )

≤ 𝜎 𝑗 ⁢ ( 𝜖 + 𝐶 ′ 1 / 𝐿 ⋅ 𝐿 + 1 / 𝐿 )

Recall that 𝜎 𝑗 ≤ 𝐶 0 1 / 𝐿 . From the above, we get the following result:

𝔼 𝑥 ∼ 𝑃 OOD ⁢ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | 2 ≤ 𝐶 0 ⁢ 𝔼 𝑃 OOD ⁢ ‖ 𝑓 𝑗 ⁢ ( 𝑊 ^ ) ‖ 2 ≤ 𝐶 0 ⁢ ( 𝜖 + 𝐶 ′ 1 / 𝐿 ⋅ 𝐿 + 1 / 𝐿 ) ,

which completes the proof of Theorem D.3.

D.3Bias learnt for nearly homogeneous nets

In this subsection, we analyze a slightly modified form of typical deep homogeneous networks with ReLU activations. To study the accumulation of model constants, we analyze the class of functions ℱ ~

Proposition D.2 (Analyzing network bias)

If gradient flow on ℱ ~ converges directionally to 𝑊 ^ , 𝑏 ^ , then 𝑏 ^ ∝ ∑ 𝑘 𝑦 𝑘 for margin points { ( 𝑥 𝑘 , 𝑦 𝑘 ) : 𝑦 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑘 )

arg ⁢ min 𝑗 ∈ [ 𝑁 ] ⁡ 𝑦 𝑗 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑗 ) } .

Proof.

Lemmas C.8, C.9 from Lyu and Li [2019] can be proven for gradient flow over 𝐹 ~ as well since all we need to do is construct ℎ 1 , … , ℎ 𝑁 such that ℎ 𝑖 satisfoes forst order stationarity for the 𝑖 th constraint in the following optimization problem, that is only lightly modified version of problem instance 𝑃 in Lyu and Li [2019]:

min 𝑊 , 𝑏 ⁡ 𝐿 ⁢ ( 𝑊 , 𝑏 ; 𝒟 ) := ∑ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ‖ 𝐹 2

s . t . 𝑦 𝑖 𝑓 ( 𝑊 ; 𝑥 𝑖 ) ≥ 1 − 𝑏 , ∀ 𝑖 ∈ [ 𝑁 ]

Note that the above problem instance also satisfies MFCQ (Mangasarian-Fromovitz Constraint Qualification), which can also be shown directly using Lemma C.7 from Lyu and Li [2019].

As a consequence of the above, we can show that using gradient flow to optimize the objective:

min 𝑊 , 𝑏 1 𝑁 ∑ ( 𝑥 , 𝑦 ) ∈ 𝒟 exp ( − 𝑦 ⋅ ( 𝑓 ( 𝑊 ; 𝑥 ) + 𝑏 ) ,

also converges in direction to the KKT point of the above optimization problem with the nearly homogeneous networks ℱ . This result is also in line with the result for linear non-homogeneous networks derived in Soudry et al. [2018].

Finally, at directional convergence, the gradient of the loss ∂ 𝐿 ⁢ ( 𝑊 ; 𝒟 ) ∂ 𝑊 converges, as a direct consequence of the asymptotic analysis in Lyu and Li [2019].

Let us denote the set of margin points at convergence as ℳ

{ ( 𝑥 𝑘 , 𝑦 𝑘 ) : 𝑦 𝑘 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑘 )

min 𝑗 ∈ [ 𝑁 ] ⁡ 𝑦 𝑗 ⁢ 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑗 ) } . These, are precisely the set of points for which the constraint in the above optimization problem is tight, and the gradients of their objectives are the only contributors in the construction of ℎ 1 , … , ℎ 𝑁 for Lemma C.8 in Lyu and Li [2019]. Thus, it is easy to see that at convergence the gradient directionally converges to the following value, which is purely determined only by the margin points in ℳ .

lim 𝑡 → ∞ ∂ ∂ 𝑊 ⁢ 𝐿 ⁢ ( 𝑊 , 𝑏 ; 𝒟 ) ‖ ∂ ∂ 𝑊 ⁢ 𝐿 ⁢ ( 𝑊 , 𝑏 ; 𝒟 ) ‖ 2

− ∑ 𝑘 ∈ ℳ 𝑦 𝑘 ⋅ ∇ 𝑊 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑘 ) ‖ ∑ 𝑘 ∈ ℳ 𝑦 𝑘 ⋅ ∇ 𝑊 𝑓 ⁢ ( 𝑊 ; 𝑥 𝑘 ) ‖ 2

Similarly we can take the derivative of the objective with respect to the bias 𝑏 , and verify its direction. For that, we can note that: ∂ exp ⁡ ( − 𝑦 ⁢ ( 𝑓 ⁢ ( 𝑊 ; 𝑥 ) + 𝑏 ) ) ∂ 𝑏

− 𝑦 ⋅ exp ⁡ ( − 𝑦 ⁢ ( 𝑓 ⁢ ( 𝑊 ; 𝑥 ) + 𝑏 ) ) , but more importantly, exp ( − 𝑦 ( 𝑓 ( 𝑊 ; 𝑥 ) + 𝑏 ) evaluates to the same value for all margin points in ℳ . Hence,

lim 𝑡 → ∞ ∂ ∂ 𝑏 ⁢ 𝐿 ⁢ ( 𝑊 , 𝑏 ; 𝒟 ) ‖ ∂ ∂ 𝑏 ⁢ 𝐿 ⁢ ( 𝑊 , 𝑏 ; 𝒟 ) ‖ 2

− ∑ 𝑘 ∈ ℳ 𝑦 𝑘 | ∑ 𝑘 ∈ ℳ 𝑦 𝑘 |

While both 𝑏 ^ and 𝑊 ^ have converged directionally, their norms keep increasing, similar to analysis in other works [Soudry et al., 2018, Huh et al., 2021, Galanti et al., 2022, Lyu and Li, 2019, Timor et al., 2023]. Thus, from the above result it is easy to see that bias keeps increasing along the direction that is just given by the sum of the labels of the margin points in the binary classification task. This direction also matches the OCS solution direction (that only depends on the label marginal) if the label marginal distribution matches the distribution of the targets 𝑦 on the support points. This completes the proof of Proposition D.2.

Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Report Issue Report Issue for Selection

Xet Storage Details

Size:: 105 kB
Xet hash:: 410f9cd8f55a17ff6753a9bfb517f676c5eb82052a6c23cb043b6b1b0ac36cc6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

In this work, we will focus on the widely studied covariate shift setting [Gretton et al., 2009, Sugiyama et al., 2007]. Formally, let the training data 𝒟

{ ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

arg ⁢ min 𝜃 ∈ Θ ⁡ 1 𝑁 ⁢ ∑ 𝑖

𝑓 constant *

1 𝑁 𝐷 KL ( 𝑃 𝜃 ( 𝑦 | 𝑥 𝑖 ) | | 𝑃 𝑓 constant * ( 𝑦 ) ) , for models trained with cross-entropy and Gaussian NLL. For MSE models, the distance is measured using the mean squared error, 1 𝑁 ⁢ ∑ 𝑖

Setup: We consider a class of homogeneous neural networks ℱ := { 𝑓 ⁢ ( 𝑊 ; 𝑥 ) : 𝑊 ∈ 𝒲 } , with 𝐿 layers and ReLU activation, taking the functional form 𝑓 ⁢ ( 𝑊 ; 𝑥 )

Suppose ∃ a network 𝑓 ′ ⁢ ( 𝑊 ; 𝑥 ) with 𝐿 ′ layers and 𝑚 ′ neurons satisfying conditions in Proposition 4.1 ( 𝛾

To study the accumulation of model constants, we now analyze a slightly modified class of functions ℱ ~

If gradient flow on ℱ ~ converges directionally to 𝑊 ^ , 𝑏 ^ , then 𝑏 ^ ∝ ∑ 𝑘 𝑦 𝑘 for margin points { ( 𝑥 𝑘 , 𝑦 𝑘 ) : 𝑦 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 𝑘 )

{ ( 𝑥 𝑖 , 𝑎 𝑖 , 𝑟 𝑖 ) } 𝑖

1 𝑁 , by minimizing the MSE loss, 1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 ( 𝑓 𝜃 ⁢ ( 𝑥 𝑖 ) 𝑎 𝑖 − 𝑟 𝑖 ) 2 , and select actions using the policy 𝜋 ⁢ ( 𝑥 )

arg ⁡ max 𝑎 ∈ 𝒜 ⁡ 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑎 . The OCS of 𝑓 𝜃 is the average reward for each action over the training points, i.e., ( 𝑓 constant * ) 𝑎

∑ 1 ≤ 𝑖 ≤ 𝑁 𝑟 𝑖 ⋅ 𝟙 ⁢ [ 𝑎 𝑖

𝑎 ] ∑ 1 ≤ 𝑗 ≤ 𝑁 𝟙 ⁢ [ 𝑎 𝑗

CIFAR10 ResNet20 ImageNet ResNet50 OfficeHome ResNet50 BREEDS ResNet18 Amazon DistilBERT SkinLesionPixels ResNet34 UTKFace Custom VGG style architecture Task Optimizer Learning Rate Learning Rate Scheduler Weight Decay Momentum MNIST Adam 0.001 Step; 𝛾

0.7 0.01 - CIFAR10 SGD 0.1 Multi step milestones=[100, 150] 0.0001 0.9 ImageNet SGD 0.1 Step; 𝛾

With a cross entropy loss, the neural network outputs a vector where each entry is associated with a class, which we denote as 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑖 . This vector parameterizes a categorical distribution: 𝑃 𝜃 ⁢ ( 𝑦 𝑖 | 𝑥 )

ℒ ( 𝑓 𝜃 ( 𝑥 ) , 𝑦 )

∑ 𝑖

1 𝑚 𝟙 [ 𝑦

𝑦 𝑖 ] log ( 𝑒 𝑓 𝜃 ⁢ ( 𝑥 ) 𝑖 ∑ 𝑗

While there can exist multiple optimal constant solutions for the cross entropy loss, they all map to the same distribution which matches the marginal empirical distribution of the training labels, 𝑃 𝑓 constant * ⁢ ( 𝑦 𝑖 )

𝑒 𝑓 constant , 𝑖 * ∑ 1 ≤ 𝑗 ≤ 𝑚 𝑒 𝑓 constant , 𝑗 *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 𝟙 ⁢ [ 𝑦

With this loss function, the output of the neural network parameterizes the mean and standard deviation of a Gaussian distribution, which we denote as 𝑓 𝜃 ⁢ ( 𝑥 )

ℒ ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 ) , 𝑦 )

Let us similarly denote 𝑓 constant *

[ 𝜇 constant * , 𝜎 constant * ] . In this case, 𝜇 constant *

1 𝑁 ⁢ ∑ 1 ≤ 𝑖 ≤ 𝑁 𝑦 𝑖 , and 𝜎 constant *

ℒ ⁢ ( 𝑓 𝜃 ⁢ ( 𝑥 ) , 𝑦 )

Here, 𝑓 constant *

We are learning over a class of homogeneous neural networks ℱ := { 𝑓 ⁢ ( 𝑊 ; 𝑥 ) : 𝑤 ∈ 𝒲 } , with 𝐿 layers, and element wise activation 𝜎 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑊 ; 𝑥 )

𝑓 ⁢ ( 𝛼 ⋅ 𝑊 ; 𝑥 )

𝐿 ⁢ ( 𝑤 ; 𝒟 ) := ∑ 𝑖

Here, the loss ℓ can be the exponential loss ℓ ⁢ ( 𝑞 )

𝑒 − 𝑞 and the logistic loss ℓ ⁢ ( 𝑞 )

𝑑 ⁢ 𝑊 ⁢ ( 𝑡 ) 𝑑 ⁢ 𝑡

− ∇ 𝑊

Note that the ReLU function is not differentiable at 0 . Practical implementations of gradient methods define the derivative 𝜎 ′ ⁢ ( 0 ) to be some constant in [ 0 , 1 ] . Following prior works [Timor et al., 2023], in this work we assume for convenience that 𝜎 ′ ⁢ ( 0 )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝑃 ⁢ 𝟙 ⁢ ( 𝑦 ⁢ 𝑓 ⁢ ( 𝑤 ; 𝑥 ) ≥ 𝛾 ) ≤ 1 𝑛 ⁢ ∑ 𝑖

where the covering number bound determines ℛ 𝑊 , 𝐴 := ( ∏ 𝑖

1 𝐿 ‖ 𝑊 𝑖 ‖ op ) ⁢ ( ∑ 𝑖

𝔼 𝑃 train ⁢ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) |

𝔼 𝑃 train ⁢ | 𝑓 ⁢ ( 𝑤 ^ ; 𝑥 ) | ≤ ( ∏ 𝑘

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝑃 ⁢ 𝟙 ⁢ ( 𝑦 ⁢ 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) ≥ 𝛾 ) ⁢ < [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ ( ∏ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ‖ op ) ⁢ ( ∑ 𝑖

𝔼 ⁢ [ | 𝑓 ⁢ ( 𝑊 ^ ; 𝑥 ) | ∣ ( 𝑥 , 𝑦 ) ∈ 𝒞 𝑊 ^ ]

𝔼 ⁢ ( 𝟙 ⁢ ( ( 𝑥 , 𝑦 ) ∈ 𝐶 𝑊 ^ ) ) < [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ ( ∏ 𝑖

1 𝐿 𝐶 0 1 / 𝐿 ) ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 ) < [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 0 ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 ) < [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 1 ⁢ ( ∑ 𝑖

1 𝐿 ⁢ ∑ 𝑖 ‖ 𝑊 𝑖 ‖ 2 2 / 3 ≥ ( ∏ 𝑖

If ∃ a shallow network 𝑓 ′ ⁢ ( 𝑊 ; 𝑥 ) with 𝐿 ′ layers and 𝑚 ′ neurons satisfying conditions in Proposition 4.1 ( 𝛾

We are given the IID sampled dataset for the binary classification task defined by distribution 𝑃 train , i.e., 𝒟 := { ( 𝑥 𝑖 , 𝑦 𝑖 ) } 𝑖

max 𝑖 ∈ [ 𝐿 ] ⁡ ‖ 𝑊 𝑖 ⋆ ‖ op / ‖ 𝑊 𝑖 ⋆ ‖ 𝐹 ≥ 1 𝐿 ⁢ ∑ 𝑖

𝛼

𝑊 𝑖

𝛼 ⋅ 𝑊 𝑖 ′

𝑊 𝐿 ′

𝑓 𝐿 ′ ⁢ ( 𝑊 ; 𝑥 )

𝑊 𝑖

where 𝐈 𝑚 is the 𝑚 × 𝑚 identity matrix. The last layer 𝐿 in 𝑓 ⁢ ( 𝑊 ; 𝑥 ) is set to be [ ( 2 𝐶 ) − 𝐿 ′ 𝐿 , − ( 2 𝐶 ) − 𝐿 ′ 𝐿 , 0 , … , 0 ] ⊤ ∈ ℝ 𝑚 . For this construction, we shall now prove that 𝑓 ⁢ ( 𝑊 ′ ; 𝑥 )

( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) ) − ( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( − 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

( 2 𝐶 ) − 𝐿 ′ 𝐿 ⋅ ( 𝐿 − 𝐿 ′ ) ⋅ ( 2 𝐶 ) 𝐿 − 𝐿 ′ 𝐿 ⋅ 𝐿 ′ ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 )

Thus, 𝑓 ⁢ ( 𝑊 ; 𝑥 )

If 𝑊

‖ 𝑊 ⋆ ‖ 2 ≤ ‖ 𝑊 ′ ‖ 2

( ∑ 𝑖

𝐶 2 ⁢ ( 𝐿 ′ + 1 ) ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 + 2 ⁢ ( 𝐿 − 𝐿 ′ ) ⁢ ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2

( 2 / 𝐶 2 ) 𝐿 − 𝐿 ′ / 𝐿 ⁢ 𝐶 2 ⁢ ( 𝐿 ′ + 1 ) + ( 2 / 𝐶 2 ) − 𝐿 ′ / 𝐿 ⋅ 2 ⁢ ( 𝐿 − 𝐿 ′ )

2 ⋅ ( 2 / 𝐵 2 ) − 𝐿 ′ / 𝐿 ⁢ ( 𝐿 ′ + 1 ) + ( 2 / 𝐶 2 ) − 𝐿 ′ / 𝐿 ⋅ 2 ⁢ ( 𝐿 − 𝐿 ′ )

‖ 𝑊 𝑖 ⋆ ‖ 𝐹

By the lemma, there is 𝐶 ⋆ > 0 such that 𝐶 ⋆

‖ 𝑊 𝑖 ⋆ ‖ 𝐹 for all 𝑖 ∈ [ 𝐿 ] . To see why this is true, consider the net 𝑓 ⁢ ( 𝑊 ~ ; 𝑥 ) where 𝑊 ~ 𝑖

𝜂 ⁢ 𝑊 𝑖 ⋆ and 𝑊 ~ 𝑗

1 / 𝜂 ⁢ 𝑊 𝑗 ⋆ , for some 𝑖 < 𝑗 and 𝑖 , 𝑗 ∈ [ 𝐿 ] . By the property of homogeneous networks we have that for every input 𝑥 , we get: 𝑓 ⁢ ( 𝑊 ~ ; 𝑥 )

𝑑 𝑑 ⁢ 𝜂 ⁢ ( 𝜂 2 ⁢ ‖ 𝑊 𝑖 ⋆ ‖ + ( 1 / 𝜂 2 ) ⁢ ‖ 𝑊 𝑗 ⋆ ‖ 2 2 )

at 𝜂

1 , since 𝑊 ⋆ is the optimal solution. Taking the derivative we get: 𝑑 𝑑 ⁢ 𝜂 ⁢ ( 𝜂 2 ⁢ ‖ 𝑊 𝑖 ⋆ ‖ + ( 1 / 𝜂 2 ) ⁢ ‖ 𝑊 𝑗 ⋆ ‖ 2 2 )

CIFAR10 ResNet20 ImageNet ResNet50 OfficeHome ResNet50 BREEDS ResNet18 Amazon DistilBERT SkinLesionPixels ResNet34 UTKFace Custom VGG style architecture Task Optimizer Learning Rate
Learning Rate Scheduler
Weight Decay Momentum MNIST Adam 0.001 Step; 𝛾

𝔼 ⁢ ( 𝟙 ⁢ ( ( 𝑥 , 𝑦 ) ∈ 𝐶 𝑊 ^ ) )
< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ ( ∏ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 )

< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 0 ⁢ ( ∑ 𝑖

1 𝐿 ‖ 𝑊 ^ 𝑖 ⊤ − 𝐴 𝑖 ⊤ ‖ 2 , 1 ‖ 𝑊 ^ 𝑖 ‖ op 2 / 3 ) + log ⁡ ( 1 / 𝛿 ) 𝑁 )

< [-0.07cm] ∼ ⁢ 𝒪 ~ ⁢ ( log ⁡ 𝑚 𝛾 ⁢ 𝑁 ⁢ 𝐶 1 ⁢ ( ∑ 𝑖

( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

− ( ( 2 𝐶 ) − 𝐿 ′ 𝐿 ) 𝐿 − 𝐿 ′ − 1 ⋅ ReLU ⁢ ( − 𝛼 𝑘 ⋅ 𝑓 ⁢ ( 𝑊 ; 𝑥 ) )

𝐶 2 ⁢ ( 𝐿 ′ + 1 ) ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2 ⁢ ( ( 2 / 𝐶 ) 𝐿 − 𝐿 ′ / 𝐿 ⋅ 𝐿 ′ ) 2

+ 2 ⁢ ( 𝐿 − 𝐿 ′ ) ⁢ ( ( 2 / 𝐶 ) 𝐿 ′ − 𝐿 / 𝐿 ⋅ 𝐿 ′ ) 2