100 kB

Title: Beyond In-Domain Scenarios: Robust Density-Aware Calibration

URL Source: https://arxiv.org/html/2302.05118

Markdown Content:

Abstract

Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as D ensity-A ware C alibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.

Machine Learning, Calibration, Uncertainty, Trustworthiness, ICML

1 Introduction

Deep learning models have become state-of-the-art (SOTA) in several different fields. Especially in safety-critical applications such as medical diagnosis and autonomous driving with changing environments over time, reliable model estimates for predictive uncertainty are crucial. Thus, models are required to be accurate as well as calibrated, meaning that their predictive uncertainty (or confidence) matches the expected accuracy. Due to the fact that many deep neural networks are generally uncalibrated (Guo et al., 2017), post-hoc calibration of already trained neural networks has received increasing attention in the last few years.

Figure 1: Left: Our density-aware calibration method (DAC) g 𝑔 g italic_g can be combined with existing post-hoc methods h ℎ h italic_h leading to robust and reliable uncertainty estimates. To this end, DAC leverages information from feature vectors z 1⁢…⁢z L subscript 𝑧 1…subscript 𝑧 𝐿 z_{1}...z_{L}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT across the entire classifier f 𝑓 f italic_f. Right: DAC is based on KNN, where predictive uncertainty is expected to be high for test samples lying in low-density regions of the empirical training distribution and vice versa.

In order to tackle the miscalibration of neural networks, researchers have come up with a plethora of post-hoc calibration methods (Guo et al., 2017; Zhang et al., 2020; Rahimi et al., 2020b; Milios et al., 2018; Tomani et al., 2022; Gupta et al., 2021). These current approaches are particularly designed for in-domain calibration, where test samples are drawn from the same distribution as the network was trained on. Although these approaches perform almost perfect in-domain, recent works (Tomani et al., 2021) have shown that they lack substantially in providing reliable confidence scores in domain-shift and out-of-domain (OOD) scenarios (up to 1 order of magnitude worse than in-domain), which is unacceptable, particularly in safety-critical real-world scenarios where calibration of neural networks matters in order to prevent unforeseeable failures. To date, the only post-hoc methods that have been introduced to mitigate this shortcoming in domain-shift and OOD settings, use artificially created data or data from different sources in order to estimate the potential test distribution (Tomani et al., 2021; Yu et al., 2022; Wald et al., 2021; Yu et al., 2022; Gong et al., 2021). However, these methods are not generic in that they require domain knowledge about the dataset and utilize multiple domains for calibration. Additionally, they might only work well for a narrow subset of anticipated distributional shifts because they rely heavily on strong assumptions towards the potential test distribution. Furthermore, they can hurt in-domain calibration performance.

To mitigate the issue of miscalibration in scenarios where test samples are not necessarily drawn from dense regions of the empirical training distribution or are even OOD, we introduce a density-aware method that extends the field of post-hoc calibration beyond in-domain calibration. Contrary to the aforementioned existing works which focus on particularly crafted training data, our method DAC does not depend on additional data and does not rely on any assumptions about potentially shifted or out-of-domain test distributions, and is even domain agnostic. The proposed method can, therefore, simply be added to an existing post-hoc calibration pipeline, because it relies on the exact same training paradigm with a held-out in-domain validation set as current post-hoc methods do.

Previous works on calibration have focused primarily on post-hoc methods that solely take softmax outputs or logits into account (Guo et al., 2017; Zhang et al., 2020; Rahimi et al., 2020b; Milios et al., 2018; Tomani et al., 2022; Gupta et al., 2021). However, we argue that prior layers in neural networks contain valuable information for recalibration too. Moreover, we report, which layers our method identified as particularly relevant for providing well-calibrated predictive uncertainty estimates.

Recently developed large-scale neural networks that benefit from pre-training on vast amounts of data (Kolesnikov et al., 2020; Mahajan et al., 2018), have mostly been overlooked when benchmarking post-hoc calibration methods. One explanation for that could be because, e.g., vision transformers (ViTs) (Dosovitskiy et al., 2020a) are well calibrated out of the box (Minderer et al., 2021). Nevertheless, we show that also these models can profit from post-hoc methods and in particular from DAC through more robust uncertainty estimates.

1.1 Contribution

• We propose DAC, an accuracy-preserving and density-aware calibration method that can be combined with existing post-hoc methods to boost domain-shift and out-of-domain performance while maintaining in-domain calibration.1 1 1 Source code available at: https://github.com/futakw/DensityAwareCalibration
• We discover that the common practice of using solely the final logits for post-hoc calibration is sub-optimal and that aggregating intermediate outputs yields improved results.
• We study recent large-scale models, such as transformers, pre-trained on vast amounts of data and encounter that also for these models our proposed method yields substantial calibration gains.

2 Related Work

Calibration methods can be divided into post-hoc calibration methods and methods that adapt the training procedure of the classifier itself. The latter one includes methods such as self-supervised learning (Hendrycks et al., 2019), Bayesian neural networks (Gal & Ghahramani, 2016; Wen et al., 2018), Deep Ensembles (Lakshminarayanan et al., 2017), label smoothing (Müller et al., 2019), methods based on synthesized feature statistics (Wang et al., 2020) or mixup techniques (Thulasidasan et al., 2019; Zhang et al., 2017) as well as other intrinsically calibrated approaches (Sensoy et al., 2018; Tomani & Buettner, 2021; Ashukha et al., 2020). Similar to post-hoc methods, Ovadia et al. (2019) have found that intrinsic methods suffer as well from miscalibration in domain-shift scenarios.

Post-hoc calibration methods, on the other hand, can be applied on top of already trained classifiers and do not require any retraining of the underlying neural network. Wang et al. (2021) argue in favour of a unified framework comprising of main training and post-hoc calibration. Rahimi et al. (2020a) provide a theoretical basis for post-hoc calibration schemes in that learning calibration functions post-hoc using a proper loss function leads to calibrated outputs.

These post-hoc calibration methods include non-parametric approaches such as histogram binning (Zadrozny & Elkan, 2001), where uncalibrated confidence scores are partitioned into bins and are assigned a respective calibrated score via optimizing a bin-wise squared loss on a validation set. Isotonic regression (Zadrozny & Elkan, 2002), which is an extension to histogram binning, fits a piecewise constant function to intervals of uncalibrated confidence scores and Bayesian Binning into Quantiles (BBQ) (Naeini et al., 2015) is different from isotonic regression in that it considers multiple binning models and their combination. In addition, Zhang et al. (2020) introduce an accuracy-preserving version of isotonic regression beyond binary tasks, which they call multi-class isotonic regression (IRM). Moreover, Wenger et al. (2020) and Milios et al. (2018) propose Gaussian processes-based calibration methods.

Approaches for training a mapping function include Platt scaling (Platt, 1999), matrix as well as vector scaling and temperature scaling (Guo et al., 2017). Temperature scaling (TS) transforms logits by a single scalar parameter in an accuracy-preserving manner, since re-scaling does not affect the ranking of the logits. Moreover, Ensemble Temperature scaling (ETS) (Zhang et al., 2020) extends temperature scaling by two additional calibration maps with a fixed temperature of 1 and ∞\infty∞, respectively. More recent and advanced approaches include Dirichlet-based scaling (Kull et al., 2019) and Parameterized Temperature scaling (Tomani et al., 2022), where a temperature is calculated sample-wise via a neural network architecture. Rahimi et al. (2020b) designed a post-hoc neural network architecture for transforming classifier logits that represent a class of intra-order-preserving functions, and Gupta et al. (2021) introduce a method for obtaining a calibration function by approximating the empirical cumulative distribution of output probabilities with the help of splines.

These post-hoc calibration methods are trained on a hold-out calibration set. Although there has been a surge of research on these approaches in recent years, Tomani et al. (Tomani et al., 2021) have discovered that post-hoc calibration methods yield highly over-confident predictions under domain-shift and are, therefore, not well suited for OOD scenarios. They introduce a strategy where samples are perturbed in the calibration set before performing the post-hoc calibration step. However, such an approach makes a strong distributional assumption on potential domain shifts during testing by perturbing training samples in a particular way, which may not necessarily hold in each case. To date, post-hoc calibration methods that are themselves capable of distinguishing in-domain samples from gradually shifted or out-of-domain samples without any distributional assumptions have not yet been addressed.

3 Method

3.1 Definitions

We study the multi-class classification problem, where X∈ℝ D 𝑋 superscript ℝ 𝐷 X\in\mathbb{R}^{D}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT denotes a D 𝐷 D italic_D-dimensional random input variable and Y∈{1,2,…,C}𝑌 1 2…𝐶 Y\in{1,2,\dots,C}italic_Y ∈ { 1 , 2 , … , italic_C } denotes the label with C 𝐶 C italic_C classes with a ground truth joint distribution π⁢(X,Y)=π⁢(Y|X)⁢π⁢(X)𝜋 𝑋 𝑌 𝜋 conditional 𝑌 𝑋 𝜋 𝑋\pi(X,Y)=\pi(Y|X)\pi(X)italic_π ( italic_X , italic_Y ) = italic_π ( italic_Y | italic_X ) italic_π ( italic_X ). The dataset 𝔻 𝔻\mathbb{D}blackboard_D contains N 𝑁 N italic_N i.i.d. samples 𝔻={(X n,Y n)}n=1 N 𝔻 superscript subscript subscript 𝑋 𝑛 subscript 𝑌 𝑛 𝑛 1 𝑁\mathbb{D}={(X_{n},Y_{n})}_{n=1}^{N}blackboard_D = { ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT drawn from π⁢(X,Y)𝜋 𝑋 𝑌\pi(X,Y)italic_π ( italic_X , italic_Y ).

Let the output of a trained neural network classifier f 𝑓 f italic_f be f⁢(X)=(y,𝐳 L)𝑓 𝑋 𝑦 subscript 𝐳 𝐿 f(X)=(y,\mathbf{z}{L})italic_f ( italic_X ) = ( italic_y , bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), where y 𝑦 y italic_y denotes the predicted class and 𝐳 L subscript 𝐳 𝐿\mathbf{z}{L}bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT the associated logits vector. The softmax function σ S⁢M subscript 𝜎 𝑆 𝑀\sigma_{SM}italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT as p=max c⁡σ S⁢M⁢(𝐳 L)(c)𝑝 subscript 𝑐 subscript 𝜎 𝑆 𝑀 superscript subscript 𝐳 𝐿 𝑐 p=\max_{c}\sigma_{SM}(\mathbf{z}{L})^{(c)}italic_p = roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT is then needed to transform 𝐳 L subscript 𝐳 𝐿\mathbf{z}{L}bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT into a confidence score or predictive uncertainty p 𝑝 p italic_p w.r.t y 𝑦 y italic_y. In this paper, we propose an approach to improve the quality of the predictive uncertainty p 𝑝 p italic_p by recalibrating the logits 𝐳 L subscript 𝐳 𝐿\mathbf{z}_{L}bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT from f⁢(X)𝑓 𝑋 f(X)italic_f ( italic_X ) via a combination of two calibration methods:

p=h⁢(g⁢(f⁢(X)))𝑝 ℎ 𝑔 𝑓 𝑋\displaystyle p=h(g(f(X)))italic_p = italic_h ( italic_g ( italic_f ( italic_X ) ) )(1)

where g 𝑔 g italic_g denotes our density-aware calibration method DAC rescaling logits for boosting domain-shift and OOD calibration performance and h ℎ h italic_h denotes an existing state-of-the-art in-domain post-hoc calibration method (Fig.1).

Following Guo et al.(2017), perfect calibration is defined such that confidence and accuracy match for all confidence levels:

ℙ(Y=y|P=p)=P,∀P∈[0,1]formulae-sequence ℙ 𝑌 conditional 𝑦 𝑃 𝑝 𝑃 for-all 𝑃 0 1\displaystyle\mathop{\mathbb{P}}(Y=y|P=p)=P,;;\forall P\in[0,1]blackboard_P ( italic_Y = italic_y | italic_P = italic_p ) = italic_P , ∀ italic_P ∈ 0 , 1

Consequently, miscalibration is defined as the difference in expectation between accuracy and confidence.

𝔼 P[|ℙ(Y=y|P=p)−P|]subscript 𝔼 𝑃 delimited-[]ℙ 𝑌 conditional 𝑦 𝑃 𝑝 𝑃\displaystyle\mathop{\mathbb{E}}_{P}\left[\big{\lvert}\mathop{\mathbb{P}}(Y=y|% P=p)-P\big{\rvert}\right]blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT | blackboard_P ( italic_Y = italic_y | italic_P = italic_p ) - italic_P |

3.2 Measuring Calibration

The expected calibration error (ECE) (Naeini et al., 2015) is frequently used for quantifying miscalibration. ECE is a scalar summary measure estimating miscalibration by approximating equation (3) as follows. In the first step, confidence scores 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG of all samples are partitioned into M 𝑀 M italic_M equally sized bins of size 1/M 1 𝑀 1/M 1 / italic_M, and secondly, for each bin B m subscript 𝐵 𝑚 B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT the respective mean confidence and the accuracy is computed based on the ground truth class y 𝑦 y italic_y. Finally, the ECE is estimated by calculating the mean difference between confidence and accuracy over all bins:

ECE d=∑m=1 M|B m|N⁢‖acc⁢(B m)−conf⁢(B m)‖d superscript ECE 𝑑 superscript subscript 𝑚 1 𝑀 subscript 𝐵 𝑚 𝑁 subscript norm acc subscript 𝐵 𝑚 conf subscript 𝐵 𝑚 𝑑\displaystyle\mathrm{ECE}^{d}=\sum_{m=1}^{M}\frac{\lvert B_{m}\rvert}{N}\left% |\mathrm{acc}(B_{m})-\mathrm{conf}(B_{m})\right|_{d}roman_ECE start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | end_ARG start_ARG italic_N end_ARG ∥ roman_acc ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - roman_conf ( italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT(4)

with d 𝑑 d italic_d usually set to 1 (L1-norm).

3.3 Density-Aware Calibration (DAC)

Our main idea for our proposed calibration method g 𝑔 g italic_g stems from the fact that test samples lying in high-density regions of the empirical training distribution can generally be predicted with higher confidence than samples lying in low-density regions. For the latter case, the network has seen very few, if any, training samples in the neighborhood of the respective test sample in feature space, and is thus not able to provide reliable predictions for those samples. Leveraging this information about density through a proxy can result in better calibration.

In order to estimate such a proxy for each sample, we propose to utilize non-parametric density estimation using k-nearest-neighbor (KNN) based on feature embeddings extracted from the classifier. KNN has successfully been applied in out-of-distribution detection (Sun et al., 2022). In contrast to Sun et al. (2022), who only take the penultimate layer into account, we argue that prior layers yield important information too, and therefore, incorporate them in our method as follows. We call our method Density-Aware Calibration (DAC).

Temperature scaling (Guo et al., 2017) is a frequently used calibration method, where a single scalar parameter T 𝑇 T italic_T is used to re-scale the logits of an already trained classifier in order to obtain calibrated probability estimates Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG for logits 𝐳 L subscript 𝐳 𝐿\mathbf{z}{L}bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT using the softmax function σ S⁢M subscript 𝜎 𝑆 𝑀\sigma{SM}italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT as

Q^=σ S⁢M⁢(𝐳 L/T)^𝑄 subscript 𝜎 𝑆 𝑀 subscript 𝐳 𝐿 𝑇\hat{Q}=\sigma_{SM}(\mathbf{z}_{L}/T)over^ start_ARG italic_Q end_ARG = italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT / italic_T )(5)

Similar to temperature scaling, our method is also accuracy preserving, in that we use one parameter S⁢(𝐱,w)𝑆 𝐱 𝑤 S(\mathbf{x},w)italic_S ( bold_x , italic_w ) for re-scaling the logits of the classifier:

Q^⁢(𝐱,w)=σ S⁢M⁢(𝐳 L/S⁢(𝐱,w))^𝑄 𝐱 𝑤 subscript 𝜎 𝑆 𝑀 subscript 𝐳 𝐿 𝑆 𝐱 𝑤\displaystyle\hat{Q}(\mathbf{x},w)=\sigma_{SM}(\mathbf{z}_{L}/S(\mathbf{x},w))over^ start_ARG italic_Q end_ARG ( bold_x , italic_w ) = italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT / italic_S ( bold_x , italic_w ) )(6)

In contrast to temperature scaling, in our case, S⁢(𝐱,w)𝑆 𝐱 𝑤 S(\mathbf{x},w)italic_S ( bold_x , italic_w ) is sample-dependent with respect to 𝐱 𝐱\mathbf{x}bold_x and is calculated via a linear combination of density estimates s l subscript 𝑠 𝑙 s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as follows:

S⁢(𝐱,w)=∑l=1 L w l⁢s l+w 0 𝑆 𝐱 𝑤 superscript subscript 𝑙 1 𝐿 subscript 𝑤 𝑙 subscript 𝑠 𝑙 subscript 𝑤 0 S(\mathbf{x},w)=\sum_{l=1}^{L}w_{l}s_{l}+w_{0}italic_S ( bold_x , italic_w ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT(7)

with w 1⁢…⁢w L subscript 𝑤 1…subscript 𝑤 𝐿 w_{1}...w_{L}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT being the weights for every layer in L 𝐿 L italic_L and w 0 subscript 𝑤 0 w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being a bias term. Note that only positive weights are valid because negative weights would assign high confidence to outliers. Thus, we constrain the weights to be positive to tackle overfitting.

For each feature layer l 𝑙 l italic_l, we compute the density estimate s l subscript 𝑠 𝑙 s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in the neighborhood of the empirical training distribution of the respective test sample 𝐱 𝐱\mathbf{x}bold_x with the k 𝑘 k italic_k-th nearest neighbor distance: First, we derive the test feature vector 𝐳 l subscript 𝐳 𝑙\mathbf{z}{l}bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from the trained classifier f 𝑓 f italic_f given the test input sample 𝐱 𝐱\mathbf{x}bold_x and average over spatial dimensions as well as normalize it. We then use the normalized training feature vectors Z N T⁢r,l=(𝐳 1,l,𝐳 2,l,…,𝐳 N T⁢r,l)subscript 𝑍 subscript 𝑁 𝑇 𝑟 𝑙 subscript 𝐳 1 𝑙 subscript 𝐳 2 𝑙…subscript 𝐳 subscript 𝑁 𝑇 𝑟 𝑙 Z{N_{Tr},l}=(\mathbf{z}{1,l},\mathbf{z}{2,l},...,\mathbf{z}{N{Tr},l})italic_Z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_l end_POSTSUBSCRIPT = ( bold_z start_POSTSUBSCRIPT 1 , italic_l end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 , italic_l end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_l end_POSTSUBSCRIPT ), which we gathered from the training dataset X N T⁢r=(𝐱 1,𝐱 2,…,𝐱 N T⁢r)subscript 𝑋 subscript 𝑁 𝑇 𝑟 subscript 𝐱 1 subscript 𝐱 2…subscript 𝐱 subscript 𝑁 𝑇 𝑟 X_{N_{Tr}}=(\mathbf{x}{1},\mathbf{x}{2},...,\mathbf{x}{N{Tr}})italic_X start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) to calculate the euclidean distance between 𝐳 l subscript 𝐳 𝑙\mathbf{z}{l}bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and each element in Z N T⁢r,l subscript 𝑍 subscript 𝑁 𝑇 𝑟 𝑙 Z{N_{Tr},l}italic_Z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_l end_POSTSUBSCRIPT for each sample i 𝑖 i italic_i in the training set N T⁢r subscript 𝑁 𝑇 𝑟 N_{Tr}italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT as follows:

d i,l=‖𝐳 i,l−𝐳 l‖subscript 𝑑 𝑖 𝑙 norm subscript 𝐳 𝑖 𝑙 subscript 𝐳 𝑙 d_{i,l}=|\mathbf{z}{i,l}-\mathbf{z}{l}|italic_d start_POSTSUBSCRIPT italic_i , italic_l end_POSTSUBSCRIPT = ∥ bold_z start_POSTSUBSCRIPT italic_i , italic_l end_POSTSUBSCRIPT - bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥(8)

The resulting sequence D N T⁢r,L=(d 1,l,d 2,l,…,d N T⁢r,l)subscript 𝐷 subscript 𝑁 𝑇 𝑟 𝐿 subscript 𝑑 1 𝑙 subscript 𝑑 2 𝑙…subscript 𝑑 subscript 𝑁 𝑇 𝑟 𝑙 D_{N_{Tr},L}=(d_{1,l},d_{2,l},...,d_{N_{Tr},l})italic_D start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_L end_POSTSUBSCRIPT = ( italic_d start_POSTSUBSCRIPT 1 , italic_l end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 , italic_l end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_l end_POSTSUBSCRIPT ) is reordered. Finally, s l subscript 𝑠 𝑙 s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is given by the k 𝑘 k italic_k-th smallest element (=k 𝑘 k italic_k-th nearest neighbor) in the sequence: s l=d(k)subscript 𝑠 𝑙 subscript 𝑑 𝑘 s_{l}=d_{(k)}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT, with (k)𝑘(k)( italic_k ) indicating the index in the reordered sequence D N T⁢r,L subscript 𝐷 subscript 𝑁 𝑇 𝑟 𝐿 D_{N_{Tr},L}italic_D start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T italic_r end_POSTSUBSCRIPT , italic_L end_POSTSUBSCRIPT. For determining k 𝑘 k italic_k, we follow Sun et al. (2022), who did a thorough analysis and concluded that a proper choice for k 𝑘 k italic_k is 50 for CIFAR10, 200 for CIFAR100 for all training samples and 10 for ImageNet for 1% training samples.

We fit our method for a trained neural network f⁢(X)=(y,𝐳 L)𝑓 𝑋 𝑦 subscript 𝐳 𝐿 f(X)=(y,\mathbf{z}{L})italic_f ( italic_X ) = ( italic_y , bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) by optimizing a squared error loss L w subscript 𝐿 𝑤 L{w}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT w.r.t. w 𝑤 w italic_w.

L w=∑c=1 C(I c−σ S⁢M⁢(𝐳 L/S⁢(𝐱,w))(c))2 subscript 𝐿 𝑤 superscript subscript 𝑐 1 𝐶 superscript subscript 𝐼 𝑐 subscript 𝜎 𝑆 𝑀 superscript subscript 𝐳 𝐿 𝑆 𝐱 𝑤 𝑐 2\displaystyle L_{w}=\sum_{c=1}^{C}(I_{c}-\sigma_{SM}(\mathbf{z}_{L}/S(\mathbf{% x},w))^{(c)})^{2}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_S italic_M end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT / italic_S ( bold_x , italic_w ) ) start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(9)

where I c subscript 𝐼 𝑐 I_{c}italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT indicates a binary variable, which is 1 1 1 1 if the respective sample has true class c 𝑐 c italic_c, and 0 0 otherwise. We accumulate L w subscript 𝐿 𝑤 L_{w}italic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over all samples in the validation set.

The rescaled logits 𝐳^L⁢(𝐱,w)=𝐳 L/S⁢(𝐱,w)subscript^𝐳 𝐿 𝐱 𝑤 subscript 𝐳 𝐿 𝑆 𝐱 𝑤\mathbf{\hat{z}}{L}(\mathbf{x},w)=\mathbf{z}{L}/S(\mathbf{x},w)over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_x , italic_w ) = bold_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT / italic_S ( bold_x , italic_w ) and consequently the recalibrated probability estimates Q^⁢(𝐱,𝐰)^𝑄 𝐱 𝐰\hat{Q}(\mathbf{x},\mathbf{w})over^ start_ARG italic_Q end_ARG ( bold_x , bold_w ) can directly be fed to another post-hoc method. Thus, DAC can be applied prior to other existing in-domain post-hoc calibration methods for robustly calibrating models in domain-shift and OOD scenarios (Fig.1).

DAC uses KNN (a non-parametric method) to compute a density proxy per layer and combines these proxies linearly across layers, whereas other methods like Parameterized Temperature scaling (Tomani et al., 2022) and models that utilize intra-order-preserving functions for calibration (Rahimi et al., 2020b) are parametric methods using a neural network. Moreover, DAC uses intermediate features from hidden layers and is particularly designed with domain shift and OOD calibration behavior in mind.

Our method has the following advantages:

• Density aware: Due to distance-based density estimation across feature layers, our method is capable of inferring how close or how far a test sample is in feature space with respect to the training distribution and can adjust the predictive estimates of the classifier accordingly.
• Domain agnostic: Since we use KNN, a non-parametric method for density estimation, no distributional assumptions are imposed on the feature space, and our method is therefore applicable to any type of in-domain, domain-shift, or OOD scenario.
• Backbone agnostic: Adapts easily to different underlying classifier architectures (e.g., CNNs, ResNets, and more recent models like transformers) because during training, DAC automatically figures out the informative feature layers regarding uncertainty calibration.

Table 1: Mean expected calibration error across all test domain-shift scenarios. For each model, the macro-averaged ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) (with equal-width binning and 15 bins) is computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). Post-hoc calibration methods paired with our method are consistently better calibrated than simple post-hoc methods. (lower ECE is better)

4 Experimental Setup

Models and Datasets

In our study, we quantify the performance of our proposed model for various model architectures and different datasets. We consider 3 different datasets to evaluate our models on CIFAR10/100 (Krizhevsky et al., 2009), and ImageNet-1k (Deng et al., 2009). In particular, for measuring performance on CIFAR10 and CIFAR100, we train ResNet18 (He et al., 2016a), VGG16 (Simonyan & Zisserman, 2014) and DenseNet121 (Huang et al., 2017), and for ImageNet, we use 3 pre-trained models, namely ResNet152 (He et al., 2016a), DenseNet169 (Huang et al., 2017), and Xception(Chollet, 2017). We further investigate post-hoc calibration methods applied to new state-of-the-art architectures as well as modern training schemes. To this end, we include the following models in our study, which are all finetuned on ImageNet-1k:

• BiT-M(Kolesnikov et al., 2020): Is a ResNet-based architecture (ResNetV2) (He et al., 2016b) pre-trained on ImageNet-21k.
• ResNeXt-WSL(Mahajan et al., 2018): Is a ResNeXt-based architecture (ResNeXt101 32x8d) (Xie et al., 2017), which is weakly supervised pre-trained with billions of hashtags of social media images.
• ViT-B(Dosovitskiy et al., 2020b): Is a transformer-based architecture pre-trained on ImageNet-21k.

We quantify calibration performance for in-domain, domain-shift, and OOD scenarios. In order to ensure a gradual domain shift in our evaluation pipeline, we use ImageNet-C as well as CIFAR-C (Hendrycks & Dietterich, 2019), which were specifically developed to produce domain shift and were incorporated in many related studies since. Both datasets have 18 distinct corruption types, each having 5 different levels of severity, mimicking a scenario where the input data to a classifier gradually shifts away from the training distribution. Additionally, we test our models on a real-world OOD dataset, namely ObjectNet-OOD. ObjectNet (Barbu et al., 2019) is a dataset consisting of 50,000 test images with a total of 313 classes, of which 200 classes are out-of-domain with respect to ImageNet. Hence, we make use of these 200 classes for our OOD analysis.

Post-hoc Calibration Methods

We consider the currently best performing post-hoc calibration methods for benchmarking as well as for combining them with DAC: Temperature scaling (TS) (Guo et al., 2017), Ensemble Temperature scaling (ETS) (Zhang et al., 2020), accuracy preserving version of Isotonic Regression (IRM) (Zhang et al., 2020), Intra-order preserving calibration (DIA) (Rahimi et al., 2020b), Calibration using Splines (SPL) (Gupta et al., 2021). Additionally, we show results for Isotonic Regression (IR) (Zadrozny & Elkan, 2002), Parameterized Temperature scaling (PTS) (Tomani et al., 2022) and Dirichlet calibration (DIR) (Kull et al., 2019) in AppendixD.

Our proposed method DAC does not solely rely on logits for calibration as other post-hoc calibration approaches do; it rather takes various layers at certain positions of the network into account. Even though DAC could use every layer in a classifier due to its weighting scheme, we opt for a much simpler and faster version. That is, we follow a structured approach for choosing layers, e.g., after each ResNet or transformer block. A detailed description of which layers we use can be found in AppendixC.1. In the results, we show that our selective approach produces similar results compared to taking all layers into account.

Measuring Calibration

Our evaluation is based on various calibration measures. Throughout the paper, we provide results for ECE and Brier scores using equal-width binning with 15 bins. Although ECE is the most commonly used metric for evaluating and comparing post-hoc calibration methods, it bears several limitations. That is why, in AppendixE, we show that our results hold for different kinds of calibration measures, including ECE based on kernel density estimation (ECE-KDE) (Zhang et al., 2020), ECE using equal-mass binning and class-wise ECE(Kull et al., 2019) as well as we demonstrate consistency with likelihood.

CIFAR10-ResNet18 CIFAR100-VGG16 ImageNet-DenseNet169 ImageNet-BiT-M ImageNet-ResNeXt-WSL ImageNet-ViT-B

Figure 2: Expected calibration error (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of post-hoc methods with and without our method DAC for different model and dataset combinations. Line plots: Macro-averaged ECE across all corruption types shown for each corruption severity, from in-domain to severity=5 (heavily corrupted). Bar plots: Macro-averaged ECE across all corruption types as well as across all severities. Our model captures domain-shift scenarios reliably and thus increases calibration (=decreases ECE) across the whole spectrum of corruptions.

Table 2: Difference in expected calibration error (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of post-hoc calibration methods with and without our method DAC. I⁢n⁢D 𝐼 𝑛 𝐷 InD italic_I italic_n italic_D: In-domain, S⁢E⁢V⁢.5 𝑆 𝐸 𝑉.5 SEV.5 italic_S italic_E italic_V .5: Heavily corrupted (severity of 5) and A⁢L⁢L 𝐴 𝐿 𝐿 ALL italic_A italic_L italic_L: Macro-averaged ECE across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). For A⁢L⁢L 𝐴 𝐿 𝐿 ALL italic_A italic_L italic_L, we additionally report the ratio to indicate overall performance gain. Post-hoc calibration methods combined with DAC consistently improve calibration in cases of heavy corruption as well as overall calibration (relative improvement of around 5-40%), while preserving in-domain performance (negative deltas are better).

InD Sev. 5 All TS-0.42-1.14-0.46 10% ETS-0.49-1.25-0.56 13% IRM-0.36-1.67-0.92 18% DIA-1.09-2.91-1.67 29% SPL-1.02-2.00-1.45 26% a.) CIFAR10 ResNet18

InD Sev. 5 All TS-0.22-8.05-4.29 42% ETS-0.22-8.06-4.29 42% IRM-0.85-7.15-4.46 38% DIA-1.87-6.20-3.97 30% SPL-0.27-4.77-2.83 30% b.) CIFAR100 VGG16

InD Sev. 5 All TS+0.08-3.16-1.49 26% ETS+0.03-4.61-2.08 38% IRM-0.48-4.25-2.47 38% DIA-0.09-3.49-1.81 25% SPL-0.15-4.23-2.15 35% c.) ImageNet DenseNet169

InD Sev. 5 All TS+0.47-6.40-2.26 36% ETS+0.28-6.23-2.13 38% IRM-0.34-5.20-2.34 38% DIA-0.69-3.42-1.75 27% SPL-0.01-6.11-2.41 42% d.) ImageNet BiT-M

InD Sev. 5 All TS+0.09-1.95-0.58 8% ETS-2.37-3.24-2.39 32% IRM-0.57-4.82-2.02 27% DIA+0.30-4.10-1.64 23% SPL-0.25-5.39-1.94 36% e.) ImageNet ResNeXt-WSL

InD Sev. 5 All TS-0.08-0.74-0.38 10% ETS-0.03-0.72-0.32 10% IRM-0.08-0.59-0.23 6% DIA-0.00-0.62-0.28 6% SPL+0.02-0.74-0.30 9% f.) ImageNet ViT-B

5 Results

First, we show that combining our method DAC with existing post-hoc calibration methods increases calibration performance across the entire spectrum, from in-domain to heavily corrupted data distributions, for different datasets and various model architectures, including transformers. Secondly, we show the calibration performance of our method on purely OOD scenarios. Lastly, we conduct additional experiments, such as a layer importance analysis and a data efficiency analysis.

5.1 DAC Boosts Calibration Performance Beyond In-Domain Scenarios

We begin by systematically assessing whether the performance of state-of-the-art post-hoc calibration methods can be improved when extended by our proposed DAC method. In particular, we are interested in scenarios under domain shift. To this end, we show calibration performance on CIFAR-C and ImageNet-C for severity levels from 1 to 5 and additionally provide results for in-domain scenarios (severity=0). Fig.2 illustrates a comparison between stand-alone post-hoc methods and combined methods with DAC for various classifiers. The line charts reveal that DAC consistently improves the calibration performance of post-hoc methods in domain-shift cases. Tab.2 underpins this performance increase even further by revealing a substantial decrease in the absolute ECE for heavily corrupted data (severity=5) scenarios when DAC is used. When optimizing for domain-shift performance, a sharp decline in in-domain performance is generally observed for other existing methods (Tomani et al., 2021). This is, however not the case for our method, where we even observe slight improvements of in-domain ECE for many classifier and post-hoc method configurations, and for the few cases where in-domain ECE marginally increases (in the order of 10−4 superscript 10 4 10^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 10−3 superscript 10 3 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT), the ECE in heavily corrupted scenarios decreases in the order of far more than a magnitude in comparison. Finally, in order to measure the overall improvement of DAC across the entire spectrum of corruption from severity 0 to 5, we calculate the macro-averaged ECE across all corruption types and levels of severity for each method. We discover a consistent improvement for all post-hoc methods combined with DAC as opposed to stand-alone methods visualized in the bar charts in Fig.2 and in Tab.2. Moreover, Tab.1 further reveals that DAC consistently boosts calibration performance across diverse architectures and datasets. In Tab.3 we demonstrate that improved calibration performance is also consistent with better Brier scores.

Table 3: Mean Brier-score computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). Note that since SPL only calibrates the highest predicted confidence, it is not directly possible to evaluate Brier scores.

Since research regarding state-of-the-art post-hoc methods applied to recent large-scale neural networks is still lacking, we want to attempt to bridge this gap and show results for modern ResNet as well as transformer architectures trained on large corpora of data. We make the same observation as Minderer et al. (Minderer et al., 2021) that, in fact, even modern ResNet architectures are not well calibrated in domain-shift settings despite being pre-trained on huge amounts of data, yet transformer architectures perform particularly well. In our experiments, we observe that modern ResNet architectures (BiT-M and ResNeXt) can indeed be further calibrated with post-hoc methods. Especially, SPL+DAC performs best and reduces the ECE by around 37% for ResNeXt-WSL and 42% for BiT-M compared to the best-performing standard post-hoc method (Tab.1). ViT-B, on the other hand, can profit from existing post-hoc methods too, at least in-domain. For domain-shift scenarios, ViT outperforms all standard post-hoc methods, except when combined with DAC, in that case, ETS+DAC outperforms existing methods by 12%.

5.2 Calibration in OOD Scenarios

To complement the previous distributional shift experiments, we conduct additional experiments for the out-of-domain case, incorporating data samples with completely different classes w.r.t.the training data. Ideally, a well-calibrated uncertainty-aware model would produce high-confidence predictions for in-domain data and low-confidence ones for the OOD case, allowing for the detection of OOD data samples. Based on this idea, several metrics have been proposed in the OOD literature (Hendrycks & Gimpel, 2017; Liang et al., 2018; Lee et al., 2018) to quantify the model performance in OOD scenarios, including FPR at 95% TPR, detection error, AUROC, AUPR-In/AUPR-Out, which we employ for our experiments.

To examine DAC in the OOD scenario, we set up the OOD dataset with in-domain data from ImageNet-1k and OOD data from ObjectNet-OOD (c.f.Section4 for dataset descriptions). Top-class confidence predictions are produced by models trained and calibrated on ImageNet-1k with various calibration methods with and without the proposed DAC method. The OOD metrics are computed from the confidence predictions. We summarize the results for the DenseNet169 backbone in Tab.4. Additional results for other backbones can be found in AppendixH.

Table 4: OOD performance with DenseNet169 trained on ImageNet-1k and using ImageNet-1k/ObjectNet-OOD as in-domain/OOD test sets, respectively. We observe that DAC consistently improves all the OOD metrics for all baseline methods.

In general, we see that DAC yields more robust calibration in the OOD scenario, as demonstrated by its consistent improvement of OOD results.

5.3 Layer Importance for Calibration

Next, we want to investigate which layers of each classifier carry valuable information for DAC to yield calibrated predictions. For each layer, DAC learns a weight based on the importance of the respective layer (equation(7)). In Fig.3 we demonstrate that DAC focuses on a few important layers, yet the logits layer is never one of them. This is particularly interesting because current state-of-the-art post-hoc calibration methods only focus on the logits vector for recalibration without even considering hidden layers of the classifier. Hence, we can conclude that one reason for DAC’s performance improvement can be attributed to its ability to take information from other layers into account apart from the logits layer.

Even though the layers DAC has access to are well distributed throughout the architecture of the classifier, we want to investigate whether DAC can capture all the necessary information present in all the layers of the classifier. To this end, we compare our simple and fast DAC method, which uses a subset of the layers, to a holistic DAC, which utilizes all layers. In Fig.4, we illustrate the weights DAC assigns to every layer, which are normalized to add up to 1. The holistic DAC is able to attend to various layers; however, we observe that this does not necessarily result in better calibration performance, which can be attributed to overfitting (see AppendixF for further insights).

Figure 3: Importance of classifier layers found by DAC. The size of blobs indicates the magnitude of assigned weights for each layer after training of DAC from left (input) to right (logits).

Figure 4: Comparison between our DAC with “selected layers” to a holistic DAC, which utilizes ”all layers” present in ResNet18 trained on CIFAR10. DAC is able to capture the most relevant areas in layer space with important information for calibration.

5.4 Sensitivity Analysis of KNN

In this section, we evaluate the sensitivity of DAC to the hyperparameter k 𝑘 k italic_k used for KNN operations in each layer. To this end, we study the resulting calibration performance while varying k 𝑘 k italic_k. Figure5 illustrates the impact of k 𝑘 k italic_k on the calibration performance of DAC when combined with ETS and SPL, with DAC having different values of k∈𝑘 absent k\in italic_k ∈ {1, 10, 50, 100, 200}. For each model, the macro-averaged ECE (×10 2 superscript 10 2 10^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) is computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). We show that DAC consistently boosts the performance of ETS and SPL regardless of the choice of k 𝑘 k italic_k, indicating that the performance of DAC is not overly sensitive to the value of k 𝑘 k italic_k. On the other hand, we observe that using the proper choice for k 𝑘 k italic_k, suggested by (Sun et al., 2022), indeed results in the best calibration performance also for our method, implying that putting in additional effort for hyperparameter tuning can further enhance the performance of DAC.

Figure 5: The sensitivity of DAC to the hyperparameter k 𝑘 k italic_k used for KNN operations in each layer. We combined DAC with ETS (1st row) and SPL (2nd row), varying hyperparameter k 𝑘 k italic_k.

5.5 Data Efficiency of DAC

Lastly, we investigate how sensitive our method is to various validation set sizes. We focus on the best-performing methods, namely ETS+DAC and SPL+DAC, and compare them to the respective stand-alone methods, ETS and SPL. Additionally, we incorporate TS in our study since this method is least likely to suffer from overfitting because it comprises only one parameter to train. We encounter in Fig.6 that no matter which validation set size we use, combinations with DAC perform better than without DAC. Additionally, DAC does not overfit the data for small validation set sizes.

Figure 6: DAC is robust across different validation set sizes (10% - 100%) for DenseNet169 trained on ImageNet (ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)). We conducted five experiments with randomly sampled validation sets.

6 Conclusion

In this work, we have introduced an accuracy-preserving density-aware calibration method that can readily be applied with SOTA post-hoc methods in order to boost domain-shift and OOD calibration performance. We found that our proposed method DAC combined with existing post-hoc calibration methods yields robust predictive uncertainty estimates for any level of domain shift, from in-domain to truly OOD scenarios. In particular, ETS+DAC, as well as SPL+DAC, performed the best. We further demonstrated that hidden layers in classifiers carry valuable information for accurately predicting uncertainty estimates. Lastly, we show that even recently developed large-scale models pre-trained on vast amounts of data can be calibrated effectively by DAC, opening up new research directions within the field of post-hoc calibration for entirely new applications. One of the limitations of our method can arise when applying it to highly parametric classifiers with numerous layers, as well as when determining which layers possess the most calibration-related information for DAC. The amount of calibration-related information within specific layers of a classifier seems to depend not only on the model architecture but also on the relationship between model size and dataset characteristics. We hope our findings will encourage further research into developing post-hoc methods that take into account features from the underlying neural network classifier, rather than just the output features.

References

Ashukha et al. (2020) Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020.
Barbu et al. (2019) Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., and Katz, B. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
Chen & Koltun (2017) Chen, Q. and Koltun, V. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pp. 1511–1520, 2017.
Chollet (2017) Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258, 2017.
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Dosovitskiy et al. (2020a) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020a.
Dosovitskiy et al. (2020b) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020b.
Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.1050–1059, 2016.
Gatys et al. (2016) Gatys, L.A., Ecker, A.S., and Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423, 2016.
Gong et al. (2021) Gong, Y., Lin, X., Yao, Y., Dietterich, T.G., Divakaran, A., and Gervasio, M. Confidence calibration for domain generalization under covariate shift. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8958–8967, 2021.
Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. JMLR. org, 2017.
Gupta et al. (2021) Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., and Hartley, R. Calibration of neural networks using splines. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eQe8DEWNN2W.
He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Springer, 2016b.
Hendrycks & Dietterich (2019) Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. International Conference on Learning Representations 2019, 2019.
Hendrycks & Gimpel (2017) Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, 2017.
Hendrycks et al. (2019) Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Using self-supervised learning can improve model robustness and uncertainty. Advances in neural information processing systems, 32, 2019.
Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
Johnson et al. (2019) Johnson, J., Douze, M., and Jégou, H. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
Kolesnikov et al. (2020) Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): General visual representation learning. In European conference on computer vision, pp. 491–507. Springer, 2020.
Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
Kull et al. (2019) Kull, M., Perello Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32, 2019.
Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
Lee et al. (2018) Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.
Liang et al. (2018) Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.
Maddox et al. (2019) Maddox, W.J., Izmailov, P., Garipov, T., Vetrov, D.P., and Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
Mahajan et al. (2018) Mahajan, D.K., Girshick, R.B., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
Milios et al. (2018) Milios, D., Camoriano, R., Michiardi, P., Rosasco, L., and Filippone, M. Dirichlet-based gaussian processes for large-scale calibrated classification. Advances in Neural Information Processing Systems, 31, 2018.
Minderer et al. (2021) Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., and Lucic, M. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694, 2021.
Müller et al. (2019) Müller, R., Kornblith, S., and Hinton, G.E. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
Naeini et al. (2015) Naeini, M.P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
Ovadia et al. (2019) Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp.13991–14002, 2019.
Platt (1999) Platt, J.C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers, pp. 61–74. MIT Press, 1999.
Rahimi et al. (2020a) Rahimi, A., Gupta, K., Ajanthan, T., Mensink, T., Sminchisescu, C., and Hartley, R. Post-hoc calibration of neural networks. arXiv preprint arXiv:2006.12807, 2020a.
Rahimi et al. (2020b) Rahimi, A., Shaban, A., Cheng, C.-A., Hartley, R., and Boots, B. Intra order-preserving functions for calibration of multi-class neural networks. Advances in Neural Information Processing Systems, 33:13456–13467, 2020b.
Sensoy et al. (2018) Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, pp.3179–3189, 2018.
Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sun et al. (2022) Sun, Y., Ming, Y., Zhu, X., and Li, Y. Out-of-distribution detection with deep nearest neighbors. arXiv preprint arXiv:2204.06507, 2022.
Thulasidasan et al. (2019) Thulasidasan, S., Chennupati, G., Bilmes, J.A., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Tomani & Buettner (2021) Tomani, C. and Buettner, F. Towards trustworthy predictions from deep neural networks with fast adversarial calibration. In Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021.
Tomani et al. (2021) Tomani, C., Gruber, S., Erdem, M.E., Cremers, D., and Buettner, F. Post-hoc uncertainty calibration for domain drift scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10132, 2021.
Tomani et al. (2022) Tomani, C., Cremers, D., and Buettner, F. Parameterized temperature scaling for boosting the expressive power in post-hoc uncertainty calibration. In European Conference on Computer Vision, pp. 555–569. Springer, 2022.
Wald et al. (2021) Wald, Y., Feder, A., Greenfeld, D., and Shalit, U. On calibration and out-of-domain generalization. Advances in neural information processing systems, 34:2215–2227, 2021.
Wang et al. (2021) Wang, D.-B., Feng, L., and Zhang, M.-L. Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Advances in Neural Information Processing Systems, 34:11809–11820, 2021.
Wang et al. (2020) Wang, X., Long, M., Wang, J., and Jordan, M. Transferable calibration with lower bias and variance in domain adaptation. Advances in Neural Information Processing Systems, 33:19212–19223, 2020.
Wen et al. (2018) Wen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018.
Wenger et al. (2020) Wenger, J., Kjellström, H., and Triebel, R. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp. 178–190. PMLR, 2020.
Xie et al. (2017) Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500, 2017.
Yu et al. (2022) Yu, Y., Bates, S., Ma, Y., and Jordan, M.I. Robust calibration with multi-domain temperature scaling. arXiv preprint arXiv:2206.02757, 2022.
Zadrozny & Elkan (2001) Zadrozny, B. and Elkan, C. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp. 609–616. Citeseer, 2001.
Zadrozny & Elkan (2002) Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002.
Zhang et al. (2017) Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhang et al. (2020) Zhang, J., Kailkhura, B., and Han, T. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning (ICML), 2020.
Zhang et al. (2018) Zhang, R., Isola, P., Efros, A.A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Appendix A Datasets

We follow the standard setup for evaluating calibration performance (Guo et al., 2017). We split each dataset into train, validation, and test set. The train set is used to train classifier neural networks. The validation set is used to optimize the post-hoc calibration methods to recalibrate classifiers. Then, the calibration performances are evaluated on the test set.

In Tab.5, we show the numbers of image-label pairs in each split we use. For CIFAR10 and CIFAR100 (Krizhevsky et al., 2009), following Guo et al.(2017), we split the original train set, which contains 50,000 image-label pairs, into 45,000 image-label pairs of train set and 5,000 image-label pairs of the validation set. For ImageNet (Deng et al., 2009)we split the original validation set, which contains 50,000 image-label pairs, into 12,500 image-label pairs of the validation set and 37,500 image-label pairs of the test set. As a common practice, we split the official validation set from ImageNet into a validation set for training our post-hoc method and test set (Minderer et al., 2021).

Table 5: The numbers of image-label pairs we used for each dataset.

To quantify calibration performance for domain-shift scenarios, we use ImageNet-C as well as CIFAR-C (Hendrycks & Dietterich, 2019). Both datasets have 18 distinct corruption types, each having 5 different levels of severity, mimicking a scenario where the input data to a classifier gradually shifts away from the training distribution. The 18 corruptions include 4 noise corruptions (gaussian noise, shot noise, speckle noise, and impulse noise), 4 blur corruptions (defocus blur, gaussian blur, motion blur, and zoom blur), 5 weather corruptions (snow, fog, brightness, spatter, and frost), and 5 digital corruptions (elastic transform, pixelate, JPEG compression, contrast, and saturate).

Appendix B Classifiers

In this section we describe the implementation of the classifiers we used in our work.

CIFAR10:

*   –

ResNet18 (He et al., 2016a)/VGG16 (Simonyan & Zisserman, 2014)/DenseNet121 (Huang et al., 2017): We use PyTorch’s official implementation to obtain the model architectures. We trained all models for 200 epochs: 100 epochs at a learning rate of 0.01, 50 epochs at a learning rate of 0.005, 30 epochs at a learning rate of 0.001, and 20 epochs at a learning rate of 0.0001. We use a basic data augmentation technique of random cropping and horizontal flipping.

CIFAR100:

*   –

ResNet18 (He et al., 2016a)/VGG16 (Simonyan & Zisserman, 2014)/DenseNet121 (Huang et al., 2017): We obtain architectures from a github repository 2 2 2 https://github.com/weiaicunzai/pytorch-cifar100 that provides PyTorch implementation of the optimal architectures for CIFAR100 dataset. We trained all models for 200 epochs, with an initial learning rate of 0.01, which decays 0.2 times at the 60, 120, and 160th epochs. We use a basic data augmentation technique of random cropping and horizontal flipping.

ImageNet-1k: We use pre-trained models.

*   –

ResNet152 (He et al., 2016a)/DenseNet169 (Huang et al., 2017): We use the ImageNet-1k pre-trained models from the torchvision library.

*   –

Xception (Chollet, 2017): We use the ImageNet-1k pre-trained model provided by the timm library 3 3 3 https://github.com/rwightman/pytorch-image-models.

*   –

BiT-M (Kolesnikov et al., 2020): We use the ImageNet-21k pre-trained and ImageNet-1k fine-tuned model provided by the timm library3. Specifically, we use the BiT-M based on ResNetV2 101x1 architecture.

*   –

ResNeXt-WSL (Mahajan et al., 2018): We use the Instagram pre-trained and ImageNet-1k fine-tuned model provided by Meta’s research group 4 4 4 https://github.com/facebookresearch/WSL-Images. Specifically, we use a model with ResNeXt101 32x8d architecture.

*   –

ViT-Base (Dosovitskiy et al., 2020b): We use the ImageNet-21k pre-trained and ImageNet-1k fine-tuned model provided by the timm library3. Specifically, we use the ViT-Base that expects an input image size of 224 and have 16 patch-embeddings.

Appendix C Post-hoc methods

C.1 Density-Aware Calibration (DAC)

Table 6: Layers from classifiers used for DAC. For neural networks that have block structures, we pick the very last layer of each block (denoted by Block-i for i t⁢h subscript 𝑖 𝑡 ℎ i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT block), and, additionally, the immediate layer before the first block (denoted by Pre-block), the layer just before the fully-connected layers at the end of neural networks (denoted by Penultimate), and the logits layer (denoted by Logits). For VGG, we use feature vectors after every max-pooling layer where the resolution of the feature map changes (denoted by Maxpool-i).

Even though DAC is capable of using every layer in a classifier due to its weighting scheme, we opt for a much simpler and faster version that uses a subset of layers. We follow a structured approach for choosing layers to end up with a well-distributed subset of layers: We choose (1) the last layer of each “block” in a neural network, e.g., ResNet or Transformer block, or, (2) the layer where the resolution or channel size of the feature vector changes, e.g., VGG. The intuition behind this approach is based on the fact that layers in neural networks represent an image at increasing levels of abstraction from low-level to high-level representation, and thus, different blocks or layers with different resolutions are expected to have different levels of representation. The approach of choosing a subset of layers to obtain different levels of representations has been applied to a wide range of works in the computer vision field (Zhang et al., 2018; Gatys et al., 2016; Chen & Koltun, 2017): Zhang et al.(2018) leverage different levels of representations for better measurement of the perceptual difference between two images; Gatys et al.(2016) leverage different levels of representations to separate image content from style for image style transfer task; Chen et al.(2017) applied the strategy to image synthesis task.

Based on the above idea, we select layers for each classifier as described in Tab.6. For neural networks that have block structures, we pick the very last layers of each block (i.e., block-i for i t⁢h subscript 𝑖 𝑡 ℎ i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT block), and, additionally, the immediate layer before the first block (i.e., pre-block), e.g., the first max-pooling layer in ResNet, the layer just before the fully-connected layers at the end of neural networks (i.e., Penultimate), and the logits layer (i.e., Logits). For VGG, which does not have a block structure, we choose 5 different resolutions of layers, similar to Gatys et al.(2016) and Chen et al.(2017): Specifically, we use feature vectors after every max-pooling layer where the resolution of the feature map changes (i.e., Maxpool-i).

Note that the extracted feature vector from each layer is always converted to a single-dimensional vector by a pooling operation. For convolutional neural networks, we do spacial avg-pooling for each layer output. For Transformer architectures, we do avg-pooling w.r.t. token length.

C.2 Baseline Post-hoc Methods

Here, we describe the details of the implementation of the baseline post-hoc methods.

• Temperature scaling (TS) (Guo et al., 2017): We use the implementation from the GitHub repository 5 5 5 https://github.com/zhang64-llnl/Mix-n-Match-Calibration provided by Zhang et al. (Zhang et al., 2020).
• Ensemble Temperature scaling (ETS) (Zhang et al., 2020): Ensemble version of TS with 4 parameters. We use the official implementation from their GitHub repository5.
• Isotonic Regression for multi-class (IR) (Zhang et al., 2020): Decomposes the problem as one-versus-all problem to extend Isotonic Regression for multi-class setting. We use the official implementation from their GitHub repository5.
• Accuracy preserving version of Isotonic Regression (IRM) (Zhang et al., 2020): We use the official implementation from their GitHub repository5.
• Parameterized Temperature scaling (PTS) (Tomani et al., 2022): Sample-wise version of TS. Following their paper, PTS was trained as a neural network with 2 fully connected hidden layers with 5 nodes each, using a learning rate of 0.00005, batch size of 1000, and step size of 100,000. The top 10 most confident predictions were used as input.
• Dirichlet calibration (DIR) (Milios et al., 2018): Matrix scaling with off-diagonal and intercept regularization. Among their variants, we use MS-ODIR, which is intended for calibrating logits rather than probabilities and is claimed as the best-performing variant by the authors. We use the official implementation from their GitHub repository 6 6 6 https://github.com/dirichletcal/experiments_dnn. However, we encountered some difficulties and could not adapt the code to run ImageNet experiments.
• Intra-order preserving calibration (DIA) (Rahimi et al., 2020b): Among their variants, we use the method DIAG (diagonal intra-order-preserving), which works the best on average for various datasets and classifiers. We use their official implementation 7 7 7 https://github.com/AmirooR/IntraOrderPreservingCalibration. DIA can be run with or without hyperparameter optimization; for a fair comparison, since our other baselines, including DAC, have fixed hyperparameters, we show results without hyperparameter optimization.
• Calibration using Splines (SPL) (Gupta et al., 2021): We use their official implementation 8 8 8 https://github.com/kartikgupta-at-anu/spline-calibration. Following their paper, we use the natural cubic spline fitting method with 6 knots for all our experiments.

C.3 Training and Evaluation of DAC

Our post-hoc calibration method DAC does not require any GPU for the training phase as well as the inference phase. For computing s l subscript 𝑠 𝑙 s_{l}italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT via the k-nearest neighbor method (equation(7)), we use the Faiss library 9 9 9 https://github.com/facebookresearch/faiss(Johnson et al., 2019) for efficient and fast similarity search, which can be run with a CPU or GPU. To minimize the loss function described in equation(9), we use Scipy optimization using a CPU.

Appendix D Results for Additional Baseline-Calibration Methods

In addition to the state-of-the-art post-hoc calibration method which we compared in the main text of the paper, we show another 3 commonly used methods in this section and combine them with our proposed DAC method. Across all methods, classifiers, and datasets we see consistent improvements in post-hoc methods combined with DAC for macro-averaged ECE (calculated across all corruptions and levels of severity) in Tab.7. Moreover, in Tab.8 we show the additional gain of the model and in Fig.7 we demonstrate how ECE behaves across different severity levels of corruption.

Table 7: ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) for Additional calibration post-hoc methods: Mean expected calibration error across all test domain-shift scenarios. For each model, the macro-averaged ECE (with equal-width binning and 15 bins) is computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). Also for these additional post-hoc calibration methods results are consistently better calibrated when paired with our method. (lower ECE is better)

Table 8: Deltas for Additional post-hoc methods: Difference in expected calibration error (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of post-hoc calibration methods with and without our method DAC. I⁢n⁢D 𝐼 𝑛 𝐷 InD italic_I italic_n italic_D: In-domain, S⁢E⁢V⁢.5 𝑆 𝐸 𝑉.5 SEV.5 italic_S italic_E italic_V .5: Heavily corrupted (severity of 5) and A⁢L⁢L 𝐴 𝐿 𝐿 ALL italic_A italic_L italic_L: Macro-averaged ECE across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). For A⁢L⁢L 𝐴 𝐿 𝐿 ALL italic_A italic_L italic_L, we additionally report the ratio to indicate overall performance gain. (negative deltas are better)

InD Sev. 5 All PTS-0.37-0.94-0.43 (9.8%) IR-0.17-1.44-0.74 (13.4%) DIR-0.56-1.44-0.72 (14.8%) a.) CIFAR10 ResNet18

InD Sev. 5 All PTS+0.45-3.83-1.77 (21.0%) IR-1.58-6.45-4.30 (31.5%) DIR-1.77-9.22-5.80 (42.3%) b.) CIFAR100 VGG16

InD Sev. 5 All PTS-0.14-1.79-0.82 (12.0%) IR-0.97-2.66-1.93 (14.1%) c.) ImageNet DenseNet169

InD Sev. 5 All PTS+0.06-1.87-0.84 (15.1%) IR-0.73-4.24-2.41 (20.2%) d.) ImageNet BiT-M

InD Sev. 5 All PTS+0.15-2.40-0.76 (12.4%) IR-1.37-4.84-2.97 (25.8%) e.) ImageNet ResNeXt-WSL

InD Sev. 5 All PTS+0.57-4.02-1.09 (24.2%) IR+0.08-0.64-0.29 (2.7%) f.) ImageNet ViT-B

CIFAR10-ResNet18 CIFAR100-VGG16 ImageNet-DenseNet169 ImageNet-BiT-M ImageNet-ResNeXt-WSL ImageNet-ViT-B

Figure 7: ECE for additional post-hoc methods: Expected calibration error (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of post-hoc methods with and without our method DAC for different model and dataset combinations. Line plots: Macro-averaged ECE across all corruption types shown for each corruption severity, from in-domain to OOD. Bar plots: Macro-averaged ECE across all corruption types as well as across all severities (lower ECE is better).

Appendix E Results for Additional Calibration Measures

Even though ECE is most commonly used for evaluating calibration performance, other metrics have been proposed as well. Here we want to show that our results on ECE with equal-width binning are consistent with various other calibration methods. We additionally evaluate based on: a. ECE using equal-mass binning (Tab.9), b. ECE based on kernel density estimation (ECE-KDE) (Zhang et al., 2020) (Tab.10) and c. class-wise ECE (Kull et al., 2019) (Tab.11). Moreover, we show that negative log-likelihood (Tab.12) is kept similar for our method compared to baseline calibration methods without DAC. In each of these tables, we report macro-averaged ECE or NLL scores. For each model, we compute the macro-averaged ECE or NLL across all corruptions from severity=0 (in-domain) until severity=5 (OOD). For spline authors didn’t provide ways to calibrate the probabilistic predictions; that is, why we were not able to calculate the negative log-likelihood, and class-wise ECE for this method.

Table 9: Mean ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) with equal-mass binning (15 bins) computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). (lower ECE is better).

Table 10: Mean ECE-KDE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). (lower ECE-KDE is better).

Table 11: Mean Class-wise ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). (lower class-wise ECE is better). Note that since SPL only calibrates the highest predicted confidence, it is not directly possible to evaluate class-wise ECE.

Table 12: Mean negative log-likelihood computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). Note that since SPL only calibrates the highest predicted confidence, it is not directly possible to evaluate negative log-likelihood.

Appendix F Ablation Study: How Information from Hidden Layers Benefits Calibration with DAC

CIFAR10-ResNet18 CIFAR100-VGG16

Figure 8: Ablation study on how information from hidden layers benefits the calibration performance of DAC. The y-axis shows the macro-averaged ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) computed across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted). DAC boosts calibration performance significantly as soon as information from hidden layers is available (”selected layers” and ”all layers”) compared to DAC with access to only the ”logits layer”.

In order to investigate the importance of hidden layers from classifiers towards the calibration performance of DAC, we conduct an ablation study. In this study, we analyze the sensitivity of ECE with regard to different subsets of hidden layers that our DAC has access to.

We compare the calibration performance of the following variations:

• w/o DAC: Baseline post-hoc calibration method without DAC.
• DAC (no layer): Baseline method + DAC using no layer at all. If DAC has no access to any layer in the classifier, only the bias term w 0 subscript 𝑤 0 w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT remains in equation (8), thus in this setting DAC is equivalent to temperature scaling.
• DAC (logits layer): Base method + DAC using only the logits layer.
• DAC (selected layers): Base method + DAC using selected layers (a well distributed subset of layers throughout the classifier). This is the default DAC used throughout the paper.
• DAC (all layers): Base method + DAC using all layers of the classifiers.

Fig.8 shows macro-averaged ECE across all corruptions from severity=0 (in-domain) until severity=5 (OOD). The results indicate that DAC using solely logits layer can already boost calibration compared to the respective stand-alone baseline methods. However, purely relying on the logits layer as a basis for post-hoc calibration is still suboptimal. DAC boosts the calibration performance significantly as soon as information from hidden layers is available to DAC. However, by increasing the access further from selected layers to all layers, an increase in calibration performance can not be observed in all scenarios. This indicates that more layers can benefit DAC’s performance; however, we assume that too many layers can also cause an overfitting issue due to having too many parameters to optimize.

Appendix G Reliability Diagrams

Reliability diagrams allow insights into calibration performance by showing the difference between a method’s confidence in its predictions and its accuracy (Guo et al., 2017). Following Maddox et al. (Maddox et al., 2019), we show the following reliability diagrams. We split the test data into 15 bins based on the confidence values, so that each bin contains a uniform number of data points. Then, we evaluate the accuracy and mean confidence for each bin. For a well-calibrated model, this difference should be close to zero for each bin.

In Fig.9, we show the reliability diagrams for 6 different dataset-classifier pairs on CIFAR-C or ImageNet-C dataset, across all corruptions and severity levels combined in one reliability diagram. The figure shows that our post-hoc calibration method DAC successfully boosts the existing post-hoc methods’ calibration performance in domain-shift scenarios.

CIFAR10-ResNet18 CIFAR100-VGG16 ImageNet-DenseNet169 ImageNet-BiT-M ImageNet-ResNeXt-WSL ImageNet-ViT-B

Figure 9: Reliability Diagrams for all corruptions and severity levels combined (with equal mass-binning for 15 bins).

Appendix H Additional Results: OOD Scenarios

Here, we include additional results from the OOD experiments. Fig.10 summarizes the in-domain/OOD confidence distributions in various cases, where we see that DAC can in general better separate the in-domain data from OOD data, which is desirable. Tab.13 provides a more comprehensive summary of the quantitative OOD results for various network backbones. We observe consistent improvement with DAC in majority of the cases.

Note that BiT-M, ResNeXt-WSL, and ViT-B are pre-trained with additional data, as mentioned in Section4. This might influence the outcome and the validity of the OOD experimental setup. The effect of backbone pre-training on the OOD tasks could be an interesting topic to investigate for future work.

ImageNet-ResNet152 ImageNet-DenseNet169 ImageNet-Xception ImageNet-BiT-M ImageNet-ResNeXt-WSL ImageNet-ViT-B

Figure 10: Boxplots of confidence values for ImageNet test set, and ObjectNet-OOD dataset. By combining the existing calibration methods with our DAC method, all classifiers output lower confidence values for novel-class images.

Table 13: Additional OOD results using ImageNet-1k/ObjectNet-OOD as in- domain/OOD test set, respectively.

Appendix I Data Efficiency Analysis

I.1 Additional Results

We show additional data efficiency diagrams for models on CIFAR10 and CIFAR100. As for ImageNet in the main text, we encounter no drastic sensitivity to validation set size. We conclude that these post-hoc methods combined with DAC can be trained on very small validation set sizes.

CIFAR10-ResNet18 CIFAR100-VGG16

Figure 11: Data efficiency diagrams for macro-averaged ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) (with 15 bins across all corruptions from severity=0 (in-domain) until severity=5 (heavily corrupted)) from 10% to 100% validation set size.

I.2 Trade-off between in-domain and domain-shift performance

We observe that the larger the validation set gets, the more likely the calibration method is to overfit to in-domain data, leading to a degradation in performance on domain shift and OOD data. In Figure12, we show the data efficiency diagrams for ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) with 15 bins from 10% to 100% validation set size, for the in-domain case and the case of corruption severity=5 (heavily corrupted), respectively. We show that while the in-domain calibration performance benefits from the size of the validation set, the domain-shift calibration performance is worsen due to overfitting. Fig.6 in the main text shows the mean ECE across all test-domain shift scenarios, including in-domain, and thus incorporates this overfitting behavior on large validation set sizes.

IMG-DenseNet169 (In-domain)IMG-DenseNet169 (Sev. 5)

Figure 12: Data efficiency diagrams for ECE (×10 2 absent superscript 10 2\times 10^{2}× 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) with 15 bins from 10% to 100% validation set size. Left: image corruption severity=0 (in-domain). Right: image corruption severity=5 (heavily corrupted)). The figures show the trade-off between in-domain and out-of-domain calibration errors of DAC.

Appendix J Computational Cost

In this section, we compare the computational cost of DAC and the existing methods. In short, the computational cost of DAC is similar to that of existing post-hoc methods during both the training and inference phases.

We first compare the training speed of DAC, DIA, ETS, and SPL for DenseNet169 trained on ImageNet, using an NVIDIA Titan X (12GB) GPU. The table below shows that DAC achieves a total training time of only about 14 minutes. Compared to the overall training process of the classifier, which takes at least several hours, the additional computational cost added by DAC is minor.

Table 14: Training time comparison: Our proposed method DAC vs. existing post-hoc calibration methods (DIA, ETS, SPL).

In addition, we compare the per-sample inference speed. As shown in the table below, we found that the inference speed of DenseNet169 combined with DAC is comparable to the existing calibration methods. This can be attributed to the efficient GPU-accelerated KNN search in DAC, enabled by the Faiss library 10 10 10 https://github.com/facebookresearch/faiss(Johnson et al., 2019).

Table 15: Per-sample inference time comparison: Our proposed method DAC vs. existing post-hoc calibration methods (DIA, ETS, SPL).

Xet Storage Details

Size:: 100 kB
Xet hash:: 982cfa2158daf82106c18bd2b4c0bc6ee8284bbc6d098457bbfd02a3b7b70b2e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.