Buckets:
Title: Conformal Risk Control
URL Source: https://arxiv.org/html/2208.02814
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Theory 3Examples 4Extensions 5Conclusion
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: scalerel failed: mdframed failed: biblatex failed: apptools
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: arXiv.org perpetual non-exclusive license arXiv:2208.02814v4 [stat.ME] 13 Jun 2025 \addbibresource
bibliography.bib \AtAppendix \AtAppendix
Conformal Risk Control Anastasios N. Angelopoulos1, Stephen Bates1, Adam Fisch2, Lihua Lei3, and Tal Schuster4 ( 1University of California, Berkeley 2Massachusetts Institute of Technology 3Stanford University 4Google Research ) Abstract
We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an ๐ช โข ( 1 / ๐ ) factor. We also introduce extensions of the idea to distribution shift, quantile risk control, multiple and adversarial risk control, and expectations of U-statistics. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score.
1Introduction
We seek to endow some pre-trained machine learning model with guarantees on its performance as to ensure its safe deployment. Suppose we have a base model ๐ that is a function mapping inputs ๐ฅ โ ๐ณ to values in some other space, such as a probability distribution over classes. Our job is to design a procedure that takes the output of ๐ and post-processes it into quantities with desirable statistical guarantees.
Split conformal prediction \citepvovk2005algorithmic,papadopoulos2002inductive, which we will henceforth refer to simply as โconformal predictionโ, has been useful in areas such as computer vision \citepangelopoulos2020sets and natural language processing \citepfisch2020efficient to provide such a guarantee. By measuring the modelโs performance on a calibration dataset { ( ๐ ๐ , ๐ ๐ ) } ๐
1 ๐ of feature-response pairs, conformal prediction post-processes the model to construct prediction sets that bound the miscoverage,
โ โข ( ๐ ๐ + 1 โ ๐ โข ( ๐ ๐ + 1 ) ) โค ๐ผ ,
(1)
where ( ๐ ๐ + 1 , ๐ ๐ + 1 ) is a new test point, ๐ผ is a user-specified error rate (e.g., 10%), and ๐ is a function of the model and calibration data that outputs a prediction set. Note that ๐ is formed using the first ๐ data points, and the probability in (1) is over the randomness in all ๐ + 1 data points (i.e., the draw of both the calibration and test points).
In this work, we extend conformal prediction to prediction tasks where the natural notion of error is not simply miscoverage. In particular, our main result is that a generalization of conformal prediction provides guarantees of the form
๐ผ โข [ โ โข ( ๐ โข ( ๐ ๐ + 1 ) , ๐ ๐ + 1 ) ] โค ๐ผ ,
(2)
for any bounded loss function โ that shrinks as ๐ โข ( ๐ ๐ + 1 ) grows. We call this a conformal risk control guarantee. Note that (2) recovers the conformal miscoverage guarantee in (1) when using the miscoverage loss, โ โข ( ๐ โข ( ๐ ๐ + 1 ) , ๐ ๐ + 1 )
๐ โข { ๐ ๐ + 1 โ ๐ โข ( ๐ ๐ + 1 ) } . However, our algorithm also extends conformal prediction to situations where other loss functions, such as the false negative rate (FNR) or F1-score, are more appropriate.
As an example, consider multilabel classification, where the ๐ ๐ โ { 1 , โฆ , ๐พ } are sets comprising a subset of ๐พ classes. Given a trained multilabel classifier ๐ : ๐ณ โ [ 0 , 1 ] ๐พ , we want to output sets that include a large fraction of the true classes in ๐ ๐ . To that end, we post-process the modelโs raw outputs into the set of classes with sufficiently high scores, ๐ ๐ โข ( ๐ฅ )
{ ๐ : ๐ โข ( ๐ ) ๐ โฅ 1 โ ๐ } . Note that as the threshold ๐ grows, we include more classes in ๐ ๐ โข ( ๐ฅ ) โi.e., it becomes more conservative. In this case, conformal risk control finds a threshold value ๐ ^ that controls the fraction of missed classes, i.e., the expected value of โ โข ( ๐ ๐ ^ โข ( ๐ ๐ + 1 ) , ๐ ๐ + 1 )
1 โ | ๐ ๐ + 1 โฉ ๐ ๐ โข ( ๐ ๐ + 1 ) | / | ๐ ๐ + 1 | . Setting ๐ผ
0.1 would ensure that our algorithm produces sets ๐ ๐ ^ โข ( ๐ ๐ + 1 ) containing โฅ 90 % of the true classes in ๐ ๐ + 1 on average.
1.1Algorithm and preview of main results
Formally, we will consider post-processing the predictions of the model ๐ to create a function ๐ ๐ โข ( โ ) . The function has a parameter ๐ that encodes its level of conservativeness: larger ๐ values yield more conservative outputs (e.g., larger prediction sets). To measure the quality of the output of ๐ ๐ , we consider a loss function โ โข ( ๐ ๐ โข ( ๐ฅ ) , ๐ฆ ) โ ( โ โ , ๐ต ] for some ๐ต < โ . We require the loss function to be non-increasing as a function of ๐ . Our goal is to choose ๐ ^ based on the observed data { ( ๐ ๐ , ๐ ๐ ) } ๐
1 ๐ so that risk control as in (2) holds.
We now rewrite this same task in a more notationally convenient and abstract form. Consider an exchangeable collection of non-increasing, random functions ๐ฟ ๐ : ฮ โ ( โ โ , ๐ต ] , ๐
1 , โฆ , ๐ + 1 . Throughout the paper, we assume ๐ max
โณ sup ฮ โ ฮ . We seek to use the first ๐ functions to choose a value of the parameter, ๐ ^ , in such a way that the risk on the unseen function is controlled:
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โค ๐ผ .
(3)
We are primarily motivated by the case where ๐ฟ ๐ โข ( ๐ )
โ โข ( ๐ ๐ โข ( ๐ ๐ ) , ๐ ๐ ) , in which case the guarantee in (3) coincides with risk control as in (2).
Now we describe the algorithm. Let ๐ ^ ๐ โข ( ๐ )
( ๐ฟ 1 โข ( ๐ ) + โฆ + ๐ฟ ๐ โข ( ๐ ) ) / ๐ . Given any desired risk level upper bound ๐ผ โ ( โ โ , ๐ต ) , define
๐ ^
inf { ๐ : ๐ ๐ + 1 โข ๐ ^ ๐ โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ } .
(4)
When the set is empty, we define ๐ ^
๐ max . Our proposed conformal risk control algorithm is to deploy ๐ ^ on the forthcoming test point. Our main result is that this algorithm satisfies (3). When the ๐ฟ ๐ are i.i.d. from a continuous distribution, the algorithm satisfies a lower bound saying it is not too conservative,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ผ โ 2 โข ๐ต ๐ + 1 .
(5)
We show the reduction from conformal risk control to conformal prediction in Section 2.3. Furthermore, if the risk is non-monotone, then this algorithm does not control the risk; we discuss this in Section 2.4. Finally, we provide both practical examples using real-world data and several theoretical extensions of our procedure in Sections 3 and 4, respectively.
1.2Related work
Conformal prediction was developed by Vladimir Vovk and collaborators beginning in the late 1990s \citepvovk1999machine, vovk2005algorithmic, and has recently become a popular uncertainty estimation tool in the machine learning community, due to its favorable model-agnostic, distribution-free, finite-sample guarantees. See [angelopoulos-gentle] for a modern introduction to the area or [shafer2008tutorial] for a more classical alternative. As previously discussed, in this paper we primarily build on split conformal prediction \citeppapadopoulos2002inductive; statistical properties of this algorithm including the coverage upper bound were studied in [lei2018distribution]. Recently there have been many extensions of the conformal algorithm, mainly targeting deviations from exchangeability \citeptibshirani2019conformal,gibbs2021adaptive,barber2022conformal,fannjiang2022conformal and improved conditional coverage \citepbarber2019limits,romano2019conformalized,guan2020conformal,romano2020classification,angelopoulos2020sets. Most relevant to us is recent work on risk control in high probability \citepvovk2012conditional, bates2021distribution,angelopoulos2021learn and its applications \citep[][inter alia]Park2020PAC,fisch2022conformal, schuster2021consistent, schuster2022confident, sankaranarayanan2022semantic, angelopoulos2022image, angelopoulos2022recommendation. Though these works closely relate to ours in terms of motivation, the algorithm presented herein differs greatly: it has a guarantee in expectation, and neither the algorithm nor its analysis share much technical similarity with these previous works.
To elaborate on the difference between our work and previous literature, first consider conformal prediction. The purpose of conformal prediction is to provide coverage guarantees of the form in (1). The guarantee available through conformal risk control, (3), strictly subsumes that of conformal prediction; it is generally impossible to recast risk control as coverage control. As a second question, one might ask whether (3) can be achieved through standard statistical machinery, such as uniform concentration inequalities. Though it is possible to integrate a uniform concentration inequality to get a bound in expectation, this strategy tends to be excessively loose both in theory and in practice (see, e.g., the bound of [anthony1993result]). The technique herein avoids these complications; it is simpler than concentration-based approaches, practical to implement, and tight up to a factor of ๐ช โข ( 1 / ๐ ) , which is comparatively faster than concentration would allow. Finally, herein we target distribution-free finite-sample control of (3), but as a side-note it is also worth pointing the reader to the rich literature on functional central limit theorems \citepdavidson2000functional, which are another way of estimating risk functions.
2Theory
In this section, we establish the core theoretical properties of conformal risk control. All proofs, unless otherwise specified, are deferred to Appendix B.
2.1Risk control
We first show that the proposed algorithm leads to risk control when the loss is monotone.
Theorem 2.1.
Assume that ๐ฟ ๐ โข ( ๐ ) is non-increasing in ๐ , right-continuous, and
๐ฟ ๐ โข ( ๐ max ) โค ๐ผ , sup ๐ ๐ฟ ๐ โข ( ๐ ) โค ๐ต < โ โข almost surely .
(6)
Then
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โค ๐ผ .
(7) Proof.
Let ๐ ^ ๐ + 1 โข ( ๐ )
( ๐ฟ 1 โข ( ๐ ) + โฆ + ๐ฟ ๐ + 1 โข ( ๐ ) ) / ( ๐ + 1 ) and
๐ ^ โฒ
inf { ๐ โ ฮ : ๐ ^ ๐ + 1 โข ( ๐ ) โค ๐ผ } .
(8)
Since inf ๐ ๐ฟ ๐ โข ( ๐ )
๐ฟ ๐ โข ( ๐ max ) โค ๐ผ , ๐ ^ โฒ is well-defined almost surely. Since ๐ฟ ๐ + 1 โข ( ๐ ) โค ๐ต , we know ๐ ^ ๐ + 1 โข ( ๐ )
๐ ๐ + 1 โข ๐ ^ ๐ โข ( ๐ ) + ๐ฟ ๐ + 1 โข ( ๐ ) ๐ + 1 โค ๐ ๐ + 1 โข ๐ ^ ๐ โข ( ๐ ) + ๐ต ๐ + 1 . Thus,
๐ ๐ + 1 โข ๐ ^ ๐ โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ โน ๐ ^ ๐ + 1 โข ( ๐ ) โค ๐ผ .
(9)
This implies ๐ ^ โฒ โค ๐ ^ when the LHS holds for some ๐ โ ฮ . When the LHS is above ๐ผ for all ๐ โ ฮ , by definition, ๐ ^
๐ max โฅ ๐ ^ โฒ . Thus, ๐ ^ โฒ โค ๐ ^ almost surely. Since ๐ฟ ๐ โข ( ๐ ) is non-increasing in ๐ ,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โค ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒ ) ] .
(10)
Let ๐ธ be the multiset of loss functions { ๐ฟ 1 , โฆ , ๐ฟ ๐ + 1 } . Then ๐ ^ โฒ is a function of ๐ธ , or, equivalently, ๐ ^ โฒ is a constant conditional on ๐ธ . Additionally, ๐ฟ ๐ + 1 ( ๐ ) | ๐ธ โผ Uniform ( { ๐ฟ 1 , โฆ , ๐ฟ ๐ + 1 } ) by exchangeability. These facts combined with the right-continuity of ๐ฟ ๐ imply
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒ ) โฃ ๐ธ ]
1 ๐ + 1 โข โ ๐
1 ๐ + 1 ๐ฟ ๐ โข ( ๐ ^ โฒ ) โค ๐ผ .
(11)
The proof is completed by the law of total expectation and (10). โ
2.2A risk lower bound
Next we show that the conformal risk control procedure is tight up to a factor 2 โข ๐ต / ( ๐ + 1 ) . Like the standard conformal coverage upper bound, the proof will rely on a form of continuity that prohibits large jumps in the risk function. Towards that end, we will define the jump function below, which quantifies the size of the discontinuity in a right-continuous input function ๐ at point ๐ :
๐ฝ โข ( ๐ , ๐ )
lim ๐ โ 0 + โข ๐ โข ( ๐ โ ๐ ) โ ๐ โข ( ๐ )
(12)
The jump function measures the size of a discontinuity at ๐ โข ( ๐ ) . When there is a discontinuity and ๐ is non-increasing, ๐ฝ โข ( ๐ , ๐ ) > 0 . When there is no discontinuity, the jump function is zero. The next theorem will assume that the probability that ๐ฟ ๐ has a discontinuity at any pre-specified ๐ is โ โข ( ๐ฝ โข ( ๐ฟ ๐ , ๐ ) > 0 )
0 . Under this assumption the conformal risk control procedure is not too conservative.
Theorem 2.2.
In the setting of Theorem 2.1, further assume that the ๐ฟ ๐ are i.i.d., ๐ฟ ๐ โฅ 0 , and for any ๐ , โ โข ( ๐ฝ โข ( ๐ฟ ๐ , ๐ ) > 0 )
0 . Then
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ผ โ 2 โข ๐ต ๐ + 1 .
2.3Conformal prediction reduces to risk control
Conformal prediction can be thought of as controlling the expectation of an indicator loss function. Recall that the risk upper bound (2) specializes to the conformal coverage guarantee in (1) when the loss function is the indicator of a miscoverage event. The conformal risk control procedure specializes to conformal prediction under this loss function as well. However, the risk lower bound in Theorem 2.2 has a slightly worse constant than the usual conformal guarantee. We now describe these correspondences.
First, we show the equivalence of the algorithms. In conformal prediction, we have conformal scores ๐ โข ( ๐ ๐ , ๐ ๐ ) for some score function ๐ : ๐ณ ร ๐ด โ โ . Based on this score function, we create prediction sets for the test point ๐ ๐ + 1 as
๐ ๐ ^ โข ( ๐ ๐ + 1 )
{ ๐ฆ : ๐ โข ( ๐ ๐ + 1 , ๐ฆ ) โค ๐ ^ } ,
where ๐ ^ is the conformal quantile, a parameter that is set based on the calibration data. In particular, conformal prediction chooses ๐ ^ to be the โ ( ๐ + 1 ) โข ( 1 โ ๐ผ ) โ / ๐ sample quantile of { ๐ โข ( ๐ ๐ , ๐ ๐ ) } ๐
1 ๐ . To formulate this in the language of risk control, we consider a miscoverage loss ๐ฟ ๐ Cvg โข ( ๐ )
๐ โข { ๐ ๐ โ ๐ ^ ๐ โข ( ๐ ๐ ) }
๐ โข { ๐ โข ( ๐ ๐ , ๐ ๐ )
๐ } . Direct calculation of ๐ ^ from (4) then shows the equivalence of the proposed procedure to conformal prediction:
๐ ^
inf { ๐ : 1 ๐ + 1 โข โ ๐
1 ๐ ๐ โข { ๐ โข ( ๐ ๐ , ๐ ๐ ) > ๐ } + 1 ๐ + 1 โค ๐ผ }
inf { ๐ : 1 ๐ โข โ ๐
1 ๐ ๐ โข { ๐ โข ( ๐ ๐ , ๐ ๐ ) โค ๐ } โฅ โ ( ๐ + 1 ) โข ( 1 โ ๐ผ ) โ ๐ } โ conformal โข prediction โข algorithm .
(13)
Next, we discuss how the risk lower bound relates to its conformal prediction equivalent. In the setting of conformal prediction, [lei2018distribution] proves that โ โข ( ๐ ๐ + 1 โ ๐ ๐ ^ โข ( ๐ ๐ + 1 ) ) โฅ ๐ผ โ 1 / ( ๐ + 1 ) when the conformal score function follows a continuous distribution. Theorem 2.2 recovers this guarantee with a slightly worse constant: โ โข ( ๐ ๐ + 1 โ ๐ ๐ ^ โข ( ๐ ๐ + 1 ) ) โฅ ๐ผ โ 2 / ( ๐ + 1 ) . First, note that our assumption in Theorem 2.2 about the distribution of discontinuities specializes to the continuity of the score function when the miscoverage loss is used:
โ โข ( ๐ฝ โข ( ๐ฟ ๐ Cvg , ๐ ) > 0 )
0 โบ โ โข ( ๐ โข ( ๐ ๐ , ๐ ๐ )
๐ )
0 .
(14)
However, the bound for the conformal case is better than the bound for the general case in Theorem 2.2 by a factor of two. We do not know whether this factor of 2 is improvable. However, this factor is of little practical importance, as the difference between 1 / ( ๐ + 1 ) and 2 / ( ๐ + 1 ) is small even for moderate values of ๐ .
2.4Controlling general loss functions
We next show that the conformal risk control algorithm does not control the risk if the ๐ฟ ๐ are not assumed to be monotone. In particular, (3) does not hold. We show this by example.
Proposition 2.1.
For any ๐ , there exists a non-monotone loss function such that
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ต โ ๐ .
(15)
Notice that for any desired level ๐ผ , the expectation in (3) can be arbitrarily close to ๐ต . Since the function values here are in [ 0 , ๐ต ] , this means that even for bounded random variables, risk control can be violated by an arbitrary amountโunless further assumptions are placed on the ๐ฟ ๐ . However, the algorithms developed may still be appropriate for near-monotone loss functions. Simply โmonotonizingโ all loss functions ๐ฟ ๐ and running conformal risk control will guarantee (3), but this strategy will only be powerful if the loss is near-monotone. For concreteness, we describe this procedure below as a corollary of Theorem 2.1.
Corollary 1.
Allow ๐ฟ ๐ โข ( ๐ ) to be any (possibly non-monotone) function of ๐ satisfying 6. Take
๐ฟ ~ ๐ โข ( ๐ )
sup ๐ โฒ โฅ ๐ โข ๐ฟ ๐ โข ( ๐ โฒ ) , ๐ ~ ๐ โข ( ๐ )
1 ๐ โข โ ๐
1 ๐ ๐ฟ ~ ๐ โข ( ๐ ) and โข ๐ ~
inf { ๐ : ๐ ๐ + 1 โข ๐ ~ ๐ โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ } .
(16)
Then,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ~ ) ] โค ๐ผ .
(17)
If the loss function is already monotone, then ๐ ~ reduces to ๐ ^ . We propose a further algorithm for picking ๐ in Appendix A that provides an asymptotic risk-control guarantee for non-monotone loss functions. However, this algorithm again is only powerful when the risk ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ) ] is near-monotone and reduces to the standard conformal risk control algorithm when the loss is monotone.
3Examples
To demonstrate the flexibility and empirical effectiveness of the proposed algorithm, we apply it to four tasks across computer vision and natural language processing. All four loss functions are non-binary, monotone losses bounded by 1 . They are commonly used within their respective application domains. Our results validate that the procedure bounds the risk as desired and gives useful outputs to the end-user. We note that the choices of ๐ ๐ used herein are only for the purposes of illustration; any nested family of sets will work. For each example use case, for a representative ๐ผ (details provided for each task) we provide both qualitative results, as well as quantitative histograms of the risk and set sizes over 1000 random data splits that demonstrate valid risk control (i.e., with mean โค ๐ผ ). Code to reproduce our examples is available at the following GitHub link: https://github.com/aangelopoulos/conformal-risk.
3.1FNR control in tumor segmentation Figure 1:FNR control in tumor segmentation. The top figure shows examples of our procedure with correct pixels in white, false positives in blue, and false negatives in red. The bottom plots report FNR and set size over 1000 independent random data splits. The dashed gray line marks ๐ผ .
In the tumor segmentation setting, our input is a ๐ ร ๐ image and our label is a set of pixels ๐ ๐ โ โ โข ( { ( 1 , 1 ) , ( 1 , 2 ) , โฆ , ( ๐ , ๐ ) } ) , where โ denotes the power set. We build on an image segmentation model ๐ : ๐ณ โ [ 0 , 1 ] ๐ ร ๐ outputting a probability for each pixel and measure loss as the fraction of false negatives,
๐ฟ ๐ FNR โข ( ๐ )
1 โ | ๐ ๐ โฉ ๐ ๐ โข ( ๐ ๐ ) | | ๐ ๐ | , where โข ๐ ๐ โข ( ๐ ๐ )
{ ๐ฆ : ๐ โข ( ๐ ๐ ) ๐ฆ โฅ 1 โ ๐ } .
(18)
The expected value of ๐ฟ ๐ FNR is the FNR. Since ๐ฟ ๐ FNR is monotone, so is the FNR. Thus, we use the technique in Section 2.1 to pick ๐ ^ by (4) that controls the FNR on a new point, resulting in the following guarantee:
๐ผ โข [ ๐ฟ ๐ + 1 FNR โข ( ๐ ^ ) ] โค ๐ผ .
(19)
For evaluating the proposed procedure we pool data from several online open-source gut polyp segmentation datasets: Kvasir, Hyper-Kvasir, CVC-ColonDB, CVC-ClinicDB, and ETIS-Larib. We choose a PraNet \citepfan2020pranet as our base model ๐ and used ๐
1000 , and evaluated risk control with the 781 remaining validation data points. We report results with ๐ผ
0.1 in Figure 1. The mean and standard deviation of the risk over 1000 trials are 0.0987 and 0.0114, respectively.
3.2FNR control in multilabel classification Figure 2:FNR control on MS COCO. The top figure shows examples of our procedure with correct classes in black, false positives in blue, and false negatives in red. The bottom plots report FNR and set size over 1000 independent random data splits. The dashed gray line marks ๐ผ .
In the multilabel classification setting, our input ๐ ๐ is an image and our label is a set of classes ๐ ๐ โ { 1 , โฆ , ๐พ } for some number of classes ๐พ . Using a multiclass classification model ๐ : ๐ณ โ [ 0 , 1 ] ๐พ , we form prediction sets and calculate the number of false positives exactly as in (18). By Theorem 2.1, picking ๐ ^ as in (4) again yields the FNR-control guarantee in (19).
We use the Microsoft Common Objects in Context (MS COCO) computer vision dataset \citeplin2014microsoft, a large-scale 80-class multiclass classification baseline dataset commonly used in computer vision, to evaluate the proposed procedure. We choose a TResNet \citepridnik2020tresnet as our base model ๐ and used ๐
4000 , and evaluated risk control with 1000 validation data points. We report results with ๐ผ
0.1 in Figure 2. The mean and standard deviation of the risk over 1000 trials are 0.0996 and 0.0052, respectively.
3.3Control of graph distance in hierarchical image classification Figure 3:Control of graph distance on hierarchical ImageNet. The top figure shows examples of our procedure with correct classes in black, false positives in blue, and false negatives in red. The bottom plots report our minimum hierarchical distance loss and set size over 1000 independent random data splits. The dashed gray line marks ๐ผ .
In the ๐พ -class hierarchical classification setting, our input ๐ ๐ is an image and our label is a leaf node ๐ ๐ โ { 1 , โฆ , ๐พ } on a tree with nodes ๐ฑ and edges โฐ . Using a single-class classification model ๐ : ๐ณ โ ฮ ๐พ , we calibrate a loss in graph distance between the interior node we select and the closest ancestor of the true class. For any ๐ฅ โ ๐ณ , let ๐ฆ ^ โข ( ๐ฅ )
arg โก max ๐ โก ๐ โข ( ๐ฅ ) ๐ be the class with the highest estimated probability. Further, let ๐ : ๐ฑ ร ๐ฑ โ โค be the function that returns the length of the shortest path between two nodes, let ๐ : ๐ฑ โ 2 ๐ฑ be the function that returns the ancestors of its argument, and let ๐ซ : ๐ฑ โ 2 ๐ฑ be the function that returns the set of leaf nodes that are descendants of its argument. We also let ๐ โข ( ๐ฃ , ๐ฅ )
โ ๐ โ ๐ซ โข ( ๐ฃ ) โข ๐ โข ( ๐ฅ ) ๐ be the sum of scores of leaves descended from ๐ฃ . Further, define a hierarchical distance
๐ ๐ป โข ( ๐ฃ , ๐ข )
inf ๐ โ ๐ โข ( ๐ฃ ) โข { ๐ โข ( ๐ , ๐ข ) } .
For a set of nodes ๐ ๐ โ 2 ๐ฑ , we then define the set-valued loss
๐ฟ ๐ Graph โข ( ๐ )
inf ๐ โ ๐ ๐ โข ( ๐ ๐ ) โข { ๐ ๐ป โข ( ๐ฆ , ๐ ) } / ๐ท , where โข ๐ ๐ โข ( ๐ฅ )
โ { ๐ โ ๐ โข ( ๐ฆ ^ โข ( ๐ฅ ) ) : ๐ โข ( ๐ , ๐ฅ ) โฅ โ ๐ } โข ๐ซ โข ( ๐ ) .
(20)
This loss returns zero if ๐ฆ is a child of any element in ๐ ๐ , and otherwise returns the minimum distance between any element of ๐ ๐ and any ancestor of ๐ฆ , scaled by the depth ๐ท . Thus, it is a monotone loss function and can be controlled by choosing ๐ ^ as in (4) to achieve the guarantee
๐ผ โข [ ๐ฟ ๐ + 1 Graph โข ( ๐ ^ ) ] โค ๐ผ .
(21)
For this experiment, we use the ImageNet dataset \citepdeng2009imagenet, which comes with an existing label hierarchy, WordNet, of maximum depth ๐ท
14 . We choose a ResNet152 \citephe2016deep as our base model ๐ and used ๐
30000 , and evaluated risk control with the remaining 20000 . We report results with ๐ผ
0.05 in Figure 3. The mean and standard deviation of the risk over 1000 trials are 0.0499 and 0.0011, respectively.
3.4F1-score control in open-domain question answering Figure 4:F1-score control on Natural Questions. The top figure shows examples of our procedure with fully correct answers in green, partially correct answers in blue, and false positives in gray. Note that due to the nature of the evaluation, answers that are technically correct may still be down-graded if they do not match the reference. We treat this as part of the randomness in the task. The bottom plots report the F1 risk and average set size over 1000 independent random data splits. The dashed gray line marks ๐ผ .
In the open-domain question answering setting, our input ๐ ๐ is a question and our label ๐ ๐ is a set of (possibly non-unique) correct answers. For example, the input
๐ ๐ + 1
โWhere was Barack Obama Born?โ
could have the answer set
๐ ๐ + 1
{ โHawaiiโ, โHonolulu, Hawaiiโ, โKapoโolani Medical Centerโ }
Formally, here we treat all questions and answers as being composed of sequences (up to size ๐ ) of tokens in a vocabulary ๐ฑ โi.e., assuming ๐ valid answers, we have ๐ ๐ โ ๐ต and ๐ ๐ โ ๐ต ๐ , where ๐ต := ๐ฑ ๐ . Using an open-domain question answering model that individually scores candidate output answers ๐ : ๐ต ร ๐ต โ โ , we calibrate the best token-based F1-score of the prediction set, taken over all pairs of predictions and answers:
๐ฟ ๐ F1 โข ( ๐ )
1 โ max โก { F1 โข ( ๐ , ๐ ) : ๐ โ ๐ ๐ โข ( ๐ ๐ ) , ๐ โ ๐ ๐ } ,
where โข ๐ ๐ โข ( ๐ ๐ )
{ ๐ฆ โ ๐ฑ ๐ : ๐ โข ( ๐ ๐ , ๐ฆ ) โฅ ๐ } .
(22)
We define the F1-score following popular QA evaluation metrics \citepsquad, where we treat predictions and ground truth answers as bags of tokens and compute the geometric average of their precision and recall (while ignoring punctuation and articles { โaโ, โanโ, โtheโ } ). Since ๐ฟ ๐ F1 , as defined in this way, is monotone and upper bounded by 1 , it can be controlled by choosing ๐ ^ as in Section 2.1 to achieve the following guarantee:
๐ผ โข [ ๐ฟ ๐ + 1 F1 โข ( ๐ ^ ) ] โค ๐ผ .
(23)
We use the Natural Questions (NQ) dataset \citepnq, a popular open-domain question answering baseline, to evaluate our method. We use the splits distributed as part of the Dense Passage Retrieval (DPR) package \citepdpr. Our base model is the DPR Retriever-Reader model \citepdpr, which retrieves passages from Wikipedia that might contain the answer to the given query, and then uses a reader model to extract text sub-spans from the retrieved passages that serve as candidate answers. Instead of enumerating all possible answers to a given question (which is intractable), we retrieve the top several hundred candidate answers, extracted from the top 100 passages (which is sufficient to control all risks of interest). We use ๐
2500 calibration points, and evaluate risk control with the remaining 1110 . We report results with ๐ผ
0.3 (chosen empirically as the lowest F1 score which typically results in nearly correct answers) in Figure 4. The mean and standard deviation of the risk over 1000 trials are 0.2996 and 0.0150, respectively.
4Extensions
In this section, we discuss several theoretical extensions of our procedure.
4.1Risk control under distributional shift
Suppose the researcher wants to control the risk under a distribution shift. Then the goal in (3) can be redefined as
๐ผ ( ๐ 1 , ๐ 1 ) , โฆ , ( ๐ ๐ , ๐ ๐ ) โผ ๐ train , ( ๐ ๐ + 1 , ๐ ๐ + 1 ) โผ ๐ test โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โค ๐ผ ,
(24)
where ๐ test denotes the test distribution that is different from the training distribution ๐ train that ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ are sampled from. Assuming that ๐ test is absolutely continuous with respect to ๐ train , the weighted objective (24) can be rewritten as
๐ผ ( ๐ 1 , ๐ 1 ) , โฆ , ( ๐ ๐ + 1 , ๐ ๐ + 1 ) โผ ๐ train โข [ ๐ค โข ( ๐ ๐ + 1 , ๐ ๐ + 1 ) โข ๐ฟ ๐ + 1 โข ( ๐ ^ ) ]
โค ๐ผ ,
(25)
where โข ๐ค โข ( ๐ฅ , ๐ฆ )
๐ โข ๐ test โข ( ๐ฅ , ๐ฆ ) ๐ โข ๐ train โข ( ๐ฅ , ๐ฆ ) .
When ๐ค is known and bounded, we can apply our procedure on the loss function ๐ฟ ~ ๐ + 1 โข ( ๐ )
๐ค โข ( ๐ ๐ + 1 , ๐ ๐ + 1 ) โข ๐ฟ ๐ + 1 โข ( ๐ ) , which is non-decreasing, bounded, and right-continuous in ๐ whenever ๐ฟ ๐ + 1 is. Thus, Theorem 2.1 guarantees that the resulting ๐ ^ satisfies (25).
In the setting of transductive learning, ๐ ๐ + 1 is available to the user. If the conditional distribution of ๐ given ๐ remains the same in the training and test domains, the distributional shift reduces to a covariate shift and
๐ค โข ( ๐ ๐ + 1 , ๐ ๐ + 1 )
๐ค โข ( ๐ ๐ + 1 )
โณ ๐ โข ๐ test โข ( ๐ ๐ + 1 ) ๐ โข ๐ train โข ( ๐ ๐ + 1 ) .
In this case, we can achieve the risk control even when ๐ค is unbounded. In particular, assuming ๐ฟ ๐ โ [ 0 , ๐ต ] , for any potential value ๐ฅ of the covariate, we define
๐ ^ โข ( ๐ฅ )
inf { ๐ : โ ๐
1 ๐ ๐ค โข ( ๐ ๐ ) โข ๐ฟ ๐ โข ( ๐ ) + ๐ค โข ( ๐ฅ ) โข ๐ต โ ๐
1 ๐ ๐ค โข ( ๐ ๐ ) + ๐ค โข ( ๐ฅ ) โค ๐ผ } .
(26)
When ๐ does not exist, we simply set ๐ ^ โข ( ๐ฅ )
max โก ฮ . It is not hard to see that ๐ ^ โข ( ๐ฅ ) โก ๐ ^ in the absence of covariate shifts. We can prove the following result.
Proposition 4.1.
In the setting of Theorem 2.1,
๐ผ ( ๐ 1 , ๐ 1 ) , โฆ , ( ๐ ๐ , ๐ ๐ ) โผ ๐ train , ( ๐ ๐ + 1 , ๐ ๐ + 1 ) โผ ๐ test โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โข ( ๐ ๐ + 1 ) ) ] โค ๐ผ .
(27)
It is easy to show that the weighted conformal procedure \citeptibshirani2019conformal is a special case with ๐ฟ ๐ โข ( ๐ )
๐ โข { ๐ ๐ โ ๐ ๐ โข ( ๐ ๐ ) } where ๐ ๐ โข ( ๐ ๐ ) is the prediction set that thresholds the conformity score at ๐ . Thus, Proposition 4.1 generalizes [tibshirani2019conformal] to any monotone risk. When the covariate shift ๐ค โข ( ๐ฅ ) is unknown but unlabeled data in the test domain are available, it can be estimated, up to a multiplicative factor that does not affect ๐ ^ โข ( ๐ฅ ) , by any probabilistic classification algorithm; see [lei2020conformal] and [candes2023conformalized] in the context of missing and censored data, respectively. We leave the full investigation of weighted conformal risk control with an estimated covariate shift for future research.
Total variation bound
Finally, for arbitrary distribution shifts, we give a total variation bound describing the way standard (unweighted) conformal risk control degrades. The bound is analogous to that of [barber2022conformal]for independent but non-identically distributed data (see their Section 4.1), though the proof is different. Here we will use the notation ๐ ๐
( ๐ ๐ , ๐ ๐ ) , and ๐ ^ โข ( ๐ 1 , โฆ , ๐ ๐ ) to refer to that chosen in (4).
Proposition 4.2.
Let ๐
( ๐ 1 , โฆ , ๐ ๐ + 1 ) be a sequence of random variables. Then, under the conditions in Theorem 2.1,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โค ๐ผ + ๐ต โข โ ๐
1 ๐ TV โข ( ๐ ๐ , ๐ ๐ + 1 ) .
If further the assumptions of Theorem 2.2 hold,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ผ โ ๐ต โข ( 2 ๐ + 1 + โ ๐
1 ๐ TV โข ( ๐ ๐ , ๐ ๐ + 1 ) ) .
4.2Quantile risk control
[snell2022quantile] generalizes [bates2021distribution] to control the quantile of a monotone loss function conditional on ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ with probability 1 โ ๐ฟ over the calibration dataset for any user-specified tolerance parameter ๐ฟ . In some applications, it may be sufficient to control the unconditional quantile of the loss function, which alleviates the burden of the user to choose the tolerance parameter ๐ฟ .
For any random variable ๐ , let
Quantile ๐ฝ โข ( ๐ )
inf { ๐ฅ : โ โข ( ๐ โค ๐ฅ ) โฅ ๐ฝ } .
Analogous to (3), we want to find ๐ ^ based on ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ such that
Quantile ๐ฝ โข ( ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ฝ ) ) โค ๐ผ .
(28)
By definition,
Quantile ๐ฝ โข ( ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ฝ ) ) โค ๐ผ โบ ๐ผ โข [ ๐ โข { ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ฝ )
๐ผ } ] โค 1 โ ๐ฝ .
As a consequence, quantile risk control is equivalent to expected risk control (3) with loss function ๐ฟ ~ ๐ โข ( ๐ )
๐ โข { ๐ฟ ๐ โข ( ๐ )
๐ผ } . Let
๐ ^ ๐ฝ
inf { ๐ โ ฮ : 1 ๐ + 1 โข โ ๐
1 ๐ ๐ โข { ๐ฟ ๐ โข ( ๐ )
๐ผ } + 1 ๐ + 1 โค 1 โ ๐ฝ } .
Proposition 4.3.
In the setting of Theorem 2.1, (28) is achieved.
[snell2022quantile] considers the high-probability control of a wider class of quantile-based risks which include the conditional value-at-risk (CVaR). It is unclear whether those more general risks can be controlled unconditionally. We leave this open problem for future research.
4.3Controlling multiple risks
Let ๐ฟ ๐ โข ( ๐ ; ๐พ ) be a family of loss functions indexed by ๐พ โ ฮ for some domain ฮ that may have infinitely many elements. A researcher may want to control ๐ผ โข [ ๐ฟ ๐ โข ( ๐ ; ๐พ ) ] at level ๐ผ โข ( ๐พ ) . Equivalently, we need to find an ๐ ^ based on ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ such that
sup ๐พ โ ฮ ๐ผ โข [ ๐ฟ ๐ โข ( ๐ ^ ; ๐พ ) ๐ผ โข ( ๐พ ) ] โค 1 .
(29)
Though the above worst-case risk is not an expectation, it can still be controlled. Towards this end, we define
๐ ^
sup ๐พ โ ฮ ๐ ^ ๐พ , where โข ๐ ^ ๐พ
inf { ๐ : 1 ๐ + 1 โข โ ๐
1 ๐ ๐ฟ ๐ โข ( ๐ ; ๐พ ) + ๐ต ๐ + 1 โค ๐ผ โข ( ๐พ ) } .
(30)
Then the risk is controlled.
Proposition 4.4.
In the setting of Theorem 2.1, (29) is satisfied.
4.4Adversarial risks
We next show how to control risks defined by adversarial perturbations. We adopt the same notation as Section 4.3. [bates2021distribution] (Section 6.3) discusses the adversarial risk where ฮ parametrizes a class of perturbations of ๐ ๐ + 1 , e.g., ๐ฟ ๐ โข ( ๐ ; ๐พ )
๐ฟ โข ( ๐ ๐ + ๐พ , ๐ ๐ ) and ฮ
{ ๐พ : โ ๐พ โ โ โค ๐ } . A researcher may want to find an ๐ ^ based on ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ such that
๐ผ โข [ sup ๐พ โ ฮ ๐ฟ ๐ โข ( ๐ ; ๐พ ) ] โค ๐ผ .
(31)
This can be recast as a conformal risk control problem by taking ๐ฟ ~ ๐ โข ( ๐ )
sup ๐พ โ ฮ ๐ฟ ๐ โข ( ๐ ; ๐พ ) . Then, the following choice of ๐ leads to risk control:
๐ ^
inf { ๐ : 1 ๐ + 1 โข โ ๐
1 ๐ ๐ฟ ~ ๐ โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ } .
(32) Proposition 4.5.
In the setting of Theorem 2.1, (31) is satisfied.
4.5U-risk control
For ranking and metric learning, [bates2021distribution] considered loss functions that depend on two test points. In general, for any ๐ > 1 and subset ๐ฎ โ { 1 , โฆ , ๐ + ๐ } with | ๐ฎ |
๐ , let ๐ฟ ๐ฎ โข ( ๐ ) be a loss function. Our goal is to find ๐ ^ ๐ based on ( ๐ ๐ , ๐ ๐ ) ๐
1 ๐ such that
๐ผ โข [ ๐ฟ { ๐ + 1 , โฆ , ๐ + ๐ } โข ( ๐ ^ ๐ ) ] โค ๐ผ .
(33)
We call the LHS a U-risk since, for any fixed ๐ ^ ๐ , it is the expectation of an order- ๐ U-statistic. As a natural extension, we can define
๐ ^ ๐
inf { ๐ : ๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ } , | ๐ฎ |
๐ ๐ฟ ๐ฎ โข ( ๐ ) + ๐ต โข ( 1 โ ( ๐ ! ) 2 ( ๐ + ๐ ) ! โข ( ๐ โ ๐ ) ! ) โค ๐ผ } .
(34)
Again, we define ๐ ^ ๐
๐ max when the right-hand side is an empty set. Then we can prove the following result.
Proposition 4.6.
Assume that ๐ฟ ๐ฎ โข ( ๐ ) is non-increasing in ๐ , right-continuous, and
๐ฟ ๐ฎ โข ( ๐ max ) โค ๐ผ , sup ๐ ๐ฟ ๐ฎ โข ( ๐ ) โค ๐ต < โ โข almost surely .
Then (33) is achieved.
5Conclusion
This generalization of conformal prediction broadens its scope to new applications, as shown in Section 3. The mathematical tools developed in Section 2, Section 4, and the Appendix may be of independent technical interest, since they provide a new and more general language for studying conformal prediction along with new results about its validity.
Acknowledgements
The authors thank Christopher Yeh for pointing out that the factor of 2 in Theorem 2.2 may not be tight, correcting an error in an earlier draft. The incorrect proposition stating that it was tight has been removed; the tightness of this factor remains an open question. The authors would like to thank Nicolas Christianson, Amit Kohli, Sherrie Wang, and Tijana Zrniฤ for comments on early drafts. A. A. would like to thank Ziheng (Tony) Wang for helpful conversations. A. A. is funded by the NSF GRFP and a Berkeley Fellowship. S. B. is supported by the NSF FODSI fellowship and the Simons institute. A. F. is partially funded by the NSF GRFP and MIT MLPDS.
\printbibliography Appendix AMonotonizing non-monotone risks
We next show that the proposed algorithm leads to asymptotic risk control for non-monotone risk functions when applied to a monotonized version of the empirical risk. We set the monotonized empirical risk to be
๐ ^ ๐ โ โข ( ๐ )
sup ๐ก โฅ ๐ โข ๐ ^ ๐ โข ( ๐ก ) ,
(35)
then define
๐ ^ ๐ โ
inf { ๐ : ๐ ^ ๐ โ โข ( ๐ ) โค ๐ผ } .
(36) Theorem A.1.
Let the ๐ฟ ๐ โข ( ๐ ) be right-continuous, i.i.d., bounded (both above and below) functions satisfying (6). Then,
lim ๐ โ โ โข ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ โ ) ] โค ๐ผ .
(37)
Theorem A.1 implies that an analogous procedure to 4 also controls the risk asymptotically. In particular, taking
๐ ~ โ
inf { ๐ : ๐ ^ ๐ โ โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ }
(38)
also results in asymptotic risk control (to see this, plug ๐ ~ โ into Theorem A.1 and see that the risk level is bounded above by ๐ผ โ ๐ต ๐ + 1 ). Note that in the case of a monotone loss function, ๐ ~ โ
๐ ^ . However, the counterexample in Proposition 2.1 does not apply to ๐ ~ โ , and it is currently unknown whether this procedure does or does not provide finite-sample risk control.
Appendix BProofs
The proof of Theorem 2.2 uses the following lemma on the approximate continuity of the empirical risk.
Lemma 1 (Jump Lemma).
In the setting of Theorem 2.2, any jumps in the empirical risk are bounded, i.e.,
sup ๐ ๐ฝ โข ( ๐ ^ ๐ , ๐ ) โข โค ๐ . ๐ . โข ๐ต ๐ .
(39) Proof of Jump Lemma, Lemma 1.
By boundedness, the maximum contribution of any single point to the jump is ๐ต ๐ , so
โ ๐ : ๐ฝ โข ( ๐ ^ ๐ , ๐ )
๐ต ๐ โน โ ๐ : ๐ฝ โข ( ๐ฟ ๐ , ๐ )
0 โข and โข ๐ฝ โข ( ๐ฟ ๐ , ๐ )
0 โข for some โข ๐ โ ๐ .
(40)
Call ๐ ๐
{ ๐ : ๐ฝ โข ( ๐ฟ ๐ , ๐ )
0 } the sets of discontinuities in ๐ฟ ๐ . Since ๐ฟ ๐ is bounded monotone, ๐ ๐ has countably many points. The union bound then implies that
โ ( โ ๐ : ๐ฝ ( ๐ ^ ๐ , ๐ )
๐ต ๐ ) โค โ ๐ โ ๐ โ ( ๐ ๐ โฉ ๐ ๐ โ โ )
(41)
Rewriting each term of the right-hand side using tower property and law of total probability gives
โ โข ( ๐ ๐ โฉ ๐ ๐ โ โ )
๐ผ โข [ โ โข ( ๐ ๐ โฉ ๐ ๐ โ โ | ๐ ๐ ) ]
(42)
โค ๐ผ โข [ โ ๐ โ ๐ ๐ โ โข ( ๐ โ ๐ ๐ | ๐ ๐ ) ]
๐ผ โข [ โ ๐ โ ๐ ๐ โ โข ( ๐ โ ๐ ๐ ) ] ,
(43)
Where the second inequality is because the union of the events ๐ โ ๐ ๐ is the entire sample space, but they are not disjoint, and the third equality is due to the independence between ๐ ๐ and ๐ ๐ . Rewriting in terms of the jump function and applying the assumption โ โข ( ๐ฝ โข ( ๐ฟ ๐ , ๐ ) > 0 )
0 ,
๐ผ โข [ โ ๐ โ ๐ ๐ โ โข ( ๐ โ ๐ ๐ ) ]
๐ผ โข [ โ ๐ โ ๐ ๐ โ โข ( ๐ฝ โข ( ๐ฟ ๐ , ๐ ) > 0 ) ]
0 .
(44)
Chaining the above inequalities yields โ ( โ ๐ : ๐ฝ ( ๐ ^ ๐ , ๐ )
๐ต ๐ ) โค 0 , so
โ ( โ ๐ : ๐ฝ ( ๐ ^ ๐ , ๐ ) > ๐ต ๐ )
0 . โ
Proof of Theorem 2.2.
If ๐ฟ ๐ โข ( ๐ max ) โฅ ๐ผ โ 2 โข ๐ต / ( ๐ + 1 ) , then ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ผ โ 2 โข ๐ต / ( ๐ + 1 ) . Throughout the rest of the proof, we assume that ๐ฟ ๐ โข ( ๐ max ) < ๐ผ โ 2 โข ๐ต / ( ๐ + 1 ) . Define the quantity
๐ ^ โฒโฒ
inf { ๐ : ๐ ^ ๐ + 1 โข ( ๐ ) + ๐ต ๐ + 1 โค ๐ผ } .
(45)
Since ๐ฟ ๐ โข ( ๐ max ) < ๐ผ โ 2 โข ๐ต / ( ๐ + 1 ) < ๐ผ โ ๐ต / ( ๐ + 1 ) , ๐ ^ โฒโฒ exists almost surely. Deterministically, ๐ ๐ + 1 โข ๐ ^ ๐ โข ( ๐ ) โค ๐ ^ ๐ + 1 โข ( ๐ ) , which yields ๐ ^ โค ๐ ^ โฒโฒ . Again since ๐ฟ ๐ โข ( ๐ ) is non-increasing in ๐ ,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒโฒ ) ] โค ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ]
(46)
By exchangeability and the fact that ๐ ^ โฒโฒ is a symmetric function of ๐ฟ 1 , โฆ , ๐ฟ ๐ + 1 ,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒโฒ ) ]
๐ผ โข [ ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) ]
(47)
For the remainder of the proof we focus on lower-bounding ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) . We begin with the following identity:
๐ผ
๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) + ๐ต ๐ + 1 โ ( ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) + ๐ต ๐ + 1 โ ๐ผ ) .
(48)
Rearranging the identity,
๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ )
๐ผ โ ๐ต ๐ + 1 + ( ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) + ๐ต ๐ + 1 โ ๐ผ ) .
(49)
Using the Jump Lemma to bound ( ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) + ๐ต ๐ + 1 โ ๐ผ ) below by โ ๐ต ๐ + 1 gives
๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) โฅ ๐ผ โ 2 โข ๐ต ๐ + 1 .
(50)
Finally, chaining together the above inequalities,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ] โฅ ๐ผ โข [ ๐ ^ ๐ + 1 โข ( ๐ ^ โฒโฒ ) ] โฅ ๐ผ โ 2 โข ๐ต ๐ + 1 .
(51)
โ
Proof of Proposition 2.1.
Without loss of generality, we assume ๐ต
1 . Assume ๐ ^ takes values in [ 0 , 1 ] and ๐ผ โ ( 1 / ( ๐ + 1 ) , 1 ) . Let ๐ โ ( 0 , 1 ) , ๐ be any positive integer, and ๐ฟ ๐ โข ( ๐ ) be i.i.d. right-continuous piecewise constant (random) functions with
๐ฟ ๐ โข ( ๐ / ๐ )
0 , ( ๐ฟ ๐ โข ( 0 / ๐ ) , ๐ฟ ๐ โข ( 1 / ๐ ) , โฆ , ๐ฟ ๐ โข ( ( ๐ โ 1 ) / ๐ ) ) โผ ๐ . ๐ . ๐ . Ber โข ( ๐ ) .
(52)
By definition, ๐ ^ is independent of ๐ฟ ๐ + 1 . Thus, for any ๐
0 , 1 , โฆ , ๐ โ 1 ,
{ ๐ฟ ๐ + 1 โข ( ๐ ^ ) โฃ ๐ ^
๐ / ๐ } โผ Ber โข ( ๐ ) , { ๐ฟ ๐ + 1 โข ( ๐ ^ ) โฃ ๐ ^
1 } โผ ๐ฟ 0 .
(53)
Then,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ]
๐ โ โ โข ( ๐ ^ โ 1 )
(54)
Note that
๐ ^ โ 1 โบ min ๐ โ { 0 , โฆ , ๐ โ 1 } โก 1 ๐ + 1 โข โ ๐
1 ๐ ๐ฟ ๐ โข ( ๐ / ๐ ) โค ๐ผ โ 1 ๐ + 1 .
(55)
Since ๐ผ
1 / ( ๐ + 1 ) ,
โ โข ( ๐ ^ โ 1 )
1 โ โ โข ( ๐ ^
1 )
1 โ โ โข ( for all โข ๐ , we have โข 1 ๐ + 1 โข โ ๐
1 ๐ ๐ฟ ๐ โข ( ๐ / ๐ ) > ๐ผ โ 1 ๐ + 1 )
1 โ ( โ ๐
โ ( ๐ + 1 ) โข ๐ผ โ ๐ ( ๐ ๐ ) โข ๐ ๐ โข ( 1 โ ๐ ) ( ๐ โ ๐ ) ) ๐
1 โ ( 1 โ BinoCDF โข ( ๐ , ๐ , โ ( ๐ + 1 ) โข ๐ผ โ โ 1 ) ) ๐
As a result,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ]
๐ โข ( 1 โ ( 1 โ BinoCDF โข ( ๐ , ๐ , โ ( ๐ + 1 ) โข ๐ผ โ โ 1 ) ) ๐ ) .
(56)
Now let ๐ be sufficiently large such that
( 1 โ ( 1 โ BinoCDF โข ( ๐ , ๐ , โ ( ๐ + 1 ) โข ๐ผ โ โ 1 ) ) ๐ )
๐ .
(57)
Then
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ) ]
๐ 2
(58)
For any ๐ผ
0 , we can take ๐ close enough to 1 to render the claim false. โ
Proof of Theorem A.1.
Define the monotonized population risk as
๐ โ โข ( ๐ )
sup ๐ก โฅ ๐ โข ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ก ) ]
(59)
Note that the independence of ๐ฟ ๐ + 1 and ๐ ^ ๐ โ implies that for all ๐ ,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ โ ) ] โค ๐ผ โข [ ๐ โ โข ( ๐ ^ ๐ โ ) ] .
(60)
Since ๐ โ is bounded, monotone, and one-dimensional, a generalization of the Glivenko-Cantelli Theorem given in Theorem 1 of [dehardt1971generalizations] gives that uniformly over ๐ ,
lim ๐ โ โ โข sup ๐ | ๐ ^ ๐ โข ( ๐ ) โ ๐ โข ( ๐ ) | โข โ ๐ . ๐ . โข 0 .
(61)
As a result,
lim ๐ โ โ โข sup ๐ | ๐ ^ ๐ โ โข ( ๐ ) โ ๐ โ โข ( ๐ ) | โข โ ๐ . ๐ . โข 0 ,
(62)
which implies that
lim ๐ โ โ โข | ๐ ^ ๐ โ โข ( ๐ ^ โ ) โ ๐ โ โข ( ๐ ^ โ ) | โข โ ๐ . ๐ . โข 0 .
(63)
By definition, ๐ ^ โ โข ( ๐ ^ โ ) โค ๐ผ almost surely and thus this directly implies
lim sup ๐ โ โ โข ๐ โ โข ( ๐ ^ ๐ โ ) โค ๐ผ a.s. .
(64)
Finally, since for all ๐ , ๐ โ โข ( ๐ ^ ๐ โ ) โค ๐ต , by Fatouโs lemma,
lim ๐ โ โ โข ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ ๐ โ ) ] โค lim sup ๐ โ โ โข ๐ผ โข [ ๐ โ โข ( ๐ ^ ๐ โ ) ] โค ๐ผ โข [ lim sup ๐ โ โ โข ๐ โ โข ( ๐ ^ ๐ โ ) ] โค ๐ผ .
(65)
โ
Proposition 4.1.
Let
๐ ^ โฒ
inf { ๐ : โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โข ๐ฟ ๐ โข ( ๐ ) โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โค ๐ผ } .
(66)
Since inf ๐ ๐ฟ ๐ โข ( ๐ ) โค ๐ผ , ๐ ^ โฒ exists almost surely. Using the same argument as in the proof of Theorem 2.1, we can show that ๐ ^ โฒ โค ๐ ^ โข ( ๐ ๐ + 1 ) . Since ๐ฟ ๐ + 1 โข ( ๐ ) is non-increasing in ๐ ,
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โข ( ๐ ๐ + 1 ) ) ] โค ๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒ ) ] .
Let ๐ธ be the multiset of loss functions { ( ๐ 1 , ๐ 1 ) , โฆ , ( ๐ ๐ + 1 , ๐ ๐ + 1 ) } . Then ๐ ^ โฒ is a function of ๐ธ , or, equivalently, ๐ ^ โฒ is a constant conditional on ๐ธ . Lemma 3 of [tibshirani2019conformal] implies that
( ๐ ๐ + 1 , ๐ ๐ + 1 ) โฃ ๐ธ โผ โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โข ๐ฟ ( ๐ ๐ , ๐ ๐ ) โน ๐ฟ ๐ + 1 โฃ ๐ธ โผ โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โข ๐ฟ ๐ฟ ๐
where ๐ฟ ๐ง denotes the Dirac measure at ๐ง . Together with the right-continuity of ๐ฟ ๐ , the above result implies
๐ผ โข [ ๐ฟ ๐ + 1 โข ( ๐ ^ โฒ ) โฃ ๐ธ ]
โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โข ๐ฟ ๐ โข ( ๐ ^ โฒ ) โ ๐
1 ๐ + 1 ๐ค โข ( ๐ ๐ ) โค ๐ผ .
(67)
The proof is then completed by the law of total expectation. โ
Proposition 4.2.
Define the vector ๐ โฒ
( ๐ 1 โฒ , โฆ , ๐ ๐ โฒ , ๐ ๐ + 1 ) , where ๐ ๐ โฒ โข โผ ๐ . ๐ . ๐ . โข โ โข ( ๐ ๐ + 1 ) for all ๐ โ [ ๐ ] . Let
๐
โ ๐
1 ๐ TV โข ( ๐ ๐ , ๐ ๐ โฒ ) .
By sublinearity,
TV โข ( ๐ , ๐ โฒ ) โค ๐ .
(68)
It is a standard fact that (68) implies
sup ๐ โ โฑ ๐ โข | ๐ผ โข [ ๐ โข ( ๐ ) ] โ ๐ผ โข [ ๐ โข ( ๐ โฒ ) ] | โค ๐ ,
(69)
where โฑ ๐
{ ๐ : ๐ต โฆ [ 0 , 1 ] } . Let โ : ๐ต ร ฮ โ [ 0 , ๐ต ] be a bounded loss function. Furthermore, let ๐ โข ( ๐ง )
โ โข ( ๐ง ๐ + 1 ; ๐ ^ โข ( ๐ง 1 , โฆ , ๐ง ๐ ) ) . Since ๐ โข ( ๐ ) โ [ 0 , ๐ต ] ,
| ๐ผ โข [ ๐ โข ( ๐ ) ] โ ๐ผ โข [ ๐ โข ( ๐ โฒ ) ] | โค ๐ต โข ๐ .
(70)
Furthermore, since ๐ 1 โฒ , โฆ , ๐ ๐ + 1 โฒ are exchangeable, we can apply Theorems 2.1 and 2.2 to ๐ผ โข [ ๐ โข ( ๐ โฒ ) ] , recovering
๐ผ โ 2 โข ๐ต ๐ + 1 โค ๐ผ โข [ ๐ โข ( ๐ โฒ ) ] โค ๐ผ .
(71)
A final step of triangle inequality implies the result:
๐ผ โ 2 โข ๐ต ๐ + 1 โ ๐ต โข ๐ โค ๐ผ โข [ ๐ โข ( ๐ ) ] โค ๐ผ + ๐ต โข ๐ .
(72)
โ
Proposition 4.3.
It is left to prove that ๐ฟ ~ ๐ โข ( ๐ ) satisfies the conditions of Theorem 2.1. It is clear that ๐ฟ ~ ๐ โข ( ๐ ) โค 1 and ๐ฟ ~ ๐ โข ( ๐ ) is non-increasing in ๐ when ๐ฟ ๐ โข ( ๐ ) is. Since ๐ฟ ๐ โข ( ๐ ) is non-increasing and right-continuous, for any sequence ๐ ๐ โ ๐ ,
๐ฟ ๐ โข ( ๐ ๐ ) โ ๐ฟ ๐ โข ( ๐ ) โน ๐ โข { ๐ฟ ๐ โข ( ๐ ๐ )
๐ผ } โ ๐ โข { ๐ฟ ๐ โข ( ๐ )
๐ผ } .
Thus, ๐ฟ ~ ๐ โข ( ๐ ) is right-continuous. Finally, ๐ฟ ๐ โข ( ๐ max ) โค ๐ผ implies ๐ฟ ~ ๐ โข ( ๐ max )
0 โค 1 โ ๐ฝ . โ
Proposition 4.4.
Examining (30), for each ๐พ โ ฮ , we have
๐ผ โข [ ๐ฟ โข ( ๐ ^ , ๐พ ) ] โค ๐ผ โข [ ๐ฟ โข ( ๐ ^ ๐พ , ๐พ ) ] โค ๐ผ โข ( ๐พ ) .
(73)
Thus, dividing both sides by ๐ผ โข ( ๐พ ) and taking the supremum, we get that sup ๐พ โ ฮ ๐ผ โข [ ๐ฟ โข ( ๐ ^ , ๐พ ) ๐ผ โข ( ๐พ ) ] โค 1 , and the worst-case risk is controlled. โ
Proposition 4.5.
Because ๐ฟ ๐ โข ( ๐ , ๐พ ) is bounded and monotone in ๐ for all choices of ๐พ , it is also true that ๐ฟ ๐ ~ โข ( ๐ ) is bounded and monotone. Furthermore, the pointwise supremum of right-continuous functions is also right-continuous. Therefore, the ๐ฟ ๐ ~ satisfy the assumptions of Theorem 2.1. โ
Proposition 4.6.
Let
๐ ^ ๐ โฒ
inf { ๐ : ๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ + ๐ } , | ๐ฎ |
๐ ๐ฟ ๐ฎ โข ( ๐ ) โค ๐ผ } .
Since ๐ฟ ๐ฎ โข ( ๐ max ) โค ๐ผ , ๐ ^ ๐ โฒ exists almost surely. Since ๐ฟ ๐ฎ โข ( ๐ ) โค ๐ต , we have
๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ + ๐ } , | ๐ฎ |
๐
๐ฟ
๐ฎ
โข
(
๐
)
โค
๐
!
โข
๐
!
(
๐
+
๐
)
!
โข
โ
๐ฎ
โ
{
1
,
โฆ
,
๐
}
,
|
๐ฎ
|
๐ ๐ฟ ๐ฎ โข ( ๐ ) + ๐ต โ โ ๐ฎ โฉ { ๐ + 1 , โฆ , ๐ + ๐ } โ โ , | ๐ฎ |
๐ 1
๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ } , | ๐ฎ |
๐ ๐ฟ ๐ฎ โข ( ๐ ) + ๐ต โข ( 1 โ ๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ } , | ๐ฎ |
๐ 1 )
๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ } , | ๐ฎ |
๐ ๐ฟ ๐ฎ โข ( ๐ ) + ๐ต โข ( 1 โ ( ๐ ! ) 2 ( ๐ + ๐ ) ! โข ( ๐ โ ๐ ) ! ) .
Since ๐ฟ ๐ฎ โข ( ๐ ) is non-increasing in ๐ , we conclude that ๐ ^ ๐ โฒ โค ๐ ^ ๐ if the right-hand side of (34) is not empty; otherwise, by definition, ๐ ^ ๐ โฒ โค ๐ max
๐ ^ ๐ . Thus, ๐ ^ ๐ โฒ โค ๐ ^ ๐ almost surely. Let ๐ธ be the multiset of loss functions { ๐ฟ ๐ฎ : ๐ฎ โ { 1 , โฆ , ๐ + ๐ } , | ๐ฎ |
๐ } . Using the same argument in the end of the proof of Theorem 2.1 and the right-continuity of ๐ฟ ๐ฎ , we can show that
๐ผ โข [ ๐ฟ { ๐ + 1 , โฆ , ๐ + ๐ } โข ( ๐ ^ ๐ โฒ ) โฃ ๐ธ ]
๐ ! โข ๐ ! ( ๐ + ๐ ) ! โข โ ๐ฎ โ { 1 , โฆ , ๐ + ๐ } , | ๐ฎ |
๐ ๐ฟ ๐ฎ โข ( ๐ ) โค ๐ผ .
The proof is then completed by the law of iterated expectation. โ
Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Xet Storage Details
- Size:
- 56.3 kB
- Xet hash:
- 68b1f6b23912b1c72bcf0554a41fe88d2312a67326d1697de3a36e5d6111af24
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.