Buckets:
Title: Realizable Learning is All You Need
URL Source: https://arxiv.org/html/2111.04746
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Distribution Family Classification 3Related Work 4Preliminaries 5The Core Reduction: Finite Label Classes 6Four Modification Archetypes 7Doubly Bounded Loss 8Robust Learning 9Partial PAC-Learning 10Uniform Stability 11Statistical Query Model 12Fairness 13Notions of Coverability License: CC BY 4.0 arXiv:2111.04746v4 [cs.LG] 03 Feb 2024 \ThCSauthor
[ucsd-cs]Max Hopkinsnmhopkin@ucsd.edu \ThCSauthor[ucsd-cs,ucsd-math]Daniel M. Kanedakane@eng.ucsd.edu \ThCSauthor[ucsd-cs]Shachar Lovettslovett@cs.ucsd.edu \ThCSauthor[yale]Gaurav Mahajangaurav.mahajan@yale.edu \ThCSaffil[ucsd-cs]Department of Computer Science and Engineering, UCSD, California, CA 92092 \ThCSaffil[ucsd-math]Department of Mathematics, UCSD, California, CA 92092 \ThCSaffil[yale]Department of Computer Science, Yale University, Connecticut, CT 06511 \ThCSshorttitleRealizable Learning is All You Need \ThCSyear2024 \ThCSarticlenum2 \ThCSreceivedSep 28, 2022 \ThCSrevisedNov 21, 2023 \ThCSacceptedDec 11, 2023 \ThCSpublishedFeb 3, 2024 \ThCSkeywordsPAC-Learning, Realizable Learning, Agnostic Learning, Blackbox Reductions \ThCSdoi10.46298/theoretics.24.2 \ThCSshortnamesM. Hopkins, D. M. Kane, S. Lovett and G. Mahajan \ThCSthanksInvited article from COLT 2022.
Realizable Learning is All You Need Abstract
The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, itโs surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression.
In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model.
More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.
1Introduction
The equivalence of realizable and agnostic learnability in Valiantโs Probably Approximately Correct (PAC) model [58] is one of the best known results in learning theory, and numbers among its most surprising. Given a set ๐ and a family of binary classifiers ๐ป , the result states that the ability to learn a classifier โ โ ๐ป from examples of the form ( ๐ฅ , โ โข ( ๐ฅ ) ) is in fact sufficient for something much stronger: given samples from any distribution ๐ท over ๐ ร { 0 , 1 } , it is possible to learn the best approximation to ๐ท in ๐ป . This surprising equivalence stems from a classical result of Vapnik and Chervonenkis (VC) [60], and independently Blumer, Ehrenfeucht, Haussler, and Warmuth (BEHW) [19] and Haussler [36], who equate both the former model (known as realizable learning) and the latter model (known as agnostic learning) to a strong property of pairs ( ๐ , ๐ป ) called uniform convergence.1
VC, BEHW, and Hausslerโs result was certainly a breakthrough in its own right, but its proof technique is too indirect to reveal any deeper connections between realizable and agnostic learning beyond the PAC setting. Further, recent years have seen both theory and practice shift away not only from this original formalization, but more generally from the โuniform convergence equals learnabilityโ paradigm, often in favor of distributional or data-dependent assumptions like margin that are more applicable to the real world. The inability of VC, BEHW, and Hausslerโs proof technique to generalize to such scenarios raises a fundamental question: is the equivalence of realizable and agnostic learning a fundamental property of learnability, or simply a happy coincidence derived from the original PAC framework?
In the 30 years since these works, a mountain of evidence has amassed in favor of the former: almost every reasonable variant of learning shares some sort of similar equivalence. This includes a long list of popular settings such as regression [9], distribution-dependent learning [17], multi-class learning [29], robust learning [52], online learning [16], private learning [13, 3], and partial learning [47, 5]. Whatโs more, the uniform convergence paradigm fails miserably in most of these models. In the distribution-dependent model, for instance, it is easy to build classes which are trivially learnable (even with one sample!) but completely fail to satisfy uniform convergence [17]. On the other hand, models such as private learning give well-known examples where uniform convergence fails to imply learnability [6]. In spite of this, we are really no closer today to a general understanding of this phenomenon than we were in the early 90s. Much like Vapnik and Chervonenkis [60], Blumer, Ehrenfeucht, Haussler, and Warmuth [19], and Hausslerโs [36] proofs, the above works often use indirect methods and tend to rely on powerful model-dependent assumptions.
In this work, we aim to offer a generic, unifying theory by way of the first direct reduction from agnostic to realizable learning. Unlike any previous work, our reduction is blackbox, relies on no additional assumptions, and, perhaps most importantly, is incredibly simple. In fact, the basic algorithm can be stated in three lines.
Input: Realizable PAC-Learner ๐ , Unlabeled Sample Oracle ๐ช ๐ , Labeled Sample Oracle ๐ช ๐ฟ Algorithm:
1.
Draw an unlabeled sample ๐ ๐ โผ ๐ช ๐ , and labeled sample ๐ ๐ฟ โผ ๐ช ๐ฟ .
2.
Run ๐ over all possible labelings of ๐ ๐ to get:
๐ถ โข ( ๐ ๐ ) โข \coloneqq โข { ๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) | โ โ ๐ป | ๐ ๐ } .
3.
Return the hypothesis in ๐ถ โข ( ๐ ๐ ) with lowest empirical error over ๐ ๐ฟ .
Algorithm 1 Agnostic to Realizable Reduction
This basic reduction simplifies and unifies classic results such as VC [60], BEHW [19], and Hausslerโs [36] distribution-free equivalence and Benedek and Itaiโs [17] analogous result in the distribution-dependent setting2 up to log factors. More importantly, because Algorithm 1 doesnโt rely on model-dependent properties like uniform convergence, it extends to learning regimes without known characterizations. One such example is the notoriously difficult distribution-family model, in which the adversary is given a restricted family of distributions ๐ along with the pair ( ๐ , ๐ป ) . The distribution-family model has no finitary characterization [44],3 yet Algorithm 1 can still be used to show that the realizable and agnostic settings are equivalent.
Unfortunately, while Algorithm 1 does avoid any significant blowup in sample complexity, it is inherently computationally inefficient. In fact, this is necessary unless ๐
๐ โข ๐ . There are many basic classes (e.g. halfspaces) which are easy to learn in the realizable model, but NP or cryptographically hard in the agnostic setting (see e.g. [33]). As such we focus in this work only on information theoretic considerations, though building computationally efficient reductions in restricted settings remains an interesting avenue of research.
The core of Algorithm 1 lies in an equivalence between PAC learning and a new form of randomized covering of independent interest we call a non-uniform cover. In contrast to more classical notions, a non-uniform cover is a distribution over subsets of hypotheses that covers any fixed hypothesis in the class with high probability, but may fail to cover all hypotheses simultaneously. The connection between supervised learning and non-uniform covering is inherent in Algorithm 1, where Steps 1 and 2 turn the realizable learner ๐ into a non-uniform cover ๐ถ โข ( ๐ ๐ ) , and Step 3 uses the cover to perform agnostic learning. At a high level, this process works because the adversary does not see the randomness inherent to Steps 1 and 2, and therefore cannot detect or exploit which hypotheses in the class will fail to be well-estimated in the process.
In fact, this connection has many broader implications within the theory of supervised learning. For one, the method is not inherently restricted to the agnostic setting. Algorithm 1 achieves agnostic learning for general classes by applying an agnostic learner for finite classes in Step 3; replacing this with learners satisfying other properties (e.g. the exponential mechanism for privacy) leads to other families of reductions to the realizable setting. At a high level, this can be summarized by the following informal โguiding principleโ:
Guiding Principle (Property Generalization).
If there is a (sample-efficient) algorithm with property ๐ over finite classes, then Algorithm 1 gives a (sample-efficient) learner with property ๐ over any โlearnableโ class.
We stress that the above is a guide, not a theorem, and indeed often requires modification or weakening of the desired property for a given application. For instance, when Step 3 is replaced with a private algorithm, the result is a semi-private learner for general classes, a weakened model allowing the use of a small amount of public unlabeled data [14, 2].
Algorithm 1 also provides a unified framework for many settings beyond the PAC-model. This includes basic extensions such as general loss functions4 and the distribution-family model, but also more involved modifications such as partial learning, robust learning, or even the statistical query model. Moreover, in some of these settings removing reliance on setting-specific assumptions like uniform convergence actually quantitatively improves the sample complexity; such is the case for the semi-private model, where we use this fact to achieve information-theoretically optimal unlabeled sample complexity for the first time.
Finally, we note there are a few settings where Algorithm 1 runs into issues, especially discrete infinite settings such as infinite multi-class classification and properties such as privacy that require more careful data handling. We leave the extension of our method to these settings as an intriguing open problem.
1.1Proof Overview
Before moving to more detailed discussion of our results, we briefly overview the proof that Algorithm 1 is an agnostic learner. At its most basic, the idea is simply to observe that the set ๐ถ โข ( ๐ ๐ ) almost certainly contains a โnear-optimalโ hypothesis โ ~ โ ๐ป . Since ๐ถ โข ( ๐ ๐ ) has bounded size, standard uniform convergence for finite classes promises Step 3 outputs a hypothesis close to โ ~ , which is therefore itself โnear-optimalโ as desired.
Taking a step back, the key observation in this process is really that ๐ถ โข ( ๐ ๐ ) โacts like a coverโ of ๐ป in the following weak sense we call non-uniform covering:
Definition 1.1 (Non-uniform Cover (Informal Definition 5.1)).
Let ( ๐ , ๐ป ) be a hypothesis class, ๐ท a marginal distribution over ๐ , and ๐ถ a random variable over the power set ๐ โข ( ๐ป ) . We call ๐ถ a non-uniform ( ๐ , ๐ฟ ) -cover of ๐ป with respect to ๐ท if:
โ โ โ ๐ป : Pr ๐ถ [ โ โ โฒ โ ๐ถ : Pr ๐ฅ โผ ๐ท [ โ โฒ ( ๐ฅ ) โ โ ( ๐ฅ ) ] โค ๐ ] โฅ 1 โ ๐ฟ .
In other words, for every fixed hypothesis โ in the class, ๐ถ is very likely to contain a hypothesis close to โ . This differs from the standard covering which enforces that ๐ถ simultaneously cover every โ โ ๐ป (equivalent to pushing the โ โ โ quantifier into the probability above). In Section 13, we show that the latter (which is the standard in the literature) is strictly stronger and requires more samples to generate. This is critical in our application to semi-private learning where we cut the additional samples required to generate a full cover and thereby achieve the optimal sample complexity.
It is left to observe that ๐ถ โข ( ๐ ๐ ) in Algorithm 1 is actually a non-uniform cover, but this is essentially immediate from the definition of realizable learning! Realizable learning promises that for every fixed hypothesis โ โ ๐ป , when ๐ด receives samples labeled by โ it outputs a hypothesis close to โ with high probability. ๐ถ โข ( ๐ ๐ ) is generated by running ๐ด across all โ โ ๐ป , so this guarantee exactly translates to the above. We refer the reader to Sections 2 and 5 for a formal explanation of this argument.
1.2Beyond PAC Learning
Algorithm 1 is an extremely flexible framework for proving agnostic to realizable reductions in supervised learning. In this section, we informally overview the many extended models studied in this paper. The most basic setting in which Algorithm 1 applies beyond the standard model is learning under distributional assumptions (or formally what we call the โdistribution-family modelโ, see Sections 2 and 4). Standard techniques using combinatorial dimensions require that the learner ๐ด works over every distribution (the โdistribution-freeโ setting), while modern algorithms frequently only work under some niceness conditions on the distribution. On the other hand, it is easy to observe that the process described above works under arbitrary distributional assumptionsโas long as ๐ด is a realizable learner over a distribution ๐ท , then Algorithm 1 is an agnostic learner whenever the data is marginally distributed as ๐ท . This leads to the following corollary:
Theorem 1.2 (Distribution Family Model (Informal Theorem 5.4)).
The sample complexity of agnostic learning a class ( ๐ , ๐ป ) in the distribution family model is at most:
๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ โข ( ๐ , ๐ฟ ) + log โก ( 1 / ๐ฟ ) ๐ 2 ) ,
where ๐ โข ( ๐ , ๐ฟ ) is the realizable sample complexity of ( ๐ , ๐ป ) .
In the finite VC setting, this can be improved to roughly ๐ โข log โก ( ๐ ๐ ) + log โก ( 1 ๐ฟ ) ๐ 2 for VC dimension ๐ , which is optimal up to a log factor.
Algorithm 1 also covers many other supervised settings in the literature that diverge from the PAC model in more substantial ways, including general loss functions, constraints on the learner, and properties beyond agnostic learning. For simplicity we cover these here informally, avoiding exact definitions and sample complexities (which are generally similar to the above), and give formal details in the main body.
1.2.1General Loss Functions
Perhaps the most natural extension of the PAC framework is to loss functions beyond binary classification. Here data points are labeled by an arbitrary label space ๐ and error is measured with respect to a generic loss function โ : ๐ ร ๐ โ โ โฅ 0 , for instance we may take ๐
โ and โ โข ( ๐ฆ , ๐ฆ โฒ )
( ๐ฆ โ ๐ฆ โฒ ) 2 , the square loss. In general, agnostic and realizable learning are actually not equivalent in this setting, even for nice loss functions (see Proposition 6.1). The issue is that over infinite label spaces it is possible to encode hypotheses into the labels with infinite precision, making the class trivial to learn realizably, but impossible even with the smallest amount of noise (which wipes out the encoding). We show this type of counter-example is essentially the only barrier to agnostic learnability.
Somewhat more formally, we call a class ๐ป discretely learnable if for every ๐
0 , there exists an ๐ -discretization5 of ๐ป that is learnable up to ๐ โข ( ๐ ) error. Discrete learnability can informally be thought of as a very weak type of noise tolerance that essentially acts only to rule out the above construction. We prove discrete and agnostic learnability are equivalent under weak conditions on the loss function.
Theorem 1.3 (General Loss (Informal Theorems 7.2 and 6.6)).
If โ is a (probably) upper bounded loss function6 and either:
1.
โ is bounded from below
2.
โ satisfies a ๐ถ -approximate triangle inequality
then discrete and agnostic learnability are equivalent under โ up to polynomial factors.
We remark that in the latter case, the learner only achieves ๐ถ โ ๐ โข ๐ โข ๐ + ๐ error and that this is tight (Proposition 6.8). To our knowledge, these results are new even to the distribution-free setting, where such an equivalence was only known for bounded Lipschitz [9, 62] or binary-valued [15, 29] loss functions. In the models below, similar guarantees hold under the above assumptions on loss. We omit the exact dependence on โ for simplicity and discuss where relevant in the main body.
Another well-studied setting for loss functions beyond standard classification is adversarial robustness. Robust learning is an extension of the PAC model introduced to handle adversarial perturbations at test time by taking a maximum of the loss function over specified โperturbation setsโ around the test point. We give a modification of Algorithm 1 in the robust setting that handles general loss and distributional assumptions.
Theorem 1.4 (Robust Classification (Informal Theorem 8.5)).
Robust realizable and robust agnostic learning are equivalent up to polynomial factors.
Note this is completely independent of the perturbation sets. In the classification setting, Theorem 1.4 generalizes recent work giving such an equivalence in the distribution-free model [52, 50], though the sample complexity of our algorithm suffers an extra factor of ๐ โ 1 in this special case.
Finally, another recent setting that works with a sort of โmodifiedโ loss function is the learning of partial functions, capturing โdata-dependentโ models such as halfspaces with geometric margin where. Here the functions in ๐ป are allowed to be โundefinedโ on some portions of ๐ , and the loss is fixed to 1 under any undefined point no matter the response of the algorithm.
Theorem 1.5 (Partial functions (Informal Theorem 9.5)).
Realizable and agnostic learning of partial functions are equivalent up to polynomial factors.
In the distribution-free setting, a variant of this equivalence is known via compression schemes, but the above is new under distributional assumptions and more general loss (again at the cost of an extra ๐ โ 1 factor).
1.2.2Constrained Models
Another frequent modification of the PAC setting is to impose constraints either on how the algorithm ๐ด uses the data, or on various properties of the output classifier itself. In this section, we cover application to three such examples: fairness, stability, and statistical queries.
We start with fairness, where the goal is to output a hypothesis with low error conditional on โtreating similar individuals similarlyโ under a fixed metric on the data space.
Theorem 1.6 (Fair Learning (Informal Theorem 12.3)).
Realizable and agnostic fair learning are equivalent up to polynomial factors.
Formally, the above result holds in Rothblum and Yonaโs [63] popular โProbably Approximately Correct and Fairโ (PACF)-learning. To our knowledge it is the first such reduction even in the distribution-free setting, as Rothblum and Yona [63] study the agnostic case directly.
Another popular constraint in the literature is algorithmic stability, where one roughly imposes that running the algorithm twice should produce similar results. We study a simple variant known as uniform stability [24] (closely related to โprivate predictionโ [32]) which promises that for every fixed point in the space ๐ฅ the output distribution of ๐ด on ๐ฅ is similar when ๐ด is run on neighboring datasets.
Theorem 1.7 (Uniform Stability (Informal Theorem 10.2)).
Realizable and agnostic learning are equivalent under uniform stability up to polynomial factors.
This is known in the distribution-free setting by VC arguments [32, 24], but is new for general loss and under distributional assumptions. Unlike prior examples the sample complexity matches prior techniques.
Another way to constrain the algorithm ๐ด is through the way it interacts with data. Perhaps the most popular example of this is the statistical query model, where ๐ด is constrained to approximating general population statistics (such algorithms are then guaranteed to satisfy certain nice properties such as noise-tolerance and privacy). We give a variant of our reduction for the statistical query setting as well.
Theorem 1.8 (Statistical Query Model (Informal Theorem 11.2)).
Realizable and agnostic learning are equivalent in the statistical query model up to polynomial factors.
Variants of such a result are known via combinatorial characterization in the fixed distribution setting [57, 33], but new in our general setup.
1.2.3Beyond Agnostic Learning
In the introduction, we claimed Algorithm 1 can be used to build a learner satisfying any โfinitely-satisfiableโ property, not just agnostic learning. In fact, one of the examples above already displays this property: Theorem 1.7 does not require the base learner to be uniformly stable. Instead, we apply a uniformly stable agnostic learner to ๐ถ โข ( ๐ ๐ ) in Step 3 of Algorithm 1 and โliftโ uniform stability from finite to infinite classes. We give two further applications of this strategy: to malicious noise, and (semi)-private learning.
Kearns and Liโs [41] malicious noise is a model for data contamination where one is given a faulty sample oracle ๐ช ๐ โข ( โ ) which (with some small probability) returns a completely adversarial pair ( ๐ฅ , ๐ฆ ) , and otherwise draws from a true โground truthโ distribution. A learner is said to be tolerant to malicious noise if it achieves standard PAC guarantees while drawing from ๐ช ๐ โข ( โ ) . Like agnostic learning, tolerance to malicious noise is known to be achievable for finite classes [41]. As a result, (a slight modification of) Algorithm 1 gives a blackbox reduction from learning with malicious noise to realizable learning.
Theorem 1.9 (Realizable โ Malicious (Informal Theorem 6.11)).
Realizable learning and learning with malicious noise are equivalent up to polynomial factors.
This extends the original result of [41] to the distribution-family and general loss settings.
Our final application of Algorithm 1 is to the ubiquitous notion of differential privacy. Informally, an algorithm is said to be differentially private if its output is not susceptible to small changes in the underlying sample (see Section 6.3). Privacy is a strong condition, and even relaxed notions require finite Littlestone dimension in the distribution-free setting [6], ruling out a direct reduction from realizable learning. However, Algorithm 1 does recover a weaker variant known as semi-privacy [14] where the learner is allowed to use a few โpublicโ unlabeled samples, but must otherwise maintain privacy with respect to the data.
Theorem 1.10 (Realizable โ Semi-Private (Informal Theorem 6.21)).
Realizable and semi-private learning are equivalent up to polynomial factors.
This generalizes Beimel, Nissim, and Stemmerโs [14] equivalence in the distribution-free setting and extends to general loss. Perhaps most interesting is that in this case, due to relying only on non-uniform covering over more classical uniform convergence arguments, Algorithm 1 actually gives gives a quantitative improvement over prior methods.
Corollary 1.11 (Semi-Optimal Semi-Private Learning (Informal Corollary 6.23)).
Let ( ๐ , ๐ป ) be a class with VC-dimension ๐ . Then ( ๐ , ๐ป ) is ๐ผ -semi-private, agnostically learnable in
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ + log โก ( 1 / ๐ฟ ) ๐ )
unlabeled (public) samples, and
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ โข log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ โ min โก { ๐ , ๐ผ } )
labeled (private) samples.
For fixed ๐ and ๐ฟ , the public sample complexity of Corollary 1.11 is tight and resolves a conjecture of Alon, Bassily, and Moran [2] who gave the corresponding lower bound.7 The private complexity is off by a log factor from known lower bounds [22], and it remains an interesting question whether this can be fixed.
Theorem 1.10 and Corollary 1.11 are also robust to light distribution shift between the public and private databases. This problem, called covariate shift, is a commonly observed in machine learning practice and is especially of concern in privacy where a distribution over โopt-inโ public users could easily differ from the overall distribution of private data. We discuss covariate shift in more depth in Section 6.4.
1.3Modification Archetypes
Finally before moving to formal presentation of our methods and results, we briefly overview the four generic modifications of Algorithm 1 used in the above extensions, and outline where they can be found formally in the main body of the paper.
Discretization.
Weโll start by discussing our main technique to extend Algorithm 1 to infinite label spaces. The basic idea is simple: since we cannot afford to run our learner over all possible labelings of ๐ ๐ , we instead run the learner over labelings coming from some discretization of the class. As long as we have access to a learner for the discretization, we can then use the same arguments covered in Section 2 to prove various occurrences of property generalization. We formalize these notions more generally in Section 6.1, where we use the technique to prove Theorem 1.3. Discretization can also be used to handle learning models such as the statistical query setting which output real-valued query responses (see Section 11).
Subsampling.
Another core limitation to Algorithm 1 is access to clean unlabeled data. Algorithm 1 works by running a realizable learner over a representative set of unlabeled data, but, in practice, such data may often be corrupted, and data-dependent assumptions such as margin might mean that the optimal hypothesis isnโt even well-defined on this set. We handle cases like these by a simple sub-sampling procedure: instead of running our realizable learner over labelings of ๐ ๐ , we run the learner over all labelings of all subsets of ๐ ๐ . As long as ๐ ๐ contains some amount of uncorrupted data, this subsampling procedure will find it and we can maintain the guarantees discussed in Section 2. We use this technique to prove property generalization for models such as robust learning (see Theorem 1.4 and Section 8), partial learning (see Section 9), and malicious noise (see Theorem 1.9 and Section 6.2).
Replacing the Finite Learner.
In the introduction, we proposed a general paradigm (guiding principle) called property generalization: that a variant of any learning property which holds for finite classes should in fact hold for any โlearnableโ class in the base model. The main idea relies on replacing Step 3 of Algorithm 1 (which, as stated, is an empirical risk minimization process) with a generic learner for finite classes with the desired property. For noise-tolerance properties such as agnostic and malicious noise, empirical risk minimization works. Properties such as privacy or stability, however, require a different finite learner. To prove Theorem 1.10, for example, we replace the ERM process in Algorithm 1 with the exponential mechanism [49]. We use a similar strategy in Section 10 to prove an analogous result for uniform stability.
Replacing the Base Learner.
Finally we note a very basic modification of Algorithm 1 that allows us to extend property generalization beyond the PAC setting: simply replace the input realizable PAC learner with a realizable learner in the desired model. This is usually combined with one of the techniques above depending on the specific application, e.g. to prove property generalization for robust learning and the statistical query model. The same idea can also be used to analyze semi-private learning with covariate shift (see Section 6.4) and property generalization for fair learning (see Section 12).
2Distribution Family Classification
Since all of our results are derived from variants of Algorithm 1, it is instructive to start by considering its basic analysis in our simplest non-trivial setting: distribution-family classification. We remark that this section is entirely pedalogical, and the results are subsumed in Section 5.
The distribution-family model captures learnability with arbitrary distributional assumptions, a well-studied relaxation of PAC learning in practice where worst-case distributional assumptions are often too strong, and encompasses both the distribution-free and distribution-dependent PAC settings. Unlike these models, however, we cannot assume uniform convergence in the distribution family setting, due to a classical example of Benedek and Itai [17].
Proposition 2.1 (Benedek and Itai [17]).
There exists a PAC-learnable class ( ๐ท , ๐ , ๐ป ) over binary labels and classification loss without the uniform convergence property.
Proof 2.2.
Let ๐
[ 0 , 1 ] , ๐ท be the uniform distribution over ๐ , ๐
{ 0 , 1 } , and ๐ป consist of all indicator functions for finite sets ๐ โ ๐ , as well as for ๐ itself. It is not hard to see that ( ๐ท , ๐ , ๐ป ) is realizably PAC-learnable by the following scheme in only a single sample: if the learner draws a sample labeled 1 , output the all 1 โs function. Otherwise, output all 0 s. When the adversary has chosen a finite set, with probability 1 the learner draws a sample labeled 0 , and outputs a hypothesis with 0 error (since the finite set has measure 0 ). If the adversary chooses the all 1 โs function, the learner will always output the all 1 โs function.
On the other hand, it is clear that when the adversary chooses the all 1 โs function, no matter how many samples the learner draws, there will exist a hypothesis in the class that is poorly approximated by the sample. Namely the hypothesis whose support is given by the support of the sample itself has empirical measure 1 , but true measure 0 . As a result, this class fails to have the uniform convergence property despite its learnability.
This simple example rules out many classical techniques in supervised learning, including uniform convergence based reductions (e.g. [8, 14, 2]). Furthermore, since the distribution-family setting has no finitary characterization, we also cannot hope to take the traditional approach [60, 17, 36, 9] of proving equivalence of realizable and agnostic models by simultaneous characterization.
With this motivation in mind, letโs define distribution-family learning more formally. Let ๐ be a set (called the instance space), ๐
{ 0 , 1 } the set of binary labels, ๐ a family of distributions over ๐ , and ๐ป
{ โ : ๐ โ ๐ } a family of binary classifiers. A tuple ( ๐ , ๐ , ๐ป ) is said to be realizably learnable if there exists an algorithm8 ๐ and a function ๐ โข ( ๐ , ๐ฟ ) such that for every ๐ , ๐ฟ
0 , every choice of distribution ๐ท โ ๐ , and every hypothesis โ โ ๐ป , ๐ outputs a good classifier with high probability on samples of size ๐ โข ( ๐ , ๐ฟ ) :
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ err ๐ท ร โ โข ( ๐ โข ( ๐ , โ โข ( ๐ ) ) ) โค ๐ ] โฅ 1 โ ๐ฟ ,
where err ๐ท ร โ โข ( โ ) is commonly called the error or risk of โ :
๐ โข ๐ โข ๐ ๐ท ร โ โข ( โ โฒ )
Pr ๐ฅ โผ ๐ท โข [ โ โฒ โข ( ๐ฅ ) โ โ โข ( ๐ฅ ) ] .
Likewise, a tuple ( ๐ , ๐ , ๐ป ) is said to be agnostically learnable if there exists an algorithm ๐ which for every distribution ๐ท over ๐ ร ๐ whose marginal ๐ท ๐ โ ๐ outputs โ โฒ close to the best hypothesis in ๐ป with probability 1 โ ๐ฟ :
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ err ๐ท โข ( ๐ โข ( ๐ ) ) โค ๐ โข ๐ โข ๐ + ๐ ] โฅ 1 โ ๐ฟ ,
where ๐ โข ๐ โข ๐
inf โ โ ๐ป { err ๐ท โข ( โ ) } is the error of the best hypothesis in the class9 and the risk err ๐ท โข ( โ ) is similarly defined:
err ๐ท โข ( โ โฒ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โฒ โข ( ๐ฅ ) โ ๐ฆ ] .
With this in mind, we can now state the most basic application of Algorithm 1: the equivalence of agnostic and realizable learning for distribution-family classification.
Theorem 2.3 (Realizable โ Agnostic (Distribution-Family Classification)).
Let ๐ be a realizable learner for ( ๐ , ๐ , ๐ป ) using ๐ โข ( ๐ , ๐ฟ ) samples. Then Algorithm 1 is an agnostic learner for ( ๐ , ๐ , ๐ป ) using:
๐ ๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ / 2 , ๐ฟ / 2 )
unlabeled samples, and
๐ ๐ฟ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) + log โก ( 1 / ๐ฟ ) ๐ 2 )
labeled samples. Moreover if ( ๐ , ๐ป ) has finite VC dimension ๐ , Algorithm 1 needs only
๐ ๐ฟ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ โข log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ 2 )
labeled samples.
Along with its novelty in the distribution-family setting (where no such equivalence was known), it is worth noting that in the distribution-free setting, Theorem 2.3 actually recovers the same sample complexity bound as standard analysis of uniform convergence (though it remains off by a log factor from the optimal bound given by chaining [45]). We also note that while unlabeled sample complexity is not usually considered separately from labeled complexity in the PAC setting, this will become a useful distinction in semi-supervised extensions considered later in the work. As such, it is instructive to keep the complexities separate for the time being.
With this out of the way, letโs prove Theorem 2.3. The analysis breaks naturally into two parts, corresponding respectively to Step 2 and Step 3 of Algorithm 1. In the first part, weโll show that ๐ถ โข ( ๐ ๐ ) , the set of outputs corresponding to running the realizable learner ๐ across all possible labelings of the unlabeled sample ๐ ๐ , is in some sense a โgood approximationโ of the class ๐ป . More formally, the crucial observation is that for any choice of the adversaryโs distribution, ๐ถ โข ( ๐ ๐ ) will (almost) always contain a hypothesis close to the optimal solution.
Claim 1.
For any distribution ๐ท over ๐ ร ๐ whose marginal ๐ท ๐ โ ๐ , with probability 1 โ ๐ฟ / 2 , there exists โ โฒ โ ๐ถ โข ( ๐ ๐ ) which is within ๐ / 2 of the optimal risk:
๐ โข ๐ โข ๐ ๐ท โข ( โ โฒ ) โค ๐ โข ๐ โข ๐ + ๐ / 2 .
Once we have this claim, the second step is to show that Step 3, an empirical risk minimization process on ๐ถ โข ( ๐ ๐ ) , gives the desired agnostic learner. This actually follows from standard arguments. In particular, given a hypothesis โ โ ๐ถ โข ( ๐ ๐ ) , let
๐ โข ๐ โข ๐ ๐ ๐ฟ โข ( โ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ ๐ฟ โก [ โ โข ( ๐ฅ ) โ ๐ฆ ]
denote its empirical risk with respect to ๐ ๐ฟ . Since ๐ถ โข ( ๐ ๐ ) is finite, a standard Chernoff+Union bound gives that with probability at least 1 โ ๐ฟ / 2 , the empirical risk of every hypothesis in ๐ถ โข ( ๐ ๐ ) with respect to ๐ ๐ฟ is close to its true risk. Then as long as ๐ ๐ฟ is sufficiently large, empirical risk minimization returns a solution with at most ๐ โข ๐ โข ๐ + ๐ error with high probability (weโll formalize this in a moment).
It remains to prove Claim 1. The key observation lies in an equivalence between realizable PAC-learning and a weak type of randomized covering: for any fixed โ โ ๐ป , ๐ถ โข ( ๐ ๐ ) contains a hypothesis close to โ with high probability.
Lemma 2.4.
For any distribution ๐ท over ๐ ร ๐ with marginal ๐ท ๐ โ ๐ and any โ โ ๐ป , with probability 1 โ ๐ฟ / 2 , there exists โ โฒ โ ๐ถ โข ( ๐ ๐ ) which is within ๐ / 2 of โ in classification distance:
Pr ๐ฅ โผ ๐ท ๐ โก [ โ โฒ โข ( ๐ฅ ) โ โ โข ( ๐ฅ ) ] โค ๐ / 2 .
Proof 2.5.
The proof is essentially immediate from the definition of realizable PAC-learning. ๐ promises that for any โ โ ๐ป and ๐ท โ ๐ , a 1 โ ๐ฟ / 2 fraction of labeled samples ( ๐ , โ โข ( ๐ ) ) โผ ๐ท ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) satisfy
๐๐๐ ๐ท ร โ โข [ ๐ โข ( ๐ , โ โข ( ๐ ) ) ]
Pr ๐ท โข [ โ โฒ โข ( ๐ฅ ) โ โ โข ( ๐ฅ ) ] โค ๐ / 2 ,
where โ โฒ
๐ โข ( ๐ , โ โข ( ๐ ) ) . Since ๐ถ โข ( ๐ ๐ ) contains ๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) for every โ โ ๐ป by definition, the result follows.
More generally, we call such objects non-uniform covers.
Definition 2.6 (Non-uniform Cover (Informal Definition 5.1)).
Let ( ๐ , ๐ป ) be a class over label space ๐ , ๐ท a marginal distribution over ๐ , and ๐ถ a random variable over the power set ๐ โข ( ๐ป ) . We call ๐ถ a non-uniform ( ๐ , ๐ฟ ) -cover of ๐ป with respect to ๐ท if for every โ โ ๐ป :
Pr ๐ถ โก [ โ โ โฒ โ ๐ถ : Pr ๐ฅ โผ ๐ท โก [ โ โฒ โข ( ๐ฅ ) โ โ โข ( ๐ฅ ) ] โค ๐ ] โฅ 1 โ ๐ฟ .
Note that Lemma 2.4 (and in general non-uniform covering) does not imply that ๐ถ โข ( ๐ ๐ ) contains hypotheses close to every โ โ ๐ป simultaneously. This stronger object is called a uniform cover and takes provably more samples to construct (see Section 13). In our case, a non-uniform cover is sufficient. Since the guarantee holds for every fixed โ โ ๐ป , it must hold in particular for the optimal hypothesis โ ๐ โข ๐ โข ๐ , so ๐ถ โข ( ๐ ๐ ) contains some โ โฒ within ๐ / 2 of optimal. Letโs now formalize these ideas and put everything together to prove Theorem 2.3.
Proof 2.7 (Proof of Theorem 2.3).
Let ๐ท be the adversaryโs distribution over ๐ ร ๐ , and let โ ๐ โข ๐ โข ๐ โ ๐ป be a hypothesis achieving the optimal error. By Lemma 2.4, with probability 1 โ ๐ฟ / 2 , ๐ถ โข ( ๐ ๐ ) contains a hypothesis โ โฒ such that:
Pr ๐ฅ โผ ๐ท ๐ โก [ โ โฒ โข ( ๐ฅ ) โ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) ] โค ๐ / 2 .
This implies Claim 1 (that ๐ถ โข ( ๐ ๐ ) contains a hypothesis with error at most ๐ โข ๐ โข ๐ + ๐ / 2 ) since
๐ โข ๐ โข ๐ ๐ท โข ( โ โฒ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โฒ โข ( ๐ฅ ) โ ๐ฆ ]
โค Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ ๐ฆ ] + Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ โ โฒ โข ( ๐ฅ ) ]
โค ๐ โข ๐ โข ๐ + ๐ / 2 .
We can now use standard empirical risk minimization bounds on ๐ถ โข ( ๐ ๐ ) to find a hypothesis with error at most ๐ โข ๐ โข ๐ + ๐ . Chernoff and union bounds imply that with probability at least 1 โ ๐ฟ / 2 , the empirical risk of every hypothesis in ๐ถ โข ( ๐ ๐ ) on a sample of size ๐ โข ( log โก ( | ๐ถ โข ( ๐ ๐ ) | / ๐ฟ ) ๐ 2 ) is at most ๐ / 4 away from its true error. Since โ โฒ has error at most ๐ โข ๐ โข ๐ + ๐ / 2 , its empirical risk is at most ๐ โข ๐ โข ๐ + 3 โข ๐ / 4 , and by the above guarantee any hypothesis in ๐ถ โข ( ๐ ๐ ) with empirical risk at most ๐ โข ๐ โข ๐ + 3 โข ๐ / 4 has true error at most ๐ โข ๐ โข ๐ + ๐ .
Putting everything together, we have that with probability 1 โ ๐ฟ over the entire process, the empirical risk minimizer of ๐ถ โข ( ๐ ๐ ) has error at most ๐ โข ๐ โข ๐ + ๐ as desired. The sample complexity bounds follow from noting that | ๐ถ โข ( ๐ ๐ ) | is at most 2 ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) , and at most ( ๐ โ ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) ๐ ) ๐ if the class has VC dimension ๐ . The sample complexity bound for the latter case then follows by plugging in the standard bound for distribution-free classification: ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) โค ๐ โข ( ๐ โข log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ ) .
3Related Work
Agnostic learning is a very widely studied model across learning theory, and works across many different sub-areas have noted model specific equivalences with realizable learning. Here weโll survey a few representative examples, and discuss how they relate and differ from our approach.
3.1Beyond Binary Classification Uniform Convergence and Multiclass Classification.
It is well known that the uniform convergence equals learnability paradigm continues to hold for 0 / 1 -valued loss functions over constant-size label spaces [19, 15, 59, 54], and that agnostic and realizable learning are equivalent as a result. On the other hand, Daniely, Sabato, Ben-Devid, and Shalev-Shwartz [25] showed this is no longer the case as the number of labels grows large. In this regime, even basic multi-class learning is no longer equivalent to uniform convergence, so the connection between realizable and agnostic learning becomes non-trivial. A few years later, David, Moran, and Yehuyadoff (DMY) [29] proved the equivalence nevertheless holds in the infinite multi-class setting through the weaker sample compression equals learnability paradigm. While more general than the uniform convergence paradigm, their proof remains model-specific and fails in many of the settings we consider, e.g. partial learning [5].
Discretization and General Loss Functions.
Basic forms of discretization were also considered back in the mid 90s in work on characterizing the learnability of real-valued functions. In a seminal work, Bartlett, Long, and Williamson (BLW) [9] proved that a scale-sensitive measure introduced by Kearns and Schapire [42] called fat-shattering dimension characterizes learnability under bounded Lipschitz loss functions.10 BLW use a basic form of discrete learning (called quantization) to prove that fat-shattering dimension is a necessary condition, and use uniform convergence to prove sufficiency. We give a similar argument as BLW in the necessary direction, but show that uniform convergence is not necessary for the equivalence to hold, and instead use Algorithm 1 to appeal directly to discrete learnability. This allows us to extend BLWโs result across a much more general set of loss functions and scenarios without strong model-specific assumptions.
3.2Semi-supervised, Active, and Semi-Private Learning
Our reduction hinges on combining a realizable learner with unlabeled data to cut down the number of potential hypotheses in our class. The use of unlabeled samples to this effect is one of the core ideas in the field of semi-supervised learning [8, 65]. Here, it is usually additionally assumed that the function to be learnt has some relation (or โcompatibilityโ) to the underlying data distribution, for example it might have large margin on unlabelled data as in Transductive SVM [38] or redundant sufficient information as in Co-training [18, 28]. In their seminal work on the topic, Balcan and Blum [8] employed a similar strategy to Algorithm 1 in which they draw an unlabeled sample ๐ ๐ and select hypotheses consistent with each possible labeling based upon compatibility. They argue via uniform convergence that this results in a uniform cover, and then use empirical risk minimization to select a good hypothesis in the cover. It is worth noting that around the same time a similar strategy independently found use in the online learning literature in work of Ben-David, Pal, and Shalev-Schwarz [16] who simulated the so-called โstandard optimal algorithmโ (SOA) over a sequence of examples and applied weighted majority [46] over the resulting set of hypotheses to obtain an agnostic online learner.
Similar strategies have also found use in the related active learning literature. Hanneke and Yang [35] use the same technique to build a cover from unlabeled samples (adding one hypothesis consistent with each possible labeling), and then apply active (adaptive) query algorithms to learn the best hypothesis in the cover in as few labeled samples as possible. This generalized earlier work of Dasgupta [27], who assumed a priori that the cover was known to the learner ahead of time. Most recently, the approach has seen use in the study of semi-private learning. In their original work on the model, Beimel, Nissim, and Stemmer [14] again apply the same trick for building a uniform cover, but then find the best hypothesis privately via the exponential mechanism (similar to our proof of Theorem 1.10). The analysis of this strategy was later improved by Alon, Bassily, and Moran (ABM) [2].
The above works differ from ours in two crucial senses. First, each work focuses solely on developing an algorithm for their specific framework (rather than working to understand a more general equivalence or reduction between settings). In this sense, one can view each of these prior results as a specific instance of our general framework where the โbase learnerโ in our reduction is restricted to be an empirical risk minimizer (or SOA in the online setting), and a problem-specific learner for the relevant property (online, agnostic, active, or private) is then applied over the resulting cover. Second, and perhaps most importantly, these previous works all rely fundamentally on uniform convergence.11 This means that their algorithms break down as soon as one moves away from the original PAC model (even to say the basic distribution-dependent setting), and can also lead to sub-optimal sample complexity bounds. In the analysis of semi-private learning, for instance, we show that avoiding uniform convergence leads to asymptotically better bounds, actually resolving the public sample complexity of the model altogether. Indeed, one can show that building a uniform cover requires asymptotically more unlabeled samples than a non-uniform one, and therefore cannot result in optimal semi-supervised algorithms.12
3.3Non-Uniform Covering and Probabilistic Representations
Covering techniques have long been used in learning theory, and while almost all prior works focus on uniform notions (where all hypotheses are covered simultaneously), there is one notable exception. In 2013, Beimel, Nissim, and Stemmer (BNS) [12] introduced probabilistic representations, a strong randomized form of covering used to characterize pure differentially private learning. In the language of our work, given a class ( ๐ , ๐ป ) , a probabilistic representation is a distribution over subsets of ๐ป which is a non-uniform cover simultaneously over all distributions of ๐ . BNS prove that private learning is equivalent to the existence of a probabilistic representations for the class. Equivalently, this can be thought of as the ability to build a non-uniform cover without access to the underlying distribution at all. On the other hand, we are interested in the much weaker setting where a non-uniform cover can be built from a bounded number of samples from the distribution (and crucially argue that this is equivalent to realizable learning). Thus in a sense, our core connection between realizable learning and non-uniform covering can be thought of as an analog of BNSโ characterization of private learning by probabilistic representations.
Paper Organization
The main body of this paper is split into two main portions. In the first we cover preliminaries (Section 4), the base reduction for all finitely supported loss functions (Section 5), and discuss in detail the four main modification archetypes to Algorithm 1: discretization (Section 6.1), sub-sampling (Section 6.2), replacing ERM (Section 6.3), and changing the base model (Section 6.4), covering extensions to infinite label classes, malicious noise, semi-private learning, and covariate shift respectively. In the remainder of the paper we cover applications of these modified versions to doubly-bounded loss (Section 7), robust learning (Section 8), partial learning (Section 9), uniformly-stable learning (Section 10), the statistical query model (Section 11), and fair learning (Section 12), and discuss further connections of non-uniform covers to previous notions of covering (Section 13). Each of the latter sections are self-contained and the reader is encouraged to skip directly to any model they wish to see directly.
4Preliminaries
Before moving to a more formal discussion of our results, weโll cover the most basic learning models discussed in this work: standard (distribution-free) PAC-learning and distribution-family PAC-learning. Extended models we consider beyond these (e.g. malicious noise, robust learning, partial learning, etc.) will instead be introduced in their respective sections.
4.1PAC-Learning
Weโll start by reviewing the seminal PAC-learning model of Valiant [58] and Vapnik and Chervonenkis [60]. We start with a few core definitions for the setting of general loss. Let ๐ be an arbitrary set called the instance space (e.g. โ ๐ ), ๐ a set called the label space (e.g. { 0 , 1 } ), and ๐ป a family of labelings of ๐ by ๐ (that is a family of functions of the form โ : ๐ โ ๐ ). Given a class ( ๐ , ๐ป ) , it will often be useful to consider its growth function ฮ ๐ป โข ( ๐ ) which measures the maximum size of ๐ป when restricted to a sample of size ๐ :
ฮ ๐ป ( ๐ )
max โ โ ๐ป , ๐ โ ๐ ๐ ( | ๐ป | ๐ | ) ) .
We note that the growth function is trivially bounded by | ๐ | ๐ , but one can often give stronger bounds when ( ๐ , ๐ป ) satisfies some finite combinatorial dimension (e.g. VC-dimension for the binary case).
While PAC-learning is sometimes used to refer only to classification, we will study the model under general loss functions. With that in mind, we call a function โ : ๐ ร ๐ โ โ โฅ 0 a loss function if โ โข ( ๐ฆ , ๐ฆ )
0 for all ๐ฆ โ ๐ . We say a loss โ satisfies the identity of indiscernibles if โ โข ( ๐ฆ 1 , ๐ฆ 2 )
0 iff ๐ฆ 1
๐ฆ 2 . Given any distribution ๐ท over ๐ ร ๐ and loss โ , the risk of a labeling โ : ๐ โ ๐ with respect to ๐ท and โ is its expected loss:
err ๐ท , โ โข ( โ )
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ โข ( ๐ฅ ) , ๐ฆ ) ] .
The goal of learning is generally to find a classifier โ โ ๐ป that minimizes risk. More formally, there are two commonly studied variants of this problem. The original formulation, now called realizable learning, assumes the existence of a hypothesis in ๐ป with no loss.
Definition 4.1 ((Realizable) PAC-learning).
We say ( ๐ , ๐ป , โ ) is realizable PAC-learnable if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ > 0 and distributions ๐ท over ๐ ร ๐ such that min โ โ ๐ป โก ๐๐๐ ๐ท , โ โข ( โ )
0 :
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) )
๐ ] โค ๐ฟ .
๐ is called proper if it outputs only labels in ๐ป .
Perhaps a more realistic variant of PAC-learning is to drop this restriction on the adversary, and let them choose an arbitrary distribution over ๐ ร ๐ . This model, introduced by Haussler [36] and Kearns, Schapire, and Sellie [41], is known as agnostic learning.
Definition 4.2 ((Agnostic) PAC-learning).
We say ( ๐ , ๐ป , โ ) is agnostic PAC-learnable if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ :
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ ,
where ๐ โข ๐ โข ๐
min โ โ ๐ป โก { ๐๐๐ ๐ท , โ โข ( โ ) } .
For some settings covered in this work, it will turn out that reaching ๐ โข ๐ โข ๐ + ๐ error is too stringent of a condition. However, we will show in these cases that it is sometimes possible to maintain a weaker guarantee and learn up to ๐ โ ๐ โข ๐ โข ๐ + ๐ error for some constant ๐
1 . We call such classes ๐ -agnostic learnable.
Finally, we note that for simplicity when โ is the standard โclassification error:โ
โ โข ( ๐ฆ 1 , ๐ฆ 2 )
{
0
if
โข
๐ฆ
1
๐ฆ 2
1
else ,
weโll simply write ( ๐ , ๐ป ) to mean ( ๐ , ๐ป , โ ) . Realizable and Agnostic Learning are well studied under many basic loss functions including binary classification, where both models are known to be characterized by a combinatorial parameter called VC-dimension.
4.2Learning Under Distribution Families
The standard PAC-models described above are often called distribution-free due to the fact that no assumptions are made on the marginal distribution over ๐ . In practice, however, this is usually too worst-case an assumption. We often expect distributions in nature to be โniceโ in some way, or at least somewhat restricted. This is reflected in the fact that popular machine learning algorithms usually significantly outperform the PAC-modelโs worst-case generalization bounds. Indeed such niceness assumptions have long been popular in learning theory as well, where conditions such as tail bounds or anti-concentration are frequently used to build efficient algorithms.
These ideas are captured more generally by a simple (but notoriously difficult) extension to the PAC framework originally proposed by Benedek and Itai [17], where the adversary is restricted to picking from a fixed, known set of distributions.
Definition 4.3 ((Realizable) Distribution-Family PAC-learning).
Let ๐ be an instance space and ๐ a family of distributions over ๐ . We say ( ๐ , ๐ , ๐ป , โ ) is realizable PAC-learnable if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying:
1.
The marginal ๐ท ๐ โ ๐ ,
2.
min โ โ ๐ป โก ๐๐๐ ๐ท , โ โข ( โ )
0 ,
we have
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) )
๐ ] โค ๐ฟ .
Agnostic learning is defined similarly. The adversary must still choose a marginal distribution in ๐ , but the conditional labeling can be arbitrary.
Definition 4.4 (Agnostic Distribution-Family PAC-learning).
We say ( ๐ , ๐ , ๐ป , โ ) is agnostic PAC-learnable if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying:
1.
The marginal ๐ท ๐ โ ๐ ,
๐ outputs a good hypothesis with high probability:
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ
where ๐ โข ๐ โข ๐
min โ โ ๐ป โก { ๐๐๐ ๐ท , โ โข ( โ ) } .
The weaker ๐ -agnostic learning is defined analogously with ๐ โข ๐ โข ๐ replaced by ๐ โ ๐ โข ๐ โข ๐ . Unlike the standard model, very little is known about distribution-family learnability. A number of works have given partial characterizations of the model [17, 31, 43, 61], but it was recently shown by Lechner and Ben-David [44] that no general finitary characterization is possible.
5The Core Reduction: Finite Label Classes
In this section, we give a more detailed exposition of our main reduction as covered in Section 2 for the general setting of arbitrary loss on constant size label spaces, matching lower bounds, and additional discussion of non-uniform covers. As mentioned previously, since there is no combinatorial characterization of learnability in the distribution-family model [44], standard techniques [19, 15, 59, 54] cannot be used.
Before jumping into our reduction proper, it is worth re-iterating why we canโt simply take the approach of prior works and rely on uniform convergence, a strong condition which promises that on a large enough sample, the empirical error of every hypothesis will be close to its true error. While uniform convergence was a very popular technique in the early years of learning, practitioners have since moved away from the paradigm which fails to capture learning rates seen in practice [64, 53]. Indeed it soon became clear that the technique failed to capture even basic theoretical models such as the distribution-dependent setting (as discussed in Proposition 2.1). In later sections, we will even see distribution-free models where uniform convergence fails, such as the Partial PAC model [47, 5] which captures realistic scenarios such as learning with margin. Since even the most basic modifications of PAC-learning fail to satisfy uniform convergence, it is clear we need to move beyond the condition to gain a more general understanding of the common phenomenon of equivalence between learning models.
Instead of relying on uniform convergence, our core observation is an equivalence between learning and sample access to a combinatorial object we call a non-uniform cover.
Definition 5.1 (Non-uniform Cover).
Let ( ๐ , ๐ป ) be a class over label space ๐ and ๐ฟ ๐ , ๐ denote the family of all labelings from ๐ to ๐ . If ๐ถ is a random variable over the power set ๐ โข ( ๐ฟ ๐ , ๐ ) and ๐ : ๐ฟ ๐ , ๐ ร ๐ฟ ๐ , ๐ โ โ โฅ 0 is a โdistanceโ function between labelings, we call ๐ถ a non-uniform ( ๐ , ๐ฟ ) -cover of ๐ป with respect to ๐ if for all โ โ ๐ป :
Pr ๐ โผ ๐ถ โก [ โ โ โฒ โ ๐ : ๐ โข ( โ โฒ , โ ) โค ๐ ] โฅ 1 โ ๐ฟ .
We call ๐ถ bounded if its support lies entirely on subsets of size at most some ๐ โ โ , and we call the smallest such ๐ its size.
Non-uniform covers share a close connection to several notions of covering used throughout the learning literature such as uniform covers [2] and fractional covers [4]. We discuss these connections in more detail in Section 13. For the moment, we note only that previous works using the strictly stronger notion of uniform covering necessarily lose factors in the sample complexity as a result. We discuss this further in Section 6.3.
In Section 2, we argued (at least implicitly) that once we have sampling access to a bounded non-uniform cover, agnostic learnability follows from standard arguments. Namely since a sample ๐ has bounded size and is guaranteed to contain a concept โcloseโ to optimal, it suffices to run empirical risk minimization over about log โก ( | ๐ | / ๐ฟ ) / ๐ 2 samples. The key to our reduction therefore boils down to turning blackbox access to a realizable PAC-learner into sampling access to some relevant non-uniform cover. This is given by Step 2 of Algorithm 1, which we rewrite here as a subroutine called LearningToCover.
Input: Hypothesis Class ๐ป , Realizable PAC-Learner ๐ , Unlabeled Sample size ๐ ๐ . Algorithm:
1.
Run ๐ over all possible labelings of ๐ ๐ by ๐ป .
2.
Return the set of responses ๐ถ โข ( ๐ ๐ ) :
๐ถ โข ( ๐ ๐ ) โข \coloneqq โข { ๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) | โ โ ๐ป | ๐ ๐ } .
Algorithm 2 LearningToCover( ๐ป , ๐ , ๐ ๐ )
In fact, we already argued in Section 2 that LearningToCover gives sampling access to a non-uniform cover, but we will re-write the result here in this formulation for convenience.
Lemma 5.2 (Core Lemma: Realizable Learning Implies Non-Uniform Covering).
Let ๐ be an algorithm that ( ๐ , ๐ฟ ) -PAC learns a class ( ๐ , ๐ , ๐ป , โ ) in ๐
๐ โข ( ๐ , ๐ฟ ) samples. Then for any ๐ท โ ๐ , running LearningToCover on ๐ ๐ โผ ๐ท ๐ returns a sample from a size ฮ ๐ป โข ( ๐ ) , non-uniform ( ๐ , ๐ฟ ) -cover with respect to the standard distance between hypotheses:
๐ โข ( โ , โ โฒ )
๐ผ ๐ท โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] .
Proof 5.3.
The proof is essentially immediate from the definition of realizable PAC-learning. ๐ promises that for any โ โ ๐ป and ๐ท โ ๐ , a 1 โ ๐ฟ fraction of labeled samples ( ๐ ๐ , โ โข ( ๐ ๐ ) ) โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) satisfy
๐๐๐ ๐ท ร โ , โ โข ( ๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) )
๐ผ ๐ท โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] โค ๐ ,
where โ โฒ
๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) . Since ๐ถ โข ( ๐ ๐ ) contains ๐ โข ( ๐ ๐ , โ โข ( ๐ ๐ ) ) for every โ by definition, the result follows.
This means that as long as we have blackbox access to a realizable PAC-learner and unlabeled samples from the adversaryโs distribution, we can simulate access to a non-uniform cover. Letโs now formalize our previous intuition that this is sufficient to turn a realizable learner into an agnostic one for any finite label class. We will generalize this result to doubly-bounded loss in Section 7, but it is instructive to consider the setting of finite ๐ first.
Theorem 5.4 (Realizable โ Agnostic (Finite Label Classes)).
Let ( ๐ , ๐ , ๐ป , โ ) be any class on a finite label space ๐ with loss function โ : ๐ โ ๐ satisfying the identity of indiscernibles. Then Algorithm 1 is an agnostic learner with sample complexity:
๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ โ โข ๐ , ๐ฟ / 2 ) + ๐ โข ( log โก ( ฮ ๐ป โข ( ๐ โข ( ๐ โ โข ๐ , ๐ฟ / 2 ) ) ๐ฟ ) ๐ 2 )
where ๐ โ โฅ ฮฉ โข ( min ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) max ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) ) is a constant depending only on โ .
Proof 5.5.
Let ๐ be the promised realizable learner for ( ๐ , ๐ป , ๐ , โ ) with sample complexity ๐ โข ( ๐ , ๐ฟ ) . Run LearningToCover with parameters ๐ โฒ
๐ โ โข ๐ and ๐ฟ โฒ
๐ฟ / 2 . We argue that the output contains some โ โฒ such that ๐๐๐ ๐ท , โ โข ( โ โฒ ) โค ๐ โข ๐ โข ๐ + ๐ / 2 . Since ๐ถ โข ( ๐ ๐ ) is finite and โ is upper bounded, a standard Chernoff bound gives that choosing an empirical risk minimizer from ๐ถ โข ( ๐ ๐ ) based on ๐ โข ( log โก ( | ๐ถ โข ( ๐ ๐ ) | ๐ฟ ) ๐ 2 ) additional samples gives the desired learner. The sample complexity then follows immediately from the fact that ๐ถ โข ( ๐ ๐ ) contains one hypothesis for every labeling of the sample, and is therefore bounded by the growth function ฮ ๐ป โข ( ๐ ) .
To see why ๐ถ โข ( ๐ ๐ ) has this property, recall that for any โ โ ๐ป , Lemma 5.2 states that ๐ถ โข ( ๐ ๐ ) contains some โ โฒ such that:
๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] โค ๐ โ โข ๐ .
Because we assume that โ โข ( ๐ , ๐ )
0 iff ๐
๐ , this actually implies a stronger relationโ โ and โ โฒ must be close in classification error:
Pr ๐ฅ โผ ๐ท ๐ โก [ โ โข ( ๐ฅ ) โ โ โฒ โข ( ๐ฅ ) ] โค ๐ โ โข ๐ min ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) .
Let โ ๐ โข ๐ โข ๐ โ ๐ป be an optimal hypothesis, and let โ ๐ โข ๐ โข ๐ โฒ denote its corresponding close hypothesis as promised above in the output of LearningToCover. Then by the above, we have that:
๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ โฒ )
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ ๐ โข ๐ โข ๐ โฒ โข ( ๐ฅ ) , ๐ฆ ) ]
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ ๐ โข ๐ โข ๐ โฒ โข ( ๐ฅ ) , ๐ฆ ) โ โ โข ( โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) , ๐ฆ ) + โ โข ( โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) , ๐ฆ ) ]
๐ โข ๐ โข ๐ + ๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ ๐ โข ๐ โข ๐ โฒ โข ( ๐ฅ ) , ๐ฆ ) โ โ โข ( โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) , ๐ฆ ) ]
โค ๐ โข ๐ โข ๐ + ๐ โข ๐ ๐ท โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ โ ๐ โข ๐ โข ๐ โฒ โข ( ๐ฅ ) ] โข max ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) )
โค ๐ โข ๐ โข ๐ + ๐ / 2
where we have used the assumption that we set ๐ โ
๐ โข min ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) max ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) for some universal ๐
0 .
Itโs worth spending a moment discussing our only assumption on the loss function โ , that it satisfies the identity of indiscernibles. This is not only a natural assumption for most cases in practice (that mislabeling has non-zero error), it is theoretically justified as well: realizable and agnostic learning arenโt necessarily equivalent for โ without this property, even in the distribution-free setting.
Proposition 5.6 (Identity of Indiscernibles Lower Bound).
There exists a realizably learnable class ( ๐ , ๐ป , โ ) over a finite label space ๐ which is not agnostically learnable.
Proof 5.7.
Let the instance space ๐
โ be the set of natural numbers, the label space ๐
{ 0 , 1 } 2 . We consider the hypothesis class ๐ป with all functions which output the first bit as 0 , that is:
๐ป
{ โ : โ โข ( ๐ฅ )
( 0 , โ ) โข โ ๐ฅ โ ๐ } .
Furthermore, we define the loss function โ : ๐ ร ๐ โ { 0 , 1 , ๐ } as
โ โข ( ( ๐ 1 , ๐ 1 ) , ( ๐ 2 , ๐ 2 ) )
{
0
๐
1
๐ 2
1
๐
1
โ
๐
2
โข
๐๐๐
โข
๐
1
๐ 2
๐
otherwise.
Note that ( ๐ , ๐ป , โ ) is trivially learnable in the realizable setting simply by returning any โ โ ๐ป . On the other hand, we will show it is only ๐ โข ( ๐ ) -agnostically learnable. First, notice that for any labelling ๐ : ๐ โ ๐ , there exists a hypothesis โ โ ๐ป which matches ๐ on the second bit, and therefore for any marginal ๐ท over ๐ :
๐ โข ๐ โข ๐ โค ๐ โข ๐ โข ๐ ๐ท , ๐ โข ( โ ) โค 1 .
As a result, it suffices to show that for every ๐ โ โ and (randomized) algorithm ๐ using ๐ samples there exists a labeling ๐ : ๐ โ ๐ and marginal distribution ๐ท ๐ such that
๐ผ ๐ โผ ๐ท ๐ ๐ โข ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ๐ / 12 .
(1)
As long as this holds Markovโs inequality gives that every algorithm must have error at least ฮฉ โข ( ๐ ) with constant probability.
For simplicity, we will restrict our attention in the rest of the proof to the marginal distribution ๐ท ๐ which is uniform over the set [ ๐ ] for some natural number ๐ we will fix later. To prove Equation 1, by Yaoโs minimax principle it is enough to prove there is a distribution ๐ over functions ๐ : [ ๐ ] โ ๐ such that any deterministic algorithm ๐ has expected loss at least ๐ / 12 over ๐ :
๐ผ ๐ โผ ๐ โข ๐ผ ๐ โผ ๐ท ๐ ๐ โข ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ]
๐ / 12 .
We now show that the above holds for ๐ being uniform over all functions from [ ๐ ] to ๐ for any ๐
2 โข ๐ . Here, we have that
๐ผ ๐ฅ โผ ๐ท ๐ ๐ผ ๐ โผ ๐ ๐ผ ๐ โผ ๐ท ๐ ๐ [ โ ( ๐ ( ๐ , ๐ ( ๐ ) ) ( ๐ฅ ) , ๐ ( ๐ฅ ) ) ) ] โฅ ๐ผ ๐ฅ โผ ๐ท ๐ [ Pr ๐ โผ ๐ท ๐ ๐ [ ๐ฅ โ ๐ ] โ ๐ / 4 ] ,
where the last step follows from noting that for any value ( ๐ , ๐ ) that ๐ โข ( ๐ , ๐ โข ( ๐ ) ) assigns to ๐ฅ โ ๐ , ๐ โข ( ๐ฅ ) will be ( 1 โ ๐ , 1 โ ๐ ) with probability 1 / 4 incurring a loss of ๐ . The result then follows by noting that for every ๐ฅ โ [ ๐ ] :
Pr ๐ โผ ๐ท ๐ ๐ โก [ ๐ฅ โ ๐ ]
( 1 โ 1 ๐ ) ๐ โฅ 1 / ๐
since 1 โ ๐ฅ โฅ exp โก { โ ๐ฅ / ( 1 โ ๐ฅ ) } and ๐
2 โข ๐ . Therefore, we get that
๐ผ ๐ฅ โผ ๐ท ๐ โข ๐ผ ๐ โผ ๐ โข ๐ผ ๐ โผ ๐ท ๐ ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ๐ / 4 โข ๐ ,
which completes the proof.
Note that this bound holds even if ๐ is allowed to be improper. It is worth noting that if we are willing to increase the size of ๐ , the learnerโs error in this bound can actually be increased all the way to ๐ , the maximum possible (see Proposition 6.1). This is in fact tight, as we will show that any loss function like the above satisfying a ๐ -approximate triangle inequality can be ๐ -agnostically learned (that is learned to within ๐ โ ๐ โข ๐ โข ๐ + ๐ error).
6Four Modification Archetypes 6.1Discretization: Infinite Label Classes
In the previous section, we showed that our base reduction characterizes the equivalence of realizable and agnostic learning for loss functions satisfying the identity of indiscernibles for all finite label classes. In this section, we discuss a technique called discretization that allows our reduction to extend this result to infinite label classes. Normally, itโs clear that when ๐ is infinite our standard reduction will fail: since the total number of possible labelings of a finite sample may be infinite, LearningToCover may output an infinite set. In fact, this is more than a technical barrier: realizable and agnostic learning simply arenโt equivalent for infinite label classes.
Proposition 6.1.
Let โ be any continuous loss function (in the first variable) over โ satisfying the identity of indiscernibles. Then there exists a class ( ๐ , ๐ , ๐ป , โ ) which is realizably learnable but not agnostically learnable.
Proof 6.2.
Let ๐
โ and ๐
[ 0 , 2 ] . To construct our class, we first consider the set of all boolean functions over ๐ with finite support. Each function ๐ in this class may equivalently be thought of as a binary string in { 0 , 1 } * . Denote the corresponding decimal value of this string in [ 0 , 1 ] by ๐ ๐ . To construct ๐ป , for every boolean function ๐ : โ โ { 0 , 1 } with finite support, include in ๐ป the function โ ๐ โข ( ๐ฅ )
๐ โข ( ๐ฅ ) + ๐ ๐ . Note that ( ๐ , ๐ป ) is clearly realizably learnable under any distribution family ๐ and any loss function, since a single sample always uniquely determines โ ๐ . On the other hand, adding even the smallest amount of noise erases this unique identification, making the class impossible to learn.
More formally, let ๐ be the family of all distributions. By the continuity of โ and the fact that โ โข ( 0 , 0 )
โ โข ( 1 , 1 )
0 , for all ๐ > 0 notice that there exists ๐พ
๐พ โข ( ๐ )
0 such that max 0 โค ๐พ โฒ โค ๐พ โก { โ โข ( ๐พ โฒ , 0 ) , โ โข ( 1 + ๐พ โฒ , 1 ) } < ๐ . Let ๐ ๐พ โ โ be the index of the first non-zero digit in the binary representation of ๐พ . The idea is to note that beyond these first ๐ ๐พ coordinates, our class if within ๐ of an arbitrary boolean function. More formally, notice that for any distribution ๐ท and boolean function ๐ which is 0 on [ ๐ ๐พ ] , we have that ๐๐๐ ๐ป โข ( ๐ ) โข \coloneqq โข min โ โ ๐ป โก { ๐๐๐ ๐ท ร ๐ , โ โข ( โ ) } โค ๐ . The bound then follows from the fact that such arbitrary functions are not learnable.
In more detail, Yaoโs minimax principle states that it is sufficient to show that for any potential sample complexity ๐ โข ( ๐ , ๐ฟ ) , there exists a randomized strategy for the adversary such that no deterministic learner can achieve ๐ โข ๐ โข ๐ + ๐ accuracy with constant probability for some constant ๐ > 0 . To this end, consider the following strategy: the adversary chooses the uniform distribution over [ ๐ ๐พ , ๐ ๐พ + 2 โข ๐ โข ( ๐ , ๐ฟ ) ] , and a binary function on that interval uniformly at random (recall ๐ โข ๐ โข ๐ is at most ๐ for every such function). Since the learner can only see 1 / 2 of the mass, any strategy must be incorrect on half of the remaining points in expectation. In particular conditioned on any sample, the expected loss of any predicted labeling of an unseen point is at least โ min-err
min ๐ฆ โ [ 0 , 2 ] โข { โ โข ( ๐ฆ , 0 ) + โ โข ( ๐ฆ , 1 ) 2 } (since each unseen label appears with 1 / 2 probability conditioned on the learnerโs sample). The total expected loss of any strategy is then at least โ min-err / 2 , which is bounded away from 0 . Setting ๐ and ๐ sufficiently small then gives the desired result.
Proposition 6.1 relies crucially on the fact that the adversary can erase a significant amount of information with a very small label perturbation. In the rest of this section, weโll discuss a technique for modifying our reduction that shows this is essentially the only barrier between realizable and agnostic learning (at least for a broad class of loss functions). The key is to require a slightly stronger notion of learnability based upon discretization.
Definition 6.3 (Discretization).
We say ( ๐ , ๐ , ๐ป โฒ , โ ) is an ๐ -discretization of ( ๐ , ๐ , ๐ป , โ ) if the following three conditions hold:
1.
๐ป โฒ is probably bounded. That is for all ๐ โ โ , ๐ฟ
0 , and ๐ท โ ๐ there exists a bound ๐ โข ( ๐ , ๐ฟ ) โ โ such that:
Pr ๐ โผ ๐ท ๐ [ | ๐ผ ๐ ( ๐ป โฒ | ๐ ) | โค ๐ ( ๐ , ๐ฟ ) ] โฅ 1 โ ๐ฟ .
2.
๐ป โฒ point-wise13 ฮต -covers ๐ป with respect to โ . That is for all for all โ โ ๐ป , there exists โ โฒ โ ๐ป โฒ satisfying:
โ ๐ฅ โ ๐ : โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) โค ๐ .
3.
๐ป โฒ is always useful. That is for all โ โฒ โ ๐ป โฒ , there exists โ โ ๐ป such that:
โ ๐ฅ โ ๐ : โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) โค ๐ .
Note that most realistic settings have reasonable discretizations (e.g. it is enough to have some Lipshitz-like condition and a weak tail-bound on the loss). We now define a basic notion of learnability based on discretization which essentially serves to rule out adversarial constructions in the vein of Proposition 6.1.
Definition 6.4 (Discretely-learnable).
We say ( ๐ , ๐ , ๐ป , โ ) is discretely-learnable with sample complexity ๐ โข ( ๐ , ๐ฟ ) if there is some constant ๐ 1
0 such that for all ๐ , ๐ฟ
0 there exists an ๐ -discretization ๐ป ๐ which is ( ๐ 1 โข ๐ , ๐ฟ ) -PAC-learnable in at most ๐ โข ( ๐ , ๐ฟ ) samples. We call the learner proper if it outputs hypotheses in ๐ป .
Weโll prove that discrete and agnostic learnability are equivalent as long as the loss satisfies an approximate triangle inequality.
Definition 6.5 (Approximate pseudometric).
We call a loss function โ : ๐ ร ๐ โ โ โฅ 0 a ๐ -approximate pseudometric if for all triples ๐ฆ 1 , ๐ฆ 2 , ๐ฆ 3 โ ๐ :
โ โข ( ๐ฆ 1 , ๐ฆ 3 ) โค ๐ โข ( โ โข ( ๐ฆ 1 , ๐ฆ 2 ) + โ โข ( ๐ฆ 2 , ๐ฆ 3 ) ) .
Approximate pseudometrics are natural choices for loss functions in practice and capture a broad set of scenarios including finite-range losses and standard setups such as โ ๐ -regression, and have seen some previous study in the literature [23]. By modifying the first step of our reduction to take discretization into account and leveraging the approximate triangle inequality in the second, we prove that discrete learnability and ๐ -agnostic learnability are equivalent under ๐ -approximate pseudometrics.
Theorem 6.6.
Let โ : ๐ ร ๐ โ โ โฅ 0 be a bounded ๐ -approximate pseudometric. Then the following are equivalent for all ( ๐ , ๐ , ๐ป , โ ) :
1.
( ๐ , ๐ , ๐ป , โ ) is discretely-learnable.
2.
( ๐ , ๐ , ๐ป , โ ) is ๐ -agnostically learnable.
Proof 6.7.
The proof is similar to Theorem 5.4. We first show the forward direction. Assume ( ๐ , ๐ , ๐ป , โ ) is discretely-learnable. Fix ๐ โฒ
๐ 4 โข ๐ 2 โข ๐ 1 (where ๐ 1 is the constant from Definition 6.4), and let ๐ป ๐ โฒ be a learnable ๐ โฒ -discretization of ๐ป . We argue that running LearningToCover on ๐ป ๐ โฒ gives the desired agnostic learner. Since โ is bounded, it is sufficient to prove that ๐ถ โข ( ๐ ๐ ) contains a hypothesis โ โฒ such that ๐๐๐ ๐ท , โ โข ( โ โฒ ) โค ๐ โ ๐ โข ๐ โข ๐ + ๐ / 2 . Empirical risk minimization then works as in the finite case.
Let โ ๐ โข ๐ โข ๐ โ ๐ป be an optimal hypothesis. Since ๐ป ๐ โฒ is a discretization of ๐ป , there exists โ ๐ โข ๐ โข ๐ ๐ โฒ โ ๐ป ๐ โฒ such that:
โ ๐ฅ โ ๐ : โ โข ( โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ ๐ โฒ โข ( ๐ฅ ) ) < ๐ โฒ .
Further, by the guarantees of discrete learnability, with probability at least 1 โ ๐ฟ / 2 there exists โ โฒ โ ๐ถ โข ( ๐ ๐ ) such that close to โ ๐ โข ๐ โข ๐ ๐ โฒ in the following sense:
๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ ๐ โฒ โข ( ๐ฅ ) ) ] โค ๐ 1 โข ๐ โฒ
๐ 4 โข ๐ 2
Plugging in the previous observation and applying our approximate triangle inequality, we get that โ โฒ is close to โ ๐ โข ๐ โข ๐ in the following sense:
๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) ) ]
โค ๐ โข ( ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ ๐ โฒ โข ( ๐ฅ ) ) ] + ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ ๐ โข ๐ โข ๐ ๐ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) ) ] )
โค ๐ 2 โข ๐
The final step is to transfer from the marginal ๐ท ๐ to the full joint distribution of the adversary, which follows immediately from a similar application of the approximate triangle inequality. This is the only step that loses a factor in the OPT term:
๐๐๐ ๐ท , โ โข ( โ โฒ )
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , ๐ฆ ) ]
โค ๐ โข ( ๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) ) ] + ๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ โข ( โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) , ๐ฆ ) ] )
โค ๐ โ ๐ โข ๐ โข ๐ + ๐ / 2
as desired.
We now prove the reverse direction, which is essentially immediate. Assume the existence of a ๐ -agnostic learner for ( ๐ , ๐ , ๐ป , โ ) . Given a discretization ๐ป ๐ , we want to show ( ๐ , ๐ , ๐ป ๐ , โ ) is learnable to within ๐ 1 โข ๐ error for some ๐ 1
0 . This is achieved simply by running the agnostic learner on ( ๐ , ๐ , ๐ป ๐ , โ ) . Since ๐ป ๐ is โalways usefulโ, every โ โ ๐ป ๐ is ๐ -close to some โ โฒ โ ๐ป in the sense that:
โ ๐ฅ โ ๐ : โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) โค ๐ .
In particular, this means that for any choice of โ by the adversary there exists โ โฒ โ ๐ป with low error:
๐๐๐ ๐ท , โ โข ( โ โฒ )
๐ผ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] โค ๐ .
As a result, running the ๐ -agnostic learner for ( ๐ , ๐ , ๐ป , โ ) returns a hypothesis of at most ( ๐ + 1 ) โข ๐ error with high probability.
It is worth noting that bounded loss is not really necessary for Theorem 6.6. More generally we can require that ( ๐ , ๐ , ๐ป , โ ) is โfinitely learnableโ in the sense that for all finite subsets ๐ป โฒ โ ๐ป , ( ๐ , ๐ , ๐ป โฒ , โ ) is agnostically learnable. When โ is bounded, this is true for any finite class by empirical risk minimization. When โ is unbounded, we can still get finite learnability if โ โข ( โ โข ( โ ) , โ ) satisfies some concentration bounds [48] (such classes are discretizable since one can simply truncate the loss). It is also possible to achieve improved convergence rates by using algorithms other than empirical risk minimization for the finite learner (for example a generalization of the median-of-means estimator [37]), but our reduction will suffer an additional factor of ๐ over applying such techniques directly due to the size of the non-uniform cover.
It is also worth noting that various modifications to the definition of loss (e.g. defining loss between hypotheses rather than on ๐ directly) will continue to work with the above. Similarly, there are various cases when one can get better than ๐ โ ๐ โข ๐ โข ๐ accuracy for a ๐ -approximate pseudometric, generally by instead optimizing over some surrogate loss function. For instance, if a simple transformation of the loss gives a ๐ โฒ -approximate pseudometric for ๐ โฒ < ๐ , then one can generally learn up to ๐ โฒ โ ๐ โข ๐ โข ๐ .14 As an example, note that while square loss โ 2 โข ( ๐ฅ , ๐ฆ )
( ๐ฅ โ ๐ฆ ) 2 is a 2-approximate pseudometric, taking err ๐ท , โ 2 gives a true metric between hypotheses. As a result, as long as ๐ โข ๐ โข ๐ is bounded, we can get truly agnostic learning by optimizing err ๐ท , โ 2 instead. This strategy works for any polynomial loss, such as โ ๐ โข ( ๐ฅ , ๐ฆ )
| ๐ฅ โ ๐ฆ | ๐ .
On the other hand, outside of these special cases, Theorem 6.6 is tight: there exist ๐ -approximate pseudometric loss functions which cannot be ๐ โฒ -agnostically learned for any ๐ โฒ < ๐ . The argument is similar to Proposition 6.1, but requires a bit more care.
Proposition 6.8.
There exists a discretely-learnable class over a ๐ -approximate pseudometric that is not ๐ โฒ -agnostically learnable for any ๐ โฒ < ๐ .
Proof 6.9.
The proof is similar to Proposition 5.6. We consider the same instance space ๐
โ and hypothesis class ๐ป :
๐ป
{ โ : โ โข ( ๐ฅ )
( 0 , โ ) โข โ ๐ฅ โ ๐ } .
The loss function โ : ๐ ร ๐ โ { 0 , 1 , ๐ } is also the same, but extended to the larger domain ๐
โ 2 :
โ โข ( ( ๐ 1 , ๐ 1 ) , ( ๐ 2 , ๐ 2 ) )
{
0
๐
1
๐ 2
1
๐
1
โ
๐
2
โข
๐๐๐
โข
๐
1
๐ 2
๐
๐๐กโ๐๐๐ค๐๐ ๐ .
As before, note that ( ๐ , ๐ป , โ ) is trivially realizably learnable by always returning any โ โ ๐ป , โ is a ๐ -psuedometric by definition, and for any labeling ๐ : ๐ โ ๐ there exists โ โ ๐ป such that for all distributions ๐ท :
๐ โข ๐ โข ๐ โค ๐ โข ๐ โข ๐ ๐ท , ๐ โข ( โ ) โค 1 .
We now show that the class ( ๐ , ๐ป , โ ) is only ๐ -agnostically learnable. Since ๐ โข ๐ โข ๐ โค 1 , it suffices to show that for every ๐ โ โ , large enough ๐ โ โ , and randomized algorithm ๐ on ๐ samples, there exists a labeling ๐ : ๐ โ ๐ and a marginal distribution ๐ท ๐ such that:
๐ผ ๐ โผ ๐ท ๐ ๐ โข ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ( 1 โ 1 ๐ ) 3 โข ๐ .
(2)
For ๐ โฅ 1 1 โ ( 1 โ ( ๐ โ ๐ โฒ ) / ๐ ) 1 / 3 , applying Markovโs inequality to Equation 2 implies that ๐ has error at least ๐ โฒ with constant probability.
For simplicity, we now restrict our attention to the marginal ๐ท ๐ which is uniform over the set [ ๐ ] for some ๐ โ โ to be fixed. By Yaoโs minimax principle, its enough to prove that there exists a distribution ๐ over functions ๐ : [ ๐ ] โ [ ๐ ] 2 such that any deterministic algorithm ๐ the following holds
๐ผ ๐ โผ ๐ โข ๐ผ ๐ โผ ๐ท ๐ ๐ โข ๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ( 1 โ 1 ๐ ) 3 โข ๐ .
We now show that the above holds for ๐ being uniform over all functions from [ ๐ ] to [ ๐ ] 2 when ๐
2 โข ๐ ln โก ( ๐ / ( ๐ โ 1 ) ) . Similar to Proposition 5.6, we have that
๐ผ ๐ฅ โผ ๐ท ๐ โข ๐ผ ๐ โผ ๐ โข ๐ผ ๐ โผ ๐ท ๐ ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ๐ผ ๐ฅ โผ ๐ท ๐ โข [ Pr ๐ โผ ๐ท ๐ ๐ โก [ ๐ฅ โ ๐ ] โ ( 1 โ 1 ๐ ) 2 โ ๐ ] ,
since no matter the assignment ๐ gives to ๐ฅ โ ๐ , it will be wrong on both coordinates with probability ( 1 โ 1 / ๐ ) 2 over the randomness of ๐ . The result follows by noting that for every ๐ฅ โ [ ๐ ]
Pr ๐ โผ ๐ท ๐ ร ๐ โข ( ๐ ) ๐ โก [ ( ๐ฅ , โ ) โ ๐ ]
( 1 โ 1 ๐ ) ๐ โฅ 1 โ 1 ๐
since 1 โ ๐ฅ โฅ exp โก { โ ๐ฅ / ( 1 โ ๐ฅ ) } , and we have assumed ๐
2 โข ๐ ln โก ( ๐ / ( ๐ โ 1 ) ) and ๐
1 . Therefore, we get that
๐ผ ๐ฅ โผ ๐ท ๐ โข ๐ผ ๐ โผ ๐ โข ๐ผ ๐ โผ ๐ท ๐ ๐ โข [ โ โข ( ๐ โข ( ๐ , ๐ โข ( ๐ ) ) โข ( ๐ฅ ) , ๐ โข ( ๐ฅ ) ) ] โฅ ( 1 โ 1 ๐ ) 3 โข ๐
which completes the proof.
6.2Sub-sampling: Malicious Noise
Now that weโve seen how to handle practical problems like regression over infinite label spaces, weโll discuss a technique that helps handle data corruption and data-dependent assumptions: sub-sampling. The main idea is as follows. Say that the original unlabeled sample we draw is, in some sense, partially corrupted: perhaps an adversary has changed some fraction of examples (malicious noise), or some portion of the sample is un-realizable for a concept in the class (robust and partial learning). In either case, there generally exists a core subset of โcleanโ samples that we can use to recover the guarantees of LearningToCover. Since we cannot necessarily identify these, the idea is to run LearningToCover over enough subsets of the unlabeled sample that we find a clean subsample with high probability. In this section weโll discuss the application of this technique in detail to Kearns and Liโs [41] well-studied malicious noise model. In Sections 8 and 9, we discuss further applications of the method to adversarially robust and partial learning.
To start, letโs recall the standard malicious noise model. In this variant of PAC learning, instead of having access to the standard sample oracle from the adversaryโs distribution ๐ท over ๐ ร ๐ , we have access to a malicious oracle ๐ช ๐ โข ( โ ) which, with probability ๐ , outputs an adversarially chosen pair ( ๐ฅ , ๐ฆ ) , and otherwise samples from ๐ท as usual.
Definition 6.10 (PAC-learning with Malicious Noise).
We call ( ๐ , ๐ป , ๐ , โ ) (agnostically) ( ๐ , ๐ฟ ) -learnable with malicious noise at rate ๐
๐ โข ( ๐ ) if there exists an algorithm ๐ and function ๐
๐ ๐๐๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 , and distributions ๐ท over ๐ ร ๐ satisfying:
1.
The marginal ๐ท ๐ โ ๐ ,
๐ outputs a good hypothesis with high probability over samples drawn from the malicious oracle of size ๐ โข ( ๐ , ๐ฟ ) :
Pr ๐ โผ ๐ช ๐ โข ( โ ) ๐ โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ .
where ๐ โข ๐ โข ๐
min โ โ ๐ป โก { ๐๐๐ ๐ท , โ โข ( โ ) } .
In other words, malicious noise essentially gives a worst-case formalization of the idea that an ๐ -fraction of the learnerโs data is (adversarial) garbage.
Letโs now formalize the argument above: modifying LearningToCover to run over subsamples gives a sample-efficient algorithm for learning with malicious noise. For readability, weโll (somewhat informally) restate the algorithm with this change.
Input: Realizable PAC-Learner ๐ , Accuracy Parameter ๐ < 1 / 2 , Noise Parameter ๐ < ๐ 1 + ๐ , Unlabeled Sample Oracle ๐ช ๐ , Labeled Sample Oracle ๐ช ๐ฟ Algorithm:
1.
Draw an unlabeled sample ๐ ๐ โผ ๐ช ๐ , and labeled sample ๐ ๐ฟ โผ ๐ช ๐ฟ .
2.
Run LearningToCover over all ๐ โ ( ๐ ๐ ) ๐ โฒ โข \coloneqq โข { ๐ โ ๐ ๐ : | ๐ |
โ ( 1 โ ๐ โฒ ) โข | ๐ ๐ | โ } , where:
๐ โฒ
3 โข ๐ + ๐ / ( 1 โ ๐ ) 4 .
3.
Return the hypothesis in ๐ถ โข ( ๐ ๐ ) โข \coloneqq โข โ ๐ โ ( ๐ ๐ ) ๐ โฒ โข ๐ถ โข ( ๐ ) with lowest empirical error over ๐ ๐ฟ .
Algorithm 3 Malicious to Realizable Reduction
We now prove that Algorithm 3 gives an (agnostic) learner that is tolerant to malicious noise.
Theorem 6.11.
Let ( ๐ , ๐ , ๐ป ) be realizably PAC-learnable with sample complexity ๐ โข ( ๐ , ๐ฟ ) . Then for any ๐ < ๐ 1 + ๐ , Algorithm 3 is an agnostic learner for ( ๐ , ๐ , ๐ป ) tolerant to ๐ malicious noise. Furthermore letting ฮ
๐ 1 + ๐ โ ๐ and ๐ฝ
( 1 + ๐ ฮ ) โข log โก ( 1 ๐ ) , its sample complexity is at most:
๐ ๐๐๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ฝ โข ๐ โข ( ฮ 4 , ๐ฟ 4 ) ฮ + log โก ( ฮ ๐ป โข ( ๐ โข ( ๐ โข ( ฮ 2 , ๐ฟ 2 ) + ๐ 2 โข log โก ( 1 / ๐ฟ ) ฮ 2 ) ) ) + log โก ( 1 / ๐ฟ ) ฮ 2 + ๐ฝ โข ๐ 2 โข log โก ( 1 / ๐ฟ ) ฮ 3 )
where weโve assumed ๐ < 1 / 2 for simplicity.
Proof 6.12.
To start, weโll review for completeness a fairly standard analysis of empirical risk minimization under malicious noise. Assume for the moment that the output of LearningToCover, ๐ถ โข ( ๐ ๐ ) , contains a hypothesis โ โฒ satisfying ๐ โข ๐ โข ๐ โข ( โ โฒ ) โค ๐ โข ๐ โข ๐ + ๐ฝ 1 . Say we draw ๐ labeled samples for the ERM step, and an ๐ โฒ
๐ + ๐ฝ 2 fraction are corrupted by the adversary. For large enough ๐ , we can assume by a Chernoff and union bound that the empirical loss of every hypothesis returned by LearningToCover is at most some ๐ฝ 3 away from its true loss on the un-corrupted portion of ๐ (we will make all these assumptions formal in a moment). Given these facts, notice that the empirical loss of โ โฒ on ๐ is at most:
๐ โข ๐ โข ๐ ๐ โข ( โ โฒ ) โค ( 1 โ ๐ โฒ ) โข ( ๐ โข ๐ โข ๐ + ๐ฝ 1 + ๐ฝ 3 ) + ๐ โฒ .
On the other hand, the empirical error of any โ ๐ โ ๐ป whose true error is greater than ๐ โข ๐ โข ๐ + ๐ is at least:
๐ โข ๐ โข ๐ ๐ โข ( โ )
( 1 โ ๐ โฒ ) โข ( ๐ โข ๐ โข ๐ + ๐ โ ๐ฝ 3 ) .
To ensure that our ERM works, it is enough to show that for any such โ , ๐ โข ๐ โข ๐ ๐ โข ( โ ) > ๐ โข ๐ โข ๐ ๐ โข ( โ โฒ ) . A simple calculation shows that this is satisfied as long as ๐ฝ 1 + 2 โข ๐ฝ 3 โค ๐ and ๐ฝ 1 + ๐ฝ 2 + 2 โข ๐ฝ 3 โค ฮ . Setting ๐ฝ 1
๐ฝ 2
๐ฝ 3
ฮ / 4 gives the desired result.
It is left to argue that our assumptions above hold with high probability. First, note that by a Chernoff bound, the probability that ๐ โฒ โฅ ๐ + ฮ / 4 on a set of ๐ samples is at most ๐ โ ๐ โข ( ฮ ๐ ) 2 โข ๐ for some ๐
0 , so this occurs with high probability as long as ๐ โฅ ฮฉ โข ( log โก ( 1 / ๐ฟ ) โข ๐ 2 / ฮ 2 ) . Similarly, the empirical error of every hypothesis in ๐ถ โข ( ๐ ๐ ) on the remaining clean samples will be within ฮ / 4 of its true error as long as ๐ โฅ ฮฉ โข ( log โก ( | ๐ถ โข ( ๐ ๐ ) | / ๐ฟ ) โข ๐ 2 / ฮ 2 ) .
It is then left to show that ๐ถ โข ( ๐ ๐ ) contains a hypothesis of error at most ๐ โข ๐ โข ๐ + ฮ / 4 . To show this, it is enough to ensure that we run LearningToCover over a clean subsample of size at least ๐ โข ( ฮ / 4 , ๐ฟ / 4 ) with high probability. If we draw | ๐ ๐ |
๐ โข ( ๐ โข ( ฮ / 4 , ๐ฟ / 4 ) 1 โ ๐ โฒ + log โก ( 1 / ๐ฟ ) โข ๐ 2 ฮ 2 ) unlabeled samples, a similar Chernoff bound to the above promises that at most an ๐ โฒ fraction are corrupted with high probability, and therefore that at least ๐ โข ( ฮ / 4 , ๐ฟ / 4 ) remain un-corrupted. Running LearningToCover over all subsets of size ( 1 โ ๐ โฒ ) โข | ๐ ๐ข | then gives the desired result. The sample complexity bound follows from choosing ๐ large enough to satisfy the above conditions along with the fact that | ๐ถ โข ( ๐ ๐ ) | โค ( ๐ ( 1 โ ๐ โฒ ) โข ๐ ) โข ฮ ๐ป โข ( ๐ ) .
A few remarks are in order. First, the error tolerance of Theorem 6.11 is tight. In their original introduction of malicious noise, Kearns and Li [41] proved that for most non-trivial concept classes, no PAC-learner can be tolerate ๐ 1 + ๐ malicious noise. Second, we note that Kearns and Li also give a reduction procedure from a base learner which achieves better dependence on ฮ above, however it allows the learner to query specifically for positive or negative examples which is a stronger setup than our model. Finally, we note in the case of finite VC, the above is off from known (randomized) bounds by a factor of roughly ฮ 2 [21]. On the other hand, Theorem 6.11 has the advantage of extending far beyond the VC setting, including to the general loss models discussed in Section 6.1. The proof remains mostly the same (with optimal error tolerance differing slightly), and is omitted.
Since our agnostic model restricts the adversary to choosing a distribution whose marginal lies in the original family, Theorem 6.11 provides the first insight on robustness against an adversary who can corrupt the underlying data as well as the labels. One might wonder whether this result can be pushed further: is it possible to be robust against an adversary who can corrupt the marginal over ๐ in some stronger sense? Unfortunately, the answer is no: malicious noise is necessarily the most distributional corruption we can handle. Letโs look at two basic lower bounds to see why. First, weโll consider an adversary who can remove a portion of the learnerโs sample.
Proposition 6.13.
For any ๐ฟ
0 , there exists a class ( ๐ , ๐ , ๐ป ) which is realizably PAC-learnable, but not learnable under an adversary who can remove a ๐ฟ fraction of the learnerโs sample.
Proof 6.14.
This follows from a result of Dudley, Kulkarni, Richardson, and Zeitouni [31] that there exists an unlearnable class ( ๐ , ๐ , ๐ป ) such that for some ๐ โข ( ๐ , ๐ฟ ) , ( ๐ท , ๐ , ๐ป ) is learnable in ๐ โข ( ๐ , ๐ฟ ) samples for every ๐ท โ ๐ . The lower bound then follows simply from adding an extra unique identifying point ๐ฅ ๐ท to ๐ for every distribution ๐ท , and modifying each ๐ท โ ๐ to have ฮ โข ( ๐ฟ ) support on ๐ฅ ๐ท . This modified class is clearly learnable, since after drawing ๐ โข ( 1 / ๐ฟ ) samples, the learner will draw ๐ฅ ๐ท and identify the distribution ๐ท with good probability. However, the class is not learnable under an adversary who removes points, since with high probability the adversary can completely remove any mention of ๐ฅ ๐ท from the learnerโs sample, reducing to the original unlearnable class ( ๐ , ๐ , ๐ป ) .
An adversary who can add samples is similarly powerful. Consider a model in which the adversary, after choosing a marginal distribution and labeling, sees the learnerโs sample ahead of time and may add an additional ๐พ fraction of correctly labeled elements.15 While the realizable setting is largely unaffected by such an adversary (as any non-optimal hypothesis will still see some bad example), even near-trivial concept classes cannot be agnostically learned.
Proposition 6.15.
There exists a class ( ๐ , ๐ป ) which for any ๐พ
0 is realizably but not agnostically learnable under an adversary who can add a ๐พ fraction of correctly labeled points to the learnerโs sample.
Proof 6.16.
Let ๐
{ ๐ฅ , ๐ฅ 1 , ๐ฅ 2 } and ๐ป
{ โ 1 , โ 2 } be any class such that โ 1 โข ( ๐ฅ )
โ 2 โข ( ๐ฅ ) , but โ 1 โข ( ๐ฅ ๐ ) โ โ 2 โข ( ๐ฅ ๐ ) for ๐
1 , 2 . In the realizable setting, note that a single labeled example on ๐ฅ 1 or ๐ฅ 2 exactly determines the hypothesis. As long as there is less than 1 โ ๐ mass on ๐ฅ , the learner will draw such a sample after ๐ โข ( 1 / ๐ ) samples with good probability. Further, if the mass on ๐ฅ is at least 1 โ ๐ , then โ 1 and โ 2 are both valid outputs. As a result, any ERM is a valid PAC-learner. Since adding correctly labeled examples can only help this learner, the class remains realizably learnable under an adversary who can add an arbitrary number of clean samples.
In the agnostic setting, consider an adversary who chooses a labeling ๐ such that ๐ โข ( ๐ฅ )
โ 1 โข ( ๐ฅ )
โ 2 โข ( ๐ฅ ) , ๐ โข ( ๐ฅ 1 )
โ 1 โข ( ๐ฅ 1 ) , and ๐ โข ( ๐ฅ 2 )
โ 2 โข ( ๐ฅ 2 ) . The optimal hypothesis โ ๐ โข ๐ โข ๐ is then decided by the amount of mass on ๐ฅ 1 and ๐ฅ 2 in the marginal distribution. Namely if the adversary chooses a distribution ๐ท over { ๐ฅ , ๐ฅ 1 , ๐ฅ 2 } , the optimal error is min โก { ๐ท โข ( ๐ฅ 1 ) , ๐ท โข ( ๐ฅ 2 ) } . The idea is then to note that for ๐พ โฒ โค ๐พ / 4 , the learner cannot distinguish between the two following distributions:
๐ท 1 โข ( ๐ฅ 1 )
๐ 1 โข ๐พ โฒ , ๐ท 1 โข ( ๐ฅ 2 )
( 1 โ ๐ 1 ) โข ๐พ โฒ
and
๐ท 2 โข ( ๐ฅ 1 )
( 1 โ ๐ 1 ) โข ๐พ โฒ , ๐ท 2 โข ( ๐ฅ 2 )
๐ 1 โข ๐พ โฒ
where 1 / 2
๐ 1
0 is some small constant. Informally, if the two distributions are indistinguishable, any learner will always incur error of around ( 1 โ ๐ 1 ) โข ๐พ โฒ 2 , whereas OPT is ๐ 1 โข ๐พ โฒ for both distributions.
Letโs now give the formal argument. By Yaoโs Minimax Principle it is enough to prove there is a strategy over distributions such that any deterministic learner has high error. In particular, if we can prove that the expected error is at least 3 โ ๐ โข ๐ โข ๐ , then Pr โก [ ๐๐๐๐๐ โฅ 2 โ ๐ โข ๐ โข ๐ ] โฅ ๐ โข ๐ โข ๐ . Since OPT is just some constant ๐ 1 โข ๐พ โฒ (dependent only on ๐พ ), this is sufficient to prove the result. Moving on, consider the strategy in which the adversary chooses the labeling described above, and chooses each marginal ( ๐ท 1 or ๐ท 2 ) with probability 1 / 2 . Weโll break our analysis into two cases dependent on the sample complexity of the learner. If the learner uses ๐ โข ( 1 / ๐พ โฒ ) examples, then there is a constant probability of drawing a sample only consisting of the point ๐ฅ . Let ๐ โฒ be the hypothesis returned by the deterministic learner on this sample. By construction, ๐ โฒ must disagree with either โ 1 or โ 2 on ๐ฅ 1 or ๐ฅ 2 . Assume ๐ โฒ differs on โ 1 โข ( ๐ฅ 1 ) (the other cases will follow similarly). When the distribution is ๐ท 2 , ๐ โฒ has error at least ( 1 โ ๐ 1 ) โข ๐พ โฒ . Since this occurs with constant probability independent of the choice of ๐ 1 , choosing ๐ 1 sufficiently small leads to an expected error of at least 3 โข ๐ 1 โข ๐พ โฒ as desired.
On the other hand, when there are ๐
ฮฉ โข ( 1 / ๐พ โฒ ) samples, we claim that the adversary can force the following sample to occur with constant probability: 2 โข ๐พ โฒ โข ๐ instances of ๐ฅ 1 and ๐ฅ 2 , and ( 1 + ๐พ ) โข ๐ โ 4 โข ๐พ โฒ โข ๐ instances of ๐ฅ . This follows from the fact that for the appropriate choice of constant for ๐ , a Chernoff bound gives that both ๐ฅ 1 and ๐ฅ 2 occur at most 2 โข ๐พ โฒ โข ๐ times with constant probability. Since the adversary is allowed to add ๐พ โข ๐ โฅ 4 โข ๐พ โฒ โข ๐ arbitrary examples, they can add instances of ๐ฅ 1 , ๐ฅ 2 , and ๐ฅ until the above sample is achieved. The remainder of the argument is then the same as the previous case, as any learner response on this sample will incur similarly high expected error.
It is also reasonable to consider distributional corruption in the semi-supervised setting, where the unlabeled and labeled data-sets might have different underlying distributions. We discuss this model in Section 6.4.
6.3Replacing ERM: Semi-Private Learning
So far we have focused on property generalization for two forms of noise-toleranceโagnostic learning and learning with malicious noise. In this section, weโll show how to use Algorithm 1 to generalize a broader spectrum of finitely-satisfiable properties through replacing the ERM process with a generic finite learner with the desired property. Our prototypical example will be privacy, which is well known to be finitely-satisfiable via McSherry and Talwarโs [49] exponential mechanism. To start, weโll cover a few basic privacy definitions.
Definition 6.17 (Differential Privacy).
A learning algorithm ๐ is said to be ๐ผ -differentially private if for all neighboring inputs ๐ , ๐ โฒ which differ on a single example:
Pr โก [ ๐ โข ( ๐ ) โ ๐ ] โค ๐ ๐ผ โข Pr โก [ ๐ โข ( ๐ โฒ ) โ ๐ ] ,
for all measurable events ๐ in the range of ๐ .
The exponential mechanism is one of the most widely used techniques in privacy. Informally, the algorithm allows for differentially private selection of a โgoodโ choice from a finite set of objects (potential hypotheses in our case). More formally, let ๐ : ( ๐ ร ๐ ) * ร ๐ป โ โ be a โscoreโ function, and define โsensitivityโ ฮ ๐ to be
ฮ ๐
max โ โ ๐ป โก max ๐ , ๐ โฒ โก | ๐ โข ( ๐ , โ ) โ ๐ โข ( ๐ โฒ , โ ) |
where ๐ , ๐ โฒ are two neighbouring datasets. The exponential mechanism selects an item with a good score with high probability, while maintaining privacy.
Definition 6.18 (Exponential Mechanism [49]).
The exponential mechanism ๐ ๐ธ on inputs ๐ , ๐ป , ๐ with privacy parameter ๐ผ selects and outputs โ โ ๐ป with probability
exp โก ( ๐ผ โข ๐ โข ( ๐ , โ ) 2 โข ฮ ) โ โ โฒ โ ๐ป exp โก ( ๐ผ โข ๐ โข ( ๐ , โ โฒ ) 2 โข ฮ ) .
It is well known that the exponential mechanism leads to a private learner for finite hypothesis classes under bounded loss.
Theorem 6.19 (Theorem 3.4 [39]).
Let ( ๐ , ๐ , ๐ป , โ ) be a finite class with a bounded loss function. Then the sample complexity of ๐ผ -differentially private learning ( ๐ , ๐ , ๐ป , โ ) is at most:
๐ ๐ โข ๐ โข ๐ โข ( ๐ผ , ๐ , ๐ฟ ) โค ๐ โข ( log โก ( | ๐ป | ๐ฟ ) โข max โก { ๐ โ 2 , ๐ โ 1 โข ๐ผ โ 1 } ) .
We note that [39, Theorem 3.4] only considers classification loss, but the extension to bounded loss is immediate. Unfortunately, even with the power of the exponential mechanism, privacy is a very restrictive condition in the general PAC framework, since weโre most often interested in infinite hypothesis sets. Indeed even improper private learning requires finiteness of a highly restrictive measure known as representation dimension [12], which can be infinite for classes of VC dimension 1 . As a result, the past decade has seen the introduction of a number of weaker, more practical definitions of privacy. In this section weโll focus on a model introduced in 2013 by Beimel, Nissim, and Stemmer [14] called semi-private learning.
Definition 6.20 (Semi-Private Learning).
We call a class ( ๐ , ๐ , ๐ป , โ ) semi-private PAC-Learnable if there exists an algorithm ๐ and two functions ๐ ๐๐ข๐
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) and ๐ ๐๐๐
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ whose marginal ๐ท ๐ is in ๐ , ๐ satisfies the following:
1.
๐ outputs a good hypothesis with high probability:
Pr ๐ ๐ โผ ๐ท ๐ ๐ ๐๐ข๐ , ๐ ๐ฟ โผ ๐ท ๐ ๐๐๐ โก [ ๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ๐ , ๐ ๐ฟ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ .
2.
๐ is semi-private. That is for all ๐ ๐ โ ๐ ๐ ๐๐ข๐ :
๐ โข ( ๐ ๐ , โ ) โข is โข ๐ผ โข -differentially private .
In other words, semi-private learning offers a model for applications where labeled data is sensitive, but some (perhaps opt-in) users might not care about their participation itself being released. Unlike standard private learning, distribution-free semi-private classification is known to be characterized by VC dimension, just like realizable PAC-learning [14]. The best sample complexity bounds are due to Alon, Bassily, and Moran (ABM) [2], who use uniform convergence to build a uniform cover for ๐ป from unlabeled data, and then apply the exponential mechanism to the resulting cover.
Due to their reliance on uniform convergence, ABMโs techniques fail in the more general settings we consider. Further, their use of uniform covers results in sub-optimal public sample complexity even for distribution-free classification. We prove in Section 13 that these objects require asymptotically more samples than non-uniform covers (at least in the distribution-family model), and therefore cannot be used to achieve optimal semi-private learning. We circumvent both of these issues by appealing directly to a realizable learner to build a weaker non-uniform cover. For readability, we first restate the algorithm here.
Input: Realizable PAC-Learner ๐ , Unlabeled Sample Oracle ๐ช ๐ , Labeled Sample Oracle ๐ช ๐ฟ Algorithm:
1.
Draw an unlabeled sample ๐ ๐ โผ ๐ช ๐ , and labeled sample ๐ ๐ฟ โผ ๐ช ๐ฟ .
2.
Run LearningToCover over ๐ ๐ to get ๐ถ โข ( ๐ ๐ ) .
3.
Return the hypothesis in ๐ถ โข ( ๐ ๐ ) given by applying the exponential mechanism with respect to ๐ ๐ฟ .
Algorithm 4 Semi-Private to Realizable Reduction
We prove that Algorithm 4 gives a semi-private agnostic learner in the distribution-family setting.
Theorem 6.21 (PAC-learning implies Semi-Private Learning).
Let โ : ๐ ร ๐ โ โ โฅ 0 be a bounded ๐ -approximate pseudometric. Then the following are equivalent for all triples ( ๐ , ๐ , ๐ป , โ ) :
1.
( ๐ , ๐ , ๐ป , โ ) is discretely-learnable
2.
( ๐ , ๐ , ๐ป , โ ) is ๐ -agnostically, semi-private learnable.
Proof 6.22.
The proof is essentially the same as Theorem 6.6. The only difference in the argument is to replace the generic ERM learner over the output of LearningToCover with the exponential mechanism [39].
Letโs now take a look at what Theorem 6.21 implies about the special case of distribution-free classification.
Corollary 6.23.
Let ( ๐ , ๐ป ) be a class of VC-dimension ๐ with sample complexity ๐ โข ( ๐ , ๐ฟ ) . The sample complexity of (agnostic) semi-private learning ( ๐ , ๐ป ) is at most:
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ / 2 , ๐ฟ / 2 )
and
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ( ๐ โข log โก ( ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) ๐ ) + log โก ( 1 ๐ฟ ) ) โข max โก { ๐ โ 2 , ๐ โ 1 โข ๐ผ โ 1 } ) .
Further, the sample complexity of improperly (agnostic) semi-private learning ( ๐ , ๐ป ) is
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ + log โก ( 1 / ๐ฟ ) ๐ )
and
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ( ๐ โข log โก ( 1 ๐ ) + log โก ( 1 ๐ฟ ) ) โข max โก { ๐ โ 2 , ๐ โ 1 โข ๐ผ โ 1 } ) .
Proof 6.24.
LearningToCover uses ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) samples by definition. In the improper case, Hanneke showed that ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) โค ๐ โข ( ๐ + log โก ( 1 / ๐ฟ ) ๐ ) . Since the class has VC dimension ๐ , the size of the resulting cover is at most ( ๐ โ ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) / ๐ ) ๐ , and the private sample complexity bound then follows from Theorem 6.19.
This improves over the recent upper bound of ABM who showed that
๐ pub โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ โข log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ ) .
In fact, for constant ๐ and ๐ฟ , Corollary 6.23 completely resolves the unlabeled sample complexity of semi-private learning, as ABM [2] prove a matching lower bound.
Theorem 6.25 (Public Lower Bound [2]).
Every class that is not privately learnable requires at least ฮฉ โข ( 1 / ๐ ) unlabeled samples to learn in the semi-private model under classification error.
On the other hand, we note that the private sample complexity remains off by a log factor from the best known lower bounds of Chaudhari and Hsu [22].
Theorem 6.26 (Private Lower Bound [22]).
There exist classes of VC dimension ๐ โข ( 1 ) which require at least:
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) โฅ ฮฉ โข ( max โก { ๐ โ 2 , ๐ โ 1 โข ๐ผ โ 1 } )
private samples to learn.
While we have now resolved the public sample complexity of improper learning, it remains an interesting open problem for the proper regime where certain adversarial examples are known to require an extra log โก ( 1 / ๐ ) factor in the standard PAC sample complexity [26, 34]. We conjecture that Theorem 6.21 should still be tight in this setting: namely that the unlabeled semi-private sample complexity should always be at least the realizable PAC sample complexity.
Conjecture 6.27.
Let ( ๐ , ๐ป ) be a hypothesis class which is not privately learnable. Then the realizable sample complexity of ( ๐ , ๐ป ) lower bounds the unlabeled sample complexity of proper semi-private learning:
๐ ๐๐ข๐ โข ( ๐ , 1 / 2 ) โฅ ฮฉ โข ( ๐ โข ( ๐ , 1 / 2 ) ) .
6.4Changing the Base Model: Covariate Shift
One issue with semi-supervised models like semi-private learning is that, in practice, the distribution over unlabeled data probably wonโt match the labeled data exactly. In this section, weโll talk about a final modification to our reduction to tackle such scenarios and more generally to extend property generalization beyond the realizable PAC setting: replacing the base learner. In fact, we already saw this strategy used to a lesser extent in Section 6.1, where we replaced our standard realizable base learner with a discrete learner. Here weโll look at an application in which we assume our initial learner is robust to covariate shift [56], meaning that even if the distribution underlying the data shifts between train and test time, the algorithm will continue to perform well. This stronger assumption will allow us to build semi-private learners that can handle corruption between the public and private databases. To start, letโs formalize covariate shift in the distribution-family model.
Definition 6.28 (Covariate Shift).
Let ( ๐ , ๐ , ๐ป , โ ) be any class, and for every ๐
0 let ๐ถ ๐ be a โcovariate-shiftโ function that maps every ๐ท โ ๐ to some family of distributions over ๐ . Given any distribution ๐ท โ ๐ and any โ โ ๐ป , let the error of a potential labeling be given by its worst-case error over ๐ถ ๐ โข ( ๐ท ) , that is:
CS-err ๐ท ร โ , โ , ๐ โข ( ๐ )
max ๐ท โฒ โ ๐ถ ๐ โข ( ๐ท ) โก { ๐ผ ๐ฅ โผ ๐ท โฒ โข [ โ โข ( ๐ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] } .
We say that ( ๐ , ๐ , ๐ป , โ ) is realizably learnable under covariate shift ๐ถ
{ ๐ถ ๐ } if there exists an algorithm ๐ and function ๐
๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 , ๐ท โ ๐ , and โ โ ๐ป :
Pr ๐ โผ ๐ท ๐ โก [ CS-err ๐ท ร โ , โ , ๐ โข ( ๐ โข ( ๐ , โ โข ( ๐ ) ) )
๐ ] โค ๐ฟ .
We call such a learner robust to covariate shift.
We emphasize that in the above definition, the covariate shift family scales with the error parameter ๐ . This is a bit different than Shimodairaโs original definition [56], but is a natural choice in our context since we consider algorithms which only use access to the original source distribution (sometimes called โconservative domain adaptationโ [30]). In this setting, weโd expect that as we demand higher accuracy, the amount of covariate shift we can tolerate will decrease. Indeed in the agnostic model, itโs clear this scaling is necessary by a similar argument to Proposition 6.15.
The key observation to apply learning under covariate shift in our reduction is simply to notice that the non-uniform cover output by LearningtoCover must contain a hypothesis close to optimal under any shifted distribution in ๐ถ ๐ โข ( ๐ท ) . This can then be used to analyze any semi-supervised model where the marginal of the labeled distribution may be corrupted from ๐ท to any distribution in ๐ถ ๐ โข ( ๐ท ) . In this section, weโll again focus on the setting of semi-private learning. First, letโs formalize what it means to be semi-private learnable under covariate shift.
Definition 6.29 (Semi-Private Learning under Covariate Shift).
We call a class ( ๐ , ๐ , ๐ป , โ ) semi-private (agnostically) PAC-Learnable under covariate shift ๐ถ
{ ๐ถ ๐ } if there exists an algorithm ๐ and two functions ๐ ๐๐ข๐
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) and ๐ ๐๐๐
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท ๐ โ ๐ and ๐ท โฒ over ๐ ร ๐ whose marginal ๐ท ๐ โฒ โ ๐ถ ๐ โข ( ๐ท ๐ ) , ๐ satisfies the following:
1.
๐ outputs a good hypothesis over ๐ท โฒ with high probability:
Pr ๐ ๐ โผ ๐ท ๐ ๐ ๐๐ข๐ , ๐ ๐ โผ ๐ท โฒ โฃ ๐ ๐๐๐ โก [ ๐๐๐ ๐ท โฒ , โ โข ( ๐ โข ( ๐ ๐ , ๐ ๐ ) )
๐ โข ๐ โข ๐ โฒ + ๐ ] โค ๐ฟ ,
where ๐ โข ๐ โข ๐ โฒ
min โ โ ๐ป โก [ ๐๐๐ ๐ท โฒ , โ โข ( โ ) ] is the minimum error over the shifted distribution.
2.
๐ is semi-private. That is for all ๐ ๐ โ ๐ ๐ ๐๐ข๐ :
๐ โข ( ๐ ๐ , โ ) โข is โข ๐ผ โข -differentially private
In other words, weโd like to recover a near-optimal hypothesis even when the marginal distribution over private data is shifted from the public data. This is a realistic scenario in practice, since the distribution of โopt-inโ users is likely different from the marginal over the total population. Weโll show that this issue is solvable in the semi-private setting as long as the analogous issue in the non-private setting (distribution shift between train and test time) can be resolved.
Theorem 6.30.
Let ( ๐ , ๐ , ๐ป , โ ) be any class of a finite label space with loss function satisfying the identity of indiscernibles which is realizably learnable under covariate shift ๐ถ
{ ๐ถ ๐ } . Then ( ๐ , ๐ , ๐ป , โ ) is semi-private, agnostically learnable under covariate shift ๐ถ with sample complexity:
๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ โ โข ๐ , ๐ฟ / 2 ) + ๐ โข ( log โก ( ฮ ๐ป โข ( ๐ โข ( ๐ โ โข ๐ , ๐ฟ / 2 ) ) ๐ฟ ) ๐ 2 )
where we recall ๐ โ โฅ ฮฉ โข ( min ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) max ๐ โ ๐ โก ( โ โข ( ๐ , ๐ ) ) ) is a constant depending only on โ .
Proof 6.31.
The proof is essentially the same as Theorem 5.4. The only difference is to note that for all marginal distributions ๐ท ๐ โ ๐ and choices of shift ๐ท ๐ โฒ โ ๐ถ ๐ โข ( ๐ท ๐ ) , the output of LearningToCover has some โ โฒ satisfying
๐ผ ๐ฅ โผ ๐ท ๐ โฒ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ ๐ โข ๐ โข ๐ โฒ โข ( ๐ฅ ) ) ] โค ๐ โ โข ๐ ,
where โ ๐ โข ๐ โข ๐ โฒ is some optimal hypothesis over the shifted distribution ๐ท โฒ . As in the standard analysis, this is guaranteed by including the output of the realizable learner on โ ๐ โข ๐ โข ๐ โฒ (robustness to covariate shift promises this output is close under ๐ท ๐ โฒ with high probability). The remainder of the argument is exactly as in Theorem 6.6, with the exception of working over the shifted distribution ๐ท ๐ โฒ instead of ๐ท ๐ .
We note that this result can also be extended to the more general loss functions discussed in Section 6.1 without too much difficulty with the appropriate definition of discrete learnability under covariate shift.
Since Theorem 6.30 is a bit abstract, letโs take a look at one concrete application. Given a class ( ๐ , ๐ป ) , the class-dependent total variation distance is a metric on distributions measuring the worst case distance across elements of ๐ป โข ฮ โข ๐ป โข \coloneqq โข { โ โข ฮ โข โ โฒ : โ , โ โฒ โ ๐ป } :
๐ โข ๐ ๐ป โข ฮ โข ๐ป โข ( ๐ท , ๐ท โฒ ) โข \coloneqq โข max โ โ ๐ป โข ฮ โข ๐ป โก { | ๐ท โข ( โ ) โ ๐ท โฒ โข ( โ ) | } .
It is not hard to see that any realizable learner is robust to ๐ โข ( ๐ ) covariate shift in ๐ โข ๐ ๐ป โข ฮ โข ๐ป distance. We can then apply Theorem 6.30 to build a robust semi-private learner.
Corollary 6.32.
Let ( ๐ , ๐ป ) be a class of VC-dimension ๐ , and for every ๐
0 and distribution ๐ท over ๐ , define a coviarate shift function:
๐ถ ๐ โข ( ๐ท ) โข \coloneqq โข { ๐ท โฒ : ๐ โข ๐ ๐ป โข ฮ โข ๐ป โข ( ๐ท , ๐ท โฒ ) โค ๐ / 2 } .
Then ( ๐ , ๐ป ) is semi-private learnable under covariate shift ๐ถ
{ ๐ถ ๐ } in only
๐ ๐๐ข๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ + log โก ( 1 / ๐ฟ ) ๐ )
unlabeled samples and
๐ ๐๐๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ( ๐ โข log โก ( 1 ๐ ) + log โก ( 1 ๐ฟ ) ) โข max โก { ๐ โ 2 , ๐ โ 1 โข ๐ผ โ 1 } )
labeled samples.
Finally, we note again that our original learner in these results is robust to covariate shift despite having no access to samples from the new distribution. Unfortunately, this model does come with fairly strong lower bounds regarding the type of covariate shifts to which it is possible to be robust [30]. One solution to this problem is to consider a relaxed variant called (non-conservative) domain adaptation, where the learner additionally has access to a small number of unlabeled samples from the test-time distribution. It is certainly possible to define an analog in the semi-private setting, but naively the use of unlabeled data from the private distribution breaks our reduction since privacy wonโt be preserved. We leave as an open question whether any sort of PAC-learner in the non-conservative model could imply semi-private learners with stronger robustness to covariate shift. Some progress has been made in this direction recently by Bassily, Moran, and Nandi [11] for distribution-free classification of halfspaces.
7Doubly Bounded Loss
In this section we discuss a natural generalization of loss functions over finite label classes we call doubly bounded loss: for all distinct ๐ฆ , ๐ฆ โฒ โ ๐ , we require โ โข ( ๐ฆ , ๐ฆ โฒ ) โ [ ๐ , ๐ ] for some ๐ โฅ ๐
0 . This is trivially satisfied by any loss function on a finite label class that satisfies the identity of indiscernibles.
Definition 7.1 (Doubly Bounded Loss).
We say โ : ๐ โ ๐ is ( ๐ , ๐ ) -bounded if there exist ๐ โฅ ๐
0 for any distinct ๐ฆ , ๐ฆ โฒ โ ๐ :
โ โข ( ๐ฆ , ๐ฆ โฒ ) โ [ ๐ , ๐ ] .
As discussed in Section 6.1, since we now allow ๐ to be infinite, we need to work with discrete learnability instead of realizable learnability. We can use a slight modification to the discretization technique in Theorem 6.6 to prove the equivalence of discrete and agnostic learnability for doubly-bounded loss functions. Note that this is stronger than our guarantee for ๐ -approximate pseudometrics, which only gives ๐ -agnostic learnability.
Theorem 7.2.
Let โ : ๐ ร ๐ โ โ โฅ 0 be an ( ๐ , ๐ ) -bounded loss function. Then for any class ( ๐ , ๐ , ๐ป , โ ) the following are equivalent:
1.
( ๐ , ๐ป , ๐ , โ ) is (properly) discretely-learnable
2.
( ๐ , ๐ป , ๐ , โ ) is (properly) agnostically learnable.
Proof 7.3.
The proof that agnostic learnability implies discrete learnability is the same as in Theorem 6.6, so we focus only the forward direction. Assume ( ๐ , ๐ , ๐ป , โ ) is discretely-learnable. Fix ๐ โฒ
๐ โข ๐ 4 โข ๐ , and let ๐ป ๐ โฒ be a learnable ๐ โฒ -discretization of ๐ป . We argue that running LearningToCover on ๐ป ๐ โฒ (using the promised discrete learner) gives the desired agnostic learner. As before, it is sufficient to prove that ๐ถ โข ( ๐ ๐ ) contains a hypothesis โ โฒ such that ๐๐๐ ๐ท , โ โข ( โ โฒ ) โค ๐ โข ๐ โข ๐ + ๐ / 2 . Since โ is upper bounded, standard empirical risk minimization arguments then give the desired result.
To see why ๐ถ โข ( ๐ ๐ ) has this property, recall from Lemma 5.2 that for any โ โ ๐ป ๐ โฒ , with probability at least 1 โ ๐ฟ / 2 there exists โ โฒ โ ๐ถ โข ( ๐ ๐ ) that is ๐ โฒ -close to โ in the following sense:
๐ผ ๐ฅ โผ ๐ท ๐ โข [ โ โข ( โ โฒ โข ( ๐ฅ ) , โ โข ( ๐ฅ ) ) ] โค ๐ โข ๐ 4 โข ๐ .
Because โ is ๐ -lower bounded, this implies that โ โฒ must be close to โ in classification error:
Pr ๐ฅ โผ ๐ท ๐ โก [ โ โข ( ๐ฅ ) โ โ โฒ โข ( ๐ฅ ) ] โค ๐ 4 โข ๐ ,
and since the loss is bounded by ๐ , the risk of โ โฒ cannot be much more than of โ :
๐๐๐ ๐ท , โ โข ( โ โฒ ) โค ๐๐๐ ๐ท , โ โข ( โ ) + ๐ / 4 .
Let โ ๐ โข ๐ โข ๐ โ ๐ป be an optimal hypothesis. Since ๐ป ๐ โฒ
๐ โฒ -covers ๐ป , there exists โ ๐ โข ๐ โข ๐ ๐ โฒ โ ๐ป ๐ โฒ such that:
๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ ๐ โฒ ) โ ๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ ) โค ๐ โฒ .
Let โ ๐ โข ๐ โข ๐ โฒ โฃ ๐ โฒ โ ๐ป ๐ โฒ denote the output of the base learner ๐ on the labeling given by โ ๐ โข ๐ โข ๐ ๐ โฒ . Then by the above, we have that:
๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ โฒ โฃ ๐ โฒ )
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
โข
[
โ
โข
(
โ
๐
โข
๐
โข
๐
โฒ
โฃ
๐
โฒ
โข
(
๐ฅ
)
,
๐ฆ
)
]
โค
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
โข
[
โ
โข
(
โ
๐
โข
๐
โข
๐
๐
โฒ
โข
(
๐ฅ
)
,
๐ฆ
)
]
+
๐
/
4
โค
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
โข
[
โ
โข
(
โ
๐
โข
๐
โข
๐
๐
โฒ
โข
(
๐ฅ
)
,
๐ฆ
)
โ
โ
โข
(
โ
๐
โข
๐
โข
๐
โข
(
๐ฅ
)
,
๐ฆ
)
+
โ
โข
(
โ
๐
โข
๐
โข
๐
โข
(
๐ฅ
)
,
๐ฆ
)
]
+
๐
/
4
๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ ๐ โฒ ) โ ๐๐๐ ๐ท , โ โข ( โ ๐ โข ๐ โข ๐ ) + ๐ โข ๐ โข ๐ + ๐ / 4
โค ๐ โข ๐ โข ๐ + ๐ / 2
as desired.
It is worth noting that the upper bound on the loss can be removed if the adversary is restricted to choosing a marginal over ๐ which is weakly concentrated.
8Robust Learning
Robust learning is an extension to the PAC setting that models an adversary with the power to perturb examples at test time. In practice, this corresponds to the fact that weโd like our predictors to be stable to small amounts of adversarial noiseโthis could range anywhere from a sticker on a stop-sign tricking a self-driving car, to completely imperceptible perturbations that totally fool standard classifiers. The latter was famously demonstrated by Athalye, Engstrom, Ilyas, and Kwok [7], who showed how to generate such perturbations and provided the classic example of tricking a standard ImageNet classifier into thinking a turtle was a gun. Their seminal work caused an explosion of both practical and theoretical research in the area.16
Formally, adversarial robustness is modeled simply by changing the error function to be the maximum error over some pre-defined set of neighboring perturbations.
Definition 8.1 (Robust Loss).
Let ๐ be an instance space and ๐ฐ : ๐ โ ๐ โข ( ๐ ) a โperturbation functionโ mapping elements to a set of possible perturbations. Given a loss function โ : ๐ ร ๐ โ โ โฅ 0 , the robust loss of a concept โ : ๐ โ ๐ with respect to a distribution ๐ท over ๐ ร ๐ is:
R-err ๐ฐ , ๐ท โข ( โ )
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ max ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) โก โ โข ( โ โข ( ๐ฅ โฒ ) , ๐ฆ ) ] .
In other words, a hypothesis with low robust loss performs well even against an adversary who can perturb ๐ฅ to any โnearby pointโ (i.e. any ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) ). Standard realizable and agnostic Robust PAC-learning are then simply defined by replacing the standard error function with the robust error function. Robust learning in the distribution-family model does require one extra twist: we need to make sure that each hypothesis in the class actually has a corresponding distribution over which it is realizable. To this end, we introduce a basic notion of closure for distribution families.
Definition 8.2 (Robust Closure).
Let ๐ be a set of distributions over an instance space ๐ and ๐ป a concept class. Given any concept โ , let ๐ โ denote the set of points in ๐ on which โ has 0 robust loss with respect to itself, that is:
๐ โ โข \coloneqq โข { ๐ฅ โ ๐ : โ ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) , โ โข ( โ โข ( ๐ฅ โฒ ) , โ โข ( ๐ฅ ) )
0 } .
For notational simplicity, let ๐ท | โ denote the restriction ๐ท | ๐ โ . The robust closure of ๐ under ๐ป is:
๐ ๐ป โข \coloneqq โข ๐ โช โ ๐ท โ ๐ , โ โ ๐ป ๐ท | โ .
In the robust distribution-family model, it only really makes sense to define realizable learnability over the robust closure of ๐ , since otherwise there may be hypotheses in the class that are not realizable with respect to any distribution in ๐ and cannot be chosen by the adversary at all. With this in mind, letโs formalize this model.
Definition 8.3 ((Realizable) Distribution-Family Robust PAC Learning).
A class ( ๐ , ๐ , ๐ป , โ ) is Robustly PAC-learnable in the realizable setting with respect to perturbation function ๐ฐ if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying:
1.
The marginal ๐ท ๐ โ ๐ ๐ป ,
2.
min โ โ ๐ป โก R-err ๐ฐ , ๐ท โข ( โ )
0 ,
then:
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ R-err ๐ฐ , ๐ท โข ( ๐ โข ( ๐ ) )
๐ ] โค ๐ฟ .
Agnostic learnability is defined similarly, but since the adversary is unrestricted, there is no longer any need to take the robust closure.
Definition 8.4 ((Agnostic) Distribution-Family Robust PAC Learning).
A class ( ๐ , ๐ , ๐ป , โ ) is Robustly PAC-learnable in the agnostic setting with respect to perturbation function ๐ฐ if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying ๐ท ๐ โ ๐ , then:
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ R-err ๐ฐ , ๐ท โข ( ๐ โข ( ๐ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ ,
where ๐ โข ๐ โข ๐
min โ โ ๐ป โก { R-err ๐ฐ , ๐ท โข ( โ ) } .
We note that different works consider different models of access to the perturbation set ๐ฐ as well (e.g. assuming ๐ฐ is known to the learner [52], or has some type of oracle access [51, 50]). Our reduction requires fairly weak access to ๐ฐ โit is enough to be able to estimate the empirical robust loss of a hypothesis โ over any finite sample ๐ โ ๐ . With this in mind, letโs now prove realizable and agnostic robust learning are equivalent in the distribution-family model. Weโll focus on the special case of (multi-class) classification, and start by re-stating our modified algorithm for simplicity of presentation.
Input: Realizable Robust PAC-Learner ๐ Algorithm:
1.
Draw an unlabeled sample ๐ ๐ , and labeled sample ๐ ๐ฟ .
2.
Run ๐ over all possible subsets and labelings of ๐ ๐ to get:
๐ถ โข ( ๐ ) โข \coloneqq โข { ๐ โข ( ๐ , โ โข ( ๐ ) ) | ๐ โ ๐ ๐ , โ โ ๐ป | ๐ } .
3.
Return the hypothesis in ๐ถ โข ( ๐ ) with lowest empirical robust error over ๐ ๐ฟ .
Algorithm 5 Agnostic to Realizable Reduction Robust Setting Theorem 8.5.
If ( ๐ , ๐ , ๐ป ) is robustly PAC-learnable in the realizable setting with sample complexity ๐ โข ( ๐ , ๐ฟ ) , then Algorithm 5 robustly learns ( ๐ , ๐ , ๐ป ) in the agnostic setting in:
๐ ๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( max ๐ โ [ 0 , 1 โ ๐ ] โก { ๐ โข ( ๐ / ( 2 โข ( 1 โ ๐ ) ) , ๐ฟ / 3 ) 1 โ ๐ } )
unlabeled samples, and
๐ ๐ฟ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ ๐ โข ( ๐ , ๐ฟ ) + log โก ( 1 / ๐ฟ ) ๐ 2 )
labeled samples.
Proof 8.6.
The proof is similar to Theorem 5.4, but like malicious noise (Theorem 6.11), requires the use of subsampling. The key issue is that for a distribution ๐ท over ๐ ร ๐ , the optimal hypothesis โ ๐ โข ๐ โข ๐ may not be realizable with respect to ๐ท ๐ , that is we may have:
R-err ๐ , ๐ท ๐ ร โ ๐ โข ๐ โข ๐ โข ( โ ๐ โข ๐ โข ๐ )
0 .
As a result, our realizable learner (and therefore LearningToCover) has no guarantees over this distribution. On the other hand, our realizable learner does have good guarantees over the restricted marginal ๐ท ๐ | โ ๐ โข ๐ โข ๐ . We can then fix the above issue by running LearningToCover of all subsets of ๐ including its restriction to ๐ โ ๐ โข ๐ โข ๐ . We will see that this essentially simulates running the realizable learner on the realizable restriction ๐ท ๐ | โ ๐ โข ๐ โข ๐ and recovers the desired guarantees.
Letโs take a look at this more formally. As in our previous arguments it is enough to prove that ๐ถ โข ( ๐ ) contains a hypothesis โ โฒ with robust error at most ๐ โข ๐ โข ๐ + ๐ / 2 , since a standard Chernoff bound tells us that ๐ โข ( log โก ( | ๐ถ โข ( ๐ ) | / ๐ฟ ) / ๐ 2 ) labeled examples are enough to estimate the robust loss of every hypothesis in ๐ถ โข ( ๐ ) with high probability. We note that this is the only step in our reduction which requires access to the perturbation set ๐ฐ .
It is left to show that ๐ถ โข ( ๐ ) satisfies this property. Let ๐ท | โ ๐ โข ๐ โข ๐ denote the restriction of ๐ท to ๐ โ ๐ โข ๐ โข ๐ , the points in ๐ on which โ ๐ โข ๐ โข ๐ has 0 robust loss with respect to itself. Let ๐ท ยฏ | โ ๐ โข ๐ โข ๐ be the restriction to the complement, that is ๐ โ ๐ โ ๐ โข ๐ โข ๐ . The idea is to decompose our analysis into two separate parts over ๐ท | โ ๐ โข ๐ โข ๐ and ๐ท ยฏ | โ ๐ โข ๐ โข ๐ . With this in mind, let ๐ * denote the mass of ๐ท ๐ on ๐ โ ๐ โข ๐ โข ๐ , and let ๐ โข ๐ โข ๐ โฒ denote the robust error of โ ๐ โข ๐ โข ๐ over ๐ท | โ ๐ โข ๐ โข ๐ . Since we are restricting our attention to classification error, notice that we can decompose ๐ โข ๐ โข ๐ as:
๐ โข ๐ โข ๐
R-err ๐ฐ , ๐ท โข ( โ ๐ โข ๐ โข ๐ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท [ โ ๐ฅ โฒ โ ๐ฐ ( ๐ฅ ) : โ ๐ โข ๐ โข ๐ ( ๐ฅ โฒ ) โ ๐ฆ ]
๐
*
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
ยฏ
|
โ
๐
โข
๐
โข
๐
[
โ
๐ฅ
โฒ
โ
๐ฐ
(
๐ฅ
)
:
โ
๐
โข
๐
โข
๐
(
๐ฅ
โฒ
)
โ
๐ฆ
]
+
(
1
โ
๐
*
)
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
[
โ
๐ฅ
โฒ
โ
๐ฐ
(
๐ฅ
)
:
โ
๐
โข
๐
โข
๐
(
๐ฅ
โฒ
)
โ
๐ฆ
]
๐ * + ( 1 โ ๐ * ) โข ๐ โข ๐ โข ๐ โฒ ,
where the last step follows from noting that by definition for all ๐ฅ in the support of ๐ท ยฏ | โ ๐ โข ๐ โข ๐ , โ ๐ โข ๐ โข ๐ is not constant on ๐ฐ โข ( ๐ฅ ) . To get a function within ๐ / 2 robust loss of ๐ โข ๐ โข ๐ , we claim it is sufficient to prove ๐ถ โข ( ๐ ) contains some โ within robust error ๐ / ( 2 โข ( 1 โ ๐ * ) ) of โ ๐ โข ๐ โข ๐ over ๐ท | โ ๐ โข ๐ โข ๐ , that is some โ satisfying:
Pr ๐ฅ โผ ๐ท ๐ | โ ๐ โข ๐ โข ๐ โก [ โ ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) : โ โข ( ๐ฅ โฒ ) โ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ โฒ ) ] โค ๐ / ( 2 โข ( 1 โ ๐ * ) ) .
(3)
This follows from a similar analysis to the above. Namely, letting ๐ โข ( โ , ๐ฅ , ๐ฆ ) denote the event
๐ โข ( โ , ๐ฅ , ๐ฆ ) โข \coloneqq โข โ ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) : โ โข ( ๐ฅ โฒ ) โ ๐ฆ
for notational simplicity, we can break down R-err ๐ฐ , ๐ท โข ( โ ) as:
R-err ๐ฐ , ๐ท โข ( โ )
๐
*
โข
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
ยฏ
|
โ
๐
โข
๐
โข
๐
โข
[
๐
โข
(
โ
,
๐ฅ
,
๐ฆ
)
]
+
(
1
โ
๐
*
)
โข
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
๐
โข
(
โ
,
๐ฅ
,
๐ฆ
)
]
โค
๐
*
+
(
1
โ
๐
*
)
โข
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
๐
โข
(
โ
,
๐ฅ
,
๐ฆ
)
]
โค
๐
*
+
(
1
โ
๐
*
)
โข
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
๐
โข
(
โ
๐
โข
๐
โข
๐
,
๐ฅ
,
๐ฆ
)
]
+
(
1
โ
๐
*
)
Pr
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
[
โ
๐ฅ
โฒ
โ
๐ฐ
(
๐ฅ
)
:
โ
(
๐ฅ
โฒ
)
โ
โ
๐
โข
๐
โข
๐
(
๐ฅ
โฒ
)
]
โค
๐
*
+
(
1
โ
๐
*
)
โข
(
๐
โข
๐
โข
๐
โฒ
+
๐
2
โข
(
1
โ
๐
*
)
)
๐ โข ๐ โข ๐ + ๐ / 2 .
It remains to prove that ๐ถ โข ( ๐ ) contains a hypothesis satisfying Equation 3 with high probability. To see this, note that by definition of realizable robust learning, on a labeled sample ( ๐ , โ ๐ โข ๐ โข ๐ โข ( ๐ ) ) โผ ๐ท | โ ๐ โข ๐ โข ๐ ร โ ๐ โข ๐ โข ๐ of size ๐ โข ( ๐ / ( 2 โข ( 1 โ ๐ * ) ) , ๐ฟ / 3 ) our learner will output โ satisfying:
Pr ๐ฅ โผ ๐ท ๐ | โ ๐ โข ๐ โข ๐ โก [ โ ๐ฅ โฒ โ ๐ฐ โข ( ๐ฅ ) : โ โข ( ๐ฅ โฒ ) โ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) ] โค ๐ / ( 2 โข ( 1 โ ๐ * ) )
with probability at least 1 โ ๐ฟ / 3 . To get Equation 3, it is then enough to note that โ ๐ โข ๐ โข ๐ is constant on ๐ฐ โข ( ๐ฅ ) for all ๐ฅ in the support of ๐ท ๐ | โ ๐ โข ๐ โข ๐ by definition.
The idea is now to draw a large enough unlabeled sample such that with probability at least 1 โ ๐ฟ / 3 , the restriction to ๐ โ ๐ โข ๐ โข ๐ is at least this size. By a Chernoff bound, it is enough to draw ๐ 1 โข ๐ โข ( ๐ / ( 1 โ ๐ * ) , ๐ฟ / 3 ) 1 โ ๐ * points to achieve this for some large enough constant ๐ 1
0 .17 Since we do not know ๐ * , weโll need to draw ๐ 1 โข max ๐ โ [ 0 , 1 โ ๐ ] โก { ๐ โข ( ๐ / ( 1 โ ๐ ) , ๐ฟ / 3 ) 1 โ ๐ } points to ensure this property holds (if ๐ * โฅ 1 โ ๐ , note that any hypothesis gives a valid solution). By a union bound we have that this overall process succeeds with probability at least 1 โ 2 โข ๐ฟ / 3 , and outputting the hypothesis in ๐ถ โข ( ๐ ) with the lowest empirical robust risk then succeeds with probability 1 โ ๐ฟ as desired.
Theorem 8.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though the exact parameters may be somewhat weaker (e.g. learning over non-binary loss may incur additional factors and lead to ๐ -agnostic rather than truly agnostic learning). This comes at the cost of an extra factor of ๐ โ 1 over reductions in the finite VC setting [52].
9Partial PAC-Learning
Partial PAC-learning is an extension of the standard PAC model to functions that are only defined on a certain portion of the input. Originally introduced by Long [47] and recently developed in greater depth by Alon, Hanneke, Holzman, and Moran (AHHM) [5], this model allows for the theoretical formalization of popular data-dependent assumptions such as margin that have no known analog in the PAC model. Combined with the distribution-family framework, this captures a significant portion of learning assumptions studied in both theory and practice (e.g. learning halfspaces with margin and distributional tail bounds). Letโs formalize this model, starting with partial functions.
Definition 9.1 (Partial Function).
Let ๐ be an instance space and ๐ a label space. A partial function is a labeling ๐ : ๐ โ ๐ โช { * } , where elements labeled โ * โ are thought as of undefined. The support of ๐ , denoted ๐ ๐ข๐๐ โข ( ๐ ) , is the set of elements ๐ฅ โ ๐ s.t. ๐ โข ( ๐ฅ ) โ * .
Standard Partial PAC-learning is defined much like the standard model with the simple modification that โ * โ labels are always considered to be incorrect. As a result, in the realizable case, when the adversary selects a particular partial function ๐ , their marginal distribution over the instance space ๐ must be restricted to lying on supp โข ( ๐ ) . This makes formalizing data-dependent assumptions easy. If one wanted to consider halfspaces with margin ๐พ for instance, one simply labels every point within ๐พ of the decision boundary as โ * .โ Interestingly, much like the distribution-family setting, Partial-PAC learning falls outside both the uniform convergence and the sample compression paradigm [1]. AHHM also show a dramatic failure of empirical risk minimization: not only does naively applying an ERM to the partial class fail, it will also fail on any total extension of the class. Despite the lack of these standard tools, both Long and AHHM were able to show that distribution-free classification of partial classes is still controlled by VC dimension, and as a result that the equivalence of realizable and agnostic learnability extends to this setting. In this section, weโll discuss how a variant of our reduction shows that this result extends to the distribution-family model, extended loss function, and to properties beyond agnostic learning.
In the distribution-family model, formalizing realizable learnability requires some slight changes from the standard model, since we need to make sure our hypotheses are actually realizable over some distribution in the family (this is automatic in the distribution-free setting). To this end, we introduce a basic notion of closure for distribution families.
Definition 9.2 (Partial Closure).
Let ๐ be a set of distributions over an instance space ๐ and ๐ป a concept class. Given any concept โ , and distribution ๐ท over ๐ , let ๐ท | โ denote the restriction ๐ท | ๐ ๐ข๐๐ โข ( ๐ ) . The partial closure of ๐ under ๐ป is:
๐ ๐ป โข \coloneqq โข ๐ โช โ ๐ท โ ๐ , โ โ ๐ป ๐ท | โ .
In the realizable model it makes more sense to work with the closure of ๐ than ๐ itself, since otherwise the class ๐ป may contain hypotheses that cannot be realized over any distribution, and therefore cannot be accessed by the adversary at all. For simplicity, weโll also restrict our attention to (multi-class) classification where the label space ๐
[ ๐ ] , and recall that the loss of any undefined point is always 1 .
Definition 9.3 ((Realizable) Distribution-Family Partial PAC Learning).
A partial class ( ๐ , ๐ , ๐ป ) is PAC-learnable in the realizable setting if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying:
1.
The marginal ๐ท ๐ โ ๐ ๐ป ,
2.
min โ โ ๐ป โก ๐๐๐ ๐ท โข ( โ )
0 ,
then:
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท โข ( ๐ โข ( ๐ ) )
๐ ] โค ๐ฟ ,
where the error ๐๐๐ ๐ท โข ( โ ) is standard classification error:
๐๐๐ ๐ท โข ( โ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โก [ โ โข ( ๐ฅ ) โ ๐ฆ ] .
Agnostic learnability is defined analogously, but since the adversary is unrestricted, there is no need to move to the closure of ๐ .
Definition 9.4 ((Agnostic) Distribution-Family Partial PAC Learning).
A partial class ( ๐ , ๐ , ๐ป ) is PAC-learnable in the agnostic setting if there exists an algorithm ๐ and function ๐ โข ( ๐ , ๐ฟ ) such that for all ๐ , ๐ฟ
0 and distributions ๐ท over ๐ ร ๐ satisfying ๐ท ๐ โ ๐ , then:
Pr ๐ โผ ๐ท ๐ โข ( ๐ , ๐ฟ ) โก [ ๐๐๐ ๐ท โข ( ๐ โข ( ๐ ) )
๐ โข ๐ โข ๐ + ๐ ] โค ๐ฟ .
The issue with our standard reduction strategy for partial functions is that in the agnostic model, the adversaryโs marginal distribution over ๐ might have support outside of supp โข ( โ ๐ โข ๐ โข ๐ ) , which causes LearningToCover to lose its guarantee of outputting a non-uniform cover. This can be dealt with by a variant of our subsampling technique. If we run LearningToCover over all subsamples of the unlabeled sample ๐ ๐ , one of these subsamples must match the support of โ ๐ โข ๐ โข ๐ . This is in fact the same strategy used for adversarial robustness in Section 8, but we will include the algorithm again to make this section self-contained.
Input: Realizable Robust PAC-Learner ๐ Algorithm:
1.
Draw an unlabeled sample ๐ ๐ , and labeled sample ๐ ๐ฟ .
2.
Run ๐ over all possible subsets and labelings of ๐ ๐ to get:
๐ถ โข ( ๐ ) โข \coloneqq โข { ๐ โข ( ๐ , โ โข ( ๐ ) ) | ๐ โ ๐ ๐ , โ โ ๐ป | ๐ } .
3.
Return the hypothesis in ๐ถ โข ( ๐ ) with lowest empirical error over ๐ ๐ฟ .
Algorithm 6 Agnostic to Realizable Reduction Partial PAC Setting Theorem 9.5.
If ( ๐ , ๐ , ๐ป ) is a realizably PAC-learnable partial class with sample complexity ๐ โข ( ๐ , ๐ฟ ) , then Algorithm 6 agnostically learns ( ๐ , ๐ , ๐ป ) in
๐ ๐ โข ( ๐ , ๐ฟ ) โค ๐ โข ( max ๐ โ [ 0 , 1 โ ๐ ] โก { ๐ โข ( ๐ / ( 2 โข ( 1 โ ๐ ) ) , ๐ฟ / 3 ) 1 โ ๐ } )
unlabeled samples, and
๐ ๐ฟ โข ( ๐ , ๐ฟ ) โค ๐ โข ( ๐ ๐ โข ( ๐ , ๐ฟ ) + log โก ( 1 / ๐ฟ ) ๐ 2 )
labeled samples.
Proof 9.6.
The proof is essentially the same as for Theorem 8.5, but we repeat it here for completeness. As always, it is enough to prove that ๐ถ โข ( ๐ ) (from Algorithm 6) contains a hypothesis โ โฒ with error at most ๐ โข ๐ โข ๐ + ๐ / 2 . The key issue with our standard reduction is that the optimal hypothesis โ ๐ โข ๐ โข ๐ may be undefined on certain examples in the unlabeled sample ๐ ๐ . By running over all subsamples of ๐ ๐ , we in essence simulate pulling samples only from the support of โ ๐ โข ๐ โข ๐ , which is enough to get the desired guarantee.
More formally, let ๐ท | โ ๐ โข ๐ โข ๐ be the restriction of ๐ท to ๐ ๐ข๐๐ โข ( โ ๐ โข ๐ โข ๐ ) , and ๐ท ยฏ | โ ๐ โข ๐ โข ๐ the restriction to its complement ๐ โ ๐ ๐ข๐๐ โข ( โ ๐ โข ๐ โข ๐ ) . The idea is to decompose our analysis into two separate parts over ๐ท | โ ๐ โข ๐ โข ๐ and ๐ท ยฏ | โ ๐ โข ๐ โข ๐ . With this in mind, let ๐ * denote the mass of ๐ท ๐ on the undefined portion of โ ๐ โข ๐ โข ๐ , and let ๐ โข ๐ โข ๐ โฒ denote the error of โ ๐ โข ๐ โข ๐ over ๐ท | โ ๐ โข ๐ โข ๐ . Since we have restricted our attention to classification, notice that we can decompose ๐ โข ๐ โข ๐ as:
๐๐๐ ๐ท โข ( โ ๐ โข ๐ โข ๐ )
Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ ๐ฆ ]
๐ * โข Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท ยฏ | โ ๐ โข ๐ โข ๐ โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ ๐ฆ ] + ( 1 โ ๐ * ) โข Pr ( ๐ฅ , ๐ฆ ) โผ ๐ท | โ ๐ โข ๐ โข ๐ โข [ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) โ ๐ฆ ]
๐ * + ( 1 โ ๐ * ) โข ๐ โข ๐ โข ๐ โฒ .
Weโd like to prove that ๐ถ โข ( ๐ ) contains a hypothesis โ within ๐ / 2 error of optimal. We claim it is sufficient to show that ๐ถ โข ( ๐ ) contains a hypothesis within ๐ / ( 2 โข ( 1 โ ๐ * ) ) classification distance of โ ๐ โข ๐ โข ๐ , since:
๐๐๐ ๐ท โข ( โ )
๐
*
โข
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
ยฏ
|
โ
๐
โข
๐
โข
๐
โข
[
โ
โข
(
๐ฅ
)
โ
๐ฆ
]
+
(
1
โ
๐
*
)
โข
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
โ
โข
(
๐ฅ
)
โ
๐ฆ
]
โค
๐
*
+
(
1
โ
๐
*
)
โข
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
โ
โข
(
๐ฅ
)
โ
๐ฆ
]
โค
๐
*
+
(
1
โ
๐
*
)
โข
(
๐ผ
(
๐ฅ
,
๐ฆ
)
โผ
๐ท
|
โ
๐
โข
๐
โข
๐
โข
[
โ
๐
โข
๐
โข
๐
โข
(
๐ฅ
)
โ
๐ฆ
]
+
๐
2
โข
(
1
โ
๐
*
)
)
๐ * + ( 1 โ ๐ * ) โข ๐ โข ๐ โข ๐ โฒ + ๐ 2
๐ โข ๐ โข ๐ + ๐ / 2 ,
where the third line follows from the fact that โ โฒ and โ ๐ โข ๐ โข ๐ only differ on a ๐ 2 โข ( 1 โ ๐ * ) fraction of inputs over ๐ท | โ ๐ โข ๐ โข ๐ .
It is left to argue that ๐ถ โข ( ๐ ) contains such a hypothesis โ . Recall that on a labeled sample ( ๐ , โ โข ( ๐ ) ) โผ ๐ท | โ ๐ โข ๐ โข ๐ ร โ ๐ โข ๐ โข ๐ of size ๐ โข ( ๐ / ( 2 โข ( 1 โ ๐ * ) ) , ๐ฟ / 3 ) , LearningToCover will contain โ that is ๐ / ( 2 โข ( 1 โ ๐ * ) ) -close to โ ๐ โข ๐ โข ๐ in classification error over ๐ท | โ ๐ โข ๐ โข ๐ with probability at least 1 โ ๐ฟ / 3 . The idea is then to draw a large enough unlabeled sample such that with probability at least 1 โ ๐ฟ / 3 , the restriction of the sample to ๐ท | โ ๐ โข ๐ โข ๐ is at least this size (since we run over every subsample, we will always hit this restriction). By a Chernoff bound, it is enough to draw ๐ 1 โข ๐ โข ( ๐ / ( 2 โข ( 1 โ ๐ * ) ) , ๐ฟ / 3 ) 1 โ ๐ * points to achieve this for some large enough constant ๐ 1
0 .18 Since we do not know ๐ * , weโll need to draw ๐ 1 โข max ๐ โ [ 0 , 1 โ ๐ ] โก { ๐ โข ( ๐ / ( 1 โ ๐ ) , ๐ฟ / 3 ) 1 โ ๐ } points to ensure this property holds (if ๐ * โฅ 1 โ ๐ , note that any hypothesis gives a valid solution). By a union bound we have that this overall process succeeds with probability at least 1 โ 2 โข ๐ฟ / 3 , and outputting the hypothesis in ๐ถ โข ( ๐ ) with the lowest empirical risk then succeeds with probability 1 โ ๐ฟ as desired.
Like Theorem 8.5, Theorem 9.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though it may experience some degradation of parameters (e.g. ๐ -agnostic rather than truly agnostic learning) depending on how the loss of โ * โ values are formalized in these settings. Again this comes at the cost of an additional factor of ๐ over known reductions based on sample compression in the finite VC regime [5].
10Uniform Stability
Uniform stability, originally introduced by Bousquet and Elisseeff [20], is a useful algorithmic property that is closely tied to both generalization and privacy. Informally, an algorithm ๐ is said to be uniformly stable if for all elements ๐ฅ โ ๐ , the probability that ๐ changes its output on ๐ฅ over neighboring datasets is small.
Definition 10.1 (Uniform Stability).
A learning algorithm is said to be ๐ผ -uniformly stable if for all neighboring inputs ๐ , ๐ โฒ which differ on a single example, all ๐ฅ โ ๐ , and all ๐ฆ โ ๐ :
Pr โก [ ๐ โข ( ๐ ) โข ( ๐ฅ )
๐ฆ ] โค Pr โก [ ๐ โข ( ๐ โฒ ) โข ( ๐ฅ )
๐ฆ ] + ๐ผ .
Uniform stability can also be thought of as a variant of private prediction [32], which protects against adversaries who have restricted access to a model only through prediction responses on individual points (this is often the case in practice since it is common to release APIs with query access rather than full models). Like semi-privacy, this definition has the benefit of maintaining practicality in a reasonable range of circumstances while weakening the stringent requirements of standard private learning. Indeed, it is well known that in the distribution-free classification setting, uniformly stable learning and private prediction are both possible for any class with finite VC dimension [55, 32, 24]. Unsurprisingly, these previous works (at least those working in the agnostic model), rely on uniform convergence and uniform covers. Weโll show these can be replaced with a variant of our standard reduction. The argument is otherwise similar to the proof in [24].
Theorem 10.2.
Let ( ๐ , ๐ , ๐ป ) be a realizably learnable class with sample complexity ๐ โข ( ๐ , ๐ฟ ) . Then there exists an ๐ผ -uniformly stable, ๐ผ -semi private algorithm that agnostically learns ( ๐ , ๐ , ๐ป ) in only
๐ ๐ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) ๐ผ )
unlabeled samples, and
๐ ๐ฟ โข ( ๐ , ๐ฟ , ๐ผ ) โค ๐ โข ( log โก ( ฮ ๐ป โข ( ๐ โข ( ๐ , ๐ฟ ) ) ) min โก { ๐ผ โข ๐ , ๐ 2 } )
labeled samples.
Proof 10.3.
The proof boils down to a standard subsampling trick first noted by [55]. Instead of drawing our standard unlabeled sample of size ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) , we draw a sample of size ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) 2 โข ๐ผ and run LearningToCover over a random ๐ผ / 2 fraction of the sample. This ensures that swapping out any individual sample can only effect the result with probability ๐ผ / 2 . Since this subsample is of size ๐ โข ( ๐ / 2 , ๐ฟ / 2 ) , LearningToCover keeps its standard guarantees and the output ๐ถ โข ( ๐ ๐ ) has a hypothesis within ๐ / 2 of optimal with probability 1 โ ๐ฟ / 2 . We can now apply the exponential mechanism with privacy parameter ๐ผ / 4 , which ensures the algorithm is ๐ผ / 2 -uniformly stable with respect to the labeled sample as well. The sample complexity bounds come from standard analysis of the exponential mechanism and the size of ๐ถ โข ( ๐ ๐ ) . Semi-privacy comes for free due to our use of the exponential mechanism.
As in previous sections, Theorem 10.2 can be extended to any of the generic property generalization results in the main body, including for instance ๐ -approximate pseudometric loss, malicious noise, and robustness to covariate shift. In the VC setting, the complexity essentially matches the best known bounds in the literature, which are given by a similar reduction using uniform convergence [24].
11Statistical Query Model
Kearnsโ [40] statistical query model is a popular modification of PAC learning where the sample oracle is replaced with the ability to ask noisy statistical questions about the data.
Definition 11.1 (Realizable SQ-learning).
Given a distribution ๐ท over ๐ and โ โ ๐ป , let ๐ โข ๐ โข ๐ด โข ๐ โข ( ๐ท , โ ) be an oracle which upon input of a function ๐ : ๐ ร ๐ โ [ โ 1 , 1 ] and tolerance ๐ โ โ โฅ 0 may output any estimate of the expectation of ๐ up to ๐ error, that is:
๐ โข ๐ โข ๐ด โข ๐ โข ( ๐ท , โ ) โข ( ๐ , ๐ ) โ ๐ผ ๐ฅ โผ ๐ท โข [ ๐ โข ( ๐ฅ , โ โข ( ๐ฅ ) ) ] ยฑ ๐ .
We call a class ( ๐ , ๐ , ๐ป , โ ) SQ-learnable if for all ๐ > 0 , there exists some tolerance ๐
๐ โข ( ๐ ) , query complexity ๐ โข ( ๐ , ๐ ) , and an algorithm ๐ such that for all ๐ท โ ๐ and โ โ ๐ป , ๐ achieves ๐ error in at most ๐ โข ( ๐ , ๐ ) oracle calls to ๐ โข ๐ โข ๐ด โข ๐ โข ( ๐ท , โ ) with tolerance at worst ๐ .19
Agnostic learning is then defined analogously where ๐ท , โ is replaced with a generic distribution over ๐ ร ๐ whose marginal lies in ๐ . We can use a basic form of discretization to prove property generalization in the SQ model.
Theorem 11.2.
Let โ be a ๐ -approximate pseudometric and ( ๐ , ๐ , ๐ป , โ ) a realizably SQ-learnable class with query complexity ๐ โข ( ๐ , ๐ ) . Then ( ๐ , ๐ , ๐ป , โ ) is ๐ -agnostically SQ-learnable up to ๐ + ๐ error in ( 1 / ๐ ) ๐ โข ( ๐ , ๐ ) statistical queries of tolerance at worst ๐ .
Proof 11.3.
The idea is similar to our discretization in Theorem 6.6. The realizable SQ-learner ๐ makes some finite ๐ โข ( ๐ , ๐ ) queries. Let ๐ถ ๐ denote the set of outputs of ๐ when fed every possible combination of responses from the discretized set { โ 1 , โ 1 + 2 โข ๐ , โฆ , 1 โ 2 โข ๐ , 1 } . For every ๐ท โ ๐ and โ โ ๐ป , one of these combinations must be a valid query response in the realizable model, so ๐ถ ๐ covers ( ๐ , ๐ , ๐ป , โ ) . By the same arguments of Theorem 6.6, ๐ถ ๐ must contains a hypothesis with error at most ๐ โ ๐ โข ๐ โข ๐ + ๐ . Since we can directly compute the loss of every element in ๐ถ ๐ up to ๐ error in the SQ model simply by querying the loss function, this gives the desired result in | ๐ถ ๐ |
( 1 / ๐ ) ๐ โข ( ๐ , ๐ ) queries.
We note that while our reduction in this model experiences exponential blowup in the number of queries, this should really be thought of as corresponding to a blow up in run-time instead of โsample complexityโ in the standard sense (which corresponds more closely to ๐ ).
12Fairness
Recent years have seen rising interest in an algorithmic property called fairness. Informally, fairness tries to tackle the issue that โwell-performingโ classifiers in the standard sense may actually be discriminatory against certain individuals or subgroups. We will consider a form of fair leaning introduced by Rothblum and Yona [63] called Probably Approximately Correct and Fair (PACF) learning. Their definition is based off of a notion of fairness that ensures that similar individuals are treated similarly with respect to a fixed metric.
Definition 12.1 (Metric Fairness).
Let ๐ : ๐ ร ๐ โ โ โฅ 0 be a similarity measure on ๐ and ๐ท a distribution over ๐ . A classifier โ : ๐ โ ๐ ๐๐ข๐ก is called ( ๐ผ , ๐พ ) -fair with respect to ๐ and ๐ท if h acts similarly on most similar individuals:
Pr ๐ฅ , ๐ฅ โฒ โผ ๐ท โก [ | โ โข ( ๐ฅ ) โ โ โข ( ๐ฅ โฒ ) |
๐ โข ( ๐ฅ , ๐ฅ โฒ ) + ๐พ ] โค ๐ผ .
We note that the output space ๐ ๐๐ข๐ก may differ from the label space ๐ in general learning problems.
In fact, this definition only really makes sense when the output classifier โ is allowed to be real-valued (as this allows for some flexibility in the | โ โข ( ๐ฅ ) โ โ โข ( ๐ฅ โฒ ) | term). As such, when considering settings such as binary classification where ๐
{ 0 , 1 } is discrete, Rothblum and Yonaโs [63] initial formalization considers returning probabilistic classifiers with ๐ out
[ 0 , 1 ] . Here โ โข ( ๐ฅ )
๐ฆ โ [ 0 , 1 ] is taken to be the probability of the label being 1 . The error of a probabilistic classifier โ with respect to any distribution ๐ท over ๐ ร { 0 , 1 } is then given by its expected โ 1 distance:
๐ โข ๐ โข ๐ ๐ท โข ( โ )
๐ผ ( ๐ฅ , ๐ฆ ) โผ ๐ท โข [ | โ โข ( ๐ฅ ) โ ๐ฆ | ] .
For simplicity weโll focus in this section on this same regime extended to the distribution-family model.
In broad strokes, the goal of Fair PAC learning is to output a fair classifier satisfying standard PAC guarantees. Practically this requires a few modifications. First, since there may be no fair classifier satisfying these guarantees, we will only require our output to be as good as the best fair classifier. Second, we will actually allow some slack in the fairness parameters, which Rothblum and Yona [63] show is a practical way to ensure that fair learnability remains possible across a broad range of classes.
Definition 12.2 (PACF-learning20 [63]).
We say ( ๐ , ๐ , ๐ป ) is (agnostically) ( ๐ผ , ๐พ ) -PACF-learnable with respect to a similarity metric ๐ : ๐ ร ๐ โ ๐ if there exists an algorithm ๐ and function ๐
๐ โข ( ๐ , ๐ ๐ผ , ๐ ๐พ , ๐ฟ ) such that for all ๐ , ๐ ๐ผ , ๐ ๐พ , ๐ฟ
0 , and distributions ๐ท over ๐ ร ๐ such that ๐ท ๐ โ ๐ , ๐ โข ( ๐ ) satisfies the following guarantees with probability 1 โ ๐ฟ over samples ๐ of size ๐ :
1.
๐ โข ( ๐ ) is accurate:
๐๐๐ ๐ท , โ โข ( ๐ โข ( ๐ ) ) โค ๐ โข ๐ โข ๐ ๐ผ , ๐พ + ๐
2.
๐ โข ( ๐ ) is ( ๐ผ + ๐ ๐ผ , ๐พ + ๐ ๐พ ) -fair.
Here ๐ โข ๐ โข ๐ ๐ผ , ๐พ is the optimal error of any ( ๐ผ , ๐พ ) -fair classifier, that is:
๐ โข ๐ โข ๐ ๐ , ๐ โข \coloneqq โข min โ โ ๐ป ๐ท ๐ , ๐ , ๐ ๐ โก { ๐ โข ๐ โข ๐ ๐ท , โ โข ( โ ) } ,
and
๐ป ๐ท ๐ , ๐ , ๐ ๐
{ โ โ ๐ป : โ โข is ( ๐ , ๐ ) -fair with respect to ๐ and ๐ท ๐ }
Realizable learnability is defined similarly, where the adversary is constrained to picking distributions which have 0 error with respect to some ( ๐ผ , ๐พ ) -fair classifier in ๐ป . We show that property generalization holds for the PACF model.
Theorem 12.3 (Agnostic โ Realizable (PACF Setting)).
Let ( ๐ , ๐ , ๐ป ) be any class that is realizably ( ๐ผ , ๐พ ) -PACF learnable with sample complexity ๐ โข ( ๐ , ๐ ๐ผ , ๐ ๐พ , ๐ฟ ) . Then ( ๐ , ๐ , ๐ป ) is agnostically ( ๐ผ , ๐พ ) -fair-PAC learnable in only
๐ ๐ โข ( ๐ , ๐ ๐ผ , ๐ ๐พ , ๐ฟ ) โค ๐ โข ( ๐ / 2 , ๐ ๐ผ , ๐ ๐พ , ๐ฟ / 2 )
unlabeled samples, and
๐ ๐ฟ โข ( ๐ , ๐ ๐ผ , ๐ ๐พ , ๐ฟ ) โค ๐ โข ( log โก ( ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , ๐ ๐ผ , ๐ ๐พ , ๐ฟ / 2 ) ) ) + log โก ( 1 / ๐ฟ ) ๐ 2 )
labeled samples.
Proof 12.4.
The key observation is that the definition of fairness depends only on the classifier โ and the marginal distribution ๐ท ๐ . Let โ ๐ โข ๐ โข ๐ be the hypothesis achieving the minimum error over ๐ป ๐ท ๐ , ๐ผ , ๐พ . By the above observation, with probability 1 โ ๐ฟ the hypothesis set ๐ถ โข ( ๐ ๐ ) returned by LearningToCover contains an ( ๐ผ + ๐ ๐ผ , ๐พ + ๐ ๐พ ) -fair โ satisfying:
๐ผ ๐ฅ โผ ๐ท ๐ โข [ | โ โข ( ๐ฅ ) โ โ ๐ โข ๐ โข ๐ โข ( ๐ฅ ) | ] โค ๐ / 2 .
Since โ 1 error is a metric (and therefore satisfies the triangle inequality), we can use our argument for ๐ -pseudometric loss functions from Theorem 6.6 to argue that choosing the lowest empirical risk ( ๐ผ + ๐ ๐ผ , ๐พ + ๐ ๐พ ) -fair classifier in ๐ถ โข ( ๐ ๐ ) with respect to a sufficiently large labeled sample ๐ ๐ฟ gives the desired learner.
With care, this result can be extended to a broader range of loss functions as well as to other finitely-satisfiable properties covered in this work.
13Notions of Coverability
In this section we discuss the connection between non-uniform covers and several previous notions of coverability used in various learning applications. For simplicity, weโll restrict our attention to covering with respect to standard classification distance, that is given a distribution ๐ท and hypotheses โ and โ โฒ over some instance space ๐ :
๐ ๐ท โข ( โ , โ โฒ )
Pr ๐ฅ โผ ๐ท โก [ โ โข ( ๐ฅ ) โ โ โฒ โข ( ๐ฅ ) ] .
To start, letโs recall the basic notion of an ๐ -cover specified to this measure for simplicity.
Definition 13.1 ( ๐ -cover).
Let ๐ be an instance space, ๐ a label space, and let ๐ฟ ๐ , ๐ denote the set of all labelings ๐ : ๐ โ ๐ . A set ๐ถ โ ๐ฟ ๐ , ๐ is said to form an ๐ -cover for ( ๐ท , ๐ , ๐ป ) if for every hypothesis โ โ ๐ป , there exists ๐ โ ๐ถ such that
๐ ๐ท โข ( ๐ , โ ) โค ๐ .
๐ถ is called proper if ๐ถ โ ๐ป .
Finite ๐ -covers are exceedingly useful in learning theory. As discussed in Section 3, a common strategy in the literature is to use unlabeled samples to construct an ๐ -cover with high probability [8, 35, 2, 10]. This results in a distribution over potential covers we call a uniform ( ๐ , ๐ฟ ) -cover.
Definition 13.2 (Uniform ( ๐ , ๐ฟ ) -cover).
Let ๐ be an instance space, ๐ a label space, and let ๐ฟ ๐ , ๐ denote the set of all labelings ๐ : ๐ โ ๐ . A distribution ๐ท ๐ถ over the power set ๐ โข ( ๐ฟ ๐ , ๐ ) is said to form a uniform ( ๐ , ๐ฟ ) -cover for ( ๐ท , ๐ , ๐ป ) if:
Pr ๐ถ โผ ๐ท ๐ถ โก [ ๐ถ โข is an ๐ -cover for โข ( ๐ท , ๐ , ๐ป ) ] โฅ 1 โ ๐ฟ .
๐ท ๐ถ is called proper if its support lies entirely in ๐ป .
In this work, we introduce a weaker non-uniform variant of this notion where each โ has an individual guarantee of being covered by the distribution, but it is not necessarily the case that a sample will cover all โ โ ๐ป simultaneously.
Definition 13.3 (Non-Uniform ( ๐ , ๐ฟ ) -cover).
Let ๐ be an instance space, ๐ a label space, and let ๐ฟ ๐ , ๐ denote the set of all labelings ๐ : ๐ โ ๐ . A distribution ๐ท ๐ถ over the power set ๐ โข ( ๐ฟ ๐ , ๐ ) is said to form a non-uniform ( ๐ , ๐ฟ ) -cover for ( ๐ท , ๐ , ๐ป ) if for every fixed hypothesis โ โ ๐ป ,
Pr ๐ถ โผ ๐ท ๐ถ โก [ ๐ถ โข is an ๐ -cover for โข ( ๐ท , ๐ , { โ } ) ] โฅ 1 โ ๐ฟ .
๐ท ๐ถ is called proper if its support lies entirely in ๐ป .
In the context of learning, we are usually interested not just in the existence of these covers, but in the more challenging problem of constructing them from a small number of unlabeled samples. In other words, given a class ( ๐ , ๐ , ๐ป ) , weโd like to know how many unlabeled samples from an adversarially chosen distribution ๐ท โ ๐ are necessary to build a uniform (or non-uniform) ( ๐ , ๐ฟ ) -cover for ( ๐ท , ๐ , ๐ป ) . In Section 6.3, we saw that the ability to construct a non-uniform ( ๐ , ๐ฟ ) -cover from ๐ โข ( log โก ( 1 / ๐ฟ ) ๐ ) samples was crucial to give a semi-private learner with optimal public sample complexity. This improved over recent work of Alon, Bassily, and Moran (ABM) [2], who showed that it is possible to build a uniform ( ๐ , ๐ฟ ) -cover in ๐ โข ( log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ ) samples.
It is interesting to ask whether non-uniformity is really necessary here, or whether ABMโs analysis is simply sub-optimal. Weโll show that the former is true, at least in the proper distribution-family setting: the log โก ( 1 / ๐ ) gap between these models is necessary and uniform covers cannot be used to build optimal semi-private learners.
Theorem 13.4 (Separation of Uniform and Non-Uniform Covers).
There exists an instance space ๐ , hypothesis class ๐ป , and family of distributions ๐ such that for any sufficiently small ๐
0 , the following statements holds:
1.
Any algorithm which returns a finite proper uniform ( ๐ , 1 / 3 ) -cover for ( ๐ , ๐ , ๐ป ) requires at least ฮฉ โข ( 1 / ๐ โ log โก ( 1 / ๐ ) ) samples.
2.
There exists an algorithm which returns a finite proper non-uniform ( ๐ , ๐ฟ ) -cover for ( ๐ , ๐ , ๐ป ) in ๐ โข ( log โก ( 1 / ๐ฟ ) / ๐ ) samples.
Proof 13.5.
Let the instance space ๐
โ and ๐ป be the class of indicators along with the all 0 โs function, that is ๐ป
{ โ ๐ : ๐ โ โ } โช { โ 0 } where โ ๐ โข ( ๐ฅ )
๐ โข { ๐ฅ
๐ } and โ 0 is 0 everywhere. We consider the family of distribution ๐
{ ๐ ๐ , ๐ } ๐ , ๐
0 given by ๐ -sets of [ ๐ ] where
๐ ๐ , ๐
{ unif โข ( ๐ ) : ๐ โ [ ๐ ] โข ๐๐๐ โข | ๐ |
๐ } ,
where unif โข ( ๐ ) is a uniform distribution over ๐ .
We start with the first claim, that building a bounded uniform ( ๐ , 1 / 2 ) -cover needs at least ฮฉ โข ( 1 / ๐ โข log โก ( 1 / ๐ ) ) samples. More formally, for any error parameter ๐ > 0 and size bound ๐
๐ โข ( ๐ ) โ โ , let ๐
โ 1 / ( 2 โข ๐ ) โ . We will show that for any algorithm ๐ on ๐ โข log โก ( ๐ ) samples that outputs at most ๐ hypotheses, ๐ must fail to output an ๐ -cover with probability at least 1 / 2 .
Let ๐ โซ ๐ , ๐ be some natural number to be fixed later and consider the family of distributions ๐ ๐ , ๐ . By Yaoโs minimax principle, it is sufficient to show that there exists a distribution over the elements in ๐ ๐ , ๐ such that any deterministic algorithm over ๐ โข log โก ( ๐ ) samples outputting a set of (at most) ๐ hypotheses fails to give a proper ๐ -cover with probability 1 / 2 . We claim that taking the uniform distribution over ๐ ๐ , ๐ suffices. To formalize this, it is useful to observe the following claim.
Claim 2.
Any subset of hypotheses ๐ถ โ ๐ป of size ๐ can be a proper ๐ -cover for ๐ป under at most ( ๐ ๐ ) distributions in ๐ ๐ , ๐ .
Letโs prove the result under this assumption. The key observation is that by standard lower bounds on the coupon collector problem, a sample ๐ of ๐ โข log โก ( ๐ ) points from any unif โข ( ๐ ) โ ๐ ๐ , ๐ will not include unif โข ( ๐ ) โs entire support with probability at least 1 / 2 . With this in mind, assume that the input sample ๐ contains only ๐ ๐ข๐๐ โข ( ๐ )
๐ < ๐
๐ ๐ข๐๐ โข ( unif โข ( ๐ ) ) elements. As a result, there are ( ๐ โ ๐ ๐ โ ๐ ) consistent distributions with ๐ , and by Claim 2, ๐ โข ( ๐ ) is a proper ๐ -cover for at most ( ๐ ๐ ) of them. Since ๐ is equally likely to have been sampled from any of these distributions, the probability that ๐ โข ( ๐ ) is a proper ๐ -cover is at most:
Pr โก [ ๐ โข fails given ๐ ๐ข๐๐ โข ( ๐ )
๐ < ๐ ] โฅ ( ๐ โ ๐ ๐ โ ๐ ) โ ( ๐ ๐ ) ( ๐ โ ๐ ๐ โ ๐ ) .
Taking ๐ sufficiently larger than ๐ and ๐ , we can make this probability as close to 1 as desired for any 0 < ๐ < ๐ . Finally, since samples of this form occur with probability at least 1 / 2 , the algorithm fails with probability at least 1 / 3 as desired. It is left to prove Claim 2.
Proof 13.6 (Proof of Claim 2).
Notice that for any distribution unif โข ( ๐ ) โ ๐ ๐ , ๐ , any ๐ โ ๐ and any ๐ โ ๐ , ๐ unif โข ( ๐ ) โข ( โ ๐ , โ ๐ ) > 2 โข ๐ . Let ๐ถ be any proper ๐ -cover of ๐ป under distribution unif โข ( ๐ ) . Then, by the above argument, it must contain { โ ๐ : ๐ โ ๐ } . Since | ๐ |
๐ , ๐ถ can be a proper ๐ -cover of ๐ป under at most ( | ๐ถ | ๐ ) distributions in ๐ ๐ , ๐ .
We now move to proving that a proper non-uniform ( ๐ , ๐ฟ ) -cover can be built in only ๐ โข ( log โก ( 1 / ๐ฟ ) / ๐ ) samples. This follows from the fact that for any ๐ โฅ ๐ and distribution unif โข ( ๐ ) โ ๐ ๐ , ๐ , each ๐ โ ๐ is in the random sample ๐ with probability 1 โ ๐ฟ . Since each โ ๐ for ๐ โ ๐ is covered by โ 0 , outputting { โ ๐ : ๐ โ ๐ } โช { โ 0 } generates a proper non-uniform ( ๐ , ๐ฟ ) -cover.
The construction in Theorem 13.4 can easily be modified to give a class with the same gap which is not privately learnable (say by embedding a single copy of a threshold over [ 0 , 1 ] ). Since any such class requires at least ฮฉ โข ( 1 ๐ ) public samples to semi-privately learn by Theorem 6.25,21 Theorem 13.4 then provides a separation between using uniform and non-uniform covers in semi-private learning: the former provably requires an extra log factor, while the latter matches the lower bound exactly. Unfortunately, our proof of this result only holds in the proper setting, as Claim 2 fails when improper hypotheses are allowed. We conjecture that this is not an inherent barrier: the separation should continue to hold in the improper case, albeit with some different analysis.
We have now seen a weak separation between uniform and non-uniform covers, but one might reasonably wonder whether a much stronger separation is possible. In particular, all previous constructions of uniform covers use uniform convergence, but there exist simple examples of learnable classes in the distribution-family model that fail this property: do such classes provide an example of objects which are non-uniformly coverable but not uniformly coverable? Surprisingly, the answer is no! It turns out that an algorithm for non-uniform covering can always be used to construct a uniform covering without too much overhead. Moreover, weโll see that the log โก ( 1 / ๐ ) gap is tight when ( ๐ , ๐ป ) has finite VC dimension.
To prove this, it will actually be useful to make a brief aside and introduce another closely related notion of covering called fractional covers. These objects are essentially a form of non-uniform covering which output a single hypothesis instead of a set of them.
Definition 13.7 (Fractional cover).
Let ๐ be an instance space, ๐ a label space, and let ๐ฟ ๐ , ๐ denote the set of all labelings ๐ : ๐ โ ๐ . A distribution ๐ท ๐ถ over ๐ฟ ๐ , ๐ is said to form a fractional ( ๐ , ๐ ) -cover for a hypothesis class ๐ป for ( ๐ท , ๐ , ๐ป ) if for any fixed โ โ ๐ป , a sample from ๐ท ๐ถ covers โ with probability ๐ :
Pr ๐ โผ ๐ท ๐ถ โก [ ๐ โข ( ๐ , โ ) โค ๐ ] โฅ ๐ .
Fractional covers are closely connected to non-uniform covers. In fact, one can easily move between the two by sampling or subsampling.
Proposition 13.8 (Non-uniform cover โ Fractional cover).
Let ( ๐ท , ๐ , ๐ป ) be any class, ๐ถ ๐๐๐๐ a fractional ( ๐ , ๐ ) -cover, and ๐ถ n-u a non-uniform ( ๐ , 1 / 2 ) -cover. Then the following hold:
1.
Drawing log 1 / ( 1 โ ๐ ) โก ( 1 / ๐ฟ ) samples from ๐ถ ๐๐๐๐ gives a non-uniform ( ๐ , ๐ฟ ) -cover.
2.
Choosing a random hypothesis from ๐ถ n-u gives a fractional ( ๐ , 1 / 2 โข | ๐ถ | ) -cover.
Proof 13.9.
Both statements are essentially immediate from definition. For any fixed โ โ ๐ป , if we draw ๐ samples from ๐ถ ๐๐๐๐ , the probability we fail to cover โ is ( 1 โ ๐ ) ๐ , so setting ๐
log 1 / ( 1 โ ๐ ) โก ( 1 / ๐ฟ ) gives the desired non-uniform cover. On the other hand, for any fixed โ โ ๐ป , a sample from ๐ถ โผ ๐ถ n-u contains ๐
๐ -close to โ with probability 1 / 2 . Outputting a uniformly random element of ๐ถ then gives an element within ๐ of โ with probability 1 / 2 โข | ๐ถ | as desired.
It will also be useful to note a classical relation between covers and fractional covers.
Lemma 13.10.
If there exists a fractional ( ๐ , ๐ ) -cover for ( ๐ท , ๐ , ๐ป ) , then there exists a 2 โข ๐ -cover of size 1 / ๐ .
Proof 13.11.
This follows from classical packing-covering duality. The existence of a fractional ( ๐ , ๐ ) -cover implies there cannot exist a 2 โข ๐ -packing of size greater than 1 / ๐ (that is, a set of more than 1 / ๐ hypotheses in ๐ป that are pairwise 2 โข ๐ -separated with respect to ๐ท ). By packing-covering duality, this implies the existence of a 2 โข ๐ -cover of size โ 1 / ๐ โ + 1 .
With this in hand, letโs show that uniform covers can be constructed for any realizably learnable class, regardless of whether we have uniform convergence.
Theorem 13.12 (Realizable learning โ Uniform cover).
Let ( ๐ , ๐ , ๐ป ) be realizably PAC-learnable with sample complexity ๐ โข ( ๐ , ๐ฟ ) . Then it is possible to construct a uniform ( ๐ , ๐ฟ ) -cover for ( ๐ , ๐ , ๐ป ) in ๐ โข ( ๐ / 2 , ๐ฟ โฒ ) samples where ๐ฟ โฒ
๐ โข ( ๐ฟ ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , 1 / 2 ) ) ) .
Proof 13.13.
Weโll start by proving a slightly more general fact. If for every ๐ท โ ๐ , ( ๐ท , ๐ , ๐ป ) has a proper ( ๐ / 2 ) -cover ๐ถ ๐ท of size at most ๐ถ
๐ถ โข ( ๐ / 2 ) , then it is possible to construct a uniform ( ๐ , ๐ฟ ) -cover in ๐ โข ( ๐ / 2 , ๐ฟ / ๐ถ ) samples. This is essentially immediate from Lemma 5.2, which states that running LearningToCover over a sample of size ๐ โข ( ๐ / 2 , ๐ฟ / ๐ถ ) gives a non-uniform ( ๐ / 2 , ๐ฟ / ๐ถ ) -cover. Union bounding over ๐ถ ๐ท then gives that a sample from the non-uniform cover ( ๐ / 2 ) -covers ๐ถ ๐ท with probability at least 1 โ ๐ฟ . Since ๐ถ ๐ท is itself an ( ๐ / 2 ) -cover, this implies that the entire class ๐ป
๐ -covered by the sample with probability at least 1 โ ๐ฟ as desired.
It remains to show that for every ๐ท โ ๐ , ( ๐ท , ๐ , ๐ป ) has a proper ( ๐ / 2 ) -cover of size ๐ โข ( ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , 1 / 2 ) ) ) . This follows from combining Proposition 13.8 and Lemma 13.10. In particular, Lemma 5.2 implies that running LearningToCover over a sample of size ๐ โข ( ๐ / 2 , 1 / 2 ) produces a non-uniform ( ๐ / 2 , 1 / 2 ) -cover of size at most ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , 1 / 2 ) ) . Proposition 13.8 states that subsampling from this cover gives a fractional ( ๐ / 2 , 1 / ( 2 โข ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , 1 / 2 ) ) ) ) -cover, which in turn implies the existence of a ( ๐ / 2 ) -cover of size ๐ โข ( ฮ ๐ป โข ( ๐ โข ( ๐ / 2 , 1 / 2 ) ) ) as desired. We note that this last argument is similar to an observation made in Benedek and Itaiโs [17] seminal work on the distribution-dependent model.
When ( ๐ , ๐ป ) has finite VC-dimension ๐ , note that Theorem 13.12 exactly matches the lower bound exhibited in Theorem 13.4 as the required number of samples for a uniform ( ๐ , ๐ฟ ) -cover becomes:
๐ โข ( ๐ / 2 , ๐ฟ โฒ ) โค ๐ โข ( ๐ โข log โก ( 1 / ๐ ) + log โก ( 1 / ๐ฟ ) ๐ ) .
This also matches the bound given by ABM [2] using uniform convergence.
Acknowledgements
The authors would like to thank Shay Moran, Russell Impagliazzo, Omar Montasser, and Avrim Blum for enlightening discussions. We also thank anonymous referees for constructive feedback, and especially for pointing out the notion of probabilistic representations and that prior work discussed in Section 3 falls under the general framework of our reduction.
References [1] โ Nir Ailon, Anup Bhattacharya and Ragesh JaiswalโApproximate correlation clustering using same-cluster queriesโIn Latin American Symposium on Theoretical Informatics, LATIN โ18, 2018, pp. 14โ27SpringerDOI: 10.1007/978-3-319-77404-6ห2 [2] โ Noga Alon, Raef Bassily and Shay MoranโLimits of Private Learning with Access to Public DataโIn Proceedings of the 33rd International Conference on Neural Information Processing Systems, NeurIPS โ19NY, USA: Curran Associates Inc., 2019DOI: 10.5555/3454287.3455215 [3] โ Noga Alon, Amos Beimel, Shay Moran and Uri StemmerโClosure properties for private classification and online predictionโIn Conference on Learning Theory, COLT โ20, 2020, pp. 119โ152PMLRURL: https://proceedings.mlr.press/v125/alon20a.html [4] โ Noga Alon et al.โAdversarial Laws of Large Numbers and Optimal Regret in Online ClassificationโIn Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021Virtual, Italy: Association for Computing Machinery, 2021, pp. 447โ455DOI: 10.1145/3406325.3451041 [5] โ Noga Alon, Steve Hanneke, Ron Holzman and Shay MoranโA Theory of PAC Learnability of Partial Concept ClassesโIn 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science, FOCS โ21CA, USA: IEEE Computer Society, 2021, pp. 658โ671DOI: 10.1109/FOCS52979.2021.00070 [6] โ Noga Alon, Roi Livni, Maryanthe Malliaris and Shay MoranโPrivate PAC learning implies finite Littlestone dimensionโIn Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC โ19, 2019, pp. 852โ860DOI: 10.1145/3313276.3316312 [7] โ Anish Athalye, Logan Engstrom, Andrew Ilyas and Kevin KwokโSynthesizing Robust Adversarial ExamplesโIn Proceedings of the 35th International Conference on Machine Learning 80, ICML โ18PMLR, 2018, pp. 284โ293URL: https://proceedings.mlr.press/v80/athalye18b.html [8] โ Maria-Florina Balcan and Avrim BlumโA discriminative model for semi-supervised learningโIn Journal of the ACM (JACM) 57.3ACM New York, NY, USA, 2010, pp. 1โ46DOI: 10.1145/1706591.1706599 [9] โ Peter L Bartlett, Philip M Long and Robert C WilliamsonโFat-shattering and the learnability of real-valued functionsโIn Journal of computer and system sciences 52.3Elsevier, 1996, pp. 434โ452DOI: 10.1006/jcss.1996.0033 [10] โ Raef Bassily et al.โPrivate Query Release Assisted by Public DataโIn Proceedings of the 37th International Conference on Machine Learning 119, ICML โ20PMLR, 2020, pp. 695โ703URL: https://proceedings.mlr.press/v119/bassily20a.html [11] โ Raef Bassily, Shay Moran and Anupama NandiโLearning from Mixtures of Private and Public PopulationsโIn Advances in Neural Information Processing Systems 33, NeurIPS โ20Curran Associates, Inc., 2020, pp. 2947โ2957URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1ee942c6b182d0f041a2312947385b23-Paper.pdf [12] โ Amos Beimel, Kobbi Nissim and Uri StemmerโCharacterizing the sample complexity of private learnersโIn Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ITCS โ13, 2013, pp. 97โ110DOI: 10.1145/2422436.2422450 [13] โ Amos Beimel, Kobbi Nissim and Uri StemmerโLearning privately with labeled and unlabeled examplesโIn Algorithmica 83Springer, 2021, pp. 177โ215DOI: 10.1007/s00453-020-00753-z [14] โ Amos Beimel, Kobbi Nissim and Uri StemmerโPrivate Learning and Sanitization: Pure vs. Approximate Differential PrivacyโIn Theory of Computing 12.1Theory of Computing Exchange, 2016, pp. 1โ61DOI: 10.4086/toc.2016.v012a001 [15] โ Shai Ben-David, Nicolรฒ Cesa-Bianchi, David Haussler and Philip LongโCharacterizations of Learnability for Classes of { 0 , โฆ , ๐ } -Valued FunctionsโIn Journal of Computer and System Sciences 50.1, 1995, pp. 74โ86DOI: 10.1006/jcss.1995.1008 [16] โ Shai Ben-David, Dรกvid Pรกl and Shai Shalev-ShwartzโAgnostic Online Learning.โIn The 22nd Annual Conference on Learning Theory 3, COLT โ09, 2009, pp. 1URL: https://www.cs.huji.ac.il/~shais/papers/BendavidPalShalevtech09.pdf [17] โ Gyora M Benedek and Alon ItaiโLearnability with respect to fixed distributionsโIn Theoretical Computer Science 86.2Elsevier, 1991, pp. 377โ389DOI: 10.1016/0304-3975(91)90026-X [18] โ Avrim Blum and Tom MitchellโCombining Labeled and Unlabeled Data with Co-TrainingโIn Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLTโ 98Wisconsin, USA: Association for Computing Machinery, 1998, pp. 92โ100DOI: 10.1145/279943.279962 [19] โ Anselm Blumer, Andrzej Ehrenfeucht, David Haussler and Manfred K WarmuthโLearnability and the Vapnik-Chervonenkis dimensionโIn Journal of the ACM (JACM) 36.4ACM, 1989, pp. 929โ965DOI: 10.1145/76359.76371 [20] โ Olivier Bousquet and Andrรฉ ElisseeffโStability and generalizationโIn The Journal of Machine Learning Research 2JMLR. org, 2002, pp. 499โ526URL: https://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf [21] โ Nicolรฒ Cesa-Bianchi et al.โSample-efficient strategies for learning in the presence of noiseโIn Journal of the ACM (JACM) 46.5ACM New York, NY, USA, 1999, pp. 684โ719DOI: 10.1145/324133.324221 [22] โ Kamalika Chaudhuri and Daniel HsuโSample complexity bounds for differentially private learningโIn Proceedings of the 24th Annual Conference on Learning Theory, COLT โ11, 2011, pp. 155โ186JMLR WorkshopConference ProceedingsURL: https://proceedings.mlr.press/v19/chaudhuri11a.html [23] โ Koby Crammer, Michael Kearns and Jennifer WortmanโLearning from Multiple SourcesโIn Journal of Machine Learning Research 9.57, 2008, pp. 1757โ1774URL: http://jmlr.org/papers/v9/crammer08a.html [24] โ Yuval Dagan and Vitaly FeldmanโPAC learning with stable and private predictionsโIn Conference on Learning Theory, COLT โ20, 2020, pp. 1389โ1410PMLRURL: https://proceedings.mlr.press/v125/dagan20a.html [25] โ Amit Daniely, Sivan Sabato, Shai Ben-David and Shai Shalev-ShwartzโMulticlass learnability and the ERM principleโIn Journal of Machine Learning Research 16, 2015, pp. 2377โ2404URL: https://www.jmlr.org/papers/volume16/daniely15a [26] โ Amit Daniely and Shai Shalev-ShwartzโOptimal learners for multiclass problemsโIn Proceedings of The 27th Conference on Learning Theory 35, COLT โ14Barcelona, Spain: PMLR, 2014, pp. 287โ316URL: https://proceedings.mlr.press/v35/daniely14b.html [27] โ Sanjoy DasguptaโCoarse sample complexity bounds for active learningโIn Advances in Neural Information Processing Systems 18, NeurIPS โ05MIT Press, 2005URL: https://proceedings.neurips.cc/paper_files/paper/2005/file/6e82873a32b95af115de1c414a1849cb-Paper.pdf [28] โ Sanjoy Dasgupta, Michael Littman and David McAllesterโPAC Generalization Bounds for Co-trainingโIn Advances in Neural Information Processing Systems 14, NeurIPS โ02MIT Press, 2002URL: https://proceedings.neurips.cc/paper/2001/file/4c144c47ecba6f8318128703ca9e2601-Paper.pdf [29] โ Ofir David, Shay Moran and Amir YehudayoffโSupervised learning through the lens of compressionโIn Advances in Neural Information Processing Systems 29, NeurIPS โ16, 2016, pp. 2784โ2792URL: https://papers.nips.cc/paper_files/paper/2016/hash/59f51fd6937412b7e56ded1ea2470c25-Abstract.html [30] โ Shai Ben David, Tyler Lu, Teresa Luu and David PalโImpossibility Theorems for Domain AdaptationโIn Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 9, AISTATS โ10Sardinia, Italy: PMLR, 2010, pp. 129โ136URL: https://proceedings.mlr.press/v9/david10a.html [31] โ Richard M. Dudley, Sanjeev R. Kulkarni, Thomas Richardson and Ofer ZeitouniโA metric entropy bound is not sufficient for learnabilityโIn IEEE Transactions on Information Theory 40.3, 1994, pp. 883โ885DOI: 10.1109/18.335898 [32] โ Cynthia Dwork and Vitaly FeldmanโPrivacy-preserving predictionโIn Conference On Learning Theory, COLT โ18, 2018, pp. 1693โ1702PMLRURL: https://proceedings.mlr.press/v75/dwork18a [33] โ Vitaly Feldman, Parikshit Gopalan, Subhash Khot and Ashok Kumar PonnuswamiโOn agnostic learning of parities, monomials, and halfspacesโIn SIAM Journal on Computing 39.2SIAM, 2009, pp. 606โ645DOI: 10.1137/070684914 [34] โ Steve HannekeโProper PAC learning VC dimension boundsโ, Theoretical Computer Science Stack Exchange, 2018URL: https://cstheory.stackexchange.com/q/44252 [35] โ Steve Hanneke and Liu YangโMinimax Analysis of Active LearningโIn Journal of Machine Learning Research 16.109, 2015, pp. 3487โ3602URL: http://jmlr.org/papers/v16/hanneke15a.html [36] โ David HausslerโDecision theoretic generalizations of the PAC model for neural net and other learning applicationsโIn Information and computation 100.1Elsevier, 1992, pp. 78โ150DOI: 10.1016/0890-5401(92)90010-D [37] โ Daniel Hsu and Sivan SabatoโLoss Minimization and Parameter Estimation with Heavy TailsโIn Journal of Machine Learning Research 17.18, 2016, pp. 1โ40URL: http://jmlr.org/papers/v17/14-273.html [38] โ Thorsten JoachimsโTransductive Inference for Text Classification Using Support Vector MachinesโIn Proceedings of the Sixteenth International Conference on Machine Learning, ICML โ99CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 200โ209DOI: 10.5555/645528.657646 [39] โ Shiva Prasad Kasiviswanathan et al.โWhat can we learn privately?โIn SIAM Journal on Computing 40.3SIAM, 2011, pp. 793โ826DOI: 10.1137/090756090 [40] โ Michael KearnsโEfficient Noise-Tolerant Learning from Statistical QueriesโIn J. ACM 45.6New York, NY, USA: Association for Computing Machinery, 1998, pp. 983โ1006DOI: 10.1145/293347.293351 [41] โ Michael Kearns and Ming LiโLearning in the presence of malicious errorsโIn SIAM Journal on Computing 22.4SIAM, 1993, pp. 807โ837DOI: 10.1137/0222052 [42] โ Michael J. Kearns and Robert E. SchapireโEfficient distribution-free learning of probabilistic conceptsโIn Journal of Computer and System Sciences 48.3, 1994, pp. 464โ497DOI: https://doi.org/10.1016/S0022-0000(05)80062-5 [43] โ Sanjeev R. Kulkarni and Mathukumalli VidyasagarโLearning decision rules for pattern classification under a family of probability measuresโIn IEEE Transactions on Information Theory 43.1, 1997, pp. 154โ166DOI: 10.1109/18.567668 [44] โ Tosca Lechner and Shai Ben-DavidโImpossibility of Characterizing Distribution Learningโa simple solution to a long-standing problemโIn Arxiv, 2023URL: https://arxiv.org/abs/2304.08712 [45] โ Yi Li, Philip M Long and Aravind SrinivasanโImproved bounds on the sample complexity of learningโIn Journal of Computer and System Sciences 62.3Elsevier, 2001, pp. 516โ527DOI: 10.1006/jcss.2000.1741 [46] โ Nick Littlestone and Manfred WarmuthโThe Weighted Majority AlgorithmโIn Information and Computation 108.2, 1994, pp. 212โ261DOI: https://doi.org/10.1006/inco.1994.1009 [47] โ Philip M LongโOn agnostic learning with { 0,*, 1 } -valued and real-valued hypothesesโIn International Conference on Computational Learning Theory, COLT โ01, 2001, pp. 289โ302SpringerURL: https://link.springer.com/chapter/10.1007/3-540-44581-1_19 [48] โ Andreas Maurer and Massimiliano PontilโConcentration inequalities under sub-Gaussian and sub-exponential conditionsโIn Advances in Neural Information Processing Systems 34, NeurIPS โ11Curran Associates, Inc., 2021, pp. 7588โ7597URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/3e33b970f21d2fc65096871ea0d2c6e4-Paper.pdf [49] โ Frank McSherry and Kunal TalwarโMechanism design via differential privacyโIn 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS โ07, 2007, pp. 94โ103IEEEDOI: 10.1109/FOCS.2007.66 [50] โ Omar Montasser, Steve Hanneke and Nathan SrebroโAdversarially Robust Learning with Unknown Perturbation SetsโIn Proceedings of Thirty Fourth Conference on Learning Theory 134, COLT โ21PMLR, 2021, pp. 3452โ3482URL: https://proceedings.mlr.press/v134/montasser21a.html [51] โ Omar Montasser, Steve Hanneke and Nathan SrebroโReducing Adversarially Robust Learning to Non-Robust PAC LearningโIn Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS โ20BC, Canada: Curran Associates Inc., 2020URL: https://dl.acm.org/doi/10.5555/3495724.3496950 [52] โ Omar Montasser, Steve Hanneke and Nathan SrebroโVC classes are adversarially robustly learnable, but only improperlyโIn Conference on Learning Theory, COLT โ19, 2019, pp. 2512โ2530PMLRURL: https://proceedings.mlr.press/v99/montasser19a/montasser19a.pdf [53] โ Vaishnavh Nagarajan and J.Zico KolterโUniform convergence may be unable to explain generalization in deep learningโIn Advances in Neural Information Processing Systems 32, NeurIPS โ11Curran Associates, Inc., 2019URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/05e97c207235d63ceb1db43c60db7bbb-Paper.pdf [54] โ Balas K. NatarajanโOn Learning Sets and FunctionsโIn Machine Learning 4.1USA: Kluwer Academic Publishers, 1989, pp. 67โ97DOI: 10.1007/BF00114804 [55] โ Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik SridharanโLearnability, stability and uniform convergenceโIn Journal of Machine Learning Research 11.90, 2010, pp. 2635โ2670URL: http://jmlr.org/papers/v11/shalev-shwartz10a.html [56] โ Hidetoshi ShimodairaโImproving predictive inference under covariate shift by weighting the log-likelihood functionโIn Journal of Statistical Planning and Inference 90.2, 2000, pp. 227โ244DOI: https://doi.org/10.1016/S0378-3758(00)00115-4 [57] โ Hans Ulrich SimonโA characterization of strong learnability in the statistical query modelโIn Annual Symposium on Theoretical Aspects of Computer Science, STACS โ07โ, 2007, pp. 393โ404SpringerDOI: 10.1007/978-3-540-70918-3ห34 [58] โ Leslie G ValiantโA theory of the learnableโIn Proceedings of the sixteenth annual ACM symposium on Theory of computing, STOC โ84, 1984, pp. 436โ445ACMDOI: 10.1145/1968.1972 [59] โ Vladimir Vapnik and Alexey ChervonenkisโOn the uniform convergence of relative frequencies of events to their probabilities.โIn Theory of Probability and Its Applications, 1971DOI: 10.1137/1116025 [60] โ Vladimir Vapnik and Alexey ChervonenkisโTheory of pattern recognitionโMoscow: Nauka Publishing House, 1974URL: https://cml.rhul.ac.uk/resources/PatternRecognition/patternreclowres.pdf [61] โ Mathuicumalli Vidyasagar, Srinivasan Balaji and Barbara HammerโClosure properties of uniform convergence of empirical means and PAC learnability under a family of probability measuresโIn Systems and Control Letters 42.2, 2001, pp. 151โ157DOI: https://doi.org/10.1016/S0167-6911(00)00086-4 [62] โ Michael M WolfโMathematical foundations of supervised learningโMunich: mediaTUM, 2023URL: https://mediatum.ub.tum.de/doc/1723378/1723378.pdf [63] โ Gal Yona and Guy RothblumโProbably Approximately Metric-Fair LearningโIn Proceedings of the 35th International Conference on Machine Learning 80, ICML โ18PMLR, 2018, pp. 5680โ5688URL: https://proceedings.mlr.press/v80/yona18a.html [64] โ Chiyuan Zhang et al.โUnderstanding Deep Learning (Still) Requires Rethinking GeneralizationโIn Commun. ACM 64.3New York, NY, USA: Association for Computing Machinery, 2021, pp. 107โ115DOI: 10.1145/3446776 [65] โ Xiaojin ZhuโSemi-supervised LearningโIn Encyclopedia of Machine Learning and Data MiningSpringer US, 2017, pp. 1142โ1147DOI: 10.1007/978-0-387-30164-8ห749 Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Report Issue Report Issue for Selection
Xet Storage Details
- Size:
- 199 kB
- Xet hash:
- e0c3f43bc88d9bdd81065278202defdaa1887e7e011d1e15458561c0885d3dc1
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.