199 kB

Title: Realizable Learning is All You Need

URL Source: https://arxiv.org/html/2111.04746

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Distribution Family Classification 3Related Work 4Preliminaries 5The Core Reduction: Finite Label Classes 6Four Modification Archetypes 7Doubly Bounded Loss 8Robust Learning 9Partial PAC-Learning 10Uniform Stability 11Statistical Query Model 12Fairness 13Notions of Coverability License: CC BY 4.0 arXiv:2111.04746v4 [cs.LG] 03 Feb 2024 \ThCSauthor

[ucsd-cs]Max Hopkinsnmhopkin@ucsd.edu \ThCSauthor[ucsd-cs,ucsd-math]Daniel M. Kanedakane@eng.ucsd.edu \ThCSauthor[ucsd-cs]Shachar Lovettslovett@cs.ucsd.edu \ThCSauthor[yale]Gaurav Mahajangaurav.mahajan@yale.edu \ThCSaffil[ucsd-cs]Department of Computer Science and Engineering, UCSD, California, CA 92092 \ThCSaffil[ucsd-math]Department of Mathematics, UCSD, California, CA 92092 \ThCSaffil[yale]Department of Computer Science, Yale University, Connecticut, CT 06511 \ThCSshorttitleRealizable Learning is All You Need \ThCSyear2024 \ThCSarticlenum2 \ThCSreceivedSep 28, 2022 \ThCSrevisedNov 21, 2023 \ThCSacceptedDec 11, 2023 \ThCSpublishedFeb 3, 2024 \ThCSkeywordsPAC-Learning, Realizable Learning, Agnostic Learning, Blackbox Reductions \ThCSdoi10.46298/theoretics.24.2 \ThCSshortnamesM. Hopkins, D. M. Kane, S. Lovett and G. Mahajan \ThCSthanksInvited article from COLT 2022.

Realizable Learning is All You Need Abstract

The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it’s surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression.

In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model.

More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.

1Introduction

The equivalence of realizable and agnostic learnability in Valiant’s Probably Approximately Correct (PAC) model [58] is one of the best known results in learning theory, and numbers among its most surprising. Given a set 𝑋 and a family of binary classifiers 𝐻 , the result states that the ability to learn a classifier ℎ ∈ 𝐻 from examples of the form ( 𝑥 , ℎ ⁢ ( 𝑥 ) ) is in fact sufficient for something much stronger: given samples from any distribution 𝐷 over 𝑋 × { 0 , 1 } , it is possible to learn the best approximation to 𝐷 in 𝐻 . This surprising equivalence stems from a classical result of Vapnik and Chervonenkis (VC) [60], and independently Blumer, Ehrenfeucht, Haussler, and Warmuth (BEHW) [19] and Haussler [36], who equate both the former model (known as realizable learning) and the latter model (known as agnostic learning) to a strong property of pairs ( 𝑋 , 𝐻 ) called uniform convergence.1

VC, BEHW, and Haussler’s result was certainly a breakthrough in its own right, but its proof technique is too indirect to reveal any deeper connections between realizable and agnostic learning beyond the PAC setting. Further, recent years have seen both theory and practice shift away not only from this original formalization, but more generally from the “uniform convergence equals learnability” paradigm, often in favor of distributional or data-dependent assumptions like margin that are more applicable to the real world. The inability of VC, BEHW, and Haussler’s proof technique to generalize to such scenarios raises a fundamental question: is the equivalence of realizable and agnostic learning a fundamental property of learnability, or simply a happy coincidence derived from the original PAC framework?

In the 30 years since these works, a mountain of evidence has amassed in favor of the former: almost every reasonable variant of learning shares some sort of similar equivalence. This includes a long list of popular settings such as regression [9], distribution-dependent learning [17], multi-class learning [29], robust learning [52], online learning [16], private learning [13, 3], and partial learning [47, 5]. What’s more, the uniform convergence paradigm fails miserably in most of these models. In the distribution-dependent model, for instance, it is easy to build classes which are trivially learnable (even with one sample!) but completely fail to satisfy uniform convergence [17]. On the other hand, models such as private learning give well-known examples where uniform convergence fails to imply learnability [6]. In spite of this, we are really no closer today to a general understanding of this phenomenon than we were in the early 90s. Much like Vapnik and Chervonenkis [60], Blumer, Ehrenfeucht, Haussler, and Warmuth [19], and Haussler’s [36] proofs, the above works often use indirect methods and tend to rely on powerful model-dependent assumptions.

In this work, we aim to offer a generic, unifying theory by way of the first direct reduction from agnostic to realizable learning. Unlike any previous work, our reduction is blackbox, relies on no additional assumptions, and, perhaps most importantly, is incredibly simple. In fact, the basic algorithm can be stated in three lines.

Input: Realizable PAC-Learner 𝒜 , Unlabeled Sample Oracle 𝒪 𝑈 , Labeled Sample Oracle 𝒪 𝐿 Algorithm:

Draw an unlabeled sample 𝑆 𝑈 ∼ 𝒪 𝑈 , and labeled sample 𝑆 𝐿 ∼ 𝒪 𝐿 .

Run 𝒜 over all possible labelings of 𝑆 𝑈 to get:

𝐶 ⁢ ( 𝑆 𝑈 ) ⁢ \coloneqq ⁢ { 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) | ℎ ∈ 𝐻 | 𝑆 𝑈 } .

Return the hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) with lowest empirical error over 𝑆 𝐿 .

Algorithm 1 Agnostic to Realizable Reduction

This basic reduction simplifies and unifies classic results such as VC [60], BEHW [19], and Haussler’s [36] distribution-free equivalence and Benedek and Itai’s [17] analogous result in the distribution-dependent setting2 up to log factors. More importantly, because Algorithm 1 doesn’t rely on model-dependent properties like uniform convergence, it extends to learning regimes without known characterizations. One such example is the notoriously difficult distribution-family model, in which the adversary is given a restricted family of distributions 𝒟 along with the pair ( 𝑋 , 𝐻 ) . The distribution-family model has no finitary characterization [44],3 yet Algorithm 1 can still be used to show that the realizable and agnostic settings are equivalent.

Unfortunately, while Algorithm 1 does avoid any significant blowup in sample complexity, it is inherently computationally inefficient. In fact, this is necessary unless 𝑃

𝑁 ⁢ 𝑃 . There are many basic classes (e.g. halfspaces) which are easy to learn in the realizable model, but NP or cryptographically hard in the agnostic setting (see e.g. [33]). As such we focus in this work only on information theoretic considerations, though building computationally efficient reductions in restricted settings remains an interesting avenue of research.

The core of Algorithm 1 lies in an equivalence between PAC learning and a new form of randomized covering of independent interest we call a non-uniform cover. In contrast to more classical notions, a non-uniform cover is a distribution over subsets of hypotheses that covers any fixed hypothesis in the class with high probability, but may fail to cover all hypotheses simultaneously. The connection between supervised learning and non-uniform covering is inherent in Algorithm 1, where Steps 1 and 2 turn the realizable learner 𝒜 into a non-uniform cover 𝐶 ⁢ ( 𝑆 𝑈 ) , and Step 3 uses the cover to perform agnostic learning. At a high level, this process works because the adversary does not see the randomness inherent to Steps 1 and 2, and therefore cannot detect or exploit which hypotheses in the class will fail to be well-estimated in the process.

In fact, this connection has many broader implications within the theory of supervised learning. For one, the method is not inherently restricted to the agnostic setting. Algorithm 1 achieves agnostic learning for general classes by applying an agnostic learner for finite classes in Step 3; replacing this with learners satisfying other properties (e.g. the exponential mechanism for privacy) leads to other families of reductions to the realizable setting. At a high level, this can be summarized by the following informal ‘guiding principle’:

Guiding Principle (Property Generalization).

If there is a (sample-efficient) algorithm with property 𝑃 over finite classes, then Algorithm 1 gives a (sample-efficient) learner with property 𝑃 over any ‘learnable’ class.

We stress that the above is a guide, not a theorem, and indeed often requires modification or weakening of the desired property for a given application. For instance, when Step 3 is replaced with a private algorithm, the result is a semi-private learner for general classes, a weakened model allowing the use of a small amount of public unlabeled data [14, 2].

Algorithm 1 also provides a unified framework for many settings beyond the PAC-model. This includes basic extensions such as general loss functions4 and the distribution-family model, but also more involved modifications such as partial learning, robust learning, or even the statistical query model. Moreover, in some of these settings removing reliance on setting-specific assumptions like uniform convergence actually quantitatively improves the sample complexity; such is the case for the semi-private model, where we use this fact to achieve information-theoretically optimal unlabeled sample complexity for the first time.

Finally, we note there are a few settings where Algorithm 1 runs into issues, especially discrete infinite settings such as infinite multi-class classification and properties such as privacy that require more careful data handling. We leave the extension of our method to these settings as an intriguing open problem.

1.1Proof Overview

Before moving to more detailed discussion of our results, we briefly overview the proof that Algorithm 1 is an agnostic learner. At its most basic, the idea is simply to observe that the set 𝐶 ⁢ ( 𝑆 𝑈 ) almost certainly contains a ‘near-optimal’ hypothesis ℎ ~ ∈ 𝐻 . Since 𝐶 ⁢ ( 𝑆 𝑈 ) has bounded size, standard uniform convergence for finite classes promises Step 3 outputs a hypothesis close to ℎ ~ , which is therefore itself ‘near-optimal’ as desired.

Taking a step back, the key observation in this process is really that 𝐶 ⁢ ( 𝑆 𝑈 ) ‘acts like a cover’ of 𝐻 in the following weak sense we call non-uniform covering:

Definition 1.1 (Non-uniform Cover (Informal Definition 5.1)).

Let ( 𝑋 , 𝐻 ) be a hypothesis class, 𝐷 a marginal distribution over 𝑋 , and 𝐶 a random variable over the power set 𝑃 ⁢ ( 𝐻 ) . We call 𝐶 a non-uniform ( 𝜀 , 𝛿 ) -cover of 𝐻 with respect to 𝐷 if:

∀ ℎ ∈ 𝐻 : Pr 𝐶 [ ∃ ℎ ′ ∈ 𝐶 : Pr 𝑥 ∼ 𝐷 [ ℎ ′ ( 𝑥 ) ≠ ℎ ( 𝑥 ) ] ≤ 𝜀 ] ≥ 1 − 𝛿 .

In other words, for every fixed hypothesis ℎ in the class, 𝐶 is very likely to contain a hypothesis close to ℎ . This differs from the standard covering which enforces that 𝐶 simultaneously cover every ℎ ∈ 𝐻 (equivalent to pushing the ‘ ∀ ’ quantifier into the probability above). In Section 13, we show that the latter (which is the standard in the literature) is strictly stronger and requires more samples to generate. This is critical in our application to semi-private learning where we cut the additional samples required to generate a full cover and thereby achieve the optimal sample complexity.

It is left to observe that 𝐶 ⁢ ( 𝑆 𝑈 ) in Algorithm 1 is actually a non-uniform cover, but this is essentially immediate from the definition of realizable learning! Realizable learning promises that for every fixed hypothesis ℎ ∈ 𝐻 , when 𝐴 receives samples labeled by ℎ it outputs a hypothesis close to ℎ with high probability. 𝐶 ⁢ ( 𝑆 𝑈 ) is generated by running 𝐴 across all ℎ ∈ 𝐻 , so this guarantee exactly translates to the above. We refer the reader to Sections 2 and 5 for a formal explanation of this argument.

1.2Beyond PAC Learning

Algorithm 1 is an extremely flexible framework for proving agnostic to realizable reductions in supervised learning. In this section, we informally overview the many extended models studied in this paper. The most basic setting in which Algorithm 1 applies beyond the standard model is learning under distributional assumptions (or formally what we call the “distribution-family model”, see Sections 2 and 4). Standard techniques using combinatorial dimensions require that the learner 𝐴 works over every distribution (the “distribution-free” setting), while modern algorithms frequently only work under some niceness conditions on the distribution. On the other hand, it is easy to observe that the process described above works under arbitrary distributional assumptions—as long as 𝐴 is a realizable learner over a distribution 𝐷 , then Algorithm 1 is an agnostic learner whenever the data is marginally distributed as 𝐷 . This leads to the following corollary:

Theorem 1.2 (Distribution Family Model (Informal Theorem 5.4)).

The sample complexity of agnostic learning a class ( 𝑋 , 𝐻 ) in the distribution family model is at most:

𝑚 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝑛 ⁢ ( 𝜀 , 𝛿 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 ) ,

where 𝑛 ⁢ ( 𝜀 , 𝛿 ) is the realizable sample complexity of ( 𝑋 , 𝐻 ) .

In the finite VC setting, this can be improved to roughly 𝑑 ⁢ log ⁡ ( 𝑑 𝜀 ) + log ⁡ ( 1 𝛿 ) 𝜀 2 for VC dimension 𝑑 , which is optimal up to a log factor.

Algorithm 1 also covers many other supervised settings in the literature that diverge from the PAC model in more substantial ways, including general loss functions, constraints on the learner, and properties beyond agnostic learning. For simplicity we cover these here informally, avoiding exact definitions and sample complexities (which are generally similar to the above), and give formal details in the main body.

1.2.1General Loss Functions

Perhaps the most natural extension of the PAC framework is to loss functions beyond binary classification. Here data points are labeled by an arbitrary label space 𝑌 and error is measured with respect to a generic loss function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 , for instance we may take 𝑌

ℝ and ℓ ⁢ ( 𝑦 , 𝑦 ′ )

( 𝑦 − 𝑦 ′ ) 2 , the square loss. In general, agnostic and realizable learning are actually not equivalent in this setting, even for nice loss functions (see Proposition 6.1). The issue is that over infinite label spaces it is possible to encode hypotheses into the labels with infinite precision, making the class trivial to learn realizably, but impossible even with the smallest amount of noise (which wipes out the encoding). We show this type of counter-example is essentially the only barrier to agnostic learnability.

Somewhat more formally, we call a class 𝐻 discretely learnable if for every 𝜀

0 , there exists an 𝜀 -discretization5 of 𝐻 that is learnable up to 𝑂 ⁢ ( 𝜀 ) error. Discrete learnability can informally be thought of as a very weak type of noise tolerance that essentially acts only to rule out the above construction. We prove discrete and agnostic learnability are equivalent under weak conditions on the loss function.

Theorem 1.3 (General Loss (Informal Theorems 7.2 and 6.6)).

If ℓ is a (probably) upper bounded loss function6 and either:

ℓ is bounded from below

ℓ satisfies a 𝐶 -approximate triangle inequality

then discrete and agnostic learnability are equivalent under ℓ up to polynomial factors.

We remark that in the latter case, the learner only achieves 𝐶 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 error and that this is tight (Proposition 6.8). To our knowledge, these results are new even to the distribution-free setting, where such an equivalence was only known for bounded Lipschitz [9, 62] or binary-valued [15, 29] loss functions. In the models below, similar guarantees hold under the above assumptions on loss. We omit the exact dependence on ℓ for simplicity and discuss where relevant in the main body.

Another well-studied setting for loss functions beyond standard classification is adversarial robustness. Robust learning is an extension of the PAC model introduced to handle adversarial perturbations at test time by taking a maximum of the loss function over specified ‘perturbation sets’ around the test point. We give a modification of Algorithm 1 in the robust setting that handles general loss and distributional assumptions.

Theorem 1.4 (Robust Classification (Informal Theorem 8.5)).

Robust realizable and robust agnostic learning are equivalent up to polynomial factors.

Note this is completely independent of the perturbation sets. In the classification setting, Theorem 1.4 generalizes recent work giving such an equivalence in the distribution-free model [52, 50], though the sample complexity of our algorithm suffers an extra factor of 𝜀 − 1 in this special case.

Finally, another recent setting that works with a sort of ‘modified’ loss function is the learning of partial functions, capturing ‘data-dependent’ models such as halfspaces with geometric margin where. Here the functions in 𝐻 are allowed to be ‘undefined’ on some portions of 𝑋 , and the loss is fixed to 1 under any undefined point no matter the response of the algorithm.

Theorem 1.5 (Partial functions (Informal Theorem 9.5)).

Realizable and agnostic learning of partial functions are equivalent up to polynomial factors.

In the distribution-free setting, a variant of this equivalence is known via compression schemes, but the above is new under distributional assumptions and more general loss (again at the cost of an extra 𝜀 − 1 factor).

1.2.2Constrained Models

Another frequent modification of the PAC setting is to impose constraints either on how the algorithm 𝐴 uses the data, or on various properties of the output classifier itself. In this section, we cover application to three such examples: fairness, stability, and statistical queries.

We start with fairness, where the goal is to output a hypothesis with low error conditional on ‘treating similar individuals similarly’ under a fixed metric on the data space.

Theorem 1.6 (Fair Learning (Informal Theorem 12.3)).

Realizable and agnostic fair learning are equivalent up to polynomial factors.

Formally, the above result holds in Rothblum and Yona’s [63] popular ‘Probably Approximately Correct and Fair’ (PACF)-learning. To our knowledge it is the first such reduction even in the distribution-free setting, as Rothblum and Yona [63] study the agnostic case directly.

Another popular constraint in the literature is algorithmic stability, where one roughly imposes that running the algorithm twice should produce similar results. We study a simple variant known as uniform stability [24] (closely related to ‘private prediction’ [32]) which promises that for every fixed point in the space 𝑥 the output distribution of 𝐴 on 𝑥 is similar when 𝐴 is run on neighboring datasets.

Theorem 1.7 (Uniform Stability (Informal Theorem 10.2)).

Realizable and agnostic learning are equivalent under uniform stability up to polynomial factors.

This is known in the distribution-free setting by VC arguments [32, 24], but is new for general loss and under distributional assumptions. Unlike prior examples the sample complexity matches prior techniques.

Another way to constrain the algorithm 𝐴 is through the way it interacts with data. Perhaps the most popular example of this is the statistical query model, where 𝐴 is constrained to approximating general population statistics (such algorithms are then guaranteed to satisfy certain nice properties such as noise-tolerance and privacy). We give a variant of our reduction for the statistical query setting as well.

Theorem 1.8 (Statistical Query Model (Informal Theorem 11.2)).

Realizable and agnostic learning are equivalent in the statistical query model up to polynomial factors.

Variants of such a result are known via combinatorial characterization in the fixed distribution setting [57, 33], but new in our general setup.

1.2.3Beyond Agnostic Learning

In the introduction, we claimed Algorithm 1 can be used to build a learner satisfying any “finitely-satisfiable” property, not just agnostic learning. In fact, one of the examples above already displays this property: Theorem 1.7 does not require the base learner to be uniformly stable. Instead, we apply a uniformly stable agnostic learner to 𝐶 ⁢ ( 𝑆 𝑈 ) in Step 3 of Algorithm 1 and ‘lift’ uniform stability from finite to infinite classes. We give two further applications of this strategy: to malicious noise, and (semi)-private learning.

Kearns and Li’s [41] malicious noise is a model for data contamination where one is given a faulty sample oracle 𝒪 𝑀 ⁢ ( ⋅ ) which (with some small probability) returns a completely adversarial pair ( 𝑥 , 𝑦 ) , and otherwise draws from a true ‘ground truth’ distribution. A learner is said to be tolerant to malicious noise if it achieves standard PAC guarantees while drawing from 𝒪 𝑀 ⁢ ( ⋅ ) . Like agnostic learning, tolerance to malicious noise is known to be achievable for finite classes [41]. As a result, (a slight modification of) Algorithm 1 gives a blackbox reduction from learning with malicious noise to realizable learning.

Theorem 1.9 (Realizable → Malicious (Informal Theorem 6.11)).

Realizable learning and learning with malicious noise are equivalent up to polynomial factors.

This extends the original result of [41] to the distribution-family and general loss settings.

Our final application of Algorithm 1 is to the ubiquitous notion of differential privacy. Informally, an algorithm is said to be differentially private if its output is not susceptible to small changes in the underlying sample (see Section 6.3). Privacy is a strong condition, and even relaxed notions require finite Littlestone dimension in the distribution-free setting [6], ruling out a direct reduction from realizable learning. However, Algorithm 1 does recover a weaker variant known as semi-privacy [14] where the learner is allowed to use a few ‘public’ unlabeled samples, but must otherwise maintain privacy with respect to the data.

Theorem 1.10 (Realizable → Semi-Private (Informal Theorem 6.21)).

Realizable and semi-private learning are equivalent up to polynomial factors.

This generalizes Beimel, Nissim, and Stemmer’s [14] equivalence in the distribution-free setting and extends to general loss. Perhaps most interesting is that in this case, due to relying only on non-uniform covering over more classical uniform convergence arguments, Algorithm 1 actually gives gives a quantitative improvement over prior methods.

Corollary 1.11 (Semi-Optimal Semi-Private Learning (Informal Corollary 6.23)).

Let ( 𝑋 , 𝐻 ) be a class with VC-dimension 𝑑 . Then ( 𝑋 , 𝐻 ) is 𝛼 -semi-private, agnostically learnable in

𝑚 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑑 + log ⁡ ( 1 / 𝛿 ) 𝜀 )

unlabeled (public) samples, and

𝑚 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑑 ⁢ log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 ⋅ min ⁡ { 𝜀 , 𝛼 } )

labeled (private) samples.

For fixed 𝑑 and 𝛿 , the public sample complexity of Corollary 1.11 is tight and resolves a conjecture of Alon, Bassily, and Moran [2] who gave the corresponding lower bound.7 The private complexity is off by a log factor from known lower bounds [22], and it remains an interesting question whether this can be fixed.

Theorem 1.10 and Corollary 1.11 are also robust to light distribution shift between the public and private databases. This problem, called covariate shift, is a commonly observed in machine learning practice and is especially of concern in privacy where a distribution over “opt-in” public users could easily differ from the overall distribution of private data. We discuss covariate shift in more depth in Section 6.4.

1.3Modification Archetypes

Finally before moving to formal presentation of our methods and results, we briefly overview the four generic modifications of Algorithm 1 used in the above extensions, and outline where they can be found formally in the main body of the paper.

Discretization.

We’ll start by discussing our main technique to extend Algorithm 1 to infinite label spaces. The basic idea is simple: since we cannot afford to run our learner over all possible labelings of 𝑆 𝑈 , we instead run the learner over labelings coming from some discretization of the class. As long as we have access to a learner for the discretization, we can then use the same arguments covered in Section 2 to prove various occurrences of property generalization. We formalize these notions more generally in Section 6.1, where we use the technique to prove Theorem 1.3. Discretization can also be used to handle learning models such as the statistical query setting which output real-valued query responses (see Section 11).

Subsampling.

Another core limitation to Algorithm 1 is access to clean unlabeled data. Algorithm 1 works by running a realizable learner over a representative set of unlabeled data, but, in practice, such data may often be corrupted, and data-dependent assumptions such as margin might mean that the optimal hypothesis isn’t even well-defined on this set. We handle cases like these by a simple sub-sampling procedure: instead of running our realizable learner over labelings of 𝑆 𝑈 , we run the learner over all labelings of all subsets of 𝑆 𝑈 . As long as 𝑆 𝑈 contains some amount of uncorrupted data, this subsampling procedure will find it and we can maintain the guarantees discussed in Section 2. We use this technique to prove property generalization for models such as robust learning (see Theorem 1.4 and Section 8), partial learning (see Section 9), and malicious noise (see Theorem 1.9 and Section 6.2).

Replacing the Finite Learner.

In the introduction, we proposed a general paradigm (guiding principle) called property generalization: that a variant of any learning property which holds for finite classes should in fact hold for any “learnable” class in the base model. The main idea relies on replacing Step 3 of Algorithm 1 (which, as stated, is an empirical risk minimization process) with a generic learner for finite classes with the desired property. For noise-tolerance properties such as agnostic and malicious noise, empirical risk minimization works. Properties such as privacy or stability, however, require a different finite learner. To prove Theorem 1.10, for example, we replace the ERM process in Algorithm 1 with the exponential mechanism [49]. We use a similar strategy in Section 10 to prove an analogous result for uniform stability.

Replacing the Base Learner.

Finally we note a very basic modification of Algorithm 1 that allows us to extend property generalization beyond the PAC setting: simply replace the input realizable PAC learner with a realizable learner in the desired model. This is usually combined with one of the techniques above depending on the specific application, e.g. to prove property generalization for robust learning and the statistical query model. The same idea can also be used to analyze semi-private learning with covariate shift (see Section 6.4) and property generalization for fair learning (see Section 12).

2Distribution Family Classification

Since all of our results are derived from variants of Algorithm 1, it is instructive to start by considering its basic analysis in our simplest non-trivial setting: distribution-family classification. We remark that this section is entirely pedalogical, and the results are subsumed in Section 5.

The distribution-family model captures learnability with arbitrary distributional assumptions, a well-studied relaxation of PAC learning in practice where worst-case distributional assumptions are often too strong, and encompasses both the distribution-free and distribution-dependent PAC settings. Unlike these models, however, we cannot assume uniform convergence in the distribution family setting, due to a classical example of Benedek and Itai [17].

Proposition 2.1 (Benedek and Itai [17]).

There exists a PAC-learnable class ( 𝐷 , 𝑋 , 𝐻 ) over binary labels and classification loss without the uniform convergence property.

Proof 2.2.

Let 𝑋

[ 0 , 1 ] , 𝐷 be the uniform distribution over 𝑋 , 𝑌

{ 0 , 1 } , and 𝐻 consist of all indicator functions for finite sets 𝑆 ⊂ 𝑋 , as well as for 𝑋 itself. It is not hard to see that ( 𝐷 , 𝑋 , 𝐻 ) is realizably PAC-learnable by the following scheme in only a single sample: if the learner draws a sample labeled 1 , output the all 1 ’s function. Otherwise, output all 0 s. When the adversary has chosen a finite set, with probability 1 the learner draws a sample labeled 0 , and outputs a hypothesis with 0 error (since the finite set has measure 0 ). If the adversary chooses the all 1 ’s function, the learner will always output the all 1 ’s function.

On the other hand, it is clear that when the adversary chooses the all 1 ’s function, no matter how many samples the learner draws, there will exist a hypothesis in the class that is poorly approximated by the sample. Namely the hypothesis whose support is given by the support of the sample itself has empirical measure 1 , but true measure 0 . As a result, this class fails to have the uniform convergence property despite its learnability.

This simple example rules out many classical techniques in supervised learning, including uniform convergence based reductions (e.g. [8, 14, 2]). Furthermore, since the distribution-family setting has no finitary characterization, we also cannot hope to take the traditional approach [60, 17, 36, 9] of proving equivalence of realizable and agnostic models by simultaneous characterization.

With this motivation in mind, let’s define distribution-family learning more formally. Let 𝑋 be a set (called the instance space), 𝑌

{ 0 , 1 } the set of binary labels, 𝒟 a family of distributions over 𝑋 , and 𝐻

{ ℎ : 𝑋 → 𝑌 } a family of binary classifiers. A tuple ( 𝒟 , 𝑋 , 𝐻 ) is said to be realizably learnable if there exists an algorithm8 𝒜 and a function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for every 𝜀 , 𝛿

0 , every choice of distribution 𝐷 ∈ 𝒟 , and every hypothesis ℎ ∈ 𝐻 , 𝒜 outputs a good classifier with high probability on samples of size 𝑛 ⁢ ( 𝜀 , 𝛿 ) :

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ err 𝐷 × ℎ ⁢ ( 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) ) ≤ 𝜀 ] ≥ 1 − 𝛿 ,

where err 𝐷 × ℎ ⁢ ( ⋅ ) is commonly called the error or risk of ℎ :

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 × ℎ ⁢ ( ℎ ′ )

Pr 𝑥 ∼ 𝐷 ⁢ [ ℎ ′ ⁢ ( 𝑥 ) ≠ ℎ ⁢ ( 𝑥 ) ] .

Likewise, a tuple ( 𝒟 , 𝑋 , 𝐻 ) is said to be agnostically learnable if there exists an algorithm 𝒜 which for every distribution 𝐷 over 𝑋 × 𝑌 whose marginal 𝐷 𝑋 ∈ 𝒟 outputs ℎ ′ close to the best hypothesis in 𝐻 with probability 1 − 𝛿 :

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ err 𝐷 ⁢ ( 𝒜 ⁢ ( 𝑆 ) ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≥ 1 − 𝛿 ,

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

inf ℎ ∈ 𝐻 { err 𝐷 ⁢ ( ℎ ) } is the error of the best hypothesis in the class9 and the risk err 𝐷 ⁢ ( ⋅ ) is similarly defined:

err 𝐷 ⁢ ( ℎ ′ )

Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℎ ′ ⁢ ( 𝑥 ) ≠ 𝑦 ] .

With this in mind, we can now state the most basic application of Algorithm 1: the equivalence of agnostic and realizable learning for distribution-family classification.

Theorem 2.3 (Realizable → Agnostic (Distribution-Family Classification)).

Let 𝒜 be a realizable learner for ( 𝒟 , 𝑋 , 𝐻 ) using 𝑛 ⁢ ( 𝜀 , 𝛿 ) samples. Then Algorithm 1 is an agnostic learner for ( 𝒟 , 𝑋 , 𝐻 ) using:

𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 )

unlabeled samples, and

𝑚 𝐿 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 )

labeled samples. Moreover if ( 𝑋 , 𝐻 ) has finite VC dimension 𝑑 , Algorithm 1 needs only

𝑚 𝐿 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝑑 ⁢ log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 )

labeled samples.

Along with its novelty in the distribution-family setting (where no such equivalence was known), it is worth noting that in the distribution-free setting, Theorem 2.3 actually recovers the same sample complexity bound as standard analysis of uniform convergence (though it remains off by a log factor from the optimal bound given by chaining [45]). We also note that while unlabeled sample complexity is not usually considered separately from labeled complexity in the PAC setting, this will become a useful distinction in semi-supervised extensions considered later in the work. As such, it is instructive to keep the complexities separate for the time being.

With this out of the way, let’s prove Theorem 2.3. The analysis breaks naturally into two parts, corresponding respectively to Step 2 and Step 3 of Algorithm 1. In the first part, we’ll show that 𝐶 ⁢ ( 𝑆 𝑈 ) , the set of outputs corresponding to running the realizable learner 𝒜 across all possible labelings of the unlabeled sample 𝑆 𝑈 , is in some sense a “good approximation” of the class 𝐻 . More formally, the crucial observation is that for any choice of the adversary’s distribution, 𝐶 ⁢ ( 𝑆 𝑈 ) will (almost) always contain a hypothesis close to the optimal solution.

Claim 1.

For any distribution 𝐷 over 𝑋 × 𝑌 whose marginal 𝐷 𝑋 ∈ 𝒟 , with probability 1 − 𝛿 / 2 , there exists ℎ ′ ∈ 𝐶 ⁢ ( 𝑆 𝑈 ) which is within 𝜀 / 2 of the optimal risk:

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 ⁢ ( ℎ ′ ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 .

Once we have this claim, the second step is to show that Step 3, an empirical risk minimization process on 𝐶 ⁢ ( 𝑆 𝑈 ) , gives the desired agnostic learner. This actually follows from standard arguments. In particular, given a hypothesis ℎ ∈ 𝐶 ⁢ ( 𝑆 𝑈 ) , let

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑆 𝐿 ⁢ ( ℎ )

Pr ( 𝑥 , 𝑦 ) ∼ 𝑆 𝐿 ⁡ [ ℎ ⁢ ( 𝑥 ) ≠ 𝑦 ]

denote its empirical risk with respect to 𝑆 𝐿 . Since 𝐶 ⁢ ( 𝑆 𝑈 ) is finite, a standard Chernoff+Union bound gives that with probability at least 1 − 𝛿 / 2 , the empirical risk of every hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) with respect to 𝑆 𝐿 is close to its true risk. Then as long as 𝑆 𝐿 is sufficiently large, empirical risk minimization returns a solution with at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 error with high probability (we’ll formalize this in a moment).

It remains to prove Claim 1. The key observation lies in an equivalence between realizable PAC-learning and a weak type of randomized covering: for any fixed ℎ ∈ 𝐻 , 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis close to ℎ with high probability.

Lemma 2.4.

For any distribution 𝐷 over 𝑋 × 𝑌 with marginal 𝐷 𝑋 ∈ 𝒟 and any ℎ ∈ 𝐻 , with probability 1 − 𝛿 / 2 , there exists ℎ ′ ∈ 𝐶 ⁢ ( 𝑆 𝑈 ) which is within 𝜀 / 2 of ℎ in classification distance:

Pr 𝑥 ∼ 𝐷 𝑋 ⁡ [ ℎ ′ ⁢ ( 𝑥 ) ≠ ℎ ⁢ ( 𝑥 ) ] ≤ 𝜀 / 2 .

Proof 2.5.

The proof is essentially immediate from the definition of realizable PAC-learning. 𝒜 promises that for any ℎ ∈ 𝐻 and 𝐷 ∈ 𝒟 , a 1 − 𝛿 / 2 fraction of labeled samples ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) ∼ 𝐷 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) satisfy

𝑒𝑟𝑟 𝐷 × ℎ ⁢ [ 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) ]

Pr 𝐷 ⁢ [ ℎ ′ ⁢ ( 𝑥 ) ≠ ℎ ⁢ ( 𝑥 ) ] ≤ 𝜀 / 2 ,

where ℎ ′

𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) . Since 𝐶 ⁢ ( 𝑆 𝑈 ) contains 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) for every ℎ ∈ 𝐻 by definition, the result follows.

More generally, we call such objects non-uniform covers.

Definition 2.6 (Non-uniform Cover (Informal Definition 5.1)).

Let ( 𝑋 , 𝐻 ) be a class over label space 𝑌 , 𝐷 a marginal distribution over 𝑋 , and 𝐶 a random variable over the power set 𝑃 ⁢ ( 𝐻 ) . We call 𝐶 a non-uniform ( 𝜀 , 𝛿 ) -cover of 𝐻 with respect to 𝐷 if for every ℎ ∈ 𝐻 :

Pr 𝐶 ⁡ [ ∃ ℎ ′ ∈ 𝐶 : Pr 𝑥 ∼ 𝐷 ⁡ [ ℎ ′ ⁢ ( 𝑥 ) ≠ ℎ ⁢ ( 𝑥 ) ] ≤ 𝜀 ] ≥ 1 − 𝛿 .

Note that Lemma 2.4 (and in general non-uniform covering) does not imply that 𝐶 ⁢ ( 𝑆 𝑈 ) contains hypotheses close to every ℎ ∈ 𝐻 simultaneously. This stronger object is called a uniform cover and takes provably more samples to construct (see Section 13). In our case, a non-uniform cover is sufficient. Since the guarantee holds for every fixed ℎ ∈ 𝐻 , it must hold in particular for the optimal hypothesis ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , so 𝐶 ⁢ ( 𝑆 𝑈 ) contains some ℎ ′ within 𝜀 / 2 of optimal. Let’s now formalize these ideas and put everything together to prove Theorem 2.3.

Proof 2.7 (Proof of Theorem 2.3).

Let 𝐷 be the adversary’s distribution over 𝑋 × 𝑌 , and let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ∈ 𝐻 be a hypothesis achieving the optimal error. By Lemma 2.4, with probability 1 − 𝛿 / 2 , 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis ℎ ′ such that:

Pr 𝑥 ∼ 𝐷 𝑋 ⁡ [ ℎ ′ ⁢ ( 𝑥 ) ≠ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ] ≤ 𝜀 / 2 .

This implies Claim 1 (that 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis with error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 ) since

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 ⁢ ( ℎ ′ )

Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℎ ′ ⁢ ( 𝑥 ) ≠ 𝑦 ]

≤ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ 𝑦 ] + Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ ℎ ′ ⁢ ( 𝑥 ) ]

≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 .

We can now use standard empirical risk minimization bounds on 𝐶 ⁢ ( 𝑆 𝑈 ) to find a hypothesis with error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 . Chernoff and union bounds imply that with probability at least 1 − 𝛿 / 2 , the empirical risk of every hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) on a sample of size 𝑂 ⁢ ( log ⁡ ( | 𝐶 ⁢ ( 𝑆 𝑈 ) | / 𝛿 ) 𝜀 2 ) is at most 𝜀 / 4 away from its true error. Since ℎ ′ has error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 , its empirical risk is at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 3 ⁢ 𝜀 / 4 , and by the above guarantee any hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) with empirical risk at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 3 ⁢ 𝜀 / 4 has true error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 .

Putting everything together, we have that with probability 1 − 𝛿 over the entire process, the empirical risk minimizer of 𝐶 ⁢ ( 𝑆 𝑈 ) has error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 as desired. The sample complexity bounds follow from noting that | 𝐶 ⁢ ( 𝑆 𝑈 ) | is at most 2 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) , and at most ( 𝑒 ⋅ 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) 𝑑 ) 𝑑 if the class has VC dimension 𝑑 . The sample complexity bound for the latter case then follows by plugging in the standard bound for distribution-free classification: 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) ≤ 𝑂 ⁢ ( 𝑑 ⁢ log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 ) .

3Related Work

Agnostic learning is a very widely studied model across learning theory, and works across many different sub-areas have noted model specific equivalences with realizable learning. Here we’ll survey a few representative examples, and discuss how they relate and differ from our approach.

3.1Beyond Binary Classification Uniform Convergence and Multiclass Classification.

It is well known that the uniform convergence equals learnability paradigm continues to hold for 0 / 1 -valued loss functions over constant-size label spaces [19, 15, 59, 54], and that agnostic and realizable learning are equivalent as a result. On the other hand, Daniely, Sabato, Ben-Devid, and Shalev-Shwartz [25] showed this is no longer the case as the number of labels grows large. In this regime, even basic multi-class learning is no longer equivalent to uniform convergence, so the connection between realizable and agnostic learning becomes non-trivial. A few years later, David, Moran, and Yehuyadoff (DMY) [29] proved the equivalence nevertheless holds in the infinite multi-class setting through the weaker sample compression equals learnability paradigm. While more general than the uniform convergence paradigm, their proof remains model-specific and fails in many of the settings we consider, e.g. partial learning [5].

Discretization and General Loss Functions.

Basic forms of discretization were also considered back in the mid 90s in work on characterizing the learnability of real-valued functions. In a seminal work, Bartlett, Long, and Williamson (BLW) [9] proved that a scale-sensitive measure introduced by Kearns and Schapire [42] called fat-shattering dimension characterizes learnability under bounded Lipschitz loss functions.10 BLW use a basic form of discrete learning (called quantization) to prove that fat-shattering dimension is a necessary condition, and use uniform convergence to prove sufficiency. We give a similar argument as BLW in the necessary direction, but show that uniform convergence is not necessary for the equivalence to hold, and instead use Algorithm 1 to appeal directly to discrete learnability. This allows us to extend BLW’s result across a much more general set of loss functions and scenarios without strong model-specific assumptions.

3.2Semi-supervised, Active, and Semi-Private Learning

Our reduction hinges on combining a realizable learner with unlabeled data to cut down the number of potential hypotheses in our class. The use of unlabeled samples to this effect is one of the core ideas in the field of semi-supervised learning [8, 65]. Here, it is usually additionally assumed that the function to be learnt has some relation (or ‘compatibility’) to the underlying data distribution, for example it might have large margin on unlabelled data as in Transductive SVM [38] or redundant sufficient information as in Co-training [18, 28]. In their seminal work on the topic, Balcan and Blum [8] employed a similar strategy to Algorithm 1 in which they draw an unlabeled sample 𝑆 𝑈 and select hypotheses consistent with each possible labeling based upon compatibility. They argue via uniform convergence that this results in a uniform cover, and then use empirical risk minimization to select a good hypothesis in the cover. It is worth noting that around the same time a similar strategy independently found use in the online learning literature in work of Ben-David, Pal, and Shalev-Schwarz [16] who simulated the so-called ‘standard optimal algorithm’ (SOA) over a sequence of examples and applied weighted majority [46] over the resulting set of hypotheses to obtain an agnostic online learner.

Similar strategies have also found use in the related active learning literature. Hanneke and Yang [35] use the same technique to build a cover from unlabeled samples (adding one hypothesis consistent with each possible labeling), and then apply active (adaptive) query algorithms to learn the best hypothesis in the cover in as few labeled samples as possible. This generalized earlier work of Dasgupta [27], who assumed a priori that the cover was known to the learner ahead of time. Most recently, the approach has seen use in the study of semi-private learning. In their original work on the model, Beimel, Nissim, and Stemmer [14] again apply the same trick for building a uniform cover, but then find the best hypothesis privately via the exponential mechanism (similar to our proof of Theorem 1.10). The analysis of this strategy was later improved by Alon, Bassily, and Moran (ABM) [2].

The above works differ from ours in two crucial senses. First, each work focuses solely on developing an algorithm for their specific framework (rather than working to understand a more general equivalence or reduction between settings). In this sense, one can view each of these prior results as a specific instance of our general framework where the “base learner” in our reduction is restricted to be an empirical risk minimizer (or SOA in the online setting), and a problem-specific learner for the relevant property (online, agnostic, active, or private) is then applied over the resulting cover. Second, and perhaps most importantly, these previous works all rely fundamentally on uniform convergence.11 This means that their algorithms break down as soon as one moves away from the original PAC model (even to say the basic distribution-dependent setting), and can also lead to sub-optimal sample complexity bounds. In the analysis of semi-private learning, for instance, we show that avoiding uniform convergence leads to asymptotically better bounds, actually resolving the public sample complexity of the model altogether. Indeed, one can show that building a uniform cover requires asymptotically more unlabeled samples than a non-uniform one, and therefore cannot result in optimal semi-supervised algorithms.12

3.3Non-Uniform Covering and Probabilistic Representations

Covering techniques have long been used in learning theory, and while almost all prior works focus on uniform notions (where all hypotheses are covered simultaneously), there is one notable exception. In 2013, Beimel, Nissim, and Stemmer (BNS) [12] introduced probabilistic representations, a strong randomized form of covering used to characterize pure differentially private learning. In the language of our work, given a class ( 𝑋 , 𝐻 ) , a probabilistic representation is a distribution over subsets of 𝐻 which is a non-uniform cover simultaneously over all distributions of 𝑋 . BNS prove that private learning is equivalent to the existence of a probabilistic representations for the class. Equivalently, this can be thought of as the ability to build a non-uniform cover without access to the underlying distribution at all. On the other hand, we are interested in the much weaker setting where a non-uniform cover can be built from a bounded number of samples from the distribution (and crucially argue that this is equivalent to realizable learning). Thus in a sense, our core connection between realizable learning and non-uniform covering can be thought of as an analog of BNS’ characterization of private learning by probabilistic representations.

Paper Organization

The main body of this paper is split into two main portions. In the first we cover preliminaries (Section 4), the base reduction for all finitely supported loss functions (Section 5), and discuss in detail the four main modification archetypes to Algorithm 1: discretization (Section 6.1), sub-sampling (Section 6.2), replacing ERM (Section 6.3), and changing the base model (Section 6.4), covering extensions to infinite label classes, malicious noise, semi-private learning, and covariate shift respectively. In the remainder of the paper we cover applications of these modified versions to doubly-bounded loss (Section 7), robust learning (Section 8), partial learning (Section 9), uniformly-stable learning (Section 10), the statistical query model (Section 11), and fair learning (Section 12), and discuss further connections of non-uniform covers to previous notions of covering (Section 13). Each of the latter sections are self-contained and the reader is encouraged to skip directly to any model they wish to see directly.

4Preliminaries

Before moving to a more formal discussion of our results, we’ll cover the most basic learning models discussed in this work: standard (distribution-free) PAC-learning and distribution-family PAC-learning. Extended models we consider beyond these (e.g. malicious noise, robust learning, partial learning, etc.) will instead be introduced in their respective sections.

4.1PAC-Learning

We’ll start by reviewing the seminal PAC-learning model of Valiant [58] and Vapnik and Chervonenkis [60]. We start with a few core definitions for the setting of general loss. Let 𝑋 be an arbitrary set called the instance space (e.g. ℝ 𝑑 ), 𝑌 a set called the label space (e.g. { 0 , 1 } ), and 𝐻 a family of labelings of 𝑋 by 𝑌 (that is a family of functions of the form ℎ : 𝑋 → 𝑌 ). Given a class ( 𝑋 , 𝐻 ) , it will often be useful to consider its growth function Π 𝐻 ⁢ ( 𝑛 ) which measures the maximum size of 𝐻 when restricted to a sample of size 𝑛 :

Π 𝐻 ( 𝑛 )

max ℎ ∈ 𝐻 , 𝑆 ∈ 𝑋 𝑛 ( | 𝐻 | 𝑆 | ) ) .

We note that the growth function is trivially bounded by | 𝑌 | 𝑛 , but one can often give stronger bounds when ( 𝑋 , 𝐻 ) satisfies some finite combinatorial dimension (e.g. VC-dimension for the binary case).

While PAC-learning is sometimes used to refer only to classification, we will study the model under general loss functions. With that in mind, we call a function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 a loss function if ℓ ⁢ ( 𝑦 , 𝑦 )

0 for all 𝑦 ∈ 𝑌 . We say a loss ℓ satisfies the identity of indiscernibles if ℓ ⁢ ( 𝑦 1 , 𝑦 2 )

0 iff 𝑦 1

𝑦 2 . Given any distribution 𝐷 over 𝑋 × 𝑌 and loss ℓ , the risk of a labeling ℎ : 𝑋 → 𝑌 with respect to 𝐷 and ℓ is its expected loss:

err 𝐷 , ℓ ⁢ ( ℎ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ ⁢ ( 𝑥 ) , 𝑦 ) ] .

The goal of learning is generally to find a classifier ℎ ∈ 𝐻 that minimizes risk. More formally, there are two commonly studied variants of this problem. The original formulation, now called realizable learning, assumes the existence of a hypothesis in 𝐻 with no loss.

Definition 4.1 ((Realizable) PAC-learning).

We say ( 𝑋 , 𝐻 , ℓ ) is realizable PAC-learnable if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿 > 0 and distributions 𝐷 over 𝑋 × 𝑌 such that min ℎ ∈ 𝐻 ⁡ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ )

0 :

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝜀 ] ≤ 𝛿 .

𝒜 is called proper if it outputs only labels in 𝐻 .

Perhaps a more realistic variant of PAC-learning is to drop this restriction on the adversary, and let them choose an arbitrary distribution over 𝑋 × 𝑌 . This model, introduced by Haussler [36] and Kearns, Schapire, and Sellie [41], is known as agnostic learning.

Definition 4.2 ((Agnostic) PAC-learning).

We say ( 𝑋 , 𝐻 , ℓ ) is agnostic PAC-learnable if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 :

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿 ,

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

min ℎ ∈ 𝐻 ⁡ { 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ) } .

For some settings covered in this work, it will turn out that reaching 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 error is too stringent of a condition. However, we will show in these cases that it is sometimes possible to maintain a weaker guarantee and learn up to 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 error for some constant 𝑐

1 . We call such classes 𝑐 -agnostic learnable.

Finally, we note that for simplicity when ℓ is the standard “classification error:”

ℓ ⁢ ( 𝑦 1 , 𝑦 2 )

{ 0
if ⁢ 𝑦 1

𝑦 2

else ,

we’ll simply write ( 𝑋 , 𝐻 ) to mean ( 𝑋 , 𝐻 , ℓ ) . Realizable and Agnostic Learning are well studied under many basic loss functions including binary classification, where both models are known to be characterized by a combinatorial parameter called VC-dimension.

4.2Learning Under Distribution Families

The standard PAC-models described above are often called distribution-free due to the fact that no assumptions are made on the marginal distribution over 𝑋 . In practice, however, this is usually too worst-case an assumption. We often expect distributions in nature to be “nice” in some way, or at least somewhat restricted. This is reflected in the fact that popular machine learning algorithms usually significantly outperform the PAC-model’s worst-case generalization bounds. Indeed such niceness assumptions have long been popular in learning theory as well, where conditions such as tail bounds or anti-concentration are frequently used to build efficient algorithms.

These ideas are captured more generally by a simple (but notoriously difficult) extension to the PAC framework originally proposed by Benedek and Itai [17], where the adversary is restricted to picking from a fixed, known set of distributions.

Definition 4.3 ((Realizable) Distribution-Family PAC-learning).

Let 𝑋 be an instance space and 𝒟 a family of distributions over 𝑋 . We say ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is realizable PAC-learnable if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying:

The marginal 𝐷 𝑋 ∈ 𝒟 ,

min ℎ ∈ 𝐻 ⁡ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ )

0 ,

we have

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝜀 ] ≤ 𝛿 .

Agnostic learning is defined similarly. The adversary must still choose a marginal distribution in 𝒟 , but the conditional labeling can be arbitrary.

Definition 4.4 (Agnostic Distribution-Family PAC-learning).

We say ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is agnostic PAC-learnable if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying:

The marginal 𝐷 𝑋 ∈ 𝒟 ,

𝒜 outputs a good hypothesis with high probability:

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

min ℎ ∈ 𝐻 ⁡ { 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ) } .

The weaker 𝑐 -agnostic learning is defined analogously with 𝑂 ⁢ 𝑃 ⁢ 𝑇 replaced by 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . Unlike the standard model, very little is known about distribution-family learnability. A number of works have given partial characterizations of the model [17, 31, 43, 61], but it was recently shown by Lechner and Ben-David [44] that no general finitary characterization is possible.

5The Core Reduction: Finite Label Classes

In this section, we give a more detailed exposition of our main reduction as covered in Section 2 for the general setting of arbitrary loss on constant size label spaces, matching lower bounds, and additional discussion of non-uniform covers. As mentioned previously, since there is no combinatorial characterization of learnability in the distribution-family model [44], standard techniques [19, 15, 59, 54] cannot be used.

Before jumping into our reduction proper, it is worth re-iterating why we can’t simply take the approach of prior works and rely on uniform convergence, a strong condition which promises that on a large enough sample, the empirical error of every hypothesis will be close to its true error. While uniform convergence was a very popular technique in the early years of learning, practitioners have since moved away from the paradigm which fails to capture learning rates seen in practice [64, 53]. Indeed it soon became clear that the technique failed to capture even basic theoretical models such as the distribution-dependent setting (as discussed in Proposition 2.1). In later sections, we will even see distribution-free models where uniform convergence fails, such as the Partial PAC model [47, 5] which captures realistic scenarios such as learning with margin. Since even the most basic modifications of PAC-learning fail to satisfy uniform convergence, it is clear we need to move beyond the condition to gain a more general understanding of the common phenomenon of equivalence between learning models.

Instead of relying on uniform convergence, our core observation is an equivalence between learning and sample access to a combinatorial object we call a non-uniform cover.

Definition 5.1 (Non-uniform Cover).

Let ( 𝑋 , 𝐻 ) be a class over label space 𝑌 and 𝐿 𝑋 , 𝑌 denote the family of all labelings from 𝑋 to 𝑌 . If 𝐶 is a random variable over the power set 𝑃 ⁢ ( 𝐿 𝑋 , 𝑌 ) and 𝑑 : 𝐿 𝑋 , 𝑌 × 𝐿 𝑋 , 𝑌 → ℝ ≥ 0 is a “distance” function between labelings, we call 𝐶 a non-uniform ( 𝜀 , 𝛿 ) -cover of 𝐻 with respect to 𝑑 if for all ℎ ∈ 𝐻 :

Pr 𝑇 ∼ 𝐶 ⁡ [ ∃ ℎ ′ ∈ 𝑇 : 𝑑 ⁢ ( ℎ ′ , ℎ ) ≤ 𝜀 ] ≥ 1 − 𝛿 .

We call 𝐶 bounded if its support lies entirely on subsets of size at most some 𝑘 ∈ ℕ , and we call the smallest such 𝑘 its size.

Non-uniform covers share a close connection to several notions of covering used throughout the learning literature such as uniform covers [2] and fractional covers [4]. We discuss these connections in more detail in Section 13. For the moment, we note only that previous works using the strictly stronger notion of uniform covering necessarily lose factors in the sample complexity as a result. We discuss this further in Section 6.3.

In Section 2, we argued (at least implicitly) that once we have sampling access to a bounded non-uniform cover, agnostic learnability follows from standard arguments. Namely since a sample 𝑇 has bounded size and is guaranteed to contain a concept “close” to optimal, it suffices to run empirical risk minimization over about log ⁡ ( | 𝑇 | / 𝛿 ) / 𝜀 2 samples. The key to our reduction therefore boils down to turning blackbox access to a realizable PAC-learner into sampling access to some relevant non-uniform cover. This is given by Step 2 of Algorithm 1, which we rewrite here as a subroutine called LearningToCover.

Input: Hypothesis Class 𝐻 , Realizable PAC-Learner 𝒜 , Unlabeled Sample size 𝑆 𝑈 . Algorithm:

Run 𝒜 over all possible labelings of 𝑆 𝑈 by 𝐻 .

Return the set of responses 𝐶 ⁢ ( 𝑆 𝑈 ) :

𝐶 ⁢ ( 𝑆 𝑈 ) ⁢ \coloneqq ⁢ { 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) | ℎ ∈ 𝐻 | 𝑆 𝑈 } .

Algorithm 2 LearningToCover( 𝐻 , 𝒜 , 𝑆 𝑈 )

In fact, we already argued in Section 2 that LearningToCover gives sampling access to a non-uniform cover, but we will re-write the result here in this formulation for convenience.

Lemma 5.2 (Core Lemma: Realizable Learning Implies Non-Uniform Covering).

Let 𝒜 be an algorithm that ( 𝜀 , 𝛿 ) -PAC learns a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) in 𝑛

𝑛 ⁢ ( 𝜀 , 𝛿 ) samples. Then for any 𝐷 ∈ 𝒟 , running LearningToCover on 𝑆 𝑈 ∼ 𝐷 𝑛 returns a sample from a size Π 𝐻 ⁢ ( 𝑛 ) , non-uniform ( 𝜀 , 𝛿 ) -cover with respect to the standard distance between hypotheses:

𝑑 ⁢ ( ℎ , ℎ ′ )

𝔼 𝐷 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] .

Proof 5.3.

The proof is essentially immediate from the definition of realizable PAC-learning. 𝒜 promises that for any ℎ ∈ 𝐻 and 𝐷 ∈ 𝒟 , a 1 − 𝛿 fraction of labeled samples ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) satisfy

𝑒𝑟𝑟 𝐷 × ℎ , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) )

𝔼 𝐷 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] ≤ 𝜀 ,

where ℎ ′

𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) . Since 𝐶 ⁢ ( 𝑆 𝑈 ) contains 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) for every ℎ by definition, the result follows.

This means that as long as we have blackbox access to a realizable PAC-learner and unlabeled samples from the adversary’s distribution, we can simulate access to a non-uniform cover. Let’s now formalize our previous intuition that this is sufficient to turn a realizable learner into an agnostic one for any finite label class. We will generalize this result to doubly-bounded loss in Section 7, but it is instructive to consider the setting of finite 𝑌 first.

Theorem 5.4 (Realizable → Agnostic (Finite Label Classes)).

Let ( 𝒟 , 𝑋 , 𝐻 , ℓ ) be any class on a finite label space 𝑌 with loss function ℓ : 𝑌 → 𝑌 satisfying the identity of indiscernibles. Then Algorithm 1 is an agnostic learner with sample complexity:

𝑚 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑛 ⁢ ( 𝜂 ℓ ⁢ 𝜀 , 𝛿 / 2 ) + 𝑂 ⁢ ( log ⁡ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜂 ℓ ⁢ 𝜀 , 𝛿 / 2 ) ) 𝛿 ) 𝜀 2 )

where 𝜂 ℓ ≥ Ω ⁢ ( min 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) max 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) ) is a constant depending only on ℓ .

Proof 5.5.

Let 𝒜 be the promised realizable learner for ( 𝑋 , 𝐻 , 𝒟 , ℓ ) with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Run LearningToCover with parameters 𝜀 ′

𝜂 ℓ ⁢ 𝜀 and 𝛿 ′

𝛿 / 2 . We argue that the output contains some ℎ ′ such that 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 . Since 𝐶 ⁢ ( 𝑆 𝑈 ) is finite and ℓ is upper bounded, a standard Chernoff bound gives that choosing an empirical risk minimizer from 𝐶 ⁢ ( 𝑆 𝑈 ) based on 𝑂 ⁢ ( log ⁡ ( | 𝐶 ⁢ ( 𝑆 𝑈 ) | 𝛿 ) 𝜀 2 ) additional samples gives the desired learner. The sample complexity then follows immediately from the fact that 𝐶 ⁢ ( 𝑆 𝑈 ) contains one hypothesis for every labeling of the sample, and is therefore bounded by the growth function Π 𝐻 ⁢ ( 𝑛 ) .

To see why 𝐶 ⁢ ( 𝑆 𝑈 ) has this property, recall that for any ℎ ∈ 𝐻 , Lemma 5.2 states that 𝐶 ⁢ ( 𝑆 𝑈 ) contains some ℎ ′ such that:

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] ≤ 𝜂 ℓ ⁢ 𝜀 .

Because we assume that ℓ ⁢ ( 𝑎 , 𝑏 )

0 iff 𝑎

𝑏 , this actually implies a stronger relation— ℎ and ℎ ′ must be close in classification error:

Pr 𝑥 ∼ 𝐷 𝑋 ⁡ [ ℎ ⁢ ( 𝑥 ) ≠ ℎ ′ ⁢ ( 𝑥 ) ] ≤ 𝜂 ℓ ⁢ 𝜀 min 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) .

Let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ∈ 𝐻 be an optimal hypothesis, and let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ denote its corresponding close hypothesis as promised above in the output of LearningToCover. Then by the above, we have that:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) , 𝑦 ) ]

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) , 𝑦 ) − ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) + ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) ]

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) , 𝑦 ) − ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) ]

≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝑃 ⁢ 𝑟 𝐷 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) ] ⁢ max 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) )

≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2

where we have used the assumption that we set 𝜂 ℓ

𝑐 ⁢ min 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) max 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) for some universal 𝑐

0 .

It’s worth spending a moment discussing our only assumption on the loss function ℓ , that it satisfies the identity of indiscernibles. This is not only a natural assumption for most cases in practice (that mislabeling has non-zero error), it is theoretically justified as well: realizable and agnostic learning aren’t necessarily equivalent for ℓ without this property, even in the distribution-free setting.

Proposition 5.6 (Identity of Indiscernibles Lower Bound).

There exists a realizably learnable class ( 𝑋 , 𝐻 , ℓ ) over a finite label space 𝑌 which is not agnostically learnable.

Proof 5.7.

Let the instance space 𝑋

ℕ be the set of natural numbers, the label space 𝑌

{ 0 , 1 } 2 . We consider the hypothesis class 𝐻 with all functions which output the first bit as 0 , that is:

𝐻

{ ℎ : ℎ ⁢ ( 𝑥 )

( 0 , ⋅ ) ⁢ ∀ 𝑥 ∈ 𝑋 } .

Furthermore, we define the loss function ℓ : 𝑌 × 𝑌 → { 0 , 1 , 𝑐 } as

ℓ ⁢ ( ( 𝑏 1 , 𝑟 1 ) , ( 𝑏 2 , 𝑟 2 ) )

{ 0
𝑏 1

𝑏 2

1
𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1

𝑟 2

𝑐

otherwise.

Note that ( 𝑋 , 𝐻 , ℓ ) is trivially learnable in the realizable setting simply by returning any ℎ ∈ 𝐻 . On the other hand, we will show it is only 𝑂 ⁢ ( 𝑐 ) -agnostically learnable. First, notice that for any labelling 𝑓 : 𝑋 → 𝑌 , there exists a hypothesis ℎ ∈ 𝐻 which matches 𝑓 on the second bit, and therefore for any marginal 𝐷 over 𝑋 :

𝑂 ⁢ 𝑃 ⁢ 𝑇 ≤ 𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 , 𝑓 ⁢ ( ℎ ) ≤ 1 .

As a result, it suffices to show that for every 𝑚 ∈ ℕ and (randomized) algorithm 𝒜 using 𝑚 samples there exists a labeling 𝑓 : 𝑋 → 𝑌 and marginal distribution 𝐷 𝑋 such that

𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ 𝑐 / 12 .

(1)

As long as this holds Markov’s inequality gives that every algorithm must have error at least Ω ⁢ ( 𝑐 ) with constant probability.

For simplicity, we will restrict our attention in the rest of the proof to the marginal distribution 𝐷 𝑋 which is uniform over the set [ 𝑘 ] for some natural number 𝑘 we will fix later. To prove Equation 1, by Yao’s minimax principle it is enough to prove there is a distribution 𝜇 over functions 𝑓 : [ 𝑘 ] → 𝑌 such that any deterministic algorithm 𝒜 has expected loss at least 𝑐 / 12 over 𝜇 :

𝔼 𝑓 ∼ 𝜇 ⁢ 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ]

𝑐 / 12 .

We now show that the above holds for 𝜇 being uniform over all functions from [ 𝑘 ] to 𝑌 for any 𝑘

2 ⁢ 𝑚 . Here, we have that

𝔼 𝑥 ∼ 𝐷 𝑋 𝔼 𝑓 ∼ 𝜇 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 [ ℓ ( 𝒜 ( 𝑆 , 𝑓 ( 𝑆 ) ) ( 𝑥 ) , 𝑓 ( 𝑥 ) ) ) ] ≥ 𝔼 𝑥 ∼ 𝐷 𝑋 [ Pr 𝑆 ∼ 𝐷 𝑋 𝑚 [ 𝑥 ∉ 𝑆 ] ⋅ 𝑐 / 4 ] ,

where the last step follows from noting that for any value ( 𝑎 , 𝑏 ) that 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) assigns to 𝑥 ∉ 𝑆 , 𝑓 ⁢ ( 𝑥 ) will be ( 1 − 𝑎 , 1 − 𝑏 ) with probability 1 / 4 incurring a loss of 𝑐 . The result then follows by noting that for every 𝑥 ∈ [ 𝑘 ] :

Pr 𝑆 ∼ 𝐷 𝑋 𝑚 ⁡ [ 𝑥 ∉ 𝑆 ]

( 1 − 1 𝑘 ) 𝑚 ≥ 1 / 𝑒

since 1 − 𝑥 ≥ exp ⁡ { − 𝑥 / ( 1 − 𝑥 ) } and 𝑘

2 ⁢ 𝑚 . Therefore, we get that

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ 𝔼 𝑓 ∼ 𝜇 ⁢ 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ 𝑐 / 4 ⁢ 𝑒 ,

which completes the proof.

Note that this bound holds even if 𝒜 is allowed to be improper. It is worth noting that if we are willing to increase the size of 𝑌 , the learner’s error in this bound can actually be increased all the way to 𝑐 , the maximum possible (see Proposition 6.1). This is in fact tight, as we will show that any loss function like the above satisfying a 𝑐 -approximate triangle inequality can be 𝑐 -agnostically learned (that is learned to within 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 error).

6Four Modification Archetypes 6.1Discretization: Infinite Label Classes

In the previous section, we showed that our base reduction characterizes the equivalence of realizable and agnostic learning for loss functions satisfying the identity of indiscernibles for all finite label classes. In this section, we discuss a technique called discretization that allows our reduction to extend this result to infinite label classes. Normally, it’s clear that when 𝑌 is infinite our standard reduction will fail: since the total number of possible labelings of a finite sample may be infinite, LearningToCover may output an infinite set. In fact, this is more than a technical barrier: realizable and agnostic learning simply aren’t equivalent for infinite label classes.

Proposition 6.1.

Let ℓ be any continuous loss function (in the first variable) over ℝ satisfying the identity of indiscernibles. Then there exists a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) which is realizably learnable but not agnostically learnable.

Proof 6.2.

Let 𝑋

ℕ and 𝑌

[ 0 , 2 ] . To construct our class, we first consider the set of all boolean functions over 𝑋 with finite support. Each function 𝑓 in this class may equivalently be thought of as a binary string in { 0 , 1 } * . Denote the corresponding decimal value of this string in [ 0 , 1 ] by 𝑠 𝑓 . To construct 𝐻 , for every boolean function 𝑓 : ℕ → { 0 , 1 } with finite support, include in 𝐻 the function ℎ 𝑓 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑥 ) + 𝑠 𝑓 . Note that ( 𝑋 , 𝐻 ) is clearly realizably learnable under any distribution family 𝒟 and any loss function, since a single sample always uniquely determines ℎ 𝑓 . On the other hand, adding even the smallest amount of noise erases this unique identification, making the class impossible to learn.

More formally, let 𝒟 be the family of all distributions. By the continuity of ℓ and the fact that ℓ ⁢ ( 0 , 0 )

ℓ ⁢ ( 1 , 1 )

0 , for all 𝜀 > 0 notice that there exists 𝛾

𝛾 ⁢ ( 𝜀 )

0 such that max 0 ≤ 𝛾 ′ ≤ 𝛾 ⁡ { ℓ ⁢ ( 𝛾 ′ , 0 ) , ℓ ⁢ ( 1 + 𝛾 ′ , 1 ) } < 𝜀 . Let 𝑛 𝛾 ∈ ℕ be the index of the first non-zero digit in the binary representation of 𝛾 . The idea is to note that beyond these first 𝑛 𝛾 coordinates, our class if within 𝜀 of an arbitrary boolean function. More formally, notice that for any distribution 𝐷 and boolean function 𝑓 which is 0 on [ 𝑛 𝛾 ] , we have that 𝑂𝑃𝑇 𝐻 ⁢ ( 𝑓 ) ⁢ \coloneqq ⁢ min ℎ ∈ 𝐻 ⁡ { 𝑒𝑟𝑟 𝐷 × 𝑓 , ℓ ⁢ ( ℎ ) } ≤ 𝜀 . The bound then follows from the fact that such arbitrary functions are not learnable.

In more detail, Yao’s minimax principle states that it is sufficient to show that for any potential sample complexity 𝑚 ⁢ ( 𝜀 , 𝛿 ) , there exists a randomized strategy for the adversary such that no deterministic learner can achieve 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝑐 accuracy with constant probability for some constant 𝑐 > 0 . To this end, consider the following strategy: the adversary chooses the uniform distribution over [ 𝑛 𝛾 , 𝑛 𝛾 + 2 ⁢ 𝑚 ⁢ ( 𝜀 , 𝛿 ) ] , and a binary function on that interval uniformly at random (recall 𝑂 ⁢ 𝑃 ⁢ 𝑇 is at most 𝜀 for every such function). Since the learner can only see 1 / 2 of the mass, any strategy must be incorrect on half of the remaining points in expectation. In particular conditioned on any sample, the expected loss of any predicted labeling of an unseen point is at least ℓ min-err

min 𝑦 ∈ [ 0 , 2 ] ⁢ { ℓ ⁢ ( 𝑦 , 0 ) + ℓ ⁢ ( 𝑦 , 1 ) 2 } (since each unseen label appears with 1 / 2 probability conditioned on the learner’s sample). The total expected loss of any strategy is then at least ℓ min-err / 2 , which is bounded away from 0 . Setting 𝜀 and 𝑐 sufficiently small then gives the desired result.

Proposition 6.1 relies crucially on the fact that the adversary can erase a significant amount of information with a very small label perturbation. In the rest of this section, we’ll discuss a technique for modifying our reduction that shows this is essentially the only barrier between realizable and agnostic learning (at least for a broad class of loss functions). The key is to require a slightly stronger notion of learnability based upon discretization.

Definition 6.3 (Discretization).

We say ( 𝒟 , 𝑋 , 𝐻 ′ , ℓ ) is an 𝜀 -discretization of ( 𝒟 , 𝑋 , 𝐻 , ℓ ) if the following three conditions hold:

𝐻 ′ is probably bounded. That is for all 𝑛 ∈ ℕ , 𝛿

0 , and 𝐷 ∈ 𝒟 there exists a bound 𝑚 ⁢ ( 𝑛 , 𝛿 ) ∈ ℕ such that:

Pr 𝑆 ∼ 𝐷 𝑛 [ | 𝐼 𝑚 ( 𝐻 ′ | 𝑆 ) | ≤ 𝑚 ( 𝑛 , 𝛿 ) ] ≥ 1 − 𝛿 .

𝐻 ′ point-wise13 ε -covers 𝐻 with respect to ℓ . That is for all for all ℎ ∈ 𝐻 , there exists ℎ ′ ∈ 𝐻 ′ satisfying:

∀ 𝑥 ∈ 𝑋 : ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ≤ 𝜀 .

𝐻 ′ is always useful. That is for all ℎ ′ ∈ 𝐻 ′ , there exists ℎ ∈ 𝐻 such that:

∀ 𝑥 ∈ 𝑋 : ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ≤ 𝜀 .

Note that most realistic settings have reasonable discretizations (e.g. it is enough to have some Lipshitz-like condition and a weak tail-bound on the loss). We now define a basic notion of learnability based on discretization which essentially serves to rule out adversarial constructions in the vein of Proposition 6.1.

Definition 6.4 (Discretely-learnable).

We say ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) if there is some constant 𝑐 1

0 such that for all 𝜀 , 𝛿

0 there exists an 𝜀 -discretization 𝐻 𝜀 which is ( 𝑐 1 ⁢ 𝜀 , 𝛿 ) -PAC-learnable in at most 𝑛 ⁢ ( 𝜀 , 𝛿 ) samples. We call the learner proper if it outputs hypotheses in 𝐻 .

We’ll prove that discrete and agnostic learnability are equivalent as long as the loss satisfies an approximate triangle inequality.

Definition 6.5 (Approximate pseudometric).

We call a loss function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 a 𝑐 -approximate pseudometric if for all triples 𝑦 1 , 𝑦 2 , 𝑦 3 ∈ 𝑌 :

ℓ ⁢ ( 𝑦 1 , 𝑦 3 ) ≤ 𝑐 ⁢ ( ℓ ⁢ ( 𝑦 1 , 𝑦 2 ) + ℓ ⁢ ( 𝑦 2 , 𝑦 3 ) ) .

Approximate pseudometrics are natural choices for loss functions in practice and capture a broad set of scenarios including finite-range losses and standard setups such as ℓ 𝑝 -regression, and have seen some previous study in the literature [23]. By modifying the first step of our reduction to take discretization into account and leveraging the approximate triangle inequality in the second, we prove that discrete learnability and 𝑐 -agnostic learnability are equivalent under 𝑐 -approximate pseudometrics.

Theorem 6.6.

Let ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 be a bounded 𝑐 -approximate pseudometric. Then the following are equivalent for all ( 𝒟 , 𝑋 , 𝐻 , ℓ ) :

( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable.

( 𝒟 , 𝑋 , 𝐻 , ℓ ) is 𝑐 -agnostically learnable.

Proof 6.7.

The proof is similar to Theorem 5.4. We first show the forward direction. Assume ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable. Fix 𝜀 ′

𝜀 4 ⁢ 𝑐 2 ⁢ 𝑐 1 (where 𝑐 1 is the constant from Definition 6.4), and let 𝐻 𝜀 ′ be a learnable 𝜀 ′ -discretization of 𝐻 . We argue that running LearningToCover on 𝐻 𝜀 ′ gives the desired agnostic learner. Since ℓ is bounded, it is sufficient to prove that 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis ℎ ′ such that 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ ) ≤ 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 . Empirical risk minimization then works as in the finite case.

Let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ∈ 𝐻 be an optimal hypothesis. Since 𝐻 𝜀 ′ is a discretization of 𝐻 , there exists ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ∈ 𝐻 𝜀 ′ such that:

∀ 𝑥 ∈ 𝑋 : ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) ) < 𝜀 ′ .

Further, by the guarantees of discrete learnability, with probability at least 1 − 𝛿 / 2 there exists ℎ ′ ∈ 𝐶 ⁢ ( 𝑆 𝑈 ) such that close to ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ in the following sense:

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) ) ] ≤ 𝑐 1 ⁢ 𝜀 ′

𝜀 4 ⁢ 𝑐 2

Plugging in the previous observation and applying our approximate triangle inequality, we get that ℎ ′ is close to ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 in the following sense:

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ) ]

≤ 𝑐 ⁢ ( 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) ) ] + 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ) ] )

≤ 𝜀 2 ⁢ 𝑐

The final step is to transfer from the marginal 𝐷 𝑋 to the full joint distribution of the adversary, which follows immediately from a similar application of the approximate triangle inequality. This is the only step that loses a factor in the OPT term:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , 𝑦 ) ]

≤ 𝑐 ⁢ ( 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ) ] + 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) ] )

≤ 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2

as desired.

We now prove the reverse direction, which is essentially immediate. Assume the existence of a 𝑐 -agnostic learner for ( 𝒟 , 𝑋 , 𝐻 , ℓ ) . Given a discretization 𝐻 𝜀 , we want to show ( 𝒟 , 𝑋 , 𝐻 𝜀 , ℓ ) is learnable to within 𝑐 1 ⁢ 𝜀 error for some 𝑐 1

0 . This is achieved simply by running the agnostic learner on ( 𝒟 , 𝑋 , 𝐻 𝜀 , ℓ ) . Since 𝐻 𝜀 is “always useful”, every ℎ ∈ 𝐻 𝜀 is 𝜀 -close to some ℎ ′ ∈ 𝐻 in the sense that:

∀ 𝑥 ∈ 𝑋 : ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ≤ 𝜀 .

In particular, this means that for any choice of ℎ by the adversary there exists ℎ ′ ∈ 𝐻 with low error:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ )

𝔼 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] ≤ 𝜀 .

As a result, running the 𝑐 -agnostic learner for ( 𝒟 , 𝑋 , 𝐻 , ℓ ) returns a hypothesis of at most ( 𝑐 + 1 ) ⁢ 𝜀 error with high probability.

It is worth noting that bounded loss is not really necessary for Theorem 6.6. More generally we can require that ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is “finitely learnable” in the sense that for all finite subsets 𝐻 ′ ⊂ 𝐻 , ( 𝒟 , 𝑋 , 𝐻 ′ , ℓ ) is agnostically learnable. When ℓ is bounded, this is true for any finite class by empirical risk minimization. When ℓ is unbounded, we can still get finite learnability if ℓ ⁢ ( ℎ ⁢ ( ⋅ ) , ⋅ ) satisfies some concentration bounds [48] (such classes are discretizable since one can simply truncate the loss). It is also possible to achieve improved convergence rates by using algorithms other than empirical risk minimization for the finite learner (for example a generalization of the median-of-means estimator [37]), but our reduction will suffer an additional factor of 𝜀 over applying such techniques directly due to the size of the non-uniform cover.

It is also worth noting that various modifications to the definition of loss (e.g. defining loss between hypotheses rather than on 𝑌 directly) will continue to work with the above. Similarly, there are various cases when one can get better than 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 accuracy for a 𝑐 -approximate pseudometric, generally by instead optimizing over some surrogate loss function. For instance, if a simple transformation of the loss gives a 𝑐 ′ -approximate pseudometric for 𝑐 ′ < 𝑐 , then one can generally learn up to 𝑐 ′ ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 .14 As an example, note that while square loss ℓ 2 ⁢ ( 𝑥 , 𝑦 )

( 𝑥 − 𝑦 ) 2 is a 2-approximate pseudometric, taking err 𝐷 , ℓ 2 gives a true metric between hypotheses. As a result, as long as 𝑂 ⁢ 𝑃 ⁢ 𝑇 is bounded, we can get truly agnostic learning by optimizing err 𝐷 , ℓ 2 instead. This strategy works for any polynomial loss, such as ℓ 𝑝 ⁢ ( 𝑥 , 𝑦 )

| 𝑥 − 𝑦 | 𝑝 .

On the other hand, outside of these special cases, Theorem 6.6 is tight: there exist 𝑐 -approximate pseudometric loss functions which cannot be 𝑐 ′ -agnostically learned for any 𝑐 ′ < 𝑐 . The argument is similar to Proposition 6.1, but requires a bit more care.

Proposition 6.8.

There exists a discretely-learnable class over a 𝑐 -approximate pseudometric that is not 𝑐 ′ -agnostically learnable for any 𝑐 ′ < 𝑐 .

Proof 6.9.

The proof is similar to Proposition 5.6. We consider the same instance space 𝑋

ℕ and hypothesis class 𝐻 :

𝐻

{ ℎ : ℎ ⁢ ( 𝑥 )

( 0 , ⋅ ) ⁢ ∀ 𝑥 ∈ 𝑋 } .

The loss function ℓ : 𝑌 × 𝑌 → { 0 , 1 , 𝑐 } is also the same, but extended to the larger domain 𝑌

ℕ 2 :

ℓ ⁢ ( ( 𝑏 1 , 𝑟 1 ) , ( 𝑏 2 , 𝑟 2 ) )

{ 0
𝑏 1

𝑏 2

1
𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1

𝑟 2

𝑐

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 .

As before, note that ( 𝑋 , 𝐻 , ℓ ) is trivially realizably learnable by always returning any ℎ ∈ 𝐻 , ℓ is a 𝑐 -psuedometric by definition, and for any labeling 𝑓 : 𝑋 → 𝑌 there exists ℎ ∈ 𝐻 such that for all distributions 𝐷 :

𝑂 ⁢ 𝑃 ⁢ 𝑇 ≤ 𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 , 𝑓 ⁢ ( ℎ ) ≤ 1 .

We now show that the class ( 𝑋 , 𝐻 , ℓ ) is only 𝑐 -agnostically learnable. Since 𝑂 ⁢ 𝑃 ⁢ 𝑇 ≤ 1 , it suffices to show that for every 𝑚 ∈ ℕ , large enough 𝑛 ∈ ℕ , and randomized algorithm 𝒜 on 𝑚 samples, there exists a labeling 𝑓 : 𝑋 → 𝑌 and a marginal distribution 𝐷 𝑋 such that:

𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ ( 1 − 1 𝑛 ) 3 ⁢ 𝑐 .

(2)

For 𝑛 ≥ 1 1 − ( 1 − ( 𝑐 − 𝑐 ′ ) / 𝑐 ) 1 / 3 , applying Markov’s inequality to Equation 2 implies that 𝒜 has error at least 𝑐 ′ with constant probability.

For simplicity, we now restrict our attention to the marginal 𝐷 𝑋 which is uniform over the set [ 𝑘 ] for some 𝑘 ∈ ℕ to be fixed. By Yao’s minimax principle, its enough to prove that there exists a distribution 𝜇 over functions 𝑓 : [ 𝑘 ] → [ 𝑛 ] 2 such that any deterministic algorithm 𝒜 the following holds

𝔼 𝑓 ∼ 𝜇 ⁢ 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ ( 1 − 1 𝑛 ) 3 ⁢ 𝑐 .

We now show that the above holds for 𝜇 being uniform over all functions from [ 𝑘 ] to [ 𝑛 ] 2 when 𝑘

2 ⁢ 𝑚 ln ⁡ ( 𝑛 / ( 𝑛 − 1 ) ) . Similar to Proposition 5.6, we have that

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ 𝔼 𝑓 ∼ 𝜇 ⁢ 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ 𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ Pr 𝑆 ∼ 𝐷 𝑋 𝑚 ⁡ [ 𝑥 ∉ 𝑆 ] ⋅ ( 1 − 1 𝑛 ) 2 ⋅ 𝑐 ] ,

since no matter the assignment 𝒜 gives to 𝑥 ∉ 𝑆 , it will be wrong on both coordinates with probability ( 1 − 1 / 𝑛 ) 2 over the randomness of 𝜇 . The result follows by noting that for every 𝑥 ∈ [ 𝑘 ]

Pr 𝑆 ∼ 𝐷 𝑋 × 𝑓 ⁢ ( 𝑋 ) 𝑚 ⁡ [ ( 𝑥 , ⋅ ) ∉ 𝑆 ]

( 1 − 1 𝑘 ) 𝑚 ≥ 1 − 1 𝑛

since 1 − 𝑥 ≥ exp ⁡ { − 𝑥 / ( 1 − 𝑥 ) } , and we have assumed 𝑘

2 ⁢ 𝑚 ln ⁡ ( 𝑛 / ( 𝑛 − 1 ) ) and 𝑛

1 . Therefore, we get that

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ 𝔼 𝑓 ∼ 𝜇 ⁢ 𝔼 𝑆 ∼ 𝐷 𝑋 𝑚 ⁢ [ ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 , 𝑓 ⁢ ( 𝑆 ) ) ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 ) ) ] ≥ ( 1 − 1 𝑛 ) 3 ⁢ 𝑐

which completes the proof.

6.2Sub-sampling: Malicious Noise

Now that we’ve seen how to handle practical problems like regression over infinite label spaces, we’ll discuss a technique that helps handle data corruption and data-dependent assumptions: sub-sampling. The main idea is as follows. Say that the original unlabeled sample we draw is, in some sense, partially corrupted: perhaps an adversary has changed some fraction of examples (malicious noise), or some portion of the sample is un-realizable for a concept in the class (robust and partial learning). In either case, there generally exists a core subset of “clean” samples that we can use to recover the guarantees of LearningToCover. Since we cannot necessarily identify these, the idea is to run LearningToCover over enough subsets of the unlabeled sample that we find a clean subsample with high probability. In this section we’ll discuss the application of this technique in detail to Kearns and Li’s [41] well-studied malicious noise model. In Sections 8 and 9, we discuss further applications of the method to adversarially robust and partial learning.

To start, let’s recall the standard malicious noise model. In this variant of PAC learning, instead of having access to the standard sample oracle from the adversary’s distribution 𝐷 over 𝑋 × 𝑌 , we have access to a malicious oracle 𝒪 𝑀 ⁢ ( ⋅ ) which, with probability 𝜂 , outputs an adversarially chosen pair ( 𝑥 , 𝑦 ) , and otherwise samples from 𝐷 as usual.

Definition 6.10 (PAC-learning with Malicious Noise).

We call ( 𝑋 , 𝐻 , 𝒟 , ℓ ) (agnostically) ( 𝜀 , 𝛿 ) -learnable with malicious noise at rate 𝜂

𝜂 ⁢ ( 𝜀 ) if there exists an algorithm 𝒜 and function 𝑚

𝑚 𝑚𝑎𝑙 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 , and distributions 𝐷 over 𝑋 × 𝑌 satisfying:

The marginal 𝐷 𝑋 ∈ 𝒟 ,

𝒜 outputs a good hypothesis with high probability over samples drawn from the malicious oracle of size 𝑛 ⁢ ( 𝜀 , 𝛿 ) :

Pr 𝑆 ∼ 𝒪 𝑀 ⁢ ( ⋅ ) 𝑚 ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿 .

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

min ℎ ∈ 𝐻 ⁡ { 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ) } .

In other words, malicious noise essentially gives a worst-case formalization of the idea that an 𝜂 -fraction of the learner’s data is (adversarial) garbage.

Let’s now formalize the argument above: modifying LearningToCover to run over subsamples gives a sample-efficient algorithm for learning with malicious noise. For readability, we’ll (somewhat informally) restate the algorithm with this change.

Input: Realizable PAC-Learner 𝒜 , Accuracy Parameter 𝜀 < 1 / 2 , Noise Parameter 𝜂 < 𝜀 1 + 𝜀 , Unlabeled Sample Oracle 𝒪 𝑈 , Labeled Sample Oracle 𝒪 𝐿 Algorithm:

Draw an unlabeled sample 𝑆 𝑈 ∼ 𝒪 𝑈 , and labeled sample 𝑆 𝐿 ∼ 𝒪 𝐿 .

Run LearningToCover over all 𝑆 ∈ ( 𝑆 𝑈 ) 𝜂 ′ ⁢ \coloneqq ⁢ { 𝑆 ⊆ 𝑆 𝑈 : | 𝑆 |

⌊ ( 1 − 𝜂 ′ ) ⁢ | 𝑆 𝑈 | ⌋ } , where:

𝜂 ′

3 ⁢ 𝜂 + 𝜀 / ( 1 − 𝜀 ) 4 .

Return the hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) ⁢ \coloneqq ⁢ ⋃ 𝑆 ∈ ( 𝑆 𝑈 ) 𝜂 ′ ⁢ 𝐶 ⁢ ( 𝑆 ) with lowest empirical error over 𝑆 𝐿 .

Algorithm 3 Malicious to Realizable Reduction

We now prove that Algorithm 3 gives an (agnostic) learner that is tolerant to malicious noise.

Theorem 6.11.

Let ( 𝒟 , 𝑋 , 𝐻 ) be realizably PAC-learnable with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Then for any 𝜂 < 𝜀 1 + 𝜀 , Algorithm 3 is an agnostic learner for ( 𝒟 , 𝑋 , 𝐻 ) tolerant to 𝜂 malicious noise. Furthermore letting Δ

𝜀 1 + 𝜀 − 𝜂 and 𝛽

( 1 + 𝜂 Δ ) ⁢ log ⁡ ( 1 𝜂 ) , its sample complexity is at most:

𝑚 𝑚𝑎𝑙 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝛽 ⁢ 𝑛 ⁢ ( Δ 4 , 𝛿 4 ) Δ + log ⁡ ( Π 𝐻 ⁢ ( 𝑂 ⁢ ( 𝑛 ⁢ ( Δ 2 , 𝛿 2 ) + 𝜂 2 ⁢ log ⁡ ( 1 / 𝛿 ) Δ 2 ) ) ) + log ⁡ ( 1 / 𝛿 ) Δ 2 + 𝛽 ⁢ 𝜂 2 ⁢ log ⁡ ( 1 / 𝛿 ) Δ 3 )

where we’ve assumed 𝜀 < 1 / 2 for simplicity.

Proof 6.12.

To start, we’ll review for completeness a fairly standard analysis of empirical risk minimization under malicious noise. Assume for the moment that the output of LearningToCover, 𝐶 ⁢ ( 𝑆 𝑈 ) , contains a hypothesis ℎ ′ satisfying 𝑒 ⁢ 𝑟 ⁢ 𝑟 ⁢ ( ℎ ′ ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝛽 1 . Say we draw 𝑀 labeled samples for the ERM step, and an 𝜂 ′

𝜂 + 𝛽 2 fraction are corrupted by the adversary. For large enough 𝑀 , we can assume by a Chernoff and union bound that the empirical loss of every hypothesis returned by LearningToCover is at most some 𝛽 3 away from its true loss on the un-corrupted portion of 𝑀 (we will make all these assumptions formal in a moment). Given these facts, notice that the empirical loss of ℎ ′ on 𝑀 is at most:

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑀 ⁢ ( ℎ ′ ) ≤ ( 1 − 𝜂 ′ ) ⁢ ( 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝛽 1 + 𝛽 3 ) + 𝜂 ′ .

On the other hand, the empirical error of any ℎ 𝜀 ∈ 𝐻 whose true error is greater than 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 is at least:

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑀 ⁢ ( ℎ )

( 1 − 𝜂 ′ ) ⁢ ( 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 − 𝛽 3 ) .

To ensure that our ERM works, it is enough to show that for any such ℎ , 𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑀 ⁢ ( ℎ ) > 𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑀 ⁢ ( ℎ ′ ) . A simple calculation shows that this is satisfied as long as 𝛽 1 + 2 ⁢ 𝛽 3 ≤ 𝜀 and 𝛽 1 + 𝛽 2 + 2 ⁢ 𝛽 3 ≤ Δ . Setting 𝛽 1

𝛽 2

𝛽 3

Δ / 4 gives the desired result.

It is left to argue that our assumptions above hold with high probability. First, note that by a Chernoff bound, the probability that 𝜂 ′ ≥ 𝜂 + Δ / 4 on a set of 𝑀 samples is at most 𝑒 − 𝑐 ⁢ ( Δ 𝜂 ) 2 ⁢ 𝑀 for some 𝑐

0 , so this occurs with high probability as long as 𝑀 ≥ Ω ⁢ ( log ⁡ ( 1 / 𝛿 ) ⁢ 𝜂 2 / Δ 2 ) . Similarly, the empirical error of every hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) on the remaining clean samples will be within Δ / 4 of its true error as long as 𝑀 ≥ Ω ⁢ ( log ⁡ ( | 𝐶 ⁢ ( 𝑆 𝑈 ) | / 𝛿 ) ⁢ 𝜂 2 / Δ 2 ) .

It is then left to show that 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis of error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + Δ / 4 . To show this, it is enough to ensure that we run LearningToCover over a clean subsample of size at least 𝑛 ⁢ ( Δ / 4 , 𝛿 / 4 ) with high probability. If we draw | 𝑆 𝑈 |

𝑂 ⁢ ( 𝑛 ⁢ ( Δ / 4 , 𝛿 / 4 ) 1 − 𝜂 ′ + log ⁡ ( 1 / 𝛿 ) ⁢ 𝜂 2 Δ 2 ) unlabeled samples, a similar Chernoff bound to the above promises that at most an 𝜂 ′ fraction are corrupted with high probability, and therefore that at least 𝑛 ⁢ ( Δ / 4 , 𝛿 / 4 ) remain un-corrupted. Running LearningToCover over all subsets of size ( 1 − 𝜂 ′ ) ⁢ | 𝑆 𝑢 | then gives the desired result. The sample complexity bound follows from choosing 𝑀 large enough to satisfy the above conditions along with the fact that | 𝐶 ⁢ ( 𝑆 𝑈 ) | ≤ ( 𝑛 ( 1 − 𝜂 ′ ) ⁢ 𝑛 ) ⁢ Π 𝐻 ⁢ ( 𝑛 ) .

A few remarks are in order. First, the error tolerance of Theorem 6.11 is tight. In their original introduction of malicious noise, Kearns and Li [41] proved that for most non-trivial concept classes, no PAC-learner can be tolerate 𝜀 1 + 𝜀 malicious noise. Second, we note that Kearns and Li also give a reduction procedure from a base learner which achieves better dependence on Δ above, however it allows the learner to query specifically for positive or negative examples which is a stronger setup than our model. Finally, we note in the case of finite VC, the above is off from known (randomized) bounds by a factor of roughly Δ 2 [21]. On the other hand, Theorem 6.11 has the advantage of extending far beyond the VC setting, including to the general loss models discussed in Section 6.1. The proof remains mostly the same (with optimal error tolerance differing slightly), and is omitted.

Since our agnostic model restricts the adversary to choosing a distribution whose marginal lies in the original family, Theorem 6.11 provides the first insight on robustness against an adversary who can corrupt the underlying data as well as the labels. One might wonder whether this result can be pushed further: is it possible to be robust against an adversary who can corrupt the marginal over 𝑋 in some stronger sense? Unfortunately, the answer is no: malicious noise is necessarily the most distributional corruption we can handle. Let’s look at two basic lower bounds to see why. First, we’ll consider an adversary who can remove a portion of the learner’s sample.

Proposition 6.13.

For any 𝛿

0 , there exists a class ( 𝒟 , 𝑋 , 𝐻 ) which is realizably PAC-learnable, but not learnable under an adversary who can remove a 𝛿 fraction of the learner’s sample.

Proof 6.14.

This follows from a result of Dudley, Kulkarni, Richardson, and Zeitouni [31] that there exists an unlearnable class ( 𝒟 , 𝑋 , 𝐻 ) such that for some 𝑛 ⁢ ( 𝜀 , 𝛿 ) , ( 𝐷 , 𝑋 , 𝐻 ) is learnable in 𝑛 ⁢ ( 𝜀 , 𝛿 ) samples for every 𝐷 ∈ 𝒟 . The lower bound then follows simply from adding an extra unique identifying point 𝑥 𝐷 to 𝑋 for every distribution 𝐷 , and modifying each 𝐷 ∈ 𝒟 to have Θ ⁢ ( 𝛿 ) support on 𝑥 𝐷 . This modified class is clearly learnable, since after drawing 𝑂 ⁢ ( 1 / 𝛿 ) samples, the learner will draw 𝑥 𝐷 and identify the distribution 𝐷 with good probability. However, the class is not learnable under an adversary who removes points, since with high probability the adversary can completely remove any mention of 𝑥 𝐷 from the learner’s sample, reducing to the original unlearnable class ( 𝒟 , 𝑋 , 𝐻 ) .

An adversary who can add samples is similarly powerful. Consider a model in which the adversary, after choosing a marginal distribution and labeling, sees the learner’s sample ahead of time and may add an additional 𝛾 fraction of correctly labeled elements.15 While the realizable setting is largely unaffected by such an adversary (as any non-optimal hypothesis will still see some bad example), even near-trivial concept classes cannot be agnostically learned.

Proposition 6.15.

There exists a class ( 𝑋 , 𝐻 ) which for any 𝛾

0 is realizably but not agnostically learnable under an adversary who can add a 𝛾 fraction of correctly labeled points to the learner’s sample.

Proof 6.16.

Let 𝑋

{ 𝑥 , 𝑥 1 , 𝑥 2 } and 𝐻

{ ℎ 1 , ℎ 2 } be any class such that ℎ 1 ⁢ ( 𝑥 )

ℎ 2 ⁢ ( 𝑥 ) , but ℎ 1 ⁢ ( 𝑥 𝑖 ) ≠ ℎ 2 ⁢ ( 𝑥 𝑖 ) for 𝑖

1 , 2 . In the realizable setting, note that a single labeled example on 𝑥 1 or 𝑥 2 exactly determines the hypothesis. As long as there is less than 1 − 𝜀 mass on 𝑥 , the learner will draw such a sample after 𝑂 ⁢ ( 1 / 𝜀 ) samples with good probability. Further, if the mass on 𝑥 is at least 1 − 𝜀 , then ℎ 1 and ℎ 2 are both valid outputs. As a result, any ERM is a valid PAC-learner. Since adding correctly labeled examples can only help this learner, the class remains realizably learnable under an adversary who can add an arbitrary number of clean samples.

In the agnostic setting, consider an adversary who chooses a labeling 𝑓 such that 𝑓 ⁢ ( 𝑥 )

ℎ 1 ⁢ ( 𝑥 )

ℎ 2 ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 1 )

ℎ 1 ⁢ ( 𝑥 1 ) , and 𝑓 ⁢ ( 𝑥 2 )

ℎ 2 ⁢ ( 𝑥 2 ) . The optimal hypothesis ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 is then decided by the amount of mass on 𝑥 1 and 𝑥 2 in the marginal distribution. Namely if the adversary chooses a distribution 𝐷 over { 𝑥 , 𝑥 1 , 𝑥 2 } , the optimal error is min ⁡ { 𝐷 ⁢ ( 𝑥 1 ) , 𝐷 ⁢ ( 𝑥 2 ) } . The idea is then to note that for 𝛾 ′ ≤ 𝛾 / 4 , the learner cannot distinguish between the two following distributions:

𝐷 1 ⁢ ( 𝑥 1 )

𝑐 1 ⁢ 𝛾 ′ , 𝐷 1 ⁢ ( 𝑥 2 )

( 1 − 𝑐 1 ) ⁢ 𝛾 ′

and

𝐷 2 ⁢ ( 𝑥 1 )

( 1 − 𝑐 1 ) ⁢ 𝛾 ′ , 𝐷 2 ⁢ ( 𝑥 2 )

𝑐 1 ⁢ 𝛾 ′

where 1 / 2

𝑐 1

0 is some small constant. Informally, if the two distributions are indistinguishable, any learner will always incur error of around ( 1 − 𝑐 1 ) ⁢ 𝛾 ′ 2 , whereas OPT is 𝑐 1 ⁢ 𝛾 ′ for both distributions.

Let’s now give the formal argument. By Yao’s Minimax Principle it is enough to prove there is a strategy over distributions such that any deterministic learner has high error. In particular, if we can prove that the expected error is at least 3 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , then Pr ⁡ [ 𝑒𝑟𝑟𝑜𝑟 ≥ 2 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ] ≥ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . Since OPT is just some constant 𝑐 1 ⁢ 𝛾 ′ (dependent only on 𝛾 ), this is sufficient to prove the result. Moving on, consider the strategy in which the adversary chooses the labeling described above, and chooses each marginal ( 𝐷 1 or 𝐷 2 ) with probability 1 / 2 . We’ll break our analysis into two cases dependent on the sample complexity of the learner. If the learner uses 𝑂 ⁢ ( 1 / 𝛾 ′ ) examples, then there is a constant probability of drawing a sample only consisting of the point 𝑥 . Let 𝑓 ′ be the hypothesis returned by the deterministic learner on this sample. By construction, 𝑓 ′ must disagree with either ℎ 1 or ℎ 2 on 𝑥 1 or 𝑥 2 . Assume 𝑓 ′ differs on ℎ 1 ⁢ ( 𝑥 1 ) (the other cases will follow similarly). When the distribution is 𝐷 2 , 𝑓 ′ has error at least ( 1 − 𝑐 1 ) ⁢ 𝛾 ′ . Since this occurs with constant probability independent of the choice of 𝑐 1 , choosing 𝑐 1 sufficiently small leads to an expected error of at least 3 ⁢ 𝑐 1 ⁢ 𝛾 ′ as desired.

On the other hand, when there are 𝑛

Ω ⁢ ( 1 / 𝛾 ′ ) samples, we claim that the adversary can force the following sample to occur with constant probability: 2 ⁢ 𝛾 ′ ⁢ 𝑛 instances of 𝑥 1 and 𝑥 2 , and ( 1 + 𝛾 ) ⁢ 𝑛 − 4 ⁢ 𝛾 ′ ⁢ 𝑛 instances of 𝑥 . This follows from the fact that for the appropriate choice of constant for 𝑛 , a Chernoff bound gives that both 𝑥 1 and 𝑥 2 occur at most 2 ⁢ 𝛾 ′ ⁢ 𝑛 times with constant probability. Since the adversary is allowed to add 𝛾 ⁢ 𝑛 ≥ 4 ⁢ 𝛾 ′ ⁢ 𝑛 arbitrary examples, they can add instances of 𝑥 1 , 𝑥 2 , and 𝑥 until the above sample is achieved. The remainder of the argument is then the same as the previous case, as any learner response on this sample will incur similarly high expected error.

It is also reasonable to consider distributional corruption in the semi-supervised setting, where the unlabeled and labeled data-sets might have different underlying distributions. We discuss this model in Section 6.4.

6.3Replacing ERM: Semi-Private Learning

So far we have focused on property generalization for two forms of noise-tolerance—agnostic learning and learning with malicious noise. In this section, we’ll show how to use Algorithm 1 to generalize a broader spectrum of finitely-satisfiable properties through replacing the ERM process with a generic finite learner with the desired property. Our prototypical example will be privacy, which is well known to be finitely-satisfiable via McSherry and Talwar’s [49] exponential mechanism. To start, we’ll cover a few basic privacy definitions.

Definition 6.17 (Differential Privacy).

A learning algorithm 𝒜 is said to be 𝛼 -differentially private if for all neighboring inputs 𝑆 , 𝑆 ′ which differ on a single example:

Pr ⁡ [ 𝒜 ⁢ ( 𝑆 ) ∈ 𝑇 ] ≤ 𝑒 𝛼 ⁢ Pr ⁡ [ 𝒜 ⁢ ( 𝑆 ′ ) ∈ 𝑇 ] ,

for all measurable events 𝑇 in the range of 𝒜 .

The exponential mechanism is one of the most widely used techniques in privacy. Informally, the algorithm allows for differentially private selection of a “good” choice from a finite set of objects (potential hypotheses in our case). More formally, let 𝑠 : ( 𝑋 × 𝑌 ) * × 𝐻 → ℝ be a “score” function, and define “sensitivity” Δ 𝑠 to be

Δ 𝑠

max ℎ ∈ 𝐻 ⁡ max 𝑆 , 𝑆 ′ ⁡ | 𝑠 ⁢ ( 𝑆 , ℎ ) − 𝑠 ⁢ ( 𝑆 ′ , ℎ ) |

where 𝑆 , 𝑆 ′ are two neighbouring datasets. The exponential mechanism selects an item with a good score with high probability, while maintaining privacy.

Definition 6.18 (Exponential Mechanism [49]).

The exponential mechanism 𝑀 𝐸 on inputs 𝑆 , 𝐻 , 𝑠 with privacy parameter 𝛼 selects and outputs ℎ ∈ 𝐻 with probability

exp ⁡ ( 𝛼 ⁢ 𝑠 ⁢ ( 𝑆 , ℎ ) 2 ⁢ Δ ) ∑ ℎ ′ ∈ 𝐻 exp ⁡ ( 𝛼 ⁢ 𝑠 ⁢ ( 𝑆 , ℎ ′ ) 2 ⁢ Δ ) .

It is well known that the exponential mechanism leads to a private learner for finite hypothesis classes under bounded loss.

Theorem 6.19 (Theorem 3.4 [39]).

Let ( 𝒟 , 𝑋 , 𝐻 , ℓ ) be a finite class with a bounded loss function. Then the sample complexity of 𝛼 -differentially private learning ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is at most:

𝑛 𝑝 ⁢ 𝑟 ⁢ 𝑖 ⁢ ( 𝛼 , 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( log ⁡ ( | 𝐻 | 𝛿 ) ⁢ max ⁡ { 𝜀 − 2 , 𝜀 − 1 ⁢ 𝛼 − 1 } ) .

We note that [39, Theorem 3.4] only considers classification loss, but the extension to bounded loss is immediate. Unfortunately, even with the power of the exponential mechanism, privacy is a very restrictive condition in the general PAC framework, since we’re most often interested in infinite hypothesis sets. Indeed even improper private learning requires finiteness of a highly restrictive measure known as representation dimension [12], which can be infinite for classes of VC dimension 1 . As a result, the past decade has seen the introduction of a number of weaker, more practical definitions of privacy. In this section we’ll focus on a model introduced in 2013 by Beimel, Nissim, and Stemmer [14] called semi-private learning.

Definition 6.20 (Semi-Private Learning).

We call a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) semi-private PAC-Learnable if there exists an algorithm 𝒜 and two functions 𝑛 𝑝𝑢𝑏

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) and 𝑛 𝑝𝑟𝑖

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 whose marginal 𝐷 𝑋 is in 𝒟 , 𝒜 satisfies the following:

𝒜 outputs a good hypothesis with high probability:

Pr 𝑆 𝑈 ∼ 𝐷 𝑋 𝑛 𝑝𝑢𝑏 , 𝑆 𝐿 ∼ 𝐷 𝑛 𝑝𝑟𝑖 ⁡ [ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 𝑈 , 𝑆 𝐿 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿 .

𝒜 is semi-private. That is for all 𝑆 𝑈 ∈ 𝑋 𝑛 𝑝𝑢𝑏 :

𝒜 ⁢ ( 𝑆 𝑈 , ⋅ ) ⁢ is ⁢ 𝛼 ⁢ -differentially private .

In other words, semi-private learning offers a model for applications where labeled data is sensitive, but some (perhaps opt-in) users might not care about their participation itself being released. Unlike standard private learning, distribution-free semi-private classification is known to be characterized by VC dimension, just like realizable PAC-learning [14]. The best sample complexity bounds are due to Alon, Bassily, and Moran (ABM) [2], who use uniform convergence to build a uniform cover for 𝐻 from unlabeled data, and then apply the exponential mechanism to the resulting cover.

Due to their reliance on uniform convergence, ABM’s techniques fail in the more general settings we consider. Further, their use of uniform covers results in sub-optimal public sample complexity even for distribution-free classification. We prove in Section 13 that these objects require asymptotically more samples than non-uniform covers (at least in the distribution-family model), and therefore cannot be used to achieve optimal semi-private learning. We circumvent both of these issues by appealing directly to a realizable learner to build a weaker non-uniform cover. For readability, we first restate the algorithm here.

Input: Realizable PAC-Learner 𝒜 , Unlabeled Sample Oracle 𝒪 𝑈 , Labeled Sample Oracle 𝒪 𝐿 Algorithm:

Draw an unlabeled sample 𝑆 𝑈 ∼ 𝒪 𝑈 , and labeled sample 𝑆 𝐿 ∼ 𝒪 𝐿 .

Run LearningToCover over 𝑆 𝑈 to get 𝐶 ⁢ ( 𝑆 𝑈 ) .

Return the hypothesis in 𝐶 ⁢ ( 𝑆 𝑈 ) given by applying the exponential mechanism with respect to 𝑆 𝐿 .

Algorithm 4 Semi-Private to Realizable Reduction

We prove that Algorithm 4 gives a semi-private agnostic learner in the distribution-family setting.

Theorem 6.21 (PAC-learning implies Semi-Private Learning).

Let ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 be a bounded 𝑐 -approximate pseudometric. Then the following are equivalent for all triples ( 𝒟 , 𝑋 , 𝐻 , ℓ ) :

( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable

( 𝒟 , 𝑋 , 𝐻 , ℓ ) is 𝑐 -agnostically, semi-private learnable.

Proof 6.22.

The proof is essentially the same as Theorem 6.6. The only difference in the argument is to replace the generic ERM learner over the output of LearningToCover with the exponential mechanism [39].

Let’s now take a look at what Theorem 6.21 implies about the special case of distribution-free classification.

Corollary 6.23.

Let ( 𝑋 , 𝐻 ) be a class of VC-dimension 𝑑 with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . The sample complexity of (agnostic) semi-private learning ( 𝑋 , 𝐻 ) is at most:

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 )

and

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( ( 𝑑 ⁢ log ⁡ ( 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) 𝑑 ) + log ⁡ ( 1 𝛿 ) ) ⁢ max ⁡ { 𝜀 − 2 , 𝜀 − 1 ⁢ 𝛼 − 1 } ) .

Further, the sample complexity of improperly (agnostic) semi-private learning ( 𝑋 , 𝐻 ) is

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑑 + log ⁡ ( 1 / 𝛿 ) 𝜀 )

and

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( ( 𝑑 ⁢ log ⁡ ( 1 𝜀 ) + log ⁡ ( 1 𝛿 ) ) ⁢ max ⁡ { 𝜀 − 2 , 𝜀 − 1 ⁢ 𝛼 − 1 } ) .

Proof 6.24.

LearningToCover uses 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) samples by definition. In the improper case, Hanneke showed that 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) ≤ 𝑂 ⁢ ( 𝑑 + log ⁡ ( 1 / 𝛿 ) 𝜀 ) . Since the class has VC dimension 𝑑 , the size of the resulting cover is at most ( 𝑒 ⋅ 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) / 𝑑 ) 𝑑 , and the private sample complexity bound then follows from Theorem 6.19.

This improves over the recent upper bound of ABM who showed that

𝑛 pub ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑑 ⁢ log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 ) .

In fact, for constant 𝑑 and 𝛿 , Corollary 6.23 completely resolves the unlabeled sample complexity of semi-private learning, as ABM [2] prove a matching lower bound.

Theorem 6.25 (Public Lower Bound [2]).

Every class that is not privately learnable requires at least Ω ⁢ ( 1 / 𝜀 ) unlabeled samples to learn in the semi-private model under classification error.

On the other hand, we note that the private sample complexity remains off by a log factor from the best known lower bounds of Chaudhari and Hsu [22].

Theorem 6.26 (Private Lower Bound [22]).

There exist classes of VC dimension 𝑂 ⁢ ( 1 ) which require at least:

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≥ Ω ⁢ ( max ⁡ { 𝜀 − 2 , 𝜀 − 1 ⁢ 𝛼 − 1 } )

private samples to learn.

While we have now resolved the public sample complexity of improper learning, it remains an interesting open problem for the proper regime where certain adversarial examples are known to require an extra log ⁡ ( 1 / 𝜀 ) factor in the standard PAC sample complexity [26, 34]. We conjecture that Theorem 6.21 should still be tight in this setting: namely that the unlabeled semi-private sample complexity should always be at least the realizable PAC sample complexity.

Conjecture 6.27.

Let ( 𝑋 , 𝐻 ) be a hypothesis class which is not privately learnable. Then the realizable sample complexity of ( 𝑋 , 𝐻 ) lower bounds the unlabeled sample complexity of proper semi-private learning:

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 1 / 2 ) ≥ Ω ⁢ ( 𝑛 ⁢ ( 𝜀 , 1 / 2 ) ) .

6.4Changing the Base Model: Covariate Shift

One issue with semi-supervised models like semi-private learning is that, in practice, the distribution over unlabeled data probably won’t match the labeled data exactly. In this section, we’ll talk about a final modification to our reduction to tackle such scenarios and more generally to extend property generalization beyond the realizable PAC setting: replacing the base learner. In fact, we already saw this strategy used to a lesser extent in Section 6.1, where we replaced our standard realizable base learner with a discrete learner. Here we’ll look at an application in which we assume our initial learner is robust to covariate shift [56], meaning that even if the distribution underlying the data shifts between train and test time, the algorithm will continue to perform well. This stronger assumption will allow us to build semi-private learners that can handle corruption between the public and private databases. To start, let’s formalize covariate shift in the distribution-family model.

Definition 6.28 (Covariate Shift).

Let ( 𝒟 , 𝑋 , 𝐻 , ℓ ) be any class, and for every 𝜀

0 let 𝐶 𝜀 be a “covariate-shift” function that maps every 𝐷 ∈ 𝒟 to some family of distributions over 𝑋 . Given any distribution 𝐷 ∈ 𝒟 and any ℎ ∈ 𝐻 , let the error of a potential labeling be given by its worst-case error over 𝐶 𝜀 ⁢ ( 𝐷 ) , that is:

CS-err 𝐷 × ℎ , ℓ , 𝜀 ⁢ ( 𝑐 )

max 𝐷 ′ ∈ 𝐶 𝜀 ⁢ ( 𝐷 ) ⁡ { 𝔼 𝑥 ∼ 𝐷 ′ ⁢ [ ℓ ⁢ ( 𝑐 ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] } .

We say that ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is realizably learnable under covariate shift 𝐶

{ 𝐶 𝜀 } if there exists an algorithm 𝒜 and function 𝑛

𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 , 𝐷 ∈ 𝒟 , and ℎ ∈ 𝐻 :

Pr 𝑆 ∼ 𝐷 𝑛 ⁡ [ CS-err 𝐷 × ℎ , ℓ , 𝜀 ⁢ ( 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) )

𝜀 ] ≤ 𝛿 .

We call such a learner robust to covariate shift.

We emphasize that in the above definition, the covariate shift family scales with the error parameter 𝜀 . This is a bit different than Shimodaira’s original definition [56], but is a natural choice in our context since we consider algorithms which only use access to the original source distribution (sometimes called “conservative domain adaptation” [30]). In this setting, we’d expect that as we demand higher accuracy, the amount of covariate shift we can tolerate will decrease. Indeed in the agnostic model, it’s clear this scaling is necessary by a similar argument to Proposition 6.15.

The key observation to apply learning under covariate shift in our reduction is simply to notice that the non-uniform cover output by LearningtoCover must contain a hypothesis close to optimal under any shifted distribution in 𝐶 𝜀 ⁢ ( 𝐷 ) . This can then be used to analyze any semi-supervised model where the marginal of the labeled distribution may be corrupted from 𝐷 to any distribution in 𝐶 𝜀 ⁢ ( 𝐷 ) . In this section, we’ll again focus on the setting of semi-private learning. First, let’s formalize what it means to be semi-private learnable under covariate shift.

Definition 6.29 (Semi-Private Learning under Covariate Shift).

We call a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) semi-private (agnostically) PAC-Learnable under covariate shift 𝐶

{ 𝐶 𝜀 } if there exists an algorithm 𝒜 and two functions 𝑛 𝑝𝑢𝑏

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) and 𝑛 𝑝𝑟𝑖

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 𝑋 ∈ 𝒟 and 𝐷 ′ over 𝑋 × 𝑌 whose marginal 𝐷 𝑋 ′ ∈ 𝐶 𝜀 ⁢ ( 𝐷 𝑋 ) , 𝒜 satisfies the following:

𝒜 outputs a good hypothesis over 𝐷 ′ with high probability:

Pr 𝑆 𝑈 ∼ 𝐷 𝑋 𝑛 𝑝𝑢𝑏 , 𝑆 𝑝 ∼ 𝐷 ′ ⁣ 𝑛 𝑝𝑟𝑖 ⁡ [ 𝑒𝑟𝑟 𝐷 ′ , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 𝑈 , 𝑆 𝑝 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ + 𝜀 ] ≤ 𝛿 ,

where 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′

min ℎ ∈ 𝐻 ⁡ [ 𝑒𝑟𝑟 𝐷 ′ , ℓ ⁢ ( ℎ ) ] is the minimum error over the shifted distribution.

𝒜 is semi-private. That is for all 𝑆 𝑈 ∈ 𝑋 𝑛 𝑝𝑢𝑏 :

𝒜 ⁢ ( 𝑆 𝑈 , ⋅ ) ⁢ is ⁢ 𝛼 ⁢ -differentially private

In other words, we’d like to recover a near-optimal hypothesis even when the marginal distribution over private data is shifted from the public data. This is a realistic scenario in practice, since the distribution of “opt-in” users is likely different from the marginal over the total population. We’ll show that this issue is solvable in the semi-private setting as long as the analogous issue in the non-private setting (distribution shift between train and test time) can be resolved.

Theorem 6.30.

Let ( 𝒟 , 𝑋 , 𝐻 , ℓ ) be any class of a finite label space with loss function satisfying the identity of indiscernibles which is realizably learnable under covariate shift 𝐶

{ 𝐶 𝜀 } . Then ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is semi-private, agnostically learnable under covariate shift 𝐶 with sample complexity:

𝑚 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑛 ⁢ ( 𝜂 ℓ ⁢ 𝜀 , 𝛿 / 2 ) + 𝑂 ⁢ ( log ⁡ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜂 ℓ ⁢ 𝜀 , 𝛿 / 2 ) ) 𝛿 ) 𝜀 2 )

where we recall 𝜂 ℓ ≥ Ω ⁢ ( min 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) max 𝑎 ≠ 𝑏 ⁡ ( ℓ ⁢ ( 𝑎 , 𝑏 ) ) ) is a constant depending only on ℓ .

Proof 6.31.

The proof is essentially the same as Theorem 5.4. The only difference is to note that for all marginal distributions 𝐷 𝑋 ∈ 𝒟 and choices of shift 𝐷 𝑋 ′ ∈ 𝐶 𝜀 ⁢ ( 𝐷 𝑋 ) , the output of LearningToCover has some ℎ ′ satisfying

𝔼 𝑥 ∼ 𝐷 𝑋 ′ ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) ) ] ≤ 𝜂 ℓ ⁢ 𝜀 ,

where ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ is some optimal hypothesis over the shifted distribution 𝐷 ′ . As in the standard analysis, this is guaranteed by including the output of the realizable learner on ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ (robustness to covariate shift promises this output is close under 𝐷 𝑋 ′ with high probability). The remainder of the argument is exactly as in Theorem 6.6, with the exception of working over the shifted distribution 𝐷 𝑋 ′ instead of 𝐷 𝑋 .

We note that this result can also be extended to the more general loss functions discussed in Section 6.1 without too much difficulty with the appropriate definition of discrete learnability under covariate shift.

Since Theorem 6.30 is a bit abstract, let’s take a look at one concrete application. Given a class ( 𝑋 , 𝐻 ) , the class-dependent total variation distance is a metric on distributions measuring the worst case distance across elements of 𝐻 ⁢ Δ ⁢ 𝐻 ⁢ \coloneqq ⁢ { ℎ ⁢ Δ ⁢ ℎ ′ : ℎ , ℎ ′ ∈ 𝐻 } :

𝑇 ⁢ 𝑉 𝐻 ⁢ Δ ⁢ 𝐻 ⁢ ( 𝐷 , 𝐷 ′ ) ⁢ \coloneqq ⁢ max ℎ ∈ 𝐻 ⁢ Δ ⁢ 𝐻 ⁡ { | 𝐷 ⁢ ( ℎ ) − 𝐷 ′ ⁢ ( ℎ ) | } .

It is not hard to see that any realizable learner is robust to 𝑂 ⁢ ( 𝜀 ) covariate shift in 𝑇 ⁢ 𝑉 𝐻 ⁢ Δ ⁢ 𝐻 distance. We can then apply Theorem 6.30 to build a robust semi-private learner.

Corollary 6.32.

Let ( 𝑋 , 𝐻 ) be a class of VC-dimension 𝑑 , and for every 𝜀

0 and distribution 𝐷 over 𝑋 , define a coviarate shift function:

𝐶 𝜀 ⁢ ( 𝐷 ) ⁢ \coloneqq ⁢ { 𝐷 ′ : 𝑇 ⁢ 𝑉 𝐻 ⁢ Δ ⁢ 𝐻 ⁢ ( 𝐷 , 𝐷 ′ ) ≤ 𝜀 / 2 } .

Then ( 𝑋 , 𝐻 ) is semi-private learnable under covariate shift 𝐶

{ 𝐶 𝜀 } in only

𝑛 𝑝𝑢𝑏 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑑 + log ⁡ ( 1 / 𝛿 ) 𝜀 )

unlabeled samples and

𝑛 𝑝𝑟𝑖 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( ( 𝑑 ⁢ log ⁡ ( 1 𝜀 ) + log ⁡ ( 1 𝛿 ) ) ⁢ max ⁡ { 𝜀 − 2 , 𝜀 − 1 ⁢ 𝛼 − 1 } )

labeled samples.

Finally, we note again that our original learner in these results is robust to covariate shift despite having no access to samples from the new distribution. Unfortunately, this model does come with fairly strong lower bounds regarding the type of covariate shifts to which it is possible to be robust [30]. One solution to this problem is to consider a relaxed variant called (non-conservative) domain adaptation, where the learner additionally has access to a small number of unlabeled samples from the test-time distribution. It is certainly possible to define an analog in the semi-private setting, but naively the use of unlabeled data from the private distribution breaks our reduction since privacy won’t be preserved. We leave as an open question whether any sort of PAC-learner in the non-conservative model could imply semi-private learners with stronger robustness to covariate shift. Some progress has been made in this direction recently by Bassily, Moran, and Nandi [11] for distribution-free classification of halfspaces.

7Doubly Bounded Loss

In this section we discuss a natural generalization of loss functions over finite label classes we call doubly bounded loss: for all distinct 𝑦 , 𝑦 ′ ∈ 𝑌 , we require ℓ ⁢ ( 𝑦 , 𝑦 ′ ) ∈ [ 𝑎 , 𝑏 ] for some 𝑏 ≥ 𝑎

0 . This is trivially satisfied by any loss function on a finite label class that satisfies the identity of indiscernibles.

Definition 7.1 (Doubly Bounded Loss).

We say ℓ : 𝑌 → 𝑌 is ( 𝑎 , 𝑏 ) -bounded if there exist 𝑏 ≥ 𝑎

0 for any distinct 𝑦 , 𝑦 ′ ∈ 𝑌 :

ℓ ⁢ ( 𝑦 , 𝑦 ′ ) ∈ [ 𝑎 , 𝑏 ] .

As discussed in Section 6.1, since we now allow 𝑌 to be infinite, we need to work with discrete learnability instead of realizable learnability. We can use a slight modification to the discretization technique in Theorem 6.6 to prove the equivalence of discrete and agnostic learnability for doubly-bounded loss functions. Note that this is stronger than our guarantee for 𝑐 -approximate pseudometrics, which only gives 𝑐 -agnostic learnability.

Theorem 7.2.

Let ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 be an ( 𝑎 , 𝑏 ) -bounded loss function. Then for any class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) the following are equivalent:

( 𝑋 , 𝐻 , 𝒟 , ℓ ) is (properly) discretely-learnable

( 𝑋 , 𝐻 , 𝒟 , ℓ ) is (properly) agnostically learnable.

Proof 7.3.

The proof that agnostic learnability implies discrete learnability is the same as in Theorem 6.6, so we focus only the forward direction. Assume ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable. Fix 𝜀 ′

𝑎 ⁢ 𝜀 4 ⁢ 𝑏 , and let 𝐻 𝜀 ′ be a learnable 𝜀 ′ -discretization of 𝐻 . We argue that running LearningToCover on 𝐻 𝜀 ′ (using the promised discrete learner) gives the desired agnostic learner. As before, it is sufficient to prove that 𝐶 ⁢ ( 𝑆 𝑈 ) contains a hypothesis ℎ ′ such that 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 . Since ℓ is upper bounded, standard empirical risk minimization arguments then give the desired result.

To see why 𝐶 ⁢ ( 𝑆 𝑈 ) has this property, recall from Lemma 5.2 that for any ℎ ∈ 𝐻 𝜀 ′ , with probability at least 1 − 𝛿 / 2 there exists ℎ ′ ∈ 𝐶 ⁢ ( 𝑆 𝑈 ) that is 𝜀 ′ -close to ℎ in the following sense:

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ ⁢ ( 𝑥 ) ) ] ≤ 𝑎 ⁢ 𝜀 4 ⁢ 𝑏 .

Because ℓ is 𝑎 -lower bounded, this implies that ℎ ′ must be close to ℎ in classification error:

Pr 𝑥 ∼ 𝐷 𝑋 ⁡ [ ℎ ⁢ ( 𝑥 ) ≠ ℎ ′ ⁢ ( 𝑥 ) ] ≤ 𝜀 4 ⁢ 𝑏 ,

and since the loss is bounded by 𝑏 , the risk of ℎ ′ cannot be much more than of ℎ :

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ ) ≤ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ) + 𝜀 / 4 .

Let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ∈ 𝐻 be an optimal hypothesis. Since 𝐻 𝜀 ′

𝜀 ′ -covers 𝐻 , there exists ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ∈ 𝐻 𝜀 ′ such that:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ) − 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ) ≤ 𝜀 ′ .

Let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁣ 𝜀 ′ ∈ 𝐻 𝜀 ′ denote the output of the base learner 𝒜 on the labeling given by ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ . Then by the above, we have that:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁣ 𝜀 ′ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁣ 𝜀 ′ ⁢ ( 𝑥 ) , 𝑦 ) ]

≤ 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) , 𝑦 ) ] + 𝜀 / 4

≤ 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) , 𝑦 ) − ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) + ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) ] + 𝜀 / 4

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ) − 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ) + 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 4

≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2

as desired.

It is worth noting that the upper bound on the loss can be removed if the adversary is restricted to choosing a marginal over 𝑌 which is weakly concentrated.

8Robust Learning

Robust learning is an extension to the PAC setting that models an adversary with the power to perturb examples at test time. In practice, this corresponds to the fact that we’d like our predictors to be stable to small amounts of adversarial noise—this could range anywhere from a sticker on a stop-sign tricking a self-driving car, to completely imperceptible perturbations that totally fool standard classifiers. The latter was famously demonstrated by Athalye, Engstrom, Ilyas, and Kwok [7], who showed how to generate such perturbations and provided the classic example of tricking a standard ImageNet classifier into thinking a turtle was a gun. Their seminal work caused an explosion of both practical and theoretical research in the area.16

Formally, adversarial robustness is modeled simply by changing the error function to be the maximum error over some pre-defined set of neighboring perturbations.

Definition 8.1 (Robust Loss).

Let 𝑋 be an instance space and 𝒰 : 𝑋 → 𝑃 ⁢ ( 𝑋 ) a “perturbation function” mapping elements to a set of possible perturbations. Given a loss function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 , the robust loss of a concept ℎ : 𝑋 → 𝑌 with respect to a distribution 𝐷 over 𝑋 × 𝑌 is:

R-err 𝒰 , 𝐷 ⁢ ( ℎ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ max 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) ⁡ ℓ ⁢ ( ℎ ⁢ ( 𝑥 ′ ) , 𝑦 ) ] .

In other words, a hypothesis with low robust loss performs well even against an adversary who can perturb 𝑥 to any “nearby point” (i.e. any 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) ). Standard realizable and agnostic Robust PAC-learning are then simply defined by replacing the standard error function with the robust error function. Robust learning in the distribution-family model does require one extra twist: we need to make sure that each hypothesis in the class actually has a corresponding distribution over which it is realizable. To this end, we introduce a basic notion of closure for distribution families.

Definition 8.2 (Robust Closure).

Let 𝒟 be a set of distributions over an instance space 𝑋 and 𝐻 a concept class. Given any concept ℎ , let 𝑋 ℎ denote the set of points in 𝑋 on which ℎ has 0 robust loss with respect to itself, that is:

𝑋 ℎ ⁢ \coloneqq ⁢ { 𝑥 ∈ 𝑋 : ∀ 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) , ℓ ⁢ ( ℎ ⁢ ( 𝑥 ′ ) , ℎ ⁢ ( 𝑥 ) )

0 } .

For notational simplicity, let 𝐷 | ℎ denote the restriction 𝐷 | 𝑋 ℎ . The robust closure of 𝒟 under 𝐻 is:

𝒟 𝐻 ⁢ \coloneqq ⁢ 𝒟 ∪ ⋃ 𝐷 ∈ 𝒟 , ℎ ∈ 𝐻 𝐷 | ℎ .

In the robust distribution-family model, it only really makes sense to define realizable learnability over the robust closure of 𝒟 , since otherwise there may be hypotheses in the class that are not realizable with respect to any distribution in 𝒟 and cannot be chosen by the adversary at all. With this in mind, let’s formalize this model.

Definition 8.3 ((Realizable) Distribution-Family Robust PAC Learning).

A class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is Robustly PAC-learnable in the realizable setting with respect to perturbation function 𝒰 if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying:

The marginal 𝐷 𝑋 ∈ 𝒟 𝐻 ,

min ℎ ∈ 𝐻 ⁡ R-err 𝒰 , 𝐷 ⁢ ( ℎ )

0 ,

then:

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ R-err 𝒰 , 𝐷 ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝜀 ] ≤ 𝛿 .

Agnostic learnability is defined similarly, but since the adversary is unrestricted, there is no longer any need to take the robust closure.

Definition 8.4 ((Agnostic) Distribution-Family Robust PAC Learning).

A class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is Robustly PAC-learnable in the agnostic setting with respect to perturbation function 𝒰 if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying 𝐷 𝑋 ∈ 𝒟 , then:

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ R-err 𝒰 , 𝐷 ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿 ,

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

min ℎ ∈ 𝐻 ⁡ { R-err 𝒰 , 𝐷 ⁢ ( ℎ ) } .

We note that different works consider different models of access to the perturbation set 𝒰 as well (e.g. assuming 𝒰 is known to the learner [52], or has some type of oracle access [51, 50]). Our reduction requires fairly weak access to 𝒰 —it is enough to be able to estimate the empirical robust loss of a hypothesis ℎ over any finite sample 𝑆 ⊂ 𝑋 . With this in mind, let’s now prove realizable and agnostic robust learning are equivalent in the distribution-family model. We’ll focus on the special case of (multi-class) classification, and start by re-stating our modified algorithm for simplicity of presentation.

Input: Realizable Robust PAC-Learner 𝒜 Algorithm:

Draw an unlabeled sample 𝑆 𝑈 , and labeled sample 𝑆 𝐿 .

Run 𝒜 over all possible subsets and labelings of 𝑆 𝑈 to get:

𝐶 ⁢ ( 𝑆 ) ⁢ \coloneqq ⁢ { 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) | 𝑆 ⊆ 𝑆 𝑈 , ℎ ∈ 𝐻 | 𝑆 } .

Return the hypothesis in 𝐶 ⁢ ( 𝑆 ) with lowest empirical robust error over 𝑆 𝐿 .

Algorithm 5 Agnostic to Realizable Reduction Robust Setting Theorem 8.5.

If ( 𝒟 , 𝑋 , 𝐻 ) is robustly PAC-learnable in the realizable setting with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) , then Algorithm 5 robustly learns ( 𝒟 , 𝑋 , 𝐻 ) in the agnostic setting in:

𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( max 𝜇 ∈ [ 0 , 1 − 𝜀 ] ⁡ { 𝑛 ⁢ ( 𝜀 / ( 2 ⁢ ( 1 − 𝜇 ) ) , 𝛿 / 3 ) 1 − 𝜇 } )

unlabeled samples, and

𝑚 𝐿 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 )

labeled samples.

Proof 8.6.

The proof is similar to Theorem 5.4, but like malicious noise (Theorem 6.11), requires the use of subsampling. The key issue is that for a distribution 𝐷 over 𝑋 × 𝑌 , the optimal hypothesis ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 may not be realizable with respect to 𝐷 𝑋 , that is we may have:

R-err 𝑈 , 𝐷 𝑋 × ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 )

0 .

As a result, our realizable learner (and therefore LearningToCover) has no guarantees over this distribution. On the other hand, our realizable learner does have good guarantees over the restricted marginal 𝐷 𝑋 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . We can then fix the above issue by running LearningToCover of all subsets of 𝑆 including its restriction to 𝑋 ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . We will see that this essentially simulates running the realizable learner on the realizable restriction 𝐷 𝑋 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 and recovers the desired guarantees.

Let’s take a look at this more formally. As in our previous arguments it is enough to prove that 𝐶 ⁢ ( 𝑆 ) contains a hypothesis ℎ ′ with robust error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 , since a standard Chernoff bound tells us that 𝑂 ⁢ ( log ⁡ ( | 𝐶 ⁢ ( 𝑆 ) | / 𝛿 ) / 𝜀 2 ) labeled examples are enough to estimate the robust loss of every hypothesis in 𝐶 ⁢ ( 𝑆 ) with high probability. We note that this is the only step in our reduction which requires access to the perturbation set 𝒰 .

It is left to show that 𝐶 ⁢ ( 𝑆 ) satisfies this property. Let 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 denote the restriction of 𝐷 to 𝑋 ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , the points in 𝑋 on which ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 has 0 robust loss with respect to itself. Let 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 be the restriction to the complement, that is 𝑋 ∖ 𝑋 ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . The idea is to decompose our analysis into two separate parts over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 and 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . With this in mind, let 𝜇 * denote the mass of 𝐷 𝑋 on 𝑋 ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , and let 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ denote the robust error of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . Since we are restricting our attention to classification error, notice that we can decompose 𝑂 ⁢ 𝑃 ⁢ 𝑇 as:

𝑂 ⁢ 𝑃 ⁢ 𝑇

R-err 𝒰 , 𝐷 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 )

Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 [ ∃ 𝑥 ′ ∈ 𝒰 ( 𝑥 ) : ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ( 𝑥 ′ ) ≠ 𝑦 ]

𝜇 * Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 [ ∃ 𝑥 ′ ∈ 𝒰 ( 𝑥 ) : ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ( 𝑥 ′ ) ≠ 𝑦 ]

+ ( 1 − 𝜇 * ) Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 [ ∃ 𝑥 ′ ∈ 𝒰 ( 𝑥 ) : ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ( 𝑥 ′ ) ≠ 𝑦 ]

𝜇 * + ( 1 − 𝜇 * ) ⁢ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ,

where the last step follows from noting that by definition for all 𝑥 in the support of 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 is not constant on 𝒰 ⁢ ( 𝑥 ) . To get a function within 𝜀 / 2 robust loss of 𝑂 ⁢ 𝑃 ⁢ 𝑇 , we claim it is sufficient to prove 𝐶 ⁢ ( 𝑆 ) contains some ℎ within robust error 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , that is some ℎ satisfying:

Pr 𝑥 ∼ 𝐷 𝑋 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁡ [ ∃ 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) : ℎ ⁢ ( 𝑥 ′ ) ≠ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ′ ) ] ≤ 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) .

(3)

This follows from a similar analysis to the above. Namely, letting 𝑅 ⁢ ( ℎ , 𝑥 , 𝑦 ) denote the event

𝑅 ⁢ ( ℎ , 𝑥 , 𝑦 ) ⁢ \coloneqq ⁢ ∃ 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) : ℎ ⁢ ( 𝑥 ′ ) ≠ 𝑦

for notational simplicity, we can break down R-err 𝒰 , 𝐷 ⁢ ( ℎ ) as:

R-err 𝒰 , 𝐷 ⁢ ( ℎ )

𝜇 * ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ 𝑅 ⁢ ( ℎ , 𝑥 , 𝑦 ) ] + ( 1 − 𝜇 * ) ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ 𝑅 ⁢ ( ℎ , 𝑥 , 𝑦 ) ]

≤ 𝜇 * + ( 1 − 𝜇 * ) ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ 𝑅 ⁢ ( ℎ , 𝑥 , 𝑦 ) ]

≤ 𝜇 * + ( 1 − 𝜇 * ) ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ 𝑅 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , 𝑥 , 𝑦 ) ]

+ ( 1 − 𝜇 * ) Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 [ ∃ 𝑥 ′ ∈ 𝒰 ( 𝑥 ) : ℎ ( 𝑥 ′ ) ≠ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ( 𝑥 ′ ) ]

≤ 𝜇 * + ( 1 − 𝜇 * ) ⁢ ( 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ + 𝜀 2 ⁢ ( 1 − 𝜇 * ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 .

It remains to prove that 𝐶 ⁢ ( 𝑆 ) contains a hypothesis satisfying Equation 3 with high probability. To see this, note that by definition of realizable robust learning, on a labeled sample ( 𝑆 , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑆 ) ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 × ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 of size 𝑛 ⁢ ( 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) , 𝛿 / 3 ) our learner will output ℎ satisfying:

Pr 𝑥 ∼ 𝐷 𝑋 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁡ [ ∃ 𝑥 ′ ∈ 𝒰 ⁢ ( 𝑥 ) : ℎ ⁢ ( 𝑥 ′ ) ≠ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ] ≤ 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) )

with probability at least 1 − 𝛿 / 3 . To get Equation 3, it is then enough to note that ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 is constant on 𝒰 ⁢ ( 𝑥 ) for all 𝑥 in the support of 𝐷 𝑋 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 by definition.

The idea is now to draw a large enough unlabeled sample such that with probability at least 1 − 𝛿 / 3 , the restriction to 𝑋 ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 is at least this size. By a Chernoff bound, it is enough to draw 𝑐 1 ⁢ 𝑛 ⁢ ( 𝜀 / ( 1 − 𝜇 * ) , 𝛿 / 3 ) 1 − 𝜇 * points to achieve this for some large enough constant 𝑐 1

0 .17 Since we do not know 𝜇 * , we’ll need to draw 𝑐 1 ⁢ max 𝜇 ∈ [ 0 , 1 − 𝜀 ] ⁡ { 𝑛 ⁢ ( 𝜀 / ( 1 − 𝜇 ) , 𝛿 / 3 ) 1 − 𝜇 } points to ensure this property holds (if 𝜇 * ≥ 1 − 𝜀 , note that any hypothesis gives a valid solution). By a union bound we have that this overall process succeeds with probability at least 1 − 2 ⁢ 𝛿 / 3 , and outputting the hypothesis in 𝐶 ⁢ ( 𝑆 ) with the lowest empirical robust risk then succeeds with probability 1 − 𝛿 as desired.

Theorem 8.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though the exact parameters may be somewhat weaker (e.g. learning over non-binary loss may incur additional factors and lead to 𝑐 -agnostic rather than truly agnostic learning). This comes at the cost of an extra factor of 𝜀 − 1 over reductions in the finite VC setting [52].

9Partial PAC-Learning

Partial PAC-learning is an extension of the standard PAC model to functions that are only defined on a certain portion of the input. Originally introduced by Long [47] and recently developed in greater depth by Alon, Hanneke, Holzman, and Moran (AHHM) [5], this model allows for the theoretical formalization of popular data-dependent assumptions such as margin that have no known analog in the PAC model. Combined with the distribution-family framework, this captures a significant portion of learning assumptions studied in both theory and practice (e.g. learning halfspaces with margin and distributional tail bounds). Let’s formalize this model, starting with partial functions.

Definition 9.1 (Partial Function).

Let 𝑋 be an instance space and 𝑌 a label space. A partial function is a labeling 𝑓 : 𝑋 → 𝑌 ∪ { * } , where elements labeled “ * ” are thought as of undefined. The support of 𝑓 , denoted 𝑠𝑢𝑝𝑝 ⁢ ( 𝑓 ) , is the set of elements 𝑥 ∈ 𝑋 s.t. 𝑓 ⁢ ( 𝑥 ) ≠ * .

Standard Partial PAC-learning is defined much like the standard model with the simple modification that “ * ” labels are always considered to be incorrect. As a result, in the realizable case, when the adversary selects a particular partial function 𝑓 , their marginal distribution over the instance space 𝑋 must be restricted to lying on supp ⁢ ( 𝑓 ) . This makes formalizing data-dependent assumptions easy. If one wanted to consider halfspaces with margin 𝛾 for instance, one simply labels every point within 𝛾 of the decision boundary as “ * .” Interestingly, much like the distribution-family setting, Partial-PAC learning falls outside both the uniform convergence and the sample compression paradigm [1]. AHHM also show a dramatic failure of empirical risk minimization: not only does naively applying an ERM to the partial class fail, it will also fail on any total extension of the class. Despite the lack of these standard tools, both Long and AHHM were able to show that distribution-free classification of partial classes is still controlled by VC dimension, and as a result that the equivalence of realizable and agnostic learnability extends to this setting. In this section, we’ll discuss how a variant of our reduction shows that this result extends to the distribution-family model, extended loss function, and to properties beyond agnostic learning.

In the distribution-family model, formalizing realizable learnability requires some slight changes from the standard model, since we need to make sure our hypotheses are actually realizable over some distribution in the family (this is automatic in the distribution-free setting). To this end, we introduce a basic notion of closure for distribution families.

Definition 9.2 (Partial Closure).

Let 𝒟 be a set of distributions over an instance space 𝑋 and 𝐻 a concept class. Given any concept ℎ , and distribution 𝐷 over 𝑋 , let 𝐷 | ℎ denote the restriction 𝐷 | 𝑠𝑢𝑝𝑝 ⁢ ( 𝑓 ) . The partial closure of 𝒟 under 𝐻 is:

𝒟 𝐻 ⁢ \coloneqq ⁢ 𝒟 ∪ ⋃ 𝐷 ∈ 𝒟 , ℎ ∈ 𝐻 𝐷 | ℎ .

In the realizable model it makes more sense to work with the closure of 𝒟 than 𝒟 itself, since otherwise the class 𝐻 may contain hypotheses that cannot be realized over any distribution, and therefore cannot be accessed by the adversary at all. For simplicity, we’ll also restrict our attention to (multi-class) classification where the label space 𝑌

[ 𝑚 ] , and recall that the loss of any undefined point is always 1 .

Definition 9.3 ((Realizable) Distribution-Family Partial PAC Learning).

A partial class ( 𝒟 , 𝑋 , 𝐻 ) is PAC-learnable in the realizable setting if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying:

The marginal 𝐷 𝑋 ∈ 𝒟 𝐻 ,

min ℎ ∈ 𝐻 ⁡ 𝑒𝑟𝑟 𝐷 ⁢ ( ℎ )

0 ,

then:

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝜀 ] ≤ 𝛿 ,

where the error 𝑒𝑟𝑟 𝐷 ⁢ ( ℎ ) is standard classification error:

𝑒𝑟𝑟 𝐷 ⁢ ( ℎ )

Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁡ [ ℎ ⁢ ( 𝑥 ) ≠ 𝑦 ] .

Agnostic learnability is defined analogously, but since the adversary is unrestricted, there is no need to move to the closure of 𝒟 .

Definition 9.4 ((Agnostic) Distribution-Family Partial PAC Learning).

A partial class ( 𝒟 , 𝑋 , 𝐻 ) is PAC-learnable in the agnostic setting if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿

0 and distributions 𝐷 over 𝑋 × 𝑌 satisfying 𝐷 𝑋 ∈ 𝒟 , then:

Pr 𝑆 ∼ 𝐷 𝑛 ⁢ ( 𝜀 , 𝛿 ) ⁡ [ 𝑒𝑟𝑟 𝐷 ⁢ ( 𝒜 ⁢ ( 𝑆 ) )

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 ] ≤ 𝛿 .

The issue with our standard reduction strategy for partial functions is that in the agnostic model, the adversary’s marginal distribution over 𝑋 might have support outside of supp ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ) , which causes LearningToCover to lose its guarantee of outputting a non-uniform cover. This can be dealt with by a variant of our subsampling technique. If we run LearningToCover over all subsamples of the unlabeled sample 𝑆 𝑈 , one of these subsamples must match the support of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . This is in fact the same strategy used for adversarial robustness in Section 8, but we will include the algorithm again to make this section self-contained.

Input: Realizable Robust PAC-Learner 𝒜 Algorithm:

Draw an unlabeled sample 𝑆 𝑈 , and labeled sample 𝑆 𝐿 .

Run 𝒜 over all possible subsets and labelings of 𝑆 𝑈 to get:

𝐶 ⁢ ( 𝑆 ) ⁢ \coloneqq ⁢ { 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) | 𝑆 ⊆ 𝑆 𝑈 , ℎ ∈ 𝐻 | 𝑆 } .

Return the hypothesis in 𝐶 ⁢ ( 𝑆 ) with lowest empirical error over 𝑆 𝐿 .

Algorithm 6 Agnostic to Realizable Reduction Partial PAC Setting Theorem 9.5.

If ( 𝒟 , 𝑋 , 𝐻 ) is a realizably PAC-learnable partial class with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) , then Algorithm 6 agnostically learns ( 𝒟 , 𝑋 , 𝐻 ) in

𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( max 𝜇 ∈ [ 0 , 1 − 𝜀 ] ⁡ { 𝑛 ⁢ ( 𝜀 / ( 2 ⁢ ( 1 − 𝜇 ) ) , 𝛿 / 3 ) 1 − 𝜇 } )

unlabeled samples, and

𝑚 𝐿 ⁢ ( 𝜀 , 𝛿 ) ≤ 𝑂 ⁢ ( 𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 )

labeled samples.

Proof 9.6.

The proof is essentially the same as for Theorem 8.5, but we repeat it here for completeness. As always, it is enough to prove that 𝐶 ⁢ ( 𝑆 ) (from Algorithm 6) contains a hypothesis ℎ ′ with error at most 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 . The key issue with our standard reduction is that the optimal hypothesis ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 may be undefined on certain examples in the unlabeled sample 𝑆 𝑈 . By running over all subsamples of 𝑆 𝑈 , we in essence simulate pulling samples only from the support of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , which is enough to get the desired guarantee.

More formally, let 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 be the restriction of 𝐷 to 𝑠𝑢𝑝𝑝 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ) , and 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 the restriction to its complement 𝑋 ∖ 𝑠𝑢𝑝𝑝 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ) . The idea is to decompose our analysis into two separate parts over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 and 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . With this in mind, let 𝜇 * denote the mass of 𝐷 𝑋 on the undefined portion of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , and let 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ denote the error of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 . Since we have restricted our attention to classification, notice that we can decompose 𝑂 ⁢ 𝑃 ⁢ 𝑇 as:

𝑒𝑟𝑟 𝐷 ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 )

Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ 𝑦 ]

𝜇 * ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ 𝑦 ] + ( 1 − 𝜇 * ) ⁢ Pr ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ 𝑦 ]

𝜇 * + ( 1 − 𝜇 * ) ⁢ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ .

We’d like to prove that 𝐶 ⁢ ( 𝑆 ) contains a hypothesis ℎ within 𝜀 / 2 error of optimal. We claim it is sufficient to show that 𝐶 ⁢ ( 𝑆 ) contains a hypothesis within 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) classification distance of ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 , since:

𝑒𝑟𝑟 𝐷 ⁢ ( ℎ )

𝜇 * ⁢ 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ¯ | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ ⁢ ( 𝑥 ) ≠ 𝑦 ] + ( 1 − 𝜇 * ) ⁢ 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ ⁢ ( 𝑥 ) ≠ 𝑦 ]

≤ 𝜇 * + ( 1 − 𝜇 * ) ⁢ 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ ⁢ ( 𝑥 ) ≠ 𝑦 ]

≤ 𝜇 * + ( 1 − 𝜇 * ) ⁢ ( 𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ [ ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) ≠ 𝑦 ] + 𝜀 2 ⁢ ( 1 − 𝜇 * ) )

𝜇 * + ( 1 − 𝜇 * ) ⁢ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ + 𝜀 2

𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 / 2 ,

where the third line follows from the fact that ℎ ′ and ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 only differ on a 𝜀 2 ⁢ ( 1 − 𝜇 * ) fraction of inputs over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 .

It is left to argue that 𝐶 ⁢ ( 𝑆 ) contains such a hypothesis ℎ . Recall that on a labeled sample ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) ∼ 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 × ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 of size 𝑛 ⁢ ( 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) , 𝛿 / 3 ) , LearningToCover will contain ℎ that is 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) -close to ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 in classification error over 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 with probability at least 1 − 𝛿 / 3 . The idea is then to draw a large enough unlabeled sample such that with probability at least 1 − 𝛿 / 3 , the restriction of the sample to 𝐷 | ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 is at least this size (since we run over every subsample, we will always hit this restriction). By a Chernoff bound, it is enough to draw 𝑐 1 ⁢ 𝑛 ⁢ ( 𝜀 / ( 2 ⁢ ( 1 − 𝜇 * ) ) , 𝛿 / 3 ) 1 − 𝜇 * points to achieve this for some large enough constant 𝑐 1

0 .18 Since we do not know 𝜇 * , we’ll need to draw 𝑐 1 ⁢ max 𝜇 ∈ [ 0 , 1 − 𝜀 ] ⁡ { 𝑛 ⁢ ( 𝜀 / ( 1 − 𝜇 ) , 𝛿 / 3 ) 1 − 𝜇 } points to ensure this property holds (if 𝜇 * ≥ 1 − 𝜀 , note that any hypothesis gives a valid solution). By a union bound we have that this overall process succeeds with probability at least 1 − 2 ⁢ 𝛿 / 3 , and outputting the hypothesis in 𝐶 ⁢ ( 𝑆 ) with the lowest empirical risk then succeeds with probability 1 − 𝛿 as desired.

Like Theorem 8.5, Theorem 9.5 can be extended to many of the generic property generalization results in the main body, including approximate pseudometric loss, malicious noise, and semi-private learning, though it may experience some degradation of parameters (e.g. 𝑐 -agnostic rather than truly agnostic learning) depending on how the loss of “ * ” values are formalized in these settings. Again this comes at the cost of an additional factor of 𝜀 over known reductions based on sample compression in the finite VC regime [5].

10Uniform Stability

Uniform stability, originally introduced by Bousquet and Elisseeff [20], is a useful algorithmic property that is closely tied to both generalization and privacy. Informally, an algorithm 𝒜 is said to be uniformly stable if for all elements 𝑥 ∈ 𝑋 , the probability that 𝒜 changes its output on 𝑥 over neighboring datasets is small.

Definition 10.1 (Uniform Stability).

A learning algorithm is said to be 𝛼 -uniformly stable if for all neighboring inputs 𝑆 , 𝑆 ′ which differ on a single example, all 𝑥 ∈ 𝑋 , and all 𝑦 ∈ 𝑌 :

Pr ⁡ [ 𝒜 ⁢ ( 𝑆 ) ⁢ ( 𝑥 )

𝑦 ] ≤ Pr ⁡ [ 𝒜 ⁢ ( 𝑆 ′ ) ⁢ ( 𝑥 )

𝑦 ] + 𝛼 .

Uniform stability can also be thought of as a variant of private prediction [32], which protects against adversaries who have restricted access to a model only through prediction responses on individual points (this is often the case in practice since it is common to release APIs with query access rather than full models). Like semi-privacy, this definition has the benefit of maintaining practicality in a reasonable range of circumstances while weakening the stringent requirements of standard private learning. Indeed, it is well known that in the distribution-free classification setting, uniformly stable learning and private prediction are both possible for any class with finite VC dimension [55, 32, 24]. Unsurprisingly, these previous works (at least those working in the agnostic model), rely on uniform convergence and uniform covers. We’ll show these can be replaced with a variant of our standard reduction. The argument is otherwise similar to the proof in [24].

Theorem 10.2.

Let ( 𝒟 , 𝑋 , 𝐻 ) be a realizably learnable class with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Then there exists an 𝛼 -uniformly stable, 𝛼 -semi private algorithm that agnostically learns ( 𝒟 , 𝑋 , 𝐻 ) in only

𝑚 𝑈 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) 𝛼 )

unlabeled samples, and

𝑚 𝐿 ⁢ ( 𝜀 , 𝛿 , 𝛼 ) ≤ 𝑂 ⁢ ( log ⁡ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 , 𝛿 ) ) ) min ⁡ { 𝛼 ⁢ 𝜀 , 𝜀 2 } )

labeled samples.

Proof 10.3.

The proof boils down to a standard subsampling trick first noted by [55]. Instead of drawing our standard unlabeled sample of size 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) , we draw a sample of size 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) 2 ⁢ 𝛼 and run LearningToCover over a random 𝛼 / 2 fraction of the sample. This ensures that swapping out any individual sample can only effect the result with probability 𝛼 / 2 . Since this subsample is of size 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 2 ) , LearningToCover keeps its standard guarantees and the output 𝐶 ⁢ ( 𝑆 𝑈 ) has a hypothesis within 𝜀 / 2 of optimal with probability 1 − 𝛿 / 2 . We can now apply the exponential mechanism with privacy parameter 𝛼 / 4 , which ensures the algorithm is 𝛼 / 2 -uniformly stable with respect to the labeled sample as well. The sample complexity bounds come from standard analysis of the exponential mechanism and the size of 𝐶 ⁢ ( 𝑆 𝑈 ) . Semi-privacy comes for free due to our use of the exponential mechanism.

As in previous sections, Theorem 10.2 can be extended to any of the generic property generalization results in the main body, including for instance 𝑐 -approximate pseudometric loss, malicious noise, and robustness to covariate shift. In the VC setting, the complexity essentially matches the best known bounds in the literature, which are given by a similar reduction using uniform convergence [24].

11Statistical Query Model

Kearns’ [40] statistical query model is a popular modification of PAC learning where the sample oracle is replaced with the ability to ask noisy statistical questions about the data.

Definition 11.1 (Realizable SQ-learning).

Given a distribution 𝐷 over 𝑋 and ℎ ∈ 𝐻 , let 𝑆 ⁢ 𝑇 ⁢ 𝐴 ⁢ 𝑇 ⁢ ( 𝐷 , ℎ ) be an oracle which upon input of a function 𝜓 : 𝑋 × 𝑌 → [ − 1 , 1 ] and tolerance 𝜏 ∈ ℝ ≥ 0 may output any estimate of the expectation of 𝜓 up to 𝜏 error, that is:

𝑆 ⁢ 𝑇 ⁢ 𝐴 ⁢ 𝑇 ⁢ ( 𝐷 , ℎ ) ⁢ ( 𝜓 , 𝜏 ) ∈ 𝔼 𝑥 ∼ 𝐷 ⁢ [ 𝜓 ⁢ ( 𝑥 , ℎ ⁢ ( 𝑥 ) ) ] ± 𝜏 .

We call a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) SQ-learnable if for all 𝜀 > 0 , there exists some tolerance 𝜏

𝜏 ⁢ ( 𝜀 ) , query complexity 𝑛 ⁢ ( 𝜀 , 𝜏 ) , and an algorithm 𝒜 such that for all 𝐷 ∈ 𝒟 and ℎ ∈ 𝐻 , 𝒜 achieves 𝜀 error in at most 𝑛 ⁢ ( 𝜀 , 𝜏 ) oracle calls to 𝑆 ⁢ 𝑇 ⁢ 𝐴 ⁢ 𝑇 ⁢ ( 𝐷 , ℎ ) with tolerance at worst 𝜏 .19

Agnostic learning is then defined analogously where 𝐷 , ℎ is replaced with a generic distribution over 𝑋 × 𝑌 whose marginal lies in 𝒟 . We can use a basic form of discretization to prove property generalization in the SQ model.

Theorem 11.2.

Let ℓ be a 𝑐 -approximate pseudometric and ( 𝒟 , 𝑋 , 𝐻 , ℓ ) a realizably SQ-learnable class with query complexity 𝑛 ⁢ ( 𝜀 , 𝜏 ) . Then ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is 𝑐 -agnostically SQ-learnable up to 𝜀 + 𝜏 error in ( 1 / 𝜏 ) 𝑛 ⁢ ( 𝜀 , 𝜏 ) statistical queries of tolerance at worst 𝜏 .

Proof 11.3.

The idea is similar to our discretization in Theorem 6.6. The realizable SQ-learner 𝒜 makes some finite 𝑛 ⁢ ( 𝜀 , 𝜏 ) queries. Let 𝐶 𝒜 denote the set of outputs of 𝒜 when fed every possible combination of responses from the discretized set { − 1 , − 1 + 2 ⁢ 𝜏 , … , 1 − 2 ⁢ 𝜏 , 1 } . For every 𝐷 ∈ 𝒟 and ℎ ∈ 𝐻 , one of these combinations must be a valid query response in the realizable model, so 𝐶 𝒜 covers ( 𝒟 , 𝑋 , 𝐻 , ℓ ) . By the same arguments of Theorem 6.6, 𝐶 𝒜 must contains a hypothesis with error at most 𝑐 ⋅ 𝑂 ⁢ 𝑃 ⁢ 𝑇 + 𝜀 . Since we can directly compute the loss of every element in 𝐶 𝒜 up to 𝜏 error in the SQ model simply by querying the loss function, this gives the desired result in | 𝐶 𝒜 |

( 1 / 𝜏 ) 𝑛 ⁢ ( 𝜀 , 𝜏 ) queries.

We note that while our reduction in this model experiences exponential blowup in the number of queries, this should really be thought of as corresponding to a blow up in run-time instead of “sample complexity” in the standard sense (which corresponds more closely to 𝜏 ).

12Fairness

Recent years have seen rising interest in an algorithmic property called fairness. Informally, fairness tries to tackle the issue that “well-performing” classifiers in the standard sense may actually be discriminatory against certain individuals or subgroups. We will consider a form of fair leaning introduced by Rothblum and Yona [63] called Probably Approximately Correct and Fair (PACF) learning. Their definition is based off of a notion of fairness that ensures that similar individuals are treated similarly with respect to a fixed metric.

Definition 12.1 (Metric Fairness).

Let 𝑑 : 𝑋 × 𝑋 → ℝ ≥ 0 be a similarity measure on 𝑋 and 𝐷 a distribution over 𝑋 . A classifier ℎ : 𝑋 → 𝑌 𝑜𝑢𝑡 is called ( 𝛼 , 𝛾 ) -fair with respect to 𝑑 and 𝐷 if h acts similarly on most similar individuals:

Pr 𝑥 , 𝑥 ′ ∼ 𝐷 ⁡ [ | ℎ ⁢ ( 𝑥 ) − ℎ ⁢ ( 𝑥 ′ ) |

𝑑 ⁢ ( 𝑥 , 𝑥 ′ ) + 𝛾 ] ≤ 𝛼 .

We note that the output space 𝑌 𝑜𝑢𝑡 may differ from the label space 𝑌 in general learning problems.

In fact, this definition only really makes sense when the output classifier ℎ is allowed to be real-valued (as this allows for some flexibility in the | ℎ ⁢ ( 𝑥 ) − ℎ ⁢ ( 𝑥 ′ ) | term). As such, when considering settings such as binary classification where 𝑌

{ 0 , 1 } is discrete, Rothblum and Yona’s [63] initial formalization considers returning probabilistic classifiers with 𝑌 out

[ 0 , 1 ] . Here ℎ ⁢ ( 𝑥 )

𝑦 ∈ [ 0 , 1 ] is taken to be the probability of the label being 1 . The error of a probabilistic classifier ℎ with respect to any distribution 𝐷 over 𝑋 × { 0 , 1 } is then given by its expected ℓ 1 distance:

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 ⁢ ( ℎ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ | ℎ ⁢ ( 𝑥 ) − 𝑦 | ] .

For simplicity we’ll focus in this section on this same regime extended to the distribution-family model.

In broad strokes, the goal of Fair PAC learning is to output a fair classifier satisfying standard PAC guarantees. Practically this requires a few modifications. First, since there may be no fair classifier satisfying these guarantees, we will only require our output to be as good as the best fair classifier. Second, we will actually allow some slack in the fairness parameters, which Rothblum and Yona [63] show is a practical way to ensure that fair learnability remains possible across a broad range of classes.

Definition 12.2 (PACF-learning20 [63]).

We say ( 𝒟 , 𝑋 , 𝐻 ) is (agnostically) ( 𝛼 , 𝛾 ) -PACF-learnable with respect to a similarity metric 𝑑 : 𝑋 × 𝑋 → 𝑌 if there exists an algorithm 𝒜 and function 𝑛

𝑛 ⁢ ( 𝜀 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 ) such that for all 𝜀 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿

0 , and distributions 𝐷 over 𝑋 × 𝑌 such that 𝐷 𝑋 ∈ 𝒟 , 𝒜 ⁢ ( 𝑆 ) satisfies the following guarantees with probability 1 − 𝛿 over samples 𝑆 of size 𝑛 :

𝒜 ⁢ ( 𝑆 ) is accurate:

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 ) ) ≤ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝛼 , 𝛾 + 𝜀

𝒜 ⁢ ( 𝑆 ) is ( 𝛼 + 𝜀 𝛼 , 𝛾 + 𝜀 𝛾 ) -fair.

Here 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝛼 , 𝛾 is the optimal error of any ( 𝛼 , 𝛾 ) -fair classifier, that is:

𝑂 ⁢ 𝑃 ⁢ 𝑇 𝑎 , 𝑏 ⁢ \coloneqq ⁢ min ℎ ∈ 𝐻 𝐷 𝑋 , 𝑎 , 𝑏 𝑑 ⁡ { 𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 , ℓ ⁢ ( ℎ ) } ,

and

𝐻 𝐷 𝑋 , 𝑎 , 𝑏 𝑑

{ ℎ ∈ 𝐻 : ℎ ⁢ is ( 𝑎 , 𝑏 ) -fair with respect to 𝑑 and 𝐷 𝑋 }

Realizable learnability is defined similarly, where the adversary is constrained to picking distributions which have 0 error with respect to some ( 𝛼 , 𝛾 ) -fair classifier in 𝐻 . We show that property generalization holds for the PACF model.

Theorem 12.3 (Agnostic → Realizable (PACF Setting)).

Let ( 𝒟 , 𝑋 , 𝐻 ) be any class that is realizably ( 𝛼 , 𝛾 ) -PACF learnable with sample complexity 𝑛 ⁢ ( 𝜀 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 ) . Then ( 𝒟 , 𝑋 , 𝐻 ) is agnostically ( 𝛼 , 𝛾 ) -fair-PAC learnable in only

𝑚 𝑈 ⁢ ( 𝜀 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 ) ≤ 𝑛 ⁢ ( 𝜀 / 2 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 / 2 )

unlabeled samples, and

𝑚 𝐿 ⁢ ( 𝜀 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 ) ≤ 𝑂 ⁢ ( log ⁡ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 𝜀 𝛼 , 𝜀 𝛾 , 𝛿 / 2 ) ) ) + log ⁡ ( 1 / 𝛿 ) 𝜀 2 )

labeled samples.

Proof 12.4.

The key observation is that the definition of fairness depends only on the classifier ℎ and the marginal distribution 𝐷 𝑋 . Let ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 be the hypothesis achieving the minimum error over 𝐻 𝐷 𝑋 , 𝛼 , 𝛾 . By the above observation, with probability 1 − 𝛿 the hypothesis set 𝐶 ⁢ ( 𝑆 𝑈 ) returned by LearningToCover contains an ( 𝛼 + 𝜀 𝛼 , 𝛾 + 𝜀 𝛾 ) -fair ℎ satisfying:

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ | ℎ ⁢ ( 𝑥 ) − ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) | ] ≤ 𝜀 / 2 .

Since ℓ 1 error is a metric (and therefore satisfies the triangle inequality), we can use our argument for 𝑐 -pseudometric loss functions from Theorem 6.6 to argue that choosing the lowest empirical risk ( 𝛼 + 𝜀 𝛼 , 𝛾 + 𝜀 𝛾 ) -fair classifier in 𝐶 ⁢ ( 𝑆 𝑈 ) with respect to a sufficiently large labeled sample 𝑆 𝐿 gives the desired learner.

With care, this result can be extended to a broader range of loss functions as well as to other finitely-satisfiable properties covered in this work.

13Notions of Coverability

In this section we discuss the connection between non-uniform covers and several previous notions of coverability used in various learning applications. For simplicity, we’ll restrict our attention to covering with respect to standard classification distance, that is given a distribution 𝐷 and hypotheses ℎ and ℎ ′ over some instance space 𝑋 :

𝑑 𝐷 ⁢ ( ℎ , ℎ ′ )

Pr 𝑥 ∼ 𝐷 ⁡ [ ℎ ⁢ ( 𝑥 ) ≠ ℎ ′ ⁢ ( 𝑥 ) ] .

To start, let’s recall the basic notion of an 𝜀 -cover specified to this measure for simplicity.

Definition 13.1 ( 𝜀 -cover).

Let 𝑋 be an instance space, 𝑌 a label space, and let 𝐿 𝑋 , 𝑌 denote the set of all labelings 𝑐 : 𝑋 → 𝑌 . A set 𝐶 ⊂ 𝐿 𝑋 , 𝑌 is said to form an 𝜀 -cover for ( 𝐷 , 𝑋 , 𝐻 ) if for every hypothesis ℎ ∈ 𝐻 , there exists 𝑐 ∈ 𝐶 such that

𝑑 𝐷 ⁢ ( 𝑐 , ℎ ) ≤ 𝜀 .

𝐶 is called proper if 𝐶 ⊂ 𝐻 .

Finite 𝜀 -covers are exceedingly useful in learning theory. As discussed in Section 3, a common strategy in the literature is to use unlabeled samples to construct an 𝜀 -cover with high probability [8, 35, 2, 10]. This results in a distribution over potential covers we call a uniform ( 𝜀 , 𝛿 ) -cover.

Definition 13.2 (Uniform ( 𝜀 , 𝛿 ) -cover).

Let 𝑋 be an instance space, 𝑌 a label space, and let 𝐿 𝑋 , 𝑌 denote the set of all labelings 𝑐 : 𝑋 → 𝑌 . A distribution 𝐷 𝐶 over the power set 𝑃 ⁢ ( 𝐿 𝑋 , 𝑌 ) is said to form a uniform ( 𝜀 , 𝛿 ) -cover for ( 𝐷 , 𝑋 , 𝐻 ) if:

Pr 𝐶 ∼ 𝐷 𝐶 ⁡ [ 𝐶 ⁢ is an 𝜀 -cover for ⁢ ( 𝐷 , 𝑋 , 𝐻 ) ] ≥ 1 − 𝛿 .

𝐷 𝐶 is called proper if its support lies entirely in 𝐻 .

In this work, we introduce a weaker non-uniform variant of this notion where each ℎ has an individual guarantee of being covered by the distribution, but it is not necessarily the case that a sample will cover all ℎ ∈ 𝐻 simultaneously.

Definition 13.3 (Non-Uniform ( 𝜀 , 𝛿 ) -cover).

Let 𝑋 be an instance space, 𝑌 a label space, and let 𝐿 𝑋 , 𝑌 denote the set of all labelings 𝑐 : 𝑋 → 𝑌 . A distribution 𝐷 𝐶 over the power set 𝑃 ⁢ ( 𝐿 𝑋 , 𝑌 ) is said to form a non-uniform ( 𝜀 , 𝛿 ) -cover for ( 𝐷 , 𝑋 , 𝐻 ) if for every fixed hypothesis ℎ ∈ 𝐻 ,

Pr 𝐶 ∼ 𝐷 𝐶 ⁡ [ 𝐶 ⁢ is an 𝜀 -cover for ⁢ ( 𝐷 , 𝑋 , { ℎ } ) ] ≥ 1 − 𝛿 .

𝐷 𝐶 is called proper if its support lies entirely in 𝐻 .

In the context of learning, we are usually interested not just in the existence of these covers, but in the more challenging problem of constructing them from a small number of unlabeled samples. In other words, given a class ( 𝒟 , 𝑋 , 𝐻 ) , we’d like to know how many unlabeled samples from an adversarially chosen distribution 𝐷 ∈ 𝒟 are necessary to build a uniform (or non-uniform) ( 𝜀 , 𝛿 ) -cover for ( 𝐷 , 𝑋 , 𝐻 ) . In Section 6.3, we saw that the ability to construct a non-uniform ( 𝜀 , 𝛿 ) -cover from 𝑂 ⁢ ( log ⁡ ( 1 / 𝛿 ) 𝜀 ) samples was crucial to give a semi-private learner with optimal public sample complexity. This improved over recent work of Alon, Bassily, and Moran (ABM) [2], who showed that it is possible to build a uniform ( 𝜀 , 𝛿 ) -cover in 𝑂 ⁢ ( log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 ) samples.

It is interesting to ask whether non-uniformity is really necessary here, or whether ABM’s analysis is simply sub-optimal. We’ll show that the former is true, at least in the proper distribution-family setting: the log ⁡ ( 1 / 𝜀 ) gap between these models is necessary and uniform covers cannot be used to build optimal semi-private learners.

Theorem 13.4 (Separation of Uniform and Non-Uniform Covers).

There exists an instance space 𝑋 , hypothesis class 𝐻 , and family of distributions 𝒟 such that for any sufficiently small 𝜀

0 , the following statements holds:

Any algorithm which returns a finite proper uniform ( 𝜀 , 1 / 3 ) -cover for ( 𝒟 , 𝑋 , 𝐻 ) requires at least Ω ⁢ ( 1 / 𝜀 ⋅ log ⁡ ( 1 / 𝜀 ) ) samples.

There exists an algorithm which returns a finite proper non-uniform ( 𝜀 , 𝛿 ) -cover for ( 𝒟 , 𝑋 , 𝐻 ) in 𝑂 ⁢ ( log ⁡ ( 1 / 𝛿 ) / 𝜀 ) samples.

Proof 13.5.

Let the instance space 𝑋

ℕ and 𝐻 be the class of indicators along with the all 0 ’s function, that is 𝐻

{ ℎ 𝑖 : 𝑖 ∈ ℕ } ∪ { ℎ 0 } where ℎ 𝑖 ⁢ ( 𝑥 )

𝟙 ⁢ { 𝑥

𝑖 } and ℎ 0 is 0 everywhere. We consider the family of distribution 𝒟

{ 𝒟 𝑛 , 𝑘 } 𝑛 , 𝑘

0 given by 𝑘 -sets of [ 𝑛 ] where

𝒟 𝑛 , 𝑘

{ unif ⁢ ( 𝑇 ) : 𝑇 ⊂ [ 𝑛 ] ⁢ 𝑎𝑛𝑑 ⁢ | 𝑇 |

𝑘 } ,

where unif ⁢ ( 𝑇 ) is a uniform distribution over 𝑇 .

We start with the first claim, that building a bounded uniform ( 𝜀 , 1 / 2 ) -cover needs at least Ω ⁢ ( 1 / 𝜀 ⁢ log ⁡ ( 1 / 𝜀 ) ) samples. More formally, for any error parameter 𝜀 > 0 and size bound 𝑚

𝑚 ⁢ ( 𝜀 ) ∈ ℕ , let 𝑘

⌊ 1 / ( 2 ⁢ 𝜖 ) ⌋ . We will show that for any algorithm 𝒜 on 𝑘 ⁢ log ⁡ ( 𝑘 ) samples that outputs at most 𝑚 hypotheses, 𝒜 must fail to output an 𝜀 -cover with probability at least 1 / 2 .

Let 𝑛 ≫ 𝑚 , 𝑘 be some natural number to be fixed later and consider the family of distributions 𝒟 𝑛 , 𝑘 . By Yao’s minimax principle, it is sufficient to show that there exists a distribution over the elements in 𝒟 𝑛 , 𝑘 such that any deterministic algorithm over 𝑘 ⁢ log ⁡ ( 𝑘 ) samples outputting a set of (at most) 𝑚 hypotheses fails to give a proper 𝜀 -cover with probability 1 / 2 . We claim that taking the uniform distribution over 𝒟 𝑛 , 𝑘 suffices. To formalize this, it is useful to observe the following claim.

Claim 2.

Any subset of hypotheses 𝐶 ⊂ 𝐻 of size 𝑚 can be a proper 𝜀 -cover for 𝐻 under at most ( 𝑚 𝑘 ) distributions in 𝒟 𝑛 , 𝑘 .

Let’s prove the result under this assumption. The key observation is that by standard lower bounds on the coupon collector problem, a sample 𝑆 of 𝑘 ⁢ log ⁡ ( 𝑘 ) points from any unif ⁢ ( 𝑇 ) ∈ 𝒟 𝑛 , 𝑘 will not include unif ⁢ ( 𝑇 ) ’s entire support with probability at least 1 / 2 . With this in mind, assume that the input sample 𝑆 contains only 𝑠𝑢𝑝𝑝 ⁢ ( 𝑆 )

𝑗 < 𝑘

𝑠𝑢𝑝𝑝 ⁢ ( unif ⁢ ( 𝑇 ) ) elements. As a result, there are ( 𝑛 − 𝑗 𝑘 − 𝑗 ) consistent distributions with 𝑆 , and by Claim 2, 𝒜 ⁢ ( 𝑆 ) is a proper 𝜀 -cover for at most ( 𝑚 𝑘 ) of them. Since 𝑆 is equally likely to have been sampled from any of these distributions, the probability that 𝒜 ⁢ ( 𝑆 ) is a proper 𝜀 -cover is at most:

Pr ⁡ [ 𝒜 ⁢ fails given 𝑠𝑢𝑝𝑝 ⁢ ( 𝑆 )

𝑗 < 𝑘 ] ≥ ( 𝑛 − 𝑗 𝑘 − 𝑗 ) − ( 𝑚 𝑘 ) ( 𝑛 − 𝑗 𝑘 − 𝑗 ) .

Taking 𝑛 sufficiently larger than 𝑚 and 𝑘 , we can make this probability as close to 1 as desired for any 0 < 𝑗 < 𝑘 . Finally, since samples of this form occur with probability at least 1 / 2 , the algorithm fails with probability at least 1 / 3 as desired. It is left to prove Claim 2.

Proof 13.6 (Proof of Claim 2).

Notice that for any distribution unif ⁢ ( 𝑇 ) ∈ 𝒟 𝑛 , 𝑘 , any 𝑖 ∈ 𝑇 and any 𝑗 ≠ 𝑖 , 𝑑 unif ⁢ ( 𝑇 ) ⁢ ( ℎ 𝑖 , ℎ 𝑗 ) > 2 ⁢ 𝜀 . Let 𝐶 be any proper 𝜀 -cover of 𝐻 under distribution unif ⁢ ( 𝑇 ) . Then, by the above argument, it must contain { ℎ 𝑖 : 𝑖 ∈ 𝑇 } . Since | 𝑇 |

𝑘 , 𝐶 can be a proper 𝜀 -cover of 𝐻 under at most ( | 𝐶 | 𝑘 ) distributions in 𝒟 𝑛 , 𝑘 .

We now move to proving that a proper non-uniform ( 𝜀 , 𝛿 ) -cover can be built in only 𝑂 ⁢ ( log ⁡ ( 1 / 𝛿 ) / 𝜀 ) samples. This follows from the fact that for any 𝑛 ≥ 𝑘 and distribution unif ⁢ ( 𝑇 ) ∈ 𝒟 𝑛 , 𝑘 , each 𝑖 ∈ 𝑇 is in the random sample 𝑆 with probability 1 − 𝛿 . Since each ℎ 𝑗 for 𝑗 ∉ 𝑇 is covered by ℎ 0 , outputting { ℎ 𝑖 : 𝑖 ∈ 𝑆 } ∪ { ℎ 0 } generates a proper non-uniform ( 𝜀 , 𝛿 ) -cover.

The construction in Theorem 13.4 can easily be modified to give a class with the same gap which is not privately learnable (say by embedding a single copy of a threshold over [ 0 , 1 ] ). Since any such class requires at least Ω ⁢ ( 1 𝜀 ) public samples to semi-privately learn by Theorem 6.25,21 Theorem 13.4 then provides a separation between using uniform and non-uniform covers in semi-private learning: the former provably requires an extra log factor, while the latter matches the lower bound exactly. Unfortunately, our proof of this result only holds in the proper setting, as Claim 2 fails when improper hypotheses are allowed. We conjecture that this is not an inherent barrier: the separation should continue to hold in the improper case, albeit with some different analysis.

We have now seen a weak separation between uniform and non-uniform covers, but one might reasonably wonder whether a much stronger separation is possible. In particular, all previous constructions of uniform covers use uniform convergence, but there exist simple examples of learnable classes in the distribution-family model that fail this property: do such classes provide an example of objects which are non-uniformly coverable but not uniformly coverable? Surprisingly, the answer is no! It turns out that an algorithm for non-uniform covering can always be used to construct a uniform covering without too much overhead. Moreover, we’ll see that the log ⁡ ( 1 / 𝜀 ) gap is tight when ( 𝑋 , 𝐻 ) has finite VC dimension.

To prove this, it will actually be useful to make a brief aside and introduce another closely related notion of covering called fractional covers. These objects are essentially a form of non-uniform covering which output a single hypothesis instead of a set of them.

Definition 13.7 (Fractional cover).

Let 𝑋 be an instance space, 𝑌 a label space, and let 𝐿 𝑋 , 𝑌 denote the set of all labelings 𝑐 : 𝑋 → 𝑌 . A distribution 𝐷 𝐶 over 𝐿 𝑋 , 𝑌 is said to form a fractional ( 𝜀 , 𝑝 ) -cover for a hypothesis class 𝐻 for ( 𝐷 , 𝑋 , 𝐻 ) if for any fixed ℎ ∈ 𝐻 , a sample from 𝐷 𝐶 covers ℎ with probability 𝑝 :

Pr 𝑐 ∼ 𝐷 𝐶 ⁡ [ 𝑑 ⁢ ( 𝑐 , ℎ ) ≤ 𝜀 ] ≥ 𝑝 .

Fractional covers are closely connected to non-uniform covers. In fact, one can easily move between the two by sampling or subsampling.

Proposition 13.8 (Non-uniform cover ⇔ Fractional cover).

Let ( 𝐷 , 𝑋 , 𝐻 ) be any class, 𝐶 𝑓𝑟𝑎𝑐 a fractional ( 𝜀 , 𝑝 ) -cover, and 𝐶 n-u a non-uniform ( 𝜀 , 1 / 2 ) -cover. Then the following hold:

Drawing log 1 / ( 1 − 𝑝 ) ⁡ ( 1 / 𝛿 ) samples from 𝐶 𝑓𝑟𝑎𝑐 gives a non-uniform ( 𝜀 , 𝛿 ) -cover.

Choosing a random hypothesis from 𝐶 n-u gives a fractional ( 𝜀 , 1 / 2 ⁢ | 𝐶 | ) -cover.

Proof 13.9.

Both statements are essentially immediate from definition. For any fixed ℎ ∈ 𝐻 , if we draw 𝑀 samples from 𝐶 𝑓𝑟𝑎𝑐 , the probability we fail to cover ℎ is ( 1 − 𝑝 ) 𝑀 , so setting 𝑀

log 1 / ( 1 − 𝑝 ) ⁡ ( 1 / 𝛿 ) gives the desired non-uniform cover. On the other hand, for any fixed ℎ ∈ 𝐻 , a sample from 𝐶 ∼ 𝐶 n-u contains 𝑐

𝜀 -close to ℎ with probability 1 / 2 . Outputting a uniformly random element of 𝐶 then gives an element within 𝜀 of ℎ with probability 1 / 2 ⁢ | 𝐶 | as desired.

It will also be useful to note a classical relation between covers and fractional covers.

Lemma 13.10.

If there exists a fractional ( 𝜀 , 𝑝 ) -cover for ( 𝐷 , 𝑋 , 𝐻 ) , then there exists a 2 ⁢ 𝜀 -cover of size 1 / 𝑝 .

Proof 13.11.

This follows from classical packing-covering duality. The existence of a fractional ( 𝜀 , 𝑝 ) -cover implies there cannot exist a 2 ⁢ 𝜀 -packing of size greater than 1 / 𝑝 (that is, a set of more than 1 / 𝑝 hypotheses in 𝐻 that are pairwise 2 ⁢ 𝜀 -separated with respect to 𝐷 ). By packing-covering duality, this implies the existence of a 2 ⁢ 𝜀 -cover of size ⌈ 1 / 𝑝 ⌉ + 1 .

With this in hand, let’s show that uniform covers can be constructed for any realizably learnable class, regardless of whether we have uniform convergence.

Theorem 13.12 (Realizable learning → Uniform cover).

Let ( 𝒟 , 𝑋 , 𝐻 ) be realizably PAC-learnable with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Then it is possible to construct a uniform ( 𝜀 , 𝛿 ) -cover for ( 𝒟 , 𝑋 , 𝐻 ) in 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 ′ ) samples where 𝛿 ′

𝑂 ⁢ ( 𝛿 Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) ) ) .

Proof 13.13.

We’ll start by proving a slightly more general fact. If for every 𝐷 ∈ 𝒟 , ( 𝐷 , 𝑋 , 𝐻 ) has a proper ( 𝜀 / 2 ) -cover 𝐶 𝐷 of size at most 𝐶

𝐶 ⁢ ( 𝜀 / 2 ) , then it is possible to construct a uniform ( 𝜀 , 𝛿 ) -cover in 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 𝐶 ) samples. This is essentially immediate from Lemma 5.2, which states that running LearningToCover over a sample of size 𝑛 ⁢ ( 𝜀 / 2 , 𝛿 / 𝐶 ) gives a non-uniform ( 𝜀 / 2 , 𝛿 / 𝐶 ) -cover. Union bounding over 𝐶 𝐷 then gives that a sample from the non-uniform cover ( 𝜀 / 2 ) -covers 𝐶 𝐷 with probability at least 1 − 𝛿 . Since 𝐶 𝐷 is itself an ( 𝜀 / 2 ) -cover, this implies that the entire class 𝐻

𝜀 -covered by the sample with probability at least 1 − 𝛿 as desired.

It remains to show that for every 𝐷 ∈ 𝒟 , ( 𝐷 , 𝑋 , 𝐻 ) has a proper ( 𝜀 / 2 ) -cover of size 𝑂 ⁢ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) ) ) . This follows from combining Proposition 13.8 and Lemma 13.10. In particular, Lemma 5.2 implies that running LearningToCover over a sample of size 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) produces a non-uniform ( 𝜀 / 2 , 1 / 2 ) -cover of size at most Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) ) . Proposition 13.8 states that subsampling from this cover gives a fractional ( 𝜀 / 2 , 1 / ( 2 ⁢ Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) ) ) ) -cover, which in turn implies the existence of a ( 𝜀 / 2 ) -cover of size 𝑂 ⁢ ( Π 𝐻 ⁢ ( 𝑛 ⁢ ( 𝜀 / 2 , 1 / 2 ) ) ) as desired. We note that this last argument is similar to an observation made in Benedek and Itai’s [17] seminal work on the distribution-dependent model.

When ( 𝑋 , 𝐻 ) has finite VC-dimension 𝑑 , note that Theorem 13.12 exactly matches the lower bound exhibited in Theorem 13.4 as the required number of samples for a uniform ( 𝜀 , 𝛿 ) -cover becomes:

𝑛 ⁢ ( 𝜀 / 2 , 𝛿 ′ ) ≤ 𝑂 ⁢ ( 𝑑 ⁢ log ⁡ ( 1 / 𝜀 ) + log ⁡ ( 1 / 𝛿 ) 𝜀 ) .

This also matches the bound given by ABM [2] using uniform convergence.

Acknowledgements

The authors would like to thank Shay Moran, Russell Impagliazzo, Omar Montasser, and Avrim Blum for enlightening discussions. We also thank anonymous referees for constructive feedback, and especially for pointing out the notion of probabilistic representations and that prior work discussed in Section 3 falls under the general framework of our reduction.

References [1] ↑ Nir Ailon, Anup Bhattacharya and Ragesh Jaiswal“Approximate correlation clustering using same-cluster queries”In Latin American Symposium on Theoretical Informatics, LATIN ’18, 2018, pp. 14–27SpringerDOI: 10.1007/978-3-319-77404-6˙2 [2] ↑ Noga Alon, Raef Bassily and Shay Moran“Limits of Private Learning with Access to Public Data”In Proceedings of the 33rd International Conference on Neural Information Processing Systems, NeurIPS ’19NY, USA: Curran Associates Inc., 2019DOI: 10.5555/3454287.3455215 [3] ↑ Noga Alon, Amos Beimel, Shay Moran and Uri Stemmer“Closure properties for private classification and online prediction”In Conference on Learning Theory, COLT ’20, 2020, pp. 119–152PMLRURL: https://proceedings.mlr.press/v125/alon20a.html [4] ↑ Noga Alon et al.“Adversarial Laws of Large Numbers and Optimal Regret in Online Classification”In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021Virtual, Italy: Association for Computing Machinery, 2021, pp. 447–455DOI: 10.1145/3406325.3451041 [5] ↑ Noga Alon, Steve Hanneke, Ron Holzman and Shay Moran“A Theory of PAC Learnability of Partial Concept Classes”In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science, FOCS ’21CA, USA: IEEE Computer Society, 2021, pp. 658–671DOI: 10.1109/FOCS52979.2021.00070 [6] ↑ Noga Alon, Roi Livni, Maryanthe Malliaris and Shay Moran“Private PAC learning implies finite Littlestone dimension”In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC ’19, 2019, pp. 852–860DOI: 10.1145/3313276.3316312 [7] ↑ Anish Athalye, Logan Engstrom, Andrew Ilyas and Kevin Kwok“Synthesizing Robust Adversarial Examples”In Proceedings of the 35th International Conference on Machine Learning 80, ICML ’18PMLR, 2018, pp. 284–293URL: https://proceedings.mlr.press/v80/athalye18b.html [8] ↑ Maria-Florina Balcan and Avrim Blum“A discriminative model for semi-supervised learning”In Journal of the ACM (JACM) 57.3ACM New York, NY, USA, 2010, pp. 1–46DOI: 10.1145/1706591.1706599 [9] ↑ Peter L Bartlett, Philip M Long and Robert C Williamson“Fat-shattering and the learnability of real-valued functions”In Journal of computer and system sciences 52.3Elsevier, 1996, pp. 434–452DOI: 10.1006/jcss.1996.0033 [10] ↑ Raef Bassily et al.“Private Query Release Assisted by Public Data”In Proceedings of the 37th International Conference on Machine Learning 119, ICML ’20PMLR, 2020, pp. 695–703URL: https://proceedings.mlr.press/v119/bassily20a.html [11] ↑ Raef Bassily, Shay Moran and Anupama Nandi“Learning from Mixtures of Private and Public Populations”In Advances in Neural Information Processing Systems 33, NeurIPS ’20Curran Associates, Inc., 2020, pp. 2947–2957URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1ee942c6b182d0f041a2312947385b23-Paper.pdf [12] ↑ Amos Beimel, Kobbi Nissim and Uri Stemmer“Characterizing the sample complexity of private learners”In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ITCS ’13, 2013, pp. 97–110DOI: 10.1145/2422436.2422450 [13] ↑ Amos Beimel, Kobbi Nissim and Uri Stemmer“Learning privately with labeled and unlabeled examples”In Algorithmica 83Springer, 2021, pp. 177–215DOI: 10.1007/s00453-020-00753-z [14] ↑ Amos Beimel, Kobbi Nissim and Uri Stemmer“Private Learning and Sanitization: Pure vs. Approximate Differential Privacy”In Theory of Computing 12.1Theory of Computing Exchange, 2016, pp. 1–61DOI: 10.4086/toc.2016.v012a001 [15] ↑ Shai Ben-David, Nicolò Cesa-Bianchi, David Haussler and Philip Long“Characterizations of Learnability for Classes of { 0 , … , 𝑛 } -Valued Functions”In Journal of Computer and System Sciences 50.1, 1995, pp. 74–86DOI: 10.1006/jcss.1995.1008 [16] ↑ Shai Ben-David, Dávid Pál and Shai Shalev-Shwartz“Agnostic Online Learning.”In The 22nd Annual Conference on Learning Theory 3, COLT ’09, 2009, pp. 1URL: https://www.cs.huji.ac.il/~shais/papers/BendavidPalShalevtech09.pdf [17] ↑ Gyora M Benedek and Alon Itai“Learnability with respect to fixed distributions”In Theoretical Computer Science 86.2Elsevier, 1991, pp. 377–389DOI: 10.1016/0304-3975(91)90026-X [18] ↑ Avrim Blum and Tom Mitchell“Combining Labeled and Unlabeled Data with Co-Training”In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98Wisconsin, USA: Association for Computing Machinery, 1998, pp. 92–100DOI: 10.1145/279943.279962 [19] ↑ Anselm Blumer, Andrzej Ehrenfeucht, David Haussler and Manfred K Warmuth“Learnability and the Vapnik-Chervonenkis dimension”In Journal of the ACM (JACM) 36.4ACM, 1989, pp. 929–965DOI: 10.1145/76359.76371 [20] ↑ Olivier Bousquet and André Elisseeff“Stability and generalization”In The Journal of Machine Learning Research 2JMLR. org, 2002, pp. 499–526URL: https://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf [21] ↑ Nicolò Cesa-Bianchi et al.“Sample-efficient strategies for learning in the presence of noise”In Journal of the ACM (JACM) 46.5ACM New York, NY, USA, 1999, pp. 684–719DOI: 10.1145/324133.324221 [22] ↑ Kamalika Chaudhuri and Daniel Hsu“Sample complexity bounds for differentially private learning”In Proceedings of the 24th Annual Conference on Learning Theory, COLT ’11, 2011, pp. 155–186JMLR WorkshopConference ProceedingsURL: https://proceedings.mlr.press/v19/chaudhuri11a.html [23] ↑ Koby Crammer, Michael Kearns and Jennifer Wortman“Learning from Multiple Sources”In Journal of Machine Learning Research 9.57, 2008, pp. 1757–1774URL: http://jmlr.org/papers/v9/crammer08a.html [24] ↑ Yuval Dagan and Vitaly Feldman“PAC learning with stable and private predictions”In Conference on Learning Theory, COLT ’20, 2020, pp. 1389–1410PMLRURL: https://proceedings.mlr.press/v125/dagan20a.html [25] ↑ Amit Daniely, Sivan Sabato, Shai Ben-David and Shai Shalev-Shwartz“Multiclass learnability and the ERM principle”In Journal of Machine Learning Research 16, 2015, pp. 2377–2404URL: https://www.jmlr.org/papers/volume16/daniely15a [26] ↑ Amit Daniely and Shai Shalev-Shwartz“Optimal learners for multiclass problems”In Proceedings of The 27th Conference on Learning Theory 35, COLT ’14Barcelona, Spain: PMLR, 2014, pp. 287–316URL: https://proceedings.mlr.press/v35/daniely14b.html [27] ↑ Sanjoy Dasgupta“Coarse sample complexity bounds for active learning”In Advances in Neural Information Processing Systems 18, NeurIPS ’05MIT Press, 2005URL: https://proceedings.neurips.cc/paper_files/paper/2005/file/6e82873a32b95af115de1c414a1849cb-Paper.pdf [28] ↑ Sanjoy Dasgupta, Michael Littman and David McAllester“PAC Generalization Bounds for Co-training”In Advances in Neural Information Processing Systems 14, NeurIPS ’02MIT Press, 2002URL: https://proceedings.neurips.cc/paper/2001/file/4c144c47ecba6f8318128703ca9e2601-Paper.pdf [29] ↑ Ofir David, Shay Moran and Amir Yehudayoff“Supervised learning through the lens of compression”In Advances in Neural Information Processing Systems 29, NeurIPS ’16, 2016, pp. 2784–2792URL: https://papers.nips.cc/paper_files/paper/2016/hash/59f51fd6937412b7e56ded1ea2470c25-Abstract.html [30] ↑ Shai Ben David, Tyler Lu, Teresa Luu and David Pal“Impossibility Theorems for Domain Adaptation”In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 9, AISTATS ’10Sardinia, Italy: PMLR, 2010, pp. 129–136URL: https://proceedings.mlr.press/v9/david10a.html [31] ↑ Richard M. Dudley, Sanjeev R. Kulkarni, Thomas Richardson and Ofer Zeitouni“A metric entropy bound is not sufficient for learnability”In IEEE Transactions on Information Theory 40.3, 1994, pp. 883–885DOI: 10.1109/18.335898 [32] ↑ Cynthia Dwork and Vitaly Feldman“Privacy-preserving prediction”In Conference On Learning Theory, COLT ’18, 2018, pp. 1693–1702PMLRURL: https://proceedings.mlr.press/v75/dwork18a [33] ↑ Vitaly Feldman, Parikshit Gopalan, Subhash Khot and Ashok Kumar Ponnuswami“On agnostic learning of parities, monomials, and halfspaces”In SIAM Journal on Computing 39.2SIAM, 2009, pp. 606–645DOI: 10.1137/070684914 [34] ↑ Steve Hanneke“Proper PAC learning VC dimension bounds”, Theoretical Computer Science Stack Exchange, 2018URL: https://cstheory.stackexchange.com/q/44252 [35] ↑ Steve Hanneke and Liu Yang“Minimax Analysis of Active Learning”In Journal of Machine Learning Research 16.109, 2015, pp. 3487–3602URL: http://jmlr.org/papers/v16/hanneke15a.html [36] ↑ David Haussler“Decision theoretic generalizations of the PAC model for neural net and other learning applications”In Information and computation 100.1Elsevier, 1992, pp. 78–150DOI: 10.1016/0890-5401(92)90010-D [37] ↑ Daniel Hsu and Sivan Sabato“Loss Minimization and Parameter Estimation with Heavy Tails”In Journal of Machine Learning Research 17.18, 2016, pp. 1–40URL: http://jmlr.org/papers/v17/14-273.html [38] ↑ Thorsten Joachims“Transductive Inference for Text Classification Using Support Vector Machines”In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99CA, USA: Morgan Kaufmann Publishers Inc., 1999, pp. 200–209DOI: 10.5555/645528.657646 [39] ↑ Shiva Prasad Kasiviswanathan et al.“What can we learn privately?”In SIAM Journal on Computing 40.3SIAM, 2011, pp. 793–826DOI: 10.1137/090756090 [40] ↑ Michael Kearns“Efficient Noise-Tolerant Learning from Statistical Queries”In J. ACM 45.6New York, NY, USA: Association for Computing Machinery, 1998, pp. 983–1006DOI: 10.1145/293347.293351 [41] ↑ Michael Kearns and Ming Li“Learning in the presence of malicious errors”In SIAM Journal on Computing 22.4SIAM, 1993, pp. 807–837DOI: 10.1137/0222052 [42] ↑ Michael J. Kearns and Robert E. Schapire“Efficient distribution-free learning of probabilistic concepts”In Journal of Computer and System Sciences 48.3, 1994, pp. 464–497DOI: https://doi.org/10.1016/S0022-0000(05)80062-5 [43] ↑ Sanjeev R. Kulkarni and Mathukumalli Vidyasagar“Learning decision rules for pattern classification under a family of probability measures”In IEEE Transactions on Information Theory 43.1, 1997, pp. 154–166DOI: 10.1109/18.567668 [44] ↑ Tosca Lechner and Shai Ben-David“Impossibility of Characterizing Distribution Learning–a simple solution to a long-standing problem”In Arxiv, 2023URL: https://arxiv.org/abs/2304.08712 [45] ↑ Yi Li, Philip M Long and Aravind Srinivasan“Improved bounds on the sample complexity of learning”In Journal of Computer and System Sciences 62.3Elsevier, 2001, pp. 516–527DOI: 10.1006/jcss.2000.1741 [46] ↑ Nick Littlestone and Manfred Warmuth“The Weighted Majority Algorithm”In Information and Computation 108.2, 1994, pp. 212–261DOI: https://doi.org/10.1006/inco.1994.1009 [47] ↑ Philip M Long“On agnostic learning with { 0,*, 1 } -valued and real-valued hypotheses”In International Conference on Computational Learning Theory, COLT ’01, 2001, pp. 289–302SpringerURL: https://link.springer.com/chapter/10.1007/3-540-44581-1_19 [48] ↑ Andreas Maurer and Massimiliano Pontil“Concentration inequalities under sub-Gaussian and sub-exponential conditions”In Advances in Neural Information Processing Systems 34, NeurIPS ’11Curran Associates, Inc., 2021, pp. 7588–7597URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/3e33b970f21d2fc65096871ea0d2c6e4-Paper.pdf [49] ↑ Frank McSherry and Kunal Talwar“Mechanism design via differential privacy”In 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, 2007, pp. 94–103IEEEDOI: 10.1109/FOCS.2007.66 [50] ↑ Omar Montasser, Steve Hanneke and Nathan Srebro“Adversarially Robust Learning with Unknown Perturbation Sets”In Proceedings of Thirty Fourth Conference on Learning Theory 134, COLT ’21PMLR, 2021, pp. 3452–3482URL: https://proceedings.mlr.press/v134/montasser21a.html [51] ↑ Omar Montasser, Steve Hanneke and Nathan Srebro“Reducing Adversarially Robust Learning to Non-Robust PAC Learning”In Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS ’20BC, Canada: Curran Associates Inc., 2020URL: https://dl.acm.org/doi/10.5555/3495724.3496950 [52] ↑ Omar Montasser, Steve Hanneke and Nathan Srebro“VC classes are adversarially robustly learnable, but only improperly”In Conference on Learning Theory, COLT ’19, 2019, pp. 2512–2530PMLRURL: https://proceedings.mlr.press/v99/montasser19a/montasser19a.pdf [53] ↑ Vaishnavh Nagarajan and J.Zico Kolter“Uniform convergence may be unable to explain generalization in deep learning”In Advances in Neural Information Processing Systems 32, NeurIPS ’11Curran Associates, Inc., 2019URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/05e97c207235d63ceb1db43c60db7bbb-Paper.pdf [54] ↑ Balas K. Natarajan“On Learning Sets and Functions”In Machine Learning 4.1USA: Kluwer Academic Publishers, 1989, pp. 67–97DOI: 10.1007/BF00114804 [55] ↑ Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik Sridharan“Learnability, stability and uniform convergence”In Journal of Machine Learning Research 11.90, 2010, pp. 2635–2670URL: http://jmlr.org/papers/v11/shalev-shwartz10a.html [56] ↑ Hidetoshi Shimodaira“Improving predictive inference under covariate shift by weighting the log-likelihood function”In Journal of Statistical Planning and Inference 90.2, 2000, pp. 227–244DOI: https://doi.org/10.1016/S0378-3758(00)00115-4 [57] ↑ Hans Ulrich Simon“A characterization of strong learnability in the statistical query model”In Annual Symposium on Theoretical Aspects of Computer Science, STACS ’07’, 2007, pp. 393–404SpringerDOI: 10.1007/978-3-540-70918-3˙34 [58] ↑ Leslie G Valiant“A theory of the learnable”In Proceedings of the sixteenth annual ACM symposium on Theory of computing, STOC ’84, 1984, pp. 436–445ACMDOI: 10.1145/1968.1972 [59] ↑ Vladimir Vapnik and Alexey Chervonenkis“On the uniform convergence of relative frequencies of events to their probabilities.”In Theory of Probability and Its Applications, 1971DOI: 10.1137/1116025 [60] ↑ Vladimir Vapnik and Alexey Chervonenkis“Theory of pattern recognition”Moscow: Nauka Publishing House, 1974URL: https://cml.rhul.ac.uk/resources/PatternRecognition/patternreclowres.pdf [61] ↑ Mathuicumalli Vidyasagar, Srinivasan Balaji and Barbara Hammer“Closure properties of uniform convergence of empirical means and PAC learnability under a family of probability measures”In Systems and Control Letters 42.2, 2001, pp. 151–157DOI: https://doi.org/10.1016/S0167-6911(00)00086-4 [62] ↑ Michael M Wolf“Mathematical foundations of supervised learning”Munich: mediaTUM, 2023URL: https://mediatum.ub.tum.de/doc/1723378/1723378.pdf [63] ↑ Gal Yona and Guy Rothblum“Probably Approximately Metric-Fair Learning”In Proceedings of the 35th International Conference on Machine Learning 80, ICML ’18PMLR, 2018, pp. 5680–5688URL: https://proceedings.mlr.press/v80/yona18a.html [64] ↑ Chiyuan Zhang et al.“Understanding Deep Learning (Still) Requires Rethinking Generalization”In Commun. ACM 64.3New York, NY, USA: Association for Computing Machinery, 2021, pp. 107–115DOI: 10.1145/3446776 [65] ↑ Xiaojin Zhu“Semi-supervised Learning”In Encyclopedia of Machine Learning and Data MiningSpringer US, 2017, pp. 1142–1147DOI: 10.1007/978-0-387-30164-8˙749 Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Report Issue Report Issue for Selection

Xet Storage Details

Size:: 199 kB
Xet hash:: e0c3f43bc88d9bdd81065278202defdaa1887e7e011d1e15458561c0885d3dc1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Unfortunately, while Algorithm 1 does avoid any significant blowup in sample complexity, it is inherently computationally inefficient. In fact, this is necessary unless 𝑃

Perhaps the most natural extension of the PAC framework is to loss functions beyond binary classification. Here data points are labeled by an arbitrary label space 𝑌 and error is measured with respect to a generic loss function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 , for instance we may take 𝑌

ℝ and ℓ ⁢ ( 𝑦 , 𝑦 ′ )

Let 𝑋

[ 0 , 1 ] , 𝐷 be the uniform distribution over 𝑋 , 𝑌

With this motivation in mind, let’s define distribution-family learning more formally. Let 𝑋 be a set (called the instance space), 𝑌

{ 0 , 1 } the set of binary labels, 𝒟 a family of distributions over 𝑋 , and 𝐻

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 × ℎ ⁢ ( ℎ ′ )

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

err 𝐷 ⁢ ( ℎ ′ )

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝑆 𝐿 ⁢ ( ℎ )

𝑒𝑟𝑟 𝐷 × ℎ ⁢ [ 𝒜 ⁢ ( 𝑆 , ℎ ⁢ ( 𝑆 ) ) ]

where ℎ ′

𝑒 ⁢ 𝑟 ⁢ 𝑟 𝐷 ⁢ ( ℎ ′ )

Π 𝐻 ( 𝑛 )

While PAC-learning is sometimes used to refer only to classification, we will study the model under general loss functions. With that in mind, we call a function ℓ : 𝑌 × 𝑌 → ℝ ≥ 0 a loss function if ℓ ⁢ ( 𝑦 , 𝑦 )

0 for all 𝑦 ∈ 𝑌 . We say a loss ℓ satisfies the identity of indiscernibles if ℓ ⁢ ( 𝑦 1 , 𝑦 2 )

0 iff 𝑦 1

err 𝐷 , ℓ ⁢ ( ℎ )

We say ( 𝑋 , 𝐻 , ℓ ) is realizable PAC-learnable if there exists an algorithm 𝒜 and function 𝑛 ⁢ ( 𝜀 , 𝛿 ) such that for all 𝜀 , 𝛿 > 0 and distributions 𝐷 over 𝑋 × 𝑌 such that min ℎ ∈ 𝐻 ⁡ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ )

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

ℓ ⁢ ( 𝑦 1 , 𝑦 2 )

{ 0 if ⁢ 𝑦 1

min ℎ ∈ 𝐻 ⁡ 𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ )

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

Let 𝒜 be an algorithm that ( 𝜀 , 𝛿 ) -PAC learns a class ( 𝒟 , 𝑋 , 𝐻 , ℓ ) in 𝑛

𝑑 ⁢ ( ℎ , ℎ ′ )

𝑒𝑟𝑟 𝐷 × ℎ , ℓ ⁢ ( 𝒜 ⁢ ( 𝑆 𝑈 , ℎ ⁢ ( 𝑆 𝑈 ) ) )

where ℎ ′

Let 𝒜 be the promised realizable learner for ( 𝑋 , 𝐻 , 𝒟 , ℓ ) with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Run LearningToCover with parameters 𝜀 ′

𝜂 ℓ ⁢ 𝜀 and 𝛿 ′

Because we assume that ℓ ⁢ ( 𝑎 , 𝑏 )

0 iff 𝑎

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ )

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) , 𝑦 ) ]

𝔼 ( 𝑥 , 𝑦 ) ∼ 𝐷 ⁢ [ ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ′ ⁢ ( 𝑥 ) , 𝑦 ) − ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) + ℓ ⁢ ( ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 ⁢ ( 𝑥 ) , 𝑦 ) ]

where we have used the assumption that we set 𝜂 ℓ

Let the instance space 𝑋

ℕ be the set of natural numbers, the label space 𝑌

𝐻

{ ℎ : ℎ ⁢ ( 𝑥 )

ℓ ⁢ ( ( 𝑏 1 , 𝑟 1 ) , ( 𝑏 2 , 𝑟 2 ) )

{ 0 𝑏 1

1 𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1

Pr 𝑆 ∼ 𝐷 𝑋 𝑚 ⁡ [ 𝑥 ∉ 𝑆 ]

Let 𝑋

ℕ and 𝑌

More formally, let 𝒟 be the family of all distributions. By the continuity of ℓ and the fact that ℓ ⁢ ( 0 , 0 )

ℓ ⁢ ( 1 , 1 )

0 , for all 𝜀 > 0 notice that there exists 𝛾

The proof is similar to Theorem 5.4. We first show the forward direction. Assume ( 𝒟 , 𝑋 , 𝐻 , ℓ ) is discretely-learnable. Fix 𝜀 ′

𝔼 𝑥 ∼ 𝐷 𝑋 ⁢ [ ℓ ⁢ ( ℎ ′ ⁢ ( 𝑥 ) , ℎ 𝑂 ⁢ 𝑃 ⁢ 𝑇 𝜀 ′ ⁢ ( 𝑥 ) ) ] ≤ 𝑐 1 ⁢ 𝜀 ′

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ )

𝑒𝑟𝑟 𝐷 , ℓ ⁢ ( ℎ ′ )

The proof is similar to Proposition 5.6. We consider the same instance space 𝑋

𝐻

{ ℎ : ℎ ⁢ ( 𝑥 )

The loss function ℓ : 𝑌 × 𝑌 → { 0 , 1 , 𝑐 } is also the same, but extended to the larger domain 𝑌

ℓ ⁢ ( ( 𝑏 1 , 𝑟 1 ) , ( 𝑏 2 , 𝑟 2 ) )

{ 0 𝑏 1

1 𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1

Pr 𝑆 ∼ 𝐷 𝑋 × 𝑓 ⁢ ( 𝑋 ) 𝑚 ⁡ [ ( 𝑥 , ⋅ ) ∉ 𝑆 ]

We call ( 𝑋 , 𝐻 , 𝒟 , ℓ ) (agnostically) ( 𝜀 , 𝛿 ) -learnable with malicious noise at rate 𝜂

𝜂 ⁢ ( 𝜀 ) if there exists an algorithm 𝒜 and function 𝑚

where 𝑂 ⁢ 𝑃 ⁢ 𝑇

Run LearningToCover over all 𝑆 ∈ ( 𝑆 𝑈 ) 𝜂 ′ ⁢ \coloneqq ⁢ { 𝑆 ⊆ 𝑆 𝑈 : | 𝑆 |

𝜂 ′

Let ( 𝒟 , 𝑋 , 𝐻 ) be realizably PAC-learnable with sample complexity 𝑛 ⁢ ( 𝜀 , 𝛿 ) . Then for any 𝜂 < 𝜀 1 + 𝜀 , Algorithm 3 is an agnostic learner for ( 𝒟 , 𝑋 , 𝐻 ) tolerant to 𝜂 malicious noise. Furthermore letting Δ

𝜀 1 + 𝜀 − 𝜂 and 𝛽

𝛽 2

𝛽 3

Let 𝑋

{ 𝑥 , 𝑥 1 , 𝑥 2 } and 𝐻

{ ℎ 1 , ℎ 2 } be any class such that ℎ 1 ⁢ ( 𝑥 )

ℎ 2 ⁢ ( 𝑥 ) , but ℎ 1 ⁢ ( 𝑥 𝑖 ) ≠ ℎ 2 ⁢ ( 𝑥 𝑖 ) for 𝑖

In the agnostic setting, consider an adversary who chooses a labeling 𝑓 such that 𝑓 ⁢ ( 𝑥 )

ℎ 1 ⁢ ( 𝑥 )

ℎ 2 ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑥 1 )

ℎ 1 ⁢ ( 𝑥 1 ) , and 𝑓 ⁢ ( 𝑥 2 )

𝐷 1 ⁢ ( 𝑥 1 )

{ 0
if ⁢ 𝑦 1

{ 0
𝑏 1

1
𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1

{ 0
𝑏 1

1
𝑏 1 ≠ 𝑏 2 ⁢ 𝑎𝑛𝑑 ⁢ 𝑟 1