106 kB

Title: Sample Complexity of Probability Divergences under Group Symmetry

URL Source: https://arxiv.org/html/2302.01915

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related work 3Background and motivation 4Sample complexity under group invariance 5Conclusion and future work 6Proofs References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: commath

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0 arXiv:2302.01915v3 [math.ST] 22 Nov 2024 Sample Complexity of Probability Divergences under Group Symmetry Ziyu Chen Department of Mathematics and Statistics University of Massachusetts Amherst Amherst, MA 01003, USA ziyuchen@umass.edu &Markos A. Katsoulakis Department of Mathematics and Statistics University of Massachusetts Amherst Amherst, MA 01003, USA markos@umass.edu &Luc Rey-Bellet Department of Mathematics and Statistics University of Massachusetts Amherst Amherst, MA 01003, USA luc@umass.edu &Wei Zhu Department of Mathematics and Statistics University of Massachusetts Amherst Amherst, MA 01003, USA weizhu@umass.edu Abstract

We rigorously quantify the improvement in the sample complexity of variational divergence estimations for group-invariant distributions. In the cases of the Wasserstein-1 metric and the Lipschitz-regularized 𝛼 -divergences, the reduction of sample complexity is proportional to the group size if the group is finite. In addition to the published version at ICML 2023, our proof indeed has included the case when the group is infinite such as compact Lie groups, the convergence rate can be further improved and depends on the intrinsic dimension of the fundamental domain characterized by the scaling of its covering number. Our approach is different from that in [Tahmasebi & Jegelka, ICML 2024] and our work also applies to asymmetric divergences, such as the Lipschitz-regularized 𝛼 -divergences. For the maximum mean discrepancy (MMD), the improvement of sample complexity is more nuanced, as it depends on not only the group size but also the choice of kernel. Numerical simulations verify our theories.

1Introduction

Probability divergences provide means to measure the discrepancy between two probability distributions. They have broad applications in a variety of inference tasks, such as independence testing [64, 37], independent component analysis [34], and generative modeling [28, 50, 1, 33, 60, 49].

A key task within the above applications is the computation and estimation of the divergences from finite data, which is known to be a difficult problem [51, 25]. Empirical estimators based on the variational representations for the probability divergences are generally favored and widely used due to their scalability to both the data size and the ambient space dimension [3, 8, 47, 48, 54, 57, 7, 10, 58, 30, 32, 31, 27].

Figure 1:The distribution of the whole-slide prostate cancer images (LYSTO data set [17]) is rotation-invariant, i.e., an image and its rotated copies are equiprobable.

Empirical computation of the probability divergences and theoretical analysis on their sample complexity are typically studied without any a priori structural assumption on the probability measures. Many distributions in real life, however, are known to have intrinsic structures, such as group symmetry. For example, the distribution of the medical images collected without preferred orientation should be rotation-invariant, i.e., an image is supposed to have the same likelihood as its rotated copies; see Figure 1. Such structural information could be leveraged to improve the accuracy and/or sample-efficiency for divergence estimation.

Indeed, the recent work by Birrell et al. [9] shows that one can develop an improved variational representation for divergences between group-invariant distributions. The key idea is to reduce the test function space in the variational formula to its subset of group-invariant functions, which effectively acts as an unbiased regularization. When used in a generative adversarial network (GAN) for group-invariant distribution learning, Birrell et al. [9] empirically show that divergence estimation/optimization based on their proposed variational representation under group symmetry leads to significantly improved sample generation, especially in the small data regime.

The purpose of this work is to rigorously quantify the performance gain of divergence estimation under group symmetry. More specifically, we analyze the reduction in sample complexity of divergence estimation in terms of the convergence of the empirical estimation. We focus, in particular, on three types of probability divergences: the Wasserstein-1 metric, the maximum mean discrepancy (MMD), and the family of Lipschitz-regularized 𝛼 -divergences; see Section 3.1 for the exact definition. Our main results show that the reduction of sample complexity is proportional to the group size if the group is finite. When the group is infinite, the convergence rate can be further improved and depends on the intrinsic dimension of the fundamental domain characterized by the scaling of its covering number defined in Definition 3; see Theorem 1 and Footnote 1 for the Wasserstein-1 metric; Theorem 3 and Theorem 4 for the Lipschitz-regularized 𝛼 -divergences respectively. In the case of MMD, the reduction in sample complexity due to the group invariance is more nuanced and depends on the properties of the kernel; see Theorem 5. As a byproduct, we also establish the consistency and sample complexity for the Lipschitz-regularized 𝛼 -divergences without group symmetry, which, to the best of our knowledge, is missing in the previous literature.

The rest of the paper is organized as follows. In Section 2, we review previous work related to the empirical estimation of probability divergences and group-invariant distributions. We introduce the background and the motivation in Section 3. Theoretical results and numerical examples are provided in Section 4. We conclude our work in Section 5. Proofs and detailed statement of the theorems in Section 4 are provided in Section 6.

2Related work

Empirical estimation of probability divergences. Probability divergences have been widely used, including in generative adversarial networks (GANs) [1, 28, 50, 9, 33], uncertainty quantification [16, 22], independence determination through mutual information estimation [3], bounding risk in probably approximately correct (PAC) learning [14, 44, 55], statistical mechanics and interacting particles [38], large deviations [21], and parameter estimation [13].

A growing body of literature has been dedicated to the empirical estimation of divergences from finite data. Earlier works based on density estimation are known to work best for low dimensions [35, 52]. Recent research has shown that statistical estimators based on the variational representations of probability divergences scale better with dimensions; such studies include the KL-divergences [3], the more general 𝑓 -divergences [8, 47, 48, 54, 57], Rényi divergences [7, 10], integral probability metrics (IPMs) [58, 30, 32, 31], and Sinkhorn divergences [27]. Such estimators are typically constructed to compare an arbitrary pair of probability measures without any a priori structural assumption, and are hence sub-optimal in estimating divergences between distributions with known structures, such as group symmetry.

Group-invariant distributions. Recent development in group-equivariant machine learning [18, 19, 63] has sparked a flurry of research in neural generative models for group-invariant distributions. Most of the works focus only on the guaranteed generation, through, e.g., an equivariant normalizing-flow, of the group-invariant distributions [5, 11, 26, 39, 43, 53]; the divergence computation between the generated distribution and the ground-truth target, a crucial step in the optimization pipeline, however, does not leverage their group-invariant structure. Equivariant GANs for group-invariant distribution learning have also been proposed by modifying the inner loop of discriminator update through either data-augmentation [65] or constrained optimization within a subspace of group-invariant discriminators [20]; the theoretical justification of such procedures, as well as the resulting performance gain, have been explained by Birrell et al. [9] as an improved estimation of variational divergences under group symmetry via an unbiased regularization. The exact quantification of the improvement, in terms of reduction in sample complexity, is however still missing; this is the main focus of this work.

3Background and motivation 3.1Variational divergences and probability metrics

Let 𝒳 be a measurable space, and 𝒫 ⁢ ( 𝒳 ) be the set of probability measures on 𝒳 . A map 𝐷 : 𝒫 ⁢ ( 𝒳 ) × 𝒫 ⁢ ( 𝒳 ) → [ 0 , ∞ ] is called a divergence on 𝒫 ⁢ ( 𝒳 ) if

𝐷 ⁢ ( 𝑃 , 𝑄 )

0 ⇔ 𝑃

𝑄 ∈ 𝒫 ⁢ ( 𝒳 ) ,

(1)

hence providing a notion of “distance” between probability measures. Many probability divergences of interest can be formulated using a variational representation

𝐷 ⁢ ( 𝑃 , 𝑄 )

sup 𝛾 ∈ Γ 𝐻 ⁢ ( 𝛾 ; 𝑃 , 𝑄 ) ,

(2)

where Γ ⊂ ℳ ⁢ ( 𝒳 ) is a space of test functions, ℳ ⁢ ( 𝒳 ) is the set of measurable functions on 𝒳 , and 𝐻 : ℳ ⁢ ( 𝒳 ) × 𝒫 ⁢ ( 𝒳 ) × 𝒫 ⁢ ( 𝒳 ) → [ − ∞ , ∞ ] is some objective functional. Through suitable choices of 𝐻 ⁢ ( 𝛾 ; 𝑃 , 𝑄 ) and Γ , formula (2) includes many divergences and probability metrics. Below we list two specific classes of examples.

(a) Γ -Integral Probability Metrics ( Γ -IPMs). Given Γ ⊂ ℳ 𝑏 ⁢ ( 𝒳 ) , the space of bounded measurable functions on 𝒳 , the Γ -IPM between 𝑃 and 𝑄 is defined as

𝐷 Γ ⁢ ( 𝑃 , 𝑄 ) ≔ sup 𝛾 ∈ Γ { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } .

(3)

Some prominent examples of the Γ -IPMs include the Wasserstein-1 metric, the total variation metric, the Dudley metric, and the maximum mean discrepancy (MMD) [46, 58]. Our work, in particular, focuses on the following two specific IPMs.

•

The Wasserstein-1 metric, 𝑊 ⁢ ( 𝑃 , 𝑄 ) ≔ 𝐷 Lip 𝐿 ⁢ ( 𝒳 ) ⁢ ( 𝑃 , 𝑄 ) , i.e.,

𝑊 ⁢ ( 𝑃 , 𝑄 ) ≔ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 ) { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } ,

(4)

where Lip 𝐿 ⁢ ( 𝒳 ) is the space of 𝐿 -Lipschitz functions on 𝒳 . We note that the normalizing factor 𝐿 − 1 has been omitted from the formula.

•

The maximum mean discrepancy, MMD ⁢ ( 𝑃 , 𝑄 ) ≔ 𝐷 𝐵 ℋ ⁢ ( 𝑃 , 𝑄 ) , i.e.,

MMD ⁢ ( 𝑃 , 𝑄 ) ≔ sup 𝛾 ∈ 𝐵 ℋ { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } ,

(5)

where 𝐵 ℋ is the unit ball of some reproducing kernel Hilbert space (RKHS) ℋ on 𝒳 .

(b) ( 𝑓 , Γ ) -divergences. Let 𝑓 : [ 0 , ∞ ) → ℝ be convex and lower semi-continuous, with 𝑓 ⁢ ( 1 )

0 and 𝑓 strictly convex at 𝑥

1 . Given Γ ⊂ ℳ 𝑏 ⁢ ( 𝒳 ) that is closed under the shift transformations 𝛾 ↦ 𝛾 + 𝜈 , 𝜈 ∈ ℝ , the ( 𝑓 , Γ ) -divergence introduced by Birrell et al. [6] is defined as

𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 )

sup 𝛾 ∈ Γ { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝑓 ∗ ⁢ ( 𝛾 ) ] } ,

(6)

where 𝑓 ∗ denotes the Legendre transform of 𝑓 . Formula (6) includes, as a special case when Γ

ℳ 𝑏 ⁢ ( 𝒳 ) , the widely known class of 𝑓 -divergences, with notable examples such as the Kullback-Leibler (KL) divergence [42], the total variation distance, the Jensen-Shannon divergence, the 𝜒 2 -divergence, the Hellinger distance, and more generally the family of 𝛼 -divergences [50]. Of particular interest to us is the class of the Lipschitz-regularized 𝛼 -divergences,

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 ) , Γ

Lip 𝐿 ⁢ ( 𝒳 ) , 𝑓 𝛼 ⁢ ( 𝑥 )

𝑥 𝛼 − 1 𝛼 ⁢ ( 𝛼 − 1 ) ,

(7)

where 𝛼

0 and 𝛼 ≠ 1 is a positive parameter.

An important observation that will be useful in one of our results, Theorem 3, is that 𝐷 𝑓 𝛼 Γ admits an equivalent representation, which writes

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

sup 𝛾 ∈ Γ , 𝜈 ∈ ℝ { 𝐸 𝑃 ⁢ [ 𝛾 + 𝜈 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 + 𝜈 ) ] }

(8)

due to the invariance of Γ

Lip 𝐿 ⁢ ( 𝒳 ) under the shift map 𝛾 ↦ 𝛾 + 𝜈 for 𝜈 ∈ ℝ .

3.2Empirical estimation of variational divergences

Given i.i.d. samples 𝑋

{ 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚 } and 𝑌

{ 𝑦 1 , 𝑦 2 , ⋯ , 𝑦 𝑛 } , respectively, from two unknown probability measures 𝑃 , 𝑄 ∈ 𝒫 ⁢ ( 𝒳 ) , it is often of interest—in applications such as two-sample testing [4, 30, 31, 15] and independence testing [32, 31, 64, 37]—to estimate the divergence between 𝑃 and 𝑄 [58, 7, 47, 48]. For variational divergences 𝐷 Γ ⁢ ( 𝑃 , 𝑄 ) and 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) in the form of (3) and (6), their empirical estimators can naturally be given by

𝐷 Γ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

sup 𝛾 ∈ Γ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) 𝑛 } ,

(9)

𝐷 𝑓 Γ ⁢ ( 𝑃 𝑚 ∥ 𝑄 𝑛 )

sup 𝛾 ∈ Γ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

1 𝑛 𝑓 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) 𝑛 }

(10)

where 𝑃 𝑚

1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛿 𝑥 𝑖 and 𝑄 𝑛

1 𝑛 ⁢ ∑ 𝑗

1 𝑛 𝛿 𝑦 𝑗 represent the empirical distributions of 𝑃 and 𝑄 , respectively.

The consistency and sample complexity of the empirical estimators 𝑊 ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) and MMD ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) in the form of (9) for, respectively, the Wasserstein-1 metric (4) and MMD (5) between two general distributions 𝑃 , 𝑄 ∈ 𝒫 ⁢ ( 𝒳 ) have been well studied [58, 31]. However, for probability measures with special structures, such as group symmetry, one can potentially obtain a divergence estimator with substantially improved sample complexity as empirically observed by Birrell et al. [9]. We provide, in the following section, a brief review of group-invariant distributions and the improved variational representations for probability divergences under group symmetry, which serves as a motivation and foundation for our theoretical analysis in Section 4.

3.3Variational divergences under group symmetry

A group is a set Σ equipped with a group product satisfying the axioms of associativity, identity, and invertibility. Given a group Σ and a set 𝒳 , a map 𝜃 : Σ × 𝒳 → 𝒳 is called a group action on 𝒳 if 𝜃 𝜎 ≔ 𝜃 ⁢ ( 𝜎 , ⋅ ) : 𝒳 → 𝒳 is an automorphism on 𝒳 for all 𝜎 ∈ Σ , and 𝜃 𝜎 2 ∘ 𝜃 𝜎 1

𝜃 𝜎 2 ⋅ 𝜎 1 , ∀ 𝜎 1 , 𝜎 2 ∈ Σ . By convention, we will abbreviate 𝜃 ⁢ ( 𝜎 , 𝑥 ) as 𝜎 ⁢ 𝑥 throughout the paper.

A function 𝛾 : 𝒳 → ℝ is called Σ -invariant if 𝛾 ∘ 𝜃 𝜎

𝛾 , ∀ 𝜎 ∈ Σ . Let Γ be a set of measurable functions 𝛾 : 𝒳 → ℝ ; its subset, Γ Σ , of Σ -invariant functions is defined as

Γ Σ ≔ { 𝛾 ∈ Γ : 𝛾 ∘ 𝜃 𝜎

𝛾 , ∀ 𝜎 ∈ Σ } .

(11)

On the other hand, a probability measure 𝑃 ∈ 𝒫 ⁢ ( 𝒳 ) is called Σ -invariant if 𝑃

( 𝜃 𝜎 ) ∗ ⁢ 𝑃 , ∀ 𝜎 ∈ Σ , where ( 𝜃 𝜎 ) ∗ ⁢ 𝑃 ≔ 𝑃 ∘ ( 𝜃 𝜎 ) − 1 is the push-forward measure of 𝑃 under 𝜃 𝜎 . We denote the set of all Σ -invariant distributions on 𝒳 as 𝒫 Σ ⁢ ( 𝒳 ) ≔ { 𝑃 ∈ 𝒫 ⁢ ( 𝒳 ) : 𝑃 ⁢ is ⁢ Σ ⁢ -invariant } .

Finally, for a compact Hausdorff topological group Σ [23], we define two symmetrization operators, 𝑆 Σ : ℳ 𝑏 ⁢ ( 𝒳 ) → ℳ 𝑏 ⁢ ( 𝒳 ) and 𝑆 Σ : 𝒫 ⁢ ( 𝒳 ) → 𝒫 ⁢ ( 𝒳 ) , on functions and probability measures, respectively, as follows

𝑆 Σ ⁢ [ 𝛾 ] ⁢ ( 𝑥 ) ≔ ∫ Σ 𝛾 ⁢ ( 𝜎 ⁢ 𝑥 ) ⁢ 𝜇 Σ ⁢ ( 𝑑 ⁢ 𝜎 ) , ∀ 𝛾 ∈ ℳ 𝑏 ⁢ ( 𝒳 )

(12)

𝐸 𝑆 Σ ⁢ [ 𝑃 ] ⁢ 𝛾 ≔ 𝐸 𝑃 ⁢ 𝑆 Σ ⁢ [ 𝛾 ] , ∀ 𝑃 ∈ 𝒫 ⁢ ( 𝒳 ) , ∀ 𝛾 ∈ ℳ 𝑏 ⁢ ( 𝒳 )

(13)

where 𝜇 Σ is the unique Haar probability measure on Σ . The operators 𝑆 Σ ⁢ [ 𝛾 ] and 𝑆 Σ ⁢ [ 𝑃 ] can be intuitively understood, respectively, as “averaging” the function 𝛾 or “spreading” the probability mass 𝑃 across the group orbits in 𝒳 ; one can easily verify that they are projection operators onto the corresponding invariant subsets Γ Σ ⊂ Γ and 𝒫 Σ ⁢ ( 𝒳 ) ⊂ 𝒫 ⁢ ( 𝒳 ) [9].

The main result by Birrel et al. [9], which we summarize in Result 1, is that for Σ -invariant distributions, the function space Γ in the variational formulae (3) and (6) can be reduced to its invariant subset Γ Σ ⊂ Γ .

Result 1 (paraphrased from [9]).

If 𝑆 Σ ⁢ [ Γ ] ⊂ Γ and 𝑃 , 𝑄 ∈ 𝒫 ⁢ ( 𝑋 ) , then

𝐷 Γ ⁢ ( 𝑆 Σ ⁢ [ 𝑃 ] , 𝑆 Σ ⁢ [ 𝑄 ] )

𝐷 Γ Σ ⁢ ( 𝑃 , 𝑄 ) ,

(14)

𝐷 𝑓 Γ ⁢ ( 𝑆 Σ ⁢ [ 𝑃 ] ∥ 𝑆 Σ ⁢ [ 𝑄 ] )

𝐷 𝑓 Γ Σ ⁢ ( 𝑃 ∥ 𝑄 ) ,

(15)

where 𝐷 Γ ⁢ ( 𝑃 , 𝑄 ) and 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) are given by (3) and (6). In particular, if 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) are Σ -invariant, then

𝐷 Γ ⁢ ( 𝑃 , 𝑄 )

𝐷 Γ Σ ⁢ ( 𝑃 , 𝑄 ) , 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 )

𝐷 𝑓 Γ Σ ⁢ ( 𝑃 ∥ 𝑄 ) .

Result 1 motivates a potentially more sample-efficient way to estimate the divergences 𝐷 Γ ⁢ ( 𝑃 , 𝑄 ) and 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) between Σ -invariant distributions 𝑃 , 𝑄 ∈ 𝒫 ⁢ ( 𝒳 ) using

𝐷 Γ Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

sup 𝛾 ∈ Γ Σ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) 𝑛 } ,

(16)

𝐷 𝑓 Γ Σ ⁢ ( 𝑃 𝑚 ∥ 𝑄 𝑛 )

sup 𝛾 ∈ Γ Σ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

1 𝑛 𝑓 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) 𝑛 } .

(17)

Compared to Eq. (9) and (10), the estimators (16) and (17) have the benefit of optimizing over a reduced space Γ Σ ⊂ Γ of test functions, effectively acting as an unbiased regularization, and their efficacy has been empirically observed by Birrell et al. [9] in neural generation of group-invariant distributions with substantially improved data-efficiency. However, the theoretical understanding of the performance gain is still lacking.

The purpose of this work is to rigorously quantify the improvement in sample complexity of the divergence estimations (16) and (17) for group-invariant distributions. To contextualize the idea, we will focus our analysis on three specific types of probability divergences, the Wasserstein-1 metric (4), the MMD (5), and the Lipschitz-regularized 𝛼 divergence (6)(7) between Σ -invariant 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) ,

𝑊 ⁢ ( 𝑃 , 𝑄 )

𝑊 Σ ⁢ ( 𝑃 , 𝑄 ) ≈ 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) ,

(18)

MMD ⁢ ( 𝑃 , 𝑄 )

MMD Σ ⁢ ( 𝑃 , 𝑄 ) ≈ MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

(19)

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

𝐷 𝑓 𝛼 Γ Σ ⁢ ( 𝑃 ∥ 𝑄 ) ≈ 𝐷 𝑓 𝛼 Γ Σ ⁢ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) ,

(20)

where

𝑊 Σ ⁢ ( 𝑃 , 𝑄 ) ≔ 𝐷 [ Lip 𝐿 ⁢ ( 𝒳 ) ] Σ ⁢ ( 𝑃 , 𝑄 ) ,

(21)

MMD Σ ⁢ ( 𝑃 , 𝑄 ) ≔ 𝐷 [ 𝐵 ℋ ] Σ ⁢ ( 𝑃 , 𝑄 ) ,

(22)

and the definition of 𝐷 𝑓 𝛼 Γ Σ ⁢ ( 𝑃 ∥ 𝑄 ) is given by Equations (6), (7) and (11).

3.4Further notations and assumptions

For the rest of the paper, we assume the measurable space 𝒳 ⊂ ℝ 𝐷 is a bounded subset of ℝ 𝐷 equipped with the Euclidean metric ∥ ⋅ ∥ 2 . The group is denoted by Σ . The Haar measure 𝜇 Σ is thus a uniform probability measure over Σ , and the symmetrization 𝑆 Σ ⁢ [ 𝛾 ] [Eq. (12)] is an average of 𝛾 over the group orbit. We next introduce the concept of fundamental domain in the following definition.

Definition 1.

A subset 𝒳 0 ⊂ 𝒳 is called a fundamental domain of 𝒳 under the Σ -action if for each 𝑥 ∈ 𝒳 , there exists a unique 𝑥 0 ∈ 𝒳 such that 𝑥

𝜎 ⁢ 𝑥 0 for some 𝜎 ∈ Σ .

Figure 2:The unit disk 𝒳 ⊂ ℝ 2 with the action of the (discrete) rotation groups Σ

𝐶 𝑛 , 𝑛

1 , 4 , 16 , 64 . The fundamental domain 𝒳 0 for each 𝐶 𝑛 is filled with yellow color.

Figure 2 displays an example where 𝒳 is the unit disk in ℝ 2 , and Σ

𝐶 𝑛 , 𝑛

1 , 4 , 16 , 64 , are the discrete rotation groups acting on 𝒳 ; the fundamental domain 𝒳 0 for each Σ

𝐶 𝑛 is filled with yellow color. We note that the choice of the fundamental domain 𝒳 0 is not unique. We will slightly abuse the notation 𝒳

Σ × 𝒳 0 to denote 𝒳 0 being a fundamental domain of 𝒳 under the Σ -action. We define 𝑇 0 : 𝒳 → 𝒳 0

𝑇 0 ⁢ ( 𝑥 ) ≔ 𝑦 ∈ 𝒳 0 , if ⁢ 𝑦

𝜎 ⁢ 𝑥 ⁢ for some ⁢ 𝜎 ∈ Σ ,

(23)

i.e., 𝑇 0 maps 𝑥 ∈ 𝒳 to its unique orbit representative in 𝒳 0 . In addition, we denote by 𝑃 𝒳 0 ∈ 𝒫 ⁢ ( 𝒳 0 ) the distribution on the fundamental domain 𝒳 0 induced by a Σ -invariant distribution 𝑃 ∈ 𝒫 Σ ⁢ ( 𝒳 ) on 𝒳 ; that is,

𝑑 ⁢ 𝑃 𝒳 0 ⁢ ( 𝑥 )

∫ 𝑦 ∈ 𝒳 : 𝑦

𝜎 ⁢ 𝑥 , 𝜎 ∈ Σ 𝑑 𝑃 ⁢ ( 𝑥 ) , 𝑥 ∈ 𝒳 0 .

(24)

The diameter of 𝒳 ⊂ ℝ 𝑑 is defined as

diam ⁢ ( 𝒳 )

sup 𝑥 , 𝑦 ∈ 𝒳 ‖ 𝑥 − 𝑦 ‖ 2 .

(25)

Finally, part of our results in Section 4 relies heavily on the concept of covering numbers which we define below.

Definition 2 (Covering number).

Let ( 𝒳 , 𝜌 ) be a metric space. A subset 𝑆 ⊂ 𝒳 is called a 𝛿 -cover of 𝒳 if for any 𝑥 ∈ 𝒳 there is an 𝑠 ∈ 𝑆 such that 𝜌 ⁢ ( 𝑠 , 𝑥 ) ≤ 𝛿 . The 𝛿 -covering number of 𝒳 is defined as

𝒩 ⁢ ( 𝒳 , 𝛿 , 𝜌 ) := min ⁡ { | 𝑆 | : 𝑆 ⁢ is a ⁢ 𝛿 ⁢ -cover of ⁢ 𝒳 } .

When 𝜌 ⁢ ( 𝑥 , 𝑦 )

‖ 𝑥 − 𝑦 ‖ 2 is the Euclidean metric in ℝ 𝐷 , we abbreviate 𝒩 ⁢ ( 𝒳 , 𝛿 , 𝜌 ) as 𝒩 ⁢ ( 𝒳 , 𝛿 ) .

Following the notion of capacity dimension from [36], we define the intrinsic dimension of a set as follows.

Definition 3.

The intrinsic dimension of 𝒳 ⊂ ℝ 𝐷 , denoted by 𝑑 ⁢ 𝑖 ⁢ 𝑚 ⁢ ( 𝒳 ) is defined as

𝑑 ⁢ 𝑖 ⁢ 𝑚 ⁢ ( 𝒳 ) ≔ − lim 𝛿 → 0 + ln ⁡ 𝒩 ⁢ ( 𝒳 , 𝛿 ) log ⁡ 𝛿 .

(26)

For example, if 𝒳 ⊂ ℝ 𝐷 contains an open set of ℝ 𝐷 , then dim ( 𝒳 )

𝐷 ; if 𝒳 is a 𝑑 -dimensional submanifold of ℝ 𝐷 , then dim ( 𝒳 )

𝑑 .

4Sample complexity under group invariance

In this section, we outline our main results for the sample complexity of divergence estimation under group invariance. In particular, we focus on three cases: the Wasserstein-1 metric (18), the MMD (19) and the ( 𝑓 𝛼 , Γ ) -divergence (20). While the convergence rate in the bounds for the Wasserstein-1 metric and the ( 𝑓 𝛼 , Γ ) -divergence depends on the dimension of the ambient space, that for the MMD case does not. In all the numerical experiments, for simplicity, we choose 𝑋

{ 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚 } and 𝑌

{ 𝑦 1 , 𝑦 2 , ⋯ , 𝑦 𝑛 } to sample from the same Σ -invariant distribution 𝑃

𝑄 for easy visualization and clear benchmark.

4.1Wasserstein-1 metric, 𝑊 ⁢ ( 𝑃 , 𝑄 )

In this section, we set Γ

Lip 𝐿 ⁢ ( 𝒳 ) to be the set of 𝐿 -Lipschitz functions on 𝒳 ; see Eq. (4). We further assume that the Σ -actions on 𝒳 is 1 -Lipschitz, i.e., ‖ 𝜎 ⁢ 𝑥 − 𝜎 ⁢ 𝑦 ‖ 2 ≤ ‖ 𝑥 − 𝑦 ‖ 2 , ∀ 𝜎 ∈ Σ , ∀ 𝑥 , 𝑦 ∈ 𝒳 , so that 𝑆 Σ ⁢ [ Γ ] ⊂ Γ (see Lemma 1 for a proof). Due to Result 1, we have 𝑊 ⁢ ( 𝑃 , 𝑄 )

𝑊 Σ ⁢ ( 𝑃 , 𝑄 ) for Σ -invariant probability measures 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) .

To convey the main message, we provide a summary of our result in Theorem 1 for the sample complexity under group invariance for the Wasserstein-1 metric. The detailed statement and the technical assumption of the theorem as well as its proof are deferred to Section 6.1. Readers are referred to Section 3 for the notations.

Theorem 1 (Finite groups).

Let 𝒳

Σ × 𝒳 0 be a bounded subset of ℝ 𝐷 equipped with the Euclidean distance, | Σ | < ∞ and dim ( 𝒳 )

𝑑 . Suppose 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) are Σ -invariant distributions on 𝒳 . If the number 𝑚 , 𝑛 of samples drawn from 𝑃 and 𝑄 are sufficiently large, then we have with high probability,

when 𝑑 ≥ 2 , for any 𝑠

0 ,

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ 𝐶 ⁢ [ ( 1 | Σ | ⁢ 𝑚 ) 1 𝑑 + 𝑠 + ( 1 | Σ | ⁢ 𝑛 ) 1 𝑑 + 𝑠 ] ,

(27)

where 𝐶

0 depends only on 𝑑 , 𝑠 and 𝒳 , and is independent of 𝑚 and 𝑛 ;

for 𝑑
1 , we have

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ 𝐶 ⋅ diam ⁢ ( 𝒳 0 ) ⁢ ( 1 𝑚 + 1 𝑛 ) ,

(28)

where 𝐶

0 is an absolute constant independent of 𝒳 , 𝒳 0 , 𝑚 and 𝑛 .

When Σ is infinite and dim ( 𝒳 0 )

𝑑 ∗ , we have the following result with improved convergence rates.

Theorem 2 (Infinite groups). 1

Let 𝒳

Σ × 𝒳 0 be a bounded subset of ℝ 𝐷 equipped with the Euclidean distance and dim ( 𝒳 0 )

𝑑 ∗ . Suppose 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) are Σ -invariant distributions on 𝒳 . Then we have with high probability,

when 𝑑 ∗ ≥ 2 , for any 𝑠

0 ,

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ 𝐶 ⁢ [ ( vol ⁢ ( 𝒳 0 ) 𝑚 ) 1 𝑑 ∗ + 𝑠 + ( vol ⁢ ( 𝒳 0 ) 𝑛 ) 1 𝑑 ∗ + 𝑠 ] ,

(29)

where 𝐶

0 depends only on 𝑑 , 𝑠 and 𝒳 , and is independent of 𝑚 and 𝑛 , and vol ⁢ ( 𝒳 0 ) is the volume of 𝒳 0 ;

for 𝑑 ∗
1 , we have

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ 𝐶 ⋅ diam ⁢ ( 𝒳 0 ) ⁢ ( 1 𝑚 + 1 𝑛 ) ,

(30)

where 𝐶

0 is an absolute constant independent of 𝒳 , 𝒳 0 , 𝑚 and 𝑛 .

Furthermore, if 𝒳 is a 𝑑 -dimensional connected submanifold, and Σ is a compact Lie group acting locally smoothly on 𝒳 , then 𝑑 ∗

𝑑 − dim ( Σ ) , where dim ( Σ ) is the dimension of a principal orbit (i.e., the maximal dimension among all orbits) by Theorem IV 3.8 in [12]. This recovers the bound derived in [59].

Sketch of the proof. Using the group invariance and the map 𝑇 0 defined in (23), we can transform the i.i.d. samples on 𝒳 to i.i.d. samples on 𝒳 0 , which are effectively sampled from 𝑃 𝒳 0 and 𝑄 𝒳 0 [cf. Eq. (24)]. Hence the supremum after applying the triangle inequality to the error (29) can be taken over 𝐿 -Lipschitz functions defined on the fundamental domain 𝒳 0 , i.e., Lip 𝐿 ⁢ ( 𝒳 0 ) , instead of over the original space Lip 𝐿 ⁢ ( 𝒳 ) . We further demonstrate in Lemma 2 that the supremum can be taken over an even smaller function space

ℱ 0

{ 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) : ∥ 𝛾 ∥ ∞ ≤ 𝑀 } ⊂ Lip 𝐿 ⁢ ( 𝒳 0 ) ,

(31)

with some uniformly bounded 𝐿 ∞ -norm 𝑀 due to the translation-invariance of 𝛾 in definition (4). Using the Dudley’s entropy integral [2], the error can be bounded in terms of the metric entropy of ℱ 0 ,

inf 𝛼

0 { 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 } .

(32)

For 𝑑 ≥ 2 , we establish the relations between the metric entropy, ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) , of ℱ 0 and the covering numbers of 𝒳 0 and 𝒳 via Lemma 4 and Lemma 5:

ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ 𝒩 ⁢ ( 𝒳 0 , 𝑐 2 ⁢ 𝛿 𝐿 ) ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ,

(33)

𝒩 ⁢ ( 𝒳 0 , 𝛿 ) 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≤ 1 | Σ | , for small enough ⁢ 𝛿 ,

(34)

which yields a factor in terms of the group size | Σ | in Eq. (29) when Σ is finite. The dominant term of the bound based on the singularity of the entropy integral at 𝛼

0 is shown in Eq. (29). For 𝑑

1 , the entropy integral is not singular at the origin, and we bound the covering number of ℱ 0 by diam ⁢ ( 𝒳 0 ) instead. The probability bound is from the application of the McDiarmid’s inequality. For Footnote 1, the result is direct using Equation 33 with the condition 𝒩 ⁢ ( 𝒳 0 , 𝛿 ) ≲ 𝛿 − 𝑑 ∗ implied by Definition 3 without resorting to Equation 34.

Remark 1.

Even though we present in Theorem 1 only the dominant terms showing the rate of convergence for the estimator, our result for sample complexity is actually non-asymptotic. See Theorem 6 in Section 6.1 for a complete description of the result.

Remark 2.

When | Σ |

1 , i.e., no group symmetry is leveraged in the divergence estimation, our result reduces to the case considered in, e.g., [58], for general distributions 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 )

𝒫 ⁢ ( 𝒳 ) .

Remark 3.

In the case for 𝑑 ≥ 2 , the 𝑠 > 0 in Theorem 1 means the rate can be arbitrarily close to − 1 𝑑 . If we further assume that 𝒳 0 is connected, then the bound can be improved to | 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) | ≤ 𝐶 ⁢ [ ( 1 | Σ | ⁢ 𝑚 ) 1 2 ⁢ ln ⁡ 𝑚 + ( 1 | Σ | ⁢ 𝑛 ) 1 2 ⁢ ln ⁡ 𝑛 ] for 𝑑

2 , and | 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) | ≤ 𝐶 ⁢ [ ( 1 | Σ | ⁢ 𝑚 ) 1 𝑑 + ( 1 | Σ | ⁢ 𝑛 ) 1 𝑑 ] for 𝑑 ≥ 3 , without the dependence of 𝑠 , which matches the rate in [24]. See Remark 9 after Lemma 4. Similar improvement holds for Footnote 1 in 𝑑 ∗ .

Remark 4.

The factor diam ⁢ ( 𝒳 0 ) in the case for 𝑑

1 is not necessarily directly related to the group size | Σ | . We refer to Example 1 below and its explanation in Remark 10 for cases where we can achieve a factor of | Σ | − 1 in the convergence bound.

Example 1.

Let 𝒳

[ 0 , 1 ) and Γ

Lip 𝐿 ⁢ ( [ 0 , 1 ) ) , i.e., 𝑑

1 . We consider the Σ -actions on 𝒳 generated by the translation 𝑥 ↦ ( 𝑥 + 1 𝑟 ) mod 1 , where 𝑟

1 , 4 , 16 , 64 , 256 , so that | Σ |

𝑟 is the group size. We draw samples 𝑥 𝑖 ∼ 𝑃

𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) on 𝒳 in the following way: 𝑥 𝑖

𝑟 − 1 ⁢ 𝜉 𝑖 1 / 3 + 𝜂 𝑖 , where 𝜉 𝑖 are i.i.d. uniformly distributed random variables on [ 0 , 1 ) and 𝜂 𝑖 take values over { 0 , 1 𝑟 , … , 𝑟 − 1 𝑟 } with equal probabilities. One can easily verify that 𝑃

𝑄 are indeed Σ -invariant. The numerical results for the empirical estimation of 𝑊 ⁢ ( 𝑃 , 𝑄 )

0 using 𝑊 Σ ⁢ ( 𝑃 𝑛 , 𝑄 𝑛 ) with different group size | Σ |

𝑟 , 𝑟

1 , 4 , 16 , 64 , 256 , are shown in the left panel of Figure 3. One can clearly observe a significant improvement of the estimator as the group size | Σ | increases. Furthermore, the right panel of Figure 3 displays the ratios between the adjacent curves, all of which converge to 4, which is the ratio between the consecutive group size. This matches our calculation in Remark 10; see also Remark 4.

Example 2.

We let 𝒳

ℝ 2 , i.e., 𝑑

2 . The probability distributions 𝑃

𝑄 are the mixture of 8 Gaussians centered at ( cos ⁡ ( 2 ⁢ 𝜋 ⁢ 𝑟 8 ) , sin ⁡ ( 2 ⁢ 𝜋 ⁢ 𝑟 8 ) ) , 𝑟

0 , 1 , … , 7 , with the same covariance. The distribution has 𝐶 8 -rotation symmetry, but we pretend that it is only 𝐶 1 , 𝐶 2 and 𝐶 4 ; that is, the Σ used in the empirical estimation 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) does not reflect the entire invariance structure. Even though in this case the domain 𝒳 is unbounded, which is beyond our theoretical assumptions, we can still see in Figure 4 that as we increase the group size | Σ | in the computation of 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) , fewer samples are needed to reach the same accuracy level in the approximation. The ratios between adjacent curves in this case are slight above the predicted value 2 ≈ 1.414 according to our theory (see Remark 3), suggesting that the complexity bound could be further improved. For instance, in [58], a logarithmic correction term can be revealed for 𝑑

2 after a more thorough analysis.

Figure 3:Left: the Wasserstein-1 distance with different group sizes on [ 0 , 1 ) , averaged over 10 replicas. Right: the ratio of the average of the Wasserstein-1 distance between different group sizes: | Σ |

1 over | Σ |

4 (blue), | Σ |

4 over | Σ |

16 (red), | Σ |

16 over | Σ |

64 (orange), | Σ |

64 over | Σ |

256 (purple). The black horizontal dashed line refers to the ratio equal to 4, which is the value theoretically predicted in Theorem 1 for 𝑑

1 . See Example 1 and Remark 10 for the detail.

Figure 4:Left: The Wasserstein-1 distance assuming different group sizes in ℝ 2 , averaged over 10 replicas. Right: the ratio of the average of the Wasserstein-1 distance between different group sizes: | Σ |

1 over | Σ |

2 (blue), | Σ |

2 over | Σ |

4 (red). The black horizontal dashed line refers to the ratio equal to 2 , which is the value theoretically predicted in Theorem 1 for 𝑑

2 . The ratios are slight above the reference line, suggesting that the complexity bound could be further improved. See Example 2 and Remark 3 for the detail. 4.2Lipschitz-regularized 𝛼 -divergence, 𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

The space Γ in this section is always set to Γ

Lip 𝐿 ⁢ ( 𝒳 ) ; see Eq. (7). We only consider 𝛼

1 , as the case when 0 < 𝛼 < 1 can be derived in a similar manner. For 𝛼

1 , the Legendre transform of 𝑓 𝛼 , which is defined in (7), is

𝑓 𝛼 ∗ ⁢ ( 𝑦 )

( 𝛼 − 1 ⁢ ( 𝛼 − 1 ) 𝛼 𝛼 − 1 ⁢ 𝑦 𝛼 𝛼 − 1 + 1 𝛼 ⁢ ( 𝛼 − 1 ) ) ⁢ 𝟏 𝑦

0 .

We provide a theorem for the sample complexity for the ( 𝑓 𝛼 , Γ ) -divergence under group invariance, whose detailed statement and proof can be found in Section 6.2. We note that this is a new sample complexity result for the ( 𝑓 𝛼 , Γ ) -divergence even without the group structure, which is still missing in the literature.

Theorem 3 (Finite groups).

Let 𝒳

Σ × 𝒳 0 be a subset of ℝ 𝐷 equipped with the Euclidean distance, | Σ | < ∞ and dim ( 𝒳 )

𝑑 . Let 𝑓 𝛼 ⁢ ( 𝑥 )

𝑥 𝛼 − 1 𝛼 ⁢ ( 𝛼 − 1 ) , 𝛼 > 1 and Γ

Lip 𝐿 ⁢ ( 𝒳 ) . Suppose 𝑃 and 𝑄 are Σ -invariant distributions on 𝒳 . If the number of samples 𝑚 , 𝑛 drawn from 𝑃 and 𝑄 are sufficiently large, we have with high probability,

when 𝑑 ≥ 2 , for any 𝑠

0 ,

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ 𝐶 1 ⁢ ( 1 | Σ | ⁢ 𝑚 ) 1 𝑑 + 𝑠 + 𝐶 2 ⁢ ( 1 | Σ | ⁢ 𝑛 ) 1 𝑑 + 𝑠 ,

(35)

where 𝐶 1 depends only on 𝑑 , 𝑠 and 𝒳 ; 𝐶 2 depends only on 𝑑 , 𝑠 , 𝒳 and 𝛼 ; both 𝐶 1 and 𝐶 2 are independent of 𝑚 and 𝑛 ;

for 𝑑
1 , we have

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ diam ⁢ ( 𝒳 0 ) ⁢ ( 𝐶 1 𝑚 + 𝐶 2 𝑛 ) ,

(36)

where 𝐶 1 and 𝐶 2 are independent of 𝒳 0 , 𝑚 and 𝑛 ; 𝐶 2 depends on 𝛼 .

When Σ is infinite and dim ( 𝒳 0 )

𝑑 ∗ for some 𝑑 ∗ < 𝑑 , we have the following result with improved convergence rates.

Theorem 4 (Infinite groups).

Let 𝒳

Σ × 𝒳 0 be a subset of ℝ 𝐷 equipped with the Euclidean distance and dim ( 𝒳 0 )

𝑑 ∗ . Let 𝑓 𝛼 ⁢ ( 𝑥 )

𝑥 𝛼 − 1 𝛼 ⁢ ( 𝛼 − 1 ) , 𝛼 > 1 and Γ

when 𝑑 ≥ 2 , for any 𝑠

0 ,

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ 𝐶 1 ⁢ ( vol ⁢ ( 𝒳 0 ) 𝑚 ) 1 𝑑 ∗ + 𝑠 + 𝐶 2 ⁢ ( vol ⁢ ( 𝒳 0 ) 𝑛 ) 1 𝑑 ∗ + 𝑠 ,

(37)

where 𝐶 1 depends only on 𝑑 , 𝑠 and 𝒳 ; 𝐶 2 depends only on 𝑑 , 𝑠 , 𝒳 and 𝛼 ; both 𝐶 1 and 𝐶 2 are independent of 𝑚 and 𝑛 ;

for 𝑑
1 , we have

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ diam ⁢ ( 𝒳 0 ) ⁢ ( 𝐶 1 𝑚 + 𝐶 2 𝑛 ) ,

(38)

where 𝐶 1 and 𝐶 2 are independent of 𝒳 0 , 𝑚 and 𝑛 ; 𝐶 2 depends on 𝛼 .

Sketch of the proof. The idea is similar to the proof of Theorem 1. The only difference is that we need to tackle the 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) term separately, since it is not translation-invariant in 𝛾 . Using the equivalent form (8), we can obtain a different Lipschitz constant associated with 𝑓 𝛼 ∗ , as well as a different 𝐿 ∞ bound 𝑀 than that in Eq. (31) by Lemma 7. This results in the 𝛼 dependence for 𝐶 2 .

Remark 5 (Compact Lie groups).

Similar to Theorem 1 and Footnote 1, 𝑠 in the bound can be removed in Theorem 3 and Theorem 4 when 𝒳 0 is connected. Furthurmore, if 𝒳 is a 𝑑 -dimensional connected submanifold, and Σ is a compact Lie group acting locally smoothly on 𝒳 , then 𝑑 ∗

𝑑 − dim ( Σ ) , where dim ( Σ ) is the dimension of a principal orbit (i.e., the maximal dimension among all orbits) by Theorem IV 3.8 in [12].

4.3Maximum mean discrepancy, MMD ⁢ ( 𝑃 , 𝑄 )

Though one can utilize the results on the covering number of the unit ball of a reproducing kernel Hilbert space, e.g. [66, 41], to derive the sample complexity bounds that depend on the dimension 𝑑 , we provide a dimension-independent bound as in [31] without the use of the covering numbers. In the MMD case, we let 𝐵 ℋ represent the unit ball in some reproducing kernel Hilbert space (RKHS) ℋ on 𝒳 ; see Eq. (5). In addition, we make the following assumptions for the kernel 𝑘 ⁢ ( 𝑥 , 𝑦 ) .

Assumption 1.

The kernel 𝑘 ⁢ ( 𝑥 , 𝑦 ) for ℋ satisfies

•

𝑘 ⁢ ( 𝑥 , 𝑦 ) ≥ 0 and 𝑘 ⁢ ( 𝜎 ⁢ ( 𝑥 ) , 𝜎 ⁢ ( 𝑦 ) )

𝑘 ⁢ ( 𝑥 , 𝑦 ) , ∀ 𝜎 ∈ Σ , 𝑥 , 𝑦 ∈ 𝒳 ;

•

Let 𝐾 := max 𝑥 , 𝑦 ∈ 𝒳 ⁡ 𝑘 ⁢ ( 𝑥 , 𝑦 ) , then 𝑘 ⁢ ( 𝑥 , 𝑦 )

𝐾 if and only if 𝑥

𝑦 ;

•

There exists 𝑐 Σ , 𝑘 ∈ ( 0 , 1 ) such that for any 𝜎 ∈ Σ and 𝜎 is not the identity element and 𝑥 ∈ 𝒳 0 , we have 𝑘 ⁢ ( 𝜎 ⁢ 𝑥 , 𝑥 ) ≤ 𝑐 Σ , 𝑘 ⁢ 𝐾 .

Intuitively, the third condition in Assumption 1 suggests uniform decay of the kernel on the group orbits. See Remark 7 and Example 3 for more details and a related example.

From Lemma C.1 in [9], we know 𝑆 Σ ⁢ [ Γ ] ⊂ Γ by the first assumption. Below is an abbreviated result for the sample complexity for the MMD, whose detailed statement and proof can be found in Section 6.3.

Theorem 5.

Let 𝒳

Σ × 𝒳 0 and | Σ | < ∞ . ℋ is a RKHS on 𝒳 whose kernel satisfies Assumption 1. Suppose 𝑃 and 𝑄 are Σ -invariant distributions on 𝒳 . Then for 𝑚 , 𝑛 sufficiently large, we have with high probability,

| MMD ⁢ ( 𝑃 , 𝑄 ) − MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

< 𝑂 ⁢ ( 𝐶 Σ , 𝑘 ⁢ ( 1 𝑚 + 1 𝑛 ) ) ,

(39)

where 𝐶 Σ , 𝑘

1 + 𝑐 Σ , 𝑘 ⁢ ( | Σ | − 1 ) | Σ | , and 𝑐 Σ , 𝑘 is the constant in Assumption 1.

Sketch of the proof. Based on Result 1, we use the equality MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

MMD ⁢ ( 𝑆 Σ ⁢ [ 𝑃 𝑚 ] , 𝑆 Σ ⁢ [ 𝑄 𝑛 ] ) to expand the divergence over all the orbit elements. The error bound is controlled in terms of the Rademacher average, whose supremum is attained at some known witness function due to the structure of the RKHS using Lemma 8. Since the Rademacher average is estimated without covering numbers, the rate is independent of the dimension 𝑑 . Then we use the decay of the kernel to obtain the bound.

Remark 6.

When | Σ |

1 , the proof is reduced to that in [58].

Remark 7.

Unlike the cases for the Wasserstein metric and the Lipschitz-regularized 𝛼 -divergence in Theorem 1 and Theorem 3, the improvement of the sample complexity under group symmetry for MMD (measured by 𝐶 Σ , 𝑘 in Theorem 5) depends on not only the group size | Σ | but also the kernel 𝑘 ⁢ ( 𝑥 , 𝑦 ) . For a fixed 𝒳 and kernel 𝑘 ⁢ ( 𝑥 , 𝑦 ) , simply increasing the group size | Σ | does not necessarily lead to a reduced sample complexity beyond a certain threshold; see the first four subfigures in Figure 5. However, we show in Example 3 below that, by adaptively picking a suitable kernel 𝑘 depending on the group size | Σ | , one can obtain an improvement in sample complexity by 𝐶 Σ , 𝑘 ≈ 1 | Σ | for arbitrarily large | Σ | .

Figure 5:MMD simulations with Gaussian kernels 𝑘 𝑠 ⁢ ( 𝑥 , 𝑦 )

𝑒 − ∥ 𝑥 − 𝑦 ∥ 2 2 2 ⁢ 𝑠 2 . From left to right, top to bottom: 𝑠

2 ⁢ 𝜋 1 × 6 , 𝑠

2 ⁢ 𝜋 4 × 6 , 𝑠

2 ⁢ 𝜋 16 × 6 , 𝑠

2 ⁢ 𝜋 64 × 6 , 𝑠

2 ⁢ 𝜋 6 ⁢ | Σ | . The first four subfigures (top two rows) show that the Gaussian kernel with a fixed bandwidth 𝑠 > 0 satisfies the third condition in Assumption 1 up to a group size of | Σ |

𝑙 , 𝑙

1 , 4 , 16 , 64 , and thus an improvement of sample complexity of order 𝐶 Σ , 𝑘 ≈ | Σ | − 1 / 2 persists till | Σ |

𝑙 ; when | Σ |

𝑙 , no further reduction in sample complexity can be observed. The last subfigure demonstrates that with an adaptive bandwidth 𝑠 inversely scaled with | Σ | , nonstop improvement of the sample complexity can be achieved as the group size | Σ | increases. See Example 3 for the detail and explanations. Example 3.

Let 𝒳

{ ( 𝑟 ⁢ cos ⁡ 𝜃 , 𝑟 ⁢ sin ⁡ 𝜃 ) ∈ ℝ 2 : 𝑟 ∈ [ 0 , 1 ] , 𝜃 ∈ [ 0 , 2 ⁢ 𝜋 ) } be the unit disk centered at the origin, and let 𝑘 𝑠 ⁢ ( 𝑥 , 𝑦 )

𝑒 − ∥ 𝑥 − 𝑦 ∥ 2 2 2 ⁢ 𝑠 2 , 𝑥 , 𝑦 ∈ 𝒳 , be the Gaussian kernel. Consider the group actions generated by a rotation (with respect to the origin) of 2 ⁢ 𝜋 𝑙 , 𝑙

1 , 4 , 16 , 64 , 256 , so that | Σ |

𝑙 is the group size. The fundamental domain 𝒳 0 under the Σ -action is 𝒳 0

[ 0 , 1 ] × [ 0 , 2 ⁢ 𝜋 𝑙 ) (see Figure 2 for a visual illustration). We draw samples 𝑥 𝑖 ∼ 𝑃

𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) in the following way,

𝑥 𝑖

𝜉 𝑖 ⁢ ( cos ⁡ [ 2 ⁢ 𝜋 𝑙 ⁢ 𝜃 𝑖 1 / 3 + 𝜂 𝑖 ⁢ 2 ⁢ 𝜋 𝑙 ] , sin ⁡ [ 2 ⁢ 𝜋 𝑙 ⁢ 𝜃 𝑖 1 / 3 + 𝜂 𝑖 ⁢ 2 ⁢ 𝜋 𝑙 ] )

where 𝜉 𝑖 and 𝜃 𝑖 are i.i.d. uniformly distributed random variables on [ 0 , 1 ) and 𝜂 𝑖 take values over { 0 , 1 𝑙 , … , 𝑙 − 1 𝑙 } with equal probabilities. We select the kernel bandwidth 𝑠

0 in different ways:

•

Fixed 𝑠 with changing group size | Σ |

𝑙 . We intuitively follow the “three-sigma rule” in the argument direction to pick different 𝑠 . Since the angle of each sector is 2 ⁢ 𝜋 𝑙 , we select 𝑠

2 ⁢ 𝜋 6 ⁢ 𝑙 , 𝑙

1 , 4 , 16 , 64 . Smaller bandwidth 𝑠 corresponds to faster decay of the kernel 𝑘 𝑠 ⁢ ( 𝑥 , 𝑦 ) , such that for a fixed bandwidth 𝑠

2 ⁢ 𝜋 6 ⁢ 𝑙 , the third condition in Assumption 1 is satisfied with a small 𝑐 𝑘 for any group Σ such that | Σ | ≤ 𝑙 , i.e., 𝐶 Σ , 𝑘 ≈ | Σ | − 1 / 2 . On the other hand, it is difficult to observe the improvement by further increasing the group size | Σ | beyond | Σ | > 𝑙 , since the third condition in Assumption 1 is not satisfied with any uniformly small 𝑐 . See the top two rows in Figure 5 for the results for 𝑠

2 ⁢ 𝜋 𝑙 × 6 , 𝑙

1 , 4 , 16 , 64 . Notice that the sample complexity improvement stops right at | Σ |

𝑙 , perfectly matching our theoretical result Theorem 5.

•

𝑠 inversely scales with | Σ | , i.e., 𝑠

2 ⁢ 𝜋 6 ⁢ | Σ | . Unlike the fixed 𝑠 discusses previously, with these adaptive selections of kernels, we can observed nonstop improvement of the sample complexity as the group size | Σ | increases; see the last row of Figure 5. This numerical result is explained by the third condition in Assumption 1; that is, in order to continuously observe the benefit from the increasing group size | Σ | , we need to have a faster decay in the kernel 𝑘 𝑠 (i.e., smaller 𝑠 ) so that 𝑐 Σ , 𝑘 𝑠 is uniformly small for all | Σ | .

Remark 8.

The bound provided in Theorem 5 for the MMD case is almost sharp in the sense that, by a direct calculation, one can obtain that

𝐸 𝒳 ⁢ MMD Σ ⁢ ( 𝑃 , 𝑃 𝑛 ) 2 𝐸 𝒳 ⁢ MMD ⁢ ( 𝑃 , 𝑃 𝑛 ) 2 ≈ 𝐶 Σ , 𝑘 2 ,

if the Gaussian kernel bandwidth 𝑠 ∝ 2 ⁢ 𝑐 Σ , 𝑘 ⁢ 𝜋 | Σ | .

5Conclusion and future work

We provide rigorous analysis to quantify the reduction in sample complexity for variational divergence estimations between group-invariant distributions. We obtain a reduction in the error bound by a power of the group size when the group is finite. The exponent on the group size depends on the intrinsic dimension of the support, characterized by the covering number rate for the Wasserstein-1 metric and the Lipschitz-regularized 𝛼 -divergence; that reduction, however, is independent of the ambient dimension for the MMD as in [30, 32].

This work also motivates some possible future directions. For the Wasserstein-1 metric in ℝ 2 , one could potentially derive a sharper bound in terms of the group size. For the MMD with Gaussian kernels, it is worth investigating how to choose the bandwidth to make as much use of the group structure as possible. Further applications of the theories on machine learning, such as neural generative models or neural estimations of divergence under symmetry, are also expected.

6Proofs

In this section, we provide detailed statements of the theorems introduced in Section 4 as well as their proofs.

6.1Wasserstein-1 metric Assumption 2.

Let 𝒳

Σ × 𝒳 0 ⊂ ℝ 𝑑 . Assume that there exists some 𝛿 0

0 such that

∥ 𝜎 ⁢ ( 𝑥 ) − 𝜎 ′ ⁢ ( 𝑥 ′ ) ∥ 2

2 ⁢ 𝛿 0 , ∀ 𝑥 , 𝑥 ′ ∈ 𝒳 0 , 𝜎 ≠ 𝜎 ′ ∈ Σ ; and

∥ 𝜎 ⁢ ( 𝑥 ) − 𝜎 ⁢ ( 𝑥 ′ ) ∥ 2 ≥ ∥ 𝑥 − 𝑥 ′ ∥ 2 , ∀ 𝑥 , 𝑥 ′ ∈ 𝒳 0 , 𝜎 ∈ Σ ,

where ∥ ⋅ ∥ 2 is the Euclidean norm on ℝ 𝑑 .

Example 1 provides a simple example when this assumption holds.

Theorem 6.

Let 𝒳

Σ × 𝒳 0 be a subset of ℝ 𝐷 satisfying the conditions in Assumption 2. Assume that 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≲ 𝛿 − 𝑑 for sufficiently small 𝛿 . Suppose 𝑃 and 𝑄 are Σ -invariant probability measures on 𝒳 .

If 𝑑 ≥ 2 , then for any 𝑠

0 , 𝜖

0 and 𝑚 , 𝑛 sufficiently large, we have with probability at least 1 − 𝜖 ,

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ ( 8 + 24 ( 𝑑 + 𝑠 2 − 1 ) ) ⁢ [ ( 9 ⁢ 𝐷 𝒳 , 𝐿 2 | Σ | ⁢ 𝑚 ) 1 𝑑 + 𝑠 + ( 9 ⁢ 𝐷 𝒳 , 𝐿 2 | Σ | ⁢ 𝑛 ) 1 𝑑 + 𝑠 ]

𝐷 ¯ 𝒳 0 , 𝐿 ⁢ ( 24 𝑚
24 𝑛 )
𝐿 ⋅ diam ⁢ ( 𝒳 0 ) ⁢ 2 ⁢ ( 𝑚
𝑛 ) 𝑚 ⁢ 𝑛 ⁢ ln ⁡ 1 𝜖 ,

where 𝐷 𝒳 , 𝐿 depends only on 𝒳 and 𝐿 ; 𝐷 ¯ 𝒳 0 , 𝐿 depends only on 𝒳 0 and 𝐿 , and is increasing in 𝒳 0 , i.e., 𝐷 ¯ 𝐴 1 , 𝐿 ≤ 𝐷 ¯ 𝐴 2 , 𝐿 for 𝐴 1 ⊂ 𝐴 2 ; 2) If 𝑑

1 , then for any 𝜖

0 and 𝑚 , 𝑛 sufficiently large, we have with probability at least 1 − 𝜖 ,

| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

≤ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) ⁢ ( 1 𝑚 + 1 𝑛 ) + 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) ⁢ 2 ⁢ ( 𝑚 + 𝑛 ) 𝑚 ⁢ 𝑛 ⁢ ln ⁡ 1 𝜖 ,

where 𝑐

0 is an absolute constant independent of 𝒳 and 𝒳 0 .

Before proving this theorem, we have the following lemmas.

Lemma 1.

Suppose the Σ -actions on 𝒳 are 1 -Lipschitz, i.e., ∥ 𝜎 ⁢ 𝑥 − 𝜎 ⁢ 𝑦 ∥ 2 ≤ ∥ 𝑥 − 𝑦 ∥ 2 for any 𝑥 , 𝑦 ∈ 𝒳 and 𝜎 ∈ Σ , then we have 𝑆 Σ ⁢ [ Γ ] ⊂ Γ , where Γ

Lip 𝐿 ⁢ ( 𝒳 ) .

Proof.

For any 𝑥 , 𝑦 ∈ 𝒳 and 𝑓 ∈ Γ , we have

| 𝑆 Σ ⁢ ( 𝑓 ) ⁢ ( 𝑥 ) − 𝑆 Σ ⁢ ( 𝑓 ) ⁢ ( 𝑦 ) |

| 1 | Σ | ⁢ ∑ 𝜎 ∈ Σ 𝑓 ⁢ ( 𝜎 ⁢ 𝑥 ) − 1 | Σ | ⁢ ∑ 𝜎 ∈ Σ 𝑓 ⁢ ( 𝜎 ⁢ 𝑦 ) |

≤ 1 | Σ | ⁢ ∑ 𝜎 ∈ Σ | 𝑓 ⁢ ( 𝜎 ⁢ 𝑥 ) − 𝑓 ⁢ ( 𝜎 ⁢ 𝑦 ) |

≤ 1 | Σ | ⁢ ∑ 𝜎 ∈ Σ 𝐿 ⁢ ∥ 𝜎 ⁢ 𝑥 − 𝜎 ⁢ 𝑦 ∥ 2

≤ 1 | Σ | ⁢ ∑ 𝜎 ∈ Σ 𝐿 ⁢ ∥ 𝑥 − 𝑦 ∥ 2

𝐿 ⁢ ∥ 𝑥 − 𝑦 ∥ 2 .

Therefore, we have 𝑆 Σ ⁢ ( 𝑓 ) ∈ Γ . ∎

Lemma 2.

For any 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) , there exists 𝜈 ∈ ℝ , such that ∥ 𝛾 + 𝜈 ∥ ∞ ≤ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) .

Proof.

Suppose 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) and ∥ 𝛾 ⁢ ( 𝑥 ) ∥ ∞

𝐿 ⋅ diam ⁢ ( 𝒳 0 ) . Without loss of generality, we can assume sup 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 )

𝐿 ⋅ diam ⁢ ( 𝒳 0 ) . Since 𝛾 is 𝐿 -Lipschitz on 𝒳 0 , we have sup 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 ) − inf 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 ) ≤ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) , so that

inf 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 ) ≥ sup 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 ) − 𝐿 ⋅ diam ⁢ ( 𝒳 0 )

0 .

Hence we can select 𝜈

− inf 𝑥 ∈ 𝒳 0 𝛾 ⁢ ( 𝑥 ) 2 , so that ∥ 𝛾 + 𝜈 ∥ ∞ < ∥ 𝛾 ∥ ∞ . ∎

We provide a variant of the Dudley’s entropy integral as well as its proof for completeness.

Lemma 3.

Suppose ℱ is a family of functions mapping the metric space ( 𝒳 , 𝜌 ) to [ − 𝑀 , 𝑀 ] for some 𝑀 > 0 . Also assume that 0 ∈ ℱ and ℱ

− ℱ . Let 𝜉

{ 𝜉 1 , … , 𝜉 𝑚 } be a set of independent random variables that take values on { − 1 , 1 } with equal probabilities, 𝑖

1 , … , 𝑚 . 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 ∈ 𝒳 . Then we have

𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝑓 ⁢ ( 𝑥 𝑖 ) | ≤ inf 𝛼

0 4 ⁢ 𝛼 + 12 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

The proof of Lemma 3 is standard using the dyadic path., e.g. see the proof of Lemma A.5. in [2].

Proof.

Let 𝑁 be an arbitrary positive integer and 𝛿 𝑘

𝑀 ⁢ 2 − ( 𝑘 − 1 ) , 𝑘

1 , … , 𝑁 . Let 𝑉 𝑘 be the cover achieving 𝒩 ⁢ ( ℱ , 𝛿 𝑘 , ∥ ⋅ ∥ ∞ ) and denote | 𝑉 𝑘 |

𝒩 ⁢ ( ℱ , 𝛿 𝑘 , ∥ ⋅ ∥ ∞ ) . For any 𝑓 ∈ ℱ , let 𝜋 𝑘 ⁢ ( 𝑓 ) ∈ 𝑉 𝑘 , such that ∥ 𝑓 − 𝜋 𝑘 ⁢ ( 𝑓 ) ∥ ∞ ≤ 𝛿 𝑘 . We have

𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝑓 ⁢ ( 𝑥 𝑖 ) |

≤ 𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝑓 ⁢ ( 𝑥 𝑖 ) − 𝜋 𝑁 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) ) | + ∑ 𝑗

1 𝑁 − 1 𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝜋 𝑗 + 1 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) − 𝜋 𝑗 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) ) |

+ 𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝜋 1 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) | .

The first on the right hand side is bounded by 𝛿 𝑁 . Note that we can choose 𝑉 1

{ 0 } , so that 𝜋 1 ⁢ ( 𝑓 ) is the zero function. For each 𝑗 , let 𝑊 𝑗

{ 𝜋 𝑗 + 1 ⁢ ( 𝑓 ) − 𝜋 𝑗 ⁢ ( 𝑓 ) : 𝑓 ∈ ℱ } . We have | 𝑊 𝑗 | ≤ | 𝑉 𝑗 + 1 | ⁢ | 𝑉 𝑗 | ≤ | 𝑉 𝑗 + 1 | 2 . Then we have

∑ 𝑗

1 𝑁 − 1 𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝜋 𝑗 + 1 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) − 𝜋 𝑗 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) ) |

∑ 𝑗

1 𝑁 − 1 𝐸 𝜉 ⁢ sup 𝑤 ∈ 𝑊 𝑗 | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝑤 ⁢ ( 𝑥 𝑖 ) | .

In addition, we have

sup 𝑤 ∈ 𝑊 𝑗 ∑ 𝑖

1 𝑚 𝑤 ⁢ ( 𝑥 𝑖 ) 2

sup 𝑓 ∈ ℱ ∑ 𝑖

1 𝑚 ( 𝜋 𝑗 + 1 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) − 𝜋 𝑗 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) ) 2

≤ sup 𝑓 ∈ ℱ ∑ 𝑖

1 𝑚 ( 𝜋 𝑗 + 1 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) − 𝑓 ⁢ ( 𝑥 𝑖 ) ) 2 + sup 𝑓 ∈ ℱ ∑ 𝑖

1 𝑚 ( 𝑓 ⁢ ( 𝑥 𝑖 ) − 𝜋 𝑗 ⁢ ( 𝑓 ) ⁢ ( 𝑥 𝑖 ) ) 2

≤ 𝑚 ⋅ 𝛿 𝑗 + 1 2 + 𝑚 ⋅ 𝛿 𝑗 2

𝑚 ⁢ ( 𝛿 𝑗 + 1 + 𝛿 𝑗 )

3 ⁢ 𝑚 ⁢ 𝛿 𝑗 + 1 .

By the Massart finite class lemma (see, e.g. [45]), we have

𝐸 𝜉 ⁢ sup 𝑤 ∈ 𝑊 𝑗 | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝑤 ⁢ ( 𝑥 𝑖 ) | ≤ 3 ⁢ 𝑚 ⁢ 𝛿 𝑗 + 1 ⁢ 2 ⁢ ln ⁡ | 𝑊 𝑗 | 𝑚 ≤ 6 ⁢ 𝛿 𝑗 + 1 ⁢ ln ⁡ | 𝑉 𝑗 + 1 | 𝑚 .

Therefore,

𝐸 𝜉 ⁢ sup 𝑓 ∈ ℱ | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ 𝑓 ⁢ ( 𝑥 𝑖 ) |
≤ 𝛿 𝑁 + 6 𝑚 ⁢ ∑ 𝑗

1 𝑁 − 1 𝛿 𝑗 + 1 ⁢ ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 𝑗 + 1 , ∥ ⋅ ∥ ∞ )

≤ 𝛿 𝑁 + 12 𝑚 ⁢ ∑ 𝑗

1 𝑁 ( 𝛿 𝑗 − 𝛿 𝑗 + 1 ) ⁢ ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 𝑗 , ∥ ⋅ ∥ ∞ )

≤ 𝛿 𝑁 + 12 𝑚 ⁢ ∫ 𝛿 𝑁 + 1 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

Finally, select any 𝛼 ∈ ( 0 , 𝑀 ) and let 𝑁 be the largest integer with 𝛿 𝑁 + 1 > 𝛼 , (implying 𝛿 𝑁 + 2 ≤ 𝛼 and 𝛿 𝑁

4 ⁢ 𝛿 𝑁 + 2 ≤ 4 ⁢ 𝛼 ), so that

𝛿 𝑁 + 12 𝑚 ⁢ ∫ 𝛿 𝑁 + 1 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 𝑗 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 ≤ 4 ⁢ 𝛼 + 12 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

∎

We can easily extend Lemma 6 in [29] to the following lemma by meshing on the range [ − 𝑀 , 𝑀 ] rather than [ 0 , 1 ] .

Lemma 4.

Let ℱ be the family of 𝐿 -Lipschitz functions mapping the metric space ( 𝒳 , ∥ ⋅ ∥ 2 ) to [ − 𝑀 , 𝑀 ] for some 𝑀

0 . Then we have

𝒩 ⁢ ( ℱ , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ ( 𝑐 1 ⁢ 𝑀 𝛿 ) 𝒩 ⁢ ( 𝒳 , 𝑐 2 ⁢ 𝛿 𝐿 ) ,

where 𝑐 1 ≥ 1 and 𝑐 2 ≤ 1 are some absolute constants not depending on 𝒳 , 𝑀 , and 𝛿 .

Remark 9.

If 𝒳 is connected, then the bound can be improved to 𝒩 ⁢ ( ℱ , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ 𝑒 𝒩 ⁢ ( 𝒳 , 𝑐 2 ⁢ 𝛿 𝐿 ) by the result in [40].

Lemma 5 (Theorem 3 in [56]).

Assume that 𝒳

Σ × 𝒳 0 . If for some 𝛿

0 we have

1) ∥ 𝜎 ⁢ ( 𝑥 ) − 𝜎 ′ ⁢ ( 𝑥 ′ ) ∥ 2

2 ⁢ 𝛿 , ∀ 𝑥 , 𝑥 ′ ∈ 𝒳 0 , 𝜎 ≠ 𝜎 ′ ∈ Σ ; and 2) ∥ 𝜎 ⁢ ( 𝑥 ) − 𝜎 ⁢ ( 𝑥 ′ ) ∥ 2 ≥ ∥ 𝑥 − 𝑥 ′ ∥ 2 , ∀ 𝑥 , 𝑥 ′ ∈ 𝒳 0 , 𝜎 ∈ Σ , then we have

𝒩 ⁢ ( 𝒳 0 , 𝛿 ) 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≤ 1 | Σ | .

In addition, we provide the following lemma for the scaling of covering numbers.

Lemma 6.

Let 𝒳 be a subset of ℝ 𝑑 or 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≲ 𝛿 − 𝑑 , and 𝛿 ¯

0 . Then there exists a constant 𝐶 𝑑 , 𝛿 ¯ that depends on 𝑑 and 𝛿 ¯ such that for 𝛿 ∈ ( 0 , 1 ) we have

𝒩 ⁢ ( 𝒳 , 𝛿 ) ≤ 𝐶 𝑑 , 𝛿 ¯ ⋅ 𝒩 ⁢ ( 𝒳 , 𝛿 ¯ ) 𝛿 𝑑 .

Proof.

Let 𝑁 := 𝒩 ⁢ ( 𝒳 , 𝛿 ¯ ) . Then 𝒳 can be covered by 𝑁 balls with radius 𝛿 ¯ . From Proposition 4.2.12 in [62], we know that each ball with radius 𝛿 ¯ can be covered by ( 𝛿 ¯ + 𝛿 / 2 ) 𝑑 ( 𝛿 / 2 ) 𝑑 balls with radius 𝛿 . This implies that 𝒳 can be covered by 𝑁 ⋅ ( 𝛿 ¯ + 𝛿 / 2 ) 𝑑 ( 𝛿 / 2 ) 𝑑 balls with radius 𝛿 , so that 𝒩 ext ⁢ ( 𝒳 , 𝛿 ) ≤ 𝑁 ⋅ ( 𝛿 ¯ + 𝛿 / 2 ) 𝑑 ( 𝛿 / 2 ) 𝑑 , where 𝒩 ext ⁢ ( 𝒳 , 𝛿 ) is the exterior covering number of 𝒳 with radius 𝛿 . Therefore, 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≤ 𝒩 ext ⁢ ( 𝒳 , 𝛿 / 2 ) ≤ 𝑁 ⋅ ( 𝛿 ¯ + 𝛿 / 4 ) 𝑑 ( 𝛿 / 4 ) 𝑑

𝑁 ⋅ ( 4 ⁢ 𝛿 ¯ 𝛿 + 1 ) 𝑑 ≤ 𝑁 ⋅ ( 4 ⁢ 𝛿 ¯ + 1 ) 𝑑 𝛿 𝑑 . ∎

Proof of Theorem 6.
| 𝑊 ⁢ ( 𝑃 , 𝑄 ) − 𝑊 Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

| sup 𝛾 ∈ Γ Σ { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } − sup 𝛾 ∈ Γ Σ { 𝐸 𝑃 𝑚 ⁢ [ 𝛾 ] − 𝐸 𝑄 𝑛 ⁢ [ 𝛾 ] } |

≤ sup 𝛾 ∈ Γ Σ | 𝐸 𝑃 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

sup 𝛾 ∈ Γ Σ | 𝐸 𝑃 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) |

≤ ( 𝑎 ) sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) |

(40)

:= 𝑓 ⁢ ( 𝑥 1 , … , 𝑥 𝑚 , 𝑦 1 , … , 𝑦 𝑛 ) ,

where inequality ( 𝑎 ) is due to the fact that 𝐸 𝑃 ⁢ [ 𝛾 ]

𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 | 𝒳 0 ] and 𝐸 𝑄 ⁢ [ 𝛾 ]

𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 | 𝒳 0 ] since 𝑃 and 𝑄 are both Σ -invariant and 𝛾 ∈ Γ Σ , and the fact that if 𝛾 ∈ Γ Σ , then 𝛾 | 𝒳 0 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) , where 𝛾 | 𝒳 0 is the restriction of 𝛾 on 𝒳 0 .

Note that the quantity inside the absolute value in (6.1) will not change if we replace 𝛾 by 𝛾 + 𝜈 and we still have 𝛾 + 𝜈 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) for any 𝜈 ∈ ℝ . Therefore, by Lemma 2, the supremum in (6.1) can be taken over 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) , where ∥ 𝛾 ∥ ∞ ≤ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) . The denominator in the exponent when applying the McDiarmid’s inequality is thus equal to

𝑚 ⁢ ( 2 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 ) 2 + 𝑛 ⁢ ( 2 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑛 ) 2

4 ⁢ 𝐿 2 ⋅ diam ⁢ ( 𝒳 0 ) 2 ⁢ 𝑚 + 𝑛 𝑚 ⁢ 𝑛 .

(41)

Denoting by 𝑋 ′

{ 𝑥 1 ′ , 𝑥 2 ′ , … , 𝑥 𝑚 ′ } and 𝑌 ′

{ 𝑦 1 ′ , 𝑦 2 ′ , … , 𝑦 𝑛 ′ } the i.i.d. samples drawn from 𝑃 𝒳 0 and 𝑄 𝒳 0 . Also note that 𝑇 0 ⁢ ( 𝑥 1 ) , … , 𝑇 0 ⁢ ( 𝑥 𝑚 ) and 𝑇 0 ⁢ ( 𝑦 1 ) , … , 𝑇 0 ⁢ ( 𝑦 𝑛 ) can be viewed as i.i.d. samples on 𝒳 0 drawn from 𝑃 𝒳 0 and 𝑄 𝒳 0 respectively, such that the expectation

𝐸 𝑋 , 𝑌 ⁢ 𝑓 ⁢ ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 )

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) |

can be replaced by the equivalent quantity

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) | ,

where 𝑋

{ 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 } and 𝑌

{ 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 } are are i.i.d. samples on 𝒳 0 drawn from 𝑃 𝒳 0 and 𝑄 𝒳 0 respectively. Then we have

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑋 ′ ⁢ ( 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ′ ) ) − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − 𝐸 𝑌 ′ ⁢ ( 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ′ ) ) + 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) |

≤ 𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ′ ) + 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) |

𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ , 𝜉 , 𝜉 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝑥 𝑖 ) ) − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) − 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

≤ 𝐸 𝑋 , 𝑋 ′ , 𝜉 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝑥 𝑖 ) ) | + 𝐸 𝑌 , 𝑌 ′ , 𝜉 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) − 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

≤ inf 𝛼

0 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 + inf 𝛼

0 8 ⁢ 𝛼 + 24 𝑛 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 ,

where ℱ 0

{ 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) : ∥ 𝛾 ∥ ∞ ≤ 𝑀 } and 𝑀

𝐿 ⋅ diam ⁢ ( 𝒳 0 ) by Lemma 2.

For 𝑑 ≥ 2 , from Lemma 4, we have ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ 𝒩 ⁢ ( 𝒳 0 , 𝑐 2 ⁢ 𝛿 𝐿 ) ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) . We fix a 𝛿 ¯ > 0 such that 𝒩 ⁢ ( 𝒳 , 𝑐 2 ⁢ 𝛿 ¯ 𝐿 )

1 , and select 𝛿 ∗ such that 𝑐 2 ⁢ 𝛿 ∗ 𝐿 ≤ 1 and 𝑐 2 ⁢ 𝛿 ∗ 𝐿 ≤ 𝛿 0 ; that is, 𝛿 ∗ ≤ min ⁡ ( 𝐿 𝑐 2 , 𝐿 ⁢ 𝛿 0 𝑐 2 ) , so that by Lemma 5 and 6, we have

𝒩 ⁢ ( 𝒳 0 , 𝑐 2 ⁢ 𝛿 𝐿 ) ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ≤ 𝒩 ⁢ ( 𝒳 , 𝑐 2 ⁢ 𝛿 𝐿 ) | Σ | ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ≤ 𝐶 𝑑 , 𝛿 ¯ ⁢ 𝐿 𝑑 | Σ | ⁢ 𝑐 2 𝑑 ⁢ 𝛿 𝑑 ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ,

when 𝛿 < 𝛿 ∗ . Therefore, for sufficiently small 𝛼 , we have

∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿

∫ 𝛼 𝛿 ∗ ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 + ∫ 𝛿 ∗ 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿

≤ ∫ 𝛼 𝛿 ∗ 𝐶 𝑑 , 𝛿 ¯ ⁢ 𝐿 𝑑 | Σ | ⁢ 𝑐 2 𝑑 ⁢ 𝛿 𝑑 ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ⁢ d 𝛿 + ∫ 𝛿 ∗ 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

(42)

For any 𝑠 > 0 , we can choose 𝛿 ∗ to be sufficiently small, such that we have ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ≤ 1 𝛿 𝑠 when 𝛿 ≤ 𝛿 ∗ . Therefore, if we let 𝐷 𝒳 , 𝐿

𝐶 𝑑 , 𝛿 ¯ ⁢ 𝐿 𝑑 𝑐 2 𝑑 , we will have

∫ 𝛼 𝛿 ∗ 𝐶 𝑑 , 𝛿 ¯ ⁢ 𝐿 𝑑 | Σ | ⁢ 𝑐 2 𝑑 ⁢ 𝛿 𝑑 ⁢ ln ⁡ ( 𝑐 1 ⁢ 𝑀 𝛿 ) ⁢ d 𝛿
≤ 𝐷 𝒳 , 𝐿 ⁢ ∫ 𝛼 𝛿 ∗ 1 | Σ | ⁢ 𝛿 𝑑 + 𝑠 ⁢ d 𝛿

≤ 𝐷 𝒳 , 𝐿 ⁢ ∫ 𝛼 ∞ 1 | Σ | ⁢ 𝛿 𝑑 + 𝑠 ⁢ d 𝛿

𝐷 𝒳 , 𝐿 | Σ | ⋅ 𝛼 1 − 𝑑 + 𝑠 2 𝑑 + 𝑠 2 − 1 .

Notice that the second integral in (6.1) is bounded while the first integral diverges as 𝛼 tends to zero, so we can optimize the majorizing terms

8 ⁢ 𝛼 + 24 𝑚 ⋅ 𝐷 𝒳 , 𝐿 | Σ | ⋅ 𝛼 1 − 𝑑 + 𝑠 2 𝑑 + 𝑠 2 − 1

with respect to 𝛼 , to obtain

𝛼

( 9 𝑚 ) 1 𝑑 + 𝑠 ⋅ ( 𝐷 𝒳 , 𝐿 2 | Σ | ) 1 𝑑 + 𝑠 ,

so that

inf 𝛼

0 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿

≤ 8 ⁢ ( 9 𝑚 ) 1 𝑑 + 𝑠 ⋅ ( 𝐷 𝒳 , 𝐿 2 | Σ | ) 1 𝑑 + 𝑠 + 24 ( 𝑑 + 𝑠 2 − 1 ) ⁢ ( 9 𝑚 ) 1 𝑑 + 𝑠 ⋅ ( 𝐷 𝒳 , 𝐿 2 | Σ | ) 1 𝑑 + 𝑠 + 24 𝑚 ⁢ ∫ 𝛿 ∗ 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

Therefore, for sufficiently large 𝑚 and 𝑛 , we have

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

≤ ( 8 + 24 ( 𝑑 + 𝑠 2 − 1 ) ) ⁢ [ ( 9 ⁢ 𝐷 𝒳 , 𝐿 2 | Σ | ⁢ 𝑚 ) 1 𝑑 + 𝑠 + ( 9 ⁢ 𝐷 𝒳 , 𝐿 2 | Σ | ⁢ 𝑛 ) 1 𝑑 + 𝑠 ]

( 24 𝑚
24 𝑛 ) ⁢ ∫ 𝛿 ∗ 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 .

For 𝑑

1 , the first integral in (6.1) in the one-dimensional case does not have a singularity at 𝛼

0 . On the other hand, the covering number 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) is bounded by the covering number for which we extend the domain to an interval that contains 𝒳 0 . Replacing the interval [ 0 , 1 ] by an interval of length diam ⁢ ( 𝒳 0 ) in Lemma 5.16 in [61], there exists a constant 𝑐

0 such that

𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ 𝑒 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝛿 ⁢ for ⁢ 𝛿 < 𝑀

𝐿 ⋅ diam ⁢ ( 𝒳 0 ) .

Therefore, we have

8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 ≤ 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝛿 ⁢ d 𝛿 ,

whose minimum is achieved at 𝛼

9 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 . This implies that

inf 𝛼 > 0 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿
≤ 72 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 + 48 ⁢ 𝐿 ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 − 144 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚

48 ⁢ 𝐿 ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 − 72 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 .

Hence, we have

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

≤ 48 ⁢ 𝐿 ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 − 72 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 + 48 ⁢ 𝐿 ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑛 − 72 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑛 .

Finally, by a simple change of variable for the probability provided in (41), we prove the theorem. ∎

Remark 10.

Though we do not directly observe the effect under the group invariance in the case when 𝑑

1 in Theorem 6, the upper bound can be improved in some special cases. Here we analyze Example 1 as an example. Replacing the interval [ 0 , 1 ] by 𝒳 0

[ 0 , 1 | Σ | ) in Lemma 5.16 in [61], there exists a constant 𝑐

0 such that

𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ≤ 𝑒 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝛿 ⁢ for ⁢ 𝛿 < 𝑀

𝐿 ⋅ diam ⁢ ( 𝒳 0 ) .

Therefore, we have

8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿

8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝛿 ⁢ d 𝛿 ,

whose minimum is achieved at 𝛼

9 ⁢ 𝑐 ⁢ 𝐿 𝑚 ⁢ | Σ | . This implies that

inf 𝛼 > 0 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿

72 ⁢ 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝑚 + 48 ⁢ 𝐿 ⁢ 𝑐 | Σ | ⁢ 𝑚 − 144 ⁢ 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝑚

48 ⁢ 𝐿 ⁢ 𝑐 | Σ | ⁢ 𝑚 − 72 ⁢ 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝑚 .

Hence, we have

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝛾 ⁢ ( 𝑦 𝑖 ) ) | ≤ 48 ⁢ 𝐿 ⁢ 𝑐 | Σ | ⁢ 𝑚 − 72 ⁢ 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝑚 + 48 ⁢ 𝐿 ⁢ 𝑐 | Σ | ⁢ 𝑛 − 72 ⁢ 𝑐 ⁢ 𝐿 | Σ | ⁢ 𝑛 .

This matches the numerical result in Figure 3 where the ratio curves are around 4 , since our group sizes are | Σ |

1 , 4 , 16 , 64 , 256 , increasing by a factor of 4 ,

6.2 ( 𝑓 𝛼 , Γ ) -divergence

We assume Assumption 2 also holds in this case.

Theorem 7.

Let 𝒳

Σ × 𝒳 0 be a subset of ℝ 𝐷 equipped with the Euclidean distance, 𝑓 ⁢ ( 𝑥 )

𝑓 𝛼 ⁢ ( 𝑥 )

𝑥 𝛼 − 1 𝛼 ⁢ ( 𝛼 − 1 ) , 𝛼 > 1 and Γ

Lip 𝐿 ⁢ ( 𝒳 ) . Assume that 𝒩 ⁢ ( 𝒳 , 𝛿 ) ≲ 𝛿 − 𝑑 for sufficiently small 𝛿 . Suppose 𝑃 and 𝑄 are Σ -invariant distributions on 𝒳 . We have

if 𝑑 ≥ 2 , then for any 𝑠

0 and 𝑚 , 𝑛 sufficiently large, we have with probability at least 1 − 𝜖 ,

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ ( 8 + 24 ( 𝑑 + 𝑠 2 − 1 ) ) ⁢ [ ( 9 ⁢ 𝐷 𝒳 , 𝐿 2 | Σ | ⁢ 𝑚 ) 1 𝑑 + 𝑠 + ( 9 ⁢ 𝐷 𝒳 , 𝐿 ′ 2 | Σ | ⁢ 𝑛 ) 1 𝑑 + 𝑠 ]

24 ⁢ 𝐷 ¯ 𝒳 0 , 𝐿 𝑚
24 ⁢ 𝐷 ¯ 𝒳 0 , 𝐿 ′ 𝑛
2 ⁢ ( 𝑀 1 2 ⁢ 𝑚
𝑀 0 2 ⁢ 𝑛 ) 𝑚 ⁢ 𝑛 ⁢ ln ⁡ 1 𝜖 ,

where 𝐷 𝒳 , 𝐿 depends only on 𝒳 and 𝐿 , and 𝐷 𝒳 , 𝐿 ′ depends only on 𝒳 , 𝐿 and 𝛼 ; 𝐷 ¯ 𝒳 0 , 𝐿 depends only on 𝒳 0 and 𝐿 , and 𝐷 ¯ 𝒳 0 , 𝐿 ′ depends only on 𝒳 0 and 𝐿 and 𝛼 , and both are increasing in 𝒳 0 ; 𝑀 0 and 𝑀 1 both only depend on 𝒳 , 𝐿 and 𝛼 ;

if 𝑑
1 , for any 𝜖

0 and 𝑚 , 𝑛 sufficiently large, we have with probability at least 1 − 𝜖 ,

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 ∥ 𝑄 𝑛 ) |

≤ 48 ⁢ 𝐿 ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 − 72 ⁢ 𝑐 ⁢ 𝐿 ⋅ diam ⁢ ( 𝒳 0 ) 𝑚 + 48 ⁢ 𝐿 ′ ⁢ 𝑐 ⋅ diam ⁢ ( 𝒳 0 ) 𝑛 − 72 ⁢ 𝑐 ⁢ 𝐿 ′ ⋅ diam ⁢ ( 𝒳 0 ) 𝑛

2 ⁢ ( 𝑀 1 2 ⁢ 𝑚
𝑀 0 2 ⁢ 𝑛 ) 𝑚 ⁢ 𝑛 ⁢ ln ⁡ 1 𝜖 ,

where 𝑐

0 is an absolute constant independent of 𝒳 0 ; 𝐿 ′ depends only on 𝒳 , 𝐿 and 𝛼 ; 𝑀 0 and 𝑀 1 both only depend on 𝒳 , 𝐿 and 𝛼 .

Before proving this theorem, we first provide the following lemma.

Lemma 7.

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

𝐷 𝑓 𝛼 ℱ ⁢ ( 𝑃 ∥ 𝑄 ) , where

ℱ

{ 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 ) : ∥ 𝛾 ∥ ∞ ≤ ( 𝛼 − 1 ) − 1 + 𝐿 ⋅ diam ⁢ ( 𝒳 ) } ,

and 𝑃 and 𝑄 are probability distributions on 𝒳 that are not necessarily Σ -invariant.

Proof.

For any fixed 𝛾 ∈ Γ , let ℎ ⁢ ( 𝜈 )

𝐸 𝑃 ⁢ [ 𝛾 + 𝜈 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 + 𝜈 ) ] . We know that sup 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) − inf 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) ≤ 𝐿 ⋅ diam ⁢ ( 𝒳 ) , so interchanging the integration with differentiation is allowed by the dominated convergence theorem: ℎ ′ ⁢ ( 𝜈 )

1 − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁣ ′ ⁢ ( 𝛾 + 𝜈 ) ] , where

𝑓 𝛼 ∗ ⁣ ′ ⁢ ( 𝑦 )

( 𝛼 − 1 ) 1 𝛼 − 1 ⁢ 𝑦 1 𝛼 − 1 ⁢ 𝟏 𝑦

0 .

If inf 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) > ( 𝛼 − 1 ) − 1 , then ℎ ′ ⁢ ( 0 ) < 0 . So there exists some 𝜈 0 < 0 such that 𝐸 𝑃 ⁢ [ 𝛾 + 𝜈 0 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 + 𝜈 0 ) ]

ℎ ⁢ ( 𝜈 0 ) > ℎ ⁢ ( 0 )

𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] . This indicates the supremum in 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) is attained only if sup 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) ≤ ( 𝛼 − 1 ) − 1 + 𝐿 ⋅ diam ⁢ ( 𝒳 ) . On the other hand, if sup 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) < 0 , then there exists 𝜈 0 > 0 that satisfies sup 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) + 𝜈 0 < 0 such that 𝐸 𝑃 ⁢ [ 𝛾 + 𝜈 0 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 + 𝜈 0 ) ]

𝐸 𝑃 ⁢ [ 𝛾 ] + 𝜈 0 > 𝐸 𝑃 ⁢ [ 𝛾 ]

𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] . This indicates that the supremum in 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) is attained only if inf 𝑥 ∈ 𝒳 𝛾 ⁢ ( 𝑥 ) ≥ − 𝐿 ⋅ diam ⁢ ( 𝒳 ) . Therefore, we have that the supremum in 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 ) is attained only if ∥ 𝛾 ∥ ∞ ≤ ( 𝛼 − 1 ) − 1 + 𝐿 ⋅ diam ⁢ ( 𝒳 ) . ∎

Proof of Theorem 7.

Similar to the beginning of the proof of Theorem 6, we have by Lemma 7 that

| 𝐷 𝑓 𝛼 Γ ( 𝑃 ∥ 𝑄 ) − 𝐷 𝑓 𝛼 Γ Σ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

| sup 𝛾 ∈ Γ Σ

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] } − sup 𝛾 ∈ Γ Σ

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 { 𝐸 𝑃 𝑚 ⁢ [ 𝛾 ] − 𝐸 𝑄 𝑛 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] } |

≤ sup 𝛾 ∈ Γ Σ

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) ) |

sup 𝛾 ∈ Γ Σ

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) ) |

≤ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) ) |

:= 𝑔 ⁢ ( 𝑥 1 , … , 𝑥 𝑚 , 𝑦 1 , … , 𝑦 𝑛 ) ,

where 𝑇 0 is the same as defined in (23). The denominator in the exponent when applying the McDiarmid’s inequality is thus equal to

𝑚 ⁢ ( 2 ⁢ 𝑀 0 𝑚 ) 2 + 𝑛 ⁢ ( 2 ⁢ 𝑀 1 𝑛 ) 2

4 ⁢ 𝑀 0 2 𝑚 + 4 ⁢ 𝑀 1 2 𝑛 ,

where 𝑀 0

( 𝛼 − 1 ) − 1 + 𝐿 ⋅ diam ⁢ ( 𝒳 ) , 𝑀 1

𝑓 𝛼 ∗ ⁢ ( 𝑀 0 ) , since for any 𝛾 such that ∥ 𝛾 ∥ ∞ ≤ 𝑀 0 , we have ∥ 𝑓 𝛼 ∗ ∘ 𝛾 ∥ ∞ ≤ 𝑀 1 . Denoting by 𝑋 ′

{ 𝑥 1 ′ , 𝑥 2 ′ , … , 𝑥 𝑚 ′ } and 𝑌 ′

𝐸 𝑋 , 𝑌 ⁢ 𝑔 ⁢ ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 )

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑥 𝑖 ) ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑇 0 ⁢ ( 𝑦 𝑖 ) ) ) ) |

can be replaced by the equivalent quantity

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) ) | ,

where 𝑋

{ 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 } and 𝑌

{ 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 } are are i.i.d. samples on 𝒳 0 drawn from 𝑃 𝒳 0 and 𝑄 𝒳 0 respectively. Then we have

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑃 𝒳 0 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − ( 𝐸 𝑄 𝒳 0 ⁢ [ 𝑓 𝛼 ∗ ⁢ ( 𝛾 ) ] − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) ) |

𝐸 𝑋 , 𝑌 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 𝐸 𝑋 ′ ⁢ ( 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ′ ) ) − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − 𝐸 𝑌 ′ ⁢ ( 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) ) ) + 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

≤ 𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) ) + 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) |

𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ , 𝜉 , 𝜉 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝑥 𝑖 ) ) − 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ( 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) ) − 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) ) |

≤ 𝐸 𝑋 , 𝑋 ′ , 𝜉 ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ( 𝛾 ⁢ ( 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝑥 𝑖 ) ) | + 𝐸 𝑌 , 𝑌 ′ , 𝜉 ′ ⁢ sup 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 )

∥ 𝛾 ∥ ∞ ≤ 𝑀 0 | 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ( 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ′ ) ) − 𝑓 𝛼 ∗ ⁢ ( 𝛾 ⁢ ( 𝑦 𝑖 ) ) ) |

≤ inf 𝛼

0 8 ⁢ 𝛼 + 24 𝑚 ⁢ ∫ 𝛼 𝑀 0 ln ⁡ 𝒩 ⁢ ( ℱ 0 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 + inf 𝛼

0 8 ⁢ 𝛼 + 24 𝑛 ⁢ ∫ 𝛼 𝑀 1 ln ⁡ 𝒩 ⁢ ( ℱ 1 , 𝛿 , ∥ ⋅ ∥ ∞ ) ⁢ d 𝛿 ,

where ℱ 0

{ 𝛾 ∈ Lip 𝐿 ⁢ ( 𝒳 0 ) : ∥ 𝛾 ∥ ∞ ≤ 𝑀 0 } and ℱ 1

{ 𝛾 ∈ Lip 𝐿 ′ ⁢ ( 𝒳 0 ) : ∥ 𝛾 ∥ ∞ ≤ 𝑀 1 } , since for any 𝛾 ∈ ℱ 0 , ∥ 𝑓 𝛼 ∗ ∘ 𝛾 ∥ ∞ ≤ 𝑀 1 and | d d ⁢ 𝑦 ⁢ 𝑓 𝛼 ∗ ⁢ ( 𝑦 ) | ≤ ( 𝛼 − 1 ) 1 𝛼 − 1 ⁢ ( 𝑀 0 ) 1 𝛼 − 1 for | 𝑦 | ≤ 𝑀 0 such that 𝑓 𝛼 ∗ ∘ 𝛾 is 𝐿 ′ -Lipschitz, where 𝑀 1

𝑓 𝛼 ∗ ⁢ ( 𝑀 0 ) and 𝐿 ′

𝐿 ⁢ ( 𝛼 − 1 ) 1 𝛼 − 1 ⁢ ( 𝑀 0 ) 1 𝛼 − 1 . Then the rest of the proof follows from the proof of Theorem 6. ∎

6.3MMD

We assume the kernel 𝑘 ⁢ ( 𝑥 , 𝑦 ) satisfies Assumption 1. Furthermore, let 𝜙 ⁢ ( 𝑥 ) be the evaluation functional at 𝑥 in ℋ : ⟨ 𝜙 ⁢ ( 𝑥 ) , 𝜙 ⁢ ( 𝑦 ) ⟩ ℋ

𝑘 ⁢ ( 𝑥 , 𝑦 ) , ∀ 𝑥 , 𝑦 ∈ ℋ .

Theorem 8.

Let 𝒳

Σ × 𝒳 0 and ℋ be a RKHS on 𝒳 whose kernel satisfies Assumption 1. Suppose 𝑃 and 𝑄 are Σ -invariant distributions on 𝒳 . Then for 𝑚 , 𝑛 sufficiently large and any 𝜖

0 we have with probability at least 1 − 𝜖 ,

| MMD ⁢ ( 𝑃 , 𝑄 ) − MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

< 2 ⁢ 𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 ⁢ ( 1 | Σ | ⁢ 𝑚 + 1 | Σ | ⁢ 𝑛 )

2 ⁢ 𝐾 ⁢ ( 1
𝑐 ⁢ ( | Σ | − 1 ) ) ⁢ ln ⁡ ( 1 𝜖 ) | Σ | ⁢ 1 𝑚
1 𝑛 ,

where 𝐾 and 𝑐 are the constants in Assumption 1.

Before proving the theorem, we provide the following lemma.

Lemma 8.

Suppose the kernel in an RKHS satisfies Assumption 1, and 𝜉

{ 𝜉 1 , … , 𝜉 𝑚 } is a set of independent random variables, each of which takes values on { − 1 , 1 } with equal probabilities. Then we have

𝐸 𝜉 ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) | ≤ ( 1 + 𝑐 ⁢ ( | Σ | − 1 ) ) ⁢ 𝐾 1 2 | Σ | ⁢ 𝑚 .

Proof.

Since the witness function to attain the supremum is explicit, we can write

𝐸 𝜉 ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) |

𝐸 𝜉 ⁢ ∥ 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ∥ ℋ

1 𝑚 ⁢ | Σ | 𝐸 𝜉 [ ∑ 𝑖 , 𝑖 ′

1 𝑚 𝜉 𝑖 𝜉 𝑖 ′ ∑ 𝑗 , 𝑗 ′

1 | Σ | 𝑘 ( 𝜎 𝑗 𝑥 𝑖 , 𝜎 𝑗 ′ 𝑥 𝑖 ′ ) ] ] 1 2

≤ 1 𝑚 ⁢ | Σ | [ 𝐸 𝜉 ∑ 𝑖 , 𝑖 ′

1 𝑚 𝜉 𝑖 𝜉 𝑖 ′ ∑ 𝑗 , 𝑗 ′

1 | Σ | 𝑘 ( 𝜎 𝑗 𝑥 𝑖 , 𝜎 𝑗 ′ 𝑥 𝑖 ′ ) ] ] 1 2

1 𝑚 ⁢ | Σ | [ 𝐸 𝜉 ∑ 𝑖

1 𝑚 ( 𝜉 𝑖 ) 2 ∑ 𝑗 , 𝑗 ′

1 | Σ | 𝑘 ( 𝜎 𝑗 𝑥 𝑖 , 𝜎 𝑗 ′ 𝑥 𝑖 ) ] ] 1 2

≤ 1 𝑚 ⁢ | Σ | ⁢ [ 𝑚 ⋅ ( | Σ | ⁢ 𝐾 + 𝑐 ⁢ ( | Σ | 2 − | Σ | ) ⁢ 𝐾 ) ] 1 2

𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 | Σ | ⁢ 𝑚 .

∎

Proof of Theorem 8.

The proof below is a generalization of the proof of Theorem 7 in [31], which does not need the notion of covering numbers due to the structure of RKHS.

| MMD ⁢ ( 𝑃 , 𝑄 ) − MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

| MMD ⁢ ( 𝑃 , 𝑄 ) − MMD ⁢ ( 𝑆 Σ ⁢ [ 𝑃 𝑚 ] , 𝑆 Σ ⁢ [ 𝑄 𝑛 ] ) |

| sup ∥ 𝛾 ∥ ℋ ≤ 1 { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } − sup ∥ 𝛾 ∥ ℋ ≤ 1 { 𝐸 𝑆 Σ ⁢ [ 𝑃 𝑚 ] ⁢ [ 𝛾 ] − 𝐸 𝑆 Σ ⁢ [ 𝑄 𝑛 ] ⁢ [ 𝛾 ] } |

| sup ∥ 𝛾 ∥ ℋ ≤ 1 { 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] } − sup ∥ 𝛾 ∥ ℋ ≤ 1 { 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) − 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) } |

≤ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) + 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) |

:= 𝑓 ⁢ ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 ) .

Now we estimate the upper bound of the difference of 𝑓 if we change one of 𝑥 𝑖 ’s.

| 𝑓 ⁢ ( 𝑥 1 , … , 𝑥 𝑖 , … , 𝑦 1 , … , 𝑦 𝑛 ) − 𝑓 ⁢ ( 𝑥 1 , … , 𝑥 ~ 𝑖 , … , 𝑦 1 , … , 𝑦 𝑛 ) |

≤ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 ~ 𝑖 ) |

1 𝑚 ⁢ | Σ | ⁢ ∥ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) − 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 ~ 𝑖 ) ∥ ℋ

(43)

≤ 1 𝑚 ⁢ | Σ | ⁢ ( ∥ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ∥ ℋ + ∥ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 ~ 𝑖 ) ∥ ℋ ) .

To bound ∥ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ∥ ℋ , we have

∥ ∑ 𝑗

1 | Σ | 𝜙 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ∥ ℋ

[ ∑ 𝑗

1 | Σ | 𝑘 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 , 𝜎 𝑗 ⁢ 𝑥 𝑖 ) + ∑ 𝑗 ≠ 𝑙 𝑘 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 , 𝜎 𝑙 ⁢ 𝑥 𝑖 ) ] 1 2

[ ∑ 𝑗

1 | Σ | 𝑘 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 , 𝜎 𝑗 ⁢ 𝑥 𝑖 ) + ∑ 𝜎 𝑗 ≠ 𝑖 ⁢ 𝑑 𝑘 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 , 𝑥 𝑖 ) ] 1 2

≤ [ | Σ | ⋅ 𝐾 + ( | Σ | 2 − | Σ | ) ⋅ 𝑐 ⁢ 𝐾 ] 1 2 .

The upper bound of the difference of 𝑓 if we change one of 𝑦 𝑖 ’s can be derived in the same way. To apply the McDiarmid’s inequality, the denominator in the exponent is thus

𝑚 ⋅ 4 ⁢ [ | Σ | ⋅ 𝐾 + ( | Σ | 2 − | Σ | ) ⋅ 𝑐 ⁢ 𝐾 ] 𝑚 2 ⁢ | Σ | 2 + 𝑛 ⋅ 4 ⁢ [ | Σ | ⋅ 𝐾 + ( | Σ | 2 − | Σ | ) ⋅ 𝑐 ⁢ 𝐾 ] 𝑛 2 ⁢ | Σ | 2

≤ 4 ⁢ 𝐾 ⁢ ( 1 𝑚 + 1 𝑛 ) ⋅ 1 + 𝑐 ⁢ ( | Σ | − 1 ) | Σ | .

Moreover, we can extend inequality (16) in [31] to take into account the group invariance. Denoting by 𝑋 ′

{ 𝑥 1 ′ , 𝑥 2 ′ , … , 𝑥 𝑚 ′ } and 𝑌 ′

{ 𝑦 1 ′ , 𝑦 2 ′ , … , 𝑦 𝑛 ′ } the i.i.d. samples drawn from 𝑃 and 𝑄 , and 𝜉

{ 𝜉 1 , … , 𝜉 𝑚 } , 𝜉 ′

{ 𝜉 1 ′ , … , 𝜉 𝑛 ′ } sets of independent random variables, each of which takes values on { − 1 , 1 } with equal probabilities, we have

𝐸 𝑋 , 𝑌 ⁢ 𝑓 ⁢ ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑚 , 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 )

𝐸 𝑋 , 𝑌 ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 𝐸 𝑃 ⁢ [ 𝛾 ] − 𝐸 𝑄 ⁢ [ 𝛾 ] − 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) + 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) |

𝐸 𝑋 , 𝑌 ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 𝐸 𝑋 ′ ⁢ ( 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ′ ) ) − 𝐸 𝑌 ′ ⁢ ( 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ′ ) )

− 1 𝑚 ⁢ | Σ | ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | 𝛾 ( 𝜎 𝑗 𝑥 𝑖 ) + 1 𝑛 ⁢ | Σ | ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | 𝛾 ( 𝜎 𝑗 𝑦 𝑖 ) |

≤ 𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ) − 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) ) |

𝐸 𝑋 , 𝑌 , 𝑋 ′ , 𝑌 ′ , 𝜉 , 𝜉 ′ ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ) − 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) ) |

≤ 𝐸 𝑋 , 𝑋 ′ , 𝜉 ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑚 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑚 𝜉 𝑖 ⁢ ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑥 𝑖 ) ) | + 𝐸 𝑌 , 𝑌 ′ , 𝜉 ′ ⁢ sup ∥ 𝛾 ∥ ℋ ≤ 1 | 1 𝑛 ⁢ | Σ | ⁢ ∑ 𝑖

1 𝑛 𝜉 𝑖 ′ ⁢ ∑ 𝑗

1 | Σ | ( 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ′ ) − 𝛾 ⁢ ( 𝜎 𝑗 ⁢ 𝑦 𝑖 ) ) |

≤ 2 ⁢ [ 𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 | Σ | ⁢ 𝑚 + 𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 | Σ | ⁢ 𝑛 ] ,

where the last inequality is due to Lemma 8. Therefore, by the McDiarmid’s theorem, we have

ℙ ⁢ ( | MMD ⁢ ( 𝑃 , 𝑄 ) − MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) | − 2 ⁢ 𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 ⁢ ( 1 | Σ | ⁢ 𝑚 + 1 | Σ | ⁢ 𝑛 )

𝜖 )

≤ exp ⁡ ( − 𝜖 2 ⁢ 𝑚 ⁢ 𝑛 ⁢ | Σ | 2 ⁢ 𝐾 ⁢ ( 𝑚 + 𝑛 ) ⁢ ( 1 + 𝑐 ⁢ ( | Σ | − 1 ) ) ) .

By a change of variable, we have with probability at least 1 − 𝜖 ,

| MMD ⁢ ( 𝑃 , 𝑄 ) − MMD Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 ) |

< 2 ⁢ 𝐾 1 2 ⁢ [ 1 + 𝑐 ⁢ ( | Σ | − 1 ) ] 1 2 ⁢ ( 1 | Σ | ⁢ 𝑚 + 1 | Σ | ⁢ 𝑛 )

2 ⁢ 𝐾 ⁢ ( 1
𝑐 ⁢ ( | Σ | − 1 ) ) ⁢ ln ⁡ ( 1 𝜖 ) | Σ | ⁢ 1 𝑚
1 𝑛 .

∎

Acknowledgements

The research of M.K. and L.R.-B. was partially supported by the Air Force Office of Scientific Research (AFOSR) under the grant FA9550-21-1-0354. The research of M. K. and L.R.-B. was partially supported by the National Science Foundation (NSF) under the grants DMS-2008970 and TRIPODS CISE-1934846. The research of Z.C and W.Z. was partially supported by NSF under DMS-2052525 and DMS-2140982. We thank Yulong Lu for the insightful discussions.

References [1] ↑ M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein generative adversarial networks, in International conference on machine learning, PMLR, 2017, pp. 214–223. [2] ↑ P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, Spectrally-normalized margin bounds for neural networks, Advances in neural information processing systems, 30 (2017). [3] ↑ M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, Mutual information neural estimation, in Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, eds., vol. 80 of Proceedings of Machine Learning Research, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018, PMLR, pp. 531–540. [4] ↑ P. J. Bickel, A distribution free version of the smirnov two sample test in the p-variate case, The Annals of Mathematical Statistics, 40 (1969), pp. 1–23. [5] ↑ M. Biloš and S. Günnemann, Scalable normalizing flows for permutation invariant densities, in International Conference on Machine Learning, PMLR, 2021, pp. 957–967. [6] ↑ J. Birrell, P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and L. Rey-Bellet, ( 𝑓 , Γ ) -Divergences: Interpolating between 𝑓 -Divergences and Integral Probability Metrics, Journal of Machine Learning Research, 23 (2022), pp. 1–70. [7] ↑ J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang, Variational representations and neural network estimation of Rényi divergences, SIAM Journal on Mathematics of Data Science, 3 (2021), pp. 1093–1116. [8] ↑ J. Birrell, M. A. Katsoulakis, and Y. Pantazis, Optimizing variational representations of divergences and accelerating their statistical estimation, IEEE Transactions on Information Theory, (2022). [9] ↑ J. Birrell, M. A. Katsoulakis, L. Rey-Bellet, and W. Zhu, Structure-preserving GANs, Proceedings of the 39th International Conference on Machine Learning, PMLR 162, (2022), pp. 1982–2020. [10] ↑ J. Birrell, Y. Pantazis, P. Dupuis, M. A. Katsoulakis, and L. Rey-Bellet, Function-space regularized Rényi divergences, arXiv preprint arXiv:2210.04974, (2022). [11] ↑ D. Boyda, G. Kanwar, S. Racanière, D. J. Rezende, M. S. Albergo, K. Cranmer, D. C. Hackett, and P. E. Shanahan, Sampling using su (n) gauge equivariant flows, Physical Review D, 103 (2021), p. 074504. [12] ↑ G. E. Bredon, Introduction to compact transformation groups, Academic press, 1972. [13] ↑ M. Broniatowski and A. Keziou, Parametric estimation and tests through divergences and the duality technique, Journal of Multivariate Analysis, 100 (2009), pp. 16 – 36. [14] ↑ O. Catoni, P. Euclid, C. U. Library, and D. U. Press, PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, Lecture notes-monograph series, Cornell University Library, 2008. [15] ↑ X. Cheng and Y. Xie, Kernel two-sample tests for manifold data, arXiv preprint arXiv:2105.03425, (2021). [16] ↑ K. Chowdhary and P. Dupuis, Distinguishing and integrating aleatoric and epistemic variation in uncertainty quantification, ESAIM: Mathematical Modelling and Numerical Analysis, 47 (2013), p. 635–662. [17] ↑ F. Ciompi, Y. Jiao, and J. van der Laak, Lymphocyte assessment hackathon (LYSTO), Oct. 2019. [18] ↑ T. Cohen and M. Welling, Group equivariant convolutional networks, in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., vol. 48 of Proceedings of Machine Learning Research, New York, New York, USA, 20–22 Jun 2016, PMLR, pp. 2990–2999. [19] ↑ T. S. Cohen, M. Geiger, and M. Weiler, A general theory of equivariant CNNs on homogeneous spaces, in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019. [20] ↑ N. Dey, A. Chen, and S. Ghafurian, Group equivariant generative adversarial networks, in International Conference on Learning Representations, 2021. [21] ↑ P. Dupuis and R. S. Ellis, A weak convergence approach to the theory of large deviations, vol. 902, John Wiley & Sons, 2011. [22] ↑ P. Dupuis, M. A. Katsoulakis, Y. Pantazis, and P. Plechac, Path-space information bounds for uncertainty quantification and sensitivity analysis of stochastic dynamics, SIAM/ASA Journal on Uncertainty Quantification, 4 (2016), pp. 80–111. [23] ↑ G. B. Folland, Real analysis: modern techniques and their applications, vol. 40, John Wiley & Sons, 1999. [24] ↑ N. Fournier and A. Guillin, On the rate of convergence in wasserstein distance of the empirical measure, Probability theory and related fields, 162 (2015), pp. 707–738. [25] ↑ S. Gao, G. Ver Steeg, and A. Galstyan, Efficient estimation of mutual information for strongly dependent variables, in Artificial intelligence and statistics, PMLR, 2015, pp. 277–286. [26] ↑ V. Garcia Satorras, E. Hoogeboom, F. Fuchs, I. Posner, and M. Welling, E (n) equivariant normalizing flows, Advances in Neural Information Processing Systems, 34 (2021). [27] ↑ A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré, Sample complexity of sinkhorn divergences, in The 22nd international conference on artificial intelligence and statistics, PMLR, 2019, pp. 1574–1583. [28] ↑ I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, Advances in neural information processing systems, 27 (2014). [29] ↑ L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer, Efficient regression in metric spaces via approximate lipschitz extension, IEEE Transactions on Information Theory, 63 (2017), pp. 4838–4849. [30] ↑ A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, A kernel method for the two-sample-problem, Advances in neural information processing systems, 19 (2006). [31] ↑ A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, A kernel two-sample test, The Journal of Machine Learning Research, 13 (2012), pp. 723–773. [32] ↑ A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola, A kernel statistical test of independence, Advances in neural information processing systems, 20 (2007). [33] ↑ I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, Improved training of Wasserstein GANs, in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., vol. 30, Curran Associates, Inc., 2017. [34] ↑ A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis, Studies in informatics and control, 11 (2002), pp. 205–207. [35] ↑ K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al., Nonparametric von mises estimators for entropies, divergences and mutual informations, Advances in Neural Information Processing Systems, 28 (2015). [36] ↑ B. Kégl, Intrinsic dimension estimation using packing numbers, Advances in neural information processing systems, 15 (2002). [37] ↑ J. B. Kinney and G. S. Atwal, Equitability, mutual information, and the maximal information coefficient, Proceedings of the National Academy of Sciences, 111 (2014), pp. 3354–3359. [38] ↑ C. Kipnis and C. Landim, Scaling Limits of Interacting Particle Systems, Springer-Verlag, 1999. [39] ↑ J. Köhler, L. Klein, and F. Noé, Equivariant flows: sampling configurations for multi-body systems with symmetric energies, arXiv preprint arXiv:1910.00753, (2019). [40] ↑ A. N. Kolmogorov, 𝜀 -entropy and 𝜀 -capacity of sets in function spaces, American Mathematical Society Translations, 17 (1961), pp. 277–364. [41] ↑ T. Kühn, Covering numbers of gaussian reproducing kernel hilbert spaces, Journal of Complexity, 27 (2011), pp. 489–499. [42] ↑ S. Kullback and R. A. Leibler, On information and sufficiency, The annals of mathematical statistics, 22 (1951), pp. 79–86. [43] ↑ J. Liu, A. Kumar, J. Ba, J. Kiros, and K. Swersky, Graph normalizing flows, Advances in Neural Information Processing Systems, 32 (2019). [44] ↑ D. A. McAllester, Pac-bayesian model averaging, in Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT ’99, New York, NY, USA, 1999, Association for Computing Machinery, p. 164–170. [45] ↑ M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning, MIT press, 2018. [46] ↑ A. Müller, Integral probability metrics and their generating classes of functions, Advances in Applied Probability, 29 (1997), pp. 429–443. [47] ↑ X. Nguyen, M. J. Wainwright, and M. I. Jordan, Nonparametric estimation of the likelihood ratio and divergence functionals, in 2007 IEEE International Symposium on Information Theory, IEEE, 2007, pp. 2016–2020. [48] ↑ , Estimating divergence functionals and the likelihood ratio by convex risk minimization, IEEE Transactions on Information Theory, 56 (2010), pp. 5847–5861. [49] ↑ S. Nietert, Z. Goldfeld, and K. Kato, Smooth 𝑝 -wasserstein distance: Structure, empirical approximation, and statistical applications, in International Conference on Machine Learning, PMLR, 2021, pp. 8172–8183. [50] ↑ S. Nowozin, B. Cseke, and R. Tomioka, f-gan: Training generative neural samplers using variational divergence minimization, Advances in neural information processing systems, 29 (2016). [51] ↑ L. Paninski, Estimation of entropy and mutual information, Neural computation, 15 (2003), pp. 1191–1253. [52] ↑ B. Póczos, L. Xiong, and J. Schneider, Nonparametric divergence estimation with applications to machine learning on distributions, in Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 2011, pp. 599–608. [53] ↑ D. J. Rezende, S. Racanière, I. Higgins, and P. Toth, Equivariant hamiltonian flows, arXiv preprint arXiv:1909.13739, (2019). [54] ↑ A. Ruderman, M. D. Reid, D. García-García, and J. Petterson, Tighter variational representations of f-divergences via restriction to probability measures, in Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, 2012, Omnipress, p. 1155–1162. [55] ↑ J. Shawe-Taylor and R. C. Williamson, A PAC analysis of a Bayesian estimator, in Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT ’97, New York, NY, USA, 1997, Association for Computing Machinery, p. 2–9. [56] ↑ J. Sokolic, R. Giryes, G. Sapiro, and M. Rodrigues, Generalization error of invariant classifiers, in Artificial Intelligence and Statistics, PMLR, 2017, pp. 1094–1103. [57] ↑ S. Sreekumar and Z. Goldfeld, Neural estimation of statistical divergences, Journal of Machine Learning Research, 23 (2022), pp. 1–75. [58] ↑ B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet, On the empirical estimation of integral probability metrics, Electronic Journal of Statistics, 6 (2012), pp. 1550–1599. [59] ↑ B. Tahmasebi and S. Jegelka, Sample complexity bounds for estimating probability divergences under invariances, in Forty-first International Conference on Machine Learning. [60] ↑ I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, Wasserstein auto-encoders, in International Conference on Learning Representations, 2018. [61] ↑ R. Van Handel, Probability in high dimension, tech. rep., PRINCETON UNIV NJ, 2014. [62] ↑ R. Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018. [63] ↑ M. Weiler and G. Cesa, General E(2)-equivariant steerable CNNs, in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds., vol. 32, Curran Associates, Inc., 2019. [64] ↑ Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic, Large-scale kernel methods for independence testing, Statistics and Computing, 28 (2018), pp. 113–130. [65] ↑ S. Zhao, Z. Liu, J. Lin, J.-Y. Zhu, and S. Han, Differentiable augmentation for data-efficient gan training, Advances in Neural Information Processing Systems, 33 (2020), pp. 7559–7570. [66] ↑ D.-X. Zhou, The covering number in learning theory, Journal of Complexity, 18 (2002), pp. 739–767. Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 106 kB
Xet hash:: f0bb0cb301039ad3e19f67760a632cad1f854830618b353934cd4a636ccb5425

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

𝐷 ⁢ ( 𝑃 , 𝑄 )

0 ⇔ 𝑃

𝐷 ⁢ ( 𝑃 , 𝑄 )

(b) ( 𝑓 , Γ ) -divergences. Let 𝑓 : [ 0 , ∞ ) → ℝ be convex and lower semi-continuous, with 𝑓 ⁢ ( 1 )

0 and 𝑓 strictly convex at 𝑥

𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 )

where 𝑓 ∗ denotes the Legendre transform of 𝑓 . Formula (6) includes, as a special case when Γ

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 ) , Γ

Lip 𝐿 ⁢ ( 𝒳 ) , 𝑓 𝛼 ⁢ ( 𝑥 )

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

due to the invariance of Γ

Given i.i.d. samples 𝑋

{ 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚 } and 𝑌

𝐷 Γ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

sup 𝛾 ∈ Γ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

𝐷 𝑓 Γ ⁢ ( 𝑃 𝑚 ∥ 𝑄 𝑛 )

sup 𝛾 ∈ Γ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

where 𝑃 𝑚

1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝛿 𝑥 𝑖 and 𝑄 𝑛

1 𝑛 ⁢ ∑ 𝑗

A function 𝛾 : 𝒳 → ℝ is called Σ -invariant if 𝛾 ∘ 𝜃 𝜎

Γ Σ ≔ { 𝛾 ∈ Γ : 𝛾 ∘ 𝜃 𝜎

On the other hand, a probability measure 𝑃 ∈ 𝒫 ⁢ ( 𝒳 ) is called Σ -invariant if 𝑃

𝐷 Γ ⁢ ( 𝑆 Σ ⁢ [ 𝑃 ] , 𝑆 Σ ⁢ [ 𝑄 ] )

𝐷 𝑓 Γ ⁢ ( 𝑆 Σ ⁢ [ 𝑃 ] ∥ 𝑆 Σ ⁢ [ 𝑄 ] )

𝐷 Γ ⁢ ( 𝑃 , 𝑄 )

𝐷 Γ Σ ⁢ ( 𝑃 , 𝑄 ) , 𝐷 𝑓 Γ ⁢ ( 𝑃 ∥ 𝑄 )

𝐷 Γ Σ ⁢ ( 𝑃 𝑚 , 𝑄 𝑛 )

sup 𝛾 ∈ Γ Σ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

𝐷 𝑓 Γ Σ ⁢ ( 𝑃 𝑚 ∥ 𝑄 𝑛 )

sup 𝛾 ∈ Γ Σ { ∑ 𝑖

1 𝑚 𝛾 ⁢ ( 𝑥 𝑖 ) 𝑚 − ∑ 𝑖

𝑊 ⁢ ( 𝑃 , 𝑄 )

MMD ⁢ ( 𝑃 , 𝑄 )

𝐷 𝑓 𝛼 Γ ⁢ ( 𝑃 ∥ 𝑄 )

A subset 𝒳 0 ⊂ 𝒳 is called a fundamental domain of 𝒳 under the Σ -action if for each 𝑥 ∈ 𝒳 , there exists a unique 𝑥 0 ∈ 𝒳 such that 𝑥

Figure 2:The unit disk 𝒳 ⊂ ℝ 2 with the action of the (discrete) rotation groups Σ

𝐶 𝑛 , 𝑛

Figure 2 displays an example where 𝒳 is the unit disk in ℝ 2 , and Σ

𝐶 𝑛 , 𝑛

1 , 4 , 16 , 64 , are the discrete rotation groups acting on 𝒳 ; the fundamental domain 𝒳 0 for each Σ

𝐶 𝑛 is filled with yellow color. We note that the choice of the fundamental domain 𝒳 0 is not unique. We will slightly abuse the notation 𝒳

𝑇 0 ⁢ ( 𝑥 ) ≔ 𝑦 ∈ 𝒳 0 , if ⁢ 𝑦

𝑑 ⁢ 𝑃 𝒳 0 ⁢ ( 𝑥 )

∫ 𝑦 ∈ 𝒳 : 𝑦

diam ⁢ ( 𝒳 )

When 𝜌 ⁢ ( 𝑥 , 𝑦 )

For example, if 𝒳 ⊂ ℝ 𝐷 contains an open set of ℝ 𝐷 , then dim ( 𝒳 )

𝐷 ; if 𝒳 is a 𝑑 -dimensional submanifold of ℝ 𝐷 , then dim ( 𝒳 )

{ 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚 } and 𝑌

{ 𝑦 1 , 𝑦 2 , ⋯ , 𝑦 𝑛 } to sample from the same Σ -invariant distribution 𝑃

In this section, we set Γ

Let 𝒳

Σ × 𝒳 0 be a bounded subset of ℝ 𝐷 equipped with the Euclidean distance, | Σ | < ∞ and dim ( 𝒳 )

for 𝑑

When Σ is infinite and dim ( 𝒳 0 )

Let 𝒳

Σ × 𝒳 0 be a bounded subset of ℝ 𝐷 equipped with the Euclidean distance and dim ( 𝒳 0 )

for 𝑑 ∗

Furthermore, if 𝒳 is a 𝑑 -dimensional connected submanifold, and Σ is a compact Lie group acting locally smoothly on 𝒳 , then 𝑑 ∗

ℱ 0

which yields a factor in terms of the group size | Σ | in Eq. (29) when Σ is finite. The dominant term of the bound based on the singularity of the entropy integral at 𝛼

0 is shown in Eq. (29). For 𝑑

When | Σ |

1 , i.e., no group symmetry is leveraged in the divergence estimation, our result reduces to the case considered in, e.g., [58], for general distributions 𝑃 , 𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 )

The factor diam ⁢ ( 𝒳 0 ) in the case for 𝑑

Let 𝒳

[ 0 , 1 ) and Γ

Lip 𝐿 ⁢ ( [ 0 , 1 ) ) , i.e., 𝑑

1 . We consider the Σ -actions on 𝒳 generated by the translation 𝑥 ↦ ( 𝑥 + 1 𝑟 ) mod 1 , where 𝑟

1 , 4 , 16 , 64 , 256 , so that | Σ |

𝑟 is the group size. We draw samples 𝑥 𝑖 ∼ 𝑃

𝑄 ∈ 𝒫 Σ ⁢ ( 𝒳 ) on 𝒳 in the following way: 𝑥 𝑖

𝑟 − 1 ⁢ 𝜉 𝑖 1 / 3 + 𝜂 𝑖 , where 𝜉 𝑖 are i.i.d. uniformly distributed random variables on [ 0 , 1 ) and 𝜂 𝑖 take values over { 0 , 1 𝑟 , … , 𝑟 − 1 𝑟 } with equal probabilities. One can easily verify that 𝑃

𝑄 are indeed Σ -invariant. The numerical results for the empirical estimation of 𝑊 ⁢ ( 𝑃 , 𝑄 )

0 using 𝑊 Σ ⁢ ( 𝑃 𝑛 , 𝑄 𝑛 ) with different group size | Σ |