Buckets:
Title: New aspect for full-rank weights
URL Source: https://arxiv.org/html/2302.05825
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Related works 3Preliminaries 4Koopman-based bound of Rademacher complexity 5Numerical results 6Conclusion and discussion
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: nccmath
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: CC BY 4.0 arXiv:2302.05825v3 [cs.LG] 16 Mar 2024 Koopman-based generalization bound: New aspect for full-rank weights Yuka Hashimoto 1 , 2 Sho Sonoda 2 Isao Ishikawa 3 , 2 Atsushi Nitanda 4 Taiji Suzuki 5 , 2
- NTT Corporation 2. Center for Advanced Intelligence Project, RIKEN 3. Center for Data Science, Ehime University 4. A*STAR Centre for Frontier AI and Research 5. Graduate School of Information Science and Technology, The University of Tokyo Abstract
We propose a new bound for generalization of neural networks using Koopman operators. Whereas most of existing works focus on low-rank weight matrices, we focus on full-rank weight matrices. Our bound is tighter than existing norm-based bounds when the condition numbers of weight matrices are small. Especially, it is completely independent of the width of the network if the weight matrices are orthogonal. Our bound does not contradict to the existing bounds but is a complement to the existing bounds. As supported by several existing empirical results, low-rankness is not the only reason for generalization. Furthermore, our bound can be combined with the existing bounds to obtain a tighter bound. Our result sheds new light on understanding generalization of neural networks with full-rank weight matrices, and it provides a connection between operator-theoretic analysis and generalization of neural networks.
1Introduction
Understanding the generalization property has been one of the biggest topics for analyzing neural networks. A major approach for theoretical investigation of this topic is bounding some complexity of networks (Bartlett & Mendelson, 2002; Mohri et al., 2018). Intuitively, a large number of parameters makes the complexity and generalization error large. This intuition has been studied, for example, based on a classical VC-dimension theory (Harvey et al., 2017; Anthony & Bartlett, 2009). However, for neural networks, small generalization error can be achieved even in over-parameterized regimes (Novak et al., 2018; Neyshabur et al., 2019). To explain this behavior, norm-based bounds have been investigated (Neyshabur et al., 2015; Bartlett et al., 2017; Golowich et al., 2018; Neyshabur et al., 2018; Wei & Ma, 2019, 2020; Li et al., 2021; Ju et al., 2022; Weinan E et al., 2022). These bounds do not depend on the number of parameters explicitly. However, they are typically described by the ( 𝑝 , 𝑞 ) norms of the weight matrices, and if the norms are large, these bounds grow exponentially with respect to the depth of the network. Another approach to tackle over-parameterized networks is a compression-based approach (Arora et al., 2018; Suzuki et al., 2020). These bounds explain the generalization of networks by investigating how much the networks can be compressed. The bounds get smaller as the ranks of the weight matrices become smaller. However, low-rankness is not the only reason for generalization. Goldblum et al. (2020) empirically showed that even with high-rank weight matrices, networks generalize well. This implies that if the ranks of the weight matrices are large, the existing compression-based bounds are not always tight.
In this paper, we derive a completely different type of uniform bounds of complexity using Koopman operators, which sheds light on why networks generalize well even when their weights are high- or full-rank matrices. More precisely, let 𝐿 be the depth, 𝑑 𝑗 be the width of the 𝑗 th layer, 𝑔 be the final nonlinear transformation, and 𝑛 be the number of samples. For 𝑗
1 , … , 𝐿 , let 𝑊 𝑗 ∈ ℝ 𝑑 𝑗 × 𝑑 𝑗 − 1 be an injective weight matrix, 𝑠 𝑗
𝑑 𝑗 / 2 describes the smoothness of a function space 𝐻 𝑗 where the Koopman operator is defined. Our results are summarized as follows, where ∥ ⋅ ∥ is the operator norm, 𝐸 𝑗 and 𝐺 𝑗 are factors determined by the activation functions and the range of 𝑊 𝑗 , respectively:
Rademacher complexity ≤ 𝑂 ( ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 ∏ 𝑗
1 𝐿 𝐺 𝑗 𝐸 𝑗 ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 − 1 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) .
(1)
Surprisingly, the determinant factor tells us that if the singular values of 𝑊 𝑗 are large, the bound gets small. It is tight when the condition number of 𝑊 𝑗 , i.e., the ratio of the largest and the smallest singular values, is small. Especially, when 𝑊 𝑗 is orthogonal, 𝐺 𝑗
1 and the factor ∥ 𝑊 𝑗 ∥ 𝑠 𝑗 − 1 / det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 reduces to 1 . We can interpret that 𝑊 𝑗 transforms signals in certain directions, which makes it easy for the network to extract features of data. Networks with orthogonal weight matrices have been proposed (Maduranga et al., 2019; Wang et al., 2020; Li et al., 2021). Our bound also justifies the generalization property of these networks.
In addition to providing the new perspective, we can combine our bound with existing bounds. In other words, our bound can be a complement to existing bounds. Goldblum et al. (2020) pointed out that the rank of the weight matrix tends to be large near the input layer, but be small near the output layer. By adopting our bound for lower layers and existing bounds for higher layers, we obtain a tight bound that takes the role of each layer into account. The determinant factors come from Koopman operators. Koopman operator is a linear operator defined by the composition of functions. It has been investigated for analyzing dynamical systems and time-series data generated from dynamical systems (Koopman, 1931; Budišić et al., 2012; Kawahara, 2016; Ishikawa et al., 2018; Klus et al., 2020; Hashimoto et al., 2020; Giannakis & Das, 2020; Brunton et al., 2022). Its theoretical aspects also have been studied (Das et al., 2021; Ikeda et al., 2022a, b; Ishikawa, 2023). Connections between Koopman operators and neural networks have also been discussed. For example, efficient learning algorithms are proposed by describing the learning dynamics of the parameters of neural networks by Koopman operators (Redman et al., 2022; Dogra & Redman, 2020). Lusch et al. (2017) applied neural networks to identifying eigenfunctions of Koopman operators to extract features of dynamical systems. Konishi & Kawahara (2023) applied Koopman operators to analyze equilibrium models. On the other hand, in this paper, we represent the composition structure of neural networks using Koopman operators, and apply them to analyzing the complexity of neural networks from an operator-theoretic perspective.
Our main contributions are as follows:
•
We show a new complexity bound, which involves both the norm and determinant of the weight matrices. By virtue of the determinant term, our bound gets small if the condition numbers of the weight matrices get small. Especially, it justifies the generalization property of existing networks with orthogonal weight matrices. It also provides a new perspective about why the networks with high-rank weights generalize well (Section 2, the paragraph after Proposition 6, and Remark 7).
•
We can combine our bound with existing bounds to obtain a bound which describes the role of each layer (Subsection 4.4).
•
We provide an operator-theoretic approach to analyzing networks. We use Koopman operators to derive the determinant term in our bound (Subsection 3.4 and Subsection 4.1).
We emphasize that our operator-theoretic approach reveals a new aspect of neural networks.
2Related works Norm-based bounds
Generalization bounds based on the norm of 𝑊 𝑗 have been investigated in previous studies. These bounds are described typically by the ( 𝑝 , 𝑞 ) matrix norm ‖ 𝑊 𝑗 ‖ 𝑝 , 𝑞 of 𝑊 𝑗 (see the upper part of Table 1). Thus, although these bounds do not explicitly depend on the width of the layers, they tend to be large as the width of the layers becomes large. For example, the ( 𝑝 , 𝑞 ) matrix norm of the 𝑑 by 𝑑 identity matrix is 𝑑 1 / 𝑝 . This situation is totally different from our case, where the factor related to the weight matrix is reduced to 1 if it is orthogonal. We can see our bound is described by the spectral property of the weight matrices and tight when the condition number of 𝑊 𝑗 is small. Bartlett et al. (2017); Wei & Ma (2020); Ju et al. (2022) showed bounds with reference matrices 𝐴 𝑗 and 𝐵 𝑗 . These bounds allow us to explain generalization property through the discrepancy between the weight matrices and fixed reference matrices. A main difference of these bounds from ours is that the existing bounds only focus on the discrepancy from fixed reference matrices whereas our bound is based on the discrepancy of the spectral property from a certain class of matrices, which is much broader than fixed matrices. Li et al. (2021) showed a bound described by | ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ − 1 | , the discrepancy between the product of largest singular values of 𝑊 1 , … , 𝑊 𝐿 and 1 . It is motivated by the covering number 𝑐 𝒳 of the input space 𝒳 and a diameter 𝛾 𝐱 of the covering set, which depends on the input samples 𝐱
( 𝑥 1 , … , 𝑥 𝑛 ) . This bound somewhat describes the spectral properties of 𝑊 𝑗 . However, they only focus on the largest singular values of 𝑊 𝑗 and do not take the other singular values into account. On the other hand, our bound is described by how much the whole singular values of 𝑊 𝑗 differ from its largest singular value.
Compression-based bounds
Arora et al. (2018) and Suzuki et al. (2020) focused on compression of the network and showed bounds that also get small as the rank of the weight matrices decreases, with the bias 𝑟 ^ induced by compression (see the middle part of Table 1). Arora et al. (2018) introduced layer cushion 𝜇 𝑗 and interlayer cushion 𝜇 𝑗 → for layer 𝑗 , which tends to be small if the rank of 𝑊 𝑗 is small. They also observed that noise is filtered by the weight matrices whose stable ranks are small. Suzuki et al. (2020) showed the dependency on the rank more directly. The bounds describe why networks with low-rank matrices generalize well. However, networks with high-rank matrices can also generalize well (Goldblum et al., 2020), and the bounds cannot describe this phenomenon. Arora et al. (2018, Figure 1) also empirically show that noise rapidly decreases on higher layers. Since the noise stability is related to the rank of the weight matrices, the result supports the tightness of their bound on higher layers. However, the result also implies that noise does not really decrease on lower layers, and we need an additional theory that describes what happens on lower layers.
Table 1:Comparison of our bound to existing bounds. Authors Rate Type Neyshabur et al. (2015) 2 𝐿 ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 2 , 2 𝑛 Norm-based Neyshabur et al. (2018) 𝐿 max 𝑗 𝑑 𝑗 ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 𝑛 ( ∑ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 2 , 2 2 ‖ 𝑊 𝑗 ‖ 2 ) 1 / 2
Golowich et al. (2018) ( ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 2 , 2 ) min { 1 𝑛 1 / 4 , 𝐿 𝑛 }
Bartlett et al. (2017) ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 𝑛 ( ∑ 𝑗
1 𝐿 ‖ 𝑊 𝑗 𝑇 − 𝐴 𝑗 𝑇 ‖ 2 , 1 2 / 3 ‖ 𝑊 𝑗 ‖ 2 / 3 ) 3 / 2
Wei & Ma (2020) ( ∑ 𝑗
1 𝐿 𝜅 𝑗 2 / 3 min { 𝐿 1 / 2 ∥ 𝑊 𝑗 − 𝐴 𝑗 ∥ 2 , 2 , ∥ 𝑊 𝑗 − 𝐵 𝑗 ∥ 1 , 1 } 2 / 3 ) 3 / 2 𝑛
Ju et al. (2022) ∑ 𝑗
1 𝐿 𝜃 𝑗 ‖ 𝑊 𝑗 − 𝐴 𝑗 ‖ 2 , 2 𝑛
Li et al. (2021) ‖ 𝐱 ‖ | ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ − 1 | + 𝛾 𝐱 + 𝑐 𝒳 𝑛
Arora et al. (2018) 𝑟 ^ + 𝐿 max 𝑖 ‖ 𝑓 ( 𝑥 𝑖 ) ‖ 𝑟 ^ 𝑛 ( ∑ 𝑗
1 𝐿 1 𝜇 𝑗 2 𝜇 𝑗 → 2 ) 1 / 2 Compression Suzuki et al. (2020) 𝑟 ^ 𝑛 + 𝐿 𝑛 ( ∑ 𝑗
1 𝐿 𝑟 ~ 𝑗 ( 𝑑 ~ 𝑗 − 1 + 𝑑 ~ 𝑗 ) ) 1 / 2
Ours ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 ∏ 𝑗
1 𝐿 𝐺 𝑗 𝐸 𝑗 ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 − 1 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 Operator-theoretic
𝜅 𝑗 and 𝜃 𝑗 are determined by the Jacobian and Hessian of the network 𝑓 with respect to the 𝑗 th layer and 𝑊 𝑗 , respectively. 𝑟 ~ 𝑗 and 𝑑 ~ 𝑗 are the rank and dimension of the 𝑗 th weight matrices for the compressed network.
3Preliminaries 3.1Notation
For a linear operator 𝑊 on a Hilbert space, its range and kernel are denoted by ℛ ( 𝑊 ) and ker ( 𝑊 ) , respectively. Its operator norm is denoted by ‖ 𝑊 ‖ . For a function 𝑝 ∈ 𝐿 ∞ ( ℝ 𝑑 ) , its 𝐿 ∞ -norm is denoted by ‖ 𝑝 ‖ ∞ . For a function ℎ on ℝ 𝑑 and a subset 𝒮 of ℝ 𝑑 , the restriction of ℎ on 𝒮 is denoted by ℎ | 𝑆 . For 𝑓 ∈ 𝐿 2 ( ℝ 𝑑 ) , we denote the Fourier transform of 𝑓 as 𝑓 ^ ( 𝜔 )
∫ ℝ 𝑑 𝑓 ( 𝑥 ) e − i 𝑥 ⋅ 𝜔 d 𝑥 . We denote the adjoint of a matrix 𝑊 by 𝑊 * .
3.2Koopman operator
Let ℱ 1 and ℱ 2 be function spaces on ℝ 𝑑 1 and ℝ 𝑑 2 . For a function 𝑓 : ℝ 𝑑 1 → ℝ 𝑑 2 , we define the Koopman operator from ℱ 2 to ℱ 1 by the composition as follows. Let 𝒟 𝑓
{ 𝑔 ∈ ℱ 2 ∣ 𝑔 ∘ 𝑓 ∈ ℱ 1 } . The Koopman operator 𝐾 𝑓 from ℱ 2 to ℱ 1 with respect to 𝑓 is defined as 𝐾 𝑓 𝑔
𝑔 ∘ 𝑓 for 𝑔 ∈ 𝒟 𝑓 .
3.3Reproducing kernel Hilbert space (RKHS)
As function spaces, we consider RKHSs. Let 𝑝 be a non-negative function such that 𝑝 ∈ 𝐿 1 ( ℝ 𝑑 ) . Let 𝐻 𝑝 ( ℝ 𝑑 )
{ 𝑓 ∈ 𝐿 2 ( ℝ 𝑑 ) ∣ 𝑓 ^ / 𝑝 ∈ 𝐿 2 ( ℝ 𝑑 ) } be the Hilbert space equipped with the inner product ⟨ 𝑓 , 𝑔 ⟩ 𝐻 𝑝 ( ℝ 𝑑 )
∫ ℝ 𝑑 𝑓 ^ ( 𝜔 ) 𝑔 ^ ( 𝜔 ) / 𝑝 ( 𝜔 ) d 𝜔 . We can see that 𝑘 𝑝 ( 𝑥 , 𝑦 ) := ∫ ℝ 𝑑 e i ( 𝑥 − 𝑦 ) ⋅ 𝜔 𝑝 ( 𝜔 ) d 𝜔 is the reproducing kernel of 𝐻 𝑝 ( ℝ 𝑑 ) , i.e., ⟨ 𝑘 𝑝 ( ⋅ , 𝑥 ) , 𝑓 ⟩ 𝐻 𝑝 ( ℝ 𝑑 )
𝑓 ( 𝑥 ) for 𝑓 ∈ 𝐻 𝑝 ( ℝ 𝑑 ) . Note that since 𝐻 𝑝 ( ℝ 𝑑 ) ⊆ 𝐿 2 ( ℝ 𝑑 ) , the value of functions in 𝐻 𝑝 ( ℝ 𝑑 ) vanishes at infinity.
One advantage of focusing on RKHSs is that they have well-defined evaluation operators induced by the reproducing property. This property makes deriving an upper bound of the Rademacher complexity easier (see, for example, Mohri et al. (2018, Theorem 6.12)).
Example 1
Set 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝑠 > 𝑑 / 2 . Then, 𝐻 𝑝 ( ℝ 𝑑 ) is the Sobolev space 𝑊 𝑠 , 2 ( ℝ 𝑑 ) of order 𝑠 . Here, ‖ 𝜔 ‖ is the Euclidean norm of 𝜔 ∈ ℝ 𝑑 . Note that if 𝑠 ∈ ℕ , then ‖ 𝑓 ‖ 𝐻 𝑝 ( ℝ 𝑑 ) 2
∑ | 𝛼 | ≤ 𝑠 𝑐 𝛼 , 𝑠 , 𝑑 ‖ ∂ 𝛼 𝑓 ‖ 𝐿 2 ( ℝ 𝑑 ) 2 , where 𝑐 𝛼 , 𝑠 , 𝑑
( 2 𝜋 ) 𝑑 𝑠 ! / 𝛼 ! / ( 𝑠 − | 𝛼 | ) ! and 𝛼 is a multi index. See Appendix K for more details.
3.4Problem setting
In this paper, we consider an 𝐿 -layer deep neural network. Let 𝑑 0 be the dimension of the input space. For 𝑗
1 , … , 𝐿 , we set the width as 𝑑 𝑗 , let 𝑊 𝑗 : ℝ 𝑑 𝑗 − 1 → ℝ 𝑑 𝑗 be a linear operator, let 𝑏 𝑗 : ℝ 𝑑 𝑗 → ℝ 𝑑 𝑗 be a shift operator defined as 𝑥 ↦ 𝑥 + 𝑎 𝑗 with a bias 𝑎 𝑗 ∈ ℝ 𝑑 𝑗 , and let 𝜎 𝑗 : ℝ 𝑑 𝑗 → ℝ 𝑑 𝑗 be a nonlinear activation function. In addition, let 𝑔 : ℝ 𝑑 𝐿 → ℂ be a nonlinear transformation in the final layer. We consider a network 𝑓 defined as
𝑓
𝑔 ∘ 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 .
(2)
Typically, we regard that 𝑊 1 is composed by 𝑏 1 , 𝜎 1 to construct the first layer. We sequentially construct the second and third layers, and so on. Then we get the whole network 𝑓 . On the other hand, from operator-theoretic perspective, we analyze 𝑓 in the opposite direction. For 𝑗
0 , … , 𝐿 , let 𝑝 𝑗 be a density function of a finite positive measure on ℝ 𝑑 𝑗 . The network 𝑓 is described by the Koopman operators 𝐾 𝑊 𝑗 : 𝐻 𝑝 𝑗 ( ℝ 𝑑 𝑗 ) → 𝐻 𝑝 𝑗 − 1 ( ℝ 𝑑 𝑗 − 1 ) , 𝐾 𝑏 𝑗 , 𝐾 𝜎 𝑗 : 𝐻 𝑝 𝑗 ( ℝ 𝑑 𝑗 ) → 𝐻 𝑝 𝑗 ( ℝ 𝑑 𝑗 ) as
𝑓
𝐾 𝑊 1 𝐾 𝑏 1 𝐾 𝜎 1 ⋯ 𝐾 𝑊 𝐿 − 1 𝐾 𝑏 𝐿 − 1 𝐾 𝜎 𝐿 − 1 𝐾 𝑊 𝐿 𝐾 𝑏 𝐿 𝑔 .
The final nonlinear transformation 𝑔 is first provided. Then, 𝑔 is composed by 𝑏 𝐿 and 𝑊 𝐿 , i.e., the corresponding Koopman operator acts on 𝑔 . We sequentially apply the Koopman operators corresponding to the ( 𝐿 − 1 ) th layer, ( 𝐿 − 2 ) th layer, and so on. Finally, we get the whole network 𝑓 . By representing the network using the product of Koopman operators, we can bound the Rademacher complexity with the product of the norms of the Koopman operators.
To simplify the notation, we denote 𝐻 𝑝 𝑗 ( ℝ 𝑑 𝑗 ) by 𝐻 𝑗 . We impose the following assumptions.
Assumption 1
The function 𝑔 is contained in 𝐻 𝐿 , and 𝐾 𝜎 𝑗 is bounded for 𝑗
1 , … , 𝐿 .
Assumption 2
There exists 𝐵
0 such that for any 𝑥 ∈ ℝ 𝑑 , | 𝑘 𝑝 0 ( 𝑥 , 𝑥 ) | ≤ 𝐵 2 .
We denote by 𝐹 the set of all functions in the form of (2) with Assumption 1. As a typical example, if we set 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 and 𝑔 ( 𝑥 )
e − ‖ 𝑥 ‖ 2 , then Assumption 1 holds if 𝐾 𝜎 𝑗 is bounded, and 𝑘 𝑝 0 satisfies Assumption 2.
Remark 1
Let 𝑔 be a smooth function which does not decay at infinity, (e.g., sigmoid). Although 𝐻 𝑝 ( ℝ 𝑑 ) does not contain 𝑔 , we can construct a function 𝑔 ~ ∈ 𝐻 𝑝 ( ℝ 𝑑 ) such that 𝑔 ~ ( 𝑥 )
𝑔 ( 𝑥 ) for 𝑥 in a sufficiently large compact region and replace 𝑔 by 𝑔 ~ in practical cases. See Remark 14 for details.
For the activation function, we have the following proposition (c.f. Sawano (2018, Theorem 4.46)).
Proposition 2
Let 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝜔 ∈ ℝ 𝑑
𝑠 ∈ ℕ , and 𝑠
𝑑 / 2 . If the activation function 𝜎 has the following properties, then 𝐾 𝜎 : 𝐻 𝑝 ( ℝ 𝑑 ) → 𝐻 𝑝 ( ℝ 𝑑 ) is bounded.
•
𝜎 is 𝑠 -times differentiable and its derivative ∂ 𝛼 𝜎 is bounded for any multi-index 𝛼 ∈ { ( 𝛼 1 , … , 𝛼 𝑑 ) ∣ ∑ 𝑗
1 𝑑 𝛼 𝑗 ≤ 𝑠 } .
•
𝜎 is bi-Lipschitz, i.e., 𝜎 is bijective and both 𝜎 and 𝜎 − 1 are Lipschitz continuous.
Example 2
We can choose 𝜎 as a smooth version of Leaky ReLU (Biswas et al., 2022). We will see how we can deal with other activation functions, such as the sigmoid and the hyperbolic tangent in Remark 17.
Remark 3
We have ‖ 𝐾 𝜎 ‖ ≤ ‖ det ( 𝐽 𝜎 − 1 ) ‖ ∞ max { 1 , ‖ ∂ 1 𝜎 ‖ ∞ , … , ‖ ∂ 𝑑 𝜎 ‖ ∞ } if 𝑠
1 and 𝜎 is elementwise, where 𝐽 𝜎 − 1 is the Jacobian of 𝜎 − 1 . As we will discuss in Appendix B, deriving a tight bound for a larger 𝑠 is challenging and future work.
4Koopman-based bound of Rademacher complexity
We derive an upper bound of the Rademacher complexity. We first focus on the case where the weight matrices are invertible or injective. Then, we generalize the results to the non-injective case.
Let Ω be a probability space equipped with a probability measure 𝑃 . We denote the integral ∫ Ω 𝑠 ( 𝜔 ) d 𝑃 ( 𝜔 ) of a measurable function 𝑠 on Ω by E [ 𝑠 ] . Let 𝑠 1 , … , 𝑠 𝑛 be i.i.d. Rademacher variables. For a function class 𝐺 and 𝑥 1 , … , 𝑥 𝑛 ∈ ℝ 𝑑 , we denote the empirical Rademacher complexity by 𝑅 ^ 𝑛 ( 𝐱 , 𝐺 ) , where 𝐱
( 𝑥 1 , … , 𝑥 𝑛 ) . We will provide an upper bound of 𝑅 ^ 𝑛 ( 𝐱 , 𝐺 ) using Koopman operators in the following subsections.
4.1Bound for invertible weight matrices ( 𝑑 𝑗
𝑑 )
In this subsection, we focus on the case 𝑑 𝑗
𝑑 ( 𝑗
0 , … , 𝐿 ) for some 𝑑 ∈ ℕ and 𝑊 𝑗 is invertible for 𝑗
1 , … , 𝐿 . This is the most fundamental case. For 𝐶 , 𝐷 > 0 , set a class of weight matrices 𝒲 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 × 𝑑 ∣ ‖ 𝑊 ‖ ≤ 𝐶 , | det ( 𝑊 ) | ≥ 𝐷 } and a function class 𝐹 inv ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 ∣ 𝑊 𝑗 ∈ 𝒲 ( 𝐶 , 𝐷 ) } . We have the following theorem for a bound of Rademacher complexity.
Theorem 4 (First Main Theorem)
The Rademacher complexity 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inv ( 𝐶 , 𝐷 ) ) is bounded as
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inv ( 𝐶 , 𝐷 ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ∞ 1 / 2 | det ( 𝑊 𝑗 ) | 1 / 2 ) ( ∏ 𝑗
1 𝐿 − 1 ‖ 𝐾 𝜎 𝑗 ‖ ) .
(3)
By representing the network using the product of Koopman operators, we can get the whole bound by bounding the norm of each Koopman operator. A main difference of our bound from existing bounds, such as the ones by Neyshabur et al. (2015); Golowich et al. (2018) is that our bound has the determinant factors in the denominator in Eq. (3). They come from a change of variable when we bound the norm of the Koopman operators, described by the following inequality:
‖ 𝐾 𝑊 𝑗 ‖ ≤ ( ‖ 𝑝 𝑗 / 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ∞ / | det ( 𝑊 𝑗 ) | ) 1 / 2 , ‖ 𝐾 𝑏 𝑗 ‖
1
for 𝑗
1 , … , 𝐿 . Since the network 𝑓 in Eq. (2) satisfies 𝑓 ∈ 𝐻 0 , using the reproducing property of 𝐻 0 , we derive an upper bound of 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 ( 𝐶 , 𝐷 ) ) in a similar manner to that for kernel methods (see, Mohri et al. (2018, Theorem 6.12)).
Regarding the factor ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ∞ in Eq. (3), we obtain the following lemma and proposition, which show it is bounded by ‖ 𝑊 𝑗 ‖ and induces the factor ∥ 𝑊 𝑗 ∥ 𝑠 𝑗 − 1 / det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 in Eq. (1).
Lemma 5
Let 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝑠 > 𝑑 / 2 and 𝑝 𝑗
𝑝 for 𝑗
0 , … , 𝐿 . Then, we have ‖ 𝑝 / ( 𝑝 ∘ 𝑊 𝑗 * ) ‖ ∞ ≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 } .
As a result, the following proposition is obtained by applying Lemma 5 to Theorem 4.
Proposition 6
Let 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝑠 > 𝑑 / 2 and 𝑝 𝑗
𝑝 for 𝑗
0 , … , 𝐿 . We have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inv ( 𝐶 , 𝐷 ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 ( max { 1 , 𝐶 𝑠 } 𝐷 ) 𝐿 ( ∏ 𝑗
1 𝐿 − 1 ‖ 𝐾 𝜎 𝑗 ‖ ) .
Comparison to existing bounds
We investigate the bound (1) in terms of the singular values of 𝑊 𝑗 . Let 𝜂 1 , 𝑗 ≥ … ≥ 𝜂 𝑑 , 𝑗 be the singular values of 𝑊 𝑗 and let 𝛼
𝑠 − 𝑑 / 2 . Then, the term depending on the weight matrices in bound (1) is described as 𝜂 1 , 𝑗 𝛼 ∏ 𝑖
1 𝑑 𝑟 𝑖 , 𝑗 1 / 2 , where 𝑟 𝑖 , 𝑗
𝜂 1 , 𝑗 / 𝜂 𝑖 , 𝑗 . On the other hand, the existing bound by Golowich et al. (2018) is described as 𝜂 1 , 𝑗 ( ∑ 𝑖
1 𝑑 𝑟 𝑖 , 𝑗 − 2 ) 1 / 2 . Since they are just upper bounds, our bound does not contradict the existing bound. Our bound is tight when the condition number 𝑟 𝑑 , 𝑗 of 𝑊 𝑗 is small. The existing bound is tight when 𝑟 𝑑 , 𝑗 is large. Note that the factors in our bound (1) are bounded below by 1 .
Remark 7
If a weight matrix 𝑊 𝑗 is orthogonal, then the factor ∥ 𝑊 𝑗 ∥ 𝑠 𝑗 − 1 / det ( 𝑊 𝑗 ) 1 / 2 in the bound (1) is reduced to 1 . On the other hand, existing norm-based bounds are described by the ( 𝑝 . 𝑞 ) matrix norm. The ( 𝑝 , 𝑞 ) matrix norm of the 𝑑 by 𝑑 identity matrix is 𝑑 1 / 𝑝 , which is larger than 1 . Indeed, limiting the weight matrices to orthogonal matrices has been proposed (Maduranga et al., 2019; Wang et al., 2020; Li et al., 2021). The tightness of our bound in the case of orthogonal matrices implies the advantage of these methods from the perspective of generalization.
Remark 8
There is a tradeoff between 𝑠 and 𝐵 . We focus on the case where 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 . According to Lemma 5, the factor ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ∞ becomes small as 𝑠 approaches to 𝑑 / 2 . However, the factor 𝐵 goes to infinity as 𝑠 goes to 𝑑 / 2 . Indeed, we have 𝑘 𝑝 ( 𝑥 , 𝑥 )
∫ ℝ 𝑑 𝑝 ( 𝜔 ) d 𝜔
𝐶 ∫ 0 ∞ 𝑟 𝑑 − 1 ( 1 + 𝑟 2 ) 𝑠 d 𝑟 ≥ 𝐶 ∫ 1 ∞ ( 𝑟 − 1 ) 𝑑 / 2 − 1 2 𝑟 𝑠 𝑑 𝑟 > 𝐶 4 ∫ 2 ∞ 𝑟 𝑑 / 2 − 1 − 𝑠 𝑑 𝑟
𝐶 4 2 𝑑 / 2 − 𝑠 𝑠 − 𝑑 / 2 for some constant 𝐶 > 0 . This behavior corresponds to the fact that if 𝑠
𝑑 / 2 , the Sobolev space is not an RKHS, and the evaluation operator becomes unbounded.
4.2Bound for injective weight matrices ( 𝑑 𝑗 ≥ 𝑑 𝑗 − 1 )
We generalize Theorem 4 to that for injective weight matrices. If 𝑑 𝑗 > 𝑑 𝑗 − 1 , then 𝑊 𝑗 is not surjective. However, it can be injective, and we first focus on the injective case. Let 𝒲 𝑗 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 𝑗 − 1 × 𝑑 𝑗 ∣ 𝑑 𝑗 ≥ 𝑑 𝑗 − 1 , ‖ 𝑊 ‖ ≤ 𝐶 , det ( 𝑊 * 𝑊 ) ≥ 𝐷 } and 𝐹 inj ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 ∣ 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) } . Let 𝑓 𝑗
𝑔 ∘ 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 𝑗 ∘ 𝑏 𝑗 and 𝐺 𝑗
‖ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ∥ 𝐻 𝑝 𝑗 − 1 ( ℛ ( 𝑊 𝑗 ) ) / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 . We have the following theorem for injective weight matrices.
Theorem 9 (Second Main Theorem)
The Rademacher complexity 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inj ( 𝐶 , 𝐷 ) ) is bounded as
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inj ( 𝐶 , 𝐷 ) ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 − 1 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) ( ∏ 𝑗
1 𝐿 − 1 ∥ 𝐾 𝜎 𝑗 ∥ ) .
(4)
Since 𝑊 𝑗 is not always square, we do not have det ( 𝑊 𝑗 ) in Theorem 4. However, we can replace det ( 𝑊 𝑗 ) with det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 2 . As Lemma A, we have the following lemma about the factor ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ℛ ( 𝑊 𝑗 ) , ∞ in Eq. (4).
Lemma 10
Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 and for 𝑗
0 , … , 𝐿 . Then, we have
‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞ ≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 𝑗 − 1 } .
Applying Lemma 10 to Theorem 9, we finally obtain the bound (1). Regarding the factor 𝐺 𝑗 , the following lemma shows that 𝐺 𝑗 is determined by the isotropy of 𝑓 𝑗 .
Lemma 11
Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 and for 𝑗
0 , … , 𝐿 . Let 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 and 𝐻 ~ 𝑗 be the Sobolev space on ℛ ( 𝑊 𝑗 ) ⟂ with 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 − 𝑠 𝑗 − 1 . Then, 𝐺 𝑗 ≤ 𝐺 𝑗 , 0 ‖ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ⋅ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ⟂ ∥ 𝐻 𝑗 / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 , where 𝐺 𝑗 , 0
‖ 𝑓 𝑗 ‖ 𝐻 ~ 𝑗 − 1 .
Remark 12
The factor 𝐺 𝑗 is expected to be small. Indeed, if 𝑓 𝑗 ( 𝑥 )
e − 𝑐 ‖ 𝑥 ‖ 2 (It is true if 𝑔 is Gaussian and 𝑗
𝐿 .), 𝐺 𝑗 gets small as 𝑠 𝑗 and 𝑑 𝑗 gets large. Moreover, 𝐺 𝑗 can alleviate the dependency of ‖ 𝐾 𝜎 𝑗 ‖ on 𝑑 𝑗 and 𝑠 𝑗 . See Appendix C for more details.
4.3Bounds for non-injective weight matrices
The bounds in Theorems 4 and 9 are only valid for injective weight matrices, and the bound goes to infinity if they become singular. This is because if 𝑊 𝑗 : ℝ 𝑑 𝑗 − 1 → ℝ 𝑑 𝑗 has singularity, ℎ ∘ 𝑊 for ℎ ∈ 𝐻 𝑗 becomes constant along the direction of ker ( 𝑊 𝑗 ) . As a result, ℎ ∘ 𝑊 𝑗 is not contained in 𝐻 𝑗 − 1 and 𝐾 𝑊 𝑗 becomes unbounded. To deal with this situation, we propose two approaches that are valid for non-injective weight matrices, graph-based and weighted Koopman-based ones.
4.3.1Graph-based bound
The first approach to deal with this situation is constructing an injective operator related to the graph of 𝑊 𝑗 . For 𝑗
1 , … , 𝐿 , let 𝑟 𝑗
dim ( ker ( 𝑊 𝑗 ) ) , 𝛿 𝑗
∑ 𝑘
0 𝑗 𝑟 𝑘 , and 𝑊 ~ 𝑗 be defined as 𝑊 ~ 𝑗 ( 𝑥 1 , 𝑥 2 )
( 𝑊 𝑗 𝑥 1 , 𝑃 𝑗 𝑥 1 , 𝑥 2 ) for 𝑥 1 ∈ ℝ 𝑑 𝑗 − 1 and 𝑥 2 ∈ ℝ 𝛿 𝑗 − 2 for 𝑗 ≥ 2 and 𝑊 ~ 𝑗 𝑥
( 𝑊 𝑗 𝑥 , 𝑃 𝑗 𝑥 ) for 𝑗
1 . Here, 𝑃 𝑗 is the projection onto ker ( 𝑊 𝑗 ) . Then, 𝑊 ~ 𝑗 is injective (See Appendix I). Let 𝜎 ~ 𝑗 : ℝ 𝛿 𝑗 → ℝ 𝛿 𝑗 and 𝑏 ~ 𝑗 : ℝ 𝛿 𝑗 → ℝ 𝛿 𝑗 be defined respectively as 𝜎 ~ 𝑗 ( 𝑥 1 , 𝑥 2 )
( 𝜎 𝑗 ( 𝑥 1 ) , 𝑥 2 ) and 𝑏 ~ 𝑗 ( 𝑥 1 , 𝑥 2 )
( 𝑏 𝑗 ( 𝑥 1 ) , 𝑥 2 ) for 𝑥 1 ∈ ℝ 𝑑 𝑗 and 𝑥 2 ∈ ℝ 𝛿 𝑗 − 1 . In addition, let 𝑔 ~ ∈ 𝐻 𝑝 𝐿 ( ℝ 𝛿 𝐿 ) be defined as 𝑔 ~ ( 𝑥 1 , 𝑥 2 )
𝑔 ( 𝑥 1 ) 𝜓 ( 𝑥 2 ) for 𝑥 1 ∈ ℝ 𝑑 𝐿 and 𝑥 2 ∈ ℝ 𝛿 𝐿 − 1 , where 𝜓 is a rapidly decaying function on ℝ 𝛿 𝐿 − 1 so that 𝑔 ~ ∈ 𝐻 𝑝 𝐿 ( ℝ 𝛿 𝐿 − 1 + 𝑑 𝐿 ) . Consider the following network:
𝑓 ~ ( 𝑥 )
𝑔 ~ ∘ 𝑏 ~ 𝐿 ∘ 𝑊 ~ 𝐿 ∘ ⋯ ∘ 𝜎 ~ 1 ∘ 𝑏 ~ 1 ∘ 𝑊 ~ 1 ( 𝑥 )
𝜓 ( 𝑃 𝐿 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) , … , 𝑃 2 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) , 𝑃 1 𝑥 )
⋅ 𝑔 ( 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) ) .
(5)
Since det ( 𝑊 ~ 𝑗 * 𝑊 ~ 𝑗 )
det ( 𝑊 𝑗 * 𝑊 𝑗 + 𝐼 ) and ‖ 𝑊 ~ 𝑗 ‖
‖ 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ‖ , we set 𝒲 ~ 𝑗 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 𝑗 − 1 × 𝑑 𝑗 ∣ ‖ 𝐼 + 𝑊 * 𝑊 ‖ ≤ 𝐶 , det ( 𝑊 * 𝑊 + 𝐼 ) ≥ 𝐷 } and 𝐹 ~ ( 𝐶 , 𝐷 )
{ 𝑓 ~ ∣ 𝑓 ∈ 𝐹 , 𝑊 𝑗 ∈ 𝒲 ~ 𝑗 ( 𝐶 , 𝐷 ) } . Moreover, put 𝐻 ~ 𝐿
𝐻 𝑝 𝐿 ( ℝ 𝛿 𝐿 ) . By Theorem 9, we obtain the following bound, where the determinant factor does not go to infinity by virtue of the identity 𝐼 appearing in 𝑊 ~ 𝑗 .
Proposition 13
The Rademacher complexity 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 ~ ( 𝐶 , 𝐷 ) ) is bounded as
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 ~ ( 𝐶 , 𝐷 ) ) ) ≤ 𝐵 ‖ 𝑔 ~ ‖ 𝐻 ~ 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 ~ 𝑗 * ) ‖ ℛ ( 𝑊 ~ 𝑗 ) , ∞ 1 / 2 𝐺 ~ 𝑗 det ( 𝑊 𝑗 * 𝑊 𝑗 + 𝐼 ) 1 / 4 ) ( ∏ 𝑗
1 𝐿 − 1 ∥ 𝐾 𝜎 𝑗 ∥ ) .
Remark 14
The difference between the networks (5) and (2) is the factor 𝜓 . If we set 𝜓 as 𝜓 ( 𝑥 )
1 for 𝑥 in a sufficiently large region Ω , then 𝑓 ~ ( 𝑥 )
𝑓 ( 𝑥 ) for 𝑥 ∈ Ω . We can set 𝜓 , for example, as a smooth bump function (Tu, 2011, Chapter 13). See Appendix D for the definition. If the data is in a compact region, then the output of each layer is also in a compact region since ‖ 𝑊 𝑗 ‖ ≤ ‖ 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ‖ ≤ 𝐶 . Thus, it is natural to assume that 𝑓 can be replaced by 𝑓 ~ in practical cases.
Remark 15
The factor ‖ 𝑔 ~ ‖ 𝐻 ~ 𝐿 grows as Ω becomes large. This is because ‖ 𝜓 ‖ 𝐿 2 ( ℝ 𝛿 𝐿 − 1 ) 2 becomes large as the volume of Ω gets large. This fact does not contradict the fact that the bound in Theorem 4 goes to infinity if 𝑊 𝑗 is singular. More details are explained in Appendix E.
4.3.2Weighted Koopman-based bound
Instead of constructing the injective operators in Subsection 4.3.1, we can also use weighted Koopman operators. For 𝜓 𝑗 : ℝ 𝑑 𝑗 → ℂ , define 𝐾 ~ 𝑊 𝑗 ℎ
𝜓 𝑗 ⋅ ℎ ∘ 𝑊 𝑗 , which is called the weighted Koopman operator. We construct the same network as 𝑓 ~ in Eq. (5) using 𝐾 ~ 𝑊 𝑗 .
𝑓 ~ ( 𝑥 )
𝐾 ~ 𝑊 1 𝐾 𝑏 1 𝐾 𝜎 1 ⋯ 𝐾 ~ 𝑊 𝐿 𝐾 𝑏 𝐿 𝑔 ( 𝑥 )
𝜓
1
(
𝑥
)
𝜓
2
(
𝜎
1
∘
𝑏
1
∘
𝑊
1
(
𝑥
)
)
⋯
𝜓
𝐿
(
𝜎
𝐿
−
1
∘
𝑏
𝐿
−
1
∘
𝑊
𝐿
−
1
∘
⋯
∘
𝜎
1
∘
𝑏
1
∘
𝑊
1
(
𝑥
)
)
⋅
𝑔
(
𝑏
𝐿
∘
𝑊
𝐿
∘
⋯
∘
𝜎
1
∘
𝑏
1
∘
𝑊
1
(
𝑥
)
)
𝜓 ( 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) , … , 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) , 𝑥 )
⋅ 𝑔 ( 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1 ( 𝑥 ) ) ,
where 𝜓 ( 𝑥 1 , … , 𝑥 𝐿 )
𝜓 1 ( 𝑥 1 ) ⋯ 𝜓 𝐿 ( 𝑥 𝐿 ) for 𝑥 𝑗 ∈ ℝ 𝑑 𝑗 − 1 . Let 𝑊 r , 𝑗
𝑊 𝑗 | ker ( 𝑊 𝑗 ) ⟂ be the restricted operator of 𝑊 𝑗 to ker ( 𝑊 𝑗 ) ⟂ . Set 𝒲 r , 𝑗 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 𝑗 − 1 × 𝑑 𝑗 ∣ ‖ 𝑊 ‖ ≤ 𝐶 , | det ( 𝑊 r ) | ≥ 𝐷 } and 𝐹 r ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 ∣ 𝑊 𝑗 ∈ 𝒲 r , 𝑗 ( 𝐶 , 𝐷 ) } . By letting 𝜓 𝑗 decay in the direction of ker ( 𝑊 𝑗 ) , e.g., the smooth bump function, 𝐾 ~ 𝑊 𝑗 ℎ for ℎ ∈ 𝐻 𝑗 decays in all the directions. We only need to care about 𝑊 r , 𝑗 , not the whole of 𝑊 𝑗 . Note that 𝜓 𝑗 has the same role as 𝜓 in Eq. (5). If the data is in a compact region, we replace 𝑓 by 𝑓 ~ , which coincides with 𝑓 on the compact region. By virtue of the decay property of 𝜓 𝑗 , we can bound 𝐾 ~ 𝑊 𝑗 even if 𝑊 𝑗 is singular.
Proposition 16
Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 with 𝑠 𝑗 ∈ ℕ , 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 , and 𝑠 𝑗 > 𝑑 𝑗 / 2 . Let 𝐻 ~ 𝑗 − 1
𝐻 𝑝 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) and 𝜓 𝑗 be a function satisfying 𝜓 𝑗 ( 𝑥 )
𝜓 𝑗 , 1 ( 𝑥 1 ) for some 𝜓 𝑗 , 1 ∈ 𝐻 ~ 𝑗 − 1 , where 𝑥
𝑥 1 + 𝑥 2 for 𝑥 1 ∈ ker ( 𝑊 𝑗 ) and 𝑥 2 ∈ ker ( 𝑊 𝑗 ) ⟂ . Moreover, let 𝐺 𝑗
‖ 𝑓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ⟂ ) / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 . Then, we have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 r ( 𝐶 , 𝐷 ) ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 r , 𝑗 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝜓 𝑗 , 1 ‖ 𝐻 ~ 𝑗 − 1 𝐺 𝑗 max { 1 , ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 − 1 } | det ( 𝑊 r , 𝑗 ) | 1 / 2 ) ( ∏ 𝑗
1 𝐿 − 1 ∥ 𝐾 𝜎 𝑗 ∥ ) .
Remark 17
Although we focus on the singularity of 𝐾 𝑊 𝑗 and use the weighted Koopman operator with respect to 𝑊 𝑗 , we can deal with the singularity of 𝐾 𝜎 𝑗 in the same manner as 𝐾 𝑊 𝑗 , i.e., by constructing the weighted Koopman operator 𝐾 ~ 𝜎 𝑗 with respect to 𝜎 𝑗 . For example, the sigmoid and hyperbolic tangent do not satisfy the assumption for 𝜎 𝑗 stated in Proposition 2. This is because the Jacobian of 𝜎 − 1 is not bounded. However, 𝐾 ~ 𝜎 𝑗 is bounded by virtue of 𝜓 𝑗 .
Remark 18
The norm of 𝜓 𝑗 can be canceled by the factor 𝐺 𝑗 . See Appendix F for more details.
4.4Combining the Koopman-based bound with other bounds
In this subsection, we show that our Koopman-based bound is flexible enough to be combined with another bound. As stated in Subsection 4.1, the case where our bound is tight differs from the case where existing bounds such as the one by Golowich et al. (2018) are tight. We can combine these bounds to obtain a tighter bound. For 1 ≤ 𝑙 ≤ 𝐿 , let 𝐹 1 : 𝑙 be the set of all functions in the form
𝜎 𝑙 ∘ 𝑏 𝑙 ∘ 𝑊 𝑙 ∘ 𝜎 𝑙 − 1 ∘ 𝑏 𝑙 − 1 ∘ 𝑊 𝑙 − 1 ∘ ⋯ ∘ 𝜎 1 ∘ 𝑏 1 ∘ 𝑊 1
(6)
with Assumption 1, and let 𝐹 1 : 𝑙 , inj ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 1 : 𝑙 ∣ 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) } . For 𝑙 ≤ 𝐿 − 1 , consider the set of all functions in 𝐻 𝑙 which have the form
𝑔 ∘ 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 𝑙 + 1 ∘ 𝑏 𝑙 + 1 ∘ 𝑊 𝑙 + 1
and consider any nonempty subset 𝐹 𝑙 + 1 : 𝐿 of it. For 𝑙
𝐿 , we set 𝐹 𝐿 + 1 : 𝐿
{ 𝑔 } . Let 𝐹 𝑙 , comb ( 𝐶 , 𝐷 )
{ 𝑓 1 ∘ 𝑓 2 ∣ 𝑓 1 ∈ 𝐹 𝑙 + 1 , 𝐿 , 𝑓 2 ∈ 𝐹 1 : 𝑙 , inj ( 𝐶 , 𝐷 ) } . The following proposition shows the connection between the Rademacher complexity of 𝐹 𝑙 , comb ( 𝐶 , 𝐷 ) and that of 𝐹 𝑙 + 1 : 𝐿 .
Proposition 19
Let 𝐱 ~
( 𝑥 ~ 1 , … , 𝑥 ~ 𝑛 ) ∈ ( ℝ 𝑑 𝑙 ) 𝑛 . Let 𝑣 𝑛 ( 𝜔 )
∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝜔 ) 𝑘 𝑝 0 ( ⋅ , 𝑥 𝑖 ) , 𝑣 ~ 𝑛 ( 𝜔 )
∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝜔 ) 𝑘 𝑝 𝑙 ( ⋅ , 𝑥 ~ 𝑖 ) , 𝒲
{ ( 𝑊 1 , … , 𝑊 𝑙 ) ∣ 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) } , and 𝛾 𝑛
‖ 𝑣 𝑛 ‖ 𝐻 0 / ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 . Then,
𝑅
^
𝑛
(
𝐱
,
𝐹
𝑙
,
comb
(
𝐶
,
𝐷
)
)
≤
sup
(
𝑊
1
,
…
,
𝑊
𝑙
)
∈
𝒲
∏
𝑗
1 𝑙 ‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4
⋅ ( 𝑅 ^ 𝑛 ( 𝐱 ~ , 𝐹 𝑙 + 1 : 𝐿 ) + 𝐵 𝑛 inf ℎ 1 ∈ 𝐹 𝑙 + 1 : 𝐿 E 1 2 [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ℎ 1 − ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 2 ] ) .
The complexity of the whole network is decomposed into the Koopman-based bound for the first 𝑙 layers and the complexity of the remaining 𝐿 − 𝑙 layers, together with a term describing the approximation power of the function class corresponding to 𝐿 − 𝑙 layers. Note that Proposition 19 generalizes Theorem 9 up to the multiplication of a constant. Indeed, if 𝑙
𝐿 , we have 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 𝑙 + 1 , 𝐿 )
0 . In addition, we have inf ℎ 1 ∈ 𝐹 𝑙 + 1 : 𝐿 E 1 2 [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ℎ 1 − ‖ ℎ 2 ‖ 𝛾 𝑛 𝑣 ~ 𝑛 / ‖ 𝑣 ~ 𝑛 ‖ ‖ 2 ]
E 1 2 [ ‖ 𝑔 − ‖ 𝑔 ‖ 𝛾 𝑛 𝑣 ~ 𝑛 / ‖ 𝑣 ~ 𝑛 ‖ ‖ 2 ] ≤ ‖ 𝑔 ‖ E 1 2 [ ( 1 + 𝛾 𝑛 ) 2 ] .
Remark 20
We can also combine our Koopman-based approach with the existing “peeling” approach, e.g., by Neyshabur et al. (2015); Golowich et al. (2018) (see Appendix G). Then, for 1 ≤ 𝑙 ≤ 𝐿 , we obtain a bound such as
𝑂 ( ( ∏ 𝑗
𝑙 + 1 𝐿 ‖ 𝑊 𝑗 ‖ 2 , 2 ) ( ∏ 𝑗
1 𝑙 ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) ) .
Typically, in many networks, the width grows sequentially near the input layer, i.e., 𝑑 𝑗 − 1 ≤ 𝑑 𝑗 for small 𝑗 and decays near the output layer, i.e., 𝑑 𝑗 − 1 ≥ 𝑑 𝑗 for large 𝑗 . Therefore, this type of combination is suitable for many practical cases for deriving a tighter bound than existing bounds.
Proposition 19 and Remark 20 theoretically implies that our Koopman-based bound is suitable for lower layers. Practically, we can interpret that signals are transformed on the lower layers so that its essential information is extracted on the higher layers. This interpretation also supports the result in Figure 1 by Arora et al. (2018). Noise is removed (i.e., signals are extracted) by the higher layers, but it is not completely removed by lower layers. We will investigate the behavior of each layer numerically in Section 5 and Appendix J.2.1.
5Numerical results Validity of the bound
To investigate our bound numerically, we consider a regression problem on ℝ 3 , where the target function 𝑡 is 𝑡 ( 𝑥 )
e − ‖ 2 𝑥 − 1 ‖ 2 . We constructed a simple network 𝑓 ( 𝑥 )
𝑔 ( 𝑊 2 𝜎 ( 𝑊 1 𝑥 + 𝑏 1 ) + 𝑏 2 ) , where 𝑊 1 ∈ ℝ 3 × 3 , 𝑊 2 ∈ ℝ 6 × 3 , 𝑏 1 ∈ ℝ 3 , 𝑏 2 ∈ ℝ 6 , 𝑔 ( 𝑥 )
e − ‖ 𝑥 ‖ 2 , and 𝜎 is a smooth version of Leaky ReLU proposed by Biswas et al. (2022). We created a training dataset from samples randomly drawn from the standard normal distribution. Figure 1 (a) illustrates the relationship between the generalization error and our bound 𝑂 ( ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 / ( det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) ) . Here, we set 𝑠 𝑗
( 𝑑 𝑗 + 0.1 ) / 2 . In Figure 1 (a), we can see that our bound gets smaller in proportion to the generalization error. In addition, we investigated the generalization property of a network with a regularization based on our bound. We considered the classification task with MNIST. For training the network, we used only 𝑛
1000 samples to create a situation where the model is hard to generalize. We constructed a network with four dense layers and trained it with and without a regularization term ‖ 𝑊 𝑗 ‖ + 1 / det ( 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ) , which makes both the norm and determinant of 𝑊 𝑗 small. Figure 1 (b) shows the test accuracy. We can see that the regularization based on our bound leads to better generalization property, which implies the validity of our bound.
(a)
(b)
(c)
Figure 1:(a) Scatter plot of the generalization error versus our bound (for 5 independent runs). The color is set to get dark as the epoch proceeds. (b) Test accuracy with and without the regularization based on our bound. (c) The condition number 𝑟 𝑑 , 𝑗
𝜂 1 , 𝑗 / 𝜂 𝑑 , 𝑗 of the weight matrix for layer 𝑗
2 , … , 4 . Singular values of the weight matrices
We investigated the difference in the behavior of singular values of the weight matrix for each layer. We considered the classification task with CIFAR-10 and AlexNet (Krizhevsky et al., 2012). AlexNet has five convolutional layers followed by dense layers. For each 𝑗
2 , … , 5 , we computed the condition number 𝑟 𝑑 , 𝑗 of the weight matrix. The results are illustrated in Figure 1 (c). We scaled the values for 𝑗
4 , 5 for readability. Since the weight matrix of the first layer ( 𝑗
1 ) is huge and the computational cost of computing its singular values is expensive, we focus on 𝑗
2 , … , 5 . We can see that for the second layer ( 𝑗
2 ), 𝑟 𝑑 , 𝑗 tends to be small as the learning process proceeds (as the test accuracy grows). On the other hand, for the third and fourth layers ( 𝑗
3 , 4 ), the 𝑟 𝑑 , 𝑗 tends to be large. This means that the behavior of the singular values is different depending on the layers. According to the paragraph after Proposition 6, our bound becomes smaller as 𝑟 𝑑 , 𝑗 becomes smaller, but the existing bound becomes smaller as 𝑟 𝑑 , 𝑗 becomes larger. In this case, our bound describes the behavior of the second layer, and the existing bound describes that of the third and fourth layers. See Appendix J for more details and results.
6Conclusion and discussion
In this paper, we proposed a new uniform complexity bound of neural networks using Koopman operators. Our bound describes why networks with full-rank weight matrices generalize well and justifies the generalization property of networks with orthogonal matrices. In addition, we provided an operator-theoretic approach to analyzing generalization property of neural networks. There are several possible limitations. First, our setting excludes non-smooth activation functions. Generalizing our framework to other function spaces may help us understand these final nonlinear transformations and activation functions. Moreover, although the factor ‖ 𝐾 𝜎 𝑗 ‖ is bounded if we set certain activation functions, how this factor changes depending on the choice of the activation function has not been clarified yet. Further investigation of this factor is required. Furthermore, we assume a modification of the structure of the neural network for deriving a bound for non-injective matrices. Thus, we still have room for improvement about this bound. We simply decomposed the product of Koopman operators into every single Koopman operator. For more refined analysis, they should be tied together to investigate the connection between layers. Considering a function space on a manifold that contains the range of the transformation corresponding to each layer may also be effective for resolving this issue. These challenging topics are left to future work.
Acknowledgement
SS was partially supported by JST PRESTO JPMJPR2125 and JST CREST JPMJCR2015. II was partially supported by JST CREST JPMJCR1913 and JST ACT-X JPMJAX2004. AN was partially supported by JSPS KAKENHI 22H03650 and JST PRESTO JPMJPR1928. TS was partially supported by JSPS KAKENHI 20H00576 and JST CREST.
References Anthony & Bartlett (2009) ↑ Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations.Cambridge University Press, 2009. Arora et al. (2018) ↑ Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang.Stronger generalization bounds for deep nets via a compression approach.In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. Bartlett & Mendelson (2002) ↑ Peter L. Bartlett and Shahar Mendelson.Rademacher and Gaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research, 3:463–482, 2002. Bartlett et al. (2017) ↑ Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky.Spectrally-normalized margin bounds for neural networks.In Proceedings of Advances in Neural Information Processing Systems 31 (NIPS), 2017. Biswas et al. (2022) ↑ Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, and Ashish Kumar Pandey.Smooth maximum unit: Smooth activation function for deep networks using smoothing maximum technique.In Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Brunton et al. (2022) ↑ Steven L. Brunton, Marko Budišić, Eurika Kaiser, and J. Nathan Kutz.Modern Koopman theory for dynamical systems.SIAM Review, 64(2):229–340, 2022. Budišić et al. (2012) ↑ Marko Budišić, Ryan Mohr, and Igor Mezić.Applied Koopmanism.Chaos (Woodbury, N.Y.), 22:047510, 2012. Das et al. (2021) ↑ Suddhasattwa Das, Dimitrios Giannakis, and Joanna Slawinska.Reproducing kernel Hilbert space compactification of unitary evolution groups.Applied and Computational Harmonic Analysis, 54:75–136, 2021. Dogra & Redman (2020) ↑ Akshunna S. Dogra and William Redman.Optimizing neural networks via Koopman operator theory.In Proceedings of Advances in Neural Information Processing Systems 34 (NeurIPS), 2020. Giannakis & Das (2020) ↑ Dimitrios Giannakis and Suddhasattwa Das.Extraction and prediction of coherent patterns in incompressible flows through space-time Koopman analysis.Physica D: Nonlinear Phenomena, 402:132211, 2020. Goldblum et al. (2020) ↑ Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein.Truth or backpropaganda? an empirical investigation of deep learning theory.In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. Golowich et al. (2018) ↑ Noah Golowich, Alexander Rakhlin, and Ohad Shamir.Size-independent sample complexity of neural networks.In Proceedings of the 2018 Conference On Learning Theory (COLT), 2018. Harvey et al. (2017) ↑ Nick Harvey, Christopher Liaw, and Abbas Mehrabian.Nearly-tight VC-dimension bounds for piecewise linear neural networks.In Proceedings of the 2017 Conference on Learning Theory (COLT), pp. 1064–1068, 2017. Hashimoto et al. (2020) ↑ Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Yoichi Matsuo, and Yoshinobu Kawahara.Krylov subspace method for nonlinear dynamical systems with random noise.Journal of Machine Learning Research, 21(172):1–29, 2020. He et al. (2015) ↑ Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV), 2015. Ikeda et al. (2022a) ↑ Masahiro Ikeda, Isao Ishikawa, and Yoshihiro Sawano.Boundedness of composition operators on reproducing kernel Hilbert spaces with analytic positive definite functions.Journal of Mathematical Analysis and Applications, 511(1):126048, 2022a. Ikeda et al. (2022b) ↑ Masahiro Ikeda, Isao Ishikawa, and Corbinian Schlosser.Koopman and Perron-–Frobenius operators on reproducing kernel Banach spaces.Chaos: An Interdisciplinary Journal of Nonlinear Science, 32(12):123143, 2022b. Ishikawa (2023) ↑ Isao Ishikawa.Bounded composition operators on functional quasi-banach spaces and stability of dynamical systems.Advances in Mathematics, 424:109048, 2023. Ishikawa et al. (2018) ↑ Isao Ishikawa, Keisuke Fujii, Masahiro Ikeda, Yuka Hashimoto, and Yoshinobu Kawahara.Metric on nonlinear dynamical systems with Perron-Frobenius operators.In Proceedings of Advances in Neural Information Processing Systems 32 (NeurIPS), 2018. Ju et al. (2022) ↑ Haotian Ju, Dongyue Li, and Hongyang R Zhang.Robust fine-tuning of deep neural networks with Hessian-based generalization guarantees.In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022. Kawahara (2016) ↑ Yoshinobu Kawahara.Dynamic mode decomposition with reproducing kernels for Koopman spectral analysis.In Proceedings of Advances in Neural Information Processing Systems 30 (NIPS), 2016. Kingma & Ba (2015) ↑ Diederik P. Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. Klus et al. (2020) ↑ Stefan Klus, Ingmar Schuster, and Krikamol Muandet.Eigendecompositions of transfer operators in reproducing kernel Hilbert spaces.Journal of Nonlinear Science, 30:283–315, 2020. Konishi & Kawahara (2023) ↑ Takuya Konishi and Yoshinobu Kawahara.Stable invariant models via Koopman spectra.Neural Networks, 165:393–405, 2023. Koopman (1931) ↑ Bernard Koopman.Hamiltonian systems and transformation in Hilbert space.Proceedings of the National Academy of Sciences, 17(5):315–318, 1931. Krizhevsky et al. (2012) ↑ Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.Imagenet classification with deep convolutional neural networks.In Proceedings of Advances in Neural Information Processing Systems 26 (NIPS), 2012. Li et al. (2021) ↑ Shuai Li, Kui Jia, Yuxin Wen, Tongliang Liu, and Dacheng Tao.Orthogonal deep neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(04):1352–1368, 2021. Lusch et al. (2017) ↑ Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton.Deep learning for universal linear embeddings of nonlinear dynamics.Nature Communications, 9:4950, 2017. Maduranga et al. (2019) ↑ Kehelwala D. G. Maduranga, Kyle E. Helfrich, and Qiang Ye.Complex unitary recurrent neural networks using scaled Cayley transform.In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), 2019. Mohri et al. (2018) ↑ Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning.MIT press, 2018. Neyshabur et al. (2015) ↑ Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro.Norm-based capacity control in neural networks.In Proceedings of the 2015 Conference on Learning Theory (COLT), 2015. Neyshabur et al. (2018) ↑ Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro.A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks.In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. Neyshabur et al. (2019) ↑ Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro.The role of over-parametrization in generalization of neural networks.In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. Novak et al. (2018) ↑ Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.Sensitivity and generalization in neural networks: an empirical study.In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. Redman et al. (2022) ↑ William T. Redman, Maria Fonoberova, Ryan Mohr, Yannis Kevrekidis, and Igor Mezić.An operator theoretic view on pruning deep neural networks.In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022. Rubin (2018) ↑ B. A. Rubin.A note on the blaschke-petkantschin formula, Riesz distributions, and Drury’s identity.Fractional Calculus and Applied Analysis, 21:1641–1650, 2018. Sawano (2018) ↑ Yoshihiro Sawano.Theory of Besov Spaces.Springer Singapore, 2018. Suzuki et al. (2020) ↑ Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura.Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network.In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. Tu (2011) ↑ Loring W. Tu.An Introduction to Manifolds.Springer New York, second edition, 2011. Wang et al. (2020) ↑ Jiayun Wang, Yubei Chen, Rudrasis Chakraborty, and Stella X. Yu.Orthogonal convolutional neural networks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Wei & Ma (2019) ↑ Colin Wei and Tengyu Ma.Data-dependent sample complexity of deep neural networks via Lipschitz augmentation.In Proceedings of Advances in Neural Information Processing Systems 33 (NeurIPS), 2019. Wei & Ma (2020) ↑ Colin Wei and Tengyu Ma.Improved sample complexities for deep neural networks and robust classification via an all-layer margin.In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020. Weinan E et al. (2022) ↑ Weinan E, Chao Ma, and Lei Wu.The Barron space and the flow-induced function spaces for neural network models.Constructive Approximation, 55:369–406, 2022. Appendix Appendix AProofs
We provide the proofs of the theorems, propositions, and lemmas in the main text.
Proposition 2 Let 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝜔 ∈ ℝ 𝑑
𝑠 ∈ ℕ , and 𝑠
𝑑 / 2 . If the activation function 𝜎 has the following properties, then 𝐾 𝜎 : 𝐻 𝑝 ( ℝ 𝑑 ) → 𝐻 𝑝 ( ℝ 𝑑 ) is bounded.
•
𝜎 is 𝑠 -times differentiable and its derivative ∂ 𝛼 𝜎 is bounded for any multi-index 𝛼 ∈ { ( 𝛼 1 , … , 𝛼 𝑑 ) ∣ ∑ 𝑗
1 𝑑 𝛼 𝑗 ≤ 𝑠 } .
•
𝜎 is bi-Lipschitz, i.e., 𝜎 is bijective and both 𝜎 and 𝜎 − 1 are Lipschitz continuous.
Proof For ℎ ∈ 𝐻 ( ℝ 𝑑 ) , we have ‖ 𝐾 𝜎 ℎ ‖ 𝐻 𝑝 ( ℝ 𝑑 ) 2
∑ | 𝛼 | ≤ 𝑠 ‖ ∂ 𝛼 ( ℎ ∘ 𝜎 ) ‖ 𝐿 2 ( ℝ 𝑑 ) 2 . We denote 𝜎 ( 𝑥 )
( 𝜎 1 ( 𝑥 ) , … , 𝜎 𝑑 ( 𝑥 ) ) and 𝐷 𝛾 𝜎 ( 𝑥 )
( ∂ 𝛾 𝜎 1 ( 𝑥 ) , … , ∂ 𝛾 𝜎 𝑑 ( 𝑥 ) ) for 𝛾 ∈ ℕ 𝑑 . By the Faà di Bruno formula, we have
∂ 𝛼 ( ℎ ∘ 𝜎 ) ( 𝑥 )
∑ | 𝛽 | ≤ | 𝛼 | ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) ∑ 𝑖
1 | 𝛼 | ∑ 𝛾 ∈ 𝑝 ( 𝛼 , 𝛽 ) 𝛼 ! ∏ 𝑗
1 𝑖 ( ∂ 𝑙 𝑗 𝜎 ( 𝑥 ) ) 𝑘 𝑗 𝑘 𝑗 ! ( 𝑙 𝑗 ! ) | 𝑘 𝑗 | ,
where 𝑝 ( 𝛼 , 𝛽 )
{ 𝛾
( 𝑘 1 , … , 𝑘 𝑠 , 𝑙 1 , … , 𝑙 𝑠 ) ∣ 0 ≤ 𝑙 1 ≤ ⋯ ≤ 𝑙 𝑠 , ∑ 𝑗
1 𝑠 𝑘 𝑗
𝛼 , ∑ 𝑗
1 𝑠 | 𝑘 𝑗 | 𝑙 𝑗
𝛽 } . Thus, ∂ 𝛼 ( ℎ ∘ 𝜎 ) ( 𝑥 ) is written as the finite weighted sum of ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) ∏ 𝑖
1 𝑚 ( 𝐷 𝛾 𝑖 𝜎 ( 𝑥 ) ) 𝛿 𝑖 for some 𝑚 ≤ | 𝛼 | and 𝛽 , 𝛾 𝑖 , 𝛿 𝑖 ∈ ℕ 𝑑 , | 𝛽 | ≤ | 𝛼 | , | 𝛾 𝑖 | ≤ | 𝛼 | , | 𝛿 𝑖 | ≤ | 𝛼 | . By the boundedness of the derivatives of 𝜎 , there exists 𝐶 𝛽 , 𝛾 , 𝛿
0 such that
∫ ℝ 𝑑 | ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) ∏ 𝑖
1 𝑚 ( 𝐷 𝛾 𝑖 𝜎 ( 𝑥 ) ) 𝛿 𝑖 | 2 d 𝑥 ≤ 𝐶 𝛽 , 𝛾 , 𝛿 ∫ ℝ 𝑑 | ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) | 2 d 𝑥 .
Moreover, by the Lipschitzness of 𝜎 − 1 , there exists 𝐶 ~
0 such that
∫ ℝ 𝑑 | ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) | 2 d 𝑥 ≤ ‖ det ( 𝐽 𝜎 − 1 ) ‖ ∞ ∫ ℝ 𝑑 | ∂ 𝛽 ℎ ( 𝑥 ) | 2 d 𝑥 ≤ 𝐶 ~ ∫ ℝ 𝑑 | ∂ 𝛽 ℎ ( 𝑥 ) | 2 d 𝑥 ,
where 𝐽 𝜎 − 1 is the Jacobian of 𝜎 − 1 , which shows the boundedness of 𝐾 𝜎 . □
Theorem 4 The Rademacher complexity 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inv ( 𝐶 , 𝐷 ) ) is bounded as
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inv ( 𝐶 , 𝐷 ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ∞ 1 / 2 | det ( 𝑊 𝑗 ) | 1 / 2 ) ( ∏ 𝑗
1 𝐿 − 1 ‖ 𝐾 𝜎 𝑗 ‖ ) .
We use the following lemma to show Theorem 4.
Lemma A
Assume 𝑊 𝑗 : ℝ 𝑑 → ℝ 𝑑 is invertible for 𝑗
1 , … , 𝐿 . Then, for 𝑗
1 , … , 𝐿 , we have
‖ 𝐾 𝑊 𝑗 ‖ ≤ ( ‖ 𝑝 𝑗 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ∞ 1 | det ( 𝑊 𝑗 ) | ) 1 / 2 , ‖ 𝐾 𝑏 𝑗 ‖
1 .
Proof For ℎ ∈ 𝐻 𝑗 , we have ( ℎ ∘ 𝑊 𝑗 ) ^ ( 𝜔 )
∫ ℝ 𝑑 ℎ ( 𝑊 𝑗 𝑥 ) e − i 𝑥 ⋅ 𝜔 d 𝑥
ℎ ^ ( 𝑊 𝑗 − * 𝜔 ) / | det ( 𝑊 𝑗 ) | . Thus, the norm of the Koopman operator is bounded as
‖ 𝐾 𝑊 𝑗 ℎ ‖ 𝐻 𝑗 − 1 2
∫ ℝ 𝑑 | ℎ ^ ( 𝑊 𝑗 − * 𝜔 ) | 2 | det ( 𝑊 𝑗 ) | 2 𝑝 𝑗 − 1 ( 𝜔 ) d 𝜔 ≤ ‖ ℎ ‖ 𝐻 𝑗 2 sup 𝜔 ∈ ℝ 𝑑 | 𝑝 𝑗 ( 𝜔 ) 𝑝 𝑗 − 1 ( 𝑊 𝑗 * 𝜔 ) | 1 | det ( 𝑊 𝑗 ) | .
In addition, for ℎ ∈ 𝐻 𝑗 , we have ( ℎ ∘ 𝑏 𝑗 ) ^ ( 𝜔 )
e − i 𝑎 𝑗 ⋅ 𝜔 ℎ ^ ( 𝜔 ) . Thus, we obtain ‖ 𝐾 𝑏 𝑗 ℎ ‖ 2
‖ ℎ ‖ 2 . □
Proof of Theorem 4 Let 𝑥 1 , … , 𝑥 𝑛 ∈ ℝ 𝑑 0 and 𝑠 1 , … , 𝑠 𝑛 be i.i.d. Rademacher variables (random variables following the uniform distribution on { − 1 , 1 } . By the reproducing property of 𝐻 0 and the Cauchy–Schwartz inequality, we have
1 𝑛 E [ sup 𝑓 ∈ 𝐹 inv ( 𝐶 , 𝐷 ) ∑ 𝑖
1 𝑛 𝑠 𝑖 𝑓 ( 𝑥 𝑖 ) ]
1 𝑛 E [ sup 𝑓 ∈ 𝐹 inv ( 𝐶 , 𝐷 ) ∑ 𝑖
1
𝑛
⟨
𝑠
𝑖
𝑘
𝑝
0
(
⋅
,
𝑥
𝑖
)
,
𝑓
⟩
𝐻
0
]
≤
1
𝑛
E
[
sup
𝑓
∈
𝐹
inv
(
𝐶
,
𝐷
)
(
∑
𝑖
,
𝑗
1
𝑛
𝑠
𝑖
𝑠
𝑗
𝑘
𝑝
0
(
𝑥
𝑖
,
𝑥
𝑗
)
)
1
/
2
‖
𝑓
‖
𝐻
0
]
≤
1
𝑛
sup
𝑓
∈
𝐹
inv
(
𝐶
,
𝐷
)
‖
𝑓
‖
𝐻
0
E
1
2
[
∑
𝑖
,
𝑗
1
𝑛
𝑠
𝑖
𝑠
𝑗
𝑘
𝑝
0
(
𝑥
𝑖
,
𝑥
𝑗
)
]
≤
1
𝑛
sup
𝑓
∈
𝐹
inv
(
𝐶
,
𝐷
)
‖
𝑓
‖
𝐻
0
(
∑
𝑖
1
𝑛
𝑘
𝑝
0
(
𝑥
𝑖
,
𝑥
𝑖
)
)
1
/
2
≤
𝐵
𝑛
sup
𝑊
𝑗
∈
𝒲
(
𝐶
,
𝐷
)
‖
𝐾
𝑊
1
𝐾
𝑏
1
𝐾
𝜎
1
⋯
𝐾
𝑊
𝐿
𝐾
𝑏
𝐿
𝑔
‖
𝐻
0
≤
𝐵
𝑛
sup
𝑊
𝑗
∈
𝒲
(
𝐶
,
𝐷
)
(
∏
𝑗
1 𝐿 ‖ 𝐾 𝑊 𝑗 ‖ ‖ 𝐾 𝑏 𝑗 ‖ ‖ 𝐾 𝜎 𝑗 ‖ ) ‖ 𝑔 ‖ 𝐻 𝐿 ,
(7)
where the third inequality is derived by the Jensen’s inequality. By Lemma A, we obtain the final result. □
Lemma 5 Let 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 for 𝑠 > 𝑑 / 2 and 𝑝 𝑗
𝑝 for 𝑗
0 , … , 𝐿 . Then, we have ‖ 𝑝 / ( 𝑝 ∘ 𝑊 𝑗 * ) ‖ ∞ ≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 } .
Proof By the definition of 𝑝 , we have
‖ 𝑝 𝑝 ∘ 𝑊 𝑗 * ‖ ∞
sup 𝜔 ∈ ℝ 𝑑 | 𝑝 ( 𝜔 ) 𝑝 ( 𝑊 𝑗 * 𝜔 ) |
sup 𝜔 ∈ ℝ 𝑑 | ( 1 + ‖ 𝑊 𝑗 * 𝜔 ‖ 2 1 + ‖ 𝜔 ‖ 2 ) 𝑠 | ≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 } .
□
Theorem 9 Let 𝒲 𝑗 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 𝑗 − 1 × 𝑑 𝑗 ∣ 𝑑 𝑗 ≥ 𝑑 𝑗 − 1 , ‖ 𝑊 ‖ ≤ 𝐶 , det ( 𝑊 * 𝑊 ) ≥ 𝐷 } and 𝐹 inj ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 ∣ 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) } . The Rademacher complexity 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inj ( 𝐶 , 𝐷 ) ) is bounded as
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inj ( 𝐶 , 𝐷 ) ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 − 1 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) ( ∏ 𝑗
1 𝐿 − 1 ∥ 𝐾 𝜎 𝑗 ∥ ) ,
where 𝐺 𝑗
‖ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ∥ 𝐻 𝑝 𝑗 − 1 ( ℛ ( 𝑊 𝑗 ) ) / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 and 𝑓 𝑗
𝑔 ∘ 𝑏 𝐿 ∘ 𝑊 𝐿 ∘ 𝜎 𝐿 − 1 ∘ 𝑏 𝐿 − 1 ∘ 𝑊 𝐿 − 1 ∘ ⋯ ∘ 𝜎 𝑗 ∘ 𝑏 𝑗 .
Proof For ℎ ∈ 𝐻 𝑗 , we have
( ℎ ∘ 𝑊 𝑗 ) ^ ( 𝜔 )
∫ ℝ 𝑑 𝑗 − 1 ℎ ( 𝑊 𝑗 𝑥 ) e − i 𝑥 ⋅ 𝜔 d 𝑥
∫ ℛ ( 𝑊 𝑗 ) ℎ ( 𝑥 ) e − i 𝑥 ⋅ 𝑊 𝑗 − * 𝜔 d 𝑥 1 | det ( 𝑅 𝑗 ) |
ℎ ^ ( 𝑊 𝑗 − * 𝜔 ) | det ( 𝑅 𝑗 ) | ,
where 𝑊 𝑗
𝑄 𝑗 𝑅 𝑗 is the QR decomposition of 𝑊 𝑗 and ℛ ( 𝑊 𝑗 ) is the range of 𝑊 𝑗 . In addition, we regard 𝑊 𝑗 : ℝ 𝑑 𝑗 − 1 → ℛ ( 𝑊 𝑗 ) . Since | det ( 𝑅 𝑗 ) |
( det ( 𝑅 𝑗 * 𝑅 𝑗 ) ) 1 / 2
( det ( 𝑊 𝑗 * 𝑊 𝑗 ) ) 1 / 2 , the norm of the Koopman operator is bounded as
‖ 𝐾 𝑊 𝑗 ℎ ‖ 𝐻 𝑗 − 1 2
∫ ℝ 𝑑 𝑗 − 1 | ℎ ^ ( 𝑊 𝑗 − * 𝜔 ) | 2 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 𝑝 𝑗 − 1 ( 𝜔 ) d 𝜔
∫ ℛ ( 𝑊 𝑗 ) | ℎ ^ ( 𝜔 ) | 2 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 2 𝑝 𝑗 − 1 ( 𝑊 𝑗 * 𝜔 ) d 𝜔
≤ ‖ ℎ | ℛ ( 𝑊 𝑗 ) ∥ 𝐻 𝑝 𝑗 − 1 ( ℛ ( 𝑊 𝑗 ) ) 2 sup 𝜔 ∈ ℛ ( 𝑊 𝑗 ) | 𝑝 𝑗 − 1 ( 𝜔 ) 𝑝 𝑗 − 1 ( 𝑊 𝑗 * 𝜔 ) | 1 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 2 .
(8)
Thus, we have
‖ 𝑓 ‖ 𝐻 0
‖ 𝐾 𝑊 1 𝑓 1 ‖ 𝐻 0 ≤ ‖ 𝑓 1 | ℛ ( 𝑊 1 ) ∥ 𝐻 𝑝 0 ( ℛ ( 𝑊 1 ) ) sup 𝜔 ∈ ℛ ( 𝑊 1 ) | 𝑝 0 ( 𝜔 ) 𝑝 0 ( 𝑊 1 * 𝜔 ) | 1 / 2 1 det ( 𝑊 1 * 𝑊 1 ) 1 / 4
𝐺 1 ‖ 𝑓 1 ‖ 𝐻 1 sup 𝜔 ∈ ℛ ( 𝑊 1 ) | 𝑝 0 ( 𝜔 ) 𝑝 0 ( 𝑊 1 * 𝜔 ) | 1 / 2 1 det ( 𝑊 1 * 𝑊 1 ) 1 / 4
≤ 𝐺 1 ‖ 𝐾 𝜎 1 ‖ 𝐻 1 ‖ 𝐾 𝑊 2 𝑓 2 ‖ 𝐻 1 sup 𝜔 ∈ ℛ ( 𝑊 1 ) | 𝑝 0 ( 𝜔 ) 𝑝 0 ( 𝑊 1 * 𝜔 ) | 1 / 2 1 det ( 𝑊 1 * 𝑊 1 ) 1 / 4 .
Applying the inequality (8) iteratively, we obtain
‖ 𝑓 ‖ 𝐻 0 ≤ ∏ 𝑗
1 𝐿 ‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 .
(9)
Applying the inequality (9) to ‖ 𝑓 ‖ 𝐻 0 in the inequality (7) completes the proof. □
Lemma 10 Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 and for 𝑗
0 , … , 𝐿 . Then, we have ‖ 𝑝 𝑗 − 1 / ( 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ) ‖ ℛ ( 𝑊 𝑗 ) , ∞ ≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 𝑗 − 1 } .
Proof By the definition of 𝑝 𝑗 and the assumption of 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 , we have
‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞
sup
𝜔
∈
ℛ
(
𝑊
𝑗
)
|
𝑝
𝑗
−
1
(
𝜔
)
𝑝
𝑗
−
1
(
𝑊
𝑗
*
𝜔
)
|
≤
sup
𝜔
∈
ℛ
(
𝑊
𝑗
)
|
(
1
+
‖
𝑊
𝑗
*
𝜔
‖
2
)
𝑠
𝑗
−
1
(
1
+
‖
𝜔
‖
2
)
𝑠
𝑗
−
1
|
≤
max
{
1
,
‖
𝑊
𝑗
‖
2
𝑠
𝑗
−
1
}
sup
𝜔
∈
ℛ
(
𝑊
𝑗
)
|
(
1
+
‖
𝜔
‖
2
1
+
‖
𝜔
‖
2
)
𝑠
𝑗
−
1
|
max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 𝑗 − 1 } .
□
Lemma 11 Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 and for 𝑗
0 , … , 𝐿 . Let 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 and 𝐻 ~ 𝑗 be the Sobolev space on ℛ ( 𝑊 𝑗 ) ⟂ with 𝑝 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 − 𝑠 𝑗 − 1 . Then, 𝐺 𝑗 ≤ 𝐺 𝑗 , 0 ‖ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ⋅ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ⟂ ∥ 𝐻 𝑗 / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 , where 𝐺 𝑗 , 0
‖ 𝑓 𝑗 ‖ 𝐻 ~ 𝑗 − 1 .
Remark B
Since 𝑊 𝑗 is injective, dim ( ℛ ( 𝑊 𝑗 ) ⟂ )
𝑑 𝑗 − 𝑑 𝑗 − 1 . Thus, we have 𝑠 𝑗 − 𝑠 𝑗 − 1
dim ( ℛ ( 𝑊 𝑗 ) ⟂ ) / 2 .
Proof By the definition of 𝐺 𝑗 , 0 , and since ℛ ( 𝑊 𝑗 ) and ℛ ( 𝑊 𝑗 ) ⟂ are orthogonal, we have
‖ 𝑓 𝑗 ‖ 𝐻 𝑝 𝑗 − 1 ( ℛ ( 𝑊 𝑗 ) ) 2
∫ ℛ ( 𝑊 𝑗 ) | 𝑓 𝑗 ^ ( 𝜔 ) | 2 1 𝑝 𝑗 − 1 ( 𝜔 ) d 𝜔
𝐺 𝑗 , 0 2 ∫ ℛ ( 𝑊 𝑗 ) | 𝑓 𝑗 ^ ( 𝜔 1 ) | 2 ( 1 + ‖ 𝜔 1 ‖ 2 ) 𝑠 𝑗 − 1 d 𝜔 1 ∫ ℛ ( 𝑊 𝑗 ) ⟂ | 𝑓 𝑗 ^ ( 𝜔 2 ) | 2 ( 1 + ‖ 𝜔 2 ‖ 2 ) 𝑠 𝑗 − 𝑠 𝑗 − 1 d 𝜔 2
𝐺 𝑗 , 0 2 ∫ ℛ ( 𝑊 𝑗 ) ∫ ℛ ( 𝑊 𝑗 ) ⟂ | 𝑓 𝑗 ^ ( 𝜔 1 ) | 2 | 𝑓 𝑗 ^ ( 𝜔 2 ) | 2 1 𝑝 𝑗 ( 𝜔 1 + 𝜔 2 ) ( 1 + ‖ 𝜔 1 ‖ 2 ) 𝑠 𝑗 − 1 ( 1 + ‖ 𝜔 2 ‖ 2 ) 𝑠 𝑗 − 𝑠 𝑗 − 1 ( 1 + ‖ 𝜔 1 + 𝜔 2 ‖ 2 ) 𝑠 𝑗 d 𝜔 2 d 𝜔 1
𝐺
𝑗
,
0
2
∫
ℛ
(
𝑊
𝑗
)
∫
ℛ
(
𝑊
𝑗
)
⟂
|
𝑓
𝑗
^
(
𝜔
1
)
|
2
|
𝑓
𝑗
^
(
𝜔
2
)
|
2
1
𝑝
𝑗
(
𝜔
1
+
𝜔
2
)
⋅
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
(
1
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
−
𝑠
𝑗
−
1
(
1
+
‖
𝜔
1
‖
2
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
−
1
(
1
+
‖
𝜔
1
‖
2
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
−
𝑠
𝑗
−
1
d
𝜔
2
d
𝜔
1
≤
𝐺
𝑗
,
0
2
∫
ℛ
(
𝑊
𝑗
)
∫
ℛ
(
𝑊
𝑗
)
⟂
|
𝑓
𝑗
^
(
𝜔
1
)
𝑓
𝑗
^
(
𝜔
2
)
|
2
1
𝑝
𝑗
(
𝜔
1
+
𝜔
2
)
d
𝜔
2
d
𝜔
1
𝐺 𝑗 , 0 2 ∫ ℝ 𝑑 𝑗 | 𝑓 𝑗 ~ ^ ( 𝜔 ) | 2 1 𝑝 𝑗 ( 𝜔 ) d 𝜔
𝐺 𝑗 , 0 2 ‖ 𝑓 𝑗 ~ ‖ 𝐻 𝑗 2
𝐺 𝑗 , 0 2 ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 2 ‖ 𝑓 𝑗 ~ ‖ 𝐻 𝑗 2 ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 2 ,
where ℎ ~ ( 𝑥 )
ℎ ( 𝑥 1 ) ℎ ( 𝑥 2 ) for 𝑥
𝑥 1 + 𝑥 2 , 𝑥 1 ∈ ℛ ( 𝑊 𝑗 ) , and 𝑥 2 ∈ ℛ ( 𝑊 𝑗 ) ⟂ . Note that since ℛ ( 𝑊 𝑗 ) and ℛ ( 𝑊 𝑗 ) ⟂ are orthogonal, we have ℎ ^ ( 𝜔 1 ) ℎ ^ ( 𝜔 2 )
ℎ ~ ^ ( 𝜔 ) . □
Proposition 16 Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 with 𝑠 𝑗 ∈ ℕ , 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 , and 𝑠 𝑗 > 𝑑 𝑗 / 2 . Let 𝜓 𝑗 be a function satisfying 𝜓 𝑗 ( 𝑥 )
𝜓 𝑗 , 1 ( 𝑥 1 ) for some 𝜓 𝑗 , 1 ∈ 𝐻 𝑝 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) , where 𝑥
𝑥 1 + 𝑥 2 for 𝑥 1 ∈ ker ( 𝑊 𝑗 ) and 𝑥 2 ∈ ker ( 𝑊 𝑗 ) ⟂ and ker ( 𝑊 𝑗 ) is the kernel of 𝑊 𝑗 . Let 𝒲 r , 𝑗 ( 𝐶 , 𝐷 )
{ 𝑊 ∈ ℝ 𝑑 𝑗 − 1 × 𝑑 𝑗 ∣ ‖ 𝑊 ‖ ≤ 𝐶 , | det ( 𝑊 r ) | ≥ 𝐷 } and 𝐹 r ( 𝐶 , 𝐷 )
{ 𝑓 ∈ 𝐹 ∣ 𝑊 𝑗 ∈ 𝒲 r , 𝑗 ( 𝐶 , 𝐷 ) } . Moreover, let 𝐺 𝑗
‖ 𝑓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ⟂ ) / ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 . Then, we have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 r ( 𝐶 , 𝐷 ) ) ) ≤ 𝐵 ‖ 𝑔 ‖ 𝐻 𝐿 𝑛 sup 𝑊 𝑗 ∈ 𝒲 r , 𝑗 ( 𝐶 , 𝐷 ) ( ∏ 𝑗
1 𝐿 ‖ 𝜓 𝑗 , 1 ‖ 𝐻 ~ 𝑗 − 1 𝐺 𝑗 max { 1 , ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 − 1 } | det ( 𝑊 r , 𝑗 ) | 1 / 2 ) ( ∏ 𝑗
1 𝐿 − 1 ∥ 𝐾 𝜎 𝑗 ∥ ) .
Proof For ℎ ∈ 𝐻 𝑗 , we have ‖ 𝐾 ~ 𝑊 𝑗 ℎ ‖ 𝐻 𝑗 − 1 2
∑ | 𝛼 | ≤ 𝑠 𝑗 − 1 ‖ ∂ 𝛼 ( 𝜓 𝑗 ⋅ ℎ ∘ 𝑊 𝑗 ) ‖ 𝐿 2 ( ℝ 𝑑 𝑗 − 1 ) 2 , where the directions of the derivatives are along the directions of ker ( 𝑊 𝑗 ) ⟂ and ker ( 𝑊 𝑗 ) . We have
∫ ℝ 𝑑 𝑗 − 1 | ∂ 𝛽 𝜓 𝑗 ( 𝑥 ) ∂ 𝛾 ( ℎ ∘ 𝑊 𝑗 ) ( 𝑥 ) | 2 d 𝑥 ≤ ‖ 𝑊 𝑗 ‖ 2 | 𝛾 | ∫ ℝ 𝑑 𝑗 − 1 | ∂ 𝛽 𝜓 𝑗 ( 𝑥 ) ( ∂ 𝛾 ℎ ) ( 𝑊 𝑗 𝑥 ) | 2 d 𝑥 .
(10)
In addition, let 𝜙 be a function satisfying 𝜙 ( 𝑥 )
𝜙 1 ( 𝑥 1 ) for some 𝜙 1 ∈ 𝐻 𝑝 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) , where 𝑥
𝑥 1 + 𝑥 2 for 𝑥 1 ∈ ker ( 𝑊 𝑗 ) and 𝑥 2 ∈ ker ( 𝑊 𝑗 ) ⟂ . Let 𝑢 ∈ 𝐻 𝑝 𝑗 − 1 ( ℝ 𝑑 𝑗 − 1 ) . Then, we have
∫ ℝ 𝑑 𝑗 − 1 | 𝜙 ( 𝑥 ) 𝑢 ( 𝑊 𝑗 𝑥 ) | 2 d 𝑥
∫ ker ( 𝑊 𝑗 ) ∫ ker ( 𝑊 𝑗 ) ⟂ | 𝜙 1 ( 𝑥 1 ) 𝑢 ( 𝑊 𝑗 𝑥 2 ) | 2 d 𝑥 2 d 𝑥 1
∫ ker ( 𝑊 𝑗 ) | 𝜙 1 ( 𝑥 1 ) | 2 d 𝑥 1 ∫ ker ( 𝑊 𝑗 ) ⟂ | 𝑢 ( 𝑊 𝑗 𝑥 2 ) | 2 d 𝑥 2
‖ 𝜙 1 ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ) 2 ∫ ker ( 𝑊 𝑗 ) ⟂ | 𝑢 ( 𝑊 𝑗 𝑥 2 ) | 2 d 𝑥 2 .
(11)
Combining Eqs. (10) and (11), we obtain
∫ ℝ 𝑑 𝑗 − 1 | ∂ 𝛽 𝜓 𝑗 ( 𝑥 ) ( ∂ 𝛾 ℎ ∘ 𝑊 𝑗 ) ( 𝑥 ) | 2 d 𝑥 ≤ ‖ 𝑊 𝑗 ‖ 2 | 𝛾 | ‖ ∂ 𝛽 𝜓 𝑗 ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ) 2 ∫ ker ( 𝑊 𝑗 ) ⟂ | ( ∂ 𝛾 ℎ ) ( 𝑊 𝑗 𝑥 ) | 2 d 𝑥
≤ ‖ 𝑊 𝑗 ‖ 2 | 𝛾 | ‖ ∂ 𝛽 𝜓 𝑗 ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ) 2 1 | det ( 𝑊 r , 𝑗 ) | ∫ ker ( 𝑊 𝑗 ) ⟂ | ∂ 𝛾 ℎ ( 𝑥 ) | 2 d 𝑥 .
As a result, we have
‖ 𝐾 ~ 𝑊 𝑗 ℎ ‖ 𝐻 𝑗 − 1 2
∑ | 𝛼 | ≤ 𝑠 𝑗 − 1 𝑐 𝛼 , 𝑠 𝑗 − 1 , 𝑑 𝑗 − 1 ‖ ∂ 𝛼 ( 𝜓 𝑗 ⋅ ℎ ∘ 𝑊 𝑗 ) ‖ 𝐿 2 ( ℝ 𝑑 𝑗 − 1 ) 2
∑ | 𝛼 | ≤ 𝑠 𝑗 − 1 𝑐 𝛼 , 𝑠 𝑗 − 1 , 𝑑 𝑗 − 1 ∫ ker ( 𝑊 𝑗 ) ∫ ker ( 𝑊 𝑗 ) ⟂ | ∂ 𝛽 𝜓 𝑗 , 1 ( 𝑥 1 ) ∂ 𝛼 − 𝛽 ( ℎ ∘ 𝑊 𝑗 ) ( 𝑥 2 ) | 2 d 𝑥 2 d 𝑥 1
∑ | 𝛽 | ≤ 𝑠 𝑗 − 1 ∑ | 𝛾 | ≤ 𝑠 𝑗 − 1 − | 𝛽 | 𝑐 𝛽 , 𝑠 𝑗 − 1 , 𝑟 𝑗 𝑐 𝛾 , 𝑠 𝑗 − 1 − | 𝛽 | , 𝑑 𝑗 − 1 − 𝑟 𝑗 ∫ ker ( 𝑊 𝑗 ) ∫ ker ( 𝑊 𝑗 ) ⟂ | ∂ 𝛽 𝜓 𝑗 , 1 ( 𝑥 1 ) ∂ 𝛾 ( ℎ ∘ 𝑊 𝑗 ) ( 𝑥 2 ) | 2 d 𝑥 2 d 𝑥 1
≤ ∑ | 𝛽 | ≤ 𝑠 𝑗 − 1 ∑ | 𝛾 | ≤ 𝑠 𝑗 − 1 − | 𝛽 | 𝑐 𝛽 , 𝑠 𝑗 − 1 , 𝑟 𝑗 𝑐 𝛾 , 𝑠 𝑗 − 1 − | 𝛽 | , 𝑑 𝑗 − 1 − 𝑟 𝑗 ‖ ∂ 𝛽 𝜓 𝑗 , 1 ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ) 2 ‖ 𝑊 𝑗 ‖ 2 | 𝛾 | | det ( 𝑊 r , 𝑗 ) | ‖ ∂ 𝛾 ℎ ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ⟂ ) 2
≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 𝑗 − 1 } det ( 𝑊 r , 𝑗 ) ∑ | 𝛽 | ≤ 𝑠 𝑗 − 1 𝑐 𝛽 , 𝑠 𝑗 − 1 , 𝑟 𝑗 ‖ ∂ 𝛽 𝜓 𝑗 , 1 ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ) 2 ∑ | 𝛾 | ≤ 𝑠 𝑗 − 1 𝑐 𝛾 , 𝑠 𝑗 − 1 , 𝑑 𝑗 − 1 − 𝑟 𝑗 ‖ ∂ 𝛾 ℎ ‖ 𝐿 2 ( ker ( 𝑊 𝑗 ) ⟂ ) 2
≤ max { 1 , ‖ 𝑊 𝑗 ‖ 2 𝑠 𝑗 − 1 } det ( 𝑊 r , 𝑗 ) ‖ 𝜓 𝑗 , 1 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) 2 ‖ ℎ ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ⟂ ) 2 ,
where 𝛽 in the second line of the above formula is the multi index whose elements corresponding to ker ( 𝑊 𝑗 ) equal to those of 𝛼 and other elements are zero. In addition, 𝑟 𝑗
dim ( ker ( 𝑊 𝑗 ) ) and 𝑐 𝛼 , 𝑠 , 𝑑
( 2 𝜋 ) 𝑑 𝑠 ! / 𝛼 ! / ( 𝑠 − | 𝛼 | ) ! . Therefore, setting ℎ
𝑓 𝑗 , we have
‖ 𝐾 ~ 𝑊 𝑗 𝑓 𝑗 ‖ 𝐻 𝑗 − 1 ≤ ‖ 𝜓 𝑗 , 1 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) 𝐺 𝑗 | det ( 𝑊 r , 𝑗 ) | 1 / 2 max { 1 , ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 − 1 } ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 ,
which completes the proof of the proposition. □
Proposition 19 Let 𝐱 ~
( 𝑥 ~ 1 , … , 𝑥 ~ 𝑛 ) ∈ ( ℝ 𝑑 𝑙 ) 𝑛 . Let 𝑣 𝑛 ( 𝜔 )
∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝜔 ) 𝑘 𝑝 0 ( ⋅ , 𝑥 𝑖 ) , 𝑣 ~ 𝑛 ( 𝜔 )
∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝜔 ) 𝑘 𝑝 𝑙 ( ⋅ , 𝑥 ~ 𝑖 ) , and 𝛾 𝑛
‖ 𝑣 𝑛 ‖ 𝐻 0 / ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 . Then, we have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 𝑙 , comb ( 𝐶 , 𝐷 ) )
≤ sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 )
( 𝑗
1 , … , 𝑙 ) ∏ 𝑗
1 𝑙 ‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4
⋅ ( 𝑅 ^ 𝑛 ( 𝐱 ~ , 𝐹 𝑙 + 1 : 𝐿 ) + 𝐵 𝑛 inf ℎ 1 ∈ 𝐹 𝑙 + 1 : 𝐿 E 1 2 [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ℎ 1 − ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 2 ] ) .
Proof To simplify the notation, let
𝛽 𝑗
‖ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ‖ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 .
By the reproducing property of 𝐻 0 and the Cauchy–Shwartz inequality, we have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 𝑙 , comb ( 𝐶 , 𝐷 ) )
1 𝑛 E [ sup 𝑓 ∈ 𝐹 𝑙 , comb ( 𝐶 , 𝐷 ) ∑ 𝑖
1 𝑛 𝑠 𝑖 𝑓 ( 𝑥 𝑖 ) ]
1
𝑛
E
[
sup
𝑓
∈
𝐹
𝑙
,
comb
(
𝐶
,
𝐷
)
⟨
𝑣
𝑛
,
𝑓
⟩
𝐻
0
]
≤
1
𝑛
E
[
sup
𝑓
∈
𝐹
𝑙
,
comb
(
𝐶
,
𝐷
)
‖
𝑣
𝑛
‖
𝐻
0
‖
𝐾
𝑊
1
𝐾
𝑏
1
𝐾
𝜎
1
⋯
𝐾
𝑊
𝑙
𝐾
𝑏
𝑙
𝐾
𝜎
𝑙
𝐾
𝑊
𝑙
+
1
𝐾
𝑏
𝑙
+
1
𝐾
𝜎
𝑙
+
1
⋯
𝐾
𝑊
𝐿
𝐾
𝑏
𝐿
𝑔
‖
𝐻
0
]
≤
1
𝑛
E
[
sup
𝑓
∈
𝐹
𝑙
,
comb
(
𝐶
,
𝐷
)
∥
𝑣
𝑛
∥
𝐻
0
∏
𝑗
1 𝑙 ∥ 𝑝 𝑗 − 1 𝑝 𝑗 − 1 ∘ 𝑊 𝑗 * ∥ ℛ ( 𝑊 𝑗 ) , ∞ 1 / 2 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4
⋅ ∥ 𝐾 𝑊 𝑙 + 1 𝐾 𝑏 𝑙 + 1 𝐾 𝜎 𝑙 + 1 ⋯ 𝐾 𝑊 𝐿 𝐾 𝑏 𝐿 𝑔 ∥ 𝐻 𝑙 ]
≤ 1 𝑛 E [ sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 )
( 𝑗
1 , … , 𝑙 ) ∏ 𝑗
1 𝑙 𝛽 𝑗 sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ⟨ 𝑣 ~ 𝑛 , ‖ ℎ 2 ‖ 𝐻 𝑙 ‖ 𝑣 𝑛 ‖ 𝐻 0 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 2 𝑣 ~ 𝑛 ⟩ 𝐻 𝑙 ]
1 𝑛 sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 )
( 𝑗
1 , … , 𝑙 ) ∏ 𝑗
1 𝑙 𝛽 𝑗 E [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ( ⟨ 𝑣 ~ 𝑛 , ℎ ⟩ 𝐻 𝑙 + ⟨ 𝑣 ~ 𝑛 , ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ⟩ 𝐻 𝑙 ) ]
≤ 1 𝑛 sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 )
( 𝑗
1 , … , 𝑙 ) ∏ 𝑗
1 𝐿 𝛽 𝑗 E [ sup ℎ 1 ∈ 𝐹 𝑙 + 1 : 𝐿 ⟨ 𝑣 ~ 𝑛 , ℎ 1 ⟩ 𝐻 𝑙 + sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ⟨ 𝑣 ~ 𝑛 , ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ⟩ 𝐻 𝑙 ]
≤ sup 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 , 𝐷 )
( 𝑗
1 , … , 𝑙 ) ∏ 𝑗
1 𝐿 𝛽 𝑗 ( 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 𝑙 + 1 : 𝐿 ) + 1 𝑛 E [ ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ‖ 𝐻 𝑙 ] )
for any ℎ ∈ 𝐹 𝑙 + 1 : 𝐿 . Moreover, again by the Cauchy–Schwartz inequality, we have
E [ ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ‖ 𝐻 𝑙 ]
≤ E 1 2 [ ‖ 𝑣 𝑛 ‖ 𝐻 𝑙 2 ] E 1 2 [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ‖ 𝐻 𝑙 2 ]
≤ 𝐵 𝑛 E 1 2 [ sup ℎ 2 ∈ 𝐹 𝑙 + 1 : 𝐿 ‖ ‖ ℎ 2 ‖ 𝐻 𝑙 𝛾 𝑛 ‖ 𝑣 ~ 𝑛 ‖ 𝐻 𝑙 𝑣 ~ 𝑛 − ℎ ‖ 𝐻 𝑙 2 ] ,
where the second inequality is derived in the same manner as the proof of Theorem 4. Since ℎ ∈ 𝐹 𝑙 + 1 : 𝐿 is arbitrary, we obtain the final result. □
Appendix BDetails of Remark 3
To derive a bound ‖ 𝐾 𝜎 ‖ , we bound ∑ | 𝛼 | ≤ 𝑠 𝑐 𝛼 , 𝑠 , 𝑑 ‖ ∂ 𝛼 ( ℎ ∘ 𝜎 ) ‖ 2 by ∑ | 𝛼 | ≤ 𝑠 𝑐 𝛼 , 𝑠 , 𝑑 ‖ ∂ 𝛼 ℎ ‖ 2 . As the proof of the boundedness of ‖ 𝐾 𝜎 ‖ , one strategy is using the Faà di Bruno formula.
By the Faà di Bruno formula, we have
∂ 𝛼 ( ℎ ∘ 𝜎 ) ( 𝑥 )
∑ | 𝛽 | ≤ | 𝛼 | ∂ 𝛽 ℎ ( 𝜎 ( 𝑥 ) ) ∑ 𝑖
1 | 𝛼 | ∑ 𝛾 ∈ 𝑝 ( 𝛼 , 𝛽 ) 𝛼 ! ∏ 𝑗
1 𝑖 ( ∂ 𝑙 𝑗 𝜎 ( 𝑥 ) ) 𝑘 𝑗 𝑘 𝑗 ! ( 𝑙 𝑗 ! ) | 𝑘 𝑗 | ,
where 𝑝 ( 𝛼 , 𝛽 )
{ 𝛾
( 𝑘 1 , … , 𝑘 𝑠 , 𝑙 1 , … , 𝑙 𝑠 ) ∣ 0 ≤ 𝑙 1 ≤ ⋯ ≤ 𝑙 𝑠 , ∑ 𝑗
1 𝑠 𝑘 𝑗
𝛼 , ∑ 𝑗
1 𝑠 | 𝑘 𝑗 | 𝑙 𝑗
𝛽 } . If 𝜎 is elementwise, 𝑙 𝑗 and 𝑘 𝑗 are chosen so that each of them has only one nonzero element, such as ( | 𝑙 𝑗 | , 0 , … , 0 ) . By counting the number of terms in the summation and calculating the coefficients of the terms, we can derive a bound of ‖ 𝐾 𝜎 ‖ . However, analytically representing the number of terms in the summation is a challenging task. We admit that this strategy does not give us a tight bound. There may be a more sophisticated approach to deriving a tight bound of ‖ 𝐾 𝜎 ‖ . However, the main goal of this paper is to investigate how the property of the weight matrices affects the generalization property. Since ‖ 𝐾 𝜎 ‖ does not depend on the weight matrices, if we assume the structure of the network is given, the property of the weight matrices does not affect ‖ 𝐾 𝜎 ‖ . As we stated in Section 6, investigating ‖ 𝐾 𝜎 ‖ and deriving a tighter bound is future work.
Appendix CDetails of Remark 12
We first show that 𝐺 𝑗 is bounded by a constant that is independent of 𝑓 𝑗 . Since 𝐺 𝑗 depends on ker ( 𝑊 𝑗 ) , we denote it by 𝐺 𝑗 ( ker ( 𝑊 𝑗 ) ) . Let 𝒲 be a 𝑘 -dimensional subspace of ℝ 𝑑 and { 𝑢 1 , … , 𝑢 𝑘 } be an orthonormal basis of 𝒲 . We consider the average of 𝐺 𝑗 ( 𝒲 ) on the Grassmann manifold 𝒢 𝑑 , 𝑘 . For this purpose, we fix an orthonormal basis 𝑒 1 , … , 𝑒 𝑑 on ℝ 𝑑 and denote by ∂ 𝑖 𝑓 the derivative of 𝑓 in the direction of 𝑒 𝑖 . In addition, we denote ∂ 𝑈 𝛼
∏ 𝑗
1 𝑘 ( ∑ 𝑖
1 𝑑 𝑢 𝑗 , 𝑖 ∂ 𝑖 ) 𝛼 𝑗 , where 𝑢 𝑗 , 𝑖
⟨ 𝑢 𝑗 , 𝑒 𝑖 ⟩ . Let 𝑠 ∈ ℕ . Then, we have
‖ 𝑓 ‖ 𝐻 𝑠 ( 𝒲 ) 2
∑ | 𝛼 | ≤ 𝑠 𝑐 𝛼 , 𝑠 , 𝑑 ‖ ∂ 𝑈 𝛼 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2 ≤ ∑ 𝑙
1 𝑠 ∑ | 𝛼 |
𝑙 ∑ | 𝛽 |
𝑙 𝑐 𝛼 , 𝑠 , 𝑑 𝐷 𝑠 , 𝑑 , 𝑘 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2
𝐷 𝑠 , 𝑑 , 𝑘 ∑ 𝑙
1 𝑠 ∑ | 𝛼 |
𝑙 𝑐 𝛼 , 𝑠 , 𝑑 ∑ | 𝛽 |
𝑙 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2
𝐷 𝑠 , 𝑑 , 𝑘 ∑ | 𝛼 | ≤ 𝑠 𝑐 𝛼 , 𝑠 , 𝑑 ∑ | 𝛽 | ≤ 𝑠 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2
≤ 𝐷 𝑠 , 𝑑 , 𝑘 ( 2 𝜋 ) 𝑑 ( 𝑑 + 1 ) 𝑠 ∑ | 𝛽 | ≤ 𝑠 𝑐 𝛽 , 𝑠 , 𝑑 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2 .
(12)
Here, we used the Cauchy–Schwartz inequality and derive the second inequality as follows:
| ∏ 𝑗
1 𝑘 ( ∑ 𝑖
1
𝑑
𝑢
𝑗
,
𝑖
∂
𝑖
)
𝛼
𝑗
𝑓
|
2
≤
(
∏
𝑗
1 𝑘 ( ∑ 𝑖
1 𝑑 𝑢 𝑗 , 𝑖 2 ) 𝛼 𝑗 ) ( ∏ 𝑗
1 𝑘 ( ∑ 𝑖
1 𝑑 𝒟 𝑖 2 ) 𝛼 𝑗 𝑓 )
∏ 𝑗
1 𝑘 ( ∑ 𝑖
1 𝑑 𝒟 𝑖 2 ) 𝛼 𝑗 𝑓
∏ 𝑗
1 𝑘 ∑ | 𝛽 |
𝛼 𝑗 ( 𝛼 𝑗 𝛽 ) 𝒟 𝛽 2 𝑓 ≤ 𝐷 𝑠 , 𝑑 , 𝑘 ∑ | 𝛽 |
| 𝛼 | | ∂ 𝛽 𝑓 | 2
for some 𝐷 𝑠 , 𝑑 , 𝑘 > 0 that depends on 𝑠 , 𝑑 , and 𝑘 . Here 𝒟 𝑖 2 is the operator defined as 𝒟 𝑖 2 𝑓
| ∂ 𝑖 𝑓 | 2 and 𝒟 𝛽 2 𝑓
| ∂ 𝑖 𝛽 𝑖 𝑓 | 2 . Assume { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝜖 } is not contained in the support of 𝑓 . In this case, we have
∑ | 𝛽 | ≤ 𝑠 𝑐 𝛽 , 𝑠 , 𝑑 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2 ≤ 𝜖 𝑘 − 𝑑 ∑ | 𝛽 | ≤ 𝑠 𝑐 𝛽 , 𝑠 , 𝑑 ∫ 𝒲 | ∂ 𝛽 𝑓 ( 𝑥 ) | 2 | 𝑥 | 𝑑 − 𝑘 d 𝑥
(13)
Integrating the both sides of (13) and by Theorem 2 by Rubin (2018), we have
∫ 𝒢 𝑑 , 𝑘 ∑ | 𝛽 | ≤ 𝑠 𝑐 𝛽 , 𝑠 , 𝑑 ‖ ∂ 𝛽 𝑓 ‖ 𝐿 2 ( 𝒲 ) 2 d 𝒲 ≤ 𝜖 𝑘 − 𝑑 ∑ | 𝛽 | ≤ 𝑠 𝑐 𝛽 , 𝑠 , 𝑑 𝜎 𝑑 𝜎 𝑘 ∫ ℝ 𝑑 | ∂ 𝛽 𝑓 ( 𝑥 ) | 2 d 𝑥 ,
where 𝜎 𝑑
2 𝜋 𝑑 / 2 / Γ ( 𝑑 / 2 ) and Γ is the Gamma function. In addition, d 𝒲 is the integration with respect to the 𝑂 ( 𝑑 ) -invariant probability measure on 𝒢 𝑑 , 𝑘 and 𝑂 ( 𝑑 ) is the orthogonal group in ℝ 𝑑 . Combining with Eq. (12), we obtain
∫ 𝒢 𝑑 , 𝑘 ‖ 𝑓 ‖ 𝐻 𝑠 ( 𝒲 ) 2 d 𝒲 ≤ 𝐷 𝑠 , 𝑑 , 𝑘 ( 2 𝜋 ) 𝑑 ( 𝑑 + 1 ) 𝑠 𝜖 𝑘 − 𝑑 𝜎 𝑑 𝜎 𝑘 ‖ 𝑓 ‖ 𝐻 𝑠 ( ℝ 𝑑 ) 2 .
As a result, we obtain
∫ 𝒢 𝑑 𝑗 , 𝑑 𝑗 − 1 𝐺 𝑗 ( 𝒲 ) 2 d 𝒲
≤ ∫ 𝒢 𝑑 𝑗 , 𝑑 𝑗 − 1 ‖ 𝑓 ‖ 𝐻 𝑠 𝑗 ( 𝒲 ) 2 ‖ 𝑓 ‖ 𝐻 𝑗 2 d 𝒲
≤ 𝐷 𝑠 𝑗 , 𝑑 𝑗 , 𝑑 𝑗 − 1 ( 𝑑 𝑗 + 1 ) 𝑠 𝑗 ( 2 𝜋 ) 𝑑 𝑗 𝜖 𝑑 𝑗 − 1 − 𝑑 𝑗 𝜎 𝑑 𝑗 𝜎 𝑑 𝑗 − 1 .
(14)
We admit that this inequality is not tight from the perspective of the dependence on 𝑠 𝑗 , 𝑑 𝑗 , and 𝑑 𝑗 − 1 . However, surprisingly, the inequality (14) shows that the factor 𝐺 𝑗 is bounded by a constant that is independent of 𝑓 𝑗 if { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝜖 } is not contained in the support of 𝑓 𝑗 . The assumption about { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝜖 } can be satisfied if the input is transformed so that it does not take the value near 0 .
One of the reasons for the looseness of the above bound is that we upper bounded ‖ 𝑓 ‖ 𝐻 𝑠 𝑗 − 1 ( 𝒲 ) by ‖ 𝑓 ‖ 𝐻 𝑠 𝑗 ( 𝒲 ) . If 𝑓 𝑗 can be controlled by the Gaussian, then the factor 𝐺 𝑗 does not seriously affect the bound. Indeed, let 𝜙 𝑐 ( 𝑥 )
e − 𝜋 2 ‖ 𝑥 ‖ 2 / 𝑐 . In the case of | 𝑓 ^ 𝑗 ( 𝜔 1 + 𝜔 2 ) | ≥ | 𝑓 ^ 𝑗 ( 𝜔 1 ) | 𝜙 𝑐 ( 𝜔 2 ) for 𝜔 1 ∈ ℛ ( 𝑊 𝑗 ) and 𝜔 2 ∈ ℛ ( 𝑊 𝑗 ) ⟂ , we can evaluate 𝐺 𝑗 as follows:
∥
𝜙
𝑐
∥
𝐻
𝑠
(
ℝ
𝑑
)
2
∫ 𝜔 ∈ ℝ 𝑑 | 𝜙 𝑐 ( 𝜔 ) | 2 ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 d 𝜔
∫ 𝜔 ∈ ℝ 𝑑 e − 2 𝜋 2 ‖ 𝜔 ‖ 2 / 𝑐 ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 d 𝜔
∫ 0 ∞ e − 2 𝜋 2 𝑟 2 / 𝑐 ( 1 + 𝑟 2 ) 𝑠 𝑟 𝑑 − 1 d 𝑟 ⋅ 2 𝜋 ∏ 𝑖
1 𝑑 − 2 𝑐 ~ 𝑖
2 𝜋 ∫ 0 ∞ e − 2 𝜋 2 𝑟 2 / 𝑐 ∑ 𝑖
0 𝑠 ( 𝑠 𝑖 ) 𝑟 2 𝑖 + 𝑑 − 1 d 𝑟 ∏ 𝑖
1 𝑑 − 2 𝑐 ~ 𝑖
2 𝜋 ∑ 𝑖
0 𝑠 ( 𝑠 𝑖 ) ∫ 0 ∞ e − 𝑡 𝑡 𝑖 + ( 𝑑 − 1 ) / 2 ( 𝑐 2 𝜋 2 ) 𝑖 + ( 𝑑 − 1 ) / 2 𝑐 𝜋 8 𝑡 d 𝑡 ∏ 𝑖
1
𝑑
−
2
𝑐
~
𝑖
∼
𝑐
𝑠
+
𝑑
/
2
𝜋
−
2
𝑠
−
𝑑
+
1
2
−
𝑠
−
𝑑
/
2
∫
0
∞
e
−
𝑡
𝑡
𝑠
+
𝑑
/
2
−
1
d
𝑡
∏
𝑖
1 𝑑 − 2 𝑐 ~ 𝑖
𝑐 𝑠 + 𝑑 / 2 𝜋 − 2 𝑠 − 𝑑 + 1 2 − 𝑠 − 𝑑 / 2 Γ ( 𝑠 + 𝑑 / 2 ) ∏ 𝑖
1 𝑑 − 2 𝑐 ~ 𝑖 ,
(15)
where 𝑎 ∼ 𝑏 means 𝑎 / 𝑏 → 1 as 𝑠 → ∞ and 𝑑 → ∞ . In addition, 𝑐 ~ 𝑖
∫ 0 𝜋 sin 𝑖 𝜃 d 𝜃 . Thus, we have
𝐺 𝑗 2
‖ 𝑓 𝑗 | ℛ ( 𝑊 𝑗 ) ∥ 𝐻 𝑝 𝑗 − 1 ( ℛ ( 𝑊 𝑗 ) ) 2 ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 2
∫
ℛ
(
𝑊
𝑗
)
|
𝑓
^
𝑗
(
𝜔
1
)
|
2
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
d
𝜔
1
∫
ℝ
𝑑
|
𝑓
^
𝑗
(
𝜔
1
+
𝜔
2
)
|
2
(
1
+
‖
𝜔
1
‖
2
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
d
𝜔
≤
∫
ℛ
(
𝑊
𝑗
)
|
𝑓
^
𝑗
(
𝜔
1
)
|
2
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
d
𝜔
1
∫
ℝ
𝑑
|
𝑓
^
𝑗
(
𝜔
1
)
|
2
𝜙
𝑐
(
𝜔
2
)
2
(
1
+
‖
𝜔
1
‖
2
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
d
𝜔
≤
∫
ℛ
(
𝑊
𝑗
)
|
𝑓
^
𝑗
(
𝜔
1
)
|
2
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
d
𝜔
1
∫
ℝ
𝑑
|
𝑓
^
𝑗
(
𝜔
1
)
|
2
𝜙
𝑐
(
𝜔
2
)
2
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
(
1
+
‖
𝜔
2
‖
2
)
𝑠
~
𝑗
d
𝜔
1
∫
ℛ
(
𝑊
𝑗
)
⟂
𝜙
𝑐
(
𝜔
2
)
2
(
1
+
‖
𝜔
2
‖
2
)
𝑠
~
𝑗
d
𝜔
∼
(
𝑐
𝑠
~
𝑗
+
𝑑
~
𝑗
/
2
𝜋
−
2
𝑠
~
𝑗
−
𝑑
~
𝑗
+
1
2
−
𝑠
~
𝑗
−
𝑑
~
𝑗
/
2
Γ
(
𝑠
~
𝑗
+
𝑑
~
𝑗
/
2
)
∏
𝑖
1 𝑑 ~ 𝑗 − 2 𝑐 ~ 𝑖 ) − 1 ,
where 𝑠 ~ 𝑗
𝑠 𝑗 − 𝑠 𝑗 − 1 and 𝑑 ~ 𝑗
𝑑 𝑗 − 𝑑 𝑗 − 1 . Note that since 𝑠 𝑗 ≥ 𝑠 𝑗 − 1 and 𝑑 𝑗 ≥ 𝑑 𝑗 − 1 , 𝐺 𝑗 becomes small as 𝑐 becomes large and 𝑠 𝑗 and 𝑑 𝑗 becomes large. The assumption | 𝑓 ^ 𝑗 ( 𝜔 1 + 𝜔 2 ) | ≥ | 𝑓 ^ 𝑗 ( 𝜔 1 ) | 𝜙 𝑐 ( 𝜔 2 ) means that 𝑓 ^ 𝑗 decays slower or equal to the speed of the Gaussian in the direction of 𝜔 2 . Even if 𝑐 is chosen small to satisfy the condition, the factor Γ ( 𝑠 ~ 𝑗 + 𝑑 ~ 𝑗 / 2 ) becomes sufficiently large if 𝑠 𝑗 is sufficiently large. As a result, the upper bound becomes sufficiently small if 𝑠 𝑗 is sufficiently large.
Moreover, the factor 𝐺 𝑗 can alleviate the dependency of ‖ 𝐾 𝜎 𝑗 ‖ on 𝑑 𝑗 and 𝑠 𝑗 . Even if the dependence of ‖ 𝐾 𝜎 𝑗 ‖ on 𝑑 𝑗 and 𝑠 𝑗 is exponential, since the exponents appeared in the above evaluation are − ( 𝑑 𝑗 − 𝑑 𝑗 − 1 ) and − ( 𝑠 𝑗 − 𝑠 𝑗 − 1 ) , we can expect that the dependency on 𝑑 𝑗 and 𝑠 𝑗 is reduced to the dependency on 𝑑 𝑗 − 1 and 𝑠 𝑗 − 1 .
Appendix DDetails of Remark 14
As an example of 𝜓 , we can use a bump function 𝜓 ( 𝑥 )
1 − 𝑔 ( ( ‖ 𝑥 ‖ 2 − 𝑎 2 ) / ( 𝑏 2 − 𝑎 2 ) ) on ℝ 𝑑 for 0 < 𝑎 < 𝑏 , where 𝑔 ( 𝑥 )
𝑓 ( 𝑥 ) / ( 𝑓 ( 𝑥 ) − 𝑓 ( 1 − 𝑥 ) ) , 𝑓 ( 𝑥 )
e − 1 / 𝑥 for 𝑥 > 0 , and 𝑓 ( 𝑥 )
0 for 𝑥 ≤ 0 . In this case, the support of 𝜓 is { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝑏 } and 𝜓 ( 𝑥 )
1 for 𝑥 ∈ { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝑎 } . If the output of each layer is bounded on { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝑎 } , then we can obtain a modified network that is exactly the same on { 𝑥 ∈ ℝ 𝑑 ∣ ‖ 𝑥 ‖ ≤ 𝑎 } with this bump function. If 𝑎 and 𝑏 are large, then the support of 𝜓 becomes large, which makes the 𝐿 2 -norm of 𝜓 large, and the Sobolev norm of 𝜓 also becomes large. If 𝑎 − 𝑏 is small, then | 𝜓 ( 𝑥 ) − 𝜓 ( 𝑦 ) | / ‖ 𝑥 − 𝑦 ‖ for ‖ 𝑥 ‖ 2
𝑎 and 𝑦
( 𝑏 / 𝑎 ) 𝑥 becomes large. Thus, the 𝐿 2 -norms of the derivatives of 𝜓 are expected to be large, and the Sobolev norm of 𝜓 also becomes large if 𝑎 − 𝑏 is small.
Appendix EDetails of Remark 15
The factor ‖ 𝑔 ~ ‖ 𝐻 ~ 𝐿 grows as Ω becomes large, where Ω is the region such that 𝑓 ~ ( 𝑥 )
𝑓 ( 𝑥 ) on 𝑥 ∈ Ω for the network 𝑓 and the modified network 𝑓 ~ . Indeed, if 𝑝 𝐿 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝐿 for 𝑠 𝐿 ∈ ℕ , then we have
‖ 𝑔 ~ ‖ 𝐻 ~ 𝐿 2
∑ | 𝛼 | ≤ 𝑠 𝐿 𝑐 𝛼 , 𝑠 𝐿 , 𝛿 𝐿 − 1 + 𝑑 𝐿 ‖ ∂ 𝛼 ( 𝜓 ⋅ 𝑔 ) ‖ 𝐿 2 ( ℝ 𝛿 𝐿 ) 2
∑ | 𝛼 | ≤ 𝑠 𝐿 𝑐 𝛼 , 𝑠 𝐿 , 𝛿 𝐿 − 1 + 𝑑 𝐿 ‖ ∂ 𝛽 𝜓 ∂ 𝛼 − 𝛽 𝑔 ‖ 𝐿 2 ( ℝ 𝛿 𝐿 − 1 + 𝑑 𝐿 ) 2
∑ | 𝛼 | ≤ 𝑠 𝐿 𝑐 𝛼 , 𝑠 𝐿 , 𝛿 𝐿 − 1 + 𝑑 𝐿 ‖ ∂ 𝛽 𝜓 ‖ 𝐿 2 ( ℝ 𝛿 𝐿 − 1 ) 2 ‖ ∂ 𝛼 − 𝛽 𝑔 ‖ 𝐿 2 ( ℝ 𝑑 𝐿 ) 2 ,
where 𝛽 is the multi index whose elements corresponding to 𝛿 𝐿 − 1 equals to those of 𝛼 and the other elements are zero. The factor ‖ 𝜓 ‖ 𝐿 2 ( ℝ 𝛿 𝐿 − 1 ) 2 becomes large as the volume of Ω gets large. Thus, under the condition that ‖ ∂ 𝛼 𝜓 ‖ 𝐿 2 ( ℝ 𝛿 𝐿 − 1 ) does not change, ‖ 𝑔 ~ ‖ 𝐻 ~ 𝐿 becomes large as Ω gets large. Indeed, assume 𝛿 𝐿
1 , 𝜓 ( 𝑥 )
1 for 𝑥 ∈ Ω , and Ω is an interval (e.g., 𝜓 is the bump function defined in Appendix D). Then we can create a new function 𝜓 ~ that satisfies ‖ 𝜓 ~ ( 𝑖 ) ‖ 𝐿 2 ( ℝ )
‖ 𝜓 ( 𝑖 ) ‖ 𝐿 2 ( ℝ ) for any 𝑖
1 , 2 , … , from 𝜓 as follows. Here, 𝜓 ( 𝑖 ) is the 𝑖 th derivative of 𝜓 . For simplicity, we consider the case where Ω
[ − 𝑎 , 𝑎 ] for some 𝑎 > 0 . For 𝑐 > 0 , we set 𝜓 ~ ( 𝑥 )
1 for 𝑥 ∈ [ 0 , 𝑐 ] , 𝜓 ~ ( 𝑥 )
𝜓 ( 𝑥 − 𝑐 ) for 𝑥 ∈ ( 𝑐 , ∞ ) , 𝜓 ~ ( 𝑥 )
1 for 𝑥 ∈ [ − 𝑐 , 0 ] , 𝜓 ~ ( 𝑥 )
𝜓 ( 𝑥 + 𝑐 ) for 𝑥 ∈ ( − ∞ , 0 ) .
Appendix FDetails of Remark 18
In Proposition 16, by combining with the factor 𝐺 𝑗 , the norm of 𝜓 𝑗 can be canceled as follows. Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 and 𝑠 𝑗
2 𝑠 𝑗 − 1 . We have
‖ 𝜓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) 2 ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ⟂ ) 2
∫
ker
(
𝑊
𝑗
)
|
𝜓
𝑗
^
(
𝜔
1
)
|
2
(
1
+
‖
𝜔
1
‖
2
)
𝑠
𝑗
−
1
d
𝜔
1
∫
ker
(
𝑊
𝑗
)
⟂
|
𝑓
𝑗
^
(
𝜔
2
)
|
2
(
1
+
‖
𝜔
2
‖
2
)
𝑠
𝑗
−
1
d
𝜔
2
≤
∫
ker
(
𝑊
𝑗
)
∫
ker
(
𝑊
𝑗
)
⟂
|
𝜓
𝑗
^
(
𝜔
1
)
𝑓
𝑗
^
(
𝜔
2
)
|
2
(
1
+
‖
𝜔
1
+
𝜔
2
‖
2
)
𝑠
𝑗
d
𝜔
1
d
𝜔
2
‖ 𝜓 𝑗 , 1 𝑓 𝑗 | ker ( 𝑊 𝑗 ) ⟂ ∥ 𝐻 𝑗 ( ℝ 𝑑 𝑗 − 1 ) 2 .
Thus, we obtain
‖ 𝜓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) 𝐺 𝑗
‖ 𝜓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ⟂ ) ‖ 𝑓 𝑗 ‖ 𝐻 𝑗
≤ ‖ 𝜓 𝑗 , 1 𝑓 𝑗 | ker ( 𝑊 𝑗 ) ⟂ ∥ 𝐻 𝑗 ( ℝ 𝑑 𝑗 − 1 ) ‖ 𝑓 𝑗 ‖ 𝐻 𝑗 .
If 𝜓 𝑗 , 1 𝑓 𝑗 | ker ( 𝑊 𝑗 ) ⟂
𝑓 𝑗 , e.g. 𝜓 𝑗 , 1 and 𝑓 𝑗 are the Gaussian, and 𝑑 𝑗 − 1
𝑑 𝑗 , then the factor ‖ 𝜓 𝑗 ‖ 𝐻 𝑗 − 1 ( ker ( 𝑊 𝑗 ) ) 𝐺 𝑗 is bounded by 1 .
Appendix GDetails of Remark 20
We can combine our Koopman-based approach with the existing “peeling” approach. For 1 ≤ 𝑙 ≤ 𝐿 , let 𝐹 𝑙 ReLU ( 𝐶 1 ) be the set of 𝑙 -layer ReLU networks where the Frobenius norms of 𝑊 1 , … , 𝑊 𝑙 are bounded by 𝐶 1 , considered by Neyshabur et al. (2015). Let 𝐹 ~ 1 : 𝑙 be the set of functions defined in the same manner as Eq. (6), except for replacing 𝜎 𝑙 with 𝑔 ∈ 𝐻 𝑙 . In addition, let 𝐹 ~ 1 : 𝑙 , inj ( 𝐶 2 , 𝐷 )
{ 𝑓 ∈ 𝐹 ~ 1 : 𝑙 ∣ 𝑊 𝑗 ∈ 𝒲 𝑗 ( 𝐶 2 , 𝐷 ) } . We combine 𝐹 𝐿 − 𝑙 ReLU ( 𝐶 1 ) and 𝐹 ~ 1 : 𝑙 , inj ( 𝐶 2 , 𝐷 ) , and we define 𝐹 1 : 𝑙 , inj ReLU , 𝐿 ( 𝐶 1 , 𝐶 2 , 𝐷 )
{ ℎ ∘ 𝑓 ∣ ℎ ∈ 𝐹 𝐿 − 𝑙 ReLU ( 𝐶 1 ) , 𝑓 ∈ 𝐹 ~ 1 : 𝑙 , inj ( 𝐶 2 , 𝐷 ) } . Then, applying Theorem 1 by Neyshabur et al. (2015), we can bound the complexity of 𝐿 -layer networks using that of ( 𝐿 − 1 ) -layer networks and the Frobenius norm of 𝑊 𝐿 . For example, by Eq. (8) by Golowich et al. (2018)), we have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 1 : 𝑙 , inj ReLU , L ( 𝐶 1 , 𝐶 2 , 𝐷 ) )
≤ ‖ 𝑊 𝐿 ‖ 2 , 2 𝑅 ^ 𝑛 ( 𝐱 , 𝜎 ( 𝐹 1 : 𝑙 , inj ReLU , L − 1 ( 𝐶 1 , 𝐶 2 , 𝐷 ) ) )
≤ 2 ‖ 𝑊 𝐿 ‖ 2 , 2 𝑅 ^ 𝑛 ( 𝐱 , 𝐹 1 : 𝑙 , inj ReLU , L − 1 ( 𝐶 1 , 𝐶 2 , 𝐷 ) ) ,
where 𝑅 ^ 𝑛 ( 𝐱 , ℱ ) for a vector-valued function class ℱ is defined as E [ sup 𝑓 ∈ ℱ 1 / 𝑛 ‖ ∑ 𝑖
1 𝑛 𝑠 𝑖 𝑓 ( 𝑥 𝑖 ) ‖ ] and 𝜎 is the ReLU. As a result, we obtain
𝑅
^
𝑛
(
𝐱
,
𝐹
1
:
𝑙
,
inj
ReLU
,
L
(
𝐶
1
,
𝐶
2
,
𝐷
)
)
≤
2
‖
𝑊
𝐿
‖
2
,
2
𝑅
^
𝑛
(
𝐱
,
𝐹
1
:
𝑙
,
inj
ReLU
,
L
−
1
(
𝐶
1
,
𝐶
2
,
𝐷
)
)
≤
2
𝐿
−
𝑙
∏
𝑗
𝑙
+
1
𝐿
‖
𝑊
𝑗
‖
2
,
2
𝑅
^
𝑛
(
𝐱
,
2
,
2
~
1
:
𝑙
,
inj
(
𝐶
2
,
𝐷
)
)
≤
2
𝐿
−
𝑙
‖
𝑔
‖
𝐻
𝑙
(
∏
𝑗
𝑙 + 1 𝐿 ‖ 𝑊 𝑗 ‖ 2 , 2 ) ( ∏ 𝑗
1 𝑙 𝐺 𝑗 ‖ 𝐾 𝜎 𝑗 ‖ ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) .
We can also apply other peeling approaches in the same manner as the above case.
Appendix HExamples of concrete Koopman-based bounds
We show examples of our Koopman-based bounds.
Example 3
Let 𝑔 ( 𝑥 )
e − 𝑐 ‖ 𝑥 ‖ 2 . Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 . We consider a shallow network 𝑓 ( 𝑥 )
𝑔 ( 𝑊 𝑥 + 𝑏 ) . Assume 𝑑 1 ≥ 𝑑 0 and 𝑊 is full-rank. The final nonlinear transformation 𝑔 in 𝑓 maps the high dimensional vector on ℝ 𝑑 1 to a scalar value. In this case, 𝑓 1 ( 𝑥 )
𝑔 ( 𝑥 + 𝑏 ) is also the Gaussian. We have
𝑅 ^ 𝑛 ( 𝐱 , 𝐹 inj ( 𝐶 , 𝐷 ) ) ≤ 𝐵 𝑛 𝐺 1 ‖ 𝑔 ‖ 𝐻 1 max { 1 , ∥ 𝑊 ∥ } 𝑠 0 det ( 𝑊 * 𝑊 ) 1 / 4 .
Since ‖ 𝑔 ‖ 𝐻 1
‖ 𝑓 1 ‖ 𝐻 1 , by Eq. (15), we have
𝐺 1 2 ‖ 𝑔 ‖ 𝐻 1 2
‖ 𝑓 1 | ℛ ( 𝑊 ) ∥ 𝐻 𝑝 0 ( ℛ ( 𝑊 ) ) ‖ 𝑔 ‖ 𝐻 1 ‖ 𝑓 1 ‖ 𝐻 1 ∼ 𝑐 𝑠 0 − 𝑑 0 / 2 𝜋 − 2 𝑠 0 + 1 2 − 𝑠 0 − 𝑑 0 / 2 Γ ( 𝑠 0 + 𝑑 0 / 2 ) ∏ 𝑖
1 𝑑 0 − 2 𝑐 ~ 𝑖 .
As a result, we have
𝑅
^
𝑛
(
𝐱
,
𝐹
inj
(
𝐶
,
𝐷
)
)
≲
𝐵
𝑛
𝑐
𝑠
0
/
2
−
𝑑
0
/
4
𝜋
−
𝑠
0
+
1
/
2
2
−
𝑠
0
/
2
−
𝑑
0
/
4
(
∏
𝑖
1 𝑑 0 − 2 𝑐 ~ 𝑖 ) 1 / 2 Γ ( 𝑠 0 + 𝑑 0 / 2 ) 1 / 2 max { 1 , ∥ 𝑊 ∥ } 𝑠 0 det ( 𝑊 * 𝑊 ) 1 / 4 .
Note that dim ( ℛ ( 𝑊 ) )
𝑑 0 is the dimension of the input and 𝑠 0 is chosen as 𝑠 0
𝑑 0 / 2 . They are independent of the structure of the network.
Example 4
Let 𝜎 ( 𝑥 )
( e − 𝑐 1 ‖ 𝑥 ‖ 2 , … , e − 𝑐 𝑑 1 ‖ 𝑥 ‖ 2 ) . Let 𝑝 𝑗 ( 𝜔 )
1 / ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 𝑗 for 𝑠 𝑗 > 𝑑 𝑗 / 2 . We consider a shallow network 𝑓 ( 𝑥 )
𝑊 2 𝜎 ( 𝑊 1 𝑥 + 𝑏 ) . Assume 𝑑 1 ≥ 𝑑 0 , 𝑑 2
1 and 𝑊 is full-rank. Using the “peeling” approach and Example 3, we obtain
𝑅
^
𝑛
(
𝐱
,
𝐹
1
:
1
,
inj
2
(
𝐶
1
,
𝐶
2
,
𝐷
)
)
≤
‖
𝑊
2
‖
2
,
2
𝑅
^
𝑛
(
𝐱
,
𝐹
inj
1
(
𝐶
2
,
𝐷
)
)
≲
‖
𝑊
2
‖
2
,
2
∑
𝑖
1 𝑑 1 𝐵 𝑛 𝑐 𝑖 𝑠 0 / 2 − 𝑑 0 / 4 𝜋 − 𝑠 0 + 1 / 2 2 − 𝑠 0 / 2 − 𝑑 0 / 4 ( ∏ 𝑖
1 𝑑 0 − 2 𝑐 ~ 𝑖 ) 1 / 2 Γ ( 𝑠 0 + 𝑑 0 / 2 ) 1 / 2 max { 1 , ∥ 𝑊 1 ∥ } 𝑠 0 det ( 𝑊 1 * 𝑊 1 ) 1 / 4 .
Here, we used the inequality
𝑅 ^ 𝑛 ( 𝐱 , ℱ )
1 𝑛 E [ sup 𝑓 ∈ ℱ ‖ ∑ 𝑖
1 𝑛 𝑠 𝑖 𝑓 ( 𝑥 𝑖 ) ‖ ]
1 𝑛 E [ sup 𝑓 ∈ ℱ ∑ 𝑗
1 𝑑 ( ∑ 𝑖
1
𝑛
𝑠
𝑖
(
𝑓
(
𝑥
𝑖
)
)
𝑗
)
2
]
≤
1
𝑛
E
[
sup
𝑓
∈
ℱ
∑
𝑗
1 𝑑 ( ∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝑓 ( 𝑥 𝑖 ) ) 𝑗 ) 2 ]
1 𝑛 E [ sup 𝑓 ∈ ℱ ∑ 𝑗
1 𝑑 | ∑ 𝑖
1
𝑛
𝑠
𝑖
(
𝑓
(
𝑥
𝑖
)
)
𝑗
|
]
≤
1
𝑛
∑
𝑗
1 𝑑 E [ sup 𝑓 ∈ ℱ | ∑ 𝑖
1 𝑛 𝑠 𝑖 ( 𝑓 ( 𝑥 𝑖 ) ) 𝑗 | ] ,
where ( 𝑓 ( 𝑥 𝑖 ) ) 𝑗 is the 𝑗 th element in the vector 𝑓 ( 𝑥 𝑖 ) . In this case, the bound depends on 𝑑 1 linearly.
Appendix IInjectivity of 𝑊 ~ 𝑗
The operator 𝑊 ~ 𝑗 defined in Subsection 4.3.1 is injective. Indeed, assume ( 𝑊 1 𝑥 , 𝑃 1 𝑥 )
( 𝑊 1 𝑦 , 𝑃 1 𝑦 ) for 𝑥 , 𝑦 ∈ ℝ 𝑑 0 . Then, we have 𝑥 − 𝑦 ∈ 𝑘 𝑒 𝑟 ( 𝑊 1 ) . On the other hand, we have 𝑃 1 ( 𝑥 − 𝑦 )
0 . Since 𝑥 − 𝑦 ∈ 𝑘 𝑒 𝑟 ( 𝑊 1 ) , we have 𝑥 − 𝑦
𝑃 1 ( 𝑥 − 𝑦 )
0 . The case of 𝑗 ≥ 2 is the same.
Appendix JExperimental details and additional experimental results
We explain details of experiments in Section 5 and show additional experimental results. All the experiments were executed with Python 3.9 and TensorFlow 2.6.
J.1Validity of the bound (Synthetic data)
We constructed a network 𝑓 𝜃 ( 𝑥 )
𝑔 ( 𝑊 2 𝜎 ( 𝑊 1 𝑥 + 𝑏 1 ) + 𝑏 2 ) , where 𝑊 1 ∈ ℝ 3 × 3 , 𝑊 2 ∈ ℝ 6 × 3 , 𝑏 1 ∈ ℝ 3 , 𝑏 2 ∈ ℝ 6 , 𝜃
( 𝑊 1 , 𝑏 1 , 𝑊 2 , 𝑏 2 ) , 𝜎 ( 𝑥 )
( ( 1 + 𝛼 ) 𝑥 + ( 1 − 𝛼 ) 𝑥 erf ( 𝜇 ( 1 − 𝛼 ) 𝑥 ) ) / 2 , and 𝑔 ( 𝑥 )
e − ‖ 𝑥 ‖ 2 . Here, erf is the Gaussian error function, and 𝜎 is a smooth version of Leaky ReLU proposed by Biswas et al. (2022). We set 𝛼
𝜇
0.5 . For training the network, we used 𝑛
1000 samples 𝑥 𝑖 ( 𝑖
1 , … , 1000 ) drawn independently from the normal distribution with mean 0 and standard deviation 1. The weight matrices are initialized by Kaiming Initialization (He et al., 2015), and we used the SGD for the optimizer. In addition, we set the error function as 𝑙 𝜃 ( 𝑥 , 𝑦 )
| 𝑓 𝜃 ( 𝑥 ) − 𝑦 | 2 , and added the regularization term 0.01 ( ∏ 𝑗
1 2 det ( 𝑊 𝑗 * 𝑊 𝑗 ) − 1 / 2 + 10 ∏ 𝑗
1 2 ∥ 𝑊 𝑗 ∥ ) . We added this regularization term since, according to our bound, both the determinant and the operator norm of 𝑊 𝑗 should be small for achieving a small generalization error. The generalization error here means | E [ 𝑙 𝜃 ( 𝑥 , 𝑡 ( 𝑥 ) ) ] − 1 / 𝑛 ∑ 𝑖
1 𝑛 𝑙 𝜃 ( 𝑥 𝑖 , 𝑡 ( 𝑥 𝑖 ) ) | , which is compared to our bound 𝑂 ( ∏ 𝑗
1 𝐿 ‖ 𝑊 𝑗 ‖ 𝑠 𝑗 / ( det ( 𝑊 𝑗 * 𝑊 𝑗 ) 1 / 4 ) ) in Figure 1 (a). Here, we set 𝑠 𝑗
( 𝑑 𝑗 + 0.1 ) / 2 .
J.2Validity of the bound (MNIST)
We constructed a network 𝑓 𝜃 ( 𝑥 )
𝑔 ( 𝑊 4 𝜎 ( 𝑊 3 𝜎 ( 𝑊 2 𝜎 ( 𝑊 1 𝑥 + 𝑏 1 ) + 𝑏 2 ) + 𝑏 3 ) + 𝑏 4 ) with dense layers, where 𝑊 1 ∈ ℝ 1024 × 784 , 𝑊 2 ∈ ℝ 2048 × 1024 , 𝑊 3 ∈ ℝ 2048 × 2048 , 𝑊 4 ∈ ℝ 10 × 2048 , 𝑏 1 ∈ ℝ 1024 , 𝑏 2 ∈ ℝ 2048 , 𝑏 3 ∈ ℝ 2048 , 𝑏 4 ∈ ℝ 10 , 𝜃
( 𝑊 1 , 𝑏 1 , 𝑊 2 , 𝑏 2 , 𝑊 3 , 𝑏 3 , 𝑊 4 , 𝑏 4 ) , 𝜎 is the same function as Section J.1, and 𝑔 is the softmax. See Remark 1 for the validity of our bound for the case where 𝑔 is the softmax. For training the network, we used only 𝑛
1000 samples to create a situation where the model is hard to generalize. We consider the regularization term 𝜆 1 ‖ 𝑊 𝑗 ‖ + 𝜆 2 / det ( 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ) , where 𝜆 1
𝜆 2
0.01 to make both the norm and the determinant of 𝑊 𝑗 small (thus, makes our bound small). Based on the observation in the last part of Subsection 4.4, we set the regularization term for only 𝑗
1 , 2 . The weight matrices are initialized by the orthogonal initialization for 𝑗
1 , 2 and by the samples from a truncated normal distribution for 𝑗
3 , 4 , and we used Adam (Kingma & Ba, 2015) for the optimizer. In addition, we set the error function as the categorical cross-entropy loss.
J.2.1Transformation of signals by lower layers
We also investigated the transformation by lower layers, as we stated in the last part of Subsection 4.4. We computed | cos ( 𝜃 ) | , where 𝜃 is the maximum value of the angles between the output of the second layer and the directions of singular vectors of 𝑊 3 associated with the singular values that are larger than 0.1 . The results are illustrated in Figure 2. We can see that with the regularization based on our bound, as the test accuracy grows, the value | cos ( 𝜃 ) | also grows. This result means that the signals turn to the directions of the singular vectors of the subsequent weight matrix associated with large singular values. That makes the extraction of information from the signals in higher layers easier. On the other hand, without the regularization, neither the test accuracy nor the value | cos ( 𝜃 ) | do not become large after a sufficiently long learning process (see also Figure 1 (b)). The results in Figures 1 (b) and 2 are obtained by three independent runs.
Figure 2:Behavior of the value | cos ( 𝜃 ) | . Here, 𝜃 is the maximum value of the angles between the output of the second layer and the directions of singular vectors of 𝑊 3 associated with the singular values that are larger than 0.1 . Figure 3:The ratio 𝑟 𝑑 , 𝑗
𝜂 1 , 𝑗 / 𝜂 𝑑 , 𝑗 of singular values (condition number) of weight matrices for layers 𝑗
1 , 2 , 4 . (Right) Without regularization (Left) With the regularization based on our bound. Figure 4:Test accuracy of AlexNet traind by CIFAR-10 with and without regularization. Figure 5:Test and train loss of AlexNet trained by CIFAR-10 with and without regularization. (Right) Test loss (Left) Train loss. J.3Singular values of the weight matrices
We constructed an AlexNet Krizhevsky et al. (2012) where the ReLU activation function is replaced by the smooth version of Leaky ReLU ( 𝜎 in Section J.1) to meet our setting. For training the network, we used 𝑛
50000 samples and used the Adam optimizer. We set the error function as the categorical cross-entropy loss. We show the test accuracy through the learning process in Figure 4. In addition to the AlexNet, we also computed the ratio 𝑟 𝑑 , 𝑗 of the largest and the smallest singular values (condition number) of the weight matrices for the network we used in Section J.2. Since the behavior of the singular values of 𝑊 3 was unstable and did not have clear patterns, we only show the result for 𝑗
1 , 2 , 4 in Figure 3. We scaled the values for 𝑗
4 for readability. In the case of the AlexNet, the behavior of singular values of each weight matrix was different depending on the layer. However, in the case of the network in Section J.2, without the regularization, the condition number 𝑟 𝑑 , 𝑗 stagnates for 𝑗
1 , 2 , 4 through the learning process, and the test accuracy also stagnates. On the other hand, with the regularization based on our bound, 𝑟 𝑑 , 𝑗 becomes small for 𝑗
1 , 2 as the learning process proceeds by virtue of the regularization. We can also see that 𝑟 𝑑 , 𝑗 grows for 𝑗
4 as the learning process proceeds, which makes the angle 𝜃 in Figure 2 large. As discussed in the last part of Subsection 4.4, we can conclude that the regularization transforms the signals in lower layers ( 𝑗
1 , 2 ) and makes it easier for them to be extracted in higher layers ( 𝑗
4 ), and the test accuracy becomes higher than the case without the regularization. The results in Figures 1 (c), 3, and 4 are obtained by three independent runs.
J.4Validity of the bound (CIFAR-10)
We used the same network and the same dataset as Appendix J.3 and observed the generalization property with and without a regularization based on our result. We consider the regularization term 𝜆 1 ‖ 𝑊 𝑗 ‖ + 𝜆 2 / ‖ 0.01 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ‖ , where 𝜆 1
0.1 and 𝜆 2
0.001 to make both the largest and the smallest singular values of 𝑊 𝑗 small (thus, makes our bound small). Since the AlexNet is composed of convolutional layers, we represented the convolutional layers as matrices. For the convolution ∑ 𝑖
1 𝑛 ∑ 𝑗
1 𝑚 𝑓 𝑘 − 𝑖 , 𝑙 − 𝑗 𝑥 𝑘 , 𝑗 with a convolutional filter 𝐹
[ 𝑓 𝑖 , 𝑗 ] , we can construct a tensor 𝑊 ~ 𝑖 , 𝑗 , 𝑘 , 𝑙
𝑓 𝑘 − 𝑖 , 𝑙 − 𝑗 . If 𝑖 or 𝑗 is out of the bound of the index of the filter, then we set 𝑓 𝑖 , 𝑗
0 . We can combine the indices ( 𝑖 , 𝑗 ) and ( 𝑘 , 𝑙 ) in 𝑊 ~ 𝑗 and obtain a matrix 𝑊 𝑗 that represents the convolution. Note that since the dimension of 𝑊 𝑗 is large, setting a regularization term with the determinant of 𝑊 𝑗 can cause numerical overflows. Thus, we set ‖ 0.01 𝐼 + 𝑊 𝑗 * 𝑊 𝑗 ‖ instead of its determinant in the same manner as Appendix J.2. Based on the observation in the last part of Subsection 4.4, we set the regularization term for only 𝑗
1 , 2 . Figure 4 shows the test accuracy obtained with and without the regularization. The behavior of the accuracy obtained with and without the regularization are similar. Figure 5 shows the test and train loss. We can see that without the regularization, although the train loss becomes small, the test loss becomes large as the iteration proceeds. On the other hand, with the regularization, the train loss becomes small, and the test loss does not become so large as the case without the regularization.
Appendix KNorm of the Sobolev space
Let 𝑝 ( 𝜔 )
( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 with 𝑠 ∈ ℕ . We can represent the Sobolev norm ‖ 𝑓 ‖ 𝐻 𝑝 ( ℝ 𝑑 ) using the derivatives of 𝑓 if 𝑠 ∈ ℕ . Indeed, we have
‖ 𝑓 ‖ 𝐻 𝑝 ( ℝ 𝑑 ) 2
∫ ℝ 𝑑 | 𝑓 ^ ( 𝜔 ) | 2 ( 1 + ‖ 𝜔 ‖ 2 ) 𝑠 d 𝜔
∫ ℝ 𝑑 | 𝑓 ^ ( 𝜔 ) | 2 ∑ 𝑖
0 𝑠 ( 𝑠 𝑖 ) ‖ 𝜔 ‖ 2 𝑖 d 𝜔
∫ ℝ 𝑑 | 𝑓 ^ ( 𝜔 ) | 2 ∑ 𝑖
0 𝑠 ( 𝑠 𝑖 ) ( ∑ 𝑗
1 𝑑 𝜔 𝑗 2 ) 𝑖 d 𝜔
∫ ℝ 𝑑 | 𝑓 ^ ( 𝜔 ) | 2 ∑ 𝑖
0 𝑠 ( 𝑠 𝑖 ) ∑ | 𝛼 |
𝑖 ( 𝑖 𝛼 ) ( 𝜔 𝛼 ) 2 d 𝜔
∑ | 𝛼 | ≤ 𝑠 𝑠 ! ( 𝑠 − | 𝛼 | ) ! 𝛼 ! ∫ ℝ 𝑑 | 𝑓 ^ ( 𝜔 ) 𝜔 𝛼 | 2 d 𝜔
∑ | 𝛼 | ≤ 𝑠 𝑠 ! ( 𝑠 − | 𝛼 | ) ! 𝛼 ! ( 2 𝜋 ) 𝑑 ‖ ∂ 𝛼 𝑓 ‖ 𝐿 2 ( ℝ 𝑑 ) 2 .
Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Report Issue Report Issue for Selection
Xet Storage Details
- Size:
- 119 kB
- Xet hash:
- 7e15a88ea52d5aa008ac35f098575f1fbc90a07a4eaeb3c7866e4215e97308d6
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.