Buckets:

|
download
raw
141 kB

Title: Quadratic models for understanding catapult dynamics of neural networks

URL Source: https://arxiv.org/html/2205.11787

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Notation and preliminary 3Optimization dynamics in Neural Quadratic Models 4Quadratic models parallel neural networks in generalization 5Summary and Discussion References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: capt-of

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2205.11787v3 [cs.LG] 01 May 2024 Quadratic models for understanding catapult dynamics of neural networks Libin Zhu Department of Computer Science, UC San Diego {libinzhu,ch1212,mbelkin}@ucsd.edu Chaoyue Liu {libinzhu,ch1212,mbelkin}@ucsd.edu Adityanarayanan Radhakrishnan aradha@mit.edu Mikhail Belkin Department of Computer Science, UC San Diego {libinzhu,ch1212,mbelkin}@ucsd.edu Abstract

While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the “catapult phase” (Lewkowycz et al., 2020) that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks.

1Introduction

A recent remarkable finding on neural networks, originating from Jacot et al. (2018) and termed as the “transition to linearity” (Liu et al., 2020), is that, as network width goes to infinity, such models become linear functions in the parameter space. Thus, a linear (in parameters) model can be built to accurately approximate wide neural networks under certain conditions. While this finding has helped improve our understanding of trained neural networks (Du et al., 2019; Nichani et al., 2021; Zou & Gu, 2019; Montanari & Zhong, 2020; Ji & Telgarsky, 2019; Chizat et al., 2019), not all properties of finite width neural networks can be understood in terms of linear models, as is shown in several recent works (Yang & Hu, 2020; Ortiz-Jiménez et al., 2021; Long, 2021; Fort et al., 2020). In this work, we show that properties of finitely wide neural networks in optimization and generalization that cannot be captured by linear models are, in fact, manifested in quadratic models.

The training dynamics of linear models with respect to the choice of the learning rates1 are well-understood (Polyak, 1987). Indeed, such models exhibit linear training dynamics, i.e., there exists a critical learning rate, 𝜂 crit , such that the loss converges monotonically if and only if the learning rate is smaller than 𝜂 crit (see Figure 1a).

(a) (b) Figure 1:Optimization dynamics for linear and non-linear models based on choice of learning rate. (a) Linear models either converge monotonically if learning rate is less than 𝜂 crit and diverge otherwise. (b) Unlike linear models, finitely wide neural networks and NQMs Eq. (2) (or general quadratic models Eq. (3)) can additionally observe a catapult phase when 𝜂 crit < 𝜂 < 𝜂 max . (a)Optimization dynamics for 𝑓 (wide neural networks): linear dynamics and catapult dynamics. (b)Generalization performance for 𝑓 , 𝑓 lin and 𝑓 quad . Figure 2:(a) Optimization dynamics of wide neural networks with sub-critical and super-critical learning rates. With sub-critical learning rates ( 0 < 𝜂 < 𝜂 crit ) , the tangent kernel of wide neural networks is nearly constant during training, and the loss decreases monotonically. The whole optimization path is contained in the ball 𝐵 ⁢ ( 𝐰 0 , 𝑅 ) := { 𝐰 : ‖ 𝐰 − 𝐰 0 ‖ ≤ 𝑅 } with a finite radius 𝑅 . With super-critical learning rates ( 𝜂 crit < 𝜂 < 𝜂 max ) , the catapult phase happens: the loss first increases and then decreases, along with a decrease of the norm of the tangent kernel . The optimization path goes beyond the finite radius ball. (b) Test loss of 𝑓 quad , 𝑓 and 𝑓 lin plotted against different learning rates. With sub-critical learning rates, all three models have nearly identical test loss for any sub-critical learning rate. With super-critical learning rates, 𝑓 and 𝑓 quad have smaller best test loss than the one with sub-critical learning rates. Experimental details are in Appendix N.5.

Recent work Lee et al. (2019) showed that the training dynamics of a wide neural network 𝑓 ⁢ ( 𝐰 ; 𝒙 ) can be accurately approximated by that of a linear model 𝑓 lin ⁢ ( 𝐰 ; 𝒙 ) :

𝑓 lin ⁢ ( 𝐰 ; 𝒙 )

𝑓 ⁢ ( 𝐰 0 ; 𝒙 ) + ( 𝐰 − 𝐰 0 ) 𝑇 ⁢ ∇ 𝑓 ⁢ ( 𝐰 0 ; 𝒙 ) ,

(1)

where ∇ 𝑓 ⁢ ( 𝐰 0 ; 𝒙 ) denotes the gradient2 of 𝑓 with respect to trainable parameters 𝐰 at an initial point 𝐰 0 and input sample 𝒙 . This approximation holds for learning rates less than 𝜂 crit ≈ 2 / ‖ ∇ 𝑓 ⁢ ( 𝐰 0 ; 𝒙 ) ‖ 2 , when the width is sufficiently large.

However, the training dynamics of finite width neural networks, 𝑓 , can sharply differ from those of linear models when using large learning rates. A striking non-linear property of wide neural networks discovered in Lewkowycz et al. (2020) is that when the learning rate is larger than 𝜂 crit but smaller than a certain maximum learning rate, 𝜂 max , gradient descent still converges but experiences a “catapult phase.” Specifically, the loss initially grows exponentially and then decreases after reaching a large value, along with the decrease of the norm of tangent kernel (see Figure 2a), and therefore, such training dynamics are non-linear (see Figure 1b).

As linear models cannot exhibit such a catapult phase, under what models and conditions does this phenomenon arise? The work of Lewkowycz et al. (2020) first observed the catapult phase phenomenon in finite width neural networks and analyzed this phenomenon for a two-layer linear neural network. However, a theoretical understanding of this phenomenon for general non-linear neural networks remains open. In this work, we utilize a quadratic model as a tool to shed light on the optimization and generalization discrepancies between finite and infinite width neural networks. We define Neural Quadratic Model (NQM) by the second order Taylor series expansion of 𝑓 ⁢ ( 𝐰 ; 𝒙 ) around the point 𝐰 0 :

𝐍𝐐𝐌 : 𝑓 quad ( 𝐰 )

𝑓 ( 𝐰 0 ) + ( 𝐰 − 𝐰 0 ) 𝑇 ∇ 𝑓 ( 𝐰 0 ) + 1 2 ( 𝐰 − 𝐰 0 ) 𝑇 𝐻 𝑓 ( 𝐰 0 ) ( 𝐰 − 𝐰 0 ) .

(2)

Here in the notation we suppress the dependence on the input data 𝒙 , and 𝐻 𝑓 ⁢ ( 𝐰 0 ) is the Hessian of 𝑓 with respect to 𝐰 evaluated at 𝐰 0 . Note that 𝑓 quad ⁢ ( 𝐰 )

𝑓 lin ⁢ ( 𝐰 ) + 1 2 ⁢ ( 𝐰 − 𝐰 0 ) 𝑇 ⁢ 𝐻 𝑓 ⁢ ( 𝐰 0 ) ⁢ ( 𝐰 − 𝐰 0 ) .

Indeed, we note that NQMs are contained in a more general class of quadratic models:

𝐆𝐞𝐧𝐞𝐫𝐚𝐥 𝐐𝐮𝐚𝐝𝐫𝐚𝐭𝐢𝐜 𝐌𝐨𝐝𝐞𝐥 : 𝑔 ( 𝐰 ; 𝒙 )

𝐰 𝑇 𝜙 ( 𝒙 ) + 1 2 𝛾 𝐰 𝑇 Σ ( 𝒙 ) 𝐰 ,

(3)

where 𝐰 are trainable parameters and 𝒙 is input data. We discuss the optimization dynamics of such general quadratic models in Section 3.3 and show empirically that they exhibit the catapult phase phenomenon in Appendix N.4. Note that the two-layer linear network analyzed in Lewkowycz et al. (2020) is a special case of Eq. (3), when 𝜙 ⁢ ( 𝒙 )

0 (See Appendix M).

Main Contributions. We prove that NQMs, 𝑓 quad , which approximate shallow fully-connected ReLU activated neural networks, exhibit catapult phase dynamics. Specifically, we analyze the optimization dynamics of 𝑓 quad by deriving the evolution of 𝑓 quad and the tangent kernel during gradient descent with squared loss, for a single training example and multiple uni-dimensional training examples. We identify three learning rate regimes yielding different optimization dynamics for 𝑓 quad , which are (1) converging monotonically (linear dynamics); (2) converging via a catapult phase (catapult dynamics); and (3) diverging. We provide a number of experimental results corroborating our theoretical analysis (See Section 3).

We then empirically show that NQMs, for the architectures of shallow (see Figure 2b as an example) and deep networks, have better test performances when catapult dynamics happens. While this was observed for some synthetic examples of neural networks in Lewkowycz et al. (2020), we systematically demonstrate the improved generalization of NQMs across a range of experimental settings. Namely, we consider fully-connected and convolutional neural networks with ReLU and other activation functions trained with GD/SGD on multiple vision, speech and text datatsets (See Section 4).

To the best of our knowledge, our work is the first to analyze the non-linear wide neural networks in the catapult regime through the perspective of the quadratic approximation. While NQMs (or quadratic models) were proposed and analyzed in Roberts et al. (2022), our work focuses on the properties of NQMs in the large learning rate regime, which has not been discussed in Roberts et al. (2022). Similarly, the following related works did not study catapult dynamics. Huang & Yau (2020) analyzed higher order approximations to neural networks under gradient flow (infinitesimal learning rates). Bai & Lee (2019) studied different quadratic models with randomized second order terms and Zhang et al. (2019) considered the loss in the quadratic form, where no catapult phase happens. A recent work showed the existence of the catapult phase in two-layer, homogenous networks (Meltzer & Liu, 2023).

Discontinuity in dynamics transition.

In the ball 𝐵 ⁢ ( 𝐰 0 , 𝑅 ) := { 𝐰 : ‖ 𝐰 − 𝐰 0 ‖ ≤ 𝑅 } with constant radius 𝑅

0 , the transition to linearity of a wide neural network (with linear output layer) is continuous in the network width 𝑚 . That is, the deviation from the network function to its linear approximation within the ball can be continuously controlled by the Hessian of the network function, i.e. 𝐻 𝑓 , which scales with 𝑚  (Liu et al., 2020):

‖ 𝑓 ⁢ ( 𝐰 ) − 𝑓 lin ⁢ ( 𝐰 ) ‖ ≤ sup 𝐰 ∈ 𝐵 ⁢ ( 𝐰 0 , 𝑅 ) ‖ 𝐻 𝑓 ⁢ ( 𝐰 ) ‖ ⁢ 𝑅 2

𝑂 ~ ⁢ ( 1 / 𝑚 ) .

(4)

Using the inequality from Eq. (4), we obtain ‖ 𝑓 quad − 𝑓 lin ‖

𝑂 ~ ⁢ ( 1 / 𝑚 ) , hence 𝑓 quad transitions to linearity continuously as well in 𝐵 ⁢ ( 𝐰 0 , 𝑅 ) 3. Given the continuous nature of the transition to linearity, one may expect that the transition from non-linear dynamics to linear dynamics for 𝑓 and 𝑓 quad is continuous in 𝑚 as well. Namely, one would expect that the domain of catapult dynamics, [ 𝜂 crit , 𝜂 max ] , shrinks and ultimately converges to a single point, i.e., 𝜂 crit

𝜂 max , as 𝑚 goes to infinity, with non-linear dynamics turning into linear dynamics. However, as shown both analytically and empirically, the transition is not continuous, for both network functions 𝑓 and NQMs 𝑓 quad , since the domain of the catapult dynamics can be independent of the width 𝑚 (or 𝛾 ). Additionally, the length of the optimization path of 𝑓 in catapult dynamics grows with 𝑚 since otherwise, the optimization path could be contained in a ball with a constant radius independent of 𝑚 , in which 𝑓 can be approximated by 𝑓 lin . Since the optimization of 𝑓 lin diverges in catapult dynamics, by the approximation, the optimization of 𝑓 diverges as well, which contradicts the fact that the optimization of 𝑓 can converge in catapult dynamics (See Figure 2a).

2Notation and preliminary

We use bold lowercase letters to denote vectors and capital letters to denote matrices. We denote the set { 1 , 2 , ⋯ , 𝑛 } by [ 𝑛 ] . We use ∥ ⋅ ∥ to denote the Euclidean norm for vectors and the spectral norm for matrices. We use ⊙ to denote element-wise multiplication (Hadamard product) for vectors. We use 𝜆 max ⁢ ( 𝐴 ) and 𝜆 min ⁢ ( 𝐴 ) to denote the largest and smallest eigenvalue of a matrix 𝐴 , respectively.

Given a model 𝑓 ⁢ ( 𝐰 ; 𝒙 ) , where 𝒙 is input data and 𝐰 are model parameters, we use ∇ 𝐰 𝑓 to represent the partial first derivative ∂ 𝑓 ⁢ ( 𝐰 ; 𝒙 ) / ∂ 𝐰 . When clear from context, we let ∇ 𝑓 := ∇ 𝐰 𝑓 for ease of notation. We use 𝐻 𝑓 and 𝐻 ℒ to denote the Hessian (second derivative matrix) of the function 𝑓 ⁢ ( 𝐰 ; 𝒙 ) and the loss ℒ ⁢ ( 𝐰 ) with respect to parameters 𝐰 , respectively.

In the paper, we consider the following supervised learning task: given training data { ( 𝒙 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 with data 𝒙 𝑖 ∈ ℝ 𝑑 and labels 𝑦 𝑖 ∈ ℝ for 𝑖 ∈ [ 𝑛 ] , we minimize the empirical risk with the squared loss ℒ ⁢ ( 𝐰 )

1 2 ⁢ ∑ 𝑖

1 𝑛 ( 𝑓 ⁢ ( 𝐰 ; 𝒙 𝑖 ) − 𝑦 𝑖 ) 2 . Here 𝑓 ⁢ ( 𝐰 ; ⋅ ) is a parametric family of models, e.g., a neural network or a kernel machine, with parameters 𝐰 ∈ ℝ 𝑝 . We use full-batch gradient descent to minimize the loss, and we denote trainable parameters 𝐰 at iteration 𝑡 by 𝐰 ⁢ ( 𝑡 ) . With constant step size (learning rate) 𝜂 , the update rule for the parameters is:

𝐰 ⁢ ( 𝑡 + 1 )

𝐰 ⁢ ( 𝑡 ) − 𝜂 ⁢ 𝑑 ⁢ ℒ ⁢ ( 𝐰 ) 𝑑 ⁢ 𝐰 ⁢ ( 𝑡 ) , ∀ 𝑡 ≥ 0 .

Definition 1 (Tangent Kernel).

The tangent kernel 𝐾 ⁢ ( 𝐰 ; ⋅ , ⋅ ) of 𝑓 ⁢ ( 𝐰 ; ⋅ ) is defined as

𝐾 ⁢ ( 𝐰 ; 𝒙 , 𝒛 )

⟨ ∇ 𝑓 ⁢ ( 𝐰 ; 𝒙 ) , ∇ 𝑓 ⁢ ( 𝐰 ; 𝒛 ) ⟩ , ∀ 𝒙 , 𝒛 ∈ ℝ 𝑑 .

(5)

In the context of the optimization problem with 𝑛 training examples, the tangent kernel matrix 𝐾 ∈ ℝ 𝑛 × 𝑛 satisfies 𝐾 𝑖 , 𝑗 ⁢ ( 𝐰 )

𝐾 ⁢ ( 𝐰 ; 𝒙 𝑖 , 𝒙 𝑗 ) , 𝑖 , 𝑗 ∈ [ 𝑛 ] . The critical learning rate for optimization is given as follows.

Definition 2 (Critical learning rate).

With an initialization of parameters 𝐰 0 , the critical learning rate of 𝑓 ⁢ ( 𝐰 ; ⋅ ) is defined as

𝜂 crit := 2 / 𝜆 max ⁢ ( 𝐻 ℒ ⁢ ( 𝐰 0 ) ) .

(6)

A learning rate 𝜂 is said to be sub-critical if 0 < 𝜂 < 𝜂 crit or super-critical if 𝜂 crit < 𝜂 < 𝜂 max . Here 𝜂 max is the maximum leaning rate such that the optimization of ℒ ⁢ ( 𝐰 ) initialized at 𝐰 0 can converge.

Dynamics for Linear models.

When 𝑓 is linear in 𝐰 , the gradient, ∇ 𝑓 , and tangent kernel are constant: 𝐾 ⁢ ( 𝐰 ⁢ ( 𝑡 ) )

𝐾 ⁢ ( 𝐰 0 ) . Therefore, gradient descent dynamics are:

𝐹 ⁢ ( 𝐰 ⁢ ( 𝑡 + 1 ) ) − 𝐲

( 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝐰 0 ) ) ⁢ ( 𝐹 ⁢ ( 𝐰 ⁢ ( 𝑡 ) ) − 𝐲 ) , ∀ 𝑡 ≥ 0 ,

(7)

where 𝐹 ⁢ ( 𝐰 0 )

[ 𝑓 1 ⁢ ( 𝐰 0 ) , … , 𝑓 𝑛 ⁢ ( 𝐰 0 ) ] 𝑇 with 𝑓 𝑖 ⁢ ( 𝐰 0 )

𝑓 ⁢ ( 𝐰 0 ; 𝒙 𝑖 ) .

Noting that 𝐻 ℒ ⁢ ( 𝐰 0 )

∇ 𝐹 ⁢ ( 𝐰 0 ) 𝑇 ⁢ ∇ 𝐹 ⁢ ( 𝐰 0 ) and that tangent kernel 𝐾 ⁢ ( 𝐰 0 )

∇ 𝐹 ⁢ ( 𝐰 0 ) ⁢ ∇ 𝐹 ⁢ ( 𝐰 0 ) 𝑇 share the same positive eigenvalues, we have 𝜆 max ⁢ ( 𝐻 ℒ ⁢ ( 𝐰 0 ) )

𝜆 max ⁢ ( 𝐾 ⁢ ( 𝐰 0 ) ) , and hence,

𝜂 crit

2 / 𝜆 max ⁢ ( 𝐾 ⁢ ( 𝐰 0 ) ) .

(8)

Therefore, from Eq. (7), if 0 < 𝜂 < 𝜂 crit , the loss ℒ decreases monotonically and if 𝜂

𝜂 crit , the loss ℒ keeps increasing. Note that the critical and maximum learning rates are equal in this setting.

3Optimization dynamics in Neural Quadratic Models

In this section, we analyze the gradient descent dynamics of the NQM corresponding to a two-layer fully-connected neural network. We show that, unlike a linear model, the NQM exhibits a catapult dynamics: the loss increases at the early stage of training then decreases afterwards. We further show that the top eigenvalues of the tangent kernel typically become smaller as a consequence of the catapult.

Neural Quadratic Model (NQM).

Consider the NQM that approximates the following two-layer neural network:

𝑓 ⁢ ( 𝐮 , 𝐯 ; 𝒙 )

1 𝑚 ⁢ ∑ 𝑖

1 𝑚 𝑣 𝑖 ⁢ 𝜎 ⁢ ( 1 𝑑 ⁢ 𝐮 𝑖 𝑇 ⁢ 𝒙 ) ,

(9)

where 𝐮 𝑖 ∈ ℝ 𝑑 , 𝑣 𝑖 ∈ ℝ for 𝑖 ∈ [ 𝑚 ] are trainable parameters, 𝒙 ∈ ℝ 𝑑 is the input, and 𝜎 ⁢ ( ⋅ ) is the ReLU activation function. We initialize 𝐮 𝑖 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) and 𝑣 𝑖 ∈ Unif ⁢ [ { − 1 , 1 } ] for each 𝑖 independently. Letting 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) := 𝑓 quad ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) , this NQM has the following expression (See the full derivation in Appendix A):

𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 )

𝑓 ⁢ ( 𝐮 0 , 𝐯 0 ; 𝒙 ) + 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 𝑣 0 , 𝑖 ⁢ ( 𝐮 𝑖 − 𝐮 0 , 𝑖 ) 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } + 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝑣 𝑖 − 𝑣 0 , 𝑖 ) ⁢ 𝜎 ⁢ ( 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 )

+ 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝑣 𝑖 − 𝑣 0 , 𝑖 ) ⁢ ( 𝐮 𝑖 − 𝐮 0 , 𝑖 ) 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } .

(10)

Given training data { 𝒙 𝑖 , 𝑦 𝑖 } 𝑖

1 𝑛 , we minimize the empirical risk with the squared loss ℒ ⁢ ( 𝐰 )

1 2 ⁢ ∑ 𝑖

1 𝑛 ( 𝑔 ⁢ ( 𝐰 ; 𝒙 𝑖 ) − 𝑦 𝑖 ) 2 using GD with constant learning rate 𝜂 . Throughout this section, we denote 𝑔 ⁢ ( 𝐮 ⁢ ( 𝑡 ) , 𝐯 ⁢ ( 𝑡 ) ; 𝒙 ) by 𝑔 ⁢ ( 𝑡 ) and its tangent kernel 𝐾 ⁢ ( 𝐮 ⁢ ( 𝑡 ) , 𝐯 ⁢ ( 𝑡 ) ) by 𝐾 ⁢ ( 𝑡 ) , where 𝑡 is the iteration of GD. We assume ‖ 𝒙 𝑖 ‖

𝑂 ⁢ ( 1 ) and | 𝑦 𝑖 |

𝑂 ⁢ ( 1 ) for 𝑖 ∈ [ 𝑛 ] , and we assume the width of 𝑓 is much larger than the input dimension 𝑑 and the data size 𝑛 , i.e., 𝑚 ≫ max ⁡ { 𝑑 , 𝑛 } . Hence, 𝑑 and 𝑛 can be regarded as small constants. In the whole paper, we use the big-O and small-o notation with respect to the width 𝑚 . Below, we start with the single training example case, which already showcases the non-linear dynamics of NQMs.

3.1Catapult dynamics with a single training example

In this subsection, we consider training dynamics of NQM Eq. (3) with a single training example ( 𝒙 , 𝑦 ) where 𝒙 ∈ ℝ 𝑑 and 𝑦 ∈ ℝ . In this case, the tangent kernel matrix 𝐾 reduces to a scalar, and we denote 𝐾 by 𝜆 to distinguish it from a matrix.

By gradient descent with step size 𝜂 , the updates for 𝑔 ⁢ ( 𝑡 ) − 𝑦 and 𝜆 ⁢ ( 𝑡 ) , which we refer to as dynamics equations, can be derived as follows (see the derivation in Appendix B.1):

Dynamics equations.
𝑔 ⁢ ( 𝑡 + 1 ) − 𝑦

( 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑔 ⁢ ( 𝑡 ) ⏟ 𝑅 𝑔 ⁢ ( 𝑡 ) ) ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) := 𝜇 ⁢ ( 𝑡 ) ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ,

(11)

𝜆 ⁢ ( 𝑡 + 1 )

𝜆 ⁢ ( 𝑡 ) − 𝜂 ⁢ ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ( 4 ⁢ 𝑔 ⁢ ( 𝑡 ) 𝑔 ⁢ ( 𝑡 ) − 𝑦 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) ) ⏟ 𝑅 𝜆 ⁢ ( 𝑡 ) , ∀ 𝑡 ≥ 0 .

(12)

Note that as the loss is given by ℒ ⁢ ( 𝑡 )

1 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 , to understand convergence, it suffices to analyze the dynamics equations above. Compared to the linear dynamics Eq. (7), this non-linear dynamics has extra terms 𝑅 𝑔 ⁢ ( 𝑡 ) and 𝑅 𝜆 ⁢ ( 𝑡 ) , which are induced by the non-linear term in the NQM. We will see that the convergence of gradient descent depends on the scale and sign of 𝑅 𝑔 ⁢ ( 𝑡 ) and 𝑅 𝜆 ⁢ ( 𝑡 ) . For example, for constant learning rate that is slightly larger than 𝜂 crit (which would result in divergence for linear models), 𝑅 𝜆 ⁢ ( 𝑡 ) stays positive during training, resulting in both monotonic decrease of tangent kernel 𝜆 and the loss.

As 𝜆 ⁢ ( 𝑡 )

𝜆 0 − ∑ 𝜏

0 𝑡 − 1 𝑅 𝜆 ⁢ ( 𝜏 ) , to track the scale of | 𝜇 ⁢ ( 𝑡 ) | , we will focus on the scale and sign of 𝑅 𝑔 ⁢ ( 𝑡 ) and 𝑅 𝜆 ⁢ ( 𝑡 ) in the following analysis. For the scale of 𝜆 0 , which is non-negative by Definition 1, we can show that with high probability over random initialization, | 𝜆 0 |

Θ ⁢ ( 1 ) (see Appendix I). And | 𝑔 ⁢ ( 0 ) |

𝑂 ⁢ ( 1 ) with high probability as well (Du et al., 2018). Therefore the following discussion is with high probability over random initialization. We start by establishing monotonic convergence for sub-critical learning rates.

Monotonic convergence: sub-critical learning rates ( 𝜂 < 2 / 𝜆 0

𝜂 crit ).

The key observation is that when | 𝑔 ⁢ ( 𝑡 ) |

𝑂 ⁢ ( 1 ) , and 𝜆 ⁢ ( 𝑡 )

Θ ⁢ ( 1 ) , | 𝑅 𝑔 ⁢ ( 𝑡 ) | and | 𝑅 𝜆 ⁢ ( 𝑡 ) | are of the order 𝑜 ⁢ ( 1 ) . Then, the dynamics equations approximately reduce to the ones of linear dynamics:

𝑔 ⁢ ( 𝑡 + 1 ) − 𝑦

( 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) ) ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ,

𝜆 ⁢ ( 𝑡 + 1 )

𝜆 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) .

Note that at initialization, the output satisfies | 𝑔 ⁢ ( 0 ) |

𝑂 ⁢ ( 1 ) , and we have shown 𝜆 0

Θ ⁢ ( 1 ) . With the choice of 𝜂 , we have for all 𝑡 ≥ 0 , | 𝜇 ⁢ ( 𝑡 ) |

| 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) | < 1 ; hence, | 𝑔 ⁢ ( 𝑡 ) − 𝑦 | decreases monotonically. The cumulative change on the tangent kernel will be 𝑜 ⁢ ( 1 ) , i.e., ∑ 𝑡 | 𝑅 𝜆 ⁢ ( 𝑡 ) |

𝑜 ⁢ ( 1 ) , since for all 𝑡 , | 𝑅 𝜆 ⁢ ( 𝑡 ) |

𝑂 ⁢ ( 1 / 𝑚 ) and the loss decreases exponentially hence ∑ | 𝑅 𝜆 ⁢ ( 𝑡 ) |

𝑂 ⁢ ( 1 / 𝑚 ) ⋅ log ⁡ 𝑂 ⁢ ( 1 )

𝑜 ⁢ ( 1 ) . See Appendix C for a detailed discussion.

Catapult convergence: super-critical learning rates ( 𝜂 crit

2 / 𝜆 0 < 𝜂 < 4 / 𝜆 0

𝜂 max ).

The training dynamics are given by the following theorem.

Theorem 1 (Catapult dynamics on a single training example).

Consider training the NQM Eq. (3) with squared loss on a single training example by GD. With a super-critical learning rate 𝜂 ∈ [ 2 + 𝜖 𝜆 0 , 4 − 𝜖 𝜆 0 ] where 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) , the catapult happens: with high probability over random initialization, the loss increases to the order of Ω ⁢ ( 𝑚 ⁢ ( 𝜂 ⁢ 𝜆 0 − 2 ) 2 log ⁡ 𝑚 ) then decreases to 𝑂 ⁢ ( 1 ) .

Proof of Theorem 1.

We use the following transformation of the variables to simplify notations.

𝑢 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 , 𝑤 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑦 , 𝑣 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) .

Then the Eq. (11) and Eq. (12) are reduced to

𝑢 ⁢ ( 𝑡 + 1 )

( 1 − 𝑣 ⁢ ( 𝑡 ) + 𝑢 ⁢ ( 𝑡 ) + 𝑤 ⁢ ( 𝑡 ) ) 2 ⁢ 𝑢 ⁢ ( 𝑡 ) := 𝜅 ⁢ ( 𝑡 ) ⁢ 𝑢 ⁢ ( 𝑡 ) ,

(13)

𝑣 ⁢ ( 𝑡 + 1 )

𝑣 ⁢ ( 𝑡 ) − 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) − 4 ⁢ 𝑤 ⁢ ( 𝑡 ) .

(14)

At initialization, since | 𝑔 ⁢ ( 0 ) |

𝑂 ⁢ ( 1 ) , we have 𝑢 ⁢ ( 0 )

𝑂 ⁢ ( 1 𝑚 ) and 𝑤 ⁢ ( 0 )

𝑂 ⁢ ( 1 𝑚 ) . Note that by definition, for all 𝑡 ≥ 0 , 𝑢 ⁢ ( 𝑡 ) ≥ 0 and we have 𝑣 ⁢ ( 𝑡 ) ≥ 0 since 𝜆 ⁢ ( 𝑡 ) is the tangent kernel for a single training example.

In the following, we will analyze the above dynamical equations. To make the analysis more understandable, we separate the dynamics during training into increasing phase and decreasing phase. We denote 𝛿 := ( 𝜂 − 𝜂 crit ) ⁢ 𝜆 0

𝜂 ⁢ 𝜆 0 − 2 .

Increasing phase.

In this phase, | 𝑢 ⁢ ( 𝑡 ) | increases exponentially from 𝑂 ⁢ ( 1 𝑚 ) to Θ ⁢ ( 𝛿 2 log ⁡ 𝑚 ) and | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) |

𝑂 ⁢ ( 𝛿 log ⁡ 𝑚 ) . This can be shown by the following lemma.

Lemma 1.

For 𝑇 > 0 such that sup 𝑡 ∈ [ 0 , 𝑇 ] 𝑢 ⁢ ( 𝑡 )

𝑂 ⁢ ( 𝛿 2 log ⁡ 𝑚 ) , 𝑢 ⁢ ( 𝑡 ) increases exponentially with inf 𝑡 ∈ [ 0 , 𝑇 ] 𝜅 ⁢ ( 𝑡 ) ≥ ( 1 + 𝛿 − 𝑂 ⁢ ( 𝛿 log ⁡ 𝑚 ) ) 2 > 1 and sup 𝑡 ∈ [ 0 , 𝑇 ] | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) |

𝑂 ⁢ ( 𝛿 log ⁡ 𝑚 ) .

Proof.

See the proof in Section D. ∎

After the increasing phase, based on the order of 𝑢 ⁢ ( 𝑡 ) we can infer the order of loss is Θ ⁢ ( 𝑚 ⁢ 𝛿 2 log ⁡ 𝑚 ) .

Decreasing phase.

When 𝑢 ⁢ ( 𝑡 ) is sufficiently large, 𝑣 ⁢ ( 𝑡 ) will have non-negligible decrease which leads to the decreasing of 𝜅 ⁢ ( 𝑡 ) , hence in turn making 𝑢 ⁢ ( 𝑡 ) decrease as well. Consequently, we have:

Lemma 2.

There exists 𝑇 ∗ > 0 such that 𝑢 ⁢ ( 𝑇 ∗ )

𝑂 ⁢ ( 1 𝑚 ) .

Proof.

See the proof in Section E. ∎

Then accordingly, the loss is of the order 𝑂 ⁢ ( 1 ) . ∎ Once the loss decreases to the order of 𝑂 ⁢ ( 1 ) , the catapult finishes and we in general have 𝜂 < 2 / 𝜆 ⁢ ( 𝑡 ) as | 𝜇 ⁢ ( 𝑡 ) |

| 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + 𝑅 𝐠 ⁢ ( 𝑡 ) | < 1 where | 𝑅 𝐠 ⁢ ( 𝑡 ) |

𝑂 ⁢ ( ℒ ⁢ ( 𝑡 ) / 𝑚 )

𝑂 ⁢ ( 1 / 𝑚 ) . Therefore the training dynamics fall into linear dynamics, and we can use the same analysis for sub-critical learning rates for the remaining training dynamics. The stableness of the steady-state equilibria of dynamical equations can be guaranteed by the following:

Theorem 2.

For dynamical equations Eq. (11) and (12), the stable steady-state equilibria satisfy 𝑔 ⁢ ( 𝑡 )

𝑦 (i.e.,loss is 0 ), and 𝜆 ⁢ ( 𝑡 ) ∈ [ 𝜖 , 2 / 𝜂 − 𝜖 ] with 𝜖

Θ ⁢ ( log ⁡ 𝑚 / 𝑚 ) .

Divergence ( 𝜂 > 𝜂 max

4 / 𝜆 0 ).

Initially, it follows the same dynamics with that in the increasing phase in catapult convergence: | 𝑔 ⁢ ( 𝑡 ) − 𝑦 | increases exponentially as | 𝜇 ⁢ ( 𝑡 ) |

1 and the 𝜆 ⁢ ( 𝑡 ) almost does not change as 𝑅 𝜆 ⁢ ( 𝑡 ) is small. However, note that 𝑅 𝜆 ⁢ ( 𝑡 )

0 since 1) 𝑔 ⁢ ( 𝑡 ) / ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ≈ 1 when 𝑔 ⁢ ( 𝑡 ) becomes large and 2) 𝜂

4 / 𝜆 ⁢ ( 𝑡 ) . Therefore, 𝜆 ⁢ ( 𝑡 ) keeps increasing during training, which consequently leads to the divergence of the optimization. See Appendix G for a detailed discussion.

3.2Catapult dynamics with multiple training examples

In this subsection we show the catapult phase will happen for NQMs Eq. (9) with multiple training examples. We assume unidimensional input data, which is common in the literature and simplifies the analysis for neural networks (see for example Williams et al. (2019); Savarese et al. (2019)).

Assumption 1.

The input dimension 𝑑

1 and not all 𝑥 𝑖 is 0 , i.e., ∑ | 𝑥 𝑖 |

0 .

We similarly analyze the dynamics equations with different learning rates for multiple training examples (see the derivation of Eq. (22) and (23) in Appendix) which are update equations of 𝐠 ⁢ ( 𝑡 ) − 𝐲 and 𝐾 ⁢ ( 𝑡 ) . And similarly, we show there are three training dynamics: monotonic convergence, catapult convergence and divergence.

In the analysis, we consider the training dynamics projected to two orthogonal eigenvectors of the tangent kernel, i.e., 𝒑 1 and 𝒑 2 , and we show with different learning rates, the catapult phase can occur only in the direction of 𝒑 1 , or occur in both directions. We consider the case where 2 / 𝜆 2 ⁢ ( 0 ) < 4 / 𝜆 1 ⁢ ( 0 ) hence the catapult can occur in both directions. The analysis for the other case can be directly obtained from our results. We denote the loss projected to 𝒑 𝑖 by Π 𝑖 ⁢ ℒ := 1 2 ⁢ ⟨ 𝐠 − 𝐲 , 𝒑 𝑖 ⟩ 2 for 𝑖

1 , 2 . We have Π 𝑖 ⁢ ℒ ⁢ ( 0 )

𝑂 ⁢ ( 1 ) with high probability over random initialization of weights.

We formulate the result for the catapult dynamics, which happens when training with super-critical learning rates, into the following theorem, and defer the proof of it and the full discussion of training dynamics to Appendix K.

Theorem 3 (Catapult dynamics on multiple training examples).

Supposing Assumption 1 holds, consider training the NQM Eq. (3) with squared loss on multiple training examples by GD. Then, with high probability over random initialization we have

1.

with 𝜂 ∈ [ 2 + 𝜖 𝜆 1 ⁢ ( 0 ) , 2 − 𝜖 𝜆 2 ⁢ ( 0 ) ] , the catapult only occurs in eigendirection 𝒑 1 : Π 1 ⁢ ℒ increases to the order of Ω ⁢ ( 𝑚 ⁢ ( 𝜂 ⁢ 𝜆 1 ⁢ ( 0 ) − 2 ) 2 log ⁡ 𝑚 ) then decreases to 𝑂 ⁢ ( 1 ) ;

2.

with 𝜂 ∈ [ 2 + 𝜖 𝜆 2 ⁢ ( 0 ) , 4 − 𝜖 𝜆 1 ⁢ ( 0 ) ] , the catapult occurs in both eigendirections 𝒑 1 and 𝒑 2 : Π 𝑖 ⁢ ℒ for 𝑖

1 , 2 increases to the order of Ω ⁢ ( 𝑚 ⁢ ( 𝜂 ⁢ 𝜆 𝑖 ⁢ ( 0 ) − 2 ) 2 log ⁡ 𝑚 ) then decreases to 𝑂 ⁢ ( 1 ) ,

where 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) .

We verify the results for multiple training examples via the experiments in Figure 3.

(a)Training loss (b)Largest eigenvalue of tangent kernel (c)Second largest eigenvalue of tangent kernel Figure 3:Training dynamics of NQMs for multiple examples case with different learning rates. By our analysis, two critical values are 2 / 𝜆 1 ⁢ ( 0 )

0.37 and 2 / 𝜆 2 ⁢ ( 0 )

0.39 . When 𝜂 < 0.37 , linear dynamics dominate hence the kernel is nearly constant; when 0.37 < 𝜂 < 0.39 , the catapult phase happens in 𝒑 1 and only 𝜆 1 ⁢ ( 𝑡 ) decreases; when 0.39 < 𝜂 < 𝜂 max , the catapult phase happens in 𝒑 1 and 𝒑 2 hence both 𝜆 1 ⁢ ( 𝑡 ) and 𝜆 2 ⁢ ( 𝑡 ) decreases. The experiment details can be found in Appendix N.1. 3.3Connection to general quadratic models and wide neural networks General quadratic models.

As mentioned in the introduction, NQMs are contained in a general class of quadratic models of the form given in Eq. (3). Additionally, we show that the two-layer linear neural network analyzed in Lewkowycz et al. (2020) is a special case of Eq. (3), and we provide a more general condition for such models to have catapult dynamics in Appendix M. Furthermore, we empirically observe that a broader class of quadratic models 𝑔 can have catapult dynamics simply by letting 𝜙 ⁢ ( 𝒙 ) and Σ be random and assigning a small value to 𝛾 (See Appendix N.4).

Wide neural networks.

We have seen that NQMs, with fixed Hessian, exhibit the catapult phase phenomenon. Therefore, the change in the Hessian of wide neural networks during training is not required to produce the catapult phase. We will discuss the high-level idea of analyzing the catapult phase for a general NQM with large learning rates, and empirically show that this idea applies to neural networks. We train an NQM Eq. (2) 𝑓 quad on 𝑛 data points { ( 𝒙 𝑖 , 𝑦 𝑖 ) } 𝑖

1 𝑛 ∈ ℝ 𝑑 × ℝ with GD. The dynamics equations take the following form:

𝐟 quad ⁢ ( 𝑡 + 1 ) − 𝐲

( 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) + 1 2 ⁢ 𝜂 2 ⁢ 𝐺 ⁢ ( 𝑡 ) ⁢ ∇ 𝐟 quad ⁢ ( 𝑡 ) 𝑇 ⏟ 𝑅 𝐟 quad ⁢ ( 𝑡 ) ) ⁢ ( 𝐟 quad ⁢ ( 𝑡 ) − 𝐲 ) ,

(15)

𝐾 ⁢ ( 𝑡 + 1 )

𝐾 ⁢ ( 𝑡 ) − 1 4 ⁢ 𝜂 ⁢ ( 4 ⁢ 𝐺 ⁢ ( 𝑡 ) ⁢ ∇ 𝐟 quad ⁢ ( 𝑡 ) 𝑇 − 𝜂 ⁢ 𝐺 ⁢ ( 𝑡 ) ⁢ 𝐺 ⁢ ( 𝑡 ) 𝑇 ) ⏟ 𝑅 𝐾 ⁢ ( 𝑡 ) ,

(16)

where 𝐺 𝑖 , : ⁢ ( 𝑡 )

( 𝐟 quad ⁢ ( 𝑡 ) − 𝐲 ) 𝑇 ⁢ ∇ 𝐟 quad ⁢ ( 𝑡 ) ⁢ 𝐻 𝑓 ⁢ ( 𝒙 𝑖 ) ∈ ℝ 𝑚 for 𝑖 ∈ [ 𝑛 ] .

In our analysis for 𝑓 quad which approximates two-layer networks in Section 3.2, we show that catapult dynamics occur in the top eigenspace of the tangent kernel. Specifically, we analyze the dynamics equations confined to the top eigendirection of the tangent kernel 𝒑 1 (i.e, Π 1 ⁢ ℒ and 𝜆 1 ⁢ ( 𝑡 ) ). We show that 𝒑 1 𝑇 ⁢ 𝑅 𝐟 quad ⁢ 𝒑 1 and 𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ 𝒑 1 scale with the loss and remain positive when the loss becomes large, therefore 𝒑 1 𝑇 ⁢ 𝐾 ⁢ 𝒑 1 (i.e., 𝜆 max ⁢ ( 𝐾 ) ) as well as the loss will be driven down, and consequently we yield catapult convergence.

We empirically verify catapults indeed happen in the top eigenspace of the tangent kernel for additional NQMs and wide neural networks in Appendix N.3. Furthermore, a similar behaviour of top eigenvalues of the tangent kernel with the one for NQMs is observed for wide neural networks when training with different learning rates (See Figure 5 in Appendix N).

4Quadratic models parallel neural networks in generalization

In this section, we empirically compare the test performance of three different models considered in this paper upon varying learning rate. In particular, we consider (1) the NQM, 𝑓 quad ; (2) corresponding neural networks, 𝑓 ; and (3) the linear model, 𝑓 lin .

Figure 4:Best test loss plotted against different learning rates for 𝑓 ⁢ ( 𝐰 ) , 𝑓 lin ⁢ ( 𝐰 ) and 𝑓 quad ⁢ ( 𝐰 ) across a variety of datasets and network architectures.

We implement our experiments on 3 vision datasets: CIFAR-2 (a 2-class subset of CIFAR-10 (Krizhevsky et al., 2009)), MNIST (LeCun et al., 1998), and SVHN (The Street View House Numbers) (Netzer et al., 2011), 1 speech dataset: Free Spoken Digit dataset (FSDD) (Jakobovski,) and 1 text dataset: AG NEWS (Gulli,).

In all experiments, we train the models by minimizing the squared loss using standard GD/SGD with constant learning rate 𝜂 . We report the best test loss achieved during the training process with each learning rate. Experimental details can be found in Appendix N.5. We also report the best test accuracy in Appendix N.6. For networks with 3 layers, see Appendix N.7. From the experimental results, we observe the following:

Sub-critical learning rates.

In accordance with our theoretical analyses, we observe that all three models have nearly identical test loss for any sub-critical learning rate. Specifically, note that as the width 𝑚 increases, 𝑓 and 𝑓 quad will transition to linearity in the ball 𝐵 ⁢ ( 𝐰 0 , 𝑅 ) :

‖ 𝑓 − 𝑓 lin ‖

𝑂 ~ ⁢ ( 1 / 𝑚 ) , ‖ 𝑓 quad − 𝑓 lin ‖

𝑂 ~ ⁢ ( 1 / 𝑚 ) ,

where 𝑅

0 is a constant which is large enough to contain the optimization path with respect to sub-critical learning rates. Thus, the generalization performance of these three models will be similar when 𝑚 is large, as shown in Figure 4.

Super-critical learning rates.

The best test loss of both 𝑓 ⁢ ( 𝐰 ) and 𝑓 quad ⁢ ( 𝐰 ) is consistently smaller than the one with sub-critical learning rates, and decreases for an increasing learning rate in a range of values beyond 𝜂 crit , which was observed for wide neural networks in Lewkowycz et al. (2020).

As discussed in the introduction, with super-critical learning rates, both 𝑓 quad and 𝑓 can be observed to have catapult phase, while the loss of 𝑓 lin diverges. Together with the similar behaviour of 𝑓 quad and 𝑓 in generalization with super-critical learning rates, we believe NQMs are a better model to understand 𝑓 in training and testing dynamics, than the linear approximation 𝑓 lin .

In Figure 4 we report the results for networks with ReLU activation function. We also implement the experiments using networks with Tanh and Swish (Ramachandran et al., 2017) activation functions, and observe the same phenomena in generalization for 𝑓 , 𝑓 lin and 𝑓 quad (See Appendix N.8).

5Summary and Discussion Summary.

In this paper, we use quadratic models as a tool to better understand optimization and generalization properties of finite width neural networks trained using large learning rates. Notably, we prove that quadratic models exhibit properties of neural networks such as the catapult phase that cannot be explained using linear models, which importantly includes linear approximations to neural networks given by the neural tangent kernel. Interestingly, we show empirically that quadratic models mimic the generalization properties of neural networks when trained with large learning rate, and that such models perform better than linearized neural networks.

Future directions.

As quadratic models are more analytically tractable than finite width neural networks, these models open further avenues for understanding the good performance of finite width networks in practice. In particular, one interesting direction of future work is to understand the change in the kernel corresponding to a trained quadratic model. As we showed, training a quadratic model with large learning rate causes a decrease in the eigenvalues of the neural tangent kernel, and it would be interesting to understand the properties of this changed kernel that correspond with improved generalization. Indeed, prior work Long (2021) has analyzed the properties of the “after kernel” corresponding to finite width neural networks, and it would be interesting to observe whether similar properties hold for the kernel corresponding to trained quadratic models.

Another interesting avenue of research is to understand whether quadratic models can be used for representation learning. Indeed, prior work Yang & Hu (2020) argues that networks in the neural tangent kernel regime do not learn useful representations of data through training. As quadratic models trained with large learning rate can already exhibit nonlinear dynamics and better capture generalization properties of finite width networks, it would be interesting to understand whether such models learn useful representations of data as well.

Acknowledgements

We thank Boris Hanin, Daniel A. Roberts and Sho Yaida for the discussion about quadratic models and catapults. A.R. is funded by the George F. Carrier fellowship at Harvard School of Engineering and Applied Sciences. We are grateful for the support from the National Science Foundation (NSF) and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning (https://deepfoundations.ai/) through awards DMS-2031883 and #814639 and the TILOS institute (NSF CCF-2112665). This work used NVIDIA V100 GPUs NVLINK and HDR IB (Expanse GPU) at SDSC Dell Cluster through allocation TG-CIS220009 and also, Delta system at the National Center for Supercomputing Applications through allocation bbjr-delta-gpu from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

References Bai & Lee (2019) ↑ Yu Bai and Jason D Lee.Beyond linearization: On quadratic and higher-order approximation of wide neural networks.In International Conference on Learning Representations, 2019. Bof et al. (2018) ↑ Nicoletta Bof, Ruggero Carli, and Luca Schenato.Lyapunov theory for discrete time systems.arXiv preprint arXiv:1809.05289, 2018. Chizat et al. (2019) ↑ Lenaic Chizat, Edouard Oyallon, and Francis Bach.On lazy training in differentiable programming.Advances in Neural Information Processing Systems, 32, 2019. Du et al. (2019) ↑ Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai.Gradient descent finds global minima of deep neural networks.In International Conference on Machine Learning, pp. 1675–1685, 2019. Du et al. (2018) ↑ Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh.Gradient descent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018. Fort et al. (2020) ↑ Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli.Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel.Advances in Neural Information Processing Systems, 33:5850–5861, 2020. (7) ↑ Antonio Gulli.AG’s corpus of news articles.http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html. Huang & Yau (2020) ↑ Jiaoyang Huang and Horng-Tzer Yau.Dynamics of deep neural networks and neural tangent hierarchy.In International Conference on Machine Learning, pp. 4542–4551. PMLR, 2020. Jacot et al. (2018) ↑ Arthur Jacot, Franck Gabriel, and Clément Hongler.Neural tangent kernel: Convergence and generalization in neural networks.In Advances in neural information processing systems, pp. 8571–8580, 2018. (10) ↑ Jakobovski.Free-Spoken-Digit-Dataset.https://github.com/Jakobovski/free-spoken-digit-dataset. Ji & Telgarsky (2019) ↑ Ziwei Ji and Matus Telgarsky.Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks.In International Conference on Learning Representations, 2019. Krizhevsky et al. (2009) ↑ Alex Krizhevsky, Geoffrey Hinton, et al.Learning multiple layers of features from tiny images.2009. LeCun et al. (1998) ↑ Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. Lee et al. (2019) ↑ Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington.Wide neural networks of any depth evolve as linear models under gradient descent.In Advances in neural information processing systems, pp. 8570–8581, 2019. Lewkowycz et al. (2020) ↑ Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari.The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020. Liu et al. (2020) ↑ Chaoyue Liu, Libin Zhu, and Mikhail Belkin.On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33, 2020. Long (2021) ↑ Philip M Long.Properties of the after kernel.arXiv preprint arXiv:2105.10585, 2021. Meltzer & Liu (2023) ↑ David Meltzer and Junyu Liu.Catapult dynamics and phase transitions in quadratic nets.arXiv preprint arXiv:2301.07737, 2023. Montanari & Zhong (2020) ↑ Andrea Montanari and Yiqiao Zhong.The interpolation phase transition in neural networks: Memorization and generalization under lazy training.arXiv preprint arXiv:2007.12826, 2020. Netzer et al. (2011) ↑ Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning.2011. Nichani et al. (2021) ↑ Eshaan Nichani, Adityanarayanan Radhakrishnan, and Caroline Uhler.Increasing depth leads to U-shaped test risk in over-parameterized convolutional networks.In International Conference on Machine Learning Workshop on Over-parameterization: Pitfalls and Opportunities, 2021. Ortiz-Jiménez et al. (2021) ↑ Guillermo Ortiz-Jiménez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard.What can linearized neural networks actually say about generalization?Advances in Neural Information Processing Systems, 34, 2021. Polyak (1987) ↑ Boris T Polyak.Introduction to optimization.Optimization Software, Inc, New York, 1987. Ramachandran et al. (2017) ↑ Prajit Ramachandran, Barret Zoph, and Quoc V Le.Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. Roberts et al. (2022) ↑ Daniel A Roberts, Sho Yaida, and Boris Hanin.The principles of deep learning theory.Cambridge University Press, 2022. Savarese et al. (2019) ↑ Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro.How do infinite width bounded norm networks look in function space?In Conference on Learning Theory, pp.  2667–2690. PMLR, 2019. Williams et al. (2019) ↑ Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna.Gradient dynamics of shallow univariate relu networks.Advances in Neural Information Processing Systems, 32, 2019. Yang & Hu (2020) ↑ Greg Yang and Edward J Hu.Feature learning in infinite-width neural networks.arXiv preprint arXiv:2011.14522, 2020. Zhang et al. (2019) ↑ Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, and Roger B Grosse.Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model.Advances in neural information processing systems, 32, 2019. Zou & Gu (2019) ↑ Difan Zou and Quanquan Gu.An improved analysis of training over-parameterized deep neural networks.In Advances in Neural Information Processing Systems, pp. 2053–2062, 2019. Appendix Appendix ADerivation of NQM

We will derive the NQM that approximate the two-layer fully connected ReLU activated neural networks based on Eq. (2).

The first derivative of 𝑓 can be computed by:

∂ 𝑓 ∂ 𝐮 𝑖

1 𝑚 ⁢ 𝑑 ⁢ 𝑣 𝑖 ⁢ 𝟙 { 𝐮 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ⁢ 𝒙 𝑇 , ∂ 𝑓 ∂ 𝑣 𝑖

1 𝑚 ⁢ 𝜎 ⁢ ( 1 𝑑 ⁢ 𝐮 𝑖 𝑇 ⁢ 𝒙 ) , ∀ 𝑖 ∈ [ 𝑚 ] .

And each entry of the Hessian of 𝑓 , i.e., 𝐻 𝑓 , can be computed by

∂ 2 𝑓 ∂ 𝐮 𝑖 2

𝟎 , ∂ 2 𝑓 ∂ 𝑣 𝑖 2

0 , ∂ 2 𝑓 ∂ 𝐮 𝑖 ⁢ 𝑣 𝑖

1 𝑚 ⁢ 𝑑 ⁢ 𝟙 { 𝐮 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ⁢ 𝒙 𝑇 , ∀ 𝑖 ∈ [ 𝑚 ] .

Now we get 𝑓 quad taking the following form

𝐍𝐐𝐌 : 𝑓 quad ⁢ ( 𝐮 , 𝐯 ; 𝒙 )

𝑓 ⁢ ( 𝐮 0 , 𝐯 0 ; 𝒙 ) + 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝐮 𝑖 − 𝐮 0 , 𝑖 ) 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ⁢ 𝑣 0 , 𝑖 + 1 𝑚 ⁢ ∑ 𝑖

1 𝑚 ( 𝑣 𝑖 − 𝑣 0 , 𝑖 ) ⁢ 𝜎 ⁢ ( 1 𝑑 ⁢ 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 )

+ 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝐮 𝑖 − 𝐮 0 , 𝑖 ) 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ⁢ ( 𝑣 𝑖 − 𝑣 0 , 𝑖 ) .

(17) Appendix BDerivation of dynamics equations

For simplicity of notation, we denote 𝑓 quad by 𝑔 . Note that at initialization, the first and second derivatives of 𝑓 with respect to parameters are the same as those of 𝑔 .

B.1Single training example

The NQM can be equivalently written as:

𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 )

𝑔 ⁢ ( 𝐮 0 , 𝐯 0 ; 𝒙 ) + ⟨ 𝐮 − 𝐮 0 , ∇ 𝐮 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ⟩ + ⟨ 𝐯 − 𝐯 0 , ∇ 𝐯 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ⟩

+ ⟨ 𝐮 − 𝐮 0 , ∂ 2 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) ∂ 𝐮 ⁢ ∂ 𝐯 | 𝐮

𝐮 0 , 𝐯

𝐯 0 ⁢ ( 𝐯 − 𝐯 0 ) ⟩ ,

since ∂ 2 𝑔 ∂ 𝐮 2

0 and ∂ 2 𝑔 ∂ 𝐯 2

0 .

And the tangent kernel 𝜆 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) takes the form

𝜆 ⁢ ( 𝐮 , 𝐯 ; 𝒙 )

‖ ∇ 𝐮 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 + ∂ 2 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) ∂ 𝐮 ⁢ ∂ 𝐯 | 𝐮

𝐮 0 ⁢ ( 𝐯 − 𝐯 0 ) ‖ 𝐹 2

+ ‖ ∇ 𝐯 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 + ( 𝐮 − 𝐮 0 ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) ∂ 𝐮 ⁢ ∂ 𝐯 | 𝐮

𝐮 0 , 𝐯

𝐯 0 ‖ 2 .

Here

∇ 𝐮 𝑖 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0

1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 𝑣 0 , 𝑖 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ⁢ 𝒙 , ∀ 𝑖 ∈ [ 𝑚 ] ,

∇ 𝐯 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0

1 𝑚 ⁢ 𝑑 ⁢ 𝜎 ⁢ ( 𝐮 0 𝑇 ⁢ 𝒙 ) .

In the following, we will consider the dynamics of 𝑔 and 𝜆 with GD, hence for simplicity of notations, we denote

∇ 𝐮 𝑔 ⁢ ( 0 )
:= ∇ 𝐮 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ,

∇ 𝐯 𝑔 ⁢ ( 0 )
:= ∇ 𝐯 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ,

∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯
:= ∂ 2 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝒙 ) ∂ 𝐮 ⁢ ∂ 𝐯 | 𝐮

𝐮 0 , 𝐯

𝐯 0 .

By gradient descent with learning rate 𝜂 , at iteration 𝑡 , we have the update equations for weights 𝐮 and 𝐯 :

𝐮 ⁢ ( 𝑡 + 1 )

𝐮 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) ,

𝐯 ⁢ ( 𝑡 + 1 )

𝐯 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) .

Then we plug them in the expression of 𝜆 ⁢ ( 𝑡 + 1 ) and we get

𝜆 ⁢ ( 𝑡 + 1 )

‖ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 + 1 ) − 𝐯 ⁢ ( 0 ) ) ‖ 𝐹 2 + ‖ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 + 1 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ‖ 2

‖ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) − 𝐯 ⁢ ( 0 ) ) ‖ 𝐹 2

+ ‖ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ‖ 2

𝜆 ⁢ ( 𝑡 ) + 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ‖ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ‖ 𝐹 2

  • 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ‖ ( ∇ 𝐮 𝑔 ⁢ ( 0 )
  • ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ‖ 2

− 2 ⁢ 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

− 2 ⁢ 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 , ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩ .

Due to the structure of ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 , we have

‖ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ‖ 𝐹 2

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ‖ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ‖ 2

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ‖ ∇ 𝐯 𝑔 ⁢ ( 𝑡 ) ‖ 2 ,

and

‖ ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ‖ 2

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ‖ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ‖ 𝐹 2

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ‖ ∇ 𝐮 𝑔 ⁢ ( 𝑡 ) ‖ 𝐹 2 .

Furthermore,

⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ⟨ 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) , ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩ + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) , 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ⟩ + ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩

+ ⟨ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ⟨ 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) , ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩ + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) , 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ⟩ + 𝑔 ⁢ ( 0 ) + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ⟨ 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⟩

𝑔 ⁢ ( 𝑡 ) ⁢ ‖ 𝒙 ‖ 2 / 𝑚 ⁢ 𝑑 .

Similarly, we have

⟨ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 , ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

𝑔 ⁢ ( 𝑡 ) ⁢ ‖ 𝒙 ‖ 2 / 𝑚 ⁢ 𝑑 .

As a result,

𝜆 ⁢ ( 𝑡 + 1 )

𝜆 ⁢ ( 𝑡 ) + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ 𝜆 ⁢ ( 𝑡 ) − 4 ⁢ ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑔 ⁢ ( 𝑡 )

𝜆 ⁢ ( 𝑡 ) + 𝜂 ⁢ ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ( 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) − 4 ⁢ 𝑔 ⁢ ( 𝑡 ) 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) .

For 𝑔 , we plug the update equations for 𝐮 and 𝐯 in the expression of 𝑔 ⁢ ( 𝑡 + 1 ) and we can get

𝑔 ⁢ ( 𝑡 + 1 )

𝑔 ⁢ ( 0 ) + ⟨ 𝐮 ⁢ ( 𝑡 + 1 ) − 𝐮 ⁢ ( 0 ) , ∇ 𝐮 𝑔 ⁢ ( 0 ) ⟩ + ⟨ 𝐯 ⁢ ( 𝑡 + 1 ) − 𝐯 ⁢ ( 0 ) , ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩

+ ⟨ 𝐮 ( 𝑡 + 1 ) − 𝐮 ( 0 ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 + 1 ) − 𝐯 ( 0 ) ⟩

𝑔 ⁢ ( 0 ) + ⟨ 𝐮 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) − 𝐮 ⁢ ( 0 ) , ∇ 𝐮 𝑔 ⁢ ( 0 ) ⟩

+ ⟨ 𝐯 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) − 𝐯 ⁢ ( 0 ) , ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩

+ ⟨ 𝐮 ( 𝑡 ) − 𝜂 ( 𝑔 ( 𝑡 ) − 𝑦 ) ( ∇ 𝐮 𝑔 ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ) − 𝐮 ( 0 ) ,

∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝜂 ( 𝑔 ( 𝑡 ) − 𝑦 ) ( ∇ 𝐯 𝑔 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) − 𝐯 ( 0 ) ) ⟩

𝑔 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∇ 𝐮 𝑔 ⁢ ( 0 ) ⟩

− 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 , ∇ 𝐯 𝑔 ⁢ ( 0 ) ⟩

+ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

− 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

− 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ⟩

𝑔 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝜆 ⁢ ( 𝑡 )

+ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ ⟨ ∇ 𝐮 𝑔 ⁢ ( 0 ) + ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( ∇ 𝐯 𝑔 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

𝑔 ⁢ ( 𝑡 ) − 𝜂 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝜆 ⁢ ( 𝑡 ) + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 ⁢ 𝑔 ⁢ ( 𝑡 )

Therefore,

𝑔 ⁢ ( 𝑡 + 1 ) − 𝑦

( 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + ‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑔 ⁢ ( 𝑡 ) ) ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) .

B.2Multiple training examples

We follow the similar notation on the first and second order derivative of 𝑔 with Appendix B.1. Specifically, for 𝑘 ∈ [ 𝑛 ] , we denote

∇ 𝐮 𝑔 𝑘 ⁢ ( 0 )
:= ∇ 𝐮 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝑥 𝑘 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ,

∇ 𝐯 𝑔 𝑘 ⁢ ( 0 )
:= ∇ 𝐯 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝑥 𝑘 ) | 𝐮

𝐮 0 , 𝐯

𝐯 0 ,

∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯
:= ∂ 2 𝑔 ⁢ ( 𝐮 , 𝐯 ; 𝑥 𝑘 ) ∂ 𝐮 ⁢ ∂ 𝐯 | 𝐮

𝐮 0 , 𝐯

𝐯 0 .

By GD with learning rate 𝜂 , we have the update equations for weights 𝐮 and 𝐯 at iteration 𝑡 :

𝐮 ⁢ ( 𝑡 + 1 )

𝐮 ⁢ ( 𝑡 ) − 𝜂 ⁢ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) ,

𝐯 ⁢ ( 𝑡 + 1 )

𝐯 ⁢ ( 𝑡 ) − 𝜂 ⁢ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐯 𝑔 𝑘 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) .

We consider the evolution of 𝐾 ⁢ ( 𝑡 ) first.

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 + 1 )

⟨ ∇ 𝐮 𝑔 𝑖 ⁢ ( 0 ) + ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 + 1 ) − 𝐯 ⁢ ( 0 ) ) , ∇ 𝐮 𝑔 𝑗 ⁢ ( 0 ) + ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 + 1 ) − 𝐯 ⁢ ( 0 ) ) ⟩

+ ⟨ ∇ 𝐯 𝑔 𝑖 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 + 1 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 , ∇ 𝐯 𝑔 𝑗 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 + 1 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

𝐾 𝑖 , 𝑗 ( 𝑡 ) − ⟨ 𝜂 ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ,

∇ 𝐮 𝑔 𝑗 ( 0 ) + ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ⟩

− ⟨ 𝜂 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐯 𝑔 𝑘 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) , ∇ 𝐮 𝑔 𝑖 ⁢ ( 0 ) + ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ⟩

+ ⟨ 𝜂 ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ,

𝜂 ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

− ⟨ 𝜂 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) , ∇ 𝐯 𝑔 𝑖 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

− ⟨ 𝜂 ⁢ ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) , ∇ 𝐯 𝑔 𝑗 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

+ ⟨ 𝜂 ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐮 𝑔 𝑘 ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ) ,

𝜂 ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐮 𝑔 𝑘 ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ) ⟩ .

We separate the data into two sets according to their sign:

𝒮 + := { 𝑖 : 𝑥 𝑖 ≥ 0 , 𝑖 ∈ [ 𝑛 ] } , 𝒮 − := { 𝑖 : 𝑥 𝑖 < 0 , 𝑖 ∈ [ 𝑛 ] } .

We consider two scenarios: (1) 𝑥 𝑖 and 𝑥 𝑗 have different signs; (2) 𝑥 𝑖 and 𝑥 𝑗 have the same sign.

(1)

With simple calculation, we get if 𝑥 𝑖 and 𝑥 𝑗 have different signs, i.e., 𝑖 ∈ 𝒮 + , 𝑗 ∈ 𝒮 − or 𝑖 ∈ 𝒮 − , 𝑗 ∈ 𝒮 + ,

∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯

0 , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∇ 𝐮 𝑔 𝑗 ⁢ ( 0 )

0 , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∇ 𝐯 𝑔 𝑗 ⁢ ( 0 )

0 .

Without lose of generality, we assume 𝑖 ∈ 𝒮 + , 𝑗 ∈ 𝒮 − . Then we have

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 + 1 )

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 ) .

(2)

If 𝑥 𝑖 and 𝑥 𝑗 have the same sign, i.e., 𝑖 , 𝑗 ∈ 𝒮 + or 𝑖 , 𝑗 ∈ 𝒮 − ,

∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯

1 𝑚 ⁢ ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ 𝑥 𝑗 , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∇ 𝐮 𝑔 𝑗 ⁢ ( 0 )

1 𝑚 ⁢ ∇ 𝐮 𝑔 𝑖 ⁢ ( 0 ) ⁢ 𝑥 𝑗 , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ∇ 𝐯 𝑔 𝑗 ⁢ ( 0 )

1 𝑚 ⁢ ∇ 𝐯 𝑔 𝑖 ⁢ ( 0 ) ⁢ 𝑥 𝑗 .

For 𝑖 , 𝑗 ∈ 𝒮 + , we have

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 + 1 )

𝐾 𝑖 , 𝑗 ( 𝑡 ) − 2 ⁢ 𝜂 𝑚 ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) 𝑥 𝑖 ⟨ ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ,

∇ 𝐮 𝑔 𝑗 ( 0 ) + ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ⟩

− 2 ⁢ 𝜂 𝑚 ⁢ ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ 𝑥 𝑖 ⁢ ⟨ ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) , ∇ 𝐯 𝑔 𝑗 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑗 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

+ 𝜂 2 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ‖ ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐯 𝑔 𝑘 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ‖ 2

+ 𝜂 2 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ‖ ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) ‖ 2

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 ) − 4 ⁢ 𝜂 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ 𝑔 𝑘 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + )

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 ) − 4 ⁢ 𝜂 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) 𝑇 ⁢ ( 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 + )

  • 𝜂 2 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎
  • ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎
  • ) .

Similarly, for 𝑖 , 𝑗 ∈ 𝒮 − , we have

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 + 1 )

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 ) − 4 ⁢ 𝜂 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) 𝑇 ⁢ ( 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 − )

  • 𝜂 2 𝑚 ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) .

Combining the results together, we have

𝐾 ⁢ ( 𝑡 + 1 )

𝐾 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) ⁢ 𝒑 1 ⁢ 𝒑 1 𝑇

  • 𝜂 2 𝑚 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) ⁢ 𝒑 2 ⁢ 𝒑 2 𝑇

− 4 ⁢ 𝜂 𝑚 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) 𝑇 ⁢ ( 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 + ) ⁢ 𝒑 1 ⁢ 𝒑 1 𝑇

− 4 ⁢ 𝜂 𝑚 ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) 𝑇 ⁢ ( 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 − ) ⁢ 𝒑 2 ⁢ 𝒑 2 𝑇 .

Now we derive the evolution of 𝐠 ⁢ ( 𝑡 ) − 𝐲 . Suppose 𝑖 ∈ 𝒮 + . Then we have

𝑔 𝑖 ⁢ ( 𝑡 + 1 )

𝑔 𝑖 ⁢ ( 0 ) + ⟨ 𝐮 ⁢ ( 𝑡 + 1 ) − 𝐮 ⁢ ( 0 ) , ∇ 𝐮 𝑔 𝑖 ⁢ ( 0 ) ⟩ + ⟨ 𝐯 ⁢ ( 𝑡 + 1 ) − 𝐯 ⁢ ( 0 ) , ∇ 𝐯 𝑔 𝑖 ⁢ ( 0 ) ⟩

+ ⟨ 𝐮 ( 𝑡 + 1 ) − 𝐮 ( 0 ) , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 + 1 ) − 𝐯 ( 0 ) ⟩

𝑔 𝑖 ⁢ ( 𝑡 ) − 𝜂 ⁢ ⟨ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐮 𝑔 𝑘 ⁢ ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⁢ ( 𝐯 ⁢ ( 𝑡 ) − 𝐯 ⁢ ( 0 ) ) ) , ∇ 𝐮 𝑔 𝑖 ⁢ ( 0 ) ⟩

− 𝜂 ⁢ ⟨ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( ∇ 𝐯 𝑔 𝑘 ⁢ ( 0 ) + ( 𝐮 ⁢ ( 𝑡 ) − 𝐮 ⁢ ( 0 ) ) 𝑇 ⁢ ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) , ∇ 𝐯 𝑔 𝑖 ⁢ ( 0 ) ⟩

− 𝜂 ⟨ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐮 𝑔 𝑘 ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ) , ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ⟩

− 𝜂 ⟨ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) , ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) 𝑇 ∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ⟩

+ 𝜂 2 ⟨ ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐮 𝑔 𝑘 ( 0 ) + ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ( 𝐯 ( 𝑡 ) − 𝐯 ( 0 ) ) ) ,

∂ 2 𝑔 𝑖 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ∑ 𝑘

1 𝑛 ( 𝑔 𝑘 ( 𝑡 ) − 𝑦 𝑘 ) ( ∇ 𝐯 𝑔 𝑘 ( 0 ) + ( 𝐮 ( 𝑡 ) − 𝐮 ( 0 ) ) 𝑇 ∂ 2 𝑔 𝑘 ⁢ ( 0 ) ∂ 𝐮 ⁢ ∂ 𝐯 ) ⟩

𝑔 𝑖 ⁢ ( 𝑡 ) − 𝜂 ⁢ ∑ 𝑘 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ 𝐾 𝑘 , 𝑖 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ∑ 𝑘 ∈ 𝒮 + ∑ 𝑗 ∈ 𝒮 + ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( 𝑔 𝑗 ⁢ ( 𝑡 ) − 𝑦 𝑗 ) ⁢ 𝑔 𝑗 ⁢ ( 𝑡 ) ⁢ 𝑥 𝑘 ⁢ 𝑥 𝑖 .

Similarly, for 𝑖 ∈ 𝒮 − , we have

𝑔 𝑖 ⁢ ( 𝑡 + 1 )

𝑔 𝑖 ⁢ ( 𝑡 ) − 𝜂 ⁢ ∑ 𝑘 ∈ 𝒮 − ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ 𝐾 𝑘 , 𝑖 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ∑ 𝑘 ∈ 𝒮 − ⁢ 𝑓 ∑ 𝑗 ∈ 𝒮 − ( 𝑔 𝑘 ⁢ ( 𝑡 ) − 𝑦 𝑘 ) ⁢ ( 𝑔 𝑗 ⁢ ( 𝑡 ) − 𝑦 𝑗 ) ⁢ 𝑔 𝑗 ⁢ ( 𝑡 ) ⁢ 𝑥 𝑘 ⁢ 𝑥 𝑖 .

Combining the results together, we have

𝐠 ⁢ ( 𝑡 + 1 ) − 𝐲

( 𝐼 − 𝜂 𝐾 ( 𝑡 ) + 𝜂 2 𝑚 ( ( 𝐠 ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + ) 𝑇 ( 𝐠 ( 𝑡 ) ⊙ 𝒎 + ) 𝒑 1 𝒑 1 𝑇

  • 𝜂 2 𝑚 ( ( 𝐠 ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − ) 𝑇 ( 𝐠 ( 𝑡 ) ⊙ 𝒎 − ) 𝒑 2 𝒑 2 𝑇 ) ( 𝐠 ( 𝑡 ) − 𝐲 ) .

Appendix COptimization with sub-critical learning rates Theorem 4.

Consider training the NQM Eq. (3), with squared loss on a single training example by GD. With a sub-critical learning rate 𝜂 ∈ [ 𝜖 , 2 − 𝜖 𝜆 0 ] with 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) , the loss decreases exponentially with

ℒ ⁢ ( 𝑡 + 1 ) ≤ ( 1 − 𝛿 + 𝑂 ⁢ ( 1 𝑚 ⁢ 𝛿 ) ) 2 ⁢ ℒ ⁢ ( 𝑡 )

( 1 − 𝛿 + 𝑜 ⁢ ( 𝛿 ) ) 2 ⁢ ℒ ⁢ ( 𝑡 ) ,

where 𝛿

min ⁡ ( 𝜂 ⁢ 𝜆 0 , 2 − 𝜂 ⁢ 𝜆 0 ) .

Furthermore, sup 𝑡 | 𝜆 ⁢ ( 𝑡 ) − 𝜆 ⁢ ( 0 ) |

𝑂 ⁢ ( 1 𝑚 ⁢ 𝛿 ) .

We use the following transformation of the variables to simplify notations.

𝑢 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 , 𝑤 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑦 , 𝑣 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) .

Then the Eq. (11) and Eq. (12) are reduced to

𝑢 ⁢ ( 𝑡 + 1 )

( 1 − 𝑣 ⁢ ( 𝑡 ) + 𝑢 ⁢ ( 𝑡 ) + 𝑤 ⁢ ( 𝑡 ) ) 2 ⁢ 𝑢 ⁢ ( 𝑡 ) := 𝜅 ⁢ ( 𝑡 ) ⁢ 𝑢 ⁢ ( 𝑡 )

𝑣 ⁢ ( 𝑡 + 1 )

𝑣 ⁢ ( 𝑡 ) − 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) − 4 ⁢ 𝑤 ⁢ ( 𝑡 ) .

At initialization, since | 𝑔 ⁢ ( 0 ) |

𝑂 ⁢ ( 1 ) , we have 𝑢 ⁢ ( 0 ) ≤ 𝐶 𝑢 / 𝑚 for some constant 𝐶 𝑢 > 0 . As | 𝑤 ⁢ ( 𝑡 ) |

𝐶 ⁢ 𝑢 ⁢ ( 𝑡 ) 𝑚 . where 𝐶 := 𝜂 ⁢ ‖ 𝒙 ‖ ⁢ | 𝑦 | 𝑑

0 , we have | 𝑤 ⁢ ( 0 ) | ≤ 𝐶 𝑢 ⁢ 𝐶 / 𝑚 3 / 2 . Note that by definition, for all 𝑡 ≥ 0 , 𝑢 ⁢ ( 𝑡 ) ≥ 0 we have 𝑣 ⁢ ( 𝑡 ) ≥ 0 since 𝜆 ⁢ ( 𝑡 ) is the tangent kernel for a single training example. From the definition of 𝛿 , we can infer that 𝛿 < 1 .

In the following, we will show that if 𝑣 ⁢ ( 0 ) ∈ [ 𝜖 , 2 − 𝜖 ] with 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) , then there exist constant 𝐶 𝜅 , 𝐶 𝑣

0 such that for all 𝑡 ≥ 0 ,

𝜅 ⁢ ( 𝑡 ) ≤ ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2 < 1 , | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) | ≤ 𝐶 𝑣 𝑚 ⁢ 𝛿 ,

if 𝐶 𝜅 ≥ 9 ⁢ 𝐶 𝑢 + ( 𝐶 𝑢 + 𝐶 𝑢 ⁢ 𝐶 ) ⁢ 𝛿 and 𝑚 satisfies

𝑚

max ⁡ { 12 ⁢ 𝐶 𝜅 𝛿 2 , 6 ⁢ 𝐶 𝜅 𝛿 3 / 2 , 𝐶 2 } .

Given the condition on 𝑚 , we have ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2 < 1 .

We will prove the result by induction. When 𝑡

0 , we have

𝜅 ⁢ ( 0 )

( 1 − 𝑣 ⁢ ( 0 ) + 𝑢 ⁢ ( 0 ) + 𝑤 ⁢ ( 0 ) ) 2 < ( 1 − 𝛿 + 𝐶 𝑢 𝑚 + 𝐶 𝑢 ⁢ 𝐶 𝑚 3 / 2 ) 2 < ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2 ,

where we use the assumption 𝐶 𝜅 ≥ ( 𝐶 𝑢 + 𝐶 𝑢 ⁢ 𝐶 ) ⁢ 𝛿 .

Therefore the result holds at 𝑡

0 .

Suppose when 𝑡

𝑇 the results hold. Then at 𝑡

𝑇 + 1 , by the inductive hypothesis that 𝑢 ⁢ ( 𝑡 ) decreases exponentially with 𝜅 ⁢ ( 𝑡 ) < ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2 , we can bound the change of 𝑣 ⁢ ( 𝑇 + 1 ) from 𝑣 ⁢ ( 0 ) :

| 𝑣 ⁢ ( 𝑇 + 1 ) − 𝑣 ⁢ ( 0 ) |

∑ 𝑡

0 𝑇 | 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) + 4 ⁢ 𝑤 ⁢ ( 𝑡 ) |

≤ ∑ 𝑡

0 𝑇 | 𝑢 ⁢ ( 𝑡 ) | ⁢ | 4 − 𝑣 ⁢ ( 𝑡 ) | + ∑ 𝑡

0 𝑇 4 ⁢ | 𝑤 ⁢ ( 𝑡 ) |

≤ max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ | 𝑣 ⁢ ( 𝑡 ) − 4 | ⋅ 𝑢 ⁢ ( 0 ) − 𝑢 ⁢ ( 𝑇 ) ⁢ max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 ) 1 − max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 ) + 𝑤 ⁢ ( 0 ) − 𝑤 ⁢ ( 𝑇 ) ⁢ max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 ) 1 − max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 )

≤ 4 ⋅ 𝑢 ⁢ ( 0 ) 1 − max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 ) + 𝑤 ⁢ ( 0 ) 1 − max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 )

≤ 4 ⋅ 𝐶 𝑢 / 𝑚 𝛿 / 2 + 𝐶 𝑢 ⁢ 𝐶 / 𝑚 3 / 2 𝛿 / 2

≤ 9 ⁢ 𝐶 𝑢 𝑚 ⁢ 𝛿 .

For the summation of the “geometric sequence” i.e., { 𝑢 ⁢ ( 0 ) , 𝑢 ⁢ ( 1 ) , ⋯ , 𝑢 ⁢ ( 𝑇 ) } where 𝑢 ⁢ ( 0 ) and 𝑢 ⁢ ( 𝑇 ) have the determined order but the ratio has an upper bound, we use the maximum ratio, i.e., max ⁡ 𝜅 ⁢ ( 𝑡 ) in the denominator to upper bound the summation.

For 1 − max 𝑡 ∈ [ 0 , 𝑇 ] ⁡ 𝜅 ⁢ ( 𝑡 ) , we use the bound that

1 − ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2

2 ⁢ 𝛿 − 𝛿 2 − 𝐶 𝜅 2 𝑚 2 ⁢ 𝛿 2 − 2 ⁢ 𝐶 𝜅 𝑚 ⁢ 𝛿 + 2 ⁢ 𝐶 𝜅 𝑚

≥ 𝛿 − 𝛿 6 − 𝛿 6 − 𝛿 6

≥ 𝛿 2 ,

where we use the assumption on 𝑚 .

Furthermore,

𝜅 ⁢ ( 𝑇 + 1 )

( 1 − 𝑣 ⁢ ( 𝑇 + 1 ) + 𝑢 ⁢ ( 𝑇 + 1 ) + 𝑤 ⁢ ( 𝑇 + 1 ) ) 2

( 1 − 𝑣 ⁢ ( 0 ) + 𝑣 ⁢ ( 0 ) − 𝑣 ⁢ ( 𝑇 + 1 ) + 𝑢 ⁢ ( 𝑇 + 1 ) + 𝑤 ⁢ ( 𝑇 + 1 ) ) 2

≤ ( 1 − 𝛿 + 9 ⁢ 𝐶 𝑢 𝑚 ⁢ 𝛿 + 𝐶 𝑢 𝑚 + 𝐶 𝑢 ⁢ 𝐶 𝑚 3 / 2 ) 2

≤ ( 1 − 𝛿 + 𝐶 𝜅 𝑚 ⁢ 𝛿 ) 2 .

Here we use the assumption 𝐶 𝜅 ≥ 9 ⁢ 𝐶 𝑢 + ( 𝐶 𝑢 + 𝐶 𝑢 ⁢ 𝐶 ) ⁢ 𝛿 .

Therefore, we finish the inductive step hence finishing the proof.

Appendix DProof of Lemma 1

We present the formal statement of Lemma 1:

Lemma 1.

Consider constants 𝐶 𝑢 , 𝐶 𝑢 ′ , 𝐶 𝑣 , 𝐶 𝜅 , 𝐶 𝜖 > 0 which satisfies 𝐶 𝜅 ≥ 28 ⁢ 𝐶 𝑢 ′ + 𝐶 𝑢 ′ ⁢ 𝛿 + 𝐶 ⁢ 𝐶 𝑢 ′ where 𝐶

𝜂 ⁢ ‖ 𝐱 ‖ ⁢ | 𝑦 | / 𝑑 , and 𝐶 𝑣 ≥ 28 ⁢ 𝐶 𝑢 ′ . If 𝑚 satisfies

𝑚 ≥ max ⁡ { 𝐶 𝜅 ⁢ 𝛿 𝐶 𝑢 ⁢ ( 𝐶 + 1 ) , ( 2 ⁢ 𝐶 𝑢 + 4 ⁢ 𝐶 ⁢ 𝐶 𝑢 𝐶 𝜅 ⁢ 𝐶 𝜖 ) 2 , 576 ⁢ 𝐶 2 𝐶 𝑢 ′ ⁢ 𝛿 2 , exp ⁡ ( 𝐶 𝑣 ⁢ 𝛿 ) , exp ⁡ ( 4 ⁢ 𝐶 𝜅 ) } ,

then with high probability over random initialization of the weights, the following holds: for 𝑇

0 such that sup 𝑡 ∈ [ 0 , 𝑇 ] 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , 𝑢 ⁢ ( 𝑡 ) increases exponentially with ratio inf 𝑡 ∈ [ 0 , 𝑇 ] 𝜅 ⁢ ( 𝑡 ) ≥ ( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2

1 and sup 𝑡 ∈ [ 0 , 𝑇 ] | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) | ≤ 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 .

Proof.

Due to the random initialization of the weights, we have with probability, there exists constant 𝐶 𝑢 > 0 such that | 𝑢 ⁢ ( 0 ) | ≤ 𝐶 𝑢 / 𝑚 . As | 𝑤 ⁢ ( 𝑡 ) |

𝐶 ⁢ 𝑢 ⁢ ( 𝑡 ) 𝑚 , where 𝐶 := 𝜂 ⁢ ‖ 𝒙 ‖ ⁢ | 𝑦 | 𝑑

0 , we have | 𝑤 ⁢ ( 0 ) | ≤ 𝐶 ⁢ 𝐶 𝑢 𝑚 3 / 2 .

We prove the results by induction.

Recall that 𝛿 := 𝜂 ⁢ 𝜆 0 − 2 ∈ [ 𝜖 , 2 − 𝜖 ] where 𝜖 ∈ [ 𝐶 𝜖 ⁢ log ⁡ 𝑚 / 𝑚 , 𝐶 𝜖 ′ ⁢ log ⁡ 𝑚 / 𝑚 ] for some constant 0 < 𝐶 𝜖 < 𝐶 𝜖 ′ .

When 𝑡

0 , as 𝑣 ⁢ ( 0 )

𝜂 ⁢ 𝜆 0

𝛿 + 2 , we have

𝜅 ⁢ ( 0 )

( 1 − 𝑣 ⁢ ( 0 ) + 𝑢 ⁢ ( 0 ) + 𝑤 ⁢ ( 0 ) ) 2

( 1 − ( 𝛿 + 2 ) + 𝑢 ⁢ ( 0 ) + 𝑤 ⁢ ( 0 ) ) 2

( 1 + 𝛿 − 𝑢 ⁢ ( 0 ) − 𝑤 ⁢ ( 0 ) ) 2 .

Based on the condition on 𝑚 that 𝑚 ≥ 𝐶 𝜅 ⁢ 𝛿 𝐶 𝑢 ⁢ ( 𝐶 + 1 ) , we have,

( 1 + 𝛿 − 𝑢 ⁢ ( 0 ) − 𝑤 ⁢ ( 0 ) ) 2

≥ ( 1 + 𝛿 − 𝐶 𝑢 𝑚 − 𝐶 𝑢 ⁢ 𝐶 𝑚 3 / 2 ) 2

≥ ( 1 + 𝛿 − 𝐶 𝑢 𝑚 − 𝐶 𝑢 ⁢ 𝐶 𝑚 ) 2

≥ ( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 .

And by the condition 𝑚 ≥ ( 2 ⁢ 𝐶 𝑢 + 4 ⁢ 𝑐 ⁢ 𝐶 𝑢 𝐶 𝜅 ⁢ 𝐶 𝜖 ) 2 , we get

| 𝑣 ⁢ ( 1 ) − 𝑣 ⁢ ( 0 ) | ≤ | 𝑢 ⁢ ( 0 ) | ⁢ | 2 − 𝛿 | + 4 ⁢ | 𝑤 ⁢ ( 0 ) | ≤ 2 ⁢ 𝐶 𝑢 / 𝑚 + 4 ⁢ 𝐶 ⁢ 𝐶 𝑢 𝑚 3 / 2 ≤ 𝐶 𝜅 ⁢ 𝛿 / log ⁡ 𝑚 .

Therefore the results hold at 𝑡

0 .

Suppose when 𝑡

𝑇 ′ the results hold. Then at 𝑡

𝑇 ′ + 1 , by the inductive hypothesis that 𝑢 ⁢ ( 𝑡 ) increases exponentially with a rate at least ( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 from 𝑢 ⁢ ( 0 ) ≤ 𝐶 𝑢 / 𝑚 to 𝑢 ⁢ ( 𝑇 ′ ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , we can bound the change of 𝑣 ⁢ ( 𝑡 ) :

| 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) |

| ∑ 𝑡

1 𝑇 ′ 𝑢 ⁢ ( 𝑡 ) ⁢ ( 𝑣 ⁢ ( 𝑡 ) − 4 ) + 4 ⁢ 𝑤 ⁢ ( 𝑡 ) |

≤ max 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ | 𝑣 ⁢ ( 𝑡 ) − 4 | ⁢ ∑ 𝑡

1 𝑇 ′ 𝑢 ⁢ ( 𝑡 ) + 4 ⁢ ∑ 𝑡

1 𝑇 ′ | 𝑤 ⁢ ( 𝑡 ) |

≤ max 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ | 𝑣 ⁢ ( 𝑡 ) − 4 | ⁢ 𝑢 ⁢ ( 𝑇 ′ ) ⁢ min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) − 1 + 4 ⁢ | 𝑤 ⁢ ( 𝑇 ′ ) | ⁢ min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) − 1

≤ ( max 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) | + | 𝑣 ⁢ ( 0 ) − 4 | ) ⁢ ( 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 ) ⋅ ( 1 + 𝛿 ) 2 ( 1 + 𝛿 − ( 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) ) 2 − 1

  • 4 ⁢ ( 𝐶 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 𝑚 ⁢ log ⁡ 𝑚 ) ⋅ ( 1
  • 𝛿 ) ( 1
  • 𝛿 − ( 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) ) − 1

≤ ( 2 − 𝛿 + 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 ) ⋅ ( 9 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 ) + 24 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚

≤ 28 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 .

(18)

Here are the techniques we used for the above inequalities: for the summation of the “geometric sequence” i.e., { 𝑢 ⁢ ( 0 ) , 𝑢 ⁢ ( 1 ) , ⋯ , 𝑢 ⁢ ( 𝑇 ′ ) } where 𝑢 ⁢ ( 0 ) and 𝑢 ⁢ ( 𝑇 ′ ) have the determined order but the ratio has a lower bound, we use the smallest ratio, i.e., inf 𝜅 ⁢ ( 𝑡 ) to upper bound the summation. Specifically, we apply the following inequality to bound the summation:

∑ 𝑡

1 𝑇 ′ 𝑢 ⁢ ( 𝑡 ) ≤ ∑ 𝑡

1 𝑇 ′ 𝑢 ⁢ ( 𝑇 ′ ) ( min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) ) 𝑡 − 1

𝑢 ⁢ ( 𝑇 ′ ) ⁢ ∑ 𝑡

1 𝑇 ′ 1 ( min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) ) 𝑡 − 1 ≤ 𝑢 ⁢ ( 𝑇 ′ ) ⁢ min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) min 𝑡 ∈ [ 0 , 𝑇 ′ ] ⁡ 𝜅 ⁢ ( 𝑡 ) − 1 .

Additionally, sine 𝑚 ≥ exp ⁡ ( 4 ⁢ 𝐶 𝜅 ⁢ 𝛿 ) , we used the inequality

( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 − 1

( 1 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 ⁢ 𝛿 2 + 2 ⁢ ( 1 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) ⁢ 𝛿

≥ 2 ⁢ ( 1 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) ⁢ 𝛿 ≥ 𝛿 ,

and ( 1 + 𝛿 − ( 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) ) − 1 ≥ 𝛿 2 to bound the denominator of the summation of the geometric sequence.

And we further used the inequality 0 < 𝛿 < 2 and 24 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚 ≤ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 by the condition on 𝑚 to get the final upper bound.

Consequently, by the assumption 𝐶 𝑣 ≥ 28 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 , we have | 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) | ≤ 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 .

Now we bound the ratio 𝜅 ⁢ ( 𝑇 ′ + 1 ) . By our assumption, 𝑢 ⁢ ( 𝑇 ′ + 1 ) ≤ 𝑢 ⁢ ( 𝑇 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , and we can similarly bound | 𝑤 ⁢ ( 𝑇 ′ + 1 ) | ≤ 𝐶 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 𝑚 ⁢ log ⁡ 𝑚 as | 𝑤 ⁢ ( 𝑇 ′ + 1 ) |

𝐶 ⁢ 𝑢 ⁢ ( 𝑇 ′ + 1 ) 𝑚 .

And the rate 𝜅 ⁢ ( 𝑇 ′ + 1 ) satisfies

𝜅 ⁢ ( 𝑇 ′ + 1 )

( 1 − 𝑣 ⁢ ( 𝑇 ′ + 1 ) + 𝑢 ⁢ ( 𝑇 ′ + 1 ) + 𝑤 ⁢ ( 𝑇 ′ + 1 ) ) 2

( 1 − 𝑣 ⁢ ( 0 ) + 𝑣 ⁢ ( 0 ) − 𝑣 ⁢ ( 𝑇 ′ + 1 ) + 𝑢 ⁢ ( 𝑇 ′ + 1 ) + 𝑤 ⁢ ( 𝑇 ′ + 1 ) ) 2

( 1 + 𝛿 + 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) − 𝑢 ⁢ ( 𝑇 ′ + 1 ) − 𝑤 ⁢ ( 𝑇 ′ + 1 ) ) 2 .

Note that | 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) | ≤ 28 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 by Eq. (18). By the assumption that 𝑚 ≥ exp ⁡ ( 4 ⁢ 𝐶 𝜅 ) and 𝐶 𝜅 ≥ 28 ⁢ 𝐶 𝑢 ′ + 𝐶 𝑢 ′ ⁢ 𝛿 + 𝐶 ⁢ 𝐶 𝑢 ′ , we have 𝛿

| 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) | + 𝑢 ⁢ ( 𝑇 ′ + 1 ) + | 𝑤 ⁢ ( 𝑇 ′ + 1 ) | .

Consequently, we can get

𝜅 ⁢ ( 𝑇 ′ + 1 )

( 1 + 𝛿 + 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) − 𝑢 ⁢ ( 𝑇 ′ + 1 ) − 𝑤 ⁢ ( 𝑇 ′ + 1 ) ) 2

≥ ( 1 + 𝛿 − | 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 ) | − 𝑢 ⁢ ( 𝑇 ′ + 1 ) − | 𝑤 ⁢ ( 𝑇 ′ + 1 ) | ) 2

≥ ( 1 + 𝛿 − 28 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 − 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 − 𝐶 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 𝑚 ⁢ log ⁡ 𝑚 ) 2

≥ ( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 .

Since 𝑚 ≥ exp ⁡ ( 4 ⁢ 𝐶 𝜅 ) , we have ( 1 + 𝛿 − 𝐶 𝜅 ⁢ 𝛿 log ⁡ 𝑚 ) 2 ≥ ( 1 + 3 4 ⁢ 𝛿 ) 2

1 .

Then we finish the inductive step hence finishing the proof.

Appendix EProof of Lemma 2

A formal statement of Lemma 2 is as follows:

Lemma 2:

Under the condition of Lemma 1, if we further assume that 𝑚 satisfies

𝑚

max ⁡ { exp ⁡ ( 2 ⁢ 𝐶 𝑣 ⁢ 𝛿 ) , 256 ⁢ 𝐶 2 ( 𝐶 𝜖 − 𝐶 𝑣 ′ ) 2 ⁢ 𝐶 𝑢 ′ 2 , exp ⁡ ( 5 ⁢ ( 𝐶 𝑢 ′ + 4 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ ) ) , exp ⁡ ( 𝐶 𝑢 ′ ⁢ ( 𝐶 𝜖 − 2 ⁢ 𝐶 𝑣 ′ ) − 8 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 20 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ ) } ,

where 𝐶 𝑣 ′ := 18 ⁢ 𝐶 𝑢 ′ + 2 ⁢ 𝐶 𝑣 , and 𝐶 𝑣 ≥ 4 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ , 𝐶 𝜖 > 2 ⁢ 𝐶 𝑣 ′ , then with high probability over random initialization of the weights, the following holds: there exists 𝑇 ∗ > 0 such that 𝑢 ⁢ ( 𝑇 ∗ )

𝑂 ⁢ ( 1 𝑚 ) .

Proof.

The main idea of the proof is the following: as 𝑢 ⁢ ( 𝑡 ) increases, 𝑣 ⁢ ( 𝑡 ) decreases since 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) ≫ 𝑤 ⁢ ( 𝑡 )

Θ ⁢ ( 𝑢 ⁢ ( 𝑡 ) / 𝑚 ) in Eq. (14) and 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) < 0 . Furthermore, the increase of 𝑢 ⁢ ( 𝑡 ) speeds up the decrease of 𝑣 ⁢ ( 𝑡 ) . However, 𝑣 ⁢ ( 𝑡 ) cannot decrease infinitely as 𝑣 ⁢ ( 𝑡 ) ≥ 0 by definition. Therefore, 𝑢 ⁢ ( 𝑡 ) has to stop increasing at some point and decrease to a small value.

We first show that by the choice of the learning rate that 4 − 𝑣 ⁢ ( 0 ) ≥ 𝜖 where 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) , we will have 4 − 𝑣 ⁢ ( 𝑡 )

0 for all 𝑡 in the increasing phase. Recall that 𝛿 := 𝜂 ⁢ 𝜆 0 − 2 .

Proposition 1.

Under the condition in Lemma 1, if we further assume 𝑚

exp ( 48 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝐶 𝜖 ) 2 / 3 , then for 𝑇

0 such that sup 𝑡 ∈ [ 0 , 𝑇 ] 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , we have 𝑣 ⁢ ( 𝑇 ) < 4 − 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 .

See the proof in Appendix H.1

Given the constant 𝐶 𝑢 ′ in Lemma 1, we define the end of the increasing phase by 𝑇 1 , i.e.,

𝑇 1 := sup { 𝑡 : 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 } .

(19)

We further show that there exists 𝑇 2 ≥ 𝑇 1 such that 𝑣 ⁢ ( 𝑇 2 ) ≤ 3 .

Note that we indeed can show that there exists 𝑇 2 such that 𝑣 ⁢ ( 𝑇 2 ) < 𝐶 ¯ where 𝐶 ¯ ∈ ( 2 , 4 ) is a constant independent of 𝑚 . Here for the simplicity of the presentation, we take 𝐶 as 3 . Furthermore, we note that 𝑇 1 , 𝑇 2 depends on 𝑚 .

Before that, we present a useful result that controls the decrease of 𝑣 ⁢ ( 𝑡 ) :

Proposition 2.

For 𝑡 such that 𝑣 ⁢ ( 𝑡 ) < 4 , if 𝑢 ⁢ ( 𝑡 )

4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) 2 , then 𝑣 ⁢ ( 𝑡 + 1 ) < 𝑣 ⁢ ( 𝑡 ) .

See the proof in Appendix H.2.

Now we are ready to show the existence of 𝑇 2 such that 𝑣 ⁢ ( 𝑇 2 ) ≤ 3 .

Proposition 3.

Under the condition of Lemma 1, if we further assume that 𝑚 satisfies

𝑚

max { exp ( 768 2 ⁢ 𝐶 2 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 2 ) , exp ( 2 𝐶 𝑢 ′ + 𝐶 𝜖 ) , exp ( 48 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝐶 𝜖 ) 2 / 3 , 16 ⁢ 𝐶 2 𝐶 𝑢 ′ 2 } ,

and 𝐶 𝑢 ′ ≥ 4 ⁢ 𝐶 2 , there exists 𝑇 2 ≥ 𝑇 1 such that 𝑣 ⁢ ( 𝑇 2 ) ≤ 3 .

See the proof in Appendix H.3

Since 𝑣 ⁢ ( 𝑇 2 ) < 3 hence 4 − 𝑣 ⁢ ( 𝑇 2 ) ≥ 1 . Simply using Proposition 2, we get

Proposition 4.

𝑣 ⁢ ( 𝑡 ) keeps decreasing after 𝑇 2 until 𝑢 ⁢ ( 𝑡 )

𝑂 ⁢ ( 1 𝑚 ) .

By definition 𝑣 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) where 𝜆 ⁢ ( 𝑡 ) ≥ 0 , 𝑣 ⁢ ( 𝑡 ) will not keep decreasing for 𝑡 → ∞ hence there exists 𝑇 ∗ such that 𝑢 ⁢ ( 𝑇 ∗ )

𝑂 ⁢ ( 1 𝑚 ) . And it indicates that the loss will decrease to the order of 𝑂 ⁢ ( 1 ) .

Appendix FProof of Theorem 2

We compute the steady-state equilibria of Eq. (13) and (14). By letting 𝑢 ⁢ ( 𝑡 + 1 )

𝑢 ⁢ ( 𝑡 ) and 𝑣 ⁢ ( 𝑡 + 1 )

𝑣 ⁢ ( 𝑡 ) , we have the steady-state equilibria ( 𝑢 ∗ , 𝑣 ∗ ) satisfy one of the following:

(1)

𝑢 ∗

0 , 𝑣 ∗ ∈ ℝ ;

(2)

| 1 − 𝑣 ∗ + 𝑢 ∗ + 𝑤 ∗ |

1 , 𝑢 ∗ ⁢ ( 4 − 𝑣 ∗ ) + 4 ⁢ 𝑤 ∗

0 .

As 𝑤 ⁢ ( 𝑡 ) 2

𝐶 2 ⁢ 𝑢 ⁢ ( 𝑡 ) 𝑚 where 𝐶 := 𝜂 ⁢ ‖ 𝒙 ‖ ⁢ | 𝑦 | 𝑑 > 0 , we write 𝑤 as a function of 𝑢 for simplicity, hence 𝑤 ∗

𝑤 ⁢ ( 𝑢 ∗ ) .

As the dynamics equations are non-linear, we analyze the local stability of the steady-state equilibria. We consider the Jacobian matrix of the dynamical systems:

𝐽 ⁢ ( 𝑢 , 𝑣 )

[ 2 ⁢ ( 1 − 𝑣 + 𝑢 + 𝑤 ) ⁢ ( 1 + 𝑑 ⁢ 𝑤 𝑑 ⁢ 𝑢 ) ⁢ 𝑢 + ( 1 − 𝑣 + 𝑢 + 𝑤 ) 2

− 2 ⁢ ( 1 − 𝑣 + 𝑢 + 𝑤 ) ⁢ 𝑢

𝑣 − 4 − 4 ⁢ 𝑑 ⁢ 𝑤 𝑑 ⁢ 𝑢

1 + 𝑢 ] .

We analyze the stability of two equilibria separately.

For Scenario (1), we evaluate 𝐽 ⁢ ( 𝑢 , 𝑣 ) at the steady-state equilibrium ( 𝑢 ∗ , 𝑣 ∗ ) then we get

𝐽 ⁢ ( 𝑢 ∗ , 𝑣 ∗ )

[ ( 1 − 𝑣 ∗ ) 2

0

𝑣 ∗ − 4 − 4 ⁢ 𝑑 ⁢ 𝑤 𝑑 ⁢ 𝑢

1 ] .

We get the two eigenvalues of 𝐽 ⁢ ( 𝑢 ∗ , 𝑣 ∗ ) are 1 and ( 1 − 𝑣 ∗ ) 2 . We will show the Lyapunov stability of the equilibrium ( 𝑢 ∗ , 𝑣 ∗ ) . Specifically, we apply Theorem 1.2 in Bof et al. (2018). We find the domain

𝐷

{ ( 𝑢 , 𝑣 ) : 𝑢 ≤ 𝐶 1 , | 𝑣 − 𝑣 ∗ | ≤ min ( | 𝐶 2 − 𝑣 ∗ | , | 2 − 𝐶 2 − 𝑣 ∗ | } ,

where 𝐶 1

Θ ⁢ ( 1 / 𝑚 ) and 𝐶 2

Θ ⁢ ( 1 / 𝑚 ) , and the Lyapunov function 𝑉 ⁢ ( 𝑢 , 𝑣 )

𝑢 + ( 𝑣 − 𝑣 ∗ ) 2 . It is not hard to verify that 𝑉 is locally Lipschitz in D as 𝑉 is continuous in a compact domain. Furthermore, we can see that ( 𝑢 ∗ , 𝑣 ∗ ) with 𝑢 ∗

0 , 𝑣 ∗ ∈ [ 𝜖 , 2 − 𝜖 ] where 𝜖

Θ ⁢ ( log ⁡ 𝑚 / 𝑚 ) satisfies the condtions Eq. (3,4) in Theorem 1.2 in Bof et al. (2018). Therefore, ( 𝑢 ∗ , 𝑣 ∗ ) with 𝑢 ∗

0 and 𝑣 ∗ ∈ [ 𝜖 , 2 − 𝜖 ] is a stable equilibrium point.

For Scenario (2), we again evaluate 𝐽 ⁢ ( 𝑢 , 𝑣 ) at the steady-state equilibrium ( 𝑢 ∗ , 𝑣 ∗ ) then we get

𝐽 ⁢ ( 𝑢 ∗ , 𝑣 ∗ )

[ − 2 ⁢ 𝑢 ∗ + 𝐶 ⁢ 𝑢 ∗ 𝑚 + 1

2 ⁢ 𝑢 ∗

− 2 ⁢ 𝐶 𝑚 ⁢ 𝑢 ∗

1 + 𝑢 ∗ ] ,

where we replace 𝑣 ∗ by 4 + 4 ⁢ 𝑤 ∗ / 𝑢 ∗ based on the second equality in Scenario (2). Note that 𝑢 ∗ ⁢ ( 4 − 𝑣 ∗ )

0 since 𝑣 < 4 during the whole training process, therefore we have 𝑤 ∗ < 0 to achieve the equilibrium.

We can compute the eigenvalue of 𝐽 ⁢ ( 𝑢 ∗ , 𝑣 ∗ ) then we get

𝜆 𝐽

1 + 𝐶 2 ⁢ 𝑚 ⁢ 𝑢 ∗ − 𝑢 ∗ 2 ± 1 2 ⁢ ( 𝑢 ∗ ) 1 / 4 ⁢ 16 ⁢ 𝐶 𝑚 − 𝐶 2 ⁢ 𝑢 ∗ 𝑚 + 6 ⁢ 𝐶 ⁢ 𝑢 ∗ 𝑚 − 9 ⁢ ( 𝑢 ∗ ) 3 / 2 ⁢ 𝑖 .

Note that when Scenario (2) holds, there are only two possible cases

(2.1)

𝑢 ∗

Θ ⁢ ( 1 / 𝑚 ) , | 𝑤 ∗ |

Θ ⁢ ( 1 / 𝑚 ) and 𝑣 ∗

Θ ⁢ ( 1 ) ;

(2.2)

𝑢 ∗

Θ ⁢ ( 1 / 𝑚 ) , | 𝑤 ∗ |

Θ ⁢ ( 1 / 𝑚 ) and 𝑣 ∗

Θ ⁢ ( 1 / 𝑚 ) .

For (2.1), by the first equality 𝑣 ∗

2 − 𝑢 ∗ + 𝑤 ∗ ∈ ( 1 , 2 ) . Then plugging 𝑣 ∗ into the second equality yields 𝑢 ∗ ∈ ( 4 3 ⁢ 𝐶 2 𝑚 , 2 ⁢ 𝐶 2 𝑚 ) .

For (2.2), by the second equality that 𝑢 ∗ ⁢ ( 4 − 𝑣 ∗ ) + 4 ⁢ 𝑤 ∗

0 , we have 𝑢 ∗

𝐶 2 𝑚 + 𝑜 ⁢ ( 1 / 𝑚 ) .

By computing the modulo of 𝜆 𝐽 , we have

| 𝜆 𝐽 |

1 + 5 ⁢ 𝐶 𝑚 ⁢ 𝑢 ∗ − 𝑢 ∗ + 𝑜 ⁢ ( 1 𝑚 ) .

Therefore, for both (2.1) and (2.2) we have | 𝜆 𝐽 |

1 which indicates ( 𝑢 ∗ , 𝑣 ∗ ) is unstable.

Appendix GOptimization with 𝜂

𝜂 max Theorem 5.

Consider training the NQM Eq. (3), with squared loss on a single training example by GD. If the learning rate satisfies 𝜂 ∈ [ 4 + 𝜖 𝜆 0 , ∞ ) with 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) , then GD diverges.

Proof.

We similarly use the transformation transformation of the variables to simplify notations.

𝑢 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) 2 , 𝑤 ⁢ ( 𝑡 )

‖ 𝒙 ‖ 2 𝑚 ⁢ 𝑑 ⁢ 𝜂 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) − 𝑦 ) ⁢ 𝑦 , 𝑣 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) .

Then the Eq. (11) and Eq. (12) are reduced to

𝑢 ⁢ ( 𝑡 + 1 )

( 1 − 𝑣 ⁢ ( 𝑡 ) + 𝑢 ⁢ ( 𝑡 ) + 𝑤 ⁢ ( 𝑡 ) ) 2 ⁢ 𝑢 ⁢ ( 𝑡 ) := 𝜅 ⁢ ( 𝑡 ) ⁢ 𝑢 ⁢ ( 𝑡 )

𝑣 ⁢ ( 𝑡 + 1 )

𝑣 ⁢ ( 𝑡 ) − 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) − 4 ⁢ 𝑤 ⁢ ( 𝑡 ) .

We similarly consider the interval [ 0 , 𝑇 ] such that sup 𝑡 ∈ [ 0 , 𝑇 ] 𝑢 ⁢ ( 𝑡 )

𝑂 ⁢ ( 1 log ⁡ 𝑚 ) . By Lemma 1, in [ 0 , 𝑇 ] , 𝑢 ⁢ ( 𝑡 ) increases exponentially with a rate sup 𝑡 ∈ [ 0 , 𝑇 ] 𝜅 ⁢ ( 𝑡 ) > 9 . We assume | 𝑤 ⁢ ( 𝑡 ) | > | 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) | for all 𝑡 ∈ [ 0 , 𝑇 ] , which is the worst case as 𝑣 ⁢ ( 𝑡 ) will increase the least. By Lemma 1, we have ∑ 𝑡

0 𝑇 | 𝑤 ⁢ ( 𝑡 ) |

𝑂 ⁢ ( 1 𝑚 ⁢ log ⁡ 𝑚 ) , which is less than 𝜖 . Therefore, we have 𝑣 ⁢ ( 𝑇 )

4 .

Then at the end of the increasing phase, we have | 𝑢 ⁢ ( 𝑇 1 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 1 ) ) |

Ω ⁢ ( 1 / 𝑚 ) is of a greater order than | 𝑤 ⁢ ( 𝑇 1 ) |

𝑂 ⁢ ( 1 / 𝑚 ⁢ log ⁡ 𝑚 ) , hence 𝑣 ⁢ ( 𝑡 ) will increase at 𝑇 1 . Note that 𝜅 ⁢ ( 𝑇 1 )

( 1 − 4 + 𝑜 ⁢ ( 1 ) ) 2

9 + 𝑜 ⁢ ( 1 ) , hence 𝑢 ( ( 𝑡 ) also increases at 𝑇 1 .

It is not hard to see that 𝑣 ⁢ ( 𝑡 ) will keep increasing unless 𝑢 ⁢ ( 𝑡 ) decreases to a smaller order. Specifically, if | 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) |

| 4 ⁢ 𝑤 ⁢ ( 𝑡 ) | , it requires 𝑢 ⁢ ( 𝑡 ) to be of the order at least 𝑂 ⁢ ( 1 / log ⁡ 𝑚 ) (by letting 𝜖 ⁢ 𝑢 ⁢ ( 𝑡 )

Θ ⁢ ( 𝑤 ⁢ ( 𝑡 ) )

Θ ⁢ ( 𝑢 ⁢ ( 𝑡 ) / 𝑚 ) ), which will not happen as 𝜅 ⁢ ( 𝑡 )

( 1 − 𝑣 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) ) 2

1 and it contradicts the decrease of 𝑢 ⁢ ( 𝑡 ) .

Therefore, both 𝑢 ⁢ ( 𝑡 ) and 𝑣 ⁢ ( 𝑡 ) keep increasing which leads to the divergence of GD.

Appendix HProof of propositions H.1Proof of Proposition 1 Proof.

Note that 4 − 𝑣 ⁢ ( 0 )

2 − 𝛿 ≥ 𝐶 𝜖 ⁢ log ⁡ 𝑚 𝑚 by definition, where 𝐶 𝜖

0 is a constant. To show 4 − 𝑣 ⁢ ( 𝑇 )

𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 , a sufficient condition is 𝑣 ⁢ ( 𝑇 ) − 𝑣 ⁢ ( 0 ) < 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 .

Specifically, we will prove for 𝑇

0 such that sup 𝑡 ∈ [ 0 , 𝑇 ] 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , the following holds:

𝑣 ⁢ ( 𝑇 ) − 𝑣 ⁢ ( 0 ) < 4 ⁢ ∑ 𝑡

0 𝑇 | 𝑤 ⁢ ( 𝑡 ) | ≤ 24 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚 ,

where 𝐶 , 𝐶 𝑢 ′ are the same constants defined in Lemma 1. Then by the condition that 𝑚

exp ( 48 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝐶 𝜖 ) 2 / 3 , we have 𝑣 ⁢ ( 𝑇 ) − 𝑣 ⁢ ( 0 ) < 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 .

We will prove the result by induction.

When 𝑇

0 , the result holds trivially.

Suppose 𝑇

𝑇 ′ the result holds. When 𝑇

𝑇 ′ + 1 , since 𝑣 ⁢ ( 𝑇 ′ ) − 𝑣 ⁢ ( 0 ) < 0 , we have 𝑣 ⁢ ( 𝑇 ′ ) < 4 . Therefore, by the update equation of 𝑣 ⁢ ( 𝑡 ) Eq. (14), we have

𝑣 ⁢ ( 𝑇 ′ + 1 )

𝑣 ⁢ ( 𝑇 ′ ) − 𝑢 ⁢ ( 𝑇 ′ ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 ′ ) ) − 4 ⁢ 𝑤 ⁢ ( 𝑇 ′ )

≤ 𝑣 ⁢ ( 𝑇 ′ ) − 4 ⁢ 𝑤 ⁢ ( 𝑇 ′ )

≤ 𝑣 ⁢ ( 𝑇 ′ ) + 4 ⁢ | 𝑤 ⁢ ( 𝑇 ′ ) | .

Then 𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 0 )

𝑣 ⁢ ( 𝑇 ′ + 1 ) − 𝑣 ⁢ ( 𝑇 ′ ) + 𝑣 ⁢ ( 𝑇 ′ ) − 𝑣 ⁢ ( 0 ) ≤ ∑ 𝑡

0 𝑇 ′ + 1 | 𝑤 ⁢ ( 𝑡 ) | .

By Lemma 1, we have ∑ 𝑡

0 𝑇 ′ + 1 | 𝑤 ⁢ ( 𝑡 ) | ≤ 24 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚 . Indeed, this inequality holds for any 𝑇 ′ + 1 such that sup 𝑡 ∈ [ 0 , 𝑇 ′ + 1 ] 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 .

Therefore, we finish the inductive step hence finish the proof.

H.2Proof of Proposition 2 Proof.

A sufficient condition for 𝑣 ⁢ ( 𝑡 ) to decrease is

𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) > 4 ⁢ | 𝑤 ⁢ ( 𝑡 ) |

4 ⁢ 𝐶 ⁢ 𝑢 ⁢ ( 𝑡 ) 𝑚 .

If 𝑢 ⁢ ( 𝑡 )

4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) 2 , then the above condition is satisfied. ∎

H.3Proof of Proposition 3 Proof.

Note that for 𝑡 ∈ [ 0 , 𝑇 1 ] , the change of 𝑣 ⁢ ( 𝑡 ) satisfies sup 𝑡 | 𝑣 ⁢ ( 𝑡 ) − 𝑣 ⁢ ( 0 ) | ≤ 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 by Lemma 1.

For 𝛿 < 1 − 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 , i.e., 𝑣 ⁢ ( 0 ) < 3 − 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 , we have 𝑣 ⁢ ( 𝑇 1 ) < 𝑣 ⁢ ( 0 ) + | 𝑣 ⁢ ( 𝑇 1 ) − 𝑣 ⁢ ( 0 ) |

2 + 𝛿 + | 𝑣 ⁢ ( 0 ) − 𝑣 ⁢ ( 𝑇 1 ) | < 3 . Therefore, the existence of 𝑇 2 can be guaranteed by simply letting 𝑇 2

𝑇 1 .

For 𝛿 ≥ 1 − 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 , i.e., 𝑣 ⁢ ( 0 ) ≥ 3 − 𝐶 𝑣 ⁢ 𝛿 log ⁡ 𝑚 , we will show there exists 𝑇 2 ≥ 𝑇 1 which depends on 𝑚 such that 𝑣 ⁢ ( 𝑇 2 ) < 3 .

We prove the existence of 𝑇 2 by contradiction. Suppose that for all 𝑡 ≥ 𝑇 1 + 1 we have 𝑣 ⁢ ( 𝑡 ) ≥ 3 .

For the simple case that if all 𝑢 ⁢ ( 𝑡 )

4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) 2 , then by Proposition 2, 𝑣 ⁢ ( 𝑡 ) keeps decreasing which will ultimately lead to 𝑣 ⁢ ( 𝑡 ) < 3 .

Suppose there is an iteration 𝑡 ≥ 𝑇 1 + 1 such that 𝑢 ⁢ ( 𝑡 ) ≤ 4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) 2 . The following Proposition guarantees that 𝑣 ⁢ ( 𝑡 ) will decrease to a smaller value after 𝑡 once such 𝑡 occurs. Therefore, we can find 𝑇 2 .

Proposition 5.

Under the condition of Lemma 1, suppose 𝑚 further satisfies

𝑚

max ⁡ { exp ⁡ ( 768 2 ⁢ 𝐶 2 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 2 ) , 16 ⁢ 𝐶 2 𝐶 𝑢 ′ 2 , exp ⁡ ( 2 ⁢ 𝐶 𝑢 ′ + 𝐶 𝜖 ) } ,

where 𝐶 𝑢 ′ ≥ 4 ⁢ 𝐶 2 .

Then if there is 𝑇 ≥ 0 such that 𝑢 ⁢ ( 𝑇 ) ≤ 4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 ) ) 2 and 𝑣 ⁢ ( 𝑇 )

3 , we have 𝑣 ⁢ ( 𝑇 ′ + 2 ) < 𝑣 ⁢ ( 𝑇 ) and 𝑢 ⁢ ( 𝑇 ′ + 1 )

4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 ′ ) ) 2 , where 𝑇 ′ is the end of the increasing phase starting from 𝑇 .

See the proof in Appendix H.4.

H.4Proof of proposition 5 Proof.

Since 𝑢 ⁢ ( 𝑇 ) ≤ 𝐶 𝑢 ′ log ⁡ 𝑚 by the assumption that 𝑚

16 ⁢ 𝐶 2 𝐶 𝑢 ′ 2 , the training dynamics falls into the increasing phase from 𝑇 . We denote the end of the increasing phase starting from 𝑇 by 𝑇 ′ , i.e.,

𝑇 ′

sup { 𝑡 : 𝑢 ⁢ ( 𝑡 ) ≤ 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 , 𝑡 ≥ 𝑇 } .

We will prove the result by induction.

Suppose 𝑇 ′ is the end of the first increasing phase, i.e., 𝑇 ′

𝑇 1 . By proposition 1, 𝑣 ⁢ ( 𝑇 1 ) < 4 − 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 . And the magnitude of 𝑢 ⁢ ( 𝑇 1 + 1 ) can be lower bounded by

𝑢 ⁢ ( 𝑇 1 + 1 ) ≥ 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚 ,

(20)

where we use 𝛿 ≥ 1 2 by the assumption on 𝑚 that and plug 𝛿 in 𝐶 𝑢 ′ ⁢ 𝛿 2 log ⁡ 𝑚 .

Note that when 𝑚

exp ⁡ ( 28 ⁢ 𝐶 𝑢 ′ ) , 𝛿 ≥ 1 2 is necessary for 𝑣 ⁢ ( 𝑇 1 )

3 as | 𝑣 ⁢ ( 𝑇 1 ) − 𝑣 ⁢ ( 0 ) | ≤ 28 ⁢ 𝐶 𝑢 ′ ⁢ 𝛿 log ⁡ 𝑚 by Inequality (18).

Furthermore, with the above bound on 𝑢 ⁢ ( 𝑇 1 + 1 ) , we have

𝑣 ⁢ ( 𝑇 1 + 1 )

𝑣 ⁢ ( 𝑇 1 ) − 𝑢 ⁢ ( 𝑇 1 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 1 ) ) − 4 ⁢ 𝑤 ⁢ ( 𝑇 1 )

≤ 4 − 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 + 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚 ⁢ 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 + 4 ⁢ | 𝑤 ⁢ ( 𝑇 1 ) |

≤ 4 − 𝐶 𝜖 ⁢ log ⁡ 𝑚 2 ⁢ 𝑚 + 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 8 ⁢ 𝑚 + 2 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚

≤ 4 − 𝐶 𝜖 ⁢ log ⁡ 𝑚 4 ⁢ 𝑚 .

(21)

Consequently, when 𝐶 𝑢 ′ ≥ 4 ⁢ 𝐶 2 and 𝑚

exp ⁡ ( 2 ⁢ 𝐶 𝑢 ′ + 𝐶 𝜖 ) , we have

𝑢 ⁢ ( 𝑇 1 + 1 )

𝜅 ⁢ ( 𝑇 1 ) ⁢ 𝑢 ⁢ ( 𝑇 1 )

( 1 − 𝑣 ⁢ ( 𝑇 1 ) + 𝑢 ⁢ ( 𝑇 1 ) + 𝑤 ⁢ ( 𝑇 1 ) ) 2 ⁢ 𝑢 ⁢ ( 𝑇 1 )

≤ ( 1 − 4 + 𝐶 𝜖 ⁢ log ⁡ 𝑚 4 ⁢ 𝑚 + 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚 + 𝐶 ⁢ 𝐶 𝑢 ′ 2 ⁢ 𝑚 ⁢ log ⁡ 𝑚 ) 2 ⁢ 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚

≤ 9 ⁢ 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚 .

Therefore, at 𝑇 1 + 1 , we have

𝑣 ⁢ ( 𝑇 1 + 1 ) − 𝑣 ⁢ ( 𝑇 1 + 2 )
≥ 𝑢 ⁢ ( 𝑇 1 + 1 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 1 + 1 ) ) − 4 ⁢ 𝐶 ⁢ 𝑢 ⁢ ( 𝑇 1 + 1 ) 𝑚

≥ 𝐶 𝑢 ′ 4 ⁢ log ⁡ 𝑚 ⁢ 𝐶 𝜖 ⁢ log ⁡ 𝑚 4 ⁢ 𝑚 − 8 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚

𝐶 𝑢 ′ ⁢ 𝐶 𝜖 16 ⁢ 𝑚 − 6 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚

≥ 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 32 ⁢ 𝑚 ,

where we use the assumption that 𝑚

exp ⁡ ( 256 ⁢ 𝐶 2 9 ⁢ 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 2 ) .

Note that the increase is caused by the term 𝑤 ⁢ ( 𝑡 ) since for all 𝑡 ∈ [ 𝑇 , 𝑇 1 + 1 ] we have 𝑢 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 ⁢ ( 𝑡 ) ) < 0 . Then we have the maximum increase during [ 𝑇 , 𝑇 1 + 1 ] be bounded by

𝑣 ⁢ ( 𝑇 1 + 1 ) − 𝑣 ⁢ ( 𝑇 ) ≤ ∑ 𝑡

𝑇 𝑇 1 + 1 4 ⁢ | 𝑤 ⁢ ( 𝑡 ) | ≤ 24 ⁢ 𝐶 ⁢ 𝐶 𝑢 ′ 𝑚 ⁢ log ⁡ 𝑚 ,

where we use Eq. (18) in the proof of Lemma 1.

By the assumption on 𝑚 that 𝑚

exp ⁡ ( 768 2 ⁢ 𝐶 2 𝐶 𝑢 ′ ⁢ 𝐶 𝜖 2 ) , we have 𝑣 ⁢ ( 𝑇 1 + 2 ) < 𝑣 ⁢ ( 𝑇 ) .

If there is 𝑇 ~

0 such that 𝑢 ⁢ ( 𝑇 ~ ) ≤ 4 ⁢ 𝐶 𝑚 ⁢ ( 4 − 𝑣 ⁢ ( 𝑇 ~ ) ) 2 while 𝑣 ⁢ ( 𝑇 ~ )

3 , there is another increasing phase. Since 𝑣 ⁢ ( 𝑇 ~ ) < 𝑣 ⁢ ( 𝑇 ) , we can apply the same analysis under the same condition to show 𝑣 ⁢ ( 𝑇 ~ 1 + 2 ) < 𝑣 ⁢ ( 𝑇 ~ ) , where 𝑇 ~ 1 is the end of the increasing phase starting from 𝑇 ~ . Therefore, we finish the inductive step hence finish the proof.

H.5Proof of Proposition 7 Restate Proposition 7:

For any 𝐮 , 𝐯 ∈ ℝ 𝑚 , rank ⁢ ( 𝐾 ) ≤ 2 . Furthermore, 𝐩 1 , 𝐩 2 are eigenvectors of 𝐾 , where 𝑝 1 , 𝑖

𝑥 𝑖 ⁢ 𝟙 { 𝑖 ∈ 𝒮 + } , 𝑝 2 , 𝑖

𝑥 𝑖 ⁢ 𝟙 { 𝑖 ∈ 𝒮 − } , for 𝑖 ∈ [ 𝑛 ] .

Proof.

By Definition 1,

𝐾 𝑖 , 𝑗

1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } , 𝑖 , 𝑗 ∈ [ 𝑛 ] .

By definition of eigenvector, we can see

∑ 𝑗

1 𝑛 𝐾 𝑖 , 𝑗 ⁢ 𝑝 1 , 𝑗

1 𝑚 ⁢ ∑ 𝑗

1 𝑛 ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } ⁢ 𝟙 { 𝑗 ∈ 𝒮 + }

∑ 𝑗

1 𝑛 𝑥 𝑗 2 ⁢ 𝟙 { 𝑗 ∈ 𝒮 + } ⁢ 1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝑥 𝑖 ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 }

𝑥 𝑖 ⁢ 𝟙 { 𝑥 𝑖 ∈ 𝑆 + } ⁢ ∑ 𝑗

1 𝑛 𝑥 𝑗 2 ⁢ 𝟙 { 𝑗 ∈ 𝒮 + } ⁢ 1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } ,

where we use the fact that if 𝑥 𝑖 ⁢ 𝑥 𝑗 < 0 , 𝐾 𝑖 , 𝑗

0 . As 𝑝 1 , 𝑖

𝑥 𝑖 ⁢ 𝟙 { 𝑥 𝑖 ∈ 𝑆 + } and ∑ 𝑗

1 𝑛 𝑥 𝑗 2 ⁢ 𝟙 { 𝑗 ∈ 𝒮 + } ⁢ 1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } does not depend on 𝑖 , we can see 𝒑 1 is an eigenvector of 𝐾 with corresponding eigenvalue 𝜆 1

∑ 𝑗

1 𝑛 𝑥 𝑗 2 ⁢ 𝟙 { 𝑗 ∈ 𝒮 + } ⁢ 1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } . The same analysis can be applied to show 𝒑 2 is another eigenvector of 𝐾 with corresponding 𝜆 2

∑ 𝑗

1 𝑛 𝑥 𝑗 2 ⁢ 𝟙 { 𝑗 ∈ 𝒮 − } ⁢ 1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( 𝑣 𝑘 2 + 𝑢 𝑘 2 ) ⁢ 𝟙 { 𝑢 𝑘 ⁢ 𝑥 𝑗 ≥ 0 } . For the rank of 𝐾 , it is not hard to verify that 𝐾

𝜆 1 ⁢ 𝒑 1 ⁢ 𝒑 1 𝑇 + 𝜆 2 ⁢ 𝒑 2 ⁢ 𝒑 2 𝑇 hence the rank of 𝐾 is at most 2 . ∎

Appendix IScale of the tangent kernel for single training example Proposition 6 (Scale of tangent kernel).

For any 𝛿 ∈ ( 0 , 1 ) , if 𝑚 ≥ 𝑐 ′ ⁢ log ⁡ ( 4 / 𝛿 ) where 𝑐 ′ is an absolute constant, with probability at least 1 − 𝛿 , ‖ 𝐱 ‖ 2 / ( 2 ⁢ 𝑑 ) ≤ 𝜆 ⁢ ( 0 ) ≤ 3 ⁢ ‖ 𝐱 ‖ 2 / ( 2 ⁢ 𝑑 ) .

Proof.

Note that when 𝑡

0 ,

𝜆 ⁢ ( 0 )

1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ) 2 + 1 𝑚 ⁢ 𝑑 ⁢ ∑ 𝑖

1 𝑚 ( 𝑣 0 , 𝑖 ) 2 ⁢ ‖ 𝒙 ‖ 2 ⁢ ( 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } ) 2 .

According to NTK initialization, for each 𝑖 ∈ [ 𝑚 ] , 𝑣 0 , 𝑖 ∼ 𝒩 ⁢ ( 0 , 1 ) and 𝐮 0 , 𝑖 ∼ 𝒩 ⁢ ( 0 , 𝐼 ) . We consider the random variable

𝜁 𝑖 := 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } , 𝜉 𝑖 := 𝑣 0 , 𝑖 ⁢ 𝟙 { 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 ≥ 0 } .

it is not hard to see that 𝜁 𝑖 and 𝜉 𝑖 are sub-guassian since 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 and 𝑣 0 , 𝑖 are sub-gaussian. Specifically, for any 𝑡 ≥ 0 ,

ℙ ⁢ { | 𝜁 𝑖 | ≥ 𝑡 } ≤ ℙ ⁢ { | 𝐮 0 , 𝑖 𝑇 ⁢ 𝒙 | ≥ 𝑡 } ≤ 2 ⁢ exp ⁡ ( − 𝑡 2 / ( 2 ⁢ ‖ 𝒙 ‖ 2 ) ) ,

ℙ ⁢ { | 𝜉 𝑖 | ≥ 𝑡 } ≤ ℙ ⁢ { | 𝑣 0 , 𝑖 | ≥ 𝑡 } ≤ 2 ⁢ exp ⁡ ( − 𝑡 2 / 2 ) ,

where the second inequality comes from the definition of sub-gaussian variables.

Since 𝜉 𝑖 is sub-gaussian, by definition, 𝜉 2 is sub-exponential, and its sub-exponential norm is bounded:

‖ 𝜉 𝑖 2 ‖ 𝜓 1 ≤ ‖ 𝜉 𝑖 ‖ 𝜓 2 2 ≤ 𝐶 ,

where 𝐶

0 is a absolute constant. Similarly we have ‖ 𝜁 𝑖 ‖ 𝜓 2 2 ≤ 𝐶 ⁢ ‖ 𝒙 ‖ 2 .

By Bernstein’s inequality, for every 𝑡 ≥ 0 , we have

ℙ ⁢ { | ∑ 𝑖

1 𝑚 𝜉 𝑖 2 − 𝑚 2 | ≥ 𝑡 } ≤ 2 ⁢ exp ⁡ ( − 𝑐 ⁢ min ⁡ ( 𝑡 2 ∑ 𝑖

1 𝑚 ‖ 𝜉 𝑖 2 ‖ 𝜓 1 2 , 𝑡 max 𝑖 ⁡ ‖ 𝜉 𝑖 2 ‖ 𝜓 1 ) ) ,

where 𝑐

0 is an absolute constant.

Letting 𝑡

𝑚 / 4 , we have with probability at least 1 − 2 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝑚 4 ≤ ∑ 𝑖

1 𝑚 𝜉 𝑖 2 ≤ 3 ⁢ 𝑚 4 ,

where 𝑐 ′

𝑐 / ( 4 ⁢ 𝐶 ) .

Similarity, we have with probability at least 1 − 2 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝑚 4 ⁢ ‖ 𝒙 ‖ 2 ≤ ∑ 𝑖

1 𝑚 𝜁 𝑖 2 ≤ 3 ⁢ 𝑚 4 ⁢ ‖ 𝒙 ‖ 2 .

As a result, using union bound, we have probability at least 1 − 4 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

‖ 𝒙 ‖ 2 2 ⁢ 𝑑 ≤ 𝜆 ⁢ ( 0 ) ≤ 3 ⁢ ‖ 𝒙 ‖ 2 2 ⁢ 𝑑 .

Appendix JScale of the tangent kernel for multiple training examples Proof.

As shown in Proposition 7, 𝒑 1 and 𝒑 2 are eigenvectors of 𝐾 , hence we have two eigenvalues:

𝜆 1 ⁢ ( 0 )

𝒑 1 𝑇 ⁢ 𝐾 ⁢ ( 0 ) ⁢ 𝒑 1 ‖ 𝒑 1 ‖ 2 , 𝜆 2 ⁢ ( 0 )

𝒑 2 𝑇 ⁢ 𝐾 ⁢ ( 0 ) ⁢ 𝒑 2 ‖ 𝒑 2 ‖ 2 .

Take 𝜆 1 ⁢ ( 0 ) as an example:

𝜆 1 ⁢ ( 0 ) ⁢ ‖ 𝒑 1 ‖ 2

∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 } ⁢ ∑ 𝑘

1 𝑚 ( 𝑢 0 , 𝑘 2 + 𝑣 0 , 𝑘 2 ) ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ 𝟙 { 𝑢 0 , 𝑘 ⁢ 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑢 0 , 𝑘 ⁢ 𝑥 𝑗 ≥ 0 }

∑ 𝑘

1 𝑚 ( 𝑢 0 , 𝑘 2 + 𝑣 0 , 𝑘 2 ) ⁢ ( 𝟙 { 𝑢 0 , 𝑘 ≥ 0 } ) 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 } .

Similar to the proof of Proposition 6, we consider 𝜉 𝑘 := 𝑣 0 , 𝑘 ⁢ 𝟙 { 𝑢 0 , 𝑘 ≥ 0 } which is a sub-gaussian random variable. Hence 𝜉 𝑘 2 is sub-exponential so that ‖ 𝜉 𝑘 2 ‖ 𝜓 1 ≤ 𝐶 where 𝐶

0 is an absolute constant. By Bernstein’s inequality, for every 𝑡 ≥ 0 , we have

ℙ ⁢ { | ∑ 𝑖

1 𝑚 𝜉 𝑖 2 − 𝑚 2 | ≥ 𝑡 } ≤ 2 ⁢ exp ⁡ ( − 𝑐 ⁢ min ⁡ ( 𝑡 2 ∑ 𝑖

1 𝑚 ‖ 𝜉 𝑖 2 ‖ 𝜓 1 2 , 𝑡 max 𝑖 ⁡ ‖ 𝜉 𝑖 2 ‖ 𝜓 1 ) ) ,

where 𝑐

0 is an absolute constant.

Letting 𝑡

𝑚 / 4 , we have with probability at least 1 − 2 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝑚 4 ≤ ∑ 𝑖

1 𝑚 𝜉 𝑖 2 ≤ 3 ⁢ 𝑚 4 ,

where 𝑐 ′

𝑐 / ( 4 ⁢ 𝐶 ) .

The same analysis applies to 𝜁 𝑘 := 𝑢 0 , 𝑘 ⁢ 𝟙 { 𝑢 0 , 𝑘 ≥ 0 } as well and we have with probability at least 1 − 2 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝑚 4 ≤ ∑ 𝑖

1 𝑚 𝜁 𝑖 2 ≤ 3 ⁢ 𝑚 4 .

As a result, we have probability at least 1 − 4 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝜆 1 ⁢ ( 0 ) ⁢ ‖ 𝒑 1 ‖ 2

1 𝑚 ⁢ ∑ 𝑖

𝑘 𝑚 ( 𝑢 0 , 𝑘 2 + 𝑣 0 , 𝑘 2 ) ⁢ ( 𝟙 { 𝑢 𝑘 ⁢ ( 0 ) ≥ 0 } ) 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 }

∈ [ 1 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 } , 3 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 } ] .

Applying the same analysis to 𝜆 2 ⁢ ( 0 ) , we have with probability 1 − 4 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

𝜆 2 ⁢ ( 0 ) ⁢ ‖ 𝒑 2 ‖ 2

1 𝑚 ⁢ ∑ 𝑖

𝑘 𝑚 ( 𝑢 0 , 𝑘 2 + 𝑣 0 , 𝑘 2 ) ⁢ ( 𝟙 { 𝑢 𝑘 ⁢ ( 0 ) ≤ 0 } ) 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≤ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≤ 0 }

∈ [ 1 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≤ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≤ 0 } , 3 2 ⁢ ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≤ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≤ 0 } ] .

The largest eigenvalue is max ⁡ { 𝜆 1 ⁢ ( 0 ) , 𝜆 2 ⁢ ( 0 ) } . Combining the results together, we have with probability at least 1 − 4 ⁢ exp ⁡ ( − 𝑚 / 𝑐 ′ ) ,

1 2 ⁢ 𝑀 ≤ ‖ 𝐾 ⁢ ( 0 ) ‖ ≤ 3 2 ⁢ 𝑀 ,

where 𝑀

max ⁡ { ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 ⁢ { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 ⁢ { 𝑥 𝑗 ≥ 0 } ∑ 𝑖

1 𝑛 𝑥 𝑖 2 ⁢ 𝟙 ⁢ { 𝑥 𝑖 ≥ 0 } , ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 ⁢ { 𝑥 𝑖 ≤ 0 } ⁢ 𝟙 ⁢ { 𝑥 𝑗 ≤ 0 } ∑ 𝑖

1 𝑛 𝑥 𝑖 2 ⁢ 𝟙 ⁢ { 𝑥 𝑖 ≤ 0 } } . ∎

Appendix KAnalysis on optimization dynamics for multiple training examples

In this section, we discuss the optimization dynamics for multiple training examples. We will see that by confining the dynamics into each eigendirection of the tangent kernel, the training dynamics is similar to that for a single training example.

Since 𝑥 𝑖 is a scalar for all 𝑖 ∈ [ 𝑛 ] , with the homogeneity of ReLU activation function, we can compute the exact eigenvectors of 𝐾 ⁢ ( 𝑡 ) for all 𝑡 ≥ 0 . To that end, we group the data into two sets 𝒮 + and 𝒮 − according to their sign:

𝒮 + := { 𝑖 : 𝑥 𝑖 ≥ 0 , 𝑖 ∈ [ 𝑛 ] } , 𝒮 − := { 𝑖 : 𝑥 𝑖 < 0 , 𝑖 ∈ [ 𝑛 ] } .

Now we have the proposition for the tangent kernel 𝐾 (the proof is deferred to Appendix H.5):

Proposition 7 (Eigenvectors and low rank structure of 𝐾 ).

For any 𝐮 , 𝐯 ∈ ℝ 𝑚 , rank ⁢ ( 𝐾 ) ≤ 2 . Furthermore, 𝐩 1 , 𝐩 2 are eigenvectors of 𝐾 , where 𝑝 1 , 𝑖

𝑥 𝑖 ⁢ 𝟙 { 𝑖 ∈ 𝒮 + } , 𝑝 2 , 𝑖

𝑥 𝑖 ⁢ 𝟙 { 𝑖 ∈ 𝒮 − } , for 𝑖 ∈ [ 𝑛 ] .

Note that when all 𝑥 𝑖 are of the same sign, rank ⁢ ( 𝐾 )

1 and 𝐾 only has one eigenvector (either 𝒑 1 or 𝒑 2 depending on the sign). It is in fact a simpler setting since we only need to consider one direction, whose analysis is covered by the one for rank ⁢ ( 𝐾 )

2 . Therefore, in the following we will assume rank ⁢ ( 𝐾 )

2 . We denote two eigenvalues of 𝐾 ⁢ ( 𝑡 ) by 𝜆 1 ⁢ ( 𝑡 ) and 𝜆 2 ⁢ ( 𝑡 ) corresponding to 𝒑 1 and 𝒑 2 respectively, i.e., 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1

𝜆 1 ⁢ ( 𝑡 ) ⁢ 𝒑 1 , 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 2

𝜆 2 ⁢ ( 𝑡 ) ⁢ 𝒑 2 . Without loss of generality, we assume 𝜆 1 ⁢ ( 0 ) ≥ 𝜆 2 ⁢ ( 0 ) .

By Eq. (5), the tangent kernel 𝐾 at step 𝑡 is defined as:

𝐾 𝑖 , 𝑗 ⁢ ( 𝑡 )

⟨ ∇ 𝐯 𝑔 𝑖 ⁢ ( 𝑡 ) , ∇ 𝐯 𝑔 𝑗 ⁢ ( 𝑡 ) ⟩ + ⟨ ∇ 𝐮 𝑔 𝑖 ⁢ ( 𝑡 ) , ∇ 𝐮 𝑔 𝑗 ⁢ ( 𝑡 ) ⟩

1 𝑚 ⁢ ∑ 𝑘

1 𝑚 ( ( 𝑢 𝑘 ⁢ ( 𝑡 ) ) 2 + ( 𝑣 𝑘 ⁢ ( 𝑡 ) ) 2 ) ⁢ 𝑥 𝑖 ⁢ 𝑥 𝑗 ⁢ 𝟙 { 𝑢 𝑘 ⁢ ( 0 ) ⁢ 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑢 𝑘 ⁢ ( 0 ) ⁢ 𝑥 𝑗 ≥ 0 } , ∀ 𝑖 , 𝑗 ∈ [ 𝑛 ] .

Similar to single example case, the largest eigenvalue of of tangent kernel is bounded from 0 :

Proposition 8.

For any 𝛿 ∈ ( 0 , 1 ) , if 𝑚 ≥ 𝑐 ′ ⁢ log ⁡ ( 4 / 𝛿 ) where 𝑐 ′ is an absolute constant, with probability at least 1 − 𝛿 , 𝑀 / 2 ≤ 𝜆 max ⁢ ( 𝐾 ⁢ ( 0 ) ) ≤ 3 ⁢ 𝑀 / 2 where 𝑀

max ⁡ { ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≥ 0 } ∑ 𝑖

1 𝑛 𝑥 𝑖 2 ⁢ 𝟙 { 𝑥 𝑖 ≥ 0 } , ∑ 𝑖 , 𝑗

1 𝑛 𝑥 𝑖 2 ⁢ 𝑥 𝑗 2 ⁢ 𝟙 { 𝑥 𝑖 ≤ 0 } ⁢ 𝟙 { 𝑥 𝑗 ≤ 0 } ∑ 𝑖

1 𝑛 𝑥 𝑖 2 ⁢ 𝟙 { 𝑥 𝑖 ≤ 0 } } .

The proof can be found in Appendix J.

For the simplicity of notation, given 𝒑 , 𝒎 ∈ ℝ 𝑛 , we define the matrices 𝐾 𝒑 , 𝒎 and 𝑄 𝒑 , 𝒎 :

𝐾 𝒑 , 𝒎 ⁢ ( 𝑡 )

:= ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 ) ⁢ 𝒑 ⁢ 𝒑 𝑇 ,

𝑄 𝒑 , 𝒎 ⁢ ( 𝑡 )

:= ( ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 ) 𝑇 ⁢ ( 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 ) ⁢ 𝒑 ⁢ 𝒑 𝑇

It is not hard to see that for all 𝑡 , 𝐾 𝒑 , 𝒎 and 𝑄 𝒑 , 𝒎 are rank-1 matrices. Specially, 𝒑 is the only eigenvector of 𝐾 𝒑 , 𝒎 and 𝑄 𝒑 , 𝒎 .

With the above notations, we can write the update equations for 𝐠 ⁢ ( 𝑡 ) − 𝐲 and 𝐾 ⁢ ( 𝑡 ) during gradient descent with learning rate 𝜂 :

Dynamics equations.
𝐠 ⁢ ( 𝑡 + 1 ) − 𝐲

( 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ( 𝑄 𝒑 1 , 𝒎 + ⁢ ( 𝑡 ) + 𝑄 𝒑 2 , 𝒎 − ⁢ ( 𝑡 ) ) ⏟ 𝑅 𝐠 ⁢ ( 𝑡 ) ) ⁢ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ,
(22)
𝐾 ⁢ ( 𝑡 + 1 )

𝐾 ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ( 𝐾 𝒑 1 , 𝒎 + ⁢ ( 𝑡 ) + 𝐾 𝒑 2 , 𝒎 − ⁢ ( 𝑡 ) ) − 4 ⁢ 𝜂 𝑚 ⁢ ( 𝑄 𝒑 1 , 𝒎 + ⁢ ( 𝑡 ) + 𝑄 𝒑 2 , 𝒎 − ⁢ ( 𝑡 ) ) ⏟ 𝑅 𝐾 ⁢ ( 𝑡 ) ,

(23)

where 𝒎 + , 𝒎 − ∈ ℝ 𝑛 are mask vectors:

𝑚 + , 𝑖

𝟙 { 𝑖 ∈ 𝒮 + } , 𝑚 − , 𝑖

𝟙 { 𝑖 ∈ 𝒮 − } .

Now we are ready to discuss different three optimization dynamics for multiple training examples case, similar to the single training example case in the following.

Monotonic convergence: sub-critical learning rates ( 𝜂 < 2 / 𝜆 1 ⁢ ( 0 ) ).

We use the key observation that when ‖ 𝐠 ⁢ ( 𝑡 ) ‖ is small, i.e., 𝑂 ⁢ ( 1 ) , and ‖ 𝐾 ⁢ ( 𝑡 ) ‖ is bounded, then ‖ 𝑅 𝐠 ⁢ ( 𝑡 ) ‖ and ‖ 𝑅 𝐾 ⁢ ( 𝑡 ) ‖ are of the order 𝑜 ⁢ ( 1 ) . Then the dynamics equations approximately reduce to the ones of linear dynamics for multiple training examples:

𝐠 ⁢ ( 𝑡 + 1 ) − 𝐲

( 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) ) ⁢ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ,

𝐾 ⁢ ( 𝑡 + 1 )

𝐾 ⁢ ( 𝑡 ) + 𝑜 ⁢ ( 1 ) .

At initialization, ‖ 𝐠 ⁢ ( 0 ) ‖

𝑂 ⁢ ( 1 ) with high probability over random initialization. By the choice of the learning rate, we will have for all 𝑡 ≥ 0 , ‖ 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) ‖ < 2 , hence ‖ 𝐠 ⁢ ( 𝑡 ) − 𝐲 ‖ decreases exponentially. The cumulative change on the norm of tangent kernel is 𝑜 ⁢ ( 1 ) since ‖ 𝑅 𝐾 ⁢ ( 𝑡 ) ‖

𝑂 ⁢ ( 1 / 𝑚 ) and the loss decreases exponentially hence ∑ ‖ 𝑅 𝐾 ⁢ ( 𝑡 ) ‖

𝑂 ⁢ ( 1 / 𝑚 ) ⋅ log ⁡ 𝑂 ⁢ ( 1 )

𝑜 ⁢ ( 1 ) .

Catapult convergence: super-critical learning rates ( 2 / 𝜆 1 ⁢ ( 0 ) < 𝜂 < min ⁡ { 2 / 𝜆 2 ⁢ ( 0 ) , 4 / 𝜆 1 ⁢ ( 0 ) } ).

We summarize the catapult dynamics in the following:

Restate Theorem 3

(Catapult dynamics on multiple training examples). Supposing Assumption 1 holds, consider training the NQM Eq. (3) with squared loss on multiple training examples by GD. Then,

1.

with 𝜂 ∈ [ 2 + 𝜖 𝜆 1 ⁢ ( 0 ) , 2 − 𝜖 𝜆 2 ⁢ ( 0 ) ] , the catapult only occurs in eigendirection 𝒑 1 : Π 1 ⁢ ℒ increases to the order of Ω ⁢ ( 𝑚 ⁢ ( 𝜂 − 2 / 𝜆 1 ⁢ ( 0 ) ) 2 log ⁡ 𝑚 ) then decreases to 𝑂 ⁢ ( 1 ) ;

2.

with 𝜂 ∈ [ 2 + 𝜖 𝜆 2 ⁢ ( 0 ) , 4 − 𝜖 𝜆 1 ⁢ ( 0 ) ] , the catapult occurs in both eigendirections 𝒑 1 and 𝒑 2 : Π 𝑖 ⁢ ℒ for 𝑖

1 , 2 increases to the order of Ω ⁢ ( 𝑚 ⁢ ( 𝜂 − 2 / 𝜆 𝑖 ⁢ ( 0 ) ) 2 log ⁡ 𝑚 ) then decreases to 𝑂 ⁢ ( 1 ) ,

where 𝜖

Θ ⁢ ( log ⁡ 𝑚 𝑚 ) .

The proof can be found in Appendix L.

For the remaining eigendirections 𝒑 3 , ⋯ , 𝒑 𝑛 , i.e., the basis of the subspace orthogonal to 𝒑 1 and 𝒑 2 , we can show that the loss projected to this subspace does not change during training in the following proposition. It follows from the fact that 𝐾 , 𝑅 𝐠 ⁢ ( 𝑡 ) and 𝑅 𝐾 ⁢ ( 𝑡 ) are orthogonal to 𝒑 𝑖 ⁢ 𝒑 𝑖 𝑇 for 𝑖

3 , ⋯ , 𝑛 .

Proposition 9.

∀ 𝑡 ≥ 0 , Π 𝑖 ⁢ ℒ ⁢ ( 𝑡 )

Π 𝑖 ⁢ ℒ ⁢ ( 0 ) for 𝑖

3 , ⋯ , 𝑛 .

Once the catapult finishes as the loss decreases to the order of 𝑂 ⁢ ( 1 ) , we generally have 𝜂

2 / 𝜆 1 and 𝜂

2 / 𝜆 2 . Therefore the training dynamics fall into linear dynamics, and we can use the same analysis for sub-critical learning rates for the remaining training dynamics.

Divergence: ( 𝜂 > 𝜂 max

4 / 𝜆 1 ⁢ ( 0 ) ).

Similar to the increasing phase in the catapult convergence, initially ‖ 𝐠 ⁢ ( 𝑡 ) − 𝐲 ‖ increases in direction 𝒑 1 and 𝒑 2 since linear dynamics dominate and the learning rate is chosen to be larger than 𝜂 crit . Also, we approximately have 𝜂

4 / 𝜆 1 ⁢ ( 𝑡 ) at the end of the increasing phase, by a similar analysis for the catapult convergence. We consider the evolution of 𝐾 ⁢ ( 𝑡 ) in the direction 𝒑 1 . Note that when ‖ 𝐠 ⁢ ( 𝑡 ) ‖ increases to the order of Θ ⁢ ( 𝑚 ) , 𝐠 ⁢ ( 𝑡 ) ⊙ 𝒎 + will be aligned with 𝒑 1 , hence with simple calculation, we approximately have

𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ≈ ‖ 𝐠 ⁢ ( 𝑡 ) ‖ 2 ⁢ ‖ 𝒑 1 ‖ 2 𝑚 ⁢ 𝜂 ⁢ ( 𝜆 1 ⁢ ( 𝑡 ) − 4 ⁢ 𝜂 )

0 .

Therefore, 𝜆 1 ⁢ ( 𝑡 ) increases since 𝒑 1 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 + 1 ) ⁢ 𝒑 1

𝒑 1 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 + 𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1

𝒑 1 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 . As a result, ‖ 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) + 𝑅 𝐠 ⁢ ( 𝑡 ) ‖ becomes even larger which makes ‖ 𝐠 ⁢ ( 𝑡 ) − 𝐲 ‖ grows faster, and ultimately leads to divergence of the optimization.

Appendix LProof of Theorem 3

As the tangent kernel 𝐾 has rank 2 by Proposition 7, the update of weight parameters 𝐰 is in a subspace with dimension 2 . Specifically,

𝐰 ⁢ ( 𝑡 + 1 )

𝐰 ⁢ ( 𝑡 ) − 𝜂 ⁢ ∂ 𝐠 ∂ 𝐰 ⁢ ∂ ℒ ∂ 𝐠 ⁢ ( 𝑡 ) ,

where ∂ 𝐠 / ∂ 𝐰 has rank 2 . Therefore, to understand the whole training dynamics, it is sufficient to analyze the dynamics of the loss in eigendirection 𝒑 1 and 𝒑 2 .

We will analyze the dynamics of the loss ℒ and the tangent kernel 𝐾 confined to 𝒑 1 and 𝒑 2 . It turns out that the dynamics in each eigen direction is almost independent on the other hence can be reduced to the same training dynamics for a single training example.

We start with eigendirection 𝒑 1 . For dynamics equations Eq. (22) and (23), we consider the training dynamics confined to direction 𝒑 1 and we have

Π 1 ⁢ ℒ ⁢ ( 𝑡 )

( 1 − 𝜂 ⁢ 𝜆 1 ⁢ ( 𝑡 ) + 𝒑 1 𝑇 ⁢ 𝑅 𝐠 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ) 2 ⁢ Π 1 ⁢ ℒ ⁢ ( 𝑡 ) := 𝜅 1 ⁢ ( 𝑡 ) ⁢ Π 1 ⁢ ℒ ⁢ ( 𝑡 ) ,

𝜆 1 ⁢ ( 𝑡 + 1 )

𝜆 1 ⁢ ( 𝑡 ) + 𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ,

where we use the notation Π 1 ⁢ ℒ ⁢ ( 𝑡 )

1 2 ⁢ ⟨ 𝐠 ⁢ ( 𝑡 ) − 𝐲 , 𝒑 1 ⟩ 2 .

We further expand 𝒑 1 𝑇 ⁢ 𝑅 𝐠 ⁢ ( 𝑡 ) ⁢ 𝒑 1 and 𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 and we have

𝒑 1 𝑇 ⁢ 𝑅 𝐠 ⁢ ( 𝑡 ) ⁢ 𝒑 1

2 ⁢ 𝜂 2 𝑚 ⁢ Π 1 ⁢ ℒ ⁢ ( 𝑡 ) + 𝜂 2 𝑚 ⁢ ⟨ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + , 𝐲 ⊙ 𝒎 + ⟩ ,

𝒑 1 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1

2 ⁢ 𝜂 𝑚 ⁢ Π 1 ⁢ ℒ ⁢ ( 𝑡 ) ⁢ ( 𝜂 ⁢ 𝜆 1 ⁢ ( 𝑡 ) − 4 ) − 4 ⁢ 𝜂 𝑚 ⁢ ⟨ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + , 𝐲 ⊙ 𝒎 + ⟩ .

Analogous to the transformation for Eq. (11) and (12) as we have done in the proof of Theorem 1, we let

𝑢 1 ⁢ ( 𝑡 )

2 ⁢ 𝜂 2 𝑚 ⁢ Π 1 ⁢ ℒ ⁢ ( 𝑡 ) , 𝑤 1 ⁢ ( 𝑡 )

𝜂 2 𝑚 ⁢ ⟨ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 + , 𝐲 ⊙ 𝒎 + ⟩ , 𝑣 1 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 1 ⁢ ( 𝑡 ) .

Then the dynamic equations can be written as:

𝑢 1 ⁢ ( 𝑡 + 1 )

( 1 − 𝑣 1 ⁢ ( 𝑡 ) + 𝑢 1 ⁢ ( 𝑡 ) + 𝑤 1 ⁢ ( 𝑡 ) ) 2 ⁢ 𝑢 1 ⁢ ( 𝑡 ) ,

(24)

𝑣 1 ⁢ ( 𝑡 + 1 )

𝑣 1 ⁢ ( 𝑡 ) − 𝑢 1 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 1 ⁢ ( 𝑡 ) ) − 4 ⁢ 𝑤 1 ⁢ ( 𝑡 ) .

(25)

Note that at initialization, ‖ 𝐠 ⁢ ( 𝑡 ) ‖

𝑂 ⁢ ( 1 ) with high probability, hence we have 𝑢 1 ⁢ ( 0 )

𝑂 ⁢ ( 1 𝑚 ) and 𝑤 1 ⁢ ( 0 )

𝑂 ⁢ ( 1 𝑚 ) (we omit the factor 𝑛 as 𝑛 is a constant). Furthermore, | 𝑤 1 ⁢ ( 𝑡 ) |

Θ ⁢ ( 𝑢 1 ⁢ ( 𝑡 ) 𝑚 ) . Therefore, both the dynamic equations and the initial condition are exactly the same with the ones for a single training example (Eq. (13) and (14)). Then we can follow the same idea of the proof of Theorem 1 to show the catapult in eigendirection 𝒑 1 .

Similarly, when we consider the training dynamics confined to 𝒑 2 , we have

𝑢 2 ⁢ ( 𝑡 + 1 )

( 1 − 𝑣 2 ⁢ ( 𝑡 ) + 𝑢 2 ⁢ ( 𝑡 ) + 𝑤 2 ⁢ ( 𝑡 ) ) 2 ⁢ 𝑢 2 ⁢ ( 𝑡 ) ,

(26)

𝑣 2 ⁢ ( 𝑡 + 1 )

𝑣 2 ⁢ ( 𝑡 ) − 𝑢 2 ⁢ ( 𝑡 ) ⁢ ( 4 − 𝑣 2 ⁢ ( 𝑡 ) ) − 4 ⁢ 𝑤 2 ⁢ ( 𝑡 ) ,

(27)

where

𝑢 2 ⁢ ( 𝑡 )

2 ⁢ 𝜂 2 𝑚 ⁢ Π 2 ⁢ ℒ ⁢ ( 𝑡 ) , 𝑤 2 ⁢ ( 𝑡 )

𝜂 2 𝑚 ⁢ ⟨ ( 𝐠 ⁢ ( 𝑡 ) − 𝐲 ) ⊙ 𝒎 − , 𝐲 ⊙ 𝒎 − ⟩ , 𝑣 2 ⁢ ( 𝑡 )

𝜂 ⁢ 𝜆 2 ⁢ ( 𝑡 ) .

Then the same analysis with Theorem 1 can be used to show the catapult in direction 𝒑 2 .

Note that when 2 / 𝜆 2 ⁢ ( 0 )

4 / 𝜆 1 ⁢ ( 0 ) , the learning rate is only allowed to be less than 4 / 𝜆 1 ⁢ ( 0 ) otherwise GD will diverge, therefore, there will be no catapult in direction 𝒑 2 .

Appendix MSpecial case of quadratic models when 𝜙 ⁢ ( 𝒙 )

0

In this section we will show under some special settings, the catapult phase phenomenon also happens and how two layer linear neural networks fit in our quadratic model.

We consider one training example ( 𝒙 , 𝑦 ) with label 𝑦

0 and assume the initial tangent kernel 𝜆 ⁢ ( 0 )

Ω ⁢ ( 1 ) . Letting the feature vector 𝜙 ⁢ ( 𝒙 )

0 , the quadratic model Eq.(3) becomes:

𝑔 ⁢ ( 𝐰 )

1 2 ⁢ 𝛾 ⁢ 𝐰 𝑇 ⁢ Σ ⁢ ( 𝒙 ) ⁢ 𝐰 .

For this quadratic model, we have the following proposition:

Proposition 10.

With learning rate 2 𝜆 ⁢ ( 0 ) < 𝜂 < 4 𝜆 ⁢ ( 0 ) , if Σ ⁢ ( 𝐱 ) 2

‖ 𝐱 ‖ 2 ⋅ 𝐼 , 𝑔 ⁢ ( 𝐰 ) exhibits catapult phase.

Proof.

With simple computation, we get

𝑔 ⁢ ( 𝑡 + 1 )

( 1 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) + 𝛾 ⁢ 𝜂 2 ⁢ ‖ 𝒙 ‖ 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) ) 2 ) ⁢ 𝑔 ⁢ ( 𝑡 ) ,

𝜆 ⁢ ( 𝑡 + 1 )

𝜆 ⁢ ( 𝑡 ) − 𝛾 ⁢ ‖ 𝒙 ‖ 2 ⁢ ( 𝑔 ⁢ ( 𝑡 ) ) 2 ⁢ ( 4 − 𝜂 ⁢ 𝜆 ⁢ ( 𝑡 ) ) .

We note that the evolution of 𝑔 and 𝜆 is almost the same with Eq. (11) and Eq. (12) if we regard 𝛾

1 / 𝑚 . Hence we can apply the same analysis to show the catapult phase phenomenon. ∎

It is worth pointing out that the two-layer linear neural network with input 𝒙 ∈ ℝ 𝑑 analyzed in Lewkowycz et al. (2020) that

𝑓 ⁢ ( 𝐔 , 𝐯 ; 𝑥 )

1 𝑚 ⁢ 𝐯 𝑇 ⁢ 𝐔 ⁢ 𝒙 ,

where 𝐯 ∈ ℝ 𝑚 , 𝐔 ∈ ℝ 𝑚 × 𝑑 is a special case of our model with 𝐰

[ Vec ⁢ ( 𝐔 ) 𝑇 , 𝐯 𝑇 ] 𝑇 , 𝛾

1 / 𝑚 and

Σ

( 0

𝐼 𝑚 ⊗ 𝒙

𝐼 𝑚 ⊗ 𝒙 𝑇

0 ) ∈ ℝ 𝑚 ⁢ 𝑑 + 𝑚 .

Appendix NExperimental settings and additional results N.1Verification of non-linear training dynamics of NQMs, i.e., Figure 3

We train the NQM which approximates the two-layer fully-connected neural network with ReLU activation function on 128 data points where each input is drawn i.i.d. from 𝒩 ⁢ ( − 2 , 1 ) if the label is − 1 or 𝒩 ⁢ ( 2 , 1 ) if the label is 1 . The network width is 5 , 000 .

N.2Experiments for training dynamics of wide neural networks with multiple examples.

We train a two-layer fully-connected neural network with ReLU activation function on 128 data points where each input is drawn i.i.d. from 𝒩 ⁢ ( − 2 , 1 ) if the label is − 1 or 𝒩 ⁢ ( 2 , 1 ) if the label is 1 . The network width is 5 , 000 . See the results in Figure 5.

(a)Training loss (b)Largest eigenvalue of tangent kernel (c)Second largest eigenvalue of tangent kernel Figure 5:Training dynamics of wide neural networks for multiple examples case with different learning rates. Compared to the training dynamics of NQMs, i.e., Figure 3, the behaviour of of top eigenvalues is almost the same with different learning rates: when 𝜂 < 0.37 , the kernel is nearly constant; when 0.37 < 𝜂 < 0.39 , only 𝜆 1 ⁢ ( 𝑡 ) decreases; when 0.39 < 𝜂 < 𝜂 max , both 𝜆 1 ⁢ ( 𝑡 ) and 𝜆 2 ⁢ ( 𝑡 ) decreases. See the experiment setting in Appendix N.2. N.3Training dynamics confined to top eigenspace of the tangent kernel

We consider the corresponding dynamics equations (15) and (16) for neural networks:

𝐟 ⁢ ( 𝑡 + 1 ) − 𝐲

( 𝐼 − 𝜂 ⁢ 𝐾 ⁢ ( 𝑡 ) + 𝑅 𝐟 ⁢ ( 𝑡 ) ) ⁢ ( 𝐟 ⁢ ( 𝑡 ) − 𝐲 ) ,

(28)

𝐾 ⁢ ( 𝑡 + 1 )

𝐾 ⁢ ( 𝑡 ) − 𝑅 𝐾 ⁢ ( 𝑡 ) .

(29)

Note that for NQMs, 𝑅 𝐟 ⁢ ( 𝑡 ) and 𝑅 𝐾 ⁢ ( 𝑡 ) have closed-form expressions but generally for neural networks they do not have.

We consider the training dynamics confined to the top eigenvector of the tangent kernel 𝒑 1 ⁢ ( 𝑡 ) :

⟨ 𝒑 1 ⁢ ( 𝑡 ) , 𝐟 ⁢ ( 𝑡 + 1 ) − 𝐲 ⟩

( 𝐼 − 𝜂 ⁢ 𝜆 1 ⁢ ( 𝑡 ) + 𝒑 1 ⁢ ( 𝑡 ) 𝑇 ⁢ 𝑅 𝐟 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ⁢ ( 𝑡 ) ) ⁢ ⟨ 𝒑 1 ⁢ ( 𝑡 ) , 𝐟 ⁢ ( 𝑡 ) − 𝐲 ⟩ ,

𝒑 1 ⁢ ( 𝑡 ) 𝑇 ⁢ 𝐾 ⁢ ( 𝑡 + 1 ) ⁢ 𝒑 1 ⁢ ( 𝑡 )

𝜆 1 ⁢ ( 𝑡 ) − 𝒑 1 ⁢ ( 𝑡 ) 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ⁢ ( 𝑡 ) .

We conduct experiments to show that 𝒑 1 ⁢ ( 𝑡 ) 𝑇 ⁢ 𝑅 𝐟 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ⁢ ( 𝑡 ) and 𝒑 1 ⁢ ( 𝑡 ) 𝑇 ⁢ 𝑅 𝐾 ⁢ ( 𝑡 ) ⁢ 𝒑 1 ⁢ ( 𝑡 ) scale with the loss and remain positive when the loss is large. Furthermore, the loss confined to 𝒑 1 can almost capture the spike in the training loss.

In the experiments, we train a two-layer FC and CNN with width 2048 and 1024 respectively on 128 points from CIFAR-2 (2 class subset of CIFAR-10) and SVHN-2 (2 class subset from SVHN-10). The results for NQM can be seen in Figure 6 and for neural networks can be seen in Figure 7.

(a)FC on CIFAR-2 (b)CNN on CIFAR-2 (c)FC on SVHN-2 (d)CNN on SVHN-2 Figure 6:Training dynamics confined to the top eigenspace of the tangent kernel for NQMs. (a)FC on CIFAR-2 (b)CNN on CIFAR-2 (c)FC on SVHN-2 (d)CNN on SVHN-2 Figure 7:Training dynamics confined to the top eigenspace of the tangent kernel for wide neural networks. N.4Training dynamics of general quadratic models and neural networks.

As discussed at the end of Section 3, a more general quadratic model can exhibit the catapult phase phenomenon. Specifically, we consider a general quadratic model:

𝑔 ⁢ ( 𝐰 ; 𝒙 )

𝐰 𝑇 ⁢ 𝜙 ⁢ ( 𝒙 ) + 1 2 ⁢ 𝛾 ⁢ 𝐰 𝑇 ⁢ Σ ⁢ ( 𝒙 ) ⁢ 𝐰 .

We will train the general quadratic model with different learning rates, and different 𝛾 respectively, to see how the catapult phase phenomenon depends on these two factors. For comparison, we also implement the experiments for neural networks. See the experiment setting in the following:

General quadratic models.

We set the dimension of the input 𝑑

100 . We let the feature vector 𝜙 ⁢ ( 𝒙 )

𝒙 / ‖ 𝒙 ‖ where 𝑥 𝑖 ∼ 𝒩 ⁢ ( 0 , 1 ) i.i.d. for each 𝑖 ∈ [ 𝑑 ] . We let Σ be a diagonal matrix with Σ 𝑖 , 𝑖 ∈ { − 1 , 1 } randomly and independently. The weight parameters 𝐰 are initialized by 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) . Unless stated otherwise, 𝛾

10 − 3 , and the learning rate is set to be 2.8 .

Neural networks.

We train a two-layer fully-connected neural networks with ReLU activation function on 20 data points of CIFAR-2. Unless stated otherwise, the network width is 10 4 , and the learning rate is set to be 2.8 .

See the results in Figure 8.

(A) (B)

(a)Loss (log scaled) vs. 𝛾 /width (b)Tangent kernel norm vs. 𝛾 /width (c)Loss vs. learning rate (d)Tangent kernel norm vs. learning rate Figure 8:General quadratic models have similar training dynamics with neural networks when trained with super-critical learning rates. Panel (A): experiments on general quadratic models. Smaller 𝛾 or larger learning rates lead to larger training loss at the peak. Larger learning rates make tangent kernel decrease more. Panel (B): experiments on two-layer neural networks. Larger width (corresponding to smaller 𝛾 ) and larger learning rates have similar effect on the training loss at the peak and decrease of tangent kernel norm with quadratic models. Note that width or 𝛾 seems to have no effect on the tangent kernel norm at convergence. N.5Test performance of 𝑓 , 𝑓 lin and 𝑓 quad , i.e., Figure 2(b) and Figure 4

For the architectures of two-layer fully connected neural network and two-layer convolutional neural network, we set the width to be 5 , 000 and 1 , 000 respectively. Specific to Figure 2(b), we use the architecture of a two-layer fully connected neural network.

Due to the large number of parameters in NQMs, we choose a small subset of all the datasets. We use the first class (airplanes) and third class (birds) of CIFAR-10, which we call CIFAR-2, and select 256 data points out of it as the training set. We use the number 0 and 2 of SVHN, and select 256 data points as the training set. We select 128 , 256 , 128 data points out of MNIST, FSDD and AG NEWS dataset respectively as the training sets. The size of testing set is 2 , 000 for all. When implementing SGD, we choose batch size to be 32 .

For each setting, we report the average result of 5 independent runs.

N.6Test performance of 𝑓 , 𝑓 lin and 𝑓 quad in terms of accuracy

In this section, we report the best test accuracy for 𝑓 , 𝑓 lin and 𝑓 quad corresponding to the best test loss in Figure 4. We use the same setting as in Appendix N.5.

Figure 9:Best test accuracy plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . Left panel: 2-layer FC on MNIST trained with GD. Right panel: 2-layer FC on AG NEWS trained with GD. Figure 10:Best test accuracy plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . Left panel: 2-layer FC on CIFAR-2 trained with SGD. Right panel: 2-layer CNN on SVHN trained with GD. Figure 11:Best test accuracy plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . Left panel: 2-layer FC on FSDD trained with GD. Right panel: 2-layer CNN on CIFAR-2 trained with GD. N.7Test performance of 𝑓 , 𝑓 lin and 𝑓 quad with architecture of 3-layer FC

In this section, we extend our results for shallow neural networks discussed in Section 4 to 3-layer fully connected neural networks. In the same way, we compare the test performance of three models, 𝑓 , 𝑓 lin and 𝑓 quad upon varying learning rate. We observe the same phenomenon for 3-layer ReLU activated FC with shallow neural networks. See Figure 12 and 13.

We use the first class (airplanes) and third class (birds) of CIFAR-10, which we call CIFAR-2, and select 100 data points out of it as the training set. We use the number 0 and 2 of SVHN, and select 100 data points as the training set. We select 100 data points out of AG NEWS dataset as the training set. For the speech data set FSDD, we select 100 data points in class 1 and 3 as the training set. The size of testing set is 500 for all.

For each setting, we report the average result of 5 independent runs.

Figure 12:Best test accuracy plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . Left panel: 3-layer FC on CIFAR-2 trained with GD. Right panel: 3-layer FC on SVHN-2 trained with GD. Figure 13:Best test accuracy plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . Left panel: 3-layer FC on FSDD-2 trained with GD. Right panel: 3-layer FC on AG NEWS trained with GD. N.8Test performance with Tanh and Swish activation functions

We replace ReLU by Tanh and Swish activation functions to train the models with the same setting as Figure 4. We observe the same phenomenon as we describe in Section 4.

(a)Swish activation function (b)Tanh activation function Figure 14: Best test loss plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . We choose 2-layer FC as the architecture and train the models on AG NEWS with GD. (a)Swish activation function (b)Tanh activation function Figure 15: Best test loss plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . We choose 2-layer FC as the architecture and train the models on FSDD with GD. (a)Swish activation function (b)Tanh activation function Figure 16: Best test loss plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . We choose 2-layer CNN as the architecture and train the models on CIFAR-2 with GD. (a)Swish activation function (b)Tanh activation function Figure 17: Best test loss plotted against different learning rates for 𝑓 quad , 𝑓 , and 𝑓 lin . We choose 2-layer CNN as the architecture and train the models on SVHN with GD. Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:
141 kB
·
Xet hash:
da498e13f08887e7b43c18b5f6cbaa61e5ca255115534f3b06ed7557ceca6bae

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.