huggingchat/papers-content / 2302 /2302.01186.md

|

255 kB

Title: The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing

URL Source: https://arxiv.org/html/2302.01186

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Problem formulation 3Main results 4Analysis 5Numerical experiments 6Discussions

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: arydshln.sty

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0 arXiv:2302.01186v4 [cs.LG] 30 Dec 2025 The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing Xingyu Xu CMU Carnegie Mellon University; Email: {xingyuxu,yandis}@andrew.cmu.edu. Yandi Shen1 CMU Yuejie Chi Yale Yale University; Email: yuejie.chi@yale.edu. Cong Ma UChicago University of Chicago; Email: congm@uchicago.edu. (February 2023; Revised December 2025) Abstract

We propose 𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) , a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparameterized factor representations, 𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. 𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) is remarkably robust to ill-conditioning compared to vanilla gradient descent ( 𝖦𝖣 ) even with overparameterization. Specifically, we show that, under the restricted isometry property (RIP) of the sensing operator, 𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla 𝖦𝖣 which suffers from a polynomial dependency on the condition number. Furthermore, we show that in the presence of measurement noise, 𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) converges to the minimax optimal error up to a multiplicative factor of the condition number at the same rate as in the noiseless setting, which is the first nearly minimax-optimal overparameterized gradient method for low-rank matrix sensing scaling with the true rank rather than the (possibly much larger) overparameterized rank. Our results also extend to the setting when the matrix is only approximately low-rank under the Gaussian design. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.

Keywords: low-rank matrix sensing, overparameterization, preconditioned gradient descent method, random initialization, ill-conditioning

Contents 1Introduction 2Problem formulation 3Main results 4Analysis 5Numerical experiments 6Discussions 1Introduction

Low-rank matrix recovery plays an essential role in modern machine learning and signal processing. To fix ideas, let us consider estimating a rank- 𝑟 ⋆ positive semidefinite matrix 𝑀 ⋆ ∈ ℝ 𝑛 × 𝑛 based on a few linear measurements 𝑦 ≔ 𝒜 ( 𝑀 ⋆ ) , where 𝒜 : ℝ 𝑛 × 𝑛 → ℝ 𝑚 models the measurement process. Significant research efforts have been devoted to tackling low-rank matrix recovery in a statistically and computationally efficient manner in recent years. Perhaps the most well-known method is convex relaxation (candes2011tight; recht2010guaranteed; davenport2016overview), which seeks the matrix with lowest nuclear norm to fit the observed measurements:

min 𝑀 ⪰ 0 ‖ 𝑀 ‖ ∗ s.t. 𝑦

𝒜 ( 𝑀 ) .

While statistically optimal, convex relaxation is prohibitive in terms of both computation and memory as it directly operates in the ambient matrix domain, i.e., ℝ 𝑛 × 𝑛 . To address this challenge, nonconvex approaches based on low-rank factorization have been proposed (burer2005LRSDP):

min 𝑋 ∈ ℝ 𝑛 × 𝑟 1 4 ‖ 𝒜 ( 𝑋 𝑋 ⊤ ) − 𝑦 ‖ 2 2 ,

(1)

where 𝑟 is a user-specified rank parameter. Despite nonconvexity, when the rank is correctly specified, i.e., when 𝑟

𝑟 ⋆ , the problem (1) admits computationally efficient solvers (chi2019nonconvex), e.g., gradient descent (GD) with spectral initialization or with small random initialization. However, three main challenges remain when applying the factorization-based nonconvex approach (1) in practice.

•

Unknown rank. First, the true rank 𝑟 ⋆ is often unknown, which makes it infeasible to set 𝑟

𝑟 ⋆ . One necessarily needs to consider an overparameterized setting in which 𝑟 is set conservatively, i.e., one sets 𝑟 ≥ 𝑟 ⋆ or even 𝑟

𝑛 .

•

Poor conditioning. Second, the ground truth matrix 𝑀 ⋆ may be ill-conditioned, which is commonly encountered in practice. Existing approaches such as gradient descent are still computationally expensive in such settings as the number of iterations necessary for convergence increases with the condition number.

•

Robustness to noise and approximate low-rankness. Last but not least, it is desirable that the performance is robust when the measurement 𝑦 is contaminated by noise and when 𝑀 ⋆ is approximately low-rank.

In light of these two challenges, the main goal of this work is to address the following question:

Can one develop an efficient and robust method for solving ill-conditioned matrix recovery in the overparameterized setting?

parameterization reference algorithm init. iteration complexity

𝑟

𝑟 ⋆ stoger2021small GD random 𝜅 8 + 𝜅 6 log ⁡ ( 𝜅 𝑛 / 𝜀 )

zhang2021preconditioned 𝖯𝗋𝖾𝖼𝖦𝖣 spectral log ⁡ ( 1 / 𝜀 )

Theorem 2 ScaledGD( 𝜆 ) random log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( 1 / 𝜀 )

𝑟

𝑟 ⋆ tong2021accelerating ScaledGD spectral log ⁡ ( 1 / 𝜀 )

stoger2021small GD random 𝜅 8 log ⁡ ( 𝜅 𝑛 ) + 𝜅 2 log ⁡ ( 1 / 𝜀 )

Theorem 3 ScaledGD( 𝜆 ) random log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( 1 / 𝜀 ) Table 1:Comparison of iteration complexity with existing algorithms for low-rank matrix sensing under Gaussian designs. Here, 𝑛 is the matrix dimension, 𝑟 ⋆ is the true rank, 𝑟 is the overparameterized rank, and 𝜅 is the condition number of the problem instance (see Section 2 for a formal problem formulation). It is important to note that in the overparameterized setting ( 𝑟

𝑟 ⋆ ), the sample complexity of zhang2021preconditioned scales polynomially with the overparameterized rank 𝑟 , while that of stoger2021small and ours only scale polynomially with the true rank 𝑟 ⋆ . 1.1Our contributions: a preview

The main contribution of the current paper is to answer the question affirmatively by developing a preconditioned gradient descent method (ScaledGD( 𝜆 )) that converges to the (possibly ill-conditioned) low-rank matrix in a fast and global manner, even with overparamterized rank 𝑟 ≥ 𝑟 ⋆ .

Theorem 1 (Informal).

Under overparameterization 𝑟 ≥ 𝑟 ⋆ and mild statistical assumptions, ScaledGD( 𝜆 )—starting from a sufficiently small random initialization with a sample complexity depending polynomially with the true rank 𝑟 ⋆ —achieves a relative 𝜀 -accuracy, i.e., ‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝜀 ‖ 𝑀 ⋆ ‖ , with no more than an order of

log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( 1 / 𝜀 )

iterations, where 𝜅 is the condition number of the problem. Moreover, in the presence of per-entry Gaussian measurement noise 𝒩 ( 0 , 𝜎 2 ) , ScaledGD( 𝜆 ) converges to the nearly minimax-optimal error

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≲ 𝜅 4 𝜎 𝑛 𝑟 ⋆

with the same rate as above.

The above theorem suggests that from a small random initialization, ScaledGD( 𝜆 ) converges at a constant linear rate—independent of the condition number—after a small logarithmic number of iterations. Overall, the iteration complexity is nearly independent of the condition number and the problem dimension, making it extremely suitable for solving large-scale and ill-conditioned problems. To the best of our knowledge, ScaledGD( 𝜆 ) is the first provably minimax-optimal overparameterized gradient method for low-rank matrix sensing, where both the sample complexity and the error bound depend on the true rank 𝑟 ⋆ . In contrast, prior error bounds for nonconvex gradient methods zhang2024fast; zhuo2021computational scale with the overparameterized rank 𝑟 , which can be significantly larger. Our results also extend to the setting when the matrix 𝑀 ⋆ is only approximately low-rank under the Gaussian design, which is new. See Table 1 for a summary of comparisons with prior art in the noiseless setting.

Our algorithm ScaledGD( 𝜆 ) is closely related to scaled gradient descent (ScaledGD) (tong2021accelerating), a recently proposed preconditioned gradient descent method that achieves a 𝜅 -independent convergence rate under spectral initialization and exact parameterization. We modify the preconditioner design by introducing a fixed damping term, which prevents the preconditioner itself from being ill-conditioned due to overparameterization; the modified preconditioner preserves the low computational overhead when the overparameterization is moderate. In the exact parameterization setting, our result extends ScaledGD beyond local convergence by characterizing the number of iterations it takes to enter the local basin of attraction from a small random initialization.

Moreover, our results shed light on the power of preconditioning in accelerating the optimization process over vanilla GD while still guaranteeing generalization in overparameterized learning models (amari2020does). Remarkably, despite the existence of an infinite number of global minima in the landscape of (1) that do not generalize, i.e., not corresponding to the ground truth, starting from a small random initialization, GD (li2018algorithmic; stoger2021small) is known to converge to a generalizable solution without explicit regularization. However, GD takes 𝑂 ( 𝜅 8 + 𝜅 6 log ⁡ ( 𝜅 𝑛 / 𝜀 ) ) iterations to reach 𝜀 -accuracy, which is unacceptable even for moderate condition numbers. On the other hand, while common wisdom suggests that preconditioning accelerates convergence, it is yet unclear if it still converges to a generalizable global minimum. Our work answers this question in the affirmative for overparameterized low-rank matrix sensing, where ScaledGD( 𝜆 ) significantly accelerates the convergence against the poor condition number—both in the initial phase and in the local phase—without hurting generalization, which is corroborated in Figure 1.

Figure 1:Comparison between ScaledGD( 𝜆 ) and GD. The learning rate of GD has been fine-tuned to achieve fastest convergence for each 𝜅 , while that of ScaledGD( 𝜆 ) is fixed to 0.3 . The initialization scale 𝛼 in each case has been fine-tuned so that the final accuracy is 10 − 9 . The details of the experiment are deferred to Section 5. 1.2Related work

Significant efforts have been devoted to understanding nonconvex optimization for low-rank matrix estimation in recent years, see chi2019nonconvex and chen2018harnessing for recent overviews. By reparameterizing the low-rank matrix into a product of factor matrices, also known as the Burer-Monteiro factorization (burer2005LRSDP), the focus point has been examining if the factor matrices can be recovered—up to invertible transformations—faithfully using simple iterative algorithms in a provably efficient manner. However, the majority of prior efforts suffer from the limitations that they assume an exact parameterization where the rank of the ground truth is given or estimated somewhat reliably, and rely on a carefully constructed initialization (e.g., using the spectral method (chen2021spectral)) in order to guarantee global convergence in a polynomial time. The analyses adopted in the exact parameterization case fail to generalize when overparameterization presents, and drastically new approaches are called for.

Overparameterization in low-rank matrix sensing.

li2018algorithmic made a theoretical breakthrough that showed that gradient descent converges globally to any prescribed accuracy even in the presence of full overparameterization ( 𝑟

𝑛 ), with a small random initialization, where their analyses were subsequently adapted and extended in stoger2021small and zhuo2021computational. ding2021rank investigated robust low-rank matrix recovery with overparameterization from a spectral initialization, and ma2022global examined the same problem from a small random initialization with noisy measurements. zhang2021preconditioned; zhang2022preconditioned developed a preconditioned gradient descent method for overparameterized low-rank matrix sensing, where an adaptive damping parameter is introduced in ScaledGD. A variant with global convergence guarantee is studied in zhang2022preconditioned, which requires adding perturbation at the initial stage to first converge to a second-order stationary point before switching to a fast local convergence. Last but not least, a number of other notable works that study overparameterized low-rank models include, but are not limited to, soltanolkotabi2018theoretical; geyer2020low; oymak2019overparameterized; zhang2021sharp; zhang2022improved.

Global convergence from random initialization without overparameterization.

Despite nonconvexity, it has been established recently that several structured learning models admit global convergence via simple iterative methods even when initialized randomly even without overparameterization. For example, chen2019gradient showed that phase retrieval converges globally from a random initialization using a near-minimal number of samples through a delicate leave-one-out analysis. In addition, the efficiency of randomly initialized GD is established for complete dictionary learning (gilboa2019efficient; bai2018subgradient), multi-channel sparse blind deconvolution (qu2019nonconvex; shi2021manifold), asymmetric low-rank matrix factorization (ye2021global), and rank-one matrix completion (kim2022rank). Moving beyond GD, lee2022randomly showed that randomly initialized alternating least-squares converges globally for rank-one matrix sensing, whereas chandrasekher2022alternating developed sharp recovery guarantees of alternating minimization for generalized rank-one matrix sensing with sample-splitting and random initialization.

Algorithmic or implicit regularization.

Our work is related to the phenomenon of algorithmic or implicit regularization (gunasekar2017implicit), where the trajectory of simple iterative algorithms follows a path that maintains desirable properties without explicit regularization. Along this line, ma2017implicit; chen2019nonconvex; li2021nonconvex highlighted the implicit regularization of GD for several statistical estimation tasks, ma2021beyond showed that GD automatically balances the factor matrices in asymmetric low-rank matrix sensing, where jiang2022algorithmic analyzed the algorithmic regularization in overparameterized asymmetric matrix factorization in a model-free setting.

2Problem formulation

Section 2.1 introduces the problem of low-rank matrix sensing, and Section 2.2 provides background on the proposed ScaledGD( 𝜆 ) algorithm developed for the possibly overparameterized case.

2.1Model and assumptions

Suppose that the ground truth 𝑀 ⋆ ∈ ℝ 𝑛 × 𝑛 is a positive-semidefinite (PSD) matrix of rank 𝑟 ⋆ ≪ 𝑛 , whose (compact) eigendecomposition is given by

𝑀 ⋆

𝑈 ⋆ Σ ⋆ 2 𝑈 ⋆ ⊤ .

Here, the columns of 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ specify the set of eigenvectors, and Σ ⋆ ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ is a diagonal matrix where the diagonal entries are ordered in a non-increasing fashion. Setting 𝑋 ⋆ ≔ 𝑈 ⋆ Σ ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ , we can rewrite 𝑀 ⋆ as

𝑀 ⋆

𝑋 ⋆ 𝑋 ⋆ ⊤ .

(2)

We call 𝑋 ⋆ the ground truth low-rank factor matrix, whose condition number 𝜅 is defined as

𝜅 ≔ 𝜎 max ( 𝑋 ⋆ ) 𝜎 min ( 𝑋 ⋆ ) .

(3)

Here we recall that 𝜎 max ( 𝑋 ⋆ ) and 𝜎 min ( 𝑋 ⋆ ) are the largest and the smallest singular values of 𝑋 ⋆ , respectively.

Instead of having access to 𝑀 ⋆ directly, we wish to recover 𝑀 ⋆ from a set of random linear measurements 𝒜 ( 𝑀 ⋆ ) , where 𝒜 : Sym 2 ⁡ ( ℝ 𝑛 ) → ℝ 𝑚 is a linear map from the space of 𝑛 × 𝑛 symmetric matrices to ℝ 𝑚 , namely

𝑦

𝒜 ( 𝑀 ⋆ ) ,

(4)

or equivalently,

𝑦 𝑖

⟨ 𝐴 𝑖 , 𝑀 ⋆ ⟩ , 1 ≤ 𝑖 ≤ 𝑚 .

We are interested in recovering 𝑀 ⋆ based on the measurements 𝑦 and the sensing operator 𝒜 in a provably efficient manner, even when the true rank 𝑟 ⋆ is unknown.

2.2ScaledGD( 𝜆 ) for overparameterized low-rank matrix sensing

Inspired by the factorized representation (2), we aim to recover the low-rank matrix 𝑀 ⋆ by solving the following optimization problem (burer2005LRSDP):

min 𝑋 ∈ ℝ 𝑛 × 𝑟 𝑓 ( 𝑋 ) ≔ 1 4 ‖ 𝒜 ( 𝑋 𝑋 ⊤ ) − 𝑦 ‖ 2 2 ,

(5)

where 𝑟 is a predetermined rank parameter, possibly different from 𝑟 ⋆ . It is evident that for any rotation matrix 𝑂 ∈ 𝒪 𝑟 , it holds that 𝑓 ( 𝑋 )

𝑓 ( 𝑋 𝑂 ) , leading to an infinite number of global minima of the loss function 𝑓 .

A prelude: exact parameterization.

When 𝑟 is set to be the true rank 𝑟 ⋆ of 𝑀 ⋆ , tong2021accelerating set forth a provable algorithmic approach called scaled gradient descent (ScaledGD)—gradient descent with a specific form of preconditioning—that adopts the following update rule

𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 : 𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ∇ 𝑓 ( 𝑋 𝑡 ) ( 𝑋 𝑡 ⊤ 𝑋 𝑡 ) − 1

(6)

= 𝑋 𝑡 − 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 ) − 1 .

Here, 𝑋 𝑡 is the 𝑡 -th iterate, ∇ 𝑓 ( 𝑋 𝑡 ) is the gradient of 𝑓 at 𝑋

𝑋 𝑡 , and 𝜂 > 0 is the learning rate. Moreover, 𝒜 ∗ : ℝ 𝑚 ↦ Sym 2 ⁡ ( ℝ 𝑛 ) is the adjoint operator of 𝒜 , that is 𝒜 ∗ ( 𝑦 )

∑ 𝑖

1 𝑚 𝑦 𝑖 𝐴 𝑖 for 𝑦 ∈ ℝ 𝑚 .

At the expense of light computational overhead, ScaledGD is remarkably robust to ill-conditioning compared with vanilla gradient descent (GD). It is established in tong2021accelerating that ScaledGD, when starting from spectral initialization, converges linearly at a constant rate—independent of the condition number 𝜅 of 𝑋 ⋆ (cf. (3)); in contrast, the iteration complexity of GD (tu2015low; zheng2015convergent) scales on the order of 𝜅 2 from the same initialization, therefore GD becomes exceedingly slow when the problem instance is even moderately ill-conditioned, a scenario that is quite commonly encountered in practice.

ScaledGD( 𝜆 ): overparametrization under unknown rank.

In this paper, we are interested in the so-called overparameterization regime, where 𝑟 ⋆ ≤ 𝑟 ≤ 𝑛 . From an operational perspective, the true rank 𝑟 ⋆ is related to model order, e.g., the number of sources or targets in a scene of interest, which is often unavailable and makes it necessary to consider the misspecified setting. Unfortunately, in the presence of overparameterization, the original ScaledGD algorithm is no longer appropriate, as the preconditioner ( 𝑋 𝑡 ⊤ 𝑋 𝑡 ) − 1 might become numerically unstable to calculate. Therefore, we propose a new variant of ScaledGD by adjusting the preconditioner as

𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) : 𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ∇ 𝑓 ( 𝑋 𝑡 ) ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

(7)

= 𝑋 𝑡 − 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

where 𝜆 > 0 is a fixed damping parameter. The new algorithm is dubbed as ScaledGD( 𝜆 ), and it recovers the original ScaledGD when 𝜆

0 . Similar to ScaledGD, a key property of ScaledGD( 𝜆 ) is that the iterates { 𝑋 𝑡 } are equivariant with respect to the parameterization of the factor matrix. Specifically, taking a rotationally equivalent factor 𝑋 𝑡 𝑂 with an arbitrary 𝑂 ∈ 𝒪 𝑟 , and feeding it into the update rule (7), the next iterate

𝑋 𝑡 𝑂 − 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 𝑂 ( 𝑂 ⊤ 𝑋 𝑡 ⊤ 𝑋 𝑡 𝑂 + 𝜆 𝐼 ) − 1

𝑋 𝑡 + 1 𝑂

is rotated simultaneously by the same rotation matrix 𝑂 . In other words, the recovered matrix sequence 𝑀 𝑡

𝑋 𝑡 𝑋 𝑡 ⊤ is invariant with respect to the parameterization of the factor matrix.

Remark 1.

We note that a related variant of ScaledGD, called 𝖯𝗋𝖾𝖼𝖦𝖣 , has been proposed recently in zhang2021preconditioned; zhang2022preconditioned for the overparameterized setting, which follows the update rule

𝖯𝗋𝖾𝖼𝖦𝖣 : 𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝑡 𝐼 ) − 1 ,

(8)

where the damping parameters 𝜆 𝑡

𝑓 ( 𝑋 𝑡 ) are selected in an iteration-varying manner. In contrast, ScaledGD( 𝜆 ) assumes a fixed damping parameter 𝜆 throughout the iterations. We defer more detailed comparisons with 𝖯𝗋𝖾𝖼𝖦𝖣 in Section 3.

3Main results

Before formally presenting our theorems, let us introduce several key assumptions that will be in effect throughout this paper.

Restricted Isometry Property.

A key property of the operator 𝒜 ( ⋅ ) is the celebrated Restricted Isometry Property (RIP) (recht2010guaranteed), which says that the operator 𝒜 ( ⋅ ) approximately preserves the distances between low-rank matrices. The formal definition is given as follows.

Definition 1 (Restricted Isometry Property).

The linear map 𝒜 ( ⋅ ) is said to obey rank- 𝑟 RIP with a constant 𝛿 𝑟 ∈ [ 0 , 1 ) , if for all matrices 𝑀 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) of rank at most 𝑟 , it holds that

( 1 − 𝛿 𝑟 ) ‖ 𝑀 ‖ 𝖥 2 ≤ ‖ 𝒜 ( 𝑀 ) ‖ 2 2 ≤ ( 1 + 𝛿 𝑟 ) ‖ 𝑀 ‖ 𝖥 2 .

(9)

The Restricted Isometry Constant (RIC) is defined to be the smallest positive 𝛿 𝑟 such that (9) holds.

The RIP is a standard assumption in low-rank matrix sensing, which has been verified to hold with high probability for a wide variety of measurement operators. The following lemma establishes the RIP for the Gaussian design.

Lemma 1.

(stoger2024non, Lemma 1) If the sensing operator 𝒜 ( ⋅ ) follows the Gaussian design, i.e., the entries of { 𝐴 𝑖 } 𝑖

1 𝑚 are independent up to symmetry with diagonal elements sampled from 𝒩 ( 0 , 1 / 𝑚 ) and off-diagonal elements from 𝒩 ( 0 , 1 / 2 𝑚 ) , then with high probability, 𝒜 ( ⋅ ) satisfies rank- 𝑟 RIP with constant 𝛿 𝑟 , as long as 𝑚 ≥ 𝐶 𝑛 𝑟 / 𝛿 𝑟 2 for some sufficiently large universal constant 𝐶

0 .

We make the following assumption about the operator 𝒜 ( ⋅ ) .

Assumption 1.

The operator 𝒜 ( ⋅ ) satisfies the rank- ( 𝑟 ⋆ + 1 ) RIP with 𝛿 𝑟 ⋆ + 1 ≕ 𝛿 . Furthermore, there exist a sufficiently small constant 𝑐 𝛿

0 and a sufficiently large constant 𝐶 𝛿

0 such that

𝛿 ≤ 𝑐 𝛿 𝑟 ⋆ − 1 / 2 𝜅 − 𝐶 𝛿 .

(10) Small random initialization.

Similar to li2018algorithmic; stoger2021small, we set the initialization 𝑋 0 to be a small random matrix, i.e.,

𝑋 0

𝛼 𝐺 ,

(11)

where 𝐺 ∈ ℝ 𝑛 × 𝑟 is some matrix considered to be normalized and 𝛼

0 controls the magnitude of the initialization. To simplify exposition, we take 𝐺 to be a standard random Gaussian matrix, that is, 𝐺 is a random matrix with i.i.d. entries distributed as 𝒩 ( 0 , 1 / 𝑛 ) .

Choice of parameters.

Last but not least, the parameters of ScaledGD( 𝜆 ) are selected according to the following assumption.

Assumption 2.

There exist some universal constants 𝑐 𝜂 , 𝑐 𝜆 , 𝐶 𝛼

0 such that ( 𝜂 , 𝜆 , 𝛼 ) in ScaledGD( 𝜆 ) satisfy the following conditions:

( 𝗅𝖾𝖺𝗋𝗇𝗂𝗇𝗀 𝗋𝖺𝗍𝖾 )

𝜂 ≤ 𝑐 𝜂 ,

(12a)

( 𝖽𝖺𝗆𝗉𝗂𝗇𝗀 𝗉𝖺𝗋𝖺𝗆𝖾𝗍𝖾𝗋 )

1 100 𝑐 𝜆 𝜅 − 4 𝜎 min 2 ( 𝑋 ⋆ ) ≤ 𝜆 ≤ 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) ,

(12b)

( 𝗂𝗇𝗂𝗍𝗂𝖺𝗅𝗂𝗓𝖺𝗍𝗂𝗈𝗇 𝗌𝗂𝗓𝖾 )

log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 ≥ 𝐶 𝛼 max ⁡ ( 𝜂 , 𝜅 − 2 ) log ⁡ ( 2 𝜅 ) ⋅ log ⁡ ( 2 𝜅 𝑛 ) .

(12c)

We are now in place to present the main theorems.

3.1The overparameterization setting

We begin with our main theorem, which characterizes the performance of ScaledGD( 𝜆 ) with overparameterization.

Theorem 2.

Suppose Assumptions 1 and 2 hold. With high probability (with respect to the realization of the random initialization 𝐺 ), there exists a universal constant 𝐶 min

0 such that for some 𝑇 ≤ 𝑇 min ≔ 𝐶 min 𝜂 log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 , we have

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝛼 1 / 3 ‖ 𝑋 ⋆ ‖ 5 / 3 .

In particular, for any prescribed accuracy target 𝜀 ∈ ( 0 , 1 ) , by choosing a sufficiently small 𝛼 fulfilling both (12c) and 𝛼 ≤ 𝜀 3 ‖ 𝑋 ⋆ ‖ , we have ‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝜀 ‖ 𝑀 ⋆ ‖ .

A few remarks are in order.

Iteration complexity.

Theorem 2 shows that by choosing an appropriate 𝛼 , ScaledGD( 𝜆 ) finds an 𝜀 -accurate solution, i.e., ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝜀 ‖ 𝑀 ⋆ ‖ , in no more than an order of

log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( 1 / 𝜀 )

iterations. Roughly speaking, this asserts that ScaledGD( 𝜆 ) converges at a constant linear rate after an initial phase of approximately 𝑂 ( log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) ) iterations. Most notably, the iteration complexity is nearly independent of the condition number 𝜅 , with a small overhead only through the poly-logarithmic additive term 𝑂 ( log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) ) . In contrast, GD requires 𝑂 ( 𝜅 8 + 𝜅 6 log ⁡ ( 𝜅 𝑛 / 𝜀 ) ) iterations to converge from a small random initialization to 𝜀 -accuracy; see stoger2021small; li2018algorithmic. Thus, the convergence of GD is much slower than ScaledGD( 𝜆 ) even for mildly ill-conditioned matrices.

Sample complexity.

The sample complexity of ScaledGD( 𝜆 ) hinges upon the Assumption 1. When the sensing operator 𝒜 ( ⋅ ) follows the Gaussian design, this assumption is fulfilled as long as 𝑚 ≳ 𝑛 𝑟 ⋆ 2 ⋅ 𝗉𝗈𝗅𝗒 ( 𝜅 ) . Notably, our sample complexity depends only on the true rank 𝑟 ⋆ , but not on the overparameterized rank 𝑟 — a crucial feature in order to provide meaningful guarantees when the overparameterized rank 𝑟 is close to the full dimension 𝑛 . The dependency on 𝜅 in the sample complexity, on the other end, is believed to be an artifact of the proof, as empirically shown in some related settings (see e.g., Figure 4 of chen2019noisy). Rigorously proving this, however, remains an open problem in nonconvex low-rank estimation (chi2019nonconvex).

Comparison with zhang2021preconditioned; zhang2022preconditioned.

As mentioned earlier, our proposed algorithm ScaledGD( 𝜆 ) is similar to 𝖯𝗋𝖾𝖼𝖦𝖣 proposed in zhang2021preconditioned that adopts an iteration-varying damping parameter in ScaledGD tong2021accelerating, with several important distinctions. In terms of theoretical guarantees, zhang2021preconditioned only provides the local convergence for 𝖯𝗋𝖾𝖼𝖦𝖣 assuming an initialization close to the ground truth; in contrast, we provide global convergence guarantees where a small random initialization is used. More critically, the sample complexity of 𝖯𝗋𝖾𝖼𝖦𝖣 zhang2021preconditioned depends on the overparameterized rank 𝑟 , while ours only depends on the true rank 𝑟 ⋆ . While zhang2022preconditioned also studied variants of 𝖯𝗋𝖾𝖼𝖦𝖣 with global convergence guarantees, they require additional operations such as gradient perturbations and switching between different algorithmic stages, which are harder to implement in practice. Furthermore, their convergence rate is much more pessimistic than ours. Our theory suggests that additional perturbation is unnecessary to ensure the global convergence of ScaledGD( 𝜆 ), as ScaledGD( 𝜆 ) automatically adapts to different curvatures of the optimization landscape throughout the entire trajectory.

3.2The exact parameterization setting

We now single out the exact parametrization case, i.e., when 𝑟

𝑟 ⋆ . In this case, our theory suggests that ScaledGD( 𝜆 ) converges to the ground truth even from a random initialization with a fixed scale 𝛼

0 .

Theorem 3.

Assume that 𝑟

𝑟 ⋆ . Suppose Assumptions 1 and 2 hold. With high probability (with respect to the realization of the random initialization 𝐺 ), there exist some universal constants 𝐶 min > 0 and 𝑐 > 0 such that for some 𝑇 ≤ 𝑇 min

𝐶 min 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) , we have for any 𝑡 ≥ 𝑇

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ ( 1 − 𝑐 𝜂 ) 𝑡 − 𝑇 ‖ 𝑀 ⋆ ‖ .

Theorem 3 shows that with some fixed initialization scale 𝛼 , ScaledGD( 𝜆 ) takes at most an order of

log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( 1 / 𝜀 )

iterations to converge to 𝜀 -accuracy for any 𝜀

0 in the exact parameterization case. Compared with ScaledGD (tong2021accelerating) which takes 𝑂 ( log ⁡ ( 1 / 𝜀 ) ) iterations to converge from a spectral initialization, we only pay a logarithmic order 𝑂 ( log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) ) of additional iterations to converge from a random initialization. In addition, once the algorithms enter the local regime, both ScaledGD( 𝜆 ) and ScaledGD behave similarly and converge at a fast constant linear rate, suggesting the effect of damping is locally negligible. Furthermore, compared with GD (stoger2021small) which requires 𝑂 ( 𝜅 8 log ⁡ ( 𝜅 𝑛 ) + 𝜅 2 log ⁡ ( 1 / 𝜀 ) ) iterations to achieve 𝜀 -accuracy, our theory again highlights the benefit of ScaledGD( 𝜆 ) in boosting the global convergence even for mildly ill-conditioned matrices.

3.3The noisy setting

We next consider the case where the measurements are contaminated by noise 𝜉

( 𝜉 𝑖 ) 𝑖

1 𝑚 , that is

𝑦

𝒜 ( 𝑀 ⋆ ) + 𝜉 , or more concretely 𝑦 𝑖

⟨ 𝐴 𝑖 , 𝑀 ⋆ ⟩ + 𝜉 𝑖 , 1 ≤ 𝑖 ≤ 𝑚 .

(13)

Instantiating (7) with the noisy measurements, the update rule of ScaledGD( 𝜆 ) can be written as

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ ) − 𝒜 ∗ ( 𝑦 ) ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 .

(14)

For simplicity, we make the following mild assumption on the noise.

Assumption 3.

We assume that 𝜉 𝑖 ’s are independent with 𝒜 ( ⋅ ) , and are i.i.d. Gaussian, i.e.,

𝜉 𝑖 ∼ i.i.d. 𝒩 ( 0 , 𝜎 2 ) , 1 ≤ 𝑖 ≤ 𝑚 .

Our theory demonstrates that ScaledGD( 𝜆 ) achieves the minimax-optimal error in this noisy setting as long as the noise is not too large.

Theorem 4.

Assume that 𝜎 𝑛 ≤ 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑀 ⋆ ‖ for some sufficiently small universal constant 𝑐 𝜎

0 and some sufficiently large universal constant 𝐶 𝜎

0 . Then the following holds with high probability (with respect to the realization of the random initialization 𝐺 and the noise 𝜉 ). Suppose Assumptions 1, 2 and 3 hold. Given a prescribed accuracy target 𝜀 ∈ ( 0 , 1 ) , suppose further that 𝛼 ≤ 𝜀 3 ‖ 𝑋 ⋆ ‖ . There exist universal constants 𝐶 min

0 , 𝐶 4

0 , such that for some 𝑇 ≤ 𝑇 min ≔ 𝐶 min 𝜂 log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 , we have

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖

≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝐶 4 𝜅 4 𝜎 𝑛 ) ,

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥

≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝐶 4 𝜅 4 𝜎 𝑛 𝑟 ⋆ ) .

A few remarks are in order.

Minimax optimality.

Theorem 4 suggests that as long as the noise level is not too large, by setting the optimization error 𝜀 sufficiently small, i.e., 𝜀 ‖ 𝑀 ⋆ ‖ ≍ 𝜅 4 𝜎 𝑛 , ScaledGD( 𝜆 ) finds a solution that satisfies

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ ≲ 𝜅 4 𝜎 𝑛 , ‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≲ 𝜅 4 𝜎 𝑛 𝑟 ⋆

(15)

in no more than log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( ‖ 𝑀 ⋆ ‖ 𝜅 4 𝜎 𝑛 ) iterations, the number of which again only depends logarithmically on the problem parameters. When 𝜅 is upper bounded by a constant, our result is minimax optimal, in the sense that the final error matches the minimax lower bound in the classical work of candes2011tight, which we recall here for completeness: for any estimator 𝑀 ^ ( 𝑦 ) based on the measurement 𝑦 defined in (13), for any 𝑟 ⋆ ≤ 𝑛 , there always exists some 𝑀 ⋆ ∈ ℝ 𝑛 × 𝑛 of rank 𝑟 ⋆ such that

‖ 𝑀 ^ ( 𝑦 ) − 𝑀 ⋆ ‖ ≳ 𝜎 𝑛 , ‖ 𝑀 ^ ( 𝑦 ) − 𝑀 ⋆ ‖ 𝖥 ≳ 𝜎 𝑛 𝑟 ⋆

with probability at least 0.99 (with respect to the realization of the noise 𝜉 ). To the best of our knowledge, Theorem 4 is the first result to establish the minimax optimality (up to multiplicative factors of 𝜅 ) of overparameterized gradient methods in the context of low-rank matrix sensing. We remark that similar sub-optimality with respect to 𝜅 is also observed in chen2019noisy.

Consistency.

It is often desirable that the estimator is (asymptotically) consistent, i.e., the estimation error converges to 0 as the number of samples 𝑚 → ∞ . To see that Theorem 4 indicates ScaledGD( 𝜆 ) indeed produces a consistent estimator, let us consider again the Gaussian design. In this case, ⟨ 𝐴 𝑖 , 𝑀 ⋆ ⟩ is on the order of ‖ 𝑀 ⋆ ‖ / 𝑚 , thus the signal-to-noise ratio can be measured by 𝖲𝖭𝖱 ≔ ( ‖ 𝑀 ⋆ ‖ / 𝑚 ) 2 / 𝜎 2

‖ 𝑀 ⋆ ‖ 2 / ( 𝑚 𝜎 2 ) . With this notation, Theorem 4 asserts that the final error is 𝑂 ( 𝖲𝖭𝖱 − 1 / 2 𝑛 𝑚 ‖ 𝑀 ⋆ ‖ ) in operator norm and 𝑂 ( 𝖲𝖭𝖱 − 1 / 2 𝑛 𝑟 ⋆ 𝑚 ‖ 𝑀 ⋆ ‖ ) in Frobenius norm, both of which converge to 0 at a rate of 𝑛 𝑟 ⋆ 𝑚 as 𝑚 → ∞ when 𝖲𝖭𝖱 is fixed.

3.4The approximately low-rank setting

Last but not least, we examine a more general model of 𝑀 ⋆ , which does not need to be exactly low-rank, but only approximately low-rank. Instead of recovering 𝑀 ⋆ exactly, one seeks to find a low-rank approximation to 𝑀 ⋆ from its linear measurements.

To set up, let 𝑀 ⋆ ∈ ℝ 𝑛 × 𝑛 be a general PSD ground truth matrix, where its spectral decomposition is given by 𝑀 ⋆

∑ 𝑖

1 𝑛 𝜎 𝑖 𝑢 𝑖 𝑢 𝑖 ⊤ , with

𝜎 1 ≥ 𝜎 2 ≥ ⋯ ≥ 𝜎 𝑛 .

For any given 𝑟 ≤ 𝑛 , let 𝑀 𝑟 be the best rank- 𝑟 approximation of 𝑀 ⋆ and 𝑀 𝑟 ′ be the residual, i.e.,

𝑀 ⋆

∑ 𝑖

1 𝑟 𝜎 𝑖 𝑢 𝑖 𝑢 𝑖 ⊤ ⏟ ≕ 𝑀 𝑟 + ∑ 𝑖

𝑟 + 1 𝑛 𝜎 𝑖 𝑢 𝑖 𝑢 𝑖 ⊤ ⏟ ≕ 𝑀 𝑟 ′ .

(16)

If 𝑀 ^ 𝑟 is a rank- 𝑟 approximation to 𝑀 ⋆ , the approximation error can be measured by ‖ 𝑀 ^ 𝑟 − 𝑀 ⋆ ‖ 𝖥 . It is well-known that the best rank- 𝑟 approximation in this sense is exactly 𝑀 𝑟 , and the optimal error is thus ‖ 𝑀 𝑟 ′ ‖ 𝖥 . By picking a larger 𝑟 , one has a smaller approximation error ‖ 𝑀 𝑟 ′ ‖ 𝖥 , but a higher memory footprint for the low-rank approximation 𝑀 𝑟 whose condition number also grows with 𝑟 .

For simplicity, we consider the Gaussian design (cf. Lemma 1) in this subsection, which is less general than the RIP. The following theorem demonstrates that, as long as the sample size satisfies 𝑚 ≳ 𝑛 𝑟 ⋆ 2 ⋅ 𝗉𝗈𝗅𝗒 ( 𝜅 ) , ScaledGD( 𝜆 ) automatically adapts to the available sample size and produces a near-optimal rank- 𝑟 ⋆ approximation to 𝑀 ⋆ in spite of overparameterization.

Theorem 5.

Assume that 𝑀 ⋆ is given in (16) and the sensing operator 𝒜 follows the Gaussian design with 𝑚 ≥ 𝐶 𝑛 𝑟 ⋆ 2 𝜅 𝐶 , where 𝜅

𝜎 1 / 𝜎 𝑟 ⋆ is the condition number of 𝑀 𝑟 ⋆ . In addition, assume ‖ 𝑀 𝑟 ⋆ ′ ‖ ≤ 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑀 ⋆ ‖ and ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 ≤ 𝑐 𝜎 𝜅 − 𝐶 𝜎 𝑚 𝑛 ‖ 𝑀 ⋆ ‖ . Then the following holds with high probability (with respect to the realization of the random initialization 𝐺 and the sensing operator 𝒜 ). Suppose Assumption 2 holds for 𝑀 𝑟 ⋆

𝑋 ⋆ 𝑋 ⋆ ⊤ . Given a prescribed accuracy target 𝜀 ∈ ( 0 , 1 ) , suppose further that 𝛼 ≤ 𝜀 3 ‖ 𝑋 ⋆ ‖ . there exist universal constants 𝐶 min

0 , 𝐶 5

0 , such that for some 𝑇 ≤ 𝑇 min ≔ 𝐶 min 𝜂 log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 , we have

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥

≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝐶 5 𝜅 4 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 ) .

Here, 𝐶

0 , 𝐶 𝜎

0 are some sufficiently large universal constants, and 𝑐 𝜎

0 is some sufficiently small universal constant.

Remark 2.

Theorem 5 also holds in the matrix factorization setting, i.e., when 𝒜 is the identity operator.

Theorem 5 suggests that as long as 𝑀 ⋆ is well approximated by a low-rank matrix, by setting the optimization error 𝜀 sufficiently small, i.e., 𝜀 ‖ 𝑀 ⋆ ‖ ≍ 𝜅 4 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 , ScaledGD( 𝜆 ) finds a solution that satisfies

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≲ 𝜅 4 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥

(17)

in no more than log ⁡ 𝜅 ⋅ log ⁡ ( 𝜅 𝑛 ) + log ⁡ ( ‖ 𝑀 ⋆ ‖ 𝜅 4 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 ) iterations, which again only depend on the problem parameters logarithmically. This suggests that if the residual 𝑀 𝑟 ⋆ ′ is small, ScaledGD( 𝜆 ) produces an approximate solution to the best rank- 𝑟 ⋆ approximation problem with near-optimal error, up to a multiplicative factor depending only on 𝜅 , without knowing the rank 𝑟 ⋆ a priori. To our best knowledge, this is the first near-optimal theoretical guarantee for approximate low-rank matrix sensing using gradient-based methods.

4Analysis

In this section, we present the main steps for proving Theorem 2 and Theorem 3. The proofs of Theorem 4 and Theorem 5 will follow the same ideas with minor modification. The detailed proofs are collected in the appendix. All of our statements will be conditioned on the following high probability event regarding the initialization matrix 𝐺 :

ℰ

{ ‖ 𝐺 ‖ ≤ 𝐶 𝐺 } ∩ { 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) ≥ ( 2 𝑛 ) − 𝐶 𝐺 } ,

(18)

where 𝑈 ^ ∈ ℝ 𝑛 × 𝑟 ⋆ is an orthonormal basis of the eigenspace associated with the 𝑟 ⋆ largest eigenvalues of 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) , and 𝐶 𝐺

0 is some sufficiently large universal constant. It is a standard result in random matrix theory that ℰ happens with high probability, as verified by the following lemma.

Lemma 2.

With respect to the randomness in 𝐺 , the event ℰ happens with probability at least 1 − ( 𝑐 𝑛 ) − 𝐶 𝐺 ( 𝑟 − 𝑟 ⋆ + 1 ) / 2 − 2 exp ⁡ ( − 𝑐 𝑛 ) , where 𝑐

0 is some universal constant.

Proof.

See Appendix A.1. ∎

4.1Preliminaries: decomposition of the iterates

Before embarking on the main proof, we present a useful decomposition (cf. (19)) of the iterate 𝑋 𝑡 into a signal term, a misalignment error term, and an overparametrization error term. Choose some matrix 𝑈 ⋆ , ⟂ ∈ ℝ 𝑛 × ( 𝑛 − 𝑟 ⋆ ) such that [ 𝑈 ⋆ , 𝑈 ⋆ , ⟂ ] is orthonormal. Then we can define

𝑆 𝑡 ≔ 𝑈 ⋆ ⊤ 𝑋 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 , and 𝑁 𝑡 ≔ 𝑈 ⋆ , ⟂ ⊤ 𝑋 𝑡 ∈ ℝ ( 𝑛 − 𝑟 ⋆ ) × 𝑟 .

Let the SVD of 𝑆 𝑡 be

𝑆 𝑡

𝑈 𝑡 Σ 𝑡 𝑉 𝑡 ⊤ ,

where 𝑈 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ , Σ 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ , and 𝑉 𝑡 ∈ ℝ 𝑟 × 𝑟 ⋆ . Similar to 𝑈 ⋆ , ⟂ , we define the orthogonal complement of 𝑉 𝑡 as 𝑉 𝑡 , ⟂ ∈ ℝ 𝑟 × ( 𝑟 − 𝑟 ⋆ ) . When 𝑟

𝑟 ⋆ we simply set 𝑉 𝑡 , ⟂

0 .

We are now ready to present the main decomposition of 𝑋 𝑡 , which we use repeatedly in later analysis. This decomposition is inspired by stoger2021small. A similar decomposition also appeared in ma2022global.

Proposition 1.

The following decomposition holds:

𝑋 𝑡

𝑈 ⋆ 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ ⏟ 𝗌𝗂𝗀𝗇𝖺𝗅 + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑉 𝑡 ⊤ ⏟ 𝗆𝗂𝗌𝖺𝗅𝗂𝗀𝗇𝗆𝖾𝗇𝗍 + 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ ⏟ 𝗈𝗏𝖾𝗋𝗉𝖺𝗋𝖺𝗆𝖾𝗍𝗋𝗂𝗓𝖺𝗍𝗂𝗈𝗇 ,

(19)

where

𝑆 ~ 𝑡 ≔ 𝑆 𝑡 𝑉 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ , 𝑁 ~ 𝑡 ≔ 𝑁 𝑡 𝑉 𝑡 ∈ ℝ ( 𝑛 − 𝑟 ⋆ ) × 𝑟 ⋆ , and 𝑂 ~ 𝑡 ≔ 𝑁 𝑡 𝑉 𝑡 , ⟂ ∈ ℝ ( 𝑛 − 𝑟 ⋆ ) × ( 𝑟 − 𝑟 ⋆ ) .

(20) Proof.

See Appendix A.2. ∎

Several remarks on the decomposition are in order.

•

First, since 𝑉 𝑡 , ⟂ spans the obsolete subspace arising from overparameterization, 𝑂 ~ 𝑡 naturally represents the error incurred by overparameterization; in particular, in the well-specified case (i.e., 𝑟

𝑟 ⋆ ), one has zero overparameterization error, i.e., 𝑂 ~ 𝑡

0 .

•

Second, apart from the rotation matrix 𝑉 𝑡 , 𝑆 ~ 𝑡 documents the projection of the iterates 𝑋 𝑡 onto the signal space 𝑈 ⋆ . Similarly, 𝑁 ~ 𝑡 characterizes the misalignment of the iterates with the signal subspace 𝑈 ⋆ . It is easy to observe that in order for 𝑋 𝑡 𝑋 𝑡 ⊤ ≈ 𝑀 ⋆ , one must have 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ≈ Σ ⋆ 2 , and 𝑁 ~ 𝑡 ≈ 0 .

•

Last but not least, the extra rotation induced by 𝑉 𝑡 is extremely useful in making the signal/misalignment terms rationally invariant. To see this, suppose that we rotate the current iterate by 𝑋 𝑡 ↦ 𝑋 𝑡 𝑄 with some rotational matrix 𝑄 ∈ 𝒪 𝑟 , then 𝑆 𝑡 ↦ 𝑆 𝑡 𝑄 but 𝑆 ~ 𝑡 remains unchanged, and similarly for 𝑁 ~ 𝑡 .

4.2Proof roadmap

Our analysis breaks into a few phases that characterize the dynamics of the key terms in the above decomposition, which we provide a roadmap to facilitate understanding. Denote

𝐶 max ≔ { 4 𝐶 min ,

𝑟

𝑟 ⋆ ,

∞ ,
𝑟

𝑟 ⋆ , and 𝑇 max ≔ 𝐶 max 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) ,

where 𝑇 max represents the largest index of the iterates that we maintain error control. The analysis boils down to the following phases, indicated by time points 𝑡 1 , 𝑡 2 , 𝑡 3 , 𝑡 4 that satisfy

𝑡 1 ≤ 𝑇 min / 16 , 𝑡 1 ≤ 𝑡 2 ≤ 𝑡 1 + 𝑇 min / 16 , 𝑡 2 ≤ 𝑡 3 ≤ 𝑡 2 + 𝑇 min / 16 , 𝑡 3 ≤ 𝑡 4 ≤ 𝑡 3 + 𝑇 min / 16 .

•

Phase I: approximate power iterations. In the initial phase, ScaledGD( 𝜆 ) behaves similarly to GD, which is shown in stoger2021small to approximate the power method in the first few iterations up to 𝑡 1 . After this phase, namely for 𝑡 ∈ [ 𝑡 1 , 𝑇 max ] , although the signal strength is still quite small, it begins to be aligned with the ground truth with the overparameterization error kept relatively small.

•

Phase II: exponential amplification of the signal. In this phase, ScaledGD( 𝜆 ) behaves somewhat as a mixture of GD and ScaledGD with a proper choice of the damping parameter 𝜆 ≍ 𝜎 min 2 ( 𝑋 ⋆ ) , which ensures the signal strength first grows exponentially fast to reach a constant level no later than 𝑡 2 , and then reaches the desired level no later than 𝑡 3 , i.e., 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ≈ Σ ⋆ 2 .

•

Phase III: local linear convergence. At the last phase, ScaledGD( 𝜆 ) behaves similarly to ScaledGD, which converges linearly at a rate independent of the condition number. Specifically, for 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] , the reconstruction error ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 converges at a linear rate up to some small overparameterization error, until reaching the desired accuracy for any 𝑡 ∈ [ 𝑡 4 , 𝑇 max ] .

4.3Phase I: approximate power iterations

It has been observed in stoger2021small that when initialized at a small scaled random matrix, the first few iterations of GD mimic the power iterations on the matrix 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) . When it comes to ScaledGD( 𝜆 ), since the initialization size 𝛼 is chosen to be much smaller than the damping parameter 𝜆 , the preconditioner ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 behaves like ( 𝜆 𝐼 ) − 1 in the beginning. This renders ScaledGD( 𝜆 ) akin to gradient descent in the initial phase. As a result, we also expect the first few iterations of ScaledGD( 𝜆 ) to be similar to the power iterations, i.e.,

𝑋 𝑡 ≈ ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) 𝑡 𝑋 0 , when 𝑡 is small .

Such proximity between ScaledGD( 𝜆 ) and power iterations can indeed be justified in the beginning period, which allows us to deduce the following nice properties after the initial iterates of ScaledGD( 𝜆 ).

Lemma 3.

Under the same setting as Theorem 2, there exists an iteration number 𝑡 1 : 𝑡 1 ≤ 𝑇 min / 16 such that

𝜎 min ( 𝑆 ~ 𝑡 1 ) ≥ 𝛼 2 / ‖ 𝑋 ⋆ ‖ ,

(21)

and that, for any 𝑡 ∈ [ 𝑡 1 , 𝑇 max ] , 𝑆 ~ 𝑡 is invertible and one has

‖ 𝑂 ~ 𝑡 ‖

≤ ( 𝐶 3 . 𝑏 𝜅 𝑛 ) − 𝐶 3 . 𝑏 ‖ 𝑋 ⋆ ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ,

(22a)

‖ 𝑂 ~ 𝑡 ‖

≤ ( 1 + 𝜂 12 𝐶 max 𝜅 ) 𝑡 − 𝑡 1 𝛼 5 / 6 ‖ 𝑋 ⋆ ‖ 1 / 6 ,

(22b)

‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖

≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ ,

(22c)

‖ 𝑆 ~ 𝑡 ‖

≤ 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ ,

(22d)

where 𝐶 3 . 𝑎 , 𝐶 3 . 𝑏 , 𝑐 3 are some positive constants satisfying 𝐶 3 . 𝑎 ≲ 𝑐 𝜆 − 1 / 2 , 𝑐 3 ≲ 𝑐 𝛿 / 𝑐 𝜆 , and 𝐶 3 . 𝑏 can be made arbitrarily large by increasing 𝐶 𝛼 .

Proof.

See Appendix C. ∎

Remark 3.

Let us record two immediate consequences of (22), which sometimes are more convenient for later analysis. From (22a), we may deduce

‖ 𝑂 ~ 𝑡 ‖

≤ ( 𝐶 3 . 𝑏 𝜅 𝑛 ) − 𝐶 3 . 𝑏 ‖ 𝑋 ⋆ ‖ 𝜎 min ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝜎 min ( 𝑆 ~ 𝑡 )

≤ 𝜅 ( 𝐶 3 . 𝑏 𝜅 𝑛 ) − 𝐶 3 . 𝑏 𝜎 min ( 𝑆 ~ 𝑡 )

≤ ( 𝐶 3 . 𝑏 ′ 𝜅 𝑛 ) − 𝐶 3 . 𝑏 ′ 𝜎 min ( 𝑆 ~ 𝑡 ) ,

(23)

where 𝐶 3 . 𝑏 ′

𝐶 3 . 𝑏 / 2 , provided 𝐶 3 . 𝑏

4 . It is clear that 𝐶 3 . 𝑏 ′ can also be made arbitrarily large by enlarging 𝐶 𝛼 . Similarly, from (22b), we may deduce

‖ 𝑂 ~ 𝑡 ‖ ≤ ( 1 + 𝜂 12 𝐶 max 𝜅 ) 𝑡 − 𝑡 1 𝛼 5 / 6 ‖ 𝑋 ⋆ ‖ 1 / 6
≤ ( 1 + 𝜂 12 𝐶 max 𝜅 ) 𝐶 max 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) 𝛼 5 / 6 ‖ 𝑋 ⋆ ‖ 1 / 6

≤ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) 1 / 12 𝛼 5 / 6 ‖ 𝑋 ⋆ ‖ 1 / 6

𝛼 3 / 4 ‖ 𝑋 ⋆ ‖ 1 / 4 .

(24)

Lemma 3 ensures the iterates of ScaledGD( 𝜆 ) maintain several desired properties after iteration 𝑡 1 , as summarized in (22). In particular, for any 𝑡 ∈ [ 𝑡 1 , 𝑇 max ] : (i) the overparameterization error ‖ 𝑂 ~ 𝑡 ‖ remains small relatively to the signal strength measured in terms of the scaled minimum singular value 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) , and remains bounded with respect to the size of the initialization 𝛼 (cf. (22a) and (22b) and their consequences (3) and (24)); (ii) the scaled misalignment-to-signal ratio remains bounded, suggesting the iterates remain aligned with the ground truth signal subspace 𝑈 ⋆ (cf. (22c)); (iii) the size of the signal component 𝑆 ~ 𝑡 remains bounded (cf. (22d)). These properties play an important role in the follow-up analysis.

Remark 4.

It is worth noting that, the scaled minimum singular value 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) plays a key role in our analysis, which is in sharp contrast to the use of the vanilla minimum singular value 𝜎 min ( 𝑆 ~ 𝑡 ) in the analysis of gradient descent (stoger2021small). This new measure of signal strength is inspired by the scaled distance for ScaledGD introduced in tong2021accelerating; tong2022scaling, which carefully takes the preconditioner design into consideration. Similarly, the metrics ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ in (22c) and ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ (to be seen momentarily) are also scaled for similar considerations to unveil the fast convergence (almost) independent of the condition number.

4.4Phase II: exponential amplification of the signal

By the end of Phase I, the signal strength is still quite small (cf. (21)), which is far from the desired level. Fortunately, the properties established in Lemma 3 allow us to establish an exponential amplification of the signal term 𝑆 ~ 𝑡 thereafter, which can be further divided into two stages.

1.

In the first stage, the signal is boosted to a constant level, i.e., 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ⪰ 1 10 Σ ⋆ 2 ;

2.

In the second stage, the signal grows further to the desired level, i.e., 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ≈ Σ ⋆ 2 .

We start with the first stage, which again uses 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) as a measure of signal strength in the following lemma.

Lemma 4.

For any 𝑡 such that (22) holds, we have

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≥ ( 1 − 2 𝜂 ) 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) .

Moreover, if 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≤ 1 / 3 , then

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≥ ( 1 + 1 8 𝜂 ) 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) .

Proof.

See Appendix D.1. ∎

The second half of Lemma 4 uncovers the exponential growth of the signal strength 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) until a constant level after several iterations, which resembles the exponential growth of the signal strength in GD (stoger2021small). This is formally established in the following corollary.

Corollary 1.

There exists an iteration number 𝑡 2 : 𝑡 1 ≤ 𝑡 2 ≤ 𝑡 1 + 𝑇 min / 16 such that for all 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] , we have

𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ⪰ 1 10 Σ ⋆ 2 .

(25) Proof.

See Appendix D.2. ∎

We next aim to show that 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ≈ Σ ⋆ 2 after the signal strength is above the constant level. To this end, the behavior of ScaledGD( 𝜆 ) becomes closer to that of ScaledGD, and it turns out to be easier to work with ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ as a measure of the scaled recovery error of the signal component. We establish the approximate exponential shrinkage of this measure in the following lemma.

Lemma 5.

For all 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] with 𝑡 2 given in Corollary 1, one has

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ ( 1 − 𝜂 ) ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ + 1 100 𝜂 .

(26) Proof.

See Appendix D.3. ∎

With the help of Lemma 5, it is straightforward to establish the desired approximate recovery guarantee of the signal component, i.e., 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ≈ Σ ⋆ 2 .

Corollary 2.

There exists an iteration number 𝑡 3 : 𝑡 2 ≤ 𝑡 3 ≤ 𝑡 2 + 𝑇 min / 16 such that for any 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] , one has

9 10 Σ ⋆ 2 ⪯ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ⪯ 11 10 Σ ⋆ 2 .

(27) Proof.

See Appendix D.4. ∎

4.5Phase III: local convergence

Corollary 2 tells us that after iteration 𝑡 3 , we enter a local region in which 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ is close to the ground truth Σ ⋆ 2 . In this local region, the behavior of ScaledGD( 𝜆 ) becomes closer to that of ScaledGD analyzed in tong2021accelerating. We turn attention to the reconstruction error ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 that measures the generalization performance, and show it converges at a linear rate independent of the condition number up to some small overparameterization error.

Lemma 6.

There exists some universal constant 𝑐 6

0 such that for any 𝑡 : 𝑡 3 ≤ 𝑡 ≤ 𝑇 max , we have

∥ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ∥ 𝖥 ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑡 3 𝑟 ⋆ ∥ 𝑀 ⋆ ∥ + 8 𝑐 6 − 1 ∥ 𝑀 ⋆ ∥ max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 .

(28)

In particular, there exists an iteration number 𝑡 4 : 𝑡 3 ≤ 𝑡 4 ≤ 𝑡 3 + 𝑇 min / 16 such that for any 𝑡 ∈ [ 𝑡 4 , 𝑇 max ] , we have

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝛼 1 / 3 ‖ 𝑋 ⋆ ‖ 5 / 3 ≤ 𝜀 ‖ 𝑀 ⋆ ‖ .

(29)

Here, 𝜀 and 𝛼 are as stated in Theorem 2.

Proof.

See Appendix E. ∎

4.6Proofs of main theorems

Now we are ready to collect the results in the preceding sections to prove our main results, i.e., Theorem 2 and Theorem 3. The proofs of Theorem 4 and Theorem 5 follows from similar ideas but with additional technicality, thus are postponed to Appendix F.

We start with proving Theorem 2. By Lemma 3, Corollary 1, Corollary 2 and Lemma 6, the final 𝑡 4 given by Lemma 6 is no more than 4 × 𝑇 min / 16 ≤ 𝑇 min / 2 , thus (29) holds for all 𝑡 ∈ [ 𝑇 min / 2 , 𝑇 max ] , in particular, for some 𝑇 ≤ 𝑇 min , as claimed.

Now we consider Theorem 3. In case that 𝑟

𝑟 ⋆ , it follows from definition that 𝑂 ~ 𝑡

0 vanishes for all 𝑡 . It follows from Lemma 6, in particular from (28), that

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑡 3 𝑟 ⋆ ‖ 𝑀 ⋆ ‖ ,

for any 𝑡 ≥ 𝑡 3 (recall that 𝑇 max

∞ by definition when 𝑟

𝑟 ⋆ ). Note that ( 1 − 𝑐 6 𝜂 ) 𝑡 𝑟 ⋆ ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑇 + 𝑡 3 if 𝑇 − 𝑡 3 ≥ 4 log ⁡ ( 𝑟 ⋆ ) / ( 𝑐 6 𝜂 ) given that 𝜂 ≤ 𝑐 𝜂 is sufficiently small. Thus for any 𝑡 ≥ 𝑇 we have

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑇 ‖ 𝑀 ⋆ ‖ .

It is clear that one may choose such 𝑇 which also satisfies 𝑇 ≤ 𝑡 3 + 8 / ( 𝑐 6 𝜂 ) ≤ 𝑡 3 + 𝑇 min / 16 . We have already shown in the proof of Theorem 2 that 𝑡 3 ≤ 4 × 𝑇 min / 16 ≤ 𝑇 min / 4 , thus 𝑇 ≤ 𝑇 min as desired.

Early stopping.

In the overparameterized setting, our theory guarantees the reconstruction error to be small until some iteration 𝑇 max . This is consistent with the phenomenon known as early stopping in prior works of learning with overparameterized models (stoger2021small; li2018algorithmic). Given the form of (22b), one may wonder if the early stopping needs to be precisely controlled, if ‖ 𝑂 ~ 𝑡 ‖ could grow excessively. Fortunately, this is not the case, as the following proposition – proved in Appendix E – demonstrates.

Proposition 2.

Under the same setting as Theorem 2, we have

‖ 𝑂 ~ 𝑡 ‖ ≤ 𝛼 7 / 10 ‖ 𝑋 ⋆ ‖ 3 / 10 , ∀ 𝑡 ≤ ( ‖ 𝑋 ⋆ ‖ 𝛼 ) 3 / 10 .

As we pick a very small 𝛼 , this means one does not need to do early stopping for all practical purposes.

5Numerical experiments

In this section, we conduct numerical experiments to demonstrate the efficacy of ScaledGD( 𝜆 ) for solving overparameterized low-rank matrix sensing. We set the ground truth matrix 𝑋 ⋆

𝑈 ⋆ Σ ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ where 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ is a random orthogonal matrix and Σ ⋆ ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ is a diagonal matrix whose condition number is set to be 𝜅 . We set 𝑛

150 and 𝑟 ⋆

3 , and use random Gaussian measurements with 𝑚

10 𝑛 𝑟 ⋆ . The overparameterization rank 𝑟 is set to be 5 unless otherwise specified.

Throughout our experiments, to choose 𝜆 , we estimate 𝜎 min ( 𝑋 ⋆ ) using a simple rule of thumb. Let 𝜎 ^ 1 ≥ 𝜎 ^ 2 ≥ ⋯ ≥ 𝜎 ^ 𝑛 be the singular values of 𝒜 ∗ ( 𝑦 ) . Let 𝑖 0 be the smallest number such that

∑ 𝑖 ≤ 𝑖 0 𝜎 ^ 𝑖 ≥ 0.95 ∑ 𝑖 ≤ 𝑛 𝜎 ^ 𝑖 .

Then we estimate 𝜎 ^ min 2 ( 𝑋 ⋆ )

𝜎 ^ 𝑖 0 . This heuristic also applies to noisy or approximately low-rank matrices, thanks to our Theorem 4 and Theorem 5. In practice, the 0.95 threshold can be tuned towards a desired accuracy level.

Comparison with overparameterized GD.

We run ScaledGD( 𝜆 ) and GD with random initialization and compare their convergence speeds under different condition numbers 𝜅 of the ground truth 𝑋 ⋆ ; the result is depicted in Figure 1. Even for a moderate range of 𝜅 , GD slows down significantly while the convergence speed of ScaledGD( 𝜆 ) remains almost the same with a almost negligible initial phase, which is consistent with our theory. The advantage of ScaledGD( 𝜆 ) enlarges as 𝜅 increase, and is already more than 10x times faster than GD when 𝜅

7 .

Effect of initialization size.

We study the effect of the initialization scale 𝛼 on the reconstruction accuracy of ScaledGD( 𝜆 ).

We fix the learning rate 𝜂 to be a constant and vary the initialization scale. We run ScaledGD( 𝜆 ) until it converges.1 The resulting reconstruction errors and their corresponding initialization scales are plotted in Figure 2. It can be inferred that the reconstruction error increases with respect to 𝛼 , which is consistent with our theory.

Figure 2:Relative reconstruction error versus initialization scale 𝛼 . The slope of the dashed line is approximately 1 . Comparison with zhang2021preconditioned.

We compare ScaledGD( 𝜆 ) with the algorithm 𝖯𝗋𝖾𝖼𝖦𝖣 proposed in zhang2021preconditioned, which also has a 𝜅 -independent convergence rate assuming a sufficiently good initialization using spectral initialization. However, 𝖯𝗋𝖾𝖼𝖦𝖣 requires RIP of rank 𝑟 , thus demanding 𝑂 ( 𝑛 𝑟 2 ) many samples instead of 𝑂 ( 𝑛 𝑟 ⋆ 2 ) as in GD and ScaledGD( 𝜆 ). This can be troublesome for larger 𝑟 . To demonstrate this point, we run ScaledGD( 𝜆 ) and 𝖯𝗋𝖾𝖼𝖦𝖣 with different overparameterization rank 𝑟 while fixing all other parameters. The results are shown in Figure 3. It can be seen that the convergence rate of 𝖯𝗋𝖾𝖼𝖦𝖣 and ScaledGD( 𝜆 ) are almost the same when the rank is exactly specified ( 𝑟

𝑟 ⋆

3 ), though ScaledGD( 𝜆 ) requires a few more iterations for the initial phases2. When 𝑟 goes higher, ScaledGD( 𝜆 ) is almost unaffected, while 𝖯𝗋𝖾𝖼𝖦𝖣 suffers from a significant drop in the convergence rate and even breaks down with a moderate overparameterization 𝑟

20 .

Figure 3:Relative reconstruction error versus the number of iterates with different overparameterization rank 𝑟 for ScaledGD( 𝜆 ) and 𝖯𝗋𝖾𝖼𝖦𝖣 . Noisy setting.

Though our theoretical results here are formulated in the noiseless setting, empirical evidence indicates our algorithm ScaledGD( 𝜆 ) also works in the noisy setting. Modifying the equation (4) for noiseless measurements, we assume the noisy measurements 𝑦 𝑖

⟨ 𝐴 𝑖 , 𝑀 ⟩ + 𝜉 𝑖 where 𝜉 𝑖 ∼ 𝒩 ( 0 , 𝜎 2 ) are i.i.d. Gaussian noises. The minimax lower bound for the reconstruction error ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 is denoted by ℰ 𝗌𝗍𝖺𝗍

𝜎 𝑛 𝑟 ⋆ (candes2011tight). We compare the reconstruction error of ScaledGD( 𝜆 ) with ℰ 𝗌𝗍𝖺𝗍 under different noise levels 𝜎 . The results are shown in Figure 4. It can be seen that the final error of ScaledGD( 𝜆 ) matches the minimax optimal error ℰ 𝗌𝗍𝖺𝗍 within a small multiplicative factor for all noise levels.

Figure 4:The relative reconstruction error of ScaledGD( 𝜆 ) versus the number of iterates for ScaledGD( 𝜆 ) in the noisy setting, where it is observed that the final error of ScaledGD( 𝜆 ) approaches the minimax error. 6Discussions

This paper demonstrates that an appropriately preconditioned gradient descent method, called ScaledGD( 𝜆 ), guarantees an accelerated convergence to the ground truth low-rank matrix in overparameterized low-rank matrix sensing, when initialized from a sufficiently small random initialization. Furthermore, in the case of exact parameterization, our analysis guarantees the fast global convergence of ScaledGD( 𝜆 ) from a small random initialization. Collectively, this complements and represents a major step forward from prior analyses of ScaledGD (tong2021accelerating) by allowing overparametrization and small random initialization for noisy and approximately low-rank settings. This works opens up a few exciting future directions that are worth further exploring.

•

Asymmetric case. Our current analysis is confined to the recovery of low-rank positive semidefinite matrices, with only one factor matrix to be recovered. It remains to generalize this analysis to the recovery of general low-rank matrices with overparameterization.

•

Robust setting. Many applications encounter corrupted measurements that call for robust recovery algorithms that optimize nonsmooth functions such as the least absolute deviation loss. One such example is the scaled subgradient method (tong2021low), which is the nonsmooth counterpart of ScaledGD robust to ill-conditioning, and it’ll be interesting to study its performance under overparameterization.

•

Other overparameterized learning models. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized low-rank matrix sensing, which is one kind of overparameterized learning models. It will be greatly desirable to extend the insights developed herein to other overparameterized learning models, for example low-rank matrix optimization (boumal2016smooth), tensors (tong2022scaling; dong2022fast), and neural networks (wang2021deep).

We believe the analysis framework put forth in this paper can be extended to analyze these general issues, by leveraging similar error decompositions and tailoring the treatment to the corresponding measurement or data models, see an overview ma2024provably and some recent works giampouras2024guarantees; diaz2025preconditioned along this line after the initial version of this paper.

Acknowledgements

The work of X. Xu and Y. Chi is supported in part by Office of Naval Research under N00014-19-1-2404, by Air Force Office of Scientific Research under award number FA9550-25-1-0060, and by National Science Foundation under CCF-1901199, DMS-2134080 and ECCS-2126634. The work of C. Ma is partially supported National Science Foundation via grant DMS-2311127 and DMS CAREER Award 2443867. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.

Appendix APreliminaries

This section collects several preliminary results that are useful in later proofs. In general, for a matrix 𝐴 , we will denote by 𝑈 𝐴 the first factor in its compact SVD 𝐴

𝑈 𝐴 Σ 𝐴 𝑉 𝐴 ⊤ , unless otherwise specified.

A.1Proof of Lemma 2

It is a standard result in random matrix theory (vershynin2010nonasym; rudelson2009smallest) that an 𝑀 × 𝑁 ( 𝑀 ≥ 𝑁 ) random matrix 𝐺 0 with i.i.d. standard Gaussian entries satisfies

ℙ ( ‖ 𝐺 0 ‖ ≤ 4 ( 𝑀 + 𝑁 ) )

≥ 1 − exp ⁡ ( − 𝑀 / 𝐶 ) ,

(30a)

ℙ ( 𝜎 min ( 𝐺 0 ) ≥ 𝜀 ( 𝑀 − 𝑁 − 1 ) )

≥ 1 − ( 𝐶 𝜀 ) 𝑀 − 𝑁 + 1 − exp ⁡ ( − 𝑀 / 𝐶 ) ,

(30b)

for some universal constant 𝐶

0 and for any 𝜀

0 . Applying (30a) to the random matrix 𝑛 𝐺 which is an 𝑛 × 𝑟 random matrix with i.i.d. standard Gaussian entries, we have

‖ 𝐺 ‖ ≤ 4 ( 𝑛 + 𝑟 ) / 𝑛 ≤ 8

with probability at least 1 − exp ⁡ ( − 𝑛 / 𝐶 ) .

Turning to the bound on 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) , observe that 𝑛 𝑈 ^ ⊤ 𝐺 is a 𝑟 ⋆ × 𝑟 random matrix with i.i.d. standard Gaussian entries, thus applying (30b) to 𝑛 𝑈 ^ ⊤ 𝐺 with 𝜀

( 2 𝑛 ) − 𝐶 𝐺 + 1 yields

𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) ≤ ( 2 𝑛 ) 𝐶 𝐺 − 1 ( 𝑟 − 𝑟 ⋆ − 1 ) − 1 ≤ ( 2 𝑛 ) 𝐶 𝐺 − 1 ( 2 𝑟 ) ≤ ( 2 𝑛 ) 𝐶 𝐺

with probability at least 1 − ( 2 𝑛 / 𝐶 ) − ( 𝐶 𝐺 − 1 ) ( 𝑟 − 𝑟 ⋆ + 1 ) − exp ⁡ ( − 𝑛 / 𝐶 ) . Here, the second inequality follows from

1 𝑟 − 𝑟 ⋆ − 1 ≤ 1 𝑟 − 𝑟 − 1

𝑟 + 𝑟 − 1 < 2 𝑟 .

Combining the above two bounds directly implies the desired probability bound if we choose 𝑐

1 / 𝐶 and choose a large 𝐶 𝐺 such that 𝐶 𝐺 ≥ 8 and 𝐶 𝐺 − 1 ≥ 𝐶 𝐺 / 2 .

A.2Proof of Proposition 1

Using the definitions of 𝑆 𝑡 and 𝑁 𝑡 , we have

𝑋 𝑡

( 𝑈 ⋆ 𝑈 ⋆ ⊤ + 𝑈 ⋆ , ⟂ 𝑈 ⋆ , ⟂ ⊤ ) 𝑋 𝑡

𝑈 ⋆ 𝑆 𝑡 + 𝑈 ⋆ , ⟂ 𝑁 𝑡

𝑈 ⋆ 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝑈 ⋆ , ⟂ 𝑁 𝑡 ( 𝑉 𝑡 𝑉 𝑡 ⊤ + 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ )

𝑈 ⋆ 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑉 𝑡 ⊤ + 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ ,

where in the second line, we used the relation 𝑆 ~ 𝑡

𝑆 𝑡 𝑉 𝑡

𝑈 𝑡 Σ 𝑡 𝑉 𝑡 ⊤ 𝑉 𝑡

𝑈 𝑡 Σ 𝑡 and thus

𝑆 𝑡

𝑆 ~ 𝑡 𝑉 𝑡 ⊤ .

(31) A.3Consequences of RIP

The first result is a standard consequence of RIP, see, for example stoger2021small.

Lemma 7.

Suppose that the linear map 𝒜 : Sym 2 ⁡ ( ℝ 𝑛 ) → ℝ 𝑚 satisfies Assumption 1. Then we have

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑍 ) ‖ ≤ 𝛿 ‖ 𝑍 ‖ 𝖥

for any 𝑍 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) with rank at most 𝑟 ⋆ . Consequently, with 𝜆 ^ 1 ≥ ⋯ ≥ 𝜆 ^ 𝑛 denoting the eigenvalues of 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) , it holds that

| 𝜆 ^ 𝑖 − 𝜎 𝑖 2 ( 𝑋 ⋆ ) | ≤ 𝛿 𝑟 ⋆ ‖ 𝑋 ⋆ ‖ 2 .

We need another straightforward consequence of RIP, given by the following lemma.

Lemma 8.

Under the same setting as Lemma 7, we have

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑍 ) ‖ ≤ 2 𝛿 ( 𝑟 ∨ 𝑟 ⋆ ) / 𝑟 ⋆ ‖ 𝑍 ‖ 𝖥 ≤ 2 ( 𝑟 ∨ 𝑟 ⋆ ) 𝛿 𝑟 ⋆ ‖ 𝑍 ‖

for any 𝑍 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) with rank at most 𝑟 .

Proof.

Without loss of generality we may assume 𝑟 ≥ 𝑟 ⋆ , thus 𝑟 ∨ 𝑟 ⋆

𝑟 . We claim that it is possible to decompose 𝑍

∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ 𝑍 𝑖 where 𝑍 𝑖 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) , rank ⁡ ( 𝑍 𝑖 ) ≤ 𝑟 ⋆ and 𝑍 𝑖 𝑍 𝑗

0 if 𝑖 ≠ 𝑗 . To see why this is the case, notice the spectral decomposition of 𝑍 gives 𝑟 rank-one components that are mutually orthogonal, thus we may divide them into ⌈ 𝑟 / 𝑟 ⋆ ⌉ subgroups indexed by 𝑖

1 , … , ⌈ 𝑟 / 𝑟 ⋆ ⌉ , such that each subgroup contains at most 𝑟 ⋆ components. Let 𝑍 𝑖 be the sum of the components in the subgroup 𝑖 , it is easy to check that 𝑍 𝑖 has the desired property.

The property of the decomposition yields

‖ 𝑍 ‖ 𝖥 2

tr ⁡ ( 𝑍 2 )

∑ 𝑖 , 𝑗 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ tr ⁡ ( 𝑍 𝑖 𝑍 𝑗 )

∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ ‖ 𝑍 𝑖 ‖ 𝖥 2 .

(32)

But for each 𝑍 𝑖 , Lemma 7 implies

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑍 𝑖 ) ‖ ≤ 𝛿 ‖ 𝑍 𝑖 ‖ 𝖥 .

Summing up for 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ yields

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑍 ) ‖ ≤ ∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ ‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑍 𝑖 ) ‖ ≤ 𝛿 ∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ ‖ 𝑍 𝑖 ‖ 𝖥 ≤ 𝛿 ⌈ 𝑟 / 𝑟 ⋆ ⌉ ‖ 𝑍 ‖ 𝖥 ,

where the last inequality follows from (32) and from Cauchy-Schwarz inequality.

The first inequality in Lemma 8 follows from the above inequality by noting that ⌈ 𝑟 / 𝑟 ⋆ ⌉ ≤ 2 𝑟 / 𝑟 ⋆ given 𝑟 ≥ 𝑟 ⋆ which was assumed in the beginning of the proof. The second inequality in Lemma 8 follows from ‖ 𝑍 ‖ 𝖥 ≤ 𝑟 ‖ 𝑍 ‖ . ∎

A.4Matrix perturbation results

The next few results are all on matrix perturbations. We first present a perturbation result on matrix inverse.

Lemma 9.

Assume that 𝐴 , 𝐵 are square matrices of the same dimension, and that 𝐴 is invertible. If ‖ 𝐴 − 1 𝐵 ‖ ≤ 1 / 2 , then

( 𝐴 + 𝐵 ) − 1

𝐴 − 1 + 𝐴 − 1 𝐵 𝑄 𝐴 − 1 , for some ‖ 𝑄 ‖ ≤ 2 .

Similarly, if ‖ 𝐵 𝐴 − 1 ‖ ≤ 1 / 2 , then we have

( 𝐴 + 𝐵 ) − 1

𝐴 − 1 + 𝐴 − 1 𝑄 𝐵 𝐴 − 1 , for some ‖ 𝑄 ‖ ≤ 2 .

In particular, if ‖ 𝐵 ‖ ≤ 𝜎 min ( 𝐴 ) / 2 , then both of the above equations hold.

Proof.

The claims follow from the identity

( 𝐴 + 𝐵 ) − 1

𝐴 − 1 − 𝐴 − 1 𝐵 ( 𝐼 + 𝐴 − 1 𝐵 ) − 1 𝐴 − 1

𝐴 − 1 − 𝐴 − 1 ( 𝐼 + 𝐵 𝐴 − 1 ) − 1 𝐵 𝐴 − 1 .

For the first claim when ‖ 𝐴 − 1 𝐵 ‖ ≤ 1 / 2 , we set 𝑄 := − ( 𝐼 + 𝐴 − 1 𝐵 ) − 1 , which satisfies ‖ 𝑄 ‖

‖ ( 𝐼 + 𝐴 − 1 𝐵 ) − 1 ‖ ≤ 1 1 − ‖ 𝐴 − 1 𝐵 ‖ ≤ 2 . The second claim follows similarly. Finally, we note that when ‖ 𝐵 ‖ ≤ 𝜎 min ( 𝐴 ) / 2 , it holds

‖ 𝐴 − 1 𝐵 ‖ ≤ 1 𝜎 min ( 𝐴 ) ‖ 𝐵 ‖ ≤ 1 2 and ‖ 𝐵 𝐴 − 1 ‖ ≤ ‖ 𝐵 ‖ 1 𝜎 min ( 𝐴 ) ≤ 1 2 ,

thus completing the proof. ∎

Next, we focus on the minimum singular value of certain matrix of form 𝐼 + 𝐴 𝐵 .

Lemma 10.

If 𝐴 , 𝐵 are positive definite matrices of the same size, we have

𝜎 min ( 𝐼 + 𝐴 𝐵 ) ≥ 𝜅 − 1 / 2 ( 𝐴 ) , where 𝜅 ( 𝐴 ) ≔ ‖ 𝐴 ‖ 𝜎 min ( 𝐴 ) .

Proof.

Writing 𝐼 + 𝐴 𝐵

𝐴 1 / 2 ( 𝐼 + 𝐴 1 / 2 𝐵 𝐴 1 / 2 ) 𝐴 − 1 / 2 , we obtain

𝜎 min ( 𝐼 + 𝐴 𝐵 ) ≥ 𝜎 min ( 𝐴 1 / 2 ) 𝜎 min ( 𝐴 − 1 / 2 ) 𝜎 min ( 𝐼 + 𝐴 1 / 2 𝐵 𝐴 1 / 2 ) .

The proof is completed by noting that 𝜎 min ( 𝐴 1 / 2 )

𝜎 min 1 / 2 ( 𝐴 ) , 𝜎 min ( 𝐴 − 1 / 2 )

‖ 𝐴 ‖ − 1 / 2 , and that 𝜎 min ( 𝐼 + 𝐴 1 / 2 𝐵 𝐴 1 / 2 ) ≥ 1 since 𝐴 1 / 2 𝐵 𝐴 1 / 2 is positive semidefinite. ∎

The last result still concerns the minimum singular value of a matrix of interest.

Lemma 11.

There exists a universal constant 𝑐 11

0 such that if Λ is a positive definite matrix obeying ‖ Λ ‖ ≤ 𝑐 11 and 𝜎 min ( 𝑌 ) ≤ 1 / 3 , then for any 𝜂 ≤ 𝑐 11 we have

𝜎 min ( ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( 𝑌 𝑌 ⊤ + Λ ) − 1 ) 𝑌 ) ≥ ( 1 + 𝜂 6 ) 𝜎 min ( 𝑌 ) .

(33) Proof.

Denote 𝑍

𝑌 𝑌 ⊤ and let 𝑈 Σ 𝑈 ⊤

𝑍 + Λ be the spectral decomposition of 𝑍 + Λ . By a coordinate transform one may assume 𝑍 + Λ

Σ . It suffices to show

𝜆 min ( ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) 𝑍 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) ) ≥ ( 1 + 1 6 𝜂 ) 2 𝜆 min ( 𝑍 ) .

(34)

For simplicity we denote 𝜁

𝜆 min ( 𝑍 ) , which is by assumption smaller than 1 / 9 . Fix 𝐾

1 / 4 so that 𝐾 ≥ 2 𝜁 + 4 𝑐 11 by choosing 𝑐 11 to be small enough. By permuting coordinates we may further assume that the diagonal matrix Σ is of the following form:

Σ

[ Σ ≤ 𝐾

Σ

𝐾 ] ,

(35)

where Σ ≤ 𝐾 , Σ

𝐾 are diagonal matrices such that 𝜆 max ( Σ ≤ 𝐾 ) ≤ 𝐾 and 𝜆 min ( Σ

𝐾 )

𝐾 . It suffices to consider the case where Σ

𝐾 is not degenerate, because otherwise 𝜆 max ( Σ ) ≤ 𝐾 ≤ 1 / 2 , and the desired (34) follows as

𝜆 min ( ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) 𝑍 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) ) ≥ ( 1 − 𝜂 + 𝜂 𝜆 max − 1 ( Σ ) ) 2 𝜆 min ( 𝑍 ) ≥ ( 1 + 𝜂 ) 2 𝜆 min ( 𝑍 ) .

For the rest of the proof, we assume the block corresponding to Σ

𝐾 is not degenerate.

Divide 𝑍 into blocks of the same shape as (35):

𝑍

[ 𝑍 0

𝐴

𝐴 ⊤

𝑍 1 ] .

(36)

The purpose of such division is to facilitate computation of minimum eigenvalues by Schur’s complement lemma. For preparation, we make a few simple observations. Since 𝑍

Σ − Λ , we see that 𝐴 being an off-diagonal submatrix of 𝑍 satisfies ‖ 𝐴 ‖ ≤ ‖ Λ ‖ ≤ 𝑐 11 , and similarly ‖ 𝑍 0 − Σ ≤ 𝐾 ‖ ≤ 𝑐 11 , ‖ 𝑍 1 − Σ

𝐾 ‖ ≤ 𝑐 11 . In particular, we have

𝜆 min ( 𝑍 1 ) ≥ 𝜆 min ( Σ

𝐾 ) − 𝑐 11

𝐾 − 𝑐 11 ≥ 2 𝜁 + 3 𝑐 11

𝜁 ,

(37)

which implies 𝑍 1 − 𝜁 𝐼 is positive definite and invertible. Thus by Schur’s complement lemma, 𝑍 ⪰ 𝜁 𝐼 is equivalent to

𝑍 0 − 𝜁 𝐼 − 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 1 𝐴 ⊤ ⪰ 0 ,

(38)

which provides an analytic characterization for the minimum eigenvalue 𝜁 of 𝑍 .

The rest of the proof follows from the following steps: we will first show again by Schur’s complement lemma that (34) admits a similar analytic characterization. More precisely, denoting 𝜁 ′

( 1 + 𝜂 6 ) 2 𝜁 , Σ 0

( 1 − 𝜂 ) 𝐼 + 𝜂 Σ ≤ 𝐾 − 1 and Σ 1

( 1 − 𝜂 ) 𝐼 + 𝜂 Σ

𝐾 − 1 , then (34) is equivalent to

𝑍 0 − 𝜁 ′ Σ 0 − 2 − 𝐴 ( 𝑍 1 − 𝜁 ′ Σ 1 − 2 ) − 1 𝐴 ⊤ ⪰ 0 .

(39)

After proving they are equivalent, we will prove that (39) holds as long as the following sufficient condition holds

𝑍 0 − ( 1 + 3 𝜂 ) − 2 𝜁 ′ 𝐼 − 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 1 𝐴 ⊤ − 10 𝜂 𝜁 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 2 𝐴 ⊤ ⪰ 0 .

(40)

In the last step, we establish the above sufficient condition to complete the proof.

Step 1: equivalence between (34) and (39).

First notice that

( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) 𝑍 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 )

[ Σ 0 𝑍 0 Σ 0

Σ 0 𝐴 Σ 1

Σ 1 𝐴 ⊤ Σ 0

Σ 1 𝑍 1 Σ 1 ] .

(41)

In order to invoke Schur’s complement lemma, we need to verify Σ 1 𝑍 1 Σ 1 − 𝜁 ′ 𝐼 ≻ 0 . Observe that by definition we have

Σ 0 ⪰ ( 1 + ( 𝐾 − 1 − 1 ) 𝜂 ) 𝐼

( 1 + 3 𝜂 ) 𝐼 , Σ 1 ⪰ ( 1 − 𝜂 ) 𝐼 .

(42)

Hence

Σ 1 𝑍 1 Σ 1 − 𝜁 ′ 𝐼 ⪰ ( 1 − 𝜂 ) 2 𝑍 1 − ( 1 + 1 6 𝜂 ) 2 𝜁 𝐼 ≻ 2 ( 1 − 𝜂 ) 2 𝜁 𝐼 − ( 1 + 1 6 𝜂 ) 2 𝜁 𝐼 ≻ 0 ,

where in the second inequality we used 𝑍 1 − 2 𝜁 𝐼 ≻ 0 proved in (37), and in the last inequality we used 𝜂 ≤ 𝑐 𝜂 with 𝑐 𝜂 sufficiently small. This completes the verification that Σ 1 𝑍 1 Σ 1 − 𝜁 ′ 𝐼 ≻ 0 . Now, invoking Schur’s complement lemma yields that (34) is equivalent to

Σ 0 𝑍 0 Σ 0 − 𝜁 ′ 𝐼 − Σ 0 𝐴 Σ 1 ( Σ 1 𝑍 1 Σ 1 − 𝜁 ′ 𝐼 ) − 1 Σ 1 𝐴 ⊤ Σ 0 ⪰ 0 ,

which simplifies easily to (39), as claimed.

Step 2: establishing (40) as a sufficient condition for (39).

By (42), it follows that

( 𝑍 1 − 𝜁 ′ Σ 1 − 2 ) − 1
⪯ ( 𝑍 1 − ( 1 − 𝜂 ) − 2 𝜁 ′ 𝐼 ) − 1

( 𝑍 1 − 𝜁 𝐼 − ( ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 ) 𝐼 ) − 1 ,

(43)

where we used the well-known fact that 𝐴 ⪯ 𝐵 implies 𝐵 − 1 ⪯ 𝐴 − 1 for positive definite matrices 𝐴 and 𝐵 (cf. (bhatia1997matrix, Proposition V.1.6)). We aim to apply Lemma 9 to control the above term, by treating ( ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 ) 𝐼 as a perturbation term. For this purpose we need to verify

| ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 | ≤ 1 2 𝜆 min ( 𝑍 1 − 𝜁 𝐼 ) .

(44)

Given 𝜂 ≤ 𝑐 𝜂 with sufficiently small 𝑐 𝜂 , we have ( 1 − 𝜂 ) − 2 ≤ 1 + 3 𝜂 , ( 1 + 1 6 𝜂 ) 2 ≤ 1 + 𝜂 , and ( 1 + 3 𝜂 ) ( 1 + 𝜂 ) ≤ 1 + 5 𝜂 , thus

0 ≤ ( 1 − 𝜂 ) − 2 ( 1 + 1 6 𝜂 ) 2 𝜁 − 𝜁

( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 ≤ ( 1 + 3 𝜂 ) ( 1 + 𝜂 ) 𝜁 − 𝜁 ≤ 5 𝜂 𝜁 < 𝜁 / 2 ,

where the last inequality follows from 𝑐 𝜂 ≤ 1 / 10 . On the other hand, invoking (37), we obtain

1 2 𝜁 ≤ 1 2 ( 𝜆 min ( 𝑍 1 ) − 𝜁 )

1 2 𝜆 min ( 𝑍 1 − 𝜁 𝐼 ) ,

which verifies (44). Thus we may apply Lemma 9 to show

‖ ( 𝑍 1 − 𝜁 𝐼 ) ( ( 𝑍 1 − 𝜁 𝐼 ) − 1 − ( 𝑍 1 − 𝜁 𝐼 − ( ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 ) 𝐼 ) − 1 ) ( 𝑍 1 − 𝜁 𝐼 ) ‖ ≤ 2 | ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 | ≤ 10 𝜂 𝜁 ,

therefore

( 𝑍 1 − 𝜁 𝐼 − ( ( 1 − 𝜂 ) − 2 𝜁 ′ − 𝜁 ) 𝐼 ) − 1 ⪯ ( 𝑍 1 − 𝜁 𝐼 ) − 1 + 10 𝜂 𝜁 ( 𝑍 1 − 𝜁 𝐼 ) − 2 .

Together with (43), this implies

( 𝑍 1 − 𝜁 ′ Σ 1 − 2 ) − 1 ⪯ ( 𝑍 1 − 𝜁 𝐼 ) − 1 + 10 𝜂 𝜁 ( 𝑍 1 − 𝜁 𝐼 ) − 2 .

(45)

Combining (42) and (45), we see that a sufficient condition for (39) to hold is (40).

Step 3: establishing (40).

It is clear that (40) is implied by

𝜁 𝐼 − ( 1 + 3 𝜂 ) − 2 𝜁 ′ 𝐼 − 10 𝜂 𝜁 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 2 𝐴 ⊤ ⪰ 0 ,

(46)

by leveraging the relation 𝑍 0 ⪰ 𝜁 𝐼 + 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 1 𝐴 ⊤ from (38).

Hence, it boils down to prove (46). Recalling ‖ 𝐴 ‖ ≤ 𝑐 11 , and from (37), we know 𝜆 min ( 𝑍 1 − 𝜁 𝐼 ) ≥ 𝐾 − 𝑐 11 − 𝜁 ≥ 𝜁 + 3 𝑐 11 . Thus

‖ 𝐴 ( 𝑍 1 − 𝜁 𝐼 ) − 2 𝐴 ⊤ ‖ ≤ ‖ 𝐴 ‖ 2 ‖ ( 𝑍 1 − 𝜁 𝐼 ) − 2 ‖ ≤ 𝑐 11 2 / ( 𝜁 + 3 𝑐 11 ) 2 ≤ 1 / 9 .

Therefore, to prove (46) it suffices to show

𝜁 − ( 1 + 3 𝜂 ) − 2 𝜁 ′ ≥ 10 9 𝜂 𝜁 .

(47)

It is easy to verify that the above inequality holds for our choice 𝜁 ′

( 1 + 1 6 𝜂 ) 2 𝜁 . In fact, given 𝜂 ≤ 𝑐 𝜂 for sufficiently small 𝑐 𝜂 , we have ( 1 + 3 𝜂 ) − 2 ≤ 1 − 4 𝜂 , ( 1 + 1 6 𝜂 ) 2 ≤ 1 + 𝜂 . These together yield

𝜁 − ( 1 + 3 𝜂 ) − 2 ( 1 + 1 6 𝜂 ) 2 𝜁 ≥ 𝜁 − ( 1 − 4 𝜂 ) ( 1 + 𝜂 ) 𝜁

3 𝜂 𝜁 + 4 𝜂 2 𝜁 ≥ 3 𝜂 𝜁 ≥ 10 9 𝜂 𝜁 ,

establishing (47) as desired. ∎

Appendix BDecompositions of key terms

In this section, we first present a useful bound of a key error quantity

Δ 𝑡 ≔ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) ,

(48)

where 𝑋 𝑡 is the iterate of ScaledGD( 𝜆 ) given in (7).

Lemma 12.

Suppose 𝒜 ( ⋅ ) satisfies Assumption 1. For any 𝑡 ≥ 0 such that (22) holds, we have

‖ Δ 𝑡 ‖ ≤ 8 𝛿 ( ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ‖ 𝖥 + ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 + 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 2 ) .

(49)

In particular, there exists some constant 𝑐 12 ≲ 𝑐 𝛿 / 𝑐 𝜆 such that

‖ Δ 𝑡 ‖ ≤ 16 ( 𝐶 3 . 𝑎 + 1 ) 2 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 ≤ 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 .

(50) Proof.

The decomposition (19) in Proposition 1 yields

𝑋 𝑡 𝑋 𝑡 ⊤

𝑈 ⋆ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑈 ⋆ ⊤ + 𝑈 ⋆ 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑈 ⋆ ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ + 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑂 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ .

Since 𝑀 ⋆

𝑈 ⋆ Σ ⋆ 2 𝑈 ⋆ ⊤ , we have

𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆

𝑈 ⋆ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) 𝑈 ⋆ ⊤ ⏟

⁣ : 𝑇 1 + 𝑈 ⋆ 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑈 ⋆ ⊤ ⏟

⁣ : 𝑇 2 + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ⏟

⁣ : 𝑇 3 + 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑂 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ⏟

⁣ : 𝑇 4 .

(51)

Note that 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ is of rank 𝑟 ⋆ , thus 𝑇 1 has rank at most 𝑟 ⋆ and 𝑇 2 has rank at most 2 𝑟 ⋆ . Similarly, since 𝑁 ~ 𝑡

𝑁 𝑡 𝑉 𝑡 while 𝑉 𝑡 ∈ ℝ 𝑟 × 𝑟 ⋆ is of rank 𝑟 ⋆ , 𝑇 3 has rank at most 𝑟 ⋆ . It is also trivial that 𝑇 4 as an 𝑛 × 𝑛 matrix has rank at most 𝑛 . Invoking Lemma 8, we obtain

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑇 1 ) ‖

≤ 2 𝛿 ‖ 𝑈 ⋆ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) 𝑈 ⋆ ⊤ ‖ 𝖥 ≤ 2 𝛿 ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ‖ 𝖥 ,

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑇 2 ) ‖

≤ 2 3 𝛿 ‖ 𝑈 ⋆ 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑈 ⋆ ⊤ ‖ 𝖥 ≤ 4 2 𝛿 ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 ,

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑇 3 ) ‖

≤ 2 𝛿 ‖ 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ‖ 𝖥 ≤ 2 𝛿 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ‖ 𝑆 ~ 𝑡 ‖ ‖ Σ ⋆ − 1 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 ≤ 𝛿 ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 ,

‖ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑇 4 ) ‖

≤ 2 𝛿 𝑛 ‖ 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑂 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ‖ ≤ 2 𝛿 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 2 ,

where the third line follows from ‖ Σ ⋆ − 1 ‖

𝜅 ‖ 𝑋 ⋆ ‖ − 1 and from (22c) in view that 𝐶 𝛿 is sufficiently large and 𝑐 3 is sufficiently small. The conclusion (49) follows from summing up the above inequalities.

For the remaining part of the lemma, note that the following inequalities that bound the individual terms of (49) can be inferred from (22): namely,

‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ ‖ 𝖥 ≤ 2 𝑟 ⋆ ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ ‖ ≤ 2 𝑟 ⋆ ( 𝐶 3 . 𝑎 2 𝜅 2 + 1 ) ‖ 𝑋 ⋆ ‖ 2

by (22d), and

‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥
≤ 𝑟 ⋆ ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖

≤ 𝑟 ⋆ ( 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ ) ⋅ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ⋅ ‖ 𝑆 ~ 𝑡 ‖ ⋅ ‖ Σ ⋆ − 1 ‖

≤ 𝑟 ⋆ ( 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ ) ⋅ ( 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ ) ⋅ ( 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ ) ⋅ 𝜎 min − 1 ( Σ ⋆ )

𝑟 ⋆ 𝑐 3 𝐶 3 . 𝑎 2 𝜅 6 ‖ 𝑋 ⋆ ‖ 2 𝜅 − 𝐶 𝛿 / 2

≤ 𝑟 ⋆ 𝐶 3 . 𝑎 2 ‖ 𝑋 ⋆ ‖ 2 ,

where the first inequality uses the fact that 𝑁 ~ 𝑡

𝑁 𝑡 𝑉 𝑡 contains a rank- 𝑟 ⋆ factor 𝑉 𝑡 , hence has rank at most 𝑟 ⋆ ; the second line follows from (22d), the third line follows from (22c) and (22d), and the last line follows from choosing 𝑐 𝛿 sufficiently small such that 𝑐 3 ≤ 1 (which is possible since 𝑐 3 ≲ 𝑐 𝛿 / 𝑐 𝜆 ) and from choosing 𝐶 𝛿 such that 𝜅 6 𝜅 − 𝐶 𝛿 / 2 ≤ 1 . Finally, from (22b) and its corollary (24), we have

2 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 2 ≤ 2 𝑛 𝛼 3 / 2 ‖ 𝑋 ⋆ ‖ 1 / 2 ≤ ‖ 𝑋 ⋆ ‖ 2 ,

since from (12c) it is easy to show that 𝛼 ≤ ( 2 𝑛 ) − 2 / 3 ‖ 𝑋 ⋆ ‖ .

Combining these inequalities and (49) yields

‖ Δ 𝑡 ‖ ≤ 8 𝛿 𝑟 ⋆ ( 2 𝐶 3 . 𝑎 2 𝜅 2 + 1 + 𝐶 3 . 𝑎 2 + 1 ) ‖ 𝑋 ⋆ ‖ 2 ≤ 16 𝛿 𝑟 ⋆ 𝜅 2 ( 𝐶 3 . 𝑎 2 + 1 ) ‖ 𝑋 ⋆ ‖ 2 .

(52)

Recalling that by (10) we have 𝛿 𝑟 ⋆ 𝜅 2 ≤ 𝑐 𝛿 𝜅 − 𝐶 𝛿 + 2 ≤ 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 as long as 𝐶 𝛿 ≥ 6 , we obtain the desired conclusion. We may choose 𝑐 12

32 ( 𝐶 3 . 𝑎 + 1 ) 2 𝑐 𝛿 , and the bound 𝑐 12 ≲ 𝑐 𝛿 / 𝑐 𝜆 follows from 𝐶 3 . 𝑎 ≲ 𝑐 𝜆 − 1 / 2 . ∎

We next present several useful decompositions of the signal term 𝑆 𝑡 + 1 and the noise term 𝑁 𝑡 + 1 , which are extremely useful in later developments.

Lemma 13.

For any 𝑡 such that 𝑆 ~ 𝑡 is invertible and (22) holds, we have

𝑆 𝑡 + 1

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑏 ,

(53a)

𝑁 𝑡 + 1

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 𝐸 𝑡 𝑐 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤

𝜂 𝐸 𝑡 𝑒 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤
𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤
𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤
𝜂 𝐸 𝑡 𝑑 ,

(53b)

where the error terms satisfy

‖ | 𝐸 𝑡 𝑎 | ‖

≤ 2 𝑐 3 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ ,

(54a)

‖ | 𝐸 𝑡 𝑏 | ‖

≤ ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 3 / 4 𝜎 min ( 𝑆 ~ 𝑡 ) ≤ 1 20 𝜅 − 10 𝜎 min ( 𝑆 ~ 𝑡 ) ,

(54b)

‖ | 𝐸 𝑡 𝑐 | ‖

≤ 𝜅 − 6 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ,

(54c)

‖ | 𝐸 𝑡 𝑑 | ‖

≤ ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 3 / 4 𝜎 min ( 𝑆 ~ 𝑡 ) ,

(54d)

‖ | 𝐸 𝑡 𝑒 | ‖

≤ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 𝑐 12 𝜅 − 6 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ .

(54e)

Moreover, we have

‖ 𝐸 𝑡 𝑏 ‖ ≤ 1 24 𝐶 max 𝜅 ‖ 𝑂 ~ 𝑡 ‖ ,

(54f)

‖ 𝐸 𝑡 𝑑 ‖ ≤ 1 24 𝐶 max 𝜅 ‖ 𝑂 ~ 𝑡 ‖ .

(54g)

To proceed, we would need the approximate update equation of the rotated signal term 𝑆 ~ 𝑡 + 1 , and the rotated misalignment term 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 later in the proof. Since directly analyzing the evolution of these two terms seems challenging, we resort to two surrogate matrices 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 , and ( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) ( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) − 1 , as documented in the following two lemmas.

Lemma 14.

For any 𝑡 such that 𝑆 ~ 𝑡 is invertible and (22) holds, and any matrix 𝑄 ∈ ℝ ( 𝑟 − 𝑟 ⋆ ) × 𝑟 ⋆ with ‖ 𝑄 ‖ ≤ 2 , we have

𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄

( 𝐼 + 𝜂 𝐸 𝑡 14 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 ,

(55)

where 𝐸 𝑡 14 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ is a matrix (depending on 𝑄 ) satisfying

‖ 𝐸 𝑡 14 ‖ ≤ 1 200 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 6 .

Here, 𝐶 3 . 𝑎

0 is given in Lemma 3.

Lemma 15.

For any 𝑡 such that 𝑆 ~ 𝑡 is invertible and (22) holds, and any matrix 𝑄 ∈ ℝ ( 𝑟 − 𝑟 ⋆ ) × 𝑟 ⋆ with ‖ 𝑄 ‖ ≤ 2 , we have

( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) ( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) − 1

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 1 + 𝜂 𝐸 𝑡 15 . 𝑎 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) − 1 ( 1 + 𝜂 𝐸 𝑡 14 ) − 1 + 𝜂 𝐸 𝑡 15 . 𝑏

where 𝐸 𝑡 15 . 𝑎 , 𝐸 𝑡 15 . 𝑏 are matrices (depending on 𝑄 ) satisfying

‖ 𝐸 𝑡 15 . 𝑎 ‖

≤ 1 200 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 6 ,

(56a)

‖ | 𝐸 𝑡 15 . 𝑏 | ‖

≤ 400 𝑐 𝜆 − 1 𝜅 6 ‖ 𝑋 ⋆ ‖ − 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 64 ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 5 ‖ 𝑋 ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

1 64 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 .

(56b)

0 is given in Lemma 3.

B.1Proof of Lemma 13

We split the proof into three steps: (1) provide several useful approximation results regarding the matrix inverses utilizing the facts that ‖ 𝑂 ~ 𝑡 ‖ and ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ are small (as shown by Lemma 3); (2) proving the claims (53a), (54a), (54b), and (54f) associated with the signal term 𝑆 𝑡 + 1 ; (3) proving the claims (53b), (54c), (54d), (54e), and (54g) associated with the noise term 𝑁 𝑡 + 1 . Note that our approximation results in step (1) include choices of some matrices { 𝑄 𝑖 } with small spectral norms, whose choices may be different from lemma to lemma for simplicity of presentation;

B.1.1Step 1: preliminaries

We know from (22) that the overparametrization error 𝑂 ~ 𝑡 is negligible compared to the signals 𝑆 ~ 𝑡 and 𝜎 min ( 𝑋 ⋆ ) . This combined with the decomposition (19) reveals a desired approximation ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ≈ ( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1 . This approximation is formalized in the lemma below.

Lemma 16.

If 𝜆 ≥ 4 ( ‖ 𝑂 ~ 𝑡 ‖ 2 ∨ 2 ‖ 𝑁 ~ 𝑡 ‖ ‖ 𝑂 ~ 𝑡 ‖ ) for some 𝑡 , then

( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1

+ ( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝐸 𝑡 16 . 𝑎 ( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ( 𝐼 + 𝐸 𝑡 16 . 𝑏 )

(57)

where the error terms 𝐸 𝑡 16 . 𝑎 , 𝐸 𝑡 16 . 𝑏 can be expressed as

𝐸 𝑡 16 . 𝑎

( 𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 𝑁 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑉 𝑡 ⊤ ) 𝑄 1 ,

(58a)

𝐸 𝑡 16 . 𝑏

𝜆 − 1 𝐸 𝑡 16 . 𝑎 𝑄 2 ,

(58b)

for some matrices 𝑄 1 , 𝑄 2 such that max ⁡ { ‖ 𝑄 1 ‖ , ‖ 𝑄 2 ‖ } ≤ 2 .

Proof.

Expanding 𝑋 𝑡 ⊤ 𝑋 𝑡 according to (19), we have

𝑋 𝑡 ⊤ 𝑋 𝑡

𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 𝑁 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑉 𝑡 ⊤ .

The conclusion readily follows from Lemma 9 by setting therein 𝐴

𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 and 𝐵

𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 𝑁 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑉 𝑡 , ⟂ 𝑂 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑉 𝑡 ⊤ , where the condition ‖ 𝐴 − 1 𝐵 ‖ ≤ 1 / 2 is satisfied since

‖ 𝐴 − 1 𝐵 ‖ ≤ 𝜎 min ( 𝐴 ) − 1 ‖ 𝐵 ‖ ≤ 𝜆 − 1 ⋅ ( ‖ 𝑂 ~ 𝑡 ‖ 2 + 2 ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ ) ≤ 1 / 2 .

∎

Moreover, the dominating term on the right hand side of (57) can be equivalently written as

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) 𝑉 𝑡 ⊤ + 𝜆 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ ) − 1

𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ .

(59)

When the misalignment error ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ is small, we expect ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 ≈ ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝜆 𝐼 ) − 1 , which is formalized in the following lemma that establishes ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 ≈ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 , due to the following approximation

( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1

𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡

≈ 𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡

( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝜆 𝐼 ) − 1 .

Lemma 17.

If ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝜎 min ( 𝑋 ⋆ ) / 16 for some 𝑡 , then

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1

( 𝐼 + 𝐸 𝑡 17 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ,

(60)

where the error term 𝐸 𝑡 17 is a matrix defined as

𝐸 𝑡 17

𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝑄 1 ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) 𝑄 2 ,

(61)

where 𝑄 1 , 𝑄 2 are matrices of appropriate dimensions satisfying ‖ 𝑄 1 ‖ ≤ 1 , ‖ 𝑄 2 ‖ ≤ 2 . In particular, we have

‖ | 𝐸 𝑡 17 | ‖ ≤ 2 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ,

(62)

Proof.

In order to apply Lemma 9, setting 𝐴

𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 and 𝐵

𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 , it is straightforward to verify that

‖ 𝐴 − 1 𝐵 ‖

‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ‖ ≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ‖ 2 ≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 2 ‖ Σ ⋆ − 1 ‖ 2 ≤ ( 1 / 16 ) 2 ,

where we use the obvious fact that ‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ‖ ≤ 1 . Applying Lemma 9, we obtain

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 − ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 𝑄 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) Σ ⋆ − 1 𝑄 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

for some matrix 𝑄 with ‖ 𝑄 ‖ ≤ 2 . Since one may further write

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 − ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

‖ Σ ⋆ − 1 ‖ 2 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 ‖ Σ ⋆ − 1 ‖ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) Σ ⋆ − 1 ‖ Σ ⋆ − 1 ‖ 𝑄 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ,

the conclusion follows by setting 𝐸 𝑡 17 as in (61) with

𝑄 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 ‖ Σ ⋆ − 1 ‖ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ , 𝑄 2

Σ ⋆ − 1 ‖ Σ ⋆ − 1 ‖ 𝑄 .

The last inequality (62) is then a direct consequence of (61). ∎

B.1.2Step 2: a key recursion

Recall the definition Δ 𝑡 in (48), we can rewrite the update equation (7) as

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 Δ 𝑡 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 .

(63)

Multiplying both sides of (63) by 𝑈 ⋆ ⊤ on the left, we obtain

𝑆 𝑡 + 1

𝑆 𝑡 − 𝜂 𝑆 𝑡 𝑋 𝑡 ⊤ 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 Σ ⋆ 2 𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 𝑈 ⋆ ⊤ Δ 𝑡 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

( 1 − 𝜂 ) 𝑆 𝑡 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ ) 𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ 𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 .

(64)

Similarly, multiplying both sides of (63) by 𝑈 ⋆ , ⟂ ⊤ , we obtain

𝑁 𝑡 + 1

𝑁 𝑡 ( 𝐼 − 𝜂 𝑋 𝑡 ⊤ 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ) + 𝜂 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

( 1 − 𝜂 ) 𝑁 𝑡 + 𝜂 𝜆 𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ 𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ 𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 .

(65)

These expressions motivate the need to study the terms 𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 and 𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 , which we formalize in the following lemma.

Lemma 18.

Under the same setting as Lemma 13, we have

𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

( 𝐼 + 𝐸 𝑡 17 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝐸 𝑡 18 . 𝑎 ,

(66a)

𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝐸 𝑡 17 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝐸 𝑡 18 . 𝑏 ,

(66b)

where 𝐸 𝑡 17 is given in (61), and the error terms 𝐸 𝑡 18 . 𝑎 , 𝐸 𝑡 18 . 𝑏 can be expressed as

𝐸 𝑡 18 . 𝑎

𝜅 𝜆 − 1 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑂 ~ 𝑡 ‖ 𝑄 1 ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ 𝑄 2 ,

(67a)

𝐸 𝑡 18 . 𝑏

( 𝑁 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ ) 𝐸 𝑡 16 . 𝑏

𝜆 − 1 ( ‖ 𝑁 ~ 𝑡 ‖ 𝑄 3 + ‖ 𝑂 ~ 𝑡 ‖ 𝑄 4 ) 𝐸 𝑡 16 . 𝑏 .

(67b)

for some matrices { 𝑄 𝑖 } 1 ≤ 𝑖 ≤ 4 with spectral norm bounded by 2 , and 𝐸 𝑡 16 . 𝑏 defined in (58b).

Proof.

To begin, combining Lemma 16 and the discussion thereafter (cf. (57)–(B.1.1)) and the fact that 𝑆 ~ 𝑡

𝑆 𝑡 𝑉 𝑡 , we have for some matrix 𝑄 with ‖ 𝑄 ‖ ≤ 2 that

𝑆 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ ( 𝐼 + 𝐸 𝑡 16 . 𝑏 )

𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝜆 − 1 𝑁 ~ 𝑡 ⊤ 𝑂 ~ 𝑡 𝑄

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤

𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡
𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡
𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ⊤ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ) ⊤ ( 𝑂 ~ 𝑡 / 𝜆 ) 𝑄 .

(68)

Note that the condition of Lemma 16 can be verified as follows: since

‖ 𝑂 ~ 𝑡 ‖

≤ 𝐶 3 . 𝑏 − 𝐶 3 . 𝑏 𝜅 − 3 ⋅ ‖ 𝑋 ⋆ ‖ ⋅ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 ) ⋅ ‖ 𝑆 ~ 𝑡 ‖ ≤ 𝐶 3 . 𝑏 − 𝐶 3 . 𝑏 𝐶 3 . 𝑎 𝜎 min ( 𝑋 ⋆ ) ,

‖ 𝑁 ~ 𝑡 ‖

≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ⋅ ‖ Σ ⋆ − 1 ‖ ⋅ ‖ 𝑆 ~ 𝑡 ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ ⋅ 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ 𝜎 min ( 𝑋 ⋆ ) ≤ 𝑐 3 𝐶 3 . 𝑎 𝜎 min ( 𝑋 ⋆ )

provided 𝐶 𝛿 ≥ 6 , the bounds 𝑐 3 ≲ 𝑐 𝛿 / 𝑐 𝜆 and 𝐶 3 . 𝑎 ≲ 𝑐 𝜆 − 1 / 2 imply that when we choose 𝐶 𝛼 to be large enough (depending on 𝑐 𝜆 , 𝑐 𝛿 ),

2 ‖ 𝑁 ~ 𝑡 ‖ ‖ 𝑂 ~ 𝑡 ‖ ∨ ‖ 𝑂 ~ 𝑡 ‖ 2 ≤ 𝜆 / 4 ,

as desired.

Now the first term in (68) can be handled by invoking Lemma 17, since its condition is verified by ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − ( 𝐶 𝛿 / 2 − 1 ) 𝜎 min ( 𝑋 ⋆ ) ≤ 𝜎 min ( 𝑋 ⋆ ) / 16 provided 𝐶 𝛿 ≥ 2 and 𝑐 3 ≤ 1 / 16 by choosing 𝑐 𝛿 sufficiently small (depending on 𝑐 𝜆 ). Namely,

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤

( 𝐼 + 𝐸 𝑡 17 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ .

For the second term, by noting that

‖ 𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ⊤ ‖ ≤ ‖ 𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ⊤ ‖ ≤ 1 ,

it can be expressed as

𝜆 − 1 ‖ 𝑂 ~ 𝑡 ‖ 𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ⊤ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ) ⊤ ( 𝑂 ~ 𝑡 / ‖ 𝑂 ~ 𝑡 ‖ ) 𝑄

𝜅 𝜆 − 1 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑂 ~ 𝑡 ‖ 𝑄 1 ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ 𝑄 2

for 𝑄 1

𝑆 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ⊤ ⋅ 𝜅 − 1 ‖ 𝑋 ⋆ ‖ Σ ⋆ − 1 with ‖ 𝑄 1 ‖ ≤ 1 and 𝑄 2

( 𝑂 ~ 𝑡 / ‖ 𝑂 ~ 𝑡 ‖ ) 𝑄 which satisfies ‖ 𝑄 2 ‖ ≤ ‖ 𝑄 ‖ ≤ 2 . Applying the above two bounds to (68) yields (66a).

Similarly, moving to (66b), it follows that

𝑁 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

( 𝑁 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ ) ( 𝐼 + 𝐸 𝑡 16 . 𝑏 )

𝑁 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝐸 𝑡 18 . 𝑏 ,

(69)

where we have

𝐸 𝑡 18 . 𝑏

( 𝑁 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤ + 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ ) 𝐸 𝑡 16 . 𝑏

𝜆 − 1 ( ‖ 𝑁 ~ 𝑡 ‖ 𝑄 3 + ‖ 𝑂 ~ 𝑡 ‖ 𝑄 4 ) 𝐸 𝑡 16 . 𝑏

for some matrices 𝑄 3 , 𝑄 4 with ‖ 𝑄 3 ‖ , ‖ 𝑄 4 ‖ ≤ 1 . In the last line we used ‖ ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 ‖ ≤ 𝜆 − 1 . For the first term of (69), we use Lemma 17 and obtain

𝑁 ~ 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1 𝑉 𝑡 ⊤

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝐸 𝑡 17 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ .

This yields the representation in (66b). ∎

B.1.3Step 3: proofs associated with 𝑆 𝑡 + 1 .

With the help of Lemma 18, we are ready to prove (53a) and the associated norm bounds (54a), (54b), and (54f). To begin with, we plug (66a), (66b) into (64) and use 𝑆 𝑡

𝑆 ~ 𝑡 𝑉 𝑡 ⊤ to obtain

𝑆 𝑡 + 1

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑏 ,

where the error terms 𝐸 𝑡 𝑎 and 𝐸 𝑡 𝑏 are

𝐸 𝑡 𝑎

≔ 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ + ( Σ ⋆ 2 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ + 𝜆 𝐼 ) 𝐸 𝑡 17 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝐸 𝑡 17 ) ,

𝐸 𝑡 𝑏

≔ ( Σ ⋆ 2 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ + 𝜆 𝐼 ) 𝐸 𝑡 18 . 𝑎 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ ( 𝜆 − 1 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝐸 𝑡 18 . 𝑏 ) .

This establishes the identity (53a). To control ‖ | 𝐸 𝑡 𝑎 | ‖ , we observe that

‖ | 𝐸 𝑡 𝑎 | ‖

≤ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + ‖ Σ ⋆ 2 + 𝑈 ⋆ ⊤ Δ 𝑡 𝑈 ⋆ + 𝜆 𝐼 ‖ ⋅ ‖ | 𝐸 𝑡 17 | ‖ + ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ ⋅ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ⋅ ‖ Σ ⋆ − 1 ‖ ⋅ ( 1 + ‖ 𝐸 𝑡 17 ‖ )

≤ ( 1 + 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 + 𝑐 𝜆 ) ‖ 𝑋 ⋆ ‖ 2 ⋅ ‖ | 𝐸 𝑡 17 | ‖ + ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ ⋅ 𝜎 min − 1 ( 𝑋 ⋆ ) ⋅ ( 1 + ‖ 𝐸 𝑡 17 ‖ ) ⋅ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖

≤ 2 ‖ 𝑋 ⋆ ‖ 2 ⋅ ‖ | 𝐸 𝑡 17 | ‖ + ( 1 + 𝑐 3 ( 1 + ‖ 𝐸 𝑡 17 ‖ ) ) ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ ,

where the second line follows from Lemma 12 and Equations (12b), (22c); the last line holds since 𝑐 12 , 𝑐 𝜆 are sufficiently small and 𝐶 𝛿 is sufficiently large. Now we invoke the bound (62) in Lemma 17 to see

‖ | 𝐸 𝑡 17 | ‖ ≤ 2 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

≤ 2 𝑐 3 𝜅 2 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

≤ 2 𝑐 3 𝜅 − 6 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ,

where the last line follows again by choosing sufficiently large 𝐶 𝛿 . Furthermore, since ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ for small enough 𝑐 3 , we obtain ‖ 𝐸 𝑡 17 ‖ ≤ 1 . Combining these inequalities yields the claimed bound

‖ | 𝐸 𝑡 𝑎 | ‖ ≤ 2 𝑐 3 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ .

The bound of ‖ | 𝐸 𝑡 𝑏 | ‖ and ‖ 𝐸 𝑡 𝑏 ‖ can be proved in a similar way, utilizing the bound for ‖ 𝑂 ~ 𝑡 ‖ in (24). In fact, a computation similar to the above shows

‖ | 𝐸 𝑡 𝑏 | ‖

≤ 2 ‖ 𝑋 ⋆ ‖ 2 ⋅ ‖ | 𝐸 18 . 𝑎 | ‖ + 𝜆 − 1 ‖ Δ 𝑡 ‖ ⋅ ‖ | 𝑂 ~ 𝑡 | ‖ + ‖ Δ 𝑡 ‖ ⋅ ‖ | 𝐸 18 . 𝑏 | ‖

≤ 2 𝜅 𝜆 − 1 ⋅ ‖ 𝑋 ⋆ ‖ ⋅ ‖ 𝑂 ~ 𝑡 ‖ ⋅ ‖ 𝑄 1 ‖ ⋅ ‖ 𝑄 2 ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 100 𝑐 𝜆 − 1 𝜎 min − 1 ( 𝑀 ⋆ ) 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 ⋅ ‖ | 𝑂 ~ 𝑡 | ‖

8 𝜆 − 2 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ( ‖ 𝑁 ~ 𝑡 ‖
‖ 𝑂 ~ 𝑡 ‖ ) ‖ 𝑁 ~ 𝑡 ‖ ⋅ ‖ | 𝑂 ~ 𝑡 | ‖

≤ 800 𝜅 7 𝑐 𝜆 − 1 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑂 ~ 𝑡 ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 1 48 ( 𝐶 max + 1 ) 𝜅 ‖ | 𝑂 ~ 𝑡 | ‖ .

Here, 𝐶 max is the constant given by Lemma 3. Similarly, we have

‖ 𝐸 𝑡 𝑏 ‖ ≤ 800 𝜅 7 𝑐 𝜆 − 1 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑂 ~ 𝑡 ‖ ⋅ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + 1 48 ( 𝐶 max + 1 ) 𝜅 ‖ 𝑂 ~ 𝑡 ‖ .

The bound (54f) now follows directly from the bound of ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ in Lemma 3, provided 𝑐 𝛿 is sufficiently small and 𝐶 𝛿 is sufficiently large. To prove (54b), we note that

‖ | 𝐴 | ‖ ≤ 𝑛 ‖ 𝐴 ‖

(70)

for any unitarily invariant norm | | | ⋅ | | | and real matrix 𝐴 ∈ ℝ 𝑝 × 𝑞 with 𝑝 ∨ 𝑞 ≤ 𝑛 (which can be easily verified when | | | ⋅ | | |

∥ ⋅ ∥ or ∥ ⋅ ∥ 𝖥 ). Thus

‖ | 𝐸 𝑡 𝑏 | ‖ ≤ ( 800 𝜅 7 𝑐 𝜆 − 1 𝑐 3 𝜅 − 𝐶 𝛿 / 2 + 1 24 ( 𝐶 max + 1 ) 𝜅 ) 𝑛 ‖ 𝑂 ~ 𝑡 ‖ ≤ ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 3 / 4 𝜎 min ( 𝑆 ~ 𝑡 )

(71)

where the last inequality follows from the control of ‖ 𝑂 ~ 𝑡 ‖ given by (3) provided 𝑐 3 is sufficiently small and 𝐶 3 . 𝑏 therein is sufficiently large. This establishes the first inequality in (54b), and the second inequality therein follows directly from (3).

B.1.4Step 4: proofs associated with 𝑁 ~ 𝑡 + 1 .

Now we move on to prove the identity (53b), and the norm controls (54c), (54d), (54e), and (54g) associated with the misalignment term 𝑁 ~ 𝑡 + 1 . Plugging (66a), (66b) into (65) and using the decomposition 𝑁 𝑡

𝑁 ~ 𝑡 𝑉 𝑡 ⊤ + 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ , we have

𝑁 𝑡 + 1

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 𝐸 𝑡 𝑐 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤

𝜂 𝐸 𝑡 𝑒 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤
𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤
𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤
𝜂 𝐸 𝑡 𝑑 ,

where the error terms are defined to be

𝐸 𝑡 𝑐

≔ 𝜆 𝐸 𝑡 17 ,

𝐸 𝑡 𝑑

≔ ( 𝜆 𝐼 + 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ ) 𝐸 𝑡 18 . 𝑏 + 𝜆 − 1 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑉 𝑡 , ⟂ ⊤ + 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ 𝐸 𝑡 18 . 𝑎 ,

𝐸 𝑡 𝑒

≔ 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ ( 𝐼 + 𝐸 𝑡 17 ) + 𝑈 ⋆ , ⟂ ⊤ Δ 𝑡 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝐸 𝑡 17 ) .

This establishes the decomposition (53b). The remaining norm controls follow from the expressions above and similar computation as we have done for 𝑆 𝑡 + 1 . For the sake of brevity, we omit the details.

B.2Proof of Lemma 14

Use the identity (53a) in Lemma 13 and the fact that 𝑉 𝑡 and 𝑉 𝑡 , ⟂ have orthogonal columns to obtain

𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 + 𝜂 𝐸 𝑡 𝑏 ( 𝑉 𝑡 + 𝑉 𝑡 , ⟂ 𝑄 )

( 𝐼 + 𝜂 𝐸 𝑡 14 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡

( 𝐼 + 𝜂 𝐸 𝑡 14 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 ,

(72)

where 𝐸 𝑡 14 is defined to be

𝐸 𝑡 14
≔ ( 𝐸 𝑡 𝑎 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 + 𝐸 𝑡 𝑏 ( 𝑉 𝑡 + 𝑉 𝑡 , ⟂ 𝑄 ) 𝑆 ~ 𝑡 − 1 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1

𝐸 𝑡 𝑎 ( ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1

+ 𝐸 𝑡 𝑏 ( 𝑉 𝑡 + 𝑉 𝑡 , ⟂ 𝑄 ) 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1

: 𝑇 1 + 𝑇 2 ,

where the invertibility of 𝑆 ~ 𝑡 follows from Lemma 3, and the invertibility of ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 follows from (113).

Since ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ⪰ 𝜆 𝐼 and 𝜆 ≥ 1 100 𝑐 𝜆 𝜎 min ( 𝑀 ⋆ ) by (12b), we have

‖ 𝑇 1 ‖ ≤ 𝜆 − 1 ‖ 𝐸 𝑡 𝑎 ‖ ≤ 100 𝑐 𝜆 − 1 𝜎 min − 1 ( 𝑀 ⋆ ) ‖ 𝐸 𝑡 𝑎 ‖ .

In view of the bound (54a) on ‖ 𝐸 𝑡 𝑎 ‖ in Lemma 13, we further have

‖ 𝑇 1 ‖

≤ 100 𝑐 𝜆 − 1 𝜎 min − 2 ( 𝑋 ⋆ ) ( 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ⋅ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + ‖ Δ 𝑡 ‖ )

≤ 100 𝑐 𝜆 − 1 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ( 𝜅 − 4 𝑐 3 𝜅 − 𝐶 𝛿 / 2 + 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ) ‖ 𝑋 ⋆ ‖ 2

≤ 1 400 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 5 ,

where the second inequality follows from (22c) in Lemma 3 and Lemma 12, and the last inequality holds as long as 𝑐 3 and 𝑐 12 are sufficiently small and 𝐶 𝛿 is sufficiently large (by first fixing 𝑐 𝜆 and then choosing 𝑐 𝛿 to be sufficiently small).

The term 𝑇 2 can be controlled in a similar way. Since ‖ 𝐴 𝐵 ‖ ≤ ‖ 𝐴 ‖ ⋅ ‖ 𝐵 ‖ , one has

‖ 𝑇 2 ‖

≤ ‖ 𝐸 𝑡 𝑏 ‖ ⋅ ( ‖ 𝑉 𝑡 ‖ + ‖ 𝑉 𝑡 , ⟂ ‖ ‖ 𝑄 ‖ ) ⋅ ‖ 𝑆 ~ 𝑡 − 1 ‖ ⋅ 𝜎 min − 1 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 )

≤ ( i ) 3 ‖ 𝐸 𝑡 𝑏 ‖ ⋅ 𝜎 min − 1 ( 𝑆 ~ 𝑡 ) ⋅ 𝜅 1 − 𝜂 ≤ ( ii ) 6 𝜅 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 3 / 4 ≤ ( iii ) 1 400 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 5 .

Here, (i) follows from the bound (113) and the facts that ‖ 𝑉 𝑡 ‖ ∨ ‖ 𝑉 𝑡 , ⟂ ‖ ≤ 1 , ‖ 𝑄 ‖ ≤ 2 ; (ii) arises from the control (54b) on ‖ 𝐸 𝑡 𝑏 ‖ in Lemma 13 as well as the condition 𝜂 ≤ 𝑐 𝜂 ≤ 1 / 2 ; and (iii) follows from the implication (3) of Lemma 3.

The proof is completed by summing up the bounds on ‖ 𝑇 1 ‖ and ‖ 𝑇 2 ‖ .

B.3Proof of Lemma 15

Similar to the proof of Lemma 14, we can use the identity (53b) in Lemma 13 and the fact that 𝑉 𝑡 and 𝑉 𝑡 , ⟂ have orthogonal columns to obtain

𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 𝐸 𝑡 𝑐 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 + 𝜂 𝐸 𝑡 15 . 𝑐

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝜂 𝐸 𝑡 15 . 𝑎 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 + 𝜂 𝐸 𝑡 15 . 𝑐 ,

(73)

where the error terms are defined to be

𝐸 𝑡 15 . 𝑐

≔ 𝐸 𝑡 𝑒 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 + 𝜂 − 1 𝑂 ~ 𝑡 𝑄 + 𝐸 𝑡 𝑑 ( 𝑉 𝑡 + 𝑉 𝑡 , ⟂ 𝑄 ) ,

(74)

𝐸 𝑡 15 . 𝑎

≔ 𝐸 𝑡 𝑐 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 .

(75)

Combine (73) and (72) to arrive at

( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) ( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) − 1

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝜂 𝐸 𝑡 15 . 𝑎 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) − 1 ( 𝐼 + 𝜂 𝐸 𝑡 14 ) − 1 + 𝜂 𝐸 𝑡 15 . 𝑏 ,

(76)

where, using

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) − 1

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ,

we have

𝐸 𝑡 15 . 𝑏
≔ 𝐸 𝑡 15 . 𝑐 𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) − 1 ( 𝐼 + 𝜂 𝐸 𝑡 14 ) − 1

𝐸 𝑡 𝑒 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 + 𝜂 Σ ⋆ 2 ) − 1 ( 𝐼 + 𝜂 𝐸 𝑡 14 ) − 1

𝜂 − 1 𝑂 ~ 𝑡 𝑄 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝐼
𝜂 ( Σ ⋆ 2
𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤
𝜆 𝐼 ) − 1 ) − 1 ( 𝐼
𝜂 𝐸 𝑡 14 ) − 1
𝐸 𝑡 𝑑 ( 𝑉 𝑡
𝑉 𝑡 , ⟂ 𝑄 ) 𝑆 ~ 𝑡 − 1 ( ( 1 − 𝜂 ) 𝐼
𝜂 ( Σ ⋆ 2
𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤
𝜆 𝐼 ) − 1 ) − 1 ( 𝐼
𝜂 𝐸 𝑡 14 ) − 1

≕ 𝑇 1 + 𝑇 2 + 𝑇 3 .

It remains to bound ‖ 𝐸 15 . 𝑎 ‖ and ‖ | 𝐸 15 . 𝑏 | ‖ . By (54c), we have

‖ 𝐸 15 . 𝑎 ‖ ≤ 𝜆 − 1 ‖ 𝐸 𝑡 𝑐 ‖

≤ 100 𝑐 𝜆 − 1 𝜎 min − 2 ( 𝑋 ⋆ ) ⋅ 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖

≤ 100 𝑐 𝜆 − 1 𝑐 3 𝜅 − 2 𝜅 − 𝐶 𝛿 / 2

≤ 1 200 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 5 ,

where the penultimate inequality follows from (22c) and the last inequality holds with the proviso that 𝑐 3 is sufficiently small and 𝐶 𝛿 is sufficiently large.

Now we move to bound ‖ | 𝐸 15 . 𝑏 | ‖ . To this end, the relation ‖ ( 𝐼 + 𝜂 𝐸 𝑡 14 ) − 1 ‖ ≤ 2 is quite helpful. This follows from Lemma 14 in which we have established that ‖ 𝐸 𝑡 14 ‖ ≤ 1 / 2 . As a result of this relation, we obtain

‖ | 𝑇 1 | ‖

≤ 2 𝜆 − 1 ‖ | 𝐸 𝑡 𝑒 | ‖ ,

‖ | 𝑇 2 | ‖

≤ 2 ‖ | 𝑂 ~ 𝑡 | ‖ ⋅ ‖ 𝑄 ‖ ⋅ ‖ 𝑆 ~ 𝑡 − 1 ‖ ⋅ ‖ ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ‖ ,

‖ | 𝑇 3 | ‖

≤ 2 ‖ | 𝐸 𝑡 𝑑 | ‖ ⋅ ( 1 + ‖ 𝑄 ‖ ) ⋅ ‖ 𝑆 ~ 𝑡 − 1 ‖ ⋅ ‖ ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ‖ .

Similar to the control of 𝑇 1 in the proof of Lemma 14, we can take the condition 𝜆 ≥ 1 100 𝜅 − 4 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) and the bound (54e) collectively to see that

‖ | 𝑇 1 | ‖ ≤ 400 𝑐 𝜆 − 1 𝜅 6 ‖ 𝑋 ⋆ ‖ − 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 64 ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4 ‖ 𝑋 ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ .

Regarding the terms 𝑇 2 and 𝑇 3 , we see from (113) that

‖ ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ‖ ≤ 𝜅 1 − 𝜂 ≤ 2 𝜅 ,

as long 𝜂 is sufficiently small. Recalling the assumption ‖ 𝑄 ‖ ≤ 2 , this allows us to obtain

‖ | 𝑇 2 | ‖

≤ 8 𝜂 − 1 𝜅 ‖ | 𝑂 ~ 𝑡 | ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ≤ 8 𝜂 − 1 𝜅 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ,

‖ | 𝑇 3 | ‖

≤ 12 𝜅 ‖ | 𝐸 𝑡 𝑑 | ‖ / 𝜎 min ( 𝑆 ~ 𝑡 ) ,

where the first inequality again uses the elementary fact ‖ | 𝑂 ~ 𝑡 | ‖ ≤ 𝑛 ‖ 𝑂 ~ 𝑡 ‖ in (70).

The desired bounds then follow from plugging in the bounds (54d) and (24).

Appendix CProofs for Phase I

The goal of this section is to prove Lemma 3 in an inductive manner. We achieve this goal in two steps. In Section C.1, we find an iteration number 𝑡 1 ≤ 𝑇 min / 16 such that the claim (22) is true at 𝑡 1 . This establishes the base case. Then in Section C.2, we prove the induction step, namely if the claim (22) holds for some iteration 𝑡 ≥ 𝑡 1 , we aim to show that (22) continues to hold for the iteration 𝑡 + 1 . These two steps taken collectively finishes the proof of Lemma 3.

C.1Establishing the base case: Finding a valid 𝑡 1

The following lemma ensures the existence of such an iteration number 𝑡 1 .

Lemma 19.

Under the same setting as Theorem 2, we have for some 𝑡 1 ≤ 𝑇 min / 16 such that (21) holds and that (22) hold with 𝑡

𝑡 1 .

The rest of this subsection is devoted to the proof of this lemma.

Define an auxiliary sequence

𝑋 ^ 𝑡 ≔ ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) 𝑡 𝑋 0 ,

(77)

which can be viewed as power iterations on the matrix 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) from the initialization 𝑋 0 .

In what follows, we first establish that the true iterates { 𝑋 𝑡 } stay close to the auxiliary iterates { 𝑋 ^ 𝑡 } as long as the initialization size 𝛼 is small; see Lemma 20. This proximity then allows us to invoke the result in stoger2021small (see Lemma 21) to establish Lemma 19. For the rest of the appendices, we work on the following event given in (18):

ℰ

{ ‖ 𝐺 ‖ ≤ 𝐶 𝐺 } ∩ { 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) ≤ ( 2 𝑛 ) 𝐶 𝐺 } .

Step 1: controlling distance between 𝑋 𝑡 and 𝑋 ^ 𝑡 .

The following lemma guarantees the closeness between the two iterates { 𝑋 𝑡 } and { 𝑋 ^ 𝑡 } , with the proof deferred to Appendix C.1.1. Recall that 𝐶 𝐺 is the constant defined in the event ℰ in (18), and 𝑐 𝜆 is the constant given in Theorem 2.

Lemma 20.

Suppose that 𝜆 ≥ 1 100 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) . For any 𝜃 ∈ ( 0 , 1 ) , there exists a large enough constant 𝐾

𝐾 ( 𝜃 , 𝑐 𝜆 , 𝐶 𝐺 )

0 such that the following holds: As long as 𝛼 obeys

log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 ≥ 𝐾 max ⁡ ( 𝜂 , 𝜅 − 2 ) log ⁡ ( 2 𝜅 𝑛 ) ⋅ ( 1 + log ⁡ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) ) ,

(78)

one has for all 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) :

∥ 𝑋 𝑡 − 𝑋 ^ 𝑡 ∥

≤ 𝑡 ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) 𝑡 𝛼 2 ‖ 𝑋 ⋆ ‖ .

(79)

Moreover, ∥ 𝑋 𝑡 ∥ ≤ ∥ 𝑋 ⋆ ∥ for all such 𝑡 .

Step 2: borrowing a lemma from stoger2021small.

Compared to the original sequence 𝑋 𝑡 , the behavior of the power iterates 𝑋 ^ 𝑡 is much easier to analyze. Now that we have sufficient control over ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ , it is possible to show that 𝑋 𝑡 has the desired properties in Lemma 19 by first establishing the corresponding property of 𝑋 ^ 𝑡 and then invoking a standard matrix perturbation argument. Fortunately, such a strategy has been implemented by stoger2021small and wrapped into the following helper lemma.

Denote

𝑠 𝑗 ≔ 𝜎 𝑗 ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) )

1 + 𝜂 𝜆 𝜎 𝑗 ( 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) , 𝑗

1 , 2 , … , 𝑛

and recall that 𝑈 ^ (resp. 𝑈 𝑋 ~ 𝑡 ) is an orthonormal basis of the eigenspace associated with the 𝑟 ⋆ largest eigenvalues of 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) (resp. 𝑋 ~ 𝑡 ).

Lemma 21.

There exists some small universal 𝑐 21

0 such that the following hold. Assume that for some 𝛾 ≤ 𝑐 21 ,

∥ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) ∥ ≤ 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) ,

(80)

and furthermore,

𝜙 ≔ 𝛼 ‖ 𝐺 ‖ 𝑠 𝑟 ⋆ + 1 𝑡 + ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ≤ 𝑐 21 𝜅 − 2 .

(81)

Then there exists some universal 𝐶 21

0 such that the following hold:

𝜎 min ( 𝑆 ~ 𝑡 )

≥ 𝛼 4 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ,

(82a)

‖ 𝑂 ~ 𝑡 ‖

≤ 𝐶 21 𝜙 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ,

(82b)

‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖

≤ 𝐶 21 ( 𝛾 + 𝜙 ) ,

(82c)

where 𝑋 ~ 𝑡 ≔ 𝑋 𝑡 𝑉 𝑡 ∈ ℝ 𝑛 × 𝑟 ⋆ . Proof of Lemma 21.

This follows from the claims of stoger2021small by noting that ∥ 𝑂 ~ 𝑡 ∥

∥ 𝑈 ⋆ , ⟂ ⊤ 𝑋 𝑡 𝑉 𝑡 , ⟂ ∥ ≤ ∥ 𝑋 𝑡 𝑉 𝑡 , ⟂ ∥ for (82b).3 ∎

Step 3: completing the proof.

Now, with the help of Lemma 21, we are ready to prove Lemma 19. We start with verifying the two assumptions in Lemma 21.

Verifying assumption (80).

By the RIP in (9), Lemma 8, and the condition of 𝛿 in (10), we have

∥ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) ∥ ≤ 𝑟 ⋆ 𝛿 ∥ 𝑀 ⋆ ∥ ≤ 𝑐 𝛿 𝜅 − ( 𝐶 𝛿 − 2 ) 𝜎 min 2 ( 𝑋 ⋆ )

: 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) .

(83)

Here 𝛾

𝑐 𝛿 𝜅 − ( 𝐶 𝛿 − 2 ) ≤ 𝑐 21 , as 𝑐 𝛿 is assumed to be sufficiently small.

Verifying assumption (81).

By Weyl’s inequality and (83), we have

| 𝑠 𝑗 − 1 − 𝜂 𝜆 𝜎 𝑗 ( 𝑀 ⋆ ) | ≤ 𝜂 𝜆 ∥ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) ∥ ≤ 𝜂 𝜆 𝑐 𝛿 𝜅 − ( 𝐶 𝛿 − 2 ) 𝜎 min 2 ( 𝑋 ⋆ ) ≤ 100 𝑐 𝛿 𝑐 𝜆 𝜂 ,

where the last inequality follows from the condition 𝜆 ≥ 1 100 𝜅 − 4 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) . Furthermore, using the condition 𝜆 ≤ 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) assumed in (12b), the above bound implies that, for some 𝐶

𝐶 ( 𝑐 𝜆 , 𝑐 𝛿 )

0 ,

𝑠 1

≤ 1 + 𝜂 𝜆 ‖ 𝑀 ⋆ ‖ + 100 𝑐 𝛿 𝑐 𝜆 𝜂 ≤ 1 + 𝐶 𝜂 𝜅 6 ,

(84a)

𝑠 𝑟 ⋆

≥ 1 + 𝜂 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) − 100 𝑐 𝛿 𝑐 𝜆 𝜂 ≥ 1 + 𝜂 2 𝑐 𝜆 ,

(84b)

𝑠 𝑟 ⋆

≤ 1 + 𝜂 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) + 100 𝑐 𝛿 𝑐 𝜆 𝜂 ≤ 1 + 2 𝜂 𝜆 / 𝜎 min 2 ( 𝑋 ⋆ ) ,

(84c)

𝑠 𝑟 ⋆ + 1

≤ 1 + 100 𝑐 𝛿 𝑐 𝜆 𝜂 ≤ 1 + 𝜂 4 𝑐 𝜆 ,

(84d)

where we use the fact that 𝜎 𝑟 ⋆ + 1 ( 𝑀 ⋆ )

0 , and 𝑐 𝛿 ≤ 1 / 400 . Consequently we have 𝑠 𝑟 ⋆ / 𝑠 𝑟 ⋆ + 1 ≥ 1 + 𝑐 ′ 𝜂 for some 𝑐 ′

𝑐 ′ ( 𝑐 𝜆 ) > 0 , assuming 𝑐 𝜂 ≤ 𝑐 𝜆 . Thus for any large constant 𝐿 > 0 , there is some constant 𝑐 ′′

𝑐 ′′ ( 𝑐 ′ ) > 0 such that, setting 𝐿 ′

𝑐 ′′ 𝐿 log ⁡ ( 𝐿 ) we have

( 𝑠 𝑟 ⋆ / 𝑠 𝑟 ⋆ + 1 ) 𝑡 ≥ ( 𝐿 𝜅 𝑛 ) 𝐿 , ∀ 𝑡 ≥ 𝐿 ′ 𝜂 log ⁡ ( 𝜅 𝑛 ) .

On the event ℰ given in (18), we can choose 𝐿 large enough so that 𝐿 ≥ 2 𝐶 𝐺 , hence ‖ 𝐺 ‖ ≤ 𝐿 and 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) ≤ ( 2 𝑛 ) 𝐿 / 2 . Summarizing these inequalities, we see for 𝑡 ≥ 𝐿 ′ 𝜂 log ⁡ ( 𝜅 𝑛 ) ,

𝛼 ‖ 𝐺 ‖ 𝑠 𝑟 ⋆ + 1 𝑡 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡

≤ 𝐿 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) ( 𝑠 𝑟 ⋆ + 1 / 𝑠 𝑟 ⋆ ) 𝑡

≤ 𝐿 ( 2 𝑛 ) 𝐿 / 2 ( 𝐿 𝜅 𝑛 ) − 𝐿 ≤ ( 𝐿 𝜅 𝑛 ) − 𝐿 / 2 .

(85)

Furthermore, invoking Lemma 20 with 𝜃

1 / ( 2 𝐿 ′ ) (note that (78) is implied by the assumption (12c), where 𝐶 𝛼 is assumed sufficiently large, considering 𝜆 ≥ 1 100 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) and ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ≤ ‖ 𝑀 ⋆ ‖ + 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) ≤ 2 ‖ 𝑋 ⋆ ‖ 2 by (83)), we obtain for any 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 )

2 𝐿 ′ 𝜂 log ⁡ ( 𝜅 𝑛 ) that ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ ≤ 𝑡 𝑠 1 𝑡 𝛼 2 / ∥ 𝑋 ⋆ ∥ . This implies

‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡

≤ ( 𝑠 1 / 𝑠 𝑟 ⋆ ) 𝑡 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) 𝛼 / ‖ 𝑋 ⋆ ‖

≤ 𝑠 1 𝑡 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) 𝛼 / ‖ 𝑋 ⋆ ‖

≤ exp ⁡ ( 𝑡 log ⁡ ( 𝑠 1 ) + 𝐿 log ⁡ ( 𝐿 𝜅 𝑛 ) ) 𝛼 / ‖ 𝑋 ⋆ ‖ ≤ ( 𝐿 𝜅 𝑛 ) − 𝐿 / 2

(86)

where the second inequality follows from (84b), the penultimate inequality follows from our choice of 𝐿 which ensured 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) ≤ ( 2 𝑛 ) 𝐿 / 2 , and the last inequality follows from (84a), our choice 𝑡 ≤ 2 𝐿 ′ 𝜂 log ⁡ ( 𝜅 𝑛 ) and our assumption (12c) on 𝛼 which implies 𝛼 / ‖ 𝑋 ⋆ ‖ ≤ ( 2 𝜅 𝑛 ) − 𝐶 𝛼 , given that 𝐶 𝛼 is sufficiently large, e.g. 𝐶 𝛼 ≥ 𝐶 ( 𝐿 , 𝑐 𝜆 , 𝑐 𝜂 ) . It may also be inferred from the above arguments that 𝐿 can be made arbitrarily large by increasing 𝐶 𝛼 .

Combining the above arguments, we conclude that for any 𝑡 ∈ [ ( 𝐿 ′ / 𝜂 ) log ⁡ ( 𝜅 𝑛 ) , ( 2 𝐿 ′ / 𝜂 ) log ⁡ ( 𝜅 𝑛 ) ] , both of (85), (86) hold, hence the condition in (81) can be verified by

𝜙

𝛼 ‖ 𝐺 ‖ 𝑠 𝑟 ⋆ + 1 𝑡 + ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡

≤ 2 ( 𝐿 𝜅 𝑛 ) − 𝐿 / 2

(87)

≤ 𝑐 21 𝜅 − 2 ,

by choosing 𝐿 sufficiently large.

This completes the verification of both assumptions of Lemma 21. Upon noting that the upper threshold of 𝑡 satisfies ( 2 𝐿 ′ / 𝜂 ) log ⁡ ( 𝜅 𝑛 ) ≤ 𝑇 min / 16 , we will now invoke the conclusions of Lemma 21 to prove Lemma 19 for some 𝑡 ∈ [ ( 𝐿 ′ / 𝜂 ) log ⁡ ( 𝜅 𝑛 ) , 𝑇 min / 16 ] .

Proof of bound (21).

This can be inferred from (82a) in the following way. Recalling that 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) ≥ ( 2 𝑛 ) − 𝐶 𝐺 on the event ℰ , and 𝑠 𝑟 ⋆ ≥ 1 by (84b), we obtain from (82a) that

𝜎 min ( 𝑆 ~ 𝑡 1 ) ≥ 1 4 𝛼 ( 2 𝑛 ) − 𝐶 𝐺 ≥ 𝛼 2 / ∥ 𝑋 ⋆ ∥ ,

given the condition (12c) which guarantees

𝛼 ‖ 𝑋 ⋆ ‖ ≤ ( 2 𝑛 ) − 𝐶 𝛼 / 𝜂 ≤ 1 4 ( 2 𝑛 ) − 𝐶 𝐺 ,

as long as 𝜂 ≤ 𝑐 𝜂 ≤ 1 and 𝐶 𝛼 ≥ 𝐶 𝐺 + 2 . The proof is complete.

Proof of bound (22a).

We combine (82a), (82b), and (87) to obtain

∥ 𝑂 ~ 𝑡 1 ∥ 𝜎 min ( 𝑆 ~ 𝑡 1 ) ≤ 4 𝐶 21 𝜙 ≤ 4 𝐶 21 ( 𝐿 𝜅 𝑛 ) − 𝐿 / 2 ≤ ( 𝐿 𝜅 𝑛 / 2 ) − 𝐿 / 2 ,

where the last inequality follows from taking 𝐿 sufficiently large. We further note that (12b) implies

𝜎 min ( 𝑆 ~ 𝑡 1 ) ≤ ‖ Σ ⋆ 2 + 𝜆 𝐼 ‖ 1 / 2 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 )

≤ ( 𝑐 𝜆 + 1 ) 1 / 2 ‖ 𝑋 ⋆ ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 )

≤ 2 ‖ 𝑋 ⋆ ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 ) ,

assuming 𝑐 𝜆 ≤ 1 , hence

∥ 𝑂 ~ 𝑡 1 ∥ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 ) ≤ 2 ‖ 𝑋 ⋆ ‖ ( 𝐿 𝜅 𝑛 / 2 ) − 𝐿 / 2 ≤ ( 𝐶 3 . 𝑏 𝜅 𝑛 ) − 𝐶 3 . 𝑏 ‖ 𝑋 ⋆ ‖ ,

as desired, with 𝐶 3 . 𝑏

𝐿 / 4 as long as 𝐿 is sufficiently large. It is also clear that 𝐶 3 . 𝑏 can be made arbitrarily large by enlarging 𝐶 𝛼 as 𝐿 can be.

Proof of bound (22b).

We apply (82b) to yield

∥ 𝑂 ~ 𝑡 1 ∥ ≤ 𝐶 21 𝜙 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 1 ≤ 𝐶 𝐺 𝐶 21 ( 𝐿 𝜅 𝑛 ) − 𝐿 / 2 ( 1 + 2 𝜂 𝑐 𝜆 ) 𝑡 1 𝛼 ≤ 𝛼 5 / 6 ∥ 𝑋 ⋆ ∥ 1 / 6 ,

where the second inequality follows from 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) ≤ ‖ 𝐺 ‖ ≤ 𝐶 𝐺 by assumption and from (84c); the last inequality follows from 𝑡 1 ≤ ( 2 𝐿 ′ / 𝜂 ) log ⁡ ( 𝜅 𝑛 ) and from the condition (12c) on 𝛼 , provided that 𝐶 𝛼 is sufficiently large.

Proof of bound (22c).

We apply (82c) to yield that

∥ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 + 1 ∥ ≤ 𝐶 21 ( 𝛾 + 𝜙 ) ≤ 𝑐 𝛿 𝑐 𝜆 𝜅 − 2 𝐶 𝛿 / 3 ,

using the bounds of 𝛾 and 𝜙 in (83) and (87), provided that 𝑐 𝜆 ≤ 1 2 min ⁡ ( 1 , 𝐶 21 − 1 ) and 𝐿 ≥ 2 ( 𝐶 𝛿 + 1 ) . To further bound ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ we need the following lemma.

Lemma 22.

Assume 𝑆 ~ 𝑡 is invertible, and at least one of the following is true: (i) ‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ 1 / 4 ; (ii) ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝜅 − 1 ‖ 𝑋 ⋆ ‖ / 4 . Then

𝜅 − 1 ‖ 𝑋 ⋆ ‖ ‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 2 ‖ 𝑋 ⋆ ‖ ‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ .

The proof is postponed to Section C.1.2. Returning to the proof of bound (22c), the above lemma yields

‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ ≤ 2 𝑐 𝛿 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ 𝜅 − 2 𝐶 𝛿 / 3 ≤ 𝑐 3 ‖ 𝑋 ⋆ ‖ 𝜅 − 2 𝐶 𝛿 / 3 ,

for some 𝑐 3 ≲ 𝑐 𝛿 / 𝑐 𝜆 , as desired.

Proof of bound (22d).

We have

∥ 𝑆 ~ 𝑡 1 ∥

∥ 𝑈 ⋆ ⊤ 𝑋 𝑡 1 𝑉 𝑡 1 ∥ ≤ ∥ 𝑋 𝑡 1 ∥ ≤ ∥ 𝑋 ⋆ ∥ ,

where the last step follows from Lemma 20.

C.1.1Proof of Lemma 20

We prove the claim (79) by induction and also show that ‖ 𝑋 𝑡 ‖ ≤ ‖ 𝑋 ⋆ ‖ follows from (79). For the base case 𝑡

0 , it holds by definition. Assume that (79) holds for some 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) − 1 . We aim to prove that (i) ∥ 𝑋 𝑡 ∥ ≤ ∥ 𝑋 ⋆ ∥ and that (ii) the inequality (79) continues to hold for 𝑡 + 1 .

Proof of ∥ 𝑋 𝑡 ∥ ≤ ∥ 𝑋 ⋆ ∥ .

By the induction hypothesis we know

∥ 𝑋 𝑡 − 𝑋 ^ 𝑡 ∥ ≤ 𝑡 ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) 𝑡 𝛼 2 ‖ 𝑋 ⋆ ‖ .

In view of the constraint (78) on 𝛼 and the restriction 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) , we have

𝑡 𝛼 ∥ 𝑋 ⋆ ∥ ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) ⋅ 𝜂 𝐾 1 log ⁡ ( 𝜅 𝑛 )

1 𝐾 𝜃 ≤ 1

as long as 𝐾

𝐾 ( 𝜃 , 𝑐 𝜆 , 𝐶 𝐺 ) is sufficiently large. This further implies

‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ ≤ ( 𝑡 𝛼 ‖ 𝑋 ⋆ ‖ ) ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 𝑡 𝛼 ≤ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 𝑡 𝛼 .

On the other hand, since ∥ 𝑋 0 ∥ ≤ 𝐶 𝐺 𝛼 under the event ℰ (cf. (18)), in view of (77), we have

‖ 𝑋 ^ 𝑡 ‖ ≤ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 𝑡 ‖ 𝑋 0 ‖ ≤ 𝐶 𝐺 ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 𝑡 𝛼 .

Thus for a large enough 𝐾

𝐾 ( 𝜃 , 𝑐 𝜆 , 𝐶 𝐺 ) , we have

‖ 𝑋 𝑡 ‖ ≤ ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ + ‖ 𝑋 ^ 𝑡 ‖ ≤ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 𝑡 ( 𝐶 𝐺 + 1 ) 𝛼 ≤ 𝑐 𝜆 / 200 ⋅ 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ,

(88)

where the last inequality follows from the condition on 𝑡 and the choice of 𝛼 in (78):

log ⁡ ∥ 𝑋 ⋆ ∥ 𝛼 ≥ log ⁡ 200 ( 𝐶 𝐺 + 1 ) 𝜅 4 𝑐 𝜆 + 𝑡 log ⁡ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) .

The inequality (88) clearly implies ∥ 𝑋 𝑡 ∥ ≤ ∥ 𝑋 ⋆ ∥ .

Proof of (79) at the induction step.

The proof builds on a key recursive relation on ∥ 𝑋 𝑡 + 1 − 𝑋 ^ 𝑡 + 1 ∥ , from which the induction follows readily from our assumption.

Step 1: building a recursive relation on ∥ 𝑋 𝑡 + 1 − 𝑋 ^ 𝑡 + 1 ∥ .

By definition (77), we have 𝑋 ^ 𝑡 + 1

( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) 𝑋 ^ 𝑡 , which implies the following decomposition:

𝑋 𝑡 + 1 − 𝑋 ^ 𝑡 + 1

[ 𝑋 𝑡 + 1 − ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) 𝑋 𝑡 ] ⏟ ≕ 𝑇 1 + ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) ( 𝑋 𝑡 − 𝑋 ^ 𝑡 ) ⏟ ≕ 𝑇 2 .

(89)

We shall control each term separately.

•

The second term 𝑇 2 can be trivially bounded as

‖ 𝑇 2 ‖

∥ ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) ( 𝑋 𝑡 − 𝑋 ^ 𝑡 ) ∥ ≤ ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) ∥ 𝑋 𝑡 − 𝑋 ^ 𝑡 ∥ .

(90) •

Turning to the first term 𝑇 1 , by the update rule (7) of 𝑋 𝑡 + 1 and the triangle inequality, we further have

‖ 𝑇 1 ‖

‖ 𝑋 𝑡 + 1 − ( 𝐼 + 𝜂 𝜆 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ) 𝑋 𝑡 ‖

≤ ‖ 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ‖

‖ 𝜂 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) 𝑋 𝑡 ( ( 𝑋 𝑡 ⊤ 𝑋 𝑡
𝜆 𝐼 ) − 1 − 𝜆 − 1 𝐼 ) ‖ .

(91)

Since ‖ ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ‖ ≤ 𝜆 − 1 , it follows that the first term in (• ‣ C.1.1) can be bounded by

‖ 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ‖ ≤ 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 ) ‖ ‖ 𝑋 𝑡 ‖ .

In addition, since 𝑐 𝜆 / 200 ⋅ 𝜅 − 4 ‖ 𝑋 ⋆ ‖

𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) / 200 ≤ 𝜆 / 2 by the condition 𝜆 ≥ 1 100 𝜅 − 4 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) , we have by (88) that ∥ 𝑋 𝑡 ∥ ≤ 𝜆 / 2 . Therefore, invoking Lemma 9 implies that

( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 − 𝜆 − 1 𝐼

𝜆 − 2 𝑋 𝑡 ⊤ 𝑋 𝑡 𝑄 , for some 𝑄 with ∥ 𝑄 ∥ ≤ 2 .

As a result, the second term in (• ‣ C.1.1) can be bounded by

‖ 𝜂 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) 𝑋 𝑡 ( ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 − 𝜆 − 1 𝐼 ) ‖ ≤ 2 𝜂 𝜆 2 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ‖ 𝑋 𝑡 ‖ 3 .

Combining the above two inequalities leads to

‖ 𝑇 1 ‖ ≤ 𝜂 𝜆 ( ‖ 𝒜 ∗ 𝒜 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 ) ‖ + 2 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ‖ 𝑋 𝑡 ‖ 2 ) ‖ 𝑋 𝑡 ‖ .

In view of Lemma 8, we know ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ≲ 𝑟 ⋆ ∥ 𝑀 ⋆ ∥ and ∥ 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ ) ∥ ≲ 𝑟 ∥ 𝑋 𝑡 ∥ 2 . Plugging these relations into the previous bound leads to

‖ 𝑇 1 ‖ ≲ 𝜂 𝑟 𝜆 ( 1 + ∥ 𝑀 ⋆ ∥ 𝜆 ) ∥ 𝑋 𝑡 ∥ 3 ≲ 𝜂 𝜅 6 𝑟 ‖ 𝑀 ⋆ ‖ 𝜅 6 ‖ 𝑋 𝑡 ‖ 3 ,

(92)

where the last inequality follows from 𝜆 ≳ 𝜅 − 4 𝜎 min 2 ( 𝑋 ⋆ )

𝜅 − 6 ‖ 𝑀 ⋆ ‖ (cf. (12b)).

Putting the bounds on 𝑇 1 and 𝑇 2 together leads to

∥ 𝑋 𝑡 + 1 − 𝑋 ^ 𝑡 + 1 ∥ ≤ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) ∥ 𝑋 𝑡 − 𝑋 ^ 𝑡 ∥ + 𝐶 𝜂 𝜅 1 2 𝑟 ‖ 𝑀 ⋆ ‖ ‖ 𝑋 𝑡 ‖ 3

(93)

for some universal constant 𝐶

𝐶 ( 𝑐 𝜆 )

0 .

Step 2: finishing the induction.

By the bound of ‖ 𝑋 𝑡 ‖ in (88), it suffices to prove

𝑡 ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) 𝑡 + 1 𝛼 2 ‖ 𝑋 ⋆ ‖ + 𝐶 ( 𝐶 𝐺 + 1 ) 3 𝜂 𝜅 12 𝑟 ‖ 𝑋 ⋆ ‖ 2 ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) 3 𝑡 𝛼 3

≤ ( 𝑡 + 1 ) ( 1 + 𝜂 𝜆 ∥ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ∥ ) 𝑡 + 1 𝛼 2 ‖ 𝑋 ⋆ ‖ .

This is equivalent to

𝐶 ( 𝐶 𝐺 + 1 ) 3 𝜂 𝜅 12 𝑟 ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) 2 𝑡 − 1 ≤ ‖ 𝑋 ⋆ ‖ 𝛼 ,

which again follows readily from our assumption 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) and the assumption (78) on 𝛼 which implies

log ⁡ ( ‖ 𝑋 ⋆ ‖ 𝛼 )

≥ ( 2 𝑡 − 1 ) log ⁡ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) + 12 log ⁡ 𝜅 + log ⁡ 𝑛 + 𝐾

≥ ( 2 𝑡 − 1 ) log ⁡ ( 1 + 𝜂 𝜆 ‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) ‖ ) + 12 log ⁡ ( 𝑛 𝜅 𝑟 ) + log ⁡ ( 𝐶 ( 𝐶 𝐺 + 1 ) 3 )

provided 𝐾

𝐾 ( 𝜃 , 𝑐 𝜆 , 𝐶 𝐺 ) is sufficiently large. The proof is complete.

C.1.2Proof of Lemma 22

We begin with the following observation:

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1

𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 Σ 𝑋 ~ 𝑡 𝑉 𝑋 ~ 𝑡 ⊤ 𝑉 𝑋 ~ 𝑡 Σ 𝑋 ~ 𝑡 − 1 ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ) − 1

𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ) − 1

(94)

where we use: (i) 𝑁 ~ 𝑡

𝑈 ⋆ , ⟂ ⊤ ( 𝑈 𝑋 ~ 𝑡 Σ 𝑋 ~ 𝑡 𝑉 𝑋 ~ 𝑡 ⊤ ) and 𝑆 ~ 𝑡

𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 Σ 𝑋 ~ 𝑡 𝑉 𝑋 ~ 𝑡 ⊤ ; (ii) 𝑋 ~ 𝑡 is invertible since 𝑆 ~ 𝑡 is invertible, and hence 𝑉 𝑋 ~ 𝑡 has rank 𝑟 ⋆ and Σ 𝑋 ~ 𝑡 , 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 are also invertible.

We will show that the above quantity is small if (and only if) 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 is small.

Turning to the proof, we first show that (ii) implies (i), thus it suffices to prove the lemma under the condition (i). In fact, in virtue of (94) we have

‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ‖ ‖ 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ‖ ≤ 𝜎 min ( 𝑋 ⋆ ) − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ,

where we used ‖ 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ ‖ 𝑈 ⋆ ‖ ‖ 𝑈 𝑋 ~ 𝑡 ‖ ≤ 1 . Consequently, ‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≤ 1 / 4 if ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝜅 − 1 ‖ 𝑋 ⋆ ‖ / 4 , as claimed.

We proceed to show that the conclusion holds assuming condition (i). The first inequality has already been established above. For the second inequality, using (94) again, it suffices to prove ‖ ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ) − 1 ‖ ≤ 2 , which is in turn equivalent to 𝜎 min ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 ) ≥ 1 / 2 . Now note that 𝑈 𝑋 ~ 𝑡

𝑈 ⋆ 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 + 𝑈 ⋆ , ⟂ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 , thus

𝜎 min ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 )

𝜎 𝑟 ⋆ ( 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 )

≥ 𝜎 𝑟 ⋆ ( 𝑈 ⋆ 𝑈 ⋆ ⊤ 𝑈 𝑋 ~ 𝑡 )

≥ 𝜎 𝑟 ⋆ ( 𝑈 𝑋 ~ 𝑡 ) − ‖ 𝑈 ⋆ , ⟂ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖

≥ 1 − ‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖ ≥ 3 / 4 .

In the last line, we used 𝜎 𝑟 ⋆ ( 𝑈 𝑋 ~ 𝑡 )

1 , which follows from 𝑈 𝑋 ~ 𝑡 being a 𝑛 × 𝑟 ⋆ orthonormal matrix, and the assumption (i). This completes the proof.

C.2Establishing the induction step

The claimed invertibility of 𝑆 ~ 𝑡 follows from induction and from Lemma 4. In fact, by (21) we know 𝑆 ~ 𝑡 1 is invertible, and by Lemma 4 we know that if 𝑆 ~ 𝑡 is invertible, 𝑆 ~ 𝑡 + 1 would also be invertible since 𝑆 ~ 𝑡 (resp. 𝑆 ~ 𝑡 + 1 ) has the same invertibility as ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 (resp. ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 + 1 ). For the rest of the proof we focus on establishing (22) by induction.

For the induction step we need to understand the one-step behaviors of ‖ 𝑂 ~ 𝑡 ‖ , ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ , and ‖ 𝑆 ~ 𝑡 ‖ , which are supplied by the following lemmas.

Lemma 23.

For any 𝑡 such that (22) holds,

‖ 𝑂 ~ 𝑡 + 1 ‖ ≤ ( 1 + 1 12 𝐶 max 𝜅 𝜂 ) ‖ 𝑂 ~ 𝑡 ‖ .

(95) Lemma 24.

For any 𝑡 such that (22) holds, setting 𝑍 𝑡

Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 , there exists some universal constant 𝐶 24

0 such that

(96)

In particular, if 𝑐 3

100 𝐶 24 ( 𝐶 3 . 𝑎 + 1 ) 4 𝑐 𝛿 / 𝑐 𝜆 , then ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ implies ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ .

Lemma 25.

For any 𝑡 such that (22) holds,

‖ 𝑆 ~ 𝑡 + 1 ‖ ≤ ( 1 − 𝜂 2 ) ‖ 𝑆 ~ 𝑡 ‖ + 100 𝑐 𝜆 − 1 / 2 𝜂 𝜅 3 ‖ 𝑋 ⋆ ‖ .

(97)

In particular, if 𝐶 3 . 𝑎

200 𝑐 𝜆 − 1 / 2 , then ‖ 𝑆 ~ 𝑡 ‖ ≤ 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ implies ‖ 𝑆 ~ 𝑡 + 1 ‖ ≤ 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑋 ⋆ ‖ .

We now return to the induction step. Recall that we need to show (22a)–(22d) hold for 𝑡 + 1 . It is obvious that (22b)–(22d) hold for 𝑡 + 1 by the induction hypothesis and the above lemmas. It remains to prove (22a). To this end we distinguish two cases: 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≤ 1 / 3 and 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 )

1 / 3 . In the former case, (22a) for 𝑡 + 1 follows from Lemma 23 and Lemma 4 (to be proved in Appendix D.1), which imply (provided 𝐶 max ≥ 2 )

‖ 𝑂 ~ 𝑡 + 1 ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≤ ( 1 + 𝜂 4 𝐶 max 𝜅 ) ( 1 + 𝜂 / 8 ) ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≤ ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ,

as desired. In the latter case where 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 )

1 / 3 , one may apply the first part of Lemma 4 to deduce that 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≥ 1 / 10 (given that 𝜂 ≤ 𝑐 𝜂 for some sufficiently small constant 𝑐 𝜂 ). This combined with (22b) for 𝑡 + 1 (already proved) yields desired inequality (22a) for 𝑡 + 1 , given our assumption (12c) on the smallness of 𝛼 . This completes the proof.

C.2.1Proof of Lemma 23

If 𝑟

𝑟 ⋆ , then we have ‖ 𝑂 ~ 𝑡 ‖

0 for all 𝑡 ≥ 0 . The conclusion follows trivially. Therefore, we only consider the case when 𝑟

𝑟 ⋆ . By definition, we have

𝑂 ~ 𝑡 + 1

𝑁 𝑡 + 1 𝑉 𝑡 + 1 , ⟂

𝑁 𝑡 + 1 𝑉 𝑡 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂

− 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂ + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂ ,

where the last inequality uses the fact that 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂

− ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂ . To see this, note that

𝑆 𝑡 + 1 𝑉 𝑡 + 1 , ⟂

0 ⟹ 𝑆 𝑡 + 1 𝑉 𝑡 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂

− 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂ .

Left-multiplying both sides by ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 yields the desired identity. Note that the invertibility of 𝑆 𝑡 + 1 𝑉 𝑡 follows from the invertibility of 𝑆 ~ 𝑡 by inserting 𝑄

0 in Lemma 14.

By Lemma 13, we immediately obtain that 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂

𝜂 𝐸 𝑡 𝑏 𝑉 𝑡 , ⟂ , and 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂

𝑂 ~ 𝑡 + 𝜂 𝐸 𝑡 𝑑 𝑉 𝑡 , ⟂ , where ‖ 𝐸 𝑡 𝑏 ‖ ∨ ‖ 𝐸 𝑡 𝑑 ‖ ≤ 1 24 𝐶 max 𝜅 ‖ 𝑂 ~ 𝑡 ‖ . Assume for now that

‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 ‖ ≤ 1 .

(98)

In addition, notice that ‖ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ ≤ 1 since both factors are orthonormal matrices, we have

‖ 𝑂 ~ 𝑡 + 1 ‖

≤ ‖ 𝑂 ~ 𝑡 ‖ + 𝜂 ‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 ‖ ‖ 𝐸 𝑡 𝑏 ‖ + 𝜂 ‖ 𝐸 𝑡 𝑑 ‖

≤ ( 1 + 1 12 𝐶 max 𝜅 𝜂 ) ‖ 𝑂 ~ 𝑡 ‖ ,

as desired. It remains to prove (98).

Proof of bound (98).

This can be done by plugging 𝑄

0 into Lemma 15 and bounding the resulting expression. This (in fact, a much stronger inequality) will be done in detail in the proof of Lemma 24, to be presented soon in Section C.2.2. In fact, the resulting expression is the same as (103) there (albeit with different values of 𝐸 𝑡 14 . 𝑎 , 𝐸 𝑡 15 . 𝑎 , 𝐸 𝑡 15 . 𝑏 , which do not affect the proof). Following the same strategy to control (103) there, we may show that ‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 Σ ⋆ ‖ enjoys the same bound (108) as ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ , the right hand side of which is less than 𝜅 − 1 ‖ 𝑋 ⋆ ‖

‖ Σ ⋆ − 1 ‖ − 1 given (22c) and (22d). Thus ‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 ‖ ≤ ‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 Σ ⋆ ‖ ‖ Σ ⋆ − 1 ‖ ≤ 1 as claimed.

C.2.2Proof of Lemma 24

Denoting 𝑋 ~ 𝑡 ≔ 𝑋 𝑡 𝑉 𝑡 , we have 𝑁 ~ 𝑡

𝑈 ⋆ , ⟂ ⊤ 𝑋 ~ 𝑡 and 𝑆 ~ 𝑡

𝑈 ⋆ ⊤ 𝑋 ~ 𝑡 . Suppose for the moment that

‖ ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) − 1 ‖ ≤ 2 ,

(99)

whose proof is deferred to the end of this section. We can write the update equation of 𝑋 ~ 𝑡 as

𝑋 ~ 𝑡 + 1

𝑋 𝑡 + 1 𝑉 𝑡 + 1

𝑋 𝑡 + 1 𝑉 𝑡 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 + 𝑋 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1

( 𝑋 𝑡 + 1 𝑉 𝑡 + 𝑋 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) − 1 ) 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 .

(100)

Left-multiplying both sides of (100) with 𝑈 ⋆ , ⟂ (or 𝑈 ⋆ ), we obtain

𝑁 ~ 𝑡 + 1

( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ,

(101a)

𝑆 ~ 𝑡 + 1

( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ,

(101b)

where we define 𝑄 ≔ 𝑉 𝑡 , ⟂ ⊤ 𝑉 𝑡 + 1 ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) − 1 . Consequently, we arrive at

𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1

( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) ( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) − 1 .

(102)

Since ‖ 𝑄 ‖ ≤ 2 (which is an immediate implication of (99)), we can invoke Lemma 15 to obtain

𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 ( 𝐼 + 𝜂 𝐸 𝑡 15 . 𝑎 ) 𝐴 𝑡 ( 𝐴 𝑡 + 𝜂 Σ ⋆ 2 ) − 1 ( 𝐼 + 𝜂 𝐸 𝑡 14 ) − 1 Σ ⋆ + 𝜂 𝐸 𝑡 15 . 𝑏 Σ ⋆

𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ( 𝐼 + 𝜂 Σ ⋆ − 1 𝐸 𝑡 15 . 𝑎 Σ ⋆ ) 𝐻 𝑡 ( 𝐻 𝑡 + 𝜂 𝐼 ) − 1 ( 𝐼 + 𝜂 Σ ⋆ − 1 𝐸 𝑡 14 Σ ⋆ ) − 1 + 𝜂 𝐸 𝑡 15 . 𝑏 Σ ⋆ ,

(103)

where for simplicity of notation, we denote

𝐴 𝑡 ≔ ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 , and 𝐻 𝑡 ≔ Σ ⋆ − 1 𝐴 𝑡 Σ ⋆ − 1 .

In addition, we have

‖ 𝐸 𝑡 14 ‖ + ‖ 𝐸 𝑡 15 . 𝑎 ‖

≤ 1 64 𝜅 5 ,

‖ | 𝐸 𝑡 15 . 𝑏 | ‖

≤ 800 𝑐 𝜆 − 1 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 64 ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 5 ‖ 𝑋 ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 1 64 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 .

Moreover, it is clear that 𝜂 ≤ 𝑐 𝜂 ≤ 1 ≤ 𝜅 4 since 𝜅 ≥ 1 , and that ‖ 𝐻 𝑡 ‖ ≤ 𝜅 2 ( 1 + ‖ 𝑆 ~ 𝑡 ‖ 2 / ‖ 𝑋 ⋆ ‖ 2 ) ≤ ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4 . Hence we have

‖ 𝐻 𝑡 ‖ + 𝜂 ≤ 2 ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4

which implies

‖ 𝐸 𝑡 14 ‖ + ‖ 𝐸 𝑡 15 . 𝑎 ‖ ≤ 1 24 𝜅 1 ‖ 𝐻 𝑡 ‖ + 𝜂 .

(104)

Similarly we may also show

‖ | 𝐸 𝑡 15 . 𝑏 | ‖ ≤ 800 𝑐 𝜆 − 1 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 12 ( ‖ 𝐻 𝑡 ‖ + 𝜂 ) ‖ 𝑋 ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 1 2 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 .

(105)

Since 𝐻 𝑡 is obviously positive definite, we have

‖ 𝐻 𝑡 ( 𝐻 𝑡 + 𝜂 𝐼 ) − 1 ‖ ≤ 1 − 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 .

(106)

Thus

‖ | 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ | ‖ ≤

( 1 − 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ) ( 1 − 𝜂 𝜅 ‖ 𝐸 𝑡 14 ‖ ) − 1 ( 1 + 𝜂 𝜅 ‖ 𝐸 𝑡 15 . 𝑎 ‖ ) ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 ‖ | 𝐸 𝑡 15 . 𝑏 | ‖ ‖ 𝑋 ⋆ ‖ .

≤

( 1 − 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ) ( 1 + 1 12 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ) 2 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

𝜂 800 𝜅 2 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖
1 12 𝜂 ‖ 𝐻 𝑡 ‖
𝜂 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖
1 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 ‖ 𝑋 ⋆ ‖

≤

( 1 − 5 6 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ) ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 1 12 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

𝜂 800 𝜅 2 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖
1 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 ‖ 𝑋 ⋆ ‖

≤

( 1 − 3 4 𝜂 ‖ 𝐻 𝑡 ‖ + 𝜂 ) ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 800 𝜅 2 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 ‖ 𝑋 ⋆ ‖

≤

( 1 − 3 4 𝜂 ‖ 𝑍 𝑡 ‖ + 𝜂 ) ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 800 𝜅 2 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 ‖ 𝑋 ⋆ ‖ ,

(107)

where in the second inequality we used ( 1 − 𝑥 ) − 1 ≤ 1 + 𝑥 for 𝑥 < 1 , in the penultimate inequality we used the elementary fact ( 1 − 𝑥 ) ( 1 + 1 16 𝑥 ) 2 ≤ 1 − 5 6 𝑥 for 𝑥 ∈ [ 0 , 1 ] , and in the last inequality we used the obvious fact

‖ 𝐻 𝑡 ‖

‖ Σ ⋆ − 1 ( ( 1 − 𝜂 ) 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 ‖ ≤ ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 ‖

‖ 𝑍 𝑡 ‖ .

The desired inequality (96) follows from the above inequality by setting 𝐶 24

800 .

For the remaining claim, we need to apply the conclusion of the first part with | | | ⋅ | | |

∥ ⋅ ∥ . Then we note the following bounds:

(i)

‖ 𝑍 𝑡 ‖ ≤ ‖ Σ ⋆ − 1 ‖ 2 ( ‖ 𝑆 ~ 𝑡 ‖ 2 + 𝜆 ) ≤ ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4 by (22d) and (12b) (since we may choose 𝑐 𝜆 ≤ 1 );

(ii)

𝜂 ≤ 𝑐 𝜂 ≤ ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4 ;

(iii)

‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ ≤ ‖ Δ 𝑡 ‖ ≤ 16 ( 𝐶 3 . 𝑎 + 1 ) 2 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 by Lemma 12;

(iv)

( ‖ 𝑂 ~ 𝑡 ‖ / 𝜎 min ( 𝑆 ~ 𝑡 ) ) 1 / 2 ≤ 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 by (22a), if we choose 𝐶 𝛼 ≥ 3 𝑐 𝛿 − 1 + 3 𝐶 𝛿 + 3 .

These together imply

‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ ≤ ( 1 − 𝜂 6 ( 𝐶 3 . 𝑎 + 1 ) 2 𝜅 4 ) ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + 𝜂 16 𝐶 24 𝜅 2 𝑐 𝜆 ( 𝐶 3 . 𝑎 + 1 ) 2 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ + 𝜂 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ .

(108)

The conclusion follows easily by plugging in ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ and using 𝜅 6 𝜅 − 2 𝐶 𝛿 / 3 ≤ 𝜅 − 𝐶 𝛿 / 2 when 𝐶 𝛿 is sufficiently large.

Proof of bound (99).

First, we observe that it is equivalent to show that 𝜎 min ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) ≥ 1 / 2 . But from 𝑉 𝑡 + 1 𝑉 𝑡 + 1 ⊤ + 𝑉 𝑡 + 1 , ⟂ 𝑉 𝑡 + 1 , ⟂ ⊤

𝐼 we have

𝜎 min ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 )

𝜎 𝑟 ⋆ ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) ≥ 𝜎 𝑟 ⋆ ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 𝑉 𝑡 + 1 ⊤ )

𝜎 𝑟 ⋆ ( 𝑉 𝑡 ⊤ − 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ 𝑉 𝑡 + 1 , ⟂ ⊤ )

≥ 𝜎 𝑟 ⋆ ( 𝑉 𝑡 ⊤ ) − ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ 𝑉 𝑡 + 1 , ⟂ ⊤ ‖

≥ 1 − ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ ,

where the last inequality follows from 𝜎 𝑟 ⋆ ( 𝑉 𝑡 ⊤ )

1 (since 𝑉 𝑡 ∈ ℝ 𝑟 × 𝑟 ⋆ is orthonormal) and from that ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ 𝑉 𝑡 + 1 , ⟂ ⊤ ‖ ≤ ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ . This implies that, to show 𝜎 min ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 ) ≥ 1 / 2 , it suffices to prove ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ ≤ 1 / 2 .

Next we prove that ‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ ≤ 1 / 2 . Recall that by definition we have 𝑆 𝑡 + 1 𝑉 𝑡 + 1 , ⟂

0 . Right-multiplying both sides of (53a) by 𝑉 𝑡 + 1 , ⟂ , we obtain

0

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 ( 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ) + 𝜂 𝐸 𝑡 𝑏 𝑉 𝑡 + 1 , ⟂ ,

hence

‖ 𝑉 𝑡 ⊤ 𝑉 𝑡 + 1 , ⟂ ‖ ≤ 𝜂 ‖ 𝐸 𝑡 𝑏 𝑉 𝑡 + 1 , ⟂ ‖ ‖ 𝑆 ~ 𝑡 − 1 ‖ ‖ ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ‖ .

By (54b) we have

‖ 𝐸 𝑡 𝑏 𝑉 𝑡 + 1 , ⟂ ‖ ‖ 𝑆 ~ 𝑡 − 1 ‖ ≤ ‖ 𝐸 𝑡 𝑏 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ≤ 1 10 𝜅 ,

thus it suffices to show

𝜂 ‖ ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) − 1 ‖ ≤ 5 𝜅 ,

(109)

or equivalently,

𝜎 min ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) ≥ 𝜂 5 𝜅 .

(110)

To this end, we write

( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝐼 + 𝜂 𝐸 𝑡 𝑎 ( ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 )

(111)

and control the two terms separately.

•

To control the first factor, starting from (54a) we may deduce

‖ 𝐸 𝑡 𝑎 ‖
≤ 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖

≤ 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ + 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2

≤ 𝜅 − 2 ‖ 𝑋 ⋆ ‖ 2 / 2

𝜎 min 2 ( 𝑋 ⋆ ) / 2 ,

where the second inequality follows from (22c) and Lemma 12; the last inequality follows from choosing 𝑐 𝛿 sufficiently small (recall that 𝑐 3 , 𝑐 12 ≲ 𝑐 𝛿 / 𝑐 𝜆 ) and 𝐶 𝛿 sufficiently large. Furthermore, since 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ is positive semidefinite, we have

‖ ( ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1 ‖ ≤ 𝜂 − 1 𝜎 min − 2 ( Σ ⋆ )

𝜂 − 1 𝜎 min − 2 ( 𝑋 ⋆ ) ,

hence

𝜎 min ( 1 + 𝜂 𝐸 𝑡 𝑎 ( ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1 )

≥ 1 − 𝜂 ‖ 𝐸 𝑡 𝑎 ‖ ‖ ( ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1 ‖

≥ 1 − 𝜂 ⋅ 𝜎 min 2 ( 𝑋 ⋆ ) 2 ⋅ 𝜂 − 1 𝜎 min − 2 ( 𝑋 ⋆ )

1 / 2 .

(112) •

Now we control the second factor. By Lemma 10 we have

𝜎 min ( 1 − 𝜂 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 )

( 1 − 𝜂 ) 𝜎 min ( 𝐼 + 𝜂 1 − 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 )

≥ ( 1 − 𝜂 ) ( ‖ Σ ⋆ 2 + 𝜆 𝐼 ‖ 𝜎 min ( Σ ⋆ 2 + 𝜆 𝐼 ) ) − 1 / 2

( 1 − 𝜂 ) ( ‖ 𝑋 ⋆ ‖ 2 + 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) + 𝜆 ) − 1 / 2 .

It is easy to check that the function 𝜆 ↦ ( 𝑎 + 𝜆 ) / ( 𝑏 + 𝜆 ) is decreasing on [ 0 , ∞ ) for 𝑎 ≥ 𝑏

0 , thus

‖ 𝑋 ⋆ ‖ 2 + 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) + 𝜆 ≤ ‖ 𝑋 ⋆ ‖ 2 𝜎 min 2 ( 𝑋 ⋆ )

𝜅 2 ,

which implies

𝜎 min ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) ≥ 1 − 𝜂 𝜅 .

(113)

Plugging (113) and (• ‣ C.2.2) into (111) yields

𝜎 min ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 + 𝐸 𝑡 𝑎 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) ≥ 1 − 𝜂 2 𝜅 ≥ 𝜂 5 𝜅 ,

(114)

where the last inequality follows from the assumption 𝜂 ≤ 𝑐 𝜂 . This shows (110) as desired, thereby completing the proof.

C.2.3Proof of Lemma 25

Combine (101b) and Lemma 14 to see that

‖ 𝑆 ~ 𝑡 + 1 ‖

≤ ‖ 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ‖

≤ ‖ 1 + 𝜂 𝐸 𝑡 14 ‖ ⋅ ‖ ( 1 − 𝜂 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) 1 / 2 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 / 2 ‖ ⋅ ‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ‖

≤ ( 1 + 𝜂 ‖ 𝐸 𝑡 14 ‖ ) ( ( 1 − 𝜂 ) ( ‖ 𝑆 ~ 𝑡 ‖ 2 + 𝜆 ) 1 / 2 + 4 𝜂 𝜆 − 1 / 2 ‖ 𝑋 ⋆ ‖ 2 ) ( ‖ 𝑆 ~ 𝑡 ‖ 2 + 𝜆 ) − 1 / 2 ‖ 𝑆 ~ 𝑡 ‖

≤ ( 1 + 𝜂 4 ) ( ( 1 − 𝜂 ) ‖ 𝑆 ~ 𝑡 ‖ + 4 𝜂 ‖ 𝑋 ⋆ ‖ 2 ‖ 𝑆 ~ 𝑡 ‖ 𝜆 ( ‖ 𝑆 ~ 𝑡 ‖ 2 + 𝜆 ) )

≤ ( 1 − 𝜂 2 ) ‖ 𝑆 ~ 𝑡 ‖ + 5 𝜂 ‖ 𝑋 ⋆ ‖ 2 𝜆 ,

(115)

where the third line follows from ‖ Σ ⋆ 2 + 𝜆 𝐼 ‖ ≤ ( 1 + 𝜆 ) ‖ 𝑋 ⋆ ‖ 2 ≤ 2 ‖ 𝑋 ⋆ ‖ 2 assuming 𝑐 𝜆 ≤ 1 and from the fact that the singular values of ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 are ( 𝜎 𝑗 2 ( 𝑆 ~ 𝑡 ) + 𝜆 ) − 1 / 2 𝜎 𝑗 ( 𝑆 ~ 𝑡 ) , 𝑗

1 , … , 𝑟 ⋆ ,4 which is bounded by ( ‖ 𝑆 ~ 𝑡 ‖ 2 + 𝜆 ) − 1 / 2 ‖ 𝑆 ~ 𝑡 ‖ since 𝜎 ↦ ( 𝜎 2 + 𝜆 ) − 1 / 2 𝜎 is increasing and since ‖ 𝑆 ~ 𝑡 ‖ is the largest singular value of 𝑆 ~ 𝑡 . In the fourth line, we used the error bound ‖ 𝐸 𝑡 14 ‖ ≤ 1 / 4 and the last line follows from the elementary inequalities 1 + 𝜂 / 4 ≤ ( 1 − 𝜂 / 2 ) ( 1 − 𝜂 ) − 1 ≤ 5 / 4 given that 𝜂 ≤ 𝑐 𝜂 for sufficiently small constant 𝑐 𝜂

0 . The conclusion readily follows from the above inequality and the assumption 𝜆 ≥ 1 100 𝜅 − 4 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) .

Appendix DProofs for Phase II

This section collects the proofs for Phase II.

D.1Proof of Lemma 4

Since ‖ 𝑉 𝑡 + 1 ⊤ 𝑉 𝑡 ‖ ≤ 1 , we have

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 )
≥ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 𝑉 𝑡 + 1 ⊤ 𝑉 𝑡 )

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 𝑡 + 1 𝑉 𝑡 ) ,

where the second equality follows from 𝑆 𝑡 + 1

𝑆 ~ 𝑡 + 1 𝑉 𝑡 + 1 ⊤ (cf. (31)). Apply Lemma 14 with 𝑄

0 to see that

𝑆 𝑡 + 1 𝑉 𝑡

( 𝐼 + 𝜂 𝐸 𝑡 14 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 ,

(116)

where 𝐸 𝑡 14 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ satisfies ‖ 𝐸 𝑡 14 ‖ ≤ 1 200 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 5 . To simplify the notation, we denote

𝑌 𝑡 ≔ ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ,

which allows us to write (116) as

( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 𝑡 + 1 𝑉 𝑡

( 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝐸 𝑡 14 ( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 ) ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( 𝑌 𝑡 𝑌 𝑡 ⊤ + 𝜆 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 ) − 1 ) 𝑌 𝑡 .

(117)

Note that

‖ ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝐸 𝑡 14 ( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 ‖

≤ ‖ ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 ‖ ⋅ ‖ ( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 ‖ ⋅ ‖ 𝐸 𝑡 14 ‖

≤ 𝜅 ‖ 𝑋 ⋆ ‖ − 1 ⋅ ( 2 ‖ 𝑋 ⋆ ‖ ) ⋅ ‖ 𝐸 𝑡 14 ‖

≤ 2 𝜅 ⋅ 1 200 ( 𝐶 3 . 𝑎 + 1 ) 4 𝜅 5 ≤ 1 / 32 ,

(118)

where in the second inequality we used 𝜆 ≤ 𝑐 𝜆 ‖ 𝑀 ⋆ ‖ ≤ ‖ 𝑋 ⋆ ‖ 2 as 𝑐 𝜆 ≤ 1 , and in the third inequality we used the claimed bound of ‖ 𝐸 𝑡 14 ‖ . Therefore, it follows that

𝜎 min ( 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝐸 𝑡 14 ( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 ) ≥ 1 − 𝜂 / 32 .

(119)

On the other hand, using 𝜎 min ( 𝐴 𝐵 ) ≥ 𝜎 min ( 𝐴 ) 𝜎 min ( 𝐵 ) for any matrices 𝐴 , 𝐵 , it is obvious that

𝜎 min ( ( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( 𝑌 𝑡 𝑌 𝑡 ⊤ + 𝜆 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 ) − 1 ) 𝑌 𝑡 ) ≥ ( 1 − 𝜂 ) 𝜎 min ( 𝑌 𝑡 ) ,

which in turn implies that

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 𝑡 + 1 𝑉 𝑡 ) ≥ ( 1 − 𝜂 / 32 ) ( 1 − 𝜂 ) 𝜎 min ( 𝑌 𝑡 ) ≥ ( 1 − 2 𝜂 ) 𝜎 min ( 𝑌 𝑡 ) ,

as long as 𝜂 ≤ 𝑐 𝜂 for some sufficiently small constant 𝑐 𝜂 . This proves the first part of Lemma 4.

Now we move to the second part assuming 𝜎 min ( 𝑌 𝑡 ) ≤ 1 / 3 . Using the assumption 𝜆 ≤ 𝑐 𝜆 𝜎 min ( 𝑀 ⋆ ) , we see that

‖ 𝜆 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 ‖ ≤ 𝑐 𝜆 .

Given that 𝑐 𝜆 is sufficiently small (such that 𝑐 𝜆 ≤ 𝑐 11 , where 𝑐 11 is the positive constant in Lemma 11), one may apply Lemma 11 with 𝑌

𝑌 𝑡 and Λ

𝜆 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 to obtain

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 𝑡 + 1 𝑉 𝑡 )

≥ 𝜎 min ( 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝐸 𝑡 14 ( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 ) ( 1 + 1 6 𝜂 ) 𝜎 min ( 𝑌 𝑡 )

≥ ( i ) ( 1 − 𝜂 / 32 ) ( 1 + 1 6 𝜂 ) 𝜎 min ( 𝑌 𝑡 ) ≥ ( ii ) ( 1 + 1 8 𝜂 ) 𝜎 min ( 𝑌 𝑡 ) ,

where (i) uses (119), and (ii) follows as long as 𝜂 ≤ 𝑐 𝜂 for some sufficiently small constant 𝑐 𝜂 . The desired conclusion follows.

D.2Proof of Corollary 1

We will prove a strengthened version of (25), that is

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≥ 1 / 10 .

(120)

It is clear that (120) implies (25). Indeed, for each 𝑢 ∈ ℝ 𝑟 ⋆ , by taking 𝑣

( Σ ⋆ 2 + 𝜆 𝐼 ) 1 / 2 𝑢 , we have

𝑢 ⊤ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑢

𝑣 ⊤ ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑣 ≥ 1 10 ‖ 𝑣 ‖ 2 ≥ 1 10 𝑢 ⊤ Σ ⋆ 2 𝑢 ,

which implies (25). It then boils down to establish (120).

Step 1: establishing the claim for a midpoint 𝑡 2 .

From Lemma 3 we know that

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 ) ≥ ‖ Σ ⋆ 2 + 𝜆 𝐼 ‖ − 1 / 2 𝜎 min ( 𝑆 ~ 𝑡 1 ) ≥ ( i ) ( 𝑐 𝜆 + 1 ) − 1 / 2 ‖ 𝑋 ⋆ ‖ − 1 ⋅ 𝛼 2 / ‖ 𝑋 ⋆ ‖ ≥ 1 3 ( 𝛼 / ‖ 𝑋 ⋆ ‖ ) 2 ,

where (i) follows from the assumption (12b) and Lemma 3, and the last inequality follows by choosing 𝑐 𝜆 ≤ 1 . By the second part of Lemma 4, starting from 𝑡 1 , whenever 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) < 1 / 10 < 1 / 3 , it would increase exponentially with rate at least ( 1 + 𝜂 8 ) . On the other end, it is easy to verify, given that 𝜂 ≤ 𝑐 𝜂 is sufficiently small,

( 1 + 𝜂 8 ) 16 𝜂 log ⁡ ( 3 10 ‖ 𝑋 ⋆ ‖ 2 𝛼 2 ) ≥ 3 ‖ 𝑋 ⋆ ‖ 2 10 𝛼 2 ≥ 1 10 1 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 1 ) .

Therefore, it takes at most 16 𝜂 log ⁡ ( 3 10 ‖ 𝑋 ⋆ ‖ 2 𝛼 2 ) ≤ 𝑇 min / 16 more iterations to make 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) grow to at least 1 / 10 . Equivalent, for some 𝑡 2 : 𝑡 1 ≤ 𝑡 2 ≤ 𝑡 1 + 𝑇 min / 16 , we have

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 2 ) ≥ 1 / 10 .

Step 2: establishing the claim for all 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] .

It remains to show that (120) continues to hold for all 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] . We prove this by induction on 𝑡 .

Assume that (120) holds for some 𝑡 ∈ [ 𝑡 2 , 𝑇 max − 1 ] . We show that it will also hold for 𝑡 + 1 . We divide the proof into two cases.

Case 1.

If 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≤ 1 / 3 , we deduce from the second part of Lemma 4 that

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≥ ( 1 + 𝜂 8 ) 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≥ 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ,

which by the induction hypothesis is no less than 1 / 10 , as desired.

Case 2.

If 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 )

1 / 3 , the first part of Lemma 4 yields

𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 + 1 ) ≥ ( 1 − 2 𝜂 ) 𝜎 min ( ( Σ ⋆ 2 + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ) ≥ ( 1 − 2 𝜂 ) / 3 ,

which is greater than 1 / 10 provided 𝜂 ≤ 𝑐 𝜂 ≤ 1 / 100 , as desired.

Combining the two cases completes the proof.

D.3Proof of Lemma 5

For simplicity, in this section we denote

Γ 𝑡 ≔ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼

Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 .

(121)

It turns out that Lemma 5 follows naturally from the following technical lemma, whose proof is deferred to the end of this section.

Lemma 26.

For any 𝑡 : 𝑡 2 ≤ 𝑡 ≤ 𝑇 max , one has

‖ | Γ 𝑡 + 1 | ‖ ≤ ( 1 − 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 𝐶 26 𝜅 6 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 16 𝜂 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 ,

(122)

From Lemma 12, we know that ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ ≤ ‖ Δ 𝑡 ‖ ≤ ‖ 𝑋 ⋆ ‖ 2 300 𝐶 26 𝜅 4 as 𝑐 𝛿 is sufficiently small. Similarly, ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ ‖ 𝑋 ⋆ ‖ / 100 and ( ‖ 𝑂 ~ 𝑡 ‖ / ‖ 𝑋 ⋆ ‖ ) 7 / 12 ≤ 1 / 300 by Lemma 3. Applying Lemma 26 with the spectral norm, we prove Lemma 5 as desired.

Proof of Lemma 26.

We start by rewriting (53a) as

𝑆 𝑡 + 1

( ( 1 − 𝜂 ) 𝐼 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑔

( 𝐼 − 𝜂 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 + 𝜂 ( Σ ⋆ 2 + 𝜆 𝐼 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑔

( 𝐼 − 𝜂 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑔 ,

(123)

where

𝐸 𝑡 𝑔

𝐸 𝑡 𝑎 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝐸 𝑡 𝑏 .

(124)

By Corollary 1, we have 𝜎 min ( 𝑆 ~ 𝑡 ) 2 ≥ 1 100 𝜎 min ( 𝑀 ⋆ ) for 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] , so

‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ ‖ ≤ ‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 / 2 ‖ ‖ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 / 2 𝑆 ~ 𝑡 ‖ ≤ 𝜎 min − 1 ( 𝑆 ~ 𝑡 ) ≲ 1 / 𝜎 min ( 𝑋 ⋆ ) .

Combined with the error bounds (54a), (54b), we have for some universal constant 𝐶

0 that

‖ | 𝐸 𝑡 𝑔 | ‖ ≤ ‖ | 𝐸 𝑡 𝑎 | ‖ + 𝜂 ‖ | 𝐸 𝑡 𝑏 | ‖ ≤ 𝐶 𝜅 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 𝐶 𝑐 13 𝜅 − 5 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝐶 ‖ 𝑂 ~ 𝑡 ‖ 3 / 4 ‖ 𝑋 ⋆ ‖ 1 / 4 .

(125) Step 1: deriving a recursion of Γ 𝑡 .

Define

𝐴 𝑡 := ( 𝐼 − 𝜂 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 ) 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ .

Then we can rewrite (123) as 𝐴 𝑡

𝑆 𝑡 + 1 − 𝜂 𝐸 𝑡 𝑔 , and by rearranging 𝐴 𝑡 𝐴 𝑡 ⊤

( 𝑆 𝑡 + 1 − 𝜂 𝐸 𝑡 𝑔 ) ( 𝑆 𝑡 + 1 − 𝜂 𝐸 𝑡 𝑔 ) ⊤ in view of (31), it follows that

𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤

𝑆 𝑡 + 1 𝑆 𝑡 + 1 ⊤

𝐴 𝑡 𝐴 𝑡 ⊤ + 𝜂 ( ‖ 𝑆 𝑡 + 1 ‖ + ‖ 𝐸 𝑡 𝑔 ‖ ) ( 𝐸 𝑡 𝑔 𝑄 1 + 𝑄 2 𝐸 𝑡 𝑔 ⊤ )

: 𝐴 𝑡 𝐴 𝑡 ⊤ + 𝜂 𝐸 𝑡 𝑓

for some matrices 𝑄 1 , 𝑄 2 with ‖ 𝑄 1 ‖ , ‖ 𝑄 2 ‖ ≤ 1 . By mapping both sides of the above equation by ( ⋅ ) ↦ Σ ⋆ − 1 ( ⋅ ) Σ ⋆ − 1 − 𝐼 , we obtain

Γ 𝑡 + 1

( 𝐼 − 𝜂 Γ 𝑡 ( 𝐼 + Γ 𝑡 + 𝜆 Σ ⋆ − 2 ) − 1 ) ( Γ 𝑡 + 𝐼 ) ( 𝐼 − 𝜂 ( 𝐼 + Γ 𝑡 + 𝜆 Σ ⋆ − 2 ) − 1 Γ 𝑡 ) − 𝐼 + 𝜂 Σ ⋆ − 1 𝐸 𝑡 𝑓 Σ ⋆ − 1 ,

(126)

where we recall the definition of Γ 𝑡 in (121).

Step 2: simplify the recursion.

Note that 𝜎 min ( Σ ⋆ − 1 𝑆 ~ 𝑡 ) ≥ 1 / 10 implies 𝐼 + Γ 𝑡 ⪰ 1 100 𝐼 . From our assumption 𝜆 ≤ 𝑐 𝜆 𝜎 min ( 𝑀 ⋆ ) , it follows that ‖ 𝜆 Σ ⋆ − 2 ‖ ≤ 𝑐 𝜆 ≤ 1 / 200 ≤ 1 2 𝜎 min ( 𝐼 + Γ 𝑡 ) , thus in virtue of Lemma 9 we have

( 𝐼 + Γ 𝑡 + 𝜆 Σ ⋆ − 2 ) − 1

( 𝐼 + Γ 𝑡 ) − 1 + ( 𝐼 + Γ 𝑡 ) − 1 ( 𝑐 𝜆 𝑄 ′ ) ( 𝐼 + Γ 𝑡 ) − 1 ,

for some matrix 𝑄 ′ with ‖ 𝑄 ′ ‖ ≤ 2 . Plugging this into (126) yields

Γ 𝑡 + 1

( 𝐼 − 𝜂 Γ 𝑡 ( 𝐼 + Γ 𝑡 ) − 1 ) ( Γ 𝑡 + 𝐼 ) ( 𝐼 − 𝜂 ( 𝐼 + Γ 𝑡 ) − 1 Γ 𝑡 ) + 𝜂 𝐸 𝑡 ℎ + 𝜂 Σ ⋆ − 1 𝐸 𝑡 𝑓 Σ ⋆ − 1

( 1 − 2 𝜂 ) Γ 𝑡 + 𝜂 2 Γ 𝑡 2 ( 1 + Γ 𝑡 ) − 1 + 𝜂 𝐸 𝑡 ℎ + 𝜂 Σ ⋆ − 1 𝐸 𝑡 𝑓 Σ ⋆ − 1 ,

(127)

where the additional error term 𝐸 𝑡 ℎ is defined by

𝐸 𝑡 ℎ :=

Γ 𝑡 ( 𝐼 + Γ 𝑡 ) − 1 ( 𝑐 𝜆 𝑄 ′ ) ( 1 − 𝜂 Γ 𝑡 ( 𝐼 + Γ 𝑡 ) − 1 ) + ( 1 − 𝜂 Γ 𝑡 ( 𝐼 + Γ 𝑡 ) − 1 ) ( 𝑐 𝜆 𝑄 ′ ) ( 𝐼 + Γ 𝑡 ) − 1 Γ 𝑡

𝜂 Γ 𝑡 ( 𝐼
Γ 𝑡 ) − 1 ( 𝑐 𝜆 𝑄 ′ ) ( 𝐼
Γ 𝑡 ) − 2 ( 𝑐 𝜆 𝑄 ′ ) ( 𝐼
Γ 𝑡 ) − 1 Γ 𝑡 .

(128) Step 3: controlling the error terms.

We now control the error terms in (127) separately.

•

By (22d) we have ‖ 𝑆 𝑡 + 1 ‖ ≤ 𝐶 3 . 𝑎 𝜅 ‖ 𝑋 ⋆ ‖ , and by controlling the right hand side of (125) using (22c), (24), and (50) in Lemma 12, it is evident that ‖ 𝐸 𝑡 𝑔 ‖ ≤ 𝜅 ‖ 𝑋 ⋆ ‖ . Hence, the term 𝐸 𝑡 𝑓 obeys

‖ | 𝐸 𝑡 𝑓 | ‖

≤ ( 𝐶 3 . 𝑎 + 1 ) 𝜅 3 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝐸 𝑡 𝑔 | ‖

≤ 𝐶 ′ 𝐶 3 . 𝑎 ( 𝜅 4 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 𝑐 13 𝜅 − 2 ‖ 𝑋 ⋆ ‖ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜅 ‖ 𝑂 ~ 𝑡 ‖ 3 / 4 ‖ 𝑋 ⋆ ‖ 5 / 4 ) ,

(129)

where 𝐶 ′

0 is again some universal constant.

•

Since Γ 𝑡 ⪰ 1 100 𝐼 − 𝐼

− 99 100 𝐼 as already proved, it is easy to see that ‖ ( 1 + Γ 𝑡 ) − 1 ‖ ≤ 𝐶 and ‖ Γ 𝑡 ( 1 + Γ 𝑡 ) − 1 ‖ ≤ 𝐶 for some universal constant 𝐶

0 . Thus,

‖ | 𝐸 𝑡 ℎ | ‖ ≤ 2 𝑐 𝜆 𝐶 ( 1 + 𝜂 𝐶 ) ‖ 𝑄 ′ ‖ ⋅ ‖ | Γ 𝑡 | ‖ + 𝜂 𝑐 𝜆 2 𝐶 4 ‖ 𝑄 ′ ‖ 2 ‖ | Γ 𝑡 | ‖ ≤ 1 2 ‖ | Γ 𝑡 | ‖ ,

(130)

where the last line follows by using ‖ 𝑄 ′ ‖ ≤ 2 and by choosing 𝑐 𝜆 , 𝑐 𝜂 sufficiently small.

•

We still need to control 𝜂 2 Γ 𝑡 2 ( 1 + Γ 𝑡 ) − 1 . This can be accomplished by invoking ‖ Γ 𝑡 ( 1 + Γ 𝑡 ) − 1 ‖ ≤ 𝐶 again. In fact, we have

𝜂 2 ‖ | Γ 𝑡 2 ( 1 + Γ 𝑡 ) − 1 | ‖ ≤ 𝜂 ⋅ 𝜂 ‖ Γ 𝑡 ( 1 + Γ 𝑡 ) − 1 ‖ ⋅ ‖ | Γ 𝑡 | ‖ ≤ 𝜂 ⋅ 𝜂 𝐶 ‖ | Γ 𝑡 | ‖ ≤ 𝜂 2 ‖ | Γ 𝑡 | ‖

(131)

provided that 𝜂 ≤ 𝑐 𝜂 is sufficiently small.

Plugging (129), (130), (131) into (127), we readily obtain

‖ | Γ 𝑡 + 1 | ‖

≤ ( 1 − 2 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 2 ‖ | Γ 𝑡 | ‖ + 𝜂 2 ‖ | Γ 𝑡 | ‖ + 𝜂 𝜅 2 ‖ 𝑋 ⋆ ‖ − 2 ‖ | 𝐸 𝑡 𝑓 | ‖

≤ ( 1 − 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 𝐶 ′ 𝐶 3 . 𝑎 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 𝜂 𝑐 13 𝐶 ′ 𝐶 3 . 𝑎 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 𝐶 ′ 𝐶 3 . 𝑎 𝜅 3 ‖ 𝑂 ~ 𝑡 ‖ 3 / 4 ‖ 𝑋 ⋆ ‖ − 3 / 4

≤ ( 1 − 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 𝐶 26 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + 1 16 𝜂 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 ,

where in the last line we set 𝐶 26

𝐶 ′ 𝐶 3 . 𝑎 , chose 𝑐 13 sufficiently small and used (24). Finally note that 𝐶 26 ≲ 𝐶 3 . 𝑎 ≲ 𝑐 𝜆 − 1 / 2 as desired.

D.4Proof of Corollary 2

From Lemma 5, it is elementary (e.g., by induction on 𝑡 ) to show that

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ ( 1 − 𝜂 ) 𝑡 − 𝑡 2 ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 2 𝑆 ~ 𝑡 2 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ + 1 100 , ∀ 𝑡 ∈ [ 𝑡 2 , 𝑇 max ] .

(132)

Suppose for the moment that

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 2 𝑆 ~ 𝑡 2 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 𝐶 3 . 𝑎 2 𝜅 4 ,

(133)

where 𝐶 3 . 𝑎 is given in Lemma 3. Then given that 𝜂 ≤ 𝑐 𝜂 for some sufficiently small 𝑐 𝜂 , we have log ⁡ ( 1 − 𝜂 ) ≥ − 𝜂 / 2 . As a result, if 𝑡 3 − 𝑡 2 ≥ 8 log ⁡ ( 10 𝐶 3 . 𝑎 𝜅 ) / 𝜂 ≥ log ⁡ ( 𝐶 3 . 𝑎 − 2 𝜅 − 4 / 100 ) / log ⁡ ( 1 − 𝜂 ) , we have ( 1 − 𝜂 ) 𝑡 3 − 𝑡 2 ≤ 𝐶 3 . 𝑎 − 2 𝜅 − 4 / 100 . When 𝐶 min is sufficiently large we may choose such 𝑡 3 which simultaneously satisfies 𝑡 3 ≤ 𝑡 2 + 𝑇 min / 16 ≤ 𝑇 max since 8 log ⁡ ( 10 𝐶 3 . 𝑎 𝜅 ) / 𝜂 ≤ 𝐶 min 32 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 )

𝑇 min / 32 . Invoking (132), we obtain

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ ( 𝐶 3 . 𝑎 − 2 𝜅 − 4 / 100 ) ( 𝐶 3 . 𝑎 2 𝜅 4 ) + 1 100

1 50 ≤ 1 10 ,

(134)

which implies the desired bound (27).

Proof of inequality (133).

It is straightforward to verify that

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 2 𝑆 ~ 𝑡 2 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ max ⁡ ( ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 2 ‖ 2 − 1 , 1 − 𝜎 min 2 ( Σ ⋆ − 1 𝑆 ~ 𝑡 2 ) ) ,

which combined with (22d) implies that

‖ Σ ⋆ − 1 𝑆 ~ 𝑡 2 ‖ 2 − 1 ≤ ‖ Σ ⋆ − 1 ‖ 2 ‖ 𝑆 ~ 𝑡 2 ‖ 2 ≤ 𝜎 min − 2 ( 𝑋 ⋆ ) 𝐶 3 . 𝑎 2 𝜅 2 ‖ 𝑋 ⋆ ‖ 2

𝐶 3 . 𝑎 2 𝜅 4 .

In addition, by Corollary 1 we have

1 − 𝜎 min 2 ( Σ ⋆ − 1 𝑆 ~ 𝑡 2 ) ≤ 1 − 1 10

9 10 .

Choosing 𝐶 3 . 𝑎 sufficiently large (say 𝐶 3 . 𝑎 ≥ 1 ) yields 𝐶 3 . 𝑎 2 𝜅 4 ≥ 9 / 10 , and hence the claim (133).

Appendix EProofs for Phase III

To characterize the behavior of ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 , it is particularly helpful to consider the following decomposition into three error terms related to the signal term, the misalignment term, and the overparametrization term.

Lemma 27.

For all 𝑡 ≥ 𝑡 3 , as long as ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 1 / 10 , one has

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 ) + 4 ‖ 𝑋 ⋆ ‖ ‖ 𝑂 ~ 𝑡 ‖ .

Note that the overparametrization error ‖ 𝑂 ~ 𝑡 ‖ stays small, as stated in (22b) and (24). Therefore we only need to focus on the shrinkage of the first two terms ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 , which is the focus of the lemma below.

Lemma 28.

For any 𝑡 : 𝑡 3 ≤ 𝑡 ≤ 𝑇 max , one has

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ 𝖥

≤ ( 1 − 𝜂 10 ) ( ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 ) + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 .

(135)

In particular, ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 1 / 10 for all 𝑡 such that 𝑡 3 ≤ 𝑡 ≤ 𝑇 max .

We now show how Lemma 6 is implied by the above two lemmas. To begin with, we apply Lemma 28 repeatedly to obtain the following bound for all 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] :

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥

≤ ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 ( ∥ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ∥ 𝖥 + ∥ 𝑋 ⋆ ∥ − 1 ∥ 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ ∥ 𝖥 ) + 10 max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 ,

(136)

which motivates us to control the error at time 𝑡 3 .

We know from Corollary 2 that ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 1 / 10 . Since Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 is a 𝑟 ⋆ × 𝑟 ⋆ matrix, we have ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 ≤ 𝑟 ⋆ / 10 . In addition, we infer from (22c) that

‖ 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ ‖ 𝖥 ≤ 𝑟 ⋆ ‖ 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ ‖ ≤ 𝑟 ⋆ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ ≤ 𝑟 ⋆ ‖ 𝑋 ⋆ ‖ / 10 ,

as long as 𝑐 3 is sufficiently small. Combine the above two bounds to arrive at the conclusion that

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ ‖ 𝖥 ≤ 𝑟 ⋆ 10 + ‖ 𝑋 ⋆ ‖ − 1 𝑟 ⋆ ‖ 𝑋 ⋆ ‖ 10

𝑟 ⋆ 5 .

(137)

Combining the two inequalities (136) and (137) yields for all 𝑡 ∈ [ 𝑡 3 , 𝑇 max ]

∥ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ∥ 𝖥 + ∥ 𝑋 ⋆ ∥ − 1 ∥ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ∥ 𝖥 ≤ 1 5 ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 𝑟 ⋆ + 10 max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 .

We can then invoke Lemma 27 to see that

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥

≤ 4 ‖ 𝑋 ⋆ ‖ 2 5 ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 𝑟 ⋆ + 40 ∥ 𝑋 ⋆ ∥ 2 max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 + 4 ∥ 𝑋 ⋆ ∥ ∥ 𝑂 ~ 𝑡 ∥

≤ ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 𝑟 ⋆ ∥ 𝑀 ⋆ ∥ + 80 ∥ 𝑀 ⋆ ∥ max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 ,

where in the last line we use ‖ 𝑂 ~ 𝑡 ‖ ≤ ‖ 𝑋 ⋆ ‖ —an implication of (24). To see this, the assumption (12c) implies that 𝛼 ≤ ‖ 𝑋 ⋆ ‖ as long as 𝜂 ≤ 1 / 2 and 𝐶 𝛼 ≥ 4 , which in turn implies ‖ 𝑂 ~ 𝑡 ‖ ≤ 𝛼 2 / 3 ‖ 𝑋 ⋆ ‖ 1 / 3 ≤ ‖ 𝑋 ⋆ ‖ . This completes the proof for the first part of Lemma 6 with 𝑐 6

1 / 10 .

For the second part of Lemma 6, notice that

8 𝑐 6 − 1 max 𝑡 3 ≤ 𝜏 ≤ 𝑇 max ( ∥ 𝑂 ~ 𝜏 ∥ / ∥ 𝑋 ⋆ ∥ ) 1 / 2 ≤ 1 2 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3

by (24), thus

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑡 3 𝑟 ⋆ ‖ 𝑀 ⋆ ‖ + 1 2 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3

for 𝑡 3 ≤ 𝑡 ≤ 𝑇 max . There exists some iteration number 𝑡 4 : 𝑡 3 ≤ 𝑡 4 ≤ 𝑡 3 + 2 𝑐 6 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) ≤ 𝑡 3 + 𝑇 min / 16 such that

( 1 − 𝑐 6 𝜂 ) 𝑡 4 − 𝑡 3 ≤ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 2 ≤ 1 2 𝑟 ⋆ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3 ,

where the last inequality is due to (12c). It is then clear that 𝑡 4 has the property claimed in the lemma.

E.1Proof of Lemma 27

Starting from (51), we may deduce

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥

≤ ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ‖ 𝖥 + 2 ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 + ‖ 𝑁 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 + ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑂 ~ 𝑡 ‖ 𝖥

≤ ‖ 𝑋 ⋆ ‖ 2 ( ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 ‖ 𝖥 + 2 ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 ‖ 2 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + 𝑛 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 )

≤ 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) ,

(138)

where the penultimate line used ‖ 𝑂 ~ 𝑡 ‖ 𝖥 ≤ 𝑛 ‖ 𝑂 ~ 𝑡 ‖ , and the last line follows from ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 ‖ 2

‖ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 ‖ ≤ 1 + ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 ‖ ≤ 2 (recall that ‖ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 ‖ ≤ 1 / 10 by assumption) and from (24).

E.2Proof of Lemma 28

Recall the definition of Γ 𝑡 from (121):

Γ 𝑡 ≔ Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 .

Fix any 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] , if (135) were true for all 𝜏 ∈ [ 𝑡 3 , 𝑡 ] , taking into account that ‖ 𝑂 ~ 𝜏 ‖ / ‖ 𝑋 ⋆ ‖ ≤ 1 / 10000 for all 𝜏 ∈ [ 𝑡 3 , 𝑇 max ] by (24), we could show by induction that ‖ Γ 𝜏 ‖ ≤ 1 / 10 for all 𝜏 ∈ [ 𝑡 3 , 𝑡 ] . Thus it suffices to assume ‖ Γ 𝑡 ‖ ≤ 1 / 10 and prove (135).

Apply Lemma 26 with Frobenius norm to obtain

‖ Γ 𝑡 + 1 ‖ 𝖥 ≤ ( 1 − 𝜂 ) ‖ Γ 𝑡 ‖ 𝖥 + 𝜂 𝐶 26 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 + 1 16 𝜂 ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 ,

(139)

In addition, Lemma 24 tells us that

‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ 𝖥 ≤ ( 1 − 𝜂 3 ( ‖ 𝑍 𝑡 ‖ + 𝜂 ) ) ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + 𝜂 𝐶 24 𝜅 6 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ 𝜎 min ( 𝑆 ~ 𝑡 ) ) 2 / 3 ‖ 𝑋 ⋆ ‖ ,

where 𝑍 𝑡

Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 . It is easy to check that ‖ 𝑍 𝑡 ‖ ≤ 1 + ‖ Γ 𝑡 ‖ + 𝑐 𝜆 ≤ 2 as ‖ Γ 𝑡 ‖ ≤ 1 / 10 and 𝑐 𝜆 is sufficiently small. In addition, one has 𝜎 min ( 𝑆 ~ 𝑡 ) 2 ≥ ( 1 − ‖ Γ 𝑡 ‖ ) 𝜎 min ( 𝑋 ⋆ ) 2 and ‖ 𝑂 ~ 𝑡 ‖ / 𝜎 min ( 𝑆 ~ 𝑡 ) ≤ ( 2 𝜅 ) − 24 . Combine these relationships together to arrive at

‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ 𝖥 ≤ ( 1 − 𝜂 8 ) ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + 𝜂 𝐶 24 𝜅 6 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 + 1 2 𝜂 ‖ 𝑋 ⋆ ‖ ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 .

(140)

Summing up (139), (140), we obtain

‖ Γ 𝑡 + 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ 𝖥

≤ ( 1 − 𝜂 8 ) ( ‖ Γ 𝑡 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 ) + 𝜂 2 ( 𝐶 24 + 𝐶 26 𝑐 𝜆 ) 𝜅 8 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ 2 ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 + 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 .

(141)

This is close to our desired conclusion, but we would need to eliminate ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 . To this end we observe

‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥

≤ 𝑟 ⋆ ‖ Δ 𝑡 ‖

≤ 8 𝛿 𝑟 ⋆ ( ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ‖ 𝖥 + ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 + 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 2 )

≤ 16 𝑐 𝛿 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 ) ,

where the first line follows from 𝑈 ⋆ being of rank 𝑟 ⋆ , the second line follows from Lemma 12, and the last line follows from (10) and from controlling the sum inside the brackets in a similar way as (138).

The conclusion follows from plugging the above inequality into (141), noting that 𝑐 𝛿 can be chosen sufficiently small and that ‖ 𝑂 ~ 𝑡 ‖ / ‖ 𝑋 ⋆ ‖ is sufficiently small due to (24).

E.3Proof of Proposition 2

Recall that in the proof of Lemma 23 (Appendix C.2.1), we have shown

‖ 𝑂 ~ 𝑡 ‖ ≤ ‖ 𝑂 ~ 𝑡 ‖ + 𝜂 ‖ 𝑁 𝑡 + 1 𝑉 𝑡 ( 𝑆 𝑡 + 1 𝑉 𝑡 ) − 1 ‖ ⋅ ‖ 𝐸 𝑡 𝑏 ‖ + 𝜂 ‖ 𝐸 𝑡 𝑑 ‖ .

(142)

This, along with all the conclusions in Section 3 (Lemma 3, Lemma 4, Lemma 6) and in the proof, hold for all 𝑡 ≤ 𝑇 max . However, it is clear from the proof that these continue to hold for 𝑡 ≤ 𝜏 , where 𝜏 is the minimal number such that

‖ 𝑂 ~ 𝜏 + 1 ‖

𝛼 7 / 10 ‖ 𝑋 ⋆ ‖ 3 / 10 ,

(143)

cf. (24). In other words, ‖ 𝑂 ~ 𝑡 ‖ ≤ 𝛼 7 / 10 ‖ 𝑋 ⋆ ‖ 3 / 10 for all 𝑡 ≤ 𝜏 . By Lemma 6 extended to the stopping time 𝜏 , we have for 𝑡 4 ≤ 𝑡 ≤ 𝜏 that

‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 ≤ 𝛼 1 / 3 ‖ 𝑋 ⋆ ‖ 5 / 3 .

(144)

We recall that Lemma 6 was derived from Lemma 28. Following the same derivation, this time controlling the term

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥

directly using lemma 28 instead of passing to ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ , we find that for 𝑡 4 ≤ 𝑡 ≤ 𝜏 , the following stronger conclusion holds:

‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3 .

(145)

Back to the recursive inequality (142), We bound each terms, this time using (98), (71) and a similar bound for 𝐸 𝑡 𝑑 , to obtain for all 𝑡 4 ≤ 𝑡 ≤ 𝜏 that:

‖ 𝑂 ~ 𝑡 + 1 ‖

≤ ‖ 𝑂 ~ 𝑡 ‖ + 𝐶 𝜂 𝜅 𝐶 ‖ 𝑋 ⋆ ‖ − 1 ( ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 Σ ⋆ ‖ + ‖ 𝑂 ~ 𝑡 ‖ ) ‖ 𝑂 ~ 𝑡 ‖

≤ ‖ 𝑂 ~ 𝑡 ‖ + 𝐶 𝜂 𝜅 𝐶 [ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3 + ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 7 / 10 ] ‖ 𝑂 ~ 𝑡 ‖

≤ ( 1 + 𝜂 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 3 / 10 ) ‖ 𝑂 ~ 𝑡 ‖

where 𝐶

0 is a universal constant; the second line follows from (145) and that ‖ 𝑂 ~ 𝑡 ‖ ≤ 𝛼 7 / 10 ‖ 𝑋 ⋆ ‖ 3 / 10 for 𝑡 ≤ 𝜏 , and the last line follows from (12c).

By induction on 𝑡 , it is easy to see

‖ 𝑂 ~ 𝜏 + 1 ‖

≤ ( 1 + 𝜂 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 3 / 10 ) 𝜏 − 𝑇 max ‖ 𝑂 ~ 𝑇 max ‖

≤ ( 1 + 𝜂 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 3 / 10 ) 𝜏 − 𝑇 max 𝛼 3 / 4 ‖ 𝑋 ⋆ ‖ 1 / 4 ,

where the last inequality follows from (24). Plug this back into (143), we readily obtain

𝜏 − 𝑇 max ≥ 𝑐 log ⁡ ( ‖ 𝑋 ⋆ ‖ 𝛼 ) log ⁡ ( 1 + 𝜂 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 3 / 10 ) ≥ 2 𝑐 log ⁡ ( ‖ 𝑋 ⋆ ‖ 𝛼 ) 𝜂 ( 𝛼 / ‖ 𝑋 ⋆ ‖ ) 3 / 10 ≥ ( ‖ 𝑋 ⋆ ‖ 𝛼 ) 3 / 10 ,

where 𝑐

3 4 − 7 10

0 is a universal constant, and the last two inequalities follow from (12a) and (12c). This completes the proof.

Appendix FProofs for the noisy and the approximate low-rank settings

Both Theorem 4 and Theorem 5 can be viewed as special cases of the following theorem.

Theorem 6.

Assume the iterates 𝑋 𝑡 of ScaledGD( 𝜆 ) obeys

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) − 𝐸 ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

(146)

for some matrix 𝐸 ∈ ℝ 𝑛 × 𝑛 , where 𝑀 ⋆

𝑋 ⋆ 𝑋 ⋆ ⊤ ∈ ℝ 𝑛 × 𝑛 is a positive semidefinite matrix of rank 𝑟 ⋆ , 𝑋 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ . Assume further that

‖ 𝐸 ‖ ≤ 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑀 ⋆ ‖

(147)

for some sufficiently small universal constant 𝑐 𝜎

0 and some sufficiently large universal constant 𝐶 𝜎

0 . Then the following holds with high probability (with respect to the realization of the random initialization 𝐺 ). Under Assumptions 1 and 2, there exist universal constants 𝐶 min

0 , 𝐶 6

0 , such that for some 𝑇 ≤ 𝑇 min ≔ 𝐶 min 𝜂 log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 , the iterates of (146) obey

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖

≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝐶 6 𝜅 4 ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ ) ,

‖ 𝑋 𝑇 𝑋 𝑇 ⊤ − 𝑀 ⋆ ‖ 𝖥

≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝐶 6 𝜅 4 ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ 𝖥 ) .

The proof is postponed to Appendix G. The rest of this appendix is devoted to showing how to deduce Theorem 4 and Theorem 5 from Theorem 6.

F.1Proof of Theorem 4

In the noisy setting, the update rule (14) of ScaledGD( 𝜆 ) can be written as

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) − 𝐸 ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

(148)

where

𝐸 ≔ 𝒜 ∗ ( 𝜉 )

∑ 𝑖

1 𝑚 𝜉 𝑖 𝐴 𝑖 .

(149)

We use the following classical lemma to show that the matrix 𝐸 defined above fulfills the assumption of Theorem 6.

Lemma 29.

Under Assumption 1, the following holds with probability at least 1 − 2 exp ⁡ ( − 𝑐 𝑛 ) .

‖ 𝐸 ‖ ≤ 8 𝜎 𝑛 , ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ 𝖥 ≤ 8 𝜎 𝑛 𝑟 ⋆ .

Proof.

The first inequality can be found in candes2010noisymc, Lemma 1.1. The second inequality can be deduced from the first one as follows. Note that 𝑈 ⋆ ⊤ 𝐸 has rank at most 𝑟 ⋆ , one has ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ 𝖥 ≤ 𝑟 ⋆ ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ ≤ 𝑟 ⋆ ‖ 𝐸 ‖ ≤ 8 𝜎 𝑛 𝑟 ⋆ , as desired. ∎

The conclusion of Theorem 4 follows immediately by conditioning on the event that the inequalities in Lemma 29 hold, and then invoking Theorem 6.

F.2Proof of Theorem 5

In the approximately low-rank setting, the update rule of ScaledGD( 𝜆 ) can be written as

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 𝑟 ⋆ ) − 𝐸 ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

(150)

where

𝐸 ≔ 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) .

(151)

Recall that we assumed 𝒜 follows the Gaussian design in Theorem 5. One may show that the matrix 𝐸 defined above fulfills the assumption of Theorem 6 using random matrix theory, detailed below.

Lemma 30.

Under the assumptions on 𝒜 and 𝑚 in Theorem 5, the following holds with probability at least 1 − 2 exp ⁡ ( − 𝑐 𝑛 ) .

‖ 𝐸 ‖ ≤ 2 ‖ 𝑀 𝑟 ⋆ ′ ‖ + 16 𝑛 𝑚 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 , ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ 𝖥 ≤ 16 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 .

Proof.

For the first inequality, we use a standard covering argument. Let ℋ be a 1 / 4 -net of 𝕊 𝑛 − 1 , which can be chosen to satisfy | ℋ | ≤ 9 𝑛 . It is well known that

‖ 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ‖

sup 𝑣 ∈ 𝕊 𝑛 − 1 | ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ | ≤ 2 sup 𝑣 ∈ ℋ | ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ | .

(152)

Note that ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ is an order- 2 Gaussian chaos, which can be bounded by standard methods (see e.g. candes2010noisymc), yielding

| ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ − ⟨ 𝑣 , 𝑀 𝑟 ⋆ ′ 𝑣 ⟩ |

| ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ − 𝔼 ⟨ 𝑣 , 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) 𝑣 ⟩ | ≤ 8 𝑛 𝑚 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥

with probability at least 1 − 2 exp ⁡ ( − 4 𝑛 ) . The desired inequality then follows from (152) and a union bound.

For the second inequality, we first note that the random vector 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ∈ ℝ 𝑚 is Gaussian with law 𝒩 ( 0 , 1 𝑚 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 2 𝐼 ) . A standard Gaussian concentration inequality implies ‖ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ‖ ≤ 2 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥 with probability at least 1 − 2 exp ⁡ ( − 𝑚 / 2 ) . To bound ‖ 𝑈 ⋆ ⊤ 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ‖ 𝖥 , the next step is to control the operator norm of 𝑈 ⋆ ⊤ 𝒜 ∗ as an operator on the following spaces:

𝑈 ⋆ ⊤ 𝒜 ∗ : ( ℝ 𝑚 , ℓ 2 ) → ( ℝ 𝑟 ⋆ × 𝑛 , ∥ ⋅ ∥ 𝖥 ) ⏟ ≕ ℳ .

In this sense, we may see that 𝑈 ⋆ ⊤ 𝒜 ∗ is a Gaussian operator, since the matrix form of this operator is a ( 𝑟 ⋆ 𝑛 ) × 𝑚 matrix whose 𝑖 -th column is the vectorization of 𝑈 ⋆ ⊤ 𝐴 𝑖 , which is i.i.d. Gaussian as 𝐴 𝑖 is. Assume the covariance of such a column is Λ 2 ∈ ℝ ( 𝑟 ⋆ 𝑛 ) × ( 𝑟 ⋆ 𝑛 ) , then the matrix form of 𝑈 ⋆ ⊤ 𝒜 ∗ has the same distribution as Λ 𝐺 , where 𝐺 is a ( 𝑟 ⋆ 𝑛 ) × 𝑚 random matrix with i.i.d. standard Gaussian entries. Again, a standard bound in random matrix theory (c.f. (30a)) implies that ‖ 𝐺 ‖ ≤ 4 ( 𝑚 + 𝑟 ⋆ 𝑛 ) with probability at least 1 − exp ⁡ ( 𝑐 𝑚 ) , given 𝑚 ≥ 𝐶 𝑛 𝑟 ⋆ as assumed in Theorem 5. Conditioning on this event, we have

‖ 𝑈 ⋆ ⊤ 𝒜 ∗ ‖ ≤ 4 ( 𝑚 + 𝑟 ⋆ 𝑛 ) ‖ Λ ‖ .

To compute ‖ Λ ‖ , note that since Λ 𝐺 has the same distribution as the matrix form of 𝑈 ⋆ ⊤ 𝒜 ∗ , we have

‖ 𝔼 ( 𝑈 ⋆ ⊤ 𝒜 ∗ 𝒜 𝑈 ⋆ ) ‖ ℳ

‖ 𝔼 ( Λ 𝐺 𝐺 ⊤ Λ ) ‖

‖ Λ ( 𝑚 𝐼 ) Λ ‖

𝑚 ‖ Λ ‖ 2 ,

where the norm ∥ ⋅ ∥ ℳ denotes the operator norm for operators on ℳ . But 𝔼 ( 𝒜 ∗ 𝒜 )

ℐ , thus 𝔼 ( 𝑈 ⋆ ⊤ 𝒜 ∗ 𝒜 𝑈 ⋆ )

𝑈 ⋆ ⊤ 𝑈 ⋆

𝐼 is the identity operator, hence ‖ 𝔼 ( 𝑈 ⋆ ⊤ 𝒜 ∗ 𝒜 𝑈 ⋆ ) ‖ ℳ

1 . Plugging this into the above identity, we find ‖ Λ ‖

1 / 𝑚 . These together imply

‖ 𝑈 ⋆ ⊤ 𝒜 ∗ ‖ ≤ 4 ( 𝑚 + 𝑟 ⋆ 𝑛 ) ⋅ 1 𝑚

4 ( 1 + 𝑟 ⋆ 𝑛 𝑚 )

with probability at least 1 − 2 exp ⁡ ( − 𝑐 𝑚 ) . The last quantity is less than 8 by the assumption 𝑚 ≥ 𝐶 𝑛 𝑟 ⋆ in Theorem 5. Therefore

‖ 𝑈 ⋆ ⊤ 𝒜 ∗ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ‖ 𝖥 ≤ ‖ 𝑈 ⋆ ⊤ 𝒜 ∗ ‖ ⋅ ‖ 𝒜 ( 𝑀 𝑟 ⋆ ′ ) ‖ ≤ 8 ⋅ 2 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥

16 ‖ 𝑀 𝑟 ⋆ ′ ‖ 𝖥

with probability at least 1 − exp ⁡ ( − 𝑐 𝑚 ) , as desired. ∎

The conclusion of Theorem 5 follows immediately by conditioning on the event that the inequalities in Lemma 30 hold, and then invoking Theorem 6 with 𝑀 ⋆ substituted by 𝑀 𝑟 ⋆ .

Appendix GProof of Theorem 6

The proof is based on a reduction to the noiseless setting. We begin with two heuristic observations that connect the generalized setting with the noiseless one, and make these observations formal later.

Observation 1: Phase I approximates power method for 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 .

As in the noiseless setting, in the first few iterations we expect ‖ 𝑋 𝑡 ‖ to remain small, thus the update equation (14) can be approximated by

𝑋 𝑡 + 1 ≈ ( 𝐼 + 𝜂 ( 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 ) ) 𝑋 𝑡 .

This coincides with the update equation of power method for 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 . Recall that in the noiseless setting, the first phase is also akin to power method, albeit for 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) . The key observation is that 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 enjoys all the same properties of 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) that were required to establish Lemma 19. In fact, the only property of 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) used in the proof of Lemma 19 is

‖ ( 𝒜 ∗ 𝒜 − ℐ ) 𝑀 ⋆ ‖ ≲ 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 ,

but by the assumption (147), 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 also satisfies

‖ 𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 − 𝑀 ⋆ ‖ ≤ ‖ ( 𝒜 ∗ 𝒜 − ℐ ) 𝑀 ⋆ ‖ + ‖ 𝐸 ‖ ≲ 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 .

Thus all conclusion of Lemma 19 remains valid in the generalized setting.

Observation 2: In Phase II and III, the update equation has the same form as that in the noiseless setting.

Set

Δ 𝑡 ′

Δ 𝑡 − 𝐸 ,

then the update equation in the generalized setting can be expressed as

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 Δ 𝑡 ′ 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

which has the same form with the noiseless update equation (63), if we replace Δ 𝑡 there by Δ 𝑡 ′ . In the proof of Phase II, the only property of Δ 𝑡 we used is (50), which still holds for Δ 𝑡 ′ since ‖ 𝐸 ‖ is small. Thus the proof can be simply carried over to the generalized setting of Theorem 6. Moreover, in the proof of Phase III, the only places that involve controlling Δ 𝑡 in a different manner than (50) are (122) and (140). These equations require us to control ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ for some unitarily invariant norm | | | ⋅ | | | . If we replace Δ 𝑡 by Δ 𝑡 ′ , we can bound in these equations that

‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ ≤ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ .

Since any unitarily invariant | | | ⋅ | | | is bounded by the operator norm up to a multiplicative constant5 (depending on the rank of the matrix), we may control ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ using the assumption (147). Then we may combine (122) and (140) (assuming (140) also holds with the Frobenius norm replaced by | | | ⋅ | | | ) to obtain

‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ | ‖

≤ ( 1 − 𝜂 10 ) ( ‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ | 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ) + 𝜂 𝐶 𝜅 4 ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 .

(153)

The conclusion of the theorem would immediately follow from the above inequality combined with Lemma 29 and Lemma 27, by taking | | | ⋅ | | | to be the operator norm and the Frobenius norm.

Based on these observations, we formally state below the generalizations of key lemmas in the three phases required to prove Theorem 6. Most of them have identical proofs to their noiseless counterparts, and in such cases the proofs will be omitted. The few of them that require a slightly modified proof will be discussed in full detail.

G.1Generalization of Phase I

Our goal is to prove Lemma 3 in the generalized setting.

Lemma 31.

The conclusions of Lemma 3, along with its corollaries (3) and (24), still hold in the setting of Theorem 6.

As in the proof in the noiseless setting, this lemma is proved if we can prove the two parts of it respectively: the base case, where we show that there exists some 𝑡 1 ≤ 𝑇 min / 16 such that (21) holds and that (22) hold with 𝑡

𝑡 1 , and the induction step, where we show that (22) continues to hold for 𝑡 ∈ [ 𝑡 1 , 𝑇 max ] .

G.1.1Establishing the base case

We first show that Lemma 19 still holds in the generalized setting.

Lemma 32.

Under the same setting as Theorem 6, we have for some 𝑡 1 ≤ 𝑇 min / 16 such that (21) holds and that (22) hold with 𝑡

𝑡 1 .

We prove this result in a slightly more general setting. We consider a general symmetric matrix 𝑀 ^ ∈ ℝ 𝑛 × 𝑛 , and set

𝑋 ^ 𝑡

( 𝐼 + 𝜂 𝜆 𝑀 ^ ) 𝑡 𝑋 0 , 𝑡

0 , 1 , 2 , ⋯

We also denote

𝑠 𝑗 ≔ 𝜎 𝑗 ( 𝐼 + 𝜂 𝜆 𝑀 ^ )

1 + 𝜂 𝜆 𝜎 𝑗 ( 𝑀 ^ ) , 𝑗

1 , 2 , … , 𝑛

The treatment of the noiseless setting in Appendix C corresponds to the special case 𝑀 ^

𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) . In the generalized setting, we choose 𝑀 ^

𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 . The following two lemmas are generalized from the lemmas in Appendix C, but have verbatim proofs as those, which are therefore omitted.

Lemma 33 (Generalization of Lemma 20).

Suppose that 𝜆 ≥ 1 100 𝜅 − 4 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) . For any 𝜃 ∈ ( 0 , 1 ) , there exists a large enough constant 𝐾

𝐾 ( 𝜃 , 𝑐 𝜆 , 𝐶 𝐺 )

0 such that the following holds. As long as 𝛼 obeys

log ⁡ ‖ 𝑋 ⋆ ‖ 𝛼 ≥ 𝐾 max ⁡ ( 𝜂 , 𝜅 − 2 ) log ⁡ ( 2 𝜅 𝑛 ) ⋅ ( 1 + log ⁡ ( 1 + 𝜂 𝜆 ‖ 𝑀 ^ ‖ ) ) ,

(154)

one has for all 𝑡 ≤ 1 𝜃 𝜂 log ⁡ ( 𝜅 𝑛 ) :

∥ 𝑋 𝑡 − 𝑋 ^ 𝑡 ∥

≤ 𝑡 ( 1 + 𝜂 𝜆 ∥ 𝑀 ^ ∥ ) 𝑡 𝛼 2 ‖ 𝑋 ⋆ ‖ .

(155)

Moreover, ∥ 𝑋 𝑡 ∥ ≤ ∥ 𝑋 ⋆ ∥ for all such 𝑡 .

Lemma 34 (Generalization of Lemma 21).

There exists some small universal constant 𝑐 34

0 such that the following hold. Assume that for some 𝛾 ≤ 𝑐 34 ,

∥ 𝑀 ^ − 𝑀 ⋆ ∥ ≤ 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) ,

(156)

and furthermore,

𝜙 ≔ 𝛼 ‖ 𝐺 ‖ 𝑠 𝑟 ⋆ + 1 𝑡 + ‖ 𝑋 𝑡 − 𝑋 ^ 𝑡 ‖ 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ≤ 𝑐 34 𝜅 − 2 .

(157)

Then for some universal 𝐶 34

0 the following hold:

𝜎 min ( 𝑆 ~ 𝑡 )

≥ 𝛼 4 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ,

(158a)

‖ 𝑂 ~ 𝑡 ‖

≤ 𝐶 34 𝜙 𝛼 𝜎 min ( 𝑈 ^ ⊤ 𝐺 ) 𝑠 𝑟 ⋆ 𝑡 ,

(158b)

‖ 𝑈 ⋆ , ⟂ ⊤ 𝑈 𝑋 ~ 𝑡 ‖

≤ 𝐶 34 ( 𝛾 + 𝜙 ) ,

(158c)

where 𝑋 ~ 𝑡 ≔ 𝑋 𝑡 𝑉 𝑡 ∈ ℝ 𝑛 × 𝑟 ⋆ .

We are now ready to prove Lemma 32.

Proof of Lemma 32.

Recall that the generalized setting corresponds to 𝑀 ^

𝒜 ∗ 𝒜 ( 𝑀 ⋆ ) + 𝐸 . The proof is mostly identical to the proof of Lemma 19. Similar to that proof, we first need to verify the two assumptions in Lemma 34. The rest of the proof goes exactly the same, thus is omitted here.

Verifying assumption (156).

By the RIP in (9), Lemma 8, the condition of 𝛿 in (10), and the assumption (147), we have

‖ 𝑀 ^ − 𝑀 ⋆ ‖

∥ ( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) + 𝐸 ∥

≤ 𝑟 ⋆ 𝛿 ‖ 𝑀 ⋆ ‖ + 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑀 ⋆ ‖

≤ 𝑐 𝛿 𝜅 − ( 𝐶 𝛿 − 2 ) 𝜎 min 2 ( 𝑋 ⋆ ) + 𝑐 𝜎 𝜅 − ( 𝐶 𝜎 − 2 ) 𝜎 min 2 ( 𝑋 ⋆ )

≕ 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) .

(159)

Here 𝛾

𝑐 𝛿 𝜅 − ( 𝐶 𝛿 − 2 ) + 𝑐 𝜎 𝜅 − ( 𝐶 𝜎 − 2 ) ≤ 𝑐 21 , as 𝑐 𝛿 and 𝑐 𝜎 are assumed to be sufficiently small.

Verifying assumption (157).

By Weyl’s inequality and (G.1.1), we have

| 𝑠 𝑗 − 1 − 𝜂 𝜆 𝜎 𝑗 ( 𝑀 ⋆ ) | ≤ 𝜂 𝜆 ∥ 𝑀 ^ − 𝑀 ⋆ ∥ ≤ 𝜂 𝜆 𝛾 𝜎 min 2 ( 𝑋 ⋆ ) ≤ 100 ( 𝑐 𝛿 + 𝑐 𝜎 ) 𝑐 𝜆 𝜂 ,

where the last inequality follows from the condition 𝜆 ≥ 1 100 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) . Furthermore, using the condition 𝜆 ≤ 𝑐 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) assumed in (12b), the above bound implies that, for some 𝐶

𝐶 ( 𝑐 𝜆 , 𝑐 𝜎 , 𝑐 𝛿 )

0 ,

𝑠 1

≤ 1 + 𝜂 𝜆 ‖ 𝑀 ⋆ ‖ + 100 ( 𝑐 𝛿 + 𝑐 𝜎 ) 𝑐 𝜆 𝜂 ≤ 1 + 𝐶 𝜂 𝜅 6 ,

(160a)

𝑠 𝑟 ⋆

≥ 1 + 𝜂 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) − 100 ( 𝑐 𝛿 + 𝑐 𝜎 ) 𝑐 𝜆 𝜂 ≥ 1 + 𝜂 2 𝜆 / 𝜎 min 2 ( 𝑋 ⋆ ) ,

(160b)

𝑠 𝑟 ⋆

≤ 1 + 𝜂 𝜆 𝜎 min 2 ( 𝑋 ⋆ ) + 100 ( 𝑐 𝛿 + 𝑐 𝜎 ) 𝑐 𝜆 𝜂 ≤ 1 + 2 𝜂 𝜆 / 𝜎 min 2 ( 𝑋 ⋆ ) ,

(160c)

𝑠 𝑟 ⋆ + 1

≤ 1 + 100 ( 𝑐 𝛿 + 𝑐 𝜎 ) 𝑐 𝜆 𝜂 ≤ 1 + 𝜂 4 𝑐 𝜆 ,

(160d)

where we use the fact that 𝜎 𝑟 ⋆ + 1 ( 𝑀 ⋆ )

0 , and 𝑐 𝛿 + 𝑐 𝜎 ≤ 1 / 400 . The rest of the verification is the same as the verification of (81) in the proof of Lemma 19. ∎

G.1.2Establishing the induction step

Following the proof of the noiseless setting, we would like to show that Lemmas 23, 24, 25 still hold in the generalized setting, which in turn relies entirely on Lemmas 13, 14, 15. Since Lemma 14 and Lemma 15 are both corollaries of Lemma 13, it suffices to prove the generalization of Lemma 13 in the generalized setting.

Lemma 35 (Generalization of Lemma 13).

Assume the update equation of 𝑋 𝑡 has the following form (cf. (63)):

𝑋 𝑡 + 1

𝑋 𝑡 − 𝜂 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 + 𝜂 Δ 𝑡 ′ 𝑋 𝑡 ( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1 ,

where Δ 𝑡 ′ ∈ ℝ 𝑛 × 𝑛 is some symmetric matrix satisfying ‖ Δ 𝑡 ′ ‖ ≤ 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 . For any 𝑡 such that 𝑆 ~ 𝑡 is invertible and (22) holds, the equations (53a) and (53b) hold, where error terms are bounded by (54b)–(54d) and the following modifications of (54a) and (54e):

‖ | 𝐸 𝑡 𝑎 | ‖

≤ 2 𝑐 3 𝜅 − 4 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ ,

(161a)

‖ | 𝐸 𝑡 𝑒 | ‖

≤ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ + 𝑐 12 𝜅 − 5 ‖ 𝑋 ⋆ ‖ ⋅ ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ .

(161b)

The proof is verbatim to Lemma 13. Note that the noiseless setting corresponds to the special case Δ 𝑡 ′

Δ 𝑡

( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) , while the generalized setting corresponds to Δ 𝑡 ′

Δ 𝑡 − 𝐸

( ℐ − 𝒜 ∗ 𝒜 ) ( 𝑀 ⋆ ) − 𝐸 . To show that Lemma 35 is applicable to the generalized setting we need to verify that this choice of Δ 𝑡 ′ guarantees the smallness of ‖ Δ 𝑡 ′ ‖ , which is proved in the following lemma.

Lemma 36 (Generalization of (50) in Lemma 12).

Under the same setting as Theorem 6, for any 𝑡 such that (22) holds, we have

‖ Δ 𝑡 ′ ‖ ≤ 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 .

Proof.

Combining (52) in the proof of Lemma 12 the assumption ‖ 𝐸 ‖ ≤ 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑀 ⋆ ‖ in (147), we obtain

‖ Δ 𝑡 ′ ‖

≤ 16 𝛿 𝑟 ⋆ 𝜅 2 ( 𝐶 3 . 𝑎 2 + 1 ) ‖ 𝑋 ⋆ ‖ 2 + 𝑐 𝜎 𝜅 − 𝐶 𝜎 ‖ 𝑋 ⋆ ‖ 2

≤ ( 16 𝑐 𝛿 𝜅 − 𝐶 𝛿 + 2 ( 𝐶 3 . 𝑎 2 + 1 ) 2 + 𝑐 𝜎 𝜅 − 𝐶 𝜎 ) ‖ 𝑋 ⋆ ‖ 2

≤ 𝑐 12 𝜅 − 2 𝐶 𝛿 / 3 ‖ 𝑋 ⋆ ‖ 2 ,

if we choose 𝐶 𝜎 ≥ 𝐶 𝛿 , 𝑐 𝜎 ≤ 𝑐 𝛿 , and note that 𝑐 12

32 ( 𝐶 3 . 𝑎 + 1 ) 2 𝑐 𝛿 as defined in Lemma 12 (please refer to the argument after (52) for details). ∎

With these fundamental results in hand we can follow the same arguments as in the noiseless case to prove the following generalization of the lemmas in Appendix C.2.

Lemma 37.

The conclusions of Lemmas 23, 25 still hold in the setting of Theorem 6. Moreover, the following modification of Lemma 24 holds in the setting of Theorem 6. For any 𝑡 such that (22) holds, setting 𝑍 𝑡

Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 , there exists some universal constant 𝐶 24

0 such that

(162)

In particular, if 𝑐 3

100 𝐶 24 ( 𝐶 3 . 𝑎 + 1 ) 4 𝑐 𝛿 / 𝑐 𝜆 , then ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ implies ‖ 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ ‖ ≤ 𝑐 3 𝜅 − 𝐶 𝛿 / 2 ‖ 𝑋 ⋆ ‖ .

By the arguments following Lemma 25, the above results are sufficient to prove the induction step, thereby completing the proof of Lemma 3 in the generalized setting.

G.2Generalization of Phase II

We will prove Lemma 4 and Lemma 5, the main results of Phase II, in the generalized setting.

Lemma 38.

The conclusions of Lemma 4 and Lemma 5, along with Corollary 1 and Corollary 2, still hold under the generalized setting of Theorem 6.

Tracking the proof of Phase II in Appendix D, one may verify that all proofs there hold verbatim in the generalized setting, with Lemma 35 in place of Lemma 13 (the proof also used Lemmas 14, 15, which are corollaries of Lemma 13, hence hold in the generalized setting given Lemma 35), except for Lemma 26, which should be substituted by the following generalization:

Lemma 39.

Under the same setting as Theorem 6, for any 𝑡 : 𝑡 2 ≤ 𝑡 ≤ 𝑇 max , one has

‖ | Γ 𝑡 + 1 | ‖ ≤ ( 1 − 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 𝐶 26 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ + 1 16 𝜂 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 ,

(163)

The proof is identical to that of Lemma 26, thus is omitted here. Following the proof in Appendix D, these generalized results are sufficient to prove Lemma 38, thereby completing the proof of Phase II in the generalized setting.

G.3Generalization of Phase III

Our goal is to prove the following modification of Lemma 6 in the generalized setting.

Lemma 40 (Generalization of Lemma 6).

Under the same setting as Theorem 6, there exists some universal constant 𝑐 40

(164)

In particular, there exists an iteration number 𝑡 4 : 𝑡 3 ≤ 𝑡 4 ≤ 𝑡 3 + 𝑇 min / 16 such that for any 𝑡 ∈ [ 𝑡 4 , 𝑇 max ] , we have

‖ | 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ | ‖ ≤ max ⁡ ( 𝛼 1 / 3 ‖ 𝑋 ⋆ ‖ 5 / 3 , 𝑐 40 − 1 𝜅 4 ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ ) ≤ max ⁡ ( 𝜀 ‖ 𝑀 ⋆ ‖ , 𝑐 40 − 1 𝜅 4 ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ ) .

(165)

Here, 𝜀 and 𝛼 are as stated in Theorem 2.

As in Appendix E, this will be accomplished by decomposing the error ‖ | 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ | ‖ using Lemma 27, and then control the components in the decomposition using Lemma 28. It is easy to check that the proof of Lemma 27 applies without modification to the generalized setting, and in fact works with the Frobenius norm replaced by any unitarily invariant norm. This leads to the following generalization.

Lemma 41 (Generalization of Lemma 27).

Under the same setting as Theorem 6, for all 𝑡 ≥ 𝑡 3 , as long as ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 1 / 10 , one has

‖ | 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ | ‖ ≤ 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ) + 4 ‖ 𝑋 ⋆ ‖ ‖ 𝑂 ~ 𝑡 ‖ .

It remains to prove the generalization of Lemma 28, stated below.

Lemma 42 (Generalization of Lemma 28).

Under the same setting as Theorem 6, there exists some universal constant 𝐶 42

‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ | ‖

≤ ( 1 − 𝜂 10 ) ( ‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ) + 𝜂 𝐶 42 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 .

(166)

In particular, ‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ ≤ 1 / 10 for all 𝑡 such that 𝑡 3 ≤ 𝑡 ≤ 𝑇 max .

We are prepared to formally prove Lemma 40. Similar to the noiseless setting, we apply Lemma 42 repeatedly to obtain the following bound for all 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] :

‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

≤ ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 ( ‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ | ‖ )

10 𝐶 42 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 | | | 𝑈 ⋆ ⊤ 𝐸 | | |
10 max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 ,

(167)

which motivates us to control the error at time 𝑡 3 . With the same arguments as in the noiseless setting (cf. Equation (137) in Appendix E), we obtain

‖ Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ ‖ 𝖥 ≤ 𝑟 ⋆ 5 .

Since the operator norm of a matrix is always less than or equal to the Frobenius norm of it, the above inequality also holds if the Frobenius norm is replaced by the operator norm. Recalling that in this lemma, | | | ⋅ | | | is taken to be either the operator norm or the Frobenius norm, we have shown

‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 3 𝑆 ~ 𝑡 3 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 3 𝑆 ~ 𝑡 3 − 1 Σ ⋆ | ‖ ≤ 𝑟 ⋆ 5 .

(168)

Combining the two inequalities (167) and (168) yields for all 𝑡 ∈ [ 𝑡 3 , 𝑇 max ]

‖ | Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) Σ ⋆ − 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖

We can then invoke Lemma 41 to see that

‖ | 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ | ‖

≤ 4 ‖ 𝑋 ⋆ ‖ 2 5 ( 1 − 𝜂 10 ) 𝑡 − 𝑡 3 𝑟 ⋆ + 10 𝐶 42 𝜅 4 | | | 𝑈 ⋆ ⊤ 𝐸 | | | + 40 ∥ 𝑋 ⋆ ∥ 2 max 𝑡 3 ≤ 𝜏 ≤ 𝑡 ( ‖ 𝑂 ~ 𝜏 ‖ ‖ 𝑋 ⋆ ‖ ) 1 / 2 + 4 ∥ 𝑋 ⋆ ∥ ∥ 𝑂 ~ 𝑡 ∥

where in the last line we use ‖ 𝑂 ~ 𝑡 ‖ ≤ ‖ 𝑋 ⋆ ‖ —an implication of (24), which holds in the generalized setting by Lemma 31. To see this, the assumption (12c) implies that 𝛼 ≤ ‖ 𝑋 ⋆ ‖ as long as 𝜂 ≤ 1 / 2 and 𝐶 𝛼 ≥ 4 , which in turn implies ‖ 𝑂 ~ 𝑡 ‖ ≤ 𝛼 2 / 3 ‖ 𝑋 ⋆ ‖ 1 / 3 ≤ ‖ 𝑋 ⋆ ‖ . This completes the proof for the first part of Lemma 40 with 𝑐 40

1 / ( 10 𝐶 42 ) .

For the second part of Lemma 40, notice that

8 𝑐 40 − 1 max 𝑡 3 ≤ 𝜏 ≤ 𝑇 max ( ∥ 𝑂 ~ 𝜏 ∥ / ∥ 𝑋 ⋆ ∥ ) 1 / 2 ≤ 1 2 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3

by (24), thus

‖ | 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ | ‖ ≤ ( 1 − 𝑐 6 𝜂 ) 𝑡 − 𝑡 3 𝑟 ⋆ ‖ 𝑀 ⋆ ‖ + 𝑐 40 − 1 𝜅 4 | ‖ 𝑈 ⋆ ⊤ 𝐸 ‖ | + 1 2 ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3

for 𝑡 3 ≤ 𝑡 ≤ 𝑇 max . There exists some iteration number 𝑡 4 : 𝑡 3 ≤ 𝑡 4 ≤ 𝑡 3 + 2 𝑐 40 𝜂 log ⁡ ( ‖ 𝑋 ⋆ ‖ / 𝛼 ) ≤ 𝑡 3 + 𝑇 min / 16 such that

( 1 − 𝑐 6 𝜂 ) 𝑡 4 − 𝑡 3 ≤ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 2 ≤ 1 2 𝑟 ⋆ ( 𝛼 ‖ 𝑋 ⋆ ‖ ) 1 / 3 ,

where the last inequality is due to (12c). It is then clear that 𝑡 4 has the property claimed in the lemma.

G.3.1Proof of Lemma 42

The idea is the same as the proof of Lemma 28. Fix any 𝑡 ∈ [ 𝑡 3 , 𝑇 max ] , if (166) were true for all 𝜏 ∈ [ 𝑡 3 , 𝑡 ] , taking into account that ‖ 𝑂 ~ 𝜏 ‖ / ‖ 𝑋 ⋆ ‖ ≤ 1 / 10000 for all 𝜏 ∈ [ 𝑡 3 , 𝑇 max ] by (24) (which still holds in the generalized setting according to Lemma 32), we could show by induction that ‖ Γ 𝜏 ‖ ≤ 1 / 10 for all 𝜏 ∈ [ 𝑡 3 , 𝑡 ] . Thus it suffices to assume ‖ Γ 𝑡 ‖ ≤ 1 / 10 and prove (166).

Apply Lemma 39 to obtain

‖ | Γ 𝑡 + 1 | ‖ ≤ ( 1 − 𝜂 ) ‖ | Γ 𝑡 | ‖ + 𝜂 𝐶 26 𝜅 4 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ + 1 16 𝜂 ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 ,

(169)

In addition, Lemma 37 tells us that

where 𝑍 𝑡

Σ ⋆ − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) Σ ⋆ − 1 . It is easy to check that ‖ 𝑍 𝑡 ‖ ≤ 1 + ‖ Γ 𝑡 ‖ + 𝑐 𝜆 ≤ 2 as ‖ Γ 𝑡 ‖ ≤ 1 / 10 and 𝑐 𝜆 is sufficiently small. In addition, one has 𝜎 min ( 𝑆 ~ 𝑡 ) 2 ≥ ( 1 − ‖ Γ 𝑡 ‖ ) 𝜎 min ( 𝑋 ⋆ ) 2 and ‖ 𝑂 ~ 𝑡 ‖ / 𝜎 min ( 𝑆 ~ 𝑡 ) ≤ ( 2 𝜅 ) − 24 . Combine these relationships together to arrive at

‖ | 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ | ‖ ≤ ( 1 − 𝜂 8 ) ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + 𝜂 𝐶 24 𝜅 2 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ + 1 2 𝜂 ‖ 𝑋 ⋆ ‖ ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 .

(170)

Summing up (139), (140), we obtain

‖ | Γ 𝑡 + 1 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 + 1 𝑆 ~ 𝑡 + 1 − 1 Σ ⋆ | ‖

≤ ( 1 − 𝜂 8 ) ( ‖ | Γ 𝑡 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ) + 𝜂 2 ( 𝐶 24 + 𝐶 26 𝑐 𝜆 ) 𝜅 6 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ 2 ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 ′ | ‖ + 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 .

≤ ( 1 − 𝜂 8 ) ( ‖ | Γ 𝑡 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ ) + 𝜂 2 ( 𝐶 24 + 𝐶 26 𝑐 𝜆 ) 𝜅 8 𝑐 𝜆 ‖ 𝑋 ⋆ ‖ 2 ( ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ + ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ ) + 2 𝜂 ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 7 / 12 .

(171)

This is close to our desired conclusion, but we would need to eliminate ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ . To this end we shall need the following lemma.

Lemma 43.

If | | | ⋅ | | | is taken to be the operator norm ∥ ⋅ ∥ or the Frobenius norm ∥ ⋅ ∥ 𝖥 , under the same setting as Lemma 42, one has

‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ ≤ 32 𝑐 𝛿 𝜅 − 6 ‖ 𝑋 ⋆ ‖ 2 ( ‖ | Γ 𝑡 | ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ | 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ | ‖ + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 ) .

(172)

Return to the proof of Lemma 42. The conclusion follows from applying the above lemma to the term ‖ | 𝑈 ⋆ ⊤ Δ 𝑡 | ‖ in (171), noting that 𝑐 𝛿 can be chosen sufficiently small such that

2 ( 𝐶 24 + 𝐶 26 𝑐 𝜆 ) 𝑐 𝜆 ⋅ 32 𝑐 𝛿 < 1 16 ,

and that ‖ 𝑂 ~ 𝑡 ‖ / ‖ 𝑋 ⋆ ‖ is sufficiently small due to (24), which still holds in the generalized setting in virtue of Lemma 32.

G.3.2Proof of Lemma 43

Observe that 𝑈 ⋆ ⊤ Δ 𝑡 has rank at most 𝑟 ⋆ , thus

‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ ≤ ‖ Δ 𝑡 ‖ , ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 ≤ 𝑟 ⋆ ‖ Δ 𝑡 ‖ .

On the other hand, from Lemma 12, we know

‖ Δ 𝑡 ‖

≤ 8 𝛿 ( ‖ 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ‖ 𝖥 + ‖ 𝑆 ~ 𝑡 ‖ ‖ 𝑁 ~ 𝑡 ‖ 𝖥 + 𝑛 ‖ 𝑂 ~ 𝑡 ‖ 2 )

≤ 16 𝑐 𝛿 𝑟 ⋆ − 1 / 2 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 )

≤ 32 𝑐 𝛿 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 )

where the penultimate line follows from (10) and from controlling the sum inside the brackets in a similar way as (138), and the last line follows from Γ 𝑡

Σ ⋆ − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 − 𝐼 being a matrix of rank at most 𝑟 ⋆ + 1 , which implies ‖ Γ 𝑡 ‖ 𝖥 ≤ 𝑟 ⋆ + 1 ‖ Γ 𝑡 ‖ , and similarly ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 ≤ 𝑟 ⋆ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ . The conclusion then follows from bounding ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ and ‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥 separately. We have

‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ ≤ ‖ Δ 𝑡 ‖ ≤ 32 𝑐 𝛿 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 ) ,

and

‖ 𝑈 ⋆ ⊤ Δ 𝑡 ‖ 𝖥
≤ 𝑟 ⋆ ‖ Δ 𝑡 ‖

≤ 𝑟 ⋆ ⋅ 16 𝑐 𝛿 𝑟 ⋆ − 1 / 2 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 )

16 𝑐 𝛿 𝜅 − 4 ‖ 𝑋 ⋆ ‖ 2 ( ‖ Γ 𝑡 ‖ 𝖥 + ‖ 𝑋 ⋆ ‖ − 1 ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ 𝖥 + ( ‖ 𝑂 ~ 𝑡 ‖ ‖ 𝑋 ⋆ ‖ ) 2 / 3 ) .

Combining the above two inequalities together proves that (172) holds with | | | ⋅ | | | taken to be the operator norm or the Frobenius norm.

G.4Proof of Theorem 6

Combining Lemma 32, Lemma 38 and Lemma 40, the final 𝑡 4 given by Lemma 40 is no more than 4 × 𝑇 min / 16 ≤ 𝑇 min / 2 , thus (165) holds for all 𝑡 ∈ [ 𝑇 min / 2 , 𝑇 max ] , in particular, for some 𝑇 ≤ 𝑇 min . Plugging in (165) the bound for ‖ | 𝑈 ⋆ ⊤ 𝐸 | ‖ given by Lemma 29 when | | | ⋅ | | | is taken to be the operator norm ∥ ⋅ ∥ or the Frobenius norm ∥ ⋅ ∥ 𝖥 , we obtain the conclusion as desired.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

min 𝑀 ⪰ 0 ‖ 𝑀 ‖ ∗ s.t. 𝑦

where 𝑟 is a user-specified rank parameter. Despite nonconvexity, when the rank is correctly specified, i.e., when 𝑟

Unknown rank. First, the true rank 𝑟 ⋆ is often unknown, which makes it infeasible to set 𝑟

𝑟 ⋆ . One necessarily needs to consider an overparameterized setting in which 𝑟 is set conservatively, i.e., one sets 𝑟 ≥ 𝑟 ⋆ or even 𝑟

𝑟

li2018algorithmic made a theoretical breakthrough that showed that gradient descent converges globally to any prescribed accuracy even in the presence of full overparameterization ( 𝑟

𝑀 ⋆

𝑀 ⋆

𝑦

𝑦 𝑖

where 𝑟 is a predetermined rank parameter, possibly different from 𝑟 ⋆ . It is evident that for any rotation matrix 𝑂 ∈ 𝒪 𝑟 , it holds that 𝑓 ​ ( 𝑋 )

𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 : 𝑋 𝑡 + 1

Here, 𝑋 𝑡 is the 𝑡 -th iterate, ∇ 𝑓 ​ ( 𝑋 𝑡 ) is the gradient of 𝑓 at 𝑋

𝑋 𝑡 , and 𝜂 > 0 is the learning rate. Moreover, 𝒜 ∗ : ℝ 𝑚 ↦ Sym 2 ⁡ ( ℝ 𝑛 ) is the adjoint operator of 𝒜 , that is 𝒜 ∗ ​ ( 𝑦 )

∑ 𝑖

𝖲𝖼𝖺𝗅𝖾𝖽𝖦𝖣 ( 𝜆 ) : 𝑋 𝑡 + 1

where 𝜆 > 0 is a fixed damping parameter. The new algorithm is dubbed as ScaledGD( 𝜆 ), and it recovers the original ScaledGD when 𝜆

𝑋 𝑡 ​ 𝑂 − 𝜂 ​ 𝒜 ∗ ​ 𝒜 ​ ( 𝑋 𝑡 ​ 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) ​ 𝑋 𝑡 ​ 𝑂 ​ ( 𝑂 ⊤ ​ 𝑋 𝑡 ⊤ ​ 𝑋 𝑡 ​ 𝑂 + 𝜆 ​ 𝐼 ) − 1

is rotated simultaneously by the same rotation matrix 𝑂 . In other words, the recovered matrix sequence 𝑀 𝑡

𝖯𝗋𝖾𝖼𝖦𝖣 : 𝑋 𝑡 + 1

where the damping parameters 𝜆 𝑡

(stoger2024non, Lemma 1) If the sensing operator 𝒜 ​ ( ⋅ ) follows the Gaussian design, i.e., the entries of { 𝐴 𝑖 } 𝑖

𝑋 0

We now single out the exact parametrization case, i.e., when 𝑟

Assume that 𝑟

𝑟 ⋆ . Suppose Assumptions 1 and 2 hold. With high probability (with respect to the realization of the random initialization 𝐺 ), there exist some universal constants 𝐶 min > 0 and 𝑐 > 0 such that for some 𝑇 ≤ 𝑇 min

We next consider the case where the measurements are contaminated by noise 𝜉

( 𝜉 𝑖 ) 𝑖

𝑦

𝒜 ​ ( 𝑀 ⋆ ) + 𝜉 , or more concretely 𝑦 𝑖

𝑋 𝑡 + 1

To set up, let 𝑀 ⋆ ∈ ℝ 𝑛 × 𝑛 be a general PSD ground truth matrix, where its spectral decomposition is given by 𝑀 ⋆

∑ 𝑖

𝑀 ⋆

∑ 𝑖

1 𝑟 𝜎 𝑖 ​ 𝑢 𝑖 ​ 𝑢 𝑖 ⊤ ⏟ ≕ 𝑀 𝑟 + ∑ 𝑖

Assume that 𝑀 ⋆ is given in (16) and the sensing operator 𝒜 follows the Gaussian design with 𝑚 ≥ 𝐶 ​ 𝑛 ​ 𝑟 ⋆ 2 ​ 𝜅 𝐶 , where 𝜅

ℰ

𝑆 𝑡

where 𝑈 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ , Σ 𝑡 ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ , and 𝑉 𝑡 ∈ ℝ 𝑟 × 𝑟 ⋆ . Similar to 𝑈 ⋆ , ⟂ , we define the orthogonal complement of 𝑉 𝑡 as 𝑉 𝑡 , ⟂ ∈ ℝ 𝑟 × ( 𝑟 − 𝑟 ⋆ ) . When 𝑟

𝑟 ⋆ we simply set 𝑉 𝑡 , ⟂

𝑋 𝑡

First, since 𝑉 𝑡 , ⟂ spans the obsolete subspace arising from overparameterization, 𝑂 ~ 𝑡 naturally represents the error incurred by overparameterization; in particular, in the well-specified case (i.e., 𝑟

𝑟 ⋆ ), one has zero overparameterization error, i.e., 𝑂 ~ 𝑡

∞ , 𝑟

where 𝐶 3 . 𝑏 ′

Now we consider Theorem 3. In case that 𝑟

𝑟 ⋆ , it follows from definition that 𝑂 ~ 𝑡

for any 𝑡 ≥ 𝑡 3 (recall that 𝑇 max

∞ by definition when 𝑟

In this section, we conduct numerical experiments to demonstrate the efficacy of ScaledGD( 𝜆 ) for solving overparameterized low-rank matrix sensing. We set the ground truth matrix 𝑋 ⋆

𝑈 ⋆ ​ Σ ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ where 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ is a random orthogonal matrix and Σ ⋆ ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ is a diagonal matrix whose condition number is set to be 𝜅 . We set 𝑛

150 and 𝑟 ⋆

3 , and use random Gaussian measurements with 𝑚

Then we estimate 𝜎 ^ min 2 ​ ( 𝑋 ⋆ )

𝑟 ⋆

3 ), though ScaledGD( 𝜆 ) requires a few more iterations for the initial phases2. When 𝑟 goes higher, ScaledGD( 𝜆 ) is almost unaffected, while 𝖯𝗋𝖾𝖼𝖦𝖣 suffers from a significant drop in the convergence rate and even breaks down with a moderate overparameterization 𝑟

Though our theoretical results here are formulated in the noiseless setting, empirical evidence indicates our algorithm ScaledGD( 𝜆 ) also works in the noisy setting. Modifying the equation (4) for noiseless measurements, we assume the noisy measurements 𝑦 𝑖

⟨ 𝐴 𝑖 , 𝑀 ⟩ + 𝜉 𝑖 where 𝜉 𝑖 ∼ 𝒩 ​ ( 0 , 𝜎 2 ) are i.i.d. Gaussian noises. The minimax lower bound for the reconstruction error ‖ 𝑋 𝑡 ​ 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 is denoted by ℰ 𝗌𝗍𝖺𝗍

This section collects several preliminary results that are useful in later proofs. In general, for a matrix 𝐴 , we will denote by 𝑈 𝐴 the first factor in its compact SVD 𝐴

Turning to the bound on 𝜎 min − 1 ​ ( 𝑈 ^ ⊤ ​ 𝐺 ) , observe that 𝑛 ​ 𝑈 ^ ⊤ ​ 𝐺 is a 𝑟 ⋆ × 𝑟 random matrix with i.i.d. standard Gaussian entries, thus applying (30b) to 𝑛 ​ 𝑈 ^ ⊤ ​ 𝐺 with 𝜀

1 𝑟 − 𝑟 ⋆ − 1 ≤ 1 𝑟 − 𝑟 − 1

Combining the above two bounds directly implies the desired probability bound if we choose 𝑐

𝑋 𝑡

( 𝑈 ⋆ ​ 𝑈 ⋆ ⊤ + 𝑈 ⋆ , ⟂ ​ 𝑈 ⋆ , ⟂ ⊤ ) ​ 𝑋 𝑡

𝑈 ⋆ ​ 𝑆 𝑡 + 𝑈 ⋆ , ⟂ ​ 𝑁 𝑡

𝑈 ⋆ ​ 𝑆 ~ 𝑡 ​ 𝑉 𝑡 ⊤ + 𝑈 ⋆ , ⟂ ​ 𝑁 𝑡 ​ ( 𝑉 𝑡 ​ 𝑉 𝑡 ⊤ + 𝑉 𝑡 , ⟂ ​ 𝑉 𝑡 , ⟂ ⊤ )

where in the second line, we used the relation 𝑆 ~ 𝑡

𝑆 𝑡 ​ 𝑉 𝑡

𝑈 𝑡 ​ Σ 𝑡 ​ 𝑉 𝑡 ⊤ ​ 𝑉 𝑡

𝑆 𝑡

Without loss of generality we may assume 𝑟 ≥ 𝑟 ⋆ , thus 𝑟 ∨ 𝑟 ⋆

𝑟 . We claim that it is possible to decompose 𝑍

∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ 𝑍 𝑖 where 𝑍 𝑖 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) , rank ⁡ ( 𝑍 𝑖 ) ≤ 𝑟 ⋆ and 𝑍 𝑖 ​ 𝑍 𝑗

0 if 𝑖 ≠ 𝑗 . To see why this is the case, notice the spectral decomposition of 𝑍 gives 𝑟 rank-one components that are mutually orthogonal, thus we may divide them into ⌈ 𝑟 / 𝑟 ⋆ ⌉ subgroups indexed by 𝑖

‖ 𝑍 ‖ 𝖥 2

tr ⁡ ( 𝑍 2 )

∑ 𝑖 , 𝑗 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ tr ⁡ ( 𝑍 𝑖 ​ 𝑍 𝑗 )

( 𝐴 + 𝐵 ) − 1

( 𝐴 + 𝐵 ) − 1

where 𝑟 is a predetermined rank parameter, possibly different from 𝑟 ⋆ . It is evident that for any rotation matrix 𝑂 ∈ 𝒪 𝑟 , it holds that 𝑓 ( 𝑋 )

Here, 𝑋 𝑡 is the 𝑡 -th iterate, ∇ 𝑓 ( 𝑋 𝑡 ) is the gradient of 𝑓 at 𝑋

𝑋 𝑡 , and 𝜂 > 0 is the learning rate. Moreover, 𝒜 ∗ : ℝ 𝑚 ↦ Sym 2 ⁡ ( ℝ 𝑛 ) is the adjoint operator of 𝒜 , that is 𝒜 ∗ ( 𝑦 )

𝑋 𝑡 𝑂 − 𝜂 𝒜 ∗ 𝒜 ( 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ) 𝑋 𝑡 𝑂 ( 𝑂 ⊤ 𝑋 𝑡 ⊤ 𝑋 𝑡 𝑂 + 𝜆 𝐼 ) − 1

(stoger2024non, Lemma 1) If the sensing operator 𝒜 ( ⋅ ) follows the Gaussian design, i.e., the entries of { 𝐴 𝑖 } 𝑖

𝒜 ( 𝑀 ⋆ ) + 𝜉 , or more concretely 𝑦 𝑖

1 𝑟 𝜎 𝑖 𝑢 𝑖 𝑢 𝑖 ⊤ ⏟ ≕ 𝑀 𝑟 + ∑ 𝑖

Assume that 𝑀 ⋆ is given in (16) and the sensing operator 𝒜 follows the Gaussian design with 𝑚 ≥ 𝐶 𝑛 𝑟 ⋆ 2 𝜅 𝐶 , where 𝜅

∞ ,
𝑟

𝑈 ⋆ Σ ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ where 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ is a random orthogonal matrix and Σ ⋆ ∈ ℝ 𝑟 ⋆ × 𝑟 ⋆ is a diagonal matrix whose condition number is set to be 𝜅 . We set 𝑛

Then we estimate 𝜎 ^ min 2 ( 𝑋 ⋆ )

⟨ 𝐴 𝑖 , 𝑀 ⟩ + 𝜉 𝑖 where 𝜉 𝑖 ∼ 𝒩 ( 0 , 𝜎 2 ) are i.i.d. Gaussian noises. The minimax lower bound for the reconstruction error ‖ 𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆ ‖ 𝖥 is denoted by ℰ 𝗌𝗍𝖺𝗍

Turning to the bound on 𝜎 min − 1 ( 𝑈 ^ ⊤ 𝐺 ) , observe that 𝑛 𝑈 ^ ⊤ 𝐺 is a 𝑟 ⋆ × 𝑟 random matrix with i.i.d. standard Gaussian entries, thus applying (30b) to 𝑛 𝑈 ^ ⊤ 𝐺 with 𝜀

( 𝑈 ⋆ 𝑈 ⋆ ⊤ + 𝑈 ⋆ , ⟂ 𝑈 ⋆ , ⟂ ⊤ ) 𝑋 𝑡

𝑈 ⋆ 𝑆 𝑡 + 𝑈 ⋆ , ⟂ 𝑁 𝑡

𝑈 ⋆ 𝑆 ~ 𝑡 𝑉 𝑡 ⊤ + 𝑈 ⋆ , ⟂ 𝑁 𝑡 ( 𝑉 𝑡 𝑉 𝑡 ⊤ + 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ )

𝑆 𝑡 𝑉 𝑡

𝑈 𝑡 Σ 𝑡 𝑉 𝑡 ⊤ 𝑉 𝑡

∑ 𝑖 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ 𝑍 𝑖 where 𝑍 𝑖 ∈ Sym 2 ⁡ ( ℝ 𝑛 ) , rank ⁡ ( 𝑍 𝑖 ) ≤ 𝑟 ⋆ and 𝑍 𝑖 𝑍 𝑗

∑ 𝑖 , 𝑗 ≤ ⌈ 𝑟 / 𝑟 ⋆ ⌉ tr ⁡ ( 𝑍 𝑖 𝑍 𝑗 )

𝐴 − 1 − 𝐴 − 1 𝐵 ( 𝐼 + 𝐴 − 1 𝐵 ) − 1 𝐴 − 1

For the first claim when ‖ 𝐴 − 1 𝐵 ‖ ≤ 1 / 2 , we set 𝑄 := − ( 𝐼 + 𝐴 − 1 𝐵 ) − 1 , which satisfies ‖ 𝑄 ‖

Writing 𝐼 + 𝐴 𝐵

The proof is completed by noting that 𝜎 min ( 𝐴 1 / 2 )

𝜎 min 1 / 2 ( 𝐴 ) , 𝜎 min ( 𝐴 − 1 / 2 )

𝑌 𝑌 ⊤ and let 𝑈 Σ 𝑈 ⊤

𝜆 min ( 𝑍 ) , which is by assumption smaller than 1 / 9 . Fix 𝐾

( 1 + 𝜂 6 ) 2 𝜁 , Σ 0

( 1 − 𝜂 ) 𝐼 + 𝜂 Σ ≤ 𝐾 − 1 and Σ 1

( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 ) 𝑍 ( ( 1 − 𝜂 ) 𝐼 + 𝜂 Σ − 1 )

Σ 0 ⪰ ( 1 + ( 𝐾 − 1 − 1 ) 𝜂 ) 𝐼

( 𝑍 1 − 𝜁 ′ Σ 1 − 2 ) − 1
⪯ ( 𝑍 1 − ( 1 − 𝜂 ) − 2 𝜁 ′ 𝐼 ) − 1

0 ≤ ( 1 − 𝜂 ) − 2 ( 1 + 1 6 𝜂 ) 2 𝜁 − 𝜁

1 2 𝜁 ≤ 1 2 ( 𝜆 min ( 𝑍 1 ) − 𝜁 )

𝜁 − ( 1 + 3 𝜂 ) − 2 ( 1 + 1 6 𝜂 ) 2 𝜁 ≥ 𝜁 − ( 1 − 4 𝜂 ) ( 1 + 𝜂 ) 𝜁

𝑋 𝑡 𝑋 𝑡 ⊤

𝑋 𝑡 𝑋 𝑡 ⊤ − 𝑀 ⋆

𝑈 ⋆ ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ − Σ ⋆ 2 ) 𝑈 ⋆ ⊤ ⏟

⁣ : 𝑇 1 + 𝑈 ⋆ 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 ⊤ 𝑈 ⋆ ⊤ ⏟

⁣ : 𝑇 2 + 𝑈 ⋆ , ⟂ 𝑁 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ⏟

⁣ : 𝑇 3 + 𝑈 ⋆ , ⟂ 𝑂 ~ 𝑡 𝑂 ~ 𝑡 ⊤ 𝑈 ⋆ , ⟂ ⊤ ⏟

Note that 𝑈 ⋆ ∈ ℝ 𝑛 × 𝑟 ⋆ is of rank 𝑟 ⋆ , thus 𝑇 1 has rank at most 𝑟 ⋆ and 𝑇 2 has rank at most 2 𝑟 ⋆ . Similarly, since 𝑁 ~ 𝑡

Recalling that by (10) we have 𝛿 𝑟 ⋆ 𝜅 2 ≤ 𝑐 𝛿 𝜅 − 𝐶 𝛿 + 2 ≤ 𝑐 𝛿 𝜅 − 2 𝐶 𝛿 / 3 as long as 𝐶 𝛿 ≥ 6 , we obtain the desired conclusion. We may choose 𝑐 12

𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄

( 𝑁 𝑡 + 1 𝑉 𝑡 + 𝑁 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) ( 𝑆 𝑡 + 1 𝑉 𝑡 + 𝑆 𝑡 + 1 𝑉 𝑡 , ⟂ 𝑄 ) − 1

( 𝑋 𝑡 ⊤ 𝑋 𝑡 + 𝜆 𝐼 ) − 1

𝑋 𝑡 ⊤ 𝑋 𝑡

𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 and 𝐵

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 ) 𝑉 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑉 𝑡 ( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) 𝑉 𝑡 ⊤ + 𝜆 𝑉 𝑡 , ⟂ 𝑉 𝑡 , ⟂ ⊤ ) − 1

( 𝑆 ~ 𝑡 ⊤ 𝑆 ~ 𝑡 + 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 + 𝜆 𝐼 ) − 1

𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡

≈ 𝑆 ~ 𝑡 − 1 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1

𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 and 𝐵

‖ 𝐴 − 1 𝐵 ‖

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 − ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 𝑄 ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝑆 ~ 𝑡 𝑁 ~ 𝑡 ⊤ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 + 𝜆 𝐼 ) − 1 − ( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1

( 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ + 𝜆 𝐼 ) − 1 𝑆 ~ 𝑡 𝑆 ~ 𝑡 ⊤ Σ ⋆ − 1 ‖ Σ ⋆ − 1 ‖ ( 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ) ⊤ ‖ 𝑁 ~ 𝑡 𝑆 ~ 𝑡 − 1 Σ ⋆ ‖ , 𝑄 2