Buckets:
Title: Graph Neural Preconditioners for Iterative Solutions of Sparse Linear Systems
URL Source: https://arxiv.org/html/2406.00809
Published Time: Tue, 28 Jan 2025 02:05:01 GMT
Markdown Content:
Abstract
Preconditioning is at the heart of iterative solutions of large, sparse linear systems of equations in scientific disciplines. Several algebraic approaches, which access no information beyond the matrix itself, are widely studied and used, but ill-conditioned matrices remain very challenging. We take a machine learning approach and propose using graph neural networks as a general-purpose preconditioner. They show attractive performance for many problems and can be used when the mainstream preconditioners perform poorly. Empirical evaluation on over 800 matrices suggests that the construction time of these graph neural preconditioners (GNPs) is more predictable and can be much shorter than that of other widely used ones, such as ILU and AMG, while the execution time is faster than using a Krylov method as the preconditioner, such as in inner-outer GMRES. GNPs have a strong potential for solving large-scale, challenging algebraic problems arising from not only partial differential equations, but also economics, statistics, graph, and optimization, to name a few.
1 Introduction
Iterative methods are commonly used to solve large, sparse linear systems of equations ๐๐ฑ=๐ ๐๐ฑ ๐\mathbf{A}\mathbf{x}=\mathbf{b}bold_Ax = bold_b. These methods typically build a Krylov subspace, onto which the original system is projected, such that an approximate solution within the subspace is extracted. For example, GMRES(Saad & Schultz, 1986), one of the most popularly used methods in practice, builds an orthonormal basis {๐ฏ 1,๐ฏ 2,โฆ,๐ฏ m}subscript ๐ฏ 1 subscript ๐ฏ 2โฆsubscript ๐ฏ ๐{\mathbf{v}{1},\mathbf{v}{2},\ldots,\mathbf{v}{m}}{ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , โฆ , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } of the m ๐ m italic_m-dimensional Krylov subspace by using the Arnoldi process and defines the approximate solution ๐ฑ mโ๐ฑ 0+spanโก({๐ฏ 1,๐ฏ 2,โฆ,๐ฏ m})subscript ๐ฑ ๐ subscript ๐ฑ 0 span subscript ๐ฏ 1 subscript ๐ฏ 2โฆsubscript ๐ฏ ๐\mathbf{x}{m}\in\mathbf{x}{0}+\operatorname{span}({\mathbf{v}{1},\mathbf{v% }{2},\ldots,\mathbf{v}{m}})bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_span ( { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , โฆ , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ), such that the residual norm โ๐โ๐๐ฑ mโ2 subscript norm ๐ subscript ๐๐ฑ ๐ 2|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}โฅ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is minimized.
The effectiveness of Krylov methods heavily depends on the conditioning of the matrix ๐ ๐\mathbf{A}bold_A. Hence, designing a good preconditioner is crucial in practice. In some cases (e.g., solving partial differential equations, PDEs), the problem structure provides additional information that aids the development of a preconditioner applicable to this problem or a class of similar problems. For example, multigrid preconditioners are particularly effective for Poisson-like problems with a mesh. In other cases, little information is known beyond the matrix itself. An example is the SuiteSparse matrix collection(Davis & Hu, 2011), which contains thousands of sparse matrices for benchmarking numerical linear algebra algorithms. In these cases, a general-purpose (also called โalgebraicโ) preconditioner is desirable; examples include ILU(Saad, 1994), approximate inverse(Chow & Saad, 1998), and algebraic multigrid (AMG)(Ruge & Stรผben, 1987). However, it remains very challenging to design one that performs well universally.
In this work, we propose to use neural networks as a general-purpose preconditioner. Specifically, we consider preconditioned Krylov methods for solving problems in the form ๐๐๐ฎ=๐ ๐๐๐ฎ ๐\mathbf{A}\mathbf{M}\mathbf{u}=\mathbf{b}bold_AMu = bold_b, where ๐โ๐โ1 ๐ superscript ๐ 1\mathbf{M}\approx\mathbf{A}^{-1}bold_M โ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the preconditioner, ๐ฎ ๐ฎ\mathbf{u}bold_u is the new unknown, and ๐ฑ=๐๐ฎ ๐ฑ ๐๐ฎ\mathbf{x}=\mathbf{M}\mathbf{u}bold_x = bold_Mu is the recovered solution. We design a graph neural network (GNN)(Zhou et al., 2020; Wu et al., 2021) to assume the role of ๐ ๐\mathbf{M}bold_M and develop a training method to learn ๐ ๐\mathbf{M}bold_M from select (๐,๐ฑ)๐ ๐ฑ(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs. This proposal is inspired by the universal approximation property of neural networks(Hornik et al., 1989) and encouraged by their widespread success in artificial intelligence(Bommasani et al., 2021).
Because a neural network is, in nature, nonlinear, our preconditioner ๐ ๐\mathbf{M}bold_M is not a linear operator anymore. We can no longer build a Krylov subspace for ๐๐ ๐๐\mathbf{A}\mathbf{M}bold_AM when a Krylov subspace is defined for only linear operators. Hence, we focus on flexible variants of the preconditioned Krylov methods instead. Specifically, we use flexible GMRES (FGMRES)(Saad, 1993), which considers ๐ ๐\mathbf{M}bold_M to be different in every Arnoldi step. In this framework, all what matters is the subspace from which the approximate solution is extracted. A nonlinear operator ๐ ๐\mathbf{M}bold_M can also be used to build this subspace and hence the neural preconditioner is applicable to FGMRES.
We use a graph neural network because of the natural connection between a sparse matrix and the graph adjacency matrix, similar to how AMG interprets the coefficient matrix with a graph. Many GNN architectures access the graph structure through ๐ ๐\mathbf{A}bold_A-multiplications and they become polynomials of ๐ ๐\mathbf{A}bold_A when the nonlinearity is omitted(Wu et al., 2019; Chen et al., 2020). This observation bridges the connection between a GNN as an operator and the polynomial approximation of ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the latter of which is an essential ingredient in the convergence theory of Krylov methods. Interestingly, a side benefit resulting from the access pattern of ๐ ๐\mathbf{A}bold_A is that the GNN is applicable to not only sparse matrices, but also structured matrices that admit fast ๐ ๐\mathbf{A}bold_A-multiplications (such as hierarchical matrices(Hackbusch, 1999; Hackbusch & Bรถrm, 2002; Chen & Stein, 2023), Toeplitz matrices(Chan & Jin, 2007; Chen et al., 2014), unassembled finite-element matrices(Ciarlet, 2002), and Gauss-Newton matrices(Liu et al., 2023)).
There are a few advantages of the proposed graph neural preconditioner (GNP). It performs comparatively much better for ill-conditioned problems, by learning the matrix inverse from data, mitigating the restrictive modeling of the sparsity pattern (as in ILU and approximate inverse), the insufficient quality of polynomial approximation (as in using Krylov methods as the preconditioner), and the challenge of smoothing over a non-geometric mesh (as in AMG). Additionally, empirical evaluation suggests that the construction time of GNP is more predictable, inherited from the predictability of neural network training, than that of ILU and AMG; and the execution time of GNP is much shorter than GMRES as the preconditioner, which may be bottlenecked by the orthogonalization in the Arnoldi process.
1.1 Contributions
Our work has a few technical contributions. First, we offer a convergence analysis for FGMRES. Although this method is well known and used in practice, little theoretical work was done, in part because the tools for analyzing Krylov subspaces are not readily applicable (as the subspace is no longer Krylov and we have no isomorphism with the space of polynomials). Instead, our analysis is a posteriori, based on the computed subspace.
Second, we propose an effective approach to training the neural preconditioner; in particular, training data generation. While it is straightforward to train the neural network in an online-learning mannerโthrough preparing streaming, random input-output pairs in the form of (๐,๐ฑ)๐ ๐ฑ(\mathbf{b},\mathbf{x})( bold_b , bold_x )โit leaves open the question regarding what randomness best suits the purpose. We consider the bottom eigen-subspace of ๐ ๐\mathbf{A}bold_A when defining the sampling distribution and show the effectiveness of this approach.
Third, we develop a scale-equivariant GNN as the preconditioner. Because of the way that training data are generated, the GNN inputs ๐ ๐\mathbf{b}bold_b can have vast different scales; meanwhile, the ground truth operator ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is equivariant to the scaling of the inputs. Hence, to facilitate training, we design the GNN by imposing an inductive bias that obeys scale-equivariance.
Fourth, we adopt a novel evaluation protocol for general-purpose preconditioners. The common practice evaluates a preconditioner on a limited number of problems; in many occasions, these problems come from a certain type of PDEs or an application, which constitutes only a portion of the use cases. To broadly evaluate a new kind of general-purpose preconditioner and understand its strengths and weaknesses, we propose to test it on a large number of matrices commonly used by the community. To this end, we use the SuiteSparse collection and perform evaluation on a substantial portion of it (all non-SPD, real matrices falling inside a size interval; over 800 from 50 application areas in this work). To streamline the evaluation, we define evaluation criteria, limit the tuning of hyperparameters, and study statistics over the distribution of matrices.
1.2 Related Work
It is important to put the proposed GNP in context. The idea of using a neural network to construct a preconditioner emerged recently. The relevant approaches learn the nonzero entries of the incomplete factors(Li et al., 2023; Hรคusner et al., 2023) or their correction(Trifonov et al., 2024), or of the approximate inverse(Bรฅnkestad et al., 2024). In these approaches, the neural network does not directly approximate the mapping from ๐ ๐\mathbf{b}bold_b to ๐ฑ ๐ฑ\mathbf{x}bold_x like ours does; rather, it is used to complete the nonzeros of the predefined sparsity structure. A drawback of these approaches is that the predefined sparsity imposes a nonzero lower bound on the approximation of ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which (and whose factors) are generally not sparse. Moreover, although the neural network can be trained on a PDE problem (by varying the coefficients, grid sizes, or boundary conditions) and generalizes well to the same problem, it is unlikely that a single network can work well for all problems and matrices in life.
Approaches wherein the neural network directly approximates the matrix inverse are more akin to physics-informed neural networks (PINNs)(Raissi et al., 2019) and neural operators (NOs)(Rudikov et al., 2024; Kovachki et al., 2022; Li et al., 2020; 2021; Lu et al., 2021). However, the learning of PINNs requires a PDE whose residual forms a partial training loss, whereas NOs consider infinite-dimensional function spaces, with PDEs being the application. In our case of a general-purpose preconditioner, there is not an underlying PDE; and not every matrix problem relates to function spaces with additional properties to exploit (e.g., spatial coordinates, smoothness, and decay of the Greenโs function). For example, one may be interested in finding the commute times between a node and all other nodes in a graph; this problem can be solved by solving a linear system with respect the graph Laplacian matrix. The only information we exploit is the matrix itself.
2 Method
Let ๐โโ nรn ๐ superscript โ ๐ ๐\mathbf{A}\in\mathbb{R}^{n\times n}bold_A โ blackboard_R start_POSTSUPERSCRIPT italic_n ร italic_n end_POSTSUPERSCRIPT. From now on, ๐ ๐\mathbf{M}bold_M is an operator โ nโโ nโsuperscript โ ๐ superscript โ ๐\mathbb{R}^{n}\to\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT โ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and we write ๐โข(๐ฏ)๐ ๐ฏ\mathbf{M}(\mathbf{v})bold_M ( bold_v ) to mean applying the operator on ๐ฏโโ n ๐ฏ superscript โ ๐\mathbf{v}\in\mathbb{R}^{n}bold_v โ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This notation includes the special case ๐โข(๐ฏ)=๐๐ฏ ๐ ๐ฏ ๐๐ฏ\mathbf{M}(\mathbf{v})=\mathbf{M}\mathbf{v}bold_M ( bold_v ) = bold_Mv when ๐ ๐\mathbf{M}bold_M is a matrix.
2.1 Flexible GMRES
A standard preconditioned Krylov solver solves the linear system ๐๐๐ฎ=๐ ๐๐๐ฎ ๐\mathbf{A}\mathbf{M}\mathbf{u}=\mathbf{b}bold_AMu = bold_b by viewing ๐๐ ๐๐\mathbf{A}\mathbf{M}bold_AM as the new matrix and building a Krylov subspace, from which an approximate solution ๐ฎ m subscript ๐ฎ ๐\mathbf{u}{m}bold_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is extracted and ๐ฑ m subscript ๐ฑ ๐\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is recovered from ๐๐ฎ m subscript ๐๐ฎ ๐\mathbf{M}\mathbf{u}{m}bold_Mu start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. It is important to note that a subspace, by definition, is linear. However, applying a nonlinear operator ๐ ๐\mathbf{M}bold_M on the linear vector space does not result in a vector space spanned by the mapped basis. Hence, convergence theory of ๐ฑ m=๐โข(๐ฎ m)subscript ๐ฑ ๐ ๐ subscript ๐ฎ ๐\mathbf{x}{m}=\mathbf{M}(\mathbf{u}_{m})bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_M ( bold_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is broken.
We resort to flexible variants of Krylov solvers(Saad, 1993; Notay, 2000; Chen et al., 2016) and focus on flexible GMRES for simplicity. FGMRES was designed such that the matrix ๐ ๐\mathbf{M}bold_M can change in each Arnoldi step; we extend it to a fixed, but nonlinear, operator ๐ ๐\mathbf{M}bold_M.
Algorithm1 in SectionB summarizes the solver. The algorithm assumes a restart length m ๐ m italic_m. Starting from an initial guess ๐ฑ 0 subscript ๐ฑ 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with residual vector ๐ซ 0 subscript ๐ซ 0\mathbf{r}{0}bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 2-norm ฮฒ ๐ฝ\beta italic_ฮฒ, FGMRES runs m ๐ m italic_m Arnoldi steps, resulting in the relation
๐๐ m=๐ mโข๐ m+h m+1,mโข๐ฏ m+1โข๐ mโค=๐ m+1โข๐ยฏm,subscript ๐๐ ๐ subscript ๐ ๐ subscript ๐ ๐ subscript โ ๐ 1 ๐ subscript ๐ฏ ๐ 1 superscript subscript ๐ ๐ top subscript ๐ ๐ 1 subscriptยฏ๐ ๐\mathbf{A}\mathbf{Z}{m}=\mathbf{V}{m}\mathbf{H}{m}+h{m+1,m}\mathbf{v}{m+1% }\mathbf{e}{m}^{\top}=\mathbf{V}{m+1}\overline{\mathbf{H}}{m},bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_m + 1 , italic_m end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,(1)
where ๐ m=[๐ฏ 1,โฆ,๐ฏ m]โโ nรm subscript ๐ ๐ subscript ๐ฏ 1โฆsubscript ๐ฏ ๐ superscript โ ๐ ๐\mathbf{V}{m}=[\mathbf{v}{1},\ldots,\mathbf{v}{m}]\in\mathbb{R}^{n\times m}bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] โ blackboard_R start_POSTSUPERSCRIPT italic_n ร italic_m end_POSTSUPERSCRIPT is orthonormal, ๐ m=[๐ณ 1,โฆ,๐ณ m]โโ nรn subscript ๐ ๐ subscript ๐ณ 1โฆsubscript ๐ณ ๐ superscript โ ๐ ๐\mathbf{Z}{m}=[\mathbf{z}{1},\ldots,\mathbf{z}{m}]\in\mathbb{R}^{n\times n}bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] โ blackboard_R start_POSTSUPERSCRIPT italic_n ร italic_n end_POSTSUPERSCRIPT contains m ๐ m italic_m columns ๐ณ j=๐โข(๐ฏ j)subscript ๐ณ ๐ ๐ subscript ๐ฏ ๐\mathbf{z}{j}=\mathbf{M}(\mathbf{v}{j})bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_M ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), ๐ m=[h iโขj]โโ mรm subscript ๐ ๐ delimited-[]subscript โ ๐ ๐ superscript โ ๐ ๐\mathbf{H}{m}=[h{ij}]\in\mathbb{R}^{m\times m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] โ blackboard_R start_POSTSUPERSCRIPT italic_m ร italic_m end_POSTSUPERSCRIPT is upper Hessenberg, and ๐ยฏm=[h iโขj]โโ(m+1)รm subscriptยฏ๐ ๐ delimited-[]subscript โ ๐ ๐ superscript โ ๐ 1 ๐\overline{\mathbf{H}}{m}=[h{ij}]\in\mathbb{R}^{(m+1)\times m}overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] โ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) ร italic_m end_POSTSUPERSCRIPT extends ๐ m subscript ๐ ๐\mathbf{H}{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with an additional row at the bottom. The approximate solution ๐ฑ m subscript ๐ฑ ๐\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is computed in the form ๐ฑ mโ๐ฑ 0+spanโก({๐ณ 1,โฆ,๐ณ m})={๐ฑ 0+๐ mโข๐ฒ}subscript ๐ฑ ๐ subscript ๐ฑ 0 span subscript ๐ณ 1โฆsubscript ๐ณ ๐ subscript ๐ฑ 0 subscript ๐ ๐ ๐ฒ\mathbf{x}{m}\in\mathbf{x}{0}+\operatorname{span}({\mathbf{z}{1},\ldots,% \mathbf{z}{m}})={\mathbf{x}{0}+\mathbf{Z}{m}\mathbf{y}}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_span ( { bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ) = { bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y }, such that the residual norm โ๐โ๐๐ฑ mโ2=โฮฒโข๐ 1โ๐ยฏmโข๐ฒโ2 subscript norm ๐ subscript ๐๐ฑ ๐ 2 subscript norm ๐ฝ subscript ๐ 1 subscriptยฏ๐ ๐ ๐ฒ 2|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}=|\beta\mathbf{e}{1}-\overline{% \mathbf{H}}{m}\mathbf{y}|{2}โฅ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = โฅ italic_ฮฒ bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is minimized. If the residual norm is sufficiently small, the solution is claimed; otherwise, let ๐ฑ m subscript ๐ฑ ๐\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be the initial guess of the next Arnoldi cycle and repeat.
We provide a convergence analysis below. All proofs are in SectionC.
Theorem 1.
Assume that FGMRES is run without restart and without breakdown. We use the tilde notation to denote the counterpart quantities when FGMRES is run with a fixed preconditioner matrix ๐๐\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG. For any ๐๐\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG such that ๐โข๐๐๐\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG can be diagonalized, as in ๐โ1โข๐โข๐โข๐=๐ฒ=diagโก(ฮป 1,โฆ,ฮป n)superscript ๐ 1 ๐๐ ๐ ๐ฒ diag subscript ๐ 1โฆsubscript ๐ ๐\mathbf{X}^{-1}\mathbf{A}\widetilde{\mathbf{M}}\mathbf{X}=\bm{\Lambda}=% \operatorname{diag}(\lambda_{1},\ldots,\lambda_{n})bold_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A over~ start_ARG bold_M end_ARG bold_X = bold_ฮ = roman_diag ( italic_ฮป start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , italic_ฮป start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the residual ๐ซ m=๐โ๐๐ฑ m subscript ๐ซ ๐ ๐ subscript ๐๐ฑ ๐\mathbf{r}{m}=\mathbf{b}-\mathbf{A}\mathbf{x}{m}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT satisfies
โ๐ซ mโ2โคฮบ 2โข(๐)โขฯต(m)โข(๐ฒ)โขโ๐ซ 0โ2+โ๐ mโข๐ mโคโ๐mโข๐mโคโ2โขโ๐ซ 0โ2,subscript norm subscript ๐ซ ๐ 2 subscript ๐
2 ๐ superscript italic-ฯต ๐ ๐ฒ subscript norm subscript ๐ซ 0 2 subscript norm subscript ๐ ๐ superscript subscript ๐ ๐ top subscript๐ ๐ superscript subscript๐ ๐ top 2 subscript norm subscript ๐ซ 0 2|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm{\Lambda})|% \mathbf{r}{0}|{2}+|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{\mathbf{% Q}}{m}\widetilde{\mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|_{2},โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮ ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + โฅ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(2)
where ฮบ 2 subscript ๐
2\kappa_{2}italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the 2-norm condition number, ๐ m subscript ๐ ๐\mathbf{Q}{m}bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (resp.๐m subscript๐ ๐\widetilde{\mathbf{Q}}{m}over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) is the thin-Q factor of ๐๐ m subscript ๐๐ ๐\mathbf{A}\mathbf{Z}{m}bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (resp.๐โข๐m ๐ subscript๐ ๐\mathbf{A}\widetilde{\mathbf{Z}}{m}bold_A over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT), โ m subscript โ ๐\mathbb{P}_{m}blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the space of degree-m ๐ m italic_m polynomials, and
ฯต(m)โข(๐ฒ)=min pโโ m,pโข(0)=1โกmax i=1,โฆ,nโก|pโข(ฮป i)|.superscript italic-ฯต ๐ ๐ฒ subscript formulae-sequence ๐ subscript โ ๐ ๐ 0 1 subscript ๐ 1โฆ๐ ๐ subscript ๐ ๐\epsilon^{(m)}(\bm{\Lambda})=\min_{p\in\mathbb{P}{m},,p(0)=1}\max{i=1,% \ldots,n}|p(\lambda_{i})|.italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮ ) = roman_min start_POSTSUBSCRIPT italic_p โ blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_p ( 0 ) = 1 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i = 1 , โฆ , italic_n end_POSTSUBSCRIPT | italic_p ( italic_ฮป start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | .
Let us consider the two terms on the right-hand side of(2). The first term ฮบ 2โข(๐)โขฯต(m)โข(๐ฒ)โขโ๐ซ 0โ2 subscript ๐
2 ๐ superscript italic-ฯต ๐ ๐ฒ subscript norm subscript ๐ซ 0 2\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm{\Lambda})|\mathbf{r}{0}|{2}italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮ ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the standard convergence result of GMRES on the matrix ๐โข๐๐๐\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG. The factor ฯต(m)โข(๐ฒ)superscript italic-ฯต ๐ ๐ฒ\epsilon^{(m)}(\bm{\Lambda})italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮ ) is approximately exponential in m ๐ m italic_m, when the eigenvalues ฮป i subscript ๐ ๐\lambda_{i}italic_ฮป start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are located in an ellipse which excludes the origin; see Saad (2003, Corollary 6.33). Furthermore, when the eigenvalues are closer to each other than to the origin, the base of the exponential is close to zero, resulting in rapid convergence. Meanwhile, when ๐โข๐๐๐\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG is close to normal, ฮบ 2โข(๐)subscript ๐
2 ๐\kappa_{2}(\mathbf{X})italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) is close to 1.
What ๐ ๐\mathbf{M}bold_M affects is the second term โ๐ mโข๐ mโคโ๐mโข๐mโคโ2โขโ๐ซ 0โ2 subscript norm subscript ๐ ๐ superscript subscript ๐ ๐ top subscript๐ ๐ superscript subscript๐ ๐ top 2 subscript norm subscript ๐ซ 0 2|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{\mathbf{Q}}{m}\widetilde{% \mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|{2}โฅ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This term does not show convergence on the surface, but one can always find an ๐~~๐\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG such that it vanishes, assuming no breakdown. When ๐ ๐\mathbf{M}bold_M is close to ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, ๐๐ m subscript ๐๐ ๐\mathbf{A}\mathbf{Z}{m}bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a small perturbation of ๐ m subscript ๐ ๐\mathbf{V}{m}bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by definition. Then, (1) suggests that ๐ยฏm subscriptยฏ๐ ๐\overline{\mathbf{H}}{m}overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is close to the identity matrix for all m ๐ m italic_m and so is ๐ n subscript ๐ ๐\mathbf{H}_{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.
Corollary 2.
Assume that FGMRES is run without restart and without breakdown. On completion, let ๐ n subscript ๐ ๐\mathbf{H}{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be diagonalizable, as in ๐โ1โข๐ nโข๐=๐บ=diagโก(ฯ 1,โฆ,ฯ n)superscript ๐ 1 subscript ๐ ๐ ๐ ๐บ diag subscript ๐ 1โฆsubscript ๐ ๐\mathbf{Y}^{-1}\mathbf{H}{n}\mathbf{Y}=\bm{\Sigma}=\operatorname{diag}(\sigma% {1},\ldots,\sigma{n})bold_Y start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y = bold_ฮฃ = roman_diag ( italic_ฯ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , italic_ฯ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Then, the residual norm satisfies
โ๐ซ mโ2โคฮบ 2โข(๐)โขฯต(m)โข(๐บ)โขโ๐ซ 0โ2 for all m.subscript norm subscript ๐ซ ๐ 2 subscript ๐ 2 ๐ superscript italic-ฯต ๐ ๐บ subscript norm subscript ๐ซ 0 2 for all m|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{Y})\epsilon^{(m)}(\bm{\Sigma})|% \mathbf{r}{0}|{2}\qquad\text{for all $m$}.โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮฃ ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for all italic_m .(3)
See SectionD for an approximately exponential convergence of โ๐ซ mโ2 subscript norm subscript ๐ซ ๐ 2|\mathbf{r}{m}|{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT due to ฯต(m)โข(๐บ)superscript italic-ฯต ๐ ๐บ\epsilon^{(m)}(\bm{\Sigma})italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮฃ ).
Note that while our results assume infinite precision, another analysis(Arioli & Duff, 2009) was conducted based on finite arithmetics. Such analysis uses the machine precision to bound the residual norm for a sufficiently large m ๐ m italic_m, asserting backward stability of FGMRES; on the other hand, we can obtain the convergence rate under infinite precision.
2.2 Training Data Generation
We generate a set of (๐,๐ฑ)๐ ๐ฑ(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs to train ๐:๐โ๐ฑ:๐โ๐ ๐ฑ\mathbf{M}:\mathbf{b}\to\mathbf{x}bold_M : bold_b โ bold_x. Unlike neural operators where the data generation is costly (e.g., requiring solving PDEs), which limits the training set size, for linear systems, we can sample ๐ฑ ๐ฑ\mathbf{x}bold_x from a certain distribution and obtain ๐=๐๐ฑ ๐ ๐๐ฑ\mathbf{b}=\mathbf{A}\mathbf{x}bold_b = bold_Ax with negligible cost, which creates a streaming training set of (๐,๐ฑ)๐ ๐ฑ(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs without size limit.1 1 1 Neural operators learn the mapping from the boundary condition to the solution. Strictly speaking, while ๐ฑ ๐ฑ\mathbf{x}bold_x is the solution, ๐ ๐\mathbf{b}bold_b is not the boundary condition.
There is no free lunch. One may sample ๐ฑโผ๐ฉโข(๐,๐ n)similar-to ๐ฑ ๐ฉ 0 subscript ๐ ๐\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which leads to ๐โผ๐ฉโข(๐,๐๐โค)similar-to ๐ ๐ฉ 0 superscript ๐๐ top\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{A}\mathbf{A}^{\top})bold_b โผ caligraphic_N ( bold_0 , bold_AA start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT ). This distribution is skewed toward the dominant eigen-subspace of ๐๐โคsuperscript ๐๐ top\mathbf{A}\mathbf{A}^{\top}bold_AA start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT. Hence, using samples from it for training may result in a poor performance of the preconditioner, when applied to inputs lying close to the bottom eigen-subspace. One may alternatively want the training data ๐โผ๐ฉโข(๐,๐ n)similar-to ๐ ๐ฉ 0 subscript ๐ ๐\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_b โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which covers uniformly all possibilities of the preconditioner input on a sphere. However, in this case, ๐ฑโผ๐ฉโข(๐,๐โ1โข๐โโค)similar-to ๐ฑ ๐ฉ 0 superscript ๐ 1 superscript ๐ absent top\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{A}^{-1}\mathbf{A}^{-\top})bold_x โผ caligraphic_N ( bold_0 , bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT - โค end_POSTSUPERSCRIPT ) is also skewed in the spectrum, causing difficulty in network training.
We resort to the Arnoldi process to obtain an approximation of ๐โ1โข๐โโคsuperscript ๐ 1 superscript ๐ absent top\mathbf{A}^{-1}\mathbf{A}^{-\top}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT - โค end_POSTSUPERSCRIPT, which strikes a balance. Specifically, we run the Arnoldi process in m ๐ m italic_m steps without a preconditioner, which results in a simplification of(1):
๐๐ m=๐ mโข๐ m+h m+1,mโข๐ฏ m+1โข๐ mโค=๐ m+1โข๐ยฏm.subscript ๐๐ ๐ subscript ๐ ๐ subscript ๐ ๐ subscript โ ๐ 1 ๐ subscript ๐ฏ ๐ 1 superscript subscript ๐ ๐ top subscript ๐ ๐ 1 subscriptยฏ๐ ๐\mathbf{A}\mathbf{V}{m}=\mathbf{V}{m}\mathbf{H}{m}+h{m+1,m}\mathbf{v}{m+1% }\mathbf{e}{m}^{\top}=\mathbf{V}{m+1}\overline{\mathbf{H}}{m}.bold_AV start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_m + 1 , italic_m end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .(4)
Let the singular value decomposition of ๐ยฏm subscriptยฏ๐ ๐\overline{\mathbf{H}}{m}overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be ๐ mโข๐ mโข๐ mโคsubscript ๐ ๐ subscript ๐ ๐ superscript subscript ๐ ๐ top\mathbf{W}{m}\mathbf{S}{m}\mathbf{Z}{m}^{\top}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT, where ๐ mโโ(m+1)รm subscript ๐ ๐ superscript โ ๐ 1 ๐\mathbf{W}{m}\in\mathbb{R}^{(m+1)\times m}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) ร italic_m end_POSTSUPERSCRIPT and ๐ mโโ mรm subscript ๐ ๐ superscript โ ๐ ๐\mathbf{Z}{m}\in\mathbb{R}^{m\times m}bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ blackboard_R start_POSTSUPERSCRIPT italic_m ร italic_m end_POSTSUPERSCRIPT are orthonormal and ๐ mโโ mรm subscript ๐ ๐ superscript โ ๐ ๐\mathbf{S}_{m}\in\mathbb{R}^{m\times m}bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ blackboard_R start_POSTSUPERSCRIPT italic_m ร italic_m end_POSTSUPERSCRIPT is diagonal. We define
๐ฑ=๐ mโข๐ mโข๐ mโ1โขฯต,ฯตโผ๐ฉโข(๐,๐ m).formulae-sequence ๐ฑ subscript ๐ ๐ subscript ๐ ๐ superscript subscript ๐ ๐ 1 bold-italic-ฯต similar-to bold-italic-ฯต ๐ฉ 0 subscript ๐ ๐\mathbf{x}=\mathbf{V}{m}\mathbf{Z}{m}\mathbf{S}{m}^{-1}\bm{\epsilon},\qquad% \bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{m}).bold_x = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_ฯต , bold_italic_ฯต โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) .(5)
Then, with simple algebra we obtain that
๐ฑ ๐ฑ\displaystyle\mathbf{x}bold_xโrangeโก(๐ m),absent range subscript ๐ ๐\displaystyle\in\operatorname{range}(\mathbf{V}{m}),โ roman_range ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,๐ฑ ๐ฑ\displaystyle\qquad\mathbf{x}bold_xโผ๐ฉโข(๐,๐บ m ๐ฑ),similar-to absent ๐ฉ 0 superscript subscript ๐บ ๐ ๐ฑ\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{x}}),โผ caligraphic_N ( bold_0 , bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT ) ,๐บ m ๐ฑ superscript subscript ๐บ ๐ ๐ฑ\displaystyle\qquad\bm{\Sigma}{m}^{\mathbf{x}}bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT=(๐ mโข๐ยฏm+)โข(๐ mโข๐ยฏm+)โค,absent subscript ๐ ๐ superscript subscriptยฏ๐ ๐ superscript subscript ๐ ๐ superscript subscriptยฏ๐ ๐ top\displaystyle=(\mathbf{V}{m}\overline{\mathbf{H}}{m}^{+})(\mathbf{V}{m}% \overline{\mathbf{H}}{m}^{+})^{\top},= ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT , ๐ ๐\displaystyle\mathbf{b}bold_bโrangeโก(๐ m+1),absent range subscript ๐ ๐ 1\displaystyle\in\operatorname{range}(\mathbf{V}{m+1}),โ roman_range ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ,๐ ๐\displaystyle\qquad\mathbf{b}bold_bโผ๐ฉโข(๐,๐บ m ๐),similar-to absent ๐ฉ 0 superscript subscript ๐บ ๐ ๐\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{b}}),โผ caligraphic_N ( bold_0 , bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT ) ,๐บ m ๐ superscript subscript ๐บ ๐ ๐\displaystyle\qquad\bm{\Sigma}{m}^{\mathbf{b}}bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT=(๐ m+1โข๐ m)โข(๐ m+1โข๐ m)โค.absent subscript ๐ ๐ 1 subscript ๐ ๐ superscript subscript ๐ ๐ 1 subscript ๐ ๐ top\displaystyle=(\mathbf{V}{m+1}\mathbf{W}{m})(\mathbf{V}{m+1}\mathbf{W}{m})% ^{\top}.= ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT .
The covariance of ๐ ๐\mathbf{b}bold_b, ๐บ m ๐ superscript subscript ๐บ ๐ ๐\bm{\Sigma}{m}^{\mathbf{b}}bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT, is a projector, because ๐ m+1 subscript ๐ ๐ 1\mathbf{V}{m+1}bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and ๐ m subscript ๐ ๐\mathbf{W}{m}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are orthonormal. As m ๐ m italic_m grows, this covariance tends to the identity ๐ n subscript ๐ ๐\mathbf{I}{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. By the theory of the Arnoldi process(Morgan, 2002), ๐ m+1 subscript ๐ ๐ 1\mathbf{V}{m+1}bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT contains better and better approximations of the extreme eigenvectors as m ๐ m italic_m increases. Hence, ๐ฉโข(๐,๐บ m ๐)๐ฉ 0 superscript subscript ๐บ ๐ ๐\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{b}})caligraphic_N ( bold_0 , bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT ) has a high density close to the bottom eigen-subspace of ๐ ๐\mathbf{A}bold_A. Meanwhile, because ๐บ m ๐ superscript subscript ๐บ ๐ ๐\bm{\Sigma}{m}^{\mathbf{b}}bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT is low-rank, the distribution has zero density along many directions. Therefore, we sample ๐ฑ ๐ฑ\mathbf{x}bold_x from both ๐ฉโข(๐,๐บ m ๐ฑ)๐ฉ 0 superscript subscript ๐บ ๐ ๐ฑ\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{x}})caligraphic_N ( bold_0 , bold_ฮฃ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT ) and ๐ฉโข(๐,๐ n)๐ฉ 0 subscript ๐ ๐\mathcal{N}(\mathbf{0},\mathbf{I}_{n})caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) to form each training batch, to cover more important subspace directions.
We note that the use of Gaussian distributions brings in the convenience to generally analyze the input/output distributions. It is possible that other distributions are more effective if knowledge of ๐ฑ ๐ฑ\mathbf{x}bold_x exists and can be exploited.
2.3 Scale-Equivariant Graph Neural Preconditioner
We now define the neural network ๐ ๐\mathbf{M}bold_M that approximates ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Naturally, a (sparse) matrix ๐=[a iโขj]๐ delimited-[]subscript ๐ ๐ ๐\mathbf{A}=[a_{ij}]bold_A = [ italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] admits a graph interpretation, similar to how AMG interprets the coefficient matrix with a graph. Treating ๐ ๐\mathbf{A}bold_A as the graph adjacency matrix allows us to use GNNs(Kipf & Welling, 2017; Hamilton et al., 2017; Velicฬkoviฤ et al., 2018; Xu et al., 2019) to parameterize the preconditioner. These neural networks perform graph convolutions in a neighborhood of each node, reducing the quadratic pairwise cost by ignoring nodes faraway from the neighborhood.
Our neural network is based on the graph convolutional network (GCN)(Kipf & Welling, 2017). A GCN layer is defined as GCONVโข(๐)=ReLUโก(๐^โข๐๐)GCONV ๐ ReLU^๐ ๐๐\text{GCONV}(\mathbf{X})=\operatorname{ReLU}(\widehat{\mathbf{A}}\mathbf{X}% \mathbf{W})GCONV ( bold_X ) = roman_ReLU ( over^ start_ARG bold_A end_ARG bold_XW ), where ๐^โโ nรn^๐ superscript โ ๐ ๐\widehat{\mathbf{A}}\in\mathbb{R}^{n\times n}over^ start_ARG bold_A end_ARG โ blackboard_R start_POSTSUPERSCRIPT italic_n ร italic_n end_POSTSUPERSCRIPT is some normalization of ๐ ๐\mathbf{A}bold_A, ๐โโ nรd in ๐ superscript โ ๐ subscript ๐ in\mathbf{X}\in\mathbb{R}^{n\times d_{\text{in}}}bold_X โ blackboard_R start_POSTSUPERSCRIPT italic_n ร italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the input data, and ๐โโ d inรd out ๐ superscript โ subscript ๐ in subscript ๐ out\mathbf{W}\in\mathbb{R}^{d_{\text{in}}\times d_{\text{out}}}bold_W โ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ร italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a learnable parameter.
Figure 1: Our GNN architecture for the preconditioner ๐ ๐\mathbf{M}bold_M. The nonlinearity ฯ ๐\sigma italic_ฯ is ReLU ReLU\operatorname{ReLU}roman_ReLU. The component s ๐ s italic_s denotes the scale-equivariant operator(8).
Our neural network, as illustrated in Figure1, enhances from GCN in a few manners:
- 1.We normalize ๐ ๐\mathbf{A}bold_A differently. Specifically, we define
๐^=๐/ฮณ where ฮณ=minโก{max iโก{โj|a|iโขj},max jโก{โi|a|iโขj}}.formulae-sequence^๐ ๐ ๐พ where ๐พ subscript ๐ subscript ๐ subscript ๐ ๐ ๐ subscript ๐ subscript ๐ subscript ๐ ๐ ๐\widehat{\mathbf{A}}=\mathbf{A}/\gamma\quad\text{where}\quad\textstyle\gamma=% \min\Big{{}\max_{i}\big{{}\sum_{j}|a|{ij}\big{}},\max{j}\big{{}\sum_{i}|% a|_{ij}\big{}}\Big{}}.over^ start_ARG bold_A end_ARG = bold_A / italic_ฮณ where italic_ฮณ = roman_min { roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { โ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_a | start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } , roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { โ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_a | start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } } .(6)
The factor ฮณ ๐พ\gamma italic_ฮณ is an upper bound of the spectral radius of ๐ ๐\mathbf{A}bold_A based on the Gershgorin circle theorem(Gerschgorin, 1931). This avoids division-by-zero that may occur in the standard GCN normalization, because the nonzeros of ๐ ๐\mathbf{A}bold_A can be both positive and negative. 2. 2.We add a residual connection to GCONV, to allow stacking a deeper GNN without suffering the smoothing problem(Chen et al., 2020). Specifically, we define a Res-GCONV layer to be
Res-GCONVโข(๐)=ReLUโก(๐๐+๐^โข๐๐).Res-GCONV ๐ ReLU ๐๐^๐ ๐๐\text{Res-GCONV}(\mathbf{X})=\operatorname{ReLU}(\mathbf{X}\mathbf{U}+\widehat% {\mathbf{A}}\mathbf{X}\mathbf{W}).Res-GCONV ( bold_X ) = roman_ReLU ( bold_XU + over^ start_ARG bold_A end_ARG bold_XW ) .(7)
Moreover, we let the parameters ๐ ๐\mathbf{U}bold_U and ๐ ๐\mathbf{W}bold_W be square matrices of dimension dรd ๐ ๐ d\times d italic_d ร italic_d, such that the input/output dimensions across layers do not change. We stack L ๐ฟ L italic_L such layers. 3. 3.We use an MLP encoder (resp.MLP decoder) before (resp.after) the L ๐ฟ L italic_L Res-GCONV layers to lift (resp.project) the dimensions.
Additionally, the main novelty of the architecture is scale-equivariance. Note that the operator to approximate, ๐โ1 superscript ๐ 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, is scale-equivariant, but a general neural network is not guaranteed so. Because the input ๐ ๐\mathbf{b}bold_b may have very different scales (due to the sampling approach considered in the preceding subsection), to facilitate training, we design a parameter-free scaling s ๐ s italic_s and back-scaling sโ1 superscript ๐ 1 s^{-1}italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT:
s(โ )=n ฯโ and sโ1(โ )=ฯ nโ ,where ฯ=โฅ๐โฅ2.\textstyle s(\cdot)=\frac{\sqrt{n}}{\tau}\cdot\quad\text{and}\quad s^{-1}(% \cdot)=\frac{\tau}{\sqrt{n}},\cdot,,\quad\text{where }\tau=|\mathbf{b}|_{2}.italic_s ( โ ) = divide start_ARG square-root start_ARG italic_n end_ARG end_ARG start_ARG italic_ฯ end_ARG โ and italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( โ ) = divide start_ARG italic_ฯ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG โ , where italic_ฯ = โฅ bold_b โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .(8)
We apply s ๐ s italic_s at the beginning and sโ1 superscript ๐ 1 s^{-1}italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT at the end of the neural network. Effectively, we restrict the input space of the neural network from the full โ n superscript โ ๐\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to a more controllable spaceโthe sphere nโข๐ nโ1 ๐ superscript ๐ ๐ 1\sqrt{n}\mathbb{S}^{n-1}square-root start_ARG italic_n end_ARG blackboard_S start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. Clearly, s ๐ s italic_s guarantees scale-equivariance; i.e., ๐โข(ฮฑโข๐)=ฮฑโข๐โข(๐)๐ ๐ผ ๐ ๐ผ ๐ ๐\mathbf{M}(\alpha\mathbf{b})=\alpha\mathbf{M}(\mathbf{b})bold_M ( italic_ฮฑ bold_b ) = italic_ฮฑ bold_M ( bold_b ) for any scalar ฮฑโ 0 ๐ผ 0\alpha\neq 0 italic_ฮฑ โ 0.
3 Evaluation Methodology and Experiment Setting
Problems. Our evaluation strives to cover a large number of problems, which come from as diverse application areas as possible. To this end, we turn to the SuiteSparse matrix collection https://sparse.tamu.edu, which is a widely used benchmark in numerical linear algebra. We select all square, real-valued, and non-SPD matrices whose number of rows falls between 1K and 100K and whose number of nonzeros is fewer than 2M. This selection results in 867 matrices from 50 application areas. Some applications are involved with PDEs (such as computational fluid dynamics), while others come from graph, optimization, economics, and statistics problems.
To facilitate solving the linear systems, we prescale each ๐ ๐\mathbf{A}bold_A by ฮณ ๐พ\gamma italic_ฮณ defined in(6). This scaling is a form of normalization, which avoids the overall entries of the matrix being exceedingly small or large. All experiments assume the ground truth solution ๐ฑ=๐ ๐ฑ 1\mathbf{x}=\mathbf{1}bold_x = bold_1.2 2 2 This is a common practice. Avoid setting ๐ฑ ๐ฑ\mathbf{x}bold_x to be Gaussian random, because this (partially) matches the training data generation, resulting in the inability to test out-of-distribution cases. FGMRES starts with ๐ฑ 0=๐ subscript ๐ฑ 0 0\mathbf{x}_{0}=\mathbf{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0.
Compared methods. We compare GNP with three widely used general-purpose preconditionersโILU, AMG, and GMRES (using GMRES to precondition GMRES is also called inner-outer GMRES)โand one simple-to-implement but weak preconditionerโJacobi. The endeavor of preconditioning over 800 matrices from different domains limits the choices, but these preconditioners are handy with Python software support.
ILU comes from scipy.sparse.linalg.spilu; it in turn comes from SuperLU(Li, 2005; Li & Shao, 2010), which implements the thresholded version of ILU(Saad, 1994). We use the default drop tolerance and fill factor without tuning. AMG comes from PyAMG(Bell et al., 2023). Specifically, we use pyamg.blackbox.solver().aspreconditioner as the preconditioner, with the configuration of the blackbox solver computed from pyamg.blackbox.solver_configuration. GMRES is self-implemented. For (the inner) GMRES, we use 10 iterations and stop it when it reaches a relative residual norm tolerance of 1e-6. For (the outer) FGMRES, the restart cycle is 10.
Note that sophisticated preconditioners require tremendous efforts to implement robustly. Hence, we prioritize ones that are more robust and that come with continual supports. For example, we also experiment with AmgX(Naumov et al., 2015) but find that while being faster, it throws more errors than does PyAMG. See SectionF.4 for details.
Stopping criteria. The stopping criteria of all linear system solutions are rtol = 1e-8 and maxiters = 100. For time comparisons, we additionally run experiments by enabling timeout and disabling maxiters. We set the timeout to be the maximum solution time among all preconditioners when using maxiters = 100 as the stopping criterion.
Evaluation metrics. Evaluating the solution qualities and comparing different methods for a large number of problems require programmable metrics, but defining these metrics is harder than it appears to be. Comparing the number of iterations alone is insufficient, because the per-iteration time of each method is different. However, to compare time, one waits until the solution reaches rtol, which is impossible to set if one wants all solutions to reach this tolerance. Thus, we can use maxiters and timeout to terminate a solution. However, one solution may reach maxiters early (in time) and attain poor accuracy, while another solution may reach timeout very late but attain good accuracy. Which one is better is debatable.
Hence, we propose two novel metrics, depending on the use of the stopping criteria. The first one, area under the relative residual norm curve with respect to iterations, is defined as
Iter-AUC=โi=0 iters log 10โกr iโlog 10โกrtol,r i=โ๐โ๐๐ฑ iโ2/โ๐โ2,formulae-sequence Iter-AUC superscript subscript ๐ 0 iters subscript 10 subscript ๐ ๐ subscript 10 rtol subscript ๐ ๐ subscript norm ๐ subscript ๐๐ฑ ๐ 2 subscript norm ๐ 2\text{Iter-AUC}=\sum_{i=0}^{\text{iters}}\log_{10}r_{i}-\log_{10}\texttt{rtol}% ,\quad r_{i}=|\mathbf{b}-\mathbf{A}\mathbf{x}{i}|{2}/|\mathbf{b}|_{2},Iter-AUC = โ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT iters end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = โฅ bold_b - bold_Ax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / โฅ bold_b โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
where iters is the actual number of iterations when FGMRES stops and r i subscript ๐ ๐ r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relative residual norm at iteration i ๐ i italic_i. We take logarithm because residual plots are typically done in the log scale. This metric compares methods based on the iteration count, taking into account the history (e.g., convergence speed at different stages) rather than the final error alone. This metric is used when the stopping criteria are rtol and maxiters.
The second metric, area under the relative residual norm curve with respect to time, is defined as
Time-AUC=โซ0 T[log 10โกrโข(t)โlog 10โกrtol]โข๐ tโโi=1 iters[log 10โกr iโlog 10โกrtol]โข(t iโt iโ1),Time-AUC superscript subscript 0 ๐ delimited-[]subscript 10 ๐ ๐ก subscript 10 rtol differential-d ๐ก superscript subscript ๐ 1 iters delimited-[]subscript 10 subscript ๐ ๐ subscript 10 rtol subscript ๐ก ๐ subscript ๐ก ๐ 1\text{Time-AUC}=\int_{0}^{T}[\log_{10}r(t)-\log_{10}\texttt{rtol}],dt\approx% \sum_{i=1}^{\text{iters}}\log_{10}r_{i}-\log_{10}\texttt{rtol},Time-AUC = โซ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r ( italic_t ) - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol ] italic_d italic_t โ โ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT iters end_POSTSUPERSCRIPT [ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol ] ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ,
where T ๐ T italic_T is the elapsed time when FGMRES stops, rโข(t)๐ ๐ก r(t)italic_r ( italic_t ) is the relative residual norm at time t ๐ก t italic_t, and t i subscript ๐ก ๐ t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the elapsed time at iteration i ๐ i italic_i. This metric compares methods based on the solution time, taking into account the history of the errors. It is used when the stopping criteria are rtol and timeout.
Hyperparameters. To feasibly train over 800 GNNs, we use the same set of hyperparameters for each of them without tuning. There is a strong potential that the GNN performance can be greatly improved with careful tuning. Our purpose is to understand the general behavior of GNP, particularly its strengths and weaknesses. We use L=8 ๐ฟ 8 L=8 italic_L = 8 Res-GCONV layers, set the layer input/output dimension to 16, and use 2-layer MLPs with hidden dimension 32 for lifting/projection. We use Adam(Kingma & Ba, 2015) as the optimizer, set the learning rate to 1e-3, and train for 2000 steps with a batch size of 16. We apply neither dropouts nor weight decays. Because data are sampled from the same distribution, we use the model at the best training epoch as the preconditioner, without resorting to a validation set.
We use the โ 1 subscript โ 1\ell_{1}roman_โ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT residual norm โ๐๐โข(๐)โ๐๐ฑโ1 subscript norm ๐๐ ๐ ๐๐ฑ 1|\mathbf{A}\mathbf{M}(\mathbf{b})-\mathbf{A}\mathbf{x}|{1}โฅ bold_AM ( bold_b ) - bold_Ax โฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the training loss, as a common practice of robust regression. We use m=40 ๐ 40 m=40 italic_m = 40 Arnoldi steps when sampling the (๐ฑ,๐)๐ฑ ๐(\mathbf{x},\mathbf{b})( bold_x , bold_b ) pairs according to(5). Among the 16 pairs in a batch, 8 pairs follow(5) and 8 pairs follow ๐ฑโผ๐ฉโข(๐,๐ n)similar-to ๐ฑ ๐ฉ 0 subscript ๐ ๐\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).
Compute environment. Our experiments are conducted on a machine with one Tesla V100(16GB) GPU, 96 Intel Xeon 2.40GHz cores, and 386GB main memory. All code is implemented in Python with Pytorch. All computations (training and inference) use the GPU whenever supported.
4 Results
We organize the experiment results by key questions of interest.
How does GNP perform compared with traditional preconditioners? We consider three factors: convergence speed, runtime speed, and preconditioner construction cost.
Figure 2: Percentage of problems on which each preconditioner performs the best.
by #iter by time 25%1.91e+00 1.18e+00 50%6.74e+00 2.33e+00 75%6.78e+03 8.16e+01 100%1.89e+10 1.91e+10
Table 1: Distribution of the residual-norm ratio between the second best preconditioner and GNP, when GNP performs the best. Distribution is described by using percentiles.
Figure 3: Preconditioner construction time and solution time (using maxiters to stop). The construction time of Jacobi is negligible and not shown. GMRES does not require construction.
In Figure2 we show the percentage of problems on which each preconditioner performs the best, with respect to iteration counts and solution time (see the Iter-AUC and Time-AUC metrics defined in the preceding section). GNP performs the best for a substantial portion of the problems in both metrics. In comparison, ILU and AMG perform the best for more problems, while Jacobi, GMRES (as a preconditioner), and no preconditioner performs the best for fewer problems in the time metric.
That ILU and AMG are the best for more problems does not compromise the competitiveness of GNP. For one reason, they are less robust and sometimes bare a significantly higher cost in construction, as will be revealed later. Another reason is that GNP can be used when other preconditioners are less satisfactory. In Table1, we summarize the distribution of the residual-norm ratio between the second best preconditioner and GNP, when GNP is the best. The ratio means a factor of x ๐ฅ x italic_x decrease in the residual norm if GNP is used in place of other preconditioners. Under the Iter-AUC metric, a factor of 6780x decrease can be seen at the 75th percentile, and in the best occasion, more than ten orders of magnitude decrease is seen. Similarly under the Time-AUC metric. These significant reductions manifest the usefulness of GNP.
In Figure3 and Figure8 of SectionF.2, we plot the preconditioner construction time and solution time for each matrix and each preconditioner. Here, the solution is terminated by maxiters rather than timeout (such that the times are different across matrices). Two observations can be made. First, the training time of GNP is nearly proportional to the matrix size, while the construction time of ILU and AMG is hardly predictable. Even though for many of the problems, the time is only a fraction of that of GNP, there exist quite a few cases where the time is significantly longer. In the worst case, the construction of the AMG preconditioner is more than an order of magnitude more costly than that of GNP.
Second, there is a clear gap between the construction and solution time for GNP, but no such gap exists for ILU and AMG. One is tempted to compare the overall time, but one preconditioner can be used for any number of right-hand sides, and hence different conclusions may be reached regarding time depending on this number. When it is large, the solution time dominates, in which case one sees that in most of the cases GNP is faster ILU, AMG, and GMRES.
On what problems does GNP perform the best? We expand the granularity of Figure2 into Figure4, overlaid with the distributions of the matrices regarding the size, the condition number, and the application area, respectively. One finding is that GNP is particularly useful for ill-conditioned matrices (i.e., those with a condition number โฅ10 16 absent superscript 10 16\geq 10^{16}โฅ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT). ILU is less competitive in these problems, possibly because it focuses on the sparsity structure, whose connection with the spectrum is less clear. Another finding is that GNP is particularly useful in three application areas: chemical process simulation problems, economic problems, and eigenvalue/model reduction problems. These areas constitute many problems in the dataset; both the number and the proportion of them in which GNP performs the best are substantially high.
On the other hand, the matrix size does not affect much the GNP performance. One sees this more clearly when consulting Figure9 in SectionF.3, which is a โproportionโ version of Figure4: the proportion of problems on which GNP performs the best is relatively flat across matrix sizes. Similarly, symmetry does not appear to play a distinguishing role either: GNP performs the best on 7.6% of the symmetric matrices and 17.5% of the nonsymmetric matrices.
Figure 4: Breakdown of best preconditioners with respect to matrix sizes, condition numbers, and application areas. Only the application areas with the top number of problems are shown. The last bar in the middle plot is for condition number โฅ10 16 absent superscript 10 16\geq 10^{16}โฅ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT.
Table 2: Failures of preconditioners (count and proportion).
How robust is GNP? We summarize in Table2 the number of failures for each preconditioner. GNP and GMRES are highly robust. We see that neural network training does not cause any troubles, which is a practical advantage. In contrast, ILU fails for nearly half of the problems, AMG fails for nearly 8%, and Jacobi fails for over 6%. According to the error log, the common failures of ILU are that โ(f)actor is exactly singularโ and that โmatrix is singular โฆ in file ilu_dpivotL.cโ, while the common failures of AMG are โarray โฆ contain(s) infs or NaNsโ. Meanwhile, solution failures occur when the residual norm tracked by the QR factorization of the upper Hessenberg ๐ยฏm subscriptยฏ๐ ๐\overline{\mathbf{H}}_{m}overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT fails to match the actual residual norm. While ILU and AMG perform the best for more problems than does GNP, safeguarding their robustness is challenging, rendering GNP more attractive.
What does the convergence history look like? In Figure5, we show a few examples when GNP performs the best. Note that in this case, ILU fails for most of the problems, in either the construction or the solution phase. The examples indicate that the convergence of GNP either tracks that of other preconditioners with a similar rate (Simon/venkat25), or becomes significantly faster. Notably, on VanVelzen/std1_Jac3, GNP uses only four iterations to reach 1e-8, while other preconditioners take 100 iterations but still cannot decrease the relative residual to below 1e-2.
Figure 5: Example convergence histories.
We additionally plot the convergence curves with respect to time in Figure10 in SectionF.6. An important finding to note is that GMRES as a preconditioner generally takes much longer time to run than GNP. This is because GMRES is bottlenecked by the orthogonalization process, shadowing the trickier cost comparison between more matrix-vector multiplications in GMRES and fewer matrix-matrix multiplications in GNP.
Are the proposed training-data generation and the scale-equivariance design necessary? In Figure6, we compare the proposed designs versus alternatives, for both the training loss and the relative residual norm. In the comparison of training data generation, using ๐ฑโผ๐ฉโข(๐,๐ n)similar-to ๐ฑ ๐ฉ 0 subscript ๐ ๐\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) leads to lower losses, while using ๐โผ๐ฉโข(๐,๐ n)similar-to ๐ ๐ฉ 0 subscript ๐ ๐\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_b โผ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) leads to higher losses. This is expected, because the former expects similar outputs from the neural network, which is easier to train, while the latter expects drastically different outputs, which make the network difficult to train. Using ๐ฑ ๐ฑ\mathbf{x}bold_x from the proposed mixture leads to losses lying in between. This case leads to the best preconditioning performanceโmore problems having lower residual norms (especially those near rtol). In other words, the proposed training data generation strikes the best balance between easiness of neural network training and goodness of the resulting preconditioner.
In the comparison of using the scaling operator s ๐ s italic_s versus not, we see that the training behaviors barely differ between the two choices, but using scaling admits an advantage that more problems result in lower residual norms. These observations corroborate the choices we make in the neural network design and training data generation.
Figure 6: Left: comparison of training data generation; right: comparison of scale-equivariance.
5 Discussions and Conclusions
We have presented a GNN approach to preconditioning Krylov solvers. This is the first work in our knowledge to use a neural network as an approximation to the matrix inverse, without exploiting information beyond the matrix itself (i.e., an algebraic preconditioenr). Compared with traditional algebraic preconditioners, this approach is robust with predictable construction costs and is the most effective in many problems. We evaluate the approach on more than 800 matrices from 50 application areas, the widest coverage in the literature.
One common skepticism on neural network approaches for preconditioning is that in general, network training is costly and in our case, the network can be used for only one ๐ ๐\mathbf{A}bold_A. We have shown to the contrary, our training cost sometimes can be much lower than the construction cost of widely used preconditioners such as ILU and AMG, because network training is more predictable while the other preconditioners suffer the manipulation of the irregular nonzero patterns. Moreover, a benefit of neural networks is that compute infrastructures and library supports are in place, which facilitate implementation and offer robustness, as opposed to other algebraic preconditioners whose implementations are challenging and robustness is extremely difficult to achieve. Additionally, traditional algebraic preconditioners are constructed for only one ๐ ๐\mathbf{A}bold_A as well, not more flexible than ours.
While one may hope that a single neural network can solve many, if not all, linear systems (different ๐ ๐\mathbf{A}bold_Aโs), as recent approaches (PINN, NO, or learning the incomplete factors) appear to suggest, it is unrealistic to expect that a single network has the capacity to learn the matrix inverse for all matrices. Existing approaches work on a distribution of linear systems for the same problem and generalize under the same problem (e.g., by varying the grid size or PDE coefficients). For GNP to achieve so, we may fine-tune a trained network with a cost lower than training from scratch, or augment the GNN input with extra information (e.g., PDE coefficients). There unlikely exists an approach that remains effective beyond the problem being trained on.
Our work can be extended in many avenues. First, an immediate follow up is the preconditioning of SPD matrices. While a recent work on PDEs(Rudikov et al., 2024) shows promise on the combined use of NOs and flexible CG(Notay, 2000), it is relatively fragile with respect to a large variation of the preconditioner in other problems. We speculate that a split preconditioner can work more robustly and some form of autoencoder networks better serves this purpose.
Second, while GNP is trained for an individual matrix in this work for simplicity, it is possible to extend it to a sequence of evolving matrices, such as in sequential problems or time stepping. Continual learning(Wang et al., 2024), which is concerned with continuously adapting a trained neural network for slowly evolving data distributions, can amortize the initial preconditioner construction cost with more and more linear systems in sequel.
Third, the GPU memory limits the size of the matrices and the complexity of the neural networks that we can experiment with. Future work can explore multi-GPU and/or distributed training of GNNs(Kaler et al., 2022; 2023) for scaling GNP to even larger matrices. The training for large graphs typically uses neighborhood sampling to mitigate the โneighborhood explosionโ problem in GNNs(Hamilton et al., 2017; Chen et al., 2018). Some sampling approaches, such as layer-wise sampling(Chen et al., 2018), are tied to sketching the matrix product ๐^โข๐^๐ ๐\widehat{\mathbf{A}}\mathbf{X}over^ start_ARG bold_A end_ARG bold_X inside the graph convolution layer(7) with certain theoretical guarantees(Chen & Luss, 2018).
Fourth, while we use the same set of hyperparameters for evaluation, nothing prevents problem-specific hyperparameter tuning when one works on an application, specially a challenging one where no general-purpose preconditioners work sufficiently well. We expect that this paperโs results based on a naive hyperparameter setting can be quickly improved by the community. Needless to say, improving the neural network architecture can push the success of GNP much farther.
Acknowledgments
The author would like to thank eight anonymous reviewers whose comments and suggestions help significantly improve this paper.
References
- Arioli & Duff (2009) M.Arioli and I.S. Duff. Using FGMRES to obtain backward stability in mixed precision. Electron. Trans. Numer. Anal., 33:31โ44, 2009.
- Bell et al. (2023) Nathan Bell, Luke N. Olson, Jacob Schroder, and Ben Southworth. PyAMG: Algebraic multigrid solvers in python. Journal of Open Source Software, 8(87):5495, 2023.
- Bommasani et al. (2021) Rishi Bommasani et al. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258, 2021.
- Bรฅnkestad et al. (2024) Maria Bรฅnkestad, Jennifer Andersson, Sebastian Mair, and Jens Sjรถlund. Ising on the graph: Task-specific graph subsampling via the Ising model. Preprint arXiv:2402.10206, 2024.
- Chan & Jin (2007) Raymond Hon-Fu Chan and Xiao-Qing Jin. An Introduction to Iterative Toeplitz Solvers. SIAM, 2007.
- Chen & Luss (2018) Jie Chen and Ronny Luss. Stochastic gradient descent with biased but consistent gradient estimators. Preprint arXiv:1807.11880, 2018.
- Chen & Stein (2023) Jie Chen and Michael L. Stein. Linear-cost covariance functions for gaussian random fields. Journal of the American Statistical Association, 118(541):147โ164, 2023.
- Chen et al. (2014) Jie Chen, Tom L.H. Li, and Mihai Anitescu. A parallel linear solver for multilevel Toeplitz systems with possibly several right-hand sides. Parallel Computing, 40(8):408โ424, 2014.
- Chen et al. (2016) Jie Chen, Lois Curfman McInnes, and Hong Zhang. Analysis and practical use of flexible BiCGStab. Journal of Scientific Computing, 68(2):803โ825, 2016.
- Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
- Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In ICML, 2020.
- Chow & Saad (1998) Edmond Chow and Yousef Saad. Approximate inverse preconditioners via sparse-sparse iterations. SIAM Journal on Scientific Computing, 19(3):995โ1023, 1998.
- Ciarlet (2002) Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
- Davis & Hu (2011) Timothy A. Davis and Yifan Hu. The university of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1โ25, 2011.
- Gerschgorin (1931) S.Gerschgorin. รber die abgrenzung der eigenwerte einer matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk, 6:749โ754, 1931.
- Hackbusch (1999) W.Hackbusch. A sparse matrix arithmetic based on โ โ\mathcal{H}caligraphic_H-matrices, part I: Introduction to โ โ\mathcal{H}caligraphic_H-matrices. Computing, 62(2):89โ108, 1999.
- Hackbusch & Bรถrm (2002) W.Hackbusch and S.Bรถrm. Data-sparse approximation by adaptive โ 2 superscript โ 2\mathcal{H}^{2}caligraphic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-matrices. Computing, 69(1):1โ35, 2002.
- Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
- Hรคusner et al. (2023) Paul Hรคusner, Ozan รktem, and Jens Sjรถlund. Neural incomplete factorization: learning preconditioners for the conjugate gradient method. Preprint arXiv:2305.16368, 2023.
- Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359โ366, 1989.
- Kaler et al. (2022) Tim Kaler, Nickolas Stathas, Anne Ouyang, Alexandros-Stavros Iliopoulos, Tao B. Schardl, Charles E. Leiserson, and Jie Chen. Accelerating training and inference of graph neural networks with fast sampling and pipelining. In MLSys, 2022.
- Kaler et al. (2023) Tim Kaler, Alexandros-Stavros Iliopoulos, Philip Murzynowski, Tao B. Schardl, Charles E. Leiserson, and Jie Chen. Communication-efficient graph neural networks with probabilistic neighborhood expansion analysis and caching. In MLSys, 2023.
- Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- Kovachki et al. (2022) Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. Journal of Machine Learning Research, 23:1โ97, 2022.
- Li (2005) Xiaoye S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans.Mathematical Software, 31(3):302โ325, 2005.
- Li & Shao (2010) Xiaoye S. Li and Meiyue Shao. A supernodal approach to incomplete LU factorization with partial pivoting. ACM Trans.Mathematical Software, 37(4), 2010.
- Li et al. (2023) Yichen Li, Peter Yichen Chen, Tao Du, and Wojciech Matusik. Learning preconditioner for conjugate gradient PDE solvers. In ICML, 2023.
- Li et al. (2020) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. Preprint arXiv:2003.03485, 2020.
- Li et al. (2021) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In ICLR, 2021.
- Liu et al. (2023) Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. Preprint arXiv:2305.14342, 2023.
- Lu et al. (2021) Lu Lu, Pengzhan Jin, and George Em Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3:218โ229, 2021.
- Manteuffel et al. (2018) Thomas A. Manteuffel, John Ruge, and Ben S. Southworth. Nonsymmetric algebraic multigrid based on local approximate ideal restriction (โ โ\ell roman_โ air). SIAM Journal on Scientific Computing, 40(6):A4105โA4130, 2018.
- Manteuffel et al. (2019) Thomas A. Manteuffel, Steffen Mรผnzenmaier, John Ruge, and Ben Southworth. Nonsymmetric reduction-based algebraic multigrid. SIAM Journal on Scientific Computing, 41(5):S242โS268, 2019.
- Morgan (2002) Ronald B. Morgan. GMRES with deflated restarting. SIAM Journal on Scientific Computing, 24(1):20โ37, 2002.
- Naumov et al. (2015) M.Naumov, M.Arsaev, P.Castonguay, J.Cohen, J.Demouth, J.Eaton, S.Layton, N.Markovskiy, I.Reguly, N.Sakharnykh, V.Sellappan, and R.Strzodka. AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iterative methods. SIAM Journal on Scientific Computing, 37(5):S602โS626, 2015.
- Notay (2000) Yvan Notay. Flexible conjugate gradients. SIAM Journal on Scientific Computing, 22(4):1444โ1460, 2000.
- Raissi et al. (2019) M.Raissi, P.Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686โ707, 2019.
- Rudikov et al. (2024) Alexander Rudikov, Vladimir Fanaskov, Ekaterina Muravleva, Yuri M. Laevsky, and Ivan Oseledets. Neural operators meet conjugate gradients: The FCG-NO method for efficient PDE solving. In ICML, 2024.
- Ruge & Stรผben (1987) J.W. Ruge and K.Stรผben. Algebraic multigrid. In Stephen F. McCormick (ed.), Multigrid Methods, Frontiers in Applied Mathematics, chapter 4. SIAM, 1987.
- Saad (1994) Y.Saad. ILUT: A dual threshold incomplete LU factorization. Numerical Linear Algebra with Applications, 1(4):387โ402, 1994.
- Saad (1993) Youcef Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing, 14(2):461โ469, 1993.
- Saad & Schultz (1986) Youcef Saad and Martin H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856โ869, 1986.
- Saad (2003) Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, second edition, 2003.
- Trifonov et al. (2024) Vladislav Trifonov, Alexander Rudikov, Oleg Iliev, Ivan Oseledets, and Ekaterina Muravleva. Learning from linear algebra: A graph neural network approach to preconditioner design for conjugate gradient solvers. Preprint arXiv:2405.15557, 2024.
- Velicฬkoviฤ et al. (2018) Petar Velicฬkoviฤ, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liรฒ, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
- Wang et al. (2024) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362โ5383, 2024.
- Wu et al. (2019) Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
- Wu et al. (2021) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4โ24, 2021.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
- Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57โ81, 2020.
Appendix A Supporting Code
Appendix B FGMRES
FGMRES is summarized in Algorithm1.
Algorithm 1 FGMRES with ๐ ๐\mathbf{M}bold_M being a nonlinear operator
1:Let
๐ฑ 0 subscript ๐ฑ 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be given. Define
๐ยฏmโโ(m+1)รm subscriptยฏ๐ ๐ superscript โ ๐ 1 ๐\overline{\mathbf{H}}_{m}\in\mathbb{R}^{(m+1)\times m}overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) ร italic_m end_POSTSUPERSCRIPT and initialize all its entries
h iโขj subscript โ ๐ ๐ h_{ij}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to zero
2:loop until maxiters is reached
3:Compute
๐ซ 0=๐โ๐๐ฑ 0 subscript ๐ซ 0 ๐ subscript ๐๐ฑ 0\mathbf{r}{0}=\mathbf{b}-\mathbf{A}\mathbf{x}{0}bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_b - bold_Ax start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
ฮฒ=โ๐ซ 0โ2 ๐ฝ subscript norm subscript ๐ซ 0 2\beta=|\mathbf{r}{0}|{2}italic_ฮฒ = โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , and
๐ฏ 1=๐ซ 0/ฮฒ subscript ๐ฏ 1 subscript ๐ซ 0 ๐ฝ\mathbf{v}{1}=\mathbf{r}{0}/\beta bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_ฮฒ
4:for
j=1,โฆ,m ๐ 1โฆ๐ j=1,\ldots,m italic_j = 1 , โฆ , italic_m do
5:Compute
๐ณ j=๐โข(๐ฏ j)subscript ๐ณ ๐ ๐ subscript ๐ฏ ๐\mathbf{z}{j}=\mathbf{M}(\mathbf{v}{j})bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_M ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and
๐ฐ=๐๐ณ j ๐ฐ subscript ๐๐ณ ๐\mathbf{w}=\mathbf{A}\mathbf{z}_{j}bold_w = bold_Az start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
6:for
i=1,โฆ,j ๐ 1โฆ๐ i=1,\ldots,j italic_i = 1 , โฆ , italic_j do
7:Compute
h iโขj=๐ฐโคโข๐ฏ i subscript โ ๐ ๐ superscript ๐ฐ top subscript ๐ฏ ๐ h_{ij}=\mathbf{w}^{\top}\mathbf{v}_{i}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_w start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and
๐ฐโ๐ฐโh iโขjโข๐ฏ iโ๐ฐ ๐ฐ subscript โ ๐ ๐ subscript ๐ฏ ๐\mathbf{w}\leftarrow\mathbf{w}-h_{ij}\mathbf{v}_{i}bold_w โ bold_w - italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
8:end for
9:Compute
h j+1,j=โ๐ฐโ2 subscript โ ๐ 1 ๐ subscript norm ๐ฐ 2 h_{j+1,j}=|\mathbf{w}|_{2}italic_h start_POSTSUBSCRIPT italic_j + 1 , italic_j end_POSTSUBSCRIPT = โฅ bold_w โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and
๐ฏ j+1=๐ฐ/h j+1,j subscript ๐ฏ ๐ 1 ๐ฐ subscript โ ๐ 1 ๐\mathbf{v}{j+1}=\mathbf{w}/h{j+1,j}bold_v start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = bold_w / italic_h start_POSTSUBSCRIPT italic_j + 1 , italic_j end_POSTSUBSCRIPT
10:end for
11:Define
๐ m=[๐ณ 1,โฆ,๐ณ m]subscript ๐ ๐ subscript ๐ณ 1โฆsubscript ๐ณ ๐\mathbf{Z}{m}=[\mathbf{z}{1},\ldots,\mathbf{z}_{m}]bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and compute
๐ฑ m=๐ฑ 0+๐ mโข๐ฒ m subscript ๐ฑ ๐ subscript ๐ฑ 0 subscript ๐ ๐ subscript ๐ฒ ๐\mathbf{x}{m}=\mathbf{x}{0}+\mathbf{Z}{m}\mathbf{y}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where
๐ฒ m=argmin ๐ฒโฮฒโข๐ 1โ๐ยฏmโข๐ฒโ2 subscript ๐ฒ ๐ subscript argmin ๐ฒ subscript norm ๐ฝ subscript ๐ 1 subscriptยฏ๐ ๐ ๐ฒ 2\mathbf{y}{m}=\operatorname*{argmin}{\mathbf{y}}|\beta\mathbf{e}{1}-% \overline{\mathbf{H}}{m}\mathbf{y}|_{2}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT โฅ italic_ฮฒ bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - overยฏ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
12:If
โ๐โ๐๐ฑ mโ2<tol subscript norm ๐ subscript ๐๐ฑ ๐ 2 tol|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}<\texttt{tol}โฅ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < tol , exit the loop; otherwise, set
๐ฑ 0โ๐ฑ mโsubscript ๐ฑ 0 subscript ๐ฑ ๐\mathbf{x}{0}\leftarrow\mathbf{x}{m}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โ bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
13:end loop
Appendix C Proofs
Proof of Theorem1.
Write โ๐ซ mโ2โคโ๐ซmโ2+โ๐ซ mโ๐ซmโ2 subscript norm subscript ๐ซ ๐ 2 subscript norm subscript๐ซ ๐ 2 subscript norm subscript ๐ซ ๐ subscript๐ซ ๐ 2|\mathbf{r}{m}|{2}\leq|\widetilde{\mathbf{r}}{m}|{2}+|\mathbf{r}{m}-% \widetilde{\mathbf{r}}{m}|_{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค โฅ over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. It is well known that
โ๐ซmโ2โคฮบ 2โข(๐)โขฯต(m)โข(๐ฒ)โขโ๐ซ 0โ2;subscript norm subscript๐ซ ๐ 2 subscript ๐
2 ๐ superscript italic-ฯต ๐ ๐ฒ subscript norm subscript ๐ซ 0 2|\widetilde{\mathbf{r}}{m}|{2}\leq\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm% {\Lambda})|\mathbf{r}{0}|{2};โฅ over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮ ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ;
see, e.g., Saad (2003, Proposition 6.32). On the other hand, because ๐ซ m=๐ซ 0โ๐๐ mโข๐ฒ m subscript ๐ซ ๐ subscript ๐ซ 0 subscript ๐๐ ๐ subscript ๐ฒ ๐\mathbf{r}{m}=\mathbf{r}{0}-\mathbf{A}\mathbf{Z}{m}\mathbf{y}{m}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where ๐ฒ m subscript ๐ฒ ๐\mathbf{y}{m}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT minimizes โ๐ซ 0โ๐๐ mโข๐ฒโ2 subscript norm subscript ๐ซ 0 subscript ๐๐ ๐ ๐ฒ 2|\mathbf{r}{0}-\mathbf{A}\mathbf{Z}{m}\mathbf{y}|{2}โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have ๐ซ m=๐ซ 0โ(๐๐ m)โข(๐๐ m)+โข๐ซ 0=๐ซ 0โ๐ mโข๐ mโคโข๐ซ 0 subscript ๐ซ ๐ subscript ๐ซ 0 subscript ๐๐ ๐ superscript subscript ๐๐ ๐ subscript ๐ซ 0 subscript ๐ซ 0 subscript ๐ ๐ superscript subscript ๐ ๐ top subscript ๐ซ 0\mathbf{r}{m}=\mathbf{r}{0}-(\mathbf{A}\mathbf{Z}{m})(\mathbf{A}\mathbf{Z}% {m})^{+}\mathbf{r}{0}=\mathbf{r}{0}-\mathbf{Q}{m}\mathbf{Q}{m}^{\top}% \mathbf{r}{0}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, โ๐ซ mโ๐ซmโ2=โ๐ mโข๐ mโคโข๐ซ 0โ๐mโข๐mโคโข๐ซ 0โ2โคโ๐ mโข๐ mโคโ๐mโข๐mโคโ2โขโ๐ซ 0โ2 subscript norm subscript ๐ซ ๐ subscript๐ซ ๐ 2 subscript norm subscript ๐ ๐ superscript subscript ๐ ๐ top subscript ๐ซ 0 subscript๐ ๐ superscript subscript๐ ๐ top subscript ๐ซ 0 2 subscript norm subscript ๐ ๐ superscript subscript ๐ ๐ top subscript๐ ๐ superscript subscript๐ ๐ top 2 subscript norm subscript ๐ซ 0 2|\mathbf{r}{m}-\widetilde{\mathbf{r}}{m}|{2}=|\mathbf{Q}{m}\mathbf{Q}{% m}^{\top}\mathbf{r}{0}-\widetilde{\mathbf{Q}}{m}\widetilde{\mathbf{Q}}{m}^{% \top}\mathbf{r}{0}|{2}\leq|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{% \mathbf{Q}}{m}\widetilde{\mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = โฅ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค โฅ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. โ
Proof of Corollary2.
By taking ๐=๐โ1โข๐ nโข๐ nโข๐ nโค๐ superscript ๐ 1 subscript ๐ ๐ subscript ๐ ๐ superscript subscript ๐ ๐ top\widetilde{\mathbf{M}}=\mathbf{A}^{-1}\mathbf{V}{n}\mathbf{H}{n}\mathbf{V}{% n}^{\top}over~ start_ARG bold_M end_ARG = bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT and noting that ๐ nโค=๐ nโ1 superscript subscript ๐ ๐ top superscript subscript ๐ ๐ 1\mathbf{V}{n}^{\top}=\mathbf{V}_{n}^{-1}bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โค end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we have
(๐ nโข๐)โ1โข๐โข๐โข(๐ nโข๐)=diagโก(ฯ 1,โฆ,ฯ n).superscript subscript ๐ ๐ ๐ 1 ๐๐ subscript ๐ ๐ ๐ diag subscript ๐ 1โฆsubscript ๐ ๐(\mathbf{V}{n}\mathbf{Y})^{-1}\mathbf{A}\widetilde{\mathbf{M}}(\mathbf{V}{n}% \mathbf{Y})=\operatorname{diag}(\sigma_{1},\ldots,\sigma_{n}).( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A over~ start_ARG bold_M end_ARG ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) = roman_diag ( italic_ฯ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โฆ , italic_ฯ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .
Then, following Theorem1, โ๐ซ mโ2โคฮบ 2โข(๐ nโข๐)โขฯต(m)โข(๐บ)โขโ๐ซ 0โ2 subscript norm subscript ๐ซ ๐ 2 subscript ๐ 2 subscript ๐ ๐ ๐ superscript italic-ฯต ๐ ๐บ subscript norm subscript ๐ซ 0 2|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{V}{n}\mathbf{Y})\epsilon^{(m)}(% \bm{\Sigma})|\mathbf{r}{0}|{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โค italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮฃ ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We conclude the proof by noting that ฮบ 2โข(๐ nโข๐)=ฮบ 2โข(๐)subscript ๐ 2 subscript ๐ ๐ ๐ subscript ๐ 2 ๐\kappa{2}(\mathbf{V}{n}\mathbf{Y})=\kappa{2}(\mathbf{Y})italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) = italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ). โ
| Appendix D Approximately Exponential Convergence of โ๐ซ mโ2 subscript norm subscript ๐ซ ๐ 2|\mathbf{r}{m}|{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |
|---|
We could bound the minimax polynomial ฯต(m)superscript italic-ฯต ๐\epsilon^{(m)}italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT by using Chebyshev polynomials to obtain an (approximately) exponential convergence of the residual norm โ๐ซ mโ2 subscript norm subscript ๐ซ ๐ 2|\mathbf{r}{m}|{2}โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for large m ๐ m italic_m.
Assume that all the eigenvalues ฯ i subscript ๐ ๐\sigma_{i}italic_ฯ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of ๐บ ๐บ\bm{\Sigma}bold_ฮฃ are enclosed by an ellipse Eโข(c,d,a)๐ธ ๐ ๐ ๐ E(c,d,a)italic_E ( italic_c , italic_d , italic_a ) which excludes the origin, where c ๐ c italic_c is the center, d ๐ d italic_d is the focal distance, and a ๐ a italic_a is major semi-axis. We have
ฯต(m)โข(๐บ)โคmin pโโ m,pโข(0)=1โกmax ฯโEโข(c,d,a)โก|pโข(ฯ)|,superscript italic-ฯต ๐ ๐บ subscript formulae-sequence ๐ subscript โ ๐ ๐ 0 1 subscript ๐ ๐ธ ๐ ๐ ๐ ๐ ๐\epsilon^{(m)}(\bm{\Sigma})\leq\min_{p\in\mathbb{P}{m},,p(0)=1}\max{\sigma% \in E(c,d,a)}|p(\sigma)|,italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮฃ ) โค roman_min start_POSTSUBSCRIPT italic_p โ blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_p ( 0 ) = 1 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ฯ โ italic_E ( italic_c , italic_d , italic_a ) end_POSTSUBSCRIPT | italic_p ( italic_ฯ ) | ,
by the fact that the maximum modulus of a complex analytical function is reached on the boundary of the domain. Then, we invoke Eqn (6.119) of Saad (2003) and obtain
ฯต(m)โข(๐บ)โคC mโข(a/d)C mโข(c/d),superscript italic-ฯต ๐ ๐บ subscript ๐ถ ๐ ๐ ๐ subscript ๐ถ ๐ ๐ ๐\epsilon^{(m)}(\bm{\Sigma})\leq\frac{C_{m}(a/d)}{C_{m}(c/d)},italic_ฯต start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_ฮฃ ) โค divide start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_a / italic_d ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_c / italic_d ) end_ARG ,
where C m subscript ๐ถ ๐ C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the (complex) Chebyshev polynomial of degree m ๐ m italic_m. When m ๐ m italic_m is large,
C mโข(a/d)C mโข(c/d)โ(a+a 2โd 2 c+c 2โd 2)m,subscript ๐ถ ๐ ๐ ๐ subscript ๐ถ ๐ ๐ ๐ superscript ๐ superscript ๐ 2 superscript ๐ 2 ๐ superscript ๐ 2 superscript ๐ 2 ๐\frac{C_{m}(a/d)}{C_{m}(c/d)}\approx\left(\frac{a+\sqrt{a^{2}-d^{2}}}{c+\sqrt{% c^{2}-d^{2}}}\right)^{m},divide start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_a / italic_d ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_c / italic_d ) end_ARG โ ( divide start_ARG italic_a + square-root start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_c + square-root start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ,
which gives an (approximately) exponential function in m ๐ m italic_m. Applying Corollary2, we conclude that when m ๐ m italic_m is large,
โ๐ซ mโ2โช ฮบ 2โข(๐)โขโ๐ซ 0โ2โข(a+a 2โd 2 c+c 2โd 2)m.subscript norm subscript ๐ซ ๐ 2 subscript ๐ 2 ๐ subscript norm subscript ๐ซ 0 2 superscript ๐ superscript ๐ 2 superscript ๐ 2 ๐ superscript ๐ 2 superscript ๐ 2 ๐|\mathbf{r}{m}|{2}\lessapprox\kappa_{2}(\mathbf{Y})|\mathbf{r}{0}|{2}% \left(\frac{a+\sqrt{a^{2}-d^{2}}}{c+\sqrt{c^{2}-d^{2}}}\right)^{m}.โฅ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โช italic_ฮบ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) โฅ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_a + square-root start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_c + square-root start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .
Appendix E Scalability
Based on the architecture outlined in Figure1, the computational cost of GNP is dominated by matrix-matrix multiplications, where the left matrix is either ๐ ๐\mathbf{A}bold_A (e.g., in GCONV) or a tall one that has n ๐ n italic_n rows (e.g., in the MLP and Lin layers and in GCONV). This forward analysis holds true for training and inference, because the back propagation also uses these matrix-matrix multiplications. Hence, the computational cost of GNP scales as Oโข(nzโก(๐))๐ nz ๐ O(\operatorname{nz}(\mathbf{A}))italic_O ( roman_nz ( bold_A ) ), the same as that of the Krylov solver.
Practical considerations, on the other hand, are more sensitive to the constants hidden inside the big-O notation. For example, one needs to store many (but still a constant number of) vectors of length n ๐ n italic_n, due to the batch size, the hidden dimensions of the neural network, and the automatic differentiation. Denote by c ๐ c italic_c this constant number; then, the GPU faces a storage pressure of nzโก(๐)+cโขn nz ๐ ๐ ๐\operatorname{nz}(\mathbf{A})+cn roman_nz ( bold_A ) + italic_c italic_n. Once this amount is beyond the memory capacity of the GPU, one either performs training (and even inference) by using CPU only, offloads some data to the CPU and transfers them back to the GPU on demand, or undertakes more laborious engineering by distributing ๐ ๐\mathbf{A}bold_A and other data across multiple GPUs. For the last option (multi-GPU and/or distributed training), the literature of GNN training brings in additional techniques (such as sampling and mini-batching the graph) to accelerate the computation(Kaler et al., 2022; 2023).
Appendix F Additional Experiment Results
F.1 Additional Metrics for Comparing Preconditioners
In addition to the Iter-AUC and Time-AUC metrics, one may be interested in using the final relative residual norm to compare the preconditioners. To this end, we summarize the results in Figure7 by using this residual metric, as an extension of Figure2. Note that for each problem, we solve the linear system twice, once using rtol and maxiters as the stopping criteria; and the other time using rtol and timeout. Hence, there are two charts in Figure7, depending on how the iterations are terminated.
Figure 7: Percentage of problems on which each preconditioner performs the best.
We see that across these two termination scenarios, observations are similar. Compared to the use of the Iter-AUC and Time-AUC metrics, GNP is the best for a similar portion of problems under the residual metric. The portion of problems that ILU and GMRES (as a preconditioner) perform the best increases under this metric, while the portion that AMG performs the best decreases. Jacobi and no preconditioner remain less competitive.
F.2 Preconditioner Construction Time and Solution Time
In Figure8, we plot the preconditioner construction time and solution time for each matrix and each preconditioner. This figure is a counterpart of Figure3 by removing the log scale in the y axis. One sees that the GNP construction time is nearly proportional to the matrix size.
Figure 8: Preconditioner construction time and solution time (using maxiters to stop). The y axis is in the linear scale.
F.3 Breakdown of Best Preconditioners With Respect to Matrix Distributions
In Figure9, we plot the proportion of problems on which each preconditioner performs the best, for different matrix sizes, condition numbers, and application areas. This figure is a counterpart of Figure4. The main observation is that the proportion of problems on which GNP performs the best is relatively flat across matrix sizes. Other observations follow closely from those of Figure4; namely, GNP is particularly useful for ill-conditioned problems and some application areas (chemical process simulation problems, economic problems, and eigenvalue/model reduction problems).
Figure 9: Proportion of problems on which each preconditioner performs the best, for different matrix sizes, condition numbers, and application areas. Only the application areas with the top number of problems are shown. The last bar in the middle plot is for condition number โฅ10 16 absent superscript 10 16\geq 10^{16}โฅ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. Note that not every matrix has a known condition number.
F.4 Comparison between PyAMG and AmgX
Besides PyAMG, another notable implementation of AMG (which supports using it as a preconditioner and supports GPU) is AmgX https://github.com/NVIDIA/AMGX. Here, we compare the two implementations in terms of robustness and speed.
Unlike PyAMG, which is a blackbox, AmgX leaves the choice of the AMG method and the parameters to the user. We used the driver examples/amgx_capi.c and the configuration src/configs/FGMRES_AGGREGATION.json, which appears to be the most reasonable for the diverse matrices and application areas at hand. This setting uses aggregation-based AMG rather than classical AMG, which is also consistent with the preference expressed by PyAMG.
Here are some findings.
AmgX is less robust than PyAMG. Table3 shows that AmgX causes more failures during the construction as well as the solution phase. In fact, there were three cases where AmgX hanged, which hampered automated benchmarking.
Table 3: Failures of two implementations of AMG.
Despite a lack of robustness, the construction of AmgX is much faster than that of PyAMG. For AmgX, the distribution of the preconditioner construction time has a minimum 0.006s, median 0.017s, and maximum 4.895s. In contrast, the distribution for PyAMG has a minimum 0.005s, median 0.286, and maximum 8505.1s. Nevertheless, there are still 16% of the cases where the PyAMG preconditioner is faster to construct than the AmgX preconditioner.
We skip the comparison of the solution time. This is because in AmgX, the Krylov solver and the preconditioner are bundled inside a C complementation. A fair comparison requires isolating the preconditioner from the solver, because our benchmarking framework implements the solver in Python. However, the isolation is beyond scope due to the complexity of the implementation, such as the use of data structures and cuda handling. Nevertheless, we expect that the solution time of AmgX is faster than that of PyAMG, based on the observations made for preconditioner construction.
F.5 Changing the Default PyAMG Solver
The blackbox solver of PyAMG, pyamg.blackbox.solver, uses the classical-style smoothed aggregation method pyamg.aggregation.smoothed_aggregation_solver (SA). For nonsymmetric matrices, however, the AIR method works better(Manteuffel et al., 2019; 2018). In this experiment, we modify the blackbox solver by calling pyamg.classical.air_solver for nonsymmetric matrices.
We first tried using the default options of AIR, but encountered โzero diagonal encountered in Jacobiโ errors. These errors were caused by the use of jc_jacobi as the post-smoother. Then, we replaced the smoothers with the default ones used by SA (gauss_seidel_nr). In this case, the solver hung on the problem FIDAP/ex40. Before this, 105 problems were solved, among which 83 were nonsymmetric. The results of these 83 systems are summarized in Table4, suggesting that AIR is indeed better but not always.
Table 4: Counts of problems when comparing AIR with SA for nonsymmetric matrices.
Overall, we can conclude that AIR has not exhibited its full potential for nonsymmetric matrices. We speculate that a robust implementation of AIR is challenging and that might be the reason why the authors did not use it for the blackbox solver in the first place.
F.6 Convergence Histories
In Figure10, we show the convergence histories of the systems appearing in Figure5, with respect to both iteration count and time. We also plot the training curves. The short solution time of GNP is notable, especially when compared with the solution time of GMRES as a preconditioner. The training loss generally stays on the level of 1e-2 to 1e-3. For some problems (e.g., VanVelzen_std1_Jac3), training does not seem to progress well, but the training loss is already at a satisfactory point in the first place, and hence the preconditioner can be rather effective. We speculate that this behavior is caused by the favorable architectural inductive bias in the GNN.
Figure 10: Convergence of the linear system solutions and training history of the preconditioners, for the matrices shown in Figure5.
To provide further examples, we show the convergence histories of all systems in the application areas identified by Figure4 to favor GNP: chemical process simulation problems (Figures11โ13), economic problems (Figures14โ16), and eigenvalue/model reduction problems (Figures17โ18). In many occasions, when GNP performs the best, it outperforms the second best substantially. It can even decrease the relative residual norm under rtol while other preconditioners barely solve the system.
Figure 11: Convergence of the linear system solutions for chemical process simulation problems (1/3).
Figure 12: Convergence of the linear system solutions for chemical process simulation problems (2/3).
Figure 13: Convergence of the linear system solutions for chemical process simulation problems (3/3).
Figure 14: Convergence of the linear system solutions for economic problems (1/3).
Figure 15: Convergence of the linear system solutions for economic problems (2/3).
Figure 16: Convergence of the linear system solutions for economic problems (3/3).
Figure 17: Convergence of the linear system solutions for eigenvalue/model reduction problems (1/2).
Figure 18: Convergence of the linear system solutions for eigenvalue/model reduction problems (2/2).
Xet Storage Details
- Size:
- 124 kB
- Xet hash:
- 180cd69023ddfc9d076d8c011f49aa12d1f7f6c01b58db6817abb7f7841ef990
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.




















