124 kB

Title: Graph Neural Preconditioners for Iterative Solutions of Sparse Linear Systems

URL Source: https://arxiv.org/html/2406.00809

Published Time: Tue, 28 Jan 2025 02:05:01 GMT

Markdown Content:

Abstract

Preconditioning is at the heart of iterative solutions of large, sparse linear systems of equations in scientific disciplines. Several algebraic approaches, which access no information beyond the matrix itself, are widely studied and used, but ill-conditioned matrices remain very challenging. We take a machine learning approach and propose using graph neural networks as a general-purpose preconditioner. They show attractive performance for many problems and can be used when the mainstream preconditioners perform poorly. Empirical evaluation on over 800 matrices suggests that the construction time of these graph neural preconditioners (GNPs) is more predictable and can be much shorter than that of other widely used ones, such as ILU and AMG, while the execution time is faster than using a Krylov method as the preconditioner, such as in inner-outer GMRES. GNPs have a strong potential for solving large-scale, challenging algebraic problems arising from not only partial differential equations, but also economics, statistics, graph, and optimization, to name a few.

1 Introduction

Iterative methods are commonly used to solve large, sparse linear systems of equations 𝐀𝐱=𝐛 𝐀𝐱 𝐛\mathbf{A}\mathbf{x}=\mathbf{b}bold_Ax = bold_b. These methods typically build a Krylov subspace, onto which the original system is projected, such that an approximate solution within the subspace is extracted. For example, GMRES(Saad & Schultz, 1986), one of the most popularly used methods in practice, builds an orthonormal basis {𝐯 1,𝐯 2,…,𝐯 m}subscript 𝐯 1 subscript 𝐯 2…subscript 𝐯 𝑚{\mathbf{v}{1},\mathbf{v}{2},\ldots,\mathbf{v}{m}}{ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } of the m 𝑚 m italic_m-dimensional Krylov subspace by using the Arnoldi process and defines the approximate solution 𝐱 m∈𝐱 0+span⁡({𝐯 1,𝐯 2,…,𝐯 m})subscript 𝐱 𝑚 subscript 𝐱 0 span subscript 𝐯 1 subscript 𝐯 2…subscript 𝐯 𝑚\mathbf{x}{m}\in\mathbf{x}{0}+\operatorname{span}({\mathbf{v}{1},\mathbf{v% }{2},\ldots,\mathbf{v}{m}})bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_span ( { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ), such that the residual norm ‖𝐛−𝐀𝐱 m‖2 subscript norm 𝐛 subscript 𝐀𝐱 𝑚 2|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}∥ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is minimized.

The effectiveness of Krylov methods heavily depends on the conditioning of the matrix 𝐀 𝐀\mathbf{A}bold_A. Hence, designing a good preconditioner is crucial in practice. In some cases (e.g., solving partial differential equations, PDEs), the problem structure provides additional information that aids the development of a preconditioner applicable to this problem or a class of similar problems. For example, multigrid preconditioners are particularly effective for Poisson-like problems with a mesh. In other cases, little information is known beyond the matrix itself. An example is the SuiteSparse matrix collection(Davis & Hu, 2011), which contains thousands of sparse matrices for benchmarking numerical linear algebra algorithms. In these cases, a general-purpose (also called “algebraic”) preconditioner is desirable; examples include ILU(Saad, 1994), approximate inverse(Chow & Saad, 1998), and algebraic multigrid (AMG)(Ruge & Stüben, 1987). However, it remains very challenging to design one that performs well universally.

In this work, we propose to use neural networks as a general-purpose preconditioner. Specifically, we consider preconditioned Krylov methods for solving problems in the form 𝐀𝐌𝐮=𝐛 𝐀𝐌𝐮 𝐛\mathbf{A}\mathbf{M}\mathbf{u}=\mathbf{b}bold_AMu = bold_b, where 𝐌≈𝐀−1 𝐌 superscript 𝐀 1\mathbf{M}\approx\mathbf{A}^{-1}bold_M ≈ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the preconditioner, 𝐮 𝐮\mathbf{u}bold_u is the new unknown, and 𝐱=𝐌𝐮 𝐱 𝐌𝐮\mathbf{x}=\mathbf{M}\mathbf{u}bold_x = bold_Mu is the recovered solution. We design a graph neural network (GNN)(Zhou et al., 2020; Wu et al., 2021) to assume the role of 𝐌 𝐌\mathbf{M}bold_M and develop a training method to learn 𝐌 𝐌\mathbf{M}bold_M from select (𝐛,𝐱)𝐛 𝐱(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs. This proposal is inspired by the universal approximation property of neural networks(Hornik et al., 1989) and encouraged by their widespread success in artificial intelligence(Bommasani et al., 2021).

Because a neural network is, in nature, nonlinear, our preconditioner 𝐌 𝐌\mathbf{M}bold_M is not a linear operator anymore. We can no longer build a Krylov subspace for 𝐀𝐌 𝐀𝐌\mathbf{A}\mathbf{M}bold_AM when a Krylov subspace is defined for only linear operators. Hence, we focus on flexible variants of the preconditioned Krylov methods instead. Specifically, we use flexible GMRES (FGMRES)(Saad, 1993), which considers 𝐌 𝐌\mathbf{M}bold_M to be different in every Arnoldi step. In this framework, all what matters is the subspace from which the approximate solution is extracted. A nonlinear operator 𝐌 𝐌\mathbf{M}bold_M can also be used to build this subspace and hence the neural preconditioner is applicable to FGMRES.

We use a graph neural network because of the natural connection between a sparse matrix and the graph adjacency matrix, similar to how AMG interprets the coefficient matrix with a graph. Many GNN architectures access the graph structure through 𝐀 𝐀\mathbf{A}bold_A-multiplications and they become polynomials of 𝐀 𝐀\mathbf{A}bold_A when the nonlinearity is omitted(Wu et al., 2019; Chen et al., 2020). This observation bridges the connection between a GNN as an operator and the polynomial approximation of 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the latter of which is an essential ingredient in the convergence theory of Krylov methods. Interestingly, a side benefit resulting from the access pattern of 𝐀 𝐀\mathbf{A}bold_A is that the GNN is applicable to not only sparse matrices, but also structured matrices that admit fast 𝐀 𝐀\mathbf{A}bold_A-multiplications (such as hierarchical matrices(Hackbusch, 1999; Hackbusch & Börm, 2002; Chen & Stein, 2023), Toeplitz matrices(Chan & Jin, 2007; Chen et al., 2014), unassembled finite-element matrices(Ciarlet, 2002), and Gauss-Newton matrices(Liu et al., 2023)).

There are a few advantages of the proposed graph neural preconditioner (GNP). It performs comparatively much better for ill-conditioned problems, by learning the matrix inverse from data, mitigating the restrictive modeling of the sparsity pattern (as in ILU and approximate inverse), the insufficient quality of polynomial approximation (as in using Krylov methods as the preconditioner), and the challenge of smoothing over a non-geometric mesh (as in AMG). Additionally, empirical evaluation suggests that the construction time of GNP is more predictable, inherited from the predictability of neural network training, than that of ILU and AMG; and the execution time of GNP is much shorter than GMRES as the preconditioner, which may be bottlenecked by the orthogonalization in the Arnoldi process.

1.1 Contributions

Our work has a few technical contributions. First, we offer a convergence analysis for FGMRES. Although this method is well known and used in practice, little theoretical work was done, in part because the tools for analyzing Krylov subspaces are not readily applicable (as the subspace is no longer Krylov and we have no isomorphism with the space of polynomials). Instead, our analysis is a posteriori, based on the computed subspace.

Second, we propose an effective approach to training the neural preconditioner; in particular, training data generation. While it is straightforward to train the neural network in an online-learning manner—through preparing streaming, random input-output pairs in the form of (𝐛,𝐱)𝐛 𝐱(\mathbf{b},\mathbf{x})( bold_b , bold_x )—it leaves open the question regarding what randomness best suits the purpose. We consider the bottom eigen-subspace of 𝐀 𝐀\mathbf{A}bold_A when defining the sampling distribution and show the effectiveness of this approach.

Third, we develop a scale-equivariant GNN as the preconditioner. Because of the way that training data are generated, the GNN inputs 𝐛 𝐛\mathbf{b}bold_b can have vast different scales; meanwhile, the ground truth operator 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is equivariant to the scaling of the inputs. Hence, to facilitate training, we design the GNN by imposing an inductive bias that obeys scale-equivariance.

Fourth, we adopt a novel evaluation protocol for general-purpose preconditioners. The common practice evaluates a preconditioner on a limited number of problems; in many occasions, these problems come from a certain type of PDEs or an application, which constitutes only a portion of the use cases. To broadly evaluate a new kind of general-purpose preconditioner and understand its strengths and weaknesses, we propose to test it on a large number of matrices commonly used by the community. To this end, we use the SuiteSparse collection and perform evaluation on a substantial portion of it (all non-SPD, real matrices falling inside a size interval; over 800 from 50 application areas in this work). To streamline the evaluation, we define evaluation criteria, limit the tuning of hyperparameters, and study statistics over the distribution of matrices.

1.2 Related Work

It is important to put the proposed GNP in context. The idea of using a neural network to construct a preconditioner emerged recently. The relevant approaches learn the nonzero entries of the incomplete factors(Li et al., 2023; Häusner et al., 2023) or their correction(Trifonov et al., 2024), or of the approximate inverse(Bånkestad et al., 2024). In these approaches, the neural network does not directly approximate the mapping from 𝐛 𝐛\mathbf{b}bold_b to 𝐱 𝐱\mathbf{x}bold_x like ours does; rather, it is used to complete the nonzeros of the predefined sparsity structure. A drawback of these approaches is that the predefined sparsity imposes a nonzero lower bound on the approximation of 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which (and whose factors) are generally not sparse. Moreover, although the neural network can be trained on a PDE problem (by varying the coefficients, grid sizes, or boundary conditions) and generalizes well to the same problem, it is unlikely that a single network can work well for all problems and matrices in life.

Approaches wherein the neural network directly approximates the matrix inverse are more akin to physics-informed neural networks (PINNs)(Raissi et al., 2019) and neural operators (NOs)(Rudikov et al., 2024; Kovachki et al., 2022; Li et al., 2020; 2021; Lu et al., 2021). However, the learning of PINNs requires a PDE whose residual forms a partial training loss, whereas NOs consider infinite-dimensional function spaces, with PDEs being the application. In our case of a general-purpose preconditioner, there is not an underlying PDE; and not every matrix problem relates to function spaces with additional properties to exploit (e.g., spatial coordinates, smoothness, and decay of the Green’s function). For example, one may be interested in finding the commute times between a node and all other nodes in a graph; this problem can be solved by solving a linear system with respect the graph Laplacian matrix. The only information we exploit is the matrix itself.

2 Method

Let 𝐀∈ℝ n×n 𝐀 superscript ℝ 𝑛 𝑛\mathbf{A}\in\mathbb{R}^{n\times n}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT. From now on, 𝐌 𝐌\mathbf{M}bold_M is an operator ℝ n→ℝ n→superscript ℝ 𝑛 superscript ℝ 𝑛\mathbb{R}^{n}\to\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and we write 𝐌⁢(𝐯)𝐌 𝐯\mathbf{M}(\mathbf{v})bold_M ( bold_v ) to mean applying the operator on 𝐯∈ℝ n 𝐯 superscript ℝ 𝑛\mathbf{v}\in\mathbb{R}^{n}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This notation includes the special case 𝐌⁢(𝐯)=𝐌𝐯 𝐌 𝐯 𝐌𝐯\mathbf{M}(\mathbf{v})=\mathbf{M}\mathbf{v}bold_M ( bold_v ) = bold_Mv when 𝐌 𝐌\mathbf{M}bold_M is a matrix.

2.1 Flexible GMRES

A standard preconditioned Krylov solver solves the linear system 𝐀𝐌𝐮=𝐛 𝐀𝐌𝐮 𝐛\mathbf{A}\mathbf{M}\mathbf{u}=\mathbf{b}bold_AMu = bold_b by viewing 𝐀𝐌 𝐀𝐌\mathbf{A}\mathbf{M}bold_AM as the new matrix and building a Krylov subspace, from which an approximate solution 𝐮 m subscript 𝐮 𝑚\mathbf{u}{m}bold_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is extracted and 𝐱 m subscript 𝐱 𝑚\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is recovered from 𝐌𝐮 m subscript 𝐌𝐮 𝑚\mathbf{M}\mathbf{u}{m}bold_Mu start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. It is important to note that a subspace, by definition, is linear. However, applying a nonlinear operator 𝐌 𝐌\mathbf{M}bold_M on the linear vector space does not result in a vector space spanned by the mapped basis. Hence, convergence theory of 𝐱 m=𝐌⁢(𝐮 m)subscript 𝐱 𝑚 𝐌 subscript 𝐮 𝑚\mathbf{x}{m}=\mathbf{M}(\mathbf{u}_{m})bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_M ( bold_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is broken.

We resort to flexible variants of Krylov solvers(Saad, 1993; Notay, 2000; Chen et al., 2016) and focus on flexible GMRES for simplicity. FGMRES was designed such that the matrix 𝐌 𝐌\mathbf{M}bold_M can change in each Arnoldi step; we extend it to a fixed, but nonlinear, operator 𝐌 𝐌\mathbf{M}bold_M.

Algorithm1 in SectionB summarizes the solver. The algorithm assumes a restart length m 𝑚 m italic_m. Starting from an initial guess 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with residual vector 𝐫 0 subscript 𝐫 0\mathbf{r}{0}bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 2-norm β 𝛽\beta italic_β, FGMRES runs m 𝑚 m italic_m Arnoldi steps, resulting in the relation

𝐀𝐙 m=𝐕 m⁢𝐇 m+h m+1,m⁢𝐯 m+1⁢𝐞 m⊤=𝐕 m+1⁢𝐇¯m,subscript 𝐀𝐙 𝑚 subscript 𝐕 𝑚 subscript 𝐇 𝑚 subscript ℎ 𝑚 1 𝑚 subscript 𝐯 𝑚 1 superscript subscript 𝐞 𝑚 top subscript 𝐕 𝑚 1 subscript¯𝐇 𝑚\mathbf{A}\mathbf{Z}{m}=\mathbf{V}{m}\mathbf{H}{m}+h{m+1,m}\mathbf{v}{m+1% }\mathbf{e}{m}^{\top}=\mathbf{V}{m+1}\overline{\mathbf{H}}{m},bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_m + 1 , italic_m end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,(1)

where 𝐕 m=[𝐯 1,…,𝐯 m]∈ℝ n×m subscript 𝐕 𝑚 subscript 𝐯 1…subscript 𝐯 𝑚 superscript ℝ 𝑛 𝑚\mathbf{V}{m}=[\mathbf{v}{1},\ldots,\mathbf{v}{m}]\in\mathbb{R}^{n\times m}bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT is orthonormal, 𝐙 m=[𝐳 1,…,𝐳 m]∈ℝ n×n subscript 𝐙 𝑚 subscript 𝐳 1…subscript 𝐳 𝑚 superscript ℝ 𝑛 𝑛\mathbf{Z}{m}=[\mathbf{z}{1},\ldots,\mathbf{z}{m}]\in\mathbb{R}^{n\times n}bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT contains m 𝑚 m italic_m columns 𝐳 j=𝐌⁢(𝐯 j)subscript 𝐳 𝑗 𝐌 subscript 𝐯 𝑗\mathbf{z}{j}=\mathbf{M}(\mathbf{v}{j})bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_M ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), 𝐇 m=[h i⁢j]∈ℝ m×m subscript 𝐇 𝑚 delimited-[]subscript ℎ 𝑖 𝑗 superscript ℝ 𝑚 𝑚\mathbf{H}{m}=[h{ij}]\in\mathbb{R}^{m\times m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT is upper Hessenberg, and 𝐇¯m=[h i⁢j]∈ℝ(m+1)×m subscript¯𝐇 𝑚 delimited-[]subscript ℎ 𝑖 𝑗 superscript ℝ 𝑚 1 𝑚\overline{\mathbf{H}}{m}=[h{ij}]\in\mathbb{R}^{(m+1)\times m}over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) × italic_m end_POSTSUPERSCRIPT extends 𝐇 m subscript 𝐇 𝑚\mathbf{H}{m}bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with an additional row at the bottom. The approximate solution 𝐱 m subscript 𝐱 𝑚\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is computed in the form 𝐱 m∈𝐱 0+span⁡({𝐳 1,…,𝐳 m})={𝐱 0+𝐙 m⁢𝐲}subscript 𝐱 𝑚 subscript 𝐱 0 span subscript 𝐳 1…subscript 𝐳 𝑚 subscript 𝐱 0 subscript 𝐙 𝑚 𝐲\mathbf{x}{m}\in\mathbf{x}{0}+\operatorname{span}({\mathbf{z}{1},\ldots,% \mathbf{z}{m}})={\mathbf{x}{0}+\mathbf{Z}{m}\mathbf{y}}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_span ( { bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ) = { bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y }, such that the residual norm ‖𝐛−𝐀𝐱 m‖2=‖β⁢𝐞 1−𝐇¯m⁢𝐲‖2 subscript norm 𝐛 subscript 𝐀𝐱 𝑚 2 subscript norm 𝛽 subscript 𝐞 1 subscript¯𝐇 𝑚 𝐲 2|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}=|\beta\mathbf{e}{1}-\overline{% \mathbf{H}}{m}\mathbf{y}|{2}∥ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_β bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is minimized. If the residual norm is sufficiently small, the solution is claimed; otherwise, let 𝐱 m subscript 𝐱 𝑚\mathbf{x}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be the initial guess of the next Arnoldi cycle and repeat.

We provide a convergence analysis below. All proofs are in SectionC.

Theorem 1.

Assume that FGMRES is run without restart and without breakdown. We use the tilde notation to denote the counterpart quantities when FGMRES is run with a fixed preconditioner matrix 𝐌~~𝐌\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG. For any 𝐌~~𝐌\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG such that 𝐀⁢𝐌𝐀𝐌\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG can be diagonalized, as in 𝐗−1⁢𝐀⁢𝐌~~⁢𝐗=𝚲=diag⁡(λ 1,…,λ n)superscript 𝐗 1 𝐀~~𝐌 𝐗 𝚲 diag subscript 𝜆 1…subscript 𝜆 𝑛\mathbf{X}^{-1}\mathbf{A}\widetilde{\mathbf{M}}\mathbf{X}=\bm{\Lambda}=% \operatorname{diag}(\lambda_{1},\ldots,\lambda_{n})bold_X start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A over~ start_ARG bold_M end_ARG bold_X = bold_Λ = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the residual 𝐫 m=𝐛−𝐀𝐱 m subscript 𝐫 𝑚 𝐛 subscript 𝐀𝐱 𝑚\mathbf{r}{m}=\mathbf{b}-\mathbf{A}\mathbf{x}{m}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT satisfies

‖𝐫 m‖2≤κ 2⁢(𝐗)⁢ϵ(m)⁢(𝚲)⁢‖𝐫 0‖2+‖𝐐 m⁢𝐐 m⊤−𝐐~~m⁢𝐐~~m⊤‖2⁢‖𝐫 0‖2,subscript norm subscript 𝐫 𝑚 2 subscript 𝜅 2 𝐗 superscript italic-ϵ 𝑚 𝚲 subscript norm subscript 𝐫 0 2 subscript norm subscript 𝐐 𝑚 superscript subscript 𝐐 𝑚 top subscript~~𝐐 𝑚 superscript subscript~~𝐐 𝑚 top 2 subscript norm subscript 𝐫 0 2|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm{\Lambda})|% \mathbf{r}{0}|{2}+|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{\mathbf{% Q}}{m}\widetilde{\mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|_{2},∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Λ ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(2)

where κ 2 subscript 𝜅 2\kappa_{2}italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the 2-norm condition number, 𝐐 m subscript 𝐐 𝑚\mathbf{Q}{m}bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (resp.𝐐~~m subscript~~𝐐 𝑚\widetilde{\mathbf{Q}}{m}over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) is the thin-Q factor of 𝐀𝐙 m subscript 𝐀𝐙 𝑚\mathbf{A}\mathbf{Z}{m}bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (resp.𝐀⁢𝐙~~m 𝐀 subscript~~𝐙 𝑚\mathbf{A}\widetilde{\mathbf{Z}}{m}bold_A over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT), ℙ m subscript ℙ 𝑚\mathbb{P}_{m}blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the space of degree-m 𝑚 m italic_m polynomials, and

ϵ(m)⁢(𝚲)=min p∈ℙ m,p⁢(0)=1⁡max i=1,…,n⁡|p⁢(λ i)|.superscript italic-ϵ 𝑚 𝚲 subscript formulae-sequence 𝑝 subscript ℙ 𝑚 𝑝 0 1 subscript 𝑖 1…𝑛 𝑝 subscript 𝜆 𝑖\epsilon^{(m)}(\bm{\Lambda})=\min_{p\in\mathbb{P}{m},,p(0)=1}\max{i=1,% \ldots,n}|p(\lambda_{i})|.italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Λ ) = roman_min start_POSTSUBSCRIPT italic_p ∈ blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_p ( 0 ) = 1 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT | italic_p ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | .

Let us consider the two terms on the right-hand side of(2). The first term κ 2⁢(𝐗)⁢ϵ(m)⁢(𝚲)⁢‖𝐫 0‖2 subscript 𝜅 2 𝐗 superscript italic-ϵ 𝑚 𝚲 subscript norm subscript 𝐫 0 2\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm{\Lambda})|\mathbf{r}{0}|{2}italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Λ ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the standard convergence result of GMRES on the matrix 𝐀⁢𝐌𝐀𝐌\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG. The factor ϵ(m)⁢(𝚲)superscript italic-ϵ 𝑚 𝚲\epsilon^{(m)}(\bm{\Lambda})italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Λ ) is approximately exponential in m 𝑚 m italic_m, when the eigenvalues λ i subscript 𝜆 𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are located in an ellipse which excludes the origin; see Saad (2003, Corollary 6.33). Furthermore, when the eigenvalues are closer to each other than to the origin, the base of the exponential is close to zero, resulting in rapid convergence. Meanwhile, when 𝐀⁢𝐌𝐀𝐌\mathbf{A}\widetilde{\mathbf{M}}bold_A over~ start_ARG bold_M end_ARG is close to normal, κ 2⁢(𝐗)subscript 𝜅 2 𝐗\kappa_{2}(\mathbf{X})italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) is close to 1.

What 𝐌 𝐌\mathbf{M}bold_M affects is the second term ‖𝐐 m⁢𝐐 m⊤−𝐐~~m⁢𝐐~~m⊤‖2⁢‖𝐫 0‖2 subscript norm subscript 𝐐 𝑚 superscript subscript 𝐐 𝑚 top subscript~~𝐐 𝑚 superscript subscript~~𝐐 𝑚 top 2 subscript norm subscript 𝐫 0 2|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{\mathbf{Q}}{m}\widetilde{% \mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|{2}∥ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This term does not show convergence on the surface, but one can always find an 𝐌~~𝐌\widetilde{\mathbf{M}}over~ start_ARG bold_M end_ARG such that it vanishes, assuming no breakdown. When 𝐌 𝐌\mathbf{M}bold_M is close to 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, 𝐀𝐙 m subscript 𝐀𝐙 𝑚\mathbf{A}\mathbf{Z}{m}bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a small perturbation of 𝐕 m subscript 𝐕 𝑚\mathbf{V}{m}bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by definition. Then, (1) suggests that 𝐇¯m subscript¯𝐇 𝑚\overline{\mathbf{H}}{m}over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is close to the identity matrix for all m 𝑚 m italic_m and so is 𝐇 n subscript 𝐇 𝑛\mathbf{H}_{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Corollary 2.

Assume that FGMRES is run without restart and without breakdown. On completion, let 𝐇 n subscript 𝐇 𝑛\mathbf{H}{n}bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be diagonalizable, as in 𝐘−1⁢𝐇 n⁢𝐘=𝚺=diag⁡(σ 1,…,σ n)superscript 𝐘 1 subscript 𝐇 𝑛 𝐘 𝚺 diag subscript 𝜎 1…subscript 𝜎 𝑛\mathbf{Y}^{-1}\mathbf{H}{n}\mathbf{Y}=\bm{\Sigma}=\operatorname{diag}(\sigma% {1},\ldots,\sigma{n})bold_Y start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y = bold_Σ = roman_diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Then, the residual norm satisfies

‖𝐫 m‖2≤κ 2⁢(𝐘)⁢ϵ(m)⁢(𝚺)⁢‖𝐫 0‖2 for all m.subscript norm subscript 𝐫 𝑚 2 subscript 𝜅 2 𝐘 superscript italic-ϵ 𝑚 𝚺 subscript norm subscript 𝐫 0 2 for all m|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{Y})\epsilon^{(m)}(\bm{\Sigma})|% \mathbf{r}{0}|{2}\qquad\text{for all $m$}.∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Σ ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for all italic_m .(3)

See SectionD for an approximately exponential convergence of ‖𝐫 m‖2 subscript norm subscript 𝐫 𝑚 2|\mathbf{r}{m}|{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT due to ϵ(m)⁢(𝚺)superscript italic-ϵ 𝑚 𝚺\epsilon^{(m)}(\bm{\Sigma})italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Σ ).

Note that while our results assume infinite precision, another analysis(Arioli & Duff, 2009) was conducted based on finite arithmetics. Such analysis uses the machine precision to bound the residual norm for a sufficiently large m 𝑚 m italic_m, asserting backward stability of FGMRES; on the other hand, we can obtain the convergence rate under infinite precision.

2.2 Training Data Generation

We generate a set of (𝐛,𝐱)𝐛 𝐱(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs to train 𝐌:𝐛→𝐱:𝐌→𝐛 𝐱\mathbf{M}:\mathbf{b}\to\mathbf{x}bold_M : bold_b → bold_x. Unlike neural operators where the data generation is costly (e.g., requiring solving PDEs), which limits the training set size, for linear systems, we can sample 𝐱 𝐱\mathbf{x}bold_x from a certain distribution and obtain 𝐛=𝐀𝐱 𝐛 𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}bold_b = bold_Ax with negligible cost, which creates a streaming training set of (𝐛,𝐱)𝐛 𝐱(\mathbf{b},\mathbf{x})( bold_b , bold_x ) pairs without size limit.1 1 1 Neural operators learn the mapping from the boundary condition to the solution. Strictly speaking, while 𝐱 𝐱\mathbf{x}bold_x is the solution, 𝐛 𝐛\mathbf{b}bold_b is not the boundary condition.

There is no free lunch. One may sample 𝐱∼𝒩⁢(𝟎,𝐈 n)similar-to 𝐱 𝒩 0 subscript 𝐈 𝑛\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which leads to 𝐛∼𝒩⁢(𝟎,𝐀𝐀⊤)similar-to 𝐛 𝒩 0 superscript 𝐀𝐀 top\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{A}\mathbf{A}^{\top})bold_b ∼ caligraphic_N ( bold_0 , bold_AA start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). This distribution is skewed toward the dominant eigen-subspace of 𝐀𝐀⊤superscript 𝐀𝐀 top\mathbf{A}\mathbf{A}^{\top}bold_AA start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Hence, using samples from it for training may result in a poor performance of the preconditioner, when applied to inputs lying close to the bottom eigen-subspace. One may alternatively want the training data 𝐛∼𝒩⁢(𝟎,𝐈 n)similar-to 𝐛 𝒩 0 subscript 𝐈 𝑛\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_b ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), which covers uniformly all possibilities of the preconditioner input on a sphere. However, in this case, 𝐱∼𝒩⁢(𝟎,𝐀−1⁢𝐀−⊤)similar-to 𝐱 𝒩 0 superscript 𝐀 1 superscript 𝐀 absent top\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{A}^{-1}\mathbf{A}^{-\top})bold_x ∼ caligraphic_N ( bold_0 , bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT ) is also skewed in the spectrum, causing difficulty in network training.

We resort to the Arnoldi process to obtain an approximation of 𝐀−1⁢𝐀−⊤superscript 𝐀 1 superscript 𝐀 absent top\mathbf{A}^{-1}\mathbf{A}^{-\top}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT - ⊤ end_POSTSUPERSCRIPT, which strikes a balance. Specifically, we run the Arnoldi process in m 𝑚 m italic_m steps without a preconditioner, which results in a simplification of(1):

𝐀𝐕 m=𝐕 m⁢𝐇 m+h m+1,m⁢𝐯 m+1⁢𝐞 m⊤=𝐕 m+1⁢𝐇¯m.subscript 𝐀𝐕 𝑚 subscript 𝐕 𝑚 subscript 𝐇 𝑚 subscript ℎ 𝑚 1 𝑚 subscript 𝐯 𝑚 1 superscript subscript 𝐞 𝑚 top subscript 𝐕 𝑚 1 subscript¯𝐇 𝑚\mathbf{A}\mathbf{V}{m}=\mathbf{V}{m}\mathbf{H}{m}+h{m+1,m}\mathbf{v}{m+1% }\mathbf{e}{m}^{\top}=\mathbf{V}{m+1}\overline{\mathbf{H}}{m}.bold_AV start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_m + 1 , italic_m end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .(4)

Let the singular value decomposition of 𝐇¯m subscript¯𝐇 𝑚\overline{\mathbf{H}}{m}over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be 𝐖 m⁢𝐒 m⁢𝐙 m⊤subscript 𝐖 𝑚 subscript 𝐒 𝑚 superscript subscript 𝐙 𝑚 top\mathbf{W}{m}\mathbf{S}{m}\mathbf{Z}{m}^{\top}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝐖 m∈ℝ(m+1)×m subscript 𝐖 𝑚 superscript ℝ 𝑚 1 𝑚\mathbf{W}{m}\in\mathbb{R}^{(m+1)\times m}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) × italic_m end_POSTSUPERSCRIPT and 𝐙 m∈ℝ m×m subscript 𝐙 𝑚 superscript ℝ 𝑚 𝑚\mathbf{Z}{m}\in\mathbb{R}^{m\times m}bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT are orthonormal and 𝐒 m∈ℝ m×m subscript 𝐒 𝑚 superscript ℝ 𝑚 𝑚\mathbf{S}_{m}\in\mathbb{R}^{m\times m}bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT is diagonal. We define

𝐱=𝐕 m⁢𝐙 m⁢𝐒 m−1⁢ϵ,ϵ∼𝒩⁢(𝟎,𝐈 m).formulae-sequence 𝐱 subscript 𝐕 𝑚 subscript 𝐙 𝑚 superscript subscript 𝐒 𝑚 1 bold-italic-ϵ similar-to bold-italic-ϵ 𝒩 0 subscript 𝐈 𝑚\mathbf{x}=\mathbf{V}{m}\mathbf{Z}{m}\mathbf{S}{m}^{-1}\bm{\epsilon},\qquad% \bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{m}).bold_x = bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) .(5)

Then, with simple algebra we obtain that

𝐱 𝐱\displaystyle\mathbf{x}bold_x∈range⁡(𝐕 m),absent range subscript 𝐕 𝑚\displaystyle\in\operatorname{range}(\mathbf{V}{m}),∈ roman_range ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,𝐱 𝐱\displaystyle\qquad\mathbf{x}bold_x∼𝒩⁢(𝟎,𝚺 m 𝐱),similar-to absent 𝒩 0 superscript subscript 𝚺 𝑚 𝐱\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{x}}),∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT ) ,𝚺 m 𝐱 superscript subscript 𝚺 𝑚 𝐱\displaystyle\qquad\bm{\Sigma}{m}^{\mathbf{x}}bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT=(𝐕 m⁢𝐇¯m+)⁢(𝐕 m⁢𝐇¯m+)⊤,absent subscript 𝐕 𝑚 superscript subscript¯𝐇 𝑚 superscript subscript 𝐕 𝑚 superscript subscript¯𝐇 𝑚 top\displaystyle=(\mathbf{V}{m}\overline{\mathbf{H}}{m}^{+})(\mathbf{V}{m}% \overline{\mathbf{H}}{m}^{+})^{\top},= ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ( bold_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 𝐛 𝐛\displaystyle\mathbf{b}bold_b∈range⁡(𝐕 m+1),absent range subscript 𝐕 𝑚 1\displaystyle\in\operatorname{range}(\mathbf{V}{m+1}),∈ roman_range ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ,𝐛 𝐛\displaystyle\qquad\mathbf{b}bold_b∼𝒩⁢(𝟎,𝚺 m 𝐛),similar-to absent 𝒩 0 superscript subscript 𝚺 𝑚 𝐛\displaystyle\sim\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{b}}),∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT ) ,𝚺 m 𝐛 superscript subscript 𝚺 𝑚 𝐛\displaystyle\qquad\bm{\Sigma}{m}^{\mathbf{b}}bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT=(𝐕 m+1⁢𝐖 m)⁢(𝐕 m+1⁢𝐖 m)⊤.absent subscript 𝐕 𝑚 1 subscript 𝐖 𝑚 superscript subscript 𝐕 𝑚 1 subscript 𝐖 𝑚 top\displaystyle=(\mathbf{V}{m+1}\mathbf{W}{m})(\mathbf{V}{m+1}\mathbf{W}{m})% ^{\top}.= ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

The covariance of 𝐛 𝐛\mathbf{b}bold_b, 𝚺 m 𝐛 superscript subscript 𝚺 𝑚 𝐛\bm{\Sigma}{m}^{\mathbf{b}}bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT, is a projector, because 𝐕 m+1 subscript 𝐕 𝑚 1\mathbf{V}{m+1}bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and 𝐖 m subscript 𝐖 𝑚\mathbf{W}{m}bold_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are orthonormal. As m 𝑚 m italic_m grows, this covariance tends to the identity 𝐈 n subscript 𝐈 𝑛\mathbf{I}{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. By the theory of the Arnoldi process(Morgan, 2002), 𝐕 m+1 subscript 𝐕 𝑚 1\mathbf{V}{m+1}bold_V start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT contains better and better approximations of the extreme eigenvectors as m 𝑚 m italic_m increases. Hence, 𝒩⁢(𝟎,𝚺 m 𝐛)𝒩 0 superscript subscript 𝚺 𝑚 𝐛\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{b}})caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT ) has a high density close to the bottom eigen-subspace of 𝐀 𝐀\mathbf{A}bold_A. Meanwhile, because 𝚺 m 𝐛 superscript subscript 𝚺 𝑚 𝐛\bm{\Sigma}{m}^{\mathbf{b}}bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT is low-rank, the distribution has zero density along many directions. Therefore, we sample 𝐱 𝐱\mathbf{x}bold_x from both 𝒩⁢(𝟎,𝚺 m 𝐱)𝒩 0 superscript subscript 𝚺 𝑚 𝐱\mathcal{N}(\mathbf{0},\bm{\Sigma}{m}^{\mathbf{x}})caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT ) and 𝒩⁢(𝟎,𝐈 n)𝒩 0 subscript 𝐈 𝑛\mathcal{N}(\mathbf{0},\mathbf{I}_{n})caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) to form each training batch, to cover more important subspace directions.

We note that the use of Gaussian distributions brings in the convenience to generally analyze the input/output distributions. It is possible that other distributions are more effective if knowledge of 𝐱 𝐱\mathbf{x}bold_x exists and can be exploited.

2.3 Scale-Equivariant Graph Neural Preconditioner

We now define the neural network 𝐌 𝐌\mathbf{M}bold_M that approximates 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Naturally, a (sparse) matrix 𝐀=[a i⁢j]𝐀 delimited-[]subscript 𝑎 𝑖 𝑗\mathbf{A}=[a_{ij}]bold_A = [ italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] admits a graph interpretation, similar to how AMG interprets the coefficient matrix with a graph. Treating 𝐀 𝐀\mathbf{A}bold_A as the graph adjacency matrix allows us to use GNNs(Kipf & Welling, 2017; Hamilton et al., 2017; Velic̆ković et al., 2018; Xu et al., 2019) to parameterize the preconditioner. These neural networks perform graph convolutions in a neighborhood of each node, reducing the quadratic pairwise cost by ignoring nodes faraway from the neighborhood.

Our neural network is based on the graph convolutional network (GCN)(Kipf & Welling, 2017). A GCN layer is defined as GCONV⁢(𝐗)=ReLU⁡(𝐀^⁢𝐗𝐖)GCONV 𝐗 ReLU^𝐀 𝐗𝐖\text{GCONV}(\mathbf{X})=\operatorname{ReLU}(\widehat{\mathbf{A}}\mathbf{X}% \mathbf{W})GCONV ( bold_X ) = roman_ReLU ( over^ start_ARG bold_A end_ARG bold_XW ), where 𝐀^∈ℝ n×n^𝐀 superscript ℝ 𝑛 𝑛\widehat{\mathbf{A}}\in\mathbb{R}^{n\times n}over^ start_ARG bold_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is some normalization of 𝐀 𝐀\mathbf{A}bold_A, 𝐗∈ℝ n×d in 𝐗 superscript ℝ 𝑛 subscript 𝑑 in\mathbf{X}\in\mathbb{R}^{n\times d_{\text{in}}}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the input data, and 𝐖∈ℝ d in×d out 𝐖 superscript ℝ subscript 𝑑 in subscript 𝑑 out\mathbf{W}\in\mathbb{R}^{d_{\text{in}}\times d_{\text{out}}}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a learnable parameter.

Figure 1: Our GNN architecture for the preconditioner 𝐌 𝐌\mathbf{M}bold_M. The nonlinearity σ 𝜎\sigma italic_σ is ReLU ReLU\operatorname{ReLU}roman_ReLU. The component s 𝑠 s italic_s denotes the scale-equivariant operator(8).

Our neural network, as illustrated in Figure1, enhances from GCN in a few manners:

1.We normalize 𝐀 𝐀\mathbf{A}bold_A differently. Specifically, we define

𝐀^=𝐀/γ where γ=min⁡{max i⁡{∑j|a|i⁢j},max j⁡{∑i|a|i⁢j}}.formulae-sequence^𝐀 𝐀 𝛾 where 𝛾 subscript 𝑖 subscript 𝑗 subscript 𝑎 𝑖 𝑗 subscript 𝑗 subscript 𝑖 subscript 𝑎 𝑖 𝑗\widehat{\mathbf{A}}=\mathbf{A}/\gamma\quad\text{where}\quad\textstyle\gamma=% \min\Big{{}\max_{i}\big{{}\sum_{j}|a|{ij}\big{}},\max{j}\big{{}\sum_{i}|% a|_{ij}\big{}}\Big{}}.over^ start_ARG bold_A end_ARG = bold_A / italic_γ where italic_γ = roman_min { roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_a | start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } , roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_a | start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } } .(6)

The factor γ 𝛾\gamma italic_γ is an upper bound of the spectral radius of 𝐀 𝐀\mathbf{A}bold_A based on the Gershgorin circle theorem(Gerschgorin, 1931). This avoids division-by-zero that may occur in the standard GCN normalization, because the nonzeros of 𝐀 𝐀\mathbf{A}bold_A can be both positive and negative. 2. 2.We add a residual connection to GCONV, to allow stacking a deeper GNN without suffering the smoothing problem(Chen et al., 2020). Specifically, we define a Res-GCONV layer to be

Res-GCONV⁢(𝐗)=ReLU⁡(𝐗𝐔+𝐀^⁢𝐗𝐖).Res-GCONV 𝐗 ReLU 𝐗𝐔^𝐀 𝐗𝐖\text{Res-GCONV}(\mathbf{X})=\operatorname{ReLU}(\mathbf{X}\mathbf{U}+\widehat% {\mathbf{A}}\mathbf{X}\mathbf{W}).Res-GCONV ( bold_X ) = roman_ReLU ( bold_XU + over^ start_ARG bold_A end_ARG bold_XW ) .(7)

Moreover, we let the parameters 𝐔 𝐔\mathbf{U}bold_U and 𝐖 𝐖\mathbf{W}bold_W be square matrices of dimension d×d 𝑑 𝑑 d\times d italic_d × italic_d, such that the input/output dimensions across layers do not change. We stack L 𝐿 L italic_L such layers. 3. 3.We use an MLP encoder (resp.MLP decoder) before (resp.after) the L 𝐿 L italic_L Res-GCONV layers to lift (resp.project) the dimensions.

Additionally, the main novelty of the architecture is scale-equivariance. Note that the operator to approximate, 𝐀−1 superscript 𝐀 1\mathbf{A}^{-1}bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, is scale-equivariant, but a general neural network is not guaranteed so. Because the input 𝐛 𝐛\mathbf{b}bold_b may have very different scales (due to the sampling approach considered in the preceding subsection), to facilitate training, we design a parameter-free scaling s 𝑠 s italic_s and back-scaling s−1 superscript 𝑠 1 s^{-1}italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT:

s(⋅)=n τ⋅and s−1(⋅)=τ n⋅,where τ=∥𝐛∥2.\textstyle s(\cdot)=\frac{\sqrt{n}}{\tau}\cdot\quad\text{and}\quad s^{-1}(% \cdot)=\frac{\tau}{\sqrt{n}},\cdot,,\quad\text{where }\tau=|\mathbf{b}|_{2}.italic_s ( ⋅ ) = divide start_ARG square-root start_ARG italic_n end_ARG end_ARG start_ARG italic_τ end_ARG ⋅ and italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) = divide start_ARG italic_τ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG ⋅ , where italic_τ = ∥ bold_b ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .(8)

We apply s 𝑠 s italic_s at the beginning and s−1 superscript 𝑠 1 s^{-1}italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT at the end of the neural network. Effectively, we restrict the input space of the neural network from the full ℝ n superscript ℝ 𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to a more controllable space—the sphere n⁢𝕊 n−1 𝑛 superscript 𝕊 𝑛 1\sqrt{n}\mathbb{S}^{n-1}square-root start_ARG italic_n end_ARG blackboard_S start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT. Clearly, s 𝑠 s italic_s guarantees scale-equivariance; i.e., 𝐌⁢(α⁢𝐛)=α⁢𝐌⁢(𝐛)𝐌 𝛼 𝐛 𝛼 𝐌 𝐛\mathbf{M}(\alpha\mathbf{b})=\alpha\mathbf{M}(\mathbf{b})bold_M ( italic_α bold_b ) = italic_α bold_M ( bold_b ) for any scalar α≠0 𝛼 0\alpha\neq 0 italic_α ≠ 0.

3 Evaluation Methodology and Experiment Setting

Problems. Our evaluation strives to cover a large number of problems, which come from as diverse application areas as possible. To this end, we turn to the SuiteSparse matrix collection https://sparse.tamu.edu, which is a widely used benchmark in numerical linear algebra. We select all square, real-valued, and non-SPD matrices whose number of rows falls between 1K and 100K and whose number of nonzeros is fewer than 2M. This selection results in 867 matrices from 50 application areas. Some applications are involved with PDEs (such as computational fluid dynamics), while others come from graph, optimization, economics, and statistics problems.

To facilitate solving the linear systems, we prescale each 𝐀 𝐀\mathbf{A}bold_A by γ 𝛾\gamma italic_γ defined in(6). This scaling is a form of normalization, which avoids the overall entries of the matrix being exceedingly small or large. All experiments assume the ground truth solution 𝐱=𝟏 𝐱 1\mathbf{x}=\mathbf{1}bold_x = bold_1.2 2 2 This is a common practice. Avoid setting 𝐱 𝐱\mathbf{x}bold_x to be Gaussian random, because this (partially) matches the training data generation, resulting in the inability to test out-of-distribution cases. FGMRES starts with 𝐱 0=𝟎 subscript 𝐱 0 0\mathbf{x}_{0}=\mathbf{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_0.

Compared methods. We compare GNP with three widely used general-purpose preconditioners—ILU, AMG, and GMRES (using GMRES to precondition GMRES is also called inner-outer GMRES)—and one simple-to-implement but weak preconditioner—Jacobi. The endeavor of preconditioning over 800 matrices from different domains limits the choices, but these preconditioners are handy with Python software support.

ILU comes from scipy.sparse.linalg.spilu; it in turn comes from SuperLU(Li, 2005; Li & Shao, 2010), which implements the thresholded version of ILU(Saad, 1994). We use the default drop tolerance and fill factor without tuning. AMG comes from PyAMG(Bell et al., 2023). Specifically, we use pyamg.blackbox.solver().aspreconditioner as the preconditioner, with the configuration of the blackbox solver computed from pyamg.blackbox.solver_configuration. GMRES is self-implemented. For (the inner) GMRES, we use 10 iterations and stop it when it reaches a relative residual norm tolerance of 1e-6. For (the outer) FGMRES, the restart cycle is 10.

Note that sophisticated preconditioners require tremendous efforts to implement robustly. Hence, we prioritize ones that are more robust and that come with continual supports. For example, we also experiment with AmgX(Naumov et al., 2015) but find that while being faster, it throws more errors than does PyAMG. See SectionF.4 for details.

Stopping criteria. The stopping criteria of all linear system solutions are rtol = 1e-8 and maxiters = 100. For time comparisons, we additionally run experiments by enabling timeout and disabling maxiters. We set the timeout to be the maximum solution time among all preconditioners when using maxiters = 100 as the stopping criterion.

Evaluation metrics. Evaluating the solution qualities and comparing different methods for a large number of problems require programmable metrics, but defining these metrics is harder than it appears to be. Comparing the number of iterations alone is insufficient, because the per-iteration time of each method is different. However, to compare time, one waits until the solution reaches rtol, which is impossible to set if one wants all solutions to reach this tolerance. Thus, we can use maxiters and timeout to terminate a solution. However, one solution may reach maxiters early (in time) and attain poor accuracy, while another solution may reach timeout very late but attain good accuracy. Which one is better is debatable.

Hence, we propose two novel metrics, depending on the use of the stopping criteria. The first one, area under the relative residual norm curve with respect to iterations, is defined as

Iter-AUC=∑i=0 iters log 10⁡r i−log 10⁡rtol,r i=‖𝐛−𝐀𝐱 i‖2/‖𝐛‖2,formulae-sequence Iter-AUC superscript subscript 𝑖 0 iters subscript 10 subscript 𝑟 𝑖 subscript 10 rtol subscript 𝑟 𝑖 subscript norm 𝐛 subscript 𝐀𝐱 𝑖 2 subscript norm 𝐛 2\text{Iter-AUC}=\sum_{i=0}^{\text{iters}}\log_{10}r_{i}-\log_{10}\texttt{rtol}% ,\quad r_{i}=|\mathbf{b}-\mathbf{A}\mathbf{x}{i}|{2}/|\mathbf{b}|_{2},Iter-AUC = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT iters end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∥ bold_b - bold_Ax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ bold_b ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where iters is the actual number of iterations when FGMRES stops and r i subscript 𝑟 𝑖 r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relative residual norm at iteration i 𝑖 i italic_i. We take logarithm because residual plots are typically done in the log scale. This metric compares methods based on the iteration count, taking into account the history (e.g., convergence speed at different stages) rather than the final error alone. This metric is used when the stopping criteria are rtol and maxiters.

The second metric, area under the relative residual norm curve with respect to time, is defined as

Time-AUC=∫0 T[log 10⁡r⁢(t)−log 10⁡rtol]⁢𝑑 t≈∑i=1 iters[log 10⁡r i−log 10⁡rtol]⁢(t i−t i−1),Time-AUC superscript subscript 0 𝑇 delimited-[]subscript 10 𝑟 𝑡 subscript 10 rtol differential-d 𝑡 superscript subscript 𝑖 1 iters delimited-[]subscript 10 subscript 𝑟 𝑖 subscript 10 rtol subscript 𝑡 𝑖 subscript 𝑡 𝑖 1\text{Time-AUC}=\int_{0}^{T}[\log_{10}r(t)-\log_{10}\texttt{rtol}],dt\approx% \sum_{i=1}^{\text{iters}}\log_{10}r_{i}-\log_{10}\texttt{rtol},Time-AUC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r ( italic_t ) - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol ] italic_d italic_t ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT iters end_POSTSUPERSCRIPT [ roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT rtol ] ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ,

where T 𝑇 T italic_T is the elapsed time when FGMRES stops, r⁢(t)𝑟 𝑡 r(t)italic_r ( italic_t ) is the relative residual norm at time t 𝑡 t italic_t, and t i subscript 𝑡 𝑖 t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the elapsed time at iteration i 𝑖 i italic_i. This metric compares methods based on the solution time, taking into account the history of the errors. It is used when the stopping criteria are rtol and timeout.

Hyperparameters. To feasibly train over 800 GNNs, we use the same set of hyperparameters for each of them without tuning. There is a strong potential that the GNN performance can be greatly improved with careful tuning. Our purpose is to understand the general behavior of GNP, particularly its strengths and weaknesses. We use L=8 𝐿 8 L=8 italic_L = 8 Res-GCONV layers, set the layer input/output dimension to 16, and use 2-layer MLPs with hidden dimension 32 for lifting/projection. We use Adam(Kingma & Ba, 2015) as the optimizer, set the learning rate to 1e-3, and train for 2000 steps with a batch size of 16. We apply neither dropouts nor weight decays. Because data are sampled from the same distribution, we use the model at the best training epoch as the preconditioner, without resorting to a validation set.

We use the ℓ 1 subscript ℓ 1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT residual norm ‖𝐀𝐌⁢(𝐛)−𝐀𝐱‖1 subscript norm 𝐀𝐌 𝐛 𝐀𝐱 1|\mathbf{A}\mathbf{M}(\mathbf{b})-\mathbf{A}\mathbf{x}|{1}∥ bold_AM ( bold_b ) - bold_Ax ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the training loss, as a common practice of robust regression. We use m=40 𝑚 40 m=40 italic_m = 40 Arnoldi steps when sampling the (𝐱,𝐛)𝐱 𝐛(\mathbf{x},\mathbf{b})( bold_x , bold_b ) pairs according to(5). Among the 16 pairs in a batch, 8 pairs follow(5) and 8 pairs follow 𝐱∼𝒩⁢(𝟎,𝐈 n)similar-to 𝐱 𝒩 0 subscript 𝐈 𝑛\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Compute environment. Our experiments are conducted on a machine with one Tesla V100(16GB) GPU, 96 Intel Xeon 2.40GHz cores, and 386GB main memory. All code is implemented in Python with Pytorch. All computations (training and inference) use the GPU whenever supported.

4 Results

We organize the experiment results by key questions of interest.

How does GNP perform compared with traditional preconditioners? We consider three factors: convergence speed, runtime speed, and preconditioner construction cost.

Figure 2: Percentage of problems on which each preconditioner performs the best.

by #iter by time 25%1.91e+00 1.18e+00 50%6.74e+00 2.33e+00 75%6.78e+03 8.16e+01 100%1.89e+10 1.91e+10

Table 1: Distribution of the residual-norm ratio between the second best preconditioner and GNP, when GNP performs the best. Distribution is described by using percentiles.

Figure 3: Preconditioner construction time and solution time (using maxiters to stop). The construction time of Jacobi is negligible and not shown. GMRES does not require construction.

In Figure2 we show the percentage of problems on which each preconditioner performs the best, with respect to iteration counts and solution time (see the Iter-AUC and Time-AUC metrics defined in the preceding section). GNP performs the best for a substantial portion of the problems in both metrics. In comparison, ILU and AMG perform the best for more problems, while Jacobi, GMRES (as a preconditioner), and no preconditioner performs the best for fewer problems in the time metric.

That ILU and AMG are the best for more problems does not compromise the competitiveness of GNP. For one reason, they are less robust and sometimes bare a significantly higher cost in construction, as will be revealed later. Another reason is that GNP can be used when other preconditioners are less satisfactory. In Table1, we summarize the distribution of the residual-norm ratio between the second best preconditioner and GNP, when GNP is the best. The ratio means a factor of x 𝑥 x italic_x decrease in the residual norm if GNP is used in place of other preconditioners. Under the Iter-AUC metric, a factor of 6780x decrease can be seen at the 75th percentile, and in the best occasion, more than ten orders of magnitude decrease is seen. Similarly under the Time-AUC metric. These significant reductions manifest the usefulness of GNP.

In Figure3 and Figure8 of SectionF.2, we plot the preconditioner construction time and solution time for each matrix and each preconditioner. Here, the solution is terminated by maxiters rather than timeout (such that the times are different across matrices). Two observations can be made. First, the training time of GNP is nearly proportional to the matrix size, while the construction time of ILU and AMG is hardly predictable. Even though for many of the problems, the time is only a fraction of that of GNP, there exist quite a few cases where the time is significantly longer. In the worst case, the construction of the AMG preconditioner is more than an order of magnitude more costly than that of GNP.

Second, there is a clear gap between the construction and solution time for GNP, but no such gap exists for ILU and AMG. One is tempted to compare the overall time, but one preconditioner can be used for any number of right-hand sides, and hence different conclusions may be reached regarding time depending on this number. When it is large, the solution time dominates, in which case one sees that in most of the cases GNP is faster ILU, AMG, and GMRES.

On what problems does GNP perform the best? We expand the granularity of Figure2 into Figure4, overlaid with the distributions of the matrices regarding the size, the condition number, and the application area, respectively. One finding is that GNP is particularly useful for ill-conditioned matrices (i.e., those with a condition number ≥10 16 absent superscript 10 16\geq 10^{16}≥ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT). ILU is less competitive in these problems, possibly because it focuses on the sparsity structure, whose connection with the spectrum is less clear. Another finding is that GNP is particularly useful in three application areas: chemical process simulation problems, economic problems, and eigenvalue/model reduction problems. These areas constitute many problems in the dataset; both the number and the proportion of them in which GNP performs the best are substantially high.

On the other hand, the matrix size does not affect much the GNP performance. One sees this more clearly when consulting Figure9 in SectionF.3, which is a “proportion” version of Figure4: the proportion of problems on which GNP performs the best is relatively flat across matrix sizes. Similarly, symmetry does not appear to play a distinguishing role either: GNP performs the best on 7.6% of the symmetric matrices and 17.5% of the nonsymmetric matrices.

Figure 4: Breakdown of best preconditioners with respect to matrix sizes, condition numbers, and application areas. Only the application areas with the top number of problems are shown. The last bar in the middle plot is for condition number ≥10 16 absent superscript 10 16\geq 10^{16}≥ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT.

Table 2: Failures of preconditioners (count and proportion).

How robust is GNP? We summarize in Table2 the number of failures for each preconditioner. GNP and GMRES are highly robust. We see that neural network training does not cause any troubles, which is a practical advantage. In contrast, ILU fails for nearly half of the problems, AMG fails for nearly 8%, and Jacobi fails for over 6%. According to the error log, the common failures of ILU are that “(f)actor is exactly singular” and that “matrix is singular … in file ilu_dpivotL.c”, while the common failures of AMG are “array … contain(s) infs or NaNs”. Meanwhile, solution failures occur when the residual norm tracked by the QR factorization of the upper Hessenberg 𝐇¯m subscript¯𝐇 𝑚\overline{\mathbf{H}}_{m}over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT fails to match the actual residual norm. While ILU and AMG perform the best for more problems than does GNP, safeguarding their robustness is challenging, rendering GNP more attractive.

What does the convergence history look like? In Figure5, we show a few examples when GNP performs the best. Note that in this case, ILU fails for most of the problems, in either the construction or the solution phase. The examples indicate that the convergence of GNP either tracks that of other preconditioners with a similar rate (Simon/venkat25), or becomes significantly faster. Notably, on VanVelzen/std1_Jac3, GNP uses only four iterations to reach 1e-8, while other preconditioners take 100 iterations but still cannot decrease the relative residual to below 1e-2.

Figure 5: Example convergence histories.

We additionally plot the convergence curves with respect to time in Figure10 in SectionF.6. An important finding to note is that GMRES as a preconditioner generally takes much longer time to run than GNP. This is because GMRES is bottlenecked by the orthogonalization process, shadowing the trickier cost comparison between more matrix-vector multiplications in GMRES and fewer matrix-matrix multiplications in GNP.

Are the proposed training-data generation and the scale-equivariance design necessary? In Figure6, we compare the proposed designs versus alternatives, for both the training loss and the relative residual norm. In the comparison of training data generation, using 𝐱∼𝒩⁢(𝟎,𝐈 n)similar-to 𝐱 𝒩 0 subscript 𝐈 𝑛\mathbf{x}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_x ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) leads to lower losses, while using 𝐛∼𝒩⁢(𝟎,𝐈 n)similar-to 𝐛 𝒩 0 subscript 𝐈 𝑛\mathbf{b}\sim\mathcal{N}(\mathbf{0},\mathbf{I}{n})bold_b ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) leads to higher losses. This is expected, because the former expects similar outputs from the neural network, which is easier to train, while the latter expects drastically different outputs, which make the network difficult to train. Using 𝐱 𝐱\mathbf{x}bold_x from the proposed mixture leads to losses lying in between. This case leads to the best preconditioning performance—more problems having lower residual norms (especially those near rtol). In other words, the proposed training data generation strikes the best balance between easiness of neural network training and goodness of the resulting preconditioner.

In the comparison of using the scaling operator s 𝑠 s italic_s versus not, we see that the training behaviors barely differ between the two choices, but using scaling admits an advantage that more problems result in lower residual norms. These observations corroborate the choices we make in the neural network design and training data generation.

Figure 6: Left: comparison of training data generation; right: comparison of scale-equivariance.

5 Discussions and Conclusions

We have presented a GNN approach to preconditioning Krylov solvers. This is the first work in our knowledge to use a neural network as an approximation to the matrix inverse, without exploiting information beyond the matrix itself (i.e., an algebraic preconditioenr). Compared with traditional algebraic preconditioners, this approach is robust with predictable construction costs and is the most effective in many problems. We evaluate the approach on more than 800 matrices from 50 application areas, the widest coverage in the literature.

One common skepticism on neural network approaches for preconditioning is that in general, network training is costly and in our case, the network can be used for only one 𝐀 𝐀\mathbf{A}bold_A. We have shown to the contrary, our training cost sometimes can be much lower than the construction cost of widely used preconditioners such as ILU and AMG, because network training is more predictable while the other preconditioners suffer the manipulation of the irregular nonzero patterns. Moreover, a benefit of neural networks is that compute infrastructures and library supports are in place, which facilitate implementation and offer robustness, as opposed to other algebraic preconditioners whose implementations are challenging and robustness is extremely difficult to achieve. Additionally, traditional algebraic preconditioners are constructed for only one 𝐀 𝐀\mathbf{A}bold_A as well, not more flexible than ours.

While one may hope that a single neural network can solve many, if not all, linear systems (different 𝐀 𝐀\mathbf{A}bold_A’s), as recent approaches (PINN, NO, or learning the incomplete factors) appear to suggest, it is unrealistic to expect that a single network has the capacity to learn the matrix inverse for all matrices. Existing approaches work on a distribution of linear systems for the same problem and generalize under the same problem (e.g., by varying the grid size or PDE coefficients). For GNP to achieve so, we may fine-tune a trained network with a cost lower than training from scratch, or augment the GNN input with extra information (e.g., PDE coefficients). There unlikely exists an approach that remains effective beyond the problem being trained on.

Our work can be extended in many avenues. First, an immediate follow up is the preconditioning of SPD matrices. While a recent work on PDEs(Rudikov et al., 2024) shows promise on the combined use of NOs and flexible CG(Notay, 2000), it is relatively fragile with respect to a large variation of the preconditioner in other problems. We speculate that a split preconditioner can work more robustly and some form of autoencoder networks better serves this purpose.

Second, while GNP is trained for an individual matrix in this work for simplicity, it is possible to extend it to a sequence of evolving matrices, such as in sequential problems or time stepping. Continual learning(Wang et al., 2024), which is concerned with continuously adapting a trained neural network for slowly evolving data distributions, can amortize the initial preconditioner construction cost with more and more linear systems in sequel.

Third, the GPU memory limits the size of the matrices and the complexity of the neural networks that we can experiment with. Future work can explore multi-GPU and/or distributed training of GNNs(Kaler et al., 2022; 2023) for scaling GNP to even larger matrices. The training for large graphs typically uses neighborhood sampling to mitigate the “neighborhood explosion” problem in GNNs(Hamilton et al., 2017; Chen et al., 2018). Some sampling approaches, such as layer-wise sampling(Chen et al., 2018), are tied to sketching the matrix product 𝐀^⁢𝐗^𝐀 𝐗\widehat{\mathbf{A}}\mathbf{X}over^ start_ARG bold_A end_ARG bold_X inside the graph convolution layer(7) with certain theoretical guarantees(Chen & Luss, 2018).

Fourth, while we use the same set of hyperparameters for evaluation, nothing prevents problem-specific hyperparameter tuning when one works on an application, specially a challenging one where no general-purpose preconditioners work sufficiently well. We expect that this paper’s results based on a naive hyperparameter setting can be quickly improved by the community. Needless to say, improving the neural network architecture can push the success of GNP much farther.

Acknowledgments

The author would like to thank eight anonymous reviewers whose comments and suggestions help significantly improve this paper.

References

Arioli & Duff (2009) M.Arioli and I.S. Duff. Using FGMRES to obtain backward stability in mixed precision. Electron. Trans. Numer. Anal., 33:31–44, 2009.
Bell et al. (2023) Nathan Bell, Luke N. Olson, Jacob Schroder, and Ben Southworth. PyAMG: Algebraic multigrid solvers in python. Journal of Open Source Software, 8(87):5495, 2023.
Bommasani et al. (2021) Rishi Bommasani et al. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258, 2021.
Bånkestad et al. (2024) Maria Bånkestad, Jennifer Andersson, Sebastian Mair, and Jens Sjölund. Ising on the graph: Task-specific graph subsampling via the Ising model. Preprint arXiv:2402.10206, 2024.
Chan & Jin (2007) Raymond Hon-Fu Chan and Xiao-Qing Jin. An Introduction to Iterative Toeplitz Solvers. SIAM, 2007.
Chen & Luss (2018) Jie Chen and Ronny Luss. Stochastic gradient descent with biased but consistent gradient estimators. Preprint arXiv:1807.11880, 2018.
Chen & Stein (2023) Jie Chen and Michael L. Stein. Linear-cost covariance functions for gaussian random fields. Journal of the American Statistical Association, 118(541):147–164, 2023.
Chen et al. (2014) Jie Chen, Tom L.H. Li, and Mihai Anitescu. A parallel linear solver for multilevel Toeplitz systems with possibly several right-hand sides. Parallel Computing, 40(8):408–424, 2014.
Chen et al. (2016) Jie Chen, Lois Curfman McInnes, and Hong Zhang. Analysis and practical use of flexible BiCGStab. Journal of Scientific Computing, 68(2):803–825, 2016.
Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In ICML, 2020.
Chow & Saad (1998) Edmond Chow and Yousef Saad. Approximate inverse preconditioners via sparse-sparse iterations. SIAM Journal on Scientific Computing, 19(3):995–1023, 1998.
Ciarlet (2002) Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. SIAM, 2002.
Davis & Hu (2011) Timothy A. Davis and Yifan Hu. The university of Florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1–25, 2011.
Gerschgorin (1931) S.Gerschgorin. Über die abgrenzung der eigenwerte einer matrix. Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk, 6:749–754, 1931.
Hackbusch (1999) W.Hackbusch. A sparse matrix arithmetic based on ℋ ℋ\mathcal{H}caligraphic_H-matrices, part I: Introduction to ℋ ℋ\mathcal{H}caligraphic_H-matrices. Computing, 62(2):89–108, 1999.
Hackbusch & Börm (2002) W.Hackbusch and S.Börm. Data-sparse approximation by adaptive ℋ 2 superscript ℋ 2\mathcal{H}^{2}caligraphic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-matrices. Computing, 69(1):1–35, 2002.
Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
Häusner et al. (2023) Paul Häusner, Ozan Öktem, and Jens Sjölund. Neural incomplete factorization: learning preconditioners for the conjugate gradient method. Preprint arXiv:2305.16368, 2023.
Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
Kaler et al. (2022) Tim Kaler, Nickolas Stathas, Anne Ouyang, Alexandros-Stavros Iliopoulos, Tao B. Schardl, Charles E. Leiserson, and Jie Chen. Accelerating training and inference of graph neural networks with fast sampling and pipelining. In MLSys, 2022.
Kaler et al. (2023) Tim Kaler, Alexandros-Stavros Iliopoulos, Philip Murzynowski, Tao B. Schardl, Charles E. Leiserson, and Jie Chen. Communication-efficient graph neural networks with probabilistic neighborhood expansion analysis and caching. In MLSys, 2023.
Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
Kovachki et al. (2022) Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. Journal of Machine Learning Research, 23:1–97, 2022.
Li (2005) Xiaoye S. Li. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans.Mathematical Software, 31(3):302–325, 2005.
Li & Shao (2010) Xiaoye S. Li and Meiyue Shao. A supernodal approach to incomplete LU factorization with partial pivoting. ACM Trans.Mathematical Software, 37(4), 2010.
Li et al. (2023) Yichen Li, Peter Yichen Chen, Tao Du, and Wojciech Matusik. Learning preconditioner for conjugate gradient PDE solvers. In ICML, 2023.
Li et al. (2020) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. Preprint arXiv:2003.03485, 2020.
Li et al. (2021) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In ICLR, 2021.
Liu et al. (2023) Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. Preprint arXiv:2305.14342, 2023.
Lu et al. (2021) Lu Lu, Pengzhan Jin, and George Em Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3:218–229, 2021.
Manteuffel et al. (2018) Thomas A. Manteuffel, John Ruge, and Ben S. Southworth. Nonsymmetric algebraic multigrid based on local approximate ideal restriction (ℓ ℓ\ell roman_ℓ air). SIAM Journal on Scientific Computing, 40(6):A4105–A4130, 2018.
Manteuffel et al. (2019) Thomas A. Manteuffel, Steffen Münzenmaier, John Ruge, and Ben Southworth. Nonsymmetric reduction-based algebraic multigrid. SIAM Journal on Scientific Computing, 41(5):S242–S268, 2019.
Morgan (2002) Ronald B. Morgan. GMRES with deflated restarting. SIAM Journal on Scientific Computing, 24(1):20–37, 2002.
Naumov et al. (2015) M.Naumov, M.Arsaev, P.Castonguay, J.Cohen, J.Demouth, J.Eaton, S.Layton, N.Markovskiy, I.Reguly, N.Sakharnykh, V.Sellappan, and R.Strzodka. AmgX: A library for GPU accelerated algebraic multigrid and preconditioned iterative methods. SIAM Journal on Scientific Computing, 37(5):S602–S626, 2015.
Notay (2000) Yvan Notay. Flexible conjugate gradients. SIAM Journal on Scientific Computing, 22(4):1444–1460, 2000.
Raissi et al. (2019) M.Raissi, P.Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
Rudikov et al. (2024) Alexander Rudikov, Vladimir Fanaskov, Ekaterina Muravleva, Yuri M. Laevsky, and Ivan Oseledets. Neural operators meet conjugate gradients: The FCG-NO method for efficient PDE solving. In ICML, 2024.
Ruge & Stüben (1987) J.W. Ruge and K.Stüben. Algebraic multigrid. In Stephen F. McCormick (ed.), Multigrid Methods, Frontiers in Applied Mathematics, chapter 4. SIAM, 1987.
Saad (1994) Y.Saad. ILUT: A dual threshold incomplete LU factorization. Numerical Linear Algebra with Applications, 1(4):387–402, 1994.
Saad (1993) Youcef Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing, 14(2):461–469, 1993.
Saad & Schultz (1986) Youcef Saad and Martin H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856–869, 1986.
Saad (2003) Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, second edition, 2003.
Trifonov et al. (2024) Vladislav Trifonov, Alexander Rudikov, Oleg Iliev, Ivan Oseledets, and Ekaterina Muravleva. Learning from linear algebra: A graph neural network approach to preconditioner design for conjugate gradient solvers. Preprint arXiv:2405.15557, 2024.
Velic̆ković et al. (2018) Petar Velic̆ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
Wang et al. (2024) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024.
Wu et al. (2019) Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
Wu et al. (2021) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021.
Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.

Appendix A Supporting Code

Appendix B FGMRES

FGMRES is summarized in Algorithm1.

Algorithm 1 FGMRES with 𝐌 𝐌\mathbf{M}bold_M being a nonlinear operator

1:Let

𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be given. Define

𝐇¯m∈ℝ(m+1)×m subscript¯𝐇 𝑚 superscript ℝ 𝑚 1 𝑚\overline{\mathbf{H}}_{m}\in\mathbb{R}^{(m+1)\times m}over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_m + 1 ) × italic_m end_POSTSUPERSCRIPT and initialize all its entries

h i⁢j subscript ℎ 𝑖 𝑗 h_{ij}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to zero

2:loop until maxiters is reached

3:Compute

𝐫 0=𝐛−𝐀𝐱 0 subscript 𝐫 0 𝐛 subscript 𝐀𝐱 0\mathbf{r}{0}=\mathbf{b}-\mathbf{A}\mathbf{x}{0}bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_b - bold_Ax start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

β=‖𝐫 0‖2 𝛽 subscript norm subscript 𝐫 0 2\beta=|\mathbf{r}{0}|{2}italic_β = ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , and

𝐯 1=𝐫 0/β subscript 𝐯 1 subscript 𝐫 0 𝛽\mathbf{v}{1}=\mathbf{r}{0}/\beta bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_β

4:for

j=1,…,m 𝑗 1…𝑚 j=1,\ldots,m italic_j = 1 , … , italic_m do

5:Compute

𝐳 j=𝐌⁢(𝐯 j)subscript 𝐳 𝑗 𝐌 subscript 𝐯 𝑗\mathbf{z}{j}=\mathbf{M}(\mathbf{v}{j})bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_M ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and

𝐰=𝐀𝐳 j 𝐰 subscript 𝐀𝐳 𝑗\mathbf{w}=\mathbf{A}\mathbf{z}_{j}bold_w = bold_Az start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

6:for

i=1,…,j 𝑖 1…𝑗 i=1,\ldots,j italic_i = 1 , … , italic_j do

7:Compute

h i⁢j=𝐰⊤⁢𝐯 i subscript ℎ 𝑖 𝑗 superscript 𝐰 top subscript 𝐯 𝑖 h_{ij}=\mathbf{w}^{\top}\mathbf{v}_{i}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and

𝐰←𝐰−h i⁢j⁢𝐯 i←𝐰 𝐰 subscript ℎ 𝑖 𝑗 subscript 𝐯 𝑖\mathbf{w}\leftarrow\mathbf{w}-h_{ij}\mathbf{v}_{i}bold_w ← bold_w - italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

8:end for

9:Compute

h j+1,j=‖𝐰‖2 subscript ℎ 𝑗 1 𝑗 subscript norm 𝐰 2 h_{j+1,j}=|\mathbf{w}|_{2}italic_h start_POSTSUBSCRIPT italic_j + 1 , italic_j end_POSTSUBSCRIPT = ∥ bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and

𝐯 j+1=𝐰/h j+1,j subscript 𝐯 𝑗 1 𝐰 subscript ℎ 𝑗 1 𝑗\mathbf{v}{j+1}=\mathbf{w}/h{j+1,j}bold_v start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = bold_w / italic_h start_POSTSUBSCRIPT italic_j + 1 , italic_j end_POSTSUBSCRIPT

10:end for

11:Define

𝐙 m=[𝐳 1,…,𝐳 m]subscript 𝐙 𝑚 subscript 𝐳 1…subscript 𝐳 𝑚\mathbf{Z}{m}=[\mathbf{z}{1},\ldots,\mathbf{z}_{m}]bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and compute

𝐱 m=𝐱 0+𝐙 m⁢𝐲 m subscript 𝐱 𝑚 subscript 𝐱 0 subscript 𝐙 𝑚 subscript 𝐲 𝑚\mathbf{x}{m}=\mathbf{x}{0}+\mathbf{Z}{m}\mathbf{y}{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where

𝐲 m=argmin 𝐲‖β⁢𝐞 1−𝐇¯m⁢𝐲‖2 subscript 𝐲 𝑚 subscript argmin 𝐲 subscript norm 𝛽 subscript 𝐞 1 subscript¯𝐇 𝑚 𝐲 2\mathbf{y}{m}=\operatorname*{argmin}{\mathbf{y}}|\beta\mathbf{e}{1}-% \overline{\mathbf{H}}{m}\mathbf{y}|_{2}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ italic_β bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over¯ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

12:If

‖𝐛−𝐀𝐱 m‖2<tol subscript norm 𝐛 subscript 𝐀𝐱 𝑚 2 tol|\mathbf{b}-\mathbf{A}\mathbf{x}{m}|{2}<\texttt{tol}∥ bold_b - bold_Ax start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < tol , exit the loop; otherwise, set

𝐱 0←𝐱 m←subscript 𝐱 0 subscript 𝐱 𝑚\mathbf{x}{0}\leftarrow\mathbf{x}{m}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

13:end loop

Appendix C Proofs

Proof of Theorem1.

Write ‖𝐫 m‖2≤‖𝐫~~m‖2+‖𝐫 m−𝐫~~m‖2 subscript norm subscript 𝐫 𝑚 2 subscript norm subscript~~𝐫 𝑚 2 subscript norm subscript 𝐫 𝑚 subscript~~𝐫 𝑚 2|\mathbf{r}{m}|{2}\leq|\widetilde{\mathbf{r}}{m}|{2}+|\mathbf{r}{m}-% \widetilde{\mathbf{r}}{m}|_{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. It is well known that

‖𝐫~~m‖2≤κ 2⁢(𝐗)⁢ϵ(m)⁢(𝚲)⁢‖𝐫 0‖2;subscript norm subscript~~𝐫 𝑚 2 subscript 𝜅 2 𝐗 superscript italic-ϵ 𝑚 𝚲 subscript norm subscript 𝐫 0 2|\widetilde{\mathbf{r}}{m}|{2}\leq\kappa_{2}(\mathbf{X})\epsilon^{(m)}(\bm% {\Lambda})|\mathbf{r}{0}|{2};∥ over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_X ) italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Λ ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ;

see, e.g., Saad (2003, Proposition 6.32). On the other hand, because 𝐫 m=𝐫 0−𝐀𝐙 m⁢𝐲 m subscript 𝐫 𝑚 subscript 𝐫 0 subscript 𝐀𝐙 𝑚 subscript 𝐲 𝑚\mathbf{r}{m}=\mathbf{r}{0}-\mathbf{A}\mathbf{Z}{m}\mathbf{y}{m}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where 𝐲 m subscript 𝐲 𝑚\mathbf{y}{m}bold_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT minimizes ‖𝐫 0−𝐀𝐙 m⁢𝐲‖2 subscript norm subscript 𝐫 0 subscript 𝐀𝐙 𝑚 𝐲 2|\mathbf{r}{0}-\mathbf{A}\mathbf{Z}{m}\mathbf{y}|{2}∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have 𝐫 m=𝐫 0−(𝐀𝐙 m)⁢(𝐀𝐙 m)+⁢𝐫 0=𝐫 0−𝐐 m⁢𝐐 m⊤⁢𝐫 0 subscript 𝐫 𝑚 subscript 𝐫 0 subscript 𝐀𝐙 𝑚 superscript subscript 𝐀𝐙 𝑚 subscript 𝐫 0 subscript 𝐫 0 subscript 𝐐 𝑚 superscript subscript 𝐐 𝑚 top subscript 𝐫 0\mathbf{r}{m}=\mathbf{r}{0}-(\mathbf{A}\mathbf{Z}{m})(\mathbf{A}\mathbf{Z}% {m})^{+}\mathbf{r}{0}=\mathbf{r}{0}-\mathbf{Q}{m}\mathbf{Q}{m}^{\top}% \mathbf{r}{0}bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( bold_AZ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Therefore, ‖𝐫 m−𝐫~~m‖2=‖𝐐 m⁢𝐐 m⊤⁢𝐫 0−𝐐~~m⁢𝐐~~m⊤⁢𝐫 0‖2≤‖𝐐 m⁢𝐐 m⊤−𝐐~~m⁢𝐐~~m⊤‖2⁢‖𝐫 0‖2 subscript norm subscript 𝐫 𝑚 subscript~~𝐫 𝑚 2 subscript norm subscript 𝐐 𝑚 superscript subscript 𝐐 𝑚 top subscript 𝐫 0 subscript~~𝐐 𝑚 superscript subscript~~𝐐 𝑚 top subscript 𝐫 0 2 subscript norm subscript 𝐐 𝑚 superscript subscript 𝐐 𝑚 top subscript~~𝐐 𝑚 superscript subscript~~𝐐 𝑚 top 2 subscript norm subscript 𝐫 0 2|\mathbf{r}{m}-\widetilde{\mathbf{r}}{m}|{2}=|\mathbf{Q}{m}\mathbf{Q}{% m}^{\top}\mathbf{r}{0}-\widetilde{\mathbf{Q}}{m}\widetilde{\mathbf{Q}}{m}^{% \top}\mathbf{r}{0}|{2}\leq|\mathbf{Q}{m}\mathbf{Q}{m}^{\top}-\widetilde{% \mathbf{Q}}{m}\widetilde{\mathbf{Q}}{m}^{\top}|{2}|\mathbf{r}{0}|{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over~ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. ∎

Proof of Corollary2.

By taking 𝐌~~=𝐀−1⁢𝐕 n⁢𝐇 n⁢𝐕 n⊤~~𝐌 superscript 𝐀 1 subscript 𝐕 𝑛 subscript 𝐇 𝑛 superscript subscript 𝐕 𝑛 top\widetilde{\mathbf{M}}=\mathbf{A}^{-1}\mathbf{V}{n}\mathbf{H}{n}\mathbf{V}{% n}^{\top}over~ start_ARG bold_M end_ARG = bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and noting that 𝐕 n⊤=𝐕 n−1 superscript subscript 𝐕 𝑛 top superscript subscript 𝐕 𝑛 1\mathbf{V}{n}^{\top}=\mathbf{V}_{n}^{-1}bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we have

(𝐕 n⁢𝐘)−1⁢𝐀⁢𝐌~~⁢(𝐕 n⁢𝐘)=diag⁡(σ 1,…,σ n).superscript subscript 𝐕 𝑛 𝐘 1 𝐀~~𝐌 subscript 𝐕 𝑛 𝐘 diag subscript 𝜎 1…subscript 𝜎 𝑛(\mathbf{V}{n}\mathbf{Y})^{-1}\mathbf{A}\widetilde{\mathbf{M}}(\mathbf{V}{n}% \mathbf{Y})=\operatorname{diag}(\sigma_{1},\ldots,\sigma_{n}).( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_A over~ start_ARG bold_M end_ARG ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) = roman_diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

Then, following Theorem1, ‖𝐫 m‖2≤κ 2⁢(𝐕 n⁢𝐘)⁢ϵ(m)⁢(𝚺)⁢‖𝐫 0‖2 subscript norm subscript 𝐫 𝑚 2 subscript 𝜅 2 subscript 𝐕 𝑛 𝐘 superscript italic-ϵ 𝑚 𝚺 subscript norm subscript 𝐫 0 2|\mathbf{r}{m}|{2}\leq\kappa_{2}(\mathbf{V}{n}\mathbf{Y})\epsilon^{(m)}(% \bm{\Sigma})|\mathbf{r}{0}|{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Σ ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We conclude the proof by noting that κ 2⁢(𝐕 n⁢𝐘)=κ 2⁢(𝐘)subscript 𝜅 2 subscript 𝐕 𝑛 𝐘 subscript 𝜅 2 𝐘\kappa{2}(\mathbf{V}{n}\mathbf{Y})=\kappa{2}(\mathbf{Y})italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_Y ) = italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ). ∎

Appendix D Approximately Exponential Convergence of ‖𝐫 m‖2 subscript norm subscript 𝐫 𝑚 2\|\mathbf{r}{m}\|{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

We could bound the minimax polynomial ϵ(m)superscript italic-ϵ 𝑚\epsilon^{(m)}italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT by using Chebyshev polynomials to obtain an (approximately) exponential convergence of the residual norm ‖𝐫 m‖2 subscript norm subscript 𝐫 𝑚 2|\mathbf{r}{m}|{2}∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for large m 𝑚 m italic_m.

Assume that all the eigenvalues σ i subscript 𝜎 𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of 𝚺 𝚺\bm{\Sigma}bold_Σ are enclosed by an ellipse E⁢(c,d,a)𝐸 𝑐 𝑑 𝑎 E(c,d,a)italic_E ( italic_c , italic_d , italic_a ) which excludes the origin, where c 𝑐 c italic_c is the center, d 𝑑 d italic_d is the focal distance, and a 𝑎 a italic_a is major semi-axis. We have

ϵ(m)⁢(𝚺)≤min p∈ℙ m,p⁢(0)=1⁡max σ∈E⁢(c,d,a)⁡|p⁢(σ)|,superscript italic-ϵ 𝑚 𝚺 subscript formulae-sequence 𝑝 subscript ℙ 𝑚 𝑝 0 1 subscript 𝜎 𝐸 𝑐 𝑑 𝑎 𝑝 𝜎\epsilon^{(m)}(\bm{\Sigma})\leq\min_{p\in\mathbb{P}{m},,p(0)=1}\max{\sigma% \in E(c,d,a)}|p(\sigma)|,italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Σ ) ≤ roman_min start_POSTSUBSCRIPT italic_p ∈ blackboard_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_p ( 0 ) = 1 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_σ ∈ italic_E ( italic_c , italic_d , italic_a ) end_POSTSUBSCRIPT | italic_p ( italic_σ ) | ,

by the fact that the maximum modulus of a complex analytical function is reached on the boundary of the domain. Then, we invoke Eqn (6.119) of Saad (2003) and obtain

ϵ(m)⁢(𝚺)≤C m⁢(a/d)C m⁢(c/d),superscript italic-ϵ 𝑚 𝚺 subscript 𝐶 𝑚 𝑎 𝑑 subscript 𝐶 𝑚 𝑐 𝑑\epsilon^{(m)}(\bm{\Sigma})\leq\frac{C_{m}(a/d)}{C_{m}(c/d)},italic_ϵ start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_Σ ) ≤ divide start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_a / italic_d ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_c / italic_d ) end_ARG ,

where C m subscript 𝐶 𝑚 C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the (complex) Chebyshev polynomial of degree m 𝑚 m italic_m. When m 𝑚 m italic_m is large,

C m⁢(a/d)C m⁢(c/d)≈(a+a 2−d 2 c+c 2−d 2)m,subscript 𝐶 𝑚 𝑎 𝑑 subscript 𝐶 𝑚 𝑐 𝑑 superscript 𝑎 superscript 𝑎 2 superscript 𝑑 2 𝑐 superscript 𝑐 2 superscript 𝑑 2 𝑚\frac{C_{m}(a/d)}{C_{m}(c/d)}\approx\left(\frac{a+\sqrt{a^{2}-d^{2}}}{c+\sqrt{% c^{2}-d^{2}}}\right)^{m},divide start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_a / italic_d ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_c / italic_d ) end_ARG ≈ ( divide start_ARG italic_a + square-root start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_c + square-root start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ,

which gives an (approximately) exponential function in m 𝑚 m italic_m. Applying Corollary2, we conclude that when m 𝑚 m italic_m is large,

‖𝐫 m‖2⪅κ 2⁢(𝐘)⁢‖𝐫 0‖2⁢(a+a 2−d 2 c+c 2−d 2)m.subscript norm subscript 𝐫 𝑚 2 subscript 𝜅 2 𝐘 subscript norm subscript 𝐫 0 2 superscript 𝑎 superscript 𝑎 2 superscript 𝑑 2 𝑐 superscript 𝑐 2 superscript 𝑑 2 𝑚|\mathbf{r}{m}|{2}\lessapprox\kappa_{2}(\mathbf{Y})|\mathbf{r}{0}|{2}% \left(\frac{a+\sqrt{a^{2}-d^{2}}}{c+\sqrt{c^{2}-d^{2}}}\right)^{m}.∥ bold_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⪅ italic_κ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Y ) ∥ bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_a + square-root start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_c + square-root start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

Appendix E Scalability

Based on the architecture outlined in Figure1, the computational cost of GNP is dominated by matrix-matrix multiplications, where the left matrix is either 𝐀 𝐀\mathbf{A}bold_A (e.g., in GCONV) or a tall one that has n 𝑛 n italic_n rows (e.g., in the MLP and Lin layers and in GCONV). This forward analysis holds true for training and inference, because the back propagation also uses these matrix-matrix multiplications. Hence, the computational cost of GNP scales as O⁢(nz⁡(𝐀))𝑂 nz 𝐀 O(\operatorname{nz}(\mathbf{A}))italic_O ( roman_nz ( bold_A ) ), the same as that of the Krylov solver.

Practical considerations, on the other hand, are more sensitive to the constants hidden inside the big-O notation. For example, one needs to store many (but still a constant number of) vectors of length n 𝑛 n italic_n, due to the batch size, the hidden dimensions of the neural network, and the automatic differentiation. Denote by c 𝑐 c italic_c this constant number; then, the GPU faces a storage pressure of nz⁡(𝐀)+c⁢n nz 𝐀 𝑐 𝑛\operatorname{nz}(\mathbf{A})+cn roman_nz ( bold_A ) + italic_c italic_n. Once this amount is beyond the memory capacity of the GPU, one either performs training (and even inference) by using CPU only, offloads some data to the CPU and transfers them back to the GPU on demand, or undertakes more laborious engineering by distributing 𝐀 𝐀\mathbf{A}bold_A and other data across multiple GPUs. For the last option (multi-GPU and/or distributed training), the literature of GNN training brings in additional techniques (such as sampling and mini-batching the graph) to accelerate the computation(Kaler et al., 2022; 2023).

Appendix F Additional Experiment Results

F.1 Additional Metrics for Comparing Preconditioners

In addition to the Iter-AUC and Time-AUC metrics, one may be interested in using the final relative residual norm to compare the preconditioners. To this end, we summarize the results in Figure7 by using this residual metric, as an extension of Figure2. Note that for each problem, we solve the linear system twice, once using rtol and maxiters as the stopping criteria; and the other time using rtol and timeout. Hence, there are two charts in Figure7, depending on how the iterations are terminated.

Figure 7: Percentage of problems on which each preconditioner performs the best.

We see that across these two termination scenarios, observations are similar. Compared to the use of the Iter-AUC and Time-AUC metrics, GNP is the best for a similar portion of problems under the residual metric. The portion of problems that ILU and GMRES (as a preconditioner) perform the best increases under this metric, while the portion that AMG performs the best decreases. Jacobi and no preconditioner remain less competitive.

F.2 Preconditioner Construction Time and Solution Time

In Figure8, we plot the preconditioner construction time and solution time for each matrix and each preconditioner. This figure is a counterpart of Figure3 by removing the log scale in the y axis. One sees that the GNP construction time is nearly proportional to the matrix size.

Figure 8: Preconditioner construction time and solution time (using maxiters to stop). The y axis is in the linear scale.

F.3 Breakdown of Best Preconditioners With Respect to Matrix Distributions

In Figure9, we plot the proportion of problems on which each preconditioner performs the best, for different matrix sizes, condition numbers, and application areas. This figure is a counterpart of Figure4. The main observation is that the proportion of problems on which GNP performs the best is relatively flat across matrix sizes. Other observations follow closely from those of Figure4; namely, GNP is particularly useful for ill-conditioned problems and some application areas (chemical process simulation problems, economic problems, and eigenvalue/model reduction problems).

Figure 9: Proportion of problems on which each preconditioner performs the best, for different matrix sizes, condition numbers, and application areas. Only the application areas with the top number of problems are shown. The last bar in the middle plot is for condition number ≥10 16 absent superscript 10 16\geq 10^{16}≥ 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. Note that not every matrix has a known condition number.

F.4 Comparison between PyAMG and AmgX

Besides PyAMG, another notable implementation of AMG (which supports using it as a preconditioner and supports GPU) is AmgX https://github.com/NVIDIA/AMGX. Here, we compare the two implementations in terms of robustness and speed.

Unlike PyAMG, which is a blackbox, AmgX leaves the choice of the AMG method and the parameters to the user. We used the driver examples/amgx_capi.c and the configuration src/configs/FGMRES_AGGREGATION.json, which appears to be the most reasonable for the diverse matrices and application areas at hand. This setting uses aggregation-based AMG rather than classical AMG, which is also consistent with the preference expressed by PyAMG.

Here are some findings.

AmgX is less robust than PyAMG. Table3 shows that AmgX causes more failures during the construction as well as the solution phase. In fact, there were three cases where AmgX hanged, which hampered automated benchmarking.

Table 3: Failures of two implementations of AMG.

Despite a lack of robustness, the construction of AmgX is much faster than that of PyAMG. For AmgX, the distribution of the preconditioner construction time has a minimum 0.006s, median 0.017s, and maximum 4.895s. In contrast, the distribution for PyAMG has a minimum 0.005s, median 0.286, and maximum 8505.1s. Nevertheless, there are still 16% of the cases where the PyAMG preconditioner is faster to construct than the AmgX preconditioner.

We skip the comparison of the solution time. This is because in AmgX, the Krylov solver and the preconditioner are bundled inside a C complementation. A fair comparison requires isolating the preconditioner from the solver, because our benchmarking framework implements the solver in Python. However, the isolation is beyond scope due to the complexity of the implementation, such as the use of data structures and cuda handling. Nevertheless, we expect that the solution time of AmgX is faster than that of PyAMG, based on the observations made for preconditioner construction.

F.5 Changing the Default PyAMG Solver

The blackbox solver of PyAMG, pyamg.blackbox.solver, uses the classical-style smoothed aggregation method pyamg.aggregation.smoothed_aggregation_solver (SA). For nonsymmetric matrices, however, the AIR method works better(Manteuffel et al., 2019; 2018). In this experiment, we modify the blackbox solver by calling pyamg.classical.air_solver for nonsymmetric matrices.

We first tried using the default options of AIR, but encountered “zero diagonal encountered in Jacobi” errors. These errors were caused by the use of jc_jacobi as the post-smoother. Then, we replaced the smoothers with the default ones used by SA (gauss_seidel_nr). In this case, the solver hung on the problem FIDAP/ex40. Before this, 105 problems were solved, among which 83 were nonsymmetric. The results of these 83 systems are summarized in Table4, suggesting that AIR is indeed better but not always.

Table 4: Counts of problems when comparing AIR with SA for nonsymmetric matrices.

Overall, we can conclude that AIR has not exhibited its full potential for nonsymmetric matrices. We speculate that a robust implementation of AIR is challenging and that might be the reason why the authors did not use it for the blackbox solver in the first place.

F.6 Convergence Histories

In Figure10, we show the convergence histories of the systems appearing in Figure5, with respect to both iteration count and time. We also plot the training curves. The short solution time of GNP is notable, especially when compared with the solution time of GMRES as a preconditioner. The training loss generally stays on the level of 1e-2 to 1e-3. For some problems (e.g., VanVelzen_std1_Jac3), training does not seem to progress well, but the training loss is already at a satisfactory point in the first place, and hence the preconditioner can be rather effective. We speculate that this behavior is caused by the favorable architectural inductive bias in the GNN.

Figure 10: Convergence of the linear system solutions and training history of the preconditioners, for the matrices shown in Figure5.

To provide further examples, we show the convergence histories of all systems in the application areas identified by Figure4 to favor GNP: chemical process simulation problems (Figures11–13), economic problems (Figures14–16), and eigenvalue/model reduction problems (Figures17–18). In many occasions, when GNP performs the best, it outperforms the second best substantially. It can even decrease the relative residual norm under rtol while other preconditioners barely solve the system.

Figure 11: Convergence of the linear system solutions for chemical process simulation problems (1/3).

Figure 12: Convergence of the linear system solutions for chemical process simulation problems (2/3).

Figure 13: Convergence of the linear system solutions for chemical process simulation problems (3/3).

Figure 14: Convergence of the linear system solutions for economic problems (1/3).

Figure 15: Convergence of the linear system solutions for economic problems (2/3).

Figure 16: Convergence of the linear system solutions for economic problems (3/3).

Figure 17: Convergence of the linear system solutions for eigenvalue/model reduction problems (1/2).

Figure 18: Convergence of the linear system solutions for eigenvalue/model reduction problems (2/2).

Xet Storage Details

Size:: 124 kB
Xet hash:: 180cd69023ddfc9d076d8c011f49aa12d1f7f6c01b58db6817abb7f7841ef990

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.