huggingchat/papers-content / 2211 /2211.17244.md

|

123 kB

Title: Tight Certification of Adversarially Trained Neural Networks via Nonconvex Low-Rank Semidefinite Relaxations

URL Source: https://arxiv.org/html/2211.17244

Markdown Content: Tight Certification of Adversarially Trained Neural Networks via Nonconvex Low-Rank Semidefinite Relaxations Hong-Ming Chiu Richard Y. Zhang Abstract

Adversarial training is well-known to produce high-quality neural network models that are empirically robust against adversarial perturbations. Nevertheless, once a model has been adversarially trained, one often desires a certification that the model is truly robust against all future attacks. Unfortunately, when faced with adversarially trained models, all existing approaches have significant trouble making certifications that are strong enough to be practically useful. Linear programming (LP) techniques in particular face a “convex relaxation barrier” that prevent them from making high-quality certifications, even after refinement with mixed-integer linear programming (MILP) and branch-and-bound (BnB) techniques. In this paper, we propose a nonconvex certification technique, based on a low-rank restriction of a semidefinite programming (SDP) relaxation. The nonconvex relaxation makes strong certifications comparable to much more expensive SDP methods, while optimizing over dramatically fewer variables comparable to much weaker LP methods. Despite nonconvexity, we show how off-the-shelf local optimization algorithms can be used to achieve and to certify global optimality in polynomial time. Our experiments find that the nonconvex relaxation almost completely closes the gap towards exact certification of adversarially trained models.

Machine Learning \RS@ifundefined

subsecref \newrefsubsecname = \RSsectxt \RS@ifundefinedthmref \newrefthmname = theorem \RS@ifundefinedlemref \newreflemname = lemma \newrefpropname=Proposition ,Name=Proposition \newrefalgname=Algorithm ,Name=Algorithm \newrefdefname=Definition ,Name=Definition \newrefthmname=Theorem ,Name=Theorem \newreflemname=Lemma ,Name=Lemma \newrefcorname=Corollary ,Name=Corollary \newrefsecname=Section ,Name=Section

1 Introduction Figure 1: Certified adversarial training vs. ℓ 2 attacks. While PGD attacks indicate an empirical upper-bound of 77.4% (red), state-of-the-art LP-based verifiers face a “convex relaxation barrier” (pink shading) that prevent them from certifying lower-bounds better than 20.9% (blue). Even an award-winning state-of-the-art branch-and-bound verifier like 𝛼 , 𝛽 -CROWN cannot significantly improve past 67.2% in reasonable time. Our nonconvex relaxation overcomes the convex relaxation barrier, certifying a lower-bound of 76.2% that almost fully closes the empirical-certified gap. (See Section 5 for details.)

To make neural network models robust to adversarial perturbation attacks, one popular strategy, known as adversarial training (Kurakin et al., 2016; Goodfellow et al., 2015; Madry et al., 2018; Shafahi et al., 2019; Wong et al., 2019), is to attack a pre-trained model, and then to re-train the model with the training set augmented or replaced by the attack. Despite its simplicity, adversarial training works remarkably well in practice. For example, the robust models adversarially trained by Madry et al. (2018) back in 2017 remain essentially unbroken in 2022, after more than four years of white-box penetration testing by researchers world-wide. Later, Shafahi et al. (2019); Wong et al. (2019) have extended the idea to train robust ImageNet classifiers, that achieve a similar level of accuracy, and within a comparable amount of training time, to nonrobust classifiers.

Nevertheless, adversarial training is an empirical strategy that does not promise a truly robust model. Given a model that has been made empirically robust through adversarial training, one often desires a formal mathematical proof or certification that the model is truly robust against all future attacks. Unfortunately, when faced with an adversarially trained model, all existing approaches have significant trouble making certifications that are strong enough to be practically useful. A model that achieves 77.4 % test accuracy on adversarial inputs might only have a certified robust accuracy of 20.9 % , using state-of-the-art methods (Weng et al., 2018a; Zhang et al., 2018b; Salman et al., 2019) based on a linear programming (LP) relaxation (Wong & Kolter, 2018) of the ReLU activation (see Figure 1).

In fact, recent work by Salman et al. (2019) suggest that it is fundamentally impossible for any method based on the LP relaxation to make substantially better certifications for adversarially trained models. Even mixed-integer linear programming (MILP) techniques like branch-and-bound and cutting planes (Tjeng et al., 2019; Xu et al., 2021; Wang et al., 2021), which in theory are capable of exact certification given unlimited time, cannot in practice significantly close the gap left by the LP relaxation within reasonable time. Applying 𝛼 , 𝛽 -CROWN (Zhang et al., 2018a; Wang et al., 2021; Xu et al., 2021; Zhang et al., 2022), the winning entry in the International Verification of Neural Networks Competition (VNN-COMP) competitions of 2021 and 2022, only improves the certified robust accuracy to 67.2 % before timing out, still leaving an unsatisfactory optimality gap of 10.2 % . There appears to be an insurmountable “convex relaxation barrier”, between the high degree of robustness that is empirically observed for adversarially trained models, and the low degree of robustness that can be rigorously certified via the LP relaxation, and mixed-integer programming methods based on the LP relaxation.

One promising direction for overcoming the barrier faced by the LP relaxation is to develop methods based on the semidefinite programming (SDP) relaxation of Raghunathan et al. (2018b). Indeed, early experiments (Raghunathan et al., 2018a, b; Zhang, 2020; Dathathri et al., 2020; Batten et al., 2021) all suggest that the SDP relaxation can make significantly stronger certifications than the LP relaxation. However, the cost of solving the SDP relaxation—while technically still polynomial time—is so high as to be completely inaccessible. At its core, the SDP relaxation requires optimizing over an 𝑛 × 𝑛 matrix variable, where 𝑛 is equal to the total number of ReLU activations, plus the dimension of the input layer. The fundamental difficulty is the need to store and to optimize over 𝑛 2 variables: even a single-layer MNIST classifier with 200 ReLU activations requires storing and optimizing over the ≈ 10 6 elements of a 986 × 986 matrix, which is already nearing the limit of state-of-the-art SDP solvers like MOSEK (2019), whose worst-case runtime scales as 𝑂 ⁢ ( 𝑛 6 ) . Despite widespread speculation, it is currently unknown whether SDP-based methods will truly be able to provide tight certification for adversarially trained models.

Contributions

This paper proposes a nonconvex certification technique, based on a low-rank restriction of the usual convex SDP relaxation of the ReLU activation. Rigorously, we show that the nonconvex relaxation is always at least as tight as the SDP relaxation, but that it optimizes over the 𝑛 ⁢ 𝑟 elements of an 𝑛 × 𝑟 rectangular matrix. If a small value of 𝑟 can be used (we later validate this experimentally), then the nonconvex relaxation can make strong certifications comparable to the SDP relaxation, while optimizing over a dramatically smaller number of variables comparable to the LP relaxation.

The nonconvexity of the relaxation poses a serious issue. The correctness of our certification hinges critically on our ability to solve a nonconvex verification problem to global optimality, and then to certify this global optimality, but large-scale optimization algorithms are only capable of achieving and certifying local optimality. In this paper, we establish that if a local minimum for the verification problem is also global, then under a mild constraint qualification, the Lagrange multipliers that certify its local optimality are also guaranteed to certify its global optimality via Lagrangian duality (see 3.1 and 4.3). Conversely, if the local minimum is non-global, then under the same mild constraint qualification, the Lagrange multipliers generate a direction of global improvement (see 4.4), thereby ensuring that local optimization will eventually achieve certified global optimality.

Our experiments provide empirical confirmation of our theoretical claims. We re-examine the models originally used by Salman et al. (2019) to demonstrate the existence of a “convex relaxation barrier” for the LP relaxation. Using a relaxation rank 𝑟 of no more than 10, we re-certify these models using our nonconvex relaxation, in time comparable to that of the best-possible LP relaxation. Our results find that even a basic nonconvex relaxation offers a significant reduction in conservatism. Augmenting the nonconvex relaxation by bound propagation (as is commonly done for the LP relaxation) allows us to almost fully close the gap towards exact certification (see Figure 1).

Related work

Robustness certification methods can be broadly divided into exact and conservative methods. Exact methods based on mixed-integer linear programming (MILP) (Tjeng et al., 2019; Xu et al., 2021) and Satisfiability Modulo Theories (SMT) (Katz et al., 2017) can make necessary and sufficient certifications of robustness, but have worst-case runtimes that scale exponentially with the number of activations. Conservative methods can decline to certify a robust model, but have polynomial worst-case runtime, and therefore tend to be much more scalable in practice. Today, most state-of-the-art certification methods are conservative methods based on a triangle-shaped LP relaxation of the ReLU activation function introduced by Wong & Kolter (2018). In particular, (Weng et al., 2018b, a; Zhang et al., 2018a) proposed techniques for strengthening these LP-based relaxations, by propagating tighter layer-wise upper- and lower-bounds on the ReLU activation function. Later work by Wang et al. (2021) and Xu et al. (2021) progressively refine these bounds using MILP and BnB techniques. Salman et al. (2019) proposed an optimal LP relaxation that unifies all the existing bound-propagating LP-based relaxation methods. This last paper pointed out that even the optimal LP relaxation has a gap cannot be improved; they refer to this inherent looseness as the “convex relaxation barrier”. Another line of conservative methods are based on SDP relaxationof the ReLU gate (Raghunathan et al., 2018b; Dathathri et al., 2020). Later work by Batten et al. (2021) further tighten the SDP relaxation using linear cut constraints.

Our proposed approach can be interpreted as an application of the Burer–Monteiro approach (Burer et al., 2002; Burer & Monteiro, 2005) and the Riemannian staircase (Boumal et al., 2016, 2020) for solving the SDP relaxation of Raghunathan et al. (2018b). Here, we emphasize that the rigorous applicability of these prior techniques hinges critically on the linear independence constraint qualification (LICQ), a highly restrictive condition that is difficult to verify in practice. If LICQ does not hold, then the Riemannian staircase can become get stuck at a non-LICQ point, so rigorous global guarantees are lost. In the existing literature, LICQ is often taken as a strong blanket assumption (Rosen et al., 2014; Carlone et al., 2015; Cohen et al., 2019; Rosen et al., 2019), but this reduces the Riemannian staircase from a provable algorithm to an empirical heuristic. In this paper, we formally establish LICQ for the verification problem in 4.2. Assuming that local optimization does not get stuck at the “corner” of the ReLU (this is the same assumption that allows ReLU models to be trained via gradient descent), it immediately follows that our nonconvex relaxation can be globally optimized in polynomial time.

2 Background Notations

We use ( 𝑥 1 , … , 𝑥 ℓ ) to denote the vertical concatenation of 𝑥 1 , … , 𝑥 ℓ , with 𝑥 𝑘 stacking on top of 𝑥 𝑘 + 1 . We use ( { 𝑥 𝑘 } 𝑘

1 ℓ ) as a shorthand notation for ( 𝑥 1 , … , 𝑥 ℓ ) . We use the square bracket 𝑥 ⁢ [ 𝑖 ] and 𝑋 ⁢ [ 𝑖 , 𝑗 ] to denote indexing. The size- 𝑛 identity matrix is written as 𝐼 𝑛 ; we suppress the subscript 𝑛 whenever it can be inferred from context. We write 𝐞 𝑖 to denote the 𝑖 -th canonical basis, i.e. the 𝑖 -th column of the appropriate identity matrix. We write diag ⁢ ( 𝑋 ) to extract the diagonal from the matrix 𝑋 , and diag ⁢ ( 𝑥 ) to convert length- 𝑛 vector into an 𝑛 × 𝑛 diagonal matrix. The 𝑖 -th largest eigenvalue of a matrix 𝑋 is denoted 𝜆 𝑖 ⁢ ( 𝑋 ) . We use ⊙ to denote the elementwise product.

Consider the task of classifying a data point 𝑥 ^ ∈ ℝ 𝑝 as belonging to the 𝑐 ^ -th of 𝑞 classes. The standard approach is to train a classifier model 𝑓 : ℝ 𝑝 → ℝ 𝑞 such that the prediction vector 𝑓 ⁢ ( 𝑥 ^ ) takes on its maximum value at the 𝑐 ^ -th element, as in 𝑓 ⁢ ( 𝑥 ^ ) ⁢ [ 𝑐 ^ ]

𝑓 ⁢ ( 𝑥 ^ ) ⁢ [ 𝑐 ] for all incorrect labels 𝑐 ≠ 𝑐 ^ . In this paper, we focus our attention on ℓ -layer feedforward ReLU-based neural networks, defined recursively

𝑓 ⁢ ( 𝑥 ) ≡ 𝑊 ℓ ⁢ 𝑥 ℓ + 𝑏 ℓ , 𝑥 𝑘 + 1

max ⁡ { 0 , 𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 }

for 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , where 𝑥 1 ≡ 𝑥 is the input. Throughout the paper, we will use 𝑛 𝑘 to denote the number of neurons at the 𝑘 -th layer, and 𝑛

∑ 𝑘

1 ℓ 𝑛 𝑘 to denote the total number of neurons. Note that our convention includes the neurons at the input layer, i.e. 𝑥 1 ≡ 𝑥 , but excludes those at the output layer, i.e. 𝑓 ⁢ ( 𝑥 ) ≡ 𝑊 ℓ ⁢ 𝑥 ℓ + 𝑏 ℓ .

To compute an adversarial example 𝑥 ≈ 𝑥 ^ , the standard approach is to apply projected gradient descent (PGD) to the following semi-targeted attack problem, which was first introduced by (Carlini & Wagner, 2017):

𝜙 ⁢ [ 𝑐 ]

min 𝑥

( 𝑥 1 , … , 𝑥 ℓ ) ∈ ℝ 𝑛
𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ + 𝑤 0 ⁢ 𝑥 0 (A) s.t. 𝑥 𝑘 + 1

max ⁡ { 0 , 𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 } ,

‖ 𝑥 1 − 𝑥 ^ ‖ ≤ 𝜌 ,

for all 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , where 𝑤 ℓ

( 𝐞 𝑐 ^ − 𝐞 𝑐 ) 𝑇 ⁢ 𝑊 ℓ and 𝑤 0

( 𝐞 𝑐 ^ − 𝐞 𝑐 ) 𝑇 ⁢ 𝑏 ℓ . 111To simplify presentation, we focus our attention on the ℓ 2 norm, and assume 𝑤 0

0 without loss of generality. Our results can be extended for the ℓ ∞ norm and are included in the appendix.Robustness to adversarial perturbations can be certified by verifying that (A) achieves a positive global minimum 𝜙 ⁢ [ 𝑐 ] > 0 for every incorrect class 𝑐 ≠ 𝑐 ^ . The numerical value of the minimum global minimum, written 𝜙 ⋆

min 𝑐 ≠ 𝑐 ^ ⁡ 𝜙 ⁢ [ 𝑐 ] , is a robustness margin that measures how robust the model is to adversarial perturbations. The more positive is the robustness margin 𝜙 ⋆ , the more the model is able to resist misclassification.

3 Rank-constrained SDP relaxation

Our goal in this paper is to develop better lower-bounds on the semi-targeted attack problem (A). Following existing work on the SDP relaxation, we begin by substituting the rank-1 SDP reformulation of the ReLU activation in (Raghunathan et al., 2018b; Zhang, 2020) to (A). But instead of deleting the rank- 1 constraint altogether, we propose to slightly relax it to a rank- 𝑟 constraint with a bounded trace, where 1 ≤ 𝑟 ≤ 𝑛 + 1 , to result in a family of nonconvex relaxations

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑋 ∈ 𝕊 𝑛 + 1 𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ

subject to

tr ⁡ ( 𝑋 1 , 1 ) − 2 ⁢ 𝑥 1 𝑇 ⁢ 𝑥 ^ 1 + ‖ 𝑥 ^ 1 ‖ 2 ⁢ 𝑥 0 ≤ 𝜌 2 ⁢ 𝑥 0 ,

( 𝑦 0 )

𝑥 𝑘 + 1 ≥ 0 , 𝑥 𝑘 + 1 ≥ 𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 ⁢ 𝑥 0 ,

( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 )

diag ⁡ ( 𝑋 𝑘 + 1 , 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑋 𝑘 , 𝑘 + 1 − 𝑏 𝑘 ⁢ 𝑥 𝑘 + 1 𝑇 )

0 ,

( 𝑧 𝑘 )

𝑥 0

1 , tr ⁡ ( 𝑋 ) ≤ 𝑅 2

( 𝑧 0 , 𝜇 )

for all 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , whose optimization variable 𝑋 is an ( 𝑛 + 1 ) × ( 𝑛 + 1 ) rank- 𝑟 constrained symmetric positive semidefinite matrix

𝑋

[ 𝑥 0

𝑥 1 𝑇

⋯

𝑥 ℓ 𝑇

𝑥 1

𝑋 1 , 1

⋯

𝑋 1 , ℓ

⋮

⋱

⋮

𝑥 ℓ

𝑋 1 , ℓ 𝑇

⋯

𝑋 ℓ , ℓ ] ⪰ 0 , rank ⁡ ( 𝑋 ) ≤ 𝑟 .

Here we assign the dual variable associated with each constraint in the parenthesis. We will assume throughout the paper that the trace bound 𝑅 has been chosen large enough so that tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 , or equivalently 𝜇 ⋆

0 , holds at optimality. It follows from (Zhang, 2020) that 𝑟

1 instance of (3) coincides with (A) exactly. Due to our use of a rank upper-bound, every subsequent instance then provides a lower-bound on its previous relaxation:

𝜙 ⁢ [ 𝑐 ]

𝜙 1 ⁢ [ 𝑐 ] ≥ 𝜙 2 ⁢ [ 𝑐 ] ≥ ⋯ ≥ 𝜙 𝑛 + 1 ⁢ [ 𝑐 ] .

Finally, setting 𝑟

𝑛 + 1 has the same effect as deleting the rank constraint. Therefore, the 𝑟

𝑛 + 1 instance coincides with the convex semidefinite relaxation as originally proposed by (Raghunathan et al., 2018b).

For relaxation ranks of 𝑟 < 𝑛 + 1 , the corresponding nonconvex instances of (3) are NP-hard in general to solve to global optimality. Even if we are provided with a globally optimal solution 𝑋 ⋆ , there is generally no way to (rigorously) tell that 𝑋 ⋆ is indeed globally optimal. The most we can say is that 𝑋 ⋆ provides an upper-bound on the global minimum of (3). Unfortunately, this upper-bound is not helpful in our goal of lower-bounding (A).

Instead, we will derive a lower-bound on (3) via Lagrangian duality, which will also serve as a valid lower-bound on (A). Our motivating insight is that all instances of (3), including those nonconvex instances with 𝑟 < 𝑛 + 1 , have the same convex Lagrangian dual. Define dual variables 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) ≥ 0 , 𝑧

( 𝑦 0 , { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) and 𝜇 ≤ 0 to correspond to the linear constraints in (3) as shown in parentheses. Then, the dual problem of (3) is written:

max 𝑦 ≥ 0 , 𝑧 , 𝜇 ≤ 0 ⁡ 𝑧 0 + 𝑅 2 ⁢ 𝜇 s.t. 𝑆 ⁢ ( 𝑦 , 𝑧 ) ⪰ 𝜇 ⁢ 𝐼 , (SDD)

in which the components of the slack matrix

𝑆 ⁢ ( 𝑦 , 𝑧 ) ≡ 1 2 ⁢ [ 𝑠 0

𝑠 1 𝑇

𝑠 2 𝑇

⋯

𝑠 ℓ 𝑇

𝑠 1

𝑆 1 , 1

𝑆 1 , 2

𝑠 2

𝑆 1 , 2 𝑇

𝑆 2 , 2

⋱

⋮

⋱

𝑆 ℓ − 1 , ℓ

𝑠 ℓ

𝑆 ℓ − 1 , ℓ 𝑇

𝑆 ℓ , ℓ ]

are written

𝑠 0

2 ⁢ [ 𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 𝑧 0 ] ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) ,

𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑆 1 , 1

2 ⁢ 𝑦 0 ⁢ 𝐼 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

2 ⁢ 𝑍 𝑘 ,

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) . Here, 𝑠 𝑘 + 1 is defined for all 𝑘 ∈ { 1 , … ⁢ ℓ − 2 } , and 𝑆 𝑘 , 𝑘 + 1 and 𝑆 𝑘 + 1 , 𝑘 + 1 are defined for all 𝑘 ∈ { 1 , … ⁢ ℓ − 1 } . We therefore obtain the following lower-bound on the semi-targeted attack problem in (A), which is valid for any choice of multipliers ( 𝑦 , 𝑧 ) .

Proposition 3.1 (Dual lower-bound).

Let 𝑋 ⋆ denote the global solution of (3) with rank 𝑟 ≥ 1 and tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 . Then, any dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) that satisfy 𝑦 ≥ 0 provide the following lower-bound

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑟 ⁢ [ 𝑐 ] ≥ 𝑧 0 + 𝑅 2 ⋅ min ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] } .

Let 𝑋 ⋆ denote the globally optimal solution for the convex instance of (3) with 𝑟

𝑛 + 1 . It turns out that strong duality is satisfied in this convex case, meaning that there exists optimal multipliers 𝑦 ⋆ , 𝑧 ⋆ that exactly satisfy

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑛 + 1 ⁢ [ 𝑐 ]

𝑧 0 ⋆ + 𝑅 2 ⋅ min ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 ⋆ , 𝑧 ⋆ ) ] }

and therefore certify the global optimality 𝑋 ⋆ via 3.1. Now, suppose that the convex solution 𝑋 ⋆ is in fact low-rank, as in 𝑟 ⋆

rank ⁡ ( 𝑋 ⋆ ) ≪ 𝑛 . The statement below says that the nonconvex instance of (3) with 𝑟

𝑟 ⋆ also admits optimal multipliers 𝑦 ⋆ , 𝑧 ⋆ that certify global optimality.

Theorem 3.2 (Existence of global optimality certificate).

Let 𝑟 ⋆

rank ⁡ ( 𝑋 ⋆ ) and tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 , where 𝑋 ⋆ denotes the maximum-rank solution to the convex instance of (3) with 𝑟

𝑛 + 1 . Then, there exists optimal multipliers 𝑦 ⋆

( 𝑦 0 ⋆ , { 𝑦 𝑘 , 1 ⋆ , 𝑦 𝑘 , 2 ⋆ } 𝑘

1 ℓ − 1 ) and 𝑧 ⋆

( 𝑧 0 ⋆ , { 𝑧 𝑘 ⋆ } 𝑘

1 ℓ − 1 ) that satisfy 𝑦 ⋆ ≥ 0 and the following

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑟 ⋆ ⁢ [ 𝑐 ]

𝑧 0 ⋆ + 𝑅 2 ⋅ min ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 ⋆ , 𝑧 ⋆ ) ] } .

In the following section, we use an approach of Burer & Monteiro (2003) and (Boumal et al., 2016, 2020) to constructively compute the optimal multipliers 𝑦 ⋆ , 𝑧 ⋆ that have been asserted to exist by 3.2. In turn, plugging 𝑦 ⋆ , 𝑧 ⋆ into 3.1 produces a tight lower-bound on the semi-targeted attack problem (A), thereby achieving the original goal of this section.

4 Solution via Nonlinear Programming

In order to expose the underlying degrees of freedom in the rank- 𝑟 matrix 𝑋 , we reformulate problem (3) into a low-rank factorization form first proposed by (Burer & Monteiro, 2003):

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 0 , 𝑢 , 𝑉 𝑢 0 ⋅ ( 𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ ) (BM- 𝑟 ) subject to
‖ 𝑢 1 − 𝑢 0 ⁢ 𝑥 ^ ‖ 2 + ‖ 𝑉 1 ‖ 2 ≤ 𝜌 2 , 𝑢 0 2

1 ,

𝑢 0 ⋅ 𝑢 𝑘 + 1 ≥ 0 ,

𝑢 0 ⋅ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ≥ 0 ,

diag [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 𝑢 𝑘 − 𝑏 𝑘 𝑢 0 ) 𝑢 𝑘 + 1 𝑇

( 𝑉 𝑘
1 − 𝑊 𝑘 𝑉 𝑘 ) 𝑉 𝑘
1 𝑇 ]
0 ,

𝑢 0 2 + ∑ 𝑘

1 ℓ − 1 ( ‖ 𝑢 𝑘 ‖ 2 + ‖ 𝑉 𝑘 ‖ 2 ) ≤ 𝑅 2 ,

for all 𝑘 ∈ { 1 , … , ℓ − 1 } , and over optimization variables are 𝑢 0 ∈ ℝ and 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) ∈ ℝ 𝑛 and 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) ∈ ℝ 𝑛 × ( 𝑟 − 1 ) . Problem (BM- 𝑟 ) is obtained by substituting the following into (3)

𝑋

[ 𝑢 0

0

𝑢 1

𝑉 1

⋮

𝑢 ℓ

𝑉 ℓ ] ⁢ [ 𝑢 0

0

𝑢 1

𝑉 1

⋮

𝑢 ℓ
𝑉 ℓ ] 𝑇

𝑈 ⁢ 𝑈 𝑇 . (1)

The equivalence between these two problems follows because every ( 𝑛 + 1 ) × ( 𝑛 + 1 ) matrix 𝑋 of rank 𝑟 can be factored as 𝑋

𝐿 ⁢ 𝐿 𝑇 into a low-rank Cholesky factor 𝐿 that is both lower-triangular and of dimensions ( 𝑛 + 1 ) × 𝑟 . The advantage of the formulation (BM- 𝑟 ) is that it reduces the number of explicit variables from the 1 2 ⁢ 𝑛 ⁢ ( 𝑛 + 1 ) ≈ 1 2 ⁢ 𝑛 2 in the original matrix 𝑋 to 𝑛 ⁢ 𝑟 + 1 ≈ 𝑛 ⁢ 𝑟 in the factor matrix 𝑈 , while also allowing the positive semidefinite constraint 𝑋 ⪰ 0 to be enforced for free. For moderate values of 𝑟 ≪ 𝑛 , the resulting instance of (BM- 𝑟 ) contains just 𝑂 ⁢ ( 𝑛 ) variables and constraints.

We propose solving (BM- 𝑟 ) as an instance of the standard-form nonlinear program,

min ‖ 𝑥 ‖ ≤ 𝑅 𝑓 ⁢ ( 𝑥 ) s.t. 𝑔 ⁢ ( 𝑥 ) ≤ 0 , ℎ ⁢ ( 𝑥 )

0 , (NLP)

using a high-performance general-purpose solver like fmincon or knitro. These are primal-dual solvers, and are designed to output a primal point 𝑥

( 𝑢 0 , 𝑢 , vec ⁡ ( 𝑉 ) ) that is first-order optimal, and dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) that certify the first-order optimality of 𝑥 . Below, the notion of first-order optimality is taken from (Nocedal & Wright, 2006, Theorem 12.3), and the notion of certifiability follows from the proof of (Nocedal & Wright, 2006, Theorem 12.1).

Definition 4.1 (Certifiably first-order optimal).

The point 𝑥 is said to be first-order optimal if it satisfies the constraints 𝑔 ⁢ ( 𝑥 ) ≤ 0 and ℎ ⁢ ( 𝑥 )

0 , and there exists no escape path 𝑥 ⁢ ( 𝑡 ) that begins at 𝑥 ⁢ ( 0 )

𝑥 and makes a first-order improvement to the objective while satisfying all constraints, as in

𝑓 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ≤ 𝑓 ⁢ ( 𝑥 ) − 𝛿 ⁢ 𝑡 , 𝑔 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ≤ 0 , ℎ ⁢ ( 𝑥 ⁢ ( 𝑡 ) )

0 ,

for all 𝑡 ∈ [ 0 , 𝜖 ) with sufficiently small 𝛿

0 and 𝜖

0 . Additionally, 𝑥 is said to be certifiably first-order optimal if there exist dual multipliers 𝑦 and 𝑧 that satisfy the Karush–Kuhn–Tucker (KKT) equations:

∇ 𝑓 ⁢ ( 𝑥 ) + ∇ 𝑔 ⁢ ( 𝑥 ) ⁢ 𝑦 + ∇ ℎ ⁢ ( 𝑥 ) ⁢ 𝑧

0 , 𝑦 ⊙ 𝑔 ⁢ ( 𝑥 )

0 , 𝑦 ≥ 0 .

Our main idea is to simply take the dual multipliers 𝑦 , 𝑧 computed by the nonlinear programming solver, round them 𝑦 ← max ⁡ { 0 , 𝑦 } to ensure that 𝑦 ≥ 0 , and then to plug them back into 3.1. Our main result is that, if 𝑥 is globally optimal and satisfies a mild constraint qualification, then the corresponding dual multipliers 𝑦 , 𝑧 exist and are unique. Therefore, if 𝑥 is indeed globally optimal, then the dual multipliers 𝑦 , 𝑧 that certify the local optimality of 𝑥 must also coincide with the optimal multipliers 𝑦 ⋆ , 𝑧 ⋆ that were asserted to existed earlier in 3.2.

Lemma 4.2 (Nonzero preactivation).

Suppose we have 𝑥

( 𝑢 0 , 𝑢 1 , … , 𝑢 ℓ , vec ⁡ ( 𝑉 1 ) , … , vec ⁡ ( 𝑉 ℓ ) ) that satisfies:

𝐞 𝑖 𝑇 ⁢ ( 𝑊 𝑘 ⁢ 𝑢 𝑘 + 𝑏 𝑘 ⁢ 𝑢 0 ) ≠ 0 , 𝐞 𝑖 𝑇 ⁢ 𝑊 𝑘 ⁢ 𝑉 𝑘 ≠ 0 , (NPCQ)

for all 𝑘 ∈ { 1 , … , ℓ − 1 } and 𝑖 ∈ { 1 , … , 𝑛 𝑘 + 1 } . Then, 𝑥 is first-order optimal if and only if there exist dual multipliers 𝑦 and 𝑧 to certify 𝑥 as being first-order optimal. Moreover, the choice of dual multipliers 𝑦 , 𝑧 is unique.

Theorem 4.3 (Zero duality gap).

Let 𝑟 ≥ 𝑟 ⋆ where 𝑟 ⋆ is defined in 3.2. If 𝑥 is globally optimal and satisfies (NPCQ), then the dual multipliers 𝑦 and 𝑧 that certify 𝑥 to be first-order optimal must also certify 𝑥 to be globally optimal, as in

𝜙 𝑟 ⁢ [ 𝑐 ]

𝑢 0 ⋅ ( 𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ )

𝑧 0 + 𝑅 2 ⋅ max ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] } .

Conversely, if a first-order optimal point 𝑥 satisfies the constraint qualification but is not globally optimal, then the dual multipliers 𝑦 , 𝑧 generate a direction of global improvement towards the global minimum. The key idea is to lift to a higher relaxation rank 𝑟 +

𝑟 + 1 , in order to make 𝑥 a saddle point. The statement below gives a direction to escape the saddle-point and make a decrement.

Theorem 4.4 (Escape lifted saddle point).

Let 𝑥 be certifiably first-order optimal for (BM- 𝑟 ) with dual multipliers ( 𝑦 , 𝑧 ) . If 𝑥 satisfies 𝛾

− 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] > 0 , (NPCQ) and ‖ 𝑥 ‖ < 𝑅 , then the eigenvector 𝜉

( 𝜉 0 , 𝜉 1 , 𝜉 2 , … , 𝜉 ℓ ) that satisfies 𝜉 𝑇 ⁢ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ⁢ 𝜉

− 𝛾 ⁢ ‖ 𝜉 ‖ 2 implicitly defines an escape path 𝑥 + ⁢ ( 𝑡 )

( 𝑢 0 , { 𝑢 𝑘 , + ⁢ ( 𝑡 ) } 𝑘

1 ℓ , { 𝑉 𝑘 , + ⁢ ( 𝑡 ) } 𝑘

1 ℓ ) with

𝑢 𝑘 , + ⁢ ( 𝑡 )

𝑢 𝑘 + 𝑂 ⁢ ( 𝑡 2 ) ,

𝑉 𝑘 , + ⁢ ( 𝑡 )

[ 𝑉 𝑘 , 0 ] + 𝑡 ⋅ [ 0 , 𝑢 𝑘 ⁢ 𝜉 0 / 𝑢 0 + 𝜉 𝑘 ] + 𝑂 ⁢ ( 𝑡 2 )

that makes a second-order improvement to the objective while satisfying all constraints, as in

𝑓 ⁢ ( 𝑥 + ⁢ ( 𝑡 ) )

𝑓 ⁢ ( 𝑥 ) − 𝑡 2 ⁢ 𝛾 , 𝑔 ⁢ ( 𝑥 + ⁢ ( 𝑡 ) ) ≤ 0 , ℎ ⁢ ( 𝑥 + ⁢ ( 𝑡 ) )

0

for all 𝑡 ∈ [ 0 , 𝜖 ) with sufficiently small but nonzero 𝜖

0 .

Algorithm 1 Summary of proposed algorithm

Input: Initial relaxation rank 𝑟 ≥ 2 . Weights 𝑊 1 , … , 𝑊 ℓ and biases 𝑏 1 , … , 𝑏 ℓ . Original input 𝑥 ^ , true label 𝑐 ^ , target label 𝑐 , and perturbation size 𝜌 . Variable radius bound 𝑅 .

Output: Lower-bound 𝜙 lb ⁢ [ 𝑐 ] ≤ 𝜙 ⁢ [ 𝑐 ] on the optimal value of the semi-targeted attack problem (A).

Algorithm:

In practice, it suffices to move along the straight path 𝑢 ~ 𝑘 , + ⁢ ( 𝑡 )

𝑢 𝑘 and 𝑉 ~ 𝑘 , + ⁢ ( 𝑡 )

[ 𝑉 𝑘 , 0 ] + 𝑡 ⋅ [ 0 , 𝑢 𝑘 ⁢ 𝜉 0 / 𝑢 0 + 𝜉 𝑘 ] then solve for feasibility 𝑔 ⁢ ( 𝑥 ) ≤ 0 and ℎ ⁢ ( 𝑥 )

0 . Concretely, after computing a first-order optimal 𝑥 , we increment the relaxation rank 𝑟 +

𝑟 + 1 , and initialize the nonlinear programming solver using the lifted point 𝑥 ~ + ⁢ ( 𝜖 ) as the initial primal point, and the old multipliers 𝑦 , 𝑧 as the initial dual multipliers. If this arrives at the global optimum, then the corresponding 𝑦 , 𝑧 must certify 𝑥 as being so. Otherwise, we repeat the rank lifting procedure.

Progressively lifting the relaxation rank 𝑟 , in our experience it takes no more than 𝑟 ≤ 10 to reduce the duality gap to values of 10 − 8 . To rigorously guarantee a zero duality gap, however, can require a relaxation rank on the order of 𝑟

𝑂 ⁢ ( 𝑛 ) (Boumal et al., 2020), irrespective of the value of 𝑟 ⋆ . Indeed, counterexamples with 𝜖 feas > 0 exist for relaxation ranks 𝑟 that are even slightly smaller than this threshold (Waldspurger & Waters, 2020; O’Carroll et al., 2022; Zhang, 2022). As a purely theoretical result, the ability to achieve a zero duality gap implies that the nonconvex relaxation (with 𝑟 ≥ 𝑟 ⋆ ) can be solved in polynomial time (and is therefore not NP-hard). Of course, setting 𝑟

𝑂 ⁢ ( 𝑛 ) would also force us to optimize over 𝑂 ⁢ ( 𝑛 3 / 2 ) variables, thereby offsetting much of our computational advantage against the usual convex SDP relaxation in practice.

1 summarizes our proposed approach in pseudocode form, and introduces a number of small practical refinements.

Table 1: Robustness verification for neural networks. We compare the number of images certified as robust by BM-Full, BM, 𝛼 , 𝛽 -CROWN, LP-Full, CROWN-Ada and Fast-Lip within the first 1000 images for normally and robustly trained networks. The upper bound (denoted as UB in the table) on the true number of robust images is obtained by PGD. Network ℓ 2 Radius PGD BM-Full BM 𝛼 , 𝛽 -CROWN LP-Full CROWN-Ada Fast-Lip UB Robust Time Robust Time Robust Time Robust Time Robust Time Robust Time ADV-MNIST 1.0 774 𝟕𝟔𝟐 47s 757 28s 672 24s 209 8s 45 12ms 30 13ms ADV-MNIST 1.3 614 𝟓𝟔𝟗 38s 559 28s 399 94s 25 10s 7 9ms 2 12ms ADV-MNIST 1.5 471 𝟒𝟏𝟏 56s 392 20s 248 138s 11 11s 1 9ms 1 13ms LPD-MNIST 1.0 755 𝟕𝟑𝟎 218s 708 29s 641 10s 411 16s 120 10ms 66 13ms LPD-MNIST 1.3 612 𝟓𝟏𝟒 129s 474 26s 430 33s 61 19s 16 10ms 8 13ms LPD-MNIST 1.5 505 𝟑𝟗𝟏 98s 350 23s 316 64s 23 20s 5 10ms 2 14ms NOR-MNIST 0.3 916 𝟗𝟏𝟏 128s 866 21s 797 23s 728 8s 420 9ms 348 12ms NOR-MNIST 0.5 732 𝟔𝟗𝟔 127s 534 27s 424 159s 232 16s 46 7ms 27 13ms NOR-MNIST 0.7 485 𝟑𝟖𝟏 156s 187 30s 124 253s 37 19s 0 13ms 0 17ms 5 Experiments

We identically reproduce three models from Salman et al. (2019), two of which were trained to be robust against an ℓ ∞ adversary. We compare the performance of our proposed verifier BM, which is based on solving (BM- 𝑟 ), and BM-Full, which is an extension of BM with the addition of layer-wise preactivation bounds (see Appendix A for details), against state-of-the-art LP-based verifiers for certifying robustness against an ℓ 2 adversary. Our experiments for certifying the robustness of the same models against an ℓ ∞ adversary are deferred to the appendix.

Methods.

BM and BM-Full denote the proposed method without and with preactivation bounds respectively. The source code for BM and BM-Full are available at https://github.com/Hong-Ming/BM-r. PGD denotes the projected gradient descent algorithm for finding the upper bound on (A). We compare BM and BM-Full to three state-of-the-art LP-based verifiers: CROWN-Ada of Zhang et al. (2018b), Fast-Lip of Weng et al. (2018a), and LP-Full of Salman et al. (2019). CROWN-Ada and Fast-Lip are both large-scale LP verifiers that have linear complexity with respect to the number of activations; the implementations that we used were taken directly from the authors’ project page222https://github.com/IBM/CROWN-Robustness-Certification. LP-Full is the optimal LP verifier that uses the tightest possible preactivation bounds by solving LP problems for each hidden neuron; its complexity is cubic with respect to the number of activations. We reimplemented this algorithm to work with ℓ 2 adversaries, and then validated our implementation against that of the authors333https://github.com/Hadisalman/robust-verify-benchmark on ℓ ∞ adversaries. The preactivation bounds in BM-Full are set to coincide with those used in LP-Full. We also compare our methods against the state-of-the-art branch-and-bound verifier 𝛼 , 𝛽 -CROWN (Zhang et al., 2018a; Wang et al., 2021; Xu et al., 2021; Zhang et al., 2022). The implementation of 𝛼 , 𝛽 -CROWN are taken from authors’ project page 444https://github.com/Verified-Intelligence/alpha-beta-CROWN. We set the timeout of 𝛼 , 𝛽 -CROWN to be 300s.

Figure 2: Lower bounds on the robustness margin. We take BM-Full, BM, LP-Full, CROWN-Ada and Fast-Lip to compute their average lower bound on (A), and then compare them to the average PGD upper bound on (A) over a wide range of ℓ 2 perturbation radius. (Left.) NOR-MNIST. (Middle.) ADV-MNIST. (Right.) LPD-MNIST. Setup.

We use an Apple laptop, running a silicon M1 pro chip with 10-core CPU, 16-core GPU, and 32GB of RAM. We implemented BM and BM-Full in MATLAB. The nonconvex problem (BM- 𝑟 ) is solved using the trust-region interior-point solver knitro (Byrd et al., 2006) with a warm-start strategy.

Models.

We perform simulation on three models: NOR-MNIST, ADV-MNIST and LPD-MNIST. All three models are trained by Salman et al. (2019) and are taken directly from the authors’ project page3. In particular, the architecture and the numerical values of the weights for NOR-MNIST, ADV-MNIST and LPD-MNIST are identical to NOR-MLP-B, ADV-MLP-B and LPD-MLP-B respectively in Salman et al. (2019). All three models are fully-connected feedforward neural network models with 2 hidden layers of 100 neurons each, and were trained on the MNIST dataset with different training procedures. NOR-MNIST was trained normally using the cross-entropy loss and used as a control. ADV-MNIST was adversarially trained against the PGD attack using the method of Madry et al. (2018), with ℓ ∞ radius of 0.1. LPD-MNIST was robustly trained via the adversarial polytope perturbation of Wong & Kolter (2018), with the size of the adversarial polytope also set to 0.1.

5.1 Robustness verification on neural network inputs.

We certify the robustness of our three models against an ℓ 2 adversary using our proposed method, and compare their performance against the existing state-of-the-art. In each trial, we fix an attack radius 𝜌 , and mark a correctly-classified image 𝑥 ^ as robust if the lower bound on (A) is positive with respect to all incorrect classes, i.e. 0 < 𝜙 lb ⁢ [ 𝑐 ] ≤ 𝜙 ⁢ [ 𝑐 ] for all 𝑐 ≠ 𝑐 ^ , and mark an image 𝑥 ^ as not robust if an attack is found by PGD, i.e. 𝜙 ⁢ [ 𝑐 ] ≤ 𝜙 ub ⁢ [ 𝑐 ] < 0 for some 𝑐 ≠ 𝑐 ^ . We mark the image as status unknown if a determination cannot be made either way. The lower bound for BM and BM-Full is obtained by 3.1.

Results and discussions.

Table 1 shows the number of images that are certified robust within the first 1000 correctly classified images using BM-Full, BM, 𝛼 , 𝛽 -CROWN, LP-Full, CROWN-Ada and Fast-Lip. The average computation time per image for each verifier are also shown. In addition, to gauge the efficacy of our verifiers, we benchmark our results against the upper bound on the true number of robust images (denoted as UB in the table), which is the number of images that does not get marked as not robust by PGD. From the table, we see that our nonconvex verifiers are able to consistently outperform all the other verifiers under all cases, with time complexity only 5 to 10 times higher than LP-Full. Despite higher time complexity, our verifiers are the only verifiers that can still verify reasonable amount of images under larger perturbation radius. Notably, for small perturbation, our verifiers can nearly certify all images that cannot be attacked by PGD, leaving very few images as status unknown.

5.2 Tightness of our lower bound

Since robust verification is an NP-hard problem, all relaxation methods must become loose for sufficiently large radius. Fortunately, robust verification is only needed when the PGD upper bound of (A) is positive; therefore, we only need the relaxation to be tight when the PGD upper bound is still positive. In this experiment, we analyze the gap between our lower bound in Proposition 3.1 and the PGD upper bound over a wide range of perturbation radius. To accurately measure the gap between our lower bound and the PGD upper bound, we average each bound over all 9 incorrect classes of the first 10 correctly classified images in the test set, in total 90 samples are considered.

Results and discussions.

Figures 2 shows the average PGD upper bound, and the average lower bound on (A) computed from BM, BM-Full, LP-Full, CROWN-Ada and Fast-Lip. Our lower bounds are significantly tighter than all the other verifiers across a wide range of ℓ 2 perturbation radius. Most importantly, our lower bounds are able to remain tight in regions where the PGD upper bound is still positive. We reiterate that it is not possible for our nonconvex verifiers to be exact with a large perturbation radius, because exact verification is NP-hard and our algorithm is polynomial-time. Nonetheless, so long as we can remain tight as the upper-bound crosses the zero line, our certification methods will be very close to exact.

6 Conclusion

In this work, we presented a neural network certification technique based on a nonconvex low-rank restricted SDP relaxation. Our experiments find that the method is able to overcome the convex relaxation barrier (Salman et al., 2019) with runtime only a small constant factor (5-10 × ) worse than the existing state-of-the-art. Our results showed that even a basic nonconvex relaxation, BM, offers a significant reduction in relaxation gap, while augmenting with bound propagation, BM-Full, allows us to almost fully close the gap towards exact certification.

Acknowledgments

The authors thank Zico Kolter for insightful discussions and pointers to the literature. Financial support for this work was provided by the NSF CAREER Award ECCS-2047462 and C3.ai Inc. and the Microsoft Corporation via the C3.ai Digital Transformation Institute.

References Batten et al. (2021) Batten, B., Kouvaros, P., Lomuscio, A., and Zheng, Y. Efficient neural network verification via layer-based semidefinite relaxations and linear cuts. In IJCAI, pp. 2184–2190, 2021. Boumal et al. (2016) Boumal, N., Voroninski, V., and Bandeira, A. The non-convex Burer-Monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pp. 2757–2765, 2016. Boumal et al. (2020) Boumal, N., Voroninski, V., and Bandeira, A. S. Deterministic guarantees for Burer-Monteiro factorizations of smooth semidefinite programs. Communications on Pure and Applied Mathematics, 73(3):581–608, 2020. Burer & Monteiro (2003) Burer, S. and Monteiro, R. D. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003. Burer & Monteiro (2005) Burer, S. and Monteiro, R. D. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3):427–444, 2005. Burer et al. (2002) Burer, S., Monteiro, R. D., and Zhang, Y. Rank-two relaxation heuristics for max-cut and other binary quadratic programs. SIAM Journal on Optimization, 12(2):503–521, 2002. Byrd et al. (2006) Byrd, R. H., Nocedal, J., and Waltz, R. A. Knitro: An integrated package for nonlinear optimization. In Large-scale nonlinear optimization, pp. 35–59. Springer, 2006. Carlini & Wagner (2017) Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. IEEE, 2017. Carlone et al. (2015) Carlone, L., Rosen, D. M., Calafiore, G., Leonard, J. J., and Dellaert, F. Lagrangian duality in 3d slam: Verification techniques and optimal solutions. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 125–132. IEEE, 2015. Cohen et al. (2019) Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320, 2019. Dathathri et al. (2020) Dathathri, S., Dvijotham, K., Kurakin, A., Raghunathan, A., Uesato, J., Bunel, R., Shankar, S., Steinhardt, J., Goodfellow, I., Liang, P., et al. Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. In Advances in Neural Information Processing Systems, 2020. Goodfellow et al. (2015) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572. Katz et al. (2017) Katz, G., Barrett, C., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Springer, 2017. Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016. Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. MOSEK (2019) MOSEK, A. The MOSEK optimization toolbox for MATLAB manual, 2019. URL https://docs.mosek.com/9.0/toolbox.pdf. Nocedal & Wright (2006) Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006. O’Carroll et al. (2022) O’Carroll, L., Srinivas, V., and Vijayaraghavan, A. The burer-monteiro sdp method can fail even above the barvinok-pataki bound. In Advances in Neural Information Processing Systems, 2022. Raghunathan et al. (2018a) Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018a. Raghunathan et al. (2018b) Raghunathan, A., Steinhardt, J., and Liang, P. S. Semidefinite relaxations for certifying robustness to adversarial examples. Advances in Neural Information Processing Systems, 31, 2018b. Rosen et al. (2014) Rosen, D. M., Kaess, M., and Leonard, J. J. Rise: An incremental trust-region method for robust online sparse least-squares estimation. IEEE Transactions on Robotics, 30(5):1091–1108, 2014. Rosen et al. (2019) Rosen, D. M., Carlone, L., Bandeira, A. S., and Leonard, J. J. Se-sync: A certifiably correct algorithm for synchronization over the special euclidean group. The International Journal of Robotics Research, 38(2-3):95–125, 2019. Salman et al. (2019) Salman, H., Yang, G., Zhang, H., Hsieh, C.-J., and Zhang, P. A convex relaxation barrier to tight robustness verification of neural networks. Advances in Neural Information Processing Systems, 32, 2019. Shafahi et al. (2019) Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019. Tjeng et al. (2019) Tjeng, V., Xiao, K., and Tedrake, R. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019. Waldspurger & Waters (2020) Waldspurger, I. and Waters, A. Rank optimality for the burer–monteiro factorization. SIAM journal on Optimization, 30(3):2577–2602, 2020. Wang et al. (2021) Wang, S., Zhang, H., Xu, K., Lin, X., Jana, S., Hsieh, C.-J., and Kolter, J. Z. Beta-crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. Advances in Neural Information Processing Systems, 34:29909–29921, 2021. Weng et al. (2018a) Weng, L., Zhang, H., Chen, H., Song, Z., Hsieh, C.-J., Daniel, L., Boning, D., and Dhillon, I. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pp. 5276–5285. PMLR, 2018a. Weng et al. (2018b) Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. Evaluating the robustness of neural networks: An extreme value theory approach. arXiv preprint arXiv:1801.10578, 2018b. Wong & Kolter (2018) Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5286–5295, 2018. Wong et al. (2019) Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2019. Xu et al. (2021) Xu, K., Zhang, H., Wang, S., Wang, Y., Jana, S., Lin, X., and Hsieh, C.-J. Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers. In International Conference on Learning Representation (ICLR), 2021. Zhang et al. (2018a) Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efficient neural network robustness certification with general activation functions. Advances in neural information processing systems, 31, 2018a. Zhang et al. (2018b) Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efficient neural network robustness certification with general activation functions. Advances in neural information processing systems, 31, 2018b. Zhang et al. (2022) Zhang, H., Wang, S., Xu, K., Li, L., Li, B., Jana, S., Hsieh, C.-J., and Kolter, J. Z. General cutting planes for bound-propagation-based neural network verification. Advances in Neural Information Processing Systems, 2022. Zhang (2020) Zhang, R. Y. On the tightness of semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, 2020. Zhang (2022) Zhang, R. Y. Improved global guarantees for the nonconvex burer–monteiro factorization via rank overparameterization. arXiv preprint arXiv:2207.01789, 2022. Appendix A Implementation details for BM and BM-Full

In this section, we present the implementation detail for our two proposed methods: BM, the nonconvex relaxation (BM- 𝑟 ) proposed in the main paper; and BM-Full, an extension of BM obtained by adding preactivation bounds on each hidden neuron in (BM- 𝑟 ). We focus our attention on how to efficiently implement both methods to verify ℓ 2 and ℓ ∞ adversaries for neural networks trained on MNIST dataset.

This section consists of three parts. First, we describe the valid input set constraint that we need to add into (BM- 𝑟 ) in order to certify MNIST images, and a few constraints in (BM- 𝑟 ) that can be simplified for improving efficiency. Second, in Appendix A.1 to A.4, we summarized the practical and efficient formulation for BM and BM-Full with respect to both ℓ 2 and ℓ ∞ perturbation. Third, in Appendix A.5, we present more details on how to efficiently solve BM and BM-Full using the procedure described in Algorithm 1.

Valid input set constraints

For model train on MNIST dataset, we add an extra constraint 0 ≤ 𝑢 0 ⋅ 𝑢 1 ≤ 1 into (BM- 𝑟 ) because MNIST images are normalized to between 0 and 1 during training and testing. Notice that adding an extra inequality constraint 0 ≤ 𝑢 0 ⋅ 𝑢 1 ≤ 1 does not alter any theoretical results in this paper as it only add an extra term to 𝑠 1 in the slack matrix 𝑆 ⁢ ( 𝑦 , 𝑧 ) .

Simplify constraints in (BM- 𝑟 )

In our practical implementation, we fix 𝑢 0 to 1 in (BM- 𝑟 ). The reason is twofold. First, by fixing 𝑢 0

1 , most constraints in (BM- 𝑟 ) become linear, and hence reduces the time complexity of our algorithm significantly. Second, the dual variable 𝑧 0 , which is associated with the constraint 𝑢 0

1 , can be solved via the KKT condition 𝑆 ⁢ ( 𝑥 , 𝑦 ) ⁢ 𝑈

0 .

A.1 Efficient formulation of BM for ℓ 2 norm

We now turn to the practical aspect of implementing our proposed method BM. In particular, to verify ℓ 2 adversaries of neural networks trained on MNIST dataset, we solve (BM- 𝑟 ) in the following form

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 , 𝑉
𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ (BM- ℓ 2 ) s.t. ‖ 𝑢 1 − 𝑥 ^ ‖ 2 + ‖ 𝑉 1 ‖ 2 ≤ 𝜌 2 ,
( 𝑦 0 )

𝑢 1 ≥ 0 , 𝑢 1 ≤ 1
( 𝑦 0 , 1 , 𝑦 0 , 2 )

𝑢 𝑘 + 1 ≥ 0 , 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ≥ 0 ,
( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ]

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 ( ‖ 𝑢 𝑘 ‖ 2 + ‖ 𝑉 𝑘 ‖ 2 ) ≤ 𝑅 2 ,

( 𝜇 )

for 𝑘 ∈ { 1 , … , ℓ − 1 } . Notice that we have substituted 𝑢 0

1 , added the valid input set constraints 𝑢 1 ≥ 0 and 𝑢 1 ≤ 1 , and assigned their associated dual variable 𝑦 0 , 1 and 𝑦 0 , 2 . In Definition A.1, we summarized how to evaluate the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) using the primal and dual solution of (BM- ℓ 2 ) in order to calculate the bound in Proposition 3.1.

Definition A.1.

Let 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

0 ℓ − 1 ) , 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) and 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) be any certifiably first-order optimal point of (BM- ℓ 2 ). Each component in the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) can be evaluated as

𝑠 0

− ∑ 𝑘

1 ℓ 𝑠 𝑘 𝑇 ⁢ 𝑢 𝑘 , 𝑧 0

𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) − 1 𝑇 ⁢ 𝑦 0 , 2 + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 − 𝑦 0 , 1 + 𝑦 0 , 2 , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

2 ⁢ 𝑦 0 ⁢ 𝐼 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

2 ⁢ 𝑍 𝑘 for ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } ,

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) for all 𝑘 .

A.2 Efficient formulation of BM-Full for ℓ 2 norm

We now describe the practical formulation for BM-Full. Let 𝑙 ⁢ 𝑏 𝑘 and 𝑢 ⁢ 𝑏 𝑘 denote the lower bound and the upper bound on preactivation neurons in the 𝑘 -th layer. 𝑙 ⁢ 𝑏 𝑘 and 𝑢 ⁢ 𝑏 𝑘 gives us the postactivation bound constraints max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ≤ 𝑥 𝑘 ≤ max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } for each postactivation neuron 𝑥 𝑘 in (3) . To incorporate these bound constraints into (BM- 𝑟 ), we first rewrite each of them into an elementwise ℓ 2 constraint for which 𝐞 𝑖 𝑇 ⁢ 𝑥 𝑘 is restricted in a ℓ 2 norm ball centered at 1 2 ⁢ 𝐞 𝑖 𝑇 ⁢ ( max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } + max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ) with radius 1 2 ⁢ 𝐞 𝑖 𝑇 ⁢ ( max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } − max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ) as in

max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ≤ 𝑥 𝑘 ≤ max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } ⇔ ‖ 𝐞 𝑖 𝑇 ⁢ 𝑥 𝑘 − 𝐞 𝑖 𝑇 ⁢ 𝑥 ^ 𝑘 ‖ 2 ≤ 𝜌 𝑘 2 for all ⁢ 𝑖 ∈ { 1 , … , 𝑛 𝑘 }

where 𝑥 ^ 𝑘

1 2 ⁢ ( max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } + max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ) and 𝜌 𝑘

1 2 ⁢ ( max ⁡ { 𝑢 ⁢ 𝑏 𝑘 , 0 } − max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ) . The above elementwise ℓ 2 constraint has the following Burer-Monteiro formulation

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) ⁢ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) 𝑇 + 𝑉 𝑘 + 1 ⁢ 𝑉 𝑘 + 1 𝑇 ] ≤ 𝜌 𝑘 + 1 2

for 𝑘 ∈ { 1 , … , ℓ − 1 } . In turn, to verify neural networks trained on MNIST dataset with respect to ℓ 2 perturbation, we add the above bound constraints into (BM- 𝑟 ) and solve the following

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 , 𝑉
𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ (BM-Full- ℓ 2 ) s.t. ‖ 𝑢 1 − 𝑥 ^ ‖ 2 + ‖ 𝑉 1 ‖ 2 ≤ 𝜌 2 ,
( 𝑦 0 )

𝑢 1 ≥ 0 , 𝑢 1 ≤ 1
( 𝑦 0 , 1 , 𝑦 0 , 2 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) ⁢ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) 𝑇 + 𝑉 𝑘 + 1 ⁢ 𝑉 𝑘 + 1 𝑇 ] ≤ 𝜌 𝑘 + 1 2
( 𝑦 𝑘 )

𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ≥ 0 ,
( 𝑦 𝑘 , 2 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ]

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 ( ‖ 𝑢 𝑘 ‖ 2 + ‖ 𝑉 𝑘 ‖ 2 ) ≤ 𝑅 2 ,

( 𝜇 )

for 𝑘 ∈ { 1 , … , ℓ − 1 } . Notice that we delete the constraints 𝑢 𝑘 ≥ 0 because they overlap with the bound constraints max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ≤ 𝑢 𝑘 . In Definition A.2, we summarized how to evaluate the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) using the primal and dual solution of (BM-Full- ℓ 2 ) in order to calculate the bound in Proposition 3.1.

Definition A.2.

Let 𝑦

( 𝑦 0 , 𝑦 0 , 1 , 𝑦 0 , 2 , { 𝑦 𝑘 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) , 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) and 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) be any certifiably first-order optimal point of (BM-Full- ℓ 2 ). Each component in the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) can be evaluated as

𝑠 0

− ∑ 𝑘

1 ℓ 𝑠 𝑘 𝑇 ⁢ 𝑢 𝑘 , 𝑧 0

𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) + ∑ 𝑘

1 ℓ − 1 𝑦 𝑘 𝑇 ⁢ ( 𝑥 ^ ( 𝑘 + 1 ) 2 − 𝜌 ( 𝑘 + 1 ) 2 ) − 1 𝑇 ⁢ 𝑦 0 , 2 + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 − 𝑦 0 , 1 + 𝑦 0 , 2 , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 2 ⁢ 𝑋 ^ ℓ ⁢ 𝑦 ℓ − 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 2 ⁢ 𝑋 ^ 𝑘 + 1 ⁢ 𝑦 𝑘 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

2 ⁢ 𝑦 0 ⁢ 𝐼 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

2 ⁢ ( 𝑍 𝑘 + 𝑋 ^ 𝑘 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } ,

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) and 𝑋 ^ 𝑘

diag ⁡ ( 𝑥 ^ 𝑘 ) for all 𝑘 .

A.3 Efficient formulation of BM for ℓ ∞ norm

We now describe how to implement BM for verifying ℓ ∞ adversaries. In the case of MNIST image, the ℓ ∞ norm ball constraint on the input 𝑥 1 , i.e. ‖ 𝑥 1 − 𝑥 ^ ‖ ∞ ≤ 𝜌 , can be combined with the valid input set constraints 0 ≤ 𝑥 1 ≤ 1 . Specifically, combining the two constraints yields: max ⁡ { 0 , 𝑥 ^ − 𝜌 } ≤ 𝑥 1 ≤ min ⁡ { 1 , 𝑥 ^ + 𝜌 } . Similar to the postactivation bound constraints in BM-Full, this constraint can be written as an elementwise ℓ 2 norm constraint as in

max ⁡ { 0 , 𝑥 ^ − 𝜌 } ≤ 𝑥 1 ≤ min ⁡ { 1 , 𝑥 ^ + 𝜌 } ⇔ ‖ 𝐞 𝑖 𝑇 ⁢ 𝑥 1 − 𝐞 𝑖 𝑇 ⁢ 𝑥 ^ 1 ‖ 2 ≤ 𝜌 1 2 for all ⁢ 𝑖 ∈ { 1 , … , 𝑛 1 }

where 𝑥 ^ 1

1 2 ⁢ ( min ⁡ { 1 , 𝑥 ^ + 𝜌 } + max ⁡ { 0 , 𝑥 ^ − 𝜌 } ) and 𝜌 1

1 2 ⁢ ( min ⁡ { 1 , 𝑥 ^ + 𝜌 } − max ⁡ { 0 , 𝑥 ^ − 𝜌 } ) . The above constraint yields the following Burer-Monteiro formulation

diag ⁡ [ ( 𝑢 1 − 𝑥 ^ 1 ) ⁢ ( 𝑢 1 − 𝑥 ^ 1 ) 𝑇 + 𝑉 1 ⁢ 𝑉 1 𝑇 ] ≤ 𝜌 1 2

In turn, to verify neural networks train on MNIST with respect to ℓ ∞ perturbation, we solve (BM- 𝑟 ) in the following form

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 , 𝑉
𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ (BM- ℓ ∞ ) s.t. diag ⁡ [ ( 𝑢 1 − 𝑥 ^ 1 ) ⁢ ( 𝑢 1 − 𝑥 ^ 1 ) 𝑇 + 𝑉 1 ⁢ 𝑉 1 𝑇 ] ≤ 𝜌 1 2 ,
( 𝑦 0 )

𝑢 𝑘 + 1 ≥ 0 , 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ≥ 0 ,
( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ]

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 ( ‖ 𝑢 𝑘 ‖ 2 + ‖ 𝑉 𝑘 ‖ 2 ) ≤ 𝑅 2 ,

( 𝜇 )

for 𝑘 ∈ { 1 , … , ℓ − 1 } . Notice that the ℓ ∞ norm constraint, i.e. ‖ 𝑢 1 − 𝑥 ^ ‖ ∞ ≤ 𝜌 , has been combined with the valid input set constraint, i.e. 0 ≤ 𝑢 1 ≤ 1 . In Definition A.3, we summarized how to evaluate the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) using the primal and dual solution of (BM- ℓ ∞ ) in order to calculate the bound in Proposition 3.1.

Definition A.3.

Let 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) , 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) and 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) be any certifiably first-order optimal point of (BM- ℓ ∞ ). Each component in the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) can be evaluated as

𝑠 0

− ∑ 𝑘

1 ℓ 𝑠 𝑘 𝑇 ⁢ 𝑢 𝑘 , 𝑧 0

𝑦 0 𝑇 ⁢ ( 𝑥 ^ 1 2 − 𝜌 1 2 ) + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑌 0 ⁢ 𝑥 ^ , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

2 ⁢ 𝑌 0 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

2 ⁢ 𝑍 𝑘 for ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } ,

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) for all 𝑘 . 𝑌 0

diag ⁡ ( 𝑦 0 ) .

A.4 Efficient formulation of BM-Full for ℓ ∞ norm

Combine the results in A.2 and A.3, to verify ℓ ∞ adversaries via BM-Full, we solve (BM- 𝑟 ) in the following form

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 , 𝑉
𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ (BM-Full- ℓ ∞ ) s.t. diag ⁡ [ ( 𝑢 1 − 𝑥 ^ 1 ) ⁢ ( 𝑢 1 − 𝑥 ^ 1 ) 𝑇 + 𝑉 1 ⁢ 𝑉 1 𝑇 ] ≤ 𝜌 1 2 ,
( 𝑦 0 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) ⁢ ( 𝑢 𝑘 + 1 − 𝑥 ^ 𝑘 + 1 ) 𝑇 + 𝑉 𝑘 + 1 ⁢ 𝑉 𝑘 + 1 𝑇 ] ≤ 𝜌 𝑘 + 1 2
( 𝑦 𝑘 )

𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ≥ 0 ,
( 𝑦 𝑘 , 2 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ]

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 ( ‖ 𝑢 𝑘 ‖ 2 + ‖ 𝑉 𝑘 ‖ 2 ) ≤ 𝑅 2 ,

( 𝜇 )

for 𝑘 ∈ { 1 , … , ℓ − 1 } , where 𝑥 ^ 1 and 𝜌 1 are defined in Appendix A.3. 𝑥 ^ 𝑘 and 𝜌 𝑘 for 𝑘 ∈ { 2 , … , ℓ − 1 } are defined in Appendix A.2. Notice that we also delete the constraints 𝑢 𝑘 ≥ 0 because they overlap with the bound constraints max ⁡ { 𝑙 ⁢ 𝑏 𝑘 , 0 } ≤ 𝑢 𝑘 . In Definition A.4, we summarized how to evaluate the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) using the primal and dual solution of (BM-Full- ℓ ∞ ) in order to calculate the bound in Proposition 3.1.

Definition A.4.

Let 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) , 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) and 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) be any certifiably first-order optimal point of (BM-Full- ℓ ∞ ). Each component in the slack matrices 𝑆 ⁢ ( 𝑦 , 𝑧 ) and the dual variable 𝑧 0 in (SDD) can be evaluated as

𝑠 0

− ∑ 𝑘

1 ℓ 𝑠 𝑘 𝑇 ⁢ 𝑢 𝑘 , 𝑧 0

∑ 𝑘

0 ℓ − 1 𝑦 𝑘 𝑇 ⁢ ( 𝑥 ^ ( 𝑘 + 1 ) 2 − 𝜌 ( 𝑘 + 1 ) 2 ) + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑌 0 ⁢ 𝑥 ^ , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ( ℓ − 1 ) ⁢ 𝑏 ( ℓ − 1 ) + 2 ⁢ 𝑋 ^ ℓ ⁢ 𝑦 ℓ − 1 + 𝑦 ( ℓ − 1 ) , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 ( 𝑘 + 1 ) , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 2 ⁢ 𝑋 ^ 𝑘 + 1 ⁢ 𝑦 𝑘 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

2 ⁢ 𝑌 0 , 𝑆 𝑘 , ( 𝑘 + 1 )

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 ( 𝑘 + 1 ) , ( 𝑘 + 1 )

2 ⁢ ( 𝑍 𝑘 + 𝑋 ^ 𝑘 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } ,

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) and 𝑋 ^ 𝑘

diag ⁡ ( 𝑥 ^ 𝑘 ) for all 𝑘 . 𝑌 0

diag ⁡ ( 𝑦 0 ) .

A.5 Efficient algorithm for solving BM and BM-Full

The efficient formulations described in Appendix A.1 to A.4 can be efficiently solved using a similar procedure described in Algorithm 1. In this section, we focus our attention on the practical and efficient algorithm for solving (BM- ℓ 2 ). We start by describing the initialization scheme for the primal variables 𝑢 𝑘 and 𝑉 𝑘 in (BM- ℓ 2 ), and then we summarize the efficient procedure for solving (BM- ℓ 2 ) in Algorithm 2. Notice that Algorithm 2 can be easily extended for (BM-Full- ℓ 2 ), (BM- ℓ ∞ ) and (BM-Full- ℓ ∞ ).

Initialize the primal variables.

We initialize each 𝑢 𝑘 and 𝑉 𝑘 in (BM- ℓ 2 ) as close to their optimal as possible, which can be done as follows. First, apply PGD to estimate 𝑥 1 , … , 𝑥 ℓ in the following semi-targeted attack problem (A- ℓ 2 ), which is the original semi-targeted attack problem (A) with the valid input set constraint

𝜙 ⁢ [ 𝑐 ]

min 𝑥 1 , … , 𝑥 ℓ 𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ s.t.
𝑥 𝑘 + 1

max ⁡ { 0 , 𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 } , 0 ≤ 𝑥 1 ≤ 1 , ‖ 𝑥 1 − 𝑥 ^ ‖ ≤ 𝜌 . (A- ℓ 2 )

Second, initialize each 𝑢 𝑘 to 𝑥 𝑘 ; notice that since (BM- ℓ 2 ) is a nonconvex relaxation of (A- ℓ 2 ), 𝑥 𝑘 would usually be a good initialization for 𝑢 𝑘 . Finally, initialize each 𝑉 𝑘 to a random matrix that has small elements. The reason for applying small initialization to each 𝑉 𝑘 is to reduce the degree of constraint violation at the initial point.

Algorithm 2 Efficient algorithm for (BM- ℓ 2 )

Input: Initial relaxation rank 𝑟 ≥ 2 . Weights 𝑊 1 , … , 𝑊 ℓ and biases 𝑏 1 , … , 𝑏 ℓ . Original input 𝑥 ^ , true label 𝑐 ^ , target label 𝑐 , and perturbation size 𝜌 . Variable radius bound 𝑅 .

Output: Lower-bound 𝜙 lb ⁢ [ 𝑐 ] ≤ 𝜙 ⁢ [ 𝑐 ] on the optimal value of the semi-targeted attack problem (A- ℓ 2 ).

Initialization: Initialize primal variable 𝑥 +

( { 𝑥 𝑘 } 𝑘

1 ℓ , { vec ⁡ ( 𝑀 𝑘 ) } 𝑘

1 ℓ ) where 𝑥 1 , … , 𝑥 ℓ are estimated by solving (A- ℓ 2 ) via PGD, and each 𝑀 𝑘 is a random matrix of small elements that has the same shape as 𝑉 𝑘 . Initialize dual variables 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

0 ℓ − 1 )

0 and 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 )

0 .

Algorithm:

Appendix B Additional experiment for ℓ 2 norm Model architectures

In this experiment, we consider three deepter feedforward ReLU networks trained on MNIST dataset: MLP- 4 × 100 , MLP- 6 × 100 and MLP- 9 × 100 . All three networks are adversarially trained using (Madry et al., 2018) with ℓ ∞ radius equals to 0.1555We train all three models using the code available at https://github.com/locuslab/convex_adversarial/blob/master/examples/mnist.py. MLP- 4 × 100 has 4 hidden layers of 100 neurons each. MLP- 6 × 100 has 6 hidden layers of 100 neurons each. MLP- 9 × 100 has 9 hidden layers of 100 neurons each.

B.1 Tightness plots for deeper neural networks

It is known that the relaxation gap generally increases along with the number of layers in the neural network. To demonstrate the performance of our proposed methods in deeper networks, in this experiment, we measure the relaxation gap of BM-Full and BM with respect to models of three different depths.

Results and discussions.

Figure 3 plots the average bounds against ℓ 2 perturbation radius for three MNIST networks with 4, 6, 9 hidden layers of 100 neurons each. Notably, BM become loose for MLP- 6 × 100 and become looser than LP-Full for MLP- 9 × 100 . This result is expected and is consistent with Zhang (2020); in particular, the SDP relaxation for ReLU gate, without any bound constrains on preactivations, does become loose for multiple layers. On the other hand, BM-Full remain significantly tighter than LP-Full for all cases. Furthermore, since the preactivation bounds used in BM-Full and LP-Full are the same in this experiment, Figure 2 and Figure 3 suggest that with the same quality of preactivation bounds, BM-Full would yield a tighter relaxation than LP-Full. Based on this finding, we note that the preactivation bounds for BM-Full can also be computed via nonconvex relaxation methods, which should yield a tighter bound on the preactivations and hence further reduces the relaxation gap for BM-Full; one example would be recursively apply BM-Full to compute the upper and lower bound on each neuron, however, such method can be extremely computational expensive for small to medium size networks. We leave BM-Full with better preactivation bounds to our future work.

Figure 3: Lower bound on (A) with different network depth ( ℓ 2 norm). We compute the average lower bound on (A) for three models with 4, 6, 9 hidden layers of 100 neurons each, respectively. The upper bound on robustness margin is estimated via PGD. Observe that the gap between the PGD upper bound and lower bound from BM-Full are significantly smaller than that from LP-Full. We note that without bound propagations, BM does get loose when the number of hidden layers is more than 6. (Left.) MLP- 4 × 100 . (Middle.) MLP- 6 × 100 . (Right.) MLP- 9 × 100 . B.2 Visualizing adversarial attacks and robustness verification

To illustrate why robustness verification is important in image classification, in this experiment, we perform a case study based on an image in the test set using the model ADV-MNIST. In particular, we focus on showing how would the ℓ 2 adversaries look like in practice for four perturbation radius 𝜌 ∈ { 0.5 , 1.0 , 1.7 , 2.0 } , as well as their corresponding lower bound on (A) computed from our nonconvex and LP-based verifiers. We choose ℓ 2 norm over ℓ ∞ norm for this experiment because ℓ 2 norm allows perturbations to be concentrated on a small group of pixels, which produces adversaries that are more perceptually consistent to the original dataset when the perturbation radius is large.

Results and discussions.

Figure 4 shows: the adversarial attacks targeting the second most probable class of a MNIST image of class "1"; the probability of the top-2 classes for the attacked image (calculated using the softmax function); the PGD upper bound; and the lower bounds from different verifiers. For the MNIST image shown in the figure, the second most probable class is "7", and its adversarial attacks are computed via PGD.

The first and second column of Figure 4 correspond to attacks within two small ℓ 2 perturbation radius 𝜌 ∈ { 0.5 , 1.0 } . We see that every verifiers is able to verify robustness for 𝜌

0.5 , however, only BM-Full and BM are able to verify robustness for 𝜌

1.0 , all the other verifiers fail because their corresponding lower bounds become loose. In both cases, the adversaries look really similar to the original image.

The third and fourth column of Figure 4 correspond to attack within two large ℓ 2 perturbation radius 𝜌 ∈ { 1.7 , 2.0 } . Notice that BM-Full and BM can still verify robustness for 𝜌

1.7 even though the image is at the boundary of becoming not robust. In addition, the image is not robust for 𝜌

2.0 as the PGD upper bound is negative. Notably, both images start gaining features of the digit "7".

Figure 4: Visualizing ℓ 2 adversarial examples for ADV-MNIST. We take an image in the test set to compare the capability of different verifiers for certifying robustness within four different ℓ 2 radius 𝜌 ∈ { 0.5 , 1.0 , 1.7 , 2.0 } . Each column in the figure shows (left to right): (Column 1.) The image is robust for 𝜌

0.5 , and it can be certified by all verifiers. (Column 2.) The image is robust for 𝜌

1.0 , but it can only be certified by our verifiers, BM-Full and BM. (Column 3.) The image is closed to become not robust for 𝜌

1.7 , but it can still be certified by our verifiers. (Column 4.) The image is not robust for 𝜌

2.0 . Appendix C Additional experiment for ℓ ∞ norm

In this section, we apply BM and BM-Full to perform the same experiments in Section 5 and Appendix B.1 with respect to ℓ ∞ perturbation. Similar to the experimental results for ℓ 2 perturbation, we set the preactivation bounds in BM-Full to be the same as those in LP-Full. The neural network models used in this section are defined in Section 5 and Appendix B.

C.1 Robust verification for ℓ ∞ adversaries

We apply BM and BM-Full to verify robustness of the first 1000 correctly classified images. The simulation settings in this experiment are the same as those in Table 1.

Results and discussions.

Table 2 shows the number of images certified as robust within the first 1000 correctly classified images. We see that BM-Full is able to consistently outperform LP-based verifiers in all cases. BM outperforms LP-Full in ADV-MNIST and NOR-MNIST but achieves similar performance in LPD-MNIST. This is because LPD-MNIST is robustly trained by maximizing dual lower bound of LP relaxation over a convex outer polytope (Wong & Kolter, 2018); therefore, LP-based verifiers tend to work well for models that are robustly trained using this method. However, we note that even though models that are trained using Wong & Kolter (2018) may be efficiently verified via LP-based verifiers, the robust training method of Wong & Kolter (2018) is generally more conservative than the method of Madry et al. (2018) as it is optimized over a convex outer polytope, which results in lower test accuracy in the final model. In this experiment, LPD-MNIST, the model that is trained using Wong & Kolter (2018), has test accuracy 95.91 % , and ADV-MNIST, the model that is trained using Madry et al. (2018) has accuracy 96.67 % . The performance of our verifiers, BM-Full and BM, is invariant to both robust training methods and achieve similar level of tightness with respect to all three model in Table 2. We provide a thorough analysis on the tightness of BM-Full and BM in the next experiment.

Table 2: Robustness verification for neural networks. We compare the number of images certified as robust by BM-Full, BM, 𝛼 , 𝛽 -CROWN, LP-Full, CROWN-Ada and Fast-Lip within the first 1000 images for normally and robustly trained networks. The upper bound (denoted as UB) on the true number of robust images is obtained by PGD. Network ℓ ∞ Radius PGD BM-Full BM 𝛼 , 𝛽 -CROWN LP-Full CROWN-Ada Fast-Lip UB Robust Time Robust Time Robust Time Robust Time Robust Time Robust Time ADV-MNIST 0.10 831 𝟕𝟗𝟏 87s 760 124s 791 2s 314 16s 8 10ms 4 13ms ADV-MNIST 0.13 731 632 102s 574 144s 𝟔𝟕𝟑 11s 46 17s 1 9ms 0 13ms ADV-MNIST 0.15 626 484 127s 366 125s 𝟓𝟑𝟓 22s 11 15s 0 9ms 0 14ms LPD-MNIST 0.10 868 𝟖𝟓𝟓 125s 828 163s 818 0.2s 829 13s 589 12ms 540 10ms LPD-MNIST 0.13 791 𝟕𝟔𝟖 104s 713 126s 743 0.2s 689 13s 154 10ms 120 11ms LPD-MNIST 0.15 727 𝟔𝟕𝟐 120s 597 132s 672 0.3s 545 12s 43 9ms 32 12ms NOR-MNIST 0.02 910 𝟖𝟗𝟖 99s 859 116s 881 1.4s 686 3s 130 9ms 88 12ms NOR-MNIST 0.03 775 𝟕𝟐𝟗 143s 617 107s 713 16s 278 4s 8 8ms 3 12ms NOR-MNIST 0.05 401 𝟐𝟔𝟕 229s 127 154s 238 43s 10 5s 0 12ms 0 17ms C.2 Tightness plots

We apply BM-Full and BM to compute the average lower bound on (A) with respect to ℓ ∞ perturbation radius. The simulation settings in this experiment are the same as those in Figure 2.

Results and discussions.

As demonstrated in Figure 5, both BM-Full and BM achieve tighter lower bound than the LP-based verifiers for all three models, across a wide range of ℓ ∞ perturbation radius. However, the gap between BM (green line) and LP-Full (blue line) is a lot smaller in LPD-MNIST compared to NOR-MNIST and ADV-MNIST, especially within the interval of 0.1 to 0.15 . This explains why BM and LP-Full achieve similar performance for LPD-MNIST in Table 2.

Figure 5: Lower bounds on the robustness margin. We take BM-Full, BM, LP-Full, CROWN-Ada and Fast-Lip to compute their average lower bound on (A), and then compare them to the average PGD upper bound on (A) over a wide range of ℓ ∞ perturbation radius. (Left.) NOR-MNIST. (Middle.) ADV-MNIST. (Right.) LPD-MNIST. C.3 Tightness plots for deeper neural networks

We apply BM-Full and BM to compute the average lower bound on (A) for three robustly trained MNIST models of different depths. The simulation settings in this experiment are the same as those in Figure 3.

Results and discussions.

We plot the average lower bound on (A) for three MNIST models of different depths. Similar to the results in Figure 3, we see that BM becomes loose when the network has more than 6 layers; this is expected as SDP relaxation, without any bound propagation, does become loose for multiple layers (Zhang, 2020). Though BM becomes loose in deeper networks, BM-Full remains significantly tighter than LP-Full in all cases. We again emphasize that the preactivation bounds used in BM-Full are the same as those in LP-Full in this experiment, and those bounds could be tighten by using the nonconvex relaxation techniques proposed in this paper. We defer it to our future work.

Figure 6: Lower bound on (A) with different network depth ( ℓ ∞ norm). We compute the average lower bound on (A) for three models with 4, 6, 9 hidden layers of 100 neurons each, respectively. Observe that BM-Full is significantly tighter than LP-Full in all cases. We note that without bound propagations, BM does become loose when the number of hidden layers is more than 6. (Left.) MLP- 4 × 100 . (Middle.) MLP- 6 × 100 . (Right.) MLP- 9 × 100 . Appendix D Derivation of the dual problem (SDD)

Recall that primal problem (3) is written (with the corresponding Lagrange multipliers given in parantheses) as the following

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑋 ∈ 𝕊 𝑛 + 1
𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ + 𝑤 0 ⁢ 𝑥 0 (SDP- 𝑟 ) s.t. tr ⁡ ( 𝑋 1 , 1 ) − 2 ⁢ 𝑥 1 𝑇 ⁢ 𝑥 ^ + 𝑥 0 ⁢ ‖ 𝑥 ^ 1 ‖ 2 ≤ 𝑥 0 ⁢ 𝜌 2 , 𝑥 0

1
( 𝑦 0 , 𝑧 0 )

𝑥 𝑘 + 1 ≥ 0 , 𝑥 𝑘 + 1 ≥ 𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 ⁢ 𝑥 0 ,
( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 )

diag ⁡ ( 𝑋 𝑘 + 1 , 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑋 𝑘 , 𝑘 + 1 − 𝑏 𝑘 ⁢ 𝑥 𝑘 + 1 𝑇 )

0
( 𝑧 𝑘 )

tr ⁡ ( 𝑋 ) ≤ 𝑅 2 ,
( 𝜇 )

𝑋

[ 𝑥 0

𝑥 1 𝑇

⋯

𝑥 ℓ 𝑇

𝑥 1

𝑋 1 , 1

⋯

𝑋 1 , ℓ

⋮

⋱

⋮

𝑥 ℓ

𝑋 1 , ℓ 𝑇

⋯

𝑋 ℓ , ℓ ] ⪰ 0 , rank ⁡ ( 𝑋 ) ≤ 𝑟 .

( 𝑆 )

for all 𝑘 ∈ { 1 , … , ℓ − 1 } . This problem can be viewed as a generic instance of the following primal problem

min 𝑋 ⪰ 0 , rank ⁡ ( 𝑋 ) ≤ 𝑟 ⁡ ⟨ 𝐹 , 𝑋 ⟩ s.t. 𝒢 0 ⁢ ( 𝑋 ) ≤ 0 , 𝒢 𝑘 ⁢ ( 𝑋 ) ≤ 0 , ℋ 0 ⁢ ( 𝑋 ) + 1

0 , ℋ 𝑘 ⁢ ( 𝑋 )

0 , tr ⁡ ( 𝑋 ) ≤ 𝑅 2

for all 𝑘 ∈ { 1 , … , ℓ − 1 } . Here, we implicitly define 𝐹 to satisfy ⟨ 𝐹 , 𝑋 ⟩

𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ + 𝑤 0 ⁢ 𝑥 0 and the linear constraint operators are respectively

𝒢 0 ⁢ ( 𝑋 )

tr ⁡ ( 𝑋 1 , 1 ) − 2 ⁢ 𝑥 1 𝑇 ⁢ 𝑥 ^ + ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) ⁢ 𝑥 0 , 𝒢 𝑘 ⁢ ( 𝑋 )

[ − 𝑥 𝑘 + 1

𝑊 𝑘 ⁢ 𝑥 𝑘 + 𝑏 𝑘 ⁢ 𝑥 0 − 𝑥 𝑘 + 1 ] ,

ℋ 0 ⁢ ( 𝑋 )

− 𝑥 0 , ℋ 𝑘 ⁢ ( 𝑋 )

diag ⁡ ( 𝑋 𝑘 + 1 , 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑋 𝑘 , 𝑘 + 1 − 𝑏 𝑘 ⁢ 𝑥 𝑘 + 1 𝑇 ) .

Notice that 𝑆 is the dual variable associated with constraint 𝑋 ⪰ 0 that satisfies 𝑆 ⪰ 0 and rank ⁡ ( 𝑋 ) + rank ⁡ ( 𝑆 ) ≤ 𝑛 + 1 at optimality. Setting 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) > 0 , 𝑧

( { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) and 𝜇 ≤ 0 , the Lagrangian of (3) reads

ℒ ⁢ ( 𝑋 , 𝑦 , 𝑧 , 𝜇 , 𝑆 )

⟨ 𝐹 , 𝑋 ⟩ + ⟨ 𝑧 0 , ℋ 0 ⁢ ( 𝑋 ) + 1 ⟩ + ⟨ 𝑦 0 , 𝒢 0 ⁢ ( 𝑋 ) ⟩ + ∑ 𝑘

1 ℓ − 1 [ ⟨ [ 𝑦 𝑘 , 1

𝑦 𝑘 , 2 ] , 𝒢 𝑘 ⁢ ( 𝑋 ) ⟩ + ⟨ 𝑧 𝑘 , ℋ 𝑘 ⁢ ( 𝑋 ) ⟩ ]

− 𝜇 ⁢ ( tr ⁡ 𝑋 − 𝑅 2 ) − ⟨ 𝑆 , 𝑋 ⟩

𝑧 0 + 𝑅 2 ⁢ 𝜇 + ⟨ 𝐹 + 𝒢 0 𝑇 ⁢ ( 𝑦 0 ) + ℋ 0 𝑇 ⁢ ( 𝑧 0 ) + ∑ 𝑘

1 ℓ − 1 [ 𝒢 𝑘 𝑇 ⁢ ( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 ) + ℋ 𝑘 𝑇 ⁢ ( 𝑧 𝑘 ) ] − 𝑆 − 𝜇 ⁢ 𝐼 , 𝑋 ⟩

𝑧 0 + 𝑅 2 ⁢ 𝜇 + ⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝑆 − 𝜇 ⁢ 𝐼 , 𝑋 ⟩

where we use the superscript “ 𝑇 ” to indicate the adjoint operators. Setting 𝑆

𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 ⪰ 0 yields the dual problem

max 𝑦 ≥ 0 , 𝑧 , 𝜇 ≤ 0 ⁡ 𝑧 0 + 𝑅 2 ⁢ 𝜇 s.t. 𝑆 ⁢ ( 𝑦 , 𝑧 ) ⪰ 𝜇 ⁢ 𝐼 . (SDD)

Finally, we derive an explicit expression for the slack matrix

𝑆 ⁢ ( 𝑦 , 𝑧 ) ≡ 1 2 ⁢ [ 𝑠 0

𝑠 1 𝑇

𝑠 2 𝑇

⋯

𝑠 ℓ 𝑇

𝑠 1

𝑆 1 , 1

𝑆 1 , 2

𝑠 2

𝑆 1 , 2 𝑇

𝑆 2 , 2

⋱

⋮

⋱

𝑆 ℓ − 1 , ℓ

𝑠 ℓ
𝑆 ℓ − 1 , ℓ 𝑇
𝑆 ℓ , ℓ ]

𝐹 + 𝒢 0 𝑇 ⁢ ( 𝑦 0 ) + ℋ 0 𝑇 ⁢ ( 𝑧 0 ) + ∑ 𝑘

1 ℓ − 1 [ 𝒢 𝑘 𝑇 ⁢ ( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 ) + ℋ 𝑘 𝑇 ⁢ ( 𝑧 𝑘 ) ] .

Here, we write out the adjoint operators in terms of the scalar, vector, and matrix components of 𝑋 :

⟨ 𝑋 , 𝐹 ⟩

𝑥 0 ⋅ 𝑤 0
+ ⟨ 𝑥 ℓ , 𝑤 ℓ ⟩

⟨ 𝑋 , ℋ 0 𝑇 ⁢ ( 𝑧 0 ) ⟩

𝑥 0 ⋅ ( − 𝑧 0 )

⟨ 𝑋 , 𝒢 0 𝑇 ⁢ ( 𝑦 0 ) ⟩

𝑥 0 ⋅ 𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 )
+ ⟨ 𝑥 1 , − 2 ⁢ 𝑦 0 ⁢ 𝑥 ^ ⟩
+ ⟨ 𝑋 1 , 1 , 𝑦 0 ⁢ 𝐼 ⟩

⟨ 𝑋 , 𝒢 𝑘 𝑇 ⁢ ( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 ) ⟩

𝑥 0 ⋅ 𝑦 𝑘 , 2 𝑇 ⁢ 𝑏 𝑘

⟨ [ 𝑥 𝑘

𝑥 𝑘 + 1 ] , [ 𝑊 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2

− ( 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) ] ⟩

⟨ 𝑋 , ℋ 𝑘 𝑇 ⁢ ( 𝑧 𝑘 ) ⟩

⟨ 𝑥 𝑘
1 , − 𝑍 𝑘 ⁢ 𝑏 𝑘 ⟩
⟨ [ 𝑋 𝑘 , ( 𝑘
1 )

𝑋 ( 𝑘 + 1 ) , ( 𝑘 + 1 ) ] , [ − 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘

𝑍 𝑘 ] ⟩

where 𝑍 𝑘

diag ⁡ ( 𝑧 𝑘 ) . Isolating the scalar terms 𝑥 0 , we obtain the expression for 𝑠 0

𝑠 0

2 ⁢ [ 𝑤 0 + 𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 𝑧 0 ]

Isolating the vector terms 𝑥 1 , 𝑥 2 , … , 𝑥 ℓ , we see that the following indeed holds for 𝑠 1 , … , 𝑠 ℓ

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ( ℓ − 1 ) ⁢ 𝑏 ( ℓ − 1 ) + 𝑦 ( ℓ − 1 ) , 1 + 𝑦 ( ℓ − 1 ) , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 ( 𝑘 + 1 ) , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } .

Finally, isolating the matrix terms 𝑋 𝑖 , 𝑗 for 𝑖 , 𝑗 ∈ { 1 , … , ℓ } , we have

𝑆 1 , 1

2 ⁢ 𝑦 0 ⁢ 𝐼 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

2 ⁢ 𝑍 𝑘 for ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } .

Appendix E Proof of the Main Results E.1 Proof of Proposition 3.1: the dual lower bounds for (3)

Recall from the previous section, we have rewritten (3) into the following generic form

min 𝑋 ⪰ 0 , rank ⁡ ( 𝑋 ) ≤ 𝑟 ⁡ ⟨ 𝐹 , 𝑋 ⟩ s.t. 𝒢 0 ⁢ ( 𝑋 ) ≤ 0 , 𝒢 𝑘 ⁢ ( 𝑋 ) ≤ 0 , ℋ 0 ⁢ ( 𝑋 ) + 1

0 , ℋ 𝑘 ⁢ ( 𝑋 )

0 , tr ⁡ ( 𝑋 ) ≤ 𝑅 2

Now we are ready to prove 3.1.

Proof.

Let 𝑋 ⋆ denote the global solution of (3) with rank 𝑟 ≥ 1 and tr ⁡ ( 𝑋 ) ≤ 𝑅 2 . Then, for any dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) that satisfy 𝑦 ≥ 0 , we have

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑟 ⁢ [ 𝑐 ]

⟨ 𝐹 , 𝑋 ⋆ ⟩

⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝒢 0 𝑇 ⁢ ( 𝑦 0 ) − ℋ 0 𝑇 ⁢ ( 𝑧 0 ) − ∑ 𝑘

1 ℓ − 1 [ 𝒢 𝑘 𝑇 ⁢ ( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 ) + ℋ 𝑘 𝑇 ⁢ ( 𝑧 𝑘 ) ] , 𝑋 ⋆ ⟩

⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) , 𝑋 ⋆ ⟩ − ⟨ [ 𝒢 0 ⁢ ( 𝑋 ⋆ )

⋮

𝒢 ℓ − 1 ⁢ ( 𝑋 ⋆ ) ] , 𝑦 ⟩ ⏟ ≤ 0 − ⟨ [ ℋ 0 ⁢ ( 𝑋 ⋆ )

⋮

ℋ ℓ − 1 ⁢ ( 𝑋 ⋆ ) ] , 𝑧 ⟩ ⏟

− 𝑧 0

≥ 𝑧 0 + ⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) , 𝑋 ⋆ ⟩

≥ 𝑧 0 + 𝑅 2 ⋅ min ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] } .

∎

E.2 Proof of the dual lower bound for (BM- 𝑟 )

we start by putting (BM- 𝑟 ) into the standard-form of nonlinear program (NLP). In the previous section, we have rewritten the original SDP problem (3) in the following generic form

min 𝑋 ⪰ 0 , rank ⁡ ( 𝑋 ) ≤ 𝑟 ⁡ ⟨ 𝐹 , 𝑋 ⟩ s.t. 𝒢 0 ⁢ ( 𝑋 ) ≤ 0 , 𝒢 𝑘 ⁢ ( 𝑋 ) ≤ 0 , ℋ 0 ⁢ ( 𝑋 ) + 1

0 , ℋ 𝑘 ⁢ ( 𝑋 )

0 , tr ⁡ ( 𝑋 ) ≤ 𝑅 2

Since every ( 𝑛 + 1 ) × ( 𝑛 + 1 ) positive semidefinite matrix 𝑋 of rank at most 𝑟 admits an ( 𝑛 + 1 ) × 𝑛 + 1 lower-triangular Cholesky factorization 𝑈 that satisfies 𝑋

𝑈 ⁢ 𝑈 𝑇 . Substituting

𝑋

[ 𝑢 0 2

𝑢 0 ⋅ 𝑢 𝑇

𝑢 0 ⋅ 𝑢
𝑢 ⁢ 𝑢 𝑇 + 𝑉 ⁢ 𝑉 𝑇 ]

[ 𝑢 0

0

𝑢 1

𝑉 1

⋮

𝑢 ℓ

𝑉 ℓ ] ⁢ [ 𝑢 0

0

𝑢 1

𝑉 1

⋮

𝑢 ℓ
𝑉 ℓ ] 𝑇

𝑈 ⁢ 𝑈 𝑇 ,

as in (1), we obtain the generic form of our proposed Burer–Monteiro formulation (BM- 𝑟 )

min ‖ 𝑈 ‖ ≤ 𝑅 ⁡ ⟨ 𝐹 , 𝑈 ⁢ 𝑈 𝑇 ⟩ s.t. 𝒢 0 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) ≤ 0 , 𝒢 𝑘 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) ≤ 0 , ℋ 0 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) + 1

0 , ℋ 𝑘 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 )

0 .

To solve (BM- 𝑟 ) as an instance of the standard-form nonlinear program (NLP). Define

𝑥 ≡ [ 𝑢 0

𝑢 1

⋮

𝑢 ℓ

vec ⁡ ( 𝑉 1 )

⋮

vec ⁡ ( 𝑉 ℓ ) ] , 𝑓 ⁢ ( 𝑥 ) ≡ ⟨ 𝐹 , 𝑈 ⁢ 𝑈 𝑇 ⟩ , 𝑔 ⁢ ( 𝑥 ) ≡ [ 𝒢 0 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 )

𝒢 1 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 )

⋮

𝒢 ℓ − 1 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) ] ⏟ 𝒢 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) , ℎ ⁢ ( 𝑥 ) ≡ [ ℋ 0 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 )

ℋ 1 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 )

⋮

ℋ ℓ − 1 ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) ] ⏟ ℋ ⁢ ( 𝑈 ⁢ 𝑈 𝑇 ) + [ 1

0

⋮

0 ] ⏟ ℎ 0

we obtain the standard-form of nonlinear program (NLP)

min ‖ 𝑥 ‖ ≤ 𝑅 ⁡ 𝑓 ⁢ ( 𝑥 ) s.t. 𝑔 ⁢ ( 𝑥 ) ≤ 0 , ℎ ⁢ ( 𝑥 )

0 .

Similar to the previous section, let 𝑦 ≥ 0 , 𝑧 and 𝜇 ≤ 0 denote the Lagrangian multiplier associated with 𝑔 ⁢ ( 𝑥 ) ≤ 0 , ℎ ⁢ ( 𝑥 )

0 and ‖ 𝑥 ‖ 2 ≤ 𝑅 2 , respectively. The corresponding Lagrangian function reads

ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

𝑓 ⁢ ( 𝑥 ) + 𝑦 𝑇 ⁢ 𝑔 ⁢ ( 𝑥 ) + 𝑧 𝑇 ⁢ ℎ ⁢ ( 𝑥 ) − 𝜇 ⁢ ( ‖ 𝑥 ‖ 2 − 𝑅 2 )

𝑧 0 + 𝜇 ⁢ 𝑅 2 + ⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 , 𝑈 ⁢ ( 𝑥 ) ⁢ 𝑈 ⁢ ( 𝑥 ) 𝑇 ⟩ (2)

where 𝑆 ⁢ ( 𝑦 , 𝑧 )

𝐹 + 𝒢 𝑇 ⁢ ( 𝑦 ) + ℋ 𝑇 ⁢ ( 𝑧 ) and 𝑈 ⁢ ( 𝑥 ) is a matricization operator

𝑈 ⁢ ( 𝑥 ) ≡ 𝑈 ⁢ ( 𝑢 0 , 𝑢 , vec ⁡ ( 𝑉 ) )

[ 𝑢 0

0

𝑢

𝑉 ] .

General-purpose nonlinear programming solvers work by computing a feasible primal point 𝑥 and dual multipliers 𝑦 , 𝑧 , 𝜇 that satisfy the first-order optimality condition (known as the Karush–Kuhn–Tucker (KKT) conditions)

∇ 𝑥 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

0 , 𝑦 ≥ 0 , 𝑦 ⊙ 𝑔 ⁢ ( 𝑥 )

0 , 𝜇 ≤ 0 , 𝜇 ⋅ ( ‖ 𝑥 ‖ 2 − 𝑅 2 )

0 , (FOC)

and also attempt to achieve the second-order optimality condition (known as the projected Hessian condition)

𝑥 ˙ 𝑇 ⁢ ∇ 𝑥 ⁢ 𝑥 2 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 ) ⁢ 𝑥 ˙ ≥ 0 for all ⁢ 𝑥 ˙ ∈ 𝒞 ⁢ ( 𝑥 , 𝑦 ) (SOC)

in which the critical cone is defined

𝒞 ( 𝑥 , 𝑦 )

{ 𝑥 ˙ : ∇ 𝑔 𝑖 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙
≥ 0 for all ⁢ 𝑖 ⁢ with ⁢ 𝑔 𝑖 ⁢ ( 𝑥 )

0 ,

∇ 𝑔 𝑖 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 for all ⁢ 𝑖 ⁢ with ⁢ 𝑦 𝑖

0 ,

∇ ℎ 𝑖 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 for all ⁢ 𝑗 ,

2 ⁢ 𝑥 𝑇 ⁢ 𝑥 ˙
≥ 0 if ‖ 𝑥 ‖ 2

𝑅 2 ,

2 ⁢ 𝑥 𝑇 ⁢ 𝑥 ˙

0 if 𝜇 < 0 . } (3)

However, in the constrained optimization setting, a local minimum 𝑥 ⋆ does not need to satisfy (FOC) and (SOC)666For an explicit counterexample, consider 𝑓 ⁢ ( 𝑥 )

𝑥 and 𝑔 ⁢ ( 𝑥 )

𝑥 2 and ℎ ⁢ ( 𝑥 )

0 .. Solvers tend to fail catastrophically when converging towards a point that does not satisfy (FOC) and (SOC), either by diverging to infinity or cycling through nonsensical solutions. Instead, a notion of constraint qualification is required to ensure convergence. Of all possibilities, the LICQ is one of the stronger conditions that allow strong guarantees to be made.

Definition E.1 (LICQ).

We say that a given 𝑥 satisfies the linear independence constraint qualification (LICQ) if the following holds

∇ 𝑔 ⁢ ( 𝑥 ) ⁢ 𝑦 + ∇ ℎ ⁢ ( 𝑥 ) ⁢ 𝑧 + 2 ⁢ 𝜇 ⋅ 𝑥

0 , 𝑔 ⁢ ( 𝑥 ) ⊙ 𝑦

0 , 𝜇 ⋅ ( ‖ 𝑥 ‖ 2 − 𝑅 2 )

0 ⇔ 𝑦

0 , 𝑧

0 , 𝜇

0 . (LICQ)

One of our main theoretical contribution in this paper is to state the conditions for a point 𝑥 to satisfy (LICQ).

Lemma E.2 (LICQ for (BM- 𝑟 )).

If 𝑥 satisfy the nonzero preactivation condition (NPCQ). Then, 𝑥 satisfies (LICQ).

We defer the proof of E.2 to the Appendix F. E.2 implies that for every local minimum 𝑥 ⋆ , there exists a unique set of dual variables ( 𝑦 ⋆ , 𝑧 ⋆ , 𝜇 ⋆ ) such that ( 𝑥 ⋆ , 𝑦 ⋆ , 𝑧 ⋆ ⁢ 𝜇 ⋆ ) is guaranteed to satisfy (FOC) and (SOC); therefore, the nonlinear programming solvers are guaranteed to work.

Corollary E.3 (Dual lower bound of (BM- 𝑟 )).

Let 𝑥 ⋆

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ − 1 ) denote a local minimum for the Burer–Monteiro problem (BM- 𝑟 ) that satisfy nonzero activation (NPCQ). Then, there exists an unique dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

1 ℓ − 1 ) that satisfy 𝑦 ≥ 0 provide the following lower-bound

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑟 ⁢ [ 𝑐 ] ≥ 𝑧 0 + 𝑅 2 ⋅ min ⁡ { 0 , 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] } .

Proof.

If 𝑥 ⋆ is a local minimum for (BM- 𝑟 ) that satisfy nonzero activation (NPCQ). Then, it follows from E.2 that the point 𝑥 is first-order optimal, and the corresponding KKT equations (FOC) yields a unique set of dual multipliers ( 𝑦 , 𝑧 , 𝜇 ) that certify the above lower-bound on the global minimum. ∎

E.3 Proof for 4.4: Escape lifted saddle point.

Now, let us explain how LICQ leads to our desired results in 4.4. We start by proving two technique lemmas regarding the first- and second-optimality of (BM- 𝑟 ).

Lemma E.4 (First-order optimality).

Let 𝑥

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ − 1 ) and 𝑦 , 𝑧 , 𝜇 satisfy ∇ 𝑥 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

0 . Then, the slack matrix 𝑆 ⁢ ( 𝑦 , 𝑧 )

𝐹 + 𝒢 𝑇 ⁢ ( 𝑦 ) + ℋ 𝑇 ⁢ ( 𝑧 ) satisfies the following:

•

𝑆 ⁢ ( 𝑦 , 𝑧 ) ⁢ 𝑈 ⁢ ( 𝑥 )

0 .

•

𝑆 ⁢ ( 𝑦 , 𝑧 ) − 𝜇 ⁢ 𝐼

[ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] 𝑇 for some matrix 𝑆 2 .

Proof.

Let 𝑆 ⁢ ( 𝑦 , 𝑧 )

[ 𝑠 0

𝑠 1 𝑇

𝑠 1

𝑆 2 ] , the Lagrangian (2) can be written as the following

ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

𝑧 0 + 𝜇 ⁢ 𝑅 2 + ⟨ [ 𝑢 0

0

𝑢

𝑉 ] ⁢ [ 𝑢 0

0

𝑢

𝑉 ] 𝑇 , [ 𝑠 0 − 𝜇

𝑠 1 𝑇

𝑠 1
𝑆 2 − 𝜇 ⁢ 𝐼 ] ⟩

𝑧 0 + 𝜇 ⁢ ( 𝑅 2 − 𝑢 0 2 ) + 𝑢 0 2 ⁢ 𝑠 0 + 2 ⁢ 𝑢 0 ⁢ 𝑠 1 𝑇 ⁢ 𝑢 + ⟨ 𝑢 ⁢ 𝑢 𝑇 + 𝑉 ⁢ 𝑉 𝑇 , 𝑆 2 − 𝜇 ⁢ 𝐼 ⟩ .

The condition ∇ 𝑥 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

0 is equivalent to setting the following three Jacobians to zero:

∂ ℒ ∂ 𝑢 0

2 ⁢ ( 𝑢 0 ⁢ ( 𝑠 0 − 𝜇 ) + 𝑠 1 𝑇 ⁢ 𝑢 ) , ∂ ℒ ∂ 𝑢

2 ⁢ ( 𝑢 0 ⁢ 𝑠 1 𝑇 + 𝑢 𝑇 ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ) , ∂ ℒ ∂ 𝑉

2 ⁢ 𝑉 𝑇 ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) .

It follows by substituting the above into the equation below that

𝑆 ⁢ ( 𝑦 , 𝑧 ) ⁢ 𝑈 ⁢ ( 𝑥 )

[ 𝑠 0 − 𝜇

𝑠 1 𝑇

𝑠 1

𝑆 2 − 𝜇 ⁢ 𝐼 ] ⁢ [ 𝑢 0

0

𝑢
𝑉 ]

[ 𝑢 0 ⁢ ( 𝑠 0 − 𝜇 ) + 𝑠 1 𝑇 ⁢ 𝑢

𝑠 1 𝑇 ⁢ 𝑉

𝑠 1 ⁢ 𝑢 0 + ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑢
( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑉 ]

0 ,

where 𝑠 1 𝑇 ⁢ 𝑉

0 because 𝑠 1 𝑇

− 𝑢 𝑇 ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) / 𝑢 0 and ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑉

0 .

Similarly, substituting 𝑠 0

− 𝑠 1 𝑇 ⁢ 𝑢 / 𝑢 0 + 𝜇 and 𝑠 1

− ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑢 / 𝑢 0 yields

𝑆 ⁢ ( 𝑥 , 𝑦 )

[ 𝑠 0

𝑠 1 𝑇

𝑠 1
𝑆 2 ]

[ 𝑢 𝑇 ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑢 / 𝑢 0 2 + 𝜇

− 𝑢 𝑇 ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) / 𝑢 0

− ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ 𝑢 / 𝑢 0
𝑆 2 ]

[ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] 𝑇 + 𝜇 ⁢ 𝐼 .

∎

Lemma E.5 (Rank-deficient second-order optimality).

Given 𝑦 , 𝑧 , let 𝑥

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ − 1 ) satisfy ∇ 𝑥 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

0 . If there exists unit vectors 𝜓 ∈ ℝ 𝑟 , ‖ 𝜓 ‖

1 and ( 𝜉 0 , 𝜉 1 ) ∈ ℝ × ℝ 𝑛 , ‖ ( 𝜉 0 , 𝜉 1 ) ‖

1 such that

𝑉 ⁢ 𝜓

0 , 2 ⁢ [ 𝜉 0

𝜉 1 ] 𝑇 ⁢ ( 𝑆 ⁢ ( 𝑦 , 𝑧 ) − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝜉 0

𝜉 1 ]

− 𝛾 < 0

where 𝑆 ⁢ ( 𝑦 , 𝑧 )

𝐹 + 𝒢 𝑇 ⁢ ( 𝑦 ) + ℋ 𝑇 ⁢ ( 𝑧 ) , then the vector 𝑥 ˙

( 0 , 0 𝑛 , { vec ⁡ ( 𝑉 𝑘 ˙ ) } 𝑘

1 ℓ − 1 ) with 𝑉 𝑘 ˙

[ ( 𝑢 𝑘 / 𝑢 0 ) ⁢ 𝜉 0 − 𝜉 1 ] ⁢ 𝜓 𝑇 satisfies

∇ 𝑔 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 , ∇ ℎ ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 , 𝑥 ˙ 𝑇 ⁢ ∇ 𝑥 ⁢ 𝑥 2 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 ) ⁢ 𝑥 ˙

− 𝛾 .

Proof.

Note that in general, a function of the form 𝑓 ⁢ ( 𝑥 )

⟨ 𝐹 , 𝑈 ⁢ 𝑈 𝑇 ⟩ with 𝑈 ⁢ ( 𝑥 )

[ 𝑢 0

0

𝑢

𝑉 ] has directional derivatives

∇ 𝑓 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

⟨ 𝐹 , 𝑈 ⁢ ( 𝑥 ) ⁢ 𝑈 ⁢ ( 𝑥 ˙ ) 𝑇 + 𝑈 ⁢ ( 𝑥 ˙ ) ⁢ 𝑈 ⁢ ( 𝑥 ) 𝑇 ⟩ , 𝑥 ˙ 𝑇 ⁢ ∇ 2 𝑓 ⁢ ( 𝑥 ) ⁢ 𝑥 ˙

2 ⁢ ⟨ 𝐹 , 𝑈 ⁢ ( 𝑥 ˙ ) ⁢ 𝑈 ⁢ ( 𝑥 ˙ ) 𝑇 ⟩ .

For our specific choice of 𝑥 ˙ , we can verify that

𝑈 ⁢ ( 𝑥 ) ⁢ 𝑈 ⁢ ( 𝑥 ˙ ) 𝑇

[ 𝑢 0

0

𝑢

𝑉 ] ⁢ [ 0

0

0
[ ( 𝑢 / 𝑢 0 ) ⁢ 𝜉 0 − 𝜉 1 ] ⁢ 𝜓 𝑇 ] 𝑇

[ 0

0

0
𝑉 ⁢ 𝜓 ⁢ [ ( 𝑢 / 𝑢 0 ) ⁢ 𝜉 0 − 𝜉 1 ] 𝑇 ]

0 .

Since 𝑔 𝑖 ⁢ ( 𝑥 )

⟨ 𝐺 𝑖 , 𝑈 ⁢ 𝑈 𝑇 ⟩ and ℎ 𝑗 ⁢ ( 𝑥 )

ℎ 0 , 𝑗 + ⟨ 𝐻 𝑗 , 𝑈 ⁢ 𝑈 𝑇 ⟩ for some matrix 𝐺 𝑖 , 𝐻 𝑗 and scalar ℎ 0 , 𝑗 , it then follows that ∇ 𝑔 𝑖 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 and ∇ ℎ 𝑗 ⁢ ( 𝑥 ) 𝑇 ⁢ 𝑥 ˙

0 for all 𝑖 and 𝑗 .

Next, given that the Lagrangian (2) is written

ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

⟨ ℎ 0 , 𝑧 ⟩ + 𝜇 ⁢ 𝑅 2 + ⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 , 𝑈 ⁢ ( 𝑥 ) ⁢ 𝑈 ⁢ ( 𝑥 ) 𝑇 ⟩ ,

For our specific choice of 𝑥 ˙ , the second-order directional derivative reads

𝑥 ˙ 𝑇 ⁢ ∇ 𝑥 ⁢ 𝑥 2 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 ) ⁢ 𝑥 ˙

2 ⁢ ⟨ 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 , 𝑈 ⁢ ( 𝑥 ˙ ) ⁢ 𝑈 ⁢ ( 𝑥 ˙ ) 𝑇 ⟩

2 ⁢ [ 0

𝜉 2 ] 𝑇 ⁢ ( 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 ) ⁢ [ 0

𝜉 2 ] .

where 𝜉 2

( 𝑢 / 𝑢 0 ) ⁢ 𝜉 0 − 𝜉 1 . It follows from E.4 that for ∇ 𝑥 ℒ ⁢ ( 𝑥 , 𝑦 , 𝑧 , 𝜇 )

0 , the slack matrix satisfies

𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼

[ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] 𝑇 and therefore

2 ⁢ [ 0

𝜉 2 ] 𝑇 ⁢ ( 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 ) ⁢ [ 0

𝜉 2 ]

2 ⁢ [ 0

𝜉 2 ] 𝑇 ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] 𝑇 ⁢ [ 0

𝜉 2 ]

2 ⁢ [ 𝜉 0

𝜉 1 ] 𝑇 ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] ⁢ ( 𝑆 2 − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝑢 𝑇 / 𝑢 0

− 𝐼 ] 𝑇 ⁢ [ 𝜉 0

𝜉 1 ]

2 ⁢ [ 𝜉 0

𝜉 1 ] 𝑇 ⁢ ( 𝑆 ⁢ ( 𝑥 , 𝑦 ) − 𝜇 ⁢ 𝐼 ) ⁢ [ 𝜉 0

𝜉 1 ]

− 𝛾 .

∎

Lemma E.6 (Critical cone).

Let Ω denote a set of feasible points of (BM- 𝑟 ) where (LICQ) hold. If 𝑥 ∈ Ω , and ( 𝑦 , 𝑧 , 𝜇 ) satisfy (FOC), then there exists a continuously differentiable path 𝑥 ⁢ ( 𝑡 ) with initial position 𝑥 ⁢ ( 0 )

𝑥 that satisfies

𝑓 ⁢ ( 𝑥 ⁢ ( 𝑡 ) )

ℒ ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑦 , 𝑧 ) , 𝑥 ⁢ ( 𝑡 ) ∈ Ω for all ⁢ 𝑡 ∈ [ 0 , 𝜖 )

if and only if 𝑥 ˙ ⁢ ( 0 ) ∈ 𝒞 ⁢ ( 𝑥 , 𝑦 ) . Moreover, if 𝑔 ⁢ ( 𝑥 ) and ℎ ⁢ ( 𝑥 ) are 𝑘 -times continuously differentiable, then 𝑥 ⁢ ( 𝑡 ) is also 𝑘 -times continuously differentiable.

We are now ready to prove the escape result.

Proof.

Let 𝑥 be first-order optimal for (BM- 𝑟 ) with dual multipliers ( 𝑦 , 𝑧 , 𝜇 ) . If 𝑥 satisfies nonzero activation, and 𝛾

− 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) − 𝜇 ⁢ 𝐼 ] > 0 , then the eigenvector 𝜉

( 𝜉 0 , 𝜉 1 ) that satisfies 𝜉 𝑇 ⁢ ( 𝑆 ⁢ ( 𝑦 , 𝑧 ) − 𝜇 ⁢ 𝐼 ) ⁢ 𝜉

− 𝛾 ⁢ ‖ 𝜉 ‖ 2 implicitly defines an escape path

𝑢 𝑘 , + ⁢ ( 𝑡 )

𝑢 𝑘 + 𝑂 ⁢ ( 𝑡 2 ) , 𝑉 𝑘 , + ⁢ ( 𝑡 )

[ 𝑉 𝑘 , 0 ] + 𝑡 ⋅ [ 0 , ( 𝑢 𝑘 / 𝑢 0 ) ⁢ 𝜉 0 − 𝜉 𝑘 ] + 𝑂 ⁢ ( 𝑡 2 )

so that 𝑥 + ⁢ ( 𝑡 )

( 𝑢 0 , 𝑢 ⁢ ( 𝑡 ) , 𝑉 ⁢ ( 𝑡 ) ) is feasible with sufficiently small 𝑡 ≥ 0 , and for which the objective makes a decrement as follows

𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ , + ⁢ ( 𝑡 )

𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ − 2 ⁢ 𝑡 2 ⁢ 𝛾 + 𝑂 ⁢ ( 𝑡 3 ) ⁢ for all ⁢ 𝑡 ∈ [ 0 , 𝜖 ) .

∎

Appendix F Proof of Constraint Qualification

In this section, we provide the proof for E.2. Specifically, we show that the (LICQ) holds for (BM- 𝑟 ) under the assumption of nonzero preactivation (NPCQ). Throughout this section, we assume 𝑅 is chosen large enough such that ‖ 𝑥 ‖ > 𝑅 holds at optimality with a strict inequality, which in turn implies the corresponding dual variable 𝜇

0 .

We start by showing that (LICQ) holds for single neuron and single layer ReLU networks.

Lemma F.1 (Single neuron).

Define 𝛼 0 , 𝛼 , 𝛼 ¯ ∈ ℝ and 𝛽 , 𝛽 ¯ , ∈ ℝ 𝑟 − 1 . Let 𝑥

( 𝛼 0 , 𝛼 , 𝛼 ¯ , 𝛽 , 𝛽 ¯ ) satisfy 𝑔 1 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 2 ⁢ ( 𝑥 ) ≥ 0 and ℎ 1 ⁢ ( 𝑥 )

0 , where

𝑔 1 ⁢ ( 𝑥 )

𝛼 0 ⁢ 𝛼 ¯ , 𝑔 2 ⁢ ( 𝑥 )

𝛼 0 ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 ) , ℎ 1 ⁢ ( 𝑥 )

𝛼 ¯ ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 ) + ⟨ 𝛽 ¯ , 𝛽 ¯ − 𝛽 ⟩ .

Suppose that 𝛼 0 2 ≠ 1 , 𝛼 + 𝛾 ⁢ 𝛼 0 ≠ 0 and 𝛽 ≠ 0 . Then, the following holds

∇ 𝑔 1 ⁢ ( 𝑥 ) ⁢ 𝑦 1 + ∇ 𝑔 2 ⁢ ( 𝑥 ) ⁢ 𝑦 2 + ∇ ℎ 1 ⁢ ( 𝑥 ) ⁢ 𝑧 1

0 , (4a)
𝑔 1 ⁢ ( 𝑥 ) ⊙ 𝑦 1

𝑔 2 ⁢ ( 𝑥 ) ⊙ 𝑦 2

0 , (4b)

if and only if 𝑦 1

𝑦 2

𝑧 1

0 .

Proof.

Explicitly computing the gradient terms in (4a), we can write (4a) and (4b) as the following

[ 𝛼 ¯

𝛼 ¯ − 𝛼 − 2 ⁢ 𝛾 ⁢ 𝛼 0

− 𝛼 ¯ ⁢ 𝛾

0

− 𝛼 0

− 𝛼 ¯

𝛼 0

2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0

0

− 𝛽 ¯

0

2 ⁢ 𝛽 ¯ − 𝛽

𝛼 0 ⁢ 𝛼 ¯

0

𝛼 0 ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 )

0 ] ⁢ [ 𝑦 1

𝑦 2

𝑧 1 ]

0

The goal is to show that the above matrix has full column rank. To simplify our proof, we delete the first, the second and the fourth row as these rows are obviously dependent to the the rest of the rows. Deleting those three rows reveals the desired claim as equivalent to the following

[ 𝛼 0

𝛼 0

2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0

0

2 ⁢ 𝛽 ¯ − 𝛽

𝛼 0 ⁢ 𝛼 ¯

0

𝛼 0 ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 )

0 ] ⁢ [ 𝑦 1

𝑦 2

𝑧 1 ]

0 ⇔ [ 𝑦 1

𝑦 2

𝑧 1 ]

0 . (5)

Next, completing the square ℎ 1 ⁢ ( 𝑥 )

𝛼 ¯ ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 ) + ⟨ 𝛽 ¯ , 𝛽 ¯ − 𝛽 ⟩

‖ ( 𝛼 ¯ , 𝛽 ¯ ) − 1 2 ⁢ ( 𝛼 + 𝛾 ⁢ 𝛼 0 , 𝛽 ) ‖ 2 − ‖ 1 2 ⁢ ( 𝛼 + 𝛾 ⁢ 𝛼 0 , 𝛽 ) ‖ 2 reveals that

ℎ 1 ⁢ ( 𝑥 )

0 ⟹ ‖ ( 2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 , 2 ⁢ 𝛽 ¯ − 𝛽 ) ‖

‖ ( 𝛼 + 𝛾 ⁢ 𝛼 0 , 𝛽 ) ‖ (6)

If additionally 𝛼 0 ⁢ 𝛼 ¯

max ⁡ { 0 , 𝛼 0 ⁢ ( 𝛼 + 𝛾 ⁢ 𝛼 0 ) } , i.e. when 𝑔 1 ⁢ ( 𝑥 )

0 or 𝑔 2 ⁢ ( 𝑥 )

0 , or 𝑔 1 ⁢ ( 𝑥 )

𝑔 2 ⁢ ( 𝑥 )

0 , then substituting 𝑔 1 ⁢ ( 𝑥 ) ⁢ 𝑔 2 ⁢ ( 𝑥 )

𝛼 0 2 ⁢ 𝛼 ¯ ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 )

𝛼 ¯ ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 )

0 into (6) further yields

𝛼 0 ⁢ 𝛼 ¯

max ⁡ { 0 , 𝛼 0 ⁢ ( 𝛼 + 𝛾 ⁢ 𝛼 0 ) } , ℎ 1 ⁢ ( 𝑥 )

0 ⟹ ‖ 2 ⁢ 𝛽 ¯ − 𝛽 ‖

‖ 𝛽 ‖ . (7)

Finally, from | 𝑔 1 ⁢ ( 𝑥 ) − 𝑔 2 ⁢ ( 𝑥 ) |

| 𝛼 0 ⁢ ( 𝛼 + 𝛾 ⁢ 𝛼 0 ) |

| 𝛼 + 𝛾 ⁢ 𝛼 0 | > 0 , it follows that we cannot jointly have both 𝑔 1 ⁢ ( 𝑥 )

0 and 𝑔 2 ⁢ ( 𝑥 )

0 at the same time. We proceed by analyzing that (5) holds true for the other three cases one at a time:

•

If 𝑔 1 ⁢ ( 𝑥 )

0 and 𝑔 2 ⁢ ( 𝑥 ) > 0 , then 𝑦 2

0 . Substituting 𝑦 2

0 into (5), it follows from ‖ 2 ⁢ 𝛽 ¯ − 𝛽 ‖

‖ 𝛽 ‖ > 0 via (7) and 𝛼 0 2

1 that

[ 𝛼 0

2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0

0

2 ⁢ 𝛽 ¯ − 𝛽

𝛼 0 ⁢ 𝛼 ¯

0 ] ⁢ [ 𝑦 1

𝑧 1 ]

0 ⇔ [ 𝑦 1

𝑧 1 ]

0 .

•

If 𝑔 1 ⁢ ( 𝑥 ) > 0 and 𝑔 2 ⁢ ( 𝑥 )

0 , then 𝑦 1

0 . Substituting 𝑦 1

0 into (5), it again follows from ‖ 2 ⁢ 𝛽 ¯ − 𝛽 ‖

‖ 𝛽 ‖ > 0 via (7) and 𝛼 0 2

1 that the following holds

[ 𝛼 0

2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0

0

2 ⁢ 𝛽 ¯ − 𝛽

𝛼 0 ⁢ ( 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 )

0 ] ⁢ [ 𝑦 2

𝑧 1 ]

0 ⇔ [ 𝑦 2

𝑧 1 ]

0 .

•

Finally, if 𝑔 1 ⁢ ( 𝑥 ) > 0 and 𝑔 2 ⁢ ( 𝑥 ) > 0 , then both 𝑦 1

𝑦 2

0 . Substituting 𝑦 1

𝑦 2

0 into (5), it follows from ‖ ( 2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0 , 2 ⁢ 𝛽 ¯ − 𝛽 ) ‖

‖ ( 𝛼 + 𝛾 ⁢ 𝛼 0 , 𝛽 ) ‖ > 0 via (6) and 𝛼 0 2

1 that

[ 2 ⁢ 𝛼 ¯ − 𝛼 − 𝛾 ⁢ 𝛼 0

2 ⁢ 𝛽 ¯ − 𝛽 ] ⁢ 𝑧 1

0 ⇔ 𝑧 1

0 .

∎

Lemma F.2 (Single layer).

Define 𝑢 0 ∈ ℝ , 𝑢 ∈ ℝ 𝑛 , 𝑢 ¯ ∈ ℝ 𝑛 ¯ , 𝑉 ∈ ℝ 𝑛 × ( 𝑟 − 1 ) and 𝑉 ¯ ∈ ℝ 𝑛 ¯ × ( 𝑟 − 1 ) . Let 𝑥

( 𝑢 0 , 𝑢 , 𝑢 ¯ , vec ⁡ ( 𝑉 ) , vec ⁡ ( 𝑉 ¯ ) ) satisfy 𝑔 1 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 2 ⁢ ( 𝑥 ) ≥ 0 and ℎ 1 ⁢ ( 𝑥 )

0 , where

𝑔 1 ⁢ ( 𝑥 )

𝑢 0 ⋅ 𝑢 ¯ , 𝑔 2 ⁢ ( 𝑥 )

𝑢 0 ⋅ ( 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 ) , ℎ ⁢ ( 𝑥 )

diag ⁡ [ ( 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 ) ⁢ 𝑢 ¯ 𝑇 + ( 𝑉 ¯ − 𝑊 ⁢ 𝑉 ) ⁢ 𝑉 ¯ 𝑇 ] .

Suppose that 𝑢 0 2

1 , 𝐞 𝑖 𝑇 ⁢ ( 𝑊 ⁢ 𝑢 + 𝑏 ⁢ 𝑢 0 ) ≠ 0 and 𝐞 𝑖 𝑇 ⁢ 𝑊 ⁢ 𝑉 ≠ 0 hold for all 𝑖 ∈ { 1 , 2 , … , 𝑛 ¯ } . Then, the following holds

∇ 𝑔 1 ⁢ ( 𝑥 ) ⁢ 𝑦 1 + ∇ 𝑔 2 ⁢ ( 𝑥 ) ⁢ 𝑦 2 + ∇ ℎ 1 ⁢ ( 𝑥 ) ⁢ 𝑧 1

0 , (8a)
𝑔 1 ⁢ ( 𝑥 ) ⊙ 𝑦 1

𝑔 2 ⁢ ( 𝑥 ) ⊙ 𝑦 2

0 , (8b)

if and only if 𝑦 1

𝑦 2

𝑧 1

0 .

Proof.

Let 𝑉

[ 𝑣 1
⋯ ⁢ 𝑣 𝑟 − 1 ] and 𝑉 ¯

[ 𝑣 ¯ 1

⋯ ⁢ 𝑣 ¯ 𝑟 − 1 ] . Explicitly write out the gradient terms in (8a) and stack it together with (8b), we have

[ 𝑢 ¯ 𝑇

( 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 2 ⁢ 𝑏 ⁢ 𝑢 0 ) 𝑇

− 𝑢 ¯ 𝑇 ⁢ diag ⁡ ( 𝑏 )

0

− 𝑢 0 ⋅ 𝑊 𝑇

− 𝑊 𝑇 ⁢ diag ⁡ ( 𝑢 ¯ )

𝑢 0 ⋅ 𝐼

diag ⁡ ( 2 ⁢ 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 )

0

− 𝑊 𝑇 ⁢ diag ⁡ ( 𝑣 ¯ 1 )

⋮

0

− 𝑊 𝑇 ⁢ diag ⁡ ( 𝑣 ¯ 𝑟 − 1 )

0

diag ⁡ ( 2 ⁢ 𝑣 ¯ 1 − 𝑊 ⁢ 𝑣 1 )

⋮

0

diag ⁡ ( 2 ⁢ 𝑣 ¯ 𝑟 − 1 − 𝑊 ⁢ 𝑣 𝑟 − 1 )

𝑢 0 ⋅ diag ⁡ ( 𝑢 ¯ )

0

𝑢 0 ⋅ diag ⁡ ( 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 )

0 ] ⁢ [ 𝑦 1

𝑦 2

𝑧 1 ]

0

Similar to F.1, deleting dependent rows allows us to restate the desired claim as the following

[ 𝑢 0 ⋅ 𝐼

𝑢 0 ⋅ 𝐼

diag ⁡ ( 2 ⁢ 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 )

0

diag ⁡ ( 2 ⁢ 𝑣 ¯ 1 − 𝑊 ⁢ 𝑣 1 )

⋮

0

diag ⁡ ( 2 ⁢ 𝑣 ¯ 𝑟 − 1 − 𝑊 ⁢ 𝑣 𝑟 − 1 )

𝑢 0 ⋅ diag ⁡ ( 𝑢 ¯ )

0

𝑢 0 ⋅ diag ⁡ ( 𝑢 ¯ − 𝑊 ⁢ 𝑢 − 𝑏 ⁢ 𝑢 0 )

0 ] ⁢ [ 𝑦 1

𝑦 2

𝑧 1 ]

0 ⇔ [ 𝑦 1

𝑦 2

𝑧 1 ]

0 . (9)

Collecting rows correspond to each ( 𝐞 𝑖 𝑇 ⁢ 𝑦 1 , 𝐞 𝑖 𝑇 ⁢ 𝑦 2 , 𝐞 𝑖 𝑇 ⁢ 𝑧 1 )

( 𝑦 1 , 𝑖 , 𝑦 2 , 𝑖 , 𝑧 1 , 𝑖 ) , we see that (9) holds true if and only if the following holds true for all 𝑖 ∈ { 1 , 2 , … , 𝑛 ¯ } :

[ 𝛼 0

𝛼 0

2 ⁢ 𝛼 ¯ 𝑖 − 𝛼 𝑖 − 𝛾 𝑖 ⁢ 𝛼 0

0

2 ⁢ 𝛽 ¯ 𝑖 − 𝛽 𝑖

𝛼 0 ⁢ 𝛼 ¯ 𝑖

0

𝛼 0 ⁢ ( 𝛼 ¯ 𝑖 − 𝛼 𝑖 − 𝛾 𝑖 ⁢ 𝛼 0 )

0 ] ⁢ [ 𝑦 1 , 𝑖

𝑦 2 , 𝑖

𝑧 1 , 𝑖 ]

0 ⇔ [ 𝑦 1 , 𝑖

𝑦 2 , 𝑖

𝑧 1 , 𝑖 ]

0 . (10)

where

𝛼 0

𝑢 0 , 𝛼 𝑖

𝐞 𝑖 𝑇 ⁢ 𝑊 ⁢ 𝑢 , 𝛼 ¯ 𝑖

𝐞 𝑖 𝑇 ⁢ 𝑢 ¯ , 𝛽 𝑖 𝑇

𝐞 𝑖 𝑇 ⁢ 𝑊 ⁢ 𝑉 , 𝛽 ¯ 𝑖 𝑇

𝐞 𝑖 𝑇 ⁢ 𝑉 ¯ , 𝛾 𝑖

𝐞 𝑖 𝑇 ⁢ 𝑏 .

By hypothesis, 𝑢 0 2

𝛼 0 2

1 , 𝐞 𝑖 𝑇 ⁢ ( 𝑊 ⁢ 𝑢 + 𝑏 ⁢ 𝑢 0 )

𝛼 𝑖 + 𝛾 𝑖 ⁢ 𝛼 0 ≠ 0 and 𝐞 𝑖 𝑇 ⁢ 𝑊 ⁢ 𝑉

𝛽 𝑖 𝑇 ≠ 0 , it then follows from F.1 that (10) holds true for all 𝑖 . This proves the lemma. ∎

The results from F.1 and F.2 can be easily extended to the multiple layers case.

Lemma F.3 (Multiple layers).

Define 𝑢 0 ∈ ℝ , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) ∈ ℝ 𝑛 , 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) ∈ ℝ 𝑛 × ( 𝑟 − 1 ) . Let 𝑥

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ ) satisfy 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ≥ 0 and ℎ 𝑘 ⁢ ( 𝑥 )

0 for all 𝑘 ∈ { 1 , … , ℓ − 1 } , where

𝑔 𝑘 , 1 ⁢ ( 𝑥 )

𝑢 0 ⋅ 𝑢 𝑘 + 1 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 )

𝑢 0 ⋅ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ,

ℎ 𝑘 ⁢ ( 𝑥 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ] .

Suppose that 𝑢 0 2

1 , 𝐞 𝑖 𝑇 ⁢ ( 𝑊 𝑘 ⁢ 𝑢 𝑘 + 𝑏 𝑘 ⁢ 𝑢 0 ) ≠ 0 and 𝐞 𝑖 𝑇 ⁢ 𝑊 𝑘 ⁢ 𝑉 𝑘 ≠ 0 hold for all 𝑘 ∈ { 1 , … , ℓ − 1 } and 𝑖 ∈ { 1 , 2 , … , 𝑛 𝑘 + 1 } . Then, the following holds

∑ 𝑘

1 ℓ − 1 [ ∇ 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ⁢ 𝑦 𝑘 , 1 + ∇ 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ⁢ 𝑦 𝑘 , 2 + ∇ ℎ 𝑘 ⁢ ( 𝑥 ) ⁢ 𝑧 𝑘 ]

0 , (11a)
𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ⊙ 𝑦 𝑘 , 1

𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ⊙ 𝑦 𝑘 , 2

0 for all ⁢ 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , (11b)

if and only if 𝑦 𝑘 , 1

𝑦 𝑘 , 2

𝑧 𝑘

0 for all 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } .

Proof.

Let us assume 𝑟

2 without loss of generality. Similar to the proof in F.2, we start by writing (11a) and (11b) into a matrix-vector product. Let ∂ 𝑓 ⁢ ( 𝑥 ) ∂ 𝑦 denote the gradient of 𝑓 ⁢ ( 𝑥 ) with respect to variable 𝑦 and let 𝑥 𝑘

( 𝑢 0 , 𝑢 𝑘 , 𝑢 𝑘 + 1 , vec ⁡ ( 𝑉 𝑘 ) , vec ⁡ ( 𝑉 𝑘 + 1 ) ) . Analogous to the one layer case, for each 𝑘 ∈ { 1 , … , ℓ − 1 } , we define the following block matrix

[ ∂ 𝑔 1 , 𝑘 ⁢ ( 𝑥 ) ∂ 𝑥 𝑘

∂ 𝑔 2 , 𝑘 ⁢ ( 𝑥 ) ∂ 𝑥 𝑘

∂ ℎ 𝑘 ⁢ ( 𝑥 ) ∂ 𝑥 𝑘

diag ⁡ ( 𝑔 1 , 𝑘 ⁢ ( 𝑥 ) )

0

0
diag ⁡ ( 𝑔 2 , 𝑘 ⁢ ( 𝑥 ) )
0 ]

[ 𝑎 𝑘 𝑇

𝐵 𝑘

𝐶 𝑘

𝐷 𝑘

𝐸 𝑘

𝐹 𝑘 ] , 𝐌 𝑘

[ 𝐶 𝑘

𝐸 𝑘

𝐹 𝑘 ] .

Notice that 𝐌 𝑘 has the same structure as in (9). Each block above is assigned as the following

𝑎 𝑘 𝑇

∂ ( 𝑔 𝑘 , 1 , 𝑔 𝑘 , 2 , ℎ 𝑘 ) ∂ 𝑢 0

[ 𝑢 𝑘 + 1 𝑇
( 𝑢 𝑘 + 1 − 𝑊 ⁢ 𝑢 𝑘 − 2 ⁢ 𝑏 𝑘 ⁢ 𝑢 0 ) 𝑇
− 𝑢 𝑘 + 1 𝑇 ⁢ diag ⁡ ( 𝑏 ) ] ,

𝐵 𝑘

∂ ( 𝑔 𝑘 , 1 , 𝑔 𝑘 , 2 , ℎ 𝑘 ) ∂ 𝑢 𝑘

[ 0
− 𝑢 0 ⋅ 𝑊 𝑘 𝑇
− 𝑊 𝑘 𝑇 ⁢ diag ⁡ ( 𝑢 𝑘 + 1 ) ] ,

𝐶 𝑘

∂ ( 𝑔 𝑘 , 1 , 𝑔 𝑘 , 2 , ℎ 𝑘 ) ∂ 𝑢 𝑘 + 1

[ 𝑢 0 ⋅ 𝐼
𝑢 0 ⋅ 𝐼
diag ⁡ ( 2 ⁢ 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ] ,

𝐷 𝑘

∂ ( 𝑔 𝑘 , 1 , 𝑔 𝑘 , 2 , ℎ 𝑘 ) ∂ vec ⁡ ( 𝑉 𝑘 )

[ 0

0

− 𝑊 𝑘 𝑇 ⁢ diag ⁡ ( 𝑣 𝑘 + 1 , 1 )

⋮

0
0
− 𝑊 𝑘 𝑇 ⁢ diag ⁡ ( 𝑣 𝑘 + 1 , 𝑟 − 1 ) ] ,

𝐸 𝑘

∂ ( 𝑔 𝑘 , 1 , 𝑔 𝑘 , 2 , ℎ 𝑘 ) ∂ vec ⁡ ( 𝑉 𝑘 + 1 )

[ 0

0

diag ⁡ ( 2 ⁢ 𝑣 𝑘 + 1 , 1 − 𝑊 𝑘 ⁢ 𝑣 𝑘 , 1 )

⋮

0
0
diag ⁡ ( 2 ⁢ 𝑣 𝑘 + 1 , 𝑟 − 1 − 𝑊 𝑘 ⁢ 𝑣 𝑘 , 𝑟 − 1 ) ] ,

𝐹 𝑘

[ 𝑢 0 ⋅ diag ⁡ ( 𝑢 𝑘 + 1 )

0

𝑢 0 ⋅ diag ⁡ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 )

0 ] .

Since each 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) , 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) , ℎ 𝑘 ⁢ ( 𝑥 ) depends only on 𝑥 𝑘 . It follows that (11a) and (11b) can be written as a matrix-vector product in which the corresponding matrix admit a block tri-diagonal structure. This allows us to restated the desire claim as the following

[ 𝑎 1 𝑇

𝑎 2 𝑇

⋯

𝑎 ℓ − 1 𝑇

𝐵 1

𝐶 1

𝐵 2

𝐶 2

⋱

𝐵 ℓ − 1

𝐶 ℓ − 1

𝐷 1

𝐸 1

𝐷 2

𝐸 2

⋱

𝐷 ℓ − 1

𝐸 ℓ − 1

𝐹 1

𝐹 2

⋱

𝐹 ℓ − 1 ] ⁢ [ 𝜆 1

𝜆 2

⋮

𝜆 ℓ − 1 ]

0 ⇔ [ 𝜆 1

𝜆 2

⋮

𝜆 ℓ − 1 ]

0 , (12)

where 𝜆 𝑘

( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 , 𝑧 𝑘 ) .

We now process to show that (12) is true. Focusing our attention on the ( ℓ − 1 )-th block-row. Observe that the blocks 𝐶 ℓ − 1 , 𝐸 ℓ − 1 and 𝐹 ℓ − 1 are the only nonzero blocks in their row; therefore, the left-hand side of (12) implies 𝐌 ℓ − 1 ⁢ 𝜆 ℓ − 1

0 . Given that 𝐞 𝑖 𝑇 ⁢ ( 𝑊 ℓ − 1 ⁢ 𝑢 ℓ − 1 + 𝑏 ℓ − 1 ⁢ 𝑢 0 ) ≠ 0 and 𝐞 𝑖 𝑇 ⁢ 𝑊 ℓ − 1 ⁢ 𝑉 ℓ − 1 ≠ 0 hold for all 𝑖 ∈ { 1 , 2 , … , 𝑛 ℓ } by hypothesis, it then follows from F.2 that 𝐌 ℓ − 1 ⁢ 𝜆 ℓ − 1

0 ⇔ 𝜆 ℓ − 1

0 .

Next, substituting 𝜆 ℓ − 1

0 back to the left-hand side of (12) to eliminate the ( ℓ − 1 )-th block-row. Repeat the same process for the ( ℓ − 2 )-th block-row all the way down to the first block-row to show that 𝐌 ℓ − 2 ⁢ 𝜆 ℓ − 2

0 ⇔ 𝜆 ℓ − 2

0 , 𝐌 ℓ − 3 ⁢ 𝜆 ℓ − 3

0 ⇔ 𝜆 ℓ − 3

0 , … , 𝐌 1 ⁢ 𝜆 1

0 ⇔ 𝜆 1

0 . This proves that the left-hand side of (12) does indeed imply the right-hand side under our stated hypotheses, as desired.

∎

Theorem F.4 (LICQ for (BM- 𝑟 )).

Define 𝑢 0 ∈ ℝ , 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) ∈ ℝ 𝑛 , 𝑉

( 𝑉 1 , … , 𝑉 ℓ ) ∈ ℝ 𝑛 × ( 𝑟 − 1 ) . Let 𝑥

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ ) satisfy constraints in (BM- 𝑟 ), 𝑔 0 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ≥ 0 , ℎ 0 ⁢ ( 𝑥 )

0 and ℎ 𝑘 ⁢ ( 𝑥 )

0 for all 𝑘 ∈ { 1 , … , ℓ − 1 } , where

𝑔 0 ⁢ ( 𝑥 )

𝜌 2 − ‖ 𝑢 1 − 𝑥 ^ ⁢ 𝑢 0 ‖ 2 − ‖ 𝑉 1 ‖ 2 , ℎ 0 ⁢ ( 𝑥 )

𝑢 0 2 − 1 ,

𝑔 𝑘 , 1 ⁢ ( 𝑥 )

𝑢 0 ⋅ 𝑢 𝑘 + 1 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 )

𝑢 0 ⋅ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ,

ℎ 𝑘 ⁢ ( 𝑥 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ] .

Suppose that 𝑥 satisfies (NPCQ), i.e. 𝐞 𝑖 𝑇 ⁢ ( 𝑊 𝑘 ⁢ 𝑢 𝑘 + 𝑏 𝑘 ⁢ 𝑢 0 ) ≠ 0 and 𝐞 𝑖 𝑇 ⁢ 𝑊 𝑘 ⁢ 𝑉 𝑘 ≠ 0 hold for all 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } and 𝑖 ∈ { 1 , 2 , … , 𝑛 𝑘 + 1 } . Then, 𝑥 satisfies (LICQ) stated as the following:

∇ 𝑔 0 ⁢ ( 𝑥 ) ⁢ 𝑦 0 + ∇ ℎ 0 ⁢ ( 𝑥 ) ⁢ 𝑧 0 + ∑ 𝑘

1 ℓ − 1 [ ∇ 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ⁢ 𝑦 𝑘 , 1 + ∇ 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ⁢ 𝑦 𝑘 , 2 ]

0 , (LICQ-a)
𝑔 0 ⊙ 𝑦 0

0 , 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ⊙ 𝑦 𝑘 , 1

𝑔 2 ⁢ ( 𝑥 ) ⊙ 𝑦 𝑘 , 2

0 for all ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } , (LICQ-b)

if and only if 𝑦 0

𝑧 0

0 and 𝑦 𝑘 , 1

𝑦 𝑘 , 2

𝑧 𝑘

0 for all 𝑘 ∈ { 0 , 1 , … , ℓ − 1 } .

Proof.

Let us assume 𝑟

2 and 𝜌

0 without loss of generality. Similar to F.3, (LICQ-a) and (LICQ-b) admits a tri-diagonal structure which allows us to restate the desired claim as the following

[ 2 ⁢ 𝑢 0

𝑎 0

𝑎 1 𝑇

𝑎 2 𝑇

⋯

𝑎 ℓ − 1 𝑇

𝑏 0

𝐵 1

𝐶 1

𝐵 2

𝐶 2

⋱

𝐵 ℓ − 1

𝐶 ℓ − 1

𝑑 0

𝐷 1

𝐸 1

𝐷 2

𝐸 2

⋱

𝐷 ℓ − 1

𝐸 ℓ − 1

𝑓 0

𝐹 1

𝐹 2

⋱

𝐹 ℓ − 1 ] ⁢ [ 𝑧 0

𝑦 0

𝜆 1

𝜆 2

⋮

𝜆 ℓ − 1 ]

0 ⇔ [ 𝑧 0

𝑦 0

𝜆 1

𝜆 2

⋮

𝜆 ℓ − 1 ]

0 , (13)

in which 𝜆 𝑘

( 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 , 𝑧 𝑘 ) . The blocks ( 𝑎 𝑘 , 𝐵 𝑘 , 𝐶 𝑘 , 𝐷 𝑘 , 𝐸 𝑘 ) for all 𝑘 ∈ { 1 , … , ℓ − 1 } are defined in F.3, and the blocks ( 𝑎 0 , 𝑏 0 , 𝑑 0 , 𝑓 0 ) are written

𝑎 0

∂ 𝑔 0 ∂ 𝑢 0

2 ⁢ ( 𝑥 ^ 𝑇 ⁢ 𝑢 1 − ‖ 𝑥 ^ ‖ 2 ⁢ 𝑢 0 ) , 𝑏 0

∂ 𝑔 0 ∂ 𝑢 1

2 ⁢ ( 𝑢 1 − 𝑥 ^ ⁢ 𝑢 0 ) , 𝑑 0

∂ 𝑔 0 ∂ vec ⁡ ( 𝑉 1 )

2 ⁢ vec ⁡ ( 𝑉 1 ) , 𝑓 0

𝑔 0 ⁢ ( 𝑥 ) .

Under the stated assumptions, we can verify the matrix at the left-hand side of (13) indeed has full column rank. First, we apply F.3 to show that

𝐌 ℓ − 1 𝜆 ℓ − 1

0 ⇔ 𝜆 ℓ − 1

0 , 𝐌 ℓ − 2 𝜆 ℓ − 2

0 ⇔ 𝜆 ℓ − 2

0 , … , 𝐌 1 𝜆 1

0 ⇔ 𝜆 1

0 .

Substituting 𝜆 ℓ − 1

𝜆 ℓ − 2

⋯

𝜆 1

0 allows us to simplify (13) as

[ 2 ⁢ 𝑢 0

𝑎 0

𝑏 0

𝑑 0

𝑓 0 ] ⁢ [ 𝑧 0

𝑦 0 ]

0 ⇔ [ 𝑧 0

𝑦 0 ]

0 . (14)

To show that the matrix at the left-hand side of (14) has full column rank. We consider two cases:

•

If 𝑔 0 ⁢ ( 𝑥 )

0 . It follows from ‖ 𝑢 1 − 𝑥 ^ ⁢ 𝑢 0 ‖ 2 + ‖ 𝑉 1 ‖ 2

‖ ( 𝑢 1 − 𝑥 ^ ⁢ 𝑢 0 , vec ⁡ ( 𝑉 1 ) ) ‖ 2

‖ 1 2 ⁢ ( 𝑏 0 , 𝑑 0 ) ‖ 2

𝜌 2 > 0 and 𝑢 0 2

1 that

[ 2 ⁢ 𝑢 0

𝑎 0

𝑏 0

𝑑 0

𝑓 0 ] ⁢ [ 𝑧 0

𝑦 0 ]

0 ⇔ [ 𝑧 0

𝑦 0 ]

0

•

If 𝑔 0 ⁢ ( 𝑥 ) > 0 , then 𝑦 0

0 . Substituting 𝑦 0

0 into (14), it then follows from 𝑢 0 2

1 that

2 ⁢ 𝑢 0 ⁢ 𝑧 0

0 ⇔ 𝑧 0

0 .

∎

We can now extend F.4 to show that (BM- ℓ 2 ) satisfies LICQ under an extra mild assumption, ‖ 𝑉 1 ‖ ≠ 0 . The proof is summarized in the following corollary.

Corollary F.5 (LICQ for (BM- ℓ 2 )).

Define 𝑢 0 , 𝑢 and 𝑉 as in F.4. Let 𝑥

( 𝑢 0 , { 𝑢 𝑘 } 𝑘

1 ℓ − 1 , { vec ⁡ ( 𝑉 𝑘 ) } 𝑘

1 ℓ ) satisfy constraints in (BM- ℓ 2 ), 𝑔 0 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 𝑘 , 1 ⁢ ( 𝑥 ) ≥ 0 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 ) ≥ 0 and ℎ 𝑘 ⁢ ( 𝑥 )

0 for all 𝑘 ∈ { 0 , … , ℓ − 1 } , where

𝑔 0 ⁢ ( 𝑥 )

𝜌 2 − ‖ 𝑢 1 − 𝑥 ^ ⁢ 𝑢 0 ‖ 2 − ‖ 𝑉 1 ‖ 2 , ℎ 0 ⁢ ( 𝑥 )

𝑢 0 2 − 1 ,

𝑔 0 , 1 ⁢ ( 𝑥 )

𝑢 0 ⋅ 𝑢 1 , 𝑔 0 , 2 ⁢ ( 𝑥 )

𝑢 0 ⋅ ( 𝑢 0 − 𝑢 1 ) ,

𝑔 𝑘 , 1 ⁢ ( 𝑥 )

𝑢 0 ⋅ 𝑢 𝑘 + 1 , 𝑔 𝑘 , 2 ⁢ ( 𝑥 )

𝑢 0 ⋅ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) for all ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } ,

ℎ 𝑘 ⁢ ( 𝑥 )

diag ⁡ [ ( 𝑢 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑢 𝑘 − 𝑏 𝑘 ⁢ 𝑢 0 ) ⁢ 𝑢 𝑘 + 1 𝑇 + ( 𝑉 𝑘 + 1 − 𝑊 ⁢ 𝑉 𝑘 ) ⁢ 𝑉 𝑘 + 1 𝑇 ] for all ⁢ 𝑘 ∈ { 1 , … , ℓ − 1 } .

Suppose that 𝑥 satisfies (NPCQ), and ‖ 𝑉 1 ‖ ≠ 0 . Then, 𝑥 satisfies (LICQ)

Proof.

From F.4, to prove this corollary, it is suffice to show the following

[ 2 ⁢ 𝑢 0

𝑎 0

𝑢 1 𝑇

2 ⁢ 𝑢 0 − 𝑢 1 𝑇

𝑏 0

𝑢 0 ⁢ 𝐼

− 𝑢 0 ⁢ 𝐼

𝑑 0

𝑓 0

𝑔 0 , 1 ⁢ ( 𝑥 )

𝑔 0 , 2 ⁢ ( 𝑥 ) ] ⁢ [ 𝑧 0

𝑦 0

𝑦 0 , 1

𝑦 0 , 2 ]

0 ⇔ [ 𝑧 0

𝑦 0

𝑦 0 , 1

𝑦 0 , 2 ]

0

where ( 𝑎 0 , 𝑏 0 , 𝑑 0 , 𝑓 0 ) are defined in F.4. By hypothesis, ‖ 1 2 ⁢ 𝑑 0 ‖

‖ 𝑉 1 ‖ ≠ 0 . This allows us to restate the desired claim as the following

[ 2 ⁢ 𝑢 0

𝑢 1 𝑇

2 ⁢ 𝑢 0 − 𝑢 1 𝑇

𝑢 0 ⁢ 𝐼

− 𝑢 0 ⁢ 𝐼

𝑔 0 , 1 ⁢ ( 𝑥 )

𝑔 0 , 2 ⁢ ( 𝑥 ) ] ⁢ [ 𝑧 0

𝑦 0 , 1

𝑦 0 , 2 ]

0 ⇔ [ 𝑧 0

𝑦 0 , 1

𝑦 0 , 2 ]

0 . (15)

Notice that we cannot jointly have 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 1 ⁢ ( 𝑥 )

𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 2 ⁢ ( 𝑥 )

0 for all 𝑖 ∈ { 1 , … , 𝑛 1 } . Hence, each pair of ( 𝐞 𝑖 𝑇 ⁢ 𝑦 0 , 1 , 𝐞 𝑖 𝑇 ⁢ 𝑦 0 , 2 ) has only three possible cases: 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 1 ⁢ ( 𝑥 )

0 , 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 2 ⁢ ( 𝑥 ) ≠ 0 ; 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 1 ⁢ ( 𝑥 ) ≠ 0 , 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 2 ⁢ ( 𝑥 )

0 ; and 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 1 ⁢ ( 𝑥 ) ≠ 0 , 𝐞 𝑖 𝑇 ⁢ 𝑔 0 , 2 ⁢ ( 𝑥 ) ≠ 0 . Of all three cases, it is clear that (15) is true because 𝑢 0 2

1 .

∎

Generated on Thu Jul 13 18:40:44 2023 by LATExml

We use ( 𝑥 1 , … , 𝑥 ℓ ) to denote the vertical concatenation of 𝑥 1 , … , 𝑥 ℓ , with 𝑥 𝑘 stacking on top of 𝑥 𝑘 + 1 . We use ( { 𝑥 𝑘 } 𝑘

𝑓 ⁢ ( 𝑥 ) ≡ 𝑊 ℓ ⁢ 𝑥 ℓ + 𝑏 ℓ , 𝑥 𝑘 + 1

for 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , where 𝑥 1 ≡ 𝑥 is the input. Throughout the paper, we will use 𝑛 𝑘 to denote the number of neurons at the 𝑘 -th layer, and 𝑛

∑ 𝑘

𝜙 ⁢ [ 𝑐 ]

min 𝑥

( 𝑥 1 , … , 𝑥 ℓ ) ∈ ℝ 𝑛 𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ + 𝑤 0 ⁢ 𝑥 0 (A) s.t. 𝑥 𝑘 + 1

for all 𝑘 ∈ { 1 , 2 , … , ℓ − 1 } , where 𝑤 ℓ

( 𝐞 𝑐 ^ − 𝐞 𝑐 ) 𝑇 ⁢ 𝑊 ℓ and 𝑤 0

( 𝐞 𝑐 ^ − 𝐞 𝑐 ) 𝑇 ⁢ 𝑏 ℓ . 111To simplify presentation, we focus our attention on the ℓ 2 norm, and assume 𝑤 0

𝜙 𝑟 ⁢ [ 𝑐 ]

diag ⁡ ( 𝑋 𝑘 + 1 , 𝑘 + 1 − 𝑊 𝑘 ⁢ 𝑋 𝑘 , 𝑘 + 1 − 𝑏 𝑘 ⁢ 𝑥 𝑘 + 1 𝑇 )

𝑥 0

𝑋

Here we assign the dual variable associated with each constraint in the parenthesis. We will assume throughout the paper that the trace bound 𝑅 has been chosen large enough so that tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 , or equivalently 𝜇 ⋆

0 , holds at optimality. It follows from (Zhang, 2020) that 𝑟

𝜙 ⁢ [ 𝑐 ]

Finally, setting 𝑟

𝑛 + 1 has the same effect as deleting the rank constraint. Therefore, the 𝑟

Instead, we will derive a lower-bound on (3) via Lagrangian duality, which will also serve as a valid lower-bound on (A). Our motivating insight is that all instances of (3), including those nonconvex instances with 𝑟 < 𝑛 + 1 , have the same convex Lagrangian dual. Define dual variables 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) ≥ 0 , 𝑧

( 𝑦 0 , { 𝑧 𝑘 } 𝑘

𝑠 0

2 ⁢ [ 𝑦 0 ⁢ ( ‖ 𝑥 ^ ‖ 2 − 𝜌 2 ) + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 𝑧 0 ] , 𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 , 𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) , 𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] , 𝑆 1 , 1

2 ⁢ 𝑦 0 ⁢ 𝐼 , 𝑆 𝑘 , 𝑘 + 1

− 𝑊 𝑘 𝑇 ⁢ 𝑍 𝑘 , 𝑆 𝑘 + 1 , 𝑘 + 1

where 𝑍 𝑘

Let 𝑋 ⋆ denote the global solution of (3) with rank 𝑟 ≥ 1 and tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 . Then, any dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

Let 𝑋 ⋆ denote the globally optimal solution for the convex instance of (3) with 𝑟

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑛 + 1 ⁢ [ 𝑐 ]

and therefore certify the global optimality 𝑋 ⋆ via 3.1. Now, suppose that the convex solution 𝑋 ⋆ is in fact low-rank, as in 𝑟 ⋆

rank ⁡ ( 𝑋 ⋆ ) ≪ 𝑛 . The statement below says that the nonconvex instance of (3) with 𝑟

Let 𝑟 ⋆

rank ⁡ ( 𝑋 ⋆ ) and tr ⁡ ( 𝑋 ⋆ ) < 𝑅 2 , where 𝑋 ⋆ denotes the maximum-rank solution to the convex instance of (3) with 𝑟

𝑛 + 1 . Then, there exists optimal multipliers 𝑦 ⋆

( 𝑦 0 ⋆ , { 𝑦 𝑘 , 1 ⋆ , 𝑦 𝑘 , 2 ⋆ } 𝑘

1 ℓ − 1 ) and 𝑧 ⋆

( 𝑧 0 ⋆ , { 𝑧 𝑘 ⋆ } 𝑘

𝜙 ⁢ [ 𝑐 ] ≥ 𝜙 𝑟 ⋆ ⁢ [ 𝑐 ]

𝜙 𝑟 ⁢ [ 𝑐 ]

min 𝑢 0 , 𝑢 , 𝑉 𝑢 0 ⋅ ( 𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ ) (BM- 𝑟 ) subject to ‖ 𝑢 1 − 𝑢 0 ⁢ 𝑥 ^ ‖ 2 + ‖ 𝑉 1 ‖ 2 ≤ 𝜌 2 , 𝑢 0 2

1 𝑇 ]

𝑢 0 2 + ∑ 𝑘

for all 𝑘 ∈ { 1 , … , ℓ − 1 } , and over optimization variables are 𝑢 0 ∈ ℝ and 𝑢

( 𝑢 1 , … , 𝑢 ℓ ) ∈ ℝ 𝑛 and 𝑉

𝑋

𝑢 ℓ 𝑉 ℓ ] 𝑇

The equivalence between these two problems follows because every ( 𝑛 + 1 ) × ( 𝑛 + 1 ) matrix 𝑋 of rank 𝑟 can be factored as 𝑋

min ‖ 𝑥 ‖ ≤ 𝑅 𝑓 ⁢ ( 𝑥 ) s.t. 𝑔 ⁢ ( 𝑥 ) ≤ 0 , ℎ ⁢ ( 𝑥 )

using a high-performance general-purpose solver like fmincon or knitro. These are primal-dual solvers, and are designed to output a primal point 𝑥

( 𝑢 0 , 𝑢 , vec ⁡ ( 𝑉 ) ) that is first-order optimal, and dual multipliers 𝑦

( 𝑦 0 , { 𝑦 𝑘 , 1 , 𝑦 𝑘 , 2 } 𝑘

1 ℓ − 1 ) and 𝑧

( 𝑧 0 , { 𝑧 𝑘 } 𝑘

The point 𝑥 is said to be first-order optimal if it satisfies the constraints 𝑔 ⁢ ( 𝑥 ) ≤ 0 and ℎ ⁢ ( 𝑥 )

0 , and there exists no escape path 𝑥 ⁢ ( 𝑡 ) that begins at 𝑥 ⁢ ( 0 )

𝑓 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ≤ 𝑓 ⁢ ( 𝑥 ) − 𝛿 ⁢ 𝑡 , 𝑔 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ≤ 0 , ℎ ⁢ ( 𝑥 ⁢ ( 𝑡 ) )

∇ 𝑓 ⁢ ( 𝑥 ) + ∇ 𝑔 ⁢ ( 𝑥 ) ⁢ 𝑦 + ∇ ℎ ⁢ ( 𝑥 ) ⁢ 𝑧

0 , 𝑦 ⊙ 𝑔 ⁢ ( 𝑥 )

Suppose we have 𝑥

𝜙 𝑟 ⁢ [ 𝑐 ]

𝑢 0 ⋅ ( 𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ )

Conversely, if a first-order optimal point 𝑥 satisfies the constraint qualification but is not globally optimal, then the dual multipliers 𝑦 , 𝑧 generate a direction of global improvement towards the global minimum. The key idea is to lift to a higher relaxation rank 𝑟 +

Let 𝑥 be certifiably first-order optimal for (BM- 𝑟 ) with dual multipliers ( 𝑦 , 𝑧 ) . If 𝑥 satisfies 𝛾

− 𝜆 min ⁢ [ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ] > 0 , (NPCQ) and ‖ 𝑥 ‖ < 𝑅 , then the eigenvector 𝜉

( 𝜉 0 , 𝜉 1 , 𝜉 2 , … , 𝜉 ℓ ) that satisfies 𝜉 𝑇 ⁢ 𝑆 ⁢ ( 𝑦 , 𝑧 ) ⁢ 𝜉

− 𝛾 ⁢ ‖ 𝜉 ‖ 2 implicitly defines an escape path 𝑥 + ⁢ ( 𝑡 )

( 𝑢 0 , { 𝑢 𝑘 , + ⁢ ( 𝑡 ) } 𝑘

1 ℓ , { 𝑉 𝑘 , + ⁢ ( 𝑡 ) } 𝑘

𝑢 𝑘 , + ⁢ ( 𝑡 )

𝑢 𝑘 + 𝑂 ⁢ ( 𝑡 2 ) , 𝑉 𝑘 , + ⁢ ( 𝑡 )

𝑓 ⁢ ( 𝑥 + ⁢ ( 𝑡 ) )

( 𝑥 1 , … , 𝑥 ℓ ) ∈ ℝ 𝑛
𝑤 ℓ 𝑇 ⁢ 𝑥 ℓ + 𝑤 0 ⁢ 𝑥 0 (A) s.t. 𝑥 𝑘 + 1

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 𝑧 0 ] ,

𝑠 1

𝑊 1 𝑇 ⁢ 𝑦 1 , 2 − 2 ⁢ 𝑥 ^ ⁢ 𝑦 0 ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) ,

𝑠 ℓ

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑆 1 , 1

min 𝑢 0 , 𝑢 , 𝑉 𝑢 0 ⋅ ( 𝑤 ℓ 𝑇 ⁢ 𝑢 ℓ ) (BM- 𝑟 ) subject to
‖ 𝑢 1 − 𝑢 0 ⁢ 𝑥 ^ ‖ 2 + ‖ 𝑉 1 ‖ 2 ≤ 𝜌 2 , 𝑢 0 2

𝑢 ℓ
𝑉 ℓ ] 𝑇

𝑢 𝑘 + 𝑂 ⁢ ( 𝑡 2 ) ,

𝑉 𝑘 , + ⁢ ( 𝑡 )

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 𝑦 ℓ − 1 , 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 𝑦 𝑘 , 1 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘

1 ℓ − 1 𝑏 𝑘 𝑇 ⁢ 𝑦 𝑘 , 2 − 1 2 ⁢ 𝑠 0 ,

𝑠 1

𝑤 ℓ − [ 𝑍 ℓ − 1 ⁢ 𝑏 ℓ − 1 + 2 ⁢ 𝑋 ^ ℓ ⁢ 𝑦 ℓ − 1 + 𝑦 ℓ − 1 , 2 ] ,

𝑠 𝑘 + 1

𝑊 𝑘 + 1 𝑇 ⁢ 𝑦 𝑘 + 1 , 2 − ( 𝑍 𝑘 ⁢ 𝑏 𝑘 + 2 ⁢ 𝑋 ^ 𝑘 + 1 ⁢ 𝑦 𝑘 + 𝑦 𝑘 , 2 ) for ⁢ 𝑘 ∈ { 1 , … , ℓ − 2 } ,

𝑆 1 , 1

0 ,
( 𝑧 𝑘 )

1 + ∑ 𝑘