197 kB

Title: Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training This work is an extension of [10] and was supported by grants from ONR and NSF.

URL Source: https://arxiv.org/html/2201.01965

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Practical Convex ANN Training 3An ADMM Algorithm for Global ANN Training 4Convex Adversarial Training 5Numerical Experiments 6Concluding Remarks References License: CC BY-SA 4.0 arXiv:2201.01965v2 [cs.LG] 17 Jun 2025 \newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersEfficient Two-layer ANN Global OptimizationY. Bai, T. Gautam, and S. Sojoudi

Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training† Yatong Bai Department of Mechanical Engineering, University of California, Berkeley, (). yatong_bai@berkeley.edu Tanmay Gautam Department of Electrical Engineering and Computer Science, University of California, Berkeley, (). tgautam23@berkeley.edu Somayeh Sojoudi Department of Mechanical Engineering and Department of Electrical Engineering and Computer Science, University of California, Berkeley, (). sojoudi@berkeley.edu Abstract

The non-convexity of the artificial neural network (ANN) training landscape brings optimization difficulties. While traditional back-propagation gradient-based algorithms are effective in certain cases, they can become stuck at spurious local minima and are sensitive to initializations and hyperparameters. Recent work has shown that training a ReLU-activated ANN can be reformulated as a convex program, bringing hope to globally optimizing interpretable ANNs. However, naïvely solving the convex training formulation has exponential complexity, and even an approximation heuristic requires cubic time. In this work, we characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees. The first algorithm is based on the alternating direction method of multipliers (ADMM). It can solve both the exact convex formulation and the approximate counterpart, and generalizes to a family of convex training formulations. Linear global convergence is achieved, and the initial several iterations often yield a solution with high prediction accuracy. When solving the approximate formulation, the per-iteration time complexity is quadratic. The second algorithm, based on the “sampled convex programs” theory, solves unconstrained convex formulations and converges to an approximately globally optimal classifier. The non-convexity of the ANN training landscape is exacerbated when adversarial training is considered. We apply robust convex optimization theory to convex training and develop convex formulations that train ANNs robust to adversarial inputs. Our analysis explicitly focuses on one-hidden-layer fully connected ANNs, but can extend to more sophisticated architectures.

keywords: Robust Optimization, Convex Optimization, Adversarial Training, Neural Networks {AMS}

68Q25, 82C32, 49M29, 46N10, 62M45

1Introduction

The artificial neural network (ANN) is one of the most powerful and popular machine learning tools. Optimizing a typical ANN with non-linear activation functions and a finite width requires solving non-convex optimization problems. Traditionally, training ANNs relies on stochastic gradient descent (SGD) back-propagation [52]. Despite its tremendous empirical success, this algorithm is only guaranteed to converge to a local minimum when applied to the non-convex ANN training objective. While SGD back-propagation can converge to a global optimizer for one-hidden-layer “rectified linear unit (ReLU)”-activated networks when the considered network is wide enough [58, 21] or when the inputs follow a Gaussian distribution [16], spurious local minima can exist in general applications. Moreover, the non-convexity of the training landscape and the properties of back-propagation SGD cause the issues listed below:

•

Poor interpretability: With SGD, it is hard to monitor the training status. For example, when the progress slows down, we may or may not be close to a local minimum, and the local minimum may be spurious.

•

High sensitivity to hyperparameters: Back-propagation SGD has several important hyperparameters to tune. Every parameter is crucial to the performance, but selecting the parameters can be difficult. SGD is also sensitive to the initialization [31].

•

Vanishing / exploding gradients: With back-propagation, the gradient at shallower layers can be tiny (or huge) if the deeper layer weights are tiny (or huge).

While more advanced back-propagation optimizers such as Adam [39] can alleviate the above issues, avoiding them entirely can be hard. Since convex programs possess the desirable property that all local minima are global, the existing works have considered convexifying the ANN training problem [13, 8, 6]. More recently, Pilanci and Ergen proposed “convex training” and derived a convex optimization problem with the same global minimum as the non-convex cost function of a one-hidden-layer fully connected ReLU ANN, enabling global ANN optimization [49]. The favorable properties of convex optimization make convex training immune to back-propagation deficiencies. Convex training also extends to more complex ANNs such as convolutional neural networks (CNNs) [25], deeper networks [24], and vector-output networks [53]. This work begins with one-hidden-layer ANNs for simplicity, and extends to a family of convex ANN training formulations, including the results for two-hidden-layer sub-networks [24, 26] and one-hidden-layer networks with batch normalization [27]. Due to space restrictions, the extensions are presented in Appendix A. Moreover, [12] designed a layer-wise training scheme that concatenates one-hidden-layer ANNs into a deep network, where each layer provably reduces the training error. This approach can be combined with this work, ultimately leading toward training deep networks with convex optimization.

Unfortunately, the 𝒪 ⁢ ( 𝑑 3 ⁢ 𝑟 3 ⁢ ( 𝑛 𝑟 ) 3 ⁢ 𝑟 ) computational complexity of the convex training formulation introduced in [49] is exponential in data matrix rank and prohibitively high. This complexity arises due to the following two reasons:

•

The size of the convex program grows exponentially in the training data matrix rank 𝑟 . This exponential relationship is inherent due to the large number of possible ReLU activation patterns, and thus can be hard to reduce. Fortunately, this problem is not a deal-breaker in practice: [49] has shown that a heuristic approximation that forms much smaller convex optimizations performs surprisingly well. In this work, we analyze this approximation and theoretically show that for a given level of suboptimality, the required size of the convex training programs is linear in the number of training data points 𝑛 .

•

The convex training formulation is constrained. A naïve algorithm choice for solving a general constrained convex optimization is the interior-point method (IPM) with a cubic per-step computational complexity. This paper develops more efficient algorithms that exploit the problem structure and achieve lower computational cost. Specifically, an algorithm based on the alternating direction method of multipliers (ADMM) with a quadratic per-iteration complexity, as well as a Sampled Convex Program (SCP)-based algorithm with a linear per-iteration complexity, are introduced.

Detailed comparisons among the ADMM-based algorithm, the SCP-based algorithm, the original convex training algorithm in [49], and back-propagation SGD are presented in Table 1. While IPM can converge to a highly accurate solution with fewer iterations, ADMM can rapidly reach a medium-precision solution, which is often sufficient for machine learning tasks. Compared with SGD back-propagation, ADMM has a higher theoretical complexity but is guaranteed to converge linearly to a global optimum, enabling efficient global optimization.

Table 1: Comparisons between the proposed ANN training methods and related methods. The middle column is the per-epoch complexity when the squared loss is considered. 𝑛 is the number of training points; 𝑑 is the data dimension; 𝑟 is the training data matrix rank.

† : Toward the theoretically minimum loss – further increasing network width will not reduce the training loss;

§ : Toward a fixed desired level of suboptimality in the sense defined in Theorem 2.2;

‡ : For an arbitrary network width 𝑚 . Since there exists a globally optimal neural network with no more than 𝑛 + 1 active hidden-layer neurons [58], the 𝒪 ⁢ ( 𝑚 ⁢ 𝑛 ⁢ 𝑑 ) bound for SGD back-propagation evaluates to 𝒪 ⁢ ( 𝑛 2 ⁢ 𝑑 ) .

Method Complexity Global convergence IPM [49] 𝒪 ⁢ ( 𝑑 3 ⁢ 𝑟 3 ⁢ ( 𝑛 𝑟 ) 3 ⁢ 𝑟 ) † Superlinear to the global optimum. ADMM (exact) 𝒪 ⁢ ( 𝑑 2 ⁢ 𝑟 2 ⁢ ( 𝑛 𝑟 ) 2 ⁢ 𝑟 ) † Rapid to a moderate accuracy; linear to the global optimum. ADMM (approximate) 𝒪 ⁢ ( 𝑛 2 ⁢ 𝑑 2 ) § Rapid to a moderate accuracy; linear to an approximate global optimum. SCP 𝒪 ⁢ ( 𝑛 2 ) § Toward an approximate global optimum;

𝒪 ⁢ ( 1 / 𝑇 ) rate for weakly convex loss; linear for strongly convex loss. SGD back-propagation 𝒪 ⁢ ( 𝑚 ⁢ 𝑛 ⁢ 𝑑 ) ‡ / 𝒪 ⁢ ( 𝑛 2 ⁢ 𝑑 ) † No spurious valleys if 𝑚 ≥ 2 ⁢ 𝑛 + 2 ; no general results.

Prior literature has considered applying the ADMM method to ANN training [56, 57]. These works used ADMM to separate the activations and the weights of each layer, enabling parallel computing. While the ADMM algorithm in [57] converges at an 𝒪 ⁢ ( 1 / 𝑡 ) rate ( 𝑡 is the number of iterations) to a critical point of the augmented Lagrangian of the training formulation, there is no guarantee that this critical point is a global optimizer of the ANN training loss. In contrast, this paper uses ADMM as an efficient convex optimization algorithm and introduces an entirely different splitting scheme based on the convex formulations conceived in [49]. More importantly, our ADMM algorithm provably converges to a globally optimal classifier.

A parallel line of work has focused on making convex training more efficient. Specifically, [24, 26] use linear penalty functions to derive unconstrained formulations for convex training. When the strengths of the penalizations are chosen appropriately, the formulations are exact. However, the penalization strengths can be difficult to select, since a good choice depends on the optimization landscape of the problem, which is generally unknown. Note that the solutions found via this penalty method can be used to initialize our ADMM algorithm. During the review period of this work, Mishkin et al. [46] independently proposed a method to accelerate convex training. The similarities and differences between this work and [46] are discussed at the end of Section 3.

Combining the SCP analysis and the convex training framework leads to a further simplified convex training program that solves unconstrained convex optimization problems. This SCP-based method converges to an approximate global optimum. The scale of the SCP convex training formulation can be larger than the convex problem solved in the ADMM algorithm. However, the unconstrained nature enables the use of gradient methods, whose per-iteration complexities are lower than ADMM. The similarities between the SCP-based algorithm and extreme learning machines (ELMs [34, 28]) show that the training of a sparse ELM can be regarded as a convex relaxation of the training of an ANN, providing insights into the hidden sparsity of neural networks. Due to space restrictions, this result is presented in Appendix B.

Another major challenge of ANNs is their vulnerability to adversarial attacks. When the input is perturbed in a carefully designed way that does not significantly alter human perception, ANNs can be tricked into unsafe/incorrect/misaligned outputs drastically different from their normal behaviors. Such a vulnerability has been observed in computer vision [55, 47, 29] and controls [36]. As ANNs become popularized in safety-critical applications, it is crucial to analyze their adversarial robustness. While there have been studies on robustness certification [2, 44, 4, 9] and achieving certified robustness at test time via “randomized smoothing” [19, 3], efficiently achieving robustness via training remains an important topic. To this end, “adversarial training” [41, 29, 35] is one of the most effective ways to train robust classifiers, compared with other methods such as obfuscated gradients [7]. Specifically, adversarial training replaces the standard loss function with an “adversarial loss” and solves a highly challenging bi-level min-max optimization problem.

Adversarial training further exacerbates the aforementioned issues of SGD back-propagation, which arise mostly due to the non-convexity. As a result, adversarial training can be fragile and volatile in practice, and convergence properties are pessimistic. Therefore, extending convex training to adversarial training is crucial. In our conference paper [10], we built upon the above results to develop “convex adversarial training”, explicitly focusing on the cases of hinge loss (for binary classification) and squared loss (for regression). We theoretically showed that solving the proposed robust convex optimizations trains robust ANNs and empirically demonstrated the efficacy and advantages over traditional methods. This work extends the analysis to the binary cross-entropy loss and discusses the extensibility to more complex ANN architectures (Section 4.5 and Appendix E.2).

Previously, researchers have applied convex relaxation techniques to adversarial training. They obtained convex certifications [51, 59] that upper-bounded the inner maximization of adversarial training and used weak duality to develop robust loss functions. Despite the convex relaxation, the resulting training formulations generally remained non-convex, leaving the fundamental challenges unresolved. In contrast, we apply robust optimization techniques to the entire min-max adversarial training formulation and obtain convex training problems.

The main contributions of this work are summarized below:

•

A theoretical evaluation of a relaxation that enables tractable convex training (Section 2);

•

Efficient algorithms to accelerate convex (standard) training (Section 3; Appendix A);

•

An extension of the convex adversarial training formulation for one-hidden-layer scalar-output ReLU neural networks (Section 4).

1.1Notations

Throughout this work, we focus on fully connected ANNs with one ReLU-activated hidden layer and a scalar output, defined as

𝑦 ^

∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 + 𝑏 𝑗 ⁢ 𝟏 𝑛 ) + ⁢ 𝛼 𝑗 ,

where 𝑋 ∈ ℝ 𝑛 × 𝑑 is the input data matrix with 𝑛 data points in ℝ 𝑑 and 𝑦 ^ ∈ ℝ 𝑛 is the ANN output vector. We use 𝑦 ∈ ℝ 𝑛 to denote the corresponding training target output. The vectors 𝑢 1 , … , 𝑢 𝑚 ∈ ℝ 𝑑 are the weights of the 𝑚 hidden layer neurons, the scalars 𝑏 1 ⁢ … , 𝑏 𝑚 ∈ ℝ are the hidden layer bias terms, and the scalars 𝛼 1 , … , 𝛼 𝑚 ∈ ℝ represent the output layer weights. The symbol ( ⋅ ) +

max ⁡ { 0 , ⋅ } indicates the ReLU activation function, which sets all negative entries of a vector or a matrix to zero. The symbol 𝟏 𝑛 defines a column vector with all entries being 1, where the subscript 𝑛 denotes the dimension of this vector. The 𝑛 -dimensional identity matrix is denoted by 𝐼 𝑛 .

Furthermore, for a vector 𝑞 ∈ ℝ 𝑛 , sgn ( 𝑞 ) ∈ { − 1 , 0 , 1 } 𝑛 denotes the signs of the entries of 𝑞 . [ 𝑞 ≥ 0 ] denotes a boolean vector in { 0 , 1 } 𝑛 with ones at the locations of the non-negative entries of 𝑞 and zeros at the remaining locations. The symbol diag ⁢ ( 𝑞 ) denotes a diagonal matrix 𝑄 ∈ ℝ 𝑛 × 𝑛 where 𝑄 𝑖 ⁢ 𝑖

𝑞 𝑖 for all 𝑖 and 𝑄 𝑖 ⁢ 𝑗

0 for all 𝑖 ≠ 𝑗 . For a vector 𝑞 ∈ ℝ 𝑛 and a scalar 𝑏 ∈ ℝ , the inequality 𝑞 ≥ 𝑏 means that 𝑞 𝑖 ≥ 𝑏 for all 𝑖 ∈ [ 𝑛 ] . The symbol ⊙ denotes the Hadamard product between two vectors and the notation ∥ ⋅ ∥ 𝑝 denotes the ℓ 𝑝 -norm. For a matrix 𝐴 , the max norm ∥ 𝐴 ∥ max is defined as max 𝑖 ⁢ 𝑗 ⁡ | 𝑎 𝑖 ⁢ 𝑗 | , where 𝑎 𝑖 ⁢ 𝑗 is the ( 𝑖 , 𝑗 ) th entry. For a set 𝒜 , the notation | 𝒜 | denotes its cardinality, and Π 𝒜 ⁢ ( ⋅ ) denotes the projection onto 𝒜 . The notation prox 𝑓 denotes the proximal operator associated with a function 𝑓 ⁢ ( ⋅ ) . The notation 𝑅 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑛 ) indicates that a random variable 𝑅 ∈ ℝ 𝑛 is a standard normal random vector, and Unif ( 𝒮 𝑛 − 1 ) denotes the uniform distribution on a ( 𝑛 − 1 ) -sphere. For 𝑃 ∈ ℕ + , we define [ 𝑃 ] as the set { 𝑎 ∈ ℕ + | 𝑎 ≤ 𝑃 } , where ℕ + is the set of positive integers.

2Practical Convex ANN Training 2.1Prior Work – Convex ANN Training

We define the problem of training the above ANN with an ℓ 2 regularized convex loss function ℓ ⁢ ( 𝑦 ^ , 𝑦 ) as:

min ( 𝑢 𝑗 , 𝛼 𝑗 , 𝑏 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 + 𝑏 𝑗 ⁢ 𝟏 𝑛 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝑏 𝑗 2 + 𝛼 𝑗 2 ) ,

where 𝛽 > 0 is a regularization parameter. Without loss of generality, we assume that 𝑏 𝑗

0 for all 𝑗 ∈ [ 𝑚 ] . We can safely make this simplification because concatenating a column of ones to the data matrix 𝑋 absorbs the bias terms. The simplified training problem is then:

(1) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) .

Consider a set of diagonal matrices { diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) | 𝑢 ∈ ℝ 𝑑 } , and let the distinct elements of this set be denoted as 𝐷 1 , … , 𝐷 𝑃 . The constant 𝑃 corresponds to the total number of partitions of ℝ 𝑑 by hyperplanes passing through the origin that are also perpendicular to the rows of 𝑋 [49]. Intuitively, 𝑃 can be regarded as the number of possible ReLU activation patterns associated with 𝑋 .

Consider the convex optimization problem

(2) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃
ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ]

and its dual formulation

(3) max 𝑣 − ℓ ∗ ( 𝑣 ) s . t . | 𝑣 ⊤ ( 𝑋 𝑢 ) + | ≤ 𝛽 , ∀ 𝑢 : ∥ 𝑢 ∥ 2 ≤ 1 ,

which is a convex semi-infinite program, where ℓ ∗ ⁢ ( 𝑣 )

max 𝑧 ⁡ 𝑧 ⊤ ⁢ 𝑣 − ℓ ⁢ ( 𝑧 , 𝑦 ) is the Fenchel conjugate function. The next theorem, borrowed from Pilanci and Ergen’s paper [49], explains the relationship between the non-convex training problem Eq. 1, the convex problem Eq. 2, and the dual problem Eq. 3 when the ANN is sufficiently wide.

Theorem 2.1 ([49]).

Let ( 𝑣 𝑖 ⋆ , 𝑤 𝑖 ⋆ ) 𝑖

1 𝑃 denote a solution of Eq. 2 and define 𝑚 ⋆ as | { 𝑖 : 𝑣 𝑖 ⋆ ≠ 0 } | + | { 𝑖 : 𝑤 𝑖 ⋆ ≠ 0 } | . Suppose that the ANN width 𝑚 is at least 𝑚 ⋆ , where 𝑚 ⋆ is upper-bounded by 𝑛 + 1 . If the loss function ℓ ⁢ ( ⋅ , 𝑦 ) is convex, then Eq. 1, Eq. 2, and Eq. 3 share the same optimal objective. The optimal network weights ( 𝑢 𝑗 ⋆ , 𝛼 𝑗 ⋆ ) 𝑗

1 𝑚 can be recovered using the formulas

(4) ( 𝑢 𝑗 1 ⁢ 𝑖 ⋆ , 𝛼 𝑗 1 ⁢ 𝑖 ⋆ )

( 𝑣 𝑖 ⋆ ∥ 𝑣 𝑖 ⋆ ∥ 2 , ∥ 𝑣 𝑖 ⋆ ∥ 2 ) if 𝑣 𝑖 ⋆ ≠ 0 ;

( 𝑢 𝑗 2 ⁢ 𝑖 ⋆ , 𝛼 𝑗 2 ⁢ 𝑖 ⋆ )

( 𝑤 𝑖 ⋆ ∥ 𝑤 𝑖 ⋆ ∥ 2 , − ∥ 𝑤 𝑖 ⋆ ∥ 2 ) if 𝑤 𝑖 ⋆ ≠ 0 .

where the remaining 𝑚 − 𝑚 ⋆ neurons are chosen to have zero weights.

The worst-case computational complexity of solving Eq. 2 for the case of squared loss is 𝒪 ⁢ ( 𝑑 3 ⁢ 𝑟 3 ⁢ ( 𝑛 𝑟 ) 3 ⁢ 𝑟 ) using standard interior-point solvers [49]. Here, 𝑟 is the rank of the data matrix 𝑋 , and in many cases 𝑟

𝑑 . Such a complexity is polynomial in 𝑛 , significantly better than previous methods, but is exponential in 𝑟 , thus still prohibitively high for many practical applications. Such high complexity is due to the large number of 𝐷 𝑖 matrices, which is upper-bounded by min ⁡ { 2 𝑛 , 2 ⁢ 𝑟 ⁢ ( 𝑒 ⁢ ( 𝑛 − 1 ) 𝑟 ) 𝑟 } [49].

2.2A Practical Convex Training Algorithm 1:Generate 𝑃 𝑠 distinct diagonal matrices via 𝐷 ℎ ← diag ⁢ ( [ 𝑋 ⁢ 𝑎 ℎ ≥ 0 ] ) , where 𝑎 ℎ ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) i.i.d. for all ℎ ∈ [ 𝑃 𝑠 ] . 2:Solve (5) 𝑝 𝑠 ⁢ 1 ⋆

min ( 𝑣 ℎ , 𝑤 ℎ ) ℎ

1 𝑃 𝑠
ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 𝐷 ℎ ⁢ 𝑋 ⁢ ( 𝑣 ℎ − 𝑤 ℎ ) , 𝑦 ) + 𝛽 ⁢ ∑ ℎ

1 𝑃 𝑠 ( ∥ 𝑣 ℎ ∥ 2 + ∥ 𝑤 ℎ ∥ 2 )

s . t .
( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 ℎ ≥ 0 , ( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 ℎ ≥ 0 , ∀ ℎ ∈ [ 𝑃 𝑠 ] .
; 3:Recover 𝑢 1 , … , 𝑢 𝑚 𝑠 and 𝛼 1 , … , 𝛼 𝑚 𝑠 from the solution ( 𝑣 𝑠 ℎ ⋆ , 𝑤 𝑠 ℎ ⋆ ) ℎ

1 𝑃 𝑠 of Eq. 5 using Eq. 4. Algorithm 1 Practical convex training

A natural direction of mitigating this high complexity is to reduce the number of 𝐷 𝑖 matrices by sampling a subset of them. This idea leads to Algorithm 1, which approximately solves the training problem and can train ANNs with widths much less than 𝑚 ⋆ . Algorithm 1 is an instance of the approximation described in [49, Remark 3.3], but [49] did not provide theoretical insights regarding its level of suboptimality. The following theorem bridges the gap by providing a probabilistic bound on the suboptimality of the ANN trained with Algorithm 1.

Theorem 2.2.

Consider an additional diagonal matrix 𝐷 𝑃 𝑠 + 1 sampled uniformly, and construct

(6) 𝑝 𝑠 ⁢ 2 ⋆

min ( 𝑣 ℎ , 𝑤 ℎ ) ℎ

1 𝑃 𝑠 + 1
ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 + 1 𝐷 ℎ ⁢ 𝑋 ⁢ ( 𝑣 ℎ − 𝑤 ℎ ) , 𝑦 ) + 𝛽 ⁢ ∑ ℎ

1 𝑃 𝑠 + 1 ( ∥ 𝑣 ℎ ∥ 2 + ∥ 𝑤 ℎ ∥ 2 )

s . t .

( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 ℎ ≥ 0 , ( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 ℎ ≥ 0 , ∀ ℎ ∈ [ 𝑃 𝑠 + 1 ] .

It holds that 𝑝 𝑠 ⁢ 2 ⋆ ≤ 𝑝 𝑠 ⁢ 1 ⋆ . Furthermore, if 𝑃 𝑠 ≥ min ⁡ { 𝑛 + 1 𝜓 ⁢ 𝜉 − 1 , 2 𝜉 ⁢ ( 𝑛 + 1 − log ⁡ 𝜓 ) } , where 𝜓 and 𝜉 are preset confidence level constants between 0 and 1, then with probability at least 1 − 𝜉 , it holds that ℙ ⁢ { 𝑝 𝑠 ⁢ 2 ⋆ < 𝑝 𝑠 ⁢ 1 ⋆ } ≤ 𝜓 .

The proof of Theorem 2.2 is presented in Section F.1. Intuitively, Theorem 2.2 shows that sampling an additional 𝐷 𝑃 𝑠 + 1 matrix will not reduce the training loss with high probability when 𝑃 𝑠 is large. One can recursively apply this bound 𝑇 times to show that the solution with 𝑃 𝑠 matrices is close to the solution with 𝑃 𝑠 + 𝑇 matrices for an arbitrary number 𝑇 . Thus, while the theorem does not directly bound the gap between the approximated optimization problem and its exact counterpart, it states that the optimality gap due to sampling is not too large for a suitable value of 𝑃 𝑠 , and the trained ANN is nearly optimal.

Compared with the exponential relationship between 𝑃 and 𝑟 , a satisfactory value of 𝑃 𝑠 is linear in 𝑛 and is independent from 𝑟 . Thus, when 𝑟 is large, solving the approximated formulation Eq. 5 is significantly (exponentially) more efficient than solving the exact formulation Eq. 2. On the other hand, Algorithm 1 is no longer deterministic due to the stochastic sampling of the 𝐷 ℎ matrices, and yields solutions that upper-bound those of Eq. 2.

Since the confidence constants 𝜓 and 𝜉 are no greater than one, Theorem 2.2 only applies to overparameterized ANNs, where 𝑃 𝑠 ≥ 𝑛 . Although [49] has shown that there exists a globally optimal neural network whose width is at most 𝑛 + 1 and Theorem 2.2 seems loose by this comparison, our theorem bounds a different quantity and is meaningful. Specifically, the bound in [49] does not provide a method that scales linearly: while a globally optimal neural network narrower than 𝑛 + 1 exists, finding such an ANN requires solving a convex program with an exponential number of constraints. In contrast, Theorem 2.2 characterizes the optimality of a convex optimization with a manageable number of constraints. In practice, selecting 𝑃 𝑠 is equivalent to choosing the ANN width. While Theorem 2.2 provides a guideline on how 𝑃 𝑠 should scale with 𝑛 , selecting a much smaller 𝑃 𝑠 will not necessarily become an issue. Our experiments in Section 5.1 show that even when 𝑃 𝑠 is much less than 𝑛 (which is much less than 𝑃 ), Algorithm 1 still reliably trains high-performance classifiers.

3An ADMM Algorithm for Global ANN Training

The convex ReLU ANN training program Eq. 2 may be solved with the IPM. The IPM is an iterative algorithm that repeatedly performs Newton updates. Each Newton update requires solving a linear system, which has a cubic complexity, hindering the application of IPM to large-scale optimization problems. Unfortunately, large-scale problems are ubiquitous in the field of machine learning. This section proposes an algorithm based on the ADMM, breaking down the optimization problem Eq. 2 to smaller subproblems that are easier to solve. Moreover, when ℓ ⁢ ( ⋅ ) is the squared loss, each subproblem has a closed-form solution. We will show that the complexity of each ADMM iteration is linear in 𝑛 and quadratic in 𝑑 and 𝑃 , and the number of required ADMM steps to reach a desired precision is logarithmic in the precision level. When other convex loss functions are used, a closed-form solution may not always exist. We illustrate that iterative methods can solve the subproblems for general convex losses efficiently. In Appendix A, we show that the ADMM algorithm extends to a family of convex training formulations.

Define 𝐹 𝑖 ≔ 𝐷 𝑖 ⁢ 𝑋 and 𝐺 𝑖 ≔ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 for all 𝑖 ∈ [ 𝑃 ] . Furthermore, we introduce 𝑣 𝑖 , 𝑤 𝑖 , 𝑠 𝑖 , and 𝑡 𝑖 as slack variables and let 𝑣 𝑖

𝑢 𝑖 , 𝑤 𝑖

𝑧 𝑖 , 𝑠 𝑖

𝐺 𝑖 ⁢ 𝑣 𝑖 , and 𝑡 𝑖

𝐺 𝑖 ⁢ 𝑤 𝑖 . For a vector 𝑞

( 𝑞 1 , … , 𝑞 𝑛 ) ∈ ℝ 𝑛 , define the indicator function of the positive quadrant 𝕀 ≥ 0 as

𝕀 ≥ 0 ⁢ ( 𝑞 ) ≔ { 0

if ⁢ 𝑞 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑁 ] ;

otherwise.

The convex training formulation Eq. 2 can be reformulated as a convex optimization problem with positive quadrant indicator functions and linear equality constraints:

min ( 𝑣 𝑖 , 𝑤 𝑖 , 𝑠 𝑖 , 𝑡 𝑖 , 𝑢 𝑖 , 𝑧 𝑖 ) 𝑖

1 𝑃
ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐹 𝑖 ⁢ ( 𝑢 𝑖 − 𝑧 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ∥ 𝑣 𝑖 ∥ 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ∥ 𝑤 𝑖 ∥ 2 + ∑ 𝑖

1 𝑃 𝕀 ≥ 0 ⁢ ( 𝑠 𝑖 ) + ∑ 𝑖

1 𝑃 𝕀 ≥ 0 ⁢ ( 𝑡 𝑖 )
(7) s . t .
𝐺 𝑖 ⁢ 𝑢 𝑖 − 𝑠 𝑖

0 , 𝐺 𝑖 ⁢ 𝑧 𝑖 − 𝑡 𝑖

0 , 𝑣 𝑖 − 𝑢 𝑖

0 , 𝑤 𝑖 − 𝑧 𝑖

0 , ∀ 𝑖 ∈ [ 𝑃 ] .

Next, we simplify the notations by concatenating the matrices. Define

𝑢 ≔ [ 𝑢 1 ⊤ ⁢ ⋯ ⁢ 𝑢 𝑃 ⊤ 𝑧 1 ⊤ ⁢ ⋯ ⁢ 𝑧 𝑃 ⊤ ] ⊤ , 𝑣 ≔ [ 𝑣 1 ⊤ ⁢ ⋯ ⁢ 𝑣 𝑃 ⊤ 𝑤 1 ⊤ ⁢ ⋯ ⁢ 𝑤 𝑃 ⊤ ] ⊤ ,

𝑠 ≔ [ 𝑠 1 ⊤ ⁢ ⋯ ⁢ 𝑠 𝑃 ⊤ 𝑡 1 ⊤ ⁢ ⋯ ⁢ 𝑡 𝑃 ⊤ ] ⊤ ,

𝐹 ≔ [ 𝐹 1 ⁢ ⋯ ⁢ 𝐹 𝑃 − 𝐹 1 ⁢ ⋯ − 𝐹 𝑃 ] , and ⁢ 𝐺 ≔ blkdiag ⁢ ( 𝐺 1 , ⋯ , 𝐺 𝑃 , 𝐺 1 , ⋯ , 𝐺 𝑃 ) ,

where blkdiag ( ⋅ , … , ⋅ ) denotes the block diagonal matrix formed by the submatrices in the parentheses. The formulation Section 3 is then equivalent to the compact notation

(8) min 𝑣 , 𝑠 , 𝑢 ⁡ ℓ ⁢ ( 𝐹 ⁢ 𝑢 , 𝑦 ) + 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝕀 ≥ 0 ⁢ ( 𝑠 ) s . t . [ 𝐼 2 ⁢ 𝑑 ⁢ 𝑃

𝐺 ] ⁢ 𝑢 − [ 𝑣

𝑠 ]

0 ,

where ∥ ⋅ ∥ 2 , 1 denotes the ℓ 1

ℓ 2 mixed norm group sparse regularization and 𝐼 2 ⁢ 𝑑 ⁢ 𝑃 is the idendity matrix in ℝ 2 ⁢ 𝑑 ⁢ 𝑃 × 2 ⁢ 𝑑 ⁢ 𝑃 . The corresponding augmented Lagrangian of Eq. 8 is:

𝐿 ( 𝑢 , 𝑣 ,

𝑠 , 𝜈 , 𝜆 ) :=

ℓ ⁢ ( 𝐹 ⁢ 𝑢 , 𝑦 ) + 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝕀 ≥ 0 ⁢ ( 𝑠 ) + 𝜌 2 ⁢ ( ∥ 𝑢 − 𝑣 + 𝜆 ∥ 2 2 − ∥ 𝜆 ∥ 2 2 ) + 𝜌 2 ⁢ ( ∥ 𝐺 ⁢ 𝑢 − 𝑠 + 𝜈 ∥ 2 2 − ∥ 𝜈 ∥ 2 2 ) ,

where 𝜆 ≔ [ 𝜆 11 ⁢ ⋯ ⁢ 𝜆 1 ⁢ 𝑃 𝜆 21 ⁢ ⋯ ⁢ 𝜆 2 ⁢ 𝑃 ] ⊤ ∈ ℝ 2 ⁢ 𝑑 ⁢ 𝑃 and 𝜈 ≔ [ 𝜈 11 ⁢ ⋯ ⁢ 𝜈 1 ⁢ 𝑃 𝜈 21 ⁢ ⋯ ⁢ 𝜈 2 ⁢ 𝑃 ] ⊤ ∈ ℝ 2 ⁢ 𝑛 ⁢ 𝑃 are dual variables, 𝜌

0 is a fixed penalty parameter [32].

1:repeat 2: Primal update (3.3a) 𝑢 𝑘 + 1

arg ⁢ min 𝑢 ⁡ ℓ ⁢ ( 𝐹 ⁢ 𝑢 , 𝑦 ) + 𝜌 2 ⁢ ∥ 𝑢 − 𝑣 𝑘 + 𝜆 𝑘 ∥ 2 2 + 𝜌 2 ⁢ ∥ 𝐺 ⁢ 𝑢 − 𝑠 𝑘 + 𝜈 𝑘 ∥ 2 2

3: Primal update (3.3b) [ 𝑣 𝑘 + 1

𝑠 𝑘 + 1 ]

arg ⁢ min 𝑣 , 𝑠 ⁡ 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝕀 ≥ 0 ⁢ ( 𝑠 ) + 𝜌 2 ⁢ ∥ 𝑢 𝑘 + 1 − 𝑣 + 𝜆 𝑘 ∥ 2 2 + 𝜌 2 ⁢ ∥ 𝐺 ⁢ 𝑢 𝑘 + 1 − 𝑠 + 𝜈 𝑘 ∥ 2 2

4: Dual update: (3.3c) [ 𝜆 𝑘 + 1

𝜈 𝑘 + 1 ]

[ 𝜆 𝑘 + 𝛾 𝑎 𝜌 ⁢ ( 𝑢 𝑘 + 1 − 𝑣 𝑘 + 1 )

𝜈 𝑘 + 𝛾 𝑎 𝜌 ⁢ ( 𝐺 ⁢ 𝑢 𝑘 + 1 − 𝑠 𝑘 + 1 ) ]

5:end repeat Algorithm 2 An ADMM algorithm for the convex ANN training problem.

We can apply the ADMM iterations described in Algorithm 2 to globally optimize Eq. 8.1 Here, 𝛾 𝑎

0 is a step-size constant. As will be shown next, Eq. 3.3b and Eq. 3.3c have simple closed-form solutions. The update Eq. 3.3a has a closed-form solution when ℓ ⁢ ( ⋅ ) is the squared loss, and can be efficiently solved numerically for general convex loss functions. When we apply ADMM to solve the approximated convex training formulation Eq. 5, Algorithm 2 becomes a subalgorithm of Algorithm 1. The following theorem certifies the linear convergence of the ADMM algorithm, with the proof provided in Section F.2:

Theorem 3.1.

If ℓ ⁢ ( 𝑦 ^ , 𝑦 ) is strictly convex and continuously differentiable with a uniform Lipschitz continuous gradient with respect to 𝑦 ^ , then the sequence { ( 𝑢 𝑘 , 𝑣 𝑘 , 𝑠 𝑘 , 𝜆 𝑘 , 𝜈 𝑘 ) } generated by Algorithm 2 converges linearly to an optimal primal-dual solution for Eq. 8, provided that the step size 𝛾 𝑎 is sufficiently small.

Many popular loss functions satisfy the conditions of Theorem 3.1. Examples include the squared loss (for regression) and the binary cross-entropy loss coupled with the tanh or the sigmoid output activation (for binary classification).

3.1 𝑠 and 𝑣 Updates

The update step Eq. 3.3b can be separated for 𝑣 𝑘 + 1 and 𝑠 𝑘 + 1 as:

(3.4a) 𝑣 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝜌 2 ⁢ ∥ 𝑢 𝑘 + 1 − 𝑣 + 𝜆 𝑘 ∥ 2 2 ;
(3.4b) 𝑠 𝑘 + 1

arg ⁢ min 𝑠 𝕀 ≥ 0 ( 𝑠 ) + ∥ 𝐺 𝑢 𝑘 + 1 − 𝑠 + 𝜈 𝑘 ∥ 2 2

arg ⁢ min 𝑠 ≥ 0 ∥ 𝐺 𝑢 𝑘 + 1 − 𝑠 + 𝜈 𝑘 ∥ 2 2 .

Note that Eq. 3.4a can be separated for each 𝑣 𝑖 and 𝑤 𝑖 (allowing parallelization) and solved analytically using the formulas

𝑣 𝑖 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑣 𝑖 ∥ 2 + 𝜌 2 ⁢ ∥ 𝑢 𝑖 𝑘 + 1 − 𝑣 + 𝜆 1 ⁢ 𝑖 𝑘 ∥ 2 2

prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 ⁢ ( 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 )

( 1 − 𝛽 𝜌 ⋅ ∥ 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ∥ 2 ) + ⁢ ( 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ) , ∀ 𝑖 ∈ [ 𝑃 ] ,

𝑤 𝑖 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑤 𝑖 ∥ 2 + 𝜌 2 ⁢ ∥ 𝑠 𝑖 𝑘 + 1 − 𝑤 + 𝜆 2 ⁢ 𝑖 𝑘 ∥ 2 2

prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 ⁢ ( 𝑧 𝑖 𝑘 + 1 + 𝜆 2 ⁢ 𝑖 𝑘 )

( 1 − 𝛽 𝜌 ⋅ ∥ 𝑧 𝑖 𝑘 + 1 + 𝜆 2 ⁢ 𝑖 𝑘 ∥ 2 ) + ⁢ ( 𝑧 𝑖 𝑘 + 1 + 𝜆 2 ⁢ 𝑖 𝑘 ) , ∀ 𝑖 ∈ [ 𝑃 ] ,

where prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 denotes the proximal operation on the function 𝑓 ⁢ ( ⋅ )

𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 . The computational complexity of finding 𝑣 𝑖 and 𝑤 𝑖 is 𝒪 ⁢ ( 𝑑 ) . Similarly, Eq. 3.4b can also be separated for each 𝑠 𝑖 and 𝑡 𝑖 and solved analytically using the formulas

𝑠 𝑖 𝑘 + 1

arg ⁢ min 𝑠 𝑖 ≥ 0 ∥ 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 − 𝑠 𝑖 + 𝜈 1 ⁢ 𝑖 𝑘 ∥ 2 2

Π ≥ 0 ( 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 + 𝜈 1 ⁢ 𝑖 𝑘 )

( 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 + 𝜈 1 ⁢ 𝑖 𝑘 ) + , ∀ 𝑖 ∈ [ 𝑃 ] ;

𝑡 𝑖 𝑘 + 1

arg ⁢ min 𝑡 𝑖 ≥ 0 ∥ 𝐺 𝑖 𝑧 𝑖 𝑘 + 1 − 𝑠 𝑖 + 𝜈 2 ⁢ 𝑖 𝑘 ∥ 2 2

Π ≥ 0 ( 𝐺 𝑖 𝑧 𝑖 𝑘 + 1 + 𝜈 2 ⁢ 𝑖 𝑘 )

( 𝐺 𝑖 𝑧 𝑖 𝑘 + 1 + 𝜈 2 ⁢ 𝑖 𝑘 ) + , ∀ 𝑖 ∈ [ 𝑃 ] .

where Π ≥ 0 denotes the projection onto the non-negative quadrant. The computational complexity of finding 𝑠 𝑖 and 𝑡 𝑖 is 𝒪 ⁢ ( 𝑛 ) . The updates Eq. 3.4a and Eq. 3.4b can be performed in 𝒪 ⁢ ( 𝑛 ⁢ 𝑃 + 𝑑 ⁢ 𝑃 ) time in total.

3.2 𝑢 Updates

The 𝑢 update step depends on the specific structure of ℓ ⁢ ( ⋅ ) . For the squared loss, the 𝑢 update step can be solved in closed form. For many other loss functions, the update can be performed with numerical methods.

3.2.1Squared Loss

The squared loss ℓ ⁢ ( 𝑦 ^ , 𝑦 )

1 2 ⁢ ∥ 𝑦 ^ − 𝑦 ∥ 2 2 is a commonly used loss function in machine learning. It is widely used for regression tasks, but can also be used for classification. For the squared loss, Eq. 3.3a amounts to

(11) 𝑢 𝑘 + 1

arg ⁢ min 𝑢 ⁡ { ∥ 𝐹 ⁢ 𝑢 − 𝑦 ∥ 2 2 + 𝜌 2 ⁢ ∥ 𝑢 − 𝑣 𝑘 + 𝜆 𝑘 ∥ 2 2 + 𝜌 2 ⁢ ∥ 𝐺 ⁢ 𝑢 − 𝑠 𝑘 + 𝜈 𝑘 ∥ 2 2 } .

Setting the gradient with respect to 𝑢 to zero yields that

(12) ( 𝐼 + 1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝐹 + 𝐺 ⊤ ⁢ 𝐺 ) ⁢ 𝑢 𝑘 + 1

1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝑦 + 𝑣 𝑘 − 𝜆 𝑘 + 𝐺 ⊤ ⁢ 𝑠 𝑘 − 𝐺 ⊤ ⁢ 𝜈 𝑘 .

Therefore, the 𝑢 update can be performed by solving the linear system Eq. 12 in each iteration. While solving a linear system 𝐴 ⁢ 𝑥

𝑏 for a square matrix 𝐴 has a cubic time complexity in general, by taking advantage of the structure of Eq. 12, a quadratic per-iteration complexity can be achieved. Specifically, the matrix 𝐼 + 1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝐹 + 𝐺 ⊤ ⁢ 𝐺 is symmetric, positive definite, and fixed throughout the ADMM iterations. In general, we can solve 𝐴 ⁢ 𝑥

𝑏 for some symmetric 𝐴 ∈ 𝕊 2 ⁢ 𝑑 ⁢ 𝑃 × 2 ⁢ 𝑑 ⁢ 𝑃 , 𝐴 ≻ 0 and 𝑏 ∈ ℝ 2 ⁢ 𝑑 ⁢ 𝑃 via the procedure:

Perform Cholesky decomposition 𝐴

𝐿 ⁢ 𝐿 ⊤ , where 𝐿 is lower-triangular (cubic complexity in 2 ⁢ 𝑑 ⁢ 𝑃 );

Solve 𝐿 ⁢ 𝑏 ^

𝑏 by forward substitution (quadratic complexity in 2 ⁢ 𝑑 ⁢ 𝑃 );

Solve 𝐿 ⊤ ⁢ 𝑥

𝑏 ^ by back substitution (quadratic complexity in 2 ⁢ 𝑑 ⁢ 𝑃 ).

Throughout the ADMM iterations, the first step only needs to be performed once, while the second and third steps are required for every iteration. Since the dimension of the matrix ( 𝐼 + 1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝐹 + 𝐺 ⊤ ⁢ 𝐺 ) is 2 ⁢ 𝑑 ⁢ 𝑃 × 2 ⁢ 𝑑 ⁢ 𝑃 , the per-iteration time complexity of the 𝑢 update is 𝒪 ⁢ ( 𝑑 2 ⁢ 𝑃 2 ) , making it the most time-consuming step of our algorithm when 𝑑 and 𝑃 are large. Thus, the overall complexity of a full ADMM primal-dual iteration for squared loss is 𝒪 ⁢ ( 𝑛 ⁢ 𝑃 + 𝑑 2 ⁢ 𝑃 2 ) , which is quadratic. In contrast, the linear system for IPM’s Newton updates can be different for each iteration, and thus each iteration has a cubic complexity. Hence, the proposed ADMM method achieves a notable speed improvement over IPM.

When the approximated formulation Eq. 5 is considered and 𝑃 𝑠 diagonal matrices are sampled in place of the full set of 𝑃 matrices, obtaining a given level of optimality requires 𝑃 𝑠 to be linear in 𝑛 , as discussed in Section 2. Coupling with the above analysis, we obtain an overall 𝒪 ⁢ ( 𝑑 2 ⁢ 𝑛 2 ) per-iteration complexity, a significant improvement over the 𝒪 ⁢ ( 𝑑 3 ⁢ 𝑟 3 ⁢ ( 𝑛 𝑟 ) 3 ⁢ 𝑟 ) per-iteration complexity in [49]. The total computational complexity for reaching a point 𝑢 𝑘 satisfying ∥ 𝑢 𝑘 − 𝑢 ⋆ ∥ 2 ≤ 𝜖 𝑎 is 𝒪 ⁢ ( 𝑑 2 ⁢ 𝑛 2 ⁢ log ⁡ ( 1 / 𝜖 𝑎 ) ) , where 𝑢 ⋆ is an optimal value of 𝑢 and 𝜖 𝑎

0 is a predefined precision threshold. In Section 5.2, we use numerical experiments to demonstrate that the ADMM algorithm’s high efficiency enables convex ANN training for image classification tasks for the first time. Moreover, our experiments show that a high prediction accuracy only requires moderate optimization precision, which can be reached within a few ADMM iterations.

3.2.2General Convex Loss Functions

When a general convex loss function ℓ ⁢ ( 𝑦 ^ , 𝑦 ) is considered, a closed-form solution to Eq. 3.3a does not always exist, and one may need to use iterative methods, such as gradiant descent, to solve Eq. 3.3a. However, for large-scale problems, a full gradient evaluation is prohibitively expensive. To address this issue, we exploit the symmetric and separable property of each 𝑢 𝑖 and 𝑧 𝑖 in Eq. 3.3a and propose a randomized block coordinate descent (RBCD) method in Algorithm 3. Steps 5 and 6 of Algorithm 3 are derived via the differentiation chain rule. Note that Eq. 3.3a is always strongly convex because its second term is strongly convex while the first and third terms are convex. Hence, our RBCD algorithm converges linearly [43, Theorem 1]. The theoretical convergence rate is faster when the convexity of Eq. 3.3a is stronger and 𝑃 is smaller.

1:Initialize 𝑦 ^

∑ 𝑖

1 𝑃 𝐹 𝑖 ⁢ ( 𝑢 𝑖 − 𝑧 𝑖 ) ; 2:Fix 𝑠 ~ 𝑖

𝐺 𝑖 ⊤ ⁢ ( 𝑠 𝑖 − 𝜈 1 ⁢ 𝑖 ) , 𝑡 ~ 𝑖

𝐺 𝑖 ⊤ ⁢ ( 𝑡 𝑖 − 𝜈 2 ⁢ 𝑖 ) for all 𝑖 ∈ [ 𝑃 ] ; 3:Select accuracy thresholds 𝜏

0 , 𝜑

0 ; 4:repeat 5: 𝑦 ~ ← ∇ 𝑦 ^ ℓ ⁢ ( 𝑦 ^ , 𝑦 ) 6: Uniformly select 𝑖 from [ 𝑃 ] at random; 7: 𝑢 𝑖 + ← 𝑢 𝑖 − 𝛾 𝑟 ⁢ 𝐹 𝑖 ⊤ ⁢ 𝑦 ~ − 𝛾 𝑟 ⁢ 𝜌 ⁢ ( 𝑢 𝑖 − 𝑣 𝑖 + 𝜆 1 ⁢ 𝑖 + 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖 ⁢ 𝑢 𝑖 − 𝑠 ~ 𝑖 ) ; 8: 𝑧 𝑖 + ← 𝑧 𝑖 + 𝛾 𝑟 ⁢ 𝐹 𝑖 ⊤ ⁢ 𝑦 ~ − 𝛾 𝑟 ⁢ 𝜌 ⁢ ( 𝑧 𝑖 − 𝑤 𝑖 + 𝜆 2 ⁢ 𝑖 + 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖 ⁢ 𝑧 𝑖 − 𝑡 ~ 𝑖 ) ; 9: 𝑦 ^ + ← 𝑦 ^ + 𝐹 𝑖 ⁢ ( ( 𝑢 𝑖 + − 𝑧 𝑖 + ) − ( 𝑢 𝑖 + 𝑧 𝑖 ) ) ; 10:until ∥ ∇ 𝑢 𝐿 ⁢ ( 𝑢 , 𝑣 , 𝑠 , 𝜈 , 𝜆 ) ∥ 2 ≤ 𝜑 max ⁡ { 𝜏 , ∥ 𝑢 ∥ 2 } . Algorithm 3 Randomized Block Coordinate Descent (RBCD). The superscript + denotes the updated quantities for each iteration; 𝛾 𝑟 denotes the step size.

In practice, the RBCD step size 𝛾 𝑟 can be adaptively chosen via the backtracking line search. While Algorithm 3 updates one block in each iteration, it is also possible to update multiple blocks at once by sampling multiple indices. Moreover, each iteration can use the gradient associated with a random portion of the dataset as a surrogate for the entire dataset.

Furthermore, it holds that 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖

𝑋 ⊤ ⁢ 𝑋 for all 𝑖 ∈ [ 𝑃 ] . To understand this, recall that 𝐺 𝑖

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 by definition. Since ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) is a diagonal matrix with all entries being ± 1 , it holds that ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 )

𝐼 𝑛 , and thus 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖

𝑋 ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋

𝑋 ⊤ ⁢ 𝑋 . Therefore, we pre-compute 𝑋 ⊤ ⁢ 𝑋 , removing the need to compute 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖 in each iteration. The most expensive steps of each RBCD update thus have the following complexities:

𝐹 𝑖 ⊤ ⁢ 𝑦 ~

𝐹 𝑖 ⁢ ( ( 𝑢 𝑖 + − 𝑧 𝑖 + ) − ( 𝑢 𝑖 + 𝑧 𝑖 ) )

( 𝑋 ⊤ ⁢ 𝑋 ) ⁢ 𝑢 𝑖

( 𝑋 ⊤ ⁢ 𝑋 ) ⁢ 𝑧 𝑖

𝒪 ⁢ ( 𝑛 ⁢ 𝑑 )

𝒪 ⁢ ( 𝑑 2 )

While it can be costly to solve Eq. 3.3a to a high accuracy using iterative methods, especially during the early iterations of ADMM, [23, Proposition 6] has shown that even when Eq. 3.3a is solved approximately, as long as the accuracy threshold 𝜑 of each ADMM iteration forms a convergent sequence, the ADMM algorithm can eventually converge to the global optimum of Eq. 8. Each iterative solution of the 𝑢 -update subproblem can also take advantage of warm-starting by initializing at the result of the previous ADMM iteration. As a result, we alternate between an ADMM update and several RBCD updates in a delicate manner.

Compared to the parallel independent work [46], our method sees some connections but is overall distinct. The authors of [46] considered two approaches, one using an unconstrained relaxation to the constrained convex training formulation and the other directly tackling the constrained formulation. While [46] also proposes to reformulate the constraints into an augmented Lagrangian, it uses a separation scheme different from ours. Specifically, we separate the group-sparse regularization in addition to the constraints, whereas [46] only separates the constraints. As a result, our ADMM separation allows the primal update subproblem Eq. 3.3a to be solved in closed form for the case of squared loss, whereas [46] requires the FISTA algorithm for the primal update step. For general loss functions, our separation embeds strong convexity into the subproblem Eq. 3.3a, allowing the randomized block coordinate descent (RBCD) subroutine to converge linearly. Furthermore, our ADMM algorithm also achieves linear convergence, whereas [46] claims a slower 𝒪 ⁢ ( 1 𝜖 ⁢ 𝛿 ) dual convergence rate.

4Convex Adversarial Training

The inherent difficulties with adversarial training can be addressed by taking advantage of the convex training framework and the related algorithms.

4.1Adversarial Training Background

A classifier is considered robust against adversarial perturbations if it assigns the same class to all inputs within a perturbation set. We need the perturbation set to define the input distortion allowances, because an unlimited distortion breaks even the most robust models, and is impractical because it can be easily detected and rejected. We consider a ℓ ∞ -bounded perturbation set with radius 𝜖

0 , a common problem formulation proposed in [29]:

𝒳

{
𝑋 + Δ ∈ ℝ 𝑛 × 𝑑 | Δ

[ 𝛿 1 , … , 𝛿 𝑛 ] ⊤ , 𝛿 𝑘 ∈ ℝ 𝑑 , ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 , ∀ 𝑘 ∈ [ 𝑛 ] } .

We consider the “white box” setting, where the adversary has complete knowledge about the ANN. A common method for training robust classifiers is to minimize the loss associated with the worst-case perturbation, i.e., the attack resulting in the maximum loss within the perturbation set. More concretely, we solve the following min-max problem proposed in [45]:

(13) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) ) .

This process of “training with adversarial data” is often referred to as “adversarial training”, as opposed to “standard training” that trains on clean, unperturbed data. In the prior literature, Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) were commonly used to numerically solve the inner maximization of Eq. 13 and generate adversarial examples [45]. Specifically, PGD generates adversarial examples 𝑥 ~ by running the iterations

(14) 𝑥 ~ 𝑡 + 1

Π 𝒳 ⁢ ( 𝑥 ~ 𝑡 + 𝛾 𝑝 ⋅ sgn ⁢ ( ∇ 𝑥 ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑥 ⊤ ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) ) )

for 𝑡

0 , 1 , ⋯ , 𝑇 , where 𝑥 ~ 𝑡 is the perturbed data vector at the 𝑡 th iteration, 𝛾 𝑝 > 0 is the step size, and 𝑇 ≥ 1 is the number of iterations. The initial vector 𝑥 ~ 0 is the unperturbed data 𝑥 . FGSM can be regarded as a special case of PGD where 𝑇

1 .

4.2The Convex Adversarial Training Formulation

While adversarial training with PGD adversaries has demonstrated some success, this approach suffers from several limitations. Since the optimization landscapes are generally non-concave over the perturbation Δ , there is no guarantee that PGD will find the true worst-case adversary. Furthermore, traditional adversarial training solves complicated bi-level min-max optimization problems, exacerbating the instability of non-convex ANN training. Our experiments show that back-propagation gradient methods can struggle to converge when solving Eq. 13. Moreover, solving the bi-level optimization Eq. 13 requires an algorithm with a computationally cumbersome nested loop structure. To conquer such difficulties, we leverage Theorem 2.1 to re-characterize Eq. 13 as robust, convex upper-bound problems that can be efficiently solved globally.

We first develop a result about adversarial training involving general convex loss functions. The connection between the convex training objective and the non-convex ANN loss function holds only when the linear constraints in Eq. 2 are satisfied. For adversarial training, we need this connection to hold at all perturbed data matrices 𝑋 + Δ ∈ 𝒳 . Otherwise, if some matrix 𝑋 + Δ violates the linear constraints, then this perturbation Δ can correspond to a low convex objective value but a high actual loss. To ensure the correctness of the convex reformulation throughout 𝒳 , we introduce some robust constraints below.

Since the 𝐷 𝑖 matrices in Eq. 2 reflect the ReLU patterns of 𝑋 , these matrices can change when 𝑋 is perturbed. Therefore, we include all distinct diagonal matrices diag ⁢ ( [ ( 𝑋 + Δ ) ⁢ 𝑢 ≥ 0 ] ) that can be obtained for all 𝑢 ∈ ℝ 𝑑 and all Δ : 𝑋 + Δ ∈ 𝒰 , denoted as 𝐷 1 , … , 𝐷 𝑃 ^ , where 𝑃 ^ is the total number of such matrices. Since 𝐷 1 , … , 𝐷 𝑃 ^ include 𝐷 1 , … , 𝐷 𝑃 in Eq. 2, we have 𝑃 ^ ≥ 𝑃 . While 𝑃 ^ is at most 2 𝑛 in the worst case, since 𝜖 is often small, we expect 𝑃 ^ to be relatively close to 𝑃 , where 𝑃 ≤ 2 ⁢ 𝑟 ⁢ ( 𝑒 ⁢ ( 𝑛 − 1 ) 𝑟 ) 𝑟 as discussed above.

Finally, we replace the objective of the convex standard training formulation Eq. 2 with its robust counterpart, giving rise to the optimization problem

(5.4a) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

(5.4b) s . t .

min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

where 𝒰 is any convex additive perturbation set. The next theorem shows that Section 4.2 is an upper-bound to the robust loss function Eq. 13, with the proof provided in Section F.5.

Theorem 4.1.

Let ( 𝑣 rob 𝑖 ⋆ , 𝑤 rob 𝑖 ⋆ ) 𝑖

1 𝑃 ^ denote a solution of Section 4.2 and define 𝑚 ^ ⋆ as | { 𝑖 : 𝑣 rob 𝑖 ⋆ ≠ 0 } | + | { 𝑖 : 𝑤 rob 𝑖 ⋆ ≠ 0 } | . When the ANN width 𝑚 satisfies 𝑚 ≥ 𝑚 ^ ⋆ , the optimization problem Section 4.2 provides an upper-bound on the non-convex adversarial training problem Eq. 13. The robust ANN weights ( 𝑢 rob 𝑗 ⋆ , 𝛼 rob 𝑗 ⋆ ) 𝑗

1 𝑚 ^ can be recovered using Eq. 4.

When the perturbation set is zero, Theorem 4.1 reduces to Theorem 2.1. In light of Theorem 4.1, we use optimization Section 4.2 as a surrogate for the optimization Eq. 13 to train the ANN. Since Section 4.2 includes all 𝐷 𝑖 matrices in Eq. 2, we have 𝑃 ^ ≥ 𝑃 . While 𝑃 ^ is at most 2 𝑛 in the worst case, since 𝜖 is often small, we expect 𝑃 ^ to be relatively close to 𝑃 , where 𝑃 ≤ 2 ⁢ 𝑟 ⁢ ( 𝑒 ⁢ ( 𝑛 − 1 ) 𝑟 ) 𝑟 as discussed above. As will be shown in Section 4.3, an approximation to Section 4.2 can be applied to train ANNs with widths much less than 𝑚 ^ ⋆ .

The robust constraints in Eq. 5.4b force all points within the perturbation set to be feasible. Intuitively, for every 𝑗 ∈ [ 𝑚 ^ ⋆ ] , Eq. 5.4b forces the ReLU activation pattern sgn ( ( 𝑋 + Δ ) ⁢ 𝑢 rob 𝑗 ⋆ ) to stay the same for all Δ such that 𝑋 + Δ ∈ 𝒰 . Moreover, if Δ rob ⋆ denotes a solution to the inner maximization in Eq. 5.4a, then 𝑋 + Δ rob ⋆ corresponds to the worst-case adversarial inputs for the recovered ANN.

Corollary 4.2.

For the perturbation set 𝒳 , the constraints in Eq. 5.4b are equivalent to

(16) ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 ,

∀ 𝑖 ∈ [ 𝑃 ^ ] .

The proof of Corollary 4.2 is provided in Section F.6. Note that the left side of each inequality in Eq. 16 is a vector while the right side is a scalar, which means that each element of the corresponding vector should be greater than or equal to that scalar.

We will show that the new problem can be efficiently solved in important cases. Specifically, Section 4.2 reduces to a classic convex optimization problem when ℓ ⁢ ( 𝑦 ^ , 𝑦 ) is the hinge loss, the squared loss, or the binary cross-entropy loss. Due to space restrictions, the result for the squared loss is presented in Section E.1.

4.3Practical Convex Adversarial Training Algorithm

Since Theorem 2.2 does not rely on assumptions about the matrix 𝑋 , it applies to an arbitrary 𝑋 + Δ matrix, and naturally extends to the convex adversarial training formulation Section 4.2. Therefore, an approximation to Section 4.2 can be applied to train robust ANNs with widths much less than 𝑚 ^ ⋆ . Similar to the strategy rendered in Algorithm 1, we use a subset of the 𝐷 𝑖 matrices for practical adversarial training. Since the 𝐷 𝑖 matrices depend on the perturbation Δ , we also add randomness to the data matrix 𝑋 in the sampling process to cover 𝐷 𝑖 matrices associated with different perturbations, leading to Algorithm 4. 𝑃 𝑎 and 𝑆 are preset parameters that determine the number of random weight samples, with 𝑃 𝑎 × 𝑆 ≥ 𝑃 𝑠 .

1:for ℎ

1 to 𝑃 𝑎 do 2: 𝑎 ℎ ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) i.i.d. 3: 𝐷 ℎ ⁢ 1 ← diag ⁢ ( [ 𝑋 ⁢ 𝑎 ℎ ≥ 0 ] ) 4: for 𝑗

2 to 𝑆 do 5: 𝑅 ℎ ⁢ 𝑗 ← [ 𝑟 1 , … , 𝑟 𝑑 ] , where 𝑟 𝜅 ∼ 𝒩 ⁢ ( 𝟎 , 𝐼 𝑛 ) , ∀ 𝜅 ∈ [ 𝑑 ] 6: 𝐷 ℎ ⁢ 𝑗 ← diag ⁢ ( [ 𝑋 ¯ ℎ ⁢ 𝑗 ⁢ 𝑎 ℎ ≥ 0 ] ) , where 𝑋 ¯ ℎ ⁢ 𝑗 ← 𝑋 + 𝜖 ⋅ sgn ⁢ ( 𝑅 ℎ ⁢ 𝑗 ) 7: Discard repeated 𝐷 ℎ ⁢ 𝑗 matrices 8: break if 𝑃 𝑠 distinct 𝐷 ℎ ⁢ 𝑗 matrices has been generated 9: end for 10:end for 11:Solve (17) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 𝐷 ℎ ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 ℎ − 𝑤 ℎ ) , 𝑦 ) + 𝛽 ⁢ ∑ ℎ

1 𝑃 𝑠 ( ∥ 𝑣 ℎ ∥ 2 + ∥ 𝑤 ℎ ∥ 2 ) ⁢ )

s . t .
min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 ℎ ≥ 0 , ∀ ℎ ∈ [ 𝑃 𝑠 ] ,

min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 ℎ ≥ 0 , ∀ ℎ ∈ [ 𝑃 𝑠 ] .
12:Recover 𝑢 1 , … , 𝑢 𝑚 𝑠 and 𝛼 1 , … , 𝛼 𝑚 𝑠 from the solution ( 𝑣 rob ⁢ 𝑠 ℎ ⋆ , 𝑤 rob ⁢ 𝑠 ℎ ⋆ ) ℎ

1 𝑃 𝑠 of Eq. 17 using Eq. 4. Algorithm 4 Practical convex adversarial training 4.4Convex Hinge Loss Adversarial Training

While the inner maximization of the robust problem Section 4.2 is still hard to solve in general, it is tractable for some loss functions. The simplest case is the piecewise-linear hinge loss ℓ ⁢ ( 𝑦 ^ , 𝑦 )

( 1 − 𝑦 ^ ⊙ 𝑦 ) + , which is widely used for classification. Here, we focus on binary classification with 𝑦 ∈ { − 1 , 1 } 𝑛 .2

Consider the training problem for a one-hidden-layer ANN with ℓ 2 regularized hinge loss:

(18) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ( 1 𝑛 ⋅ 𝟏 ⊤ ⁢ ( 𝟏 − 𝑦 ⊙ ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 ) + + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) ) .

The adversarial training problem considering the ℓ ∞ -bounded adversarial data perturbation set 𝒳 is:

(19) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ 1 𝑛 ⋅ 𝟏 ⊤ ⁢ ( 𝟏 − 𝑦 ⊙ ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 ) + + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) )

Applying Theorem 4.1 and Corollary 4.2 leads to the following formulation as an upper bound on Eq. 19:

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ 1 𝑛 ⋅ 𝟏 ⊤ ⁢ ( 𝟏 − 𝑦 ⊙ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ) + + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

(20) s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ] .

For the purpose of generating the 𝐷 1 , … , 𝐷 𝑃 ^ matrices, instead of enumerating an infinite number of points in 𝒳 , we only need to enumerate all vertices of 𝒳 , which is finite. This is because the solution Δ hinge ⋆ to the inner maximum always occurs at a vertex of 𝒳 , as will be shown in Theorem 4.3. Solving the inner maximization of Section 4.4 in closed form leads to the next theorem, whose proof is provided in Section F.7.

Theorem 4.3.

For the binary classification problem, the inner maximum of Section 4.4 is attained at Δ hinge ⋆

− 𝜖 ⋅ sgn ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ 𝑦 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) , and the bi-level optimization problem Section 4.4 is equivalent to the classic optimization problem:

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ 1 𝑛
∑ 𝑘

1 𝑛 ( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⁢ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 ) + + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

(21) s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

where 𝑑 𝑖 ⁢ 𝑘 denotes the 𝑘 th diagonal element of 𝐷 𝑖 .

The problem Theorem 4.3 is a finite-dimensional convex program that upper-bounds Eq. 19, the robust counterpart of Eq. 18. We can thus solve Theorem 4.3 to robustly train the ANN.

4.5Convex Binary Cross-Entropy Loss Adversarial Training

The binary cross-entropy loss is also widely used in binary classification. Here, we consider a scalar-output ANN with a scaled tanh output layer for binary classification with 𝑦 ∈ { 0 , 1 } 𝑛 . The loss function ℓ ⁢ ( ⋅ ) in this case is ℓ ⁢ ( 𝑦 ^ , 𝑦 )

− 2 ⁢ 𝑦 ^ ⊤ ⁢ 𝑦 + 𝟏 ⊤ ⁢ log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ + 1 ) . The non-convex adversarial training formulation considering the ℓ ∞ -bounded data uncertainty set 𝒳 is then:

(22) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚
( max ∥ Δ ∥ max ≤ 𝜖 ⁡ 1 𝑛 ⁢ ∑ 𝑘

1 𝑛 ( − 2 ⁢ 𝑦 ^ 𝑘 ⁢ 𝑦 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )
where 𝑦 ^ := ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 .

Applying Theorem 4.1 and Corollary 4.2 leads to the following optimization problem as an upper bound on Eq. 22:

(23) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( max ∥ Δ ∥ max ≤ 𝜖 ⁡ 1 𝑛 ⁢ ∑ 𝑘

1 𝑛 ( − 2 ⁢ 𝑦 ^ 𝑘 ⁢ 𝑦 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .
( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

𝑦 ^ 𝑘

∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝛿 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) .

Consider the convex optimization formulation

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
1 𝑛 ⁢ ( ∑ 𝑘

1 𝑛 𝑓 ∘ 𝑔 𝑘 ⁢ ( { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 ^ ) ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .
( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ]
(24) 𝑓 ⁢ ( 𝑢 ) := log ⁡ ( 𝑒 2 ⁢ 𝑢 + 1 ) ,

𝑔 𝑘 ⁢ ( { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 ^ ) := ( 2 ⁢ 𝑦 𝑘 − 1 ) ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⋅ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 , ∀ 𝑘 ∈ [ 𝑛 ] .

The next theorem establishes the equivalence between Section 4.5 and Eq. 23. The proof is provided in Section F.9.

Theorem 4.4.

The optimization Section 4.5 is a convex program that is equivalent to the bi-level optimization Eq. 23, and can be used as a surrogate for Eq. 22 to train robust ANNs. The worst-case perturbation is Δ BCE ⋆

− 𝜖 ⋅ sgn ⁢ ( ( 2 ⁢ 𝑦 − 1 ) ⁢ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) .

Note that the worst-case perturbation occurs at the same location as for the hinge loss case, which is a vertex in 𝒳 . Thus, for the purpose of generating the 𝐷 1 , … , 𝐷 𝑃 ^ matrices, we again only need to enumerate all vertices of 𝒳 instead of all points in 𝒳 .

5Numerical Experiments

Due to space restrictions, we focus on binary classification with the hinge loss, and defer the squared loss results to Section C.5.

5.1Approximated Convex Standard Training (a) A randomized 2-dimensional dataset used in this experiment. The red crosses are positive training points and the white circles are negative training points. The region classified as positive is in blue, whereas the negative region is in black. (b) The optimized training loss for each 𝑃 𝑠 . When 𝑃 𝑠 reaches 128, the mean and variance of the optimized loss become very small. Figure 1:Analyzing the effect of 𝑃 𝑠 on convex standard training.

In this subsection, we use numerical experiments to demonstrate the efficacy of practical standard training (Algorithm 1) and to show the level of suboptimality of the ANN trained using Algorithm 1.3 The experiment was performed on a randomly generated dataset with 𝑛

40 and 𝑑

2 shown in Fig. 1(a). The upper bound on the number of ReLU activation patterns is 4 ⁢ ( 𝑒 ⁢ ( 39 ) 2 ) 2

11239 . We ran Algorithm 1 to train ANNs using the hinge loss with the number of 𝐷 ℎ matrices equal to 4 , 8 , 16 , … , 2048 and compared the optimized loss.4 We repeated this experiment 15 times for each setting, and plotted the loss in Fig. 1(b). The error bars show the loss values achieved in the best and the worst runs. When there are more than 128 matrices (much less than the theoretical bound on 𝑃 ), Algorithm 1 yields consistent and favorable results. Further increasing the number of 𝐷 matrices does not produce a significantly lower loss. By Theorem 2.2, 𝑃 𝑠

128 corresponds to 𝜓 ⁢ 𝜉

0.318 .

5.2The ADMM Convex Training Algorithm

We now present the experiment results with the ADMM training algorithm. We use Algorithm 2 to solve the approximate convex training formulation Eq. 5 with the sampled 𝐷 ℎ matrices. In Section D.1, we discuss our experiments’ ADMM hyperparameter settings and present guidelines on selecting them.

5.2.1Squared Loss (closed form 𝑢 Updates) – Convergence

For the case of the squared loss, the closed-form solution Eq. 12 is used for the 𝑢 updates. We first demonstrate the convergence of the proposed ADMM algorithm using illustrative random data with dimensions 𝑛

6 , 𝑑

5 , 𝑃 𝑠

8 . CVX [30] with the IPM-based MOSEK solver [5] was used to solve the optimal objective of Eq. 2 as the ground truth.

In the figures, we use 𝑙 CVX ⋆ to denote the CVX optimal objective and use 𝑙 ADMM ⋆ to denote the objective that ADMM converges to as the number of iterations 𝑘 goes to infinity. There are several methods to calculate the training loss obtained by ADMM. For fair comparisons among ADMM, CVX, and SGD, we use Eq. 4 to recover the ANN weights ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 from the ADMM optimization variables ( 𝑣 ℎ 𝑘 , 𝑤 ℎ 𝑘 ) ℎ

1 𝑃 𝑠 , and use ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 to calculate the true non-convex training loss Eq. 1. The loss at each iteration calculated via this method is denoted as 𝑙 ADMM 𝑢 , 𝛼 , and the ADMM solution 𝑙 ADMM ⋆ is also calculated via this method. At each iteration, we also compute the convex objective of Eq. 2 using ( 𝑣 ℎ 𝑘 , 𝑤 ℎ 𝑘 ) ℎ

1 𝑃 𝑠 , denoted as 𝑙 ADMM 𝑣 , 𝑤 . Since ADMM uses dual variables to enforce the constraints, while the ADMM solution is feasible as 𝑘 goes to infinity, the intermediate iterations may not be feasible. When the constraints in Eq. 2 are satisfied, it holds that 𝑙 ADMM 𝑢 , 𝛼

𝑙 ADMM 𝑣 , 𝑤 . Otherwise, 𝑙 ADMM 𝑢 , 𝛼 may be different from 𝑙 ADMM 𝑣 , 𝑤 . The gap between 𝑙 ADMM 𝑢 , 𝛼 and 𝑙 ADMM 𝑣 , 𝑤 indirectly characterizes the feasibility of the ADMM intermediate solutions. When this gap is small, ( 𝑣 ℎ 𝑘 , 𝑤 ℎ 𝑘 ) ℎ

1 𝑃 𝑠 should be almost feasible. When this gap is large, the constraints may have been severely violated.

(a) 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 ADMM ⋆ (b) 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 CVX ⋆ (c) | 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 ADMM 𝑣 , 𝑤 | Figure 2: Gap between the cost returned by ADMM for the first 25 iterations and the true optimal cost for the five independent runs.

While it can be expensive for ADMM to converge to a high precision (note that the algorithm is guaranteed to linearly converge to a global minimum given an ample computation time according to Theorem 3.1), an approximate solution is usually sufficient for achieving a high validation accuracy since decreasing the training loss excessively could induce overfitting. Therefore, when performing the experiments, we apply early stopping [50], a common training technique that improves generalization. Fig. 2(a), 2(b) shows that a precision of 10 − 3 can be achieved within 25 iterations. Moreover, Fig. 2(c) shows that the solution after 25 iterations violates the constraints insignificantly. This behavior of “converging rapidly in the first several steps and slowing down (to a linear rate) afterward” is typical for the ADMM algorithm. As will be shown next, a medium-accuracy solution returned by only a few ADMM iterations can achieve a better prediction performance than the CVX solution. In Section C.1, we present empirical results that demonstrate the asymptotic convergence properties of ADMM.

To visualize how the prediction performance achieved by the model changes as the ADMM iteration progresses, we run the ADMM iterations on the “mammographic masses” dataset from the UCI Machine Learning Repository [22], and record the prediction accuracy on the validation set at each iteration. 70% of the dataset is randomly selected as the training set, and the other 30% is used as the validation set. Fig. 3 plots the difference between the ADMM accuracy and the CVX accuracy at each iteration. In all experiments, all variables in the ADMM algorithm are initialized to be zero.

(a) Accuracy ADMM − Accuracy CVX (positive means the ADMM solution outperforms CVX). (b)Fig. 3(a) zoomed-in to the first five iterations. Figure 3:Comparing the ANNs trained with ADMM and with CVX over ten independent runs on the mammographic masses dataset.

All ten runs achieve superior validation accuracy throughout the first 200 iterations compared with the CVX baseline. Even the first five iterations outperform the baseline, with the best run outperforming CVX by 6%. After about 80 iterations, the accuracy stabilizes at around 2% to 4% better than CVX. In conclusion, the prediction performance of the classifiers trained by ADMM is superior even when only a few iterations are run.

(a)Average validation accuracy for each 𝑛 .

(b)Average CPU wall time for each 𝑛 .

(c)Average validation accuracy for each 𝑃 𝑠 .

(d)Average CPU wall time for each 𝑃 𝑠 . Figure 4:Analyzing the effect of 𝑛 and 𝑃 𝑠 on ADMM convex training with the MNIST dataset. 5.2.2Squared Loss (Closed Form 𝑢 Updates) – Complexity

To demonstrate the computational complexity of the proposed ADMM method, we used the ADMM method to train ANNs on the downsampled MNIST handwritten digits dataset with 𝑑

100 . The task was to perform binary classification between digits “2” and “8”. We first fix 𝑃 𝑠

8 and vary 𝑛 from 100 to 11809.5 We independently repeat the experiment five times for each 𝑛 setting, and present the average results in Fig. 4(a), 4(b). In each experiment, ADMM is allowed to run six iterations, which is sufficient to train an accurate ANN. For all choices of 𝑛 except 𝑛

100 , the ANNs trained with ADMM attain higher accuracy than CVX networks. This is because while ADMM and CVX solve the same problem, the medium-precision solution from ADMM generalizes better than the high-precision CVX solution. More importantly, as 𝑛 increases, the CPU time required for CVX grows much faster than ADMM’s execution time, which increases linearly in 𝑛 . While it is also theoretically possible to run the IPM to a medium precision, even a few IPM iterations become too expensive when 𝑛 is large. Moreover, since the IPM uses barrier functions to approximate the constraints, a medium-precision solution produced by the IPM may have feasibility issues, while the ADMM solution sequence generally has good feasibility, as illustrated in Fig. 2.

Similarly, we fix 𝑛

1000 and vary 𝑃 𝑠 from 4 to 50. The average result over five runs is shown in Fig. 4(c), 4(d). Once again, the proposed ADMM algorithm achieves a higher accuracy for each 𝑃 𝑠 , and the average CPU time of ADMM grows much slower than the CVX CPU time. When 𝑃 𝑠 is 20, all five CVX runs achieve low validation accuracy, possibly because the structure of the true underlying distribution cannot be well approximated with a combination of 20 linear classifiers. Fig. 4(c), 4(d) also show that the CPU time scales quadratically with 𝑃 𝑠 , confirming our theoretical analysis of the 𝒪 ⁢ ( 𝑛 ⁢ 𝑃 𝑠 + 𝑑 2 ⁢ 𝑃 𝑠 2 ) per-iteration complexity.

5.2.3Squared Loss (Closed Form 𝑢 Updates) – MNIST, Fashion MNIST, and CIFAR-10

We now demonstrate the effectiveness of the proposed ADMM algorithm on all images of “2” and “8” in the MNIST dataset without downsampling ( 𝑛

11809 and 𝑑

784 ). The parameter 𝑃 𝑠 was chosen to be 24, corresponding to a network width of at most 48. The prediction accuracy on the validation set, the training loss, and the CPU time are shown in Table 2. The baseline method “CVX” corresponds to using CVX to globally optimize the ANN by solving Eq. 2, while “Back-prop” denotes the conventional method that performs an SGD local search on the non-convex cost function Eq. 1.

Table 2 shows that the training loss returned by ADMM is higher than the true optimal cost but lower than the back-propagation solution. Note that the difference between the ADMM training loss and the CVX loss is due to the early stopping strategy applied to ADMM. ADMM will converge to the true global optimal with a sufficient computation time, but we prematurely terminate the algorithm once the validation accuracy becomes satisfactory so that the rapid initial convergence of ADMM can be fully exploited. In contrast, back-propagation does not have this guarantee due to the non-convexity of Eq. 1. Moreover, back-propagation is highly sensitive to the initialization and the hyperparameters. While ADMM also requires a pre-specified step size 𝛾 𝑎 , it is much more stable: its convergence to a primal optimum does not depend on the step size [14, Appendix A]. An optimal step size speeds up the training, but a suboptimal step size is also acceptable.

ADMM achieves a higher validation accuracy than both CVX and back-propagation SGD. Once again, while ADMM and CVX solve the same problem, the CVX solution suffers from overfitting and thus cannot generalize well to the validation data.

The training time of ADMM is considerably shorter than CVX. Specifically, assembling the matrix 𝐼 + 1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝐹 + 𝐺 ⊤ ⁢ 𝐺 required 22% of the time, and the Cholesky decomposition needed 34% of the time, while each ADMM iteration only took 4.4% of the time. Thus, running more ADMM iterations will not considerably increase the training time.

Table 2:Average experiment results with the squared loss on the MNIST dataset over five independent runs. We run 10 ADMM iterations for each setting. Method Validation Accuracy CPU Time (s) Training Loss Global Convergence Back-prop 98.86 % 74.09 422.4 No CVX 70.99 % 14879 1.146 Yes ADMM 98.90 % 802.2 223.2 Yes Table 3:Average experiment results with the squared loss over five independent runs.

Fashion MNIST (42 ADMM iterations, 𝑃 𝑠 set to 18)

Method Validation Accuracy CPU Time (s) Training Loss Back-prop 99.04% (.0735%) 183.6 175.1 (4.246) ADMM 98.73% (.0200%) 167.1 129.7 (13.24) Back-prop (DS) 98.34% (.0917%) 18.31 433.0 (10.40) ADMM (DS) 98.80% (.0585%) 6.840 380.1 (17.74)

Downsampled CIFAR-10 (30 ADMM iterations, 𝑃 𝑠 set to 18)

Method Validation Accuracy CPU Time (s) Training Loss Back-prop (DS) 90.90% (.305%) 122.7 991.5 (11.68) ADMM (DS) 86.89% (.132%) 118.6 607.6 (10.76) •

“DS” denotes image downsampling with a stride of 2.

•

The numbers in the parentheses are the standard deviations over five runs.

•

Note that the ADMM algorithm is theoretically guaranteed to converge to an approximate global minimum, whereas back-propagation does not have this property.

Figure 5: The learning curves of the closed-form ADMM algorithm and back-propagation gradient descent. The flat parts of the ADMM curves represent the pre-processing time.

Next, we compare ADMM with back-propagation on the more challenging Fashion MNIST [60] and CIFAR-10 datasets. For Fashion MNIST, we perform binary classification between the “pullover” and the “bag” classes on both full data ( 𝑛

12000 , 𝑑

784 ) and downsampled data ( 𝑛

12000 , 𝑑

196 ). For CIFAR-10, we perform binary classification between “birds” and “ships”, and downsample the images to 16 × 16 × 3 . The results are presented in Table 3, and we plot the training loss with respect to time in Fig. 5. The results show that ADMM converges faster and achieves a lower loss within the same allowed time, even though it requires preprocessing before the iterations start. However, on these datasets, the classifiers learned via back-propagation generalize better to the validation set. Gradient descent is known to have favorable properties for machine learning, where solutions with similar losses can have vastly different properties. For applications where training data is abundant, ADMM is well-suited since the generalization gap would be small.

We also note that ADMM is extremely efficient on the downsampled Fashion MNIST dataset, since the faster convergence of ADMM overshadows the higher complexity associated with the decomposition when the data dimension is smaller. This result shows that ADMM is particularly suitable for data with a dimension of around 200.

5.2.4Binary Cross-Entropy Loss (Iterative 𝑢 Updates) – MNIST

To verify the efficacy of using the RBCD method to solve Eq. 3.3a, we similarly experiment with the binary cross-entropy loss coupled with a tanh output activation. The resulting loss function is ℓ ⁢ ( 𝑦 ^ , 𝑦 )

− 2 ⁢ 𝑦 ^ ⊤ ⁢ 𝑦 + 𝟏 ⊤ ⁢ log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ + 1 ) . Since the augmented Lagrangian’s gradient in the stopping condition of Algorithm 3 is difficult to obtain, we use the objective improvement amount as a surrogate.

Table 4: Average experiment results with the binary cross-entropy loss over five runs. The main advantage of ADMM-RBCD is its theoretically guaranteed global convergence.

MNIST (34 ADMM iterations, 𝑃 𝑠

24 ) Method Validation Accuracy CPU Time (s) Training Loss Back-prop 98.91 % 62.06 437.6 CVX 98.21 % 14217 1.007 ADMM-RBCD 98.89 % 555.8 310.3

The experiment results are shown in Table 4. On the MNIST dataset, the ADMM-RBCD algorithm achieves a high validation accuracy while requiring a training time 94.6 % shorter than the time of globally optimizing the cost function Eq. 2 with CVX. ADMM-RBCD also requires less time to reach a comparable accuracy than the closed-form ADMM method with the squared loss. On the other hand, ADMM-RBCD is still slower than back-propagation local search, trading the training speed for the global convergence guarantee. The extremely slow pace of CVX forbids its application to even medium-scaled problems, while ADMM-RBCD makes convex training much more practical by balancing efficiency and optimality.

5.2.5GPU Acceleration

The success of modern deep learning relies on the parallelized computing enabled by GPUs. Using GPUs to accelerate the proposed ADMM algorithm is straightforward. All operations required in the ADMM algorithm (Algorithm 2) are already implemented in existing GPU-supporting deep learning libraries like PyTorch [48]. Specifically, Eq. 3.3c consists of parallelizable algebraic operations, and we have shown that Eq. 3.3b reduces to parallelizable element-wise operations. If the RBCD algorithm is used to solve Eq. 3.3a, then all operations are again parallelizable (as is the case for traditional back-propagation gradient descent), and auto-differentiation can be used to obtain the closed-form gradients.

To verify the effectiveness of GPU acceleration and show that ADMM-RBCD scales to wider neural networks and higher dimensions with the help of GPUs, we use the method to train binary classifiers with 𝑃 𝑠 set to 120 on the CIFAR-10 dataset. The average validation accuracy over five runs is 91.23%. On a MacBook Pro laptop computer, this task takes 474.5 seconds on average. Repeating the experiment on an Nvidia V100 GPU only requires 24.64 seconds, which is a 19.25x speed-up.

5.2.6Summary of ADMM Experiment Results

Based on the above experiment results, we summarize some advantages of our ADMM methods below:

While the closed-form ADMM algorithm has a higher theoretical complexity compared with back-propagation, it is guaranteed to linearly converge to a global optimum if allowed to run for a sufficiently long time, enabling efficient global optimization of neural networks. Back-propagation does not have this property.

The closed-form ADMM algorithm often converges rapidly in the first few iterations. Since a moderately accurate solution is sufficient for many machine learning tasks, this fast initial convergence is highly advantageous.

For datasets with a relatively small number of dimensions, the closed-form ADMM algorithm is more efficient than back-propagation (as shown in Table 3), since the faster convergence outweighs the increased complexity.

Compared with closed-form ADMM, ADMM-RBCD applies to general convex loss functions, and scales better to wide ANNs, but is less efficient, as illustrated in Table 4. ADMM-RBCD is then a trade-off between CVX (high solution quality) and back-propagation (efficient), while maintaining the theoretically provable global convergence.

In summary, the proposed ADMM method is particularly suited for applications where:

•

Abundant training data exists (a low empirical risk translates to a low true risk);

•

Accuracy is more important than computational efficiency;

•

The number of dimensions is not too large.

5.3Convex Adversarial Training

All experiment results in this section are obtained using CVX with the MOSEK solver based on the interior-point method.

5.3.1Hinge Loss Convex Adversarial Training – 2D Illustration

Red crosses: positive training points; Red circles: negative training points. Blue region: classified as positive; Black region: classified as negative. The white box around each training data: the ℓ ∞ perturbation bound. The white dot at a vertex of each box: the worst-case perturbation. Figure 6: Visualization of the binary decision boundaries in a 2-dimensional space. Algorithm 4 fits the perturbation boxes while the standard training fits the training points.

To analyze the decision boundaries obtained from convex adversarial training, we ran Algorithm 1 and Algorithm 4 on 34 random points in a two-dimensional space for binary classification. The algorithms were run with the parameters 𝑃 𝑠

360 and 𝜖

0.08 . A bias term was included by concatenating a column of ones to the data matrix 𝑋 . The decision boundaries shown in Fig. 6 confirm that Algorithm 4 fits the perturbation boxes as designed, coinciding with the theoretical prediction [45, Figure 3]. In Section C.4, we compare the decision boundaries of convex training and back-propagation methods, and discuss how the regularization strength 𝛽 affects the decision boundaries. In Section C.3, we compare the convex and the non-convex optimization landscapes and demonstrate robust certifications around the training data.

5.3.2Hinge Loss Convex Adversarial Training – Image Classification

We now verify the real-world performance of the proposed convex training methods on a subset of the CIFAR-10 image classification dataset [40] for binary classification between “birds” and “ships”. The subset consists of 600 images downsampled to 𝑑

7 × 7 × 3

147 .6 We use clean data and adversarial data generated with FGSM and PGD to compare Algorithm 1, Algorithm 4, traditional back-propagation standard training (abbreviated as GD-std), and the widely used adversarial training method: use FGSM or PGD to solve for the inner maximum of Eq. 19 and use back-propagation to solve the outer minimization (abbreviated as GD-FGSM and GD-PGD). The implementation details of FGSM and PGD are discussed in Section D.2.

Table 5: Average optimal objective and accuracy on clean and adversarial data over seven runs on the CIFAR-10 dataset. The standard deviations across the runs are shown in parentheses. Method Clean accuracy FGSM adv. PGD adv. Objective CPU Time (s) GD-std 79.56 %

( .414 % )

47.09 %

( .4290 % )

45.60 %

( .4796 % )

.3146

108.4

GD-FGSM 75.30 %

( 3.10 % )

61.03 %

( 4.763 % )

60.99 %

( 4.769 % )

.8370

154.9

GD-PGD 76.56 %

( .604 % )

62.48 %

( .2215 % )

62.44 %

( .1988 % )

.8220

1764

Algorithm 1 81.01 %

( .809 % )

.4857 %

( .1842 % )

.3571 %

( .1239 % )

6.910 × 10 − 3

37.77

Algorithm 4 78.36 %

( .325 % )

66.95 %

( .4564 % )

66.81 %

( .4862 % )

.6511

1544

Table 5 shows the results on our CIFAR-10 subset. Convex standard training (Algorithm 1) achieves a higher clean accuracy and a much lower training loss than GD-std, supporting the findings of Theorem 2.2. The non-robust convex-trained model is highly sensitive to adversarial perturbations. This is because standard training has no control over the loss of the perturbed inputs, and the high optimization accuracy of convex training exacerbates this issue, making convex adversarial training (Algorithm 4) paramount. As shown in Table 5, Algorithm 4 achieves a higher accuracy on clean and adversarial data alike compared to GD-FGSM and GD-PGD. While Algorithm 4 solves the upper-bound problem Theorem 4.3, it returns a lower training objective than GD-FGSM and GD-PGD, showing that back-propagation fails to find an optimal network. In addition to achieving superior results and higher observed stability, Algorithm 1 and Algorithm 4 are theoretically guaranteed to converge to their global optima, hence particularly suitable for safety-critical applications.

We also compare the aforementioned SDP relaxation adversarial training method [51] and the LP relaxation method [59] against our work on the CIFAR-10 subset. While an iteration of the LP or the SDP method is faster than a GD-PGD iteration, the ANNs trained with the LP or SDP method achieve worse accuracy and robustness than those trained with Algorithm 4: the LP method achieves a 74.05% clean accuracy and a 58.65% PGD accuracy, whereas the SDP method achieves 73.35% on clean data and 40.45% on PGD adversaries.7 These results support that Algorithm 4 trains more robust ANNs and that the LP and SDP relaxations can be extremely loose and unstable. While [51, 59] applied the convex relaxation method to the adversarial training problem, their training formulations are non-convex.

The presence of an ℓ 1 norm term in the upper-bound formulations Theorem 4.3 and Section 4.5 indicates that adversarial training with a small 𝜖 has a regularizing effect, which can improve generalization, supporting the finding of [41]. In the above experiments, Algorithm 4 outperforms Algorithm 1 on adversarial data, highlighting the contribution of Algorithm 4: a novel convex adversarial training procedure that reliably trains robust ANNs.

6Concluding Remarks

In this paper, we used the SCP theory to characterize the quality of the solution obtained from an approximation method, providing theoretical insights into practical convex training. We then developed a separating scheme and applied the ADMM algorithm to a family of convex training formulations. When combined with the approximation method, the algorithm achieves a quadratic per-iteration computational complexity and a linear convergence towards an approximate global optimum. We also introduced a simpler unconstrained convex training formulation based on an SCP relaxation. The characterization of its solution quality shows that ELMs are convex relaxations to ANNs. Compared to traditional back-propagation, our training algorithms possess theoretical convergence rate guarantees and enjoy the absence of spurious local minima. Compared with naïvely solving the convex training formulation with general-purpose solvers, our algorithms have much-improved complexities, making a significant step towards practical convex training.

On the robustness side, we used the robust convex optimization analysis to derive convex programs that train adversarially robust ANNs. Compared with traditional adversarial training methods, including GD-FGSM and GD-PGD, the favorable properties of convex optimization endow convex adversarial training with the following advantages:

•

Global convergence to an upper bound: Convex adversarial training provably converges to an upper bound to the global optimum cost, offering superior interpretability.

•

Guaranteed adversarial robustness on training data: As shown in Theorem 4.3, the inner maximization over the robust loss function is solved exactly.

•

Hyperparameter-free: Algorithm 4 can automatically determine its step size with line search, not requiring any preset parameters.

•

Immune to vanishing/exploding gradients: The convex training method avoids this problem completely because it does not rely on back-propagation.

Overall, our analysis makes it easier and more efficient to train interpretable and robust ANNs with global convergence guarantees, facilitating safety-critical ANN applications.

References [1] ↑ A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, A rewriting system for convex optimization problems, Journal of Control and Decision, 5 (2018), pp. 42–60. [2] ↑ B. G. Anderson, Z. Ma, J. Li, and S. Sojoudi, Tightened convex relaxations for neural network robustness certification, in IEEE Conference on Decision and Control, 2020. [3] ↑ B. G. Anderson and S. Sojoudi, Certified robustness via locally biased randomized smoothing, in Learning for Dynamics and Control Conference, 2022. [4] ↑ B. G. Anderson and S. Sojoudi, Data-driven certification of neural networks with random input noise, IEEE Transactions on Control of Network Systems, (2022). [5] ↑ M. ApS, The MOSEK optimization toolbox for MATLAB manual. Version 9.0, 2019. [6] ↑ R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, Understanding deep neural networks with rectified linear units, in International Conference on Learning Representations, 2018. [7] ↑ A. Athalye, N. Carlini, and D. Wagner, Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples, in International Conference on Machine Learning, 2018. [8] ↑ F. Bach, Breaking the curse of dimensionality with convex neural networks, Journal of Machine Learning Research, 18 (2017), pp. 1–53. [9] ↑ Y. Bai, B. G. Anderson, A. Kim, and S. Sojoudi, Improving the accuracy-robustness trade-off of classifiers via adaptive smoothing, arXiv preprint arXiv:2301.12554, (2023). [10] ↑ Y. Bai, T. Gautam, Y. Gai, and S. Sojoudi, Practical convex formulation of robust one-hidden-layer neural network training, in American Control Conference, 2022. [11] ↑ A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202. [12] ↑ E. Belilovsky, M. Eickenberg, and E. Oyallon, Greedy layerwise learning can scale to ImageNet, in International Conference on Machine Learning, 2019. [13] ↑ Y. Bengio, N. Roux, P. Vincent, O. Delalleau, and P. Marcotte, Convex neural networks, in Annual Conference on Neural Information Processing Systems, 2006. [14] ↑ S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3 (2011), pp. 1–122. [15] ↑ S. Boyd and L. Vandenberghe, Convex optimization, Cambridge university press, 2004. [16] ↑ A. Brutzkus and A. Globerson, Globally optimal gradient descent for a convnet with gaussian inputs, in International Conference on Machine Learning, 2017. [17] ↑ G. Calafiore and M. C. Campi, Uncertain convex programs: randomized solutions and confidence levels, Mathematical Programming, 102 (2005), pp. 25–46. [18] ↑ M. C. Campi, S. Garatti, and M. Prandini, The scenario approach for systems and control design, Annual Reviews in Control, 33 (2009), pp. 149–157. [19] ↑ J. Cohen, E. Rosenfeld, and Z. Kolter, Certified adversarial robustness via randomized smoothing, in International Conference on Machine Learning, 2019. [20] ↑ S. Diamond and S. Boyd, CVXPY: A Python-embedded modeling language for convex optimization, Journal of Machine Learning Research, 17 (2016), pp. 1–5. [21] ↑ S. S. Du, X. Zhai, B. Poczos, and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, in International Conference on Learning Representations, 2019. [22] ↑ D. Dua and C. Graff, UCI machine learning repository, 2017. [23] ↑ J. Eckstein and W. Yao, Approximate ADMM algorithms derived from lagrangian splitting, Computational Optimization and Applications, 68 (2017), pp. 363–405. [24] ↑ T. Ergen and M. Pilanci, Global optimality beyond two layers: Training deep relu networks via convex programs, in International Conference on Machine Learning, 2021. [25] ↑ T. Ergen and M. Pilanci, Implicit convex regularizers of CNN architectures: Convex optimization of two- and three-layer networks in polynomial time, in International Conference on Learning Representations, 2021. [26] ↑ T. Ergen and M. Pilanci, Path regularization: A convexity and sparsity inducing regularization for parallel ReLU networks, in Advances in Neural Information Processing Systems, 2023, pp. 59761–59786. [27] ↑ T. Ergen, A. Sahiner, B. Ozturkler, J. M. Pauly, M. Mardani, and M. Pilanci, Demystifying batch normalization in ReLU networks: Equivalent convex optimization models and implicit regularization, in International Conference on Learning Representations, 2022. [28] ↑ C. Gallicchio and S. Scardapane, Deep randomized neural networks, 2020. [29] ↑ I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, in International Conference on Learning Representations, 2015. [30] ↑ M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1, 2014. [31] ↑ K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in IEEE International Conference on Computer Vision, 2015. [32] ↑ M. R. Hestenes, Multiplier and gradient methods, Journal of Optimization Theory and Applications, 4 (1969), pp. 303–320. [33] ↑ M. Hong and Z. Luo, On the linear convergence of the alternating direction method of multipliers, Mathematical Programming, 162 (2017), pp. 165–199. [34] ↑ G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in IEEE International Joint Conference on Neural Networks, 2004. [35] ↑ R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári, Learning with a strong adversary, arXiv preprint arXiv:1511.03034, (2015). [36] ↑ S. H. Huang, N. Papernot, I. J. Goodfellow, Y. Duan, and P. Abbeel, Adversarial attacks on neural network policies, in International Conference on Learning Representations, 2017. [37] ↑ B. Igelnik and Y. Pao, Stochastic choice of basis functions in adaptive function approximation and the functional-link net, IEEE Transactions on Neural Networks, 6 (1995), pp. 1320–1329. [38] ↑ S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning, 2015. [39] ↑ D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in International Conference on Learning Representations, 2015. [40] ↑ A. Krizhevsky, Learning multiple layers of features from tiny images.https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf, 2009. [41] ↑ A. Kurakin, I. J. Goodfellow, and S. Bengio, Adversarial machine learning at scale, in International Conference on Learning Representations, 2017. [42] ↑ Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324. [43] ↑ Z. Lu and L. Xiao, On the complexity analysis of randomized block-coordinate descent methods, Mathematical Programming, 152 (2015), pp. 615–642. [44] ↑ Z. Ma and S. Sojoudi, A sequential framework towards an exact SDP verification of neural networks, in International Conference on Data Science and Advanced Analytics, 2021. [45] ↑ A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, in International Conference on Learning Representations, 2018. [46] ↑ A. Mishkin, A. Sahiner, and M. Pilanci, Fast convex optimization for two-layer relu networks: Equivalent model classes and cone decompositions, in International Conference on Machine Learning, 2022, pp. 15770–15816. [47] ↑ S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, Deepfool: A simple and accurate method to fool deep neural networks, in IEEE Conference on Computer Vision and Pattern Recognition, 2016. [48] ↑ A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in Advances in Neural Information Processing Systems, 2019. [49] ↑ M. Pilanci and T. Ergen, Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks, in International Conference on Machine Learning, 2020. [50] ↑ L. Prechelt, Early stopping – but when?, in Neural Networks: Tricks of the Trade - Second Edition, vol. 7700, 2012, pp. 53–67. [51] ↑ A. Raghunathan, J. Steinhardt, and P. Liang, Certified defenses against adversarial examples, in International Conference on Learning Representations, 2018. [52] ↑ D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, 323 (1986), pp. 533–536. [53] ↑ A. Sahiner, T. Ergen, J. M. Pauly, and M. Pilanci, Vector-output ReLU neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms, in International Conference on Learning Representations, 2021. [54] ↑ M. Sion, On general minimax theorems, Pacific Journal of Mathematics, 8 (1958), pp. 171–176. [55] ↑ C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations, 2014. [56] ↑ G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, Training neural networks without gradients: A scalable ADMM approach, in International Conference on Machine Learning, 2016. [57] ↑ J. Wang, F. Yu, X. Chen, and L. Zhao, ADMM for efficient deep learning with global convergence, in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. [58] ↑ Y. Wang, J. Lacotte, and M. Pilanci, The hidden convex optimization landscape of regularized two-layer ReLU networks: an exact characterization of optimal solutions, in International Conference on Learning Representations, 2022. [59] ↑ E. Wong and Z. Kolter, Provable defenses against adversarial examples via the convex outer adversarial polytope, in International Conference on Machine Learning, 2018. [60] ↑ H. Xiao, K. Rasul, and R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747, (2017). Appendix Appendix AExtending the ADMM Approach to More Sophisticated ReLU Networks

Since the emergence of convex training, convex formulations have been developed to train various types of neural networks. Most formulations share the structure

(25) min 𝐰 , 𝐰 ′

ℓ ⁢ ( 𝐅 ⁢ ( 𝐰 − 𝐰 ′ ) , 𝑦 ) + 𝛽 ⁢ ( ∥ 𝐰 ∥ 2 , 1 + ∥ 𝐰 ′ ∥ 2 , 1 )

s . t .

𝐆𝐰 ≥ 0 , 𝐆𝐰 ′ ≥ 0 ,

where 𝐅 and 𝐆 are matrices formed by the training data matrix 𝑋 and those matrices that represent all possible ReLU activation patterns, ∥ ⋅ ∥ 2 , 1 denotes the norm that is a mixture of the ℓ 2 norm and the ℓ 1 norm under some partition scheme, and 𝐰 and 𝐰 ′ are the optimization variables from which the neural network weights can be recovered.

Algorithm 2 can be extended to all convex training formulations with this structure by first reforming the problem into the equality-constrained form

(26) min 𝐮 , 𝐮 ′ , 𝐰 , 𝐰 ′ , 𝐬 , 𝐬 ′

ℓ ⁢ ( 𝐅 ⁢ ( 𝐮 − 𝐮 ′ ) , 𝑦 ) + 𝛽 ⁢ ( ∥ 𝐰 ∥ 𝐹 , 1 + ∥ 𝐰 ′ ∥ 𝐹 , 1 ) + 𝕀 ≥ 0 ⁢ ( 𝐬 ) + 𝕀 ≥ 0 ⁢ ( 𝐬 ′ )

s . t .

[ 𝐼

𝐆 ] ⁢ [ 𝐮
𝐮 ′ ]

[ 𝐰

𝐰 ′

𝐬

𝐬 ′ ]

and constructing the augmented Lagrangian

𝐿 ( 𝐮 , 𝐮 ′ , 𝐰 ,

𝐰 ′ , 𝐬 , 𝐬 ′ , 𝜆 , 𝜆 ′ , 𝜈 , 𝜈 ′ ) :=

ℓ ⁢ ( 𝐅 ⁢ ( 𝐮 − 𝐮 ′ ) , 𝑦 ) + 𝛽 ⁢ ( ‖ 𝐰

𝐰 ′ ‖ 2 , 1 ) + 𝕀 ≥ 0 ⁢ ( [ 𝐬

𝐬 ′ ] ) + 𝜌 2 ⁢ ( ‖ 𝐮 − 𝐰 + 𝜆

𝐮 ′ − 𝐰 ′ + 𝜆 ′

𝐆𝐮 − 𝐬 + 𝜈

𝐆𝐮 ′ − 𝐬 ′ + 𝜈 ′ ‖ 2 2 − ‖ 𝜆

𝜆 ′

𝜈

𝜈 ′ ‖ 2 2 ) ,

where ( 𝜆 , 𝜆 ′ ) and ( 𝜈 , 𝜈 ′ ) are again dual variables and 𝜌

0 is a fixed penalty parameter. Minimizing over ( 𝐰 , 𝐰 ′ ) , ( 𝐮 , 𝐮 ′ ) , and ( 𝐬 , 𝐬 ′ ) separately in an alternating manner and performing dual updates on ( 𝜆 , 𝜆 ′ ) and ( 𝜈 , 𝜈 ′ ) gives us an ADMM algorithm that tackles Eq. 25.

A.1Two-Hidden-Layer Sub-Networks

We now discuss extending our methods to deeper and more practical ANN architectures. In [24], the authors have shown that training multiple two-hidden-layer ReLU sub-networks with a weight decay regularization is equivalent to solving a higher-dimensional convex problem with sparsity induced by group ℓ 1 regularization.

Consider an architecture with 𝐾 parallel sub-networks, each of which is a two-hidden-layer ReLU network. The neural network output can be parameterized as ( ( 𝑋 ⁢ 𝐰 1 ⁢ 𝑘 ) + ⁢ 𝐰 2 ⁢ 𝑘 ) + ⁢ 𝑤 3 ⁢ 𝑘 , where 𝐰 1 ⁢ 𝑘 ∈ ℝ 𝑚 0 × 𝑚 1 , 𝐰 2 ⁢ 𝑘 ∈ ℝ 𝑚 1 × 𝑚 2 , 𝑤 3 ⁢ 𝑘 ∈ ℝ 𝑚 3 are the hidden and output layer weights for the 𝑘 th sub-network. Note that 𝑚 0

𝑑 , whereas 𝑚 1 and 𝑚 2 denote the numbers of neurons in the first and the second hidden layer. The regularized training problem is formalized as

(27) min 𝜃 ⁡ ℓ ⁢ ( ( ( 𝑋 ⁢ 𝐰 1 ⁢ 𝑘 ) + ⁢ 𝐰 2 ⁢ 𝑘 ) + ⁢ 𝑤 3 ⁢ 𝑘 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑘

1 𝐾 ( ∥ 𝐰 2 ⁢ 𝑘 ∥ + 𝑤 3 ⁢ 𝑘 2 ) ,

where 𝛽

0 is a regularization parameter. In [24], it has been shown that the non-convex training problem (27) can be equivalently stated as the following convex problem:

min 𝐰 , 𝐰 ′

ℓ ⁢ ( 𝑋 ~ ⁢ ( 𝐰 − 𝐰 ′ ) , 𝑦 ) + 𝛽 ⁢ ( ∥ 𝐰 ∥ 2 , 1 + ∥ 𝐰 ′ ∥ 2 , 1 )

(28) s . t .

vec ⁢ ( [ ( 2 ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 − 𝐼 𝑛 ) ⁢ 𝑋

( 2 ⁢ 𝐃 2 ⁢ 𝑙 − 𝐼 𝑛 ) ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ⁢ 𝑋 ] ⁢ [ 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 +

𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 − ] ) ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 1 ] , 𝑗 ∈ [ 𝑚 1 ] , 𝑙 ∈ [ 𝑃 2 ] ,

vec ⁢ ( [ ( 2 ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 − 𝐼 𝑛 ) ⁢ 𝑋

( 2 ⁢ 𝐃 2 ⁢ 𝑙 − 𝐼 𝑛 ) ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ⁢ 𝑋 ] ⁢ [ 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 + ′

𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 − ′ ] ) ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 1 ] , 𝑗 ∈ [ 𝑚 1 ] , 𝑙 ∈ [ 𝑃 2 ] ,

where

•

The vectors 𝐰 and 𝐰 ′ ∈ ℝ 2 ⁢ 𝑑 ⁢ 𝑚 1 ⁢ 𝑃 1 ⁢ 𝑃 2 are constructed by concatenating { { { { 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 ± } 𝑖

1 𝑃 1 } 𝑗

1 𝑚 1 } 𝑙

1 𝑃 2 } ± and { { { { 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 ± ′ } 𝑖

1 𝑃 1 } 𝑗

1 𝑚 1 } 𝑙

1 𝑃 2 } ± , respectively;

•

Consider all 𝐰 ¯ ∈ ℝ 𝑑 , 𝐰 1 ∈ ℝ 𝑑 × 𝑚 1 , and 𝐰 2 ∈ ℝ 𝑚 1 . 𝑃 1 denotes the total number of possible sign patterns of 𝑋 ⁢ 𝐰 ¯ , and 𝑃 2 denotes the number of possible sign patterns of ( 𝐗𝐖 1 ) + ⁢ 𝐰 2 ;

•

The fixed diagonal binary mask matrices 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ∈ ℝ 𝑛 × 𝑛 and 𝐃 2 ⁢ 𝑙 ∈ ℝ 𝑛 × 𝑛 with 𝑖 ∈ [ 𝑃 1 ] , 𝑗 ∈ [ 𝑚 1 ] , 𝑙 ∈ [ 𝑃 2 ] encode all possible ReLU patterns;

•

For a vector 𝐮 ∈ ℝ 𝑑 ⁢ 𝑃 , the notation ∥ 𝐮 ∥ 2 , 1 := ∑ 𝑖

1 𝑃 ∥ 𝐮 𝑖 ∥ 2 denotes the 𝑑 -dimensional group norm operator with 𝐮 𝑖 being the 𝑖 th

𝑑 -dimensional partition of 𝐮 ;

•

𝑋 ~ 𝑠 is defined as [ 𝐃 21 ⁢ 𝐃 111 ⁢ 𝑋

…

𝐃 2 ⁢ 𝑙 ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ⁢ 𝑋

…

𝐃 2 ⁢ 𝑃 2 ⁢ 𝐃 1 ⁢ 𝑃 1 ⁢ 𝑚 1 ⁢ 𝑋 ] , and 𝑋 ~ is defined as [ 𝑋 ~ 𝑠

𝑋 ~ 𝑠 ] .

We observe that both the objective function and the constraint set of Section A.1 follow the same structure as Eq. 25, namely, the objective consists of a convex loss with ℓ 1

ℓ 2 regularization and the feasible set is defined by linear inequality constraints. Specifically, Section A.1 can be represented in the equality-constrained form below:

min 𝐰 , 𝐰 ′ , 𝐮 , 𝐮 ′ , ( 𝐬 , 𝐬 ′ ) 𝑖 ⁢ 𝑗 ⁢ 𝑙
ℓ ⁢ ( 𝑋 ~ ⁢ ( 𝐮 − 𝐮 ′ ) , 𝑦 ) + 𝛽 ⁢ ( ∥ 𝐰 ∥ 2 , 1 + ∥ 𝐰 ′ ∥ 2 , 1 ) + ∑ 𝑖 , 𝑗 , 𝑘 ( 𝕀 ≥ 0 ⁢ ( 𝐬 𝑖 ) + 𝕀 ≥ 0 ⁢ ( 𝐬 𝑖 ′ ) )

s . t .
𝐮

𝐰 , 𝐮 ′

𝐰 ′ ,
(29) 𝐬 𝑖 ⁢ 𝑗 ⁢ 𝑙

vec ⁢ ( [ ( 2 ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 − 𝐼 𝑛 ) ⁢ 𝑋

( 2 ⁢ 𝐃 2 ⁢ 𝑙 − 𝐼 𝑛 ) ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ⁢ 𝑋 ] ⁢ [ 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 +
𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 − ] ) , ∀ 𝑖 ∈ [ 𝑃 1 ] , 𝑗 ∈ [ 𝑚 1 ] , 𝑙 ∈ [ 𝑃 2 ] ,

𝐬 𝑖 ⁢ 𝑗 ⁢ 𝑙 ′

vec ⁢ ( [ ( 2 ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 − 𝐼 𝑛 ) ⁢ 𝑋

( 2 ⁢ 𝐃 2 ⁢ 𝑙 − 𝐼 𝑛 ) ⁢ 𝐃 1 ⁢ 𝑖 ⁢ 𝑗 ⁢ 𝑋 ] ⁢ [ 𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 + ′

𝐰 𝑖 ⁢ 𝑗 ⁢ 𝑙 − ′ ] ) , ∀ 𝑖 ∈ [ 𝑃 1 ] , 𝑗 ∈ [ 𝑚 1 ] , 𝑙 ∈ [ 𝑃 2 ] ,

which is a special case of Eq. 26. The ADMM algorithm thus extends to Section A.1, the convex training problem for architectures consisting of parallel two-hidden-layer ReLU networks.

The work [26] has similarly analyzed three-layer ReLU networks, but considers an alternative regularization technique — path regularization. Since the convex training formulation with path regularization also follows the structure of Eq. 25, our ADMM algorithm similarly applies.

A.2One-Hidden-Layer Networks with Batch Normalization

In [27], exact convex representations of weight-decay regularized ReLU networks with (full-batch) batch normalization (BN) have been introduced. While [27] provides discussions on training deeper neural networks with BN, the paper only presents convex training formulations for the one-hidden-layer case. Consider a one-hidden-layer scalar-output ReLU network with the weights 𝐰 ( 1 ) ∈ ℝ 𝑚 0 × 𝑚 1 and 𝐰 ( 2 ) ∈ ℝ 𝑚 1 , where 𝑚 0

𝑑 is the input dimension and 𝑚 1 is the network width. Let 𝑋 ∈ ℝ 𝑛 × 𝑑 denote the training data and 𝑦 ∈ ℝ 𝑛 be the label matrix. The regularized training problem of this network with BN is given by

(30) min 𝐰 ( 1 ) , 𝐰 ( 2 ) , 𝛾 , 𝛼 ⁡ ℓ ⁢ ( ( BN 𝛾 , 𝛼 ⁢ ( 𝑋 ⁢ 𝐰 ( 1 ) ) ) + ⁢ 𝐰 ( 2 ) , 𝑦 ) + 𝛽 2 ⁢ ( ∥ 𝛾 ∥ 2 2 + ∥ 𝛼 ∥ 2 2 + ∥ 𝐰 ( 1 ) ∥ 𝐹 2 + ∥ 𝐰 ( 2 ) ∥ 2 2 ) ,

where ℓ is a convex loss function and BN 𝛾 , 𝛼 ⁢ ( ⋅ ) represents the BN operator associated with a scaling parameter 𝛾 and a shifting parameter 𝛼 [38]. The non-convex training problem Eq. 30 can be equivalently cast as the convex optimization problem

(31) min 𝐰 𝑖 , 𝐰 𝑖 ′ ∈ ℝ 𝑟 + 1
ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝐔 𝑖 ′ ⁢ ( 𝐰 𝑖 − 𝐰 𝑖 ′ ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ( ∥ 𝐰 𝑖 ∥ 2 + ∥ 𝐰 𝑖 ′ ∥ 2 )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝐔 ′ ⁢ 𝐰 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝐔 ′ ⁢ 𝐰 𝑖 ′ ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ] ,

where the diagonal matrices 𝐷 1 , … , 𝐷 𝑃 represent all ReLU activation patterns associated with 𝑋 ⁢ 𝐰 for an arbitrary weight vector 𝐰 ∈ ℝ 𝑑 and 𝑃 denotes the cardinality of the set of all possible 𝐃 matrices. Furthermore, 𝐔 ∈ ℝ 𝑛 × 𝑟 and 𝐔 ′ ∈ ℝ 𝑛 × ( 𝑟 + 1 ) are computed using the compact singular value decomposition (SVD) of the zero-mean data matrix, where 𝑟

rank ⁢ ( 𝑋 ) . More specifically, ( 𝐼 𝑛 − 1 𝑛 ⁢ 𝟏𝟏 ⊤ ) ⁢ 𝑋

𝐔 ⁢ Σ ⁢ 𝐕 ⊤ and 𝐔 ′

[ 𝐔 1 𝑛 ⁢ 𝟏 ] [27].

Note that Eq. 31 has the same structure as the convex reformulation of the standard one-hidden-layer ReLU network training problem Eq. 2. The main difference is that in Eq. 31, 𝐔 ′ plays the role of the data matrix 𝑋 . As such, Algorithm 2 and the convex adversarial training analyses extend to the convex training formulation of ReLU networks with BN without modifications.

Appendix BSCP-Based Convex Training

While the practical training formulation Eq. 5 and the ADMM algorithm (Algorithm 2) vastly improve the efficiency and the practicality of globally optimizing ANNs, the complexity of the aforementioned methods can still be too high for large-scale machine learning problems due to the complicated structure of Eq. 2. In this section, we propose a “sampled convex program (SCP)”-based alternative approach to approximately globally optimize scalar-output one-hidden-layer ANNs. This approach constructs scalable unconstrained convex optimization problems with simpler structures. Unconstrained convex optimization problems are much easier to numerically solve compared to constrained ones. Scalable and simple first-order methods can be easily applied to unconstrained convex programs, while the same cannot be said for constrained optimization in general due to feasibility issues.

Compared with the ADMM approach in Algorithm 2, the SCP approach is easier to implement and has a lower per-iteration complexity. The trade-off is that while Algorithm 2 can be applied to find the exact global minimum of Eq. 1 (albeit with an exponential complexity with respect to the data matrix rank), the SCP approach only finds an approximate global solution. In the approximate case, the qualities of the ADMM solution and the SCP solution can both be characterized.

B.1One-Shot Sampling of Hidden-Layer Weights

The paper [49] has shown that the non-convex training formulation Eq. 1 has the same global optimum as

(32) 𝑝 ⋆

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 | 𝛼 𝑗 | s . t . ∥ 𝑢 𝑗 ∥ 2 ≤ 1 , ∀ 𝑗 ∈ [ 𝑚 ] .

Note that we can replace the perturbation set { 𝑢 | ∥ 𝑢 ∥ 2 ≤ 1 } with { 𝑢 | ∥ 𝑢 ∥ 2

1 } without changing the optimum. This is because for any pair ( 𝑢 𝑗 , 𝛼 𝑗 ) such that ∥ 𝑢 𝑗 ∥ 2 < 1 , replacing ( 𝑢 𝑗 , 𝛼 𝑗 ) with the scaled weights ( 𝑢 𝑗 ∥ 𝑢 𝑗 ∥ 2 , ∥ 𝑢 𝑗 ∥ 2 ⋅ 𝛼 𝑗 ) will reduce the regularization term of Eq. 32 while keeping the loss term unchanged. Therefore, the optimal 𝑢 𝑗 ⋆ must satisfy ∥ 𝑢 𝑗 ⋆ ∥ 2

1 .

To approximate the semi-infinite program Eq. 32, we randomly sample a total of 𝑁 vectors, namely 𝑢 1 , … , 𝑢 𝑁 , on the ℓ 2 unit norm sphere 𝒮 𝑑 − 1 following a uniform distribution. It is well-known that such a procedure can be performed by randomly sampling 𝑢 ^ 𝑖 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) for all 𝑖 ∈ [ 𝑁 ] and projecting each 𝑢 ^ 𝑖 onto the unit ℓ 2 norm sphere by calculating 𝑢 𝑖

𝑢 ^ 𝑖 ∥ 𝑢 ^ 𝑖 ∥ 2 for all 𝑖 ∈ [ 𝑁 ] . Next, 𝑢 1 , … , 𝑢 𝑁 are used to construct the following SCP:

(33) 𝑝 𝑠 ⁢ 3 ⋆

min ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | ,

where the sampled hidden-layer weights ( 𝑢 𝑖 ) 𝑖

1 𝑁 are fixed.

The finite-dimensional unconstrained convex formulation Eq. 33 is a relaxation of Eq. 32, and can be used as a surrogate for the optimization problem Eq. 1 to approximately globally optimize one-hidden-layer ANNs. The formulation Eq. 33 optimizes the ANN’s output layer while freezing the hidden layer. When the squared loss ℓ ⁢ ( 𝑦 ^ , 𝑦 )

1 2 ⁢ ∥ 𝑦 ^ − 𝑦 ∥ 2 2 is considered, Eq. 33 is a Lasso regression problem. Intuitively, the sampled hidden-layer weights map the training data points into a higher-dimensional space. While some of the sampled weights will inevitably be far from the optimum weights for the ANN, the ℓ 1 regularization term promotes sparsity, encouraging assigning zero weights to “disable” the suboptimal hidden neurons.

The SCP training formulation Eq. 33 recovers the training problems of one-hidden-layer random vector functional link (RVFL) [37] and ELM. Such an equivalence shows that training an ELM is a convex relaxation of ANN training. Compared with traditional ELMs, Eq. 33 contains a sparsity-promoting regularization, and requires a different initialization of the untrained hidden layer weights, providing insights into the implicit sparsity-seeking property of ANNs.

The method in this subsection is referred to as “one-shot sampling” because all hidden layer weights are sampled in advance, in contrast with the iterative sampling procedure described in Section B.2. The ANNs trained with Eq. 33 can be suboptimal in terms of empirical loss compared with the network that globally minimizes the non-convex cost function, but are expected to be close to the optimal classifier. The next theorem characterizes the level of suboptimality of the SCP optimizer, with the proof provided in Section F.3.

Theorem B.1.

Suppose that an additional hidden neuron 𝑢 𝑁 + 1 is randomly sampled on the unit Euclidean norm sphere via a uniform distribution to augment the ANN. Consider the following formulation to train the augmented network:

(34) 𝑝 𝑠 ⁢ 4 ⋆

min ( 𝛼 𝑖 ) 𝑖

1 𝑁 + 1 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑁 + 1 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑁 + 1 | 𝛼 𝑖 | .

It holds that 𝑝 𝑠 ⁢ 4 ⋆ ≤ 𝑝 𝑠 ⁢ 3 ⋆ . Furthermore, if 𝑁 ≥ min ⁡ { 𝑛 + 1 𝜓 ⁢ 𝜉 − 1 , 2 𝜉 ⁢ ( 𝑛 + 1 − log ⁡ 𝜓 ) } , where 𝜓 and 𝜉 are preset confidence level constants between 0 and 1, then with probability no smaller than 1 − 𝜉 , it holds that ℙ ⁢ { 𝑝 𝑠 ⁢ 4 ⋆ < 𝑝 𝑠 ⁢ 3 ⋆ } ≤ 𝜓 .

Intuitively, this bound means that uniformly sampling another hidden layer weight 𝑢 𝑁 + 1 on the unit norm sphere will not improve the training loss with high probability. For a fixed level of suboptimality, the required scale of the SCP formulation Eq. 33 has a linear relationship with respect to the number of training data points. Somewhat surprisingly, from the perspective of the probabilistic optimality, the bound provided by Theorem B.1 is the same as the bound associated with Algorithm 1 presented in Theorem 2.2, because both bounds are obtained via the SCP analysis framework.

The main advantage of the SCP-based training approach is that the unconstrained optimization Eq. 33 is much easier and faster to solve than the constrained optimization Eq. 5. The iterative soft-thresholding algorithm (ISTA) [11] and its accelerated or stochastic variants can be readily applied to solve Eq. 33. Specifically, ISTA converges at a linear rate if ℓ ⁢ ( ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 , 𝑦 ) is strongly convex over each 𝛼 𝑖 , and converges at a 𝒪 ⁢ ( 1 / 𝑇 ) rate for weakly convex cases, where 𝑇 is the iteration count. As a result, with the same amount of computational resources, one can solve Eq. 33 with 𝑁 ≫ 𝑃 𝑠 , allowing for training wider networks (with stronger representation powers) within a reasonable amount of time. Numerical experiments in Section C.2 verify that the SCP relaxation Eq. 33 can train larger-scale classifiers with a reasonable computing effort.

When ℓ ⁢ ( ⋅ ) is the squared loss, the SCP formulation Eq. 33 evaluates to min 𝛼 ∥ 𝐻 𝛼 − 𝑦 ∥ 2 2 + 𝛽 ∥ 𝛼 ∥ 1 , where 𝐻

[ ( 𝑋 ⁢ 𝑢 1 ) +
…
( 𝑋 ⁢ 𝑢 𝑁 ) + ] ∈ ℝ 𝑛 × 𝑁 and 𝛼

( 𝛼 1 , … , 𝛼 𝑁 ) ∈ ℝ 𝑁 . The ISTA update is then 𝛼 +

prox 𝛾 𝑠 ⁢ 𝛽 ⁢ ∥ ⋅ ∥ 1 ⁢ ( 𝛼 − 𝛾 𝑠 ⁢ 𝐻 ⊤ ⁢ 𝐻 ⁢ 𝛼 + 𝛾 𝑠 ⁢ 𝐻 ⊤ ⁢ 𝑦 ) , where prox 𝛾 𝑠 ⁢ 𝛽 ⁢ ∥ ⋅ ∥ 1 ⁢ ( ⋅ ) evaluates to sgn ( ⋅ ) max ( | ⋅ | − 𝛾 𝑠 𝛽 , 0 ) , 𝛼 + denotes the updated 𝛼 at each iteration, and 𝛾 𝑠 is a step size that can be determined with backtracking line search. Since 𝐻 ⊤ ⁢ 𝐻 and 𝐻 ⊤ ⁢ 𝑦 are fixed and only need to be calculated once, the per-iteration complexity is 𝒪 ⁢ ( 𝑁 2 ) . Since 𝑁 is linear in 𝑛 for a fixed solution quality (see Theorem B.1), the per-iteration complexity amounts to 𝒪 ⁢ ( 𝑛 2 ) , and the overall complexity amounts to 𝒪 ⁢ ( 𝑛 2 ⁢ log ⁡ ( 1 / 𝜖 𝑎 ) ) and 𝒪 ⁢ ( 𝑛 2 / 𝜖 𝑎 ) for strongly and weakly convex loss functions, respectively, where 𝜖 𝑎 is the desired optimization precision.

Theorem 2.2 also implies that when the neural network is wide, the hidden layer weights are less important than the output layer weights. The role of the hidden layers is to map the data to features in higher-dimensional spaces, facilitating the output layer to extract the most important information.

B.2Iterative Sampling of Hidden-Layer Weights

While the efficacy of SCP-based convex training with a one-shot sampling of the hidden layer neurons can be proved theoretically and experimentally, the probabilistic optimality bound provided in Theorem B.1 may be too conservative in some cases. To provide a more accurate and robust estimation of the level of suboptimality of the SCP relaxation Eq. 33, we propose a scheme (Algorithm 5) that iteratively samples hidden layer neurons used in Eq. 33 to train classifiers.

The convex semi-infinite training formulation Eq. 32 has a dual problem: [49, Appendix A.4]

(35) 𝑑 ⋆

max 𝑣 ∈ ℝ 𝑛 − ℓ ∗ ⁢ ( 𝑣 ) s . t . | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + | ≤ 𝛽 , ∀ 𝑢 : ∥ 𝑢 ∥ 2 ≤ 1 ,

where ℓ ∗ ⁢ ( ⋅ ) is the Fenchel conjugate function defined as ℓ ∗ ⁢ ( 𝑣 )

max 𝑧 ⁡ 𝑧 ⊤ ⁢ 𝑣 − ℓ ⁢ ( 𝑧 , 𝑦 ) . When 𝑚 ≥ 𝑚 ∗ , where 𝑚 ∗ is upper-bounded by 𝑛 + 1 , strong duality holds 𝑝 ⋆

𝑑 ⋆ . Moreover, the dual problem Eq. 35 is a convex semi-infinite problem, which is a category of uncertain convex programs (UCP) [17].

We then use the sampled vectors 𝑢 1 , … , 𝑢 𝑁 to construct the following SCP that approximates the UCP Eq. 35:

(36) 𝑑 𝑠 ⁢ 3 ⋆

max 𝑣 ∈ ℝ 𝑛 − ℓ ∗ ⁢ ( 𝑣 ) s . t .

| 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 ] .

Similarly, strong duality holds between Eq. 36 and Eq. 33 and it holds that 𝑝 𝑠 ⁢ 3 ⋆

𝑑 𝑠 ⁢ 3 ⋆ . The level of suboptimality of the dual solution 𝑣 ⋆ to Eq. 36 can be easily verified by checking the feasibility of 𝑣 ⋆ to the UCP Eq. 35.

While it is easier to check the quality of the dual solution, it is desirable to solve the primal problem Eq. 33 because the primal is unconstrained and thus easier to solve. Suppose that ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 is a solution to Eq. 33. By following the procedure described in Section F.4, one can recover the optimal dual variable 𝑣 ⋆ from ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 by exploiting the strong duality between Eq. 33 and Eq. 36. Next, we independently sample another set of 𝑁 1 hidden layer weights ( 𝑢 𝑖 1 ) 𝑖

1 𝑁 1 ∼ Unif ⁢ ( 𝒮 𝑛 − 1 ) and check if | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 1 ) + |

𝛽 for each 𝑖 ∈ [ 𝑁 1 ] . If | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 1 ) + |

𝛽 for a particular 𝑖 , then adding 𝑢 𝑖 1 to the set of sampled constraint set of Eq. 36 will change (reduce) the value of 𝑑 𝑠 ⁢ 3 ⋆ and thereby reduce the relaxation gap between 𝑝 𝑠 ⁢ 3 ⋆ and 𝑝 ⋆ . In other words, by incorporating 𝑢 𝑖 1 as another hidden layer node, the considered ANN can be improved.

Define the notations

𝑍 𝑖 ≔ { 1

if ⁢ | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 1 ) + |

𝛽

0
otherwise , for all ⁢ ∀ 𝑖 ∈ [ 𝑁 1 ] ,

𝑍 ¯ ≔ ∑ 𝑖

1 𝑁 1 𝑍 𝑖 𝑁 1 , and 𝜃 ≔ 𝔼 𝑢 ∼ Unif ⁢ ( 𝒮 𝑑 − 1 ) ⁢ [ 𝑍 𝑖 ]

ℙ 𝑢 ∼ Unif ⁢ ( 𝒮 𝑑 − 1 ) ⁢ [ | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + |

𝛽 ] .

By Hoeffding’s inequality, it holds that ℙ ⁢ ( 𝜃 − 𝑍 ¯ ≥ 𝑡 ) ≤ exp ⁡ ( − 2 ⁢ 𝑁 1 ⁢ 𝑡 2 ) . Therefore, with probability at least 1 − 𝜉 , it holds that 𝜃 ≤ 𝑍 ¯ + log ⁡ ( 1 / 𝜉 ) 2 ⁢ 𝑁 1 , where 𝜉 ∈ ( 0 , 1 ] . In other words, by evaluating the feasibility of the additional set of hidden layer weights 𝑢 1 1 ⁢ … ⁢ 𝑢 𝑁 1 1 , one can obtain a probabilistic bound on the level of suboptimality of the solution to Eq. 36 constructed with 𝑢 1 ⁢ … ⁢ 𝑢 𝑁 : as long as 𝑍 ¯ + log ⁡ ( 1 / 𝜉 ) 2 ⁢ 𝑁 1 ≤ 𝜓 for a constant 𝜓 ∈ ( 0 , 1 ] , it holds that 𝜃 ≤ 𝜓 with probability at least 1 − 𝜉 .

We now introduce a scheme of training scalar-output fully connected ReLU ANNs to an arbitrary degree of suboptimality by repeating the evaluation and sampling procedure, described in Algorithm 5. Let 𝑇 denote the total iterations of Algorithm 5, 𝑈 𝑡 denote the total number of hidden layer neurons at iteration 𝑡 , and 𝑁 𝑡 denote the number of hidden layer neurons sampled at iteration 𝑡 . In light of Theorem B.1, it holds that the solution ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑈 𝑇 yielded by Algorithm 5 satisfies the following property with probability at least 1 − 𝜉 : if an additional vector 𝑢 ~ is sampled on the unit Euclidean norm sphere 𝒮 𝑑 − 1 via a uniform distribution, then adding 𝑢 ~ to the set of hidden layer weights used in Eq. 33 will not improve the training loss of the ANN with probability at least 1 − 𝜓 .

1:Let 𝑡

0 ; sample 𝑢 ^ 1 0 , … , 𝑢 ^ 𝑁 0 0 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) i.i.d., and let 𝑢 𝑖 0

𝑢 ^ 𝑖 0 ∥ 𝑢 ^ 𝑖 0 ∥ 2 for all 𝑖 ∈ [ 𝑁 0 ] . 2:Construct 𝒰 0 ≔ { 𝑢 1 0 , … , 𝑢 𝑁 0 0 } ; let 𝑈 0

𝑁 0 . 3:repeat 4: Solve ( 𝛼 𝑖 𝑡 ) 𝑖

1 𝑈 𝑡

arg ⁢ min ( 𝛼 𝑖 ) 𝑖

1 𝑈 𝑡 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑈 𝑡 ( 𝑋 ⁢ 𝑢 𝑖 𝑡 ) + ⁢ 𝛼 𝑖 , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑈 𝑡 | 𝛼 𝑖 | , the same formulation as Eq. 33. 5: Update 𝑣 𝑡

𝑦 − ∑ 𝑖

1 𝑈 𝑡 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 𝑡 . 6: Sample 𝑢 ^ 1 𝑡 + 1 , … , 𝑢 ^ 𝑁 𝑡 + 1 𝑡 + 1 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) i.i.d., and let 𝑢 ¯ 𝑖 𝑡 + 1

𝑢 ^ 𝑖 𝑡 + 1 ∥ 𝑢 ^ 𝑖 𝑡 + 1 ∥ 2 for all 𝑖 ∈ [ 𝑁 𝑡 + 1 ] . 7: Construct ℰ 𝑡 + 1

{ 𝑢 ¯ 𝑖 𝑡 + 1 | | 𝑣 𝑡 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ¯ 𝑖 𝑡 + 1 ) + | > 𝛽 } to be the set of newly sampled weight vectors that tighten the dual constraint. 8: Construct 𝒰 𝑡 + 1

𝒰 𝑡 ∪ ℰ 𝑡 + 1 and rename all vectors in 𝒰 𝑡 + 1 as 𝑢 1 𝑡 + 1 , … , 𝑢 𝑈 𝑡 + 1 𝑡 + 1 , where 𝑈 𝑡 + 1 is the cardinality of 𝒰 𝑡 + 1 . 9: 𝑡 ← 𝑡 + 1 . 10:until | ℰ 𝑡 | 𝑁 𝑡 + log ⁡ ( 1 / 𝜉 ) 2 ⁢ 𝑁 𝑡 ≤ 𝜓 or/and 𝑈 𝑡 − 1 ≥ 𝑛 + 1 𝜓 ⁢ 𝜉 − 1 , where 𝜓 and 𝜉 are preset thresholds. Algorithm 5 Convex ANN training based on iterative sampling hidden-layer weights Appendix CAdditional Experiments C.1ADMM Asymptotic Convergence

In this part of the appendix, we present empirical evidences that demonstrate the asymptotic convergence properties of ADMM (Algorithm 2). We use the same data as in Section 5.2.1, and the experiment settings are presented in Section D.1.

(a) 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 ADMM ⋆ (b) 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 CVX ⋆ (c) | 𝑙 ADMM 𝑢 , 𝛼 − 𝑙 ADMM 𝑣 , 𝑤 | Figure 7:Gap between the cost returned by ADMM at each iteration and the true optimal cost for five independent runs.

Fig. 7(a) shows that the training loss converges to a stationary value at a linear rate, verifying the findings of Theorem 3.1. Note that the 𝐷 ℎ matrices randomly generated in the five runs are different, resulting in different optimization landscapes and different linear convergence bounds. Fig. 7(b) shows that ADMM converges towards the CVX ground truth, verifying the correctness of the ADMM solution. Fig. 7(c) shows that 𝑙 ADMM 𝑣 , 𝑤 and 𝑙 ADMM 𝑢 , 𝛼 are close throughout the ADMM iterations, implying that 𝑣 𝑖 and 𝑤 𝑖 violate the constraints of Eq. 2 insignificantly at every step. Together, these figures confirm that the ADMM algorithm optimizes Eq. 1 effectively as designed. The learning curves of the five runs look quite different because different random 𝐷 ℎ matrices can make the optimization landscape quite different. However, as illustrated in Fig. 2, the initial rapid convergence behavior is very consistent.

C.2The SCP Convex Training Formulation

In this subsection, we demonstrate the efficacy of the SCP relaxed training using the one-shot random sampling approach to choose 𝑢 1 , … , 𝑢 𝑁 and explore the effect of the number of sampled weights 𝑁 . We independently sample different numbers of hidden-layer-weights and use the SCP training formulation Eq. 33 to train ANNs on the “mammographic masses” dataset [22]. We remove instances containing NaNs and randomly select 70% of the data for the training set and 30% for the test set, resulting in 𝑛

581 and 𝑑

5 . We use two different regularization strengths: 𝛽

10 − 4 and 𝛽

10 − 2 . The training loss and the test accuracy of each 𝑁 setting are plotted in Fig. 8. The ANN training process is stochastic due to the randomly generated hidden-layer weights 𝑢 𝑗 and the random splitting of training and test sets. We use CVXPY and the MOSEK solver to solve the underlying optimization problem Eq. 33. We perform 20 independent trials for each 𝑁 and average the results.

(a) 𝛽

10 − 4 (b) 𝛽

10 − 2 Figure 8:Average accuracy and average cost with different choices of 𝑁 for two different selections of the regularization strength 𝛽 .

For both regularization settings, adding more sampled hidden layer weights makes the SCP approximation more refined and therefore decreases the training loss. When the regularization strength 𝛽 is 10 − 4 , the test accuracy increases, peaks, and then decreases as 𝑁 increases. The accuracy drops when 𝑁 is large, possibly because of the overfitting caused by a lack of sparsity. As a comparison, training ANNs using Algorithm 1 with 𝑃 𝑠 set to 120 achieves an average accuracy of 79.80 % and an average training loss of 0.2428 on the same dataset. Directly optimizing the non-convex cost function Eq. 1 using gradient descent back-propagation with the width 𝑚 set to 2 ⁢ 𝑃 𝑠

240 achieves an 81.14 % average test accuracy and a 0.3560 average cost. Thus, with a proper choice of 𝑁 , the prediction performance of the SCP convex training approach is on par with Algorithm 1 and traditional back-propagation SGD. When the regularization strength 𝛽 is 10 − 2 , the test accuracy of the ANNs trained with the SCP method generally increases with 𝑁 .

To verify the performance of the proposed training approach on larger-scale data, we use the SCP method to train ANNs on the MNIST handwritten digits database [42] for binary classification between digits “2” and “8” ( 𝑑

784 and 𝑛

11809 ) using the binary cross-entropy loss. The SCP training formulation Eq. 33 is solved with the ISTA algorithm [11]. With the number of sampled weights 𝑁 set to 39365 (a much larger value than 𝑃 𝑠 in the ADMM experiments, corresponding to an optimality level of 𝜉 ⁢ 𝜓

0.3 ), the SCP formulation Eq. 33 achieves a test accuracy of 99.45 % . Compared with the ADMM approach discussed in Section 3, the SCP formulation is able to train much wider ANNs with a similar amount of computational power. In summary, this result demonstrates the performance and efficiency advantage of the SCP formulation Eq. 33 for medium or large machine learning problems.

C.3Hinge Loss Convex Adversarial Training – The Optimization Landscape

This subsection shows that the convex loss landscape and the non-convex landscape overlap within an ℓ ∞ -norm-bounded additive perturbation set around a training point 𝑥 𝑘 , and thereby verifies that the convex objective Eq. 5.4a provides an exact certification of the non-convex loss function at training data points.

(a)The loss landscape of the convex objective ℓ convex for ∥ 𝛿 ∥ ∞ ≤ 0.3 .

(b)The loss landscape of the non-convex objective ℓ nonconvex for ∥ 𝛿 ∥ ∞ ≤ 0.3 .

(d) ℓ convex − ℓ nonconvex zoomed into ∥ 𝛿 ∥ ∞ ≤ 0.08 . Figure 9:Illustrations of the optimization landscapes of the convex and non-convex training formulations.

The visualizations are based on the 2-dimensional experiment described in Section 5.3.1. We use Algorithm 4 to train a robust ANN on the 2-dimensional dataset with 𝜖

0.08 , 𝑃 𝑠

360 , and 𝛽

10 − 9 . We then randomly select one of the training points 𝑥 𝑘 and plot the loss around 𝑥 𝑘 for the convex objective Eq. 5.4a and the non-convex objective Eq. 13. Specifically, for ∥ 𝛿 ∥ ∞ ≤ 0.3 , we plot

ℓ convex

( 1 − 𝑦 𝑘 ⋅ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑥 𝑘 + 𝛿 ) ⊤ ⁢ ( 𝑣 𝑖 ⋆ − 𝑤 𝑖 ⋆ ) ) ⁢ and ⁢ ℓ nonconvex

( 1 − 𝑦 𝑘 ⋅ ∑ 𝑗

1 𝑚 ( ( 𝑥 𝑘 + 𝛿 ) ⊤ ⁢ 𝑢 𝑗 ⋆ ) + ⁢ 𝛼 𝑗 ⋆ ) ,

where 𝑑 𝑖 ⁢ 𝑘 is the 𝑘 th entry of 𝐷 𝑖 , 𝑦 𝑘 is the training label corresponding to 𝑥 𝑘 . Moreover, 𝑣 𝑖 ⋆ , 𝑤 𝑖 ⋆ are the optimizers returned by Algorithm 4 and 𝑢 𝑗 ⋆ and 𝛼 𝑗 ⋆ are the ANN weights recovered from 𝑣 𝑖 ⋆ and 𝑤 𝑖 ⋆ with Eq. 4. The plots are shown in Fig. 9(a), 9(b).

For a clearer visualization, we also plot ℓ convex − ℓ nonconvex in Fig. 9(c) and zoom in to the ℓ ∞ norm ball with radius 𝜖

0.08 in Fig. 9(d). When ℓ convex − ℓ nonconvex is zero, the convex objective provides an exact certificate for the non-convex loss function. Fig. 9(d) shows that for ∥ 𝛿 ∥ ∞ ≤ 0.08 , the difference is zero, supporting the finding that for ANNs trained with Algorithm 4, the convex objective offers an exact certificate around the training points.

C.4Hinge Loss Convex Adversarial Training with Different Regularizations Figure 10:Decision boundaries obtained from various methods with 𝛽 set to 10 − 9 , 10 − 6 , 10 − 3 , and 10 − 2 .

We now compare the decision boundaries obtained from the convex training algorithms and back-propagation algorithms. As shown in Fig. 10, the two standard training methods (Algorithm 1 and GD-std) learned decision boundaries that separated the training points but failed to separate the perturbation boxes. Note that Algorithm 1 learned slightly more sophisticated boundaries while GD-std learned near-linear boundaries that were very close to one of the positive training points × .

The convex adversarial training method given by Algorithm 4 learnes boundaries that separate all perturbation boxes when 𝛽 was 10 − 3 , 10 − 6 , or 10 − 9 . This behavior matches the theoretical illustration of adversarial training [45, Figure 3], and verifies that Algorithm 4 works as intended. When the regularization is too strong ( 𝛽

10 − 2 ), the robust boundary becomes smoothed out and very similar to the standard training boundaries. The traditional adversarial training method GD-PGD learns boundaries that separate most perturbation boxes. However, the boundaries cut through the box at around ( 1 , − 1 ) when 𝛽 is 10 − 3 , 10 − 6 , or 10 − 9 . This behavior is likely caused by GD-PGD’s worse convergence due to the non-convexity. When 𝛽 is too large, the GD-PGD boundary also becomes smoothed out.

C.5Squared Loss Convex Adversarial Training

Figure 11:The true relationship between the data 𝑥 and the targets 𝑦 used in the illustrative example in Section C.5. The training ( 𝑛

8 points) and test ( 𝑛

100 points) sets are uniformly sampled from the distribution.

Figure 12:The robust training approach Eq. 40 outperforms the standard approach for different 𝜖 ∈ { 0.1 , … , 0.9 } on the dataset studied in Section C.5.

The performance of the proposed robust optimization problem Eq. 40 is compared with the standard training problem Eq. 2 on an illustrative 1-dimensional dataset. Fig. 11 shows the true relationship between the data vector 𝑋 and the target output 𝑦 . Training data are constructed by uniformly sampling eight points from this distribution, and test data are constructed by uniformly sampling 100 points. A bias term is included by concatenating a column of ones to 𝑋 .

The training and test procedure are repeated for 100 trials with convex standard training (Algorithm 1). For convex adversarial training (Algorithm 4), we varied the perturbation radius 𝜖

0.1 , … , 0.9 . The training and test procedure was carried out for ten trials for each 𝜖 . Fig. 12 reports the average test mean square error (MSE) for each setup.

The adversarial training procedure outperforms standard training for all 𝜖 choices. We further observe that the average MSE is the lowest at 𝜖 ≈ 0.3 . This behavior arises as the robust problem attempts to account for all points within the uncertainty interval around the sampled training points. When 𝜖 is too small, the robust problem approaches the standard training problem. Larger values of 𝜖 cause the uncertainty interval to overestimate the constant regions of the true distribution, increasing the MSE.

Appendix DExperiment Setting Details D.1ADMM Hyperparameters Table 6:Hyperparameter settings used for the ADMM experiments. Fig. 7 Fig. 2 Fig. 3 Fig. 4 Table 2 Table 3 Table 4 (ADMM-RBCD)

𝜌 0.4 0.4 0.1 0.1 0.1 0.4 0.01

𝛾 𝑎 0.01 0.4 0.1 0.1 0.1 0.16 0.01

𝛽 0.0005 0.0005 0.0005 0.0001 0.001 0.001 0.001

The proposed ADMM algorithm has two hyperparameters: a penalty hyperparameter 𝜌 and a step size 𝛾 𝑎 . The hyperparameters used in the experiments in this paper are shown in Table 6. In most experiments, we select 𝛾 𝑎

𝜌 , a common choice for the ADMM algorithm. The penalty parameter 𝜌 controls the level of infeasibility of 𝑣 and 𝑤 . Note that while ADMM guarantees to converge to an optimal feasible solution, the optimization variables may be infeasible in intermediate steps. The feasibility of 𝑣 𝑖 and 𝑤 𝑖 to Eq. 2 is emphasized when 𝜌 is large, while a low objective value is emphasized when 𝜌 is small. For the purpose of finding optimal 𝑢 𝑗 and 𝛼 𝑗 that minimize Eq. 1, a balance between feasibility and low objective is required. In practice, if there exists a significant gap between the objective of Eq. 2 and the training loss Eq. 1, then 𝜌 should be increased. If the objective of Eq. 2 struggles to reduce, then 𝜌 should be decreased.

D.2FGSM and PGD Details

The hinge loss has a flat part with zero gradient. To generate adversarial examples even in this part, we treat it as the “leaky hinge loss” via the model max ⁡ { 𝜁 ⁢ ( 1 − 𝑦 ^ ⋅ 𝑦 ) , 1 − 𝑦 ^ ⋅ 𝑦 } , where 𝜁 → 0 + . Hence, the PGD update Eq. 14 amounts to

𝑥 ~ 𝑡 + 1

Π 𝒳 ⁢ ( 𝑥 ~ 𝑡 − 𝛾 𝑝 ⋅ sgn ⁢ ( 𝑦 ⋅ ∑ 𝑗 : 𝑥 ⊤ ⁢ 𝑢 𝑗 ≥ 0 ( 𝑢 𝑗 ⁢ 𝛼 𝑗 ) ) ) , 𝑥 ~ 0

𝑥 .

where the projection step can be performed by clipping the coordinates that deviate more than 𝜖 from 𝑥 . In the following experiments, we use 𝛾 𝑝

𝜖 / 30 and run PGD for 𝑇

40 steps. On the other hand, the FGSM calculation can again be regarded as a special case of PGD where 𝑇

1 .

Appendix EConvex Adversarial Training Extensions E.1Convex Squared Loss Adversarial Training

The squared loss ℓ ⁢ ( 𝑦 ^ , 𝑦 )

1 2 ⁢ ∥ 𝑦 ^ − 𝑦 ∥ 2 2 is another commonly used loss function in machine learning. Consider the non-convex training problem of a one-hidden-layer ReLU ANN trained with the ℓ 2 -regularized squared loss:

(37) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 1 2 | | ∑ 𝑗

1 𝑚 ( 𝑋 𝑢 𝑗 ) + 𝛼 𝑗 − 𝑦
| | 2 2 + 𝛽 2 ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) .

Coupling this nominal problem with the perturbation set 𝒳 gives us the robust counterpart as

(38) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ 1 2 ⁢ ‖ ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 − 𝑦 ‖ 2 2 + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) ) .

Applying Theorem 4.1 and Corollary 4.2 leads to the following formulation as an upper bound on Eq. 38:

(39) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ 1 2 ⁢ ‖ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 ‖ 2 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

s . t .

Solving the maximization over Δ in closed form leads to the next result, with the proof provided in Section F.8.

Theorem E.1.

The optimization problem Eq. 39 is equivalent to the convex program:

(40) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , 𝑎 , 𝑧 ⁡ 𝑎 + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

𝑧 𝑘 ≥ | ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 𝑘 | + 𝜖 ⁢ ∥ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 , ∀ 𝑘 ∈ [ 𝑛 ]

𝑧 𝑛 + 1 ≥ | 2 ⁢ 𝑎 − 1 4 | , ∥ 𝑧 ∥ 2 ≤ 2 ⁢ 𝑎 + 1 4 .

Problem Eq. 40 is a convex optimization that can train robust ANNs. However, directly using Eq. 40 for adversarial training can be intractable due to the large number of constraints that arise when we include all 𝐷 𝑖 matrices associated with all Δ such that 𝑋 + Δ ∈ 𝒳 . To this end, one can use the approximation in Algorithm 4 and sample a subset of the diagonal matrices 𝐷 1 , … , 𝐷 𝑃 𝑠 . As before, the optimality gap can be characterized with Theorem 2.2.

E.2Convex Adversarial Training for ConvNets

While our discussions explicitly focus on one-hidden-layer scalar-output ReLU networks, the derived training methods can be used for more sophisticated ANN architectures. As discussed above, greedily training one-hidden-layer ANNs leads to a well-performing deep network [12]. Leveraging recent works that reform the training of more complex ANNs into convex programs [25, 24, 53], our analysis can also extend to those ANNs because most convex training formulations share similar structures. Specifically, these convex training formulations rely on binary matrices to represent ReLU activation patterns and rely on convex (and often linear) constraints to enforce the patterns, with different regularizations revealing the sparse properties of different architectures. Coupling layer-wise training [12] and SCP convex training recovers multi-layer ELMs.

As an example, we now extend our convex adversarial training analysis to various CNN formulations used in [25].

The paper [25] shows that the convex ANN training approach extends to various CNN architectures. Taking advantage of this result, the convex adversarial training formulations similarly generalize. In this part of the appendix, we change our notations to align with [25]. For example, the robust counterpart of the average pooling two-layer CNN convex training formulation (cf. Equations (4) and (26) in [25]) is:

min { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 conv
( max 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 conv ∑ 𝑘

1 𝐾 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ ( 𝑤 𝑖 − 𝑣 𝑖 ) , 𝐲 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 conv ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

s.t.

min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ( 2 ⁢ 𝐷 ¯ 𝑖 𝑘 − 𝐼 𝑛 ) ⁢ 𝑋 𝑘 ⁢ 𝑤 𝑖 ≥ 0 , min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ( 2 ⁢ 𝐷 ¯ 𝑖 𝑘 − 𝐼 𝑛 ) ⁢ 𝑋 𝑘 ⁢ 𝑣 𝑖 ≥ 0 , ∀ 𝑖 , 𝑘 ,

where 𝑣 𝑖 , 𝑤 𝑖 ∈ ℝ 𝑑 ¯ for all 𝑖 ∈ [ 𝑃 conv ] and 𝑑 ¯ is the convolutional filter size. Moreover, 𝑋 𝑘 is the 𝑘 th patch of the data matrix 𝑋 and 𝒳 𝑘 is the corresponding perturbation set of the patch 𝑋 𝑘 . Furthermore, { 𝐷 ¯ 1 , ⋯ , 𝐷 ¯ 𝑃 conv } is the set formed by all diagonal binary matrices that represent possible ReLU activation patterns associated with 𝐌 ≔ [ 𝑋 1 ⊤

⋯

𝑋 𝑃 conv ⊤ ] ⊤ and 𝐷 ¯ 𝑖 𝑘 denotes the 𝑘 th

𝑑 ¯ -by- 𝑑 ¯ diagonal block of 𝐷 ¯ 𝑖 .

The next step would be to show that the above formulation is equivalent to a classic convex optimization. Note that each robust constraint is an LP subproblem that can be solved in closed form, which means that the robust constraints can be cast as equivalent classic constraints. When ℓ ⁢ ( ⋅ ) is the squared loss, the above equation becomes a robust second-order cone program (SOCP), which is known to be a convex optimization (similar to Eq. 39). Otherwise, if ℓ ⁢ ( ⋅ ) is monotonously increasing or decreasing in the CNN output 𝐲 ^ (examples include the hinge loss and the binary cross-entropy loss), the inner maximization problem

arg ⁢ max 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 conv ∑ 𝑘

1 𝐾 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ ( 𝑤 𝑖 − 𝑣 𝑖 ) , 𝐲 )

reduces to

arg ⁢ max 𝑋 𝑘 ∈ 𝒳 𝑘 ⁢ ∑ 𝑖

1 𝑃 conv ∑ 𝑘

1 𝐾 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ ( 𝑤 𝑖 − 𝑣 𝑖 ) ⁢ or ⁢ arg ⁢ min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁢ ∑ 𝑖

1 𝑃 conv ∑ 𝑘

1 𝐾 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ ( 𝑤 𝑖 − 𝑣 𝑖 ) ,

which are LPs that can be solved in closed form. Substituting the closed-form solution yields the desired convex adversarial training formulations.

Similarly, for max pooling two-layer CNNs, the robust counterpart becomes (cf. Equation (7) of [25]):

min { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 conv
( max 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 conv ∑ 𝑘

1 𝐾 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ ( 𝑤 𝑖 − 𝑣 𝑖 ) , 𝐲 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 conv ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

s.t.

min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ( 2 ⁢ 𝐷 ¯ 𝑖 𝑘 − 𝐼 𝑛 ) ⁢ 𝑋 𝑘 ⁢ 𝑤 𝑖 ≥ 0 , min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ ( 2 ⁢ 𝐷 ¯ 𝑖 𝑘 − 𝐼 𝑛 ) ⁢ 𝑋 𝑘 ⁢ 𝑣 𝑖 ≥ 0 , ∀ 𝑖 , 𝑘 .

min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ 𝑣 𝑖 ≥ max 𝑋 𝑗 ∈ 𝒳 𝑗 ⁡ 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑗 ⁢ 𝑣 𝑖 , ∀ 𝑖 , 𝑗 , 𝑘 ,

min 𝑋 𝑘 ∈ 𝒳 𝑘 ⁡ 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑘 ⁢ 𝑤 𝑖 ≥ max 𝑋 𝑗 ∈ 𝒳 𝑗 ⁡ 𝐷 ¯ 𝑖 𝑘 ⁢ 𝑋 𝑗 ⁢ 𝑤 𝑖 , ∀ 𝑖 , 𝑗 , 𝑘 .

where each additional robust constraint is an LP subproblem solvable in closed form.

The same robust optimization techniques can be applied to three-layer CNNs (see Equation (11) in [25]) and derive corresponding convex adversarial training formulations. In general, the convex standard training formulations for different NNs / CNNs share very similar structures. Therefore, many convex standard training formulations can be “robustified” by recasting as mini-max formulations. Whether these mini-max formulations can be reformed into classic convex optimizations depends on the specific structures of the problems. For CNNs with two or three layers considered in [25], such classic convex formulations can be derived.

Similarly, the ADMM splitting scheme, discussed in Section 3, also applies to the above CNN formulations. The CNN training formulations also belong to the family of convex training formulations outlined in Eq. 25, and can be similarly split into loss function terms, regularization terms, and linear inequality constraints.

E.3 ℓ 𝑝 Norm-Bounded Perturbation Set for Hinge Loss

Theorem 4.3 can be extended to the following ℓ 𝑝 norm-bounded perturbation set:

𝒳 ~

{ 𝑋 + Δ ∈ ℝ 𝑛 × 𝑑 | Δ

[ 𝛿 1 ⁢ ⋯ ⁢ 𝛿 𝑛 ] ⊤ , ∥ 𝛿 𝑘 ∥ 𝑝 ≤ 𝜖 , ∀ 𝑘 ∈ [ 𝑛 ] } .

In the case of performing binary classification with a hinge-lossed ANN, the convex adversarial training problem then becomes:

(41) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( 1 𝑛 ⁢ ∑ 𝑘

1 𝑛 ( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⋅ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 𝑝 ⁣ ∗ ) +

𝛽 ⁢ ∑ 𝑖
1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2
∥ 𝑤 𝑖 ∥ 2 ) )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 𝑝 ⁣ ∗ , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 𝑝 ⁣ ∗ , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

where 𝐷 1 , … , 𝐷 𝑃 ^ are all distinct diagonal matrices associated with diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) for all possible 𝑢 ∈ ℝ 𝑑 and all 𝑋 + Δ at the boundary of 𝒳 ~ . Note that ∥ ⋅ ∥ 𝑝 ⁣ ∗ is the dual norm of ∥ ⋅ ∥ 𝑝 .

Appendix FProofs F.1Proof of Theorem 2.2

We start by recasting the semi-infinite constraint of the dual formulation Eq. 3 as max ∥ 𝑢 ∥ 2 ≤ 1 ⁡ | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + | ≤ 𝛽 and obtain

max ∥ 𝑢 ∥ 2 ≤ 1 ⁡ | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + |

max ∥ 𝑢 ∥ 2 ≤ 1 ⁡ | 𝑣 ⊤ ⁢ diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) ⁢ 𝑋 ⁢ 𝑢 |

max 𝑖 ∈ [ 𝑃 ] ⁡ ( max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⊤ ⁢ 𝐷 𝑖 ⁢ 𝑋 ⁢ 𝑢 | ) ,

where the last equality holds by the definition of the 𝐷 𝑖 matrices: 𝐷 1 ⁢ … , 𝐷 𝑃 are all distinct matrices that can be formed by diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) for some 𝑢 ∈ ℝ 𝑑 . The constraint ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 is equivalent to 𝐷 𝑖 ⁢ 𝑋 ⁢ 𝑢 ≥ 0 and ( 𝐼 𝑛 − 𝐷 𝑖 ) ⁢ 𝑋 ⁢ 𝑢 ≤ 0 , which forces 𝐷 𝑖

diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) to hold.

Therefore, the dual formulation Eq. 3 can be recast as

(42) max 𝑣 − ℓ ∗ ⁢ ( 𝑣 ) s . t . ⁡ max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⊤ ⁢ 𝐷 𝑖 ⁢ 𝑋 ⁢ 𝑢 | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑃 ] .

To form a tractable convex program that provides an approximation to Eq. 42, one can independently sample a subset of the diagonal matrices. One possible sampling procedure is presented in Algorithm 1. The sampled matrices, denoted as 𝐷 1 , … , 𝐷 𝑃 𝑠 , can be used to construct the relaxed problem:

(43) 𝑑 𝑠 ⁢ 1 ⋆

max 𝑣 − ℓ ∗ ⁢ ( 𝑣 ) s . t . ⁡ max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⊤ ⁢ 𝐷 ℎ ⁢ 𝑋 ⁢ 𝑢 | ≤ 𝛽 , ∀ ℎ ∈ [ 𝑃 𝑠 ] .

The optimization problem Eq. 43 is convex with respect to 𝑣 . [49] has shown that Eq. 42 has the same optimal objective as its dual problem Eq. 2. By following precisely the same derivation, it can be shown that Eq. 43 has the same optimal objective as Eq. 5 and 𝑝 𝑠 ⁢ 1 ⋆

𝑑 𝑠 ⁢ 1 ⋆ . Moreover, if an additional diagonal matrix 𝐷 𝑃 𝑠 + 1 is independently randomly sampled to form Eq. 6, then we also have 𝑝 𝑠 ⁢ 2 ⋆

𝑑 𝑠 ⁢ 2 ⋆ , where

𝑑 𝑠 ⁢ 2 ⋆

max 𝑣 − ℓ ∗ ⁢ ( 𝑣 ) s . t . ⁡ max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 ℎ − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⊤ ⁢ 𝐷 ℎ ⁢ 𝑋 ⁢ 𝑢 | ≤ 𝛽 , ∀ ℎ ∈ [ 𝑃 𝑠 + 1 ] .

Thus, the level of suboptimality of Eq. 43 compared with Eq. 42 is the level of suboptimality of Eq. 5 compared with Eq. 2. Notice that by introducing a slack variable 𝑤 ∈ ℝ , Eq. 42 can be represented as an instance of the UCP with 𝑛 + 1 optimization variables, defined in [17]:

max 𝑣 , 𝑤 : 𝑤 ≤ − ℓ ∗ ⁢ ( 𝑣 ) ⁡ 𝑤 s . t . ⁡ max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⊤ ⁢ 𝐷 𝑖 ⁢ 𝑋 ⁢ 𝑢 | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑃 ] .

The relaxed problem Eq. 43 can be regarded as a corresponding SCP. Suppose that 𝑤 ⋆ , 𝑣 ⋆ is a solution to the sampled convex problem Eq. 43. It can be concluded from [17, Theorem 1] and [18, Theorem 1] that if 𝑃 𝑠 ≥ min ⁡ { 𝑛 + 1 𝜓 ⁢ 𝜉 − 1 , 2 𝜉 ⁢ ( 𝑛 + 1 − log ⁡ 𝜓 ) } , then 𝑣 ⋆ satisfies the original constraints of the UCP Eq. 42 with high probability. Specifically, with probability no smaller than 1 − 𝜉 , we have

ℙ ⁢ { 𝐷 ∈ 𝒟 : max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⋆ ⊤ ⁢ 𝐷 ⁢ 𝑋 ⁢ 𝑢 |

𝛽 } ≤ 𝜓 .

where 𝒟 denotes the set of all diagonal matrices that can be formed by diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) for some 𝑢 ∈ ℝ 𝑑 , which is the set formed by 𝐷 1 , … , 𝐷 𝑃 .

Since 𝐷 𝑃 𝑠 + 1 is randomly sampled from 𝒟 , we have

ℙ ⁢ { 𝐷 ∈ 𝒟 : max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⋆ ⊤ ⁢ 𝐷 ⁢ 𝑋 ⁢ 𝑢 | > 𝛽 }

ℙ ⁢ { max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 𝑃 𝑠 + 1 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⋆ ⊤ ⁢ 𝐷 𝑃 𝑠 + 1 ⁢ 𝑋 ⁢ 𝑢 |

𝛽 }

Thus, with probability no smaller than 1 − 𝜉 , it holds that

ℙ ⁢ { max ∥ 𝑢 ∥ 2 ≤ 1

( 2 ⁢ 𝐷 𝑃 𝑠 + 1 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑢 ≥ 0 ⁢ | 𝑣 ⋆ ⊤ ⁢ 𝐷 𝑃 𝑠 + 1 ⁢ 𝑋 ⁢ 𝑢 |

𝛽 } ≤ 𝜓 .

Moreover, 𝑑 𝑠 ⁢ 2 ⋆ < 𝑑 𝑠 ⁢ 1 ⋆ if and only if | 𝑣 ⋆ ⊤ ⁢ 𝐷 𝑃 𝑠 + 1 ⁢ 𝑋 ⁢ 𝑢 | > 𝛽 with 𝑑 𝑠 ⁢ 2 ⋆

𝑑 𝑠 ⁢ 1 ⋆ otherwise. The proof is completed by noting that 𝑝 𝑠 ⁢ 1 ⋆

𝑑 𝑠 ⁢ 1 ⋆ and 𝑝 𝑠 ⁢ 2 ⋆

𝑑 𝑠 ⁢ 2 ⋆ . □

F.2Proof of Theorem 3.1

We start by rewriting Eq. 8 as

(44) min 𝑣 , 𝑠 , 𝑢 : 𝑠 ≥ 0 ⁡ 𝑓 1 ⁢ ( 𝑢 ) + 𝑓 2 ⁢ ( 𝑣 , 𝑠 ) s . t . 𝐸 1 ⁢ 𝑢 − 𝐸 2 ⁢ [ 𝑣

𝑠 ]

0 ,

where 𝑓 1 ⁢ ( 𝑢 )

ℓ ⁢ ( 𝐹 ⁢ 𝑢 , 𝑦 ) , 𝑓 2 ⁢ ( 𝑣 , 𝑠 )

𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 , 𝐸 1

[ 𝐼

𝐺 ] , and 𝐸 2

𝐼 .

Furthermore, let 𝐿 ⁢ ( 𝑢 , 𝑣 , 𝑠 , 𝜈 , 𝜆 ) denote the augmented Lagrangian:

𝐿 ⁢ ( 𝑢 , 𝑣 , 𝑠 , 𝜈 , 𝜆 ) ≔

𝑓 1 ⁢ ( 𝑢 ) + 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝕀 ≥ 0 ⁢ ( 𝑠 ) + 𝜌 2 ⁢ ( ∥ 𝑢 − 𝑣 + 𝜆 ∥ 2 2 − ∥ 𝜆 ∥ 2 2 ) + 𝜌 2 ⁢ ( ∥ 𝐺 ⁢ 𝑢 − 𝑠 + 𝜈 ∥ 2 2 − ∥ 𝜈 ∥ 2 2 )

Theorem 3.1 in [33] shows that the ADMM algorithm converges linearly when the objective satisfies seven conditions. We show that these conditions are all satisfied for Eq. 44 given the assumptions of Theorem 3.1 in this paper:

(a)

It can be easily shown that Eq. 44 attains a global solution because the feasible set of the equivalent problem Eq. 2 is non-empty.

(b)

We can then decompose 𝑓 1 ⁢ ( 𝑢 ) into 𝑔 1 ⁢ ( 𝐹 ⁢ 𝑢 ) ≔ ℓ ⁢ ( 𝐹 ⁢ 𝑢 , 𝑦 ) and ℎ 1 ⁢ ( 𝑢 ) ≔ 0 and define ℎ 2 ⁢ ( ⋅ ) ≔ 𝑓 2 ⁢ ( ⋅ ) . When the loss ℓ ⁢ ( 𝑦 ^ , 𝑦 ) is convex with respect to 𝑦 ^ , the functions 𝑔 1 ⁢ ( ⋅ ) , ℎ 1 ⁢ ( ⋅ ) , ℎ 2 ⁢ ( ⋅ ) are all convex and continuous.

(c)

When ℓ ⁢ ( 𝑦 ^ , 𝑦 ) is strictly convex and continuously differentiable with a uniform Lipschitz continuous gradient with respect to 𝑦 ^ , the function 𝑔 1 ⁢ ( ⋅ ) is strictly convex and continuously differentiable with a uniform Lipschitz continuous gradient.

(d)

The epigraph of ℎ 1 ⁢ ( ⋅ )

0 is a polyhedral set. Moreover, ℎ 2 ⁢ ( 𝑣 , 𝑠 )

∥ 𝑣 ∥ 2 , 1

∑ 𝑖

1 𝑃 ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) by definition.

(e)

The constant function ℎ 1 ⁢ ( ⋅ ) is trivially finite. Furthermore, for all 𝑢 , 𝑣 , 𝑠 that make 𝐿 ⁢ ( 𝑢 , 𝑣 , 𝑠 , 𝜈 , 𝜆 ) finite, it must hold that 𝑓 1 ⁢ ( 𝑢 ) < + ∞ , 𝑣 < + ∞ , and 𝑠 ≥ 0 . Therefore, ℎ 2 ⁢ ( ⋅ ) must be finite.

(f)

𝐸 1 and 𝐸 2 both have full column rank since the identity matrix has full column rank.

(g)

When 𝑢 → ∞ , we have 𝐿 ⁢ ( 𝑢 , 𝑣 , 𝑠 , 𝜈 , 𝜆 ) → ∞ . Hence, the solution to Eq. 3.3a must be finite as long as the initial points 𝑢 0 , 𝑣 0 , 𝑠 0 , 𝜆 0 , 𝜈 0 are finite. The solutions to Eq. 3.3b and Eq. 3.3c are also finite, since the closed-form solutions are derived in Section 3.1. Therefore, the sequence { ( 𝑢 𝑘 , 𝑣 𝑘 , 𝑠 𝑘 , 𝜆 𝑘 , 𝜈 𝑘 ) } is finite. Thus, there exist finite 𝑢 max , 𝑣 max , 𝑠 max such that Eq. 44 is equivalent to the formulation below:

(45) min 𝑣 , 𝑠 , 𝑢 : 𝑠 ≥ 0

𝑓 1 ⁢ ( 𝑢 ) + 𝑓 2 ⁢ ( 𝑣 , 𝑠 )

s . t .

𝐸 1 ⁢ 𝑢 − 𝐸 2 ⁢ [ 𝑣

𝑠 ]

0 , ∥ 𝑢 ∥ ∞ ≤ 𝑢 max , ∥ 𝑣 ∥ ∞ ≤ 𝑣 max , ∥ 𝑠 ∥ ∞ ≤ 𝑠 max .

Furthermore, the ADMM algorithm that solves Eq. 45 is equivalent to Algorithm 2. The feasible set of Eq. 45 is a compact polyhedral set formed by the ℓ ∞ norm constraints, the non-negativity constraints, and the linear equality constraints.

Thus, by the application of [33, Theorem 3.1], the desired result holds true when the step size 𝛾 𝑎 is sufficiently small. □

F.3Proof of Theorem B.1

As discussed in Section B.2, strong duality holds between Eq. 32 and Eq. 35, as well as between Eq. 33 and Eq. 36. Here, we introduce a slack variable 𝑤 and cast Eq. 35 as a canonical uncertain convex program with 𝑛 + 1 optimization variables and a linear objective, where 𝑛 is the number of training data:

min ( 𝑣 , 𝑤 ) ∈ ℱ ⁡ 𝑤

s . t .
𝑓 ⁢ ( 𝑣 , 𝑤 , 𝑢 ) ≔ | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + | − 𝛽 ≤ 0 , ∀ 𝑢 ∈ 𝒢

ℱ

{ 𝑣 ∈ ℝ 𝑛 , 𝑤 ∈ ℝ | ∥ 𝑦 − 𝑣 ∥ 2 2 − 2 ⁢ 𝑤 ≤ 0 }

𝒢

{ 𝑢 | ∥ 𝑢 ∥ 2

1 } .

By leveraging [17, Theorem 1] and [18, Theorem 1], we can conclude that if 𝑁 ≥ min ⁡ { 𝑛 + 1 𝜓 ⁢ 𝛾 − 1 , 2 𝛾 ⁢ ( 𝑛 + 1 − log ⁡ 𝜓 ) } , then with probability no smaller than 1 − 𝛾 , the solution 𝑣 ⋆ to the randomized problem Eq. 36 satisfies ℙ ⁢ { 𝑢 : ∥ 𝑢 ∥ 2

1 , | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 ) + |

𝛽 } ≤ 𝜓 . Since 𝑢 𝑁 + 1 is randomly generated on the Euclidean norm sphere via a uniform distribution, it holds that ℙ ⁢ { | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑁 + 1 ) + |

𝛽 } ≤ 𝜓 .

Consider the following dual formulation with the newly sampled hidden neuron 𝑢 𝑁 + 1 included:

(46) 𝑑 𝑠 ⁢ 4 ⋆

max 𝑣 ∈ ℝ 𝑛 − ℓ ∗ ⁢ ( 𝑣 ) s . t .

| 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 + 1 ] .

Since Eq. 46 and Eq. 36 share the same objective, it holds that 𝑑 𝑠 ⁢ 4 ⋆ < 𝑑 𝑠 ⁢ 3 ⋆ if and only if | 𝑣 ⋆ ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑁 + 1 ) + | > 𝛽 with 𝑑 𝑠 ⁢ 4 ⋆

𝑑 𝑠 ⁢ 3 ⋆ otherwise. The proof is completed by recalling that 𝑝 𝑠 ⁢ 3 ⋆

𝑑 𝑠 ⁢ 3 ⋆ and 𝑝 𝑠 ⁢ 4 ⋆

𝑑 𝑠 ⁢ 4 ⋆ due to strong duality. □

F.4Details About the Strong Duality Between Eq. 36 and Eq. 33 F.4.1General Loss Functions

In this part of the appendix, we explicitly derive the relationship between the optimal solutions ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 and 𝑣 ⋆ for the purpose of recovering the dual optimizers from the primal optimizers.

The SCP training formulation Eq. 33 is equivalent to the following constrained optimization:

(47) min 𝑟 , ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ ℓ ⁢ ( 𝑟 , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | s . t . 𝑟

∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 ,

and a solution to Eq. 33 is also optimal for Eq. 47. The optimization problem Eq. 47 is then equivalent to the minimax problem

(48) min 𝑟 , ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ ( max 𝑣 ⁡ ℓ ⁢ ( 𝑟 , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | + 𝑣 ⊤ ⁢ ( ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 − 𝑟 ) ) .

The outer minimization is convex over 𝑟 and ( 𝛼 𝑖 ) 𝑖

1 𝑁 , while the inner maximization is concave over 𝑣 . Thus, by the Sion’s minimax theorem [54], the optimization problem Eq. 48 is equivalent to:

max 𝑣 ⁡ ( min 𝑟 ⁡ ( ℓ ⁢ ( 𝑟 , 𝑦 ) − 𝑣 ⊤ ⁢ 𝑟 ) + min ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ ( 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑗 | + 𝑣 ⊤ ⁢ ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 ) )

max 𝑣 ⁡ ( − max 𝑟 ⁡ ( 𝑣 ⊤ ⁢ 𝑟 − ℓ ⁢ ( 𝑟 , 𝑦 ) ) s . t . | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 ] )

max 𝑣 − ℓ ∗ ⁢ ( 𝑣 ) s . t . | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 ] ,

which is Eq. 36. The first equality holds because

min ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ ( 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑗 | + 𝑣 ⊤ ⁢ ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 )

{ 0 ,

| 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 ] ,

∞ ,

otherwise.

Therefore, with the optimal ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 , one can calculate 𝑟 ⋆ via 𝑟 ⋆

∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 ⋆ , and recover 𝑣 ⋆ by solving the following LP:

𝑣 ⋆

arg ⁢ max 𝑣 − 𝑣 ⊤ ⁢ 𝑟 ⋆ s . t . | 𝑣 ⊤ ⁢ ( 𝑋 ⁢ 𝑢 𝑖 ) + | ≤ 𝛽 , ∀ 𝑖 ∈ [ 𝑁 ] .

F.4.2Squared Loss

In this part, we prove the relationship between ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 and 𝑣 ⋆ by deriving the Karush–Kuhn–Tucker (KKT) conditions for the special case when the squared loss is considered. In this case, the SCP training formulation Eq. 33 reduces to

min ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ 1 2 ⁢ ∥ ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 − 𝑦 ∥ 2 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | ,

which is equivalent to

(49) min 𝑟 , ( 𝛼 𝑖 ) 𝑖

1 𝑁 ⁡ 1 2 ⁢ ∥ 𝑟 ∥ 2 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | s . t . 𝑟

∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 − 𝑦 .

By introducing a dual vector variable 𝑣 ∈ ℝ 𝑛 , we can write the Lagrangian of Eq. 49 as:

𝐿 SCP ⁢ ( 𝑣 , 𝑟 , ( 𝛼 𝑖 ) 𝑖

1 𝑁 )

1 2 ⁢ ∥ 𝑟 ∥ 2 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | + 𝑣 ⊤ ⁢ ( ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 − 𝑦 − 𝑟 )

( 1 2 ⁢ 𝑟 ⊤ + 𝑣 ⊤ ) ⁢ 𝑟 + ( 𝛽 ⁢ ∑ 𝑖

1 𝑁 | 𝛼 𝑖 | + 𝑣 ⊤ ⁢ ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 ) + 𝑣 ⊤ ⁢ 𝑦

𝐿 SCP ⁢ ( 𝑣 , 𝑟 , ( 𝛼 𝑖 ) 𝑖

1 𝑁 ) is smooth with respect to 𝑟 . Thus, by the Lagrangian stationarity condition, at optimum, we must have ∇ 𝑟 𝐿 ⁢ ( 𝑣 ⋆ , 𝑟 ⋆ , ( 𝛼 𝑖 ⋆ ) 𝑖

1 𝑁 )

𝑟 ⋆ + 𝑣 ⋆

0 . By the primal feasibility condition, we must have 𝑟 ⋆

∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 ⋆ − 𝑦 . Thus, at the optimum, 𝑣 ⋆

𝑦 − ∑ 𝑖

1 𝑁 ( 𝑋 ⁢ 𝑢 𝑖 ) + ⁢ 𝛼 𝑖 ⋆ .

F.5Proof of Theorem 4.1

Before proceeding with the proof, we first present the following result borrowed from [49].

Lemma F.1.

For a given data matrix 𝑋 and ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 , if ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 0 and ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 0 for all 𝑖 ∈ [ 𝑃 ] , then we can recover the corresponding ANN weights ( 𝑢 𝑣 , 𝑤 𝑗 , 𝛼 𝑣 , 𝑤 𝑗 ) 𝑗

1 𝑚 ⋆ using the formulas in Eq. 4, and it holds that

ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) +
𝛽 ⁢ ∑ 𝑖

1 𝑃 ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )
(50)

ℓ ⁢ ( ∑ 𝑗

1 𝑚 ⋆ ( 𝑋 ⁢ 𝑢 𝑣 , 𝑤 𝑗 ) + ⁢ 𝛼 𝑣 , 𝑤 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ⋆ ( ∥ 𝑢 𝑣 , 𝑤 𝑗 ∥ 2 2 + 𝛼 𝑣 , 𝑤 𝑗 2 ) .

Theorem 2.1 implies that the non-convex cost function Eq. 1 has the same objective value as the following finite-dimensional convex optimization problem:

(51) 𝑞 ⋆

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃
ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ]

where 𝐷 1 , … , 𝐷 𝑃 are all of the matrices in the set of matrices 𝒟 , which is defined as the set of all distinct diagonal matrices diag ⁢ ( [ 𝑋 ⁢ 𝑢 ≥ 0 ] ) that can be obtained for all possible 𝑢 ∈ ℝ 𝑑 . We recall that the optimal neural network weights can be recovered using Eq. 4.

Consider the following optimization problem:

(52) 𝑞 ~ ⋆

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ~
ℓ ⁢ ( ∑ 𝑖

1 𝑃 ~ 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ~ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ~ ]

where additional 𝐷 matrices, denoted as 𝐷 𝑃 + 1 , … , 𝐷 𝑃 ~ , are introduced. These additional matrices are still diagonal with each entry being either 0 or 1, while they do not belong to 𝒟 . They represent “infeasible hyperplanes” that cannot be achieved by the sign pattern of 𝑋 ⁢ 𝑢 for any 𝑢 ∈ ℝ 𝑑 .

Lemma F.2.

It holds that 𝑞 ~ ⋆

𝑞 ⋆ , meaning that the optimization problem Eq. 52 has the same optimal objective as Eq. 51.

The proof of Lemma F.2 is given in Section F.10.

The robust minimax training problem Eq. 13 considers an uncertain data matrix 𝑋 + Δ . Different values of 𝑋 + Δ within the perturbation set 𝒰 can result in different 𝐷 matrices. Now, we define 𝒟 ^

⋃ Δ 𝒟 Δ , where 𝒟 Δ is the set of diagonal matrices for a particular Δ such that 𝑋 + Δ ∈ 𝒰 . By construction, we have 𝒟 Δ ⊆ 𝒟 ^ for every Δ such that 𝑋 + Δ ∈ 𝒰 . Thus, if we define 𝐷 1 , … , 𝐷 𝑃 ^ as all matrices in 𝒟 ^ , then for every Δ with the property 𝑋 + Δ ∈ 𝒰 , the optimization problem

(53) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

is equivalent to

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )

as long as 𝑚 ≥ 𝑚 ^ ⋆ with 𝑚 ^ ⋆

| { 𝑖 : 𝑣 𝑖 ⋆ ⁢ ( Δ ) ≠ 0 } | + | { 𝑖 : 𝑤 𝑖 ⋆ ⁢ ( Δ ) ≠ 0 } | , where ( 𝑣 𝑖 ⋆ ⁢ ( Δ ) , 𝑤 𝑖 ⋆ ⁢ ( Δ ) ) 𝑖

1 𝑃 ^ denotes an optimal point to Eq. 53.

Now, we focus on the minimax training problem with a convex objective given by

(54) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ∈ ℱ ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ) ,

where ℱ is defined as:

{ ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ | ∃ Δ : 𝑋 + Δ ∈ 𝒰

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] } .

The introduction of the feasible set ℱ is to avoid the situation where the inner maximization over Δ is infeasible and the objective becomes − ∞ , leaving the outer minimization problem unbounded.

Moreover, consider the following problem:

(55) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
( ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) )

s . t .

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

where Δ 𝑣 , 𝑤 ⋆ is the optimal point for max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) . Note that the inequality constraints are dropped for the maximization here compared to Eq. 54.

The optimization problem Eq. 54 gives a lower bound on Eq. 55. To prove this, we first rewrite Eq. 55 as:

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^
𝑓 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) ⁢ , where ⁢ 𝑓 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ )

{ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 )

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

𝛽 ⁢ ∑ 𝑖
1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2
∥ 𝑤 𝑖 ∥ 2 ) ,

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

∞ ,

otherwise .

Now, we analyze Eq. 54 by considering the following three cases.

Case 1: For some ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , Δ 𝑣 , 𝑤 ⋆ is optimal for the inner maximization of Eq. 54 and the inequality constraints are inactive. This happens whenever Δ 𝑣 , 𝑤 ⋆ is feasible for the particular choice of ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ . In other words, ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑖 ≥ 0 and ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑖 ≥ 0 hold true for all 𝑖 ∈ [ 𝑃 ^ ] . For these ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , we have:

( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] )

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 ) .

Case 2: For some ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , Δ 𝑣 , 𝑤 ⋆ is infeasible, while some Δ within the perturbation bound satisfies the inequality constraints. Suppose that among the feasible Δ ’s,

Δ ~ 𝑣 , 𝑤 ⋆

arg ⁢ max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] .

In this case,

( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] )

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ~ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

Case 3: For all other ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , the objective value is + ∞ since they do not belong to ℱ .

Therefore, Eq. 54 can be rewritten as

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ 𝑔 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) , where ⁢ 𝑔 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ )

{ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 )

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

𝛽 ⁢ ∑ 𝑖
1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2
∥ 𝑤 𝑖 ∥ 2 ) ,

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

∃ 𝑗 : ( 2 ⁢ 𝐷 𝑗 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑗 < 0

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ~ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 )

or ⁢ ( 2 ⁢ 𝐷 𝑗 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑗 < 0

𝛽 ⁢ ∑ 𝑖
1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2
∥ 𝑤 𝑖 ∥ 2 ) ,

∃ Δ : ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

∞ ,

otherwise

Hence, 𝑔 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ )

𝑓 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) for all ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ belonging to the first and the third cases. 𝑔 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) < 𝑓 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) for all ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ belonging to the second case. Thus, min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ 𝑔 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) ≤ min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ 𝑓 ⁢ ( ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ) . This concludes that Eq. 54 is a lower bound to Eq. 55.

Let ( 𝑣 minimax 𝑖 ⋆ , 𝑤 minimax 𝑖 ⋆ ) 𝑖

1 𝑃 ^ denote an optimal point for Eq. 55. It is possible that for some Δ : 𝑋 + Δ ∈ 𝒰 , the constraints ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 minimax 𝑖 ⋆ ≥ 0 and ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 minimax 𝑖 ⋆ ≥ 0 are not satisfied for all 𝑖 ∈ [ 𝑃 ^ ] . In light of Lemma F.1, at those Δ where such constraints are violated, the convex problem Eq. 55 does not reflect the cost of the ANN. For these infeasible Δ , the input-label pairs ( 𝑋 + Δ , 𝑦 ) can have a high cost in the ANN and potentially become the worst-case adversary. However, these Δ are ignored in Eq. 55 due to the infeasibility. Since adversarial training aims to minimize the cost over the worst-case adversaries generated upon the training data whereas Eq. 55 may sometimes miss the worst-case adversaries, Eq. 55 does not fully accomplish the task of adversarial training. In fact, by applying Theorem 2.1 and Lemma F.2, it can be verified that Eq. 54 and Eq. 55 are lower bounds to Eq. 13 as long as 𝑚 ≥ 𝑚 ^ ⋆ :

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) )

≥ min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )

( min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ⁡ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑣 𝑖 ≥ 0 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ 𝑣 , 𝑤 ⋆ ) ⁢ 𝑤 𝑖 ≥ 0 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ) .

To address the feasibility issue, we can apply robust optimization techniques ([15] Section 4.4.2) and replace the constraints in Eq. 55 with robust convex constraints, which will lead to Section 4.2. Let ( ( 𝑣 rob 𝑖 ⋆ , 𝑤 rob 𝑖 ⋆ ) 𝑖

1 𝑃 ^ , Δ rob ⋆ ) denote an optimal point of Section 4.2 and let ( 𝑢 rob 𝑗 ⋆ , 𝛼 rob 𝑗 ⋆ ) 𝑗

1 𝑚 ^ ⋆ be the ANN weights recovered from ( 𝑣 rob 𝑖 ⋆ , 𝑤 rob 𝑖 ⋆ ) 𝑖

1 𝑃 ^ with Eq. 4, where 𝑚 ^ ⋆ is the number of nonzero weights. In light of Lemma F.1, since the constraints ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 rob 𝑖 ⋆ ≥ 0 and ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 rob 𝑖 ⋆ ≥ 0 for all 𝑖 ∈ [ 𝑃 ^ ] apply to all 𝑋 + Δ ∈ 𝒰 , all 𝑋 + Δ ∈ 𝒰 satisfy the equality

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 rob 𝑖 ⋆ − 𝑤 rob 𝑖 ⋆ ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 rob 𝑖 ⋆ ∥ 2 + ∥ 𝑤 rob 𝑖 ⋆ ∥ 2 )

ℓ ⁢ ( ∑ 𝑗

1 𝑚 ^ ⋆ ( ( 𝑋 + Δ ) ⁢ 𝑢 rob 𝑗 ⋆ ) + ⁢ 𝛼 rob 𝑗 ⋆ , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ^ ⋆ ( ∥ 𝑢 rob 𝑗 ⋆ ∥ 2 2 + 𝛼 rob 𝑗 ⋆ 2 ) .

Thus, since

Δ rob ⋆

arg ⁢ max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 rob 𝑖 ⋆ − 𝑤 rob 𝑖 ⋆ ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 rob 𝑖 ⋆ ∥ 2 + ∥ 𝑤 rob 𝑖 ⋆ ∥ 2 ) ,

we have

Δ rob ⋆

arg ⁢ max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ^ ⋆ ( ( 𝑋 + Δ ) ⁢ 𝑢 rob 𝑗 ⋆ ) + ⁢ 𝛼 rob 𝑗 ⋆ , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ^ ⋆ ( ∥ 𝑢 rob 𝑗 ⋆ ∥ 2 2 + 𝛼 rob 𝑗 ⋆ 2 ) ,

giving rise to:

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ rob ⋆ ) ⁢ ( 𝑣 rob 𝑖 ⋆ − 𝑤 rob 𝑖 ⋆ ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 rob 𝑖 ⋆ ∥ 2 + ∥ 𝑤 rob 𝑖 ⋆ ∥ 2 )

ℓ ⁢ ( ∑ 𝑗

1 𝑚 ^ ⋆ ( ( 𝑋 + Δ rob ⋆ ) ⁢ 𝑢 rob 𝑗 ⋆ ) + ⁢ 𝛼 rob 𝑗 ⋆ , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ^ ⋆ ( ∥ 𝑢 rob 𝑗 ⋆ ∥ 2 2 + 𝛼 rob 𝑗 ⋆ 2 )

max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ^ ⋆ ( ( 𝑋 + Δ ) ⁢ 𝑢 rob 𝑗 ⋆ ) + ⁢ 𝛼 rob 𝑗 ⋆ , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ^ ⋆ ( ∥ 𝑢 rob 𝑗 ⋆ ∥ 2 2 + 𝛼 rob 𝑗 ⋆ 2 )

≥ min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ^ ⋆ ⁡ ( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ^ ⋆ ( ( 𝑋 + Δ ) ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ^ ⋆ ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) )

Therefore, Section 4.2 is an upper bound to Eq. 13. □

F.6Proof of Corollary 4.2

Define 𝐸 𝑖

2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 for all 𝑖 ∈ [ 𝑃 ^ ] . Note that each 𝐸 𝑖 is a diagonal matrix, and its diagonal elements are either -1 or 1. Therefore, for each 𝑖 ∈ [ 𝑃 ^ ] , we can analyze the robust constraint min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ 𝐸 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ 𝑣 𝑖 ≥ 0 element-wise (for each data point). Let 𝑒 𝑖 ⁢ 𝑘 denote the 𝑘 th diagonal element of 𝐸 𝑖 and 𝛿 𝑖 ⁢ 𝑘 ⊤ denote the 𝑘 th element of Δ that appears in the 𝑖 th constraint. We then have:

(56) ( min ∥ 𝛿 𝑖 ⁢ 𝑘 ∥ ∞ ≤ 𝜖 ⁡ 𝑒 𝑖 ⁢ 𝑘 ⁢ ( 𝑥 𝑘 ⊤ + 𝛿 𝑖 ⁢ 𝑘 ⊤ ) ⁢ 𝑣 𝑖 )

( 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ 𝑣 𝑖 + min ∥ 𝛿 𝑖 ⁢ 𝑘 ∥ ∞ ≤ 𝜖 ⁡ 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝛿 𝑖 ⁢ 𝑘 ⊤ ⁢ 𝑣 𝑖 ) ≥ 0

The minima of the above optimization problems are achieved at 𝛿 𝑖 ⁢ 𝑘 ⋆ ⋆

𝜖 ⋅ sgn ⁢ ( 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑣 𝑖 )

𝜖 ⋅ 𝑒 𝑖 ⁢ 𝑘 ⋅ sgn ⁢ ( 𝑣 𝑖 ) .

Note that as 𝜖 approaches 0, 𝛿 𝑖 ⁢ 𝑘 ⋆ ⋆ and Δ rob ⋆ in Theorem 4.1 both approach 0, which means that the gap between the convex robust problem Theorem 4.3 and the non-convex adversarial training problem Eq. 19 diminishes. Substituting 𝛿 𝑘 ⋆ ⋆ into Eq. 56 yields that

( 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ 𝑣 𝑖 − 𝜖 ⁢ ∥ 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑣 𝑖 ∥ 1 )

( 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ 𝑣 𝑖 − 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 ) ≥ 0 .

Vertically concatenating 𝑒 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ 𝑣 𝑖 − 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 ≥ 0 for all 𝑖 ∈ [ 𝑃 ^ ] gives the vectorized representation 𝐸 𝑖 ⁢ 𝑋 ⁢ 𝑣 𝑖 − 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 ≥ 0 , which leads to Eq. 16. Since the constraints on 𝑤 are exactly the same, we also have that min Δ : 𝑋 + Δ ∈ 𝒰 ⁡ 𝐸 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ 𝑤 𝑖 ≥ 0 is equivalent to 𝐸 𝑖 ⁢ 𝑋 ⁢ 𝑤 𝑖 − 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 ≥ 0 for all 𝑖 ∈ [ 𝑃 ^ ] .

F.7Proof of Theorem 4.3

The regularization term is independent of Δ . Thus, it can be ignored for the purpose of analyzing the inner maximization. Note that each 𝐷 𝑖 is diagonal, and its diagonal elements are either 0 or 1. Therefore, the inner maximization of Section 4.4 can be analyzed element-wise (i.e. independently maximize the cost at each data point).

The maximization problem of the loss at each data point is:

(57) max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ( 1 − 𝑦 𝑘 ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ( 𝑥 𝑘 ⊤ + 𝛿 𝑘 ⊤ ) ( 𝑣 𝑖 − 𝑤 𝑖 ) ) + ,

where 𝑑 𝑖 ⁢ 𝑘 is the 𝑘 th diagonal element of 𝐷 𝑖 and 𝛿 𝑘 ⊤ is the 𝑘 th row of Δ . One can write:

max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ( 1 − 𝑦 𝑘 ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ( 𝑥 𝑘 ⊤ + 𝛿 𝑘 ⊤ ) ( 𝑣 𝑖 − 𝑤 𝑖 ) ) +

( max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ⁡ 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑥 𝑘 ⊤ + 𝛿 𝑘 ⊤ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ) +

( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − min ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ⁡ 𝛿 𝑘 ⊤ ⁢ 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ) + .

The optimal solution to min ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ⁡ 𝛿 𝑘 ⊤ ⁢ 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) is 𝛿 hinge 𝑘 ⋆

− 𝜖 ⋅ sgn ⁢ ( 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) , or equivalently:

Δ hinge ⋆

− 𝜖 ⋅ sgn ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑦 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) .

Substituting 𝛿 hinge 𝑘 ⋆ into Eq. 57, we find the optimal objective of the optimization problem Eq. 57 to be

( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⁢ ‖ 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ‖ 1 ) +

( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⁢ | 𝑦 𝑘 | ⁢ ‖ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ‖ 1 ) + .

Therefore, the overall loss function is:

1 𝑛 ⁢ ∑ 𝑘

1 𝑛 ( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⁢ | 𝑦 𝑘 | ⁢ ‖ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ‖ 1 ) + .

In the case of binary classification, 𝑦

{ − 1 , 1 } 𝑛 , and thus | 𝑦 𝑘 |

1 for all 𝑘 ∈ [ 𝑛 ] . Therefore, the above is equivalent to

(58) 1 𝑛 ⁢ ∑ 𝑘

1 𝑛 ( 1 − 𝑦 𝑘 ⁢ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⁢ ‖ ∑ 𝑖

1 𝑃 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ‖ 1 ) +

which is the objective of Theorem 4.3. This completes the proof. □

F.8Proof of Theorem E.1

We first exploit the structure of Eq. 39 and reformulate it as the following robust second-order cone program (SOCP) by introducing a slack variable 𝑎 ∈ ℝ :

(59) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ^ , 𝑎 ⁡ 𝑎 + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t . ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ]

max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ ‖ [ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦

2 ⁢ 𝑎 − 1 4 ] ‖ 2 ≤ 2 ⁢ 𝑎 + 1 4 , ∀ 𝑖 ∈ [ 𝑃 ^ ] .

Then, we need to establish the equivalence between Eq. 59 and Eq. 40. To this end, we consider the constraints of Eq. 59 and argue that these can be recast as the constraints given in Eq. 40. One can write:

max Δ : 𝑋 + Δ ∈ 𝒳 ⁢ ‖ [ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑋 + Δ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦

2 ⁢ 𝑎 − 1 4 ] ‖ 2 ≤ 2 ⁢ 𝑎 + 1 4

⟺
max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 , ∀ 𝑘 ∈ [ 𝑛 ] ⁡ ‖ [ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 1 ⁢ ( 𝑥 1 ⊤ − 𝛿 1 ⊤ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 1

∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 2 ⁢ ( 𝑥 2 ⊤ − 𝛿 2 ⊤ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 2

⋮

∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑛 ⁢ ( 𝑥 𝑛 ⊤ − 𝛿 𝑛 ⊤ ) ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 𝑛

2 ⁢ 𝑎 − 1 4 ] ‖ 2 ≤ 2 ⁢ 𝑎 + 1 4

⟺
max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 , ∀ 𝑘 ∈ [ 𝑛 ] ( ∑ 𝑘

1 𝑛 ( ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ( 𝑥 𝑘 ⊤ − 𝛿 𝑘 ⊤ ) ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 𝑘 ) 2 + ( 2 𝑎 − 1 4 ) 2 ) 1 2 ≤ 2 𝑎 + 1 4 ,

where 𝑑 𝑖 ⁢ 𝑘 is the 𝑘 th diagonal element of 𝐷 𝑖 and 𝛿 𝑘 ⊤ is the 𝑘 th row of Δ . The above constraints can be rewritten by introducing slack variables 𝑧 ∈ ℝ 𝑛 + 1 as

𝑧 𝑘 ≥ | ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) − 𝑦 𝑘 | + 𝜖 ⁢ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 , ∀ 𝑘 ∈ [ 𝑛 ]

𝑧 𝑛 + 1 ≥ | 2 ⁢ 𝑎 − 1 4 | , ∥ 𝑧 ∥ 2 ≤ 2 ⁢ 𝑎 + 1 4 .

□

F.9Proof of Theorem 4.4

The inner maximization of Eq. 23 can be analyzed separately for each 𝑦 𝑘 . For every index 𝑘 such that 𝑦 𝑘

0 , it holds that ∑ 𝑘

1 𝑛 ( − 2 ⁢ 𝑦 ^ 𝑘 ⁢ 𝑦 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) monotonously increases with respect to 𝑦 ^ 𝑘 . Thus, we need to find 𝛿 𝑘 that maximizes 𝑦 ^ 𝑘 in order to maximize the objective. Therefore, the worst-case adversary 𝛿 𝑘 ⋆ is

(60) 𝛿 𝑘 : 𝑦 𝑘

0 ⋆

arg ⁢ max ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ⁡ ( ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝛿 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) )

𝜖 ⋅ sgn ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) .

For each index 𝑘 such that 𝑦 𝑘

1 , it holds that ∑ 𝑘

1 𝑛 ( − 2 ⁢ 𝑦 ^ 𝑘 ⋅ 𝑦 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) monotonously decreases with respect to 𝑦 ^ 𝑘 . Thus, we need to minimize 𝑦 ^ 𝑘 . Therefore,

(61) 𝛿 𝑘 : 𝑦 𝑘

1 ⋆

arg ⁢ min ∥ 𝛿 𝑘 ∥ ∞ ≤ 𝜖 ⁡ ( ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝛿 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) )

− 𝜖 ⋅ sgn ⁢ ( ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) .

The two cases can be combined as 𝛿 𝑘 ⋆

− 𝜖 ⋅ sgn ⁢ ( ( 2 ⁢ 𝑦 𝑘 − 1 ) ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) . Concatenating 𝛿 1 ⋆ , … , 𝛿 𝑛 ⋆ back into the matrix form yields the worst-case perturbation matrix Δ BCE ⋆

− 𝜖 ⋅ sgn ⁢ ( ( 2 ⁢ 𝑦 − 1 ) ⁢ ∑ 𝑖

1 𝑃 ^ 𝐷 𝑖 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ⊤ ) .

Moreover, notice that the objective is separable based on those 𝑘 such that 𝑦 𝑘

0 and those 𝑘 such that 𝑦 𝑘

1 :

∑ 𝑘

1 𝑛 ( − 2 𝑦 ^ 𝑘
𝑦 𝑘 + log ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) )

∑ 𝑘 : 𝑦 𝑘

1 ( − 2 ⁢ 𝑦 ^ 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) + ∑ 𝑘 : 𝑦 𝑘

0 log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 )

∑ 𝑘 : 𝑦 𝑘

1 log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 𝑒 2 ⁢ 𝑦 ^ 𝑘 ) + ∑ 𝑘 : 𝑦 𝑘

0 log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 )

∑ 𝑘 : 𝑦 𝑘

1 log ⁡ ( 𝑒 − 2 ⁢ 𝑦 ^ 𝑘 + 1 ) + ∑ 𝑘 : 𝑦 𝑘

0 log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 )
(62)

∑ 𝑘 : 𝑦 𝑘

1 log ⁡ ( exp ⁡ ( − 2 ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 2 ⁢ 𝜖 ⋅ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 ) + 1 )
(63) + ∑ 𝑘 : 𝑦 𝑘

0 log ⁡ ( exp ⁡ ( 2 ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 2 ⁢ 𝜖 ⋅ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 ) + 1 )

∑ 𝑘

1 𝑛 log ⁡ ( exp ⁡ ( 2 ⁢ ( ( 2 ⁢ 𝑦 𝑘 − 1 ) ⁢ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ 𝑥 𝑘 ⊤ ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) + 𝜖 ⋅ ∥ ∑ 𝑖

1 𝑃 ^ 𝑑 𝑖 ⁢ 𝑘 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) ∥ 1 ) ) + 1 )

∑ 𝑘

1 𝑛 𝑓 ∘ 𝑔 𝑘 ⁢ ( { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 ^ ) ,

where Eq. 62 and Eq. 63 are obtained by substituting in Eq. 60 and Eq. 61, and 𝑓 ⁢ ( ⋅ ) , 𝑔 ⁢ ( ⋅ ) are defined in Section 4.5. Substituting the term ∑ 𝑘

1 𝑛 ( − 2 ⁢ 𝑦 ^ 𝑘 ⁢ 𝑦 𝑘 + log ⁡ ( 𝑒 2 ⁢ 𝑦 ^ 𝑘 + 1 ) ) in Eq. 23 with the term ∑ 𝑘

1 𝑛 𝑓 ∘ 𝑔 𝑘 ⁢ ( { 𝑣 𝑖 , 𝑤 𝑖 } 𝑖

1 𝑃 ^ ) yields the formulation Section 4.5. Since the function 𝑓 ⁢ ( ⋅ ) is convex non-decreasing and 𝑔 ⁢ ( ⋅ ) is convex, the optimization problem Section 4.5 is convex. □

F.10Proof of Lemma F.2

According to [49], recovering the ANN weights by substituting Eq. 4 into Eq. 51 leads to

𝑞 ⋆

min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ⁡ ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⋆ ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ⋆ ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ⋆ ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )

Similarly, we can recover the network weights from the solution ( 𝑣 ~ 𝑖 ⋆ , 𝑤 ~ 𝑖 ⋆ ) 𝑖

1 𝑃 ~ of Eq. 52 using

(64) ( 𝑢 ~ 𝑗 1 ⁢ 𝑖 , 𝛼 ~ 𝑗 1 ⁢ 𝑖 )

( 𝑣 ~ 𝑖 ⋆ ∥ 𝑣 ~ 𝑖 ⋆ ∥ 2 , ∥ 𝑣 ~ 𝑖 ⋆ ∥ 2 ) , ( 𝑢 ~ 𝑗 2 ⁢ 𝑖 , 𝛼 ~ 𝑗 2 ⁢ 𝑖 )

( 𝑤 ~ 𝑖 ⋆ ∥ 𝑤 ~ 𝑖 ⋆ ∥ 2 , − ∥ 𝑤 ~ 𝑖 ⋆ ∥ 2 ) , ∀ 𝑖 ∈ [ 𝑃 ~ ] .

Unlike in Eq. 4, zero weights are not discarded in Eq. 64. For simplicity, we use 𝑢 ~ 1 , … , 𝑢 ~ 𝑚 ~ ⋆ to refer to the hidden layer weights and use 𝛼 ~ 1 , … , 𝛼 ~ 𝑚 ~ ⋆ to refer to the output layer weights recovered using Eq. 64. Since ( 𝑣 ~ 𝑖 ⋆ , 𝑤 ~ 𝑖 ⋆ ) 𝑖

1 𝑃 ~ is a solution to Eq. 52, it satisfies ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 ~ 𝑖 ⋆ ≥ 0 and ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 ~ 𝑖 ⋆ ≥ 0 for all 𝑖 ∈ [ 𝑃 ~ ] . Thus, we can apply Lemma F.1 to obtain:

𝑞 ~ ⋆

ℓ ⁢ ( ∑ 𝑖

1 𝑃 ~ 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 ~ 𝑖 ⋆ − 𝑤 ~ 𝑖 ⋆ ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ~ ( ∥ 𝑣 ~ 𝑖 ⋆ ∥ 2 + ∥ 𝑤 ~ 𝑖 ⋆ ∥ 2 )

ℓ ⁢ ( ∑ 𝑗

1 𝑚 ~ ⋆ ( 𝑋 ⁢ 𝑢 ~ 𝑗 ⋆ ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ~ ⋆ ( ∥ 𝑢 ~ 𝑗 ⋆ ∥ 2 2 + 𝛼 ~ 𝑗 ⋆ 2 )

≥
min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ~ ⋆ ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ~ ⋆ ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ~ ⋆ ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )

Since 𝑃 ~ ≥ 𝑃 , 𝑚 ⋆ ≤ 2 ⁢ 𝑃 and 𝑚 ~ ⋆

2 ⁢ 𝑃 ~ , we have 𝑚 ~ ⋆ ≥ 𝑚 ⋆ . Therefore, according to Section 2 and Theorem 6 of [49], we have:

𝑞 ⋆

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⋆ ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ⋆ ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ⋆ ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )

min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ~ ⋆ ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ~ ⋆ ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

1 𝑚 ~ ⋆ ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 ) ≤ 𝑞 ~ ⋆ .

The above inequality 𝑞 ⋆ ≤ 𝑞 ~ ⋆ shows that an ANN with more than 𝑚 neurons in the hidden layer will yield the same loss as the ANN with 𝑚 neurons when optimized.

Note that Eq. 52 can always attain 𝑞 ⋆ by simply substituting in the optimal solution of Eq. 51 and assigning zeros to all other additional 𝑣 𝑖 and 𝑤 𝑖 , implying that 𝑞 ⋆ ≥ 𝑞 ~ ⋆ . Since 𝑞 ⋆ is both an upper bound and a lower bound on 𝑞 ~ ⋆ , we have 𝑞 ~ ⋆

𝑞 ⋆ . Therefore, as long as all matrices in 𝒟 are included, the existence of redundant matrices does not change the optimal objective value. □

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 197 kB
Xet hash:: a53e4dbecd977ef1deec9f899581b9e75bc837cca0bf86cc381ba66388788329

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

𝑦 ^

∑ 𝑗

𝑞 𝑖 for all 𝑖 and 𝑄 𝑖 ⁢ 𝑗

min ( 𝑢 𝑗 , 𝛼 𝑗 , 𝑏 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 + 𝑏 𝑗 ⁢ 𝟏 𝑛 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

where 𝛽 > 0 is a regularization parameter. Without loss of generality, we assume that 𝑏 𝑗

(1) min ( 𝑢 𝑗 , 𝛼 𝑗 ) 𝑗

1 𝑚 ⁡ ℓ ⁢ ( ∑ 𝑗

1 𝑚 ( 𝑋 ⁢ 𝑢 𝑗 ) + ⁢ 𝛼 𝑗 , 𝑦 ) + 𝛽 2 ⁢ ∑ 𝑗

(2) min ( 𝑣 𝑖 , 𝑤 𝑖 ) 𝑖

1 𝑃 ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐷 𝑖 ⁢ 𝑋 ⁢ ( 𝑣 𝑖 − 𝑤 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

which is a convex semi-infinite program, where ℓ ∗ ⁢ ( 𝑣 )

Let ( 𝑣 𝑖 ⋆ , 𝑤 𝑖 ⋆ ) 𝑖

(4) ( 𝑢 𝑗 1 ⁢ 𝑖 ⋆ , 𝛼 𝑗 1 ⁢ 𝑖 ⋆ )

( 𝑣 𝑖 ⋆ ∥ 𝑣 𝑖 ⋆ ∥ 2 , ∥ 𝑣 𝑖 ⋆ ∥ 2 ) if 𝑣 𝑖 ⋆ ≠ 0 ; ( 𝑢 𝑗 2 ⁢ 𝑖 ⋆ , 𝛼 𝑗 2 ⁢ 𝑖 ⋆ )

The worst-case computational complexity of solving Eq. 2 for the case of squared loss is 𝒪 ⁢ ( 𝑑 3 ⁢ 𝑟 3 ⁢ ( 𝑛 𝑟 ) 3 ⁢ 𝑟 ) using standard interior-point solvers [49]. Here, 𝑟 is the rank of the data matrix 𝑋 , and in many cases 𝑟

2.2A Practical Convex Training Algorithm 1:Generate 𝑃 𝑠 distinct diagonal matrices via 𝐷 ℎ ← diag ⁢ ( [ 𝑋 ⁢ 𝑎 ℎ ≥ 0 ] ) , where 𝑎 ℎ ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑑 ) i.i.d. for all ℎ ∈ [ 𝑃 𝑠 ] . 2:Solve (5) 𝑝 𝑠 ⁢ 1 ⋆

min ( 𝑣 ℎ , 𝑤 ℎ ) ℎ

1 𝑃 𝑠 ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 𝐷 ℎ ⁢ 𝑋 ⁢ ( 𝑣 ℎ − 𝑤 ℎ ) , 𝑦 ) + 𝛽 ⁢ ∑ ℎ

(6) 𝑝 𝑠 ⁢ 2 ⋆

min ( 𝑣 ℎ , 𝑤 ℎ ) ℎ

1 𝑃 𝑠 + 1 ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 + 1 𝐷 ℎ ⁢ 𝑋 ⁢ ( 𝑣 ℎ − 𝑤 ℎ ) , 𝑦 ) + 𝛽 ⁢ ∑ ℎ

Define 𝐹 𝑖 ≔ 𝐷 𝑖 ⁢ 𝑋 and 𝐺 𝑖 ≔ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 for all 𝑖 ∈ [ 𝑃 ] . Furthermore, we introduce 𝑣 𝑖 , 𝑤 𝑖 , 𝑠 𝑖 , and 𝑡 𝑖 as slack variables and let 𝑣 𝑖

𝑢 𝑖 , 𝑤 𝑖

𝑧 𝑖 , 𝑠 𝑖

𝐺 𝑖 ⁢ 𝑣 𝑖 , and 𝑡 𝑖

𝐺 𝑖 ⁢ 𝑤 𝑖 . For a vector 𝑞

min ( 𝑣 𝑖 , 𝑤 𝑖 , 𝑠 𝑖 , 𝑡 𝑖 , 𝑢 𝑖 , 𝑧 𝑖 ) 𝑖

1 𝑃 ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝐹 𝑖 ⁢ ( 𝑢 𝑖 − 𝑧 𝑖 ) , 𝑦 ) + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ∥ 𝑣 𝑖 ∥ 2 + 𝛽 ⁢ ∑ 𝑖

1 𝑃 ∥ 𝑤 𝑖 ∥ 2 + ∑ 𝑖

1 𝑃 𝕀 ≥ 0 ⁢ ( 𝑠 𝑖 ) + ∑ 𝑖

1 𝑃 𝕀 ≥ 0 ⁢ ( 𝑡 𝑖 ) (7) s . t . 𝐺 𝑖 ⁢ 𝑢 𝑖 − 𝑠 𝑖

0 , 𝐺 𝑖 ⁢ 𝑧 𝑖 − 𝑡 𝑖

0 , 𝑣 𝑖 − 𝑢 𝑖

0 , 𝑤 𝑖 − 𝑧 𝑖

𝑠 ]

where ∥ ⋅ ∥ 2 , 1 denotes the ℓ 1

1:repeat 2: Primal update (3.3a) 𝑢 𝑘 + 1

𝑠 𝑘 + 1 ]

𝜈 𝑘 + 1 ]

(3.4a) 𝑣 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝜌 2 ⁢ ∥ 𝑢 𝑘 + 1 − 𝑣 + 𝜆 𝑘 ∥ 2 2 ; (3.4b) 𝑠 𝑘 + 1

arg ⁢ min 𝑠 𝕀 ≥ 0 ( 𝑠 ) + ∥ 𝐺 𝑢 𝑘 + 1 − 𝑠 + 𝜈 𝑘 ∥ 2 2

𝑣 𝑖 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑣 𝑖 ∥ 2 + 𝜌 2 ⁢ ∥ 𝑢 𝑖 𝑘 + 1 − 𝑣 + 𝜆 1 ⁢ 𝑖 𝑘 ∥ 2 2

prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 ⁢ ( 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 )

( 1 − 𝛽 𝜌 ⋅ ∥ 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ∥ 2 ) + ⁢ ( 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ) , ∀ 𝑖 ∈ [ 𝑃 ] , 𝑤 𝑖 𝑘 + 1

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑤 𝑖 ∥ 2 + 𝜌 2 ⁢ ∥ 𝑠 𝑖 𝑘 + 1 − 𝑤 + 𝜆 2 ⁢ 𝑖 𝑘 ∥ 2 2

prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 ⁢ ( 𝑧 𝑖 𝑘 + 1 + 𝜆 2 ⁢ 𝑖 𝑘 )

where prox 𝛽 𝜌 ⁢ ∥ ⋅ ∥ 2 denotes the proximal operation on the function 𝑓 ⁢ ( ⋅ )

𝑠 𝑖 𝑘 + 1

arg ⁢ min 𝑠 𝑖 ≥ 0 ∥ 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 − 𝑠 𝑖 + 𝜈 1 ⁢ 𝑖 𝑘 ∥ 2 2

Π ≥ 0 ( 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 + 𝜈 1 ⁢ 𝑖 𝑘 )

( 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 + 𝜈 1 ⁢ 𝑖 𝑘 ) + , ∀ 𝑖 ∈ [ 𝑃 ] ; 𝑡 𝑖 𝑘 + 1

arg ⁢ min 𝑡 𝑖 ≥ 0 ∥ 𝐺 𝑖 𝑧 𝑖 𝑘 + 1 − 𝑠 𝑖 + 𝜈 2 ⁢ 𝑖 𝑘 ∥ 2 2

Π ≥ 0 ( 𝐺 𝑖 𝑧 𝑖 𝑘 + 1 + 𝜈 2 ⁢ 𝑖 𝑘 )

The squared loss ℓ ⁢ ( 𝑦 ^ , 𝑦 )

(11) 𝑢 𝑘 + 1

(12) ( 𝐼 + 1 𝜌 ⁢ 𝐹 ⊤ ⁢ 𝐹 + 𝐺 ⊤ ⁢ 𝐺 ) ⁢ 𝑢 𝑘 + 1

Therefore, the 𝑢 update can be performed by solving the linear system Eq. 12 in each iteration. While solving a linear system 𝐴 ⁢ 𝑥

Perform Cholesky decomposition 𝐴

Solve 𝐿 ⁢ 𝑏 ^

Solve 𝐿 ⊤ ⁢ 𝑥

1:Initialize 𝑦 ^

∑ 𝑖

1 𝑃 𝐹 𝑖 ⁢ ( 𝑢 𝑖 − 𝑧 𝑖 ) ; 2:Fix 𝑠 ~ 𝑖

𝐺 𝑖 ⊤ ⁢ ( 𝑠 𝑖 − 𝜈 1 ⁢ 𝑖 ) , 𝑡 ~ 𝑖

Furthermore, it holds that 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖

𝑋 ⊤ ⁢ 𝑋 for all 𝑖 ∈ [ 𝑃 ] . To understand this, recall that 𝐺 𝑖

( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 by definition. Since ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) is a diagonal matrix with all entries being ± 1 , it holds that ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 )

𝐼 𝑛 , and thus 𝐺 𝑖 ⊤ ⁢ 𝐺 𝑖

𝑋 ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⊤ ⁢ ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋

𝒳

{ 𝑋 + Δ ∈ ℝ 𝑛 × 𝑑 | Δ

1 𝑃
ℓ ⁢ ( ∑ 𝑖

( 𝑣 𝑖 ⋆ ∥ 𝑣 𝑖 ⋆ ∥ 2 , ∥ 𝑣 𝑖 ⋆ ∥ 2 ) if 𝑣 𝑖 ⋆ ≠ 0 ;

( 𝑢 𝑗 2 ⁢ 𝑖 ⋆ , 𝛼 𝑗 2 ⁢ 𝑖 ⋆ )

1 𝑃 𝑠
ℓ ⁢ ( ∑ ℎ

1 𝑃 𝑠 + 1
ℓ ⁢ ( ∑ ℎ

1 𝑃
ℓ ⁢ ( ∑ 𝑖

1 𝑃 𝕀 ≥ 0 ⁢ ( 𝑡 𝑖 )
(7) s . t .
𝐺 𝑖 ⁢ 𝑢 𝑖 − 𝑠 𝑖

arg ⁢ min 𝑣 ⁡ 𝛽 ⁢ ∥ 𝑣 ∥ 2 , 1 + 𝜌 2 ⁢ ∥ 𝑢 𝑘 + 1 − 𝑣 + 𝜆 𝑘 ∥ 2 2 ;
(3.4b) 𝑠 𝑘 + 1

( 1 − 𝛽 𝜌 ⋅ ∥ 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ∥ 2 ) + ⁢ ( 𝑢 𝑖 𝑘 + 1 + 𝜆 1 ⁢ 𝑖 𝑘 ) , ∀ 𝑖 ∈ [ 𝑃 ] ,

𝑤 𝑖 𝑘 + 1

( 𝐺 𝑖 𝑢 𝑖 𝑘 + 1 + 𝜈 1 ⁢ 𝑖 𝑘 ) + , ∀ 𝑖 ∈ [ 𝑃 ] ;

𝑡 𝑖 𝑘 + 1

{
𝑋 + Δ ∈ ℝ 𝑛 × 𝑑 | Δ

1 𝑃 ^
( max Δ : 𝑋 + Δ ∈ 𝒰 ⁡ ℓ ⁢ ( ∑ ℎ

1 𝑃 ^
( max Δ : 𝑋 + Δ ∈ 𝒳 ⁡ 1 𝑛 ⋅ 𝟏 ⊤ ⁢ ( 𝟏 − 𝑦 ⊙ ∑ 𝑖

1 𝑃 ^ ⁡ 1 𝑛
∑ 𝑘

1 𝑚
( max ∥ Δ ∥ max ≤ 𝜖 ⁡ 1 𝑛 ⁢ ∑ 𝑘

1 𝑚 ( ∥ 𝑢 𝑗 ∥ 2 2 + 𝛼 𝑗 2 )
where 𝑦 ^ := ∑ 𝑗

1 𝑃 ^
( max ∥ Δ ∥ max ≤ 𝜖 ⁡ 1 𝑛 ⁢ ∑ 𝑘

1 𝑃 ^ ( ∥ 𝑣 𝑖 ∥ 2 + ∥ 𝑤 𝑖 ∥ 2 )

s . t .
( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑣 𝑖 ≥ 𝜖 ⁢ ∥ 𝑣 𝑖 ∥ 1 , ( 2 ⁢ 𝐷 𝑖 − 𝐼 𝑛 ) ⁢ 𝑋 ⁢ 𝑤 𝑖 ≥ 𝜖 ⁢ ∥ 𝑤 𝑖 ∥ 1 , ∀ 𝑖 ∈ [ 𝑃 ^ ] ,

𝑦 ^ 𝑘

1 𝑃 ^
1 𝑛 ⁢ ( ∑ 𝑘