huggingchat/papers-content / 2302 /2302.03775.md

|

132 kB

Title: Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion

URL Source: https://arxiv.org/html/2302.03775

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Definitions and Setup 3Online-to-Non-Convex Conversion 4Bounds for the 𝐿 1 Norm 5From Non-smooth to Smooth Guarantees 6Deterministic and Smooth Case 7Lower Bounds 8Conclusion References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: forloop.sty failed: epic.sty failed: eepic.sty

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2302.03775v3 [cs.LG] 07 Aug 2025 Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion Ashok Cutkosky Boston University Boston, MA ashok@cutkosky.com Harsh Mehta Google Research Mountain View, CA harshm@google.com Francesco Orabona Boston University Boston, MA francesco@orabona.com Abstract

We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a ( 𝛿 , 𝜖 ) -stationary point from 𝑂 ( 𝜖 − 4 𝛿 − 1 ) stochastic gradient queries to 𝑂 ( 𝜖 − 3 𝛿 − 1 ) , which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of 𝑂 ( 𝜖 − 1.5 𝛿 − 0.5 ) . Our techniques also recover all optimal or best-known results for finding 𝜖 stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.

1Introduction

Algorithms for non-convex optimization are some of the most important tools in modern machine learning, as training neural networks requires optimizing a non-convex objective. Given the abundance of data in many domains, the time to train a neural network is the current bottleneck to having bigger and more powerful machine learning models. Motivated by this need, the past few years have seen an explosion of research focused on understanding non-convex optimization (Ghadimi & Lan, 2013; Carmon et al., 2017; Arjevani et al., 2019; 2020; Carmon et al., 2019; Fang et al., 2018). Despite significant progress, key issues remain unaddressed.

In this paper, we work to minimize a potentially non-convex objective 𝐹 : ℝ 𝑑 → ℝ which we only accesss in some stochastic or “noisy” manner. As motivation, consider 𝐹 ( 𝐱 ) ≜ 𝔼 𝐳 [ 𝑓 ( 𝐱 , 𝐳 ) ] , where 𝐱 can represent the model weights, 𝐳 a minibatch of i.i.d. examples, and 𝑓 the loss of a model with parameters 𝐱 on the minibatch 𝐳 . In keeping with standard empirical practice, we will restrict ourselves to first order algorithms (gradient-based optimization).

The vast majority of prior analyses of non-convex optimization algorithms impose various smoothness conditions on the objective (Ghadimi & Lan, 2013; Carmon et al., 2017; Allen-Zhu, 2018; Tripuraneni et al., 2018; Fang et al., 2018; Zhou et al., 2018; Fang et al., 2019; Cutkosky & Orabona, 2019; Li & Orabona, 2019; Cutkosky & Mehta, 2020; Zhang et al., 2020a; Karimireddy et al., 2020; Levy et al., 2021; Faw et al., 2022; Liu et al., 2022). One motivation for smoothness assumptions is that they allow for a convenient surrogate for global minimization: rather than finding a global minimum of a neural network’s loss surface (which may be intractable), we can hope to find an 𝜖 -stationary point, i.e., a point 𝐱 such that ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 . By now, the fundamental limits on first order smooth non-convex optimization are well understood: Stochastic Gradient Descent (SGD) will find an 𝜖 -stationary point in 𝑂 ( 𝜖 − 4 ) iterations, which is the optimal rate (Arjevani et al., 2019). Moreover, if 𝐹 happens to be second-order smooth, SGD requires only 𝑂 ( 𝜖 − 3.5 ) iterations, which is also optimal (Fang et al., 2019; Arjevani et al., 2020). These optimality results motivate the popularity of SGD and its variants in practice (Kingma & Ba, 2014; Loshchilov & Hutter, 2016; 2018; Goyal et al., 2017; You et al., 2019).

Unfortunately, many standard neural network architectures are non-smooth (e.g., architectures incorporating ReLUs or max-pools cannot be smooth). As a result, these analyses can only provide intuition about what might occur when an algorithm is deployed in practice: the theorems themselves do not apply (see Patel & Berahas (2022) for examples of failure of SGD in non-smooth settings, or Li et al. (2021) for futher discussion of assumptions). Despite the obvious need for non-smooth analyses, recent results suggest that even approaching a neighborhood of a stationary point may be impossible for non-smooth objectives (Kornowski & Shamir, 2022b). Nevertheless, optimization clearly is possible in practice, which suggests that we may need to rethink our assumptions and goals in order to understand non-smooth optimization.

Fortunately, Zhang et al. (2020b) recently considered an alternative definition of stationarity that is tractable even for non-smooth objectives and which has attracted much interest (Davis et al., 2021; Tian et al., 2022; Kornowski & Shamir, 2022a; Tian & So, 2022; Jordan et al., 2022). Roughly speaking, instead of ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 , we ask that there is a random variable 𝐲 supported in a ball of radius 𝛿 about 𝐱 such that ‖ 𝔼 [ ∇ 𝐹 ( 𝐲 ) ] ‖ ≤ 𝜖 . We call such an 𝐱 an ( 𝛿 , 𝜖 ) -stationary point, so that the previous definition ( ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 ) is a ( 0 , 𝜖 ) -stationary point. The current best-known complexity for identifying an ( 𝛿 , 𝜖 ) stationary point is 𝑂 ( 𝜖 − 4 𝛿 − 1 ) stochastic gradient evaluations.

In this paper, we significantly improve this result: we can identify an ( 𝛿 , 𝜖 ) -stationary point with only 𝑂 ( 𝜖 − 3 𝛿 − 1 ) stochastic gradient evaluations. Moreover, we also show that this rate is optimal. Our primary technique is a novel online-to-non-convex conversion: a connection between non-convex stochastic optimization and online learning, which is a classical field of learning theory that already has a deep literature (Cesa-Bianchi & Lugosi, 2006; Hazan, 2019; Orabona, 2019). In particular, we show that an online learning algorithm that provides a shifting regret bound can be used to decide the update step, when fed with linear losses constructed using the stochastic gradients of the function 𝐹 . By establishing this connection, we open new avenues for algorithm design in non-convex optimization and also motivate new research directions in online learning.

In sum, we make the following contributions:

•

A reduction from non-convex non-smooth stochastic optimization to online learning: better online learning algorithms result in faster non-convex optimization. Applying this reduction to standard online learning algorithms allows us to identify an ( 𝛿 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 3 𝛿 − 1 ) stochastic gradient evaluations. The previous best-known rate in this setting was 𝑂 ( 𝜖 − 4 𝛿 − 1 ) .

•

We show that the 𝑂 ( 𝜖 − 3 𝛿 − 1 ) rate is optimal for all 𝛿 , 𝜖 such that 𝜖 ≤ 𝑂 ( 𝛿 ) .

Additionally, we prove important corollaries for smooth 𝐹 :

•

The 𝑂 ( 𝜖 − 3 𝛿 − 1 ) complexity implies the optimal 𝑂 ( 𝜖 − 4 ) and 𝑂 ( 𝜖 − 3.5 ) respective complexities for finding ( 0 , 𝜖 ) -stationary points of smooth or second-order smooth objectives.

•

For deterministic and second-order smooth objectives, we obtain a rate of 𝑂 ( 𝜖 − 3 / 2 𝛿 − 1 / 2 ) , which implies the best-known 𝑂 ( 𝜖 − 7 / 4 ) complexity for finding ( 0 , 𝜖 ) -stationary points.

2Definitions and Setup

Here, we formally introduce our setting and notation. We are interested in optimizing real-valued functions 𝐹 : ℋ → ℝ where ℋ is a real Hilbert space (e.g., usually ℋ

ℝ 𝑑 ). We assume 𝐹 ⋆ ≜ inf 𝐱 𝐹 ( 𝐱 ) > − ∞ . We assume that 𝐹 is differentiable, but we do not assume that 𝐹 is smooth. All norms ∥ ⋅ ∥ are the Hilbert space norm (i.e., the 2-norm) unless otherwise specified. As mentioned in the introduction, the motivating example to keep in mind in our development is the case 𝐹 ( 𝐱 )

𝔼 𝐳 [ 𝑓 ( 𝐱 , 𝐳 ) ] .

Our algorithms access information about 𝐹 through a stochastic gradient oracle Grad : ℋ × 𝒵 → ℝ . Given a point 𝐱 in ℋ , the oracle will sample an i.i.d. random variable 𝐳 ∈ 𝒵 and return Grad ( 𝐱 , 𝐳 ) ∈ ℋ such that 𝔼 [ Grad ( 𝐱 , 𝐳 ) ]

∇ 𝐹 ( 𝐱 ) and Var ( Grad ( 𝐱 , 𝐳 ) ) ≤ 𝜎 2 .

In the following, we only consider functions satisfying the following mild regularity condition.

Definition 1.

We define a differentiable function 𝐹 : ℋ → ℝ to be well-behaved if for all 𝐱 , 𝐲 ∈ ℋ , it holds that

𝐹 ( 𝐲 ) − 𝐹 ( 𝐱 )

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 + 𝑡 ( 𝐲 − 𝐱 ) ) , 𝐲 − 𝐱 ⟩ d 𝑡 .

If 𝐹 happens to be differentiable and locally Lipschitz, then this assumption is simply the Fundamental Theorem of Calculus. Under this assumption, our results can be applied to improve the past results on non-smooth stochastic optimization. In fact, Proposition 2 (proof in Appendix A) below shows that for the wide class of functions that are locally Lipschitz (but possibly non-differentiable), applying an arbitrarily small perturbation to the function is sufficient to ensure both differentiability and well-behavedness. This result works via standard perturbation arguments similar to those used previously by Davis et al. (2021) (see also Bertsekas (1973); Duchi et al. (2012); Flaxman et al. (2005) for similar techniques in the convex setting). In practice we suspect that such perturbation arguments are unnecessary: intuitively an algorithm is unlikely to query a point of non-differentiability (see also Bianchi et al. (2022) for some formal evidence for this idea).

Proposition 2.

Let 𝐹 : ℝ 𝑑 → ℝ be locally Lipschitz with stochastic oracle Grad such that 𝔼 𝐳 [ Grad ( 𝐱 , 𝐳 ) ]

∇ 𝐹 ( 𝐱 ) whenever 𝐹 is differentiable. We have two cases:

•

If 𝐹 is differentiable everywhere, then 𝐹 is well-behaved.

•

If 𝐹 is not differentiable everywhere, let 𝑝 > 0 be an arbitrary number and let 𝐮 be a random vector in ℝ 𝑑 uniformly distributed on the unit ball. Define 𝐹 ^ ( 𝐱 ) ≜ 𝔼 𝐮 [ 𝐹 ( 𝐱 + 𝑝 𝐮 ) ] . Then, 𝐹 ^ is differentiable and well-behaved, and the oracle Grad ^ ( 𝐱 , ( 𝐳 , 𝐮 ) )

Grad ( 𝐱 + 𝑝 𝐮 , 𝐳 ) is a stochastic gradient oracle for 𝐹 ^ . Moreover, 𝐹 is differentiable at 𝐱 + 𝑝 𝐮 with probability 1 and if 𝐹 is 𝐺 -Lipschitz, then | 𝐹 ^ ( 𝐱 ) − 𝐹 ( 𝐱 ) | ≤ 𝑝 𝐺 for all 𝐱 .

Remark 3.

We explicitly note that our results cover the case in which 𝐹 is directionally differentiable and we have access to a stochastic directional gradient oracle, as considered by Zhang et al. (2020b). This is a less standard oracle Grad ( 𝐱 , 𝐯 , 𝐳 ) that outputs 𝐠 such that 𝔼 [ ⟨ 𝐠 , 𝐯 ⟩ ] is the directional derivative of 𝐹 in the direction 𝐯 . This setting is subtly different (although a directional derivative oracle is a gradient oracle at all points for which 𝐹 is continuously differentiable). In order to keep technical complications to a minimum, in the main text we consider the simpler stochastic gradient oracle discussed above. In Appendix H, we show that our results and techniques also apply using directional gradient oracles with only superficial modification.

2.1 ( 𝛿 , 𝜖 ) -Stationary Points

Now, let us define our notion of ( 𝛿 , 𝜖 ) -stationary point. This definition is essentially the same as used in Zhang et al. (2020b); Davis et al. (2021); Tian et al. (2022). It is in fact mildly more stringent since we restrict to distributions of finite support and require an “unbiasedness” condition in order to make eventual connections to second-order smooth objectives easier.

Definition 4.

A point 𝐱 is an ( 𝛿 , 𝜖 ) -stationary point of an almost-everywhere differentiable function 𝐹 if there is a finite subset 𝑆 of the ball of radius 𝛿 centered at 𝐱 such that for 𝐲 selected uniformly at random from 𝑆 , 𝔼 [ 𝐲 ]

𝐱 and ‖ 𝔼 [ ∇ 𝐹 ( 𝐳 ) ] ‖ ≤ 𝜖 .

As a counterpart to this definition, we also define:

Definition 5.

Given a point 𝐱 , a number 𝛿

0 and a almost-everywhere differentiable function 𝐹 , define

‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ( 𝐱 , 𝛿 ) , 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝐱 ‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖ .

Let’s also state an immediate corollary of Proposition 2 that converts a guarantee on a randomized smoothed function to one on the original function. This result is also immediate from Theorem 3.1 of Lin et al. (2022).

Corollary 6.

Let 𝐹 : ℝ 𝑑 → ℝ be 𝐺 -Lipschitz. For 𝜖

0 , let 𝑝 ≤ 𝛿 and let 𝐮 be a random vector in ℝ 𝑑 uniformly distributed on the unit ball. Define 𝐹 ^ ( 𝐱 ) ≜ 𝔼 𝐮 [ 𝐹 ( 𝐱 + 𝑝 𝐮 ) ] . If a point 𝐱 satisfies ‖ ∇ 𝐹 ^ ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 , then ‖ ∇ 𝐹 ( 𝐱 ) ‖ 2 𝛿 ≤ 𝜖 .

Our ultimate goal is to use 𝑁 stochastic gradient evaluations of 𝐹 to identify a point 𝐱 with as small a value of 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ] as possible. For the rest of this paper we will consider exclusively the case of well-behaved and differentiable objectives 𝐹 . We focus our development on this conceptually simpler case in order to simplify the proofs as much as possible, however due to Proposition 2 and Corollary 6, our results will immediately extend from differentiable 𝐹 to those 𝐹 that are locally Lipschitz and for which Grad ( 𝐱 , 𝐳 ) returns a unbiased estimate of ∇ 𝐹 ( 𝐱 ) whenever 𝐹 is differentiable at 𝐱 .

2.2Online Learning

Here, we very briefly introduce the setting of online linear learning with shifting competitors, that will be the core of our online-to-non-convex conversion. We refer the interested reader to Cesa-Bianchi & Lugosi (2006); Hazan (2019); Orabona (2019) for a comprehensive introduction to online learning. In the online learning setting, the learning process goes on in rounds. In each round the algorithm outputs a point 𝚫 𝑡 in a feasible set 𝑉 , and then receives a linear loss function ℓ 𝑡 ( ⋅ )

⟨ 𝐠 𝑡 , ⋅ ⟩ and it pays ℓ 𝑡 ( 𝚫 𝑡 ) . The goal of the algorithm is to minimize the static regret over 𝑇 rounds, defined as the difference between its cumulative loss and the one of an arbitrary comparison vector 𝐮 ∈ 𝑉 :

𝑅 𝑇 ( 𝐮 ) ≜ ∑ 𝑡

1 𝑇 ⟨ 𝐠 𝑡 , 𝚫 𝑡 − 𝐮 ⟩ .

With no stochastic assumption, it is possible to design online algorithms that guarantee that the regret is upper bounded by 𝑂 ( 𝑇 ) . In this work, we frequently make use of a more challenging objective: minimizing the 𝐾 -shifting regret. This is the regret with respect to an arbitrary sequence of 𝐾 vectors 𝐮 1 , … , 𝐮 𝐾 ∈ 𝑉 that changes every 𝑇 iterations:

𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ≜ ∑ 𝑘

1 𝐾 ∑ 𝑛

( 𝑘 − 1 ) 𝑇 + 1 𝑘 𝑇 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑘 ⟩ .

(1)

It should be intuitive that resetting the online algorithm every 𝑇 iterations can achieve a shifting regret of 𝑂 ( 𝐾 𝑇 ) .

2.3Related Work

In addition to the papers discussed in the introduction, here we discuss further related work.

In this paper we build on top of the definition of ( 𝛿 , 𝜖 ) -stationary points proposed by Zhang et al. (2020b). There, they prove a complexity rate of 𝑂 ( 𝜖 − 4 𝛿 − 1 ) for stochastic Lipschitz functions, which we improve to 𝑂 ( 𝜖 − 3 𝛿 − 1 ) and prove the optimality of this result.

In a concurrent work, Chen et al. (2023) consider the setting of zeroth-order stochastic optimization (i.e. evaluation of function values only rather than gradients) and achieve a similar asymptotic rate of 𝑂 ( 𝑑 3 / 2 𝜖 − 3 𝛿 − 1 ) by applying variance-reduction to a smoothed version of the objective 𝐹 . This result is an intriguing contrast to many zeroth-order algorithms based on such smoothing in that the algorithm is not obtained by applying applying smoothing to a first-order algorithm. More recently, Kornowski & Shamir (2023) improved the dimension dependence in the rate for zeroth-order optimization to 𝑂 ( 𝑑 𝜖 − 3 𝛿 − 1 ) by employing the algorithm described in this paper in concert with a refined analysis of the smoothing operation.

The idea to reduce machine learning to online learning was pioneered by Cesa-Bianchi et al. (2004) with the online-to-batch conversion. There is also previous work exploring the possibility of transforming non-convex problems into online learning ones. Ghai et al. (2022) provides some conditions under which online gradient descent on non-convex losses is equivalent to a convex online mirror descent. Hazan et al. (2017) defines a notion of regret which can be used to find approximate stationary points of smooth objectives. Zhuang et al. (2019) transform the problem of tuning of learning rates in stochastic non-convex optimization into an online learning problem. Our proposed approach differs from all the ones above in applying to non-smooth objectives. Moreover, as discusses in the next section, we employ online learning algorithms with shifting regret Herbster & Warmuth (1998) to generate the updates (i.e. the differences between successive iterates), rather than the iterates themselves.

3Online-to-Non-Convex Conversion

In this section, we explain the online-to-non-convex conversion. The core idea transforms the minimization of a non-convex and non-smooth function onto the problem of minimizing the shifting regret over linear losses. In particular, consider an optimization algorithm that updates a previous iterate 𝐱 𝑛 − 1 by moving in a direction 𝚫 𝑛 : 𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 . For example, SGD sets 𝚫 𝑛

− 𝜂 𝐠 𝑛 − 1

− 𝜂 ⋅ Grad ( 𝐱 𝑛 − 1 , 𝐳 𝑛 − 1 ) for a learning rate 𝜂 . Instead, we let an online learning algorithm 𝒜 decide the update direction 𝚫 𝑛 , using linear losses ℓ 𝑛 ( 𝐱 )

⟨ 𝐠 𝑛 , 𝐱 ⟩ .

The motivation behind essentially all first order algorithms is that 𝐹 ( 𝐱 𝑛 − 1 + 𝚫 𝑛 ) − 𝐹 ( 𝐱 𝑛 − 1 ) ≈ ⟨ 𝐠 𝑛 , 𝚫 𝑛 ⟩ . This suggests that 𝚫 𝑛 should be chosen to minimize the inner product ⟨ 𝐠 𝑛 , 𝚫 𝑛 ⟩ . However, we are faced with two difficulties. The first difficulty is the the approximation error in the first-order expansion. The second is the fact that 𝚫 𝑛 needs to be chosen before 𝐠 𝑛 is revealed, so that 𝚫 𝑛 needs in some sense to “predict the future”. Typical analysis of algorithms such as SGD use the remainder form of Taylor’s theorem to address both difficulties simultaneously for smooth objectives, but in our non-smooth case this is not a valid approach. Instead, we tackle these difficulties independently. We overcome the first difficulty using the same randomized scaling trick employed by Zhang et al. (2020b): define 𝐠 𝑛 to be a gradient evaluated not at 𝐱 𝑛 − 1 or 𝐱 𝑛 − 1 + 𝚫 𝑛 , but at a random point along the line segment connecting the two. Then for a well-behaved function we will have 𝐹 ( 𝐱 𝑛 − 1 + 𝚫 𝑛 ) − 𝐹 ( 𝐱 𝑛 − 1 )

𝔼 [ ⟨ 𝐠 𝑛 , 𝚫 𝑛 ⟩ ] . The second difficulty is where online learning shines: online learning algorithms are specifically designed to predict completely arbitrary sequences of vectors as accurately as possible.

The previous intuition is formalized in Algorithm 1 and the following result, which we will elaborate on in Theorem 8 before yielding our main result in Corollary 9.

Theorem 7.

Suppose 𝐹 is well-behaved. Define ∇ 𝑛

∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) d 𝑠 . Then, with the notation in Algorithm 1 and for any sequence of vectors 𝐮 1 , … , 𝐮 𝑁 , we have the equality:

𝐹 ( 𝐱 𝑀 )

𝐹 ( 𝐱 0 ) + ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ + ∑ 𝑛

1 𝑀 ⟨ ∇ 𝑛 − 𝐠 𝑛 , 𝚫 𝑛 ⟩ + ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ .

Moreover, if we let 𝑠 𝑛 be independent random variables uniformly distributed in [ 0 , 1 ] , then we have

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ ] + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .

Proof.

By the well-behaveness of 𝐹 , we have

𝐹 ( 𝐱 𝑛 ) − 𝐹 ( 𝐱 𝑛 − 1 )

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 ( 𝐱 𝑛 − 𝐱 𝑛 − 1 ) ) , 𝐱 𝑛 − 𝐱 𝑛 − 1 ⟩ d 𝑠

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) , 𝚫 𝑛 ⟩ d 𝑠

⟨ ∇ 𝑛 , 𝚫 𝑛 ⟩

⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ + ⟨ ∇ 𝑛 − 𝐠 𝑛 , 𝚫 𝑛 ⟩ + ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ .

Now, sum over 𝑛 and telescope to obtain the stated bound.

For the second statement, simply observe that by definition we have 𝔼 [ 𝐠 𝑛 ]

∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) d 𝑠

∇ 𝑛 . ∎

Algorithm 1 Online-to-Non-Convex Conversion Input: Initial point 𝐱 0 , 𝐾 ∈ ℕ , 𝑇 ∈ ℕ , online learning algorithm 𝒜 . Set 𝑀

𝐾 ⋅ 𝑇 for 𝑛

1 … 𝑀 do Get 𝚫 𝑛 from 𝒜 Set 𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 Generate 𝑠 𝑛 ∈ [ 0 , 1 ] // usually uniformly random, see Theorem statements for precise settings. Set 𝐰 𝑛

𝐱 𝑛 − 1 + 𝑠 𝑛 𝚫 𝑛 Sample random 𝐳 𝑛 Generate gradient 𝐠 𝑛

Grad ( 𝐰 𝑛 , 𝐳 𝑛 ) Send 𝐠 𝑛 to 𝒜 as gradient end for Set 𝐰 𝑡 𝑘

𝐰 ( 𝑘 − 1 ) 𝑇 + 𝑡 for 𝑘

1 , … , 𝐾 and 𝑡

1 , … , 𝑇 Set 𝐰 ¯ 𝑘

1 𝑇 ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 for 𝑘

1 , … , 𝐾 Return { 𝐰 ¯ 1 , … , 𝐰 ¯ 𝐾 } 3.1Guarantees for Non-Smooth Non-Convex Functions

The primary value of Theorem 7 is that the term ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ is exactly the regret of an online learning algorithm: lower regret clearly translates to a smaller bound on 𝐹 ( 𝐱 𝑀 ) . Next, by carefully choosing 𝐮 𝑛 , we will be able to relate the term ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ to the gradient averages that appear in the definition of ( 𝛿 , 𝜖 ) -stationarity. Formalizing these ideas, we have the following:

Theorem 8.

Assume 𝐹 is well-behaved. With the notation in Algorithm 1, set 𝑠 𝑛 to be a random variable sampled uniformly from [ 0 , 1 ] . Set 𝑇 , 𝐾 ∈ ℕ and 𝑀

𝐾 𝑇 . Define 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ for some 𝐷 > 0 for 𝑘

1 , … , 𝐾 . Finally, suppose Var ( 𝐠 𝑛 ) ≤ 𝜎 2 . Then:

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ]

≤ 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 𝑀 + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] 𝐷 𝑀 + 𝜎 𝑇 .

Proof.

From Theorem 7, we have

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .

Now, since 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ , 𝔼 [ 𝐠 𝑛 ]

𝔼 [ ∇ 𝐹 ( 𝐰 𝑛 ) ] , and Var ( 𝐠 𝑛 ) ≤ 𝜎 2 , we have

𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ]
≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝔼 [ 𝐷 ∑ 𝑘

1 𝐾 ‖ ∑ 𝑡

1 𝑇 ( ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − 𝐠 𝑇 ( 𝑘 − 1 ) + 𝑡 ) ‖ ]

≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝐷 𝜎 𝐾 𝑇

𝔼 [ − ∑ 𝑘

1 𝐾 𝐷 𝑇 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ] + 𝐷 𝜎 𝐾 𝑇 .

Putting this all together, we have

𝐹 ⋆
≤ 𝔼 [ 𝐹 ( 𝐱 𝑀 ) ] ≤ 𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝜎 𝐷 𝐾 𝑇 − 𝐷 𝑇 ∑ 𝑘

1 𝐾 𝔼 [ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ] .

Dividing by 𝐾 𝐷 𝑇

𝐷 𝑀 and reordering, we have the stated bound. ∎

We now instantiate Theorem 8 with the simplest online learning algorithm: online gradient descent (OGD) (Zinkevich, 2003). OGD takes input a radius 𝐷 and a step size 𝜂 and makes the update 𝚫 𝑛 + 1

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑛 − 𝜂 𝐠 𝑛 ] with 𝚫 1

0 . The standard analysis shows that if 𝔼 [ ‖ 𝐠 𝑛 ‖ 2 ] ≤ 𝐺 2 for all 𝑛 , then with 𝜂

𝐷 𝐺 𝑇 , OGD will ensure1 static regret 𝔼 [ 𝑅 𝑇 ( 𝐮 ) ] ≤ 𝐷 𝐺 𝑇 for any 𝐮 satisfying ‖ 𝐮 ‖ ≤ 𝐷 . Thus, by resetting the algorithm every 𝑇 iterations, we achieve 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … 𝐮 𝐾 ) ] ≤ 𝐾 𝐷 𝐺 𝑇 . This powerful guarantee for all sequences is characteristic of online learning. We are now free to optimize the remaining parameters 𝐾 and 𝐷 to achieve our main result, presented in Corollary 9.

Corollary 9.

Suppose we have a budget of 𝑁 gradient evaluations. Under the assumptions of Theorem 8, suppose in addition 𝔼 [ ‖ 𝐠 𝑛 ‖ 2 ] ≤ 𝐺 2 and that 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 for some user-specified 𝐷 for all 𝑛 and ensures the worst-case 𝐾 -shifting regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 𝐺 𝐾 𝑇 for all ‖ 𝐮 𝑘 ‖ ≤ 𝐷 (e.g., as achieved by the OGD algorithm that is reset every 𝑇 iterations). Let 𝛿 > 0 be an arbitrary number. Set 𝐷

𝛿 / 𝑇 , 𝑇

min ⁡ ( ⌈ ( 𝐺 𝑁 𝛿 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

⌊ 𝑁 𝑇 ⌋ . Then, for all 𝑘 and 𝑡 , we have ‖ 𝐰 ¯ 𝑘 − 𝐰 𝑡 𝑘 ‖ ≤ 𝛿 .

Moreover, we have the inequality

𝔼
[ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ] ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) ,

which implies

𝔼 [ 1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ] ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) .

Before providing the proof, let us discuss the implications. Notice that if we select 𝐰 ^ at random from { 𝐰 ¯ 1 , … , 𝐰 ¯ 𝐾 } , then we clearly have 𝔼 [ ‖ ∇ 𝐹 ( 𝐰 ^ ) ‖ 𝛿 ]

𝔼 [ 1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ] . Therefore, the Corollary asserts that for a function 𝐹 with 𝐹 ( 𝐱 0 ) − inf 𝐹 ( 𝐱 ) ≤ 𝛾 with a stochastic gradient oracle whose second moment is bounded by 𝐺 2 , a properly instantiated Algorithm 1 finds a ( 𝛿 , 𝜖 ) stationary point in 𝑁

𝑂 ( 𝐺 𝛾 𝜖 − 3 𝛿 − 1 ) gradient evaluations. In Section 7, we will provide a lower bound showing that this rate is optimal essentially whenever 𝛿 𝐺 2 ≥ 𝜖 𝛾 . Together, the Corollary and the lower bound provide a nearly complete characterization of the complexity of finding ( 𝛿 , 𝜖 ) -stationary points in the stochastic setting.

It is also interesting to note that the bound does not appear to improve if the gradients are deterministic. Specifically, in the assumptions for Corollary 9, we could try to relax 𝔼 [ ‖ 𝐠 𝑡 ‖ 2 ] ≤ 𝐺 to ‖ ∇ 𝐹 ( 𝐰 𝑡 ) ‖ ≤ 𝐺 and Var ( 𝐠 𝑡 ) ≤ 𝜎 2 for some 𝜎 . We might then hope to improve the bound as 𝜎 → 0 by taking advantage of the 𝜎 -dependency in Theorem 8. However, it turns out that the 𝜎 -dependency in Corollary 9 is dominated by a dependency on 𝐺 coming from the regret bound of OGD. This highlights an interesting open question: is it actually possible to improve in the deterministic setting? It is conceivable that the answer is “no”: in the non-smooth convex optimization setting, it is well-known that the optimal rates for stochastic and deterministic optimization are the same (see, e.g., Bubeck (2015) for proofs of both upper and lower bounds).

Remark 10.

We conjecture that by employing martingale concentration, the above can be extended to identify a ( 𝛿 , 𝑂 ( 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 ) ) -stationary point with high probability, although we do not establish such a result here.

It is also interesting to explicitly write the update of the overall algorithm:

𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛

𝐠 𝑛

Grad ( 𝐱 𝑛 + ( 𝑠 𝑛 − 1 ) 𝚫 𝑛 , 𝐳 𝑛 )

𝚫 𝑛 + 1

clip 𝐷 ( 𝚫 𝑛 + 𝜂 𝐠 𝑛 )

where clip ( 𝐱 ) 𝐷

𝐱 min ( 𝐷 ‖ 𝐱 ‖ , 1 ) . In words, the update is reminiscent of the SGD update with momentum and clipping. The primary different element is the fact that the stochastic gradient is taken on a slightly perturbed 𝐱 𝑛 .

Proof of Corollary 9.

Since 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 , for all 𝑛 < 𝑛 ′ ≤ 𝑛 + 𝑇 − 1 , we have

‖ 𝐰 𝑛 − 𝐰 𝑛 ′ ‖

‖ 𝐱 𝑛 − ( 1 − 𝑠 𝑛 ) 𝚫 𝑛 − 𝐱 𝑛 ′ − 1 + 𝑠 𝑛 ′ 𝚫 𝑛 ′ ‖

≤ ‖ ∑ 𝑖

𝑛 + 1 𝑛 ′ − 1 𝚫 𝑖 ‖ + ‖ 𝚫 𝑛 ‖ + ‖ 𝚫 𝑛 ′ ‖

≤ 𝐷 ( ( 𝑛 ′ − 1 ) − ( 𝑛 + 1 ) + 1 ) + 2 𝐷 ≤ 𝐷 𝑇 .

Therefore, we clearly have ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝐷 𝑇

𝛿 .

Note that from the choice of 𝐾 and 𝑇 we have 𝑀

𝐾 𝑇 ≥ 𝑁 − 𝑇 ≥ 𝑁 / 2 . So, for the second fact, notice that Var ( 𝐠 𝑛 ) ≤ 𝔼 [ ‖ 𝐠 𝑡 ‖ 2 ] ≤ 𝐺 2 for all 𝑛 . Thus, applying Theorem 8 in concert with the additional assumption 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 𝐺 𝐾 𝑇 , we have:

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ]

≤ 2 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 𝑁 + 2 𝐾 𝐷 𝐺 𝑇 𝐷 𝑁 + 𝐺 𝑇

≤ 2 𝑇 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + 3 𝐺 𝑇

≤ max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 ,

where the last inequality is due to the choice of 𝑇 .

Finally, observe that ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝛿 for all 𝑡 and 𝑘 , and also that 𝐰 ¯ 𝑘

1 𝑇 ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 . Therefore 𝑆

{ 𝐰 1 𝑘 , … , 𝐰 𝑇 𝑘 } satisfies the conditions in the infimum in Definition 5 so that ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ≤ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ . ∎

4Bounds for the 𝐿 1 Norm

It is a well-known trick in the online learning literature that running a separate instance of an online learning algorithm on each coordinate of 𝚫 yields regret bounds with respect to 𝐿 1 norms of the linear costs (e.g., as in AdaGrad (Duchi et al., 2010; McMahan & Streeter, 2010)). For example, we can run the online gradient descent algorithm with a separate learning rate for each coordinate: 𝚫 𝑛 + 1 , 𝑖

Π [ − 𝐷 ∞ , 𝐷 ∞ ] [ 𝚫 𝑛 , 𝑖 − 𝜂 𝑖 𝐠 𝑛 , 𝑖 ] . The regret of this procedure is simply the sum of the regrets of each of the individual algorithms. In particular, if 𝔼 [ 𝐠 𝑛 , 𝑖 2 ] ≤ 𝐺 𝑖 2 , then setting 𝜂 𝑖

𝐷 ∞ 𝐺 𝑖 𝑇 yields the regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 ) ] ≤ 𝐷 ∞ 𝑇 ∑ 𝑖

1 𝑁 𝐺 𝑖 for any 𝐮 satisfying ‖ 𝐮 ‖ ∞ ≤ 𝐷 ∞ . By employing such online algorithms with our online-to-non-convex conversion, we can obtain a guarantee on the 𝐿 1 norm of the gradients.

Definition 11.

A point 𝐱 is a ( 𝛿 , 𝜖 ) -stationary point with respect to the 𝐿 1 norm of an almost-everywhere differentiable function 𝐹 if there exists a finite subset 𝑆 of the 𝐿 ∞ ball of radius 𝛿 centered at 𝐱 such that if 𝐲 is selected uniformly at random from 𝑆 , 𝔼 [ 𝐲 ]

𝐱 and ‖ 𝔼 [ ∇ 𝐹 ( 𝐲 ) ] ‖ 1 ≤ 𝜖 .

As a counterpart to this definition, we define:

Definition 12.

Given a point 𝐱 , a number 𝛿

0 and an almost-everywhere differentiable function 𝐹 , define

‖ ∇ 𝐹 ( 𝐱 ) ‖ 1 , 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ∞ ( 𝐱 , 𝛿 ) | , 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝐱 ‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖ 1 .

We now can state a theorem similar to Corollary 9. Given that the proof is also very similar, we defer it to Appendix G.

Theorem 13.

Suppose we have a budget of 𝑁 gradient evaluations. Assume 𝐹 : ℝ 𝑑 → ℝ is well-behaved. With the notation in Algorithm 1, set 𝑠 𝑛 to be a random variable sampled uniformly from [ 0 , 1 ] . Set 𝑇 , 𝐾 ∈ ℕ and 𝑀

𝐾 𝑇 . Assume that 𝔼 [ 𝑔 𝑛 , 𝑖 2 ] ≤ 𝐺 𝑖 2 for 𝑖

1 , … , 𝑑 for all 𝑛 . Assume that 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ∞ ≤ 𝐷 ∞ for some user-specified 𝐷 ∞ for all 𝑛 and ensures the 𝐾 -shifting regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

1 𝑑 𝐺 𝑖 for all ‖ 𝐮 𝑘 ‖ ∞ ≤ 𝐷 ∞ . Let 𝛿 > 0 be an arbitrary number. Set 𝐷 ∞

𝛿 / 𝑇 , 𝑇

min ⁡ ( ⌈ ( 𝑁 𝛿 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

⌊ 𝑁 𝑇 ⌋ . Then we have:

1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 1 , 𝛿 ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 ( ∑ 𝑖

1 𝑑 𝐺 𝑖 ) 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝑁 ) .

Let’s compare this result with Corollary 9. For a fair comparison, we set 𝐺 𝑖 and 𝐺 such that ∑ 𝑖

1 𝑑 𝐺 𝑖 2

𝐺 2 . Then, we can lower bound ∥ ⋅ ∥ 𝛿 with 1 𝑑 ∥ ⋅ ∥ 1 , 𝛿 . Hence, under the assumption 𝔼 [ ‖ 𝐠 𝑛 ‖ 2 ]

∑ 𝑖

1 𝑑 𝔼 [ 𝑔 𝑛 , 𝑖 2 ] ≤ ∑ 𝑖

1 𝑑 𝐺 𝑖 2

𝐺 2 , Corollary 9 implies 1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 1 , 𝛿 ≤ 𝑂 ( 𝐺 2 / 3 𝑑 ( 𝑁 𝛿 ) 1 / 3 ) .

Now, let us see what would happen if we instead employed the above Corollary 13. First, observe that ∑ 𝑖

1 𝑑 𝐺 𝑖 ≤ 𝑑 ∑ 𝑖

1 𝑑 𝐺 𝑖 2 ≤ 𝑑 𝐺 . Substituting this expression into Theorem 13 now gives an upper bound on ∥ ⋅ ∥ 1 , 𝛿 that is 𝑂 ( 𝑑 1 / 3 𝐺 2 / 3 ( 𝑁 𝛿 ) 1 / 3 ) , which is better than the one we could obtain from Corollary 9 under the same assumptions.

5From Non-smooth to Smooth Guarantees

Let us now see what our results imply for smooth objectives. The following two propositions show that for smooth 𝐹 , a ( 𝛿 , 𝜖 ) -stationary point is automatically a ( 0 , 𝜖 ′ ) -stationary point for some appropriate 𝜖 ′ . The proofs are in Appendix E.

Proposition 14.

Suppose that 𝐹 is 𝐻 -smooth (that is, ∇ 𝐹 is 𝐻 -Lipschitz) and 𝑥 also satisfies ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 . Then, ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐻 𝛿 .

Proposition 15.

Suppose that 𝐹 is 𝐽 -second-order-smooth (that is, ‖ ∇ 2 𝐹 ( 𝐱 ) − ∇ 2 𝐹 ( 𝐲 ) ‖ op ≤ 𝐽 ‖ 𝐱 − 𝐲 ‖ for all 𝐱 and 𝐲 ). Suppose also that 𝐱 satisfies ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 . Then, ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐽 2 𝛿 2 .

Now, recall that Corollary 9 shows that we can find a ( 𝛿 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 3 𝛿 − 1 ) iteration. Thus, Proposition 14 implies that by setting 𝛿

𝜖 / 𝐻 , we can find a ( 0 , 𝜖 ) -stationary point of an 𝐻 -smooth objective 𝐹 in 𝑂 ( 𝜖 − 4 ) iterations, which matches the (optimal) guarantee of standard SGD (Ghadimi & Lan, 2013; Arjevani et al., 2019). Further, Proposition 15 shows that by setting 𝛿

𝜖 / 𝐽 , we can find a ( 0 , 𝜖 ) -stationary point of a 𝐽 -second order smooth objective in 𝑂 ( 𝜖 − 3.5 ) iterations. This matches the performance of more refined SGD variants and is also known to be tight (Fang et al., 2019; Cutkosky & Mehta, 2020; Arjevani et al., 2020). In summary: the online-to-non-convex conversion also recovers the optimal results for smooth stochastic losses.

6Deterministic and Smooth Case

We will now consider the case of a non-stochastic oracle (that is, Grad ( 𝐱 , 𝐳 )

∇ 𝐹 ( 𝐱 ) for all 𝐳 , 𝐱 ) and 𝐹 is 𝐻 -smooth (i.e. ∇ 𝐹 is 𝐻 -Lipschitz). We will show that optimistic online algorithms (Rakhlin & Sridharan, 2013; Hazan & Kale, 2010) achieve rates matching the optimal deterministic results. In particular, we consider online algorithms that ensure static regret:

𝑅 𝑇 ( 𝐮 ) ≤ 𝑂 ( 𝐷 ∑ 𝑡

1 𝑇 ‖ 𝐡 𝑡 − 𝐠 𝑡 ‖ 2 ) ,

(2)

for some “hint” vectors 𝐡 𝑡 . In Appendix B, we provide an explicit construction of such an algorithm for completeness. The standard setting for the hints is 𝐡 𝑡

𝐠 𝑡 − 1 . As explained in Section 2.2, to obtain a 𝐾 -shifting regret it will be enough to reset the algorithm every 𝑇 iterations.

Theorem 16.

Suppose we have a budget of 𝑁 gradient evaluations. and that we have an online algorithm 𝒜 𝑠 𝑡 𝑎 𝑡 𝑖 𝑐 that guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 for all 𝑛 and ensures the optimistic regret bound 𝑅 𝑇 ( 𝐮 ) ≤ 𝐶 𝐷 ∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 − 𝐠 𝑡 − 1 ‖ 2 for some constant 𝐶 , and we define 𝐠 0

𝟎 . In Algorithm 1, set 𝒜 to be 𝒜 static that is reset every 𝑇 rounds. Let 𝛿 > 0 be an arbitrary number. Set 𝐷

𝛿 / 𝑇 , 𝑇

min ⁡ ( ⌈ ( 𝐶 𝛿 2 𝐻 𝑁 ) 2 / 5 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 5 ⌉ , 𝑁 2 ) , and 𝐾

⌊ 𝑁 𝑇 ⌋ . Finally, suppose that 𝐹 is 𝐻 -smooth and that the gradient oracle is deterministic (that is, 𝐠 𝑛

∇ 𝐹 ( 𝐰 𝑛 ) ). Then we have:

𝔼 [ 1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ]

≤ 2 𝐶 𝐺 1 𝑁 + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁

max ⁡ ( 6 ( 𝐶 𝐻 ) 2 / 5 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 3 / 5 𝛿 1 / 5 𝑁 3 / 5 , 17 𝐶 𝛿 𝐻 𝑁 3 / 2 ) .

Note that the expectation here encompasses only the randomness in the choice of 𝑠 𝑡 𝑘 , because the gradient oracle is assumed to be deterministic. Theorem 16 finds a ( 𝛿 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 5 / 3 𝛿 − 1 / 3 ) iteratations. Thus, by setting 𝛿

𝜖 / 𝐻 , Proposition 14 shows we can find a ( 0 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 2 ) iterations, which matches the standard optimal rate (Carmon et al., 2021).

Proof.

First, observe that for all 𝑘 , 𝑡 , ‖ 𝐰 ¯ 𝑘 − 𝐰 𝑡 𝑘 ‖ ≤ 𝛿 . This holds for precisely the same reason that it holds in Corollary 9.

Next, observe that for 𝑘

1 we have

𝑅 𝑇 ( 𝐮 𝑘 )
≤ 𝐶 𝐷 ∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 𝑘 − 𝐠 𝑡 − 1 𝑘 ‖ 2

≤ 𝐶 𝐷 𝐺 1 2 + ∑ 𝑡

2 𝑇 ‖ ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − ∇ 𝐹 ( 𝐰 𝑡 − 1 𝑘 ) ‖ 2

≤ 𝐶 𝐷 𝐺 1 2 + ∑ 𝑡

2 𝑇 𝐻 2 ‖ 𝐰 𝑡 𝑘 − 𝐰 𝑡 − 1 𝑘 ‖ 2

≤ 𝐶 𝐷 𝐺 1 2 + 4 𝐻 2 𝑇 𝐷 2 ≤ 𝐶 𝐷 𝐺 1 + 2 𝐶 𝐷 2 𝐻 𝑇 .

Similarly, for 𝑘

1 , we observe that

∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 𝑘 − 𝐠 𝑡 − 1 𝑘 ‖ 2
≤ ∑ 𝑡

2 𝑇 ‖ ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − ∇ 𝐹 ( 𝐰 𝑡 − 1 𝑘 ) ‖ 2 + ‖ ∇ 𝐹 ( 𝐰 1 𝑘 ) − ∇ 𝐹 ( 𝐰 𝑇 𝑘 − 1 ) ‖ 2

≤ 𝐻 2 ( ‖ 𝐰 1 𝑘 − 𝐰 𝑇 𝑘 − 1 ‖ 2 + ∑ 𝑡

2 𝑇 ‖ 𝐰 𝑡 𝑘 − 𝐰 𝑡 − 1 𝑘 ‖ 2 )

≤ 4 𝑇 𝐻 2 𝐷 2 .

Thus, we have

𝑅 𝑇 ( 𝐮 𝑘 )
≤ 𝐶 𝐷 ∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 𝑘 − 𝐠 𝑡 − 1 𝑘 ‖ 2

≤ 𝐶 𝐷 4 𝐻 2 𝑇 𝐷 2

≤ 2 𝐶 𝐷 2 𝐻 𝑇 .

Now, applying Theorem 8 in concert with the above bounds on 𝑅 𝑇 ( 𝐮 𝑘 ) , we have

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ]
≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝐷 𝑁 + 2 𝐶 𝐷 𝐺 1 + 4 𝐶 𝐾 𝐷 2 𝐻 𝑇 𝐷 𝑁

2 𝑇 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + 2 𝐶 𝐺 1 𝑁 + 4 𝐶 𝛿 𝐻 𝑇 3 / 2

≤ max ⁡ ( 6 ( 𝐶 𝐻 ) 2 / 5 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 3 / 5 𝛿 1 / 5 𝑁 3 / 5 , 17 𝐶 𝛿 𝐻 𝑁 3 / 2 )

2 𝐶 𝐺 1 𝑁
2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 .

Recalling that ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝛿 , the conclusion follows. ∎

6.1Better Results with Second-Order Smoothness

When 𝐹 is 𝐽 -second-order smooth (i.e., ∇ 2 𝐹 is 𝐽 -Lipschitz) we can do even better. First, observe that by Theorem 16, if 𝐹 is 𝐽 -second-order-smooth, then by Proposition 15, the 𝑂 ( 𝜖 − 5 / 3 𝛿 − 1 / 3 ) iteration complexity of Theorem 16 implies an 𝑂 ( 𝜖 − 11 / 6 ) iteration complexity for finding ( 0 , 𝜖 ) stationary points by setting 𝛿

𝜖 / 𝐽 . This already improves upon the 𝑂 ( 𝜖 − 2 ) result for smooth losses, but we can improve still further. The key idea is to generate more informative hints 𝐡 𝑡 . If we can make 𝐡 𝑡 ≈ 𝐠 𝑡 , then by (2), we can achieve smaller regret and so a better guarantee.

To do so, we abandon randomization: instead of choosing 𝑠 𝑛 randomly, we just set 𝑠 𝑛

1 / 2 . This setting still allows 𝐹 ( 𝐱 𝑛 ) ≈ 𝐹 ( 𝐱 𝑛 − 1 ) + ⟨ 𝐠 𝑛 , 𝚫 𝑛 ⟩ with very little error when 𝐹 is second-order-smooth. By inspecting the optimistic mirror descent update formula, we can identify an 𝐡 𝑡 with ‖ 𝐡 𝑡 − 𝐠 𝑡 ‖ ≤ 𝑂 ( 1 / 𝑁 ) using 𝑂 ( log ⁡ ( 𝑁 ) ) gradient queries. This more advanced online learning algorithm is presented in Algorithm 2 (full analysis in Appendix C).

Overall, Algorithm 2’s update has an “implicit” flavor:

𝚫 𝑛

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑛 − 1 − 𝐠 𝑛 2 𝐻 ] ,

𝐠 𝑛

∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝚫 𝑛 / 2 ) .
Algorithm 2 Optimistic Mirror Descent with Careful Hints Input: Learning rate 𝜂 , number 𝑄 ( 𝑄 will be 𝑂 ( log ⁡ 𝑁 ) ), function 𝐹 , horizon 𝑇 , radius 𝐷 Receive initial iterate 𝐱 0 Set 𝚫 1 ′

𝟎 for 𝑡

1 … 𝑇 do Set 𝐡 𝑡 0

∇ 𝐹 ( 𝐱 𝑡 − 1 ) for 𝑖

1 … 𝑄 do Set 𝐡 𝑡 𝑖

∇ 𝐹 ( 𝐱 𝑡 − 1 + 1 2 Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 − 1 ] ) end for Set 𝐡 𝑡

𝐡 𝑡 𝑄 Output 𝚫 𝑡

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 ] Receive 𝑡 th gradient 𝐠 𝑡 Set 𝚫 𝑡 + 1 ′

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐠 𝑡 ] end for

With this refined online algorithm, we can show the following convergence guarantee, whose proof is in Appendix D.

Theorem 17.

In Algorithm 1, assume that 𝐠 𝑛

∇ 𝐹 ( 𝐰 𝑛 ) , and set 𝑠 𝑛

1 2 . Use Algorithm 2 restarted every 𝑇 rounds as 𝒜 . Let 𝛿 > 0 an arbitrary number. Set 𝑇

min ⁡ ( ⌈ ( 𝛿 2 ( 𝐻 + 𝐽 𝛿 ) 𝑁 ) 1 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ⌉ , 𝑁 / 2 ) and 𝐾

⌊ 𝑁 𝑇 ⌋ . In Algorithm 2, set 𝜂

1 / 2 𝐻 , 𝐷

𝛿 / 𝑇 , and 𝑄

⌈ log 2 ⁡ ( 𝑁 𝐺 / 𝐻 𝐷 ) ⌉ . Finally, suppose that 𝐹 is 𝐽 -second-order-smooth. Then, the following facts hold:

1.

For all 𝑘 , 𝑡 , ‖ 𝐰 ¯ 𝑘 − 𝐰 𝑡 𝑘 ‖ ≤ 𝛿 .

2.

We have the inequality

1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖

≤ 4 𝐺 𝑁 + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝑁 𝛿

3 ( 𝐻
𝐽 𝛿 ) 1 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 𝛿 1 / 3 𝑁 2 / 3
10 𝛿 ( 𝐻
𝐽 𝛿 ) 𝑁 2 .

3.

With 𝛿

𝐻 1 / 7 ( 𝐹 ( 𝐱 0 ) − 𝐹 ( 𝐱 𝑁 ) ) 2 / 7 𝐽 3 / 7 𝑁 2 / 7 , we have

1 𝐾
∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ ≤ 𝑂 ( 𝐽 1 / 7 𝐻 2 / 7 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 4 / 7 𝑁 4 / 7 ) .

Moreover, the total number of gradient queries consumed is 𝑁 𝑄

𝑂 ( 𝑁 log ⁡ ( 𝑁 ) )

This result finds a ( 𝛿 , 𝜖 ) stationary point in 𝑂 ~ ( 𝜖 − 3 / 2 𝛿 − 1 / 2 ) iterations. Via Proposition 15, this translates to 𝑂 ~ ( 𝜖 − 7 / 4 ) iterations for finding a ( 0 , 𝜖 ) stationary point, matching the best known rate (up to a logarithmic factor) (Carmon et al., 2017). Note that this may not be optimal: the best lower bound is Ω ( 𝜖 − 12 / 7 ) (Carmon et al., 2021). Intriguingly, our technique seems distinct from previous work, which usually relies on acceleration and detecting or exploiting negative curvature (Carmon et al., 2017; Agarwal et al., 2016; Carmon et al., 2018; Li & Lin, 2022).

7Lower Bounds

In this section, we show that our 𝑂 ( 𝜖 − 3 𝛿 − 1 ) complexity achieved in Corollary 9 is tight. We do this by a simple extension of the lower bound for stochastic smooth non-convex optimization of Arjevani et al. (2019). We provide an informal statement and proof-sketch below. The formal result (Theorem 28) and proof is provided in Appendix F.

Theorem 18 (informal).

There is a universal constant 𝐶 such that for any 𝛿 , 𝜖 , 𝛾 and 𝐺 ≥ 𝐶 𝜖 𝛾 𝛿 , for any first-order algorithm 𝒜 , there is a 𝐺 -Lipschitz, 𝐶 ∞ function 𝐹 : ℝ 𝑑 → ℝ for some 𝑑 with 𝐹 ( 0 ) − inf 𝐱 𝐹 ( 𝐱 ) ≤ 𝛾 and a stochastic first-order gradient oracle for 𝐹 whose outputs 𝐠 satisfy 𝔼 [ ‖ 𝐠 ‖ 2 ] ≤ 𝐺 2 such that such that 𝒜 requires Ω ( 𝐺 2 𝛾 / 𝛿 𝜖 3 ) stochastic oracle queries to identify a point 𝐱 with 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ] ≤ 𝜖 .

Proof sketch.

The construction of Arjevani et al. (2019) provides, for any 𝜎 a function 𝐹 and stochastic oracle whose outputs have variance at most 𝜎 2 such that 𝐹 is 𝐻 -smooth, 𝑂 ( 𝐻 𝛾 ) -Lipschitz and 𝒜 requires Ω ( 𝜎 2 𝐻 𝛾 / 𝜖 4 ) oracle queries to find a point 𝐱 with ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 2 𝜖 . By setting 𝐻

𝜖 𝛿 and 𝜎

𝐺 / 2 , this becomes an 𝜖 𝛾 / 𝛿 -Lipschitz function, and so is at most 𝐺 / 2 -Lipschitz. Thus, the second moment of the gradient oracle is at most 𝐺 2 / 2 + 𝐺 2 / 2

𝐺 2 . Further, the algorithm requires Ω ( 𝐺 2 𝛾 / 𝛿 𝜖 3 ) queries to find a point 𝐱 with ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 2 𝜖 . Now, if ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 , then since 𝐹 is 𝐻

𝜖 𝛿 -smooth, by Proposition 14, ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝛿 𝐻

2 𝜖 . Thus, we see that we need Ω ( 𝐺 2 𝛾 / 𝛿 𝜖 3 ) queries to find a point with ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 as desired. ∎

8Conclusion

We have presented a new online-to-non-convex conversion technique that applies online learning algorithms to non-convex and non-smooth stochastic optimization. When used with online gradient descent, this achieves the optimal 𝜖 − 3 𝛿 − 1 complexity for finding ( 𝛿 , 𝜖 ) stationary points.

These results suggest new directions for work in online learning. Much past work is motivated by the online-to-batch conversion relating static regret to convex optimization. We employ switching regret for non-convex optimization. More refined analysis may be possible via generalizations such as strongly adaptive or dynamic regret (Daniely et al., 2015; Jun et al., 2017; Zhang et al., 2018; Jacobsen & Cutkosky, 2022; Cutkosky, 2020; Lu et al., 2022; Luo et al., 2022; Zhang et al., 2021; Baby & Wang, 2022; Zhang et al., 2022). Moreover, our analysis assumes perfect tuning of constants (e.g., 𝐷 , 𝑇 , 𝐾 ) for simplicity. In practice, we would prefer to adapt to unknown parameters, motivating new applications and problems for adaptive online learning, which is already an area of active current investigation (see, e.g., Orabona & Pál, 2015; Hoeven et al., 2018; Cutkosky & Orabona, 2018; Cutkosky, 2019; Mhammedi & Koolen, 2020; Chen et al., 2021; Sachs et al., 2022; Zhang & Cutkosky, 2022; Wang et al., 2022). We hope that this expertise can be applied in the non-convex setting as well.

Finally, our results leave an important question unanswered: the current best-known algorithm for deterministic non-smooth optimization still requires 𝑂 ( 𝜖 − 3 𝛿 − 1 ) iterations to find a ( 𝛿 , 𝜖 ) -stationary point (Zhang et al., 2020b). We achieve this same result even in the stochastic case. Thus it is natural to wonder if the deterministic rate is tight. For example, is the 𝑂 ( 𝜖 − 3 / 2 𝛿 − 1 / 2 ) complexity we achieve in the smooth setting also achievable in the non-smooth setting? Intriguingly, prior work Kornowski & Shamir (2022a); Jordan et al. (2022) shows that randomization is necessary, even if the gradient oracle itself is deterministic.

Acknowledgements

The authors would like to thank Zijian Liu for identifying an error in the original proof of Theorem 18.

Ashok Cutkosky is supported by the National Science Foundation grant CCF-2211718 as well as a Google gift. Francesco Orabona is supported by the National Science Foundation under the grants no. 2022446 “Foundations of Data Science Institute” and no. 2046096 “CAREER: Parameter-free Optimization Algorithms for Machine Learning”.

References Agarwal et al. (2016) ↑ Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T.Finding approximate local minima for nonconvex optimization in linear time.arXiv preprint arXiv:1611.01146, 2016. Allen-Zhu (2018) ↑ Allen-Zhu, Z.Natasha 2: Faster non-convex optimization than SGD.In Advances in neural information processing systems, pp. 2675–2686, 2018. Arjevani et al. (2019) ↑ Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B.Lower bounds for non-convex stochastic optimization.arXiv preprint arXiv:1912.02365, 2019. Arjevani et al. (2020) ↑ Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Sekhari, A., and Sridharan, K.Second-order information in non-convex stochastic optimization: Power and limitations.In Conference on Learning Theory, pp. 242–299, 2020. Baby & Wang (2022) ↑ Baby, D. and Wang, Y.-X.Optimal dynamic regret in proper online learning with strongly convex losses and beyond.In International Conference on Artificial Intelligence and Statistics, pp. 1805–1845. PMLR, 2022. Bertsekas (1973) ↑ Bertsekas, D. P.Stochastic optimization problems with nondifferentiable cost functionals.Journal of Optimization Theory and Applications, 12(2):218–231, 1973. Bianchi et al. (2022) ↑ Bianchi, P., Hachem, W., and Schechtman, S.Convergence of constant step stochastic gradient descent for non-smooth non-convex functions.Set-Valued and Variational Analysis, 30(3):1117–1147, 2022. Bubeck (2015) ↑ Bubeck, S.Convex optimization: Algorithms and complexity.Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015. Carmon et al. (2017) ↑ Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A.“convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions.In International Conference on Machine Learning, pp. 654–663. PMLR, 2017. Carmon et al. (2018) ↑ Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A.Accelerated methods for nonconvex optimization.SIAM Journal on Optimization, 28(2):1751–1772, 2018. Carmon et al. (2019) ↑ Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A.Lower bounds for finding stationary points I.Mathematical Programming, pp. 1–50, 2019. Carmon et al. (2021) ↑ Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A.Lower bounds for finding stationary points II: first-order methods.Mathematical Programming, 185(1-2), 2021. Cesa-Bianchi & Lugosi (2006) ↑ Cesa-Bianchi, N. and Lugosi, G.Prediction, learning, and games.Cambridge University Press, 2006. Cesa-Bianchi et al. (2004) ↑ Cesa-Bianchi, N., Conconi, A., and Gentile, C.On the generalization ability of on-line learning algorithms.Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004. Chen et al. (2021) ↑ Chen, L., Luo, H., and Wei, C.-Y.Impossible tuning made possible: A new expert algorithm and its applications.In Conference on Learning Theory, pp. 1216–1259. PMLR, 2021. Chen et al. (2023) ↑ Chen, L., Xu, J., and Luo, L.Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization.2023. Clarke (1990) ↑ Clarke, F. H.Optimization and nonsmooth analysis.SIAM, 1990. Cutkosky (2019) ↑ Cutkosky, A.Combining online learning guarantees.In Proceedings of the Thirty-Second Conference on Learning Theory, pp. 895–913, 2019. Cutkosky (2020) ↑ Cutkosky, A.Parameter-free, dynamic, and strongly-adaptive online learning.In International Conference on Machine Learning, volume 2, 2020. Cutkosky & Mehta (2020) ↑ Cutkosky, A. and Mehta, H.Momentum improves normalized SGD.In International Conference on Machine Learning, 2020. Cutkosky & Orabona (2018) ↑ Cutkosky, A. and Orabona, F.Black-box reductions for parameter-free online learning in Banach spaces.In Conference On Learning Theory, pp. 1493–1529, 2018. Cutkosky & Orabona (2019) ↑ Cutkosky, A. and Orabona, F.Momentum-based variance reduction in non-convex SGD.In Advances in Neural Information Processing Systems, pp. 15210–15219, 2019. Daniely et al. (2015) ↑ Daniely, A., Gonen, A., and Shalev-Shwartz, S.Strongly adaptive online learning.In International Conference on Machine Learning, pp. 1405–1411. PMLR, 2015. Davis et al. (2021) ↑ Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., and Ye, G.A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions.arXiv preprint arXiv:2112.06969, 2021. Duchi et al. (2010) ↑ Duchi, J., Hazan, E., and Singer, Y.Adaptive subgradient methods for online learning and stochastic optimization.In Conference on Learning Theory (COLT), pp. 257–269, 2010. Duchi et al. (2012) ↑ Duchi, J. C., Bartlett, P. L., and Wainwright, M. J.Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701, 2012. Fang et al. (2018) ↑ Fang, C., Li, C. J., Lin, Z., and Zhang, T.SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator.In Advances in Neural Information Processing Systems, pp. 689–699, 2018. Fang et al. (2019) ↑ Fang, C., Lin, Z., and Zhang, T.Sharp analysis for nonconvex sgd escaping from saddle points.In Conference on Learning Theory, pp. 1192–1234, 2019. Faw et al. (2022) ↑ Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R.The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance.In Conference on Learning Theory, pp. 313–355. PMLR, 2022. Flaxman et al. (2005) ↑ Flaxman, A. D., Kalai, A. T., and McMahan, H. B.Online convex optimization in the bandit setting: gradient descent without a gradient.In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 385–394, 2005. Ghadimi & Lan (2013) ↑ Ghadimi, S. and Lan, G.Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013. Ghai et al. (2022) ↑ Ghai, U., Lu, Z., and Hazan, E.Non-convex online learning via algorithmic equivalence.arXiv preprint arXiv:2205.15235, 2022. Goyal et al. (2017) ↑ Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K.Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv preprint arXiv:1706.02677, 2017. Hazan (2019) ↑ Hazan, E.Introduction to online convex optimization.arXiv preprint arXiv:1909.05207, 2019. Hazan & Kale (2010) ↑ Hazan, E. and Kale, S.Extracting certainty from uncertainty: Regret bounded by variation in costs.Machine learning, 80(2-3):165–188, 2010. Hazan et al. (2017) ↑ Hazan, E., Singh, K., and Zhang, C.Efficient regret minimization in non-convex games.In International Conference on Machine Learning, pp. 1433–1441. PMLR, 2017. Herbster & Warmuth (1998) ↑ Herbster, M. and Warmuth, M. K.Tracking the best regressor.In Proceedings of the eleventh annual conference on Computational learning theory, pp. 24–31, 1998. Hoeven et al. (2018) ↑ Hoeven, D., Erven, T., and Kotłowski, W.The many faces of exponential weights in online learning.In Conference On Learning Theory, pp. 2067–2092. PMLR, 2018. Jacobsen & Cutkosky (2022) ↑ Jacobsen, A. and Cutkosky, A.Parameter-free mirror descent.In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 4160–4211. PMLR, 2022. Jordan et al. (2022) ↑ Jordan, M. I., Lin, T., and Zampetakis, M.On the complexity of deterministic nonsmooth and nonconvex optimization.arXiv preprint arXiv:2209.12463, 2022. Jun et al. (2017) ↑ Jun, K.-S., Orabona, F., Wright, S., and Willett, R.Improved strongly adaptive online learning using coin betting.In Artificial Intelligence and Statistics, pp. 943–951. PMLR, 2017. Karimireddy et al. (2020) ↑ Karimireddy, S. P., Jaggi, M., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T.Mime: Mimicking centralized stochastic algorithms in federated learning.arXiv preprint arXiv:2008.03606, 2020. Kingma & Ba (2014) ↑ Kingma, D. and Ba, J.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. Kornowski & Shamir (2022a) ↑ Kornowski, G. and Shamir, O.On the complexity of finding small subgradients in nonsmooth optimization.In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022a. Kornowski & Shamir (2022b) ↑ Kornowski, G. and Shamir, O.Oracle complexity in nonsmooth nonconvex optimization.Journal of Machine Learning Research, 23(314):1–44, 2022b. Kornowski & Shamir (2023) ↑ Kornowski, G. and Shamir, O.An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.arXiv preprint arXiv:2307.04504, 2023. Levy et al. (2021) ↑ Levy, K., Kavis, A., and Cevher, V.Storm+: Fully adaptive SGD with recursive momentum for nonconvex optimization.Advances in Neural Information Processing Systems, 34:20571–20582, 2021. Li & Lin (2022) ↑ Li, H. and Lin, Z.Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the 𝑜 ( 𝜖 − 7 / 4 ) complexity.In International Conference on Machine Learning. PMLR, 2022. Li & Orabona (2019) ↑ Li, X. and Orabona, F.On the convergence of stochastic gradient descent with adaptive stepsizes.In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983–992. PMLR, 2019. Li et al. (2021) ↑ Li, X., Zhuang, Z., and Orabona, F.A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance.In International Conference on Machine Learning, pp. 6553–6564. PMLR, 2021. Lin et al. (2022) ↑ Lin, T., Zheng, Z., and Jordan, M.Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022. Liu et al. (2022) ↑ Liu, Z., Nguyen, T. D., Nguyen, T. H., Ene, A., and Nguyen, H. L.META-STORM: Generalized fully-adaptive variance reduced SGD for unbounded functions.arXiv preprint arXiv:2209.14853, 2022. Loshchilov & Hutter (2016) ↑ Loshchilov, I. and Hutter, F.SGDR: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. Loshchilov & Hutter (2018) ↑ Loshchilov, I. and Hutter, F.Decoupled weight decay regularization.In International Conference on Learning Representations, 2018. Lu et al. (2022) ↑ Lu, Z., Xia, W., Arora, S., and Hazan, E.Adaptive gradient methods with local guarantees.arXiv preprint arXiv:2203.01400, 2022. Luo et al. (2022) ↑ Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H.Corralling a larger band of bandits: A case study on switching regret for linear bandits.In Conference on Learning Theory, 2022. McMahan & Streeter (2010) ↑ McMahan, H. B. and Streeter, M.Adaptive bound optimization for online convex optimization.In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pp. 244–256, 2010. Mhammedi & Koolen (2020) ↑ Mhammedi, Z. and Koolen, W. M.Lipschitz and comparator-norm adaptivity in online learning.Conference on Learning Theory, pp. 2858–2887, 2020. Orabona (2019) ↑ Orabona, F.A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019. Orabona & Pál (2015) ↑ Orabona, F. and Pál, D.Scale-free algorithms for online linear optimization.In Chaudhuri, K., Gentile, C., and Zilles, S. (eds.), Algorithmic Learning Theory, pp. 287–301. Springer International Publishing, 2015. Patel & Berahas (2022) ↑ Patel, V. and Berahas, A. S.Gradient descent in the absence of global lipschitz continuity of the gradients: Convergence, divergence and limitations of its continuous approximation.arXiv preprint arXiv:2210.02418, 2022. Rakhlin & Sridharan (2013) ↑ Rakhlin, A. and Sridharan, K.Online learning with predictable sequences.In Conference on Learning Theory (COLT), pp. 993–1019, 2013. Sachs et al. (2022) ↑ Sachs, S., Hadiji, H., van Erven, T., and Guzmán, C.Between stochastic and adversarial online convex optimization: Improved regret bounds via smoothness.arXiv preprint arXiv:2202.07554, 2022. Stein & Shakarchi (2009) ↑ Stein, E. M. and Shakarchi, R.Real analysis: measure theory, integration, and Hilbert spaces.Princeton University Press, 2009. Tian & So (2022) ↑ Tian, L. and So, A. M.-C.No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians.arXiv preprint arXiv:2210.06907, 2022. Tian et al. (2022) ↑ Tian, L., Zhou, K., and So, A. M.-C.On the finite-time complexity and practical computation of approximate stationarity concepts of lipschitz functions.In International Conference on Machine Learning, pp. 21360–21379. PMLR, 2022. Tripuraneni et al. (2018) ↑ Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I.Stochastic cubic regularization for fast nonconvex optimization.In Advances in neural information processing systems, pp. 2899–2908, 2018. Wang et al. (2022) ↑ Wang, G., Hu, Z., Muthukumar, V., and Abernethy, J.Adaptive oracle-efficient online learning.arXiv preprint arXiv:2210.09385, 2022. You et al. (2019) ↑ You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J.Large batch optimization for deep learning: Training BERT in 76 minutes.arXiv preprint arXiv:1904.00962, 2019. Zhang & Cutkosky (2022) ↑ Zhang, J. and Cutkosky, A.Parameter-free regret in high probability with heavy tails.In Advances in Neural Information Processing Systems, 2022. Zhang et al. (2020a) ↑ Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S.Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020a. Zhang et al. (2020b) ↑ Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A.Complexity of finding stationary points of nonconvex nonsmooth functions.In International Conference on Machine Learning, 2020b. Zhang et al. (2018) ↑ Zhang, L., Lu, S., and Zhou, Z.-H.Adaptive online learning in dynamic environments.In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1330–1340, 2018. Zhang et al. (2021) ↑ Zhang, L., Wang, G., Tu, W.-W., Jiang, W., and Zhou, Z.-H.Dual adaptivity: A universal algorithm for minimizing the adaptive regret of convex functions.Advances in Neural Information Processing Systems, 34:24968–24980, 2021. Zhang et al. (2022) ↑ Zhang, Z., Cutkosky, A., and Paschalidis, I.Adversarial tracking control via strongly adaptive online learning with memory.In International Conference on Artificial Intelligence and Statistics, pp. 8458–8492. PMLR, 2022. Zhou et al. (2018) ↑ Zhou, D., Xu, P., and Gu, Q.Stochastic nested variance reduction for nonconvex optimization.In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925–3936, 2018. Zhuang et al. (2019) ↑ Zhuang, Z., Cutkosky, A., and Orabona, F.Surrogate losses for online learning of stepsizes in stochastic non-convex optimization.In International Conference on Machine Learning, pp. 7664–7672, 2019. Zinkevich (2003) ↑ Zinkevich, M.Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936, 2003. Appendix AProof of Proposition 2

First, we state a technical lemma that will be used to prove Proposition 2.

Lemma 19.

Let 𝐹 : ℝ 𝑑 → ℝ be locally Lipschitz. Then, 𝐹 is differentiable almost everywhere, is Lipschitz on all compact sets, and for all 𝐯 ∈ ℝ 𝑑 , 𝐱 ↦ ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ is integrable on all compact sets. Finally, for any compact measurable set 𝐷 ⊂ ℝ 𝑑 , the vector 𝐰

∫ 𝐷 ∇ 𝐹 ( 𝐱 ) d 𝐱 is well-defined and the operator 𝜌 ( 𝐯 )

∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ d 𝐱 is linear and equal to ⟨ 𝐰 , 𝐯 ⟩ .

Proof.

First, observe that since 𝐹 is locally Lipschitz, for every point 𝐱 ∈ ℝ 𝑑 with rational coordinates there is a neighborhood 𝑈 𝐱 of 𝐱 on which 𝐹 is Lipschitz. Thus, by Rademacher’s theorem, 𝐹 is differentiable almost everywhere in 𝑈 𝐱 . Since the set of points with rational coordiantes is dense in ℝ 𝑑 , ℝ 𝑑 is equal to the countable union ⋃ 𝑈 𝐱 . Thus, since the set of points of non-differentiability of 𝐹 in 𝑈 𝐱 is measure zero, the total set of points of non-differentability is a countable union of sets of measure zero and so must be measure zero. Thus 𝐹 is differentiable almost everywhere. This implies that 𝐹 is differentiable at 𝐱 + 𝑝 𝐮 with probability 1.

Next, observe that for any compact set 𝑆 ⊂ ℝ 𝑑 , for every point 𝐱 ∈ 𝑆 with rational coordinates, 𝐹 is Lipschitz on some neighborhood 𝑈 𝐱 containing 𝐱 with Lipschitz constant 𝐺 𝐱 . Since 𝑆 is compact, there is a finite set 𝐱 1 , … , 𝐱 𝐾 such that 𝑆

⋃ 𝑈 𝐱 𝑖 . Therefore, 𝐹 is max 𝑖 ⁡ 𝐺 𝐱 𝑖 -Lipschitz on 𝑆 and so 𝐹 is Lipschitz on every compact set.

Now, for almost all 𝐱 , for all 𝐯 we have that the limit lim 𝛿 → 0 𝐹 ( 𝐱 + 𝛿 𝐯 ) − 𝐹 ( 𝐱 ) 𝛿 exists and is equal to ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ by definition of differentiability. Further, on any compact set 𝑆 we have that | 𝐹 ( 𝐱 + 𝛿 𝐯 ) − 𝐹 ( 𝐱 ) | 𝛿 ≤ 𝐿 for some 𝐿 for all 𝐱 ∈ 𝑆 . Therefore, by the bounded convergence theorem (see e.g., Stein & Shakarchi (2009, Theorem 1.4)), we have that ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ is integrable on 𝑆 .

Next, we prove the linearity of the operator 𝜌 . Observe that for any vectors 𝐯 and 𝐰 , and scalar 𝑐 , by linearity of integration, we have

∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝑐 𝐯 + 𝐰 ⟩ d 𝐱

∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝑐 𝐯 ⟩ + ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐰 ⟩ d 𝐱

∫ 𝐷 𝑐 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ + ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐰 ⟩ d 𝐱

𝑐 ∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐯 ⟩ d 𝐱 + ∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐰 ⟩ d 𝐱 .

For the remaining statement, given that ℝ 𝑑 is finite dimensional, there must exist 𝐰 ∈ ℝ 𝑑 such that 𝜌 ( 𝐯 )

⟨ 𝐰 , 𝐯 ⟩ for all 𝐯 . Further, 𝐰 is uniquely determined by ⟨ 𝐰 , 𝐞 𝑖 ⟩ for 𝑖

1 , … , 𝑑 where 𝐞 𝑖 indicates the 𝑖 th standard basis vector. Then, since ∇ 𝐹 ( 𝐱 ) 𝑖

⟨ ∇ 𝐹 ( 𝐱 ) , 𝐞 𝑖 ⟩ is integrable on compact sets, we have

⟨ 𝐰 , 𝐞 𝑖 ⟩

∫ 𝐷 ⟨ ∇ 𝐹 ( 𝐱 ) , 𝐞 𝑖 ⟩ d 𝐱

∫ 𝐷 ∇ 𝐹 ( 𝐱 ) 𝑖 d 𝐱 ,

which is the definition of ∫ 𝐷 ∇ 𝐹 ( 𝐱 ) d 𝐱 when the integral is defined. ∎

We can now prove the Proposition.

Proof of Proposition 2.

Since 𝐹 is locally Lipschitz, by Proposition 19, 𝐹 is Lipschitz on compact sets. Therefore, 𝐹 must be Lipschitz on the line segment connecting 𝐱 and 𝐲 . Thus the function 𝑘 ( 𝑡 )

𝐹 ( 𝐱 + 𝑡 ( 𝐲 − 𝐱 ) ) is absolutely continuous on [ 0 , 1 ] . As a result, 𝑘 ′ is integrable on [ 0 , 1 ] and 𝐹 ( 𝐲 ) − 𝐹 ( 𝐱 )

𝑘 ( 1 ) − 𝑘 ( 0 )

∫ 0 1 𝑘 ′ ( 𝑡 ) d 𝑡

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 + 𝑡 ( 𝐲 − 𝐱 ) ) , 𝐲 − 𝐱 ⟩ d 𝑡 by the Fundamental Theorem of Calculus (see, e.g., Stein & Shakarchi (2009, Theorem 3.11)).

Now, we tackle the case that 𝐹 is not differentiable everywhere. Notice that the last statement of the Proposition is an immediate consequence of Lipschitzness. So, we focus on showing the remaining parts.

Now, by Lemma 19, we have that 𝐹 is differentiable almost everywhere. Further, 𝐠 𝐱

𝔼 𝐮 [ ∇ 𝐹 ( 𝐱 + 𝑝 𝐮 ) ] exists and satisfies for all 𝐯 ∈ ℝ 𝑑 :

⟨ 𝐠 𝐱 , 𝐯 ⟩

𝔼 𝐮 [ ⟨ ∇ 𝐹 ( 𝐱 + 𝑝 𝐮 ) , 𝐯 ⟩ ] .

Notice also that

𝔼 𝐳 , 𝐮 [ Grad ^ ( 𝐱 , ( 𝐳 , 𝐮 ) ) ]

𝔼 𝐮 𝔼 𝐳 [ Grad ( 𝐱 + 𝑝 𝐮 , 𝐳 ) ]

𝔼 𝐮 [ ∇ 𝐹 ( 𝐱 + 𝑝 𝐮 ) ]

𝐠 𝐱 .

So, it remains to show that 𝐹 ^ is differentiable and ∇ 𝐹 ^ ( 𝐱 )

𝐠 𝐱 .

Now, let 𝐱 be an arbitrary elements of ℝ 𝑑 and let 𝐯 1 , 𝐯 2 , … be any sequence of vectors such that lim 𝑛 → ∞ 𝐯 𝑛

0 and ‖ 𝐯 𝑖 ‖ ≤ 𝑝 for all 𝑖 . Then, since the ball of radius 2 𝑝 centered at 𝐱 is compact, 𝐹 is 𝐿 -Lipschitz inside this ball for some 𝐿 . Then, we have

lim 𝑛 → ∞ 𝐹 ^ ( 𝐱 + 𝐯 𝑛 ) − 𝐹 ^ ( 𝐱 ) − ⟨ 𝐠 𝐱 , 𝐯 𝑛 ⟩ ‖ 𝐯 𝑛 ‖

lim 𝑛 → ∞ 𝔼 𝑢 [ 𝐹 ( 𝐱 + 𝐯 𝑛 + 𝑝 𝐮 ) − 𝐹 ( 𝐱 + 𝑝 𝐮 ) − ⟨ ∇ 𝐹 ( 𝐱 + 𝑝 𝐮 ) , 𝐯 𝑛 ⟩ ‖ 𝐯 𝑛 ‖ ] .

Now, observe | 𝐹 ( 𝐱 + 𝐯 𝑖 + 𝑝 𝐮 ) − 𝐹 ( 𝐱 + 𝑝 𝐮 ) | ‖ 𝐯 𝑖 ‖ ≤ 𝐿 . Further, for all almost all 𝐮 , 𝐹 is differentiable at 𝐱 + 𝑝 𝐮 so that lim 𝑛 → ∞ 𝐹 ( 𝐱 + 𝐯 𝑛 + 𝑝 𝐮 ) − 𝐹 ( 𝐱 + 𝑝 𝐮 ) − ⟨ ∇ 𝐹 ( 𝐱 + 𝑝 𝐮 ) , 𝐯 𝑛 ⟩ ] ‖ 𝐯 𝑛 ‖

0 for almost all 𝐮 . Thus, by the bounded convergence theorem, we have

lim 𝑛 → ∞ 𝐹 ^ ( 𝐱 + 𝐯 𝑛 ) − 𝐹 ^ ( 𝐱 ) − ⟨ 𝐠 𝐱 , 𝐯 𝑛 ⟩ ‖ 𝐯 𝑛 ‖

0 .

which shows that 𝐠 𝐱

∇ 𝐹 ^ ( 𝐱 ) .

Finally, observe that since 𝐹 is Lipschitz on compact sets, 𝐹 ^ must be also, and so by the first part of the proposition, 𝐹 ^ is well-behaved. ∎

Appendix BAnalysis of (Optimistic) Online Gradient Descent

Optimistic Online Gradient Descent (in its simplest form) is described by Algorithm 3. Here we collect the standard analysis of the algorithm for completeness. None of this analysis is new, and more refined versions can be found in a variety of sources (e.g. Chen et al. (2021)).

Algorithm 3 Optimistic Mirror Descent Input: Regularizer function 𝜙 , domain 𝑉 , time horizon 𝑇 . 𝐰 ^ 1

𝟎 for 𝑡

1 … 𝑇 do Generate “hint” ℎ 𝑡 Set 𝐰 𝑡

argmin 𝐱 ∈ 𝑉 ⟨ 𝐡 𝑡 , 𝐱 ⟩ + 1 2 ‖ 𝐱 − 𝐰 ^ 𝑡 ‖ 2 Output 𝐰 𝑡 and receive loss vector 𝐠 𝑡 Set 𝐰 ^ 𝑡 + 1

argmin 𝐱 ∈ 𝑉 ⟨ 𝐠 𝑡 , 𝐱 ⟩ + 1 2 ‖ 𝐱 − 𝐰 ^ 𝑡 ‖ 2 end for

We will analyze only a simple version of this algorithm, that is when 𝑉 is an 𝐿 2 ball of radius 𝐷 in some real Hilbert space (such as ℝ 𝑑 ). Then, Algorithm 3 satisfies the following guarantee.

Proposition 20.

Let 𝑉

{ 𝐱 : ‖ 𝐱 ‖ ≤ 𝐷 } ⊂ ℋ for some real Hilbert space ℋ . Then, with for all 𝐮 ∈ 𝑉 , Algorithm 3 ensures

∑ 𝑡

1 𝑇 ⟨ 𝐠 𝑡 , 𝐰 𝑡 − 𝐮 ⟩ ≤ 𝐷 2 2 𝜂 + ∑ 𝑡

1 𝑇 𝜂 2 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 .

Proof.

Now, by Chen et al. (2021, Lemma 15) instantiated with the squared Euclidean distance as Bregman divergence, we have

⟨ 𝐠 𝑡 , 𝐰 𝑡 − 𝐮 ⟩

≤ ⟨ 𝐠 𝑡 − 𝐡 𝑡 , 𝐰 𝑡 − 𝐰 ^ 𝑡 + 1 ⟩ + 1 2 ‖ 𝐮 − 𝑤 ^ 𝑡 ‖ 2 − 1 2 ‖ 𝐮 − 𝐰 ^ 𝑡 + 1 ‖ 2 − 1 2 ‖ 𝐰 ^ 𝑡 + 1 − 𝐰 𝑡 ‖ 2 − 1 2 ‖ 𝐰 𝑡 − 𝐰 ^ 𝑡 ‖ 2

From Young inequality:

≤ 𝜂 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 2 + ‖ 𝐰 𝑡 − 𝐰 ^ 𝑡 + 1 ‖ 2 2 𝜂 + 1 2 ‖ 𝐮 − 𝐰 ^ 𝑡 ‖ 2 − 1 2 ‖ 𝐮 − 𝐰 ^ 𝑡 + 1 ‖ 2 − 1 2 ‖ 𝐰 ^ 𝑡 + 1 − 𝐰 𝑡 ‖ 2

− ‖ 𝐰 𝑡 − 𝐰 ^ 𝑡 ‖ 2 2 𝜂

≤ 𝜂 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 2 + 1 2 ‖ 𝐮 − 𝐰 ^ 𝑡 ‖ 2 − 1 2 ‖ 𝐮 − 𝐰 ^ 𝑡 + 1 ‖ 2 .

Summing over 𝑡 and telescoping, we have

∑ 𝑡

1 𝑇 ⟨ 𝐠 𝑡 , 𝐰 𝑡 − 𝐮 ⟩
≤ 1 2 ‖ 𝐮 − 𝐰 ^ 1 ‖ 2 − 1 2 ‖ 𝐮 − 𝐰 ^ 𝑇 + 1 ‖ 2 + ∑ 𝑡

1 𝑇 𝜂 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 2

≤ ‖ 𝐮 − 𝐰 ^ 1 ‖ 2 2 𝜂 + ∑ 𝑡

1 𝑇 𝜂 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 2 ≤ 𝐷 2 2 𝜂 + ∑ 𝑡

1 𝑇 𝜂 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 2 . ∎

In the case that the hints 𝐡 𝑡 are not present, the algorithm becomes online gradient descent (Zinkevich, 2003). In this case, assuming 𝔼 [ ‖ 𝐠 𝑡 ‖ 2 ] ≤ 𝐺 2 and setting 𝜂

𝐷 𝐺 𝑇 we obtain the 𝔼 [ 𝑅 𝑇 ( 𝐮 ) ] ≤ 𝐷 𝐺 𝑇 for all 𝐮 such that ‖ 𝐮 ‖ ≤ 𝐷 .

Appendix CAlgorithm 2 and Regret Guarantee Theorem 21.

Let 𝐹 be an 𝐻 -smooth and 𝐺 -Lipschitz function. Then, when 𝑄

⌈ log 2 ⁡ ( 𝑁 𝐺 / 𝐻 𝐷 ) ⌉ , Algorithm 2 with 𝐱 𝑡

𝐱 𝑡 − 1 + 1 2 𝚫 𝑡 and 𝐠 𝑡

∇ 𝐹 ( 𝐱 𝑡 ) and 𝜂 ≤ 1 2 𝐻 ensures for all ‖ 𝐮 ‖ ≤ 𝐷

∑ 𝑡

1 𝑇 ⟨ 𝐠 𝑡 , 𝚫 𝑡 − 𝐮 ⟩

≤ 𝐻 𝐷 2 2 + 2 𝐺 𝑇 𝐷 𝑁 .

Furthermore, a total of at most 𝑇 ⌈ log 2 ⁡ ( 𝑁 𝐺 / 𝐻 𝐷 ) ⌉ gradient evaluations are required.

Proof.

The count of gradient evaluations is immediate from inspection of the algorithm, so it remains only to prove the regret bound.

First, we observe that the choices of 𝚫 𝑡 specified by Algorithm 2 correspond to the values of 𝐰 𝑡 produced by Algorithm 3 when 𝜓 ( 𝐰 )

1 2 𝜂 ‖ 𝐰 ‖ 2 . This can be verified by direct calculation (recalling that 𝐷 𝜓 ( 𝐱 , 𝐲 )

‖ 𝐱 − 𝐲 ‖ 2 2 𝜂 ) .

Therefore, by Proposition 20, we have

∑ 𝑡

1 𝑇 ⟨ 𝐠 𝑡 , 𝚫 𝑡 − 𝐮 ⟩ ≤ 𝐷 2 2 𝜂 + ∑ 𝑡

1 𝑇 𝜂 2 ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ 2 .

(3)

So, our primary task is to show that ‖ 𝐠 𝑡 − 𝐡 𝑡 ‖ is small. To this end, recall that 𝐠 𝑡

∇ 𝐹 ( 𝐰 𝑡 )

∇ 𝐹 ( 𝐱 𝑡 − 1 + 𝚫 𝑡 / 2 ) .

Now, we define ℎ 𝑡 𝑀 + 1

∇ 𝐹 ( 𝐱 𝑡 − 1 + 1 2 Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑀 ] ) (which simply continues the recursive definition of 𝐡 𝑡 𝑖 in Algorithm 2 for one more step). Then, we claim that for all 0 ≤ 𝑖 ≤ 𝑀 , ‖ 𝐡 𝑡 𝑖 + 1 − 𝐡 𝑡 𝑖 ‖ ≤ 1 2 𝑖 ‖ 𝐡 𝑡 1 − 𝐡 𝑡 0 ‖ . We establish the claim by induction on 𝑖 . First, for 𝑖

0 the claim holds by definition. Now suppose ‖ 𝐡 𝑡 𝑖 − 𝐡 𝑡 𝑖 − 1 ‖ ≤ 1 2 𝑖 − 1 ‖ 𝐡 𝑡 1 − 𝐡 𝑡 0 ‖ for some 𝑖 . Then, we have

‖ 𝐡 𝑡 𝑖 + 1 − 𝐡 𝑡 𝑖 ‖
≤ ‖ ∇ 𝐹 ( 𝐱 𝑡 − 1 + 1 2 Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 ] ) − ∇ 𝐹 ( 𝐱 𝑡 − 1 + 1 2 Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 − 1 ] ) ‖
Using the 𝐻 -smoothness of 𝐹 :
≤ 𝐻 2 ‖ Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 ] − Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 − 1 ] ‖
Using the fact that projection is a contraction:
≤ 𝐻 𝜂 2 ‖ 𝐡 𝑡 𝑖 − 𝐡 𝑡 𝑖 − 1 ‖
Using 𝜂 ≤ 1 𝐻 :

1 2 ‖ 𝐡 𝑡 𝑖 − 𝐡 𝑡 𝑖 − 1 ‖

From the induction assumption:

≤ 1 2 𝑖 ‖ 𝐡 𝑡 1 − 𝐡 𝑡 0 ‖ .

So that the claim holds.

Now, since 𝐡 𝑡

𝐡 𝑡 𝑄 , we have 𝚫 𝑡

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑡 ′ − 𝜂 𝐡 𝑡 𝑖 − 1 ] . Therefore 𝑔 𝑡

∇ 𝐹 ( 𝐱 𝑡 )

∇ 𝐹 ( 𝐱 𝑡 − 1 + 𝚫 𝑡 / 2 )

𝐡 𝑡 𝑄 + 1 . Thus,

‖ 𝐠 𝑡 − 𝐡 𝑡 𝑄 ‖

‖ 𝐡 𝑡 𝑄 + 1 − 𝐡 𝑡 𝑄 ‖ ≤ 1 2 𝑄 ‖ 𝐡 𝑡 1 − 𝐡 𝑡 0 ‖ ≤ 2 𝐺 2 𝑄 ,

where in the last inequality we used the fact that 𝐹 is 𝐺 -Lipschitz. So, for 𝑄

⌈ log 2 ⁡ ( 𝑁 𝐺 / 𝐻 𝐷 ) ⌉ , we have ‖ 𝐠 𝑡 − 𝐡 𝑡 𝑄 ‖ ≤ 2 𝐺 𝐻 𝐷 𝑁 𝐺 for all 𝑡 . The result now follows by substituting into equation (3). ∎

Appendix DProof of Theorem 17 Proof of Theorem 17.

Once more, the first part of the result is immediate from the fact that ‖ 𝚫 𝑛 ‖ ≤ 𝐷 . So, we proceed to show the second part.

Define ∇ 𝑛

∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) d 𝑠 . Then, we have

‖ ⟨ ∇ 𝑛 − 𝐠 𝑛 , 𝚫 𝑛 ⟩ ‖

‖ ⟨ ∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) − ∇ 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) d 𝑠 , 𝚫 𝑛 ⟩ ‖

≤ 𝐷 ‖ ∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) − ∇ 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) d 𝑠 ‖

𝐷 ∥ ∫ 0 1 ( ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) − ∇ 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) − ∇ 2 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) 𝚫 𝑛 ( 𝑠 − 1 / 2 ) )

+ ∇ 2 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) 𝚫 𝑛 ( 𝑠 − 1 / 2 ) d 𝑠 ∥
(observing that ∫ 0 1 𝑠 − 1 / 2 d 𝑠

0 )

𝐷 ‖ ∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) − ∇ 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) − ∇ 2 𝐹 ( 𝐱 𝑛 − 1 + 1 2 𝚫 𝑛 ) 𝚫 𝑛 ( 𝑠 − 1 / 2 ) d 𝑠 ‖

(using second-order smoothness)

≤ 𝐷 ∫ 0 1 𝐽 2 ‖ 𝚫 𝑛 ‖ 2 ( 𝑠 − 1 / 2 ) 2 d 𝑠 ≤ 𝐽 𝐷 3 48 .

In Theorem 7, set 𝐮 𝑛 to be equal to 𝐮 1 for the first 𝑇 iterations, 𝐮 2 for the second 𝑇 iterations and so on. In other words, 𝐮 𝑛

𝐮 ⌊ 𝑛 / 𝑇 ⌋ + 1 for 𝑛

1 , … , 𝑀 . So, we have

𝐹 ( 𝐱 𝑀 ) − 𝐹 ( 𝐱 0 )

𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) + ∑ 𝑛

1 𝑀 ⟨ ∇ 𝑛 − 𝐠 𝑛 , 𝚫 𝑛 ⟩ + ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩

≤ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) + 𝑁 𝐽 𝐷 3 48 + ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ .

Now, set 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ . Then, by Theorem 21, we have that 𝑅 𝑇 ( 𝐮 𝑘 ) ≤ 𝐻 𝐷 2 2 + 2 𝑇 𝐺 𝐷 𝑁 . Therefore:

𝐹 ( 𝐱 𝑀 )
≤ 𝐹 ( 𝐱 0 ) + 𝐻 𝐷 2 𝐾 2 + 2 𝐺 𝐷 + 𝑀 𝐽 𝐷 3 48 − 𝐷 𝑇 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ .

Hence, we obtain

1 𝐾
∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ≤ 𝐹 ( 𝐱 0 ) − 𝐹 ( 𝐱 𝑀 ) 𝑀 𝐷 + 𝐻 𝐷 2 𝑇 + 2 𝐺 𝑀 + 𝐽 𝐷 2 48 .

Note that from the choice of 𝐾 and 𝑇 we have 𝑀

𝐾 𝑇 ≥ 𝑁 − 𝑇 ≥ 𝑁 / 2 . So, using 𝐷

𝛿 / 𝑇 , we have can upper bound the r.h.s. with

2 𝑇 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝑁 𝛿 + 𝐻 𝛿 2 𝑇 2 + 4 𝐺 𝑁 + 𝐽 𝛿 2 2 𝑇 2
and with 𝑇

min ⁡ ( ⌈ ( 𝛿 2 ( 𝐻 + 𝐽 𝛿 ) 𝑁 ) 1 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ⌉ , 𝑁 / 2 ) :

≤ 3 ( 𝐻 + 𝐽 𝛿 ) 1 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 𝛿 1 / 3 𝑁 2 / 3 + 4 𝐺 𝑁 + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝑁 𝛿 + 10 𝛿 ( 𝐻 + 𝐽 𝛿 ) 𝑁 2 .

Now, the third fact follows by observing that Proposition 15 implies that

1 𝐾 ∑ 𝑘

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖
≤ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ + 𝐽 𝛿 2 2 .

Now, substituting the specified value of 𝛿 completes the identity. Finally, the count of number of gradient evaluations is a direct calculation. ∎

Appendix EProofs for Section 5

See 14

Proof.

Let 𝑆 ⊂ 𝐵 ( 𝐱 , 𝛿 ) with 𝐱

1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲 . By 𝐻 -smoothness, for all 𝐲 ∈ 𝑆 , ‖ ∇ 𝐹 ( 𝐲 ) − ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝐻 ‖ 𝐲 − 𝐱 ‖ ≤ 𝐻 𝛿 . Therefore, we have

‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖

‖ ∇ 𝐹 ( 𝐱 ) + 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ( ∇ 𝐹 ( 𝐲 ) − ∇ 𝐹 ( 𝐱 ) ) ‖

≥ ‖ ∇ 𝐹 ( 𝐱 ) ‖ − 𝐻 𝛿 .

Now, since ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 , for any 𝑝

0 , there is a set 𝑆 such that ‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖ ≤ 𝜖 + 𝑝 . Thus, ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐻 𝛿 + 𝑝 for any 𝑝

0 , which implies ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐻 𝛿 . ∎

See 15

Proof.

The proof is similar to that of Proposition 14. Let 𝑆 ⊂ 𝐵 ( 𝐱 , 𝛿 ) with 𝐱

1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲 . By 𝐽 -second-order-smoothness, for all 𝐲 ∈ 𝑆 , we have

‖ ∇ 𝐹 ( 𝐲 ) − ∇ 𝐹 ( 𝐱 ) − ∇ 2 𝐹 ( 𝐱 ) ( 𝐲 − 𝐱 ) ‖

‖ ∫ 0 1 ( ∇ 2 𝐹 ( 𝐱 + 𝑡 ( 𝐲 − 𝐱 ) ) − ∇ 2 𝐹 ( 𝐱 ) ) ( 𝐲 − 𝐱 ) d 𝑡 ‖

≤ ∫ 0 1 𝑡 𝐽 ‖ 𝐲 − 𝐱 ‖ 2 d 𝑡

𝐽 ‖ 𝐲 − 𝐱 ‖ 2 2 ≤ 𝐽 𝛿 2 2 .

Further, since 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝐱 , we have 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 2 𝐹 ( 𝐱 ) ( 𝐲 − 𝐱 )

0 . Therefore, we have

‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖

‖ ∇ 𝐹 ( 𝐱 ) + 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ( ∇ 𝐹 ( 𝐲 ) − ∇ 𝐹 ( 𝐱 ) − ∇ 2 𝐹 ( 𝐱 ) ( 𝐲 − 𝐱 ) ) ‖

≥ ‖ ∇ 𝐹 ( 𝐱 ) ‖ − 𝐽 𝛿 2 2 .

Now, since ‖ ∇ 𝐹 ( 𝑥 ) ‖ 𝛿 ≤ 𝜖 , for any 𝑝

0 , there is a set 𝑆 such that ‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 ∇ 𝐹 ( 𝐲 ) ‖ ≤ 𝜖 + 𝑝 . Thus, ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐽 2 𝛿 2 + 𝑝 for any 𝑝

0 , which implies ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐽 2 𝛿 2 . ∎

Appendix FLower Bounds

Our lower bounds are constructed via a mild alteration to the arguments of Arjevani et al. (2019) for lower bounds on finding ( 0 , 𝜖 ) -stationary points of smooth functions with a stochastic gradient oracle. At a high level, we show that since a 𝛿 , 𝜖 -stationary point of an 𝐻 -smooth loss is also a ( 0 , 𝐻 𝛿 + 𝜖 ) -stationary point, a lower bound on the complexity of the latter implies a lower bound on the of complexity of the former. The lower bound of Arjevani et al. (2019) is proved by constructing a distribution over “hard” functions such that no algorithm can quickly find a ( 0 , 𝜖 ) -stationary point of a random selected function. Unfortunately, these “hard” functions are not Lipschitz. Fortunately, they take the form 𝐹 ( 𝐱 ) + 𝜂 ‖ 𝐱 ‖ 2 where 𝐹 is Lipschitz and smooth so that the “non-Lipschitz” part is solely contained in the quadratic term. We show that one can replace the quadratic term ‖ 𝐱 ‖ 2 with a Lipschitz function that is quadratic for sufficiently small 𝐱 but proportional to ‖ 𝐱 ‖ for larger values. Our proof consists of carefully reproducing the argument of Arjevani et al. (2019) to show that this modification does not cause any problems. We emphasize that almost all of this development can be found with more detail in Arjevani et al. (2019). We merely restate here the minimum results required to verify our modification to their construction.

F.1Definitions and Results from Arjevani et al. (2019)

A randomized first-order algorithm is a distribution 𝑃 𝑆 supported on a set 𝑆 and a sequence of measurable mappings 𝐴 𝑖 ( 𝑠 , 𝐠 1 , … , 𝐠 𝑖 − 1 ) → ℝ 𝑑 with 𝑠 ∈ 𝑆 and 𝐠 𝑖 ∈ ℝ 𝑑 . Given a stochastic gradient oracle Grad : ℝ 𝑑 × 𝑍 → ℝ 𝑑 , a distribution 𝑃 𝑍 supported on 𝑍 and an i.i.d. sample ( 𝐳 1 , … , 𝐳 𝑛 ) ∼ 𝑃 𝑍 , we define the iterates of 𝐴 recursively by:

𝐱 1

𝐴 1 ( 𝑠 )

𝐱 𝑖

𝐴 𝑖 ( 𝑠 , Grad ( 𝐱 1 , 𝐳 1 ) , Grad ( 𝐱 2 , 𝐳 2 ) , … , Grad ( 𝐱 𝑖 − 1 , 𝐳 𝑖 − 1 ) ) .

So, 𝐱 𝑖 is a function of 𝑠 and 𝐳 1 , … , 𝐳 𝑖 − 1 . We define A rand to be the set of such sequences of mappings.

Now, in the notation of Arjevani et al. (2019), we define the “progress function”

prog 𝑐 ( 𝐱 )

max ⁡ { 𝑖 : | 𝐱 𝑖 | ≥ 𝑐 } .

Further, a stochastic gradient oracle Grad can be called a probability- 𝑝 zero-chain if prog 0 ( Grad ( 𝐱 , 𝐳 ) )

prog 1 / 4 ( 𝐱 ) + 1 for all 𝐱 with probability at least 1 − 𝑝 , and prog 0 ( Grad ( 𝐱 , 𝐳 ) ) ≤ prog 1 / 4 ( 𝐱 ) + 1 with probability 1.

Next, let 𝐹 𝑇 : ℝ 𝑇 → ℝ be the function defined by Lemma 2 of Arjevani et al. (2019). Restating their Lemma, this function satisfies:

Lemma 22 (Lemma 2 of Arjevani et al. (2019)).

There exists a function 𝐹 𝑇 : ℝ 𝑇 → ℝ satisfies that satisfies:

1.

𝐹 𝑇 ( 0 )

0 and inf 𝐹 𝑇 ( 𝐱 ) ≥ − 𝛾 0 𝑇 for 𝛾 0

12 .

2.

∇ 𝐹 𝑇 ( 𝐱 ) is 𝐻 0 -Lipschitz, with 𝐻 0

152 .

3.

For all 𝐱 , ‖ ∇ 𝐹 𝑇 ( 𝐱 ) ‖ ∞ ≤ 𝐺 0 with 𝐺 0

23

4.

For all 𝐱 , prog 0 ( ∇ 𝐹 𝑇 ( 𝐱 ) ) ≤ prog 1 / 2 ( 𝐱 ) + 1 .

5.

If prog 1 ( 𝐱 ) < 𝑇 , then ‖ ∇ 𝐹 𝑇 ( 𝐱 ) ‖ ≥ | ∇ 𝐹 𝑇 ( 𝐱 ) prog 1 ( 𝐱 ) + 1 | ≥ 1 .

We associate with this function 𝐹 𝑇 the stochastic gradient oracle 𝑂 𝑇 ( 𝐱 , 𝑧 ) : ℝ 𝑇 × { 0 , 1 } → ℝ 𝑑 where 𝑧 is Bernoulli ( 𝑝 ) :

Grad 𝑇 ( 𝐱 , 𝐳 ) 𝑖

{ ∇ 𝐹 𝑇 ( 𝐱 ) 𝑖 ,

if 𝑖 ≠ prog 1 / 4 ( 𝐱 )

𝐳 ∇ 𝐹 𝑇 ( 𝐱 ) 𝑖 𝑝 ,
if 𝑖

prog 1 / 4 ( 𝐱 )

It is clear that 𝔼 𝑧 [ 𝑂 𝑇 ( 𝐱 , 𝐳 ) ]

∇ 𝐹 𝑇 ( 𝐱 ) .

This construction is so far identical to that in Arjevani et al. (2019), and so we have by their Lemma 3:

Lemma 23 (Lemma 3 of Arjevani et al. (2019)).

Grad 𝑇 is a probability- 𝑝 zero chain, has variance 𝔼 [ ‖ Grad 𝑇 ( 𝐱 , 𝐳 ) − ∇ 𝐹 𝑇 ( 𝐱 ) ‖ 2 ] ≤ 𝐺 0 2 / 𝑝 , and ‖ Grad 𝑇 ( 𝐱 , 𝑧 ) ‖ ≤ 𝐺 0 𝑝 + 𝐺 0 𝑇 .

Proof.

The probability 𝑝 zero-chain and variance statements are directly from Arjevani et al. (2019). For the bound on ‖ Grad 𝑇 ‖ , observe that Grad 𝑇 ( 𝐱 , 𝐳 )

∇ 𝐹 𝑇 ( 𝐱 ) in all but one coordinate. In that one coordinate, Grad 𝑇 ( 𝐱 , 𝐳 ) is at most ‖ ∇ 𝐹 𝑇 ( 𝐱 ) ‖ ∞ 𝑝

𝐺 0 𝑝 . Thus, the bound follows by triangle inequality. ∎

Next, for any matrix 𝑈 ∈ ℝ 𝑑 × 𝑇 with orthonormal columns, we define 𝐹 𝑇 , 𝑈 : ℝ 𝑑 → ℝ by:

𝐹 𝑇 , 𝑈 ( 𝐱 )

𝐹 𝑇 ( 𝑈 ⊤ 𝐱 ) .

The associated stochastic gradient oracle is:

Grad 𝑇 , 𝑈 ( 𝐱 , 𝐳 )

𝑈 Grad 𝑇 ( 𝑈 ⊤ 𝐱 , 𝐳 ) .

Now, we restate Lemma 5 of Arjevani et al. (2019):

Lemma 24 (Lemma 5 of Arjevani et al. (2019)).

Let 𝑅 > 0 and suppose 𝐴 ∈ A rand is such that 𝐴 produces iterates 𝐱 𝑡 with ‖ 𝐱 𝑡 ‖ ≤ 𝑅 . Let 𝑑 ≥ ⌈ 18 𝑅 2 𝑇 𝑝 log ⁡ 2 𝑇 2 𝑝 𝑐 ⌉ Suppose 𝑈 is chosen uniformly at random from the set of 𝑑 × 𝑇 matrices with orthonormal columns. Let Grad be an probability- 𝑝 zero chain and let Grad 𝑈 ( 𝐱 , 𝐳 )

𝑈 Grad ( 𝑈 ⊤ 𝐱 , 𝐳 ) . Let 𝐱 1 , 𝐱 2 , … be the iterates of 𝐴 when provided the stochastic gradient oracle Grad 𝑈 . Then with probability at least 1 − 𝑐 (over the randomness of 𝑈 , the oracle, and also the seed 𝑠 of 𝐴 ):

prog 1 / 4 ( 𝑈 ⊤ 𝐱 𝑡 ) < 𝑇 for all 𝑡 ≤ 𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝 .

F.2Defining the “Hard” Instance

Now, we for the first time diverge from the construction of Arjevani et al. (2019) (albeit only slightly). Their construction uses a “shrinking function” 𝜌 𝑅 , 𝑑 : ℝ 𝑑 → ℝ 𝑑 given by 𝜌 𝑅 , 𝑑 ( 𝐱 )

𝐱 1 + ‖ 𝐱 ‖ 2 / 𝑅 2 as well as an additional quadratic term to overcome the limitation of bounded iterates. We cannot tolerate the non-Lipschitz quadratic term, so we replace it with a Lipschitz version 𝑞 𝐵 , 𝑑 ( 𝐱 )

𝐱 ⊤ 𝜌 𝐵 , 𝑑 ( 𝐱 )

‖ 𝐱 ‖ 2 1 + ‖ 𝐱 ‖ 2 / 𝑅 2 . Intuitively, 𝑞 𝐵 , 𝑑 behaves like ‖ 𝐱 ‖ 2 for small enough 𝐱 , but behaves like 𝑅 ‖ 𝐱 ‖ for large ‖ 𝐱 ‖ . Overall, we consider the function:

𝐹 ^ 𝑇 , 𝑈 ( 𝐱 )

𝐹 𝑇 , 𝑈 ( 𝜌 𝑅 , 𝑑 ( 𝐱 ) ) + 𝜂 𝑞 𝐵 , 𝑑 ( 𝐱 )

𝐹 𝑇 ( 𝑈 ⊤ 𝜌 𝑅 , 𝑑 ( 𝐱 ) ) + 𝜂 𝑞 𝐵 , 𝑑 ( 𝐱 ) .

The stochastic gradient oracle associated with 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 ) is

Grad ^ 𝑇 , 𝑈 ( 𝐱 , 𝐳 )

𝐽 [ 𝜌 𝑅 , 𝑑 ] ( 𝐱 ) ⊤ 𝑈 Grad 𝑇 ( 𝑈 ⊤ 𝜌 𝑅 , 𝑑 ( 𝐱 ) , 𝐳 ) + 𝜂 ∇ 𝑞 𝐵 , 𝑑 ( 𝐱 ) .

where 𝐽 [ 𝑓 ] ( 𝐱 ) indicates the Jacobian of the function 𝑓 evaluated at 𝐱 .

A description of the relevant properties of 𝑞 𝐵 is provided in Section F.3.

Next we produce a variant on Lemma 6 from Arjevani et al. (2019). This is the most delicate part of our alteration, although the proof is still almost identical to that of Arjevani et al. (2019).

Lemma 25 (variant on Lemma 6 of Arjevani et al. (2019)).

Let 𝑅

𝐵

60 𝐺 0 𝑇 . Let 𝜂

1 / 10 and 𝑐 ∈ ( 0 , 1 ) and 𝑝 ∈ ( 0 , 1 ) and 𝑇 ∈ ℕ . Set 𝑑

⌈ 18 𝑅 2 𝑇 𝑝 log ⁡ 2 𝑇 2 𝑝 𝑐 ⌉ and let 𝑈 be sampled uniformly from the set of 𝑑 × 𝑇 matrices with orthonormal columns. Define 𝐹 ^ 𝑇 , 𝑈 and Grad ^ 𝑈 , 𝑇 as above. Suppose 𝐴 ∈ A rand and let 𝐱 1 , 𝐱 2 , … be the iterates of 𝐴 when provided with Grad ^ 𝑈 , 𝑇 as input. Then with probability at least 1 − 𝑐 :

‖ ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 𝑡 ) ‖ ≥ 1 / 2 for all 𝑡 ≤ 𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝 .

Proof.

Define 𝐲 𝑖

𝜌 𝑅 , 𝑑 ( 𝐱 𝑖 ) . Recall the defniition:

Grad 𝑇 , 𝑈 ( 𝐲 , 𝐳 )

𝑈 Grad 𝑇 ( 𝑈 ⊤ 𝐲 , 𝐳 )

Then observe that Grad ^ 𝑇 , 𝑈 ( 𝐱 , 𝐳 ) can be computed from 𝐱 and Grad 𝑇 , 𝑈 ( 𝐲 , 𝐳 ) :

Grad ^ 𝑇 , 𝑈 ( 𝐱 , 𝐳 )

𝐽 [ 𝜌 𝑅 , 𝑑 ] ( 𝐱 ) ⊤ Grad 𝑇 , 𝑈 ( 𝐲 , 𝐳 ) + 𝜂 ∇ 𝑞 𝐵 , 𝑑 ( 𝐱 )

Therefore, we may consider the 𝐲 to be the iterates of some different algorithm 𝐴 𝑦 ∈ A rand applied to the oracle Grad 𝑇 , 𝑈 ( 𝐲 , 𝐳 ) ( 𝐴 𝑦 computes Grad ^ 𝑇 , 𝑈 ( 𝐱 , 𝐳 ) from Grad 𝑇 , 𝑈 ( 𝐲 , 𝐳 ) , and then applies the original algorithm 𝐴 to get 𝐱 and then 𝜌 𝐵 , 𝑑 to get 𝐲 ).

Furthermore, it is clear from the definition of 𝜌 𝐵 , 𝑑 that ‖ 𝐲 𝑖 ‖

‖ 𝜌 𝐵 , 𝑑 ( 𝐱 𝑖 ) ‖ ≤ 𝑅 for all 𝑖 .

All together, this implies that 𝐴 𝑦 satisfies the conditions of Lemma 24, and so we have that with probability at least 1 − 𝑐 :

prog 1 / 4 ( 𝑈 ⊤ 𝐲 𝑡 ) ≤ 𝑇 for all 𝑡 ≤ 𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝

Now, our goal is to show that ‖ ∇ 𝐹 ( 𝐱 𝑖 ) ‖ ≥ 1 / 2 . We consider two cases, either ‖ 𝐱 𝑖 ‖

𝑅 / 2 or not.

First, observe that for 𝐱 𝑖 with ‖ 𝐱 𝑖 ‖

𝑅 / 2 , we have:

‖ ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 𝑖 ) ‖
≥ 𝜂 ‖ ∇ 𝑞 𝐵 , 𝑑 ( 𝐱 𝑖 ) ‖ − ‖ 𝐽 [ 𝜌 𝑅 , 𝑑 ] ( 𝐱 𝑖 ) ‖ op ‖ ∇ 𝐹 ^ ( 𝑈 ⊤ 𝐲 𝑖 ) ‖
using the fact that ∥ 𝐽 [ 𝜌 𝑅 , 𝑑 ] ( 𝐱 𝑖 ) ∥ op ∥ ≤ 1 (see Arjevani et al. (2019) Lemma 15) as well as Proposition 29 part 3:
≥ 𝜂 ‖ 𝐱 𝑖 ‖ 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝐵 2 − 𝐺 0 𝑇
using 𝐵

𝑅 and ‖ 𝐱 𝑖 ‖ > 𝑅 / 2 :
≥ 𝜂 𝐵 5 − 𝐺 0 𝑇

≥ 𝜂 𝐵 3 − 𝐺 0 𝑇
Recalling 𝐵

𝑅

60 𝐺 0 𝑇 and 𝜂

1 / 10 :

𝐺 0 𝑇
Recalling 𝐺 0

23 :

≥ 1 / 2 .

Alternatively, suppose ‖ 𝐱 𝑖 ‖ ≤ 𝑅 / 2 . Then, let us set 𝑗

prog 1 ( 𝑈 ⊤ 𝐲 𝑖 ) + 1 ≤ 𝑇 (the inequality follows since prog 1 ≤ prog 1 / 4 ). Then, if 𝐮 𝑗 indicates the 𝑗 th row of 𝑢 , Lemma 22 implies:

| ⟨ 𝐮 𝑗 , 𝐲 𝑖 ⟩ |

< 1 ,

| ⟨ 𝐮 𝑗 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ |

≥ 1 .

Next, by direct calculation we have 𝐽 [ 𝜌 𝑅 ] ( 𝐱 𝑖 )

𝐼 − 𝜌 𝑅 ( 𝐱 𝑖 ) 𝜌 𝑅 ( 𝐱 𝑖 ) ⊤ / 𝑅 2 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2

𝐼 − 𝐲 𝑖 𝐲 𝑖 ⊤ / 𝑅 2 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2 so that:

⟨ 𝐮 𝑗 , ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 𝑖 ) ⟩

⟨ 𝐮 𝑗 , 𝐽 [ 𝜌 𝑅 ] ( 𝐱 𝑖 ) ⊤ ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ + 𝜂 ⟨ 𝐮 𝑗 , ∇ 𝑞 𝐵 ( 𝐱 𝑖 ) ⟩

⟨ 𝐮 𝑗 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2 − ⟨ 𝐮 𝑗 , 𝐲 𝑖 ⟩ ⟨ 𝐲 𝑖 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ / 𝑅 2 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2 + 𝜂 ⟨ 𝐮 𝑗 , ∇ 𝑞 𝐵 ( 𝐱 𝑖 ) ⟩ .

Now, by Proposition 29, we have ∇ 𝑞 𝐵 ( 𝐱 𝑖 )

( 2 − ‖ 𝐲 𝑖 ‖ 2 𝐵 2 ) 𝐲 𝑖 . So, (recalling 𝑅

𝐵 ):

⟨ 𝐮 𝑗 , ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 𝑖 ) ⟩

⟨ 𝐮 𝑗 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2 − ⟨ 𝐮 𝑗 , 𝐲 𝑖 ⟩ ⟨ 𝐲 𝑖 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ / 𝑅 2 1 + ‖ 𝐱 𝑖 ‖ 2 / 𝑅 2 + 𝜂 ( 2 − ‖ 𝐲 𝑖 ‖ 2 𝑅 2 ) ⟨ 𝐮 𝑗 , 𝐲 𝑖 ⟩
Observing that ‖ 𝑦 𝑖 ‖ ≤ ‖ 𝑥 𝑖 ‖ ≤ 𝑅 / 2 and | ⟨ 𝐮 𝑗 , 𝐲 𝑖 ⟩ | < 1 :
| ⟨ 𝐮 𝑗 , ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 𝑖 ) ⟩ |
≥ 2 | ⟨ 𝐮 𝑗 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ | 5 − ‖ ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ‖ 2 𝑅 − 2 𝜂
Using | ⟨ 𝐮 𝑗 , ∇ 𝐹 𝑇 , 𝑈 ( 𝐲 𝑖 ) ⟩ | ≥ 1 and ‖ ∇ 𝐹 ( 𝐲 𝑖 ) ‖ ≤ 𝐺 0 𝑇 :
≥ 2 5 − 𝐺 0 𝑇 2 𝑅 − 2 𝜂
With 𝑅

60 𝐺 0 𝑇 and 𝜂

1 / 10 :

2 5 − 1 120 − 1 5

1 / 2 . ∎

Next, we observe some basic facts about the function 𝐹 ^ 𝑇 , 𝑈 :

Lemma 26 (variation on Lemma 7 in Arjevani et al. (2019)).

With the settings of 𝑅 , 𝐵 , 𝜂 in Lemma 25, the function 𝐹 ^ 𝑇 , 𝑈 satisfies:

1.

𝐹 ^ 𝑇 , 𝑈 ( 0 ) − inf 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 ) ≤ 𝛾 0 𝑇

12 𝑇

2.

‖ ∇ 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 ) ‖ ≤ 𝐺 0 𝑇 + 3 𝜂 𝐵 ≤ 437 𝑇 for all 𝐱 .

3.

∇ 𝐹 ^ 𝑈 ( 𝐱 ) is 𝐻 0 + 3 + 8 𝜂 ≤ 156 -Lipschitz.

4.

‖ Grad ^ 𝑇 , 𝑈 ( 𝐱 , 𝐳 ) ‖ ≤ 𝐺 0 𝑝 + 𝐺 0 𝑇 + 3 𝜂 𝐵 ≤ 23 𝑝 + 437 𝑇 with probability 1.

5.

Grad ^ 𝑇 , 𝑈 has variance at most 𝐺 0 2 𝑝 ≤ 23 2 𝑝

Proof.

1.

This property follows immediately from the fact that 𝐹 𝑇 ( 0 ) − inf 𝐹 𝑇 ( 𝐱 ) ≤ 𝛾 0 𝑇 .

2.

Since 𝜌 𝑅 is 1-Lipschitz for all 𝑅 and 𝑞 𝐵 is 3 𝐵 -Lipschitz (see Proposition 29), 𝐹 ^ 𝑇 , 𝑈 ( 𝐱 ) is 𝐺 0 𝑇 + 3 𝜂 𝐵 -Lipschitz, where 𝐺 0

23 and 𝐵

60 𝐺 0 𝑇 and 𝜂

1 / 10

3.

By assumption, 𝑅 ≥ max ⁡ ( 𝐻 0 , 1 ) . Thus, by Arjevani et al. (2019, Lemma 16), ∇ 𝐹 𝑇 ( 𝜌 𝑅 ( 𝐱 ) ) is 𝐻 0 + 3 -Lipschitz and so ∇ 𝐹 ^ 𝑇 , 𝑈 is 𝐻 0 + 3 + 8 𝜂 -Lipschitz by Proposition 29.

4.

Since ‖ Grad 𝑇 ‖ ≤ 𝐺 0 𝑝 + 𝐺 0 𝑇 , and 𝐽 [ 𝜌 𝑅 ] ( 𝐱 ) ⊤ 𝑈 has operator norm at most 1, the bound follows.

5.

Just as in the previous part, since Grad 𝑇 has variance 𝐺 0 2 / 𝑝 and 𝐽 [ 𝜌 𝑅 ] ( 𝐱 ) ⊤ 𝑈 has operator norm at most 1, the bound follows.

∎

Now, we are finally in a position to prove:

Theorem 27.

Given any 𝛾 , 𝐻 , 𝜖 , and 𝜎 such that 𝛾 𝐻 48 ⋅ 156 𝜖 2 ≥ 1 , there exists a distribution over functions 𝐹 and stochastic first-order oracles Grad such that with probability 1, 𝐹 is 𝐻 -smooth, 𝐹 ( 0 ) − inf 𝐹 ( 𝐱 ) ≤ 𝛾 , 𝐹 is 11 𝐻 𝛾 -Lipschitz and Grad has variance 𝜎 2 , and for any algorithm in A rand , with probability at least 1 − 𝑐 , when provided a randomly selected Grad, 𝐴 requires at least Ω ( 𝛾 𝐻 𝜎 2 𝜖 4 ) iterations to output a point 𝐱 with 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ ] ≤ 𝜖 .

Proof.

From Lemma 26 and Lemma 25, we have a distribution over functions 𝐹 and first-order oracles such that with probability 1, 𝐹 is 437 𝑇 Lipschitz, 𝐹 is 156 -smoooth, 𝐹 ( 0 ) − inf 𝐹 ( 𝐱 ) ≤ 12 𝑇 , Grad has variance at most 𝐺 0 2 / 𝑝

23 2 / 𝑝 , and with probability at least 1 − 𝑐 ,

‖ ∇ 𝐹 ( 𝐱 𝑡 ) ‖ ≥ 1 / 2 for all 𝑡 ≤ 𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝 .

Now, set 𝜆

156 𝐻 ⋅ 2 𝜖 , 𝑇

⌊ 156 𝛾 12 𝐻 𝜆 2 ⌋

⌊ 𝛾 𝐻 48 ⋅ 156 𝜖 2 ⌋ ≥ 𝛾 𝐻 96 ⋅ 156 𝜖 2 and 𝑝

min ⁡ ( 23 2 𝐻 2 𝜆 2 156 2 𝜎 2 , 1 ) . Then, define

𝐹 𝜆 ( 𝐱 )

𝐻 𝜆 2 156 𝐹 ( 𝐱 / 𝜆 ) .

Then 𝐹 𝜆 is 𝐻 𝜆 2 𝐻 0 ⋅ 1 𝜆 2 ⋅ 156

𝐻

smooth, 𝐹 𝜆 ( 0 ) − inf 𝐱 𝐹 𝜆 ( 𝐱 ) ≤ 12 ⋅ 𝑇 ⋅ 𝐻 𝜆 2 156 ≤ 𝛾 , and 𝐹 𝜆 is 437 𝑇 𝐻 𝜆 156 ≤ 11 𝐻 𝛾
Lipschitz. We can construct an oracle Grad 𝜆 from Grad by:

Grad 𝜆 ( 𝐱 , 𝐳 )

𝐻 𝜆 156 Grad ( 𝐱 / 𝜆 , 𝐳 ) .

so that if 𝑝 < 1 , we have:

𝔼 [ ‖ Grad 𝜆 ( 𝐱 , 𝐳 ) − ∇ 𝐹 𝜆 ( 𝐱 ) ‖ 2 ]
≤ 𝐻 2 𝜆 2 156 2 ⋅ 23 2 𝑝

𝜎 2 .

Alternatively, if 𝑝

1 , clearly the variance is 0.

Further, since an oracle for 𝐹 𝜆 can be constructed from the oracle for 𝐹 , if we run 𝐴 on 𝐹 𝜆 , with probability at least 1 − 𝑐 ,

‖ ∇ 𝐹 𝜆 ( 𝐱 𝑡 ) ‖

𝐻 𝜆 156 ‖ ∇ 𝐹 ( 𝐱 𝑡 ) ‖ ≥ 𝜖 for all 𝑡 ≤ 𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝 .

Finally, we calculate:

𝑇 − log ⁡ ( 2 / 𝑐 ) 2 𝑝 ≥ 1.5 ⋅ 10 − 8 ⋅ 𝜎 2 𝛾 𝐻 𝜖 4 − 3 ⋅ 10 − 4 𝜎 2 log ⁡ ( 2 / 𝑐 ) 𝜖 2 .

Thus, there exists a constant 𝐾 and an 𝜖 0 such that for all 𝜖 < 𝜖 0 ,

𝔼 [ ‖ ∇ 𝐹 𝜆 ( 𝐱 𝑡 ) ‖ ] ≥ 𝜖 for all 𝑡 ≤ 𝐾 𝐻 𝛾 𝜎 2 𝜖 4 . ∎

From this result, we have our main lower bound (the formal version of Theorem 18):

Theorem 28.

For any 𝛿 , 𝜖 , 𝛾 , 𝐺 ≥ 11 2 𝜖 𝛾 𝛿 , there is a distribution over 𝐺 -Lipschitz 𝐶 ∞ functions 𝐹 with 𝐹 ( 0 ) − inf 𝐹 ( 𝐱 ) ≤ 𝛾 and stochastic gradient oracles Grad with 𝔼 [ ‖ Grad ( 𝐱 , 𝐳 ) ‖ 2 ] ≤ 𝐺 2 such that for any algorithm 𝐴 ∈ A rand , if 𝐴 is provided as input a randomly selected oracle Grad, 𝐴 will require Ω ( 𝐺 2 𝛾 / 𝛿 𝜖 3 ) iterations to identify a point 𝑥 with 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ] ≤ 𝜖 .

Proof.

From Theorem 27, for any 𝐻 , 𝜖 ′ , 𝛾 , and 𝜎 we have a distribution over 𝐶 ∞ functions 𝐹 and oracles Grad such that 𝐹 is 𝐻 -smooth, 11 𝐻 𝛾 -Lischitz and 𝐹 ( 0 ) − inf 𝐹 ( 𝐱 ) ≤ 𝛾 and Grad has variance 𝜎 2 such that 𝐴 requires Ω ( 𝐻 𝛾 𝜎 2 / 𝜖 ′ ⁣ 4 ) iterations to output a point 𝐱 such that 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ ] ≤ 𝜖 ′ . Set 𝜎

𝐺 / 2 , 𝐻

𝜖 / 𝛿 and 𝜖 ′

2 𝜖 . Then, we see that Grad has variance 𝐺 2 / 2 , and 𝐹 is 11 𝐻 𝛾

11 𝜖 𝛾 𝛿 ≤ 𝐺 / 2 -Lipschitz so that 𝔼 [ ‖ Grad ( 𝐱 , 𝐳 ) ‖ 2 ] ≤ 𝐺 2 . Further by Proposition!14, if ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ 𝜖 , then ‖ ∇ 𝐹 ( 𝐱 ) ‖ ≤ 𝜖 + 𝐻 𝛿

2 𝜖 . Therefore, since 𝐴 cannot output a point 𝐱 with 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ ] ≤ 𝜖 ′

2 𝜖 in less than Ω ( 𝐻 𝛾 𝜎 2 / 𝜖 ′ ⁣ 4 ) iterations, we see that 𝐴 also cannot output a point 𝐱 with 𝔼 [ ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ] ≤ 𝜖 in less than Ω ( 𝐻 𝛾 𝜎 2 / 𝜖 ′ ⁣ 4 )

Ω ( 𝛾 𝐺 2 / 𝜖 3 𝛿 ) iterations. ∎

F.3Definition and Properties of 𝑞 𝐵

Consider the function 𝑞 𝐵 , 𝑑 : ℝ 𝑑 → ℝ defined by

𝑞 𝐵 , 𝑑 ( 𝐱 )

‖ 𝐱 ‖ 2 1 + ‖ 𝐱 ‖ 2 / 𝐵 2

𝐱 ⊤ 𝜌 𝐵 , 𝑑 ( 𝐱 ) .

This function has the following properties, all of which follow from direct calculuation:

Proposition 29.

𝑞 𝐵 , 𝑑 satisfies:

1.
∇ 𝑞 𝐵 , 𝑑 ( 𝐱 )

2 𝐱 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 − 𝐱 ‖ 𝐱 ‖ 2 𝐵 2 ( 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ) 3 / 2

( 2 − ‖ 𝜌 𝐵 ( 𝐱 ) ‖ 2 𝐵 2 ) 𝜌 𝐵 ( 𝐱 ) .
2.
∇ 2 𝑞 𝐵 , 𝑑 ( 𝐱 )

1 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ( 2 𝐼 − 3 𝐱𝐱 ⊤ 𝐵 2 ( 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ) − ‖ 𝐱 ‖ 2 𝐼 𝐵 2 ( 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ) + 2 ‖ 𝐱 ‖ 2 𝐱𝐱 ⊤ 𝐵 4 ( 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ) 2 ) .

3.

‖ 𝐱 ‖ 1 + ‖ 𝐱 ‖ 2 / 𝐵 2

≤ ‖ ∇ 𝑞 𝐵 , 𝑑 ( 𝐱 ) ‖ ≤ 3 ‖ 𝐱 ‖ 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ≤ 3 𝐵 .

4.

‖ ∇ 2 𝑞 𝐵 , 𝑑 ( 𝐱 ) ‖ op

≤ 8 1 + ‖ 𝐱 ‖ 2 / 𝐵 2 ≤ 8 .

Appendix GProof of Theorem 13

First, we state and prove a theorem analogous to Theorem 8.

Theorem 30.

Assume 𝐹 : ℝ 𝑑 → ℝ is well-behaved. In Algorithm 1, set 𝑠 𝑛 to be a random variable sampled uniformly from [ 0 , 1 ] . Set 𝑇 , 𝐾 ∈ ℕ and 𝑀

𝐾 𝑇 . For 𝑖

1 , … , 𝑑 , set 𝑢 𝑖 𝑘

− 𝐷 ∞ ∑ 𝑡

1 𝑇 ∂ 𝐹 ( 𝐰 𝑡 𝑘 ) ∂ 𝑥 𝑖 | ∑ 𝑡

1 𝑇 ∂ 𝐹 ( 𝐰 𝑡 𝑘 ) ∂ 𝑥 𝑖 | for some 𝐷 ∞ > 0 . Finally, suppose Var ( 𝑔 𝑛 , 𝑖 ) ≤ 𝜎 𝑖 2 for 𝑖

1 , … , 𝑑 . Then, we have

𝔼
[ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ 1 ] ≤ 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 ∞ 𝑀 + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] 𝐷 ∞ 𝑀 + 𝐷 ∞ ∑ 𝑖

1 𝑑 𝜎 𝑖 𝑇 .

Proof.

In Theorem 7, set 𝐮 𝑛 to be equal to 𝐮 1 for the first 𝑇 iterations, 𝐮 2 for the second 𝑇 iterations and so on. In other words, 𝐮 𝑛

𝐮 𝑚 𝑜 𝑑 ( 𝑛 , 𝑇 ) + 1 for 𝑛

1 , … , 𝑁 .

From Theorem 7, we have

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .

Now, since 𝑢 𝑘 , 𝑖

− 𝐷 ∞ ∑ 𝑡

1 𝑇 ∂ 𝐹 ( 𝐰 𝑡 𝑘 ) ∂ 𝑥 𝑖 | ∑ 𝑡

1 𝑇 ∂ 𝐹 ( 𝐰 𝑡 𝑘 ) ∂ 𝑥 𝑖 | , 𝔼 [ 𝐠 𝑛 ]

∇ 𝐹 ( 𝐰 𝑛 ) , and Var ( 𝑔 𝑛 , 𝑖 ) ≤ 𝜎 𝑖 2 for 𝑖

1 , … , 𝑑 , we have

𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ]
≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ + 𝐷 ∞ ∑ 𝑘

1 𝐾 ‖ ∑ 𝑡

1 𝑇 ( ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − 𝐠 𝑛 ) ‖ 1 ]

≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

1 𝑑 𝜎 𝑖

𝔼 [ − ∑ 𝑘

1 𝐾 𝐷 ∞ 𝑇 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ 1 ] + 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

1 𝑑 𝜎 𝑖 .

Putting this all together, we have

𝐹 ⋆ ≤ 𝔼 [ 𝐹 ( 𝐱 𝑁 ) ] ≤ 𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝑘 ) ] + 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

1 𝑑 𝜎 𝑖 − 𝐷 ∞ 𝑇 ∑ 𝑘

1 𝐾 𝔼 [ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ 1 ] .

Dividing by 𝐾 𝑇 𝐷 ∞

𝐷 ∞ 𝑀 and reordering, we have the stated bound. ∎

We can now prove Theorem 13.

Proof of Theorem 13.

Since 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ∞ ≤ 𝐷 ∞ , for all 𝑛 < 𝑛 ′ ≤ 𝑇 + 𝑛 − 1 , we have

‖ 𝐰 𝑛 − 𝐰 𝑛 ′ ‖ ∞

‖ 𝐱 𝑛 − ( 1 − 𝑠 𝑛 ) 𝚫 𝑛 − 𝐱 𝑛 ′ − 1 + 𝑠 𝑛 ′ 𝚫 𝑛 ′ ‖ ∞

≤ ‖ ∑ 𝑖

𝑛 + 1 𝑛 ′ − 1 𝚫 𝑖 ‖ ∞ + ‖ 𝚫 𝑛 ‖ ∞ + ‖ 𝚫 𝑛 ′ ‖ ∞

≤ 𝐷 ∞ ( ( 𝑛 ′ − 1 ) − ( 𝑛 + 1 ) + 1 ) + 2 𝐷 ∞

𝐷 ∞ ( 𝑛 ′ − 𝑛 + 1 )

≤ 𝐷 ∞ 𝑇 .

Therefore, we clearly have ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ∞ ≤ 𝐷 ∞ 𝑇

𝛿 .

Note that from the choice of 𝐾 and 𝑇 we have 𝑀

𝐾 𝑇 ≥ 𝑁 − 𝑇 ≥ 𝑁 / 2 . Now, observe that Var ( 𝑔 𝑛 , 𝑖 ) ≤ 𝔼 [ 𝑔 𝑛 𝑖 2 ] ≤ 𝐺 𝑖 2 . Thus, applying Theorem 30 in concert with the additional assumption 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

1 𝑑 𝐺 𝑖 , we have

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ 1 ]
≤ 2 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 ∞ 𝑁 + 2 𝐾 𝐷 ∞ 𝑇 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝐷 ∞ 𝑁 + ∑ 𝑖

1 𝑑 𝐺 𝑖 𝑇

2 𝑇 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + 3 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝑇

≤ max ⁡ ( 5 ( ∑ 𝑖

1 𝑑 𝐺 𝑖 ) 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝑁 ) + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 ,

where the last inequality is due to the choice of 𝑇 .

Now to conclude, observe that ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ∞ ≤ 𝛿 for all 𝑡 and 𝑘 , and also that 𝐰 ¯ 𝑘

1 𝑇 ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 . Therefore 𝑆

{ 𝐰 1 𝑘 , … , 𝐰 𝑇 𝑘 } satisfies the conditions in the infimum in Definition 12 so that ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 1 , 𝛿 ≤ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ 1 . ∎

Appendix HDirectional Derivative Setting

In the main text, our algorithms make use of a stochastic gradient oracle. However, the prior work of Zhang et al. (2020b) instead considers a stochastic directional gradient oracle. This is a less common setup, and other works (e.g., Davis et al. (2021)) have also taken our route of tackling non-smooth optimization via an oracle that returns gradients at points of differentiability.

Nevertheless, all our results extend easily to the exact setting of Zhang et al. (2020b) in which 𝐹 is Lipschitz and directionally differentiable and we have access to a stochastic directional gradient oracle rather than a stochastic gradient oracle. To quantify this setting, we need a bit more notation which we copy directly from Zhang et al. (2020b) below:

First, from Clarke (1990) and Zhang et al. (2020b), the generalized directional derivative of a function 𝐹 in a direction 𝐝 is

𝐹 ∘ ( 𝐱 , 𝐝 )

lim sup 𝐲 → 𝐱 𝑡 ↓ 0 𝑓 ( 𝐲 + 𝑡 𝐝 ) − 𝑓 ( 𝐲 ) 𝑡 .

(4)

Further, the generalized gradient is the set

∂ 𝐹 ( 𝐱 )

{ 𝐠 : ⟨ 𝐠 , 𝐝 ⟩ ≤ ⟨ 𝐹 ∘ ( 𝐱 , 𝐝 ) , 𝐝 ⟩ for all 𝐝 } .

Finally, 𝐹 : ℝ 𝑑 → ℝ is Hadamard directionally differentiable in the direction 𝐯 ∈ ℝ 𝑑 if for any function 𝜓 : ℝ + → ℝ 𝑑 such that lim 𝑡 → 0 𝜓 ( 𝑡 ) − 𝜓 ( 0 ) 𝑡

𝐯 and 𝜓 ( 0 )

𝐱 , the following limit exists:

lim 𝑡 → 0 𝐹 ( 𝜓 ( 𝑡 ) ) − 𝐹 ( 𝐱 ) 𝑡 .

If 𝐹 is Hadamard directionally differentiable, then the above limit is denoted 𝐹 ′ ( 𝐱 , 𝐯 ) . When 𝐹 is Hadamard directionally differentiable for all 𝐱 and 𝐯 , then we say simply that 𝐹 is directionally differentiable.

With these definitions, a stochastic directional oracle for a Lipschitz, directionally differentiable, and bounded from below function 𝐹 is an oracle Grad ( 𝐱 , 𝐯 , 𝐳 ) that outputs 𝐠 ∈ ∂ 𝐹 ( 𝐱 ) such that ⟨ 𝐠 , 𝐯 ⟩

𝐹 ′ ( 𝐱 , 𝐯 ) . In this case, Zhang et al. (2020b) shows (Lemma 3) that 𝐹 satisfies an alternative notion of well-behavedness:

𝐹 ( 𝐲 ) − 𝐹 ( 𝐱 )

∫ 0 1 ⟨ 𝔼 [ Grad ( 𝐱 + 𝑡 ( 𝐲 − 𝐱 ) , 𝐲 − 𝐱 , 𝐳 ) ] , 𝐲 − 𝐱 ⟩ 𝑑 𝑡 .

(5)

Next, we define:

Definition 31.

A point 𝐱 is a ( 𝛿 , 𝜖 ) stationary point of 𝐹 for the generalized gradient if there is a set of points 𝑆 contained in the ball of radius 𝛿 centered at 𝐱 such that for 𝐲 selected uniformly at random from 𝑆 , 𝔼 [ 𝐲 ]

𝐱 and for all 𝐲 there is a choice of 𝐠 𝐲 ∈ ∂ 𝐹 ( 𝐲 ) such that ‖ 𝔼 [ 𝐠 𝐲 ] ‖ ≤ 𝜖 .

Similarly, we have the definition:

Definition 32.

Given a point 𝐱 , and a number 𝛿

0 , define:

‖ ∂ 𝐹 ( 𝐱 ) ‖ 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ( 𝐱 , 𝛿 ) , 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝐱 , 𝐠 𝐲 ∈ ∂ 𝐹 ( 𝐲 ) ‖ 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐠 𝐲 ‖ .

In fact, whenever a locally Lipschitz function 𝐹 is differentiable at a point 𝐱 , we have that ∇ 𝐹 ( 𝐱 ) ∈ ∂ 𝐹 ( 𝐱 ) , so that ‖ ∂ 𝐹 ( 𝐱 ) ‖ 𝛿 ≤ ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 . Thus our results in the main text also bound ‖ ∂ 𝐹 ( 𝐱 ) ‖ 𝛿 . However, while a gradient oracle is also directional derivative oracle, a directional derivative oracle is only guaranteed to be a gradient oracle if 𝐹 is continuously differentiable at the queried point 𝐱 . This technical issue means that when we have access to a directional derivative oracle rather than a gradient oracle, we will instead only bound ‖ ∂ 𝐹 ( 𝐱 ) ‖ 𝛿 rather than ‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 .

Despite this technical complication, our overall strategy is essentially identical. The key observation is that the only time at which we used the properties of the gradient previously was when we invoked well-behavedness of 𝐹 . When we have a directional derivative instead of the gradient, the alternative notion of well-behavedness in (5) will play an identical role. Thus, our approach is simply to replace the call to Grad ( 𝐰 𝑛 , 𝐳 𝑛 ) in Algorithm 1 with a call instead to Grad ( 𝐰 𝑛 , 𝚫 𝑛 , 𝐳 𝑛 ) (see Algorithm 4). With this change, all of our analysis in the main text applies almost without modification. Essentially, we only need to change notation in a few places to reflect the updated definitions.

Algorithm 4 Online-to-Non-Convex Conversion (directional derivative oracle version) Input: Initial point 𝐱 0 , 𝐾 ∈ ℕ , 𝑇 ∈ ℕ , online learning algorithm 𝒜 , 𝑠 𝑛 for all 𝑛 Set 𝑀

𝐾 ⋅ 𝑇 for 𝑛

1 … 𝑀 do Get 𝚫 𝑛 from 𝒜 Set 𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 Set 𝐰 𝑛

𝐱 𝑛 − 1 + 𝑠 𝑛 𝚫 𝑛 Sample random 𝐳 𝑛 Generate directional derivative 𝐠 𝑛

Grad ( 𝐰 𝑛 , 𝚫 𝑛 , 𝐳 𝑛 ) Send 𝐠 𝑛 to 𝒜 as gradient end for Set 𝐰 𝑡 𝑘

𝐰 ( 𝑘 − 1 ) 𝑇 + 𝑡 for 𝑘

1 , … , 𝐾 and 𝑡

1 , … , 𝑇 Set 𝐰 ¯ 𝑘

1 𝑇 ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 for 𝑘

1 , … , 𝐾 Return { 𝐰 ¯ 1 , … , 𝐰 ¯ 𝐾 }

To begin this notational update, the counterpart to Theorem 7 is:

Theorem 33.

Suppose 𝐹 is Lipschitz and directionally differentiable. With the notation in Algorithm 4, if we let 𝑠 𝑛 be independent random variables uniformly distributed in [ 0 , 1 ] , then for any sequence of vectors 𝐮 1 , … , 𝐮 𝑁 , if we have the equality:

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ ] + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .
Proof.
𝐹 ( 𝐱 𝑛 ) − 𝐹 ( 𝐱 𝑛 − 1 )

∫ 0 1 ⟨ 𝔼 [ Grad ( 𝐱 𝑛 − 1 + 𝑠 ( 𝐱 𝑛 − 𝐱 𝑛 − 1 ) , 𝐱 𝑛 − 𝐱 𝑛 − 1 , 𝐳 𝑛 ) ] , 𝐱 𝑛 − 𝐱 𝑛 − 1 ⟩ d 𝑠

𝔼 [ ⟨ 𝐠 𝑛 , 𝚫 𝑛 ⟩ ]

𝔼 [ ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ + ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .

Where in the second line we have used the definition 𝐠 𝑛

Grad ( 𝐱 𝑛 − 1 + 𝑠 𝑛 ( 𝐱 𝑛 − 𝐱 𝑛 − 1 ) , 𝐱 𝑛 − 𝐱 𝑛 − 1 , 𝐳 𝑛 ) , the assumption that 𝑠 𝑛 is uniform on [ 0 , 1 ] , and Fubini theorem (as Grad is bounded by Lipschitzness of 𝐹 ). Now, sum over 𝑛 and telescope to obtain the stated bound.

∎

Next, we have the following analog of Theorem 8:

Theorem 34.

With the notation in Algorithm 4, set 𝑠 𝑛 to be a random variable sampled uniformly from [ 0 , 1 ] . Set 𝑇 , 𝐾 ∈ ℕ and 𝑀

𝐾 𝑇 . Define ∇ 𝑡 𝑘

𝔼 [ 𝐠 ( 𝑘 − 1 ) 𝑇 + 𝑡 ] . Define 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ for some 𝐷 > 0 for 𝑘

1 , … , 𝐾 . Finally, suppose Var ( 𝐠 𝑛 )

𝜎 2 . Then:

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ]

≤ 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 𝑀 + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] 𝐷 𝑀 + 𝜎 𝑇 .

Proof.

The proof is essentially identical to that of Theorem 8. In Theorem 33, set 𝐮 𝑛 to be equal to 𝐮 1 for the first 𝑇 iterations, 𝐮 2 for the second 𝑇 iterations and so on. In other words, 𝐮 𝑛

𝐮 𝑚 𝑜 𝑑 ( 𝑛 , 𝑇 ) + 1 for 𝑛

1 , … , 𝑀 . So, we have

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] .

Now, since 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ , and Var ( 𝐠 𝑛 )

𝜎 2 , we have

𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ]
≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 , 𝐮 𝑘 ⟩ ] + 𝔼 [ 𝐷 ∑ 𝑘

1 𝐾 ‖ ∑ 𝑡

1 𝑇 ( ∇ 𝑡 𝑘 − 𝐠 ( 𝑘 − 1 ) 𝑇 + 𝑡 ) ‖ ]

≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 , 𝐮 𝑘 ⟩ ] + 𝐷 𝜎 𝐾 𝑇

𝔼 [ − ∑ 𝑘

1 𝐾 𝐷 𝑇 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ] + 𝐷 𝜎 𝐾 𝑇 .

Putting this all together, we have

𝐹 ⋆
≤ 𝔼 [ 𝐹 ( 𝐱 𝑀 ) ] ≤ 𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝜎 𝐷 𝐾 𝑇 − 𝐷 𝑇 ∑ 𝑘

1 𝐾 𝔼 [ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ] .

Dividing by 𝐾 𝐷 𝑇

𝐷 𝑀 and reordering, we have the stated bound. ∎

Finally, we instantiate Theorem 34 with online gradient descent to obtain the analog of Corollary 9. This result establishes that the online-to-batch conversion finds an ( 𝛿 , 𝜖 ) critical point in 𝑂 ( 1 / 𝜖 3 𝛿 ) iterations, even when using a directional derivative oracle. Further, our lower bound construction makes use of continuously differentiable functions, for which the directional derivative oracle and the standard gradient oracle must coincide. Thus the 𝑂 ( 1 / 𝜖 3 𝛿 ) complexity is optimal in this setting as well.

Corollary 35.

Suppose we have a budget of 𝑁 gradient evaluations. Under the assumptions and notation of Theorem 34, suppose in addition 𝔼 [ ‖ 𝐠 𝑛 ‖ 2 ] ≤ 𝐺 2 and that 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 for some user-specified 𝐷 for all 𝑛 and ensures the worst-case 𝐾 -shifting regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 𝐺 𝐾 𝑇 for all ‖ 𝐮 𝑘 ‖ ≤ 𝐷 (e.g., as achieved by the OGD algorithm that is reset every 𝑇 iterations). Let 𝛿 > 0 be an arbitrary number. Set 𝐷

𝛿 / 𝑇 , 𝑇

min ⁡ ( ⌈ ( 𝐺 𝑁 𝛿 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

⌊ 𝑁 𝑇 ⌋ . Then, for all 𝑘 and 𝑡 , ‖ 𝐰 ¯ 𝑘 − 𝐰 𝑡 𝑘 ‖ ≤ 𝛿 .

Moreover, we have the inequality

𝔼
[ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ] ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) ,

which implies

1 𝐾
∑ 𝑡

1 𝐾 ‖ ∂ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) .

Proof.

Since 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 , for all 𝑛 < 𝑛 ′ ≤ 𝑇 + 𝑛 − 1 , we have

‖ 𝐰 𝑛 − 𝐰 𝑛 ′ ‖

‖ 𝐱 𝑛 − ( 1 − 𝑠 𝑛 ) 𝚫 𝑛 − 𝐱 𝑛 ′ − 1 + 𝑠 𝑛 ′ 𝚫 𝑛 ′ ‖

≤ ‖ ∑ 𝑖

𝑛 + 1 𝑛 ′ − 1 𝚫 𝑖 ‖ + ‖ 𝚫 𝑛 ‖ + ‖ 𝚫 𝑛 ′ ‖

≤ 𝐷 ( ( 𝑛 ′ − 1 ) − ( 𝑛 + 1 ) + 1 ) + 2 𝐷

𝐷 ( 𝑛 ′ − 𝑛 + 1 ) ≤ 𝐷 𝑇 .

Therefore, we clearly have ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝐷 𝑇

𝛿 .

Note that from the choice of 𝐾 and 𝑇 we have 𝑀

𝐾 𝑇 ≥ 𝑁 − 𝑇 ≥ 𝑁 / 2 . So, for the second fact, notice that Var ( 𝐠 𝑛 ) ≤ 𝐺 2 for all 𝑛 . Thus, applying Theorem 34 in concert with the additional assumption 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 𝐺 𝐾 𝑇 , we have:

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ ]

≤ 2 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ 𝐷 𝑁 + 2 𝐾 𝐷 𝐺 𝑇 𝐷 𝑁 + 𝐺 𝑇

≤ 2 𝑇 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + 3 𝐺 𝑇

≤ max ⁡ ( 5 𝐺 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 𝐺 𝑁 ) + 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 ,

where the last inequality is due to the choice of 𝑇 .

Finally, observe that ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝛿 for all 𝑡 and 𝑘 , and also that 𝐰 ¯ 𝑘

1 𝑇 ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 . Therefore 𝑆

{ 𝐰 1 𝑘 , … , 𝐰 𝑇 𝑘 } satisfies the conditions in the infimum in Definition 32 so that ‖ ∂ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ≤ ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∇ 𝑡 𝑘 ‖ . ∎

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Here, we formally introduce our setting and notation. We are interested in optimizing real-valued functions 𝐹 : ℋ → ℝ where ℋ is a real Hilbert space (e.g., usually ℋ

Our algorithms access information about 𝐹 through a stochastic gradient oracle Grad : ℋ × 𝒵 → ℝ . Given a point 𝐱 in ℋ , the oracle will sample an i.i.d. random variable 𝐳 ∈ 𝒵 and return Grad ​ ( 𝐱 , 𝐳 ) ∈ ℋ such that 𝔼 [ Grad ​ ( 𝐱 , 𝐳 ) ]

𝐹 ​ ( 𝐲 ) − 𝐹 ​ ( 𝐱 )

Let 𝐹 : ℝ 𝑑 → ℝ be locally Lipschitz with stochastic oracle Grad such that 𝔼 𝐳 [ Grad ​ ( 𝐱 , 𝐳 ) ]

A point 𝐱 is an ( 𝛿 , 𝜖 ) -stationary point of an almost-everywhere differentiable function 𝐹 if there is a finite subset 𝑆 of the ball of radius 𝛿 centered at 𝐱 such that for 𝐲 selected uniformly at random from 𝑆 , 𝔼 [ 𝐲 ]

‖ ∇ 𝐹 ​ ( 𝐱 ) ‖ 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ​ ( 𝐱 , 𝛿 ) , 1 | 𝑆 | ​ ∑ 𝐲 ∈ 𝑆 𝐲

𝑅 𝑇 ​ ( 𝐮 ) ≜ ∑ 𝑡

𝑅 𝑇 ​ ( 𝐮 1 , … , 𝐮 𝐾 ) ≜ ∑ 𝑘

1 𝐾 ∑ 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 . For example, SGD sets 𝚫 𝑛

− 𝜂 ​ 𝐠 𝑛 − 1

− 𝜂 ⋅ Grad ​ ( 𝐱 𝑛 − 1 , 𝐳 𝑛 − 1 ) for a learning rate 𝜂 . Instead, we let an online learning algorithm 𝒜 decide the update direction 𝚫 𝑛 , using linear losses ℓ 𝑛 ​ ( 𝐱 )

Suppose 𝐹 is well-behaved. Define ∇ 𝑛

𝐹 ​ ( 𝐱 𝑀 )

𝐹 ​ ( 𝐱 0 ) + ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ + ∑ 𝑛

1 𝑀 ⟨ ∇ 𝑛 − 𝐠 𝑛 , 𝚫 𝑛 ⟩ + ∑ 𝑛

𝔼 [ 𝐹 ​ ( 𝐱 𝑀 ) ]

𝐹 ​ ( 𝐱 0 ) + 𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ ] + 𝔼 [ ∑ 𝑛

𝐹 ​ ( 𝐱 𝑛 ) − 𝐹 ​ ( 𝐱 𝑛 − 1 )

∫ 0 1 ⟨ ∇ 𝐹 ​ ( 𝐱 𝑛 − 1 + 𝑠 ​ ( 𝐱 𝑛 − 𝐱 𝑛 − 1 ) ) , 𝐱 𝑛 − 𝐱 𝑛 − 1 ⟩ ​ d 𝑠

∫ 0 1 ⟨ ∇ 𝐹 ​ ( 𝐱 𝑛 − 1 + 𝑠 ​ 𝚫 𝑛 ) , 𝚫 𝑛 ⟩ ​ d 𝑠

⟨ ∇ 𝑛 , 𝚫 𝑛 ⟩

For the second statement, simply observe that by definition we have 𝔼 [ 𝐠 𝑛 ]

∫ 0 1 ∇ 𝐹 ​ ( 𝐱 𝑛 − 1 + 𝑠 ​ 𝚫 𝑛 ) ​ d 𝑠

Algorithm 1 Online-to-Non-Convex Conversion Input: Initial point 𝐱 0 , 𝐾 ∈ ℕ , 𝑇 ∈ ℕ , online learning algorithm 𝒜 . Set 𝑀

𝐾 ⋅ 𝑇 for 𝑛

1 ​ … ​ 𝑀 do Get 𝚫 𝑛 from 𝒜 Set 𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 Generate 𝑠 𝑛 ∈ [ 0 , 1 ] // usually uniformly random, see Theorem statements for precise settings. Set 𝐰 𝑛

𝐱 𝑛 − 1 + 𝑠 𝑛 ​ 𝚫 𝑛 Sample random 𝐳 𝑛 Generate gradient 𝐠 𝑛

Grad ​ ( 𝐰 𝑛 , 𝐳 𝑛 ) Send 𝐠 𝑛 to 𝒜 as gradient end for Set 𝐰 𝑡 𝑘

𝐰 ( 𝑘 − 1 ) ​ 𝑇 + 𝑡 for 𝑘

1 , … , 𝐾 and 𝑡

1 , … , 𝑇 Set 𝐰 ¯ 𝑘

1 𝑇 ​ ∑ 𝑡

1 𝑇 𝐰 𝑡 𝑘 for 𝑘

The primary value of Theorem 7 is that the term ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ is exactly the regret of an online learning algorithm: lower regret clearly translates to a smaller bound on 𝐹 ​ ( 𝐱 𝑀 ) . Next, by carefully choosing 𝐮 𝑛 , we will be able to relate the term ∑ 𝑛

Assume 𝐹 is well-behaved. With the notation in Algorithm 1, set 𝑠 𝑛 to be a random variable sampled uniformly from [ 0 , 1 ] . Set 𝑇 , 𝐾 ∈ ℕ and 𝑀

𝐾 ​ 𝑇 . Define 𝐮 𝑘

− 𝐷 ​ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) ‖ for some 𝐷 > 0 for 𝑘

𝔼 [ 1 𝐾 ​ ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ​ ∑ 𝑡

𝔼 [ 𝐹 ​ ( 𝐱 𝑀 ) ]

𝐹 ​ ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ​ ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝔼 [ ∑ 𝑛

Now, since 𝐮 𝑘

− 𝐷 ​ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) ‖ , 𝔼 [ 𝐠 𝑛 ]

𝔼 [ ∑ 𝑛

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ] ≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝔼 [ 𝐷 ​ ∑ 𝑘

1 𝐾 ‖ ∑ 𝑡

1 𝑇 ( ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) − 𝐠 𝑇 ​ ( 𝑘 − 1 ) + 𝑡 ) ‖ ] ≤ 𝔼 [ ∑ 𝑘

1 𝐾 ⟨ ∑ 𝑡

1 𝑇 ∇ 𝐹 ​ ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝐷 ​ 𝜎 ​ 𝐾 ​ 𝑇

𝔼 [ − ∑ 𝑘

1 𝐾 𝐷 ​ 𝑇 ​ ‖ 1 𝑇 ​ ∑ 𝑡

𝐹 ⋆ ≤ 𝔼 [ 𝐹 ​ ( 𝐱 𝑀 ) ] ≤ 𝐹 ​ ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ​ ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝜎 ​ 𝐷 ​ 𝐾 ​ 𝑇 − 𝐷 ​ 𝑇 ​ ∑ 𝑘

1 𝐾 𝔼 [ ‖ 1 𝑇 ​ ∑ 𝑡

Dividing by 𝐾 ​ 𝐷 ​ 𝑇

We now instantiate Theorem 8 with the simplest online learning algorithm: online gradient descent (OGD) (Zinkevich, 2003). OGD takes input a radius 𝐷 and a step size 𝜂 and makes the update 𝚫 𝑛 + 1

Π ‖ 𝚫 ‖ ≤ 𝐷 ​ [ 𝚫 𝑛 − 𝜂 ​ 𝐠 𝑛 ] with 𝚫 1

0 . The standard analysis shows that if 𝔼 [ ‖ 𝐠 𝑛 ‖ 2 ] ≤ 𝐺 2 for all 𝑛 , then with 𝜂

𝛿 / 𝑇 , 𝑇

min ⁡ ( ⌈ ( 𝐺 ​ 𝑁 ​ 𝛿 𝐹 ​ ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

𝔼 [ 1 𝐾 ​ ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ​ ∑ 𝑡

𝔼 [ 1 𝐾 ​ ∑ 𝑡

Before providing the proof, let us discuss the implications. Notice that if we select 𝐰 ^ at random from { 𝐰 ¯ 1 , … , 𝐰 ¯ 𝐾 } , then we clearly have 𝔼 [ ‖ ∇ 𝐹 ​ ( 𝐰 ^ ) ‖ 𝛿 ]

𝔼 [ 1 𝐾 ​ ∑ 𝑡

𝐱 𝑛

𝐱 𝑛 − 1 + 𝚫 𝑛 𝐠 𝑛

Grad ​ ( 𝐱 𝑛 + ( 𝑠 𝑛 − 1 ) ​ 𝚫 𝑛 , 𝐳 𝑛 ) 𝚫 𝑛 + 1

where clip ( 𝐱 ) 𝐷

‖ 𝐰 𝑛 − 𝐰 𝑛 ′ ‖

Our algorithms access information about 𝐹 through a stochastic gradient oracle Grad : ℋ × 𝒵 → ℝ . Given a point 𝐱 in ℋ , the oracle will sample an i.i.d. random variable 𝐳 ∈ 𝒵 and return Grad ( 𝐱 , 𝐳 ) ∈ ℋ such that 𝔼 [ Grad ( 𝐱 , 𝐳 ) ]

𝐹 ( 𝐲 ) − 𝐹 ( 𝐱 )

Let 𝐹 : ℝ 𝑑 → ℝ be locally Lipschitz with stochastic oracle Grad such that 𝔼 𝐳 [ Grad ( 𝐱 , 𝐳 ) ]

‖ ∇ 𝐹 ( 𝐱 ) ‖ 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ( 𝐱 , 𝛿 ) , 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝑅 𝑇 ( 𝐮 ) ≜ ∑ 𝑡

𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ≜ ∑ 𝑘

− 𝜂 𝐠 𝑛 − 1

− 𝜂 ⋅ Grad ( 𝐱 𝑛 − 1 , 𝐳 𝑛 − 1 ) for a learning rate 𝜂 . Instead, we let an online learning algorithm 𝒜 decide the update direction 𝚫 𝑛 , using linear losses ℓ 𝑛 ( 𝐱 )

𝐹 ( 𝐱 𝑀 )

𝐹 ( 𝐱 0 ) + ∑ 𝑛

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ ∑ 𝑛

𝐹 ( 𝐱 𝑛 ) − 𝐹 ( 𝐱 𝑛 − 1 )

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 ( 𝐱 𝑛 − 𝐱 𝑛 − 1 ) ) , 𝐱 𝑛 − 𝐱 𝑛 − 1 ⟩ d 𝑠

∫ 0 1 ⟨ ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) , 𝚫 𝑛 ⟩ d 𝑠

∫ 0 1 ∇ 𝐹 ( 𝐱 𝑛 − 1 + 𝑠 𝚫 𝑛 ) d 𝑠

1 … 𝑀 do Get 𝚫 𝑛 from 𝒜 Set 𝐱 𝑛

𝐱 𝑛 − 1 + 𝑠 𝑛 𝚫 𝑛 Sample random 𝐳 𝑛 Generate gradient 𝐠 𝑛

Grad ( 𝐰 𝑛 , 𝐳 𝑛 ) Send 𝐠 𝑛 to 𝒜 as gradient end for Set 𝐰 𝑡 𝑘

𝐰 ( 𝑘 − 1 ) 𝑇 + 𝑡 for 𝑘

1 𝑇 ∑ 𝑡

1 𝑀 ⟨ 𝐠 𝑛 , 𝚫 𝑛 − 𝐮 𝑛 ⟩ is exactly the regret of an online learning algorithm: lower regret clearly translates to a smaller bound on 𝐹 ( 𝐱 𝑀 ) . Next, by carefully choosing 𝐮 𝑛 , we will be able to relate the term ∑ 𝑛

𝐾 𝑇 . Define 𝐮 𝑘

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ for some 𝐷 > 0 for 𝑘

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

𝔼 [ 𝐹 ( 𝐱 𝑀 ) ]

𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝔼 [ ∑ 𝑛

− 𝐷 ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ ∑ 𝑡

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) ‖ , 𝔼 [ 𝐠 𝑛 ]

1 𝑀 ⟨ 𝐠 𝑛 , 𝐮 𝑛 ⟩ ]
≤ 𝔼 [ ∑ 𝑘

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝔼 [ 𝐷 ∑ 𝑘

1 𝑇 ( ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − 𝐠 𝑇 ( 𝑘 − 1 ) + 𝑡 ) ‖ ]

≤ 𝔼 [ ∑ 𝑘

1 𝑇 ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) , 𝐮 𝑘 ⟩ ] + 𝐷 𝜎 𝐾 𝑇

1 𝐾 𝐷 𝑇 ‖ 1 𝑇 ∑ 𝑡

𝐹 ⋆
≤ 𝔼 [ 𝐹 ( 𝐱 𝑀 ) ] ≤ 𝐹 ( 𝐱 0 ) + 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] + 𝜎 𝐷 𝐾 𝑇 − 𝐷 𝑇 ∑ 𝑘

1 𝐾 𝔼 [ ‖ 1 𝑇 ∑ 𝑡

Dividing by 𝐾 𝐷 𝑇

Π ‖ 𝚫 ‖ ≤ 𝐷 [ 𝚫 𝑛 − 𝜂 𝐠 𝑛 ] with 𝚫 1

min ⁡ ( ⌈ ( 𝐺 𝑁 𝛿 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

𝔼
[ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

𝔼 [ 1 𝐾 ∑ 𝑡

Before providing the proof, let us discuss the implications. Notice that if we select 𝐰 ^ at random from { 𝐰 ¯ 1 , … , 𝐰 ¯ 𝐾 } , then we clearly have 𝔼 [ ‖ ∇ 𝐹 ( 𝐰 ^ ) ‖ 𝛿 ]

𝔼 [ 1 𝐾 ∑ 𝑡

𝐱 𝑛 − 1 + 𝚫 𝑛

𝐠 𝑛

Grad ( 𝐱 𝑛 + ( 𝑠 𝑛 − 1 ) 𝚫 𝑛 , 𝐳 𝑛 )

𝚫 𝑛 + 1

‖ 𝐱 𝑛 − ( 1 − 𝑠 𝑛 ) 𝚫 𝑛 − 𝐱 𝑛 ′ − 1 + 𝑠 𝑛 ′ 𝚫 𝑛 ′ ‖

≤ ‖ ∑ 𝑖

Therefore, we clearly have ‖ 𝐰 𝑡 𝑘 − 𝐰 ¯ 𝑘 ‖ ≤ 𝐷 𝑇

𝔼 [ 1 𝐾 ∑ 𝑘

1 𝐾 ‖ 1 𝑇 ∑ 𝑡

1 𝑇 ∑ 𝑡

{ 𝐰 1 𝑘 , … , 𝐰 𝑇 𝑘 } satisfies the conditions in the infimum in Definition 5 so that ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 𝛿 ≤ ‖ 1 𝑇 ∑ 𝑡

Π [ − 𝐷 ∞ , 𝐷 ∞ ] [ 𝚫 𝑛 , 𝑖 − 𝜂 𝑖 𝐠 𝑛 , 𝑖 ] . The regret of this procedure is simply the sum of the regrets of each of the individual algorithms. In particular, if 𝔼 [ 𝐠 𝑛 , 𝑖 2 ] ≤ 𝐺 𝑖 2 , then setting 𝜂 𝑖

𝐷 ∞ 𝐺 𝑖 𝑇 yields the regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 ) ] ≤ 𝐷 ∞ 𝑇 ∑ 𝑖

‖ ∇ 𝐹 ( 𝐱 ) ‖ 1 , 𝛿 ≜ inf 𝑆 ⊂ 𝐵 ∞ ( 𝐱 , 𝛿 ) | , 1 | 𝑆 | ∑ 𝐲 ∈ 𝑆 𝐲

𝐾 𝑇 . Assume that 𝔼 [ 𝑔 𝑛 , 𝑖 2 ] ≤ 𝐺 𝑖 2 for 𝑖

1 , … , 𝑑 for all 𝑛 . Assume that 𝒜 guarantees ‖ 𝚫 𝑛 ‖ ∞ ≤ 𝐷 ∞ for some user-specified 𝐷 ∞ for all 𝑛 and ensures the 𝐾 -shifting regret bound 𝔼 [ 𝑅 𝑇 ( 𝐮 1 , … , 𝐮 𝐾 ) ] ≤ 𝐷 ∞ 𝐾 𝑇 ∑ 𝑖

min ⁡ ( ⌈ ( 𝑁 𝛿 ∑ 𝑖

1 𝑑 𝐺 𝑖 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 3 ⌉ , 𝑁 2 ) , and 𝐾

1 𝐾 ∑ 𝑡

1 𝐾 ‖ ∇ 𝐹 ( 𝐰 ¯ 𝑘 ) ‖ 1 , 𝛿 ≤ 2 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 𝛿 𝑁 + max ⁡ ( 5 ( ∑ 𝑖

1 𝑑 𝐺 𝑖 ) 2 / 3 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 1 / 3 ( 𝑁 𝛿 ) 1 / 3 , 6 ∑ 𝑖

𝐺 2 , Corollary 9 implies 1 𝐾 ∑ 𝑡

1 𝑑 𝐺 𝑖 ≤ 𝑑 ∑ 𝑖

Now, recall that Corollary 9 shows that we can find a ( 𝛿 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 3 𝛿 − 1 ) iteration. Thus, Proposition 14 implies that by setting 𝛿

𝜖 / 𝐻 , we can find a ( 0 , 𝜖 ) -stationary point of an 𝐻 -smooth objective 𝐹 in 𝑂 ( 𝜖 − 4 ) iterations, which matches the (optimal) guarantee of standard SGD (Ghadimi & Lan, 2013; Arjevani et al., 2019). Further, Proposition 15 shows that by setting 𝛿

We will now consider the case of a non-stochastic oracle (that is, Grad ( 𝐱 , 𝐳 )

𝑅 𝑇 ( 𝐮 ) ≤ 𝑂 ( 𝐷 ∑ 𝑡

Suppose we have a budget of 𝑁 gradient evaluations. and that we have an online algorithm 𝒜 𝑠 𝑡 𝑎 𝑡 𝑖 𝑐 that guarantees ‖ 𝚫 𝑛 ‖ ≤ 𝐷 for all 𝑛 and ensures the optimistic regret bound 𝑅 𝑇 ( 𝐮 ) ≤ 𝐶 𝐷 ∑ 𝑡

min ⁡ ( ⌈ ( 𝐶 𝛿 2 𝐻 𝑁 ) 2 / 5 ( 𝐹 ( 𝐱 0 ) − 𝐹 ⋆ ) 2 / 5 ⌉ , 𝑁 2 ) , and 𝐾

𝔼 [ 1 𝐾 ∑ 𝑡

Note that the expectation here encompasses only the randomness in the choice of 𝑠 𝑡 𝑘 , because the gradient oracle is assumed to be deterministic. Theorem 16 finds a ( 𝛿 , 𝜖 ) stationary point in 𝑂 ( 𝜖 − 5 / 3 𝛿 − 1 / 3 ) iteratations. Thus, by setting 𝛿

𝑅 𝑇 ( 𝐮 𝑘 )
≤ 𝐶 𝐷 ∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 𝑘 − 𝐠 𝑡 − 1 𝑘 ‖ 2

≤ 𝐶 𝐷 𝐺 1 2 + ∑ 𝑡

2 𝑇 ‖ ∇ 𝐹 ( 𝐰 𝑡 𝑘 ) − ∇ 𝐹 ( 𝐰 𝑡 − 1 𝑘 ) ‖ 2

≤ 𝐶 𝐷 𝐺 1 2 + ∑ 𝑡

1 𝑇 ‖ 𝐠 𝑡 𝑘 − 𝐠 𝑡 − 1 𝑘 ‖ 2
≤ ∑ 𝑡