102 kB

Title: Scalable Frank-Wolfe on Generalized Self-concordant Functions via Simple Steps

URL Source: https://arxiv.org/html/2105.13913

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Frank-Wolfe Convergence Guarantees 3Away-step and Blended Pairwise Conditional Gradients 4Computational experiments 5A stateless simple step variant License: arXiv.org perpetual non-exclusive license arXiv:2105.13913v8 [math.OC] 08 Apr 2024 Scalable Frank-Wolfe on Generalized Self-concordant Functions via Simple Steps \nameAlejandro Carderera \emailalejandro.carderera@gatech.edu \addrDepartment of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, USA \AND\nameMathieu Besançon \emailbesancon@zib.de \addrZuse Institute Berlin, Germany, and Univ. Grenoble Alpes, Inria, LIG, Grenoble, France \AND\nameSebastian Pokutta \emailpokutta@zib.de \addrInstitute of Mathematics Technische Universität Berlin and Zuse Institute Berlin, Germany Abstract

Generalized self-concordance is a key property present in the objective function of many important learning problems. We establish the convergence rate of a simple Frank-Wolfe variant that uses the open-loop step size strategy 𝛾 𝑡

2 / ( 𝑡 + 2 ) , obtaining a 𝒪 ⁢ ( 1 / 𝑡 ) convergence rate for this class of functions in terms of primal gap and Frank-Wolfe gap, where 𝑡 is the iteration count. This avoids the use of second-order information or the need to estimate local smoothness parameters of previous work. We also show improved convergence rates for various common cases, e.g., when the feasible region under consideration is uniformly convex or polyhedral.

1Introduction

Constrained convex optimization is the cornerstone of many machine learning problems. We consider such problems, formulated as:

min 𝐱 ∈ 𝒳 ⁡ 𝑓 ⁢ ( 𝐱 ) ,

(1.1)

where 𝑓 : ℝ 𝑛 → ℝ ∪ { + ∞ } is a generalized self-concordant function and 𝒳 ⊆ ℝ 𝑛 is a compact convex set. When computing projections onto the feasible regions as required in, e.g., projected gradient descent, is prohibitive, Frank-Wolfe (FW) [Frank & Wolfe, 1956] algorithms (a.k.a. Conditional Gradients (CG) [Levitin & Polyak, 1966]) are the algorithms of choice, relying on Linear Minimization Oracles (LMO) at each iteration to solve Problem (1.1). The analysis of their convergence often relies on the assumption that the gradient is Lipschitz-continuous. This assumption does not necessarily hold for generalized self-concordant functions, an important class of functions whose growth can be unbounded.

1.1Related work

In the classical analysis of Newton’s method, when the Hessian of 𝑓 is assumed to be Lipschitz continuous and the function is strongly convex, one arrives at a convergence rate for the algorithm that depends on the Euclidean structure of ℝ 𝑛 , despite the fact that the algorithm is affine-invariant. This motivated the introduction of self-concordant functions in Nesterov & Nemirovskii [1994], functions for which the third derivative is bounded by the second-order derivative, with which one can obtain an affine-invariant convergence rate for the aforementioned algorithm. More importantly, many of the barrier functions used in interior-point methods are self-concordant, which extends the use of polynomial-time interior-point methods to many settings of interest.

Self-concordant functions have received strong interest in recent years due to the attractive properties that they allow to prove for many statistical estimation settings [Marteau-Ferey et al., 2019, Ostrovskii & Bach, 2021]. The original definition of self-concordance has been expanded and generalized since its inception, as many objective functions of interest have self-concordant-like properties without satisfying the strict definition of self-concordance. For example, the logistic loss function used in logistic regression is not strictly self-concordant, but it fits into a class of pseudo-self-concordant functions, which allows one to obtain similar properties and bounds as those obtained for self-concordant functions [Bach, 2010]. This was also the case in Ostrovskii & Bach [2021] and Tran-Dinh et al. [2015], in which more general properties of these pseudo-self-concordant functions were established. This was fully formalized in Sun & Tran-Dinh [2019], in which the concept of generalized self-concordant functions was introduced, along with key bounds, properties, and variants of Newton methods for the unconstrained setting which make use of this property.

Most algorithms that aim to solve Problem (1.1) assume access to second-order information, as this often allows the algorithms to make monotonic progress, remain inside the domain of 𝑓 , and often, converge quadratically when close enough to the optimum. Recently, several lines of work have focused on using Frank-Wolfe algorithm variants to solve these types of problems in the projection-free setting, for example constructing second-order approximations to a self-concordant 𝑓 using first and second-order information, and minimizing these approximations over 𝒳 using the Frank-Wolfe algorithm [Liu et al., 2020]. Other approaches, such as the ones presented in Dvurechensky et al. [2020] (later extended in Dvurechensky et al. [2022]), apply the Frank-Wolfe algorithm to a generalized self-concordant 𝑓 , using first and second-order information about the function to guarantee that the step sizes are so that the iterates do not leave the domain of 𝑓 , and monotonic progress is made. An additional Frank-Wolfe variant in that work, in the spirit of Garber & Hazan [2016], utilizes first and second order information about 𝑓 , along with a Local Linear Optimization Oracle for 𝒳 , to obtain a linear convergence rate in primal gap over polytopes given in inequality description. The authors in Dvurechensky et al. [2022] also present an additional Frank-Wolfe variant which does not use second-order information, and uses the backtracking line search of Pedregosa et al. [2020] to estimate local smoothness parameters at a given iterate. Other specialized Frank-Wolfe algorithms have been developed for specific problems involving generalized self-concordant functions, such as the Frank-Wolfe variant developed for marginal inference with concave maximization [Krishnan et al., 2015], the variant developed in Zhao & Freund [2023] for 𝜃 -homogeneous barrier functions, or the application for phase retrieval in Odor et al. [2016], where the Frank-Wolfe algorithm is shown to converge on a self-concordant non-Lipschitz smooth objective.

1.2Contribution

The contributions of this paper are detailed below and summarized in Table 1.

Simple FW variant for generalized self-concordant functions

We show that a small variation of the original Frank-Wolfe algorithm [Frank & Wolfe, 1956] with an open-loop step size of the form 𝛾 𝑡

2 / ( 𝑡 + 2 ) , where 𝑡 is the iteration count is all that is needed to achieve a convergence rate of 𝒪 ⁢ ( 1 / 𝑡 ) in primal gap; this also answers an open question posed in Dvurechensky et al. [2022]. Our variation ensures monotonic progress while employing an open-loop strategy which, together with the iterates being convex combinations, ensures that we do not leave the domain of 𝑓 . In contrast to other methods that depend on either a line search or second-order information, our variant uses only a linear minimization oracle, zeroth-order and first-order information and a domain oracle for 𝑓 ⁢ ( 𝐱 ) . The assumption of the latter oracle is very mild and was also implicitly assumed in several of the algorithms presented in Dvurechensky et al. [2022]. As such, our iterations are much cheaper than those in previous work, while essentially achieving the same convergence rates for Problem (1.1).

Moreover, our variant relying on the open-loop step size 𝛾 𝑡

2 / ( 𝑡 + 2 ) allows us to establish a 𝒪 ⁢ ( 1 / 𝑡 ) convergence rate for the Frank-Wolfe gap, is agnostic, i.e., does not need to estimate local smoothness parameters, and is parameter-free, leading to convergence rates and oracle complexities that are independent of any tuning parameters.

Algorithm Convergence Reference 1 𝐬𝐭 -order / Requirements Primal gap FW gap LS free? FW-GSC 𝒪 ⁢ ( 1 / 𝜀 ) [Dvurechensky et al., 2022, Alg.2] ✗ / ✓ SOO LBTFW-GSC 𝒪 ⁢ ( 1 / 𝜀 ) [Dvurechensky et al., 2022, Alg.3] ✓ / ✗ ZOO, DO MBTFW-GSC 𝒪 ⁢ ( 1 / 𝜀 ) [Dvurechensky et al., 2022, Alg.5] ✗ / ✓ ZOO, SOO, DO FW-LLOO 𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) [Dvurechensky et al., 2022, Alg.7] ✗ / ✓ 𝑃 ⁢ ( 𝒳 ) , LLOO, SOO ASFW-GSC 𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) [Dvurechensky et al., 2022, Alg.8] ✗ / ✓ 𝑃 ⁢ ( 𝒳 ) , SOO M-FW 𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 ) This work ✓ / ✓ ZOO, DO B-{AFW/BPCG} 𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) This work ✓ / ✗ 𝑃 ⁢ ( 𝒳 ) , ZOO, DO Table 1: Number of iterations needed to achieve an 𝜀 -optimal solution for Problem 1.1. We denote line search by LS, zeroth-order oracle by ZOO, second-order oracle by SOO, domain oracle by DO, local linear optimization oracle by LLOO, and the assumption that 𝒳 is polyhedral by 𝑃 ⁢ ( 𝒳 ) . The oracles listed under the Requirements column are the additional oracles required, other than the first-order oracle (FOO) and the linear minimization oracle (LMO) which all algorithms use. Faster rates in common special cases

We also obtain improved convergence rates when the optimum is contained in the interior of 𝒳 ∩ dom ⁢ ( 𝑓 ) , or when the set 𝒳 is uniformly or strongly convex, using the backtracking line search of Pedregosa et al. [2020]. We also show that the Away-step Frank-Wolfe Wolfe [1970], Lacoste-Julien & Jaggi [2015] and the Blended Pairwise Conditional Gradient Tsuji et al. [2022] can use the aforementioned line search to achieve linear rates over polytopes. For clarity we want to stress that any linear rate over polytopes has to depend also on the ambient dimension of the polytope; this applies to our linear rates and those in Table 1 established elsewhere (see Diakonikolas et al. [2020]). In contrast, the 𝒪 ⁢ ( 1 / 𝜀 ) rates are dimension-independent.

Numerical experiments

We provide numerical experiments that showcase the performance of the algorithms on generalized self-concordant objectives to complement the theoretical results. In particular, they highlight that the simple step size strategy we propose is competitive with and sometimes outperforms other variants on many instances.

After publication of our initial draft, in a revision of their original work, Dvurechensky et al. [2022] added an analysis of the Away-step Frank-Wolfe algorithm which is complementary to ours (considering a slightly different setup and regimes) and was conducted independently; we have updated the tables to include these additional results.

1.3Preliminaries and Notation

We denote the domain of 𝑓 as dom ⁢ ( 𝑓 )

def { 𝐱 ∈ ℝ 𝑛 , 𝑓 ⁢ ( 𝐱 ) < + ∞ } and the (potentially non-unique) minimizer of Problem (1.1) by 𝐱 * . Moreover, we denote the primal gap and the Frank-Wolfe gap at 𝐱 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) as ℎ ⁢ ( 𝐱 )

def 𝑓 ⁢ ( 𝐱 ) − 𝑓 ⁢ ( 𝐱 * ) and 𝑔 ⁢ ( 𝐱 )

def max 𝐯 ∈ 𝒳 ⁡ ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐱 − 𝐯 ⟩ , respectively. We use ∥ ⋅ ∥ , ∥ ⋅ ∥ 𝐻 , and ⟨ ⋅ , ⋅ ⟩ to denote the Euclidean norm, the matrix norm induced by a symmetric positive definite matrix 𝐻 ∈ ℝ 𝑛 × 𝑛 , and the Euclidean inner product, respectively. We denote the diameter of 𝒳 as 𝐷

def max 𝐱 , 𝐲 ∈ 𝒳 ⁡ ‖ 𝐱 − 𝐲 ‖ . Given a non-empty set 𝒳 ⊂ ℝ 𝑛 we refer to its boundary as Bd ⁡ ( 𝒳 ) and to its interior as Int ⁡ ( 𝒳 ) . We use Δ 𝑛 to denote the probability simplex of dimension 𝑛 . Given a compact convex set 𝒞 ⊆ dom ⁢ ( 𝑓 ) we denote:

𝐿 𝑓 𝒞

max 𝐮 ∈ 𝒞 , 𝐝 ∈ ℝ 𝑛 ⁡ ‖ 𝐝 ‖ ∇ 2 𝑓 ⁢ ( 𝐮 ) 2 ‖ 𝐝 ‖ 2 2 , 𝜇 𝑓 𝒞

min 𝐮 ∈ 𝒞 , 𝐝 ∈ ℝ 𝑛 ⁡ ‖ 𝐝 ‖ ∇ 2 𝑓 ⁢ ( 𝐮 ) 2 ‖ 𝐝 ‖ 2 2 .

We assume access to:

Domain Oracle (DO): Given 𝐱 ∈ 𝒳 , return whether 𝐱 ∈ dom ⁢ ( 𝑓 ) .

Zeroth-Order Oracle (ZOO): Given 𝐱 ∈ dom ⁢ ( 𝑓 ) , return 𝑓 ⁢ ( 𝐱 ) .

First-Order Oracle (FOO): Given 𝐱 ∈ dom ⁢ ( 𝑓 ) , return ∇ 𝑓 ⁢ ( 𝐱 ) .

Linear Minimization Oracle (LMO): Given 𝐝 ∈ ℝ 𝑛 , return argmin 𝐱 ∈ 𝒳 ⟨ 𝐱 , 𝐝 ⟩ .

The FOO and LMO oracles are standard in the FW literature. The ZOO oracle is often implicitly assumed to be included with the FOO oracle; we make this explicit here for clarity. Finally, the DO oracle is motivated by the properties of generalized self-concordant functions. It is reasonable to assume the availability of the DO oracle: following the definition of the function codomain, one could simply evaluate 𝑓 at 𝐱 and assert 𝑓 ⁢ ( 𝐱 ) < + ∞ , thereby combining the DO and ZOO oracles into one oracle. However, in many cases testing the membership of 𝐱 ∈ dom ⁢ ( 𝑓 ) is computationally less demanding than the function evaluation.

Remark 1.1.

Requiring access to a zeroth-order and domain oracle are mild assumptions, that were also implicitly assumed in one of the three FW-variants presented in Dvurechensky et al. [2022] when computing the step size according to the strategy from Pedregosa et al. [2020]; see 5 in Algorithm 4. The remaining two variants ensure that 𝐱 ∈ dom ⁢ ( 𝑓 ) by using second-order information about 𝑓 , which we explicitly do not rely on.

The following example motivates the use of Frank-Wolfe algorithms in the context of generalized self-concordant functions. We present more examples in the computational results.

Example 1.2 (Intersection of a convex set with a polytope).

Consider Problem (1.1) where 𝒳

𝒫 ∩ 𝒞 , 𝒫 is a polytope over which we can minimize a linear function efficiently, and 𝒞 is a convex compact set for which one can easily build a barrier function.

(a) (b) (c) Figure 1:Minimizing 𝑓 ⁢ ( 𝐱 ) over 𝒫 ∩ 𝒞 , versus minimizing the sum of 𝑓 ⁢ ( 𝐱 ) and Φ 𝒞 ⁢ ( 𝐱 ) over 𝒫 for two different penalty values 𝜇 ′ and 𝜇 such that 𝜇 ′ ≫ 𝜇 .

Solving a linear optimization problem over 𝒳 may be extremely expensive. In light of this, we can incorporate 𝒞 into the problem through the use of a barrier penalty in the objective function, minimizing instead 𝑓 ⁢ ( 𝐱 ) + 𝜇 ⁢ Φ 𝒞 ⁢ ( 𝐱 ) where Φ 𝒞 ⁢ ( 𝐱 ) is a log-barrier function for 𝒞 and 𝜇 is a parameter controlling the penalization. This reformulation is illustrated in Figure 1. Note that if the original objective function is generalized self-concordant, so is the new objective function (see Proposition 1 in Sun & Tran-Dinh [2019]). We assume that computing the gradient of 𝑓 ⁢ ( 𝐱 ) + 𝜇 ⁢ Φ 𝒞 ⁢ ( 𝐱 ) is roughly as expensive as computing the gradient for 𝑓 ⁢ ( 𝐱 ) and solving an LP over 𝒫 is inexpensive relative to solving an LP over 𝒫 ∩ 𝒞 . The 𝜇 parameter can be driven down to 0 after a solution converges in a warm-starting procedure similar to interior-point methods, ensuring convergence to the true optimum.

An additional advantage of this transformation of the problem is the solution structure. Running Frank-Wolfe on the set 𝒫 ∩ 𝒞 can select a large number of extremal points from Bd ⁡ ( 𝒞 ) if 𝒞 is non-polyhedral. In contrast, 𝒫 has a finite number of vertices, a small subset of which will be selected throughout the optimization procedure. The same solution as that of the original problem can thus be constructed as a convex combination of a small number of vertices of 𝒫 , improving sparsity and interpretability in many applications.

The following definition formalizes the setting of Problem (1.1).

Definition 1.3 (Generalized self-concordant function).

Let 𝑓 ∈ 𝐶 3 ⁢ ( dom ⁢ ( 𝑓 ) ) be a closed convex function with dom ⁢ ( 𝑓 ) ⊆ ℝ 𝑛 open. Then 𝑓 is ( 𝑀 , 𝜈 ) generalized self-concordant if:

| ⟨ D 3 ⁡ 𝑓 ⁢ ( 𝐱 ) ⁢ [ 𝐰 ] ⁢ 𝐮 , 𝐮 ⟩ | ≤ 𝑀 ⁢ ‖ 𝐮 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 ) 2 ⁢ ‖ 𝐰 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 ) 𝜈 − 2 ⁢ ‖ 𝐰 ‖ 2 3 − 𝜈 ,

for any 𝐱 ∈ dom ⁢ ( 𝑓 ) and 𝐮 , 𝐰 ∈ ℝ 𝑛 , where

D 3 ⁡ 𝑓 ⁢ ( 𝐱 ) ⁢ [ 𝐰 ]

lim 𝛼 → 0 𝛼 − 1 ⁢ ( ∇ 2 𝑓 ⁢ ( 𝐱 + 𝛼 ⁢ 𝐰 ) − ∇ 2 𝑓 ⁢ ( 𝐱 ) ) .

2Frank-Wolfe Convergence Guarantees

We establish convergence rates for a Frank-Wolfe variant with an open-loop step size strategy for generalized self-concordant functions. The Monotonic Frank-Wolfe (M-FW) algorithm presented in Algorithm 1 is a rather simple, but powerful modification of the standard Frank-Wolfe algorithm, with the only difference that before taking a step, we verify if 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) , and if so, we check whether moving to the next iterate provides primal progress.

Algorithm 1 Monotonic Frank-Wolfe (M-FW) 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 2:for 𝑡

0 to … do 3: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ , 𝛾 𝑡 ← 2 / ( 𝑡 + 2 ) 4: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 5: if 𝐱 𝑡 + 1 ∉ dom ⁢ ( 𝑓 ) or 𝑓 ⁢ ( 𝐱 𝑡 + 1 )

𝑓 ⁢ ( 𝐱 𝑡 ) then 6: 𝐱 𝑡 + 1 ← 𝐱 𝑡 7: end if 8:end for

Note, that the open-loop step size rule 2 / ( 𝑡 + 2 ) does not guarantee monotonic primal progress for the vanilla Frank-Wolfe algorithm in general. If either of these two checks fails, we simply do not move: the algorithm sets 𝐱 𝑡 + 1

𝐱 𝑡 in Line 6 of Algorithm 1. As customary, we assume short-circuit evaluation of the logical conditions in Algorithm 1, i.e., if the first condition in Line 5 is true, then the second condition is not even checked, and the algorithm directly goes to Line 6. This minor modification of the vanilla Frank-Wolfe algorithm enables us to use the monotonicity of the iterates in the proofs to come, at the expense of at most one extra function evaluation per iteration. Note that if we set 𝐱 𝑡 + 1

𝐱 𝑡 , we do not need to call the FOO or LMO oracle at iteration 𝑡 + 1 , as we can simply reuse ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) and 𝐯 𝑡 . This effectively means that between successive iterations in which we search for an acceptable value of 𝛾 𝑡 , we only need to call the zeroth-order and domain oracle.

In order to establish the main convergence results for the algorithm, we lower bound the progress per iteration with the help of Proposition 2.1.

Proposition 2.1 (Proposition 10, Sun & Tran-Dinh [2019]).

Given a ( 𝑀 , 𝜈 ) generalized self-concordant function, then for 𝜈 ≥ 2 , we have that:

𝑓 ⁢ ( 𝐲 ) − 𝑓 ⁢ ( 𝐱 ) − ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐲 − 𝐱 ⟩ ≤ 𝜔 𝜈 ⁢ ( 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) ⁢ ‖ 𝐲 − 𝐱 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 ) 2 ,

(2.1)

where the inequality holds if and only if 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) < 1 for 𝜈

2 , and we have that,

𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 )

def { 𝑀 ⁢ ‖ 𝐲 − 𝐱 ‖
if ⁢ 𝜈

( 𝜈 2 − 1 ) ⁢ 𝑀 ⁢ ‖ 𝐲 − 𝐱 ‖ 3 − 𝜈 ⁢ ‖ 𝐲 − 𝐱 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 ) 𝜈 − 2

if ⁢ 𝜈

2 ,

(2.2)

where:

𝜔 𝜈 ⁢ ( 𝜏 )

def { 𝑒 𝜏 − 𝜏 − 1 𝜏 2
if ⁢ 𝜈

− 𝜏 − 𝑙𝑛 ⁢ ( 1 − 𝜏 ) 𝜏 2
if ⁢ 𝜈

( 1 − 𝜏 ) ⁢ 𝑙𝑛 ⁢ ( 1 − 𝜏 ) + 𝜏 𝜏 2
if ⁢ 𝜈

( 𝜈 − 2 4 − 𝜈 ) ⁢ 1 𝜏 ⁢ [ 𝜈 − 2 2 ⁢ ( 3 − 𝜈 ) ⁢ 𝜏 ⁢ ( ( 1 − 𝜏 ) 2 ⁢ ( 3 − 𝜈 ) 2 − 𝜈 − 1 ) − 1 ]

otherwise.

The inequality shown in Eq. 2.1 is very similar to the one that we would obtain if the gradient of 𝑓 was Lipschitz continuous, however, while the Lipschitz continuity of the gradient leads to an inequality that holds globally for all 𝐱 , 𝐲 ∈ dom ⁢ ( 𝑓 ) , the inequality in (2.1) only holds for 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) < 1 . Moreover, there are two other important differences, the norm used in (2.1) is now the norm defined by the Hessian at 𝐱 𝑡 instead of the ℓ 2 norm, and the term multiplying the norm is 𝜔 𝜈 ⁢ ( 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) instead of 1 / 2 . We deal with the latter issue by bounding 𝜔 𝜈 ⁢ ( 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) with a constant that depends on 𝜈 for any 𝐱 , 𝐲 ∈ dom ⁢ ( 𝑓 ) such that 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ≤ 1 / 2 , as shown in Remark 2.2.

Remark 2.2.

As d ⁡ 𝜔 𝜈 ⁢ ( 𝜏 ) / d ⁡ 𝜏

0 for 𝜏 < 1 and 𝜈 ≥ 2 , then 𝜔 𝜈 ⁢ ( 𝜏 ) ≤ 𝜔 𝜈 ⁢ ( 1 / 2 ) for 𝜏 ≤ 1 / 2 .

Due to the fact that we use a simple step size 𝛾 𝑡

2 / ( 𝑡 + 2 ) , that we make monotonic progress, and we ensure that the iterates are inside dom ⁢ ( 𝑓 ) , careful accounting allows us to bound the number of iterations until 𝑑 𝜈 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) ≤ 1 / 2 . Before formalizing the convergence rate, we first review a lemma needed in the proof.

Lemma 2.3 (Proposition 7,Sun & Tran-Dinh [2019]).

Let 𝑓 be a generalized self-concordant function with 𝜈 > 2 . If 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) < 1 and 𝐱 ∈ dom ⁢ ( 𝑓 ) then 𝐲 ∈ dom ⁢ ( 𝑓 ) . For the case 𝜈

2 we have that dom ⁢ ( 𝑓 )

ℝ 𝑛 .

Putting all these things together allows us to obtain a convergence rate for Algorithm 1.

Theorem 2.4.

Suppose 𝒳 is a compact convex set and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 , and define the compact set

ℒ 0

def { 𝐱 ∈ dom ⁢ ( 𝑓 ) ∩ 𝒳 ∣ 𝑓 ⁢ ( 𝐱 ) ≤ 𝑓 ⁢ ( 𝐱 0 ) } .

Then, the Monotonic Frank-Wolfe algorithm (Algorithm 1) satisfies:

ℎ ⁢ ( 𝐱 𝑡 ) ≤ 4 ⁢ ( 𝑇 𝜈 + 1 ) 𝑡 + 1 ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) } .

(2.3)

for 𝑡 ≥ 𝑇 𝜈 , 𝑇 𝜈 is defined as:

𝑇 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2
if ⁢ 𝜈

⌈ 2 ⁢ 𝑀 ⁢ 𝐷 ⁢ ( 𝐿 𝑓 ℒ 0 ) 𝜈 / 2 − 1 ⁢ ( 𝜈 − 2 ) ⌉ − 2

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 .

(2.4)

Otherwise it holds that ℎ ⁢ ( 𝐱 𝑡 ) ≤ ℎ ⁢ ( 𝐱 0 ) for 𝑡 < 𝑇 𝜈 .

Proof.

As the algorithm makes monotonic progress and moves towards points such that 𝐱 𝑡 ∈ 𝒳 , then 𝐱 𝑡 ∈ ℒ 0 for 𝑡 ≥ 0 . As the smoothness parameter of 𝑓 is bounded over ℒ 0 , we have from the properties of smooth functions that the bound ‖ 𝐝 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 𝑡 ) 2 / ‖ 𝐝 ‖ 2 ≤ 𝐿 𝑓 ℒ 0 holds for any 𝐝 ∈ ℝ 𝑛 . Particularizing for 𝐝

𝐱 𝑡 − 𝐯 𝑡 and noting that ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ ≤ 𝐷 leads to ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 𝑡 ) 2 ≤ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 . We then define 𝑇 𝜈 with (2.4). Using the definition shown in (2.2) we have that for 𝑡 ≥ 𝑇 𝜈 then 𝑑 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) ≤ 1 / 2 . This fact, along with the fact that 𝐱 𝑡 ∈ dom ⁢ ( 𝑓 ) (by monotonicity) allows us to claim that 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) , by application of Lemma 2.3. This means that the non-zero step size 𝛾 𝑡 will ensure that 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) in Line 5 of Algorithm 1. Moreover, it allows us to use the bound between the function value at points 𝐱 𝑡 and 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) in (2.1) of Proposition 2.1, which holds for 𝑑 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) < 1 . With this we can estimate the primal progress we can guarantee for 𝑡 ≥ 𝑇 𝜈 if we move from 𝐱 𝑡 to 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) :

ℎ ⁢ ( 𝐱 𝑡 + 1 )

ℎ ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) )

≤ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝛾 𝑡 2 ⁢ 𝜔 𝜈 ⁢ ( 𝑑 𝜈 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) ) ⁢ ‖ 𝐯 𝑡 − 𝐱 𝑡 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 𝑡 ) 2

≤ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ ( 1 − 𝛾 𝑡 ) + 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ,

where the second inequality follows from the upper bound on the primal gap via the FW gap 𝑔 ⁢ ( 𝐱 𝑡 ) , the application of Remark 2.2 as for 𝑡 ≥ 𝑇 𝜈 we have that 𝑑 𝜈 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) ≤ 1 / 2 , and from the fact that 𝐱 𝑡 ∈ ℒ 0 for all 𝑡 ≥ 0 . With the previous chain of inequalities we can bound the primal progress for 𝑡 ≥ 𝑇 𝜈 as

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) )

≥ 𝛾 𝑡 ⁢ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) .

(2.5)

From these facts we can prove the convergence rate shown in (2.3) by induction. The base case 𝑡

𝑇 𝜈 holds trivially by the fact that using monotonicity we have that ℎ ⁢ ( 𝐱 𝑇 𝜈 ) ≤ ℎ ⁢ ( 𝐱 0 ) . Assuming the claim is true for some 𝑡 ≥ 𝑇 𝜈 we distinguish two cases. Case 𝛾 𝑡 ⁢ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) > 0 : Focusing on the first case, we can plug the previous inequality into (2.5) to find that 𝛾 𝑡 guarantees primal progress, that is, ℎ ⁢ ( 𝐱 𝑡 ) > ℎ ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) with the step size 𝛾 𝑡 , and so we know that we will not go into Line 6 of Algorithm 1, and we have that ℎ ⁢ ( 𝐱 𝑡 + 1 )

ℎ ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) . Thus using the induction hypothesis and plugging in the expression for 𝛾 𝑡

2 / ( 𝑡 + 2 ) into (2.5) we have:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ 4 ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) } ⁢ ( ( 𝑇 𝜈 + 1 ) ⁢ 𝑡 ( 𝑡 + 1 ) ⁢ ( 𝑡 + 2 ) + 1 ( 𝑡 + 2 ) 2 )

≤ 4 ⁢ ( 𝑇 𝜈 + 1 ) 𝑡 + 2 ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) } ,

where we use that ( 𝑇 𝜈 + 1 ) ⁢ 𝑡 / ( 𝑡 + 1 ) + 1 / ( 𝑡 + 2 ) ≤ 𝑇 𝜈 + 1 for all 𝑡 ≥ 0 and any 𝑡 ≥ 𝑇 𝜈 . Case 𝛾 𝑡 ⁢ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ≤ 0 : In this case, we cannot guarantee that the step size 𝛾 𝑡 provides primal progress by plugging into (2.5), and so we cannot guarantee if a step size of 𝛾 𝑡 will be accepted and we will have 𝐱 𝑡 + 1

𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) , or we will simply have 𝐱 𝑡 + 1

𝐱 𝑡 , that is, we may go into Line 6 of Algorithm 1. Nevertheless, if we reorganize the expression 𝛾 𝑡 ⁢ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ≤ 0 , by monotonicity we will have that:

ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) ≤ 2 𝑡 + 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ≤ 4 ⁢ ( 𝑇 𝜈 + 1 ) 𝑡 + 2 ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) } .

Where the last inequality holds as 2 ≤ 4 ⁢ ( 𝑇 𝜈 + 1 ) for any 𝑇 𝜈 ≥ 0 . ∎

One of the quantities that we have used in the proof of Theorem 2.4 is 𝐿 𝑓 ℒ 0 . Note that the function 𝑓 is 𝐿 𝑓 ℒ 0 -smooth over ℒ 0 . One could wonder why we have bothered to use the bound on the Bregman divergence in Proposition 2.1 for a ( 𝑀 , 𝜈 ) -generalized self-concordant function, instead of simply using the bounds from the 𝐿 𝑓 ℒ 0 -smoothness of 𝑓 over ℒ 0 . The reason is that the upper bound on the Bregman divergence in Proposition 2.1 applies for any 𝐱 , 𝐲 ∈ dom ⁢ ( 𝑓 ) such that 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) < 1 , and we can easily bound the number of iterations 𝑇 𝜈 it takes for the step size 𝛾 𝑡

2 / ( 𝑡 + 2 ) to verify both 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) and 𝑑 𝜈 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) < 1 for 𝑡 ≥ 𝑇 𝜈 . However, in order to apply the bound on the Bregman divergence from 𝐿 𝑓 ℒ 0 -smoothness we need 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ ℒ 0 , and while it is easy to show by monotonicity that 𝐱 𝑡 ∈ ℒ 0 , there is no straightforward way to prove that for some 𝑇 ~ 𝜈 we have that 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ ℒ 0 for all 𝑡 ≥ 𝑇 ~ 𝜈 , i.e., that from some point onward a step with a non-zero step size is taken (that is, we do not go into Line 6 of Algorithm 1) that guarantees primal progress.

Remark 2.5.

In the case where 𝜈

2 we can easily bound the primal gap ℎ ⁢ ( 𝐱 1 ) , as in this setting dom ⁢ ( 𝑓 )

ℝ 𝑛 , which leads to ℎ ⁢ ( 𝐱 1 ) ≤ 𝐿 𝑓 𝒳 ⁢ 𝐷 2 from (2.5), regardless of whether we set 𝐱 1

𝐱 0 or 𝐱 1

𝐯 0 . Moreover, as the upper bound on the Bregman divergence holds for 𝜈

2 regardless of the value of 𝑑 2 ⁢ ( 𝐱 , 𝐲 ) , we can modify the proof of Theorem 2.4 to obtain a convergence rate of the form:

ℎ ⁢ ( 𝐱 𝑡 ) ≤ 2 𝑡 + 1 ⁢ 𝐿 𝑓 𝒳 ⁢ 𝐷 2 ⁢ 𝜔 2 ⁢ ( 𝑀 ⁢ 𝐷 ) ⁢ ∀ 𝑡 ≥ 1 ,

which is reminiscient of the 𝒪 ⁢ ( 𝐿 𝑓 𝒳 ⁢ 𝐷 2 / 𝑡 ) rate of the original Frank-Wolfe algorithm for the smooth and convex case.

Furthermore, with this simple step size we can also prove a convergence rate for the Frank-Wolfe gap, as shown in Theorem 2.6. More specifically, the minimum of the Frank-Wolfe gap over the run of the algorithm converges at a rate of 𝒪 ⁢ ( 1 / 𝑡 ) . The idea of the proof is very similar to the one in Jaggi [2013]. In a nutshell, as the primal progress per iteration is directly related to the step size times the Frank-Wolfe gap, we know that the Frank-Wolfe gap cannot remain indefinitely above a given value, as otherwise we would obtain a large amount of primal progress, which would make the primal gap become negative. This is formalized in Theorem 2.6.

Theorem 2.6.

Suppose 𝒳 is a compact convex set and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 . Then if the Monotonic Frank-Wolfe algorithm (Algorithm 1) is run for 𝑇 ≥ 𝑇 𝜈 + 6 iterations, we will have that:

min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝒪 ⁢ ( 1 / 𝑇 ) ,

where 𝑇 𝜈 is defined as:

𝑇 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2
if ⁢ 𝜈

⌈ 2 ⁢ 𝑀 ⁢ 𝐷 ⁢ ( 𝐿 𝑓 ℒ 0 ) 𝜈 / 2 − 1 ⁢ ( 𝜈 − 2 ) ⌉ − 2

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 .

(2.6) Proof.

In order to prove the claim, we focus on the iterations 𝑡 such that:

𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 ,

(2.7)

where 𝑇 𝜈 is defined in (2.6). Note that as we assume that 𝑇 ≥ 𝑇 𝜈 + 6 , we know that 𝑇 𝜈 ≤ 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 , and so for iterations 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 we know that 𝑑 𝜈 ⁢ ( 𝐱 𝑡 , 𝐱 𝑡 + 1 ) ≤ 1 / 2 , and so:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) .

(2.8)

In a very similar fashion as was done in the proof of Theorem 2.4, we divide the proof into two different cases. Case − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ≥ 0 for some 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 : Reordering the inequality above we therefore know that there exists a 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝐾 ≤ 𝑇 − 2 such that:

𝑔 ⁢ ( 𝐱 𝐾 )
≤ 2 2 + 𝐾 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 )

≤ 2 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 )

6 2 ⁢ 𝑇 𝜈 + 𝑇 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ,

where the second inequality follows from the fact that 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝐾 . This leads to min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝑔 ⁢ ( 𝐱 𝐾 ) ≤ 6 2 ⁢ 𝑇 𝜈 + 𝑇 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) . Case − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝛾 𝑡 2 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) < 0 for all 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 : Using the inequality above and plugging into (2.8) allows us to conclude that all steps 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 will produce primal progress using the step size 𝛾 𝑡 , and so as we know that 𝐱 𝑡 + 1 ∈ dom ⁢ ( 𝑓 ) by Lemma 2.3, then for all 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 ≤ 𝑡 ≤ 𝑇 − 2 we will take a non-zero step size determined by 𝛾 𝑡 , as 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) and 𝑓 ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) < 𝑓 ⁢ ( 𝐱 𝑡 ) in 5 of Algorithm 1. Consequently, summing up (2.8) from 𝑡 min

def 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 to 𝑡 max

def 𝑇 − 2 we have that:

ℎ ⁢ ( 𝐱 𝑡 max + 1 )
≤ ℎ ⁢ ( 𝐱 𝑡 min ) − ∑ 𝑡

𝑡 min 𝑡 max 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ⁢ ∑ 𝑡

𝑡 min 𝑡 max 𝛾 𝑡 2

(2.9)

≤ ℎ ⁢ ( 𝐱 𝑡 min ) − 2 ⁢ min 𝑡 min ≤ 𝑡 ≤ 𝑡 max ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ⁢ ∑ 𝑡

𝑡 min 𝑡 max 1 2 + 𝑡 + 4 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ⁢ ∑ 𝑡

𝑡 min 𝑡 max 1 ( 2 + 𝑡 ) 2

(2.10)

≤ ℎ ⁢ ( 𝐱 𝑡 min ) − 2 ⁢ min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ⁢ 𝑡 max − 𝑡 min + 1 2 + 𝑡 max + 4 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ⁢ 𝑡 max − 𝑡 min + 1 ( 2 + 𝑡 min ) 2

(2.11)

≤ 4 ⁢ ( 𝑇 𝜈 + 1 𝑡 min + 1 + 𝑡 max − 𝑡 min + 1 ( 2 + 𝑡 min ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

(2.12)

− 2 ⁢ min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ⁢ 𝑡 max − 𝑡 min + 1 2 + 𝑡 max .

(2.13)

Note that (2.10) stems from the fact that min 𝑡 min ≤ 𝑡 ≤ 𝑡 max ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝑔 ⁢ ( 𝐱 𝑡 ) for any 𝑡 min ≤ 𝑡 ≤ 𝑡 max , and from plugging 𝛾 𝑡

2 / ( 2 + 𝑡 ) , and (2.11) follows from the fact that − 1 / ( 2 + 𝑡 ) ≤ − 1 / ( 2 + 𝑡 max ) and 1 / ( 2 + 𝑡 ) ≤ 1 / ( 2 + 𝑡 min ) for all 𝑡 min ≤ 𝑡 ≤ 𝑡 max . The last inequality, shown in (2.12) and (2.13) arises from plugging in the upper bound on the primal gap ℎ ⁢ ( 𝐱 𝑡 min ) from Theorem 2.4 and collecting terms. If we plug in the specific values of 𝑡 max and 𝑡 min this leads to:

ℎ ⁢ ( 𝐱 𝑇 − 1 )

≤ 12 ⁢ ( 𝑇 𝜈 + 1 2 ⁢ 𝑇 𝜈 + 𝑇 − 3 + 2 ⁢ 𝑇 − 2 ⁢ 𝑇 𝜈 + 3 ( 2 ⁢ 𝑇 𝜈 + 𝑇 ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

(2.14)

− 2 3 ⁢ min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ⁢ 𝑇 − 𝑇 𝜈 𝑇 .

(2.15)

We establish our claim using proof by contradiction. Assuming that:

min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 )

18 ⁢ 𝑇 𝑇 − 𝑇 𝜈 ⁢ ( 𝑇 𝜈 + 1 2 ⁢ 𝑇 𝜈 + 𝑇 − 3 + 2 ⁢ 𝑇 − 2 ⁢ 𝑇 𝜈 + 3 ( 2 ⁢ 𝑇 𝜈 + 𝑇 ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

results, together with the bound from (2.15) in ℎ ⁢ ( 𝐱 𝑇 − 1 ) < 0 , which is the desired contradiction, as the primal gap cannot be negative. Therefore we must have that:

min 1 ≤ 𝑖 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑖 )
≤ 18 ⁢ 𝑇 𝑇 − 𝑇 𝜈 ⁢ ( 𝑇 𝜈 + 1 2 ⁢ 𝑇 𝜈 + 𝑇 − 3 + 2 ⁢ 𝑇 − 2 ⁢ 𝑇 𝜈 + 3 ( 2 ⁢ 𝑇 𝜈 + 𝑇 ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

𝒪 ⁢ ( 1 / 𝑇 ) .

This completes the proof. ∎

Remark 2.7.

Note that the Monotonic Frank-Wolfe algorithm (Algorithm 1) performs at most one ZOO, FOO, DO, and LMO oracle call per iteration. This means that Theorems 2.4 and 2.6 effectively bound the number of ZOO, FOO, DO, and LMO oracle calls needed to achieve a target primal gap or Frank-Wolfe gap accuracy 𝜀 as a function of 𝑇 𝜈 and 𝜀 ; note that 𝑇 𝜈 is independent of 𝜀 . This is an important difference with respect to existing bounds, as the existing Frank-Wolfe-style first-order algorithms for generalized self-concordant functions in the literature that utilize various types of line searches may perform more than one ZOO or DO call per iteration in the line search. This means that the convergence bounds in terms of iteration count of these algorithms are only informative when considering the number of FOO and LMO calls that are needed to reach a target accuracy in primal gap, and do not directly provide any information regarding the number of ZOO or DO calls that are needed. In order to bound the latter two quantities one typically needs additional technical tools. For example, for the backtracking line search of Pedregosa et al. [2020], one can use [Pedregosa et al., 2020, Theorem 1, Appendix C], or a slightly modified version of [Nesterov, 2013, Lemma 4], to find a bound for the number of ZOO or DO calls that are needed to find an 𝜀 -optimal solution. Note that these bounds depend on user-defined initialization or tuning parameters.

Remark 2.8.

In practice, a halving strategy for the step size is preferred for the implementation of the Monotonic Frank-Wolfe algorithm, as opposed to the step size implementation shown in Algorithm 1. This halving strategy, which is shown in Algorithm 2, helps deal with the case in which a large number of consecutive step sizes 𝛾 𝑡 are rejected either because 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∉ dom ⁢ ( 𝑓 ) or 𝑓 ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) )

𝑓 ⁢ ( 𝐱 𝑡 ) , and helps avoid the need to potentially call the zeroth-order or domain oracle a large number of times in these cases. The halving strategy in Algorithm 2 results in a step size that is at most a factor of 2 smaller than the one that would have been accepted with the original strategy, i.e., that would have ensured that 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) and 𝑓 ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) ≤ 𝑓 ⁢ ( 𝐱 𝑡 ) , in the standard Monotonic Frank-Wolfe algorithm in Algorithm 1. However, the number of zeroth-order or domain oracles that would be needed to find this step size that satisfies both 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ dom ⁢ ( 𝑓 ) and 𝑓 ⁢ ( 𝐱 𝑡 + 1 ) ≤ 𝑓 ⁢ ( 𝐱 𝑡 ) is logarithmic for the Monotonic Frank-Wolfe variant shown in Algorithm 2, when compared to the number needed for the Monotonic Frank-Wolfe variant without halving shown in Algorithm 1. Note that the convergence properties established throughout the paper for the Monotonic Frank-Wolfe algorithm in Algorithm 1 also hold for the variant in Algorithm 2; with the only difference being that we lose a very small constant factor (e.g., at most a factor of 2 for the standard case) in the convergence rate.

Algorithm 2 Halving M-FW 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 2:Iterates 𝐱 1 , … ∈ 𝒳 3: 𝜓 − 1 ← 0 4:for 𝑡

0 to … do 5: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 6: 𝜓 𝑡 ← 𝜓 𝑡 − 1 7: 𝛾 𝑡 ← 2 1 − 𝜓 𝑡 / ( 𝑡 + 2 ) 8: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 9: while 𝐱 𝑡 + 1 ∉ dom ⁢ ( 𝑓 ) or 𝑓 ⁢ ( 𝐱 𝑡 + 1 )

𝑓 ⁢ ( 𝐱 𝑡 ) do 10: 𝜓 𝑡 ← 𝜓 𝑡 + 1 11: 𝛾 𝑡 ← 2 1 − 𝜓 𝑡 / ( 𝑡 + 2 ) 12: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 13: end while 14:end for

In Table 2 we provide a detailed complexity comparison between the Monotonic Frank-Wolfe (M-FW) algorithm (Algorithm 1), and other comparable algorithms in the literature.

Algorithm SOO calls FOO calls ZOO calls LMO calls DO calls FW-GSC [Dvurechensky et al., 2022, Alg.2] 𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 )

LBTFW-GSC ‡ [Dvurechensky et al., 2022, Alg.3] 𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 )

MBTFW-GSC ‡ [Dvurechensky et al., 2022, Alg.5] 𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 )

M-FW † [This work] 𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 )

𝒪 ⁢ ( 1 / 𝜀 ) Table 2: Complexity comparison: Number of iterations needed to reach a solution with ℎ ⁢ ( 𝐱 ) below 𝜀 for Problem 1.1. We use the superscript † to indicate that the same complexities hold for reaching an 𝜀 -optimal solution in 𝑔 ⁢ ( 𝐱 ) . The superscript ‡ is used to indicate that constants in the convergence bounds depend on user-defined inputs; the other algorithms are parameter-free.

We note that the LBTFW-GSC algorithm from Dvurechensky et al. [2022] is in essence the Frank-Wolfe algorithm with a modified version of the backtracking line search of Pedregosa et al. [2020]. In the next section, we provide improved convergence guarantees for various cases of interest for this algorithm, which we refer to as the Frank-Wolfe algorithm with Backtrack (B-FW) for simplicity.

2.1Improved convergence guarantees

We will now establish improved convergence rates for various special cases. We focus on two different settings to obtain improved convergence rates; in the first, we assume that 𝐱 * ∈ Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) ) (Section 2.1.1), and in the second we assume that 𝒳 is strongly or uniformly convex (Section 2.1.2). The algorithm in this section is a slightly modified Frank-Wolfe algorithm with the adaptive line search technique of Pedregosa et al. [2020] (shown for reference in Algorithm 3 and 4). This is the same algorithm used in Dvurechensky et al. [2022], however, we show improved convergence rates in several settings of interest. Note that the adaptive line search technique of Pedregosa et al. [2020] requires user-defined inputs or parameters, which means that the algorithms in this section are not parameter-free. The parameter 𝑀 of Algorithm 4 corresponds to a local estimate of the Lipschitz constant of 𝑓 , the stopping condition defining the admissible step size requires the function decrease to be greater than the one derived from the quadratic model built from the Lipschitz estimate 𝑀 and gradient, hence ensuring monotonicity.

Algorithm 3 B-FW 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 , initial smoothness estimate 𝐿 − 1 2:Iterates 𝐱 1 , … ∈ 𝒳 3:for 𝑡

0 to … do 4: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 5: 𝛾 𝑡 , 𝐿 𝑡 ← Backtrack ( 𝑓 , 𝐱 𝑡 , 𝐯 𝑡 − 𝐱 𝑡 , 𝐿 𝑡 − 1 , 1 ) 6: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 7:end for Algorithm 4 Backtrack ( 𝑓 , 𝐱 , 𝐝 , 𝐿 𝑡 − 1 , 𝛾 max ) (line search of Pedregosa et al. [2020]) 1:Point 𝐱 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , 𝐯 ∈ ℝ 𝑛 , function 𝑓 , estimate 𝐿 𝑡 − 1 , step size 𝛾 max 2: 𝛾 , 𝑀 3:Choose 𝜏 > 1 , 𝜂 ≤ 1 and 𝑀 ∈ [ 𝜂 ⁢ 𝐿 𝑡 − 1 , 𝐿 𝑡 − 1 ] 4: 𝛾

min ⁡ { − ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐝 ⟩ / ( 𝑀 ⁢ ‖ 𝐝 ‖ 2 ) , 𝛾 max } 5:while 𝐱 + 𝛾 ⁢ 𝐝 ∉ dom ⁢ ( 𝑓 ) or 𝑓 ⁢ ( 𝐱 + 𝛾 ⁢ 𝐝 ) − 𝑓 ⁢ ( 𝐱 ) > 𝑀 ⁢ 𝛾 2 2 ⁢ ‖ 𝐝 ‖ 2 + 𝛾 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐝 ⟩ do 6: 𝑀

𝜏 ⁢ 𝑀 7: 𝛾

min ⁡ { − ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐝 ⟩ / ( 𝑀 ⁢ ‖ 𝐝 ‖ 2 ) , 𝛾 max } 8:end while 2.1.1Optimum contained in the interior

We first focus on the assumption that 𝐱 * ∈ Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) ) , obtaining improved rates when we use the FW algorithm coupled with the adaptive step size strategy from Pedregosa et al. [2020] (see Algorithm 4). This assumption is reasonable if for example Bd ⁡ ( 𝒳 ) ⊈ dom ⁢ ( 𝑓 ) , and Int ⁡ ( 𝒳 ) ⊆ dom ⁢ ( 𝑓 ) . That is to say, we will have that 𝐱 * ∈ Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) ) if for example we use logarithmic barrier functions to encode a set of constraints, and we have that dom ⁢ ( 𝑓 ) is a proper subset of 𝒳 . In this case the optimum is guaranteed to be in Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) ) .

The analysis in this case is reminiscent of the one in the seminal work of Guélat & Marcotte [1986], and is presented in Subsection 2.1.1. Note that we can upper-bound the value of 𝐿 𝑡 for 𝑡 ≥ 0 by 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , where 𝜏

1 is the backtracking parameter and 𝐿 − 1 is the initial smoothness estimate in Algorithm 4. Before proving the main theoretical results of this section, we first review some auxiliary results that allow us to prove linear convergence in this setting.

Proposition 2.9 (Proposition 3, Sun & Tran-Dinh [2019]).

Let 𝑓 be generalized self-concordant with 𝜈 ≥ 2 and dom ⁢ ( 𝑓 ) not contain any straight line, then the Hessian ∇ 2 𝑓 ⁢ ( 𝐱 ) is non-degenerate at all points 𝐱 ∈ dom ⁢ ( 𝑓 ) .

Note that the assumption that dom ⁢ ( 𝑓 ) does not contain any straight line is without loss of generality as we can simply modify the function outside of our compact convex feasible region so that it holds.

Proposition 2.10 (Proposition 2.16, Braun et al. [2022]).

If there exists an 𝑟

0 such that ℬ ⁢ ( 𝐱 * , 𝑟 ) ⊆ 𝒳 ∩ dom ⁢ ( 𝑓 ) , then for all 𝐱 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) we have that:

𝑔 ⁢ ( 𝐱 ) ‖ 𝐱 − 𝐯 ‖ ≥ 𝑟 𝐷 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖ ≥ 𝑟 𝐷 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐱 − 𝐱 * ⟩ ‖ 𝐱 − 𝐱 * ‖ ,

where 𝐯

argmin 𝐲 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐲 ⟩ and 𝑔 ⁢ ( 𝐱 ) is the Frank-Wolfe gap.

With these tools at hand, we show that the Frank-Wolfe algorithm with the backtracking step-size strategy converges at a linear rate.

Theorem 2.11.

Let 𝑓 be a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 , let dom ⁢ ( 𝑓 ) not contain any straight line and define the compact set

ℒ 0

def { 𝐱 ∈ dom ⁢ ( 𝑓 ) ∩ 𝒳 ∣ 𝑓 ⁢ ( 𝐱 ) ≤ 𝑓 ⁢ ( 𝐱 0 ) } .

Furthermore, let 𝑟

0 be the largest value such that ℬ ⁢ ( 𝐱 * , 𝑟 ) ⊆ 𝒳 ∩ dom ⁢ ( 𝑓 ) . Then, the Frank-Wolfe algorithm (Algorithm 3) with the backtracking strategy of Pedregosa et al. [2020] results in a linear primal gap convergence rate of the form:

ℎ ⁢ ( 𝐱 𝑡 )

≤ ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 𝜇 𝑓 ℒ 0 2 ⁢ 𝐿 ~ ⁢ ( 𝑟 𝐷 ) 2 ) 𝑡 ,

for 𝑡 ≥ 1 , where 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , 𝜏

1 is the backtracking parameter, 𝐿 − 1 is the initial smoothness estimate in Algorithm 4.

Proof.

As the backtracking line search makes monotonic primal progress, we know that for 𝑡 ≥ 0 we will have that 𝐱 𝑡 ∈ ℒ 0 . Since dom ⁢ ( 𝑓 ) does not contain any straight line by assumption, we know from Proposition 2.9 that for all 𝐱 ∈ dom ⁢ ( 𝑓 ) , the Hessian is non-degenerate and therefore 𝜇 𝑓 ℒ 0

0 . This allows us to claim that for any 𝐱 , 𝐲 ∈ ℒ 0 we have that:

𝑓 ⁢ ( 𝐱 ) − 𝑓 ⁢ ( 𝐲 ) − ⟨ ∇ 𝑓 ⁢ ( 𝐲 ) , 𝐱 − 𝐲 ⟩

≥ 𝜇 𝑓 ℒ 0 2 ⁢ ‖ 𝐱 − 𝐲 ‖ 2 .

(2.16)

The backtracking line search in Algorithm 4 will either output a point 𝛾 𝑡

1 or 𝛾 𝑡 < 1 . In any case, Algorithm 4 will find and output a smoothness estimate 𝐿 𝑡 and a step size 𝛾 𝑡 such that for 𝐱 𝑡 + 1

𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) we have that:

𝑓 ⁢ ( 𝐱 𝑡 + 1 ) − 𝑓 ⁢ ( 𝐱 𝑡 ) ≤ 𝐿 𝑡 ⁢ 𝛾 𝑡 2 2 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) .

(2.17)

In the case where 𝛾 𝑡

1 we know by observing Line 7 of Algorithm 4 that 𝑔 ⁢ ( 𝐱 𝑡 ) ≥ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 , and so plugging into (2.17) we arrive at ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) / 2 . In the case where 𝛾 𝑡

𝑔 ⁢ ( 𝐱 𝑡 ) / ( 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 ) < 1 , we have that 𝑔 ⁢ ( 𝐱 𝑡 ) < 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 , which leads to ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑘 ) − 𝑔 ⁢ ( 𝐱 𝑡 ) 2 / ( 2 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 ) , when plugging the expression for the step size in the progress bound in (2.17). In this last case where 𝛾 𝑡 < 1 we have the following contraction for the primal gap:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 )

≥ 𝑔 ⁢ ( 𝐱 𝑡 ) 2 2 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2

≥ 𝑟 2 𝐷 2 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ 2 2 ⁢ 𝐿 𝑡

≥ 𝜇 𝑓 ℒ 0 𝐿 ~ ⁢ 𝑟 2 𝐷 2 ⁢ ℎ ⁢ ( 𝐱 𝑡 ) ,

where we have used the inequality that involves the central term and the leftmost term in Proposition 2.10, and the last inequality stems from the bound ℎ ⁢ ( 𝐱 𝑡 ) ≤ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ 2 / ( 2 ⁢ 𝜇 𝑓 ℒ 0 ) for 𝜇 𝑓 ℒ 0 -strongly convex functions. Putting the above bounds together we have that:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜇 𝑓 ℒ 0 𝐿 ~ ⁢ ( 𝑟 𝐷 ) 2 } )

≤ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ ( 1 − 𝜇 𝑓 ℒ 0 2 ⁢ 𝐿 ~ ⁢ ( 𝑟 𝐷 ) 2 ) ,

which completes the proof. ∎

The previous bound depends on the largest positive 𝑟 such that ℬ ⁢ ( 𝐱 * , 𝑟 ) ⊆ 𝒳 ∩ dom ⁢ ( 𝑓 ) , which can be arbitrarily small. Note also that the previous proof uses the lower bound of the Bregman divergence from the 𝜇 𝑓 ℒ 0 -strong convexity of the function over ℒ 0 to obtain linear convergence. Note that this bound is local as this 𝜇 𝑓 ℒ 0 -strong convexity holds only inside ℒ 0 , and is only of use because the step size of Algorithm 4 automatically ensures that if 𝐱 𝑡 ∈ ℒ 0 and 𝐝 𝑡 is a descent direction, then 𝐱 𝑡 + 𝛾 𝑡 ⁢ 𝐝 𝑡 ∈ ℒ 0 . This is in contrast with Algorithm 1, in which the step size 𝛾 𝑡

2 / ( 2 + 𝑡 ) did not automatically ensure monotonicity in primal gap, and this had to be enforced by setting 𝐱 𝑡 + 1

𝐱 𝑡 if 𝑓 ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ 𝐝 𝑡 ) > 𝑓 ⁢ ( 𝐱 𝑡 ) , where 𝐝 𝑡

𝐯 𝑡 − 𝐱 𝑡 . If we were to have used the lower bound on the Bregman divergence from [Sun & Tran-Dinh, 2019, Proposition 10] in the proof, which states that:

𝑓 ⁢ ( 𝐲 ) − 𝑓 ⁢ ( 𝐱 ) − ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐲 − 𝐱 ⟩ ≥ 𝜔 𝜈 ⁢ ( − 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) ⁢ ‖ 𝐲 − 𝐱 ‖ ∇ 2 𝑓 ⁢ ( 𝐱 ) 2 ,

for any 𝐱 , 𝐲 ∈ dom ⁢ ( 𝑓 ) and any 𝜈 ≥ 2 , we would have arrived at a bound that holds over all dom ⁢ ( 𝑓 ) . However, in order to arrive at a usable bound, and armed only with the knowledge that the Hessian is non-degenerate if dom ⁢ ( 𝑓 ) does not contain any straight line, and that 𝐱 , 𝐲 ∈ ℒ 0 , we would have had to write:

𝜔 𝜈 ⁢ ( − 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) ⁢ ‖ 𝐱 − 𝐲 ‖ ∇ 2 𝑓 ⁢ ( 𝐲 ) 2 ≥ 𝜇 𝑓 ℒ 0 ⁢ 𝜔 𝜈 ⁢ ( − 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) ⁢ ‖ 𝐱 − 𝐲 ‖ 2 ,

where the inequality follows from the definition of 𝜇 𝑓 ℒ 0 . It is easy to see that as 𝑑 ⁢ 𝜔 𝜈 ⁢ ( 𝜏 ) / 𝑑 ⁢ 𝜏 > 0 by Remark 2.2, we have that 1 / 2

𝜔 𝜈 ⁢ ( 0 ) ≥ 𝜔 𝜈 ⁢ ( − 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) . This results in a bound:

𝑓 ⁢ ( 𝐲 ) − 𝑓 ⁢ ( 𝐱 ) − ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐲 − 𝐱 ⟩ ≥ 𝜇 𝑓 ℒ 0 ⁢ 𝜔 𝜈 ⁢ ( − 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) ) ⁢ ‖ 𝐱 − 𝐲 ‖ 2 .

(2.18)

When we compare the bounds obtained from local strong convexity in (2.16) and that obtained directly from generalized self-concordance in (2.18), we can see that the former is tighter than the latter, albeit local. For this reason, we have used the former bound in the proof of Theorem 2.11.

2.1.2Strongly convex or uniformly convex sets

Next, we recall the definition of uniformly convex sets, used in Kerdreux et al. [2021], which will allow us to obtain improved convergence rates for the FW algorithm over uniformly convex feasible regions.

In order to prove convergence rate results for the case where the feasible region is ( 𝜅 , 𝑝 ) -uniformly convex, we first review the definition of the ( 𝜅 , 𝑝 ) -uniform convexity of a set (see Definition 2.12), as well as a useful lemma that allows us to go from contractions to convergence rates.

Definition 2.12 ( ( 𝜅 , 𝑞 ) -uniformly convex set).

Given two positive numbers 𝜅 and 𝑞 , we say the set 𝒳 ⊆ ℝ 𝑛 is ( 𝜅 , 𝑞 ) -uniformly convex with respect to a norm ∥ ⋅ ∥ if for any 𝐱 , 𝐲 ∈ 𝒳 , 0 ≤ 𝛾 ≤ 1 , and 𝐳 ∈ ℝ 𝑛 with ‖ 𝐳 ‖

1 we have that:

𝐲 + 𝛾 ⁢ ( 𝐱 − 𝐲 ) + 𝛾 ⁢ ( 1 − 𝛾 ) ⋅ 𝜅 ⁢ ‖ 𝐱 − 𝐲 ‖ 𝑞 ⁢ 𝐳 ∈ 𝒳 .

The previous definition allows us to obtain a scaling inequality very similar to the one shown in Proposition 2.10, which is key to proving the following convergence rates, and can be implicitly found in Kerdreux et al. [2021] and Garber & Hazan [2016].

Proposition 2.13.

Let 𝒳 ⊆ ℝ 𝑛 be ( 𝜅 , 𝑞 ) -uniformly convex, then for all 𝐱 ∈ 𝒳 :

𝑔 ⁢ ( 𝐱 ) ‖ 𝐱 − 𝐯 ‖ 𝑞 ≥ 𝜅 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖ ,

where 𝐯

argmin 𝐮 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐮 ⟩ , and 𝑔 ⁢ ( 𝐱 ) is the Frank-Wolfe gap.

The next lemma that will be presented is an extension of the one used in [Kerdreux et al., 2021, Lemma A.1] (see also Temlyakov [2015]), and allows us to go from per-iteration contractions to convergence rates.

Lemma 2.14.

We denote a sequence of nonnegative numbers by { ℎ 𝑡 } 𝑡 . Let 𝑐 0 , 𝑐 1 , 𝑐 2 and 𝛼 be positive numbers such that 𝑐 1 < 1 , ℎ 1 ≤ 𝑐 0 and ℎ 𝑡 − ℎ 𝑡 + 1 ≥ ℎ 𝑡 ⁢ min ⁡ { 𝑐 1 , 𝑐 2 ⁢ ℎ 𝑡 𝛼 } for 𝑡 ≥ 1 , then:

ℎ 𝑡 ≤ { 𝑐 0 ⁢ ( 1 − 𝑐 1 ) 𝑡 − 1

if ⁢ 1 ≤ 𝑡 ≤ 𝑡 0

( 𝑐 1 / 𝑐 2 ) 1 / 𝛼 ( 1 + 𝑐 1 ⁢ 𝛼 ⁢ ( 𝑡 − 𝑡 0 ) ) 1 / 𝛼

𝒪 ⁢ ( 𝑡 − 1 / 𝛼 )

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 .

where

𝑡 0

def max ⁡ { 1 , ⌊ log 1 − 𝑐 1 ⁡ ( ( 𝑐 1 / 𝑐 2 ) 1 / 𝛼 𝑐 0 ) ⌋ } .

This allows us to convert the per-iteration contractions to convergence rates.

Theorem 2.15.

Suppose 𝒳 is a compact ( 𝜅 , 𝑞 ) -uniformly convex set and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 . Furthermore, assume that min 𝐱 ∈ 𝒳 ⁡ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖ ≥ 𝐶 . Then, the Frank-Wolfe algorithm with Backtrack (Algorithm 3) results in a convergence:

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡
if ⁢ 𝑞

ℎ ⁢ ( 𝐱 0 ) 2 𝑡

if ⁢ 𝑞

2 , 1 ≤ 𝑡 ≤ 𝑡 0

( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ( 1 + ( 𝑞 − 2 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 2 )

𝒪 ⁢ ( 𝑡 − 𝑞 / ( 𝑞 − 2 ) )

if ⁢ 𝑞

2 , 𝑡

𝑡 0 ,

for 𝑡 ≥ 1 , where:

𝑡 0

max ⁡ { 1 , ⌊ log 1 / 2 ⁡ ( ( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ℎ ⁢ ( 𝐱 0 ) ) ⌋ } .

and 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , where 𝜏

1 is the backtracking parameter, 𝐿 − 1 is the initial smoothness estimate in Algorithm 4, and

𝐿 𝑓 ℒ 0

max 𝐮 ∈ ℒ 0 , 𝐝 ∈ ℝ 𝑛 ⁡ ‖ 𝐝 ‖ ∇ 2 𝑓 ⁢ ( 𝐮 ) 2 / ‖ 𝐝 ‖ 2 2 .

Proof.

At iteration 𝑡 , the backtracking line search strategy finds through successive function evaluations a 𝐿 𝑡

0 such that:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ ℎ ⁢ ( 𝐱 𝑡 ) − 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝐿 𝑡 ⁢ 𝛾 𝑡 2 2 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 .

Finding the 𝛾 𝑡 that maximizes the right-hand side of the previous inequality leads to:

𝛾 𝑡

min ⁡ { 1 , 𝑔 ⁢ ( 𝐱 𝑡 ) / ( 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 ) } ,

which is the step size ultimately taken by the algorithm at iteration 𝑡 . Note that if 𝛾 𝑡

1 this means that 𝑔 ⁢ ( 𝐱 𝑡 ) ≥ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 , which when plugged into the inequality above leads to ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) / 2 . Conversely, for 𝛾 𝑡 < 1 we have that ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) − 𝑔 ⁢ ( 𝐱 𝑡 ) 2 / ( 2 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 ) . Focusing on this case and using the bounds 𝑔 ⁢ ( 𝐱 𝑡 ) ≥ ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) ≥ 𝜅 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 𝑞 from Proposition 2.13 leads to:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 ) 2 − 2 / 𝑞 ⁢ ( 𝜅 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ ) 2 / 𝑞 2 ⁢ 𝐿 𝑡

(2.19)

≤ ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 ) 2 − 2 / 𝑞 ⁢ ( 𝜅 ⁢ 𝐶 ) 2 / 𝑞 2 ⁢ 𝐿 ~ ,

(2.20)

where the last inequality simply comes from the bound on the gradient norm, and the fact that 𝐿 𝑡 ≤ 𝐿 ~ , for 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , where 𝜏

1 is the backtracking parameter and 𝐿 − 1 is the initial smoothness estimate in Algorithm 4. Reordering this expression and putting together the two cases we have that:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 )

≥ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ min ⁡ { 1 2 , ( 𝜅 ⁢ 𝐶 ) 2 / 𝑞 2 ⁢ 𝐿 ~ ⁢ ℎ ⁢ ( 𝐱 𝑡 ) 1 − 2 / 𝑞 } .

For the case where 𝑞

2 we get a linear contraction in primal gap. Using Lemma 2.14 to go from a contraction to a convergence rate for 𝑞

2 we have that:

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡
if ⁢ 𝑞

ℎ ⁢ ( 𝐱 0 ) 2 𝑡

if ⁢ 𝑞

2 , 1 ≤ 𝑡 ≤ 𝑡 0

( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ( 1 + ( 𝑞 − 2 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 2 )

𝒪 ⁢ ( 𝑡 − 𝑞 / ( 𝑞 − 2 ) )

if ⁢ 𝑞

2 , 𝑡

𝑡 0 ,

for 𝑡 ≥ 1 , where:

𝑡 0

max ⁡ { 1 , ⌊ log 1 / 2 ⁡ ( ( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ℎ ⁢ ( 𝐱 0 ) ) ⌋ } ,

which completes the proof. ∎

However, in the general case, we cannot assume that the norm of the gradient is bounded away from zero over 𝒳 . We deal with the general case using local strong convexity in Theorem 2.16.

Theorem 2.16.

Suppose 𝒳 is a compact ( 𝜅 , 𝑞 ) -uniformly convex set and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 for which domain does not contain any straight line. Then, the Frank-Wolfe algorithm with Backtrack (Algorithm 3) results in a convergence:

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) 2 𝑡

if ⁢ 1 ≤ 𝑡 ≤ 𝑡 0

( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ( 1 + ( 𝑞 − 1 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 1 )

𝒪 ⁢ ( 𝑡 − 𝑞 / ( 𝑞 − 1 ) )

if ⁢ 𝑡

𝑡 0 ,

for 𝑡 ≥ 1 , where:

ℒ 0

{ 𝐱 ∈ dom ⁢ ( 𝑓 ) ∩ 𝒳 ∣ 𝑓 ⁢ ( 𝐱 ) ≤ 𝑓 ⁢ ( 𝐱 0 ) }

𝑡 0

max ⁡ { 1 , ⌊ log 1 / 2 ⁡ ( ( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ℎ ⁢ ( 𝐱 0 ) ) ⌋ }

and 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , where 𝜏

1 is the backtracking parameter, 𝐿 − 1 is the initial smoothness estimate in Algorithm 4.

Proof.

As the algorithm makes monotonic primal progress we have that 𝐱 𝑡 ∈ ℒ 0 for 𝑡 ≥ 0 . The proof proceeds very similarly as before, except for the fact that now we have to bound ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ using 𝜇 𝑓 ℒ 0 -strong convexity for points 𝐱 𝑡 , 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ∈ ℒ 0 . Continuing from (2.19) for the case where 𝛾 𝑡 < 1 and using the fact that, given 𝑓 𝑢 the unconstrained minimum of 𝑓 , ℎ ⁢ ( 𝐱 𝑡 ) ≤ 𝑓 ⁢ ( 𝐱 𝑡 ) − 𝑓 𝑢 ≤ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ 2 / ( 2 ⁢ 𝜇 𝑓 ℒ 0 ) we have that:

ℎ ⁢ ( 𝐱 𝑡 + 1 )

≤ ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 ) 2 − 2 / 𝑞 ⁢ ( 𝜅 ⁢ ‖ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) ‖ ) 2 / 𝑞 2 ⁢ 𝐿 𝑡

≤ ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 ) 2 − 1 / 𝑞 ⁢ 𝜅 2 / 𝑞 ⁢ ( 𝜇 𝑓 ℒ 0 ) 1 / 𝑞 ⁢ 2 1 / 𝑞 − 1 𝐿 ~ ,

where we have also used the bound 𝐿 𝑡 ≤ 𝐿 ~ in the last equation. This leads us to a contraction, together with the case where 𝛾 𝑡

1 , which is unchanged from the previous proofs, of the form:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 )

≥ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ min ⁡ { 1 2 , 𝜅 2 / 𝑞 ⁢ ( 𝜇 𝑓 ℒ 0 ) 1 / 𝑞 ⁢ 2 1 / 𝑞 − 1 𝐿 ~ ⁢ ℎ ⁢ ( 𝐱 𝑡 ) 1 − 1 / 𝑞 } .

Using again Lemma 2.14 to go from a contraction to a convergence rate for 𝑞

2 we have that:

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) 2 𝑡

if ⁢ 1 ≤ 𝑡 ≤ 𝑡 0

( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ( 1 + ( 𝑞 − 1 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 1 )

𝒪 ⁢ ( 𝑡 − 𝑞 / ( 𝑞 − 1 ) )

if ⁢ 𝑡

𝑡 0 ,

for 𝑡 ≥ 1 , where:

𝑡 0

max ⁡ { 1 , ⌊ log 1 / 2 ⁡ ( ( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ℎ ⁢ ( 𝐱 0 ) ) ⌋ } ,

which completes the proof. ∎

In Table 3 we provide an oracle complexity breakdown for the Frank-Wolfe algorithm with Backtrack (B-FW), also referred to as LBTFW-GSC in Dvurechensky et al. [2022], when minimizing over a ( 𝜅 , 𝑞 ) -uniformly convex set.

Algorithm Assumptions Oracle calls Reference B-FW/LBTFW-GSC ‡
𝐱 * ∈ Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) )
𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) This work B-FW/LBTFW-GSC ‡
min 𝐱 ∈ 𝒳 ⁡ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖ > 0 , 𝑞

𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) This work B-FW/LBTFW-GSC ‡

min 𝐱 ∈ 𝒳 ⁡ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖

0 , 𝑞

𝒪 ⁢ ( 𝜀 − ( 𝑞 − 2 ) / 𝑞 ) This work B-FW/LBTFW-GSC ‡ No straight lines in dom ⁢ ( 𝑓 )

𝒪 ⁢ ( 𝜀 − ( 𝑞 − 1 ) / 𝑞 ) This work Table 3: Complexity comparison for B-FW (Algorithm 3) when minimizing over a ( 𝜅 , 𝑞 ) -uniformly convex set: Number of iterations needed to reach an 𝜀 -optimal solution in ℎ ⁢ ( 𝐱 ) for Problem 1.1 in several cases of interest. We use the superscript ‡ to indicate that constants in the convergence bounds depend on user-defined inputs. Oracle calls refer simultaneously to FOO, ZOO, LMO, and DO calls. 3Away-step and Blended Pairwise Conditional Gradients

When the domain 𝒳 is a polytope, one can obtain linear convergence in primal gap for a generalized self-concordant function using the well known Away-step Frank-Wolfe (AFW) algorithm [Guélat & Marcotte, 1986, Lacoste-Julien & Jaggi, 2015] shown in Algorithm 5 and the more recent Blended Pairwise Conditional Gradients (BPCG) algorithm [Tsuji et al., 2022] with the adaptive step size of Pedregosa et al. [2020]. We use 𝒮 𝑡 to denote the active set at iteration 𝑡 , that is, the set of vertices of the polytope that gives rise to 𝐱 𝑡 as a convex combination with positive weights.

For AFW, we can see that the algorithm either chooses to perform what is known as a Frank-Wolfe step in Line 7 of Algorithm 5 if the Frank-Wolfe gap 𝑔 ⁢ ( 𝐱 ) is greater than the away gap ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩ or an Away step in 9 of Algorithm 5 otherwise. Similarly for BPCG, the algorithm performs a Frank-Wolfe step in Line 7 of Algorithm 6 if the Frank-Wolfe gap is greater than the pairwise gap ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐬 𝑡 ⟩ . For simplicity of exposition, we made both algorithms start with a vertex of of 𝒳 in dom ⁢ ( 𝑓 ) . Although this could be too restrictive for some applications (e.g. when the function includes a barrier of the polytope), it is easy to warm-start the active set with an initial combination of vertices.

Algorithm 5 Away-step Frank-Wolfe (B-AFW) with the step size of Pedregosa et al. [2020] 1:Vertex 𝐱 0 ∈ dom ⁢ ( 𝑓 ) of 𝒳 , function 𝑓 , initial smoothness estimate 𝐿 − 1 2: 𝒮 0 ← { 𝐱 0 } , 𝝀 0 ← { 1 } 3:for 𝑡

0 to … do 4: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 5: 𝐚 𝑡 ← argmax 𝐯 ∈ 𝒮 𝑡 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 6: if ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ ≥ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩ then 7: 𝐝 𝑡 ← 𝐯 𝑡 − 𝐱 𝑡 , 𝛾 max ← 1 8: else 9: 𝐝 𝑡 ← 𝐱 𝑡 − 𝐚 𝑡 , 𝛾 max ← 𝝀 𝑡 ⁢ ( 𝐚 𝑡 ) / ( 1 − 𝝀 𝑡 ⁢ ( 𝐚 𝑡 ) ) 10: end if 11: 𝛾 𝑡 , 𝐿 𝑡 ← Backtrack ( 𝑓 , 𝐱 𝑡 , 𝐝 𝑡 , ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐿 𝑡 − 1 , 𝛾 max ) 12: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ 𝐝 𝑡 13: Update 𝒮 𝑡 and 𝝀 𝑡 to 𝒮 𝑡 + 1 and 𝝀 𝑡 + 1 14:end for Algorithm 6 Blended Pairwise Conditional Gradients (B-BPCG) with the step size of Pedregosa et al. [2020] 1:Vertex 𝐱 0 ∈ dom ⁢ ( 𝑓 ) of 𝒳 , function 𝑓 , initial smoothness estimate 𝐿 − 1 2: 𝒮 0 ← { 𝐱 0 } , 𝝀 0 ← { 1 } 3:for 𝑡

0 to … do 4: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 5: 𝐚 𝑡 ← argmax 𝐯 ∈ 𝒮 𝑡 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 6: 𝐬 𝑡 ← argmin 𝐯 ∈ 𝒮 𝑡 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 7: if ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ ≥ ⟨ ∇ 𝑓 ⁢ ( 𝐚 𝑡 ) , 𝐬 𝑡 − 𝐚 𝑡 ⟩ then 8: 𝐝 𝑡 ← 𝐯 𝑡 − 𝐱 𝑡 , 𝛾 max ← 1 9: else 10: 𝐝 𝑡 ← 𝐚 𝑡 − 𝐬 𝑡 , 𝛾 max ← 𝝀 𝑡 ⁢ ( 𝐚 𝑡 ) 11: end if 12: 𝛾 𝑡 , 𝐿 𝑡 ← Backtrack ( 𝑓 , 𝐱 𝑡 , 𝐝 𝑡 , ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐿 𝑡 − 1 , 𝛾 max ) 13: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ 𝐝 𝑡 14: Update 𝒮 𝑡 and 𝝀 𝑡 to 𝒮 𝑡 + 1 and 𝝀 𝑡 + 1 15:end for

Both proofs of linear convergence follow closely from Pedregosa et al. [2020], Lacoste-Julien & Jaggi [2015] but leverage generalized self-concordant instead of smoothness and strong convexity. One of the key inequalities used in the proof is a scaling inequality from Lacoste-Julien & Jaggi [2015] very similar to the one shown in Proposition 2.10 and Proposition 2.13, which we state next:

Proposition 3.1.

Let 𝒳 ⊆ ℝ 𝑛 be a polytope, and denote by 𝒮 the set of vertices of the polytope 𝒳 that gives rise to 𝐱 ∈ 𝒳 as a convex combination with positive weights, then for all 𝐲 ∈ 𝒳 :

⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐚 − 𝐯 ⟩ ≥ 𝛿 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐱 − 𝐲 ⟩ ‖ 𝐱 − 𝐲 ‖ ,

where 𝐯

argmin 𝐮 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐮 ⟩ , 𝐚

argmax 𝐮 ∈ 𝒮 ⟨ ∇ 𝑓 ⁢ ( 𝐱 ) , 𝐮 ⟩ , and 𝛿

0 is the pyramidal width of 𝒳 .

Theorem 3.2.

Suppose 𝒳 is a polytope and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 for which the domain does not contain any straight line. Then, both AFW and BPCG with Backtrack achieve a convergence rate:

ℎ ⁢ ( 𝐱 𝑡 ) ≤ ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 𝜇 𝑓 ℒ 0 4 ⁢ 𝐿 ~ ⁢ ( 𝛿 𝐷 ) 2 ) ⌈ ( 𝑡 − 1 ) / 2 ⌉ ,

where 𝛿 is the pyramidal width of the polytope 𝒳 , 𝐿 ~

def max ⁡ { 𝜏 ⁢ 𝐿 𝑓 ℒ 0 , 𝐿 − 1 } , 𝜏

1 is the backtracking parameter, 𝐿 − 1 is the initial smoothness estimate in Algorithm 4.

Proof.

Proceeding very similarly as in the proof of Theorem 2.11, we have that as the backtracking line search makes monotonic primal progress, we know that for 𝑡 ≥ 0 we will have that 𝐱 𝑡 ∈ ℒ 0 . As the function is 𝜇 𝑓 ℒ 0 -strongly convex over ℒ 0 , we can use the appropriate inequalities from strong convexity in the progress bounds. Using this aforementioned property, together with the scaling inequality of Proposition 3.1 results in:

ℎ ⁢ ( 𝐱 𝑡 )

𝑓 ⁢ ( 𝐱 𝑡 ) − 𝑓 ⁢ ( 𝐱 * )

≤ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐱 * ⟩ 2 2 ⁢ 𝜇 𝑓 ℒ 0 ⁢ ‖ 𝐱 𝑡 − 𝐱 * ‖ 2

≤ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐯 𝑡 ⟩ 2 2 ⁢ 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 .

(3.1)

The first inequality comes from the 𝜇 𝑓 ℒ 0 -strong convexity over ℒ 0 (see, e.g., [Braun et al., 2022, Lemma 2.13]), and the second inequality comes from applying Proposition 3.1 with 𝐲

𝐱 * . For AFW, we can expand the expression of the numerator of the bound in (3.1):

𝑓 ⁢ ( 𝐱 𝑡 ) − 𝑓 ⁢ ( 𝐱 * )

≤ ( ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩ + ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ ) 2 2 ⁢ 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 .

(3.2)

Note that if the Frank-Wolfe step is chosen in Line 7, then:

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩

⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ ≥ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩ ,

otherwise, if an away step is chosen in Line 9, then:

⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ < ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ .

In both cases, we have that:

𝑓 ⁢ ( 𝐱 𝑡 ) − 𝑓 ⁢ ( 𝐱 * )

≤ 2 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ 2 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 .

(3.3)

For BPCG, we can directly exploit [Tsuji et al., 2022, Lemma 3.5], which establishes the following bound at every iteration of the algorithm:

− 2 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ ≥ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐯 𝑡 ⟩ ,

resulting in the same inequality (3.3).1 Note that using a similar reasoning, as ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩

𝑔 ⁢ ( 𝐱 𝑡 ) , in both cases it holds that:

ℎ ⁢ ( 𝐱 𝑡 ) ≤ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ − ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ .

(3.4)

As in the preceding proofs, the backtracking line search in Algorithm 4 will either output a point 𝛾 𝑡

𝛾 max or 𝛾 𝑡 < 𝛾 max . In any case, for both AFW and BPCG, and regardless of the type of step taken, Algorithm 4 will find and output a smoothness estimate 𝐿 𝑡 and a step size 𝛾 𝑡 such that:

ℎ ⁢ ( 𝐱 𝑡 + 1 ) − ℎ ⁢ ( 𝐱 𝑡 ) ≤ 𝐿 𝑡 ⁢ 𝛾 𝑡 2 2 ⁢ ‖ 𝐝 𝑡 ‖ 2 + 𝛾 𝑡 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ .

(3.5)

As before, we will have two cases differentiating whether the step size 𝛾 𝑡 is maximal. If 𝛾 𝑡

𝛾 max we know by observing Line 7 of Algorithm 4 that:

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ ≥ 𝛾 max ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 ,

which combined with (3.5) results in:

ℎ ⁢ ( 𝐱 𝑡 + 1 ) − ℎ ⁢ ( 𝐱 𝑡 ) ≤ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ ⁢ 𝛾 max 2 .

In the case where 𝛾 𝑡 < 𝛾 max , we have:

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ < 𝛾 max ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 ,

𝛾 𝑡

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ / ( 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 ) .

Plugging the expression of 𝛾 𝑡 into (3.5) yields

ℎ ⁢ ( 𝐱 𝑡 + 1 ) − ℎ ⁢ ( 𝐱 𝑡 ) ≤ − ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ 2 / ( 2 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 ) .

In any case, we can rewrite (3.5) as:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 )

≥ min ⁡ { − ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ ⁢ 𝛾 max 2 , ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ 2 2 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 } .

(3.6)

We can now use the inequality in (3.3) to bound the second term in the minimization component of (3.6), and (3.4) to bound the first term. This leads to:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 )

≥ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ min ⁡ { 𝛾 max 2 , 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 4 ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 }

(3.7)

≥ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ min ⁡ { 𝛾 max 2 , 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 4 ⁢ 𝐿 ~ ⁢ 𝐷 2 } ,

(3.8)

where in the last inequality we use ‖ 𝐝 𝑡 ‖ ≤ 𝐷 and 𝐿 𝑡 ≤ 𝐿 ~ for all 𝑡 . It remains to bound 𝛾 max away from zero to obtain the linear convergence bound. For Frank-Wolfe steps, we immediately have 𝛾 max

1 , but for away or pairwise steps, there is no straightforward way of bounding 𝛾 max away from zero. One of the key insights from Lacoste-Julien & Jaggi [2015] is that instead of bounding 𝛾 max away from zero for all steps up to iteration 𝑡 , we can instead bound the number of away steps with a step size 𝛾 𝑡

𝛾 max up to iteration 𝑡 , which are steps that reduce the cardinality of the active set 𝒮 𝑡 and satisfy ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) . The same argument is used in Tsuji et al. [2022] to prove the convergence of BPCG. This leads us to consider only the progress provided by the remaining steps, which are Frank-Wolfe steps and away steps for AFW or pairwise steps for BPCG with 𝛾 𝑡 < 𝛾 max . For a number of steps 𝑡 , only at most half of these steps could have been away steps with 𝛾 𝑡

𝛾 max , as we cannot drop more vertices from the active set than the number of vertices we could have potentially picked up with Frank-Wolfe steps. For the remaining ⌈ ( 𝑡 − 1 ) / 2 ⌉ steps, we know that:

ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≥ ℎ ⁢ ( 𝐱 𝑡 ) ⁢ 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 4 ⁢ 𝐿 ~ ⁢ 𝐷 2 .

Therefore, we have that the primal gap satisfies:

ℎ ⁢ ( 𝐱 𝑡 ) ≤ ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 𝜇 𝑓 ℒ 0 ⁢ 𝛿 2 4 ⁢ 𝐿 ~ ⁢ 𝐷 2 ) ⌈ ( 𝑡 − 1 ) / 2 ⌉ .

∎

We can make use of the proof of convergence in primal gap to prove linear convergence in Frank-Wolfe gap. In order to do so, we recall a quantity formally defined in Kerdreux et al. [2019] but already implicitly used earlier in Lacoste-Julien & Jaggi [2015] as:

𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 )

def max 𝐮 ∈ 𝒮 𝑡 , 𝐯 ∈ 𝒳 ⁡ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐮 − 𝐯 ⟩

max 𝐮 ∈ 𝒮 𝑡 ⁡ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐮 − 𝐱 𝑡 ⟩ + max 𝐯 ∈ 𝒳 ⁡ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 ⟩

max 𝐮 ∈ 𝒮 𝑡 ⁡ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐮 − 𝐱 𝑡 ⟩ + 𝑔 ⁢ ( 𝐱 𝑡 ) .

Note that 𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 ) provides an upper bound on the Frank-Wolfe gap as the first term in the definition, the so-called away gap, is positive.

Theorem 3.3.

Suppose 𝒳 is a polytope and 𝑓 is a ( 𝑀 , 𝜈 ) generalized self-concordant function with 𝜈 ≥ 2 for which the domain does not contain any straight line. Then, AFW and BPCG with Backtrack both contract the Frank-Wolfe gap linearly, i.e., min 1 ≤ 𝑡 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝜀 after 𝑇

𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) iterations.

Proof.

We observed in the proof of Theorem 3.2 that regardless of the type of step chosen in AFW and BPCG, the following holds:

− 2 ⁢ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ ≥ ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐱 𝑡 − 𝐯 𝑡 ⟩ + ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐚 𝑡 − 𝐱 𝑡 ⟩

𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 ) .

On the other hand, we also have that ℎ ⁢ ( 𝐱 𝑡 ) − ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) . Plugging these bounds into the right-hand side and the left hand side of 3.6 in Theorem 3.2, and using the fact that ‖ 𝐝 𝑡 ‖ ≤ 𝐷 we have that:

min ⁡ { 𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 ) ⁢ 𝛾 max 4 , 𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 ) 2 8 ⁢ 𝐿 𝑡 ⁢ 𝐷 2 } ≤ ℎ ⁢ ( 𝐱 𝑡 ) ≤ ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 𝜇 𝑓 ℒ 0 4 ⁢ 𝐿 ~ ⁢ ( 𝛿 𝐷 ) 2 ) ⌈ ( 𝑡 − 1 ) / 2 ⌉ ,

where the second inequality follows from the convergence bound on the primal gap from Theorem 3.2. Considering the steps that are not away steps with 𝛾 𝑡

𝛾 max as in the proof of Theorem 3.2, leads us to:

𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝑤 ⁢ ( 𝐱 𝑡 , 𝒮 𝑡 ) ≤ 4 ⁢ ℎ ⁢ ( 𝐱 0 ) ⁢ max ⁡ { 1 , 𝐿 ~ ⁢ 𝐷 2 2 ⁢ ℎ ⁢ ( 𝐱 0 ) } ⁢ ( 1 − 𝜇 𝑓 ℒ 0 4 ⁢ 𝐿 ~ ⁢ ( 𝛿 𝐷 ) 2 ) ⌊ ( 𝑡 − 1 ) / 4 ⌋ .

∎

In Table 4 we provide a detailed complexity comparison between the Backtracking AFW (B-AFW) Algorithm 5, and other comparable algorithms in the literature.

Algorithm SOO calls FOO calls ZOO calls LMO calls DO calls FW-LLOO [Dvurechensky et al., 2022, Alg.7] 𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) * ASFW-GSC [Dvurechensky et al., 2022, Alg.8] 𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

B-AFW/B-BPCG † ‡

𝒪 ⁢ ( log ⁡ 1 / 𝜀 )

𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) Table 4: Complexity comparison: Number of iterations needed to reach a solution with ℎ ⁢ ( 𝐱 ) below 𝜀 for Problem 1.1 for Frank-Wolfe-type algorithms in the literature. The asterisk on FW-LLOO highlights the fact that the procedure is different from the standard LMO procedure. The complexity shown for the FW-LLOO, ASFW-GSC, and B-AFW algorithms only apply to polyhedral domains, with the additional requirement that for the former two we need an explicit polyhedral representation of the domain (see Assumption 3 in Dvurechensky et al. [2022]), whereas the latter only requires an LMO. The requirement that we have an explicit polyhedral representation may be limiting, for instance for the matching polytope over non-bipartite graphs, as the size of the polyhedral representation in this case depends exponentially on the number of nodes of the graph [Rothvoß, 2017]. We use the superscript † to indicate that the same complexities hold when reaching an 𝜀 -optimal solution in 𝑔 ⁢ ( 𝐱 ) , and the superscript ‡ to indicate that constants in the convergence bounds depend on user-defined inputs. 4Computational experiments

We showcase the performance of the M-FW algorithm, the second-order step size and the LLOO algorithm from Dvurechensky et al. [2022] (denoted by GSC-FW and LLOO in the figures) and the Frank-Wolfe and the Away-step Frank-Wolfe algorithm with the backtracking stepsize of Pedregosa et al. [2020], denoted by B-FW and B-AFW respectively. We ran all experiments on a server with 8 Intel Xeon 3.50GHz CPUs and 32GB RAM in single-threaded mode in Julia 1.6.0 with the FrankWolfe.jl package [Besançon et al., 2022]. The data sets used in the problem instances can be found in Carderera et al. [2021], the code used for the experiments can be found in Carderera et al.. When running the adaptive step size from Pedregosa et al. [2020], the only parameter that we need to set is the initial smoothness estimate 𝐿 − 1 . We use the initialization proposed in Pedregosa et al. [2020], namely:

𝐿 − 1

‖ ∇ 𝑓 ⁢ ( 𝐱 0 ) − ∇ 𝑓 ⁢ ( 𝐱 0 + 𝜀 ⁢ ( 𝐯 0 − 𝐱 0 ) ) ‖ / ( 𝜀 ⁢ ‖ 𝐯 0 − 𝐱 0 ‖ )

with 𝜀 set to 10 − 3 . The scaling parameters 𝜏

2 , 𝜂

0.9 are left at their default values as proposed in Pedregosa et al. [2020] and also used in Dvurechensky et al. [2022].

We also use the vanilla FW algorithm denoted by FW, which is simply Algorithm 1 without Lines 5 and 6 using the traditional 𝛾 𝑡

2 / ( 𝑡 + 2 ) open-loop step size rule. Note that there are no formal convergence guarantees for this algorithm when applied to Problem (1.1). All figures show the evolution of the ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) against 𝑡 and time with a log-log scale. As in Dvurechensky et al. [2022] we implemented the LLOO based variant only for the portfolio optimization instance Δ 𝑛 ; for the other examples, the oracle implementation was not implemented due to the need to estimate non-trivial parameters.

As can be seen in all experiments, the Monotonic Frank-Wolfe algorithm is very competitive, outperforming previously proposed variants in both in progress per iteration and time. The only other algorithm that is sometimes faster is the Away-step Frank-Wolfe variant, which however depends on an active set and can therefore induce up a quadratic time and memory overhead, potentially rendering the method inattractive for very large-scale settings.

Portfolio optimization. We consider the portfolio problem with logarithmic returns 𝑓 ⁢ ( 𝐱 )

− ∑ 𝑡

1 𝑝 log ⁡ ( ⟨ 𝐫 𝑡 , 𝐱 ⟩ ) , where 𝑝 denotes the number of periods and 𝒳

Δ 𝑛 . The results are shown in Figure 2 with all methods and in Figure 3 on larger instances with first-order methods only. We use the revenue data 𝐫 𝑡 from Dvurechensky et al. [2022] and add instances generated in a similar fashion from independent Normal random entries with dimension 1000, 2000, and 5000, and from a Log-normal distribution with ( 𝜇

0.0 , 𝜎

0.5 ) .

Figure 2:Portfolio optimization: Convergence of ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) vs. 𝑡 and wall-clock time. 𝑛

1000 . (a) 𝑛

2000 (b) 𝑛

5000 Figure 3:Portfolio optimization: Convergence of ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) vs. 𝑡 and wall-clock time.

Logistic regression. One of the motivating examples for the development of a theory of generalized self-concordant function is the logistic loss function, as it does not match the definition of a standard self-concordant function but shares many of its characteristics. We consider a design matrix with rows 𝐚 𝑖 ∈ ℝ 𝑛 with 1 ≤ 𝑖 ≤ 𝑁 and a vector 𝐲 ∈ { − 1 , 1 } 𝑁 and formulate a logistic regression problem with elastic net regularization, in a similar fashion to Liu et al. [2020], with 𝑓 ⁢ ( 𝐱 )

1 / 𝑁 ⁢ ∑ 𝑖

1 𝑁 log ⁡ ( 1 + exp ⁡ ( − 𝑦 𝑖 ⁢ ⟨ 𝐱 , 𝐚 𝑖 ⟩ ) ) + 𝜇 / 2 ⁢ ‖ 𝐱 ‖ 2 , and 𝒳 is the ℓ 1 ball of radius 𝜌 . The results can be seen in Figure 4.

(a)a4a: ( 𝑁 , 𝑛 )

( 4781 , 121 ) (b)a8a: ( 𝑁 , 𝑛 )

( 22696 , 123 ) Figure 4:Logistic regression: Convergence of ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) vs. 𝑡 and wall-clock time for instances of the LIBSVM dataset.

Birkhoff polytope. All applications previously considered all have in common a computationally inexpensive LMO that returns highly sparse vertices. To complement the results, we consider a logistic regression problem over the Birkhoff polytope, where the LMO is implemented with the Hungarian algorithm, and is not as inexpensive as in the other examples. We use a quadratic regularization parameter 𝜇

100 / 𝑁 where 𝑁 is the number of samples. The results are presented in Figure 5.

Figure 5:Birkhoff polytope: Convergence of ℎ ⁢ ( 𝐱 𝑡 ) and 𝑔 ⁢ ( 𝐱 𝑡 ) vs. 𝑡 and wall-clock time on a2a: ( 𝑁 , 𝑛 )

( 2265 , 114 ) . Monotonic step size: the numerical case

The computational experiments highlighted that the Monotonic Frank-Wolfe performs well in terms of iteration count and time against other Frank-Wolfe and Away-step Frank-Wolfe variants. Another advantage of a simple step size computation procedure is its numerical stability. On some instances, an ill-conditioned gradient can lead to a plateau of the primal and/or dual progress. Even worse, some step-size strategies do not guarantee monotonicity and can result in the primal value increasing over some iterations. The numerical issue that causes this phenomenon is illustrated by running the methods of the FrankWolfe.jl package over the same instance using 64 -bits floating-point numbers and Julia BigFloat types (which support arithmetic in arbitrary precision to remove numerical issues).

(a)64-bit floating point (b)Arbitrary precision using BigFloat Figure 6:Ill-conditioned portfolio optimization problem in dimension 𝑛

2000 .

In Fig. 6, we compare the primal and dual gap progress of different algorithms on a portfolio instance. In the finite-precision execution, we observe a plateau of the dual gap for both M-FW and B-AFW. The primal value however worsens after the iteration where B-AFW reaches its dual gap plateau. In contrast, M-FW reaches a plateau in both primal and dual gap at a certain iteration. Note that the primal value at the point where the plateau is hit is already below 𝜀 float ⁢ 64 , the square root of the machine precision. In arbitrary-precision arithmetic, instead of reaching a plateau or deteriorating, B-AFW closes the dual gap tolerance and terminates before other methods. Although this observation (made on several instances of the portfolio optimization problem) only impacts ill-conditioned problems, it suggests M-FW may be a good candidate for a numerically robust default implementation of Frank-Wolfe algorithms.

5A stateless simple step variant

The simple step-size strategy presented in Algorithm 2 ensures monotonicity and domain-respecting iterates by maintaining a “memory” 𝜙 𝑡 which is the number of performed halvings. The number of halvings to reach an accepted step is bounded, but the corresponding factor 2 𝜙 𝑡 is carried over in all following iterations, which may slow down progress. We propose an alternative step size that still ensures the monotonicity and domain-preserving properties, but does not carry over information from one iteration to the next.

Algorithm 7 Stateless Monotonic Frank-Wolfe 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 2:Iterates 𝐱 1 , … ∈ 𝒳 3:for 𝑡

0 to … do 4: 𝐯 𝑡 ← argmin 𝐯 ∈ 𝒳 ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐯 ⟩ 5: 𝛾 𝑡 ← 2 / ( 𝑡 + 2 ) 6: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 7: while 𝐱 𝑡 + 1 ∉ dom ⁢ ( 𝑓 ) or 𝑓 ⁢ ( 𝐱 𝑡 + 1 )

𝑓 ⁢ ( 𝐱 𝑡 ) do 8: 𝛾 𝑡 ← 𝛾 𝑡 / 2 9: 𝐱 𝑡 + 1 ← 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) 10: end while 11:end for

Note that Algorithm 7, presented above, is stateless since it is equivalent to Algorithm 2 with 𝜙 𝑡 reset to zero between every outer iteration. This resetting step also implies that the per-iteration convergence rate of the stateless step is at least as good as the simple step, at the potential cost of a bounded number of halvings, with associated ZOO and DOO calls at each iteration. Finally, we point out that the stateless step-size strategy can be viewed as a particular instance of a backtracking line search where the initial step size estimate is the agnostic step size 2 / ( 𝑡 + 2 ) .

We compare the two strategies on a random and badly conditioned problem with objective function:

𝑥 ⊤ ⁢ 𝑄 ⁢ 𝑥 + ⟨ 𝑏 , 𝑥 ⟩ + 𝜇 ⁢ 𝜑 ℝ + ⁢ ( 𝑥 )

where 𝑄 is a symmetric positive definite matrix with log-normally distributed eigenvalues and 𝜑 ℝ + ⁢ ( ⋅ ) is the log-barrier function of the positive orthant. We optimize instances of this function over the ℓ 1 -norm ball in dimension 30 and 1000 . The results are shown in Figure 7. On both of these instances, the simple step progress is slowed down or even seems stalled in comparison to the stateless version because a lot of halving steps were done in the early iterations for the simple step size, which penalizes progress over the whole run. The stateless step-size does not suffer from this problem, however, because the halvings have to be performed at multiple iterations when using the stateless step-size strategy, the per iteration cost of the stateless step-size is about three times that of the simple step-size. Future work will consider additional restart conditions, not only on 𝜙 𝑡 of Algorithm 2, but also on the base step-size strategy employed, similar to Kerdreux et al. [2019].

(a) 𝑛

30 (b) 𝑛

1000 Figure 7:Stateless step size: comparison of the stateless and simple steps on badly conditioned problems. Conclusion

We introduced FQ variants based on open-loop step sizes 𝛾 𝑡

2 / ( 𝑡 + 2 ) to obtain a 𝒪 ⁢ ( 1 / 𝑡 ) convergence rate for generalized self-concordant functions in terms of primal and FW gaps. This algorithm neither requires second-order information, nor line searches, and allows us to bound the number of zeroth-, first-order oracle, domain oracle, and linear oracle calls needed to obtain the target accuracy. We also show improved convergence rates for several variants in various cases of interest and prove that the AFW [Wolfe, 1970, Lacoste-Julien & Jaggi, 2015] and BPCG Tsuji et al. [2022] algorithms coupled with the backtracking line search of Pedregosa et al. [2020] can achieve linear convergence rates over polytopes when minimizing generalized self-concordant functions.

Acknowledgements

Research reported in this paper was partially supported through the Research Campus Modal funded by the German Federal Ministry of Education and Research (fund numbers 05M14ZAM,05M20ZBM) and the Deutsche Forschungsgemeinschaft (DFG) through the DFG Cluster of Excellence MATH+. We would like to thank the anonymous reviewers for their suggestions and comments.

References Bach [2010] ↑ Bach, F.Self-concordant analysis for logistic regression.Electronic Journal of Statistics, 4:384–414, 2010. Besançon et al. [2022] ↑ Besançon, M., Carderera, A., and Pokutta, S.FrankWolfe.jl: a high-performance and flexible toolbox for Frank-Wolfe algorithms and conditional gradients.INFORMS Journal on Computing, 34(5):2611–2620, 2022. Braun et al. [2022] ↑ Braun, G., Carderera, A., Combettes, C. W., Hassani, H., Karbasi, A., Mokhtari, A., and Pokutta, S.Conditional gradient methods.arXiv preprint arXiv:2211.14103, 2022. [4] ↑ Carderera, A., Besançon, M., and Pokutta, S.Frank-wolfe for generalized self-concordant functions - code repository.https://github.com/ZIB-IOL/fw-generalized-selfconcordant. Carderera et al. [2021] ↑ Carderera, A., Besançon, M., and Pokutta, S.Frank-Wolfe for Generalized Self-Concordant Functions - Problem Instances, May 2021.URL https://doi.org/10.5281/zenodo.4836009. Diakonikolas et al. [2020] ↑ Diakonikolas, J., Carderera, A., and Pokutta, S.Locally accelerated conditional gradients.In Proceedings of the 23th International Conference on Artificial Intelligence and Statistics, pp. 1737–1747. PMLR, 2020. Dvurechensky et al. [2020] ↑ Dvurechensky, P., Ostroukhov, P., Safin, K., Shtern, S., and Staudigl, M.Self-concordant analysis of Frank-Wolfe algorithms.In Proceedings of the 37th International Conference on Machine Learning, pp. 2814–2824. PMLR, 2020. Dvurechensky et al. [2022] ↑ Dvurechensky, P., Safin, K., Shtern, S., and Staudigl, M.Generalized self-concordant analysis of Frank–Wolfe algorithms.Mathematical Programming, pp. 1–69, 2022. Frank & Wolfe [1956] ↑ Frank, M. and Wolfe, P.An algorithm for quadratic programming.Naval research logistics quarterly, 3(1-2):95–110, 1956. Garber & Hazan [2016] ↑ Garber, D. and Hazan, E.A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization.SIAM Journal on Optimization, 26(3):1493–1528, 2016. Guélat & Marcotte [1986] ↑ Guélat, J. and Marcotte, P.Some comments on Wolfe’s ‘away step’.Mathematical Programming, 35(1):110–119, 1986. Jaggi [2013] ↑ Jaggi, M.Revisiting Frank-Wolfe: Projection-free sparse convex optimization.In Proceedings of the 30th International Conference on Machine Learning, pp. 427–435. PMLR, 2013. Kerdreux et al. [2019] ↑ Kerdreux, T., d’Aspremont, A., and Pokutta, S.Restarting Frank-Wolfe.In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 1275–1283. PMLR, 2019. Kerdreux et al. [2021] ↑ Kerdreux, T., d’Aspremont, A., and Pokutta, S.Projection-free optimization on uniformly convex sets.In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pp. 19–27. PMLR, 2021. Krishnan et al. [2015] ↑ Krishnan, R. G., Lacoste-Julien, S., and Sontag, D.Barrier Frank-Wolfe for Marginal Inference.In Proceedings of the 28th Conference in Neural Information Processing Systems. PMLR, 2015. Lacoste-Julien & Jaggi [2015] ↑ Lacoste-Julien, S. and Jaggi, M.On the global linear convergence of Frank-Wolfe optimization variants.In Proceedings of the 29th Conference on Neural Information Processing Systems, pp. 566–575. PMLR, 2015. Levitin & Polyak [1966] ↑ Levitin, E. S. and Polyak, B. T.Constrained minimization methods.USSR Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966. Liu et al. [2020] ↑ Liu, D., Cevher, V., and Tran-Dinh, Q.A newton frank–wolfe method for constrained self-concordant minimization.Journal of Global Optimization, pp. 1–27, 2020. Marteau-Ferey et al. [2019] ↑ Marteau-Ferey, U., Ostrovskii, D., Bach, F., and Rudi, A.Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance.In Proceedings of the 32nd Conference on Learning Theory, pp. 2294–2340. PMLR, 2019. Nesterov [2013] ↑ Nesterov, Y.Gradient methods for minimizing composite functions.Mathematical Programming, 140(1):125–161, 2013. Nesterov & Nemirovskii [1994] ↑ Nesterov, Y. and Nemirovskii, A.Interior-point polynomial algorithms in convex programming.SIAM, 1994. Odor et al. [2016] ↑ Odor, G., Li, Y.-H., Yurtsever, A., Hsieh, Y.-P., Tran-Dinh, Q., El Halabi, M., and Cevher, V.Frank-wolfe works for non-lipschitz continuous gradient objectives: scalable poisson phase retrieval.In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6230–6234. Ieee, 2016. Ostrovskii & Bach [2021] ↑ Ostrovskii, D. M. and Bach, F.Finite-sample analysis of M-estimators using self-concordance.Electronic Journal of Statistics, 15(1):326–391, 2021. Pedregosa et al. [2020] ↑ Pedregosa, F., Negiar, G., Askari, A., and Jaggi, M.Linearly convergent Frank–Wolfe with backtracking line-search.In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. PMLR, 2020. Rothvoß [2017] ↑ Rothvoß, T.The matching polytope has exponential extension complexity.Journal of the ACM (JACM), 64(6):1–19, 2017. Sun & Tran-Dinh [2019] ↑ Sun, T. and Tran-Dinh, Q.Generalized self-concordant functions: a recipe for Newton-type methods.Mathematical Programming, 178(1):145–213, 2019. Temlyakov [2015] ↑ Temlyakov, V.Greedy approximation in convex optimization.Constructive Approximation, 41(2):269–296, 2015. Tran-Dinh et al. [2015] ↑ Tran-Dinh, Q., Li, Y.-H., and Cevher, V.Composite convex minimization involving self-concordant-like cost functions.In Modelling, Computation and Optimization in Information Systems and Management Sciences, pp. 155–168. Springer, 2015. Tsuji et al. [2022] ↑ Tsuji, K. K., Tanaka, K., and Pokutta, S.Pairwise conditional gradients without swap steps and sparser kernel herding.In International Conference on Machine Learning, pp. 21864–21883. PMLR, 2022. Wolfe [1970] ↑ Wolfe, P.Convergence theory in nonlinear programming.In Integer and Nonlinear Programming, pp. 1–36. North-Holland, Amsterdam, 1970. Zhao & Freund [2023] ↑ Zhao, R. and Freund, R. M.Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier.Mathematical Programming, 199(1-2):123–163, 2023. Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Report Issue Report Issue for Selection

Xet Storage Details

Size:: 102 kB
Xet hash:: 9f0061b921a74e9cd3ee5f2fe7b0e13587f677151c23987e0e0aed5da2f2fce0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Generalized self-concordance is a key property present in the objective function of many important learning problems. We establish the convergence rate of a simple Frank-Wolfe variant that uses the open-loop step size strategy 𝛾 𝑡

We show that a small variation of the original Frank-Wolfe algorithm [Frank & Wolfe, 1956] with an open-loop step size of the form 𝛾 𝑡

Moreover, our variant relying on the open-loop step size 𝛾 𝑡

We denote the domain of 𝑓 as dom ⁢ ( 𝑓 )

def { 𝐱 ∈ ℝ 𝑛 , 𝑓 ⁢ ( 𝐱 ) < + ∞ } and the (potentially non-unique) minimizer of Problem (1.1) by 𝐱 * . Moreover, we denote the primal gap and the Frank-Wolfe gap at 𝐱 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) as ℎ ⁢ ( 𝐱 )

def 𝑓 ⁢ ( 𝐱 ) − 𝑓 ⁢ ( 𝐱 * ) and 𝑔 ⁢ ( 𝐱 )

𝐿 𝑓 𝒞

max 𝐮 ∈ 𝒞 , 𝐝 ∈ ℝ 𝑛 ⁡ ‖ 𝐝 ‖ ∇ 2 𝑓 ⁢ ( 𝐮 ) 2 ‖ 𝐝 ‖ 2 2 , 𝜇 𝑓 𝒞

Consider Problem (1.1) where 𝒳

D 3 ⁡ 𝑓 ⁢ ( 𝐱 ) ⁢ [ 𝐰 ]

Algorithm 1 Monotonic Frank-Wolfe (M-FW) 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 2:for 𝑡

Note, that the open-loop step size rule 2 / ( 𝑡 + 2 ) does not guarantee monotonic primal progress for the vanilla Frank-Wolfe algorithm in general. If either of these two checks fails, we simply do not move: the algorithm sets 𝐱 𝑡 + 1

𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 )

def { 𝑀 ⁢ ‖ 𝐲 − 𝐱 ‖ if ⁢ 𝜈

𝜔 𝜈 ⁢ ( 𝜏 )

def { 𝑒 𝜏 − 𝜏 − 1 𝜏 2 if ⁢ 𝜈

− 𝜏 − 𝑙𝑛 ⁢ ( 1 − 𝜏 ) 𝜏 2 if ⁢ 𝜈

( 1 − 𝜏 ) ⁢ 𝑙𝑛 ⁢ ( 1 − 𝜏 ) + 𝜏 𝜏 2 if ⁢ 𝜈

Due to the fact that we use a simple step size 𝛾 𝑡

Let 𝑓 be a generalized self-concordant function with 𝜈 > 2 . If 𝑑 𝜈 ⁢ ( 𝐱 , 𝐲 ) < 1 and 𝐱 ∈ dom ⁢ ( 𝑓 ) then 𝐲 ∈ dom ⁢ ( 𝑓 ) . For the case 𝜈

2 we have that dom ⁢ ( 𝑓 )

ℒ 0

𝑇 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2 if ⁢ 𝜈

ℎ ⁢ ( 𝐱 𝑡 + 1 )

From these facts we can prove the convergence rate shown in (2.3) by induction. The base case 𝑡

ℎ ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) ) . Thus using the induction hypothesis and plugging in the expression for 𝛾 𝑡

𝐱 𝑡 + 𝛾 𝑡 ⁢ ( 𝐯 𝑡 − 𝐱 𝑡 ) , or we will simply have 𝐱 𝑡 + 1

In the case where 𝜈

2 we can easily bound the primal gap ℎ ⁢ ( 𝐱 1 ) , as in this setting dom ⁢ ( 𝑓 )

ℝ 𝑛 , which leads to ℎ ⁢ ( 𝐱 1 ) ≤ 𝐿 𝑓 𝒳 ⁢ 𝐷 2 from (2.5), regardless of whether we set 𝐱 1

𝐱 0 or 𝐱 1

𝐯 0 . Moreover, as the upper bound on the Bregman divergence holds for 𝜈

𝑇 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2 if ⁢ 𝜈

𝑔 ⁢ ( 𝐱 𝐾 ) ≤ 2 2 + 𝐾 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ≤ 2 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 )

def 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ − 2 to 𝑡 max

ℎ ⁢ ( 𝐱 𝑡 max + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 min ) − ∑ 𝑡

𝑡 min 𝑡 max 𝛾 𝑡 ⁢ 𝑔 ⁢ ( 𝐱 𝑡 ) + 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ⁢ ∑ 𝑡

≤ ℎ ⁢ ( 𝐱 𝑡 min ) − 2 ⁢ min 𝑡 min ≤ 𝑡 ≤ 𝑡 max ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ⁢ ∑ 𝑡

𝑡 min 𝑡 max 1 2 + 𝑡 + 4 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) ⁢ ∑ 𝑡

Note that (2.10) stems from the fact that min 𝑡 min ≤ 𝑡 ≤ 𝑡 max ⁡ 𝑔 ⁢ ( 𝐱 𝑡 ) ≤ 𝑔 ⁢ ( 𝐱 𝑡 ) for any 𝑡 min ≤ 𝑡 ≤ 𝑡 max , and from plugging 𝛾 𝑡

min 1 ≤ 𝑖 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑖 ) ≤ 18 ⁢ 𝑇 𝑇 − 𝑇 𝜈 ⁢ ( 𝑇 𝜈 + 1 2 ⁢ 𝑇 𝜈 + 𝑇 − 3 + 2 ⁢ 𝑇 − 2 ⁢ 𝑇 𝜈 + 3 ( 2 ⁢ 𝑇 𝜈 + 𝑇 ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

Algorithm 2 Halving M-FW 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 2:Iterates 𝐱 1 , … ∈ 𝒳 3: 𝜓 − 1 ← 0 4:for 𝑡

Algorithm 3 B-FW 1:Point 𝐱 0 ∈ 𝒳 ∩ dom ⁢ ( 𝑓 ) , function 𝑓 , initial smoothness estimate 𝐿 − 1 2:Iterates 𝐱 1 , … ∈ 𝒳 3:for 𝑡

𝜏 ⁢ 𝑀 7: 𝛾

The analysis in this case is reminiscent of the one in the seminal work of Guélat & Marcotte [1986], and is presented in Subsection 2.1.1. Note that we can upper-bound the value of 𝐿 𝑡 for 𝑡 ≥ 0 by 𝐿 ~

where 𝐯

ℒ 0

for 𝑡 ≥ 1 , where 𝐿 ~

The backtracking line search in Algorithm 4 will either output a point 𝛾 𝑡

1 or 𝛾 𝑡 < 1 . In any case, Algorithm 4 will find and output a smoothness estimate 𝐿 𝑡 and a step size 𝛾 𝑡 such that for 𝐱 𝑡 + 1

In the case where 𝛾 𝑡

1 we know by observing Line 7 of Algorithm 4 that 𝑔 ⁢ ( 𝐱 𝑡 ) ≥ 𝐿 𝑡 ⁢ ‖ 𝐱 𝑡 − 𝐯 𝑡 ‖ 2 , and so plugging into (2.17) we arrive at ℎ ⁢ ( 𝐱 𝑡 + 1 ) ≤ ℎ ⁢ ( 𝐱 𝑡 ) / 2 . In the case where 𝛾 𝑡

2 / ( 2 + 𝑡 ) did not automatically ensure monotonicity in primal gap, and this had to be enforced by setting 𝐱 𝑡 + 1

𝐱 𝑡 if 𝑓 ⁢ ( 𝐱 𝑡 + 𝛾 𝑡 ⁢ 𝐝 𝑡 ) > 𝑓 ⁢ ( 𝐱 𝑡 ) , where 𝐝 𝑡

where the inequality follows from the definition of 𝜇 𝑓 ℒ 0 . It is easy to see that as 𝑑 ⁢ 𝜔 𝜈 ⁢ ( 𝜏 ) / 𝑑 ⁢ 𝜏 > 0 by Remark 2.2, we have that 1 / 2

Given two positive numbers 𝜅 and 𝑞 , we say the set 𝒳 ⊆ ℝ 𝑛 is ( 𝜅 , 𝑞 ) -uniformly convex with respect to a norm ∥ ⋅ ∥ if for any 𝐱 , 𝐲 ∈ 𝒳 , 0 ≤ 𝛾 ≤ 1 , and 𝐳 ∈ ℝ 𝑛 with ‖ 𝐳 ‖

where 𝐯

( 𝑐 1 / 𝑐 2 ) 1 / 𝛼 ( 1 + 𝑐 1 ⁢ 𝛼 ⁢ ( 𝑡 − 𝑡 0 ) ) 1 / 𝛼

𝑡 0

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡 if ⁢ 𝑞

( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ( 1 + ( 𝑞 − 2 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 2 )

𝑡 0

and 𝐿 ~

𝐿 𝑓 ℒ 0

𝛾 𝑡

which is the step size ultimately taken by the algorithm at iteration 𝑡 . Note that if 𝛾 𝑡

where the last inequality simply comes from the bound on the gradient norm, and the fact that 𝐿 𝑡 ≤ 𝐿 ~ , for 𝐿 ~

For the case where 𝑞

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡 if ⁢ 𝑞

( 𝐿 ~ 𝑞 / ( 𝜅 ⁢ 𝐶 ) 2 ) 1 / ( 𝑞 − 2 ) ( 1 + ( 𝑞 − 2 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 2 )

𝑡 0

( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ( 1 + ( 𝑞 − 1 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 1 )

ℒ 0

{ 𝐱 ∈ dom ⁢ ( 𝑓 ) ∩ 𝒳 ∣ 𝑓 ⁢ ( 𝐱 ) ≤ 𝑓 ⁢ ( 𝐱 0 ) } 𝑡 0

and 𝐿 ~

where we have also used the bound 𝐿 𝑡 ≤ 𝐿 ~ in the last equation. This leads us to a contraction, together with the case where 𝛾 𝑡

( 𝐿 ~ 𝑞 / ( 𝜅 2 ⁢ 𝜇 𝑓 ℒ 0 ) ) 1 / ( 𝑞 − 1 ) ( 1 + ( 𝑞 − 1 ) ⁢ ( 𝑡 − 𝑡 0 ) / ( 2 ⁢ 𝑞 ) ) 𝑞 / ( 𝑞 − 1 )

𝑡 0

def { 𝑀 ⁢ ‖ 𝐲 − 𝐱 ‖
if ⁢ 𝜈

def { 𝑒 𝜏 − 𝜏 − 1 𝜏 2
if ⁢ 𝜈

− 𝜏 − 𝑙𝑛 ⁢ ( 1 − 𝜏 ) 𝜏 2
if ⁢ 𝜈

( 1 − 𝜏 ) ⁢ 𝑙𝑛 ⁢ ( 1 − 𝜏 ) + 𝜏 𝜏 2
if ⁢ 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2
if ⁢ 𝜈

def { ⌈ 4 ⁢ 𝑀 ⁢ 𝐷 ⌉ − 2
if ⁢ 𝜈

𝑔 ⁢ ( 𝐱 𝐾 )
≤ 2 2 + 𝐾 ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 )

≤ 2 𝑇 𝜈 + ⌈ ( 𝑇 − 𝑇 𝜈 ) / 3 ⌉ ⁢ 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 )

ℎ ⁢ ( 𝐱 𝑡 max + 1 )
≤ ℎ ⁢ ( 𝐱 𝑡 min ) − ∑ 𝑡

min 1 ≤ 𝑖 ≤ 𝑇 ⁡ 𝑔 ⁢ ( 𝐱 𝑖 )
≤ 18 ⁢ 𝑇 𝑇 − 𝑇 𝜈 ⁢ ( 𝑇 𝜈 + 1 2 ⁢ 𝑇 𝜈 + 𝑇 − 3 + 2 ⁢ 𝑇 − 2 ⁢ 𝑇 𝜈 + 3 ( 2 ⁢ 𝑇 𝜈 + 𝑇 ) 2 ) ⁢ max ⁡ { ℎ ⁢ ( 𝐱 0 ) , 𝐿 𝑓 ℒ 0 ⁢ 𝐷 2 ⁢ 𝜔 𝜈 ⁢ ( 1 / 2 ) }

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡
if ⁢ 𝑞

ℎ 𝑡 ≤ { ℎ ⁢ ( 𝐱 0 ) ⁢ ( 1 − 1 2 ⁢ min ⁡ { 1 , 𝜅 ⁢ 𝐶 𝐿 ~ } ) 𝑡
if ⁢ 𝑞

{ 𝐱 ∈ dom ⁢ ( 𝑓 ) ∩ 𝒳 ∣ 𝑓 ⁢ ( 𝐱 ) ≤ 𝑓 ⁢ ( 𝐱 0 ) }

𝑡 0

Algorithm Assumptions Oracle calls Reference B-FW/LBTFW-GSC ‡
𝐱 * ∈ Int ⁡ ( 𝒳 ∩ dom ⁢ ( 𝑓 ) )
𝒪 ⁢ ( log ⁡ 1 / 𝜀 ) This work B-FW/LBTFW-GSC ‡
min 𝐱 ∈ 𝒳 ⁡ ‖ ∇ 𝑓 ⁢ ( 𝐱 ) ‖ > 0 , 𝑞

− ⟨ ∇ 𝑓 ⁢ ( 𝐱 𝑡 ) , 𝐝 𝑡 ⟩ < 𝛾 max ⁢ 𝐿 𝑡 ⁢ ‖ 𝐝 𝑡 ‖ 2 ,

𝛾 𝑡