35.7 kB

Title: A Bregman firmly nonexpansive proximal operator for baryconvex optimization

URL Source: https://arxiv.org/html/2411.00928

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract Notations 1Proximal operator 2Bregman firm nonexpansiveness 3Critical points 4Barygradient flows References License: arXiv.org perpetual non-exclusive license arXiv:2411.00928v3 [math.OC] 11 Apr 2025 A Bregman firmly nonexpansive proximal operator for baryconvex optimization Mastane Achab Deep Gambit Limited, Masdar City, Abu Dhabi, UAE www.deepgambit.com mastane@deepgambit.com Abstract

We present a generalization of the proximal operator defined through a convex combination of convex objectives, where the coefficients are updated in a minimax fashion. We prove that this new operator is Bregman firmly nonexpansive with respect to a Bregman divergence that combines Euclidean and information geometries; and that its fixed points are given by the critical points of a certain nonconvex function. Finally, we derive the associated continuous flows.

Notations

The Euclidean norm of any vector 𝑥 ∈ ℝ 𝑚 ( 𝑚 ≥ 1 ) is denoted ‖ 𝑥 ‖ . For any integer 𝑆 ≥ 2 , we denote by 𝟏 𝑆 the all-ones vector of size 𝑆 and by Δ 𝑆 the probability simplex:

Δ 𝑆

{ 𝑞

( 𝑞 1 , … , 𝑞 𝑆 ) ∈ [ 0 , 1 ] 𝑆 : 𝑞 1 + ⋯ + 𝑞 𝑆

1 } .

The Kullback-Leibler divergence Kullback and Leibler [1951] will be denoted by “ 𝐷 KL ” throughout the paper: for any 𝑞 , 𝑟 ∈ Δ ̊ 𝑆 , 𝐷 KL ⁢ ( 𝑟 ∥ 𝑞 )

∑ 𝑠

1 𝑆 𝑟 𝑠 ⁢ log ⁡ ( 𝑟 𝑠 𝑞 𝑠 ) . Let ℎ ⁢ ( 𝑞 )

∑ 𝑠

1 𝑆 𝑞 𝑠 ⁢ log ⁡ ( 𝑞 𝑠 ) be the negative entropy function defined over Δ ̊ 𝑆 ; its gradient ∇ ℎ ⁢ ( 𝑞 )

( 1 + log ⁡ ( 𝑞 𝑠 ) ) 𝑠 , and the softargmax function

𝜎 : 𝜉

( 𝜉 1 , … , 𝜉 𝑆 ) ∈ ℝ 𝑆 ↦ ( ∇ ℎ ) − 1 ⁢ ( 𝜉 − log ⁡ ( ∑ 𝑠

1 𝑆 𝑒 𝜉 𝑠 − 1 ) ⁢ 𝟏 𝑆 ) .

Given a differentiable function ℓ

( ℓ 1 , … , ℓ 𝑆 ) : ℝ 𝑚 → ℝ 𝑆 , we denote by 𝐽 ℓ its Jacobian matrix. Finally, given ( 𝑥 , 𝑞 ) ∈ ℝ 𝑚 × Δ 𝑆 , we refer to the vector 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝑞

∑ 𝑠

1 𝑆 𝑞 𝑠 ⁢ ∇ ℓ 𝑠 ⁢ ( 𝑥 ) as the “ 𝑞 -barygradient of ℓ at 𝑥 ”.

1Proximal operator

In this article we present a generalization of the convex optimization formalism (Boyd and Vandenberghe [2004]) that we call baryconvex optimization since it involves weighted convex objectives where the weights are learned in a minimax fashion.

Definition 1 (Generalized proximal operator).

Let ℓ

( ℓ 1 , … , ℓ 𝑆 ) : ℝ 𝑚 → ℝ 𝑆 for 𝑚 ≥ 1 , 𝑆 ≥ 2 and where ℓ 𝑠 is a differentiable convex function for each 𝑠 ∈ { 1 , … , 𝑆 } . Given 𝜆

0 , we define our generalized proximal operator prox 𝜆 ⁢ ℓ as follows: for all ( 𝑥 , 𝑞 ) ∈ ℝ 𝑚 × Δ ̊ 𝑆

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 , 𝑞 )

arg ⁢ minimax ( 𝑧 , 𝑟 ) ∈ ℝ 𝑚 × Δ ̊ 𝑆 ⁡ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑧 , 𝑟 ) := 𝑟 ⊺ ⁢ ℓ ⁢ ( 𝑧 ) + 1 2 ⁢ 𝜆 ⁢ ‖ 𝑧 − 𝑥 ‖ 2 − 1 𝜆 ⁢ 𝐷 KL ⁢ ( 𝑟 ∥ 𝑞 ) .

First notice that in the degenerate case 𝑆

1 , the probability simplex is reduced to the singleton Δ 1

{ 1 } , and we recover the standard proximal operator with a single convex loss function whose minimizers are exactly the fixed points of the prox. This paper proposes to extend well-known convex optimization methods such as the proximal point algorithm (PPA, see Rockafellar [1976]) and gradient descent (GD, see Boyd and Vandenberghe [2004]) to our general setting with 𝑆 ≥ 2 .

Question: Can we compute a fixed point (if it exists) of the generalized prox in Definition 1?

As will be shown, the answer provided by this paper is positive.

Answer: Yes, by leveraging a Bregman geometry that combines Euclidean and simplex structures.

Saddle point

We point out that the function ( 𝑧 , 𝑟 ) ↦ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑧 , 𝑟 ) is strongly convex-concave (i.e. strongly convex in 𝑧 and strongly concave in 𝑟 , see e.g. Boyd and Vandenberghe [2004]) and admits a unique saddle point ( 𝑥 ′ , 𝑞 ′ )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 , 𝑞 ) characterized by the stationarity condition

∇ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑥 ′ , 𝑞 ′ )

( 𝟎 𝑚

− 𝑐 𝜆 ⁢ 𝟏 𝑆 ) with ⁢ 𝑐

log ⁡ ( ∑ 𝑠 𝑒 log ⁡ ( 𝑞 𝑠 ′ ) − 𝜆 ⁢ ℓ 𝑠 ⁢ ( 𝑥 ′ ) ) − log ⁡ ( ∑ 𝑠 𝑒 log ⁡ ( 𝑞 𝑠 ) ) .

(1)

Further, by the minimax theorem1, we have:

min 𝑧 ⁡ max 𝑟 ⁡ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑧 , 𝑟 )

𝐻 𝑥 , 𝑞 ⁢ ( 𝑥 ′ , 𝑞 ′ )

max 𝑟 ⁡ min 𝑧 ⁡ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑧 , 𝑟 ) .

(2) Tensorization

Consider two generalized proximal operators (with same step-size 𝜆 > 0 ) characterized for 𝑖 ∈ { 1 , 2 } by ℓ ( 𝑖 ) : ℝ 𝑚 → ℝ 𝑆 𝑖 with 𝑆 𝑖 ≥ 2 . We can define the outer sum ℓ ( 1 ) ⊕ ℓ ( 2 ) : ℝ 𝑚 → ℝ 𝑆 1 × 𝑆 2 as [ ℓ ( 1 ) ⊕ ℓ ( 2 ) ] 𝑗 , 𝑘

ℓ 𝑗 ( 1 ) + ℓ 𝑘 ( 2 ) , and for probabilities 𝑞 ( 𝑖 ) ∈ Δ 𝑆 𝑖 the outer product 𝑞 ( 1 ) ⊗ 𝑞 ( 2 ) ∈ Δ 𝑆 1 × 𝑆 2 as [ 𝑞 ( 1 ) ⊗ 𝑞 ( 2 ) ] 𝑗 , 𝑘

𝑞 𝑗 ( 1 ) ⁢ 𝑞 𝑘 ( 2 ) , for all ( 𝑗 , 𝑘 ) ∈ { 1 , … , 𝑆 1 } × { 1 , … , 𝑆 2 } . We have the following property concerning the proximity operator of ℓ ( 1 ) ⊕ ℓ ( 2 ) .

Proposition 2 (Closure property).

( 𝑥 ′ , 𝑞 ′ )

prox 𝜆 ⁢ ℓ ( 1 ) ⊕ ℓ ( 2 ) ⁢ ( 𝑥 , 𝑞 ( 1 ) ⊗ 𝑞 ( 2 ) ) ,

then there exist 𝑞 ~ ( 𝑖 ) ∈ Δ 𝑆 𝑖 such that 𝑞 ′

𝑞 ~ ( 1 ) ⊗ 𝑞 ~ ( 2 ) .

The proof of Proposition 2 is straightforward using Eq. (1). In other words, the next distribution 𝑞 ′ remains an outer product and we deduce that the subset ℝ 𝑚 × ( Δ 𝑆 1 ⊗ Δ 𝑆 2 ) ⊆ ℝ 𝑚 × Δ 𝑆 1 × 𝑆 2 is closed under prox 𝜆 ⁢ ℓ ( 1 ) ⊕ ℓ ( 2 ) :

prox 𝜆 ⁢ ℓ ( 1 ) ⊕ ℓ ( 2 ) ⁢ ( ℝ 𝑚 × ( Δ 𝑆 1 ⊗ Δ 𝑆 2 ) ) ⊆ ℝ 𝑚 × ( Δ 𝑆 1 ⊗ Δ 𝑆 2 ) .

(3)

This closure property can even be extended to any multi-index set 𝒯 ⊆ { 1 , … , 𝑆 } 𝑁 with [ ⨁ 𝑘

1 𝑁 ℓ ( 𝑘 ) ] 𝜏

∑ 𝑘

1 𝑁 ℓ 𝜏 𝑘 ( 𝑘 ) and [ ⨂ 𝑘

1 𝑁 𝑞 ( 𝑘 ) ] 𝜏 ∝ ∏ 𝑘

1 𝑁 𝑞 𝜏 𝑘 ( 𝑘 ) for all 𝜏

( 𝜏 1 , … , 𝜏 𝑁 ) ∈ 𝒯 . In particular, this allows to optimize models that are based on circular convolutions (Achab [2022],Achab [2023]).

In the next sections, we propose to generalize some key components of the convex analysis toolbox (firm nonexpansion property Bauschke and Combettes [2011], PPA and GD methods) in order to find a fixed point of prox 𝜆 ⁢ ℓ in the general case 𝑆 ≥ 2 .

2Bregman firm nonexpansiveness

We recall from Brohé and Tossings [2000]-Bauschke et al. [2003] that an operator 𝑇 is Bregman firmly nonexpansive (BFNE) with respect to 𝑓 if ⟨ 𝑇 ⁢ 𝑥 − 𝑇 ⁢ 𝑦 , ∇ 𝑓 ⁢ ( 𝑇 ⁢ 𝑥 ) − ∇ 𝑓 ⁢ ( 𝑇 ⁢ 𝑦 ) ⟩ ≤ ⟨ 𝑇 ⁢ 𝑥 − 𝑇 ⁢ 𝑦 , ∇ 𝑓 ⁢ ( 𝑥 ) − ∇ 𝑓 ⁢ ( 𝑦 ) ⟩ , ∀ 𝑥 , 𝑦 . Furthermore, if the BFNE operator has a fixed point 𝑥 ∗

𝑇 ⁢ 𝑥 ∗ , any sequence obtained by recursively applying 𝑇 , namely 𝑥 𝑘 + 1

𝑇 ⁢ 𝑥 𝑘 , converges to a fixed point. Our main result (Theorem 4 below) states that our generalized proximal operator introduced in section 1 is BFNE with respect to a hybrid Bregman divergence mixing the squared Euclidean and the KL divergences.

Definition 3 (Euclidean+KL Bregman divergence).

Let the function 𝑓 be defined for all ( 𝑥 , 𝑞 ) ∈ ℝ 𝑚 × Δ ̊ 𝑆 as follows:

𝑓 ⁢ ( 𝑥 , 𝑞 )

1 2 ⁢ ‖ 𝑥 ‖ 2 + ℎ ⁢ ( 𝑞 )

and its corresponding Bregman divergence:

𝐷 𝑓 ⁢ ( ( 𝑥

𝑞 ) , ( 𝑥 ′

𝑞 ′ ) )

1 2 ⁢ ‖ 𝑥 − 𝑥 ′ ‖ 2 + 𝐷 KL ⁢ ( 𝑞 ∥ 𝑞 ′ ) .

Theorem 4 (BFNE).

Let prox 𝜆 ⁢ ℓ and 𝑓 be as defined in Definitions 1 and 3 respectively. Then, prox 𝜆 ⁢ ℓ is Bregman firmly nonexpansive with respect to 𝑓 .

Proof.

For 𝑞 ∈ Δ ̊ 𝑆 , we have by the convexity of 𝑥 ↦ 𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) :

𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑧 ) − 𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) ≥ 𝑞 ⊺ ⁢ 𝐽 ℓ ⁢ ( 𝑥 ) ⁢ ( 𝑧 − 𝑥 )

(4)

and, similarly, for any other 𝑟 ∈ Δ ̊ 𝑆 :

𝑟 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) − 𝑟 ⊺ ⁢ ℓ ⁢ ( 𝑧 ) ≥ 𝑟 ⊺ ⁢ 𝐽 ℓ ⁢ ( 𝑧 ) ⁢ ( 𝑥 − 𝑧 ) .

(5)

Then, by summing Eqs. 4 and 5 it holds:

( 𝐽 ℓ ⁢ ( 𝑧 ) ⊺ ⁢ 𝑟 − 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝑞 ) ⊺ ⁢ ( 𝑧 − 𝑥 ) ≥ 𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) − 𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑧 ) + 𝑟 ⊺ ⁢ ℓ ⁢ ( 𝑧 ) − 𝑟 ⊺ ⁢ ℓ ⁢ ( 𝑥 )

⟺ ⟨ ( 𝑧

𝑟 ) − ( 𝑥

𝑞 ) , ( 𝐽 ℓ ⁢ ( 𝑧 ) ⊺ ⁢ 𝑟

− ℓ ⁢ ( 𝑧 ) ) − ( 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝑞

− ℓ ⁢ ( 𝑥 ) ) ⟩ ≥ 0 .

(6)

From the stationarity condition (1) satisfied by the saddle point ( 𝑥 ′ , 𝑞 ′ )

prox 𝜆 ⁢ ( 𝑥 , 𝑞 ) of the function 𝐻 𝑥 , 𝑞 :

∇ 𝐻 𝑥 , 𝑞 ⁢ ( 𝑥 ′ , 𝑞 ′ )

( 𝟎 𝑚

− 𝑐 𝜆 ⁢ 𝟏 𝑆 ) ⇔ { 𝐽 ℓ ⁢ ( 𝑥 ′ ) ⊺ ⁢ 𝑞 ′ + 1 𝜆 ⁢ ( 𝑥 ′ − 𝑥 )

𝟎 𝑚

ℓ ⁢ ( 𝑥 ′ ) − 1 𝜆 ⁢ ( ∇ ℎ ⁢ ( 𝑞 ′ ) − ∇ ℎ ⁢ ( 𝑞 ) )

− 𝑐 𝜆 ⁢ 𝟏 𝑆

⇔ { 𝑥

𝑥 ′ + 𝜆 ⁢ 𝐽 ℓ ⁢ ( 𝑥 ′ ) ⊺ ⁢ 𝑞 ′

∇ ℎ ⁢ ( 𝑞 ) + 𝑐 ⁢ 𝟏 𝑆

∇ ℎ ⁢ ( 𝑞 ′ ) − 𝜆 ⁢ ℓ ⁢ ( 𝑥 ′ ) .

(7)

We are now ready to prove that prox 𝜆 ⁢ ℓ is BFNE w.r.t. 𝑓 . For 𝑥 , 𝑧 ∈ ℝ 𝑚 , 𝑞 , 𝑟 ∈ Δ ̊ 𝑆 and ( 𝑥 ′ , 𝑞 ′ )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 , 𝑞 ) , ( 𝑧 ′ , 𝑟 ′ )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑧 , 𝑟 )

⟨ prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 , 𝑞 ) − prox 𝜆 ⁢ ℓ ⁢ ( 𝑧 , 𝑟 ) , ∇ 𝑓 ⁢ ( 𝑥 , 𝑞 ) − ∇ 𝑓 ⁢ ( 𝑧 , 𝑟 ) ⟩

⟨ ( 𝑥 ′

𝑞 ′ ) − ( 𝑧 ′

𝑟 ′ ) , ( 𝑥

∇ ℎ ⁢ ( 𝑞 ) ) − ( 𝑧

∇ ℎ ⁢ ( 𝑟 ) ) ⟩

= ⟨ ( 𝑥 ′ − 𝑧 ′

𝑞 ′ − 𝑟 ′ ) , ( 𝑥 ′ + 𝜆 ⁢ 𝐽 ℓ ⁢ ( 𝑥 ′ ) ⊺ ⁢ 𝑞 ′ − 𝑧 ′ − 𝜆 ⁢ 𝐽 ℓ ⁢ ( 𝑧 ′ ) ⊺ ⁢ 𝑟 ′

∇ ℎ ⁢ ( 𝑞 ′ ) − 𝜆 ⁢ ℓ ⁢ ( 𝑥 ′ ) − ∇ ℎ ⁢ ( 𝑟 ′ ) + 𝜆 ⁢ ℓ ⁢ ( 𝑧 ′ ) ) ⟩

= ‖ 𝑥 ′ − 𝑧 ′ ‖ 2 + ⟨ 𝑞 ′ − 𝑟 ′ , ∇ ℎ ⁢ ( 𝑞 ′ ) − ∇ ℎ ⁢ ( 𝑟 ′ ) ⟩

𝜆 ⁢ ⟨ 𝑥 ′ − 𝑧 ′ , 𝐽 ℓ ⁢ ( 𝑥 ′ ) ⊺ ⁢ 𝑞 ′ − 𝐽 ℓ ⁢ ( 𝑧 ′ ) ⊺ ⁢ 𝑟 ′ ⟩
𝜆 ⁢ ⟨ 𝑞 ′ − 𝑟 ′ , − ℓ ⁢ ( 𝑥 ′ )
ℓ ⁢ ( 𝑧 ′ ) ⟩

≥ ‖ 𝑥 ′ − 𝑧 ′ ‖ 2 + ⟨ 𝑞 ′ − 𝑟 ′ , ∇ ℎ ⁢ ( 𝑞 ′ ) − ∇ ℎ ⁢ ( 𝑟 ′ ) ⟩

⟨ ( 𝑥 ′

𝑞 ′ ) − ( 𝑧 ′

𝑟 ′ ) , ∇ 𝑓 ⁢ ( 𝑥 ′ , 𝑞 ′ ) − ∇ 𝑓 ⁢ ( 𝑧 ′ , 𝑟 ′ ) ⟩

(8)

where the inequality comes from Eq. (6).

∎

We highlight that Theorem 4 generalizes the fact that the classic proximal operator is firmly nonexpansive, since 𝐷 𝑓 reduces to the squared Euclidean Bregman divergence in the convex scenario 𝑆

1 . Moreover, the next result shows that our prox can also be written as a generalized resolvent. Indeed, we recall from Eckstein [1993]-Bauschke et al. [2003]-Borwein et al. [2011] that an 𝑓 -resolvent is equal to ( ∇ 𝑓 + 𝜆 ⁢ 𝐴 ) − 1 ∘ ∇ 𝑓 for some monotone operator 𝐴 . This definition extends the classic notion of resolvent, namely ( 𝐼 + 𝜆 ⁢ 𝐴 ) − 1 (which corresponds to the particular case 𝑓

∥ ⋅ ∥ 2 2 ), to a general Bregman divergence 𝐷 𝑓 .

Proposition 5 ( 𝑓 -resolvent).

Consider the notations introduced in Definition 1.

(i)

The operator 𝐴 ⁢ ( 𝑥 , 𝑞 )

( 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝑞

− ℓ ⁢ ( 𝑥 ) ) is monotone.

(ii)

Our prox operator is an 𝑓 -resolvent:

prox 𝜆 ⁢ ℓ

( ∇ 𝑓 + 𝜆 ⁢ 𝐴 − ( 𝟎 𝑚

LSE ⁢ ( ∇ ℎ − 𝜆 ⁢ ℓ ) ⁢ 𝟏 𝑆 ) ) − 1 ∘ ( ∇ 𝑓 − ( 𝟎 𝑚

LSE ⁢ ( ∇ ℎ ) ⁢ 𝟏 𝑆 ) ) ,

with 𝐴 from (i) and 𝑓 from Definition 3 and LSE ⁢ ( 𝜉 )

log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 − 1 ) .

Proof.

(i) follows from the inequality in Eq. (6) while (ii) derives from the stationarity condition (Eq. 1) of the saddle point ( 𝑥 ′ , 𝑞 ′ )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 , 𝑞 ) of the function 𝐻 𝑥 , 𝑞 . ∎

PPA and fixed points

Theorem 4 implies that the generalized proximal point algorithm ( 𝑥 𝑘 + 1 , 𝑞 𝑘 + 1 )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 𝑘 , 𝑞 𝑘 ) converges to a fixed point ( 𝑥 ∗ , 𝑞 ∗ ) of the prox, if there exists any. Such a fixed point is characterized by:

( 𝑥 ∗ , 𝑞 ∗ )

prox 𝜆 ⁢ ℓ ⁢ ( 𝑥 ∗ , 𝑞 ∗ ) ⇔ { 𝐽 ℓ ⁢ ( 𝑥 ∗ ) ⊺ ⁢ 𝑞 ∗

𝑞 𝑠 ∗

𝑞 𝑠 ∗ ⁢ 𝑒 − 𝜆 ⁢ ℓ 𝑠 ⁢ ( 𝑥 ∗ ) ∑ 𝑠 ′ 𝑞 𝑠 ′ ∗ ⁢ 𝑒 − 𝜆 ⁢ ℓ 𝑠 ′ ⁢ ( 𝑥 ∗ ) ( ∀ 1 ≤ 𝑠 ≤ 𝑆 )

(9)

which means that the 𝑞 ∗ -barygradient of ℓ at 𝑥 ∗ is equal to zero and that:

ℓ 1 ⁢ ( 𝑥 ∗ )

⋯

ℓ 𝑆 ⁢ ( 𝑥 ∗ ) .

(10) 3Critical points

An immediate consequence of Equations 9-10 is that the set of fixed points Fix ⁢ ( prox 𝜆 ⁢ ℓ ) of our generalized proximal operator 𝑓 -mirrors the set of critical points of the function: ∀ ( 𝑥 , 𝜉 ¯ ) ∈ ℝ 𝑚 × ℝ 𝑆 − 1 and 𝜉

( 𝜉 ¯ , 0 ) ∈ ℝ 𝑆 ,

𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

∑ 𝑠

1 𝑆 − 1 𝜎 ¯ 𝑠 ⁢ ( 𝜉 ¯ ) ⁢ ℓ 𝑠 ⁢ ( 𝑥 ) + ( 1 − ∑ 𝑠

1 𝑆 − 1 𝜎 ¯ 𝑠 ⁢ ( 𝜉 ¯ ) ) ⁢ ℓ 𝑆 ⁢ ( 𝑥 )

𝜎 ⁢ ( 𝜉 ) ⊺ ⁢ ℓ ⁢ ( 𝑥 ) ,

(11)

where

( 𝜎 ¯ ⁢ ( 𝜉 ¯ )

1 1 + ∑ 𝑠

1 𝑆 − 1 𝑒 𝜉 ¯ 𝑠 )

𝜎 ⁢ ( 𝜉 ) .

(12)

Before looking at the critical points of 𝐹 ¯ , let us first derive the expression of the Fisher information matrix (FIM) of our family of discrete distributions (see e.g. Amari [1998],Amari [2016]).

Proposition 6 (FIM).

The Fisher information matrix 𝐈 ⁢ ( 𝜉 ¯ ) of the family of distributions { 𝜎 ⁢ ( 𝜉 ) : 𝜉

( 𝜉 ¯ , 0 ) , 𝜉 ¯ ∈ ℝ 𝑆 − 1 } is given by:

∀ 1 ≤ 𝑖 , 𝑗 ≤ 𝑆 − 1 , [ 𝐈 ⁢ ( 𝜉 ¯ ) ] 𝑖 , 𝑗

𝛿 𝑖 , 𝑗 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) − 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) ,

with 𝛿 𝑖 , 𝑗 denoting Kronecker deltas. Equivalently,

𝐈 ⁢ ( 𝜉 ¯ )

Diag ⁢ ( 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ) − 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ⊺ and 𝐈 ⁢ ( 𝜉 ¯ ) − 1

Diag ⁢ ( 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ) − 1 + 1 1 − ∑ 𝑠

1 𝑆 − 1 𝜎 ¯ 𝑠 ⁢ ( 𝜉 ¯ ) ⁢ 𝟏 𝑆 − 1 ⁢ 𝟏 𝑆 − 1 ⊺ .

Proof.

By definition:

[ 𝐈 ⁢ ( 𝜉 ¯ ) ] 𝑖 , 𝑗

𝔼 𝜏 ∼ 𝜎 ⁢ ( 𝜉 ) ⁢ [ ∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑖 ⁢ ∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑗 ]

∑ 𝜏

1 𝑆 𝜎 𝜏 ⁢ ( 𝜉 ) ⁢ [ ∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑖 ⁢ ∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑗 ] .

For all 1 ≤ 𝑖 ≤ 𝑆 − 1 , 1 ≤ 𝜏 ≤ 𝑆 , we have:

∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑖

𝛿 𝑖 , 𝜏 − 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) .

Hence,

∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑖 ⁢ ∂ log ⁡ 𝜎 𝜏 ⁢ ( 𝜉 ) ∂ 𝜉 ¯ 𝑗

[ 𝛿 𝑖 , 𝜏 − 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ] ⁢ [ 𝛿 𝑗 , 𝜏 − 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) ]

= 𝛿 𝑖 , 𝜏 ⁢ 𝛿 𝑗 , 𝜏 − 𝛿 𝑖 , 𝜏 ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑗 , 𝜏 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) + 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) ,

from which we deduce that

[ 𝐈 ⁢ ( 𝜉 ¯ ) ] 𝑖 , 𝑗

𝛿 𝑖 , 𝑗 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) − 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) .

∎

In particular the FIM defines the Fisher-Rao metric (Rao [1992]).

Given that

∇ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

( 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝜎 ⁢ ( 𝜉 )

𝐈 ⁢ ( 𝜉 ¯ ) ⁢ ℓ ¯ ⁢ ( 𝑥 ) ) with ℓ ¯

( ℓ 1 − ℓ 𝑆

⋮

ℓ 𝑆 − 1 − ℓ 𝑆 ) ,

(13)

we can now state our result concerning the set of critical points of 𝐹 ¯ .

Theorem 7.

Recall that for any 𝜉 ¯ ∈ ℝ 𝑆 − 1 , we denote 𝜉

( 𝜉 ¯ , 0 ) ∈ ℝ 𝑆 . Then,

Fix ⁢ ( prox 𝜆 ⁢ ℓ )

{ ( 𝑥 , 𝜎 ⁢ ( 𝜉 ) ) : ∇ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

0 } .

Proof.

We have:

∇ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 )

( 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝜎 ⁢ ( 𝜉 )

𝐈 ⁢ ( 𝜉 ¯ ) ⁢ ℓ ¯ ⁢ ( 𝑥 ) )

0 ⇔ { 𝐽 ℓ ⁢ ( 𝑥 ) ⊺ ⁢ 𝜎 ⁢ ( 𝜉 )

ℓ ¯ ⁢ ( 𝑥 ) ∈ ker ⁢ ( 𝐈 ⁢ ( 𝜉 ¯ ) )

{ 𝟎 𝑆 − 1 }

⇔ ( 𝑥 , 𝜎 ⁢ ( 𝜉 ) ) ∈ Fix ⁢ ( prox 𝜆 ⁢ ℓ ) .

(14)

∎

We now provide a result concerning the critical values of 𝐹 ¯ , i.e. the values taken at critical points.

Proposition 8.

All critical values of 𝐹 ¯ are identical.

Proof.

Assume ( 𝑥 1 , 𝜉 ¯ 1 ) , ( 𝑥 2 , 𝜉 ¯ 2 ) ∈ ℝ 𝑚 + 𝑆 − 1 are two critical points of 𝐹 ¯ : for 𝑖 ∈ { 1 , 2 } ,

∇ 𝐹 ¯ ⁢ ( 𝑥 𝑖 , 𝜉 ¯ 𝑖 )

( 𝐽 ℓ ⁢ ( 𝑥 𝑖 ) ⊺ ⁢ 𝜎 ⁢ ( 𝜉 𝑖 )

𝐈 ⁢ ( 𝜉 ¯ 𝑖 ) ⁢ ℓ ¯ ⁢ ( 𝑥 𝑖 ) )

0 .

Thus 𝑥 𝑖 minimizes the convex function 𝑥 ↦ 𝜎 ⁢ ( 𝜉 𝑖 ) ⊺ ⁢ ℓ ⁢ ( 𝑥 𝑖 ) , and we know from Theorem 7 that

ℓ 1 ⁢ ( 𝑥 𝑖 )

⋯

ℓ 𝑆 ⁢ ( 𝑥 𝑖 ) .

(15)

We deduce that:

𝐹 ¯ ⁢ ( 𝑥 1 , 𝜉 ¯ 1 ) ≤ 𝐹 ¯ ⁢ ( 𝑥 2 , 𝜉 ¯ 1 )

𝜎 ⁢ ( 𝜉 1 ) ⊺ ⁢ ℓ ⁢ ( 𝑥 2 )

𝜎 ⁢ ( 𝜉 2 ) ⊺ ⁢ ℓ ⁢ ( 𝑥 2 )

𝐹 ¯ ⁢ ( 𝑥 2 , 𝜉 ¯ 2 ) ,

(16)

and similarly that 𝐹 ¯ ⁢ ( 𝑥 2 , 𝜉 ¯ 2 ) ≤ 𝐹 ¯ ⁢ ( 𝑥 1 , 𝜉 ¯ 1 ) . ∎

Proposition 8 generalizes the property that a convex function has at most a unique critical value. Furthermore, Eq. (16) shows that if there exists 𝑠 ∈ { 1 , … , 𝑆 } such that ℓ 𝑠 is strictly convex, then necessarily 𝑥 1

𝑥 2 .

Riemannian Hessian matrix

From now on, let us assume that every ℓ 𝑠 is twice continuously differentiable. For 𝑝 ∈ ℝ 𝑆 − 1 , denote by 𝐓 ⁢ ( 𝑝 ) ∈ ℝ ( 𝑆 − 1 ) × ( 𝑆 − 1 ) × ( 𝑆 − 1 ) the tensor derivative of 𝐂 ⁢ ( 𝑝 )

Diag ⁢ ( 𝑝 ) − 𝑝 ⁢ 𝑝 ⊺ given by

∀ ( 𝑖 , 𝑗 , 𝑘 ) ∈ { 1 , … , 𝑆 − 1 } 3 , [ 𝐓 ⁢ ( 𝑝 ) ] 𝑖 , 𝑗 , 𝑘

∂ [ 𝐂 ⁢ ( 𝑝 ) ] 𝑖 , 𝑗 ∂ 𝑝 𝑘

𝛿 𝑖 , 𝑗 ⁢ 𝛿 𝑖 , 𝑘 − 𝑝 𝑗 ⁢ 𝛿 𝑖 , 𝑘 − 𝑝 𝑖 ⁢ 𝛿 𝑗 , 𝑘 .

(17)

The Hessian matrix of 𝐹 ¯ admits the following expression:

∇ 2 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

( ∑ 𝑠 𝜎 𝑠 ⁢ ( 𝜉 ) ⁢ ∇ 2 ℓ 𝑠 ⁢ ( 𝑥 )

𝐽 ℓ ¯ ⁢ ( 𝑥 ) ⊺ ⁢ 𝐈 ⁢ ( 𝜉 ¯ )

𝐈 ⁢ ( 𝜉 ¯ ) ⁢ 𝐽 ℓ ¯ ⁢ ( 𝑥 )

[ 𝐓 ⁢ ( 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ) × 2 ℓ ¯ ⁢ ( 𝑥 ) ] ⁢ 𝐈 ⁢ ( 𝜉 ¯ ) ) ,

(18)

where the matrix with entries [ 𝐓 ⁢ ( 𝑝 ) × 2 𝑣 ] 𝑖 , 𝑘

∑ 𝑗 [ 𝐓 ⁢ ( 𝑝 ) ] 𝑖 , 𝑗 , 𝑘 ⁢ 𝑣 𝑗 is the second mode contraction of tensor 𝐓 ⁢ ( 𝑝 ) with some vector 𝑣 .

Letting

𝐌 ⁢ ( 𝑥 , 𝜉 ¯ )

( 𝐼 𝑚

𝟎 𝑚 × ( 𝑆 − 1 )

𝟎 ( 𝑆 − 1 ) × 𝑚

𝐈 ⁢ ( 𝜉 ¯ ) )

(19)

be the 𝑥 -Euclidean 𝜉 ¯ -Fisher-Rao product Riemannian metric, we deduce that the Riemannian Hessian ∇ ~ 2 ⁢ 𝐹 ¯ with respect to 𝐌 is given by:

∇ ~ 2 ⁢ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

∇ 2 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ ) − ( 𝟎 𝑚 × 𝑚

𝟎 𝑚 × ( 𝑆 − 1 )

𝟎 ( 𝑆 − 1 ) × 𝑚

[ ∑ 𝑘 Γ 𝑖 , 𝑗 𝑘 ⁢ ( 𝜉 ¯ ) ⁢ ∂ 𝐹 ¯ ∂ 𝜉 ¯ 𝑘 ⁢ ( 𝑥 , 𝜉 ¯ ) ] 𝑖 , 𝑗 )

(20)

with Christoffel symbols provided below.

Proposition 9 (Christoffel symbols).
Γ 𝑖 , 𝑗 𝑘 ⁢ ( 𝜉 ¯ )

1 2 ⁢ { 𝛿 𝑖 , 𝑗 ⁢ 𝛿 𝑖 , 𝑘 − 𝛿 𝑖 , 𝑗 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑖 , 𝑘 ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑗 , 𝑘 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) + 2 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) } .

Proof.

We consider the distributions 𝜎 ⁢ ( 𝜉 ) that form an exponential family with potential function 𝜓 ⁢ ( 𝜉 ¯ )

log ⁡ ( 1 + ∑ 𝑠

1 𝑆 − 1 𝑒 𝜉 ¯ 𝑠 ) (see page 80 in Amari [2012]). Recall that [ 𝐈 ⁢ ( 𝜉 ¯ ) ] 𝑖 , 𝑗

∂ 2 𝜓 ∂ 𝜉 ¯ 𝑖 ⁢ ∂ 𝜉 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) , from which we get the Christoffel symbols of the first kind (see pages 41 and 106 in Amari [2012]):

Γ 𝑖 , 𝑗 , 𝑘 ⁢ ( 𝜉 ¯ )

1 2 ⁢ ∂ 3 𝜓 ∂ 𝜉 ¯ 𝑖 ⁢ ∂ 𝜉 ¯ 𝑗 ⁢ ∂ 𝜉 ¯ 𝑘 ⁢ ( 𝜉 ¯ )

1 2 ⁢ ∂ [ 𝐈 ⁢ ( 𝜉 ¯ ) ] 𝑖 , 𝑗 ∂ 𝜉 ¯ 𝑘

= 1 2 ⁢ { 𝛿 𝑖 , 𝑗 ⁢ 𝛿 𝑖 , 𝑘 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑖 , 𝑗 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑘 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑖 , 𝑘 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) − 𝛿 𝑗 , 𝑘 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) + 2 ⁢ 𝜎 ¯ 𝑖 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑗 ⁢ ( 𝜉 ¯ ) ⁢ 𝜎 ¯ 𝑘 ⁢ ( 𝜉 ¯ ) } .

We deduce (see e.g. Weisstein [2003]) that

Γ 𝑖 , 𝑗 𝑘 ⁢ ( 𝜉 ¯ )

∑ 𝑠

1 𝑆 − 1 [ 𝐈 ⁢ ( 𝜉 ¯ ) − 1 ] 𝑘 , 𝑠 ⁢ Γ 𝑖 , 𝑗 , 𝑠 ⁢ ( 𝜉 ¯ )

∎

Remark 1 (Potential function).

One easily verifies that the Riemannian Hessian matrix of ( 𝑥 , 𝜉 ¯ ) ↦ 1 2 ⁢ ‖ 𝑥 ‖ 2 + 𝜓 ⁢ ( 𝜉 ¯ ) with the log-partition function 𝜓 ⁢ ( 𝜉 ¯ )

log ⁡ ( 1 + ∑ 𝑠

1 𝑆 − 1 𝑒 𝜉 ¯ 𝑠 ) is exactly 𝐌 ⁢ ( 𝑥 , 𝜉 ¯ ) . Indeed, the correction term involving the Christoffel symbols vanishes (namely ∑ 𝑘 Γ 𝑖 , 𝑗 𝑘 ⁢ ∂ 𝜓 ∂ 𝜉 ¯ 𝑘

0 ).

Proposition 10 (Riemannian Hessian).

The Riemannian Hessian of 𝐹 ¯ with respect to 𝐌 is equal to

∇ ~ 2 ⁢ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ )

( ∑ 𝑠 𝜎 𝑠 ⁢ ( 𝜉 ) ⁢ ∇ 2 ℓ 𝑠 ⁢ ( 𝑥 )

𝐽 ℓ ¯ ⁢ ( 𝑥 ) ⊺ ⁢ 𝐈 ⁢ ( 𝜉 ¯ )

𝐈 ⁢ ( 𝜉 ¯ ) ⁢ 𝐽 ℓ ¯ ⁢ ( 𝑥 )

𝐇 ⁢ ( 𝑥 , 𝜉 ¯ ) ) ,

where 𝐇

1 2 ⁢ { Diag ⁢ ( 𝜎 ¯ ) ⁢ 𝐃 − 𝐃 ⁢ 𝜎 ¯ ⁢ 𝜎 ¯ ⊺ − 𝜎 ¯ ⁢ 𝜎 ¯ ⊺ ⁢ 𝐃 } ⁢ with ⁢ 𝐃

Diag ⁢ ( ℓ ¯ − 𝟏 𝑆 − 1 ⁢ 𝜎 ¯ ⊺ ⁢ ℓ ¯ ) .

Proof.

By combining Eqs. 18 and 20 with Proposition 9 and noticing that

[ 𝐓 ⁢ ( 𝜎 ¯ ⁢ ( 𝜉 ¯ ) ) × 2 ℓ ¯ ⁢ ( 𝑥 ) ] ⁢ 𝐈 ⁢ ( 𝜉 ¯ )

2 ⁢ [ ∑ 𝑘 Γ 𝑖 , 𝑗 𝑘 ⁢ ( 𝜉 ¯ ) ⁢ ∂ 𝐹 ¯ ∂ 𝜉 ¯ 𝑘 ⁢ ( 𝑥 , 𝜉 ¯ ) ] 𝑖 , 𝑗

2 ⁢ 𝐇 ⁢ ( 𝑥 , 𝜉 ¯ ) .

(21)

∎

Remark 2 (Inertia and saddle point).

Proposition 10 shows that the Riemannian Hessian of 𝐹 ¯ is in general not positive semi-definite (by using Haynsworth inertia additivity formula if at least one ℓ 𝑠 satisfies ∇ 2 ℓ 𝑠 ≻ 𝟎 2) and hence that 𝐹 ¯ is not necessarily geodesically convex (see Udriste [2013] or Theorem 11.23 in Boumal [2023]). More precisely, ∇ ~ 2 ⁢ 𝐹 ¯ ⁢ ( 𝑥 , 𝜉 ¯ ) is congruent to (and thus, by Sylvester’s law of inertia, has the same inertia as) the block-diagonal matrix Diag ⁢ ( 𝐁 1 , 𝐁 2 ) with positive definite matrix 𝐁 1

∑ 𝑠 𝜎 𝑠 ⁢ ( 𝜉 ) ⁢ ∇ 2 ℓ 𝑠 ⁢ ( 𝑥 ) and its Schur complement 𝐁 2

𝐇 ⁢ ( 𝑥 , 𝜉 ¯ ) − 𝐈 ⁢ ( 𝜉 ¯ ) ⁢ 𝐽 ℓ ¯ ⁢ ( 𝑥 ) ⁢ 𝐁 1 − 1 ⁢ 𝐽 ℓ ¯ ⁢ ( 𝑥 ) ⊺ ⁢ 𝐈 ⁢ ( 𝜉 ¯ ) . In particular, for ( 𝑥 , 𝜉 ¯ )

( 𝑥 ∗ , 𝜉 ¯ ∗ ) a critical point of 𝐹 ¯ we have 𝐇 ⁢ ( 𝑥 ∗ , 𝜉 ¯ ∗ )

𝟎 and 𝐁 2 ⪯ 𝟎 from which we deduce that ( 𝑥 ∗ , 𝜉 ¯ ∗ ) must be a saddle point.

4Barygradient flows 4.1Barygradient min-max flow

Now let us generalize the gradient flow ordinary differential equation (ODE) by letting 𝜆 → 0 in our generalized PPA.

Definition 11.

Let 𝐹 ⁢ ( 𝑥 , 𝑞 )

𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) . We define the barygradient min-max flow ODE as

𝜁 ˙ ⁢ ( 𝑡 )

− ( 𝐼 𝑚

− 𝐼 𝑆 ) ⁢ ∇ 𝐹 ⁢ ( ( ∇ 𝑓 ) − 1 ⁢ ( 𝜁 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ) ) + 𝛾 ⁢ ( 𝑡 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ,

where 𝜁

( 𝑥 , 𝜉 ) : ℝ + → ℝ 𝑚 × ℝ 𝑆 and

𝛾 ⁢ ( 𝑡 )

∑ 𝑠 [ 𝜉 ˙ 𝑠 ⁢ ( 𝑡 ) − ℓ 𝑠 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ] ⁢ 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1

𝑞 ⁢ ( 𝑡 ) ⊺ ⁢ [ 𝜉 ˙ ⁢ ( 𝑡 ) − ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ]

with 𝑞 ⁢ ( 𝑡 )

𝜎 ⁢ ( 𝜉 ⁢ ( 𝑡 ) ) .

We point out that

( 𝐼 𝑚

− 𝐼 𝑆 ) ⁢ ∇ 𝐹 ⁢ ( ( ∇ 𝑓 ) − 1 ⁢ ( 𝜁 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ) )

𝐴 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) ) .

Monotonicity analysis

Contrary to classic gradient flow, the function 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) ) is not necessarily nonincreasing along the flow. Indeed,

𝑑 𝑑 ⁢ 𝑡 ⁢ 𝐹 ⁢ ( ( ∇ 𝑓 ) − 1 ⁢ ( 𝜁 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ) )

𝑑 𝑑 ⁢ 𝑡 ⁢ [ ( ∇ ℎ ) − 1 ⁢ ( 𝜉 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ 𝟏 𝑆 ) ] ⊺ ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) + 𝑞 ⁢ ( 𝑡 ) ⊺ ⁢ 𝑑 𝑑 ⁢ 𝑡 ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ,

where

𝑑 𝑑 ⁢ 𝑡 ⁢ [ ( ∇ ℎ ) − 1 ⁢ ( 𝜉 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ 𝟏 𝑆 ) ]

[ ∇ 2 ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ] − 1 ⁢ 𝜉 ˙ ⁢ ( 𝑡 ) − ∑ 𝑠 𝜉 ˙ 𝑠 ⁢ ( 𝑡 ) ⁢ 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ⁢ 𝑞 ⁢ ( 𝑡 )

and 𝑑 𝑑 ⁢ 𝑡 ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) )

𝐽 ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ⁢ 𝑥 ˙ ⁢ ( 𝑡 ) . Hence,

𝑑 𝑑 ⁢ 𝑡 ⁢ 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) )

ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ⊺ ⁢ [ ∇ 2 ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ] − 1 ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) − 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) ) 2 ⏟ Var 𝜏 ∼ 𝑞 ⁢ ( 𝑡 ) ⁢ ( ℓ 𝜏 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ) − ‖ 𝐽 ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ⊺ ⁢ 𝑞 ⁢ ( 𝑡 ) ‖ 2 ,

which is not necessarily nonpositive.

Entropy dynamics

Denote 𝜒 ⁢ ( 𝑡 )

ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) . Then,

𝜒 ˙ ⁢ ( 𝑡 )

𝑞 ˙ ⁢ ( 𝑡 ) ⊺ ⁢ ∇ ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) )

{ [ ∇ 2 ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ] − 1 ⁢ 𝜉 ˙ ⁢ ( 𝑡 ) − [ 𝑞 ⁢ ( 𝑡 ) ⊺ ⁢ 𝜉 ˙ ⁢ ( 𝑡 ) ] ⁢ 𝑞 ⁢ ( 𝑡 ) } ⊺ ⁢ { 𝜉 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ 𝟏 𝑆 }

= 𝜉 ⁢ ( 𝑡 ) ⊺ ⁢ [ Diag ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) − 𝑞 ⁢ ( 𝑡 ) ⁢ 𝑞 ⁢ ( 𝑡 ) ⊺ ] ⏟ Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ,

where Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) denotes the covariance matrix3 of the categorical distribution 𝑞 ⁢ ( 𝑡 ) .

Remark 3.

The barygradient flow can be equivalently rewritten as the following pseudo-Riemannian flow:

𝜁 ˙ ⁢ ( 𝑡 )

− ( 𝐼 𝑚

− Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) † ) ⁢ ∇ 𝐹 ~ ⁢ ( 𝜁 ⁢ ( 𝑡 ) ) + [ 𝛾 ⁢ ( 𝑡 ) + 𝟏 𝑆 ⊺ ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) 𝑆 ] ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ,

where † denotes the Moore–Penrose pseudoinverse (see Meyer [1973]) and

𝐹 ~ ⁢ ( 𝑥 , 𝜉 )

𝜎 ⁢ ( 𝜉 ) ⊺ ⁢ ℓ ⁢ ( 𝑥 ) .

4.2Barygradient min-min flow

Similarly, we define the barygradient min-min flow as follows.

Definition 12.

Let 𝐹 ⁢ ( 𝑥 , 𝑞 )

𝑞 ⊺ ⁢ ℓ ⁢ ( 𝑥 ) . We define the barygradient min-min flow ODE as

𝜁 ˙ ⁢ ( 𝑡 )

− ∇ 𝐹 ⁢ ( ( ∇ 𝑓 ) − 1 ⁢ ( 𝜁 ⁢ ( 𝑡 ) − log ⁡ ( ∑ 𝑠 𝑒 𝜉 𝑠 ⁢ ( 𝑡 ) − 1 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ) ) + 𝛾 ⁢ ( 𝑡 ) ⁢ ( 𝟎 𝑚

𝟏 𝑆 )

= − ( 𝐼 𝑚

Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) † ) ⁢ ∇ 𝐹 ~ ⁢ ( 𝜁 ⁢ ( 𝑡 ) ) + [ 𝛾 ⁢ ( 𝑡 ) − 𝟏 𝑆 ⊺ ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) 𝑆 ] ⁢ ( 𝟎 𝑚

𝟏 𝑆 ) ,

where

𝛾 ⁢ ( 𝑡 )

𝑞 ⁢ ( 𝑡 ) ⊺ ⁢ [ 𝜉 ˙ ⁢ ( 𝑡 ) + ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ] .

As in the min-max case, we can study the dynamics of 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) ) and the negentropy ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) . Indeed, we have:

𝑑 𝑑 ⁢ 𝑡 ⁢ 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) )

− Var 𝜏 ∼ 𝑞 ⁢ ( 𝑡 ) ⁢ ( ℓ 𝜏 ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ) ⏟ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ⊺ ⁢ Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) − ‖ 𝐽 ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) ⊺ ⁢ 𝑞 ⁢ ( 𝑡 ) ‖ 2 ≤ 0 ,

which shows that 𝐹 ⁢ ( 𝑥 ⁢ ( 𝑡 ) , 𝑞 ⁢ ( 𝑡 ) ) is nonincreasing. For the negentropy we have:

𝑑 𝑑 ⁢ 𝑡 ⁢ ℎ ⁢ ( 𝑞 ⁢ ( 𝑡 ) )

− 𝜉 ⁢ ( 𝑡 ) ⊺ ⁢ Cov ⁢ ( 𝑞 ⁢ ( 𝑡 ) ) ⁢ ℓ ⁢ ( 𝑥 ⁢ ( 𝑡 ) ) .

Acknowledgments

The author thanks Adil Salim and Massil Achab for helpful discussions on convex analysis.

References Achab [2022] ↑ Mastane Achab.Checkered regression.2022. Achab [2023] ↑ Mastane Achab.Beyond log-concavity: Theory and algorithm for sum-log-concave optimization.arXiv preprint arXiv:2309.15298, 2023. Amari [1998] ↑ Shun-Ichi Amari.Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998. Amari [2012] ↑ Shun-ichi Amari.Differential-geometrical methods in statistics, volume 28.Springer Science & Business Media, 2012. Amari [2016] ↑ Shun-ichi Amari.Information geometry and its applications, volume 194.Springer, 2016. Bauschke and Combettes [2011] ↑ Heinz H Bauschke and Patrick L Combettes.Convex analysis and monotone operator theory in hilbert spaces.2011. Bauschke et al. [2003] ↑ Heinz H Bauschke, Jonathan M Borwein, and Patrick L Combettes.Bregman monotone optimization algorithms.SIAM Journal on control and optimization, 42(2):596–636, 2003. Borwein et al. [2011] ↑ Jonathan M Borwein, Simeon Reich, and Shoham Sabach.A characterization of bregman firmly nonexpansive operators using a new monotonicity concept.J. Nonlinear and Convex Analysis, 12:161–183, 2011. Boumal [2023] ↑ Nicolas Boumal.An introduction to optimization on smooth manifolds.Cambridge University Press, 2023. Boyd and Vandenberghe [2004] ↑ Stephen Boyd and Lieven Vandenberghe.Convex optimization.Cambridge university press, 2004. Brohé and Tossings [2000] ↑ Myrana Brohé and Patricia Tossings.Perturbed proximal point algorithm with nonquadratic kernel.Serdica Math. J, 26:177–206, 2000. Cesa-Bianchi and Lugosi [2006] ↑ Nicolo Cesa-Bianchi and Gábor Lugosi.Prediction, learning, and games.Cambridge university press, 2006. Eckstein [1993] ↑ Jonathan Eckstein.Nonlinear proximal point algorithms using bregman functions, with applications to convex programming.Mathematics of Operations Research, 18(1):202–226, 1993. Kullback and Leibler [1951] ↑ Solomon Kullback and Richard A Leibler.On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951. Meyer [1973] ↑ Carl D Meyer, Jr.Generalized inversion of modified matrices.Siam journal on applied mathematics, 24(3):315–323, 1973. Rao [1992] ↑ C Radhakrishna Rao.Information and the accuracy attainable in the estimation of statistical parameters.In Breakthroughs in Statistics: Foundations and basic theory, pages 235–247. Springer, 1992. Rockafellar [1976] ↑ R Tyrrell Rockafellar.Monotone operators and the proximal point algorithm.SIAM journal on control and optimization, 14(5):877–898, 1976. Udriste [2013] ↑ Constantin Udriste.Convex functions and optimization methods on Riemannian manifolds, volume 297.Springer Science & Business Media, 2013. Weisstein [2003] ↑ Eric W Weisstein.Christoffel symbol of the second kind.https://mathworld. wolfram. com/, 2003. Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 35.7 kB
Xet hash:: ff199b173f0a308015a9183af960405ce5701dfa468c1dd6bf442c453ab9d54a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.