Buckets:
Title: Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information
URL Source: https://arxiv.org/html/2502.00560
Markdown Content: Abstract 1Introduction 2Related Work 3Problem Statement 4A Primal-Dual Reformulation of the Game 5Incorporating the Atomic Structure into MARL and Control 6Empirical Validation 7Conclusion References Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information Mukesh Ghimire Arizona State University Tempe, AZ 85281, USA mghimire@asu.edu &Lei Zhang Purdue University West Lafayette, IN 47907, USA zhan5814@purdue.edu Zhe Xu Arizona State University Tempe, AZ 85281, USA xzhe1@asu.edu &Yi Ren Arizona State University Tempe, AZ 85281, USA yiren@asu.edu
Corresponding author. Abstract
For a two-player imperfect-information extensive-form game (IIEFG) with 𝐾 time steps and a player action space of size 𝑈 , the game tree complexity is 𝑈 2 𝐾 , causing existing IIEFG solvers to struggle with large or infinite ( 𝑈 , 𝐾 ) , e.g., differential games with continuous action spaces. To partially address this scalability challenge, we focus on an important class of 2p0s games where the informed player (P1) knows the payoff while the uninformed player (P2) only has a belief over the set of 𝐼 possible payoffs. Such games encompass a wide range of scenarios in sports, defense, cybersecurity, and finance. We prove that under mild conditions, P1’s (resp. P2’s) equilibrium strategy at any infostate concentrates on at most 𝐼 (resp. 𝐼 + 1 ) action prototypes. When 𝐼 ≪ 𝑈 , this equilibrium structure causes the game tree complexity to collapse to 𝐼 𝐾 for P1 when P2 plays pure best responses, and ( 𝐼 + 1 ) 𝐾 for P2 in a dual game where P1 plays pure best responses. We then show that exploiting this structure in standard learning modes, i.e., model-free multiagent reinforcement learning and model predictive control, is straightforward, leading to significant improvements in learning accuracy and efficiency from SOTA IIEFG solvers. Our demonstration solves a 22-player football game ( 𝐾
10 , 𝑈
∞ ) where the attacking team has to strategically conceal their intention until a critical moment in order to exploit information advantage. Code is available here.
1Introduction
The strength of game solvers has grown rapidly in the last decade, beating elite-level human players in Chess (Silver et al., 2017a), Go (Silver et al., 2017b), Poker (Brown & Sandholm, 2019; Brown et al., 2020b), Diplomacy (FAIR† et al., 2022), Stratego (Perolat et al., 2022), among others with increasing complexity. However, most of the existing solvers with proven convergence, either based on regret matching (Tammelin, 2014; Burch et al., 2014; Moravčík et al., 2017; Brown et al., 2020b; Lanctot et al., 2009) or gradient descent-ascent (McMahan, 2011; Perolat et al., 2021; Sokota et al., 2022; Cen et al., 2021; Vieillard et al., 2020), have computational complexities increasing with the size of the finite action set, and suffer from game-tree complexities growing exponentially with both the action size 𝑈 and the tree depth 𝐾 . Real-world games, however, often have continuous actions and happen in continuous time and state spaces, making them differential in nature. Directly applying existing solvers to differential games would require either insightful action-state-time abstraction or enormous compute. Neither are readily available.
Figure 1:(a) IIEFG with 𝑈 actions per player per infostate and 𝐾 time steps has a game-tree complexity of 𝑈 2 𝐾 . For 2p0s1 with 𝐼 payoff types, deterministic dynamics, and Isaacs’ condition, we show that the NE is 𝐼 -atomic for P1 and ( 𝐼 + 1 ) -atomic for P2, leading to a game-tree complexity of 𝐼 𝐾 for P1 in the primal game where P2 plays best responses and ( 𝐼 + 1 ) 𝐾 for P2 in the dual game where P1 plays best responses. (b) American Football with 22 players and continuous action spaces ( 𝑈
∞ ) with 𝐾
10 time steps. P1 (red) attacks with two private game types ( 𝐼
2 ): Running back (RB) power-runs through the space created by blockers, and quarterback (QB) throws the ball to the leading wide receiver (WR). See animation. (c) At NE, P1 conceals type until 0.5 sec., similar to the reported 1.0 sec. Due to significant tree size reduction, the game can be solved in 30 minutes.
In this paper, we address this scalability challenge for an important subset of 2p0s differential games where the informed player (P1) knows the payoff type while the uniformed player (P2) only holds a public belief 𝑝 0 ∈ Δ ( 𝐼 ) over the finite set of 𝐼 possible types. At the beginning of the game, nature draws a game type according to 𝑝 0 and informs P1 about the type. As the game progresses, the public belief 𝑝 about the true game type is updated from 𝑝 0 based on the action sequence taken by P1 and his strategy via the Bayes’ rule. P1’s (resp. P2’s) goal is to minimize (resp. maximize) the expected payoff over 𝑝 0 . Due to the zero-sum nature, P1 may need to delay information release or manipulate P2’s belief to take full advantage of information asymmetry. While restricted, such games represent a wide range of attack-defense scenarios including football set-pieces where the attacker has private information about which play is to be executed, and missile defense where multiple potential targets are concerned. The setting of one-sided information, i.e., P1 knows everything about P2, is necessary for P2 to derive defense strategies in risk-sensitive games. We call this focused set of games “2p0s1”.
We claim the following contributions:
•
We prove two unique Nash equilibrium (NE) structures for 2p0s1: (1) The equilibrium behavioral strategy for P1 (resp. P2) is 𝐼 -atomic (resp. ( 𝐼 + 1 ) -atomic) on their continuous action space, and (2) the equilibrium strategies for P1 and P2 can be computed via separated primal and dual reformulations of the game. Together, these structures collapse the game-tree complexity to at most 𝐼 𝐾 for P1 and ( 𝐼 + 1 ) 𝐾 for P2. In comparison, solving the same game through the lens of IIEFG would have a game-tree complexity of 𝑈 2 𝐾 , with a discretized action size of 𝑈 (Fig. 1a).
•
We demonstrate how this structural knowledge can significantly accelerate game solving: (1) For value and policy approximation settings where the ground-truth NE is available, we achieve qualitative improvements on solution accuracy and efficiency from SOTA normal- and behavioral-form solvers (CFR+, MMD, Deep-CFR, JPSPG, PPO, R-NaD). (2) We further demonstrate the practical value of the equilibrium structure of 2p01s by solving an American football setting where the attacking team needs to strategically conceal their true intention between “RB power push” and “QB throw”. While this IIEFG has a complexity of 10 440 even with a coarse mesh of 10 values per action dimension, the atomic structure of its NE makes the game solvable in less than 30 minutes on a M1 Pro Macbook. See Fig. 1(b,c).
2Related Work 2p0s games with incomplete information.
Games where players have missing information only about the game types are called incomplete-information. These are a special case of imperfect-information games where nature only plays a chance move at the beginning Harsanyi (1967). The seminal work of Aumann et al. (1995) developed equilibrium strategies for a repeated game with one-sided incomplete information through the “Cav u” theorem, which reveals that belief-manipulating behavioral strategies optimize value through value convexification. Building on top of this, De Meyer (1996) introduced a dual game reformulation where the behavioral strategy of P2 becomes Markov. This technique helped Cardaliaguet (2007); Ghimire et al. (2024) to establish the value existence proof for 2p0s differential games with incomplete information. Unlike repeated games where belief manipulation occurs only in the first round of the game, differential games may have multiple critical time-state-belief points where belief manipulation is required to achieve equilibrium, depending on the specifications of system dynamics, payoffs, and state constraints (Ghimire et al., 2024). Our paper builds on top of Cardaliaguet (2007); Ghimire et al. (2024), providing a thorough analysis of the convexification nature of behavioral strategies and, as a consequence, their atomic structure.
IIEFGs.
IIEFGs are partially-observable stochastic games (POSGs) with finite horizons. Significant efforts have been taken to approximate equilibrium of large IIEFGs with finite 𝑈 (Koller & Megiddo, 1992; Billings et al., 2003; Gilpin & Sandholm, 2006; Gilpin et al., 2007; Sandholm, 2010; Brown & Sandholm, 2019; Brown et al., 2020a; Perolat et al., 2022; Schmid et al., 2023). Regret matching algorithms (Zinkevich et al., 2007; Lanctot et al., 2009; Abernethy et al., 2011; Brown et al., 2019; Tammelin, 2014; Johanson et al., 2012) have computational complexity 𝒪 ( 𝑈 𝜀 − 2 ) to 𝜀 -Nash, while gradient-based solvers (McMahan, 2011; Perolat et al., 2021; Sokota et al., 2022) have 𝒪 ( ln ( 𝑈 ) 𝜀 − 1 ln ( 𝜀 − 1 ) ) to 𝜀 -QRE. All these complexities increase with 𝑈 , and convergence guarantee only exists if the NE lies in the interior of the simplex Δ ( 𝑈 ) . Critically, this latter assumption does not hold for 2p0s1, as we explain in Sec. 4. For games with continuous action spaces, existing pseudo-gradient based approaches (Martin & Sandholm, 2023, 2024) lack convergence guarantee, and perform poorly in our case studies in both normal- and extensive-form settings.
Multigrid for value approximation.
Since solving 2p0s1 essentially requires solving a Hamilton-Jacobi PDE, we briefly review multigrid methods for accelerating PDE solving(Trottenberg et al., 2000). In a typical V-cycle solver (Braess & Hackbusch, 1983), a fine-mesh solve is first performed briefly, the residual is then restricted to a coarser mesh where a correction is solved and prolongated back to the fine mesh. Essentially, multigrid uses coarse solves to reduce the low-frequency approximation errors in the PDE solution at low costs, leaving only high-frequency errors to be resolved through the fine mesh. Multigrid was successfully applied to solving Hamilton-Jacobi PDEs for optimal control and differential games (Han & Wan, 2013), although characterizing its computational complexity for such nonlinear PDEs is yet to be done. During value approximation, we extend multigrid to solving Hamilton-Jacobi PDEs underlying 2p0s1.
3Problem Statement
We denote by Δ ( 𝐼 ) the simplex in ℝ 𝐼 , [ 𝐼 ] := { 1 , … , 𝐼 } , 𝑎 [ 𝑖 ] the 𝑖 th element of vector 𝑎 , ∇ 𝑝 𝑉 and ∂ 𝑝 𝑉 the respective gradient and subgradient of function 𝑉 with respect to 𝑝 . ∥ ⋅ ∥ 2 is the 𝑙 2 -norm and ‖ 𝑣 ‖ 𝐴 := 𝑣 𝑇 𝐴 𝑣 . Consider a time-invariant dynamical system that defines the evolution of the joint state 𝑥 ∈ 𝒳 ⊆ ℝ 𝑑 𝑥 of P1 and P2 with control inputs 𝑢 ∈ 𝒰 ⊆ ℝ 𝑑 𝑢 and 𝑣 ∈ 𝒱 ⊆ ℝ 𝑑 𝑣 , respectively:
𝑥 ˙ ( 𝑡 )
𝑓 ( 𝑥 ( 𝑡 ) , 𝑢 , 𝑣 ) .
(1)
The initial belief of players is set to nature’s distribution about the game type. Denote by { ℋ 𝑟 𝑖 } 𝐼 the joint sets of 𝐼 behavioral strategies of P1, and 𝒵 𝑟 the set of behavioral strategies of P2. The subscript 𝑟 indicates the random nature of behavioral strategies: At any infostate ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) and with type 𝑖 , P1 draws an action based on 𝜂 𝑖 ∈ ℋ 𝑟 𝑖 , which is a probability measure over 𝒰 parameterized by ( 𝑡 , 𝑥 , 𝑝 ) . P2’s strategy 𝜁 ∈ 𝒵 𝑟 ( 𝑡 ) is a probability measure over 𝒱 independent of 𝑖 . Denote by 𝑋 𝑡 1 𝑡 0 , 𝑥 0 , 𝜂 𝑖 , 𝜁 the random state arrived at 𝑡 1 ∈ ( 𝑡 0 , 𝑇 ] from ( 𝑡 0 , 𝑥 0 ) following ( 𝜂 𝑖 , 𝜁 ) and the system dynamics in Eq. 1. With mild abuse of notation, let ( 𝜂 𝑖 ( 𝑡 ) , 𝜁 ( 𝑡 ) ) denote the random controls at time 𝑡 induced by ( 𝜂 𝑖 , 𝜁 ) . P1 accumulates a running cost 𝑙 𝑖 ( 𝜂 𝑖 ( 𝑡 ) , 𝜁 ( 𝑡 ) ) during the game and receives a terminal cost 𝑔 𝑖 ( 𝑋 𝑇 𝑡 0 , 𝑥 0 , 𝜂 𝑖 , 𝜁 ) . Together, a type- 𝑖 P1 minimizes:
𝐽 𝑖 ( 𝑡 0 , 𝑥 0 ; 𝜂 𝑖 , 𝜁 ) := 𝔼 𝜂 𝑖 , 𝜁 [ 𝑔 𝑖 ( 𝑋 𝑇 𝑡 0 , 𝑥 0 , 𝜂 𝑖 , 𝜁 ) + ∫ 𝑡 0 𝑇 𝑙 𝑖 ( 𝜂 𝑖 ( 𝑠 ) , 𝜁 ( 𝑠 ) ) 𝑑 𝑠 ] ,
while P2 maximizes 𝐽 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ; { 𝜂 𝑖 } , 𝜁 )
𝔼 𝑖 ∼ 𝑝 0 [ 𝐽 𝑖 ] . We say the game has a value 𝑉 if and only if the upper value 𝑉 + ( 𝑡 0 , 𝑥 0 , 𝑝 0 )
inf { 𝜂 𝑖 } sup 𝜁 𝐽 and the lower value 𝑉 − ( 𝑡 0 , 𝑥 0 , 𝑝 0 )
sup 𝜁 inf { 𝜂 𝑖 } 𝐽 are equal: 𝑉
𝑉 +
𝑉 − . The game is proven to have a value under the following sufficient conditions (Cardaliaguet, 2007):
A1.
𝒰 ⊆ ℝ 𝑑 𝑢 and 𝒱 ⊆ ℝ 𝑑 𝑣 are compact and finite-dimensional.
A2.
𝑓 : 𝒳 × 𝒰 × 𝒱 → 𝒳 is 𝐶 1 and has bounded value and first-order derivatives.
A3.
𝑔 𝑖 : 𝒳 → ℝ and 𝑙 𝑖 : 𝒰 × 𝒱 → ℝ are bounded and Lipschitz continuous.
A4.
Isaacs’ condition holds for the Hamiltonian 𝐻 : 𝒳 × ℝ 𝑑 𝑥 → ℝ :
𝐻
(
𝑥
,
𝜉
)
:=
min
𝑢
∈
𝒰
max
𝑣
∈
𝒱
𝑓
(
𝑥
,
𝑢
,
𝑣
)
⊤
𝜉
+
𝑙
𝑖
(
𝑢
,
𝑣
)
max 𝑣 ∈ 𝒱 min 𝑢 ∈ 𝒰 𝑓 ( 𝑥 , 𝑢 , 𝑣 ) ⊤ 𝜉 + 𝑙 𝑖 ( 𝑢 , 𝑣 ) .
(2)
Isaacs’ condition allows any complete-information versions of this game to have pure Nash equilibria, including nonrevealing games where neither player knows the actual game type.
A5.
Both players have full knowledge about 𝑓 , { 𝑔 𝑖 } 𝑖
1 𝐼 , { 𝑙 𝑖 } 𝑖
1 𝐼 , 𝑝 0 . Control inputs and states are fully observable. Players have perfect recall.
Our goal is to compute a Nash equilibrium (NE) ( { 𝜂 𝑖 } † , 𝜁 † ) that attains 𝑉 , given the game 𝐺
{ 𝒳 , ( 𝒰 , 𝒱 ) , ( { 𝑔 𝑖 } , { 𝑙 𝑖 } ) , 𝑓 , 𝑇 } .
4A Primal-Dual Reformulation of the Game
In this section, we introduce discrete-time primal and dual reformulations of 𝐺 , denoted by 𝐺 𝜏 and 𝐺 𝜏 ∗ , respectively, for which dynamic programming principles (DP) exist. We show that P1’s equilibrium behavioral strategy { 𝜂 𝑖 , 𝜏 † } in 𝐺 𝜏 is 𝐼 -atomic, i.e., 𝜂 𝑖 , 𝜏 † concentrates on at most 𝐼 actions in 𝒰 , and P2’s strategy 𝜁 𝜏 † in 𝐺 𝜏 ∗ is ( 𝐼 + 1 ) -atomic. Then we show that ( { 𝜂 𝑖 , 𝜏 † } , 𝜁 𝜏 † ) approaches the Nash equilibrium of the differential game 𝐺 as the time interval 𝜏 → 0 + .
The primal game 𝐺 𝜏 .
𝐺 𝜏 is a discrete-time Stackelberg version of 𝐺 where P2 plays a pure best response after P1 announces his next action. Let the value of 𝐺 𝜏 be 𝑉 𝜏 . 𝑉 𝜏 satisfies the following DP for ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) :
𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 )
min { 𝜂 𝑖 } ∈ { ℋ 𝑟 } 𝐼 𝔼 𝑖 ∼ 𝑝 , 𝑢 ∼ 𝜂 𝑖 [
max 𝑣 ∈ 𝒱 𝑉 𝜏 ( 𝑡 + 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 ) , 𝑝 ′ ( 𝑢 ) ) + 𝜏 𝑙 𝑖 ( 𝑢 , 𝑣 ) ] ,
(3)
with a terminal boundary 𝑉 𝜏 ( 𝑇 , 𝑥 , 𝑝 )
∑ 𝑖 𝑝 [ 𝑖 ] 𝑔 𝑖 ( 𝑥 ) . For small enough 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 )
𝑥 + 𝜏 𝑓 ( 𝑥 , 𝑢 , 𝑣 ) . 𝑝 ′ ( 𝑢 ) is the Bayes update of the public belief after P1 announces 𝑢 : 𝑝 ′ ( 𝑢 ) [ 𝑖 ]
𝜂 𝑖 ( 𝑢 ; 𝑡 , 𝑥 , 𝑝 ) 𝑝 [ 𝑖 ] / 𝜂 ¯ ( 𝑢 ; 𝑡 , 𝑥 , 𝑝 ) , where 𝜂 ¯ ( 𝑢 ; 𝑡 , 𝑥 , 𝑝 )
∑ 𝑖 ∈ [ 𝐼 ] 𝜂 𝑖 ( 𝑢 ; 𝑡 , 𝑥 , 𝑝 ) 𝑝 0 [ 𝑖 ] is the marginal over 𝒰 across types. Note that P2’s equilibrium behavioral strategy cannot be derived from Eq. 3.
The dual game 𝐺 𝜏 ∗ .
For P2’s strategy, we need a separate DP where P2 announces his next action and P1 best responses to it. We do so by first introducing the convex conjugate 𝑉 ∗ of the value:
𝑉 ∗ ( 𝑡 0 , 𝑥 0 , 𝑝 ^ 0 ) := max 𝑝 𝑝 𝑇 𝑝 ^ 0 − 𝑉 ( 𝑡 0 , 𝑥 0 , 𝑝 )
max 𝑝 𝑝 𝑇 𝑝 ^ 0 − sup 𝜁 ∈ 𝒵 𝑟 inf { 𝜂 𝑖 } ∈ { ℋ 𝑟 } 𝐼 𝔼 𝜂 𝑖 , 𝜁 , 𝑖 [ 𝐽 𝑖 ]
(4)
= max 𝑝 inf 𝜁 ∈ 𝒵 𝑟 sup { 𝜂 𝑖 } ∈ { ℋ 𝑟 } 𝐼 𝑝 𝑇 𝑝 ^ 0 − 𝔼 𝜂 𝑖 , 𝜁 , 𝑖 [ 𝐽 𝑖 ]
inf 𝜁 ∈ 𝒵 𝑟 sup 𝜂 ∈ ℋ max 𝑖 ∈ { 1 , … , 𝐼 } { 𝑝 ^ 0 [ 𝑖 ] − 𝔼 𝜁 [ 𝐽 𝑖 ] } .
Eq. 4 describes a dual game 𝐺 ∗ with complete information, where the strategy space of P1 becomes ℋ × [ 𝐼 ] , i.e., the game type is now chosen by P1 rather than the nature. We prove in App. A that P2’s equilibrium in the dual game is also an equilibrium in the primal game if 𝑝 ^ 0 ∈ ∂ 𝑝 𝑉 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) , and such 𝑝 ^ 0 [ 𝑖 ] represents the loss of type- 𝑖 P1 should he play best responses to P2’s equilibrium strategy. Therefore 𝑝 ^ 0 [ 𝑖 ] − 𝔼 𝜁 [ 𝐽 𝑖 ] measures P2’s risk, and P2’s equilibrium strategy minimizes the worst-case risk across all game types. We now introduce a discrete-time version of the dual game 𝐺 𝜏 ∗ where P1 plays a pure best response after P2 announces their action at each time step. Let the value of 𝐺 𝜏 ∗ be 𝑉 𝜏 ∗ , the dual DP is:
𝑉 𝜏 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ )
min 𝜁 , 𝑝 ^ ′ ( ⋅ ) 𝔼 𝑣 ∼ 𝜁 [ max 𝑢 ∈ 𝒰 𝑉 𝜏 ∗ ( 𝑡 + 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 ) , 𝑝 ^ ′ ( 𝑣 ) − 𝜏 𝑙 ( 𝑢 , 𝑣 ) ) ] ,
(5)
with a terminal boundary 𝑉 ∗ ( 𝑇 , 𝑥 , 𝑝 ^ )
max 𝑖 ∈ [ 𝐼 ] { 𝑝 ^ [ 𝑖 ] − 𝑔 𝑖 ( 𝑥 ) } . Here 𝑝 ^ ′ ( ⋅ ) : 𝒱 → ℝ 𝐼 is constrained by 𝔼 𝑣 ∼ 𝜁 [ 𝑝 ^ ′ ( 𝑣 ) ]
𝑝 ^ (similar to the martingale nature of 𝑝 in 𝐺 𝜏 ), and 𝑙 ( 𝑢 , 𝑣 ) [ 𝑖 ]
𝑙 𝑖 ( 𝑢 , 𝑣 ) .
Equilibrium strategies of 𝐺 𝜏 and 𝐺 𝜏 ∗ are atomic.
Our first theoretical result is Thm. 4.1, which states that P1’s strategy that solves 𝐺 𝜏 is 𝐼 -atomic, and P2’s strategy that solves 𝐺 𝜏 ∗ is ( 𝐼 + 1 ) -atomic:
Theorem 4.1.
The RHS of Eq. 3 can be reformulated as
min { 𝑢 𝑘 } , { 𝛼 𝑖 𝑘 } max { 𝑣 𝑘 } ∑ 𝑘
1 𝐼 𝜆 𝑘 ( 𝑉 ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 ( 𝑥 , 𝑢 𝑘 , 𝑣 𝑘 ) , 𝑝 𝑘 ) + 𝜏 𝔼 𝑖 ∼ 𝑝 𝑘 [ 𝑙 𝑖 ( 𝑢 𝑘 , 𝑣 𝑘 ) ] )
( P 1 )
s.t. 𝑢 𝑘 ∈ 𝒰 , 𝑣 𝑘 ∈ 𝒱 , 𝛼 𝑖 𝑘 ∈ [ 0 , 1 ] , ∑ 𝑘
1 𝐼 𝛼 𝑖 𝑘
1 , 𝜆 𝑘
∑ 𝑖
1 𝐼 𝛼 𝑖 𝑘 𝑝 [ 𝑖 ] , 𝑝 𝑘 [ 𝑖 ]
𝛼 𝑖 𝑘 𝑝 [ 𝑖 ] 𝜆 𝑘 , ∀ 𝑖 , 𝑘 ∈ [ 𝐼 ] ,
i.e., 𝜂 𝑖 , 𝜏 † concentrates on actions { 𝑢 𝑘 } 𝑘
1 𝐼 for 𝑖 ∈ [ 𝐼 ] . The RHS of Eq. 5 can be reformulated as
min { 𝑣 𝑘 } , { 𝜆 𝑘 } , { 𝑝 ^ 𝑘 } max { 𝑢 𝑘 } ∑ 𝑘
1 𝐼 + 1 𝜆 𝑘 ( 𝑉 ∗ ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 ( 𝑥 , 𝑢 𝑘 , 𝑣 𝑘 ) , 𝑝 ^ 𝑘 − 𝜏 𝑙 ( 𝑢 𝑘 , 𝑣 𝑘 ) ) )
( P 2 )
s.t. 𝑢 𝑘 ∈ 𝒰 , 𝑣 𝑘 ∈ 𝒱 , 𝜆 𝑘 ∈ [ 0 , 1 ] , ∑ 𝑘
1 𝐼 + 1 𝜆 𝑘 𝑝 ^ 𝑘
𝑝 ^ , ∑ 𝑘
1 𝐼 + 1 𝜆 𝑘
1 , 𝑘 ∈ [ 𝐼 + 1 ] .
Proof sketch.
(1) Using Isaacs’ condition, we show that P2’s best response in Eq. 3 is implicitly governed by P1’s action 𝑢 , and 𝑢 is in turn governed by the posterior belief 𝑝 ′ . (2) With this insight, we can rewrite Eq. 3 as 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 )
min 𝜈 ∫ Δ ( 𝐼 ) 𝑉 ~ 𝜏 ( 𝑡 , 𝑥 , 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) , where we control a pushforward density 𝜈 ( 𝑑 𝑝 ′ ) for the posterior belief to be 𝑝 ′ . 𝜈 is subjected to ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 ( 𝑑 𝑝 ′ )
𝑝 . For each 𝑝 ′ , ( 𝑢 ( 𝑝 ′ ) , 𝑣 ( 𝑝 ′ ) ) is found by solving 𝑉 ~ ( 𝑡 , 𝑥 , 𝑝 ′ )
min 𝑢 ∈ 𝒰 max 𝑣 ∈ 𝒱 𝑉 𝜏 ( 𝑡 + 𝜏 , 𝑥 ′ , 𝑝 ′ ) + 𝔼 𝑖 ∼ 𝑝 ′ 𝑙 𝑖 . Here 𝑉 ~ ( 𝑡 , 𝑥 , 𝑝 ′ ) is P1’s loss should both players play pure strategies at ( 𝑡 , 𝑥 , 𝑝 ′ ) within [ 𝑡 , 𝑡 + 𝜏 ) , and its existence is guaranteed by Isaacs’ condition. Since playing pure strategies does not change the public belief, we call 𝑉 ~ 𝜏 the non-revealing value. (3) We can then show that min 𝜈 ∫ Δ ( 𝐼 ) 𝑉 ~ 𝜏 ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) is a convexification of 𝑉 ~ 𝜏 in Δ ( 𝐼 ) . Since convexification requires at most 𝐼 vertices in Δ ( 𝐼 ) , 𝜈 is at most 𝐼 -atomic. Since 𝜈 determines 𝜂 𝑖 , 𝜏 † , the latter is also 𝐼 -atomic. (4) A similar argument can be made for 𝐺 𝜏 ∗ , in which case the convexification is with respect to the dual variable 𝑝 ^ . Since 𝑝 ^ is defined on ℝ 𝐼 rather than constrained on Δ ( 𝐼 ) , 𝜁 † is at most ( 𝐼 + 1 ) -atomic. Proof in App. B.
Figure 2:Value convexification causes NE to be atomic. An example.
To support the proof sketch, we use Fig. 2 to illustrate why an equilibrium behavioral strategy is atomic. Here 𝐼
2 , thus the value is defined on Δ ( 2 ) for fixed ( 𝑡 , 𝑥 ) . To solve min 𝜈 ∫ Δ ( 2 ) 𝑉 ~ 𝜏 ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) , we scan the non-revealing value 𝑉 ~ across Δ ( 2 ) . One notices that if 𝑉 ~ is not convex in 𝑝 , it is always possible for P1 to achieve a lower loss by convexifying 𝑉 ~ through a mixed strategy, leading to P 1 . In the figure, P1 identifies [ 𝜆 𝑎 , 𝜆 𝑏 ] ⊤ ∈ Δ ( 2 ) and { 𝑝 𝑎 , 𝑝 𝑏 } such that 𝜆 𝑎 𝑝 𝑎 + 𝜆 𝑏 𝑝 𝑏
𝑝 . Solving the non-revealing pure NE ( 𝑢 𝑘 , 𝑣 𝑘 ) for each 𝑘 ∈ { 𝑎 , 𝑏 } , P1’s mixed strategy that convexifies 𝑉 ~ is to play action 𝑢 𝑘 with probability 𝛼 𝑖 𝑘
𝑝 𝑘 [ 𝑖 ] 𝜆 𝑘 / 𝑝 [ 𝑖 ] if he is type- 𝑖 . By announcing this strategy, the public belief shifts to 𝑝 𝑘 via the Bayes’ rule when P2 observes action 𝑢 𝑘 . As a result, P1 receives 𝑉 ( 𝑝 )
𝜆 𝑎 𝑉 ~ ( 𝑝 𝑎 ) + 𝜆 𝑏 𝑉 ~ ( 𝑝 𝑏 ) on the convex hull of 𝑉 ~ ( 𝑝 ) .
Remarks.
(1) The atomic structure of the equilibrium strategies was first discovered for 2p0s repeated games with one-sided information (Aumann et al., 1995; De Meyer, 1996). Our new contribution is in explaining its presence in differential games and in demonstrating its significant utilities in improving the effectiveness of multiagent reinforcement learning and model predictive control schemes. (2) P1’s actions in 2p0s1 simultaneously advance system states and achieve signaling. The deterministic dynamics allows precise belief control (e.g., ( 𝑝 𝑎 , 𝑝 𝑏 ) in Fig. 2) for value convexification. For games with stochastic dynamics/observations, however, P1 will not be able to precisely control the belief, and the convex Bellman operators ( P 1 and P 2 ) become lower bounds of the value. Finding tight upper-bounding operators that preserve atomic NEs is left for future work.
( { 𝜂 𝑖 , 𝜏 † } , 𝜁 𝜏 † ) approaches NE of 𝐺 .
Next we present Thm. 4.2, which proves that ( { 𝜂 𝑖 , 𝜏 † } , 𝜁 𝜏 † ) computed from P 1 and P 2 approaches the equilibrium of 𝐺 when 𝜏 is sufficiently small. The theorem uses P1’s loss in 𝐺 when they play NE in 𝐺 𝜏 : 𝑉 1 ( 𝑡 , 𝑥 , 𝑝 ) := max 𝜁 ∈ 𝒵 𝐽 ( 𝑡 , 𝑥 , 𝑝 ; { 𝜂 𝑖 , 𝜏 † } , 𝜁 ) , and P2’s loss in 𝐺 ∗ when they play NE in 𝐺 𝜏 ∗ : 𝑉 1 ∗ ( 𝑡 , 𝑥 , 𝑝 ) := max { 𝜂 𝑖 } ∈ { ℋ 𝑖 } 𝐼 𝐽 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ; { 𝜂 𝑖 } , 𝜁 𝜏 † ) .
Theorem 4.2.
If A1-5 hold, there exists 𝐶
0 , such that 𝑉 1 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝐶 ( 𝑇 − 𝑡 ) 𝜏 ] for any ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) , and 𝑉 1 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ) − 𝑉 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ) ∈ [ 0 , 𝐶 ( 𝑇 − 𝑡 ) 𝜏 ] for any ( 𝑡 , 𝑥 , 𝑝 ^ ) ∈ [ 0 , 𝑇 ] × 𝒳 × ℝ 𝐼 .
Proof sketch.
𝑉 ≤ 𝑉 1 because P1’s NE in 𝐺 𝜏 is not NE in 𝐺 and 𝑉 is unique. Compared with 𝐺 , P2 has an advantage in 𝐺 𝜏 since he plays best responses to the actions to be played by P1, thus 𝑉 1 ≤ 𝑉 𝜏 . Now we just need to show max 𝑥 , 𝑝 | 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 ) | ≤ 𝐶 ( 𝑇 − 𝑡 ) 𝜏 . This is done through a recursion that leverages (1) the consistency property of the Bellman backup (denoted by 𝑇 𝜏 [ 𝑉 ] , P 1 ): | 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ 𝑉 ] ( 𝑡 + 𝜏 , 𝑥 , 𝑝 ) | ≤ 𝐶 𝜏 2 , i.e., backing up the value via P 1 causes a error quadratic in 𝜏 , and (2) the boundary condition 𝑉 ( 𝑇 , 𝑥 , 𝑝 )
𝑉 𝜏 ( 𝑇 , 𝑥 , 𝑝 ) . The consistency property is derived from the fact that values of first-order Hamilton-Jacobi equations are Lipschitz and 𝜔 -semiconcave under A1-5. The same technique can be applied to 𝐺 𝜏 ∗ . Full proof in App. C. We note that Cardaliaguet (2009) provided the recursion sketch of the theorem similar to ours without explaining the quadratic error or the consistency property.
With Thms. 4.1 and 4.2, and given that 𝐺 is differential ( 𝜏 → 0 + ), we now proved that NEs of 𝐺 are atomic, and can be approximated via Eqs. P 1 and P 2 with sufficiently small 𝜏 .
5Incorporating the Atomic Structure into MARL and Control
We study the efficacy of atomic NEs when applied to (1) value approximation, (2) model-free MARL, and (3) control settings. For (1), we approximate value and strategies in the entire [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) through P 1 and P 2 , and introduce multigrid to further reduce the compute. We compare with discrete-action solvers CFR+, MMD, CFR-BR, and continuous-action solver JPSPG. For (2) and (3), we solve NE strategies for a fixed initial ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) . We compare with discrete-action solvers MMD, PPO, and R-NaD. We call the proposed continuous-action mixed-strategy solvers CAMS, CAMS-DRL, and CAMS-MPC for value approximation, MARL, and control, respectively.
5.1Value approximation CAMS for 𝐺 𝜏 .
We discretize time as { 𝑘 𝜏 } 𝑘
0 𝐾 , 𝜏
𝑇 / 𝐾 . Let 𝒮
{ ( 𝑥 , 𝑝 ) 𝑖 } 𝑖 ∈ [ | 𝒮 | ] be a collocation set. We solve P 1 starting from 𝑡
( 𝐾 − 1 ) 𝜏 at all collocation points in 𝒮 . The resultant nonconvex-nonconcave minimax problems have size ( 𝒪 ( 𝐼 ( 𝐼 + 𝑑 𝑢 ) ) , 𝒪 ( 𝐼 𝑑 𝑣 ) ) and are solved by DS-GDA (Zheng et al., 2023), which guarantees sublinear convergence on nonconvex-nonconcave problems. To generalize value and strategies across 𝒳 × Δ ( 𝐼 ) , a value network is trained on the minimax solutions and used to formulate the next round of minimax at 𝑡 − 𝜏 . 𝐺 𝜏 ∗ is solved similarly.
Computational challenge.
From Thm. 4.2, large 𝐾 (small 𝜏 ) is required for strategies derived from P 1 and P 2 to be good approximations of the NE. Yet suppressing the value prediction error at 𝑡
0 requires a computational complexity exponential in 𝐾 . Specifically, let 𝑉 ^ 0 ( 𝑥 , 𝑝 ) : 𝒳 × Δ ( 𝐼 ) → ℝ be the trained value networks at 𝑡
0 , we have the following result (see proof in App. D):
Theorem 5.1.
Given 𝐾 , a minimax approximation error 𝜖
0 , a prediction error threshold 𝛿
0 , there exists 𝐶 ≥ 1 , such that with a computational complexity of 𝒪 ( 𝐾 3 𝐶 2 𝐾 𝐼 2 𝜖 − 4 𝛿 − 2 ) , CAMS achieves
max ( 𝑥 , 𝑝 ) ∈ 𝒳 × Δ ( 𝐼 ) | 𝑉 ^ 0 ( 𝑥 , 𝑝 ) − 𝑉 ( 0 , 𝑥 , 𝑝 ) | ≤ 𝛿 .
(6)
A similar result applies to the dual game. Zanette et al. (2019) discussed a linear value approximator that achieves 𝐶
1 . However, their method requires solving a linear program (LP) for every inference 𝑉 ^ 𝑡 ( 𝑥 , 𝑝 ) if ( 𝑥 , 𝑝 ) does not belong to the training set 𝒮 . In our context, incorporating their method would require auto-differentiating through the LP solver for each descent and ascent steps in minimax, which turned out to be too expensive. To this end, we introduce a multigrid scheme to reduce the cost for games with a large 𝐾 .
Multigrid.
Since strategies at time 𝑡 are implicitly nonlinear functions of the value at 𝑡 + 𝜏 , the Hamilton-Jacobi PDEs underlying P 1 and P 2 are nonlinear. Our method extends the Full Approximation Scheme (FAS) used for solving nonlinear PDEs (Trottenberg et al., 2000; Henson et al., 2003). A two-grid FAS has four steps: (1) Restrict the fine-grid approximation and its residual; (2) solve the coarse-grid problem using the fine-grid residual; (3) compute the coarse-grid correction; (4) prolong the coarse-grid correction to fine-grid and add the correction to fine-grid approximation. For conciseness, we focus on the primal problem. Let 𝑉 ^ 𝑡 𝑙 be the value network for time 𝑡 on grid size (time interval) 𝑙 . Let the restriction operators be ℛ 𝑙 from a finer grid with grid size 𝑙 to a coarser one with size 2 𝑙 : ℛ 𝑙 ( 𝑉 ^ 𝑡 𝑙 )
( 𝑉 ^ 𝑡 𝑙 + 𝑉 ^ 𝑡 + 𝑙 𝑙 ) / 2 is the value restriction from 𝑙 to 2 𝑙 . Similarly, we define the prolongation operators 𝒫 2 𝑙 as 𝒫 2 𝑙 ( 𝑉 ^ 𝑡 2 𝑙 )
𝑉 ^ 𝑡 2 𝑙 if 𝑡 ∈ 𝒯 2 𝑙 or 𝑉 ^ 𝑡 + 𝑙 2 𝑙 otherwise, where 𝒯 2 𝑙 := { 𝑛 ⋅ 2 𝑙 : 𝑛 ∈ ℕ 0 , 𝑛 < 𝑇 / 2 𝑙 } . Let 𝕆 𝑙 ( 𝑡 , 𝑥 , 𝑝 ; 𝑉 ^ ) solve P 1 at ( 𝑡 , 𝑥 , 𝑝 ) where 𝑉 ^ is the value at 𝑡 + 𝑙 , and outputs an approximation for 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) . The dataset { ( 𝑡 , 𝑥 ( 𝑗 ) , 𝑝 ( 𝑗 ) , 𝕆 𝑙 ( 𝑡 , 𝑥 ( 𝑗 ) , 𝑝 ( 𝑗 ) ; 𝑉 ^ 𝑡 + 𝑙 𝑙 ) ) } is used to train 𝑉 ^ 𝑡 𝑙 ( ⋅ , ⋅ ) . Let 𝑟 𝑡 𝑙 ( 𝑥 , 𝑝 )
𝑉 ^ 𝑡 𝑙 ( 𝑥 , 𝑝 ) − 𝕆 𝑙 ( 𝑡 , 𝑥 , 𝑝 ; 𝑉 ^ 𝑡 + 𝑙 𝑙 ) be the residual. To achieve 𝑟 𝑡 𝑙 ( 𝑥 , 𝑝 ) ≈ 0 for all ( 𝑡 , 𝑥 , 𝑝 ) ∈ 𝒯 𝑙 × 𝒳 × Δ ( 𝐼 ) , we restrict the fine grid approximations and residuals to the coarse grid and solve to determine the corrections. To do so, let 𝑒 𝑡 𝑙 ( 𝑥 , 𝑝 ) be the correction in grid 𝑙 at ( 𝑡 , 𝑥 , 𝑝 ) . The coarse-grid problem is
ℛ 𝑙 𝑟 𝑡 𝑙 ⏟ residual
𝕆 2 𝑙 ( 𝑡 , 𝑥 , 𝑝 ; ℛ 𝑙 𝑉 ^ 𝑡 + 2 𝑙 𝑙 + 𝑒 𝑡 + 2 𝑙 2 𝑙 ) − ( ℛ 𝑙 𝑉 ^ 𝑡 𝑙 + 𝑒 𝑡 2 𝑙 ( 𝑥 , 𝑝 ) ) ⏟ coarse-grid eq. w/ corrections − ( 𝕆 2 𝑙 ( ℛ 𝑙 𝑉 ^ 𝑡 + 2 𝑙 𝑙 ) − ℛ 𝑙 𝑉 ^ 𝑡 𝑙 ) ⏟ coarse-grid eq. w/o corrections ,
(7)
where 𝑒 𝑡 2 𝑙 ( 𝑥 , 𝑝 ) is computed backward from 𝑇 − 2 𝑙 using 𝑒 𝑇 2 𝑙
0 :
𝑒 𝑡 2 𝑙 ( 𝑥 , 𝑝 )
𝕆 2 𝑙 ( 𝑡 , 𝑥 , 𝑝 ; ℛ 𝑙 𝑉 ^ 𝑡 + 2 𝑙 𝑙 + 𝑒 𝑡 + 2 𝑙 2 𝑙 ) − 𝕆 2 𝑙 ( ℛ 𝑙 𝑉 ^ 𝑡 + 2 𝑙 𝑙 ) − ℛ 𝑙 𝑟 𝑡 𝑙 .
(8)
This correction ensures consistency: If 𝑉 ^ 𝑡 𝑙
𝑉 ( 𝑡 , ⋅ , ⋅ ) for all 𝑡 ∈ 𝒯 𝑙 , 𝑒 𝑡 2 𝑙 ( ⋅ , ⋅ )
0 for all 𝑡 ∈ 𝒯 2 𝑙 . The coarse grid corrections are prolonged to the fine grid to update the fine-grid value approximation. Note that from Eq. 8, computing the coarse correction in our case requires two separate minimax calls with similar loss formulations. We further accelerate the multigrid solver by warm-starting these minimax problems using the recorded minimax solution derived from the fine grid (during the residual computation).
5.2Model-free multiagent reinforcement learning
Recent studies (Rudolph et al., 2025; Sokota et al., 2022) showed that policy gradient methods for MARL, such as PPO and MMD, can effectively solve IIEFGs. We show that the atomic structure can be directly applied to this unified model-free framework with minimal code changes, while yielding significant solution improvement for 2p0s1. For CAMS-DRL, the policy network of P1 takes in the infostate ( 𝑡 , 𝑥 , 𝑝 ) and outputs 𝐼 logit vectors ℓ 𝑘 ∈ ℝ 𝐼 and 𝐼 action prototypes 𝜇 𝑘 ∈ 𝒰 for 𝑘 ∈ [ 𝐼 ] . Each logit vector ℓ 𝑘 is transformed via softmax to define the behavioral strategy of type- 𝑖 P1, i.e., the probability 𝜂 𝑖 ( 𝜇 𝑘 ; 𝑡 , 𝑥 , 𝑝 ) of choosing action 𝜇 𝑘 . The policy network for P2 takes in ( 𝑡 , 𝑥 , 𝑝 ) and outputs a single logit ℓ ∈ ℝ 𝐼 + 1 and action prototypes 𝜇 𝑘 for 𝑘 ∈ [ 𝐼 + 1 ] . We directly solve the NE of these policy models using PPO and MMD.
5.3Model predictive control
When dynamics 𝑓 is known and an initial state ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) is given, we can formulate the primal (resp. dual) minimax objective as the sum over 𝐼 𝐾 (resp. ( 𝐼 + 1 ) 𝐾 ) game tree paths (Fig. 1a). CAMS-MPC builds the computational graph for the entire tree where the policy networks follow CAMS-DRL, and applies a minimax solver (e.g., DS-GDA) to this differentiable loss. This method is feasible thanks to the atomic structure of NEs and for small 𝐼 . Since P1 in 2p0s1 games reveals their payoff mid-game, the game tree further collapses after information revelation. E.g., Hexner’s game (see below) has a proven game-tree complexity of 𝐼 , where P1 plays a fixed nonrevealing strategy before splitting to 𝐼 type-dependent revealing strategies. Due to this collapse, modeling strategies using neural networks turns out to be more effective for the convergence to NEs than using infostate-wise parameterization.
6Empirical Validation Hexner’s game.
We introduce Hexner’s game (Hexner, 1979; Ghimire et al., 2024) to compare CAMS variants with baselines on solution quality and computational cost. This game has an analytical NE (see proof in App. F.1). The dynamics is 𝑥 ˙ 𝑗
𝐴 𝑗 𝑥 𝑗 + 𝐵 𝑗 𝑢 𝑗 for 𝑗
[ 2 ] , where 𝑥 𝑗 ∈ 𝒳 𝑗 , 𝑢 𝑗 ∈ 𝒰 𝑗 , and 𝐴 𝑗 and 𝐵 𝑗 are known matrices. The target state of P1 is 𝑧 𝜃 where 𝜃 is drawn with distribution 𝑝 0 from Θ , | Θ |
𝐼 , and 𝑧 ∈ ℝ 𝑑 𝑥 is fixed and public. The expected payoff to P1 is:
𝐽 ( { 𝜂 𝑖 } , 𝜁 )
𝔼 𝑖 ∼ 𝑝 0 [ ∫ 0 𝑇 ‖ 𝜂 𝑖 ( 𝑡 ) ‖ 𝑅 1 2 − ‖ 𝜁 ( 𝑡 ) ‖ 𝑅 2 2 𝑑 𝑡 + ‖ 𝑥 1 ( 𝑇 ) − 𝑧 𝜃 𝑖 ‖ 𝐾 1 2 − ‖ 𝑥 2 ( 𝑇 ) − 𝑧 𝜃 𝑖 ‖ 𝐾 2 2 ] ,
where 𝑅 1 , 𝑅 2 , 𝐾 1 , 𝐾 2 ≻ 0 are control- and state-penalty matrices. The goal of P1 is to get closer to the target 𝑧 𝜃 than P2. To take information advantage, P1 needs to decide when to home-in to and reveal the target. Analytical NE: There exists a critical time 𝑡 𝑟 := 𝑡 𝑟 ( 𝑇 , { 𝐴 𝑗 } , { 𝐵 𝑗 } , { 𝑅 𝑗 } , { 𝐾 𝑗 } ) . If 𝑡 𝑟 ∈ ( 0 , 𝑇 ) , P1 moves towards 𝔼 [ 𝜃 ] as if he does not know the actual target until 𝑡 𝑟 when he fully reveals the target, i.e., value convexification happens at 𝑡 𝑟 . If 𝑡 𝑟 ≤ 0 , P1 homes towards the actual target at 𝑡
0 . P2’s NE mirrors P1. See proof in App. F.1.
Figure 3:(a-c) Comparisons b/w CAMS, JPSPG, CFR+, MMD, CFR-BR-Primal on 1-stage Hexner’s game. (d) Comparison b/w CAMS, JPSPG, and DeepCFR on 4-stage Hexner’s w/ similar compute. 6.1Effect of atomic NE on value approximation 𝑈 -invariant convergence of CAMS.
We use a normal-form Hexner’s game with 𝜏
𝑇 and a fixed initial state 𝑥 0 ∈ 𝒳 to demonstrate that baseline algs. suffer from increasing discrete action sizes while CAMS does not. The baselines are SOTA normal-form solvers including CFR+ (Tammelin, 2014), MMD (Sokota et al., 2022), JPSPG (Martin & Sandholm, 2024), and a modified CFR-BR (Johanson et al., 2012) (dubbed CFR-BR-Primal), where we focus on converging P1’s strategy and only compute P2’s best responses. Among these, only JPSPG naturally handles continuous action spaces. All baselines except JPSPG are standard implementations in OpenSpiel (Lanctot et al., 2019). The normal-form primal game has a trivial ground-truth strategy where P1 goes directly to his target. For visualization, we use 𝑑 𝑥
4 (position and velocity in 2D). For baselines except JPSPG, we use discrete action sets defined by 4 grid sizes so that 𝑈
| 𝒰 𝑗 | ∈ { 16 , 36 , 64 , 144 } . All algs. terminate when a threshold of 𝑁 𝑎 𝑠 ℎ 𝐶 𝑜 𝑛 𝑣 ( 𝜋 𝑖 )
max 𝜋 𝑖 ′ 𝑉 𝑖 ( 𝜋 𝑖 ′ ) − 𝑉 𝑖 ( 𝜋 𝑖 ) is met. For conciseness, we only consider solving P1’s strategy and thus use P1’s NashConv. We set the threshold to 10 − 3 for baselines and use a more stringent threshold of 10 − 5 for CAMS. We then use DeepCFR and JPSPG as baselines for a 4-stage game where 𝑇
1 and 𝜏
0.25 . DeepCFRs were run for 1000 CFR iterations (resp. 100) with 10 (resp. 5) traversals for 𝑈
9 (resp. 16 ). The wall-time costs for game solving are 17 hours using CAMS (baseline), 24 hours for JPSPG, 29 hours ( 𝑈
9 ) and 34 hours ( 𝑈
16 ) using DeepCFR, all on an A100 GPU. More details on experiment settings can be found in App. G.3. Furthermore, JPSPG was run for 2 ⋅ 10 8 iterations, where each iteration consists of solving a game with a random initial state and type, and performing a strategy update. More details in App. G.4.
Fig. 3 summarizes the comparisons with baselines. For the normal-form game, we compare both computational cost and the expected action error 𝜀 from the ground-truth action of P1: 𝜀 ( 𝑥 0 ) := 𝔼 𝑖 ∼ 𝑝 𝑜 [ ∑ 𝑘
1 | 𝒰 | 𝛼 𝑘 𝑖 ‖ 𝑢 𝑘 − 𝑢 𝑖 ∗ ( 𝑥 0 ) ‖ 2 ] , where 𝑢 𝑖 ∗ ( 𝑥 0 ) is the ground truth for type 𝑖 at 𝑥 0 . In the 4-stage game, we compare the expected action errors at each time-step: 𝜀 ¯ 𝑡 := 𝔼 𝑥 𝑡 ∼ 𝜋 [ 𝜀 ( 𝑥 𝑡 ) ] , where 𝜋 is the strategy learned by the respective learning method. For each strategy, we estimate { 𝜀 ¯ 𝑡 } 𝑡
1 4 by generating 100 trajectories with initial states uniformly sampled from 𝒳 . In terms of computational cost, all baselines (except JPSPG) have complexity and wall-time costs increasing with 𝑈 , while CAMS is invariant to 𝑈 . With similar or less compute, CAMS achieves significantly better strategies than DeepCFR and JPSPG in the 4-stage game. Sample trajectories for the 4-stage game are visualized in App. G. Fig. 4 compares the ground-truth vs. approximated NEs for 10-stage games with different initial states. While approximation errors exist, CAMS successfully learns the target-concealing behavior of P1. Averaging over 50 trajectories derived from CAMS, P1 conceals the target until 𝑡 𝑟
0.60 𝑠 ± 0.06 𝑠 (compared to the ground-truth 𝑡 𝑟
0.5 𝑠 ).
Table 1:Runtime comparison of CAMS with and without multigrid
time steps
no multigrid
multigrid ↓
4
9.3 hrs
2.3 hrs
10
27.6 hrs
10.9 hrs
16
46.2 hrs
17.8 hrs CAMS scalability with multigrid.
We solve Hexner’s games using CAMS with and without multigrid to demonstrate the scalability of our approach and the effect of multigrid. We report the runtime on one H100 GPU for 4-, 10-, and 16-stage games in Tab. 1. We run the 2-level multigrid on the 4- and 10-stage games, and 4-level multigrid on the 16-stage game (see Alg. 1 for pseudo code). We report resulting trajectories in App. E.
Figure 4:(a) Hexner’s game schematics: one goal is selected out of two possible goals: Goal-1 and Goal-2, and communicated to P1. (b-e) Sample trajectories for the primal game (b-c) where P1 plays Nash and P2 plays best response, and primal-dual game (d-e) where both players play Nash. Dotted lines are ground-truth Nash. Color shades indicate evolution of public belief (Pr[Goal is 1]). Filled Magenta circle represents the true goal. Initial position pairs are marked with the same markers. 6.2Effect of atomic NE on model-free MARL
For model-free MARL, we compare CAMS-DRL with MMD, PPO and R-NaD with discrete actions of size 100 formed by pairing each of the ten linearly spaced 𝑥 -direction acceleration between ( − 1 , 1 ) and 𝑦 -direction acceleration between ( − 4 , 4 ) . We test the policy convergence in the normal-form game by comparing P1’s learned policy every 8192 steps (corresponds to 1 iteration in Fig. 5)a against the ground truth policy. MMD, PPO, and R-NaD are implemented as in Rudolph et al. (2025). Results in Fig. 5, which uses the same comparison metric as outlined in Sec. 6.1, show that CAMS-DRL approximates NEs accurately, while PPO and MMD fail completely. R-NaD, on the other hand, shows better performance than MMD and PPO with little to no hyperparameter tuning. This contradicts with observations in Rudolph et al. (2025). We conjecture that PG methods such as PPO and MMD have difficulty converging to non-interior solutions, which is the case in 2p0s1 due to the atomic structure of NEs. We performed the same comparison for a 4-stage game, with an action space size of 100 composed by pairing linearly spaced 𝑥 and 𝑦 direction accelerations, all between ( − 12 , 12 ) . Results in Fig. 5)b show significant improvement in solution accuracy when the atomic structure is exploited.
Figure 5:Comparisons b/w CAMS-DRL and standard PG methods on (a) 1-stage and (b) 4-stage games. 6.3Effect of atomic NE on MPC Hexner’s game.
With known dynamics and 𝐼
2 , a 10-stage Hexner’s (primal) game has at most 2 10 rollouts, making exact policy gradient computation feasible. Using DS-GDA, the convergence to the ground-truth NE takes only 5 minutes on a M1 Pro Macbook. See results in App. H.
Football game.
The independence of the game-tree complexity from the action space allows us to solve games beyond what IIEFG solvers can afford. Here we model an 11-vs-11 American Football play as a short, two-team pursuit–evasion game in a 2D plane. The horizontal axis represents “downfield” progress and the vertical axis is sideline-to-sideline. Each player follows double-integrator kinematics and chooses acceleration at discrete time steps. Physical contact is captured by a smooth “merge” weight that grows as opponents approach, softly blending their velocities and attenuating their ability to apply new acceleration, so tackles emerge continuously rather than through hard impulses. The offense has two types ( 𝐼
2 ): an “RB power push” that prefers the RB to advance straight upfield, and a “QB throw” that rewards whichever offensive player gets furthest downfield. With loose bounds on velocity and acceleration, the soft tackle dynamics is control-affine, allowing Isaacs’ condition to hold. Together with differentiable rollouts, the game settings satisfy A1-5, leading to atomic structure of the NE. With 𝐾
10 and known differentiable dynamics, it is then feasible to solve P1’s (and then P2’s) strategy by applying DS-GDA directly to the minimax objective constituting all 𝐼 𝐾 ( ( 𝐼 + 1 ) 𝐾 ) rollouts. See App. I for detailed game settings. The convergence takes approximately 30 minutes on a M1 Pro Macbook. The resulting plays are summarized in Fig. 1b,c and animated in the github repo. Qualitatively, the results resemble real football tactics: the offense either tries to push through the defense (aka inside zone play), or goes out wide by faking a move (aka waggle play), while the defense, in response, either close-in on the player with the ball possession or stay back to guard. Importantly, our results also show that the offense conceal their intent on their play selection for 0.5 seconds, which is comparable to coaching analyses that estimate roughly a one-second window before the play becomes clearer (Grabowski, 2020).
7Conclusion
Unlike general IIEFGs where NE strategies are distributions over the entire action space, we showed that 2p0s1 games enjoy a much simpler NE structure when P1 can precisely control the public belief, leading to exponentially reduced game-tree complexity. We demonstrated the utilities of this NE structure in solving games with continuous action spaces in model-free and model-based modes, in terms of computational cost and solution quality. Our methods enable fast approximation of deceptive and counter-deception team strategies, e.g., in sports, missile/drone defense, and risk-sensitive robotics applications, tailored for specific team dynamics, action spaces, and task specifications.
References Abernethy et al. (2011) Jacob Abernethy, Peter L Bartlett, and Elad Hazan.Blackwell approachability and no-regret learning are equivalent.In Proceedings of the 24th Annual Conference on Learning Theory, pp. 27–46. JMLR Workshop and Conference Proceedings, 2011. Amos et al. (2017) Brandon Amos, Lei Xu, and J. Zico Kolter.Input convex neural networks.In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 146–155. PMLR, 2017. Aumann et al. (1995) Robert J Aumann, Michael Maschler, and Richard E Stearns.Repeated games with incomplete information.MIT press, 1995. Bardi et al. (1997) Martino Bardi, Italo Capuzzo Dolcetta, et al.Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations, volume 12.Springer, 1997. Billings et al. (2003) Darse Billings, Neil Burch, Aaron Davidson, Robert Holte, Jonathan Schaeffer, Terence Schauenberg, and Duane Szafron.Approximating game-theoretic optimal strategies for full-scale poker.In IJCAI, volume 3, pp. 661, 2003. Braess & Hackbusch (1983) Dietrich Braess and Wolfgang Hackbusch.A new convergence proof for the multigrid method including the v-cycle.SIAM journal on numerical analysis, 20(5):967–975, 1983. Brown & Sandholm (2019) Noam Brown and Tuomas Sandholm.Superhuman ai for multiplayer poker.Science, 365(6456):885–890, 2019. Brown et al. (2019) Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm.Deep counterfactual regret minimization.In International conference on machine learning, pp. 793–802. PMLR, 2019. Brown et al. (2020a) Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong.Combining deep reinforcement learning and search for imperfect-information games.Advances in Neural Information Processing Systems, 33:17057–17069, 2020a. Brown et al. (2020b) Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong.Combining deep reinforcement learning and search for imperfect-information games.Advances in Neural Information Processing Systems, 33:17057–17069, 2020b. Burch et al. (2014) Neil Burch, Michael Johanson, and Michael Bowling.Solving imperfect information games using decomposition.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014. Cannarsa & Sinestrari (2004) Piermarco Cannarsa and Carlo Sinestrari.Semiconcave functions, Hamilton-Jacobi equations, and optimal control, volume 58.Springer Science & Business Media, 2004. Cardaliaguet (2007) Pierre Cardaliaguet.Differential games with asymmetric information.SIAM journal on Control and Optimization, 46(3):816–838, 2007. Cardaliaguet (2009) Pierre Cardaliaguet.Numerical approximation and optimal strategies for differential games with lack of information on one side.Advances in Dynamic Games and Their Applications: Analytical and Numerical Developments, pp. 1–18, 2009. Cen et al. (2021) Shicong Cen, Yuting Wei, and Yuejie Chi.Fast policy extragradient methods for competitive games with entropy regularization.Advances in Neural Information Processing Systems, 34:27952–27964, 2021. De Meyer (1996) Bernard De Meyer.Repeated games, duality and the central limit theorem.Mathematics of Operations Research, 21(1):237–251, 1996. FAIR† et al. (2022) Meta Fundamental AI Research Diplomacy Team FAIR†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al.Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022. Ghimire et al. (2024) Mukesh Ghimire, Lei Zhang, Zhe Xu, and Yi Ren.State-constrained zero-sum differential games with one-sided information.In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 15512–15539. PMLR, 21–27 Jul 2024.URL https://proceedings.mlr.press/v235/ghimire24a.html. Gilpin & Sandholm (2006) Andrew Gilpin and Tuomas Sandholm.Finding equilibria in large sequential games of imperfect information.In Proceedings of the 7th ACM conference on Electronic commerce, pp. 160–169, 2006. Gilpin et al. (2007) Andrew Gilpin, Samid Hoda, Javier Pena, and Tuomas Sandholm.Gradient-based algorithms for finding nash equilibria in extensive form games.In Internet and Network Economics: Third International Workshop, WINE 2007, San Diego, CA, USA, December 12-14, 2007. Proceedings 3, pp. 57–69. Springer, 2007. Grabowski (2020) Keith Grabowski.Wake forest’s slow mesh rpo for explosive plays, Oct 2020.URL https://coachandcoordinator.com/2020/10/wake-forests-slow-mesh-rpo-for-explosive-plays/?utm_source=chatgpt.com. Han & Wan (2013) Dong Han and Justin WL Wan.Multigrid methods for second order hamilton–jacobi–bellman and hamilton–jacobi–bellman–isaacs equations.SIAM Journal on Scientific Computing, 35(5):S323–S344, 2013. Harsanyi (1967) John C Harsanyi.Games with incomplete information played by “bayesian” players, i–iii part i. the basic model.Management science, 14(3):159–182, 1967. Henson et al. (2003) Van Henson et al.Multigrid methods nonlinear problems: an overview.Computational imaging, 5016:36–48, 2003. Hexner (1979) G Hexner.A differential game of incomplete information.Journal of Optimization Theory and Applications, 28:213–232, 1979. Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler.Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018. Johanson et al. (2012) Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling.Finding optimal abstract strategies in extensive-form games.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pp. 1371–1379, 2012. Koller & Megiddo (1992) Daphne Koller and Nimrod Megiddo.The complexity of two-person zero-sum games in extensive form.Games and economic behavior, 4(4):528–552, 1992. Lanctot et al. (2009) Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling.Monte carlo sampling for regret minimization in extensive games.Advances in neural information processing systems, 22, 2009. Lanctot et al. (2019) Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, and Jonah Ryan-Davis.OpenSpiel: A framework for reinforcement learning in games.CoRR, abs/1908.09453, 2019.URL http://arxiv.org/abs/1908.09453. Martin & Sandholm (2023) Carlos Martin and Tuomas Sandholm.Finding mixed-strategy equilibria of continuous-action games without gradients using randomized policy networks.In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 2844–2852, 2023. Martin & Sandholm (2024) Carlos Martin and Tuomas Sandholm.Joint-perturbation simultaneous pseudo-gradient.arXiv preprint arXiv:2408.09306, 2024. McMahan (2011) Brendan McMahan.Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization.In Geoffrey Gordon, David Dunson, and Miroslav Dudík (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 525–533, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.URL https://proceedings.mlr.press/v15/mcmahan11b.html. Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling.Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.Science, 356(6337):508–513, 2017. Perolat et al. (2021) Julien Perolat, Remi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, et al.From poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization.In International Conference on Machine Learning, pp. 8525–8535. PMLR, 2021. Perolat et al. (2022) Julien Perolat, Bart De Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T Connor, Neil Burch, Thomas Anthony, et al.Mastering the game of stratego with model-free multiagent reinforcement learning.Science, 378(6623):990–996, 2022. Rudolph et al. (2025) Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota.Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025. Sandholm (2010) Tuomas Sandholm.The state of solving large incomplete-information games, and application to poker.Ai Magazine, 31(4):13–32, 2010. Schmid et al. (2023) Martin Schmid, Matej Moravčík, Neil Burch, Rudolf Kadlec, Josh Davidson, Kevin Waugh, Nolan Bard, Finbarr Timbers, Marc Lanctot, G Zacharias Holland, et al.Student of games: A unified learning algorithm for both perfect and imperfect information games.Science Advances, 9(46):eadg3256, 2023. Silver et al. (2017a) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al.Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017a. Silver et al. (2017b) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al.Mastering the game of go without human knowledge.nature, 550(7676):354–359, 2017b. Sokota et al. (2022) Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer.A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022. Tammelin (2014) Oskari Tammelin.Solving large imperfect information games using cfr+.arXiv preprint arXiv:1407.5042, 2014. Trottenberg et al. (2000) Ulrich Trottenberg, Cornelius W Oosterlee, and Anton Schuller.Multigrid.Elsevier, 2000. Vieillard et al. (2020) Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist.Leverage the average: an analysis of kl regularization in reinforcement learning.Advances in Neural Information Processing Systems, 33:12163–12174, 2020. Zanette et al. (2019) Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill.Limiting extrapolation in linear approximate value iteration.Advances in Neural Information Processing Systems, 32, 2019. Zheng et al. (2023) Taoli Zheng, Linglingzhi Zhu, Anthony Man-Cho So, José Blanchet, and Jiajin Li.Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization.Advances in Neural Information Processing Systems, 36:54075–54110, 2023. Zinkevich et al. (2007) Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione.Regret minimization in games with incomplete information.Advances in neural information processing systems, 20, 2007. Appendix Appendix AConnection between Primal and Dual Games
Here we use the infinitely-repeated game setting to explain the connection between the primal and dual games and the interpretation of the dual variable 𝑝 ^ . Please see Theorem 2.2 in De Meyer (1996) and the extension to differential games in Cardaliaguet (2007).
Game setting.
In an infinitely-repeated normal-form game with one-sided information, there is a set of 𝐼 possible payoff tables { 𝐴 𝑖 } 𝑖
1 𝐼 . A table is initially drawn from 𝑝 0 and shown to P1, while P2 only knows 𝑝 0 . At stage 𝑡 ∈ [ 𝑇 ] , the players draw actions from their strategies 𝜂 𝑖 and 𝜁 , respectively, under the current belief 𝑝 𝑡 . The actions are public, which trigger belief updates, yet the actual payoffs are hidden (or otherwise P2 may infer the true payoff easily). At the terminal 𝑇 , an average payoff is received by P1 from P2. This game is proven to have a value (Aumann et al., 1995) for finite and infinite 𝑇 . Here we only consider the latter case where the value is defined as
𝑉 ( 𝑝 0 )
max { 𝜂 𝑖 } 𝑖
1 𝐼 min 𝜁 lim 𝑇 → ∞ 1 𝑇 ∑ 𝑡
1 𝑇 𝔼 𝑖 ∼ 𝑝 0 [ 𝜂 𝑖 ⊤ 𝐴 𝑖 𝜁 ] .
(9)
Notice that this is a special case of the differential games we are interested in, where the incomplete information is on the running loss, the action spaces are discrete, and the dynamics is identity.
Primal-dual properties.
Let the primal game be 𝐺 ( 𝑝 ) for 𝑝 ∈ Δ ( 𝐼 ) , the dual game be 𝐺 ∗ ( 𝑝 ^ ) for 𝑝 ^ ∈ ℝ 𝐼 , and let { 𝜂 𝑖 } 𝑖
1 𝐼 be the set of strategies for P1 and 𝜁 the strategy for P2. 𝜂 𝑖 ∈ Δ ( 𝑑 𝑢 ) and 𝜁 ∈ Δ ( 𝑑 𝑣 ) . We note that P1’s strategy { 𝜂 𝑖 } 𝑖
1 𝐼 can also be together represented in terms of 𝜋 := { 𝜋 𝑖 𝑗 } 𝐼 , 𝑑 𝑢 such that ∑ 𝑗 𝑑 𝑢 𝜋 𝑖 𝑗
𝑝 [ 𝑖 ] and 𝜂 𝑖 [ 𝑗 ]
𝜋 𝑖 𝑗 / 𝑝 [ 𝑖 ] , i.e., nature’s distribution is the marginal of 𝜋 and P1’s strategy the conditional of 𝜋 . Let 𝐺 𝜂 𝜁 𝑖 be the payoff to P1 of type 𝑖 for strategy profile ( 𝜂 , 𝜁 ) . From De Meyer (1996) we have the following results connecting 𝐺 ( 𝑝 ) and 𝐺 ∗ ( 𝑝 ^ ) :
1.
If 𝜋 is Nash for P1 in 𝐺 ( 𝑝 ) and 𝑝 ^ ∈ ∂ 𝑉 ( 𝑝 ) , then { 𝜂 𝑖 } 𝑖
1 𝐼 is also Nash for P1 in 𝐺 ∗ ( 𝑝 ^ ) .
2.
If 𝜋 is Nash for P1 in 𝐺 ∗ ( 𝑝 ^ ) and 𝑝 is induced by 𝜋 , then 𝑝 ∈ ∂ 𝑉 ∗ ( 𝑝 ^ ) and 𝜋 is Nash for P1 in 𝐺 ( 𝑝 ) .
3.
If 𝜁 is Nash for P2 in 𝐺 ∗ ( 𝑝 ^ ) and 𝑝 ∈ ∂ 𝑉 ∗ ( 𝑝 ^ ) , then 𝜁 is also Nash for P2 in 𝐺 ( 𝑝 ) .
4.
If 𝜁 is Nash for 𝐺 ( 𝑝 ) , and let 𝑝 ^ 𝑖 := max 𝜂 ∈ Δ ( 𝑑 𝑢 ) 𝐺 𝜂 𝜁 𝑖 and 𝑝 ^ := [ 𝑝 ^ 1 , … , 𝑝 ^ 𝐼 ] 𝑇 , then 𝑝 ∈ ∂ 𝑉 ∗ ( 𝑝 ^ ) and 𝜁 is also Nash for P2 in 𝐺 ∗ ( 𝑝 ^ ) .
From the last two properties we have: If 𝜁 is Nash for 𝐺 ( 𝑝 ) and 𝐺 ∗ ( 𝑝 ^ ) , then 𝑝 ^
max 𝜂 ∈ Δ ( 𝑑 𝑢 ) 𝐺 𝜂 𝜁 𝑖 , i.e., 𝑝 ^ [ 𝑖 ] is the payoff of type 𝑖 if P1 plays a best response for that type to P2’s Nash.
Appendix BProof of Theorem 4.1 Theorem 4.1
(A splitting reformulation of the primal and dual DPs). The RHS of Eq. 3 can be reformulated as
min { 𝑢 𝑘 } , { 𝛼 𝑘 𝑖 } max { 𝑣 𝑘 } ∑ 𝑘
1 𝐼 𝜆 𝑘 ( 𝑉 ( 𝑡 + 𝜏 , 𝑥 𝑘 , 𝑝 𝑘 ) + 𝜏 𝔼 𝑖 ∼ 𝑝 𝑘 [ 𝑙 𝑖 ( 𝑢 𝑘 , 𝑣 𝑘 ) ] )
( P 1 )
s.t. 𝑢 𝑘 ∈ 𝒰 , 𝑥 𝑘
ODE
(
𝑥
,
𝜏
,
𝑢
𝑘
,
𝑣
𝑘
;
𝑓
)
,
𝑣
𝑘
∈
𝒱
,
𝛼
𝑘
𝑖
∈
[
0
,
1
]
,
∑
𝑘
1 𝐼 𝛼 𝑘 𝑖
1 , 𝜆 𝑘
∑ 𝑖
1
𝐼
𝛼
𝑘
𝑖
𝑝
[
𝑖
]
,
𝑝
𝑘
[
𝑖
]
𝛼 𝑘 𝑖 𝑝 [ 𝑖 ] 𝜆 𝑘 , ∀ 𝑖 , 𝑘 ∈ [ 𝐼 ] .
And the RHS of Eq. 5 can be reformulated as
min { 𝑣 𝑘 } , { 𝜆 𝑘 } , { 𝑝 ^ 𝑘 } max { 𝑢 𝑘 } ∑ 𝑘
1 𝐼 + 1 𝜆 𝑘 ( 𝑉 ∗ ( 𝑡 + 𝜏 , 𝑥 𝑘 , 𝑝 ^ 𝑘 − 𝜏 𝑙 ( 𝑢 𝑘 , 𝑣 𝑘 ) ) )
( P 2 )
s.t. 𝑢 𝑘 ∈ 𝒰 , 𝑣 𝑘 ∈ 𝒱 , 𝑥 𝑘
ODE
(
𝑥
,
𝜏
,
𝑢
𝑘
,
𝑣
𝑘
;
𝑓
)
,
𝜆
𝑘
∈
[
0
,
1
]
,
∑
𝑘
1 𝐼 + 1 𝜆 𝑘 𝑝 ^ 𝑘
𝑝 ^ , ∑ 𝑘
1 𝐼 + 1 𝜆 𝑘
1 , 𝑘 ∈ [ 𝐼 + 1 ] .
Proof.
Recall that the primal DP is:
𝑉 𝜏 ( 𝑡 0 , 𝑥 0 , 𝑝 )
min { 𝜂 𝑖 } 𝔼 𝑖 ∼ 𝑝 , 𝑢 ∼ 𝜂 𝑖 [ max 𝑣 ∈ 𝒱 { 𝑉 𝜏 ( 𝑡 0 + 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 ) , 𝑝 ′ ( 𝑢 ) ) + 𝜏 𝑙 𝑖 ( 𝑢 , 𝑣 ) } ]
(10)
= min { 𝜂 𝑖 } ∫ 𝒰 𝜂 ¯ ( 𝑢 ) max 𝑣 ∈ 𝒱 { 𝑉 𝜏 ( 𝑡 0 + 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 ) , 𝑝 ′ ( 𝑢 ) ) + 𝜏 𝔼 𝑖 ∼ 𝑝 ′ ( 𝑢 ) [ 𝑙 𝑖 ( 𝑢 , 𝑣 ) ] } 𝑑 𝑢
min { 𝜂 𝑖 } ∫ 𝒰 𝜂 ¯ ( 𝑢 ) 𝑎 ( 𝑢 , 𝑝 ′ ( 𝑢 ) ) 𝑑 𝑢 ,
where the last equality uses Isaacs’ condition to reduce 𝑣 as an implicit function of 𝑢 :
𝑎 ( 𝑢 , 𝑝 ′ ( 𝑢 ) )
max 𝑣 ∈ 𝒱 𝑉 𝜏 ( 𝑡 0 + 𝜏 , 𝑥 ′ ( 𝑢 , 𝑣 ) , 𝑝 ′ ( 𝑢 ) ) + 𝜏 𝔼 𝑖 ∼ 𝑝 ′ ( 𝑢 ) [ 𝑙 𝑖 ( 𝑢 , 𝑣 ) ] .
(11)
Now we introduce a pushforward measure 𝜈 on Δ ( 𝐼 ) for any 𝐸 ⊂ Δ ( 𝐼 ) : 𝜈 ( 𝐸 )
∫ { 𝑢 : 𝑝 ′ ( 𝑢 ) ∈ 𝐸 } 𝜂 ¯ ( 𝑢 ) 𝑑 𝑢 . Let 𝜂 𝑝 ′ be the conditional measure on 𝒰 for each 𝑝 ′ . Then we have
min { 𝜂 𝑖 } ∫ 𝒰 𝜂 ¯ ( 𝑢 ) 𝑎 ( 𝑢 , 𝑝 ′ ( 𝑢 ) ) 𝑑 𝑢
min 𝜈 ∫ Δ ( 𝐼 ) min 𝜂 𝑝 ′ [ ∫ 𝑝 ′ ( 𝑢 )
𝑝 ′ 𝑎 ( 𝑢 , 𝑝 ′ ) 𝜂 𝑝 ′ ( 𝑑 𝑢 ) ] 𝜈 ( 𝑑 𝑝 ′ )
min 𝜈 ∫ Δ ( 𝐼 ) min 𝑢 ∈ 𝒰 𝑎 ( 𝑢 , 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ )
min 𝜈 ∫ Δ ( 𝐼 ) 𝑎 ~ ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) .
This leads to the following reformulation of 𝑉 𝜏 :
𝑉 𝜏 ( 𝑡 0 , 𝑥 0 , 𝑝 )
min 𝜈 ∫ Δ ( 𝐼 ) 𝑎 ~ ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ )
(12)
s.t. 𝔼 𝜈 [ 𝑝 ′ ]
𝑝 .
Now we show that the RHS of Eq. 12 computes the convexificiation of 𝑎 ~ ( 𝑝 ′ ) at 𝑝 ′
𝑝 .
To do so, we first show 𝑉 𝜏 is convex on Δ ( 𝐼 ) : Let two probability measures 𝜈 1 and 𝜈 2 be the solutions for 𝑉 𝜏 ( 𝑡 0 , 𝑥 0 , 𝑝 1 ) and 𝑉 𝜏 ( 𝑡 0 , 𝑥 0 , 𝑝 2 ) , respectively. For any 𝜃 ∈ [ 0 , 1 ] and 𝑝 𝜃
𝜃 𝑝 1 + ( 1 − 𝜃 ) 𝑝 2 , the mixture 𝜈 𝜃 := 𝜃 𝜈 1 + ( 1 − 𝜃 ) 𝜈 2 satisfies ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 ( 𝑑 𝑝 ′ )
𝜃 ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 1 ( 𝑑 𝑝 ′ ) + ( 1 − 𝜃 ) ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 2 ( 𝑑 𝑝 ′ )
𝑝 𝜃 and is a feasible solution to Eq. 12 for 𝑝 𝜃 . Therefore
𝑉 ( 𝑝 𝜃 ) ≤ ∫ Δ ( 𝐼 ) 𝑎 ( 𝑝 ′ ) 𝜈 𝜃 ( 𝑑 𝑝 ′ )
𝜃 𝑉 ( 𝑝 1 ) + ( 1 − 𝜃 ) 𝑉 ( 𝑝 2 ) .
Then we show 𝑉 ( 𝑝 ) ≤ 𝑎 ~ ( 𝑝 ) for any 𝑝 ∈ Δ ( 𝐼 ) : The Dirac measure 𝛿 𝑝 concentrated at 𝑝 satisfies ∫ Δ ( 𝐼 ) 𝑝 ′ 𝛿 𝑝 ( 𝑑 𝑝 ′ )
𝑝 . Thus 𝛿 𝑝 is a feasible solution to Eq. 12 and by definition 𝑉 ( 𝑝 ) ≤ ∫ Δ ( 𝐼 ) 𝑎 ~ ( 𝑝 ′ ) 𝛿 𝑝 ( 𝑑 𝑝 ′ )
𝑎 ~ ( 𝑝 ) .
Lastly, we show 𝑉 is the largest convex minorant of 𝑎 ~ : Let ℎ be any convex function on Δ ( 𝐼 ) such that ℎ ( 𝑝 ) ≤ 𝑎 ~ ( 𝑝 ) for all 𝑝 . Given 𝑝 ∈ Δ ( 𝐼 ) , for any probability measure 𝜈 that satisfies ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 ( 𝑑 𝑝 ′ )
𝑝 , we have
ℎ ( 𝑝 )
ℎ ( ∫ Δ ( 𝐼 ) 𝑝 ′ 𝜈 ( 𝑑 𝑝 ′ ) ) ≤ ∫ Δ ( 𝐼 ) ℎ ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) ≤ ∫ Δ ( 𝐼 ) 𝑎 ~ ( 𝑝 ′ ) 𝜈 ( 𝑑 𝑝 ′ ) .
Since this inequality holds for arbitrary 𝜈 , including the optimal ones that define 𝑉 ( 𝑝 ) through Eq. 12, it follows that ℎ ( 𝑝 ) ≤ 𝑉 ( 𝑝 ) . With these, 𝑉 ( ⋅ ) in Eq. 12 is the convexification of 𝑎 ~ ( ⋅ ) .
Since convexification in Δ ( 𝐼 ) requires at most 𝐼 vertices, 𝜈 ∗ that solves Eq. 12 is 𝐼 -atomic. We will denote by { 𝑝 𝑘 } 𝑘 ∈ [ 𝐼 ] the set of “splitting” points that have non-zero probability masses according to 𝜈 ∗ , and let 𝜆 𝑘 := 𝜈 ∗ ( 𝑝 𝑘 ) . Using Isaacs’ condition, arg min 𝑢 ∈ 𝒰 𝑎 ( 𝑢 , 𝑝 ) is non-empty for any 𝑝 ∈ Δ ( 𝐼 ) , and therefore each 𝑝 𝑘 is associated with (at least) one action in arg min 𝑢 ∈ 𝒰 𝑎 ( 𝑢 , 𝑝 𝑘 ) . As a result, { 𝜂 𝑖 } is also concentrated on a common set of 𝐼 actions in 𝒰 . Specifically, denote this set by { 𝑢 𝑘 } 𝑘 ∈ [ 𝐼 ] , we should have 𝛼 𝑘 𝑖 := 𝜂 𝑖 ( 𝑢 𝑘 )
𝜆 𝑘 𝑝 𝑘 [ 𝑖 ] / 𝑝 [ 𝑖 ] . Thus we reach P 1 . The same proof technique can be applied to the dual DP to derive P 2 . ∎
Appendix CProof of Theorem 4.2 Theorem 4.2.
For any ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) , if A1-5 hold, then there exist 𝐶
0 , such that
𝑉 ( 𝑡 , 𝑥 , 𝑝 ) ≤ max 𝜁 ∈ 𝒵 𝐽 ( 𝑡 , 𝑥 , 𝑝 ; { 𝜂 𝑖 , 𝜏 } † , 𝜁 ) ≤ 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) + 𝐶 ( 𝑇 − 𝑡 ) 𝜏 .
(13)
Similarly, for any ( 𝑡 , 𝑥 , 𝑝 ^ ) ∈ [ 0 , 𝑇 ] × 𝒳 × ℝ 𝐼 ,
𝑉 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ) ≤ max { 𝜂 𝑖 } ∈ { ℋ 𝑖 } 𝐼 𝐽 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ; { 𝜂 𝑖 } , 𝜁 𝜏 † ) ≤ 𝑉 ∗ ( 𝑡 , 𝑥 , 𝑝 ^ ) + 𝐶 ( 𝑇 − 𝑡 ) 𝜏 .
(14) C.1Settings
We recall all necessary settings for the proof.
From Thm. 4.1, the Bellman backup for 𝐺 𝜏 is P 1 , which can be written as an operator 𝑇 𝜏 :
𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 + 𝜏 , 𝑥 , 𝑝 ) := 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 )
min { 𝜆 𝑘 } , { 𝑝 𝑘 } ∑ 𝑘
1 𝐼 𝜆 𝑘 𝑉 ~ 𝜏 ( 𝑡 , 𝑥 , 𝑝 𝑘 ) ,
(15)
where 𝑉 ~ 𝜏 ( 𝑡 , 𝑥 , 𝑝 𝑘 )
min 𝑢 ∈ 𝒰 max 𝑣 ∈ 𝒱 𝑉 𝜏 ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 ( 𝑥 , 𝑢 , 𝑣 ) , 𝑝 𝑘 ) + 𝔼 𝑖 ∼ 𝑝 𝑘 [ 𝑙 𝑖 ( 𝑢 , 𝑣 ) ] is the non-revealing value at ( 𝑡 , 𝑥 , 𝑝 𝑘 ) , and ∑ 𝑘
1 𝐼 𝜆 𝑘
1 , ∑ 𝑘
1 𝐼 𝜆 𝑘 𝑝 𝑘
𝑝 , 𝑝 𝑘 ∈ Δ ( 𝐼 ) , ∀ 𝑘 ∈ [ 𝐼 ] .
From Cardaliaguet (2007), the value 𝑉 of the original game 𝐺 is the unique viscosity solution to the following Hamilton-Jacobi equation
∇ 𝑡 𝑤 + 𝐻 ( 𝑡 , 𝑥 , ∇ 𝑥 𝑤 )
0 ,
(16)
with terminal boundary 𝑤 ( 𝑇 , 𝑥 , 𝑝 )
∑ 𝑖
1 𝐼 𝑝 [ 𝑖 ] 𝑔 𝑖 ( 𝑥 ) , and is convex in 𝑝 .
C.2Preliminary lemmas
The following lemmas will be used in the main proof.
Lemma C.1 (Value properties).
𝑉 is spatially Lipschitz and 𝜔 -semiconcave.
Proof.
Spatial Lipschitzness of 𝑉 is proved in Cardaliaguet (2007). A function 𝑉 : 𝒳 → ℝ is 𝜔 -semiconcave on 𝒳 if
𝑉 ( 𝑦 ) ≤ 𝑉 ( 𝑥 ) + 𝑝 ⊤ ( 𝑦 − 𝑥 ) + 𝜔 2 ‖ 𝑦 − 𝑥 ‖ 2 2 ∀ 𝑥 , 𝑦 ∈ 𝒳 , 𝑝 ∈ ∂ + 𝑉 ,
(17)
where ∂ + 𝑉 is the supergradient of 𝑉 . Verbally, between two points on its support, 𝑉 bends up at most quadratically with a curvature 𝜔 . Semiconcavity of value functions for first-order Hamilton–Jacobi equations is proved in Bardi et al. (1997) and Cannarsa & Sinestrari (2004) (Thm. 3.3.7). ∎
Lemma C.2 (Quadratic contact).
For any ( 𝑢 , 𝑣 , 𝑝 ) with ‖ 𝑓 ( 𝑥 , 𝑢 , 𝑣 ) ‖ 2 ≤ 𝐹 and 𝑝 ∈ Δ ( 𝐼 ) , let 𝜙 ∈ 𝐶 2 be a test function touching 𝑉 at ( 𝑡 , 𝑥 , 𝑝 ) (a local max. or min.). Then
| 𝑉 ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 , 𝑝 ) − 𝜙 ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 , 𝑝 ) | ≤ 𝐶 𝜏 2 , 𝐶 := 1 2 ( 𝜔 + ‖ ∇ 𝑥 2 𝜙 ‖ ∞ ) ( 1 + 𝐹 ) 2 .
(18) Proof.
By semiconcavity of 𝑉 in 𝑥 there exists super-derivative 𝑝 ^ ∈ ∂ 𝑥 + 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) with 𝑝 ^
∇ 𝑥 𝜙 ( 𝑡 , 𝑥 , 𝑝 ) . For Δ := 𝜏 𝑓 ( 𝑥 , 𝑢 , 𝑣 ) ,
𝑉 ( 𝑡 + 𝜏 , 𝑥 + Δ , 𝑝 ) − 𝜙 ( 𝑡 + 𝜏 , 𝑥 + Δ , 𝑝 ) ≤ 𝜔 2 ‖ Δ ‖ 2 2 + 1 2 ‖ ∇ 𝑥 2 𝜙 ‖ ∞ ( 𝜏 + ‖ Δ ‖ 2 ) 2 .
The same bound with 𝑉 and 𝜙 swapped proves Eq. 18. ∎
Lemma C.3 (1‐Lipschitz property).
Let 𝑔 , ℎ be two bounded functions on Δ ( 𝐼 ) . Let 𝑉 𝑒 𝑥 ( 𝑔 ) be the convex envelope of 𝑔 . Then for any 𝑝 ∈ Δ ( 𝐼 )
| 𝑉 𝑒 𝑥 ( 𝑔 ) ( 𝑝 ) − 𝑉 𝑒 𝑥 ( ℎ ) ( 𝑝 ) | ≤ sup 𝑝 | 𝑔 ( 𝑝 ) − ℎ ( 𝑝 ) | .
Proof.
Let 𝛿 := sup 𝑝 | 𝑔 ( 𝑝 ) − ℎ ( 𝑝 ) | , i.e., 𝑔 − 𝛿 ≤ ℎ ≤ 𝑔 + 𝛿 for all 𝑝 . Then 𝑉 𝑒 𝑥 ( 𝑔 − 𝛿 ) ≤ 𝑉 𝑒 𝑥 ( ℎ ) ≤ 𝑉 𝑒 𝑥 ( 𝑔 + 𝛿 ) . Since 𝑉 𝑒 𝑥 ( 𝑔 − 𝛿 )
𝑉 𝑒 𝑥 ( 𝑔 ) − 𝛿 , we have | 𝑉 𝑒 𝑥 ( 𝑔 ) − 𝑉 𝑒 𝑥 ( ℎ ) | ≤ 𝛿
sup 𝑝 | 𝑔 ( 𝑝 ) − ℎ ( 𝑝 ) | . ∎
Lemma C.4 (Quadratic local error).
For every touching 𝐶 2 test function 𝜙 and every ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 − 𝜏 ] × 𝒳 × Δ ( 𝐼 )
| 𝑇 𝜏 [ 𝑉 ] ( 𝑡 , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ 𝜙 ] ( 𝑡 , 𝑥 , 𝑝 ) | ≤ 𝐶 𝜏 2 ,
with the constant 𝐶 independent of ( 𝑡 , 𝑥 , 𝑝 ) .
Proof.
Let 𝜙 ~ ∈ 𝐶 2 be a test function touching 𝑉 ~ at ( 𝑡 , 𝑥 , 𝑝 ) , where 𝑉 ~ is the non-revealing value. Lemma C.2 gives | 𝜙 ~ ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 , 𝑝 ) − 𝑉 ~ ( 𝑡 + 𝜏 , 𝑥 + 𝜏 𝑓 , 𝑝 ) | ≤ 𝐶 𝜏 2 . Apply Lemma C.3 and use the fact that 𝜙 ( 𝑉 ) is a convexification of 𝜙 ~ ( 𝑉 ~ ) to get | 𝑇 𝜏 [ 𝑉 ] − 𝑇 𝜏 [ 𝜙 ] | ≤ 𝐶 𝜏 2 . ∎
Lemma C.5 (Consistency).
There exists a constant 𝐶
0 independent of ( 𝑡 , 𝑥 , 𝑝 , 𝜏 ) such that
| 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ 𝑉 ] ( 𝑡 + 𝜏 , 𝑥 , 𝑝 ) | ≤ 𝐶 𝜏 2
(19)
for any ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) .
Proof.
Let 𝜙 ∈ 𝐶 2 be a test function touching 𝑉 at ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) . Taylor expansion gives:
𝑇 𝜏 [ 𝜙 ] ( 𝑡 0 + 𝜏 , 𝑥 0 , 𝑝 0 )
𝜙 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) + 𝜏 ∑ 𝑘
1 𝐼 𝜆 𝑘 ( ∂ 𝑡 𝜙 ~ ( 𝑡 0 , 𝑥 0 , 𝑝 𝑘 ) + 𝐻 𝑘 ) + 𝜖 𝜏 ,
(20)
where 𝜙 ~ is the non-revealing value and ( 𝑢 𝑘 , 𝑣 𝑘 , 𝜆 𝑘 , 𝑝 𝑘 ) is the 𝑘 th splitting point, so that 𝜙 ( 𝑡 0 , 𝑥 0 , 𝑝 0 )
∑ 𝑘
1 𝐼 𝜆 𝑘 𝜙 ~ ( 𝑡 0 , 𝑥 0 , 𝑝 𝑘 ) . 𝐻 𝑘
∇ 𝑥 𝜙 ~ ( 𝑡 0 , 𝑥 0 , 𝑝 𝑘 ) ⊤ 𝑓 ( 𝑥 0 , 𝑢 𝑘 , 𝑣 𝑘 ) + ∑ 𝑖
1 𝐼 𝑝 𝑘 [ 𝑖 ] 𝑙 𝑖 ( 𝑢 𝑘 , 𝑣 𝑘 ) . 𝜖 𝜏 ≤ 𝐶 𝑒 𝑥 𝑝 𝜏 2 .
First let 𝜙 and 𝜙 ~ touch 𝑉 from above, i.e., 𝑉 ≤ 𝜙 ≤ 𝜙 ~ , 𝑉 ( 𝑡 0 )
𝜙 ( 𝑡 0 ) . From the viscosity property, we have ∂ 𝑡 𝜙 ~ ( 𝑡 0 , 𝑥 0 , 𝑝 𝑘 ) + 𝐻 𝑘 ≥ 0 for 𝑘 ∈ [ 𝐼 ] . Combining with Lem. C.4 to have
𝑇 𝜏 [ 𝑉 ] ( 𝑡 0 + 𝜏 , 𝑥 0 , 𝑝 0 ) ≥ 𝜙 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) − 𝐶 𝜏 2 .
(21)
Since 𝑉 ( 𝑡 0 , 𝑥 0 , 𝑝 0 )
𝜙 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) , we have
𝑉 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) ≤ 𝑇 𝜏 [ 𝑉 ] ( 𝑡 0 + 𝜏 , 𝑥 0 , 𝑝 0 ) + 𝐶 𝜏 2 .
(22)
We can similarly introduce 𝜙 ′ and 𝜙 ~ ′ touching 𝑉 from below to have
𝑉 ( 𝑡 0 , 𝑥 0 , 𝑝 0 ) ≥ 𝑇 𝜏 [ 𝑉 ] ( 𝑡 0 + 𝜏 , 𝑥 0 , 𝑝 0 ) − 𝐶 𝜏 2 .
(23)
∎
Lemma C.6 (Non-expansive Bellman backup).
Let 𝑔 and ℎ be two bounded functions defined on [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) . Then we have
| 𝑇 𝜏 [ 𝑔 ] ( 𝑡 ′ , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ ℎ ] ( 𝑡 ′ , 𝑥 , 𝑝 ) | ≤ sup 𝑥 ′ ∈ 𝒳 , 𝑝 ′ ∈ Δ ( 𝐼 ) | 𝑔 ( 𝑡 ′ , 𝑥 ′ , 𝑝 ′ ) − ℎ ( 𝑡 ′ , 𝑥 ′ , 𝑝 ′ ) |
(24)
for any ( 𝑡 ′ , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) .
Proof.
Let 𝛿
sup | 𝑔 − ℎ | . Then 𝑔 ≤ ℎ + 𝛿 and ℎ ≤ 𝑔 + 𝛿 . By monotonicity of 𝑇 𝜏 : 𝑇 𝜏 [ 𝑔 ] ≤ 𝑇 𝜏 [ ℎ + 𝛿 ] . It can be shown that 𝑇 𝜏 [ ⋅ + 𝛿 ]
𝑇 𝜏 [ ⋅ ] + 𝛿 for a constant 𝛿 since the value function shift adds directly. So, 𝑇 𝜏 [ 𝑔 ] ≤ 𝑇 𝜏 [ ℎ ] + 𝛿 . Similarly, 𝑇 𝜏 [ ℎ ] ≤ 𝑇 𝜏 [ 𝑔 ] + 𝛿 . Thus, | 𝑇 𝜏 [ 𝑔 ] − 𝑇 𝜏 [ ℎ ] | ≤ 𝛿 . ∎
Lemma C.7 (Value gap between ℎ and ℎ 𝜏 ).
Let 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) and 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 ) be the values of ℎ and ℎ 𝜏 , respectively, at any ( 𝑡 , 𝑥 , 𝑝 ) ∈ [ 0 , 𝑇 ] × 𝒳 × Δ ( 𝐼 ) . We have
| 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 ) | ≤ 𝐶 ( 𝑇 − 𝑡 ) 𝜏
for some constant 𝐶
0 independent of ( 𝑡 , 𝑥 , 𝑝 , 𝜏 ) .
Proof.
Let 𝑒 ( 𝑡 , 𝑥 , 𝑝 )
𝑉 ( 𝑡 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 ) be the error function. Let 𝑡 𝑘
𝑇 − 𝑘 𝜏 for 𝑘 ∈ [ 𝑁 ] , and 𝑁
𝑇 / 𝜏 . Define 𝐸 𝑘
sup 𝑥 , 𝑝 | 𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) | . Note that 𝐸 0
sup 𝑥 , 𝑝 | 𝑉 ( 𝑇 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑇 , 𝑥 , 𝑝 ) |
0 due to the terminal boundary.
Consider the error at time 𝑡 𝑘 ( 𝑘 ≥ 1 ):
𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 )
𝑉 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑡 𝑘 , 𝑥 , 𝑝 )
𝑉 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 𝑘 − 1 , 𝑥 , 𝑝 ) ( using 𝑡 𝑘 + 𝜏
𝑡 𝑘 − 1 )
Using the consistency bounds from Eq. 19:
𝑒
(
𝑡
𝑘
,
𝑥
,
𝑝
)
≤
(
𝑇
𝜏
[
𝑉
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
+
𝐶
𝜏
2
)
−
𝑇
𝜏
[
𝑉
𝜏
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
(
𝑇
𝜏
[
𝑉
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
−
𝑇
𝜏
[
𝑉
𝜏
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
)
+
𝐶
𝜏
2
𝑒
(
𝑡
𝑘
,
𝑥
,
𝑝
)
≥
(
𝑇
𝜏
[
𝑉
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
−
𝐶
𝜏
2
)
−
𝑇
𝜏
[
𝑉
𝜏
]
(
𝑡
𝑘
−
1
,
𝑥
,
𝑝
)
( 𝑇 𝜏 [ 𝑉 ] ( 𝑡 𝑘 − 1 , 𝑥 , 𝑝 ) − 𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 𝑘 − 1 , 𝑥 , 𝑝 ) ) − 𝐶 𝜏 2
From Lem. C.6, we have | 𝑇 𝜏 [ 𝑉 ] ( 𝑡 𝑘 − 1 ) − 𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 𝑘 − 1 ) | ≤ 𝐸 𝑘 − 1 . This implies:
𝑇 𝜏 [ 𝑉 ] ( 𝑡 𝑘 − 1 ) − 𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 𝑘 − 1 )
≤ 𝐸 𝑘 − 1
𝑇 𝜏 [ 𝑉 ] ( 𝑡 𝑘 − 1 ) − 𝑇 𝜏 [ 𝑉 𝜏 ] ( 𝑡 𝑘 − 1 )
≥ − 𝐸 𝑘 − 1
Substituting these into the bounds for 𝑒 ( 𝑡 𝑘 ) :
𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 )
≤ 𝐸 𝑘 − 1 + 𝐶 𝜏 2
𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 )
≥ − 𝐸 𝑘 − 1 − 𝐶 𝜏 2
Combining these gives:
| 𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) | ≤ 𝐸 𝑘 − 1 + 𝐶 𝜏 2
Since this holds for all ( 𝑥 , 𝑝 ) , we can take the supremum over ( 𝑥 , 𝑝 ) :
𝐸 𝑘
sup 𝑥 , 𝑝 | 𝑒 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) | ≤ 𝐸 𝑘 − 1 + 𝐶 𝜏 2
We have the recursion 𝐸 𝑘 ≤ 𝐸 𝑘 − 1 + 𝐶 𝜏 2 with 𝐸 0
0 , which leads to 𝐸 𝑘 ≤ 𝑘 𝐶 𝜏 2 . Use 𝑘
( 𝑇 − 𝑡 𝑘 ) / 𝜏 to have
𝐸 𝑘 ≤ 𝑇 − 𝑡 𝑘 𝜏 𝐶 𝜏 2
𝐶 ( 𝑇 − 𝑡 𝑘 ) 𝜏
Since 𝐸 𝑘
‖ 𝑉 ( 𝑡 𝑘 ) − 𝑉 𝜏 ( 𝑡 𝑘 ) ‖ ∞ , we have shown that for any discrete time 𝑡 𝑘
𝑇 − 𝑘 𝜏 :
sup 𝑥 , 𝑝 | 𝑉 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) − 𝑉 𝜏 ( 𝑡 𝑘 , 𝑥 , 𝑝 ) | ≤ 𝐶 ( 𝑇 − 𝑡 𝑘 ) 𝜏 .
∎
C.3Proof of Theorem 4.2 Proof.
Our proof focuses on the primal game. The same technique can be applied to the dual game. First, it is easy to see 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) ≤ max 𝜁 ∈ 𝒵 𝐽 ( 𝑡 , 𝑥 , 𝑝 ; { 𝜂 𝑖 , 𝜏 † } , 𝜁 ) because { 𝜂 𝑖 , 𝜏 † } is not necessarily NE in 𝐺 . So we just need to prove
max 𝜁 ∈ 𝒵 𝐽 ( 𝑡 , 𝑥 , 𝑝 ; { 𝜂 𝑖 , 𝜏 † } , 𝜁 ) ≤ 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) + 𝐶 ( 𝑇 − 𝑡 ) 𝜏 ,
(25)
i.e., applying { 𝜂 𝑖 , 𝜏 † } (which solves 𝐺 𝜏 ) to 𝐺 will yield a value not far from the true value of 𝐺 .
To do so, we note that
max 𝜁 ∈ 𝒵 𝐽 ( 𝑡 , 𝑥 , 𝑝 ; { 𝜂 𝑖 , 𝜏 † } , 𝜁 ) ≤ 𝑉 𝜏 ( 𝑡 , 𝑥 , 𝑝 ) .
(26)
This is because in 𝐺 𝜏 , P2 moves after P1 and has an advantage. Then we just need to use Lem. C.7 to reach Eq. 25. ∎
Appendix DPrediction Error of Value Approximation
Here we show that CAMS shares the same exponential error propagation as in standard approximate value iteration (AVI). The only difference from AVI is that the measurement error in CAMS comes from numerical approximation of the minimax problems rather than randomness in state transition and rewards. To start, let the true value be 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) . Following Zanette et al. (2019), the prediction error 𝜖 𝑡 𝑏 𝑖 𝑎 𝑠 := max 𝑥 , 𝑝 | 𝑉 ^ 𝑡 ( 𝑥 , 𝑝 ) − 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) | is affected by (1) the prediction error 𝜖 𝑡 + 𝜏 𝑏 𝑖 𝑎 𝑠 propagated back from 𝑡 + 𝜏 , (2) the minimax error 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 caused by limited iterations in solving the minimax problem at each collocation point: 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥
max ( 𝑥 , 𝑝 ) ∈ 𝒮 𝑡 | 𝑉 ~ ( 𝑡 , 𝑥 , 𝑝 ) − 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) | , and (3) the approximation error due to the fact that 𝑉 ( 𝑡 , ⋅ , ⋅ ) may not lie in the model hypothesis space 𝒱 𝑡 of 𝑉 ^ 𝑡 : 𝜖 𝑡 𝑎 𝑝 𝑝
max 𝑥 , 𝑝 min 𝑉 ^ 𝑡 ∈ 𝒱 𝑡 | 𝑉 ^ 𝑡 ( 𝑥 , 𝑝 ) − 𝑉 ( 𝑡 , 𝑥 , 𝑝 ) | .
Approximation error.
For simplicity, we will abuse the notation by using 𝑥 in place of ( 𝑥 , 𝑝 ) and omit time dependence of variables when possible. In practice we consider 𝑉 ^ 𝑡 as neural networks that share the architecture and the hypothesis space. Note that 𝑉 ^ 𝑇 ( ⋅ )
𝑉 ( 𝑇 , ⋅ ) is analytically defined by the boundary condition and thus 𝜖 𝑇 𝑎 𝑝 𝑝
𝜖 𝑇 𝑏 𝑖 𝑎 𝑠
0 . To enable the analysis on neural networks, we adopt the assumption that 𝑉 ^ is infinitely wide and that the resultant neural tangent kernel (NTK) is positive definite. Therefore from NTK analysis (Jacot et al., 2018), 𝑉 ^ can be considered a kernel machine equipped with a kernel function 𝑟 ( 𝑥 ( 𝑖 ) , 𝑥 ( 𝑗 ) ) := ⟨ 𝜙 ( 𝑥 ( 𝑖 ) ) , 𝜙 ( 𝑥 ( 𝑗 ) ) ⟩ defined by a feature map 𝜙 : 𝒳 → ℝ 𝑑 𝜙 . Given training data 𝒮
{ ( 𝑥 ( 𝑖 ) , 𝑉 ( 𝑖 ) ) } , let 𝑟 ( 𝑥 ) [ 𝑖 ] := 𝑟 ( 𝑥 ( 𝑖 ) , 𝑥 ) , 𝑅 𝑖 𝑗 := 𝑟 ( 𝑥 ( 𝑖 ) , 𝑥 ( 𝑗 ) ) , 𝑉 𝒮 := [ 𝑉 ( 1 ) , … , 𝑉 ( 𝑁 ) ] ⊤ , Φ 𝒮 := [ 𝜙 ( 𝑥 ( 1 ) ) , … , 𝜙 ( 𝑥 ( 𝑁 ) ) ] , and 𝑤 𝒮 := Φ 𝒮 ( Φ 𝒮 ⊤ Φ 𝒮 ) − 1 𝑉 𝒮 be model parameters learned from 𝒮 , then
𝑉 ^ ( 𝑥 )
𝑟 ( 𝑥 ) ⊤ 𝑅 − 1 𝑉 𝒮
⟨ 𝜙 ( 𝑥 ) , 𝑤 𝒮 ⟩
(27)
is a linear model in the feature space. Let 𝜃 𝜙 ( 𝑥 ) := 𝑟 ( 𝑥 ) ⊤ 𝑅 − 1 and 𝐶 := max 𝑥 ‖ 𝜃 𝜙 ( 𝑥 ) ‖ 1 . Further, let 𝒮 † := arg min 𝒮 | ⟨ 𝜙 ( 𝑥 ) , 𝑤 𝒮 ⟩ − 𝑉 ( 𝑥 ) | and 𝑤 † := 𝑤 𝒮 † , i.e., 𝑤 † represents the best hypothetical model given sample size 𝑁 . Since 𝑁 is finite, the data-dependent hypothesis space induces an approximation error 𝜖 𝑡 𝑎 𝑝 𝑝 := max 𝑥 | ⟨ 𝜙 ( 𝑥 ) , 𝑤 † ⟩ − 𝑉 ( 𝑥 ) | . From standard RKHS analysis, we have 𝜖 𝑡 𝑎 𝑝 𝑝 ∝ 𝑁 − 1 2 .
Error propagation.
Recall that we approximately solve P 1 at each collocation point. Let 𝑧 := { 𝜆 , 𝑝 , 𝑢 , 𝑣 } be the collection of variables and 𝑧 ~ be the approximated saddle point resulting from DS-GDA. Let 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ~ ) be the approximate value at ( 𝑡 , 𝑥 ) and let 𝑉 ( 𝑡 , 𝑥 , 𝑧 ∗ ) be the value at the true saddle point 𝑧 ∗ . Lemma D.1 bounds the error of 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ~ ) :
Lemma D.1.
max 𝑥 | 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ~ ) − 𝑉 ( 𝑡 , 𝑥 , 𝑧 ∗ ) | ≤ 𝜖 𝑡 + 𝜏 𝑏 𝑖 𝑎 𝑠 + 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 .
Proof.
Note that ∑ 𝑘
1 𝐼 𝜆 𝑘
1 . Then
max 𝑥 | 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ~ ) − 𝑉 ( 𝑡 , 𝑥 , 𝑧 ∗ ) |
≤ max 𝑥 | 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ~ ) − 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ∗ ) | + max 𝑥 | 𝑉 ~ ( 𝑡 , 𝑥 , 𝑧 ∗ ) − 𝑉 ( 𝑡 , 𝑥 , 𝑧 ∗ ) |
(28)
≤ 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + max 𝑥 | ∑ 𝑘
1 𝐼 𝜆 𝑘 ( 𝑉 ~ ( 𝑡 + 𝜏 , 𝑥 ′ , 𝑝 𝑘 ) − 𝑉 ( 𝑡 + 𝜏 , 𝑥 ′ , 𝑝 𝑘 ) |
≤ 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑡 + 𝜏 𝑏 𝑖 𝑎 𝑠 .
∎
Now we can combine this measurement error with the inherent approximation error 𝜖 𝑡 𝑎 𝑝 𝑝 to reach the following bound on the prediction error 𝜖 𝑡 𝑏 𝑖 𝑎 𝑠 :
Lemma D.2.
max 𝑥 | 𝑉 ^ 𝑡 ( 𝑥 ) − 𝑉 ( 𝑡 , 𝑥 ) | ≤ 𝐶 𝑡 ( 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑡 + 𝜏 𝑏 𝑖 𝑎 𝑠 + 𝜖 𝑡 𝑎 𝑝 𝑝 ) + 𝜖 𝑡 𝑎 𝑝 𝑝 .
Proof.
max 𝑥 | 𝑉 ^ 𝑡 ( 𝑥 ) − 𝑉 ( 𝑡 , 𝑥 ) | ≤ max 𝑥 | 𝑉 ^ 𝑡 ( 𝑥 ) − ⟨ 𝜙 ( 𝑥 ) , 𝑤 † ⟩ | + max 𝑥 | ⟨ 𝜙 ( 𝑥 ) , 𝑤 † ⟩ − 𝑉 ( 𝑡 , 𝑥 ) |
(29)
≤ max 𝑥 | ⟨ 𝜃 𝜙 ( 𝑥 ) , 𝑉 ~ ( 𝑡 , 𝑥 ) − 𝑉 ( 𝑡 , 𝑥 ) ⟩ | + max 𝑥 | ⟨ 𝜃 𝜙 ( 𝑥 ) , 𝑉 ( 𝑡 , 𝑥 ) − 𝜙 ( 𝑥 ) ⊤ 𝑤 † ⟩ | + 𝜖 𝑡 𝑎 𝑝 𝑝
≤ 𝐶 ( 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑡 + 𝜏 𝑏 𝑖 𝑎 𝑠 + 𝜖 𝑡 𝑎 𝑝 𝑝 ) + 𝜖 𝑡 𝑎 𝑝 𝑝 .
∎
Lem. D.3 characterizes the propagation of error:
Lemma D.3.
Let 𝜖 𝑡 𝑎 𝑝 𝑝 ≤ 𝜖 𝑎 𝑝 𝑝 , 𝜖 𝑡 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 ≤ 𝜖 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 , and 𝐶 𝑡 ≤ 𝐶 for all 𝑡 ∈ [ 𝑇 ] . If 𝜖 𝑇 𝑎 𝑝 𝑝
0 , then 𝜖 0 𝑏 𝑖 𝑎 𝑠 ≤ 𝑇 𝐶 𝑇 ( 𝜖 𝑎 𝑝 𝑝 + 𝐶 ( 𝜖 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑎 𝑝 𝑝 ) ) .
Proof.
Using Lem. D.2 and by induction, we have
𝜖 0 𝑏 𝑖 𝑎 𝑠 ≤ ( 𝜖 𝑎 𝑝 𝑝 + 𝐶 ( 𝜖 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑎 𝑝 𝑝 ) ) 1 − 𝐶 𝑇 1 − 𝐶 ≤ 𝑇 𝐶 𝑇 ( 𝜖 𝑎 𝑝 𝑝 + 𝐶 ( 𝜖 𝑚 𝑖 𝑛 𝑚 𝑎 𝑥 + 𝜖 𝑎 𝑝 𝑝 ) ) .
(30)
∎
We can now characterize the computational complexity of the baseline algorithm through Thm. D.4, by taking into account the number of DS-GDA iterations and the per-iteration complexity:
Theorem D.4.
For a fixed 𝑇 and some error threshold 𝛿
0 , with a computational complexity of 𝒪 ( 𝑇 3 𝐶 2 𝑇 𝐼 2 𝜖 − 4 𝛿 − 2 ) , CAMS achieves
max ( 𝑥 , 𝑝 ) ∈ 𝒳 × Δ ( 𝐼 ) | 𝑉 ^ 0 ( 𝑥 , 𝑝 ) − 𝑉 ( 0 , 𝑥 , 𝑝 ) | ≤ 𝛿 .
(31) Proof.
From Lem. D.3 and using the fact that 𝜖 𝑎 𝑝 𝑝 ∝ 𝑁 − 1 / 2 , achieving a prediction error of 𝛿 at 𝑡
0 requires 𝑁
𝒪 ( 𝐶 2 𝑇 𝑇 2 𝛿 − 2 ) . CAMS solves 𝑇 𝑁 minimax problems, each requires a worst-case 𝒪 ( 𝜖 − 4 ) iterations, and each iteration requires computing gradients of dimension 𝒪 ( 𝐼 2 ) , considering the dimensionalities of action spaces as constants. This leads to a total complexity of 𝒪 ( 𝑇 3 𝐶 2 𝑇 𝐼 2 𝜖 − 4 𝛿 − 2 ) . ∎
Appendix EMultigrid Algorithms and Results
Alg. 1 presents the 𝑛 -level multigrid algorithm.
Input: 𝑘 max , 𝑘 min , 𝕆 (minimax solver), 𝑇 (time horizon), 𝑁 (number of data points), ℛ (restriction operator), 𝒫 (prolongation operator) 1exInitialize: 𝒯 𝑙 ← [ 0 , 𝑙 , 2 𝑙 , … , 𝑇 − 𝑙 ] , ∀ 𝑙 ∈ { 2 − 𝑘 max , … , 2 − 𝑘 min } , 𝐿 ← 2 − 𝑘 min Initialize: Value networks 𝑉 ^ 𝑡 𝑙 , ∀ 𝑡 ∈ 𝒯 𝑙 , ∀ 𝑙 ∈ { 2 − 𝑘 max , … , 2 − 𝑘 min } , policy set Π ← ∅ ; while resources not exhausted or until convergence do 𝑅 ← ∅ , 𝐸 𝐿 ← ∅ , 𝒮 ← ∅ ; Initialize coarsest-grid correction networks 𝜀 𝑡 𝐿 , ∀ 𝑡 ∈ 𝒯 𝐿 ; 𝒮 [ 𝑡 ] ← sample 𝑁 ( 𝑡 , 𝑥 , 𝑝 ) , ∀ 𝑡 ∈ 𝒯 𝑘 max ; // down-cycle for 𝑘 ← 𝑘 max down to 𝑘 min + 1 do Compute target via 𝕆 𝑘 (init. with 𝜋 𝑡 if Π [ 𝑘 ] ≠ ∅ ), and store updated policies 𝜋 𝑡 in Π [ 𝑘 ] , ∀ 𝑡 ∈ 𝒯 𝑘 ; Compute residuals 𝑟 𝑘 [ 𝑡 ] , ∀ 𝑡 ∈ 𝒯 𝑘 ; if 𝑘 ≠ 𝑘 max then 𝑟 𝑡 𝑘 ← 𝑟 𝑡 𝑘 + ℛ 𝑟 𝑡 𝑘 + 1 , ∀ 𝑡 ∈ 𝒯 𝑘 ; Store 𝑟 𝑡 𝑘 in 𝑅 [ 𝑘 ] ; for 𝑡 ← 𝑇 − 𝐿 to 0 do // coarse-solve backwards in time 𝑒 𝑡 𝐿 ← 𝕆 𝐿 ( ℛ 𝑉 ^ 𝑡 + 𝐿 𝑙 + 𝜀 𝑡 + 𝐿 𝐿 ) − 𝕆 𝐿 ( ℛ 𝑉 ^ 𝑡 + 𝐿 𝑙 ) − ℛ 𝑟 𝑡 𝑘 min + 1 ; ; // 𝑒 𝑇 𝐿
0 , 𝜀 𝑇 𝐿
∅ Store 𝑒 𝑡 𝐿 in 𝐸 𝐿 ; Fit 𝜀 𝐿 𝑡 to 𝑒 𝐿 𝑡 ; // up-cycle for 𝑘 ← 𝑘 min + 1 to 𝑘 max do 𝑒 𝑡 𝑘 ← 𝒫 ( 𝑒 𝑡 𝑘 − 1 ) , ∀ 𝑡 ∈ 𝒯 𝑘 ; Update 𝑉 ^ 𝑡 𝑘 ← 𝑉 ^ 𝑡 𝑘 + 𝑒 𝑡 𝑘 ; // post smoothing (for all t’s and l’s) target , 𝜋 𝑡 ← 𝕆 𝑙 ( 𝑉 ^ 𝑡 + 𝑙 𝑙 ) (initialized with 𝜋 𝑡 ) Fit 𝑉 ^ 𝑡 𝑙 to target and replace 𝜋 𝑡 in Π [ 𝑙 ] ; Algorithm 1 𝑛 -Level Multigrid for Value Approximation
In Fig. 6 we compare learned trajectories via the multigrid approach against the ground truth. The learned trajectories closely resemble the ground truth as P1 successfully concealing his payoff type until a critical time. In Fig. 6 we visualize the NE trajectories of P2 by solving the dual game.
Figure 6:Comparison of trajectories generated using value learned via multigrid method vs the ground truth. Figure 7:Trajectories when P1 and P2 play their respective primal and dual game. P1’s actions are a result of the primal value function whereas P2’s actions are a result of the dual value function. Both primal and dual values are learned using multigrid approach. Appendix FAnalytical Examples
In this section, we walk through the derivation of analytical NEs for two problems: Hexner’s game and a zero-sum variant of the classic beer-quiche game. The former is differential where players take actions simultaneously; the latter is dynamic and turn-based. Both have one-sided payoff information and finite time horizons. These examples are reproduced from Ghimire et al. (2024) with permission.
F.1Hexner’s Game: Analytical Solution
Here we discuss the solution to Hexner’s game using primal and dual formulations (i.e., P 1 and P 2 ) on a differential game as proposed in Hexner (1979). Consider two players with linear dynamics
𝑥 ˙ 𝑖
𝐴 𝑖 𝑥 𝑖 + 𝐵 𝑖 𝑢 𝑖 ,
for 𝑖
1 , 2 , where 𝑥 𝑖 ( 𝑡 ) ∈ ℝ 𝑑 𝑥 are system states, 𝑢 𝑖 ( 𝑡 ) ∈ 𝒰 are control inputs belonging to the admissible set 𝒰 , 𝐴 𝑖 , 𝐵 𝑖 ∈ ℝ 𝑑 𝑥 × 𝑑 𝑥 . Let 𝜃 ∈ { − 1 , 1 } be Player 1’s type unknown to Player 2. Let 𝑝 𝜃 be Nature’s probability distribution of 𝜃 . Consider that the game is to be played infinite many times, the payoff is an expectation over 𝜃 :
𝐽 ( 𝑢 1 , 𝑢 2 )
𝔼 𝜃 [ ∫ 0 𝑇 ( ∥ 𝑢 1 ∥ 𝑅 1 2 − ∥ 𝑢 2 ∥ 𝑅 2 2 ) 𝑑 𝑡 +
(32)
∥ 𝑥 1 ( 𝑇 ) − 𝑧 𝜃 ∥ 𝐾 1 ( 𝑇 ) 2 − ∥ 𝑥 2 ( 𝑇 ) − 𝑧 𝜃 ∥ 𝐾 2 ( 𝑇 ) 2 ] ,
where, 𝑧 ∈ ℝ 𝑑 𝑥 . 𝑅 1 and 𝑅 2 are continuous, positive-definite matrix-valued functions, and 𝐾 1 ( 𝑇 ) and 𝐾 2 ( 𝑇 ) are positive semi-definite matrices. All parameters are publicly known except for 𝜃 , which remains private. Player 1’s objective is to get closer to the target 𝑧 𝜃 than Player 2. However, since Player 2 can deduce 𝜃 indirectly through Player 1’s control actions, Player 1 may initially employ a non-revealing strategy. This involves acting as though he only knows about the prior distribution 𝑝 𝜃 (rather than the true 𝜃 ) to hide the information, before eventually revealing 𝜃 .
First, it can be shown that players’ control has a 1D representation, denoted by 𝜃 ~ 𝑖 ∈ ℝ , through:
𝑢 𝑖
− 𝑅 𝑖 − 1 𝐵 𝑖 𝑇 𝐾 𝑖 𝑥 𝑖 + 𝑅 𝑖 − 1 𝐵 𝑖 𝑇 𝐾 𝑖 Φ 𝑖 𝑧 𝜃 ~ 𝑖 ,
for 𝑖
1 , 2 , where Φ ˙ 𝑖
𝐴 𝑖 Φ 𝑖 with boundary condition Φ 𝑖 ( 𝑇 )
𝐼 , and
𝐾 ˙ 𝑖
− 𝐴 𝑖 𝑇 𝐾 𝑖 − 𝐾 𝑖 𝐴 𝑖 + 𝐾 𝑖 𝑇 𝐵 𝑖 𝑅 𝑖 − 1 𝐵 𝑖 𝑇 𝐾 𝑖 .
Then define a quantity 𝑑 𝑖 as:
𝑑 𝑖
𝑧 𝑇 Φ 𝑖 𝑇 𝐾 𝑖 𝐵 𝑖 𝑅 𝑖 − 1 𝐵 𝑖 𝑇 𝐾 𝑖 𝑇 Φ 𝑖 𝑧 .
(33)
With these, the game can be reformulated with the following payoff function:
𝐽 ( 𝑡 , 𝜃 ~ 1 , 𝜃 ~ 2 )
𝔼 𝜃 [ ∫ 𝜏
𝑡 𝑇 ( 𝜃 ~ 1 ( 𝜏 ) − 𝜃 ) 2 𝑑 1 ( 𝜏 ) − ( 𝜃 ~ 2 ( 𝜏 ) − 𝜃 ) 2 𝑑 2 ( 𝜏 ) 𝑑 𝜏 ] ,
(34)
where 𝑑 1 , 𝑑 2 , 𝑝 𝜃 are common knowledge; 𝜃 is only known to Player 1; the scalar 𝜃 ~ 1 (resp. 𝜃 ~ 2 ) is Player 1’s (resp. Player 2’s) strategy. We consider two player types 𝜃 ∈ { − 1 , 1 } and therefore 𝑝 𝜃 ∈ Δ ( 2 ) .
Then by defining critical time:
𝑡 𝑟
arg min 𝑡 ∫ 0 𝑡 ( 𝑑 1 ( 𝑠 ) − 𝑑 2 ( 𝑠 ) ) 𝑑 𝑠 ,
we have the following equilibrium:
𝜃 ~ 1 ( 𝑠 )
𝜃 ~ 2 ( 𝑠 )
0 ∀ 𝑠 ∈ [ 0 , 𝑡 𝑟 ]
(35)
𝜃 ~ 1 ( 𝑠 )
𝜃 ~ 2 ( 𝑠 )
𝜃 ∀ 𝑠 ∈ ( 𝑡 𝑟 , 𝑇 ] ,
(36)
To solve the game via primal-dual formulation, we introduce a few quantities. First, introduce time stamps [ 𝑇 𝑘 ] 𝑘
1 2 𝑟 as roots of the time-dependent function 𝑑 1 − 𝑑 2 , with 𝑇 0
0 , 𝑇 2 𝑞 + 1
𝑡 𝑟 , and 𝑇 2 𝑟 + 1
𝑇 . Without loss of generality, assume:
𝑑 1 − 𝑑 2 < 0 ∀ 𝑡 ∈ ( 𝑇 2 𝑘 , 𝑇 2 𝑘 + 1 ) ∀ 𝑘
0 , … , 𝑟 ,
(37)
𝑑 1 − 𝑑 2 ≥ 0 ∀ 𝑡 ∈ [ 𝑇 2 𝑘 − 1 , 𝑇 2 𝑘 ] ∀ 𝑘
1 , … , 𝑟 .
(38)
Also introduce 𝐷 𝑘 := ∫ 𝑇 𝑘 𝑇 𝑘 + 1 ( 𝑑 1 − 𝑑 2 ) 𝑑 𝑠 and
𝐷 ~ 𝑘
{ 𝐷 ~ 𝑘 + 1 + 𝐷 𝑘 ,
if 𝐷 ~ 𝑘 + 1 + 𝐷 𝑘 < 0
0 ,
otherwise ,
(39)
with 𝐷 ~ 2 𝑟 + 1
0 .
The following properties are necessary (see Ghimire et al. (2024) for details):
1.
∫ 𝑘 2 𝑞 + 1 ( 𝑑 1 − 𝑑 2 ) 𝑑 𝑠
∑ 𝑘 2 𝑞 𝐷 𝑘 < 0 , ∀ 𝑘
0 , … , 2 𝑞 ;
2.
∫ 2 𝑞 + 1 𝑘 ( 𝑑 1 − 𝑑 2 ) 𝑑 𝑠
∑ 2 𝑞 + 1 𝑘 − 1 𝐷 𝑘 > 0 , ∀ 𝑘
2 𝑞 + 2 , … , 2 𝑟 + 1 ;
3.
𝐷 ~ 2 𝑞 + 2 + 𝐷 2 𝑞 + 1
0 ;
4.
𝐷 ~ 𝑘 < 0 , ∀ 𝑘 < 2 𝑞 + 1 .
Primal game.
We start with 𝑉 ( 𝑇 , 𝑝 )
0 where 𝑝 := 𝑝 𝜃 [ 1 ]
Pr ( 𝜃
− 1 ) . The Hamiltonian is as follows:
𝐻 ( 𝑝 )
min 𝜃 ~ 1 max 𝜃 ~ 2 𝔼 𝜃 [ ( 𝜃 ~ 1 − 𝜃 ) 2 𝑑 1 − ( 𝜃 ~ 2 − 𝜃 ) 2 𝑑 2 ]
4 𝑝 ( 1 − 𝑝 ) ( 𝑑 1 − 𝑑 2 ) .
The optimal actions for the Hamiltonian are 𝜃 ~ 1
𝜃 ~ 2
1 − 2 𝑝 . From Bellman backup, we can get
𝑉 ( 𝑇 𝑘 , 𝑝 )
4 𝑝 ( 1 − 𝑝 ) 𝐷 ~ 𝑘 .
Therefore, at 𝑇 2 𝑞 + 1 , we have
𝑉 ( 𝑇 2 𝑞 + 1 , 𝑝 )
𝑉 𝑒 𝑥 𝑝 ( 𝑉 ( 𝑇 2 𝑞 + 2 , 𝑝 ) + 4 𝑝 ( 1 − 𝑝 ) 𝐷 2 𝑞 + 1 )
𝑉 𝑒 𝑥 𝑝 ( 4 𝑝 ( 1 − 𝑝 ) ( 𝐷 ~ 2 𝑞 + 2 + 𝐷 2 𝑞 + 1 ) ) .
Notice that 𝐷 ~ 2 𝑞 + 2 + 𝐷 2 𝑞 + 1 > 0 (property 3) and 𝐷 ~ 𝑘 < 0 for all 𝑘 < 2 𝑞 + 1 (property 4), 𝑇 2 𝑞 + 1 is the first time such that the right-hand side term inside the convexification operator, i.e., 4 𝑝 ( 1 − 𝑝 ) ( 𝐷 ~ 2 𝑞 + 2 + 𝐷 2 𝑞 + 1 ) , becomes concave. Therefore, splitting of belief happens at 𝑇 2 𝑞 + 1 with 𝑝 1
0 and 𝑝 2
1 . Player 1 plays 𝜃 ~ 1
− 1 (resp. 𝜃 ~ 1
1 ) with probability 1 if 𝜃
− 1 (resp. 𝜃
1 ), i.e., Player 1 reveals its type. This result is consistent with Hexner’s.
Dual game.
To find Player 2’s strategy, we need to derive the conjugate value which follows
𝑉 ∗ ( 𝑡 , 𝑝 ^ )
{ max 𝑖 ∈ { 1 , 2 } 𝑝 ^ [ 𝑖 ]
∀ 𝑡 ≥ 𝑇 2 𝑞 + 1
𝑝 ^ [ 2 ] − 𝐷 ~ 𝑡 ( 1 − 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ] 4 𝐷 ~ 𝑡 ) 2
∀ 𝑡 < 𝑇 2 𝑞 + 1 , 4 𝐷 ~ 𝑡 ≤ 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ] ≤ − 4 𝐷 ~ 𝑡
𝑝 ^ [ 1 ]
∀ 𝑡 < 𝑇 2 𝑞 + 1 , 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ] ≥ 4 𝐷 ~ 𝑡
𝑝 ^ [ 2 ]
∀ 𝑡 < 𝑇 2 𝑞 + 1 , 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ] < 4 𝐷 ~ 𝑡
Here 𝑝 ^ ∈ ∇ 𝑝 𝜃 𝑉 ( 0 , 𝑝 𝜃 ) and 𝑉 ( 0 , 𝑝 𝜃 )
4 𝑝 [ 1 ] 𝑝 [ 2 ] 𝐷 ~ 0 . For any particular 𝑝 ∗ ∈ Δ ( 2 ) , from the definition of subgradient, we have 𝑝 ^ [ 1 ] 𝑝 ∗ [ 1 ] + 𝑝 ^ [ 2 ] 𝑝 ∗ [ 2 ]
4 𝑝 ∗ [ 1 ] 𝑝 ∗ [ 2 ] 𝐷 ~ 0 and 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ]
4 ( 𝑝 ∗ [ 2 ] − 𝑝 ∗ [ 1 ] ) 𝐷 ~ 0 . Solving these to get 𝑝 ^
[ 4 𝑝 ∗ [ 2 ] 2 𝐷 ~ 0 , 4 𝑝 ∗ [ 1 ] 2 𝐷 ~ 0 ] 𝑇 . Therefore 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ]
4 𝐷 ~ 0 ( 1 − 2 𝑝 ∗ [ 1 ] ) ∈ [ 4 𝐷 ~ 0 , − 4 𝐷 ~ 0 ] , and
𝑉 ∗ ( 0 , 𝑝 ^ )
𝑝 ^ [ 2 ] − 𝐷 ~ 0 ( 1 − 𝑝 ^ [ 1 ] − 𝑝 ^ [ 2 ] 4 𝐷 ~ 0 ) 2 .
Notice that 𝑉 ∗ ( 𝑡 , 𝑝 ^ ) is convex to 𝑝 ^ since 𝐷 ~ 0 < 0 (property 4) for all 𝑡 ∈ [ 0 , 𝑇 ] . Therefore, there is no splitting of 𝑝 ^ during the dual game, i.e., 𝜃 ~ 2
1 − 2 𝑝 . This result is also consistent with results in Hexner (1979).
F.2Example of a Turn-Based Game
We present a zero-sum variant of the classic beer-quiche game, which is a turn-based incomplete-information game with a perfect Bayesian equilibrium. Unlike in Hexner’s game, Player 1 in beer-quiche game wants to maximize his payoff, and Player 2 wants to minimize it; hence, Vex becomes a Cav.
\includestandalone
figures/beerquiche
Figure 8:Zero-Sum Beer-Quiche Game
We solve the game through backward induction (from 𝑡
2 , 1 , 0 ) of its primal and dual values (denoted by 𝑉 and 𝑉 ∗ respectively). Players 1 and 2 make their respective decisions at 𝑡
0 and 𝑡
1 , and the game ends at 𝑡
2 . The state 𝑥 at a time 𝑡 encodes the history of actions taken until 𝑡 .
Primal game: First, we compute the equilibrium strategy of Player 1 using the primal value. At the terminal time step ( 𝑡
2 ), based on Fig. 8, the value for Player 1 is the following:
𝑉 ( 2 , 𝑥 , 𝑝 )
{
4
𝑝
𝑇
−
2
if
𝑥
( 𝐵 , 𝑏 )
𝑝
𝑇
if
𝑥
( 𝐵 , 𝑑 )
2
𝑝
𝑇
−
1
if
𝑥
( 𝑄 , 𝑏 )
2
−
2
𝑝
𝑇
if
𝑥
( 𝑄 , 𝑑 ) .
(40)
At the intermediate time step ( 𝑡
1 ), it is Player 2’s turn to take an action. Therefore, the value is a function of Player 1’s action at 𝑡
0 and Player 2’s current action. And for the same reason, the value is not a concavification (Cav) over the RHS term.
𝑉 ( 1 , 𝑥 , 𝑝 )
min 𝑣 ∈ { 𝑏 , 𝑑 } 𝑉 ( 2 , ( 𝑥 , 𝑣 ) , 𝑝 ) .
(41)
We can find the best responses of Player 2 for both actions of Player 1. This leads to
𝑉 ( 1 , 𝑥 , 𝑝 )
{
𝑝
𝑇
if
𝑥
𝐵
,
3
𝑝
𝑇
−
2
≥
0
(
𝑣
∗
𝑑 )
4
𝑝
𝑇
−
2
if
𝑥
𝐵
,
3
𝑝
𝑇
−
2
<
0
(
𝑣
∗
𝑏 )
2
−
2
𝑝
𝑇
if
𝑥
𝑄
,
4
𝑝
𝑇
−
3
≥
0
(
𝑣
∗
𝑑 )
2
𝑝
𝑇
−
1
if
𝑥
𝑄
,
4
𝑝
𝑇
−
3
<
0
(
𝑣
∗
𝑏 ) .
(42)
Finally, at the beginning of the game ( 𝑡
0 ), we have
𝑉 ( 0 , ∅ , 𝑝 )
Cav ( max 𝑢 ∈ { 𝐵 , 𝑄 } 𝑉 ( 1 , 𝑢 , 𝑝 ) ) .
(43)
Cav is achieved by taking the concave hull with respect to 𝑝 𝑇 :
𝑉 ( 0 , ∅ , 𝑝 )
{ 5 𝑝 𝑇 / 2 − 1
if 𝑝 𝑇 < 2 / 3
𝑝 𝑇
if 𝑝 𝑇 ≥ 2 / 3
.
(44)
When 𝑝 𝑇 ∈ [ 0 , 2 / 3 ) ,
𝑉 ( 0 , ∅ , 𝑝 )
𝜆 max 𝑢 𝑉 ( 1 , 𝑢 , 𝑝 1 ) + ( 1 − 𝜆 ) max 𝑢 𝑉 ( 1 , 𝑢 , 𝑝 2 ) ,
where 𝑝 1
[ 0 , 1 ] 𝑇 , 𝑝 2
[ 2 / 3 , 1 / 3 ] 𝑇 , and 𝜆 𝑝 1 + ( 1 − 𝜆 ) 𝑝 2
𝑝 .
Therefore, when 𝑝 𝑇
1 / 3 , 𝜆
1 / 2 , Player 1’s strategy is:
Pr ( 𝑢
𝑄 | 𝑇 )
𝜆 𝑝 1 [ 1 ] 𝑝 [ 1 ]
0
,
Pr
(
𝑢
𝑄 | 𝑊 )
𝜆 𝑝 1 [ 2 ] 𝑝 [ 2 ]
3 / 4 ,
(45)
Pr ( 𝑢
𝐵 | 𝑇 )
( 1 − 𝜆 ) 𝑝 2 [ 1 ] 𝑝 [ 1 ]
1
,
Pr
(
𝑢
𝐵 | 𝑊 )
( 1 − 𝜆 ) 𝑝 2 [ 2 ] 𝑝 [ 2 ]
1 / 4 .
Dual game: To solve the equilibrium of Player 2, we first derive the dual variable 𝑝 ^ ∈ ∂ 𝑝 𝑉 ( 0 , ∅ , 𝑝 ) for 𝑝
[ 1 / 3 , 2 / 3 ] 𝑇 . By definition, 𝑝 ^ 𝑇 𝑝 defines the concave hull of 𝑉 ( 0 , ∅ , 𝑝 ) , and therefore we have
[ 1 / 3 , 2 / 3 ] 𝑝 ^
𝑉 ( 0 , ∅ , 𝑝 )
− 1 / 6
(46)
[ 0 , 1 ] 𝑝 ^
𝑉 ( 0 , ∅ , [ 0 , 1 ] )
− 1 .
This leads to 𝑝 ^
[ 3 / 2 , − 1 ] 𝑇 .
At the terminal time, we have
𝑉 ∗ ( 2 , 𝑥 , 𝑝 ^ )
min { 𝑝 ^ [ 1 ] − 𝑔 𝑇 ( 𝑥 ) , 𝑝 ^ [ 2 ] − 𝑔 𝑊 ( 𝑥 ) }
(47)
=
{
min
{
𝑝
^
[
1
]
−
2
,
𝑝
^
[
2
]
+
2
}
if
𝑥
( 𝐵 , 𝑏 )
min
{
𝑝
^
[
1
]
−
1
,
𝑝
^
[
2
]
}
if
𝑥
( 𝐵 , 𝑑 )
min
{
𝑝
^
[
1
]
−
1
,
𝑝
^
[
2
]
+
1
}
if
𝑥
( 𝑄 , 𝑏 )
min
{
𝑝
^
[
1
]
,
𝑝
^
[
2
]
−
2
}
if
𝑥
( 𝑄 , 𝑑 )
At 𝑡
1 , we have
𝑉 ∗ ( 1 , 𝑢 , 𝑝 ^ )
Cav 𝑝 ^ ( max 𝑣 𝑉 ∗ ( 2 , ( 𝑢 , 𝑣 ) , 𝑝 ^ ) ) .
(48)
When 𝑢
𝐵 , the conjugate value is a concave hull of a piece-wise linear function:
𝑉 ∗ ( 1 , 𝐵 , 𝑝 ^ )
Cav
𝑝
^
(
{
𝑝
^
[
1
]
−
1
if
𝑝
^
[
2
]
≥
𝑝
^
[
1
]
−
1
(
𝑣
∗
𝑑 )
𝑝
^
[
2
]
if
𝑝
^
[
2
]
∈
[
𝑝
^
[
1
]
−
2
,
𝑝
^
[
1
]
−
1
)
(
𝑣
∗
𝑏 )
𝑝
^
[
1
]
−
2
if
𝑝
^
[
2
]
∈
[
𝑝
^
[
1
]
−
4
,
𝑝
^
[
1
]
−
2
)
(
𝑣
∗
𝑑 )
𝑝
^
[
2
]
+
2
if
𝑝
^
[
2
]
<
𝑝
^
[
1
]
−
4
(
𝑣
∗
𝑏 ) )
(49)
=
{
𝑝
^
[
1
]
−
1
if
𝑝
^
[
2
]
≥
𝑝
^
[
1
]
−
1
(
𝑣
∗
𝑑 )
2 / 3 𝑝 ^ [ 1 ] + 1 / 3 𝑝 ^ [ 2 ] − 2 / 3
if 𝑝 ^ [ 2 ] ∈ [ 𝑝 ^ [ 1 ] − 4 , 𝑝 ^ [ 1 ] − 1 )
( mixed strategy )
𝑝
^
[
2
]
+
2
if
𝑝
^
[
2
]
<
𝑝
^
[
1
]
−
4
(
𝑣
∗
𝑏 )
For 𝑝 ^
[ 3 / 2 , − 1 ] 𝑇 which satisfies 𝑝 ^ [ 2 ] ∈ [ 𝑝 ^ [ 1 ] − 4 , 𝑝 ^ [ 1 ] − 1 ) , Player 2 follows a mixed strategy determined based on { 𝜆 1 , 𝜆 2 , 𝜆 3 } ∈ Δ ( 3 ) and 𝑝 ^ 𝑗 ∈ ℝ 2 for 𝑗
1 , 2 , 3 such that:
(i)
At least one of 𝑝 ^ 𝑗 for 𝑗
1 , 2 , 3 should satisfy 𝑝 ^ [ 2 ]
𝑝 ^ [ 1 ] − 1 and another 𝑝 ^ [ 2 ]
𝑝 ^ [ 1 ] − 4 . These conditions are necessary for 𝑉 ∗ ( 1 , 𝐵 , 𝑝 ^ ) to be a concave hull:
𝑉 ∗ ( 1 , 𝐵 , 𝑝 ^ )
∑ 𝑗
1 3 𝜆 𝑗 max 𝑣 𝑉 ∗ ( 2 , ( 𝐵 , 𝑣 ) , 𝑝 ^ 𝑗 ) .
(50) (ii)
∑ 𝑗
1 3 𝜆 𝑗 𝑝 ^ 𝑗
𝑝 ^ .
These conditions lead to 𝜆 1
1 / 2 and 𝜆 2 + 𝜆 3
1 / 2 . Therefore, when Player 1 picks beer, Player 2 chooses to defer and bully with equal probability.
When 𝑢
𝑄 , we similarly have
𝑉 ∗ ( 1 , 𝑄 , 𝑝 ^ )
{
𝑝
^
[
1
]
if
𝑝
^
[
2
]
≥
𝑝
^
[
1
]
+
2
(
𝑣
∗
𝑑 )
…
if 𝑝 ^ [ 2 ] ∈ [ 𝑝 ^ [ 1 ] − 2 , 𝑝 ^ [ 1 ] + 2 )
( mixed strategy )
𝑝
^
[
2
]
+
1
if
𝑝
^
[
2
]
<
𝑝
^
[
1
]
−
2
(
𝑣
∗
𝑏 )
(51)
The derivation of the concave hull when 𝑝 ^ [ 2 ] ∈ [ 𝑝 ^ [ 1 ] − 2 , 𝑝 ^ [ 1 ] + 2 ) is omitted, because, for 𝑝 ^
[ 3 / 2 , − 1 ] 𝑇 , 𝑉 ∗ ( 1 , 𝑄 , 𝑝 ^ )
𝑝 ^ [ 2 ] + 1
0 while 𝑣 ∗
𝑏 , i.e. if Player 1 picks quiche, Player 2 chooses to bully with a probability of 1.
Appendix GHexner’s Game Settings, Baselines, and Ground Truth G.1Game Settings
The players move in an arena bounded between [ − 1 , 1 ] in all directions. All games in the paper follow 2D/3D point dynamics as follows: 𝑥 ˙ 𝑗
𝐴 𝑥 𝑗 + 𝐵 𝑢 𝑗 , where 𝑥 𝑗 is a vector of position and velocity and 𝑢 𝑗 is the action for player 𝑗 . Note that we use 𝑢 and 𝑣 in the optimization problems P 1 and P 2 to represent player 1 and player 2’s actions respectively. The type independent effort loss for each player 𝑗 is defined as 𝑙 𝑗 ( 𝑢 𝑗 )
𝑢 𝑗 ⊤ 𝑅 𝑗 𝑢 𝑗 , where 𝑅 1
diag ( 0.05 , 0.025 ) and 𝑅 2
diag ( 0.05 , 0.1 ) . For the higher dimensional case, 𝑅 1
diag ( 0.05 , 0.05 , 0.025 ) and 𝑅 2
diag ( 0.05 , 0.05 , 0.1 ) . Note that, in the incomplete information case, P1 is able to get better payoff by hiding the target because P2 incurs higher effort cost, and hence cannot accelerate as fast as P1.
G.2Ground Truth for Hexner’s Game
For the 4-stage and 10-stage Hexner’s game, there exists analytical solution to the equilibrium policies via solving the HJB for respective players.
𝑢 𝑗
− 𝑅 𝑗 − 1 𝐵 𝑗 ⊤ 𝐾 𝑗 𝑥 𝑗 + 𝑅 𝑗 − 1 𝐵 𝑗 ⊤ 𝐾 𝑗 Φ 𝑗 𝑧 𝜃 ~ 𝑗 ,
based on the reformulation outlined below in which players’ action 𝜃 ~ 𝑗 ∈ ℝ become 1D and are decoupled from the state: where Φ 𝑗 is a state-transition matrix that solves Φ ˙ 𝑗
𝐴 𝑗 Φ 𝑗 , with Φ 𝑗 ( 𝑇 ) being an identity matrix, and 𝐾 𝑗 is a solution to a continuous-time differential Ricatti equation:
𝐾 ˙ 𝑗
− 𝐴 𝑗 ⊤ 𝐾 𝑗 − 𝐾 𝑗 𝐴 𝑗 + 𝐾 𝑗 ⊤ 𝐵 𝑗 𝑅 𝑗 − 1 𝐵 𝑗 ⊤ 𝐾 𝑗 ,
(52)
Finally, by defining
𝑑 𝑗
𝑧 ⊤ Φ 𝑗 ⊤ 𝐾 𝑗 𝐵 𝑗 𝑅 𝑗 − 1 𝐵 𝑗 ⊤ 𝐾 𝑗 ⊤ Φ 𝑖 𝑧
and the critical time
𝑡 𝑟
arg min 𝑡 ∫ 0 𝑡 ( 𝑑 1 ( 𝑠 ) − 𝑑 2 ( 𝑠 ) ) 𝑑 𝑠
and
𝜃 ~ 𝑗 ( 𝑡 )
{ 0 ,
𝑡 ∈ [ 0 , 𝑡 𝑟 ]
𝜃 ,
𝑡 ∈ ( 𝑡 𝑟 , 𝑇 ] .
As explained in Sec.6, P1 chooses 𝜃 1
0 until the critical time 𝑡 𝑟 and P2 follows.
Note that in order to compute the ground truth when time is discretized with some 𝜏 , we need the discrete counterpart of equation 52, namely the discrete-time Ricatti difference equation and compute the matrices 𝐾 recursively.
G.3OpenSpiel Implementations and Hyperparameters
We use OpenSpiel (Lanctot et al., 2019), a collection of various environments and algorithms for solving single and multi-agent games. We select OpenSpiel due to its ease of access and availability of wide range of algorithms. The first step is to write the game environment with simultaneous moves for the stage-game and the multi-stage games (with 4 decision nodes). Note that to learn the policy, the algorithms in OpenSpiel require conversion from simultaneous to sequential game, which can be done with a built-in method.
In the single-stage game, P1 has two information states representing his type, and P2 has only one information state (i.e., the starting position of the game which is fixed). In the case of the 4-stage game, the information state (or infostate) is a vector consisting of the P1’s type (2-D: [0, 1] for type-1, [1, 0] for type-2), states of the players (8-D) and actions of the players at each time step ( 4 × 2 × 𝑈 ) . The 2-D “type” vector for P2 is populated with 0 as she has no access to P1’s type. For example, the infostate at the final decision node for a type-1 P1 could be [ 0 , 1 , 𝑥 ( 8 ) , 𝟙 𝑢 0 ( 𝑈 ) , 𝟙 𝑑 0 ( 𝑈 ) , ⋯ , 𝟙 𝑑 2 ( 𝑈 ) , 𝟎 ( 𝑈 ) , 𝟎 ( 𝑈 ) ] , and [ 0 , 0 , 𝑥 ( 8 ) , 𝟙 𝑢 0 ( 𝑈 ) , 𝟙 𝑑 0 ( 𝑈 ) , ⋯ , 𝟙 𝑑 2 ( 𝑈 ) , 𝟎 ( 𝑈 ) , 𝟎 ( 𝑈 ) ] for P2, where 𝑢 𝑘 , 𝑑 𝑘 represent the index of the actions at 𝑘 𝑡 ℎ decision node, 𝑘
0 , 1 , 2 , 3
The hyperparameters for DeepCFR is listed in table 2
Table 2:Hyperparameters for DeepCFR Training Policy Network Layers (256, 256) Advantage Network Layers (256, 256) Number of Iterations 1000 (100, for 𝑈
16 ) Number of Traversals 5 (10, for 𝑈
16 ) Learning Rate 1e-3 Advantage Network Batch Size 1024 Policy Network Batch Size 10000 (5000 for 𝑈
16 ) Memory Capacity 1e7 (1e5 for 𝑈
16 ) Advantage Network Train Steps 1000 Policy Network Train Steps 5000 Re-initialize Advantage Networks True G.4Joint-Perturbation Simultaneous Pseudo-Gradient (JPSPG)
The core idea in the JPSPG algorithm is the use of pseudo-gradient instead of computing the actual gradient of the utility to update players’ strategies. By perturbing the parameters of a utility function (which consists of the strategy), an unbiased estimator of the gradient of a smoothed version of the original utility function is obtained. Computing pseudo-gradient can often be cheaper as faster than computing exact gradient, and at the same time suitable in scenarios where the utility (or objective) functions are "black-box" or unknown. Building on top of pseudo-gradient, Martin & Sandholm (2024) proposed a method that estimates the pseudo-gradient with respect to all players’ strategies simultaneously. The implication of this is that instead of multiple calls to estimate the pseudo-gradient, we can estimate the pseudo-gradient in a single evaluation. More formally, let f : ℝ 𝑑 → ℝ 𝑛 be a vector-valued function. Then, its smoothed version is defined as:
f 𝜎 ( x )
𝔼 z ∼ 𝜇 f ( x + 𝜎 z ) ,
(53)
where 𝜇 is a 𝑑 -dimensional standard normal distribution, 𝜎 ≠ 0 ∈ ℝ is a scalar. Then, extending the pseudo-gradient of a scalar-valued function to a vector-valued function, we have the following pseudo-Jacobian:
∇ f 𝜎 ( x )
𝔼 z ∼ 𝜇 1 𝜎 f ( x + 𝜎 z ) ⊗ z ,
(54)
where ⊗ is the tensor product.
Typically, in a game, the utility function returns utility for each player given their strategy. Let u : ℝ 𝑛 × 𝑑 → ℝ 𝑛 be the utility function in a game with 𝑛 players, where each player has a 𝑑 -dimensional strategy. Then, the simultaneous gradient of u would be a function v : ℝ 𝑛 × 𝑑 → ℝ 𝑛 × 𝑑 . That is, row 𝑖 of v ( u ) is the gradient of the utility of the player 𝑖 with respect to its strategy, v 𝑖
∇ 𝑖 u 𝑖 . As a result, we can rewrite v concisely as: v
diag ( ∇ u ) , where ∇ is the Jacobian. With these we have the following:
v 𝜎 ( x )
diag ( ∇ u 𝜎 ( x ) )
(55)
= diag ( 𝔼 z ∼ 𝜇 1 𝜎 u 𝜎 ( x + 𝜎 z ) ⊗ z )
𝔼 z ∼ 𝜇 1 𝜎 diag ( u 𝜎 ( x + 𝜎 z ) ⊗ z )
𝔼 z ∼ 𝜇 1 𝜎 u 𝜎 ( x + 𝜎 z ) ⊙ z ,
where ⊙ is element-wise product and a result of the fact that diag ( a ⊗ b )
a ⊙ b . Hence, by evaluating Eq. 55 once, we get the pseudo-gradient associated with all players, making the evaluation constant as opposed to linear in number of players.
Once the pseudo-gradients are evaluated, the players update their strategy in the direction of the pseudo-gradient, assuming each player is interested in maximing their respective utility.
JPSPG Implementation.
In games with discrete-action spaces, where strategy is the probability distribution over the actions, JPSPG can be directly applied to get mixed strategy. However, for continuous-action games, a standard implementation would result in pure strategy solution than mixed. In order to compute a mixed strategy, we can turn into neural network as a strategy with an added randomness that can be learned as described in Martin & Sandholm (2023, 2024). We similarly define two strategy networks for each player, the outputs of which are scaled based on the respective action bounds with the help of hyperbolic tangent (tanh) activation on the final layer. The input to the strategy networks (a single hidden layered neural network with 64 neurons and output neuron of action-space dimension) are the state of the player and a random variable whose mean and variance are trainable parameters. We follow the architecture as outlined by Martin & Sandholm (2024) in their implementation of continuous-action Goofspiel. We would like to thank the authors for providing an example implementation of JPSPG on a normal-form game.
In the normal-form Hexner’s game, P1’s state x 1
{ 𝑥 1 , 𝑦 1 , type } , and P2’s state x 2
{ 𝑥 2 , 𝑦 2 } . 𝑥 𝑖 , and 𝑦 𝑖 denote the x-y coordinates of the player 𝑖 . In 4-stage case, we also include x-y velocities in the state and append the history of actions chosen by both P1 and P2 into the input to the strategy network. As an example, P1’s input at the very last decision step a vector [ 𝑥 1 , 𝑦 1 , 𝑣 𝑥 1 , 𝑣 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑣 𝑥 2 , 𝑣 𝑦 2 , type , 𝑢 1 𝑥 , 𝑢 1 𝑦 , 𝑑 1 𝑥 , 𝑑 1 𝑦 , 𝑢 2 𝑥 , 𝑢 2 𝑦 , 𝑑 2 𝑥 , 𝑑 2 𝑦 , 𝑢 3 𝑥 , 𝑢 3 𝑦 , 𝑑 3 𝑥 , 𝑑 3 𝑦 ] ∈ ℝ 21 , where 𝑢 𝑗 and 𝑑 𝑗 represent actions of P1 and P2, respectively, at 𝑗 𝑡 ℎ decision point. P2’s input, on the other hand, is the same without the type information making it a vector in ℝ 20 .
G.5Sample Trajectories
Here we present sample trajectories for three different initial states for each P1 type. The policies learned by CAMS results in trajectories that are significantly close to the ground truth than the other two algorithms.
Figure 9:Trajectories generated using CAMS (primal game), DeepCFR, and JPSPG. The initial position pairs are marked with same marker and the final with star. The trajectories from CAMS are close to the ground-truth while those from DeepCFR and JPSPG are not. G.6Value Network Training Details Data Sampling:
At each time-step, we first collect training data by solving the optimization problem ( P 1 or P 2 ). Positions are sampled uniformly from [-1, 1] and velocities from [ − 𝑣 ¯ 𝑡 , 𝑣 ¯ 𝑡 ] computed as 𝑣 ¯ 𝑡
𝑡 × 𝑢 𝑚 𝑎 𝑥 , where 𝑢 𝑚 𝑎 𝑥 is the maximum acceleration. For the unconstrained game, 𝑢 𝑚 𝑎 𝑥
12 for both P1 and P2. For the constrained case, 𝑢 𝑥 𝑚 𝑎 𝑥
6 , 𝑢 𝑦 𝑚 𝑎 𝑥
12 for P1 and 𝑢 𝑥 𝑚 𝑎 𝑥
6 , 𝑢 𝑦 𝑚 𝑎 𝑥
4 for P2. During training, the velocities are normalized between [-1, 1]. The belief 𝑝 is then sampled uniformly from [ 0 , 1 ] . For the dual value, we first determine the upper and lower bounds of 𝑝 ^ by computing the sub-gradient ∂ 𝑝 𝑉 ( 𝑡 0 , ⋅ , ⋅ ) and then sample uniformly from [ 𝑝 ^ − , 𝑝 ^ + ] .
Training:
We briefly discuss the training procedure of the value networks. As mentioned in the main paper, both the primal and the dual value functions are convex with respect to 𝑝 and 𝑝 ^ respectively. As a result, we use Input Convex Neural Networks (ICNN) (Amos et al., 2017) as the neural network architecture. Starting from 𝑇 − 𝜏 , solutions of the optimization problem P 1 for sampled ( 𝑋 , 𝑝 ) is saved and the convex value network is fit to the saved training data. The model parameters are saved and are then used in the optimization step at 𝑇 − 2 𝜏 . This is repeated until the value function at 𝑡
0 is fit. The inputs to the primal value network are the joint states containing position and velocities of the players 𝑋 and the belief 𝑝 .
The process for training the dual value is similar to that of the primal value training. The inputs to the dual value network are the joint states containing position and velocities of the players 𝑋 and the dual variable 𝑝 ^ .
Appendix HSolving Hexner’s game via MPC
Here we solve a 2D Hexner’s primal game with 𝐼
2 , 𝐾
10 , and other settings following App. G. With these settings and using the equilibrium in App. G.2, the true type revelation time is 𝑡 𝑟
0.5 second. We directly solve the minimax problem by autodiffing the gradient of the sum of payoffs from the 2 10 paths of the game tree. At each infostate along each path, P1’s strategy is modeled by a neural network that takes in ( 𝑡 , 𝑥 , 𝑝 ) and outputs 𝐼 action prototypes and an 𝐼 -by- 𝐼 logit matrix that encodes the type-dependent probabilities of taking each of the action prototypes, and P2’s best response is modeled by a separate neural network that takes in ( 𝑡 , 𝑥 , 𝑝 ) and outputs a single action. With a DS-GDA solver, the search successfully converges to the GT equilibrium. Fig. 10 illustrates the NE trajectories for one particular initial state and the corresponding belief dynamics.
Figure 10:2D Hexner’s game solved by MPC. Appendix IA Differentiable 11-vs-11 American Football Game
We model a single running/pass play as a 2p0s1 game between the offense (P1) and defense (P2) teams. Each player is a point mass with double-integrator dynamics on a 2D plane. Time is discretised with macro step Δ 𝑡
𝜏 and 𝐾
𝑇 / 𝜏 steps, and each macro step is resolved by 𝑛 sub semi-implicit Euler substeps for stability.
State, controls, and bounds.
Let 𝑁
11 be players per team. offense positions and velocities are 𝑋 ( 1 ) , 𝑉 ( 1 ) ∈ ℝ 𝑁 × 2 ; defense 𝑋 ( 2 ) , 𝑉 ( 2 ) ∈ ℝ 𝑁 × 2 . We pack them into a state vector 𝑥
[ 𝑋 ( 1 ) , 𝑉 ( 1 ) , 𝑋 ( 2 ) , 𝑉 ( 2 ) ] ∈ ℝ 8 𝑁 . At each step, the teams apply accelerations 𝑈 1 , 𝑈 2 ∈ ℝ 𝑁 × 2 (stacked later as 𝑢
[ 𝑢 1 ; 𝑢 2 ] ∈ ℝ 4 𝑁 ). Kinematic saturations enforce a playable box of half-width BOX _ POS and box-limited speeds and accelerations BOX _ VEL , BOX _ ACC by componentwise clamping after each substep.
Differentiable tackle dynamics (smooth contact and merge).
During a substep with duration 𝛿 𝑡
𝜏 / 𝑛 sub , we first compute a soft, pairwise “stickiness” weight between an attacker 𝑖 and a defender 𝑗 :
𝑤 𝑖 𝑗
𝜎 ( 𝑘 tackle ( 𝑟 thr 2 − ‖ 𝑋 𝑖 ( 1 ) − 𝑋 𝑗 ( 2 ) ‖ 2 ) ) ,
where 𝜎 ( 𝑧 )
1 / ( 1 + 𝑒 − 𝑧 ) , 𝑘 tackle sets steepness and 𝑟 thr 2
MERGE _ RADIUS 2 . These weights form 𝑊 ∈ [ 0 , 1 ] 𝑁 × 𝑁 . We then compute velocity “sharing” and contact accelerations via convex averaging across opponents:
𝑉 ^ 𝑖 ( 1 )
𝑉
𝑖
(
1
)
+
∑
𝑗
𝑤
𝑖
𝑗
𝑉
𝑗
(
2
)
1
+
∑
𝑗
𝑤
𝑖
𝑗
,
𝑉
^
𝑗
(
2
)
𝑉
𝑗
(
2
)
+
∑
𝑖
𝑤
𝑖
𝑗
𝑉
𝑖
(
1
)
1
+
∑
𝑖
𝑤
𝑖
𝑗
,
𝐴
c
,
𝑖
(
1
)
∑
𝑗
𝑤
𝑖
𝑗
𝐴
c
,
𝑗
(
2
)
1
+
∑
𝑗
𝑤
𝑖
𝑗
,
𝐴
c
,
𝑗
(
2
)
∑ 𝑖 𝑤 𝑖 𝑗 𝐴 c , 𝑖 ( 1 ) 1 + ∑ 𝑖 𝑤 𝑖 𝑗 ,
with 𝐴 c ( ⋅ ) initialised at zero so the first pass merely defines a contact baseline. This produces smooth, differentiable coupling without hard impulses.
To blend control and contact accelerations we form state-dependent merge probabilities
𝑝 m , 𝑖 ( 1 )
1 − exp ( − ∑ 𝑗 𝑤 𝑖 𝑗 ) , 𝑝 m , 𝑗 ( 2 )
1 − exp ( − ∑ 𝑖 𝑤 𝑖 𝑗 ) ,
and set
𝐴 𝑖 ( 1 )
( 1 − 𝑝 m , 𝑖 ( 1 ) ) 𝑈 1 , 𝑖 + 𝑝 m , 𝑖 ( 1 ) 𝐴 c , 𝑖 ( 1 ) , 𝐴 𝑗 ( 2 )
( 1 − 𝑝 m , 𝑗 ( 2 ) ) 𝑈 2 , 𝑗 + 𝑝 m , 𝑗 ( 2 ) 𝐴 c , 𝑗 ( 2 ) .
We then perform a semi-implicit Euler update with the shared velocities 𝑉 ^ :
𝑉 ( 1 )
← clip ( 𝑉 ^ ( 1 ) + 𝐴 ( 1 ) 𝛿 𝑡 , ± BOX _ VEL ) ,
𝑉 ( 2 )
← clip ( 𝑉 ^ ( 2 ) + 𝐴 ( 2 ) 𝛿 𝑡 , ± BOX _ VEL ) ,
𝑋 ( 1 )
← clip ( 𝑋 ( 1 ) + 𝑉 ( 1 ) 𝛿 𝑡 , ± BOX _ POS ) ,
𝑋 ( 2 )
← clip ( 𝑋 ( 2 ) + 𝑉 ( 2 ) 𝛿 𝑡 , ± BOX _ POS ) .
Control-affine analysis
Fix a macro time 𝑘 and a substep, and treat the current state ( 𝑋 ( 1 ) , 𝑉 ( 1 ) , 𝑋 ( 2 ) , 𝑉 ( 2 ) ) as given. The weights 𝑊 , the merge probabilities 𝑝 m ( ⋅ ) , the shared velocities 𝑉 ^ ( ⋅ ) , and the contact terms 𝐴 c ( ⋅ ) are functions of the state only at that substep. Consequently,
𝐴 ( 1 )
( 1 − 𝑝 m ( 1 ) ) ⏟ state-only ⊙ 𝑈 1 + 𝑝 m ( 1 ) ⊙ 𝐴 c ( 1 ) ⏟ state-only , 𝐴 ( 2 )
( 1 − 𝑝 m ( 2 ) ) ⊙ 𝑈 2 + 𝑝 m ( 2 ) ⊙ 𝐴 c ( 2 ) .
The semi-implicit update is affine in ( 𝐴 ( 1 ) , 𝐴 ( 2 ) ) , hence affine in ( 𝑈 1 , 𝑈 2 ) :
𝑥 𝑘 + 1
𝑓 ( 𝑥 𝑘 ) + 𝐵 1 ( 𝑥 𝑘 ) 𝑢 1 + 𝐵 2 ( 𝑥 𝑘 ) 𝑢 2 ,
where the “input matrices” 𝐵 1 , 𝐵 2 are diagonal masks with entries ( 1 − 𝑝 m ( ⋅ ) ) 𝛿 𝑡 in the velocity rows and ( 1 − 𝑝 m ( ⋅ ) ) 𝛿 𝑡 2 in the corresponding position rows, all depending only on 𝑥 𝑘 . Thus the map is control-affine for any fixed state, and globally piecewise control-affine due to the velocity/position clamping at the box limits; the latter introduces non-smooth but almost-everywhere differentiable saturations.
Tackle probability and running cost.
We summarise the likelihood of a tackle against the ball-carrier (RB) via a differentiable probabilistic OR across all defenders. Let “rb” index the RB on offense, then with the same 𝑊 ,
𝑝 tackle
1 − ∏ 𝑗
1 𝑁 ( 1 − 𝑤 rb , 𝑗 ) .
The running loss at a macro step is
ℓ run
0.1 2 𝜏 ( vec ( 𝑈 1 ) ⊤ 𝑅 1 vec ( 𝑈 1 ) − vec ( 𝑈 2 ) ⊤ 𝑅 2 vec ( 𝑈 2 ) ) + 𝜆 tackle 𝑝 tackle ,
with 𝑅 1
𝑅 2
𝐼 4 𝑁 in our defaults, a small control weight to encourage purposeful motion, and 𝜆 tackle is the penalty weight for RB being tackled.
Terminal payoffs: power-push (RB) vs. QB throw
The hidden type 𝑖 ⋆ ∈ { 0 , 1 } selects the objective. For the power-push run ( 𝑖 ⋆
0 ), let ( 𝑥 rb , 𝑦 rb ) denote the RB coordinates and 𝛼 in
− 0.8 . The terminal loss is
𝐿 term run
− ( 𝑥 rb + 𝛼 in | 𝑦 rb | ) ,
which rewards downfield progress while softly encouraging an inside lane. For the QB throw ( 𝑖 ⋆
1 ), we reward the deepest downfield offensive player, regardless of role:
𝐿 term throw
− max 𝑖 ∈ { 1 , … , 𝑁 } 𝑋 𝑖 , 𝑥 ( 1 ) .
The implemented terminal function is
𝐿 term
{
𝐿
term
run
,
𝑖
⋆
0 ,
𝐿
term
throw
,
𝑖
⋆
1 .
The overall zero-sum loss is the sum of running losses over 𝑘
0 , … , 𝐾 − 1 plus 𝐿 term .
Initial lineup.
For 𝑁
11 we instantiate a realistic I-formation offense against a 4–3 base defense in a normalized field window. Coordinates use 𝑥 as downfield (increasing towards the defense) and 𝑦 as lateral. offense aligns its line on the line of scrimmage at 𝑥
LINEUP _ OFF _ X with O-line 𝑦 coordinates { − 0.80 , − 0.40 , 0.00 , 0.40 , 0.80 } labelled LT, LG, C, RG, RT, a tight end at 𝑦
1.10 (right), wide receivers at 𝑦
± 1.45 at the same 𝑥 , a quarterback at 𝑥
LINEUP _ OFF _ X − 0.20 , a fullback at 𝑥
LINEUP _ OFF _ X − 0.30 , 𝑦
0.20 , and the running back at 𝑥
LINEUP _ OFF _ X − 0.40 , 𝑦
0.00 . defense places a four-man line at 𝑥
LINEUP _ DEF _ X with 𝑦 ∈ { − 0.60 , − 0.20 , 0.20 , 0.60 } , three linebackers at 𝑥
LINEUP _ DEF _ X − 0.15 , 𝑦 ∈ { − 0.80 , 0.00 , 0.80 } , cornerbacks slightly pressed at 𝑥
LINEUP _ DEF _ X + 0.05 , 𝑦
± 1.45 , and two safeties deep at 𝑥
LINEUP _ DEF _ X − 0.45 , 𝑦 ∈ { − 0.90 , 0.90 } .
Generated on Thu Oct 9 17:20:21 2025 by LaTeXML Report Issue Report Issue for Selection
Xet Storage Details
- Size:
- 132 kB
- Xet hash:
- 1273f9a51e9a45dcb9e6f3d1bc316c34ac3f42612948d4933212719f8a06c556
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.