Buckets:
Title: Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers
URL Source: https://arxiv.org/html/2106.09435
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Preliminaries 3MG(C)CE and its Computation 4Properties of MG(C)CE 5Joint PSRO 6CEs and CCEs as Joint Meta-Solvers 7Discussion 8Conclusions 9Acknowledgements References
HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: xr-hyper
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: arXiv.org perpetual non-exclusive license arXiv:2106.09435v3 [cs.MA] 18 Apr 2024 Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers Luke Marris Paul Muller Marc Lanctot Karl Tuyls Thore Graepel Abstract
Two-player, constant-sum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint Policy-Space Response Oracles (JPSRO), an algorithm for training agents in n-player, general-sum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising meta-solvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE meta-solvers for JPSRO and demonstrate convergence on n-player, general-sum games.
Machine Learning, ICML 1Introduction
Recent success in tackling two-player, constant-sum games (Silver et al., 2016; Vinyals et al., 2019) has outpaced progress in n-player, general-sum games despite a lot of interest (Jaderberg et al., 2019; Berner et al., 2019; Brown & Sandholm, 2019; Lockhart et al., 2020; Gray et al., 2020; Anthony et al., 2020). One reason is because Nash equilibrium (NE) (Nash, 1951) is tractable and interchangeable in the two-player, constant-sum setting but becomes intractable (Daskalakis et al., 2009) and potentially non-interchangeable1 in n-player and general-sum settings. The problem of selecting from multiple solutions is known as the equilibrium selection problem (Goldberg et al., 2013; Avis et al., 2010; Harsanyi & Selten, 1988).2
Outside of normal form (NF) games, this problem setting arises in multi-agent training when dealing with empirical games (also called meta-games), where a game payoff tensor is populated with expected outcomes between agents playing an extensive form (EF) game, for example the StarCraft League (Vinyals et al., 2019) and Policy-Space Response Oracles (PSRO) (Lanctot et al., 2017), a recent variant of which reached state-of-the-art results in Stratego Barrage (McAleer et al., 2020).
In this work we propose using correlated equilibrium (CE) (Aumann, 1974) and coarse correlated equilibrium (CCE) as a suitable target equilibrium space for n-player, general-sum games3. The (C)CE solution concept has two main benefits over NE; firstly, it provides a mechanism for players to correlate their actions to arrive at mutually higher payoffs and secondly, it is computationally tractable to compute solutions for n-player, general-sum games (Daskalakis et al., 2009). We provide a tractable approach to select from the space of (C)CEs (MG), and a novel training framework that converges to this solution (JPSRO). The result is a set of tools for theoretically solving any complete information4 multi-agent problem. These tools are amenable to scaling approaches; including utilizing reinforcement learning, function approximation, and online solution solvers, however we leave this to future work.
In Section 2 we provide background on a) correlated equilibrium (CE), an important generalization of NE, b) coarse correlated equilibrium (CCE) (Moulin & Vial, 1978), a similar solution concept, and c) PSRO, a powerful multi-agent training algorithm. In Section 3 we propose novel solution concepts called Maximum Gini (Coarse) Correlated Equilibrium (MG(C)CE) and in Section 4 we thoroughly explore its properties including tractability, scalability, invariance, and a parameterized family of solutions. In Section 5 we propose a novel training algorithm, Joint Policy-Space Response Oracles (JPSRO), to train policies on n-player, general-sum extensive form games. JPSRO requires the solution of a meta-game, and we propose using MG(C)CE as a meta-solver. We prove that the resulting algorithm converges to a normal form (C)CE in the extensive form game. In Section 6 we conduct an empirical study and show convergence rates and social welfare across a variety of games including n-player, general-sum, and common-payoff games.
An important area of related work is πΌ -Rank (Omidshafiei et al., 2019) which also aims to provide a tractable alternative solution in normal form games. It gives similar solutions to NE in the two-player, constant-sum setting, however it is not directly related to NE or (C)CE. πΌ -Rank has also been applied to ranking agents and as a meta-solver for PSRO (Muller et al., 2020). MG(C)CE is inspired by Maximum Entropy Correlated Equilibria (MECE) (Ortiz et al., 2007), an entropy maximizing CE based on Shannonβs entropy that is harder to compute than Gini impurity.
Another important area of related work concerns optimization based approaches (von Stengel & Forges, 2008; Dudik & Gordon, 2012; Farina et al., 2019a) and no-regret approaches (Celli et al., 2019, 2020; Morrill et al., 2021). These approaches identify specific subsets or supersets of (C)CE in the extensive-form game by constructing constraint programs or by local regret minimization using the full representation of the information state space. In contrast, the oracle approach can iteratively identify meta-games with smaller support that summarize the strategic complexity of the game compactly.
2Preliminaries
This section introduces correlated equilibrium and the multi-agent training algorithm PSRO.
2.1Correlated Equilibrium
Each player, π , in a game has a set of actions π π β π π (also known as pure strategies) available to it. Let π be the number of players in a game. Let π
β π π π be the joint action space and π
( π 1 , β¦ , π π ) β π be a joint action.
Let us index quantities relating to all players apart from player π as β π
{ 1 , β¦ , π β 1 , π + 1 , β¦ , π } . Let π β’ ( π )
π β’ ( π π , π β π ) be the probability that joint action π β π is played in a game. Let π β’ ( π π )
β π β π β π β π π β’ ( π π , π β π ) be the marginal probability of player π taking action π π β π π . Let π β’ ( π β π | π π ) be the conditional probability that other players play π β π β π β π , when player π plays π π β π π . π without arguments should be interpreted as a vector of size [ | π | ] .
Let πΊ π : π β β be the payoff function for player π when players play the joint action π β π . The full game payoff πΊ can therefore be defined by a tensor of shape [ π , | π 1 | , β¦ , | π π | ] . A normal form game is defined by the tuple π’
( πΊ , π ) .
Define π΄ π β’ ( π π β² , π π , π β π )
πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) as the advantage of player π switching action from π π to π π β² , when other players play π β π . This can be represented as a matrix, π΄ π , of shape [ | π π | β’ ( | π π | β 1 ) , | π | ] , since we do not need to compare an action with itself. The matrix is sparse and a fraction of 1 π π elements are non-zero. We use π΄ , with shape [ β π | π π | β’ ( | π π | β 1 ) , | π | ] , to denote the concatenation of π΄ π into a two-dimensional matrix.
A correlated equilibrium (CE), is a joint mixed strategy π β’ ( π ) such that no player π has payoff to gain from unilaterally choosing to play another action π π β² instead of π π . An approximate correlated equilibrium ( π -CE)5 is one where that gain from switching actions is no more than π . When π
0 , the standard CE is recovered. This relationship is described mathematically in Equation (1), β π β π« , π π β² β π π β π π . In matrix form, we can simply write π΄ β’ π β€ π .
β π β π π β’ ( π β π , π π ) β’ π΄ π β’ ( π π β² , π π , π β π ) β€ π
(1)
Together, π΄ π and π represent the CE linear inequality constraints. Mathematically these are equations of a plane, and separate the joint action space π β’ ( π ) into half-spaces. Together these half-spaces intersect to form a convex polytope of valid CE solutions.
Of special interest are valid CEs that can factorize into their marginals π β’ ( π )
β π π β’ ( π π ) , and correspond to NE solutions. All NEs are also CEs. Since an NE always exists (when there are finite players and actions) (Nash, 1951), a CE always exists. An NE is always on the boundary of the polytope for non-trivial games (Nau et al., 2004). Any convex combination of CEs is also a CE.
CEs provide a richer set of solutions than NEs. The maximum sum of social welfare in CEs is at least that of any NE. In particular, this allows more intuitive solutions to anti-coordination games such as chicken and traffic lights. Consider the traffic lights example; a symmetric, general-sum, two-player game consisting of two actions go, ( πΊ ), and wait, ( π ). ( πΊ , πΊ ) results in a crash, in ( π , π ) no progress is made, and ( πΊ , π ) and ( π , πΊ ) result in progress for one of the players. Figure 1 shows the NE and CE solution space for the traffic lights game. The mixed NE solution ( πΊ , π )
( 1 11 , 10 11 ) is clearly unsatisfactory ( 1 121 crashing and 100 121 waiting). One could argue that the best solution is to have players flip a coin to decide who waits and who goes. It turns out that this solution is a valid CE and is in fact the unique solution of min β‘ π -MGCE, a novel solution concept that we introduce later in Section 4.3.
(a)CE Convex Polytope G W G β 10 , β 10
1 , 0
W 0 , 1
0 , 0 (b)Payoff Table
G W
G 0.033
0.334
W 0.334
0.299 (c)MECE
( 0 , 0 ) G W G 1 121
10 121
W 10 121
100 121 (d)Mixed NE
( 0 , 0 ) G W G 0
1
W 0
0 (e)Row NE
( 1 , 0 ) G W G 0
0
W 1
0 (f)Col NE
( 0 , 1 ) G W G 0.033
0.327
W 0.327
0.313 (g)MGCE
( 0 , 0 ) G W G 0
0.34
W 0.34
0.32 (h)F-MGCE
( 0.34 , 0.34 ) G W G 0
0.5
W 0.5
0 (i)MWCE
( 0.5 , 0.5 ) Figure 1:The solution landscape for the traffic lights game. The solid polytope shows the space of CE joint strategies, and the dotted surface shows factorizable joint strategies. NEs are where the surface and polytope intersect. There are three unsatisfying NEs: mixed spends most of its time waiting and does not avoid crashing, the others favour only the row or column player. One MWCE provides a better solution (note that Row NE and Col NE, and any mixture of the two are also MWCE solutions). The center of the tetrahedron is the uniform distribution and the MECE and MGCE attempt to be near this point. The dashed lines correspond to the family of solutions permitted by MGCE and MECE when varying the approximation parameter π . Both have ( πΊ β’ π , π β’ πΊ )
( 0.5 , 0.5 ) as the min β‘ π solution. Player payoffs are given in parenthesis.
Correlation is achieved via a trusted external entity (correlation device) which samples a joint action from a public CE joint distribution. Each player is given their action in secret. The properties of the CE means that no individual player is motivated to deviate from the suggested action. If there are deviation actions with equal payoff available, the distribution is a weak equilibrium. If instead, the suggested actions are better than alternatives, the distribution is a strict equilibrium. Distributions that produce actions that are not better than all alternatives are called approximate equilibrium, and the maximum gain that can be obtained by an agent unilaterally deviating from any suggested action is described by π > 0 . Weak and strict equilibrium have associated gain π
0 and π < 0 , respectively. Mathematically, the effect of reducing π is shrinking the volume of the CE polytope. We can choose π when solving for a CE and show in Section 4.3 how π can be used to parameterise a family of MGCE solutions.
There are two important solution concepts in the space of CEs. The first is Maximum Welfare Correlated Equilibrium (MWCE) which is defined as the CE that maximises the sum of all playerβs payoffs. An MWCE can be obtained by solving a linear program, however the MWCE may not be unique and therefore does not fully solve the equilibrium selection problem (e.g. constant-sum game solutions all have equal payoff). The second such concept is Maximum Entropy Correlated Equilibrium (MECE) (Ortiz et al., 2007) which maximises Shannonβs entropy (Shannon, 1948) as an objective. MECE also shares some interesting properties with MGCE such as computational scalability when the solution is full-support (positive probability mass everywhere). Drawbacks of this approach are that the literature does not provide algorithms when the solution is general-support (non-negative probability) and, maximising Shannonβs entropy can be complex.
Finally, coarse correlated equilibrium (CCE) (Moulin & Vial, 1978) (Section A.3) is a simpler solution concept that contains CE as a subset: NE β CE β CCE . Intuitively, a game distribution is in CCE if no player wishes to deviate before receiving a recommended signal.
The solution concepts discussed so far apply to normal form (NF) games, and therefore are sometimes prefixed as such in the literature (NFCE and NFCCE) to disambiguate them from their extensive form (EF) counterparts (EFCE (von Stengel & Forges, 2008) and EFCCE (Farina et al., 2019a)). This distinction is important because although EF solutions are a natural choice in EF games; NF solutions can also be applied in EF games by using whole policies π π β Ξ π in place of actions π π β π π . These solutions are subsets of one another; NFCE β EFCE β EFCCE β NFCCE (von Stengel & Forges, 2008), therefore NFCE is the most restrictive correlation device while NFCCE is the least restrictive and is therefore capable of achieving the highest welfare. The best correlation device to use is a matter of debate in the literature. However, we note that NF solutions are interesting in EF games because a) it permits the highest welfare, and b) only requires communicating recommendations once before the game starts (as opposed to EF(C)CEs which require communication at every timestep). (J)PSRO trains sets of policies and converges to an NF equilibrium. Therefore, all equilibria discussed in this work are NF and we do not use a prefix going forward.
2.2Policy-Space Response Oracles (PSRO)
Policy-Space Response Oracles (PSRO) (Lanctot et al., 2017) (Algorithm 1) is an iterative population based training method for multi-agent learning that generalizes other well known algorithms such as fictitious play (FP) (Brown, 1951), fictitious self play (FSP) (Heinrich et al., 2015) and double oracle (DO) (McMahan et al., 2003).
PSRO finds a set of policies, ( π π β Ξ π ) π
1 . . π , and a distribution over this set for each player, ( π π ) π
1 . . π . The distribution converges to an NE in two-player, zero-sum games, and has recently been extended to convergence to other types of equilibria (Muller et al., 2020; McAleer et al., 2021). This work is in line with these developments, studying convergence of a variant of PSRO with joint policy distributions and (C)CE meta-solvers in n-player, general-sum games.
PSRO consists of a response oracle that estimates the best response (BR) to a joint distribution of policies. Commonly the response oracle is either a reinforcement learning (RL) agent or a method that computes the exact BR. The component that determines the distribution of policies that the oracle responds to is called the meta-solver (MS). The MS operates on the meta-game (MG), which is a payoff tensor estimated by measuring the expected return (ER) of policies against one another. This is a NF game, but instead of strategies corresponding to actions, π , they correspond to policies, π . The set of deterministic policies can be huge and that of stochastic policies is infinite, therefore PSRO only considers a subset of game policies: the ones found by the BR over all iterations so far. Different MSs result in different algorithms: the uniform distribution results in FSP, and using the NE distribution results in an extension of DO.
3MG(C)CE and its Computation
The set of (C)CEs forms a convex polytope, and therefore any strictly convex function could uniquely select amongst this set. The literature only provides one such example: MECE (Ortiz et al., 2007) which has a number of appealing properties, but was found to be slow to solve large games. There is a gap in the literature for a more tractable approach, and propose to use the Gini impurity (GI) (Breiman et al., 1984; Bishop, 2006). GI is a member of Tsallis entropy family, a generalized entropy that is equivalent to GI under a certain parameterization. It is maximized when the probability mass function is uniform π
1 | π | and minimized when all mass is on a single outcome. GI is popular in decision tree classification algorithms because it is easy to compute (Breiman et al., 1984). We call the resulting solution concept maximum Gini (coarse) correlated equilibrium (MG(C)CE). This approach has connections to maximum margin (Cortes & Vapnik, 1995) and maximum entropy (Jaynes, 1957). The derivations (Section C.2) follow standard optimization theory.
3.1Quadratic Program
The Gini impurity is defined as 1 β π π β’ π , and the MG(C)CE is denoted π β . We use an equivalent standard form objective β 1 2 β’ π π β’ π . The most basic form of the problem can be expressed directly as a quadratic program (QP), consisting of a quadratic objective function (Equation (2)) and linear constraints (Equations (3) and (4)).
Gini objective:
max π β 1 2 β’ π π β’ π s.t. (2)
(C)CE constraints:
π΄ π β’ π
β€ π β π
(3)
Probability constraints:
π β₯ 0 π π β’ π
1
(4)
QPs are a well studied problem class and many techniques may be used to solve them, including convex and quadratic optimization software, such as CVXPY (Diamond & Boyd, 2016; Agrawal et al., 2018) and OSQP (Stellato et al., 2020).
3.2Primal and Dual Forms
The primal objective that we wish to optimize is min π β‘ max πΌ , π½ , π β‘ πΏ β’ ( π , πΌ , π½ , π )
πΏ π πΌ , π½ , π , where πΏ π πΌ , π½ , π is the primal Lagrangian function, πΌ π β₯ 0 are the dual variable vectors corresponding to the π β (C)CE inequality constraints (Equation (3)), π½ β₯ 0 is the dual variable vector corresponding to the distribution inequality constraints (Equation (4)), and π is the dual variable corresponding to the distribution equality constraint (Equation (4)). By augmenting the dual variables πΌ
[ πΌ 1 , β¦ , πΌ π ] and constraints matrix π΄
[ π΄ 1 , β¦ , π΄ π ] , we can write the primal objective compactly as:
πΏ π πΌ π , π½ , π
1 2 β’ π π β’ π + πΌ π β’ ( π΄ β’ π β π ) β π½ π β’ π + π β’ ( π π β’ π β 1 ) ,
(5)
where the constant vector of ones with appropriate size is denoted by π , and π is a vector populated with the approximation parameter. We can also formulate a simplified dual version of the optimization as:
πΏ πΌ , π½
β 1 2 β’ πΌ π β’ π΄ β’ πΆ β’ π΄ π β’ πΌ + π π β’ π΄ π β’ πΌ β π π β’ πΌ β 1 2 β’ π½ π β’ πΆ β’ π½
β π π β’ π½ + πΌ π β’ π΄ β’ πΆ β’ π½ + 1 2 β’ π π β’ π ,
(6)
where πΆ
πΌ β π β’ π π normalizes by the mean, and π
1 | π | β’ π is the uniform vector. The optimal primal solution π β can be recovered from the optimal dual variables πΌ π β and π½ π β using
π β
π β πΆ β’ π΄ π β’ πΌ β + πΆ β’ π½ β .
(7)
The full-support assumption states that all joint probabilities have some positive mass, π > 0 . In this scenario, the dual variable vector corresponding to the non-negative probability constraint is zero, π½
0 . Therefore we can define simplified primal and dual objectives.
πΏ π πΌ , π
1 2 β’ π π β’ π + πΌ π β’ ( π΄ β’ π β π ) + π β’ ( π π β’ π β 1 )
(8)
πΏ πΌ
β 1 2 β’ πΌ π β’ π΄ β’ πΆ β’ π΄ π β’ πΌ + π π β’ π΄ π β’ πΌ β π π β’ πΌ + 1 2 β’ π π β’ π
(9)
π β
π β πΆ β’ π΄ π β’ πΌ β
(10) 4Properties of MG(C)CE
In this section we discuss some of the properties of π -MG(C)CE6. Section C contains the proofs for this section.
4.1Equilibrium Selection Problem
There are two levels of coordination; first is selecting an equilibrium before play commences, and second is selecting actions during play time. Both NEs and (C)CEs require agreement on what equilibrium is being played (Goldberg et al., 2013; Avis et al., 2010; Harsanyi & Selten, 1988): for (C)CEs this is a joint action probability distribution, and for NEs this is also a joint action probability distribution that can conveniently be factored into stochastic strategies for each player. Therefore, at this level of coordination, both NEs and (C)CEs are similar. We refer to this coordination problem as the equilibrium selection problem (Harsanyi & Selten, 1988). At action selection time only (C)CEs require further coordination. NEs are factorizable and therefore can sample independently without further coordination. (C)CEs rely on a central correlation device that will recommend actions from the equilibrium that was previously agreed upon.
This means that neither NEs nor (C)CEs can be directly used prescriptively in n-player, general-sum games. These solution concepts specify what subsets of joint strategies are in equilibrium, but does not specify how decentralized agents should select amongst these. Furthermore, the presence of a correlation device does not make (C)CEs prescriptive because the agents still need a mechanism to agree on the distribution the correlation device samples from7. This coordination problem can be cast as one that is more computational in nature: what rules allow an equilibrium to be uniquely (and perhaps de-centrally) selected?
This highlights the main drawback of MW(C)CE which does not select for unique solutions (for example, in constant-sum games all solutions have maximum welfare). One selection criterion for NEs is maximum entropy Nash equilibrium (MENE) (Balduzzi et al., 2018), however outside of the two-player constant-sum setting, these are generally not easy to compute (Daskalakis et al., 2009). CEs exist in a convex polytope, so any convex function can select among them. Maximum entropy correlated equilibrium (MECE) (Ortiz et al., 2007) is limited to full-support solutions, which may not exist when π
0 , and can be hard to solve in practice. Therefore, there is a gap in the literature for a computationally tractable, unique, solution concept and this work proposes MG(C)CE fills this gap.
Theorem 1 (Uniqueness and Existence).
MG(C)CE provides a unique solution to the equilibrium solution problem and always exists.
4.2Scalable Representation
MG(C)CE can provide solutions in general-support and, similar to MECE, MG(C)CE permits a scalable representation when the solution is full-support. Under this scenario, the distribution inequality constraint variables, π½ , are inactive, are equal to zero, can be dropped, and the πΌ variables can fully parameterize the solution.
Theorem 2 (Scalable Representation).
The MG(C)CE, π β , has the following forms:
General Support: π β
π β πΆ β’ π΄ π β’ πΌ β + πΆ β’ π½ β
(11)
Full Support: π β
π β πΆ β’ π΄ π β’ πΌ β
(12)
Where π is a vector of ones, | π |
β π | π π | , πΆ
πΌ β π π β’ π , and π
1 | π | β’ π are constants. πΌ β β₯ 0 and π½ β β₯ 0 are the optimal dual variables of the solution, corresponding to the (C)CE and distribution inequality constraints respectively.
Let | π π | correspond to the number of actions available to player π , and the total number of joint actions, π , is | π |
β π | π π | . For each value of π , there is a corresponding π½ dual variable. The number of πΌ dual variables is no more than the number of pair permutations β π | π π | β’ ( | π π | β 1 ) for CEs or actions β π | π π | for CCEs. Clearly, games with three or more players and many actions, β π | π π | β’ ( | π π | β 1 ) βͺ β π | π π | for CEs and β π | π π | βͺ β π | π π | for CCEs, allow for a very scalable parameterization if the full-support assumption holds. Furthermore, optimal πΌ β are sparse so we can discard rows from π΄ , in a similar spirit to SVMs (Cortes & Vapnik, 1995).
For CEs, full-support is not possible when an action is strictly dominated by another. This case can be easily mitigated by iterated elimination of strictly dominated strategies (IESDS) (Fudenberg & Tirole, 1991). This also has the desirable property of simplifying the optimization. In a similar argument, when actions are repeated (having the same payoffs), only one need be retained with appropriate modifications to the optimization.
Among the set of π -MG(C)CE there always exists one with full-support. Note that any infinitesimal positive π will permit a full-support (C)CE, but π -MG(C)CE does not necessarily select these. An upper bound on π which permits a full-support solution is given by Theorem 3.
Theorem 3 (Existence of Full-Support π -MG(C)CE).
For all games, there exists an π β€ max β‘ ( π΄ β’ π ) such that a full-support, π -MG(C)CE exists. A uniform solution, π , always exists when max β‘ ( π΄ β’ π ) β€ π . When π < max β‘ ( π΄ β’ π ) , the solution is non-uniform.
4.3Family of Solutions
π -MG(C)CE provides an intuitive way to control the strictness of the equilibrium via the approximation parameter, π , which parameterizes a family of unique solutions. Positive π expands the solution set and results in a higher Gini impurity solution, at the expense of lower payoff, and approximate equilibrium. Negative π shrinks the solution set to achieve a strict equilibrium and higher payoff at the expense of Gini impurity. This might also be a more robust solution (Wald, 1939, 1945; Ben-Tal et al., 2009) if the payoff is uncertain.
It is worth emphasizing a set of particularly interesting solutions within this family. Firstly the standard MG(C)CE, with π
0 , provides a weak equilibrium for non-trivial games (Theorem 4). Secondly, an edge case with positive π is max β‘ ( π΄ β’ π )
π -MG(C)CE which guarantees a uniform distribution solution. Converging to uniform when increasing π is a desirable property (principle of insufficient reason) (Leonard J. Savage, 1954; Sinn, 1980; Jaynes, 1957). Thirdly, note that all π < max β‘ ( π΄ β’ π ) are guaranteed to have a non-uniform distribution (Theorem 3), therefore, a 1 2 β’ max β‘ ( π΄ β’ π )
π -MG(C)CE could be an interesting way to regularise a MGCE towards a uniform distribution. Fourthly, because our algorithms are particularly scalable when full-support, working out the minimum π such that a full-support solution exists, full- π -MG(C)CE, would be useful. Finally, the solution with the smallest feasible π is the min β‘ π -MG(C)CE. This solution has the lowest entropy of the family, but the highest payoff, and constitutes the strictest equilibrium. Refer to Figure 1 for the family of solutions for the traffic lights game.
Theorem 4.
For non-trivial games (Nau et al., 2004), the MG(C)CE lies on the boundary of the polytope and hence is a weak equilibrium.
Since the π is deterministically known for the max β‘ ( π΄ β’ π ) β’ π -MG(C)CE, 1 2 β’ max β‘ ( π΄ β’ π ) β’ π -MG(C)CE and MG(C)CE solutions, we can solve for these using the standard solvers discussed in Section 3. For the min β‘ π -MG(C)CE we can tweak our optimization procedure to solve for this case directly by simply including a π β’ π term to minimize, where π
1 . We use bisection search to find full- π -MG(C)CE.
4.4Invariance
An important concept in decision theory, called cardinal utility (Mas-Colell et al., 1995), is that offset and positive scale of each playerβs payoff does not change the properties of the game. A notable solution concept that does not have this property is MW(C)CE.
Theorem 5 (Affine Payoff Transformation Invariance).
If π β is the π -MG(C)CE of a game, π’ , then for each player π independently we can transform the payoff tensors πΊ ~ π
π π β’ πΊ π + π π and approximation vector π ~ π
π π β’ π π for some positive π π and real π π scalars, without changing the solution. Furthermore rows of the advantage matrix π΄ , and approximation vector, π , can be scaled independently without changing the MG(C)CE.
4.5Computationally Tractable
In general, finding NEs is a hard problem (Daskalakis et al., 2009). While solving for any valid (C)CE is simple (basic feasible solution of a linear constraint problem) (Matouek & GΓ€rtner, 2006), and finding a (C)CE with a linear objective is an LP, solving for a particular (C)CE can be hard. For example, MECE (Ortiz et al., 2007) requires optimizing a constrained nonlinear objective. πΌ -Rank can be solved in cubic time in the number of pure joint strategies, π β’ ( | π | 3 ) .
MG(C)CE, however, is the solution to a quadratic program, and therefore can be solved in polynomial time. Furthermore, if the assumption is made that the solution is full-support, the algorithmβs variables scale better than the number of π parameters.
Space requirements are dominated by the storage of the advantage matrix π΄ , which requires a space of π β’ ( π β’ | π π | β’ | π | ) when exploiting sparsity. Computation is also on the order π β’ ( π β’ | π π | β’ | π | ) for gradient computation, exploiting sparsity. The number of variables depends on whether we are solving the general-support, | π | + π β’ | π π | 2 , or full-support, π β’ | π π | 2 version. It is possible to make use of sparse matrix implementations and only efficient matrix-vector multiplications are required to compute the derivatives.
5Joint PSRO
JPSRO (Algorithm 2) is a novel extension to Policy-Space Response Oracles (PSRO) (Lanctot et al., 2017) (Algorithm 1) with full mixed joint policies to enable coordination among policies. Although a conceptually straightforward extension, careful attention is needed to a) develop suitable best response (BR) operators, b) develop tractable joint distribution meta-solvers (MS), c) evaluate the set of policies found so far, and d) develop convergence proofs.
Using notation of Section 2.1, but policies instead of actions. Let ( Ξ π β ) π
1 . . π be the set of all policies of the extensive form game available for each player, and Ξ β
β π Ξ π β be the set of all joint policies. JPSRO is an iteration-based algorithm, let { π π π‘ π , β¦ }
Ξ π π‘ be the set of new policies found at iteration π‘ for player π with π β π indexing an individual policy within that set. The set of all policies found so far for player π is denoted Ξ π 0 : π‘ and the set of joint policies is denoted Ξ 0 : π‘
β π Ξ π 0 : π‘ . The expected return (ER), an NF game ( πΊ π 0 : π‘ ) π
1 . . π , is tracked for each joint policy found so far such that πΊ π 0 : π‘ β’ ( π ) is the expected return to player π when playing joint policy π . We also define πΊ π β to be the payoff over all possible joint policies.
The MS is a function taking in the ER and returning a joint distribution, π π‘ , over Ξ 0 : π‘ , such that π π‘ β’ ( π ) is the probability to play joint policy π β Ξ 0 : π‘ at iteration π‘ . The BR operator finds a policy which maximizes the expected return over of opponent mixed joint policies, π β π β Ξ β π 0 : π‘ . This mixture is defined in terms of the MS joint distribution, π π‘ .
Algorithm 1 Two-Player PSRO 1: Ξ 1 0 , Ξ 2 0 β { π 1 0 } , { π 2 0 } 2: πΊ 0 β ER( Ξ 0 ) 3: π 1 0 , π 2 0 β MS( πΊ 0 ) 4:for π‘ β { 1 , β¦ } do 5: π 1 π‘ , Ξ 1 π‘ β BR( Ξ 2 π‘ β 1 , π 2 π‘ β 1 ) 6: π 2 π‘ , Ξ 2 π‘ β BR( Ξ 1 π‘ β 1 , π 1 π‘ β 1 ) 7: Ξ 1 π‘ , Ξ 2 π‘ β Ξ 1 π‘ β 1 βͺ { π 1 π‘ } , Ξ 2 π‘ β 1 βͺ { π 2 π‘ } 8: πΊ π‘ β ER( Ξ π‘ ) 9: π 1 π‘ , π 2 π‘ β MS( πΊ π‘ ) 10: if Ξ 1 π‘ + Ξ 2 π‘
0 then 11: break return ( Ξ 1 0 : π‘ , Ξ 2 0 : π‘ ), ( π 1 π‘ , π 2 π‘ ) Algorithm 2 JPSRO 1: Ξ 1 0 , β¦, Ξ π 0 β { π 1 0 } , β¦, { π π 0 } 2: πΊ 0 β ER( Ξ 0 ) 3: π 0 β MS( πΊ 0 ) 4:for π‘ β { 1 , β¦ } do 5: for π β { 1 , β¦ , π } do 6: { π π π‘ 1 , β¦ } , { Ξ π π‘ 1 , β¦ } β BRp( Ξ 0 : π‘ β 1 , π π‘ β 1 ) 7: Ξ π 0 : π‘ β Ξ π 0 : π‘ β 1 βͺ { π π π‘ 1 , β¦ } 8: πΊ 0 : π‘ β ER( Ξ 0 : π‘ ) 9: π π‘ β MS( πΊ 0 : π‘ ) 10: if β π , π Ξ π π‘ π
0 then 11: break return Ξ 0 : π‘ , π π‘ 5.1Best Response Operators
At iteration π‘ + 1 each set, Ξ π 0 : π‘ , can be expanded using either using a CCE or CE best response (BR) operator. The type of BR operator used determines the type of equilibrium that JPSRO converges to (Section 5.4).
JPSRO(CCE)
There is a single BR objective for each player, which expands the player policy set, Ξ π 0 : π‘ + 1
Ξ π 0 : π‘ βͺ Ξ π π‘ + 1 , where Ξ π π‘ + 1
{ BR π π‘ + 1 } , and π β’ ( π β π )
β π π β Ξ π 0 : π‘ π β’ ( π π , π β π ) .
BR π π‘ + 1 β argmax π π β β Ξ π β β’ β π β π β Ξ β π 0 : π‘ π π‘ β’ ( π β π ) β’ πΊ π β β’ ( π π β , π β π )
The CCE BR attempts to exploit the joint distribution with the responderβs own policy preferences marginalized out.
JPSRO(CE)
There is a BR for each possible recommendation a player can get, Ξ π π‘ + 1
Ξ π 0 : π‘ βͺ Ξ π π‘ + 1 , where Ξ π π‘ + 1
{ ( BR π π‘ + 1 β’ ( π π π ) ) π
1 . . | Ξ π 0 : π‘ | } .
BR π π‘ + 1 β’ ( π π ) β argmax π π β β Ξ π β β’ β π β π β Ξ β π 0 : π‘ π π‘ β’ ( π β π | π π ) β’ πΊ π β β’ ( π π β , π β π )
Therefore the CE BR attempts to exploit each policy conditional βsliceβ. In practice, we only calculate a BR for positive support policies (similar to Rectified Nash (Balduzzi et al., 2019). Computing the argmax of the BRs can be achieved through RL or exactly traversing the game tree.
5.2Meta-Solvers
We propose that (C)CEs are good candidates as meta-solvers (MSs). They are more tractable than NEs and can enable coordination to maximize payoff between cooperative agents. In particular we propose three flavours of equilibrium MSs. Firstly, greedy (such as MW(C)CE), which select highest payoff equilibria, and attempt to improve further upon them. Secondly, maximum entropy (such as MG(C)CE) attempt to be robust against many policies through spreading weight. Finally, random samplers (such as RV(C)CE) attempt to explore by probing the extreme points of equilibria. Note that these MSs search through the equilibrium subspace, not the full policy space, and this restriction is a powerful way of achieving convergence. Note that since CEs β CCEs , one can also use CE MSs with JPSRO(CCE).
5.3Evaluation
Measuring convergence to NE (NE Gap, Lanctot et al. (2017)) is suitable in two-player, constant-sum games. However, it is not rich enough in cooperative settings. We propose to measure convergence to (C)CE ((C)CE Gap in Section E.4) in the full extensive form game. A gap, Ξ , of zero implies convergence to an equilibrium. We also measure the expected value obtained by each player, because convergence to an equilibrium does not imply a high value. Both gap and value metrics need to be evaluated under a meta-distribution. Using the same distribution as the MS may be unsuitable because MSs do not necessarily result in equilibria, may be random, or may maximize entropy. Therefore we may also want to evaluate under other distributions such as MW(C)CE, because it constitutes an equilibrium and maximizes value. A final relevant measurement is the number of unique polices found over time. The goal of an MS is to expand policy space (by proposing a joint policy to best respond to). If it fails to find novel policies at an acceptable rate, this could be evidence it is not performing well. Not all novel policies are useful, so caution should be exercised when interpreting this metric. If using a (C)CE MS and the gap is positive, it is guaranteed to find a novel BR policy.
5.4Convergence to Equilibria
JPSRO(CCE) converges8 to a CCE and JPSRO(CE) converges to a CE. We provide a sketch of the proofs here, for full proofs see Section E.5.
Theorem 6 (CCE Convergence).
When using a CCE meta-solver and CCE best response in JPSRO(CCE) the mixed joint policy converges to a CCE under the meta-solver distribution.
Theorem 7 (CE Convergence).
When using a CE meta-solver and CE best response in JPSRO(CE) the mixed joint policy converges to a CE under the meta-solver distribution.
Proof.
A (C)CE MS provides a distribution that is in equilibrium over the set of joint policies found so far, Ξ 0 : π‘ . For the algorithm to have converged, it needs to also be in equilibrium over the set of all possible joint policies, Ξ β . This is the case when the BR fails to find a novel policy with nonzero gap. Policies that have been found before, by definition of (C)CE, have zero gap. All behavioural policies can be defined in terms of a mixture of deterministic policies. Therefore, given that there are finite deterministic policies the algorithm will converge. β
6CEs and CCEs as Joint Meta-Solvers
We evaluate a number of (C)CE MSs in JPSRO on pure competition, pure cooperation, and general-sum games (Section H). All games used are available in OpenSpiel (Lanctot et al., 2019). More thorough descriptions of the games used can be found in Section F. We use an exact BR oracle, and exactly evaluate policies in the meta-game by traversing the game tree to precisely isolate the MSβs contribution to the algorithm.
We compare against common MS including uniform, πΌ -Rank (Omidshafiei et al., 2019; Muller et al., 2020), Projected Replicator Dynamics (PRD) (Lanctot et al., 2017) which is an NE approximator, and random vertex (coarse) correlated equilibrium (RV(C)CE) which randomly selects a solution on the vertices of (C)CE polytope. We also include a random joint and random Dirichlet solvers as baselines. We treat the solutions to the MSs as full joint distributions. Random solvers were evaluated with five seeds and we plot the mean. When evaluating, we measure equilibrium gaps under their own MS distribution and MW(C)CE to provide a consistent and value maximizing comparison. Experiments were ran for up to 6 hours, after which they were terminated.
Kuhn Poker (Kuhn, 1950; Southey et al., 2009; Lanctot, 2014) is a zero-sum poker game with only two actions per player. The two-player variant is solvable with PSRO, however the three-player version benefits from JPSRO. The results in Figure 2(a) show rapid convergence to equilibrium.
Trade Comm is a two-player, common-payoff trading game, where players attempt to coordinate on a compatible trade. This game is difficult because it requires searching over a large number of policies to find a compatible mapping, and can easily fall into a sub-optimal equilibrium. Figure 2(b) shows a remarkable dominance of CCE MSs. It is clear that traditional PSRO MSs cannot cope with this cooperative setting.
Sheriff (Farina et al., 2019b) is a two-player, general-sum negotiation game. It consists of bargaining rounds between a smuggler, who is motivated to import contraband without getting caught, and a sheriff, who is motivated to find contraband or accept bribes. Figure 2(c) shows that JPSRO is capable of finding the optimal value.
(a)CCE Gap on three-player Kuhn Poker. Several MS converge to within numerical accuracy (data is clipped) of a CCE. (b)Value sum on three-item Trade Comm. The approximate CCE MS was not sufficient to converge in this game, however all valid CCE MSs were able to converge to the optimal value sum. (c)Value sum on Sheriff. The optimal maximum welfare of other solution concepts are included to highlight the appeal of using NFCCE. Figure 2:JPSRO(CCE) on various games. Additional metrics can be found in Section H. MGCCE is consistently a good choice of MS over the games tested. 7Discussion
There has been significant recent interest in solving the equilibrium selection problem (Ortiz et al., 2007; Omidshafiei et al., 2019). This paper provides a novel approach which is computationally tractable, supports general-support solutions, and has favourable scaling properties when the solution is full-support.
The new solution concept MG(C)CE is rooted in the powerful principles of entropy and margin maximisation. Therefore it is a simple solution that makes limited assumptions, and is robust to many possible counter strategies (Jaynes, 1957). The MG(C)CE defines a family of unique solutions parameterized by π , that can control for the properties of the distribution. We have compared it to other NE, CE, and πΌ -Rank solutions, and have shown it has several advantages over these approaches, and performs very well across a variety of games.
PSRO has proved to be a formidable learning algorithm in two-player, constant-sum games, and JPSRO, with (C)CE MSs, is showing promising results on n-player, general-sum games. The secret to the success of these methods seems to lie in (C)CEs ability to compress the search space of opponent policies to an expressive and non-exploitable subset. For example, no dominated policies are part of CEs, and during execution there are no policies a player would rather deviate to. For (C)CE MSs, if there is a value-improving BR it is guaranteed to be a novel policy.
There is a rich polytope of possible equilibria to choose from, however, an MS must pick one at each time step. There are three competing properties which are important in this regard, exploitation, robustness, and exploration. For exploitation, maximum welfare equilibria appear to be useful. However, to prevent JPSRO from stalling in a local equilibrium it is essential to randomize over multiple solutions satisfying the maximum welfare criterion. To produce robust BRs, entropy maximizing MSs (such as MG(C)CE) have better empirical value and convergence than the uniform MS. For exploration, we can randomly select a valid equilibrium at each iteration which outperforms random joint and random Dirichlet by a significant margin (similar to AlphaStarβs βexploiter policiesβ (Vinyals et al., 2019)). Furthermore, one could also switch between MSs at each iteration to achieve the best mix of exploitation and exploration.
Another strength of (C)CE MSs is that they appear to perform well across many different games, with different numbers of players and payoff properties.
8Conclusions
We have shown that JPSRO converges to an NF(C)CE over joint policies in extensive form and stochastic games. Furthermore, there is empirical evidence that some MSs also result in high value equilibria over a variety of games. We argue that (C)CEs are an important concept in evaluating policies in n-player, general-sum games and thoroughly evaluate several MSs. Finally, we believe that both MG(C)CE and JPSRO can scale to large problems, by using stochastic online MSs for the former and exploiting function approximation and RL for the latter.
9Acknowledgements
Special thanks to Shayegan Omidshafiei for help with πΌ -Rank related discussions, Thomas Anthony for helpful comments and critiques of a draft of the paper, Gabriele Farina for providing maximum welfare values for the Sheriff game, and the anonymous ICML reviewers whose thoughtful feedback strengthened the paper considerably.
References Agrawal et al. (2018) β Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S.A rewriting system for convex optimization problems.Journal of Control and Decision, 5(1):42β60, 2018. Anthony et al. (2020) β Anthony, T., Eccles, T., Tacchetti, A., KramΓ‘r, J., Gemp, I., Hudson, T. C., Porcel, N., Lanctot, M., PΓ©rolat, J., Everett, R., Werpachowski, R., Singh, S., Graepel, T., and Bachrach, Y.Learning to play no-press diplomacy with best response policy iteration, 2020. Aumann (1974) β Aumann, R.Subjectivity and correlation in randomized strategies.Journal of Mathematical Economics, 1(1):67β96, 1974. Avis et al. (2010) β Avis, D., Rosenberg, G. D., Savani, R., and Von Stengel, B.Enumeration of Nash equilibria for two-player games.Economic theory, 42(1):9β37, 2010. Balduzzi et al. (2018) β Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T.Re-evaluating evaluation.In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS, pp. 3272β3283, Red Hook, NY, USA, 2018. Curran Associates Inc. Balduzzi et al. (2019) β Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., PΓ©rolat, J., Jaderberg, M., and Graepel, T.Open-ended learning in symmetric zero-sum games.CoRR, abs/1901.08106, 2019.URL http://arxiv.org/abs/1901.08106. Ben-Tal et al. (2009) β Ben-Tal, A., Ghaoui, L., and Nemirovski, A.Robust Optimization.Princeton Series in Applied Mathematics. Princeton University Press, 2009.ISBN 9781400831050. Berner et al. (2019) β Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., JΓ³zefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., and Zhang, S.Dota 2 with large scale deep reinforcement learning.CoRR, abs/1912.06680, 2019.URL http://arxiv.org/abs/1912.06680. Bishop (2006) β Bishop, C. M.Pattern Recognition and Machine Learning (Information Science and Statistics).Springer-Verlag, Berlin, Heidelberg, 2006.ISBN 0387310738. Breiman et al. (1984) β Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J.Classification and Regression Trees.Wadsworth and Brooks, Monterey, CA, 1984. Brown (1951) β Brown, G. W.Iterative solutions of games by fictitious play.1951. Brown & Sandholm (2019) β Brown, N. and Sandholm, T.Superhuman AI for multiplayer poker.Science, 365(6456):885β890, 2019.ISSN 0036-8075.doi: 10.1126/science.aay2400. Byrd et al. (1995) β Byrd, R., Lu, P., Nocedal, J., and Zhu, C.A limited memory algorithm for bound constrained optimization.SIAM Journal of Scientific Computing, 16:1190β1208, September 1995.ISSN 1064-8275. Celli et al. (2019) β Celli, A., Marchesi, A., Bianchi, T., and Gatti, N.Learning to correlate in multi-player general-sum sequential games, 2019. Celli et al. (2020) β Celli, A., Marchesi, A., Farina, G., and Gatti, N.No-regret learning dynamics for extensive-form correlated equilibrium, 2020. Cortes & Vapnik (1995) β Cortes, C. and Vapnik, V.Support-vector networks.In Machine Learning, pp. 273β297, 1995. Daskalakis et al. (2009) β Daskalakis, C., Goldberg, P., and Papadimitriou, C.The complexity of computing a Nash equilibrium.SIAM J. Comput., 39:195β259, 02 2009. Diamond & Boyd (2016) β Diamond, S. and Boyd, S.CVXPY: A Python-embedded modeling language for convex optimization.Journal of Machine Learning Research, 17(83):1β5, 2016. Dudik & Gordon (2012) β Dudik, M. and Gordon, G.A sampling-based approach to computing equilibria in succinct extensive-form games.05 2012. Farina et al. (2019a) β Farina, G., Bianchi, T., and Sandholm, T.Coarse correlation in extensive-form games, 2019a. Farina et al. (2019b) β Farina, G., Ling, C. K., Fang, F., and Sandholm, T.Correlation in extensive-form games: Saddle-point formulation and benchmarks.In Conference on Neural Information Processing Systems (NeurIPS), 2019b. Fudenberg & Tirole (1991) β Fudenberg, D. and Tirole, J.Game Theory.MIT Press, 1991. Gerschgorin (1931) β Gerschgorin, S.Uber die Abgrenzung der Eigenwerte einer Matrix.1931. Goldberg et al. (2013) β Goldberg, P. W., Papadimitriou, C. H., and Savani, R.The complexity of the homotopy method, equilibrium selection, and lemke-howson solutions.ACM Transactions on Economics and Computation (TEAC), 1(2):1β25, 2013. Gray et al. (2020) β Gray, J., Lerer, A., Bakhtin, A., and Brown, N.Human-level performance in no-press diplomacy via equilibrium search, 2020. Harsanyi & Selten (1988) β Harsanyi, J. and Selten, R.A General Theory of Equilibrium Selection in Games, volume 1.The MIT Press, 1 edition, 1988. Havrda et al. (1967) β Havrda, J., Charvat, F., and Havrda, J.Quantification method of classification processes: Concept of structural a-entropy.Kybernetika, 1967. Heinrich et al. (2015) β Heinrich, J., Lanctot, M., and Silver, D.Fictitious self-play in extensive-form games.In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015. (29) β Hoehn, B., Southey, F., Holte, R. C., and Bulitko, V.Effective short-term opponent exploitation in simplified poker. Jaderberg et al. (2019) β Jaderberg, M., Czarnecki, W., Dunning, I., Marris, L., Lever, G., CastaΓ±eda, A., Beattie, C., Rabinowitz, N., Morcos, A., Ruderman, A., Sonnerat, N., Green, T., Deason, L., Leibo, J., Silver, D., Hassabis, D., Kavukcuoglu, K., and Graepel, T.Human-level performance in 3d multiplayer games with population-based reinforcement learning.Science, 364(6443):859β865, 05 2019. Jaynes (1957) β Jaynes, E. T.Information theory and statistical mechanics.Phys. Rev., 106:620β630, May 1957. Kaur & Buttar (2019) β Kaur, M. and Buttar, G.A brief review on different measures of entropy.08 2019. Kuhn (1950) β Kuhn, H. W.A simplified two-person poker.Contributions to the Theory of Games, 1:97β103, 1950. Lanctot (2014) β Lanctot, M.Further developments of extensive-form replicator dynamics using the sequence-form representation.volume 2, 05 2014. Lanctot et al. (2017) β Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T.A unified game-theoretic approach to multiagent reinforcement learning.In NIPS. 2017. Lanctot et al. (2019) β Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., PΓ©rolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller, P., Ewalds, T., Faulkner, R., KramΓ‘r, J., Vylder, B. D., Saeta, B., Bradbury, J., Ding, D., Borgeaud, S., Lai, M., Schrittwieser, J., Anthony, T., Hughes, E., Danihelka, I., and Ryan-Davis, J.OpenSpiel: A framework for reinforcement learning in games.CoRR, 2019. Leonard J. Savage (1954) β Leonard J. Savage, J. W.The foundations of statistics.1954. Lockhart et al. (2020) β Lockhart, E., Burch, N., Bard, N., Borgeaud, S., Eccles, T., Smaira, L., and Smith, R.Human-agent cooperation in bridge bidding, 2020. Mas-Colell et al. (1995) β Mas-Colell, A., Whinston, M. D., and Green, J. R.Microeconomic Theory.Oxford University Press, 1995. Matouek & GΓ€rtner (2006) β Matouek, J. and GΓ€rtner, B.Understanding and Using Linear Programming (Universitext).Springer-Verlag, Berlin, Heidelberg, 2006.ISBN 3540306978. McAleer et al. (2020) β McAleer, S., Lanier, J., Fox, R., and Baldi, P.Pipeline PSRO: A scalable approach for finding approximate Nash equilibria in large games.In Neural Information Processing Systems 33, 2020. McAleer et al. (2021) β McAleer, S., Lanier, J. B., Baldi, P., and Fox, R.XDO: A double oracle algorithm for extensive-form games.CoRR, abs/2103.06426, 2021.URL https://arxiv.org/abs/2103.06426. McMahan et al. (2003) β McMahan, H. B., Gordon, G. J., and Blum, A.Planning in the presence of cost functions controlled by an adversary.In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536β543, 2003. Morrill et al. (2021) β Morrill, D., DβOrazio, R., Sarfati, R., Lanctot, M., Wright, J. R., Greenwald, A., and Bowling, M.Hindsight and sequential rationality of correlated play.In Proceedings of the The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021. Moulin & Vial (1978) β Moulin, H. and Vial, J.-P.Strategically zero-sum games: the class of games whose completely mixed equilibria cannot be improved upon.International Journal of Game Theory, 7(3-4):201β221, 1978. Muller et al. (2020) β Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L., Lanctot, M., Hughes, E., Wang, Z., Lever, G., Heess, N., Graepel, T., and Munos, R.A generalized training approach for multiagent learning.In International Conference on Learning Representations, 2020. Nash (1951) β Nash, J.Non-cooperative games.Annals of Mathematics, 54(2):286β295, 1951. Nau et al. (2004) β Nau, R., Canovas, S. G., and Hansen, P.On the geometry of Nash equilibria and correlated equilibria.International Journal of Game Theory, 32(4):443β453, August 2004. OβLeary (1980) β OβLeary, D. P.A generalized conjugate gradient algorithm for solving a class of quadratic programming problems.Linear Algebra and its Applications, 34:371β399, 1980/12// 1980. Omidshafiei et al. (2019) β Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R. πΌ -rank: Multi-agent evaluation by evolution.Scientific Reports, 9(1):9937, 2019. Ortiz et al. (2007) β Ortiz, L. E., Schapire, R. E., and Kakade, S. M.Maximum entropy correlated equilibria.In Meila, M. and Shen, X. (eds.), Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2 of Proceedings of Machine Learning Research, pp. 347β354, San Juan, Puerto Rico, 21β24 Mar 2007. PMLR. Polyak (1969) β Polyak, B.The conjugate gradient method in extreme problem.USSR Computational Mathematics and Mathematical Physics, 9:94β112, 12 1969. Rumelhart et al. (1986) β Rumelhart, D. E., Hinton, G. E., and Williams, R. J.Learning representations by back-propagating errors.Nature, 323(6088):533β536, October 1986. Shannon (1948) β Shannon, C. E.A mathematical theory of communication.Bell Syst. Tech. J., 27(3):379β423, 1948. Silver et al. (2016) β Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.Mastering the game of Go with deep neural networks and tree search.nature, 529(7587):484β489, 2016. Sinn (1980) β Sinn, H.-W.A Rehabilitation of the Principle of Insufficient Reason.The Quarterly Journal of Economics, 94(3):493β506, 05 1980. Sokota et al. (2021) β Sokota, S., Lockhart, E., Timbers, F., Davoodi, E., DβOrazio, R., Burch, N., Schmid, M., Bowling, M., and Lanctot, M.Solving common-payoff games with approximate policy iteration, 2021. Southey et al. (2009) β Southey, F., Hoehn, B., and Holte, R.Effective short-term opponent exploitation in simplified poker.Machine Learning, 74:159β189, 02 2009. Stellato et al. (2020) β Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S.OSQP: an operator splitting solver for quadratic programs.Mathematical Programming Computation, 2020. Tsallis (1988) β Tsallis, C.Possible generalization of Boltzmann-Gibbs statistics.Journal of Statistical Physics, 52:479β487, 1988. Vinyals et al. (2019) β Vinyals, O., Babuschkin, I., Czarnecki, W., Mathieu, M., Dudzik, A., Chung, J., Choi, D., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J., Jaderberg, M., and Silver, D.Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575, 11 2019. von Stengel & Forges (2008) β von Stengel, B. and Forges, F.Extensive-form correlated equilibrium: Definition and computational complexity.Mathematics of Operations Research, 33(4):1002β1022, 2008. Wald (1939) β Wald, A.Contributions to the theory of statistical estimation and testing hypotheses.Ann. Math. Statist., 10(4):299β326, 12 1939. Wald (1945) β Wald, A.Statistical decision functions which minimize the maximum risk.Annals of Mathematics, 46:265β280, 1945. Wang & Xia (2017) β Wang, Y. and Xia, S.Unifying attribute splitting criteria of decision trees by Tsallis entropy.In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2507β2511, 2017. Appendix ACorrelated Equilibrium
In this section we define the differences between two competing definitions of approximate correlated equilibrium ( π -CE) and define the related solution concept approximate coarse correlated equilibrium ( π -CCE).
A.1Correlated Equilibrium
An approximate correlated equilibrium ( π -CE) is one where the advantage of a single player unilaterally switching away from a recommended action is no more than π . When π
0 , the standard CE is recovered. Define π΄ π β’ ( π π β² , π π , π β π )
πΊ β’ ( π π β² , π β π ) β πΊ β’ ( π π , π β π ) as the advantage for player π switching action from π π to π π β² , when other players play π β π . This relationship is described mathematically in Equation (13).
β π β π π β’ ( π β π | π π ) β’ π΄ π β’ ( π π β² , π π , π β π )
β€ π π
(13)
β π β π π β’ ( π β π , π π ) β’ π΄ π β’ ( π π β² , π π , π β π )
β€ π β’ ( π π ) β’ π π
(14)
β π β π π β’ ( π β π , π π ) β’ ( π΄ π β’ ( π π β² , π π , π β π ) β π π )
β€ 0
(15)
β π β π« , π π β² β π π β π π
Together, π΄ π and π represent the CE linear inequality constraints. Mathematically these are equations of a plane, and separate the mixed joint probability π β’ ( π ) into half-spaces. Together these half-spaces intersect to form a convex polytope of valid CE solutions.
A.2Alternate Form Correlated Equilibrium
Sometimes another definition for CEs is used which is not equivalent to the definition above when π β 0 . We call Equation (16) the alternate approximate CE. In matrix form, we can simply write π΄ β’ π β€ π .
β π β π π β’ ( π β π , π π ) β’ π΄ π β’ ( π π β² , π π , π β π ) β€ π π
(16)
β π β π« , π π β² β π π β π π
This form is often easier to deal with computationally (particularly with min β‘ π -MG(C)CE) because it does not require dealing with a conditional distribution, and the approximation term is independent of probabilities. We use this definition throughout this work, although MGCE is still well defined using the former definition.
A.3Coarse Correlated Equilibrium
The coarse correlated equilibrium (CCE) is a looser solution concept where a player must decide if they are going to play the correlation deviceβs recommendation before they receive the recommendation. Define π΄ π β’ ( π π β² , π )
πΊ β’ ( π π β² , π β π ) β πΊ β’ ( π ) as the gain from deviating before an action has been recommended.
β π π β’ ( π ) β’ π΄ π β’ ( π π β² , π )
β€ π π
(17)
β π π β’ ( π ) β’ ( π΄ π β’ ( π π β² , π ) β π π )
β€ 0
(18)
β π β π« , π π β² β π π
Note that this can be derived from the correlated equilibrium by mixing over all possible actions, π π , that an agent can take.
β π π π β’ ( π π ) β’ β π β π π β’ ( π β π | π π ) β’ π΄ π β’ ( π π β² , π π , π β π )
β€ β π π π β’ ( π π ) β’ π π
β π π β π β π π β’ ( π β π , π π ) β’ π΄ π β’ ( π π β² , π π , π β π )
β€ π π
β π π β’ ( π β π , π π ) β’ ( πΊ β’ ( π π β² , π β π ) β πΊ β’ ( π π , π β π ) )
β€ π π
β π π β’ ( π ) β’ π΄ π β’ ( π π β² , π )
β€ π π
Note that if one wished to solve for MGCCE, simply substitute the advantage matrix π΄ πΆ β’ πΈ for π΄ πΆ β’ πΆ β’ πΈ , without any other additional changes.
Appendix BGeneralized Entropy
Shannonβs Entropy (Shannon, 1948), πΌ π , is a familiar quantity and is described as a measure of βinformation gainβ. The Gini Impurity (Breiman et al., 1984; Bishop, 2006) is a measurement of the probability of mis-classifying a sample of a discrete random variable, if that sample were randomly classified according to its own probability mass function, πΌ πΊ
β π π π π β’ β π β π π π
1 β β π π π π 2 .
Both Shannonβs entropy and Gini Impurity are maximized when the probability mass function is uniform π π
1 | π | and minimized when all mass is on a single outcome. Both metrics are used in decision tree classification algorithms, with Gini being more popular because it is easier to compute (Breiman et al., 1984).
In physics, there has been recent interest in non-extensive entropies which have been found to better model certain physical properties. One such entropy is called the Tsallis entropy, πΌ π
1 β β π π π π π β 1 , (Tsallis, 1988; Havrda et al., 1967; Wang & Xia, 2017; Kaur & Buttar, 2019) and is parameterized by real π .
A notable property of the Tsallis entropy is that it is non-additive. Assume that we have two independent variables π΄ and π΅ , with joint probability π β’ ( π΄ , π΅ )
π β’ ( π΄ ) β’ π β’ ( π΅ ) , then the combined Tsallis entropy of this system is πΌ π β’ ( π΄ , π΅ )
πΌ π β’ ( π΄ ) + πΌ π β’ ( π΅ ) + ( 1 β π ) β’ πΌ π β’ ( π΄ ) β’ πΌ π β’ ( π΅ ) . Therefore it can be seen that the ( 1 β π ) quantity is a measure of the departure from additivity, with additivity being recovered in the limit when π β 1 . This corresponds to the additive Shannonβs entropy. The Gini impurity is recovered when π
2 . Therefore, the Gini impurity is a non-extensive generalized entropy.
Appendix CProofs of MG(C)CE Properties C.1Uniqueness and Existence Theorem 1 (Uniqueness and Existence).
MG(C)CE provides a unique solution to the equilibrium solution problem and always exists.
Proof.
The problem is a concave maximization problem with linear constraints so therefore has a unique solution. Existence follows from the fact that a CE always exists. β
C.2Scalable Representation Theorem 2 (Scalable Representation).
The maximum Gini (C)CE, π β , has the following forms:
General Support: π β
π β πΆ β’ π΄ π β’ πΌ β + πΆ β’ π½ β
(19)
Full Support: π β
π β πΆ β’ π΄ π β’ πΌ β
(20)
Where π is a vector of ones, | π |
β π | π π | , πΆ
πΌ β π π β’ π , and π
1 | π | β’ π are constants. πΌ β β₯ 0 and π½ β β₯ 0 are the optimal dual variables of the solution, corresponding to the CE and distribution inequality constraints respectively.
Proof.
Start with the equation we call the primal Lagrangian form.
πΏ π πΌ , π½ , π
1 2 β’ π π β’ π + πΌ β’ ( π΄ β’ π β π ) β π½ π β’ π + π β’ ( π π β’ π β 1 )
(21)
To construct the dual Lagrangian we first take derivatives with respect to the primal variables π , and set them equal to zero.
β πΏ π πΌ , π½ , π β π
π β + ( π΄ π β’ πΌ β π½ + π )
0
βΉ
π
β
β π΄ π β’ πΌ + π½ β π β’ π
(22)
These can be substituted back into the primal Lagrangian.
πΏ πΌ , π½ , π
β 1 2 β’ [ π΄ π β’ πΌ β π½ + π β’ π ] π β’ [ π΄ π β’ πΌ β π½ + π β’ π ]
β πΌ π β’ π β’ π β π
Taking derivatives with respect to π .
β πΏ πΌ , π½ , π β π
β | π | β’ π β β π π β’ π΄ π β’ πΌ π + π π β’ π½ β 1
0
βΉ
π
β
1 | π | β’ ( β π π β’ π΄ π β’ πΌ + π π β’ π½ β 1 )
(23)
Substituting π back. Remember that there are non-negative constraints on πΌ β₯ 0 and π½ β₯ 0 . Therefore, one cannot easily solve for π½ to reduce this expression further. By defining πΆ
πΌ β π β’ π π , and π π
1 | π | β’ π π (the uniform distribution), noting π π β’ πΆ
0 and πΆ π β’ πΆ
πΆ , we arrive at the general support dual Lagrangian form.
πΏ πΌ , π½
β
1
2
β’
[
πΆ
β’
π΄
π
β’
πΌ
β
πΆ
β’
π½
β
π
|
π
|
]
π
β’
[
πΆ
β’
π΄
π
β’
πΌ
β
πΆ
β’
π½
β
π
|
π
|
]
β
πΌ
π
β’
π
+
π
π
β’
π΄
π
β’
πΌ
β
π
π
β’
π½
+
1
|
π
|
β 1 2 β’ πΌ π β’ π΄ β’ πΆ β’ π΄ π β’ πΌ + π π β’ π΄ π β’ πΌ β πΌ π β’ π β 1 2 β’ π½ π β’ πΆ β’ π½
β π π β’ π½ + πΌ π β’ π΄ β’ πΆ β’ π½ + 1 2 β’ π π β’ π
(24)
By combining Equations (22) and (23), we can arrive at an equation that describes the relationship between the primal and dual parameters.
π β
π β πΆ β’ π΄ π β’ πΌ β + πΆ β’ π½ β
(25)
It is advantageous to try and obtain a more compact representation. We can achieve this if we assume π has full support. In this case, π½
0 , because none of the π β₯ 0 constraints are active and we obtain Equation (26) the full support dual Lagrangian form.
πΏ πΌ
β 1 2 β’ πΌ π β’ π΄ β’ πΆ β’ π΄ π β’ πΌ + π π β’ π΄ π β’ πΌ β π π β’ πΌ + 1 2 β’ π π β’ π
(26)
π β
π β πΆ β’ π΄ π β’ πΌ β
(27)
β
Theorem 3 (Existence of Full-Support π -MG(C)CE).
For all games, there exists an π β€ max β‘ ( π΄ β’ π ) such that a full-support, π -MG(C)CE exists. A uniform solution, π , always exists when max β‘ ( π΄ β’ π ) β€ π . When π < max β‘ ( π΄ β’ π ) , the solution is non-uniform.
Proof.
Note, π΄ β’ π β€ π β π΄ β’ πΆ β’ π + π΄ β’ π β€ π , πΆ β’ π
0 and that π is the uniform distribution with maximum possible Gini impurity. Note that when max β‘ ( π΄ β’ π ) β€ π the inequality will always hold with π
π . And the inequality cannot hold with π
π when π β€ max β‘ ( π΄ β’ π ) . β
C.3Family Table 1:Family of MG(C)CE solutions. MG(C)CE π Properties
max β‘ ( π΄ β’ π ) β’ π -MG(C)CE max β‘ ( π΄ β’ π ) Uniform, highest entropy, lowest payoff
1 2 β’ max β‘ ( π΄ β’ π ) β’ π -MG(C)CE 1 2 β’ max β‘ ( π΄ β’ π ) Between uniform and (C)CE
full β’ π -MG(C)CE β€ max β‘ ( π΄ β’ π ) Minimum π such that MG(C)CE is full-support MG(C)CE 0 Weak (C)CE, NE in two-player constant sum
min β‘ π -MG(C)CE β€ 0 Strictest (C)CE, lowest entropy, highest payoff Theorem 4.
For non-trivial games, the MG(C)CE lies on the boundary of the polytope and hence is a weak equilibrium.
Proof.
MG(C)CE is attempting to be near the uniform distribution. If the uniform distribution is not a (C)CE the MG(C)CE lies on the boundary of the (C)CE polytope, and by definition is weak. If the uniform distribution is a (C)CE, then it is also a NE (because it factorizes). It therefore lies on the polytope if it is a non-trivial game by (Nau et al., 2004). β
Table 1 summarizes the family of solutions that make up MG(C)CE. Note that a similar family can be defined for ME(C)CE.
C.4Invariance Theorem 5 (Affine Payoff Transformation Invariance).
If π β is the π -MG(C)CE of a game, π’ , then for each player π independently we can transform the payoff tensors πΊ ~ π
π π β’ πΊ π + π π and approximation vector π ~ π
π π β’ π π for some positive π π and real π π scalars, without changing the solution.
Furthermore, if a game, π’ has (C)CE constraint matrix, π΄ , and bound vector, π , then each row can be scaled independently without changing the MG(C)CE.
Proof.
The only way that a gameβs payoff, πΊ , influences the solution is via the (C)CE constraint matrices π΄ π . Recall that these are defined as the difference between action payoffs π π β π π β² β π π . It is easy to see that the constant π π will cancel immediately.
π΄ ~ π , π , π
πΊ ~ π β’ ( π π β² , π β π ) β πΊ ~ π β’ ( π π , π β π )
(28)
= π β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) )
Notice that π΄ always appears alongside the dual variables πΌ . Therefore any scale in π΄ ~ β’ πΌ ~
π β’ π΄ β’ πΌ ~ can be counteracted by πΌ ~
πΌ π , without changing the nature of the optimization.
Similar to above, not only does πΌ π appear alongside π΄ π , each element appears alongside a particular row of π΄ π . Therefore not only can a whole π΄ π be scaled by a positive factor, each row of π΄ π can be scaled individually. Intuitively, each row of the (C)CE constraint matrix defines an equation of a plane in the simplex, and planes are not altered when scaled by a positive factor. We may exploit this property to better condition our optimization problem. β
Appendix DMGCE Computation
There are several tricks that can be employed to simplify the nature of the computation problem.
D.1Bounded Gradient Methods
It is easy to formulate gradient algorithms to solve for the MG(C)CE. It is most convenient to work in the reduced dual form of the problem as it enforces the probability equality constraint automatically, allows for making the full-support assumption, and does not require any projection routines. The computations involve sparse matrices, so appropriate sparse data structures should be used. The dual variables have a non-negative constraint, which is also sometimes referred to as a box or bound constraints in the literature.
For gradient ascent, initialize πΌ 0
0 , π½ 0
0 , and update the variables according to their gradient, where NN β’ ( π )
max β‘ ( 0 , π ) , ensures the variables remain non-negative.
πΌ π‘ + 1
β NN β’ [ πΌ π‘ + πΎ β’ ( β π΄ β’ πΆ β’ π΄ π β’ πΌ π‘ + π΄ β’ π β π + π΄ β’ πΆ β’ π½ π‘ ) ]
π½ π‘ + 1
β NN β’ [ π½ π‘ + πΎ β’ ( β πΆ β’ π½ π‘ β π + πΆ π β’ π΄ π β’ πΌ π‘ ) ]
(29)
If we assume the solution is full-support, we can simplify the dual version even further by dropping the π½ variable updates.
πΌ π‘ + 1
β NN β’ [ πΌ π‘ + πΎ β’ ( β π΄ β’ πΆ β’ π΄ π β’ πΌ π‘ + π΄ β’ π β π ) ]
(30)
Second order derivatives are also easily computed, allowing use of bounded second order linesearch optimizers, such as L-BFGS-B (Byrd et al., 1995). Other techniques such as momentum (Rumelhart et al., 1986), preconditioning the rows of the π΄ matrix, and iterated elimination of strictly dominated strategies of the payoff matrix will also help. An efficient conjugate gradient method can be adapted from Polyakβs algorithm (Polyak, 1969; OβLeary, 1980), which is a conjugate gradient method modified to support solving problems with bounds and is proven to converge in finite iterations.
D.2Payoff Reductions
There are two methods which could be used to reduce the size of the payoff tensor and hence reduce the complexity of the game that is required to be solved; repeated action elimination, and dominated action elimination.
Repeated Action Elimination:
Consider a payoff which has repeated strategies (identical payoffs). This represents a redundancy in the game formulation and we can therefore keep only one of these actions and appropriately modify the objective to account for this alteration. Let π π be the number of repeats for each action after elimination (i.e. π π
π if all were unique). Define π
β π π π as the flattened repeat count which is the same size as π and π ~ π
β π β² { π if π β²
π else π π β² } . Then the constraints now become π π β’ π
0 and π΄ π β’ ( π β π ~ π ) β€ π π , and the objective becomes 1 β π π β’ ( π β π ) . This has the dual effect of reducing the number of variables and constraints in the problem and, more importantly, breaks the symmetry of repeated terms which several solvers can struggle with. It is important to run this procedure before eliminated dominated actions, because repeated actions by definition do not dominate one another.
Dominated Action Elimination:
Strictly dominated strategies can be pruned from the payoff without affecting the results because dominated strategies can never have non-zero support in CEs where π β€ 0 . Any CE solution with non-positive π can exploit this reduction.
The nature of JPSRO means that it is common for actions to be repeated (best responders can produce the same output over multiple distributions) and actions to be strictly dominated by others (as the algorithm finds better and better policies).
D.3Eigenvalue Normalization
Some methods, such as gradient methods, benefit from the eigenvalues of the problem being similar in magnitude. We found empirically that re-normalizing by the πΏ 2 norm of the rows of the constraint matrix resulted in eigenvalues close to 1. By Theorem 5 this is a legal procedure.
D.4Dual Optimal Learning Rate
For the dual form of the objective there is an optimal constant learning rate we can use which is based on the eigenvalues of the Hessian. Calculating the eigenvalues exactly may be too computationally expensive. We can instead obtain an upper bound. A good choice of learning rate that is guaranteed to converge is πΎ
2 π max + π min + β₯ 2 max π β’ β π | π· π β’ π | + min π β’ β π | π· π β’ π | , where π· is the Hessian of the dual form. A proof follows below.
Proof.
πΆ is idempotent and positive semi-definite. For any π΅ , π΅ β’ π΅ π is positive semi-definite, therefore ( π΄ β’ πΆ ) β’ ( π΄ β’ πΆ ) π
π΄ β’ πΆ β’ π΄ π is positive semi-definite. This is the first part of the block diagonals of the Hessian, π· , which is therefore singular symmetric positive semi-definite.
It is known that the best choice of constant learning rate in this setting is πΎ
2 π max+ + π min+ . Because the Hessian is not full rank and positive semi-definite, π min
0 . We need to find the smallest non-zero eigenvalue. One possible upper bound on the maximal eigenvalues of a positive semi-definitive matrix, by the Gerschgorin circle Theorem (Gerschgorin, 1931), is:
π
π
β’
π
β’
π₯
β€
max
π
β’
β
π
|
π·
π
β’
π
|
max π β’ β π | π· π β’ π |
(31)
π
π
β’
π
β’
π
+
β€
min
π
β’
β
π
|
π·
π
β’
π
|
min π β’ β π | π· π β’ π |
(32)
β
D.5 min β‘ π -MG(C)CE
The previous formulations discussed assume that π is given as a hyper-parameter. If we want to directly find the minimum π that produces a valid maximum Gini impurity we must also optimize over π . The insight here is that the derivatives of the objective function with respect to the approximation parameter must always be stronger than the derivatives of the objective function with respect to the distribution.
β πΏ β π β₯ π π β’ β πΏ β π
β π π β’ π
β 1
(33)
Therefore an additional objective with a term of β 2 β’ π would be sufficient to ensure this condition holds.
πΏ π πΌ , π½ , π , π
1 2 β’ π π β’ π + 2 β’ π + πΌ π β’ ( π΄ β’ π β π )
β π½ π β’ π + π β’ ( π π β’ π β 1 )
(34)
πΏ πΌ , π½ , π
2 β’ π β 1 2 β’ πΌ π β’ π΄ β’ πΆ β’ π΄ π β’ πΌ + π π β’ π΄ π β’ πΌ β π π β’ πΌ
β 1 2 β’ π½ π β’ πΆ β’ π½ β π π β’ π½ + πΌ π β’ π΄ β’ πΆ β’ π½ + 1 2 β’ π π β’ π
(35)
π β
π β πΆ β’ π΄ π β’ πΌ β + πΆ β’ π½ β
(36) Appendix EJoint PSRO
While the concept of JPSRO is straightforward, careful attention needs to be made around a) formulating best response operators, b) creating suitable MSs, c) defining evaluation metrics, and d) establishing convergence. We discuss these in detail in this section.
E.1Meta Game Estimation
There are two strategies for estimating the meta-game (a normal form payoff tensor populated by the returns of all the policies); exact sampling and empirical sampling.
Exact Sampling:
The exact return is computed for each player by traversing the entire game tree. This is only suitable for small games, or when using deterministic policies that cannot reach the majority of the game tree.
Empirical Sampling:
For larger games, or situations where the policy cannot be easily queried (for example when using a policy that depends on internal state like an LSTM) we may have to estimate the return through sampling.
In this work we used exact sampling so we could conduct an exact study into the performance of different MSs without introducing noise form other sources. However, the authors believe this approach can be scaled with empirical sampling, as has been achieved with PSRO.
E.2Meta-Solvers
Many of the traditional PSRO solvers are factorizable solutions. Equivalently, their joint probabilities can be marginalized without losing any information.
Uniform:
This solver places equal probability mass over each policy it has found so far. PSRO using a uniform distribution is also known as Fictitious Self Play (FSP) (Heinrich et al., 2015). A key advantage of this approach is that it is not necessary to compute the meta-game to obtain this distribution. It is proven to slowly converge in the two-player, constant-sum setting.
Nash Equilibrium (NE):
The well known solution concept (Nash, 1951), when used in PSRO is called Double Oracle (DO) (McMahan et al., 2003). This is difficult to compute for n-player, general-sum, and is equivalent to CCEs in two-player, zero-sum so we did not benchmark against this MS.
Projected Replicator Dynamics (PRD):
An evolutionary method of approximating NE, introduced in (Lanctot et al., 2017).
There are a number of solvers which produce full joint distributions. We describe some we think are relevant here. Note that all factorizable solutions mentioned previously can be trivially promoted to full distributions.
πΌ -Rank:
A solution concept based on the stationary distribution of a Markov chain (Omidshafiei et al., 2019). πΌ -Rank has been studied before in the context of PSRO (Muller et al., 2020), however the authors marginalized over the distribution.
Maximum Welfare (C)CE (MW(C)CE):
A non-unique linear formulation that maximizes the sum of payoffs over all players. In the case where there are multiple (C)CEs with maximum welfare we can define a maximum entropy version to spread weight, MEMW(C)CE, and a random version to select one at random, RMW(C)CE. We use the latter as an MS baseline in experiments.
Random Vertex (C)CE (RV(C)CE):
A linear formulation. In our implementation we formulate the standard linear (C)CE problem and randomly sample a linear cost function from the unit ball. Note that this selects a random vertex on the (C)CE polytope and is not sampling from within the polytope volume or elsewhere on the polytope surface.
Maximum Entropy (C)CE (ME(C)CE):
A unique nonlinear convex formulation that maximizes the Shannon Entropy of the resulting distribution (Ortiz et al., 2007). We do not evaluate this solution concept in this work due to computational difficulties when scaling to large payoff tensors, however we expects its performance to be similar to MG(C)CE.
Maximum Gini (C)CE (MGCE):
A unique quadratic convex formulation that maximizes the Gini Impurity (a form of Tsallis Entropy), introduced in this work.
Random Dirichlet:
Sample a distribution randomly from a Dirichlet distribution with πΌ
1 . This has not been used in the literature before but we believe acts as a good (naive) baseline against RVCE.
Random Joint:
Sample a single joint policy from the set. This has not been used in the literature before either but we believe acts as a good (naive) baseline against RV(C)CE.
In previous work joint solvers have been used (Muller et al., 2020), however the authors marginalized the distributions so they could be used in classic PSRO.
E.3Joint Best Responders
We provide two best response operators for JPSRO. The first is required to converge to a CCE in policy space (when using CCE meta-solvers). The second is required to converge to a CE in policy space (when using CE meta-solvers).
JPSRO(CCE)
: At each iteration there is a single BR objective for each player, which expands the player policy set, Ξ π 0 : π‘ + 1
Ξ π 0 : π‘ βͺ Ξ π π‘ + 1 , where Ξ π π‘ + 1
{ BR π π‘ + 1 } , and π β’ ( π β π )
β π π β Ξ π 0 : π‘ π β’ ( π π , π β π ) .
BR π π‘ + 1 β argmax π π β β Ξ π β β’ β π β π β Ξ β π 0 : π‘ π π‘ β’ ( π β π ) β’ πΊ π β β’ ( π π β , π β π )
Therefore, the CCE BR attempts to exploit the joint distribution with the responderβs own policy preferences marginalized out, resulting in a joint policy distribution over the other playersβ policies. This means that a player is best responding to a weighted mixture of up to β β π | Ξ π π‘ | joint opponent policies. This is an upper bound because π is often sparse.
JPSRO(CE):
There is a BR for each possible recommendation a player can get, Ξ π π‘ + 1
Ξ π 0 : π‘ βͺ Ξ π π‘ + 1 , where Ξ π π‘ + 1
{ ( BR π π‘ + 1 β’ ( π π π ) ) π
1 . . | Ξ π 0 : π‘ | } .
BR π π‘ + 1 β’ ( π π ) β argmax π π β β Ξ π β β’ β π β π β Ξ β π 0 : π‘ π π‘ β’ ( π β π | π π ) β’ πΊ π β β’ ( π π β , π β π )
Therefore the CE BR attempts to exploit each policy conditional βsliceβ. In practice, we only calculate a BR for positive support policies. Computing the argmax of the BRs can be achieved through RL or exactly traversing the game tree. Similarly each BR is responding to a weighted mixture of up to β β π | Ξ π π‘ | joint opponent policies.
Notice that if the distribution is factorizable (like NE), then the CE BR is equal for all player policies, and furthermore is equal to the CCE BR, illuminating the connection to PSROβs BR operator.
The best response is independent of the best responding playerβs policy. We can compute the argmax in a number of ways. Two common ways are exact best response, and reinforcement learning.
Exact Best Response:
Maintain exact tabular policies and compute a best response against the joint policies for each player, through maximizing value by traversing the game tree. We employ this approach in this work to allow us to compare meta-solvers without introducing noise from approximate BRs. This method is only suitable for small games, or when using only deterministic policies.
RL:
In this setting, the learning algorithms train against randomly sampled joint-policies according to π , and do standard value maximization. Both on-policy (such as Policy Gradient) and off-policy (such as Q-Learning) are suitable learning algorithms. Function approximation may also be used. This approach has been used extensively in PSRO before.
E.4Evaluation Metrics
For two-player, constant-sum games there is a clear evaluation metric; how close the players are to the unique Nash Equilibrium (measured by NEGap defined below). However, outside of this narrow setting it is unclear how to fairly evaluate the policies that have been found. This is true for a number of reasons including: there being multiple equilibria, and equilibria not necessarily having good payoff. A combination of high payoff and stability is indicative of a strong set of policies. In this section we describe a number of metrics that could help describe the strength of the resulting joint policies.
Value:
This describes the undiscounted return for each player at the root state of a game when following a joint policy, mixed under a joint distribution.
π π β’ ( π )
β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π )
πΌ
π
βΌ
π
[
πΊ
π
β’
(
π
)
]
π
π
(
π
(
β
|
π
π
)
)
β π β π β Ξ β π π β’ ( π β π | π π ) β’ πΊ π β’ ( π π , π β π )
πΌ π β π βΌ
π ( β | π π ) [ πΊ π β’ ( π π , π β π ) ]
NE Gap:
This quantity describes how close joint policies are to an NE (referred to as NashConv in (Lanctot et al., 2017)) under π . This is only defined for marginal distributions over policies.
NEGap π β’ ( π )
β π β Ξ π β’ ( π ) β’ πΊ π β’ ( BR π , π β π ) β π π β’ ( π )
πΌ
π
βΌ
π
[
πΊ
π
β’
(
BR
π
,
π
β
π
)
]
β
π
π
β’
(
π
)
NEGap
β’
(
π
)
β π NEGap π β’ ( π )
(37) CCE Gap:
This quantity describes how close joint policies are to a coarse correlated equilibrium (CCE) under π . The origins of this metric can be deduced from studying the CCE BR operator.
CCEGap π β’ ( π )
β β π β Ξ π β’ ( π ) β’ πΊ π β’ ( BR π , π β π ) β π π β’ ( π ) β +
β
πΌ
π
βΌ
π
[
πΊ
π
β’
(
BR
π
,
π
β
π
)
]
β
π
π
β’
(
π
)
β
+
CCEGap
β’
(
π
)
β π CCEGap π β’ ( π )
Where β π₯ β +
π β’ π β’ π₯ β’ ( 0 , π₯ ) , is the non-negative operator. Note that it is possible for a best response over all joint strategies to have lower value than playing according to the joint distribution for a given player (because a BR is blind to the best responding playerβs correlation with the opponent policies, and deviating from this correlation can hurt performance).
CE Gap:
This quantity describes how close joint policies are to a correlated equilibrium (CE) under π .
CEGap π β’ ( π , π π )
β β π β π β
Ξ β π π ( π β π | π π ) πΊ π ( BR π ( π π ) , π β π ) β π π ( π ( β | π π ) ) β +
β πΌ π β π βΌ
π
(
β
|
π
π
)
[
πΊ
π
(
BR
π
(
π
π
)
,
π
β
π
)
]
β
π
π
(
π
(
β
|
π
π
)
)
β
+
CEGap
π
β’
(
π
)
β
π
π
β
Ξ
π
π
β’
(
π
π
)
β’
CEGap
π
β’
(
π
,
π
π
)
CEGap
β’
(
π
)
β π CEGap π β’ ( π )
Unique Policy:
Each iteration of JPSRO(CCE) produces n new policies (one for each player), and JPSRO(CE) produces up to the number of policies found so far. These are best responses to the joint mixture of existing polices, however, they are not guaranteed to be distinct from previous policies that have been found. The number of unique policies found so far could be a good indicator of how efficiently a meta-solver is producing new policies.
E.5Proof of JPSRO Convergence
We provide two convergence proofs for JPSRO. Firstly, when using CCE meta-solvers with a CCE best response operator, which we refer to as JPSRO(CCE), and secondly when using CE meta-solvers with a CE best response operator, which we refer to as JPSRO(CE). Note that, in order to ignore possibly undefined values of π π‘ β’ ( π β π | π π ) , we use the formulation of correlated equilibria using joint probabilities instead of conditional ones. The definitions being equivalent, the conclusions are as well. Note that we also assume that β π , π‘ , | BR π π‘ |
0 , β π π β’ st. β’ π π‘ β’ ( π π )
0 , | BR π π‘ β’ ( π π ) |
0 , i.e. every time a best response should be computed, it is. We discuss a relaxation of these conditions, and why it is useful, in Section E.5.3.
E.5.1Proof of JPSRO(CCE) Theorem 6 (CCE Convergence).
When using a CCE meta-solver and CCE best response in JPSRO(CCE) the mixed joint policy converges to a CCE under the meta-solver distribution.
We recall the definition of coarse correlated equilibria. For joint probability π , joint policy set Ξ
β π Ξ π where Ξ π is the set of valid policies of player π and β is the Cartesian product, and payoff function πΊ , such that πΊ π β’ ( π ) is the payoff of player π when all player play according to π , a Coarse Correlated Equilibrium is a joint distribution π over Ξ such that, for any player π and any policy π π β² of player π ,
β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π π β² , π β π ) β€ β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π )
(38)
In other words, a CCE is a distribution from which no player has an incentive to unilaterally deviate before being assigned their action. From this definition of CCEs, we derive the definition of CCEGap, which measures the above gap over all players
CCEGap β’ ( π )
β π β max π π β² β’ β π β Ξ π β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β +
where β π₯ β +
π β’ π β’ π₯ β’ ( 0 , π₯ ) , this β β + term being necessary because the gap is potentially negative, as one can see from Equation (38). From this definition, we introduce the following lemma:
Lemma 1 (Game CCE and CCEGap).
We have the following equivalence:
(i)
π is a CCE of the game
(ii)
CCEGap( π ) = 0
Proof.
Let us first prove (i) β (ii). Suppose π is a CCE. Then for any player π and any policy π π β² of player π ,
β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π π β² , π β π ) β€ β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π )
therefore, by subtracting the right hand-term and taking the maximum over π π β² β Ξ π ,
max π π β² β’ β π β Ξ π β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β€ 0
and so
β max π π β² β π β Ξ π ( π ) ( πΊ π ( π π β² , π β π β πΊ π ( π ) ) β +
0
Summing this last inequality over all players yields (ii).
Let us now prove (ii) β (i). Suppose that π is such that CCEGap( π ) = 0. Then, for all π ,
max π π β² β’ β π β Ξ π β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β€ 0
(39)
For all π π β²β² β Ξ π we have
β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π π β²β² , π β π ) β€ max π π β² β’ β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π π β² , π β π )
and therefore, by subtracting β π β Ξ π β’ ( π ) β’ πΊ π β’ ( π ) and using Equation (39),
β π β Ξ π β’ ( π ) β’ ( πΊ π β’ ( π π β²β² , π β π ) β πΊ π β’ ( π ) ) β€ 0
Rearranging the terms yields the proof. β
The context of JPSRO motivates us to expand and overload the definition CCEGap. Let us denote by Ξ β the policies of the extensive form game, and by Ξ 0 : π‘ all the policies found by JPSRO by iteration π‘ . We immediately have, for all π‘ , Ξ 0 : π‘ β Ξ β . We expand CCEGap via, for all π‘ ,
CCEGap β’ ( π , Ξ β , Ξ 0 : π‘ )
β π β max π π β β Ξ π β β’ β π β Ξ 0 : π‘ π β’ ( π ) β’ ( πΊ π β’ ( π π β , π β π ) β πΊ π β’ ( π ) ) β +
The only difference is the search space of π π β , which now lives within Ξ β , while the policies used in the sum live in Ξ 0 : π‘ . It is nevertheless easy to see that this new definition characterizes CCEs of Ξ β (and not of Ξ 0 : π‘ ), albeit a restricted class, since Ξ 0 : π‘ β Ξ β and one can expand π to be zero over Ξ β β Ξ 0 : π‘ . Let us now prove Theorem 6.
Proof.
To prove that JPSRO with a CCE meta-solver, JPSRO(CCE), converges to a CCE, we need only prove one thing: that JPSRO(CCE) is unable to produce new policies if and only if it has reached a CCE of the extensive form game. Provided this is true, and since all games have a finite number of deterministic policies, we have that JPSRO(CCE) necessarily cannot produce new policies forever, and therefore eventually can only produce already-discovered policies.
Note that the joint distribution π π‘ of JPSRO(CCE) is by construction a CCE over Ξ 0 : π‘ for all π‘ (when using a CCE meta-solver). It is nevertheless not necessarily a CCE of Ξ β .
Let us now suppose that JPSRO(CCE) has not produced any new policy for any player at iteration π‘ . Given the JPSRO(CCE) formulation, we can therefore restrict the search space of policies from Ξ β to Ξ 0 : π‘ in the CCEGap max term, since the max of the expression is reached in Ξ 0 : π‘ , and we thus rewrite the CCEGap definition:
β π β max π π β² β Ξ π β β’ β π β Ξ 0 : π‘ π π‘ β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β +
β π β max π π β² β Ξ π 0 : π‘ β’ β π β Ξ 0 : π‘ π π‘ β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β +
But since π π‘ is a CCE over Ξ 0 : π‘ , the second term is null. Therefore, CCEGap β’ ( π , Ξ β , Ξ 0 : π‘ )
0 , and according to Lemma 1, π π‘ is therefore a CCE over Ξ β , which concludes the proof. β
E.5.2Proof of JPSRO(CE) Theorem 7 (CE Convergence).
When using a CE meta-solver and CE best response in JPSRO(CE) the mixed joint policy converges to a CE under the meta-solver distribution.
We recall the definition of correlated equilibria. Keeping the same notations as above, a correlated equilibrium is a joint distribution π over Ξ such that, for any player π and any policies π π , π π β² of player π ,
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π β² , π β π ) β€
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π , π β π )
In other words, a CE is a distribution from which no player has an incentive to unilaterally deviate even after having been assigned their action. They are therefore stronger than CCEs, and the result CEs β CCEs easily follows from the above inequality. From this definition of CEs, we derive the definition of CEGap, which measures the above gap over all players.
CEGap ( π )
β π , π π β Ξ π β max π π β² β π β π β Ξ β π
π ( π π , π β π ) ( πΊ π ( π π β² , π β π ) β πΊ π ( π π , π β π ) ) β +
From this definition, we conclude the following lemma:
Lemma 2 (Game CE and CEGap).
We have the following equivalence:
(i)
π is a CE of the game
(ii)
CEGap( π ) = 0
Proof.
Let us first prove (i) β (ii). Let π be a CE of the game. Therefore, for all π , for all π π , π π β² β Ξ π ,
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π β² , π β π ) β€
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π , π β π )
therefore
β π β π β
Ξ β π π β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) ) β€ 0
which is true for all π π β² β Ξ π , so also true for the max over them
max π π β² β Ξ π β’ β π β π β
Ξ β π π β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) )
β€ 0
β max π π β² β Ξ β π β’ β π β π β
Ξ β π π β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) ) β +
0
Therefore (i) β (ii).
Let us now suppose that π is such that CEGap β’ ( π )
0 . Thus
β π , π π β Ξ π 0 : π‘ + β max π π β² β π β π β
Ξ
β
π
π
(
π
π
,
π
β
π
)
(
πΊ
π
(
π
π
β²
,
π
β
π
)
β
πΊ
π
(
π
π
,
π
β
π
)
)
β
+
0
Given the presence of the positivity operator β . β + , we deduce that for all π , for all π π , π π β² β Ξ π 0 : π‘ ,
β π β π β
Ξ β π π β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) ) β€ 0
We therefore deduce
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π β² , π β π ) β€
β π β π β Ξ β π π β’ ( π π , π β π ) β’ πΊ π β’ ( π π , π β π )
which concludes the proof. β
Once again, the CEGap definition is extended
CEGap β’ ( π , Ξ β , Ξ 0 : π‘ )
β π , π π β Ξ π 0 : π‘ β max π π β β Ξ π β β π β π β Ξ β π π‘ π ( π π ,
π β π ) ( πΊ π ( π π β , π β π ) β
πΊ π ( π π , π β π ) ) β +
It is once again easy to see that CEGap β’ ( π , Ξ β , Ξ 0 : π‘ ) characterizes CEs of Ξ β .
This lemma proven, we prove Theorem 7.
Proof.
Once again, it is sufficient to prove that JPSRO(CE) stops producing new policies if and only if it has reached a CE of the extensive form game, the rest of the argument being supplied by the finiteness of the game forcing JPSRO(CE) to eventually stop producing new policies.
Let us now suppose that JPSRO(CE) has not produced any new policy for any new player at iteration π‘ . This means that for all π π β Ξ π π‘ ,
max π π β β Ξ π β β’ β π β π β
Ξ β π 0 : π‘ π β’ ( π π , π β π ) β’ πΊ π β’ ( π π β , π β π )
max π π β² β Ξ π π‘ β’ β π β π β
Ξ β π 0 : π‘ π β’ ( π π , π β π ) β’ πΊ π β’ ( π π β² , π β π )
We subtract β π β π β Ξ β π π‘ π β’ ( π π , π β π ) β’ πΊ π β’ ( π π , π β π ) to both expressions, apply β . β + and sum over π π β Ξ π π‘ and π , and finally apply the fact that π is a CE of the restricted game to obtain that
CEGap ( π , Ξ β , Ξ 0 : π‘ )
β
π
,
π
π
β
Ξ
π
β
max
π
π
β²
β
Ξ
π
π‘
β
π
β
π
β
Ξ
β
π
π
(
π
π
,
π
β
π
)
(
πΊ
π
(
π
π
β²
,
π
β
π
)
β
πΊ
π
(
π
π
,
π
β
π
)
)
β
+
0
which, by extension, is also true for the CEGap over the extensive form game. By Lemma 2, π is therefore a CE of the extensive form game, which concludes the proof. β
E.5.3Relaxation on Proof Requirements
Our definition of Best Responses (BRs) is that they are functions that return a set of policies which maximize their value against a given objective. There are two reasons to add a set of policies. Firstly, the max of a given objective can be reached at different points, thus returning a set of policies enables us to potentially include them all. Secondly, using sets also enables us to potentially set some of the BR outputs to β . Concretely, this means that no policy is computed by the BR in that case, which saves compute time and memory. The proofs shown so far rely on each BR having cardinality greater than or equal to 1, which means that one should compute at least one new policy every time the BR operator is called. We can relax this condition into the following conditions, which we prove are sufficient (but not necessary) for convergence.
CCE-Condition:
β π
0 , π , β π‘
π , | BR π π‘ | β₯ 1
i.e. each player receives an infinity of best responses.
CE-Condition:
β
π
>
0
,
π
,
π
π
,
β
π‘
>
π
,
either
β
π‘
β²
β₯
π‘
,
π
π‘
β²
β’
(
π
π
)
0
or
| BR π π‘ β’ ( π π ) | β₯ 1
i.e. any policy of any player is either never selected by the CE meta-solver after some time, or is considered for a best response an infinite number of times.
Solver-Condition:
β π‘ , β π‘ β² β₯ π‘ , if β π , β π π β Ξ π 0 : π‘ β² , π π β Ξ π 0 : π‘ , then β π β Ξ 0 : π‘ (or π β Ξ π‘ β² ), π π‘ β’ ( π )
π π‘ β² β’ ( π ) : if no new policy has been added to the pool between π‘ and π‘ β² , the amount of mass granted to each policy by the solver does not change, i.e. repeating policies does not affect solver outputs, and the solverβs outputs are constant given the same pools.
The rest of this section presents the relaxed theorems, their proofs, and discusses why such a relaxation is of interest.
Relaxed Theorems and Proofs Theorem 8 (Relaxed CCE-Convergence).
When using a CCE meta-solver and CCE best response in JPSRO(CCE), under CCE-Condition and Solver-Condition, the mixed joint policy converges to a CCE under the meta-solver distribution.
Proof.
Let us suppose CCE-Condition and Solver-Condition. We have that JPSRO(CCE) will necessarily be able to produce new policies until it reaches a CCE. Let us prove this: while CCEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 , JPSRO(CCE) is able to add at least one new policy to its pool. Indeed, let π‘
0 be such that CCEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 . Then there exists at least one π such that
max π π β² β Ξ π β β’ β π β Ξ 0 : π‘ π π‘ β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) )
0 .
Let us select one of these π with minimal π‘ β² β₯ π‘ , | BR π π‘ | β₯ 1 , i.e. the first best response with positive CCEGap to be added to the pool after and including π‘ . π‘ β² exists because we suppose CCE-Condition. Let us suppose that no new policies have been added to the pool between π‘ and π‘ β² . Then, since no new best response has been added to the pool between π‘ and π‘ β² , π π‘
π π‘ β² since we suppose Solver-Condition, and therefore β π β² β BR π π‘ β² ,
β π β Ξ 0 : π‘ π π‘ β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) )
0 .
We have that necessarily, BR π π‘ β² β© Ξ π 0 : π‘
β , as otherwise π π‘ would not be a CCE of Ξ 0 : π‘ : indeed, since π π‘ is a CCE of Ξ 0 : π‘ , CCEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 , and thus β π , π π β² β Ξ π 0 : π‘ ,
β π β Ξ 0 : π‘ π π‘ β’ ( π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π ) ) β€ 0 ,
thus new best responses can be added to the pool. We therefore have that CCEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 implies that at least one new policy can be found by JPSRO.
Thus a new best response can always be added, and will always be added since we have CCE-Condition, to the pool while π π‘ is not a CCE of the extensive form game. Therefore, if JPSRO(CCE) is unable to add any new policy to the pool (which has to be verified over all players, or measured through CCEGap), then it must be at a CCE, which concludes the proof. β
Theorem 9 (Relaxed CE-Convergence).
When using a CE meta-solver and CE best response in JPSRO(CE), under CE-Condition and Solver-Condition, the mixed joint policy converges to a CE under the meta-solver distribution.
Proof.
Let us suppose CE-Condition and Solver-Condition. We have that JPSRO(CE) will necessarily be able to produce new policies until it reaches a CE. Let us prove this: while CEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 , JPSRO(CE) is able to add at least one new policy to its pool. Indeed, let π‘
0 be such that CEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 . Then there exists at least one π , π π β’ st. β’ π π‘ β’ ( π π )
0 such that
max π π β² β Ξ π β β’ β π β π β Ξ β π π‘ π π‘ β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) )
0 .
By CE-Condition, we have that either new policies have been added to the pool before any such π , π π has been selected, or that there exists π‘ β² such that π‘ β² β₯ π‘ , | BR π π‘ β’ ( π π ) | β₯ 1 . Indeed, if no new best response has been added to the pool by π‘ β² β₯ π‘ , the Solver-Condition implies that for all these π , π π β’ st. β’ π π‘ β’ ( π π )
0 , we also have π π‘ β² β’ ( π π )
0 , hence there exists π‘ β² , | BR π π‘ β’ ( π π ) |
1 . Let us select the minimal π‘ β² over all π , π π such that CEGap π β’ ( π π‘ , Ξ β , Ξ 0 : π‘ ) β’ ( π π )
0 .
Let us suppose that no new policies have been added to the pool between π‘ and π‘ β² . Then, since no new best response has been added to the pool between π‘ and π‘ β² , π π‘
π π‘ β² since we suppose Solver-Condition, and therefore β π β² β BR π π‘ β² β’ ( π π ) , β π β π β Ξ β π π‘ π π‘ β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) ) > 0 . We have that necessarily, BR π π‘ β² β’ ( π π ) β© Ξ π 0 : π‘
β , as otherwise π π‘ would not be a CE of Ξ 0 : π‘ : indeed, since π π‘ is a CE of Ξ 0 : π‘ , CEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 , and thus β π , π π β Ξ π 0 : π‘ , π π β² β Ξ π 0 : π‘ ,
β π β π β Ξ β π 0 : π‘ π π‘ β’ ( π π , π β π ) β’ ( πΊ π β’ ( π π β² , π β π ) β πΊ π β’ ( π π , π β π ) ) β€ 0 .
Thus new best responses can be added to the pool. We therefore have that CEGap β’ ( π π‘ , Ξ β , Ξ 0 : π‘ )
0 implies that at least one new policy can be found by JPSRO.
Thus a new best response can always be added, and will always be added since we have CE-Condition, to the pool while π π‘ is not a CE of the extensive form game. Therefore, if JPSRO(CE) is unable to add any new policy to the pool (Which has to be verified over all players, or measured through CEGap), then it must be at a CE, which concludes the proof. β
Discussion on Relaxation
These relaxed conditions matter especially for JPSRO(CE), which has potentially exponential complexity in term of number of policies to keep (if the solver spreads mass on all policies at each iteration, then the number of policies in each playersβ pools at iteration π‘ is β₯ 1 + β π
1 π‘ 2 π
2 π‘ + 1 β 1 ).
Given that the policies produced for one player at the same iteration are potentially similar (even identical), a number of modifications could be imagined to keep JPSRO(CE) tractable. For example: a) randomly select only one π π from which to best respond for each player, b) only compute a best response for one randomly chosen π π , or c) compute all BRs, but only add the BR with the largest gap to the pool.
It could make sense to randomly select only one π π from which to best respond for each player, at each iteration, or even to only compute a best response for one randomly chosen π π for one randomly-chosen π at each iteration.
Note that it is necessary to impose a condition on the solver (although an alternate Solver-Condition could be formulated). To illustrate this, let us imagine modes between the best response chooser and the solver. Namely, let us imagine a two-player game, for which on even π‘ , in JPSRO(CCE), the best response operator only computes one best response for player 1 (and on odd π‘ , the best response is computed only for player 2). Let us also infer that the current restricted game has two CCEs. The first of these (CCE1) is not βexpandableβ for player 1, but is for player 2 (i.e. the best response for player 1 is already in the pool, but player 2βs best response is not). The second (CCE2) is expandable for player 1, but not for player 2. If the CCE solver outputs CCE1 on even π‘ , and CCE2 on odd π‘ , then the algorithm never produces new policies, and therefore never converges.
Of course, the conditions provided are sufficient, but not necessary, and in the case where best response and meta-solver outputsβ randomizations are decorrelated, it makes intuitive sense that the algorithm should also converge with probability 1, which one can prove with a more involved argument.
Appendix FGames
We study several games with JPSRO; Kuhn Poker, Trade Comm, and Sheriff. These cover three-player, general-sum, and common-payoff games. Implementations of all the games are available in OpenSpiel (Lanctot et al., 2019).
Kuhn Poker:
A simplified n-player, zero-sum, sequential, imperfect information version of poker. It consists of π + 1 playing cards. In each round of the game, every player remaining antes one chip. One card is dealt to each player. Each player has two choices, bet one chip or check. If a player bets other players have the option to call or fold. Out of the players that bet, the one with the highest card wins. If all players check the player with the highest card wins. The original two-player game is described in (Kuhn, 1950). An n-player extension is described in (Lanctot, 2014). Additional information about the game (such as equilibrium) can be found in (Hoehn et al.,).
Trade Comm:
A simple two-player, common-payoff trading game (Sokota et al., 2021). In this game each player (in secret) receives one of πΌ different items. The first player can then make one of πΌ utterances to the second agent, and vice versa. Then each agent chooses one of πΌ 2 trades in private, if the trade is compatible both agents receive 1 reward, otherwise both receive 0 . The goal of the agents is therefore to find a bijection between the items and utterances and the trade proposal. There are πΌ 4 deterministic policies per player, and good learning algorithms will be be able to search over these policies. Because the game is common-payoff, it is very transitive, and has many dominated strategies, however there are multiple strategies with equal payoff, and therefore many equilibria in partially explored policy space. It is for this reason many learning algorithms get stuck exploiting sub-optimal policies they have already found.
Sheriff:
A simplified two-player, general-sum version of the board game Sheriff of Nottingham (Farina et al., 2019b). The game consists of a smuggler, who is motivated to import contraband without getting caught, and a sheriff, who is motivated to either find contraband or accept bribes. The players negotiate a bribe over several rounds after which the bribe if accepted or rejected. If the sheriff finds contraband, the smuggler pays a fine, otherwise if no contraband is found the sheriff must pay compensation to the smuggler. The smuggler also gets value from smuggling goods. The game has different optimal values for NFCCE, EFCCE, EFCE, and NFCE solutions concepts.
Appendix GJPSRO Hyper-parameters
There are a number of ways of implementing JPSRO in practice through various hyper-parameters.
Best Response:
We use an exact best response calculation that assigns uniform probability over valid actions for states with zero reach probability. However, other best response approaches will also work including reinforcement learning (which we will leave to future work).
Pool Type:
The data structure used to store the policies found so far can either be a set or a multi-set. Using a set ensures that all policies are unique and only appear once even if multiple iterations produce the same best response policy. Some meta-solvers rely on repeated policies being present for convergence (for example, the uniform meta-solver can converge in two-player, zero-sum because the repeated policies trend to a NE over repeats). In this case using a multi-set is more suitable. This parameterization is only relevant when using tabular policies which can be checked for equality.
Player Updates Per Iteration:
It is not necessary to find the best response for all players at every iteration. Other strategies such as cycling through players or randomly selecting a player will work too. It is sufficient that over time all players should be updated. Updating a single player at a time is more efficient when minimizing the number of best responses necessary for convergence, however updating all can be done in parallel.
Best Responses Per Iteration:
When computing the CE best response, each player has several best responses to calculate. It is not necessary to compute them all and, even if they are all computed, it is not necessary to add them all to the pool of policies. The best responses can be calculated at random. And only best responses with nonzero gap need be added, or perhaps only the one with largest gap. In order to measure convergence to a CE, all best responses (and their gaps) must be computed.
Policy Initialization:
Policies can be initialized in any manner and the algorithm will converge to an equilibrium under any initial condition. However, the initial policies does determine the space of equilibrium reachable (so for example is may not be possible to find the MWCE from all initial policies). JPSRO works, without limitation, using only deterministic policies, however stochastic policies are supported too. A stochastic uniform policy over valid actions is a reasonable setting.
Best Response Type:
The most important parameterization is picking one of the two best response types: CE and CCE. The resulting algorithm is named either JPSRO(CE) or JPSRO(CCE) respectively.
Meta-Solvers:
The second most important parameterization is the type of meta-solver to use (Table 2). An important constraint is that JPSRO(CE) is only guaranteed to converge under CE meta-solvers. JPSRO(CCE) must use CCE meta-solvers (noting that CEs are a subset of CCEs).
Table 2:Summary of meta-solvers used during experiments and their properties. We use the normalized π for naming, for example 1 100 β’ π -MGCE means 1 100 β’ max β‘ ( π΄ β’ π ) β’ π -MGCE. Meta-Solver Joint CCE CE Max Val Max Ent Rand Uniform β PRD
πΌ -Rank β Rand Dirichlet β β Rand Joint β β RMWCCE β β β β RVCCE β β β
1 100 β’ π -MGCCE β π β MGCCE β β β
min β‘ π -MGCCE β β β RMWCE β β β β β RVCE β β β β
1 100 β’ π -MGCE β π
π β MGCE β β β β
min β‘ π -MGCE β β β β Appendix HExperiments
We conduct experiments over three extensive form games to demonstrate the versatility of the algorithm over n-player general-sum games. For each game we run on both JPSRO(CCE) and JPSO(CE) algorithms under all suitable meta-solvers and baselines.
For JPSRO(CCE), we initialize using uniform policies, update all players at every iteration, and use multi-sets for the pool. For JPSRO(CE), we initialize using uniform policies, update all players at every iteration, only add the highest-gap BR to the pool for each player at each iteration, and use multi-sets for the pool. For random meta-solvers we repeat the experiment five times and show the average, otherwise the experiment is deterministic. The experiments were run for 6 hours, after which any that had not finished were truncated.
In order to measure performance, we track five metrics:
1.
The gap to equilibrium under a maximum welfare equilibrium (MW(C)CE) distribution. This describes how close the algorithm is to finding a set of joint policies that are in exact equilibrium in the extensive form game.
2.
The gap to equilibrium under the meta-solverβs distribution. This is the gap that JPSRO theoretically converges to when using (C)CEs.
3.
The value of the game to the players under the MW(C)CE distribution.
4.
The value of the game to the players under the meta-solverβs distribution.
5.
The number of unique policies found so far.
Ultimately, the algorithm should be finding high-value joint policies that are in equilibrium, over a variety of games. The first game is a purely competitive, three-player game called Kuhn Poker (Figure 3). The second game is a purely cooperative, common-payoff game called Trade Comm (Figure 4). The final game is a general-sum game called Sheriff (Figure 5).
Appendix IOpen Source Code
An open source implementation of JPSRO is available in OpenSpiel (Lanctot et al., 2019) under https://github.com/deepmind/open_spiel/blob/master/open_spiel/python/examples/jpsro.py.
Appendix JNecessity of Population Based Training
In the absence of a correlating signal, a single joint policy is, in general, insufficient to represent a correlated equilibrium. To see this, let us consider the Traffic Light game (Figure 1(b)). One possible correlated equilibrium consists in recommending (G, W) half of the time, and (W,G) the other half.
Let us now consider this game as an extensive-form, partial-information game, where the row player first chooses their action, and the column player then chooses their own without knowing the action chosen by the row player. In the absence of a correlating signal, it is impossible for the column player to know which action the row player has played, and therefore playing (G, W) or (W, G) becomes impossible, as the column player is unable to change their action as a function of the action taken by the row player.
Therefore, without modifying the game and observation space to add a correlating signal, convergence to a correlated equilibrium necessarily requires a distribution over joint policies. Population Based Training (PBT), a set of methods that slowly grow the space of (joint) policies, therefore appears to be the appropriate framework to converge to (C)CEs without adding correlating signals to the considered game.
Figure 3:JPSRO(CCE) and JPSRO(CE) on three-player Kuhn Poker. All (C)CE MSs, PRD and πΌ -Rank find joint policies capable of supporting equilibrium (although πΌ -Rank was slow and was terminated after 6 hours). This is some evidence that classic MSs designed for the two-player, zero-sum setting can generalize well to the three-player, zero-sum. Figure 4:JPSRO(CCE) and JPSRO(CE) on three-item Trade Comm. In JPSRO(CCE), 1 100 β’ min -MGCCE fails to find the maximum welfare equilibrium, however, all other (C)CE MSs find the maximum welfare equilibrium. Unexpectedly, πΌ -Rank performs well on this game, while all other classic MSs fail to make progress on this purely cooperative game. Performing well on this game requires exploration, so the random joint MS is able to make progress, albeit naively and slowly. Figure 5:JPSRO(CCE) and JPSRO(CE) on Sheriff. This game is interesting because it is general-sum and different solution concepts have different optimal maximum welfare values. The maximum welfare NFCCE is 13.64 for the smuggler and 2.0 for the sheriff which JPSRO(CCE) successfully finds, while the maximum welfare NFCE is 0.82 for the smuggler and 0.0 for the sheriff which JPSRO(CE) successfully finds. This demonstrates the appeal of using NFCCE as a target equilibrium. Interestingly, for this game, 1 100 β’ π -MG(C)CE was able to produce BRs of high enough quality to converge which is evidence that scaled methods that only approximate (C)CEs may be enough in some settings. RMWCCE converged to an equilibrium, but not the welfare maximizing one, providing evidence that greedy MSs are not always suitable. In a similar argument, min
π -MGCCE did not reach the maximum welfare solution within the allocated number of iterations. RV(C)CE is efficient at finding novel policies but ones of limited utility. PRD and πΌ -Rank perform well and find the maximum welfare (C)CE equilibria. Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Xet Storage Details
- Size:
- 127 kB
- Xet hash:
- 415f2956cfd844b30c6ea289ce44dd1eaa4372263b7c5b5a79f9328a2e824e2f
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.