155 kB

Title: Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

URL Source: https://arxiv.org/html/2403.00222

Markdown Content: Abstract 1Introduction 2Preliminaries 3Method and Theoretical Results 4Proof Outline 5Experiments 6Conclusion, Limitations, and Future Work 7Acknowledgements References Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale Emile Anand Computing and Mathematical Sciences California Institute of Technology eanand@caltech.edu &Guannan Qu Department of Electrical and Computer Engineering Carnegie Mellon University gqu@andrew.cmu.edu Work done while author was a visiting student at Carnegie Mellon University. Abstract

We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples 𝑘 ≤ 𝑛 local agents to compute a policy in time that is polynomial in 𝑘 . We show that this learned policy converges to the optimal policy in the order of 𝑂 ~ ⁢ ( 1 / 𝑘 + 𝜖 𝑘 , 𝑚 ) as the number of sub-sampled agents 𝑘 increases, where 𝜖 𝑘 , 𝑚 is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.

1Introduction

Global decision-making for local agents, where a global agent makes decisions that affect a large number of local agents, is a classical problem that has been widely studied in many forms (Foster et al., 2022; Qin et al., 2023; Foster et al., 2023) and can be found in many applications, e.g. network optimization, power management, and electric vehicle charging (Kim & Giannakis, 2017; Zhang & Pavone, 2016; Molzahn et al., 2017). However, a critical challenge is the uncertain nature of the underlying system, which can be very hard to model precisely. Reinforcement Learning (RL) has seen an impressive performance in a wide array of applications, such as the game of Go (Silver et al., 2016), autonomous driving (Kiran et al., 2022), and robotics (Kober et al., 2013). More recently, RL has emerged as a powerful tool for learning to control unknown systems (Ghai et al., 2023; Lin et al., 2023; 2024a; 2024b), and hence has great potential for decision-making for multi-agent systems, including the problem of global decision making for local agents.

However, RL for multi-agent systems, where the number of agents increases, is intractable due to the curse of dimensionality (Blondel & Tsitsiklis, 2000). For instance, RL algorithms such as tabular 𝑄 -learning and temporal difference (TD) learning require storing a 𝑄 -function (Bertsekas & Tsitsiklis, 1996; Powell, 2007) that is as large as the state-action space. However, even if the individual agents’ state space is small, the global state space can take values from a set of size exponentially large in the number of agents. In the case where the system’s rewards are not discounted, reinforcement learning on multi-agent systems is provably NP-hard (Blondel & Tsitsiklis, 2000), This problem of scalability has been observed in a variety of settings (Papadimitriou & Tsitsiklis, 1999; Guestrin et al., 2003). A promising line of research that has emerged over recent years constrains the problem to a networked instance to enforce local interactions between agents (Lin et al., 2020; 2021; Qu et al., 2020b; Jing et al., 2022; Chu et al., 2020). This has led to scalable algorithms where each agent only needs to consider the agents in its neighborhood to derive approximately optimal solutions. However, these results do not apply to our setting where one global agent interacts with many local agents. This can be viewed as a star graph, where the neighborhood of the central decision-making agent is large.

Beyond the networked formulation, another exciting line of work that addresses this intractability is mean-field RL (Yang et al., 2018). The mean-field RL approach assumes that all the agents are homogeneous in their state and action spaces, which allows the interactions between agents to be approximated by a representative “mean” agent. This reduces the complexity of 𝑄 -learning to polynomial in the number of agents, and learns an approximately optimal policy where the approximation error decays with the number of agents (Gu et al., 2021; 2022a). However, mean-field RL does not directly transfer to our setting as the global decision-making agent is heterogeneous to the local agents. Further, when the number of local agents is large, it might still be impractical to store a polynomially-large 𝑄 -table (where the polynomial’s degree is the size of the state space for a single agent). This motivates the following fundamental question: can we design a fast and competitive policy-learning algorithm for a global decision-making agent in a system with many local agents?

Contributions. We answer this question affirmatively. Our key contributions are outlined below.

•

Subsampling Algorithm. We propose SUBSAMPLE-Q, an algorithm designed to address the challenge of global decision-making in systems with a large number of pseudo-heterogeneous local agents. We model the problem as a Markov Decision Process with a global decision-making agent and 𝑛 local agents. SUBSAMPLE-Q (Algorithms 1 and 2) first chooses 𝑘 local agents to learn a deterministic policy 𝜋 ^ 𝑘 , 𝑚 est , where 𝑚 is the number of samples used to update the 𝑄 -function’s estimates, by performing mean-field value iteration on the 𝑘 local agents to learn 𝑄 𝑘 , 𝑚 est , which can be viewed as a smaller 𝑄 function of size polynomial in 𝑘 , instead of polynomial in 𝑛 (as done in the mean-field RL literature). It then deploys a stochastic policy 𝜋 𝑘 , 𝑚 est that chooses 𝑘 local agents, uniformly at random, at each step to find an action for the global agent using 𝜋 ^ 𝑘 , 𝑚 est .

•

Theoretical Guarantee. Theorem 3.4 shows that the performance gap between 𝜋 𝑘 , 𝑚 est and the optimal policy 𝜋 ∗ is 𝑂 ⁢ ( 1 𝑘 + 𝜖 𝑘 , 𝑚 ) , where 𝜖 𝑘 , 𝑚 is the Bellman noise in 𝑄 𝑘 , 𝑚 est . The choice of 𝑘 reveals a fundamental trade-off between the size of the 𝑄 -table stored and the optimality of 𝜋 𝑘 , 𝑚 est . For 𝑘

𝑂 ⁢ ( log ⁡ 𝑛 ) , SUBSAMPLE-Q runs in time polylogarithmic in 𝑛 , creating an exponential speedup from the previously best-known polytime mean-field RL methods, with a decaying optimality gap.

•

Numerical Simulations. We demonstrate the effectiveness of SUBSAMPLE-Q in a power system demand-response problem in Example 5.1, and in a queueing problem in Example 5.2. A key inspiration of our approach is the power-of-two-choices in the queueing theory literature (Mitzenmacher & Sinclair, 1996), where a dispatcher subsamples two queues to make decisions. Our work generalizes this to a broader decision-making problem.

While our result is theoretical in nature, it is our hope that SUBSAMPLE-Q will lead to further investigation into the power of sampling in Markov games and inspire practical algorithms.

2Preliminaries

Notation. For 𝑘 , 𝑚 ∈ ℕ where 𝑘 ≤ 𝑚 , let ( [ 𝑚 ] 𝑘 ) denote the set of 𝑘 -sized subsets of [ 𝑚 ]

{ 1 , … , 𝑚 } . Let [ 𝑚 ] ¯

{ 0 } ∪ [ 𝑚 ] . For any vector 𝑧 ∈ ℝ 𝑑 , let ‖ 𝑧 ‖ 1 and ‖ 𝑧 ‖ ∞ denote the standard ℓ 1 and ℓ ∞ norms of 𝑧 respectively. Let ‖ 𝐀 ‖ 1 denote the matrix ℓ 1 -norm of 𝐀 ∈ ℝ 𝑛 × 𝑚 . Given a collection of variables 𝑠 1 , … , 𝑠 𝑛 the shorthand 𝑠 Δ denotes the set { 𝑠 𝑖 : 𝑖 ∈ Δ } for Δ ⊆ [ 𝑛 ] . We use 𝑂 ~ ⁢ ( ⋅ ) to suppress polylogarithmic factors in all problem parameters except 𝑛 . For any discrete measurable space ( 𝒮 , ℱ ) , the total variation distance between probability measures 𝜇 1 , 𝜇 2 is given by TV ⁢ ( 𝜇 1 , 𝜇 2 )

1 2 ⁢ ∑ 𝑠 ∈ 𝒮 | 𝜇 1 ⁢ ( 𝑠 ) − 𝜇 2 ⁢ ( 𝑠 ) | . Finally, for 𝐶 ⊂ ℝ , Π 𝐶 : ℝ → 𝐶 denotes a projection onto 𝐶 in ℓ 1 -norm.

2.1Problem Formulation

Problem Statement. We consider a system of 𝑛 + 1 agents given by 𝒩

{ 0 } ∪ [ 𝑛 ] . Let agent 0 be the “global agent” decision-maker, and agents [ 𝑛 ] be the “local” agents. In this model, each agent 𝑖 ∈ [ 𝑛 ] is associated with a state 𝑠 𝑖 ∈ 𝒮 𝑙 , where 𝒮 𝑙 is the local agent’s state space. The global agent is associated with a state 𝑠 𝑔 ∈ 𝒮 𝑔 and action 𝑎 𝑔 ∈ 𝒜 𝑔 , where 𝒮 𝑔 is the global agent’s state space and 𝒜 𝑔 is the global agent’s action space. The global state of all agents is given by ( 𝑠 𝑔 , 𝑠 1 , … , 𝑠 𝑛 ) ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 . At each time-step 𝑡 , the next state for all the agents is independently generated by stochastic transition kernels 𝑃 𝑔 : 𝒮 𝑔 × 𝒮 𝑔 × 𝒜 𝑔 → [ 0 , 1 ] and 𝑃 𝑙 : 𝒮 𝑙 × 𝒮 𝑙 × 𝒮 𝑔 → [ 0 , 1 ] as follows:

𝑠 𝑔 ( 𝑡 + 1 ) ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) ) ,

(1)

𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) , ∀ 𝑖 ∈ [ 𝑛 ]

(2)

The global agent selects 𝑎 𝑔 ⁢ ( 𝑡 ) ∈ 𝒜 𝑔 . Next, the agents receive a structured reward 𝑟 : 𝒮 × 𝒜 𝑔 → ℝ , given by Equation 3, where the choice of functions 𝑟 𝑔 and 𝑟 𝑙 is flexible and application-specific.

𝑟 ⁢ ( 𝑠 , 𝑎 𝑔 )

𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 ) ⏟ global component + 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 ) ⏟ local component

(3)

We define a policy 𝜋 : 𝒮 → 𝒫 ⁢ ( 𝒜 𝑔 ) as a map from states to distributions of actions such that 𝑎 𝑔 ∼ 𝜋 ( ⋅ | 𝑠 ) . When a policy is executed, it generates a trajectory ( 𝑠 0 , 𝑎 𝑔 0 , 𝑟 0 ) , … , ( 𝑠 𝑇 , 𝑎 𝑔 𝑇 , 𝑟 𝑇 ) via the process 𝑎 𝑔 𝑡 ∼ 𝜋 ⁢ ( 𝑠 𝑡 ) , 𝑠 𝑡 + 1 ∼ ( 𝑃 𝑔 , 𝑃 𝑙 ) ⁢ ( 𝑠 𝑡 , 𝑎 𝑔 𝑡 ) , initialized at 𝑠 1 ∼ 𝑑 0 . We write ℙ 𝜋 ⁢ [ ⋅ ] and 𝔼 𝜋 ⁢ [ ⋅ ] to denote the law and corresponding expectation for the trajectory under this process. The goal of the problem is to then learn a policy 𝜋 that maximizes the value function 𝑉 : 𝜋 × 𝒮 → ℝ which is the expected discounted reward for each 𝑠 ∈ 𝒮 given by 𝑉 𝜋 ⁢ ( 𝑠 )

𝔼 𝜋 ⁢ [ ∑ 𝑡

0 ∞ 𝛾 𝑡 ⁢ 𝑟 ⁢ ( 𝑠 ⁢ ( 𝑡 ) , 𝑎 𝑔 ⁢ ( 𝑡 ) ) | 𝑠 ⁢ ( 0 )

𝑠 ] , where 𝛾 ∈ ( 0 , 1 ) is a discounting factor. We write 𝜋 ∗ as the optimal deterministic policy, which maximizes 𝑉 𝜋 ⁢ ( 𝑠 ) at all states. This model characterizes a crucial decision-making process in the presence of multiple agents where the information of all local agents is concentrated towards the decision maker, the global agent. So, the goal of the problem is to learn an approximately optimal policy which jointly minimizes the sample and computational complexities of learning the policy.

We make the following standard assumptions:

Assumption 2.1 (Finite state/action spaces).

We assume that the state spaces of all the agents and the action space of the global agent are finite: | 𝒮 𝑙 | , | 𝒮 𝑔 | , | 𝒜 𝑔 | < ∞ .

Assumption 2.2 (Bounded rewards).

The global and local components of the reward function are bounded. Specifically, ‖ 𝑟 𝑔 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ≤ 𝑟 ~ 𝑔 , and ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ≤ 𝑟 ~ 𝑙 . Then, ‖ 𝑟 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ≤ 𝑟 ~ 𝑔 + 𝑟 ~ 𝑙 := 𝑟 ~ .

Definition 2.1 ( 𝜖 -optimal policy).

Given a policy simplex Π , a policy 𝜋 ∈ Π is 𝜖 -optimal if for all 𝑠 ∈ 𝒮 , 𝑉 𝜋 ⁢ ( 𝑠 ) ≥ sup 𝜋 ∗ ∈ Π 𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝜖 .

Remark 2.2.

While this model requires the 𝑛 local agents to have homogeneous transition and reward functions, it allows heterogeneous initial states, which captures a pseudo-heterogeneous setting. For this, we assign a type to each local agent by letting 𝒮 𝑙

𝒵 × 𝒮 𝑙 ¯ , where 𝒵 is a set of different types for each local agent, which is treated as part of the state for each local agent. This type state will be heterogeneous and will remain unchanged throughout the transitions. Hence, the transition and reward function will be different for different types of agents. Further, by letting 𝑠 𝑔 ∈ 𝒮 𝑔 := ∏ 𝑧 ∈ 𝒵 [ 𝑆 ¯ 𝑔 ] 𝑧 and 𝑎 𝑔 ∈ 𝒜 𝑔 := ∏ 𝑧 ∈ 𝒵 [ 𝐴 ¯ 𝑔 ] 𝑧 correspond to a state/action vector where each element corresponds to a type 𝑧 ∈ 𝒵 , the global agent can uniquely signal agents of each type.

2.2Related Work

This paper relates to two major lines of work which we describe below.

Multi-agent RL (MARL). MARL has a rich history starting with early works on Markov games used to characterize the decision-making process (Shapley, 1953; Littman, 1994), which can be regarded as a multi-agent extension to the Markov Decision Process (MDP). MARL has since been actively studied (Zhang et al., 2021) in a broad range of settings, such as cooperative and competitive agents. MARL is most similar to the category of “succinctly described” MDPs (Blondel & Tsitsiklis, 2000) where the state/action space is a product space formed by the individual state/action spaces of multiple agents, and where the agents interact to maximize an objective function. Our work, which can be viewed as an essential stepping stone to MARL, also shares the curse of dimensionality.

A line of celebrated works (Qu et al., 2020b; Chu et al., 2020; Lin et al., 2020; 2021; Jing et al., 2022) constrain the problem to networked instances to enforce local agent interactions and find policies that maximize the objective function which is the expected cumulative discounted reward. By exploiting Gamarnik’s spatial exponential decay property from combinatorial optimization (Gamarnik et al., 2009), they overcome the curse of dimensionality by truncating the problem to only searching over the policy space derived from the local neighborhood of agents that are atmost 𝜅 away from each other to find an 𝑂 ⁢ ( 𝜌 𝑘 + 1 ) approximation of the maximized objective function for 𝜌 ∈ ( 0 , 1 ) . However, since their algorithms have a complexity that is exponential in the size of the neighborhood, they are only tractable for sparse graphs. Therefore, these algorithms do not apply to our decision-making problem which can be viewed as a dense star graph (see Appendix A). The recently popular work on V-learning (Jin et al., 2021) reduces the dependence of the product action space to an additive dependence. However, since our work focuses on the action of the global decision-maker, the complexity in the action space is already minimal. Instead, our work focuses on reducing the complexity of the joint state space which has not been generally accomplished for dense networks.

Mean-Field RL. Under assumptions of homogeneity in the state/action spaces of the agents, the problem of densely networked multi-agent RL was answered in Yang et al. (2018); Gu et al. (2021; 2022a; 2022b); Subramanian et al. (2022) which approximates the learning problem with a mean-field control approach where the approximation error scales in 𝑂 ⁢ ( 1 / 𝑛 ) . To overcome the problem of designing algorithms on probability measure spaces, they study MARL under Pareto optimality and use the (functional) strong law of large numbers to consider a lifted state/action space with a representative agent where the rewards and dynamics of the system are aggregated. Cui & Koeppl (2022); Hu et al. (2023); Carmona et al. (2023) introduce heterogeneity to the mean-field approach using graphon mean-field games; however, there is a loss in topological information when using graphons to approximate finite graphs, as graphons correspond to infinitely large adjacency matrices. Additionally, graphon mean-field RL imposes a critical assumption of the existence of graphon sequences that converge in cut-norm to the problem instance. Another mean-field RL approach that partially introduces heterogeneity is in a line of work considering major and minor agents. This has been well studied in the competitive setting (Carmona & Zhu, 2016; Carmona & Wang, 2016). In the cooperative setting, Mondal et al. (2022); Cui et al. (2023) are most related to our work, which collectively consider a setting with 𝑘 classes of homogeneous agents, but their mean-field analytic approaches does not converge to the optimal policy upon introducing a global decision-making agent. Typically, these works require Lipschitz continuity assumptions on the reward functions which we relax in our work. Finally, the algorithms underlying mean-field RL have a runtime that is polynomial in 𝑛 , whereas our SUBSAMPLE-Q algorithm has a runtime that is polynomial in 𝑘 .

Other Related Works. A line of works have similarly exploited the star-shaped network in cooperative multi-agent systems. Min et al. (2023); Chaudhari et al. (2024) studied the communication complexity and mixing times of various learning settings with purely homogeneous agents, and Do et al. (2023) studied the setting of heterogeneous linear contextual bandits to yield a no-regret guarantee. We extend this work to the more challenging setting in reinforcement learning.

2.3Technical Background Q-learning.

To provide background for the analysis in this paper, we review a few key technical concepts in RL. At the core of the standard Q-learning framework (Watkins & Dayan, 1992) for offline-RL is the 𝑄 -function 𝑄 : 𝒮 × 𝒜 𝑔 → ℝ . Intuitively, 𝑄 -learning seeks to produce a policy 𝜋 ∗ ( ⋅ | 𝑠 ) that maximizes the expected infinite horizon discounted reward. For any policy 𝜋 , 𝑄 𝜋 ⁢ ( 𝑠 , 𝑎 )

𝔼 𝜋 ⁢ [ ∑ 𝑡

0 ∞ 𝛾 𝑡 ⁢ 𝑟 ⁢ ( 𝑠 ⁢ ( 𝑡 ) , 𝑎 ⁢ ( 𝑡 ) ) | 𝑠 ⁢ ( 0 )

𝑠 , 𝑎 ⁢ ( 0 )

𝑎 ] . One approach to learn the optimal policy 𝜋 ∗ ( ⋅ | 𝑠 ) is dynamic programming, where the 𝑄 -function is iteratively updated using value-iteration: 𝑄 0 ⁢ ( 𝑠 , 𝑎 )

0 , for all ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 𝑔 . Then, for all 𝑡 ∈ [ 𝑇 ] , 𝑄 𝑡 + 1 ⁢ ( 𝑠 , 𝑎 )

𝒯 ⁢ 𝑄 𝑡 ⁢ ( 𝑠 , 𝑎 ) , where 𝒯 is the Bellman operator defined as 𝒯 ⁢ 𝑄 𝑡 ⁢ ( 𝑠 , 𝑎 )

𝑟 ⁢ ( 𝑠 , 𝑎 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 ) , 𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ⁢ max 𝑎 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 𝑡 ⁢ ( 𝑠 ′ , 𝑎 ′ ) . The Bellman operator 𝒯 satisfies a 𝛾 -contractive property, ensuring the existence of a unique fixed-point 𝑄 ∗ such that 𝒯 ⁢ 𝑄 ∗

𝑄 ∗ , by the Banach-Caccioppoli fixed-point theorem (Banach, 1922). Here, the optimal policy is the deterministic greedy policy 𝜋 ∗ : 𝒮 𝑔 × 𝒮 𝑙 𝑛 → 𝒜 𝑔 , where 𝜋 ∗ ⁢ ( 𝑠 )

arg ⁡ max 𝑎 ∈ 𝒜 𝑔 ⁡ 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 ) . However, in this solution, the complexity of a single update to the 𝑄 -function is 𝑂 ⁢ ( | 𝒮 𝑔 | ⁢ | 𝒮 𝑙 | 𝑛 ⁢ | 𝒜 𝑔 | ) , which grows exponentially with 𝑛 . For practical purposes, even for small 𝑛 , this complexity renders 𝑄 -learning impractical (see Example 5.2).

Mean-field Transformation. To address this, Yang et al. (2018); Gu et al. (2021) developed a mean-field approach which, under assumptions of homogeneity in the agents, considers the distribution function 𝐹 [ 𝑛 ] : 𝒮 𝑙 → ℝ given by 𝐹 [ 𝑛 ] ⁢ ( 𝑥 )

∑ 𝑖

1 𝑛 𝟏 ⁢ { 𝑠 𝑖

𝑥 } 𝑛 , for 𝑥 ∈ 𝒮 𝑙 . Define Θ 𝑛

{ 𝑏 / 𝑛 : 𝑏 ∈ [ 𝑛 ] ¯ } . With abuse of notation, let 𝐹 [ 𝑛 ] ∈ Θ | 𝒮 𝑙 | be a vector storing the proportion of agents in each state. As the local agents are homogeneous, the 𝑄 -function is permutation-invariant in the local agents as permuting the labels of local agents with the same state will not change the global agent’s decision. Hence, the 𝑄 -function only depends on 𝑠 [ 𝑛 ] through 𝐹 [ 𝑛 ] : 𝑄 ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 )

𝑄 ^ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 ) . Here, 𝑄 ^ : 𝒮 𝑔 × Θ | 𝒮 𝑙 | × 𝒜 𝑔 → ℝ is a reparameterized 𝑄 -function learned by mean-field value iteration, where 𝑄 ^ 0 ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

0 , ∀ ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 , and for all 𝑡 ∈ [ 𝑇 ] , 𝑄 ^ 𝑡 + 1 ⁢ ( 𝑠 , 𝐹 [ 𝑛 ] , 𝑎 )

𝒯 ^ ⁢ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 ) . Here, 𝒯 ^ is the Bellman operator in distribution space, which is given by Equation 4:

𝒯 ^ ⁢ 𝑄 ^ 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

𝑟 ⁢ ( 𝑠 , 𝑎 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) , 𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑡 ⁢ ( 𝑠 ′ , 𝐹 [ 𝑛 ] ′ , 𝑎 𝑔 ′ ) .

(4)

Then, since 𝒯 has a 𝛾 -contractive property, so does 𝒯 ^ ; hence 𝑇 ^ has a unique fixed-point 𝑄 ^ ∗ such that 𝑄 ^ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) . Finally, the optimal policy is the deterministic greedy policy 𝜋 ^ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] )

arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 ) . Here, the complexity of a single update to the 𝑄 ^ -function is 𝑂 ⁢ ( | 𝒮 𝑔 | ⁢ | 𝒜 𝑔 | ⁢ 𝑛 | 𝒮 𝑙 | ) , which scales polynomially in 𝑛 .

However, for practical purposes, for larger values of 𝑛 , the update complexity of mean-field value iteration can still be computationally intensive, and a subpolynomial-time policy learning algorithm would be desirable. Hence, we introduce the SUBSAMPLE-Q algorithm in Section 3 to attain this.

3Method and Theoretical Results 3.1Proposed Method: SUBSAMPLE-Q

In this work, we propose algorithm SUBSAMPLE-Q to overcome the poly ⁢ ( 𝑛 ) update time of mean-field 𝑄 -learning. In our algorithm, the global agent randomly samples a subset of local agents Δ ∈ 𝒰 ⁢ ( [ 𝑛 ] 𝑘 ) for 𝑘 ∈ [ 𝑛 ] . It ignores all other agents [ 𝑛 ] ∖ Δ and uses an empirical mean-field value iteration to learn the 𝑄 -function 𝑄 ^ 𝑘 ∗ and policy 𝜋 ^ 𝑘 , 𝑚 est for this surrogate system of 𝑘 local agents. The surrogate reward gained by the system at each time step is 𝑟 Δ : 𝒮 × 𝒜 𝑔 → ℝ , given by Equation 5:

𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 )

𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 ) + 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑖 ) .

(5)

We then derive a randomized policy 𝜋 𝑘 , 𝑚 est which samples Δ ∈ 𝒰 ⁢ ( [ 𝑛 ] 𝑘 ) at each time-step to derive action 𝑎 𝑔 ← 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝑠 Δ ) . We show that the policy 𝜋 𝑘 , 𝑚 est converges to the optimal policy 𝜋 ∗ as 𝑘 → 𝑛 and 𝑚 → ∞ in Theorem 3.4. More formally, we present Algorithm 1 (SUBSAMPLE-Q: Learning) and Algorithm 2 (SUBSAMPLE-Q: Execution), which we describe below. A characterization that is crucial to our result is the notion of empirical distribution.

Definition 3.1 (Empirical Distribution Function).

For any population ( 𝑠 1 , … , 𝑠 𝑛 ) ∈ 𝒮 𝑙 𝑛 , define the empirical distribution function 𝐹 𝑠 Δ : 𝒮 𝑙 → ℝ for Δ ⊆ [ 𝑛 ] by:

𝐹 𝑠 Δ ⁢ ( 𝑥 ) := 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } .

(6)

Since the local agents in the system are homogeneous in their state spaces, transitions, and reward functions, the 𝑄 function is permutation-invariant in the local agents as permuting the labels of local agents with the same state does not change the global agent’s decision making process. Define Θ 𝑘

{ 𝑏 / 𝑘 : 𝑏 ∈ [ 𝑘 ] ¯ } . Then, 𝑄 ^ 𝑘 depends on 𝑠 Δ through 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | . We denote this by Equation 7:

𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝑠 Δ , 𝑎 𝑔 )

𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) , 𝑄 ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 )

𝑄 ^ 𝑛 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) .

(7)

Algorithm 1 (Offline learning). We empirically learn the optimal mean-field Q-function for a subsystem with 𝑘 local agents that we denote by 𝑄 ^ 𝑘 , 𝑚 est : 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | × 𝒜 𝑔 → ℝ , where 𝑚 is the sample size. As in Section 2.3, we set 𝑄 ^ 𝑘 , 𝑚 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 for all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 . For 𝑡 ∈ ℕ , we set 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) where 𝒯 ^ 𝑘 , 𝑚 is the empirically adapted Bellman operator defined for 𝑘 ≤ 𝑛 and 𝑚 ∈ ℕ in Equation 8. 𝒯 ^ 𝑘 , 𝑚 draws 𝑚 random samples 𝑠 𝑔 𝑗 ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) for 𝑗 ∈ [ 𝑚 ] and 𝑠 𝑖 𝑗 ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) for 𝑗 ∈ [ 𝑚 ] , 𝑖 ∈ Δ . Here, the operator 𝒯 ^ 𝑘 , 𝑚 is:

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 𝑚 ⁢ ∑ 𝑗 ∈ [ 𝑚 ] max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 𝑗 , 𝐹 𝑠 Δ 𝑗 , 𝑎 𝑔 ′ ) .

(8)

𝒯 ^ 𝑘 , 𝑚 satisfies a 𝛾 -contraction property (see Lemma A.10). So, Algorithm 1 (SUBSAMPLE-Q: Learning) performs mean-field value iteration where it repeatedly applies 𝒯 ^ 𝑘 , 𝑚 to the same Δ ⊆ [ 𝑛 ] until 𝑄 ^ 𝑘 , 𝑚 converges to its fixed point 𝑄 ^ 𝑘 , 𝑚 est satisfying 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 est

𝑄 ^ 𝑘 , 𝑚 est . We then obtain a deterministic policy 𝜋 ^ 𝑘 , 𝑚 est : 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | given by 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Algorithm 2 (Online implementation). Here, Algorithm 2 (SUBSAMPLE-Q: Execution) randomly samples Δ ∼ 𝒰 ⁢ ( [ 𝑛 ] 𝑘 ) at each time step and uses action 𝑎 𝑔 ∼ 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) to get reward 𝑟 ⁢ ( 𝑠 , 𝑎 𝑔 ) . This procedure of first sampling Δ and then applying 𝜋 ^ 𝑘 , 𝑚 is denoted by a stochastic policy 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑎 𝑔 | 𝑠 ) :

𝜋 𝑘 , 𝑚 est ⁢ ( 𝑎 𝑔 | 𝑠 )

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝟙 ⁢ ( 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

𝑎 𝑔 ) .

(9)

Then, each agent transitions to their next state based on Equation 1.

Algorithm 1 SUBSAMPLE-Q: Learning 0: A multi-agent system as described in Section 2. Parameter 𝑇 for the number of iterations in the initial value iteration step. Sampling parameters 𝑘 ∈ [ 𝑛 ] and 𝑚 ∈ ℕ . Discount parameter 𝛾 ∈ ( 0 , 1 ) . Oracle 𝒪 to sample 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) and 𝑠 𝑖 ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) for all 𝑖 ∈ [ 𝑛 ] . 1: Uniformly choose Δ ⊆ [ 𝑛 ] such that | Δ |

𝑘 . 2: Set 𝑄 ^ 𝑘 , 𝑚 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 , for 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 , where Θ 𝑘

{ 𝑏 / 𝑘 : 𝑏 ∈ [ 𝑘 ] ¯ } . 3: for 𝑡

1 to 𝑇 do 4: 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) , for all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 5: For all ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) ∈ 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | , let 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) . Algorithm 2 SUBSAMPLE-Q: Execution 0: A multi-agent system as described in Section 2. Parameter 𝑇 ′ for the number of rounds in the game. Hyperparameter 𝑘 ∈ [ 𝑛 ] . Discount parameter 𝛾 . Policy 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) . 1: Initialize ( 𝑠 𝑔 ⁢ ( 0 ) , 𝑠 [ 𝑛 ] ⁢ ( 0 ) ) ∼ 𝑠 0 , where 𝑠 0 is a distribution on the initial global state ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] ) , 2: Initialize the total reward: 𝑅 0 ← 0 . 3: Policy 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) : 4: for 𝑡

0 to 𝑇 ′ do 5: Sample Δ uniformly at random from from ( [ 𝑛 ] 𝑘 ) . 6: Let 𝑎 𝑔 ⁢ ( 𝑡 ) ∼ 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ⁢ ( 𝑡 ) , 𝑠 Δ ⁢ ( 𝑡 ) ) . 7: Let 𝑠 𝑔 ( 𝑡 + 1 ) ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) ) and 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) , for all 𝑖 ∈ [ 𝑛 ] . 8: 𝑅 𝑡 + 1

𝑅 𝑡 + 𝛾 𝑡 ⋅ 𝑟 ⁢ ( 𝑠 , 𝑎 𝑔 ) Remark 3.2.

Algorithm 1 assumes the existence of a generative model 𝒪 (Kearns & Singh, 1998) to sample 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) and 𝑠 𝑖 ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) . This is generalizable to the online reinforcement learning setting using techniques from (Jin et al., 2018), and we leave this for future investigations.

3.2Theoretical Guarantee

This subsection shows that the value of the expected discounted cumulative reward produced by 𝜋 𝑘 , 𝑚 est is approximately optimal, where the optimality gap decays as 𝑘 → 𝑛 and 𝑚 → ∞ .

Bellman noise. We first introduce the notion of Bellman noise, which is used in the main theorem. Firstly, clearly 𝒯 ^ 𝑘 , 𝑚 is an unbiased estimator of the generalized adapted Bellman operator 𝒯 ^ 𝑘 ,

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) , 𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ Δ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) .

(10)

For all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 , 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 . For 𝑡 ∈ ℕ , let 𝑄 ^ 𝑘 𝑡 + 1

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑡 , where 𝒯 ^ 𝑘 is defined for 𝑘 ≤ 𝑛 in Equation 10. Similarly to 𝒯 ^ 𝑘 , 𝑚 , 𝒯 ^ 𝑘 satisfies a 𝛾 -contraction property (Lemma A.9) with fixed-point 𝑄 ^ 𝑘 ∗ . By the law of large numbers, lim 𝑚 → ∞ 𝒯 ^ 𝑘 , 𝑚

𝒯 ^ 𝑘 . Hence, the gap ‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ‖ ∞ converges to 0 as 𝑚 → ∞ . For finite 𝑚 , ∥ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ∥ ∞

: 𝜖 𝑘 , 𝑚 is called the Bellman noise. Bounding 𝜖 𝑘 , 𝑚 has been well studied in the literature. One such bound is:

Lemma 3.3 (Theorem 1 of Li et al. (2022)).

For all 𝑘 ∈ [ 𝑛 ] and 𝑚 ∈ ℕ , where 𝑚 is the number of samples in Equation 8, there exists a Bellman noise 𝜖 𝑘 , 𝑚 such that ‖ 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 est − 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ∗ ‖ ∞

‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ‖ ∞ ≤ 𝜖 𝑘 , 𝑚 ≤ 𝑂 ⁢ ( 1 / 𝑚 ) .

With the above preparations, we are now primed to present our main result: a bound on the optimality gap for our learned policy 𝜋 𝑘 , 𝑚 est that decays with 𝑘 . Section 4 outlines the proof of Theorem 3.4.

Theorem 3.4.

For any state 𝑠 ∈ 𝒮 𝑔 × 𝒮 𝑙 𝑛 ,

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 )

≤ 2 ⁢ 𝑟 ~ ( 1 − 𝛾 ) 2 ⁢ ( 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | ⁢ | 𝒜 𝑔 | ⁢ 𝑘 ) + 1 𝑘 ) + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾 .

Corollary 3.5.

Theorem 3.4 implies an asymptotically decaying optimality gap for our learned policy 𝜋 ~ 𝑘 , 𝑚 est . Further, from Lemma 3.3, 𝜖 𝑘 , 𝑚 ≤ 𝑂 ⁢ ( 1 / 𝑚 ) . Hence,

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ≤ 𝑂 ~ ⁢ ( 1 / 𝑘 + 1 / 𝑚 ) .

(11) Discussion 3.6.

The size of 𝑄 ^ 𝑘 , 𝑚 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) is 𝑂 ⁢ ( | 𝒮 𝑔 | ⁢ | 𝒜 𝑔 | ⁢ 𝑘 | 𝒮 𝑙 | ) . From Theorem 3.4, as 𝑘 → 𝑛 , the optimality gap decays, revealing a trade-off in the choice of 𝑘 , between the size of the 𝑄 -function and the optimality of the policy 𝜋 𝑘 , 𝑚 est . We demonstrate this trade-off further in our experiments. For 𝑘

𝑂 ⁢ ( log ⁡ 𝑛 ) and 𝑚 → ∞ , we get an exponential speedup on the complexity from mean-field value iteration (from poly ⁢ ( 𝑛 ) to poly ⁢ ( log ⁡ 𝑛 ) ), and a super-exponential speedup from traditional value-iteration (from exp ⁢ ( 𝑛 ) to poly ⁢ ( log ⁡ 𝑛 ) , with a decaying 𝑂 ⁢ ( 1 / log ⁡ 𝑛 ) optimality gap. This gives a competitive policy-learning algorithm with polylogarithmic run-time.

Discussion 3.7.

One could replace the 𝑄 -learning algorithm with an arbitrary value-based RL method that learns 𝑄 ^ 𝑘 with function approximation (Sutton et al., 1999) such as deep 𝑄 -networks (Silver et al., 2016). Doing so introduces a further error that factors into the bound in Corollary 3.5.

4Proof Outline

This section details an outline for the proof of Theorem 3.4, as well as some key ideas. At a high level, our SUBSAMPLE-Q framework in Algorithms 1 and 2 recovers exact mean-field 𝑄 learning (and therefore, traditional value iteration) when 𝑘

𝑛 and as 𝑚 → ∞ . Further, as 𝑘 → 𝑛 , 𝑄 ^ 𝑘 ∗ should intuitively get closer to 𝑄 ∗ from which the optimal policy is derived. Thus, the proof is divided into three steps. We first prove a Lipschitz continuity bound between 𝑄 ^ 𝑘 ∗ and 𝑄 ^ 𝑛 ∗ in terms of the total variation (TV) distance between 𝐹 𝑠 Δ and 𝐹 𝑠 [ 𝑛 ] . Secondly, we bound the TV distance between 𝐹 𝑠 Δ and 𝐹 𝑠 [ 𝑛 ] . Finally, we bound the value differences between 𝜋 ~ 𝑘 , 𝑚 est and 𝜋 ∗ by bounding 𝑄 ∗ ⁢ ( 𝑠 , 𝜋 ∗ ) − 𝑄 ∗ ⁢ ( 𝑠 , 𝜋 ^ 𝑘 , 𝑚 est ) and then using the performance difference lemma (Kakade & Langford, 2002).

Step 1: Lipschitz Continuity Bound. To compare 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) with 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) , we prove a Lipschitz continuity bound between 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) and 𝑄 ^ 𝑘 ′ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) with respect to the TV distance measure between 𝑠 Δ ∈ ( 𝑠 [ 𝑛 ] 𝑘 ) and 𝑠 Δ ′ ∈ ( 𝑠 [ 𝑛 ] 𝑘 ′ ) . Specifically, we show:

Theorem 4.1 (Lipschitz continuity in 𝑄 ^ 𝑘 ∗ ).

For all ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 𝑔 , Δ ∈ ( [ 𝑛 ] 𝑘 ) and Δ ′ ∈ ( [ 𝑛 ] 𝑘 ′ ) ,

| 𝑄 ^ 𝑘 ∗ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

− 𝑄 ^ 𝑘 ′ ∗ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) | ≤ 2 ( 1 − 𝛾 ) − 1 ∥ 𝑟 𝑙 ( ⋅ , ⋅ ) ∥ ∞ ⋅ TV ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

We defer the proof of Theorem 4.1 to Appendix C.6. See Figure 3 for a comparison between the 𝑄 ^ 𝑘 ∗ learning and estimation process, and the exact 𝑄 -learning framework.

Step 2: Bounding Total Variation (TV) Distance.

We bound the TV distance between 𝐹 𝑠 Δ and 𝐹 𝑠 [ 𝑛 ] , where Δ ∈ 𝒰 ⁢ ( [ 𝑛 ] 𝑘 ) . Bounding this TV distance is equivalent to bounding the discrepancy between the empirical distribution and the distribution of the underlying finite population. Since each 𝑖 ∈ Δ is chosen uniformly at random and without replacement, standard concentration inequalities do not apply as they require the random variables to be i.i.d. Further, standard TV distance bounds that use the KL divergence produce a suboptimal decay as | Δ | → 𝑛 (Lemma C.7). Therefore, we prove the following probabilistic result (which generalizes the Dvoretzky–Kiefer–Wolfowitz (DKW) concentration inequality (Dvoretzky et al., 1956) to the regime of sampling without replacement:

Theorem 4.2.

Given a finite population 𝒳

( 𝑥 1 , … , 𝑥 𝑛 ) for 𝒳 ∈ 𝒮 𝑙 𝑛 , let Δ ⊆ [ 𝑛 ] be a uniformly random sample from 𝒳 of size 𝑘 chosen without replacement. Fix 𝜖

0 . Then, for all 𝑥 ∈ 𝒮 𝑙 :

Pr [ sup 𝑥 ∈ 𝒮 𝑙 | 1 | Δ | ∑ 𝑖 ∈ Δ 𝟙 { 𝑥 𝑖

𝑥 }
− 1 𝑛 ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 { 𝑥 𝑖

𝑥 } | ≤ 𝜖 ] ≥ 1 − 2 | 𝒮 𝑙 | 𝑒 − 2 ⁢ | Δ | ⁢ 𝑛 ⁢ 𝜖 2 𝑛 − | Δ | + 1 .

Then, by Theorem 4.2 and the definition of TV distance from Section 2, we have that for 𝛿 ∈ ( 0 , 1 ] ,

Pr ⁡ ( TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 [ 𝑛 ] ) ≤ 𝑛 − | Δ | + 1 8 ⁢ 𝑛 ⁢ | Δ | ⁢ ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 ) ≥ 1 − 𝛿 .

(12)

We then apply this result to our global decision-making problem by studying the rate of decay of the objective function between our learned policy 𝜋 𝑘 , 𝑚 est and the optimal policy 𝜋 ∗ (Theorem 3.4).

Step 3: Performance Difference Lemma to Complete the Proof. As a consequence of the prior two steps and Lemma 3.3, 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ′ ) and 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ′ ) become similar as 𝑘 → 𝑛 (see Theorem C.6). We further prove that the value generated by their policies 𝜋 ∗ and 𝜋 𝑘 , 𝑚 est must also be very close (where the residue shrinks as 𝑘 → 𝑛 ). We then use the well-known performance difference lemma (Kakade & Langford, 2002) which we restate and explain in D.2 in the appendix. A crucial theorem needed to use the performance difference lemma is a bound on 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ) . Therefore, we formulate and prove Theorem 4.3 which yields a probabilistic bound on this difference, where the randomness is over the choice of Δ ∈ ( [ 𝑛 ] 𝑘 ) .

Theorem 4.3.

For a fixed 𝑠 ′ ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 and for 𝛿 ∈ ( 0 , 1 ] , with probability atleast 1 − 2 ⁢ | 𝒜 𝑔 | ⁢ 𝛿 :

𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | 𝛿 ) + 2 ⁢ 𝜖 𝑘 , 𝑚 .

We defer the proof of Theorem 4.3 and finding optimal value of the parameters 𝛿 1 , 𝛿 2 to D.5 in the Appendix. Using Theorem 4.3 and the performance difference lemma leads to Theorem 3.4.

5Experiments

This section provides examples and numerical simulation results to validate our theoretical framework. All numerical experiments were run on a 3-core CPU server equipped with a 12GB RAM. We chose parameters with complexity sufficient to only validate the theory, such as the computational speedups, pseudo-heterogeneity of each local agent, and the decaying optimality gap.

Example 5.1 (Demand-Response (DR)).

DR is a pathway in the transformation towards a sustainable electricity grid where users (local agents) are compensated to lower their electricity consumption to a level set by a regulator (global agent). DR has applications ranging from pricing strategies for EV charging stations, regulating the supply of any product in a market with fluctuating demands, and maximizing the efficiency of allocating resources. We ran a small-scale simulation with 𝑛

8 local agents, and a large-scale simulation with 𝑛

50 local agents, where the goal was to learn an optimal policy for the global agent to moderate supply in the presence of fluctuating demand.

Let each local agent 𝑖 ∈ [ 𝑛 ] have a state 𝑠 𝑖 ⁢ ( 𝑡 )

( 𝜓 𝑖 , 𝑠 𝑖 ∗ ⁢ ( 𝑡 ) , 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) ) ∈ 𝒮 𝑙 := Ψ × 𝒟 𝑎 × 𝒟 𝑐 ⊆ ℤ 3 . Here, 𝜓 𝑖 is the agent’s type, 𝑠 𝑖 ∗ ⁢ ( 𝑡 ) is agent 𝑖 ’s consumption, and 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) is its desired consumption level. Let 𝑠 𝑔 ⁢ ( 𝑡 ) ∈ 𝒮 𝑔 , 𝑎 𝑔 ⁢ ( 𝑡 ) ∈ 𝒜 𝑔 where 𝑠 𝑔 ⁢ ( 𝑡 ) is the DR signal (target consumption set by the regulator). The global agent transition is given by 𝑠 𝑔 ⁢ ( 𝑡 + 1 )

Π 𝒮 𝑔 ⁢ ( 𝑠 𝑔 ⁢ ( 𝑡 ) + 𝑎 𝑔 ⁢ ( 𝑡 ) ) , i.e., 𝑎 𝑔 ⁢ ( 𝑡 ) changes the DR signal. Then, 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

( 𝜓 𝑖 , 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 ) , 𝑠 𝑖 ∗ ⁢ ( 𝑡 + 1 ) ) , where intuitively, 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 ) fluctuates based on 𝜓 𝑖 and 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) < 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 𝑖 ∗ ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) (the local agent chases its desired consumption). If not, the local agent either follows 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) or reduces its consumption to match 𝑠 𝑔 ⁢ ( 𝑡 ) . Formally, if 𝜓 𝑖

1 , then 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) + 𝒰 ⁢ { 0 , 1 } . If 𝜓 𝑖

2 , 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 )

𝒰 ⁢ { 𝒟 𝑐 } . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) ≤ 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 ¯ 𝑖 ∗ ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) > 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 ¯ 𝑖 ∗ ⁢ ( 𝑡 + 1 )

Π 𝒟 𝑐 ⁢ [ 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) + ( 𝑠 𝑔 ⁢ ( 𝑡 ) − 𝑠 𝑖 ∗ ⁢ ( 𝑡 ) ) ⁢ 𝒰 ⁢ { 0 , 1 } ] . The reward of the system at each step is given by 𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 )

15 / 𝑠 𝑔 − 𝟙 ⁢ { 𝑎 𝑔

− 1 } and 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 )

𝑠 𝑖 ∗ − 1 2 ⁢ 𝟙 ⁢ { 𝑠 𝑖 ∗ > 𝑠 𝑔 } . We set 𝒟 𝑎

𝒟 𝑐

[ 5 ] , Ψ

{ 1 , 2 } , 𝛾

0.9 , 𝑚

50 , and the length of the decision game to be 𝑇 ′

300 .

We use 𝑇

300 empirical adapted Bellman iterations for the small-scale simulation, and 𝑇

50 iterations for the large scale simulation. For the small-scale simulation, Figure 1a illustrates the polynomial speedup of Algorithm 1 (note that 𝑘

𝑛 exactly recovers mean-field value iteration Yang et al. (2018), which we treat as our baseline comparison). Figure 1b plots the reward-optimality gap for varying 𝑘 , illustrating that the gap decreases monotonically as 𝑘 → 𝑛 , as shown in Theorem 3.4. Figure 1c plots the cumulative reward of the large-scale experiment. We observe that the rewards (on average) grow monotonically as they obey our worst-case guarantee in Theorem 3.4.

Example 5.2 (Queueing).

We model a system with 𝑛 queues, 𝑠 𝑖 ⁢ ( 𝑡 ) ∈ 𝒮 𝑙 := ℕ at time 𝑡 denotes the number of jobs at time 𝑡 for queue 𝑖 ∈ [ 𝑛 ] . We model the job allocation mechanism as a global agent where 𝑠 𝑔 ⁢ ( 𝑡 ) ∈ 𝒮 𝑔

𝒜 𝑔

[ 𝑛 ] , where 𝑠 𝑔 ⁢ ( 𝑡 ) denotes the queue to which the next job should be delivered. We choose the state transitions to capture the stochastic job arrival and departure: 𝑠 𝑔 ⁢ ( 𝑡 + 1 )

𝑎 𝑔 ⁢ ( 𝑡 ) , and 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

min ⁡ { 𝑐 , max ⁡ { 0 , 𝑠 𝑖 ⁢ ( 𝑡 ) + 𝟙 ⁢ { 𝑠 𝑔 ⁢ ( 𝑡 )

𝑖 } − Bern ⁢ ( 𝑝 ) } } . For the rewards, we set 𝑟 𝑔 ⁢ ( 𝑠 𝑔 ⁢ ( 𝑡 ) , 𝑎 𝑔 ⁢ ( 𝑡 ) )

0 , 𝑟 𝑙 ⁢ ( 𝑠 𝑖 ⁢ ( 𝑡 ) , 𝑠 𝑔 ⁢ ( 𝑡 ) )

− 𝑠 𝑖 ⁢ ( 𝑡 ) − 10 ⋅ 𝟙 ⁢ { 𝑠 𝑖 ⁢ ( 𝑡 ) > 𝑐 } , where 𝑝

0.8 is the probability of finishing a job, 𝑐

30 is the capacity of each queue, and 𝛾

0.9 .

This simulation ran on a system of 𝑛

50 local agents. The goal was to learn an optimal policy for a dispatcher to send incoming jobs to. We ran Algorithm 1 for 𝑇

300 empirical adapted Bellman iterations with 𝑚

30 , and ran Algorithm 2 for 𝑇 ′

100 iterations. Figure 2 illustrates the log-scale reward-optimality gap for varying 𝑘 , showing that the gap decreases monotonically as 𝑘 → 𝑛 with a decay rate that is consistent with the 𝑂 ⁢ ( 1 / 𝑘 ) upper bound in Theorem 3.4.

Figure 1:Demand-Response simulation. a) Computation time to learn 𝜋 ^ 𝑘 , 𝑚 est for 𝑘 ≤ 𝑛

8 . b) Reward optimality gap (log scale) with 𝜋 𝑘 , 𝑚 est running 300 iterations for 𝑘 ≤ 𝑛

8 , c) Discounted cumulative rewards for 𝑘 ≤ 𝑛

50 . We note that 𝑘

𝑛 recovers the mean-field RL iteration solution.

Figure 2:Reward optimality gap (log scale) with 𝜋 𝑘 , 𝑚 est running 300 iterations. 6Conclusion, Limitations, and Future Work

Conclusion. This work considers a global decision-making agent in the presence of 𝑛 local homogeneous agents. We propose SUBSAMPLE-Q which derives a policy 𝜋 𝑘 , 𝑚 est where 𝑘 ≤ 𝑛 and 𝑚 ∈ ℕ are tunable parameters, and show that 𝜋 𝑘 , 𝑚 est converges to the optimal policy 𝜋 ∗ with a decay rate of 𝑂 ⁢ ( 1 / 𝑘 + 𝜖 𝑘 , 𝑚 ) , where 𝜖 𝑘 , 𝑚 is the Bellman noise. To establish the result, we develop an adapted Bellman operator 𝒯 ^ 𝑘 and show a Lipschitz-continuity result for 𝑄 ^ 𝑘 ∗ and generalize the DKW inequality. Finally, we validate our theoretical result through numerical experiments.

Limitations and Future Work. We recognize several future directions. This model studies a ‘star-graph’ setting to model a single source of density. It would be fascinating to extend to general graphs. We believe expander-graph decomposition methods (Anand & Umans, 2023) are amenable for this. Another direction is to find connections between our sub-sampling method to algorithms in federated learning, where the rewards can be stochastic and to incorporate learning rates Lin et al. (2021) to attain numerical stability. Another limitation of this work is that we have only partially resolved the problem for truly heterogeneous local agents by adding a ‘type’ property to each local agent to model some pseudoheterogeneity in the state space of each agent. Additionally, it would be interesting to extend this work to the online setting without a generative oracle simulator. Finally, our model assumes finite state/action spaces as in the fundamental tabular MDP setting. However, to increase the applicability of the model, it would be interesting to replace the 𝑄 -learning algorithm with a deep- 𝑄 learning or a value-based RL method where the state/action spaces can be continuous.

7Acknowledgements

This work is supported by a research assistantship at Carnegie Mellon University and a fellowship from the Caltech Associates. We thank ComputeX for allowing usage of their server to run numerical experiments and gratefully acknowledge insightful conversations with Yiheng Lin, Ishani Karmarkar, Elia Gorokhovsky, David Hou, Sai Maddipatla, Alexis Wang, and Chris Zhou.

References Anand & Umans (2023) Emile Anand and Chris Umans.Pseudorandomness of the sticky random walk.arXiv preprint arXiv:2307.11104, 2023. Anand et al. (2024) Emile Anand, Jan van den Brand, Mehrdad Ghadiri, and Daniel J. Zhang.The Bit Complexity of Dynamic Algebraic Formulas and Their Determinants.In Karl Bringmann, Martin Grohe, Gabriele Puppis, and Ola Svensson (eds.), 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024), volume 297 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 10:1–10:20, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.ISBN 978-3-95977-322-5.doi: 10.4230/LIPIcs.ICALP.2024.10. Banach (1922) Stefan Banach.Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales.Fundamenta Mathematicae, 3(1):133–181, 1922. Bertsekas & Tsitsiklis (1996) Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming.Athena Scientific, 1st edition, 1996.ISBN 1886529108. Blondel & Tsitsiklis (2000) Vincent D. Blondel and John N. Tsitsiklis.A Survey of Computational Complexity Results in Systems and Control.Automatica, 36(9):1249–1274, 2000.ISSN 0005-1098.doi: https://doi.org/10.1016/S0005-1098(00)00050-9. Carmona & Wang (2016) Rene Carmona and Peiqi Wang.Finite State Mean Field Games with Major and Minor Players, 2016. Carmona et al. (2023) René Carmona, Mathieu Laurière, and Zongjun Tan.Model-free Mean-Field Reinforcement Learning: Mean-field MDP and mean-field Q-learning.The Annals of Applied Probability, 33(6B):5334 – 5381, 2023.doi: 10.1214/23-AAP1949. Carmona & Zhu (2016) René Carmona and Xiuneng Zhu.A probabilistic approach to mean field games with major and minor players.The Annals of Applied Probability, 26(3):1535–1580, 2016.ISSN 10505164. Chaudhari et al. (2024) Shreyas Chaudhari, Srinivasa Pranav, Emile Anand, and José M. F. Moura.Peer-to-peer learning dynamics of wide neural networks, 2024. Chen & Theja Maguluri (2022) Zaiwei Chen and Siva Theja Maguluri.Sample complexity of policy-based methods under off-policy sampling and linear function approximation.In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 11195–11214. PMLR, 28–30 Mar 2022. Chu et al. (2020) Tianshu Chu, Sandeep Chinchali, and Sachin Katti.Multi-agent Reinforcement Learning for Networked System Control.In International Conference on Learning Representations, 2020. Cui & Koeppl (2022) Kai Cui and Heinz Koeppl.Learning Graphon Mean Field Games and Approximate Nash Equilibria.In International Conference on Learning Representations, 2022. Cui et al. (2023) Kai Cui, Christian Fabian, and Heinz Koeppl.Multi-Agent Reinforcement Learning via Mean Field Control: Common Noise, Major Agents and Approximation Properties, 2023. Do et al. (2023) Anh Do, Thanh Nguyen-Tang, and Raman Arora.Multi-Agent Learning with Heterogeneous Linear Contextual Bandits.In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Dvoretzky et al. (1956) A. Dvoretzky, J. Kiefer, and J. Wolfowitz.Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator.The Annals of Mathematical Statistics, 27(3):642 – 669, 1956.doi: 10.1214/aoms/1177728174. Foster et al. (2022) Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, and Karthik Sridharan.On the Complexity of Adversarial Decision Making.In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Foster et al. (2023) Dylan J Foster, Noah Golowich, Jian Qian, Alexander Rakhlin, and Ayush Sekhari.Model-Free Reinforcement Learning with the Decision-Estimation Coefficient.In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Gamarnik et al. (2009) David Gamarnik, David Goldberg, and Theophane Weber.Correlation Decay in Random Decision Networks, 2009. Ghai et al. (2023) Udaya Ghai, Arushi Gupta, Wenhan Xia, Karan Singh, and Elad Hazan.Online Nonstochastic Model-Free Reinforcement Learning.In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Gu et al. (2021) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu.Mean-Field Controls with Q-Learning for Cooperative MARL: Convergence and Complexity Analysis.SIAM Journal on Mathematics of Data Science, 3(4):1168–1196, 2021.doi: 10.1137/20M1360700. Gu et al. (2022a) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu.Dynamic Programming Principles for Mean-Field Controls with Learning, 2022a. Gu et al. (2022b) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu.Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach, 2022b. Guestrin et al. (2003) Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman.Efficient Solution Algorithms for Factored MDPs.J. Artif. Int. Res., 19(1):399–468, oct 2003.ISSN 1076-9757. Hoeffding (1963) Wassily Hoeffding.Probability Inequalities for Sums of Bounded Random Variables.Journal of the American Statistical Association, 58(301):13–30, 1963.ISSN 01621459. Hu et al. (2023) Yuanquan Hu, Xiaoli Wei, Junji Yan, and Hengxi Zhang.Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning.Journal of the Franklin Institute, 360(18):14783–14805, 2023.ISSN 0016-0032. Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan.Is q-learning provably efficient?In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Jin et al. (2021) Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu.V-learning – a simple, efficient, decentralized algorithm for multiagent rl, 2021. Jing et al. (2022) Gangshan Jing, He Bai, Jemin George, Aranya Chakrabortty, and Piyush. K. Sharma.Distributed Cooperative Multi-Agent Reinforcement Learning with Directed Coordination Graph.In 2022 American Control Conference (ACC), pp. 3273–3278, 2022.doi: 10.23919/ACC53348.2022.9867152. Kakade & Langford (2002) Sham Kakade and John Langford.Approximately Optimal Approximate Reinforcement Learning.In Claude Sammut and Achim Hoffman (eds.), Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002), pp. 267–274, San Francisco, CA, USA, 2002. Morgan Kauffman.ISBN 1-55860-873-7. Kearns & Singh (1998) Michael Kearns and Satinder Singh.Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms.In M. Kearns, S. Solla, and D. Cohn (eds.), Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998. Kim & Giannakis (2017) Seung-Jun Kim and Geogios B. Giannakis.An Online Convex Optimization Approach to Real-time Energy Pricing for Demand Response.IEEE Transactions on Smart Grid, 8(6):2784–2793, 2017.doi: 10.1109/TSG.2016.2539948. Kiran et al. (2022) B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, and Patrick Pérez.Deep Reinforcement Learning for Autonomous Driving: A Survey.IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2022.doi: 10.1109/TITS.2021.3054625. Kober et al. (2013) Jens Kober, J. Andrew Bagnell, and Jan Peters.Reinforcement Learning in Robotics: A Survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013.doi: 10.1177/0278364913495721. Li et al. (2022) Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen.Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction.IEEE Transactions on Information Theory, 68(1):448–473, 2022.doi: 10.1109/TIT.2021.3120096. Lin et al. (2020) Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman.Distributed Reinforcement Learning in Multi-Agent Networked Systems.CoRR, abs/2006.06555, 2020. Lin et al. (2021) Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman.Multi-Agent Reinforcement Learning in Stochastic Networked Systems.In Thirty-fifth Conference on Neural Information Processing Systems, 2021. Lin et al. (2023) Yiheng Lin, James A. Preiss, Emile Anand, Yingying Li, Yisong Yue, and Adam Wierman.Online adaptive policy selection in time-varying systems: No-regret via contractive perturbations.In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 53508–53521. Curran Associates, Inc., 2023. Lin et al. (2024a) Yiheng Lin, James A Preiss, Fengze Xie, Emile Anand, Soon-Jo Chung, Yisong Yue, and Adam Wierman.Online policy optimization in unknown nonlinear systems.arXiv preprint arXiv:2404.13009, 2024a. Lin et al. (2024b) Yiheng Lin, James A. Preiss, Fengze Xie, Emile Anand, Soon-Jo Chung, Yisong Yue, and Adam Wierman.Online policy optimization in unknown nonlinear systems.In Shipra Agrawal and Aaron Roth (eds.), Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pp. 3475–3522. PMLR, 30 Jun–03 Jul 2024b. Littman (1994) Michael L. Littman.Markov Games as a Framework for Multi-Agent Reinforcement Learning.In Machine learning proceedings, Elsevier, pp. 157–163, 1994. Massart (1990) P. Massart.The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality.The Annals of Probability, 18(3):1269 – 1283, 1990.doi: 10.1214/aop/1176990746. Min et al. (2023) Yifei Min, Jiafan He, Tianhao Wang, and Quanquan Gu.Cooperative Multi-Agent Reinforcement Learning: Asynchronous Communication and Linear Function Approximation.In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 24785–24811. PMLR, 23–29 Jul 2023. Mitzenmacher & Sinclair (1996) Michael David Mitzenmacher and Alistair Sinclair.The Power of Two Choices in Randomized Load Balancing.PhD thesis, University of California, Berkeley, 1996.AAI9723118. Molzahn et al. (2017) Daniel K. Molzahn, Florian Dörfler, Henrik Sandberg, Steven H. Low, Sambuddha Chakrabarti, Ross Baldick, and Javad Lavaei.A Survey of Distributed Optimization and Control Algorithms for Electric Power Systems.IEEE Transactions on Smart Grid, 8(6):2941–2962, 2017.doi: 10.1109/TSG.2017.2720471. Mondal et al. (2022) Washim Uddin Mondal, Mridul Agarwal, Vaneet Aggarwal, and Satish V. Ukkusuri.On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) Using Mean Field Control (MFC).Journal of Machine Learning Research, 23(1), jan 2022.ISSN 1532-4435. Naaman (2021) Michael Naaman.On the Tight Constant in the Multivariate Dvoretzky–Kiefer–Wolfowitz Inequality.Statistics & Probability Letters, 173:109088, 2021.ISSN 0167-7152.doi: https://doi.org/10.1016/j.spl.2021.109088. Papadimitriou & Tsitsiklis (1999) Christos H. Papadimitriou and John N. Tsitsiklis.The Complexity of Optimal Queuing Network Control.Mathematics of Operations Research, 24(2):293–305, 1999.ISSN 0364765X, 15265471. Powell (2007) Warren B. Powell.Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics).Wiley-Interscience, USA, 2007.ISBN 0470171553. Qin et al. (2023) Aoyang Qin, Feng Gao, Qing Li, Song-Chun Zhu, and Sirui Xie.Learning non-Markovian Decision-Making from State-only Sequences.In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Qu et al. (2020a) Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li.Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward.In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020a. Curran Associates Inc.ISBN 9781713829546. Qu et al. (2020b) Guannan Qu, Adam Wierman, and Na Li.Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems.In Alexandre M. Bayen, Ali Jadbabaie, George Pappas, Pablo A. Parrilo, Benjamin Recht, Claire Tomlin, and Melanie Zeilinger (eds.), Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 256–266. PMLR, 10–11 Jun 2020b. Serfling (1974) R. J. Serfling.Probability Inequalities for the Sum in Sampling without Replacement.The Annals of Statistics, 2(1):39–48, 1974.ISSN 00905364. Shapley (1953) L. S. Shapley.Stochastic Games*.Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953.doi: 10.1073/pnas.39.10.1095. Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis.Mastering the Game of Go with Deep Neural Networks and Tree Search.Nature, 529(7587):484–489, January 2016.ISSN 1476-4687.doi: 10.1038/nature16961. Subramanian et al. (2022) Sriram Ganapathi Subramanian, Matthew E. Taylor, Mark Crowley, and Pascal Poupart.Decentralized mean field games, 2022. Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour.Policy gradient methods for reinforcement learning with function approximation.In S. Solla, T. Leen, and K. Müller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. Tsybakov (2008) Alexandre B. Tsybakov.Introduction to Nonparametric Estimation.Springer Publishing Company, Incorporated, 1st edition, 2008.ISBN 0387790519. Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan.Q-learning.Machine Learning, 8(3):279–292, May 1992.ISSN 1573-0565.doi: 10.1007/BF00992698. Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang.Mean Field Multi-Agent Reinforcement Learning.In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5571–5580. PMLR, 10–15 Jul 2018. Zhang et al. (2021) Kaiqing Zhang, Zhuoran Yang, and Tamer Başar.Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms, 2021. Zhang & Pavone (2016) Rick Zhang and Marco Pavone.Control of Robotic Mobility-on-Demand Systems: A Queueing-Theoretical Perspective.The International Journal of Robotics Research, 35(1-3):186–203, 2016.doi: 10.1177/0278364915581863.

Outline of the Appendices.

•

Appendix A presents additional definitions and remarks that support the main body.

•

Appendix B-C contains a detailed proof of the Lipschitz continuity bound in Theorem 4.1 and total variation distance bound in Theorem 4.2.

•

Appendix D contains a detailed proof of the main result in Theorem 3.4.

Appendix AMathematical Background and Additional Remarks Definition A.1 (Lipschitz continuity).

Given two metric spaces ( 𝒳 , 𝑑 𝒳 ) and ( 𝒴 , 𝑑 𝒴 ) and a constant 𝐿 ∈ ℝ + , a mapping 𝑓 : 𝒳 → 𝒴 is 𝐿 -Lipschitz continuous if for all 𝑥 , 𝑦 ∈ 𝒳 , 𝑑 𝒴 ⁢ ( 𝑓 ⁢ ( 𝑥 ) , 𝑓 ⁢ ( 𝑦 ) ) ≤ 𝐿 ⋅ 𝑑 𝒳 ⁢ ( 𝑥 , 𝑦 ) .

Theorem A.2 (Banach-Caccioppoli fixed point theorem Banach (1922)).

Consider the metric space ( 𝒳 , 𝑑 𝒳 ) , and 𝑇 : 𝒳 → 𝒳 such that 𝑇 is a 𝛾 -Lipschitz continuous mapping for 𝛾 ∈ ( 0 , 1 ) . Then, by the Banach-Cacciopoli fixed-point theorem, there exists a unique fixed point 𝑥 ∗ ∈ 𝒳 for which 𝑇 ⁢ ( 𝑥 ∗ )

𝑥 ∗ . Additionally, 𝑥 ∗

lim 𝑠 → ∞ 𝑇 𝑠 ⁢ ( 𝑥 0 ) for any 𝑥 0 ∈ 𝒳 .

For convenience, we restate below the various Bellman operators under consideration.

Definition A.3 (Bellman Operator 𝒯 ).

𝒯 ⁢ 𝑄 𝑡 ⁢ ( 𝑠 , 𝑎 𝑔 ) := 𝑟 [ 𝑛 ] ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 𝑡 ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ )

(13) Definition A.4 (Adapted Bellman Operator 𝒯 ^ 𝑘 ).

The adapted Bellman operator updates a smaller 𝑄 function (which we denote by 𝑄 ^ 𝑘 ), for a surrogate system with the global agent and 𝑘 ∈ [ 𝑛 ] local agents, using mean-field value iteration:

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) := 𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ Δ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑡 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )

(14) Definition A.5 (Empirical Adapted Bellman Operator 𝒯 ^ 𝑘 , 𝑚 ).

The empirical adapted Bellman operator 𝒯 ^ 𝑘 , 𝑚 empirically estimates the adapted Bellman operator update using mean-field value iteration by drawing 𝑚 random samples of 𝑠 𝑔 ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) and 𝑠 𝑖 ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) for 𝑖 ∈ Δ , where for 𝑗 ∈ [ 𝑚 ] , the 𝑗 ’th random sample is given by 𝑠 𝑔 𝑗 and 𝑠 Δ 𝑗 :

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) := 𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 𝑚 ⁢ ∑ 𝑗 ∈ [ 𝑚 ] max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 𝑗 , 𝐹 𝑠 Δ 𝑗 , 𝑎 𝑔 ′ )

(15) Remark A.6.

We remark on the following relationships between the variants of the Bellman operators from Definitions A.3, A.4 and A.5. First, by the law of large numbers, we have lim 𝑚 → ∞ 𝒯 ^ 𝑘 , 𝑚

𝒯 ^ 𝑘 , where the error decays in 𝑂 ⁢ ( 1 / 𝑚 ) by the Chernoff bound. Secondly, by comparing Definition A.4 and Definition A.3, we have 𝒯 𝑛

𝒯 .

Lemma A.7.

For any Δ ⊆ [ 𝑛 ] such that | Δ |

𝑘 , suppose 0 ≤ 𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) ≤ 𝑟 ~ . Then, 𝑄 ^ 𝑘 𝑡 ≤ 𝑟 ~ 1 − 𝛾 .

Proof.

We prove this by induction on 𝑡 ∈ ℕ . The base case is satisfied as 𝑄 ^ 𝑘 0

0 . Assume that ‖ 𝑄 ^ 𝑘 𝑡 − 1 ‖ ∞ ≤ 𝑟 ~ 1 − 𝛾 . We bound 𝑄 ^ 𝑘 𝑡 + 1 from the Bellman update at each time step as follows, for all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 :

𝑄 ^ 𝑘 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ Δ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑡 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )

≤ 𝑟 ~ + 𝛾 ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 , 𝑠 𝑔 ′ ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ′ ∈ Θ 𝑘 | 𝒮 𝑙 | ⁡ 𝑄 ^ 𝑘 𝑡 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ≤ 𝑟 ~ 1 − 𝛾

Here, the first inequality follows by noting that the maximum value of a random variable is at least as large as its expectation. The second inequality follows from the inductive hypothesis.∎

Remark A.8.

Lemma A.7 is independent of the choice of 𝑘 . Therefore, for 𝑘

𝑛 , this implies an identical bound on 𝑄 𝑡 . A similar argument as Lemma A.7 implies an identical bound on 𝑄 ^ 𝑘 , 𝑚 𝑡 .

Recall that the original Bellman operator 𝒯 satisfies a 𝛾 -contractive property under the infinity norm. We similarly show that 𝒯 ^ 𝑘 and 𝒯 ^ 𝑘 , 𝑚 satisfy a 𝛾 -contractive property under infinity norm in Lemma A.9 and Lemma A.10.

Lemma A.9.

𝒯 ^ 𝑘 satisfies the 𝛾 -contractive property under infinity norm:

‖ 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ′ − 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ‖ ∞ ≤ 𝛾 ⁢ ‖ 𝑄 ^ 𝑘 ′ − 𝑄 ^ 𝑘 ‖ ∞

Proof.

Suppose we apply 𝒯 ^ 𝑘 to 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) and 𝑄 ^ 𝑘 ′ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) for | Δ |

𝑘 . Then:

‖ 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ′ − 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ‖ ∞

𝛾 ⁢ max 𝑠 𝑔 ∈ 𝒮 𝑔 ,

𝑎 𝑔 ∈ 𝒜 𝑔 ,

𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | ⁡ | 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) ,

∀ 𝑠 𝑖 ′ ∈ 𝑠 Δ ′ , ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) ,

∀ 𝑠 𝑖 ′ ∈ 𝑠 Δ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) |

≤ 𝛾 ⁢ max 𝑠 𝑔 ′ ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ′ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ | 𝑄 ^ 𝑘 ′ ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) |

𝛾 ⁢ ‖ 𝑄 ^ 𝑘 ′ − 𝑄 ^ 𝑘 ‖ ∞

The equality implicitly cancels the common 𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 ) terms from each application of the adapted-Bellman operator. The inequality follows from Jensen’s inequality, maximizing over the actions, and bounding the expected value with the maximizers of the random variables. The last line recovers the definition of infinity norm. ∎

Lemma A.10.

𝒯 ^ 𝑘 , 𝑚 satisfies the 𝛾 -contractive property under infinity norm.

Proof.

Similarly to Lemma A.9, suppose we apply 𝒯 ^ 𝑘 , 𝑚 to 𝑄 ^ 𝑘 , 𝑚 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) and 𝑄 ^ 𝑘 , 𝑚 ′ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) . Then:

‖ 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 − 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 ′ ‖ ∞

𝛾 𝑚 ⁢ ‖ ∑ 𝑗 ∈ [ 𝑚 ] ( max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 𝑗 , 𝐹 𝑠 Δ 𝑗 , 𝑎 𝑔 ′ ) − max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ ⁢ ( 𝑠 𝑔 𝑗 , 𝐹 𝑠 Δ 𝑗 , 𝑎 𝑔 ′ ) ) ‖ ∞

≤ 𝛾 ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 , 𝑠 𝑔 ′ ∈ 𝒮 𝑔 , 𝑠 Δ ∈ 𝒮 𝑙 𝑘 ⁡ | 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝑄 ^ 𝑘 ′ ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) |

≤ 𝛾 ⁢ ‖ 𝑄 ^ 𝑘 − 𝑄 ^ 𝑘 ′ ‖ ∞

The first inequality uses the triangle inequality and the general property | max 𝑎 ∈ 𝐴 ⁡ 𝑓 ⁢ ( 𝑎 ) − max 𝑏 ∈ 𝐴 ⁡ 𝑓 ⁢ ( 𝑏 ) | ≤ max 𝑐 ∈ 𝐴 ⁡ | 𝑓 ⁢ ( 𝑎 ) − 𝑓 ⁢ ( 𝑏 ) | . In the last line, we recover the definition of infinity norm.∎

Remark A.11.

Intuitively, the 𝛾 -contractive property of 𝒯 ^ 𝑘 and 𝒯 ^ 𝑘 , 𝑚 causes the trajectory of two 𝑄 ^ 𝑘 and 𝑄 ^ 𝑘 , 𝑚 functions on the same state-action tuple to decay by 𝛾 at each time step such that repeated applications of their corresponding Bellman operators produce a unique fixed-point from the Banach-Cacciopoli fixed-point theorem which we introduce in Definitions A.12 and A.13.

Definition A.12 ( 𝑄 ^ 𝑘 ∗ ).

Suppose 𝑄 ^ 𝑘 0 := 0 and let 𝑄 ^ 𝑘 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) for 𝑡 ∈ ℕ . Denote the fixed-point of 𝒯 ^ 𝑘 by 𝑄 ^ 𝑘 ∗ such that 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Definition A.13 ( 𝑄 ^ 𝑘 , 𝑚 est ).

Suppose 𝑄 ^ 𝑘 , 𝑚 0 := 0 and let 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) for 𝑡 ∈ ℕ . Denote the fixed-point of 𝒯 ^ 𝑘 , 𝑚 by 𝑄 ^ 𝑘 , 𝑚 est such that 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Furthermore, recall the assumption on our empirical approximation of 𝑄 ^ 𝑘 ∗ :

Lemma 3.3. For all 𝑘 ∈ [ 𝑛 ] and 𝑚 ∈ ℕ , we assume that:

‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ‖ ∞ ≤ 𝜖 𝑘 , 𝑚

Corollary A.14.

Observe that by backpropagating results of the 𝛾 -contractive property for 𝑇 time steps:

‖ 𝑄 ^ 𝑘 ∗ − 𝑄 ^ 𝑘 𝑇 ‖ ∞ ≤ 𝛾 𝑇 ⋅ ‖ 𝑄 ^ 𝑘 ∗ − 𝑄 ^ 𝑘 0 ‖ ∞

(16)

‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 , 𝑚 𝑇 ‖ ∞ ≤ 𝛾 𝑇 ⋅ ‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 , 𝑚 0 ‖ ∞

(17)

Further, noting that 𝑄 ^ 𝑘 0

𝑄 ^ 𝑘 , 𝑚 0 := 0 , ‖ 𝑄 ^ 𝑘 ∗ ‖ ∞ ≤ 𝑟 ~ 1 − 𝛾 , and ‖ 𝑄 ^ 𝑘 , 𝑚 est ‖ ∞ ≤ 𝑟 ~ 1 − 𝛾 from Lemma A.7:

‖ 𝑄 ^ 𝑘 ∗ − 𝑄 ^ 𝑘 𝑇 ‖ ∞ ≤ 𝛾 𝑇 ⁢ 𝑟 ~ 1 − 𝛾

(18)

‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 , 𝑚 𝑇 ‖ ∞ ≤ 𝛾 𝑇 ⁢ 𝑟 ~ 1 − 𝛾

(19) Remark A.15.

Corollary A.14 characterizes the error decay between 𝑄 ^ 𝑘 𝑇 and 𝑄 ^ 𝑘 ∗ as well as between 𝑄 ^ 𝑘 , 𝑚 𝑇 and 𝑄 ^ 𝑘 , 𝑚 est and shows that it decays exponentially in the number of corresponding Bellman iterations with the 𝛾 𝑇 multiplicative factor.

Furthermore, we characterize the maximal policies greedy policies obtained from 𝑄 ∗ , 𝑄 ^ 𝑘 ∗ , and 𝑄 ^ 𝑘 , 𝑚 est .

Definition A.16 ( 𝜋 ∗ ).

The greedy policy derived from 𝑄 ∗ is

𝜋 ∗ ⁢ ( 𝑠 ) := arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) .

Definition A.17 ( 𝜋 ^ 𝑘 ∗ ).

The greedy policy from 𝑄 ^ 𝑘 ∗ is

𝜋 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) := arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Definition A.18 ( 𝜋 ^ 𝑘 , 𝑚 est ).

The greedy policy from 𝑄 ^ 𝑘 , 𝑚 est is given by

𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) := arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Figure 3 details the analytic flow on how we use the empirical adapted Bellman operator to perform value iteration on 𝑄 ^ 𝑘 , 𝑚 to get 𝑄 ^ 𝑘 , 𝑚 est which approximates 𝑄 ∗ .

𝑄 ^ 𝑘 , 𝑚 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) 𝑄 ^ 𝑛 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) ( 1 ) ( 2 )

= ( 3 )

≈ ( 4 )

=
Figure 3:Flow of the algorithm and relevant analyses in learning 𝑄 ∗ . Here, (1) follows by performing Algorithm 1 (SUBSAMPLE-Q: Learning) on 𝑄 ^ 𝑘 , 𝑚 0 . (2) follows from Lemma 3.3. (3) follows from the Lipschitz continuity and total variation distance bounds in Theorems 4.1 and 4.2. Finally, (4) follows from noting that 𝑄 ^ 𝑛 ∗

𝑄 ∗ .

Algorithm 3 provides a stable implementation of Algorithm 1: SUBSAMPLE-Q: Learning, where we incorporate a sequence of learning rates { 𝜂 𝑡 } 𝑡 ∈ [ 𝑇 ] into the framework Watkins & Dayan (1992). Algorithm 3 is also provably numerical stable under fixed-point arithmetic Anand et al. (2024).

Algorithm 3 Stable (Practical) Implementation of Algorithm 1: SUBSAMPLE-Q: Learning 0: A multi-agent system as described in Section 2. Parameter 𝑇 for the number of iterations in the initial value iteration step. Hyperparameter 𝑘 ∈ [ 𝑛 ] . Discount parameter 𝛾 ∈ ( 0 , 1 ) . Oracle 𝒪 to sample 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) and 𝑠 𝑖 ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) for all 𝑖 ∈ [ 𝑛 ] . Sequence of learning rates { 𝜂 𝑡 } 𝑡 ∈ [ 𝑇 ] where 𝜂 𝑡 ∈ ( 0 , 1 ] . 1: Choose any Δ ⊆ [ 𝑛 ] such that | Δ |

𝑘 . 2: Set 𝑄 ^ 𝑘 , 𝑚 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 for 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 . 3: for 𝑡

1 to 𝑇 do 4: for ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) ∈ 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | do 5: for 𝑎 𝑔 ∈ 𝒜 𝑔 do 6: 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) ← ( 1 − 𝜂 𝑡 ) ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) + 𝜂 𝑡 ⁢ 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) 7: For all ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) ∈ 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | , let the approximate policy be
𝜋 ^ 𝑘 , 𝑚 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

arg ⁡ max 𝑎 𝑔 ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 , 𝑚 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) .

Notably, 𝑄 ^ 𝑘 , 𝑚 𝑡 in Algorithm 3 due to a similar 𝛾 -contractive property as in Lemma A.9, given an appropriately conditioned sequence of learning rates 𝜂 𝑡 :

Theorem A.19.

As 𝑇 → ∞ , if ∑ 𝑡

1 𝑇 𝜂 𝑡

∞ , and ∑ 𝑡

1 𝑇 𝜂 𝑡 2 < ∞ , then 𝑄 -learning converges to the optimal 𝑄 function asymptotically with probability 1 .

Furthermore, finite-time guarantees with the learning rate and sample complexity have been shown recently in Chen & Theja Maguluri (2022), which when adapted to our 𝑄 ^ 𝑘 , 𝑚 framework in Algorithm 3 yields:

Theorem A.20 (Chen & Theja Maguluri (2022)).

For all 𝑡 ∈ [ 𝑇 ] and 𝜖 > 0 , if 𝜂 𝑡

( 1 − 𝛾 ) 4 ⁢ 𝜖 2 and 𝑇

𝑘 | 𝒮 𝑙 | ⁢ | 𝒮 𝑔 | ⁢ | 𝒜 𝑔 | / ( 1 − 𝛾 ) 5 ⁢ 𝜖 2 ,

‖ 𝑄 ^ 𝑘 , 𝑚 𝑇 − 𝑄 ^ 𝑘 , 𝑚 est ‖ ≤ 𝜖 .

This global decision-making problem can be viewed as a generalization of the network setting to a specific type of dense graph: the star graph (Figure 4). We briefly elaborate more on this connection below.

Definition A.21 (Star Graph 𝑆 𝑛 ).

For 𝑛 ∈ ℕ , the star graph 𝑆 𝑛 is the complete bipartite graph 𝐾 1 , 𝑛 .

𝑆 𝑛 captures the graph density notion by saturating the set of neighbors for the central node. Furthermore, it models interactions between agents identically to our setting, where the central node is a global agent and the peripheral nodes are local agents. The cardinality of the search space simplex for the optimal policy is | 𝒮 𝑔 | ⁢ | 𝒮 𝑙 | 𝑛 ⁢ | 𝒜 𝑔 | , which is exponential in 𝑛 . Hence, this problem cannot be naively modeled by an MDP: we need to exploit the symmetry of the local agents. This intuition allows our subsampling algorithm to run in polylogarithmic time (in 𝑛 ). Further, works that leverage the exponential decaying property that truncates the search space for policies over immediate neighborhoods of agents still rely on the assumption that the graph neighborhood for the agent is sparse Lin et al. (2021); Qu et al. (2020a; b); Lin et al. (2020); however, the graph 𝑆 𝑛 violates this local sparsity condition; hence, previous methods do not apply to this problem instance.

1 2 0 3 … 𝑛 Figure 4:Star graph 𝑆 𝑛 Appendix BProof of Lipschitz-Continuity Bound

This section proves the Lipschitz-continuity bound Theorem 4.1 between 𝑄 ^ 𝑘 ∗ and 𝑄 ∗ in Theorem B.2 and includes a framework to compare 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) and 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) in Lemma B.12. The following definition will be relevant to the proof of Theorem 4.1.

Definition B.1.

[Joint Stochastic Kernels] The joint stochastic kernel on ( 𝑠 𝑔 , 𝑠 Δ ) for Δ ⊆ [ 𝑛 ] where | Δ |

𝑘 is defined as 𝒥 𝑘 : 𝒮 𝑔 × 𝒮 𝑙 𝑘 × 𝒮 𝑔 × 𝒜 𝑔 × 𝒮 𝑙 𝑘 → [ 0 , 1 ] , where

𝒥 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) := Pr ⁡ [ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ]
(20) Theorem B.2 ( 𝑄 ^ 𝑘 𝑇 is ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ -Lipschitz continuous with respect to 𝐹 𝑠 Δ in total variation distance).

Suppose Δ , Δ ′ ⊆ [ 𝑛 ] such that | Δ |

𝑘 and | Δ ′ |

𝑘 ′ . Then:

| 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) | ≤ ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

Proof.

We prove this inductively. Note that 𝑄 ^ 𝑘 0 ⁢ ( ⋅ , ⋅ , ⋅ )

𝑄 ^ 𝑘 ′ 0 ⁢ ( ⋅ , ⋅ , ⋅ )

0 from the initialization step in Algorithm 1, which proves the lemma for 𝑇

0 since TV ⁢ ( ⋅ , ⋅ ) ≥ 0 . For the remainder of this proof, we adopt the shorthand 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ to refer to 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) , 𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ Δ .

Then, at 𝑇

1 :

| 𝑄 ^ 𝑘 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )
− 𝑄 ^ 𝑘 ′ 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

| 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝒯 ^ 𝑘 ′ ⁢ 𝑄 ^ 𝑘 ′ 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

| 𝑟 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) + 𝛾 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 0 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )

− 𝑟 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) − 𝛾 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 ′ 0 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

| 𝑟 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑟 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

| 1 𝑘 ⁢ ∑ 𝑖 ∈ Δ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑖 ) − 1 𝑘 ′ ⁢ ∑ 𝑖 ∈ Δ ′ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑖 ) |

| 𝔼 𝑠 𝑙 ∼ 𝐹 𝑠 Δ ⁢ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑙 ) − 𝔼 𝑠 𝑙 ′ ∼ 𝐹 𝑠 Δ ′ ⁢ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑙 ′ ) |

In the first and second equalities, we use the time evolution property of 𝑄 ^ 𝑘 1 and 𝑄 ^ 𝑘 ′ 1 by applying the adapted Bellman operators 𝒯 ^ 𝑘 and 𝒯 ^ 𝑘 ′ to 𝑄 ^ 𝑘 0 and 𝑄 ^ 𝑘 ′ 0 , respectively, and expanding. In the third and fourth equalities, we note that 𝑄 ^ 𝑘 0 ⁢ ( ⋅ , ⋅ , ⋅ )

𝑄 ^ 𝑘 ′ 0 ⁢ ( ⋅ , ⋅ , ⋅ )

0 , and subtract the common ‘global component’ of the reward function.

Then, noting the general property that for any function 𝑓 : 𝒳 → 𝒴 for | 𝒳 | < ∞ we can write 𝑓 ⁢ ( 𝑥 )

∑ 𝑦 ∈ 𝒳 𝑓 ⁢ ( 𝑦 ) ⁢ 𝟙 ⁢ { 𝑦

𝑥 } , we have:

| 𝑄 ^ 𝑘 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )
− 𝑄 ^ 𝑘 ′ 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

| 𝔼 𝑠 𝑙 ∼ 𝐹 𝑠 Δ ⁢ [ ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑧 ) ⁢ 𝟙 ⁢ { 𝑠 𝑙

𝑧 } ] − 𝔼 𝑠 𝑙 ′ ∼ 𝐹 𝑠 Δ ′ ⁢ [ ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑧 ) ⁢ 𝟙 ⁢ { 𝑠 𝑙 ′

𝑧 } ] |

| ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑧 ) ⋅ ( 𝔼 𝑠 𝑙 ∼ 𝐹 𝑠 Δ ⁢ 𝟙 ⁢ { 𝑠 𝑙

𝑧 } − 𝔼 𝑠 𝑙 ′ ∼ 𝐹 𝑠 Δ ′ ⁢ 𝟙 ⁢ { 𝑠 𝑙 ′

𝑧 } ) |

| ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑧 ) ⋅ ( 𝐹 𝑠 Δ ⁢ ( 𝑧 ) − 𝐹 𝑠 Δ ′ ⁢ ( 𝑧 ) ) |

≤ | max 𝑧 ∈ 𝒮 𝑙 ⁡ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑧 ) | ⋅ ∑ 𝑧 ∈ 𝒮 𝑙 | 𝐹 𝑠 Δ ⁢ ( 𝑧 ) − 𝐹 𝑠 Δ ′ ⁢ ( 𝑧 ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

The second equality follows from the linearity of expectations, and the third equality follows by noting that for any random variable 𝑋 ∼ 𝒳 , 𝔼 𝑋 ⁢ 𝟙 ⁢ [ 𝑋

𝑥 ]

Pr ⁡ [ 𝑋

𝑥 ] . Then, the first inequality follows from an application of the triangle inequality and the Cauchy-Schwarz inequality, and the second inequality follows by the definition of total variation distance. Thus, when 𝑇

1 , 𝑄 ^ is ( 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ) -Lipschitz continuous with respect to total variation distance, proving the base case.

We now assume that for 𝑇 ≤ 𝑡 ′ ∈ ℕ :

| 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) | ≤ ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

Then, inductively we have:

| 𝑄 ^ 𝑘 𝑇 + 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

− 𝑄 ^ 𝑘 ′ 𝑇 + 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

≤ | 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑖 ) − 1 | Δ ′ | ⁢ ∑ 𝑖 ∈ Δ ′ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 , 𝑠 𝑖 ) |

𝛾 ⁢ | 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

𝛾 ⁢ | 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

In the first equality, we use the time evolution property of 𝑄 ^ 𝑘 𝑇 + 1 and 𝑄 ^ 𝑘 ′ 𝑇 + 1 by applying the adapted-Bellman operators 𝒯 ^ 𝑘 and 𝒯 ^ 𝑘 ′ to 𝑄 ^ 𝑘 𝑇 and 𝑄 ^ 𝑘 ′ 𝑇 , respectively. We then expand and use the triangle inequality. In the first term of the second inequality, we use our Lipschitz bound from the base case. For the second term, we now rewrite the expectation over the states 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝑠 Δ ′ ′ into an expectation over the joint transition probabilities 𝒥 𝑘 and 𝒥 𝑘 ′ from Definition B.1.

Therefore, using the shorthand 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 to denote 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) , we have:

| 𝑄 ^ 𝑘 𝑇 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ′ 𝑇 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

+ 𝛾 ⁢ | 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

+ 𝛾 ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ | 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ⁢ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ ) + 𝛾 ⁢ ( ∑ 𝜏

0 𝑇 − 1 2 ⁢ 𝛾 𝜏 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

( ∑ 𝜏

0 𝑇 2 ⁢ 𝛾 𝜏 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

In the first inequality, we rewrite the expectations over the states as the expectation over the joint transition probabilities. The second inequality then follows from Lemma B.9. To apply it to Lemma B.9, we superficially conflate the joint expectation over ( 𝑠 𝑔 , 𝑠 Δ ∪ Δ ′ ) and reduce it back to the original form of its expectation. Finally, the third inequality follows from Lemma B.3.

Then, by the inductive hypothesis, the claim is proven.∎

Lemma B.3.

For all 𝑇 ∈ ℕ , for any 𝑎 𝑔 , 𝑎 𝑔 ′ ∈ 𝒜 𝑔 , 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝑠 Δ ∈ 𝒮 𝑙 𝑘 , and for all joint stochastic kernels 𝒥 𝑘 as defined in Definition B.1, we have that 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) is ( ∑ 𝑡

0 𝑡 − 1 ) 2 𝛾 𝑡 ) ∥ 𝑟 𝑙 ( ⋅ , ⋅ ) ∥ ∞ ) -Lipschitz continuous with respect to 𝐹 𝑠 Δ in total variation distance:

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) 𝑄 ^ 𝑘 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )
− 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ′ ) 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ ( ∑ 𝜏

0 𝑇 − 1 2 ⁢ 𝛾 𝜏 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

Proof.

We prove this inductively. At 𝑇

0 , the statement is true since 𝑄 ^ 𝑘 0 ⁢ ( ⋅ , ⋅ , ⋅ )

𝑄 ^ 𝑘 ′ 0 ⁢ ( ⋅ , ⋅ , ⋅ )

0 and TV ⁢ ( ⋅ , ⋅ ) ≥ 0 . For 𝑇

1 , applying the adapted Bellman operator yields:

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ 𝑄 ^ 𝑘 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ′ ) ⁢ 𝑄 ^ 𝑘 ′ 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ [ 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 ′ , 𝑠 𝑖 ′ ) − 1 | Δ ′ | ⁢ ∑ 𝑖 ∈ Δ ′ 𝑟 𝑙 ⁢ ( 𝑠 𝑔 ′ , 𝑠 𝑖 ′ ) ] |

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ [ ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 ′ , 𝑧 ) ⋅ ( 𝐹 𝑠 Δ ′ ⁢ ( 𝑧 ) − 𝐹 𝑠 Δ ′ ′ ⁢ ( 𝑧 ) ) ] |

Similarly to Theorem B.2, we implicitly write the result as an expectation over the reward functions and use the general property that for any function 𝑓 : 𝒳 → 𝒴 for | 𝒳 | < ∞ , we can write 𝑓 ⁢ ( 𝑥 )

∑ 𝑦 ∈ 𝒳 𝑓 ⁢ ( 𝑦 ) ⁢ 𝟙 ⁢ { 𝑦

𝑥 } . Then, taking the expectation over the indicator variable yields the second equality. As a shorthand, let 𝔇 denote the distribution of 𝑠 𝑔 ′ ∼ ∑ 𝑠 Δ ∪ Δ ′ ′ ∈ 𝒮 𝑙 | Δ ∪ Δ ′ | 𝒥 | Δ ∪ Δ | ⁢ ( ⋅ , 𝑠 Δ ∪ Δ ′ ′ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) . Then, by the law of total expectation,

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ 𝑄 ^ 𝑘 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ′ ) ⁢ 𝑄 ^ 𝑘 ′ 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

| 𝔼 𝑠 𝑔 ′ ∼ 𝔇 ⁢ ∑ 𝑧 ∈ 𝒮 𝑙 𝑟 𝑙 ⁢ ( 𝑠 𝑔 ′ , 𝑧 ) ⁢ 𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ ( 𝐹 𝑠 Δ ′ ⁢ ( 𝑧 ) − 𝐹 𝑠 Δ ′ ′ ⁢ ( 𝑧 ) ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ 𝔼 𝑠 𝑔 ′ ∼ 𝔇 ⁢ TV ⁢ ( 𝔼 𝑠 Δ ∪ Δ ′ ′ | 𝑠 𝑔 ′ ⁢ 𝐹 𝑠 Δ ′ , 𝔼 𝑠 Δ ∪ Δ ′ ′ | 𝑠 𝑔 ′ ⁢ 𝐹 𝑠 Δ ′ ′ )

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

In the ensuing inequalities, we first use Jensen’s inequality and the triangle inequality to pull out 𝔼 𝑠 𝑔 ′ ⁢ ∑ 𝑧 ∈ 𝒮 𝑙 from the absolute value, and then use Cauchy-Schwarz to further factor ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ . The second inequality follows from Lemma B.5 and does not have a dependence on 𝑠 𝑔 ′ thus eliminating 𝔼 𝑠 𝑔 ′ and proving the base case.

We now assume that for 𝑇 ≤ 𝑡 ′ ∈ ℕ , for all joint stochastic kernels 𝒥 𝑘 and 𝒥 𝑘 ′ , and for all 𝑎 𝑔 ′ ∈ 𝒜 𝑔 :

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) 𝑄 ^ 𝑘 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )
− 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑘 ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ′ ) 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

For the remainder of the proof, we adopt the shorthand 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 to denote 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 | Δ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) , and 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 to denote 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 | Δ | ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑎 𝑔 ′ , 𝑠 Δ ′ ) .

Then, inductively, we have:

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 ⁢ 𝑄 ^ 𝑘 𝑇 + 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 ⁢ 𝑄 ^ 𝑘 ′ 𝑇 + 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 [ 𝑟 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝑟 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ )

𝛾 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 [ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 𝑇 ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ ) − max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′ ′′ , 𝑎 𝑔 ′′ ) ] ] |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

𝛾 ⁢ | 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 ⁢ [ 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ⁢ [ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ ) − max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′ ′′ , 𝑎 𝑔 ′′ ) ] ] |

Here, we expand out 𝑄 ^ 𝑘 𝑇 + 1 and 𝑄 ^ 𝑘 ′ 𝑇 + 1 using the adapted Bellman operator. In the ensuing inequality, we apply the triangle inequality and bound the first term using the base case. Then, note that

𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑎 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ )

is, for some stochastic function 𝒥 | Δ ∪ Δ ′ | ′ , equal to

𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ ) ,

where 𝒥 ′ is implicitly a function of 𝑎 𝑔 ′ which is fixed from the beginning.

In the special case where 𝑎 𝑔

𝑎 𝑔 ′ , we can derive an explicit form of 𝒥 ′ which we show in Lemma B.11. As a shorthand, we denote 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ′ ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) by 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ′ .

Therefore,

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥
𝑄 ^ 𝑘 𝑇 + 1 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ′ ) ∼ 𝒥 𝑄 ^ 𝑘 ′ 𝑇 + 1 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ ) + 𝛾 | 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ′ ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ )

− 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ′ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′ ′′ , 𝑎 𝑔 ′′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ ) + 𝛾 ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 | 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ′ ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ )

− 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ∪ Δ ′ ′′ ) ∼ 𝒥 ′ 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′ ′′ , 𝑎 𝑔 ′′ ) |

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ ) + 𝛾 ⁢ ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

( ∑ 𝑡

0 𝑇 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

The second inequality follows from Lemma B.9 where we set the joint stochastic kernel to be 𝒥 | Δ ∪ Δ ′ | ′ . In the ensuing lines, we concentrate the expectation towards the relevant terms and use the induction assumption for the transition probability functions 𝒥 𝑘 ′ and 𝒥 𝑘 ′ ′ . This proves the lemma. ∎

Remark B.4.

Given a joint transition probability function 𝒥 | Δ ∪ Δ ′ | as defined in Definition B.1, we can recover the transition function for a single agent 𝑖 ∈ Δ ∪ Δ ′ given by 𝒥 1 using the law of total probability and the conditional independence between 𝑠 𝑖 and 𝑠 𝑔 ∪ 𝑠 [ 𝑛 ] ∖ 𝑖 in Equation 21. This characterization is crucial in Lemma B.5 and Lemma B.6.

𝒥 1 ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 𝑖 )

∑ 𝑠 Δ ∪ Δ ′ ∖ 𝑖 ′ ∼ 𝒮 𝑙 | Δ ∪ Δ ′ | − 1 𝒥 | Δ ∪ Δ ′ | ( 𝑠 Δ ∪ Δ ′ ∖ 𝑖 ′ , 𝑠 𝑖 ′ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ )

(21) Lemma B.5.

Given a joint transition probability 𝒥 | Δ ∪ Δ ′ | as defined in Definition B.1,

TV ⁢ ( 𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝐹 𝑠 Δ ′ , 𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝐹 𝑠 Δ ′ ′ ) ≤ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

Proof.

Note that from Lemma B.6:

𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝐹 𝑠 Δ ′

𝔼 𝑠 Δ ′ ∼ 𝒥 | Δ | ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ 𝐹 𝑠 Δ ′

𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝐹 𝑠 Δ

Then, by expanding the TV distance in ℓ 1 -norm:

TV
( 𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝐹 𝑠 Δ ′ , 𝔼 𝑠 Δ ∪ Δ ′ ′ ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) ⁢ 𝐹 𝑠 Δ ′ ′ )

1 2 ∥ 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝐹 𝑠 Δ − 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝐹 𝑠 Δ ′ ∥ 1

≤ ∥ 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) ∥ 1 ⋅ 1 2 ∥ 𝐹 𝑠 Δ − 𝐹 𝑠 Δ ′ ∥ 1

≤ 1 2 ⁢ ‖ 𝐹 𝑠 Δ − 𝐹 𝑠 Δ ′ ‖ 1

TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 Δ ′ )

In the first inequality, we factorize ∥ 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) ) ∥ 1 from the ℓ 1 -normed expression by the sub-multiplicativity of the matrix norm. Finally, since 𝒥 1 is a column-stochastic matrix, we bound its norm by 1 to recover the total variation distance between 𝐹 𝑠 Δ and 𝐹 𝑠 Δ ′ . ∎

Lemma B.6.

Given the joint transition probability 𝒥 𝑘 from Definition B.1:

𝔼 𝑠 Δ ∪ Δ ′ ( 𝑡 + 1 ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 Δ ∪ Δ ′ ( 𝑡 ) ) 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) := 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝐹 𝑠 Δ ( 𝑡 )

Proof.

First, observe that for all 𝑥 ∈ 𝒮 𝑙 :

𝔼 𝑠 Δ ∪ Δ ′ ( 𝑡 + 1 ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 Δ ∪ Δ ′ ( 𝑡 ) ) ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ⁢ ( 𝑥 )

1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝔼 𝑠 Δ ∪ Δ ′ ( 𝑡 + 1 ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 Δ ∪ Δ ′ ( 𝑡 ) ) ⁢ 𝟙 ⁢ ( 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

𝑥 )

1 | Δ | ∑ 𝑖 ∈ Δ Pr [ 𝑠 𝑖 ( 𝑡 + 1 )

𝑥 | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 Δ ∪ Δ ′ ( 𝑡 ) ) ]

1 | Δ | ∑ 𝑖 ∈ Δ Pr [ 𝑠 𝑖 ( 𝑡 + 1 )

𝑥 | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 𝑖 ( 𝑡 ) ) ]

1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝒥 1 ⁢ ( 𝑥 | 𝑠 𝑔 ⁢ ( 𝑡 + 1 ) , 𝑠 𝑔 ⁢ ( 𝑡 ) , 𝑎 𝑔 ⁢ ( 𝑡 ) , 𝑠 𝑖 ⁢ ( 𝑡 ) )

In the first line, we expand on the definition of 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ⁢ ( 𝑥 ) . Finally, we note that 𝑠 𝑖 ⁢ ( 𝑡 + 1 ) is conditionally independent to 𝑠 Δ ∪ Δ ′ ∖ 𝑖 , which yields the equality above. Then, aggregating across every entry 𝑥 ∈ 𝒮 𝑙 ,

𝔼 𝑠 Δ ∪ Δ ′ ( 𝑡 + 1 ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , 𝑠 Δ ∪ Δ ′ ( 𝑡 ) ) ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 )

1 | Δ | ∑ 𝑖 ∈ Δ 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝟙 → 𝑠 𝑖 ⁢ ( 𝑡 )

𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) 𝐹 𝑠 Δ

Notably, every 𝑥 corresponds to a choice of rows in 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) and every choice of 𝑠 𝑖 ⁢ ( 𝑡 ) corresponds to a choice of columns in 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) , making 𝒥 1 ( ⋅ | 𝑠 𝑔 ( 𝑡 + 1 ) , 𝑠 𝑔 ( 𝑡 ) , 𝑎 𝑔 ( 𝑡 ) , ⋅ ) column-stochastic. This yields the claim.∎

Lemma B.7.

The total variation distance between the expected empirical distribution of 𝑠 Δ ⁢ ( 𝑡 + 1 ) and 𝑠 Δ ′ ⁢ ( 𝑡 + 1 ) is linearly bounded by the total variation distance of the empirical distributions of 𝑠 Δ ⁢ ( 𝑡 ) and 𝑠 Δ ′ ⁢ ( 𝑡 ) , for Δ , Δ ′ ⊆ [ 𝑛 ] :

TV ⁢ ( 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) ,

∀ 𝑖 ∈ Δ ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) , 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) ,

∀ 𝑖 ∈ Δ ′ ⁢ 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 + 1 ) ) ≤ TV ⁢ ( 𝐹 𝑠 Δ ⁢ ( 𝑡 ) , 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 ) )

Proof.

We expand the total variation distance measure in ℓ 1 -norm and utilize the result from Lemma B.10 that 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 )

𝑃 𝑙 ( ⋅ | 𝑠 𝑔 ( 𝑡 ) ) 𝐹 𝑠 Δ ⁢ ( 𝑡 ) as follows:

( 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) , 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ′ ⁢ 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 + 1 ) )

1 2 ⁢ ‖ 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) − 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ′ ⁢ 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 + 1 ) ‖ 1

1 2 ∥ 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝐹 𝑠 Δ ⁢ ( 𝑡 ) − 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 ) ∥ 1

≤ ∥ 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) ∥ 1 ⋅ 1 2 | 𝐹 𝑠 Δ ⁢ ( 𝑡 ) − 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 ) | 1

∥ 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) ∥ 1 ⋅ TV ( 𝐹 𝑠 Δ ⁢ ( 𝑡 ) , 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 ) )

In the last line, we recover the total variation distance from the ℓ 1 norm. Finally, by the column stochasticity of 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ) , we have that ∥ 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ) ∥ 1 ≤ 1 , which then implies

TV ⁢ ( 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ⁢ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) , 𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ′ ⁢ 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 + 1 ) ) ≤ TV ⁢ ( 𝐹 𝑠 Δ ⁢ ( 𝑡 ) , 𝐹 𝑠 Δ ′ ⁢ ( 𝑡 ) )

This proves the lemma.∎

Remark B.8.

Lemma B.7 can be viewed as an irreducibility and aperiodicity result on the finite-state Markov chain whose state space is given by 𝒮

𝒮 𝑔 × 𝒮 𝑙 𝑛 . Let { 𝑠 𝑡 } 𝑡 ∈ ℕ denote the sequence of states visited by this Markov chain where the transitions are induced by the transition functions 𝑃 𝑔 , 𝑃 𝑙 . Through this, Lemma B.7 describes an ergodic behavior of the Markov chain.

Lemma B.9.

The absolute difference between the expected maximums between 𝑄 ^ 𝑘 and 𝑄 ^ 𝑘 ′ is atmost the maximum of the absolute difference between 𝑄 ^ 𝑘 and 𝑄 ^ 𝑘 ′ , where the expectations are taken over any joint distributions of states 𝒥 , and the maximums are taken over the actions.

| 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) [ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )

− max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) ] |

≤ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 | 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) [ 𝑄 ^ 𝑘 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ )

− 𝑄 ^ 𝑘 ′ 𝑇 ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) ] |

Proof.

𝑎 𝑔 ∗ := arg ⁡ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) , 𝑎 ~ 𝑔 ∗ := arg ⁡ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ )

For the remainder of this proof, we adopt the shorthand 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ to refer to 𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ) ∼ 𝒥 | Δ ∪ Δ ′ | ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ∪ Δ ′ ) .

Then, if 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ )

0 , we have:

| 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ∗ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 ~ 𝑔 ∗ )

≤ 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ∗ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ∗ )

≤ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ | 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) |

Similarly, if 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) − 𝔼 𝑠 𝑔 ′ , 𝑠 Δ ∪ Δ ′ ′ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 ′ 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ′ , 𝑎 𝑔 ′ ) < 0 , an analogous argument by replacing 𝑎 𝑔 ∗ with 𝑎 ~ 𝑔 ∗ yields an identical bound. ∎

Lemma B.10.

For all 𝑡 ∈ ℕ and Δ ⊆ [ 𝑛 ] ,

𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ [ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ]

𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝐹 𝑠 Δ ⁢ ( 𝑡 )

Proof.

For all 𝑥 ∈ 𝒮 𝑙 :

𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ⁢ [ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ⁢ ( 𝑥 ) ]
:= 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝔼 𝑠 𝑖 ⁢ ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ⁢ ( 𝑠 𝑖 ⁢ ( 𝑡 ) , 𝑠 𝑔 ⁢ ( 𝑡 ) ) ⁢ [ 𝟙 ⁢ ( 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

𝑥 ) ]

1 | Δ | ∑ 𝑖 ∈ Δ Pr [ 𝑠 𝑖 ( 𝑡 + 1 )

𝑥 | 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) ]

1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝑃 𝑙 ⁢ ( 𝑥 | 𝑠 𝑖 ⁢ ( 𝑡 ) , 𝑠 𝑔 ⁢ ( 𝑡 ) )

In the first line, we are writing out the definition of 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ⁢ ( 𝑥 ) and using the conditional independence in the evolutions of Δ ∖ 𝑖 and 𝑖 . In the second line, we use the fact that for any random variable 𝑋 ∈ 𝒳 , 𝔼 𝑋 ⁢ 𝟙 ⁢ [ 𝑋

𝑥 ]

Pr ⁡ [ 𝑋

𝑥 ] . In line 3, we observe that the above probability can be written as an entry of the local transition matrix 𝑃 𝑙 . Then, aggregating across every entry 𝑥 ∈ 𝒮 𝑙 , we have that:

𝔼 𝑠 𝑖 ( 𝑡 + 1 ) ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

∀ 𝑖 ∈ Δ ⁢ [ 𝐹 𝑠 Δ ⁢ ( 𝑡 + 1 ) ]

1 | Δ | ∑ 𝑖 ∈ Δ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) )

1 | Δ | ∑ 𝑖 ∈ Δ 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝟙 → 𝑠 𝑖 ⁢ ( 𝑡 )

: 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝐹 𝑠 Δ ⁢ ( 𝑡 )

Here, 𝟙 → 𝑠 𝑖 ⁢ ( 𝑡 ) ∈ { 0 , 1 } | 𝒮 𝑙 | such that 𝟙 → 𝑠 𝑖 ⁢ ( 𝑡 ) is 1 at the index corresponding to 𝑠 𝑖 ⁢ ( 𝑡 ) , and is 0 everywhere else. The last equality follows since 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) is a column-stochastic matrix which yields that 𝑃 𝑙 ( ⋅ | ⋅ , 𝑠 𝑔 ( 𝑡 ) ) 𝟙 → 𝑠 𝑖 ⁢ ( 𝑡 )

𝑃 𝑙 ( ⋅ | 𝑠 𝑖 ( 𝑡 ) , 𝑠 𝑔 ( 𝑡 ) ) , thus proving the lemma.∎

Lemma B.11.

For any joint transition probability function on 𝑠 𝑔 , 𝑠 Δ , where | Δ |

𝑘 , given by 𝒥 𝑘 : 𝒮 𝑔 × 𝒮 𝑙 | Δ | × 𝒮 𝑔 × 𝒜 𝑔 × 𝒮 𝑙 | Δ | → [ 0 , 1 ] , we have:

𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ [ 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑎 𝑔 , 𝑠 Δ ′ ) ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ ) ]

𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 𝑘 2 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ max 𝑎 𝑔 ′′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′′ )

Proof.

We start by expanding the expectations:

𝔼 ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ [ 𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 𝑘 ( ⋅ , ⋅ | 𝑠 𝑔 ′ , 𝑎 𝑔 , 𝑠 Δ ′ ) ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′ ) ]

∑ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∈ 𝒮 𝑔 × 𝒮 𝑙 | Δ | ∑ ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∈ 𝒮 𝑔 × 𝒮 𝑙 | Δ | 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ] ⁢ 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′′ , 𝑠 Δ ′′ , 𝑠 𝑔 ′ , 𝑎 𝑔 , 𝑠 Δ ′ ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′ )

∑ ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∈ 𝒮 𝑔 × 𝒮 𝑙 | Δ | 𝒥 𝑘 2 ⁢ [ 𝑠 𝑔 ′′ , 𝑠 Δ ′′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′ )

𝔼 ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) ∼ 𝒥 𝑘 2 ( ⋅ , ⋅ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′′ , 𝐹 𝑠 Δ ′′ , 𝑎 𝑔 ′ )

The right-stochasticity of 𝒥 𝑘 implies the right-stochasticity of 𝒥 𝑘 2 . Further, observe that 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ] ⁢ 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′′ , 𝑠 Δ ′′ , 𝑠 𝑔 ′ , 𝑎 𝑔 , 𝑠 Δ ′ ] denotes the probability of the transitions ( 𝑠 𝑔 , 𝑠 Δ ) → ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) → ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) with actions 𝑎 𝑔 at each step, where the joint state evolution is governed by 𝒥 𝑘 . Thus, ∑ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) ∈ 𝒮 𝑔 × 𝒮 𝑙 | Δ | 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ] ⁢ 𝒥 𝑘 ⁢ [ 𝑠 𝑔 ′′ , 𝑠 Δ ′′ , 𝑠 𝑔 ′ , 𝑎 𝑔 , 𝑠 𝑔 ′ ] is the stochastic probability function corresponding to the two-step evolution of the joint states from ( 𝑠 𝑔 , 𝑠 Δ ) to ( 𝑠 𝑔 ′′ , 𝑠 Δ ′′ ) under the action 𝑎 𝑔 , which is equivalent to 𝒥 𝑘 2 ⁢ [ 𝑠 𝑔 ′′ , 𝑠 Δ ′′ , 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ] . In the third equality, we recover the definition of the expectation, where the joint probabilities are taken over 𝒥 𝑘 2 . ∎

The following lemma bounds the average difference between 𝑄 ^ 𝑘 𝑇 (across every choice of Δ ∈ ( [ 𝑛 ] 𝑘 ) ) and 𝑄 ∗ and shows that the difference decays to 0 as 𝑇 → ∞ .

Lemma B.12.

For all 𝑠 ∈ 𝒮 𝑔 × 𝒮 [ 𝑛 ] , and for all 𝑎 𝑔 ∈ 𝒜 𝑔 , we have:

𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) ≤ 𝛾 𝑇 ⁢ 𝑟 ~ 1 − 𝛾

Proof.

We bound the differences between 𝑄 ^ 𝑘 𝑇 at each Bellman iteration of our approximation to 𝑄 ∗ .

𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ⁢ 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑟 [ 𝑛 ] ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ) ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ )

− 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) [ 𝑟 [ Δ ] ⁢ ( 𝑠 𝑔 , 𝑠 Δ , 𝑎 𝑔 ) + 𝛾 ⁢ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 )

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ Δ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 𝑘 𝑇 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

Next, observe that 𝑟 [ 𝑛 ] ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 )

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑟 [ Δ ] ⁢ ( 𝑠 𝑔 , 𝑠 Δ , 𝑎 𝑔 ) . To prove this, we write:

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑟 [ Δ ] ⁢ ( 𝑠 𝑔 , 𝑠 Δ , 𝑎 𝑔 )

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) ( 𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 ) + 1 𝑘 ⁢ ∑ 𝑖 ∈ Δ 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 ) )

𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 ) + ( 𝑛 − 1 𝑘 − 1 ) 𝑘 ⁢ ( 𝑛 𝑘 ) ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 )

𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 ) + 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 ) := 𝑟 [ 𝑛 ] ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 )

In the second equality, we reparameterized the sum to count the number of times each 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 ) was added for each 𝑖 ∈ Δ , and in the last equality, we expanded and simplified the binomial coefficients. Therefore:

sup ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 [ 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) ]

sup ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 [ 𝒯 ⁢ 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) ]

𝛾 ⁢ sup ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 [ 𝔼 𝑠 𝑔 ′ ∼ 𝑃 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 )

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 )

∀ 𝑖 ∈ [ 𝑛 ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 )

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 )

∀ 𝑖 ∈ Δ ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

𝛾 ⁢ sup ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ⁢ [ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

≤ 𝛾 ⁢ sup ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 𝔼 𝑠 𝑔 ′ ∼ 𝑃 𝑔 ( ⋅ | 𝑠 𝑔 , 𝑎 𝑔 ) ,

𝑠 𝑖 ′ ∼ 𝑃 𝑙 ( ⋅ | 𝑠 𝑖 , 𝑠 𝑔 ) , ∀ 𝑖 ∈ [ 𝑛 ] ⁢ max 𝑎 𝑔 ′ ∈ 𝒜 𝑔 ⁡ [ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

≤ 𝛾 ⁢ sup ( 𝑠 ′ , 𝑎 𝑔 ′ ) ∈ 𝒮 × 𝒜 𝑔 [ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 − 1 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

We justify the first inequality by noting the general property that for positive vectors 𝑣 , 𝑣 ′ for which 𝑣 ⪰ 𝑣 ′ which follows from the triangle inequality:

‖ 𝑣 − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑣 ′ ‖ ∞
≥ | ‖ 𝑣 ‖ ∞ − ‖ 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑣 ′ ‖ ∞ |

‖ 𝑣 ‖ ∞ − ‖ 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑣 ′ ‖ ∞

≥ ‖ 𝑣 ‖ ∞ − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) ‖ 𝑣 ′ ‖ ∞

Therefore:

𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 )
− 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

≤ 𝛾 𝑇 ⁢ sup ( 𝑠 ′ , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 [ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝑎 𝑔 ′ ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑘 0 ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ′ ) ]

𝛾 𝑇 ⁢ 𝑟 ~ 1 − 𝛾

The first inequality follows from the 𝛾 -contraction property of the update procedure, and the ensuing equality follows from our bound on the maximum possible value of 𝑄 from Lemma A.7 and noting that 𝑄 ^ 𝑘 0 := 0 . Therefore, as 𝑇 → ∞ , 𝑄 ∗ ⁢ ( 𝑠 , 𝑎 𝑔 ) − 1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝑄 ^ 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) → 0 , which proves the lemma.∎

Appendix CBounding Total Variation Distance

As | Δ | → 𝑛 , the total variation (TV) distance between the empirical distribution of 𝑠 [ 𝑛 ] and 𝑠 Δ goes to 0 . We formalize this notion and prove this statement by obtaining tight bounds on the difference and showing that this error decays quickly.

Remark C.1.

First, observe that if Δ is an independent random variable uniformly supported on ( [ 𝑛 ] 𝑘 ) , then 𝑠 Δ is also an independent random variable uniformly supported on the global state ( 𝑠 [ 𝑛 ] 𝑘 ) . To see this, let 𝜓 1 : [ 𝑛 ] → 𝒮 𝑙 where 𝜓 ⁢ ( 𝑖 )

𝑠 𝑖 . This naturally extends to 𝜓 𝑘 : [ 𝑛 ] 𝑘 → 𝒮 𝑙 𝑘 given by 𝜓 𝑘 ⁢ ( 𝑖 1 , … , 𝑖 𝑘 )

( 𝑠 𝑖 1 , … , 𝑠 𝑖 𝑘 ) , for all 𝑘 ∈ [ 𝑛 ] . Then, the independence of Δ implies the independence of the generated 𝜎 -algebra. Further, 𝜓 𝑘 (which is a Lebesgue measurable function of a 𝜎 -algebra) is a sub-algebra, implying that 𝑠 Δ must also be an independent random variable.

For reference, we present the multidimensional Dvoretzky-Kiefer-Wolfowitz (DKW) inequality Dvoretzky et al. (1956); Massart (1990); Naaman (2021) which bounds the difference between an empirical distribution function for 𝑠 Δ and 𝑠 [ 𝑛 ] when each element of Δ for | Δ |

𝑘 is sampled uniformly randomly from [ 𝑛 ] with replacement.

Theorem C.2 (Dvoretzky-Kiefer-Wolfowitz (DFW) inequality Dvoretzky et al. (1956)).

By the multi-dimensional version of the DKW inequality Naaman (2021), assume that 𝒮 𝑙 ⊂ ℝ 𝑑 . Then, for any 𝜖

0 , the following statement holds for when Δ ⊆ [ 𝑛 ] is sampled uniformly with replacement.

Pr [ sup 𝑥 ∈ 𝒮 𝑙 | 1 | Δ | ∑ 𝑖 ∈ Δ 𝟙 { 𝑠 𝑖

𝑥 } − 1 𝑛 ∑ 𝑖

1 𝑛 𝟙 { 𝑠 𝑖

𝑥 } | < 𝜖 ] ≥ 1 − 𝑑 ( 𝑛 + 1 ) 𝑒 − 2 ⁢ | Δ | ⁢ 𝜖 2 ⋅

We give an analogous bound for the case when Δ is sampled uniformly from [ 𝑛 ] without replacement. However, our bound does not have a dependency on 𝑑 , the dimension of 𝒮 𝑙 which allows us to consider non-numerical state-spaces.

Before giving the proof, we add a remark on this problem. Intuitively, when samples are chosen without replacement from a finite population, the marginal distribution, when conditioned on the random variable chosen, takes the running empirical distribution closer to the true distribution with high probability. However, we need a uniform probabilistic bound on the error that adapts to worst-case marginal distributions and decays with 𝑘 .

Recall the landmark results of Hoeffding and Serfling in Hoeffding (1963) and Serfling (1974) which we restate below.

Lemma C.3 (Lemma 4, Hoeffding).

Given a finite population, note that for any convex and continuous function 𝑓 : ℝ → ℝ , if 𝑋

{ 𝑥 1 , … , 𝑥 𝑘 } denotes a sample with replacement and 𝑌

{ 𝑦 1 , … , 𝑦 𝑘 } denotes a sample without replacement, then:

𝔼 ⁢ 𝑓 ⁢ ( ∑ 𝑖 ∈ 𝑋 𝑖 ) ≤ 𝔼 ⁢ 𝑓 ⁢ ( ∑ 𝑖 ∈ 𝑌 𝑖 )

Lemma C.4 (Corollary 1.1, Serfling).

Suppose the finite subset 𝒳 ⊂ ℝ such that | 𝒳 |

𝑛 is bounded between [ 𝑎 , 𝑏 ] . Then, let 𝑋

( 𝑥 1 , … , 𝑥 𝑘 ) be a random sample of 𝒳 of size 𝑘 chosen uniformly and without replacement. Denote 𝜇 := 1 𝑛 ⁢ ∑ 𝑖

1 𝑛 𝑥 𝑖 . Then:

Pr ⁡ [ | 1 𝑘 ⁢ ∑ 𝑖

1 𝑘 𝑥 𝑖 − 𝜇 |

𝜖 ] < 2 ⁢ 𝑒 − 2 ⁢ 𝑘 ⁢ 𝜖 2 ( 𝑏 − 𝑎 ) 2 ⁢ ( 1 − 𝑘 − 1 𝑛 )

We now present a sampling without replacement analog of the DKW inequality.

Theorem C.5 (Sampling without replacement analogue of the DKW inequality).

Consider a finite population 𝒳

( 𝑥 1 , … , 𝑥 𝑛 ) ∈ 𝒮 𝑙 𝑛 . Let Δ ⊆ [ 𝑛 ] be a random sample of size 𝑘 chosen uniformly and without replacement.

Then, for all 𝑥 ∈ 𝒮 𝑙 :

Pr ⁡ [ sup 𝑥 ∈ 𝒮 𝑙 | 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑥 𝑖

𝑥 } − 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑥 𝑖

𝑥 } | < 𝜖 ] ≥ 1 − 2 ⁢ | 𝒮 𝑙 | ⁢ 𝑒 − 2 ⁢ | Δ | ⁢ 𝑛 ⁢ 𝜖 2 𝑛 − | Δ | + 1

Proof.

For each 𝑥 ∈ 𝒮 𝑙 , define the “ 𝑥 -surrogate population” of indicator variables as

𝒳 ¯ 𝑥

( 𝟙 { 𝑥 1

𝑥 } , … , 𝟙 { 𝑥 𝑛

𝑥 } ) ∈ { 0 , 1 } 𝑛

(22)

Since the maximal difference between each element in this surrogate population is 1 , we set 𝑏 − 𝑎

1 in Lemma C.4 when applied to 𝒳 ¯ 𝑥 to get:

Pr ⁡ [ | 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑥 𝑖

𝑥 } − 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑥 𝑖

𝑥 } | < 𝜖 ] ≥ 1 − 2 ⁢ 𝑒 − 2 ⁢ | Δ | ⁢ 𝑛 ⁢ 𝜖 2 𝑛 − | Δ | + 1

In the above equation, the probability is over Δ ⊆ ( [ 𝑛 ] 𝑘 ) and it holds for each 𝑥 ∈ 𝒮 𝑙 . Therefore, the randomness is only over Δ . Then, by a union bounding argument, we have:

Pr [ sup 𝑥 ∈ 𝒮 𝑙 | 1 | Δ | ∑ 𝑖 ∈ Δ 𝟙 { 𝑥 𝑖

𝑥 }
− 1 𝑛 ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 { 𝑥 𝑖

𝑥 } | < 𝜖 ]

Pr ⁡ [ ⋂ 𝑥 ∈ 𝒮 𝑙 { | 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑥 𝑖

𝑥 } − 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑥 𝑖

𝑥 } | < 𝜖 } ]

1 − ∑ 𝑥 ∈ 𝒮 𝑙 Pr ⁡ [ | 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑥 𝑖

𝑥 } − 1 𝑛 ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑥 𝑖

𝑥 } | ≥ 𝜖 ]

≥ 1 − 2 ⁢ | 𝒮 𝑙 | ⁢ 𝑒 − 2 ⁢ | Δ | ⁢ 𝑛 ⁢ 𝜖 2 𝑛 − | Δ | + 1

This proves the claim.∎

Then, combining the Lipschitz continuity bound from Theorem 4.1 and the total variation distance bound from Theorem 4.2 yields Theorem C.6.

Theorem C.6.

For all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝑠 1 , … , 𝑠 𝑛 ∈ 𝒮 𝑙 𝑛 , 𝑎 𝑔 ∈ 𝒜 𝑔 , we have that with probability atleast 1 − 𝛿 :

| 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑛 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) | ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − | Δ | + 1 8 ⁢ 𝑛 ⁢ | Δ | ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | / 𝛿 )

Proof.

By the definition of total variation distance, observe that

TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 [ 𝑛 ] ) ≤ 𝜖 ⇔ sup 𝑥 ∈ 𝒮 𝑙 | 𝐹 𝑠 Δ − 𝐹 𝑠 [ 𝑛 ] | < 2 ⁢ 𝜖

(23)

Then, let 𝒳

𝒮 𝑙 be the finite population in Theorem C.5 and recall the Lipschitz-continuity of 𝑄 ^ 𝑘 𝑇 from Theorem B.2:

| 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑛 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) |
≤ ( ∑ 𝑡

0 𝑇 − 1 2 ⁢ 𝛾 𝑡 ) ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 [ 𝑛 ] )

≤ 2 1 − 𝛾 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ⋅ 𝜖

By setting the error parameter in Theorem C.5 to 2 ⁢ 𝜖 , we find that Equation 23 occurs with probability at least 1 − 2 ⁢ | 𝒮 𝑙 | ⁢ 𝑒 − 2 ⁢ | Δ | ⁢ 𝑛 ⁢ 𝜖 2 / ( 𝑛 − | Δ | + 1 ) .

Pr ⁡ [ | 𝑄 ^ 𝑘 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑛 𝑇 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) | ≤ 2 ⁢ 𝜖 1 − 𝛾 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ] ≥ 1 − 2 ⁢ | 𝒮 𝑙 | ⁢ 𝑒 − 8 ⁢ 𝑛 ⁢ | Δ | ⁢ 𝜖 2 𝑛 − | Δ | + 1

Finally, we parameterize the probability to 1 − 𝛿 to solve for 𝜖 , which yields

𝜖

𝑛 − | Δ | + 1 8 ⁢ 𝑛 ⁢ | Δ | ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | / 𝛿 ) .

This proves the theorem.∎

The following lemma is not used in the main result; however, we include it to demonstrate why popular TV-distance bounding methods using the Kullback-Liebler (KL) divergence and the Bretagnolle-Huber inequality Tsybakov (2008) only yield results with a suboptimal subtractive decay of | Δ | / 𝑛 . In comparison, Theorem 4.2 achieves a stronger multiplicative decay of 1 / | Δ | .

Lemma C.7.

TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 [ 𝑛 ] ) ≤ 1 − | Δ | / 𝑛

Proof.

By the symmetry of the total variation distance, we have TV ⁢ ( 𝐹 𝑠 [ 𝑛 ] , 𝐹 𝑠 Δ )

TV ⁢ ( 𝐹 𝑠 Δ , 𝐹 𝑠 [ 𝑛 ] ) .

From the Bretagnolle-Huber inequality Tsybakov (2008) we have that TV ⁢ ( 𝑓 , 𝑔 )

1 − 𝑒 − 𝐷 KL ⁢ ( 𝑓 ∥ 𝑔 ) . Here, 𝐷 KL ⁢ ( 𝑓 ∥ 𝑔 ) is the Kullback-Leibler (KL) divergence metric between probability distributions 𝑓 and 𝑔 over the sample space, which we denote by 𝒳 and is given by

𝐷 KL ⁢ ( 𝑓 ∥ 𝑔 ) := ∑ 𝑥 ∈ 𝒳 𝑓 ⁢ ( 𝑥 ) ⁢ ln ⁡ 𝑓 ⁢ ( 𝑥 ) 𝑔 ⁢ ( 𝑥 )

(24)

Thus, from Equation 24:

𝐷 KL ⁢ ( 𝐹 𝑠 Δ ∥ 𝐹 𝑠 [ 𝑛 ] )

∑ 𝑥 ∈ 𝒮 𝑙 ( 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ) ⁢ ln ⁡ 𝑛 ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } | Δ | ⁢ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑠 𝑖

𝑥 }

1 | Δ | ⁢ ∑ 𝑥 ∈ 𝒮 𝑙 ( ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ) ⁢ ln ⁡ 𝑛 | Δ |

+ 1 | Δ | ⁢ ∑ 𝑥 ∈ 𝒮 𝑙 ( ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ) ⁢ ln ⁡ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑠 𝑖

𝑥 }

ln ⁡ 𝑛 | Δ | + 1 | Δ | ⁢ ∑ 𝑥 ∈ 𝒮 𝑙 ( ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ) ⁢ ln ⁡ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑠 𝑖

𝑥 }

≤ ln ⁡ ( 𝑛 / | Δ | )

In the third line, we note that ∑ 𝑥 ∈ 𝒮 𝑙 ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 }

| Δ | since each local agent contained in Δ must have some state contained in 𝒮 𝑙 . In the last line, we note that ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑥 } ≤ ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 ⁢ { 𝑠 𝑖

𝑥 } , For all 𝑥 ∈ 𝒮 𝑙 , and thus the summation of logarithmic terms in the third line is negative. Finally, using this bound in the Bretagnolle-Huber inequality yields the lemma.∎

Appendix DUsing the Performance Difference Lemma to Bound the Optimality Gap

Recall from Definition A.13 that the fixed-point of the empirical adapted Bellman operator 𝒯 ^ 𝑘 , 𝑚 is 𝑄 ^ 𝑘 , 𝑚 est . Further, recall from Lemma 3.3 that ‖ 𝑄 ^ 𝑘 ∗ − 𝑄 ^ 𝑘 , 𝑚 est ‖ ∞ ≤ 𝜖 𝑘 , 𝑚 .

Lemma D.1.

Fix 𝑠 ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 . Suppose we are given a 𝑇 -length sequence of i.i.d. random variables Δ 1 , … , Δ 𝑇 , distributed uniformly over the support ( [ 𝑛 ] 𝑘 ) . Further, suppose we are given a fixed sequence 𝛿 1 , … , 𝛿 𝑇 ∈ ( 0 , 1 ) . Then, for each action 𝑎 𝑔 ∈ 𝒜 𝑔 and for 𝑖 ∈ [ 𝑇 ] , define events 𝐵 𝑖 𝑎 𝑔 such that:

𝐵 𝑖 𝑎 𝑔 := { | 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ 𝑖 , 𝑎 𝑔 ) |

𝑛 − 𝑘 + 1 8 ⁢ 𝑘 ⁢ 𝑛 ⁢ ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 𝑖 ⋅ 2 1 − 𝛾 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ + 𝜖 𝑘 , 𝑚 }

Next, for 𝑖 ∈ [ 𝑀 ] , we define “bad-events” 𝐵 𝑖 such that 𝐵 𝑖

⋃ 𝑎 𝑔 ∈ 𝒜 𝑔 𝐵 𝑖 𝑎 𝑔 . Next, denote 𝐵

∪ 𝑖

1 𝑇 𝐵 𝑖 . Then, the probability that no “bad event” occurs is:

Pr ⁡ [ 𝐵 ¯ ] ≥ 1 − | 𝒜 𝑔 | ⁢ ∑ 𝑖

1 𝑇 𝛿 𝑖

Proof.

| 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) |

≤ | 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) |

| 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) |

≤ | 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) | + 𝜖 𝑘 , 𝑚

The first inequality above follows from the triangle inequality, and the second inequality uses | 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) | ≤ ‖ 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) ‖ ∞ ≤ 𝜖 𝑘 , 𝑚 , where 𝜖 𝑘 , 𝑚 is defined in Lemma 3.3. Then, from Theorem C.6, we have that with probability at least 1 − 𝛿 𝑖 ,

| 𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) − 𝑄 ^ 𝑘 ∗ ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) | ≤ 𝑛 − 𝑘 + 1 8 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 𝑖 ⋅ 2 1 − 𝛾 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞

So, event 𝐵 𝑖 occurs with probability atmost 𝛿 𝑖 . Thus, by repeated applications of the union bound, we get:

Pr ⁡ [ 𝐵 ¯ ]
≥ 1 − ∑ 𝑖

1 𝑇 ∑ 𝑎 𝑔 ∈ 𝒜 𝑔 Pr ⁡ [ 𝐵 𝑖 𝑎 𝑔 ]

≥ 1 − | 𝒜 𝑔 | ⁢ ∑ 𝑖

1 𝑇 Pr ⁡ [ 𝐵 𝑖 𝑎 𝑔 ]

Finally, substituting Pr ⁡ [ 𝐵 ¯ 𝑖 𝑎 𝑔 ] ≤ 𝛿 𝑖 yields the lemma. ∎

Recall that for any 𝑠 ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 ≅ 𝒮 𝑔 , the policy function 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) is defined as a uniformly random element in the maximal set of 𝜋 ^ 𝑘 , 𝑚 est evaluated on all possible choices of Δ . Formally:

𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ∼ 𝒰 ⁢ { 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) : Δ ∈ ( [ 𝑛 ] 𝑘 ) }

(25)

We now use the celebrated performance difference lemma from Kakade & Langford (2002), restated below for convenience in Theorem D.2, to bound the value functions generated between 𝜋 𝑘 , 𝑚 est and 𝜋 ∗ .

Theorem D.2 (Performance Difference Lemma).

Given policies 𝜋 1 , 𝜋 2 , with corresponding value functions 𝑉 𝜋 1 , 𝑉 𝜋 2 :

𝑉 𝜋 1 ⁢ ( 𝑠 ) − 𝑉 𝜋 2 ⁢ ( 𝑠 )

1 1 − 𝛾 ⁢ 𝔼 𝑠 ′ ∼ 𝑑 𝑠 𝜋 1

𝑎 ′ ∼ 𝜋 1 ( ⋅ | 𝑠 ′ ) ⁢ [ 𝐴 𝜋 2 ⁢ ( 𝑠 ′ , 𝑎 ′ ) ]

Here, 𝐴 𝜋 2 ⁢ ( 𝑠 ′ , 𝑎 ′ ) := 𝑄 𝜋 2 ⁢ ( 𝑠 ′ , 𝑎 ′ ) − 𝑉 𝜋 2 ⁢ ( 𝑠 ′ ) and 𝑑 𝑠 𝜋 1 ⁢ ( 𝑠 ′ )

( 1 − 𝛾 ) ⁢ ∑ ℎ

0 ∞ 𝛾 ℎ ⁢ Pr ℎ 𝜋 1 ⁡ [ 𝑠 ′ , 𝑠 ] where Pr ℎ 𝜋 1 ⁡ [ 𝑠 ′ , 𝑠 ] is the probability of 𝜋 1 reaching state 𝑠 ′ at time step ℎ starting from state 𝑠 .

Theorem D.3 (Bounding value difference).

For any 𝑠 ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 and ( 𝛿 1 , 𝛿 2 ) ∈ ( 0 , 1 ] 2 , we have:

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ( 1 − 𝛾 ) 2 ⁢ 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + 2 ⁢ 𝑟 ~ ( 1 − 𝛾 ) 2 ⁢ | 𝒜 𝑔 | ⁢ 𝛿 1 + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾

Proof.

Note that by definition of the advantage function, we have:

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ 𝐴 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ )

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ ) − 𝑉 𝜋 ∗ ⁢ ( 𝑠 ′ ) ]

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ ) − 𝔼 𝑎 ∼ 𝜋 ∗ ( ⋅ | 𝑠 ′ ) ⁢ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ) ]

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ 𝔼 𝑎 ∼ 𝜋 ∗ ( ⋅ | 𝑠 ′ ) ⁢ [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ ) − 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ) ]

Since 𝜋 ∗ is a deterministic policy, we can write:

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ 𝔼 𝑎 ∼ 𝜋 ∗ ( ⋅ | 𝑠 ′ ) ⁢ 𝐴 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ )

𝔼 𝑎 ′ ∼ 𝜋 𝑘 , 𝑚 est ( ⋅ | 𝑠 ′ ) ⁢ [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝑎 ′ ) − 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) ]

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) − 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) ]

Then, by the linearity of expectations and the performance difference lemma (while noting that 𝑄 𝜋 ∗ ⁢ ( ⋅ , ⋅ )

𝑄 ∗ ⁢ ( ⋅ , ⋅ ) ):

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 )

1 1 − 𝛾 ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 1 ( 𝑛 𝑘 ) ⁢ 𝔼 𝑠 ′ ∼ 𝑑 𝑠 𝜋 𝑘 , 𝑚 est ⁢ [ 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 𝜋 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ]

1 1 − 𝛾 ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 1 ( 𝑛 𝑘 ) ⁢ 𝔼 𝑠 ′ ∼ 𝑑 𝑠 𝜋 𝑘 , 𝑚 est ⁢ [ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ]

Next, we use Lemma D.4 to bound this difference (where the probability distribution function of 𝒟 is set as 𝑑 𝑠 𝜋 𝑘 , 𝑚 est as defined in Theorem D.2) while letting 𝛿 1

𝛿 2 :

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 )

≤ 1 1 − 𝛾 ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 1 ( 𝑛 𝑘 ) ⁢ [ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 ) + 2 ⁢ 𝑟 ~ 1 − 𝛾 ⁢ | 𝒜 𝑔 | ⁢ 𝛿 1 + 2 ⁢ 𝜖 𝑘 , 𝑚 ]

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ( 1 − 𝛾 ) 2 ⁢ 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 ) + 2 ⁢ 𝑟 ~ ( 1 − 𝛾 ) 2 ⁢ | 𝒜 𝑔 | ⁢ 𝛿 1 + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾

This proves the theorem. ∎

Lemma D.4.

For any arbitrary distribution 𝒟 of states 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 , for any Δ ∈ ( [ 𝑛 ] 𝑘 ) and for 𝛿 1 , 𝛿 2 ∈ ( 0 , 1 ] , we have:

𝔼 𝑠 ′ ∼ 𝒟 [ 𝑄 ∗ ( 𝑠 ′ , 𝜋 ∗ ( 𝑠 ′ ) ) − 𝑄 ∗ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ]

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 8 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 2 ) + 𝑟 ~ 1 − 𝛾 ⁢ | 𝒜 𝑔 | ⁢ ( 𝛿 1 + 𝛿 2 ) + 2 ⁢ 𝜖 𝑘 , 𝑚

Proof.

Denote 𝜁 𝑘 , 𝑚 𝑠 , Δ := 𝑄 ∗ ( 𝑠 , 𝜋 ∗ ( 𝑠 ) ) − 𝑄 ∗ ( 𝑠 , 𝜋 ^ 𝑘 , 𝑚 est ( 𝑠 𝑔 , 𝐹 𝑠 Δ ) . We define the indicator function ℐ : 𝒮 × ℕ × ( 0 , 1 ] × ( 0 , 1 ] by:

ℐ ⁢ ( 𝑠 , 𝑘 , 𝛿 1 , 𝛿 2 )

𝟙 ⁢ { 𝜁 𝑘 , 𝑚 𝑠 , Δ ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 8 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 2 ) + 2 ⁢ 𝜖 𝑘 , 𝑚 }

We then study the expected difference between 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) and 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) . Observe that:

𝔼 𝑠 ′ ∼ 𝒟 ⁢ [ 𝜁 𝑘 , 𝑚 𝑠 , Δ ]

𝔼 𝑠 ′ ∼ 𝒟 ⁢ [ 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ]

𝔼 𝑠 ′ ∼ 𝒟 ⁢ [ ℐ ⁢ ( 𝑠 ′ , 𝑘 , 𝛿 1 , 𝛿 2 ) ⁢ ( 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ) ]

𝔼 𝑠 ′ ∼ 𝒟 ⁢ [ ( 1 − ℐ ⁢ ( 𝑠 ′ , 𝑘 , 𝛿 1 , 𝛿 2 ) ) ⁢ ( 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ) ]

Here, we have used the general property for a random variable 𝑋 and constant 𝑐 that 𝔼 ⁢ [ 𝑋 ]

𝔼 ⁢ [ 𝑋 ⁢ 𝟙 ⁢ { 𝑋 ≤ 𝑐 } ] + 𝔼 ⁢ [ ( 1 − 𝟙 ⁢ { 𝑋 ≤ 𝑐 } ) ⁢ 𝑋 ] . Then,

𝔼 𝑠 ′ ∼ 𝒟 [ 𝑄 ∗ ( 𝑠 ′ , 𝜋 ∗ ( 𝑠 ′ ) )

− 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ]

≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 8 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 2 ) ) + 2 ⁢ 𝜖 𝑘 , 𝑚

𝑟 ~ 1 − 𝛾 ( 1 − 𝔼 𝑠 ′ ∼ 𝒟 ℐ ( 𝑠 ′ , 𝑘 , 𝛿 1 , 𝛿 2 ) ) )

𝑟 ~ 1 − 𝛾 ⁢ | 𝒜 𝑔 | ⁢ ( 𝛿 1
𝛿 2 )

For the first term in the first inequality, we use 𝔼 ⁢ [ 𝑋 ⁢ 𝟙 ⁢ { 𝑋 ≤ 𝑐 } ] ≤ 𝑐 . For the second term, we trivially bound 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) by the maximum value 𝑄 ∗ can take, which is 𝑟 ~ 1 − 𝛾 by Lemma A.7. In the second inequality, we use the fact that the expectation of an indicator function is the conditional probability of the underlying event. The second inequality follows from Lemma D.5 which yields the claim.∎

Lemma D.5.

For a fixed 𝑠 ′ ∈ 𝒮 := 𝒮 𝑔 × 𝒮 𝑙 𝑛 , for any Δ ∈ ( [ 𝑛 ] 𝑘 ) , and for 𝛿 1 , 𝛿 2 ∈ ( 0 , 1 ] , we have that with probability at least 1 − | 𝒜 𝑔 | ⁢ ( 𝛿 1 + 𝛿 2 ) :

𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ 1 − 𝛾 ⁢ 𝑛 − 𝑘 + 1 8 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 2 ) + 2 ⁢ 𝜖 𝑘 , 𝑚
Proof.
𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) )
− 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) )

𝑄 ∗ ( 𝑠 ′ , 𝜋 ∗ ( 𝑠 ′ ) ) − 𝑄 ∗ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) + 𝑄 ^ 𝑘 , 𝑚 est ( 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝜋 ∗ ( 𝑠 ′ ) )

− 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) + 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) )

− 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) )

By the monotonicity of the absolute value and the triangle inequality, we have:

𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) )

− 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) )

≤ | 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) |

| 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) − 𝑄 ∗ ⁢ ( 𝑠 ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) |

The above inequality crucially uses the fact that the residual term 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝜋 ∗ ⁢ ( 𝑠 ′ ) ) − 𝑄 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ , 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 ′ , 𝐹 𝑠 Δ ′ ) ) ≤ 0 , since 𝜋 ^ 𝑘 , 𝑚 est is the optimal greedy policy for 𝑄 ^ 𝑘 , 𝑚 est . Finally, applying the error bound derived in Lemma D.1 for two timesteps completes the proof. ∎

Corollary D.6.

Optimizing parameters in Theorem D.3 yields:

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ≤ 2 ⁢ 𝑟 ~ ( 1 − 𝛾 ) 2 ⁢ ( 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | ⁢ | 𝒜 𝑔 | ⁢ 𝑘 ) + 1 𝑘 ) + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾

Proof.

Recall from Theorem D.3 that:

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ≤ 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ( 1 − 𝛾 ) 2 ⁢ 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ( ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 ) + 2 ⁢ ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ( 1 − 𝛾 ) 2 ⁢ | 𝒜 𝑔 | ⁢ 𝛿 1 + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾

Note ‖ 𝑟 𝑙 ⁢ ( ⋅ , ⋅ ) ‖ ∞ ≤ 𝑟 ~ from Assumption 2.2. Then,

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 ) ≤ 2 ⁢ 𝑟 ~ ( 1 − 𝛾 ) 2 ⁢ ( 𝑛 − 𝑘 + 1 2 ⁢ 𝑛 ⁢ 𝑘 ⁢ ln ⁡ 2 ⁢ | 𝒮 𝑙 | 𝛿 1 + | 𝒜 𝑔 | ⁢ 𝛿 1 ) + 2 ⁢ 𝜖 𝑘 , 𝑚 1 − 𝛾

Finally, setting 𝛿 1

1 𝑘 1 / 2 ⁢ | 𝒜 𝑔 | yields the claim.∎

Corollary D.7.

Therefore, from Corollary D.6, we have:

𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est ⁢ ( 𝑠 )
≤ 𝑂 ⁢ ( 𝑟 ~ 𝑘 ⁢ ( 1 − 𝛾 ) 2 ⁢ ln ⁡ ( 2 ⁢ | 𝒮 𝑙 | ⁢ | 𝒜 𝑔 | ⁢ 𝑘 ) + 𝜖 𝑘 , 𝑚 1 − 𝛾 )

𝑂 ~ ⁢ ( 𝑟 ~ ⁢ ( 1 − 𝛾 ) − 2 𝑘 + 𝜖 𝑘 , 𝑚 1 − 𝛾 )

This yields the bound from Theorem 3.4.

Appendix EAdditional Discussions Discussion E.1 (Tighter Endpoint Analysis).

Our theoretical result shows that 𝑉 𝜋 ∗ ⁢ ( 𝑠 ) − 𝑉 𝜋 𝑘 , 𝑚 est decays on the order of 𝑂 ⁢ ( 1 / 𝑘 + 𝜖 𝑘 , 𝑚 ) . For 𝑘

𝑛 , this bound is actually suboptimal since 𝑄 ^ 𝑘 ∗ becomes 𝑄 ∗ . However, placing | Δ |

𝑛 in our weaker TV bound in Lemma C.7, we recovers a total variation distance of 0 when 𝑘

𝑛 , recovering the optimal endpoint bound.

Discussion E.2 (Choice of 𝑘 ).

Discussion 3.6 previously discussed the tradeoff in 𝑘 between the polynomial in 𝑘 complexity of learning the 𝑄 ^ 𝑘 function and the decay in the optimality gap of 𝑂 ⁢ ( 1 / 𝑘 ) . This discussion promoted 𝑘

𝑂 ⁢ ( log ⁡ 𝑛 ) as a means to balance the tradeoff. However, the “correct” choice of 𝑘 truly depends on the amount of compute available, as well as the accuracy desired from the method. If the former is available, we recommend setting 𝑘

Ω ⁢ ( 𝑛 ) as it will yield a more optimal policy. Conversely, setting 𝑘

𝑂 ⁢ ( log ⁡ 𝑛 ) , when 𝑛 is large, would be the minimum 𝑘 recommended to realize any asymptotic decay of the optimality gap.

Generated on Tue Oct 22 18:48:42 2024 by LaTeXML Report Issue Report Issue for Selection

Xet Storage Details

Size:: 155 kB
Xet hash:: b6e72b8876ae6e3d2879b9fa7016997099078e152adaad890e00ab2a2f90b454

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Notation. For 𝑘 , 𝑚 ∈ ℕ where 𝑘 ≤ 𝑚 , let ( [ 𝑚 ] 𝑘 ) denote the set of 𝑘 -sized subsets of [ 𝑚 ]

{ 1 , … , 𝑚 } . Let [ 𝑚 ] ¯

Problem Statement. We consider a system of 𝑛 + 1 agents given by 𝒩

𝑟 ⁢ ( 𝑠 , 𝑎 𝑔 )

𝔼 𝜋 ⁢ [ ∑ 𝑡

0 ∞ 𝛾 𝑡 ⁢ 𝑟 ⁢ ( 𝑠 ⁢ ( 𝑡 ) , 𝑎 𝑔 ⁢ ( 𝑡 ) ) | 𝑠 ⁢ ( 0 )

While this model requires the 𝑛 local agents to have homogeneous transition and reward functions, it allows heterogeneous initial states, which captures a pseudo-heterogeneous setting. For this, we assign a type to each local agent by letting 𝒮 𝑙

𝔼 𝜋 ⁢ [ ∑ 𝑡

0 ∞ 𝛾 𝑡 ⁢ 𝑟 ⁢ ( 𝑠 ⁢ ( 𝑡 ) , 𝑎 ⁢ ( 𝑡 ) ) | 𝑠 ⁢ ( 0 )

𝑠 , 𝑎 ⁢ ( 0 )

𝑎 ] . One approach to learn the optimal policy 𝜋 ∗ ( ⋅ | 𝑠 ) is dynamic programming, where the 𝑄 -function is iteratively updated using value-iteration: 𝑄 0 ⁢ ( 𝑠 , 𝑎 )

0 , for all ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 𝑔 . Then, for all 𝑡 ∈ [ 𝑇 ] , 𝑄 𝑡 + 1 ⁢ ( 𝑠 , 𝑎 )

𝒯 ⁢ 𝑄 𝑡 ⁢ ( 𝑠 , 𝑎 ) , where 𝒯 is the Bellman operator defined as 𝒯 ⁢ 𝑄 𝑡 ⁢ ( 𝑠 , 𝑎 )

𝑄 ∗ , by the Banach-Caccioppoli fixed-point theorem (Banach, 1922). Here, the optimal policy is the deterministic greedy policy 𝜋 ∗ : 𝒮 𝑔 × 𝒮 𝑙 𝑛 → 𝒜 𝑔 , where 𝜋 ∗ ⁢ ( 𝑠 )

Mean-field Transformation. To address this, Yang et al. (2018); Gu et al. (2021) developed a mean-field approach which, under assumptions of homogeneity in the agents, considers the distribution function 𝐹 [ 𝑛 ] : 𝒮 𝑙 → ℝ given by 𝐹 [ 𝑛 ] ⁢ ( 𝑥 )

∑ 𝑖

1 𝑛 𝟏 ⁢ { 𝑠 𝑖

𝑥 } 𝑛 , for 𝑥 ∈ 𝒮 𝑙 . Define Θ 𝑛

𝑄 ^ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 ) . Here, 𝑄 ^ : 𝒮 𝑔 × Θ | 𝒮 𝑙 | × 𝒜 𝑔 → ℝ is a reparameterized 𝑄 -function learned by mean-field value iteration, where 𝑄 ^ 0 ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

0 , ∀ ( 𝑠 , 𝑎 𝑔 ) ∈ 𝒮 × 𝒜 𝑔 , and for all 𝑡 ∈ [ 𝑇 ] , 𝑄 ^ 𝑡 + 1 ⁢ ( 𝑠 , 𝐹 [ 𝑛 ] , 𝑎 )

𝒯 ^ ⁢ 𝑄 ^ 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

Then, since 𝒯 has a 𝛾 -contractive property, so does 𝒯 ^ ; hence 𝑇 ^ has a unique fixed-point 𝑄 ^ ∗ such that 𝑄 ^ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] , 𝑎 𝑔 )

𝑄 ∗ ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 ) . Finally, the optimal policy is the deterministic greedy policy 𝜋 ^ ∗ ⁢ ( 𝑠 𝑔 , 𝐹 [ 𝑛 ] )

𝑟 Δ ⁢ ( 𝑠 , 𝑎 𝑔 )

𝐹 𝑠 Δ ⁢ ( 𝑥 ) := 1 | Δ | ⁢ ∑ 𝑖 ∈ Δ 𝟙 ⁢ { 𝑠 𝑖

𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝑠 Δ , 𝑎 𝑔 )

𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 ) , 𝑄 ⁢ ( 𝑠 𝑔 , 𝑠 [ 𝑛 ] , 𝑎 𝑔 )

0 for all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 . For 𝑡 ∈ ℕ , we set 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 𝑡 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝑄 ^ 𝑘 , 𝑚 est . We then obtain a deterministic policy 𝜋 ^ 𝑘 , 𝑚 est : 𝒮 𝑔 × Θ 𝑘 | 𝒮 𝑙 | given by 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

𝜋 𝑘 , 𝑚 est ⁢ ( 𝑎 𝑔 | 𝑠 )

1 ( 𝑛 𝑘 ) ⁢ ∑ Δ ∈ ( [ 𝑛 ] 𝑘 ) 𝟙 ⁢ ( 𝜋 ^ 𝑘 , 𝑚 est ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ )

𝑘 . 2: Set 𝑄 ^ 𝑘 , 𝑚 0 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 , for 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 , where Θ 𝑘

{ 𝑏 / 𝑘 : 𝑏 ∈ [ 𝑘 ] ¯ } . 3: for 𝑡

1 to 𝑇 do 4: 𝑄 ^ 𝑘 , 𝑚 𝑡 + 1 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

For all 𝑠 𝑔 ∈ 𝒮 𝑔 , 𝐹 𝑠 Δ ∈ Θ 𝑘 | 𝒮 𝑙 | , 𝑎 𝑔 ∈ 𝒜 𝑔 , 𝑄 ^ 𝑘 ⁢ ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )

0 . For 𝑡 ∈ ℕ , let 𝑄 ^ 𝑘 𝑡 + 1

𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 𝑡 , where 𝒯 ^ 𝑘 is defined for 𝑘 ≤ 𝑛 in Equation 10. Similarly to 𝒯 ^ 𝑘 , 𝑚 , 𝒯 ^ 𝑘 satisfies a 𝛾 -contraction property (Lemma A.9) with fixed-point 𝑄 ^ 𝑘 ∗ . By the law of large numbers, lim 𝑚 → ∞ 𝒯 ^ 𝑘 , 𝑚

𝒯 ^ 𝑘 . Hence, the gap ‖ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ‖ ∞ converges to 0 as 𝑚 → ∞ . For finite 𝑚 , ∥ 𝑄 ^ 𝑘 , 𝑚 est − 𝑄 ^ 𝑘 ∗ ∥ ∞

For all 𝑘 ∈ [ 𝑛 ] and 𝑚 ∈ ℕ , where 𝑚 is the number of samples in Equation 8, there exists a Bellman noise 𝜖 𝑘 , 𝑚 such that ‖ 𝒯 ^ 𝑘 , 𝑚 ⁢ 𝑄 ^ 𝑘 , 𝑚 est − 𝒯 ^ 𝑘 ⁢ 𝑄 ^ 𝑘 ∗ ‖ ∞

This section details an outline for the proof of Theorem 3.4, as well as some key ideas. At a high level, our SUBSAMPLE-Q framework in Algorithms 1 and 2 recovers exact mean-field 𝑄 learning (and therefore, traditional value iteration) when 𝑘

Given a finite population 𝒳

Pr [ sup 𝑥 ∈ 𝒮 𝑙 | 1 | Δ | ∑ 𝑖 ∈ Δ 𝟙 { 𝑥 𝑖

𝑥 } − 1 𝑛 ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 { 𝑥 𝑖

8 local agents, and a large-scale simulation with 𝑛

Let each local agent 𝑖 ∈ [ 𝑛 ] have a state 𝑠 𝑖 ⁢ ( 𝑡 )

Π 𝒮 𝑔 ⁢ ( 𝑠 𝑔 ⁢ ( 𝑡 ) + 𝑎 𝑔 ⁢ ( 𝑡 ) ) , i.e., 𝑎 𝑔 ⁢ ( 𝑡 ) changes the DR signal. Then, 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

( 𝜓 𝑖 , 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 ) , 𝑠 𝑖 ∗ ⁢ ( 𝑡 + 1 ) ) , where intuitively, 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 ) fluctuates based on 𝜓 𝑖 and 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) < 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 𝑖 ∗ ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) (the local agent chases its desired consumption). If not, the local agent either follows 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) or reduces its consumption to match 𝑠 𝑔 ⁢ ( 𝑡 ) . Formally, if 𝜓 𝑖

1 , then 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) + 𝒰 ⁢ { 0 , 1 } . If 𝜓 𝑖

2 , 𝑠 ¯ 𝑖 ⁢ ( 𝑡 + 1 )

𝒰 ⁢ { 𝒟 𝑐 } . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) ≤ 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 ¯ 𝑖 ∗ ⁢ ( 𝑡 + 1 )

𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) . If 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) > 𝑠 𝑔 ⁢ ( 𝑡 ) , then 𝑠 ¯ 𝑖 ∗ ⁢ ( 𝑡 + 1 )

Π 𝒟 𝑐 ⁢ [ 𝑠 ¯ 𝑖 ⁢ ( 𝑡 ) + ( 𝑠 𝑔 ⁢ ( 𝑡 ) − 𝑠 𝑖 ∗ ⁢ ( 𝑡 ) ) ⁢ 𝒰 ⁢ { 0 , 1 } ] . The reward of the system at each step is given by 𝑟 𝑔 ⁢ ( 𝑠 𝑔 , 𝑎 𝑔 )

15 / 𝑠 𝑔 − 𝟙 ⁢ { 𝑎 𝑔

− 1 } and 𝑟 𝑙 ⁢ ( 𝑠 𝑖 , 𝑠 𝑔 )

𝑠 𝑖 ∗ − 1 2 ⁢ 𝟙 ⁢ { 𝑠 𝑖 ∗ > 𝑠 𝑔 } . We set 𝒟 𝑎

𝒟 𝑐

[ 5 ] , Ψ

{ 1 , 2 } , 𝛾

0.9 , 𝑚

50 , and the length of the decision game to be 𝑇 ′

We use 𝑇

300 empirical adapted Bellman iterations for the small-scale simulation, and 𝑇

50 iterations for the large scale simulation. For the small-scale simulation, Figure 1a illustrates the polynomial speedup of Algorithm 1 (note that 𝑘

We model a system with 𝑛 queues, 𝑠 𝑖 ⁢ ( 𝑡 ) ∈ 𝒮 𝑙 := ℕ at time 𝑡 denotes the number of jobs at time 𝑡 for queue 𝑖 ∈ [ 𝑛 ] . We model the job allocation mechanism as a global agent where 𝑠 𝑔 ⁢ ( 𝑡 ) ∈ 𝒮 𝑔

𝒜 𝑔

[ 𝑛 ] , where 𝑠 𝑔 ⁢ ( 𝑡 ) denotes the queue to which the next job should be delivered. We choose the state transitions to capture the stochastic job arrival and departure: 𝑠 𝑔 ⁢ ( 𝑡 + 1 )

𝑎 𝑔 ⁢ ( 𝑡 ) , and 𝑠 𝑖 ⁢ ( 𝑡 + 1 )

min ⁡ { 𝑐 , max ⁡ { 0 , 𝑠 𝑖 ⁢ ( 𝑡 ) + 𝟙 ⁢ { 𝑠 𝑔 ⁢ ( 𝑡 )

𝑖 } − Bern ⁢ ( 𝑝 ) } } . For the rewards, we set 𝑟 𝑔 ⁢ ( 𝑠 𝑔 ⁢ ( 𝑡 ) , 𝑎 𝑔 ⁢ ( 𝑡 ) )

0 , 𝑟 𝑙 ⁢ ( 𝑠 𝑖 ⁢ ( 𝑡 ) , 𝑠 𝑔 ⁢ ( 𝑡 ) )

− 𝑠 𝑖 ⁢ ( 𝑡 ) − 10 ⋅ 𝟙 ⁢ { 𝑠 𝑖 ⁢ ( 𝑡 ) > 𝑐 } , where 𝑝

0.8 is the probability of finishing a job, 𝑐

30 is the capacity of each queue, and 𝛾

This simulation ran on a system of 𝑛

50 local agents. The goal was to learn an optimal policy for a dispatcher to send incoming jobs to. We ran Algorithm 1 for 𝑇

𝑥 }
− 1 𝑛 ∑ 𝑖 ∈ [ 𝑛 ] 𝟙 { 𝑥 𝑖

𝒥 𝑘 ⁢ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ) := Pr ⁡ [ ( 𝑠 𝑔 ′ , 𝑠 Δ ′ ) | 𝑠 𝑔 , 𝑎 𝑔 , 𝑠 Δ ]
(20) Theorem B.2 ( 𝑄 ^ 𝑘 𝑇 is ( ∑ 𝑡

| 𝑄 ^ 𝑘 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )
− 𝑄 ^ 𝑘 ′ 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |

| 𝑄 ^ 𝑘 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ , 𝑎 𝑔 )
− 𝑄 ^ 𝑘 ′ 1 ( 𝑠 𝑔 , 𝐹 𝑠 Δ ′ , 𝑎 𝑔 ) |