Buckets:

|
download
raw
118 kB

Title: Asynchronous Communication and Linear Function Approximation

URL Source: https://arxiv.org/html/2305.06446

Markdown Content: Cooperative Multi-Agent Reinforcement Learning: Asynchronous Communication and Linear Function Approximation Yifei Min    Jiafan He    Tianhao Wang    Quanquan Gu Abstract

We study multi-agent reinforcement learning in the setting of episodic Markov decision processes, where multiple agents cooperate via communication through a central server. We propose a provably efficient algorithm based on value iteration that enable asynchronous communication while ensuring the advantage of cooperation with low communication overhead. With linear function approximation, we prove that our algorithm enjoys an 𝒪 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 ) regret with 𝒪 ~ ⁢ ( 𝑑 ⁢ 𝐻 ⁢ 𝑀 2 ) communication complexity, where 𝑑 is the feature dimension, 𝐻 is the horizon length, 𝑀 is the total number of agents, and 𝐾 is the total number of episodes. We also provide a lower bound showing that a minimal Ω ⁢ ( 𝑑 ⁢ 𝑀 ) communication complexity is required to improve the performance through collaboration.

Markov Decision Process, reinforcement learning, multi-agent, federated learning, distributed

1 Introduction

Multi-agent Reinforcement Learning (RL) has been successfully applied in various application scenarios, such as robotics (Williams et al., 2016; Liu et al., 2019; Ding et al., 2020; Liu et al., 2020), games (Vinyals et al., 2017; Berner et al., 2019; Jaderberg et al., 2019; Ye et al., 2020), and many other real-world systems and settings (Bazzan, 2009; Yu et al., 2014, 2020; Fei & Xu, 2022; Min et al., 2022b; Xu et al., 2023b). In particular, in the cooperative setting, agents benefit from collaboration via (in)direct communication among each other. It thus requires the RL algorithm to effectively coordinate the communication in a flexible way, in order to fully exploit the advantage of cooperation. Towards this goal, in this paper, we study cooperative multi-agent RL with asynchronous communication, and show that the same performance as single-agent methods can be achieved with efficient communication strategy.

We focus on the so-called parallel RL setting (Kretchmar, 2002; Grounds & Kudenko, 2007), where agents interact with the environment in parallel to solve a common problem. More specifically, we consider a model of episodic Markov decision processes (MDPs) called linear MDPs (Yang & Wang, 2019; Jin et al., 2020), where both the transition probability and reward functions are linear in some known 𝑑 -dimensional feature mapping. We assume there are 𝑀 agents, which share the same underlying MDP model, but interact with the environment independently in parallel. The agents cannot communicate directly with each other, and the information exchange is realized only through a central server. We emphasize that in our setting, the communication between the agents and server is not required to be synchronous, and any communication is initiated solely by the agent, thus providing flexibility for practical needs. The goal of the agents is to achieve a low total regret with as less communication as possible.

Notably, a recent work by Dubey & Pentland (2021) studied cooperative multi-agent RL with linear MDPs. They proposed a cooperative variant of the LSVI-UCB algorithm (Jin et al., 2020) named Coop-LSVI, which achieves an 𝑂 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 ) regret111Their original result is written as 𝑂 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝑀 ⁢ 𝑇 ) . Here 𝑑 is the feature dimension, 𝐻 is the horizon length, 𝑀 is the total number of agents, and 𝐾 is the total number of episodes. Because of their round-robin-type participation, their 𝑀 ⁢ 𝑇 is equivalent to 𝐾 , which is the total number of episodes under our notation. with 𝑂 ⁢ ( 𝑑 ⁢ 𝐻 ⁢ 𝑀 3 ) communication complexity. However, their algorithm mandates the participation of all agents in a round-robin fashion, which is impractical as it imposes a stringent synchronous constraint on the agents’ interaction with the environment and their communication with the server. It is possible that some agents might be temporarily unavailable in a round, or the connection with the server is disrupted due to infrastructure failure. These anomalies demand the algorithm to be resilient to irregular participation patterns of the agents.

To this end, we propose an asynchronous version of LSVI-UCB (Jin et al., 2020). We eliminate the synchronous constraint by carefully designing a determinant-based criterion for deciding whether or not an agent needs to communicate with the server and update the local model. This criterion depends only on the local data of each agent, and more importantly, the communication triggered by one agent with the server does not affect any other agents. As a comparison, in the Coop-LSVI algorithm in Dubey & Pentland (2021), if some agents decide to communicate with the server, the algorithm will execute a mandated aggregation of data from all agents. As a result, our algorithm is considerably more flexible and practical, though this presents new challenges in analyzing its performance theoretically.

Setting Algorithm Regret Communication Low-switching Allow asynchronous

communication

Single-agent LSVI-UCB 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 N/A ✘ N/A (Jin et al., 2020) Multi-agent Coop-LSVI 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾

𝑑 ⁢ 𝐻 ⁢ 𝑀 3 ✓ ✘ (Dubey & Pentland, 2021) Multi-agent Async-Coop-LSVI-UCB 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾

𝑑 ⁢ 𝐻 ⁢ 𝑀 2 ✓ ✓ (ours) Table 1: Comparison of our result with baseline methods for linear MDPs. Our result achieves regret comparable to that of the single-agent setting under low communication complexity. Here 𝑑 is the dimension of the feature, 𝑀 is the number of agents, and 𝐾 is the total number of episodes by all agents. Logarithmic factors are hidden from the regret and the communication complexity.

As mentioned before, the participation order of the agents can be arbitrary and irregular, resulting in some agents having the latest aggregated information from the server while others may have outdated information. This issue of information asymmetry prohibits direct adaption of existing analyses for LSVI-type algorithms, such as those in Jin et al. (2020); Dubey & Pentland (2021). To address this, we need to carefully calibrate the communication criterion, so as to balance and regulate the information asymmetry. We achieve this by examining the quantitative relationship between each agent’s local information and the virtual universal information, yielding a simple yet effective communication coefficient. The final result confirms the efficiency of the proposed algorithm (see Theorem 5.1).

Besides the positive result, we further investigate the fundamental limit of cooperative multi-agent RL. Inspired by the construction of hard-to-learn instance for federated bandits in Wang et al. (2020); He et al. (2022a), we characterize the minimum amount of communication complexity required to surpass the performance of a single-agent method.222Here by ‘single-agent methods’ we mean all agents independently run a single-agent algorithm without communication. (see Theorem 5.5).

The main contributions of this paper are summarized as follows:

We propose a provably efficient algorithm (Algorithm 2) for cooperative multi-agent RL with asynchronous communication under episodic linear MDPs (Yang & Wang, 2019; Jin et al., 2020). Our algorithm allows an arbitrary participation order of the agents and independent communication between the agents and the server, making it significantly more flexible than the existing algorithm in Dubey & Pentland (2021) for the synchronous setting. A comparison with baseline methods is presented in Table 1.

We prove that under standard assumptions, the proposed algorithm enjoys an 𝒪 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 ) regret with 𝒪 ~ ⁢ ( 𝑑 ⁢ 𝐻 ⁢ 𝑀 2 ) communication complexity. Our theoretical analysis identifies and resolves the information assymmetry due to asynchronous communication, which may be of independent interest.

We also provide a lower bound for the communication complexity, showing that an Ω ⁢ ( 𝑑 ⁢ 𝑀 ) complexity is necessary to improve over single-agent methods through collaboration. To the best of our knowledge, this is the first result on communication complexity for learning multi-agent MDPs.

Notation.

We denote [ 𝑛 ] := { 1 , 2 , … , 𝑛 } for any positive integer 𝑛 . We use 𝐈 to denote the 𝑑 × 𝑑 identity matrix. We use 𝒪 to hide universal constants and 𝒪 ~ to further hide poly-logarithmic terms. For any vector 𝐱 ∈ ℝ 𝑑 and positive semi-definite matrix 𝚺 ∈ ℝ 𝑑 × 𝑑 , we denote ‖ 𝐱 ‖ 𝚺

𝐱 ⊤ ⁢ 𝚺 ⁢ 𝐱 . For any 𝑎 , 𝑏 , 𝑐 ∈ ℝ ∪ { ± ∞ } with 𝑎 ≤ 𝑏 , we use the shorthand [ 𝑐 ] [ 𝑎 , 𝑏 ] to denote the truncation (or projection) of 𝑐 into the interval [ 𝑎 , 𝑏 ] , i.e., [ 𝑐 ] [ 𝑎 , 𝑏 ]

argmin 𝑐 ′ ∈ [ 𝑎 , 𝑏 ] | 𝑐 − 𝑐 ′ | . A comprehensive clarification of notation is also provided in Appendix A.

2 Related Work Multi-Agent RL.

Various algorithms with convergence guarantees have been developed for multi-agent RL (Zhang et al., 2018b; Wai et al., 2018; Zhang et al., 2018a), e.g., federated version of TD and Q-learning (Khodadadian et al., 2022), and policy gradient with fault tolerance (Fan et al., 2021). In contrast, in this work we study algorithms with low regret guarantee cooperative multi-agent RL with asynchronous communication.

As mentioned above, we focus on the homogeneous setting where the underlying MDP for every agent is the same. There are also existing works on cooperative multi-agent RL with non-stationary environment and/or heterogeneity (Lowe et al., 2017; Yu et al., 2021; Kuba et al., 2022; Liu et al., 2022; Jin et al., 2022). Besides homogeneous parallel linear MDP, Dubey & Pentland (2021) further studied heterogeneous parallel linear MDP (i.e., the underlying MDPs can be different from agent to agent) and Markov games in linear multi-agent MDPs. These generalized setups are beyond the scope of the current paper, and we leave as future work to study algorithms compatible with asynchronous communication in these settings.

We consider multi-agent RL with linear function approximation to incorporate large state and action space. More powerful deep learning techniques have been used for federated RL in Clemente et al. (2017); Espeholt et al. (2018); Horgan et al. (2018); Nair et al. (2015); Zhuo et al. (2019). We refer the reader to Qi et al. (2021) for a recent survey on federated RL. Our work is also related to the broader context of distributed learning, where a collective of agents collaborate towards a common objective (Bottou, 2010; Dean et al., 2012; Littman & Boyan, 2013; Li et al., 2014; Liang et al., 2018; Hoffman et al., 2020; Ding et al., 2022; Zhan et al., 2021, 2022; Xu et al., 2023a). Interested readers may refer to the survey article by Verbraeken et al. (2020).

RL with Linear Function Approximation.

Function approximation techniques in RL enable extension beyond the restricted setting of tabular MDP. Recent years have especially witnessed rapid progress in the research of single-agent RL with linear function approximation, among which two major parallel lines of work (for online RL) focus on linear MDPs (Yang & Wang, 2019; Jin et al., 2020; Zanette et al., 2020; Neu & Pike-Burke, 2020; He et al., 2021; Wang et al., 2021; Hu et al., 2022; He et al., 2022b; Agarwal et al., 2022; Lu et al., 2023) and linear mixture MDPs (Modi et al., 2020; Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2021b; Cai et al., 2020; Zhou et al., 2021a; Zhang et al., 2021; Kim et al., 2021; Min et al., 2022a; Zhang et al., 2022; Zhou & Gu, 2022), respectively.

In this paper, we follow the design of the LSVI-UCB algorithm (Jin et al., 2020) to devise an asynchronous algorithm for cooperative linear MDP. Indeed, our algorithmic design can also be carried over to tabular MDPs and linear mixture MDPs, which will be discussed later in Section 4.

3 Preliminaries

In this section, we first provide the formal definition of linear MDPs, and then introduce our model of cooperative multi-agent linear MDPs with asynchronous communication.

3.1 Linear MDPs

Episodic MDPs are a classic family of models in RL (Sutton & Barto, 2018). Let 𝒮 be the state space, 𝒜 be the action space, and 𝐻 be the horizon length. Each episode starts from some initial state 𝑠 0 ∈ 𝒮 . For step ℎ

1 , 2 , … , 𝐻 , the agent at state 𝑠 ℎ ∈ 𝒮 takes some action 𝑎 ℎ ∈ 𝒜 , and receives a reward 𝑟 ℎ ⁢ ( 𝑠 ℎ , 𝑎 ℎ ) , where 𝑟 ℎ : 𝒮 × 𝒜 → ℝ is the reward function at step ℎ . Then the environment transits to the next state 𝑠 ℎ + 1 ∼ ℙ ℎ ( ⋅ ∣ 𝑠 ℎ , 𝑎 ℎ ) , where ℙ ℎ : 𝒮 × 𝒮 × 𝒜 → ℝ is the transition probability function for step ℎ . We call the strategy the agent interacts with the environment a policy, and a policy 𝜋 consists of 𝐻 mappings, 𝜋 ℎ : 𝒮 → 𝒜 for every ℎ ∈ [ 𝐻 ] . The agent will run for 𝐾 episodes in total. The goal of the agent is then to find the optimal policy that maximizes the cumulative reward across an episode through this online process.

In this work we consider the time-inhomogeneous linear MDP setting where the transition probabilities and the reward functions can be parametrized as linear functions of a known feature mapping 𝜙 . This is a popular setting considered by various authors (Bradtke & Barto, 1996; Melo & Ribeiro, 2007; Yang & Wang, 2019; Jin et al., 2020; Min et al., 2021; Yin et al., 2021). The formal definition is given by the following assumption.

Assumption 3.1 (Linear MDPs, Yang & Wang 2019; Jin et al. 2020).

MDP ⁢ ( 𝒮 , 𝒜 , 𝐻 , { 𝑟 ℎ } ℎ

1 𝐻 , { ℙ ℎ } ℎ

1 𝐻 ) is called a linear MDP with a known feature mapping 𝜙 : 𝒮 × 𝒜 → ℝ 𝑑 , if for any ℎ ∈ [ 𝐻 ] , there exist 𝜸 ℎ and 𝝁 ℎ ∈ ℝ 𝑑 , such that for any state-action pair ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 ,

ℙ ℎ ( ⋅ ∣ 𝑠 , 𝑎 )

⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝝁 ℎ ⁢ ( ⋅ ) ⟩ ,

𝑟 ℎ ⁢ ( 𝑠 , 𝑎 )

⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝜸 ℎ ⟩ , (3.1)

where max ⁡ { ‖ 𝝁 ℎ ⁢ ( 𝒮 ) ‖ 2 , ‖ 𝜸 ℎ ‖ 2 } ≤ 𝑑 for all ℎ ∈ [ 𝐻 ] .

We assume that at any step ℎ ∈ [ 𝐻 ] , for any state-action pair ( 𝑠 ℎ , 𝑎 ℎ ) ∈ 𝒮 × 𝒜 , the reward received by the agent is given by 𝑟 ℎ ⁢ ( 𝑠 ℎ , 𝑎 ℎ ) . Without loss of generality, we assume 0 ≤ 𝑟 ℎ ⁢ ( 𝑠 , 𝑎 ) ≤ 1 and ‖ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ‖ 2 ≤ 1 for all ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 . We assume 𝒜 is large but finite, while 𝒮 is possibly infinite.

3.2 Cooperative Multi-agent RL with Asynchronous Communication

We assume there is a group of 𝑀 agents. The process proceeds in an episodic fashion, where the total number of episodes is 𝐾 . At each episode 𝑘 , there is only an active agent participating and we denote this agent by 𝑚 𝑘 . This agent adopts a policy 𝜋 𝑚 𝑘 , 𝑘 and starts from an initial state 𝑠 𝑘 , 1 . For each step ℎ ∈ [ 𝐻 ] , agent 𝑚 𝑘 picks an action 𝑎 𝑘 , ℎ ∈ 𝒜 according to 𝑎 𝑘 , ℎ ∼ 𝜋 𝑘 , ℎ ( ⋅ | 𝑠 𝑘 , ℎ ) , receives a reward 𝑟 𝑘 , ℎ ∼ 𝑟 ℎ ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) , and transitions to the next state 𝑠 𝑘 , ℎ + 1 . The episode 𝑘 terminates when agent 𝑚 𝑘 reaches 𝑠 𝑘 , 𝐻 + 1 and there is zero reward at step 𝐻 + 1 . Note that we consider the homogeneous agent setting, where reach agent has the same transition kernel and reward functions.

Asynchronous Communication.

In the multi-agent setting, the agents need to communicate (i.e. share data) to collaboratively learn the underlying optimal policy while minimizing the regret. Without communication, the problem would reduce to 𝑀 separate single-agent linear MDP problems. This would lead to a worst-case regret of order 𝒪 ~ ⁢ ( 𝑀 ⁢ 𝐾 / 𝑀 ) , which suffers from an extra 𝑀 factor as compared to the 𝒪 ~ ⁢ ( 𝐾 ) regret in the single-agent 𝐾 -episode setting (Jin et al., 2020). In the following sections we will show that this extra factor can be avoided at the cost of a small number of communication rounds.

Figure 1: Illustration of Star-shaped Communication Network

We now describe our communication protocol as follows: In this paper we assume the existence of a central server through which all the agents can share their local data (Figure 1). Each agent can communicate with the server by uploading local data and downloading global data from the server. This is also known as the star-shaped communication network (Wang et al., 2020; Dubey & Pentland, 2020; He et al., 2022a).

Moreover, each agent can decide whether to trigger a communication with the server or not. Specifically, at the end of episode 𝑘 , the active agent 𝑚 𝑘 can choose whether to upload its local trajectory to the central server and download all the data uploaded to the server by that time. The communication complexity is measured by the total number of communication rounds between the agents and the server.

Importantly, we consider an asynchronous setting satisfying the following two properties:

(i)

Full participation or a round-robin-type participation is not required.

(ii)

The communication between one agent and the server will not cause mandatory download for other agents.

This setting is much more flexible than the synchronous setting where no offline agent is allowed. As a comparison, in Dubey & Pentland (2021), all agents are required to participate in a round-robin fashion. Our setting is more general than the synchronous setting since it gives the agents the extra flexibility to decide whether to participate or not. The pseudo-code of the communication protocol is summarized by Algorithm 1.

Algorithm 1 Communication Protocol 1:  for  𝑘

1 , … , 𝐾  do 2:     Agent 𝑚 𝑘 is active 3:     Receives 𝑠 𝑘 , 1 from the environment 4:     for  ℎ

1 , ⋯ , 𝐻  do 5:        Take action 𝑎 𝑘 , ℎ ← 𝜋 𝑘 , ℎ ( ⋅ | 𝑠 𝑘 , ℎ ) 6:        Receive 𝑠 𝑘 , ℎ + 1 ∼ ℙ ℎ ( ⋅ | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) 7:        Receive reward 𝑟 𝑘 , ℎ 8:     end for 9:     if Communication Triggered then 10:        Send local data SERVER. 11:        Download from SERVER 12:        Update policy using all available data 13:     end if 14:  end for Communication Complexity and Switching Cost.

We define the communication complexity of an algorithm to be the total number of rounds of communication (i.e. one upload and download operation) between any agent and the central server. Note that some papers also use the number of bits to measure the communication complexity (Wang et al., 2020). Here we follow the notation of communication complexity in Dubey & Pentland (2021).

The policy switching cost refers to the number of times the agents change their policies (Kalai & Vempala, 2005; Abbasi-Yadkori et al., 2011). In the RL setting where the agents choose to adopt a greedy policy according to their estimated Q-functions, the switching cost is also equal to the number of times the estimated Q-functions are updated.

Learning Objective.

For any policy 𝜋

{ 𝜋 ℎ } ℎ

1 𝐻 , we define the corresponding value functions as

𝑉 ℎ 𝜋 ⁢ ( 𝑠 ) ≔ 𝔼 ⁢ [ ∑ ℎ ′

ℎ 𝐻 𝑟 ℎ ′ ⁢ ( 𝑠 ℎ ′ , 𝜋 ℎ ′ ⁢ ( 𝑠 ℎ ′ ) ) | 𝑠 ℎ

𝑠 ] , ∀ ℎ ∈ [ 𝐻 ] .

Since the horizon 𝐻 is finite and action space 𝒜 is also finite, there exists some optimal policy 𝜋 ⋆ such that

𝑉 ℎ 𝜋 ⋆ ⁢ ( 𝑠 )

sup 𝜋 𝑉 ℎ 𝜋 ⁢ ( 𝑠 ) ,

and we denote 𝑉 ℎ ⋆

𝑉 ℎ 𝜋 ⋆ . Due to space limit, more definition details of the value functions and Q-functions can be found in Appendix A.

The objective of all agents is to collaboratively minimize the aggregated cumulative regret defined as

𝑅 ⁢ ( 𝐾 )
≔ ∑ 𝑘

1 𝐾 [ 𝑉 1 ⋆ ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ] , (3.2)

where 𝑚 𝑘 is the active agent in episode 𝑘 , 𝜋 𝑚 𝑘 , 𝑘

{ 𝜋 𝑚 𝑘 , 𝑘 , ℎ } ℎ

1 𝐻 is the policy adopted by agent 𝑚 𝑘 in episode 𝑘 , and 𝑠 𝑘 , 1 is the initial state of episode 𝑘 .

4 The Proposed Algorithm

Now we proceed to present our proposed algorithm, as displayed in Algorithm 2. After explaining the detailed design of Algorithm 2, we will also discuss possible extensions to other MDP settings in Section 4.2. Here for notational convenience, we abbreviate 𝜙 𝑘 , ℎ := 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) .

4.1 Algorithmic Design

Algorithm 2 adopts an execute-then-update framework on a high level: In each episode 𝑘 ∈ [ 𝐾 ] , there is one active agent denoted by 𝑚

𝑚 𝑘 (Line 4 in Algorithm 2). Here we omit the subscript 𝑘 for better readability. Each episode involves an interaction phase (Line 6-13) and an event-triggered communication and policy update phase (Line 14-26).

Phase I: Interaction.

Agent 𝑚 will first interact with the environment by executing the greedy policy with respect to its current Q-function estimates { 𝑄 𝑚 , 𝑘 , ℎ } ℎ

1 𝐻 (Line 7 & 8), and collect the data from the trajectory of the current episode (Line 9 & 10). Then the agent will update its local dataset and covariance matrices (Line 11 & 12).

Phase II: Communication and Policy Update.

The second phase involving communication and policy update is triggered by a determinant-based criterion (Line 14). Once the criterion is satisfied, the agent will upload all the accumulated local data to the central server (Line 18), and then download all the available data from the server (Line 20). Using this latest dataset, the agent then updates its Q-function estimates (Line 21-23).

More specifically, the Q-function estimate is obtained by using backward least square value iteration following Jin et al. (2020): Given the estimate 𝑄 𝑚 , 𝑘 + 1 , ℎ + 1 for step ℎ + 1 , we solve for 𝐰 𝑚 , 𝑘 + 1 , ℎ that minimizes the Bellman error in terms of a ridge linear regression:

𝐰 𝑚 , 𝑘 + 1 , ℎ

𝚲 𝑚 , 𝑘 + 1 , ℎ − 1 ⁢ ∑ 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ 𝜙 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ )

⋅ [ 𝑟 𝜏 , ℎ + max 𝑎 ⁡ 𝑄 𝑚 , 𝑘 + 1 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 , 𝑎 ) ] . (4.1)

In the above summation we use 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ to denote the collection of indices of all the episodes whose data is available to agent 𝑚 by the end of episode 𝑘 . Recall from line 11 of Algorithm 2 that the original definition of 𝒟 𝑚 , 𝑘 , ℎ is the dataset of trajectories available to agent 𝑚 . Here we slightly abuse the definition of 𝒟 𝑚 , 𝑘 , ℎ to reflect the fact that 𝐰 𝑚 , 𝑘 + 1 , ℎ is computed using only available trajectories to agent 𝑚 by the end of episode 𝑘 . The Q-function estimate for step ℎ is given by

𝑄 𝑚 , 𝑘 + 1 , ℎ ⁢ ( ⋅ , ⋅ ) (4.2)

[ 𝜙 ⁢ ( ⋅ , ⋅ ) ⊤ ⁢ 𝐰 𝑚 , 𝑘 + 1 , ℎ + Γ 𝑚 , 𝑘 + 1 , ℎ ⁢ ( ⋅ , ⋅ ) ] [ 0 , 𝐻 − ℎ + 1 ] ,

where Γ 𝑚 , 𝑘 + 1 , ℎ is a bonus term that ensures the optimism of 𝑄 𝑚 , 𝑘 + 1 , ℎ , and is defined by

Γ 𝑚 , 𝑘 + 1 , ℎ ⁢ ( ⋅ , ⋅ )

𝛽 ⋅ ‖ 𝜙 ⁢ ( ⋅ , ⋅ ) ‖ 𝚲 𝑚 , 𝑘 + 1 , ℎ − 1 . (4.3)

Otherwise if the criterion is not satisfied, then the local data as well as the Q-function estimates remain unchanged for this agent (Line 27).

Discussion on The Criterion.

The determinant-based criterion is a common and important technique in single-agent contextual bandits (Abbasi-Yadkori et al., 2011; Ruan et al., 2021) and RL with linear function approximation (Zhou et al., 2021c; Wang et al., 2021; Min et al., 2022a), where it is often used to reduce the policy switching cost. While in our multi-agent linear MDP setting, we apply it to determine the appropriate time for communication and the corresponding policy update. The similar idea has been adopted by other works on multi-agent bandits and RL problems (Wang et al., 2020; Li & Wang, 2022; He et al., 2022a; Dubey & Pentland, 2020, 2021).

In our Algorithm 2, the criterion is adjusted by a parameter 𝛼 that controls the communication frequency: smaller 𝛼 indicates less frequent communication and larger 𝛼 implies otherwise. As will be shown in Section 5, 𝛼 determines the trade-off between the total number of communication rounds (or equivalently, policy updates) and the total regret of Algorithm 2. With a proper choice of 𝛼 , we show that our algorithm achieves a regret nearly identical to that under the single-agent setting (Jin et al., 2020) at a low communication complexity which depends only logarithmically on 𝐾 .

Algorithm 2 Asynchronous Multi-agent LSVI 1:  Input: number of episodes 𝐾 , 𝛽 , 𝛼 2:  Initialize: 𝚲 𝑚 , 1 , ℎ ← 𝜆 ⁢ 𝐈 𝑑 × 𝑑 , 𝐰 𝑚 , 1 , ℎ ← 𝟎 , 𝑄 𝑚 , 1 , ℎ ← [ 𝛽 ⁢ [ 𝜙 ⁢ ( ⋅ , ⋅ ) ⊤ ⁢ ( 𝚲 𝑚 , 1 , ℎ ) − 1 ⁢ 𝜙 ⁢ ( ⋅ , ⋅ ) ] 1 / 2 ] [ 0 , 𝐻 − ℎ + 1 ] , 𝚲 𝑚 , 0 , ℎ loc ← 𝟎 , 𝒟 𝑚 , 0 , ℎ ← ∅ , ∀ 𝑚 , ℎ ∈ [ 𝑀 ] × [ 𝐻 ] 3:  for  𝑘

1 , … , 𝐾  do 4:     Agent 𝑚

𝑚 𝑘 is active 5:     Receive 𝑠 𝑘 , 1 from the environment 6:     for  ℎ

1 , … , 𝐻  do 7:         𝜋 𝑚 , 𝑘 , ℎ ⁢ ( ⋅ ) ← argmax 𝑎 ∈ 𝒜 𝑄 𝑚 , 𝑘 , ℎ ⁢ ( ⋅ , 𝑎 ) 8:        Take action 𝑎 𝑘 , ℎ ∼ 𝜋 𝑚 , 𝑘 , ℎ ( ⋅ | 𝑠 𝑘 , ℎ ) 9:        Receive next state 𝑠 𝑘 , ℎ + 1 ∼ ℙ ℎ ( ⋅ | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) 10:        Receive reward 𝑟 𝑘 , ℎ 11:         𝒟 𝑚 , 𝑘 , ℎ ← 𝒟 𝑚 , 𝑘 − 1 , ℎ ∪ { 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ , 𝑠 𝑘 , ℎ + 1 , 𝑟 𝑘 , ℎ } 12:         𝚲 𝑚 , 𝑘 , ℎ loc ← 𝚲 𝑚 , 𝑘 − 1 , ℎ loc + 𝜙 𝑘 , ℎ ⁢ 𝜙 𝑘 , ℎ ⊤ 13:     end for 14:     if  ∃ ℎ s.t. det ( 𝚲 𝑚 , 𝑘 , ℎ + 𝚲 𝑚 , 𝑘 , ℎ loc ) det ( 𝚲 𝑚 , 𝑘 , ℎ ) > 1 + 𝛼  then 15:         𝑄 𝑚 , 𝑘 + 1 , 𝐻 + 1 ← 0 16:        for  ℎ

𝐻 , 𝐻 − 1 , ⋯ , 1  do 17:           Agent 𝑚 sends local data to SERVER: 18:            𝚲 𝑘 , ℎ ser ← 𝚲 𝑘 , ℎ ser + 𝚲 𝑚 , 𝑘 , ℎ loc , 𝒟 𝑘 , ℎ ser ← 𝒟 𝑘 , ℎ ser ∪ 𝒟 𝑚 , 𝑘 , ℎ 19:           SERVER sends global data back to agent 𝑚 : 20:            𝚲 𝑚 , 𝑘 + 1 , ℎ ← 𝚲 𝑘 , ℎ ser , 𝒟 𝑚 , 𝑘 , ℎ ← 𝒟 𝑘 , ℎ ser 21:           Update estimate 𝐰 𝑚 , 𝑘 + 1 , ℎ by (4.1) 22:           Update bonus Γ 𝑚 , 𝑘 + 1 , ℎ by (4.3) 23:           Update Q-function estimate 𝑄 𝑚 , 𝑘 + 1 , ℎ by (4.2) 24:        end for 25:        Reset 𝚲 𝑚 , 𝑘 , ℎ loc ← 𝟎 , ∀ ℎ ∈ [ 𝐻 ] 26:     else 27:         𝑄 𝑚 , 𝑘 + 1 , ℎ ← 𝑄 𝑚 , 𝑘 , ℎ , 𝚲 𝑚 , 𝑘 + 1 , ℎ ← 𝚲 𝑚 , 𝑘 , ℎ , 𝚲 𝑘 + 1 , ℎ ser ← 𝚲 𝑘 , ℎ ser , 𝒟 𝑘 + 1 , ℎ ser ← 𝒟 𝑘 , ℎ ser , ∀ ℎ ∈ [ 𝐻 ] 28:     end if 29:     for all other inactive agents 𝑚 ′ ≠ 𝑚  do 30:         𝑄 𝑚 ′ , 𝑘 + 1 , ℎ ← 𝑄 𝑚 ′ , 𝑘 , ℎ , 𝚲 𝑚 ′ , 𝑘 + 1 , ℎ ← 𝚲 𝑚 ′ , 𝑘 , ℎ , ∀ ℎ ∈ [ 𝐻 ] 31:     end for 32:  end for 4.2 Extension to Other MDP Settings

We remark that though designed for linear MDPs, Algorithm 2 can be easily extended to other MDP settings.

Note that for any tabular MDP with small state and action space, we can represent it using the one-hot feature mapping as discussed in Example 2.1 in Jin et al. (2020). Thus Algorithm 2 can be applied directly to tabular MDPs, and the determinant-based criterion in Line 14 would become a criterion based on the visitation count for every state-action pair. However, we anticipate that Theorem 5.1 would produce in this case an regret upper bound that is suboptimal in | 𝒮 | and | 𝒜 | , which is possibly due to that the one-hot feature mapping is not a good representation.

Moreover, the algorithm design can also be applied to linear mixture MDPs where we would have a tenary feature mapping 𝜙 : 𝒮 × 𝒜 × 𝒮 → ℝ 𝑑 . Then for example, the UCRL-VTR algorithm (Jia et al., 2020; Ayoub et al., 2020) can be adapted to the asynchronous cooperative setting by designing a similar communication criterion based on the covariance matrix defined over 𝜙 𝑉 ⁢ ( ⋅ , ⋅ ) := ∑ 𝑠 ∈ 𝒮 𝜙 ⁢ ( 𝑠 , ⋅ , ⋅ ) ⁢ 𝑉 ⁢ ( 𝑠 ) , instead of the 𝚲 defined over only the feature mapping in our case of linear MDPs.

There have also been other more advanced algorithms for linear MDPs (Hu et al., 2022; He et al., 2022b; Agarwal et al., 2022) that exploit variance information to further reduce the dependence of the regret bound on problem parameters. We leave it as future work to develop and analyze variance-aware variants of Algorithm 2.

5 Main Results

We now present the main theoretical results. We provide the regret upper bound for Algorithm 2 in Theorem 5.1, and compare it with existing related results. Then as a complement, in Theorem 5.5, we provide a lower bound on the communication complexity for cooperative linear MDPs.

5.1 Regret Upper Bound

The following theorem provides the regret upper bound of Algorithm 2.

Theorem 5.1 (Regret Upper Bound).

Under Assumption 3.1, there exists some universal constant 𝑐 𝛽 such that by choosing

𝛽

𝑐 𝛽 ⁢ 𝑑 ⁢ 𝐻 ⁢ 𝐶 ~ ⁢ [ log ⁡ ( 2 + 𝐾 𝛿 ⁢ min ⁡ { 1 , 𝜆 , 𝛼 ⁢ 𝜆 } ) + log ⁡ ( 𝐻 ⁢ 𝑑 ⁢ 𝐶 ~ ) ] ,

where 𝐶 ~ ≔ 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 , then with probability at least 1 − 𝛿 , the regret of Algorithm 2 can be bounded as

𝒪 ⁢ ( 𝛽 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝐾 ⁢ log ⁡ ( 2 ⁢ 𝑑 ⁢ 𝐾 / min ⁡ { 1 , 𝜆 } ⁢ 𝛿 ) ) .

Moreover, the communication complexity and policy switching cost (in number of rounds) are upper bounded by

𝒪 ⁢ ( 𝑑 ⁢ 𝐻 ⁢ ( 𝑀 + 1 / 𝛼 ) ⁢ log ⁡ ( 1 + 𝐾 / 𝜆 ⁢ 𝑑 ) ) .

Remark 5.2.

Theorem 5.1 indicates that by setting the parameters 𝛼

1 / 𝑀 2 and 𝜆

1 in Algorithm 2, the regret upper bound can be simplified to 𝒪 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 ) , and the communication complexity is bounded by 𝒪 ~ ⁢ ( 𝑑 ⁢ 𝐻 ⁢ 𝑀 2 ) .

Remark 5.3.

We compare our upper bound with the best known result by Dubey & Pentland (2021). In their Theorem 1, Dubey & Pentland (2021) present an 𝒪 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝑀 ⁢ 𝑇 ) regret upper bound for the homogeneous-agent setting (i.e. the same transition kernel and reward functions shared among all agents), which is identical to the multi-agent linear MDP setting considered in our paper. However, their communication protocol is synchronous in a round-robin fashion (see line 4 of their Algorithm 1), and thus 𝑀 ⁢ 𝑇 under their setting is equal to our 𝐾 . Therefore, by Theorem 5.1, our regret upper bound for the asynchronous setting matches that for the synchronous setting.

Remark 5.4.

Our result generalizes that of the multi-agent linear bandit setting (He et al., 2022a) and the single-agent linear MDP setting (Jin et al., 2020). Specifically, with 𝐻

1 , our upper bound becomes 𝒪 ~ ⁢ ( 𝑑 ⁢ 𝐾 ) and the number of communication rounds becomes 𝒪 ~ ⁢ ( 𝑑 ⁢ 𝑀 2 ) . Note that we save a 𝑑 factor in the regret bound compared to the original 𝑑 3 / 2 dependence from Theorem 5.1 since there is no covering issue when 𝐻

1 . Both the regret and the communication reduce to those under the bandit setting (He et al., 2022a). When 𝑀

1 , our regret reduces to 𝒪 ~ ⁢ ( 𝑑 3 / 2 ⁢ 𝐻 2 ⁢ 𝐾 ) , matching that of Jin et al. (2020).

5.2 Regret Lower Bound Theorem 5.5 (Regret Lower Bound).

Suppose 𝑑 , 𝐻 ≥ 2 and number of episodes 𝐾 ≥ 𝑑 ⁢ 𝑀 , then for any algorithm Alg with expected communication complexity less than 𝑑 ⁢ 𝑀 / 11400 , there exist a linear MDP, such that the expected regret for algorithm Alg is at least Ω ⁢ ( 𝐻 ⁢ 𝑑 ⁢ 𝑀 ⁢ 𝐾 ) .

Remark 5.6.

Theorem 5.5 suggests that, for any algorithm Alg with communication complexity 𝑜 ⁢ ( 𝑑 ⁢ 𝑀 ) , the regret is no better than Ω ⁢ ( 𝑀 ⁢ 𝐾 ) . On the other hand, if each agent perform the LSVI-UCB algorithm (Jin et al., 2020), the total regret of 𝑀 agents is upper bounded by ∑ 𝑚

1 𝑀 𝑂 ~ ⁢ ( 𝐾 𝑚 )

𝑂 ~ ⁢ ( 𝑀 ⁢ 𝐾 ) , where 𝐾 𝑚 is the number of episodes that agent 𝑚 is active. Thus, in order to improve the performance through collaboration and remove the dependency on the number of agent 𝑀 , an Ω ⁢ ( 𝑑 ⁢ 𝑀 ) communication complexity is necessary.

Remark 5.7.

Though Theorem 5.5 requires the number of stage 𝐻 ≥ 2 , it is not difficult to extend the result for 𝐻

1 with stochastic reward function 𝑟 ℎ ⁢ ( 𝑠 , 𝑎 ) . In this situation, Theorem 5.5 will reduce to bandit problem with adversarial contexts and improves the communication complexity in He et al. (2022a) with a factor of 𝑑 . We also compare our result with the communication lower bound in Amani et al. (2022). In this work, they measured the communication complexity by bits, which is strictly larger than our definition, and also provided a Ω ⁢ ( 𝑑 ⁢ 𝑀 ) communication complexity (in bits) for stochastic contexts.

6 Overview of the Analysis

In this section, we discuss the technical challenges of analyzing Algorithm 2 and our solutions.

6.1 Technical Challenges

The asynchronous communication protocol causes a unique challenge in the theoretical analysis. To illustrate the challenge, let us first recall the synchronous setting with a round-robin-type participation, as studied in Dubey & Pentland (2021). Note that under this setting, the order of participation is fixed. This implies that if 𝜙 𝑘 , ℎ is uploaded to the server, then for all 𝑘 ′ < 𝑘 , the vectors 𝜙 𝑘 ′ , ℎ must also have already been uploaded to the server. As a sharp comparison, the above important condition is no longer satisfied under the asynchronous setting. We name the violation of this condition the information asymmetry issue.

Technically, this issue causes two consequences. Recall from the analysis of LSVI-UCB that the final regret bound depends on two technical lemmas: the concentration of self-normalized martingales, and the elliptical potential lemma (Abbasi-Yadkori et al., 2011; Jin et al., 2020). The first lemma determines the width of the confidence region (i.e. 𝛽 ), and the second is crucial for bounding the sum of the bonus terms (i.e. ∑ 𝑘

1 𝐾 ‖ 𝜙 𝑘 , ℎ ‖ 𝚲 𝑚 , 𝑘 , ℎ − 1 ). Both lemmas require a well-defined and fixed order of 𝜙 𝑘 , ℎ vectors in the collected data in order to be applied. Unfortunately, such an order does not exist in the asynchronous setting, since the data receiving process by the server and every agent is stochastic. The analysis of this stochastic process is also prohibitive because our agents have full freedom to decide whether to participate. Therefore, this arbitrary pattern in the data collected by Algorithm 2 forbids the directly application of these two tools.

To address this information asymmetry issue, we develop a novel form of the self-normalized martingale concentration lemma (Lemma B.4), and an asynchronous elliptical potential lemma (Lemma 6.4). The main idea is a refined analysis of the local covariance matrix 𝚲 𝑚 , 𝑘 , ℎ and the universal covariance matrix and their comparison under the partial ordering defined by matrix positive definiteness. In the remaining of this section, we overview some key steps and the definition of some important quantities behind the upper bound in Theorem 5.1. The full details are included in Appendix B.

6.2 Key Ingredients of the Proof

For any agent 𝑚 ∈ [ 𝑀 ] and episode 𝑘 ∈ [ 𝐾 ] , define the following indices:

𝑚 𝑘 : the active agent of episode 𝑘 . Note that in Algorithm 2 we use 𝑚 instead of 𝑚 𝑘 due to space limit.

𝑡 𝑘 ⁢ ( 𝑚 ) : for any agent 𝑚 , 𝑡 𝑘 ⁢ ( 𝑚 ) ≤ 𝑘 is the last episode when agent 𝑚 adopts a newly updated a policy by the end of episode 𝑘 . If no policy updating has been conducted by the end of episode 𝑘 , then by default 𝑡 𝑘 ⁢ ( 𝑚 )

1 .

For the participating agent 𝑚 𝑘 in episode 𝑘 , its adopted Q-function 𝑄 𝑚 𝑘 , 𝑘 , ℎ would be equal to 𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ , for all ℎ ∈ [ 𝐻 ] , according to the definition. In the following, we may write 𝑡 𝑘

𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) whenever there is no confusion. A comprehensive clarification of notation is also provided in Appendix A.

Regret Decomposition.

By Definition 3.2, the regret is

𝑅 ⁢ ( 𝐾 )
≔ ∑ 𝑘

1 𝐾 [ 𝑉 1 ⋆ ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ]

≤ ∑ 𝑘

1 𝐾 [ 𝑉 𝑚 𝑘 , 𝑘 , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ] (6.1)

∑ 𝑘

1 𝐾 [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ] .

The inequality is from the following optimism property, which is standard for UCB-type algorithms.

Lemma 6.1 (Optimism).

Under the setting of Theorem 5.1, on the event of Lemma B.5, for all 𝑘 ∈ [ 𝐾 ] , ℎ ∈ [ 𝐻 ] , and ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 , we have 𝑄 ℎ ⋆ ⁢ ( 𝑠 , 𝑎 ) ≤ 𝑄 𝑚 𝑘 , 𝑘 , ℎ ⁢ ( 𝑠 , 𝑎 ) .

Proof of Lemma 6.1.

See Appendix C.2. ∎

The last step in (6.2) follows from the definition of 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) . We then further decompose the terms as

𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⁢ ( 𝑠 𝑘 , ℎ ) − 𝑉 ℎ 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ )

𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) − 𝑄 ℎ 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ )

≤ 𝜙 𝑘 , ℎ ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 𝛽 ⁢ 𝜙 𝑘 , ℎ ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 𝑘 , ℎ − 𝜙 𝑘 , ℎ ⊤ ⁢ 𝐰 ℎ 𝜋 𝑚 𝑘 , 𝑘 .

To analyze the above, we establish the following result.

Lemma 6.2.

Suppose we choose 𝛽 as

𝛽

𝑐 𝛽 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝐶 ~ ⁢ [ log ⁡ ( 2 + 𝐾 𝛿 ⁢ min ⁡ { 1 , 𝜆 , 𝛼 ⁢ 𝜆 } ) + log ⁡ ( 𝐻 ⁢ 𝑑 ⁢ 𝐶 ~ ) ] ,

where 𝐶 ~ ≔ 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 and 𝑐 𝛽 is some universal constant. For any fixed policy 𝜋 , on the event of Lemma B.5, for any 𝑘 ∈ [ 𝐾 ] , ℎ ∈ [ 𝐻 ] , and ( 𝑠 , 𝑎 ) ∈ 𝒮 × 𝒜 , it holds that

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ ( 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 𝐰 ℎ 𝜋 ) − ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 , 𝑎 ) |

≤ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) .

Proof of Lemma 6.2.

See Appendix C.1. ∎

Taming Information Asymmetry. Lemma 6.2 serves a purpose similar to Lemma B.4 in Jin et al. (2020). However, its proof is more involved due to the discrepancy between 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ and 𝜆 ⁢ 𝐈 + ∑ 𝑘 ′

1 𝑘 − 1 𝜙 𝑘 ′ , ℎ under the asynchronous setting. In other words, a random proportion of information is missing from the covariance matrix 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ , causing the information asymmetry issue.

To circumvent this issue, we establish a delicate comparison of the covariance matrices (B.3) via several auxiliary matrices (Appendix A.1). By doing so, we can bound the discrepancy between each 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ and the full information matrix 𝜆 ⁢ 𝐈 + ∑ 𝑘 ′

1 𝑘 − 1 𝜙 𝑘 ′ , ℎ , and then apply the classical concentration argument for self-normalized martingales.

Lemma 6.2 further allows us to apply the standard recursive relation for LSVI-type algorithms (Jin et al., 2020).

Lemma 6.3 (Recursion).

Define 𝜉 𝑘 , ℎ

𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⁢ ( 𝑠 𝑘 , ℎ ) − 𝑉 ℎ 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ ) . On the event of Lemma B.5, it holds that

𝜉 𝑘 , ℎ

≤ 𝜉 𝑘 , ℎ + 1 + ( 𝔼 [ 𝜉 𝑘 , ℎ + 1 | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ] − 𝜉 𝑘 , ℎ + 1 )

  • 2 ⁢ 𝛽 ⁢ 𝜙 𝑘 , ℎ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 𝑘 , ℎ .

Proof of Lemma 6.3.

See Appendix C.3. ∎

Asynchronous Elliptical Potential Lemma. Finally, Lemma 6.3 allows us to bound the regret by the sum of bonus terms. However, the standard elliptical potential lemma (Abbasi-Yadkori et al., 2011) still does not apply due to the information asymmetry issue. To this end, we proposed an asynchronous elliptical potential lemma to facilitate the analysis.

Lemma 6.4 (Asynchronous Elliptical Potential).

Let

𝐵 ℎ

∑ ℎ

1 𝐻 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) .

Under the same assumption of Theorem 5.1, it holds that

∑ 𝑘

1 𝐾 min ⁡ { 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) , 𝐵 ℎ }

≤ 𝒪 ⁢ ( 𝛽 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝐾 ⁢ log ⁡ ( 2 ⁢ 𝑑 ⁢ 𝐾 / min ⁡ { 1 , 𝜆 } ⁢ 𝛿 ) ) .

Proof of Lemma 6.4.

See Appendix C.8. ∎

With all the above steps, we can finally establish the regret upper bound in Theorem 5.1. The remaining details are provided in Appendix B.

7 Conclusion and Future Work

In this paper we propose a provably efficient algorithm for cooperative multi-agent RL with asynchronous communication, and provide a novel theoretical analysis resolving the challenge of information asymmetry induced by asynchronous communication. We also provide a lower bound on the communication complexity for such a setting.

There are several possible directions for future work. Moreover, the current lower bound in Theorem 5.5 is arguably not tight, and it requires novel construction to exhibit the fundamental trade-off between reducing communication complexity and lowering regret. Also, for the asynchronous communication model, it is important to incorporate other aspects such as agent heterogeneity, environment non-stationarity, privacy and more general function approximation.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers and area chair for their helpful comments. JH and QG are partially supported by the National Science Foundation CAREER Award 1906169 and the Sloan Research Fellowship. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

References Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320, 2011. Agarwal et al. (2022) Agarwal, A., Jin, Y., and Zhang, T. Vo 𝑞 l: Towards optimal regret in model-free rl with nonlinear function approximation. arXiv preprint arXiv:2212.06069, 2022. Amani et al. (2022) Amani, S., Lattimore, T., György, A., and Yang, L. F. Distributed contextual linear bandits with minimax optimal communication cost. arXiv preprint arXiv:2205.13170, 2022. Ayoub et al. (2020) Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020. Bazzan (2009) Bazzan, A. L. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3):342–375, 2009. Berner et al. (2019) Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019. Bottou (2010) Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pp.  177–186. Springer, 2010. Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996. Cai et al. (2020) Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283–1294. PMLR, 2020. Clemente et al. (2017) Clemente, A. V., Castejón, H. N., and Chandra, A. Efficient parallel methods for deep reinforcement learning. arXiv preprint arXiv:1705.04862, 2017. Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012. Ding et al. (2020) Ding, G., Koh, J. J., Merckaert, K., Vanderborght, B., Nicotra, M. M., Heckman, C., Roncone, A., and Chen, L. Distributed reinforcement learning for cooperative multi-robot object manipulation. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp.  1831–1833, 2020. Ding et al. (2022) Ding, G., Li, Z., Wu, Y., Yang, X., Aliasgari, M., and Xu, H. Towards an efficient client selection system for federated learning. In 15th International Conference on Cloud Computing, CLOUD 2022, pp.  13–21. Springer, 2022. Dubey & Pentland (2020) Dubey, A. and Pentland, A. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003–6014, 2020. Dubey & Pentland (2021) Dubey, A. and Pentland, A. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021. Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018. Fan et al. (2021) Fan, X., Ma, Y., Dai, Z., Jing, W., Tan, C., and Low, B. K. H. Fault-tolerant federated reinforcement learning with theoretical guarantee. Advances in Neural Information Processing Systems, 34:1007–1021, 2021. Fei & Xu (2022) Fei, Y. and Xu, R. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In International Conference on Machine Learning, pp. 6392–6417. PMLR, 2022. Grounds & Kudenko (2007) Grounds, M. and Kudenko, D. Parallel reinforcement learning with linear function approximation. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, pp.  1–3, 2007. He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning. PMLR, 2021. He et al. (2022a) He, J., Wang, T., Min, Y., and Gu, Q. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. In Advances in Neural Information Processing Systems, 2022a. He et al. (2022b) He, J., Zhao, H., Zhou, D., and Gu, Q. Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022b. Hoffman et al. (2020) Hoffman, M. W., Shahriari, B., Aslanides, J., Barth-Maron, G., Momchev, N., Sinopalnikov, D., Stańczyk, P., Ramos, S., Raichuk, A., Vincent, D., et al. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020. Horgan et al. (2018) Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018. Hu et al. (2022) Hu, P., Chen, Y., and Huang, L. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 8971–9019. PMLR, 2022. Jaderberg et al. (2019) Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019. Jia et al. (2020) Jia, Z., Yang, L., Szepesvari, C., and Wang, M. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pp.  666–686. PMLR, 2020. Jin et al. (2020) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR, 2020. Jin et al. (2022) Jin, H., Peng, Y., Yang, W., Wang, S., and Zhang, Z. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pp.  18–37. PMLR, 2022. Kalai & Vempala (2005) Kalai, A. and Vempala, S. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. Khodadadian et al. (2022) Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pp. 10997–11057. PMLR, 2022. Kim et al. (2021) Kim, Y., Yang, I., and Jun, K.-S. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. arXiv preprint arXiv:2111.03289, 2021. Kretchmar (2002) Kretchmar, R. M. Parallel reinforcement learning. In The 6th World Conference on Systemics, Cybernetics, and Informatics. Citeseer, 2002. Kuba et al. (2022) Kuba, J. G., Chen, R., Wen, M., Wen, Y., Sun, F., Wang, J., and Yang, Y. Trust region policy optimisation in multi-agent reinforcement learning. In International Conference on Learning Representations, 2022. Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020. Li & Wang (2022) Li, C. and Wang, H. Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pp.  6529–6553. PMLR, 2022. Li et al. (2014) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 27, 2014. Liang et al. (2018) Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., Gonzalez, J., Jordan, M., and Stoica, I. Rllib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning, pp. 3053–3062. PMLR, 2018. Littman & Boyan (2013) Littman, M. and Boyan, J. A distributed reinforcement learning scheme for network routing. In Proceedings of the international workshop on applications of neural networks to telecommunications, pp.  55–61. Psychology Press, 2013. Liu et al. (2019) Liu, B., Wang, L., and Liu, M. Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems. IEEE Robotics and Automation Letters, 4(4):4555–4562, 2019. Liu et al. (2020) Liu, D., Cui, Y., Cao, Z., and Chen, Y. Indoor navigation for mobile agents: A multimodal vision fusion model. In 2020 International Joint Conference on Neural Networks (IJCNN), pp.  1–8. IEEE, 2020. Liu et al. (2022) Liu, D., Shah, V., Boussif, O., Meo, C., Goyal, A., Shu, T., Mozer, M., Heess, N., and Bengio, Y. Stateful active facilitator: Coordination and environmental heterogeneity in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2210.03022, 2022. Lowe et al. (2017) Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Pieter Abbeel, O., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017. Lu et al. (2023) Lu, M., Min, Y., Wang, Z., and Yang, Z. Pessimism in the face of confounders: Provably efficient offline reinforcement learning in partially observable markov decision processes. In International Conference on Learning Representations, 2023. Melo & Ribeiro (2007) Melo, F. S. and Ribeiro, M. I. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pp.  308–322. Springer, 2007. Min et al. (2021) Min, Y., Wang, T., Zhou, D., and Gu, Q. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34:7598–7610, 2021. Min et al. (2022a) Min, Y., He, J., Wang, T., and Gu, Q. Learning stochastic shortest path with linear function approximation. In International Conference on Machine Learning, pp. 15584–15629. PMLR, 2022a. Min et al. (2022b) Min, Y., Wang, T., Xu, R., Wang, Z., Jordan, M., and Yang, Z. Learn to match with no regret: Reinforcement learning in markov matching markets. In Advances in Neural Information Processing Systems, 2022b. Modi et al. (2020) Modi, A., Jiang, N., Tewari, A., and Singh, S. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp.  2010–2020. PMLR, 2020. Nair et al. (2015) Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015. Neu & Pike-Burke (2020) Neu, G. and Pike-Burke, C. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020. Qi et al. (2021) Qi, J., Zhou, Q., Lei, L., and Zheng, K. Federated reinforcement learning: techniques, applications, and open challenges. arXiv preprint arXiv:2108.11887, 2021. Ruan et al. (2021) Ruan, Y., Yang, J., and Zhou, Y. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp.  74–87, 2021. Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Verbraeken et al. (2020) Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., and Rellermeyer, J. S. A survey on distributed machine learning. Acm computing surveys (csur), 53(2):1–33, 2020. Vinyals et al. (2017) Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017. Wai et al. (2018) Wai, H.-T., Yang, Z., Wang, Z., and Hong, M. Multi-agent reinforcement learning via double averaging primal-dual optimization. Advances in Neural Information Processing Systems, 31, 2018. Wang et al. (2021) Wang, T., Zhou, D., and Gu, Q. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. In Advances in Neural Information Processing Systems, 2021. Wang et al. (2020) Wang, Y., Hu, J., Chen, X., and Wang, L. Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations, 2020. Williams et al. (2016) Williams, G., Drews, P., Goldfain, B., Rehg, J. M., and Theodorou, E. A. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp.  1433–1440. IEEE, 2016. Xu et al. (2023a) Xu, H., Liu, P., Guan, B., Wang, Q., Da Silva, D., and Hu, L. Achieving online and scalable information integrity by harnessing social spam correlations. IEEE Access, 2023a. Xu et al. (2023b) Xu, R., Min, Y., Wang, T., Jordan, M. I., Wang, Z., and Yang, Z. Finding regularized competitive equilibria of heterogeneous agent macroeconomic models via reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  375–407. PMLR, 2023b. Yang & Wang (2019) Yang, L. and Wang, M. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pp. 6995–7004, 2019. Ye et al. (2020) Ye, D., Chen, G., Zhang, W., Chen, S., Yuan, B., Liu, B., Chen, J., Liu, Z., Qiu, F., Yu, H., et al. Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632, 2020. Yin et al. (2021) Yin, M., Bai, Y., and Wang, Y.-X. Near-optimal offline reinforcement learning via double variance reduction. In Advances in Neural Information Processing Systems, 2021. Yu et al. (2021) Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A., and Wu, Y. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021. Yu et al. (2020) Yu, S., Chen, X., Zhou, Z., Gong, X., and Wu, D. When deep reinforcement learning meets federated learning: Intelligent multitimescale resource management for multiaccess edge computing in 5g ultradense network. IEEE Internet of Things Journal, 8(4):2238–2251, 2020. Yu et al. (2014) Yu, T., Wang, H., Zhou, B., Chan, K. W., and Tang, J. Multi-agent correlated equilibrium q ( 𝜆 ) learning for coordinated smart generation control of interconnected power grids. IEEE transactions on power systems, 30(4):1669–1679, 2014. Zanette et al. (2020) Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978–10989. PMLR, 2020. Zhan et al. (2021) Zhan, C., Ghaderibaneh, M., Sahu, P., and Gupta, H. Deepmtl: Deep learning based multiple transmitter localization. In IEEE 22nd International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM), 2021. doi: 10.1109/WoWMoM51794.2021.00017. Zhan et al. (2022) Zhan, C., Ghaderibaneh, M., Sahu, P., and Gupta, H. Deepmtl pro: Deep learning based multiple transmitter localization and power estimation. Pervasive and Mobile Computing, 2022. doi: 10.1016/j.pmcj.2022.101582. Zhang et al. (2018a) Zhang, K., Yang, Z., and Basar, T. Networked multi-agent reinforcement learning in continuous spaces. In 2018 IEEE conference on decision and control (CDC), pp. 2771–2776. IEEE, 2018a. Zhang et al. (2018b) Zhang, K., Yang, Z., Liu, H., Zhang, T., and Basar, T. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pp. 5872–5881. PMLR, 2018b. Zhang et al. (2021) Zhang, Z., Yang, J., Ji, X., and Du, S. S. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. arXiv preprint arXiv:2101.12745, 2021. Zhang et al. (2022) Zhang, Z., Ji, X., and Du, S. Horizon-free reinforcement learning in polynomial time: the power of stationary policies. In Conference on Learning Theory, pp.  3858–3904. PMLR, 2022. Zhou & Gu (2022) Zhou, D. and Gu, Q. Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507, 2022. Zhou et al. (2021a) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR, 2021a. Zhou et al. (2021b) Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning. PMLR, 2021b. Zhou et al. (2021c) Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pp. 12793–12802. PMLR, 2021c. Zhuo et al. (2019) Zhuo, H. H., Feng, W., Lin, Y., Xu, Q., and Yang, Q. Federated deep reinforcement learning. arXiv preprint arXiv:1901.08277, 2019. Appendix A Clarification of Notation

In this section, we give a comprehensive clarification on the notation used in the algorithm and the analysis.

Throughout the paper, we use 𝒪 ⁢ ( ⋅ ) to hide problem-independent universal constants and 𝒪 ~ ⁢ ( ⋅ ) to further hide logarithmic factors. We use ( ⋅ ) [ 𝑎 , 𝑏 ] to denote the truncation of values into the range [ 𝑎 , 𝑏 ] .

We also present the following table of notations. The 𝜋 in the superscript can be replaced by 𝜋 𝑘 or 𝜋 ⋆ , where the former refers to the policy in episode 𝑘 , and the latter refers to the optimal policy.

Table 2: Notation

Notation

Meaning

𝑚 𝑘

the active agent in episode 𝑘

𝜋 𝑚 , 𝑘

{ 𝜋 𝑚 , 𝑘 , ℎ } ℎ

1 𝐻

the policy of agent 𝑚 at episode 𝑘 (regardless of agent 𝑚 being active or not)

{ 𝑄 𝑚 , 𝑘 , ℎ } ℎ

1 𝐻

Q-functions of agent 𝑚 at episode 𝑘 in Algorithm 2

{ 𝑉 𝑚 , 𝑘 , ℎ } ℎ

1 𝐻

Value functions of agent 𝑚 at episode 𝑘 in Algorithm 2, where 𝑉 𝑚 , 𝑘 , ℎ ⁢ ( ⋅ )

argmax 𝑎 𝑄 𝑚 , 𝑘 , ℎ ⁢ ( ⋅ , 𝑎 )

{ 𝑉 ℎ 𝜋 } ℎ

1 𝐻

Value functions under policy 𝜋

𝜙 𝑘 , ℎ

𝜙 𝑘 , ℎ

𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) for 𝑘 ∈ [ 𝐾 ] and ℎ ∈ [ 𝐻 ]

𝐰 𝑚 , 𝑘 , ℎ , 𝚲 𝑚 , 𝑘 , ℎ

underlying parameter and covariance matrix of 𝑄 𝑚 , 𝑘 , ℎ in Algorithm 2

Indices of available episodes

Recall from line 11 and 20 of Algorithm 2 that the original definition of 𝒟 𝑚 , 𝑘 , ℎ is the dataset of trajectories available to agent 𝑚 by the end of episode 𝑘 . However, in our analysis we may use 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ to denote the collection of indices of all the episodes whose data is available to agent 𝑚 at the beginning of episode 𝑘 . That is, 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ refers to all the episodes whose trajectories are available to agent 𝑚 by the end of episode 𝑘 . For example, in (4.1), we use the summation over the indices 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ to reflect that the parameter 𝐰 𝑚 , 𝑘 + 1 , ℎ is computed using only available trajectories (either the agent’s own local trajectories or downloaded ones) by the end of episode 𝑘 . This is a slight abuse of the definition of 𝒟 𝑚 , 𝑘 , ℎ , since this 𝜏 ∈ 𝒟 𝑚 , 𝑘 , ℎ notation is only required in a summation of this kind in the proof and won’t cause further confusion.

Value and Q-functions

For any policy 𝜋

{ 𝜋 ℎ } ℎ

1 𝐻 , we define the corresponding value functions as

𝑉 ℎ 𝜋 ( 𝑠 ) ≔ 𝔼 [ ∑ ℎ ′

ℎ 𝐻 𝑟 ℎ ′ ( 𝑠 ℎ ′ , 𝜋 ℎ ′ ( 𝑠 ℎ ′ ) ) | 𝑠 ℎ

𝑠 ] , ∀ ℎ ∈ [ 𝐻 ] .

The Q-functions are defined as

𝑄 ℎ 𝜋 ( 𝑠 , 𝑎 )

𝑟 ℎ ( 𝑠 , 𝑎 ) + 𝔼 [ ∑ ℎ ′

ℎ + 1 𝐻 𝑟 ℎ ′ ( 𝑠 ℎ ′ , 𝜋 ℎ ′ ( 𝑠 ℎ ′ ) ) | 𝑠 ℎ

𝑠 , 𝑎 ℎ

𝑎 ] , ∀ ℎ ∈ [ 𝐻 ] .

Since the horizon 𝐻 is finite and action space 𝒜 is also finite, there exists some optimal policy 𝜋 ⋆ such that

𝑉 ℎ 𝜋 ⋆ ⁢ ( 𝑠 )

sup 𝜋 𝑉 ℎ 𝜋 ⁢ ( 𝑠 ) .

We denote 𝑉 ℎ 𝜋 ⋆

𝑉 ℎ ⋆ . Furthermore, the above definition implies the following Bellman equations

𝑄 ℎ 𝜋 ⁢ ( 𝑠 , 𝑎 )

𝑟 ℎ ⁢ ( 𝑠 , 𝑎 ) + ℙ ℎ ⁢ 𝑉 ℎ + 1 𝜋 ⁢ ( 𝑠 , 𝑎 ) , 𝑉 ℎ 𝜋 ⁢ ( 𝑠 )

𝑄 ℎ 𝜋 ⁢ ( 𝑠 , 𝜋 ℎ ⁢ ( 𝑠 ) ) , ∀ ℎ ∈ [ 𝐻 ] ,

where 𝑉 𝐻 + 1 𝜋 ⁢ ( ⋅ )

0 .

Multi-value quantities.

The following quantities from Algorithm 2 can possibly take two different values in an episode due to the policy update. In our analysis, we assume they refer to the values at the end of the episode 𝑘 , unless otherwise stated.

𝒟 𝑚 , 𝑘 , ℎ ; 𝒟 𝑘 , ℎ ser ; 𝚲 𝑘 , ℎ ser ; 𝚲 𝑚 , 𝑘 , ℎ loc .

Indices of episodes.

The following indices of episodes are necessary to the analysis under the asynchronous setting:

𝑡 𝑘 ⁢ ( 𝑚 ) : 𝑡 𝑘 ⁢ ( 𝑚 ) ≤ 𝑘 is the last episode when agent 𝑚 adopts a newly updated a policy by the end of episode 𝑘 . If no policy updating has been conducted by the end of episode 𝑘 , then by default 𝑡 𝑘 ⁢ ( 𝑚 )

1 .

𝑛 𝑘 ⁢ ( 𝑚 ) : 𝑛 𝑘 ⁢ ( 𝑚 ) < 𝑡 𝑘 ⁢ ( 𝑚 ) is the most recent episode before 𝑘 when the agent 𝑚 updates its policy. This newly updated policy is executed for the first time at episode 𝑡 𝑘 ⁢ ( 𝑚 ) . If no policy updating has been conducted before episode 𝑘 by agent 𝑚 , then by default 𝑛 𝑘 ⁢ ( 𝑚 )

0 .

𝑁 𝑘 ⁢ ( 𝑚 ) : 𝑁 𝑘 ⁢ ( 𝑚 ) ≤ 𝑘 is the last episode that agent 𝑚 participates up until the end of episode 𝑘 . For example, if agent 𝑚 participates in episode 𝑘 , then 𝑁 𝑘 ⁢ ( 𝑚 )

𝑘 .

The above definition implies 0 ≤ 𝑛 𝑘 ⁢ ( 𝑚 ) < 𝑡 𝑘 ⁢ ( 𝑚 ) ≤ 𝑁 𝑘 ⁢ ( 𝑚 ) ≤ 𝑘 .

A.1 Auxiliary Matrices

We further define a few notations that will be used extensively in the proof.

Universal information.

We define the following matrix of universal information up to the beginning of episode 𝑘 :

𝚲 𝑘 , ℎ all

𝜆 ⁢ 𝐈 + ∑ 𝜏

1 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ 𝜙 𝜏 , ℎ ⊤ , ∀ ℎ ∈ [ 𝐻 ] . (A.1) Personal information.

We define the uploaded information by agent 𝑚 until episode 𝑘 as

𝚲 𝑚 , 𝑘 , ℎ up

∑ 𝜏

1 , 𝑚 𝜏

𝑚 𝑛 𝑘 ⁢ ( 𝑚 ) 𝜙 𝜏 , ℎ ⁢ 𝜙 𝜏 , ℎ ⊤ , ∀ ℎ ∈ [ 𝐻 ] . (A.2)

Since quantities such as 𝚲 𝑚 , 𝑘 , ℎ loc can possibly take two different values during episode 𝑘 due to the policy update, in the following, we assume all these quantities refer to the value at the end of each episode. The matrix 𝚲 𝑚 , 𝑘 , ℎ loc can be rewritten as

𝚲 𝑚 , 𝑘 , ℎ loc

∑ 𝜏

𝑛 𝑘 ⁢ ( 𝑚 ) + 1 , 𝑚 𝜏

𝑚 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜙 𝜏 , ℎ ⊤ , ∀ ℎ ∈ [ 𝐻 ] . (A.3) Appendix B Proof of Regret Upper Bound B.1 Basic Properties of the LSVI-type Algorithm

In this section, we list a few basic lemmas for our LSVI-type algorithm. Most of these lemmas are modified from those in Jin et al. (2020). These lemmas are crucial to our regret upper bound.

Lemma B.1 (Lemma B.1 in Jin et al. 2020).

Under Assumption 3.1, for any policy 𝜋 , for any ℎ ∈ [ 𝐻 ] , let 𝐰 ℎ 𝜋 be such that 𝑄 ℎ 𝜋 ⁢ ( ⋅ , ⋅ )

⟨ 𝜙 ⁢ ( ⋅ , ⋅ ) , 𝐰 ℎ 𝜋 ⟩ . Then for all ℎ ∈ [ 𝐻 ] , it holds that

‖ 𝐰 ℎ 𝜋 ‖ ≤ 2 ⁢ 𝐻 ⁢ 𝑑 .

Lemma B.2.

Under Assumption 3.1, for any 𝑘 ∈ [ 𝐾 ] and ℎ ∈ [ 𝐻 ] , the estimated parameter 𝐰 𝑚 , 𝑘 , ℎ in Algorithm 2 satisfies

‖ 𝐰 𝑚 , 𝑘 , ℎ ‖ ≤ 2 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝑘 / 𝜆 .

Proof of Lemma B.2.

See Appendix C.4. ∎

Recall the auxiliary matrices defined in Appendix A.1. The following result describes the partial ordering between them.

Lemma B.3 (Covariance matrix ordering).

Under the setting of Theorem 5.1, it holds that

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up ⪰ 1 𝛼 ⁢ 𝚲 𝑚 , 𝑘 , ℎ loc , ∀ 𝑘 , ℎ , 𝑚 ∈ [ 𝐾 ] × [ 𝐻 ] × [ 𝑀 ] . (B.1)

Furthermore, for some 1 < 𝑡 ¯ ≤ 𝑡 ¯ ≤ 𝐾 , suppose agent 𝑚 is the only participating agent within these episodes (i.e. 𝑚 𝑘

𝑚 for all 𝑘 ∈ [ 𝑡 ¯ , 𝑡 ¯ ] ), and agent 𝑚 communicates with the server only at episode 𝑘

𝑡 ¯ during [ 𝑡 ¯ , 𝑡 ¯ ] . Then for all 𝑘 ∈ [ 𝑡 ¯ + 1 , 𝑡 ¯ ] , it holds that

𝚲 𝑚 , 𝑘 , ℎ ⪰ 1 1 + 𝑀 ⁢ 𝛼 ⁢ 𝚲 𝑘 , ℎ all , ∀ ℎ ∈ [ 𝐻 ] . (B.2) Proof of Lemma B.3.

See Appendix C.5. ∎

The following two lemmas provides the concentration of self-normalized martingales in the asynchronous setting, where the first lemma applies to a fixed 𝑉 function, and the second one applies to the 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 function in Algorithm 2 via a covering argument. With a proper choice of 𝛼 , the bounds can be reduced to 𝒪 ~ ⁢ ( 𝐻 ⁢ 𝑑 ) and 𝒪 ~ ⁢ ( 𝐻 ⁢ 𝑑 ) , respectively. These are identical to the result under the single-agent case (Jin et al., 2020).

Lemma B.4.

Under the setting of Theorem 5.1, for any fixed 𝑉 ∈ 𝒱 , with probability at least 1 − 𝛿 , for any 𝑘 ∈ [ 𝐾 ] and ℎ ∈ [ 𝐻 ] , it holds that

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

≤ 2 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ⋅ 𝐻 ⋅ ( log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( ( 𝐾 + 𝜆 𝜆 ) 𝑑 / 2 ) + 2 ⁢ log ⁡ ( 1 𝛿 ) ) .

Proof of Lemma B.4.

See Appendix C.6. ∎

Lemma B.5.

Under the setting of Theorem 5.1, with probability at least 1 − 𝛿 , for any 𝑘 ∈ [ 𝐾 ] and ℎ ∈ [ 𝐻 ] , it holds that

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

≤ 𝐶 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ⁢ 𝐻 ⁢ 𝜄 + 𝐶 ⁢ 𝑑 ⁢ 𝐻 / 𝜆 ,

where 𝐶 is some universal constant and

𝜄 ≔ 𝑑 ⁢ log ⁡ ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) + 𝑑 ⁢ log ⁡ ( 𝐾 + 𝜆 𝜆 ) + log ⁡ 1 𝛿 + 𝑑 ⁢ log ⁡ ( 2 + 8 ⁢ 𝑘 3 𝜆 ) + 𝑑 2 ⁢ log ⁡ ( 1 + 8 ⁢ 𝛽 2 ⁢ 𝑘 2 𝜆 ⁢ 𝑑 1.5 ⁢ 𝐻 2 ) .

Proof of Lemma B.5.

See Appendix C.7. ∎

The next lemma is applied to bound the number of communication rounds of Algorithm 2. We first divide the episodes into different epochs.

Lemma B.6.

For each 𝑖 ≥ 0 , define 𝐾 ~ 𝑖

min ⁡ { 𝑘 ∈ [ 𝐾 ] : ∃ ℎ ∈ [ 𝐻 ] ⁢ s.t. ⁢ det ( 𝚲 𝑘 , ℎ all ) ≥ 2 𝑖 ⁢ 𝜆 𝑑 } . Divide all episodes into epochs where epoch i is given as { 𝐾 ~ 𝑖 , 𝐾 ~ 𝑖 + 1 , ⋯ , 𝐾 ~ 𝑖 + 1 − 1 } , where 𝑖 ≥ 0 . Then within any epoch 𝑖 , the total number of communication rounds is upper bounded by 𝒪 ⁢ ( 𝐻 ⁢ ( 𝑀 + 1 / 𝛼 ) ) .

Proof of Lemma B.6.

The proof follows from the a modified argument from that of Lemma 6.2 in (He et al., 2022a). Different from He et al. (2022a), by Line 14 of Algorithm 2, a communication is triggered if any of the 𝐻 determinant conditions are satisfied. As a result, the communication number is at most 𝐻 times the upper bound in Lemma 6.2 in He et al. (2022a). ∎

B.2 Proof of Theorem 5.1 Proof of Theorem 5.1.

We first prove the regret upper bound. By Lemma 6.1, the regret can be upper bounded as

𝑅 ⁢ ( 𝐾 )

∑ 𝑘

1 𝐾 [ 𝑉 1 ⋆ ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ]

≤ ∑ 𝑘

1 𝐾 [ 𝑉 𝑚 𝑘 , 𝑘 , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ]

∑ 𝑘

1 𝐾 [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) ]

≤ ∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 ( 𝔼 [ 𝜉 𝑘 , ℎ + 1 | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ] − 𝜉 𝑘 , ℎ + 1 )

+ ∑ 𝑘

1 𝐾 min ⁡ { 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) , ∑ ℎ

1 𝐻 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) } , (B.3)

where the first inequality is by Lemma 6.1, and the second inequality is by Lemma 6.3. The minimum in the last step might seems odd at first, but will turn out to be necessary later. Bounding the first term in the above is straightforward using martingale convergence (Jin et al., 2020). Specifically, by the definition of in Lemma 6.3, the first term can be written as

∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 ( 𝔼 [ 𝜉 𝑘 , ℎ + 1 | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ] − 𝜉 𝑘 , ℎ + 1 )

∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 ( 𝔼 [ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ( 𝑠 𝑘 , ℎ + 1 ) − 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 ( 𝑠 𝑘 , ℎ + 1 ) ] | 𝑠 𝑘 , ℎ + 1 , 𝑎 𝑘 , ℎ + 1 ] − [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ( 𝑠 𝑘 , ℎ + 1 ) − 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 ( 𝑠 𝑘 , ℎ + 1 ) ] )

Above summation can be viewed as the sum of a martingale difference sequence since 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 and 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 are independent of the observation in episode 𝑘 . Since | 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝑘 , ℎ + 1 ) − 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ + 1 ) | ≤ 2 ⁢ 𝐻 , by Azuma-Hoeffding inequality, with probability at least 1 − 𝛿 , for all 𝑘 , ℎ , it holds that

∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 ( 𝔼 [ 𝜉 𝑘 , ℎ + 1 | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ] − 𝜉 𝑘 , ℎ + 1 ) ≤ 2 𝐻 3 / 2 𝐾 ⁢ log ⁡ ( 2 / 𝛿 ) , (B.4)

where { 𝜉 𝑘 , ℎ } 𝑘 , ℎ ∈ [ 𝐾 ] × [ 𝐻 ] are defined in Lemma 6.3. For the second term, note that instead of bounding 2 ⁢ 𝛽 ⁢ ∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) directly, we construct a new term involving a minimum between the bonus and the per-episode regret bound 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) . The reason behind this is that the sum of bonus along cannot be bounded using the standard elliptical potential argument (Abbasi-Yadkori et al., 2011) due to the asynchronous nature of the communication protocol. Its analysis turns out to be much more involved and therefore is summarized separately in Lemma 6.4. Now, combining Lemma 6.4 and (B.4), we finish the proof of the regret upper bound.

The proof of the communication complexity of Algorithm 2 is straightforward given the simple form of our determinant-based criterion. By Lemma B.6, it remains to bound the number of epochs. Recall from Assumption 3.1 that ‖ 𝜙 ⁢ ( ⋅ , ⋅ ) ‖ ≤ 1 . This implies that, for any ℎ ∈ [ 𝐻 ] ,

det ( 𝚲 𝐾 , ℎ all ) ≤ ( 𝜆 + 1 𝑑 ⁢ ∑ 𝑘

1 𝐾 ‖ 𝜙 𝑘 , ℎ ‖ 2 2 ) 𝑑 ≤ 𝜆 𝑑 ⁢ ( 1 + 𝐾 𝜆 ⁢ 𝑑 ) 𝑑 .

By the definition of 𝐾 ~ 𝑖 from Lemma B.6, in order for 𝐾 ~ 𝑖 to be non-empty, 𝑖 should satisfy

2 𝑖 ⁢ 𝜆 𝑑 ≤ 𝜆 𝑑 ⁢ ( 1 + 𝐾 𝜆 ⁢ 𝑑 ) 𝑑 ,

which implies 𝑖 ≤ log ⁡ 2 ⋅ 𝑑 ⁢ log ⁡ ( 1 + 𝐾 / 𝜆 ⁢ 𝑑 ) . Together with Lemma B.6, the total communication number is upper bounded by 𝐻 ⁢ ( 𝑀 + 1 / 𝛼 ) ⋅ log ⁡ 2 ⋅ 𝑑 ⁢ log ⁡ ( 1 + 𝐾 / 𝜆 ⁢ 𝑑 ) up to some constant factor. This finishes the proof. ∎

Appendix C Proof of Technical Lemmas C.1 Proof of Lemma 6.2 Proof of Lemma 6.2.

Recall the definition of 𝐰 𝑚 , 𝑘 + 1 , ℎ from (4.1), and the definition of 𝑛 𝑘 ⁢ ( ⋅ ) from Appendix A. Then since 𝐰 𝑚 𝑘 , 𝑘 , ℎ

𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ is computed using all the trajectories available to agent 𝑚 𝑘 by the beginning episode 𝑡 𝑘 , we can write

𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ

𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⋅ [ 𝑟 𝜏 , ℎ + 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) ] ,

where 𝜏 ∈ 𝒟 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ ser denotes all the data uploaded to the server by the end of episode 𝑛 𝑘 . Note that this is well-defined since 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) is the most recent episode before 𝑘 when agent 𝑚 𝑘 updates its policy, and therefore its local data is also included in 𝒟 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ ser . In the following we simply use 𝑛 𝑘 instead of 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) since there is no confusion. We then write

𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 𝐰 ℎ 𝜋

𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⋅ [ 𝑟 𝜏 , ℎ + 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) ] − 𝐰 ℎ 𝜋

𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ { − 𝜆 ⁢ 𝐰 ℎ 𝜋 + ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ℎ + 1 𝜋 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] }

− 𝜆 ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝐰 ℎ 𝜋 ⏟ 𝐯 1 + 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ⏟ 𝐯 2

+ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ⏟ 𝐯 3 . (C.1)

For the first term, we have

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐯 1 | ≤ 𝜆 ⁢ ‖ 𝐰 ℎ 𝜋 ‖ 2 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ≤ 2 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝜆 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , (C.2)

where the first step is by 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ≼ 𝜆 − 1 ⁢ 𝐈 and the second step is by Lemma B.1. For the second term, we have

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐯 2 |

≤ ‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⏟ 𝜒 ⋅ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) . (C.3)

For the third term, we have

𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐯 3 (C.4)

⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ⟩

≤ ⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ 𝜙 𝜏 , ℎ ⊤ ⁢ ∫ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 ′ ) ⁢ d 𝝁 ℎ ⁢ ( 𝑠 ′ ) ⟩

≤ ⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , ∫ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 ′ ) ⁢ d 𝝁 ℎ ⁢ ( 𝑠 ′ ) ⟩

− 𝜆 ⁢ ⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∫ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 ′ ) ⁢ d 𝝁 ℎ ⁢ ( 𝑠 ′ ) ⟩

ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 , 𝑎 ) − 𝜆 ⁢ ⟨ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ ∫ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 ′ ) ⁢ d 𝝁 ℎ ⁢ ( 𝑠 ′ ) ⟩

≤ ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 , 𝑎 ) + 2 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝜆 ⋅ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) , (C.5)

where the last step holds because ‖ 𝝁 ℎ ‖ ≤ 𝑑 by Assumption 3.1. Combining (C.1), (C.2), (C.1) and (C.4), we have

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ ( 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 𝐰 ℎ 𝜋 ) − ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 , 𝑎 ) |

≤ ( 4 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝜆 + 𝜒 ) ⋅ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) . (C.6)

It remains to show that the choice of 𝛽 satisfies

4 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝜆 + 𝜒 ≤ 𝛽 .

By Lemma B.5, we want to show

4 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝜆 + 𝐶 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ⁢ 𝐻 ⁢ 𝜄 + 𝐶 ⁢ 𝑑 ⁢ 𝐻 / 𝜆 ≤ 𝛽 .

Plugging in the choice of 𝛽 and the definition of 𝜄 from Lemma B.5 and simplifying the expression, it suffices to show that there exists 𝑐 𝛽 such that

𝐶 ⁢ [ log ⁡ ( 2 + 𝐾 𝛿 ⁢ min ⁡ { 1 , 𝜆 , 𝛼 ⁢ 𝜆 } ) + log ⁡ ( 𝑐 𝛽 ⁢ 𝐻 ⁢ 𝑑 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ) ]

≤ 𝑐 𝛽 2 ⁢ [ log ⁡ ( 2 + 𝐾 𝛿 ⁢ min ⁡ { 1 , 𝜆 , 𝛼 ⁢ 𝜆 } ) + log ⁡ ( 𝐻 ⁢ 𝑑 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ) ] ,

where 𝐶 is some universal constant. The existence of such 𝑐 𝛽 is clear. Therefore, we conclude that

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ ( 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 𝐰 ℎ 𝜋 ) − ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 ] ⁢ ( 𝑠 , 𝑎 ) | ≤ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) .

C.2 Proof of Lemma 6.1 Proof of Lemma 6.1.

The proof follows from the same induction argument in Lemma B.5 of (Jin et al., 2020). For completeness we introduce the proof here. For step 𝐻 , by Lemma 6.2, we have

| 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 − 𝑄 𝐻 ⋆ ⁢ ( 𝑠 , 𝑎 ) | ≤ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ,

since 𝑉 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 + 1

𝑉 𝐻 + 1 ⋆

0 . This implies

𝑄 𝐻 ⋆ ⁢ ( 𝑠 , 𝑎 ) ≤ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 + 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ≤ 𝑄 𝑚 𝑘 , 𝑡 𝑘 , 𝐻 ⁢ ( 𝑠 , 𝑎 ) .

Now suppose we have proved 𝑄 ℎ + 1 ⋆ ⁢ ( 𝑠 , 𝑎 ) ≤ 𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 ⁢ ( 𝑠 , 𝑎 ) . Then by Lemma 6.2 again, we have

𝑄 ℎ ⋆ ⁢ ( 𝑠 , 𝑎 ) + ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 ⋆ ] ⁢ ( 𝑠 , 𝑎 ) − 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ ≤ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ,

which implies

𝑄 ℎ ⋆ ⁢ ( 𝑠 , 𝑎 )

≤ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) − ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 ⋆ ] ⁢ ( 𝑠 , 𝑎 )

≤ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ,

where the last step is by the induction hypothesis that 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 ⋆ ≥ 0 . Therefore, we conclude that

𝑄 ℎ ⋆ ⁢ ( 𝑠 , 𝑎 )

min ⁡ { 𝐻 − ℎ + 1 , 𝑄 ℎ ⋆ ⁢ ( 𝑠 , 𝑎 ) }

≤ min ⁡ { 𝐻 − ℎ + 1 , 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝐰 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 , 𝑎 ) }

𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ

𝑄 𝑚 𝑘 , 𝑘 , ℎ .

C.3 Proof of Lemma 6.3 Proof of Lemma 6.3.

Lemma 6.2 and the definition of 𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ imply that for any 𝑘 , ℎ ,

𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) − 𝑄 ℎ 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ )

≤ ℙ ℎ ⁢ [ 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 − 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 ] ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) + 2 ⁢ 𝛽 ⁢ 𝜙 𝑘 , ℎ ⊤ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 𝑘 , ℎ .

By the definition of 𝑉 𝑚 𝑘 , 𝑡 𝑘 , ℎ + 1 and 𝑉 ℎ + 1 𝜋 𝑚 𝑘 , 𝑘 , we have 𝜉 𝑘 , ℎ

𝑄 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) − 𝑄 ℎ 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) , and it follows that

𝜉 𝑘 , ℎ ≤ 𝔼 ⁢ [ 𝜉 𝑘 , ℎ + 1 | 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ] + 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) .

C.4 Proof of Lemma B.2 Proof of Lemma B.2.

The proof follows from that of Lemma B.2 in (Jin et al., 2020). Specifically, the estimated parameters 𝐰 𝑚 , 𝑘 , ℎ take the same form as 𝐰 ℎ 𝑘 ’s in (Jin et al., 2020) if we re-index the vectors 𝜙 𝑘 , ℎ ’s that are available to agent 𝑚 at episode  𝑘 . ∎

C.5 Proof of Lemma B.3 Proof of Lemma B.3.

Fix some episode 𝑘 and agent 𝑚 . Recall from Appendix A that 𝑁 𝑘 ⁢ ( 𝑚 ) ≤ 𝑘 is the last episode that agent 𝑚 participates up until the end of episode 𝑘 . If agent 𝑚 communicates with server in episode 𝑁 𝑘 ⁢ ( 𝑚 ) , then

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up ⪰ 𝟎

𝚲 𝑚 , 𝑁 𝑘 ⁢ ( 𝑚 ) , ℎ loc

𝚲 𝑚 , 𝑘 , ℎ loc .

If agent 𝑚 does not participates in episode 𝑁 𝑘 ⁢ ( 𝑚 ) , then by Line 14 of Algorithm 2, it holds that, for all ℎ ∈ [ 𝐻 ] ,

det ( 𝚲 𝑚 , 𝑁 𝑘 ⁢ ( 𝑚 ) , ℎ + 𝚲 𝑚 , 𝑁 𝑘 ⁢ ( 𝑚 ) , ℎ loc ) ≤ ( 1 + 𝛼 ) ⁢ det ( 𝚲 𝑚 , 𝑁 𝑘 ⁢ ( 𝑚 ) , ℎ ) .

Since no further participation happens between [ 𝑁 𝑘 ⁢ ( 𝑚 ) + 1 , 𝑘 ] , above implies

det ( 𝚲 𝑚 , 𝑘 , ℎ + 𝚲 𝑚 , 𝑘 , ℎ loc ) ≤ ( 1 + 𝛼 ) ⁢ det ( 𝚲 𝑚 , 𝑘 , ℎ ) .

Applying Lemma E.2, we get

𝐱 ⊤ ⁢ ( 𝚲 𝑚 , 𝑘 , ℎ + 𝚲 𝑚 , 𝑘 , ℎ loc ) ⁢ 𝐱 𝐱 ⊤ ⁢ 𝚲 𝑚 , 𝑘 , ℎ ⁢ 𝐱 ≤ det ( 𝚲 𝑚 , 𝑘 , ℎ + 𝚲 𝑚 , 𝑘 , ℎ loc ) det ( 𝚲 𝑚 , 𝑘 , ℎ ) ≤ 1 + 𝛼 ,

and it follows that

𝐱 ⊤ ⁢ 𝚲 𝑚 , 𝑘 , ℎ loc ⁢ 𝐱 ≤ 𝛼 ⁢ 𝐱 ⊤ ⁢ 𝚲 𝑚 , 𝑘 , ℎ ⁢ 𝐱 .

Finally, we conclude that

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up ⪰ 𝚲 𝑚 , 𝑘 , ℎ ⪰ 1 𝛼 ⁢ 𝚲 𝑚 , 𝑘 , ℎ loc ,

where the first step follows from the fact that 𝚲 𝑚 , 𝑘 , ℎ is downloaded at some episode 𝑛 𝑘 ⁢ ( 𝑚 ) < 𝑘 , and the definition of 𝚲 𝑚 ′ , 𝑘 , ℎ up from (A.2). This proves (B.1).

To show (B.2), suppose that agent 𝑚 communicates with the server at episode 𝑡 ¯ , and is active for 𝑘 ∈ [ 𝑡 ¯ , 𝑡 ¯ ] . Applying (B.1) for all 𝑀 agents and averaging, we have

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up ⪰ 1 𝛼 ⁢ 𝑀 ⁢ ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ loc ,

and it follows that, for 𝑘 ∈ [ 𝑡 ¯ + 1 , 𝑡 ¯ ] ,

𝚲 𝑚 , 𝑘 , ℎ

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑡 ¯ + 1 , ℎ up

𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up

⪰ 1 1 + 𝛼 ⁢ 𝑀 ⁢ ( 𝜆 ⁢ 𝐈 + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ up + ∑ 𝑚 ′ ∈ [ 𝑀 ] 𝚲 𝑚 ′ , 𝑘 , ℎ loc )

1 1 + 𝛼 ⁢ 𝑀 ⁢ 𝚲 𝑘 , ℎ all .

Here the first step follows from the definition of 𝚲 𝑚 ′ , 𝑡 ¯ + 1 , ℎ up from (A.2) and the assumption that agent 𝑚 communicates with the server in episode 𝑡 ¯ . The second step holds because agent 𝑚 is the only active agent between [ 𝑡 ¯ , 𝑡 ¯ ] and thus no further upload has been made by any agent during episodes [ 𝑡 ¯ + 1 , 𝑡 ¯ ] . The last step follows from the definition of 𝚲 𝑘 , ℎ all and the assumption that agent 𝑚 is the only active agent between [ 𝑡 ¯ , 𝑡 ¯ ] (i.e. no other agent can upload during [ 𝑡 ¯ , 𝑡 ¯ ] ). This finishes the proof of (B.2) and that of Lemma B.3. ∎

C.6 Proof of Lemma B.4 Proof of Lemma B.4.

Define 𝜂 𝜏 , ℎ

𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) , and

𝐮 𝑘 , ℎ up ⁢ ( 𝑚 ′ )

∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑘 , ℎ ser , 𝑚 𝜏

𝑚 ′ 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ ,

𝐮 𝑘 , ℎ loc ⁢ ( 𝑚 ′ )

∑ 𝜏

1 , 𝜏 ∉ 𝒟 𝑘 , ℎ ser , 𝑚 𝜏

𝑚 ′ 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ , (C.7)

for all 𝑚 ′ ∈ [ 𝑀 ] and 𝑘 , ℎ ∈ [ 𝐾 ] × [ 𝐻 ] . Then we have

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

‖ ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ up ⁢ ( 𝑚 ′ ) ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

‖ ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ up ⁢ ( 𝑚 ′ ) + ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) − ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

≤ ‖ ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ up ⁢ ( 𝑚 ′ ) + ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⏟ ( I ) + ‖ ∑ 𝑚 ′

1 𝑀 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⏟ ( II ) , (C.8)

where the first and the second steps are by the definition of 𝜂 𝜏 , ℎ and 𝐮 𝑛 𝑘 , ℎ up , and the last step is by triangle inequality.

For ( I ) , we have that with probability at least 1 − 𝛿 , for all 𝑘 ,

( I )

‖ ∑ 𝜏

1 𝑛 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

‖ ∑ 𝜏

1 𝑛 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ ‖ 𝚲 𝑚 𝑘 , 𝑛 𝑘 + 1 , ℎ − 1

≤ 1 + 𝛼 ⁢ 𝑀 ⁢ ‖ ∑ 𝜏

1 𝑛 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜂 𝜏 , ℎ ‖ ( 𝚲 𝑛 𝑘 + 1 , ℎ all ) − 1

≤ 1 + 𝛼 ⁢ 𝑀 ⋅ 4 ⁢ 𝐻 2 ⁢ [ log ⁡ ( ( 𝐾 + 𝜆 𝜆 ) 𝑑 / 2 ) + log ⁡ ( 1 𝛿 ) ] . (C.9)

Here the first inequality is given by (B.2) in Lemma B.3. The second inequality is derived by applying Theorem E.3 with | 𝜂 𝜏 , ℎ | ≤ 2 ⁢ 𝐻 (according to Line 23), and

det ( 𝚲 𝑛 𝑘 + 1 , ℎ all ) ≤ ( ‖ 𝚲 𝑛 𝑘 + 1 , ℎ all ‖ 2 ) 𝑑 ≤ ‖ ∑ 𝜏

1 𝑛 𝑘 𝜙 𝜏 , ℎ ⁢ 𝜙 𝜏 , ℎ ⊤ + 𝜆 ⁢ 𝐈 ‖ 2 ≤ ( 𝐾 + 𝜆 ) 𝑑 .

For ( II ) , first note that for any 𝑚 ∈ [ 𝑀 ] , Lemma B.3 implies

𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ
⪰ 𝜆 ⁢ 𝐈 + ∑ 𝑚 ′

1 𝑀 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ up ≥ 1 𝛼 ⁢ 𝚲 𝑚 , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc .

It follows that for any 𝑚 ′ ∈ [ 𝑀 ] ,

𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ
⪰ 𝜆 ⁢ 𝐈 2 + 1 2 ⁢ 𝛼 ⁢ 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc

1 2 ⁢ 𝛼 ⁢ ( 𝛼 ⁢ 𝜆 ⁢ 𝐈 + 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc ) ,

and thus

‖ 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1
≤ ‖ 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ 1 2 ⁢ 𝛼 ⁢ ( 𝛼 ⁢ 𝜆 ⁢ 𝐈 + 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc ) − 1

2 ⁢ 𝛼 ⁢ ‖ 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) ‖ ( 𝛼 ⁢ 𝜆 ⁢ 𝐈 + 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc ) − 1

≤ 2 ⁢ 𝛼 ⁢ 4 ⁢ 𝐻 2 ⁢ [ log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( 1 𝛿 ) ] ,

where the last step holds by Theorem E.3, the definition of 𝐮 𝑛 𝑘 , ℎ loc ⁢ ( 𝑚 ′ ) from (C.6), and the definition of 𝚲 𝑚 ′ , 𝑛 𝑘 ⁢ ( 𝑚 𝑘 ) , ℎ loc from (A.3). We then conclude that

( II )

≤ 2 ⁢ 𝑀 ⁢ 𝛼 ⁢ 𝐻 ⁢ log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( 1 𝛿 ) . (C.10)

Combining (C.6), (C.6) and (C.10), we conclude that

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑛 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

≤ 2 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ⋅ 𝐻 ⋅ ( log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( ( 𝐾 + 𝜆 𝜆 ) 𝑑 / 2 ) + 2 ⁢ log ⁡ ( 1 𝛿 ) ) .

C.7 Proof of Lemma B.5

With Lemma B.4 established, the proof of Lemma B.5 relies on the classical ℓ ∞ covering net argument of the linear MDPs, developed by Lemma B.3 in Jin et al. (2020).

Lemma C.1 (Lemma D.6, Jin et al. 2020).

Let 𝒱 denote a class of functions from 𝒮 to ℝ such that each 𝑉 ∈ 𝒱 can be parametrized as

𝑉 ⁢ ( ⋅ )

max 𝑎 ∈ 𝒜 [ 𝜙 ( ⋅ , ⋅ ) ⊤ 𝐰 + 𝛽 ⋅ 𝜙 ⁢ ( ⋅ , ⋅ ) ⊤ ⁢ 𝚲 − 1 ⁢ 𝜙 ⁢ ( ⋅ , ⋅ ) ] [ 0 , 𝐻 − ℎ + 1 ] ,

where the parameters ( 𝐰 , 𝚲 , 𝛽 ) satisfy ‖ 𝐰 ‖ ≤ 𝑊 , 0 ≤ 𝛽 ≤ 𝐵 , and 𝚲 ⪰ 𝜆 ⁢ 𝐈 for some 𝜆

0 . Suppose ‖ 𝜙 ⁢ ( ⋅ , ⋅ ) ‖ ≤ 1 . The 𝜖 -covering number of 𝒱 with respect to the ℓ ∞ norm satisfies

log ⁡ ( 𝒩 𝜖 ) ≤ 𝑑 ⁢ log ⁡ ( 1 + 4 ⁢ 𝑊 / 𝜖 ) + 𝑑 2 ⁢ log ⁡ [ 1 + 8 ⁢ 𝑑 1 / 2 ⁢ 𝐵 2 / ( 𝜆 ⁢ 𝜖 2 ) ] .

We can now prove Lemma B.5 using Lemma B.4 and Lemma C.1.

Proof of Lemma B.5.

We first fix an 𝜖 -net of 𝒱 . For each 𝑉 ∈ 𝒱 , there exists some 𝑉 ~ in the 𝜖 -net, such that ‖ 𝑉 − 𝑉 ~ ‖ ∞ ≤ 𝜖 . Applying a union bound over the 𝜖 -net and Lemma B.4, we get that with probability at least 1 − 𝛿 , for each 𝑉 ∈ 𝒱 ,

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 2

≤ 8 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) 2 ⋅ 𝐻 2 ⋅ ( log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( ( 𝐾 + 𝜆 𝜆 ) 𝑑 / 2 ) + 2 ⁢ log ⁡ ( 𝑁 𝜖 𝛿 ) )

+ 2 ⁢ ‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ Δ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ Δ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 2

≤ 8 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) 2 ⋅ 𝐻 2 ⋅ ( log ⁡ ( ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) 𝑑 / 2 ) + log ⁡ ( ( 𝐾 + 𝜆 𝜆 ) 𝑑 / 2 ) + 2 ⁢ log ⁡ ( 𝑁 𝜖 𝛿 ) )

  • 8 ⁢ 𝑘 2 ⁢ 𝜖 2 𝜆 ,

where the last step follows from ‖ Δ ⁢ 𝑉 ‖ ∞

‖ 𝑉 − 𝑉 ~ ‖ ∞ ≤ 𝜖 and 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ ⪰ 𝜆 ⁢ 𝐈 . Finally, by Lemma B.2, we have ‖ 𝐰 𝑚 , 𝑘 , ℎ ‖ ≤ 2 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝑘 / 𝜆 . Plugging in the bound for 𝒩 𝜖 from Lemma C.1, we get

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 2 ≤ 8 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) 2 ⋅ 𝐻 2 ⋅ 𝜄 ′ + 8 ⁢ 𝑘 2 ⁢ 𝜖 2 𝜆 ,

where

𝜄 ′ ≔ 𝑑 2 ⁢ log ⁡ ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) + 𝑑 2 ⁢ log ⁡ ( 𝐾 + 𝜆 𝜆 ) + 2 ⁢ log ⁡ 1 𝛿 + 𝑑 ⁢ log ⁡ ( 2 + 8 ⁢ 𝐻 2 ⁢ 𝑑 ⁢ 𝑘 𝜆 ⁢ 𝜖 2 ) + 2 ⁢ 𝑑 2 ⁢ log ⁡ ( 1 + 8 ⁢ 𝑑 ⁢ 𝛽 2 𝜆 ⁢ 𝜖 2 ) .

We choose 𝜖

𝑑 ⁢ 𝐻 / 𝑘 , and conclude that

‖ ∑ 𝜏

1 , 𝜏 ∈ 𝒟 𝑡 𝑘 , ℎ ser 𝑡 𝑘 − 1 𝜙 𝜏 , ℎ ⁢ [ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ + 1 ) − ℙ ℎ ⁢ 𝑉 ⁢ ( 𝑠 𝜏 , ℎ , 𝑎 𝜏 , ℎ ) ] ‖ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1

≤ 𝐶 ⁢ ( 𝑀 ⁢ 𝛼 + 1 + 𝑀 ⁢ 𝛼 ) ⁢ 𝐻 ⁢ 𝜄 + 𝐶 ⁢ 𝑑 ⁢ 𝐻 / 𝜆 ,

where

𝜄

𝑑 ⁢ log ⁡ ( 𝐾 + 𝛼 ⁢ 𝜆 𝛼 ⁢ 𝜆 ) + 𝑑 ⁢ log ⁡ ( 𝐾 + 𝜆 𝜆 ) + log ⁡ 1 𝛿 + 𝑑 ⁢ log ⁡ ( 2 + 8 ⁢ 𝑘 3 𝜆 ) + 𝑑 2 ⁢ log ⁡ ( 1 + 8 ⁢ 𝛽 2 ⁢ 𝑘 2 𝜆 ⁢ 𝑑 1.5 ⁢ 𝐻 2 ) .

C.8 Proof of Lemma 6.4 Lemma C.2 (Repeat of Lemma 6.4).

Under the same assumption of Theorem 5.1, it holds that

∑ 𝑘

1 𝐾 min ⁡ { 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) , ∑ ℎ

1 𝐻 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) }

≤ 𝒪 ⁢ ( 𝛽 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝐾 ⁢ log ⁡ ( 2 ⁢ 𝑑 ⁢ 𝐾 / min ⁡ { 1 , 𝜆 } ⁢ 𝛿 ) ) .

Proof of Lemma 6.4.

Suppose that agents communicate with the server at episodes 0

𝑘 0 < 𝑘 1 < ⋯ < 𝑘 𝑁

𝐾 + 1 . Here 𝑘 0

0 and 𝑘 𝑁

𝐾 + 1 are imaginary episodes created for notational convenience. The first step is to use a reordering trick to argue that it suffices to consider the case where there is only one active agent in the episodes [ 𝑡 𝑖 , 𝑡 𝑖 + 1 − 1 ] . That is, 𝑚 𝑡 𝑖

𝑚 𝑡 𝑖 + 1

𝑚 𝑡 𝑖 + 1 − 1 .

To see why this is the case, suppose an agent 𝑚 communicates with the server at some episode 𝑘 1 and 𝑘 2 . Then the order of actions between 𝑘 1 and 𝑘 2 will not affect agent 𝑚 ’s covariance matrix and dataset at episode 𝑘 1 or 𝑘 2 , and thus will not affect the estimated Q-function updated at the end of eposide 𝑘 1 and 𝑘 2 . Furthermore, agent 𝑚 ’s participation in those episodes between [ 𝑘 1 + 1 , 𝑘 2 − 1 ] will also not affect the other agents’ estimated Q-functions since agent 𝑚 does not upload any new trajectory. Given the above rationale, we can reorder all the episodes in a way such that each agent communicates with the server and keeps participating until the next agent kicks in to communicates with the server. Note that the fundamental reason is that each agent only performs local data update between two communications, which does not affect any other agents. Consequently, this reordering is always valid under the current communication protocol of Algorithm 2.

From above, in the following we only consider the case where the communication episodes are 0

𝑘 0 < 𝑘 1 < ⋯ < 𝑘 𝑁

𝐾 + 1 , and 𝑚 𝑡 𝑖

𝑚 𝑡 𝑖 + 1

𝑚 𝑡 𝑖 + 1 − 1 for each 𝑖

0 , ⋯ , 𝑁 − 1 . The summation of bonus can thus be rephrased as

∑ 𝑘

1 𝐾 min ⁡ { 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) , ∑ ℎ

1 𝐻 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) }

≤ 2 ⁢ 𝛽 ⁢ ∑ 𝑖

0 𝑁 − 1 ∑ 𝑘

𝑘 𝑖 + 1 𝑘 𝑖 + 1 − 1 ∑ ℎ

1 𝐻 𝜙 𝑘 , ℎ ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 𝑘 , ℎ ⏟ I

+ 2 ⁢ 𝛽 ⁢ ∑ 𝑖

1 𝑁 − 1 min ⁡ { 𝑉 𝑚 𝑘 𝑖 , 𝑡 𝑘 𝑖 , 1 ⁢ ( 𝑠 𝑘 𝑖 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 𝑖 , 𝑘 𝑖 ⁢ ( 𝑠 𝑘 𝑖 , 1 ) , ∑ ℎ

1 𝐻 𝜙 𝑘 𝑖 , ℎ ⁢ 𝚲 𝑚 𝑘 𝑖 , 𝑡 𝑘 𝑖 , ℎ − 1 ⁢ 𝜙 𝑘 𝑖 , ℎ } ⏟ II . (C.11)

To bound I, by (B.2) in Lemma B.3 it holds that

I	

≤ ∑ 𝑖

0 𝑁 − 1 ∑ 𝑘

𝑘 𝑖 + 1 𝑘 𝑖 + 1 − 1 ∑ ℎ

1 𝐻 1 + 𝑀 ⁢ 𝛼 ⁢ ‖ 𝜙 𝑘 , ℎ ‖ ( 𝚲 𝑘 , ℎ all ) − 1 ≤ 1 + 𝑀 ⁢ 𝛼 ⁢ ∑ 𝑘

1 𝐾 ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 , ℎ ‖ ( 𝚲 𝑘 , ℎ all ) − 1 . (C.12)

To bound II, we apply a refined analysis tailored from (He et al., 2022a). Specifically, we define the following indices of episode

𝐾 ~ 𝑖 ≔ min ⁡ { 𝑘 ∈ [ 𝐾 ] : ∃ ℎ ∈ [ 𝐻 ] ⁢ s.t. ⁢ det ( 𝚲 𝑘 , ℎ all ) ≥ 2 𝑖 ⁢ 𝜆 𝑑 } ,

and define 𝑁 ′ to be the largest integer such that 𝐾 ~ 𝑁 ′ is non-empty. For each interval [ 𝐾 ~ 𝑖 , 𝐾 ~ 𝑖 + 1 ) , consider an arbitrary agent 𝑚 ∈ [ 𝑀 ] . Suppose that during this interval agent 𝑚 communicates with the server at episodes 𝑘 𝑖 , 1 < 𝑘 𝑖 , 2 < ⋯ < 𝑘 𝑖 , 𝑧 . Note that here we assume there are at least two communication rounds for 𝑚 . The case of 0 and 1 communication round is quite straightforward, as will be shown soon. Now, for 𝑗

2 , ⋯ , 𝑧 , agent 𝑚 is active at episode 𝑘 𝑖 , 𝑗 − 1 and 𝑘 𝑖 , 𝑗 . As a result, we can apply (B.2) in Lemma B.3 with our reordering trick, and get that

∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ 𝚲 𝑚 , 𝑘 𝑖 , 𝑗 , ℎ − 1
≤ ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ 𝚲 𝑚 , 𝑘 𝑖 , 𝑗 − 1 + 1 , ℎ − 1 ≤ 1 + 𝑀 ⁢ 𝛼 ⁢ ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ ( 𝚲 𝑘 𝑖 , 𝑗 − 1 + 1 , ℎ all ) − 1 ,

where the first step is by 𝚲 𝑚 , 𝑘 𝑖 , 𝑗 , ℎ − 1 ⪯ 𝚲 𝑚 , 𝑘 𝑖 , 𝑗 − 1 + 1 , ℎ − 1 . Furthermore, by the definition of 𝐾 ~ 𝑖 , it holds that det ( 𝚲 𝐾 ~ 𝑖 + 1 − 1 , ℎ all ) / det ( 𝚲 𝑘 𝑖 , 𝑗 − 1 + 1 , ℎ all ) ≤ 2 , which implies

∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ 𝚲 𝑚 , 𝑘 𝑖 , 𝑗 , ℎ − 1
≤ 2 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ ( 𝚲 𝐾 ~ 𝑖 + 1 − 1 , ℎ all ) − 1 ≤ 2 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 𝑖 , 𝑗 , ℎ ‖ ( 𝚲 𝑘 𝑖 , 𝑗 , ℎ all ) − 1 . (C.13)

The second step in the above holds since 𝑘 𝑖 , 𝑗 ≤ 𝐾 ~ 𝑖 + 1 − 1 . For those episodes 𝑘 𝑖 , 1 (i.e. 𝑗

1 ), we can trivially bound the term as max ⁡ [ 𝑉 𝑚 𝑘 𝑖 , 𝑡 𝑘 𝑖 , 1 ⁢ ( ⋅ ) − 𝑉 1 𝜋 𝑚 𝑘 𝑖 , 𝑘 𝑖 ⁢ ( ⋅ ) ] ≤ 2 ⁢ 𝐻 . Together with (C.8) and (C.13), we have

∑ 𝑘

1 𝐾 min ⁡ { 𝑉 𝑚 𝑘 , 𝑡 𝑘 ⁢ ( 𝑚 𝑘 ) , 1 ⁢ ( 𝑠 𝑘 , 1 ) − 𝑉 1 𝜋 𝑚 𝑘 , 𝑘 ⁢ ( 𝑠 𝑘 , 1 ) , ∑ ℎ

1 𝐻 2 ⁢ 𝛽 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) ⁢ 𝚲 𝑚 𝑘 , 𝑡 𝑘 , ℎ − 1 ⁢ 𝜙 ⁢ ( 𝑠 𝑘 , ℎ , 𝑎 𝑘 , ℎ ) }

≤ 2 ⁢ 𝐻 ⁢ 𝑁 ′ + 2 ⁢ 𝛽 ⁢ 2 ⁢ ( 1 + 𝑀 ⁢ 𝛼 ) ⁢ ∑ 𝑖

0 𝑁 − 1 ∑ 𝑘

𝑘 𝑖 𝑘 𝑖 + 1 − 1 ∑ ℎ

1 𝐻 ‖ 𝜙 𝑘 , ℎ ‖ ( 𝚲 𝑘 , ℎ all ) − 1

≤ 2 ⁢ 𝐻 ⁢ 𝑁 ′ + 4 ⁢ 𝛽 ⁢ 1 + 𝑀 ⁢ 𝛼 ⁢ 𝐻 ⁢ 𝑑 ⁢ 𝐾 ⁢ log ⁡ ( 2 ⁢ 𝑑 ⁢ 𝐾 / ( min ⁡ { 1 , 𝜆 } ⁢ 𝛿 ) ) ,

where the last step follows from the standard elliptical potential argument (Abbasi-Yadkori et al., 2011; Jin et al., 2020). To bound 𝑁 ′ , by Assumption 3.1, it holds that

det ( 𝚲 𝑘 , ℎ all ) ≤ ( 𝜆 + 𝐾 ) 𝑑 ,

and therefore 𝑁 ′ ≤ 𝑑 ⁢ 𝐻 ⁢ log ⁡ ( 1 + 𝐾 / 𝜆 ) . This finishes the proof. ∎

Appendix D Lower bound

To prove the lower bound, we construct a series of hard-to-learn MDPs as follows. For each hard-to-learn MDP, the state space 𝒮 consists of 𝑑 / 2 different states 𝒮

{ 𝑠 1 , … , 𝑠 𝑑 / 2 − 2 , 𝑔 0 , 𝑔 1 } , where { 𝑠 1 , … , 𝑠 𝑑 / 2 − 2 } are possible initial states and { 𝑔 0 , 𝑔 1 } are absorbing states. The action space 𝒜 only consists of two different action { 𝑎 0 , 𝑎 1 } . For each stage, ℎ ∈ [ 𝐻 ] , the agent will always receive reward 1 at state 𝑔 0 and reward 0 at other states. For the stochastic transition process, the initial state 𝑠 𝑖 will transit to the absorbing states 𝑔 0 or 𝑔 1 , and stay at the absorbing state later. Since the state and action spaces are finite, these hard-to-learn tabular MDPs can be further represented as linear MDPs with dimension | 𝒮 | × | 𝒜 |

𝑑 .

Now, for each initial state 𝑠 𝑖 , the selection of action 𝑎 ∈ { 𝑎 0 , 𝑎 1 } can be seemed as a 2 -armed bandits problem with Bernoulli reward ( 0 for absorbing stage 𝑔 1 and 𝐻 − 1 for absorbing stage 𝑔 0 ) and we have the following Lemmas:

Lemma D.1 (Theorem 9.1 in Lattimore & Szepesvári 2020).

For any 2 -armed Bernoulli bandits problem, there exist an algorithm (e.g., MOSS algorithm in Section 9.1 of Lattimore & Szepesvári (2020)) with expected regret 𝔼 ⁢ [ Regret ⁢ ( 𝑇 ) ] ≤ 40 ⁢ 𝑇 .

The original Theorem 9.1 holds for multi-armed bandit with sub-Gaussian noise and we only need the results for 2 -armed Bernoulli bandits.

Lemma D.2 (Lemma D.2 in Wang et al. 2020).

For any algorithm Alg and 𝑇 , there exist a 2 -armed Bernoulli bandits such that the regret is lower bounded by 𝔼 ⁢ [ Regret ⁢ ( 𝑇 ) ] ≥ 𝑇 / 75 .

The lemma in Wang et al. (2020) extended the result for Gaussian bandit (Lattimore & Szepesvári, 2020) to Bernoulli bandits and holds for general multi-arm bandit problem. In this lower bound, we only need the results for 2 -armed bandits.

Now, we start to prove the Theorem 5.5, which is an extension of the lower bound results in Wang et al. (2020, Theorem 2) and He et al. (2022a, Theorem 5.3) from bandits to MDPs.

Proof of Theorem 5.5.

Now, we divide the 𝐾 episodes to 𝑑 / 2 different epochs. For each epoch i (from episodes 2 ⁢ ( 𝑖 − 1 ) ⁢ 𝐾 / 𝑑 + 1 to episode 2 ⁢ 𝑖 ⁢ 𝐾 / 𝑑 ), we set the initial state as 𝑠 𝑖 and letting each agent 𝑚 ∈ [ 𝑀 ] be active for 2 ⁢ 𝐾 / ( 𝑑 ⁢ 𝑀 ) different rounds (where we assume 2 ⁢ 𝐾 / ( 𝑑 ⁢ 𝑀 ) is an integer for simplicity). Now, we start to analyse the regret 𝔼 ⁢ [ Regret 𝑖 , 𝐀𝐥𝐠 𝑖 ] for each epoch 𝑖 .

For each epoch 𝑖 and any algorithm Alg for multi-agent Reinforcement Learning, we construct the auxiliary 𝐀𝐥𝐠 𝑖 as follows: For each agent 𝑚 ∈ [ 𝑀 ] , it performs Alg until there is a communication between the agent 𝑚 and the server after the epoch 𝑖 − 1 . After the communication after epoch 𝑖 − 1 , the agent 𝑚 remove all previous information and perform the used Algorithm in Lemma D.1(e.g., MOSS algorithm in Lattimore & Szepesvári (2020)).

In this case, for each agent 𝑚 ∈ [ 𝑀 ] , 𝐀𝐥𝐠 𝑖 can only communicate with the server before epoch 𝑖 , which can only provide information about previous states 𝑠 1 , . . , 𝑠 𝑖 − 1 . Since the agent can not receive any information for state 𝑠 𝑖 from other agents, the performance of 𝐀𝐥𝐠 𝑖 in epoch 𝑖 will reduce to a single agent bandit algorithm.

Now, we consider the hard-to-learn Bernoulli bandits in Lemma D.2 with rounds 𝑇

2 ⁢ 𝐾 / ( 𝑑 ⁢ 𝑀 ) . Since 𝐀𝐥𝐠 𝑖 reduces to a single agent bandit algorithm with Bernoulli reward ( 0 or 𝐻 − 1 ), Lemma D.2 implies that the expected regret for agent 𝑚 with 𝐀𝐥𝐠 𝑖 is lower bounded by

𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ] ≥ ( 𝐻 − 1 ) ⁢ 𝑇 / 75 . (D.1)

Taking the sum of (D.1) over all agents 𝑚 ∈ [ 𝑀 ] , we obtain

𝔼 ⁢ [ Regret 𝑖 , 𝐀𝐥𝐠 𝑖 ]

∑ 𝑚

1 𝑀 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ] ≥ 𝑀 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 / 75 . (D.2)

For each agent 𝑚 ∈ [ 𝑀 ] , let 𝛿 𝑖 , 𝑚 denote the probability that agent 𝑚 will communicate with the server during epoch 𝑖 . Notice that before the communication, 𝐀𝐥𝐠 𝑖 has the same performance as the original Alg and the corresponding regret of 𝐀𝐥𝐠 𝑖 is upper bounded by 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 ] . After the communication during epoch 𝑖 , 𝐀𝐥𝐠 𝑖 perform the near optimal algorithm in Lemma D.1 and provides a 40 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 regret guarantee. Combining these results, the expected regret for agent 𝑚 with 𝐀𝐥𝐠 𝑖 is upper bounded by

𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ] ≤ 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 ] + 40 ⁢ 𝛿 𝑖 , 𝑚 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 . (D.3)

Taking the sum of (D.3) over all agents 𝑚 ∈ [ 𝑀 ] , we obtain

𝔼 ⁢ [ Regret 𝑖 , 𝐀𝐥𝐠 𝑖 ]

∑ 𝑚

1 𝑀 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ]

≤ ∑ 𝑚

1 𝑀 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 ] + ( ∑ 𝑚

1 𝑀 𝛿 𝑚 ) ⁢ 40 ⁢ 𝛿 𝑖 , 𝑚 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇

𝔼 ⁢ [ Regret 𝑖 , 𝐀𝐥𝐠 ] + 40 ⁢ 𝛿 𝑖 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 , (D.4)

where 𝛿 𝑖

∑ 𝑚

1 𝑀 𝛿 𝑖 , 𝑚 is the expected communication complexity during epoch 𝑖 . For the regret bounds in (D.2) and (D.4), after taking an summation over all epoch 𝑖 ∈ [ 𝑑 / 2 ] , we have

∑ 𝑖

1 𝑑 / 2 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ]
≥ 𝑑 ⁢ 𝑀 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 / 150 ,

∑ 𝑖

1 𝑑 / 2 𝔼 ⁢ [ Regret 𝑖 , 𝑚 , 𝐀𝐥𝐠 𝑖 ]
≤ ∑ 𝑖

1 𝑑 / 2 𝔼 ⁢ [ Regret 𝑖 , 𝐀𝐥𝐠 ] + 40 ⁢ 𝛿 𝑖 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇

𝔼 ⁢ [ Regret 𝐀𝐥𝐠 ] + 40 ⁢ 𝛿 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 ,

where 𝛿

∑ 𝑖

1 𝑑 / 2 𝛿 𝑖 , 𝑚 denotes the expected communication complexity. Combining these results, for any algorithm Alg with expected communication complexity 𝛿 ≤ 𝑑 ⁢ 𝑀 / 12000

𝑂 ⁢ ( 𝑑 ⁢ 𝑀 ) , we have

𝔼 ⁢ [ Regret 𝐀𝐥𝐠 ]
≥ 𝑑 ⁢ 𝑀 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 − 40 ⁢ 𝛿 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 ≥ 𝑑 ⁢ 𝑀 ⁢ ( 𝐻 − 1 ) ⁢ 𝑇 / 2

Ω ⁢ ( 𝐻 ⁢ 𝑑 ⁢ 𝑀 ⁢ 𝐾 ) .

This finishes the proof of Theorem 5.5. ∎

Appendix E Auxiliary Lemmas Lemma E.1 (Lemma D.1 in Jin et al. 2020).

Let 𝚲 𝑡

𝜆 ⁢ 𝐈 + ∑ 𝜏

1 𝑡 𝜙 𝜏 ⁢ 𝜙 𝜏 ⊤ where 𝜙 𝑡 ∈ ℝ 𝑑 for all 𝜏 , and 𝜆

0 . Then

∑ 𝜏

1 𝑡 𝜙 𝜏 ⊤ ⁢ 𝚲 𝑡 − 1 ⁢ 𝜙 𝜏 ≤ 𝑑 .

Lemma E.2 (Lemma 12 in Abbasi-Yadkori et al. (2011)).

Suppose 𝐀 , 𝐁 ∈ ℝ 𝑑 × 𝑑 are positive definite matrices such that 𝐀 ⪰ 𝐁 . Then for any 𝐱 ∈ ℝ 𝑑 , ‖ 𝐱 ‖ 𝐀 ≤ ‖ 𝐱 ‖ 𝐁 ⋅ det ( 𝐀 ) / det ( 𝐁 ) .

E.1 Concentration Inequalities Theorem E.3 (Hoeffding-type inequality for self-normalized martingales (Abbasi-Yadkori et al., 2011)).

Let { 𝜂 𝑡 } 𝑡

1 ∞ be a real-valued stochastic process. Let { ℱ 𝑡 } 𝑡

0 ∞ be a filtration, such that 𝜂 𝑡 is ℱ 𝑡 -measurable. Assume 𝜂 𝑡 ∣ ℱ 𝑡 − 1 is zero-mean and 𝑅 -subgaussian for some 𝑅

0 , i.e.,

∀ 𝜆 ∈ ℝ , 𝔼 ⁢ [ 𝑒 𝜆 ⁢ 𝜂 𝑡 ∣ ℱ 𝑡 − 1 ] ≤ 𝑒 𝜆 2 ⁢ 𝑅 2 / 2 .

Let { 𝜙 𝑡 } 𝑡

1 ∞ be an ℝ 𝑑 -valued stochastic process where 𝜙 𝑡 is ℱ 𝑡 − 1 -measurable. Assume 𝚲 0 is a 𝑑 × 𝑑 positive definite matrix, and let 𝚲 𝑡

𝚲 0 + ∑ 𝑠

1 𝑡 𝜙 𝑠 ⁢ 𝜙 𝑠 ⊤ . Then, for any 𝛿

0 , with probability at least 1 − 𝛿 , for all 𝑡

0 ,

‖ ∑ 𝑠

1 𝑡 𝜙 𝑠 ⁢ 𝜂 𝑠 ‖ 𝚲 𝑡 − 1 2 ≤ 2 ⁢ 𝑅 2 ⁢ log ⁡ ( det ( 𝚲 𝑡 ) 1 / 2 ⁢ det ( 𝚲 0 ) − 1 / 2 𝛿 ) .

Lemma E.4 (Lemma D.4 in Jin et al. 2020).

Let 𝒱 be a function class such that any 𝑉 ∈ 𝒱 maps from 𝒮 → ℝ and ‖ 𝑉 ‖ ∞ ≤ 𝑅 . Let { ℱ 𝑡 } 𝑡

0 ∞ be a filtration. Let { 𝑠 𝑡 } 𝑡

1 ∞ be a stochastic process in the space 𝒮 such that 𝑠 𝑡 is ℱ 𝑡 -measurable. Let { 𝜙 } 𝑡

0 ∞ be an ℝ 𝑑 -valued stochastic process such that 𝜙 𝑡 is ℱ 𝑡 − 1 -measurable and ‖ 𝜙 ‖ 2 ≤ 1 . Let 𝚲 𝑘

𝜆 ⁢ 𝑰 + ∑ 𝑡

1 𝑘 − 1 𝐱 𝑡 ⁢ 𝜙 𝑡 ⊤ . Then for any 𝛿

0 , with probability at least 1 − 𝛿 , for any 𝑘 , and any 𝑉 ∈ 𝒱 , we have

∥ ∑ 𝑡

1 𝑘 − 1 𝜙 𝑡 [ 𝑉 ( 𝑠 𝑡 ) − 𝔼 [ 𝑉 ( 𝑠 𝑡 ) ∣ ℱ 𝑡 − 1 ] ] ∥ ( 𝚲 𝑘 ) − 1 2

≤ 4 ⁢ 𝑅 2 ⁢ [ 𝑑 2 ⁢ log ⁡ ( 𝑘 + 𝜆 𝜆 ) + log ⁡ 𝒩 𝜖 𝒱 𝛿 ] + 8 ⁢ 𝑘 2 ⁢ 𝜖 2 𝜆 ,

where 𝒩 𝜖 𝒱 is the 𝜖 -covering number of 𝒱 with respect to the ℓ ∞ distance.

Generated on Thu Jul 13 17:49:56 2023 by LATExml

Xet Storage Details

Size:
118 kB
·
Xet hash:
5257cc89805f45f88db94d754c9a1031771aa2a6de7d5507f090dc7168761e55

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.