Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Abstract
A novel algorithm is presented for learning average-reward MDPs with minimax optimal regret bounds that do not require prior knowledge of the optimal bias function's span.
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of mathrm{O}(mathrm{sp(h^*) S A T}), where sp(h^*) is the span of the optimal bias function h^*, S times A is the size of the state-action space and T the number of learning steps. Remarkably, our algorithm does not require prior information on sp(h^*). Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
Get this paper in your agent:
hf papers read 2406.01234 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper