Buckets:
Title: Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits
URL Source: https://arxiv.org/html/2310.00968
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related Work 3Problem Setup 4Algorithm 5Main Results 6Experiments 7Conclusion References License: arXiv.org perpetual non-exclusive license arXiv:2310.00968v2 [cs.LG] 15 Oct 2024 Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits Qiwei Di1 , Tao Jin2β, Yue Wu1, Heyang Zhao1, Farzad Farnoud2 , Quanquan Gu1β 1Department of Computer Science, University of California, Los Angeles 2Department of Computer Science, University of Virginia qiwei2000@cs.ucla.edu, taoj@virginia.edu, ywu@cs.ucla.edu, hyzhao@cs.ucla.edu,farzad@virginia.edu,qgu@cs.ucla.edu Equal ContributionCo-corresponding Authors Abstract
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound π ~ β’ ( π β’ β π‘
1 π π π‘ 2 + π ) , where π π‘ is the variance of the pairwise comparison in round π‘ , π is the dimension of the context vectors, and π is the time horizon. Our regret bound naturally aligns with the intuitive expectation β in scenarios where the comparison is deterministic, the algorithm only suffers from an π ~ β’ ( π ) regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
1Introduction
The multi-armed bandit (MAB) model has undergone comprehensive examination as a framework for decision-making with uncertainty. Within this framework, an agent has to select one specific βarmβ to pull in each round, and receives a stochastic reward as feedback. The objective is to maximize the cumulative reward accumulated over all rounds. While the MAB model provides a robust foundation for various applications, the reality is that many real-world tasks present an intractably large action space coupled with intricate contextual information. Consequently, this challenge has led to the proposal of the (linear) contextual bandit model, where the reward is intricately linked to both the context associated with the selected arm and the underlying reward function. A series of work into the linear contextual bandits has led to efficient algorithms such as LinUCB (Li et al., 2010; Chu et al., 2011) and OFUL (Abbasi-Yadkori et al., 2011).
In scenarios where feedback is based on subjective human experiences β a phenomenon evident in fields such as information retrieval (Yue & Joachims, 2009), ranking (Minka et al., 2018), crowdsourcing (Chen et al., 2013), and Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) β preferential choices emerge as a more natural and intuitive form of feedback compared with numerical evaluations. The rationale behind preference feedback lies in the fact that numerical scores can exhibit significant variability among individuals, resulting in noisy and poorly calibrated rewards. On the contrary, a binary signal from preferential feedback remains independent of scale and is thus more reliable. This distinction gives rise to a specialized variant of the MAB problem known as dueling bandits (Yue et al., 2012). In this setting, the agent simultaneously pulls two arms and receives binary preferential feedback, which essentially indicates the outcome of a comparison between the chosen arms. A line of works proposed efficient and practical algorithms for multi-armed dueling bandits based on upper confidence bound (UCB) (Zoghi et al., 2014; 2015) or Thompson sampling (Wu & Liu, 2016). Similar to linear contextual bandits, considerable effort has been invested in developing efficient algorithms that minimize the cumulative regret for the contextual dueling bandits (Saha, 2021; Bengs et al., 2022).
Intuitively, the variance of the noise in the feedback signal determines the difficulty of the problem. To illustrate, consider an extreme case, where the feedback of a linear contextual bandit is noiseless (i.e., the variance is zero). A learner can recover the underlying reward function precisely by exploring each dimension only once, and suffer a π ~ β’ ( π ) regret in total, where π is the dimension of the context vector. This motivates a series of works on establishing variance-aware regret bounds for multi-armed bandits, e.g. (Audibert et al., 2009; Mukherjee et al., 2017) and contextual bandits, e.g. (Zhou et al., 2021; Zhang et al., 2021b; Kim et al., 2022; Zhao et al., 2023b; a). This observation also remains valid when applied to the dueling bandit scenario. In particular, the binary preferential feedback is typically assumed to adhere to a Bernoulli distribution, with the mean value denoted by π . The variance reaches its maximum when π is close to 1 / 2 , a situation that is undesirable in human feedback applications, as it indicates a high level of disagreement or indecision. Therefore, maintaining a low variance in comparisons is usually preferred, and variance-dependent dueling algorithms are desirable because they can potentially perform better than those algorithms that only have worst-case regret guarantees. This leads to the following research question:
Can we design a dueling bandit algorithm with a variance-aware regret bound?
We give an affirmative answer to this question by studying the dueling bandit problem with a contextualized generalized linear model, which is in the same setting as Saha (2021); Bengs et al. (2022). We summarize our contributions as follows:
β’
We propose a new algorithm, named VACDB, to obtain a variance-aware regret guarantee. This algorithm is built upon several innovative designs, including (1) adaptation of multi-layered estimators to generalized linear models where the mean and variance are coupled (i.e., Bernoulli distribution), (2) symmetric arm selection that naturally aligns with the actual reward maximization objective in dueling bandits.
β’
We prove that our algorithm enjoys a variance-aware regret bound π ~ β’ ( π β’ β π‘
1 π π π‘ 2 + π ) , where π π‘ is the variance of the comparison in round π‘ . Our algorithm is computationally efficient and does not require any prior knowledge of the variance level, which is available in the dueling bandit scenario. In the deterministic case, our regret bound becomes π ~ β’ ( π ) , showcasing a remarkable improvement over previous works. When the variances of the pairwise comparison are the same across different pairs of arms, our regret reduces to the worst-case regret of π ~ β’ ( π β’ π ) , which matches the lower bound Ξ© β’ ( π β’ π ) proved in Bengs et al. (2022)
β’
We compare our algorithm with many strong baselines on synthetic data. Our experiments demonstrate the empirical advantage of the proposed algorithm in terms of regret and adaptiveness when faced with environments with varying variances.
β’
As an additional outcome of our research, we identified an unrigorous argument in the existing analysis of MLE for generalized linear bandits. To rectify this issue, we provide a rigorous proof based on Brouwerβs invariance of domain property (Brouwer, 1911), which is discussed further in Appendix D.
Notation
In this paper, we use plain letters such as π₯ to denote scalars, lowercase bold letters such as π± to denote vectors and uppercase bold letters such as π to denote matrices. For a vector π± , β π± β 2 denotes its β 2 -norm. The weighted β 2 -norm associated with a positive-definite matrix π is defined as β π± β π
π± β€ β’ ππ± . For two symmetric matrices π and π , we use π βͺ° π to denote π β π is positive semidefinite. We use π to denote the indicator function and 0 to denote the zero vector. For a postive integer π , we use [ π ] to denote { 1 , 2 , β¦ , π } . We use π± 1 : π‘ to denote the set { π± π } 1 β€ π β€ π‘ . We use standard asymptotic notations including π β’ ( β ) , Ξ© β’ ( β ) , Ξ β’ ( β ) , and π ~ β’ ( β ) , Ξ© ~ β’ ( β ) , Ξ ~ β’ ( β ) will hide logarithmic factors.
2Related Work
Multi-Armed Bandits and Contextual Bandits. The multi-armed bandit problem involves an agent making sequential decisions among multiple arms based on the observation of stochastic reward, with the goal of maximizing the cumulative rewards over time. It has been widely studied, including works such as Lai et al. (1985); Lai (1987); Auer (2002); Auer et al. (2002); Kalyanakrishnan et al. (2012); Lattimore & SzepesvΓ‘ri (2020); Agrawal & Goyal (2012). To deal with large decision spaces with potentially infinitely many actions or to utilize contextual information, extensive studies have been conducted in contextual bandits. Some work focused on contextual linear bandits, where the mean reward of an arm is a linear function of some feature vectors, including algorithms such as LinUCB/SupLinUCB (Chu et al., 2011), OFUL (Abbasi-Yadkori et al., 2011). Other works, such as (Filippi et al., 2010; Li et al., 2017; Jun et al., 2017), studied the generalized linear bandits where the mean reward is from a generalized linear model (GLM).
Dueling Bandits. The problem of dueling bandits is a variant of the multi-armed bandits, where the stochastic reward is replaced by a pairwise preference. This model was first proposed in Yue et al. (2012). Many works (Zoghi et al., 2014; Komiyama et al., 2015) studied this problem, assuming the existence of a Condorcet winner, which is one arm that beats all the other arms. There are also works on other types of winners such as Copeland winner (Zoghi et al., 2015; Wu & Liu, 2016; Komiyama et al., 2016), Borda winner (Jamieson et al., 2015; Falahatgar et al., 2017; Heckel et al., 2018; Saha et al., 2021; Wu et al., 2023) and von Neumann winner (Ramamohan et al., 2016; DudΓk et al., 2015; Balsubramani et al., 2016). Similar to the idea of contextual bandits, some works considered regret minimization for dueling bandits with context information. Kumagai (2017) studied the contextual dueling bandit problem where the feedback is based on a cost function. They proposed a stochastic mirror descent algorithm and proved the regret upper bound under strong convexity and smoothness assumptions. Saha (2021) proposed algorithms and lower bounds for contextual preference bandits with logistic link function, considering pairwise and subsetwise preferences, respectively. Bengs et al. (2022) further extended to the contextual linear stochastic transitivity model, allowing arbitrary comparison function, and provided efficient algorithms along with a matching lower bound for the weak regret. For a recent comprehensive survey of dueling bandits, please refer to Bengs et al. (2021). Our work studies the same model as Saha (2021); Bengs et al. (2021).
Variance-Aware Bandits. It has been shown empirically that leveraging variance information in multi-armed bandit algorithms can enjoy performance benefits (Auer et al., 2002). In light of this, Audibert et al. (2009) proposed an algorithm, named UCBV, which is based on Bernsteinβs inequality equipped with empirical variance. It provided the first analysis of variance-aware algorithms, demonstrating an improved regret bound. EUCBV Mukherjee et al. (2017) is another variance-aware algorithm that employs an elimination strategy. It incorporates variance estimates to determine the confidence bounds of the arms. For linear bandits, Zhou et al. (2021) proposed a Bernstein-type concentration inequality for self-normalized martingales and designed an algorithm named Weighted OFUL. This approach used a weighted ridge regression scheme, using variance to discount each sampleβs contribution to the estimator. In particular, they proved a variance-dependent regret upper bound, which was later improved by Zhou & Gu (2022). These two works assumed the knowledge of variance information. Without knowing the variances, Zhang et al. (2021a) and Kim et al. (2022) obtained the variance-dependent regret bound by constructing variance-aware confidence sets. (Zhao et al., 2023b) proposed an algorithm named MOR-UCB with the idea of partitioning the observed data into several layers and grouping samples with similar variance into the same layer. A similar idea was used in Zhao et al. (2023a) to design a SupLin-type algorithm SAVE. It assigns collected samples to πΏ layers according to their estimated variances, where each layer has twice the variance upper bound as the one at one level lower. In this way, for each layer, the estimated variance of one sample is at most twice as the others. Their algorithm is computationally tractable with a variance-dependent regret bound based on a Freedman-type concentration inequality and adaptive variance-aware exploration.
3Problem Setup
In this work, we consider a preferential feedback model with contextual information. In this model, an agent learns through sequential interactions with its environment over a series of rounds indexed by π‘ , where π‘ β [ π ] and π is the total number of rounds. In each round π‘ , the agent is presented with a finite set of alternatives, with each alternative being characterized by its associated feature in the contextual set π π‘ β β π . Following the convention in bandit theory, we refer to these alternatives as arms. Both the number of alternatives and the contextual set π π‘ can vary with the round index π‘ . Afterward, the agent selects a pair of arms, with features ( π± π‘ , π² π‘ ) respectively. The environment then compares the two selected arms and returns a stochastic feedback π π‘ , which takes a value from the set { 0 , 1 } . This feedback informs the agent which arm is preferred: When π π‘
1 (resp. π π‘
0 ), the arm with feature π± π‘ (resp. π² π‘ ) wins.
We assume that stochastic feedback π π‘ follows a Bernoulli distribution, where the expected value π π‘ is determined by a generalized linear model (GLM). To be more specific, let π β’ ( β ) be a fixed link function that is increasing monotonically and satisfies π β’ ( π₯ ) + π β’ ( β π₯ )
1 . We assume the existence of an unknown parameter π½ β β β π which generates the preference probability when two contextual vectors are given, i.e.
β β’ ( π π‘
1 )
β β’ ( arm β’ with β’ π± π‘ β’ is preferred over β’ arm β’ with β’ π² t )
π π‘
π β’ ( ( π± π‘ β π² π‘ ) β€ β’ π½ β ) .
This model is the same as the linear stochastic transitivity (LST) model in Bengs et al. (2022), which includes the Bradley-Terry-Luce (BTL) model (Hunter, 2003; Luce, 1959), Thurstone-Mosteller model (Thurstone, 1994) and the exponential noise model as special examples. Please refer to Bengs et al. (2022) for details. The preference model studied in Saha (2021) can be treated as a special case where the link function is logistic.
We make the assumption on the boundness of the true parameter π½ β and the feature vector.
Assumption 3.1.
β π½ β β 2 β€ 1 . There exists a constant π΄
0 such that for all π‘ β [ π ] and all π± β π π‘ , β π± β 2 β€ π΄ .
Additionally, we make the following assumption on the link function π , which is common in the study of generalized linear contextual bandits (Filippi et al., 2010; Li et al., 2017).
Assumption 3.2.
The link function π is differentiable. Furthermore, the first derivative π Λ satisfies π π β€ π Λ β’ ( β ) β€ πΏ π for some constants πΏ π , π π
0 .
We define the random noise π π‘
π π‘ β π π‘ . Since the stochastic feedback π π‘ adheres to the Bernoulli distribution with expected value π π‘ , π π‘ β { β π π‘ , 1 β π π‘ } . From the definition of π π‘ , we can see that | π π‘ | β€ 1 . Furthermore, we make the following assumptions:
πΌ β’ [ π π‘ | π± 1 : π‘ , π² 1 : π‘ , π 1 : π‘ β 1 ]
0 , πΌ β’ [ π π‘ 2 | π± 1 : π‘ , π² 1 : π‘ , π 1 : π‘ β 1 ]
π π‘ 2 .
Intuitively, π π‘ reflects the difficulty associated with comparing the two arms:
β’
When π π‘ is around 1 / 2 , it suggests that the arms are quite similar, making the comparison challenging. Under this circumstance, the variance π π‘ tends toward a constant, reaching a maximum value of 1 / 4 .
β’
On the contrary, as π π‘ approaches 0 or 1, it signals that one arm is distinctly preferable over the other, thus simplifying the comparison. In such scenarios, the variance π π‘ decreases significantly toward 0.
The learning objective is to minimize the cumulative average regret defined as
Regret β’ ( π )
1 2 β’ β π‘
1 π [ 2 β’ π± π‘ β β€ β’ π½ β β ( π± π‘ + π² π‘ ) β€ β’ π½ β ] ,
(3.1)
where π± π‘ β
arg β‘ max π± β π π‘ β‘ π± β€ β’ π½ β is the contextual/feature vector of the optimal arm in round π‘ . This definition is the same as the average regret studied in (Saha, 2021; Bengs et al., 2022). Note that in Bengs et al. (2022), besides the average regret, they also studied another type of regret, called weak regret. Since the weak regret is smaller than the average regret, the regret bound proved in our paper can immediately imply a regret bound defined by the weak regret.
4Algorithm 4.1Overview of the Algorithm Algorithm 1 Variance-Aware Contextual Dueling Bandit (VACDB) 1: Require: πΌ > 0 , πΏ β β log 2 β‘ ( 1 / πΌ ) β , π π , πΏ π . 2: Initialize: For β β [ πΏ ] , πΊ ^ 1 , β β 2 β 2 β’ β β’ π , π½ ^ 1 , β β 0 , πΏ 1 , β β β , π½ ^ 1 , β β 2 β β β’ ( 1 + 1 / π π ) 3: for π‘
1 , β¦ , π do 4: Observe π π‘ 5: Let π π‘ , 1 β π π‘ , β β 1 . 6: while π± π‘ , π² π‘ are not specified do 7: if β π± β π² β πΊ ^ π‘ , β β 1 β€ πΌ for all π± , π² β π π‘ , β then 8: Choose π± π‘ , π² π‘
argmax π± , π² β π π‘ , β { ( π± + π² ) β€ β’ π½ ^ π‘ , β + π½ ^ π‘ , β β π± β π² β₯ πΊ ^ π‘ , β β 1 } and observe π π‘
π β‘ ( π± π‘ β» π² π‘ ) ββββββ//Exploitation (Lines 7-9) 9: Keep the same index sets at all layers: πΏ π‘ + 1 , β β² β πΏ π‘ , β β² for all β β² β [ πΏ ] 10: else if β π± β π² β πΊ ^ π‘ , β β 1 β€ 2 β β for all π± , π² β π π‘ , β then 11: π π‘ , β + 1 β { π± β π π‘ , β β£ π± β€ β’ π½ ^ π‘ , β β₯ max π± β² β π π‘ , β β‘ π± β² β€ β’ π½ ^ π‘ , β β 2 β β β’ π½ ^ π‘ , β } 12: β
β + 1 ββββββββββββββ//Elimination (Lines 10-12) 13: else 14: Choose π± π‘ , π² π‘ such that β π± π‘ β π² π‘ β πΊ ^ π‘ , β β 1 > 2 β β and observe π π‘
π
β‘
(
π±
π‘
β»
π²
π‘
)
ββββββ//Exploration (Lines 14-16)
15: Compute the weight
π€
π‘
β
2
β
β
/
β
π±
π‘
β
π²
π‘
β
πΊ
^
π‘
,
β
β
1
16: Update the index sets
πΏ
π‘
+
1
,
β
β
πΏ
π‘
,
β
βͺ
{
π‘
}
and
πΏ
π‘
+
1
,
β
β²
β
πΏ
π‘
,
β
β²
for all
β
β²
β
[
πΏ
]
/
{
β
}
17: end if
18: end while
19: For
β
β
[
πΏ
]
such that
πΏ
π‘
+
1
,
β
β
πΏ
π‘
,
β
, update
πΊ
^
π‘
+
1
,
β
β
πΊ
^
π‘
,
β
+
π€
π‘
2
β’
(
π±
π‘
β
π²
π‘
)
β’
(
π±
π‘
β
π²
π‘
)
β€
20: Calculate the MLE
π½
^
π‘
+
1
,
β
by solving the equation:
2
β
2
β’
β
β’
π
π
β’
π½
+
β
π
β
πΏ
π‘
+
1
,
β
π€
π
2
β’
(
π
β’
(
(
π±
π
β
π²
π
)
β€
β’
π½
)
β
π
π
)
β’
(
π±
π
β
π²
π
)
0
21: Compute
π½
^
π‘
+
1
,
β
according to (4.3)
22: For
β
β
[
πΏ
]
such that
πΏ
π‘
+
1
,
β
πΏ π‘ , β , let πΊ ^ π‘ + 1 , β
πΊ ^ π‘ , β , π½ ^ π‘ + 1 , β β π½ ^ π‘ , β , π½ ^ π‘ + 1 , β β π½ ^ π‘ , β 23: end for
In this section, we present our algorithm named VACDB in Algorithm 1. Our algorithm shares a similar structure with StaβD in Saha (2021) and SupCoLSTIM in Bengs et al. (2022). The core of our algorithm involves a sequential arm elimination process: from Line 6 to Line 18, our algorithm conducts arm selection with a layered elimination procedure. Arms are progressively eliminated across layers, with increased exploration precision in the subsequent layers. Starting at layer β
1 , our algorithm incorporates a loop comprising three primary conditional phases: Exploitation (Lines 7-9), Elimination (Lines 10-12) and Exploration (Lines 14-16). When all arm pairs within a particular layer have low uncertainty, the elimination procedure begins, dropping the arms with suboptimal estimated values. This elimination process applies an adaptive bonus radius based on variance information. A more comprehensive discussion can be found in Section 4.3. Subsequently, it advances to a higher layer, where exploration is conducted over the eliminated set. Upon encountering a layer with arm pairs of higher uncertainty than desired, our algorithm explores them and receives the feedback. Once comprehensive exploration has been achieved across layers and the uncertainty for all remaining arm pairs is small enough, our algorithm leverages the estimated parameters in the last layer to select the best arm from the remaining arms. For a detailed discussion of the selection policy, please refer to Section 4.4. After arm selection in the exploration phase, the estimator of the current layer is updated (Lines 19-22) using the regularized MLE, which will be discussed in more details in Section 4.2. Note that our algorithm maintains an index set Ξ¨ π‘ , β for each layer, comprising all rounds before round π‘ when the algorithm conducts exploration in layer β . As a result, for each exploration step, only one of the estimators π½ ^ π‘ , β needs to be updated. Furthermore, our algorithm updates the covariance matrix πΊ ^ π‘ , β used to estimate uncertainty (Line 19).
4.2Regularized MLE
Most of the previous work adopted standard MLE techniques to maintain an estimator of π½ β in the generalized linear bandit model (Filippi et al., 2010; Li et al., 2017), which requires an initial exploration phase to ensure a balanced input dataset across β π for the MLE. In the dueling bandits setting, where the feedback in each round can be seen as a generalized linear reward, Saha (2021); Bengs et al. (2022) also applied a similar MLE in their algorithms. As a result, a random initial exploration phase is also inherited to ensure that the MLE equation has a unique solution. However, in our setting, where the decision set varies among rounds and is even arbitrarily decided by the environment, this initial exploration phase cannot be directly applied to control the minimum eigenvalue of the covariance matrix.
To resolve this issue, we introduce a regularized MLE for contextual dueling bandits, which is more well-behaved in the face of extreme input data and does not require an additional exploration phase at the starting rounds. Specifically, the regularized MLE is the solution of the following equation:
π β’ π½ + β π π€ π 2 β’ ( π β’ ( ( π± π β π² π ) β€ β’ π½ ) β π π ) β’ ( π± π β π² π )
0 ,
(4.1)
where we add the additional regularization term π β’ π½ to make sure that the estimator will change mildly. From the theoretical viewpoint, our proposed regularization term leads to a non-singularity guarantee for the covariance matrix. Additionally, we add some weights here to obtain a tighter concentration inequality. Concretely, with a suitable choice of the parameters in each layer and a Freedman-type inequality first introduced in Zhao et al. (2023a), we can prove a concentration inequality for the estimator in the β -th layer:
β π½ β β π½ ^ π‘ , β β πΊ ^ π‘ , β β€ 2 β β π π β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ] + 2 β β .
(4.2)
This upper bound scales with 2 β β , which arises from our choice of the weights.
The regularized MLE can be formulated as a finite-sum offline optimization problem. For many widely used models, such as the Bradley-Terry-Luce (BTL) model (Hunter, 2003; Luce, 1959), the regularized MLE is a strongly convex and smooth optimization problem. We can solve it using accelerated gradient descent (Nesterov, 2003) and SVRG (Johnson & Zhang, 2013), both of which achieve a linear rate of convergence. This can mitigate the scalability issues caused by the increasing number of iterations. The regularized MLE can also be solved by an online learning algorithm such as in Jun et al. (2017) and Zhao et al. (2023b), where additional effort is required for the analysis.
4.3Multi-layer Structure with Variance-Aware Confidence Radius
Due to the multi-layered structure of our algorithm, the construction of the confidence set is of paramount importance. Our algorithm distinguishes itself from prior multi-layered algorithms (Saha, 2021; Bengs et al., 2022) primarily through a variance-aware adaptive selection of the confidence radius, which helps to achieve a variance-aware regret bound. Intuitively, we should choose the confidence radius π½ ^ π‘ , β based on the concentration inequality (4.2). However, it depends on the true variance π π , of which we do not have prior knowledge. To address this issue, we estimate it using the estimator π½ ^ π‘ , β . We choose
π½ ^ π‘ , β
:= 16 β 2 β β π π β’ ( 8 β’ Var ^ π‘ , β + 18 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) ) β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ )
- 6 β 2 β β π π β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ )
- 2 β β
- 1 ,
(4.3)
where
Var ^ π‘ , β := { β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2 ,
2 β β₯ 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) ,
| Ξ¨ π‘ , β | ,
otherwise .
The varied selections of Var ^ π‘ , β arise from the fact that our variance estimator becomes more accurate at higher layers. For those low layers, we employ the natural upper bound π π β€ 1 . Note that this situation arises only Ξ β’ ( log β‘ log β‘ ( π / πΏ ) ) times, which is a small portion of the total layers πΏ
Ξ β’ ( log β‘ π ) . In our proof, we deal with two cases separately. Due to the limited space available here, the full proof can be found in Appendix E.
4.4Symmetric Arm Selection
In this subsection, we focus on the arm selection policy described in Line 9. To our knowledge, this policy is new and has never been studied in prior work for the (generalized) linear dueling bandit problem. In detail, suppose that we have an estimator π½ ^ π‘ in round π‘ that lies in a high probability confidence set:
{ π½ : β π½ β π½ β β πΊ ^ π‘ β€ π½ π‘ } ,
where πΊ ^ π‘
π β’ π + β π
1 π‘ β 1 ( π± π β π² π ) β’ ( π± π β π² π ) β€ . Our choice of arms can be written as
π± π‘ , π² π‘
argmax π± , π² β π π‘ [ ( π± + π² ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± β π² β πΊ ^ π‘ β 1 ] .
(4.4)
Intuitively, we utilize ( π± + π² ) β€ β’ π½ ^ π‘ as the estimated score and incorporate an exploration bonus dependent on β π± β π² β πΊ ^ π‘ β 1 . Our symmetric selection of arms aligns with the nature of dueling bandits where the order of arms does not matter. Here we compare it with several alternative arm selection criteria that have appeared in previous works.
The MaxInP algorithm in Saha (2021) builds the so-called βpromisingβ set that includes the optimal arm:
π π‘
{ π± β π π‘ β£ ( π± β π² ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± β π² β πΊ ^ π‘ β 1 β₯ 0 , β π² β π π‘ } .
It chooses the symmetric arm pair from the set π π‘ that has the highest pairwise score variance (maximum informative pair), i.e.,
π± π‘ , π² π‘
argmax π± , π² β π π‘ β π± β π² β πΊ π‘ β 1 .
The StaβD algorithm in Saha (2021) uses an asymmetric arm selection criterion, which selects the first arm with the highest estimated score, i.e.,
π± π‘
argmax π± β π π‘ π± β€ β’ π½ ^ π‘ .
Following this, it selects the second arm as the toughest competitor to the arm π± π‘ , with a bonus term related to β π± π‘ β π² β Ξ£ π‘ β 1 , i.e.,
π² π‘
argmax π² β π π‘ π² β€ β’ π½ ^ π‘ + 2 β’ π½ π‘ β’ β π± π‘ β π² β πΊ π‘ β 1 .
(4.5)
Similar arm selection criterion has also been used in the CoLSTIM algorithm (Bengs et al., 2022). We can show that these two alternative arm selection policies result in comparable regret decomposition and can establish similar regret upper bound. A more detailed analysis can be found in Appendix C.
5Main Results 5.1Variance-aware Regret Bound
In this section, we summarize our main results in the following theorem.
Theorem 5.1.
If we set πΌ
1 / ( π 3 / 2 ) , then with probability at least 1 β 2 β’ πΏ , the regret of Algorithm 1 is bounded as
Regret β’ ( π )
π ~ β’ ( π π π β’ β π‘
1 π π π‘ 2 + π β’ ( πΏ π 2 π π 2 + 1 π π ) ) .
This regret can be divided into two parts, corresponding to the regret incurred from the exploration steps (Line 14) and the exploitation steps (Line 8). The exploitation-induced regret is always π ~ β’ ( 1 ) as shown in (5.1), and thus omitted by the big-O notation. The total regret is dominated by the exploration-induced regret, which mainly depends on the total variance β π‘
1 π π π‘ 2 . Note that the comparisons during the exploration steps only happen between non-identical arms ( π± π‘ β π² π‘ ).
Remark 5.2.
To show the advantage of variance awareness, consider the extreme case where the comparisons are deterministic. More specifically, for any two arms with contextual vectors π± and π² , the comparison between arm π± and item π² is determined by π π‘
π β‘ { π± π‘ β€ β’ π½ β
π² π‘ β€ β’ π½ β } , and thus has zero variance. Our algorithm can account for the zero variance, and the regret becomes π ~ β’ ( π ) , which is optimal since recovering the parameter π½ β β β π requires exploring each dimension.
Remark 5.3.
The setting we study is quite general, where the arm set is time-varying, and therefore, the variance of arms can vary with respect to time and arms. When we restrict our setting to a special case with uniform variances for all pairwise comparisons, i.e., π π‘ 2
π 2 for all π‘ , our upper bound becomes π ~ β’ ( π β’ π β’ π ) . This results in a regret bound that does not depend on the random variable π π‘ 2 .
Remark 5.4.
In the worst-case scenario, the variance of the arm comparison is upper bounded by 1 / 4 , our regret upper bound becomes π ~ β’ ( π β’ π ) , which matches the regret lower bound Ξ© β’ ( π β’ π ) for dueling bandits with exponentially many arms proved in Bengs et al. (2022), up to logarithmic factors. This regret bound also recovers the regret bounds of MaxInP (Saha, 2021) and CoLSTIM (Bengs et al., 2022). Compared with StaβD (Saha, 2021) and SupCoLSTIM (Bengs et al., 2022), our regret bound is on par with their regret bounds provided the number of arms πΎ is large. More specifically, their regret upper bounds are π ~ β’ ( π β’ π β’ log β‘ πΎ ) . When πΎ is exponential in π , their regret bound becomes π ~ β’ ( π β’ π ) , which is of the same order as our regret bound.
Remark 5.5.
Notably, in Bengs et al. (2022), they made an assumption that the context vectors can span the total π -dimensional Euclidean space, which is essential in their initial exploration phase. In our work, we replace the initial exploration phase with a regularizer, thus relaxing their assumption.
5.2Proof Sketch of Theorem 5.1
As we describe in Section 4, the arm selection is specified in two places, the exploration part (Lines 14 - 16) and the exploitation part (Lines 8 - 9). Given the update rule of the index set, each step within the exploration part will be included by the final index set Ξ¨ π + 1 , β of a singular layer β . Conversely, steps within the exploitation part get into π / βͺ β β [ πΏ ] Ξ¨ π + 1 , β . Using this division, we can decompose the regret into :
Regret β’ ( π )
1 2 [ β π β [ π ] / ( βͺ β β [ πΏ ] Ξ¨ π + 1 , β ) ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) ) β exploitation
- β β β [ πΏ ] β π β Ξ¨ π
- 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β
- π² π β€ β’ π½ β ) ) β exploration ] .
We bound the incurred regret of each part separately.
For any round π β π / βͺ β β [ πΏ ] Ξ¨ π + 1 , β , the given condition for exploitation indicates the existence of a layer β π such that β π± π β π² π β πΊ ^ π , β β 1 β€ πΌ for all π± π , π² π β π π , β . Using the Cauchy inequality and the MLE described in Section 4.2, we can show that the regret incurred in round π is smaller than 3 β’ π½ ^ π , β π β πΌ . Considering the simple upper bound π½ ^ π , β π β€ π ~ β’ ( π ) and πΌ
π β 3 / 2 , the regret for one exploitation round does not exceed π ~ β’ ( 1 / π ) . Consequently, the cumulative regret is
β π β [ π ] / ( βͺ β β [ πΏ ] Ξ¨ π + 1 , β ) ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) ) β€ π ~ β’ ( 1 ) . ,
(5.1)
which is a low-order term in total regret.
In the exploration part, the regret is the cumulative regret encountered within each layer. We analyze the low layers and high layers distinctly. For β β€ β β
β log 2 β‘ ( 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π + 1 ) 2 β’ πΏ / πΏ ) ) β , the incurred regret can be upper bounded by the number of rounds in this layer
β π β Ξ¨ π + 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) )
β€ 4 β’ | Ξ¨ π + 1 , β | .
Moreover, | Ξ¨ π + 1 , β | can be upper bounded by
| Ξ¨ π + 1 , β | β€ 2 2 β’ β β’ π β’ log β‘ ( 1 + 2 2 β’ β β’ π΄ β’ π / π ) β€ π β’ ( πΏ π 2 π π 2 β’ π β’ log β‘ ( 1 + 2 2 β’ β β β’ π΄ β’ π / π ) β’ log β‘ ( 4 β’ ( π + 1 ) 2 β’ πΏ / πΏ ) ) .
(5.2)
Thus the total regret for layers β β€ β β is bounded by π ~ β’ ( π ) . For β
β β , we can bound the cumulative regret incurred in each layer with
Lemma 5.6.
With high probability, for all β β [ πΏ ] β { 1 } , the regret incurred by the index set Ξ¨ π + 1 , β is bounded by
β π β Ξ¨ π + 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) ) β€ π ~ β’ ( π β 2 β β’ π½ ^ π , β β 1 ) .
By summing up the regret of all the layers, we can upper bound the total regret for layers β
β β as
β β β [ πΏ ] / [ β β ] β π β Ξ¨ π + 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) ) β€ π ~ β’ ( π π π β’ β π‘
1 π π π‘ 2 + π π π ) ,
We can complete the proof of Theorem 5.1 by combining the regret in different parts together. For the detailed proof, please refer to Appendix E.
6Experiments Experiment Setup.
We study the proposed algorithm in simulation to compare it with those that are also designed for contextual dueling bandits. Each experiment instance is simulated for π
4000 rounds. The unknown parameter π½ β to be estimated is generated at random and normalized to be a unit vector. The feature dimension is set to π
5 . A total of | π π‘ |
2 π distinct contextual vectors are generated from { β 1 , 1 } π . In each round, given the arm pair selected by the algorithm, a response is generated according to the random process defined in Section 3. For each experiment, a total of 128 repeated runs are carried out. We tune the confidence radius of each algorithm to showcase the best performance. The average cumulative regret is reported in Figure 1 along with the standard deviation in the shaded region. The link function π β’ ( β ) is set to be the logistic function. Our implementation is publicly available 1 .
Algorithms. We list the algorithms studied in this section as follows:
β’
MaxInP: Maximum Informative Pair by Saha (2021). It maintains an active set of possible optimal arms each round. The pairs are chosen on the basis of the maximum uncertainty in the difference between the two arms. Instead of using a warm-up period π 0 in their definition, we initialize πΊ 0
π β’ π as regularization. When π
0.001 this approach empirically has no significant impact on regret performance compared to the warm-up method.
β’
MaxPairUCB: In this algorithm, we keep the MLE the same as MaxInP. However, we eliminate the need for an active set of arms, and the pair of arms that is picked is according to the term defined in (4.4).
β’
CoLSTIM: This method is from Bengs et al. (2022). First, they add randomly disturbed utilities to each arm and pick the arm that has the best estimation. They claim this step achieves better empirical performance. The second arm is chosen according to criteria as defined in (4.5).
β’
VACDB: The proposed variance-aware Algorithm 1 in this paper. πΌ is set to this theoretical value according to Theorem 5.1. However, we note that for this specific experiment, πΏ
4 is enough to eliminate all suboptimal arms. The estimated π½ ^ in one layer below is used to initialize the MLE of the upper layer when it is first reached to provide a rough estimate since the data is not shared among layers.
(a)Compare proposed algorithm with baselines. (b)Variance-awareness of the proposed algorithm. Figure 1:Experiments showing regret performance in various settings.
Regret Comparison. In Figure 1(a) we first notice that the proposed method VACDB has a better regret over other methods on average, demonstrating its efficiency. Second, the MaxPairUCB and CoLSTIM algorithm have a slight edge over the MaxInP algorithm empirically, which can be partially explained by the discussion in Section 4.4. The contributing factor for this could be that in MaxInP the chosen pair is solely based on uncertainty, while the other two methods choose at least one arm that maximizes the reward.
Variance-Awareness. In Figure 1(b), we show the variance awareness of our algorithm by scaling the unknown parameter π½ β . Note that the variance of the Bernoulli distribution with parameter π is π 2
π β’ ( 1 β π ) . To generate high- and low-variance instances, we scale the parameter π½ β by a ratio of πΌ β { 0.5 , 1 , 2 , 4 } . If πΌ β₯ 1 then π will be closer to 0 or 1 which results in a lower variance instance, and vice versa. In this plot, we show the result under four cases where the scale is set in an increasing manner, which corresponds to reducing the variance of each arm. With decreasing variance, our algorithm suffers less regret, which corresponds to the decrease in the π π‘ term in our main theorem.
7Conclusion
We introduced a variance-aware method for contextual dueling bandits. An adaptive algorithm called VACDB is proposed. Theoretical analysis shows a regret upper bound depending on the observed variances in each round. The worst-case regret bound matches lower bound. Additionally, we conduct some simulated studies to show that the proposed algorithm reacts to instances with changing variance implied by the regret analysis. In the future, one of the possible directions is to consider a subset-wise comparison: In each round, a subset of size πΎ arms can be chosen from all arms, and the agent can only observe the best arm of the chosen subset. The dueling bandits model in this work can be treated as a special case of πΎ
2 . Moreover, the preference probability is characterized by a generalized linear model, which may be a strong assumption for some real-world applications. We aim to generalize our results to broader nonlinear function classes, such as the function class with bounded Eluder dimension (Russo & Van Roy, 2013).
Acknowledgements
We thank the anonymous reviewers and area chair for their helpful comments. QD, YW and QG are supported in part by the NSF grants CIF-1911168 and CPS-2312094. YW is also supported by UCLA Dissertation Year Fellowship. TJ and FF are supported in part by the NSF grant CIF-1908544. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.
References Abbasi-Yadkori et al. (2011) β Yasin Abbasi-Yadkori, DΓ‘vid PΓ‘l, and Csaba SzepesvΓ‘ri.Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011. Agrawal & Goyal (2012) β Shipra Agrawal and Navin Goyal.Analysis of thompson sampling for the multi-armed bandit problem.In Conference on learning theory, pp. 39β1. JMLR Workshop and Conference Proceedings, 2012. Audibert et al. (2009) β Jean-Yves Audibert, RΓ©mi Munos, and Csaba Szepesvari.Exploration-exploitation tradeoff using variance estimates in multi-armed bandits.Theor. Comput. Sci., 410:1876β1902, 2009. Auer (2002) β Peter Auer.Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(Nov):397β422, 2002. Auer et al. (2002) β Peter Auer, NicolΓ² Cesa-Bianchi, and Paul Fischer.Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47:235β256, 2002. Balsubramani et al. (2016) β Akshay Balsubramani, Zohar Karnin, Robert E Schapire, and Masrour Zoghi.Instance-dependent regret bounds for dueling bandits.In Conference on Learning Theory, pp. 336β360. PMLR, 2016. Bengs et al. (2021) β Viktor Bengs, RΓ³bert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke HΓΌllermeier.Preference-based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22:7β1, 2021. Bengs et al. (2022) β Viktor Bengs, Aadirupa Saha, and Eyke HΓΌllermeier.Stochastic contextual dueling bandits under linear stochastic transitivity models.In International Conference on Machine Learning, pp. 1764β1786. PMLR, 2022. Brouwer (1911) β Luitzen EJ Brouwer.Beweis der invarianz des n-dimensionalen gebiets.Mathematische Annalen, 71:305β313, 1911. Chen et al. (2013) β Xi Chen, Paul N Bennett, Kevyn Collins-Thompson, and Eric Horvitz.Pairwise ranking aggregation in a crowdsourced setting.In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 193β202, 2013. Chu et al. (2011) β Wei Chu, Lihong Li, L. Reyzin, and Robert E. Schapire.Contextual bandits with linear payoff functions.In International Conference on Artificial Intelligence and Statistics, 2011. DudΓk et al. (2015) β Miroslav DudΓk, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi.Contextual dueling bandits.ArXiv, abs/1502.06362, 2015. Falahatgar et al. (2017) β Moein Falahatgar, Yi Hao, Alon Orlitsky, Venkatadheeraj Pichapati, and Vaishakh Ravindrakumar.Maxing and ranking with few assumptions.Advances in Neural Information Processing Systems, 30, 2017. Filippi et al. (2010) β Sarah Filippi, Olivier Cappe, AurΓ©lien Garivier, and Csaba SzepesvΓ‘ri.Parametric bandits: The generalized linear case.Advances in Neural Information Processing Systems, 23, 2010. Freedman (1975) β David A Freedman.On tail probabilities for martingales.the Annals of Probability, pp. 100β118, 1975. Heckel et al. (2018) β Reinhard Heckel, Max Simchowitz, Kannan Ramchandran, and Martin Wainwright.Approximate ranking from pairwise comparisons.In International Conference on Artificial Intelligence and Statistics, pp. 1057β1066. PMLR, 2018. Hunter (2003) β David R. Hunter.Mm algorithms for generalized bradley-terry models.Annals of Statistics, 32:384β406, 2003. Jamieson et al. (2015) β Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak.Sparse dueling bandits.In Artificial Intelligence and Statistics, pp. 416β424. PMLR, 2015. Johnson & Zhang (2013) β Rie Johnson and Tong Zhang.Accelerating stochastic gradient descent using predictive variance reduction.Advances in neural information processing systems, 26, 2013. Jun et al. (2017) β Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett.Scalable generalized linear bandits: Online computation and hashing.Advances in Neural Information Processing Systems, 30, 2017. Kalyanakrishnan et al. (2012) β Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone.Pac subset selection in stochastic multi-armed bandits.In ICML, volume 12, pp. 655β662, 2012. Kim et al. (2022) β Yeoneung Kim, Insoon Yang, and Kwang-Sung Jun.Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps.Advances in Neural Information Processing Systems, 35:1060β1072, 2022. Komiyama et al. (2015) β Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa.Regret lower bound and optimal algorithm in dueling bandit problem.In Conference on learning theory, pp. 1141β1154. PMLR, 2015. Komiyama et al. (2016) β Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa.Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm.In International Conference on Machine Learning, pp. 1235β1244. PMLR, 2016. Kumagai (2017) β Wataru Kumagai.Regret analysis for continuous dueling bandit.Advances in Neural Information Processing Systems, 30, 2017. Lai (1987) β Tze Leung Lai.Adaptive treatment allocation and the multi-armed bandit problem.The annals of statistics, pp. 1091β1114, 1987. Lai et al. (1985) β Tze Leung Lai, Herbert Robbins, et al.Asymptotically efficient adaptive allocation rules.Advances in applied mathematics, 6(1):4β22, 1985. Lattimore & SzepesvΓ‘ri (2020) β Tor Lattimore and Csaba SzepesvΓ‘ri.Bandit Algorithms.Cambridge University Press, 2020.doi: 10.1017/9781108571401. Li et al. (2010) β Lihong Li, Wei Chu, John Langford, and Robert E Schapire.A contextual-bandit approach to personalized news article recommendation.In Proceedings of the 19th international conference on World wide web, pp. 661β670, 2010. Li et al. (2017) β Lihong Li, Yu Lu, and Dengyong Zhou.Provably optimal algorithms for generalized linear contextual bandits.In International Conference on Machine Learning, pp. 2071β2080. PMLR, 2017. Luce (1959) β R. Duncan Luce.Individual choice behavior.In , 1959. Minka et al. (2018) β Thomas P. Minka, Ryan Cleven, and Yordan Zaykov.Trueskill 2: An improved bayesian skill rating system.In Microsoft Research, 2018. Mukherjee et al. (2017) β Subhojyoti Mukherjee, Kolar Purushothama Naveen, Nandan Sudarsanam, and Balaraman Ravindran.Efficient-ucbv: An almost optimal algorithm using variance estimates.In AAAI Conference on Artificial Intelligence, 2017. Nesterov (2003) β Yurii Nesterov.Introductory lectures on convex optimization: A basic course, volume 87.Springer Science & Business Media, 2003. Ouyang et al. (2022) β Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730β27744, 2022. Ramamohan et al. (2016) β S. Ramamohan, A. Rajkumar, and Shivani Agarwal.Dueling bandits: Beyond condorcet winners to general tournament solutions.In NIPS, 2016. Russo & Van Roy (2013) β Daniel Russo and Benjamin Van Roy.Eluder dimension and the sample complexity of optimistic exploration.Advances in Neural Information Processing Systems, 26, 2013. Saha (2021) β Aadirupa Saha.Optimal algorithms for stochastic contextual preference bandits.In Neural Information Processing Systems, 2021. Saha et al. (2021) β Aadirupa Saha, Tomer Koren, and Y. Mansour.Adversarial dueling bandits.ArXiv, abs/2010.14563, 2021. Thurstone (1994) β Louis Leon Thurstone.A law of comparative judgment.Psychological Review, 34:273β286, 1994. Wu & Liu (2016) β Huasen Wu and Xin Liu.Double thompson sampling for dueling bandits.Advances in neural information processing systems, 29, 2016. Wu et al. (2023) β Yue Wu, Tao Jin, Hao Lou, Farzad Farnoud, and Quanquan Gu.Borda regret minimization for generalized linear dueling bandits.arXiv preprint arXiv:2303.08816, 2023. Yue & Joachims (2009) β Yisong Yue and Thorsten Joachims.Interactively optimizing information retrieval systems as a dueling bandits problem.In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201β1208, 2009. Yue et al. (2012) β Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims.The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538β1556, 2012. Zhang et al. (2016) β Xiaohang Zhang, Guoliang Li, and Jianhua Feng.Crowdsourced top-k algorithms: An experimental evaluation.Proc. VLDB Endow., 9:612β623, 2016. Zhang et al. (2021a) β Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du.Improved variance-aware confidence sets for linear bandits and linear mixture mdp.Advances in Neural Information Processing Systems, 34:4342β4355, 2021a. Zhang et al. (2021b) β Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon Shaolei Du.Improved variance-aware confidence sets for linear bandits and linear mixture mdp.In Neural Information Processing Systems, 2021b. Zhao et al. (2023a) β Heyang Zhao, Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu.Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency.arXiv preprint arXiv:2302.10371, 2023a. Zhao et al. (2023b) β Heyang Zhao, Dongruo Zhou, Jiafan He, and Quanquan Gu.Optimal online generalized linear regression with stochastic noise and its application to heteroscedastic bandits.In International Conference on Machine Learning, pp. 42259β42279. PMLR, 2023b. Zhou & Gu (2022) β Dongruo Zhou and Quanquan Gu.Computationally efficient horizon-free reinforcement learning for linear mixture mdps.Advances in neural information processing systems, 35:36337β36349, 2022. Zhou et al. (2021) β Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari.Nearly minimax optimal reinforcement learning for linear mixture markov decision processes.In Conference on Learning Theory, pp. 4532β4576. PMLR, 2021. Zoghi et al. (2014) β Masrour Zoghi, Shimon Whiteson, RΓ©mi Munos, and M. de Rijke.Relative upper confidence bound for the k-armed dueling bandit problem.ArXiv, abs/1312.3393, 2014. Zoghi et al. (2015) β Masrour Zoghi, Zohar S. Karnin, Shimon Whiteson, and M. de Rijke.Copeland dueling bandits.In NIPS, 2015. Appendix AComparison with Prior Works
In this section, we provide a detailed discussion of the layered design, drawing a comparison with StaβD in Saha (2021) and SupCoLSTIM in Bengs et al. (2022). The general idea follows Auer (2002), which focuses on maintaining a set of βhigh confidence promising armsβ. The algorithm operates differently in two distinct scenarios. If there are some pairs ( π± π‘ , π² π‘ ) in the current layer β with high uncertainty, represented by β π± π‘ β π² π‘ β πΊ ^ π‘ , β β 1 , we will explore those arm pairs. Conversely, when achieving the desired accuracy, we eliminate suboptimal arms using our confidence set and proceed to a subsequent layer demanding greater accuracy. This process continues until we reach a sufficiently accurate high layer, at which we make decisions based on the remaining arms in the confidence set and the estimated parameters π½ ^ π‘ , β . In the final stage, StaβD picks the first arm π± π‘ as the one with the maximum estimated score, followed by choosing its strongest challenger π² π‘ , which has the highest optimistic opportunity to beat π± π‘ . SupCoLSTIM adopts a similar policy and distinguishes itself with a randomized learning strategy by generating additive noise terms from an underlying perturbation distribution. Our arm selection is based on the symmetric arm selection policy described in Section 4.4. StaβD and SupCoLSTIM choose the confidence set radius π½ ^ π‘ , β to be 2 β β in the β -th layer. In comparison, our choice π½ ^ π‘ , β is defined in (4.3). As we mention in Section 4.3, apart from the 2 β β dependency on the layer β , it also relies on the estimated variance. Such a variance-adaptive confidence set radius helps to achieve the variance-aware regret bound.
Appendix BAdditional Experiment on Real-world Data Figure 2:Regret comparison between VACDB and MaxInP on a real-world dataset.
To showcase the performance of our algorithms in a real-world setting, we use EventTime dataset (Zhang et al., 2016). In this dataset, πΎ
100 historical events are compared in a pairwise fashion by crowd-sourced workers. The data contains binary response indicating which one of the events the worker thinks precedes the other. There is no side information
π
{ π± π , π β [ πΎ ] } ,
or the true parameter π½ β readily available in the dataset. Thus, we estimate them with pairwise comparison data. To achieve this, let πΆ π β’ π , π , π β [ πΎ ] be the number of times event π precedes event π labeled by the workers. The following MLE is used:
argmax { π± π } , π½ β π β [ πΎ ] β π β [ πΎ ] πΆ π β’ π β’ log β‘ ( π β’ ( ( π± π β π± π ) β€ β’ π½ ) ) .
With the estimated π and π½ β , it is then possible to simulate the interactive process. We compared our algorithm VACDB with MaxInP in Figure 2. We can see that after about 2500 rounds, our algorithm starts to outperform MaxInP in terms of cumulative regret.
Appendix CDiscussion on Arm Selection Policies
In this section, we present a detailed discussion for Section 4.4. We assume that in round π‘ , we have an estimator π½ ^ π‘ , a covariance matrix πΊ π‘
π β’ π + β π
1 π‘ β 1 ( π± π β π² π ) β’ ( π± π β π² π ) β€ and a concentration inequality with confidence radius π½ π‘ ,
β π½ ^ π‘ β π½ β β πΊ π‘ β€ π½ π‘ .
(C.1)
The three arm selection methods can be described as follows:
Method 1:
Following Saha (2021), let π π‘ be
π π‘
{ π± β π π‘ β£ ( π± β π² ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± β π² β πΊ π‘ β 1 β₯ 0 , β π² β π π‘ } .
Then π± π‘ β β π π‘ because for any π² β π π‘
( π± π‘ β β π² ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± π‘ β β π² β πΊ π‘ β 1
( π± π‘ β β π² ) β€ β’ ( π½ ^ π‘ β π½ β ) + ( π± π‘ β β π² ) β€ β’ π½ β + π½ π‘ β’ β π± π‘ β β π² β πΊ π‘ β 1
β₯ π½ π‘ β’ β π± π‘ β β π² β Ξ£ π‘ β 1 β β π± π‘ β β π² β Ξ£ π‘ β 1 β€ β’ β π½ ^ π‘ β π½ β β πΊ π‘
β₯ 0 ,
where the first inequality holds due to Cauchy-Schwarz inequality and π± π‘ β is the optimal arm in round π‘ . The second inequality holds due to (C.1).
The arms selected in round π‘ are π± π‘ , π² π‘
argmax π± , π² β π π‘ β π± β π² β πΊ π‘ β 1 Then the regret in round π‘ can be decomposed as
2 β’ π π‘
2 β’ π± π‘ β β€ β’ π½ β β ( π± π‘ + π² π‘ ) β€ β’ π½ β
( π± π‘ β β π± π‘ ) β€ β’ π½ β + ( π± π‘ β β π² π‘ ) β€ β’ π½ β
( π± π‘ β β π± π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + ( π± π‘ β β π± π‘ ) β€ β’ π½ ^ π‘ + ( π± π‘ β β π² π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + ( π± π‘ β β π² π‘ ) β€ β’ π½ ^ π‘
β€ ( π± π‘ β β π± π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ π‘ β 1 + ( π± π‘ β β π² π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ π‘ β 1
β€ β π± π‘ β β π± π‘ β πΊ π‘ β 1 β’ β π½ β β π½ ^ π‘ β πΊ π‘ + π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ π‘ β 1
- β π± π‘ β β π² π‘ β πΊ π‘ β 1 β’ β π½ β β π½ ^ π‘ β πΊ π‘
- π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ π‘ β 1
β€ 2 β’ π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ π‘ β 1 + 2 β’ π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ π‘ β 1
β€ 4 β’ π½ π‘ β’ β π± π‘ β π² π‘ β πΊ π‘ β 1 ,
where the first inequality holds because the choice π± π‘ , π² π‘ β π π‘ . The second inequality holds due to Cauchy-Schwarz inequality. The third inequality holds due to (C.1). The last inequality holds due to π± π‘ β β π π‘ , π± π‘ , π² π‘
argmax π± , π² β π π‘ β π± β π² β πΊ π‘ β 1 .
Method 2:
Following Bengs et al. (2022), we choose the first arm as
π± π‘
argmax π± β π π‘ π± β€ β’ π½ ^ π‘ .
Then choose the second arm as
π² π‘
argmax π² β π π‘ π² β€ β’ π½ ^ π‘ + 2 β’ π½ π‘ β’ β π± π‘ β π² β πΊ π‘ β 1 ,
The regret in round π‘ can be decomposed as
2 β’ π π‘
2 β’ π± π‘ β β€ β’ π½ β β ( π± π‘ + π² π‘ ) β€ β’ π½ β
2 β’ ( π± π‘ β β π± π‘ ) β€ β’ π½ β + ( π± π‘ β π² π‘ ) β€ β’ π½ β
2
β’
(
π±
π‘
β
β
π±
π‘
)
β€
β’
(
π½
β
β
π½
^
π‘
)
+
2
β’
(
π±
π‘
β
β
π±
π‘
)
β€
β’
π½
^
π‘
+
(
π±
π‘
β
π²
π‘
)
β€
β’
(
π½
β
β
π½
^
π‘
)
+
(
π±
π‘
β
π²
π‘
)
β€
β’
π½
^
π‘
β€
2
β’
β
π±
π‘
β
β
π±
π‘
β
πΊ
π‘
β
1
β’
β
π½
β
β
π½
^
π‘
β
πΊ
π‘
+
(
π±
π‘
β
β
π±
π‘
)
β€
β’
π½
^
π‘
+
β
π±
π‘
β
π²
π‘
β
πΊ
π‘
β
1
β’
β
π½
β
β
π½
^
π‘
β
πΊ
π‘
+
(
π±
π‘
β
π²
π‘
)
β€
β’
π½
^
π‘
β€
2
β’
π½
π‘
β’
β
π±
π‘
β
β
π±
π‘
β
πΊ
π‘
β
1
+
(
π±
π‘
β
β
π²
π‘
)
β€
β’
π½
^
π‘
+
π½
π‘
β’
β
π±
π‘
β
π²
π‘
β
πΊ
π‘
β
1
β€
π²
π‘
β€
β’
π½
^
π‘
+
2
β’
π½
π‘
β’
β
π±
π‘
β
π²
π‘
β
πΊ
π‘
β
1
β
π±
π‘
β
β€
β’
π½
^
π‘
+
(
π±
π‘
β
β
π²
π‘
)
β€
β’
π½
^
π‘
+
π½
π‘
β’
β
π±
π‘
β
π²
π‘
β
πΊ
π‘
β
1
3 β’ π½ π‘ β’ β π± π‘ β π² π‘ β πΊ π‘ β 1 ,
where the first inequality holds due to the Cauchy-Schwarz inequality and π± π‘ β€ β’ π½ ^ π‘ β₯ π± π‘ β β’ π½ ^ π‘ . The second inequality holds due to the Cauchy-Schwarz inequality. The third inequality holds due to π² π‘
argmax π² β π π‘ π² β€ β’ π½ ^ π‘ + 2 β’ π½ π‘ β’ β π± π‘ β π² β Ξ£ π‘ β 1 .
Method 3:
In this method, we choose two arms as
π± π‘ , π² π‘
argmax π± , π² β π π‘ [ ( π± + π² ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± β π² β πΊ ^ π‘ β 1 ]
(C.2)
Then the regret can be decomposed as
2 β’ π π‘
2 β’ π± π‘ β β€ β’ π½ β β ( π± π‘ + π² π‘ ) β€ β’ π½ β
( π± π‘ β β π± π‘ ) β€ β’ π½ β + ( π± π‘ β β π² π‘ ) β€ β’ π½ β
( π± π‘ β β π± π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + ( π± π‘ β β π² π‘ ) β€ β’ ( π½ β β π½ ^ π‘ ) + ( 2 β’ π± π‘ β β π± π‘ β π² π‘ ) β€ β’ π½ ^ π‘
β€ β π± π‘ β β π± π‘ β πΊ π‘ β 1 β’ β π½ β β π½ ^ π‘ β πΊ π‘ + β π± π‘ β β π² π‘ β πΊ π‘ β 1 β’ β π½ β β π½ ^ π‘ β πΊ π‘ + ( 2 β’ π± π‘ β β π± π‘ β π² π‘ ) β€ β’ π½ ^ π‘
β€ π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ π‘ β 1 + π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ π‘ β 1 + ( 2 β’ π± π‘ β β π± π‘ β π² π‘ ) β€ β’ π½ ^ π‘ ,
where the first inequality holds due to the Cauchy-Schwarz inequality. The second inequality holds due to (C.1). Using (C.2), we have
( π± π‘ β + π± π‘ ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ ^ π‘ β 1
β€ ( π± π‘ + π² π‘ ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± π‘ β π² π‘ β πΊ ^ π‘ β 1
( π± π‘ β + π² π‘ ) β€ β’ π½ ^ π‘ , β + π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ ^ π‘ β 1
β€ ( π± π‘ + π² π‘ ) β€ β’ π½ ^ π‘ + π½ π‘ β’ β π± π‘ β π² π‘ β πΊ ^ π‘ β 1 .
Adding the above two inequalities, we have
π½ π‘ β’ β π± π‘ β β π± π‘ β πΊ π‘ β 1 + π½ π‘ β’ β π± π‘ β β π² π‘ β πΊ π‘ β 1 β€ ( π± π‘ + π² π‘ β 2 β’ π± π‘ β ) β€ β’ π½ ^ π‘ + 2 β’ π½ π‘ β’ β π± π‘ β π² π‘ β πΊ ^ π‘ β 1 .
Therefore, we prove that the regret can be upper bounded by
2 β’ π π‘ β€ 2 β’ π½ π‘ β’ β π± π‘ β π² π‘ β πΊ ^ π‘ β 1 .
In conclusion, we can prove similar inequalities for the above three arm selection policies. To get an upper bound of regret, we can sum up the instantaneous regret in each round and use Lemma G.1 to obtain the final result.
Appendix DA Rigorous Proof for the MLE D.1Discussion on the Weakness
In the proof of Lemma E.1, for completeness, we need to prove that (4.1) has a unique solution. Following Li et al. (2017), we define a auxiliary function πΊ : β π β β π as
πΊ β’ ( π½ )
π β’ π½ + β π π€ π 2 β’ [ π β’ ( ( π± π β π² π ) β€ β’ π½ ) β π β’ ( ( π± π β π² π ) β€ β’ π½ β ) ] β’ ( π± π β π² π ) .
Using the condition that the minimum eigenvalue of the covariance matrix is strictly positive, we can prove that πΊ is injective and π½ ^ is the solution of (4.1) is equivalent to πΊ β’ ( π½ ^ )
π , where π is a quantity dependent on the stochastic noise. In Li et al. (2017), there is a minor weakness in asserting the existence and uniqueness of the solution with π½ ^
πΊ β 1 β’ ( π ) , without confirming whether π lies in the range of πΊ . We solve this problem with the classical Brouwer invariance of domain theorem in algebraic topology:
Theorem D.1 (Brouwer 1911).
Let π be an open subset of β π , and let π : π β β π be a continuous injective map. Then π β’ ( π ) is also open.
We complete the proof by proving πΊ β’ ( β π ) is both open and closed and therefore (4.1) has a unique solution.
D.2A detailed proof
We will prove that the function πΊ is a bijection from β π to β π . We first show itβs injective. The proof idea is similar to Theorem 1 in Li et al. (2017). With the mean value theorem, for any π½ 1 , π½ 2 β β π , there exists π β [ 0 , 1 ] and π½ Β―
π β’ π½ 1 + ( 1 β π ) β’ π½ 2 , such that the following equation holds,
πΊ β’ ( π½ 1 ) β πΊ β’ ( π½ 2 )
π β’ ( π½ 1 β π½ 2 ) + β π π€ π 2 β’ [ π β’ ( ( π± π β π² π ) β€ β’ π½ 1 ) β π β’ ( ( π± π β π² π ) β€ β’ π½ 2 ) ] β’ ( π± π β π² π )
[ π β’ π + β π π€ π 2 β’ π Λ β’ ( ( π± π β π² π ) β€ β’ π½ Β― ) β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ ] β’ ( π½ 1 β π½ 2 ) .
We define πΉ β’ ( π½ Β― ) as
πΉ β’ ( π½ Β― )
[ π β’ π + β π π€ π 2 β’ π Λ β’ ( ( π± π β π² π ) β€ β’ π½ Β― ) β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ ] .
Using π Λ β’ ( β ) β₯ π π
0 and inf π π€ π 2
0 , we have πΉ β’ ( π½ Β― ) is positive definite. Therefore, we prove that when π½ 1 β π½ 2 , πΊ π‘ , β β’ ( π½ 1 ) β πΊ π‘ , β β’ ( π½ 2 ) . That is to say, πΊ π‘ , β is an injection from β π to β π .
Next, we prove πΊ is surjective. The classical Brouwer invariance of domain theorem (Theorem G.4) in algebraic topology indicates that πΊ is an open map, and thus πΊ β’ ( β π ) is an open set. On the other hand, the minimum eigenvalue of πΉ β’ ( π½ Β― ) is strictly positive. Therefore, πΉ β’ ( π½ Β― ) is invertible, and we have
π½ 1 β π½ 2
πΉ β’ ( π½ Β― ) β 1 β’ [ πΊ π‘ , β β’ ( π½ 1 ) β πΊ π‘ , β β’ ( π½ 2 ) ] .
(D.1)
Let { πΊ π‘ , β β’ ( π½ π ) } π
1 β be a Cauchy sequence in πΊ β’ ( β π ) . Using (D.1) and the fact that π min β’ ( πΉ β’ ( π½ Β― ) ) β₯ π
0 , we have for any π
π ,
β π½ π β π½ π β 2 β€ 1 π β’ β πΊ β’ ( π½ π ) β πΊ β’ ( π½ π ) β 2 .
This inequality shows that { π½ π } π
1 β is also a Cauchy sequence. With the completeness of the space β π , the limit lim π β β π½ π
π½ exists. By the continuity of the function πΊ , we have
lim π β β πΊ β’ ( π½ π )
πΊ β’ ( π½ ) β πΊ β’ ( β π ) .
Therefore, πΊ β’ ( β π ) is also closed. We have proved that πΊ β’ ( β π ) is both open and closed. Using β π is connected, we have proved that πΊ β’ ( β π )
β π , i.e. πΊ π‘ , β is subjective.
In conclusion, the function πΊ is invertible, and (4.1) has a unique solution.
Appendix EProof of Theorem 5.1
In this section, we assume (4.1) has a unique solution π½ ^ π‘ + 1 , β , which is essential in our analysis. A detailed discussion is in Section D.
We first need the concentration inequality for the MLE.
Lemma E.1.
With probability at least 1 β πΏ , the following concentration inequality holds for all round π‘ β₯ 2 and layer β β [ πΏ ] simultaneously:
β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β β€ 2 β β π π β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ] + 2 β β .
With this lemma, we have the following event holds with high probability:
β°
{ β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β β€ 2 β β π π β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ] + 2 β β β’ for all β’ π‘ , β } .
Lemma E.1 shows that β β’ [ β° ] β₯ 1 β πΏ . For our choice of π½ ^ π‘ , β defined in (4.3), we define the following event:
β° bonus
{ π½ ^ π‘ , β β₯ 2 β β π β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ] + 2 β β , for all β’ π‘ , β } .
The following two lemmas show that the event β° β bonus holds with high probability.
Lemma E.2.
With probability at least 1 β πΏ , for all π‘ β₯ 2 , β β [ πΏ ] , the following two inequalties hold simultaneously.
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
β€ 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 14 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) .
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
β€ 3 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 7 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) .
Lemma E.3.
Suppose that the inequalities in Lemma E.2 and the event β° hold. For all π‘ β₯ 2 and β β [ πΏ ] such that 2 β β₯ 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π + 1 ) 2 β’ πΏ / πΏ ) , the following inequalities hold
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 β€ 8 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2 + 18 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) .
β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2 β€ 4 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 8 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) .
Recall that with our choice of π½ ^ π‘ , β in (4.3), the inequality in β° bonus holds naturally when 2 β < 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π + 1 ) 2 β’ πΏ / πΏ ) . Combining Lemma E.2, Lemma E.3 and β β’ [ β° ] β₯ 1 β πΏ , after taking a union bound, we have proved β β’ [ β° bonus β© β° ] β₯ 1 β 2 β’ πΏ .
Lemma E.4.
Suppose the high probability events β° bonus and β° holds. Then for all π‘ β₯ 1 and β β [ πΏ ] such that the set π π‘ , β is defined, the contextual vector of the optimal arm π± π‘ β lies in π π‘ , β .
Then we can bound the regret incurred in each layer separately.
Lemma E.5.
Suppose the the high probability events β° bonus and β° holds. Then for all β β [ πΏ ] / 1 , the regret incurred by the index set Ξ¨ π + 1 , β is bounded by
β π β Ξ¨ π + 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) ) β€ π ~ β’ ( π β 2 β β’ π½ ^ π , β β 1 ) .
With all these lemmas, we can prove Theorem 5.1.
Proof of Theorem 5.1.
Conditioned on β° bonus β© β° , let
β β
β log 2 β‘ ( 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π + 1 ) 2 β’ πΏ / πΏ ) ) β .
Using the high probability event β° bonus , Lemma E.4 and Lemma E.5, for any β
β β , we have
β
π
β
Ξ¨
π
+
1
,
β
(
2
β’
π±
π
β
β€
β’
π½
β
β
(
π±
π
β€
β’
π½
β
+
π²
π
β€
β’
π½
β
)
)
β€
π
~
β’
(
π
β
2
β
β’
π½
^
π
,
β
β
1
)
β€
π
~
β’
(
π
π
π
β’
β
π
β
Ξ¨
π
+
1
,
β
π€
π
2
β’
(
π
π
β
π
β’
(
(
π±
π
β
π²
π
)
β€
β’
π½
^
π
+
1
,
β
)
)
2
+
1
+
1
)
β€
π
~
β’
(
π
π
π
β’
β
π‘
1 π π π‘ 2 + π π π + 1 ) ,
(E.1)
where the first inequality holds due to Lemma E.5. The second inequality holds due to the definition 4.3. The last inequality holds due to Lemma E.3 and π€ π β€ 1 .
For β β [ β β ] , we have
β
π
β
Ξ¨
π
+
1
,
β
(
2
β’
π±
π
β
β€
β’
π½
β
β
(
π±
π
β€
β’
π½
β
+
π²
π
β€
β’
π½
β
)
)
β€
4
β’
|
Ξ¨
π
+
1
,
β
|
2
2
β’
β
+
2
β’
β
π
β
Ξ¨
π
+
1
,
β
β
π€
π
β’
(
π±
π
β
π²
π
)
β
πΊ
^
π
,
β
2
β€
2
2
β’
β
+
3
β’
π
β’
log
β‘
(
1
+
π
/
(
π
β’
π
)
)
π ~ β’ ( π β’ πΏ π 2 π π 2 ) ,
(E.2)
where the first equality holds due to our choice of π€ π such that β π€ π β’ ( π± π β π² π ) β πΊ ^ π , β 2 . The second inequality holds due to Lemma G.1. The last equality holds due to β β€ β β
For any π β [ π ] / ( βͺ β β [ πΏ ] Ξ¨ π + 1 , β ) , we set β π as the value of layer such that β π± π β π² π β πΊ ^ π , β β 1 β€ πΌ for all π± π , π² π β π π , β and then the while loop ends. By the choice of π± π , π² π and π± π β β π π , β π (Lemma E.4), we have
2 β’ π± π β β€ β’ π½ ^ π , β π
β€ π± π β€ β’ π½ ^ π , β π + π² π β€ β’ π½ ^ π , β π + π½ ^ π , β π β’ β π± π β π² π β πΊ ^ π , β π β 1
β€ π± π β€ β’ π½ ^ π , β π + π² π β€ β’ π½ ^ π , β π + π½ ^ π , β π β’ πΌ ,
(E.3)
where the last inequality holds because β π± π β π² π β πΊ ^ π , β β 1 β€ πΌ for all π± π , π² π β π π , β . Then we have
β π β [ π ] / ( βͺ β β [ πΏ ] Ξ¨ π + 1 , β ) ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) )
β
π
β
[
π
]
/
(
βͺ
β
β
[
πΏ
]
Ξ¨
π
+
1
,
β
)
(
2
π±
π
β
β€
π½
β
β
2
π±
π
β
β€
π½
^
π
,
β
π
+
(
π±
π
β€
π½
^
π
,
β
π
β
π±
π
β€
π½
β
)
+
(
π²
π
β€
π½
^
π
,
β
π
β
π²
π
β€
π½
β
)
+
(
2
π±
π
β
β€
π½
^
π
,
β
π
β
(
π±
π
β€
π½
^
π
,
β
π
+
π²
π
β€
π½
^
π
,
β
π
)
)
)
β€
β
π
β
[
π
]
/
(
βͺ
β
β
[
πΏ
]
Ξ¨
π
+
1
,
β
)
(
β
π±
π
β
β
π±
π
β
πΊ
^
π
,
β
π
β
1
+
β
π±
π
β
β
π²
π
β
πΊ
^
π
,
β
π
β
1
)
β’
β
π½
β
β
π½
^
π
,
β
π
β
Ξ£
^
π
,
β
π
+
π½
^
π
,
β
π
β’
πΌ
β€
β
π
β
[
π
]
/
(
βͺ
β
β
[
πΏ
]
Ξ¨
π
+
1
,
β
)
3
β’
π½
^
π
,
β
π
β’
πΌ
β€
π
β
π
~
β’
(
1
/
π
)
π ~ β’ ( 1 ) ,
(E.4)
where the first inequality holds due to the Cauchy-Schwarz inequality and (E.3). The third inequality holds due to β π± π β π² π β πΊ ^ π , β β 1 β€ πΌ for all π± π , π² π β π π , β π , π± π β β π π , β π (Lemma E.4) and Lemma E.1. The third inequality holds due to our choice of π½ ^ π , β π β€ π ~ β’ ( π ) and πΌ
1 / π 3 / 2 . Combining (E.1), (E.2), (E.4) together, we obtain
Regret β’ ( π )
π ~ β’ ( π π π β’ β π‘
1 π π π‘ 2 + π β’ ( πΏ π 2 π π 2 + 1 π π ) ) .
β
Appendix FProof of Lemmas in Section E F.1Proof of Lemma E.1 Proof of Lemma E.1.
For a fixed β β [ πΏ ] , let π‘ β Ξ¨ π + 1 , β , π‘ β₯ 2 , we define some auxiliary quantities:
πΊ π‘ , β β’ ( π½ )
2
β
2
β’
β
β’
π
π
β’
π½
+
β
π
β
πΏ
π‘
,
β
π€
π
2
β’
[
π
β’
(
(
π±
π
β
π²
π
)
β€
β’
π½
)
β
π
β’
(
(
π±
π
β
π²
π
)
β€
β’
π½
β
)
]
β’
(
π±
π
β
π²
π
)
π
π‘
π
π‘
β
π
β’
(
(
π±
π‘
β
π²
π‘
)
β€
β’
π½
β
)
π
π‘
,
β
β π β πΏ π‘ , β π€ π 2 β’ π π β’ ( π± π β π² π ) .
Recall (4.1), π½ ^ π‘ , β is the solution to
2 β 2 β’ β β’ π π β’ π½ ^ π‘ , β + β π β πΏ π‘ , β π€ π 2 β’ ( π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) β π π ) β’ ( π± π β π² π )
0 .
(F.1)
A simple transformation shows that (F.1) is equivalent to following equation,
πΊ π‘ , β β’ ( π½ ^ π‘ , β )
2 β 2 β’ β β’ π π β’ π½ ^ π‘ , β + β π β πΏ π‘ , β π€ π 2 β’ [ π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) β π β’ ( ( π± π β π² π ) β€ β’ π½ β ) ] β’ ( π± π β π² π )
β π β πΏ π‘ , β π€ π 2 β’ [ π π β π β’ ( ( π± π β π² π ) β€ β’ π½ β ) ] β’ ( π± π β π² π )
π π‘ , β .
We has proved πΊ π‘ , β is invertible in Section D and thus π½ ^ π‘ , β
πΊ π‘ , β β 1 β’ ( π π‘ , β ) .
Moreover, we can see that πΊ π‘ , β β’ ( π½ β )
2 β 2 β’ β β’ π π β’ π½ β . Recall πΊ ^ π‘ , β
2 β 2 β’ β β’ π π β’ π + β π β πΏ π‘ , β π€ π 2 β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ . We have
β πΊ π‘ , β β’ ( π½ ^ π‘ , β ) β πΊ π‘ , β β’ ( π½ β ) β πΊ ^ π‘ , β β 1 2
(
π½
^
π‘
,
β
β
π½
β
)
β€
β’
πΉ
β’
(
π½
Β―
)
β’
πΊ
^
π‘
,
β
β
1
β’
πΉ
β’
(
π½
Β―
)
β’
(
π½
^
π‘
,
β
β
π½
β
)
β₯
π
π
2
β’
(
π½
^
π‘
,
β
β
π½
β
)
β€
β’
πΊ
^
π‘
,
β
β’
(
π½
^
π‘
,
β
β
π½
β
)
π π 2 β’ β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β 2 ,
where the first inequality holds because π Λ β’ ( β ) β₯ π π
0 and thus πΉ β’ ( π½ Β― ) βͺ° π π β’ πΊ ^ π‘ , β . Using the triangle inequality, we have
β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β
β€ 2 β 2 β’ β β’ β π½ β β πΊ ^ π‘ , β β 1 + 1 π π β’ β π π‘ , β β πΊ ^ π‘ , β β 1
β€ 2 β β β’ β π½ β β 2 + 1 π π β’ β π π‘ , β β πΊ ^ π‘ , β β 1 .
To bound the β π π‘ , β β πΊ ^ π‘ , β β 1 term, we use Lemma G.3. By the choice of π€ π , for any π‘ β Ξ¨ π + 1 , β , we have
β π€ π‘ β’ ( π± π‘ β π² π‘ ) β πΊ ^ π‘ , β β 1
2 β β β’ and β’ π€ π‘ β€ 1 .
We also have
πΌ β’ [ π€ π‘ 2 β’ π π‘ 2 β£ β± π‘ ] β€ π€ π‘ 2 β’ πΌ β’ [ π π‘ 2 β£ β± π‘ ] β€ π€ π‘ 2 β’ π π‘ 2 β’ and β’ | π€ π‘ β’ π π‘ | β€ | π π‘ | β€ 1 .
Therefore, Lemma G.3 shows that with probability at least 1 β πΏ / πΏ , for all π‘ β Ξ¨ π + 1 , β , the following inequality holds
β π π‘ , β β πΊ ^ π‘ , β β 1 β€ 16 β 2 β β β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β 2 β β β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) .
Finally, we get
β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β β€ 2 β β π π β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ] + 2 β β .
Take a union bound on all β β [ πΏ ] , and then we finish the proof of Lemma E.1. β
F.2Proof of Lemma E.2 Proof of Lemma E.2.
The proof of this lemma is similar to the proof of Lemma B.4 in Zhao et al. (2023a). For a fixed layer β β [ πΏ ] , using the definition of π π and π π , we have
β π β₯ 1 , πΌ β’ [ π π 2 β π π 2 | π± 1 : π , π² 1 : π , π 1 : π β 1 ]
0 .
Therefore, we have
β π β Ξ¨ π‘ , β πΌ β’ [ π€ π 2 β’ ( π π 2 β π π 2 ) 2 | π± 1 : π , π² 1 : π , π 1 : π β 1 ]
β€ β π β Ξ¨ π‘ , β πΌ β’ [ π€ π 2 β’ π π 4 | π± 1 : π , π² 1 : π , π 1 : π β 1 ]
β€ β π β Ξ¨ π‘ . β π€ π 2 β’ π π 2 ,
where the last inequality holds due to the definition of π π and π π β€ 1 . Then using Lemma G.2 and taking a union bound on all β β [ πΏ ] , for all π‘ β₯ 2 , we have
| β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π 2 β π π 2 ) |
β€ 2 β’ β π β Ξ¨ π‘ . β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 2 3 β 2 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ )
β€ 1 2 β’ β π β Ξ¨ π‘ . β π€ π 2 β’ π π 2 + 7 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ,
(F.2)
where we use the Youngβs inequality π β’ π β€ 1 2 β’ π 2 + 1 2 β’ π 2 . Finally, we finish the proof of Lemma E.2 by
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
| β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 β β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π 2 β π π 2 ) |
β€ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + | β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π 2 β π π 2 ) |
β€ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 1 2 β’ β π β Ξ¨ π‘ . β π€ π 2 β’ π π 2 + 7 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ,
(F.3)
where the first inequality holds due to the triangle inequality. The second inequality holds due to (F.2). We also have
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
| β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 β β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π 2 β π π 2 ) |
β₯ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 β | β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π 2 β π π 2 ) |
β₯ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 β 1 2 β’ β π β Ξ¨ π‘ . β π€ π 2 β’ π π 2 β 7 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) .
The proof of this inequality is almost the same as (F.3). β
F.3Proof of Lemma E.3 Proof of Lemma E.3.
For a fixed β β [ πΏ ] , Lemma E.2 indicates that
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
β€ 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 14 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ )
β€ 14 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) + 4 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2
- 4 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) ) 2 β ( πΌ ) ,
(F.4)
where the second inequality holds due to the basic inequality ( π + π ) 2 β€ 2 β’ π 2 + 2 β’ π 2 for all π , π β β . Using our definition of π π , π π
π β’ ( ( π± π β π² π ) β€ β’ π½ β ) + π π . Thus, we have
( πΌ )
β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) ) 2
β π β Ξ¨ π‘ , β π€ π 2 β’ ( π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) β π β’ ( ( π± π β π² π ) β€ β’ π½ β ) ) 2
β€ πΏ π 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( ( π± π β π² π ) β€ β’ ( π½ ^ π‘ , β β π½ β ) ) 2 ,
(F.5)
where the last inequality holds because the first order derivative of function π is upper bounded by πΏ π (Assumption 3.2). Moreover, by expanding the square, we have
(
πΌ
)
β€
πΏ
π
2
β’
β
π
β
Ξ¨
π‘
,
β
π€
π
2
β’
(
(
π±
π
β
π²
π
)
β€
β’
(
π½
^
π‘
,
β
β
π½
β
)
)
2
πΏ π 2 β’ β π β Ξ¨ π‘ , β ( π½ ^ π‘ , β β π½ β ) β€ β’ π€ π 2 β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ β’ ( π½ ^ π‘ , β β π½ β )
πΏ π 2 β’ ( π½ ^ π‘ , β β π½ β ) β€ β’ ( β π β Ξ¨ π‘ , β π€ π 2 β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ ) β’ ( π½ ^ π‘ , β β π½ β )
β€ πΏ π 2 β’ β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β 2 ,
(F.6)
where the last inequality holds due to
πΊ ^ π‘ , β
2 β 2 β’ β β’ π π β’ π + β π β Ξ¨ π‘ , β π€ π 2 β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ βͺ° β π β Ξ¨ π‘ , β π€ π 2 β’ ( π± π β π² π ) β’ ( π± π β π² π ) β€ .
Combining (F.5), (F.6) and the event β° (Lemma E.1), we have
( πΌ )
β€ 2 β 2 β’ β β’ πΏ π 2 π π 2 β’ [ 16 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) + 6 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) + π π ] 2
β€ 2 β 2 β’ β β’ πΏ π 2 π π 2 β’ [ 512 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) β β π β πΏ π‘ , β π€ π 2 β’ π π 2 + 2 β’ ( 6 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) + π π ) 2 ] ,
where the last inequality holds due to the basic inequality ( π + π ) 2 β€ 2 β’ π 2 + 2 β’ π 2 for all π , π β β . When 2 β β₯ 64 β’ ( πΏ π / π π ) β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) , we can further bound the above inequality by
( πΌ ) β€ 1 8 β’ β π β πΏ π‘ + 1 , β π€ π 2 β’ π π 2 + log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) .
(F.7)
Subitituting (F.7) into (F.4), we have
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
β€ 4 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2
- 9 β’ log β‘ ( 4 β’ ( π‘
- 1 ) 2 β’ πΏ / πΏ )
- 1 2 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 .
Therefore, we prove the first inequality in Lemma E.3 as follows
β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2
β€ 8 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2
- 18 β’ log β‘ ( 4 β’ ( π‘
- 1 ) 2 β’ πΏ / πΏ ) .
For the second inequality, we have
β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2
β€ 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) ) 2 β ( πΌ ) .
We complete the proof of Lemma E.3.
β π β Ξ¨ π‘ , β π€ π 2 β’ ( π π β π β’ ( ( π± π β π² π ) β€ β’ π½ ^ π‘ , β ) ) 2
β€ 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 1 4 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 + 2 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ )
β€ 2 β’ ( 3 2 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 7 3 β’ log β‘ ( 4 β’ π‘ 2 β’ πΏ / πΏ ) ) + 1 4 β’ β π β πΏ π‘ , β π€ π 2 β’ π π 2 + 2 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ )
β€ 4 β’ β π β Ξ¨ π‘ , β π€ π 2 β’ π π 2 + 8 β’ log β‘ ( 4 β’ ( π‘ + 1 ) 2 β’ πΏ / πΏ ) ,
where the first inequality holds due to (F.7). The second inequality holds due to Lemma E.2. β
F.4Proof of Lemma E.4 Proof of Lemma E.4.
We prove it by induction. For β
1 , we initialze the set π π‘ , 1 to be π π‘ , thus trivially π± π‘ β β π π‘ , 1 . Now we suppose π π‘ , β is defined and π± π‘ β β π π‘ , β . By the way π π‘ , β + 1 is constructed, π π‘ , β + 1 is defined only when β π± β π² β πΊ ^ π‘ , β β 1 β€ 2 β β for all π± , π² β π π‘ , β .
Let π± max
argmax π± β π π‘ , β π± β€ β’ π½ ^ π‘ , β . Then we have
π± π‘ β β€ β’ π½ ^ π‘ , β β π± max β€ β’ π½ ^ π‘ , β
( π± π‘ β β€ β’ π½ β β π± max β€ β’ π½ β ) + ( π± π‘ β β π± max ) β€ β’ ( π½ ^ π‘ , β β π½ β )
β₯ β β π± π‘ β β π± max β πΊ ^ π‘ , β β 1 β β π½ ^ π‘ , β β π½ β β πΊ ^ π‘ , β ,
where the inequality holds due to the Cauchy-Schwarz inequality and the fact π± π‘ β
argmax π± β π π‘ π± β€ β’ π½ β . With the inductive hypothesis, we know π± π‘ β β π π‘ , β . Thus we have β π± π‘ β β π± max β πΊ ^ π‘ , β β 1 β€ 2 β β . Finally, with the inequality in Lemma E.1, we have
π± π‘ β β€ β’ π½ ^ π‘ , β β₯ max π± β π π‘ , β β‘ π± β€ β’ π½ ^ π‘ , β β 2 β β β’ π½ ^ π‘ , β .
Therefore, we have π± π‘ β β π π‘ , β + 1 , and we complete the proof of Lemma E.4 by induction. β
F.5Proof of Lemma E.5 Proof of Lemma E.5.
For any π β Ξ¨ π + 1 , β , due to the definition of Ξ¨ π + 1 , β and our choice of π± π , π² π (Algorithm 1 Line 14-16), we have π± π , π² π β π π , β . Additionally, because the set π π , β is defined, β π± β π² β πΊ ^ π , β β 1 β 1 β€ 2 β β + 1 for all π± , π² β π π , β β 1 . From Lemma E.4, we can see that π± π β β π π , β . Combining these results, we have
β π± π β β π± π β πΊ ^ π , β β 1 β 1 β€ 2 β β + 1 , β π± π β β π² π β πΊ ^ π , β β 1 β 1 β€ 2 β β + 1 ,
(F.8)
where we use the inclusion property π π , β β π π , β β 1 . Moreover, π± π , π± π β β π π , β shows that
π± π β€ β’ π½ ^ π , β β 1
β₯ max π± β π π , β β 1 β‘ π± β€ β’ π½ ^ π , β β 1 β 2 β β + 1 β’ π½ ^ π , β β 1
β₯ π± π β β€ β’ π½ ^ π , β β 1 β 2 β β + 1 β’ π½ ^ π , β β 1 ,
(F.9)
where we use π± π β π π , β β 1 . Similarly, we have
π² π β€ β’ π½ ^ π , β β 1 β₯ π± π β β€ β’ π½ ^ π , β β 1 β 2 β β + 1 β’ π½ ^ π , β β 1 .
(F.10)
Now we compute the regret incurred in round π .
2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β )
( π± π β β π± π ) β€ β’ π½ β + ( π± π β β π² π ) β€ β’ π½ β
β€ ( π± π β β π± π ) β€ β’ π½ ^ π , β β 1 + | ( π± π β β π± π ) β€ β’ ( π½ ^ π , β β 1 β π½ β ) |
- ( π± π β β π² π ) β€ β’ π½ ^ π , β β 1
- | ( π± π β β π² π ) β€ β’ ( π½ ^ π , β β 1 β π½ β ) |
β€ 2 β β + 1 β’ π½ ^ π , β β 1 + β π± π β β π± π β πΊ ^ π , β β 1 β 1 β’ β π½ ^ π , β β 1 β π½ β β πΊ ^ π , β β 1
- 2 β β
- 1 β’ π½ ^ π , β β 1
- β π± π β β π² π β πΊ ^ π , β β 1 β 1 β’ β π½ ^ π , β β 1 β π½ β β πΊ ^ π , β β 1
β€ 8 β 2 β β β’ π½ ^ π , β β 1 ,
(F.11)
where the first inequality holds due to the basic inequality π₯ β€ | π₯ | for all π₯ β β . The second inequality holds due t (F.9), (F.10) and the Cauchy-Schwarz inequality. The last inequality holds due to (F.8) and Lemma E.1. Now we can return to the summation of regret on the index set Ξ¨ π + 1 , β .
β π β Ξ¨ π + 1 , β ( 2 β’ π± π β β€ β’ π½ β β ( π± π β€ β’ π½ β + π² π β€ β’ π½ β ) )
β€ β π β Ξ¨ π + 1 , β 8 β 2 β β β’ π½ ^ π , β β 1
β€ 8 β 2 β β β’ π½ ^ π , β β 1 β’ | Ξ¨ π + 1 , β |
β€ 8 β 2 β β’ π½ ^ π , β β 1 β’ β π β Ξ¨ π + 1 , β β π π β ( π± π β π² π ) β πΊ ^ π , β β 1 2
β€ 8 β 2 β β’ π½ ^ π , β β 1 β 2 β’ π β’ log β‘ ( 1 + 2 2 β’ β + 2 β’ π / π ) ,
where the first inequality holds due to (F.11). The second inequality holds due to our choice of π π such that β π π β ( π± π β π² π ) β πΊ ^ π , β β 1
2 β β . The last inequality holds due to Lemma G.1. Therefore, we complete the proof of Lemma E.5. β
Appendix GAuxiliary Lemmas Lemma G.1 (Lemma 11, Abbasi-Yadkori et al. 2011).
For any π > 0 and sequence { π± π } π
1 πΎ β β π for π β [ πΎ ] , define π π
π β’ π + β π
1 π β 1 π± π β’ π± π β€ . Then, provided that β π± π β 2 β€ πΏ holds for all π β [ πΎ ] , we have
β π
1 πΎ min β‘ { 1 , β π± π β π π β 1 2 } β€ 2 β’ π β’ log β‘ ( 1 + πΎ β’ πΏ 2 / ( π β’ π ) ) .
Lemma G.2 (Freedman 1975).
Let π , π£ > 0 be fixed constants. Let { π₯ π } π
1 π be a stochastic process, { π’ π } π β [ π ] be a filtration so that for all π β [ π ] , π₯ π is π’ π -measurable, while almost surely
πΌ β’ [ π₯ π | π’ π β 1 ]
0 , | π₯ π | β€ π , β π
1 π πΌ β’ [ π₯ π 2 | π’ π β 1 ] β€ π£ .
Then for any πΏ
0 , with probability at least 1 β πΏ , we have
β π
1 π π₯ π β€ 2 β’ π£ β’ log β‘ ( 1 / πΏ ) + 2 / 3 β π β’ log β‘ ( 1 / πΏ ) .
Lemma G.3 (Zhao et al. 2023a).
Let { π’ π } π
1 β be a filtration, and { π± π , π π } π β₯ 1 be a stochastic process such that π± π β β π is π’ π -measurable and π π β β is π’ π + 1 -measurable. Let πΏ , π , π , π > 0 , π β β β π . For π β₯ 1 , let π¦ π
β¨ π β , π± π β© + π π , where π π , π± π satisfy
πΌ β’ [ π π β£ π’ π ]
0 , | π π | β€ π , β π
1 π πΌ β’ [ π π 2 β£ π’ π ] β€ π£ π , for β’ β π β₯ 1 .
For π β₯ 1 , let π π
π β’ π + β π
1 π π± π β’ π± π β€ , π π
β π
1 π π¦ π β’ π± π , π π
π π β 1 β’ π π and
π½ π
16 β’ π β’ π£ π β’ log β‘ ( 4 β’ π 2 / πΏ ) + 6 β’ π β’ π β’ log β‘ ( 4 β’ π 2 / πΏ ) ,
where π β₯ sup π β₯ 1 β π± π β π π β 1 β 1 . Then, for any 0 < πΏ < 1 , we have with probability at least 1 β πΏ ,
β π β₯ 1 , β β π
1 π π± π β’ π π β π π β 1 β€ π½ π , β π π β π β β π π β€ π½ π + π β’ β π β β 2
Theorem G.4 (Brouwer invariance of domain theorem,Brouwer 1911).
Let π be an open subset of β π , and let π : π β β π be a continuous injective map. Then π β’ ( π ) is also open.
Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Xet Storage Details
- Size:
- 99.5 kB
- Xet hash:
- 81b197282fd7df24d6430e4552c54462b9e9026d688ed87910c5c621217cbde3
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.